CVPR 2026 · 📄 Paper
Narges Norouzi1, Idil Esen Zulfikar2,*, Niccolò Cavagnero1,*, Tommie Kerssies1, Bastian Leibe2, Gijs Dubbelman1, Daan de Geus1
¹ Eindhoven University of Technology, ² RWTH Aachen University, * Equal contribution
We introduce Video Encoder-only Mask Transformer (VidEoMT), a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It performs both spatial and temporal reasoning within the ViT encoder, without relying on dedicated tracking modules or heavy task-specific heads.
VidEoMT propagates information over time by reusing queries from the previous frame and fusing them with a compact set of learned, frame-agnostic queries. This design achieves competitive accuracy while being 5x–10× faster than existing approaches, reaching up to 160 FPS with a ViT-L backbone.
If you don't have Conda installed, install Miniconda and restart your shell:
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.shThen create the environment, activate it, and install the dependencies:
conda create -n videomt python==3.12.3
conda activate videomt
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu126
python -m pip install --no-build-isolation 'git+https://github.com/facebookresearch/detectron2.git'
pip install git+https://github.com/cocodataset/panopticapi.git
python3 -m pip install -r requirements.txtWeights & Biases (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:
wandb loginDownload and prepare the datasets.
To evaluate a pre-trained VidEoMT model, first prepare the datasets by following the instructions in this link and download the trained weights from here. Once these are set up, run:
python train_net_video.py \
--num-gpus 1 \
--config-file /path/to/config.yaml \
--eval-only MODEL.WEIGHTS /path/to/weight.pth \
MODEL.MODEL.BACKBONE.TEST.WINDOW_SIZE 1 \
OUTPUT_DIR /path/to/output🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.
🔧 Replace /path/to/output with the path to the output folder.
🔧 Change the value of --num-gpus to the number of GPUs available to you.
For detailed instructions on running evaluation on different datasets, see Evaluation.
To calculate the FPS and GFLOPs, run:
python benchmark.py \
--task fps \
--config-file /path/to/config.yaml \
--model-weights /path/to/weight.pth \
--warmup-iters 100
export TIMM_FUSED_ATTN=0
python benchmark.py \
--task flops \
--config-file /path/to/config.yaml \
--model-weights /path/to/weight.pth 🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.
We provide example visualizations below.
To generate additional visualization samples, please use the code in Visualization.
- [x] Inference code
- [x] Flops and FPS code
- [x] Visualization code
- [ ] Training codes
- [ ] DINOv3 model zoo and code
We provide pre-trained weights for both DINOv2- and DINOv3-based VidEoMT models.
- DINOv2 Models - Original published results and pre-trained weights.
- DINOv3 Models - DINOv3-based models and pre-trained weights.
If you find this work useful in your research, please cite it using the BibTeX entry below:
@inproceedings{norouzi2026videomt,
author = {Norouzi, Narges and Zulfikar, Idil and Cavagnero, Niccol\`{o} and Kerssies, Tommie and Leibe, Bastian and Dubbelman, Gijs and {de Geus}, Daan},
title = {{VidEoMT: Your ViT is Secretly Also a Video Segmentation Model}},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026},
}This project builds upon code from the following libraries and repositories:
- EoMT (MIT License)
- Hugging Face Transformers (Apache-2.0 License)
- PyTorch Image Models (timm) (Apache-2.0 License)
- CAVIS (MIT License)
- Mask2Former (Apache-2.0 License)
- Detectron2 (Apache-2.0 License)




