Skip to content

[CVPR 2026] Official code and models for Video Encoder-only Mask Transformer (VidEoMT).

License

Notifications You must be signed in to change notification settings

tue-mps/videomt

Repository files navigation

VidEoMT: Your ViT is Secretly Also a Video Segmentation Model

CVPR 2026 · 📄 Paper

Narges Norouzi1, Idil Esen Zulfikar2,*, Niccolò Cavagnero1,*, Tommie Kerssies1, Bastian Leibe2, Gijs Dubbelman1, Daan de Geus1

¹ Eindhoven University of Technology, ² RWTH Aachen University, * Equal contribution

Overview

VidEoMT Overview

We introduce Video Encoder-only Mask Transformer (VidEoMT), a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It performs both spatial and temporal reasoning within the ViT encoder, without relying on dedicated tracking modules or heavy task-specific heads.

VidEoMT propagates information over time by reusing queries from the previous frame and fusing them with a compact set of learned, frame-agnostic queries. This design achieves competitive accuracy while being 5x–10× faster than existing approaches, reaching up to 160 FPS with a ViT-L backbone.

Installation

If you don't have Conda installed, install Miniconda and restart your shell:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Then create the environment, activate it, and install the dependencies:

conda create -n videomt python==3.12.3
conda activate videomt
pip install torch==2.7.0 torchvision==0.22.0 --index-url https://download.pytorch.org/whl/cu126
python -m pip install --no-build-isolation 'git+https://github.com/facebookresearch/detectron2.git'  
pip install git+https://github.com/cocodataset/panopticapi.git
python3 -m pip install -r requirements.txt

Weights & Biases (wandb) is used for experiment logging and visualization. To enable wandb, log in to your account:

wandb login

Data preparation

Download and prepare the datasets.

Usage

Evaluation

To evaluate a pre-trained VidEoMT model, first prepare the datasets by following the instructions in this link and download the trained weights from here. Once these are set up, run:

python train_net_video.py \
  --num-gpus 1 \
  --config-file /path/to/config.yaml \
  --eval-only MODEL.WEIGHTS /path/to/weight.pth \
  MODEL.MODEL.BACKBONE.TEST.WINDOW_SIZE 1 \ 
  OUTPUT_DIR /path/to/output

🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.
🔧 Replace /path/to/output with the path to the output folder.
🔧 Change the value of --num-gpus to the number of GPUs available to you.

For detailed instructions on running evaluation on different datasets, see Evaluation.

Benchmark

To calculate the FPS and GFLOPs, run:

python benchmark.py \
  --task fps \
  --config-file    /path/to/config.yaml \
  --model-weights  /path/to/weight.pth  \
  --warmup-iters 100 

export TIMM_FUSED_ATTN=0 
python benchmark.py \
  --task flops \
  --config-file    /path/to/config.yaml \
  --model-weights  /path/to/weight.pth  

🔧 Replace /path/to/config.yaml with the path to the config file.
🔧 Replace /path/to/weight.pth with the path to the checkpoint to evaluate.

Demo

We provide example visualizations below.

To generate additional visualization samples, please use the code in Visualization.

Upcoming Features

- [x] Inference code
- [x] Flops and FPS code
- [x] Visualization code 
- [ ] Training codes 
- [ ] DINOv3 model zoo and code

Model Zoo

We provide pre-trained weights for both DINOv2- and DINOv3-based VidEoMT models.

  • DINOv2 Models - Original published results and pre-trained weights.
  • DINOv3 Models - DINOv3-based models and pre-trained weights.

Citation

If you find this work useful in your research, please cite it using the BibTeX entry below:

@inproceedings{norouzi2026videomt,
  author     = {Norouzi, Narges and Zulfikar, Idil and Cavagnero, Niccol\`{o} and Kerssies, Tommie and Leibe, Bastian and Dubbelman, Gijs and {de Geus}, Daan},
  title      = {{VidEoMT: Your ViT is Secretly Also a Video Segmentation Model}},
  booktitle  = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year       = {2026},
}

Acknowledgements

This project builds upon code from the following libraries and repositories:

About

[CVPR 2026] Official code and models for Video Encoder-only Mask Transformer (VidEoMT).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages