GitPedia

VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

From MCG-NJU·Updated June 16, 2026·View on GitHub·

> [**VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training**](https://arxiv.org/abs/2203.12602) > [Zhan Tong](https://github.com/yztongzhan), [Yibing Song](https://ybsong00.github.io/), [Jue Wang](https://juewang725.github.io/), [Limin Wang](http://wanglimin.github.io/)Nanjing University, Tencent AI Lab The project is written primarily in Python, distributed under the Other license, first published in 2022. It has gained significant community traction with 1,760 stars and 168 forks on GitHub. Key topics include: action-recognition, mae, masked-autoencoder, neurips-2022, pytorch.

Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).

VideoMAE Framework

License: CC BY-NC 4.0<br>
Hugging Face ModelsHugging Face SpacesColab<br>
PWC<br>
PWC<br>PWC<br>
PWC<br>
PWC

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training<br>
Zhan Tong, Yibing Song, Jue Wang, Limin Wang<br>Nanjing University, Tencent AI Lab

📰 News

[2023.4.18] 🎈Everyone can download Kinetics-400, which is used in VideoMAE, from this link.<br>
[2023.4.18] Code and pre-trained models of VideoMAE V2 have been released! Check and enjoy this repo!<br>
[2023.4.17] We propose EVAD, an end-to-end Video Action Detection framework.<br>
[2023.2.28] Our VideoMAE V2 is accepted by CVPR 2023! 🎉<br>
[2023.1.16] Code and pre-trained models for Action Detection in VideoMAE are available! <br>
[2022.12.27] 🎈Everyone can download extracted VideoMAE features of THUMOS, ActivityNet, HACS and FineAction from InternVideo.<br>
[2022.11.20] 👀 VideoMAE is integrated into Hugging Face Spaces and Colab, supported by @Sayak Paul.<br>
[2022.10.25] 👀 VideoMAE is integrated into MMAction2, the results on Kinetics-400 can be reproduced successfully. <br>
[2022.10.20] The pre-trained models and scripts of ViT-S and ViT-H are available! <br>
[2022.10.19] The pre-trained models and scripts on UCF101 are available! <br>
[2022.9.15] VideoMAE is accepted by NeurIPS 2022 as a spotlight presentation! 🎉 <br>
[2022.8.8] 👀 VideoMAE is integrated into official 🤗HuggingFace Transformers now! Hugging Face Models<br>
[2022.7.7] We have updated new results on downstream AVA 2.2 benchmark. Please refer to our paper for details. <br>
[2022.4.24] Code and pre-trained models are available now! <br>
[2022.3.24] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

⚡️ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

😮 High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

🚀 Main Results

✨ Something-Something V2

MethodExtra DataBackboneResolution#Frames x Clips x CropsTop-1Top-5
VideoMAEnoViT-S224x22416x2x366.890.3
VideoMAEnoViT-B224x22416x2x370.892.4
VideoMAEnoViT-L224x22416x2x374.394.6
VideoMAEnoViT-L224x22432x1x375.495.2

✨ Kinetics-400

MethodExtra DataBackboneResolution#Frames x Clips x CropsTop-1Top-5
VideoMAEnoViT-S224x22416x5x379.093.8
VideoMAEnoViT-B224x22416x5x381.595.1
VideoMAEnoViT-L224x22416x5x385.296.8
VideoMAEnoViT-H224x22416x5x386.697.1
VideoMAEnoViT-L320x32032x4x386.197.3
VideoMAEnoViT-H320x32032x4x387.497.6

✨ AVA 2.2

Please check the code and checkpoints in VideoMAE-Action-Detection.

MethodExtra DataExtra LabelBackbone#Frame x Sample RatemAP
VideoMAEKinetics-400ViT-S16x422.5
VideoMAEKinetics-400ViT-S16x428.4
VideoMAEKinetics-400ViT-B16x426.7
VideoMAEKinetics-400ViT-B16x431.8
VideoMAEKinetics-400ViT-L16x434.3
VideoMAEKinetics-400ViT-L16x437.0
VideoMAEKinetics-400ViT-H16x436.5
VideoMAEKinetics-400ViT-H16x439.5
VideoMAEKinetics-700ViT-L16x436.1
VideoMAEKinetics-700ViT-L16x439.3

✨ UCF101 & HMDB51

MethodExtra DataBackboneUCF101HMDB51
VideoMAEnoViT-B91.362.6
VideoMAEKinetics-400ViT-B96.173.3

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

👀 Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

☎️ Contact

Zhan Tong: tongzhan@smail.nju.edu.cn

👍 Acknowledgements

Thanks to Ziteng Gao, Lei Chen, Chongjian Ge, and Zhiyu Zhao for their kind support.<br>
This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.

🔒 License

The majority of this project is released under the CC-BY-NC 4.0 license as found in the LICENSE file. Portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license. BEiT is licensed under the MIT license.

✏️ Citation

If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:

@inproceedings{tong2022videomae,
  title={Video{MAE}: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Zhan Tong and Yibing Song and Jue Wang and Limin Wang},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}

Contributors

Showing top 4 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from MCG-NJU/VideoMAE via the GitHub API.Last fetched: 6/16/2026