HBI
[CVPR 2023 Highlight & TPAMI] Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
The implementation of CVPR 2023 Highlight (Top 10%) paper [Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning](https://arxiv.org/abs/2303.14369). The project is written primarily in Python, distributed under the Apache License 2.0 license, first published in 2023. Key topics include: cross-modal-retrieval, cvpr, video-question-answering, video-retrieval.
【CVPR'2023 Highlight🔥&TPAMI】Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning
</div>The implementation of CVPR 2023 Highlight (Top 10%) paper Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning.
In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity.
📌 Citation
If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:
@article{jin2024hierarchical,
title={Hierarchical Banzhaf Interaction for General Video-Language Representation Learning},
author={Jin, Peng and Li, Hao and Yuan, Li and Yan, Shuicheng and Chen, Jie},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2024},
publisher={IEEE}
}
@inproceedings{jin2023video,
title={Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning},
author={Jin, Peng and Huang, Jinfa and Xiong, Pengfei and Tian, Shangxuan and Liu, Chang and Ji, Xiangyang and Yuan, Li and Chen, Jie},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={2472--2482},
year={2023}
}
<details open><summary>💡 I also have other text-video retrieval projects that may interest you ✨. </summary><p>
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model<br>
Accepted by ICCV 2023 | [DiffusionRet Code]<br>
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations<br>
Accepted by NeurIPS 2022 | [EMCL Code]<br>
Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, Jie Chen
</p></details>Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment<br>
Accepted by IJCAI 2023 | [DiCoSA Code]<br>
Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen
📣 Updates
- [2023/10/15]: We release our pre-trained estimator weights. If you want to apply a to other tasks, you can initialize a new estimator with the weights we provide. If you want better performance, you can train the estimator with a smaller learning rate and more epochs.
- [2023/10/11]: We release code for Banzhaf Interaction estimator. Recommended running parameters will be provided shortly, and we will also release our pre-trained estimator weights.
- [2023/10/08]: I am working on the code for Banzhaf Interaction estimator, which is expected to be released soon.
- [2023/06/28]: Release code for reimplementing the experiments in the paper.
- [2023/03/28]: Our HBI has been selected as a Highlight paper at CVPR 2023! (Top 2.5% of 9155 submissions).
- [2023/02/28]: We will release the code asap. (I am busy with other DDLs. After that, I will open the source code as soon as possible. Please understand.)
⚡ Demo
<div align="center"> </div>😍 Visualization
Example 1
<div align=center> <img src="static/images/Visualization_1.png" width="800px"> </div> <details> <summary><b>More examples</b></summary>Example 2
<div align=center> <img src="static/images/Visualization_2.png" width="800px"> </div>Example 3
<div align=center> <img src="static/images/Visualization_3.png" width="800px"> </div>Example 4
<div align=center> <img src="static/images/Visualization_4.png" width="800px"> </div>Example 5
<div align=center> <img src="static/images/Visualization_5.png" width="800px"> </div>Example 6
<div align=center> <img src="static/images/Visualization_6.png" width="800px"> </div>Example 7
<div align=center> <img src="static/images/Visualization_0.png" width="800px"> </div> </details>🚀 Quick Start
Setup
Setup code environment
shellconda create -n HBI python=3.9 conda activate HBI pip install -r requirements.txt pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html
Download CLIP Model
shellcd HBI/models wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt # wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt # wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt
Download Datasets
<div align=center>| Datasets | Google Cloud | Baidu Yun | Peking University Yun |
|---|---|---|---|
| MSR-VTT | Download | Download | Download |
| MSVD | Download | Download | Download |
| ActivityNet | TODO | Download | Download |
| DiDeMo | TODO | Download | Download |
Train the Banzhaf Interaction Estimator
Train the estimator according to the label generated by the BanzhafInteraction in HBI/models/banzhaf.py.
The training code is provided in banzhaf_estimator.py. We provide our trained weights, and if you want to apply a to other tasks, you can initialize a new estimator with the weights we provide.
We have tested the performance of Estimator_1e-2_epoch6 with R@1 of 48.2 (log) on the MSR-VTT dataset. If you want better performance, you can train the estimator with a smaller learning rate and more epochs.
<div align=center>| Models | Google Cloud | Baidu Yun | Peking University Yun | log |
|---|---|---|---|---|
| Estimator_1e-2_epoch1 | Download | Download | Download | log |
| Estimator_1e-2_epoch2 | Download | Download | Download | log |
| Estimator_1e-2_epoch3 | Download | Download | Download | log |
| Estimator_1e-2_epoch4 | Download | Download | Download | log |
| Estimator_1e-2_epoch5 | Download | Download | Download | log |
| Estimator_1e-2_epoch6 | Download | Download | Download | log |
shellCUDA_VISIBLE_DEVICES=0,1,2,3 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=4 \ banzhaf_estimator.py \ --do_train 1 \ --workers 8 \ --n_display 1 \ --epochs 10 \ --lr 1e-2 \ --coef_lr 1e-3 \ --batch_size 128 \ --batch_size_val 128 \ --anno_path data/MSR-VTT/anns \ --video_path ${DATA_PATH}/MSRVTT_Videos \ --datatype msrvtt \ --max_words 24 \ --max_frames 12 \ --video_framerate 1 \ --output_dir ${OUTPUT_PATH}
Text-video Retrieval
<div align=center>| Checkpoint | Google Cloud | Baidu Yun | Peking University Yun |
|---|---|---|---|
| MSR-VTT | Download | Download | Download |
| ActivityNet | Download | Download | Download |
Eval on MSR-VTT
shellCUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_retrieval.py \ --do_eval 1 \ --workers 8 \ --n_display 50 \ --batch_size_val 128 \ --anno_path data/MSR-VTT/anns \ --video_path ${DATA_PATH}/MSRVTT_Videos \ --datatype msrvtt \ --max_words 24 \ --max_frames 12 \ --video_framerate 1 \ --init_model ${CHECKPOINT_PATH} \ --output_dir ${OUTPUT_PATH}
Train on MSR-VTT
shellCUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_retrieval.py \ --do_train 1 \ --workers 8 \ --n_display 50 \ --epochs 5 \ --lr 1e-4 \ --coef_lr 1e-3 \ --batch_size 128 \ --batch_size_val 128 \ --anno_path data/MSR-VTT/anns \ --video_path ${DATA_PATH}/MSRVTT_Videos \ --datatype msrvtt \ --max_words 24 \ --max_frames 12 \ --video_framerate 1 \ --estimator ${ESTIMATOR_PATH} \ --output_dir ${OUTPUT_PATH} \ --kl 2 \ --skl 1
Eval on ActivityNet Captions
shellCUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_retrieval.py \ --do_eval 1 \ --workers 8 \ --n_display 50 \ --batch_size_val 128 \ --anno_path ${DATA_PATH}/ActivityNet \ --video_path ${DATA_PATH}/ActivityNet/Activity_Videos \ --datatype activity \ --max_words 64 \ --max_frames 64 \ --video_framerate 1 \ --init_model ${CHECKPOINT_PATH} \ --output_dir ${OUTPUT_PATH}
Train on ActivityNet Captions
shellCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=8 \ main_retrieval.py \ --do_train 1 \ --workers 8 \ --n_display 10 \ --epochs 10 \ --lr 1e-4 \ --coef_lr 1e-3 \ --batch_size 128 \ --batch_size_val 128 \ --anno_path ${DATA_PATH}/ActivityNet \ --video_path ${DATA_PATH}/ActivityNet/Activity_Videos \ --datatype activity \ --max_words 64 \ --max_frames 64 \ --video_framerate 1 \ --estimator ${ESTIMATOR_PATH} \ --output_dir ${OUTPUT_PATH} \ --kl 2 \ --skl 1
Video-question Answering
<div align=center>| Checkpoint | Google Cloud | Baidu Yun | Peking University Yun |
|---|---|---|---|
| MSR-VTT-QA | Download | Download | Download |
Eval on MSR-VTT-QA
shellCUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_vqa.py \ --do_eval \ --num_thread_reader=8 \ --train_csv data/MSR-VTT/qa/train.jsonl \ --val_csv data/MSR-VTT/qa/test.jsonl \ --data_path data/MSR-VTT/qa/train_ans2label.json \ --features_path ${DATA_PATH}/MSRVTT_Videos \ --max_words 32 \ --max_frames 12 \ --batch_size_val 16 \ --datatype msrvtt \ --expand_msrvtt_sentences \ --feature_framerate 1 \ --freeze_layer_num 0 \ --slice_framepos 2 \ --loose_type \ --linear_patch 2d \ --init_model ${CHECKPOINT_PATH} \ --output_dir ${OUTPUT_PATH}
Train on MSR-VTT-QA
shellCUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_vqa.py \ --do_train \ --num_thread_reader=8 \ --epochs=5 \ --batch_size=32 \ --n_display=50 \ --train_csv data/MSR-VTT/qa/train.jsonl \ --val_csv data/MSR-VTT/qa/test.jsonl \ --data_path data/MSR-VTT/qa/train_ans2label.json \ --features_path ${DATA_PATH}/MSRVTT_Videos \ --lr 1e-4 \ --max_words 32 \ --max_frames 12 \ --batch_size_val 16 \ --datatype msrvtt \ --expand_msrvtt_sentences \ --feature_framerate 1 \ --coef_lr 1e-3 \ --freeze_layer_num 0 \ --slice_framepos 2 \ --loose_type \ --linear_patch 2d \ --estimator ${ESTIMATOR_PATH} \ --output_dir ${OUTPUT_PATH} \ --kl 2 \ --skl 1
🎗️ Acknowledgments
Our code is based on EMCL, CLIP, CLIP4Clip and DRL. We sincerely appreciate for their contributions.
Contributors
Showing top 1 contributor by commit count.
