GitPedia

HBI

[CVPR 2023 Highlight & TPAMI] Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

From jpthu17·Updated May 28, 2026·View on GitHub·

The implementation of CVPR 2023 Highlight (Top 10%) paper [Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning](https://arxiv.org/abs/2303.14369). The project is written primarily in Python, distributed under the Apache License 2.0 license, first published in 2023. Key topics include: cross-modal-retrieval, cvpr, video-question-answering, video-retrieval.

<div align="center">

【CVPR'2023 Highlight🔥&TPAMI】Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Conference
Project
Paper

</div>

The implementation of CVPR 2023 Highlight (Top 10%) paper Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning.

In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity.

📌 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@article{jin2024hierarchical,
  title={Hierarchical Banzhaf Interaction for General Video-Language Representation Learning},
  author={Jin, Peng and Li, Hao and Yuan, Li and Yan, Shuicheng and Chen, Jie},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

@inproceedings{jin2023video,
  title={Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning},
  author={Jin, Peng and Huang, Jinfa and Xiong, Pengfei and Tian, Shangxuan and Liu, Chang and Ji, Xiangyang and Yuan, Li and Chen, Jie},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2472--2482},
  year={2023}
}
<details open><summary>💡 I also have other text-video retrieval projects that may interest you ✨. </summary><p>

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model<br>
Accepted by ICCV 2023 | [DiffusionRet Code]<br>
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations<br>
Accepted by NeurIPS 2022 | [EMCL Code]<br>
Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song, David Clifton, Jie Chen

Text-Video Retrieval with Disentangled Conceptualization and Set-to-Set Alignment<br>
Accepted by IJCAI 2023 | [DiCoSA Code]<br>
Peng Jin, Hao Li, Zesen Cheng, Jinfa Huang, Zhennan Wang, Li Yuan, Chang Liu, Jie Chen

</p></details>

📣 Updates

  • [2023/10/15]: We release our pre-trained estimator weights. If you want to apply a to other tasks, you can initialize a new estimator with the weights we provide. If you want better performance, you can train the estimator with a smaller learning rate and more epochs.
  • [2023/10/11]: We release code for Banzhaf Interaction estimator. Recommended running parameters will be provided shortly, and we will also release our pre-trained estimator weights.
  • [2023/10/08]: I am working on the code for Banzhaf Interaction estimator, which is expected to be released soon.
  • [2023/06/28]: Release code for reimplementing the experiments in the paper.
  • [2023/03/28]: Our HBI has been selected as a Highlight paper at CVPR 2023! (Top 2.5% of 9155 submissions).
  • [2023/02/28]: We will release the code asap. (I am busy with other DDLs. After that, I will open the source code as soon as possible. Please understand.)

⚡ Demo

<div align="center">

https://user-images.githubusercontent.com/53246557/221760113-4a523e7e-d743-4dff-9f16-357ab0be0d5b.mp4

</div>

😍 Visualization

Example 1

<div align=center> <img src="static/images/Visualization_1.png" width="800px"> </div> <details> <summary><b>More examples</b></summary>

Example 2

<div align=center> <img src="static/images/Visualization_2.png" width="800px"> </div>

Example 3

<div align=center> <img src="static/images/Visualization_3.png" width="800px"> </div>

Example 4

<div align=center> <img src="static/images/Visualization_4.png" width="800px"> </div>

Example 5

<div align=center> <img src="static/images/Visualization_5.png" width="800px"> </div>

Example 6

<div align=center> <img src="static/images/Visualization_6.png" width="800px"> </div>

Example 7

<div align=center> <img src="static/images/Visualization_0.png" width="800px"> </div> </details>

🚀 Quick Start

Setup

Setup code environment

shell
conda create -n HBI python=3.9 conda activate HBI pip install -r requirements.txt pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model

shell
cd HBI/models wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt # wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt # wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

<div align=center>
DatasetsGoogle CloudBaidu YunPeking University Yun
MSR-VTTDownloadDownloadDownload
MSVDDownloadDownloadDownload
ActivityNetTODODownloadDownload
DiDeMoTODODownloadDownload
</div>

Train the Banzhaf Interaction Estimator

Train the estimator according to the label generated by the BanzhafInteraction in HBI/models/banzhaf.py.

The training code is provided in banzhaf_estimator.py. We provide our trained weights, and if you want to apply a to other tasks, you can initialize a new estimator with the weights we provide.

We have tested the performance of Estimator_1e-2_epoch6 with R@1 of 48.2 (log) on the MSR-VTT dataset. If you want better performance, you can train the estimator with a smaller learning rate and more epochs.

<div align=center>
ModelsGoogle CloudBaidu YunPeking University Yunlog
Estimator_1e-2_epoch1DownloadDownloadDownloadlog
Estimator_1e-2_epoch2DownloadDownloadDownloadlog
Estimator_1e-2_epoch3DownloadDownloadDownloadlog
Estimator_1e-2_epoch4DownloadDownloadDownloadlog
Estimator_1e-2_epoch5DownloadDownloadDownloadlog
Estimator_1e-2_epoch6DownloadDownloadDownloadlog
</div>
shell
CUDA_VISIBLE_DEVICES=0,1,2,3 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=4 \ banzhaf_estimator.py \ --do_train 1 \ --workers 8 \ --n_display 1 \ --epochs 10 \ --lr 1e-2 \ --coef_lr 1e-3 \ --batch_size 128 \ --batch_size_val 128 \ --anno_path data/MSR-VTT/anns \ --video_path ${DATA_PATH}/MSRVTT_Videos \ --datatype msrvtt \ --max_words 24 \ --max_frames 12 \ --video_framerate 1 \ --output_dir ${OUTPUT_PATH}

Text-video Retrieval

<div align=center>
CheckpointGoogle CloudBaidu YunPeking University Yun
MSR-VTTDownloadDownloadDownload
ActivityNetDownloadDownloadDownload
</div>

Eval on MSR-VTT

shell
CUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_retrieval.py \ --do_eval 1 \ --workers 8 \ --n_display 50 \ --batch_size_val 128 \ --anno_path data/MSR-VTT/anns \ --video_path ${DATA_PATH}/MSRVTT_Videos \ --datatype msrvtt \ --max_words 24 \ --max_frames 12 \ --video_framerate 1 \ --init_model ${CHECKPOINT_PATH} \ --output_dir ${OUTPUT_PATH}

Train on MSR-VTT

shell
CUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_retrieval.py \ --do_train 1 \ --workers 8 \ --n_display 50 \ --epochs 5 \ --lr 1e-4 \ --coef_lr 1e-3 \ --batch_size 128 \ --batch_size_val 128 \ --anno_path data/MSR-VTT/anns \ --video_path ${DATA_PATH}/MSRVTT_Videos \ --datatype msrvtt \ --max_words 24 \ --max_frames 12 \ --video_framerate 1 \ --estimator ${ESTIMATOR_PATH} \ --output_dir ${OUTPUT_PATH} \ --kl 2 \ --skl 1

Eval on ActivityNet Captions

shell
CUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_retrieval.py \ --do_eval 1 \ --workers 8 \ --n_display 50 \ --batch_size_val 128 \ --anno_path ${DATA_PATH}/ActivityNet \ --video_path ${DATA_PATH}/ActivityNet/Activity_Videos \ --datatype activity \ --max_words 64 \ --max_frames 64 \ --video_framerate 1 \ --init_model ${CHECKPOINT_PATH} \ --output_dir ${OUTPUT_PATH}

Train on ActivityNet Captions

shell
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=8 \ main_retrieval.py \ --do_train 1 \ --workers 8 \ --n_display 10 \ --epochs 10 \ --lr 1e-4 \ --coef_lr 1e-3 \ --batch_size 128 \ --batch_size_val 128 \ --anno_path ${DATA_PATH}/ActivityNet \ --video_path ${DATA_PATH}/ActivityNet/Activity_Videos \ --datatype activity \ --max_words 64 \ --max_frames 64 \ --video_framerate 1 \ --estimator ${ESTIMATOR_PATH} \ --output_dir ${OUTPUT_PATH} \ --kl 2 \ --skl 1

Video-question Answering

<div align=center>
CheckpointGoogle CloudBaidu YunPeking University Yun
MSR-VTT-QADownloadDownloadDownload
</div>

Eval on MSR-VTT-QA

shell
CUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_vqa.py \ --do_eval \ --num_thread_reader=8 \ --train_csv data/MSR-VTT/qa/train.jsonl \ --val_csv data/MSR-VTT/qa/test.jsonl \ --data_path data/MSR-VTT/qa/train_ans2label.json \ --features_path ${DATA_PATH}/MSRVTT_Videos \ --max_words 32 \ --max_frames 12 \ --batch_size_val 16 \ --datatype msrvtt \ --expand_msrvtt_sentences \ --feature_framerate 1 \ --freeze_layer_num 0 \ --slice_framepos 2 \ --loose_type \ --linear_patch 2d \ --init_model ${CHECKPOINT_PATH} \ --output_dir ${OUTPUT_PATH}

Train on MSR-VTT-QA

shell
CUDA_VISIBLE_DEVICES=0,1 \ python -m torch.distributed.launch \ --master_port 2502 \ --nproc_per_node=2 \ main_vqa.py \ --do_train \ --num_thread_reader=8 \ --epochs=5 \ --batch_size=32 \ --n_display=50 \ --train_csv data/MSR-VTT/qa/train.jsonl \ --val_csv data/MSR-VTT/qa/test.jsonl \ --data_path data/MSR-VTT/qa/train_ans2label.json \ --features_path ${DATA_PATH}/MSRVTT_Videos \ --lr 1e-4 \ --max_words 32 \ --max_frames 12 \ --batch_size_val 16 \ --datatype msrvtt \ --expand_msrvtt_sentences \ --feature_framerate 1 \ --coef_lr 1e-3 \ --freeze_layer_num 0 \ --slice_framepos 2 \ --loose_type \ --linear_patch 2d \ --estimator ${ESTIMATOR_PATH} \ --output_dir ${OUTPUT_PATH} \ --kl 2 \ --skl 1

🎗️ Acknowledgments

Our code is based on EMCL, CLIP, CLIP4Clip and DRL. We sincerely appreciate for their contributions.

Contributors

Showing top 1 contributor by commit count.

View all contributors on GitHub →

This article is auto-generated from jpthu17/HBI via the GitHub API.Last fetched: 6/28/2026