GitPedia

Vision Language Transformer

[ICCV2021 & TPAMI2023] Vision-Language Transformer and Query Generation for Referring Segmentation

From henghuiding·Updated June 13, 2026·View on GitHub·

Please consider citing our paper in your publications if the project helps your research. The project is written primarily in Python, distributed under the MIT License license, first published in 2021. Key topics include: iccv2021, keras, referring-segmentation, tensorflow, tpami.

Vision-Language Transformer and Query Generation for Referring Segmentation

Please consider citing our paper in your publications if the project helps your research.

@inproceedings{vision-language-transformer,
  title={Vision-Language Transformer and Query Generation for Referring Segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  year={2021}
}

Introduction

Vision-Language Transformer (VLT) is a framework for referring segmentation task. Our method produces multiple query vector for one input language expression, and use each of them to “query” the input image, generating a set of responses. Then the network selectively aggregates these responses, in which queries that provide better comprehensions are spotlighted.

<p align="center"> <img src="fig0.png" width="500px"> </p>

Installation

  1. Environment:

    • Python 3.6

    • tensorflow 1.15

    • Other dependencies in requirements.txt

    • SpaCy model for embedding:

      python -m spacy download en_vectors_web_lg

  2. Dataset preparation

    • Put the folder of COCO training set ("train2014") under data/images/.

    • Download the RefCOCO dataset from here and extract them to data/. Then run the script for data preparation under data/:

      cd data
      python data_process_v2.py --data_root . --output_dir data_v2 --dataset [refcoco/refcoco+/refcocog] --split [unc/umd/google] --generate_mask
      

Evaluating

  1. Download pretrained models & config files from here.

  2. In the config file, set:

    • evaluate_model: path to the pretrained weights
    • evaluate_set: path to the dataset for evaluation.
  3. Run

    python vlt.py test [PATH_TO_CONFIG_FILE]
    

Training

  1. Pretrained Backbones:
    We use the backbone weights proviede by MCN.

    Note: we use the backbone that excludes all images that appears in the val/test splits of RefCOCO, RefCOCO+ and RefCOCOg.

  2. Specify hyperparameters, dataset path and pretrained weight path in the configuration file. Please refer to the examples under /config, or config file of our pretrained models.

  3. Run

    python vlt.py train [PATH_TO_CONFIG_FILE]
    

Acknowledgement

We borrowed a lot of codes from MCN, keras-transformer, RefCOCO API and keras-yolo3. Thanks for their excellent works!

Contributors

Showing top 2 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from henghuiding/Vision-Language-Transformer via the GitHub API.Last fetched: 6/14/2026