GitPedia

Tips

TIPSv2 (CVPR'26) and TIPS (ICLR'25)

From google-deepmind·Updated June 18, 2026·View on GitHub·

This repository contains the implementation and models introduced in: * TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment, CVPR 2026 * TIPS: Text-Image Pretraining with Spatial Awareness, ICLR 2025 The project is written primarily in Jupyter Notebook, distributed under the Apache License 2.0 license, first published in 2025. Key topics include: image-text, spatial-understanding, vision-encoder.

Demo-Colab-Pytorch
Demo-HF
Models-HF
HF downloads
Webpage
Paper
Paper
<br/>

TIPS / TIPSv2

This repository contains the implementation and models introduced in:

  • TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment, CVPR 2026
  • TIPS: Text-Image Pretraining with Spatial Awareness, ICLR 2025
<p align="center"> <img src="./docs/images/overview.png" style="width:75%;" > </p>

The TIPS series of models (Text-Image Pretraining with Spatial Awareness) are foundational image-text encoders built for general-purpose computer vision and multimodal applications. Our models were validated on a comprehensive suite of 9 tasks and 20 datasets, displaying excellent performance that matches or exceeds other recent vision encoders, with particularly strong spatial awareness.

We recommend using the latest version, TIPSv2, but still provide the earlier TIPSv1 for completeness. For a more detailed overview, please visit the <a href="https://gdm-tipsv2.github.io/">Project Webpage</a> and check out the papers:
Paper
Paper

See also our demos and notebooks for a quick start.

<p align="center"> <img src="./docs/images/pca.png" style="width:60%;" > </p>

Demos and notebooks

Demo-HF --> HuggingFace demo for Feature visualization / Zero-shot segmentation / Depth and Normals estimation / Supervised segmentation <br>

Inference-Colab-Pytorch --> Inference Colab in Pytorch <br>

Inference-Colab-Jax --> Inference Colab in Jax <br>

We also provide task-specific notebooks:

ZS-Pytorch --> Zero-shot segmentation (Pytorch) <br>

FG-Seg-Pytorch --> Train a linear head for foreground segmentation (Pytorch) <br>
DPT-Pytorch --> Inference with DPT heads for segmentation, depth and normals (Pytorch) <br>

How to use

We provide both Pytorch and Jax (Scenic) implementations:

  • tips/pytorch/: PyTorch inference for the model.
  • tips/scenic/: Jax-based inference using the
    scenic library.

We provide links to all available checkpoints, for both Pytorch and Jax model
definitions, together with representative evals.

You can also find TIPSv2 models on HuggingFace here.

TIPSv2 models

Model size#Params vision / textPytorch ckp.Jax ckp.PASCAL seg.↑NYU-depth↓ImageNet-KNN↑Flickr I→T↑Flickr T→I↑ADE150-ZS↑
g/141.1B / 389.1Mvision | textvision | text85.10.33483.795.185.917.8
SO/14412.4M / 448.3Mvision | textvision | text85.20.33982.894.884.023.3
L/14303.2M / 183.9Mvision | textvision | text85.10.33982.595.483.324.7
B/1485.7M / 109.6Mvision | textvision | text84.00.37479.892.680.017.4

TIPSv1 models

Model size#Params vision / textPytorch ckp.Jax ckp.PASCAL seg.↑NYU-depth↓ImageNet-KNN↑UNED-KNN↑Flickr I→T↑Flickr T→I↑
g/14-HR1.1B / 389.1Mvision | textvision | text83.10.36383.268.493.883.8
g/14-LR1.1B / 389.1Mvision | textvision | text82.00.39083.671.593.482.1
SO/14-HR412.4M / 448.3Mvision | textvision | text83.70.36283.068.694.283.8
L/14-HR303.2M / 183.9Mvision | textvision | text83.90.37282.567.893.683.5
B/14-HR85.7M / 109.6Mvision | textvision | text82.90.37980.062.791.379.4
S/14-HR21.6M / 33.6Mvision | textvision | text80.60.42575.157.786.374.7

Local Installation

To install locally instead of using the Colabs/HF, please follow the instructions below.

Installation (Pytorch)

Manage dependencies with a custom environment (eg. Conda)

bash
conda create -n tips python=3.11 # Activate the environment. conda activate tips

Install Pytorch dependencies.

bash
# Install pytorch (change to GPU version if needed) pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu # Install other dependencies. pip install tensorflow_text mediapy jax jaxlib scikit-learn # Optionally, install Jupyter to use the notebook. pip install jupyter

Clone the code from this repo.

bash
git clone https://github.com/google-deepmind/tips.git # Add the current directory to PYTHONPATH. export PYTHONPATH=$PYTHONPATH:$(pwd)

Download the checkpoints locally. The script downloads all released checkpoints.
Please adjust accordingly.

bash
cd tips/pytorch/checkpoints chmod +x download_checkpoints.sh ./download_checkpoints.sh cd ../../..

Usage (Pytorch)

To run inference on one image and get the L2-normalized image embedding from the
1st and 2nd CLS token, one can use the following:

bash
cd tips/pytorch && \ python run_image_encoder_inference.py \ --model_path=${PATH_TO_CHECKPOINT} \ --image_file=${PATH_TO_IMAGE} \ --model_variant=${MODEL_VARIANT}

One can use is_low_res to specify whether a low-resolution or high-resolution
checkpoint is used.

To run text model inference and get the L2-normalized text embedding, please use
the following cmd

bash
cd tips/pytorch && \ python run_text_encoder_inference.py \ --model_path=${PATH_TO_CHECKPOINT} \ --tokenizer_path=${PATH_TO_TOKENIZER} \ --model_variant=${MODEL_VARIANT} \ --text_input=${TEXT_INPUT}

Installation (JAX/Scenic)

Similar to using Pytorch, manage dependencies with a custom environment.

bash
conda create -n tips python=3.11 # Activate the environment. conda activate tips
bash
# Install scenic. git clone https://github.com/google-research/scenic.git scenic_src cd scenic_src pip install . cd .. rm -rf scenic_src # Install other dependencies. pip install pillow scikit-learn opencv-python tensorflow_text # Optionally, install Jupyter to use the notebook. pip install jupyter mediapy # In case of using CUDA, install the CUDA-supported JAX libraries. # For example, for CUDA 12 run: # pip install --upgrade "jax[cuda12_pip]" -f \ # https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

Clone the code from the this repo.

bash
git clone https://github.com/google-deepmind/tips.git # Add the current directory to PYTHONPATH. export PYTHONPATH=$PYTHONPATH:$(pwd)

Download the checkpoints (different files from Pytorch).

bash
cd tips/scenic/checkpoints chmod +x download_checkpoints.sh ./download_checkpoints.sh cd ../../..

Usage (Jax)

To run inference on an image, use the following script:

bash
cd tips/scenic python run_tips_inference.py

Citing this work

The manuscripts for TIPS v1 and v2 can be found on arXiv (v1, v2).

Please consider citing this work using:

@InProceedings{tips_v2_paper,
    Title={{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
    Author={Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Ren\'e and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andr\'e},
    Booktitle={CVPR},
    year={2026},
}

@InProceedings{tips_v1_paper,
    Title={{TIPS: Text-Image Pretraining with Spatial Awareness}},
    Author={Maninis, Kevis-Kokitsi and Chen, Kaifeng and Ghosh, Soham and Karpur, Arjun and Chen, Koert and Xia, Ye and Cao, Bingyi and Salz, Daniel and Han, Guangxing and Dlabal, Jan and Gnanapragasam, Dan and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andr\'e},
    Booktitle={ICLR},
    year={2025},
}

License and disclaimer

Copyright 2025 DeepMind Technologies Limited

All software is licensed under the Apache License, Version 2.0 (Apache 2.0);
you may not use this file except in compliance with the Apache 2.0 license.
You may obtain a copy of the Apache 2.0 license at:
https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0
International License (CC-BY). You may obtain a copy of the CC-BY license at:
https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and
materials distributed here under the Apache 2.0 or CC-BY licenses are
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
either express or implied. See the licenses for the specific language governing
permissions and limitations under those licenses.

This is not an official Google product.

Contributors

Showing top 8 contributors by commit count.

View all contributors on GitHub →

This article is auto-generated from google-deepmind/tips via the GitHub API.Last fetched: 6/19/2026