Awesome CLIP
Awesome list for research on CLIP (Contrastive Language-Image Pre-Training).
This repo collects the research resources based on CLIP (Contrastive Language-Image Pre-Training) proposed by OpenAI. If you would like to contribute, please open an issue. The project is first published in 2021. It has gained significant community traction with 1,230 stars and 59 forks on GitHub. Key topics include: clip, contrastive-learning, pre-training.
Awesome CLIP
This repo collects the research resources based on CLIP (Contrastive Language-Image Pre-Training) proposed by OpenAI. If you would like to contribute, please open an issue.
CLIP
- Learning Transferable Visual Models From Natural Language Supervision [code]
- CLIP: Connecting Text and Images
- Multimodal Neurons in Artificial Neural Networks
Training
- OpenCLIP (3rd-party, PyTorch) [code]
- Train-CLIP (3rd-party, PyTorch) [code]
- Paddle-CLIP (3rd-party, PaddlePaddle) [code]
Applications
GAN
- VQGAN-CLIP [code]
- StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery [code]
- CLIP Guided Diffusion [code]
- CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions [code]
- TargetCLIP: Image-Based CLIP-Guided Essence Transfer [code]
- DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation [code]
- Clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP [code]
Object Detection
- Roboflow Zero-shot Object Tracking [code]
- Zero-Shot Detection via Vision and Language Knowledge Distillation [code]
- Crop-CLIP [code]
- Detic: Detecting Twenty-thousand Classes using Image-level Supervision [code]
- CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
- SLIP: Self-supervision meets Language-Image Pre-training [code]
- ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension [code]
Information Retrieval
- Unsplash Image Search [code]
- CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval [code]
- Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling [code]
- Natural Language YouTube Search [code]
- CLIP-as-service: Embed images and sentences into fixed-length vectors with CLIP [code]
- clip-retrieval [code]
- A CLIP-Hitchhiker’s Guide to Long Video Retrieval [code]
- CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [code]
- X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval [code]
- Extending CLIP for Category-to-image Retrieval in E-commerce [code]
Representation Learning
- Wav2CLIP: Learning Robust Audio Representations From CLIP [code]
- CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotation [code]
- RegionCLIP: Region-based Language-Image Pretraining [code]
- CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification [code]
- DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [code]
- CyCLIP: Cyclic Contrastive Language-Image Pretraining [code]
- CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment [code]
- DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection [code]
- UniCLIP: Unified Framework for Contrastive Language–Image Pre-training [code]
- SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model [code]
- Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [code]
- PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining [code]
- Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training [code]
- Fine-tuned CLIP Models are Efficient Video Learners[code]
- MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining [code]
- Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm [code]
- Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision [code]
Text-to-3D Generation
- CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation [code]
- Text2Mesh: Text-Driven Neural Stylization for Meshes [code]
- CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [code]
- CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders [code]
- CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields [code]
- MotionCLIP: Exposing Human Motion Generation to CLIP Space [code]
- AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars [code]
- ClipFace: Text-guided Editing of Textured 3D Morphable Models [code]
Text-to-Image Generation
- Big Sleep: A simple command line tool for text to image generation [code]
- Deep Daze: A simple command line tool for text to image generation [code]
- CLIP-CLOP: CLIP-Guided Collage and Photomontage [code]
- CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP [code]
Prompt Learning
- Learning to Prompt for Vision-Language Models [code]
- Conditional Prompt Learning for Vision-Language Models [code]
- Prompt-aligned Gradient for Prompt Tuning [code]
- CLIP-Adapter: Better Vision-Language Models with Feature Adapters [code]
- Learning to Compose Soft Prompts for Compositional Zero-Shot Learning [code]
Video Understanding
- VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding [code]
- FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks [code]
- Frozen CLIP Models are Efficient Video Learners [code]
- Towards Real-Time Text2Video via CLIP-Guided, Pixel-Level Optimization [code]
- MovieCLIP: Visual Scene Recognition in Movies [code]
Image Captioning
- CLIP prefix captioning [code]
- CLIPScore: A Reference-free Evaluation Metric for Image Captioning [code]
- ClipCap: CLIP Prefix for Image Captioning [code]
- Text-Only Training for Image Captioning using Noise-Injected CLIP [code]
- Fine-grained Image Captioning with CLIP Reward [code]
Image Editing
- HairCLIP: Design Your Hair by Text and Reference Image [code]
- CLIPstyler: Image Style Transfer with a Single Text Condition [code]
- CLIPasso: Semantically-Aware Object Sketching [code]
- Image-based CLIP-Guided Essence Transfer [code]
- CLIPDraw: Synthesize drawings to match a text prompt! [code]
- CLIP-CLOP: CLIP-Guided Collage and Photomontage [code]
- Towards Counterfactual Image Manipulation via CLIP [code]
- ClipCrop: Conditioned Cropping Driven by Vision-Language Model [code]
- CLIPascene: Scene Sketching with Different Types and Levels of Abstraction [code]
Image Segmentation
- CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation [code]
- Image Segmentation Using Text and Image Prompts [code]
- Extract Free Dense Labels from CLIP [code]
- Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP [code]
3D Recognition
- PointCLIP: Point Cloud Understanding by CLIP [code]
- CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training [code]
- MotionCLIP: Exposing Human Motion Generation to CLIP Space [code]
- LidarCLIP or: How I Learned to Talk to Point Clouds[code]
- CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP [code]
Audio
- AudioCLIP: Extending CLIP to Image, Text and Audio [code]
- Wav2CLIP: Learning Robust Audio Representations from Clip [code]
- AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization [code]
Language Tasks
Object Navigation
Localization
Others
- Multilingual-CLIP [code]
- CLIP (With Haiku + Jax!) [code]
- CLIP-Event: Connecting Text and Images with Event Structures [code]
- How Much Can CLIP Benefit Vision-and-Language Tasks? [code]
- CLIP meets GamePhysics: Towards bug identification in gameplay videos using zero-shot transfer learning [code]
- CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory [code]
- CLIP-Event: Connecting Text and Images with Event Structures [code]
- CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracywith ViT-B and ViT-L on ImageNet [code]
- Task Residual for Tuning Vision-Language Models [code]
Acknowledgment
Inspired by Awesome Visual-Transformer.
Contributors
Showing top 1 contributor by commit count.
This article is auto-generated from yzhuoning/Awesome-CLIP via the GitHub API.Last fetched: 6/13/2026
