Vlms zero to hero
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
Welcome to VLMs Zero to Hero! This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models. The project is written primarily in Jupyter Notebook, distributed under the Apache License 2.0 license, first published in 2024. It has gained significant community traction with 1,180 stars and 102 forks on GitHub. Key topics include: bert-model, clip, computer-vision, embeddings, gpt.
hello
Welcome to VLMs Zero to Hero! This series will take you on a journey from the
fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
tutorials
| notebook | open in colab | video | paper |
|---|---|---|---|
| 01.01. Word2Veq: Distributed Representations of Words and Phrases and their Compositionality | link | soon | link |
roadmap
natural language processing (NLP) fundamentals
- Word2Veq: Efficient Estimation of Word Representations in Vector Space (2013) and Distributed Representations of Words and Phrases and their Compositionality (2013)
- Seq2Seq: Sequence to Sequence Learning with Neural Networks (2014)
- Attention Is All You Need (2017)
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
- GPT: Improving Language Understanding by Generative Pre-Training (2018)
computer vision (CV) fundamentals
- AlexNet: ImageNet Classification with Deep Convolutional Neural Networks (2012)
- VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)
- ResNet: Deep Residual Learning for Image Recognition (2015)
early vision-language models
- Show and Tell: A Neural Image Caption Generator (2014) and Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (2015)
- A Picture is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020)
- CLIP: Learning Transferable Visual Models from Natural Language Supervision (2021)
scale and efficiency
- Scaling Laws for Neural Language Models (2020)
- LoRA: Low-Rank Adaptation of Large Language Models (2021)
- QLoRA: Efficient Fine-tuning of Quantized LLMs (2023)
modern vision-language models
- Flamingo: A Visual Language Model for Few-Shot Learning (2022)
- LLaVA: Visual Instruction Tuning (2023)
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)
- PaliGemma: A versatile 3B VLM for transfer (2024)
extra
contribute and suggest more papers
Are there important papers, models, or techniques we missed? Do you have a favorite
breakthrough in vision-language research that isn't listed here? We’d love to hear
your suggestions!
Contributors
Showing top 1 contributor by commit count.
