<div align="center"> <h1 align="center">VLMs zero-to-hero</h1> <p>coming: january 2025...</p> </div>

hello

Welcome to VLMs Zero to Hero! This series will take you on a journey from the
fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.

tutorials

notebook	open in colab	video	paper
01.01. Word2Veq: Distributed Representations of Words and Phrases and their Compositionality	link	soon	link

roadmap

natural language processing (NLP) fundamentals

Word2Veq: Efficient Estimation of Word Representations in Vector Space (2013) and Distributed Representations of Words and Phrases and their Compositionality (2013)
Seq2Seq: Sequence to Sequence Learning with Neural Networks (2014)
Attention Is All You Need (2017)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018)
GPT: Improving Language Understanding by Generative Pre-Training (2018)

computer vision (CV) fundamentals

AlexNet: ImageNet Classification with Deep Convolutional Neural Networks (2012)
VGG: Very Deep Convolutional Networks for Large-Scale Image Recognition (2014)
ResNet: Deep Residual Learning for Image Recognition (2015)

early vision-language models

scale and efficiency

Scaling Laws for Neural Language Models (2020)
LoRA: Low-Rank Adaptation of Large Language Models (2021)
QLoRA: Efficient Fine-tuning of Quantized LLMs (2023)

modern vision-language models

Flamingo: A Visual Language Model for Few-Shot Learning (2022)
LLaVA: Visual Instruction Tuning (2023)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (2023)
PaliGemma: A versatile 3B VLM for transfer (2024)

extra

BLEU: a Method for Automatic Evaluation of Machine Translation (2002)

contribute and suggest more papers

Are there important papers, models, or techniques we missed? Do you have a favorite
breakthrough in vision-language research that isn't listed here? We’d love to hear
your suggestions!