GitPedia

How to train your gpt

Build a modern LLM from scratch. Every line commented. Explained like we are five.

From raiyanyahyaยทUpdated June 18, 2026ยทView on GitHubยท

> *A guide to building a world-class language model from absolute scratch. Taught like you're five. Built like you're an engineer.* > > *I made this with the goal of learning something I didn't understand completely. Specifically the attention part. I use AI a lot to understand key concepts and verifying them.* The project is written primarily in Jupyter Notebook, distributed under the MIT License license, first published in 2026. It has gained significant community traction with 2,244 stars and 299 forks on GitHub. Key topics include: attention-mechanism, deep-learning, educational, from-scratch, gpt.

๐Ÿง  How to Train Your GPT

A guide to building a world-class language model from absolute scratch. Taught like you're five. Built like you're an engineer.

I made this with the goal of learning something I didn't understand completely. Specifically the attention part. I use AI a lot to understand key concepts and verifying them.

<p align="center"> <img src="https://img.shields.io/badge/chapters-12-blue" alt="12 chapters"> <img src="https://img.shields.io/badge/lines-7%2C500%2B-green" alt="7,500+ lines"> <img src="https://img.shields.io/badge/topics_explained-26-teal" alt="26 topic explainers"> <img src="https://img.shields.io/badge/code%20commented-100%25-brightgreen" alt="100% commented"> <img src="https://img.shields.io/badge/prerequisite-python%20basics-orange" alt="Python basics only"> <img src="https://img.shields.io/badge/architecture-LLaMA%203%20style-purple" alt="LLaMA 3 style"> <img src="https://img.shields.io/badge/purpose-learning%20only-lightgrey" alt="Learning only"> <a href="https://colab.research.google.com/github/raiyanyahya/how-to-train-your-gpt/blob/master/notebooks/colab_train.ipynb"> <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" height="25"> </a> </p>

๐Ÿ“– What Is This?

This is a 12-chapter, 7,500+ line interactive textbook that teaches you how to build, train and run a modern language model from absolute scratch. The same family of architecture behind ChatGPT, Claude, LLaMA and Mistral.

Alongside the chapters there are 26 standalone topic explainers covering every technique in depth. RoPE, attention, RMSNorm, SwiGLU, KV cache, AdamW, mixed precision and more. Plus two narrative walkthroughs that trace a single sentence through the entire model step by step. Each file follows the same style: child language, no jargon, a code example you can run.

You won't just read about Transformers. You'll write every line yourself: tokenizer, embeddings, attention, training loop, inference engine. Every single line annotated to explain what it does and why it's there.


๐Ÿค” Why This Exists

Most ML tutorials fall into one of two traps:

โŒ Too ShallowโŒ Too Academicโœ… This Guide
model = GPT().fit(data)40-page papers, dense notation5-year-old analogies โ†’ full working code
You learn to call APIsAssumes PhD in MLZero ML experience required
No understanding of internalsNo worked examplesEvery line annotated with WHAT & WHY

The goal: After finishing, you won't just know that attention "works". You'll understand the variance argument behind 1/โˆšd_k. How RoPE captures relative position through rotation. Why pre-norm beats post-norm for deep networks. And exactly where every gradient flows during backpropagation.


๐Ÿ‘ฅ Who Is This For?

๐Ÿง‘โ€๐Ÿ’ป You Are...๐Ÿ“š You Need...
A Python developer curious about how ChatGPT actually worksBasic Python (functions, classes, lists). No ML experience
A student who wants to deeply understand TransformersWillingness to read ~3,500 lines of commented code
An engineer evaluating LLM architecturesUnderstanding of tradeoffs (RoPE vs learned, RMSNorm vs LayerNorm)
Someone who got lost at "attention" in other tutorialsParty analogy + worked numeric example with real numbers

๐Ÿ”ง Prerequisites: Python basics (variables, functions, classes, pip install). That's it. No calculus, no linear algebra, no PyTorch experience required. We teach those as we go.


๐Ÿ—บ๏ธ Chapters

ChapterWhat You'll Learn
0: OverviewWhat is a GPT? The big picture
1: SetupInstall tools, GPU vs CPU, venv, PyTorch basics
2: TokenizationBPE walkthrough: how "unbelievably" becomes tokens
3: EmbeddingsHow numbers become meaning. king โˆ’ man + woman = queen
4: Positional EncodingRoPE: why LLaMA rotates vectors, not adds numbers
5: Attentionโญ THE CORE. Q,K,V, scaling, causal mask, 8-step walkthrough
6: Transformer BlockRMSNorm, SwiGLU, residuals, pre-norm vs post-norm
7: Complete GPT Model151M parameter model (with SwiGLU), weight tying, logits explained
8: Training PipelineCross-entropy, backprop, AdamW, cosine warmup, mixed precision
9: InferenceKV cache, temperature, top-k/p, beam search, repetition penalty
10: Full ScriptRunnable main.py: everything in one file
11: GlossaryArchitecture provenance table, parameter breakdown

โญ Start with Chapter 0 and read sequentially. Each builds on the previous.


๐Ÿ—๏ธ What You'll Build

๐Ÿงฉ Component๐Ÿ“ Lines๐Ÿ’ก What You'll Understand
BPE Tokenizer~60How GPT-4 splits "unbelievably" โ†’ "un" + "believ" + "ably"
Embeddings~30How "cat" and "dog" end up near each other in 768D space
RoPE~70Why LLaMA rotates vectors instead of adding position numbers
Multi-Head Attention~120The exact 8-step computation behind every modern LLM
Transformer Block~50Why residual connections are the "gradient highway"
Full GPT Model~200151M parameter model with SwiGLU, weight tying and pre-norm
Training Pipeline~250AdamW, cosine warmup, mixed precision, gradient accumulation
Inference Engine~80KV cache, temperature, top-k/p, beam search

๐Ÿ’Ž ~860 lines of core model code, ~2,600 lines of explanation and diagrams


๐Ÿ›๏ธ Architecture

This guide implements the latest publicly-documented decoder-only Transformer:

๐Ÿงฌ Technique๐Ÿ“ฆ Source Modelโšก Why It Matters
RoPELLaMA, Mistral, QwenRelative position without learned parameters
RMSNormLLaMA, Mistral, Gemma15% faster than LayerNorm, equally effective
SwiGLUPaLM, LLaMA, GeminiLearns which information to pass or block
Pre-NormGPT-3, all modernStable training at 100+ layers
AdamWGPT-3+Better generalization than vanilla Adam
BPEGPT-2/3/4Handles any text. Even unseen words and emoji
Weight TyingGPT-2/3Saves 30% parameters, improves training signal
Mixed PrecisionAll production LLMs2ร— speed, half memory, same quality

โ„น๏ธ GPT-4 and Claude architectures are proprietary/undisclosed. This teaches the best publicly-confirmed architecture: what LLaMA 3, Mistral and Qwen 2.5 use.


๐Ÿš€ Quick Start

bash
# 1. Clone git clone https://github.com/raiyanyahya/how-to-train-your-gpt.git cd how-to-train-your-gpt # 2. Create environment python -m venv gpt_env source gpt_env/bin/activate # Mac/Linux # gpt_env\Scripts\activate # Windows # 3. Install dependencies (CPU version. For GPU see below) pip install torch tiktoken datasets numpy matplotlib --index-url https://download.pytorch.org/whl/cpu # Or use the requirements file pip install -r requirements.txt # 4. Verify GPU (optional but recommended) python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')" # 5. Start reading! open chapters/00_overview.md

Run the training script:

bash
python main.py

This uses the tiny config (d_model=256, 4 layers) by default. Training takes a few minutes on CPU. For the GPT-2 scale config (151M params, 768 dims, 12 layers), edit the config in main.py and uncomment the larger configuration.

๐Ÿ’ป The default config uses a tiny model (d_model=256, 4 layers, 17M params) that runs in minutes on CPU. For the full GPT-2 scale (151M params, 768 dims, 12 layers), edit the config in main.py and uncomment the larger configuration. You'll need a GPU for that one.


๐Ÿ““ Jupyter Notebooks

Alongside the textbook, each chapter has a companion notebook you can run live. These strip away the explanations and give you pure, clean code that executes from top to bottom. If the textbook teaches you why, the notebooks let you see it happen.

We're going to run this whole project on a very small dataset so you can watch training happen in minutes rather than weeks. Every notebook is self-contained. Open it, run all cells and you'll see the model learn in real time.

bash
# Install everything you need pip install jupyter tiktoken torch numpy datasets matplotlib --index-url https://download.pytorch.org/whl/cpu # Start with chapter 2 (tokenization) jupyter notebook notebooks/02_tokenization.ipynb

Notebooks live in the notebooks/ directory, one per chapter. Open any of them and hit Cell โ†’ Run All.


๐Ÿ“š Topic Explainers

Each concept in this guide has a dedicated deep dive inside explanations and examples WIP/. These are written in the simplest possible language. No jargon. No formulas before analogies. Every explainer covers what, where, why, when and how with a code example you can run.

The last two files are narrative walkthroughs. A Token's Journey follows one sentence through the entire model. The Complete Story covers every component across 22 parts. Read these after the chapters to see how everything connects.

TopicFileWhat It Covers
RoPErope.mdHow word order is encoded through rotation
Attentionattention.mdStep by step with a 3-token worked example
BPE Tokenizationbpe_tokenization.mdHow text becomes tokens
Embeddingsembeddings.mdHow numbers become meaning
RMSNormrmsnorm.mdSimpler faster normalization
SwiGLUswiglu.mdThe gated activation that beat ReLU
Causal Maskingcausal_masking.mdNo peeking at the future
Residual Connectionsresidual_connections.mdThe gradient highway
KV Cachekv_cache.mdMaking generation fast
Samplingsampling.mdTemperature, top-k, top-p
Mixed Precisionmixed_precision.mdSpeed without sacrifice
AdamWadamw.mdThe optimizer that trains LLMs
Weight Tyingweight_tying.mdTwo jobs one matrix
Gradient Clippinggradient_clipping.mdPreventing training explosions
Cosine Warmupcosine_warmup.mdThe learning rate schedule
Pre-Normpre_norm.mdWhere to normalize
Grouped Query Attentiongrouped_query_attention.mdMHA vs GQA vs MQA explained
Flash Attentionflash_attention.mdHow Flash Attention makes training 4ร— faster
Loss Curveshow_to_read_loss.mdDiagnose training problems from the loss curve
Mixture of Expertsmixture_of_experts.mdHow MoE scales models with sparse routing
Speculative Decodingspeculative_decoding.md2-3ร— faster generation with a draft model
Cheatsheetcheatsheet.mdEvery formula and hyperparameter in one place
FAQfaq.mdTroubleshooting common problems
Encoder vs Decoderencoder_decoder_architectures.mdGPT vs BERT vs T5 explained
๐Ÿ“– A Token's Journeya_tokens_journey.mdFollow one sentence through every layer
๐Ÿ“– The Complete Storythe_complete_story.mdThe full narrative: 22 parts, 7800 words

๐Ÿ“– How to Read

Each chapter follows the same 4-step structure:

StepFormatPurpose
1๏ธโƒฃ AnalogyPlain English, 5-year-old levelBuild intuition before math
2๏ธโƒฃ Worked ExampleReal numbers traced throughSee exactly what happens
3๏ธโƒฃ Annotated CodeEvery line: WHAT + WHYUnderstand every decision
4๏ธโƒฃ DiagramMermaid flowchart or ASCIIVisualize data flow

๐Ÿ’ก Tip: Lost in the code? Jump back to the analogy. Confused by the math? Skip to the worked example.


โœจ What Makes This Different

Aspect๐Ÿ˜ด Typical Tutorial๐Ÿ”ฅ This Guide
Explanation depth"Attention helps the model focus"8-step worked example with real numbers + variance math + causal mask visualization
Code commentsFew or noneEvery single line: WHAT + WHY
Modern techniquesGPT-2 style (2019)LLaMA 3 style (2024): RoPE, RMSNorm, SwiGLU
TrainingUses HuggingFace TrainerFull custom loop: AdamW, cosine warmup, mixed precision, grad accumulation
Inferencemodel.generate()Temperature, top-k, top-p, beam search, KV cache explained
Target audienceML engineersPython developers with zero ML experience
DiagramsNoneMermaid flowcharts + ASCII matrices + worked examples

๐ŸŽฏ Skills You'll Gain

  • โœ… Explain how GPT-4 tokenizes text using BPE
  • โœ… Understand why RoPE, RMSNorm and SwiGLU replaced older techniques
  • โœ… Compute attention scores manually for a 3-token sentence
  • โœ… Debug a Transformer training loop (loss spikes, flat lines, overfitting)
  • โœ… Choose sampling parameters (temperature, top_k, top_p) for different use cases
  • โœ… Understand why KV caching is critical for production inference
  • โœ… Read modern ML papers with confidence (you'll recognize every component)

๐Ÿ”ฎ Next Steps After Finishing

ExperimentWhat to ChangeWhat You'll Learn
Bigger modelnum_layers 12 โ†’ 24How depth improves reasoning
More dataAdd BookCorpus, C4, The PileImpact of data quality and diversity
Flash AttentionInstall flash-attn, swap attention2-5ร— faster training, longer context
Grouped Query AttentionSet num_kv_heads < num_headsHow Mistral achieves efficient inference
LoRA fine-tuningAdd low-rank adapter layersCustomize models without full retraining
RLHF / DPOAdd reward model trainingHow ChatGPT learns to follow instructions
KV CacheImplement persistent key-value storage500ร— faster text generation
Mixture of ExpertsRoute tokens through different FFN expertsHow GPT-4 scales to trillions of params

๐Ÿ“ File Structure

๐Ÿ“ฆ how-to-train-your-gpt/
โ”œโ”€โ”€ ๐Ÿ“„ README.md              โ† You are here
โ”œโ”€โ”€ ๐Ÿ main.py                โ† Runnable training script (clone & run)
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt       โ† One command install
โ”œโ”€โ”€ ๐Ÿ“‚ chapters/
โ”‚   โ”œโ”€โ”€ ๐Ÿ  00_overview.md     โ† What is a GPT? Why build one?
โ”‚   โ”œโ”€โ”€ ๐Ÿ”ง 01_setup.md        โ† Install tools, GPU vs CPU, venv basics
โ”‚   โ”œโ”€โ”€ ๐Ÿ”ช 02_tokenization.md โ† BPE walkthrough, EOS tokens, emoji handling
โ”‚   โ”œโ”€โ”€ ๐ŸงŠ 03_embeddings.md   โ† How numbers become meaning, king โˆ’ man + woman
โ”‚   โ”œโ”€โ”€ ๐Ÿ“ 04_positional_encoding.md โ† RoPE math, numerical example, theta
โ”‚   โ”œโ”€โ”€ ๐Ÿง  05_attention.md    โ† โญ THE CORE (713 lines). Q,K,V, scaling, causal mask
โ”‚   โ”œโ”€โ”€ ๐Ÿงฑ 06_transformer_block.md โ† RMSNorm, SwiGLU, residuals, pre-norm vs post
โ”‚   โ”œโ”€โ”€ ๐Ÿ—๏ธ 07_gpt_model.md    โ† Complete 151M model, weight tying, logits explained
โ”‚   โ”œโ”€โ”€ ๐Ÿ‹๏ธ 08_training.md     โ† Cross-entropy, backprop, AdamW, cosine warmup
โ”‚   โ”œโ”€โ”€ ๐ŸŽค 09_inference.md    โ† KV cache, temperature, top-k/p, beam search
โ”‚   โ”œโ”€โ”€ ๐Ÿ“œ 10_full_script.md  โ† About main.py
โ”‚   โ””โ”€โ”€ ๐Ÿ“Š 11_glossary.md     โ† Architecture provenance, parameter breakdown
โ”œโ”€โ”€ ๐Ÿ““ notebooks/             โ† Jupyter notebooks (one per chapter)
โ”‚   โ”œโ”€โ”€ ๐ŸŽจ attention_visualized.ipynb โ† Watch attention weights in action
โ”‚   โ””โ”€โ”€ โ˜๏ธ colab_train.ipynb  โ† One-click cloud training on Colab
โ”œโ”€โ”€ ๐ŸŽฏ fine-tuning/           โ† Fine-tuning guide: LoRA, QLoRA, data prep
โ”‚   โ”œโ”€โ”€ ๐Ÿ“„ README.md
โ”‚   โ”œโ”€โ”€ 01_what_is_finetuning.md
โ”‚   โ”œโ”€โ”€ 02_lora_explained.md
โ”‚   โ”œโ”€โ”€ 03_qlora_explained.md
โ”‚   โ”œโ”€โ”€ 04_data_preparation.md
โ”‚   โ”œโ”€โ”€ 05_full_finetune.md
โ”‚   โ””โ”€โ”€ ๐Ÿ““ notebooks/lora_finetune.ipynb
โ”œโ”€โ”€ ๐Ÿ“š explanations and examples WIP/ โ† Standalone explainers (26 topics)
โ””โ”€โ”€ ๐Ÿ“„ CONTRIBUTING.md

<p align="center"> <i>"Any sufficiently explained technology is indistinguishable from magic. Until you build it yourself."</i> </p> <p align="center"> <sub>โญ Star this repo if you found it useful | ๐Ÿ› Issues & PRs welcome | ๐Ÿ“– Happy learning!</sub> </p>

Contributors

Showing top 1 contributor by commit count.

View all contributors on GitHub โ†’

This article is auto-generated from raiyanyahya/how-to-train-your-gpt via the GitHub API.Last fetched: 6/18/2026