Step-by-step interactive walkthrough of LLM construction, from Common Crawl filtering through BPE tokenization, Transformer training, and RLHF post-training, based on Karpathy’s lecture.
Key Takeaways
FineWeb pipeline: 2.7B Common Crawl pages reduced to 44TB / 15T tokens via URL blocklists, MinHash deduplication, language filtering, and PII removal.
BPE builds GPT-4’s 100,277-token vocabulary by starting from 256 byte symbols and iteratively merging the most frequent adjacent pairs.
Training cost collapsed: GPT-2 quality cost ~$40K in 2019, equivalent now runs ~$100. Llama 3 uses 405B parameters on 15T tokens.
Base model is a stochastic internet autocomplete engine; SFT on labeled conversations and RLHF preference ranking convert it into a chat assistant.
Inference is autoregressive and stochastic; temperature 0.7-1.0 is the practical sweet spot, and tool use works by emitting special tokens that pause generation and inject results into the context window.
Hacker News Comment Review
Dominant thread concern is unproofread AI-generated copy: the claim that 44TB “roughly fits on a single hard drive” is wrong by at least 1.5x the current retail maximum (~32TB), eroding trust in the technical content.
The BPE diagram drew a specific technical objection: BPE is purely additive and never removes the original 256 byte-level tokens, so showing replacement is a meaningful misconception.
Commenters broadly redirected to Jay Alammar’s “The Illustrated GPT-2” as the established human-authored reference for the same material.
Notable Comments
@vova_hn2: BPE visualization misleads by implying old tokens are discarded; the process only adds merges, always retaining all 256 byte tokens.
@gushogg-blake: Guide skips embedding depth entirely: how the input side of the network represents N context tokens, and how embeddings handle tokens with context-dependent meanings.