Deep Dive into LLMs like ChatGPT
Author: Andrej Karpathy
Date: February 2025
Video: https://www.youtube.com/watch?v=7xTGNNLPyMI
Source file: raw/llm-deep-dive-script.md
A comprehensive general-audience walk through the full pipeline for building and using large language models — from raw internet data to deployed assistant — including LLM psychology, security vulnerabilities, and future directions.
What this talk covers
Seven major areas:
- Pretraining data and tokenisation — how internet text is collected, filtered, and converted into token sequences.
- Neural network training — how a transformer learns to predict the next token by adjusting billions of parameters.
- Base models vs assistants — what a base model does and why post-training converts it into a useful assistant.
- Post-training pipeline — supervised fine-tuning (SFT) and reinforcement learning (RL) from verifiable domains.
- RLHF — how human preference is used to extend RL into unverifiable domains.
- LLM psychology — hallucinations, knowledge vs working memory, jagged intelligence, the need for tokens to think.
- Security — jailbreaks, prompt injection, and data poisoning.
Key concepts introduced
- Pretraining — compressing a large internet corpus into neural network parameters.
- Tokenisation — byte-pair encoding; how text becomes integer sequences.
- Transformers — the neural network architecture underlying all modern LLMs.
- Hallucination — why models confabulate, and two mitigations (refusal training + tool use).
- Reinforcement Learning from Human Feedback — using ranked human preferences to improve models in unverifiable domains.
- Jagged Intelligence — LLMs brilliant in most places, with arbitrary holes.
- Context Window — the model’s working memory; knowledge in the window beats knowledge in parameters.
Key arguments
Pretraining is knowledge acquisition. The model doesn’t store facts as discrete entries — it compresses statistical patterns. Things mentioned often are recalled more reliably; things mentioned rarely are recalled vaguely or not at all.
Post-training is alignment, not new knowledge. Supervised fine-tuning on human-labelled conversations changes output format (from internet documents to Q&A assistant) while preserving the knowledge built during pretraining.
Reinforcement learning can find solutions humans can’t write. In verifiable domains (maths, code), RL lets the model discover its own reasoning paths — chains of thought that emerge from optimisation, not imitation. The AlphaGo analogy: RL found Go strategies no human had tried. The same may be happening in LLM reasoning.
RLHF is limited. Because the reward model is a neural network approximation of human preference, RL will eventually find adversarial inputs that score highly without being genuinely good. RLHF must be stopped early; it polishes rather than fundamentally improves.
Tokens to think. Each token gets a fixed compute budget. Forcing an answer into the first token prevents the model from distributing reasoning across many tokens. Chain-of-thought prompting works because it allocates computation over time.
Security landscape
Three attack surfaces, each with active research:
| Attack type | Mechanism | Example |
|---|---|---|
| Jailbreaks | Prompt framing tricks model into ignoring safety training | Roleplay-as-grandmother; base64 encoding |
| Prompt injection | Malicious instructions hidden in content the model reads | White-on-white text in images; attacker-controlled web pages |
| Data poisoning | Backdoor inserted during pretraining via controlled web content | ”James Bond” trigger phrase from fine-tuning study |
Relation to other talks
- Companion to Intro to Large Language Models (earlier, shorter overview with more OS analogy).
- Practical application covered in How I Use LLMs.
- Agentic implications followed up in From Vibe Coding to Agentic Engineering.