Deep Dive into LLMs like ChatGPT

Author: Andrej Karpathy Date: February 2025 Video: https://www.youtube.com/watch?v=7xTGNNLPyMI Source file: raw/llm-deep-dive-script.md

A comprehensive general-audience walk through the full pipeline for building and using large language models — from raw internet data to deployed assistant — including LLM psychology, security vulnerabilities, and future directions.

What this talk covers

Seven major areas:

Pretraining data and tokenisation — how internet text is collected, filtered, and converted into token sequences.
Neural network training — how a transformer learns to predict the next token by adjusting billions of parameters.
Base models vs assistants — what a base model does and why post-training converts it into a useful assistant.
Post-training pipeline — supervised fine-tuning (SFT) and reinforcement learning (RL) from verifiable domains.
RLHF — how human preference is used to extend RL into unverifiable domains.
LLM psychology — hallucinations, knowledge vs working memory, jagged intelligence, the need for tokens to think.
Security — jailbreaks, prompt injection, and data poisoning.

Key concepts introduced

Pretraining — compressing a large internet corpus into neural network parameters.
Tokenisation — byte-pair encoding; how text becomes integer sequences.
Transformers — the neural network architecture underlying all modern LLMs.
Hallucination — why models confabulate, and two mitigations (refusal training + tool use).
Reinforcement Learning from Human Feedback — using ranked human preferences to improve models in unverifiable domains.
Jagged Intelligence — LLMs brilliant in most places, with arbitrary holes.
Context Window — the model’s working memory; knowledge in the window beats knowledge in parameters.

Key arguments

Pretraining is knowledge acquisition. The model doesn’t store facts as discrete entries — it compresses statistical patterns. Things mentioned often are recalled more reliably; things mentioned rarely are recalled vaguely or not at all.

Post-training is alignment, not new knowledge. Supervised fine-tuning on human-labelled conversations changes output format (from internet documents to Q&A assistant) while preserving the knowledge built during pretraining.

Reinforcement learning can find solutions humans can’t write. In verifiable domains (maths, code), RL lets the model discover its own reasoning paths — chains of thought that emerge from optimisation, not imitation. The AlphaGo analogy: RL found Go strategies no human had tried. The same may be happening in LLM reasoning.

RLHF is limited. Because the reward model is a neural network approximation of human preference, RL will eventually find adversarial inputs that score highly without being genuinely good. RLHF must be stopped early; it polishes rather than fundamentally improves.

Tokens to think. Each token gets a fixed compute budget. Forcing an answer into the first token prevents the model from distributing reasoning across many tokens. Chain-of-thought prompting works because it allocates computation over time.

Security landscape

Three attack surfaces, each with active research:

Attack type	Mechanism	Example
Jailbreaks	Prompt framing tricks model into ignoring safety training	Roleplay-as-grandmother; base64 encoding
Prompt injection	Malicious instructions hidden in content the model reads	White-on-white text in images; attacker-controlled web pages
Data poisoning	Backdoor inserted during pretraining via controlled web content	”James Bond” trigger phrase from fine-tuning study

Relation to other talks

Companion to Intro to Large Language Models (earlier, shorter overview with more OS analogy).
Practical application covered in How I Use LLMs.
Agentic implications followed up in From Vibe Coding to Agentic Engineering.