Deep Dive into LLMs like ChatGPT

Deep Dive into LLMs like ChatGPT

karpathy llm pretraining post-training rl security
Read transcript →

Deep Dive into LLMs like ChatGPT

Author: Andrej Karpathy Date: February 2025 Video: https://www.youtube.com/watch?v=7xTGNNLPyMI Source file: raw/llm-deep-dive-script.md

A comprehensive general-audience walk through the full pipeline for building and using large language models — from raw internet data to deployed assistant — including LLM psychology, security vulnerabilities, and future directions.


What this talk covers

Seven major areas:

  1. Pretraining data and tokenisation — how internet text is collected, filtered, and converted into token sequences.
  2. Neural network training — how a transformer learns to predict the next token by adjusting billions of parameters.
  3. Base models vs assistants — what a base model does and why post-training converts it into a useful assistant.
  4. Post-training pipeline — supervised fine-tuning (SFT) and reinforcement learning (RL) from verifiable domains.
  5. RLHF — how human preference is used to extend RL into unverifiable domains.
  6. LLM psychology — hallucinations, knowledge vs working memory, jagged intelligence, the need for tokens to think.
  7. Security — jailbreaks, prompt injection, and data poisoning.

Key concepts introduced

  • Pretraining — compressing a large internet corpus into neural network parameters.
  • Tokenisation — byte-pair encoding; how text becomes integer sequences.
  • Transformers — the neural network architecture underlying all modern LLMs.
  • Hallucination — why models confabulate, and two mitigations (refusal training + tool use).
  • Reinforcement Learning from Human Feedback — using ranked human preferences to improve models in unverifiable domains.
  • Jagged Intelligence — LLMs brilliant in most places, with arbitrary holes.
  • Context Window — the model’s working memory; knowledge in the window beats knowledge in parameters.

Key arguments

Pretraining is knowledge acquisition. The model doesn’t store facts as discrete entries — it compresses statistical patterns. Things mentioned often are recalled more reliably; things mentioned rarely are recalled vaguely or not at all.

Post-training is alignment, not new knowledge. Supervised fine-tuning on human-labelled conversations changes output format (from internet documents to Q&A assistant) while preserving the knowledge built during pretraining.

Reinforcement learning can find solutions humans can’t write. In verifiable domains (maths, code), RL lets the model discover its own reasoning paths — chains of thought that emerge from optimisation, not imitation. The AlphaGo analogy: RL found Go strategies no human had tried. The same may be happening in LLM reasoning.

RLHF is limited. Because the reward model is a neural network approximation of human preference, RL will eventually find adversarial inputs that score highly without being genuinely good. RLHF must be stopped early; it polishes rather than fundamentally improves.

Tokens to think. Each token gets a fixed compute budget. Forcing an answer into the first token prevents the model from distributing reasoning across many tokens. Chain-of-thought prompting works because it allocates computation over time.


Security landscape

Three attack surfaces, each with active research:

Attack typeMechanismExample
JailbreaksPrompt framing tricks model into ignoring safety trainingRoleplay-as-grandmother; base64 encoding
Prompt injectionMalicious instructions hidden in content the model readsWhite-on-white text in images; attacker-controlled web pages
Data poisoningBackdoor inserted during pretraining via controlled web content”James Bond” trigger phrase from fine-tuning study

Relation to other talks