Deep Dive into LLMs like ChatGPT

Cleaned transcript — original talk by Andrej Karpathy
Video: youtube.com/watch?v=7xTGNNLPyMI
Author: karpathy.ai
This is a cleaned version of the auto-generated YouTube transcript — punctuation added, filler removed, restructured for readability. Not a verbatim transcript. For exact quotes, please refer to the original video.

Introduction

Hi everyone. I've wanted to make this video for a while: a comprehensive but general-audience introduction to large language models like ChatGPT. My goal is to give you mental models for thinking about what this tool actually is. It's magical and amazing in some respects, really good at some things, not very good at others, and there are a lot of sharp edges to be aware of.

What's behind that text box? You can put anything in and press enter — but what should you put there? What are the words coming back? How does it work? What are you actually talking to? We'll go through the entire pipeline of how this stuff is built, keeping it accessible, and along the way I'll touch on the cognitive and psychological implications of these tools.

Pretraining Data (the Internet)

Building something like ChatGPT happens in multiple sequential stages. The first is pretraining, and its first step is to download and process the internet.

A good reference for what this looks like in practice is the FineWeb dataset from Hugging Face. Every major LLM provider (OpenAI, Anthropic, Google) has some internal equivalent. The goal: a huge quantity of high-quality, highly diverse documents from publicly available sources.

FineWeb, which is representative of production-grade data, ends up being only about 44 TB — small enough to fit on a single hard drive. The internet is large, but once you filter it down to text and apply aggressive quality filters, it becomes manageable.

The starting point is usually Common Crawl, an organization that has been scraping the internet since 2007. As of 2024, it had indexed 2.7 billion web pages. Crawlers follow links from seed pages and index everything they find.

The raw Common Crawl data goes through many stages of filtering:

URL filtering — block lists for malware, spam, marketing, adult content, etc.
Text extraction — pulling clean text out of raw HTML, removing markup, CSS, and navigation.
Language filtering — e.g., FineWeb keeps pages that are >65% English. This is a design choice: filter out Spanish and your model will be bad at Spanish.
Deduplication.
PII removal — detecting and filtering addresses, social security numbers, etc.

The final dataset looks like a giant tapestry of raw internet text. With 200 web pages concatenated, you already have a massive amount of text. This is the starting point for the next stage.

Tokenization

Before feeding text into a neural network, we need to decide how to represent it. The network expects a one-dimensional sequence of symbols from a finite vocabulary.

The text is a 1D sequence already (left-to-right, top-to-bottom). Underneath, it's UTF-8 encoded bits — just zeros and ones. But two symbols with an extremely long sequence is bad: sequence length is a precious resource. We want to trade off vocabulary size against sequence length.

A first compression: group 8 bits into a byte. Now we have 256 possible symbols and a sequence 8× shorter. Think of these not as numbers but as unique IDs — like 256 distinct emojis.

State-of-the-art models go further using byte-pair encoding (BPE). Find frequently-occurring consecutive pairs (e.g., 116, 32), mint a new symbol with ID 256, and replace every occurrence. Iterate. Each iteration decreases sequence length and grows vocabulary.

GPT-4 uses 100,277 tokens. The process of converting text to tokens is called tokenization.

You can explore this at tiktokenizer. Some observations:

"hello world" → 2 tokens.
"hello world" with no space → different tokenization.
Two spaces → another different tokenization.
It's case-sensitive: "Hello World" tokenizes differently than "hello world".

Neural Network I/O

Re-represent the FineWeb data using the tokenizer: it becomes a 15-trillion-token sequence. These tokens are atoms — the numbers are just unique IDs.

Now the fun part: neural network training. We model the statistical relationships of how tokens follow each other.

Take a random window of tokens from the data — anywhere from 0 to some maximum (e.g., 8,000). Longer windows are more expensive, so we crop. Say we take 4 tokens as context.

Feed those 4 tokens into the network. The network outputs a probability distribution over all 100,277 possible next tokens. Initially the network is randomly initialized, so the predictions are random. But we know what actually comes next (it's in our dataset), so we have a label.

We mathematically update the network so the correct token gets slightly higher probability and the others get slightly lower. After the update, feeding the same context produces a slightly improved distribution.

This happens in parallel across batches of windows covering the entire dataset. The network gradually becomes consistent with the statistical patterns of the data.

Neural Network Internals

Inputs (token sequences) get mixed up in a giant mathematical expression with the network's parameters (also called weights). Modern networks have billions of parameters, randomly initialized at the start.

Think of the parameters as knobs on a DJ set. Training discovers a setting of those knobs that produces predictions consistent with the training data.

The expressions themselves aren't scary — multiplication, addition, exponentiation, division. Neural network architecture research designs expressions that are expressive, optimizable, and parallelizable.

A production example: the Transformer. One visualization shows ~85,000 parameters. Tokens are first embedded into vector representations, then those vectors flow through layers — attention blocks, MLP blocks, layer norms, matrix multiplications, softmaxes — producing intermediate values you can loosely think of as "firing rates" of synthetic neurons.

But don't take that analogy too far. These are far simpler than biological neurons. There's no memory, no state — it's a fixed function from input to output. Think of it as a synthetic piece of brain tissue if it helps, but it's stateless.

Inference

After training, we can generate new data — this is inference.

Start with a prefix (e.g., a single token, ID 91). Feed it into the network, get the probability distribution, sample from it (flip a biased coin weighted by probabilities). The sampled token is, say, 860. Append it. Feed in [91, 860], sample the next token, and so on.

Because sampling is stochastic, you don't reproduce the training data verbatim. You get remixes — statistically similar to training data but not identical. Sometimes the sampled token wasn't even part of any document in the training data; it just happens to be a likely next token in this context.

When you talk to ChatGPT, you're doing inference. OpenAI trained the model months ago with a fixed set of parameters, and now they're held constant. You provide tokens, the model completes them.

GPT-2: Training and Inference

A concrete example I'm fond of: GPT-2 (OpenAI, 2019). It's the first model where a recognizably modern stack came together — everything since has just gotten bigger.

Transformer neural network
1.6 billion parameters (modern models: hundreds of billions to trillions)
Max context length: 1,024 tokens (modern: hundreds of thousands)
Trained on ~100 billion tokens (FineWeb: 15 trillion)

Training GPT-2 in 2019 was estimated at ~$40,000. Today, with my llm.c reproduction, you can do it in about a day for ~$600 — and you could probably get it to ~$100 without trying hard. Why so much cheaper? Better datasets, much faster hardware, and much better software.

When you train one of these models as a researcher, you stare at a screen where each line is one update — adjusting parameters slightly to improve prediction on ~1 million tokens. The key number to watch is loss: a single number where lower is better. As updates accumulate, loss decreases and predictions improve.

Every 20 steps you might run inference to see what the model generates. At step 1, it's pure gibberish (random network). At step 420 of 32,000, you get something with local coherence but global incoherence. By the end, it produces fairly coherent English.

This runs on an 8×H100 node rented from a cloud provider like Lambda (~$3/GPU/hour). H100 GPUs slot into computers and are perfectly suited to training because of the massive parallelism in matrix multiplications. Stack 8 into a node, many nodes into a data center. This is the "gold rush" driving NVIDIA's $3.4T market cap. Elon Musk getting 100,000 GPUs in a single data center? They're all doing this — predicting the next token, faster.

Llama 3.1 Base Model Inference

Few companies release base models because by themselves they're just internet-text simulators — not assistants. But some do.

Llama 3 from Meta: 405B parameters, trained on 15 trillion tokens. Meta released both the base model and an instruct (assistant) model.

You can interact with the Llama 3.1 405B base on hyperbolic.xyz. A few things to notice:

It's not yet an assistant. Ask "What is 2+2?" and it doesn't answer 4 — it continues the token sequence, possibly veering into philosophy. It's a glorified, very expensive autocomplete.
It's stochastic. Same prompt, different result every time.
It still has knowledge. Prompt: "Here's my top 10 list of landmarks to see in Paris:" — and it produces a list. The 405B parameters are a lossy compression of the internet. Knowledge is recalled vaguely and probabilistically, not stored explicitly. Things mentioned often are remembered better.
Memorization happens. Paste the first sentence of the Zebra Wikipedia page and the model will regurgitate huge chunks verbatim. High-quality sources like Wikipedia get oversampled during training (a few "epochs"), so the model can recite them.
It hallucinates beyond its knowledge cutoff. Llama 3's cutoff is end of 2023. Prompt it about the 2024 election and it makes things up — guessing Mike Pence as Trump's running mate against Hillary Clinton in one sample, Ron DeSantis against Biden/Harris in another. Parallel-universe hallucinations.
It can be coaxed into useful behavior via prompt design. A few-shot prompt with 10 English:Korean pairs followed by "teacher:" will make the model continue the pattern (in-context learning). You can even fake an assistant by structuring the prompt as a "conversation between an AI assistant and a human" with example turns, then appending a user query.

Pretraining to Post-Training

Recap: pretraining takes internet documents, breaks them into tokens, and trains a network to predict the next token. The output is a base model — an internet document simulator at the token level.

We want an assistant — something that answers questions. For that we need the second stage: post-training.

Post-training is computationally much cheaper than pretraining (hours vs. months) but critical for turning the model into something useful.

Post-Training Data (Conversations)

We want the model to handle multi-turn conversations between a human and an assistant. For example:

Human: "What is 2+2?" → Assistant: "2+2 = 4"
Human: "What if it were * instead of +?" → Assistant: explains.
Human: asks for something harmful → Assistant: refuses.

We're not programming the assistant in code — we're programming it implicitly via a dataset of example conversations. Hundreds of thousands of multi-turn conversations covering diverse topics. The model trains on these and learns to imitate the response patterns.

Where do the conversations come from? Human labelers. They're given conversational contexts and asked to write the ideal assistant response.

We take the base model, throw away the internet-text dataset, and continue training on the conversations dataset using the exact same algorithm. The model rapidly picks up the statistical pattern of how the assistant responds.

Tokenizing Conversations

Conversations have to become 1D token sequences. There are protocols for this — analogous to TCP/IP packet formats.

For GPT-4o: special tokens like <|im_start|> (imaginary monologue start), then a role identifier ("user" or "assistant"), then <|im_sep|>, then the content, then <|im_end|>. These special tokens are introduced during post-training — they're new tokens never seen in pretraining. A short conversation might become 49 tokens.

At inference time, the system constructs the conversation context up through <|im_start|>assistant<|im_sep|> and then samples to generate the assistant's reply.

What These Datasets Look Like

The seminal paper here is OpenAI's InstructGPT (2022). Human contractors (hired via Upwork, Scale AI) wrote prompts and ideal assistant responses, following detailed labeling instructions ("be helpful, truthful, harmless"). The instructions can be hundreds of pages.

OpenAI never released the InstructGPT dataset, but open-source reproductions exist (e.g., OpenAssistant). Modern datasets like UltraChat have millions of conversations, much of it now synthetically generated with LLM help and lightly human-edited.

What You're Actually Talking To

This is important. When you ask ChatGPT a question, you're not talking to a magical AI. You're talking to a statistical simulation of a human labeler — what a (typically expert) human, following labeling instructions, would most plausibly write in response. For prompts not in the dataset, the answer is emergent — combining pretraining knowledge with the assistant persona learned from labeling.

Hallucinations, Tool Use, Knowledge / Working Memory

Now: LLM psychology.

Hallucinations

LLMs make things up. In the training set, questions like "Who is [famous person]?" are confidently answered. So when asked "Who is Orson Kovats?" (a name I made up), the model doesn't say "I don't know" — it confidently fabricates, because the style of confident answers is what it learned.

Older models (Falcon 7B) hallucinate freely: "Orson Kovats is an American author... no, a TV character... no, a baseball player." Modern ChatGPT responds: "There's no well-known public figure named Orson Kovats."

Mitigation 1: Teach the model to refuse when it doesn't know.

Meta's approach for Llama 3 (described in their paper):

Take a document, pick a paragraph, use an LLM to generate factual Q&A pairs.
Interrogate the model with the questions, multiple times.
Compare its answers to the correct answer with an LLM judge.
If the model consistently gets it wrong, add a training example where the correct response is "I don't know / I don't remember."

There's probably a neuron inside the network representing uncertainty — it's just not wired up to the output. Adding refusal examples teaches the model to surface that uncertainty.

Mitigation 2: Give the model tools.

Knowledge in parameters is a vague recollection (something you read a month ago). Knowledge in the context window is working memory (something in front of you right now). To refresh memory, humans use the internet — so we give models web search.

We introduce special tokens like <SEARCH_START>query<SEARCH_END>. When the model emits these, the inference program pauses generation, runs a real search, copy-pastes the results into the context window, and resumes. The model now has the information directly available.

We teach this with training examples that demonstrate when and how to use the tool. The model already has a native sense of what a good search query looks like from pretraining.

This is why ChatGPT now says "searching the web..." and produces citations.

Knowledge vs. Working Memory

Implication for users: If you want the model to use specific information, put it in the context window rather than relying on recall. "Summarize chapter 1 of Pride and Prejudice" works, but "Summarize chapter 1 of Pride and Prejudice — text attached below" works significantly better.

Knowledge of Self

People often ask LLMs "What model are you?" or "Who built you?" — but the question is somewhat nonsensical. The model has no persistent self. It boots up, processes tokens, shuts down. Every conversation is a fresh start.

If you don't program a self-identity, you get random answers. Falcon 7B will tell you it was built by OpenAI based on GPT-3 — not because it was trained on OpenAI data, but because ChatGPT is so prominent on the internet that "I am ChatGPT by OpenAI" is the model's hallucinated default label.

Two ways to override this:

Hardcoded training examples. Allen AI's OLMo dataset has 240 conversations explicitly about model identity ("What is your name?" → "I'm OLMo by Allen AI...").
System messages. Hidden tokens inserted at the start of every conversation reminding the model of its identity, training date, knowledge cutoff, etc.

Either way, it's bolted on — not deeply present.

Models Need Tokens to Think

A critical insight. Here's a math problem: "Emily buys 3 apples and 2 oranges, each orange costs $2, total cost is $13, what's the cost of an apple?"

Two possible labeled responses, both arriving at $3:

Bad: "The answer is $3. [explanation follows]"
Good: Work step by step, set up an equation, compute intermediate results, arrive at $3 at the end.

Why is one bad? Because each token gets a fixed, finite amount of computation through the network — maybe 100 layers' worth. You can't cram an arbitrary calculation into one token.

If the answer comes first ("The answer is $3"), the model has to do the entire problem in the computation budget of a single token. That doesn't work. Anything after that point is just post-hoc justification — the answer is already committed.

The good answer distributes computation across tokens. Each token does a small amount of work; by the end, the model has built up intermediate results in its context and reaching the final answer is easy.

You can test this: ask ChatGPT "Answer in a single token: Emily buys 23 apples and 177 oranges..." — for harder numbers, it fails. Let it work step by step and it succeeds.

Even better: for arithmetic-heavy problems, ask it to use code. The Python interpreter is far more reliable than the model's mental arithmetic.

Counting

Same problem. "How many dots are below: ...." The model will guess in a single token and get it wrong. Ask it to use code — it copies the input into a Python string and calls .count(). Now it's accurate. Counting is a task to delegate to tools.

Tokenization Revisited: Spelling

The model doesn't see characters — it sees tokens. "Ubiquitous" is 3 tokens to the model, not 10 letters. So tasks like "print every third character" fail.

The famous example: "How many R's in strawberry?" Models insisted on 2 for ages. It's not because they're dumb — they don't see the individual letters, and they're bad at counting. Two compounding deficits.

Fix: ask the model to use code.

Jagged Intelligence

LLMs can solve Olympiad math but fail on "Which is bigger, 9.11 or 9.9?" — answering 9.11 confidently. Researchers found that internal activations associated with Bible verses light up on this prompt. In Bible verse numbering, 9.11 does come after 9.9. The model gets distracted by the spurious association.

This is what I call jagged intelligence or the "Swiss cheese" model of LLM capabilities: brilliant in most places, with arbitrary holes you can fall into. Treat the model as a powerful but unreliable tool. Check its work.

Supervised Fine-Tuning → Reinforcement Learning

So far we have two stages: pretraining (knowledge acquisition) and supervised fine-tuning (imitating expert assistant responses). Now we add a third: reinforcement learning.

A useful analogy: school textbooks have three kinds of content:

Exposition — background knowledge (= pretraining).
Worked examples — expert solutions you imitate (= supervised fine-tuning).
Practice problems — you get the problem and the final answer but have to find your own solution (= reinforcement learning).

Reinforcement Learning

Why is SFT not enough? Consider that math problem again. A human labeler writes a "good" solution — but how do they know what's good for the model? Token sequences that are easy for the model differ from those easy for humans. The labeler might inject leaps the model can't make, or pad with steps that are trivial for the model. The labeler simply doesn't know the model's ideal token path.

The solution: let the model discover its own solutions.

For a given prompt, generate many candidate solutions (thousands or millions). Check which ones reach the correct answer. Train on the ones that worked. The model discovers, by trial and error, the token sequences that reliably get it to correct answers — exploiting its own knowledge, not someone else's.

This is essentially "do more of what worked." SFT initializes the model in the vicinity of correct solutions; RL dials it in.

DeepSeek-R1

The first two stages (pretraining, SFT) are standard and well-known. RL fine-tuning for LLMs has been done internally at labs like OpenAI for years but kept private. The DeepSeek-R1 paper (early 2025) was a big deal because it publicly described how to do RL fine-tuning at scale and gave enough detail to reproduce it.

Key findings:

On math problems, accuracy climbs steadily over thousands of RL steps.
Average response length grows as training progresses. The model learns that longer responses with more thinking yield better accuracy.
Chains of thought emerge — the model spontaneously starts saying things like "wait, let me re-evaluate this step by step from a different perspective." No human labeler wrote that template — it emerged from optimization because it works.

These cognitive strategies — backtracking, reframing, trying multiple approaches, verifying — emerge naturally. The only signal is "correct answer or not."

When you give the same Emily-apples problem to DeepSeek-R1 (on chat.deepseek.com with "DeepThink" enabled), you get a long internal monologue: "Let me figure this out... it must be $3. Wait, let me check. Yep, that's right. Let me try setting up an equation. Same answer. Confident." Then a clean writeup for the human.

DeepSeek-R1 is open-weights — you can download it, or use it via together.ai (an American provider).

In ChatGPT, models like o1, o3-mini, o3-mini-high are thinking models trained similarly with RL. GPT-4o is mostly an SFT model. OpenAI hides the raw chain-of-thought (showing only summaries) — likely to prevent distillation, where competitors imitate the reasoning traces.

When you encounter a hard problem, reach for a thinking model. For simple factual or conversational queries, GPT-4o is faster and usually fine. ~80–90% of my usage is GPT-4o; I reach for thinking models for hard math/code problems.

AlphaGo

RL's power isn't new. AlphaGo (DeepMind) showed this in the game of Go years ago.

In the AlphaGo paper, there's a plot comparing models trained by supervised learning (imitating human expert players) versus reinforcement learning. The SL model plateaus below the level of top human players — it can't exceed what it's imitating. The RL model surpasses top human players, including Lee Sedol.

RL discovered moves humans wouldn't play. Famously, Move 37 in the AlphaGo vs. Lee Sedol match was a move humans estimated had a 1-in-10,000 probability of being played. In retrospect: brilliant. AlphaGo found a strategy unknown to human Go.

We're seeing the beginnings of something analogous in LLMs. RL could in principle discover new ways of thinking that no human has tried — new analogies, new cognitive strategies, maybe even new languages internal to the model. The behavior is unconstrained as long as it reaches correct answers.

This requires huge, diverse sets of practice problems — essentially "game environments" for thinking. A lot of frontier LLM research is building those environments.

Reinforcement Learning from Human Feedback (RLHF)

Everything above is in verifiable domains — math, code — where solutions can be automatically scored against a known answer.

What about unverifiable domains? Write a joke. Write a poem. Summarize a paragraph. How do you score these automatically?

Naive approach: have humans score every solution. Infeasible — RL needs thousands of updates × thousands of prompts × hundreds of rollouts = a billion human judgments.

RLHF (OpenAI, ~2022): introduce indirection.

Collect a small amount of human preference data — show humans 5 candidate solutions, ask them to rank (ranking is easier than scoring or generating).
Train a separate neural network — the reward model — to imitate those human rankings.
Run RL against the reward model, querying it as many times as needed.

The reward model is a simulator of human preference. It's not perfect, but if it's statistically similar to human judgment, RL against it produces improvements.

Upsides

Enables RL in unverifiable domains.
Empirically improves models.
Leverages the discriminator–generator gap: it's much easier for humans to compare 5 generated poems than to write the ideal poem from scratch.

Downsides

We're optimizing against a lossy simulation, not actual humans.
More devious: RL is extremely good at gaming the reward model.

Run RLHF too long and you'll find the top "joke about pelicans" becomes something nonsensical like "the the the the the" — because the reward model, a giant neural net with billions of parameters, has weird corners where nonsense inputs score 1.0. These are adversarial examples.

You can patch each one (add "the the the" to the dataset with a low score), but there are infinitely many. RL will always find new ones.

So RLHF is not RL in the magical, scalable sense. You can't run it indefinitely the way you can run AlphaGo's RL. You run it for a few hundred steps, get a small improvement, and stop before the reward model gets gamed. RLHF is more like a fine-tune than true RL.

In verifiable domains (math, code), the scoring function is rigid — either you got the right answer or you didn't, and you can't game that. So you can run RL indefinitely and get genuine emergent reasoning. In RLHF, you can't.

GPT-4o has been through RLHF — it works, but it's a polish, not a fundamental capability multiplier.

Preview of Things to Come

A few things on the horizon:

Multimodality. Audio, images, video — all tokenizable. Slice an audio spectrogram into tokens, slice an image into patches, intersperse with text. Same training paradigm. We're already seeing the early forms.

Agents. Today we hand individual tasks to models on a silver platter. Soon: agents performing long-running tasks over minutes or hours, with humans supervising. Expect "human-to-agent ratios" the way factories have human-to-robot ratios.

Pervasive integration. LLMs becoming invisible infrastructure inside tools.

Computer-using models. ChatGPT's Operator is an early example — handing the model control of keyboard and mouse.

Test-time training. Currently, parameters are fixed after training; only the context window is "learned" during inference (in-context learning). Humans actually update their parameters from experience (especially during sleep). We don't have an analog of that yet. Important because context windows are finite, and multimodal long-running tasks will blow past current sizes.

Keeping Track of LLMs

Three resources I use:

LM Arena (lmarena.ai) — leaderboard of models ranked by blind human comparisons. Recently feels somewhat gamed (Sonnet ranks lower than I'd expect from my own use), but a useful first pass.
AI News newsletter by swyx — comprehensive, near-daily summaries of what's new. Mostly LLM-assisted but with human curation.
X (Twitter) — most of the conversation happens there. Follow people you trust.

Where to Find LLMs

Proprietary models: chat.openai.com (ChatGPT), gemini.google.com or aistudio.google.com (Google), claude.ai (Anthropic).
Open-weight models (DeepSeek, Llama, etc.): inference providers like together.ai.
Base models: less common. hyperbolic.xyz hosts Llama 3.1 405B base, which I like.
Local: smaller / distilled / lower-precision versions can run on a MacBook. LM Studio is the app I use — ugly UI, lots of confusing options, but it works once you find your way.

Grand Summary

When you go to ChatGPT and hit enter, here's what's happening:

Your query is tokenized and inserted into the conversation protocol format.
The model continues the token sequence — predicting one token at a time.
The tokens it picks reflect what was learned across three stages:
- Pretraining: internalized knowledge from the internet, stored as a lossy compression in the parameters.
- Supervised fine-tuning: assistant persona learned by imitating human labelers following labeling instructions.
- Reinforcement learning (for thinking models): emergent reasoning strategies discovered through trial and error on verifiable problems.

For a non-thinking model like GPT-4o, the reply is essentially a neural network simulation of a human data labeler at OpenAI — what a labeler, given the instructions and infinite patience, would have written. Not magic, just a learned imitation, with the cognitive quirks of a finite-compute-per-token system.

For thinking models like o3-mini or DeepSeek-R1, the reply additionally reflects emergent reasoning developed through RL — strategies that weren't directly imitated from any human. In verifiable domains this is genuinely new and powerful; we're seeing early hints of "Move 37"-style discoveries in open-ended thinking.

Practical takeaways:

Use LLMs as tools, not oracles. They'll randomly fail in surprising ways even when they shine elsewhere.
Watch for hallucinations. Mitigations help but don't eliminate them.
Put information in the context window when you need it used precisely; don't rely on recall.
For arithmetic, counting, spelling, or anything requiring precision: ask for code.
For hard reasoning, use a thinking model and accept the wait.
Always verify. Own the product of your work.

It's an extraordinary time to be working with these tools. I use them dozens of times a day. They massively accelerate work — as long as you respect their sharp edges.

Thanks for watching.