Hi everyone. I've wanted to make this video for a while: a comprehensive but general-audience introduction to large language models like ChatGPT. My goal is to give you mental models for thinking about what this tool actually is. It's magical and amazing in some respects, really good at some things, not very good at others, and there are a lot of sharp edges to be aware of.
What's behind that text box? You can put anything in and press enter — but what should you put there? What are the words coming back? How does it work? What are you actually talking to? We'll go through the entire pipeline of how this stuff is built, keeping it accessible, and along the way I'll touch on the cognitive and psychological implications of these tools.
Building something like ChatGPT happens in multiple sequential stages. The first is pretraining, and its first step is to download and process the internet.
A good reference for what this looks like in practice is the FineWeb dataset from Hugging Face. Every major LLM provider (OpenAI, Anthropic, Google) has some internal equivalent. The goal: a huge quantity of high-quality, highly diverse documents from publicly available sources.
FineWeb, which is representative of production-grade data, ends up being only about 44 TB — small enough to fit on a single hard drive. The internet is large, but once you filter it down to text and apply aggressive quality filters, it becomes manageable.
The starting point is usually Common Crawl, an organization that has been scraping the internet since 2007. As of 2024, it had indexed 2.7 billion web pages. Crawlers follow links from seed pages and index everything they find.
The raw Common Crawl data goes through many stages of filtering:
The final dataset looks like a giant tapestry of raw internet text. With 200 web pages concatenated, you already have a massive amount of text. This is the starting point for the next stage.
Before feeding text into a neural network, we need to decide how to represent it. The network expects a one-dimensional sequence of symbols from a finite vocabulary.
The text is a 1D sequence already (left-to-right, top-to-bottom). Underneath, it's UTF-8 encoded bits — just zeros and ones. But two symbols with an extremely long sequence is bad: sequence length is a precious resource. We want to trade off vocabulary size against sequence length.
A first compression: group 8 bits into a byte. Now we have 256 possible symbols and a sequence 8× shorter. Think of these not as numbers but as unique IDs — like 256 distinct emojis.
State-of-the-art models go further using byte-pair encoding (BPE). Find frequently-occurring consecutive pairs (e.g., 116, 32), mint a new symbol with ID 256, and replace every occurrence. Iterate. Each iteration decreases sequence length and grows vocabulary.
GPT-4 uses 100,277 tokens. The process of converting text to tokens is called tokenization.
You can explore this at tiktokenizer. Some observations:
Re-represent the FineWeb data using the tokenizer: it becomes a 15-trillion-token sequence. These tokens are atoms — the numbers are just unique IDs.
Now the fun part: neural network training. We model the statistical relationships of how tokens follow each other.
Take a random window of tokens from the data — anywhere from 0 to some maximum (e.g., 8,000). Longer windows are more expensive, so we crop. Say we take 4 tokens as context.
Feed those 4 tokens into the network. The network outputs a probability distribution over all 100,277 possible next tokens. Initially the network is randomly initialized, so the predictions are random. But we know what actually comes next (it's in our dataset), so we have a label.
We mathematically update the network so the correct token gets slightly higher probability and the others get slightly lower. After the update, feeding the same context produces a slightly improved distribution.
This happens in parallel across batches of windows covering the entire dataset. The network gradually becomes consistent with the statistical patterns of the data.
Inputs (token sequences) get mixed up in a giant mathematical expression with the network's parameters (also called weights). Modern networks have billions of parameters, randomly initialized at the start.
Think of the parameters as knobs on a DJ set. Training discovers a setting of those knobs that produces predictions consistent with the training data.
The expressions themselves aren't scary — multiplication, addition, exponentiation, division. Neural network architecture research designs expressions that are expressive, optimizable, and parallelizable.
A production example: the Transformer. One visualization shows ~85,000 parameters. Tokens are first embedded into vector representations, then those vectors flow through layers — attention blocks, MLP blocks, layer norms, matrix multiplications, softmaxes — producing intermediate values you can loosely think of as "firing rates" of synthetic neurons.
But don't take that analogy too far. These are far simpler than biological neurons. There's no memory, no state — it's a fixed function from input to output. Think of it as a synthetic piece of brain tissue if it helps, but it's stateless.
After training, we can generate new data — this is inference.
Start with a prefix (e.g., a single token, ID 91). Feed it into the network, get the probability distribution, sample from it (flip a biased coin weighted by probabilities). The sampled token is, say, 860. Append it. Feed in [91, 860], sample the next token, and so on.
Because sampling is stochastic, you don't reproduce the training data verbatim. You get remixes — statistically similar to training data but not identical. Sometimes the sampled token wasn't even part of any document in the training data; it just happens to be a likely next token in this context.
When you talk to ChatGPT, you're doing inference. OpenAI trained the model months ago with a fixed set of parameters, and now they're held constant. You provide tokens, the model completes them.
A concrete example I'm fond of: GPT-2 (OpenAI, 2019). It's the first model where a recognizably modern stack came together — everything since has just gotten bigger.
Training GPT-2 in 2019 was estimated at ~$40,000. Today, with my llm.c reproduction, you can do it in about a day for ~$600 — and you could probably get it to ~$100 without trying hard. Why so much cheaper? Better datasets, much faster hardware, and much better software.
When you train one of these models as a researcher, you stare at a screen where each line is one update — adjusting parameters slightly to improve prediction on ~1 million tokens. The key number to watch is loss: a single number where lower is better. As updates accumulate, loss decreases and predictions improve.
Every 20 steps you might run inference to see what the model generates. At step 1, it's pure gibberish (random network). At step 420 of 32,000, you get something with local coherence but global incoherence. By the end, it produces fairly coherent English.
This runs on an 8×H100 node rented from a cloud provider like Lambda (~$3/GPU/hour). H100 GPUs slot into computers and are perfectly suited to training because of the massive parallelism in matrix multiplications. Stack 8 into a node, many nodes into a data center. This is the "gold rush" driving NVIDIA's $3.4T market cap. Elon Musk getting 100,000 GPUs in a single data center? They're all doing this — predicting the next token, faster.
Few companies release base models because by themselves they're just internet-text simulators — not assistants. But some do.
Llama 3 from Meta: 405B parameters, trained on 15 trillion tokens. Meta released both the base model and an instruct (assistant) model.
You can interact with the Llama 3.1 405B base on hyperbolic.xyz. A few things to notice:
Recap: pretraining takes internet documents, breaks them into tokens, and trains a network to predict the next token. The output is a base model — an internet document simulator at the token level.
We want an assistant — something that answers questions. For that we need the second stage: post-training.
Post-training is computationally much cheaper than pretraining (hours vs. months) but critical for turning the model into something useful.
We want the model to handle multi-turn conversations between a human and an assistant. For example:
* instead of +?" → Assistant: explains.We're not programming the assistant in code — we're programming it implicitly via a dataset of example conversations. Hundreds of thousands of multi-turn conversations covering diverse topics. The model trains on these and learns to imitate the response patterns.
Where do the conversations come from? Human labelers. They're given conversational contexts and asked to write the ideal assistant response.
We take the base model, throw away the internet-text dataset, and continue training on the conversations dataset using the exact same algorithm. The model rapidly picks up the statistical pattern of how the assistant responds.
Conversations have to become 1D token sequences. There are protocols for this — analogous to TCP/IP packet formats.
For GPT-4o: special tokens like <|im_start|> (imaginary monologue start), then a role identifier ("user" or "assistant"), then <|im_sep|>, then the content, then <|im_end|>. These special tokens are introduced during post-training — they're new tokens never seen in pretraining. A short conversation might become 49 tokens.
At inference time, the system constructs the conversation context up through <|im_start|>assistant<|im_sep|> and then samples to generate the assistant's reply.
The seminal paper here is OpenAI's InstructGPT (2022). Human contractors (hired via Upwork, Scale AI) wrote prompts and ideal assistant responses, following detailed labeling instructions ("be helpful, truthful, harmless"). The instructions can be hundreds of pages.
OpenAI never released the InstructGPT dataset, but open-source reproductions exist (e.g., OpenAssistant). Modern datasets like UltraChat have millions of conversations, much of it now synthetically generated with LLM help and lightly human-edited.
This is important. When you ask ChatGPT a question, you're not talking to a magical AI. You're talking to a statistical simulation of a human labeler — what a (typically expert) human, following labeling instructions, would most plausibly write in response. For prompts not in the dataset, the answer is emergent — combining pretraining knowledge with the assistant persona learned from labeling.
Now: LLM psychology.
LLMs make things up. In the training set, questions like "Who is [famous person]?" are confidently answered. So when asked "Who is Orson Kovats?" (a name I made up), the model doesn't say "I don't know" — it confidently fabricates, because the style of confident answers is what it learned.
Older models (Falcon 7B) hallucinate freely: "Orson Kovats is an American author... no, a TV character... no, a baseball player." Modern ChatGPT responds: "There's no well-known public figure named Orson Kovats."
Mitigation 1: Teach the model to refuse when it doesn't know.
Meta's approach for Llama 3 (described in their paper):
There's probably a neuron inside the network representing uncertainty — it's just not wired up to the output. Adding refusal examples teaches the model to surface that uncertainty.
Mitigation 2: Give the model tools.
Knowledge in parameters is a vague recollection (something you read a month ago). Knowledge in the context window is working memory (something in front of you right now). To refresh memory, humans use the internet — so we give models web search.
We introduce special tokens like <SEARCH_START>query<SEARCH_END>. When the model emits these, the inference program pauses generation, runs a real search, copy-pastes the results into the context window, and resumes. The model now has the information directly available.
We teach this with training examples that demonstrate when and how to use the tool. The model already has a native sense of what a good search query looks like from pretraining.
This is why ChatGPT now says "searching the web..." and produces citations.
Implication for users: If you want the model to use specific information, put it in the context window rather than relying on recall. "Summarize chapter 1 of Pride and Prejudice" works, but "Summarize chapter 1 of Pride and Prejudice — text attached below" works significantly better.
People often ask LLMs "What model are you?" or "Who built you?" — but the question is somewhat nonsensical. The model has no persistent self. It boots up, processes tokens, shuts down. Every conversation is a fresh start.
If you don't program a self-identity, you get random answers. Falcon 7B will tell you it was built by OpenAI based on GPT-3 — not because it was trained on OpenAI data, but because ChatGPT is so prominent on the internet that "I am ChatGPT by OpenAI" is the model's hallucinated default label.
Two ways to override this:
Either way, it's bolted on — not deeply present.
A critical insight. Here's a math problem: "Emily buys 3 apples and 2 oranges, each orange costs $2, total cost is $13, what's the cost of an apple?"
Two possible labeled responses, both arriving at $3:
Why is one bad? Because each token gets a fixed, finite amount of computation through the network — maybe 100 layers' worth. You can't cram an arbitrary calculation into one token.
If the answer comes first ("The answer is $3"), the model has to do the entire problem in the computation budget of a single token. That doesn't work. Anything after that point is just post-hoc justification — the answer is already committed.
The good answer distributes computation across tokens. Each token does a small amount of work; by the end, the model has built up intermediate results in its context and reaching the final answer is easy.
You can test this: ask ChatGPT "Answer in a single token: Emily buys 23 apples and 177 oranges..." — for harder numbers, it fails. Let it work step by step and it succeeds.
Even better: for arithmetic-heavy problems, ask it to use code. The Python interpreter is far more reliable than the model's mental arithmetic.
Same problem. "How many dots are below: ...." The model will guess in a single token and get it wrong. Ask it to use code — it copies the input into a Python string and calls .count(). Now it's accurate. Counting is a task to delegate to tools.
The model doesn't see characters — it sees tokens. "Ubiquitous" is 3 tokens to the model, not 10 letters. So tasks like "print every third character" fail.
The famous example: "How many R's in strawberry?" Models insisted on 2 for ages. It's not because they're dumb — they don't see the individual letters, and they're bad at counting. Two compounding deficits.
Fix: ask the model to use code.
LLMs can solve Olympiad math but fail on "Which is bigger, 9.11 or 9.9?" — answering 9.11 confidently. Researchers found that internal activations associated with Bible verses light up on this prompt. In Bible verse numbering, 9.11 does come after 9.9. The model gets distracted by the spurious association.
This is what I call jagged intelligence or the "Swiss cheese" model of LLM capabilities: brilliant in most places, with arbitrary holes you can fall into. Treat the model as a powerful but unreliable tool. Check its work.
So far we have two stages: pretraining (knowledge acquisition) and supervised fine-tuning (imitating expert assistant responses). Now we add a third: reinforcement learning.
A useful analogy: school textbooks have three kinds of content:
Why is SFT not enough? Consider that math problem again. A human labeler writes a "good" solution — but how do they know what's good for the model? Token sequences that are easy for the model differ from those easy for humans. The labeler might inject leaps the model can't make, or pad with steps that are trivial for the model. The labeler simply doesn't know the model's ideal token path.
The solution: let the model discover its own solutions.
For a given prompt, generate many candidate solutions (thousands or millions). Check which ones reach the correct answer. Train on the ones that worked. The model discovers, by trial and error, the token sequences that reliably get it to correct answers — exploiting its own knowledge, not someone else's.
This is essentially "do more of what worked." SFT initializes the model in the vicinity of correct solutions; RL dials it in.
The first two stages (pretraining, SFT) are standard and well-known. RL fine-tuning for LLMs has been done internally at labs like OpenAI for years but kept private. The DeepSeek-R1 paper (early 2025) was a big deal because it publicly described how to do RL fine-tuning at scale and gave enough detail to reproduce it.
Key findings:
These cognitive strategies — backtracking, reframing, trying multiple approaches, verifying — emerge naturally. The only signal is "correct answer or not."
When you give the same Emily-apples problem to DeepSeek-R1 (on chat.deepseek.com with "DeepThink" enabled), you get a long internal monologue: "Let me figure this out... it must be $3. Wait, let me check. Yep, that's right. Let me try setting up an equation. Same answer. Confident." Then a clean writeup for the human.
DeepSeek-R1 is open-weights — you can download it, or use it via together.ai (an American provider).
In ChatGPT, models like o1, o3-mini, o3-mini-high are thinking models trained similarly with RL. GPT-4o is mostly an SFT model. OpenAI hides the raw chain-of-thought (showing only summaries) — likely to prevent distillation, where competitors imitate the reasoning traces.
When you encounter a hard problem, reach for a thinking model. For simple factual or conversational queries, GPT-4o is faster and usually fine. ~80–90% of my usage is GPT-4o; I reach for thinking models for hard math/code problems.
RL's power isn't new. AlphaGo (DeepMind) showed this in the game of Go years ago.
In the AlphaGo paper, there's a plot comparing models trained by supervised learning (imitating human expert players) versus reinforcement learning. The SL model plateaus below the level of top human players — it can't exceed what it's imitating. The RL model surpasses top human players, including Lee Sedol.
RL discovered moves humans wouldn't play. Famously, Move 37 in the AlphaGo vs. Lee Sedol match was a move humans estimated had a 1-in-10,000 probability of being played. In retrospect: brilliant. AlphaGo found a strategy unknown to human Go.
We're seeing the beginnings of something analogous in LLMs. RL could in principle discover new ways of thinking that no human has tried — new analogies, new cognitive strategies, maybe even new languages internal to the model. The behavior is unconstrained as long as it reaches correct answers.
This requires huge, diverse sets of practice problems — essentially "game environments" for thinking. A lot of frontier LLM research is building those environments.
Everything above is in verifiable domains — math, code — where solutions can be automatically scored against a known answer.
What about unverifiable domains? Write a joke. Write a poem. Summarize a paragraph. How do you score these automatically?
Naive approach: have humans score every solution. Infeasible — RL needs thousands of updates × thousands of prompts × hundreds of rollouts = a billion human judgments.
RLHF (OpenAI, ~2022): introduce indirection.
The reward model is a simulator of human preference. It's not perfect, but if it's statistically similar to human judgment, RL against it produces improvements.
Run RLHF too long and you'll find the top "joke about pelicans" becomes something nonsensical like "the the the the the" — because the reward model, a giant neural net with billions of parameters, has weird corners where nonsense inputs score 1.0. These are adversarial examples.
You can patch each one (add "the the the" to the dataset with a low score), but there are infinitely many. RL will always find new ones.
So RLHF is not RL in the magical, scalable sense. You can't run it indefinitely the way you can run AlphaGo's RL. You run it for a few hundred steps, get a small improvement, and stop before the reward model gets gamed. RLHF is more like a fine-tune than true RL.
In verifiable domains (math, code), the scoring function is rigid — either you got the right answer or you didn't, and you can't game that. So you can run RL indefinitely and get genuine emergent reasoning. In RLHF, you can't.
GPT-4o has been through RLHF — it works, but it's a polish, not a fundamental capability multiplier.
A few things on the horizon:
Multimodality. Audio, images, video — all tokenizable. Slice an audio spectrogram into tokens, slice an image into patches, intersperse with text. Same training paradigm. We're already seeing the early forms.
Agents. Today we hand individual tasks to models on a silver platter. Soon: agents performing long-running tasks over minutes or hours, with humans supervising. Expect "human-to-agent ratios" the way factories have human-to-robot ratios.
Pervasive integration. LLMs becoming invisible infrastructure inside tools.
Computer-using models. ChatGPT's Operator is an early example — handing the model control of keyboard and mouse.
Test-time training. Currently, parameters are fixed after training; only the context window is "learned" during inference (in-context learning). Humans actually update their parameters from experience (especially during sleep). We don't have an analog of that yet. Important because context windows are finite, and multimodal long-running tasks will blow past current sizes.
Three resources I use:
When you go to ChatGPT and hit enter, here's what's happening:
For a non-thinking model like GPT-4o, the reply is essentially a neural network simulation of a human data labeler at OpenAI — what a labeler, given the instructions and infinite patience, would have written. Not magic, just a learned imitation, with the cognitive quirks of a finite-compute-per-token system.
For thinking models like o3-mini or DeepSeek-R1, the reply additionally reflects emergent reasoning developed through RL — strategies that weren't directly imitated from any human. In verifiable domains this is genuinely new and powerful; we're seeing early hints of "Move 37"-style discoveries in open-ended thinking.
Practical takeaways:
It's an extraordinary time to be working with these tools. I use them dozens of times a day. They massively accelerate work — as long as you respect their sharp edges.
Thanks for watching.