Intro to Large Language Models
Author: Andrej Karpathy
Date: November 2023
Video: https://www.youtube.com/watch?v=zjkBMFhNj_g
Source file: raw/llm-intro-script.md
A one-hour general-audience introduction to large language models. An earlier and more condensed counterpart to Deep Dive into LLMs like ChatGPT, strong on the LLM-as-operating-system framing and on future directions.
What this talk covers
- LLM inference — what a model actually is (two files: parameters + run code).
- LLM training — compression of the internet into parameters; cost and compute.
- LLM dreams — base model behaviour: hallucinating internet-like documents.
- How they work — Transformer architecture; the reversal curse; mechanistic interpretability.
- Fine-tuning into an assistant — from document generator to Q&A assistant via human-labelled data.
- Comparisons, RLHF, synthetic data — stage three of fine-tuning.
- Scaling laws — model performance as a smooth function of parameters and data; no sign of topping out.
- Tool use — web search, calculator, Python interpreter, image generation.
- Multimodality — vision, speech, image generation.
- System 1 vs System 2 thinking — current LLMs are System 1 only; future directions.
- Self-improvement and AlphaGo — the RL analogy; reward criterion as the key bottleneck.
- Customisation and the GPT Store — RAG, custom instructions.
- The LLM OS — LLMs as the kernel of a new computing paradigm.
- LLM security — jailbreaks, prompt injection, data poisoning.
Key concepts introduced
- Large Language Models — two files: a parameters file (the weights) and a run file (the code).
- Pretraining — training as lossy compression of the internet; ~100× compression ratio.
- Fine-Tuning — swapping the dataset from internet text to human-labelled conversations.
- Reinforcement Learning from Human Feedback — ranking-based human preference as a scalable signal.
- Scaling Laws — performance is a predictable, smooth function of parameters N and data D.
- Tool Use — web search, calculator, Python interpreter integrated via special tokens.
- Multimodality — tokenising audio spectrograms, image patches, alongside text.
- LLM OS — LLM as kernel process; context window as RAM; tools as peripherals.
Key arguments
A language model is just two files. The 70B Llama 2 model is a 140 GB parameters file and ~500 lines of C. It runs fully offline on a laptop. The complexity is in obtaining the parameters, not in running them.
Training is lossy internet compression. A 10 TB internet crawl compressed into 140 GB parameters — roughly 100× compression, lossy like JPEG, not lossless like ZIP. The model doesn’t store facts; it stores statistical patterns. Knowledge is vague recollection, not lookup.
Scaling laws make progress predictable. Performance on next-token prediction is a smooth, reliable function of parameter count and training data. This predictability is what drove the GPU gold rush: adding compute reliably adds performance.
Tool use is the key capability axis. Just as humans don’t compute in their heads — they reach for calculators, search engines, Python — LLMs with tool access far outperform LLMs without. Tool use converts the model from a memory recall engine into a problem-solving system.
The LLM OS framing. LLM as kernel; context window as RAM; disk/internet as storage; tools as peripherals; proprietary models (GPT, Claude) and open-source models (Llama) mirroring Windows/macOS vs Linux.
Self-improvement requires a reward function. AlphaGo succeeded at RL because “win or lose” is a cheap, automatic, reliable signal. Language in the general case has no such signal. Self-improvement is possible in narrow verifiable domains; the general case remains open.
Relation to other talks
- Extended and updated in Deep Dive into LLMs like ChatGPT (Feb 2025).
- Practical use patterns in How I Use LLMs.
- Agentic and Software 3.0 framing in From Vibe Coding to Agentic Engineering.