Intro to Large Language Models

Author: Andrej Karpathy Date: November 2023 Video: https://www.youtube.com/watch?v=zjkBMFhNj_g Source file: raw/llm-intro-script.md

A one-hour general-audience introduction to large language models. An earlier and more condensed counterpart to Deep Dive into LLMs like ChatGPT, strong on the LLM-as-operating-system framing and on future directions.

What this talk covers

LLM inference — what a model actually is (two files: parameters + run code).
LLM training — compression of the internet into parameters; cost and compute.
LLM dreams — base model behaviour: hallucinating internet-like documents.
How they work — Transformer architecture; the reversal curse; mechanistic interpretability.
Fine-tuning into an assistant — from document generator to Q&A assistant via human-labelled data.
Comparisons, RLHF, synthetic data — stage three of fine-tuning.
Scaling laws — model performance as a smooth function of parameters and data; no sign of topping out.
Tool use — web search, calculator, Python interpreter, image generation.
Multimodality — vision, speech, image generation.
System 1 vs System 2 thinking — current LLMs are System 1 only; future directions.
Self-improvement and AlphaGo — the RL analogy; reward criterion as the key bottleneck.
Customisation and the GPT Store — RAG, custom instructions.
The LLM OS — LLMs as the kernel of a new computing paradigm.
LLM security — jailbreaks, prompt injection, data poisoning.

Key concepts introduced

Large Language Models — two files: a parameters file (the weights) and a run file (the code).
Pretraining — training as lossy compression of the internet; ~100× compression ratio.
Fine-Tuning — swapping the dataset from internet text to human-labelled conversations.
Reinforcement Learning from Human Feedback — ranking-based human preference as a scalable signal.
Scaling Laws — performance is a predictable, smooth function of parameters N and data D.
Tool Use — web search, calculator, Python interpreter integrated via special tokens.
Multimodality — tokenising audio spectrograms, image patches, alongside text.
LLM OS — LLM as kernel process; context window as RAM; tools as peripherals.

Key arguments

A language model is just two files. The 70B Llama 2 model is a 140 GB parameters file and ~500 lines of C. It runs fully offline on a laptop. The complexity is in obtaining the parameters, not in running them.

Training is lossy internet compression. A 10 TB internet crawl compressed into 140 GB parameters — roughly 100× compression, lossy like JPEG, not lossless like ZIP. The model doesn’t store facts; it stores statistical patterns. Knowledge is vague recollection, not lookup.

Scaling laws make progress predictable. Performance on next-token prediction is a smooth, reliable function of parameter count and training data. This predictability is what drove the GPU gold rush: adding compute reliably adds performance.

Tool use is the key capability axis. Just as humans don’t compute in their heads — they reach for calculators, search engines, Python — LLMs with tool access far outperform LLMs without. Tool use converts the model from a memory recall engine into a problem-solving system.

The LLM OS framing. LLM as kernel; context window as RAM; disk/internet as storage; tools as peripherals; proprietary models (GPT, Claude) and open-source models (Llama) mirroring Windows/macOS vs Linux.

Self-improvement requires a reward function. AlphaGo succeeded at RL because “win or lose” is a cheap, automatic, reliable signal. Language in the general case has no such signal. Self-improvement is possible in narrow verifiable domains; the general case remains open.

Relation to other talks

Extended and updated in Deep Dive into LLMs like ChatGPT (Feb 2025).
Practical use patterns in How I Use LLMs.
Agentic and Software 3.0 framing in From Vibe Coding to Agentic Engineering.