From Vibe Coding to Agentic Engineering

Author: Andrej Karpathy Date: 2025 Source file: raw/llm-vibe-coding-script.md

A conversation-format talk in which Karpathy traces the shift from experimental AI use to a fundamentally new mode of programming. Introduces the Software 1.0 / 2.0 / 3.0 framework, the distinction between vibe coding and agentic engineering, and a verifiability-based theory of where AI capabilities cluster.

What this talk covers

Feeling behind as a coder — the December shift; from correcting the model to trusting it.
Software 3.0 — natural language as the programming paradigm; the LLM as interpreter.
Agents as installers — Claude Code installation as a Software 3.0 artefact.
MenuGen vs raw prompts — when an intermediary app is spurious; the “app shouldn’t exist” insight.
What’s obvious by 2026 — the neural computer; intelligence compute as the dominant FLOP share.
Verifiability and jagged skills — why capabilities cluster in maths and code; the strawberry/car-wash examples.
Founder advice — verifiable domains as the tractable frontier.
From vibe coding to agentic engineering — raising the floor vs preserving the quality bar.

Key concepts introduced

Vibe Coding — using an agent to build software by narrating intent; anyone can now build without formal training.
Agentic Engineering — coordinating agents to maintain professional quality; higher ceiling than 10× engineer.
Software 1.0, 2.0, 3.0 — Karpathy’s framework for the three programming paradigms.
Verifiability — the structural reason capabilities cluster in maths, code, and logic.
Jagged Intelligence — the car-wash example; capability gaps explained by what the labs chose to invest RL effort in.

Key arguments

Software 3.0 is not faster Software 1.0. In Software 3.0, the program is a prompt; the LLM is the interpreter. New things become possible that had no prior implementation path — a document collection compiled into a wiki, for instance. The MenuGen example: the app should not exist; the multimodal model does the work directly.

Verifiability explains jaggedness. RL requires a reward signal, which requires a verifier. Labs built RL environments around maths and code because those domains are economically valuable and technically clean. Domains where verification is hard — aesthetics, judgment, nuanced writing — progress more slowly, not because they are harder in principle, but because the RL environments weren’t built. Chess is a concrete example: capability spiked between GPT-3.5 and GPT-4 because someone added a large corpus of chess data.

Agentic engineering is not vibe coding. Vibe coding raises the floor — non-engineers can build. Agentic engineering raises the ceiling — engineers coordinate stochastic, fallible agents while maintaining correctness guarantees. Security vulnerabilities from careless agent use are your vulnerabilities. The discipline is: invest in your setup, learn the tools deeply, apply the same rigour you would to vim or VS Code.

Hiring should reflect this. Puzzle-solving interviews are the old paradigm. Better: give a candidate a large project (a Twitter clone), deploy it, then attack it with agents. The ability to hold up under adversarial agent attacks is the right test for agentic engineering competence.

The neural computer endpoint. Pushed to its logical extreme: a device that takes raw audio/video, runs it through a neural network, and uses diffusion to render a UI. No OS underneath, no app layer. CPUs become the co-processor; neural networks are the host process. Karpathy calls the progression “extremely foreign” and “TBD.”

Relation to other talks

Practical LLM use patterns in How I Use LLMs (see Cursor/vibe coding section).
Technical foundations for RL and verifiability in Deep Dive into LLMs like ChatGPT.
The “LLM OS” framing introduced earlier in Intro to Large Language Models.