From Vibe Coding to Agentic Engineering

Cleaned transcript — Andrej Karpathy in conversation
Author: karpathy.ai
This is a cleaned version of the auto-generated transcript — punctuation added, filler removed, restructured for readability. Not a verbatim transcript. For exact quotes, please refer to the original recording.

Introduction

Andrej Karpathy needs little introduction. He co-founded OpenAI, led the Autopilot program at Tesla, and has a rare gift for making the most complex technical shifts feel both accessible and inevitable. He is also the person who coined the term vibe coding — a phrase that has since taken on a life of its own. But just months after that coinage, he said something that stopped people short: that he had never felt more behind as a programmer. That’s where this conversation begins.

Feeling Behind as a Coder

The shift happened in December. Karpathy had been using agentic tools — things like Cursor and adjacent systems — for about a year, and the experience was useful but imperfect. The model would produce chunks of code, sometimes mess them up, and he would correct them. Helpful, but still very much a human-in-the-loop workflow.

December changed that. He was on a break, which meant more time to experiment, and he noticed something with the latest models: the chunks came out fine. He kept asking for more, and they still came out fine. He couldn’t remember the last time he’d corrected anything. He trusted the system more and more, and then — he was vibe coding.

He tried to stress on X that a lot of people had experienced AI in 2024 as a ChatGPT-adjacent thing — useful, impressive, but incremental. What he wanted them to see was that things had changed fundamentally, specifically around agentic coherent workflows. The experience wasn’t just faster coding; it was a different relationship with the machine. His side-projects folder filled up. He kept building.

Software 3.0 Explained

To understand why December felt like a rupture, it helps to have Karpathy’s framework for what is actually happening. Software 1.0 is writing code: explicit rules, deterministic logic. Software 2.0 shifted programming into arranging datasets and training neural networks — the program is implicit in the weights, not written by hand. Software 3.0 takes this further: you are no longer writing code or curating training data. You are prompting. The LLM is the interpreter, and what lives in the context window is your lever over it.

In this paradigm, the LLM is not a better autocomplete — it is a programmable computer of a new kind. It multitasks because it was trained on the internet, which contains everything. And the way you program it is natural language.

Agents as the Installer

One example that crystallized this for Karpathy was how Claude Code gets installed. Normally, you would expect a shell script: a carefully written piece of Software 1.0 that handles all the edge cases of different platforms and environments, and inevitably balloons into something complex and brittle.

But the Claude Code installation is a block of text you copy and paste to your agent. That’s it. The agent has its own intelligence: it reads your environment, figures out what needs to happen, executes the steps, and debugs in a loop. You don’t have to spell out every conditional. The agent-as-installer is a concrete instance of the Software 3.0 paradigm — and the programming artifact is no longer a script, it’s a prompt.

Menu Gen vs Raw Prompts

An even more striking example comes from a side project Karpathy built himself: MenuGen. The problem: you sit down at a restaurant, the menu has no pictures, and 30–50% of the items mean nothing to you. His solution was to build an app — vibe-coded, running on Vercel — that lets you photograph the menu, OCRs the items, calls an image generator for each one, and renders a visual version of the menu.

Then he saw what he called the Software 3.0 version of the same idea: take your photo, give it to a multimodal model, and ask it to overlay images directly onto the menu. The model returns an annotated image. No OCR pipeline, no image generation orchestration, no app. The neural network does the work; the prompt is the program; the output is the thing.

“All of my MenuGen is spurious,” he said. “The app shouldn’t exist.”

This reframing matters because it’s tempting to think of the current moment as just a speed-up of what already existed — programming, but faster. Karpathy thinks that’s the wrong frame. New things are now possible that couldn’t exist before, not because they’re faster but because there was no code you could have written. His LLM knowledge-base project is another example: there is no traditional program that takes a collection of documents and compiles them into a wiki. But an LLM can do this natively. These are genuinely new capabilities, not accelerations of old ones.

What’s Obvious by 2026

Asked what will look completely obvious in hindsight that is still mostly unbuilt today, Karpathy pushed the extrapolation to its logical extreme — and acknowledged it gets strange quickly.

Following the MenuGen logic, a lot of the software we currently write simply shouldn’t exist in the new paradigm. The neural network does most of the heavy lifting; the intermediary code is overhead. Pushed further, you can imagine what he called a neural computer: a device that takes raw video or audio as input, runs it through a neural network, and uses diffusion to render a UI customized to that moment. No traditional operating system underneath, no app layer — the intelligence is the substrate.

He drew an analogy to the early days of computing, when it wasn’t obvious whether computers would look like calculators or neural networks. In the 1950s and 60s, both paths were live options. The calculator path won, and we built classical computing. But the diagram may flip: neural networks could become the host process, with CPUs as the co-processor. Intelligence compute is already becoming the dominant share of FLOPS. The endpoint is something “extremely foreign,” he said — and he thinks we’ll get there piece by piece, though the exact progression is TBD.

Verifiability and Jagged Skills

The question of which domains AI accelerates fastest comes down to verifiability. Traditional computers automate what you can specify in code. This generation of LLMs automates what you can verify — because the way these models are trained is fundamentally as reinforcement learning systems, and RL requires a reward signal, which requires a verifier.

This is why capabilities cluster in domains like math and code: those are the places where you can run a test, check an answer, execute the program. The labs built RL environments there because they’re economically valuable and technically clean. Domains where verification is hard — aesthetics, judgment, certain kinds of writing — progress more slowly, not necessarily because they’re harder in principle, but because the labs haven’t built the environments.

The jaggedness that results is one of Karpathy’s recurring preoccupations. The classic example was “how many R’s in strawberry?” — models famously got this wrong for a long time, and it’s a perfect illustration of the Swiss-cheese nature of LLM capability: brilliant in most places, with arbitrary gaps. His current favorite example is blunter: “I want to go to a car wash 50 meters away. Should I drive or walk?” State-of-the-art models will tell you to walk, because of course you’d walk 50 meters — except you need a clean car at the other end. The same model that can refactor a 100,000-line codebase or find zero-day vulnerabilities gets tripped up by this. “This is insane,” he said.

The implication is that jaggedness reflects a combination of what’s verifiable and what the labs have chosen to focus on. Chess improved dramatically between GPT-3.5 and GPT-4, not because of general capability gains, but because someone at OpenAI added a large corpus of chess data to the pretraining set. A decision was made, and a capability spiked.

The upshot for practitioners: you are somewhat at the mercy of what the labs put in the mix. You have to explore the tool with no manual — figure out which circuits are alive, where it flies and where it struggles. If your application lives outside the circuits that were part of the RL training, you’re going to need fine-tuning; you won’t get there from the base model out of the box.

Founder Advice and Automation

For founders trying to figure out where to build, Karpathy’s framework is useful but he’s deliberately not giving away the answer. The core observation: verifiability makes something tractable in the current paradigm because you can throw RL at it. And that remains true even if the frontier labs aren’t focusing on it. If you’re in a domain with a verifiable structure — one where you could build RL environments and evaluation datasets — you can do your own fine-tuning and potentially unlock capability that the base models don’t have out of the box.

Asked what feels automatable only from a distance, he paused — and then said everything. Almost anything can be made verifiable to some degree. Even writing can be evaluated by a council of LLM judges and get to something reasonable. It’s a question of what’s easy or hard, not what’s possible or impossible.

From Vibe Coding to Agent Engineering

The term vibe coding is his, but Karpathy is careful about what it does and doesn’t mean. Vibe coding is about raising the floor: anyone can now build software without being a trained engineer. That’s remarkable and real. But it’s not the whole story.

Agentic engineering is a different thing. It’s about preserving the quality bar of professional software while moving faster. You’re still responsible for your code. Vulnerabilities introduced by careless agent use are still your vulnerabilities. The question becomes: how do you coordinate these agents — which are powerful but also fallible, stochastic, and capable of surprising mistakes — without sacrificing correctness?

The 10x engineer framing from the pre-agent era is obsolete. The ceiling is much higher now. People who are very good at agentic engineering are operating at multiples that dwarf 10x.

The difference between someone who is mediocre at this and someone who is fully AI-native comes down to investment: investing in your setup, learning the tools deeply, utilizing all their features. Just as engineers previously got the most out of vim or VS Code, the same discipline now applies to Claude Code, Codex, and their successors.

Hiring needs to change to match. Puzzle-solving interviews are the old paradigm. A better test: give someone a large project — build a Twitter clone, make it secure, deploy it — and then throw attacking agents at their implementation. Can it hold? That’s the right evaluation for agentic engineering capability.

As for which human skills become more valuable rather than less: taste and judgment, for now. Agents are still intern-like entities. They are extraordinarily capable at execution but will make surprising design mistakes. Karpathy’s MenuGen agent tried to correlate user accounts across Google and Stripe by matching email addresses — a decision that is obviously wrong if you think about it for three seconds (people use different emails for different services), but the agent didn’t think about it. Someone has to catch these things.

The pattern he sees: API details and implementation specifics get delegated to the agent; architectural decisions and fundamental understanding stay with the human. He doesn’t remember which PyTorch argument is keepdim versus keep_dims, and he doesn’t need to anymore. But he still needs to understand that tensors have an underlying storage and that views of the same storage are more efficient than copies. The conceptual layer remains yours.

He does think taste and judgment could improve in models — there’s nothing fundamental preventing it. The current bloatiness of agent-generated code (lots of copy-paste, awkward abstractions, things that work but are “really gross”) is a consequence of the labs not having built RL rewards around code quality and elegance. His micro-GPT project, which aims to simplify LLM training to its bare essentials, is a case study: current models hate it. They can’t simplify. You’re pulling teeth. It feels like being outside the RL circuits. The capability is missing not because it’s impossible but because it hasn’t been trained.

On the “animals versus ghosts” framing — a piece he wrote about what kind of entity an LLM actually is — the practical upshot is modest but real. These are not animal intelligences with emotions, intrinsic motivations, or responses to social pressure. If you yell at them, nothing changes. They are statistical simulation circuits, with pretraining providing the substrate and RL bolted on top. Having an accurate model of what they are helps you use them more competently. He doesn’t claim it yields five obvious engineering conclusions; it’s more of a posture — being appropriately suspicious, exploring carefully, calibrating trust.

Agents Everywhere and Learning

Asked what the world looks like when agents have real permissions and take real actions on our behalf, Karpathy said everything has to be rewritten. Everything we have today is built for humans.

His pet peeve: documentation. He still encounters frameworks and libraries where the docs are written for a human to read, understand, and then go do something. But he doesn’t want to do anything. He wants to know what to copy-paste to his agent. Every time he’s told “go to this URL” or “navigate to this menu,” it grates. The infrastructure hasn’t caught up.

The vision he described: agent-native infrastructure, where workloads are decomposed into sensors and actuators over the world, data structures are legible to LLMs first, and deployment is fully automated. He used MenuGen as a benchmark: the hard part of building it wasn’t writing the code, it was deploying it on Vercel — configuring DNS, navigating settings pages, stringing together services. In the future, you should be able to give an LLM a prompt and have it build and deploy MenuGen without you touching anything. When that’s possible, you’ll know infrastructure has become genuinely agent-native.

At the individual level, he sees a world of agent representation: your agent handling logistics, your agent talking to my agent to coordinate meeting details. The friction of human-to-human coordination gets absorbed by agents underneath.

On learning and education — the closing question — Karpathy came back to a line he’d seen online that he said he thinks about every other day: “You can outsource your thinking but you can’t outsource your understanding.” He finds it precisely right.

The agents do the thinking. But someone has to direct the thinking — to decide what to build, why it’s worth doing, how to evaluate the output. That requires understanding, and understanding still has to make it into a human brain. He sees himself becoming a bottleneck not on execution but on comprehension: there’s so much being produced that the constraint is now his capacity to absorb and make sense of it.

His LLM knowledge-base project is partly a response to this. Every time he encounters a new projection of information — a different angle on something he thought he understood — he gains insight. The tools he builds are prompts for synthetic data generation over fixed knowledge. He reads an article, his wiki gets updated, and he asks questions. These are tools to enhance understanding, he said — and that enhancement is still irreducibly yours, because the LLMs certainly don’t excel at understanding. You remain uniquely in charge of that.