Simon Willison on Agentic Engineering and the Future of Code

Simon Willison on Agentic Engineering and the Future of Code

transcript agentic-engineering vibe-coding prompt-injection software-engineering dark-factory lenny-podcast

Simon Willison on Agentic Engineering and the Future of Code

Guest: Simon Willison — Co-creator of Django (powers Instagram, Pinterest, Spotify); coined the term “prompt injection”; popularised “agentic engineering” and “AI slop”; creator of Datasette (data analysis tool used in investigative journalism) and 100+ other open-source projects. 25 years as a software engineer; currently a one-person company using coding agents as his team.
Host: Lenny Rachitsky
Source: Lenny’s Podcast. Recorded ~2025 (references GPT 5.1 and Claude Opus 4.5 as recent models).


Overview

One of the most technically grounded and intellectually honest accounts of what the current AI coding moment actually feels like from the inside. Simon Willison covers the precise boundary between vibe coding and agentic engineering, the dark factory pattern as the next frontier, the Challenger disaster prediction for AI safety, human agency vs. AI agency, and practical techniques for getting professional-grade results from coding agents (test-driven development, starting templates, cognitive load management).


Key ideas

  1. Vibe coding vs. agentic engineering. Vibe coding = not looking at code, not caring how it works. Agentic engineering = using coding agents to produce professional-grade code that you are responsible for. The distinction matters: vibe coding for things only you use is fine; vibe coding for other people is irresponsible. Agentic engineering requires the full depth of 25 years of engineering experience.
  2. Dark factory pattern. The next frontier beyond agentic engineering: applying professional quality standards to code you do not directly review at all. Named after “lights-out factories” (so automated you can turn off the lights). Currently being explored by companies like StrongDM. The open question: how do you responsibly produce production-grade software you are not individually reviewing?
  3. The Challenger disaster prediction. AI is experiencing “normalisation of deviance” (the academic term from the 1986 Challenger post-mortem): people know prompt injection is unreliable, but keep shipping, and every time nothing breaks they feel more confident. Willison predicts a headline-grabbing disaster — a major AI security breach that forces reckoning. Has made this prediction every six months for three years; it has not happened yet.
  4. Human agency vs. AI agency. The most important quality for working effectively with AI is human agency: the capacity to decide what problems to take on and where to go. Agents have no agency — they have no human motivations and cannot decide what makes sense to pursue. “I would argue that the one thing AI can never have is agency.”
  5. Test-driven development with agents. Agents dramatically benefit from tests: they confirm code runs, catch regressions, and create a track record that lets you parallelise safely. Red/Green TDD (write test first → watch fail → implement → watch pass) is a useful technique to encode as a prompt shortcut for agents.

Vibe coding vs. agentic engineering

Willison’s precise framing:

Vibe coding: you do not look at the code, you do not understand how it works, you go on vibes. Legitimate for personal tools where only you are affected by bugs. Not legitimate for code used by other people.

Agentic engineering: you use coding agents to produce professional-grade code. You are still responsible for the quality. “Using coding agents well is taking every inch of my 25 years of experience as a software engineer.” By 11 AM with four agents running in parallel, he is cognitively exhausted.

The distinction matters because much public discourse conflates the two, which devalues “vibe coding” as a term (useful for signalling “I haven’t reviewed this; treat it accordingly”) and misrepresents what professional agentic engineering requires.


The dark factory pattern

Definition. Software produced by agents at professional quality standards without the engineer directly reviewing individual lines of code. Named after “lights-out factories” — manufacturing so automated the factory runs in the dark.

Current state. StrongDM is an example of a company actively experimenting with this. They simulate employee actions (requesting Jira access, Slack messages) 24 hours a day using AI agents, spending ~$10,000/day on tokens, to test their own AI-powered access management system under realistic load. They even built their own simulated versions of Slack/Jira/Okta because real SaaS platforms have rate limits.

The open question. What professional practices and quality mechanisms allow code production at scale without direct review? Automated tests are the leading candidate — they act as the verification layer that replaces direct human review. But test coverage alone is not sufficient for all production scenarios.


The Challenger disaster prediction

From the 1986 Space Shuttle Challenger post-mortem: “normalisation of deviance.” Many engineers knew the O-rings were unreliable in cold temperatures. But they kept launching. Every successful launch reinforced institutional confidence. The failure was not caused by ignorance of the risk — it was caused by repeated success despite the known risk.

Applied to AI: everyone in the industry knows prompt injection is unreliable and security is porous. But AI systems keep shipping, keep working well enough, and each uneventful deployment increases institutional confidence. Willison predicts a major, headline-grabbing AI security disaster that will force reckoning with these structural vulnerabilities.

He has made this prediction every six months for three years. It has not happened yet. The prediction stands.


Human agency vs. AI agency

The most common word in conversations about working effectively with AI is “agency” — but applied to humans, not machines.

Willison’s thesis: agents have no agency in the meaningful sense. They have no human motivations. You can tell an agent to “make money,” but it cannot self-determine what problems are worth pursuing or what matters. Human agency — deciding what problems to take on, what directions to go — is what agents fundamentally cannot provide, and what becomes more, not less, valuable in an agentic world.

Implication: invest in your own agency. The people who will do best in this environment are those who know what they want to build and why, and who use agents as instruments of that agency.


Practical techniques for agentic engineering

Test-driven development

Tests are the single most important practice for coding agents:

  • Agents confirm code runs by running tests
  • Tests catch regressions when building new features
  • A repository with a good test suite lets you parallelise safely: change one thing; everything else tests automatically
  • Agents given a test-covered repo will write code that does not break the tests

Prompt shortcut: “Use Red/Green TDD.” Agents know this means write-test → watch-fail → implement → watch-pass. Five characters vs. a paragraph of instructions.

Starting templates

Begin every project with a template file that has a single simple test and the preferred code structure. Agents will detect the pattern and conform to it throughout the project. This is the cheapest way to establish house style.

Cognitive load management

Running four coding agents in parallel is cognitively exhausting for the human. The human’s job is to hold the architecture, review outputs at high level, and decide what to parallelise — not to watch each agent step by step. Finding the right personal limit (how many parallel agents, when to review) is itself a skill to develop.

Using existing code as context

Willison maintains a research GitHub repository with 75+ public projects. He gives agents access to it and asks them to find relevant prior work before solving new problems. Agents can search a full hard drive of context using tool calls; they are not limited to the context window.


AI slop

Willison coined or popularised “AI slop”: low-quality AI-generated content that is cheap to produce and floods spaces with noise. The concern is that cheapness of code/text production incentivises quantity over quality, and the accumulated technical debt from unreviewed AI code has long-term costs that are not visible at generation time.


Simon Willison on prompt injection

Willison coined the term “prompt injection.” His framing aligns with Sander Schulhoff's: it is not solvable at the model level; Claude Opus will mostly spot unsafe instructions in an agentic context but not reliably. The Challenger disaster prediction is specifically about prompt injection becoming a systemic risk as agentic systems are deployed at scale.


See also