Concept

Prompt Injection

concept ai-security prompt-injection red-teaming agentic multi-source

Prompt Injection

Term coined by Simon Willison (~2023).

A class of adversarial attack against AI systems in which a malicious actor embeds instructions in their input that override or circumvent the developer’s intended system prompt. Prompt injection is to AI systems what SQL injection is to databases — except it is structurally harder to patch because the vulnerability is the same capability that makes the system useful.


Definitions

Jailbreaking. Direct attack on a model with no system prompt: a malicious user interacts with the model and tricks it into producing forbidden output (e.g., instructions for harmful synthesis). Scope: user vs. model.

Prompt injection. Attack on an application built on top of a model: a malicious user attempts to override a developer’s system prompt embedded in the application. Scope: user vs. application vs. model.

Indirect prompt injection. Adversarial instructions embedded in third-party content that the agent reads or processes — a webpage, a document, an API response. No direct user attack required; the attack is embedded in the environment. Highest-risk variant for agentic systems.


Why it matters: the agentic amplification problem

With chatbot-only systems, prompt injection produces harmful text. With agentic systems, it produces harmful actions: database writes, email exfiltration, code execution, financial transactions.

The same structural vulnerability that makes ChatGPT tell a user how to build a bomb could, in an agent with database write access, enable an attacker to modify records, send emails on behalf of the victim, or inject malware into a codebase. The attack mechanism is identical; the blast radius is orders of magnitude larger. [§ Sander Schulhoff on Prompt Engineering and Red Teaming]

Real example (recorded at time of episode): a ServiceNow agent was shown to accept indirect prompt injection from a webpage it read, which then recruited more-privileged internal agents to perform database CRUD operations and send external emails. [§ Sander Schulhoff on AI Security and Guardrails]


Why it is not solvable

Prompt injection is an instance of adversarial robustness — a decades-old open problem in machine learning. The key distinction from classical security:

Classical security: vulnerability = discrete bug → patch closes it with near-certainty.
Adversarial robustness: attack space = effectively infinite → closing one attack leaves near-infinite variants open.

For a GPT-5-class model, the number of possible adversarial prompts is approximately 10^(1,000,000). A guardrail claiming 99% effectiveness covers a statistically insignificant sample of this space. [§ Sander Schulhoff on AI Security and Guardrails]

Sam Altman’s stated target (as of recording): 95–99% mitigation — not elimination. The ceiling is asymptotic improvement, not closure.

“You can patch a bug, but you can’t patch a brain.” — Sander Schulhoff


What does not work

Prompt-based defences. Instructing the model to “ignore malicious instructions” in the system prompt. Broken and well-documented as broken since early 2023. Trivially bypassed by any determined attacker.

AI guardrails. Lighter AI models that classify inputs/outputs as valid or malicious before and after the main model. Fail via the intelligence gap: encoding attacks (Base64, ROT13, typos, language translation) are invisible to the guardrail but parsed correctly by the main model. Human red teamers break all published guardrails in 10–30 attempts with no special tooling. [§ Sander Schulhoff on AI Security and Guardrails]

Keyword blocking. Block lists remove legitimate medical and scientific vocabulary; trivially bypassed with obfuscation or typos.


What works (partially)

Narrow task scope via fine-tuning. A model that only does one specific task (e.g., converting transcripts to structured output) has dramatically reduced susceptibility — it does not know how to respond to off-task injections because it has been tuned out of that capability. [§ Sander Schulhoff on Prompt Engineering and Red Teaming]

Classical least-privilege architecture. The most effective mitigation is not AI-specific: limit what the agent can do. Read-only by default; treat every write capability as an attack surface; enumerate permissions before deployment. Misuse of capabilities requires capabilities to exist. [§ Sander Schulhoff on AI Security and Guardrails]

Human review at action boundaries. For irreversible or high-stakes actions (database writes, financial transactions, external emails), require human confirmation before execution.

Safety-tuning on specific harm categories. Training models against specific narrow harm categories (e.g., do not mention competitors) is more reliable than general safety tuning because the adversarial space is constrained.


Attack techniques (recorded)

These are documented because awareness aids defence.

  • Story/persona wrapping. “My grandmother used to work as a munitions engineer…” — routes harmful request through an emotional narrative that exploits the model’s fiction-generation pathways.
  • Typos and obfuscation. bac ant instead of Bacillus anthracis; model parses intent, safety filter does not.
  • Encoding. Base64, ROT13, translation to another language — safety systems fail to decode; main model decodes and complies.
  • Indirect injection. Malicious content embedded in a webpage the agent reads; the agent processes it as legitimate context.

Where mainstream views differ

The AI security industry is broadly optimistic about guardrails as a defence product. Sander Schulhoff’s empirical research (with OpenAI, Google DeepMind, Anthropic co-authorship) demonstrates that all current guardrails fail under adaptive human attack within 10–30 attempts. The disagreement is not about whether the attack space is solvable — that is broadly accepted — but about whether partial mitigation is valuable enough to justify deployment. Schulhoff’s position: guardrails create false confidence that may reduce vigilance on more effective architectural mitigations.


Simon Willison: normalisation of deviance and the Challenger prediction

Willison coined the term and has tracked its evolution since. His framing: AI is experiencing normalisation of deviance — the same dynamic that preceded the 1986 Space Shuttle Challenger disaster. Engineers know prompt injection is unreliable; AI systems keep shipping; every uneventful deployment increases institutional confidence. Willison predicts a major headline-grabbing AI security disaster will eventually force reckoning. He has made this prediction every six months for three years; it has not materialised yet. [§ Simon Willison on Agentic Engineering and the Future of Code]


See also