Sander Schulhoff on AI Security and Guardrails

Guest: Sander Schulhoff — AI researcher; prompt engineering pioneer; founder of HackAPrompt; adversarial robustness researcher; collaborates with OpenAI, Google DeepMind, and Anthropic on model defences.
Host: Lenny Rachitsky
Source: Lenny’s Podcast. Follow-up episode to Sander Schulhoff on Prompt Engineering and Red Teaming.

Overview

A focused critique of the AI security industry: automated red teaming works too well (can break anything); AI guardrails do not work (infinite attack space, intelligence gap, trivial bypass). Covers concrete advice for product teams and reframes the defence problem as a classical cybersecurity + architecture problem rather than an AI-specific one.

Key ideas

Two industry failures. Automated red teaming works too well — it can break every transformer-based system, which makes it a misleading sales tool. Guardrails work too poorly — they are bypassable at scale and create false confidence.
The intelligence gap. Guardrail models are less intelligent than the models they guard. A Base64-encoded attack bypasses the guardrail (cannot decode) but succeeds against the main model (can decode). This gap is structural and persists with capability improvements.
You can patch a bug, but you can’t patch a brain. Classical security: patch a specific bug → near-certain it is closed. AI security: train against a specific attack → model still fundamentally susceptible to variants. The search space is infinite; a 99% claim is statistically meaningless.
Architecture is the real defence. The most effective mitigation: constrain what the system can do, not what it can say. Read-only agents, narrow task scope, strict data permissions, classical least-privilege principles. Human review at action boundaries.
When to worry. Chatbot-only deployments with no tool access: limited risk, minimal investment needed. Agents with real-world actions (email, database writes, code execution): treat as an angry god — assume it will be manipulated, design accordingly.

The guardrails problem

Why guardrails fail

Attack space is infinite. For a GPT-5-class model, the number of possible prompts is ~10^(1,000,000). A guardrail claiming “99% effectiveness” is tested against a statistically insignificant sample of this space. The remaining “1%” is still effectively infinite.

Human attackers break everything. A research paper co-authored with OpenAI, Google DeepMind, and Anthropic tested state-of-the-art defences (including all major guardrails) against both automated attacks and human red teamers. Human attackers achieved 100% success across all defences in 10–30 attempts. Automated systems required orders of magnitude more attempts and achieved ~90% success. Guardrails added no meaningful friction.

Intelligence gap. Guardrail models cannot understand encoding schemes that the main model can parse. Obfuscated or encoded attacks pass the guardrail, reach the main model, and succeed.

Dissuasion effect is zero. If an attacker is motivated enough to get past GPT-5’s built-in safety, adding a guardrail adds negligible friction.

What automated red teaming actually shows

Automated red teamers always produce alarming results — they are designed to. A CISO shown these results reasonably concludes their system is dangerous and buys guardrails. The correct interpretation: every transformer-based system is breakable; your system is not unusual; the guardrails being sold to you will not change that.

Concrete advice for product teams

Scenario 1: Conversational AI, no tool access, affects only the user interacting. Risk is limited. User can only harm themselves. Invest nothing in AI-specific security; invest in standard input/output hygiene.

Scenario 2: Agent with real-world capabilities. Apply classical least-privilege design:

Read-only by default; explicit write-permissions are a security boundary
Each permission the agent has is a potential attack surface — enumerate them
Apply the “angry god” test: if this agent were controlled by a malicious actor, what is the maximum damage it could do? That is your threat model
Human review before any high-stakes irreversible action

Skill investment. Hire or develop “cybersecurity + AI” hybrid skills — people who understand both the classical security model (patch, audit, permissioning) and why AI breaks the assumptions of that model (“you can’t patch a brain”).

Do not. Deploy guardrails expecting security. Do not implement prompt-based defences (“ignore malicious instructions”). Do not rely on automated red teaming reports as proof of vulnerability — they prove nothing unique about your system.

The structural framing

The key insight for understanding AI security: it is adversarial robustness, not classical security. Classical security is discrete: vulnerabilities can be enumerated and patched. Adversarial robustness is continuous: the model’s behaviour is a function over an infinite input space; there is no patch that closes a continuous boundary.

This distinction is why the AI security industry’s framing is flawed: it is selling a patch-like solution to a problem that does not admit patches.