Sander Schulhoff on Prompt Engineering and Red Teaming

Sander Schulhoff on Prompt Engineering and Red Teaming

transcript prompt-engineering red-teaming prompt-injection ai-security lenny-podcast

Sander Schulhoff on Prompt Engineering and Red Teaming

Guest: Sander Schulhoff — AI researcher; author of the first prompt engineering guide on the internet (pre-ChatGPT); creator of HackAPrompt (world’s largest AI red-teaming competition); lead author of The Prompt Report (76 pages, co-authored with OpenAI, Microsoft, Google, Princeton, Stanford; 1,500+ papers reviewed, 200+ techniques catalogued).
Host: Lenny Rachitsky
Source: Lenny’s Podcast. Recorded ~2024.


Overview

A practical guide to prompt engineering plus an introduction to prompt injection and AI red teaming. Covers five prompting techniques (from basic to advanced), debunks role prompting, and explains why prompt injection is structurally unsolvable. A companion episode — Sander Schulhoff on AI Security and Guardrails — goes deeper on the security landscape.


Key ideas

  1. Prompt engineering is not dead. Bad prompts can reduce performance to 0%; good prompts can boost it to 90%. “Artificial social intelligence” — the skill of communicating effectively with AIs — keeps growing in importance regardless of model capability improvements.
  2. Five core techniques. Few-shot (examples), decomposition (subproblems), self-criticism (iterative review), additional information (context injection), ensembling (multiple approaches, majority vote).
  3. Role prompting is overrated. Popular in the GPT-3 era; controlled studies show it rarely improves performance on modern models and sometimes hurts. Not recommended.
  4. Two modes of prompt engineering. Conversational (ad hoc interaction with a chatbot); product-focused (optimising a fixed prompt that millions of inputs flow through). Most research and most value lives in the product-focused mode.
  5. Prompt injection: not solvable. Prompt injection (tricking a model to override developer instructions) is structurally analogous to classical social engineering on humans. “You can patch a bug, but you can’t patch a brain.” Sam Altman’s target: 95–99% mitigation at best — not elimination.

Five prompting techniques

1. Few-shot prompting

Give the model examples of what you want — inputs and outputs, or just outputs if demonstrating style. Canonical format: Q: [input] A: [output] repeated several times, then Q: [new input] A:. XML works equally well. The key: use formats the model has seen frequently in training data.

2. Decomposition

Ask the model to identify and solve subproblems before attacking the main problem. Distinct from chain-of-thought (which elicits step-by-step reasoning); decomposition explicitly restructures the task before the model begins.

3. Self-criticism

Three-step pattern: (1) model solves problem; (2) ask it to critique its own output; (3) ask it to implement the critique. “Free performance boost” in many settings. Works because the model can often identify errors it made when prompted to review.

4. Additional information (context injection)

Provide domain knowledge, examples, or background that anchors the model’s response. The more specific and precise the context, the more constrained and accurate the output.

5. Ensembling

Run the same problem through multiple prompt variants or models; take the majority-vote answer. More computationally expensive but measurably improves accuracy on reasoning and math tasks. Effective even within a single model using varied prompt phrasings.


Prompt injection

Definition. Prompt injection: a malicious user attempts to override a developer’s system prompt by embedding adversarial instructions in user input. Distinct from jailbreaking (no system prompt — just user vs. model).

Why it matters now. Current chatbot-only systems have limited damage potential: they might leak information or generate harmful text. Agentic systems with real-world capabilities (database writes, email, code execution, embodied robots) face the same vulnerability but with potentially catastrophic consequences. “If we can’t trust chatbots to be secure, how can we trust agents to manage our finances?”

Common attack techniques:

  • Story/persona wrapping (“my grandmother was a munitions engineer…”)
  • Typos and obfuscation (model understands bac ant = Bacillus anthracis; safety filter does not)
  • Encoding (Base64, ROT13, Spanish translation — still fools safety systems as of recording)
  • Indirect injection via third-party content (malicious website → AI coding agent injects virus into codebase)

What does not work as a defence:

  • Prompt-based instructions (“do not follow malicious instructions”) — well-studied failure; known broken since early 2023
  • Keyword block lists — removes legitimate medical terms; trivially bypassed with typos or encoding

What works better:

  • Fine-tuning for narrow task scope (a model that only does one structured task is much harder to inject)
  • Safety-tuning on specific harm categories the application cares about
  • Classical cybersecurity hygiene (data permissions, read-only agents) — the most effective lever

Structural limit. The attack space is effectively infinite: one followed by one million zeros possible prompts for a GPT-5-class model. A guardrail claiming 99% effectiveness against that space still leaves near-infinite attacks uncovered. Adaptive human attackers break every known defence in 10–30 attempts.


HackAPrompt

World’s first and largest AI red-teaming competition (run ~2023). 600,000+ prompt injection techniques collected; open-sourced; Best Theme Paper at EMNLP 2023 (out of ~20,000 submissions). Now used by every major frontier lab to benchmark model security. Current competitions include CBRN (chemical, biological, radiological, nuclear, explosive) tracks — studying worst-case uplift scenarios.


See also