Evals

Evals (evaluations) are concrete test cases that measure whether a model or model/harness combination behaves as intended on a specific task. In the context of AI product development, they are the primary tool for quantifying model capability, tracking progress, and catching regressions.

Why evals matter for AI products

Traditional software products can be verified by unit tests and integration tests: either the function returns the expected value or it does not. AI product behaviour is probabilistic and context-dependent — the same prompt may produce different outputs; the “right” output is often a judgement call.

Evals bridge this gap: they convert a vague sense of “the model seems better at X” into a concrete, repeatable measurement. Without evals, teams rely on intuition, which decays as the product and models evolve.

Cat Wu’s case for evals

“Even building 10 great evals is important for helping the team quantify what the goal is, what their progress towards it is, and what they’re missing.”

Cat Wu argues evals are underappreciated — by both PMs and engineers. Her pattern: write evals personally when a feature needs more product definition. The output is not just a pass/fail rate; it is:

A precise statement of what success looks like.
Evidence of where the model/harness fails and why.
A prompt that increases the success rate (product definition → harness improvement).

She writes evals for features like memory, where success is qualitative and hard to define. She relies on a small pod that collaborates closely with research to precisely measure Claude Code behaviours and identify the largest areas for improvement.

Evals and the harness

Evals are the feedback loop that drives harness improvements. When evals show the model is failing in a predictable way, the team investigates: is this a model issue (wait for the next model) or a harness issue (fix the system prompt, add a tool, restructure the context)? The distinction matters because harness fixes ship in days; model fixes ship on model release cycles.

Types of evals

In the Claude Code/Cowork context:

Task-completion evals — does the agent successfully complete a coding task? Does it fix all 20 call sites, not just 5?
Code review evals — does the code review agent catch the bugs that matter before they reach main?
Memory evals — are the memories the model writes accurate, durable, and useful in future sessions?
Character evals — harder to quantify; Amanda at Anthropic specialises in defining and measuring whether Claude’s character is working as intended.

Relationship to RLHF and RL

Evals at the product layer are analogous to the reward signal in Reinforcement Learning from Human Feedback. Both convert a qualitative goal (“be helpful”) into a concrete signal. The difference: product-layer evals measure whether the deployed product achieves user goals; RL reward models measure whether the base model’s output is preferred by human raters.

Nick Turley: evals as the PM’s lingua franca

Nick Turley on ChatGPT adds an important demystification. Nick started writing evals before he knew what an eval was — he was just “articulating very clearly specified ideal behaviour for various use cases.” When a researcher told him this was called an eval, he realised it was the common language between product and ML:

“This might be the lingua franca of how to communicate what the product should be doing to people who do AI research.”

His demystification: evals are not technical magic. They can be done in a spreadsheet. The essence is the eternal PM discipline of articulating success before you do anything else — evals are just a new mechanism for doing that in a form maximally useful for training and measuring models.

Boris Cherny: three-layer safety model

Boris Cherny on Claude Code positions evals as Layer 2 of Anthropic’s three-layer safety model:

Alignment + Mechanistic Interpretability — training-time neuron-level monitoring.
Evals — laboratory/Petri dish setting: synthetic situations, controlled inputs, pass/fail.
In-the-wild behaviour — deployment reveals what the laboratory misses.

The important caveat: as models become more capable, Layers 1 and 2 alone are insufficient. Real-world deployment is irreplaceable. Evals are necessary but not sufficient.

Aishwarya and Kiriti: semantic diffusion and the CCCD feedback loop

Aishwarya and Kiriti on AI Products adds two important perspectives.

Semantic diffusion. The term “evals” has been overloaded: data labelling annotations, LLM judges, model benchmarks, and product evaluation datasets are all called “evals” by different practitioners. Borrowing from Martin Fowler’s concept of “semantic diffusion,” Aishwarya argues that this overloading has made the term nearly useless in cross-team conversations. The corrective: be precise about which type of evaluation you mean, and what you’re trying to measure.

Evals vs. production monitoring. Neither alone is sufficient:

Evals (pre-deployment datasets) catch known failure modes.
Production monitoring (implicit signals — regeneration rates, human edits, engagement) catches unknown/emerging failure modes that only appear in real use.

Building an actionable feedback loop requires both. The decision of which mechanism to use for a given signal type depends on volume, error type, and domain complexity. LLM judges are valuable for some use cases and impractical for others.

Evals are not just for data labelling experts. PMs can write evals as product definitions (Nick Turley’s framing). Subject matter experts contribute domain knowledge. Data labelling teams add annotation. These are all valid — but different — activities.

Chip Huyen: evals as an ROI decision

Chip Huyen on AI Engineering adds the most pragmatic counterweight in the wiki. Chip is pro-evals in principle; her point is about opportunity cost.

The executive decision model: when an engineer proposes investing in evals, the expected gain (say, 80% → 82% accuracy) must be weighed against the opportunity cost of two engineers for several weeks. If a new feature would deliver more user value, the feature wins. This is not anti-eval; it is resource allocation.

When evals are non-negotiable:

Operating at scale where failures have catastrophic consequences.
When AI capability is a core competitive differentiator — you need precise benchmarking.
When failure modes are diverse and hard to enumerate manually.

Vibe-checking is legitimate for low-stakes features where failures are recoverable and the product team has direct visibility. The number of evals is not fixed — it depends on the coverage required for high confidence, which varies enormously by product.

Chip’s example of eval complexity: a deep-research product requires evals at every step (search query diversity, result relevance, content overlap, summary accuracy), not a single end-to-end score. Coverage, not a target count, is the right framing.

Hamel Husain and Shreya Shankar: error analysis methodology

Hamel Husain and Shreya Shankar on Evals provides the most operationally complete methodology for building product-layer evals in the wiki. Their three-step process is grounded in 30+ years of social science error analysis:

Open coding → axial coding → counting

Open coding: one domain expert (the benevolent dictator) reads traces and writes a one-line note on the first upstream error per trace. One person, not a committee. Stops at theoretical saturation — when no new failure modes are appearing.
Axial coding: an LLM clusters the open codes into named, actionable failure mode categories. Human reviews and edits the clusters.
Counting: a pivot table ranks failure modes by frequency. Most prevalent = highest priority.

LLM-as-judge design principles:

One judge per failure mode; binary output only (true/false). Likert scales (1–7) produce uninterpretable numbers and should be rejected.
4–7 judges is enough for most products. Only build judges for failure modes that persist after prompt improvements.
Calibrate with a confusion matrix before deploying. Raw % agreement is misleading when errors are rare — a judge that always says “pass” achieves 90% agreement if the failure mode occurs 10% of the time. Examine false positive and false negative rates separately.
LLM judges work in both CI/unit tests and online monitoring of production traffic.

Criteria drift. Shreya’s research finding (paper: “Who validates the validated?”, 2024): people’s definition of a good output changes as they review more outputs. Failure modes only become visible after seeing unexpected examples. This makes rubrics written entirely upfront necessarily incomplete — the open coding process is the mechanism for discovering the criteria you did not know you needed.

Evals as PRDs. The LLM judge prompt encodes the product requirements: what the product should do, under what conditions, in what form. Derived from real trace data, it is a better PRD than one written without data.

The coding agent exception. Coding agents (Claude Code, Cursor) appear to rely on “vibes” rather than formal evals, but this is a special case: developers are simultaneously domain experts and power users. Most AI products (medical, customer service, real estate) lack this collapsed feedback loop and need systematic error analysis.

Kevin Weil: evals determine product design

Kevin Weil on OpenAI and the Future of AI Products reframes evals from quality gate to product architecture driver. The central claim: the model’s reliability on your specific use case determines the entire product design, not just the AI layer.

The reliability threshold framework. Three bands requiring fundamentally different product responses:

60% reliable → human-in-the-loop required; narrow, forgiving use cases only; explicit error surfacing.
95% reliable → most consumer products are viable; failures are occasional and recoverable.
99.5% reliable → mission-critical products become viable; users extend trust.

You cannot design the right product without knowing which band your use case falls into. Evals are the instrument that tells you.

AI is capped by the quality of its evals. Most domain knowledge — company processes, proprietary workflows, subject-matter expertise — is absent from model training data. Models can learn anything, but they need structured exposure. Custom evals are the mechanism: they encode what “good” looks like for a specific use case, providing the signal for fine-tuning and the benchmark for measuring progress.

Deep Research case study. Kevin’s team built Deep Research (25-minute sessions, 20-page outputs) by constructing evals in parallel with the product:

Identify hero use cases — the complex research questions users want answered.
Define what an exemplary answer looks like for each (human-written gold standard).
Encode those as evals.
Fine-tune the model against the evals.
Track eval performance as the primary signal the product is working.

This is the evals → fine-tuning loop at product scale. Evals are not a test run after the product is built; they are the specification the model learns from during building. This extends Hamel Husain and Shreya Shankar on Evals’ open/axial coding methodology with the OpenAI product-layer application.

Poor man’s fine-tuning. For teams without the compute or researcher capacity to fine-tune: include problem/answer examples directly in the prompt (few-shot). More effective than a bare prompt; not as powerful as a full fine-tune. A practical first step toward the evals → fine-tuning loop.