Hamel Husain and Shreya Shankar on Evals
Guests: Hamel Husain — independent AI consultant and educator; Shreya Shankar — PhD researcher at UC Berkeley; co-instructors of the highest-grossing course on Maven.
Host: Lenny Rachitsky
Source: Lenny’s Podcast.
Overview
A live demo walkthrough of the error analysis methodology for AI products, using a real deployment (Nurture Boss, an apartment leasing AI) as the example. Hamel and Shreya argue that most AI teams fail to improve their products because they don’t look at their traces. Their process — open coding → axial coding → counting → LLM judge — is grounded in 30+ years of social science methodology and is immediately actionable. The episode is the most practical treatment of evals in the wiki.
Key ideas
- Look at your traces. The highest-ROI activity in AI product development is manually reading production traces. Teams consistently underdo this and consistently discover things they could not have anticipated.
- Open coding → axial coding → counting. The three-step error analysis process: write notes on first upstream error per trace (open coding); cluster those notes into named failure modes using an LLM (axial coding); count failure modes in a pivot table to find the most prevalent (counting).
- Benevolent dictator. One domain expert does the open coding, not a committee. A single coherent perspective makes the process tractable and the categories meaningful.
- LLM-as-judge: binary, calibrated, targeted. Each judge covers one specific failure mode, outputs true/false (not a Likert scale), and is calibrated against human labels using a confusion matrix before deployment.
- Evals as PRDs. The LLM judge prompt encodes the product requirements: exactly what the product should do, under what conditions, in what form. It is derived from real trace data, not invented upfront.
The problem: not looking at data
The most common failure mode in AI product development is not building the wrong evaluator — it is not looking at the data at all. Hamel’s consulting pattern:
“The first thing I’ll say is, ‘Let’s go look at your traces.’ You can see their eyes pop open and be like, ‘What do you mean?’”
Every single time he does this with a client, he learns something material. The traces always contain surprises. Teams who skip this step build evaluators based on hypothetical failure modes — and those hypotheticals rarely match what is actually happening.
The methodology: error analysis
Step 1 — Open coding
Read traces one at a time. For each trace, write a short note on the first upstream error you observe — the earliest point where something went wrong. One line per trace. Stop at the first error; do not catalogue everything.
Why one person, not a committee (the benevolent dictator principle):
- Keeps the process tractable.
- A committee produces incoherent categories as different annotators use different mental models.
- One domain expert’s consistent lens produces categories that are actionable.
Theoretical saturation: stop when you are no longer uncovering new failure modes. In practice, Hamel and Shreya recommend doing at least 100 traces as a habit, but the real stopping condition is conceptual saturation.
Step 2 — Axial coding
Feed the open codes into an LLM and ask it to cluster them into named, actionable failure mode categories. Human reviews and edits the output — do not accept it blindly. The result is a set of named categories (e.g., “handoff failures”, “conversational flow issues”, “tool call errors”).
The categories should be:
- Actionable — you can actually fix these.
- Named — precise enough that developers know what to work on.
- Grounded — directly traceable to what appeared in the traces.
Step 3 — Counting
Build a pivot table: failure mode vs. count. This is a simple frequency analysis. The most prevalent failure modes are the highest-priority ones to fix. Some failure modes will be fixable by editing the system prompt; those are not worth building evaluators for — just fix them.
LLM-as-judge design
An LLM judge is a prompt that evaluates one specific failure mode. Design principles:
Binary output only. True/false, pass/fail. Not a 1–7 Likert scale. A numeric score sounds precise but is uninterpretable: nobody knows what 4.2 vs. 3.7 means, and the model is forced into an opinion that obscures the underlying question. Force a decision.
One failure mode per judge. The judge has a tightly scoped task. This makes it reliable: asking an LLM to judge one specific thing in a binary way is well within its capabilities; asking it to judge overall quality is not.
Derive from real data. The judge prompt should list criteria drawn directly from the failure modes uncovered in open/axial coding — not from a priori product requirements alone. Criteria you could not dream up upfront will appear in the data.
Use cases: LLM judges run in CI/unit tests (catching regressions on labelled traces) and in online monitoring (sampling N production traces per day, measuring real-world failure rates).
Judge calibration: the confusion matrix
Before releasing an LLM judge, calibrate it against human labels.
The naive metric is misleading. If a failure mode occurs in 10% of traces, a judge that always says “pass” achieves 90% agreement — which looks good but is useless. Raw agreement % is not sufficient.
Use a confusion matrix. For each judge, measure:
- Human true + judge true (correct positives)
- Human false + judge false (correct negatives)
- Human true + judge false (missed errors)
- Human false + judge true (false alarms)
The off-diagonal cells are what matter. Iterate on the judge prompt until false positives and false negatives are acceptably low. A PM whose team reports agreement % without this matrix should ask for the full confusion matrix.
Evals as PRDs
Lenny’s observation from the demo: the LLM judge prompt — listing exactly what constitutes a handoff failure, under what conditions — is structurally identical to a product requirements document. It encodes:
- What the product is supposed to do.
- Under what conditions.
- In what form.
Shreya’s extension: the PRD derived from real traces is always better than the PRD written upfront. The traces reveal requirements the team could not have anticipated. “You’re never going to know what the failure modes are going to be upfront.”
Criteria drift: “Who validates the validated?”
Shreya’s 2024 research paper (with collaborators) studied how people validate LLM outputs and found a key structural problem: criteria drift.
People’s definition of a good output changes as they review more outputs. Failure modes only become visible after seeing outputs they would never have anticipated. This applies even to expert practitioners who have built many LLM pipelines.
Consequence: rubrics written entirely upfront are necessarily incomplete. The process must be iterative. Open coding is the mechanism for discovering criteria you did not know you needed.
The evals debate
The “vibes” controversy
One narrative: Claude Code’s team does not do formal evals — they just use the product and feel whether it’s right. Shreya’s take:
- Standing on others’ evals. Coding agents benefit from the extensive evals done on the foundation model. The coding benchmarks Claude is evaluated against are evals done by Anthropic’s research team.
- Implicit error analysis. Any team that ships and iterates is doing some form of error analysis — dogfooding, monitoring engagement, looking at traces that go wrong. It may not be called “evals” but it is functionally similar.
- The coding agent exception. Coding agents are a special case: the developer is simultaneously the domain expert and the power user. This collapses the distance between building and evaluating. This does not generalise to most AI products (medical AI, leasing AI, customer service AI), where the domain experts are not the developers.
Evals vs. A-B tests
Shreya’s synthesis: A-B tests are a form of evals — they test a metric against experimental conditions. The problem is premature A-B testing: teams run A-B tests on hypothetically important metrics before doing error analysis to find out what is actually failing. Ground your A-B test hypotheses in error analysis first.
What “evals” means
Hamel’s reframe: evals is a confusing term because it is new jargon for something old — data science applied to your product. Looking at your data, doing analytics, measuring what matters. The tactics look slightly different with LLMs, but the underlying discipline is the same.
Time investment
Shreya’s estimate for a new product:
- Initial error analysis: 3–4 days of close collaboration with the domain team; labelling, building the axial coding framework, creating 4–7 LLM judge evaluators.
- Ongoing: ~30 minutes per week, mostly looking at data and handling emerging failure modes.
The upfront cost is one-time. The evaluators, judges, and labelled datasets persist indefinitely.
Build your own interfaces
Hamel’s tip: for high-frequency data review, build a bespoke interface. Custom tools — built with AI assistance in a few hours — remove friction from the most important activity. The Nurture Boss team built an app that shows traces organised by channel (voice/email/text), hides system prompts by default, and shows the axial coding failure count in red for each thread.
The goal is not a beautiful eval suite. The goal is a better product. Make looking at data as frictionless as possible.