Speaker

Shreya Shankar

Shreya Shankar

Shreya Shankar is a PhD researcher at UC Berkeley and co-instructor (with Hamel Husain) of AI Evals for Engineers and Product Managers on Maven. Her research spans ML systems and AI evaluation methodology. Her 2024 paper “Who validates the validated?” established the concept of criteria drift in LLM output validation.


Key ideas

  • Criteria drift. People’s definition of a good output changes as they review more outputs; rubrics written entirely upfront are necessarily incomplete. The implication: evaluation must be iterative.
  • Theoretical saturation. The right stopping condition for open coding is when no new failure modes are being uncovered — not a fixed sample size.
  • LLM judges for online monitoring. LLM judges are not just for CI; sample production traces and run judges daily to measure real-world failure rates.
  • 4–7 judges is enough. Most failure modes either have a simple prompt fix or require only a handful of targeted evaluators. Do not try to build one for everything.
  • Evals are not magic. Any team iterating on an AI product is already doing some form of error analysis. The goal of formalising it is to make the process repeatable and to catch what intuition misses.

Appearances

SourceDateNotes
Hamel Husain and Shreya Shankar on Evals2025Criteria drift; theoretical saturation; LLM judge design; online monitoring; evals debate

See also