Aishwarya and Kiriti on AI Products

Aishwarya and Kiriti on AI Products

transcript ai-product agents evals product-management lenny-podcast

Aishwarya and Kiriti on AI Products

Guests: Aishwarya Naresh Reganti (AI researcher; consulting practice; ex-Alexa, Microsoft); Kiriti Badam (Kodex at OpenAI; ex-Google, Kumo)
Host: Lenny Rachitsky (Lenny’s Podcast)
Date: 2025
Source: raw/lenny/Aishwarya Naresh Reganti + Kiriti Badam.txt


Overview

Aishwarya and Kiriti have led or supported over 50 AI product deployments and teach the top-rated AI product course on Maven. The episode distils their hardest-won lessons into a framework for building AI products that succeed — and avoid the common failure modes they see repeatedly in enterprises, startups, and AI-native companies.


Key ideas

  • AI products differ from traditional software in two fundamental ways. Non-determinism (the user experience and the model output are both probabilistic); and the agency-control trade-off (every increment of autonomy you grant an agent is an increment of control you relinquish). See Agency-Control Trade-off.
  • Start with low agency and high human control; graduate slowly. Jump straight to a fully autonomous agent and you will accumulate errors faster than you can fix them. Build version by version, increasing autonomy only when the current version has demonstrated reliability.
  • “Pain is the new moat.” The companies winning at AI are not the ones with the best technology — they are the ones that went through the pain of understanding their specific workflows, data, and failure modes. That accumulated knowledge cannot be bought or copied.
  • Evals are misunderstood, not overrated. “Semantic diffusion” has made “evals” mean too many different things. The core principle is sound: build an actionable feedback loop. But how you do it depends entirely on your application. LLM judges, implicit production signals, and human review are all valid tools; none is universally sufficient.
  • Problem-first, not tool-first. The biggest failure mode is getting dazzled by agent complexity and forgetting the problem. 80% of AI engineering time should be spent understanding workflows and user behaviour — not building pipelines.

Two fundamental differences from traditional software

1. Non-determinism

Traditional software has a deterministic decision engine: intent → buttons → forms → outcome. AI products replace that with a fluid natural-language interface — which means the input can take unlimited forms, and the LLM output is probabilistic. Result: “you don’t know how the user will behave with your product, and you don’t know how the LLM will respond.”

This is both a challenge (behaviour calibration is much harder) and an opportunity (the bar to entry is lower — users talk naturally).

2. Agency-control trade-off

Every increment of autonomy granted to an agentic system is a corresponding reduction in human control. Autonomy is only appropriate when the system has “earned” it — when it has demonstrated reliability in the use cases at hand. See Agency-Control Trade-off.


The graduation ladder: low → high agency

The recommended pattern is explicit version staging from high control to high agency:

Customer support example:

  • V1: AI routes tickets to the right department — humans review every routing decision.
  • V2: AI drafts responses based on SOPs — human support agents review and edit before sending. All edits are logged as training signal.
  • V3: AI resolves tickets end-to-end, handles refunds and escalations autonomously.

Do not build V3 first. You will not know which failure modes to fix. You will also be logging nothing useful.

Other examples from the episode:

  • Coding assistant: inline autocomplete → generate test blocks for review → autonomously open PRs.
  • Marketing assistant: draft emails → run multi-step campaigns → auto-launch A/B-tested campaigns.

Continuous Calibration, Continuous Development (CCCD)

Their AI development lifecycle framework, analogous to CI/CD. Two loops running in parallel:

Continuous Development (right side):

  1. Scope capability — define what the AI should do.
  2. Curate data — build a dataset of expected inputs/outputs; this alone surfaces team misalignment on product behaviour.
  3. Set up the application.
  4. Design evaluation metrics.
  5. Deploy.

Continuous Calibration (left side):

  1. Observe production behaviour — what distribution of inputs actually arrives?
  2. Spot error patterns — what failure modes weren’t in the evaluation dataset?
  3. Apply fixes.
  4. Design new evaluation metrics for newly discovered failure modes.
  5. Decide whether to increase agency (are we seeing diminishing new surprises?).

Key difference from CI/CD: you are calibrating behaviour, not just testing code. The feedback loop includes implicit signals (did the user regenerate the answer? did they edit the AI’s draft?) alongside explicit evals.

When to graduate to the next agency level: when new calibration cycles produce diminishing new information — the system’s behaviour distribution has stabilised.


Success triangle

Companies that build AI products successfully share three dimensions, all of which must be present:

  1. Great leaders — who are hands-on, rebuild their intuitions, and are comfortable being “the dumbest person in the room.” The Rackspace CEO example: daily 4–6 AM “catching up with AI” block; weekend vibe coding sessions.
  2. Good culture — empowerment, not fear. Subject matter experts must be willing to participate in calibration; FOMO-driven “you’ll be replaced” cultures cause subject matter experts to disengage — exactly the people needed to calibrate AI behaviour.
  3. Technical prowess — specifically: deep workflow understanding, choosing the right tool for each part of the workflow (ML model + deterministic code + LLM is more often the right architecture than pure LLM), and fast iteration loops.

Evals — clarifying the semantic diffusion

“Evals” now means too many different things (data labelling notes, LLM judges, model benchmarks, product evaluation datasets). Aishwarya calls this “semantic diffusion” — the term has been overloaded to the point of uselessness.

The underlying principle everyone agrees on: build an actionable feedback loop. How you do that depends on:

  • Volume: high throughput → you cannot manually review traces → implicit production signals are essential.
  • Error type: known failure modes → build evaluation datasets; unknown/emerging failure modes → production monitoring surfaces them first.
  • Domain complexity: complex domains → LLM judges may fail to generalise; spot fixes + production monitoring may be more practical.

Kiriti’s Kodex (OpenAI coding agent) approach: evals for regression testing core behaviours + active A/B testing for model changes + close attention to social signals (Twitter, Reddit) for emerging issues.

See Evals.


”Pain is the new moat”

Kiriti’s framing for competitive advantage in AI: the companies winning are those that went through the pain of learning their specific domain — messy enterprise data, weird taxonomies, undocumented rules, unusual failure modes. That knowledge cannot be replicated by buying a vendor or slapping an agent on the problem. The moat is the lived experience of iterating through failure.


Key concepts

See also