Chip Huyen on AI Engineering

Guest: Chip Huyen — co-founder of Claypot AI; former NVIDIA NeMo; former Netflix AI researcher; Stanford ML lecturer; author of AI Engineering (most-read O’Reilly book since launch) and Designing Machine Learning Systems.
Host: Lenny Rachitsky
Source: Lenny’s Podcast, 2025.

Overview

Chip Huyen walks through what actually improves AI products (vs. what people think does), explains the full training stack from pretraining to post-training to RLHF, and goes deep on RAG data preparation, evals, and the emerging AI Engineering discipline. The episode is simultaneously a technical primer for non-technical PMs and a pragmatic practitioner’s guide for teams building AI products. Chip has worked with a large range of enterprises and brings ground-level data on what is and is not working.

Key ideas

Talk to users first. The viral table: what people think improves AI apps (new frameworks, best model, fine-tuning) vs. what actually does (user conversations, better data, better prompts, end-to-end workflow optimisation). Staying current with AI news is nearly always the wrong investment.
Data preparation is the real RAG work. Most RAG performance gains come from better data preparation — chunking strategy, contextual metadata, hypothetical question generation — not from vector database selection.
Evals: ROI-first. Evals are always valuable in principle; in practice, the decision is a return-on-investment question. Scale and competitive importance determine how much precision you need. Vibe-checking is legitimate for low-stakes features.
AI Engineering = using models; ML Engineering = building models. The distinction defines a new profession. AI Engineering’s entry barrier is lower; the ceiling is as high as you make it with application architecture, evals, and prompt engineering.
Test-time compute as performance lever. Allocating more inference compute (generating multiple candidates, extended chain-of-thought) can substantially improve perceived model performance without changing the base model.

What actually improves AI apps

Chip’s viral LinkedIn table — the most engagement she has received on a single post:

What people think improves AI apps	What actually improves AI apps
Staying up to date with the latest AI news	Talking to users
Adopting the newest agentic framework	Building more reliable platforms
Agonising about which vector database to use	Preparing better data
Constantly evaluating which model is smarter	Optimising end-to-end workflows
Fine-tuning a model	Writing better prompts

The asymmetry is stark. Most “AI strategy” attention goes to infrastructure and model selection — activities with limited marginal value in most contexts. The high-leverage work is user research, data quality, and prompt engineering. The framework questions to ask before adopting new technology: (1) how much improvement could come from optimal vs. non-optimal choices? (2) how hard would it be to switch out later?

The training stack: pretraining → post-training → RLHF

Pretraining — encoding statistical information about language across massive datasets. The model learns to predict the most likely next token. Pre-training is largely commoditised; most users never touch it.

Post-training — where frontier labs compete most intensively today. Two main techniques:

Supervised fine-tuning — expert demonstration data; the model learns to emulate human expert responses.
Reinforcement learning — using a signal (human preference, AI preference, or verifiable rewards) to bias the model toward better outputs.

RLHF — Reinforcement Learning from Human Feedback. Humans compare two responses and indicate which is better (comparison is easier than scoring). A reward model is trained on these preferences; the base model is then trained to produce outputs scored highly by the reward model.

Verifiable rewards — a variant increasingly in use: for tasks with deterministic correct answers (maths, code), the model is rewarded when the output is verifiably correct. No human rater required.

See Reinforcement Learning from Human Feedback.

RAG and data preparation

RAG (Retrieval-Augmented Generation) — providing the model with relevant context retrieved from a knowledge base before generating a response. The idea has roots in 2017 NLP research: giving models access to relevant retrieved documents dramatically improves performance on knowledge-intensive tasks.

Chunk design matters most. The key decision: how large each chunk should be. Too large: the chunk contains relevant information but you can only retrieve one chunk per query, limiting breadth. Too small: you can retrieve many chunks but each lacks sufficient context.

Contextual enrichment strategies:

Add metadata summaries to each chunk so it can be retrieved even when the query uses different terminology.
Generate hypothetical questions for each chunk — questions the chunk could answer. When a query arrives, match it against hypothetical questions rather than raw chunk content.
Rewrite document content in Q&A format. A practitioner reported a major performance gain by restructuring documentation into explicit question-answer pairs (AI-assisted).

Documentation written for humans ≠ documentation for AI. Human readers have implicit context AI lacks. Example: a function that returns a value on a scale of −1 to 1 where 1 means “high temperature” needs a human annotation layer that explains the scale to the model — otherwise the model treats it as an actual temperature reading.

Vector database choice is largely irrelevant for quality. Access patterns (read-heavy vs. write-heavy, latency requirements) matter for database selection. Pure answer quality is dominated by data preparation, not retrieval infrastructure.

Evals: pragmatic framing

Chip’s position: evals are always valuable in principle; the real question is ROI.

The executive decision model she describes: when an engineer proposes investing in evals, the cost (e.g., two engineers for several weeks) must be weighed against expected improvement and opportunity cost. If evals would improve performance from 80% to 82% and a new feature would deliver more user value, the new feature wins. This is not anti-eval; it is ROI thinking.

When evals are essential:

Operating at scale where failures have catastrophic consequences.
When AI is a competitive advantage, not a supporting feature — you need precise benchmarking against competitors.
When failure modes are diverse and hard to enumerate.

Number of evals is not fixed. The right number is the coverage required for high confidence in performance and visibility into failure modes. Some products need hundreds of evals (general-purpose assistants with many dimensions of quality); others need a handful targeting the core use case.

Deep research as an example of multi-dimensional eval. Evaluating deep research quality requires evals at every step: search query diversity, result relevance, content overlap, summary accuracy — not a single end-to-end metric.

See Evals.

The three-bucket productivity experiment

An informal randomised trial at a ~35-person engineering team: engineers divided into three performance buckets (high, average, low performing), then half of each bucket given access to Cursor.

Observed result: the highest-performing engineers showed the greatest productivity gain. Hypothesis: high performers have strong problem-solving skills; AI amplifies the application of those skills. Low performers often went on “autopilot” — accepting AI output without critical review — so gains were smaller.

Caveat: a different company reported the opposite — senior engineers were most resistant to AI tools because they had higher standards for code quality. Chip notes she cannot reconcile the two reports; both are plausible.

Policy implication: some organisations are restructuring engineering teams so that senior engineers write process guidelines and review AI/junior-generated code, rather than writing code themselves.

AI Engineering vs ML Engineering

See AI Engineering.

The key distinction:

ML engineers build and train models from scratch.
AI engineers use existing pre-trained models (via API or open weights) to build applications.

AI Engineering has a lower entry barrier (no need to understand training algorithms in depth) but a high ceiling: application architecture, evals, prompt engineering, RAG design, and fine-tuning all reward deep expertise.

Test-time compute

Test-time compute (also: inference-time compute) — allocating more computational resources at inference time to improve output quality, rather than (only) training a more capable base model.

Two mechanisms:

Sampling multiple candidates — generate N outputs and select the best by a scoring function or majority vote.
Extended chain-of-thought — allow the model to generate more “thinking tokens” before producing a final answer.

Both improve perceived performance without changing the base model. This partially explains the apparent performance plateau in pre-training: the base model capacity may be growing more slowly, but test-time compute strategies are producing significant quality improvements at the application layer.

Idea crisis and the tool paradox

Chip observes a paradox: AI has made it easier than ever to build things (design, code, websites all auto-generatable), yet people in hackathons and internal AI strategy sessions consistently struggle to identify what to build.

Her hypothesis: decades of specialist culture have left people without big-picture vision. The remedy she prescribes for hackathons: for one week, keep a frustration log — note every moment of friction in your daily work. When something frustrates you, ask whether it could be done differently. The frustration log surfaces problems that AI tools can address; the hard part is the act of noticing.

Org structure implications

For the near term, Chip sees organisations evolving toward:

Senior engineers as process architects. Writing guidelines, establishing code review standards, operating as reviewers rather than authors.
Junior engineers as code producers. Generating code via AI with senior review gates.
Blurred function lines. Evals require product understanding and engineering depth simultaneously; the person who designs evals needs to understand user behaviour and system architecture. This is pulling product and engineering closer together.