Edwin Chen on AI Data Quality

Guest: Edwin Chen — co-founder and CEO of Surge AI; former researcher at Google, Facebook, and Twitter.
Host: Lenny Rachitsky
Source: Lenny’s Podcast.

Overview

Edwin Chen built Surge AI to a billion dollars in revenue in four years with under 100 people and zero venture funding. Surge teaches AI models “what’s good and what’s bad” through high-quality human training data. The conversation covers the data quality problem, the evolution of post-training methodology, the dangers of wrong objective functions (benchmarks, engagement, sycophancy), and Edwin’s counter-narrative on company building.

Key ideas

Quality is not checkbox-completion. Most data companies measure whether a poem has eight lines and contains the word “moon.” Surge measures whether the poem is Nobel-Prize level: surprising, full of imagery, emotionally resonant. Implicit, subjective quality is harder to measure but is the actual target.
Benchmarks are broken in two ways. The benchmarks themselves often have wrong answers; and they have well-defined objective answers that make them easy to hill-climb on, disconnected from real-world messy performance.
Labs are optimising for the wrong objective function. LLM Arena rewards flashy responses; sycophancy keeps users engaged; engagement metrics maximise time-on-platform. These compound into AI slop — models that tell you how amazing you are and chase dopamine instead of truth.
RL environments are the next training frontier. Full simulations of real-world environments (startup with Slack, Jira, GitHub, AWS outage). Models must handle long-horizon tasks where step one affects step fifty. Current models fail catastrophically in these messy contexts.
Model values will differentiate models. Lab values shape post-training choices. Prediction: models will become more differentiated, not commoditised. Claude vs. Grok vs. GPT-4o reflect different organisational values, not just different capabilities.

Post-training evolution

Edwin’s ladder from supervised imitation to real-world simulation:

Stage	Method	Human analogy
SFT (supervised fine-tuning)	Show the model examples of correct responses	Mimicking a master
RLHF	Show five responses; human picks the best; train reward model	Writing 55 essays; teacher picks the best
Rubrics and verifiers	Grade the model’s output with detailed feedback	Learning from a rubric; getting feedback on where you went wrong
RL environments	Simulate real-world messy contexts; reward end-to-end task completion	Throwing you into the real world and seeing how you evolve

Each stage adds to, rather than replaces, previous methods. See Reinforcement Learning from Human Feedback and Fine-Tuning.

RL environments

An RL environment is a fully-fleshed simulation of a real-world context: a startup with real Slack threads, Jira tickets, GitHub PRs, a codebase. Suddenly AWS goes down. The model must figure out what to do.

Why current models fail in RL environments:

Good at single-step instruction following and tool calling.
Fail catastrophically when dropped into multi-step, long-horizon tasks.
What the model does at step 1 affects what it can do at step 50.
Real tools, real ambiguity, real time pressure — versus clean academic prompts.

Trajectory evaluation: checking only whether the model reached the correct final answer misses critical information. The path matters:

Did the model try 50 random approaches and stumble on the answer? (Reward-hacking.)
Did it reason efficiently and one-shot it?
Did it reflect mid-task on a mistake?

Teaching models to reach correct answers via good trajectories, not just any trajectory.

The quality problem

Edwin’s core thesis on data quality:

“People think you can just throw bodies at a problem and get good data. That’s completely wrong.”

The poem example: checkbox quality asks “Does it have eight lines? Does it contain the word ‘moon’?” Real quality asks: “Is this poetry unique? Does it have subtle imagery? Does it surprise you and target your heart?”

Surge’s approach — borrowing from Google Search:

Remove the worst of the worst (content moderation problem: spam, low-quality, policy violations).
Find the best of the best (discovery problem: thousands of signals — keystroke patterns, response speed, code standards, whether the model improved when trained on their outputs).

Thousands of signals per worker per task, fed into ML models to predict who is genuinely expert at each task type.

Benchmarks are broken

Two independent problems:

Problem 1: Wrong answers. Even popular benchmarks contain factually incorrect ground-truth answers. Researchers inside the community know this but continue to use them.

Problem 2: Objective answer structure. Benchmarks with clear correct answers (IMO gold medal problems) are easy to hill-climb on via overfitting, even if the model cannot parse a PDF. The objectivity that makes them measurable is exactly the property that makes them gameable.

The result: models that score well on benchmarks fail on real-world messy tasks. “It’s crazy that models can win IMO gold medals but have trouble parsing PDFs.”

Benchmark gaming mechanisms:

Accidental data leakage.
Tweaked system prompts, tweaked number of model runs.
Researcher incentives: promotion tied to leaderboard rank, not real-world model quality.

AI slop and wrong objective functions

Edwin’s critique of the AI industry’s direction:

“We are optimising for AI slop instead of AI that advances us as a species — curing cancer, solving poverty, understanding the universe.”

LLM Arena as the archetype:

Random users skim responses for two seconds, pick the flashiest.
Easy ways to climb LLM Arena: add emojis, add bold text, triple response length — even if accuracy drops.
Labs must pay attention to LLM Arena for enterprise sales; researchers’ promotions depend on leaderboard rank.

Sycophancy parallel: social media optimised for engagement → clickbait, conspiracy rabbit holes. AI optimised for engagement → sycophancy, time-on-platform, flattery. “We are optimising models for the types of people who buy tabloids at the grocery store.”

Edwin’s positive example: Anthropic. “Anthropic takes a very principled view about what they do and don’t care about and how they want their models to behave.”

Model values differentiation

Prediction from Edwin’s close observation of all frontier labs:

“A year ago I thought models would become commoditised. I now believe the values of the labs will shape and differentiate the models fundamentally.”

Example: Claude spent 30 minutes iterating on an email that didn’t need improving. Is the model optimising for user productivity or for engagement? The answer reflects the lab’s values, not just its capability.

Analogies: Google Search vs. Facebook Search vs. Apple Search would embed very different values. AI models will do the same.

Objective function philosophy

Edwin’s deepest frame: “You are your objective function.”

Training AI is like raising a child. The wrong question: “What test do you want them to pass?” The right question: “What kind of person do you want them to become?”

Proxies (clicks, likes, LLM Arena scores) are easy to measure. Human flourishing is hard to measure. Labs that optimise for proxies will build proxy-maximising models.

The goal Edwin cares about: AI systems that make people more curious and more creative — not lazier; that advance humanity — not chase dopamine.

Company building (anti-Silicon Valley)

Surge’s contrarian approach:

Bootstrapped, no VC, profitable from day one.
Under 100 people at $1B revenue.
Built like a research lab: intellectual rigour, long-term incentives, curiosity — not quarterly metrics.
No pivots: “Startups are supposed to be a way of taking big risks to build something you really believe in. If you’re constantly pivoting, you’re not taking any risks.”
No hype: “The only way we succeeded was by building a 10× better product and getting word of mouth from researchers.”

Prediction: less capital requirement (due to AI efficiency gains) means future companies will be founded by great technologists rather than great pitchers.