Kevin Weil on OpenAI and the Future of AI Products

Kevin Weil on OpenAI and the Future of AI Products

transcript openai evals model-training product-strategy ai-product scaling lenny-podcast

Kevin Weil on OpenAI and the Future of AI Products

Guest: Kevin Weil — Chief Product Officer at OpenAI; previously head of product at Instagram and Twitter.
Host: Lenny Rachitsky
Source: Lenny’s Podcast.


Overview

Kevin Weil offers a view from inside OpenAI on how the company operates, why evals are the core product-building skill, the model maximalism philosophy, the iterative deployment approach, and where AI is heading. One of the clearest articulations in the wiki of how to think about building on probabilistic models versus deterministic software.


Key ideas

  1. The worst model you’ll ever use. The AI models available today are the worst you will ever use for the rest of your life. Models are improving at ~10x/year — a pace far steeper than Moore’s Law.
  2. Model maximalism. Don’t build scaffolding around model limitations. Build at the edge of capability and trust the model to improve. Products that barely work now will sing in two months.
  3. Evals determine the right product design. If a model gets something right 60% of the time vs 95% vs 99.5%, you build entirely different products. You must know your model’s reliability on your specific use case.
  4. Chat is a universal interface. Chat fits all intelligence levels — you can talk to a very smart person or a less smart person the same way. As models improve, chat remains the right baseline interface for anything open-ended.
  5. Fine-tuning is the future. Most AI products will be built on fine-tuned models, not baseline APIs. Every product team will need quasi-researcher/ML types to tune models for specific use cases.

Deterministic vs. probabilistic product building

“Your database works once, it works every time. And that’s not true in this world.”

Traditional software: defined inputs → defined outputs. If you run the same function three times, you get the same result. Product design follows from this certainty.

LLMs: fuzzy inputs → approximate outputs. Same question yields spiritually similar but not identical answers. Reliability varies by use case.

The product design implication: you need to know your model’s reliability for your specific use case. 60% reliable → you need a very different product (perhaps human-in-the-loop, narrow use cases, explicit error handling). 99.5% reliable → you can build confidently. The transition from “barely works” to “really works” can happen in two to three months as models improve.


Model maximalism

“Our general mindset is in two months, there’s going to be a better model and it’s going to blow away whatever the current set of limitations are.”

Kevin’s advice to developers building at the edge of model capability: keep going. The model will catch up. Products that barely work now will work well soon.

The practical implication: don’t invest engineering time building scaffolding around current model limitations. The scaffolding will become unnecessary. Instead, invest in the product design, the evals, and the user experience — these compound.

Applied internally at OpenAI: for most model failures, the investment is not in building guardrails but in improving the model. Only for the subset of failures that would be catastrophic (safety-critical) is scaffolding justified.


Iterative deployment

“We co-evolve together with the rest of society as we learn about these things.”

OpenAI’s product philosophy: ship early, ship often, iterate in public. Don’t hold a breakthrough and bestow it on the world — release it and learn together. Society and the company evolve together.

This has produced some missteps (poor model naming being the most visible), but Kevin’s view is that moving fast and iterating is more important than getting it perfect upfront. The planning cadence is quarterly but the actual roadmap is thrown out halfway through when new things are learned.


Evals as the core product skill

Kevin’s framing extends the wiki’s existing evals discussion significantly:

AI is capped by the quality of its evals. Most of the world’s knowledge — company processes, domain expertise, proprietary data — is not in model training sets. Models can learn anything, but they need the data. Custom evals are the mechanism for teaching models company- or use case-specific behaviour.

Deep Research as a case study. As Kevin’s team built the Deep Research feature (25-minute research sessions, 20-page outputs), they designed evals simultaneously: “Here’s a question. Here’s an amazing answer for that question.” Those evals became the target for fine-tuning and the benchmark for measuring progress. When performance on evals rose, they knew they had a product.

Evals → fine-tuning loop. Evals are not just tests — they are training signals. Write evals for the use case → gather data → fine-tune model → measure → iterate. This is the research-product tight integration Kevin is building at OpenAI.

Poor man’s fine-tuning. For teams without compute/researchers to fine-tune: include examples in the prompt (few-shot). “Here’s a problem, here’s a good answer. Here’s a problem, here’s a good answer. Now solve this.” More effective than a bare prompt; not as good as a full fine-tune.


Fine-tuning is the future

“I really have no question that that’s the future. Models are going to be everywhere just like transistors are everywhere.”

Kevin’s prediction: every product team will eventually have quasi-researcher/ML-engineer types. Fine-tuning a model for a specific use case will become a standard part of the product development workflow, not an advanced specialisation.

OpenAI uses this internally:

  • Customer support: fine-tuned model trained on knowledge base + human corrections; handles most tickets automatically.
  • Different model sizes for different latency/cost requirements (o-series for reasoning tasks; 4o-mini for quick checks).
  • Ensembles of models for complex problems: break the problem into specific subtasks, use a specialised model for each, integrate the outputs.

The company-as-ensemble analogy. A company is arguably an ensemble of models that have all been fine-tuned (by their education and careers) to have different skills, combined in different configurations. The ensemble output is much better than any individual contributor.


Chat as the universal interface

Kevin’s defence of chat against critics who predict something better will replace it:

The argument: chat is flexible. An unstructured communication channel can express anything. Every other interface is a restriction — you can only do certain things with it. Chat is the catch-all for everything you’d ever want to express.

For high-volume, prescribed use cases: a more structured, faster, task-specific interface is often better. Build those where appropriate.

But for general-purpose interaction: chat plus speech is the right baseline. “You still want that very open-ended, flexible communication medium.”

The LLM breakthrough made chat viable by finally being good enough at understanding “all of the complexity and nuances of human speech.”


The human analogy as product-design tool

Kevin’s counterintuitive finding: when you reason about how to design AI product experiences, thinking about how a human would behave in the same situation often produces the right answer.

Examples:

  • Reasoning model wait time — 20 seconds is too long to stare at a spinner, but too short to go do something else. What would a human do? Give little updates, maybe say “that’s a good question” while thinking. OpenAI shipped exactly that: periodic one-to-two sentence summaries of the reasoning process.
  • Model ensemble — getting multiple models to tackle a problem and integrating their outputs sounds like brainstorming, which humans do and it works.
  • Framing a prompt — “You are the world’s greatest brand marketer. Now answer this naming question.” Framing shifts model into a mindset, same as framing shifts a human colleague’s mindset.

The caveat: “maybe not always the answer is to think about how a person would do it, but sometimes to gain intuition for how you might solve a problem, you think about what an equivalent human would do.”


OpenAI operating model

Team structure:

  • ~25 PMs (deliberately PM-light).
  • Bottoms-up; teams have high autonomy; no launch reviews blocking shipping.
  • Engineering, design, product, and research working together as single teams on every product.
  • Product-focused, high-agency engineers who can make product decisions without PM involvement.

Why PM-light?

“Too many PMs causes problems. We’ll fill the world with decks and ideas versus execution.”

PMs at OpenAI must be high-agency, comfortable with ambiguity, decisive (able to call the answer when no one else will), and capable of leading through influence with research teams.

Quarterly roadmapping: valuable not because the plan will be followed, but because it creates a moment of reflection (what worked, what didn’t), dependency check, and directional alignment. The roadmap will be mostly thrown out halfway through as new model capabilities arrive.


”Worst model you’ll ever use”

“The AI models that you’re using today is the worst AI model you will ever use for the rest of your life.”

Quantification: GPT-4o mini costs ~100x less than GPT-3.5 did at comparable capability. API costs have dropped two orders of magnitude in two years, while intelligence has increased dramatically. Models are improving on all dimensions simultaneously: smarter, faster, cheaper, safer (fewer hallucinations).

The rate is approximately 10x per year — steeper than Moore’s Law (2x per 18 months).

Implication: today’s limitations are temporary. Build for the product and the use case; don’t optimise around limitations that will be gone in three months.


See also