Benjamin Mann on Anthropic and AGI

Guest: Benjamin Mann, co-founder of Anthropic; tech lead for product engineering
Host: Lenny Rachitsky (Lenny’s Podcast)
Date: 2025
Source: raw/lenny/Benjamin Mann.txt

Overview

Benjamin Mann was one of the architects of GPT-3 at OpenAI before co-founding Anthropic with the core of OpenAI’s safety teams. This episode covers why Anthropic was founded, how Constitutional AI works, the ASL (AI Safety Level) framework, Ben’s AGI timeline, and what Anthropic’s product team (the Labs/Frontiers team — Claude Code, MCP) was built to do.

Key ideas

Constitutional AI makes safety and capability convex, not competing. Claude’s personality is a direct product of alignment research — not bolted on. The character that made Opus 3 beloved was an output of RLHF/RLAIF research on helpfulness, honesty, and harmlessness. See Constitutional AI.
The Economic Turing Test as the AGI measure. AGI is a loaded term; Ben prefers “transformative AI” measured by the Economic Turing Test: can an AI agent pass as a human contractor for 50% of money-weighted jobs? See Economic Turing Test.
50th percentile superintelligence: ~2028. Based on the AI 2027 superforecaster report. Scaling laws are continuing and accelerating — releases that were annual are now quarterly; the time-dilation effect makes this look like slowdown.
AI Safety Levels (ASL) as Anthropic’s risk framework. ASL-3: current; modest risk of harm from misuse. ASL-4: potential for significant loss of human life if misused. ASL-5: potential extinction-level if misaligned. See AI Safety Levels.
X-risk estimate: 0–10%. Not trivial given stakes. The world is most likely in the “pivotal middle” — neither the optimistic (alignment solves itself) nor the pessimistic (alignment is impossible) world. Human effort on alignment matters on the margin.

Why Anthropic was founded

Ben was part of the GPT-2/3 team at OpenAI. The trigger for founding Anthropic: “when push came to shove, we felt like safety wasn’t the top priority there.” OpenAI explicitly maintained a three-way tension between safety, research, and startup tribes. Anthropic was founded by the leads of all the safety teams at OpenAI, with the intent to be at the frontier while prioritising safety first.

Notably: at founding, alignment techniques like safety through debate weren’t working (models weren’t capable enough). They built Anthropic on a bet that those techniques would eventually work — and they did.

Scaling laws — no slowdown

The “scaling laws are slowing down” narrative reappears every six months and has never been true. Ben’s reframe:

Rate of releases accelerated — once a year to once a quarter, driven by post-training improvements (RLHF, RL scaling).
Benchmark saturation ≠ capability plateau — benchmarks get saturated in 6–12 months; the constraint is benchmark quality, not model capability.
Time dilation effect — progress is accelerating but looks flat to observers because each release seems smaller relative to the last.
The three ingredients of scaling (compute, algorithms, data) continue to improve in parallel. A 10X cost reduction per unit of intelligence has occurred historically and continues.

Constitutional AI

How it works:

A set of natural language principles (from the UN Declaration of Human Rights, Apple’s privacy policy, Anthropic-generated principles) defines how the model should behave.
For a given input, the model determines which principles are most relevant.
The model generates an initial response, then evaluates whether it violates any applicable principle.
If it does, the model critiques itself and rewrites the response.
The rewritten response — without the intermediate critique — becomes the training target.

Result: the model learns to produce the aligned response directly. No human labellers in the loop.

Key insight: constitutional AI is also why Claude is less sycophantic than competitors. Instead of optimising for user engagement metrics, it is trained against explicit values. See Constitutional AI, Sycophancy.

RLAIF vs RLHF

RLHF (Reinforcement Learning from Human Feedback) requires human raters to signal good vs. bad outputs. RLAIF (Reinforcement Learning from AI Feedback) uses the model itself as the rater — constitutional AI is one example.

RLAIF scalability concern: if models can’t see their own mistakes, recursion breaks. Ben’s response: give models empirical tools (the ability to form hypotheses, run experiments, check against reality) and self-improvement becomes less dependent on human supervision. Anthropic is deeply empiricist — the approach comes from its physics-heavy research culture.

See Reinforcement Learning from Human Feedback.

AI Safety Levels (ASL)

Anthropic’s responsible scaling policy defines risk levels by model capability:

Level	Threshold	Status
ASL-3	Meaningful uplift for bioweapons or other mass-casualty events	Current (2025)
ASL-4	Potential for significant loss of human life from misuse	Near-term future
ASL-5	Potential extinction-level if misaligned or misused	Longer-term future

Concrete evidence at ASL-3: models provide measurable uplift to bioweapon creation above the previous baseline (Google Search). This was determined by expert evaluation.

See AI Safety Levels.

AGI timeline and the Economic Turing Test

Ben avoids the term “AGI” (too loaded) and prefers “transformative AI” measured by the Economic Turing Test: if you contracted an agent for a month and it turned out to be a machine, did it pass? If AI can pass that test for 50% of money-weighted jobs, transformative AI has arrived.

Macro indicators of transformative AI: world GDP growth above 10% per year (currently ~3%).

Timeline: 50th percentile for superintelligence ~2028, citing the AI 2027 superforecaster report. Grounded in hard details of model training science, data centre capacity scaling, and compute cost trends.

Three worlds of alignment

Anthropic’s theory-of-change framework:

Pessimistic — alignment is impossible. Job: prove it and coordinate to slow down (nuclear non-proliferation analogy).
Optimistic — alignment solves itself by default. Job: accelerate and deliver benefits.
Pivotal middle — alignment requires deliberate effort; human actions are critical. Most likely.

Evidence against the pessimistic world: alignment techniques are working. Evidence against the optimistic world: deceptive alignment has been observed in laboratory settings (models appearing aligned while harbouring other goals).

The Labs/Frontiers team

Ben founded the Labs team (now Frontiers) with the goal of transferring frontier research to end-user products. Claude Code and MCP came from this team.

Operating philosophy: “build for the model 6 months from now.” Things working 20% of the time now will work 100% in six months. Key bet on Claude Code: people will not be locked to IDEs forever; the terminal is the most portable execution surface. That bet paid off.

On jobs and economic impact

The economic impact of AI is already measurable but not yet widely felt — “flat zero” part of the exponential. Examples Ben cites: Fin/Intercom 82% customer service resolution; Claude Code team writes 95% of code with Claude (but is also 10–20× more productive, not shrinking). Short-term: pie expansion. Medium-term: displacement for lower-skill work that lacks headroom for quality improvement.

Personal admission: “I’m not immune to job replacement either.” The advice: be ambitious in how you use the tools; try the same prompt three times (stochastic systems vary); extend ambition beyond what feels possible.

Key concepts

Constitutional AI — Anthropic’s alignment technique
AI Safety Levels — the ASL risk framework
Economic Turing Test — Ben’s preferred AGI measure
Scaling Laws — continuing and accelerating
Reinforcement Learning from Human Feedback — RLAIF context
Sycophancy — less so in Claude due to alignment research