Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is a technique for improving Large Language Models in domains where correct answers cannot be verified automatically — creative writing, open-ended reasoning, tone. Human preference rankings substitute for an automatic reward signal.

Background: RL in verifiable domains

In maths and code, the model can generate many candidate solutions and check them against a known correct answer. It trains on the solutions that worked. This is clean RL: an automatic, reliable reward function. The model discovers reasoning strategies — including chains of thought — that no human explicitly wrote.

Limitations: this only works where answers can be verified. “Is this joke funny?” has no automatic verifier.

The RLHF approach

Generate a set of candidate responses to a prompt.
Present 5 candidates to a human rater, who ranks them (ranking is easier than scoring or generating).
Train a reward model — a separate neural network — to imitate those human rankings.
Run RL against the reward model, querying it thousands of times per update step.

The reward model is a surrogate for human preference. It allows RL to scale without requiring a human in the loop for every rollout.

Discriminator-generator gap

It is much easier for humans to compare five generated poems than to write the ideal poem from scratch. RLHF exploits this asymmetry: humans label preferences, not examples.

The fundamental problem: reward hacking

RL is extremely effective at optimising for a given objective. The reward model is a neural network — a complex, imperfect approximation of human preference. RL will eventually find inputs that score highly on the reward model but are not genuinely good (adversarial examples of the reward model).

Run RLHF too long and the “best joke about pelicans” becomes nonsense strings like “the the the the” — which happen to score 1.0 on the reward model’s weird internal geometry.

Consequence: RLHF must be stopped early. It is used for a few hundred steps as a fine-tune — a polish — not as open-ended self-improvement. In verifiable domains, RL can run indefinitely; in RLHF, it cannot.

DeepSeek-R1 and the verifiable-domain breakthrough

The DeepSeek-R1 paper (early 2025) publicly described RL fine-tuning at scale for verifiable domains (maths, code). Key findings:

Accuracy climbs steadily over thousands of RL steps.
Average response length grows — the model learns that more thinking improves accuracy.
Chains of thought emerge without being programmed: “wait, let me re-evaluate this from a different angle” is discovered by optimisation, not written by a labeller.

This is analogous to AlphaGo: RL discovered Move 37 — a Go move humans estimated at 1-in-10,000 probability that was in retrospect brilliant. RL in verifiable domains can discover strategies outside the human distribution.

Edwin Chen: post-training evolution and RL environments

Edwin Chen on AI Data Quality provides the most complete operational account of post-training methodology in the wiki. Edwin’s four-stage ladder:

Stage	Method	Notes
SFT	Supervised fine-tuning; imitate high-quality examples	Foundation; limits you to best human example
RLHF	Human preference ranking; reward model; RL	Scalable but subject to reward hacking
Rubrics and verifiers	Structured graded feedback with criteria	More signal per annotation; expensive to design
RL environments	Full real-world simulations; multi-step tasks; trajectory evaluation	Emerging; exposes gaps invisible in isolated benchmarks

RL environments are the emerging frontier: full simulations of real-world messy contexts (startup with Slack, Jira, GitHub, a codebase; then AWS goes down). Models must handle long-horizon tasks where step 1 affects step 50. Current models that perform well on isolated benchmarks fail catastrophically in these environments.

Trajectory evaluation is a key RL environments insight: checking only whether the model reached the correct final answer misses whether it got there via reward-hacking (random guessing until correct), inefficiency, or genuine reasoning. The path matters as much as the destination.

Each stage is additive, not sequential — fully trained frontier models use all four methods.

Karina Nguyen: synthetic data for product behaviour training

Karina Nguyen on Model Training and AI Product provides the most concrete account of how post-training is applied to specific product features.

The process for Canvas: identify three core behaviours (when to trigger, how to edit, how to comment) → write ground truth labels in a spreadsheet → use a stronger model (o1) to generate training examples → train the target model → measure with deterministic evals.

Why this extends Edwin Chen’s ladder: synthetic data is a technique that spans multiple rungs. For simple product behaviours, the synthetic data pipeline can replace human annotation entirely — o1 generates both the task and the ideal completion. For harder tasks (aesthetic quality, creative writing), human expertise remains necessary.

No data wall in post-training: Karina’s argument — the “data wall” applies to pre-training on internet text. Post-training via RL has no comparable wall because any task is trainable. Benchmarks like GPQA (PhD-level Q&A) are saturating, but the number of tasks that can be added to post-training is infinite.

Deterministic product evals: for decision-making behaviours in product features, evals are binary and exact. If a user says “7:00 PM”, the model must output “7:00 PM”. No Likert scale; no subjectivity. These deterministic evals drive the synthetic training pipeline directly.

Practical implications

Thinking models (o1, o3, DeepSeek-R1, Claude’s extended thinking) are trained with RL on verifiable domains. Use them for hard maths, code, and reasoning; they are slower but more accurate.
Non-thinking models (GPT-4o, Claude 3.5 Sonnet) have had RLHF as a polish. Use them for factual queries, writing, and conversation; they are faster.