World Models

World models are AI foundation systems that can create explorable spatial environments, reason within them, and enable agents to interact with them. Distinguished from language models by operating in the spatial, physical, and 3D domain rather than the linguistic domain.

The concept was formalised and publicly advocated by Dr. Fei Fei Li (TED talk 2024; World Labs founded 2024). Related work and terminology has emerged independently from Elon Musk, Jensen Huang, and Google DeepMind.

The spatial intelligence thesis

Language models are powerful but intelligence is not only linguistic. Dr. Fei Fei Li on World Models:

“Humans are embodied agents. We navigate 3D space, manipulate objects, plan paths, reason about physical causality. Language is part of that, but language cannot get you to put down the fire.”

The argument:

Humans acquire the vast majority of knowledge through perceptual and spatial experience, not language.
Large language models capture the linguistic projection of that knowledge but not the underlying spatial structure.
World models provide the missing spatial foundation — for robots, for humans using AI tools, and for any system that must act in the physical world.

Definition

A world model:

Takes prompts (text, images, multiple images) as input.
Outputs genuinely 3D, navigable, interactive spatial environments.
Supports reasoning within the world (e.g. path planning for robots).
Supports interaction within the world (camera movement, object manipulation).
Supports creation of new worlds from imagination.

Key distinction from video generation models: video generation produces a flat 2D temporal sequence. A world model produces a persistent 3D structure that can be navigated, not just watched. You can change the viewpoint, walk through it, export it as mesh, use it as a simulation environment.

Relationship to language models

	Language Models	World Models
Input modality	Text (tokens)	Images, text, spatial prompts
Output	Text (tokens)	3D navigable environments
Training data alignment	Perfect (tokens → tokens)	Harder (need 3D-grounded data)
Intelligence dimension	Linguistic	Spatial, physical, perceptual
Primary application	Conversation, coding, reasoning	Robotics, design, VFX, simulation

They are complementary, not competing. A robot with a world model can plan in space; a robot also needs language to take instructions. See Pretraining for the common thread (big data + neural networks + GPUs).

Applications

Domain	Use case
Robotics	Synthetic training environments with 3D-grounded action data; sim-to-real transfer
VFX / virtual production	Prompt-to-3D-world scenes for film production (40× faster than traditional pipeline)
Games	Infinitely generatable game environments
Design and architecture	Instant spatial prototyping
Scientific discovery	Spatial reasoning support (analogy: Watson and Crick used spatial imagination to derive DNA double helix from 2D X-ray data)
Therapy / psychology	Controlled immersive environments for exposure therapy and psychiatric research

Bitter lesson in robotics

The Bitter Lesson (Sutton) holds that simple models with large data outperform complex models with small data. This generalisation is harder to apply to robotics because of the objective–data alignment problem:

Language models: train on tokens, output tokens. Objective matches training data perfectly.
Robots: must output actions in 3D physical environments. Web video data does not contain 3D-grounded actions. Teleoperation data and synthetic data partially bridge the gap.

World models are one path to generating the missing synthetic 3D action data — creating diverse simulated environments in which robots can be trained without requiring human teleoperation for each scenario.

Additional robotics constraint: physical hardware maturation follows different timelines from software (cf. self-driving cars: 20 years from DARPA challenge to commercial Waymo deployment, and robots are harder than cars).

World Labs and Marble

World Labs was co-founded by Fei-Fei Li, Justin Johnson, Christoph Lassner, and Ben Mildenhall (all from AI/computer vision research). Their first product, Marble, is described as the world’s first generative model outputting genuinely 3D navigable worlds.

Key capabilities at launch:

Prompt-to-world (text or image input → navigable 3D scene).
Real-time video generation on a single H100 GPU.
Mesh export for use in games, VR, and simulation tools.
The animated-dots entry animation is an intentional UX feature, not model output.