World Models
World models are AI foundation systems that can create explorable spatial environments, reason within them, and enable agents to interact with them. Distinguished from language models by operating in the spatial, physical, and 3D domain rather than the linguistic domain.
The concept was formalised and publicly advocated by Dr. Fei Fei Li (TED talk 2024; World Labs founded 2024). Related work and terminology has emerged independently from Elon Musk, Jensen Huang, and Google DeepMind.
The spatial intelligence thesis
Language models are powerful but intelligence is not only linguistic. Dr. Fei Fei Li on World Models:
“Humans are embodied agents. We navigate 3D space, manipulate objects, plan paths, reason about physical causality. Language is part of that, but language cannot get you to put down the fire.”
The argument:
- Humans acquire the vast majority of knowledge through perceptual and spatial experience, not language.
- Large language models capture the linguistic projection of that knowledge but not the underlying spatial structure.
- World models provide the missing spatial foundation — for robots, for humans using AI tools, and for any system that must act in the physical world.
Definition
A world model:
- Takes prompts (text, images, multiple images) as input.
- Outputs genuinely 3D, navigable, interactive spatial environments.
- Supports reasoning within the world (e.g. path planning for robots).
- Supports interaction within the world (camera movement, object manipulation).
- Supports creation of new worlds from imagination.
Key distinction from video generation models: video generation produces a flat 2D temporal sequence. A world model produces a persistent 3D structure that can be navigated, not just watched. You can change the viewpoint, walk through it, export it as mesh, use it as a simulation environment.
Relationship to language models
| Language Models | World Models | |
|---|---|---|
| Input modality | Text (tokens) | Images, text, spatial prompts |
| Output | Text (tokens) | 3D navigable environments |
| Training data alignment | Perfect (tokens → tokens) | Harder (need 3D-grounded data) |
| Intelligence dimension | Linguistic | Spatial, physical, perceptual |
| Primary application | Conversation, coding, reasoning | Robotics, design, VFX, simulation |
They are complementary, not competing. A robot with a world model can plan in space; a robot also needs language to take instructions. See Pretraining for the common thread (big data + neural networks + GPUs).
Applications
| Domain | Use case |
|---|---|
| Robotics | Synthetic training environments with 3D-grounded action data; sim-to-real transfer |
| VFX / virtual production | Prompt-to-3D-world scenes for film production (40× faster than traditional pipeline) |
| Games | Infinitely generatable game environments |
| Design and architecture | Instant spatial prototyping |
| Scientific discovery | Spatial reasoning support (analogy: Watson and Crick used spatial imagination to derive DNA double helix from 2D X-ray data) |
| Therapy / psychology | Controlled immersive environments for exposure therapy and psychiatric research |
Bitter lesson in robotics
The Bitter Lesson (Sutton) holds that simple models with large data outperform complex models with small data. This generalisation is harder to apply to robotics because of the objective–data alignment problem:
- Language models: train on tokens, output tokens. Objective matches training data perfectly.
- Robots: must output actions in 3D physical environments. Web video data does not contain 3D-grounded actions. Teleoperation data and synthetic data partially bridge the gap.
World models are one path to generating the missing synthetic 3D action data — creating diverse simulated environments in which robots can be trained without requiring human teleoperation for each scenario.
Additional robotics constraint: physical hardware maturation follows different timelines from software (cf. self-driving cars: 20 years from DARPA challenge to commercial Waymo deployment, and robots are harder than cars).
World Labs and Marble
World Labs was co-founded by Fei-Fei Li, Justin Johnson, Christoph Lassner, and Ben Mildenhall (all from AI/computer vision research). Their first product, Marble, is described as the world’s first generative model outputting genuinely 3D navigable worlds.
Key capabilities at launch:
- Prompt-to-world (text or image input → navigable 3D scene).
- Real-time video generation on a single H100 GPU.
- Mesh export for use in games, VR, and simulation tools.
- The animated-dots entry animation is an intentional UX feature, not model output.