World Models Explained: How AI Learns Physics Without a Body

Drop a glass. It falls. It shatters.

You knew that before it happened. But you didn’t think it in words. No internal monologue calculating trajectories. You just… knew. A simulation running somewhere below language.

Infants have this before they can speak. Object permanence. Intuitive physics. They understand something about the world, but they can’t tell you what.

You do this constantly. Your hand shapes itself before it touches the cup. You know the box is heavy before you lift it.

Understanding that lives below language. Translated into words only when someone asks.

LLMs don’t have this.

They predict text, brilliantly. But text is already one level removed. When you describe the glass shattering, you’re translating from that pre-verbal model into language. The LLM only sees the translation.

It learns patterns in how people talk about physics. Never the physics itself.

World models are an attempt to build the other thing. The pre-verbal part. AI that learns to predict what happens, not what people say about it.

Why This Matters

Text is lossy compression of reality.

Four hours of video contains roughly the information of 12,000 books. Not because video is better, because the physical world is dense with information that language skips over.

The difference between reading about a cat and the weight of one sleeping on your chest. The warmth. The vibration you feel in your ribs rather than hear.

LLMs predict what people say about the world. Not what happens in it.

There’s no feedback loop with reality. An LLM can confidently describe how a bicycle works. It has no way to check if that description matches what actually happens when you pedal.

The word that keeps coming up: grounding.

LLMs manipulate symbols brilliantly. But do those symbols connect to anything real?

This isn’t fringe skepticism. LeCun left Meta late 2025 with €500 million to build world models instead. Sutton, the father of reinforcement learning, calls text prediction “a dead end” for true understanding.

When the people who built this start walking away, it’s worth asking what they’re seeing.

How World Models Work

The core idea: instead of predicting the next token, predict the next state of the world.

Two innovations make this possible.

Latent space prediction. When you recognize a friend from across the street, you’re not comparing millions of tiny details, the exact shade of their hair, the precise angle of their nose. Your brain has some compressed representation. The essence of that face.

You could pick them out of a crowd in a fraction of a second. But you couldn’t list the features that let you do it.

Latent space is the machine version of this. Instead of storing every pixel of a scene, the model learns to capture what matters in a much smaller description. A point in some abstract space that encodes meaning, not surface appearance.

Prediction happens there, efficient, and forced to capture what actually matters.

Learning from video. Video is dense with causal information. This happened, then that happened. Physics, object permanence, spatial relationships, all implicit in the footage.

No labels needed. The model learns by predicting what comes next.

A million hours of internet video becomes a physics textbook the model writes for itself.

The Robot That Had Never Seen a Mug

V-JEPA 2 came out of Meta in June 2025. 1.2 billion parameters. Trained on over a million hours of video.

Then they connected it to a robot arm.

The arm had never encountered this particular mug before. White ceramic, slightly chipped, sitting on a cluttered desk. Previous systems would need explicit programming for this, demonstrations of grabbing mugs, parameters tuned for this grip width, this weight distribution.

V-JEPA 2 picked it up on the first try.

The model had learned enough physics from watching videos, millions of hours of objects being grasped, released, dropped, caught, that it knew how things behave before touching them. The way your hand shapes itself before it reaches the cup.

The video didn’t show this exact mug. But it showed mugs. The compressed representation transferred.

62 hours of robot-specific data. That’s it. 65-80% success rate on objects it had never seen during training.

Here’s the part that’s kinda wild: the model watched but never moved. Video is passive. Robot control is active.

How does seeing translate to doing?

The internal physics model works both ways. Watching teaches you what happens when force meets object. Acting is just running that model in reverse, what force produces this outcome?

The physics is the same whether you’re watching or reaching.

Where We Are Now

Multiple approaches, all converging on the same insight.

Tesla’s Full Self-Driving runs on a world model. Not just where objects are, where they’ll be. 4D prediction: space plus time. Millions of cars, every day, navigating with an internal simulation of traffic physics.

The largest-scale deployment of this idea, often overlooked in academic discussions.

DeepMind’s Genie 3 generates interactive 3D worlds from prompts. Describe an environment and it creates something you can walk around in. Real-time, playable. The model learned enough about how spaces work to generate coherent new ones.

Physical Intelligence raised $600 million (valued at $5.6 billion) to build robots that fuse vision, language, and action into continuous motor commands. World Labs, Fei-Fei Li’s company, launched Marble: downloadable 3D environments from text or images, VR-compatible.

The field went from “let’s try LLMs on robots” to a full taxonomy of architectures in eighteen months.

What This Enables

Things that are hard or impossible with text-only systems.

Physical reasoning. Will this tower of blocks fall? Can I fit this object through that gap? What happens if I push this?

These questions require simulating outcomes, not retrieving facts. The answer lives in running the physics forward.

Zero-shot robot control. The mug the V-JEPA robot had never seen. The video pre-training gave it intuitions about how things behave, transferred from watching to acting without explicit instruction.

Simulation before action. The AI can “imagine” outcomes before committing. What happens if I grasp here versus there? Play it out in the model’s head first.

Like a chess player thinking moves ahead. But for physical tasks.

Long-horizon planning. Chaining predictions: if I do A, then B happens, then I can do C. Text models struggle with this because each token is somewhat independent. World models track state across time.

But Then Again

Not “LLMs are fine” - but inference-time approaches that blur the line.

Recursive Language Models treat context as something to search through rather than hold in memory. Test-Time Training updates the model’s weights during inference, every input makes it, temporarily, a different model.

Both add something. Call it thinking. Call it world-model-like behavior emerging from text prediction.

The line between “world model” and “reasoning model” is blurrier than the pure critique suggests.

Text-based systems might be building implicit models of the world through the statistical structure of language. Or they might be doing something else entirely that just looks similar.

World models are one path. Maybe not the only one.

How Humans Build World Models

Turn the lens on ourselves.

Babies develop object permanence between 3 and 9 months. Hide a toy behind a barrier. The baby knows it’s still there.

That’s not language. That’s a world model, things persist even when you can’t see them.

Alison Gopnik calls children “scientists in the crib.” Her theory-theory: kids form intuitive theories, run experiments (that’s what play is), and update beliefs based on evidence.

Bayesian learners building world models from limited data.

The neuroscience backs this up. Place cells, grid cells, the brain has spatial maps. And these maps generalize beyond physical space.

We use the same neural machinery for abstract reasoning that we use for navigation. Cognitive maps. Not just for finding your way home. For thinking.

Humans build world models through sensorimotor experience. But we also learn from language. From reading. From being told things.

The question isn’t whether embodiment helps. It clearly does.

The question is whether it’s necessary.

What Does Understanding Even Require?

Can symbols ever connect to meaning without sensorimotor experience?

LeCun’s position: No. You can’t know what “hot” means without touching something hot.

There’s a difference between knowing the word “cold” and standing in a 7-Eleven at 2 AM, the air conditioning raising goosebumps on your arms while you decide between two brands of instant noodles.

One is a symbol. The other is something else.

Counter-position: maybe functional grounding is enough. The statistical patterns in language encode so much structure about the world that the symbols effectively become grounded through usage. Even without direct sensory contact.

We don’t know what understanding requires. Not for AI. Not even for ourselves.

Is there something it’s like to “understand”? Or is understanding just successful prediction dressed up in philosophical language?

Two Kinds of Prediction

World models bet that you need to simulate reality to understand it.

Text models bet that language encodes enough structure to reason without simulation.

LeCun walked away from Meta with half a billion euros to pursue the first bet. Anthropic and OpenAI keep pushing on the second.

Both sides have smart people and real results.

The answer matters for AI. Which architecture wins, which problems get solved, which companies build the future.

But it also matters for something older.

What does it mean to understand something? Philosophers have asked this for centuries.

Now we’re building systems that might actually show us.