Three Ways AI Is Learning to Understand the Physical World — And Why It Matters for the Future of Robotics

Large language models can write poetry, debug code, and pass the bar exam. But ask them to predict what happens when a ball rolls off a table, and they struggle. This fundamental gap — the inability to reason about physical causality — is one of the most significant limitations holding back AI’s expansion into robotics, autonomous vehicles, and physical manufacturing. A new generation of research is tackling the problem from three distinct angles.

The Physical World Problem

LLMs excel at processing abstract knowledge through next-token prediction, but they fundamentally lack grounding in physical causality. They cannot reliably predict the physical consequences of real-world actions. This is why AI systems that seem brilliant in benchmarks routinely fail when deployed in physical environments.

As AI pioneer Richard Sutton noted in a recent interview: LLMs just mimic what people say instead of modeling the world, which limits their capacity to learn from experience and adjust to changes in the world. Similarly, Google DeepMind CEO Demis Hassabis has described today’s AI as suffering from jagged intelligence — capable of solving complex math olympiad problems while failing at basic physics.

This is driving a fundamental research focus: building world models — internal simulators that allow AI systems to safely test hypotheses before taking physical action.

Approach 1: JEPA — Learning Latent Representations

The first major approach focuses on learning latent representations instead of trying to predict the dynamics of the world at the pixel level. This method, heavily based on the Joint Embedding Predictive Architecture (JEPA), is endorsed by AMI Labs and Yann LeCun.

JEPA models mimic human cognition: rather than memorizing every pixel of a scene, humans track trajectories and interactions. JEPA models work the same way — learning abstract features rather than exact pixel predictions, discarding irrelevant details and focusing on core interaction rules.

The advantages are significant:

Highly robust against background noise and small input changes
Compute and memory efficient — fewer training examples required
Low latency — suitable for real-time robotics applications
AMI Labs is already partnering with healthcare company Nabla to simulate operational complexity in fast-paced healthcare settings

Approach 2: Gaussian Splats — Building Spatial Environments

The second approach uses generative models to build complete spatial environments from scratch. Adopted by World Labs, this method takes an initial prompt (image or text) and uses a generative model to create a 3D Gaussian splat — a technique representing 3D scenes using millions of mathematical particles that define geometry and lighting.

Unlike flat video generation, these 3D representations can be imported directly into standard physics and 3D engines like Unreal Engine, where users and AI agents can navigate and interact from any angle. This approach addresses World Labs founder Fei-Fei Li’s observation that LLMs are like \”wordsmiths in the dark\” — possessing flowery language but lacking spatial intelligence.

The enterprise value is already evident: Autodesk has heavily backed World Labs to integrate these models into industrial design applications.

Approach 3: End-to-End Generation — Real-Time Physics Engines

The third approach uses an end-to-end generative model that processes prompts and user actions while continuously generating the scene, physical dynamics, and reactions on the fly. Rather than exporting a static file to an external physics engine, the model itself acts as the physics engine.

DeepMind’s Genie 3 and Nvidia’s Cosmos fall into this category. These models provide a simple interface for generating infinite interactive experiences and massive volumes of synthetic data. DeepMind demonstrated Genie 3 maintaining strict object permanence and consistent physics at 24 frames per second.

Why This Matters Now

The race to build world models has attracted over billion in recent funding — World Labs raised billion in February 2026, and AMI Labs followed with a .03 billion seed round. This is not academic curiosity; it is industrial strategy.

Robotics, autonomous vehicles, and AI-controlled manufacturing all depend on AI systems that can reason about physical consequences. Without world models, AI systems deployed in physical spaces will continue to fail in ways that are expensive, dangerous, and embarrassing.

The three approaches represent genuine architectural diversity — JEPA for efficiency, Gaussian splats for spatial computing, and end-to-end generation for scale. Which approach wins, or whether they converge, will shape the next decade of AI deployment in the physical world.

The Physical World Problem

Approach 1: JEPA — Learning Latent Representations

Approach 2: Gaussian Splats — Building Spatial Environments

Approach 3: End-to-End Generation — Real-Time Physics Engines

Why This Matters Now

Related Posts

Newsletter

Join the discussion Cancel reply