Beyond LLMs: The Three Architectural Approaches Teaching AI to Understand Physics

Large language models excel at writing poetry and debugging code, but ask them to predict what happens when you drop a ball and you’ll quickly discover their limitations. Despite mastering chess, generating art, and passing bar exams, today’s most powerful AI systems fundamentally don’t understand physics.

This gap is becoming increasingly apparent as companies try to deploy AI in robotics, autonomous vehicles, and manufacturing. The solution? World models鈥攊nternal simulators that let AI systems safely test hypotheses before taking physical action. And investors are paying attention: AMI Labs raised a billion-dollar seed round, while World Labs secured funding from backers including Nvidia and AMD.

The Problem with Next-Token Prediction

LLMs work by predicting the next token in a sequence. This approach has been remarkably successful for text, but it has a critical flaw when applied to physical tasks. These models cannot reliably predict the physical consequences of real-world actions, according to AI researchers.

Turing Award recipient Richard Sutton warned that LLMs just mimic what people say instead of modeling the world, which limits their capacity to learn from experience. DeepMind CEO Demis Hassabis calls this jagged intelligence鈥擜I that can solve complex math olympiads but fails at basic physics.

The industry is responding with three distinct architectural approaches, each with different tradeoffs.

1. JEPA: Learning Abstract Representations

The Joint Embedding Predictive Architecture, endorsed by AMI Labs and pioneered by Yann LeCun, takes a fundamentally different approach. Instead of trying to predict what the next video frame will look like at the pixel level, JEPA models learn a smaller set of abstract, or latent, features.

Think about how humans actually observe the world. When you watch a car driving down a street, you track its trajectory and speed鈥攜ou don’t calculate the exact reflection of light on every leaf in the background. JEPA models reproduce this cognitive shortcut.

The benefits are substantial: JEPA models are highly compute and memory efficient, require fewer training examples, and run with significantly lower latency. These characteristics make it suitable for applications where real-time inference is non-negotiable鈥攔obotics, self-driving cars, high-stakes enterprise workflows.

Healthcare company Nabla is already using this architecture to simulate operational complexity in fast-paced medical settings, reducing cognitive load for healthcare workers.

2. Gaussian Splats: Building Spatial Worlds

The second approach, adopted by World Labs led by AI pioneer Fei-Fei Li, uses generative models to build complete 3D spatial environments. The process takes an initial prompt鈥攅ither an image or textual description鈥攁nd uses a generative model to create a 3D Gaussian splat.

A Gaussian splat represents 3D scenes using millions of tiny mathematical particles that define geometry and lighting. Unlike flat video generation, these 3D representations can be imported directly into standard physics and 3D engines like Unreal Engine, where users and AI agents can freely navigate and interact from any angle.

World Labs founder Fei-Fei Li describes LLMs as wordsmiths in the dark鈥攑ossessing flowery language but lacking spatial intelligence and physical experience. The company’s Marble model aims to give AI that missing spatial awareness.

Industrial design giant Autodesk has backed World Labs heavily, planning to integrate these models into their design applications. The approach has massive potential for spatial computing, interactive entertainment, and building training environments for robotics.

3. End-to-End Generation: Physics Native

The third approach uses an end-to-end generative model that continuously generates the scene, physical dynamics, and reactions on the fly. Rather than exporting to an external physics engine, the model itself acts as the engine.

DeepMind’s Genie 3 and Nvidia’s Cosmos fall into this category. These models ingest an initial prompt alongside continuous user actions and generate subsequent environment frames in real-time, calculating physics, lighting, and object reactions natively.

The compute cost is substantial鈥攃ontinuously rendering physics and pixels simultaneously requires significant resources. But the investment enables synthetic data factories that can generate infinite interactive experiences and massive volumes of synthetic training data.

Nvidia Cosmos uses this architecture to scale synthetic data and physical AI reasoning. Waymo built its world model on Genie 3 for training self-driving cars, synthesizing rare, dangerous edge-case conditions without the cost or risk of physical testing.

The Hybrid Future

LLMs will continue serving as the reasoning and communication interface, but world models are positioning themselves as foundational infrastructure for physical and spatial data pipelines. We’re already seeing hybrid architectures emerge.

Cybersecurity startup DeepTempo recently developed LogLM, integrating LLMs with JEPA elements to detect anomalies and cyber threats from security logs. The boundary between AI that thinks and AI that understands the physical world is beginning to dissolve.

As world models mature, expect AI systems that can not only tell you how to change a tire, but actually understand what happens when you apply torque to a rusted bolt. The physical world is finally coming into focus for artificial intelligence.

The Problem with Next-Token Prediction

1. JEPA: Learning Abstract Representations

2. Gaussian Splats: Building Spatial Worlds

3. End-to-End Generation: Physics Native

The Hybrid Future

Related Posts

Newsletter

Join the discussion Cancel reply