Future of AI

How Video Generation Drives Progress Toward Artificial General Intelligence

Video generative models are often perceived as tools designed purely for entertainment. Think creating viral TikToks, flashy advertisements, or even entire short-form video series. This all can now be powered by a single prompt. At first glance, their primary purpose is content production, however, this perspective barely scratches the surface of generative models’ true potential.Ā 

Generative models undoubtedly drive advancements in entertainment and marketing, their foundation lies in something far more complicated: the World Model. This mechanism enables models to generate content and also allows them to simulate and predict the behavior of objects and environments. Even basic tasks require a robust World Model, making it a key step toward Artificial General Intelligence.

What Are World Models in ML?

World Models are fundamental mechanisms that let biological and artificial systems to predict future states based on past information. This principle is the foundation of video generative models, which strive to predict the next frame in a sequence using the history of previous frames. It mirrors how our brains work: not just reacting to changes but actively anticipating what comes next.

Consider a baseball player hitting a ball with a bat. The time they have to decide how to strike the ball is shorter than it takes to process visual information. This is possible because the brain predicts where the ball will be, allowing the player to act preemptively. This predictive ability (rooted in anticipating future outcomes) is a fundamental example of a World Model in action.

Another interesting similarity between our brain and video generative models is how they process information. Not one operates directly on images or raw data. Instead, they work in a latent space, so an abstract representation of sensory inputs. In humans, the optic nerve transmits signals from the eyes and the brain doesnā€™t process these as full images but as encoded signals. Video models predict the next latent state instead of directly generating the following image. So, both systems focus on predicting the sensory information they expect in the future.

How do World Models work?

There is a famous example of a world model of VizDoom ā€“ a game where you can move either left or right to dodge fireballs, and your goal is to survive as long as possible. Letā€™s walk through it. First step is encoding pixels into latent. For that, we can use the auto-encoder. This model learns how to map an image into latent space (the encoding) and then how to get an original image from the latent representation (the decoding). Then, we train a model that predicts an action we must take (step left, step right, or stay still).

This system will be able to play the game but will be deficient because it will react to the world using only the current frame. To overcome this, another model (the actual world model) has to be introduced. Given the current state’s history and action, its task is to generate the next latent state. This world model is then trained using accurate game data as an input to predict the next state. After training this world model, we can train a better controller aware of the game’s history.

After that, we can check the model’s ability to be a world model by running it in pure simulation mode. In this mode, it works in autoregressive mode, taking its predictions as input without any input from the actual game (except the first frame). So, the model now exists in its hallucination without any external input.Ā 

The most remarkable aspect of this approach is that we can train a controller entirely within this simulation and then apply it directly in the actual game. In many cases, it performs at least as well as a controller trained on actual in-game data. This property of World Models makes them especially valuable, as they can be used in areas where real-world data is expensive or difficult to obtain.

Why are World Models so important?Ā 

World Models are not limited to video generation. They extend to any system that predicts future events based on predefined rules. For instance, large language models (LLMs) incorporate aspects of a World Model to solve complex tasks by leveraging their knowledge of the physical world derived from extensive text datasets. One example used to evaluate this capability involves reasoning about interconnected gears: ā€œSeven axles are equally spaced around a circle. A gear is placed on each axle to engage with the gear to its left and the gear to its right. If gear 3 were rotated clockwise, in which direction would gear 7 rotate?ā€ Modern LLMs can solve such tasks by simulating physical interactions internally.

In applied contexts, World Models enable systems to perform tasks that require foresight. In video games, for example, models simulate responses to user inputs within the game’s rules. Similarly, World Models are used in simulations to train autonomous vehicles to predict road behavior, avoid collisions, and follow traffic rules.

Why Is Video Generation Central to World Models?

When developing the Sora model, the team published a blog post, ā€œVideo Generation as World Model,ā€ explaining the architecture of their diffusion model. They view video models as a key step toward building a comprehensive World Model ā€” the foundation of AGI. In their view, AGI cannot exist without a coherent internal understanding of how the world works.

All the examples discussed ā€” video games, autonomous vehicles, and text-based scenarios ā€” share a common requirement: a clear understanding of the laws of the physical world. Extensive investment in video generation, whether for entertainment or other practical purposes, is a powerful driver toward building robust World Models, making this field a stepping stone to achieving far greater capabilities, including AGI.

Despite their potential, current video models still face significant obstacles. Even the most advanced systems, which require millions of dollars to train, struggle with basic physical principles. For example, in a demonstration by Sora, a glass is lifted and tilted, spilling water before it is entirely overturned. The glass then falls but shatters only after the water has already spilled. Such errors illustrate how even sophisticated models fail to understand the sequence of events realistically.

These issues are not isolated. In vehicle simulations wheels might spin in opposite directions, indicating a lack of understanding of fundamental physical consistency. Researchers point to two key requirements to address these limitations: more data and increasingly complex models. The prevailing approach is to continue scaling resources ā€” more servers, chips, and computational power ā€” hoping that models will eventually learn to replicate the intricacies of the physical world.

Video generation is significantly more complex than text-based models because it operates in a much larger space of possibilities. Language models work with a fixed set of words and tokens, video models must capture motion, textures, lighting, and intricate interactions. This added complexity pushes AI toward developing more advanced World Models. They will become systems that can predict and simulate the real world more effectively. Plus, video models will become key in fields like robotics simulation and bring us closer to major advancements in artificial general intelligence as a whole.

Author

  • Aleksandr Rezanov

    ML Research Engineer at Higgsfield, a company creating artificial intelligence for video. He led the Computer Vision Research team at Rask AI, a company building a service that allows content creators and companies to quickly and efficiently translate their videos into 130+ languages.

    View all posts

Related Articles

Back to top button