
Embodied AI, over the past few years, has largely followed a familiar formula: build larger models, collect more robot demonstrations, and refine policies through increasingly sophisticated post-training. Progress has been undeniable. Yet despite impressive benchmarks and viral demonstrations, a fundamental question remains unresolved:
Why does embodied intelligence still struggle to scale?
The answer may have less to do with model capability than with the assumptions underlying how the field is being built.
Those signals suggest that some of those assumptions are beginning to be questioned. WALL-WM, X Square Robot’s open-sourced event-centric world model, recently entered the alphaXiv Top 10 trending papers, reflecting growing interest in alternative approaches to robot cognition and world representation.
Yet the significance of this moment extends beyond a single model.The company’s continued investment in embodied data infrastructure suggests that these research directions are increasingly being translated into longer-term platform development.
What makes X Square Robot’s open-source releases noteworthy is not simply that the company has released three projects. It is that each project challenges one of the field’s most deeply held assumptions.
Taken together, WALL-OSS-0.5, WALL-WM, and QUANXTA Zero-G0 represent something larger than a collection of tools. They ou tline an alternative vision for how embodied intelligence may need to be built if it is to move beyond laboratory demonstrations and become a scalable technology platform.
Assumption One: Useful Robot Models Require Post-Training
In large language models (LLMs), pretraining often produces general capabilities that can be applied directly to many downstream tasks. Robotics, in contrast, has traditionally required extensive post-training: models trained in one environment or on one task rarely perform well in another without additional task-specific fine-tuning and adaptation.
This reliance on post-training creates a structural challenge for embodied AI. Each new environment demands further optimization. Each deployment becomes a bespoke project. Each new task adds another layer of training overhead.
As a result, even state-of-the-art robotic models often remain confined to laboratory settings, unable to generalize widely or scale efficiently.
WALL-OSS-0.5 approaches this problem from an unusual angle. Rather than treating post-training as an unavoidable requirement, it asks a more fundamental question:
What if a robot foundation model could become useful immediately after pretraining?

The open-sourced VLA model is designed to deliver performance much closer to post-trained systems directly from pretraining.
While this may appear to be an engineering optimization, its implications should be interpreted cautiously as early-stage experimental results.
If successful, the industry’s development cycle changes. The objective shifts from building models that can eventually become useful to building models that are useful from the outset.
In other words, WALL-OSS-0.5 is not merely reducing deployment friction. It is questioning whether deployment friction should exist in the first place.

A related line of work explores alternative action tokenization schemes for VLA systems. One example is X-Tokenizer, which proposes a semantic residual quantization approach for representing robot actions across multiple embodiments. Rather than treating tokenization as a purely compression-oriented process, it investigates how action tokens can be aligned with pretrained vision-language representations to better support grounding and control in embodied settings.
Assumption Two: The World Is Best Understood Through Time
A second assumption runs even deeper.
Most world models are fundamentally temporal. They process sequences of frames, chunks, or tokens and attempt to predict what happens next.
This approach has produced important advances, but it carries an implicit belief:
Time is the correct unit of understanding.
Humans rarely perceive reality this way.
We do not remember thousands of frames. We remember events.
A glass falls.
A door opens.
Someone enters a room.
WALL-WM proposes that embodied AI should operate according to a similar principle.
Instead of organizing experience around temporal chunks, it introduces event-centric world modeling, treating events rather than timestamps as the fundamental unit of representation.

This is more than an architectural modification. It represents a shift in how intelligence itself is structured.
As embodied systems move from controlled demonstrations toward long-horizon real-world tasks, identifying what matters may become more important than processing everything equally.
The significance of WALL-WM is therefore not simply that it is a new world model. It is that it reframes the question of what a world model should model in the first place.
Assumption Three: Robot Data Must Come From Robots
Perhaps the most expensive assumption in robotics is also the most accepted. Robot intelligence requires robot experience. For years, the scaling strategy has been straightforward: more robots, more teleoperation, more demonstrations, better policies. The problem is that this equation scales costs almost as efficiently as it scales performance.
QUANXTA Zero-G0 challenges this constraint by proposing a different idea. What if most useful experience does not need to be collected through robots at all? Rather than treating robot-free data as a supplement, QUANXTA Zero-G0 and the accompanying G0-Dataset present a new vision: data as infrastructure. QUANXTA Zero-G0 provides a closed-loop pipeline of collection, inspection, training, and evaluation, enabling robot-free experiences to be transformed into high-quality, trainable data.
Built on top of this system, G0-Dataset scales the approach into a reproducible foundation, offering over 2,000 hours of validated multimodal demonstrations and serving as the foundation for subsequent large-scale action representation learning.
Importantly, the system supports mobile data collection workflows in which the same data acquisition process can be executed across different device configurations. Data collected under these mobile setups is stored in a unified structured format.
This allows collected trajectories to be replayed and reused within the model’s training and evaluation loop, enabling consistent validation and utilization of data across the pipeline.The focus of this design is on improving the flexibility, reuse, and consistency of the data collection process.
This is more than a dataset; it is a structured data infrastructure pipeline for embodied learning. By integrating robot-free generation with automated quality inspection and real-robot validation, G0-Dataset turns what was once an expensive, piecemeal process into a systematic infrastructure. The result is a model for how embodied intelligence can grow beyond lab constraints, reducing the amount of required real-robot data under experimental conditions.
Embodied AI Is Entering Its Infrastructure Era
Viewed independently, these projects address different technical challenges.
WALL-OSS 0.5 focuses on deployability.
WALL-WM focuses on representation.
QUANXTA Zero-G0 focuses on scalable data generation.
Yet together they point toward the same conclusion.
WALL-OSS 0.5 suggests useful capability should emerge during pretraining.
WALL-WM suggests the world should be organized around events rather than time.
QUANXTA Zero-G0 suggests data should be produced as infrastructure rather than collected as a scarce resource.
These ideas may seem unrelated, but they share a common objective:
Transforming embodied AI from a handcrafted discipline into a scalable industrial system.
Recent developments around X Square Robot also suggest that investment interest appears to be following a similar trajectory.
In just over two months, X Square Robot completed four consecutive funding rounds—B, B+, B++, and C—with all closings finalized, reaching a valuation exceeding RMB 20 billion and attracting over 30 top-tier institutions such as technology companies, industrial partners and venture capital firms.IDG participated in the Series C round, while HongShan and Xiaomi have backed the company in multiple previous rounds. Combined with earlier lead investments from Meituan, Alibaba and ByteDance, X Square Robot has become the only embodied Al company in China to secure lead-round backing at different stages from four of the country’s leading technology companies. Rather than emphasizing individual robot products, the investment narrative appears to center on long-term capabilities in data production, model training, and embodied infrastructure. This may reflect a broader transition within the field: embodied AI could increasingly be evaluated not only by what individual models can do, but by the systems that make those models scalable.
The competition is shifting from building models to building the systems that make intelligence scalable.As capital, data infrastructure and foundation models become increasingly intertwined, the next phase of embodied AI may depend less on isolated algorithmic breakthroughs and more on the ability to industrialize intelligence itself.
And that redefines how embodied AI should be competed for.
For readers interested in exploring these open-source projects and X Square Robot’s broader embodied AI research, additional information is available on the company’s official website:https://x2robot.com/en


