Future of AIAI

How synthetic data will create new opportunities beyond the AI data fog

By Victor Shilo, CTO of Unlimit

The AI Agent boom’s replication trap

The year 2025 has been unmistakably marked as the year of AI Agents. New startups are emerging every day, each claiming to transform workflows through intelligent automation. While some deliver meaningful value by solving real problems, many rely solely on prompt engineering layered over public models.

This has led to a growing recognition in emerging agent marketplaces: most AI agents are built on the same foundation, making them largely interchangeable. The reason is simple—nearly all of them rely on general-purpose large language models (LLMs), trained on the same publicly available internet data. These models are typically used as-is, without any additional fine-tuning or augmentation.

As a result, the intelligence behind most agents doesn’t come from exclusive insight or proprietary capabilities—it comes from creative prompting layered on top of shared infrastructure. In this environment, differentiation becomes superficial, replication is trivial, and the barrier to entry remains low. What’s missing is a meaningful moat.

This hype isn’t like the others

It isn’t just a lack of differentiation that sets the AI revolution apart from any tech wave we’ve seen before. The iPhone brought mobile-first thinking and location-aware apps. Cloud computing turned servers into scalable infrastructure. In both cases, complexity increased, demanding new skill sets and new types of professionals to manage it.

AI is different.

AI’s purpose is to reduce complexity by removing friction between humans and machines. We don’t want another tool; we want instant intelligence on demand. AI doesn’t require new user behaviors—it adapts to ours.

That’s what makes this wave fundamentally more disruptive. It doesn’t need experts to unlock its value. It can remove the expert altogether.

But there’s a paradox at the heart of it: AI’s intelligence today is largely historical. These models reflect the past because they’re trained on public data rooted in the past. We’ve extracted immense value from that data, but it was never meant to answer questions no one has asked yet.

The limits of internet-trained models

Today’s leading LLMs have been trained on a staggering volume of publicly available content—books, websites, code repositories, Wikipedia articles, Reddit threads. The training data is undeniably vast, but by its nature, it’s also finite and frozen in time.

You can only predict so much about the future using historical information. These models excel at summarizing, organizing, and synthesizing what humanity has already written or shared—but they remain fundamentally backwards-looking.

Models built on Internet-scale data inherit both strengths and limitations. They are fluent, informed, and impressively versatile—but they tend to mirror what people have already done, rather than explore what’s still unexplored. They reinforce prevailing narratives instead of challenging them. And while they reflect widespread knowledge, they rarely capture the deep, operational insight that lives inside specific industries or organizations.

From a business perspective, this poses a challenge. Echoing existing ideas adds little strategic value—especially when the same underlying intelligence is available to everyone. The result is commoditization. The intelligence begins to feel generic, standardized, and easily replaceable.

The next leap in AI-driven outcomes won’t come from reusing yesterday’s knowledge. It will come from building something fundamentally different—from tapping into intelligence that others simply don’t have access to.

The untapped power of proprietary data

That difference begins with data that no one else possesses.

Many organizations have already taken the first step by using Retrieval-Augmented Generation (RAG) or embedding search to connect LLMs with internal documents and databases. These approaches work well for surfacing known facts or answering context-specific questions—but they only scratch the surface of what proprietary data can offer.

Every enterprise, no matter its size, possesses internal datasets so specific, so rich in structure and context, that they could play a transformative role in building highly personalized models—models that offer insights no general-purpose system could ever deliver. Transactional histories, product telemetry, customer service transcripts, operational records—these data points are born from decades of doing business, shaped by industry nuance, and deeply aligned with an organization’s reality.

Models fine-tuned on this data don’t just pull relevant answers—they internalize how the business works. They speak your company’s language, understand product-specific workflows, and recognize subtle customer signals. They become native to the organization they serve, offering intelligent responses that no external agent could replicate.

Unlike the Internet, this data isn’t publicly available. It’s not indexed. It hasn’t been used to train foundation models. That’s exactly what makes it so powerful

Modelling the future with synthetic data

Until recently, using proprietary data to build advanced models required rare expertise and costly computing infrastructure. But that’s changing fast. With more efficient training methods and the rise of open-source model ecosystems, the ability to create bespoke AI is now within reach for a growing number of enterprises. Those that seize this opportunity will gain a competitive edge that’s hard to duplicate—and even harder to overtake.

Yet even the richest proprietary datasets can fall short. In some cases, the necessary data simply doesn’t exist. In others, it may be impossible—or unethical—to collect.

This is where synthetic data becomes indispensable. Its potential across industries like healthcare, finance, and manufacturing is vast. Consider the challenge of building models to assess patient outcomes, predict transaction flows, or crash-test products. Traditional methods often require expensive, time-consuming processes. In contrast, AI-enhanced simulations powered by synthetic data offer a more dynamic and scalable alternative—one that can generate millions of scenarios and adapt continuously as conditions evolve.

As self-learning models are embedded into workflows, the AI begins to conceptualize a full range of possible futures—testing assumptions, optimizing for edge cases, and surfacing insights long before a human even needs to intervene.

This isn’t science fiction. Unlike historical data, which is anchored in what has already happened, synthetic data enables training on what could happen. It is speculative, predictive, and fundamentally forward-looking.

Synthetic digital twins and large behavioral models

The foundation for this kind of simulation already exists in the form of digital twins. A digital twin is a virtual replica of a physical object, process, or environment, designed to mirror its real-world counterpart in real-time. The concept dates back to NASA’s Apollo program, where mirrored systems on Earth were used to troubleshoot in-flight spacecraft behavior and test real-time solutions.

Since then, digital twins have evolved into powerful engines for generating synthetic data. Cities like Singapore have developed comprehensive digital replicas to simulate traffic flows, infrastructure stress, and energy usage—creating real-time labs for urban optimization. These simulations are made possible by synthetic data, and the possibilities are only expanding.

One particularly promising frontier is the rise of Large Behavioral Models (LBMs). While LLMs focus on language prediction, LBMs go a step further by modeling decision-making behavior. They aim to understand not just what people say, but what they do—how they respond to stimuli, how systems adapt over time, and how complex dynamics unfold.

These models are trained on sequences of actions, interactions, and outcomes. That data often comes from digital twins, agent-based simulations, or anonymized behavioral logs. In a smart city environment like Singapore’s, an LBM could learn how traffic responds to policy changes, construction, or population shifts. Urban planners could then simulate thousands of “what if” scenarios—without closing a single street.

In finance, an LBM could simulate customer behavior in response to interest rate changes or new product offerings, helping teams test strategies before deploying them in the real world. Unlike static prediction tools, LBMs evolve dynamically. They help decision-makers understand current conditions while anticipating the ripple effects of future actions.

The technology to generate synthetic data at scale—with precision and relevance—is already here. What remains is to fully embed it into the heart of AI development, where it can simulate the rare, the extreme, and the not-yet-happened to build systems that are smarter, safer, and more resilient.

When used thoughtfully, synthetic data allows us to model the future—not just replay the past.

Where to begin

If AI’s true value comes from building smarter agents, and that intelligence depends entirely on the data underneath, then organizations need to rethink where their competitive edge will come from. Some are sitting on decades of untouched operational data. Others, especially in regulated industries, may find more promise in simulation than in historical records. Both paths lead to the same outcome: intelligence that reflects the organization’s reality and anticipates its future.

For most enterprises, the journey starts with awareness. Few fully realize the depth and value of their own data. Fewer still are actively exploring its use in AI development.

The first step is recognizing that proprietary data isn’t just an operational byproduct—it’s a strategic asset. Then comes experimentation. Even small-scale fine-tuning efforts on open models can yield dramatic improvements in their relevance and accuracy. Alongside that, organizations must begin thinking creatively about where synthetic data might unlock new capabilities, especially in areas where data is sparse, risky, or unavailable.

Companies must move beyond the idea that AI is solely about automation. It’s not merely a faster way to do old things. It’s a new way to see, think, and decide. When intelligence is the product, data is the raw material. The better the data, the better the decisions.

Most of what’s on the Internet has already trained the models of today. But the models of tomorrow will be trained on data that’s never been seen before.

And that data might already be yours.

Author

Related Articles

Back to top button