
The AI Agent boomโs replication trap
The year 2025 has been unmistakably marked as the year of AI Agents. New startups are emerging every day, each claiming to transform workflows through intelligent automation. While some deliver meaningful value by solving real problems, many rely solely on prompt engineering layered over public models.
This has led to a growing recognition in emerging agent marketplaces: most AI agents are built on the same foundation, making them largely interchangeable. The reason is simpleโnearly all of them rely on general-purpose large language models (LLMs), trained on the same publicly available internet data. These models are typically used as-is, without any additional fine-tuning or augmentation.
As a result, the intelligence behind most agents doesnโt come from exclusive insight or proprietary capabilitiesโit comes from creative prompting layered on top of shared infrastructure. In this environment, differentiation becomes superficial, replication is trivial, and the barrier to entry remains low. Whatโs missing is a meaningful moat.
This hype isnโt like the others
It isnโt just a lack of differentiation that sets the AI revolution apart from any tech wave weโve seen before. The iPhone brought mobile-first thinking and location-aware apps. Cloud computing turned servers into scalable infrastructure. In both cases, complexity increased, demanding new skill sets and new types of professionals to manage it.
AI is different.
AIโs purpose is to reduce complexity by removing friction between humans and machines. We donโt want another tool; we want instant intelligence on demand. AI doesnโt require new user behaviorsโit adapts to ours.
Thatโs what makes this wave fundamentally more disruptive. It doesnโt need experts to unlock its value. It can remove the expert altogether.
But thereโs a paradox at the heart of it: AIโs intelligence today is largely historical. These models reflect the past because theyโre trained on public data rooted in the past. Weโve extracted immense value from that data, but it was never meant to answer questions no one has asked yet.
The limits of internet-trained models
Todayโs leading LLMs have been trained on a staggering volume of publicly available contentโbooks, websites, code repositories, Wikipedia articles, Reddit threads. The training data is undeniably vast, but by its nature, itโs also finite and frozen in time.
You can only predict so much about the future using historical information. These models excel at summarizing, organizing, and synthesizing what humanity has already written or sharedโbut they remain fundamentally backwards-looking.
Models built on Internet-scale data inherit both strengths and limitations. They are fluent, informed, and impressively versatileโbut they tend to mirror what people have already done, rather than explore whatโs still unexplored. They reinforce prevailing narratives instead of challenging them. And while they reflect widespread knowledge, they rarely capture the deep, operational insight that lives inside specific industries or organizations.
From a business perspective, this poses a challenge. Echoing existing ideas adds little strategic valueโespecially when the same underlying intelligence is available to everyone. The result is commoditization. The intelligence begins to feel generic, standardized, and easily replaceable.
The next leap in AI-driven outcomes wonโt come from reusing yesterdayโs knowledge. It will come from building something fundamentally differentโfrom tapping into intelligence that others simply donโt have access to.
The untapped power of proprietary data
That difference begins with data that no one else possesses.
Many organizations have already taken the first step by using Retrieval-Augmented Generation (RAG) or embedding search to connect LLMs with internal documents and databases. These approaches work well for surfacing known facts or answering context-specific questionsโbut they only scratch the surface of what proprietary data can offer.
Every enterprise, no matter its size, possesses internal datasets so specific, so rich in structure and context, that they could play a transformative role in building highly personalized modelsโmodels that offer insights no general-purpose system could ever deliver. Transactional histories, product telemetry, customer service transcripts, operational recordsโthese data points are born from decades of doing business, shaped by industry nuance, and deeply aligned with an organizationโs reality.
Models fine-tuned on this data donโt just pull relevant answersโthey internalize how the business works. They speak your companyโs language, understand product-specific workflows, and recognize subtle customer signals. They become native to the organization they serve, offering intelligent responses that no external agent could replicate.
Unlike the Internet, this data isnโt publicly available. Itโs not indexed. It hasnโt been used to train foundation models. Thatโs exactly what makes it so powerful
Modelling the future with synthetic data
Until recently, using proprietary data to build advanced models required rare expertise and costly computing infrastructure. But thatโs changing fast. With more efficient training methods and the rise of open-source model ecosystems, the ability to create bespoke AI is now within reach for a growing number of enterprises. Those that seize this opportunity will gain a competitive edge thatโs hard to duplicateโand even harder to overtake.
Yet even the richest proprietary datasets can fall short. In some cases, the necessary data simply doesnโt exist. In others, it may be impossibleโor unethicalโto collect.
This is where synthetic data becomes indispensable. Its potential across industries like healthcare, finance, and manufacturing is vast. Consider the challenge of building models to assess patient outcomes, predict transaction flows, or crash-test products. Traditional methods often require expensive, time-consuming processes. In contrast, AI-enhanced simulations powered by synthetic data offer a more dynamic and scalable alternativeโone that can generate millions of scenarios and adapt continuously as conditions evolve.
As self-learning models are embedded into workflows, the AI begins to conceptualize a full range of possible futuresโtesting assumptions, optimizing for edge cases, and surfacing insights long before a human even needs to intervene.
This isnโt science fiction. Unlike historical data, which is anchored in what has already happened, synthetic data enables training on what could happen. It is speculative, predictive, and fundamentally forward-looking.
Synthetic digital twins and large behavioral models
The foundation for this kind of simulation already exists in the form of digital twins. A digital twin is a virtual replica of a physical object, process, or environment, designed to mirror its real-world counterpart in real-time. The concept dates back to NASAโs Apollo program, where mirrored systems on Earth were used to troubleshoot in-flight spacecraft behavior and test real-time solutions.
Since then, digital twins have evolved into powerful engines for generating synthetic data. Cities like Singapore have developed comprehensive digital replicas to simulate traffic flows, infrastructure stress, and energy usageโcreating real-time labs for urban optimization. These simulations are made possible by synthetic data, and the possibilities are only expanding.
One particularly promising frontier is the rise of Large Behavioral Models (LBMs). While LLMs focus on language prediction, LBMs go a step further by modeling decision-making behavior. They aim to understand not just what people say, but what they doโhow they respond to stimuli, how systems adapt over time, and how complex dynamics unfold.
These models are trained on sequences of actions, interactions, and outcomes. That data often comes from digital twins, agent-based simulations, or anonymized behavioral logs. In a smart city environment like Singaporeโs, an LBM could learn how traffic responds to policy changes, construction, or population shifts. Urban planners could then simulate thousands of โwhat ifโ scenariosโwithout closing a single street.
In finance, an LBM could simulate customer behavior in response to interest rate changes or new product offerings, helping teams test strategies before deploying them in the real world. Unlike static prediction tools, LBMs evolve dynamically. They help decision-makers understand current conditions while anticipating the ripple effects of future actions.
The technology to generate synthetic data at scaleโwith precision and relevanceโis already here. What remains is to fully embed it into the heart of AI development, where it can simulate the rare, the extreme, and the not-yet-happened to build systems that are smarter, safer, and more resilient.
When used thoughtfully, synthetic data allows us to model the futureโnot just replay the past.
Where to begin
If AIโs true value comes from building smarter agents, and that intelligence depends entirely on the data underneath, then organizations need to rethink where their competitive edge will come from. Some are sitting on decades of untouched operational data. Others, especially in regulated industries, may find more promise in simulation than in historical records. Both paths lead to the same outcome: intelligence that reflects the organizationโs reality and anticipates its future.
For most enterprises, the journey starts with awareness. Few fully realize the depth and value of their own data. Fewer still are actively exploring its use in AI development.
The first step is recognizing that proprietary data isnโt just an operational byproductโitโs a strategic asset. Then comes experimentation. Even small-scale fine-tuning efforts on open models can yield dramatic improvements in their relevance and accuracy. Alongside that, organizations must begin thinking creatively about where synthetic data might unlock new capabilities, especially in areas where data is sparse, risky, or unavailable.
Companies must move beyond the idea that AI is solely about automation. Itโs not merely a faster way to do old things. Itโs a new way to see, think, and decide. When intelligence is the product, data is the raw material. The better the data, the better the decisions.
Most of whatโs on the Internet has already trained the models of today. But the models of tomorrow will be trained on data thatโs never been seen before.
And that data might already be yours.



