AI

Model Collapse Is Coming: And It’s About to Break the Way AI Learns From the Web

By Tal Melenboim

The internet has become awash with AI-generated data, with tools like ChatGPT, Gemini, Grok and DALL-E spitting out millions upon millions of blog posts, images and AI videos. Such content is not only common, but in many cases it’s also fairly convincing, to the point that not everyone can tell it was created by AI.  

As impressive as it may seem, it’s causing enormous complications for AI builders, as it means their biggest data resource has become heavily polluted. The problem is that training AI systems on AI-generated data is the fastest way to bring about so-called “model collapse”.  

AI model collapse is not just a theoretical thing. It’s something that has been observed in repeated studies. One of the most famous was a paper titled “The Curse of Recursion: Training on Generated Data Makes Models Forget”, posted in the journal Nature last year. It explained how feedback loops form when large language models are trained on synthetic data, causing the model to drift away from the real-world data distribution it’s supposed to be aligned with. When this happens, its outputs start to become less accurate, more repetitive and sometimes, just totally wrong.  

What’s Model Collapse? 

Developers have become increasingly concerned by this phenomenon because the internet is not what it used to be. When OpenAI first started training the GPT-3 model that powered the original ChatGPT in early 2023, it did so by scraping vast amounts of content from the internet. At that point, doing so wasn’t a problem. Virtually everything published online at that time was created by humans, meaning it was full of “original” data, and much of it was accurate.  

The result was spectacular – ChatGPT dazzled the world and everyone wanted to use it. Competing models emerged in their hundreds, and before you could even blink there were AI chatbots and image generators everywhere, integrated in everything from Gmail to Microsoft Word, to Instagram and Facebook. What followed was a tidal wave of AI-generated content.  

Now, AI developers are struggling to deal with it. Similar to a snake trying to eat its own tail, new AI models are increasingly being trained on content created by their predecessors, and it doesn’t bode well. The paper in Nature explained how model collapse occurs. When AI systems are repeatedly trained on artificial content, it causes them to become misaligned from the true data distribution, leading to increasingly distorted outputs that worsen with each new iteration of the model. This doesn’t just affect their performance on benchmarks. It undermines the reliability of AI models and seriously erodes the quality of the content or responses they generate.  

Fixing It Requires A Human Touch  

As the internet becomes increasingly defiled by synthetic data, AI engineers working on new foundational models must take steps to mitigate model collapse, and that means reinventing the model training process.  

The simplest way to do this is to ensure that humans remain in the loop throughout the model training process. By carefully curating datasets created by humans and integrating human feedback at every step, it’s easier to make sure models remain aligned with the true data distribution, reducing errors in their outputs.  

That sounds simple, but how can we do this in practice? One of the most effective techniques being RLHF, or reinforcement learning from human feedback, where models are rewarded for generating accurate and informative outputs aligned with human reasoning, and punished for inaccuracies or fake outputs. It’s a process that ensures models are anchored in rich, meaningful data.  

We can also implement error correction loops, where humans review model’s outputs, correct the mistakes they made and then funnel these inputs back into the original training data. In doing this, it’s possible to prevent errors from accumulating over generations of models. As a final step, humans must also step in to validate low probability events, which means checking for rare cases that may otherwise be lost. It helps models to preserve information about less common events that occur in their original training data. Such rigorous feedback can help AI systems to maintain their alignment with the original data distribution and prevent the degradation of model performance as they evolve.  

No More Scraping 

To avoid the risks of model collapse, AI training will inevitably shift to smaller, highly-curated and high-quality datasets with a much narrower focus. We’ll see less of the “general-purpose” LLMs that can do everything from writing to image generating to coding to maths, replaced by the emergence of specialized models designed to do just one thing – and be very good at it.  

They’ll be trained on much more refined datasets that have plenty of human input as described above, and these datasets will become much more valuable, helping to offset the increased costs of producing them. Flourishing data marketplaces will emerge where these datasets will be licensed and verified, so they can be bought and sold by other companies, with certifications and standards for sensitive industries like healthcare and finance. Similarly, we’ll see new data cooperatives originate, where organizations agree to share verified data, reducing the need to do everything themselves.  

Meanwhile, the old approach of scraping the internet and helping yourself to whatever’s online will become much less common. Because this data is so polluted, it’s no longer good enough.  

Unfortunately, life will be much harder for AI startups. The high-quality, human-curated datasets they need won’t come cheap, and the cost of it will likely eat up a huge portion of their limited budgets. To get around this, they’ll likely have to focus on narrower domains, where it’s possible to create the datasets they need themselves, or else obtain it from their users, perhaps by offering incentives or free access to the services they build. Alternatively, startups can also create data cooperatives to share data among themselves, or pool their resources to license content at lower costs.  

It’s going to involve more thinking outside of the box, but at the same time, it creates an opportunity for startups to differentiate themselves and create something worthwhile. If their models are trained on better, richer data, they’ll create greater value for users and build something that’s worth paying for.  

The Data Gold Rush 

AI developers won’t be able to take shortcuts to succeed. AI systems will only ever be as good as the data they’re built on, and that means putting more effort into collection and curation to create datasets customized to the specific problems they’re trying to solve.  

But the more effort companies put into it, the bigger their competitive advantage will be, setting them up to create something truly transformative and sustainable. The true gold rush of the AI era won’t be the algorithms themselves, but in the meticulously crafted data that guides them.  

Author

Related Articles

Back to top button