
Google’s recent decision to cut the num=100 search parameter from Google Search might have seemed like a fairly innocuous change, but from the perspective of AI developers it was a crushing move. It has severely hobbled their ability to use the public internet as a source of training data.
AI systems are extremely reliant on indexed search results, and by slashing visibility to just 10 results instead of 100, Google has made it almost impossible for them to carry out a deeper analysis of the web. It’s a decision that worsens the already crippling shortage of high quality data for AI training.
Developers are struggling with limits imposed on the publicly available content they can access. This was highlighted in a 2024 research paper by Data Provenance, which analyzed 14,000 websites to try and gain insight into the restrictions imposed to limit access to AI models. The authors concluded that in the last 12 months there has been a “rapid crescendo of data restrictions”, with content creators actively limiting the ability of algorithms to access their web pages.
Much of what’s left of the public internet has already been scraped repeatedly and fed into today’s top AI models anyway, and the lack of new data is forcing developers to rethink their AI training strategies. Some have switched to using condensed, domain-specific datasets and a more focused approach that involves training models to do one thing specifically, such as maths or image creation, as opposed to building large general-purpose models. It’s a logical solution. If the datasets are smaller, developers won’t need nearly as much.
Data Tokens Are The Answer
The question is, how can we source these narrow datasets and get them into the hands of developers? This is where blockchain comes in. It allows anyone to upload their data to a distributed network and create digital tokens that represent ownership of it. The blockchain facilitates seamless transfer of those tokens from creators to developers, and also supports “fractional” ownership, where dozens of researchers can buy fractions of tokens to access data collectively, lowering their costs.
Tokenized datasets offer significant advantages. Not only are they easily divisible and tradable, they’re transparent and publicly verifiable too. Smart contracts can be used to create and enforce revenue streams for data creators, ensuring they’re fairly rewarded. Blockchain-based interactions are driven by supply and demand, which means the richest and most unique datasets will command greater value. In other words, tokenized data can be transformed into a new, investable asset class that drives a fresh wave of innovation in AI.
The beauty of this concept is that everyone can contribute. For instance, an agricultural model that’s designed to detect crop disease could be trained on thousands of images supplied by farmers who snap photos of diseased crops using their smartphones. Alternatively, healthcare organizations can donate anonymized images of medical scans to support the development of diagnostic AI models.
With blockchain we can support two key functions that are necessary for decentralized data markets to thrive. The value of data depends on how accurate and reliable it is, and the transparency of blockchain makes it possible for anyone to verify its origins and quality. Users who submit lots of good quality data will increase their reputations over time because every dataset they create can be traced back to them. Communities can play a part too, with individuals working to review datasets for their quality, earning rewards based on their honesty.
Cryptocurrency is borderless, which means tokenized data can be inclusive of contributors from anywhere in the world. This isn’t possible with fiat, where high fees and a lack of infrastructure make it impossible for many to participate. So long as someone has a smartphone with an internet connection, they can send and receive micropayments instantly, without a bank account, giving everyone the opportunity to participate in the data economy. This means more diverse datasets that originate from every corner of the globe, reducing bias in AI outputs.
Rewarding The Biggest Contributors
With crypto-based payments and foolproof verification, we have the foundation in place to create thriving decentralized data markets that will operate according to standard supply and demand principles.
Consider the example of agriculture AI models that can detect crop disease. A farmer from Malawi can play a role in its development by uploading photos of an infected maize crop. The farmer’s images will be tokenized and verified, contributing to a global AI data supply chain that’s coordinated by cryptographic protocols and community governance. The quality of those images will determine their value, and consequently, the amount of rewards the farmer earns. When AI models access those images to process a prompt, the interaction will be recorded on the blockchain and smart contracts will automatically send a micropayment to the farmer. The more often that model is used, the more it will query that dataset, increasing the rewards the farmer can earn.
For AI startups, this is beneficial because they won’t have to pay enormous amounts of cash upfront to access training data. Instead, they’ll pay as they grow, once the revenue starts flowing in. It’s easy to envisage how this ecosystem might expand organically over time. The rewards for providing data will attract contributors seeking an income. They’ll compete to provide higher quality data, increasing the volume available. This will entice more developers who’re hungry for data to increase the sophistication of their models. As adoption of those models increases, so does the value flowing back to the data providers, making it even more lucrative.
A Data Economy For All
The AI industry is growing like wildfire and the effects of the data shortage are already being felt, with websites limiting access and content creators firing off lawsuits against transgressors like there’s no tomorrow. Developers are desperate for an alternative to scraping the web, and decentralized data markets are an enticing solution.
Likely, blockchain won’t be the only fix to AI’s data conundrum. There’s merit to other ideas around synthetic data and the creation of data consortiums, where companies share private data with their peers with strict limits on how it can be used. But decentralized data economies are perhaps the most romantic and practical, with their potential to utilize existing infrastructure and incentivize everyone to participate in the AI revolution.



