
The conversation around artificial intelligence has been dominated by models and benchmark scores. Far less attention has been given to a more important question. Where does the training data actually come from, and how does it physically reach the model in the first place?
Inside the data infrastructure side of the internet, this question has become one of the most important issues in AI development. The model is the visible layer. The collection layer underneath it is doing most of the work, and almost nobody outside the industry sees it.
This article describes that layer in detail, including the parts that the AI industry has not been forced to think about yet.
The Visible Model and the Invisible Pipeline
When a new AI system is announced, the headline is almost always about capabilities. It can write code, analyze documents, generate images, hold long conversations. The training part is mentioned briefly, usually as a number of tokens, sometimes as a list of categories.
What is rarely described is the pipeline that produced those tokens. A modern AI training set can require hundreds of terabytes of web data. That data does not appear by itself. It is collected through a lot of automated systems that touch real websites, real servers, and real people.
Each link in that chain has its own quality, ethics, and legal profile. The model only knows what reaches it. If something upstream is broken, biased, or compromised, the model inherits that problem silently.
This is the infrastructure layer. It deserves more scrutiny than it gets.
How Web-Scale Data Actually Gets Collected
The simplified version is that AI companies need a representative sample of the internet, and the internet does not hand it over easily. Most large websites actively defend against automated collection through rate limits, fingerprinting, CAPTCHA systems, and detection algorithms designed to separate real users from bots.
To collect at the scale modern AI requires, teams typically take one of two paths. The first is to build their own collection infrastructure, which means scraping at high volume from a wide range of public sources. The second is to buy datasets from companies that have already done this work.
Both paths depend on the same underlying capability. The collector needs to access websites in a way that looks like normal human traffic, from a wide range of geographic and network locations, without being blocked or rate-limited.
This is where the infrastructure question becomes interesting, because the way that traffic is generated turns out to matter quite a lot.
The IP Layer Is Where Quality Is Decided
Every request to a website carries an IP address. To the receiving server, that address is the closest thing to an identity. It signals where the request came from, what kind of network it originated on, and whether the address has a history of suspicious activity.
For AI data collection, the type of IP used at this stage shapes everything that follows. Datacenter IPs are easy and cheap to obtain but trivially detectable. Residential IPs, which originate from real consumer internet connections, look like ordinary users and pass through filters that block automated traffic.
The quality difference between these two categories is enormous. A request from a residential IP can read content that a datacenter IP cannot reach at all. For an AI training set, this is the difference between a representative sample of the internet and a sample biased toward whatever is left over after the defenses kick in.
Most AI teams underestimate how much of their training data was shaped by this filter before they ever saw it.
Why Standard Quality Metrics Are Misleading
The proxy and IP intelligence industry has settled on a few metrics that are widely used to evaluate IP quality. The most common is the fraud score, produced by services that track how often an IP appears in suspicious traffic patterns.
These scores were designed for fraud prevention, not for AI data collection, and the difference matters more than people assume. A residential IP can carry a high fraud score simply because it has been used heavily for legitimate but high-volume tasks like data collection. The IP itself is not compromised. The score is a reflection of activity, not character.
In practical field testing, residential IPs with high fraud scores often work perfectly well for accessing complex websites, while cleaner-looking IPs from mixed pools fail at much simpler tasks. The reason is that detection systems on real websites do not rely solely on fraud scores. They look at full request fingerprints, browser behavior, session patterns, and dozens of other signals.
For AI teams using these metrics as a proxy for data quality, this is a problem. They may be filtering out exactly the kind of access that produces the broadest training data and keeping the kind that produces the narrowest.
The Sourcing Question Almost Nobody Asks
There is a more uncomfortable issue further upstream, which is how the residential IPs used for large-scale data collection are obtained in the first place.
The legitimate model is straightforward. A company offers an application or service to consumers, and in exchange those consumers opt in to share a small portion of their idle internet bandwidth. They are paid, either in cash or in access to a free service.
The terms are clear, and the consumer knows what is happening. Providers that follow this consent-based sourcing model, for example services like CatProxies, build their pools through transparent SDK partnerships rather than hidden bundling.
The illegitimate version uses the same end result, but the consent is missing. Free VPN applications, ad-supported utilities, and software bundles sometimes route user traffic through a commercial proxy network without meaningfully disclosing it. In some cases, the disclosure is buried deep in terms of service. In the worst cases, there is no disclosure at all.
This is the part of the supply chain that the AI industry has largely ignored. When training data is collected through an IP that came from an unaware consumer, the legality and ethics of that data become questionable regardless of how clean the model itself is.
Why This Matters for AI, Not Just Proxies
The natural response to this is that it sounds like a proxy industry problem, not an AI problem. The opposite is closer to the truth. The AI industry has positioned itself as the most consequential technology of the decade, and the data underneath these systems is one of its most sensitive inputs.
If a portion of an AI training corpus was collected through compromised consumer connections, that fact does not stay hidden forever. It becomes a regulatory question, a reputational question, and eventually a commercial question. The companies that can document a clean supply chain will have an advantage that is hard to replicate after the fact.
There are already early signals of this shift. The Niantic story, in which the company reportedly used a free mobile game to collect detailed mapping data later sold for AI applications, was disclosed in their stated terms from the start. That arguably makes it a defensible example of consent-based collection.
Other reported cases, such as the use of large book datasets without permission, sit in murkier territory and have generated visible legal pressure. The pattern is that the upstream sourcing question is no longer a backroom concern, and companies that have not thought carefully about it are exposed.
What AI Teams Should Actually Be Asking
For teams building or buying data pipelines, the practical implications come down to a small set of questions that are worth asking out loud.
The first is where the data physically came from. If a dataset was scraped, what was the IP layer underneath the scraping, and how were those IPs sourced? If the dataset was purchased, what does the seller disclose about their own collection methods?
The second is whether the quality metrics being used reflect actual data quality or just convenient proxies for it. Fraud scores, block rates, and IP reputation databases are useful, but they were not built for AI use cases and can mislead in subtle ways.
The third is whether the collection method is sustainable under future regulation. Data privacy law is moving in a clear direction across the EU, the UK, and increasingly the US. Collection methods that depend on uninformed consent are unlikely to remain viable.
These are not technical questions. They are infrastructure questions, and they deserve the same level of attention that AI safety now receive.
A More Mature Conversation About AI Infrastructure
The AI industry is at the point where the model is no longer the only thing that matters. The pipeline behind it, the data that feeds it, and the infrastructure that produced that data are becoming first-class concerns.
The collection layer is where most of this gets decided. It is the part of the stack that determines whether a training set is broad or narrow, ethically sourced or quietly compromised, durable under scrutiny or fragile to it. The choices made here propagate forward into every output the model eventually produces.
The companies that take this seriously now will have a significant advantage as the conversation matures. The ones that treat data collection as a solved problem are going to be surprised by how unresolved it actually is.
It is a good time for the AI industry to start asking where its training data really comes from, and to expect a real answer.

