
Every day, machine learning systems summarize clinical notes, evaluate financial risk, personalize customer experiences and optimize supply chains. Much of the data behind those systems is sensitive. As AI moves from experimentation to production, that data is also moving through more pipelines, more third-party systems, more model endpoints and more operational logs.
With this in mind, walls have been built around data as a safeguard. To secure them as they travel. But why does so much raw data need to move in the first place?
Data Exposure Doesn’t Have To Be Inevitable
There is an important difference between protecting data and minimizing its exposure. Traditional security strategies are often built around the assumption that exposure is inevitable. Encrypt the data at rest. Encrypt it in transit. Restrict access. Monitor activity. These controls matter, and they will remain essential. But in machine learning, they do not solve the whole problem.
Conventional encryption protects data until it has to be used. To do anything useful with encrypted data, it usually has to be decrypted somewhere. Homomorphic encryption can allow computation on encrypted data and is an important area of privacy-preserving machine learning, but it often introduces significant computational overhead, latency and cost. Differential privacy can reduce the risk that individual records are exposed through noise injection, but it can also affect model utility if not carefully calibrated.
These approaches are valuable. The issue is that they often treat raw data exposure as the starting point. Enterprise AI needs a prevention-first architecture: one that reduces or entirely removes exposure before sensitive information reaches downstream machine learning environments.
McKinsey reported in its 2025 State of AI survey that 88 percent of respondents say their organizations regularly use AI in at least one business function, up from 78 percent the previous year. PwC found that 88 percent of surveyed senior executives planned to increase AI-related budgets over the next 12 months because of agentic AI. And Risk.net ranked information security as the number-one operational risk for the fifth year running in 2026, while AI risk entered the operational risk rankings in its own right. More AI adoption means more sensitive data will be used in more places. Without a different architecture, the attack surface grows with every successful deployment.
I have seen this problem repeatedly over nearly two decades building AI systems in environments where reliability, security and scale were not optional: fraud detection at Mastercard and Equifax, large-scale optimization at Coca-Cola and VICI Partners, and digital identity systems where data protection is central to the entire business model. The models often work. The proof of concept is promising. The hard part is deploying at enterprise scale without sending raw sensitive data into places it does not need to be.
Train on vectors, not raw data
That is the problem VEIL was designed to address. VEIL stands for Vector Encoded Information Layer. It is a privacy-preserving machine learning architecture built around a concept called Informationally Compressive Anonymization, or ICA. It tackles one of the core barriers in enterprise AI today: how to use sensitive data without exposing it.
Instead of sending raw feature data into downstream model pipelines, transform the raw data near its source into compressed, task-aligned vector representations. Those representations are what the model trains on and what the model sees at inference time. The raw record stays inside the trusted source environment.
This is not simply another wrapper around the same pipeline. It changes what crosses the trust boundary. Names, phone numbers, account attributes, financial details or other sensitive source features should not have to travel through a model-serving stack if the model only needs the predictive signal contained in a reduced representation. VEIL is designed to preserve the signal needed for the supervised learning task while discarding information that should not be exposed downstream.
In the technical white paper I published on arXiv, Informationally Compressive Anonymization: Non-Degrading Sensitive Input Protection for Privacy-Preserving Supervised Machine Learning, I describe the architecture and the underlying reasoning in detail. A key principle is that downstream systems should operate on reduced-risk representations, not on raw sensitive inputs. If a downstream training environment, model endpoint, log stream or storage layer is compromised, the attacker should not receive the original records the enterprise was trying to protect.
Privacy without the usual performance penalty
The most common objection to privacy-preserving machine learning is that privacy usually comes at a cost: lower accuracy, higher latency, greater complexity or all three. That trade-off is one reason so many promising AI projects stall between proof of concept and production. Businesses do not want to expose sensitive information, but they also cannot deploy a solution that is too slow, too expensive or too degraded to be useful.
Under evaluated conditions described in the VEIL research, the architecture achieved data compression rates ranging from 95 percent to 99.96 percent while maintaining predictive utility comparable to, and in some cases exceeding, baseline model performance. That matters. It suggests that privacy and utility do not always have to sit on opposite sides of the scale.
The reason is that enterprise models rarely need all the information contained in raw records. They need the task-relevant signal. A well-designed representation layer can concentrate that signal and remove what is unnecessary for prediction. In other words, the right question is not whether we can protect everything after exporting it. The question is whether we can avoid exporting most of it at all.
From protection to prevention
This is the shift AI security now requires. For years, organizations have focused on protecting sensitive information after it enters complex environments. The next stage is reducing the amount of sensitive information that enters those environments in the first place.
A prevention-first machine learning architecture asks a different set of questions. Can the representation be generated inside the trusted source environment? Can raw inputs, encoder parameters and sensitive gradients remain there? Can downstream training and inference operate only on vectors? Can the vector be useful for prediction without being useful for reconstruction?
This approach does not replace encryption, access control, monitoring, data governance or responsible retention practices. It should work alongside them. But it reduces the burden on every downstream control because there is less raw sensitive data to defend. It also creates a security posture that is more resilient to changing computational assumptions, including the post-quantum concerns that make long-term reliance on cryptographic secrecy alone increasingly difficult.
From security control to competitive advantage
The broader opportunity is not only security. It is access. Many organizations possess highly valuable datasets that remain underused because the privacy, governance and regulatory risk is too high. Healthcare, financial services, insurance, identity, fraud prevention and supply chain operations all contain data that could improve AI systems if it could be used safely.
The enterprises that win with AI may not be the ones with the largest models. They may be the ones who can safely unlock the most valuable data. That requires a change in mindset. Sensitive data should not be treated as something that must inevitably be copied, moved and exposed so that AI can function. It should be treated as something that can be transformed, minimized and used through safer interfaces.
Ultimately, the future of AI security will not be won by building higher walls around ever-larger pools of raw data. It will be won by designing pipelines in which sensitive data travels less, reveals less and becomes less useful to attackers at every boundary.
Train on the vector. Never expose the raw data. For enterprise AI, that is not just a security principle. It is the path from proof of concept to production.


