False AI insights are becoming an increasingly urgent challenge as enterprises ramp up their use of generative tools.

GenAI is becoming part of the daily run-of-the-mill operation in many organisations – but even for personal usage, the risk of inaccuracy in its answers is well-known. Particularly in LLMs, which have been trained to deliver extremely high-quality writing rather than necessarily being trained in the nuances of any particular field. It’s not uncommon for major public LLMs to deliver an answer that seems right to its language algorithm, without necessarily deducing that it’s not, in fact, true. In the worst cases, it can be hard to decipher why the AI came to the conclusion it did, beyond the answer sounding like the sort of thing that might follow the question asked!

Poor data, poor answers

While this sort of AI ‘hallucination’ is a familiar phenomenon, a quieter but equally serious problem is AI systems pulling from outdated, incomplete, or inaccurate data. For example, you might ask an AI model to identify symptoms of a medical condition and unknowingly receive an answer based on a 50-year-old paper instead of current research. If you don’t have complete visibility over the sources AI is drawing from – and, crucially, how it’s processing them – this kind of inaccuracy becomes not just possible, but likely.

In these cases, the problem is more insidious – the answer might not just sound right linguistically, but also logically: outdated information was correct at some point, even if it’s not fully accurate now. And that makes it harder to tell that an error has occurred. A large part of the benefit of GenAI is its ability to relieve people of the need for long-form information processing – but if its answers can’t be trusted unless they’re subject to thorough human checks, in a sense we’re back to square one.

In short, as more companies integrate AI into business-critical processes, the risk of drawing false conclusions from poorly governed data is only growing. Solving this issue requires real control over training and inference data – something many enterprises still lack.  

Why ‘data quality debt’ is a major risk in AI development  

The GenAI boom is accelerating at unprecedented speed, leaving many leaders feeling they need to scramble to ensure their organisation doesn’t fall behind. Much of the investment that follows is often directed towards the creation of AI models themselves – the processing algorithms that turn data into insights. Much less, however, is directed towards the data foundation itself.

Organisations looking to generate effective and helpful insights from AI will need to be confident their AI tools can ingest governed data from across all enterprise applications, modern communications, and legacy ERP. Not only will this help organisations to realise value faster- but it also reduces the risk AI can pose to businesses by inadvertently exposing regulated data, company trade secrets, or simply ingesting faulty and irrelevant data- as in the scenario above.

The importance of feeding AI systems compliant, current, and contextually relevant data cannot be overstated and those organisations able to shift from application-centric to data-centric management will bear the fruits of AI faster – being able to extract valuable insights which are accurate and compliant.

Five steps to visibility and governance over training and retrieval pipelines  

In order to achieve the level of data quality required, there are five key areas organisations need to master.

Data lineage and provenance

This includes maintaining a record of the source of the data, its origin, ownership, and any changes in metadata (if permitted) throughout its life cycle within our platform. It also means maintaining rich metadata and all the underlying documents or artifacts from which it is derived. This provides the necessary transparency, especially across large volumes and timeframes. 

Data authenticity

This requires maintaining a clear chain of custody for all data, storing objects in their native forms, and hashing objects received to demonstrate data remains unchanged. In addition, organisations must maintain a full audit history for each object, and for all actions with respect to changes in policies and controls. This means analytics teams can be certain the data remains in its original form. 

Data classification

Establishing the nature of a set or type of data is important since it speeds up AI training. Organisations need to be able to govern structured data, semi-structured data, and structured sets of data. Giving each class a unique schema can allow organisations to manage diverse sets of data without a one-size-fits-all fixed ontology, which makes publishing data to analytics and AI solutions more effective – avoiding the data being unnecessarily manipulated to force it into an inflexible data structure. 

Data normalisation

Establishing common definitions and formats of metadata is important for use in analytics and AI solutions. Clearly defined schemas are an important element, along with tools that can transform or map data to maintain consistent, normalised views of related data. 

Data entitlements

Controlling access and use of data is critical to delivering the right data for analytics and AI solutions. Enterprises need to be able to entitle data in as granular a manner as needed, including at an object or field level, based on user or system profiles. 

This means the right data is available to users and systems who are entitled to access it, while restricting or limiting access to those who are not – stopping, for example, a diagnostic AI accessing files meant for a research model.

Looking ahead

As GenAI becomes more deeply embedded in enterprise operations, the stakes for data governance are rising fast. The reliability of AI insights depends not only on the sophistication of the models themselves but – crucially – on the quality, traceability, and control of the data they’re trained on and retrieve from. Without clear oversight of these pipelines, organisations risk undermining the very efficiency gains they seek. Getting AI right starts with getting data right – and that’s a challenge no enterprise can afford to overlook.

Author

AIJ Thought Leader

View all posts

AIJ Thought Leader 3 weeks ago

4 minutes read

Dodging the trap of false AI insights

By George Tziahanas, VP of Compliance, Archive360

Poor data, poor answers

Why ‘data quality debt’ is a major risk in AI development

Five steps to visibility and governance over training and retrieval pipelines

Looking ahead

Author

Poor data, poor answers

Why ‘data quality debt’ is a major risk in AI development

Five steps to visibility and governance over training and retrieval pipelines

Looking ahead

Author

Related Articles

Police Robots Can Now Kill – Have We Gone Too Far?

How AI is Transforming Compliance and Risk Management in Finance

AI Governance Atlas: A Layered Model for Navigating the AI Compliance Maze

What IT leaders should know about building AI-powered security systems from scratch

Why ‘data quality debt’ is a major risk in AI development  

Five steps to visibility and governance over training and retrieval pipelines