Artificial Intelligence and Machine Learning are key techniques for any organisation wanting to achieve the goal of being ‘best in class’. But sadly, all too many AI and ML projects fail to reach their potential.
There can be many reasons for this, including poor goal setting, budgetary constraints, and function creep during the planning, proof of concept and realisation phases. Bad data is also a key reason for project failure. Remember the old and much-worn GIGO principle – Garbage In, Garbage Out? It is as relevant today as it ever was.
Two of the four key roles and responsibilities for AI projects outlined by analyst Gartner relate to data. Alongside AI Architect and ML Engineer, the analyst lists Data Scientist – responsible for identifying use cases, determining which data sets and algorithms are required, and building AI models, and Data Engineer – responsible for making the appropriate data available, with a focus on data integration, modelling, optimisation, quality and self-service. Organisations ignore these data-related roles and the wider importance of data at their peril and risk falling on the wrong side of the 50% of IT leaders Gartner has identified that will struggle to move their AI projects beyond proof of concept through 2023.
Work backwards – with help
No AI/ML project has a chance of being successful if there is not an accompanying data strategy. A key aspect of getting this in place is to work backward from your desired outcome to figure out what ingredients it requires.
For this, the project goals should be articulated with crystal clarity. It is not enough to say “AI will help us streamline our production plant”. That needs to be broken down into greater detail, with a focus on specific process elements.
Now it is possible to see what data is required to achieve each goal. And this can’t be generalised. It is important to list all the data elements required. These might not all exist before the project begins, and working out what they are and how they will be collected will be vital.
This is often much trickier than it sounds, and external data scientists and data engineers, with experience of working in AI/ML development, will bring an ability to ask the right questions, look round corners at problems, keep a lid on function creep and make sure the most difficult questions are addressed rather than parked. Bringing them in early can mean an organisation doesn’t find itself having to do this work at a later stage when it can add expense and time to a project or worse – contribute to its failure.
By the end of the ‘working backward’ process, an organising should know what it needs to know, what of that knowledge it already has, and what it needs to find a way to get.
Establish a data policy
Organisations can’t assume that because they collect data already, that they can just pass it right over to the AI/ML developer and it’ll drop neatly into new applications, out the other end of which will come amazingly informative dashboards of new information. If only.
The way data is collected has changed in recent years in part due to the EU’s General Data Protection Regulation (GDPR), and therefore this year’s data set isn’t compatible or comparable with that from four years ago. Perhaps significant amounts of historical data – even recent historical data – are missing. Perhaps the organisation needs to put in place entirely new data collection policies to start from a designated ‘Day 1’.
Working out a ‘Day 1’ data policy is one thing. To hit the ground running with an AI/ML project as soon as it kicks in, some historical data will be useful. Here there are several decisions to be made, including which data sets are most important, and how far back to go, whether to work on a subset for proof of concept and bring in more data sets later, and whether some poorly managed data can be cleaned enough for use – or not.
Making the right decisions will be vitally important, and this is another area where that trusted third party view of the AI/ML specialist will be extremely valuable.
Scrap the siloes
It is possible an entirely new data policy will be needed going forward to keep the AI/ML system fed with the right quality data. Not only might new data need to be gathered, but new working practices might also be needed, and there could be significant implications across the whole organisation. For example, it might be important to do a ‘once and for all’ purge of data siloes that can often hold data-related projects back.
IT teams are spending 40% of their time managing and maintaining data infrastructure, and only 32% of data available to enterprises is put to work — the remaining 68% goes unleveraged. And, in too many organisations we still see different lines of business capturing the same data for their own use. This is cost-inefficient, causes data siloes which lead to data fragmentation and governance issues, and inevitably means there is variance in the accuracy and quality of data.
Which data set should the AI/ML project use? That’s the wrong question to ask. The right question to ask is “How do we ensure there is just one set of this data, shared across all lines of business?”
Ask the question, find the answer, then implement it. Repeat for all siloes. This will help the current AI/ML project immensely, should create cost-efficiencies, and should support future AI/ML and other projects going forwards. It will also be beneficial for other data management processes such as backup, restore and archive.
Even in the era of digital transformation, Fortune 500 companies take weeks or, most often, months to deliver clean data to their teams—often mandating a carefully coordinated effort across multiple teams. Further, this has necessitated the use of ingenious, albeit often insufficient, methods such as the use of synthetic data sets or subsets of data.
With Cohesity, there is an answer. Users can instantly provision clones of backup data, files, objects, or entire views and present those clones to support a variety of use cases. Cohesity’s zero-cost clones are extremely efficient and can be instantly created without having to move data. This is in stark contrast to the inefficiency of the traditional DevTest paradigm, in which full copies of data are created between infrastructure silos. This is a dramatic shift to modernisation.
By decoupling data from the underlying infrastructure in this way, we enable organizations to automate data delivery, and provide data mobility. Zero-cost clones can be spun up in minutes rather than weeks. As a result, customers have reduced their service level agreement (SLAs) for data delivery, accelerated application delivery and migration, and greatly simplified their data preparation.
With the right data flowing into them, AI and ML projects can provide organisations with those dashboards of insights that can be used by the organisation in the transformational ways it envisions. Focussing on the data from the start of an AI/ML project can help an organisation land on the right side of Gartner’s 50%.