It’s looking like 2022 will be a defining year for synthetic data.

With companies keen to press on with innovation in AI networks, against the backdrop of a still uncertain picture regarding Covid-19, and with changing attitudes towards the way big technology companies handle private information, privacy-complaint, ready-annotated data is going to grow and grow in popularity over the next 12 months.

At Mindtech Global, we’re already starting to see the trends that’ll turn this growing acceptance of synthetic data into a full-blown revolution. I’ve pinpointed three of these trends below.

A changing world means updating the data AI networks are built on

Our world changes constantly. Whether that’s new rules on social distancing or a new design on your favourite cereal box—we’re greeted by newness every day.

Beautiful as this all sounds, it’s a nightmare for AI networks and those in charge of training them. Every new addition to a location or changes to everyday scenarios helps date the AI operating within them very quickly.

Developers will learn, going into 2022, that creating a ‘timeless’ prototype is the easy part. Harder is the production stage, where models require constant maintenance to keep their intelligence relevant, and developers are forced to constantly be in a mode of reacting to change.

And with 2022 no doubt set to bring with it more significant changes to everyday life, network developers will need to go out and gather and annotate the photos to plug data gaps—that’s if the photos exist. And if companies don’t mind their highly skilled and sought-after machine learning engineers spending their time on what’s really a sophisticated filing job.

Or, instead, we’ll see companies creating their own synthetic images with real-world likeness and extracting data to meet the demands of our perpetually forward moving world.

2. An appreciation of quality and quantity

As our understanding of what AI networks are capable of continues to grow, so do our expectations of the quality of the work they do.

There’s a raft of legislation on the horizon to govern against so-called ‘faulty networks’—networks that are biased or limited in their necessary understanding of the diversity of the world around them.

Companies will soon be responsible for the consequences of these blind spots. There’s no more just putting an AI solution out in the wild and hoping it works. So I predict companies are going to be more focused on ensuring the robustness of their network before launch.

This will be achieved through more testing of corner and edge cases; those awkward and dangerous situations that, although rare, have the potential to compromise the safety of a location or individuals.

The rarity and danger of these cases make data on them scarce—a gap that will be plugged in years to come through simulations that produce synthetic data of the same nature.

That’s what our new Smart Home Application Pack aims to achieve. Released at this month’s CES, customers developing AI vision systems will be able to use our Chameleon platform to quickly build unlimited scenes and scenarios to train Smart Home AIs to improve home security, home automation, and home safety.

3. The age of hyperspeed meets the need for innovation

We’re in the age of hyperspeed. Our expectations of technology, of life, have increased to the extent that everything that needs to be done faster than ever before.

This includes inventing things that have never even existed before. The way our cities, offices, and homes changed through lockdown is an example of the speed at which technology needs to adapt to answer new and unforeseen challenges.

That’s an issue with AI networks reliant on real-world data. Innovation at hyperspeed is an impossibility when it takes a machine learning engineer 20 weeks to gather and annotate the 100,000 real-world images required to train a visual AI system to see and understand the world as a human.

That’s even if the prerequisite images exist or are of high enough quality—see Facebook/Meta deleting its database of Faceprints because of regulatory issues. Huge, real-world datasets don’t exactly lend themselves to innovation anyway. Multiple companies using the same datasets drives down the uniqueness of their AI networks, while large corporations restrict access to their owned data because it hands them a competitive advantage anyway.

In 2022, I expect companies growing tired of slow moving projects using real-world data, and the scarcity of real-world data itself, to level the playing field themselves with a pivot to synthetic sources.

Author

Steve Harris

After working in the Software and Semiconductor industries for more than twenty years, I and the early founders of Mindtech (who I’d known for a long time) got curious. Would it be possible, we wondered, to develop and deliver AI applications as a “Neural Network platform” on a chip. Fascinating though that was, we hit an unexpected roadblock. We were working with three lead customers and to train the networks, they each shared a huge amount of real-world visual data. But what we discovered when we looked at the data astounded me. Firstly, the data across all three customers was very, very poor. Most of it was completely unusable. Either it was not the right data or it was not in any way privacy compliant, with no data provenance. A big red flag. Secondly, the small amount of data that was usable took an inordinate amount of time to properly annotate because the process depended on humans. Real humans, painstakingly picking out pixel after pixel, hour after hour, day after day. It just wasn't scalable. And finally, when we looked at acquiring more data from bigger platforms, we got a hard no. Those platforms, if they had good data, knew the value and there was no way they were going to share it. In that moment, we realised that before we can take the next big leap in AI, we first needed to fix the issue of providing training data for AI systems. Real-world visual data is a scarce and precious resource. It has an important role to play in training visual AI systems, but we need an additional training data source that’s unlimited, unbiased, precisely annotated and above all privacy compliant. The idea behind Mindtech's Chameleon Platform was born. In 2019, we announced our end-to-end platform where companies can create synthetic 3D worlds to train their visual AI systems. The result is a real world where computers better understand the way humans interact with each other and the world around them.
View all posts

Steve Harris January 6, 2022

3 minutes read

Author

Related Articles

AI & Data Privacy – How companies can ensure ethical data use while leveraging AI for efficiency

Is AI Data Center Growth Sustainable? Not without Sustainability Strategies

Reimagining Data Centers: Giving Back to Communities While Expanding Digital Infrastructure

The Rise of Artificially Inflated Traffic in Business Messaging: Fighting Back Against AI Bot Attacks with Intelligent Datasets