From AI-powered hospital monitors to connected wind farms and autonomous vehicles, we are witnessing a quiet but rapid change in how technology interacts with the physical world. Systems that were once passive in the way they collect and report on data can now adjust, respond and even make decisions. Unsurprisingly, this has raised fresh concerns over what happens when they get it wrong.

The number of cyber-physical systems is increasing, as AI and IoT technologies are increasingly embedded into environments that affect both human safety and critical infrastructure. While these systems have the potential to deliver efficiency and smarter decision-making, their risks can’t be managed by digital security tools alone.

The failure of a sensor could now lead to an insuline pump delivering the wrong dosage, a traffic light misjudging a sequence or a piece of factory equipment behaving unpredictably. In this new reality, resilience must become a design priority, not an afterthought when things go wrong.

When ‘smart’ systems become critical infrastructure

Until recently, AI and IoT systems played a largely supporting role. However, that’s changing as they become central to operations across healthcare, transport, energy and manufacturing. Nowadays they are detecting faults, optimising responses, controlling machinery and even influencing real-time human decisions.

The more we give these systems the power to act, the more we have to consider their reliability — not just in ideal conditions, but also in the face of failures, delays or unexpected inputs. These are engineering problems and must be treated as such.

Why small faults escalate in connected systems

From a control engineering perspective, small faults can have big impacts. A drifting sensor or delayed signal might seem minor, but in connected systems where components constantly respond to each other, even a slight error can throw the whole process off balance. Without the right checks in place, minor problems can quickly snowball. And as systems become more and more interconnected, they can propagate from one system to another.

Take healthcare, for example. A wearable device feeding real-time data into an AI model could help doctors detect early signs of illness. If that data is wrong or if the model misinterprets it, the recommendations could be misleading, even dangerous. And model misrepresentations could propagate in population models’ inaccuracies.

Another good example is autonomous manufacturing, where a sensor glitch causes a robotic arm to move out of sync on the production line. The entire process could grind to a halt or cause a serious injury if the system isn’t designed to detect and correct that fault.

Whatever the sector, the more autonomy we build in, the more critical resilience becomes.

Designing for disruption, not perfection

Resilience means much more than keeping systems online. Those systems must also be able to operate safely and predictably when things don’t go to plan – whether that’s a fault, delay or unexpected input.

We need to start designing with disruption in mind, not just performance. That means building in redundancy to ensure no single point of failure can bring the system down. Systems should also be able to degrade safely, maintaining some level of functionality rather than failing outright.

Fast, reliable recovery is another priority, allowing services to resume quickly before any real harm is done. Perhaps most importantly, though, automated systems must be able to explain their actions very clearly, so that when things do go wrong, we can understand what happened and how to put it right.

These ideas aren’t new. They come from decades of engineering in safety-critical domains like aviation and industrial control. What’s changed is the urgency with which we need to apply them, given the rate at which intelligent, connected systems are being integrated into everyday infrastructure.

‘Reliable AI’ in the real world

In high-stakes, real-time environments, the consistency and predictability of system behaviour are also critical factors. A 99% accurate AI model might sound impressive on paper, but in a control system that updates every second, that remaining 1% could lead to a fault every couple of minutes. When systems interact with the physical world, you can see how even a small margin of error, repeated at speed, can quickly become a problem.

Rather than chasing marginal improvements in efficiency or accuracy, we need to focus on stability, transparency and control. That can only happen when resilience goes beyond the engineering team to become a shared priority for everyone involved in developing and deploying intelligent systems.

Testing failure before it happens

Digital twins can support this shift by allowing teams to simulate how systems behave under different conditions, even before anything is deployed. This helps them move from reactive fixes to proactive resilience.

Formal verification methods, adapted from aerospace and high-integrity engineering, will deliver another layer of protection. These methods can mathematically prove that certain types of failure won’t occur within defined operating conditions. While not always easy to implement, they offer a level of assurance that is valuable in critical environments.

We don’t need to reinvent the wheel. But we do need to embed practices like these far earlier in the AI and IoT system design process.

Resilience is the real measure of smart tech

Cyber-physical systems may be marketed as smart, but their true value lies in how they hold up under pressure. As these technologies become more widespread, the conversation needs to move beyond innovation. Designing for resilience is what will keep these systems stable, trusted and safe – no matter that.

Author

AIJ Guest Post

View all posts

AIJ Guest Post 11 August 2025

4 minutes read

Engineering trust into intelligent systems

By Dr Francesca Boem, Associate Professor at UCL and IEEE Senior Member

When ‘smart’ systems become critical infrastructure

Why small faults escalate in connected systems

Designing for disruption, not perfection

‘Reliable AI’ in the real world

Testing failure before it happens

Resilience is the real measure of smart tech

Author

When ‘smart’ systems become critical infrastructure

Why small faults escalate in connected systems

Designing for disruption, not perfection

‘Reliable AI’ in the real world

Testing failure before it happens

Resilience is the real measure of smart tech

Author

Related Articles

FEDGPU’s intelligent AI cloud computing power provides users with a stable, technology-driven revenue stream, especially as XRP faces a 25% pullback pressure.

The Rise of AI-Enhanced Semantic SEO: Optimizing Beyond Exact Keywords

Alex Yancher: AI are the first responders, humans are the problem-solvers

Shadow AI: Cybersecurity’s New Blind Spot