It would seem as though the post-pandemic travel levels have re-emerged, with hundreds of thousands of flights each month occurring across the globe. But an increased number of flights brings with it higher chances of computer outages, and with such a complex supply chain, even a few minutes of downtime can cause knock-on effects for hours.
Airline outages are quite rare, particularly when the thousands of flights each day are taken into account, but they can significantly impact business. Depending on their visibility and scale, these outages can sometimes even make headlines, resulting in serious brand damage that can take time to repair.
British Airways and EasyJet, two of Europe’s largest airlines, both experienced outages over the summer. Each one serves thousands of passengers each day, and their systems are normally robust enough that downtime is a rarity. But their outages during peak travel times this year have proved that like every industry, the airline industry is still susceptible to outages, and the addition of practices like observability are a key way in which they could modernise their systems.
Not all incidents are equal
Regardless of its visibility, any outage or degradation of reliability or performance can have wide-scale effects on all parts of a business, from customer acquisition and brand reputation to regulatory and contractual compliance.
This means that all digital teams are faced with the same end goals and challenges, regardless of their size or industry. They must work to reduce the number of incidents and learn how to detect, understand and resolve any problems that do occur as quickly as possible. Because while outages may be inevitable, mitigation is still possible.
Fortunately, not all incidents have the same impact, allowing engineers to classify them by their severity, quantified through three main factors: how badly their service is affected, the percentage of customers affected, and what the ultimate consequences may be. Minor degradations on a webpage, which cause a mild inconvenience, would be treated as a lower severity. But the recent airline outages would likely have qualified as the highest level of severity.
Why did the airline outages occur?
Most airline outages occur for the same reason: problems with their flight operation systems. These systems ensure pilots and cabin crews can prepare flight plans and do flight briefings. In essence, they allow air travel to run as smoothly and efficiently as possible.
These legacy systems are extremely complex due to the scale of operations they run. They also use older coding languages that very few people still know how to use. This makes fixing or spotting any bugs before they occur incredibly difficult.
On some occasions, cascading events can also occur, when one system fails, causing a ripple effect. Essentially, on rare occasions, things just really go wrong.
How do you solve a problem like outages?
Observability platforms can potentially play a significant role for airlines looking to proactively reduce the potential of outages. With the 360-degree visibility they provide, engineers can not only understand what is happening in their system, but why it is happening – allowing them to collect data about their systems and quickly respond to issues. This means they would not only be able to reduce the number of outages that occur, but for those that still slip through the net, it would be much easier and quicker to root out the problem and reduce downtime.
The New Relic Observability Report 2023 found that over a third of those who use observability reported it had improved reliability and uptime. The report also found that those with fully deployed observability’s mean time to detect (MTTD) issues was just five minutes, far improving their opportunity to resolve an incident before it becomes an issue or is visible to customers. And for any that do still happen, it becomes much easier to understand the root of the problem, fix it, and prevent a recurrence.
Observability not only allows IT problems to be fixed more quickly, but also helps reduce the effect the outages have on brand reputation. Being able to instantly inform your customers of the reasons they are experiencing delays helps to build trust with them as easily as keeping them in the dark helps to lose that trust.
However, flight operation systems are so critical, and migrations so delicate, that replacing one is akin to a heart transplant. The real challenge for airlines comes down to a balancing act based on a simple question:
Does the potential chaos that could occur from a failed migration outweigh the potential gains realised by a modernised system?