What the Global Microsoft Outage Taught Us

The most significant event in cybersecurity in 2024 was the global Microsoft application outage on July 19. It caused problems in various sectors, including airlines, airports, banks, and medical systems. Eyewitnesses reported seeing the BSOD error (“blue screen of death”). The outage was triggered by CrowdStrike Falcon Sensor protection systems, which block cyberattacks. CrowdStrike CEO George Kurtz stated on the social network X that the company is “actively working with customers affected by the defect found in a content update for Windows hosts.” He said the outage was not a cyberattack and did not affect Mac and Linux operations. The consequences primarily resulted in global financial losses, and experts are still analyzing them.

To comment on this significant incident, we asked Anton Snitavets, Cloud Security Engineer at Jabil, Fellow member of Hackathon Raptors community, laureate of the Slolkovo Tech and Innovation Awards 2023 in the “Information Security Innovations” category, and judge at the RUNET AWARDS 2023 in the “Information Security” and “Podcasts and Digital Content” categories, and holder of the Certified Information Security Systems Specialist certification, which is recognized as the most difficult to obtain in cybersecurity. Anton shared recommendations on how companies using Endpoint Protection can avoid similar issues when using such software, and how companies developing such software can reduce the risks of comparable failures.

Hello, Anton. Your extensive professional experience in Belarusian and transnational companies covers almost all cybersecurity domains —from working as a Network and System Administrator and DevSecOps engineer to Cloud Security Engineer. How would you, as an expert, comment on the July 19 global outage incident? Why do you think this problem occurred, and what caused it?

The problem arose “at the intersection” of automated cybersecurity systems and the operating systems in which these automated systems operate. Essentially, a security content update in the CrowdStrike Falcon Sensor, an endpoint security tool used by many companies worldwide, led to an unforeseen conflict with Windows components, causing the BSOD. CrowdStrike Falcon closely integrates with the operating system kernel, which, on one hand, allows it to operate with system privileges—for example, to scan areas of memory used by other applications and processes—while on the other hand, imposes higher requirements on the quality and reliability of updates: any error or conflict in the integration can lead to critical failures, such as the BSOD. The July 19 incident was caused by a logical error made during the development of the CrowdStrike Falcon Sensor component update, as well as insufficient testing of such updates before their release on CrowdStrike’s side.

Looking back on today’s event, do you think we could have expected even worse consequences? What are the potential repercussions for ordinary people?

The damage was very significant but not the worst: it could always be worse. Imagine if, as a result of this incident, not only had the operation of some systems temporarily stopped—from air traffic control systems to hospital systems—but also critical data was lost or corrupted. Restoring the functionality of a “downed” system is much cheaper than recovering lost data. In this case, it mainly resulted in temporary unavailability of services and increased operational costs for recovery. For ordinary people, this means delays in receiving services, inability to obtain necessary information on time, or to carry out transactions—such as banking transactions.

Your accomplishments include developing a software solution integrated into Aras Innovator that eliminated multiple vulnerabilities in the product and enhanced security not only in Aras’ development processes but also for Aras clients. Later, at Jabil, you improved the security compliance of their giant cloud infrastructure, boosting their overall compliance rating to the highest values within just two years. Evaluating the work of colleagues at CrowdStrike, do you think everything was done correctly in this crisis?

The reaction of CrowdStrike specialists was prompt—they teamed up with Microsoft specialists in attempts to quickly determine the cause of the problem to help users resolve it as soon as possible. However, the very fact that such a situation occurred indicates deficiencies in the update testing processes. In my opinion, under those circumstances, CrowdStrike specialists acted as competently as possible, even considering the crisis. Unfortunately, at that moment, the only way to restore the functionality of the affected systems was a manual rollback of the CrowdStrike Falcon content updates on each of them. Given that many organizations may have thousands of such systems, the operational costs of restoring system functionality were astronomical.

In your opinion, how systemic is this precedent? Has this happened in any variations before July 19? Have you encountered similar issues in any way?

This is not an isolated case—conflicts between security agents and system components have happened before, although their scale was often smaller. Security system updates have repeatedly caused failures in the operation of individual applications or OS modules, but fortunately, they are not always so large-scale and therefore not as publicly noticeable.

In several articles, you have shared recommendations on cybersecurity for government agencies and businesses. Based on your experience not with crisis management but with “learning from mistakes” and long-term prevention, how would you assess CrowdStrike’s solution to the problem?

Now, with a complete picture of what happened, it is easy to point out the shortcomings that preceded the incident and caused it—today, they seem obvious. If updates, including content updates, were more thoroughly tested before release, and if CrowdStrike Falcon Sensor had a “rollback” update system, such incidents could have been avoided. I am sure this was a good lesson not only for CrowdStrike but also for other software manufacturers.

You hold several professional certifications including Microsoft Certified Azure Security Engineer Associate and Certified Information Security Systems Professional (CISSP), which is considered one of the most challenging to obtain in the field of information security. As a certified specialist, what solution do you consider appropriate, having the opportunity to evaluate the causes and consequences in retrospect?

In my opinion, the best solution for software companies would be to implement or improve the pre-release testing process. An excellent solution would be to test content updates just like the product’s software code, going through all stages of review and tests. Another great idea would be to use test “environments” simulating user environments—detecting anomalies in such environments should immediately lead to recalling the updates.

For companies using software, the best solution, balancing cost and quality, would be to implement and develop patch management systems that allow testing software updates, including operating systems and critical software, on small groups of organizational computers before rolling them out to the rest of the organization’s computers and systems. Such systems also allow for gradual rolling out to small parts of the organization—this is called a “canary release”—allowing for monitoring system stability and, in case of problems, canceling the update for the remaining systems.

As a member of a professional association – Fellowship in Hackathon Raptors, you have the opportunity to track the community’s reactions to global challenges. What, in your observation, is being done in the industry to prevent similar incidents in the future?

To prevent such incidents, the industry increasingly applies “shift-left” technologies, testing the security and compatibility of updates at the early stages of development. Concepts like “canary release”—applying updates to small groups of devices and systems and evaluating their stability before applying updates to the rest of the devices—and “feature flags”—allowing quick enabling and disabling of software product functions—are also widespread. Both concepts help reduce risks when deploying software updates.