Crowdstrike: When an antivirus blazed through the news

When you work in cybersecurity, your cell phone ringing on a Friday usually means bad news. Cybercriminals know our work patterns and the days when they are most likely to break through your defenses. That is why so many cyber incidents occur on Fridays. And we feared the worst when on July 19 the phone wouldn't stop buzzing.

Here the good news begins. First, the phone was not ringing because of a call from GMV's CERT, its 24/7 security-monitoring and incident-response service. We get these calls when a cyberincident occurs and may impact either GMV itself or one of its clients. Then our CERT colleagues shall alert us on what is going on. Thus, it was probably not a serious incident. That gives us time for a more thought-out and easy-to-implement response.

The second piece of good news was that the cell phone buzzed on our usual internal communications of CERT activity. In other words, there was something going on in the background, but we were detecting them beforehand, and we could trigger an early response if things got serious. However, there was too many of these communications and worth investigating. We found out very different types of data than what CERT usually analyzes. Some equipment and services being monitored by CERT were completely down. Actually, quite a few. None of them were GMV’s, rather from some of its monitored clients. The CERT team was already investigating this issue to find patterns of potential attacks or failures in the information transfer systems.

Meanwhile, more news was coming in. Cybersecurity professionals always speak for the value of collaborative and cooperative networks. Many common cybersecurity frameworks set specific requirements for creating, joining, and participating in these collaborative networks. On Friday, July 19, these networks proved their worth. Several collaborative groups set up on social networks were frantic. Problems had been detected on servers and user stations equipped with EDR software from the vendor Crowdstrike. These computers were unresponsive and inactive. Some participants shared that their equipment showed a dreaded failure on a Windows machine: a blue screen. These occur when the state of a Windows computer is unstable, and the operating system cannot run properly. The computer shuts down for security, displaying a completely blue screen with an error message indicating the detected cause of the problem. What seemed odd was that it was happening simultaneously in many organizations worldwide and on all types of computers. Apparently, all these computers had the same EDR software installed. There was a clear lead to the investigation.
Fortunately, this theory was confirmed shortly. The vendor Crowdstrike officially already published on its support portal at 07:30 Spanish time a technical alert on its product "Falcon Sensor", which is the agent it deploys on all the computers it is protecting. This technical alert described the agent received a failing update that caused the blue screen on Windows systems.

These professional cooperation channel also began to share some possible solutions for this failure using a workaround. Booting the computer in "Safe mode," and then manually delete a file called "C-00000291-*.sys" should work. Crowdstrike acknowledged that this file was an update they had made at 04:09 UTC (06:09 in Spain), which contained this incorrect file that caused the failure.

At this point, CERT was already working with the impacted customers, validating potential solutions and adapting the monitoring systems to the avalanche of alerts. And there was more good news in the middle of a Friday with such an unpromising start:

The potential cyberattack was due to an operational failure, not a malicious attack. The way out of troubles would require fixing the existing bug. No need to worry about any attacker reacting to our containment, response, and recovery actions.
There was a clear candidate as the source of the error: Crowdstrike had already acknowledged it, provided an explanation that matched evidence, data and our analysis, and other people’s. We have a trustable line of action to solve it.
All this had happened in a matter of minutes. So, the incident could get quickly solved. In fact, many people were fixing it.
Communication was clear and fluid, in all senses, sharing as much information as possible, fulfilling the confidentiality commitments we are bound by.

However, it was not all good news. Other organizations were experiencing the problem intensely, with huge impact on their operation, systems and services. Their activity was at a standstill. Names of major organizations and critical infrastructure operators impacted. A GMV colleague who was working from home told his family over breakfast that this would probably make the day's news. Clearly the incident was going to have an impact on society, and not a small one. Microsoft later declared that 8.5 million computers were affected, with significant consequences: international flights were delayed, and public transport connections were interrupted; some hospitals were affected and had to cancel medical test; even some banking and payment transactions could not be carried out...

And meanwhile, our work to fix the cyberincident was progressing. The workaround was confirmed as effective in most of computers. GMV's clients were practically fully recovered. There were some cases where alternative workarounds had to be carried out for devices where the initially one had failed. Some machines were found to fail in slightly different configurations. Professionals keep analyzing and sharing the information they were finding. It seemed clear that, although known, the solution in some companies would take days or weeks. There were colleagues who were physically moving to computers with USBs to apply the workaround manually. This worked, but it needed time.

The rest is history. It is not worth repeating it.

We at GMV found some lessons to apply from this cyberincident. We analyze it, , how we managed, how we should have acted, what tools we had ... It is time to improve for future incidents.

Our first conclusion is the value of continuous 7x24 security monitoring through a CERT service. It must include all services, in real time, and be fast and effective. Run by prepared, trained, and qualified people to carry out this work. The CERT must also be able to detect events and assess various levels of criticality, to react when needed, but also to provide early alerts before the actual response is launched. Without a capable CERT or SOC, none of this is possible.

The second conclusion lays on our incident response plans. We didn't have to execute it for this cyberincident. It was not even necessary to initiate the escalation procedures. But the response plan worked effectively as far as it needed to. Just perfect.

The third conclusion stays on our business continuity plans. What would have happened if we had had a Crowdstrike as our EDR by-default? We know we would have detected it, but was our current Business Continuity Plan able to manage that scenario? Would it have been an effective? Do we need to improve it? Fortunately, we did not have to execute the plan, but we anyway simulate it during the incident. Just in case. And we found it worked very well, but for a couple of minor issues. Adding these two improvements boosted for a quicker, more effective response and would have reduced the impact we might have experienced.

This was a Friday for history. Hopefully it will be for the lessons we learned and for the increased awareness of companies in their security monitoring and incident response needs that will better prepare us for the next crisis.

Mariano J. Benito, Cybersecurity & Privacy Ambassador of GMV

Óscar Riaño, head of GMV CERT

Add new comment

Not show on Home

Inactiu