Try our mobile app

Lessons from CrowdStrike, two months after disaster struck

Published: 2024-09-10 12:18 +02:00 by Richard Firth tag: Cloud services

JSE:AVI JSE:MPT JSE:ISA

The biggest lesson here is this: you can’t simply outsource everything and assume it will run perfectly, writes Richard Firth.
Companies – and their customers – expect their IT services to run constantly. While no system is completely error-free, any outages or downtime should be measured in seconds, or minutes at the most.

An outage lasting days or weeks is almost unheard of, and more than a week of downtime is not only unacceptable, but could put even the largest organisations out of business.

The recent CrowdStrike outage is a perfect illustration of this, with Delta Air Lines not only having to cancel about 7 000 flights over five days, but also facing an investigation from the US transportation department for the disruptions.

When we make use of cloud services, we trust those providers to follow thorough testing procedures

Estimates put the airline’s loss at around US$500-million, excluding the cost of regulatory and legal action facing the company as a direct result of the outage. Delta wasn’t the only business affected, with banks and hospitals also having to deal with the repercussions of what some are calling the world’s largest IT outage.

According to Microsoft, 8.5 million Windows computers around the world crashed as a result of a bug in a CrowdStrike update, and it took 10 days for the company to fix the problem fully. It’s no wonder that the security software company is facing multiple lawsuits, one of which was launched by its own shareholders, who have accused CrowdStrike of making “false and misleading” statements about its software testing.

Delta CEO Ed Bastian has publicly faulted both CrowdStrike and Microsoft for failing to provide an “exceptional service”. Both tech companies have responded with declarations that they will be defending themselves “aggressively” and “vigorously” in the case of further legal action. Microsoft has tried to pass the responsibility back to Delta Air Lines, saying its preliminary review suggested that Delta, unlike its competitors, apparently had not modernised its IT infrastructure.

Microsoft should stay in its lane

When we make use of cloud services, we trust those providers to follow thorough testing procedures before making changes to their infrastructure. If they don’t, a CrowdStrike scenario will inevitably happen. Microsoft trusted CrowdStrike to the point that it accepted updates pushed by CrowdStrike directly into its production Azure infrastructure. While CrowdStrike was to blame for the fault, Microsoft should have had processes in place to implement things on “canary servers” before allowing them into production.

And the same should be true of any IT service. If you choose to outsource critical services to external providers, you expose yourself to the quality of their processes. If you choose to keep it in-house, you remain in control of the phases of roll-out to production. Of course, many people who did keep their stuff in-house still suffered – because they did not implement any “canary server” testing themselves.

While Microsoft has been happy to play the blame game with CrowdStrike, the reality is that the software giant has been pushing Office 365 into every type of business functionality it can, including mission-critical and customer-facing operations such as billing services and call centres. A situation like the CrowdStrike outage just highlights how short-sighted a complete reliance on Microsoft products can be for organisations that require more specialised and reliable solutions.

The author, MIP Holdings CEO Richard Firth

For years, companies have been increasingly buying into the Microsoft PR that the software giant can provide everything they need, but this has resulted in organisations placing all of their proverbial eggs in one basket. This not only increases the risk of something going wrong, it increases the likelihood that solving a problem is harder to achieve when the solution is reliant on software developers in another time zone who may not have an understanding of the urgency or magnitude of an outage.

There’s no doubt that Microsoft excels in certain areas, but there is a reason that software companies like MIP exist, and that reason is the ability to design and develop solutions tailored to the specific needs of organisations. Using specialist solutions not only ensures that companies can provide uninterrupted service to their customers, but that security and other risks are minimised.

It’s all about skills

Unfortunately, Microsoft’s success has partly been as a result of the fact that there are few software engineering companies that have the skills and capabilities to deliver specialised solutions to organisations like Delta Air Lines. In some cases, the lack of entrepreneurial skills in building IT platforms can only be seen in the ubiquity of out-of-the-box solutions that require a lot of investment to get them to perform properly, but in others, this lack is causing difficulties in business processes, directly impacting how well companies can operate.

If more people had the development skills needed to create tailored solutions – and the skills to integrate them effectively with common programs like those offered by Microsoft, companies would have access to a broader variety of tools. This would not only ensure better recourse for companies dealing with any tech challenges, but would ensure that the technologies used were chosen to mitigate any risks.

Read: CrowdStrike faces lawsuit … from its own shareholders

Microservices, for example, would have ensured that the impact of the CrowdStrike outage was limited at every organisation affected, allowing companies to continue to operate while the problem was being fixed. Microservices would also have negated Microsoft’s complaint that Delta Air Lines hadn’t modernised its IT environment, allowing for specific services to be organised around business capabilities rather than infrastructure.

If the CrowdStrike outage proved anything, it’s that software development skills are more important than ever. In today’s technology-driven world, everyone should have a programming or software engineering background – if only to be able to understand CrowdStrike’s explanation of what caused the outage – and how it intends to ensure this type of scenario never happens again.

Maybe the biggest lesson here is this: you can’t simply outsource everything and assume it will run perfectly. Ultimately, you remain responsible for your business operations, and if you choose to trust someone else to do something for you, you may be shifting some workload, but you cannot really shift responsibility. You should still be cautious. And if you take the risk of outsourcing, don’t cry when the risk materialises.

The author, Richard Firth, is CEO of MIP Holdings Read more articles by Richard Firth on TechCentral

Don’t miss:

Microsoft to host security summit after CrowdStrike disaster