On October 4th, what appeared to be the largest global outage in social media history had alarm bells ringing and companies’ DEFCON statuses changing as the outage’s impacts cascaded across the globe.
At 11:40 am Eastern time, Facebook users began reporting errors, and within minutes Facebook disappeared from the internet completely. All of Facebook’s services, including Instagram and WhatsApp, were severely impacted. The outage led to a lot of speculation about the cause - was it a malicious event? Human error?
Facebook later released this explanation in a statement:
"Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt."
No matter the root cause, such a compelling event will surely be a wake-up call to companies across all verticals that outages are as sure to occur as death and taxes. It truly is not ‘if’ something bad will happen, but ‘when’.
The full financial implications of this outage have yet to be calculated, as many businesses rely on Facebook’s services and will have lost revenue from not being able to operate during this time. Facebook’s stock price also plummeted 4.9%. If this is the wide-reaching impact of a social media service going down, what would be the consequences if this was the global financial backbone, interconnected power grids, or national emergency services systems?
The added irony of the whole event was that the outage also affected Facebook’s internal communications systems and even locked some employees out of offices, adding extra obstacles to solving the problem quickly. How do you get a service back online when your main method of doing so is also offline? These are the kinds of questions organizations need to consider when evaluating their own level of resilience.
We know from experience that a mature operational resilience execution posture comes from the confidence you have in addressing whatever bad thing occurs, in a timely manner, and because your people and tooling have been rigorously rehearsed. Having contingency plans for multiple possibilities is great, but you also need to be able to dynamically pivot to deal with new and unexpected challenges as they occur - and ensuring your lines of communication and orchestration aren’t compromised is key to this.
Facebook isn’t the first organization to face a major outage and it won’t be the last - less than four months ago Fastly also went down for a couple of hours, taking several major news platforms with it. Other organizations, whether they’re tech giants or financial services providers, can learn a lot from these events and will be reminded of the importance of investing time and resources into the training and technology needed to mitigate these threats and their impacts.
I’ll be speaking on a similar topic at the upcoming DRJ Fall virtual event, discussing how organizations can dynamically manage their cyber events in real time with other operational resilience experts.