Yesterday’s internet outage was not the first, nor will it be the last, to make headlines. Ironically, the fact that the glitch hit several major news platforms including CNN, BBC, The Guardian, and New York Times meant that the news may not have spread quite as quickly as usual, with journalists and commentators reverting to social media and in some cases even Google Docs to document the events as they unfolded.
Still, the news disseminated, and we were faced with yet another occasion where the impact of an outage has reverberated around digital services across the globe, with companies including Reddit, Spotify, Paypal, eBay, and others left unable to operate online, greeting users with error messages, and struggling to upload content to their platforms.
The Checklist Manifesto
My initial thoughts were to revert back to The Checklist Manifesto. Written by Atul Gawande, a medical writer focused on making surgery safer globally, the point that resonates is that by having a process, or in technology’s case, a failover, lives in medicine can be saved. At Cutover we often liken our technology capabilities in optimising critical business and technology operations to the live telemetry of space, where mission-critical operations have a clear-cut pathway and back-up structure. Today, let’s look at The Checklist Manifesto. If the setbacks and risks of a complex surgical procedure can be protected and planned for with an agreed process, why not the same for the commercial world?
Fastly’s major internet outage resolved within one hour according to TechCrunch, but the implications were very real and meant that leading digital platforms were unable to operate as they should have been able to. Previous outages have hit critical services too, preventing end users from accessing the goods and services they needed and bringing with them serious regulatory, reputational, and financial ramifications.
Hindsight is a wonderful thing, and today’s headlines will be full of recommendations and comments on how to avoid or best manage unexpected events such as outages and downtime. We operate in an incredibly interconnected and in many cases interdependent digital ecosystem, with the internet’s infrastructure forming a prime example, so yesterday’s events serve to reiterate the importance of having a failsafe ‘plan’ or set of processes in place to put out proverbial fires (or in some cases, real ones).
It did spark some interesting talking points and discussions from a Cutover perspective, though.
Let’s just briefly look at what happened, and then we can think about why it’s important.
What caused the Fastly outage?
Mid-morning on a relatively ‘normal’ Tuesday, there was a huge internet outage, affecting websites including The Guardian, Gov.uk, Amazon, and Reddit. The outage was traced to a failure in a content delivery network (CDN) run by Fastly. It saw visitors receiving error messages including ‘Error 503 Service Unavailable’, and ‘Connection Failure’. This brought down some websites altogether, while also breaking parts of other services. The cause of the problem took Fastly, a cloud services provider, a while to identify. The problem was that the technology used by Fastly, designed to speed up web page loading time and deal with traffic spikes, sits between most of its clients and their users, hence the domino effect of the failure. Some sites were able to restore services quite quickly by switching their services away from the network, either presumably to another provider, or by failing over to static content served from elsewhere. Others had more of an uphill battle, either in their ability to be alerted, fail over, or reroute gracefully.
The key takeaway here is that centralised internet infrastructure means that single points of failure can have huge ramifications, and it’s not the first time we have seen outages sweep across geographies and industries causing significant disruption. In other instances, the outages can be localised, only affecting specific cities, which was also the case in yesterday’s Fastly outage, where some geographies, including Berlin, seem to have emerged unscathed.
What we can learn from this is that in times of stress, like these black swan events, companies do need pre-canned routines that can be rehearsed and (in reality) adapted on the fly, so that they become the processes you can rely on to get you through, rather than the processes that go out of the window in an emergency. Many of these ‘scenarios’ are never really thought of as being likely to happen, but as we all saw in the events of the early stages of the pandemic, even the most unexpected events need a plan.
With the internet infrastructure being so centralised and companies to some extent relying on systems out of their control (in this context, Fastly’s CDN network), companies need a path to be able to take that is not reliant on that network, or at least the ability to recognise that it is not coming back anytime soon, and the visibility to identify the affected services and orchestrate a plan. In today’s scenario, that could be as simple as a ‘fail whale’ type static HTML holding page that can be served up and handle many requests per second, even on what today is commodity hardware and internet connection.
So, just like in The Checklist Manifesto, by having a processes and failover independent of integrated services, companies can have more confidence that if, and inevitably when, things go wrong, they have a back-up plan A, B, C, and so on. We’re living in a world where we know these shocks are likely to keep coming, so the better we can plan for them and be accountable for our actions, the more confidently we can operate and, crucially, the more quickly we can recover.