Richard Bell and Chloe Lovatt
December 20, 2018
The Treasury Committee has announced an inquiry into the increase in operational incidents in financial services. It is an inquiry that is much needed. However, the investigation is looking in the wrong place. It fails to focus on the true root cause of most major incidents - not enough focus on change management and execution.
Banks are under pressure to transform and to do it quickly. This pursuit of high-velocity transformation is leading to poorly-executed, high-risk change events. The result is an increasing number of high-profile major incidents. The inquiry fails to address the role that poorly planned and executed change plays in causing major outages.
It’s interesting that the initial information about the inquiry doesn’t mention the quality of change as a key focus. Change management was found to be the top root cause of operational incidents in 2017, with 80% of incidents having their root cause in change. This is also evidenced by some of the letters written to Chairwoman Nicky Morgan by banks about their own recent outages. According to Barclays, “The causes of the issue we experienced on 20 September were some technological software changes which interacted in a rare and unexpected manner that did not present in extensive pre-testing.” Natwest and RBS said about their recent outage, “The cause was quickly recognized as the result of an incorrect implementation of a network firewall rule update.”
Although Agile, DevOps, Continuous Integration and Continuous Deployment have enabled many organizations to deliver IT changes more quickly, it is possible that the increased volume and velocity of change is a contributing factor to increased outages. Core business processes traverse evolved technical architectures made up of vintage platforms, legacy vendor services and an array of proprietary applications and digital services which add huge complexity to the implementation of change. The biggest threat to the availability of core services is the unexpected consequences of change. The demand and pace of change continue to grow against a backdrop of organizations aspiring to achieve 99.999% continuous resilience. It stands to reason that more change is likely to equal more incidents.
Banks perform tens of thousands of changes per year, the majority being low-risk minor works items. The medium to large changes require highly orchestrated events involving humans and machines to be safely implemented. The continuous threat of cyber attacks drives continuous change. Barclays stated in their letter, “we typically deploy thousands of software improvements each day in order to ensure our systems are updated to keep up with rapidly evolving threats, technologies, and standards.” NatWest and RBS also cited a great volume of change in their organization, writing “We update our firewall rules around 800 times annually and it is very rare for an incident like this to occur.” Despite the rarity of errors, with a volume of change like this it’s only a matter of time before something goes wrong if the way that change is delivered is not addressed.
There are many causes of operational incidents that can lead to major outages. The initial information on the inquiry shows that banks have a lot to contend with in order to ensure that their customers are not negatively affected by disruptions and that they are adequately compensated if they are. However, perhaps the banks should be focusing more on how they implement change, rather than just focusing on accelerating IT delivery. The resilience goals of “near perfection” will only be achieved when change is near perfected. If the main cause of outages is change, then improving change will enable banks to avoid incidents, rather than having to deal with the fallout after they occur. Proper change management allows banks to deliver change at pace with resiliency being one of the deliverables.
The Treasury Inquiry is much needed in response to the increased number of outages over the past couple of years. However, in light of everything above, we would urge the committee to focus their investigation more on the most common cause of outages, which is change. We would be willing to support the review and wider discussions on how to reduce risk in change delivery.
Cutover is a work orchestration and observability platform that allows you to strategically plan, orchestrate and manage change with transparency and control. This helps you improve your change and release processes and your level of resilience against incidents. Download our new white paper Work Orchestration & Observability Become Critical for Operational Resilience in Financial Services to find out more.
Richard Bell is a cofounder and director at Cutover. He is the former IT COO & Deputy CIO for Barclays and has over 30 years of Financial Services experience, including the delivery of many international change programs across Investment, Commercial, Wealth & Retail Banking.