The evolution of Incident Management part 2: The advent of ITIL

Share this post:
cyber terrorists
Operational resilience /
Is negotiating with cyber terrorists our “new...

Related content:

Trick plays and how they make you more resilient
The evolution of Incident Management part 2: The advent of...
What is a runbook?
Lessons from Facebook's outage: comms & orchestration are...
The evolution of Incident Management part 1: In the...
Operational Resilience is a journey, not a destination
Meet the team: Craig Gregory, CISO

Jim Korchak

October 28, 2021

The following blog post is the second part of my series on The Evolution of Incident Management. If you haven’t had a chance to read part one: in the beginning... you may find it as useful background context to this article. I will point out that no mainframe people were harmed in the making of that blog. 

 

As society emerged out of the Wild West period of computing towards the end of the millennium, the way companies looked at technology began to shift. A discipline understood by few outside of technology in the ‘90s began to become less mystic and mystical to those outside of technology (aka muggles). I argue that this shift was driven by three primary factors: 

  • Technology began to emerge as the key differentiator at this time for many industries but also a significant cost for most companies – like mosquitos drawn to a fluorescent light, senior management will always be compelled to understand and dissect material expenditures and strategic opportunities.
  • The prolific growth in the use of desktop applications like MS office and the internet began to make technology less scary for the layman. In particular, the spreadsheet began to bridge the gap between technologists and muggles because writing formulae and recording macros was more ‘code-like’ than the ‘word processor’ paradigm that most applications leveraged at that time (eg. email, Word, PowerPoint, and Internet Forms). I could write an entire series on why spreadsheets are the bane of technology teams, but that’s for another time.
  • Finally, (and most crucially) as technology began to be more ubiquitously used and critically relied upon by companies, technology issues became more impactful to internal operations and therefore noticeable by management. When things went wrong, technology could now expect a major inquisition with each outage. This created a new and, at times, humorous challenge where technology now had to try to translate and explain complicated technical concepts to the non-technology community. This remains a challenge to this day - I recently had to explain a failure in “dual-phase commit” to an executive committee using an analogy involving lollipops and children.

 

As we delve into the evolution of technology and incident management during this period, it is once again good to set the scene by looking at the backdrop of technology at the time:

  • With advances in Wide Area Networking (and particularly fiber optic technology), as well as the broad adoption of HTML and JavaScript, the concept of global applications became more the norm versus the localized systems of the 1990s. This now means that technology support teams are far less likely to be physically co-located with the user base.
  • The internet really started to properly take off - Amazon began selling more than books a little before the turn of the millennium, the more tech-savvy consumers began to do their banking exclusively online, and the ‘glossy brochure’ version of the internet began to evolve into the digital marketplace that we enjoy today. The internet shifted from fringe technology to ubiquity during this period.
  • With the non-technology leaders leaning into technology, the often-self-inflicted wounds by well-intended but chaotic technology teams were seen as problematic and avoidable - IT Security and Access Control became ‘a thing’ and support teams had most of their everyday access revoked.
  • Similarly, making changes to your IT systems became globally recognized as the single largest risk of IT outages and incidents.
  • As investment in technology increased and drew the attention of senior leadership, one question started being asked: ‘Why are we using our expensive developers and engineers on support and killing their productivity?’ In response, and, not surprisingly, having dedicated, stand-alone support teams became the go-to organization model.

 

Given the backdrop above and the “leaning in” of management muggles into the magical world of technology, the key issue to resolve around the turn of the century was this: “How do we bring order to the relatively chaotic world of technology and support?”

At this point, ITIL popped up and said, “Somebody please hold my drink”.

In the late 1980s, the UK Government’s Central Computer and Telecommunications Agency decided to create a standard set of technology practices that they called the Information Technology Infrastructure Library or ITIL. Leveraging the supremely efficient practices of government, ITIL was comprised of a mere 31 different books. During the 1990s, this illustrious set of books could be found at most libraries in the Arcane Literature section or in most pharmacies in the aisle dedicated to “Sleeping Aids”. But in the early 2000s, ITIL was consolidated into a much easier-to-digest version two and for many muggles, this became the panacea to the problem of “technology chaos”.

In essence, ITIL attempted to break down technology into a set of discrete, but interconnected processes that could describe just about anything one would want to achieve inside of technology. Throughout the 2000s, terms like Incident Management, Problem Management, Change Management, Release Management, and Service Level Management became the vernacular of the era.

Incident Management itself was then considered as a process that had its own discrete steps. Interestingly, the incident management process’s only objective is to restore service to working order. The idea of fixing the root cause of the issue permanently is part of an entirely different process (Problem Management). During this era, teams increasingly used data to analyze the speed of the process overall, but also the speed of each step. This analysis led toward investment into technology that would help drive measurable benefits in reducing downtime like ticketing tools, monitoring tools, but also resilience designs and disaster recovery. 

Some organizations drank the ITIL Kool-Aid so much so that they modified their technology organization structures to have teams aligned to the ITIL Processes. During the 2000s we increasingly began to see structures like the “Incident Management Team”, “Change Management Team”, and “Problem Management Team” within IT.

With respect to the time, the components of incident management evolved to look something like the following: 

  • Issue identification - monitoring agents and alerting solutions started to become ubiquitous as they are today. The implementation of “Command Centers” to oversee the landscape of monitoring became the norm. Great programs of work were implemented to ensure every technology component everywhere was monitored in some way. Although this may seem like perfection, it began to create a signal-to-noise-ratio issue for these newly formed Command Centers as well as the challenge of the “false positive”.
  • Mobilization – newly-formed Incident Management teams began to drive mobilization as one of their key responsibilities. “On-Call” tools and SMS message broadcasts became pervasive in their efforts to get the right skills assembled as quickly as possible onto the ever-present “Incident Bridge”.
  • Diagnosis and recovery became a more structured process. With the investment into alerting and monitoring, the percentage of issues where the cause of the issue was known at the offset increased materially. Unfortunately, those few incidents where the monitoring didn’t highlight exactly what was wrong at the start began to stick out like sore thumbs. 
  • Resolution and post-incident review – with the separation of restoring service (Incident Management) from permanently addressing the issue (Problem Management) technology teams had a better mechanism for understanding repeat issues. It became common practice to do post-mortems on the larger impacting issues and modify the Incident Management process through the practice of “continual improvement” to fundamentally “do better next time”.

 

So, in terms of that technology chaos of the previous era, mischief managed, right? Well, not so fast my friend… a new set of challenges began to emerge. 

As with many reactionary responses, the pendulum of making everything a process or procedure maybe swung a bit too far toward structure. Many of these internal processes came to be seen as onerous, slow, and heavy handed. With the dawn of the mobile computing revolution, the desire to release features fortnightly, weekly, or daily began to bump up against this concept of structured procedure.

Additionally, the proliferation of ITIL process teams and standalone support teams both diluted technical expertise on incident bridges but also created a greater distance between the developer/engineer and the production environment/consumer. This distance, in turn, spawned a few negative consequences including a drop in quality control of software development because it became easy for developers to throw software “over the wall”. 

Finally, those incidents where monitoring and alerting hadn’t immediately identified a solution began to take longer to resolve without the product experts (developers/engineers) being involved. This was something that did not escape the notice of management.

In summary, many tech insiders might feel the 2000s was the time that saw the magic and mystery of technology come under attack by the shadowy forces of muggle process and procedure. But, in all fairness, the period can probably best be considered the maturation era for Incident Management. Many of the concepts and improvements implemented during these years are still in use today with great effect. 

Nonetheless, it wasn’t all smooth sailing just yet - new storm clouds were brewing on the horizon in response to the attempt to make technology formulaic. This is something I will explore in the next part of this series - The Evolution of Incident Management Part 3: Cloud City, Androids, and The Developers Strike Back.

 

JimJim Korchak is a twenty-five-year technology veteran whose career has centered on automation, application service management, and the software delivery lifecycle. He has extensive experience in the financial services sector both in the UK and the USA where he has held a number of senior positions helping to shape technology strategy and execution. 

He is currently a Principal Consultant for Resilient Technology Specialists where he advises companies on how to drive improvement in the application lifecycle, from requirements management to production operations by leveraging best practices and intelligent automation.

A recognized thought leader in Application Management and Technology, Jim has been quoted by Forrester Research, has held advisory positions for several technology start-ups, and has spoken publicly as a lecturer for Hult International Business School as well as at a number of industry events as keynote speaker. 

Share this post:
Webinar /
Forrester x Cutover Webinar
Cloud /
5 ways to accelerate your move to the cloud
cyber terrorists
Operational resilience /
Is negotiating with cyber terrorists our “new normal”?