Mastering your IT disaster recovery maturity

Well, good morning. Good afternoon, everyone, and thank you for joining today's Cutover Live session. My name is Kimberly Sack. I'm gonna be the moderator and host for today. Cutover Live is our series of short informative sessions on IT operations, like IT disaster recovery, major incident management, cloud migration. Today's session is mastering the IT disaster recovery maturity curve. And before we kick it off, I just wanna do a quick housekeeping items. For everyone joining today as an attendee, you know, at the bottom of your screen, you're gonna find a q and a panel. Please feel free to submit any questions you have during the webinar. If possible, we'll make it interactive and take questions during the topic. And if we can't get to it, we can take them at the end or follow-up with you after. This session will be recorded, and will be accessible after the event. So now on to our agenda. In the next twenty to thirty minutes or so, we're gonna discuss key ITDR metrics, what's changing, how to understand where you fall on the ITDR maturity curve, and how to advance, and then wrap it up with q and a. And I'm very excited to announce our or introduce, rather, our presenter and speaker for today, Manish Patel. He is our SVP of global solutions here at Cutover and based in the UK. And, Manish, I will hand it over to you, to get started. Great. Thanks, Kim. Just a quick background about myself. I've been with Cutover for almost nine years. I don't know where that time's gone, but it's it's been fun, a great journey. I look after clients both in prospect stage and customers when they become customers, and I work with them on driving their operating model, making business changes on how they do ITDR, delivering integrations and automations, all the sorts of things that we're gonna talk about today. Previous to Cutover, I was in a major UK financial organization for over twenty years running big sort of migration and project and and programs delivering sort of front office solution, middle office, and settlement solutions around the world, which is a a great job there as well. So, look, let's let's get into it. And and as Kim says, please do ask questions. I'd love to answer them as we're on the slide if it's appropriate. If not, we've got a few minutes at the end that we can we can wrap up and answer all those questions. I'm gonna talk about metrics first and what what we think is changing, and then we'll go into the maturity curve thereafter. So these are the common challenges that we see, and we've been speaking to lots and lots of different clients. We we've done a survey, which I'll go through in a second, and we got some really interesting results. But these will be common challenges, no doubt, that you will see right now today. You know, how can we do a comprehensive Doctor test? We we see lots and lots of of people running Doctor tests, but they're only running a few services at once and not quite mimicking what could happen in a real disaster. You stitch them all together, you get you get a level of comfort that you think you've got a recovery process. But if you try to run five hundred services or a thousand services all at once, are you really gonna be able to achieve that? That's a common concern. Even though they get ticks and boxes, they do worry about what would happen in a real life event. The lack of confidence is always there, and that's because when you do a test, it's not real. You you sanitize it. It's slightly scripted. You know you know what the answer is. You go through the process. So again, if you ever had to do it for real, there is a certain worry about how we will react in in in a real life scenario. Most Doctor tests are manual, which is fine. You need the right people to do the right jobs, but because it's manual, there are errors to that process. There are delays to that process. And what we're finding more and more is, as Doctor becomes a bigger a bigger exercise, we are seeing that we have a lack of visibility. Senior stakeholders not want to know what's going on. No one person wants to own the whole thing. It's it's segregated out to lots of different people. So common common problems that we see across a whole bunch of, client customers, which no doubt you will see as well. A great little quote on the right hand side, which which is from one of our customers. We did a full scale test. We were able they they were at breaking point using spreadsheets. So this is where they were trying to do five hundred services in one go. It was all run by spreadsheets. They couldn't do it. They would they they failed, unfortunately. They had to stop what they're doing, and now they're looking at what they can do to improve their process. And that's where tooling, automation, ownership, visibility, all of that comes together to allow you to run a really good test. Let's move on to some of the answers that we got to our survey, and they're quite interesting. There is certainly concerns across different types of disaster scenarios. As you can see on the right hand side, what would normally have been an IT failure, a network outage, a software failure, some description now being overtaken by cyber attack. The the biggest worry that they are seeing today is is how do they recover from a cyber scenario. That's not to say Doctor is is not important anymore. Of course, is Doctor actually helps in a cyber attack, situation. If you're well rehearsed, you got your services well written, and you know what your recovery plans are, then across all of those different outages, you have got a way that you can recover. But it's interesting how cyber attack seems to be the number one, concern that is, that is, affecting most of our clients. Most of our clients are financial services, so that makes a lot of sense in that process as well. On the left hand side, you can see that what other types of scenarios that caused outages. Cloud being one of the highest ones. It's interesting. As cloud has become more and more prevalent, which is a great thing in itself, people don't quite realize that it still can go down. Services can still go down, and if it's not architected in a certain way, you still have to go through a recovery process. It's not auto detected. It's not necessarily auto healing. There's still a recovery process you need to go through. So whilst you're on the cloud, it's still important to have really good Doctor processes and tested and rehearsed them. So in the event that you need to do it for real, they are available. Things like cyber attack and on premise is still a a major concern and still increasing every year as we get more complicated in our architecture, in our number of applications, you know, unfortunately, introduce a higher risk of failure within those processes. Just very quickly on the other two areas, top three contributing factors. There is generally an IT skill shortage. It's fair to say, especially in larger organizations, it gets pockets of people who know their product really well, but wider outside of that, you know, it becomes a voyage of discovery. So that's that's important. We need to make sure that we keep we keep, the skills on how to recover as wide as it possibly can be. We're not relying on one or two people. We want many people to be able to do that role. So in the event of a of a disaster, you're not relying on one person who happens to be on holiday to try and fix that problem. It should become second nature to a much wider group of people. As cyber threats, as you can imagine, is is always gonna be a problem. And I think if we run this if we run this year after year, anything cyber related will be starting to hit the top of the lists. Remote working is is an interesting one. We've we've been used to doing this now for the last four or five years. I think it's become the new norm. It's a tricky one to make sure that people are at the desk when you want them to be at the desk. You know? When you work from home, there's a little bit of flexibility there quite rightly, And therefore, you know, you don't you won't necessarily have people hands on keyboards twenty four seven. So that does become a contributing factor on some of these issues. And then very quickly in the last one, you know, things can take longer to recover. And that is, again, down to the basic. The Doctor plan may not be correct, may be out of date, and and it affects all the use cases that you can see there as well. Just moving on. This is an interesting one. You know, how how up to date are your recovery plans? If you have a look at the pie chart on the right hand side, a third of them are more than twelve months old. That that sounds a long time to me. You know, even a year is quite a long time. So for for our customers, and and like I say, some of them are financial services, you know, for them to have plans that are greater than the year old is quite alarming. We need to have a good process to make sure that plans are regularly updated. If we move to the other part of the chai, the pie, I can I can see that it's really nice to see that we've got, you know, twenty odd percent who'd who'd like to update plans within six months? We've got thirteen percent that are constantly reviewing. That's fantastic to see. And the norm, is six to twelve months, typically towards the twelve month end. That that's okay. I think regulation speaks that that plans have to be kept up to date at least once a year. They have to be tested at least once a year depending what tier of application you are. So the fact is that, you know, that that that up to twelve months is fine. I would like to see through a maturity process how we can get that, more often. Maybe every time there's been a change, you release a change to your to your application, maybe that drives an update to the plan or at least a review and recertification of that plan. I think that's a great maturity step that people should take. But I understand, you know, it takes time, it takes resourcing, and that's not always available. So as an ambition, I think that's a great place to be. Coming on to this one, this is really important. And, again, it's a maturity step. How do we move away from plans that are executed manually? How do we move away from things that take a long time to plan and move into some level of automation that improves the whole process? If we think about and I'm I'm moving into maturity a little bit, and we'll we'll go we'll we'll we'll talk about that in more detail in in a second. But when we when we think about it, how long it takes to plan, you don't have that luxury if it was a live event. If it was a live event, you've literally probably got fifteen, twenty minutes while senior management make a decision on whether you need to run a recovery process or you're gonna fix forward. So you literally only have twenty or thirty minutes on how you're gonna do your recovery process. And and and therefore, the major part of that is automation, and going forward, probably a little bit of AI as well. So that that's a great intro into our maturity curve. So we've built this maturity curve. This is based on how we've interacted with our customers, where we see they are, where we see where we see ITDR being in the future within what we are and what we do to help that that that that problem. It's a very simple curve, and we kept it simple on purpose. You know, all from one, which is pretty much you don't do it very well today at all, all the way to five where it's singing and dancing. It's fully automated or as automated as it can be, and AI is helping to now drive you to the next level. Now it's important to note, it doesn't matter what you want your ambition to be. If you think your ambition is three or four and it's not five, maybe AI is not adopted within the organization or there is some sort of trepidation about it, which is quite new and it's fast paced moving, that's absolutely fine. The important thing about the, the maturity curve is to understand where you are today and where you want to be going forward, what your ambition is, and whether that's three, four, or five. Like I say, it doesn't matter. Let's break down these segments and talk about them individually. So number one, unstructured disaster recovery. We don't see many people in this stage anymore, which is great to see. But let but, you know, no doubt there will be. In financial services, I think because of the nature of the regulation that supports their their their being, one doesn't really exist. But as we go wider into different verticals, into retail, let's say, into the pharma sector and so on and so forth energy, we we we will see some people that are in one. One basically means you really don't have a deal process. When something goes wrong, you pick up your phone, you call in a favor, and you work out who the best person is to go and fix it. You probably don't test or if you test, you test it in isolation where the reality is if it happened for real, it would not actually help you in a in a real implication. And the last one, unfortunately, is just is just what manifests itself in the top three. Because it's not a formal process, there's probably no dedicated resources. There's no single ownership of a Doctor process. And with that means there's probably no budget, so you can't do improvements. You can't hire people. Bring in bring in technologies to help you manage that process. Like I say, I don't see many people in stage one anymore, which is great, but it is a valid stage based on different verticals and how they start the Doctor journey. Let's move into stage two. Stage two is where most people are, which is which is good because there is some level of understanding what Doctor is and the need to have a good Doctor process. But it's very manual, typically of confluence pages or spreadsheets. Testing is done as frequently as humanly possible, but that largely means it's probably once a year. And and you probably only test your top applications, and you and you you you probably don't have the capacity to test your lower tiered applications. It's pretty much all manual. You're still relying on resources who know what they're doing to be able to be assigned to activities. It takes a long time to plan. And, again, unfortunately, because it's not quite doing what it should do, it's not really helping in a live invocation. So what does that mean? If if something went wrong and and you then have to use those plans, those processes to invocate, you may not hit your RTO if they've been applied. That may that may have reputational risk attached to it. You may get fined by the regulator. You may then start a whole host of other processes, you know, as a result of failing those processes. So so stage two is is a good stage to start at, but I it's the foundation stage. It's where you need to move on from. Let's talk about stage three. So this is where you've now got a process, and we've seen some some customers, moving into this stage as part of a natural maturity that they see themselves. So it's all very well documented. You have central ownership. Service owners understand that they need to build the recovery plans. Those plans need to be accurate. They need to be updated. When you run a test, you are actually critically measuring yourself against the RTL. You're not guessing. You're not saying, yeah. I think it took two hours, so I'm okay. You're looking at the activities, and you can work out fairly well or fairly closely to how long it took. That gives you a greater sense of of of, confidence. But it's still manual. Still pretty much manual. It still takes a long time to plan. It still takes a lot of people to be involved in the process. So you haven't you haven't improved the execution element of it, but you've improved the the Doctor plans and maybe the planning side of it. Often, we see these are centrally managed, so there is a Doctor coordinator. They they manage the process. They execute. May not be as well funded as it needs to be, but it's okay for what they got. And we are seeing more and more stakeholders taking more of an active role in in what that process looks like. So this is where the fun begins. Let's go on to stage four. So stage four is what what we want to see most people mature to at a minimum stage four. And that is you're looking at your recovery plans. You've kept them updated, which is really good, but it's manual. How do we move those manual steps into automation? It's fair to say the ambition should be for everyone. How can I have a one click failover? The one click being I've made a decision to failover. I'm gonna execute a plan, and the rest of the plan gets executed automatically. That's the right ambition to have. Low touch DRs, more than one click, but still the majority has been run by an automation process. Again, the right ambition to have. The reality is, unfortunately, we live in very complex architectures and enterprises. Things may not be able to be automated in the way we want to. So having that ambitions correct, but let's not be disappointed if you only do fifty percent automation or forty percent automation, and then you you sort of take time to then catch up and do the rest. But it is important to have an automation ambition, and it is important to try and keep it as high as you possibly can and have processes in how you're gonna try and achieve that over time. The classic example is in your plan, you would ask a engineer, an SRE, someone, application team to go and run a process. And all you're really asking them to do is to log in to Jenkins or Ansible or go in a command line and type in some keywords. Right? And therefore, you're asking a human to run an automation job. The point we're saying here is and and this is a terrible term, but take the human out the loop for that process. Let the let the runbook automate that. Let it run the process. Let it manage it. It will be looking for, is it running? How long is it completing? Did it pass or did it fail? So we're we're trying to make sure that the person who's looking after it is not looking at a screen waiting for something. They are and and being told now go and press some buttons. We're doing all of that for them. They they won't changes. They become exception managers, which is correct. We want them not to be pressing buttons. We want them to react to problems, which is what they get paid to do, which is what that we want them to do on these weekends. The other process that's attached to that is if you've got a person who's looking after lots lots of different applications, call it a DBA, he or she cannot run fifteen things at once. They will have to stage it one after the other, and they will make a conscious decision in which order they're doing. If they're automated, we can release all fifteen at the same time. Ninety nine percent, maybe ninety nine point nine percent always pass. You get that odd failure. So rather than them looking to see has it worked, has it worked, what do I need to do next, then get notified on the one that may have failed, but the other fourteen went through successfully. So we're managing the time much better. What this all means is your RTO, your RTAs will come in because you're not in a queue of things while you're waiting for five minutes for something to happen. We've automated. You've saved those five minutes. So therefore, you've got a greater headroom in achieving your RTAs. The important thing here is planning. We've spoken about it already. Planning takes a long time. In a real life event, you don't have the time to plan. With automation, you can very quickly stand up very large events, hundreds of services very quickly into a measured response, and then you execute that. So that planning time that should that typically takes sort of six, nine, twelve weeks can come down initially to just a few weeks. And over time, it should really come down to just a few minutes. That's where we want us to be. That's the ambition that I I want people to have when they look at how they mature their deal. The last bit is really important, and I'm super passionate about this. You have to test like you want to invoke. There is no point testing in a sanitized environment. There's no point testing at a point where you know success is guaranteed because in real life, that doesn't actually happen. When you invoke, there are so many factors that come in, and you're not used to that in your test. And therefore, you are scrambling. You are chasing around to try and make things happen. Now I know you can never do it a hundred percent. That's absolutely understandable. These are tests after all. But where possible, your run books, your execution plans, your Doctor plan should be as lifelike as as lifelike as possible. So test like you invoke is a hugely important process, and it'll take time to do that. Of course, it will. But, again, I use the word ambition a lot. That should be the ambition. So the last stage is arguably the most fun stage, but also is the one with the greatest amount of unknowns, AI. AI is playing such a big role in all of our lives every single day of the week, and it seems to be gathering pace like like there's no tomorrow. Where do we see AI coming to ITDR? I think right now, it's an assistant. It's helping you. It's not necessarily taking over, but it certainly it certainly can be used in managing your plans. If there's been changes to the infrastructure or to that product that you've you've just updated, AI can help readjust your plans. It can learn what you've done before. It can take it can take a historical view of what's actually happened, and then as it plans going forward, it can then change your plans. Optimization is what we're talking about here. Every time there's a change, you should update your plan or at least review it. We don't have the resources to do that, but AI can tell you if there's if there's any changes that are needed. It can it can it can propose changes. You still have to review based on the hard work for you. When we talk about running events, you know, lots of agents are now being deployed. Know, ServiceNow has it. BMC Helix have them. All sorts of agents are out there, which you can actually ask questions of and say, I'm about to run this service. Failover. Tell me what the last three changes were and were the material. And it'll come back and it'll give you some level of comfort about what you're about to do is gonna allow you to be successful versus let's go ahead and do it and work out what happens at the other end. I think the important thing for me is there could be a mind shift in in this stage. There could be a stage six going forward when AI really takes takes on. And and I had this conversation. I I was presenting at an event a few weeks ago, and and there was a US regulator in the room, and we had this conversation. Today, the regulations typically mean typically say, you need to have a recovery plan in a repository on the shelf somewhere ready to be executed. Whether it's digitized or not doesn't matter, but you need to have something from there. When we were talking about this, and and it was it was a really open open discussion, we were saying, well, with AI and the confidence AI could bring in the future, having plans that are sat in the shelf that may not have been reviewed for the last few months is actually dangerous. Having a plan that is being created in the moment of need, whether it's a Doctor or live invocation using live data, and the AI can help you create that plan very quickly with a great deal of confidence may be the way we need to go. So therefore, would the regulation change? Could it change? Or is it a question of I have to have a plan, and I'll keep you updated every month, but I know the I know the plan was created by AI. But it was a really interesting conversation because now we could be with the advancements of AI and other things that are going on. We could break the regulation by saying your regulation's out of date now. So it may have been it may have made sense up until now, but going forward, actually, we wanna be even better than what your regulation is saying, but we won't be allowed to do that because we're bound by the regulation itself. So quite an interesting discussion. And like I say, maybe if I've done this same thing next year, we could have a number six, and the number six is just in time ITDR. You build it as you need to do it, whether it's for live or for test. So, look, I've gone through that relatively quickly, and and, hopefully, we got some questions on on on on the line. But like I said, the the the curve is not there to to tell you what to do. It's there for you to recognize where you are and ideally where you what what you want to achieve, where you want to be. Ambition is a really good word. We want to have ambition to be higher than what maybe is available, but then you adjust where you need to be based on, you know, the infrastructure around you, the level of expertise, the budget, and so on and so forth. But, yeah, use the curve as as a where am I now, and where could I be going forward. So I I sort of covered this already, I think, in in my conversation. So, yes, automation, absolutely essential. We are beyond the need of having humans do everything. Let's use the tools that are already within your organizations. Number two is is is actually disappearing as a result of number one. The more you automate, the more chance of manual errors disappear. And the last thing is look at AI. We all have different views on it. We all have different levels of of comfortness on it, but look at AI for what it is. And if it's right for you and your organization is adopting it, it could be a nice value add to what you do today. Great. Thank you so much, Manish. I know we have a a couple minutes left. So it looks like we do have a couple questions. The first one, Manish, is can you provide an example of an AI powered runbook that identifies an improvement or automates a decision during a Doctor test or live recovery? Yeah. I I I think I think that's where we want runbooks to be in general. Right? So when you create a runbook, there are lots of great AI tools. And and I'll bring it back to CutOver just for this one question. We've got something called CutOver Assistant, which is actually looking at your runbook as it is today and suggesting improvements. And it's suggesting improvements based on what it sees and what it understands about your application. So you feed it the right data. It will generally come back with improvements. Now it's up to you whether you accept them or not. That's that's fair. I think the other thing is during execution, you can use AI to go and do a whole bunch of health checks or go and do a whole bunch of what if scenarios and so on and so forth. So during execution, as you're doing it for real, you're not scrambling around to try and work out what's happening. You're asking AI to help you make decisions. Again, I think the the ultimate decision has to be done by a human. You don't want it to take over fully because there are external circumstances where you may not wanna say yes to what's been proposed, but at least it's giving you all of the knowledge for you to make a confident decision how to move forward. Awesome. Thank you. And it looks like we have one more question. So which specific regulations are driving these types of changes, and how does achieving higher maturity on the curve directly help an organization satisfy these compliance requirements? Yeah. That that's a great question. And I'm gonna say I don't think regulations are driving change. I think the audit from the regulators is driving the change. So we have heard recently amongst a number of clients that when they've been audited, they've gone into much more detail than they've done before. They have now got SMEs in the audit panel reviewing what you're presenting back to the regulators. Before, it was just a bunch of people coming in and saying, have you done this? Have you done that? And you're showing evidence. Yeah. They're now digging into it. They're looking at what you've done. They're critically saying, hold on a second. That doesn't look right. Are you sure you did that and so on? So I don't think regulation is changing, but the audit practice is getting far deeper, and they want to know far more than they ever have done before. Okay. I just wanna call out that please stay tuned for more cutover live sessions that are coming in twenty twenty six. This will likely be our last one in twenty twenty five. We've had, I think, a total of fifteen. So if you're interested in seeing some of the other sessions, go to CutOver dot com. Visit our YouTube channel. Go to our LinkedIn page. You know, again, if you have any questions, you know, feel free to go on LinkedIn and go to Cutover's company page. Find myself or Manish, and we're happy to have a conversation or get you connected with the correct people. And, of course, if you wanna see the product in action and learn more about how CutOver can help you mature on your ITDR maturity or advance on your ITDR maturity, you'll go to CutOver dot com and and request the demo. Awesome. Well, we are at time. So, if you just go to the next slide. Thank you. So, again, thank you, Manish. That was fantastic. Thank you to our attendees who joined. Again, find us at cut over dot com. Find us on LinkedIn, and we will see you all in twenty twenty six. Enjoy the the winter and holiday season, everyone.

Webinar

Not all IT Disaster Recovery (DR) processes are created equal. Discover where your organization falls on the DR Maturity Curve, moving from unstructured, reactive planning to a proactive, automated, and mature resilience posture.

Learn the critical steps to assess your current process, reduce risk, and increase confidence in your ability to recover quickly and effectively from a major incident. See how incorporating automation and continuous improvement can lower your Recovery Time Actuals (RTAs) and free up teams for higher-value activities.

Speakers

No items found.

Mastering your IT disaster recovery maturity

Cutover AI Assistant

Above the automation layer: How the next level of orchestration is essential for complex IT Operations

Third annual IT disaster and cyber recovery trends and insights report

Get the latest Cutover updates and insights in a monthly newsletter