Welcome to our webinar on maturing your approach to IT disaster recovery. We're really excited just to talk through this topic in a little more detail today. Just want to introduce myself. My name is Darren Lee. I'm the senior product manager for resilience at Cutover, and I'm joined today by Manish Patel Manish Jewsters. Yes. Sure. Hi, everyone. My name is tell on the SOP or global solutions. What that means is I'll look after a couple of areas within cut over. I'll look after presales, but also in integration and engineering and delivery teams. Just a quick background on myself. Before cut over, I was working in a bank, for over twenty years. In various parts of the organization, and I had a lot of experience in doing ITDRs. And unfortunately, was involved in a couple of live invitations as well. So lift and brief this subject and really looking forward to talking to you today. That's great. Thanks, Manish. Yeah, as Manish says, we're going to touch on some of, our respective is good and bad, later on in this session. Like, manage, I always have a a background in financial services. Working in banking and, and asset management. And also have seen similar events to to to manage. So hopefully, we'll touch on those, we'll touch on how we dealt with them or how we didn't deal with them, and indeed how our customers are dealing with these sorts of events as well. So we've alluded to a, an approach to maturity. We're calling a maturity curve. You'll see that come up on the screen. We're going to go through the different stages you'll you'll see here. I'm gonna talk through, what we've seen in the market, what we continue to see what we see with our customers, also conversations we have with prospects as well. And it's fair to say that's that that generally speaking, we see companies and an organization's firms at all of these stages. We see companies who have a very unstructured approach to disaster recovery, as well as, some that are perhaps a little more mature, little more structures. And I guess just to qualify the disaster recovery statement, what we're talking about there is is both the, the the testing, the ensuring that you're ready, but also the actual response, should the worst case happen, and an instant occur. So it's it's the testing and the recovery. And we'll we'll come onto that. You you may see that. I'll I'll stage five, but will come on to that in a in a little bit of time. We're also seeing, a fair amount of, response and activity in response to regulations. So, important business services, Bank of England, PRA, and so on. Dora, in the EU, and and the Federal with, with similar types of regulation and requirements that are placing on firms, particularly the financial services sector. And our maturity curve is with detailing here as a reflection of the journey that our customers have gone through. In some cases, are still going through. But also those that we see, you know, in terms of the conversations we have as as, prospects begin to look at our platform and see what we can do for them. Manager, do you always touch perhaps on some of the the conversations you're having with with with prospects? Does this resonate? Is this a sort of subject matter that's that's coming up? Yeah. Absolutely. And I think it's fair to say Darren that not everyone wants to achieve level five which we'll talk through. You know, it depends on where they are, where they wanna be, and and and this is the, you know, our our role is to support you on that on that journey. We're we're talking right across the piece of Midclear. I don't think there's anyone really at at four and five, amongst our our clients at the moment. Although lots of aspiration certainly get to four. Most of our clients are at one and two, and they're really looking at a solution that will take them to the next level of automation of of of being able to reply to regulations, great, great orchestration, great tooling, great dashboard that allows them on that journey. So it's great to see this. And and, you know, I love talking to our clients about where they are and where they wanna be in the program together with them to work out what we can do together. That's great. I guess the other thing we should also call out here is that, it implies in a sequence where we're saying one through five is that you have to of that sequence. And and I think it's fair to say that whilst typically we see customers progress through these stages, we're not saying that this is the only order or the correct order. We have some customers that will go straight in a three in terms of integrating with their IT central, the service management tool, and all their CMDB, whilst also restructuring and structuring their testing. So, it's important to us to point out that this is a very sort of typical flow, but we see customers, you know, take on at all different levels. Manage, I I know we we we've talked about our sort of history. We've talked about our experiences good and bad. Perhaps you could just talk through the the the sort of the good old days, the the battle boxes, and so on, how did we used to do disaster recovery testing and planning? Yeah. I mean, battle boxes, just bring up trade in my mind as we start talking about it. Look, really, really interesting thing in the old days. And I mean, it sounds like it's a long time ago, only ten, twelve, fifteen years ago, and they're not that long ago, really. You know, DR was an exercise that we carried out, but we didn't really do it with any great plot. We just had to do it to get tick in the box. When I was running front office applications in my previous role, We had battle boxes, they're effectively pilot bags, which was just full of documents, and it was my job to take them home every day as well as my number two and and other people as well. So if there was an event, there was someone with all of the instructions on how you recover. Initially, it felt like a badge of honor to carry this box because you were putting in, you know, in, you know, in a position of authority, if anything went wrong, they were gonna look at you. It suddenly became a burden very, very quickly. And and, you know, I think I think a lot of people are not doing that today, but they're certainly relying on Excel spreadsheets today, which is no no no dissimilar to a battle box, some description. It's a file sitting on a file share somewhere. Maybe there's five versions of it and all those which is the latest version. So things have moved on where where that data is more accessible, but it doesn't mean it's correct, and it doesn't mean it's the right thing to do. I got rid of the ballot box, because we then had a conference page, that was the biggest thing in the world. But that, again, is very an outdated system. So again, through good orchestration, through good automation. We can get rid of some of these archaic ways of working and really move into a very mature way of doing your testing and going to be in your invocation when it comes to it. Yeah. Absolutely. And that that, I I guess, leads neatly on to to sort of about the first stage. So the first stages we have here is is is unstructured. So our approach to to testing and answer answer recovery is a little ad hoc, as as as as Manish was saying, you know, the the days of their sort of battle boxes of of of printing out documents of storing them somewhere so that you can use them should the worst case happens We still see some of that happening now albeit in in digital form. So Workdocs and Excel and PowerPoint and and wiki pages and the like. And and the challenge there is if if the the approach there to to testing and recovery is is unstructured, then it tends to not be front of mind tends not to be the thing you're considering when you're making changes to your production estate whether that be app or infrastructure. It tends to be something that you're doing as a tip- exercise. It tends to be something that perhaps you're doing just to meet the sort of the bare minimum. And I think what we're seeing a real move for for for from customers is is a move from just that sort of tick box just doing enough actually to to wanting to to to do more, to to be more resilient. And some of that is driven by regulators. Some of that's driven by board pressure. Some of that's driven by just wanting to do better for their own customers. And and that's certainly where we see, you know, change going on and certainly where we see people take use of our platform and and and and certainly go on that journey. And with that unstructured testing, what we also see and where we see success really is where there's a a culture of resilience set. So we see, organizations actually build that into the approach. It's part of their regular life cycle of change. You know, you make a change to your production system. What do you do? We need to make sure that your recovery plans are fit for purpose. We see organizations who, perhaps are struggling to complete a data center test actually use us to complete that activity and more than just complete it, actually schedule more on that. And what we'll touch on that, in in a little bit. And so really, and and this happens, you know, even even I suppose, you know, in in in these years, you know, we'd imagined that things would be more structured and more planned, we still see organizations come to us who are wanting to improve their approach and improve their methods orchestration to set standards and and so on, and they're using us to do that. Moving on to to number two, that sort of regular testing, that's structured approach. As I mentioned, we we often see, customers get to the point of, defining standards whether there's a standards for how they approach, orchestration of large events, like a data center test, like a region failure. Moving on to actually defining individual service or application based recovery plans. We're still defining schedules of events. And also surrounding those events, with with good, change management, activities. So they define those those dates of the test as a large significant CR that requires the requisite attention and approval, and and so on. See those events protected by change freezes, and they'll appear on a major change calendar. And and one of the I suppose one of the the the areas where we have seen some some difficulties where organizations don't have that. And so their disaster recovery testing becomes a sort of, side of desk activity. They're trying to fit it in amongst all of the other valuable stuff that they're doing throughout throughout that calendar year, but it becomes something that is not not seen as as as vital and and and critical. And using cut over using using us in in in that journey is is certainly one way of of defining that, those standards for those recovery plans, orchestrating those regular events, and also getting into into the weeds of actually defining the things that you want to measure working with your business counterpart It's important to talk about that in terms of maturity. I think management probably stayed the same. I certainly, the business users that I supported and looked after, we worked hand in hand in terms of defining the right recovery time objectives, ensuring that the architectures that we that we built and supported supported the business needs. And, that might sound like a surprising thing to do. It was baked into us. It was drilled into us that that was the right thing to do. Technology is there to serve the business and the business to serve the, the customers, the users of that business. Managed maybe maybe John touched just on your experience in terms of working with with business and how you, came to a point of defining recovery plans just talk through some of your experience there. Yeah. I mean, as you say, it it it was a very manual process. You're almost re inventing the wheel every time you did that, which which in itself was a major problem as well. Regular testing just doesn't mean you do it often enough but you also do repeatability. You don't have to reinvent the wheel every time you do a test. In in our days of doing, I DL back in the bank. It was very much. I saw that there's activity, like you said, and it was a a all by all health, activity. You just have to get it done as quickly as possible. Will go in the weekend and hopefully get the tick in the box and you have no wallet points kitchen, then you walk away with with with, you know, a small pounder back and you go, right? That's done for months, and then have to worry about it. Again, that's not the way that we should be doing it now. Right? What we should be doing now is much more of a it needs to be done for all the right reasons. We are building these plans, not just to test with, but ultimately to pick off the shelf if it had to happen for real and execute that same plan that you've already been tested with numerous times. Just a quick story on the the old way of doing it was like I just mentioned, it was very manual time consuming lots and lots of conversation to get to an event. We have experience with clients right now who would cut their their their, their planning time for many weeks. Twelve weeks after two or three weeks just because it's a repeatable exercise. They're literally pulling things off the shelf, agreeing these are the services in scope and then go in a way to execute them. So it becomes a very much more governance process versus a a a planning process with nothing written on a page. And I think that's what we need to get to. So regular testing for me is not just doing it often, it's repeatability, and reducing the effort it takes to stage one of these events. Yeah. And absolutely. And I I think the other the other piece to that is is about measurement of of the orchestration time, the time that you're actually spending to, complete that test activity. We talked about RTOs, recovery time objectives. We're also interested in RTA is what's the actual time, that we we measured when we're doing the but also in the event of an instant did we hit the mark that we thought we were going to. And certainly we we see, that the benefit of of a of a platform like ours is that you can measure the activities that you're executing, to either prove or to recover your your applications and services, and indeed your large facilities, your data centers, public cloud regions, and so on. That you can actually measure that and compare that back to your your objective. And, you know, that there are nuances with RTOs and RTAs, in terms of, you know, the time of day incident may happen. But broadly speaking, we do see that as a, as a, again, a mark of maturity in terms that regular testing approach is actually to measure what you're doing and compare that back to what your stated objective was. Okay. So moving on to the third stage as we've got it defined here. So integration with your IT service management tooling, CMDB. Management talk about sort of integrations with with with tooling, how people are using their ITM tools, and then how perhaps they're augmenting the data that's in cut over with with the data they're holding their golden sources. Yeah. Sure. And I I I would probably say there's there's something else missing from this line. Maybe CN tooling as well, which is where a lot of the risk management and and RTOs may be held and so on and so forth. Look, you said a really important fact to this. We can have the greatest rock book in the world in Cartover, but if it doesn't have the associated data with it, that makes come alive, you know, like the service, maybe where the service is, what the criticality of that service is, what the RTO associated to it, maybe even the servers that are supporting that service because those are the ones you're failing over. All of that information is really important. We are being quite forceful with our clients. Now. We are saying that we have an integration with typically your CMDB solutions in ServiceNow and in other places as well. And there is no reason why we should not integrate to that. So when you create the wrong book, create the integration that allows you to bring in all this extra information what it does is a couple of things really. What it brings to life like I mentioned what you're actually doing, but secondly is the real data associated with the recovery. And and you can make mistakes. You can make assumptions if you don't have that data to hand in a in a in an event like a recovery or even a live event, the last thing you wanna do is go to three or four different solutions to try and work out what makes that service tick. What we need to do to recover that service. If it's all in one place, you've got great confidence that what you're doing is the desired impact it needs to have in a recovery. The converse of that is at the end of that recovery, we could update your CMDB with new data. If if new services got enrolled in and you wanna update your CMDB to say they're the primary ones down, we can do that. If you wanted to log what your RTA at the end of it was in your BCM solution, so there's an audit log on how long it actually took, we could do that as well. So This is really important. I haven't even spoken about change requests and things like that that are associated with recoveries. And again, we can integrate with that as well. So it could bring your change management solution into into your wrong book. So when you're running your wrong book, you know what changed numbers are associated to it. And ultimately you can close them directly from cut over as well. So point three is a really important maturity step that we see for all of our clients. And it's one that we are actually actively pushing with clients. Even if they're gonna ask you, we're saying, look, do you think you should be doing this? We have the technology to help you do that. Of what value would it be to you, and then we work with us whilst they get that initiated. So really important step to bring to life your actual run book during these recoveries. That that's great. And I think, you know, as we we touched on it at the beginning quite often, we see, whether it's, you know, of our sort of as you said, sort of pushing clients or whether they're coming to us apart from this is a conversation that happens quite early on in a customer's journey with us which is how do I make best use of that data? So let let's just recap a little. We talked about, you know, unstructured testing. What's important there, building a culture of resilience We've talked about the, the meter regular test. So that's, as management's talking about, that's actually reducing the amount of time that you take to prepare for a test so that you can fit more in. But it's also about being structured in your approach. It's also about doing things in a consistent manner across all of services and applications, across your estate. We've talked just now about integrating with your ITSM tool, and and getting that sort of source the truth, that additional set of data that you're relying on for your apps and services and bring that into cuts over and then augmenting that with the the recovery plan instructions. And it's probably a good point just to move on to automation and improve scenario coverage now. So I'm gonna talk a little bit about scenario coverage and then hand off to to manage just to talk about what other things we can do in the in the sort of automation arena. But really once you create capacity for yourself. You're giving yourself the opportunity to to test other scenarios. We we talked a little bit about the very sort of stereotypical I've lost my data center. I've lost my, access to my, potentially to my public cloud region. They sort of total lost scenarios. And It's fair to say that we see customers that are beginning to stretch the bounds of what they do, certainly within our platform, but but just looking at the approaches they're taking to resilience as a whole. And what that really manifests itself in is looking at different types of scenarios to actually test and prove out. Now we we know it's not gonna be possible to define all of the possible scenarios and have an appropriate, plan or response for that. But we certainly see customers starting to broaden their, their approach, their minds, if you like, to the other types of things that can happen. And so Not just total loss, but, loss of utilities such as, my SQL Server, farm, or my virtual server farm, testing down to the individual service level. What happens? And this is possibly the more likely thing. What happens if I lose the hard disk on my database server that underpins my application. What's my response then? Can I actually recover part of the application versus the whole of it? But more than that, not just that sort of loss here and recover to over over to here, but also I'm gonna prove out my disaster recovery infrastructure and make sure that it is fit for purpose. So not only do we see customers doing a a fail over and back again, but also a fail and stay. And so They're actually running production load through their disaster recovery environments, to ensure that they are adequately configured and so on. It's fair to say early days of my, application ownership career. I was very nervous about doing that. We operated, on the basis of of having enough infrastructure to to maintain business, but not necessarily to continue running business. At the same level and and and sort of process as we did, in our production environment. And and things have changed you know, now. And as I said, we're seeing customers move more and more to that fail and stay, prove it there, stay there for weeks, months, whatever the right time period is, and and then migrate back again. We're also seeing customers start to think about how they respond to ransomware attacks, and invariably One option for for a ransomware attack is to, recreate your environment from from bare metal or from your last known good backup. And, we'll see customers use cut over to orchestrate that activity as well. Management on touch perhaps just on some of the other opportunities for for automation as customers get more comfortable with the the the planning and the recovery and the testing. What are they looking to do in terms of further automation within within the platform? Yeah. Yeah. And look, I think it's fair to say that RTOs are not always easy to achieve and a good way of achieving the RTO is to automate, you know, and and I hate this term. Take humans out of an awful term, but but there is delayed and get one person to run a script to work out what's happening and then report back to say I've done it for someone else to do it. If you can automate that process, then you do save seconds and minutes and they all add up when it comes to RTO. I don't think I don't think there is any major organization out there that doesn't have an RTO that is quite challenging. Right? You know, unfortunately. And sometimes it's because the the application was opted to active fifteen years ago, and and that's what's making the RTO challenging. So it is a difficult thing to do. So automation plays a big role in We're also hearing from some of our newer clients that they want zero touch recoveries. And what they're really saying is an incident has and we can recognize an incident that can kick off a wrong book automatically, and that has a series of automated steps that just goes away and recovers. So you're now monitoring the recovery versus doing the recovery. The reality of life is that's gonna be quite difficult to achieve because we as humans make really important decisions in that recovery process and that we should never forget that. But we can do a lot of things within automation. Many companies will run, you know, typical things like ansible and Jenkins that can run scripts to automate a process. That could be bringing up a new server and copying the last known data point to it. Right? Well, I have a human kicking that off when cut over can do that automatically. Our philosophy is with the integrations and automation plays a huge role in recoveries and other use cases that cut over supports. But what we don't wanna do is be a a a process where it takes ages and ages to create those things. You should be able to create them at will and within your own infrastructure set. So we've created a framework. We call it the cut over integration suite. It's a framework which allows you to create integration cuts over to third party applications to kick off jobs and monitor them and report back on them, but also we have an open API public API with third party applications can tell cut over that if something's happened so we can progress the wrong books. So we've developed this framework and it's being used by our clients, which is really great to see and and what we are seeing is more and more automations are now being put into into our wrong books. So very simple collaboration tools that MS teams have mattered most to say I've started it and I've finished it to something more complicated where an answer or job kicks off to actually do something, report back on it. Once that does it, we do a couple of other things automatically as well. But still making sure the human is at the center of process. And when difficult decisions are made, the human is still responsible for making that decision. That's great. So, moving on then to our to our final stage, so it's I think it managed that at the, the the the top of the webinar we don't have customers that are there yet, but we certainly have customers who've who've got this as a stated goal and objective, and that is testing as response. So you know, why test? Well, typically people test so that they know that something isn't going to go wrong, but also people test to build muscle memory to build that sense of, I know what I'm going to need to do when the thing actually happens. Now, typically you're not given notice that an incident is going to occur. And so the sort of the the the weeks and I managed to on this earlier. You know, there are customers that have moved from sort of twelve weeks down to four weeks down to two weeks in terms of reducing their their preparation time. And then have a goal of of of actually reducing that too. We extend even further. Now do we think customers are going to be taking production systems down in the middle of the trading day. I think where the architecture allows it, they're certainly going to start to look at that. Where the architecture doesn't allow it, you know, in your very sort of stereotypical primary and secondary site type type setup, I think there's a desire to see how their own staff and their own, plans and so on actually respond to a little or no notice type incident. So you could imagine that at nine AM on a Saturday morning, the, DR. Culture is invoked and those that are responsible for a designated set of apps according to a particular type of instant, you know, data center one is down, are notified and then ask to, spin up their recovery plans and to get those activities going in a safe and measured and planned way, of course. But but the notice really is there to help see how your staff respond to see how your organization responds, see how resilient your staff are as much as how resilient your plans and your infrastructure is. And in order to do that, it really does require focus on reducing planning time. And, you know, as managers touched on with some of the automation and so on, There are opportunities even to reduce the activities for planning using cuts over, so not only use cuts over to orchestrate the events and the response but use something like us over to orchestrate the activities required to get ready for an event. And to the degree that you document it, capture those, measure them. You then got the opportunity to improve and reduce and and optimize. I think the other thing as well that we we start to see as well is that major incident management teams are now actively participating in the sort of large scale test events. They certainly would have been aware on the periphery that something was going on, not least just in case something untoward happens, but we're seeing more and more that customers are looking to involve their major instant management teams, and think ultimately have those, individuals, that typically run major incidents actually be responsible for running the tests, after all it's, a simulation of what would happen in response to a a particularly significant incident. And so I said muscle memory is important here. Right? You you want to build that muscle memory so that people aren't scrambling around to work out what they do is we touched on you know, structuring your recovery plans, tiering your recovery plans by service criticality, integrating it with your, your CMDB, automating where you can. They're all steps on a journey to allow you to respond more quickly to reduce impact to, to your end customers, and ultimately to ensure that the business continues, and and and demonstrates resilience and, and, you know, that sort of capacity to to to adapt in the event of an incident occurring. It's probably worth, maybe a couple of sort of, either either war stories for from us or or maybe even those that we've heard from Custer's managed. Do do you have one or two perhaps, are worth, talking about protecting the innocent, of course? Yeah. No. Look. We've got some got some nice couple of use cases here. I mean, look, we've got clients who wanted to do a full day data sensory data sensory recovery test but weren't able to do that. So they're they're recovering services and group services by themselves using Excel and the classic bridge where we just bulk out all this and lots of people do things. They were told by the regulator that they now need to do a full day set to recovery test. There are a a a major organization that provides financial services for the UK. And they weren't able to do that. So we got involved in that process, which was which was great for us because it really brought to life what we can do and how we can help our clients. And I'm, and I'm glad to say, you know, within a few months of using us to be able to do a full data center recovery test just by using our orchestration, our real time collaboration, our notification of what has to happen when it needs to happen, keeping everyone abreast of absolutely what's going on without having to read pages and pages of update emails and so on and so forth. So who are able to execute that, and it wasn't success, and the regulator was able to sign them off, which is, which is a great success story for them. But also is a great success story for us as well. I think another one, and and this is probably very key to a lot of the people will be listening to this global ARri's is how can I do hundreds, if not thousands of services at any one time? A lot of our clients are tier one clients. They have thousands or services available to them. And and to varying degrees, they're all very important in the way that that operates in the way that bank operates and services internal and its external clients. And again, we've got clients who before cut over could only do a few hundred. That sounds amazing by self and that was through huge effort and lots and lots of people involved with cut over. They're able to do fifteen, sixteen, seventeen hundred apps in one weekend. And with a smaller user base as well that manages that process. So a, they were able to to do a lot more in the same amount of time, but secondly do it with smaller footprint of people inconvenienced in less people at the weekend. Again, again, a great success story for them, but it just shows them how what what what cut over can provide when it comes to things like orchestration gentrification? No. Thank you, Maish. Well, that, takes us to the end of our webinar today. We are very grateful for you spending the time to, listen and hear from us. If you'd like more information, obviously, please do get in contact with us by our website cut over dot com. And, let's just say thank you, for myself. Do appreciate your time, manage any final words. No. No. Thank you very much. Any questions do come back through with us? That's fine. Thank you.