›

Webinar

30 minutes

One-Click to Recovery: Slash Disaster Failover Time by 53% with Automation

Good morning and good afternoon, everyone, and thank you for joining today's CutOver live session. This is our seventh in the series of short informative sessions on improving IT operations, like IT disaster recovery, cloud migration, and major incident management. My name is Kimberly Sack. I'm on the product marketing team, and I'm gonna be our host and moderator for today. So today's session is titled one click to recovery, how automation slash fail over time by fifty three percent. And before we kick it off officially, I just wanna cover a few housekeeping items. So for those joining via Zoom, at the bottom of your screen, you'll find a q and a box. Please feel free to submit any questions you have during the webinar, and our presenters will answer them at the end of the session if given time. If not, we can absolutely follow-up with you afterwards. And for those of you joining LinkedIn Live, please comment any questions in LinkedIn, and one of our moderators there will get it to us over here and our presenters. This session is being recorded, and it will be accessible via LinkedIn after the event. So in the next twenty to thirty minutes, we're going to discuss how an enterprise financial institution used Cutover automated runbooks combined with Ansible to achieve a one click failover. Then instead of just talking about it, we're going to actually show a live demonstration of Cutover, a quick and dirty one, and then we're gonna wrap it up and open it up for q and a. So we have two great speakers lined up for you today, Melissa Summer and Madan Kumar. I'm gonna hand it over to them and let them do a brief introduction to get us started. Over to you, Melissa. Hey. Yes. My name is Moses Summer, and I'm the technical customer success manager, for this client. And I've been with Cutover for almost four years now. Great. Hey, guys. Medha Kumar, senior sales engineer here at Cutover. Been with Cutover for about four years and looking forward to, showing you guys a bit of the platform today. Awesome. Thank you both. So now I think we're ready, Melissa, for you to kick it off with talking about the customer challenges. Alright. Thank you. Yeah. So let's start with the problem, which was manual and inefficient failover processes. So this company, like many large enterprises, relied on manual steps for both disaster recovery failovers and routine failover testing. So they have their process in static logs. To execute each step, operators had to manually trigger Ansible scripts and follow a spreadsheet with those instructions. Each step required coordination among multiple teams. People were spread across bridge lines, chat channels, and, you know, email threads galore, as well as dealing with Ansible sprawl, which meant working with poorly organized, complex, redundant Ansible code, making it even harder to manage and maintain. So, obviously, from that, resulted in delays with manual toil, human error, and massive coordination overhead. In that previous process, RTAs were calculated manually and after the fact, so there was no oversight if an application was not going to hit its recovery objective. So during high pressure moments like real incidents, you know, this approach was risky, hard to manage, prone to error, and also just opaque, like, let not much visibility into what was actually happening, when, and where. Right. So which brings us to the solution, which is automated failovers with cutover runbooks. So the team decided to recreate their failover process by moving away from that static documentation and into automated runbooks. They did the legwork to build out repeatable templates that could be executed depending on the scope from an application recovery repository and integrated it fully with Ansible. So with that, now just one person can initiate a failover or a test with virtually a single click. By utilizing the Cutover API, the client had set up calls to automatically spin up runbooks from those templates for each application or service being failed over. And since those runbooks were integrated with Ansible, all those formula formerly manual and manual tasks are now automated, triggered, and executed with little to no human input. And the system will actually alert you to any errors so the users users have the ability to debug integration failures directly on our UI. With that, Cutover's live dashboards provide real time visibility. You can instantly see how many runbooks are running at a given time, see recovery metrics, as well as, you know, get alerts out if anything's off track to, one, just regular users or send it up to executive leadership. And lastly, audit logs are auto auto generated, so that makes compliance and retrospectives easily accessible right on the platform. Right. So that brings us to the outcome. The results here speak volumes. Since implementing Cutover, our team the team has seen fifty three percent increase in efficiency for ITDR. The average recovery time for an application dropped from over four hours to just thirty eight minutes, which is huge, especially for an organization managing, you know, mission critical financial systems. Even better, the number of people required for a failover has drastically reduced. So it used to be a coordination heavy effort now runs cleanly and confidently with minimal human oversight. And this client is currently working on filling the model across the enterprise to even more use cases, such as change management, patch management, as well as major incident response. And they expect to drive even more efficiency gains moving forward. This accessory is not just about automation. It's really about operational resiliency and being prepared when it matters most. By replacing manual effort prone processes with the streamline automation, this company made failovers not only faster, but they made them smarter, safer, and scalable. Great. Thank you, Melissa. And I think, Madan, you were gonna touch on just some of the benefits specific to Ansible. Yeah. Exactly. I think one thing we saw with this client at at VCU is with Ansible, you know, a lot of folks say with the amount of Ansible playbooks that they have within their estate tend to see what we call Ansible sprawl. So it's really multiple different Ansible playbooks, all doing various different things without any clear definitive idea for nontechnical users to understand what those playbooks are doing. And by abstracting and allowing Cutter to kind of wrap those Ansible playbooks, those infrastructure as code directly into the Cutter runbooks, It gave nontechnical users a way to leverage these standardized Ansible playbooks, able to plug it in as part of a larger, you know, overall, you know, recovery or process, and also allow them to bring in additional automations. So whether it's just Ansible today, but they're moving toward things like Terraforma, additional automations, they can plug in those various different automation components as part of the runbook hierarchy seamlessly and essentially allows them to get that full resilience automation while still providing that human governance, within Cutover. Awesome. Thank you so much, Madan. So I think now we're gonna hand it over to you, and you everyone gets to see Cutover in action. Awesome. Let me, share my screen. Did you guys see that? Yes. Cool. Looks good. So I wanted to kinda now take you guys into the Cutover platform and give you kind of an overview of what the platform allows us to do today. So what you're looking at here is example of Cutover. And at its core, what Cutover provides is what we call Runbooks. Now Runbooks is a term that we use, but when we think of, runbooks in scope of recovery or failover, these could be any sort of runbook that requires capture the process or steps, whether it's fully automated, like we're gonna see here, or a mix of automation and human in the loop, and then be able to execute via the Conover platform. Runbooks today in the Conover platform are hosted in what we see here known as workspaces. And you can think of a workspace as essentially a logical repository off the host runbooks, runbook users, and then templates associated to that. Within this particular workspace for recovery, you could see a whole host of various runbooks here for various stages of the recovery kind of life cycle. And so each of these runbooks here could be an example of maybe a single application failover following something in the cloud or on prem environment or, in our case, even a one click recovery. The process of creating these runbooks is seamless, and users could typically come into the UI, create a normal or manual runbook, and then use something like the data import functionality to dynamically import CSVs or Excels into the runbook creation and create those. But over time, what we see folks doing is actually leveraging and then standardizing on these runbooks as runbook templates. So templates, like what our customer that Melissa was speaking to, is really how they started about standardizing these repeatable process and introducing a level of version control that they didn't have previously. So within the template repository, you see we have our whole kind of approval workflow built into a platform, and this allows individual users or teams to come in and manage their templates as that standardization and then use those templates as a way to very quickly then, you know, at point of, let's say, recovery or failure, come in and then dynamically create a runbook from template and then get that kicked off. Now everything that you see here in the UI is, of course, accessible through the cutover API. So the customer that we previously alluded to actually leverages the cutover API to dynamically create runbooks from the templates using the cutover API and then put those runbooks into execution with little to none, in terms of human kind of intervention. Now if we take an example of what runbook itself looks like so here we have our one click recovery or failover runbook. And you could see this runbook is made up of forty two tasks, and those forty two tasks are represented here throughout your screen. The first thing you'll notice is that those tasks within this runbook have this little cog icon associated to them. And what that represents within Cutover is the ability for us to be able to trigger any level of automation, that you have within your environment, whether it's things like Ansible, Teams, Jenkins, or even of, you know, scripts in your on prem environments as part of that larger failover or recovery. The other tasks in this runbook that don't represent the cog icon essentially represent those human loop tasks that you could go ahead and introduce as stopgaps at any point in time in your runbook. And then you'll notice that each of these tasks also gives us the ability to dev set dependencies at the task level. Now looking at this view, for example, it's not very easy to understand and kinda visualize those upstream and downstream dependencies. But the node map view really gives us that visibility or understanding for this particular runbook, how complex is the execution path gonna be. And the node map here now gives us visibility and understanding for those various tasks I have as part of my larger recovery. What are those upstream and downstream dependencies? What is the critical path you could see here in orange that's being calculated for us dynamically for this runbook. And within the runbook and node map here, can understand those upstream, downstream dependencies and even the critical path in order to get to a particular task in the workflow. And what Cutter will do is when I actually put this runbook into execution, we'll use those dependencies we set to manage the notification process for us and do the dependency management for us as well. So this runbook is currently in, in a planning phase. But if I go ahead now, let's say, schedule it for, let's say, ten o'clock nine o'clock today and go ahead and put this runbook into execution, Again, this could be done through the UI or through the API as we mentioned. And now we can go ahead and put this in And you'll notice once I execute and put this runbook into a dynamic execution, the view that we're seeing has shifted slightly. And now what Cutter is showing us here is that it's managing and kicking off those integrations or automations for us dynamically and automatically, but doing so based on dependencies that we've set ahead of time. So you could see tasks downstream below here are grayed out, and it's essentially a lot. And what that is just representing is that in order for these automation tasks to be kicked off, those upstream dependencies need to be completed and kicked off first. And within this task list view here, you could see those integration tasks have been kicked off, returning us things like bill status, job IDs, and we can go ahead and modify these payloads to bring back what information is most critical to us. So if I wanna see things like job status and maybe DID, I can go ahead and configure that in my integration payloads. In this case, if I look into the integration itself, we'll work passing it as just a job template ID and then returning back job status. So these integrations could be fire and forget, but could also pull those endpoints for state changes and errors that arise in real time. And what this gives customers the ability is now if there's any point of failure in this complex flow, they can immediately see those failures in real time. They don't need to go into Ansible playbooks to kinda debug those errors manually. They can get that visibility for nontechnical users and business users to be able to see and then be able to mitigate against in real time as well. Now when we talk about that visibility, you know, most folks are typically not gonna be looking at this runbook list view in terms of understanding where they are, in terms of, you know, progress or overall, execution summaries. What they're gonna be looking at is this runbook level dashboard, which is a real time dashboard that gives us the ability to see where we are at any point in time in the execution and any delays that we might be occurring. So in this run with global dashboard, you could see I could see that I've started about forty three minutes late, but I'm projected to finish twenty seven minutes ahead of schedule just by how quickly I've been completing these various tasks. I can go ahead and understand, you know, for these different integrations that I have, any summaries or failures that might arise, completion summary by stage. So you could see right now, we're at nineteen percent. But as those integration automation tasks complete, that will go up in completion percentage. And I can even see a stream summary, which is a way that we kinda group tasks into different streams for us to understand where they are in the workflow. And tasks that are currently startable, meaning the dependencies have been met, but they haven't been started yet, and tasks in progress, like these automation tasks, will show up here as well. And you'll see I could see one task have been completed. That runbook has been that dashboard does get updated in real time, and this dashboard could be customized for our particular view as well. So in this dashboard view, I can go ahead and look at it by my team, maybe my particular user, or any other set of metadata I wanna use to filter on could be added into the runbook level dashboard and then used as part of that visibility. And throughout the execution, everything that that's done in terms of automation and human loop execution is captured intrinsically in our audit logs as well. So most of our enterprise customers post recovery need to go ahead and show, you know, regulatory users or other, you know, business users how they went about doing a certain recovery or failover. And the audit log is that artifact of proof that they could use to show exactly how they went about doing a recovery or any execution within CutOver. The audit log starts capturing details for us as soon as the runbook gets created all the way through to the completion of, the runbook itself, And we could see everything from, you know, durations of when tasks were made ready, when they were actual actually started, and who those various users were that started and completed those tasks. Now from a visibility perspective, what we're looking at here is obviously a rec isolated recovery, a single application, for example, that's recovering. But we know that oftentimes that that's not gonna be the case. In fact, in within any enterprise, we might have multiple different executions or recoveries happening at any given time. And the way that we provide that level of visibility across the entire state of recovery or runbooks that you have is through the multi runbook dashboard that you guys see here. So the multi runbook dashboard or the MRD now gives us a visibility into understanding if we have, let's say, two hundred nineteen different runbooks that are being kicked off, where am I in terms of execution across the entire two hundred nineteen of those? I can understand type of breakdown by, you know, lateness, app bonus, location, RTOs. And, again, this metadata that we're slicing and dicing this by could be customized based on what's most pertinent to you to be able to have, you know, look at this data by. We gotta understand apps that are currently failing over in progress, which ones have been completed and yet to start. And from this multi run book dashboard, what we could understand is if there are delays that are happening, let's say, after two hundred nineteen, there's thirteen of them that have been highlighted as running behind without a red rack status, I can immediately click into what those thirteen runbooks are and understand why those delays might be happening in real time. So in this case, maybe there was a delay in network isolation tasks for a particular runbook, or maybe there's a late start to recovery plan and there's issues with the database checks, which have not been resolved, but it means, the recovery of the app is running later than planned. So very quickly, I can go from that high level view down to understanding what apps might be in scope of missing those recovery time objectives. And then from Cutover directly, I can jump into that problem runbook, use something like the at hot comps to be able to then page out to those various folks, end users and get this recovery back on track. And within the Runbook Cura itself, we've also, within the platform, introduced additional features for us to be able to understand things like detail improvements. So with the AI assist feature, we can now start to use Cutover AI to suggest improvements as part of the runbook design. So in this case, I can choose to suggest detail improvements, and the Cutover AI will then go ahead and understand, you know, based on this particular runbook and based on what we've seen as best practices. These are maybe some of the things that you wanna do and incorporate in that next run or even in this particular run to be able to make this runbook better. And so the AI assist feature has a whole, plethora of various different tasks and prompts that we can give the cutover AI to be able to process, improve, and iterate on as we go forward as well. And finally, post event cover provides metrics for us to also process, improve and, on that next execution as well. And the way that we do that is upon completion of a runbook, a a dashboard that gets generated for us automatically is what we call the post implementation or PID dashboard view. And in this particular example, this runbook is doing a large scale PC failover event, and this run has now been completed as you could see up here. And within this post implementation dashboard view, we now provide metrics around for that particular execution. How do I actually go about doing it in terms of people that were involved, lateness of those various tasks that were in the runbook? Performance was planned summaries by those streams or those workflows in the runbook, by those teams that were assigned to those tasks, planned versus actual, and then wastage, which is essentially representing dead time, which means that when a task was available to start, did folks start it right away, or did they wait for x amount of time? So all those metrics we capture for you automatically and generate as part of the PID view, and customers could go ahead and use this to process and improve for that next iteration and the next failover that they do within the platform. Cool. Kim, I think, back to you now. Awesome. Thank you so much, Madam. That was great. So if you stop sharing, then I can share my screen. Thank you. Can you guys see my screen again? Yeah. Okay. So, you know, just to wrap it up in a quick summary, you know, today, we discussed that, you know, even when using automation tools to help remove repetitive manual tasks, there's still a lot of complexity with ITDR processes. When you're incorporating automation into ITDR, it's really a journey that continues to mature over time. You can add in one tool, multiple tools, and it can provide a lot of benefits. And Cutover and our automated runbooks can work with the entire recovery stack. You know, Madan mentioned Ansible, but we also integrate with communication platforms like Slack or Zoom. We can work with monitoring tools and ITSM and CMDB databases and pull data in. So, you know, they're with the API, we can work kind of across all the tools and be that central execution, for your recoveries, to give to give your teams the visibility, the control that they need to have more efficiency, reduce costs, and ultimately enable you to achieve a one click failover. So now we are going to open it up for q and a. And while everyone puts their their questions in, I'm just gonna talk about a couple upcoming events. So our cutover live sessions are on a regular cadence every two weeks, nine thirty AM eastern on Wednesdays. The next one will be June eleventh, same time, same channel, and we're gonna be talking about application level recovery and the importance of runtime and design time data. Additionally, we are gonna have a live online two hour workshop on June twenty sixth where if you're interested, you can build a cutover runbook. It'll be hosted by Cutover experts, our technical experts, and it'll be specific to, incorporating AWS services, like fault injection service to to do more chaos engineering and testing recovery scenarios. But Cutover is cloud agnostic, product agnostic. So whether you're in the cloud, on premises, wherever your applications are, Cutover can work with that. But if you're interested in kind of going a level deeper, this would be a great opportunity for you. If you're interested, you can contact us at Cutover dot com. Reach out to us on LinkedIn, myself, Madonna, or Melissa, as well as the the Cutover events page. So now let me check and see what questions that we have. So the first one is, for the case study, what kind of training was implemented to ensure that the teams could leverage the the cutover runbooks and the capabilities? Yeah. So I'll I'll answer that. As the customer success manager, I worked very closely with them during that time. You know, they went through our our traditional onboarding process where it was very hands on at the beginning. They went through the Cutover certification programs for, you know, like, the super users, as we call them. But other than that, they were honestly quite self sufficient in programmatically adopting CutOver. They had some professional service days where they helped or where we helped, you know, with the integrations and sort of walk them through our API documentation and answer those questions. But, yeah, other than that, I meet with meet with them on a weekly basis, and answer ad hoc questions on how to improve. But, yeah, they they kind of hit the ground running, and saw the vision pretty quickly. And, yeah, that's how they how they did it. Awesome. Thank you. So you mentioned templates, runbook templates. Does CutOver come out of the box with any recovery templates? Yeah. Yeah. I can take that. So, yeah, cut over, out of the box, we provide what we call master templates, about forty or so templates that, basically, customers can use to start to build these recovery plans off of. And these master templates, as we call it, basically encapsulate various different scenarios, in terms of recovery, whether it's for, you know, on prem applications, cloud app templates following, you know, pilot light or warm standby. We provide those based on best practices and standardizations that we've seen our other customers use, and, you know, new customers are welcome to use that to onboard and get to scale faster. Awesome. Thanks, Madan. So, typically, would a customer integrate CutOver with only one tool like an Ansible, or how far can that scale? Yeah. Yeah. So, it's entirely up to the customer on in terms of what automations and systems they wanna connect to. It's essentially anything that's required for them to be able to get to that one click failover. And so from an integration perspective, Cutover is able to integrate with any sort of system that provides a REST API endpoint, and it, and provides endpoints for us to interact with. And even for certain tools and systems that don't, we have something called Cutover Connect, which is a proxy service that we can deploy to be able to speak to things that don't expose itself via REST for things like on prem integrations and in house kind of systems as well. Awesome. Thank you. And, Madonna, I think you did cover this already, at the end of the demo, but someone asked about AI. So what kind of AI options does Cutover offer? Yeah. So with the AI, we have what we call Cutover AI, which offers everything from kind of runbook creation. So, basically, being able to leverage those master templates to automatically create, you know, runbooks based on standardization and props that you give it, down to once you actually create the runbooks, how could you optimize and then iterate on that as well? So things like dependency improvements, task descriptions, those sort of things that allow you to process and improve for future iterations. And then, of course, we can also connect to any AI kind of agents that you have internally and also use that as part of the larger runbook execution as well. Awesome. And then looks like the final question is how do I learn more? So, again, I can cover that one. Reach out to us on cutover dot com. Reach out to us on our cutover LinkedIn page. Connect with Madan, Melissa, or myself, and we can absolutely connect you with the right people or have a conversation with you ourselves as well as keep a lookout for future events. It's about every two weeks. We'll be probably continuing that through the summer and then into the fall, and we hope you can join us. So let me just stop sharing and make sure I didn't miss any questions anywhere else. I think we are we are good to end our session. So thank you, Madon. Thank you, Melissa. Really appreciate your insight your expertise there. Thank you for all of our attendees on Zoom and LinkedIn. We hope you found it informative, and we hope you can join us in the future. Have a great day, everyone.

‍

Webinar

Are you ready to automate your failover processes? Join our 30-minute session to discover how a major investment company achieved a near one-click failover initiation, slashing recovery time by 53% using automation.

‍

This session will reveal how you can implement one-click initiation for failover and testing, orchestrate seamless integration with tools like Ansible, and gain real-time visibility and control with custom dashboards.

‍

Speakers

Kimberly K. Sack

Senior Product Marketing Manager

Madan Kumar

Sales Engineer

Cutover

One-Click to Recovery: Slash Disaster Failover Time by 53% with Automation

Cutover AI Assistant

Above the automation layer: How the next level of orchestration is essential for complex IT Operations

Mastering your IT disaster recovery maturity

Get the latest Cutover updates and insights in a monthly newsletter