Strengthen Cloud Resilience and Cloud Recovery Plans: Navigate Operational Challenges in the Cloud

I'm Ky Nichol, a CEO, cofounder of Cutover. I had a background in the the space industry before working in technology for a chunk and then thought it was very important, my team and I, that we were, we put forward a platform that allow human and machine collaboration, in technology operations. And, our our primary, operations that we, help to move forward are often operational resilience and application failover as well as things like migration and and and bits like that. But it was a great to start cut over to provide those automated runbooks for those technology operations teams to really bring humans and technology together and make it very visible, which is, often required in these sort of resilient moments, when you, you you have outages, and and you need to recover. Yeah. So that's, cut over and, me. And, now, what I'm interested in in doing but one important thing, in terms of, how we go through the slides today is, unfortunately, Atul has had a a health issue and can't join us in discussion, which is a which is a shame. I'm sure we have a chance to chat to Atul in another time, but that's a a shame for today. And perhaps I could, first start with Dominic on on question one, which was, an important consideration for, many, clients and potential clients that that we talk to, which is how does the incorporation of cloud technologies have an impact on your operational resilience strategy? Thanks for a great great question, Kai. At first, it didn't have any impact because people were thinking it was magic. It was so like magic. Oh, cloud is resilience. You have nothing to do. You build a solution on top of it, and it's magic. But by the way, some of those cloud provider, they were approaching us with exactly that kind of discussion, but it it had and it has a real, real huge impact. I mean, you need to approach cloud resiliency the same way you would approach any resiliency plan or con contingency plan either on prem. But it's even more complicated because now you deal with region, with, with the, all sort of, different stack or, layer in the stack. So you need to be maniac about how you build your process around it. And for those who who already have great process for resiliency because they have on prem, data center, just use the same process. Be maniac about it, and don't don't try to minimize, the impact on resiliency that your cloud can have. We we had a huge outage three weeks ago. I I will give a little bit more detail later on in in the discussion, because because we thought that this part that was sold as resilient was not, or testing process wasn't were not aligned with our own testing process that we were doing on prem. So, yeah, it in the end, it has a huge, huge impact. No. I think that that's great, Dominik. So I I, I think it's whatever is sold effectively, I think it always comes back to I forget who said it, but the cloud is just someone else's computer. So you've gotta make sure that, that that that that stuff is effective and and works for you. And I and I think you it's worth going in detail, what I've talked to to some folks over what actually is a sort of data center and availability zone, how far apart are those things for the different availability zones in a region, and then how and then what what really would be cross region. Because I think it it really does differ between provider and, and what they they offer in those those different areas for, which impacts your resilience strategy. So I really, really appreciate you sharing sharing those insights today and look forward to further detail later on. And, Saba, I wonder if you had some comments on this. Yeah. Sure. Adding few additional comments here. Totally agree with both of you there. One additional perspective we could look at, you know, is cloud technologies do profoundly impact our operational resiliency. All we need to do is we need to learn. We need to understand the shared responsibility model well. We need to understand different service models we have, infrastructure service platform, service software service. And then what's our shared responsibility as a customer, and what's the provider's responsibility? And just make sure we have standards, policies, and guidelines, defined well and communicated well in the organization. It has really, profound capabilities. Like, you know, you have scalability and so much flexibility there that you can use and build great resilience and ensure your, you know, businesses can recover faster. There's so much automation capability, that you can capitalize on on as well. Oh, that's fantastic stuff, Sabine. I think, yeah. We really appreciate the the sharing of that knowledge. Yeah. And so, with that thought in mind when we proceed to question two, which is what recommendations would you have for maybe go into a slightly more detailed level of testing those RTOs, recovery time objects sort of acronyms, that explanation, within the cloud. It's because, yeah, it's that level of detail is really important. I wonder if Saba, if I could start with you on this one. Yeah. Sure. So it is super important. Right? Recovery time objective. It is the time in which you wanna make sure your application recovers. For example, if your RTO is twenty four hours and, right, you wanna make sure after an outage, you're able to recover within those twenty four hours without causing any further business disruptions. So, before delving into the testing of RTO, there is so much more prep work that I would like to, highlight a little bit and then get into testing of RTOs. As you all know, you know, in traditional on prem world, the setup is more static. In cloud, there's more complexity. You have more, you know, different AWS services or different services that gets integrated into the environment. Configurations change. Workloads are constantly shifting. So that means your application fluctuating over time. So, we need to make sure, we are prepared for that. So what do we do? So some things that keep it that we need to keep in mind are, as you're doing in the beginning, your first stage business impact analysis, make sure, you bring up your cloud concerns in that in this meeting. Make sure you, if it's a if your application is migrating, then make sure you review your, impact assessment or maybe create a new assessment to include this cloud specific disruptions and risks. And, then, make sure, once you have set the with your business, make sure those are cascaded to the technology teams. A lot of times, I have noticed the technology teams go on and use the, most advanced services, manage services, and develop an architecture and deploy, but, the understanding of the business alignment and understanding of business recovery objectives is missing there. So make sure right from the onset, you know what the recovery objectives objectives are. Design your system based on those objectives, and don't simply, rely on the design. Integrate, you know, this resiliency considerations into your software development life cycle. So you are assessing those resiliency considerations before you deploy, your workloads into production. In that way, you know, you're being as prepared as possible. Other consideration to keep in mind is also as you build your recovery objectives, build your observability objectives as well. Define those thresholds so you can have right kind of monitoring and alerting in place should there be a disruption. So going into testing, like, you know, testing RTUs, I I would say, like, you know, don't test annually or semi annually. In cloud space, this is a dynamic environment here. So test your RTOs as much regularly as possible, integrated into the software development life cycle. So, yeah, at every release, test those RTOs and make sure your resiliency lists are continuously as and it's adaptive. As you're doing this testing, if you notice something, go and update your Doctor plans. Don't leave them static. A lot of times, I've seen Doctor plans are not updated. So continuously update the Doctor plans, engage cross functional teams like the technology teams, continuity compliance folks, so you all have we have a holistic approach, and we are all aligned on the objectives then from compliance standpoint as well. Over to you. Sorry. Yeah. I think more time. What what you say Saba is a is is so important. I think it's a it's a very important to have something that's not a static plan I find in in discussion, in the associated, evidence, to be able to say that the RTO was really met, rather than maybe something that's a a a a Wiki page, a Confluence page with some rough guidance on how to recover, which, it's it's really hard to say. You did the test. I think, as you say, that that that sort of diligence is is fantastic. And, also, yeah, peer release, absolutely amazing. And, Dominic, I'm very keen to get your perspective on this. Yeah. That thing that that one is, is complex for us because we have a lot of legacy application. We have something called a mainframe. So we have a lot of application living in a hybrid world. So aligning our SaaS cloud on prem mainframe, aligning RTO is a nightmare. Just having the discussion and aligning this, it's getting complex, I would say. You you need to be like Sarah was mentioning. You need to have to be thorough to test it over and over and over. I mean, you need to align all those small parts together. In the past, it was all sitting inside of a big G2A application. So we had almost a hundred percent of the control. It was sitting on in work data center. Now it's so spread out that it's getting complex. I would say we didn't find a solution to that one yet. It's still a struggle for our team. Yeah. I think get it. As you say that punches, but it is it is it is the world we live in. I think it's not it's not that click. Everything's suddenly gone to multi availability, hot, hot, applications, in in the cloud is, it that doesn't work for everything, in in any way or form. And so we we have to have this adaptive approach for active RTOs and testing against it for for a hybrid world. And, yeah, I think I think better the failover of mainframe and and bits like that, it's it's tricky. But, as you say, you you you you're getting on with it. Hey. How do you do how do you align, microservices and RTOs of microservices in the cloud with, a failover of a mainframe? It's it's something to to just align that. I mean and and when you talk about the cloud, everyone talk about those, companies that start up who started from new. Yes. They build the application of the cloud from the ground up so they could those all those application, they they are designed cloud ready. I mean, we are a financial, of over a hundred year plus organization. We started with paper and pen. And, yes, we moved them to mainframe, and now we move on to the cloud. That's the reality of it, and it brings challenge, a huge challenge because I I I'm I will not rewrite the mainframe. There's no business case for that right now for us to bring them all those services in the cloud. So and I'm pretty sure in the audience, you feel the same pain point sometimes, unless you are in a startup. If you're in a startup, you're an happy camper because you're doing it from the ground up. But don't forget I I think even Yeah. Go ahead. Don't forget that in the but don't forget that in the future might not be today, tomorrow, but in the future, you will be the next mainframe. So keep that in mind also when you design and deploy. Yeah. Yeah. And I think that even in a case where somebody starting cloud native, without the proper resilience considerations and sort of well architected framework for those applications are put forward, It's easy to, accidentally shoot yourself in the foot and think that you have a fantastic resilience when you don't. You don't. Yeah. Exactly. Yeah. So it's a it's a that's that's a tricky one tricky one. So, yeah, I think, on three, I think the one of the things that I I find fascinating is the the development of the resilience considerations in in in financial services. To me, it it has really driven a cutting edge thought for, how this topic is dealt with that's trickling into other industries. And, I just wondered if I could end this, Dominic, of how you think regulatory expectations have impacted your cloud approach. Yeah. There's, it's it's we are getting a lot of question from our regular, our regulators. In Canada, there's the regulators that managing the the banking system. It's called the BASIF. And, there's an equivalent in Europe who's working right now. I don't remember the acronym. We mentioned it earlier. But they are getting asking a lot of question about what's your contentious what's your resiliency plan with cloud provider? Do you have two cloud provider? Can you prove that you can switch all your workflow on one from one cloud provider to the other? And we were kind of not ready for that kind of question. I mean, you can build resiliency inside of a cloud provider solution to work with them and bring high high level of resiliency like Saba was mentioning by process, people, tools, or tool process and, people. And but, having the discussion of okay. So if Google failed tomorrow, what's your plan? If Amazon fail tomorrow, what's your plan? If Azure failed tomorrow, did they go out of business? They their data center or Azure go down. What's your plan? And we are getting those question asked. I don't have, frankly, an answer. We were kind of taken aback from those question, but it's legitimate. We since there's three big in Canada, there's Azure, AWS, and Google. As a cloud provider, it can bring problem to the financial system of the Canada if one has to fail. So we are we're getting heat from our regulators on this, but they are not tech and the other side, they are not technical people. So the discussion is quite difficult right now, I would say. I don't know on your side, Saba, if you Yeah. We are a financial institution as well, regulated industry. We have strict, regulations as well. We have to make sure our critical infrastructure, is set up in an alternate location should there be a disruption, and making sure, we are super careful about our customer's data. And, you know, there are so many other controls that we define and, mitigate those those kind of disruptions. Yeah. No. That's good. Savannah, just around the audience that the the views of the guests are their own, around this thing, regulatory topics, especially. And I I think that as as you both raised, it's a it is a tricky one, and I I, I very great respect for the the regulator, and driving operational resilience forward. I think, though, it is a very tricky thing, especially to put a burden on our technology operations team to say, well, all of your team have to be completely au fait and experts in managing your infrastructure on, say, AWS and Azure and maybe another cloud. I I think that's a that's a that's a very tricky thing to to manage manage those things and have that that exit capability and really have the discussion again with the with the regulator on what is an availability zone versus a region versus changing it and exit exit from the cloud. They've all got those things and and been very costly, for for the institution. No. No. I totally get the the driver, to to to to say systematically that there's concentration risk with each of those providers and and keeping the economy going. So it's, it makes for an interesting mix. So thank you very much for sharing those perspectives. Sure. No. I think, as saying Saba go ahead. Initiative by I think it's a great initiative by, European ACT where, you know, the whole Dora initiative, you know, making sure because technology is what we're all dependent on right now, and it is so important to be prepared, for those technology specific disruptions. So there is some great, guidance there. It's all around you know, it is very multifaceted when we say operational resiliency. It means risk management, resiliency, risk management. It means, planning with testing and may incident management. You know, it is multifaceted, and I think Dora has these five pillars clearly defined. So, we are preparing our, you know, our applications to meet that to meet those regulations. Yeah. I think Dora also wraps his arms around the technology providers to effectively regulate them. So it's, it's, yeah. We'll see how all that plays out. Oh, yeah. It's very interesting. Definitely. Yeah. So, if I, move on to question four, how do you work alongside your cloud providers to ensure your applications are resilient in the cloud? I thought, with you to get your perspective on on this one. And then I know, Dominic, you have an interesting, case study to potentially. Exactly. Thank you, Kyle. First of all, you have to adopt your cloud provider best practices. You need to be sure that your architect and your the solution that you deployed are stick to the the the the the rules and the the criteria of those flow providers. Because if you try to go outside of the box, yes, it will work. Yes. You would your own team can even convince you that it will work. But trust me, it will bite you in the back. If if I want to be blunt, it will bite you next week, a year, or ten year forward. You have to stick. You have to bring cloud ready application. Don't try to fit the circle and and and and and and and and and and and and and a square. You have to make sure that you control leader access to native service, that the thought provider provide. I mean, there's many services that get deployed in the cloud, and your your developers, they all have access to the services if you don't control this access. But some of those services, they are in a GA. They are not resilient. They are in a beta that you don't even know. We've discovered that one in one of our application, developer was using native services from Microsoft that was not even GA. So you fail they're they are in best effort. So you have to put process in place to prevent your team to even consume those services. You have to implement rigorous test, test with your cloud provider. Saba mentioned that you need to comprehend in your contract the share responsibilities of the cloud because they are responsible of the one one part of the stack, but you are responsible of of the rest. And and ninety nine percent of the time, it's in the rest that it's failed. I mean I mean, you have to to put practices, in you need to be sure that your architect and your dev team need know those, boundaries. And because, frankly, we have all those guys and girls that try to push the envelope, but they put herself at risk doing so in the cloud. They were even doing it on prem, but it's even easier in the cloud doing that. So, you you have to break to bake SLAs in your contract. We even have, we even have what, we call the bonus model. So there's penalties when those SLAs are not met, when a resiliency test don't work because of the sheer responsibility. When it's at their level, we have penalties. So we work with them on a monthly basis. We have report. We have governance in place with them. You have, by the way, to put the governance in place, place, a resiliency or governance with your cloud provider. It's one thing you need to do. And even all that, you can get bite. We lost one of our Azure Azure region, three weeks ago. Full day down. Remember, we are a bank. Online banking, mobile banking, all down, it was a not a nice day. Just because just because, we had we consume the services from a third party in the cloud that was supposed to be certified resilient, firewall company. I will I will not name the company, but we we were consuming, firewall, and and there was a bug. And when Microsoft did an upgrade of the network card, those firewall went down. All four of those in the region. We lost the region. No problem. We have resiliency plan. Let's switch to center. Let's switch to the other region. Guess what? We discovered that our developers, they baked in their pipeline, the switch. But the pipeline, they were not accessible because it was running in east, in one region. So, I mean, it was the dog betting his own tail. I mean, or the the the chicken and egg. We couldn't we couldn't launch our resiliency plan because those automation part of scripts or pipeline, they they were not resilient. So we were down a full day till we were able to bring back up those firewall. It's sad. It's sad. So you need to test. I mean, we never talked about testing a full region down going down, but now we will we will. Testing small part and hoping that it will work as a whole when shit happens. Use my expression. It won't work. I'm telling you, we got bitten. Murphy exist. By the way, or in our center region, we were doing an upgrade of of or, past system. So even though we we were we would have been able to switch over application, There was not enough capacity left, and we couldn't stop the process. So Murphy twice, you lose your firewall, but we were grading stuff in center. So we ask ourselves the question now, do we need a third site? It's, because in real life, that can happen, and it did happen. And, it was not nice because our client, what we call our members or client, they were they they were at the their grocery store buying stuff there, and they were not able to do that. And I was really, really sad. They are buying house. They are buying cars. And, you know, it's difficult right now for a lot of folks out there, and I was really sad because of that. And it it it it, I mean, it it made us ask us a lot of question. Yeah. Yeah. Yeah. I think, Dominic, I really appreciate that the sharing the story. It is so tricky because, like you say, the the the the application stack now complicated as things, or or it offers fantastic capability, for, ability to scale up automatically and serve customers and things like that in a fantastic way. But within the application stack, there's some components that people are using as managed services at a database layer, say, network, done done in a different way. And, each of those has a a capability for sometimes, something that you say to go down and then a Murphy's Law second order consequence that you can't get to another component that you you planned would be there. It, it has the this this sort of domino effect with all that complexity. But I think the, as you say, you gotta work alongside that client provider, do the testing. You can't test all scenarios all the time, and you're gonna hit some of those things. And so your consideration of third site, it sounds crazy, but it's, but maybe that's the thing that we have to do with, this increasing complexity as we go forward. Yeah. One friend who worked one friend who worked in the, in in the airplane industry has told me how a plane in twenty twenty four can go down because you have multiple redundancy in the plane. Yep. It's not one big problem. It's a it's a chain of small even event that when put together will bring the plane down. It's the same thing for cloud. You have to be aware of those small moving part, and there are so many of those in the cloud that it it can get lost pretty fast. Yep. Absolutely. Adding on to that Go ahead, Saba. Saba, please. Yeah. Sorry. Adding on to that, you know, there are different kinds of testings that we can use to ensure, you know, how our applications can our our application recovery plans can work within these RTO timelines. You know, to be prepared, there are you can, I understand we don't have, like, full control of the infrastructure, data centers, and all of that? But still, that doesn't mean that we can do anything, and we can we simply rely on the cloud provider to come back up. There are steps that we can take, to make sure, we are prepared and we have plans laid out to recover. So what that means is, you know, in depends on what type of service model it is. You can define it it could be tabletop exercise sort of testing. You don't need to actually run the test in the environment, but you are creating this whole cross functional collaboration of with your business owners, with your application owners, your infrastructure and operation folks, and come up with the nexus with a plan, like, you know, prepare a scenario. You create a tabletop exercise based on different scenarios, specifically in cloud environment considered, you know, the cloud disruptions like Dominic mentioned, region outage, AC outage. It could even mean, application, like, within the region, within the AC. There could be some components going down. So come up with the scenario, take take the recovery objectives that you have, sit with the, you know, get into a conference room or or a Zoom room, collaborate with, cross functional teams, bring in all the stakeholders, and, run an exercise to make sure, okay, guys. If this region goes out, what's our plan? How how are we going to, you know, do this particular business process? So having that, laid out and talked through, lot of things you can hash out, responsibilities. Lot of times, not right people are assigned to the right task, roles, and responsibilities. And, we will uncover, you know, what we are prepared on and what we are not, and then we can take those learnings and incorporate them. So that's one kind of testing. And then, scenario based testing I mentioned, you can include that take that into tabletop exercise. You can also do a simulated video testing. We are living in great times, so many, emerging technologies, so many new services out there. There are particular sim failures that we can simulate using these different tools, and automate and continuously. We don't have to wait as we can just embed it into the SDLC cycle and automate the scenario or failure testing within your cycle, so you are prepared in terms of resilience readiness and as well as operational readiness. Yeah. Those are two and then chaos engineering is an like, you know, the simulated failure is what chaos engineering is all about. It's a very methodical you you hypothesize, scenario, and very methodically, you plan and you induce failure and you observe. So, you know, in the past, we have done, chaos. You know? We have conducted game days. We sat together as a team. We we did it as unplanned. We made sure the team doesn't know what kind of failure is getting induced, and, we induced a failure and we observe, to see, you know, how much are our operational folks prepared. What about our playbooks? Do we have, you know, recovery playbooks, and right kind of monitoring and alerting configured in our environment. So there was a lot of learning, and, don't, don't point at each other's faults in this. It is a it's a learning exercise. Develop that culture of, you know, testing failure testing, chaos testing, and learn from it and develop a continuous feedback loop to to continuously improve, and be prepared. Yeah. I think that's a, a good way of of looking at Sabah. And I think that I quite like the the framework, that I think may may have come out of Bank of England but been adopted very widely, which is the consider that important business service, that that if it was down for whatever, period might cause intolerable harm to the consumer, which applications support those important business services and develop a tiering of of of those, to understand which need to be recovered within a recovery time object terms and tiering and then, being able to test against that as you swear. Folks are also looking across the cloud providers for maybe some global services that a lot of things rely on over and above the individual application. Like, oh, well, a lot of my tier zero apps rely in this global service that runs and runs only maybe these regions, doesn't run everywhere. How how do I some scenarios like that where there's a give me armor, so there's gaps that are you both sharing your insights on on that question. And maybe if I then move to question five, and maybe it's a bit of a continuation, Saba, if I start with you here on what testing can be used to ensure these technical application recovery plans. And, again, I'd like to have that nuance of how we move from maybe a plan that's a a Wiki page or a flat file thing to an like a a really executable runbook that you really know did do the failover. How do you how do you sort of test these things and really know how the the teams conducted it? Whether the success, as you say, you're not pointing fingers, but how do you really ensure that there was recovery a technical recovery application failover plans, or work in practice? Sure. It could be a continuation of same, my answer on, how could we come out of, you know there are a lot of techno new tools and services available out there, on you know, where you can develop very intuitive recovery plans. It doesn't have to be Excel. It doesn't have to be, and, you know, there's a lot of integration in there where you could assign roles and you could create notifications, and you could, develop multiple plans. Like, for example, for cloud, you know, testing, we wanna make sure our critical applications or critical infrastructure, we are doing high ability high availability testing, disaster recovery testing, and sustained resiliency testing. So we we what we what we can do is you can use these tools, prepare plans for, you know, these different types of tests, scenario based, and, in, specific to cloud disruptions multiple stakeholders. One, so everybody's on the same page with that. And, as you're testing and learning, because now you have integrated, I mean, I'm going with the thought of this embedded into SDLC. You are running your assessments continuously and testing. As you learn, you know, one of the artifact, should be not only you test and recover, but another artifact is is your Doctor playbook updated. So ensure your Doctor playbooks are updated. And I think there are other orchestration tools to bring all the teams together and orchestrate a Doctor exercise, using these plans. You can integrate, the tool with this plan, and then you can orchestrate and make and, run the whole test. Fantastic stuff, Saba. I really appreciate you sharing that. And, Dominic, I wonder if you had some thoughts on this. Yeah. We we've created this a program. Is it resiliency program? And we incorporated the cloud into that program. So we have an elastic view since we are so hybrid. We had to have an holistic view of our what if scenarios like Saba was mentioning. You need to build what if scenarios, runbook, but and you need to test those runbook. There are some of those that I mean, you will have pushback from the management team, like, testing the last of a region. Hope we would have done that one before, but it it got tested in real time. Next time, we'll incorporate we had the one thing positive, we had those what if scenario like Saba was mentioning. It's critical. And, so we knew what to do. Plan a, plan b, plan c, we had everything because, like I was mentioning earlier, we had, a double situation in our case that the the the first rung boat didn't work because the sec I mean, second rung boat neither, so we had to go to the third one. But we have to to do the same thing as you would have done in the last twenty years. If you have a resiliency practices in your organization, involve them. The cloud is just someone else computer. They have to be involved right from the start. Don't try to because it's a cloud to start to to start with a new team. I mean, yes, they might not know what is native services or CICD, but they are well non knowledgeable about resiliency and Doctor, disaster recovery. So involve those people and and bring them to bring them running up to to the cloud level. I mean, teach them what is the cloud, and then they will bring great value. That's what we did. And so we have a whole team, which is the same team, since we are so hybrid. It's the same team who manage the resiliency of the cloud, the runbook of the cloud that we run for on prem also, or even SaaS. One thing we didn't discuss, we have more and more and more SaaS provider. So for us, the cloud is not just Google or Amazon anymore. It's a lot of small, provider out there that and this one is getting really tricky. In the financial institution, the the the there's those big names lying around. And, you have to work with them also, and you have to link your your run book with those size pro SaaS provider also. Yeah. So but we have a dedicate dedicated project or team to that that that's running constantly. Yeah. I think you you raise a good point, especially if some of those SaaS providers or providers are are maybe involved in a control plane of how you access your applications and infrastructure. And if that, if they're not available, then the whole thing is You don't. It's SaaS. I can't do it. You consume a service you consume a service, and you don't even know what's going on because you signed for SLAs on the service. You don't know what's going on in the back. But, in the last, I would say, six months, the more SaaS provider you have, the more chance you have an outage every day. And that's getting, that that's getting the problem. Even though those SaaS provider and I know it's a different topic than the cloud, but those SaaS provider sits in the cloud also. But, I mean, they on paper, they are resilience. They are fly cloud ready. They are, elastic and so on and so forth, but we still get outage now and then. Yeah. And I think that, for for all I would add to that is the the the bit that also I'm reminded of, versus cloud is that often, the recovery technical, application recovery plans can be sort of nuclear, which is I'm recovering the entire stack, which takes a certain amount of time, sort of network up to, some of the managed services of that application stack. But, actually, in reality, the that some of the most common outages are just one of those components. It could be network. It could be databases out or something like that and and how the the the plans can be adaptive, because I think sometimes application teams rightly so are reticent to sort of say, oh, this component's out. I'm gonna invoke the full nuclear plan. But if your plans can be adaptive to sort of say, skip all the other layers, just recover network, but we'll do it via this plan, I think it builds the the sort of muscle, in in a good way. And I think Saba, as you say, sometimes using, like, bolt injection services or things like that for just individual components can really sort of build capability, in in in those those bits. Yeah. And also, to add on to that, you know, understanding dependencies is also important. We are not only relying on the service provider's infrastructure, but we are also relying on their upstream and downstream dependencies. So understanding dependencies and, making sure we're doing some kind of dependency failure testing every now and then on our workloads, will go a long way. That's very good. So I think that, we've, come to the q and a component, of the discussion. I think it's I really am grateful for our guests sharing, the great, insights today. We've had quite a lot of questions come in during the the sessions. I'll try to get to some of them. I know we we've only got a short amount of time left. This first question I've got is a a bit of a wide topic, but we're interested to guest get guest components. And, it is how will AI affect application recovery, and what will it bring to your app recovery solutions and best practices? I don't know if if, either of you, Dominic or Saba, would like to, opine on that. You wanna go ahead? Okay. Yeah. Yeah. We are starting the AI journey. So right now, it doesn't have a lot of effect. The problem with AI is you will will rely even more on those cloud provider. So I guess in the future, I I it will be a huge part of the application recovery because all those application, they are getting ties in ties into the, AI platform in the either in the back of the cloud provider or the so right now, in the financial system, there are so many question that we had to in Canada, we had to halt all our work on AI because of, data. I mean, we couldn't have guaranteed about the fraud provider. They won't use our data to train their AI or use their your data to have even our competitor. So right now, this question is is kind of pushed in the back in my case. But I guess in the future, it will be so tying that it will be a huge part of their recovery plan. Because the application itself will not work if AI is not there. If I get the question right, I mean, AI will be so integrated into those application that the application couldn't work without AI, And you will need to have a recovery plan for AI also. Saba? Yeah. Sure. Yeah. No. Those are, those are good points. Definitely agree there of that we can't mix, the data of our financial institution with the, general Internet data. Right? We have to be really careful there. But there are some use cases that where we can use AI and enhance our application recovery capabilities. Again, it could be by integrating into some of the tools where we are continuously, where the machine learning model is, checking the health of the system you know, we you know, we have the standard controls that are in place for specific industries. You know, if we could automate or if we could, use AI and see if we are comply you know, if we are complying with those controls, as the workloads are running in the environment, that could be one great use of AI as well. But, yeah, of course, we have to proceed forward with a great caution, making sure we are not compromising user data. Yeah. I think, it's it's absolutely right. I think, get get make sure the data, remains in your control and and isn't being used to train, wider and and maybe go to, folks you don't want it to go to. And then also, as you say, to pandemic, the, there's increasing set of complexity in in providing increasingly, sort of, performant, applications. And and maybe AI can help us in terms of very quickly getting to, this is actually the thing that's down in the problem to solve, and this is actually the response that I should invoke to to to get to that without it having to be sort of many folks working for many hours over a weekend in in their pajamas to recover services, when it's it's tough times here. We are currently using to answer. Sorry. Go ahead. I think GenAI is another use case, at least to speed up the, you know, recovery plan development and Doctor development. We could use that, and, we could give it a scenario and it could help us build a basic plan, and then we move forward, customize that plan so that we'll be, speed up, and it could help a lot in exercises world also, I see. So, yeah, definitely, there are ways that it can enhance recovery space and resilience space. Yeah. And we are already using AI and Dynatrace for those who don't know that tools. It's an APM. We are using and it gives great result. If you deploy an APM across all your services, we have it on the mainframe, on prem, in the in all our clouds, at the application layer, infrastructure layer, and the the the new set of AI tools that they are bringing to the platform give us a lot of insight and help us with our recovery plan and runbook. And it even help us building our runbook and finding because because it moves so fast, finding month to month, what changed and so we can adjust those runbooks. So the the AI inside of the enterprise give us, oh, I've I've I've seen this change. Can you look at it? And it it's really great. Really great. So that's one example that we are we started using AI AI with that sort of tools. No. That yeah. That's great stuff. And it, that changing dynamic nature and signal from noise and that sort of alerting world in in, in ops is is so important to get that right data for. And again, yeah, we'll integrate at Coteva with with, such tools to, invoke and and shape the runbook for the response, of of of a very, good way and good use case. And that's what leads me a little bit to the next question I see in in in in speakers, which was, could you provide specific examples of integrations for automation, to streamline recovery, with a cloud based workloads? But if there's any specifics of, bits of automation and, how we're we're helping to get the the sort of RTAs within the RTOs, the recovery time actuals, and then recovery time objectives. I don't know if Saba, you'd have any thoughts on that. Yeah. Especially in the testing space, we can use automation, and automate some testing. Like, we have tools where we can simulate a failure, so we can use those tools. First first, I would say, like, build a scenario manually, test it out, make sure you have your objectives in place, and your your steady state of your application. You understand what's the steady state of of your application, then, use this for simulation tools and induce the failure and, use the automation to validate whether you met the recovery objectives or not. That's one place we could integrate automation once, once you build that experiment, and, you see you have you make sure you analyze the results and configurations, everything, then you could, make that experiment into a repeatable executor execution and yeah, then embed it into its your SDLC. So for certain specific, for for every lilies, these are the minimum set of experiments that you could automate, and make sure, you know, there are some generic and there could be some advanced, resilience experiments that you could add to that, but there could you know, you can have some basic experiments, in terms of CPU outage or memory or or some, infrastructure related, some network related in some application specific, experiments that you could embed into your pipelines and run them automatically in every release and make sure the new the new code that you added or the new feature you developed has not, you know, has not caused any resilience issues in the infrastructure. Don't know, Dominic, do you have any thoughts on that? The the short answer, it needs to be baked in the solution and into the CICD pipeline. I mean, one thing I consider, when they deliver a new application, one thing I I consider as a definition I've done, and I always ask the question to my team. Can I destroy it, and can you recreate it with your pipeline in five minute? And, if they become white or the the their eyes bleed, I know we are not ready to go in production, and it won't be resilient. I mean, you need to bait in your your solution, the resiliency. So and since the cloud is is as code, I mean, in your case, we are we are really, really maniac about this because as soon as you have an admin logging in, trying to modify something by end, it will break resiliency because the cloud itself, if you respect all those principle, have of as code, it will be elastic. It will be resilience. It will be fantastic, but you need to be really maniac about it and baked in into your code. So I don't have Yeah. One additional thing I would add, I think sorry. One additional thing I would add, I think looking at the question, one more time, they're also talking more about recovery. So places where you could automate the recovery is, you could have these scenarios, in terms easy failure, and then how would you recover? That can be automated into your runbooks, using SSM automation or in terms of AWS space, something like that. So you could automate recovery playbooks to an extent. No. That's that's all good stuff. And I all that add to that maybe is that I know one, particular, client that is also obsessed to say, well, if it can operate in the cloud and operating where things as code, then I wanna make sure all of my requirements and compliance and cyber aspects are all defined as code, if you will, rather than having to go to a committee ever to get an opinion on on it. And then, the the ability to attest to it in a test, prove that it was met. All of those things become far faster, than than than anything that has to sort of, not be defined as code in terms of your requirements, and that's that's also very good for making sure, all automation, can go there. I think we got time for maybe a painting quickly on one one more question before we have to close. There's a question here about the the guidelines for the frequency of testing protocols for critical and and important business services, best practices force combat that sort of frequency. I know, Saviola, you talked about, doing it almost in every release as part of CICD, and I wondered if you wanted to opine on that. Yeah. Sure. You know, pro be proactive and do regular testing before you deploy into production. That's that's one thing for sure we need to do, I would recommend. And after you deploy, and depending on, depending on how mature you are, you could do, you know, more than once in production without causing any disruption but otherwise, at a minimum, annually test your, you know, production environments and your back up and recovery procedures. You know, make sure you're able to restore from those backups and you're you know, you do the data integrity checks checks and, and like that. That's great. Saba, and any, before we close, Dominic, any, comments on that question? Not really. And, one thing we struggled on our side was not identifying those, business critical services. Like I said, we have keep discovering critical services that piggyback into the cloud, not directly, but indirectly. Oh, application a consume services and application b, which is an application c. So your discovery system or what your metadata inside of the CMDB or your service three need to be perfect and really dynamic. You need to have a system that that can show you all those moving part and where they are, And it's kind of a struggle. I mean, it's the the standard tools that you use on prem doesn't work well in the cloud. We are moving to a new system, ServiceNow. So we hope that it will bring us the kind of visibility that we need to attach all those small part together. So I we are kind of in the journey to put the best practice in place. So I cannot fully answer the question because we are into it. Mhmm. But don't minimize the amount of effort that it's require it require. I will close this discussion by saying I I'm a manager. I'm a director, and the cloud is sometimes sold as, oh, it will cost less. I mean, it doesn't even it's not near that. It it cost more. You need more process like, Saba was mentioning. You need you you need to be thorough. You need to have your team on it all the time. It's not just I put my stuff there and I can fire my IT team. It's not that. You even need more people managing all those moving parts. So just be aware of that. You are not moving to the cloud because of that. You're moving the to the cloud because you want elasticity. The cloud can be is a great system for resiliency when you build it right. And we have the you have the right process people and tool around it. Yeah. But you don't do it to save money, because you will you will lose a lot of money fast. No. Thank you very much, Dominic. I don't know. In closing, Saba, do you want anything you would I'd like to say before we close? I would say thank you. I I am very grateful for Marcus Evans for inviting me and giving me this opportunity. And I was happy, to be here and interact with such, you know, industry experts. Thank you. Thank you. Thank you very much, to to both of our panelists. I think I've really appreciated the insights they've shared today. And, yeah, very, very grateful for that. And thank you audience for for dialing in. And, yeah, I look forward to another session soon. I think there's there's more more stuff to go through from that debate. I think we barely scratched the surface, but some good stuff. Thank you all very much. Thank you very much. Yeah. Special thank you to our amazing panelists for being so, generous with your time today and sharing, your insights. It was a very interesting discussion on behalf of myself and Marcus Evans. We'd just like to thank you so much for the collaboration. To turn things to the audience quickly, thank you for all your amazing, questions and comments that you sent. If we didn't get to all of them, we will, make sure to reach out to you after this broadcast. A reminder to look out for an email in the next two hours that will contain links to review today's material, and we'd love to hear your feedback about what you thought about today's webinar. A brief survey will pop up on your screen in just a moment. Really appreciate any, comments that you might have. Behalf of Cutover and Marcus Evans, we'd just like to thank you so much for joining us, and hope to see you again at future events. Thanks, everyone. Have a wonderful afternoon. Thank you, guys. Thank you.

Webinar

Institutions are prioritizing cloud migration which has heightened concerns for operational resilience professionals who are tasked with ensuring uninterrupted service delivery. As organizations grapple with increasing workloads on the cloud alongside the growing complexity of these workloads, it is essential that they have visibility across the cloud and are aware of their fault tolerance levels so they can recover quickly and efficiently if an outage occurs. The focus particularly lies on ensuring that institutions can remain resilient at scale as these workloads continue to grow across different business functions.

What you will learn about cloud recovery

In this webinar, our expert panel will provide actionable insights into strengthening cloud resilience and recovery plans. They will navigate through the nuances of creating cloud resilience and recovery plans through an understanding of workload prioritization of business and mission-critical services alongside visibility across cloud infrastructures to ensure readiness to withstand an outage. Our panel will also delve into regulatory compliance for cloud usage and how to ensure this is integrated into resilience frameworks and runbooks.

Learning Objectives:

Ensure resilience in cloud migration and management
Understand cloud providers shared responsibility model and recovery services available
Manage regulatory standards for cloud workloads to ensure compliance
Develop strategic approaches for scaling recovery capabilities across a variety of workloads and applications with automation
Integrate rigorous testing protocols to ensure the resilience of critical business functions

‍

Speakers

Ky Nichol

CEO

Cutover

Dominic Rocheleau

Senior Director of Infrastructure & Cloud

Desjardins

Saba Waheed

Lead, Cloud Resiliency, Enterprise Resilience Group

USAA

Strengthen Cloud Resilience and Cloud Recovery Plans: Navigate Operational Challenges in the Cloud

What you will learn about cloud recovery

Cutover Recover: Next Generation IT DR Solution Brief

Cutover Application Metastore fact sheet

Streamline hybrid cloud disaster recovery with automation and orchestration

How to build your cloud disaster recovery plan with Cutover runbooks

The technology operations guide to cloud disaster recovery in AWS

Cutover for cloud disaster recovery

Get the latest Cutover updates and insights in a monthly newsletter