Preparing disaster recovery procedures for cloud workloads brings complexity. Cloud architecture has multiple layers - database systems, clustering technologies, data replication solutions and storage replications - and each layer requires integration, configuration and setup.
Depending on the resiliency built into your architecture and the disaster recovery strategy implemented (warm standby, active-passive, etc.), tracking and measuring recovery objectives can be complicated, but critical. Two of the most important metrics for measuring disaster recovery include:
- Recovery time objective - designates the amount of real time that can pass before the disruption begins to seriously impact the flow of business operations; this is also the maximum amount of time a service can take to be recovered from failure.
- Recovery point objective - designates the maximum amount of data that can be lost following an incident.
The importance of recovery time objectives (RTOs) and recovery point objectives (RPOs)
Attaining recovery time objectives is critical for complying with disaster recovery regulatory requirements and ensuring overall business operations. The length of your RTO is dependent on how critical the application is to your business operations. Mission-critical applications, for example, may have an RTO of 15 minutes while a non-business-critical application may be closer to 2-4 hours. While recovery time is generally measured in minutes or hours, it needs to encompass the entire duration of recovery, from discovering the outage to bringing the services back online.
Recovery point objectives are goals for the maximum amount of data your organization can tolerate losing. It is a point-in-time measurement and one of the most important metrics you need to consider when building your backup and disaster recovery plan. RPOs establish your approach to data redundancy - including replication, log shipping, and backups. The frequency or time between your backups essentially equals the amount of data you could lose in a data disaster.
Highly regulated industries, like financial services, cannot afford to lose any data and therefore may measure their recovery point objective threshold in milliseconds. With advances in technology and the increasing adoption of cloud computing, organizations are demanding smaller or near-zero RTOs.
The value of measuring recovery time actuals (RTAs)
While recovery time objectives and recovery point objectives are key performance indicators of disaster recovery procedures, RPOs are more straightforward to measure and track. Recovery time actual refers to the actual time period elapsed to complete the recovery and make the application available for access. While RTO is the estimated value set as a target, RTA is the actual time measured against it. For good governance and compliance, recovery time actuals must be achieved in less time than the recovery time objectives set in the IT disaster recovery or cyber resilience plan. Measuring RTAs enables you to examine the effectiveness of your backup and recovery procedures and tools. If the recovery time actual exceeds the recovery time objective you may need to revisit your failover strategy to ensure that the switch from source to target happens faster.
Dependencies on people to complete activities or storing recovery plans in a non-executable form can make recovery time objectives hard to meet. In general, a recovery process would include the following steps, which need to be accounted for when defining RTOs and calculating RTAs:
- Detect that the service is offline or that data is being lost
- Begin the recovery
- Restart any services in the correct dependency order
- Test that the recovery worked and the data is consistent
- Ensure clients are able to reconnect
Testing recovery procedures is how you determine if you are anywhere close to meeting your stated recovery objectives. Organizations should test regularly, validate recovery periods, and verify if they can recover all their data in a timely manner.
Gain efficiency with automated recovery time actual (RTA) calculation
Automating repetitive, manual recovery tasks can help you optimize disaster recovery processes and reduce recovery time actuals. Cutover, a Collaborative Automation platform, helps enterprises streamline recovery procedures with dynamic, automated runbooks. With pre-built integrations to CMDBs, like ServiceNow, you can import recovery time objectives and then instantly calculate recovery time actuals during a recovery test or actual failover event. Measure RTA compared to the defined RTO and pinpoint steps in the recovery plan that require changes to improve your RTA.
Ready to learn more?
Make sure your enterprise can accurately track and measure recovery time actuals with Cutover’s Collaborative Automation platform. With Cutover, centralize your IT disaster recovery and test plans and govern recovery and resilience procedures while reducing manual tasks and increasing efficiency.