How to Modernize Your Disaster Recovery Plan


Today’s disaster recovery (DR) techniques have come a long way since 2000, and a lot of the changes are related to the modern definition of a disaster.

Initially, a disaster was defined as the total loss of your physical site due to “acts of God” such as flood, hurricane, fire, earthquake or other such natural events. These events may not affect your site directly, instead affecting your electricity provider or local infrastructure such as roads and bridges. With your site unavailable, DR plans usually called for having data backups available at a secondary site called a cold siteOpens a new window . This site had sufficient hardware to run mission-critical applications and usually included enough data storage to run your business until the primary site was available.

Skip ahead to the mid-2000s. By then, network speeds and bandwidth had grown to the point where a company could publish its operational data daily (asynchronously) across a private or public network to an alternate site. Since this backup site needed to be connected to the network to receive regular data backups, it was usually kept running and ready.

Today, most sophisticated IT operations use high-speed networks to replicate data synchronously to their alternate site. In effect, as data inserts and changes are made to your production data, those changes are almost immediately available at the alternate site. Therefore, this so-called hot siteOpens a new window contains up-to-the-minute data and is ready to take over executing applications almost immediately. Indeed, you can now move your regular data backups (usually to tape) from your primary system to your alternate system, thus offloading CPU cycles from your primary machine.

Note that since data can now be synchronously replicated to a backup site, it is now possible to handle problems that are not as catastrophic as disasters easily. For example, suppose a recent upgrade to an application has caused some data corruption. In that case, the backups available on the secondary system can be used to restore datasets and databases locally there while your primary system remains up and running. Once backups are restored at your secondary site, the data changes can be replicated from there to the primary site, thus minimizing your production outage.

This means that synchronous data replication to your hot site is a critical part of your IT infrastructure.

All this works fine until cloud computing enters the equation. How has the availability of database-as-a-service (DBaaS), software-as-a-service (SaaS), and other cloud services changed DR planning?

Learn More: What Is Disaster Recovery? Definition, Cloud and On-Premise, Benefits and Best Practices 

What About the Cloud?

Cloud services have matured to be mission-critical in the same way that big data and data warehouses (DW) are now mission-critical.

Initially, big data applications and DW were implemented on-premise using special hybrid hardware/software. Business intelligence (BI) queries that accessed these large data stores tended to run for minutes (or hours), making them more suitable for ad hoc users or reporting. 

However, hardware and database performance soon increased to the point where these services could provide answers in seconds. Operational systems soon took advantage of this by embedding BI queries in customer-facing applications. For example, a financial application executing a customer transaction could use a BI query to access the warehouse or big data application to determine the probability of fraud. In addition to this, many BI queries that were deemed valuable became regular weekly reports, then daily reports, then finally on-demand reports.

Big data and data warehouses moved from being ad hoc and reporting-only to being integrated into operational applications. Hence, they now became part of the IT disaster recovery (DR) process, and must be backed up and recovered along with operational data.

The same is true for cloud services. If you store any operational or business intelligence data in the cloud, that data must be part of your DR plan. If you use cloud services, you must have plans for their replacement in case of a disaster.

Specifically, if you have a cloud provider, you must now consider whether or not that provider is now a single point of failureOpens a new window (SPOF). This is when the failure of a single system component causes the entire system to fail. SPOFs increase the risk that a localized failure can cause a significant outage.

Failure of cloud services is uncommon, but it happens. Recently, Amazon’s AWS experienced an outage, and other providers such as Google Cloud Platform and Microsoft Azure also failed. A failure at your cloud provider may not be your fault, but your customers see your company, your applications, and your portals, not the providers.

Learn More: When Cloud Is Not Reliable: 4 Tips to Deal With Cloud Outages 

Proper Disaster Planning Requires Testing

Today, disaster recovery planning is more than simply installing hardware at a secondary site and purchasing alternate cloud services. Laws, regulations, and compliance rules exist in many industries that require testing of your DR plans. Some of these include:

All of these (and more) means that you must execute regular disaster recovery tests to certify that your DR plans are sufficient to bring your business back into operation after a disaster. Such business availability requirements and regulations are common and may also come with time limits. For example, some financial regulations require that your plans be robust enough to survive a disaster and be back up and running within 24 hours.

 Learn More: 10 Best Practices for Disaster Recovery Planning (DRP) 

Review Your Disaster Recovery Plan

The first step is to review your current DR plan in the context of your current (and future) hardware and software configuration, including on-premise, off-premise, and cloud services. Use current industry standards to guide your review. For example, reference the ISO/IEC 27031 Business Continuity Standards (ISO/IEC 27031Opens a new window ).

Standard DR recovery plans include specification of how long you can tolerate a data or application outage (the recovery time objective, or RTO) and how up-to-date your data must be (the recovery point objective or RPO). It may be necessary to define this for separate applications and data categories. For example, if a disaster strikes, you may wish your customer-facing applications to be up within a few hours, while back-office reporting processes can wait a day or more.

Part of your application service-level agreements (SLAs) should include expected data and application availability after a disaster. SLAs are extremely important for cloud services, especially if you store operational data in the cloud or rely on cloud providers to present your applications to customers. Common metrics include transaction turnaround time (TAT), mean time to recovery (MTTR), and uptime, which commonly includes network availability and the number, duration and scheduling of maintenance windows.

Testing your disaster recovery plan is crucial. If you must have your apps up within a few hours, but it takes half a day to restore your data, you may need to change your application requirements or backup and recovery methods. Testing is also used to measure the performance of your DR site. A successful data and application recovery may not help your business if DR site performance is poor. Testing must include participation by any of your cloud providers and should include testing network outages and outages of all services.

One interesting DR planning alternative is a cloud service known as disaster recovery as a service (DRaaS). Several vendors offer this kind of service, including IBM, Microsoft Azure, and Amazon’s CloudEndure.

 It is clear that a robust disaster recovery plan will make use of multiple cloud service partners. One typical approach is to designate one provider as primary for a particular category of data or service while another provider is secondary or backup. Such a multi-cloud methodology allows you to remove potential single points of failure, although having multiple partners means managing your extended infrastructure is more complicated.

The Cost of Not Modernizing Your DR Plans

The most important aspect of disaster recovery is understanding and measuring the costs of application and data downtime. An outage duration translates directly into lost business, and customers that cannot access your data (or their data) may not return. A poorly managed or poorly-tuned secondary site can also lead to lost revenue. There are also costs associated with not obeying laws, regulations or compliance rules. These costs may include fines, litigation, and other penalties.

Review your disaster recovery plans now, and be prepared to modernize them by implementing a multi-cloud methodology. Consider a DR service provider as a partner if managing multiple cloud infrastructure seems too complex for your IT staff.

What lessons from 2020 are you applying to your disaster recovery strategies? Comment below or let us know on  LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!