What IT Gets Wrong in Their Disaster Recovery Plans

There are often two overlooked factors essential to protecting critical applications when disaster strikes â€” organizational communication and maintaining up-to-date documentation of processes and procedures for application operation before, during, and after a disaster. Here, Sandi Hamilton, Director of Customer Support at SIOS Technology, addresses the significance of regularly running a disaster recovery plan that supports business continuity in both test and QA environments.

Whether the business is a multi-billion-dollar financial services firm or a small-to-medium-sized family business, staying competitive in today’s economy requires continuously updating and refining their IT infrastructureOpens a new window Â to maintain end-user productivity, automating manual processes, removing inefficiencies, and cutting costs. As a result, IT infrastructures are in a near-constant state of change.

While this continuous improvement often adds efficiency, it presents an often-overlooked threat to the health of the business itself â€” downtime for business-critical applications and loss of important data. Implementing and managing innovations and moving resources to the cloud can disrupt and confuse the responsible IT team. As a result, when critical systems fail or disaster strikes, recovery times increase, and costs skyrocket.

Like the well-known adage that a balanced diet and regular exercise are keys to maintaining a healthy weight, the best practices for ensuring high availability protection are practical and clear but overlooked with surprising frequency.

Teamwork and Clear Communication Ensure Efficient RecoveryÂ

A common failure in business continuity/disaster recoveryOpens a new window (BC/DR) is a lack of coordination and communication on the part of IT staff responsible for critical applications, databases, and ERP systems. These systems are typically mission-critical to the company and among the most complex to configure, migrate, manage, and maintain. On the technology side, it’s important to use an application-aware high availability solution to orchestrate failovers while maintaining application-specific best practices. On the IT side, there are several key elements needed to reduce risks:

Clearly define and communicate personnel roles and responsibilities related to the configuration, ongoing maintenance, and recovery of application and dataOpens a new window .
Ensure that all key staff is involved in any IT or critical application changes, upgrades, or new additions.
Document the overall complexity of environments â€“ application, dependent services, networking as well as hardware are considered during implementation, testing, and upgrades.
Collaborate to ensure documentation is accessible and accurate for everyone who needs it.
Test availability failover and failback as well as disaster recovery frequently and in as realistic a manner as possible.

Playing Your Position: Clearly Define Team Roles

In very large organizations, responsibility for the classic IT â€œsilosâ€ of operation are often in different departments. One department may specialize in networking, another may focus on managing cloud, for example. However, in many organizations, the lines of responsibility for key IT functions are not so clearly defined. This confusion often surfaces when implementing and managing a high availability solution that crosses these lines of responsibility. For example, the hierarchy of access permissions is a common point of confusion. A database administrator configuring a failover clustering environment may need to request access privileges from their colleague responsible for the network or infrastructure.

Unfortunately, these points of confusion often surface only when there is an urgent need â€“ during a downtime event or disaster. For that reason, the specific responsibilities, permissions, and roles of all IT team members must be documented and communicated to the entire IT organization. Coordination and communication between IT members and groups are necessary to ensure smooth ongoing operation and fast, efficient response to crises.Â

Consider starting from the requirements for operating critical applications and list the dependent IT requirements â€“ supporting software services, infrastructure, network, and so forth. Document the settings, owners, and access privileges needed in the event of a change, failure, or update to these application environments. Use this documentation to inform and update all parties affected by system changes of any kind.

Also, ensure that changes are captured and kept current in a central document â€“ sometimes called a run book available â€“ that is available to the entire IT organization. This book should contain descriptions of roles and responsibilities and all critical information needed to manage and maintain the critical systems.

Learn More: How to Build a Cloud Data â€˜Restore and Recovery’ PlanOpens a new window

Deciphering Complexity

System complexity can easily cause human error. Your teams should also be aware of typical causes of complexity that exist in your current systems, including compatibility limitations, system age, software version, and any legacy or custom applications. Clear documentation that helps decipher this complexity can significantly speed recovery.

Test Environments are Not Optional

A surprising number of companies implement significant changes to their environments directly on their production sites. While this saves the cost of maintaining a testing site, even seemingly simple changes can put critical systems at riskOpens a new window . Be sure to have a non-production test environment available to test changes or updates that could affect application performance or availability.

Break Down IT Silos

True collaboration among IT groups ensures high availability systems are implemented to incorporate all groups’ requirements and priorities. For example, an application team may need the help and support of the infrastructure and network teams to set up a clustering environment for HA.

They may each have different priorities dictating how that clustering environment is provisioned, secured, accessed, and maintained. Collaboration can remove significant confusion and unnecessary complexity, and frustration. This type of collaboration can also identify various potential issues that may not be surfaced without it, such as component incompatibility, unexpected capacity requirements, networking requirements, etc.

As discussed earlier, collaborating among IT members on an updated runbook containing policies, procedures, and deployed environment configurations is important for continuity and efficient response. Maintain this important document in a centralized location (including your DR location) in a digital and physical format where all IT members can access it with or without power or DR.

Learn More: 7 Ways to Build an Effective Disaster Recovery & Business Continuity Plan

Never Take Shortcuts When it Comes to DR Testing

Because disaster recovery testing can be time-consuming and tedious, many discover the flaws in their failover or disaster recovery plan the hard way â€“ when they cannot recover operation or data from a downtime event. You should include DR in your runbook and proactively plan and continuously test your systems. If these systems have high availability clustering, testing should include switchover from the primary server to the secondary server and back again. Testing should also include simulation of various failures, including hardware, network, and software failures.

Understand the Special Requirements

There are various factors to be considered when implementing high availability protection in the cloud that differs from a similar configuration on-premises. You will likely require a load balancer in public clouds such as Azure that you would not need on-premises. Or moving a shared-storage-based failover cluster to AWS where shared storage is not offered to Azure where failover across Availability Zones is not offered will necessitate efficient replication of local storage in a SAN-less cluster configuration.

Operating important applications in dynamic, virtual cloud environments means keeping track of more frequent changes and adjustments. Make sure upgrades are compatible and supported with dependent components. Unlike an on-prem environment where you control the hardware maintenance in the cloud, you need to be aware of cloud-provider planned maintenance. Your service provider should notify you.Â

Ensure that your team has a way to migrate the workload from the affected systems. Remember that the key to reducing human error, affecting the high availability of critical applications is collaboration, where knowledge and testing are shared with all stakeholders through ongoing communication and documentation.

Let us know if you liked this article on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you.