Many times organizations conduct traditional disaster recovery exercises where testing is done in silos and the scope is limited and restricted only to host level recovery of individual systems. With growing technology changes and globalization trends, the intricacy and interdependencies of applications have become more complex in recent years and major applications are spread across multiple locations and multiple servers.
In this scenario, a traditional recovery exercise focusing on server (host) level recovery is not going to adequately ensure the complete recovery of the application without any inconsistencies among various interdependent subcomponents. In a widespread disaster scenario involving major outages at the data center level, it is fairly certain that this kind of limited exercise is not going to be sufficient to assure the realistic readiness status and overall Recovery Time Objective (RTO) for multiple applications.
Therefore, organizations should increase the scope and complexity of disaster recovery exercises over time and ensure that each exercise is process-oriented and focused on “end-to-end” recovery. This article addresses some of the technical challenges faced in end-to-end disaster recovery exercises which attempt a full life cycle of transactions across disaster recovery applications and their dependencies, and simulate business activities during the exercises.
Growing reliance on Information Technology, along with compliance and regulatory requirements, has led many organizations to focus on Business Continuity and Disaster Recovery (DR) solutions. Availability has become a major concern for business survival. Therefore, it becomes mandatory that one should take a detailed look at disaster recovery testing and the specific steps to ensure a disaster recovery plan performs as expected. An end-to-end disaster recovery exercise would provide realistic readiness status and bring out any complexities or intricacies involved in recovering multiple applications in the case of any widespread disasters, including a data center level outage.
There are lot of challenges in an “end to end” disaster recovery exercise approach compared to traditional disaster recovery exercise since one needs to consider all the dependencies and should take into an account an end to end view to understand the full functionality of the applications.
This article illustrates the some of the challenges faced in actual “end to end” disaster recovery exercises conducted for applications which interface with external third parties and have heavy reliance on middleware components and batch jobs.
Why a Disaster Recovery Exercise is Required
Disaster recovery plans represent a considerable amount of complex and interrelated technologies and tasks that are required to support an organization’s disaster recovery capability. Constant changes in personnel, technologies, and application systems demand periodic plan validation to assure that the recovery plans are functional and remain so in the future. Without this validation, an organization would not be able to demonstrate that the documented set of recovery plans support current recovery operations that will be needed to sustain critical business functions in time of disaster.
The periodic disaster recovery exercise is required to validate the documented recovery procedures, assumptions and associated technology used in the restoration of the production environment.
Issues in Traditional Disaster Recovery Exercises
How many organizations attempt a full life cycle of transactions across disaster recovery applications and their dependencies, and simulate business activity as part of disaster recovery exercises? Many times organizations conduct traditional disaster recovery exercises where testing is done in silos and the scope is limited and restricted only to host level recovery of individual systems. In most of these disaster recovery exercises, the participating team is comprised of only the information technology team without involving any business users. Generally the primary objectives in such exercises will be restricted to recovery of standalone systems without involving any integration with upstream or downstream dependencies.
Typical application validation carried out in this exercise includes login validation, form navigation and search validation, without testing any connections to other dependent applications or any business activity. Most of the time traditional disaster recovery test activities are limited to travel and the restoration of hosts at the recovery site and not anything further. Major drawbacks in this type of testing are that one will not know, until the actual disaster, how the integration part is going to work, what the main dependencies are and what the impact may be due to any network latency related issues [ 3].
With growing technology changes and globalization trends, the intricacy and interdependencies of applications have become more complex in recent years and major applications are spread across multiple locations and multiple servers. In this scenario, a traditional recovery exercise focusing on server (host) level recovery is not going to be adequate to fully recover the application without any inconsistencies among various interdependent subcomponents.
This kind of limited exercise not involving end-to-end disaster recovery activities and without attempting to simulate business activity is not going to be sufficient to reflect the preparedness to handle a real time disaster and to assure the required overall Recovery Time Objective (RTO) for multiple applications.
Why We Need an End-to-End Disaster Recovery Exercise
A limited scope disaster recovery exercise not involving end-to-end disaster recovery activities and without attempting to simulate business activity is typically based on asset level outage scenarios and not based on any widespread site level outages.
Therefore, in order to ensure effective disaster recovery preparedness, organizations should plan for an end-to-end disaster recovery exercise. This will bring out the practical issues involved in performing the business transactions in the disaster recovery environment and verify the real effectiveness of disaster recovery procedures.
Challenges in an “End-to-End” Disaster Recovery Exercise
An end-to-end disaster recovery exercise focuses on complete recovery of applications and their dependencies across various layers, including Presentation, Business Logic, Integration and Data layer. It takes into account the required data consistency among various interdependent subcomponents and sees the recovery from the business process perspective.
Since an end-to-end disaster recovery exercise attempts a full life cycle of transactions across disaster recovery applications and their dependencies, and simulates a business activity during the exercise, there are many challenges in conducting an “end-to-end” recovery exercise. Typical challenges faced are:
- Isolating the DR environment
- Replacing hard coded IP addresses and host names
- Connecting to dependent systems not having a corresponding disaster recovery environment
- Proper sequencing of applications
- Thorough preparation and coordination
- Ensuring a back out plan and data replication during the exercise
This article assumes a parallel exercise scenario and highlights the common technical challenges faced in conducting an end-to-end disaster recovery parallel exercise in a warm site. In a parallel exercise, the DR environment is brought up without interrupting or shutting down the production environment.
Isolating the DR Environment
As everyone will agree, we need to perform the disaster recovery exercise without any interruption to production. This is very easily said, but is the toughest challenge for the disaster recovery coordinator, especially when it is required to do a parallel test at a warm site. Isolating the DR environment and at the same time conducting the full life cycle testing requires a lot of planning and coordination.
The key issue with a full life cycle test is the potential interruption to production systems by unintended access either by other applications or batch jobs. This may result in updating some transactions in the production environment during the test since these restored systems might have the same host names or IP addresses as production systems. Any production interruption such as any duplicate financial transaction for paying a vendor or any missing critical transaction due to a disaster recovery exercise could put your disaster recovery effort in jeopardy.
One should ensure that disaster recovery instances are not connected to the production environment at all layers, including the database and network layers. For example, at the database layer, the tnsnames.ora file or database (DB) links should be updated to ensure that only DR instances are speaking to each other. At the network layer, appropriate firewall rules should be implemented to block any traffic from the disaster recovery environment to the production environment .
In an isolated DR environment, there will be challenges for the desktop clients/end users to connect to the DR environment and to verify whether the production or DR environments are accessed. These challenges can be overcome by allowing access to the disaster recovery environment via DR-Citrix, DR host names or direct DR URLs as applicable. End user client machines’ local host file and configuration file need to be configured to point to DR host names instead of production host names during the DR exercise.
Replace the Hard-Coded IP Address and Host Name
In many organizations, a major issue in disaster recovery exercises is hard-coded IP addresses and host names in applications, particularly in batch jobs. There is a possibility that interfaces and batch jobs might fail or might interrupt production systems during the exercise if there are any hard-coded IP addresses or host names. Hence one needs to thoroughly analyze all the involved systems and identify any hard-coded IP address or host name. As a best practice , one should always reference alias names and avoid any hard-coding of host names or IP address. One of the important tasks for effective disaster recovery implementation is to convert every application to reference alias names, not the primary host names listed in DNS or IP addresses.
However, it might become a tough job to replace the hard-coded host name or IP address for some of the applications which were developed several years ago. In such cases, it is suggested to use automated scripts as much as possible to replace the production host name or production IP address to the respective DR hostname or DR IP address. These DR scripts should be documented and ensured that they are not overwritten during storage replication to DR environments.
Connecting to Dependent Systems Not Having a Corresponding Disaster Recovery Environment
One of the key challenges in an end-to-end disaster recovery exercise is how to test the connecting interfaces with other applications which do not have a corresponding disaster recovery environment. For instance, as represented in figure 1 let us assume an application X, which is hosted at the disaster recovery site and needs to interface with application Y, which is hosted at a third party site. If application Y has a corresponding disaster recovery system, then we can connect both disaster recovery systems during the exercise. Otherwise, one needs to look into options of using the other available environments, such as development, test or pre-production systems of Y application for testing.
Flowcharts, data feeds, and architecture comparisons for production and disaster recovery would help in identifying all the required components for the successful functioning of applications in a disaster scenario. An Interface architecture comparison done between production and DR environment is shown in figure 1. In the below DR Interface Architecture Drawing, since Y which is a vendor application, did not have any DR environment, DR exercise is conducted by connecting to the test environment of Y application from DR environment of X application.
Proper Sequencing of Applications and overall RTO
A crucial challenge in most disaster recovery exercises is the proper identification and sequencing of upstream and downstream dependencies . When performing a disaster recovery exercise with a full life cycle of transactions for 20 or 30 applications, sequencing of applications becomes very critical. The sequence should be planned out properly based on the dependency and agreed overall Recovery Time Objective (RTO) requirements for multiple applications. Documenting all the critical interfaces for a disaster recovery scenario would help in ensuring proper sequencing of applications. While considering the dependencies for application, the interfaces needs to be analyzed for business requirement of the data and the frequency at which they run.
Figure 2 illustrates the resulting application dependency analysis diagram. As illustrated in this diagram, D1 is the application which needs to be brought up first in DR environment before bringing up the DR application X. This is because D1 provides critical input data to X , without which X cannot function appropriately. Inbound interfaces which feed data to applications are required to be brought up first at DR site in most of the cases. In this example, applications marked as D1, D2, D3 and D4 are brought up first followed by which application X is brought up as D5. Under this scenario, RTO for application X (D5) will depend on the RTO for the other 4 dependent applications (D1,D2, D3 and D4) and this overall RTO should meet the business requirements as well. Applications for outbound interfaces are brought up subsequently. Applications can also be brought up in parallel instead of in sequence as per the business requirements.
Thorough Preparation and Coordination
In disaster recovery exercises, one can tend to skip the proper sequence of DR exercises or one can over look the importance of the sequence. But in the road to an end-to-end disaster recovery exercise, it is crucial to thoroughly follow a proper sequence of testing, namely walk-through, simulation, parallel and then full interruption exercises.
A walk-through and simulation test is required first among the various participating teams (network/firewall, server, database, middleware and various applications) to ensure that everyone knows what the scenario is, who needs to do what and what is the sequence. These tests bring out the potential risks to the production environment during DR exercise and the coordination or sequence related issues in recovery procedures.
Thorough preparation and coordination involving a great deal of planning, involvement from all the participating teams, and “mini” tests of all the subcomponents would result in identifying most of the potential issues before they occur, and in eliminating most of the human errors.
Ensuring Back Out Plan and Data Replication During Exercise
One needs to ensure the appropriate process is in place for a solid back out plan (a restore point prior to test start) and how to abort the exercise in the event of an anomaly or critical business needs while performing the exercise.
One also needs to ensure that data replication is not stopped while testing if there is a continuous data replication process in place to the disaster recovery site. As shown in figure 3, If storage array based Storage Area Network (SAN) replication is used, then using technologies like point-in-time copies of data can be used for disaster recovery exercises  by presenting point-in-time copies of data to hosts instead of directly attaching SAN to hosts. In this way, we need not have to stop the data replication during the exercise.
In the figure provided above, there is a continuous data replication from the local data center which is a primary site to the remote site even during DR exercise. Testers are testing the data in point-in-time copies of data. Also in the above figure, as a best practice, a point in time copy and backup is taken at the primary site which might help in resolving any major issues due to data corruption at the primary site. In storage array based replication, there is a risk that when the data at the primary site SAN is corrupted, then the secondary site SAN will also have the corrupted data. In that case both will become unusable . Hence one should consider this risk and design the DR replication solution accordingly.
Simplified and automated recovery procedure to resolve issues in involving the testing team
Traditional recovery testing teams would consist of several groups such as operating system, database, middleware, networking, storage and datacenter operations teams, etc . Since there are multiple teams (about 8-9 teams) involved in testing, this makes it more complicated in terms of scheduling the test. Also, some times during the test, if any other high priority production issue arises, then testers may need to leave in the middle of the testing since in most of the cases testers will be supporting the production environment as well. In order to reduce the complexity in scheduling and to avoid any interruptions, it is recommended to reduce these levels of dependency to a minimum level and create a DR tester who can run all these recovery steps alone and contact the respective (database, network, storage, OS) system administrators only when there is any issue. The important aspect in this arrangement is that the recovery steps should be documented in such a way that it can be understandable and executable by any normal (L4) help desk level person who will not have any high-level specific administrator (L2/L1) skills.
Besides testing, in a time of actual disaster, there is a tremendous amount of pressure and stress to get everything back up and running and available to users. In manual processes, mistakes will be made for a variety of reasons. Thus, it is suggested to automate the recovery process as much as possible. Having simplified and automated disaster recovery processes would eliminate the unnecessary time delay and manual errors during the recovery .
In most cases, simple scripts can help in reducing the recovery time considerably and in avoiding human error and dependency on skilled administrators during disasters. For example, scripts can be used for activities like mounting or unmounting the disk groups or changing the hard-coded host name in configuration parameters to point to the DR host name, etc.
Growing changes in technology and business models demand business processes which are heavily reliant on complex and interdependent applications and therefore an organization should attempt process-oriented and “end-to-end” disaster recovery exercises for testing these applications instead of traditional server-centric exercises. Even though there are so many challenges in performing an end-to-end disaster recovery exercise, these challenges can be overcome by thoroughly analyzing the interdependencies and by using the appropriate sequence to bring up the dependent applications. An end-to-end disaster recovery exercise is the only way to effectively build the confidence among stakeholders on the recoverability of the disaster recovery environment and to understand the realistic RTO in a site level outage where multiple applications are impacted.
- A complete disaster recovery strategy begins with the hardware, operating system network and continues through to the applications. A structured approach with an end-to-end view is required to understand and incorporate all the interdependencies which won’t work otherwise smoothly. Please see the book High Availability and Disaster Recovery Concepts, Design, Implementation by Klaus Schmidt. Springer Publications, 2006.
- Dana French. Naming standards for business continuity Best practices and definitions http://www.ibm.com/developerworks/aix/library/au-name_standards/index.html
- Stephanie Balaouras. June 5, 2006, Trends “Enterprises Are Realistic About Site Separation.” http://www.forrester.com/rb/Research/enterprises_are_realistic_about_site_separation/q/id/39562/t/2
- Bill Peldzus, Real DR Testing, Storage Magazine, July 2006 http://searchstorage.techtarget.com/magazineFeature/0,296894,sid5_gci1258301_idx3,00.html
- Herco van Brug, March 20, 2009, Data & System Availability http://www.virtuall.eu/data-a-system-availability
- Greg Schulz, Automated Recovery, Storage Magazine, August 2006 http://searchstorage.techtarget.com/magazineFeature/0,296894,sid5_gci1258306,00.html
Shankar Subramaniyan, CISSP, ABCP, has over 13 years of experience as a technology consulting and project management executive in the areas of IT Governance, Risk and Compliance (GRC), and Business Continuity Planning . He is a certified professional and has hands on experience in implementing Disaster Recovery solutions. He has implemented and managed Information Security Management System (ISMS) based on industry standards like ISO 27001 and ITIL. He has worked extensively on various compliance requirements like SoX, PCI, etc.
As Manager in Wipro Technologies, Shankar is responsible for providing Information Security solutions in Governance, Risk and Compliance (GRC) and Disaster Recovery (DR) domains to leading clients spread across different geographies. (email: firstname.lastname@example.org)
An unexercised contingency plan could be worse than no plan at all!.
Emergency Management Exercises:
From Response to Recovery
Everything you need to know to design a great exercise!
by Regina Phelps, RN BSN MPA CEM
Get your copy of this brand new book today – and see below for a special offer!
Exercises are a mainstay in the field of emergency management and business continuity planning. Although many companies conduct exercises, and the organizers may be emergency response subject matter experts, they do not excel in the discipline of designing and conducting the actual exercise – which means they simply don’t get the best results out of their effort.
This thoughtful, new book starts with a “silly little question:” Why are we doing this? What seems like a simple query is actually one of the keys to get the most out of every exercise you design.
This text peels back the design process with the goal of creating the best exercise experience possible. Whether you are developing a simple tabletop exercises or working on a full-scale extravaganza that resembles a Hollywood movie, this book will provide you with gems of wisdom that will make your next exercise sizzle. An internationally recognized expert in exercise design, Regina Phelps shares many of her secrets to ensure your exercise success.
2011, 233 pages. $79.95.
To order (click through the title of this email to access our secure ordering system from your web browser):
- Why conduct emergency management exercises?
- The six types of exercises
- The design team: your secret weapon in exercise design
- Your roadmap – the exercise plan
- The real story: the development of the exercise narrative
- The drivers – exercise injects
- Where the magic is created – the design team
- Make your exercise sizzle – audio-visual tools
- Supporting documents to make it all work – additional background exercise documents
- Bringing it all together – the exercise team
- Finally – the big day arrives: managing the exercise day
- How to conduct an orientation exercise
- How to conduct a tabletop exercise
- Creating reality: how to conduct a functional exercise
- Preparing the after-action report and exercise follow-up
- Developing an annual or multi-year exercise and training calendar
- Exercise design resource list
Regina Phelps RN BSN MPA CEM is an internationally recognized expert in the field of emergency management and continuity planning. Since 1982, she has provided consultation and speaking services to clients on four continents. She is founder of Emergency Management & Safety Solutions, a consulting company specializing in emergency management, continuity planning and safety. Ms. Phelps has extensive exercise design experience. She designs between 100 and 150 exercises per year.
Receive a copy of the classic book, DISASTER RECOVERY TESTING: EXERCISING YOUR CONTINGENCY PLAN for Half Price – only $24.98 – with prepaid purchase of Emergency Management Exercises!