Tuesday, March 25, 2014

CISSP - Disaster Recovery and Business Continuity

Business Continuity and Disaster Recovery Overview

The goal of disaster recovery is to minimize the effects of a disaster or disruption. It means taking the necessary steps to ensure that the resources, personnel, and business processes are able to resume operation in a timely manner. This is different from continuity planning, which provides methods and procedures for dealing with longer-term outages and disasters. The goal of a disaster recovery plan is to handle the disaster and its ramifications right after the disaster hits; the disaster recovery plan is usually very information technology (IT)–focused.

A disaster recovery plan (DRP) is carried out when everything is still in emergency mode, and everyone is scrambling to get all critical systems back online. A business continuity plan (BCP) takes a broader approach to the problem. It can include getting critical systems to another environment while repair of the original facilities is under way, getting the right people to the right places during this time, and performing business in a different mode until regular conditions are back in place. It also involves dealing with customers, partners, and shareholders through different channels until everything returns to normal.

In most situations the company is purely focused on getting back up and running, thus focusing on functionality. If security is not integrated and implemented properly, the effects of the physical disaster can be amplified as hackers come in and steal sensitive information.

Business continuity and disaster recovery planning is an organization’s last line of defense. When all other controls have failed, BCP/DRP is the final control that may prevent drastic events such as injury, loss of life, or failure of an organization.

An additional benefit of BCP/DRP is that an organization that forms a business continuity team and conducts a thorough BCP/DRP process is forced to view the organization’s critical processes and assets in a different, often clarifying light. Critical assets must be identified and key business processes understood. Risk analysis conducted during a BCP/DRP plan can lead to immediate mitigating steps.

The business continuity plan is an umbrella plan that includes multiple specific plans, most importantly the disaster recovery plan.

One point that can often be overlooked when focusing on disasters and their associated recovery is to ensure that personnel safety remains the top priority.

Disruptive events and disaster that justify the preparation of BCP and DRP can be resumed as follows:
  • Human errors and omissions
  •  Natural disasters
  • Electrical and power problems
  • Temperature and humidity failure
  •  Warfare, terrorism and sabotage
  • Financially motivated attackers
  • Personnel shortages and unavailabilities
  •  Pandemics and diseases
  • Strikes
  • Communication failures

DRP/BCP Preparation

Steps to prepare a BCP/DRP are:

  • Project Initiation
  • Scope the Project
  • Business Impact Analysis
  • Identify Preventive Controls
  • Recovery Strategy
  •  Plan Design and Development
  •  Implementation, Training, and Testing
  • BCP/DRP Maintenance

Project Initiation

  1. Develop the continuity planning policy statement. Write a policy that provides the guidance to develop a BCP, and that assigns authority to the necessary roles to carry out the tasks.
  2. Conduct the business impact analysis (BIA). Identify critical functions and systems and allow the organization to prioritize them based on necessity. Identify vulnerabilities and threats, and calculate risks.
  3. Identify preventive controls. Once threats are recognized, identify and implement controls and countermeasures to reduce the organization’s risk level in an economical manner.
  4. Develop recovery strategies. Formulate methods to ensure systems and critical functions can be brought online quickly.
  5. Develop the contingency plan. Write procedures and guidelines for how the organization can still stay functional in a crippled state.
  6. Test the plan and conduct training and exercises. Test the plan to identify deficiencies in the BCP, and conduct training to properly prepare individuals on their expected tasks.
  7. Maintain the plan. Put in place steps to ensure the BCP is a living document that is updated regularly.


The most critical part of establishing and maintaining a current continuity plan is management support. Management must be convinced of the necessity of such a plan. Therefore, a business case must be made to obtain this support. The business case may include current vulnerabilities, regulatory and legal obligations, the current status of recovery plans, and recommendations. Management is mostly concerned with cost/benefit issues, so preliminary numbers need to be gathered and potential losses estimated. A cost/benefit analysis should include shareholder, stakeholder, regulatory, and legislative impacts, as well as those on products, services, and personnel.

BCP/DRP project manager

The BCP/DRP project manager is the key point of contact (POC) for ensuring that a BCP/DRP not only is completed but also is routinely tested. This person needs to have business skills, to be extremely competent, and to be knowledgeable with regard to the organization and its mission, in addition to being a good manager and leader in case there is an event that causes the BCP or DRP to be implemented. In most cases, the project manager is the POC for every person within the organization during a crisis.

BCP/DRP team

The BCP/DRP team is comprised of those personnel who will have responsibilities if or when an emergency occurs. Before identification of the BCP/DRP personnel can take place, the continuity planning project team (CPPT) must be assembled. The CPPT is comprised of stakeholders within an organization and focuses on identifying who would need to play a role if a specific emergency event were to occur. This includes people from the HR section, public relations (PR), IT staff, physical security, line managers, essential personnel for full business effectiveness, and anyone else responsible for essential functions.

The people who develop the BCP should also be the ones who execute it. (If you knew that in a time of crisis you would be expected to carry out some critical tasks, you might pay more attention during the planning and testing phases.)

The BCP policy supplies the framework for and governance of designing and building the BCP effort. The policy helps the organization understand the importance of BCP by outlining BCP’s purpose. It provides an overview of the principles of the organization and those behind BCP, and the context for how the BCP team will proceed.

Scope of the Project

There are a number of questions to be asked and answered. For instance, is the team supposed to develop a BCP for just one facility or for more than one facility? Is the plan supposed to cover just large potential threats (hurricanes, tornadoes, floods) or deal with smaller issues as well (loss of a communications line, power failure, Internet connection failure)? Should the plan address possible terrorist attacks and other manmade threats? What is the threat profile of the company? Then there’s resources—what personnel, time allocation, and funds is management willing to commit to the BCP program overall?

Basically, Scope of the project is the answer of these and other questions. Senior executives, not BCP managers and planners, should make these kinds of decisions.

Conduct Business Impact Analysis (BIA)

The primary goal of the BIA is to determine the maximum tolerable downtime (MTD) for a specific IT asset. This will directly impact what disaster recovery solution is chosen.

The BIA is comprised of two processes: identification of critical assets, and comprehensive risk assessment.

Critical asset identification can be made by using a table as given below.

IT Asset
User Group Affected
Business Process Effected
Business Impact
E-Mail System
Office Employees
Financial group communications with executive committee
Mild impact, financial group can also use public e-mail





A typical example to DRP biased risk assessment can be as seen from the table below.

Risk Assessment Finding
Vulnerability
BIA
Mitigation
Servers are hosted in an unlocked room
Access to server room by unauthorized people
Potentially bring several business services down
Install PIN code based electronic lock system (Risk reduced)
Client computers lack security patches
Malware can infect computers or DoS type attack can happen
Client cannot reach ERP applications
Update OS (Risk is eliminated)

Maximum Tolerable Downtime (MTD) is one of the most important terms that should be very well understood and describes the total time a system can be inoperable before an organization is severely impacted. It is the maximum time it takes to execute the reconstitution phase. Maximum tolerable downtime is comprised of two metrics: the recovery time objective (RTO), and the work recovery time (WRT).

MTD is also known as maximum allowable downtime (MAD), maximum tolerable outage (MTO), and maximum acceptable outage (MAO).

The recovery point objective (RPO) is the amount of data loss or system inaccessibility (measured in time) that an organization can withstand.

The recovery time objective (RTO) describes the maximum time allowed to recover business or IT systems. RTO is also called the systems recovery time.

Work recovery time (WRT) describes the time required to configure a recovered system.

Downtime (MTD) consists of two elements, the systems recovery time and the work recovery time. 

Therefore, MTD = RTO + WRT

Mean time between failures (MTBF) quantifies how long a new or repaired system will run before failing. It is typically generated by a component vendor and is largely applicable to hardware as opposed to applications and software.

The mean time to repair (MTTR) describes how long it will take to recover a specific failed system. It is the best estimate for reconstituting the IT system so that business continuity may occur.

Minimum operating requirements (MOR) describe the minimum environmental and connectivity requirements in order to operate computer equipment.

Identify Preventive Controls

One of the important advantages of BCP/DRP preparation is the early detection of some vulnerabilities which can be eliminated by applying simple preventive controls. Applying these controls will help DRP team to better focus on critical areas.

Recovery Strategy

In result of previously defined parameters during BIA phase such as MTD, RTO, RPO, and MTTR, a suitable recovery strategy can be defined for the organization.

Recovery strategy must consider supply chain management, telecommunication management and utility management during decision phase. It must be well understood that, in many cases of disaster recovery efforts, procurement of systems and other equipment, building a new system room from scratch as well as providing connectivity to DR sites may take longer than usual due to several reasons thus can be very risky unless the organization opts for Cold Site strategy.

Different types of strategies in function of cost and provided availability can be seen from the scheme.

Recovery Strategies

Redundant site

A redundant site is an exact production duplicate of a system that has the capability to seamlessly operate all necessary IT operations without loss of services to the end user of the system. A redundant site receives data backups in real time so that in the event of a disaster the users of the system have no loss of data. It is a building configured exactly like the primary site and is the most expensive recovery option because it effectively more than doubles the cost of IT operations.

Hot site

A hot site is a location to which an organization may relocate following a major disruption or disaster. It is a datacenter with a raised floor, power, utilities, computer peripherals, and fully configured computers. The hot site will have all necessary hardware and critical applications data mirrored in real time. A hot site will have the capability to allow the organization to resume critical operations within a very short period of time—sometimes in less than an hour.

Warm site

A warm site has some aspects of a hot site but it will have to rely upon backup data in order to reconstitute a system after a disruption. It is a datacenter with a raised floor, power, utilities, computer peripherals, and fully configured computers.

Because of the extensive costs involved with maintaining a hot or redundant site, many organizations will elect to use a warm site recovery solution. These organizations will have to be able to withstand an MTD of at least 1 to 3 days in order to consider a warm site solution. The longer the MTD is, the less expensive the recovery solution will be.

Cold site

A cold site is the least expensive recovery solution to implement. It does not include backup copies of data, nor does it contain any immediately available hardware. After a disruptive event, a cold site will take the longest amount of time of all recovery solutions to implement and restore critical IT services for the organization. Organizations using a cold site recovery solution will have to be able to withstand a significantly long MTD—usually measured in weeks, not days.

Reciprocal agreement

Reciprocal agreements are bidirectional agreements between two organizations in which one organization promises another organization that it can move in and share space if it experiences a disaster. It is documented in the form of a contract written to gain support from outside organizations in the event of a disaster. They are also referred to as mutual aid agreements (MAA), and they are structured so that each organization will assist the other in the event of an emergency.

Mobile sites

Mobile sites are “datacenters on wheels,” towable trailers that contain racks of computer equipment, as well as HVAC, fire suppression, and physical security. They are a good fit for disasters such as a datacenter flood, where the datacenter is damaged but the rest of the facility and surrounding property are intact.

Subscription services

Some organizations outsource their BCP/DRP planning and/or implementation by paying another company to perform those services. This effectively transfers the risk to the insurer company.

Related Plans

Continuity of operations plan (COOP)

The continuity of operations plan (COOP) describes the procedures required to maintain operations during a disaster. This includes transfer of personnel to an alternative disaster recovery site, and operations of that site.

Business recovery plan

The business recovery plan (BRP), also known as the business resumption plan, details the steps required to restore normal business operations after recovering from a disruptive event. This may include switching operations from an alternative site back to a (repaired) primary site. The business recovery plan picks up when the COOP is complete.

Continuity of support plan

The continuity of support plan focuses narrowly on support of specific IT systems and applications. It is also called the IT contingency plan.

Cyber incident response plan

The cyber incident response plan (CIRP) is designed to respond to disruptive cyber events, including network-based attacks, worms, computer viruses, Trojan horses, etc.

Occupant emergency plan

The occupant emergency plan (OEP) provides the “response procedures for occupants of a facility in the event of a situation posing a potential threat to the health and safety of personnel, the environment, or property.

Crisis management plan

The crisis management plan (CMP) is designed to provide effective coordination among the managers of the organization in the event of an emergency or disruptive event. The CMP details the actions management must take to ensure that life and safety of personnel and property are immediately protected in case of a disaster.

Crisis communications plan

A critical component of the crisis management plan is the crisis communications plan which communicates to staff and the public in the event of a disruptive event. All communication with the public should be channeled via senior management or the public relations team.

Call trees

A key tool leveraged for staff communication by the crisis communications plan is the call tree, which is used to quickly communicate news throughout an organization without overburdening any specific person. The call tree works by assigning each employee a small number of other employees they are responsible for calling in an emergency event. The call tree continues until all affected personnel have been contacted.

Automated call trees

Automated call trees automatically contact all BCP/DRP team members after a disruptive event. Third-party BCP/DRP service providers may provide this service. The automated tree is populated with team members’ primary phone, cellular phone, pager, email, and/or fax.

Executive succession planning

Organizations must ensure that there is always an executive available to make decisions during a disaster. Executive succession planning determines an organization’s line of succession. Executives may become unavailable due to a variety of disasters, ranging from injury and loss of life to strikes, travel restrictions, and medical quarantines.

Backups and Availability

Other than the methods which are discussed with more details in Operations Security domain, some concepts deserve to be mentioned.

Hard Copy

After the evaluation of BIA, some organizations may choose to go with hard copies, which means, during the Disaster Recovery period, the organization may choose to continue their business operations on paper.

Tape rotation methods

A common tape rotation method is first-in, first-out (FIFO). Assume you are performing full daily backups and have 14 rewritable tapes total. FIFO means that you will use each tape in order and cycle back to the first tape after the 14th is used. This ensures that 14 days of data are archived. The downside of this plan is that you only maintain 14 days of data.

Grandfather–father–son (GFS) addresses this problem. There are 3 sets of tapes: 7 daily tapes (the son), 4 weekly tapes (the father), and 12 monthly tapes (the grandfather). Once per week a son tape graduates to father. Once every 5 weeks, a father tape graduates to grandfather. After running for a year, this method ensures there daily are backup tapes available for the past 7 days, weekly tapes for the past 4 weeks, and monthly tapes for the past 12 months.

Remote journaling

A database journal contains a log of all database transactions. Journals may be used to recover from a database failure. Assume that a database checkpoint (snapshot) is saved every hour. If the database loses integrity 20 minutes after a checkpoint, it may be recovered by reverting to the checkpoint and then applying all subsequent transactions described by the database journal.

Database shadowing

Database shadowing uses two or more identical databases that are updated simultaneously. The shadow databases can exist locally, but it is best practice to host one shadow database offsite. The goal of database shadowing is to greatly reduce the recovery time for a database implementation. Database shadowing allows faster recovery when compared with remote journaling.

Software escrow

Vendors who have developed products on behalf of other organizations might well have intellectual property concerns about disclosing the source code of their applications to customers. A common middle ground between these two entities is for the application development company to allow a neutral third party to hold the source code. This approach is known as software escrow. If the development organization go out of business or otherwise violate the terms of the software escrow agreement, the third party holding the escrow will provide the source code and other information to the purchasing organization.

DRP Testing, Training and Awareness

There are some important concepts that should be known about DRP testing.

DRP review

The DRP review is the most basic form of initial DRP testing, and is focused on simply reading the DRP in its entirety to ensure completeness of coverage. This review is typically to be performed by the team that developed the plan.

Checklist

Checklist (also known as consistency) testing lists all necessary components required for successful recovery and ensures that they are, or will be, available if a disaster occurs. The checklist test is often performed concurrently with the structured walkthrough or tabletop testing as a solid first testing threshold.

Structured walkthrough/tabletop

Another test that is commonly completed at the same time as the checklist test is that of the structured walkthrough, which is also often referred to as a tabletop exercise. The goal is to allow individuals to thoroughly review the overall approach.

Simulation test/walkthrough drill

A simulation test, also called a walkthrough drill (not to be confused with structured walkthrough), goes beyond talking about the process and actually has teams carry out the recovery process. The team must respond to a simulated disaster as directed by the DRP.

Parallel processing

This type of test is common in environments where transactional data is a key component of the critical business processing. Typically, this test will involve recovery of critical components at an alternative computing facility, and then restore data from a previous backup. Note that regular production systems are not interrupted. Organizations that are highly dependent upon mainframe and midrange systems will often employ this type of test.

Partial and complete business interruption

Arguably, the most high fidelity of all DRP tests involves business interruption testing; however, this type of test can actually be the cause of a disaster, so extreme caution should be exercised before attempting an actual interruption test. The business interruption style of testing will have the organization actually stop processing normal business at the primary location and instead leverage the alternative computing facility.

DRP/BCP Maintenance

It is recommended to repeat BCP/DRP tests at least once a year. To be able to do so, all the documents mentioned so far must be kept up to date and revised by all the DRP/BCP team members. In order to have must complete record of changes, DRP/BCP process must be related with organization’s change management process.

DRP/BCP Mistakes

Common BCP/DRP mistakes include:
  • Lack of management support
  • Lack of business unit involvement
  • Improper (often narrow) scope
  •  Inadequate telecommunications management
  • Inadequate supply chain management
  • Lack of testing
  • Lack of training and awareness
  • Failure to keep the BCP/DRP plan up to date

Specific DRP/BCP Frameworks

NIST 800-34

ISO/IEC 27031

BS 25999

The Business Continuity Institute (BCI) 2008 Good Practice Guidelines



2 comments:

  1. This is an awesome blog and a great source to get updated by some of the greatest facts. The work you did in order to implement this is absolutely magnificent.
    database disaster recovery plan

    ReplyDelete
  2. great blog. A lot of valid information is given here that is very helpful and understandable. disaster recovery plans are very important in case of information loss.

    ReplyDelete