Business Continuity and Disaster Recovery Overview
The goal of disaster recovery is to minimize the effects of a disaster or
disruption. It means taking the necessary steps to ensure that the resources,
personnel, and business processes are able to resume operation in a timely
manner. This is different from continuity planning, which provides methods and procedures for dealing with
longer-term outages and disasters. The goal of a disaster recovery plan is to
handle the disaster and its ramifications right after the disaster hits; the
disaster recovery plan is usually very information technology (IT)–focused.
A disaster recovery plan (DRP) is carried out when
everything is still in emergency mode, and everyone is scrambling to get all
critical systems back online. A business
continuity plan (BCP) takes a broader approach to the problem. It can
include getting critical systems to another environment while repair of the
original facilities is under way, getting the right people to the right places
during this time, and performing business in a different mode until regular
conditions are back in place. It also involves dealing with customers,
partners, and shareholders through different channels until everything returns
to normal.
In most situations the company is purely focused on getting
back up and running, thus focusing on functionality. If security is not
integrated and implemented properly, the effects of the physical disaster can
be amplified as hackers come in and steal sensitive information.
Business continuity and disaster recovery planning is an
organization’s last line of defense. When all other controls have failed,
BCP/DRP is the final control that may prevent drastic events such as injury,
loss of life, or failure of an organization.
An additional benefit of BCP/DRP is that an organization
that forms a business continuity team and conducts a thorough BCP/DRP process
is forced to view the organization’s critical processes and assets in a
different, often clarifying light. Critical assets must be identified and key
business processes understood. Risk analysis conducted during a BCP/DRP plan
can lead to immediate mitigating steps.
The business continuity plan is an umbrella plan that
includes multiple specific plans, most importantly the disaster recovery plan.
One point that can often be overlooked when focusing on
disasters and their associated recovery is to ensure that personnel safety remains the top priority.
Disruptive events and disaster that justify the preparation
of BCP and DRP can be resumed as follows:
- Human errors and omissions
- Natural disasters
- Electrical and power problems
- Temperature and humidity failure
- Warfare, terrorism and sabotage
- Financially motivated attackers
- Personnel shortages and unavailabilities
- Pandemics and diseases
- Strikes
- Communication failures
DRP/BCP Preparation
Steps to prepare a BCP/DRP are:
- Project Initiation
- Scope the Project
- Business Impact Analysis
- Identify Preventive Controls
- Recovery Strategy
- Plan Design and Development
- Implementation, Training, and Testing
- BCP/DRP Maintenance
Project Initiation
- Develop the
continuity planning policy statement. Write a policy that provides the
guidance to develop a BCP, and that assigns authority to the necessary roles to
carry out the tasks.
- Conduct the
business impact analysis (BIA). Identify critical functions and systems and
allow the organization to prioritize them based on necessity. Identify
vulnerabilities and threats, and calculate risks.
- Identify
preventive controls. Once threats are recognized, identify and implement controls
and countermeasures to reduce the organization’s risk level in an economical
manner.
- Develop recovery
strategies. Formulate methods to ensure systems and critical functions can
be brought online quickly.
- Develop the
contingency plan. Write procedures and guidelines for how the organization
can still stay functional in a crippled state.
- Test the plan and
conduct training and exercises. Test the plan to identify deficiencies in
the BCP, and conduct training to properly prepare individuals on their expected
tasks.
- Maintain the plan.
Put in place steps to ensure the BCP is a living document that is updated
regularly.
The most critical part of establishing and maintaining a
current continuity plan is management
support. Management must be convinced of the necessity of such a plan. Therefore,
a business case must be made to obtain this support. The business case
may include current vulnerabilities, regulatory and legal obligations, the
current status of recovery plans, and recommendations. Management is mostly
concerned with cost/benefit issues, so preliminary numbers need to be gathered
and potential losses estimated. A cost/benefit analysis should include
shareholder, stakeholder, regulatory, and legislative impacts, as well as those
on products, services, and personnel.
BCP/DRP project manager
The BCP/DRP project manager is the key point of contact
(POC) for ensuring that a BCP/DRP not only is completed but also is routinely
tested. This person needs to have business skills, to be extremely
competent, and to be knowledgeable with regard to the organization and its
mission, in addition to being a good manager and leader in case there is an
event that causes the BCP or DRP to be implemented. In most cases, the project
manager is the POC for every person within the organization during a crisis.
BCP/DRP team
The BCP/DRP team is comprised of those personnel who will
have responsibilities if or when an emergency occurs. Before identification of
the BCP/DRP personnel can take place, the continuity planning project team
(CPPT) must be assembled. The CPPT is comprised of stakeholders within an
organization and focuses on identifying who would need to play a role if a
specific emergency event were to occur. This includes people from the HR section,
public relations (PR), IT staff, physical security, line managers, essential personnel
for full business effectiveness, and anyone else responsible for essential
functions.
The people who develop the BCP should also be the ones who
execute it. (If you knew that in a time of crisis you would be expected to
carry out some critical tasks, you might pay more attention during the planning
and testing phases.)
The BCP policy supplies the framework for and governance of designing
and building the BCP effort. The policy helps the organization understand the
importance of BCP by outlining BCP’s purpose. It provides an overview of the
principles of the organization and those behind BCP, and the context for how
the BCP team will proceed.
Scope of the Project
There are a number of questions to be asked and answered. For
instance, is the team supposed to develop a BCP for just one facility or for
more than one facility? Is the plan supposed to cover just large potential
threats (hurricanes, tornadoes, floods) or deal with smaller issues as well (loss
of a communications line, power failure, Internet connection failure)? Should
the plan address possible terrorist attacks and other manmade threats? What is
the threat profile of the company? Then there’s resources—what personnel, time
allocation, and funds is management willing to commit to the BCP program
overall?
Basically, Scope of the project is the answer of these and
other questions. Senior executives, not
BCP managers and planners, should make these kinds of decisions.
Conduct Business Impact Analysis (BIA)
The primary goal of the
BIA is to determine the maximum tolerable downtime (MTD) for a specific IT asset.
This will directly impact what disaster recovery solution is chosen.
The BIA is comprised of
two processes: identification of
critical assets, and comprehensive
risk assessment.
Critical asset
identification can be made by using a table as given below.
IT Asset
|
User Group Affected
|
Business Process Effected
|
Business Impact
|
E-Mail System
|
Office Employees
|
Financial group
communications with executive committee
|
Mild impact, financial
group can also use public e-mail
|
|
|
|
|
A typical example to DRP
biased risk assessment can be as seen from the table below.
Risk Assessment Finding
|
Vulnerability
|
BIA
|
Mitigation
|
Servers are hosted in
an unlocked room
|
Access to server room
by unauthorized people
|
Potentially bring
several business services down
|
Install PIN code based
electronic lock system (Risk reduced)
|
Client computers lack
security patches
|
Malware can infect
computers or DoS type attack can happen
|
Client cannot reach
ERP applications
|
Update OS (Risk is
eliminated)
|
Maximum Tolerable Downtime
(MTD) is one of the most important terms that should be very well
understood and describes the total time a system can be inoperable before an
organization is severely impacted. It is the maximum time it takes to execute
the reconstitution phase. Maximum tolerable downtime is comprised of two
metrics: the recovery time objective
(RTO), and the work recovery time
(WRT).
MTD is also known as maximum allowable downtime (MAD),
maximum tolerable outage (MTO), and maximum acceptable outage (MAO).
The recovery point
objective (RPO) is the amount of data loss or system inaccessibility (measured
in time) that an organization can withstand.
The recovery time
objective (RTO) describes the maximum time allowed to recover business or
IT systems. RTO is also called the systems recovery time.
Work recovery time
(WRT) describes the time required to configure a recovered system.
Downtime (MTD) consists of two elements, the systems
recovery time and the work recovery time.
Therefore, MTD = RTO + WRT
Mean time between
failures (MTBF) quantifies how long a new or repaired system will run
before failing. It is typically generated by a component vendor and is largely applicable
to hardware as opposed to applications and software.
The mean time to
repair (MTTR) describes how long it will take to recover a specific failed
system. It is the best estimate for reconstituting the IT system so that
business continuity may occur.
Minimum operating
requirements (MOR) describe the minimum environmental and connectivity
requirements in order to operate computer equipment.
Identify Preventive Controls
One of the important advantages of BCP/DRP preparation is
the early detection of some vulnerabilities which can be eliminated by applying
simple preventive controls. Applying these controls will help DRP team to
better focus on critical areas.
Recovery Strategy
In result of previously defined parameters during BIA phase
such as MTD, RTO, RPO, and MTTR, a suitable recovery strategy can be defined
for the organization.
Recovery strategy must consider supply chain management, telecommunication
management and utility management
during decision phase. It must be well understood that, in many cases of
disaster recovery efforts, procurement of systems and other equipment, building
a new system room from scratch as well as providing connectivity to DR sites
may take longer than usual due to several reasons thus can be very risky unless
the organization opts for Cold Site strategy.
Different types of strategies in function of cost and
provided availability can be seen from the scheme.
Recovery Strategies
Redundant site
A redundant site is an exact production duplicate of a
system that has the capability to seamlessly operate all necessary IT
operations without loss of services to the end user of the system. A redundant
site receives data backups in real time so that in the event of a disaster the
users of the system have no loss of data. It is a building configured exactly
like the primary site and is the most expensive recovery option
because it effectively more than doubles the cost of IT operations.
Hot site
A hot site is a location to which an organization may
relocate following a major disruption or disaster. It is a datacenter with a
raised floor, power, utilities, computer peripherals, and fully configured
computers. The hot site will have all necessary hardware and critical
applications data mirrored in real time. A hot site will have the capability to
allow the organization to resume critical operations within a very short
period of time—sometimes in less than an hour.
Warm site
A warm site has some aspects of a hot site but it will have
to rely upon backup data in order to reconstitute a system after a
disruption. It is a datacenter with a raised floor, power, utilities, computer
peripherals, and fully configured computers.
Because of the extensive costs involved with maintaining a
hot or redundant site, many organizations will elect to use a warm site
recovery solution. These organizations will have to be able to withstand an
MTD of at least 1 to 3 days in order to consider a warm site solution. The
longer the MTD is, the less expensive the recovery solution will be.
Cold site
A cold site is the least expensive recovery solution to
implement. It does not include backup copies of data, nor does it contain any
immediately available hardware. After a disruptive event, a cold site will take
the longest amount of time of all recovery solutions to implement and restore
critical IT services for the organization. Organizations using a cold site
recovery solution will have to be able to withstand a significantly long
MTD—usually measured in weeks, not days.
Reciprocal agreement
Reciprocal agreements are bidirectional agreements between
two organizations in which one organization promises another organization that
it can move in and share space if it experiences a disaster. It is documented
in the form of a contract written to gain support from outside organizations in
the event of a disaster. They are also referred to as mutual aid agreements (MAA), and they are structured so that each
organization will assist the other in the event of an emergency.
Mobile sites
Mobile sites are “datacenters on wheels,” towable trailers
that contain racks of computer equipment, as well as HVAC, fire suppression,
and physical security. They are a good fit for disasters such as a datacenter
flood, where the datacenter is damaged but the rest of the facility and
surrounding property are intact.
Subscription services
Some organizations outsource their BCP/DRP planning and/or
implementation by paying another company to perform those services. This
effectively transfers the risk to the insurer company.
Related Plans
Continuity of operations plan (COOP)
The continuity of operations plan (COOP) describes the procedures
required to maintain operations during a disaster. This includes transfer
of personnel to an alternative disaster recovery site, and operations of that
site.
Business recovery plan
The business recovery plan (BRP), also known as the business resumption plan, details the
steps required to restore normal business operations after recovering from a disruptive
event. This may include switching operations from an alternative site back to a
(repaired) primary site. The business recovery plan picks up when the COOP is complete.
Continuity of support plan
The continuity of support plan focuses narrowly on support
of specific IT systems and applications. It is also called the IT contingency
plan.
Cyber incident response plan
The cyber incident response plan (CIRP) is designed to
respond to disruptive cyber events, including network-based attacks, worms,
computer viruses, Trojan horses, etc.
Occupant emergency plan
The occupant emergency plan (OEP) provides the “response
procedures for occupants of a facility in the event of a situation posing a
potential threat to the health and safety of personnel, the environment, or
property.
Crisis management plan
The crisis management plan (CMP) is designed to provide effective coordination among the managers of
the organization in the event of an emergency or disruptive event. The CMP
details the actions management must take to ensure that life and safety of
personnel and property are immediately protected in case of a disaster.
Crisis communications plan
A critical component of the crisis management plan is the
crisis communications plan which communicates to staff and the public in the
event of a disruptive event. All communication with the public should be
channeled via senior management or the public relations team.
Call trees
A key tool leveraged for staff communication by the crisis
communications plan is the call tree, which is used to quickly communicate news
throughout an organization without overburdening any specific person. The call
tree works by assigning each employee a small number of other employees they
are responsible for calling in an emergency event. The call tree continues
until all affected personnel have been contacted.
Automated call trees
Automated call trees automatically contact all BCP/DRP team
members after a disruptive event. Third-party BCP/DRP service providers may
provide this service. The automated tree is populated with team members’
primary phone, cellular phone, pager, email, and/or fax.
Executive succession planning
Organizations must ensure that there is always an executive
available to make decisions during a disaster. Executive succession planning
determines an organization’s line of succession. Executives may become
unavailable due to a variety of disasters, ranging from injury and loss of life
to strikes, travel restrictions, and medical quarantines.
Backups and Availability
Other than the methods which are discussed with more details
in Operations Security domain, some concepts deserve to be mentioned.
Hard Copy
After the evaluation of BIA, some organizations may choose
to go with hard copies, which means, during the Disaster Recovery period, the
organization may choose to continue their business operations on paper.
Tape rotation methods
A common tape rotation method is first-in, first-out (FIFO). Assume you are performing full daily
backups and have 14 rewritable tapes total. FIFO means that you will use each
tape in order and cycle back to the first tape after the 14th is used. This
ensures that 14 days of data are archived. The downside of this plan is that
you only maintain 14 days of data.
Grandfather–father–son
(GFS) addresses this problem. There are 3 sets of tapes: 7 daily tapes (the
son), 4 weekly tapes (the father), and 12 monthly tapes (the grandfather). Once
per week a son tape graduates to father. Once every 5 weeks, a father tape
graduates to grandfather. After running for a year, this method ensures there daily
are backup tapes available for the past 7 days, weekly tapes for the past 4
weeks, and monthly tapes for the past 12 months.
Remote journaling
A database journal contains a log of all database
transactions. Journals may be used to recover from a database failure. Assume
that a database checkpoint (snapshot) is saved every hour. If the database
loses integrity 20 minutes after a checkpoint, it may be recovered by reverting
to the checkpoint and then applying all subsequent transactions described by
the database journal.
Database shadowing
Database shadowing uses two or more identical databases that
are updated simultaneously. The shadow databases can exist locally, but it is
best practice to host one shadow database offsite. The goal of database
shadowing is to greatly reduce the recovery time for a database implementation.
Database shadowing allows faster recovery when compared with remote journaling.
Software escrow
Vendors who have developed products on behalf of other
organizations might well have intellectual property concerns about disclosing
the source code of their applications to customers. A common middle ground
between these two entities is for the application development company to allow
a neutral third party to hold the source code. This approach is known as software escrow. If the development
organization go out of business or otherwise violate the terms of the software
escrow agreement, the third party holding the escrow will provide the source
code and other information to the purchasing organization.
DRP Testing, Training and Awareness
There are some important concepts that should be known about
DRP testing.
DRP review
The DRP review is the most basic form of initial DRP
testing, and is focused on simply reading the DRP in its entirety to ensure
completeness of coverage. This review is typically to be performed by the team
that developed the plan.
Checklist
Checklist (also known as consistency) testing lists all
necessary components required for successful recovery and ensures that they
are, or will be, available if a disaster occurs. The checklist test is often
performed concurrently with the structured walkthrough or tabletop testing as a
solid first testing threshold.
Structured walkthrough/tabletop
Another test that is commonly completed at the same time as
the checklist test is that of the structured walkthrough, which is also often
referred to as a tabletop exercise. The goal is to allow individuals to
thoroughly review the overall approach.
Simulation test/walkthrough drill
A simulation test, also called a walkthrough drill (not to
be confused with structured walkthrough), goes beyond talking about the process
and actually has teams carry out the recovery process. The team must respond to
a simulated disaster as directed by the DRP.
Parallel processing
This type of test is common in environments where
transactional data is a key component of the critical business processing.
Typically, this test will involve recovery of critical components at an
alternative computing facility, and then restore data from a previous backup.
Note that regular production systems are not interrupted. Organizations that
are highly dependent upon mainframe and midrange systems will often employ this
type of test.
Partial and complete business interruption
Arguably, the most high fidelity of all DRP tests involves
business interruption testing; however, this type of test can actually be the
cause of a disaster, so extreme caution should be exercised before attempting
an actual interruption test. The business interruption style of testing will
have the organization actually stop processing normal business at the primary
location and instead leverage the alternative computing facility.
DRP/BCP Maintenance
It is recommended to repeat BCP/DRP tests at least once a
year. To be able to do so, all the documents mentioned so far must be kept up
to date and revised by all the DRP/BCP team members. In order to have must
complete record of changes, DRP/BCP process must be related with organization’s
change management process.
DRP/BCP Mistakes
Common BCP/DRP mistakes include:
- Lack of management support
- Lack of business unit involvement
- Improper (often narrow) scope
- Inadequate telecommunications management
- Inadequate supply chain management
- Lack of testing
- Lack of training and awareness
- Failure to keep the BCP/DRP plan up to date
Specific DRP/BCP Frameworks
NIST 800-34
ISO/IEC 27031
BS 25999
The Business Continuity Institute (BCI) 2008 Good Practice Guidelines