Tuesday, March 25, 2014

CISSP - Disaster Recovery and Business Continuity

Business Continuity and Disaster Recovery Overview

The goal of disaster recovery is to minimize the effects of a disaster or disruption. It means taking the necessary steps to ensure that the resources, personnel, and business processes are able to resume operation in a timely manner. This is different from continuity planning, which provides methods and procedures for dealing with longer-term outages and disasters. The goal of a disaster recovery plan is to handle the disaster and its ramifications right after the disaster hits; the disaster recovery plan is usually very information technology (IT)–focused.

A disaster recovery plan (DRP) is carried out when everything is still in emergency mode, and everyone is scrambling to get all critical systems back online. A business continuity plan (BCP) takes a broader approach to the problem. It can include getting critical systems to another environment while repair of the original facilities is under way, getting the right people to the right places during this time, and performing business in a different mode until regular conditions are back in place. It also involves dealing with customers, partners, and shareholders through different channels until everything returns to normal.

In most situations the company is purely focused on getting back up and running, thus focusing on functionality. If security is not integrated and implemented properly, the effects of the physical disaster can be amplified as hackers come in and steal sensitive information.

Business continuity and disaster recovery planning is an organization’s last line of defense. When all other controls have failed, BCP/DRP is the final control that may prevent drastic events such as injury, loss of life, or failure of an organization.

An additional benefit of BCP/DRP is that an organization that forms a business continuity team and conducts a thorough BCP/DRP process is forced to view the organization’s critical processes and assets in a different, often clarifying light. Critical assets must be identified and key business processes understood. Risk analysis conducted during a BCP/DRP plan can lead to immediate mitigating steps.

The business continuity plan is an umbrella plan that includes multiple specific plans, most importantly the disaster recovery plan.

One point that can often be overlooked when focusing on disasters and their associated recovery is to ensure that personnel safety remains the top priority.

Disruptive events and disaster that justify the preparation of BCP and DRP can be resumed as follows:

Human errors and omissions
Natural disasters
Electrical and power problems
Temperature and humidity failure
Warfare, terrorism and sabotage
Financially motivated attackers
Personnel shortages and unavailabilities
Pandemics and diseases
Strikes
Communication failures

DRP/BCP Preparation

Steps to prepare a BCP/DRP are:

Project Initiation
Scope the Project
Business Impact Analysis
Identify Preventive Controls
Recovery Strategy
Plan Design and Development
Implementation, Training, and Testing
BCP/DRP Maintenance

Project Initiation

Develop the continuity planning policy statement. Write a policy that provides the guidance to develop a BCP, and that assigns authority to the necessary roles to carry out the tasks.
Conduct the business impact analysis (BIA). Identify critical functions and systems and allow the organization to prioritize them based on necessity. Identify vulnerabilities and threats, and calculate risks.
Identify preventive controls. Once threats are recognized, identify and implement controls and countermeasures to reduce the organization’s risk level in an economical manner.
Develop recovery strategies. Formulate methods to ensure systems and critical functions can be brought online quickly.
Develop the contingency plan. Write procedures and guidelines for how the organization can still stay functional in a crippled state.
Test the plan and conduct training and exercises. Test the plan to identify deficiencies in the BCP, and conduct training to properly prepare individuals on their expected tasks.
Maintain the plan. Put in place steps to ensure the BCP is a living document that is updated regularly.

The most critical part of establishing and maintaining a current continuity plan is management support. Management must be convinced of the necessity of such a plan. Therefore, a business case must be made to obtain this support. The business case may include current vulnerabilities, regulatory and legal obligations, the current status of recovery plans, and recommendations. Management is mostly concerned with cost/benefit issues, so preliminary numbers need to be gathered and potential losses estimated. A cost/benefit analysis should include shareholder, stakeholder, regulatory, and legislative impacts, as well as those on products, services, and personnel.

BCP/DRP project manager

The BCP/DRP project manager is the key point of contact (POC) for ensuring that a BCP/DRP not only is completed but also is routinely tested. This person needs to have business skills, to be extremely competent, and to be knowledgeable with regard to the organization and its mission, in addition to being a good manager and leader in case there is an event that causes the BCP or DRP to be implemented. In most cases, the project manager is the POC for every person within the organization during a crisis.

BCP/DRP team

The BCP/DRP team is comprised of those personnel who will have responsibilities if or when an emergency occurs. Before identification of the BCP/DRP personnel can take place, the continuity planning project team (CPPT) must be assembled. The CPPT is comprised of stakeholders within an organization and focuses on identifying who would need to play a role if a specific emergency event were to occur. This includes people from the HR section, public relations (PR), IT staff, physical security, line managers, essential personnel for full business effectiveness, and anyone else responsible for essential functions.

The people who develop the BCP should also be the ones who execute it. (If you knew that in a time of crisis you would be expected to carry out some critical tasks, you might pay more attention during the planning and testing phases.)

The BCP policy supplies the framework for and governance of designing and building the BCP effort. The policy helps the organization understand the importance of BCP by outlining BCP’s purpose. It provides an overview of the principles of the organization and those behind BCP, and the context for how the BCP team will proceed.

Scope of the Project

There are a number of questions to be asked and answered. For instance, is the team supposed to develop a BCP for just one facility or for more than one facility? Is the plan supposed to cover just large potential threats (hurricanes, tornadoes, floods) or deal with smaller issues as well (loss of a communications line, power failure, Internet connection failure)? Should the plan address possible terrorist attacks and other manmade threats? What is the threat profile of the company? Then there’s resources—what personnel, time allocation, and funds is management willing to commit to the BCP program overall?

Basically, Scope of the project is the answer of these and other questions. Senior executives, not BCP managers and planners, should make these kinds of decisions.

Conduct Business Impact Analysis (BIA)

The primary goal of the BIA is to determine the maximum tolerable downtime (MTD) for a specific IT asset. This will directly impact what disaster recovery solution is chosen.

The BIA is comprised of two processes: identification of critical assets, and comprehensive risk assessment.

Critical asset identification can be made by using a table as given below.

IT Asset	User Group Affected	Business Process Effected	Business Impact
E-Mail System	Office Employees	Financial group communications with executive committee	Mild impact, financial group can also use public e-mail

A typical example to DRP biased risk assessment can be as seen from the table below.

Risk Assessment Finding	Vulnerability	BIA	Mitigation
Servers are hosted in an unlocked room	Access to server room by unauthorized people	Potentially bring several business services down	Install PIN code based electronic lock system (Risk reduced)
Client computers lack security patches	Malware can infect computers or DoS type attack can happen	Client cannot reach ERP applications	Update OS (Risk is eliminated)

Maximum Tolerable Downtime (MTD) is one of the most important terms that should be very well understood and describes the total time a system can be inoperable before an organization is severely impacted. It is the maximum time it takes to execute the reconstitution phase. Maximum tolerable downtime is comprised of two metrics: the recovery time objective (RTO), and the work recovery time (WRT).

MTD is also known as maximum allowable downtime (MAD), maximum tolerable outage (MTO), and maximum acceptable outage (MAO).

The recovery point objective (RPO) is the amount of data loss or system inaccessibility (measured in time) that an organization can withstand.

The recovery time objective (RTO) describes the maximum time allowed to recover business or IT systems. RTO is also called the systems recovery time.

Work recovery time (WRT) describes the time required to configure a recovered system.

Downtime (MTD) consists of two elements, the systems recovery time and the work recovery time.

Therefore, MTD = RTO + WRT

Mean time between failures (MTBF) quantifies how long a new or repaired system will run before failing. It is typically generated by a component vendor and is largely applicable to hardware as opposed to applications and software.

The mean time to repair (MTTR) describes how long it will take to recover a specific failed system. It is the best estimate for reconstituting the IT system so that business continuity may occur.

Minimum operating requirements (MOR) describe the minimum environmental and connectivity requirements in order to operate computer equipment.

Identify Preventive Controls

One of the important advantages of BCP/DRP preparation is the early detection of some vulnerabilities which can be eliminated by applying simple preventive controls. Applying these controls will help DRP team to better focus on critical areas.

Recovery Strategy

In result of previously defined parameters during BIA phase such as MTD, RTO, RPO, and MTTR, a suitable recovery strategy can be defined for the organization.

Recovery strategy must consider supply chain management, telecommunication management and utility management during decision phase. It must be well understood that, in many cases of disaster recovery efforts, procurement of systems and other equipment, building a new system room from scratch as well as providing connectivity to DR sites may take longer than usual due to several reasons thus can be very risky unless the organization opts for Cold Site strategy.

Different types of strategies in function of cost and provided availability can be seen from the scheme.

Recovery Strategies

Redundant site

A redundant site is an exact production duplicate of a system that has the capability to seamlessly operate all necessary IT operations without loss of services to the end user of the system. A redundant site receives data backups in real time so that in the event of a disaster the users of the system have no loss of data. It is a building configured exactly like the primary site and is the most expensive recovery option because it effectively more than doubles the cost of IT operations.

Hot site

A hot site is a location to which an organization may relocate following a major disruption or disaster. It is a datacenter with a raised floor, power, utilities, computer peripherals, and fully configured computers. The hot site will have all necessary hardware and critical applications data mirrored in real time. A hot site will have the capability to allow the organization to resume critical operations within a very short period of time—sometimes in less than an hour.

Warm site

A warm site has some aspects of a hot site but it will have to rely upon backup data in order to reconstitute a system after a disruption. It is a datacenter with a raised floor, power, utilities, computer peripherals, and fully configured computers.

Because of the extensive costs involved with maintaining a hot or redundant site, many organizations will elect to use a warm site recovery solution. These organizations will have to be able to withstand an MTD of at least 1 to 3 days in order to consider a warm site solution. The longer the MTD is, the less expensive the recovery solution will be.

Cold site

A cold site is the least expensive recovery solution to implement. It does not include backup copies of data, nor does it contain any immediately available hardware. After a disruptive event, a cold site will take the longest amount of time of all recovery solutions to implement and restore critical IT services for the organization. Organizations using a cold site recovery solution will have to be able to withstand a significantly long MTD—usually measured in weeks, not days.

Reciprocal agreement

Reciprocal agreements are bidirectional agreements between two organizations in which one organization promises another organization that it can move in and share space if it experiences a disaster. It is documented in the form of a contract written to gain support from outside organizations in the event of a disaster. They are also referred to as mutual aid agreements (MAA), and they are structured so that each organization will assist the other in the event of an emergency.

Mobile sites

Mobile sites are “datacenters on wheels,” towable trailers that contain racks of computer equipment, as well as HVAC, fire suppression, and physical security. They are a good fit for disasters such as a datacenter flood, where the datacenter is damaged but the rest of the facility and surrounding property are intact.

Subscription services

Some organizations outsource their BCP/DRP planning and/or implementation by paying another company to perform those services. This effectively transfers the risk to the insurer company.

Related Plans

Continuity of operations plan (COOP)

The continuity of operations plan (COOP) describes the procedures required to maintain operations during a disaster. This includes transfer of personnel to an alternative disaster recovery site, and operations of that site.

Business recovery plan

The business recovery plan (BRP), also known as the business resumption plan, details the steps required to restore normal business operations after recovering from a disruptive event. This may include switching operations from an alternative site back to a (repaired) primary site. The business recovery plan picks up when the COOP is complete.

Continuity of support plan

The continuity of support plan focuses narrowly on support of specific IT systems and applications. It is also called the IT contingency plan.

Cyber incident response plan

The cyber incident response plan (CIRP) is designed to respond to disruptive cyber events, including network-based attacks, worms, computer viruses, Trojan horses, etc.

Occupant emergency plan

The occupant emergency plan (OEP) provides the “response procedures for occupants of a facility in the event of a situation posing a potential threat to the health and safety of personnel, the environment, or property.

Crisis management plan

The crisis management plan (CMP) is designed to provide effective coordination among the managers of the organization in the event of an emergency or disruptive event. The CMP details the actions management must take to ensure that life and safety of personnel and property are immediately protected in case of a disaster.

Crisis communications plan

A critical component of the crisis management plan is the crisis communications plan which communicates to staff and the public in the event of a disruptive event. All communication with the public should be channeled via senior management or the public relations team.

Call trees

A key tool leveraged for staff communication by the crisis communications plan is the call tree, which is used to quickly communicate news throughout an organization without overburdening any specific person. The call tree works by assigning each employee a small number of other employees they are responsible for calling in an emergency event. The call tree continues until all affected personnel have been contacted.

Automated call trees

Automated call trees automatically contact all BCP/DRP team members after a disruptive event. Third-party BCP/DRP service providers may provide this service. The automated tree is populated with team members’ primary phone, cellular phone, pager, email, and/or fax.

Executive succession planning

Organizations must ensure that there is always an executive available to make decisions during a disaster. Executive succession planning determines an organization’s line of succession. Executives may become unavailable due to a variety of disasters, ranging from injury and loss of life to strikes, travel restrictions, and medical quarantines.

Backups and Availability

Other than the methods which are discussed with more details in Operations Security domain, some concepts deserve to be mentioned.

Hard Copy

After the evaluation of BIA, some organizations may choose to go with hard copies, which means, during the Disaster Recovery period, the organization may choose to continue their business operations on paper.

Tape rotation methods

A common tape rotation method is first-in, first-out (FIFO). Assume you are performing full daily backups and have 14 rewritable tapes total. FIFO means that you will use each tape in order and cycle back to the first tape after the 14th is used. This ensures that 14 days of data are archived. The downside of this plan is that you only maintain 14 days of data.

Grandfather–father–son (GFS) addresses this problem. There are 3 sets of tapes: 7 daily tapes (the son), 4 weekly tapes (the father), and 12 monthly tapes (the grandfather). Once per week a son tape graduates to father. Once every 5 weeks, a father tape graduates to grandfather. After running for a year, this method ensures there daily are backup tapes available for the past 7 days, weekly tapes for the past 4 weeks, and monthly tapes for the past 12 months.

Remote journaling

A database journal contains a log of all database transactions. Journals may be used to recover from a database failure. Assume that a database checkpoint (snapshot) is saved every hour. If the database loses integrity 20 minutes after a checkpoint, it may be recovered by reverting to the checkpoint and then applying all subsequent transactions described by the database journal.

Database shadowing

Database shadowing uses two or more identical databases that are updated simultaneously. The shadow databases can exist locally, but it is best practice to host one shadow database offsite. The goal of database shadowing is to greatly reduce the recovery time for a database implementation. Database shadowing allows faster recovery when compared with remote journaling.

Software escrow

Vendors who have developed products on behalf of other organizations might well have intellectual property concerns about disclosing the source code of their applications to customers. A common middle ground between these two entities is for the application development company to allow a neutral third party to hold the source code. This approach is known as software escrow. If the development organization go out of business or otherwise violate the terms of the software escrow agreement, the third party holding the escrow will provide the source code and other information to the purchasing organization.

DRP Testing, Training and Awareness

There are some important concepts that should be known about DRP testing.

DRP review

The DRP review is the most basic form of initial DRP testing, and is focused on simply reading the DRP in its entirety to ensure completeness of coverage. This review is typically to be performed by the team that developed the plan.

Checklist

Checklist (also known as consistency) testing lists all necessary components required for successful recovery and ensures that they are, or will be, available if a disaster occurs. The checklist test is often performed concurrently with the structured walkthrough or tabletop testing as a solid first testing threshold.

Structured walkthrough/tabletop

Another test that is commonly completed at the same time as the checklist test is that of the structured walkthrough, which is also often referred to as a tabletop exercise. The goal is to allow individuals to thoroughly review the overall approach.

Simulation test/walkthrough drill

A simulation test, also called a walkthrough drill (not to be confused with structured walkthrough), goes beyond talking about the process and actually has teams carry out the recovery process. The team must respond to a simulated disaster as directed by the DRP.

Parallel processing

This type of test is common in environments where transactional data is a key component of the critical business processing. Typically, this test will involve recovery of critical components at an alternative computing facility, and then restore data from a previous backup. Note that regular production systems are not interrupted. Organizations that are highly dependent upon mainframe and midrange systems will often employ this type of test.

Partial and complete business interruption

Arguably, the most high fidelity of all DRP tests involves business interruption testing; however, this type of test can actually be the cause of a disaster, so extreme caution should be exercised before attempting an actual interruption test. The business interruption style of testing will have the organization actually stop processing normal business at the primary location and instead leverage the alternative computing facility.

DRP/BCP Maintenance

It is recommended to repeat BCP/DRP tests at least once a year. To be able to do so, all the documents mentioned so far must be kept up to date and revised by all the DRP/BCP team members. In order to have must complete record of changes, DRP/BCP process must be related with organization’s change management process.

DRP/BCP Mistakes

Common BCP/DRP mistakes include:

Lack of management support
Lack of business unit involvement
Improper (often narrow) scope
Inadequate telecommunications management
Inadequate supply chain management
Lack of testing
Lack of training and awareness
Failure to keep the BCP/DRP plan up to date

Specific DRP/BCP Frameworks

NIST 800-34

ISO/IEC 27031

BS 25999

The Business Continuity Institute (BCI) 2008 Good Practice Guidelines

Saturday, March 22, 2014

CISSP - Operations Security

Another important domain in CISSP CBK is Operations Security, which is more interesting for more people than some of the other domains (or it may be only my preference) and which takes less time to understand.

In this domain there are some important concepts that must be well known such as Data Backup modes and techniques (RAID levels) and Security Incident Response Management.

Administrative Security

Least Privilege

Principle of least privilege dictates that persons have no more than the access that is strictly required for the performance of their duties.

This principle is more meaningful in environments where Discretionary Access Control is applied. An important point to remember about DAC is that in this model, Data Owner defines who can access that specific data. With DAC, the principle of least privilege suggests that a user will be given access to data if, and only if, a data owner determines that a business need exists for the user to have the access.

Need to Know

With MAC, we have a further concept that helps to inform the principle of least privilege: need to know. Though the vetting process for someone accessing highly sensitive information is stringent, clearance level alone is insufficient when dealing with the most sensitive of information. An extension to the principle of least privilege in MAC environments is the concept of compartmentalization.

Compartmentalization is a method for enforcing need to know and can be best understood by considering a highly sensitive military operation; while there may be a large number of individuals (high rank), only a subset needs to know specific information. The others have no need to know and therefore no access.

Separation of Duties

Separation of duties prescribes that multiple people are required to complete critical or sensitive transactions. The goal of separation of duties is to ensure that in order for someone to be able to abuse access to sensitive data or transactions, that person must convince another party to act in concert.

If several people act in a way to compromise the security of sensitive information, collusion happens.

Rotation of Duties / Job Rotation

Rotation of duties simply requires that one person does not perform critical functions or responsibilities without interruption. If the operational impact of the loss of an individual would be too great, then perhaps one way to soften this impact would be to provide additional depth of coverage for this individual’s responsibilities.

Rotation of duties can also mitigate fraud. One of the best ways to detect this fraudulent behavior is to require that responsibilities that could lead to fraud be frequently rotated among multiple people. In addition to the increased detection capabilities, the fact that responsibilities are routinely rotated deters fraud.

Mandatory Leave / Forced Vacation

Discovering a lack of depth in personnel with critical skills can help organizations understand risks associated with employees unavailable for work due to unforeseen circumstances. Forcing all employees to take leave can identify areas where depth of coverage is lacking. Further, requiring employees to be away from work while it is still operating can also help reveal fraudulent or suspicious behavior.

Non-disclosure agreement (NDA)

Requiring employees to sign an NDA is a practice which is seen in more and more enterprises of today’s world. A special emphasis on signing NDA must be put on 3rd parties such as consultants, contractors and on-site outsourced workforce.

Background Checks

Privilege Monitoring

Some employees by their job definition may require some privileges that are higher than ordinary employees. The operations of these employees constitute greater risk and must be regularly checked.

Furthermore employees, changing functions or gaining some new responsibilities may keep their old privileges while earning the new ones. This is a difficult technical problem of today’s Identity/Access Management Systems to be addressed leading to privilege creeps. Privilege monitoring may also help to detect and recover such situations.

Sensitive Information and Media Security

Labeling/Marking

All information should be labeled according to the data classification policy to get the correct kind of care

Handling

Storage

Especially sensitive data should be kept encrypted in storage media.

Retention

Retention of sensitive information should not persist beyond the period of usefulness or legal requirement (whichever is greater), as it needlessly exposes the data to threats of disclosure when the data is no longer needed.

Media sanitization or destruction of data

Data Remanence

Data remanence is data that persists beyond non-invasive means to delete it. Deleting a file from the Recycle Bin does not necessarily mean that the file is unrecoverably deleted. Several other measures such as wiping (overwriting random bits on file’s location several times), degaussing (applying electromagnetic waves to a disk that will no longer be used), shredding and physical destruction should be considered according to the sensitivity of the data.

Asset Management

Configuration Management

Configuration management in this context has a different meaning that it has in various IT Service Management models. From Security perspective, configuration items should have a baseline configuration model which is security hardened and used as a standard for all of the same items to ease security management.

Baselining

Security baselining is the process of capturing a point-in-time understanding of the current system security configuration. Establishing an easy means for capturing the current system security configuration can be extremely helpful in responding to a potential security incident.

Patch Management

Patch management should be very tightly related to change management process. Automation and reporting is also very important

Vulnerability Management

Vulnerability scanning is a way to discover poor configurations and missing patches in an environment. Vulnerability management is much more than just discovering the vulnerabilities and presenting the finding in form of a report. Vulnerability management requires a risk management to use institutions resources such as time and money to address necessary risks and do the reporting. The remediation or mitigation of vulnerabilities should be prioritized on both risk and ease of application.

The term for a vulnerability being known before the existence of a patch is zero-day vulnerability. The best way to deal with zero-day vulnerabilities is the application of defense-in-depth principle.

Change Management

The purpose of the change control process is to understand, communicate, and document any changes with the primary goal of being able to understand, control, and avoid direct or indirect negative impact that the changes might impose.

There should be a change control board that oversees and coordinates the change control process. The person proposing the change should attempt to supply information about any potential negative impacts that might result from the change, as well as any negative impacts that could result from not implementing the change. Rollback plan (backout plan) should be prepared in order to detail the procedures for reversing the change in case it is deemed necessary. Phases of change management procedure can be resumed as below:

Identifying a change
Proposing a change
Assessing the risk associated with the change
Testing the change
Scheduling the change
Notifying impacted parties of the change
Implementing the change
Reporting results of the change implementation

Finally, all changes must be closely tracked and auditable. A detailed change record should be kept.

Continuity of Operations

Service Level Agreements (SLAs)

Service Level Agreements are became more important in last years as more and more IT services are outsourced are provided in “as a service” model like in the case of cloud services. The goal of the SLA is to stipulate all expectations regarding the behavior of service (organizations mostly pay too much attention to availability and tend to forget other important factors) at the beginning of procurement process to include in contract negotiations.

Adequate time and effort should be spent to define specific service levels reflecting organization’s expectations from the service that is going to be acquired. Contractors may demand additional fees for requirements which were not previously included in contract negotiation phase.

Fault Tolerance

Full backup is simply is a replica of all allocated data on a hard disk and contains all of the allocated data on the hard disk, which makes them simple from a recovery standpoint in the event of a failure.The amount of media required to hold full backups is obviously more than other backup methods. Another downside of using only full backups is the time it takes to perform the backup itself, which may take too long according to the amount of data present on the system.

Incremental backups only archive files that have changed since the last backup of any kind was performed. Because fewer files are backed up, the time to perform the incremental backup is greatly reduced. For example, each Sunday, a full backup is performed. For Monday’s incremental backup, only those files that have been changed since Sunday’s backup will be marked for backup. On Tuesday, those files that have been changed since Monday’s incremental backup will be marked for backup.

Whereas the incremental backup only archives those files that had changed since any backup, the differential backup method backs up any files that have been changed since the last full backup.For example, Each Sunday, a full backup is performed. For Monday’s differential backup, only those files that have been changed since Sunday’s backup will be archived. On Tuesday, again those files that have been changed since Sunday’s full backup, including those backed up with Monday’s differential, will be archived.

Redundant Array of Inexpensive Disks (RAID)

The goal of (RAID) is to help mitigate the risk associated with hard disk failures. The various RAID levels consist of different approaches to disk array configurations. RAID configurations are not always made to mitigate hard disk failures, such in the case of RAID 0 and RAID 3 and can be done to improve read and write performance of the disks.

Before going further into RAID, we should understand the basic terms of RAID operation such as mirroring, striping and parity.

Mirroring is simply used to achieve full data redundancy by writing the same data to multiple hard disks. Because mirrored data must be written to multiple disks, the write times are slower; however, performance gains can be achieved when reading mirrored data by simultaneously pulling data from multiple hard disks.

Striping is a RAID concept that is focused on increasing the read and write performance by spreading data across multiple hard disks. With data being spread among multiple disk drives, read and writes can be performed in parallel across multiple disks rather than serially on one disk. This parallelization provides a performance increase and does not aid in data redundancy.

Parity is a way to achieve data redundancy without incurring the same degree of cost as that of mirroring in terms of disk usage and write performance. The table below explains basic RAID modes.

RAID Level	Description
RAID 0	Simple Striping, no redundancy
RAID 1	Mirrored disks, usable disk capacity is half of the total disk capacity
RAID 2	Requires either 14 or 39 disks and special controller, not commercially viable
RAID 3	Byte Level Striping with Dedicated Parity Disk (A disk alone is used for parity)
RAID 4	Block Level Striping with Dedicated Parity Disk (A disk alone is used for parity)
RAID 5	Block Level Striping with Distributed Parity (Parity is distributed to disks)
RAID 6	Block Level Striping with Double Distributed Parity (Uses double parity)

In addition to standard there are other RAID levels such as RAID 10 which are called nested RAID levels and which combine 2 RAID modes, in the case of RAID 10 these are RAID 1 and RAID 0. RAID 01, RAID 50, RAID 60 and RAID 100 are other well-known nested RAID levels.

Other than disk backups, for critical systems, system level backups should also be considered. This can be realized by the use of redundancy of other critical system components such as power supplies, NICs and disk controllers. Systems as a whole can be backed up by using either active-active (load balancing) or active-passive redundancy methods which are more costly but which provide better levels of availability.

Incident Response Management

Security Incident Response is treated in most organizations no different than other IT incidents which results in important losses in regards to confidentiality and availability of the organization’s information.

Security Incident Response plan should be prepared prior to incidents and must be followed. This plan can be resumed in 4 or 6 steps according to the methodologies but the basis is the same.

Preparation This stage includes training, writing incident response policies and procedures, and providing tools such as laptops with sniffing software, cables, original OS media, removable drives, etc.
Detection and analysis Organizations should have an automated system (like SIEM) for pulling events from several systems and bringing those events into the wider organizational context. Attacker may use one attack or attack to mask the real attack.
Containment A good analogy to explain containment is to compare it to emergency medical technicians arriving on the scene of an accident, as they seek only to stabilize an injured patient (stop their condition from worsening) and do not attempt to cure the patient.
Eradication In order for an organization to be able to reliably recover from an incident, the cause of the incident must be determined. Eradication cannot be made without a proper root cause analysis.
Recovery
Lessons learned

Friday, March 21, 2014

CISSP - Security Architectures and Design

An important domain in one's quest to get CISSP certified is Security Architecture and Design.

This domain may seem irrelevant, unnecessarily detailed and boring for those who come from Network and network security operations background but I believe everyone will find very important and unnoticed stuff here.

This domain is very strongly related with Access Control domain and concepts like DAC, MAC, RBAC must be thorougly understood before starting.

I would suggest you to pay close attention to subjects such as Bell-LaPadula, Biba, Clark-Wilson and Chinese Wall models as well as the Evaluation Criteria such as TCSEC (aka The Orange Book) its European version ITSEC and Common Criteria.

I tried to resume as much as I can so do that I can make it readable but there are really too much small and important points to keep in mind.

Let's start.

SECURITY ARCHITECTURE AND DESIGN

Security Architecture and Design is a three-part domain. The first part covers the hardware and software required to have a secure computer system, the second part covers the logical models required to keep the system secure, and the third part covers evaluation models that quantify how secure the system really is.

SECURE SYSTEM DESIGN CONCEPTS

Layering

Layering separates hardware and software functionality into modular tiers. A generic list of security architecture layers is as follows:

Hardware
Kernel and device drivers
Operating system
Applications

Abstraction

Abstraction hides unnecessary details from the user. Complexity is the enemy of security—the more complex a process is, the less secure it is.

Security domains

A security domain is the list of objects a subject is allowed to access. More broadly defined, domains are groups of subjects and objects with similar security requirements. “Confidential,” “secret,” and “top secret” are three security domains used by the U.S. DoD, for example. With respect to kernels, two domains are user mode and kernel mode.

The Ring Model

The ring model is a form of CPU hardware layering that separates and protects domains (such as kernel mode and user mode) from each other. Many CPUs, such as the Intel x86 family, have four rings, ranging from ring 0 (kernel) to ring 3 (user). The innermost ring is the most trusted, and each successive outer ring is less trusted.

The rings are (theoretically) used as follows:

Ring 0—Kernel
Ring 1—Other OS components that do not fit into ring 0
Ring 2—Device drivers
Ring 3—User applications

Processes communicate between the rings via system calls, which allow processes to communicate with the kernel and provide a window between the rings. A user running a word processor in ring 3 presses “save,” and a system call is made into ring 0, asking the kernel to save the file. The kernel does so and reports that the file is saved. System calls are slow (compared to performing work within one ring) but provide security. The ring model also provides abstraction: The nitty-gritty details of saving the file are hidden from the user, who simply presses the “save file” button.

While x86 CPUs have four rings and can be used as described above, this usage is considered theoretical because most x86 operating systems, including Linux and Windows, use rings 0 and 3 only. Using our “save file” example with four rings, a call would be made from ring 3 to ring 2, then from ring 2 to ring 1, and finally from ring 1 to ring 0. This is secure, but complex and slow, so most modern operating systems opt for simplicity and speed.

A new mode called hypervisor mode (and informally called “ring -1”) allows virtual guests to operate in ring 0, controlled by the hypervisor one ring “below.

Open and closed systems

An open system uses open hardware and standards, using standard components from a variety of vendors. An IBM-compatible PC is an open system; you may build an IBM-compatible PC by purchasing components from a multitude of vendors.

A closed system uses proprietary hardware or software. (Such as Apple computers)

SECURE HARDWARE ARCHITECTURE

The System Unit and Motherboard

The system unit is the computer’s case: It contains all of the internal electronic computer components, including motherboard, internal disk drives, power supply, etc. The motherboard contains hardware, including the CPU, memory slots, firmware, and peripheral slots such as PCI slots.

The computer bus

A computer bus is the primary communication channel on a computer system. Communication between the CPU, memory, and input/output devices such as keyboard, mouse, display, etc., occur via the bus.

Northbridge and Southbridge

Some computer designs use two buses: a northbridge and southbridge. The names derive from the visual design, usually shown with the northbridge on top and the southbridge on the bottom. The northbridge, also called the Memory Controller Hub (MCH), connects the CPU to RAM and video memory. The southbridge, also called the I/O Controller Hub (ICH), connects input/output (I/O) devices, such as disk, keyboard, mouse, CD drive, USB ports, etc. The northbridge is directly connected to the CPU and is faster than the southbridge.

The CPU

The arithmetic logic unit (ALU) performs mathematical calculations—it computes. It is fed instructions by the control unit, which acts as a traffic cop, sending instructions to the ALU.

CPUs fetch machine language instructions (such as “add 1 + 1”) and execute them (add the numbers, for answer of “2”). The fetch and execute (also called fetch– decode–execute, or FDX) process actually takes four steps:

1. Fetch Instruction 1

2. Decode Instruction 1

3. Execute Instruction 1

4. Write (save) result 1

These four steps take one clock cycle to complete.

Pipelining combines multiple steps into one combined process, allowing simultaneous fetch, decode, execute, and write steps for different instructions. Each part is called a pipeline stage; the pipeline depth is the number of simultaneous stages which may be completed at once.

Given our previous fetch and execute example of adding 1ş1, a CPU without pipelining would have to wait an entire cycle before performing another computation. A four-stage pipeline can combine the stages of four other instructions:

1. Fetch Instruction 1

2. Fetch Instruction 2, Decode Instruction 1

3. Fetch Instruction 3, Decode Instruction 2, Execute Instruction 1

4. Fetch Instruction 4, Decode Instruction 3, Execute Instruction 2, Write (save) result 1

5. Fetch Instruction 5, Decode Instruction 4, Execute Instruction 3, Write (save) result 2,

An interrupt indicates that an asynchronous event has occurred. CPU interrupts are a form of hardware interrupt that cause the CPU to stop processing its current task, save the state, and begin processing a new request. When the new task is complete, the CPU will complete the prior task.

A process is an executable program and its associated data loaded and running in memory. A heavy-weight process (HWP) is also called a task. A parent process may spawn additional child processes called threads. A thread is a light-weight process (LWP).

Applications run as processes in memory, comprised of executable code and data. Multitasking allows multiple tasks (heavy weight processes) to run simultaneously on one CPU. Older and simpler operating systems, such as MS-DOS, are non-multitasking; they run one process at a time. Most modern operating systems, such as Linux and Windows XP, support multitasking.

Multiprocessing has a fundamental difference from multitasking in that it runs multiple processes on multiple CPUs.

A watchdog timer is designed to recover a system by rebooting after critical processes hang or crash. The watchdog timer reboots the system when it reaches zero; critical operating system processes continually reset the timer, so it never reaches zero as long as they are running. If a critical process hangs or crashes, they no longer reset the watchdog timer, which reaches zero, and the system reboots.

Complex instruction set computer (CISC) and reduced instruction set computer (RISC) are two forms of CPU design. CISC uses a large set of complex machine language instructions, while RISC uses a reduced set of simpler instructions. X86 CPUs (among many others) are CISC; ARM (used in many cell phones and PDAs), PowerPC, Sparc, and others are RISC.

Cache memory is the fastest memory on the system, required to keep up with the CPU as it fetches and executes instructions. The data most frequently used by the CPU is stored in cache memory. The fastest portion of the CPU cache is the register file, which contains multiple registers. Registers are small storage locations used by the CPU to store instructions and data. The next fastest form of cache memory is Level 1 cache, located on the CPU itself. Finally, Level 2 cache is connected to (but outside) the CPU. Static random access memory (SRAM) is used for cache memory.

As a general rule, the memory closest to the CPU (cache memory) is the fastest and most expensive memory in a computer. As you move away from the CPU, from SRAM, to DRAM, to disk, to tape, etc., the memory becomes slower and less expensive.

RAM and ROM

RAM is volatile memory used to hold instructions and data of currently running programs. It loses integrity after loss of power. RAM memory modules are installed into slots on the computer motherboard. Read-only memory (ROM) is nonvolatile: Data stored in ROM maintains integrity after loss of power. The basic input/output system (BIOS) firmware is stored in ROM.

DRAM and SRAM

Static random access memory (SRAM) is expensive and fast memory that uses small latches called “flip-flops” to store bits. Dynamic random access memory (DRAM) stores bits in small capacitors (like small batteries) and is slower and cheaper.

Values may be stored in multiple locations in memory, including CPU registers and in general RAM. These values may be addressed directly (“add the value stored here”) or indirectly (“add the value stored in memory location referenced here”). Indirect addressing is like a pointer.

Register direct addressing is the same as direct addressing, except it references a CPU cache register, such as Register 1.

Memory protection prevents one process from affecting the confidentiality, integrity, or availability of another. This is a requirement for secure multiuser (more than one user logged in simultaneously) and multitasking (more than one process running simultaneously) systems.

Process isolation is a logical control that attempts to prevent one process from interfering with another. This is a common feature among multiuser operating systems such as Linux, UNIX, or recent Microsoft Windows operating systems. Older operating systems such as MS-DOS provide no process isolation. A lack of process isolation means a crash in any MS-DOS application could crash the entire system.

Hardware segmentation takes process isolation one step further by mapping processes to specific memory locations. This provides more security than (logical) process isolation alone.

Virtual memory provides virtual address mapping between applications and hardware memory. Virtual memory provides many functions, including multitasking (multiple tasks executing at once on one CPU), allowing multiple processes to access the same shared library in memory, swapping, and others.

Swapping uses virtual memory to copy contents in primary memory (RAM) to or from secondary memory (not directly addressable by the CPU, on disk). Swap space is often a dedicated disk partition that is used to extend the amount of available memory. If the kernel attempts to access a page (a fixed-length block of memory) stored in swap space, a page fault occurs (an error that means the page is not located in RAM), and the page is “swapped” from disk to RAM.

Swapping and paging are often used interchangeably, but there is a slight difference. Paging copies a block of memory to or from disk, while swapping copies an entire process to or from disk.

Swap is designed as a protective measure to handle occasional bursts of memory usage. Systems should not routinely use large amounts of swap; in that case, physical memory should be added or processes should be removed, moved to another system, or shortened.

Firmware

Flash (EEPROM, faster than regular EEPROM, slower than disks)

BIOS

WORM storage

Write once, read many (WORM) storage can be written to once and read many times. It is often used to support records retention for legal or regulatory compliance. The most common type of WORM media is Compact Disc–Recordable (CD-R) and Digital Versatile Disk–Recordable (DVD-R). Note that CD-RW and DVD-RW (Read/Write) are not WORM media.

SECURE OPERATING SYSTEM AND SOFTWARE ARCHITECTURE

The Kernel

The kernel is the heart of the operating system, which usually runs in ring 0. Kernels have two basic designs: monolithic and microkernel. Monolithic kernel is not modular and compiled only once when the computer is turned on thus does not detect hardware changes once computer already booted. Microkernel is modular and can call drivers on demand.

Reference Monitor

Reference Monitor mediates all access between subjects and objects. It enforces the system’s security policy, such as preventing a normal user from writing to a restricted file, such as the system password file. The reference monitor is always enabled and cannot be bypassed.

Users and file permissions

Unix/Linux

Windows

Privileged Programs

Setuid is a Linux and UNIX file permission that makes an executable run with the permissions of the file’s owner, and not as the running user. Setgid (set group ID) programs run with the permissions of the file’s group.

The passwd program runs as root, allowing users to change their passwords and thus the contents of /etc/passwd and /etc/shadow.

The activities of these commands must be controlled with close attention.

VIRTUALIZATION AND DISTRIBUTED COMPUTING

Virtualization

The key to virtualization security is the hypervisor, which controls access between virtual guests and host hardware. A Type 1 hypervisor (also called bare metal) is part of an operating system that runs directly on host hardware. A Type 2 hypervisor runs as an application on a normal operating system, such as Windows 7; for example, VMWare ESX is a Type 1 hypervisor, and VMWare Workstation is Type 2.

Many network-based security tools, such as network intrusion detection systems (NIDS), can be blinded by virtualization. A traditional NIDS connected to a physical SPAN port or tap cannot see traffic passing from one guest to another on the same host. NIDS vendors are beginning to offer virtual IDS products, running in software on the host and capable of inspecting host–guest and guest–guest traffic.

Cloud Computing

A concern about cloud computing is multiple organizations’ guests running on the same host. The compromise of one cloud customer could lead to the compromise of other customers.

Also, many cloud providers offer preconfigured system images, which may introduce risks via insecure configuration.

Finally, do you know where your data is? Public clouds may potentially move data to any country, potentially beyond the jurisdiction of the organization’s home country. Some laws forbid the storage of critical information such as PII abroad.

SYSTEM VULNERABILITIES, THREATS, AND COUNTERMEASURES

Emanations are energy that escapes an electronic system, and which may be remotely monitored under certain circumstances.

A covert channel is any communication that violates security policy. The communication channel used by malware installed on a system that locates personally identifiable information (PII) such as credit card information and sends it to a malicious server is an example of a covert channel.

Buffer overflows can occur when a programmer fails to perform bounds checking.

Time of check, time of use (TOCTOU) attacks are also called race conditions. An attacker attempts to alter a condition after it has been checked by the operating system, but before it is used. TOCTOU is an example of a state attack, where the attacker capitalizes on a change in operating system state.

Here is pseudocode for a setuid root program (runs with super user privileges, regardless of the running user) called “open test file” that contains a race condition:

1. If the file “test” is readable by the user

2. Then open the file “test”

3. Else print “Error: cannot open file.”

The race condition occurs between steps 1 and 2. Remember that most modern computers are multitasking; the CPU executes multiple processes at once. Other processes are running while our “open test file” program is running. In other words, the computer may run our program like this:

1. If the file “test” is readable by the user

2. Run another process

3. Run another process

4. Then open the file “test”

A successful attack may place some commands between the first and the fourth steps and execute harmful commands.

A backdoor is a shortcut in a system that allows a user to bypass security checks (such as username/password authentication) to log in.

Malicious code (or malware) is the generic term for any type of software that attacks an application or system.

Zero-day exploits are malicious code (a threat) for which there is no vendor-supplied patch (meaning there is an unpatched vulnerability).

Computer viruses are malware that does not spread automatically; they require a carrier.

Worms are malware that self-propagates (spreads independently).

A trojan (also called a Trojan horse) is malware that performs two functions: one benign (such as a game) and one malicious.

A rootkit is malware which replaces portions of the kernel and/or operating system.

A logic bomb is a malicious program that is triggered when a logical condition is met, such as after a number of transactions have been processed or on a specific date (also called a time bomb).

Packers provide runtime compression of executables. The original exe is compressed, and a small executable decompresser is prepended to the exe. Upon execution, the decompresser unpacks the compressed executable machine code and runs it. Packers are a neutral technology that is used to shrink the size of executables.

Server-side attacks (also called service-side attacks) are launched directly from an attacker (the client) to a listening service.

Client-side attacks occur when a user downloads malicious content.

Applets are small pieces of mobile code that are embedded in other software such as Web browsers.

Java applets run in a sandbox, which segregates the code from the operating system. The sandbox is designed to prevent an attacker who is able to compromise a Java applet from accessing system files, such as the password file. Code that runs in the sandbox must be self-sufficient; it cannot rely on operating system files that exist outside the sandbox.

ActiveX controls are the functional equivalent of Java applets. They use digital certificates instead of a sandbox to provide security. ActiveX controls are tied more closely to the operating system, allowing functionality such as installing patches via Windows Update. Unlike Java, ActiveX is a Microsoft technology that works on Microsoft Windows operating systems only.

The Open Web Application Security Project represents one of the best application security resources. OWASP provides a tremendous number of free resources dedicated to improving organizations’ application security posture. One of their best-known projects is the OWASP Top 10 project, which provides consensus guidance on what are considered to be the ten most significant application security risks.

Service-Oriented Architecture (SOA) attempts to reduce application architecture down to a functional unit of a service. SOA is intended to allow multiple heterogeneous applications to be consumers of services. The service can be used and reused throughout an organization rather than built within each individual application that needs the functionality offered by the service.

Data mining searches large amounts of data to determine patterns that would otherwise get lost in the noise.

The primary countermeasure to mitigate the attacks described in the previous section is defense in depth: multiple overlapping controls spanning across multiple domains, which enhance and support each other. Any one control may fail, but defense in depth (also called layered defense) mitigates this issue.

SECURITY MODELS

Read Down, Write Up concepts apply to Mandatory Access Control models.

Bell-LaPadula Model (CONFIDENTIALITY)

Simple Security Property : No Read UP, NRU
Security Property : No Write Down, NRW
The Strong Tranquility Property states that security labels will not change while the system is operating. The Weak Tranquility Property states that security labels will not change in a way that conflicts with defined security properties.

BIBA Model (INTEGRITY)

Simple Integrity Axiom : No Read Down, NRD
* Integrity Axiom : No Write Up, NWU

Clark-Wilson Model (INTEGRITY)

While Bell-LaPadula and Biba models apply mostly governmental bodies, Clark-Wilson better applies to enterprises.

Clark–Wilson effectively limits the capabilities of the subject. Clark–Wilson uses two primary concepts to ensure that security policy is enforced; well-formed transactions and separation of duties.

Subject à Transformation Procedure àObject

A transformation procedure (TP) is a well formed transaction, and a constrained data item (CDI) is data that requires integrity. Unconstrained data items (UDIs) are data that do not require integrity. For each TP, an audit record is made and entered into the access control system. This provides both detective and recovery controls in case integrity is lost.

Clark–Wilson requires that users are authorized to access and modify data. It also requires that data is modified in only authorized ways.

Chinese Wall Model (Brewer and Nash)(INTEGRITY)

The Chinese Wall model is designed to avoid conflicts of interest by prohibiting one person, such as a consultant, from accessing multiple conflict of interest categories.

Conflict of Interest (CoI) should always remind Chinese Wall Model. The Chinese Wall model requires that CoIs be identified so that once a consultant gains access to one CoI, that person cannot read or write to an opposing CoI.

Noninterference

The noninterference model ensures that data at different security domains remain separate from one another. By implementing this model, the organization can be assured that covert channel communication does not occur because the information cannot cross security boundaries.

Access control matrix

An access control matrix is a table defining what access permissions exist between specific subjects and objects.

Zachman Framework for Enterprise Architecture

The Zachman Framework for Enterprise Architecture provides a framework for providing information security, asking what, how, where, who, when, and why and mapping those frameworks across rules, including planner, owner, designer, builder, programmer, and user.

Graham-Denning Model

The Graham-Denning Model has three parts: objects, subjects, and rules. It provides a more granular approach for interaction between subjects and objects. There are eight rules:

• R1. Transfer access

• R2. Grant access

• R3. Delete access

• R4. Read object

• R5. Create object

• R6. Destroy object

• R7. Create subject

• R8. Destroy subject

Harrison–Ruzzo–Ullman Model

HRU model maps subjects, objects, and access rights to an access matrix. It is considered a variation to the Graham–Denning Model. HRU has six basic operations:

1. Create object.

2. Create subject.

3. Destroy subject.

4. Destroy object.

5. Enter right into access matrix.

6. Delete right from access matrix.

EVALUATION METHODS, CERTIFICATION, AND ACCREDITATION

Trusted Computer System Evaluation Criteria (TCSEC, aka the Orange Book)

ITSEC

Additional levels to the above shown levels are:

• F-IN: High integrity requirements

• AV: High availability requirements

• DI: High integrity requirements for networks

• DC: High confidentiality requirements for networks

• DX: High integrity and confidentiality requirements for networks

Common Criteria

Target of evaluation (ToE)—the system or product that is being evaluated.

Security target (ST)—the documentation describing the ToE, including the security requirements and operational environment.

Protection profile (PP)—an independent set of security requirements and objectives for a specific category of products or systems, such as firewalls or intrusion detection systems.

Evaluation assurance level (EAL)—the evaluation score of the tested product or system.

PCI-DSS

The core principles of PCI-DSS are:

• Build and maintain a secure network.

• Protect cardholder data.

• Maintain a vulnerability management program.

• Implement strong access control measures.

• Regularly monitor and test networks.

• Maintain an information security policy.

Certification means a system has been certified to meet the security requirements of the data owner. Certification considers the system, the security measures taken to protect the system, and the residual risk represented by the system.

Accreditation is the data owner’s acceptance of the certification, and of the residual risk, required before the system is put into production.