Business Continuity and Disaster Recovery¶

2025.2

The CAP Index Contingency Plan establishes procedures to recover CAP Index following a disruption resulting from a disaster. This Disaster Recovery Policy is maintained by the CAP Index Security Officer and Privacy Officer.

NIST: This CAP Index Contingency Plan is created under the legislative requirements set forth in the Federal Information Security Management Act (FISMA) of 2002 and the guidelines established by the National Institute of Standards and Technology (NIST) Special Publication (SP) 800-34, titled “Contingency Planning Guide for Information Technology Systems” dated June 2002.

Policy Statements¶

CAP Index policy requires that:

(a) A plan and process for business continuity and disaster recovery (BCDR), including the backup and recovery of systems and data, must be defined and documented.

(b) BCDR shall be simulated and tested at least once a year. Metrics shall be measured and identified recovery enhancements shall be filed to improve the BCDR process.

(c) Security controls and requirements must be maintained during all BCDR activities.

Controls and Procedures¶

BCDR Objectives and Roles¶

Objectives¶

The following objectives have been established for this plan:

Maximize the effectiveness of contingency operations through an established plan that consists of the following phases:
- Notification/Activation phase to detect and assess damage and to activate the plan;
- Recovery phase to restore temporary IT operations and recover damage done to the original system;
- Reconstitution phase to restore IT system processing capabilities to normal operations.
Identify the activities, resources, and procedures needed to carry out CAP Index processing requirements during prolonged interruptions to normal operations.
Identify and define the impact of interruptions to CAP Index systems.
Assign responsibilities to designated personnel and provide guidance for recovering CAP Index during prolonged periods of interruption to normal operations.
Ensure coordination with other CAP Index staff who will participate in the contingency planning strategies.
Ensure coordination with external points of contact and vendors who will participate in the contingency planning strategies.

Example of the types of disasters that would initiate this plan are natural disaster, political disturbances, man made disaster, external human threats, and internal malicious activities.

CAP Index defined two categories of systems from a disaster recovery perspective.

Critical Systems. These systems host production application servers/services and database servers/services or are required for functioning of systems that host production applications and data. These systems, if unavailable, affect the integrity of data and must be restored, or have a process begun to restore them, immediately upon becoming unavailable.
Non-critical Systems. These are all systems not considered critical by definition above. These systems, while they may affect the performance and overall security of critical systems, do not prevent Critical systems from functioning and being accessed appropriately. These systems are restored at a lower priority than critical systems.

Line of Succession¶

The following order of succession to ensure that decision-making authority for the CAP Index Contingency Plan is uninterrupted. The Security Officer is responsible for ensuring the safety of personnel and the execution of procedures documented within this CAP Index Contingency Plan. The Security Officer is responsible for the recovery of CAP Index technical environments. If the Security Officer is unable to function as the overall authority or chooses to delegate this responsibility to a successor, the CEO shall function as that authority or choose an alternative delegate. To provide contact initiation should the contingency plan need to be initiated, please use the contact list below.

Brian Cunningham, Security Officer: bcunningham@capindex.com
Steven Aurand, CEO: saurand@capindex.com

Response Teams and Responsibilities¶

The following teams have been developed and trained to respond to a contingency event affecting CAP Index infrastructure and systems.

IT is responsible for recovery of the CAP Index hosted environment, network devices, and all servers. The team includes personnel responsible for the daily IT operations and maintenance. The team leader is the IT Manager who reports to the CEO.
HR & Facilities is responsible for ensuring the physical safety of all CAP Index personnel and environmental safety at each CAP Index physical location. The team members also include site leads at each CAP Index work site. The team leader is the Facilities Manager who reports to the CEO.
DevOps & Development is responsible for assuring all applications, web services, platform and their supporting infrastructure in the Cloud. The team is also responsible for testing re-deployments and assessing damage to the environment. The team leader is the Security Officer.
Security is responsible for assessing and responding to all cybersecurity related incidents according to CAP Index Incident Response policy and procedures. The security team shall assist the above teams in recovery as needed in non-cybersecurity events. The team leader is the Security Officer.

Members of above teams must maintain local copies of the contact information of the BCDR succession team. Additionally, the team leads must maintain a local copy of this policy in the event Internet access is not available during a disaster scenario.

All executive leadership shall be informed of any and all contingency events. Current members of CAP Index leadership team include the Executive Management.

General Disaster Recovery Procedures¶

Notification and Activation Phase¶

This phase addresses the initial actions taken to detect and assess damage inflicted by a disruption to CAP Index. Based on the assessment of the Event, sometimes according to the CAP Index Incident Response Policy, the Contingency Plan may be activated by either the CEO or Security Officer.

The notification sequence is listed below:

The first responder is to notify the Security Officer. All known information must be relayed to the Security Officer.
The Security Officer is to contact the CEO and the Response Teams and inform them of the event. The Security Officer or delegate is responsible to begin assessment procedures.
The Security Officer is to notify team members and direct them to complete the assessment procedures outlined below to determine the extent of damage and estimated recovery time. If damage assessment cannot be performed locally because of unsafe conditions, the Security Officer is to following the steps below.
- Damage Assessment Procedures:
- The Security Officer is to logically assess damage, gain insight into whether the infrastructure is salvageable, and begin to formulate a plan for recovery.
- Alternate Assessment Procedures:
- Upon notification, the Security Officer is to follow the procedures for damage assessment with the Response Teams.
The CAP Index Contingency Plan is to be activated if one or more of the following criteria are met:
- CAP Index will be unavailable for more than 48 hours.
- On-premise hosting facility or cloud infrastructure service is damaged and will be unavailable for more than 24 hours.
- Other criteria, as appropriate and as defined by CAP Index.
If the plan is to be activated, the Security Officer is to notify and inform team members of the details of the event and if relocation is required.
Upon notification from the Security Officer, group leaders and managers are to notify their respective teams. Team members are to be informed of all applicable information and prepared to respond and relocate if necessary.
The Security Officer is to notify the hosting facility partners that a contingency event has been declared and to ship the necessary materials (as determined by damage assessment) to the alternate site.
The Security Officer is to notify remaining personnel and executive leadership on the general status of the incident.
Notification can be message, email, or phone.

Recovery Phase¶

This section provides procedures for recovering CAP Index infrastructure and operations at an alternate site, whereas other efforts are directed to repair damage to the original system and capabilities.

Procedures are outlined per team required. Each procedure should be executed in the sequence it is presented to maintain efficient operations.

Recovery Goal: The goal is to rebuild CAP Index infrastructure to a production state.

The tasks outlines below are not sequential and some can be run in parallel.

Contact Partners and Customers affected to begin initial communication - DevOps
Assess damage to the environment - DevOps
Create a new production environment using new environment bootstrap automation - DevOps
Ensure secure access to the new environment - Security
Begin code deployment and data replication using pre-established automation - DevOps
Test new environment and applications using pre-written tests - DevOps
Test logging, security, and alerting functionality - DevOps and Security
Assure systems and applications are appropriately patched and up to date - DevOps
Update DNS and other necessary records to point to new environment - DevOps
Update Partners and Customers affected through established channels - DevOps

Reconstitution Phase¶

This section discusses activities necessary for restoring full CAP Index operations at the original or new site. The goal is to restore full operations within 24 hours of a disaster or outage. If necessary, when the hosted data center at the original or new site has been restored, CAP Index operations at the alternate site may be transitioned back. The goal is to provide a seamless transition of operations from the alternate site to the computer center.

Original or New Site Restoration
- Repeat steps 5-9 in the Recovery Phase at the original or new site / environment.
- Restoration of Original site is unnecessary for cloud environments, except when required for forensic purpose.
Plan Deactivation
- If the CAP Index environment is moved back to the original site from the alternative site, all hardware used at the alternate site should be handled and disposed of according to the CAP Index Media Disposal Policy.

Testing and Maintenance¶

The Security Officer shall establish criteria for validation/testing of a Contingency Plan, an annual test schedule, and ensure implementation of the test. This process will also serve as training for personnel involved in the plan’s execution. At a minimum the Contingency Plan shall be tested annually (within 365 days). The types of validation/testing exercises include tabletop and technical testing. Contingency Plans for all application systems must be tested at a minimum using the tabletop testing process. However, if the application system Contingency Plan is included in the technical testing of their respective support systems that technical test will satisfy the annual requirement.

Tabletop Testing¶

Tabletop Testing is conducted in accordance with the CMS Risk Management Handbook, Volume 2. The primary objective of the tabletop test is to ensure designated personnel are knowledgeable and capable of performing the notification/activation requirements and procedures as outlined in the CP, in a timely manner. The exercises include, but are not limited to:

Testing to validate the ability to respond to a crisis in a coordinated, timely, and effective manner, by simulating the occurrence of a specific crisis.

Simulation and/or Technical Testing¶

The primary objective of the technical test is to ensure the communication processes and data storage and recovery processes can function at an alternate site to perform the functions and capabilities of the system within the designated requirements. Technical testing shall include, but is not limited to:

Process from backup system at the alternate site;
Restore system using backups; and
Switch compute and storage resources to alternate processing site.

Work Site Recovery¶

CAP Index’s organization has the ability to work from any location with Internet access and does not require an office provided Internet connection.

Application Service Event Recovery¶

CAP Index will develop a status page to provide real time update and inform our customers of the status of each service. The status page is updated with details about an event that may cause service interruption / downtime.

A follow up root-cause analysis details (RCA) will be available to customers upon request after the event has transpired for further details to cause and remediation plan for the future. Event Service Level

Short (hours)¶

Experience a short delay in service.
CAP Index will monitor the event and determine course of action. Escalation may be required.

Moderate (days)¶

Experience a modest delay in service where processes in flight may need to be restarted.
CAP Index will monitor the event and determine course of action. Escalation may be required.
CAP Index will notify customers of delay in service and provide updates on CAP Index’s status page.

Long (a week or more)¶

Experience a delay in service and processes in flight may need to be restarted.
CAP Index will monitor the event and determine course of action. Escalation may be required.
CAP Index will notify customers of delay in service and provide updates on CAP Index’s status page.

Production Environments and Data Recovery¶

Production data is to be synchronized across multiple Azure Volumes in Azure. Additionally, it is backed up to Azure Storage for long term storage and recovery. In an event that requires data to be recovered, it will be retrieved from long term storage.

CAP Index assumes that in the worst-case scenario, that one of the production environments suffers a complete data loss, the account will be reconstructed from code, and the data restored from Azure Storage that is hosted within a different Azure resource group and geolocation.

RTO Metrics¶

This RTO statement applies to all application services and associated databases that store and process sensitive data, ensuring that business operations are minimally impacted in the event of a disaster. This represents the maximum allowable downtime for sercies following a disaster, ensuring the recovery of the application and database within an acceptable timeframe.

Application Layer Recovery Metrics

The recovery of the application is critical to the business. The following metrics define the acceptable limits for downtime:

RTO for Application Recovery: The application must be fully restored within 2 hours of the disaster incident.
Key Recovery Actions:
- Automated application failover to a secondary region (if applicable).
- Recovery from backups stored in Azure Blob Storage or snapshot replication.
- Full testing of application functionality after restoration to ensure all services are available for tenant access.
Recovery Performance Metric:
- Time to Initial Response: Within 15 minutes of identifying the disaster, a response team will begin executing the DR plan.
- Time to Service Availability: Within 2 hours of disaster identification, the multi-tenant application should be operational for primary business functions.

Database Layer Recovery Metrics (MSSQL)

The MSSQL database recovery is crucial, particularly for systems storing transactional or sensitive customer data. The following metrics define the acceptable limits for downtime:

RTO for Database Recovery: The MSSQL database must be fully restored within 3 hours of the disaster incident.
Key Recovery Actions:
- Point-in-time restore using Azure Backup or transaction log backups.
- Use of geo-replication or Always On Availability Groups to minimize downtime for multi-region architectures.
- Validation of database integrity and performance post-recovery.
Recovery Performance Metric:
- Time to Initial Response: Within 30 minutes of the disaster event, the database team will initiate recovery processes.
- Time to Data Availability: The database should be accessible to the application and users within 3 hours of disaster identification.
- Time to Full Data Integrity: Full data consistency and transaction logs should be fully restored within 6 hours.

Service Layer RTO Metrics (API Gateway / Web Services)

To support the multi-tenant application and MSSQL database, API services and web services must be restored promptly.

RTO for Service Recovery: Recovery of all exposed web APIs must occur within 3 hours to ensure the application and database can communicate effectively.
Key Recovery Actions:
- Failover mechanisms for web services.
- Restoration of API endpoints from backup configurations or codebase (IaC).
- Verification of functional API endpoints through test cases post-recovery.

Backup and Storage Recovery Metrics

Efficient and fast access to backups and storage is vital to meeting the overall RTO.

RTO for Backup Restoration: The restoration of critical backup data from Azure Backup or other services must be completed within 3 hours.
Key Recovery Actions:
- Ensure that backups are restored from the most recent and secure storage options.
- Verify that the restored data is not corrupted and that the restoration process adheres to security protocols.

Recovery of production Environments and data should follow the procedures listed above and in Data Management - Backup and Recovery