beginner•50 min

Resilience and Business Continuity

Name: CBDC Architecture & Design
Price: 29 USD
Availability: InStock

Learning Objectives

Explain why resilience is critical for CBDC systems

Describe the key components of resilient architecture

Identify common failure scenarios and mitigation strategies

Analyze recovery time and recovery point objectives

Evaluate the trade-offs in resilience design decisions

Imagine a major earthquake. Power is out. Cell towers are damaged. Banks are closed. In this moment, people still need to buy food, water, and supplies. Cash works. Will CBDC?

This is the resilience challenge. A CBDC isn't just another app—it's national infrastructure. It must have the availability and recovery capabilities of critical systems like power grids and telecommunications. Failure isn't just inconvenient; it's potentially catastrophic.

CBDC CRITICALITY

DEPENDENCY CHAIN:
┌────────────────────────────────────────────────┐
│ Citizens depend on CBDC for: │
│ │
│ - Daily transactions │
│ - Salary receipt │
│ - Bill payment │
│ - Emergency purchases │
│ - Government benefits │
└────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ Businesses depend on CBDC for: │
│ │
│ - Customer payments │
│ - Supplier payments │
│ - Payroll │
│ - Cash flow │
└────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ Economy depends on CBDC for: │
│ │
│ - Payment system function │
│ - Economic activity │
│ - Financial stability │
└────────────────────────────────────────────────┘

People can't pay
Businesses can't operate
Economic activity stops
Social instability possible

This is why CBDC is CRITICAL INFRASTRUCTURE
```

CBDC AVAILABILITY TARGETS

STANDARD METRICS:

8.76 hours downtime per year
Insufficient for CBDC
52.6 minutes downtime per year
Minimum for payment systems
5.26 minutes downtime per year
Target for critical infrastructure

CBDC SHOULD TARGET:
99.99% minimum (four nines)
99.999% aspirational (five nines)

CONTEXT:
Visa: Claims 99.999%+ availability
Major banks: 99.95-99.99%
Stock exchanges: 99.99%+

CBDC must match or exceed
best-in-class financial infrastructure
```

RECOVERY METRICS

RTO - RECOVERY TIME OBJECTIVE:
Maximum acceptable time to restore service
after disruption

RPO - RECOVERY POINT OBJECTIVE:
Maximum acceptable data loss
(time between last backup and failure)

FOR CBDC:

RTO Requirements:
┌────────────────────────────────────────────────┐
│ Minor incident: < 15 minutes │
│ Major incident: < 1 hour │
│ Disaster: < 4 hours │
│ Catastrophic: < 24 hours │
└────────────────────────────────────────────────┘

RPO Requirements:
┌────────────────────────────────────────────────┐
│ Target: Near-zero (seconds) │
│ Maximum: Minutes at most │
│ │
│ Money transactions cannot be "lost" │
│ Even minutes of lost data is serious │
└────────────────────────────────────────────────┘

---

REDUNDANCY LEVELS

N+1 REDUNDANCY:
One extra component beyond minimum needed
If one fails, system continues
Most basic level

N+2 REDUNDANCY:
Two extra components
Can survive two simultaneous failures
More robust

2N REDUNDANCY:
Complete duplicate system
Full capacity backup
Highest resilience

FOR CBDC:
Core systems: 2N (full redundancy)
Distribution: N+1 minimum
User-facing: Geographic distribution

NO SINGLE POINTS OF FAILURE:
Every critical component has backup
Every critical path has alternative
```

GEOGRAPHIC RESILIENCE

ARCHITECTURE:
┌─────────────────────────────────────────────────┐
│ PRIMARY DATA CENTER │
│ (Region A) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Compute │ │ Storage │ │ Network │ │
│ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────┬─────────────────────────┘
│ Real-time
│ Replication
┌───────────────────────┴─────────────────────────┐
│ SECONDARY DATA CENTER │
│ (Region B) │
│ (500+ km apart) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────┬─────────────────────────┘
│
│
┌───────────────────────┴─────────────────────────┐
│ TERTIARY (Disaster Recovery) │
│ (Region C) │
│ (Different country/continent) │
│ │
│ Asynchronous replication │
│ Warm standby capability │
└─────────────────────────────────────────────────┘

Different seismic zones
Different power grids
Different network paths
Different political jurisdictions (if appropriate)

DEPLOYMENT MODELS

ACTIVE-PASSIVE:
┌─────────────────────────────────────────────────┐
│ │
│ PRIMARY (Active) │
│ - Handles all traffic │
│ - Live operations │
│ │
│ SECONDARY (Passive) │
│ - Standby mode │
│ - Receives replicated data │
│ - Activated on failover │
│ │
│ Pros: Simpler, clear state │
│ Cons: Failover time, wasted capacity │
└─────────────────────────────────────────────────┘

ACTIVE-ACTIVE:
┌─────────────────────────────────────────────────┐
│ │
│ SITE A (Active) SITE B (Active) │
│ - 50% traffic - 50% traffic │
│ - Full capability - Full capability │
│ - Real-time sync - Real-time sync │
│ │
│ If one fails, other takes 100% │
│ │
│ Pros: No failover delay, uses all resources │
│ Cons: More complex, consistency challenges │
└─────────────────────────────────────────────────┘

FOR CBDC:
Active-active preferred for core systems
Eliminates failover delay
Maximizes resource utilization
But requires careful consistency management
```

INFRASTRUCTURE FAILURE SCENARIOS

SCENARIO 1: DATA CENTER LOSS
┌────────────────────────────────────────────────┐
│ Cause: Fire, flood, power failure │
│ │
│ Impact: Primary site unavailable │
│ │
│ Response: │
│ - Automatic failover to secondary │
│ - DNS redirect │
│ - Traffic routing update │
│ - Continue operations │
│ │
│ RTO: Minutes (with active-active) │
│ RPO: Seconds (with sync replication) │
└────────────────────────────────────────────────┘

SCENARIO 2: NETWORK PARTITION
┌────────────────────────────────────────────────┐
│ Cause: Cable cut, ISP failure, attack │
│ │
│ Impact: Region isolated from system │
│ │
│ Response: │
│ - Route around failure │
│ - Multiple network providers │
│ - Satellite backup (if available) │
│ - Offline capability for users │
│ │
│ Mitigation: Multi-path networking │
└────────────────────────────────────────────────┘

SCENARIO 3: DATABASE CORRUPTION
┌────────────────────────────────────────────────┐
│ Cause: Software bug, hardware failure │
│ │
│ Impact: Data integrity compromised │
│ │
│ Response: │
│ - Detect corruption │
│ - Isolate affected data │
│ - Restore from clean backup │
│ - Replay transactions if needed │
│ │
│ Prevention: Checksums, verification │
└────────────────────────────────────────────────┘
```

CYBER ATTACK RESILIENCE

SCENARIO: DDoS ATTACK
┌────────────────────────────────────────────────┐
│ Cause: Massive traffic flood │
│ │
│ Defense: │
│ - DDoS mitigation services │
│ - Traffic scrubbing │
│ - Capacity headroom │
│ - Geographic distribution │
│ │
│ Goal: Absorb attack, maintain service │
└────────────────────────────────────────────────┘

SCENARIO: RANSOMWARE
┌────────────────────────────────────────────────┐
│ Cause: Encryption of systems │
│ │
│ Defense: │
│ - Air-gapped backups │
│ - Immutable backup storage │
│ - Rapid restore capability │
│ - Segmented systems │
│ │
│ Recovery: Restore from clean backups │
│ Never pay ransom │
└────────────────────────────────────────────────┘

SCENARIO: SYSTEM COMPROMISE
┌────────────────────────────────────────────────┐
│ Cause: Attacker gains access │
│ │
│ Response: │
│ - Detect intrusion │
│ - Isolate compromised systems │
│ - Activate clean standby │
│ - Forensic investigation │
│ │
│ Goal: Contain, recover, learn │
└────────────────────────────────────────────────┘
```

NATURAL DISASTER RESILIENCE

REGIONAL DISASTER:
┌────────────────────────────────────────────────┐
│ Events: Earthquake, hurricane, flood │
│ │
│ Impact: │
│ - Infrastructure damage │
│ - Power outages │
│ - Network disruption │
│ - Staff unavailable │
│ │
│ Response: │
│ - Failover to distant site │
│ - Remote operations │
│ - Extended autonomous operation │
│ - Offline capability for users │
└────────────────────────────────────────────────┘

PANDEMIC / PROLONGED EVENT:
┌────────────────────────────────────────────────┐
│ Impact: │
│ - Staff availability reduced │
│ - Physical access restricted │
│ - Extended remote operations │
│ │
│ Response: │
│ - Full remote operation capability │
│ - Automated systems │
│ - Reduced staffing mode │
│ - Extended autonomous operation │
└────────────────────────────────────────────────┘

KEY PRINCIPLE:
Design for worst plausible scenario
Not just likely scenarios
```

OPERATIONAL VISIBILITY

MONITORING LAYERS:

Server health
Network status
Storage capacity
Power systems
Transaction processing
Response times
Error rates
Queue depths
Transaction volumes
Success rates
User activity
Anomaly detection

ALERTING:
┌────────────────────────────────────────────────┐
│ Severity 1 (Critical): │
│ - System down │
│ - Immediate response required │
│ - 24/7 on-call activation │
│ │
│ Severity 2 (High): │
│ - Degraded performance │
│ - Response within minutes │
│ │
│ Severity 3 (Medium): │
│ - Warning condition │
│ - Response within hours │
│ │
│ Severity 4 (Low): │
│ - Informational │
│ - Scheduled attention │
└────────────────────────────────────────────────┘
```

RESILIENCE TESTING

REGULAR TESTING:

Monthly: Automated failover
Quarterly: Full site failover
Annual: Disaster recovery drill
Weekly: Backup verification
Monthly: Restore test
Annual: Full restore exercise
Controlled failure injection
Test system response
Find weaknesses before attackers do
Walk through scenarios
Test decision-making
Identify gaps
Train staff
Simulate actual disaster
Execute recovery procedures
Time and measure
Learn and improve

CONTROLLED CHANGE

WHY IT MATTERS:
Most outages caused by changes
Controlled change = controlled risk

CHANGE PROCESS:
┌────────────────────────────────────────────────┐
│ 1. PLAN │
│ - Document change │
│ - Risk assessment │
│ - Rollback procedure │
│ │
│ 2. REVIEW │
│ - Peer review │
│ - CAB approval (for significant changes) │
│ │
│ 3. TEST │
│ - Non-production first │
│ - Verify functionality │
│ │
│ 4. IMPLEMENT │
│ - Scheduled window │
│ - Monitored execution │
│ │
│ 5. VERIFY │
│ - Confirm success │
│ - Monitor for issues │
│ │
│ 6. ROLLBACK (if needed) │
│ - Execute rollback plan │
│ - Return to known good state │
└────────────────────────────────────────────────┘
```

GRACEFUL DEGRADATION

FULL SERVICE:
All features available
Normal operations

Core transactions work
Some advanced features disabled
Slightly slower performance
Users may not notice

Essential transactions only
New features disabled
Noticeable latency
Communication to users

Basic payments only
Strict rate limits
Queuing for transactions
Major user communication

Minimal functionality
Offline capability activated
Prepare for extended outage
Crisis communication

PRINCIPLE:
Better limited service than no service
Degrade gracefully, not catastrophically
```

COMMUNICATION DURING INCIDENTS

In-app notifications
Status page
Social media
SMS alerts
Official announcements

COMMUNICATION TIMING:
┌────────────────────────────────────────────────┐
│ T+0: Incident detected │
│ Internal escalation │
│ │
│ T+5 min: Initial assessment │
│ Decide on user communication │
│ │
│ T+15 min: First user notification │
│ (if significant impact) │
│ │
│ T+30 min: Update with expected resolution │
│ │
│ Ongoing: Regular updates │
│ │
│ Resolution: "All clear" notification │
│ Post-incident summary │
└────────────────────────────────────────────────┘

Transparent about issues
Clear about impact
Honest about timeline
Regular updates
Don't overpromise

✅ Redundancy and distribution are essential—single points of failure are unacceptable.

✅ Testing is critical—untested recovery plans fail when needed.

✅ Graceful degradation is possible—systems can provide limited service during problems.

⚠️ Offline capability at scale—limited real-world testing.

⚠️ Recovery from novel attacks—unknown scenarios can't be fully planned.

⚠️ Multi-site coordination complexity—active-active is hard.

📌 Assuming backups work—untested backups often fail.

📌 Underestimating disaster scenarios—worst cases do happen.

📌 Neglecting change management—changes cause outages.

CBDC resilience requires the same rigor as other critical infrastructure—power grids, telecommunications. This means redundancy, geographic distribution, regular testing, and graceful degradation. The cash replacement claim demands cash-equivalent availability, which is a high bar requiring significant investment.

Assignment: Design a resilience architecture for a hypothetical CBDC, including redundancy, geographic distribution, and recovery procedures.

Time Investment: 3-4 hours

End of Lesson 17

Course 58: CBDC Architecture & Design
Lesson 17 of 20

Key Takeaways

CBDC is critical infrastructure

: Failure affects citizens, businesses, and the economy—availability requirements are extreme.

No single points of failure

: Every critical component needs redundancy; every critical path needs alternatives.

Geographic distribution is essential

: Sites in different regions protect against regional disasters.

Testing validates resilience

: Untested disaster recovery plans fail when needed.

Graceful degradation maintains service

: Limited service during problems is better than complete failure. ---

Resilience and Business Continuity

Learning Objectives

Introduction: When Everything Else Fails

Section 1: Why Resilience Matters

Section 2: Resilience Architecture

Section 3: Failure Scenarios

Section 4: Operational Resilience

Section 5: User-Facing Resilience

Critical Analysis

Deliverable: Resilience Assessment

Key Takeaways