Resilience and Business Continuity
Learning Objectives
Explain why resilience is critical for CBDC systems
Describe the key components of resilient architecture
Identify common failure scenarios and mitigation strategies
Analyze recovery time and recovery point objectives
Evaluate the trade-offs in resilience design decisions
Imagine a major earthquake. Power is out. Cell towers are damaged. Banks are closed. In this moment, people still need to buy food, water, and supplies. Cash works. Will CBDC?
This is the resilience challenge. A CBDC isn't just another app—it's national infrastructure. It must have the availability and recovery capabilities of critical systems like power grids and telecommunications. Failure isn't just inconvenient; it's potentially catastrophic.
CBDC CRITICALITY
DEPENDENCY CHAIN:
┌────────────────────────────────────────────────┐
│ Citizens depend on CBDC for: │
│ │
│ - Daily transactions │
│ - Salary receipt │
│ - Bill payment │
│ - Emergency purchases │
│ - Government benefits │
└────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ Businesses depend on CBDC for: │
│ │
│ - Customer payments │
│ - Supplier payments │
│ - Payroll │
│ - Cash flow │
└────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ Economy depends on CBDC for: │
│ │
│ - Payment system function │
│ - Economic activity │
│ - Financial stability │
└────────────────────────────────────────────────┘
- People can't pay
- Businesses can't operate
- Economic activity stops
- Social instability possible
This is why CBDC is CRITICAL INFRASTRUCTURE
```
CBDC AVAILABILITY TARGETS
STANDARD METRICS:
8.76 hours downtime per year
Insufficient for CBDC
52.6 minutes downtime per year
Minimum for payment systems
5.26 minutes downtime per year
Target for critical infrastructure
CBDC SHOULD TARGET:
99.99% minimum (four nines)
99.999% aspirational (five nines)
CONTEXT:
Visa: Claims 99.999%+ availability
Major banks: 99.95-99.99%
Stock exchanges: 99.99%+
CBDC must match or exceed
best-in-class financial infrastructure
```
RECOVERY METRICS
RTO - RECOVERY TIME OBJECTIVE:
Maximum acceptable time to restore service
after disruption
RPO - RECOVERY POINT OBJECTIVE:
Maximum acceptable data loss
(time between last backup and failure)
FOR CBDC:
RTO Requirements:
┌────────────────────────────────────────────────┐
│ Minor incident: < 15 minutes │
│ Major incident: < 1 hour │
│ Disaster: < 4 hours │
│ Catastrophic: < 24 hours │
└────────────────────────────────────────────────┘
RPO Requirements:
┌────────────────────────────────────────────────┐
│ Target: Near-zero (seconds) │
│ Maximum: Minutes at most │
│ │
│ Money transactions cannot be "lost" │
│ Even minutes of lost data is serious │
└────────────────────────────────────────────────┘
---
REDUNDANCY LEVELS
N+1 REDUNDANCY:
One extra component beyond minimum needed
If one fails, system continues
Most basic level
N+2 REDUNDANCY:
Two extra components
Can survive two simultaneous failures
More robust
2N REDUNDANCY:
Complete duplicate system
Full capacity backup
Highest resilience
FOR CBDC:
Core systems: 2N (full redundancy)
Distribution: N+1 minimum
User-facing: Geographic distribution
NO SINGLE POINTS OF FAILURE:
Every critical component has backup
Every critical path has alternative
```
GEOGRAPHIC RESILIENCE
ARCHITECTURE:
┌─────────────────────────────────────────────────┐
│ PRIMARY DATA CENTER │
│ (Region A) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Compute │ │ Storage │ │ Network │ │
│ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────┬─────────────────────────┘
│ Real-time
│ Replication
┌───────────────────────┴─────────────────────────┐
│ SECONDARY DATA CENTER │
│ (Region B) │
│ (500+ km apart) │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ └─────────┘ └─────────┘ └─────────┘ │
└───────────────────────┬─────────────────────────┘
│
│
┌───────────────────────┴─────────────────────────┐
│ TERTIARY (Disaster Recovery) │
│ (Region C) │
│ (Different country/continent) │
│ │
│ Asynchronous replication │
│ Warm standby capability │
└─────────────────────────────────────────────────┘
- Different seismic zones
- Different power grids
- Different network paths
- Different political jurisdictions (if appropriate)
DEPLOYMENT MODELS
ACTIVE-PASSIVE:
┌─────────────────────────────────────────────────┐
│ │
│ PRIMARY (Active) │
│ - Handles all traffic │
│ - Live operations │
│ │
│ SECONDARY (Passive) │
│ - Standby mode │
│ - Receives replicated data │
│ - Activated on failover │
│ │
│ Pros: Simpler, clear state │
│ Cons: Failover time, wasted capacity │
└─────────────────────────────────────────────────┘
ACTIVE-ACTIVE:
┌─────────────────────────────────────────────────┐
│ │
│ SITE A (Active) SITE B (Active) │
│ - 50% traffic - 50% traffic │
│ - Full capability - Full capability │
│ - Real-time sync - Real-time sync │
│ │
│ If one fails, other takes 100% │
│ │
│ Pros: No failover delay, uses all resources │
│ Cons: More complex, consistency challenges │
└─────────────────────────────────────────────────┘
FOR CBDC:
Active-active preferred for core systems
Eliminates failover delay
Maximizes resource utilization
But requires careful consistency management
```
INFRASTRUCTURE FAILURE SCENARIOS
SCENARIO 1: DATA CENTER LOSS
┌────────────────────────────────────────────────┐
│ Cause: Fire, flood, power failure │
│ │
│ Impact: Primary site unavailable │
│ │
│ Response: │
│ - Automatic failover to secondary │
│ - DNS redirect │
│ - Traffic routing update │
│ - Continue operations │
│ │
│ RTO: Minutes (with active-active) │
│ RPO: Seconds (with sync replication) │
└────────────────────────────────────────────────┘
SCENARIO 2: NETWORK PARTITION
┌────────────────────────────────────────────────┐
│ Cause: Cable cut, ISP failure, attack │
│ │
│ Impact: Region isolated from system │
│ │
│ Response: │
│ - Route around failure │
│ - Multiple network providers │
│ - Satellite backup (if available) │
│ - Offline capability for users │
│ │
│ Mitigation: Multi-path networking │
└────────────────────────────────────────────────┘
SCENARIO 3: DATABASE CORRUPTION
┌────────────────────────────────────────────────┐
│ Cause: Software bug, hardware failure │
│ │
│ Impact: Data integrity compromised │
│ │
│ Response: │
│ - Detect corruption │
│ - Isolate affected data │
│ - Restore from clean backup │
│ - Replay transactions if needed │
│ │
│ Prevention: Checksums, verification │
└────────────────────────────────────────────────┘
```
CYBER ATTACK RESILIENCE
SCENARIO: DDoS ATTACK
┌────────────────────────────────────────────────┐
│ Cause: Massive traffic flood │
│ │
│ Defense: │
│ - DDoS mitigation services │
│ - Traffic scrubbing │
│ - Capacity headroom │
│ - Geographic distribution │
│ │
│ Goal: Absorb attack, maintain service │
└────────────────────────────────────────────────┘
SCENARIO: RANSOMWARE
┌────────────────────────────────────────────────┐
│ Cause: Encryption of systems │
│ │
│ Defense: │
│ - Air-gapped backups │
│ - Immutable backup storage │
│ - Rapid restore capability │
│ - Segmented systems │
│ │
│ Recovery: Restore from clean backups │
│ Never pay ransom │
└────────────────────────────────────────────────┘
SCENARIO: SYSTEM COMPROMISE
┌────────────────────────────────────────────────┐
│ Cause: Attacker gains access │
│ │
│ Response: │
│ - Detect intrusion │
│ - Isolate compromised systems │
│ - Activate clean standby │
│ - Forensic investigation │
│ │
│ Goal: Contain, recover, learn │
└────────────────────────────────────────────────┘
```
NATURAL DISASTER RESILIENCE
REGIONAL DISASTER:
┌────────────────────────────────────────────────┐
│ Events: Earthquake, hurricane, flood │
│ │
│ Impact: │
│ - Infrastructure damage │
│ - Power outages │
│ - Network disruption │
│ - Staff unavailable │
│ │
│ Response: │
│ - Failover to distant site │
│ - Remote operations │
│ - Extended autonomous operation │
│ - Offline capability for users │
└────────────────────────────────────────────────┘
PANDEMIC / PROLONGED EVENT:
┌────────────────────────────────────────────────┐
│ Impact: │
│ - Staff availability reduced │
│ - Physical access restricted │
│ - Extended remote operations │
│ │
│ Response: │
│ - Full remote operation capability │
│ - Automated systems │
│ - Reduced staffing mode │
│ - Extended autonomous operation │
└────────────────────────────────────────────────┘
KEY PRINCIPLE:
Design for worst plausible scenario
Not just likely scenarios
```
OPERATIONAL VISIBILITY
MONITORING LAYERS:
Server health
Network status
Storage capacity
Power systems
Transaction processing
Response times
Error rates
Queue depths
Transaction volumes
Success rates
User activity
Anomaly detection
ALERTING:
┌────────────────────────────────────────────────┐
│ Severity 1 (Critical): │
│ - System down │
│ - Immediate response required │
│ - 24/7 on-call activation │
│ │
│ Severity 2 (High): │
│ - Degraded performance │
│ - Response within minutes │
│ │
│ Severity 3 (Medium): │
│ - Warning condition │
│ - Response within hours │
│ │
│ Severity 4 (Low): │
│ - Informational │
│ - Scheduled attention │
└────────────────────────────────────────────────┘
```
RESILIENCE TESTING
REGULAR TESTING:
Monthly: Automated failover
Quarterly: Full site failover
Annual: Disaster recovery drill
Weekly: Backup verification
Monthly: Restore test
Annual: Full restore exercise
Controlled failure injection
Test system response
Find weaknesses before attackers do
Walk through scenarios
Test decision-making
Identify gaps
Train staff
Simulate actual disaster
Execute recovery procedures
Time and measure
Learn and improve
CONTROLLED CHANGE
WHY IT MATTERS:
Most outages caused by changes
Controlled change = controlled risk
CHANGE PROCESS:
┌────────────────────────────────────────────────┐
│ 1. PLAN │
│ - Document change │
│ - Risk assessment │
│ - Rollback procedure │
│ │
│ 2. REVIEW │
│ - Peer review │
│ - CAB approval (for significant changes) │
│ │
│ 3. TEST │
│ - Non-production first │
│ - Verify functionality │
│ │
│ 4. IMPLEMENT │
│ - Scheduled window │
│ - Monitored execution │
│ │
│ 5. VERIFY │
│ - Confirm success │
│ - Monitor for issues │
│ │
│ 6. ROLLBACK (if needed) │
│ - Execute rollback plan │
│ - Return to known good state │
└────────────────────────────────────────────────┘
```
GRACEFUL DEGRADATION
FULL SERVICE:
All features available
Normal operations
- Core transactions work
- Some advanced features disabled
- Slightly slower performance
- Users may not notice
- Essential transactions only
- New features disabled
- Noticeable latency
- Communication to users
- Basic payments only
- Strict rate limits
- Queuing for transactions
- Major user communication
- Minimal functionality
- Offline capability activated
- Prepare for extended outage
- Crisis communication
PRINCIPLE:
Better limited service than no service
Degrade gracefully, not catastrophically
```
COMMUNICATION DURING INCIDENTS
- In-app notifications
- Status page
- Social media
- SMS alerts
- Official announcements
COMMUNICATION TIMING:
┌────────────────────────────────────────────────┐
│ T+0: Incident detected │
│ Internal escalation │
│ │
│ T+5 min: Initial assessment │
│ Decide on user communication │
│ │
│ T+15 min: First user notification │
│ (if significant impact) │
│ │
│ T+30 min: Update with expected resolution │
│ │
│ Ongoing: Regular updates │
│ │
│ Resolution: "All clear" notification │
│ Post-incident summary │
└────────────────────────────────────────────────┘
- Transparent about issues
- Clear about impact
- Honest about timeline
- Regular updates
- Don't overpromise
✅ Redundancy and distribution are essential—single points of failure are unacceptable.
✅ Testing is critical—untested recovery plans fail when needed.
✅ Graceful degradation is possible—systems can provide limited service during problems.
⚠️ Offline capability at scale—limited real-world testing.
⚠️ Recovery from novel attacks—unknown scenarios can't be fully planned.
⚠️ Multi-site coordination complexity—active-active is hard.
📌 Assuming backups work—untested backups often fail.
📌 Underestimating disaster scenarios—worst cases do happen.
📌 Neglecting change management—changes cause outages.
CBDC resilience requires the same rigor as other critical infrastructure—power grids, telecommunications. This means redundancy, geographic distribution, regular testing, and graceful degradation. The cash replacement claim demands cash-equivalent availability, which is a high bar requiring significant investment.
Assignment: Design a resilience architecture for a hypothetical CBDC, including redundancy, geographic distribution, and recovery procedures.
Time Investment: 3-4 hours
End of Lesson 17
Course 58: CBDC Architecture & Design
Lesson 17 of 20
Key Takeaways
CBDC is critical infrastructure
: Failure affects citizens, businesses, and the economy—availability requirements are extreme.
No single points of failure
: Every critical component needs redundancy; every critical path needs alternatives.
Geographic distribution is essential
: Sites in different regions protect against regional disasters.
Testing validates resilience
: Untested disaster recovery plans fail when needed.
Graceful degradation maintains service
: Limited service during problems is better than complete failure. ---