Disaster Recovery: Planning for When Everything Goes Wrong
It is 3am. Your primary database is corrupted. Your last backup is from 6 hours ago. You have no documented recovery procedure. Your on-call engineer is trying to remember the restore command from memory. Six hours of customer data is gone. Your CEO is calling.
Disaster recovery is not about preventing disasters. It is about knowing exactly what you will do when they happen, having practiced it, and being able to execute it under pressure.
Key concepts
RTO (Recovery Time Objective): The maximum acceptable time to restore service after a failure. “We must be back online within 4 hours.”
RPO (Recovery Point Objective): The maximum acceptable data loss, measured in time. “We can lose at most 1 hour of data.”
MTTR (Mean Time To Recovery): The average time to recover from a failure. Lower is better.
RTO and RPO drive your architecture. Lower RTO requires more automation and redundancy. Lower RPO requires more frequent backups or synchronous replication. Both cost money.
DR strategies
Backup and restore
The simplest strategy. Take regular backups. When disaster strikes, restore from backup.
RTO: Hours to days (depends on backup size and restore speed). RPO: Time since last backup (hours if daily backups, minutes if continuous). Cost: Low (just storage for backups).
Use for: Non-critical systems, development environments, systems where hours of downtime are acceptable.
Pilot light
A minimal version of the system runs in the DR region. The database is replicated. Application servers are stopped (or running at minimal capacity). When disaster strikes, scale up the application servers and switch traffic.
RTO: 30 minutes to 2 hours (time to scale up application servers). RPO: Minutes (near-real-time database replication). Cost: Low (minimal resources in DR region).
Use for: Systems where 1-2 hours of downtime is acceptable.
Warm standby
A scaled-down but fully functional version of the system runs in the DR region. It can handle a fraction of production traffic. When disaster strikes, scale up to full capacity and switch traffic.
RTO: Minutes (scale up is faster than starting from scratch). RPO: Seconds to minutes (near-real-time replication). Cost: Medium (running a scaled-down environment).
Use for: Systems where minutes of downtime is acceptable.
Active-active (multi-region)
Multiple regions are active simultaneously. Traffic is distributed across regions. If one region fails, the other regions absorb the traffic.
RTO: Seconds (automatic failover). RPO: Near-zero (synchronous replication) or seconds (asynchronous). Cost: High (full capacity in multiple regions).
Use for: Mission-critical systems where any downtime is unacceptable.
graph LR subgraph strategies["DR Strategy Comparison"] BR["Backup and Restore RTO: hours RPO: hours Cost: low"] PL["Pilot Light RTO: 1-2 hours RPO: minutes Cost: low-medium"] WS["Warm Standby RTO: minutes RPO: seconds Cost: medium"] AA["Active-Active RTO: seconds RPO: near-zero Cost: high"] end BR -->|"more expensive better RTO/RPO"| PL PL --> WS WS --> AA style BR fill:#F1EFE8,stroke:#888780,color:#444441 style PL fill:#FAEEDA,stroke:#854F0B,color:#633806 style WS fill:#E1F5EE,stroke:#0F6E56,color:#085041 style AA fill:#EEEDFE,stroke:#534AB7,color:#3C3489
Backup strategy
The 3-2-1 rule
- 3 copies of data
- 2 different storage media
- 1 copy offsite
Example: primary database (copy 1), daily backup to S3 in the same region (copy 2, different media), weekly backup to S3 in a different region (copy 3, offsite).
Backup types
Full backup: Complete copy of all data. Slow to create, fast to restore.
Incremental backup: Only changes since the last backup. Fast to create, slow to restore (must apply all increments).
Differential backup: Changes since the last full backup. Faster to restore than incremental (only one differential to apply).
Continuous backup (WAL archiving): Stream database write-ahead log to backup storage. RPO of seconds. Used by PostgreSQL with WAL-E/WAL-G, AWS RDS automated backups.
Testing backups
A backup you have never tested is not a backup. Regularly restore from backup to verify:
- The backup is complete and uncorrupted
- The restore procedure works
- The restore time meets your RTO
Test restores monthly. Automate the test: restore to a test environment, run smoke tests, verify data integrity.
Where it breaks or gets interesting
The ransomware problem
Ransomware encrypts your data and your backups. If your backups are mounted and accessible, ransomware can encrypt them too.
Mitigations:
- Immutable backups: S3 Object Lock, Azure Immutable Blob Storage. Backups cannot be modified or deleted for a specified period.
- Air-gapped backups: Backups stored on media not connected to the network.
- Offsite backups: Backups in a separate cloud account with different credentials.
The “backup works, restore fails” problem
Backups are taken successfully. But the restore procedure has never been tested. When disaster strikes, the restore fails (wrong version of the restore tool, missing dependencies, corrupted backup file).
Test restores regularly. Automate the test. Include restore testing in your DR runbook.
Human error
Most disasters are caused by human error, not hardware failures. A developer runs DROP TABLE on production. A misconfigured deployment deletes data. A script runs in the wrong environment.
Mitigations:
- Require confirmation for destructive operations
- Use separate credentials for production (no accidental production access)
- Enable soft deletes (mark as deleted, do not actually delete)
- Use point-in-time recovery (restore to any point in the last N days)
The DR runbook
A DR runbook is a documented, step-by-step procedure for recovering from specific failure scenarios. It should be:
- Written before the disaster (not during)
- Tested regularly (not just read)
- Accessible when the primary systems are down (not stored only in the system that failed)
- Specific enough to execute under pressure (not vague)
Store runbooks in a separate system (Confluence, Google Docs, printed copies) that is accessible even when your primary systems are down.
Real-world systems
AWS - Provides multiple DR tools: RDS automated backups (point-in-time recovery), S3 Cross-Region Replication, AWS Backup (centralized backup management), AWS Elastic Disaster Recovery (fast failover for on-premises and cloud workloads).
Google Cloud - Cloud SQL automated backups, Cloud Storage multi-region buckets, Persistent Disk snapshots.
Netflix - Chaos Kong regularly takes down entire AWS regions to test DR. Their DR is active-active, so regional failures are handled automatically.
GitLab - Publishes their DR runbooks publicly. Detailed procedures for database recovery, region failover, and data restoration.
How to apply it in practice
Define your RTO and RPO first
Before choosing a DR strategy, define your requirements:
- What is the maximum acceptable downtime? (RTO)
- What is the maximum acceptable data loss? (RPO)
- What is the cost of downtime per hour?
- What is the cost of data loss?
These answers determine which DR strategy is appropriate and how much to invest.
DR testing schedule
- Monthly: Restore a database backup to a test environment. Verify data integrity.
- Quarterly: Full DR drill. Simulate a regional failure. Execute the DR runbook. Measure actual RTO and RPO.
- Annually: Full DR exercise with all stakeholders. Test communication procedures, escalation paths, and decision-making.
The DR checklist
For each critical system:
- What are the RTO and RPO requirements?
- What backups exist? How often? Where are they stored?
- What is the restore procedure? Is it documented?
- When was the last restore test?
- What is the failover procedure?
- Who is responsible for executing DR?
- How do you communicate with users during an outage?
FAQ
Q: What is the difference between DR and high availability?
High availability (HA) prevents downtime through redundancy and automatic failover. It handles component failures (a server crashes, a database fails over). DR handles catastrophic failures (an entire region goes down, data is corrupted). HA is about uptime. DR is about recovery. You need both: HA for normal operations, DR for disasters.
Q: How often should you back up your database?
It depends on your RPO. If you can lose 1 hour of data, take hourly backups. If you can lose 5 minutes, use continuous WAL archiving. For most production databases, continuous WAL archiving (RPO of seconds) combined with daily full backups is the standard. The daily full backup provides a clean restore point. WAL archiving allows point-in-time recovery to any moment.
Q: What is point-in-time recovery (PITR)?
PITR allows you to restore a database to any point in time within a retention window. It works by combining a full backup with a continuous stream of transaction logs (WAL in PostgreSQL, binary log in MySQL). To restore to 3pm yesterday: restore the last full backup before 3pm, then replay the transaction logs up to 3pm. AWS RDS, Google Cloud SQL, and Azure SQL Database all support PITR. It is the most powerful recovery option for databases.
Interview questions
Q1: Your production PostgreSQL database is corrupted. Your last full backup is from 6 hours ago. You have WAL archiving enabled. Walk through the recovery.
Strong answer: With WAL archiving, you can recover to any point in time, not just the last backup. Steps: 1) Stop the corrupted database. 2) Restore the last full backup to a new instance. 3) Configure recovery.conf (or postgresql.conf in newer versions) to replay WAL from the archive up to the point just before the corruption. 4) Start the database in recovery mode. It replays the WAL and stops at the specified point. 5) Verify data integrity. 6) Switch application traffic to the recovered database. The RPO is the time between the last WAL archive and the corruption - typically seconds to minutes. The RTO depends on the size of the database and the amount of WAL to replay - typically 30 minutes to 2 hours for a large database.
Q2: Design a DR strategy for a SaaS application with an RTO of 1 hour and RPO of 15 minutes.
Strong answer: Use a warm standby strategy. Primary region: full production environment. DR region: scaled-down environment (25% capacity) with near-real-time database replication. Database replication: asynchronous with 15-minute maximum lag (monitor lag and alert if it exceeds 10 minutes). Application servers in DR region: running but at minimal capacity. When disaster strikes: scale up DR application servers (10-15 minutes), switch DNS to DR region (5 minutes, with low TTL), verify functionality (10 minutes). Total RTO: 30 minutes (well within the 1-hour requirement). RPO: 15 minutes (matches the replication lag). Cost: 25-30% of production cost for the DR environment. Test quarterly: simulate a regional failure, execute the runbook, measure actual RTO and RPO.
Q3: How do you protect against ransomware encrypting your backups?
Strong answer: Multiple layers of protection. First, immutable backups: use S3 Object Lock in compliance mode. Backups cannot be modified or deleted for a specified retention period (e.g., 30 days), even by the account owner. Ransomware cannot encrypt or delete them. Second, separate backup account: store backups in a separate AWS account with different credentials. Even if the production account is compromised, the backup account is not. Third, air-gapped backups: periodically copy backups to offline media (tape, external drives) stored offsite. Fourth, test restores: regularly restore from backup to verify they are not corrupted. Fifth, monitoring: alert on unexpected backup deletions or modifications. The combination of immutable storage and a separate account provides strong protection against ransomware.