Kubernetes Disaster Recovery: Are You Really Prepared?
Most organizations have a disaster recovery plan. Few have actually tested it. Even fewer understand what happens when it's invoked at 2 AM on a Sunday during a real crisis.
For companies running Kubernetes in Switzerland or Europe, DR isn't theoretical. Swiss data protection law and EU GDPR create regulatory pressure to have reliable recovery procedures. But more importantly, downtime has real business cost.
Here's what you need to know to build a DR plan that actually works.
Defining "Disaster"
Before building recovery procedures, define what you're recovering from:
Tier 1: Recoverable (Minutes)
- Single pod crashes
- Single node failure
- Temporary network partition
- Solution: Kubernetes self-healing handles this automatically
Tier 2: Serious (Hours)
- Multi-node failure (but cluster remains operational)
- Persistent volume failure
- Database connection issue
- RTO: 2-4 hours | RPO: < 1 hour
Tier 3: Severe (Days)
- Entire Kubernetes cluster destroyed
- Entire data center down
- Widespread data corruption
- RTO: 24 hours | RPO: 24 hours
Tier 4: Catastrophic (Days)
- All backups corrupted
- Multi-region failure
- RTO: 72+ hours | RPO: 72+ hours
This hierarchy informs your recovery strategy. You don't need to recover from Tier 4 in 30 minutes; but you should recover from Tier 2 in under 4 hours.
The Three Elements of DR
1. Backup Strategy (RPO - Recovery Point Objective)
RPO is the maximum acceptable data loss (measured in time). If you can afford to lose 1 hour of data, your RPO is 1 hour.
What to back up in Kubernetes:
- Application code (not needed; rebuild from Git)
- Configuration (YAML, Helm charts, Kustomize manifests)
- Persistent data (databases, file systems, caches)
- Secrets (API keys, certificates, credentials)
- Custom CRDs (CustomResourceDefinitions unique to your cluster)
Three-tier backup approach:
Tier 1: Configuration backup (hourly)
# Backup all Kubernetes manifests to Git
kubectl get all -A -o yaml | git commit -am "Cluster snapshot"
Tool: Velero (formerly Heptio)
Tier 2: Database backups (every 4 hours)
# For stateful workloads (PostgreSQL, MySQL)
pg_dump mydb | gzip > backup_$(date +%s).sql.gz
# Upload to S3 with encryption
Tools: Automated database backup solutions (AWS RDS, GCP CloudSQL) or managed services
Tier 3: Persistent volume snapshots (hourly)
# Cloud provider snapshots
gcloud compute disks snapshot [disk-name]
Tool: Cloud provider native snapshots
Data location reality:
- Application configs → Git (zero risk of loss)
- Database data → Managed backup service (near-zero risk)
- Secrets → External vault with replication
- Persistent volumes → Cloud provider snapshots + off-site copies
2. Recovery Procedure (RTO - Recovery Time Objective)
RTO is the maximum acceptable downtime. If you can afford 4 hours of downtime, your RTO is 4 hours.
Recovery scenarios:
Scenario 1: Single Node Failure (RTO: < 5 minutes)
Kubernetes handles automatically. Pods reschedule to healthy nodes. No manual intervention needed.
What you don't need to do: Nothing. It's automatic.
Scenario 2: Cluster Failure (RTO: 4 hours)
The entire Kubernetes cluster is lost. You need to rebuild from backups.
Recovery procedure:
Step 1: Spin up new Kubernetes cluster (10-30 min)
- Terraform/Helm chart you've pre-prepared
- Same size as production
Step 2: Restore configuration (10-20 min)
- Clone Git repo with manifests
- Apply all YAML: kubectl apply -f .
- This recreates all deployments, services, configmaps
Step 3: Restore persistent data (30 min - 2 hours)
- Database: Restore from backup
- Persistent volumes: Attach snapshots
- Secrets: Restore from vault
Step 4: Verification (30 min)
- Run smoke tests
- Verify application health
- Check data integrity
Total: 1-4 hours
To hit 4-hour RTO, you need:
- Pre-templated Kubernetes cluster (Terraform, not manual)
- Backups < 1 hour old
- Tested recovery procedure
- Team trained on manual process
Scenario 3: Data Corruption (RTO: 4-12 hours)
You discover that data in production is corrupted. You need to restore from a clean backup.
Recovery procedure:
Step 1: Identify corruption scope and time
- When did corruption start?
- Which data is affected?
Step 2: Find clean backup point
- Restore to point before corruption
- Verify data is uncorrupted
Step 3: Restore database and volumes
- Databases: Point-in-time recovery
- Persistent volumes: Restore from snapshot
Step 4: Validate recovery
- Run integrity checks
- Verify application functionality
Total: 4-12 hours (depends on backup frequency and corruption extent)
This requires:
- Immutable backups (can't be modified or deleted)
- Multiple backup retention points (daily snapshots for 30 days)
- Automated integrity checks on backups
3. Testing and Validation
A disaster recovery plan you haven't tested is fiction, not plan.
Testing schedule:
Monthly: Quick test
- Restore a database backup to staging
- Run smoke tests
- Verify data integrity
- Takes 2-3 hours
Quarterly: Cluster rebuild test
- Destroy staging cluster
- Rebuild from backup
- Test all workloads
- Takes 6-8 hours
Annually: Full production DR exercise
- Actually failover to backup region/cluster (or simulate it)
- Run full workload validation
- Measure actual RTO
- Takes 1-2 business days
Game day approach:
Morning:
- Declare "Code Red: Disaster Recovery Exercise"
- Notify all teams
- Freeze all other work
Afternoon:
- Team attempts full recovery
- Document everything that goes wrong
- No scripting allowed (first time doing it)
Evening:
- Review results
- Update procedures
- Schedule follow-up
This is uncomfortable. It should be. Better to find problems during controlled exercises than during real outages.
Specific Kubernetes DR Patterns
Multi-Region Disaster Recovery
If entire regions fail, you need resources in multiple regions.
Active-passive setup:
Region 1 (Primary):
- Full Kubernetes cluster
- All traffic routed here
- Database writes happen here
Region 2 (Standby):
- Identical cluster running low-traffic (1-2 replicas per service)
- Database read replicas only
- Minimal cost
On failure:
- DNS redirects traffic to Region 2
- Database failover promoted read replica to primary
- Existing cluster in Region 2 handles traffic
Cost: ~30-40% additional (one full standby region is expensive; this is a middle ground).
RTO: 5-10 minutes (DNS propagation + database failover)
Backup and Restore via Velero
Velero is the standard tool for Kubernetes-native backups.
Installation:
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-releases
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--set configuration.backupStorageLocation.bucket=my-backup-bucket \
--set configuration.backupStorageLocation.provider=aws
Creating a backup:
# Backup entire cluster
velero backup create full-backup
# Or backup specific namespace
velero backup create app-backup --include-namespaces payment-system
# List backups
velero backup get
# Restore from backup
velero restore create --from-backup full-backup
Restore to different cluster:
# On new cluster:
velero restore create --from-backup full-backup
# Entire cluster state is restored
Your DR Checklist
Before You Go to Production
- RPO and RTO defined and communicated to stakeholders
- Backup strategy documented (what, how often, where stored)
- Secrets backup isolated from application backups
- Immutable backup storage configured
- Tested recovery procedure documented
- Team trained on manual recovery steps
- Terraform/IaC templates for cluster recreation ready
Ongoing (Monthly)
- Incremental backup test (restore non-prod, verify data)
- Backup integrity validation
- Check backup logs for errors
Quarterly
- Full cluster rebuild test
- Time each step
- Update documentation if procedures changed
Annually
- Full production disaster recovery exercise
- Document actual RTO achieved
- Identify process improvements
Real Numbers: Cost of Disaster
Scenarios:
Scenario A: No DR plan (disaster strikes)
- Downtime: 24 hours (recovery is chaotic, ad-hoc)
- Data loss: 8 hours of transactions
- Cost to business: $1M (varies by industry)
- Regulatory fines: Unknown (GDPR, nLPD violations)
Scenario B: Good DR plan (disaster strikes)
- Downtime: 2 hours (prepared cluster, tested procedures)
- Data loss: < 30 minutes (hourly backups)
- Cost to business: $200k
- Regulatory posture: Defensible (tested procedures documented)
Cost of the plan:
- Initial setup: 2-3 months engineering effort
- Ongoing: 2-4 hours/month for testing
- Infrastructure: ~$500/month (backup storage, standby resources)
ROI: If one major incident costs $1M and you have a 1-in-5 chance per year, your plan pays for itself immediately.
Common Mistakes
Mistake 1: Backing up to the same data center Fix: Backups must be geographically distant.
Mistake 2: Testing only the happy path Fix: Test failure scenarios. What if the restore command fails halfway?
Mistake 3: Assuming backups work without verification Fix: Validate restore every month.
Mistake 4: Disaster recovery without backup verification Fix: A backup you haven't restored is not a backup.
Mistake 5: Not documenting runbooks Fix: At 3 AM, people don't improvise well. Document step-by-step procedures.
The Reality
Disaster recovery isn't fun. It's insurance. You hope you never need it. But when you do, a well-tested plan means the difference between a minor incident and a catastrophic one.
The good news: For Kubernetes in Switzerland/Europe, the infrastructure to do this well is mature and relatively straightforward. The effort is real, but manageable.
Start today. Write down your RPO and RTO. Design your backup strategy. Test it once. Then test it again.
Related reading:
Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert



