Kubernetes Disaster Recovery: Are You Really Prepared?

Most organizations have a disaster recovery plan. Few have actually tested it. Even fewer understand what happens when it's invoked at 2 AM on a Sunday during a real crisis.

For companies running Kubernetes in Switzerland or Europe, DR isn't theoretical. Swiss data protection law and EU GDPR create regulatory pressure to have reliable recovery procedures. But more importantly, downtime has real business cost.

Here's what you need to know to build a DR plan that actually works.

Defining "Disaster"

Before building recovery procedures, define what you're recovering from:

Tier 1: Recoverable (Minutes)

Single pod crashes
Single node failure
Temporary network partition
Solution: Kubernetes self-healing handles this automatically

Tier 2: Serious (Hours)

Multi-node failure (but cluster remains operational)
Persistent volume failure
Database connection issue
RTO: 2-4 hours | RPO: < 1 hour

Tier 3: Severe (Days)

Entire Kubernetes cluster destroyed
Entire data center down
Widespread data corruption
RTO: 24 hours | RPO: 24 hours

Tier 4: Catastrophic (Days)

All backups corrupted
Multi-region failure
RTO: 72+ hours | RPO: 72+ hours

This hierarchy informs your recovery strategy. You don't need to recover from Tier 4 in 30 minutes; but you should recover from Tier 2 in under 4 hours.

The Three Elements of DR

1. Backup Strategy (RPO - Recovery Point Objective)

RPO is the maximum acceptable data loss (measured in time). If you can afford to lose 1 hour of data, your RPO is 1 hour.

What to back up in Kubernetes:

Application code (not needed; rebuild from Git)
Configuration (YAML, Helm charts, Kustomize manifests)
Persistent data (databases, file systems, caches)
Secrets (API keys, certificates, credentials)
Custom CRDs (CustomResourceDefinitions unique to your cluster)

Three-tier backup approach:

Tier 1: Configuration backup (hourly)

# Backup all Kubernetes manifests to Git
kubectl get all -A -o yaml | git commit -am "Cluster snapshot"

Tool: Velero (formerly Heptio)

Tier 2: Database backups (every 4 hours)

# For stateful workloads (PostgreSQL, MySQL)
pg_dump mydb | gzip > backup_$(date +%s).sql.gz
# Upload to S3 with encryption

Tools: Automated database backup solutions (AWS RDS, GCP CloudSQL) or managed services

Tier 3: Persistent volume snapshots (hourly)

# Cloud provider snapshots
gcloud compute disks snapshot [disk-name]

Tool: Cloud provider native snapshots

Data location reality:

Application configs → Git (zero risk of loss)
Database data → Managed backup service (near-zero risk)
Secrets → External vault with replication
Persistent volumes → Cloud provider snapshots + off-site copies

2. Recovery Procedure (RTO - Recovery Time Objective)

RTO is the maximum acceptable downtime. If you can afford 4 hours of downtime, your RTO is 4 hours.

Recovery scenarios:

Scenario 1: Single Node Failure (RTO: < 5 minutes)

Kubernetes handles automatically. Pods reschedule to healthy nodes. No manual intervention needed.

What you don't need to do: Nothing. It's automatic.

Scenario 2: Cluster Failure (RTO: 4 hours)

The entire Kubernetes cluster is lost. You need to rebuild from backups.

Recovery procedure:

Step 1: Spin up new Kubernetes cluster (10-30 min)
        - Terraform/Helm chart you've pre-prepared
        - Same size as production

Step 2: Restore configuration (10-20 min)
        - Clone Git repo with manifests
        - Apply all YAML: kubectl apply -f .
        - This recreates all deployments, services, configmaps

Step 3: Restore persistent data (30 min - 2 hours)
        - Database: Restore from backup
        - Persistent volumes: Attach snapshots
        - Secrets: Restore from vault

Step 4: Verification (30 min)
        - Run smoke tests
        - Verify application health
        - Check data integrity

Total: 1-4 hours

To hit 4-hour RTO, you need:

Pre-templated Kubernetes cluster (Terraform, not manual)
Backups < 1 hour old
Tested recovery procedure
Team trained on manual process

Scenario 3: Data Corruption (RTO: 4-12 hours)

You discover that data in production is corrupted. You need to restore from a clean backup.

Recovery procedure:

Step 1: Identify corruption scope and time
        - When did corruption start?
        - Which data is affected?

Step 2: Find clean backup point
        - Restore to point before corruption
        - Verify data is uncorrupted

Step 3: Restore database and volumes
        - Databases: Point-in-time recovery
        - Persistent volumes: Restore from snapshot

Step 4: Validate recovery
        - Run integrity checks
        - Verify application functionality

Total: 4-12 hours (depends on backup frequency and corruption extent)

This requires:

Immutable backups (can't be modified or deleted)
Multiple backup retention points (daily snapshots for 30 days)
Automated integrity checks on backups

3. Testing and Validation

A disaster recovery plan you haven't tested is fiction, not plan.

Testing schedule:

Monthly: Quick test

Restore a database backup to staging
Run smoke tests
Verify data integrity
Takes 2-3 hours

Quarterly: Cluster rebuild test

Destroy staging cluster
Rebuild from backup
Test all workloads
Takes 6-8 hours

Annually: Full production DR exercise

Actually failover to backup region/cluster (or simulate it)
Run full workload validation
Measure actual RTO
Takes 1-2 business days

Game day approach:

Morning:
  - Declare "Code Red: Disaster Recovery Exercise"
  - Notify all teams
  - Freeze all other work

Afternoon:
  - Team attempts full recovery
  - Document everything that goes wrong
  - No scripting allowed (first time doing it)

Evening:
  - Review results
  - Update procedures
  - Schedule follow-up

This is uncomfortable. It should be. Better to find problems during controlled exercises than during real outages.

Specific Kubernetes DR Patterns

Multi-Region Disaster Recovery

If entire regions fail, you need resources in multiple regions.

Active-passive setup:

Region 1 (Primary):
  - Full Kubernetes cluster
  - All traffic routed here
  - Database writes happen here

Region 2 (Standby):
  - Identical cluster running low-traffic (1-2 replicas per service)
  - Database read replicas only
  - Minimal cost

On failure:
  - DNS redirects traffic to Region 2
  - Database failover promoted read replica to primary
  - Existing cluster in Region 2 handles traffic

Cost: ~30-40% additional (one full standby region is expensive; this is a middle ground).

RTO: 5-10 minutes (DNS propagation + database failover)

Backup and Restore via Velero

Velero is the standard tool for Kubernetes-native backups.

Installation:

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-releases
helm install velero vmware-tanzu/velero \
  --namespace velero \
  --create-namespace \
  --set configuration.backupStorageLocation.bucket=my-backup-bucket \
  --set configuration.backupStorageLocation.provider=aws

Creating a backup:

# Backup entire cluster
velero backup create full-backup

# Or backup specific namespace
velero backup create app-backup --include-namespaces payment-system

# List backups
velero backup get

# Restore from backup
velero restore create --from-backup full-backup

Restore to different cluster:

# On new cluster:
velero restore create --from-backup full-backup
# Entire cluster state is restored

Your DR Checklist

Before You Go to Production

RPO and RTO defined and communicated to stakeholders
Backup strategy documented (what, how often, where stored)
Secrets backup isolated from application backups
Immutable backup storage configured
Tested recovery procedure documented
Team trained on manual recovery steps
Terraform/IaC templates for cluster recreation ready

Ongoing (Monthly)

Incremental backup test (restore non-prod, verify data)
Backup integrity validation
Check backup logs for errors

Quarterly

Full cluster rebuild test
Time each step
Update documentation if procedures changed

Annually

Full production disaster recovery exercise
Document actual RTO achieved
Identify process improvements

Real Numbers: Cost of Disaster

Scenarios:

Scenario A: No DR plan (disaster strikes)

Downtime: 24 hours (recovery is chaotic, ad-hoc)
Data loss: 8 hours of transactions
Cost to business: $1M (varies by industry)
Regulatory fines: Unknown (GDPR, nLPD violations)

Scenario B: Good DR plan (disaster strikes)

Downtime: 2 hours (prepared cluster, tested procedures)
Data loss: < 30 minutes (hourly backups)
Cost to business: $200k
Regulatory posture: Defensible (tested procedures documented)

Cost of the plan:

Initial setup: 2-3 months engineering effort
Ongoing: 2-4 hours/month for testing
Infrastructure: ~$500/month (backup storage, standby resources)

ROI: If one major incident costs $1M and you have a 1-in-5 chance per year, your plan pays for itself immediately.

Common Mistakes

Mistake 1: Backing up to the same data center Fix: Backups must be geographically distant.

Mistake 2: Testing only the happy path Fix: Test failure scenarios. What if the restore command fails halfway?

Mistake 3: Assuming backups work without verification Fix: Validate restore every month.

Mistake 4: Disaster recovery without backup verification Fix: A backup you haven't restored is not a backup.

Mistake 5: Not documenting runbooks Fix: At 3 AM, people don't improvise well. Document step-by-step procedures.

The Reality

Disaster recovery isn't fun. It's insurance. You hope you never need it. But when you do, a well-tested plan means the difference between a minor incident and a catastrophic one.

The good news: For Kubernetes in Switzerland/Europe, the infrastructure to do this well is mature and relatively straightforward. The effort is real, but manageable.

Start today. Write down your RPO and RTO. Design your backup strategy. Test it once. Then test it again.

Related reading:

Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert