Zero Downtime: The CTO's Guide to Sleeping Well
CTOs often face this tension: our business needs continuous innovation (multiple deployments daily), but zero downtime during deployments feels like an impossible luxury. Either we deploy frequently and risk incidents, or we deploy carefully and move slowly.
The good news: zero-downtime deployments are achievable for most workloads. It requires planning, but the architecture patterns are well-established.
Here's how to implement them.
What Zero Downtime Actually Means
Zero downtime != zero risk: You can deploy without user-facing downtime while still carrying deployment risk. Rollback procedures, monitoring, and incident response remain critical.
Zero downtime means:
- Users don't experience service interruption during deployment
- Database schemas evolve without blocking queries
- Infrastructure changes don't drop connections
- Rollback doesn't require downtime
This is achievable for:
- Stateless services (APIs, web applications)
- Stateful services with proper design (databases, caches)
- Background jobs and batch systems
This is hard for:
- Systems with global state
- Distributed consensus systems (etcd, Kafka during rebalancing)
- Legacy monoliths tightly coupled to infrastructure
The Three Layers of Zero-Downtime Deployment
Layer 1: Application Level
Your application must handle being replaced mid-request.
Pattern: Graceful Shutdown
// Your application should handle SIGTERM properly
func main() {
server := &http.Server{}
go func() {
sigterm := make(chan os.Signal, 1)
signal.Notify(sigterm, syscall.SIGTERM)
<-sigterm
// Give in-flight requests 30 seconds to complete
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
server.Shutdown(ctx)
}()
server.ListenAndServe()
}
When Kubernetes terminates a pod:
- Send SIGTERM to application
- Application stops accepting new requests
- In-flight requests complete (within timeout)
- Application exits
- Kubernetes routes traffic elsewhere
No dropped requests.
Pattern: Health Checks
// Readiness probe: Can this instance handle traffic?
func readinessHandler(w http.ResponseWriter, r *http.Request) {
if isShuttingDown {
w.WriteHeader(http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
}
// Liveness probe: Is the application still alive?
func livenessHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
}
Kubernetes checks readiness before routing traffic. When shutdown begins, readiness fails, traffic routes away, then the pod terminates.
Layer 2: Infrastructure Level (Kubernetes)
Your deployment strategy must handle instance replacement gracefully.
Pattern: Rolling Update
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-service
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # Never drop below desired replicas
selector:
matchLabels:
app: api-service
template:
metadata:
labels:
app: api-service
spec:
terminationGracePeriodSeconds: 30 # Wait 30s for graceful shutdown
containers:
- name: api
image: api:v2.0
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # Extra time for load balancer to notice
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
How it works:
- New pod (v2.0) spins up alongside old pod (v1.0)
- New pod reports ready
- Load balancer starts routing traffic to new pod
- Traffic gradually shifts from old to new
- Old pod receives SIGTERM
- Old pod gracefully shuts down
- Repeat until all pods updated
No downtime because there's always capacity to handle traffic.
Avoiding connection drains:
# Pod Disruption Budget prevents too many simultaneous terminations
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2 # Always keep 2 pods running
selector:
matchLabels:
app: api-service
This prevents accidental service degradation during maintenance.
Layer 3: Database Level
This is the tricky one. Schema changes can cause downtime if not planned.
The challenge:
Deployment v1.0: Code expects table 'users' with columns id, name
Deployment v2.0: Code expects columns id, name, email
If you remove 'email' column before v2.0 is fully deployed:
- Old instances still running try to access email column
- Result: Errors and partial downtime
Solution: Expand-Contract Pattern
Phase 1: Add column (backward compatible)
ALTER TABLE users ADD COLUMN email VARCHAR(255);
-- Old code: continues ignoring email
-- New code: can read/write email
Phase 2: Deploy new code
- Deploy v2.0 (which uses email column)
- Both v1.0 and v2.0 are running
- Both work fine (v1.0 ignores email, v2.0 uses it)
Phase 3: Remove column (after v1.0 fully sunset)
ALTER TABLE users DROP COLUMN old_field;
Timeline:
- Day 1: Add column
- Day 2: Deploy new code
- Day 7: Remove old column
This is boring, not dramatic, and completely safe.
Another pattern: Feature flags
if (featureFlags.isEnabled("use_new_field")) {
// Use new logic
} else {
// Use old logic
}
Deploy with new field disabled. Enable flag. If problems, disable flag without redeployment. This gives you rollback without code redeployment.
Practical Zero-Downtime Deployment Workflow
Standard Deployment (no database changes)
1. Developer merges code to main
↓
2. CI/CD builds and tests image
↓
3. Image pushed to registry
↓
4. ArgoCD detects Git change
↓
5. Rolling update begins
↓
6. kubectl rolls out new version
↓
7. No downtime (all steps respect graceful shutdown)
Monitoring during deployment:
# Watch the rollout
kubectl rollout status deployment/api-service
# If something goes wrong
kubectl rollout undo deployment/api-service
Rollback is instant because you're just routing traffic back to previous pod replicas.
Complex Deployment (database schema changes)
Day 1:
1. Run migration: ALTER TABLE users ADD COLUMN email
2. Deploy code v1.0 (doesn't use email yet)
3. Verify: Both code and database working
Day 2:
1. Deploy code v2.0 (uses email column)
2. Rolling update begins
3. Both v1.0 and v2.0 handle traffic
4. Complete
Day 7:
1. Run cleanup: ALTER TABLE users DROP COLUMN old_field
2. Verify database state
Deployment with Traffic Shifting (Canary)
For higher risk deployments, test with traffic first:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-service
service:
port: 8080
analysis:
interval: 1m
threshold: 5 # Success threshold in percentage
maxWeight: 50 # Route maximum 50% traffic
stepWeight: 10 # Increase by 10% every interval
metrics:
- name: request-success-rate
thresholdRange:
min: 99
- name: request-duration
thresholdRange:
max: 500
What happens:
- Deploy v2.0 (receives 0% traffic initially)
- Canary sends 10% traffic to v2.0, 90% to v1.0
- Monitor metrics for 1 minute
- If success rate > 99%, increase to 20%
- Continue until v2.0 handles 100% or metrics fail
- If metrics fail, automatically roll back
Result: If v2.0 has a bug, only 10% of users see it. Immediate rollback.
Monitoring and Rollback Procedures
Zero-downtime deployment is only safe if you can see problems and rollback quickly.
Essential metrics during deployment:
- Request latency (p50, p95, p99)
- Error rate (4xx, 5xx)
- Throughput (requests per second)
- Application-specific metrics (transaction success, cart abandonment)
Automated rollback triggers:
If error_rate > 1% for 2 minutes:
- Rollback deployment
- Alert on-call engineer
- Trigger incident investigation
Manual rollback:
# If you need to rollback immediately
kubectl rollout undo deployment/api-service
# Verify
kubectl get pods
kubectl logs deployment/api-service
Rollback is instant because old pod versions are still running on the cluster.
The 90-Day Implementation Plan
Month 1: Foundation
- Implement graceful shutdown in applications
- Configure readiness/liveness probes
- Test rolling updates in staging
- Document deployment procedures
Month 2: Database Evolution
- Audit existing schemas
- Implement expand-contract pattern
- Test schema migrations with zero downtime
- Document database change procedures
Month 3: Advanced Patterns
- Implement canary deployments for high-risk services
- Set up automated rollback based on metrics
- Document incident response procedures
- Run deployment game days
Common Gotchas
Gotcha 1: Long-running requests If your application handles long-lived connections (WebSockets, Server-Sent Events), graceful shutdown must wait for them.
Solution:
terminationGracePeriodSeconds: 300 # 5 minutes for long connections
Gotcha 2: Persistent connections not respecting graceful shutdown Some clients (databases, caches) cache connections and don't immediately reconnect.
Solution: Implement connection pooling with automatic reconnection. Most libraries handle this by default.
Gotcha 3: Deployment takes forever If you have 100 replicas, rolling update with maxUnavailable: 0 takes time.
Solution:
maxSurge: 25% # Allow 25 extra pods temporarily
maxUnavailable: 10% # Allow 10 pods offline at once
Trade-off: Uses more resources during deployment, finishes faster.
Real Impact
Before zero-downtime deployments:
- Deploy Monday-Friday 9 AM - 5 PM only
- Deployments take 2-4 hours
- Team stress high
- Incidents during deployments are catastrophic
After implementing zero-downtime:
- Deploy anytime (even production incidents can be fixed mid-day)
- Deployments take 10-20 minutes
- Team stress low
- Frequent deployments (daily or more) become normal
The ability to deploy safely and frequently is a force multiplier for engineering velocity.
Related reading:
Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert



