Zero Downtime: The CTO's Guide to Sleeping Well

CTOs often face this tension: our business needs continuous innovation (multiple deployments daily), but zero downtime during deployments feels like an impossible luxury. Either we deploy frequently and risk incidents, or we deploy carefully and move slowly.

The good news: zero-downtime deployments are achievable for most workloads. It requires planning, but the architecture patterns are well-established.

Here's how to implement them.

What Zero Downtime Actually Means

Zero downtime != zero risk: You can deploy without user-facing downtime while still carrying deployment risk. Rollback procedures, monitoring, and incident response remain critical.

Zero downtime means:

Users don't experience service interruption during deployment
Database schemas evolve without blocking queries
Infrastructure changes don't drop connections
Rollback doesn't require downtime

This is achievable for:

Stateless services (APIs, web applications)
Stateful services with proper design (databases, caches)
Background jobs and batch systems

This is hard for:

Systems with global state
Distributed consensus systems (etcd, Kafka during rebalancing)
Legacy monoliths tightly coupled to infrastructure

The Three Layers of Zero-Downtime Deployment

Layer 1: Application Level

Your application must handle being replaced mid-request.

Pattern: Graceful Shutdown

// Your application should handle SIGTERM properly

func main() {
    server := &http.Server{}

    go func() {
        sigterm := make(chan os.Signal, 1)
        signal.Notify(sigterm, syscall.SIGTERM)
        <-sigterm

        // Give in-flight requests 30 seconds to complete
        ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
        defer cancel()
        server.Shutdown(ctx)
    }()

    server.ListenAndServe()
}

When Kubernetes terminates a pod:

Send SIGTERM to application
Application stops accepting new requests
In-flight requests complete (within timeout)
Application exits
Kubernetes routes traffic elsewhere

No dropped requests.

Pattern: Health Checks

// Readiness probe: Can this instance handle traffic?
func readinessHandler(w http.ResponseWriter, r *http.Request) {
    if isShuttingDown {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

// Liveness probe: Is the application still alive?
func livenessHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
}

Kubernetes checks readiness before routing traffic. When shutdown begins, readiness fails, traffic routes away, then the pod terminates.

Layer 2: Infrastructure Level (Kubernetes)

Your deployment strategy must handle instance replacement gracefully.

Pattern: Rolling Update

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1           # Allow 1 extra pod during update
      maxUnavailable: 0     # Never drop below desired replicas
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      terminationGracePeriodSeconds: 30  # Wait 30s for graceful shutdown
      containers:
      - name: api
        image: api:v2.0
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5"]  # Extra time for load balancer to notice
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

How it works:

New pod (v2.0) spins up alongside old pod (v1.0)
New pod reports ready
Load balancer starts routing traffic to new pod
Traffic gradually shifts from old to new
Old pod receives SIGTERM
Old pod gracefully shuts down
Repeat until all pods updated

No downtime because there's always capacity to handle traffic.

Avoiding connection drains:

# Pod Disruption Budget prevents too many simultaneous terminations
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2  # Always keep 2 pods running
  selector:
    matchLabels:
      app: api-service

This prevents accidental service degradation during maintenance.

Layer 3: Database Level

This is the tricky one. Schema changes can cause downtime if not planned.

The challenge:

Deployment v1.0: Code expects table 'users' with columns id, name
Deployment v2.0: Code expects columns id, name, email

If you remove 'email' column before v2.0 is fully deployed:
  - Old instances still running try to access email column
  - Result: Errors and partial downtime

Solution: Expand-Contract Pattern

Phase 1: Add column (backward compatible)

ALTER TABLE users ADD COLUMN email VARCHAR(255);
-- Old code: continues ignoring email
-- New code: can read/write email

Phase 2: Deploy new code

Deploy v2.0 (which uses email column)
Both v1.0 and v2.0 are running
Both work fine (v1.0 ignores email, v2.0 uses it)

Phase 3: Remove column (after v1.0 fully sunset)

ALTER TABLE users DROP COLUMN old_field;

Timeline:

Day 1: Add column
Day 2: Deploy new code
Day 7: Remove old column

This is boring, not dramatic, and completely safe.

Another pattern: Feature flags

if (featureFlags.isEnabled("use_new_field")) {
    // Use new logic
} else {
    // Use old logic
}

Deploy with new field disabled. Enable flag. If problems, disable flag without redeployment. This gives you rollback without code redeployment.

Practical Zero-Downtime Deployment Workflow

Standard Deployment (no database changes)

1. Developer merges code to main
   ↓
2. CI/CD builds and tests image
   ↓
3. Image pushed to registry
   ↓
4. ArgoCD detects Git change
   ↓
5. Rolling update begins
   ↓
6. kubectl rolls out new version
   ↓
7. No downtime (all steps respect graceful shutdown)

Monitoring during deployment:

# Watch the rollout
kubectl rollout status deployment/api-service

# If something goes wrong
kubectl rollout undo deployment/api-service

Rollback is instant because you're just routing traffic back to previous pod replicas.

Complex Deployment (database schema changes)

Day 1:
  1. Run migration: ALTER TABLE users ADD COLUMN email
  2. Deploy code v1.0 (doesn't use email yet)
  3. Verify: Both code and database working

Day 2:
  1. Deploy code v2.0 (uses email column)
  2. Rolling update begins
  3. Both v1.0 and v2.0 handle traffic
  4. Complete

Day 7:
  1. Run cleanup: ALTER TABLE users DROP COLUMN old_field
  2. Verify database state

Deployment with Traffic Shifting (Canary)

For higher risk deployments, test with traffic first:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5  # Success threshold in percentage
    maxWeight: 50  # Route maximum 50% traffic
    stepWeight: 10 # Increase by 10% every interval
  metrics:
  - name: request-success-rate
    thresholdRange:
      min: 99
  - name: request-duration
    thresholdRange:
      max: 500

What happens:

Deploy v2.0 (receives 0% traffic initially)
Canary sends 10% traffic to v2.0, 90% to v1.0
Monitor metrics for 1 minute
If success rate > 99%, increase to 20%
Continue until v2.0 handles 100% or metrics fail
If metrics fail, automatically roll back

Result: If v2.0 has a bug, only 10% of users see it. Immediate rollback.

Monitoring and Rollback Procedures

Zero-downtime deployment is only safe if you can see problems and rollback quickly.

Essential metrics during deployment:

Request latency (p50, p95, p99)
Error rate (4xx, 5xx)
Throughput (requests per second)
Application-specific metrics (transaction success, cart abandonment)

Automated rollback triggers:

If error_rate > 1% for 2 minutes:
  - Rollback deployment
  - Alert on-call engineer
  - Trigger incident investigation

Manual rollback:

# If you need to rollback immediately
kubectl rollout undo deployment/api-service

# Verify
kubectl get pods
kubectl logs deployment/api-service

Rollback is instant because old pod versions are still running on the cluster.

The 90-Day Implementation Plan

Month 1: Foundation

Implement graceful shutdown in applications
Configure readiness/liveness probes
Test rolling updates in staging
Document deployment procedures

Month 2: Database Evolution

Audit existing schemas
Implement expand-contract pattern
Test schema migrations with zero downtime
Document database change procedures

Month 3: Advanced Patterns

Implement canary deployments for high-risk services
Set up automated rollback based on metrics
Document incident response procedures
Run deployment game days

Testing Zero-Downtime in Practice

Achieving zero downtime on paper is one thing. Proving it works under real conditions is another. Before relying on your setup for production traffic, validate it systematically.

Load testing during deployments: Run a load test (using tools like k6, Locust, or Gatling) while triggering a rolling update. Monitor the error rate throughout the deployment window. If you see even a single failed request, your graceful shutdown or readiness probe configuration needs adjustment.

Connection draining validation: Verify that long-lived connections (database pools, gRPC streams) are properly drained before pod termination. This is especially important for services that maintain persistent connections to downstream dependencies. A common oversight is setting terminationGracePeriodSeconds too short for the actual connection drain time.

Chaos engineering: Once your zero-downtime setup is stable, introduce controlled failures. Kill pods mid-deployment, simulate node failures, and inject network latency. Tools like Chaos Mesh or Litmus (both CNCF projects) integrate natively with Kubernetes and let you validate resilience without risking uncontrolled outages.

These tests should run in staging on a recurring schedule, not just once during initial setup. Infrastructure changes, dependency updates, and application refactors can silently break zero-downtime guarantees.

Common Gotchas

Gotcha 1: Long-running requests If your application handles long-lived connections (WebSockets, Server-Sent Events), graceful shutdown must wait for them.

Solution:

terminationGracePeriodSeconds: 300  # 5 minutes for long connections

Gotcha 2: Persistent connections not respecting graceful shutdown Some clients (databases, caches) cache connections and don't immediately reconnect.

Solution: Implement connection pooling with automatic reconnection. Most libraries handle this by default.

Gotcha 3: Deployment takes forever If you have 100 replicas, rolling update with maxUnavailable: 0 takes time.

Solution:

maxSurge: 25%       # Allow 25 extra pods temporarily
maxUnavailable: 10% # Allow 10 pods offline at once

Trade-off: Uses more resources during deployment, finishes faster.

Real Impact

Before zero-downtime deployments:

Deploy Monday-Friday 9 AM - 5 PM only
Deployments take 2-4 hours
Team stress high
Incidents during deployments are catastrophic

After implementing zero-downtime:

Deploy anytime (even production incidents can be fixed mid-day)
Deployments take 10-20 minutes
Team stress low
Frequent deployments (daily or more) become normal

The ability to deploy safely and frequently is a force multiplier for engineering velocity.

Related reading:

Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert