DevOps
Blog
DevOps11 min

Zero Downtime: The CTO's Guide to Sleeping Well

Mattia Eleuteri23 octobre 2025

Zero Downtime: The CTO's Guide to Sleeping Well

CTOs often face this tension: our business needs continuous innovation (multiple deployments daily), but zero downtime during deployments feels like an impossible luxury. Either we deploy frequently and risk incidents, or we deploy carefully and move slowly.

The good news: zero-downtime deployments are achievable for most workloads. It requires planning, but the architecture patterns are well-established.

Here's how to implement them.

What Zero Downtime Actually Means

Zero downtime != zero risk: You can deploy without user-facing downtime while still carrying deployment risk. Rollback procedures, monitoring, and incident response remain critical.

Zero downtime means:

  • Users don't experience service interruption during deployment
  • Database schemas evolve without blocking queries
  • Infrastructure changes don't drop connections
  • Rollback doesn't require downtime

This is achievable for:

  • Stateless services (APIs, web applications)
  • Stateful services with proper design (databases, caches)
  • Background jobs and batch systems

This is hard for:

  • Systems with global state
  • Distributed consensus systems (etcd, Kafka during rebalancing)
  • Legacy monoliths tightly coupled to infrastructure

The Three Layers of Zero-Downtime Deployment

Layer 1: Application Level

Your application must handle being replaced mid-request.

Pattern: Graceful Shutdown

// Your application should handle SIGTERM properly

func main() {
    server := &http.Server{}

    go func() {
        sigterm := make(chan os.Signal, 1)
        signal.Notify(sigterm, syscall.SIGTERM)
        <-sigterm

        // Give in-flight requests 30 seconds to complete
        ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
        defer cancel()
        server.Shutdown(ctx)
    }()

    server.ListenAndServe()
}

When Kubernetes terminates a pod:

  1. Send SIGTERM to application
  2. Application stops accepting new requests
  3. In-flight requests complete (within timeout)
  4. Application exits
  5. Kubernetes routes traffic elsewhere

No dropped requests.

Pattern: Health Checks

// Readiness probe: Can this instance handle traffic?
func readinessHandler(w http.ResponseWriter, r *http.Request) {
    if isShuttingDown {
        w.WriteHeader(http.StatusServiceUnavailable)
        return
    }
    w.WriteHeader(http.StatusOK)
}

// Liveness probe: Is the application still alive?
func livenessHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
}

Kubernetes checks readiness before routing traffic. When shutdown begins, readiness fails, traffic routes away, then the pod terminates.

Layer 2: Infrastructure Level (Kubernetes)

Your deployment strategy must handle instance replacement gracefully.

Pattern: Rolling Update

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-service
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1           # Allow 1 extra pod during update
      maxUnavailable: 0     # Never drop below desired replicas
  selector:
    matchLabels:
      app: api-service
  template:
    metadata:
      labels:
        app: api-service
    spec:
      terminationGracePeriodSeconds: 30  # Wait 30s for graceful shutdown
      containers:
      - name: api
        image: api:v2.0
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5"]  # Extra time for load balancer to notice
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

How it works:

  1. New pod (v2.0) spins up alongside old pod (v1.0)
  2. New pod reports ready
  3. Load balancer starts routing traffic to new pod
  4. Traffic gradually shifts from old to new
  5. Old pod receives SIGTERM
  6. Old pod gracefully shuts down
  7. Repeat until all pods updated

No downtime because there's always capacity to handle traffic.

Avoiding connection drains:

# Pod Disruption Budget prevents too many simultaneous terminations
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2  # Always keep 2 pods running
  selector:
    matchLabels:
      app: api-service

This prevents accidental service degradation during maintenance.

Layer 3: Database Level

This is the tricky one. Schema changes can cause downtime if not planned.

The challenge:

Deployment v1.0: Code expects table 'users' with columns id, name
Deployment v2.0: Code expects columns id, name, email

If you remove 'email' column before v2.0 is fully deployed:
  - Old instances still running try to access email column
  - Result: Errors and partial downtime

Solution: Expand-Contract Pattern

Phase 1: Add column (backward compatible)

ALTER TABLE users ADD COLUMN email VARCHAR(255);
-- Old code: continues ignoring email
-- New code: can read/write email

Phase 2: Deploy new code

  • Deploy v2.0 (which uses email column)
  • Both v1.0 and v2.0 are running
  • Both work fine (v1.0 ignores email, v2.0 uses it)

Phase 3: Remove column (after v1.0 fully sunset)

ALTER TABLE users DROP COLUMN old_field;

Timeline:

  • Day 1: Add column
  • Day 2: Deploy new code
  • Day 7: Remove old column

This is boring, not dramatic, and completely safe.

Another pattern: Feature flags

if (featureFlags.isEnabled("use_new_field")) {
    // Use new logic
} else {
    // Use old logic
}

Deploy with new field disabled. Enable flag. If problems, disable flag without redeployment. This gives you rollback without code redeployment.

Practical Zero-Downtime Deployment Workflow

Standard Deployment (no database changes)

1. Developer merges code to main
   ↓
2. CI/CD builds and tests image
   ↓
3. Image pushed to registry
   ↓
4. ArgoCD detects Git change
   ↓
5. Rolling update begins
   ↓
6. kubectl rolls out new version
   ↓
7. No downtime (all steps respect graceful shutdown)

Monitoring during deployment:

# Watch the rollout
kubectl rollout status deployment/api-service

# If something goes wrong
kubectl rollout undo deployment/api-service

Rollback is instant because you're just routing traffic back to previous pod replicas.

Complex Deployment (database schema changes)

Day 1:
  1. Run migration: ALTER TABLE users ADD COLUMN email
  2. Deploy code v1.0 (doesn't use email yet)
  3. Verify: Both code and database working

Day 2:
  1. Deploy code v2.0 (uses email column)
  2. Rolling update begins
  3. Both v1.0 and v2.0 handle traffic
  4. Complete

Day 7:
  1. Run cleanup: ALTER TABLE users DROP COLUMN old_field
  2. Verify database state

Deployment with Traffic Shifting (Canary)

For higher risk deployments, test with traffic first:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5  # Success threshold in percentage
    maxWeight: 50  # Route maximum 50% traffic
    stepWeight: 10 # Increase by 10% every interval
  metrics:
  - name: request-success-rate
    thresholdRange:
      min: 99
  - name: request-duration
    thresholdRange:
      max: 500

What happens:

  1. Deploy v2.0 (receives 0% traffic initially)
  2. Canary sends 10% traffic to v2.0, 90% to v1.0
  3. Monitor metrics for 1 minute
  4. If success rate > 99%, increase to 20%
  5. Continue until v2.0 handles 100% or metrics fail
  6. If metrics fail, automatically roll back

Result: If v2.0 has a bug, only 10% of users see it. Immediate rollback.

Monitoring and Rollback Procedures

Zero-downtime deployment is only safe if you can see problems and rollback quickly.

Essential metrics during deployment:

  • Request latency (p50, p95, p99)
  • Error rate (4xx, 5xx)
  • Throughput (requests per second)
  • Application-specific metrics (transaction success, cart abandonment)

Automated rollback triggers:

If error_rate > 1% for 2 minutes:
  - Rollback deployment
  - Alert on-call engineer
  - Trigger incident investigation

Manual rollback:

# If you need to rollback immediately
kubectl rollout undo deployment/api-service

# Verify
kubectl get pods
kubectl logs deployment/api-service

Rollback is instant because old pod versions are still running on the cluster.

The 90-Day Implementation Plan

Month 1: Foundation

  • Implement graceful shutdown in applications
  • Configure readiness/liveness probes
  • Test rolling updates in staging
  • Document deployment procedures

Month 2: Database Evolution

  • Audit existing schemas
  • Implement expand-contract pattern
  • Test schema migrations with zero downtime
  • Document database change procedures

Month 3: Advanced Patterns

  • Implement canary deployments for high-risk services
  • Set up automated rollback based on metrics
  • Document incident response procedures
  • Run deployment game days

Common Gotchas

Gotcha 1: Long-running requests If your application handles long-lived connections (WebSockets, Server-Sent Events), graceful shutdown must wait for them.

Solution:

terminationGracePeriodSeconds: 300  # 5 minutes for long connections

Gotcha 2: Persistent connections not respecting graceful shutdown Some clients (databases, caches) cache connections and don't immediately reconnect.

Solution: Implement connection pooling with automatic reconnection. Most libraries handle this by default.

Gotcha 3: Deployment takes forever If you have 100 replicas, rolling update with maxUnavailable: 0 takes time.

Solution:

maxSurge: 25%       # Allow 25 extra pods temporarily
maxUnavailable: 10% # Allow 10 pods offline at once

Trade-off: Uses more resources during deployment, finishes faster.

Real Impact

Before zero-downtime deployments:

  • Deploy Monday-Friday 9 AM - 5 PM only
  • Deployments take 2-4 hours
  • Team stress high
  • Incidents during deployments are catastrophic

After implementing zero-downtime:

  • Deploy anytime (even production incidents can be fixed mid-day)
  • Deployments take 10-20 minutes
  • Team stress low
  • Frequent deployments (daily or more) become normal

The ability to deploy safely and frequently is a force multiplier for engineering velocity.


Related reading:


Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert

Does this article resonate?

Hidora can support you on this topic.

Need support?

Let's talk about your project. 30 minutes, no strings attached.