Observability at Scale: Beyond Grafana Dashboards

At some point, every growing company realizes their monitoring strategy is broken. They have Grafana dashboards that no one understands, alerts that fire constantly (most false positives), and zero ability to debug issues when something goes wrong.

"We need better observability," they declare.

Six months later, they've invested in more tools (Datadog, New Relic, more Grafana dashboards) and the situation hasn't improved much. The problem isn't the tools. It's the strategy.

Observability at scale requires a fundamental shift: from monitoring (does my system work?) to observability (why did it break?).

Monitoring vs. Observability

Monitoring: What You Know

Monitoring answers: "Is everything okay?"

✓ Server up/down
✓ CPU utilization
✓ Memory usage
✓ Disk space

Useful, but limited. If you're at 85% CPU but everything works fine, is that a problem?

Observability: What You Can Learn

Observability answers: "Why did something break?"

- Request A failed with a 500 error
- It failed because database query took 45 seconds
- Query was slow because a table lock was held
- Lock was held by table maintenance job running at 3 AM
- Solution: Reschedule maintenance job

Observability requires three ingredients: metrics, logs, and traces. Together, they let you understand system behavior and debug issues.

The Three Pillars of Observability

Pillar 1: Metrics (What Is Happening Now?)

Metrics are time-series data about your system: request latency, error rate, CPU usage, etc.

Metrics answer:

How many requests are failing?
What's the p95 response time?
How much disk space do I have?
Is this pod consuming abnormal memory?

Good metrics tools:

Prometheus (open-source, industry standard)
Datadog (commercial, comprehensive)
New Relic (commercial)
CloudWatch (AWS native)

Example Prometheus metric:

http_requests_total{method="GET",path="/api/users",status="200"} 15423
http_request_duration_seconds{method="GET",path="/api/users",quantile="0.95"} 0.234

This tells you: 15,423 successful GET requests to /api/users, with p95 latency of 234ms.

The trap: Collecting too many metrics. Prometheus instances storing millions of time-series quickly become expensive and slow.

Solution: Collect only metrics you'll actually alert on or use for debugging.

Cardinality management:

# BAD: Creates infinite metrics
http_request_duration{user_id="123", request_id="456"...}

# GOOD: Fixed dimensions
http_request_duration{service="api", endpoint="/users", method="GET"}

Each unique combination of labels is a separate time-series. Too many combinations (high cardinality) destroys your metrics system.

Pillar 2: Logs (What Happened?)

Logs are events: "User logged in," "Database connection failed," "Deployment completed."

Logs answer:

What was the application doing when it crashed?
Did any errors occur during the incident?
Which requests were affected?

Good logging tools:

ELK Stack (Elasticsearch, Logstash, Kibana) - open-source
Loki (Prometheus-compatible logging)
Splunk (commercial, industry standard)
Datadog Logs

Example log entry:

{
  "timestamp": "2026-10-16T14:23:45Z",
  "level": "ERROR",
  "service": "payment-service",
  "request_id": "abc-123-xyz",
  "message": "Payment processing failed",
  "error": "stripe_api_timeout",
  "duration_ms": 30000,
  "user_id": "user-456"
}

This single log entry contains context (service, request ID), error details, and relevant metadata.

The trap: Unstructured logs (plain text instead of JSON).

# BAD: Impossible to query or analyze at scale
"User login failed"

# GOOD: Structured, queryable
{"event": "login_failed", "user_id": "123", "reason": "invalid_password"}

Structured logs let you query across millions of entries: "Show me all login failures from today."

Log aggregation architecture:

Applications (generate logs)
   ↓
Log shipper (Filebeat, Fluentd, Vector)
   ↓
Aggregation service (Elasticsearch, Loki)
   ↓
Query interface (Kibana, Grafana)

Pillar 3: Traces (How Did It Happen?)

Traces follow a single request through your entire system, showing every service it touched and where time was spent.

Traces answer:

Why did this request take 2 seconds when usual is 200ms?
Which services were involved?
Where did the bottleneck occur?

How tracing works:

User makes request to API
   ↓
API service processes: 50ms
   ↓
Calls database: 800ms (SLOW!)
   ↓
Calls cache: 10ms
   ↓
Returns response to user: 2000ms total

This trace shows the database call was the bottleneck.

Good tracing tools:

Jaeger (open-source, CNCF)
Zipkin (open-source)
Datadog APM (commercial)
New Relic (commercial)

Implementation requires code changes:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_payment(user_id, amount):
    with tracer.start_as_current_span("payment_processing") as span:
        span.set_attribute("user_id", user_id)
        span.set_attribute("amount", amount)

        result = call_payment_api(amount)  # This is also traced
        return result

The tracing library automatically captures timing and calls to downstream services.

Building an Observability Strategy

Phase 1: Metrics Baseline (Weeks 1-4)

Deploy Prometheus:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod

Key metrics to collect:

Request count (by endpoint, method, status)
Request latency (p50, p95, p99)
Error rate
Database query count and latency
Cache hit/miss ratio
Resource utilization (CPU, memory, disk)

Set 5-10 critical dashboards:

System health (CPU, memory, disk)
API performance (latency, errors, throughput)
Database performance (query time, connections)
Business metrics (transactions, conversions)

Avoid: Creating 50 dashboards. Most won't be used.

Phase 2: Structured Logging (Weeks 5-8)

Require JSON logging in all applications:

import json
import logging

logger = logging.getLogger()
handler = logging.StreamHandler()
handler.setFormatter(logging.JSONFormatter())
logger.addHandler(handler)

logger.error("payment_failed", extra={
    "user_id": "123",
    "amount": 99.99,
    "reason": "insufficient_funds"
})

Deploy log aggregation:

helm install loki grafana/loki-stack -n monitoring

Create log alerts:

Error rate spike (if error_rate > 1% for 5 minutes)
Specific errors (authentication failures, timeouts)
Performance degradation (query slow-down)

Phase 3: Distributed Tracing (Weeks 9-12)

Instrument critical paths:

User signup flow
Payment processing
Data export operations

Use OpenTelemetry (vendor-agnostic standard):

from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

jaeger_exporter = JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

What traces to capture:

User-facing flows (signup, checkout, export)
Database queries
External API calls
Cache operations

Avoid: Tracing every internal function call (too noisy, expensive).

Practical Alerting Strategy

Most organizations alert on too many things, creating alert fatigue.

Right-sized alerting:

Metric	Alert Threshold	Severity
Error rate	> 1% for 5 min	Critical
p95 latency	> 2x baseline	Warning
CPU	> 85%	Warning
Memory	> 90%	Warning
Disk	> 85%	Warning
Database connections	> 80% of pool	Warning

Alert only on:

Things that impact users (errors, latency)
Things that require action (disk space, certificate expiration)
Things that predict future problems (CPU trending upward)

Don't alert on:

Everything being okay
Minor deviations from normal
Things you can't or won't respond to

The Observability Runbook

When an alert fires, your team needs a playbook.

Example runbook:

Alert: Error rate > 1%

1. Check dashboard: What errors are we seeing?
2. Check logs: Any patterns? (specific user? endpoint? region?)
3. Check metrics: CPU/memory issues?
4. Check traces: Is latency high or is it actual errors?
5. Check recent deployments: Did we just release something?
6. Check external dependencies: Is Stripe API down? Is database slow?

Action:
- If recent deploy: rollback
- If dependency down: route around it (fail fast)
- If resource exhausted: scale up
- Otherwise: investigate further

Good runbooks are specific (not "check everything"), actionable (actual steps), and tested (run them during game days).

Cost Management

Observability at scale gets expensive. Manage costs intentionally:

Metrics:

Retention: Keep 15 days of high-resolution, 1 year of low-resolution
Cardinality: Limit label combinations
Sampling: Don't collect every metric everywhere

Logs:

Sampling: Collect 10-50% of logs instead of 100%
Retention: Keep 7 days hot, 30 days cold (cheaper archive)
Filtering: Don't log verbose debug info in production

Traces:

Sampling: Collect 1% of traces, not 100%
Retention: Keep 7 days
Cardinality: Don't trace internal function calls

Real numbers:

Metrics: $100-500/month (Prometheus self-hosted or cloud)
Logs: $200-1000/month (depends on volume)
Traces: $100-300/month (1% sampling)
Total: $400-1800/month for comprehensive observability

The 90-Day Roadmap

Month 1: Metrics and Dashboards

Deploy Prometheus
Configure scrape targets
Build 5-10 critical dashboards
Set up basic alerting

Month 2: Structured Logging

Migrate applications to JSON logging
Deploy log aggregation
Create log searches for common investigation patterns
Set up log-based alerts

Month 3: Distributed Tracing

Instrument critical user-facing flows
Set up trace analysis
Build trace-based alerts
Test observability during incident simulation

The Reality

Observability isn't a one-time project. It's an ongoing practice. As your system grows, your observability strategy must evolve.

Start simple (metrics), add structure (logs), then add depth (traces). Don't try to do everything at once.

The payoff is significant: when issues occur, you understand them in minutes instead of hours. Your team sleeps better. Your business moves faster.

Related reading:

Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert