Observability at Scale: Beyond Grafana Dashboards
At some point, every growing company realizes their monitoring strategy is broken. They have Grafana dashboards that no one understands, alerts that fire constantly (most false positives), and zero ability to debug issues when something goes wrong.
"We need better observability," they declare.
Six months later, they've invested in more tools (Datadog, New Relic, more Grafana dashboards) and the situation hasn't improved much. The problem isn't the tools. It's the strategy.
Observability at scale requires a fundamental shift: from monitoring (does my system work?) to observability (why did it break?).
Monitoring vs. Observability
Monitoring: What You Know
Monitoring answers: "Is everything okay?"
✓ Server up/down
✓ CPU utilization
✓ Memory usage
✓ Disk space
Useful, but limited. If you're at 85% CPU but everything works fine, is that a problem?
Observability: What You Can Learn
Observability answers: "Why did something break?"
- Request A failed with a 500 error
- It failed because database query took 45 seconds
- Query was slow because a table lock was held
- Lock was held by table maintenance job running at 3 AM
- Solution: Reschedule maintenance job
Observability requires three ingredients: metrics, logs, and traces. Together, they let you understand system behavior and debug issues.
The Three Pillars of Observability
Pillar 1: Metrics (What Is Happening Now?)
Metrics are time-series data about your system: request latency, error rate, CPU usage, etc.
Metrics answer:
- How many requests are failing?
- What's the p95 response time?
- How much disk space do I have?
- Is this pod consuming abnormal memory?
Good metrics tools:
- Prometheus (open-source, industry standard)
- Datadog (commercial, comprehensive)
- New Relic (commercial)
- CloudWatch (AWS native)
Example Prometheus metric:
http_requests_total{method="GET",path="/api/users",status="200"} 15423
http_request_duration_seconds{method="GET",path="/api/users",quantile="0.95"} 0.234
This tells you: 15,423 successful GET requests to /api/users, with p95 latency of 234ms.
The trap: Collecting too many metrics. Prometheus instances storing millions of time-series quickly become expensive and slow.
Solution: Collect only metrics you'll actually alert on or use for debugging.
Cardinality management:
# BAD: Creates infinite metrics
http_request_duration{user_id="123", request_id="456"...}
# GOOD: Fixed dimensions
http_request_duration{service="api", endpoint="/users", method="GET"}
Each unique combination of labels is a separate time-series. Too many combinations (high cardinality) destroys your metrics system.
Pillar 2: Logs (What Happened?)
Logs are events: "User logged in," "Database connection failed," "Deployment completed."
Logs answer:
- What was the application doing when it crashed?
- Did any errors occur during the incident?
- Which requests were affected?
Good logging tools:
- ELK Stack (Elasticsearch, Logstash, Kibana) - open-source
- Loki (Prometheus-compatible logging)
- Splunk (commercial, industry standard)
- Datadog Logs
Example log entry:
{
"timestamp": "2026-10-16T14:23:45Z",
"level": "ERROR",
"service": "payment-service",
"request_id": "abc-123-xyz",
"message": "Payment processing failed",
"error": "stripe_api_timeout",
"duration_ms": 30000,
"user_id": "user-456"
}
This single log entry contains context (service, request ID), error details, and relevant metadata.
The trap: Unstructured logs (plain text instead of JSON).
# BAD: Impossible to query or analyze at scale
"User login failed"
# GOOD: Structured, queryable
{"event": "login_failed", "user_id": "123", "reason": "invalid_password"}
Structured logs let you query across millions of entries: "Show me all login failures from today."
Log aggregation architecture:
Applications (generate logs)
↓
Log shipper (Filebeat, Fluentd, Vector)
↓
Aggregation service (Elasticsearch, Loki)
↓
Query interface (Kibana, Grafana)
Pillar 3: Traces (How Did It Happen?)
Traces follow a single request through your entire system, showing every service it touched and where time was spent.
Traces answer:
- Why did this request take 2 seconds when usual is 200ms?
- Which services were involved?
- Where did the bottleneck occur?
How tracing works:
User makes request to API
↓
API service processes: 50ms
↓
Calls database: 800ms (SLOW!)
↓
Calls cache: 10ms
↓
Returns response to user: 2000ms total
This trace shows the database call was the bottleneck.
Good tracing tools:
- Jaeger (open-source, CNCF)
- Zipkin (open-source)
- Datadog APM (commercial)
- New Relic (commercial)
Implementation requires code changes:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_payment(user_id, amount):
with tracer.start_as_current_span("payment_processing") as span:
span.set_attribute("user_id", user_id)
span.set_attribute("amount", amount)
result = call_payment_api(amount) # This is also traced
return result
The tracing library automatically captures timing and calls to downstream services.
Building an Observability Strategy
Phase 1: Metrics Baseline (Weeks 1-4)
Deploy Prometheus:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
Key metrics to collect:
- Request count (by endpoint, method, status)
- Request latency (p50, p95, p99)
- Error rate
- Database query count and latency
- Cache hit/miss ratio
- Resource utilization (CPU, memory, disk)
Set 5-10 critical dashboards:
- System health (CPU, memory, disk)
- API performance (latency, errors, throughput)
- Database performance (query time, connections)
- Business metrics (transactions, conversions)
Avoid: Creating 50 dashboards. Most won't be used.
Phase 2: Structured Logging (Weeks 5-8)
Require JSON logging in all applications:
import json
import logging
logger = logging.getLogger()
handler = logging.StreamHandler()
handler.setFormatter(logging.JSONFormatter())
logger.addHandler(handler)
logger.error("payment_failed", extra={
"user_id": "123",
"amount": 99.99,
"reason": "insufficient_funds"
})
Deploy log aggregation:
helm install loki grafana/loki-stack -n monitoring
Create log alerts:
- Error rate spike (if error_rate > 1% for 5 minutes)
- Specific errors (authentication failures, timeouts)
- Performance degradation (query slow-down)
Phase 3: Distributed Tracing (Weeks 9-12)
Instrument critical paths:
- User signup flow
- Payment processing
- Data export operations
Use OpenTelemetry (vendor-agnostic standard):
from opentelemetry import trace, metrics
from opentelemetry.exporter.jaeger import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger",
agent_port=6831,
)
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)
What traces to capture:
- User-facing flows (signup, checkout, export)
- Database queries
- External API calls
- Cache operations
Avoid: Tracing every internal function call (too noisy, expensive).
Practical Alerting Strategy
Most organizations alert on too many things, creating alert fatigue.
Right-sized alerting:
| Metric | Alert Threshold | Severity |
|---|---|---|
| Error rate | > 1% for 5 min | Critical |
| p95 latency | > 2x baseline | Warning |
| CPU | > 85% | Warning |
| Memory | > 90% | Warning |
| Disk | > 85% | Warning |
| Database connections | > 80% of pool | Warning |
Alert only on:
- Things that impact users (errors, latency)
- Things that require action (disk space, certificate expiration)
- Things that predict future problems (CPU trending upward)
Don't alert on:
- Everything being okay
- Minor deviations from normal
- Things you can't or won't respond to
The Observability Runbook
When an alert fires, your team needs a playbook.
Example runbook:
Alert: Error rate > 1%
1. Check dashboard: What errors are we seeing?
2. Check logs: Any patterns? (specific user? endpoint? region?)
3. Check metrics: CPU/memory issues?
4. Check traces: Is latency high or is it actual errors?
5. Check recent deployments: Did we just release something?
6. Check external dependencies: Is Stripe API down? Is database slow?
Action:
- If recent deploy: rollback
- If dependency down: route around it (fail fast)
- If resource exhausted: scale up
- Otherwise: investigate further
Good runbooks are specific (not "check everything"), actionable (actual steps), and tested (run them during game days).
Cost Management
Observability at scale gets expensive. Manage costs intentionally:
Metrics:
- Retention: Keep 15 days of high-resolution, 1 year of low-resolution
- Cardinality: Limit label combinations
- Sampling: Don't collect every metric everywhere
Logs:
- Sampling: Collect 10-50% of logs instead of 100%
- Retention: Keep 7 days hot, 30 days cold (cheaper archive)
- Filtering: Don't log verbose debug info in production
Traces:
- Sampling: Collect 1% of traces, not 100%
- Retention: Keep 7 days
- Cardinality: Don't trace internal function calls
Real numbers:
- Metrics: $100-500/month (Prometheus self-hosted or cloud)
- Logs: $200-1000/month (depends on volume)
- Traces: $100-300/month (1% sampling)
- Total: $400-1800/month for comprehensive observability
The 90-Day Roadmap
Month 1: Metrics and Dashboards
- Deploy Prometheus
- Configure scrape targets
- Build 5-10 critical dashboards
- Set up basic alerting
Month 2: Structured Logging
- Migrate applications to JSON logging
- Deploy log aggregation
- Create log searches for common investigation patterns
- Set up log-based alerts
Month 3: Distributed Tracing
- Instrument critical user-facing flows
- Set up trace analysis
- Build trace-based alerts
- Test observability during incident simulation
The Reality
Observability isn't a one-time project. It's an ongoing practice. As your system grows, your observability strategy must evolve.
Start simple (metrics), add structure (logs), then add depth (traces). Don't try to do everything at once.
The payoff is significant: when issues occur, you understand them in minutes instead of hours. Your team sleeps better. Your business moves faster.
Related reading:
Found this helpful? See how Hidora can help: Professional Services · Managed Services · SLA Expert


