Observabilité à l'échelle : au-delà des dashboards Grafana
Une organisation avec 50+ microservices et 500+ collaborateurs produit téraoctets de logs par jour. Ajouter un dashboard Grafana de plus ne résout pas le problème. En fait, ça l'aggrave : alert fatigue, bruit signal/noise monstrueux, et une équipe ops noyée.
L'observabilité moderne = trois piliers (metrics, traces, logs) intégrés intelligemment. Pas juste "afficher plus de données."
Les trois piliers : metrics, traces, logs
Metrics : l'ensemble
Agrégés, rapides, bas bruit.
CPU utilization: 62%
Memory used: 4.2 GB
Request latency p95: 234ms
Error rate: 0.8%
Avantages : compact, indexable, efficace pour alerting Inconvénients : agrégés = contexte perdu
Traces : le story
Suivi d'une requête entière à travers tous les services.
User request → API gateway (12ms)
→ Auth service (8ms)
→ Order service (156ms)
→ Database query (143ms)
→ Cache miss (2ms)
→ Payment service (78ms)
→ Return response (22ms)
Total: 284ms
Avantages : contexte complet, localiser bottleneck Inconvénients : lourd, coûteux à stocker/analyser
Logs : les détails
Unstructured ou semi-structured, très verbeux.
{
"timestamp": "2025-03-16T14:23:45Z",
"level": "ERROR",
"service": "order-api",
"trace_id": "abc123def456",
"message": "Database connection timeout",
"context": {
"user_id": 12345,
"order_id": 67890,
"retry_count": 3
}
}
Avantages : détail complet, pattern matching Inconvénients : énorme volume, coûteux
OpenTelemetry : l'intégration standard
OpenTelemetry (OTel) est l'approche moderne : instrument une fois, exporter partout.
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/semconv/v1.21.0"
)
// Initialize tracer provider
func initTracer() (*trace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint("localhost:4317"))
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
context.Background(),
semconv.ServiceNameKey.String("order-api"),
semconv.ServiceVersionKey.String("2.0.0"),
)),
)
return tp, nil
}
// Use in handler
func handleOrder(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "handleOrder")
defer span.End()
// Span attributs
span.SetAttributes(
attribute.String("order.id", orderID),
attribute.Int64("user.id", userID),
)
// Appel DB
dbCtx, dbSpan := tracer.Start(ctx, "database.query")
order := db.GetOrder(dbCtx, orderID)
dbSpan.End()
// Appel cache
cacheCtx, cacheSpan := tracer.Start(ctx, "cache.get")
cached := cache.Get(cacheCtx, key)
cacheSpan.End()
// ...
}
Une seule instrumentation, puis exporter à :
- Jaeger (tracing)
- Prometheus (metrics)
- ELK (logs)
- DataDog, New Relic, etc.
Même code, plusieurs destinations.
Distributed Tracing : localiser les bottlenecks
Avec Jaeger ou Zipkin, vous voyez le flow complet.
POST /orders
├─ API Gateway (12ms)
├─ Auth Service (8ms)
├─ Order Service (156ms)
│ ├─ Parse JSON (1ms)
│ ├─ Validate (3ms)
│ ├─ Database.Query (143ms) ← BOTTLENECK
│ │ ├─ Connection pool (2ms)
│ │ ├─ Query execution (140ms) ← slow query
│ │ └─ Result fetch (1ms)
│ └─ Cache miss (2ms)
├─ Payment Service (78ms)
└─ Response (22ms)
Total: 284ms (15% database, 27% payment, rest is OK)
Install Jaeger :
docker run -d \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one
# Access: http://localhost:16686
Maintenant vous visualisez :
- Service map (which service calls what)
- Latency per service
- Error rates
- Dependencies
Log aggregation : structuré et indexed
Logs massifs → incompréhensible. Solution : log structuré + aggregation.
// Structured logging with zap
package main
import "go.uber.org/zap"
func main() {
logger, _ := zap.NewProduction() // JSON output
defer logger.Sync()
logger.Info("order created",
zap.String("order_id", "order-123"),
zap.String("user_id", "user-456"),
zap.Float64("amount", 199.99),
zap.String("currency", "CHF"),
zap.String("status", "pending"),
)
// Output:
// {"level":"info","ts":1710606000,"logger":"","msg":"order created",
// "order_id":"order-123","user_id":"user-456","amount":199.99,
// "currency":"CHF","status":"pending"}
}
Send to ELK/Loki :
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
[INPUT]
Name tail
Path /var/log/containers/*/*.log
Parser docker
Tag kube.*
Refresh_Interval 5
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
[OUTPUT]
Name loki
Match *
Host loki.default.svc.cluster.local
Port 3100
Queryable :
# Find errors from order service
{job="order-api"} | json | level="ERROR"
# Errors with latency > 1s
{job="*"} | json | duration > 1000 | level="ERROR"
# Count requests by status
sum by (status) (rate({job="api"}[5m]))
Alert fatigue : l'ennemi silencieux
Trop d'alertes = aucune alertes. 70% des organisations ont > 100 alerts/jour actives. Resultat : oncall abuse, burnout.
Solution : alerte intelligente
❌ Mauvaises alertes :
alert: HighCPU
expr: cpu > 80%
for: 1m
# → Feu toutes les heures pendant scaling, useless
✓ Bonnes alertes :
alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
for: 5m
# → Vraie issue, actionnable, rare
alert: SLOBreachInProgress
expr: (rate(http_requests_total{status=~"5.."}[5m]) > 0.01)
AND (timestamp() - max by (job) (timestamp(container_up_time))) < 600
for: 2m
# → SLA broken + not new deployment = investigate
Pattern : use error budgets
# Calculate remaining error budget
1 - (sum(rate(http_requests_total{status=~"2..|3.."}[30d])) /
sum(rate(http_requests_total[30d])))
Si error budget < 10%, alerte. Sinon silence.
Alert routing : context matters
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: app-alerts
spec:
groups:
- name: critical
rules:
- alert: DatabaseDown
expr: up{job="postgres"} == 0
for: 1m
annotations:
severity: critical
runbook: https://wiki/postgres-down
labels:
page: "true" # PagerDuty immediately
channel: "#p1-oncall" # Slack critical
- name: warning
rules:
- alert: DiskSpaceLow
expr: node_filesystem_avail_bytes{mountpoint="/"} < 10gb
for: 10m
annotations:
severity: warning
labels:
page: "false" # Don't page
channel: "#alerts" # Slack normal
Runbook automation : pas de manual heroics
Alert feu → sur-the-spot investigation. Mieux : runbook automatisé.
#!/bin/bash
# runbook: high-latency-api-troubleshoot
SERVICE=$1
THRESHOLD_MS=${2:-1000}
echo "Investigating latency > ${THRESHOLD_MS}ms in ${SERVICE}..."
# 1. Check pod health
echo "Pod status:"
kubectl get pods -l app=$SERVICE -o wide
# 2. Check logs for errors
echo "Recent errors:"
kubectl logs -l app=$SERVICE --tail=100 | grep ERROR | tail -20
# 3. Check database connections
echo "Database connections:"
psql -h postgres.default -U admin -c "SELECT count(*) FROM pg_stat_activity;"
# 4. Check cache hit rate
echo "Cache metrics:"
curl -s http://prometheus:9090/api/v1/query?query='rate(cache_miss_total[5m])' | jq '.data.result[0].value'
# 5. Auto-remediation: restart if in bad state
RESTART_THRESHOLD=2000
if [ $(curl -s http://$SERVICE/metrics | grep 'latency' | awk '{print $2}') -gt $RESTART_THRESHOLD ]; then
echo "Latency critical, attempting restart..."
kubectl rollout restart deployment/$SERVICE
echo "Waiting for pods to be ready..."
kubectl rollout status deployment/$SERVICE --timeout=2m
fi
echo "Investigation complete. Check dashboard at https://grafana.example.com"
Trigger automatiquement en alert :
rule: HighLatency
annotations:
action: "kubectl exec -n kube-system -- /scripts/latency-troubleshoot.sh api"
Cost of observability : le hidden cost
Logs coûtent cher. 1 billion de logs = ~$5k-10k/month en cloud.
Cost optimization :
- Sampling : Log 10% not 100%
if rand.Intn(10) == 0 { // 10% sampling
logger.Debug("request", zap.String("path", r.URL.Path))
}
- Log level management : ERROR et WARN en prod, DEBUG en dev
// Production
logger, _ := zap.NewProduction() // JSON, ERROR level
// Development
logger, _ := zap.NewDevelopment() // Colored, DEBUG level
- Loki log retention : Keep detailed logs 7 days, summary 30 days
- name: loki
args:
- -config.file=/etc/loki/loki-config.yaml
# loki-config.yaml
schema_config:
configs:
- from: 2020-05-15
store: boltdb
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
retention_config:
enabled: true
retention_deletes_enabled: true
retention_period: 7d # Keep 7 days only
- Trace sampling : Sample 1% transactions
option.WithTraceSampler(sdktrace.ProbabilitySampler(0.01))
Observability checklist
-
- OpenTelemetry instrumenté (metrics + traces + logs)
-
- Traces centralisées (Jaeger/Zipkin)
-
- Logs structurés en JSON
-
- Log aggregation (ELK/Loki)
-
- Metrics exposées (Prometheus)
-
- Alertes smart (not spam)
-
- Alert routing par severity
-
- Runbooks pour critical alerts
-
- SLO/SLI définis
-
- Cost control (sampling, retention)
Observability maturity
Level 1 : Grafana dashboard
- Manually create dashboards
- Lots of noise
- Alert fatigue
Level 2 : Structured logging + metrics
- Logs en JSON
- Alert rules définies
- Basic SLO monitoring
Level 3 : Distributed tracing + intelligent alerting
- OpenTelemetry implémenté
- Alert routing (page vs notify)
- Runbooks automatisés
Level 4 : Autonomous observability
- AIOps (anomaly detection)
- Self-healing (auto-remediation)
- Cost-aware sampling
- Predictive alerting
Chez Hidora, on aide les organisations à progresser de Level 1→3 en 6 mois.
Conclusion
Observabilité n'est pas un "nice to have". C'est fondational pour opérer confiant en production.
Le shift de mindset :
- De : "que puis-je voir ?" (reactive)
- À : "que dois-je savoir ?" (proactive)
Trois piliers (metrics, traces, logs) combinés intelligemment donnent une vision claire de l'état de votre système. Et ce sans alert fatigue ou coûts explosifs.
À lire aussi :
Cet article vous a été utile ? Découvrez comment Hidora peut vous accompagner : Professional Services · Managed Services · SLA Expert



