SRE (Site Reliability Engineering): definition and best practices

What SRE does

Site Reliability Engineering (SRE) is a discipline formalised by Google in the early 2000s and popularised by the book of the same name in 2016. Its central idea: apply software-engineering practices (automation, measurement, abstraction) to operational problems traditionally handled by sysadmins. An SRE does not just maintain servers; they write code to make infrastructure self-managed.

SRE addresses a fundamental tension in production: development teams want to ship fast, operations want stability. SRE resolves this tension through quantification: you define measurable reliability targets, and you allow fast change as long as those targets are met.

The core concepts

SLI (Service Level Indicator): a quantified metric of service behaviour, such as p95 latency, 5xx error rate, or availability measured by external probes. Always from the user-experience side, never an internal metric.

SLO (Service Level Objective): the numerical target of an SLI, for example "99.9% of requests under 300 ms over a 30-day rolling window". The SLO is an internal commitment, not a commercial one.

SLA (Service Level Agreement): the commercial contract with customers, usually with financial commitments (penalties). The SLA is laxer than the SLO so that there is operational headroom.

Error budget: the mathematical complement of the SLO. With a 99.9% SLO, the error budget is 0.1%, about 43 minutes of downtime per month. As long as you are within budget, the product team can take risks; above it, deployments slow down and reliability becomes the priority.

SRE vs DevOps

SRE is a concrete implementation of DevOps culture, not an alternative. DevOps answers the "cultural why and how" (bridging dev and ops, automation, shared ownership). SRE answers the "measurable and operational how" (which metrics, which thresholds, which trade-offs).

In Swiss Romande, we typically see organisations adopt DevOps before SRE. The move to SRE becomes relevant beyond 500 employees or when several product teams consume the same platform. For an SME under 200 employees, a DevOps team with a few SRE practices (SLOs, blameless post-mortems) is usually enough.

The SRE role in practice

An SRE ideally spends 50% of their time on automation and 50% on supporting product teams. The critical threshold is around 30% pure operations time: beyond that, the SRE drifts into permanent firefighting mode and loses their engineering value.

When SRE is not the right fit

Very small organisations (<50 people), startups in product-discovery phase, and purely office IT environments do not get immediate value from SRE. Formalising SLOs assumes a stable in-production product with measurable traffic.

Related Hidora services

SLA Expert: on-call expertise contract with contractually measured SLOs.
Managed Services: operations with monthly SLOs, reliability reports, shared error budgets.
DevOps and Observability: cultural and technical prerequisites for SRE practice.

What is SRE (Site Reliability Engineering)?