What RTO does
The Recovery Time Objective (RTO) is the maximum-duration commitment between a major incident (datacenter loss, ransomware attack, database corruption) and full service restoration. It is a business metric, not a technical one: it reflects the downtime cost acceptable to the organisation, not the performance of the tools.
In practice, a 4-hour RTO means: if a disaster strikes at 9:00, the service must be operational again for users by 13:00 at the latest. RTO includes everything: incident detection, decision to activate the recovery plan, data restoration, application restart, functional validation, communication.
How to set a realistic RTO
RTO is decided jointly by business leadership and the IT team. The usual method follows three steps:
-
Business Impact Analysis (BIA). Quantify the hourly outage cost for each critical application. An e-commerce platform doing 50,000 CHF/hour cannot tolerate a 24-hour RTO; an internal HR tool can.
-
Technical assessment. Measure the achievable RTO with the current architecture. This means real restoration drills (not just theoretical ones), ideally quarterly.
-
Investment. Lowering RTO costs money: real-time replication, hot secondary sites, automation. Halving an RTO typically multiplies infrastructure costs by 1.5 to 3.
Practical RTO tiers
Typical categorisation observed on Hidora engagements:
- RTO < 15 minutes: active-active multi-region architecture, automatic failover. High cost, justified for banks, critical e-commerce, telemedicine.
- RTO 1 to 4 hours: hot secondary sites, semi-automated restoration. Standard for most Swiss SMEs with significant online activity.
- RTO 4 to 24 hours: off-site backups, scheduled manual restoration. Suited to non-real-time applications (BI, archiving, batch).
- RTO > 24 hours: off-site backups, ad-hoc restoration process. Acceptable for non-critical internal tools.
RTO and cloud-native architectures
On Kubernetes, RTO drops drastically if the application is designed stateless. Configuration is in GitOps, images are in a replicated registry, databases use synchronous replication: restoring a full environment on a standby cluster takes 10 to 30 minutes through automated deployment.
Conversely, stateful workloads with complex dependencies (Postgres clusters with custom replication, NFS shared files, Kafka queues with long retention) keep a high RTO. The architectural work consists of progressively isolating those components and applying suitable replication strategies.
RTO vs RPO
RTO measures outage duration; RPO (Recovery Point Objective) measures acceptable data loss. An SME can tolerate 4 hours of unavailability (RTO) but no more than 5 minutes of lost data (RPO). The two indicators are independent and addressed separately.
Related Hidora services
- SLA Expert: contractual RTO commitments on P1 incidents with automatic recovery-plan activation.
- Consulting: BIA audit, disaster-recovery plan design, quarterly restoration drills.
- Managed Services: operational execution of the recovery plan with monthly drill reports.
- RPO, DRP, SLA: related indicators and processes.