Reliability &
Availability
How often does it break, and how long does it take to fix? The numbers that define your reputation.
The Core Definitions
Availability
The percentage of time the system is operational and accessible.
Reliability
The probability that the system will perform its function without failure over a period of time.
Note: A system can be Available but not Reliable (e.g., if it loses data but stays up).
The "Nines" of Availability
| Percent | Downtime (Year) | Downtime (Week) | Standard |
|---|---|---|---|
| 99% | 3.65 days | 1.68 hours | Casual / Non-critical |
| 99.9% | 8.77 hours | 10.1 minutes | Standard (3 Nines) |
| 99.99% | 52.56 minutes | 1.01 minutes | Highly Available (4 Nines) |
| 99.999% | 5.26 minutes | 6.05 seconds | Mission Critical (5 Nines) |
MTBF vs MTTR
MTBF
Mean Time Between Failures
Measures Reliability. The average time a system runs before it hits a glitch.
MTTR
Mean Time To Repair
Measures Maintainability. The average time it takes to restore service after a failure.
Availability Formula:
"To increase availability, you must either make it fail less often (MTBF↑) or fix it faster (MTTR↓)."
SLA vs SLO vs SLI
How we measure and promise performance.
A specific metric (e.g., Latency, Error Rate).
"The percentage of successful HTTP requests."
A target value for an SLI.
"99.9% of requests must succeed."
A legal contract involving SLOs and consequences (money).
"If we drop below 99.9%, we pay you back."
High Availability Patterns
Redundancy
Eliminating Single Points of Failure (SPOF). Multiple servers, multiple DB replicas, multiple regions.
Failover
Automatically switching to a standby system when the primary fails. Can be Active-Active or Active-Passive.
Circuit Breakers
Stopping requests to a failing service to prevent "cascading failures" across the entire system.
Interview Guidance
The Paradox
High availability often leads to lower consistency. If you have two replicas (Availability↑), keeping them in perfect sync (Consistency↓) is the hard part.
Interviewer Question
"Your system needs 99.999% availability. How do you design it?"
Don't just say 'replicas'. Talk about Auto-failover, Health Checks, and Multi-Region active setups.
Red Flag
"Adding a manual switch for failover." - In an interview, manual is too slow for HA. You must describe automated monitoring and leader election.
Green Flag
"Discussing MTTR explicitly." - Showing that you care about observability and deployment speed as part of availability.