verified

Reliability &
Availability

How often does it break, and how long does it take to fix? The numbers that define your reputation.

The Core Definitions

Availability

The percentage of time the system is operational and accessible.

"Is it up right now?"

Reliability

The probability that the system will perform its function without failure over a period of time.

"Will it break in the next hour?"

Note: A system can be Available but not Reliable (e.g., if it loses data but stays up).

The "Nines" of Availability

PercentDowntime (Year)Downtime (Week)Standard
99%3.65 days1.68 hoursCasual / Non-critical
99.9%8.77 hours10.1 minutesStandard (3 Nines)
99.99%52.56 minutes1.01 minutesHighly Available (4 Nines)
99.999%5.26 minutes6.05 secondsMission Critical (5 Nines)

MTBF vs MTTR

MTBF

Mean Time Between Failures

Measures Reliability. The average time a system runs before it hits a glitch.

MTTR

Mean Time To Repair

Measures Maintainability. The average time it takes to restore service after a failure.

Availability Formula:

MTBF
MTBF + MTTR

"To increase availability, you must either make it fail less often (MTBF↑) or fix it faster (MTTR↓)."

SLA vs SLO vs SLI

How we measure and promise performance.

SLI (Indicator)

A specific metric (e.g., Latency, Error Rate).

"The percentage of successful HTTP requests."

SLO (Objective)

A target value for an SLI.

"99.9% of requests must succeed."

SLA (Agreement)

A legal contract involving SLOs and consequences (money).

"If we drop below 99.9%, we pay you back."

High Availability Patterns

Redundancy

Eliminating Single Points of Failure (SPOF). Multiple servers, multiple DB replicas, multiple regions.

Failover

Automatically switching to a standby system when the primary fails. Can be Active-Active or Active-Passive.

Circuit Breakers

Stopping requests to a failing service to prevent "cascading failures" across the entire system.

Interview Guidance

The Paradox

High availability often leads to lower consistency. If you have two replicas (Availability↑), keeping them in perfect sync (Consistency↓) is the hard part.

Interviewer Question

"Your system needs 99.999% availability. How do you design it?"

Don't just say 'replicas'. Talk about Auto-failover, Health Checks, and Multi-Region active setups.

Red Flag

"Adding a manual switch for failover." - In an interview, manual is too slow for HA. You must describe automated monitoring and leader election.

Green Flag

"Discussing MTTR explicitly." - Showing that you care about observability and deployment speed as part of availability.