verified

Reliability &
Availability

How often does it break, and how long does it take to fix? The numbers that define your reputation.

The Core Definitions

Availability

The percentage of time the system is operational and accessible.

"Is it up right now?"

Reliability

The probability that the system will perform its function without failure over a period of time.

"Will it break in the next hour?"

Note: A system can be Available but not Reliable (e.g., if it loses data but stays up).

The "Nines" of Availability

Percent	Downtime (Year)	Downtime (Week)	Standard
99%	3.65 days	1.68 hours	Casual / Non-critical
99.9%	8.77 hours	10.1 minutes	Standard (3 Nines)
99.99%	52.56 minutes	1.01 minutes	Highly Available (4 Nines)
99.999%	5.26 minutes	6.05 seconds	Mission Critical (5 Nines)

MTBF vs MTTR

MTBF

Mean Time Between Failures

Measures Reliability. The average time a system runs before it hits a glitch.

MTTR

Mean Time To Repair

Measures Maintainability. The average time it takes to restore service after a failure.

Availability Formula:

MTBF

MTBF + MTTR

"To increase availability, you must either make it fail less often (MTBF↑) or fix it faster (MTTR↓)."

SLA vs SLO vs SLI

How we measure and promise performance.

SLI (Indicator)

A specific metric (e.g., Latency, Error Rate).

"The percentage of successful HTTP requests."

SLO (Objective)

A target value for an SLI.

"99.9% of requests must succeed."

SLA (Agreement)

A legal contract involving SLOs and consequences (money).

"If we drop below 99.9%, we pay you back."

High Availability Patterns

Redundancy

Eliminating Single Points of Failure (SPOF). Multiple servers, multiple DB replicas, multiple regions.

Failover

Automatically switching to a standby system when the primary fails. Can be Active-Active or Active-Passive.

Circuit Breakers

Stopping requests to a failing service to prevent "cascading failures" across the entire system.

Interview Guidance

The Paradox

High availability often leads to lower consistency. If you have two replicas (Availability↑), keeping them in perfect sync (Consistency↓) is the hard part.

Interviewer Question

"Your system needs 99.999% availability. How do you design it?"

Don't just say 'replicas'. Talk about Auto-failover, Health Checks, and Multi-Region active setups.

Red Flag

"Adding a manual switch for failover." - In an interview, manual is too slow for HA. You must describe automated monitoring and leader election.

Green Flag

"Discussing MTTR explicitly." - Showing that you care about observability and deployment speed as part of availability.

arrow_backPreviousLatency & Throughput Nextarrow_forwardNetworking

Reliability & Availability

The Core Definitions

Availability

Reliability

The "Nines" of Availability

MTBF vs MTTR

MTBF

MTTR

SLA vs SLO vs SLI

High Availability Patterns

Redundancy

Failover

Circuit Breakers

Interview Guidance

The Paradox

Interviewer Question

Red Flag

Green Flag

Reliability &
Availability