visibility

Observability &
Monitoring

Monitoring tells you *when* a system is failing. Observability tells you *why*.

Why Observability?

"If you can't measure it, you can't manage it."

In a monolith, a debugger might suffice. In a system of 50 microservices, you are trying to find a needle in an ever-shifting haystack of asynchronous events.

MTTRMean Time To Recovery
MTBFMean Time Between Failures

The 3 Pillars

1. Metrics (Prometheus)

Metrics are cheap, fast, and great for alerting. Think "CPU is at 90%" or **"Success rate is 99%."**

http_requests_total{status="200"}+1

2. Logs (ELK / Splunk)

Logs are expensive but provide the grainy details. "User 123 failed to update profile due to NullPointerException."

[2024-03-21 00:20] INFO [UserService] Processing request id=abc-123
[2024-03-21 00:21] ERROR [UserService] DB timeout...

3. Traces (Distributed Tracing)

Tracing uses a correlation_id passed between services to reconstruct the timeline of one request.

Gateway
AuthService
DB Query

Interview Guidance

"A high-value user complained about slowness."

Don't look at metrics (they only show aggregates). You need Distributed Tracing to see exactly which hop in *their* request was slow.

The "Cardiologist" Analogy

Metrics are the "Heart Rate" (Overview). Traces are the "EKG" (Details). In a system design interview, emphasize that you need both to run a reliable production service.