Observability &
Monitoring
Monitoring tells you *when* a system is failing. Observability tells you *why*.
Why Observability?
"If you can't measure it, you can't manage it."
In a monolith, a debugger might suffice. In a system of 50 microservices, you are trying to find a needle in an ever-shifting haystack of asynchronous events.
The 3 Pillars
1. Metrics (Prometheus)
Metrics are cheap, fast, and great for alerting. Think "CPU is at 90%" or **"Success rate is 99%."**
2. Logs (ELK / Splunk)
Logs are expensive but provide the grainy details. "User 123 failed to update profile due to NullPointerException."
[2024-03-21 00:21] ERROR [UserService] DB timeout...
3. Traces (Distributed Tracing)
Tracing uses a correlation_id passed between services to reconstruct the timeline of one request.
Interview Guidance
"A high-value user complained about slowness."
Don't look at metrics (they only show aggregates). You need Distributed Tracing to see exactly which hop in *their* request was slow.
The "Cardiologist" Analogy
Metrics are the "Heart Rate" (Overview). Traces are the "EKG" (Details). In a system design interview, emphasize that you need both to run a reliable production service.