Episode 9 — System Design / 9.10 — Advanced Distributed Systems
9.10.c — Observability
Introduction
Observability answers the question: "What is happening inside my distributed system right now, and why?" Unlike traditional monitoring (which checks known failure modes), observability lets you investigate unknown unknowns -- problems you did not predict. It is essential for debugging distributed systems and a frequent discussion point in system design interviews.
1. The Three Pillars of Observability
+------------------------------------------------------------------------+
| THE THREE PILLARS |
| |
| +------------------+ +------------------+ +------------------+ |
| | LOGS | | METRICS | | TRACES | |
| | | | | | | |
| | Discrete events | | Numeric values | | Request journey | |
| | with context | | over time | | across services | |
| | | | | | | |
| | "What happened?" | | "How much / how | | "Where did the | |
| | | | fast / how many?"| | request go?" | |
| +------------------+ +------------------+ +------------------+ |
| |
| Example: Example: Example: |
| "User 42 got a 500 "p99 latency is "Request abc123 spent |
| error at 14:32:01 340ms; error rate 120ms in auth, 450ms |
| on /api/payment" is 2.3%" in DB query, 30ms |
| in serialization" |
+------------------------------------------------------------------------+
Comparison
| Aspect | Logs | Metrics | Traces |
|---|---|---|---|
| Data type | Text / structured JSON | Numeric (counters, gauges, histograms) | Spans with timing + parent-child relationships |
| Cardinality | High (one per event) | Low (aggregated) | Medium (one per request) |
| Storage cost | Expensive at scale | Cheap | Moderate |
| Best for | Debugging specific errors | Dashboards, alerting, trends | Understanding request flow |
| Query pattern | Search / filter | Aggregate / graph | Trace by ID, waterfall view |
| Retention | Days to weeks | Months to years | Days to weeks |
2. Logs
Structured vs Unstructured Logs
UNSTRUCTURED (hard to parse):
"2024-03-15 14:32:01 ERROR Payment failed for user 42, amount $99.99"
STRUCTURED (machine-parseable):
{
"timestamp": "2024-03-15T14:32:01Z",
"level": "ERROR",
"service": "payment-service",
"message": "Payment failed",
"user_id": 42,
"amount": 99.99,
"currency": "USD",
"error_code": "CARD_DECLINED",
"trace_id": "abc-123-def-456",
"span_id": "span-789"
}
Always use structured logs in distributed systems. They enable:
- Filtering by field (all errors for user 42)
- Correlation across services (by trace_id)
- Automated alerting on specific conditions
Log Levels
FATAL --> System is unusable (process crashing)
ERROR --> Operation failed (needs attention)
WARN --> Unexpected but recoverable (watch closely)
INFO --> Normal operations (request served, job completed)
DEBUG --> Detailed diagnostic info (usually off in prod)
TRACE --> Very verbose (never in prod)
Production typically: INFO + WARN + ERROR + FATAL
Debugging / staging: Add DEBUG
3. Metrics
Metric Types
+------------------------------------------------------------------------+
| COUNTER GAUGE HISTOGRAM |
| Monotonically Current value Distribution of values |
| increasing (up or down) (bucketed) |
| |
| requests_total cpu_usage_pct request_duration_seconds |
| |
| ^ ^ Count |
| | / | /\ | ____ |
| | / | / \ | | |___ |
| | / | / \ | | | |__ |
| | / | / \ | | | | |_ |
| +--------> +--------> +-+----+---+--+--> |
| time time 10ms 50ms 100ms 500ms |
| |
| "Total requests "Current memory "p50 = 45ms, p95 = 120ms, |
| served: 1.2M" usage: 73%" p99 = 340ms" |
+------------------------------------------------------------------------+
The RED Method (for request-driven services)
| Letter | Metric | What It Tells You |
|---|---|---|
| R | Rate | Requests per second (throughput) |
| E | Errors | Failed requests per second |
| D | Duration | Latency distribution (p50, p95, p99) |
The USE Method (for infrastructure resources)
| Letter | Metric | What It Tells You |
|---|---|---|
| U | Utilization | % of resource capacity in use |
| S | Saturation | Queue depth / backlog |
| E | Errors | Hardware / resource errors |
4. Distributed Tracing
The Problem
Client --> API Gateway --> Auth Service --> Order Service --> Payment Service
|
Inventory Service --> DB
"The request took 2.3 seconds. WHERE was the time spent?"
The Solution: Correlation IDs and Spans
Trace ID: abc-123-def-456 (ONE per end-to-end request)
+-----------------------------------------------------------------------+
| Span: API Gateway [=============] 200ms |
| Span: Auth Service [====] 50ms |
| Span: Order Service [========================] 1800ms |
| Span: Inventory Check [=======] 300ms |
| Span: Payment Service [===============] 1200ms |
| Span: DB Write [==========] 800ms |
+-----------------------------------------------------------------------+
0ms 200ms 400ms 800ms 1200ms 1600ms 2000ms 2300ms
Conclusion: Payment -> DB write is the bottleneck (800ms)
How Tracing Works (Context Propagation)
Service A Service B Service C
+-----------+ +-----------+ +-----------+
| Create | | Extract | | Extract |
| trace_id | HTTP Header | trace_id | HTTP Header | trace_id |
| Create | ------------> | from | ------------> | from |
| span_a | traceparent: | header | traceparent: | header |
| | abc-123... | Create | abc-123... | Create |
| | | span_b | | span_c |
+-----------+ | (parent: | | (parent: |
| span_a) | | span_b) |
+-----------+ +-----------+
W3C Trace Context header:
traceparent: 00-<trace_id>-<span_id>-<flags>
Example: 00-abc123def456-span789-01
5. Observability Stack: Prometheus + Grafana
+------------------------------------------------------------------------+
| PROMETHEUS + GRAFANA ARCHITECTURE |
| |
| +-------------+ +-------------+ +-------------+ |
| | Service A | | Service B | | Service C | |
| | /metrics | | /metrics | | /metrics | |
| +------+------+ +------+------+ +------+------+ |
| | | | |
| +------ PULL ------+------ PULL ------+ |
| | |
| +--------v--------+ |
| | PROMETHEUS | |
| | Time-series DB | |
| | + Scraper | |
| | + PromQL | |
| +--------+--------+ |
| | |
| +-------------+-------------+ |
| | | |
| +--------v--------+ +--------v--------+ |
| | GRAFANA | | ALERTMANAGER | |
| | Dashboards | | Rules + Routes | |
| | Visualization | | PagerDuty / | |
| +------------------+ | Slack / Email | |
| +------------------+ |
+------------------------------------------------------------------------+
| Component | Role | Key Feature |
|---|---|---|
| Prometheus | Metrics collection + storage | Pull-based scraping, PromQL query language |
| Grafana | Visualization | Dashboards, panels, annotations |
| Alertmanager | Alert routing | Grouping, silencing, escalation |
Example PromQL queries:
# Request rate (per second) over last 5 minutes
rate(http_requests_total[5m])
# 99th percentile latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Error rate as a percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
6. ELK Stack (Elasticsearch + Logstash + Kibana)
+------------------------------------------------------------------------+
| ELK STACK |
| |
| +----------+ +----------+ +----------+ |
| | Service | | Service | | Service | |
| | (logs) | | (logs) | | (logs) | |
| +----+-----+ +----+-----+ +----+-----+ |
| | | | |
| +--- Filebeat / Fluentd ------+ (Log shippers) |
| | |
| +-------v--------+ |
| | LOGSTASH | Parse, transform, enrich |
| | (or Fluentd) | "Extract user_id from log line" |
| +-------+--------+ |
| | |
| +-------v--------+ |
| | ELASTICSEARCH | Index, store, search |
| | (distributed | Full-text search on logs |
| | search engine)| |
| +-------+--------+ |
| | |
| +-------v--------+ |
| | KIBANA | Visualize, explore, dashboard |
| +----------------+ |
+------------------------------------------------------------------------+
| Component | Purpose |
|---|---|
| Elasticsearch | Store and search logs (inverted index, distributed) |
| Logstash | Ingest, parse, and transform logs before storage |
| Kibana | Web UI for searching logs, building dashboards |
| Beats / Fluentd | Lightweight log shippers on each host |
7. OpenTelemetry (OTel)
OpenTelemetry is the emerging standard that unifies logs, metrics, and traces under one framework.
+------------------------------------------------------------------------+
| OPENTELEMETRY ARCHITECTURE |
| |
| +-------------------+ |
| | Your Application | |
| | + OTel SDK | (Auto-instrumentation or manual) |
| +--------+----------+ |
| | |
| Logs + Metrics + Traces (OTLP protocol) |
| | |
| +--------v----------+ |
| | OTel Collector | (Receives, processes, exports) |
| +---+-----+-----+---+ |
| | | | |
| v v v |
| Jaeger Prom Elastic (Any backend) |
| (Traces)(Metrics)(Logs) |
+------------------------------------------------------------------------+
Why OpenTelemetry matters:
- Vendor-neutral: Switch backends without changing application code
- Unified: One SDK for logs + metrics + traces
- Correlation: Automatically links traces to metrics to logs
- Industry standard: CNCF project, adopted by all major cloud vendors
8. Alerting Best Practices
+------------------------------------------------------------------------+
| ALERTING PYRAMID |
| |
| /\ |
| / \ PAGE (wake someone up) |
| / P99 \ - Service completely down |
| / > 2s \ - Data loss imminent |
| / errors \ - SLA breach |
| / > 5% \ |
| +-----------+ |
| / \ TICKET (fix during business hours) |
| / Disk > 80% \ - Degraded but functional |
| / Slow queries \ - Approaching limits |
| / Memory > 85% \ |
| +---------------------+ |
| / \ LOG / DASHBOARD (informational) |
| / Deployment completed \ - Awareness, no action needed |
| / Cache hit rate dropped \ |
| / New version released \ |
| +--------------------------------+ |
+------------------------------------------------------------------------+
| Principle | Explanation |
|---|---|
| Alert on symptoms, not causes | Alert on "error rate > 5%" not "disk I/O high" |
| Every alert must be actionable | If no one can act on it, it is noise |
| Reduce alert fatigue | Too many alerts = all alerts get ignored |
| Use severity levels | P1 (page) vs P2 (ticket) vs P3 (log) |
| Include runbook links | Alert should link to steps for resolution |
| Test alerts regularly | An untested alert might not fire when needed |
9. Debugging Distributed Systems
Common Debugging Workflow
1. ALERT fires: "Error rate > 5% on order-service"
|
v
2. CHECK DASHBOARD: Which endpoint? Since when? Correlate with deploys.
|
v
3. SEARCH LOGS: Filter by service + time window + error level
"service:order-service AND level:ERROR AND timestamp:[now-30m TO now]"
|
v
4. FIND TRACE ID: Pick a failing request, get its trace_id from logs
|
v
5. VIEW TRACE: Open trace in Jaeger/Zipkin -- see the waterfall
"Aha, the payment-service span shows 10s timeout"
|
v
6. DRILL INTO SERVICE: Check payment-service logs for that trace_id
"Connection refused to payment-db-primary"
|
v
7. ROOT CAUSE: Database failover was in progress, connections dropped
|
v
8. FIX + POSTMORTEM: Add circuit breaker, increase connection pool retry
10. Observability in System Design Interviews
When discussing observability in an interview:
+-----------------------------------------------------------------------+
| OBSERVABILITY TALKING POINTS FOR INTERVIEWS |
| |
| 1. "Every service exposes /metrics (Prometheus) and emits |
| structured logs with trace IDs" |
| |
| 2. "We use distributed tracing to follow requests across |
| services -- critical for debugging latency issues" |
| |
| 3. "Dashboards show the RED metrics: Request rate, Error rate, |
| Duration (p50/p95/p99)" |
| |
| 4. "Alerts fire on SLO violations, not raw thresholds -- |
| e.g., error budget burn rate" |
| |
| 5. "Health check endpoints enable the load balancer to route |
| around unhealthy instances" |
| |
| DO NOT go deep unless asked -- mention it, show awareness, |
| move on to the next design point. |
+-----------------------------------------------------------------------+
Key Takeaways
- Three pillars: Logs, Metrics, Traces. Each serves a different purpose; all three are needed.
- Structured logs are non-negotiable. JSON logs with trace IDs enable cross-service correlation.
- Distributed tracing reveals bottlenecks. A waterfall view instantly shows where latency lives.
- RED method for services, USE method for infrastructure. Two simple frameworks that cover most monitoring needs.
- Prometheus + Grafana is the de facto metrics stack. Know PromQL basics.
- ELK is the de facto log stack. Elasticsearch indexes; Kibana visualizes.
- OpenTelemetry is the future. One unified SDK for all three pillars.
- Alert on symptoms, not causes. Every alert must be actionable.
Explain-It Challenge
Scenario: A user reports that "the app is slow." You have Prometheus, Grafana, Jaeger, and an ELK stack deployed.
Walk through your exact debugging steps:
- Which dashboard do you check first?
- What metrics tell you which service is slow?
- How do you find the specific slow request?
- How do you trace it to the root cause?
Next -> 9.10.d — Rate Limiting