Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10.c — Observability

Introduction

Observability answers the question: "What is happening inside my distributed system right now, and why?" Unlike traditional monitoring (which checks known failure modes), observability lets you investigate unknown unknowns -- problems you did not predict. It is essential for debugging distributed systems and a frequent discussion point in system design interviews.

1. The Three Pillars of Observability

+------------------------------------------------------------------------+
|                   THE THREE PILLARS                                     |
|                                                                        |
|   +------------------+  +------------------+  +------------------+     |
|   |      LOGS        |  |     METRICS      |  |     TRACES       |     |
|   |                  |  |                  |  |                  |     |
|   | Discrete events  |  | Numeric values   |  | Request journey  |     |
|   | with context     |  | over time        |  | across services  |     |
|   |                  |  |                  |  |                  |     |
|   | "What happened?" |  | "How much / how  |  | "Where did the   |     |
|   |                  |  |  fast / how many?"|  |  request go?"    |     |
|   +------------------+  +------------------+  +------------------+     |
|                                                                        |
|   Example:              Example:              Example:                  |
|   "User 42 got a 500    "p99 latency is       "Request abc123 spent    |
|    error at 14:32:01     340ms; error rate     120ms in auth, 450ms    |
|    on /api/payment"      is 2.3%"              in DB query, 30ms       |
|                                                in serialization"       |
+------------------------------------------------------------------------+

Comparison

Aspect	Logs	Metrics	Traces
Data type	Text / structured JSON	Numeric (counters, gauges, histograms)	Spans with timing + parent-child relationships
Cardinality	High (one per event)	Low (aggregated)	Medium (one per request)
Storage cost	Expensive at scale	Cheap	Moderate
Best for	Debugging specific errors	Dashboards, alerting, trends	Understanding request flow
Query pattern	Search / filter	Aggregate / graph	Trace by ID, waterfall view
Retention	Days to weeks	Months to years	Days to weeks

2. Logs

Structured vs Unstructured Logs

  UNSTRUCTURED (hard to parse):
  "2024-03-15 14:32:01 ERROR Payment failed for user 42, amount $99.99"

  STRUCTURED (machine-parseable):
  {
    "timestamp": "2024-03-15T14:32:01Z",
    "level": "ERROR",
    "service": "payment-service",
    "message": "Payment failed",
    "user_id": 42,
    "amount": 99.99,
    "currency": "USD",
    "error_code": "CARD_DECLINED",
    "trace_id": "abc-123-def-456",
    "span_id": "span-789"
  }

Always use structured logs in distributed systems. They enable:

Filtering by field (all errors for user 42)
Correlation across services (by trace_id)
Automated alerting on specific conditions

Log Levels

  FATAL  -->  System is unusable (process crashing)
  ERROR  -->  Operation failed (needs attention)
  WARN   -->  Unexpected but recoverable (watch closely)
  INFO   -->  Normal operations (request served, job completed)
  DEBUG  -->  Detailed diagnostic info (usually off in prod)
  TRACE  -->  Very verbose (never in prod)

  Production typically: INFO + WARN + ERROR + FATAL
  Debugging / staging:  Add DEBUG

3. Metrics

Metric Types

+------------------------------------------------------------------------+
|  COUNTER             GAUGE               HISTOGRAM                      |
|  Monotonically       Current value       Distribution of values         |
|  increasing          (up or down)        (bucketed)                     |
|                                                                        |
|  requests_total      cpu_usage_pct       request_duration_seconds       |
|                                                                        |
|     ^                   ^                  Count                        |
|     |     /             |    /\            |  ____                      |
|     |   /               |   /  \           | |    |___                  |
|     |  /                |  /    \          | |    |   |__               |
|     | /                 | /      \         | |    |   |  |_             |
|     +-------->          +-------->         +-+----+---+--+-->           |
|      time                time               10ms 50ms 100ms 500ms      |
|                                                                        |
|  "Total requests       "Current memory    "p50 = 45ms, p95 = 120ms,   |
|   served: 1.2M"        usage: 73%"         p99 = 340ms"               |
+------------------------------------------------------------------------+

The RED Method (for request-driven services)

Letter	Metric	What It Tells You
R	Rate	Requests per second (throughput)
E	Errors	Failed requests per second
D	Duration	Latency distribution (p50, p95, p99)

The USE Method (for infrastructure resources)

Letter	Metric	What It Tells You
U	Utilization	% of resource capacity in use
S	Saturation	Queue depth / backlog
E	Errors	Hardware / resource errors

4. Distributed Tracing

The Problem

  Client --> API Gateway --> Auth Service --> Order Service --> Payment Service
                                                    |
                                              Inventory Service --> DB

  "The request took 2.3 seconds. WHERE was the time spent?"

The Solution: Correlation IDs and Spans

  Trace ID: abc-123-def-456  (ONE per end-to-end request)

  +-----------------------------------------------------------------------+
  |  Span: API Gateway          [=============]                  200ms    |
  |   Span: Auth Service          [====]                          50ms    |
  |   Span: Order Service           [========================]  1800ms   |
  |    Span: Inventory Check           [=======]                 300ms   |
  |    Span: Payment Service             [===============]      1200ms   |
  |     Span: DB Write                      [==========]         800ms   |
  +-----------------------------------------------------------------------+
  0ms    200ms    400ms    800ms    1200ms   1600ms   2000ms   2300ms

  Conclusion: Payment -> DB write is the bottleneck (800ms)

How Tracing Works (Context Propagation)

  Service A                    Service B                    Service C
  +-----------+               +-----------+               +-----------+
  | Create    |               | Extract   |               | Extract   |
  | trace_id  |  HTTP Header  | trace_id  |  HTTP Header  | trace_id  |
  | Create    | ------------> | from      | ------------> | from      |
  | span_a    | traceparent:  | header    | traceparent:  | header    |
  |           | abc-123...    | Create    | abc-123...    | Create    |
  |           |               | span_b    |               | span_c    |
  +-----------+               | (parent:  |               | (parent:  |
                              |  span_a)  |               |  span_b)  |
                              +-----------+               +-----------+

W3C Trace Context header:

  traceparent: 00-<trace_id>-<span_id>-<flags>
  Example:     00-abc123def456-span789-01

5. Observability Stack: Prometheus + Grafana

+------------------------------------------------------------------------+
|                  PROMETHEUS + GRAFANA ARCHITECTURE                       |
|                                                                        |
|  +-------------+   +-------------+   +-------------+                   |
|  | Service A   |   | Service B   |   | Service C   |                   |
|  | /metrics    |   | /metrics    |   | /metrics    |                   |
|  +------+------+   +------+------+   +------+------+                   |
|         |                  |                  |                         |
|         +------ PULL ------+------ PULL ------+                        |
|                            |                                           |
|                   +--------v--------+                                  |
|                   |   PROMETHEUS    |                                  |
|                   |  Time-series DB |                                  |
|                   |  + Scraper      |                                  |
|                   |  + PromQL       |                                  |
|                   +--------+--------+                                  |
|                            |                                           |
|              +-------------+-------------+                             |
|              |                           |                             |
|     +--------v--------+        +--------v--------+                    |
|     |    GRAFANA       |        |  ALERTMANAGER   |                    |
|     |  Dashboards      |        |  Rules + Routes |                    |
|     |  Visualization   |        |  PagerDuty /    |                    |
|     +------------------+        |  Slack / Email  |                    |
|                                 +------------------+                   |
+------------------------------------------------------------------------+

Component	Role	Key Feature
Prometheus	Metrics collection + storage	Pull-based scraping, PromQL query language
Grafana	Visualization	Dashboards, panels, annotations
Alertmanager	Alert routing	Grouping, silencing, escalation

Example PromQL queries:

  # Request rate (per second) over last 5 minutes
  rate(http_requests_total[5m])

  # 99th percentile latency
  histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

  # Error rate as a percentage
  rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

6. ELK Stack (Elasticsearch + Logstash + Kibana)

+------------------------------------------------------------------------+
|                         ELK STACK                                       |
|                                                                        |
|  +----------+   +----------+   +----------+                            |
|  | Service  |   | Service  |   | Service  |                            |
|  | (logs)   |   | (logs)   |   | (logs)   |                            |
|  +----+-----+   +----+-----+   +----+-----+                           |
|       |              |              |                                   |
|       +--- Filebeat / Fluentd ------+  (Log shippers)                  |
|                      |                                                 |
|              +-------v--------+                                        |
|              |   LOGSTASH     |  Parse, transform, enrich              |
|              | (or Fluentd)   |  "Extract user_id from log line"       |
|              +-------+--------+                                        |
|                      |                                                 |
|              +-------v--------+                                        |
|              | ELASTICSEARCH  |  Index, store, search                   |
|              | (distributed   |  Full-text search on logs              |
|              |  search engine)|                                        |
|              +-------+--------+                                        |
|                      |                                                 |
|              +-------v--------+                                        |
|              |    KIBANA      |  Visualize, explore, dashboard         |
|              +----------------+                                        |
+------------------------------------------------------------------------+

Component	Purpose
Elasticsearch	Store and search logs (inverted index, distributed)
Logstash	Ingest, parse, and transform logs before storage
Kibana	Web UI for searching logs, building dashboards
Beats / Fluentd	Lightweight log shippers on each host

7. OpenTelemetry (OTel)

OpenTelemetry is the emerging standard that unifies logs, metrics, and traces under one framework.

+------------------------------------------------------------------------+
|                    OPENTELEMETRY ARCHITECTURE                           |
|                                                                        |
|  +-------------------+                                                 |
|  |  Your Application |                                                 |
|  |  + OTel SDK       |  (Auto-instrumentation or manual)               |
|  +--------+----------+                                                 |
|           |                                                            |
|    Logs + Metrics + Traces (OTLP protocol)                             |
|           |                                                            |
|  +--------v----------+                                                 |
|  |  OTel Collector   |  (Receives, processes, exports)                 |
|  +---+-----+-----+---+                                                |
|      |     |     |                                                     |
|      v     v     v                                                     |
|  Jaeger  Prom  Elastic   (Any backend)                                 |
|  (Traces)(Metrics)(Logs)                                               |
+------------------------------------------------------------------------+

Why OpenTelemetry matters:

Vendor-neutral: Switch backends without changing application code
Unified: One SDK for logs + metrics + traces
Correlation: Automatically links traces to metrics to logs
Industry standard: CNCF project, adopted by all major cloud vendors

8. Alerting Best Practices

+------------------------------------------------------------------------+
|                    ALERTING PYRAMID                                      |
|                                                                        |
|                      /\                                                 |
|                     /  \    PAGE (wake someone up)                      |
|                    / P99 \   - Service completely down                  |
|                   / > 2s  \  - Data loss imminent                      |
|                  /  errors \  - SLA breach                              |
|                 /   > 5%    \                                           |
|                +-----------+                                           |
|               /              \   TICKET (fix during business hours)     |
|              /  Disk > 80%    \  - Degraded but functional             |
|             /   Slow queries   \  - Approaching limits                  |
|            /    Memory > 85%    \                                       |
|           +---------------------+                                      |
|          /                        \   LOG / DASHBOARD (informational)   |
|         /  Deployment completed    \  - Awareness, no action needed     |
|        /   Cache hit rate dropped   \                                   |
|       /    New version released      \                                  |
|      +--------------------------------+                                |
+------------------------------------------------------------------------+

Principle	Explanation
Alert on symptoms, not causes	Alert on "error rate > 5%" not "disk I/O high"
Every alert must be actionable	If no one can act on it, it is noise
Reduce alert fatigue	Too many alerts = all alerts get ignored
Use severity levels	P1 (page) vs P2 (ticket) vs P3 (log)
Include runbook links	Alert should link to steps for resolution
Test alerts regularly	An untested alert might not fire when needed

9. Debugging Distributed Systems

Common Debugging Workflow

  1. ALERT fires: "Error rate > 5% on order-service"
         |
         v
  2. CHECK DASHBOARD: Which endpoint? Since when? Correlate with deploys.
         |
         v
  3. SEARCH LOGS: Filter by service + time window + error level
     "service:order-service AND level:ERROR AND timestamp:[now-30m TO now]"
         |
         v
  4. FIND TRACE ID: Pick a failing request, get its trace_id from logs
         |
         v
  5. VIEW TRACE: Open trace in Jaeger/Zipkin -- see the waterfall
     "Aha, the payment-service span shows 10s timeout"
         |
         v
  6. DRILL INTO SERVICE: Check payment-service logs for that trace_id
     "Connection refused to payment-db-primary"
         |
         v
  7. ROOT CAUSE: Database failover was in progress, connections dropped
         |
         v
  8. FIX + POSTMORTEM: Add circuit breaker, increase connection pool retry

10. Observability in System Design Interviews

When discussing observability in an interview:

+-----------------------------------------------------------------------+
|            OBSERVABILITY TALKING POINTS FOR INTERVIEWS                  |
|                                                                        |
|  1. "Every service exposes /metrics (Prometheus) and emits             |
|      structured logs with trace IDs"                                   |
|                                                                        |
|  2. "We use distributed tracing to follow requests across              |
|      services -- critical for debugging latency issues"                |
|                                                                        |
|  3. "Dashboards show the RED metrics: Request rate, Error rate,        |
|      Duration (p50/p95/p99)"                                           |
|                                                                        |
|  4. "Alerts fire on SLO violations, not raw thresholds --              |
|      e.g., error budget burn rate"                                     |
|                                                                        |
|  5. "Health check endpoints enable the load balancer to route          |
|      around unhealthy instances"                                       |
|                                                                        |
|  DO NOT go deep unless asked -- mention it, show awareness,            |
|  move on to the next design point.                                     |
+-----------------------------------------------------------------------+

Key Takeaways

Three pillars: Logs, Metrics, Traces. Each serves a different purpose; all three are needed.
Structured logs are non-negotiable. JSON logs with trace IDs enable cross-service correlation.
Distributed tracing reveals bottlenecks. A waterfall view instantly shows where latency lives.
RED method for services, USE method for infrastructure. Two simple frameworks that cover most monitoring needs.
Prometheus + Grafana is the de facto metrics stack. Know PromQL basics.
ELK is the de facto log stack. Elasticsearch indexes; Kibana visualizes.
OpenTelemetry is the future. One unified SDK for all three pillars.
Alert on symptoms, not causes. Every alert must be actionable.

Explain-It Challenge

Scenario: A user reports that "the app is slow." You have Prometheus, Grafana, Jaeger, and an ELK stack deployed.

Walk through your exact debugging steps:

Which dashboard do you check first?

What metrics tell you which service is slow?

How do you find the specific slow request?

How do you trace it to the root cause?

Next -> 9.10.d — Rate Limiting