Episode 9 — System Design / 9.10 — Advanced Distributed Systems

9.10.c — Observability

Introduction

Observability answers the question: "What is happening inside my distributed system right now, and why?" Unlike traditional monitoring (which checks known failure modes), observability lets you investigate unknown unknowns -- problems you did not predict. It is essential for debugging distributed systems and a frequent discussion point in system design interviews.


1. The Three Pillars of Observability

+------------------------------------------------------------------------+
|                   THE THREE PILLARS                                     |
|                                                                        |
|   +------------------+  +------------------+  +------------------+     |
|   |      LOGS        |  |     METRICS      |  |     TRACES       |     |
|   |                  |  |                  |  |                  |     |
|   | Discrete events  |  | Numeric values   |  | Request journey  |     |
|   | with context     |  | over time        |  | across services  |     |
|   |                  |  |                  |  |                  |     |
|   | "What happened?" |  | "How much / how  |  | "Where did the   |     |
|   |                  |  |  fast / how many?"|  |  request go?"    |     |
|   +------------------+  +------------------+  +------------------+     |
|                                                                        |
|   Example:              Example:              Example:                  |
|   "User 42 got a 500    "p99 latency is       "Request abc123 spent    |
|    error at 14:32:01     340ms; error rate     120ms in auth, 450ms    |
|    on /api/payment"      is 2.3%"              in DB query, 30ms       |
|                                                in serialization"       |
+------------------------------------------------------------------------+

Comparison

AspectLogsMetricsTraces
Data typeText / structured JSONNumeric (counters, gauges, histograms)Spans with timing + parent-child relationships
CardinalityHigh (one per event)Low (aggregated)Medium (one per request)
Storage costExpensive at scaleCheapModerate
Best forDebugging specific errorsDashboards, alerting, trendsUnderstanding request flow
Query patternSearch / filterAggregate / graphTrace by ID, waterfall view
RetentionDays to weeksMonths to yearsDays to weeks

2. Logs

Structured vs Unstructured Logs

  UNSTRUCTURED (hard to parse):
  "2024-03-15 14:32:01 ERROR Payment failed for user 42, amount $99.99"

  STRUCTURED (machine-parseable):
  {
    "timestamp": "2024-03-15T14:32:01Z",
    "level": "ERROR",
    "service": "payment-service",
    "message": "Payment failed",
    "user_id": 42,
    "amount": 99.99,
    "currency": "USD",
    "error_code": "CARD_DECLINED",
    "trace_id": "abc-123-def-456",
    "span_id": "span-789"
  }

Always use structured logs in distributed systems. They enable:

  • Filtering by field (all errors for user 42)
  • Correlation across services (by trace_id)
  • Automated alerting on specific conditions

Log Levels

  FATAL  -->  System is unusable (process crashing)
  ERROR  -->  Operation failed (needs attention)
  WARN   -->  Unexpected but recoverable (watch closely)
  INFO   -->  Normal operations (request served, job completed)
  DEBUG  -->  Detailed diagnostic info (usually off in prod)
  TRACE  -->  Very verbose (never in prod)

  Production typically: INFO + WARN + ERROR + FATAL
  Debugging / staging:  Add DEBUG

3. Metrics

Metric Types

+------------------------------------------------------------------------+
|  COUNTER             GAUGE               HISTOGRAM                      |
|  Monotonically       Current value       Distribution of values         |
|  increasing          (up or down)        (bucketed)                     |
|                                                                        |
|  requests_total      cpu_usage_pct       request_duration_seconds       |
|                                                                        |
|     ^                   ^                  Count                        |
|     |     /             |    /\            |  ____                      |
|     |   /               |   /  \           | |    |___                  |
|     |  /                |  /    \          | |    |   |__               |
|     | /                 | /      \         | |    |   |  |_             |
|     +-------->          +-------->         +-+----+---+--+-->           |
|      time                time               10ms 50ms 100ms 500ms      |
|                                                                        |
|  "Total requests       "Current memory    "p50 = 45ms, p95 = 120ms,   |
|   served: 1.2M"        usage: 73%"         p99 = 340ms"               |
+------------------------------------------------------------------------+

The RED Method (for request-driven services)

LetterMetricWhat It Tells You
RRateRequests per second (throughput)
EErrorsFailed requests per second
DDurationLatency distribution (p50, p95, p99)

The USE Method (for infrastructure resources)

LetterMetricWhat It Tells You
UUtilization% of resource capacity in use
SSaturationQueue depth / backlog
EErrorsHardware / resource errors

4. Distributed Tracing

The Problem

  Client --> API Gateway --> Auth Service --> Order Service --> Payment Service
                                                    |
                                              Inventory Service --> DB

  "The request took 2.3 seconds. WHERE was the time spent?"

The Solution: Correlation IDs and Spans

  Trace ID: abc-123-def-456  (ONE per end-to-end request)

  +-----------------------------------------------------------------------+
  |  Span: API Gateway          [=============]                  200ms    |
  |   Span: Auth Service          [====]                          50ms    |
  |   Span: Order Service           [========================]  1800ms   |
  |    Span: Inventory Check           [=======]                 300ms   |
  |    Span: Payment Service             [===============]      1200ms   |
  |     Span: DB Write                      [==========]         800ms   |
  +-----------------------------------------------------------------------+
  0ms    200ms    400ms    800ms    1200ms   1600ms   2000ms   2300ms

  Conclusion: Payment -> DB write is the bottleneck (800ms)

How Tracing Works (Context Propagation)

  Service A                    Service B                    Service C
  +-----------+               +-----------+               +-----------+
  | Create    |               | Extract   |               | Extract   |
  | trace_id  |  HTTP Header  | trace_id  |  HTTP Header  | trace_id  |
  | Create    | ------------> | from      | ------------> | from      |
  | span_a    | traceparent:  | header    | traceparent:  | header    |
  |           | abc-123...    | Create    | abc-123...    | Create    |
  |           |               | span_b    |               | span_c    |
  +-----------+               | (parent:  |               | (parent:  |
                              |  span_a)  |               |  span_b)  |
                              +-----------+               +-----------+

W3C Trace Context header:

  traceparent: 00-<trace_id>-<span_id>-<flags>
  Example:     00-abc123def456-span789-01

5. Observability Stack: Prometheus + Grafana

+------------------------------------------------------------------------+
|                  PROMETHEUS + GRAFANA ARCHITECTURE                       |
|                                                                        |
|  +-------------+   +-------------+   +-------------+                   |
|  | Service A   |   | Service B   |   | Service C   |                   |
|  | /metrics    |   | /metrics    |   | /metrics    |                   |
|  +------+------+   +------+------+   +------+------+                   |
|         |                  |                  |                         |
|         +------ PULL ------+------ PULL ------+                        |
|                            |                                           |
|                   +--------v--------+                                  |
|                   |   PROMETHEUS    |                                  |
|                   |  Time-series DB |                                  |
|                   |  + Scraper      |                                  |
|                   |  + PromQL       |                                  |
|                   +--------+--------+                                  |
|                            |                                           |
|              +-------------+-------------+                             |
|              |                           |                             |
|     +--------v--------+        +--------v--------+                    |
|     |    GRAFANA       |        |  ALERTMANAGER   |                    |
|     |  Dashboards      |        |  Rules + Routes |                    |
|     |  Visualization   |        |  PagerDuty /    |                    |
|     +------------------+        |  Slack / Email  |                    |
|                                 +------------------+                   |
+------------------------------------------------------------------------+
ComponentRoleKey Feature
PrometheusMetrics collection + storagePull-based scraping, PromQL query language
GrafanaVisualizationDashboards, panels, annotations
AlertmanagerAlert routingGrouping, silencing, escalation

Example PromQL queries:

  # Request rate (per second) over last 5 minutes
  rate(http_requests_total[5m])

  # 99th percentile latency
  histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

  # Error rate as a percentage
  rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100

6. ELK Stack (Elasticsearch + Logstash + Kibana)

+------------------------------------------------------------------------+
|                         ELK STACK                                       |
|                                                                        |
|  +----------+   +----------+   +----------+                            |
|  | Service  |   | Service  |   | Service  |                            |
|  | (logs)   |   | (logs)   |   | (logs)   |                            |
|  +----+-----+   +----+-----+   +----+-----+                           |
|       |              |              |                                   |
|       +--- Filebeat / Fluentd ------+  (Log shippers)                  |
|                      |                                                 |
|              +-------v--------+                                        |
|              |   LOGSTASH     |  Parse, transform, enrich              |
|              | (or Fluentd)   |  "Extract user_id from log line"       |
|              +-------+--------+                                        |
|                      |                                                 |
|              +-------v--------+                                        |
|              | ELASTICSEARCH  |  Index, store, search                   |
|              | (distributed   |  Full-text search on logs              |
|              |  search engine)|                                        |
|              +-------+--------+                                        |
|                      |                                                 |
|              +-------v--------+                                        |
|              |    KIBANA      |  Visualize, explore, dashboard         |
|              +----------------+                                        |
+------------------------------------------------------------------------+
ComponentPurpose
ElasticsearchStore and search logs (inverted index, distributed)
LogstashIngest, parse, and transform logs before storage
KibanaWeb UI for searching logs, building dashboards
Beats / FluentdLightweight log shippers on each host

7. OpenTelemetry (OTel)

OpenTelemetry is the emerging standard that unifies logs, metrics, and traces under one framework.

+------------------------------------------------------------------------+
|                    OPENTELEMETRY ARCHITECTURE                           |
|                                                                        |
|  +-------------------+                                                 |
|  |  Your Application |                                                 |
|  |  + OTel SDK       |  (Auto-instrumentation or manual)               |
|  +--------+----------+                                                 |
|           |                                                            |
|    Logs + Metrics + Traces (OTLP protocol)                             |
|           |                                                            |
|  +--------v----------+                                                 |
|  |  OTel Collector   |  (Receives, processes, exports)                 |
|  +---+-----+-----+---+                                                |
|      |     |     |                                                     |
|      v     v     v                                                     |
|  Jaeger  Prom  Elastic   (Any backend)                                 |
|  (Traces)(Metrics)(Logs)                                               |
+------------------------------------------------------------------------+

Why OpenTelemetry matters:

  • Vendor-neutral: Switch backends without changing application code
  • Unified: One SDK for logs + metrics + traces
  • Correlation: Automatically links traces to metrics to logs
  • Industry standard: CNCF project, adopted by all major cloud vendors

8. Alerting Best Practices

+------------------------------------------------------------------------+
|                    ALERTING PYRAMID                                      |
|                                                                        |
|                      /\                                                 |
|                     /  \    PAGE (wake someone up)                      |
|                    / P99 \   - Service completely down                  |
|                   / > 2s  \  - Data loss imminent                      |
|                  /  errors \  - SLA breach                              |
|                 /   > 5%    \                                           |
|                +-----------+                                           |
|               /              \   TICKET (fix during business hours)     |
|              /  Disk > 80%    \  - Degraded but functional             |
|             /   Slow queries   \  - Approaching limits                  |
|            /    Memory > 85%    \                                       |
|           +---------------------+                                      |
|          /                        \   LOG / DASHBOARD (informational)   |
|         /  Deployment completed    \  - Awareness, no action needed     |
|        /   Cache hit rate dropped   \                                   |
|       /    New version released      \                                  |
|      +--------------------------------+                                |
+------------------------------------------------------------------------+
PrincipleExplanation
Alert on symptoms, not causesAlert on "error rate > 5%" not "disk I/O high"
Every alert must be actionableIf no one can act on it, it is noise
Reduce alert fatigueToo many alerts = all alerts get ignored
Use severity levelsP1 (page) vs P2 (ticket) vs P3 (log)
Include runbook linksAlert should link to steps for resolution
Test alerts regularlyAn untested alert might not fire when needed

9. Debugging Distributed Systems

Common Debugging Workflow

  1. ALERT fires: "Error rate > 5% on order-service"
         |
         v
  2. CHECK DASHBOARD: Which endpoint? Since when? Correlate with deploys.
         |
         v
  3. SEARCH LOGS: Filter by service + time window + error level
     "service:order-service AND level:ERROR AND timestamp:[now-30m TO now]"
         |
         v
  4. FIND TRACE ID: Pick a failing request, get its trace_id from logs
         |
         v
  5. VIEW TRACE: Open trace in Jaeger/Zipkin -- see the waterfall
     "Aha, the payment-service span shows 10s timeout"
         |
         v
  6. DRILL INTO SERVICE: Check payment-service logs for that trace_id
     "Connection refused to payment-db-primary"
         |
         v
  7. ROOT CAUSE: Database failover was in progress, connections dropped
         |
         v
  8. FIX + POSTMORTEM: Add circuit breaker, increase connection pool retry

10. Observability in System Design Interviews

When discussing observability in an interview:

+-----------------------------------------------------------------------+
|            OBSERVABILITY TALKING POINTS FOR INTERVIEWS                  |
|                                                                        |
|  1. "Every service exposes /metrics (Prometheus) and emits             |
|      structured logs with trace IDs"                                   |
|                                                                        |
|  2. "We use distributed tracing to follow requests across              |
|      services -- critical for debugging latency issues"                |
|                                                                        |
|  3. "Dashboards show the RED metrics: Request rate, Error rate,        |
|      Duration (p50/p95/p99)"                                           |
|                                                                        |
|  4. "Alerts fire on SLO violations, not raw thresholds --              |
|      e.g., error budget burn rate"                                     |
|                                                                        |
|  5. "Health check endpoints enable the load balancer to route          |
|      around unhealthy instances"                                       |
|                                                                        |
|  DO NOT go deep unless asked -- mention it, show awareness,            |
|  move on to the next design point.                                     |
+-----------------------------------------------------------------------+

Key Takeaways

  1. Three pillars: Logs, Metrics, Traces. Each serves a different purpose; all three are needed.
  2. Structured logs are non-negotiable. JSON logs with trace IDs enable cross-service correlation.
  3. Distributed tracing reveals bottlenecks. A waterfall view instantly shows where latency lives.
  4. RED method for services, USE method for infrastructure. Two simple frameworks that cover most monitoring needs.
  5. Prometheus + Grafana is the de facto metrics stack. Know PromQL basics.
  6. ELK is the de facto log stack. Elasticsearch indexes; Kibana visualizes.
  7. OpenTelemetry is the future. One unified SDK for all three pillars.
  8. Alert on symptoms, not causes. Every alert must be actionable.

Explain-It Challenge

Scenario: A user reports that "the app is slow." You have Prometheus, Grafana, Jaeger, and an ELK stack deployed.

Walk through your exact debugging steps:

  • Which dashboard do you check first?
  • What metrics tell you which service is slow?
  • How do you find the specific slow request?
  • How do you trace it to the root cause?

Next -> 9.10.d — Rate Limiting