Episode 9 — System Design / 9.11 — Real World System Design Problems

9.11.l Design a Monitoring and Alerting System (Datadog / Prometheus)

Problem Statement

Design a monitoring and alerting system like Datadog or Prometheus that collects metrics, logs, and traces from distributed services. The system must support real-time dashboards, configurable alerting, anomaly detection, and handle millions of data points per second from thousands of services.


1. Requirements

Functional Requirements

  • Collect infrastructure metrics (CPU, memory, disk, network)
  • Collect application metrics (request count, latency, error rate)
  • Collect custom business metrics (orders/sec, revenue, signups)
  • Ingest and index structured logs
  • Collect distributed traces (spans across services)
  • Real-time dashboards with custom queries and visualizations
  • Configurable alerting rules (threshold, anomaly, composite)
  • Alert routing to multiple channels (email, Slack, PagerDuty)
  • Tag-based filtering and grouping (host, service, environment)
  • Retention policies with downsampling for historical data
  • Anomaly detection with automatic baseline learning
  • API for programmatic metric submission and querying

Non-Functional Requirements

  • Ingest 10 million metric data points per second
  • Ingest 500,000 log lines per second
  • Dashboard query latency < 1 second for last-hour queries
  • Alert evaluation latency < 30 seconds
  • Support 100,000 unique metric names with high cardinality tags
  • 99.9% availability for ingestion (never lose metrics)
  • Retain full-resolution data for 15 days, downsampled for 1 year
  • Horizontally scalable as monitored fleet grows

2. Capacity Estimation

Metrics

Data points per second:   10 million
Data point size:          ~50 bytes (metric_name, tags, value, timestamp)
Daily ingestion:          10M/sec * 86,400 * 50 bytes = 43.2 TB/day

Unique time series:       50 million active series
  (metric_name + unique tag combination = 1 series)

Storage (full resolution, 15 days):
  43.2 TB/day * 15 = 648 TB raw
  With compression (~10x for time-series): ~65 TB

Storage (downsampled 1-minute, 1 year):
  10M/sec compressed to ~166K/sec (1-min rollup)
  166K * 20 bytes * 86,400 * 365 ~= 105 TB raw
  Compressed: ~10 TB

Logs

Log lines per second:     500,000
Average log line:         500 bytes
Daily log volume:         500K/sec * 86,400 * 500 bytes = 21.6 TB/day
Retention (30 days):      648 TB raw (with compression ~65 TB)

Traces

Traces per second:        100,000
Average spans per trace:  8
Span size:                300 bytes
Daily trace volume:       100K * 8 * 300 * 86,400 = 20.7 TB/day
Retention (7 days):       145 TB raw (compressed ~15 TB)

Query Volume

Dashboard queries/sec:    5,000
Alert rule evaluations:   100,000 rules * 1 eval/30 sec = 3,333 evals/sec
API queries/sec:          2,000

3. High-Level Architecture

+------------------+     +------------------+     +------------------+
| Application      |     | Infrastructure   |     | Custom Metrics   |
| (instrumented)   |     | (agents on hosts)|     | (StatsD/API)     |
+--------+---------+     +--------+---------+     +--------+---------+
         |                        |                        |
         v                        v                        v
+--------+------------------------+------------------------+---------+
|                        Ingestion Gateway                           |
|                  (Load Balanced, Protocol Translation)              |
+--------+----------+----------+-----------+----------+--------------+
         |          |          |           |          |
    +----v---+ +---v----+ +---v----+ +----v---+ +---v----+
    |Metrics | |Metrics | |Log     | |Log     | |Trace   |
    |Ingestor| |Ingestor| |Ingestor| |Ingestor| |Ingestor|
    |  #1    | |  #N    | |  #1    | |  #N    | | #1..N  |
    +---+----+ +---+----+ +---+----+ +---+----+ +---+----+
        |          |          |           |          |
   +----v----------v---+ +---v-----------v-+ +-----v--------+
   | Metrics Kafka     | | Logs Kafka      | | Traces Kafka |
   | Topic             | | Topic           | | Topic        |
   +--------+----------+ +--------+--------+ +------+-------+
            |                     |                  |
   +--------v--------+  +--------v--------+ +------v--------+
   | Metrics         |  | Log Indexer     | | Trace Indexer |
   | Aggregator &    |  | (Elasticsearch  | | (Jaeger/      |
   | Writer          |  |  / ClickHouse)  | |  Tempo)       |
   +---------+-------+  +--------+--------+ +------+--------+
             |                    |                 |
   +---------v--------+  +-------v--------+ +-----v---------+
   | Time-Series DB   |  | Log Store      | | Trace Store   |
   | (custom TSDB or  |  | (compressed    | | (object store |
   |  InfluxDB/Mimir) |  |  indices)      | |  + index)     |
   +------------------+  +----------------+ +---------------+
             |                    |                 |
   +---------v--------------------v-----------------v--------+
   |                    Query Service                        |
   +---------+--------------------+----------+---------------+
             |                    |          |
   +---------v-------+  +--------v-----+ +--v--------------+
   | Dashboard       |  | Alerting     | | API             |
   | Service         |  | Engine       | | (Programmatic)  |
   +-----------------+  +------+-------+ +-----------------+
                               |
                        +------v-------+
                        | Notification |
                        | Router       |
                        +--+-----------+
                           |     |     |
                     Email Slack PagerDuty

4. API Design

Metric Submission

POST /api/v1/metrics
  Headers: Authorization: Bearer <api_key>
  Body: {
    "series": [
      {
        "metric": "http.request.duration",
        "type": "gauge",                     // gauge, counter, histogram
        "points": [
          [1681200000, 45.2],                // [timestamp, value]
          [1681200010, 38.7],
          [1681200020, 52.1]
        ],
        "tags": ["service:api-gateway", "env:production", "host:web-01"]
      },
      {
        "metric": "http.request.count",
        "type": "counter",
        "points": [[1681200000, 15234]],
        "tags": ["service:api-gateway", "env:production", "method:GET"]
      }
    ]
  }
  Response 202: { "status": "accepted" }

POST /api/v1/logs
  Headers: Authorization: Bearer <api_key>
  Body: [
    {
      "message": "Payment processed successfully for order 12345",
      "level": "info",
      "service": "payment-service",
      "host": "pay-01",
      "timestamp": 1681200000000,
      "attributes": { "order_id": "12345", "amount": 49.99, "trace_id": "abc123" }
    }
  ]
  Response 202: { "status": "accepted" }

POST /api/v1/traces
  Headers: Authorization: Bearer <api_key>
  Body: {
    "spans": [
      {
        "trace_id": "abc123",
        "span_id": "span_1",
        "parent_span_id": null,
        "operation": "POST /api/v1/payments",
        "service": "api-gateway",
        "start_time": 1681200000000,
        "duration_ms": 245,
        "status": "ok",
        "tags": { "http.method": "POST", "http.status_code": 201 }
      },
      {
        "trace_id": "abc123",
        "span_id": "span_2",
        "parent_span_id": "span_1",
        "operation": "fraud_check",
        "service": "fraud-engine",
        "start_time": 1681200000050,
        "duration_ms": 120,
        "status": "ok",
        "tags": { "risk_score": 0.1 }
      }
    ]
  }
  Response 202: { "status": "accepted" }

Metric Querying

POST /api/v1/query
  Headers: Authorization: Bearer <api_key>
  Body: {
    "query": "avg:http.request.duration{service:api-gateway,env:production} by {host}",
    "from": 1681196400,           // 1 hour ago
    "to": 1681200000,
    "interval": 60                // 60-second buckets
  }
  Response 200: {
    "series": [
      {
        "tags": { "host": "web-01" },
        "points": [[1681196400, 42.3], [1681196460, 44.1], ...]
      },
      {
        "tags": { "host": "web-02" },
        "points": [[1681196400, 39.8], [1681196460, 41.2], ...]
      }
    ]
  }

Alert Management

POST /api/v1/alerts
  Body: {
    "name": "High Error Rate",
    "query": "avg(last_5m):sum:http.errors{service:api-gateway} / sum:http.requests{service:api-gateway} > 0.05",
    "message": "Error rate exceeded 5% for API Gateway",
    "notify": ["slack:#alerts-prod", "pagerduty:api-team"],
    "thresholds": {
      "critical": 0.05,
      "warning": 0.02
    },
    "evaluation_window": "5m",
    "no_data_behavior": "alert"
  }
  Response 201: { "alert_id": "alert_002" }

GET /api/v1/alerts
  Response 200: {
    "alerts": [
      {
        "alert_id": "alert_001",
        "name": "High API Latency",
        "status": "triggered",     // ok|triggered|no_data
        "query": "avg(last_5m):avg:http.request.duration{service:api-gateway} > 500",
        "triggered_at": "2026-04-11T10:05:00Z",
        "value": 623.4
      }
    ]
  }

5. Deep Dive: Metrics Collection (Push vs Pull)

Push Model (StatsD / Datadog Agent)

Application --push--> Agent (local) --push--> Ingestion Gateway

Flow:
  1. Application code instruments metrics:
     statsd.increment("api.request.count", tags=["method:GET"])
     statsd.histogram("api.request.duration", 45.2)
  
  2. Local agent on same host:
     - Receives UDP datagrams from application
     - Aggregates locally (10-second flush interval)
     - Batches and compresses
     - Pushes to ingestion gateway over HTTPS
  
  3. Ingestion gateway:
     - Validates API key
     - Writes to Kafka topic

Advantages:
  + Application controls when to emit
  + Works behind firewalls (outbound only)
  + Local agent pre-aggregates (reduces network traffic)
  + Better for ephemeral/serverless workloads (short-lived processes)

Disadvantages:
  - Requires agent installation on every host
  - Agent failure = metric loss (mitigated by local disk buffer)

Pull Model (Prometheus)

Monitoring Server --pull--> Application /metrics endpoint

Flow:
  1. Application exposes metrics at HTTP endpoint:
     GET /metrics
     
     # HELP http_requests_total Total HTTP requests
     # TYPE http_requests_total counter
     http_requests_total{method="GET",status="200"} 15234
     http_requests_total{method="POST",status="201"} 892
     
     # HELP http_request_duration_seconds Request duration
     # TYPE http_request_duration_seconds histogram
     http_request_duration_seconds_bucket{le="0.05"} 8923
     http_request_duration_seconds_bucket{le="0.1"} 12345
     http_request_duration_seconds_bucket{le="0.5"} 15000

  2. Prometheus server scrapes every 15 seconds:
     - Service discovery finds all targets
     - Scrapes /metrics from each target
     - Stores in local TSDB

Advantages:
  + Simple to debug (curl the endpoint to see current metrics)
  + Central control over scrape frequency
  + Can detect if target is down (failed scrape = "up" metric = 0)
  + No agent needed

Disadvantages:
  - Requires network access TO targets (firewall issues)
  - Hard to scale past ~10M series per Prometheus instance
  - Not ideal for short-lived processes (may never be scraped)

Our Hybrid Approach

+------------------+                    +------------------+
| Push Path        |                    | Pull Path        |
| (agents, SDKs)  |                    | (Prometheus compat)|
+--------+---------+                    +--------+---------+
         |                                       |
         v                                       v
+--------+----------+              +-------------+---------+
| Ingestion Gateway |              | Scrape Manager        |
| (receives pushes) |              | (discovers & scrapes) |
+--------+----------+              +-------------+---------+
         |                                       |
         +-------------------+-------------------+
                             |
                    +--------v--------+
                    | Kafka (unified) |
                    +--------+--------+
                             |
                    +--------v--------+
                    | TSDB Writer     |
                    +-----------------+

Decision: Support BOTH models. Push for custom/business metrics and
ephemeral workloads. Pull for infrastructure metrics and Prometheus
ecosystem compatibility. Both converge into the same Kafka + TSDB pipeline.

6. Deep Dive: Time-Series Database

Data Model

A time series is uniquely identified by:
  metric_name + sorted_tag_set

Example:
  http.request.duration{host=web-01,service=api,env=prod}  <- series ID
  
  Timestamp     | Value
  --------------|-------
  1681200000    | 42.3
  1681200010    | 38.7
  1681200020    | 52.1
  1681200030    | 45.0
  ...

Storage Architecture (LSM-based TSDB)

Write Path:
  1. Incoming data point -> Write-Ahead Log (WAL) for durability
  2. Buffer in in-memory table (memtable / head block)
  3. When memtable reaches threshold (e.g., 2-hour time window):
     - Flush to immutable block on disk
     - Organized by time block (2-hour blocks)

Disk Layout:
  /data/
    block_2026041100/        (2026-04-11 00:00 - 02:00)
      index                  (series_id -> chunk offsets)
      chunks/
        000001               (compressed data points)
        000002
      meta.json
    block_2026041102/        (2026-04-11 02:00 - 04:00)
      ...
    wal/
      segment-000001         (write-ahead log for current block)
      segment-000002

Compression (Critical for Cost)

Timestamps:  Delta-of-delta encoding
  Raw:    [1681200000, 1681200010, 1681200020, 1681200030]
  Delta:  [10, 10, 10]
  DoD:    [0, 0]                     -> Compresses to nearly nothing
                                        for regular intervals

Values:     XOR encoding (Gorilla compression, from Facebook's paper)
  Consecutive float values often share many bits.
  XOR with previous value; encode only the differing bits.
  
  Example:
    v1 = 42.3  (IEEE 754: 0100000001000101001100110011...)
    v2 = 38.7  (IEEE 754: 0100000001000011010110011001...)
    XOR:        0000000000001110011010101010...
    -> Only store: leading zeros count + significant bits

  Typical compression: ~1.37 bits per data point

Overall compression ratio: ~12x
  Raw:        50 bytes per point
  Compressed: ~4 bytes per point

At 43.2 TB/day raw, this means ~3.6 TB/day compressed.

Label Index (Inverted Index for Fast Queries)

Label -> Set of Series IDs

service="api-gateway"  -> {1, 5, 12, 45, 89, ...}
method="GET"           -> {1, 3, 5, 8, 12, ...}
status="200"           -> {1, 2, 5, 7, 12, ...}

Query: http.request.duration{service="api-gateway", method="GET"}
  Step 1: Find series matching metric name
    metric="http.request.duration" -> {1, 5, 12, 45, 67, 89}
  Step 2: Intersect with label posting lists
    AND service="api-gateway" -> {1, 5, 12, 45, 89}
    AND method="GET"          -> {1, 3, 5, 8, 12}
    Result: {1, 5, 12}
  Step 3: Fetch chunks for series 1, 5, 12 in time range
  Step 4: Decompress and aggregate (avg, sum, max, etc.)

Query Execution

Query: "avg:http.request.duration{service:api-gateway} by {host} [last 1 hour]"

Step 1: Parse query
  metric_name = "http.request.duration"
  filters = { "service": "api-gateway" }
  groupby = ["host"]
  aggregation = "avg"
  time_range = [now - 3600, now]

Step 2: Resolve time blocks
  Blocks covering last hour: block_2026041108, block_2026041110
  Plus the current in-memory head block

Step 3: Index lookup
  For each block, find series matching filters using inverted index.
  Returns: [series_001, series_002, series_003, ...]

Step 4: Read chunks
  For each matching series, read data points in time range.
  Decompress using delta-of-delta + XOR decoders.

Step 5: Aggregate
  Group by "host" tag value.
  Compute avg within each requested interval bucket (60s).

Step 6: Return results
  ~50ms for last-hour queries (mostly in-memory head block)
  ~200ms for last-day queries (1-minute rollups from disk)
  ~500ms for last-week queries (5-minute rollups)

7. Deep Dive: Aggregation Pipeline and Downsampling

Real-Time Aggregation

Ingestion -> Pre-aggregation -> Storage

Pre-aggregation at ingest time:
  For counter metrics:
    Compute rate (per-second) from raw counter values.
  
  For histogram metrics:
    Compute percentiles (p50, p95, p99) from histogram buckets.
  
  For all metrics:
    Compute 1-minute rollups: min, max, sum, count, avg

This reduces query-time computation significantly.

Downsampling Pipeline

Full resolution (10s intervals) -> kept for 15 days
1-minute rollups               -> kept for 3 months
5-minute rollups               -> kept for 1 year
1-hour rollups                 -> kept for 3 years

Downsampling runs as a background job:

+--------------------+        +--------------------+
| Full-Res Block     |        | 1-Min Rollup Block |
| (2-hour block)     | -----> | (24-hour block)    |
| [10s data points]  |  agg   | [min,max,avg,count]|
+--------------------+        +--------------------+
                                       |
                              +--------v-----------+
                              | 5-Min Rollup Block |
                              | (7-day block)      |
                              +--------------------+
                                       |
                              +--------v-----------+
                              | 1-Hour Rollup Block|
                              | (30-day block)     |
                              +--------------------+

Query routing (automatic based on time range):
  "last 1 hour"   -> full resolution data
  "last 24 hours" -> 1-minute rollups
  "last 7 days"   -> 5-minute rollups
  "last 30 days"  -> 1-hour rollups

This keeps dashboards fast regardless of time range.

Retention Enforcement

TTL-based deletion:
  - Background compactor checks block age
  - Blocks older than retention period are deleted
  - Deletion is block-level (not per-series) for efficiency

Storage budget with downsampling:
  15-day full-res:      ~65 TB  (compressed)
  3-month 1-min:        ~10 TB  (compressed)
  1-year 5-min:         ~5 TB   (compressed)
  3-year 1-hour:        ~1 TB   (compressed)
  
  Total: ~81 TB (vs thousands of TB without downsampling)

8. Deep Dive: Alerting Engine

Alert Rule Model

Alert rule:
  name:          "High API Latency"
  query:         "avg(last_5m):avg:http.request.duration{service:api-gateway}"
  thresholds:
    critical:    > 500ms
    warning:     > 200ms
  evaluation_window: 5 minutes
  notify:
    critical:    ["pagerduty:api-team", "slack:#alerts-critical"]
    warning:     ["slack:#alerts-warning"]
  no_data:       "alert" after 10 minutes
  recovery:      notify when value returns below threshold
  mute_schedule: "Saturday 00:00 - Monday 06:00"

Evaluation Architecture

+-------------------+
| Alert Rule Store  |    100,000 rules
| (PostgreSQL)      |
+--------+----------+
         |
+--------v----------+
| Alert Scheduler   |    Assigns rules to evaluator workers
+--------+----------+    via consistent hashing on rule_id
         |
    +----+----+----+----+
    |         |         |
+---v---+ +--v----+ +--v----+
| Eval  | | Eval  | | Eval  |    20 evaluator workers
| Worker| | Worker| | Worker|    Each handles ~5,000 rules
| #1    | | #2    | | #N    |
+---+---+ +--+----+ +--+----+
    |         |         |
    v         v         v
  Query TSDB for each rule
    |
+---v-----------+
| State Machine |    Track alert state transitions
| (Redis)       |    Key: alert_state:{rule_id}:{label_hash}
+---+-----------+
    |
+---v-----------+
| Notification  |    Route alerts to appropriate channels
| Router        |
+--+----+-------+
   |    |    |
 Email Slack PagerDuty

Alert State Machine

                  query returns data
                  within threshold
         +-----------------------------+
         |                             |
         v                             |
  +------+------+              +-------+------+
  |     OK      | ------->    | WARNING      |
  +------+------+ value >     +-------+------+
         ^        warning              |
         |                    value > critical
         |                             |
         |                    +--------v------+
         +------------------- | CRITICAL     |
           value < warning    +-------+-------+
                                      |
                              +-------v-------+
              no data for     | NO DATA       |
              10 minutes      +---------------+

Hysteresis: Require threshold to be exceeded for N consecutive
evaluation cycles before transitioning (prevents flapping).

Example: "alert if avg latency > 500ms for 3 consecutive checks"
  Check 1: 520ms -> count=1 (not yet alerting)
  Check 2: 510ms -> count=2 (not yet alerting)
  Check 3: 530ms -> count=3 -> TRIGGER ALERT
  Check 4: 480ms -> count reset -> send RECOVERY notification

Alert Deduplication and Grouping

Problem: 100 hosts all have high CPU simultaneously -> 100 separate alerts

Solution: Alert grouping by configurable keys
  - Group alerts by (service, alert_name) or (cluster, alert_name)
  - Send ONE notification for the group:
    "HIGH CPU: 100/120 hosts affected in production"
    "Services: api-gateway, order-service, payment-service"
  - Include summary with worst-offender list
  
Deduplication:
  - Same alert in same state: do not re-notify
  - Re-notify only on state transitions (OK -> WARN -> CRITICAL)
  - Reminder notifications: every 4 hours while in critical state
  - Configurable per-rule

Silencing:
  - Maintenance windows: silence all alerts for host X from 2AM-4AM
  - Incident-based: silence alert Y while incident Z is open
  - One-click mute from notification with auto-expire

9. Deep Dive: Anomaly Detection

Statistical Anomaly Detection

Instead of fixed thresholds ("alert if CPU > 80%"), detect anomalies
by learning the normal pattern and alerting on deviations.

Approach: Seasonal decomposition + standard deviation

For a metric like "request count per minute":
  1. Learn seasonal pattern:
     - Hourly pattern (traffic peaks at 2 PM, dips at 4 AM)
     - Daily pattern (weekdays vs weekends)
     - Weekly pattern
  
  2. For each time window, compute:
     expected_value = seasonal_model.predict(day_of_week, hour, minute)
     std_deviation  = rolling_std(metric, window=7_days)
     
  3. Anomaly score:
     z_score = (actual_value - expected_value) / std_deviation
     
  4. Alert if |z_score| > threshold:
     - z > 3.0: "Metric is significantly higher than expected"
     - z < -3.0: "Metric is significantly lower than expected"

Example:
  Normal request rate at Tuesday 2 PM: 5,000/min (std: 300)
  Current rate: 8,500/min
  z_score = (8500 - 5000) / 300 = 11.67  -> ANOMALY ALERT
  
  Same rate at Black Friday: expected 10,000/min (std: 1500)
  z_score = (8500 - 10000) / 1500 = -1.0  -> Normal (below expected)

Anomaly Detection Pipeline

Metrics --> Seasonal Model (trained daily) --> Expected Values
                                                    |
Metrics --> Actual Values ---+                      |
                             |                      |
                     +-------v----------------------v------+
                     | Comparison Engine                   |
                     | z_score = (actual - expected) / std |
                     +-------+-----------------------------+
                             |
                     +-------v--------+
                     | Alert if       |
                     | |z| > threshold|
                     +----------------+

Model training:
  - Retrained daily at midnight using last 30 days of data
  - Separate model per metric series (each has its own pattern)
  - Lightweight: rolling mean + std per (day_of_week, hour) bucket
  - For 50M series: ~50M * 168 buckets * 16 bytes = ~134 GB model data
  - Stored in Redis for fast lookup during evaluation

10. Deep Dive: Log Ingestion and Indexing

Log Pipeline

Application --> Agent (local) --> Kafka --> Log Processor --> Storage

Log Processor:
  1. Parse structured fields (JSON, key-value, regex)
  2. Extract and index: timestamp, level, service, host, trace_id
  3. Enrich: add geo info from IP, resolve hostname
  4. Route: error logs to hot storage, debug logs to cold storage

Storage tiers:
  Hot (last 3 days):   Elasticsearch cluster (fast full-text search)
  Warm (3-30 days):    Compressed indices, read-only, cheaper SSDs
  Cold (30-90 days):   Object storage (S3) with minimal index
  Frozen (90+ days):   Archive to Glacier (compliance only)

Log Query Language

Examples:

  # Find all errors in payment service in last hour
  service:payment-service AND level:error | last 1h

  # Find slow requests with response time > 1 second
  service:api-gateway AND response_time_ms:>1000

  # Correlate logs with a specific trace
  trace_id:abc123

  # Pattern search with wildcards
  message:"connection refused*" AND env:production

  # Aggregate: count errors by service
  level:error | count by service | last 1h | sort desc

  # Find logs around a specific timestamp (context)
  service:payment-service | around 2026-04-11T10:05:00Z | window 5m

Log-Metric Correlation

When an alert fires on a metric (e.g., "error rate > 5%"):
  1. System automatically queries logs matching:
     service:{alerting_service} AND level:error
     AND timestamp:[alert_start - 5m, now]
  2. Groups log messages by pattern (log pattern mining)
  3. Shows top error messages in alert notification:
     "Top errors during alert:
      - 'Connection refused to payment-db' (452 occurrences)
      - 'Timeout calling fraud-engine' (128 occurrences)"
  4. Links to full trace for a sample error occurrence

11. Deep Dive: Distributed Tracing

Trace Data Model

Trace:
  trace_id: "abc123" (globally unique, propagated via headers)
  
  Spans (timeline view):
  +--[ api-gateway: POST /payments  (250ms) ]-------------------+
  |                                                              |
  |  +--[ auth-service: validate_token (20ms) ]--+               |
  |                                                              |
  |  +--[ fraud-engine: check_fraud (120ms) ]------+             |
  |  |                                              |            |
  |  |  +--[ ml-model: predict (80ms) ]--+          |            |
  |  |                                              |            |
  |  +--[ payment-service: charge (90ms) ]--+       |            |
  |  |                                      |       |            |
  |  |  +--[ visa-psp: authorize (60ms) ]--+|       |            |
  +--------------------------------------------------------------+

Header propagation (W3C Trace Context):
  traceparent: 00-abc123-span_1-01
  
  Service A calls Service B:
    1. A generates span for outgoing call
    2. A injects trace_id + span_id into HTTP headers
    3. B extracts trace context from headers
    4. B creates child span with parent = A's span_id
    5. B sends completed span to collector (agent -> Kafka)

Trace Sampling

At 100,000 traces/sec, storing everything is expensive (~20 TB/day).

Sampling strategies:

1. Head-based sampling (decision at trace start):
   - Sample 10% of all traces randomly
   - Always sample traces from high-priority services
   - Simple, but misses rare interesting traces
   
2. Tail-based sampling (decision after trace completes):
   - Buffer traces in memory for 30 seconds
   - After all spans arrive, decide whether to keep:
     a. Slow traces (duration > p99 for that operation)
     b. Error traces (any span has error status)
     c. High-value traces (specific tag values, e.g., premium customer)
     d. Random 1% for baseline
   - Discard the rest

   Buffer requirement: 100K traces/sec * 30 sec * 2.4 KB = ~7.2 GB
   
3. Our choice: Tail-based sampling
   - Captures the most diagnostically useful traces
   - Keeps storage manageable (5-10% of all traces stored)
   - Buffer is held in a streaming Kafka consumer group

Trace Storage

Write path:
  1. Spans arrive via Kafka (from each service's agent)
  2. Trace assembler groups spans by trace_id
     (waits up to 30 seconds for all spans to arrive)
  3. Sampling decision made on complete trace
  4. Sampled traces stored:
     - Recent (7 days): Elasticsearch or Grafana Tempo
     - Long-term: Object storage (S3) with index in DynamoDB
  
Queryable by:
  - trace_id (primary lookup)
  - service name + operation name
  - duration range (find slow traces)
  - error status
  - timestamp range
  - custom tag values

12. Scaling Considerations

Metrics Ingestion Scaling

10 million data points/sec across the cluster:

Ingestion Gateway: 50 instances (200K points/sec each)
Kafka:            24 brokers, 128 partitions for metrics topic
TSDB Writers:     40 instances (250K points/sec each)

Each TSDB writer:
  - Maintains in-memory head block for recent data (~4 GB RAM)
  - Flushes 2-hour blocks to object storage
  - Handles ~1.25 million active series

Scaling trigger: when any writer exceeds 80% memory, add writers
and rebalance series assignment using consistent hashing.

High Cardinality Problem

Problem: A metric with high-cardinality tags creates millions of series.

Example:
  http.request.duration{user_id=<unique_per_user>}
  If 10M users -> this ONE metric creates 10M series.

Impact:
  - Index bloats (series lookup table grows)
  - Memory usage spikes (each active series needs ~4 KB)
  - Query performance degrades (scanning millions of series)

Mitigations:
  1. Reject metrics exceeding cardinality limit:
     If metric_name has > 100K unique tag combinations, reject with error.
     
  2. Pre-aggregation:
     Aggregate away high-cardinality tags at ingest time.
     Instead of per-user metrics, emit percentiles (p50, p95, p99).
     
  3. Separate storage:
     Route high-cardinality metrics to a dedicated cluster with
     different retention/query characteristics.
     
  4. Tag value limits:
     Cap unique values per tag key at 10,000.
     Emit a cardinality warning metric for operators.
     
  5. Usage tracking:
     Dashboard showing top-10 metrics by cardinality.
     Alert when a metric's cardinality grows > 50%/day.

TSDB Sharding

Strategy: Shard by series_id (hash of metric_name + sorted tags)

Series Assignment:
  series_id = hash(metric_name + sorted_tags)
  shard = series_id % num_shards

+--------------------------------------------------+
| Series Ring (consistent hashing with vnodes)     |
|                                                  |
| Shard 0: series hash range [0, 16383]            |
| Shard 1: series hash range [16384, 32767]        |
| ...                                              |
| Shard 15: series hash range [245760, 262143]     |
|                                                  |
| Each shard: 1 writer + 2 read replicas           |
+--------------------------------------------------+

Query fanout:
  - Metric-specific queries: route to shard(s) containing that metric
  - Wildcard queries: fan out to all shards, merge results
  - Parallel execution keeps latency low (~50ms even with 16-shard fanout)

Geographic Distribution

+------------------+     +------------------+     +------------------+
|  US Region       |     |  EU Region       |     |  Asia Region     |
|  - Full stack    |     |  - Full stack    |     |  - Full stack    |
|  - Local TSDB    |     |  - Local TSDB    |     |  - Local TSDB    |
|  - Local logs    |     |  - Local logs    |     |  - Local logs    |
+--------+---------+     +--------+---------+     +--------+---------+
         |                        |                        |
         +------------------------+------------------------+
                          |
                  +-------v-------+
                  | Global Query  |     Cross-region queries
                  | Aggregator    |     for global dashboards
                  +---------------+

Metrics are ingested locally in each region.
Cross-region queries fan out to all regional TSDB clusters and merge.
Alerting runs locally per region (low latency).

13. Key Tradeoffs

DecisionOption AOption BOur Choice
Collection modelPush onlyPull onlyHybrid
TSDBPrometheus (single node)Custom distributed TSDBCustom TSDB
Log storageElasticsearch (flexible)ClickHouse (fast agg.)Elasticsearch
Trace storageJaeger + CassandraTempo + Object storeTempo + S3
SamplingHead-based (simple)Tail-based (smart)Tail-based
Alert evaluationPush (stream processing)Pull (periodic query)Periodic query
CompressionGeneral (LZ4)Time-series specificTS-specific
RetentionKeep all data foreverDownsample old dataDownsample
High cardinalityAccept allReject + aggregateReject + limit
Anomaly detectionFixed thresholds onlyML-based baselinesBoth

14. Failure Scenarios and Mitigations

Scenario                             Mitigation
---------------------------------------------------------------------------
Ingestion gateway overload           Backpressure to agents; agents buffer to disk
Kafka broker failure                 Replication factor 3; auto-leader election
TSDB writer crash                    WAL replay on restart; replicas serve reads
Alert evaluator down                 Consistent hashing re-assigns rules in <30s;
                                     alert state stored in Redis (not local memory)
Elasticsearch cluster full           Automated index rotation; cold tier rollover
Query timeout (large time range)     Auto-downsample; query timeout with partial results
High cardinality explosion           Automatic detection + rejection; alert metric owner
Network partition (cross-region)     Each region operates independently; merge when healed
Dashboard service outage             Static HTML fallback with last-known snapshot
False positive alert storm           Alert grouping + rate limiting notifications
Agent crash on host                  Systemd auto-restart; disk buffer survives restart
Clock skew between hosts             NTP required; accept data within 5-minute window
Anomaly model stale                  Retrain daily; fall back to fixed thresholds if
                                     model is > 48 hours old

Key Takeaways

  1. Time-series databases require specialized compression (delta-of-delta for timestamps, XOR/Gorilla for values) to achieve 10x+ compression ratios -- generic databases cannot handle 43 TB/day ingestion cost-effectively.
  2. Downsampling with retention tiers is essential -- keeping full-resolution data forever is prohibitively expensive; the total storage drops from thousands of TB to ~81 TB with a tiered approach.
  3. High cardinality is the number-one operational challenge in monitoring systems -- a single bad metric with per-user tags can overwhelm the entire system; proactive detection and limits are mandatory.
  4. Alerting requires hysteresis, deduplication, and grouping to avoid alert fatigue -- raw threshold checks generate too many false positives and flapping.
  5. Anomaly detection based on seasonal baselines catches issues that fixed thresholds miss (traffic that is normal at 2 PM but abnormal at 4 AM).
  6. Push and pull collection models serve different use cases -- supporting both maximizes compatibility while a unified Kafka + TSDB backend keeps the architecture simple.
  7. Tail-based trace sampling captures the most diagnostically valuable traces (slow, errored) while keeping storage costs manageable at scale.