Episode 9 — System Design / 9.11 — Real World System Design Problems
9.11.l Design a Monitoring and Alerting System (Datadog / Prometheus)
Problem Statement
Design a monitoring and alerting system like Datadog or Prometheus that collects metrics, logs, and traces from distributed services. The system must support real-time dashboards, configurable alerting, anomaly detection, and handle millions of data points per second from thousands of services.
1. Requirements
Functional Requirements
- Collect infrastructure metrics (CPU, memory, disk, network)
- Collect application metrics (request count, latency, error rate)
- Collect custom business metrics (orders/sec, revenue, signups)
- Ingest and index structured logs
- Collect distributed traces (spans across services)
- Real-time dashboards with custom queries and visualizations
- Configurable alerting rules (threshold, anomaly, composite)
- Alert routing to multiple channels (email, Slack, PagerDuty)
- Tag-based filtering and grouping (host, service, environment)
- Retention policies with downsampling for historical data
- Anomaly detection with automatic baseline learning
- API for programmatic metric submission and querying
Non-Functional Requirements
- Ingest 10 million metric data points per second
- Ingest 500,000 log lines per second
- Dashboard query latency < 1 second for last-hour queries
- Alert evaluation latency < 30 seconds
- Support 100,000 unique metric names with high cardinality tags
- 99.9% availability for ingestion (never lose metrics)
- Retain full-resolution data for 15 days, downsampled for 1 year
- Horizontally scalable as monitored fleet grows
2. Capacity Estimation
Metrics
Data points per second: 10 million
Data point size: ~50 bytes (metric_name, tags, value, timestamp)
Daily ingestion: 10M/sec * 86,400 * 50 bytes = 43.2 TB/day
Unique time series: 50 million active series
(metric_name + unique tag combination = 1 series)
Storage (full resolution, 15 days):
43.2 TB/day * 15 = 648 TB raw
With compression (~10x for time-series): ~65 TB
Storage (downsampled 1-minute, 1 year):
10M/sec compressed to ~166K/sec (1-min rollup)
166K * 20 bytes * 86,400 * 365 ~= 105 TB raw
Compressed: ~10 TB
Logs
Log lines per second: 500,000
Average log line: 500 bytes
Daily log volume: 500K/sec * 86,400 * 500 bytes = 21.6 TB/day
Retention (30 days): 648 TB raw (with compression ~65 TB)
Traces
Traces per second: 100,000
Average spans per trace: 8
Span size: 300 bytes
Daily trace volume: 100K * 8 * 300 * 86,400 = 20.7 TB/day
Retention (7 days): 145 TB raw (compressed ~15 TB)
Query Volume
Dashboard queries/sec: 5,000
Alert rule evaluations: 100,000 rules * 1 eval/30 sec = 3,333 evals/sec
API queries/sec: 2,000
3. High-Level Architecture
+------------------+ +------------------+ +------------------+
| Application | | Infrastructure | | Custom Metrics |
| (instrumented) | | (agents on hosts)| | (StatsD/API) |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
v v v
+--------+------------------------+------------------------+---------+
| Ingestion Gateway |
| (Load Balanced, Protocol Translation) |
+--------+----------+----------+-----------+----------+--------------+
| | | | |
+----v---+ +---v----+ +---v----+ +----v---+ +---v----+
|Metrics | |Metrics | |Log | |Log | |Trace |
|Ingestor| |Ingestor| |Ingestor| |Ingestor| |Ingestor|
| #1 | | #N | | #1 | | #N | | #1..N |
+---+----+ +---+----+ +---+----+ +---+----+ +---+----+
| | | | |
+----v----------v---+ +---v-----------v-+ +-----v--------+
| Metrics Kafka | | Logs Kafka | | Traces Kafka |
| Topic | | Topic | | Topic |
+--------+----------+ +--------+--------+ +------+-------+
| | |
+--------v--------+ +--------v--------+ +------v--------+
| Metrics | | Log Indexer | | Trace Indexer |
| Aggregator & | | (Elasticsearch | | (Jaeger/ |
| Writer | | / ClickHouse) | | Tempo) |
+---------+-------+ +--------+--------+ +------+--------+
| | |
+---------v--------+ +-------v--------+ +-----v---------+
| Time-Series DB | | Log Store | | Trace Store |
| (custom TSDB or | | (compressed | | (object store |
| InfluxDB/Mimir) | | indices) | | + index) |
+------------------+ +----------------+ +---------------+
| | |
+---------v--------------------v-----------------v--------+
| Query Service |
+---------+--------------------+----------+---------------+
| | |
+---------v-------+ +--------v-----+ +--v--------------+
| Dashboard | | Alerting | | API |
| Service | | Engine | | (Programmatic) |
+-----------------+ +------+-------+ +-----------------+
|
+------v-------+
| Notification |
| Router |
+--+-----------+
| | |
Email Slack PagerDuty
4. API Design
Metric Submission
POST /api/v1/metrics
Headers: Authorization: Bearer <api_key>
Body: {
"series": [
{
"metric": "http.request.duration",
"type": "gauge", // gauge, counter, histogram
"points": [
[1681200000, 45.2], // [timestamp, value]
[1681200010, 38.7],
[1681200020, 52.1]
],
"tags": ["service:api-gateway", "env:production", "host:web-01"]
},
{
"metric": "http.request.count",
"type": "counter",
"points": [[1681200000, 15234]],
"tags": ["service:api-gateway", "env:production", "method:GET"]
}
]
}
Response 202: { "status": "accepted" }
POST /api/v1/logs
Headers: Authorization: Bearer <api_key>
Body: [
{
"message": "Payment processed successfully for order 12345",
"level": "info",
"service": "payment-service",
"host": "pay-01",
"timestamp": 1681200000000,
"attributes": { "order_id": "12345", "amount": 49.99, "trace_id": "abc123" }
}
]
Response 202: { "status": "accepted" }
POST /api/v1/traces
Headers: Authorization: Bearer <api_key>
Body: {
"spans": [
{
"trace_id": "abc123",
"span_id": "span_1",
"parent_span_id": null,
"operation": "POST /api/v1/payments",
"service": "api-gateway",
"start_time": 1681200000000,
"duration_ms": 245,
"status": "ok",
"tags": { "http.method": "POST", "http.status_code": 201 }
},
{
"trace_id": "abc123",
"span_id": "span_2",
"parent_span_id": "span_1",
"operation": "fraud_check",
"service": "fraud-engine",
"start_time": 1681200000050,
"duration_ms": 120,
"status": "ok",
"tags": { "risk_score": 0.1 }
}
]
}
Response 202: { "status": "accepted" }
Metric Querying
POST /api/v1/query
Headers: Authorization: Bearer <api_key>
Body: {
"query": "avg:http.request.duration{service:api-gateway,env:production} by {host}",
"from": 1681196400, // 1 hour ago
"to": 1681200000,
"interval": 60 // 60-second buckets
}
Response 200: {
"series": [
{
"tags": { "host": "web-01" },
"points": [[1681196400, 42.3], [1681196460, 44.1], ...]
},
{
"tags": { "host": "web-02" },
"points": [[1681196400, 39.8], [1681196460, 41.2], ...]
}
]
}
Alert Management
POST /api/v1/alerts
Body: {
"name": "High Error Rate",
"query": "avg(last_5m):sum:http.errors{service:api-gateway} / sum:http.requests{service:api-gateway} > 0.05",
"message": "Error rate exceeded 5% for API Gateway",
"notify": ["slack:#alerts-prod", "pagerduty:api-team"],
"thresholds": {
"critical": 0.05,
"warning": 0.02
},
"evaluation_window": "5m",
"no_data_behavior": "alert"
}
Response 201: { "alert_id": "alert_002" }
GET /api/v1/alerts
Response 200: {
"alerts": [
{
"alert_id": "alert_001",
"name": "High API Latency",
"status": "triggered", // ok|triggered|no_data
"query": "avg(last_5m):avg:http.request.duration{service:api-gateway} > 500",
"triggered_at": "2026-04-11T10:05:00Z",
"value": 623.4
}
]
}
5. Deep Dive: Metrics Collection (Push vs Pull)
Push Model (StatsD / Datadog Agent)
Application --push--> Agent (local) --push--> Ingestion Gateway
Flow:
1. Application code instruments metrics:
statsd.increment("api.request.count", tags=["method:GET"])
statsd.histogram("api.request.duration", 45.2)
2. Local agent on same host:
- Receives UDP datagrams from application
- Aggregates locally (10-second flush interval)
- Batches and compresses
- Pushes to ingestion gateway over HTTPS
3. Ingestion gateway:
- Validates API key
- Writes to Kafka topic
Advantages:
+ Application controls when to emit
+ Works behind firewalls (outbound only)
+ Local agent pre-aggregates (reduces network traffic)
+ Better for ephemeral/serverless workloads (short-lived processes)
Disadvantages:
- Requires agent installation on every host
- Agent failure = metric loss (mitigated by local disk buffer)
Pull Model (Prometheus)
Monitoring Server --pull--> Application /metrics endpoint
Flow:
1. Application exposes metrics at HTTP endpoint:
GET /metrics
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 15234
http_requests_total{method="POST",status="201"} 892
# HELP http_request_duration_seconds Request duration
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 8923
http_request_duration_seconds_bucket{le="0.1"} 12345
http_request_duration_seconds_bucket{le="0.5"} 15000
2. Prometheus server scrapes every 15 seconds:
- Service discovery finds all targets
- Scrapes /metrics from each target
- Stores in local TSDB
Advantages:
+ Simple to debug (curl the endpoint to see current metrics)
+ Central control over scrape frequency
+ Can detect if target is down (failed scrape = "up" metric = 0)
+ No agent needed
Disadvantages:
- Requires network access TO targets (firewall issues)
- Hard to scale past ~10M series per Prometheus instance
- Not ideal for short-lived processes (may never be scraped)
Our Hybrid Approach
+------------------+ +------------------+
| Push Path | | Pull Path |
| (agents, SDKs) | | (Prometheus compat)|
+--------+---------+ +--------+---------+
| |
v v
+--------+----------+ +-------------+---------+
| Ingestion Gateway | | Scrape Manager |
| (receives pushes) | | (discovers & scrapes) |
+--------+----------+ +-------------+---------+
| |
+-------------------+-------------------+
|
+--------v--------+
| Kafka (unified) |
+--------+--------+
|
+--------v--------+
| TSDB Writer |
+-----------------+
Decision: Support BOTH models. Push for custom/business metrics and
ephemeral workloads. Pull for infrastructure metrics and Prometheus
ecosystem compatibility. Both converge into the same Kafka + TSDB pipeline.
6. Deep Dive: Time-Series Database
Data Model
A time series is uniquely identified by:
metric_name + sorted_tag_set
Example:
http.request.duration{host=web-01,service=api,env=prod} <- series ID
Timestamp | Value
--------------|-------
1681200000 | 42.3
1681200010 | 38.7
1681200020 | 52.1
1681200030 | 45.0
...
Storage Architecture (LSM-based TSDB)
Write Path:
1. Incoming data point -> Write-Ahead Log (WAL) for durability
2. Buffer in in-memory table (memtable / head block)
3. When memtable reaches threshold (e.g., 2-hour time window):
- Flush to immutable block on disk
- Organized by time block (2-hour blocks)
Disk Layout:
/data/
block_2026041100/ (2026-04-11 00:00 - 02:00)
index (series_id -> chunk offsets)
chunks/
000001 (compressed data points)
000002
meta.json
block_2026041102/ (2026-04-11 02:00 - 04:00)
...
wal/
segment-000001 (write-ahead log for current block)
segment-000002
Compression (Critical for Cost)
Timestamps: Delta-of-delta encoding
Raw: [1681200000, 1681200010, 1681200020, 1681200030]
Delta: [10, 10, 10]
DoD: [0, 0] -> Compresses to nearly nothing
for regular intervals
Values: XOR encoding (Gorilla compression, from Facebook's paper)
Consecutive float values often share many bits.
XOR with previous value; encode only the differing bits.
Example:
v1 = 42.3 (IEEE 754: 0100000001000101001100110011...)
v2 = 38.7 (IEEE 754: 0100000001000011010110011001...)
XOR: 0000000000001110011010101010...
-> Only store: leading zeros count + significant bits
Typical compression: ~1.37 bits per data point
Overall compression ratio: ~12x
Raw: 50 bytes per point
Compressed: ~4 bytes per point
At 43.2 TB/day raw, this means ~3.6 TB/day compressed.
Label Index (Inverted Index for Fast Queries)
Label -> Set of Series IDs
service="api-gateway" -> {1, 5, 12, 45, 89, ...}
method="GET" -> {1, 3, 5, 8, 12, ...}
status="200" -> {1, 2, 5, 7, 12, ...}
Query: http.request.duration{service="api-gateway", method="GET"}
Step 1: Find series matching metric name
metric="http.request.duration" -> {1, 5, 12, 45, 67, 89}
Step 2: Intersect with label posting lists
AND service="api-gateway" -> {1, 5, 12, 45, 89}
AND method="GET" -> {1, 3, 5, 8, 12}
Result: {1, 5, 12}
Step 3: Fetch chunks for series 1, 5, 12 in time range
Step 4: Decompress and aggregate (avg, sum, max, etc.)
Query Execution
Query: "avg:http.request.duration{service:api-gateway} by {host} [last 1 hour]"
Step 1: Parse query
metric_name = "http.request.duration"
filters = { "service": "api-gateway" }
groupby = ["host"]
aggregation = "avg"
time_range = [now - 3600, now]
Step 2: Resolve time blocks
Blocks covering last hour: block_2026041108, block_2026041110
Plus the current in-memory head block
Step 3: Index lookup
For each block, find series matching filters using inverted index.
Returns: [series_001, series_002, series_003, ...]
Step 4: Read chunks
For each matching series, read data points in time range.
Decompress using delta-of-delta + XOR decoders.
Step 5: Aggregate
Group by "host" tag value.
Compute avg within each requested interval bucket (60s).
Step 6: Return results
~50ms for last-hour queries (mostly in-memory head block)
~200ms for last-day queries (1-minute rollups from disk)
~500ms for last-week queries (5-minute rollups)
7. Deep Dive: Aggregation Pipeline and Downsampling
Real-Time Aggregation
Ingestion -> Pre-aggregation -> Storage
Pre-aggregation at ingest time:
For counter metrics:
Compute rate (per-second) from raw counter values.
For histogram metrics:
Compute percentiles (p50, p95, p99) from histogram buckets.
For all metrics:
Compute 1-minute rollups: min, max, sum, count, avg
This reduces query-time computation significantly.
Downsampling Pipeline
Full resolution (10s intervals) -> kept for 15 days
1-minute rollups -> kept for 3 months
5-minute rollups -> kept for 1 year
1-hour rollups -> kept for 3 years
Downsampling runs as a background job:
+--------------------+ +--------------------+
| Full-Res Block | | 1-Min Rollup Block |
| (2-hour block) | -----> | (24-hour block) |
| [10s data points] | agg | [min,max,avg,count]|
+--------------------+ +--------------------+
|
+--------v-----------+
| 5-Min Rollup Block |
| (7-day block) |
+--------------------+
|
+--------v-----------+
| 1-Hour Rollup Block|
| (30-day block) |
+--------------------+
Query routing (automatic based on time range):
"last 1 hour" -> full resolution data
"last 24 hours" -> 1-minute rollups
"last 7 days" -> 5-minute rollups
"last 30 days" -> 1-hour rollups
This keeps dashboards fast regardless of time range.
Retention Enforcement
TTL-based deletion:
- Background compactor checks block age
- Blocks older than retention period are deleted
- Deletion is block-level (not per-series) for efficiency
Storage budget with downsampling:
15-day full-res: ~65 TB (compressed)
3-month 1-min: ~10 TB (compressed)
1-year 5-min: ~5 TB (compressed)
3-year 1-hour: ~1 TB (compressed)
Total: ~81 TB (vs thousands of TB without downsampling)
8. Deep Dive: Alerting Engine
Alert Rule Model
Alert rule:
name: "High API Latency"
query: "avg(last_5m):avg:http.request.duration{service:api-gateway}"
thresholds:
critical: > 500ms
warning: > 200ms
evaluation_window: 5 minutes
notify:
critical: ["pagerduty:api-team", "slack:#alerts-critical"]
warning: ["slack:#alerts-warning"]
no_data: "alert" after 10 minutes
recovery: notify when value returns below threshold
mute_schedule: "Saturday 00:00 - Monday 06:00"
Evaluation Architecture
+-------------------+
| Alert Rule Store | 100,000 rules
| (PostgreSQL) |
+--------+----------+
|
+--------v----------+
| Alert Scheduler | Assigns rules to evaluator workers
+--------+----------+ via consistent hashing on rule_id
|
+----+----+----+----+
| | |
+---v---+ +--v----+ +--v----+
| Eval | | Eval | | Eval | 20 evaluator workers
| Worker| | Worker| | Worker| Each handles ~5,000 rules
| #1 | | #2 | | #N |
+---+---+ +--+----+ +--+----+
| | |
v v v
Query TSDB for each rule
|
+---v-----------+
| State Machine | Track alert state transitions
| (Redis) | Key: alert_state:{rule_id}:{label_hash}
+---+-----------+
|
+---v-----------+
| Notification | Route alerts to appropriate channels
| Router |
+--+----+-------+
| | |
Email Slack PagerDuty
Alert State Machine
query returns data
within threshold
+-----------------------------+
| |
v |
+------+------+ +-------+------+
| OK | -------> | WARNING |
+------+------+ value > +-------+------+
^ warning |
| value > critical
| |
| +--------v------+
+------------------- | CRITICAL |
value < warning +-------+-------+
|
+-------v-------+
no data for | NO DATA |
10 minutes +---------------+
Hysteresis: Require threshold to be exceeded for N consecutive
evaluation cycles before transitioning (prevents flapping).
Example: "alert if avg latency > 500ms for 3 consecutive checks"
Check 1: 520ms -> count=1 (not yet alerting)
Check 2: 510ms -> count=2 (not yet alerting)
Check 3: 530ms -> count=3 -> TRIGGER ALERT
Check 4: 480ms -> count reset -> send RECOVERY notification
Alert Deduplication and Grouping
Problem: 100 hosts all have high CPU simultaneously -> 100 separate alerts
Solution: Alert grouping by configurable keys
- Group alerts by (service, alert_name) or (cluster, alert_name)
- Send ONE notification for the group:
"HIGH CPU: 100/120 hosts affected in production"
"Services: api-gateway, order-service, payment-service"
- Include summary with worst-offender list
Deduplication:
- Same alert in same state: do not re-notify
- Re-notify only on state transitions (OK -> WARN -> CRITICAL)
- Reminder notifications: every 4 hours while in critical state
- Configurable per-rule
Silencing:
- Maintenance windows: silence all alerts for host X from 2AM-4AM
- Incident-based: silence alert Y while incident Z is open
- One-click mute from notification with auto-expire
9. Deep Dive: Anomaly Detection
Statistical Anomaly Detection
Instead of fixed thresholds ("alert if CPU > 80%"), detect anomalies
by learning the normal pattern and alerting on deviations.
Approach: Seasonal decomposition + standard deviation
For a metric like "request count per minute":
1. Learn seasonal pattern:
- Hourly pattern (traffic peaks at 2 PM, dips at 4 AM)
- Daily pattern (weekdays vs weekends)
- Weekly pattern
2. For each time window, compute:
expected_value = seasonal_model.predict(day_of_week, hour, minute)
std_deviation = rolling_std(metric, window=7_days)
3. Anomaly score:
z_score = (actual_value - expected_value) / std_deviation
4. Alert if |z_score| > threshold:
- z > 3.0: "Metric is significantly higher than expected"
- z < -3.0: "Metric is significantly lower than expected"
Example:
Normal request rate at Tuesday 2 PM: 5,000/min (std: 300)
Current rate: 8,500/min
z_score = (8500 - 5000) / 300 = 11.67 -> ANOMALY ALERT
Same rate at Black Friday: expected 10,000/min (std: 1500)
z_score = (8500 - 10000) / 1500 = -1.0 -> Normal (below expected)
Anomaly Detection Pipeline
Metrics --> Seasonal Model (trained daily) --> Expected Values
|
Metrics --> Actual Values ---+ |
| |
+-------v----------------------v------+
| Comparison Engine |
| z_score = (actual - expected) / std |
+-------+-----------------------------+
|
+-------v--------+
| Alert if |
| |z| > threshold|
+----------------+
Model training:
- Retrained daily at midnight using last 30 days of data
- Separate model per metric series (each has its own pattern)
- Lightweight: rolling mean + std per (day_of_week, hour) bucket
- For 50M series: ~50M * 168 buckets * 16 bytes = ~134 GB model data
- Stored in Redis for fast lookup during evaluation
10. Deep Dive: Log Ingestion and Indexing
Log Pipeline
Application --> Agent (local) --> Kafka --> Log Processor --> Storage
Log Processor:
1. Parse structured fields (JSON, key-value, regex)
2. Extract and index: timestamp, level, service, host, trace_id
3. Enrich: add geo info from IP, resolve hostname
4. Route: error logs to hot storage, debug logs to cold storage
Storage tiers:
Hot (last 3 days): Elasticsearch cluster (fast full-text search)
Warm (3-30 days): Compressed indices, read-only, cheaper SSDs
Cold (30-90 days): Object storage (S3) with minimal index
Frozen (90+ days): Archive to Glacier (compliance only)
Log Query Language
Examples:
# Find all errors in payment service in last hour
service:payment-service AND level:error | last 1h
# Find slow requests with response time > 1 second
service:api-gateway AND response_time_ms:>1000
# Correlate logs with a specific trace
trace_id:abc123
# Pattern search with wildcards
message:"connection refused*" AND env:production
# Aggregate: count errors by service
level:error | count by service | last 1h | sort desc
# Find logs around a specific timestamp (context)
service:payment-service | around 2026-04-11T10:05:00Z | window 5m
Log-Metric Correlation
When an alert fires on a metric (e.g., "error rate > 5%"):
1. System automatically queries logs matching:
service:{alerting_service} AND level:error
AND timestamp:[alert_start - 5m, now]
2. Groups log messages by pattern (log pattern mining)
3. Shows top error messages in alert notification:
"Top errors during alert:
- 'Connection refused to payment-db' (452 occurrences)
- 'Timeout calling fraud-engine' (128 occurrences)"
4. Links to full trace for a sample error occurrence
11. Deep Dive: Distributed Tracing
Trace Data Model
Trace:
trace_id: "abc123" (globally unique, propagated via headers)
Spans (timeline view):
+--[ api-gateway: POST /payments (250ms) ]-------------------+
| |
| +--[ auth-service: validate_token (20ms) ]--+ |
| |
| +--[ fraud-engine: check_fraud (120ms) ]------+ |
| | | |
| | +--[ ml-model: predict (80ms) ]--+ | |
| | | |
| +--[ payment-service: charge (90ms) ]--+ | |
| | | | |
| | +--[ visa-psp: authorize (60ms) ]--+| | |
+--------------------------------------------------------------+
Header propagation (W3C Trace Context):
traceparent: 00-abc123-span_1-01
Service A calls Service B:
1. A generates span for outgoing call
2. A injects trace_id + span_id into HTTP headers
3. B extracts trace context from headers
4. B creates child span with parent = A's span_id
5. B sends completed span to collector (agent -> Kafka)
Trace Sampling
At 100,000 traces/sec, storing everything is expensive (~20 TB/day).
Sampling strategies:
1. Head-based sampling (decision at trace start):
- Sample 10% of all traces randomly
- Always sample traces from high-priority services
- Simple, but misses rare interesting traces
2. Tail-based sampling (decision after trace completes):
- Buffer traces in memory for 30 seconds
- After all spans arrive, decide whether to keep:
a. Slow traces (duration > p99 for that operation)
b. Error traces (any span has error status)
c. High-value traces (specific tag values, e.g., premium customer)
d. Random 1% for baseline
- Discard the rest
Buffer requirement: 100K traces/sec * 30 sec * 2.4 KB = ~7.2 GB
3. Our choice: Tail-based sampling
- Captures the most diagnostically useful traces
- Keeps storage manageable (5-10% of all traces stored)
- Buffer is held in a streaming Kafka consumer group
Trace Storage
Write path:
1. Spans arrive via Kafka (from each service's agent)
2. Trace assembler groups spans by trace_id
(waits up to 30 seconds for all spans to arrive)
3. Sampling decision made on complete trace
4. Sampled traces stored:
- Recent (7 days): Elasticsearch or Grafana Tempo
- Long-term: Object storage (S3) with index in DynamoDB
Queryable by:
- trace_id (primary lookup)
- service name + operation name
- duration range (find slow traces)
- error status
- timestamp range
- custom tag values
12. Scaling Considerations
Metrics Ingestion Scaling
10 million data points/sec across the cluster:
Ingestion Gateway: 50 instances (200K points/sec each)
Kafka: 24 brokers, 128 partitions for metrics topic
TSDB Writers: 40 instances (250K points/sec each)
Each TSDB writer:
- Maintains in-memory head block for recent data (~4 GB RAM)
- Flushes 2-hour blocks to object storage
- Handles ~1.25 million active series
Scaling trigger: when any writer exceeds 80% memory, add writers
and rebalance series assignment using consistent hashing.
High Cardinality Problem
Problem: A metric with high-cardinality tags creates millions of series.
Example:
http.request.duration{user_id=<unique_per_user>}
If 10M users -> this ONE metric creates 10M series.
Impact:
- Index bloats (series lookup table grows)
- Memory usage spikes (each active series needs ~4 KB)
- Query performance degrades (scanning millions of series)
Mitigations:
1. Reject metrics exceeding cardinality limit:
If metric_name has > 100K unique tag combinations, reject with error.
2. Pre-aggregation:
Aggregate away high-cardinality tags at ingest time.
Instead of per-user metrics, emit percentiles (p50, p95, p99).
3. Separate storage:
Route high-cardinality metrics to a dedicated cluster with
different retention/query characteristics.
4. Tag value limits:
Cap unique values per tag key at 10,000.
Emit a cardinality warning metric for operators.
5. Usage tracking:
Dashboard showing top-10 metrics by cardinality.
Alert when a metric's cardinality grows > 50%/day.
TSDB Sharding
Strategy: Shard by series_id (hash of metric_name + sorted tags)
Series Assignment:
series_id = hash(metric_name + sorted_tags)
shard = series_id % num_shards
+--------------------------------------------------+
| Series Ring (consistent hashing with vnodes) |
| |
| Shard 0: series hash range [0, 16383] |
| Shard 1: series hash range [16384, 32767] |
| ... |
| Shard 15: series hash range [245760, 262143] |
| |
| Each shard: 1 writer + 2 read replicas |
+--------------------------------------------------+
Query fanout:
- Metric-specific queries: route to shard(s) containing that metric
- Wildcard queries: fan out to all shards, merge results
- Parallel execution keeps latency low (~50ms even with 16-shard fanout)
Geographic Distribution
+------------------+ +------------------+ +------------------+
| US Region | | EU Region | | Asia Region |
| - Full stack | | - Full stack | | - Full stack |
| - Local TSDB | | - Local TSDB | | - Local TSDB |
| - Local logs | | - Local logs | | - Local logs |
+--------+---------+ +--------+---------+ +--------+---------+
| | |
+------------------------+------------------------+
|
+-------v-------+
| Global Query | Cross-region queries
| Aggregator | for global dashboards
+---------------+
Metrics are ingested locally in each region.
Cross-region queries fan out to all regional TSDB clusters and merge.
Alerting runs locally per region (low latency).
13. Key Tradeoffs
| Decision | Option A | Option B | Our Choice |
|---|---|---|---|
| Collection model | Push only | Pull only | Hybrid |
| TSDB | Prometheus (single node) | Custom distributed TSDB | Custom TSDB |
| Log storage | Elasticsearch (flexible) | ClickHouse (fast agg.) | Elasticsearch |
| Trace storage | Jaeger + Cassandra | Tempo + Object store | Tempo + S3 |
| Sampling | Head-based (simple) | Tail-based (smart) | Tail-based |
| Alert evaluation | Push (stream processing) | Pull (periodic query) | Periodic query |
| Compression | General (LZ4) | Time-series specific | TS-specific |
| Retention | Keep all data forever | Downsample old data | Downsample |
| High cardinality | Accept all | Reject + aggregate | Reject + limit |
| Anomaly detection | Fixed thresholds only | ML-based baselines | Both |
14. Failure Scenarios and Mitigations
Scenario Mitigation
---------------------------------------------------------------------------
Ingestion gateway overload Backpressure to agents; agents buffer to disk
Kafka broker failure Replication factor 3; auto-leader election
TSDB writer crash WAL replay on restart; replicas serve reads
Alert evaluator down Consistent hashing re-assigns rules in <30s;
alert state stored in Redis (not local memory)
Elasticsearch cluster full Automated index rotation; cold tier rollover
Query timeout (large time range) Auto-downsample; query timeout with partial results
High cardinality explosion Automatic detection + rejection; alert metric owner
Network partition (cross-region) Each region operates independently; merge when healed
Dashboard service outage Static HTML fallback with last-known snapshot
False positive alert storm Alert grouping + rate limiting notifications
Agent crash on host Systemd auto-restart; disk buffer survives restart
Clock skew between hosts NTP required; accept data within 5-minute window
Anomaly model stale Retrain daily; fall back to fixed thresholds if
model is > 48 hours old
Key Takeaways
- Time-series databases require specialized compression (delta-of-delta for timestamps, XOR/Gorilla for values) to achieve 10x+ compression ratios -- generic databases cannot handle 43 TB/day ingestion cost-effectively.
- Downsampling with retention tiers is essential -- keeping full-resolution data forever is prohibitively expensive; the total storage drops from thousands of TB to ~81 TB with a tiered approach.
- High cardinality is the number-one operational challenge in monitoring systems -- a single bad metric with per-user tags can overwhelm the entire system; proactive detection and limits are mandatory.
- Alerting requires hysteresis, deduplication, and grouping to avoid alert fatigue -- raw threshold checks generate too many false positives and flapping.
- Anomaly detection based on seasonal baselines catches issues that fixed thresholds miss (traffic that is normal at 2 PM but abnormal at 4 AM).
- Push and pull collection models serve different use cases -- supporting both maximizes compatibility while a unified Kafka + TSDB backend keeps the architecture simple.
- Tail-based trace sampling captures the most diagnostically valuable traces (slow, errored) while keeping storage costs manageable at scale.