Episode 6 — Scaling Reliability Microservices Web3 / 6.7 — Logging and Observability

6.7 -- Logging and Observability: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

Skim before labs or interviews.
Drill gaps -- reopen README.md -> 6.7.a...6.7.c.
Practice -- 6.7-Exercise-Questions.md.
Polish answers -- 6.7-Interview-Questions.md.

Core Vocabulary

Term	One-liner
Structured logging	Emitting logs as JSON with consistent fields (timestamp, level, message, requestId)
Log level	Severity category: fatal > error > warn > info > debug > trace
Request ID	Unique ID generated at entry point, propagated through all services via headers
Correlation ID	ID that tracks a business operation across multiple requests and queues
Child logger	Logger that inherits parent context (requestId, userId) so you don't repeat fields
PII redaction	Automatically replacing sensitive fields (password, token, SSN) with `[REDACTED]`
Three pillars	Logs (events), Metrics (numbers over time), Traces (request journey)
RED method	Rate, Errors, Duration -- the standard for monitoring request-driven services
USE method	Utilization, Saturation, Errors -- the standard for monitoring infrastructure
SLI	Service Level Indicator -- the actual measured metric
SLO	Service Level Objective -- the internal target for the SLI
SLA	Service Level Agreement -- contractual promise with financial penalties
Error budget	Allowed downtime = 100% minus SLO (e.g., 99.9% SLO = 43.2 min/month budget)
Percentile (p99)	99% of requests are faster than this value
Alert fatigue	Team ignores alerts because too many are non-actionable
Runbook	Step-by-step procedure for responding to a specific alert
Audit log	Immutable record of who did what, when, from where -- for compliance
Burn rate	How fast the error budget is being consumed relative to the sustainable rate

Log Levels

MOST SEVERE ──────────────────────────────── LEAST SEVERE

  fatal    error    warn    info    debug    trace
    │        │       │       │       │        │
    │        │       │       │       │        └─ Step-by-step granular flow
    │        │       │       │       └────────── Developer detail (off in prod)
    │        │       │       └────────────────── Normal operations (prod default)
    │        │       └────────────────────────── Unusual but recovered
    │        └────────────────────────────────── Something broke (needs investigation)
    └─────────────────────────────────────────── System crashing (immediate action)

Production default: LOG_LEVEL=info  (shows fatal, error, warn, info)
Debugging:          LOG_LEVEL=debug (temporarily, on specific service)

Structured Log Format

{
  "timestamp": "2026-04-11T14:32:01.234Z",
  "level": "info",
  "message": "Order created",
  "service": "order-service",
  "requestId": "req-abc-123",
  "userId": "user-456",
  "method": "POST",
  "path": "/api/orders",
  "statusCode": 201,
  "duration": 142,
  "environment": "production",
  "version": "2.3.1"
}

Must-have fields: timestamp, level, message, service, requestId Should-have fields: userId, method, path, statusCode, duration, environment, version

Winston vs Pino

	Winston	Pino
Speed	~15K logs/sec	~75K logs/sec
API	`logger.info('msg', { meta })`	`logger.info({ meta }, 'msg')`
Pretty print	Built-in	Requires `pino-pretty`
Transports	Built-in (file, console, HTTP)	Separate modules
Redaction	Custom format needed	Built-in `redact` option
Child loggers	Supported	First-class feature
Choose when	Most apps, needs more integrations	High-throughput services

Three Pillars of Observability

  LOGS                    METRICS                 TRACES
  ════                    ═══════                 ══════
  What happened           How much / how fast     Where did time go
  Individual events       Aggregated numbers      Request path
  High volume, detail     Low volume, fast query  Medium volume
  Expensive to store      Cheap to store          Moderate cost

  Tools:                  Tools:                  Tools:
  CloudWatch Logs         Prometheus              Jaeger
  ELK Stack               Grafana                 Zipkin
  Datadog Logs            CloudWatch Metrics      AWS X-Ray
                          Datadog Metrics         Datadog APM

RED Method (Services)

R = Rate          Requests per second        Counter + rate()
E = Errors        Error count or %           Counter filtered by status >= 500
D = Duration      Latency distribution       Histogram + histogram_quantile()

USE Method (Infrastructure)

U = Utilization   % resource busy            CPU 75%, Memory 80%
S = Saturation    Queue depth / backlog      50 requests queued
E = Errors        Resource error count       3 disk I/O errors

Percentile Latency

p50  = median           "typical" user experience
p95  = 1 in 20 users    standard SLO target
p99  = 1 in 100 users   high-reliability SLO
p99.9 = 1 in 1000       critical paths (checkout, login)

RULE: Never use averages for latency. Averages hide long tails.
  Average = 50ms sounds fine
  p99 = 3200ms means 1% of users wait 3+ seconds

SLI / SLO / SLA

SLA  ────  "We promise 99.9% uptime"    (contract, penalties)
  │
SLO  ────  "We target 99.95% uptime"    (internal, stricter)
  │
SLI  ────  "Current: 99.97% uptime"     (actual measurement)

Error Budget:
  SLO = 99.9% over 30 days
  Budget = 0.1% x 43,200 min = 43.2 minutes of allowed downtime

  Budget remaining > 50%  →  Ship features aggressively
  Budget remaining < 25%  →  Freeze risky deploys, fix reliability
  Budget exhausted        →  All hands on reliability

Metric Types (Prometheus)

Type	Behavior	Example
Counter	Only goes up	`http_requests_total`
Gauge	Goes up and down	`active_connections`
Histogram	Values in buckets	`request_duration_seconds`
Summary	Client-side percentiles	`request_duration_summary`

Alerting Rules

FIVE RULES OF GOOD ALERTS:
  1. Actionable     →  Someone can fix it right now
  2. Urgent         →  Needs attention in minutes, not days
  3. Real           →  Indicates a genuine problem, not noise
  4. Unique         →  Not duplicated by another alert
  5. Documented     →  Links to a runbook

SEVERITY:
  P1 Critical  →  Service down          →  Page on-call (< 5 min)
  P2 High      →  Major degradation     →  Page on-call (< 30 min)
  P3 Medium    →  Minor issue           →  Slack (< 4 hours)
  P4 Low       →  Informational         →  Email/ticket (next day)

ALERT FATIGUE FIX:
  - Delete alerts that fire > 3x/week without action
  - Use sustained conditions (not single data points)
  - Route P3/P4 to Slack, only P1/P2 page
  - Review all alerts monthly
  - Use SLO burn rate instead of raw thresholds

Common Alert Thresholds

Metric	Warning	Critical	Window
Error rate	> 1%	> 5%	5 min sustained
p99 latency	> 1s	> 5s	5 min sustained
CPU	> 70%	> 90%	15 min / 5 min
Memory	> 75%	> 90%	10 min / 5 min
Disk	> 80%	> 90%	Point-in-time
Event loop lag	> 50ms	> 200ms	5 min / 2 min
DB connections	> 70% pool	> 90% pool	Current
Cert expiry	< 14 days	< 3 days	Daily

Event Logging Checklist

ALWAYS LOG THESE EVENTS:

Security:
  ✓ Login success/failure (with IP, user agent)
  ✓ Account lockout
  ✓ Password change
  ✓ Permission denied / access denied
  ✓ API key creation/revocation

Business:
  ✓ User signup / account deletion
  ✓ Order created / cancelled
  ✓ Payment success / failure / refund
  ✓ Subscription change

AI:
  ✓ AI API call (provider, model, tokens, cost, latency, status)
  ✓ AI error / timeout / rate limit
  ✓ Token usage per user / feature

System:
  ✓ Deployment start / complete / rollback
  ✓ Configuration change
  ✓ Health check failure
  ✓ Rate limit exceeded

NEVER LOG:
  ✗ Passwords or hashes
  ✗ Full credit card numbers (only last 4)
  ✗ Social security numbers
  ✗ API keys or tokens
  ✗ Full request bodies (may contain PII)
  ✗ Session tokens

Request ID Flow

Client Request
    │
    ▼
API Gateway ──→ generates requestId: "req-abc-123"
    │             sets X-Request-Id header
    │
    ├──→ Auth Service    (logs with requestId)
    │         │
    │         ├──→ Order Service  (logs with requestId)
    │         │         │
    │         │         └──→ Payment Service (logs with requestId)
    │         │
    │         └──→ Notification Service (logs with requestId)
    │
    ▼
Response ──→ X-Request-Id: "req-abc-123" (returned to client)

To debug: filter requestId="req-abc-123" across ALL services

Dashboard Layout

Row 1: SLI NUMBERS         Uptime | Error Rate | p95 Latency | RPS
Row 2: RED TIME SERIES      Rate graph | Error graph | Latency percentiles
Row 3: INFRASTRUCTURE       CPU | Memory | Event Loop | DB Connections
Row 4: BUSINESS              Signups | Orders/min | AI Calls | Payment Rate

Runbook Template

1. ALERT         What alert fired, threshold, severity
2. IMPACT        What users are experiencing
3. DIAGNOSIS     Step-by-step investigation commands
4. RESOLUTION    Fix steps for each root cause scenario
5. ESCALATION    Who to contact if you cannot resolve

Common Gotchas

Gotcha	Why
Using `console.log` in production	Not structured, not filterable, no levels
Logging PII (passwords, tokens)	Security breach, compliance violation
Alerting on averages	Hides long-tail latency problems
Alerting on single data points	Causes alert fatigue from brief spikes
No request ID propagation	Cannot trace requests across services
Same storage for app + audit logs	Audit logs need immutable, long-term storage
`LOG_LEVEL=debug` in production	Massive log volume and cost
No runbooks for alerts	On-call wastes time figuring out what to do
Error budget not tracked	No objective basis for deploy/fix decisions
Not monitoring AI costs	Token costs can spike unexpectedly

Quick Code References

Pino logger setup

const pino = require('pino');
const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  base: { service: 'my-service' },
  redact: ['password', 'token', 'creditCard']
});

Request ID middleware

const { v4: uuidv4 } = require('uuid');
app.use((req, res, next) => {
  req.requestId = req.headers['x-request-id'] || `req-${uuidv4()}`;
  res.setHeader('X-Request-Id', req.requestId);
  next();
});

Prometheus metrics

const { Counter, Histogram, collectDefaultMetrics, register } = require('prom-client');
collectDefaultMetrics();
const requests = new Counter({ name: 'http_requests_total', help: 'Total requests', labelNames: ['method', 'route', 'status_code'] });
const duration = new Histogram({ name: 'http_request_duration_seconds', help: 'Duration', labelNames: ['method', 'route', 'status_code'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 5] });
app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });

End of 6.7 quick revision.