Episode 6 — Scaling Reliability Microservices Web3 / 6.7 — Logging and Observability

6.7 -- Logging and Observability: Quick Revision

Compact cheat sheet. Print-friendly.

How to use this material (instructions)

  1. Skim before labs or interviews.
  2. Drill gaps -- reopen README.md -> 6.7.a...6.7.c.
  3. Practice -- 6.7-Exercise-Questions.md.
  4. Polish answers -- 6.7-Interview-Questions.md.

Core Vocabulary

TermOne-liner
Structured loggingEmitting logs as JSON with consistent fields (timestamp, level, message, requestId)
Log levelSeverity category: fatal > error > warn > info > debug > trace
Request IDUnique ID generated at entry point, propagated through all services via headers
Correlation IDID that tracks a business operation across multiple requests and queues
Child loggerLogger that inherits parent context (requestId, userId) so you don't repeat fields
PII redactionAutomatically replacing sensitive fields (password, token, SSN) with [REDACTED]
Three pillarsLogs (events), Metrics (numbers over time), Traces (request journey)
RED methodRate, Errors, Duration -- the standard for monitoring request-driven services
USE methodUtilization, Saturation, Errors -- the standard for monitoring infrastructure
SLIService Level Indicator -- the actual measured metric
SLOService Level Objective -- the internal target for the SLI
SLAService Level Agreement -- contractual promise with financial penalties
Error budgetAllowed downtime = 100% minus SLO (e.g., 99.9% SLO = 43.2 min/month budget)
Percentile (p99)99% of requests are faster than this value
Alert fatigueTeam ignores alerts because too many are non-actionable
RunbookStep-by-step procedure for responding to a specific alert
Audit logImmutable record of who did what, when, from where -- for compliance
Burn rateHow fast the error budget is being consumed relative to the sustainable rate

Log Levels

MOST SEVERE ──────────────────────────────── LEAST SEVERE

  fatal    error    warn    info    debug    trace
    │        │       │       │       │        │
    │        │       │       │       │        └─ Step-by-step granular flow
    │        │       │       │       └────────── Developer detail (off in prod)
    │        │       │       └────────────────── Normal operations (prod default)
    │        │       └────────────────────────── Unusual but recovered
    │        └────────────────────────────────── Something broke (needs investigation)
    └─────────────────────────────────────────── System crashing (immediate action)

Production default: LOG_LEVEL=info  (shows fatal, error, warn, info)
Debugging:          LOG_LEVEL=debug (temporarily, on specific service)

Structured Log Format

{
  "timestamp": "2026-04-11T14:32:01.234Z",
  "level": "info",
  "message": "Order created",
  "service": "order-service",
  "requestId": "req-abc-123",
  "userId": "user-456",
  "method": "POST",
  "path": "/api/orders",
  "statusCode": 201,
  "duration": 142,
  "environment": "production",
  "version": "2.3.1"
}

Must-have fields: timestamp, level, message, service, requestId Should-have fields: userId, method, path, statusCode, duration, environment, version


Winston vs Pino

WinstonPino
Speed~15K logs/sec~75K logs/sec
APIlogger.info('msg', { meta })logger.info({ meta }, 'msg')
Pretty printBuilt-inRequires pino-pretty
TransportsBuilt-in (file, console, HTTP)Separate modules
RedactionCustom format neededBuilt-in redact option
Child loggersSupportedFirst-class feature
Choose whenMost apps, needs more integrationsHigh-throughput services

Three Pillars of Observability

  LOGS                    METRICS                 TRACES
  ════                    ═══════                 ══════
  What happened           How much / how fast     Where did time go
  Individual events       Aggregated numbers      Request path
  High volume, detail     Low volume, fast query  Medium volume
  Expensive to store      Cheap to store          Moderate cost

  Tools:                  Tools:                  Tools:
  CloudWatch Logs         Prometheus              Jaeger
  ELK Stack               Grafana                 Zipkin
  Datadog Logs            CloudWatch Metrics      AWS X-Ray
                          Datadog Metrics         Datadog APM

RED Method (Services)

R = Rate          Requests per second        Counter + rate()
E = Errors        Error count or %           Counter filtered by status >= 500
D = Duration      Latency distribution       Histogram + histogram_quantile()

USE Method (Infrastructure)

U = Utilization   % resource busy            CPU 75%, Memory 80%
S = Saturation    Queue depth / backlog      50 requests queued
E = Errors        Resource error count       3 disk I/O errors

Percentile Latency

p50  = median           "typical" user experience
p95  = 1 in 20 users    standard SLO target
p99  = 1 in 100 users   high-reliability SLO
p99.9 = 1 in 1000       critical paths (checkout, login)

RULE: Never use averages for latency. Averages hide long tails.
  Average = 50ms sounds fine
  p99 = 3200ms means 1% of users wait 3+ seconds

SLI / SLO / SLA

SLA  ────  "We promise 99.9% uptime"    (contract, penalties)
  │
SLO  ────  "We target 99.95% uptime"    (internal, stricter)
  │
SLI  ────  "Current: 99.97% uptime"     (actual measurement)

Error Budget:
  SLO = 99.9% over 30 days
  Budget = 0.1% x 43,200 min = 43.2 minutes of allowed downtime

  Budget remaining > 50%  →  Ship features aggressively
  Budget remaining < 25%  →  Freeze risky deploys, fix reliability
  Budget exhausted        →  All hands on reliability

Metric Types (Prometheus)

TypeBehaviorExample
CounterOnly goes uphttp_requests_total
GaugeGoes up and downactive_connections
HistogramValues in bucketsrequest_duration_seconds
SummaryClient-side percentilesrequest_duration_summary

Alerting Rules

FIVE RULES OF GOOD ALERTS:
  1. Actionable     →  Someone can fix it right now
  2. Urgent         →  Needs attention in minutes, not days
  3. Real           →  Indicates a genuine problem, not noise
  4. Unique         →  Not duplicated by another alert
  5. Documented     →  Links to a runbook

SEVERITY:
  P1 Critical  →  Service down          →  Page on-call (< 5 min)
  P2 High      →  Major degradation     →  Page on-call (< 30 min)
  P3 Medium    →  Minor issue           →  Slack (< 4 hours)
  P4 Low       →  Informational         →  Email/ticket (next day)

ALERT FATIGUE FIX:
  - Delete alerts that fire > 3x/week without action
  - Use sustained conditions (not single data points)
  - Route P3/P4 to Slack, only P1/P2 page
  - Review all alerts monthly
  - Use SLO burn rate instead of raw thresholds

Common Alert Thresholds

MetricWarningCriticalWindow
Error rate> 1%> 5%5 min sustained
p99 latency> 1s> 5s5 min sustained
CPU> 70%> 90%15 min / 5 min
Memory> 75%> 90%10 min / 5 min
Disk> 80%> 90%Point-in-time
Event loop lag> 50ms> 200ms5 min / 2 min
DB connections> 70% pool> 90% poolCurrent
Cert expiry< 14 days< 3 daysDaily

Event Logging Checklist

ALWAYS LOG THESE EVENTS:

Security:
  ✓ Login success/failure (with IP, user agent)
  ✓ Account lockout
  ✓ Password change
  ✓ Permission denied / access denied
  ✓ API key creation/revocation

Business:
  ✓ User signup / account deletion
  ✓ Order created / cancelled
  ✓ Payment success / failure / refund
  ✓ Subscription change

AI:
  ✓ AI API call (provider, model, tokens, cost, latency, status)
  ✓ AI error / timeout / rate limit
  ✓ Token usage per user / feature

System:
  ✓ Deployment start / complete / rollback
  ✓ Configuration change
  ✓ Health check failure
  ✓ Rate limit exceeded

NEVER LOG:
  ✗ Passwords or hashes
  ✗ Full credit card numbers (only last 4)
  ✗ Social security numbers
  ✗ API keys or tokens
  ✗ Full request bodies (may contain PII)
  ✗ Session tokens

Request ID Flow

Client Request
    │
    ▼
API Gateway ──→ generates requestId: "req-abc-123"
    │             sets X-Request-Id header
    │
    ├──→ Auth Service    (logs with requestId)
    │         │
    │         ├──→ Order Service  (logs with requestId)
    │         │         │
    │         │         └──→ Payment Service (logs with requestId)
    │         │
    │         └──→ Notification Service (logs with requestId)
    │
    ▼
Response ──→ X-Request-Id: "req-abc-123" (returned to client)

To debug: filter requestId="req-abc-123" across ALL services

Dashboard Layout

Row 1: SLI NUMBERS         Uptime | Error Rate | p95 Latency | RPS
Row 2: RED TIME SERIES      Rate graph | Error graph | Latency percentiles
Row 3: INFRASTRUCTURE       CPU | Memory | Event Loop | DB Connections
Row 4: BUSINESS              Signups | Orders/min | AI Calls | Payment Rate

Runbook Template

1. ALERT         What alert fired, threshold, severity
2. IMPACT        What users are experiencing
3. DIAGNOSIS     Step-by-step investigation commands
4. RESOLUTION    Fix steps for each root cause scenario
5. ESCALATION    Who to contact if you cannot resolve

Common Gotchas

GotchaWhy
Using console.log in productionNot structured, not filterable, no levels
Logging PII (passwords, tokens)Security breach, compliance violation
Alerting on averagesHides long-tail latency problems
Alerting on single data pointsCauses alert fatigue from brief spikes
No request ID propagationCannot trace requests across services
Same storage for app + audit logsAudit logs need immutable, long-term storage
LOG_LEVEL=debug in productionMassive log volume and cost
No runbooks for alertsOn-call wastes time figuring out what to do
Error budget not trackedNo objective basis for deploy/fix decisions
Not monitoring AI costsToken costs can spike unexpectedly

Quick Code References

Pino logger setup

const pino = require('pino');
const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  base: { service: 'my-service' },
  redact: ['password', 'token', 'creditCard']
});

Request ID middleware

const { v4: uuidv4 } = require('uuid');
app.use((req, res, next) => {
  req.requestId = req.headers['x-request-id'] || `req-${uuidv4()}`;
  res.setHeader('X-Request-Id', req.requestId);
  next();
});

Prometheus metrics

const { Counter, Histogram, collectDefaultMetrics, register } = require('prom-client');
collectDefaultMetrics();
const requests = new Counter({ name: 'http_requests_total', help: 'Total requests', labelNames: ['method', 'route', 'status_code'] });
const duration = new Histogram({ name: 'http_request_duration_seconds', help: 'Duration', labelNames: ['method', 'route', 'status_code'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 5] });
app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });

End of 6.7 quick revision.