6.7 -- Logging and Observability: Quick Revision
Compact cheat sheet. Print-friendly.
How to use this material (instructions)
- Skim before labs or interviews.
- Drill gaps -- reopen
README.md -> 6.7.a...6.7.c.
- Practice --
6.7-Exercise-Questions.md.
- Polish answers --
6.7-Interview-Questions.md.
Core Vocabulary
| Term | One-liner |
|---|
| Structured logging | Emitting logs as JSON with consistent fields (timestamp, level, message, requestId) |
| Log level | Severity category: fatal > error > warn > info > debug > trace |
| Request ID | Unique ID generated at entry point, propagated through all services via headers |
| Correlation ID | ID that tracks a business operation across multiple requests and queues |
| Child logger | Logger that inherits parent context (requestId, userId) so you don't repeat fields |
| PII redaction | Automatically replacing sensitive fields (password, token, SSN) with [REDACTED] |
| Three pillars | Logs (events), Metrics (numbers over time), Traces (request journey) |
| RED method | Rate, Errors, Duration -- the standard for monitoring request-driven services |
| USE method | Utilization, Saturation, Errors -- the standard for monitoring infrastructure |
| SLI | Service Level Indicator -- the actual measured metric |
| SLO | Service Level Objective -- the internal target for the SLI |
| SLA | Service Level Agreement -- contractual promise with financial penalties |
| Error budget | Allowed downtime = 100% minus SLO (e.g., 99.9% SLO = 43.2 min/month budget) |
| Percentile (p99) | 99% of requests are faster than this value |
| Alert fatigue | Team ignores alerts because too many are non-actionable |
| Runbook | Step-by-step procedure for responding to a specific alert |
| Audit log | Immutable record of who did what, when, from where -- for compliance |
| Burn rate | How fast the error budget is being consumed relative to the sustainable rate |
Log Levels
MOST SEVERE ──────────────────────────────── LEAST SEVERE
fatal error warn info debug trace
│ │ │ │ │ │
│ │ │ │ │ └─ Step-by-step granular flow
│ │ │ │ └────────── Developer detail (off in prod)
│ │ │ └────────────────── Normal operations (prod default)
│ │ └────────────────────────── Unusual but recovered
│ └────────────────────────────────── Something broke (needs investigation)
└─────────────────────────────────────────── System crashing (immediate action)
Production default: LOG_LEVEL=info (shows fatal, error, warn, info)
Debugging: LOG_LEVEL=debug (temporarily, on specific service)
Structured Log Format
{
"timestamp": "2026-04-11T14:32:01.234Z",
"level": "info",
"message": "Order created",
"service": "order-service",
"requestId": "req-abc-123",
"userId": "user-456",
"method": "POST",
"path": "/api/orders",
"statusCode": 201,
"duration": 142,
"environment": "production",
"version": "2.3.1"
}
Must-have fields: timestamp, level, message, service, requestId
Should-have fields: userId, method, path, statusCode, duration, environment, version
Winston vs Pino
| Winston | Pino |
|---|
| Speed | ~15K logs/sec | ~75K logs/sec |
| API | logger.info('msg', { meta }) | logger.info({ meta }, 'msg') |
| Pretty print | Built-in | Requires pino-pretty |
| Transports | Built-in (file, console, HTTP) | Separate modules |
| Redaction | Custom format needed | Built-in redact option |
| Child loggers | Supported | First-class feature |
| Choose when | Most apps, needs more integrations | High-throughput services |
Three Pillars of Observability
LOGS METRICS TRACES
════ ═══════ ══════
What happened How much / how fast Where did time go
Individual events Aggregated numbers Request path
High volume, detail Low volume, fast query Medium volume
Expensive to store Cheap to store Moderate cost
Tools: Tools: Tools:
CloudWatch Logs Prometheus Jaeger
ELK Stack Grafana Zipkin
Datadog Logs CloudWatch Metrics AWS X-Ray
Datadog Metrics Datadog APM
RED Method (Services)
R = Rate Requests per second Counter + rate()
E = Errors Error count or % Counter filtered by status >= 500
D = Duration Latency distribution Histogram + histogram_quantile()
USE Method (Infrastructure)
U = Utilization % resource busy CPU 75%, Memory 80%
S = Saturation Queue depth / backlog 50 requests queued
E = Errors Resource error count 3 disk I/O errors
Percentile Latency
p50 = median "typical" user experience
p95 = 1 in 20 users standard SLO target
p99 = 1 in 100 users high-reliability SLO
p99.9 = 1 in 1000 critical paths (checkout, login)
RULE: Never use averages for latency. Averages hide long tails.
Average = 50ms sounds fine
p99 = 3200ms means 1% of users wait 3+ seconds
SLI / SLO / SLA
SLA ──── "We promise 99.9% uptime" (contract, penalties)
│
SLO ──── "We target 99.95% uptime" (internal, stricter)
│
SLI ──── "Current: 99.97% uptime" (actual measurement)
Error Budget:
SLO = 99.9% over 30 days
Budget = 0.1% x 43,200 min = 43.2 minutes of allowed downtime
Budget remaining > 50% → Ship features aggressively
Budget remaining < 25% → Freeze risky deploys, fix reliability
Budget exhausted → All hands on reliability
Metric Types (Prometheus)
| Type | Behavior | Example |
|---|
| Counter | Only goes up | http_requests_total |
| Gauge | Goes up and down | active_connections |
| Histogram | Values in buckets | request_duration_seconds |
| Summary | Client-side percentiles | request_duration_summary |
Alerting Rules
FIVE RULES OF GOOD ALERTS:
1. Actionable → Someone can fix it right now
2. Urgent → Needs attention in minutes, not days
3. Real → Indicates a genuine problem, not noise
4. Unique → Not duplicated by another alert
5. Documented → Links to a runbook
SEVERITY:
P1 Critical → Service down → Page on-call (< 5 min)
P2 High → Major degradation → Page on-call (< 30 min)
P3 Medium → Minor issue → Slack (< 4 hours)
P4 Low → Informational → Email/ticket (next day)
ALERT FATIGUE FIX:
- Delete alerts that fire > 3x/week without action
- Use sustained conditions (not single data points)
- Route P3/P4 to Slack, only P1/P2 page
- Review all alerts monthly
- Use SLO burn rate instead of raw thresholds
Common Alert Thresholds
| Metric | Warning | Critical | Window |
|---|
| Error rate | > 1% | > 5% | 5 min sustained |
| p99 latency | > 1s | > 5s | 5 min sustained |
| CPU | > 70% | > 90% | 15 min / 5 min |
| Memory | > 75% | > 90% | 10 min / 5 min |
| Disk | > 80% | > 90% | Point-in-time |
| Event loop lag | > 50ms | > 200ms | 5 min / 2 min |
| DB connections | > 70% pool | > 90% pool | Current |
| Cert expiry | < 14 days | < 3 days | Daily |
Event Logging Checklist
ALWAYS LOG THESE EVENTS:
Security:
✓ Login success/failure (with IP, user agent)
✓ Account lockout
✓ Password change
✓ Permission denied / access denied
✓ API key creation/revocation
Business:
✓ User signup / account deletion
✓ Order created / cancelled
✓ Payment success / failure / refund
✓ Subscription change
AI:
✓ AI API call (provider, model, tokens, cost, latency, status)
✓ AI error / timeout / rate limit
✓ Token usage per user / feature
System:
✓ Deployment start / complete / rollback
✓ Configuration change
✓ Health check failure
✓ Rate limit exceeded
NEVER LOG:
✗ Passwords or hashes
✗ Full credit card numbers (only last 4)
✗ Social security numbers
✗ API keys or tokens
✗ Full request bodies (may contain PII)
✗ Session tokens
Request ID Flow
Client Request
│
▼
API Gateway ──→ generates requestId: "req-abc-123"
│ sets X-Request-Id header
│
├──→ Auth Service (logs with requestId)
│ │
│ ├──→ Order Service (logs with requestId)
│ │ │
│ │ └──→ Payment Service (logs with requestId)
│ │
│ └──→ Notification Service (logs with requestId)
│
▼
Response ──→ X-Request-Id: "req-abc-123" (returned to client)
To debug: filter requestId="req-abc-123" across ALL services
Dashboard Layout
Row 1: SLI NUMBERS Uptime | Error Rate | p95 Latency | RPS
Row 2: RED TIME SERIES Rate graph | Error graph | Latency percentiles
Row 3: INFRASTRUCTURE CPU | Memory | Event Loop | DB Connections
Row 4: BUSINESS Signups | Orders/min | AI Calls | Payment Rate
Runbook Template
1. ALERT What alert fired, threshold, severity
2. IMPACT What users are experiencing
3. DIAGNOSIS Step-by-step investigation commands
4. RESOLUTION Fix steps for each root cause scenario
5. ESCALATION Who to contact if you cannot resolve
Common Gotchas
| Gotcha | Why |
|---|
Using console.log in production | Not structured, not filterable, no levels |
| Logging PII (passwords, tokens) | Security breach, compliance violation |
| Alerting on averages | Hides long-tail latency problems |
| Alerting on single data points | Causes alert fatigue from brief spikes |
| No request ID propagation | Cannot trace requests across services |
| Same storage for app + audit logs | Audit logs need immutable, long-term storage |
LOG_LEVEL=debug in production | Massive log volume and cost |
| No runbooks for alerts | On-call wastes time figuring out what to do |
| Error budget not tracked | No objective basis for deploy/fix decisions |
| Not monitoring AI costs | Token costs can spike unexpectedly |
Quick Code References
Pino logger setup
const pino = require('pino');
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
base: { service: 'my-service' },
redact: ['password', 'token', 'creditCard']
});
Request ID middleware
const { v4: uuidv4 } = require('uuid');
app.use((req, res, next) => {
req.requestId = req.headers['x-request-id'] || `req-${uuidv4()}`;
res.setHeader('X-Request-Id', req.requestId);
next();
});
Prometheus metrics
const { Counter, Histogram, collectDefaultMetrics, register } = require('prom-client');
collectDefaultMetrics();
const requests = new Counter({ name: 'http_requests_total', help: 'Total requests', labelNames: ['method', 'route', 'status_code'] });
const duration = new Histogram({ name: 'http_request_duration_seconds', help: 'Duration', labelNames: ['method', 'route', 'status_code'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 5] });
app.get('/metrics', async (req, res) => { res.set('Content-Type', register.contentType); res.end(await register.metrics()); });
End of 6.7 quick revision.