Episode 6 — Scaling Reliability Microservices Web3 / 6.7 — Logging and Observability

Interview Questions: Logging and Observability

Model answers for structured logging, monitoring/metrics, and alerting/event logging.

How to use this material (instructions)

Read lessons in order -- README.md, then 6.7.a -> 6.7.c.
Practice out loud -- definition -> example -> pitfall.
Pair with exercises -- 6.7-Exercise-Questions.md.
Quick review -- 6.7-Quick-Revision.md.

Beginner (Q1--Q4)

Q1. What is structured logging and why does it matter?

Why interviewers ask: Tests if you understand why production systems need machine-parseable logs. Every company with more than one service uses structured logging.

Model answer:

Structured logging means emitting log entries as machine-parseable records (typically JSON) with consistent, well-defined fields -- timestamp, log level, message, service name, and request context. It replaces unstructured console.log() statements with searchable, filterable, and aggregatable data.

It matters because in production you have dozens of services, hundreds of containers, and thousands of requests per second. Unstructured text like "User login successful" is useless when you need to answer questions like "How many login failures happened from IP 10.0.0.5 in the last hour?" or "Show me every log from request req-abc-123 across all services." Structured logging makes logs queryable (filter by level, service, userId), correlatable (request ID links logs across services), and parseable by tools like CloudWatch Insights, ELK Stack, or Datadog. Without it, debugging production issues at scale is effectively impossible.

Q2. What are the three pillars of observability?

Why interviewers ask: Foundational concept -- tests whether you understand the complete observability picture beyond just "adding logging."

Model answer:

The three pillars are logs, metrics, and traces, and each serves a different purpose:

Logs are discrete event records with full context -- "User 456 got a 500 error on POST /api/orders at 14:32." They have the most detail but are expensive to store and hard to aggregate across millions of events.

Metrics are numeric aggregations over time -- "500 requests per second, 2% error rate, p99 latency 320ms." They are cheap to collect and query, ideal for dashboards and alerts, but lack individual event detail.

Traces follow a single request's journey across multiple services -- "This request spent 50ms in the API gateway, 200ms in the auth service, 80ms in the database." They reveal where time is spent in distributed systems but require sampling at high traffic volumes.

They work together: a metric tells you error rate spiked, a trace shows which service in the chain is failing, and a log from that service reveals the exact error message and stack trace. You need all three for complete observability.

Q3. What are log levels and how do you use them correctly?

Why interviewers ask: Misusing log levels is a common junior engineer mistake -- too many errors that are actually warnings, debug logs left on in production.

Model answer:

Log levels categorize severity from most to least critical: fatal (system crashing), error (something failed that should not have), warn (unusual but recovered), info (normal operations worth noting), debug (detailed developer information), and trace (extremely granular step-by-step).

The level hierarchy means setting LOG_LEVEL=warn shows only fatal, error, and warn -- suppressing the higher-volume info and debug logs. In production, you typically run at info level. When debugging a specific issue, you temporarily lower to debug on the affected service.

The most common mistakes are: logging actual errors as warn (making them invisible in error dashboards), using error for expected failures like "user not found" (which should be warn or even info), and leaving debug enabled in production which generates massive log volume and cost. The key principle: error means something broke that should not have broken; warn means something unusual that might need attention; info means normal events you want to see in production.

Q4. What is a request ID and why is it critical in microservices?

Why interviewers ask: Request ID propagation is the most practical debugging technique in distributed systems. If you have built microservices, you have used this.

Model answer:

A request ID is a unique identifier (typically a UUID) generated at the system's entry point (API gateway or first service) and propagated through every downstream service via HTTP headers (commonly X-Request-Id). Every log entry includes this ID, so you can filter all logs across all services for a single user request with one query.

Implementation: a middleware generates the ID (or accepts it from an incoming header), attaches it to the request object, and passes it in the response header. When calling downstream services, the request ID is forwarded in the headers. Each service includes it in every log entry.

Without request IDs, debugging a distributed failure means searching for timestamp-correlated logs across 5-10 services -- nearly impossible when processing thousands of requests per second. With request IDs, you simply filter requestId=req-abc-123 and see the complete request journey in chronological order: API gateway received it, auth service validated the token, order service processed the order, payment service charged the card, notification service sent the email. This is the single most valuable logging pattern in microservices.

Intermediate (Q5--Q8)

Q5. Explain the RED method and how you would implement it for a Node.js API.

Why interviewers ask: Tests practical knowledge of the standard service monitoring approach. Shows you can go beyond just logging to quantitative system monitoring.

Model answer:

The RED method defines three essential metrics for any request-driven service:

Rate -- requests per second. Tracks throughput and traffic patterns. Implemented as a Prometheus Counter (http_requests_total) incremented on every request, then queried with rate() for per-second values.

Errors -- failed requests per second or error percentage. Uses the same counter filtered by status codes >= 500. Error rate = rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100.

Duration -- response time distribution, measured as percentiles (p50, p95, p99). Implemented as a Prometheus Histogram (http_request_duration_seconds) with configured buckets. You observe each request's duration, then query histogram_quantile(0.99, rate(...[5m])) for p99.

Implementation in Express: create a middleware that starts a timer on request entry and, on res.finish, records the duration in the histogram and increments the counter with method, route, and status code labels. Expose a /metrics endpoint that Prometheus scrapes. This gives you real-time visibility into whether your API is healthy -- rate tells you if traffic is normal, errors tell you if things are breaking, duration tells you if things are slow.

Q6. Explain SLIs, SLOs, and SLAs. How do error budgets work?

Why interviewers ask: This is the framework used by Google, Amazon, and every mature engineering organization to manage reliability. It separates "100% uptime is impossible" into a practical engineering tool.

Model answer:

SLI (Service Level Indicator) is the actual metric being measured -- for example, "99.97% of requests return a successful response" or "p99 latency is 280ms." It is the raw measurement from your monitoring system.

SLO (Service Level Objective) is the internal target for that SLI -- "we aim for 99.95% availability" or "p99 latency should be under 500ms." SLOs are set by the engineering team and are intentionally stricter than SLAs to provide a safety margin.

SLA (Service Level Agreement) is the contractual promise to customers -- "99.9% uptime or we issue service credits." SLAs have financial penalties and are set by the business.

Error budgets are the practical tool that makes SLOs useful. If your SLO is 99.9% over 30 days, you have 43.2 minutes of allowed downtime (0.1% of 43,200 minutes). This becomes a budget you "spend" on deployments, experiments, and incidents. When the budget is nearly exhausted, you freeze risky deployments and focus on reliability. When there is plenty of budget remaining, you can ship features aggressively. This transforms reliability from a subjective "are we reliable enough?" into a quantitative decision: "we have 25 minutes of error budget remaining with 15 days left -- proceed cautiously."

Q7. How do you prevent logging sensitive data (PII) in production?

Why interviewers ask: PII leakage through logs is a real compliance and security risk (GDPR, PCI-DSS). Tests whether you think about security as part of engineering.

Model answer:

Preventing PII in logs requires a multi-layered approach:

Layer 1: Redaction at the logger level. Both Pino and Winston support automatic field redaction. Configure paths like password, token, creditCard, ssn, *.password (nested) to be replaced with [REDACTED]. This is the safety net -- even if a developer accidentally passes sensitive data, it gets stripped.

Layer 2: Code review discipline. Never log raw req.body -- always log specific, known-safe fields. Instead of logger.info({ body: req.body }), use logger.info({ userId: req.body.userId, plan: req.body.plan }).

Layer 3: Structured logging standards. Define an approved set of loggable fields in your team's logging guidelines. Anything not on the list should not be logged. Common safe fields: userId, requestId, method, path, statusCode, duration.

Layer 4: Automated scanning. Run periodic scans on stored logs looking for patterns that match credit card numbers (16 digits), email addresses, or SSN formats. Tools like Amazon Macie can detect PII in S3-stored logs.

Layer 5: Access control. Restrict who can read production logs. Not every developer needs access to all log data. Use CloudWatch Insights with IAM policies or role-based access in your log aggregation tool.

Q8. Design a monitoring dashboard for a microservices application.

Why interviewers ask: Tests your ability to organize observability data into something useful. Separates people who have operated production systems from those who have not.

Model answer:

A well-designed dashboard follows a top-down information hierarchy:

Row 1: SLI summary (single-stat panels). Four large numbers with color coding: uptime percentage (green if above SLO), error rate (green if below threshold), p95 latency, and requests per second. This row answers "is the system healthy right now?" at a glance.

Row 2: RED metrics (time-series graphs). Three panels showing request rate, error rate, and latency percentiles (p50, p95, p99) over time. These reveal trends -- a slowly increasing p99 signals a growing problem before it becomes critical.

Row 3: Infrastructure (gauges). CPU utilization, memory usage, event loop lag (Node.js), database connection pool usage, and cache hit rate. These identify the resource causing the problem when RED metrics show degradation.

Row 4: Business metrics. Signups, orders per minute, AI API calls, payment success rate. These are what stakeholders care about and often the earliest indicator of user-facing problems.

Each service gets its own dashboard with this layout, plus one overview dashboard showing all services side by side. Default time range is 6 hours. Every panel links to a more detailed drill-down view. Critical panels include threshold lines showing SLO targets.

Advanced (Q9--Q11)

Q9. Design an alerting strategy that avoids alert fatigue for a 10-service microservices system.

Why interviewers ask: Alert fatigue is one of the most common operational failures. This tests your ability to design a practical alerting system that actually works.

Model answer:

The core strategy is alert on user-facing symptoms, not internal metrics, using SLO burn rate rather than raw thresholds.

Tier 1 (P1 -- pages on-call): Only two types of alerts page. First, SLO burn rate alerts: if the error budget is being consumed at 14x the sustainable rate for 5+ minutes, the SLO will be breached in 2 days -- page immediately. Second, complete outage detection: if the health check endpoint returns unhealthy for 2+ minutes, page.

Tier 2 (P2 -- Slack #alerts-urgent): SLO burn rate at 6x for 30+ minutes. Individual service error rate > 5% sustained for 5 minutes. p99 latency > 2x the SLO target for 10 minutes.

Tier 3 (P3 -- Slack #alerts-info): Infrastructure warnings (CPU > 70%, memory > 75%, disk > 80%). Certificate expiry < 14 days. Dependency health warnings.

Anti-fatigue measures: (1) Monthly alert review -- every alert that fired gets categorized as actionable or noise; noise gets tuned or deleted. (2) Deduplication -- one incident generates one alert, not one per service. (3) Maintenance windows -- silence alerts during planned deployments. (4) Escalation with timeout -- if primary does not acknowledge in 5 minutes, escalate to secondary. (5) Track alert-to-action ratio as a metric itself -- target > 80% of pages requiring real action.

Every P1/P2 alert links to a runbook with specific diagnosis and resolution steps. The on-call engineer should never have to figure out what to do from scratch.

Q10. Your team's average API latency is 50ms but users are complaining about slow responses. Walk through your investigation process using observability data.

Why interviewers ask: Tests systematic debugging skills using all three pillars of observability. Reveals whether you understand why averages are misleading.

Model answer:

The average is almost certainly hiding a long tail. My investigation:

Step 1: Check percentiles, not averages. Pull up the Grafana dashboard showing p50, p95, and p99 latency. If p50 is 45ms but p99 is 3 seconds, then 1% of users (potentially thousands per day) are waiting 3+ seconds. The average of 50ms is mathematically correct but experientially wrong.

Step 2: Identify the slow requests. Filter metrics by route and method. Is the long tail on all endpoints or specific ones? Commonly, a /search or /ai-generate endpoint has much worse p99 than /api/users. Also segment by response status -- 5xx errors might have long latency due to timeouts.

Step 3: Pull traces for slow requests. Use distributed tracing to look at requests above the p95 threshold. The trace waterfall shows where time is spent: is it the database query? An external API call? CPU-intensive processing? This pinpoints the bottleneck.

Step 4: Check the logs for those traces. Filter structured logs by the trace/request ID of a specific slow request. Look for: slow query warnings, retry attempts, cache misses, timeout errors, or errors from downstream services.

Step 5: Check infrastructure metrics. Is event loop lag high during the slow periods? Is the database connection pool saturated? Are there garbage collection pauses correlating with latency spikes?

Step 6: Correlate with time. Do the latency spikes happen at specific times (peak traffic hours, cron jobs, deployments)? This narrows the cause to capacity, competing workloads, or code changes.

The fix depends on the root cause: add caching for slow queries, increase timeouts or add circuit breakers for external APIs, scale out if it is a capacity issue, or optimize the hot code path if it is CPU-bound.

Q11. Design a complete observability strategy for an AI-powered feature that makes 50,000 OpenAI API calls per day.

Why interviewers ask: AI features have unique observability requirements (token costs, model latency, hallucination risk, rate limits). Tests your ability to apply observability principles to a cutting-edge use case.

Model answer:

AI features require observability across four dimensions: reliability, cost, quality, and compliance.

Structured logging for every AI call: Log: timestamp, requestId, userId, feature name, model used, prompt tokens, completion tokens, total tokens, estimated cost, latency, response status (success/error/timeout/rate-limited), temperature setting, and a hash of the prompt (for debugging, not the full prompt which may contain user data).

Metrics (Prometheus):

ai_calls_total{provider, model, status} -- Counter for total calls, filtered by success/error/rate_limited
ai_tokens_total{provider, model, direction} -- Counter for tokens consumed (input vs output)
ai_request_duration_seconds{provider, model} -- Histogram with buckets [0.5, 1, 2, 5, 10, 30, 60]
ai_estimated_cost_dollars -- Counter tracking running cost
ai_rate_limit_remaining{provider} -- Gauge tracking remaining rate limit capacity

Alerts:

P1: AI error rate > 10% for 5 min (feature is broken for users)
P2: AI p99 latency > 30s (users are experiencing timeouts)
P2: Daily AI spend exceeds budget threshold (cost runaway prevention)
P3: Rate limit remaining < 20% (approaching provider limits)
P3: Token usage trending 50% above daily baseline (unexpected cost growth)

Quality monitoring: Sample 1-5% of AI responses for human review. Log response metadata (not full responses, for privacy) that allows quality scoring. Track user feedback signals: thumbs up/down, regeneration requests, manual edits to AI output. Alert if the regeneration rate exceeds a threshold (may indicate model quality degradation).

Cost dashboard: Real-time running cost per model, per feature, per day. Cost-per-user calculation. Projected monthly spend based on current rate. Anomaly detection for sudden cost spikes.

Compliance: Audit log every AI call with user ID for data access reviews. Retain logs per your data retention policy. Do not log full prompts or responses that may contain user PII -- log hashes or metadata instead.

Quick-fire

#	Question	One-line answer
1	Structured logging output format?	JSON -- machine-parseable with consistent fields
2	Most important log field for microservices?	requestId -- correlates logs across services
3	Winston vs Pino speed?	Pino is ~5x faster (~75K vs ~15K logs/sec)
4	What does the RED method measure?	Rate, Errors, Duration for request-driven services
5	p99 latency means what?	99% of requests are faster than this value
6	SLO vs SLA?	SLO is internal target; SLA is external contract with penalties
7	What causes alert fatigue?	Too many non-actionable alerts that train the team to ignore pages
8	What is a runbook?	Step-by-step procedure for responding to a specific alert
9	Why separate audit logs from app logs?	Audit logs need immutable storage and longer retention for compliance
10	Error budget for 99.9% SLO over 30 days?	43.2 minutes of allowed downtime

<- Back to 6.7 -- Logging and Observability (README)