Episode 6 — Scaling Reliability Microservices Web3 / 6.7 — Logging and Observability

6.7 -- Exercise Questions: Logging and Observability

Practice questions for all three subtopics in Section 6.7. Mix of conceptual, code-oriented, and design tasks.

How to use this material (instructions)

Read lessons in order -- README.md, then 6.7.a -> 6.7.c.
Answer closed-book first -- then compare to the matching lesson.
Try the code exercises -- build a real logging setup in a Node.js project.
Interview prep -- 6.7-Interview-Questions.md.
Quick review -- 6.7-Quick-Revision.md.

6.7.a -- Structured Logging (Q1--Q12)

Q1. What is structured logging? Explain why console.log('User signed up') is insufficient in production.

Q2. List six core fields that every structured log entry should include. Explain why each field matters.

Q3. Name all six standard log levels in order from most severe to least severe. Give one concrete example of when to use each level.

Q4. If you set LOG_LEVEL=warn in production, which log levels will be visible and which will be suppressed? Why is this a good default for most production systems?

Q5. Compare Winston and Pino across three dimensions: speed, API style, and ecosystem. When would you choose one over the other?

Q6. Code exercise: Write a Winston logger configuration that: (a) outputs JSON in production, (b) outputs colorized text in development, (c) includes service name and version in every log, (d) writes errors to a separate file.

Q7. What is a request ID and why is it the most important field for debugging microservices? Draw a diagram showing how it flows through 3 services.

Q8. Code exercise: Write an Express middleware that: (a) generates a request ID if not present in headers, (b) attaches it to req.requestId, (c) sets it as a response header.

Q9. Explain the difference between a request ID and a correlation ID. When would you use each?

Q10. What is a child logger? Write code showing how a child logger in Pino reduces repetitive logging code in a request handler.

Q11. Your application accidentally logged user passwords to CloudWatch. Name three strategies to prevent this from happening again.

Q12. Code exercise: Write a PII sanitizer function that replaces values for keys containing "password", "token", "creditCard", or "ssn" with [REDACTED], including nested objects.

6.7.b -- Monitoring and Metrics (Q13--Q24)

Q13. Name the three pillars of observability. For each pillar, state: (a) what it measures, (b) its primary strength, and (c) its primary weakness.

Q14. What is the RED method? Spell out what each letter stands for and give the Prometheus metric type you would use for each.

Q15. What is the USE method? How does it differ from RED, and when do you use each?

Q16. Explain the four Prometheus metric types: Counter, Gauge, Histogram, and Summary. Give one real-world example of each.

Q17. Calculation: Your API received 10,000 requests in the last 5 minutes. 150 returned 5xx status codes. What is the error rate? If your SLO is "error rate < 1%", is the SLO being met?

Q18. Why do averages lie about latency? Given these 10 request durations: [42, 45, 48, 50, 51, 55, 58, 62, 95, 4200] (in ms), calculate the average, p50, p90, and p99. Which tells the real story?

Q19. Explain the difference between SLI, SLO, and SLA. Who cares about each one?

Q20. Calculation: Your SLO is 99.9% availability over 30 days. How many minutes of downtime is your error budget? If you have already had 25 minutes of downtime and there are 15 days left, what action should the team take?

Q21. Code exercise: Write an Express middleware that uses prom-client to track: (a) total request count by method, route, and status code, (b) request duration as a histogram.

Q22. What Node.js-specific metrics should you monitor? Name at least four and explain what unhealthy values look like.

Q23. Describe the ideal layout for a production monitoring dashboard. What goes in each of the four rows?

Q24. Your manager asks: "Why do we need Prometheus and Grafana when we already log everything?" Explain the difference between logs and metrics with a cost and performance argument.

6.7.c -- Alerting and Event Logging (Q25--Q37)

Q25. What is the purpose of alerting? Describe the timeline difference between an incident discovered by a user report vs. an automated alert.

Q26. State the five rules of a good alert. For each rule, give one example of a bad alert that violates it.

Q27. Define the four alert severity levels (P1--P4). For each, specify the expected response time and notification method.

Q28. What is alert fatigue? List three signs that a team is suffering from it and three strategies to fix it.

Q29. Explain SLO burn rate alerting. Why is it better than alerting on raw metric thresholds?

Q30. Design exercise: Design an alerting strategy for an e-commerce checkout flow. Specify: (a) what metrics to monitor, (b) alert thresholds for P1 and P2, (c) who gets notified, (d) what the runbook should contain.

Q31. List five security events that MUST be logged in any production application. For each, list the fields to include in the log entry.

Q32. Code exercise: Write a logAuditEvent() function that logs who performed an action, what they did, what resource was affected, the outcome, and the IP address.

Q33. What is the difference between application logs and audit logs? Why should they be stored separately?

Q34. Name three compliance regulations (GDPR, HIPAA, PCI-DSS) and for each, describe one logging requirement.

Q35. What is a runbook? List the five sections every runbook should contain.

Q36. Design exercise: Your AI-powered feature calls the OpenAI API 50,000 times per day. Design: (a) what events to log for each call, (b) what metrics to track, (c) what alerts to set up with thresholds.

Q37. An on-call engineer is getting paged 15 times per week. Only 3 of those pages require action. Write a step-by-step plan to fix this problem, explaining what you would analyze and what changes you would make.

Answer Hints

Q	Hint
Q4	fatal, error, warn are visible; info, debug, trace are suppressed
Q5	Pino: ~75K logs/sec; Winston: ~15K logs/sec; Pino: `logger.info({ meta }, 'msg')`; Winston: `logger.info('msg', { meta })`
Q9	Request ID: single HTTP chain; Correlation ID: business operation spanning multiple requests/queues
Q17	150/10,000 = 1.5% error rate; SLO is NOT being met (1.5% > 1%)
Q18	Average = 470.6ms (misleading!); p50 = ~51ms; p90 = ~95ms; p99 = ~4200ms
Q19	SLI = the measurement; SLO = the target; SLA = the contract with penalties
Q20	43,200 min x 0.1% = 43.2 min budget; 43.2 - 25 = 18.2 min remaining with 15 days left -- freeze risky deploys
Q26	Actionable, urgent, real, unique, documented
Q29	Raw thresholds alert on brief spikes; burn rate alerts on sustained SLO consumption
Q34	GDPR: log data access; HIPAA: log health record access for 6 years; PCI-DSS: log auth events for 1 year
Q35	Alert description, impact, diagnosis steps, resolution steps, escalation contacts

<- Back to 6.7 -- Logging and Observability (README)