Episode 6 — Scaling Reliability Microservices Web3 / 6.7 — Logging and Observability
6.7.b -- Monitoring and Metrics
In one sentence: Metrics are numeric measurements collected over time that tell you how your system is performing right now -- combining the RED method (Rate, Errors, Duration) for services with infrastructure gauges and business KPIs gives you a complete, real-time dashboard of system health.
Navigation: <- 6.7.a Structured Logging | 6.7.c -- Alerting and Event Logging ->
1. Three Pillars of Observability
Observability is built on three complementary pillars. Each answers a different question.
┌──────────────────────────────────────────────────────────────────┐
│ THE THREE PILLARS │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ LOGS │ │ METRICS │ │ TRACES │ │
│ ├─────────────┤ ├─────────────┤ ├─────────────┤ │
│ │ Discrete │ │ Aggregated │ │ End-to-end │ │
│ │ events │ │ numbers │ │ request │ │
│ │ │ │ over time │ │ path │ │
│ ├─────────────┤ ├─────────────┤ ├─────────────┤ │
│ │ "User 456 │ │ "500 req/s" │ │ "API -> │ │
│ │ got a 500 │ │ "2% error │ │ Auth -> │ │
│ │ error on │ │ rate" │ │ DB -> │ │
│ │ /api/pay" │ │ "p99=320ms" │ │ Cache -> │ │
│ │ │ │ │ │ Response" │ │
│ ├─────────────┤ ├─────────────┤ ├─────────────┤ │
│ │ HIGH volume │ │ LOW volume │ │ MEDIUM vol │ │
│ │ HIGH detail │ │ HIGH speed │ │ HIGH detail │ │
│ │ Expensive │ │ Cheap │ │ Moderate │ │
│ │ to store │ │ to store │ │ cost │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ USE TOGETHER: │
│ Metric says "error rate spiked" → Trace shows which service │
│ is slow → Log shows the exact error message and stack trace │
└──────────────────────────────────────────────────────────────────┘
| Pillar | What It Is | Strength | Weakness |
|---|---|---|---|
| Logs | Individual event records with full context | Complete detail of any single event | Expensive at high volume; hard to aggregate |
| Metrics | Numeric values aggregated over time intervals | Cheap, fast to query, good for dashboards/alerts | No individual event detail |
| Traces | Distributed request path across services | Shows where time is spent across service boundaries | Sampling needed at high traffic; setup complexity |
2. Types of Metrics
Application Metrics (RED Method)
The RED method is the gold standard for monitoring any request-driven service:
| Letter | Metric | What It Measures | Example |
|---|---|---|---|
| R | Rate | Requests per second | 500 req/s |
| E | Errors | Failed requests per second (or error %) | 2% error rate |
| D | Duration | Response time (latency distribution) | p50=45ms, p95=180ms, p99=320ms |
// Collecting RED metrics in Node.js
const promClient = require('prom-client');
// R - Rate: Counter (always goes up)
const httpRequestsTotal = new promClient.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
// E - Errors: Counter filtered by status code
// (Use the same counter above, filter where status_code >= 500)
// D - Duration: Histogram (distribution of values)
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
Infrastructure Metrics (USE Method)
The USE method is for monitoring infrastructure resources:
| Letter | Metric | What It Measures | Example |
|---|---|---|---|
| U | Utilization | % of resource busy | CPU at 75% |
| S | Saturation | Queue depth / backlog | 50 requests queued |
| E | Errors | Resource error count | 3 disk I/O errors |
Common infrastructure metrics to track:
| Resource | Metrics | Warning Threshold |
|---|---|---|
| CPU | Utilization %, load average | > 70% sustained |
| Memory | Used %, heap size (Node.js) | > 80% used |
| Disk | Usage %, I/O wait, IOPS | > 85% full |
| Network | Bandwidth, packet loss, connections | Loss > 0.1% |
| Node.js Heap | process.memoryUsage().heapUsed | > 80% of --max-old-space-size |
| Event Loop | Lag in ms | > 100ms |
Business Metrics
These track what matters to the business -- they are the ultimate measure of system health.
| Category | Metrics |
|---|---|
| User Activity | Signups/day, active users, login failures |
| Revenue | Orders/minute, payment success rate, cart abandonment |
| AI Usage | AI API calls, tokens consumed, AI error rate, AI latency |
| Engagement | Page views, feature adoption, session duration |
// Business metrics example
const signupsTotal = new promClient.Counter({
name: 'business_signups_total',
help: 'Total user signups',
labelNames: ['plan', 'source']
});
const aiApiCalls = new promClient.Counter({
name: 'ai_api_calls_total',
help: 'Total AI API calls',
labelNames: ['provider', 'model', 'status']
});
const aiTokensUsed = new promClient.Counter({
name: 'ai_tokens_used_total',
help: 'Total AI tokens consumed',
labelNames: ['provider', 'model', 'direction'] // direction: input/output
});
const aiRequestDuration = new promClient.Histogram({
name: 'ai_request_duration_seconds',
help: 'AI API request duration',
labelNames: ['provider', 'model'],
buckets: [0.5, 1, 2, 5, 10, 30, 60]
});
// Usage
signupsTotal.inc({ plan: 'free', source: 'organic' });
aiApiCalls.inc({ provider: 'openai', model: 'gpt-4o', status: 'success' });
aiTokensUsed.inc({ provider: 'openai', model: 'gpt-4o', direction: 'input' }, 1500);
3. Metric Types Explained
Understanding the four fundamental metric types is essential for choosing the right one.
| Type | Behavior | Use Case | Example |
|---|---|---|---|
| Counter | Only goes up (resets on restart) | Totals: requests, errors, bytes | http_requests_total |
| Gauge | Goes up and down | Current values: temp, queue size, active connections | node_memory_heap_used_bytes |
| Histogram | Counts values in configurable buckets | Distributions: latency, response size | http_request_duration_seconds |
| Summary | Like histogram but calculates percentiles client-side | Pre-calculated percentiles | http_request_duration_summary |
const promClient = require('prom-client');
// Counter -- total requests (monotonically increasing)
const counter = new promClient.Counter({
name: 'api_requests_total',
help: 'Total API requests'
});
counter.inc(); // +1
counter.inc(5); // +5
// Gauge -- current active connections (goes up and down)
const gauge = new promClient.Gauge({
name: 'active_connections',
help: 'Current active WebSocket connections'
});
gauge.set(42); // Set to 42
gauge.inc(); // Now 43
gauge.dec(); // Now 42
// Histogram -- request duration distribution
const histogram = new promClient.Histogram({
name: 'request_duration_seconds',
help: 'Request latency distribution',
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
histogram.observe(0.234); // Record a 234ms request
// Timing helper
const end = histogram.startTimer();
await handleRequest();
end(); // Automatically records elapsed time
4. Prometheus and Grafana Overview
Prometheus collects and stores metrics. Grafana visualizes them as dashboards. Together they are the most common open-source monitoring stack.
┌──────────────┐ scrape /metrics ┌──────────────┐
│ Your App │ ◄────────────────────────│ Prometheus │
│ (Node.js) │ every 15 seconds │ (Time- │
│ │ │ Series DB) │
│ Exposes │ │ │
│ /metrics │ │ Stores all │
│ endpoint │ │ metric data │
└──────────────┘ └──────┬───────┘
│
│ PromQL queries
▼
┌──────────────┐
│ Grafana │
│ │
│ Dashboards │
│ Alerts │
│ Panels │
└──────────────┘
Exposing a /metrics endpoint in Express
const express = require('express');
const promClient = require('prom-client');
const logger = require('./config/logger');
const app = express();
// Collect default Node.js metrics (CPU, memory, event loop, GC)
promClient.collectDefaultMetrics({
prefix: 'myapp_',
gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5]
});
// Custom RED metrics
const httpRequestsTotal = new promClient.Counter({
name: 'myapp_http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const httpRequestDuration = new promClient.Histogram({
name: 'myapp_http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
// Metrics middleware -- instruments every request
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
const route = req.route?.path || req.path;
const labels = {
method: req.method,
route: route,
status_code: res.statusCode
};
httpRequestsTotal.inc(labels);
end(labels); // Records duration with labels
});
next();
});
// Prometheus scrape endpoint
app.get('/metrics', async (req, res) => {
try {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
} catch (err) {
logger.error({ err }, 'Failed to collect metrics');
res.status(500).end();
}
});
Sample /metrics output:
# HELP myapp_http_requests_total Total HTTP requests
# TYPE myapp_http_requests_total counter
myapp_http_requests_total{method="GET",route="/api/users",status_code="200"} 15234
myapp_http_requests_total{method="POST",route="/api/orders",status_code="201"} 3421
myapp_http_requests_total{method="POST",route="/api/orders",status_code="500"} 12
# HELP myapp_http_request_duration_seconds HTTP request duration in seconds
# TYPE myapp_http_request_duration_seconds histogram
myapp_http_request_duration_seconds_bucket{method="GET",route="/api/users",status_code="200",le="0.05"} 12000
myapp_http_request_duration_seconds_bucket{method="GET",route="/api/users",status_code="200",le="0.1"} 14500
myapp_http_request_duration_seconds_bucket{method="GET",route="/api/users",status_code="200",le="0.25"} 15100
5. CloudWatch Custom Metrics (AWS)
For AWS deployments, CloudWatch is often used instead of (or alongside) Prometheus.
const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch({ region: 'us-east-1' });
async function publishMetric(metricName, value, unit, dimensions = []) {
const params = {
Namespace: 'MyApp/Production',
MetricData: [{
MetricName: metricName,
Value: value,
Unit: unit,
Timestamp: new Date(),
Dimensions: dimensions
}]
};
await cloudwatch.putMetricData(params).promise();
}
// Examples
await publishMetric('OrdersCreated', 1, 'Count', [
{ Name: 'Service', Value: 'order-service' },
{ Name: 'Environment', Value: 'production' }
]);
await publishMetric('PaymentLatency', 342, 'Milliseconds', [
{ Name: 'Service', Value: 'payment-service' },
{ Name: 'Provider', Value: 'stripe' }
]);
await publishMetric('AITokensUsed', 1500, 'Count', [
{ Name: 'Provider', Value: 'openai' },
{ Name: 'Model', Value: 'gpt-4o' }
]);
6. Percentile Latency (p50, p95, p99)
Averages lie. If 99 requests take 50ms and 1 request takes 5000ms, the average is 99.5ms -- which hides the fact that 1% of users wait 5 seconds. Percentiles reveal the truth.
Example: 100 requests sorted by latency
p50 (median): 50% of requests are faster than this → 45ms
p90: 90% of requests are faster than this → 120ms
p95: 95% of requests are faster than this → 200ms
p99: 99% of requests are faster than this → 850ms
p99.9: 99.9% of requests are faster than this → 3200ms
Average: 98ms ← HIDES the long tail!
┌────────────────────────────────────────────────────────────┐
│ Latency Distribution │
│ │
│ ████████████████████████ ← Most requests: 30-60ms │
│ ████████████████ │
│ ████████████ │
│ ████████ ← p95: 200ms │
│ ████ │
│ ██ ← p99: 850ms │
│ █ │
│ ░ ← p99.9: 3200ms (the "long tail") │
│ ░ │
│ └──────────────────────────────────────────────────────── │
│ 0ms 100ms 500ms 1s 2s 3s 5s │
└────────────────────────────────────────────────────────────┘
Which percentiles to track
| Percentile | Who It Represents | When to Use |
|---|---|---|
| p50 | The typical user experience | General health check |
| p95 | 1 in 20 users sees this or worse | Standard SLO target |
| p99 | 1 in 100 users sees this or worse | High-reliability SLO |
| p99.9 | 1 in 1000 users | For critical paths (checkout, login) |
Querying percentiles with Prometheus (PromQL)
# p50 latency over last 5 minutes
histogram_quantile(0.50, rate(myapp_http_request_duration_seconds_bucket[5m]))
# p95 latency over last 5 minutes
histogram_quantile(0.95, rate(myapp_http_request_duration_seconds_bucket[5m]))
# p99 latency over last 5 minutes
histogram_quantile(0.99, rate(myapp_http_request_duration_seconds_bucket[5m]))
# Error rate percentage
sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(myapp_http_requests_total[5m])) * 100
# Request rate (requests per second)
sum(rate(myapp_http_requests_total[5m]))
7. SLIs, SLOs, and SLAs Explained
These three concepts form the reliability framework used by every serious engineering team.
SLA (Agreement) "We promise 99.9% uptime"
│ (External, contractual, with penalties)
│
▼
SLO (Objective) "We target 99.95% uptime internally"
│ (Internal, stricter than SLA, engineering target)
│
▼
SLI (Indicator) "Current uptime: 99.97%"
(The actual measurement from your metrics)
| Term | Full Name | What It Is | Who Cares | Example |
|---|---|---|---|---|
| SLI | Service Level Indicator | The actual metric being measured | Engineers | 99.97% of requests return 2xx |
| SLO | Service Level Objective | The target value for the SLI | Engineering team | We target p99 latency < 500ms |
| SLA | Service Level Agreement | Contractual promise with penalties | Business, customers | 99.9% uptime or credits issued |
Common SLIs for a web service
// SLI: Availability
// "What percentage of requests succeed?"
const availability = successfulRequests / totalRequests;
// SLO: availability > 99.95% over 30 days
// SLI: Latency
// "What is the p99 response time?"
const p99Latency = getPercentile(requestDurations, 0.99);
// SLO: p99 < 500ms
// SLI: Error rate
// "What percentage of requests return 5xx?"
const errorRate = serverErrors / totalRequests;
// SLO: error rate < 0.1%
// SLI: Throughput
// "How many requests can we handle per second?"
const throughput = totalRequests / timeWindowSeconds;
// SLO: sustain > 1000 req/s without degradation
Error budgets
An error budget is the inverse of your SLO. If your SLO is 99.9% uptime in 30 days:
30 days = 43,200 minutes
0.1% error budget = 43.2 minutes of allowed downtime per month
If you've used 20 minutes of downtime:
Remaining budget: 23.2 minutes
→ You can deploy with confidence
If you've used 40 minutes:
Remaining budget: 3.2 minutes
→ STOP deploying. Focus on reliability.
8. Node.js Performance Monitoring
Node.js has specific metrics that are critical to track due to its single-threaded, event-driven architecture.
const promClient = require('prom-client');
// 1. Event loop lag -- how backed up is the event loop?
const eventLoopLag = new promClient.Gauge({
name: 'nodejs_eventloop_lag_seconds',
help: 'Event loop lag in seconds'
});
// Measure event loop lag
function measureEventLoopLag() {
const start = process.hrtime.bigint();
setImmediate(() => {
const lag = Number(process.hrtime.bigint() - start) / 1e9;
eventLoopLag.set(lag);
});
}
setInterval(measureEventLoopLag, 1000);
// 2. Memory usage
const memoryUsage = new promClient.Gauge({
name: 'nodejs_memory_usage_bytes',
help: 'Node.js memory usage',
labelNames: ['type']
});
function collectMemoryMetrics() {
const mem = process.memoryUsage();
memoryUsage.set({ type: 'heapUsed' }, mem.heapUsed);
memoryUsage.set({ type: 'heapTotal' }, mem.heapTotal);
memoryUsage.set({ type: 'rss' }, mem.rss);
memoryUsage.set({ type: 'external' }, mem.external);
}
setInterval(collectMemoryMetrics, 5000);
// 3. Active handles and requests (detect leaks)
const activeHandles = new promClient.Gauge({
name: 'nodejs_active_handles',
help: 'Number of active handles'
});
const activeRequests = new promClient.Gauge({
name: 'nodejs_active_requests',
help: 'Number of active requests'
});
setInterval(() => {
activeHandles.set(process._getActiveHandles().length);
activeRequests.set(process._getActiveRequests().length);
}, 5000);
Key Node.js metrics to watch
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Event loop lag | < 10ms | 10-100ms | > 100ms |
| Heap used | < 60% of limit | 60-80% | > 80% |
| RSS memory | Stable | Slowly growing | Continuously growing (leak) |
| Active handles | Stable count | Growing slowly | Growing fast (leak) |
| GC pause duration | < 50ms | 50-200ms | > 200ms |
9. Creating Dashboards
A good monitoring dashboard answers key questions at a glance.
Dashboard layout pattern
┌──────────────────────────────────────────────────────────────┐
│ SERVICE HEALTH DASHBOARD │
│ │
│ Row 1: THE BIG NUMBERS (SLIs) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Uptime │ │ Error │ │ p95 │ │ Requests │ │
│ │ 99.97% │ │ Rate │ │ Latency │ │ Per Sec │ │
│ │ ✓ │ │ 0.3% ✓ │ │ 180ms ✓ │ │ 523 rps │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Row 2: RED METRICS (time series graphs) │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Request Rate │ │ Error Rate │ │ Latency │ │
│ │ ~~~~~~~~~~~ │ │ ___________ │ │ p50 ------ │ │
│ │ ~~~~~~~~~~~~ │ │ ___________ │ │ p95 ------- │ │
│ │ ~~~~~~~~~~~~~│ │ ___________ │ │ p99 -------- │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │
│ Row 3: INFRASTRUCTURE │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ CPU Usage │ │ Memory │ │ Event Loop │ │
│ │ ████░░░ 45% │ │ ██████░ 62% │ │ Lag: 3ms │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
│ │
│ Row 4: BUSINESS METRICS │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Signups Today │ │ Orders/min │ │ AI API Calls │ │
│ │ 1,234 │ │ 45 │ │ 12,500 today │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Dashboard best practices
- Top row = SLIs with color coding (green/yellow/red against SLO thresholds)
- Second row = RED metrics as time-series graphs (spot trends and anomalies)
- Third row = Infrastructure gauges (CPU, memory, disk, event loop)
- Bottom row = Business metrics (the things stakeholders care about)
- Time range selector -- default to last 6 hours, allow drill-down
- One dashboard per service plus one overview dashboard for all services
10. Complete Metrics Collection Example
A full metrics module for a Node.js microservice.
// src/metrics/index.js
const promClient = require('prom-client');
// Collect default Node.js metrics
promClient.collectDefaultMetrics({ prefix: 'myapp_' });
// ---- RED Metrics ----
const httpRequestsTotal = new promClient.Counter({
name: 'myapp_http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
const httpRequestDuration = new promClient.Histogram({
name: 'myapp_http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
// ---- Business Metrics ----
const businessEvents = new promClient.Counter({
name: 'myapp_business_events_total',
help: 'Business event counter',
labelNames: ['event_type', 'status']
});
const aiMetrics = {
calls: new promClient.Counter({
name: 'myapp_ai_calls_total',
help: 'AI API calls',
labelNames: ['provider', 'model', 'status']
}),
tokens: new promClient.Counter({
name: 'myapp_ai_tokens_total',
help: 'AI tokens consumed',
labelNames: ['provider', 'model', 'direction']
}),
duration: new promClient.Histogram({
name: 'myapp_ai_duration_seconds',
help: 'AI API call duration',
labelNames: ['provider', 'model'],
buckets: [0.5, 1, 2, 5, 10, 30, 60]
})
};
// ---- Infrastructure Metrics ----
const dbQueryDuration = new promClient.Histogram({
name: 'myapp_db_query_duration_seconds',
help: 'Database query duration',
labelNames: ['operation', 'collection'],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]
});
const cacheHitRate = new promClient.Counter({
name: 'myapp_cache_operations_total',
help: 'Cache hit/miss counter',
labelNames: ['operation'] // hit, miss, set, delete
});
const activeConnections = new promClient.Gauge({
name: 'myapp_active_connections',
help: 'Current active connections',
labelNames: ['type'] // http, websocket, db
});
// ---- Express Middleware ----
function metricsMiddleware(req, res, next) {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
const route = req.route?.path || req.path;
const labels = {
method: req.method,
route,
status_code: res.statusCode
};
httpRequestsTotal.inc(labels);
end(labels);
});
next();
}
// ---- Metrics Endpoint ----
async function metricsHandler(req, res) {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
}
module.exports = {
httpRequestsTotal,
httpRequestDuration,
businessEvents,
aiMetrics,
dbQueryDuration,
cacheHitRate,
activeConnections,
metricsMiddleware,
metricsHandler
};
// src/app.js -- Using the metrics module
const express = require('express');
const {
metricsMiddleware,
metricsHandler,
businessEvents,
aiMetrics
} = require('./metrics');
const app = express();
app.use(metricsMiddleware);
app.get('/metrics', metricsHandler);
// In a route handler
app.post('/api/orders', async (req, res) => {
try {
const order = await createOrder(req.body);
businessEvents.inc({ event_type: 'order_created', status: 'success' });
res.status(201).json(order);
} catch (error) {
businessEvents.inc({ event_type: 'order_created', status: 'failure' });
res.status(500).json({ error: 'Failed to create order' });
}
});
// Tracking AI calls
async function callAI(prompt) {
const end = aiMetrics.duration.startTimer({ provider: 'openai', model: 'gpt-4o' });
try {
const response = await openai.chat.completions.create({ model: 'gpt-4o', messages: [{ role: 'user', content: prompt }] });
aiMetrics.calls.inc({ provider: 'openai', model: 'gpt-4o', status: 'success' });
aiMetrics.tokens.inc({ provider: 'openai', model: 'gpt-4o', direction: 'input' }, response.usage.prompt_tokens);
aiMetrics.tokens.inc({ provider: 'openai', model: 'gpt-4o', direction: 'output' }, response.usage.completion_tokens);
end();
return response;
} catch (error) {
aiMetrics.calls.inc({ provider: 'openai', model: 'gpt-4o', status: 'error' });
end();
throw error;
}
}
11. Key Takeaways
- Three pillars work together -- logs for detail, metrics for aggregation, traces for request flow. You need all three.
- RED method for services -- Rate, Errors, Duration cover 90% of what you need to monitor.
- Percentiles over averages -- p95 and p99 reveal the experience of your worst-off users.
- SLIs measure, SLOs target, SLAs promise -- your error budget tells you when to ship features vs fix reliability.
- Prometheus + Grafana is the standard open-source stack; CloudWatch is the AWS-native alternative.
- Node.js-specific metrics matter -- event loop lag, heap usage, and active handles catch problems unique to Node.
Explain-It Challenge
- Your average latency is 50ms but users are complaining about slow responses. Explain what metric you should look at instead and why.
- The error budget for your SLO has been 80% consumed with 20 days left in the month. What action do you recommend to the team?
- A manager asks "Why do we need Prometheus if we already have logs?" Explain the difference between logs and metrics with a concrete example.
Navigation: <- 6.7.a Structured Logging | 6.7.c -- Alerting and Event Logging ->