Episode 6 — Scaling Reliability Microservices Web3 / 6.7 — Logging and Observability

6.7.b -- Monitoring and Metrics

In one sentence: Metrics are numeric measurements collected over time that tell you how your system is performing right now -- combining the RED method (Rate, Errors, Duration) for services with infrastructure gauges and business KPIs gives you a complete, real-time dashboard of system health.

Navigation: <- 6.7.a Structured Logging | 6.7.c -- Alerting and Event Logging ->

1. Three Pillars of Observability

Observability is built on three complementary pillars. Each answers a different question.

┌──────────────────────────────────────────────────────────────────┐
│                  THE THREE PILLARS                                │
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │    LOGS     │    │   METRICS   │    │   TRACES    │          │
│  ├─────────────┤    ├─────────────┤    ├─────────────┤          │
│  │ Discrete    │    │ Aggregated  │    │ End-to-end  │          │
│  │ events      │    │ numbers     │    │ request     │          │
│  │             │    │ over time   │    │ path        │          │
│  ├─────────────┤    ├─────────────┤    ├─────────────┤          │
│  │ "User 456   │    │ "500 req/s" │    │ "API ->     │          │
│  │  got a 500  │    │ "2% error   │    │  Auth ->    │          │
│  │  error on   │    │  rate"      │    │  DB ->      │          │
│  │  /api/pay"  │    │ "p99=320ms" │    │  Cache ->   │          │
│  │             │    │             │    │  Response"  │          │
│  ├─────────────┤    ├─────────────┤    ├─────────────┤          │
│  │ HIGH volume │    │ LOW volume  │    │ MEDIUM vol  │          │
│  │ HIGH detail │    │ HIGH speed  │    │ HIGH detail │          │
│  │ Expensive   │    │ Cheap       │    │ Moderate    │          │
│  │ to store    │    │ to store    │    │ cost        │          │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│                                                                  │
│  USE TOGETHER:                                                   │
│  Metric says "error rate spiked" → Trace shows which service    │
│  is slow → Log shows the exact error message and stack trace    │
└──────────────────────────────────────────────────────────────────┘

Pillar	What It Is	Strength	Weakness
Logs	Individual event records with full context	Complete detail of any single event	Expensive at high volume; hard to aggregate
Metrics	Numeric values aggregated over time intervals	Cheap, fast to query, good for dashboards/alerts	No individual event detail
Traces	Distributed request path across services	Shows where time is spent across service boundaries	Sampling needed at high traffic; setup complexity

2. Types of Metrics

Application Metrics (RED Method)

The RED method is the gold standard for monitoring any request-driven service:

Letter	Metric	What It Measures	Example
R	Rate	Requests per second	500 req/s
E	Errors	Failed requests per second (or error %)	2% error rate
D	Duration	Response time (latency distribution)	p50=45ms, p95=180ms, p99=320ms

// Collecting RED metrics in Node.js
const promClient = require('prom-client');

// R - Rate: Counter (always goes up)
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// E - Errors: Counter filtered by status code
// (Use the same counter above, filter where status_code >= 500)

// D - Duration: Histogram (distribution of values)
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});

Infrastructure Metrics (USE Method)

The USE method is for monitoring infrastructure resources:

Letter	Metric	What It Measures	Example
U	Utilization	% of resource busy	CPU at 75%
S	Saturation	Queue depth / backlog	50 requests queued
E	Errors	Resource error count	3 disk I/O errors

Common infrastructure metrics to track:

Resource	Metrics	Warning Threshold
CPU	Utilization %, load average	> 70% sustained
Memory	Used %, heap size (Node.js)	> 80% used
Disk	Usage %, I/O wait, IOPS	> 85% full
Network	Bandwidth, packet loss, connections	Loss > 0.1%
Node.js Heap	`process.memoryUsage().heapUsed`	> 80% of `--max-old-space-size`
Event Loop	Lag in ms	> 100ms

Business Metrics

These track what matters to the business -- they are the ultimate measure of system health.

Category	Metrics
User Activity	Signups/day, active users, login failures
Revenue	Orders/minute, payment success rate, cart abandonment
AI Usage	AI API calls, tokens consumed, AI error rate, AI latency
Engagement	Page views, feature adoption, session duration

// Business metrics example
const signupsTotal = new promClient.Counter({
  name: 'business_signups_total',
  help: 'Total user signups',
  labelNames: ['plan', 'source']
});

const aiApiCalls = new promClient.Counter({
  name: 'ai_api_calls_total',
  help: 'Total AI API calls',
  labelNames: ['provider', 'model', 'status']
});

const aiTokensUsed = new promClient.Counter({
  name: 'ai_tokens_used_total',
  help: 'Total AI tokens consumed',
  labelNames: ['provider', 'model', 'direction']  // direction: input/output
});

const aiRequestDuration = new promClient.Histogram({
  name: 'ai_request_duration_seconds',
  help: 'AI API request duration',
  labelNames: ['provider', 'model'],
  buckets: [0.5, 1, 2, 5, 10, 30, 60]
});

// Usage
signupsTotal.inc({ plan: 'free', source: 'organic' });
aiApiCalls.inc({ provider: 'openai', model: 'gpt-4o', status: 'success' });
aiTokensUsed.inc({ provider: 'openai', model: 'gpt-4o', direction: 'input' }, 1500);

3. Metric Types Explained

Understanding the four fundamental metric types is essential for choosing the right one.

Type	Behavior	Use Case	Example
Counter	Only goes up (resets on restart)	Totals: requests, errors, bytes	`http_requests_total`
Gauge	Goes up and down	Current values: temp, queue size, active connections	`node_memory_heap_used_bytes`
Histogram	Counts values in configurable buckets	Distributions: latency, response size	`http_request_duration_seconds`
Summary	Like histogram but calculates percentiles client-side	Pre-calculated percentiles	`http_request_duration_summary`

const promClient = require('prom-client');

// Counter -- total requests (monotonically increasing)
const counter = new promClient.Counter({
  name: 'api_requests_total',
  help: 'Total API requests'
});
counter.inc();       // +1
counter.inc(5);      // +5

// Gauge -- current active connections (goes up and down)
const gauge = new promClient.Gauge({
  name: 'active_connections',
  help: 'Current active WebSocket connections'
});
gauge.set(42);       // Set to 42
gauge.inc();         // Now 43
gauge.dec();         // Now 42

// Histogram -- request duration distribution
const histogram = new promClient.Histogram({
  name: 'request_duration_seconds',
  help: 'Request latency distribution',
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
histogram.observe(0.234);  // Record a 234ms request

// Timing helper
const end = histogram.startTimer();
await handleRequest();
end();  // Automatically records elapsed time

4. Prometheus and Grafana Overview

Prometheus collects and stores metrics. Grafana visualizes them as dashboards. Together they are the most common open-source monitoring stack.

┌──────────────┐     scrape /metrics      ┌──────────────┐
│  Your App    │ ◄────────────────────────│  Prometheus  │
│  (Node.js)   │     every 15 seconds     │  (Time-      │
│              │                           │   Series DB) │
│  Exposes     │                           │              │
│  /metrics    │                           │  Stores all  │
│  endpoint    │                           │  metric data │
└──────────────┘                           └──────┬───────┘
                                                  │
                                                  │ PromQL queries
                                                  ▼
                                           ┌──────────────┐
                                           │   Grafana    │
                                           │              │
                                           │  Dashboards  │
                                           │  Alerts      │
                                           │  Panels      │
                                           └──────────────┘

Exposing a /metrics endpoint in Express

const express = require('express');
const promClient = require('prom-client');
const logger = require('./config/logger');

const app = express();

// Collect default Node.js metrics (CPU, memory, event loop, GC)
promClient.collectDefaultMetrics({
  prefix: 'myapp_',
  gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5]
});

// Custom RED metrics
const httpRequestsTotal = new promClient.Counter({
  name: 'myapp_http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const httpRequestDuration = new promClient.Histogram({
  name: 'myapp_http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});

// Metrics middleware -- instruments every request
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();

  res.on('finish', () => {
    const route = req.route?.path || req.path;
    const labels = {
      method: req.method,
      route: route,
      status_code: res.statusCode
    };

    httpRequestsTotal.inc(labels);
    end(labels);  // Records duration with labels
  });

  next();
});

// Prometheus scrape endpoint
app.get('/metrics', async (req, res) => {
  try {
    res.set('Content-Type', promClient.register.contentType);
    res.end(await promClient.register.metrics());
  } catch (err) {
    logger.error({ err }, 'Failed to collect metrics');
    res.status(500).end();
  }
});

Sample /metrics output:

# HELP myapp_http_requests_total Total HTTP requests
# TYPE myapp_http_requests_total counter
myapp_http_requests_total{method="GET",route="/api/users",status_code="200"} 15234
myapp_http_requests_total{method="POST",route="/api/orders",status_code="201"} 3421
myapp_http_requests_total{method="POST",route="/api/orders",status_code="500"} 12

# HELP myapp_http_request_duration_seconds HTTP request duration in seconds
# TYPE myapp_http_request_duration_seconds histogram
myapp_http_request_duration_seconds_bucket{method="GET",route="/api/users",status_code="200",le="0.05"} 12000
myapp_http_request_duration_seconds_bucket{method="GET",route="/api/users",status_code="200",le="0.1"} 14500
myapp_http_request_duration_seconds_bucket{method="GET",route="/api/users",status_code="200",le="0.25"} 15100

5. CloudWatch Custom Metrics (AWS)

For AWS deployments, CloudWatch is often used instead of (or alongside) Prometheus.

const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch({ region: 'us-east-1' });

async function publishMetric(metricName, value, unit, dimensions = []) {
  const params = {
    Namespace: 'MyApp/Production',
    MetricData: [{
      MetricName: metricName,
      Value: value,
      Unit: unit,
      Timestamp: new Date(),
      Dimensions: dimensions
    }]
  };

  await cloudwatch.putMetricData(params).promise();
}

// Examples
await publishMetric('OrdersCreated', 1, 'Count', [
  { Name: 'Service', Value: 'order-service' },
  { Name: 'Environment', Value: 'production' }
]);

await publishMetric('PaymentLatency', 342, 'Milliseconds', [
  { Name: 'Service', Value: 'payment-service' },
  { Name: 'Provider', Value: 'stripe' }
]);

await publishMetric('AITokensUsed', 1500, 'Count', [
  { Name: 'Provider', Value: 'openai' },
  { Name: 'Model', Value: 'gpt-4o' }
]);

6. Percentile Latency (p50, p95, p99)

Averages lie. If 99 requests take 50ms and 1 request takes 5000ms, the average is 99.5ms -- which hides the fact that 1% of users wait 5 seconds. Percentiles reveal the truth.

Example: 100 requests sorted by latency

  p50 (median):  50% of requests are faster than this   → 45ms
  p90:           90% of requests are faster than this   → 120ms
  p95:           95% of requests are faster than this   → 200ms
  p99:           99% of requests are faster than this   → 850ms
  p99.9:         99.9% of requests are faster than this → 3200ms

  Average:       98ms  ← HIDES the long tail!

  ┌────────────────────────────────────────────────────────────┐
  │  Latency Distribution                                      │
  │                                                            │
  │  ████████████████████████  ← Most requests: 30-60ms       │
  │  ████████████████                                          │
  │  ████████████                                              │
  │  ████████         ← p95: 200ms                             │
  │  ████                                                      │
  │  ██               ← p99: 850ms                             │
  │  █                                                         │
  │  ░                ← p99.9: 3200ms (the "long tail")       │
  │  ░                                                         │
  │  └──────────────────────────────────────────────────────── │
  │  0ms    100ms    500ms    1s      2s      3s      5s       │
  └────────────────────────────────────────────────────────────┘

Which percentiles to track

Percentile	Who It Represents	When to Use
p50	The typical user experience	General health check
p95	1 in 20 users sees this or worse	Standard SLO target
p99	1 in 100 users sees this or worse	High-reliability SLO
p99.9	1 in 1000 users	For critical paths (checkout, login)

Querying percentiles with Prometheus (PromQL)

# p50 latency over last 5 minutes
histogram_quantile(0.50, rate(myapp_http_request_duration_seconds_bucket[5m]))

# p95 latency over last 5 minutes
histogram_quantile(0.95, rate(myapp_http_request_duration_seconds_bucket[5m]))

# p99 latency over last 5 minutes
histogram_quantile(0.99, rate(myapp_http_request_duration_seconds_bucket[5m]))

# Error rate percentage
sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(myapp_http_requests_total[5m])) * 100

# Request rate (requests per second)
sum(rate(myapp_http_requests_total[5m]))

7. SLIs, SLOs, and SLAs Explained

These three concepts form the reliability framework used by every serious engineering team.

  SLA (Agreement)     "We promise 99.9% uptime"
      │                    (External, contractual, with penalties)
      │
      ▼
  SLO (Objective)     "We target 99.95% uptime internally"
      │                    (Internal, stricter than SLA, engineering target)
      │
      ▼
  SLI (Indicator)     "Current uptime: 99.97%"
                           (The actual measurement from your metrics)

Term	Full Name	What It Is	Who Cares	Example
SLI	Service Level Indicator	The actual metric being measured	Engineers	99.97% of requests return 2xx
SLO	Service Level Objective	The target value for the SLI	Engineering team	We target p99 latency < 500ms
SLA	Service Level Agreement	Contractual promise with penalties	Business, customers	99.9% uptime or credits issued

Common SLIs for a web service

// SLI: Availability
// "What percentage of requests succeed?"
const availability = successfulRequests / totalRequests;
// SLO: availability > 99.95% over 30 days

// SLI: Latency
// "What is the p99 response time?"
const p99Latency = getPercentile(requestDurations, 0.99);
// SLO: p99 < 500ms

// SLI: Error rate
// "What percentage of requests return 5xx?"
const errorRate = serverErrors / totalRequests;
// SLO: error rate < 0.1%

// SLI: Throughput
// "How many requests can we handle per second?"
const throughput = totalRequests / timeWindowSeconds;
// SLO: sustain > 1000 req/s without degradation

Error budgets

An error budget is the inverse of your SLO. If your SLO is 99.9% uptime in 30 days:

30 days = 43,200 minutes
0.1% error budget = 43.2 minutes of allowed downtime per month

If you've used 20 minutes of downtime:
  Remaining budget: 23.2 minutes
  → You can deploy with confidence

If you've used 40 minutes:
  Remaining budget: 3.2 minutes
  → STOP deploying. Focus on reliability.

8. Node.js Performance Monitoring

Node.js has specific metrics that are critical to track due to its single-threaded, event-driven architecture.

const promClient = require('prom-client');

// 1. Event loop lag -- how backed up is the event loop?
const eventLoopLag = new promClient.Gauge({
  name: 'nodejs_eventloop_lag_seconds',
  help: 'Event loop lag in seconds'
});

// Measure event loop lag
function measureEventLoopLag() {
  const start = process.hrtime.bigint();
  setImmediate(() => {
    const lag = Number(process.hrtime.bigint() - start) / 1e9;
    eventLoopLag.set(lag);
  });
}
setInterval(measureEventLoopLag, 1000);

// 2. Memory usage
const memoryUsage = new promClient.Gauge({
  name: 'nodejs_memory_usage_bytes',
  help: 'Node.js memory usage',
  labelNames: ['type']
});

function collectMemoryMetrics() {
  const mem = process.memoryUsage();
  memoryUsage.set({ type: 'heapUsed' }, mem.heapUsed);
  memoryUsage.set({ type: 'heapTotal' }, mem.heapTotal);
  memoryUsage.set({ type: 'rss' }, mem.rss);
  memoryUsage.set({ type: 'external' }, mem.external);
}
setInterval(collectMemoryMetrics, 5000);

// 3. Active handles and requests (detect leaks)
const activeHandles = new promClient.Gauge({
  name: 'nodejs_active_handles',
  help: 'Number of active handles'
});
const activeRequests = new promClient.Gauge({
  name: 'nodejs_active_requests',
  help: 'Number of active requests'
});

setInterval(() => {
  activeHandles.set(process._getActiveHandles().length);
  activeRequests.set(process._getActiveRequests().length);
}, 5000);

Key Node.js metrics to watch

Metric	Healthy	Warning	Critical
Event loop lag	< 10ms	10-100ms	> 100ms
Heap used	< 60% of limit	60-80%	> 80%
RSS memory	Stable	Slowly growing	Continuously growing (leak)
Active handles	Stable count	Growing slowly	Growing fast (leak)
GC pause duration	< 50ms	50-200ms	> 200ms

9. Creating Dashboards

A good monitoring dashboard answers key questions at a glance.

Dashboard layout pattern

┌──────────────────────────────────────────────────────────────┐
│  SERVICE HEALTH DASHBOARD                                    │
│                                                              │
│  Row 1: THE BIG NUMBERS (SLIs)                              │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐      │
│  │ Uptime   │ │ Error    │ │ p95      │ │ Requests │      │
│  │ 99.97%   │ │ Rate     │ │ Latency  │ │ Per Sec  │      │
│  │   ✓      │ │ 0.3%  ✓ │ │ 180ms ✓  │ │ 523 rps  │      │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘      │
│                                                              │
│  Row 2: RED METRICS (time series graphs)                    │
│  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐  │
│  │  Request Rate  │ │  Error Rate    │ │  Latency       │  │
│  │  ~~~~~~~~~~~   │ │  ___________   │ │  p50 ------    │  │
│  │  ~~~~~~~~~~~~  │ │  ___________   │ │  p95 -------   │  │
│  │  ~~~~~~~~~~~~~│ │  ___________   │ │  p99 --------  │  │
│  └────────────────┘ └────────────────┘ └────────────────┘  │
│                                                              │
│  Row 3: INFRASTRUCTURE                                      │
│  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐  │
│  │  CPU Usage     │ │  Memory        │ │  Event Loop    │  │
│  │  ████░░░ 45%   │ │  ██████░ 62%   │ │  Lag: 3ms      │  │
│  └────────────────┘ └────────────────┘ └────────────────┘  │
│                                                              │
│  Row 4: BUSINESS METRICS                                    │
│  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐  │
│  │  Signups Today │ │  Orders/min    │ │  AI API Calls  │  │
│  │  1,234         │ │  45            │ │  12,500 today  │  │
│  └────────────────┘ └────────────────┘ └────────────────┘  │
└──────────────────────────────────────────────────────────────┘

Dashboard best practices

Top row = SLIs with color coding (green/yellow/red against SLO thresholds)
Second row = RED metrics as time-series graphs (spot trends and anomalies)
Third row = Infrastructure gauges (CPU, memory, disk, event loop)
Bottom row = Business metrics (the things stakeholders care about)
Time range selector -- default to last 6 hours, allow drill-down
One dashboard per service plus one overview dashboard for all services

10. Complete Metrics Collection Example

A full metrics module for a Node.js microservice.

// src/metrics/index.js
const promClient = require('prom-client');

// Collect default Node.js metrics
promClient.collectDefaultMetrics({ prefix: 'myapp_' });

// ---- RED Metrics ----

const httpRequestsTotal = new promClient.Counter({
  name: 'myapp_http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const httpRequestDuration = new promClient.Histogram({
  name: 'myapp_http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});

// ---- Business Metrics ----

const businessEvents = new promClient.Counter({
  name: 'myapp_business_events_total',
  help: 'Business event counter',
  labelNames: ['event_type', 'status']
});

const aiMetrics = {
  calls: new promClient.Counter({
    name: 'myapp_ai_calls_total',
    help: 'AI API calls',
    labelNames: ['provider', 'model', 'status']
  }),
  tokens: new promClient.Counter({
    name: 'myapp_ai_tokens_total',
    help: 'AI tokens consumed',
    labelNames: ['provider', 'model', 'direction']
  }),
  duration: new promClient.Histogram({
    name: 'myapp_ai_duration_seconds',
    help: 'AI API call duration',
    labelNames: ['provider', 'model'],
    buckets: [0.5, 1, 2, 5, 10, 30, 60]
  })
};

// ---- Infrastructure Metrics ----

const dbQueryDuration = new promClient.Histogram({
  name: 'myapp_db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['operation', 'collection'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]
});

const cacheHitRate = new promClient.Counter({
  name: 'myapp_cache_operations_total',
  help: 'Cache hit/miss counter',
  labelNames: ['operation']  // hit, miss, set, delete
});

const activeConnections = new promClient.Gauge({
  name: 'myapp_active_connections',
  help: 'Current active connections',
  labelNames: ['type']  // http, websocket, db
});

// ---- Express Middleware ----

function metricsMiddleware(req, res, next) {
  const end = httpRequestDuration.startTimer();

  res.on('finish', () => {
    const route = req.route?.path || req.path;
    const labels = {
      method: req.method,
      route,
      status_code: res.statusCode
    };
    httpRequestsTotal.inc(labels);
    end(labels);
  });

  next();
}

// ---- Metrics Endpoint ----

async function metricsHandler(req, res) {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
}

module.exports = {
  httpRequestsTotal,
  httpRequestDuration,
  businessEvents,
  aiMetrics,
  dbQueryDuration,
  cacheHitRate,
  activeConnections,
  metricsMiddleware,
  metricsHandler
};

// src/app.js -- Using the metrics module
const express = require('express');
const {
  metricsMiddleware,
  metricsHandler,
  businessEvents,
  aiMetrics
} = require('./metrics');

const app = express();

app.use(metricsMiddleware);
app.get('/metrics', metricsHandler);

// In a route handler
app.post('/api/orders', async (req, res) => {
  try {
    const order = await createOrder(req.body);
    businessEvents.inc({ event_type: 'order_created', status: 'success' });
    res.status(201).json(order);
  } catch (error) {
    businessEvents.inc({ event_type: 'order_created', status: 'failure' });
    res.status(500).json({ error: 'Failed to create order' });
  }
});

// Tracking AI calls
async function callAI(prompt) {
  const end = aiMetrics.duration.startTimer({ provider: 'openai', model: 'gpt-4o' });
  try {
    const response = await openai.chat.completions.create({ model: 'gpt-4o', messages: [{ role: 'user', content: prompt }] });
    aiMetrics.calls.inc({ provider: 'openai', model: 'gpt-4o', status: 'success' });
    aiMetrics.tokens.inc({ provider: 'openai', model: 'gpt-4o', direction: 'input' }, response.usage.prompt_tokens);
    aiMetrics.tokens.inc({ provider: 'openai', model: 'gpt-4o', direction: 'output' }, response.usage.completion_tokens);
    end();
    return response;
  } catch (error) {
    aiMetrics.calls.inc({ provider: 'openai', model: 'gpt-4o', status: 'error' });
    end();
    throw error;
  }
}

11. Key Takeaways

Three pillars work together -- logs for detail, metrics for aggregation, traces for request flow. You need all three.
RED method for services -- Rate, Errors, Duration cover 90% of what you need to monitor.
Percentiles over averages -- p95 and p99 reveal the experience of your worst-off users.
SLIs measure, SLOs target, SLAs promise -- your error budget tells you when to ship features vs fix reliability.
Prometheus + Grafana is the standard open-source stack; CloudWatch is the AWS-native alternative.
Node.js-specific metrics matter -- event loop lag, heap usage, and active handles catch problems unique to Node.

Explain-It Challenge

Your average latency is 50ms but users are complaining about slow responses. Explain what metric you should look at instead and why.
The error budget for your SLO has been 80% consumed with 20 days left in the month. What action do you recommend to the team?
A manager asks "Why do we need Prometheus if we already have logs?" Explain the difference between logs and metrics with a concrete example.

Navigation: <- 6.7.a Structured Logging | 6.7.c -- Alerting and Event Logging ->