Episode 6 — Scaling Reliability Microservices Web3 / 6.7 — Logging and Observability

6.7.b -- Monitoring and Metrics

In one sentence: Metrics are numeric measurements collected over time that tell you how your system is performing right now -- combining the RED method (Rate, Errors, Duration) for services with infrastructure gauges and business KPIs gives you a complete, real-time dashboard of system health.

Navigation: <- 6.7.a Structured Logging | 6.7.c -- Alerting and Event Logging ->


1. Three Pillars of Observability

Observability is built on three complementary pillars. Each answers a different question.

┌──────────────────────────────────────────────────────────────────┐
│                  THE THREE PILLARS                                │
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐          │
│  │    LOGS     │    │   METRICS   │    │   TRACES    │          │
│  ├─────────────┤    ├─────────────┤    ├─────────────┤          │
│  │ Discrete    │    │ Aggregated  │    │ End-to-end  │          │
│  │ events      │    │ numbers     │    │ request     │          │
│  │             │    │ over time   │    │ path        │          │
│  ├─────────────┤    ├─────────────┤    ├─────────────┤          │
│  │ "User 456   │    │ "500 req/s" │    │ "API ->     │          │
│  │  got a 500  │    │ "2% error   │    │  Auth ->    │          │
│  │  error on   │    │  rate"      │    │  DB ->      │          │
│  │  /api/pay"  │    │ "p99=320ms" │    │  Cache ->   │          │
│  │             │    │             │    │  Response"  │          │
│  ├─────────────┤    ├─────────────┤    ├─────────────┤          │
│  │ HIGH volume │    │ LOW volume  │    │ MEDIUM vol  │          │
│  │ HIGH detail │    │ HIGH speed  │    │ HIGH detail │          │
│  │ Expensive   │    │ Cheap       │    │ Moderate    │          │
│  │ to store    │    │ to store    │    │ cost        │          │
│  └─────────────┘    └─────────────┘    └─────────────┘          │
│                                                                  │
│  USE TOGETHER:                                                   │
│  Metric says "error rate spiked" → Trace shows which service    │
│  is slow → Log shows the exact error message and stack trace    │
└──────────────────────────────────────────────────────────────────┘
PillarWhat It IsStrengthWeakness
LogsIndividual event records with full contextComplete detail of any single eventExpensive at high volume; hard to aggregate
MetricsNumeric values aggregated over time intervalsCheap, fast to query, good for dashboards/alertsNo individual event detail
TracesDistributed request path across servicesShows where time is spent across service boundariesSampling needed at high traffic; setup complexity

2. Types of Metrics

Application Metrics (RED Method)

The RED method is the gold standard for monitoring any request-driven service:

LetterMetricWhat It MeasuresExample
RRateRequests per second500 req/s
EErrorsFailed requests per second (or error %)2% error rate
DDurationResponse time (latency distribution)p50=45ms, p95=180ms, p99=320ms
// Collecting RED metrics in Node.js
const promClient = require('prom-client');

// R - Rate: Counter (always goes up)
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// E - Errors: Counter filtered by status code
// (Use the same counter above, filter where status_code >= 500)

// D - Duration: Histogram (distribution of values)
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});

Infrastructure Metrics (USE Method)

The USE method is for monitoring infrastructure resources:

LetterMetricWhat It MeasuresExample
UUtilization% of resource busyCPU at 75%
SSaturationQueue depth / backlog50 requests queued
EErrorsResource error count3 disk I/O errors

Common infrastructure metrics to track:

ResourceMetricsWarning Threshold
CPUUtilization %, load average> 70% sustained
MemoryUsed %, heap size (Node.js)> 80% used
DiskUsage %, I/O wait, IOPS> 85% full
NetworkBandwidth, packet loss, connectionsLoss > 0.1%
Node.js Heapprocess.memoryUsage().heapUsed> 80% of --max-old-space-size
Event LoopLag in ms> 100ms

Business Metrics

These track what matters to the business -- they are the ultimate measure of system health.

CategoryMetrics
User ActivitySignups/day, active users, login failures
RevenueOrders/minute, payment success rate, cart abandonment
AI UsageAI API calls, tokens consumed, AI error rate, AI latency
EngagementPage views, feature adoption, session duration
// Business metrics example
const signupsTotal = new promClient.Counter({
  name: 'business_signups_total',
  help: 'Total user signups',
  labelNames: ['plan', 'source']
});

const aiApiCalls = new promClient.Counter({
  name: 'ai_api_calls_total',
  help: 'Total AI API calls',
  labelNames: ['provider', 'model', 'status']
});

const aiTokensUsed = new promClient.Counter({
  name: 'ai_tokens_used_total',
  help: 'Total AI tokens consumed',
  labelNames: ['provider', 'model', 'direction']  // direction: input/output
});

const aiRequestDuration = new promClient.Histogram({
  name: 'ai_request_duration_seconds',
  help: 'AI API request duration',
  labelNames: ['provider', 'model'],
  buckets: [0.5, 1, 2, 5, 10, 30, 60]
});

// Usage
signupsTotal.inc({ plan: 'free', source: 'organic' });
aiApiCalls.inc({ provider: 'openai', model: 'gpt-4o', status: 'success' });
aiTokensUsed.inc({ provider: 'openai', model: 'gpt-4o', direction: 'input' }, 1500);

3. Metric Types Explained

Understanding the four fundamental metric types is essential for choosing the right one.

TypeBehaviorUse CaseExample
CounterOnly goes up (resets on restart)Totals: requests, errors, byteshttp_requests_total
GaugeGoes up and downCurrent values: temp, queue size, active connectionsnode_memory_heap_used_bytes
HistogramCounts values in configurable bucketsDistributions: latency, response sizehttp_request_duration_seconds
SummaryLike histogram but calculates percentiles client-sidePre-calculated percentileshttp_request_duration_summary
const promClient = require('prom-client');

// Counter -- total requests (monotonically increasing)
const counter = new promClient.Counter({
  name: 'api_requests_total',
  help: 'Total API requests'
});
counter.inc();       // +1
counter.inc(5);      // +5

// Gauge -- current active connections (goes up and down)
const gauge = new promClient.Gauge({
  name: 'active_connections',
  help: 'Current active WebSocket connections'
});
gauge.set(42);       // Set to 42
gauge.inc();         // Now 43
gauge.dec();         // Now 42

// Histogram -- request duration distribution
const histogram = new promClient.Histogram({
  name: 'request_duration_seconds',
  help: 'Request latency distribution',
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
histogram.observe(0.234);  // Record a 234ms request

// Timing helper
const end = histogram.startTimer();
await handleRequest();
end();  // Automatically records elapsed time

4. Prometheus and Grafana Overview

Prometheus collects and stores metrics. Grafana visualizes them as dashboards. Together they are the most common open-source monitoring stack.

┌──────────────┐     scrape /metrics      ┌──────────────┐
│  Your App    │ ◄────────────────────────│  Prometheus  │
│  (Node.js)   │     every 15 seconds     │  (Time-      │
│              │                           │   Series DB) │
│  Exposes     │                           │              │
│  /metrics    │                           │  Stores all  │
│  endpoint    │                           │  metric data │
└──────────────┘                           └──────┬───────┘
                                                  │
                                                  │ PromQL queries
                                                  ▼
                                           ┌──────────────┐
                                           │   Grafana    │
                                           │              │
                                           │  Dashboards  │
                                           │  Alerts      │
                                           │  Panels      │
                                           └──────────────┘

Exposing a /metrics endpoint in Express

const express = require('express');
const promClient = require('prom-client');
const logger = require('./config/logger');

const app = express();

// Collect default Node.js metrics (CPU, memory, event loop, GC)
promClient.collectDefaultMetrics({
  prefix: 'myapp_',
  gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5]
});

// Custom RED metrics
const httpRequestsTotal = new promClient.Counter({
  name: 'myapp_http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const httpRequestDuration = new promClient.Histogram({
  name: 'myapp_http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});

// Metrics middleware -- instruments every request
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();

  res.on('finish', () => {
    const route = req.route?.path || req.path;
    const labels = {
      method: req.method,
      route: route,
      status_code: res.statusCode
    };

    httpRequestsTotal.inc(labels);
    end(labels);  // Records duration with labels
  });

  next();
});

// Prometheus scrape endpoint
app.get('/metrics', async (req, res) => {
  try {
    res.set('Content-Type', promClient.register.contentType);
    res.end(await promClient.register.metrics());
  } catch (err) {
    logger.error({ err }, 'Failed to collect metrics');
    res.status(500).end();
  }
});

Sample /metrics output:

# HELP myapp_http_requests_total Total HTTP requests
# TYPE myapp_http_requests_total counter
myapp_http_requests_total{method="GET",route="/api/users",status_code="200"} 15234
myapp_http_requests_total{method="POST",route="/api/orders",status_code="201"} 3421
myapp_http_requests_total{method="POST",route="/api/orders",status_code="500"} 12

# HELP myapp_http_request_duration_seconds HTTP request duration in seconds
# TYPE myapp_http_request_duration_seconds histogram
myapp_http_request_duration_seconds_bucket{method="GET",route="/api/users",status_code="200",le="0.05"} 12000
myapp_http_request_duration_seconds_bucket{method="GET",route="/api/users",status_code="200",le="0.1"} 14500
myapp_http_request_duration_seconds_bucket{method="GET",route="/api/users",status_code="200",le="0.25"} 15100

5. CloudWatch Custom Metrics (AWS)

For AWS deployments, CloudWatch is often used instead of (or alongside) Prometheus.

const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch({ region: 'us-east-1' });

async function publishMetric(metricName, value, unit, dimensions = []) {
  const params = {
    Namespace: 'MyApp/Production',
    MetricData: [{
      MetricName: metricName,
      Value: value,
      Unit: unit,
      Timestamp: new Date(),
      Dimensions: dimensions
    }]
  };

  await cloudwatch.putMetricData(params).promise();
}

// Examples
await publishMetric('OrdersCreated', 1, 'Count', [
  { Name: 'Service', Value: 'order-service' },
  { Name: 'Environment', Value: 'production' }
]);

await publishMetric('PaymentLatency', 342, 'Milliseconds', [
  { Name: 'Service', Value: 'payment-service' },
  { Name: 'Provider', Value: 'stripe' }
]);

await publishMetric('AITokensUsed', 1500, 'Count', [
  { Name: 'Provider', Value: 'openai' },
  { Name: 'Model', Value: 'gpt-4o' }
]);

6. Percentile Latency (p50, p95, p99)

Averages lie. If 99 requests take 50ms and 1 request takes 5000ms, the average is 99.5ms -- which hides the fact that 1% of users wait 5 seconds. Percentiles reveal the truth.

Example: 100 requests sorted by latency

  p50 (median):  50% of requests are faster than this   → 45ms
  p90:           90% of requests are faster than this   → 120ms
  p95:           95% of requests are faster than this   → 200ms
  p99:           99% of requests are faster than this   → 850ms
  p99.9:         99.9% of requests are faster than this → 3200ms

  Average:       98ms  ← HIDES the long tail!

  ┌────────────────────────────────────────────────────────────┐
  │  Latency Distribution                                      │
  │                                                            │
  │  ████████████████████████  ← Most requests: 30-60ms       │
  │  ████████████████                                          │
  │  ████████████                                              │
  │  ████████         ← p95: 200ms                             │
  │  ████                                                      │
  │  ██               ← p99: 850ms                             │
  │  █                                                         │
  │  ░                ← p99.9: 3200ms (the "long tail")       │
  │  ░                                                         │
  │  └──────────────────────────────────────────────────────── │
  │  0ms    100ms    500ms    1s      2s      3s      5s       │
  └────────────────────────────────────────────────────────────┘

Which percentiles to track

PercentileWho It RepresentsWhen to Use
p50The typical user experienceGeneral health check
p951 in 20 users sees this or worseStandard SLO target
p991 in 100 users sees this or worseHigh-reliability SLO
p99.91 in 1000 usersFor critical paths (checkout, login)

Querying percentiles with Prometheus (PromQL)

# p50 latency over last 5 minutes
histogram_quantile(0.50, rate(myapp_http_request_duration_seconds_bucket[5m]))

# p95 latency over last 5 minutes
histogram_quantile(0.95, rate(myapp_http_request_duration_seconds_bucket[5m]))

# p99 latency over last 5 minutes
histogram_quantile(0.99, rate(myapp_http_request_duration_seconds_bucket[5m]))

# Error rate percentage
sum(rate(myapp_http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(myapp_http_requests_total[5m])) * 100

# Request rate (requests per second)
sum(rate(myapp_http_requests_total[5m]))

7. SLIs, SLOs, and SLAs Explained

These three concepts form the reliability framework used by every serious engineering team.

  SLA (Agreement)     "We promise 99.9% uptime"
      │                    (External, contractual, with penalties)
      │
      ▼
  SLO (Objective)     "We target 99.95% uptime internally"
      │                    (Internal, stricter than SLA, engineering target)
      │
      ▼
  SLI (Indicator)     "Current uptime: 99.97%"
                           (The actual measurement from your metrics)
TermFull NameWhat It IsWho CaresExample
SLIService Level IndicatorThe actual metric being measuredEngineers99.97% of requests return 2xx
SLOService Level ObjectiveThe target value for the SLIEngineering teamWe target p99 latency < 500ms
SLAService Level AgreementContractual promise with penaltiesBusiness, customers99.9% uptime or credits issued

Common SLIs for a web service

// SLI: Availability
// "What percentage of requests succeed?"
const availability = successfulRequests / totalRequests;
// SLO: availability > 99.95% over 30 days

// SLI: Latency
// "What is the p99 response time?"
const p99Latency = getPercentile(requestDurations, 0.99);
// SLO: p99 < 500ms

// SLI: Error rate
// "What percentage of requests return 5xx?"
const errorRate = serverErrors / totalRequests;
// SLO: error rate < 0.1%

// SLI: Throughput
// "How many requests can we handle per second?"
const throughput = totalRequests / timeWindowSeconds;
// SLO: sustain > 1000 req/s without degradation

Error budgets

An error budget is the inverse of your SLO. If your SLO is 99.9% uptime in 30 days:

30 days = 43,200 minutes
0.1% error budget = 43.2 minutes of allowed downtime per month

If you've used 20 minutes of downtime:
  Remaining budget: 23.2 minutes
  → You can deploy with confidence

If you've used 40 minutes:
  Remaining budget: 3.2 minutes
  → STOP deploying. Focus on reliability.

8. Node.js Performance Monitoring

Node.js has specific metrics that are critical to track due to its single-threaded, event-driven architecture.

const promClient = require('prom-client');

// 1. Event loop lag -- how backed up is the event loop?
const eventLoopLag = new promClient.Gauge({
  name: 'nodejs_eventloop_lag_seconds',
  help: 'Event loop lag in seconds'
});

// Measure event loop lag
function measureEventLoopLag() {
  const start = process.hrtime.bigint();
  setImmediate(() => {
    const lag = Number(process.hrtime.bigint() - start) / 1e9;
    eventLoopLag.set(lag);
  });
}
setInterval(measureEventLoopLag, 1000);

// 2. Memory usage
const memoryUsage = new promClient.Gauge({
  name: 'nodejs_memory_usage_bytes',
  help: 'Node.js memory usage',
  labelNames: ['type']
});

function collectMemoryMetrics() {
  const mem = process.memoryUsage();
  memoryUsage.set({ type: 'heapUsed' }, mem.heapUsed);
  memoryUsage.set({ type: 'heapTotal' }, mem.heapTotal);
  memoryUsage.set({ type: 'rss' }, mem.rss);
  memoryUsage.set({ type: 'external' }, mem.external);
}
setInterval(collectMemoryMetrics, 5000);

// 3. Active handles and requests (detect leaks)
const activeHandles = new promClient.Gauge({
  name: 'nodejs_active_handles',
  help: 'Number of active handles'
});
const activeRequests = new promClient.Gauge({
  name: 'nodejs_active_requests',
  help: 'Number of active requests'
});

setInterval(() => {
  activeHandles.set(process._getActiveHandles().length);
  activeRequests.set(process._getActiveRequests().length);
}, 5000);

Key Node.js metrics to watch

MetricHealthyWarningCritical
Event loop lag< 10ms10-100ms> 100ms
Heap used< 60% of limit60-80%> 80%
RSS memoryStableSlowly growingContinuously growing (leak)
Active handlesStable countGrowing slowlyGrowing fast (leak)
GC pause duration< 50ms50-200ms> 200ms

9. Creating Dashboards

A good monitoring dashboard answers key questions at a glance.

Dashboard layout pattern

┌──────────────────────────────────────────────────────────────┐
│  SERVICE HEALTH DASHBOARD                                    │
│                                                              │
│  Row 1: THE BIG NUMBERS (SLIs)                              │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐      │
│  │ Uptime   │ │ Error    │ │ p95      │ │ Requests │      │
│  │ 99.97%   │ │ Rate     │ │ Latency  │ │ Per Sec  │      │
│  │   ✓      │ │ 0.3%  ✓ │ │ 180ms ✓  │ │ 523 rps  │      │
│  └──────────┘ └──────────┘ └──────────┘ └──────────┘      │
│                                                              │
│  Row 2: RED METRICS (time series graphs)                    │
│  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐  │
│  │  Request Rate  │ │  Error Rate    │ │  Latency       │  │
│  │  ~~~~~~~~~~~   │ │  ___________   │ │  p50 ------    │  │
│  │  ~~~~~~~~~~~~  │ │  ___________   │ │  p95 -------   │  │
│  │  ~~~~~~~~~~~~~│ │  ___________   │ │  p99 --------  │  │
│  └────────────────┘ └────────────────┘ └────────────────┘  │
│                                                              │
│  Row 3: INFRASTRUCTURE                                      │
│  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐  │
│  │  CPU Usage     │ │  Memory        │ │  Event Loop    │  │
│  │  ████░░░ 45%   │ │  ██████░ 62%   │ │  Lag: 3ms      │  │
│  └────────────────┘ └────────────────┘ └────────────────┘  │
│                                                              │
│  Row 4: BUSINESS METRICS                                    │
│  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐  │
│  │  Signups Today │ │  Orders/min    │ │  AI API Calls  │  │
│  │  1,234         │ │  45            │ │  12,500 today  │  │
│  └────────────────┘ └────────────────┘ └────────────────┘  │
└──────────────────────────────────────────────────────────────┘

Dashboard best practices

  1. Top row = SLIs with color coding (green/yellow/red against SLO thresholds)
  2. Second row = RED metrics as time-series graphs (spot trends and anomalies)
  3. Third row = Infrastructure gauges (CPU, memory, disk, event loop)
  4. Bottom row = Business metrics (the things stakeholders care about)
  5. Time range selector -- default to last 6 hours, allow drill-down
  6. One dashboard per service plus one overview dashboard for all services

10. Complete Metrics Collection Example

A full metrics module for a Node.js microservice.

// src/metrics/index.js
const promClient = require('prom-client');

// Collect default Node.js metrics
promClient.collectDefaultMetrics({ prefix: 'myapp_' });

// ---- RED Metrics ----

const httpRequestsTotal = new promClient.Counter({
  name: 'myapp_http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const httpRequestDuration = new promClient.Histogram({
  name: 'myapp_http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});

// ---- Business Metrics ----

const businessEvents = new promClient.Counter({
  name: 'myapp_business_events_total',
  help: 'Business event counter',
  labelNames: ['event_type', 'status']
});

const aiMetrics = {
  calls: new promClient.Counter({
    name: 'myapp_ai_calls_total',
    help: 'AI API calls',
    labelNames: ['provider', 'model', 'status']
  }),
  tokens: new promClient.Counter({
    name: 'myapp_ai_tokens_total',
    help: 'AI tokens consumed',
    labelNames: ['provider', 'model', 'direction']
  }),
  duration: new promClient.Histogram({
    name: 'myapp_ai_duration_seconds',
    help: 'AI API call duration',
    labelNames: ['provider', 'model'],
    buckets: [0.5, 1, 2, 5, 10, 30, 60]
  })
};

// ---- Infrastructure Metrics ----

const dbQueryDuration = new promClient.Histogram({
  name: 'myapp_db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['operation', 'collection'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]
});

const cacheHitRate = new promClient.Counter({
  name: 'myapp_cache_operations_total',
  help: 'Cache hit/miss counter',
  labelNames: ['operation']  // hit, miss, set, delete
});

const activeConnections = new promClient.Gauge({
  name: 'myapp_active_connections',
  help: 'Current active connections',
  labelNames: ['type']  // http, websocket, db
});

// ---- Express Middleware ----

function metricsMiddleware(req, res, next) {
  const end = httpRequestDuration.startTimer();

  res.on('finish', () => {
    const route = req.route?.path || req.path;
    const labels = {
      method: req.method,
      route,
      status_code: res.statusCode
    };
    httpRequestsTotal.inc(labels);
    end(labels);
  });

  next();
}

// ---- Metrics Endpoint ----

async function metricsHandler(req, res) {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
}

module.exports = {
  httpRequestsTotal,
  httpRequestDuration,
  businessEvents,
  aiMetrics,
  dbQueryDuration,
  cacheHitRate,
  activeConnections,
  metricsMiddleware,
  metricsHandler
};
// src/app.js -- Using the metrics module
const express = require('express');
const {
  metricsMiddleware,
  metricsHandler,
  businessEvents,
  aiMetrics
} = require('./metrics');

const app = express();

app.use(metricsMiddleware);
app.get('/metrics', metricsHandler);

// In a route handler
app.post('/api/orders', async (req, res) => {
  try {
    const order = await createOrder(req.body);
    businessEvents.inc({ event_type: 'order_created', status: 'success' });
    res.status(201).json(order);
  } catch (error) {
    businessEvents.inc({ event_type: 'order_created', status: 'failure' });
    res.status(500).json({ error: 'Failed to create order' });
  }
});

// Tracking AI calls
async function callAI(prompt) {
  const end = aiMetrics.duration.startTimer({ provider: 'openai', model: 'gpt-4o' });
  try {
    const response = await openai.chat.completions.create({ model: 'gpt-4o', messages: [{ role: 'user', content: prompt }] });
    aiMetrics.calls.inc({ provider: 'openai', model: 'gpt-4o', status: 'success' });
    aiMetrics.tokens.inc({ provider: 'openai', model: 'gpt-4o', direction: 'input' }, response.usage.prompt_tokens);
    aiMetrics.tokens.inc({ provider: 'openai', model: 'gpt-4o', direction: 'output' }, response.usage.completion_tokens);
    end();
    return response;
  } catch (error) {
    aiMetrics.calls.inc({ provider: 'openai', model: 'gpt-4o', status: 'error' });
    end();
    throw error;
  }
}

11. Key Takeaways

  1. Three pillars work together -- logs for detail, metrics for aggregation, traces for request flow. You need all three.
  2. RED method for services -- Rate, Errors, Duration cover 90% of what you need to monitor.
  3. Percentiles over averages -- p95 and p99 reveal the experience of your worst-off users.
  4. SLIs measure, SLOs target, SLAs promise -- your error budget tells you when to ship features vs fix reliability.
  5. Prometheus + Grafana is the standard open-source stack; CloudWatch is the AWS-native alternative.
  6. Node.js-specific metrics matter -- event loop lag, heap usage, and active handles catch problems unique to Node.

Explain-It Challenge

  1. Your average latency is 50ms but users are complaining about slow responses. Explain what metric you should look at instead and why.
  2. The error budget for your SLO has been 80% consumed with 20 days left in the month. What action do you recommend to the team?
  3. A manager asks "Why do we need Prometheus if we already have logs?" Explain the difference between logs and metrics with a concrete example.

Navigation: <- 6.7.a Structured Logging | 6.7.c -- Alerting and Event Logging ->