Episode 6 — Scaling Reliability Microservices Web3 / 6.7 — Logging and Observability

6.7.c -- Alerting and Event Logging

In one sentence: Alerting turns your metrics and logs into actionable notifications that wake you up before users notice problems, while event logging creates an auditable record of every important thing that happens in your system -- authentication attempts, payments, AI usage, and rate limit violations.

Navigation: <- 6.7.b Monitoring and Metrics | 6.7 Overview


1. Why Alerting Matters

Without alerting, you discover problems in one of three ways:

  1. A user complains -- the worst way. They are already frustrated.
  2. Someone checks the dashboard -- unreliable. Nobody watches dashboards 24/7.
  3. An automated alert fires -- the best way. You know before users do.
Timeline of an incident WITHOUT alerting:

  14:00  Database connection pool saturated
  14:05  API responses slow down (p99 → 5s)
  14:15  Users start seeing timeouts
  14:30  First user emails support
  14:45  Support escalates to engineering
  15:00  Engineer opens laptop and starts investigating
  15:30  Root cause found and fixed
  ────────────────────────────────────────────────
  Impact: 90 minutes of degraded service

Timeline of the same incident WITH alerting:

  14:00  Database connection pool saturated
  14:01  Alert fires: "DB connection pool > 90%"
  14:03  On-call engineer's phone buzzes
  14:05  Engineer opens laptop, sees the metric
  14:15  Root cause found and fixed
  ────────────────────────────────────────────────
  Impact: 15 minutes of degraded service

Key insight: Alerting shrinks the detection time from 30-60 minutes to 1-3 minutes. The fix takes the same time either way -- detection is what you can automate.


2. Alert Design Principles

Bad alerts are worse than no alerts because they train the team to ignore them (alert fatigue). Every alert must pass these five tests:

The five rules of good alerts

RuleDescriptionBad ExampleGood Example
ActionableSomeone can do something about it right now"Disk usage: 45%""Disk usage > 85% -- expand volume or clean up"
UrgentIt needs attention within minutes, not days"SSL cert expires in 60 days""SSL cert expires in 3 days"
RealIt indicates a genuine problem, not noise"CPU spike to 70% for 2 seconds""CPU > 80% sustained for 10 minutes"
UniqueNot duplicated by another alert5 alerts for the same outage1 alert with context about the failure
DocumentedLinks to a runbook explaining what to do"Error rate high""Error rate > 5% -- see runbook.md#error-rate"

The three questions every alert must answer

1. WHAT is happening?     → "Payment service error rate is 12%"
2. WHY does it matter?    → "Users cannot complete purchases"
3. WHAT should I do?      → "Check payment-service logs, verify Stripe status"

3. Alert Thresholds and Severity Levels

Severity levels

SeverityMeaningResponse TimeNotificationExample
Critical (P1)Service down, revenue impactImmediate (< 5 min)Page on-call, phone callAPI returning 5xx for all users
High (P2)Major degradation, some users affectedWithin 30 minutesPage on-call, Slackp99 latency > 5s, error rate > 5%
Medium (P3)Minor issue, needs attention todayWithin 4 hoursSlack channelDisk usage > 80%, cert expires in 7 days
Low (P4)Informational, plan for laterNext business dayEmail, ticketMemory usage trending up, dependency deprecated

Setting thresholds: the multi-window approach

Do not alert on a single data point. Use sustained conditions with appropriate time windows.

// BAD: Alerts on a single spike (noisy)
if (cpuUsage > 80) alert('CPU high!');

// GOOD: Alerts on sustained condition
// "CPU > 80% for at least 5 consecutive minutes"
// This ignores brief spikes during deployments or GC pauses

// BETTER: Multi-window approach
// Short window (5 min): detects sudden failures fast
// Long window (1 hour): detects slow degradation
// Alert when BOTH windows exceed thresholds

Recommended alert thresholds

MetricWarning (P3)Critical (P1)Window
Error rate> 1% for 5 min> 5% for 2 min5-minute rate
p99 latency> 1s for 5 min> 5s for 2 min5-minute window
CPU usage> 70% for 15 min> 90% for 5 minSustained
Memory usage> 75% for 10 min> 90% for 5 minSustained
Disk usage> 80%> 90%Point-in-time
Event loop lag> 50ms for 5 min> 200ms for 2 minNode.js specific
DB connections> 70% pool used> 90% pool usedCurrent gauge
Queue depth> 1000 messages> 10000 messagesCurrent gauge
Certificate expiry< 14 days< 3 daysDaily check

4. On-Call Basics

On-call rotation ensures someone is always responsible for responding to alerts.

On-call responsibilities

┌─────────────────────────────────────────────────┐
│              ON-CALL ROTATION                    │
│                                                 │
│  Week 1: Alice (primary), Bob (secondary)       │
│  Week 2: Bob (primary), Charlie (secondary)     │
│  Week 3: Charlie (primary), Alice (secondary)   │
│                                                 │
│  When an alert fires:                           │
│  1. Primary on-call is paged                    │
│  2. Acknowledge within 5 minutes                │
│  3. If no ack → escalate to secondary           │
│  4. If no ack → escalate to engineering manager │
│                                                 │
│  After resolution:                              │
│  1. Write incident summary                      │
│  2. Update runbook if needed                    │
│  3. Blameless post-mortem for P1/P2             │
└─────────────────────────────────────────────────┘

On-call best practices

  1. Rotate weekly -- no one should be on-call more than 1 week in 4
  2. Compensate -- on-call work is real work, acknowledge it
  3. Provide runbooks -- the on-call person should never have to guess what to do
  4. Limit alerts -- if on-call gets more than 2 pages per shift, fix the noise
  5. Blameless post-mortems -- focus on what broke, not who broke it

Tools for on-call management

ToolPurpose
PagerDutyIncident management, on-call scheduling, escalation policies
OpsGenie (Atlassian)Similar to PagerDuty, integrates with Jira
Grafana OnCallOpen-source on-call management
CloudWatch AlarmsAWS-native alerting, integrates with SNS

5. CloudWatch Alarms (AWS)

For AWS deployments, CloudWatch Alarms are the standard alerting mechanism.

// Creating a CloudWatch Alarm with AWS SDK
const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch({ region: 'us-east-1' });

// Alert: Error rate > 5% for 2 consecutive periods
async function createErrorRateAlarm() {
  await cloudwatch.putMetricAlarm({
    AlarmName: 'HighErrorRate-OrderService',
    AlarmDescription: 'Error rate exceeds 5% for 2 minutes. Check order-service logs. Runbook: https://wiki/runbook/error-rate',
    Namespace: 'MyApp/Production',
    MetricName: 'ErrorRate',
    Dimensions: [{ Name: 'Service', Value: 'order-service' }],
    Statistic: 'Average',
    Period: 60,                    // Check every 60 seconds
    EvaluationPeriods: 2,          // Must breach for 2 consecutive periods
    Threshold: 5,                  // 5% error rate
    ComparisonOperator: 'GreaterThanThreshold',
    TreatMissingData: 'breaching', // Missing data = assume bad
    AlarmActions: [
      'arn:aws:sns:us-east-1:123456789:PagerDutyAlerts'  // SNS topic
    ],
    OKActions: [
      'arn:aws:sns:us-east-1:123456789:AlertResolved'
    ]
  }).promise();
}

// Alert: p99 latency > 2 seconds
async function createLatencyAlarm() {
  await cloudwatch.putMetricAlarm({
    AlarmName: 'HighLatency-OrderService',
    AlarmDescription: 'p99 latency exceeds 2s. Check for slow queries or downstream timeouts.',
    Namespace: 'MyApp/Production',
    MetricName: 'RequestLatency',
    Dimensions: [{ Name: 'Service', Value: 'order-service' }],
    ExtendedStatistic: 'p99',      // Use p99 percentile
    Period: 300,                    // 5-minute window
    EvaluationPeriods: 2,
    Threshold: 2000,                // 2000ms
    ComparisonOperator: 'GreaterThanThreshold',
    AlarmActions: [
      'arn:aws:sns:us-east-1:123456789:SlackAlerts'
    ]
  }).promise();
}

// Alert: Disk usage > 85%
async function createDiskAlarm() {
  await cloudwatch.putMetricAlarm({
    AlarmName: 'HighDiskUsage-OrderService',
    AlarmDescription: 'Disk usage exceeds 85%. Clean up logs or expand volume.',
    Namespace: 'AWS/EC2',
    MetricName: 'DiskSpaceUtilization',
    Statistic: 'Maximum',
    Period: 300,
    EvaluationPeriods: 1,
    Threshold: 85,
    ComparisonOperator: 'GreaterThanThreshold',
    AlarmActions: [
      'arn:aws:sns:us-east-1:123456789:SlackAlerts'
    ]
  }).promise();
}

6. Alert Fatigue and How to Prevent It

Alert fatigue is when the team receives so many alerts that they start ignoring them -- including the real ones. It is the number one reason alerting systems fail.

Signs of alert fatigue

  • On-call acknowledges alerts without investigating
  • Alerts are muted or snoozed habitually
  • Multiple alerts fire for the same incident
  • Alerts fire during deployments and are always "resolved automatically"
  • Team says "that alert always fires, just ignore it"

How to prevent alert fatigue

StrategyImplementation
Delete noisy alertsIf an alert fires > 3 times/week without action, remove it
Increase thresholdsMove from "CPU > 60%" to "CPU > 80% sustained 10 min"
Add time windowsNever alert on a single data point; require sustained conditions
DeduplicateGroup related alerts into a single notification
Route correctlyP3/P4 to Slack, only P1/P2 page on-call
Review monthlyAudit all alerts: keep, tune, or delete each one
Use error budgetsAlert on SLO burn rate, not individual metric spikes

SLO burn rate alerting (the modern approach)

Instead of alerting on raw metrics, alert when your error budget is being consumed too fast.

SLO: 99.9% availability over 30 days
Error budget: 43.2 minutes of downtime

Normal burn rate: 1x  (consuming budget at the expected rate)

Alert thresholds:
  - 14.4x burn rate for 5 min  → P1 (will exhaust budget in 2 days)
  - 6x burn rate for 30 min    → P2 (will exhaust budget in 5 days)
  - 3x burn rate for 6 hours   → P3 (will exhaust budget in 10 days)

This approach:
  ✓ Ignores brief spikes (1 minute of errors)
  ✓ Catches sustained degradation early
  ✓ Directly tied to what matters (the SLO)
  ✓ Fewer, more meaningful alerts

7. Important Events to Log

Beyond request/response logging, specific events should be logged for operational awareness, security, and compliance.

API errors and failures

// Log every error with full context
app.use((err, req, res, next) => {
  const logData = {
    requestId: req.requestId,
    userId: req.user?.id,
    method: req.method,
    path: req.originalUrl,
    statusCode: err.status || 500,
    errorName: err.name,
    errorMessage: err.message,
    errorCode: err.code,        // Custom error codes
    stack: err.stack,
    // Downstream service info (if applicable)
    downstreamService: err.service,
    downstreamStatusCode: err.downstreamStatus
  };

  if (err.status >= 500) {
    logger.error(logData, 'Server error');
  } else if (err.status >= 400) {
    logger.warn(logData, 'Client error');
  }

  res.status(err.status || 500).json({
    error: err.message,
    requestId: req.requestId
  });
});

Authentication and security events

// ALWAYS log authentication events -- these are security-critical
const securityLogger = logger.child({ category: 'security' });

// Successful login
securityLogger.info({
  event: 'auth.login.success',
  userId: user.id,
  method: 'password',        // password, oauth, magic-link
  ip: req.ip,
  userAgent: req.get('User-Agent'),
  mfaUsed: true
}, 'User logged in');

// Failed login
securityLogger.warn({
  event: 'auth.login.failure',
  email: maskEmail(email),    // "j***@example.com"
  reason: 'invalid_password',
  ip: req.ip,
  userAgent: req.get('User-Agent'),
  failedAttempts: loginAttempts
}, 'Login failed');

// Account locked
securityLogger.error({
  event: 'auth.account.locked',
  userId: user.id,
  reason: 'too_many_failed_attempts',
  failedAttempts: 5,
  lockDuration: '30 minutes',
  ip: req.ip
}, 'Account locked due to failed login attempts');

// Password changed
securityLogger.info({
  event: 'auth.password.changed',
  userId: user.id,
  ip: req.ip,
  method: 'user_initiated'
}, 'Password changed');

// Permission denied
securityLogger.warn({
  event: 'auth.access.denied',
  userId: req.user.id,
  resource: req.originalUrl,
  requiredRole: 'admin',
  actualRole: req.user.role,
  ip: req.ip
}, 'Access denied - insufficient permissions');

AI usage events

const aiLogger = logger.child({ category: 'ai' });

// Log every AI API call
aiLogger.info({
  event: 'ai.call.completed',
  provider: 'openai',
  model: 'gpt-4o',
  promptTokens: response.usage.prompt_tokens,
  completionTokens: response.usage.completion_tokens,
  totalTokens: response.usage.total_tokens,
  duration: endTime - startTime,
  temperature: 0.7,
  userId: req.user.id,
  feature: 'profile-generation',    // Which feature triggered this
  estimatedCost: calculateCost(response.usage)
}, 'AI API call completed');

// AI error
aiLogger.error({
  event: 'ai.call.failed',
  provider: 'openai',
  model: 'gpt-4o',
  error: error.message,
  statusCode: error.status,
  retryCount: attempt,
  userId: req.user.id,
  feature: 'profile-generation'
}, 'AI API call failed');

Rate limit events

// Log when users hit rate limits
logger.warn({
  event: 'ratelimit.exceeded',
  userId: req.user?.id || 'anonymous',
  ip: req.ip,
  path: req.originalUrl,
  limit: rateLimit.max,
  windowMs: rateLimit.windowMs,
  currentCount: currentRequestCount
}, 'Rate limit exceeded');

// Log when approaching rate limit (early warning)
if (currentRequestCount > rateLimit.max * 0.8) {
  logger.info({
    event: 'ratelimit.approaching',
    userId: req.user?.id,
    currentCount,
    limit: rateLimit.max,
    percentUsed: Math.round((currentRequestCount / rateLimit.max) * 100)
  }, 'Approaching rate limit');
}

Payment and financial events

const paymentLogger = logger.child({ category: 'payment' });

// Payment succeeded
paymentLogger.info({
  event: 'payment.charge.success',
  userId: user.id,
  orderId: order.id,
  amount: order.total,
  currency: 'USD',
  provider: 'stripe',
  paymentIntentId: paymentIntent.id,
  cardLast4: paymentMethod.card.last4,  // Only last 4 digits!
  cardBrand: paymentMethod.card.brand
}, 'Payment successful');

// Payment failed
paymentLogger.error({
  event: 'payment.charge.failure',
  userId: user.id,
  orderId: order.id,
  amount: order.total,
  currency: 'USD',
  provider: 'stripe',
  declineCode: error.decline_code,
  errorMessage: error.message,
  errorType: error.type
}, 'Payment failed');

// Refund processed
paymentLogger.info({
  event: 'payment.refund.processed',
  userId: user.id,
  orderId: order.id,
  refundAmount: refund.amount,
  refundId: refund.id,
  reason: refund.reason,
  initiatedBy: adminUser.id
}, 'Refund processed');

8. Audit Logging

Audit logs are a special category of event logs designed for compliance, security review, and forensic analysis. They answer: "Who did what, when, and from where?"

// src/utils/auditLogger.js
const logger = require('../config/logger');

const auditLogger = logger.child({ category: 'audit' });

function logAuditEvent({
  actor,            // Who performed the action
  action,           // What they did
  resource,         // What they did it to
  resourceId,       // Specific resource ID
  details,          // Additional context
  ip,               // IP address
  userAgent,        // Browser/client info
  outcome           // success or failure
}) {
  auditLogger.info({
    event: 'audit',
    actor: {
      id: actor.id,
      email: actor.email,
      role: actor.role
    },
    action,
    resource,
    resourceId,
    outcome,
    details,
    ip,
    userAgent,
    timestamp: new Date().toISOString()
  }, `Audit: ${actor.email} ${action} ${resource}/${resourceId}`);
}

module.exports = { logAuditEvent };

Usage in route handlers

const { logAuditEvent } = require('../utils/auditLogger');

// Admin deletes a user
app.delete('/api/admin/users/:id', requireAdmin, async (req, res) => {
  const user = await User.findById(req.params.id);

  await user.deleteOne();

  logAuditEvent({
    actor: req.user,
    action: 'DELETE',
    resource: 'user',
    resourceId: req.params.id,
    details: { deletedEmail: user.email, deletedRole: user.role },
    ip: req.ip,
    userAgent: req.get('User-Agent'),
    outcome: 'success'
  });

  res.json({ message: 'User deleted' });
});

// User updates their profile
app.patch('/api/users/me', authenticate, async (req, res) => {
  const before = { ...req.user.toObject() };
  const updated = await User.findByIdAndUpdate(req.user.id, req.body, { new: true });

  logAuditEvent({
    actor: req.user,
    action: 'UPDATE',
    resource: 'user_profile',
    resourceId: req.user.id,
    details: {
      fieldsChanged: Object.keys(req.body),
      before: { name: before.name, email: before.email },
      after: { name: updated.name, email: updated.email }
    },
    ip: req.ip,
    userAgent: req.get('User-Agent'),
    outcome: 'success'
  });

  res.json(updated);
});

What to audit

CategoryEvents to Log
AuthenticationLogin, logout, password change, MFA enable/disable, token refresh
AuthorizationPermission changes, role assignments, access denials
Data accessView, create, update, delete of sensitive records
Admin actionsUser management, configuration changes, feature toggles
FinancialPayments, refunds, subscription changes, pricing updates
SystemDeployments, configuration changes, secret rotation

9. Event Logging for Compliance

Different regulations require different retention and logging standards.

RegulationWhat It RequiresRetention
GDPRLog access to personal data, data deletion events, consent changesVaries (document your policy)
HIPAALog all access to health records, who accessed what and when6 years
PCI-DSSLog access to cardholder data, authentication events, admin actions1 year (3 months immediately available)
SOC 2Log security events, access controls, system changes1 year

Compliance logging best practices

  1. Immutable storage -- audit logs should be written to append-only storage (S3 with Object Lock, CloudWatch Logs)
  2. Separate from application logs -- audit logs go to a different destination that developers cannot delete
  3. Include all context -- who, what, when, where (IP), outcome
  4. Retention policies -- configure automatic retention matching your compliance requirements
  5. Encrypt at rest -- audit logs often contain references to sensitive operations
// Writing audit logs to a separate immutable destination
const { S3Client, PutObjectCommand } = require('@aws-sdk/client-s3');

const s3 = new S3Client({ region: 'us-east-1' });

async function writeAuditToS3(auditEvent) {
  const date = new Date();
  const key = `audit-logs/${date.getFullYear()}/${date.getMonth() + 1}/${date.getDate()}/${date.getTime()}-${auditEvent.resourceId}.json`;

  await s3.send(new PutObjectCommand({
    Bucket: 'myapp-audit-logs',  // Bucket with Object Lock enabled
    Key: key,
    Body: JSON.stringify(auditEvent),
    ContentType: 'application/json',
    ServerSideEncryption: 'aws:kms'
  }));
}

10. Runbook Basics

A runbook is a documented procedure for responding to a specific alert. Every P1/P2 alert should link to a runbook.

Runbook template

# Runbook: High Error Rate on Order Service

## Alert
- Name: HighErrorRate-OrderService
- Threshold: Error rate > 5% for 2 consecutive minutes
- Severity: P1

## Impact
- Users cannot place orders
- Revenue is directly affected

## Diagnosis Steps
1. Check order-service logs for the most common error:
   `filter: service=order-service AND level=error | top 10 errorMessage`

2. Check if the error is in our code or a downstream dependency:
   `filter: service=order-service AND level=error | top 10 downstreamService`

3. Check recent deployments:
   `git log --oneline -5 --since="1 hour ago"`

4. Check database health:
   - MongoDB Atlas dashboard: connections, operations/sec, replication lag
   - Redis dashboard: memory usage, connected clients, evictions

## Resolution Steps

### If the error is a bad deployment:
1. Roll back to the previous version:
   `aws ecs update-service --cluster prod --service order-service --task-definition order-service:PREVIOUS`
2. Verify error rate is decreasing
3. Investigate the bad deploy in a non-production environment

### If the error is a downstream service:
1. Check the downstream service's health dashboard
2. Enable circuit breaker if not already active
3. Escalate to the downstream service team

### If the error is database-related:
1. Check connection pool usage: should be < 80%
2. Check for slow queries: MongoDB profiler or slow query log
3. Check replica set health: `rs.status()`

## Escalation
- Primary: On-call engineer (PagerDuty)
- Secondary: Order Service tech lead
- Manager: Engineering Manager

## Post-Incident
- Write incident summary in #incidents Slack channel
- Schedule blameless post-mortem for P1 incidents
- Update this runbook with any new findings

11. Complete Alerting Configuration Example

Bringing together metrics, alerts, and event logging in a unified module.

// src/monitoring/alerts.js
const logger = require('../config/logger');
const { businessEvents, aiMetrics } = require('../metrics');

// ---- Alert Evaluation (for self-hosted monitoring) ----

class AlertManager {
  constructor() {
    this.alertStates = new Map();    // Track current alert states
    this.silencedAlerts = new Set(); // Temporarily silenced alerts
  }

  evaluate(alertName, currentValue, threshold, severity, context = {}) {
    const isTriggered = currentValue > threshold;
    const wasTriggered = this.alertStates.get(alertName)?.triggered || false;

    if (isTriggered && !wasTriggered) {
      // Alert is firing (new)
      this.alertStates.set(alertName, {
        triggered: true,
        triggeredAt: new Date(),
        value: currentValue,
        threshold
      });

      logger.error({
        event: 'alert.fired',
        alertName,
        severity,
        currentValue,
        threshold,
        ...context
      }, `ALERT FIRED: ${alertName}`);

      this.notify(alertName, severity, currentValue, threshold, context);

    } else if (!isTriggered && wasTriggered) {
      // Alert resolved
      const alertState = this.alertStates.get(alertName);
      const duration = Date.now() - alertState.triggeredAt.getTime();

      this.alertStates.set(alertName, { triggered: false });

      logger.info({
        event: 'alert.resolved',
        alertName,
        severity,
        currentValue,
        threshold,
        durationMs: duration,
        ...context
      }, `ALERT RESOLVED: ${alertName} (was firing for ${Math.round(duration / 1000)}s)`);

      this.notifyResolved(alertName, severity, duration);
    }
  }

  async notify(alertName, severity, value, threshold, context) {
    if (this.silencedAlerts.has(alertName)) return;

    // Send to Slack
    if (severity === 'critical' || severity === 'high') {
      await this.sendSlackAlert({
        channel: '#alerts-critical',
        text: `*${severity.toUpperCase()}: ${alertName}*\nValue: ${value} (threshold: ${threshold})\nRunbook: https://wiki/runbooks/${alertName}`
      });
    }

    // Page on-call for critical
    if (severity === 'critical') {
      await this.pageDutyAlert(alertName, value, threshold, context);
    }
  }

  async notifyResolved(alertName, severity, durationMs) {
    await this.sendSlackAlert({
      channel: '#alerts-critical',
      text: `*RESOLVED: ${alertName}* (was firing for ${Math.round(durationMs / 60000)} minutes)`
    });
  }

  silence(alertName, durationMs) {
    this.silencedAlerts.add(alertName);
    setTimeout(() => this.silencedAlerts.delete(alertName), durationMs);
    logger.warn({ alertName, durationMs }, 'Alert silenced');
  }

  // Placeholder methods -- integrate with actual services
  async sendSlackAlert(payload) { /* Slack webhook integration */ }
  async pageDutyAlert(name, value, threshold, context) { /* PagerDuty API */ }
}

module.exports = new AlertManager();
// src/monitoring/healthChecks.js
const logger = require('../config/logger');
const mongoose = require('mongoose');
const Redis = require('ioredis');

async function checkHealth() {
  const health = {
    status: 'healthy',
    timestamp: new Date().toISOString(),
    checks: {}
  };

  // Database check
  try {
    const start = Date.now();
    await mongoose.connection.db.admin().ping();
    health.checks.database = {
      status: 'healthy',
      latency: Date.now() - start
    };
  } catch (error) {
    health.checks.database = { status: 'unhealthy', error: error.message };
    health.status = 'unhealthy';
  }

  // Redis check
  try {
    const start = Date.now();
    await redis.ping();
    health.checks.redis = {
      status: 'healthy',
      latency: Date.now() - start
    };
  } catch (error) {
    health.checks.redis = { status: 'unhealthy', error: error.message };
    health.status = 'unhealthy';
  }

  // Log health check results
  if (health.status === 'unhealthy') {
    logger.error({ health }, 'Health check failed');
  } else {
    logger.debug({ health }, 'Health check passed');
  }

  return health;
}

module.exports = { checkHealth };
// Express health endpoint
app.get('/health', async (req, res) => {
  const health = await checkHealth();
  const statusCode = health.status === 'healthy' ? 200 : 503;
  res.status(statusCode).json(health);
});

12. Key Takeaways

  1. Alert on symptoms, not causes -- alert when user experience degrades (error rate, latency), not on internal metrics (CPU) unless they directly impact users.
  2. Every alert must be actionable -- if you cannot do anything about it right now, it should not wake you up.
  3. Alert fatigue kills -- ruthlessly delete or tune noisy alerts. Use SLO burn rate alerting as a modern alternative.
  4. Log security events -- authentication, authorization, and data access events are non-negotiable for security and compliance.
  5. Runbooks save lives -- every P1/P2 alert should link to a step-by-step guide. The 3 AM on-call engineer should not have to think.
  6. Audit logs are separate -- they go to immutable storage, are retained per compliance requirements, and cannot be deleted by application developers.

Explain-It Challenge

  1. The on-call engineer got paged 12 times last week, but only 2 pages required real action. Diagnose the problem and propose three specific fixes.
  2. Your payment service processes credit cards. What events MUST you log for PCI-DSS compliance, and what must you NEVER log?
  3. Design an alerting strategy for a new AI-powered feature that calls the OpenAI API. What metrics would you alert on and at what thresholds?

Navigation: <- 6.7.b Monitoring and Metrics | 6.7 Overview