Episode 6 — Scaling Reliability Microservices Web3 / 6.7 — Logging and Observability
6.7.c -- Alerting and Event Logging
In one sentence: Alerting turns your metrics and logs into actionable notifications that wake you up before users notice problems, while event logging creates an auditable record of every important thing that happens in your system -- authentication attempts, payments, AI usage, and rate limit violations.
Navigation: <- 6.7.b Monitoring and Metrics | 6.7 Overview
1. Why Alerting Matters
Without alerting, you discover problems in one of three ways:
- A user complains -- the worst way. They are already frustrated.
- Someone checks the dashboard -- unreliable. Nobody watches dashboards 24/7.
- An automated alert fires -- the best way. You know before users do.
Timeline of an incident WITHOUT alerting:
14:00 Database connection pool saturated
14:05 API responses slow down (p99 → 5s)
14:15 Users start seeing timeouts
14:30 First user emails support
14:45 Support escalates to engineering
15:00 Engineer opens laptop and starts investigating
15:30 Root cause found and fixed
────────────────────────────────────────────────
Impact: 90 minutes of degraded service
Timeline of the same incident WITH alerting:
14:00 Database connection pool saturated
14:01 Alert fires: "DB connection pool > 90%"
14:03 On-call engineer's phone buzzes
14:05 Engineer opens laptop, sees the metric
14:15 Root cause found and fixed
────────────────────────────────────────────────
Impact: 15 minutes of degraded service
Key insight: Alerting shrinks the detection time from 30-60 minutes to 1-3 minutes. The fix takes the same time either way -- detection is what you can automate.
2. Alert Design Principles
Bad alerts are worse than no alerts because they train the team to ignore them (alert fatigue). Every alert must pass these five tests:
The five rules of good alerts
| Rule | Description | Bad Example | Good Example |
|---|---|---|---|
| Actionable | Someone can do something about it right now | "Disk usage: 45%" | "Disk usage > 85% -- expand volume or clean up" |
| Urgent | It needs attention within minutes, not days | "SSL cert expires in 60 days" | "SSL cert expires in 3 days" |
| Real | It indicates a genuine problem, not noise | "CPU spike to 70% for 2 seconds" | "CPU > 80% sustained for 10 minutes" |
| Unique | Not duplicated by another alert | 5 alerts for the same outage | 1 alert with context about the failure |
| Documented | Links to a runbook explaining what to do | "Error rate high" | "Error rate > 5% -- see runbook.md#error-rate" |
The three questions every alert must answer
1. WHAT is happening? → "Payment service error rate is 12%"
2. WHY does it matter? → "Users cannot complete purchases"
3. WHAT should I do? → "Check payment-service logs, verify Stripe status"
3. Alert Thresholds and Severity Levels
Severity levels
| Severity | Meaning | Response Time | Notification | Example |
|---|---|---|---|---|
| Critical (P1) | Service down, revenue impact | Immediate (< 5 min) | Page on-call, phone call | API returning 5xx for all users |
| High (P2) | Major degradation, some users affected | Within 30 minutes | Page on-call, Slack | p99 latency > 5s, error rate > 5% |
| Medium (P3) | Minor issue, needs attention today | Within 4 hours | Slack channel | Disk usage > 80%, cert expires in 7 days |
| Low (P4) | Informational, plan for later | Next business day | Email, ticket | Memory usage trending up, dependency deprecated |
Setting thresholds: the multi-window approach
Do not alert on a single data point. Use sustained conditions with appropriate time windows.
// BAD: Alerts on a single spike (noisy)
if (cpuUsage > 80) alert('CPU high!');
// GOOD: Alerts on sustained condition
// "CPU > 80% for at least 5 consecutive minutes"
// This ignores brief spikes during deployments or GC pauses
// BETTER: Multi-window approach
// Short window (5 min): detects sudden failures fast
// Long window (1 hour): detects slow degradation
// Alert when BOTH windows exceed thresholds
Recommended alert thresholds
| Metric | Warning (P3) | Critical (P1) | Window |
|---|---|---|---|
| Error rate | > 1% for 5 min | > 5% for 2 min | 5-minute rate |
| p99 latency | > 1s for 5 min | > 5s for 2 min | 5-minute window |
| CPU usage | > 70% for 15 min | > 90% for 5 min | Sustained |
| Memory usage | > 75% for 10 min | > 90% for 5 min | Sustained |
| Disk usage | > 80% | > 90% | Point-in-time |
| Event loop lag | > 50ms for 5 min | > 200ms for 2 min | Node.js specific |
| DB connections | > 70% pool used | > 90% pool used | Current gauge |
| Queue depth | > 1000 messages | > 10000 messages | Current gauge |
| Certificate expiry | < 14 days | < 3 days | Daily check |
4. On-Call Basics
On-call rotation ensures someone is always responsible for responding to alerts.
On-call responsibilities
┌─────────────────────────────────────────────────┐
│ ON-CALL ROTATION │
│ │
│ Week 1: Alice (primary), Bob (secondary) │
│ Week 2: Bob (primary), Charlie (secondary) │
│ Week 3: Charlie (primary), Alice (secondary) │
│ │
│ When an alert fires: │
│ 1. Primary on-call is paged │
│ 2. Acknowledge within 5 minutes │
│ 3. If no ack → escalate to secondary │
│ 4. If no ack → escalate to engineering manager │
│ │
│ After resolution: │
│ 1. Write incident summary │
│ 2. Update runbook if needed │
│ 3. Blameless post-mortem for P1/P2 │
└─────────────────────────────────────────────────┘
On-call best practices
- Rotate weekly -- no one should be on-call more than 1 week in 4
- Compensate -- on-call work is real work, acknowledge it
- Provide runbooks -- the on-call person should never have to guess what to do
- Limit alerts -- if on-call gets more than 2 pages per shift, fix the noise
- Blameless post-mortems -- focus on what broke, not who broke it
Tools for on-call management
| Tool | Purpose |
|---|---|
| PagerDuty | Incident management, on-call scheduling, escalation policies |
| OpsGenie (Atlassian) | Similar to PagerDuty, integrates with Jira |
| Grafana OnCall | Open-source on-call management |
| CloudWatch Alarms | AWS-native alerting, integrates with SNS |
5. CloudWatch Alarms (AWS)
For AWS deployments, CloudWatch Alarms are the standard alerting mechanism.
// Creating a CloudWatch Alarm with AWS SDK
const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch({ region: 'us-east-1' });
// Alert: Error rate > 5% for 2 consecutive periods
async function createErrorRateAlarm() {
await cloudwatch.putMetricAlarm({
AlarmName: 'HighErrorRate-OrderService',
AlarmDescription: 'Error rate exceeds 5% for 2 minutes. Check order-service logs. Runbook: https://wiki/runbook/error-rate',
Namespace: 'MyApp/Production',
MetricName: 'ErrorRate',
Dimensions: [{ Name: 'Service', Value: 'order-service' }],
Statistic: 'Average',
Period: 60, // Check every 60 seconds
EvaluationPeriods: 2, // Must breach for 2 consecutive periods
Threshold: 5, // 5% error rate
ComparisonOperator: 'GreaterThanThreshold',
TreatMissingData: 'breaching', // Missing data = assume bad
AlarmActions: [
'arn:aws:sns:us-east-1:123456789:PagerDutyAlerts' // SNS topic
],
OKActions: [
'arn:aws:sns:us-east-1:123456789:AlertResolved'
]
}).promise();
}
// Alert: p99 latency > 2 seconds
async function createLatencyAlarm() {
await cloudwatch.putMetricAlarm({
AlarmName: 'HighLatency-OrderService',
AlarmDescription: 'p99 latency exceeds 2s. Check for slow queries or downstream timeouts.',
Namespace: 'MyApp/Production',
MetricName: 'RequestLatency',
Dimensions: [{ Name: 'Service', Value: 'order-service' }],
ExtendedStatistic: 'p99', // Use p99 percentile
Period: 300, // 5-minute window
EvaluationPeriods: 2,
Threshold: 2000, // 2000ms
ComparisonOperator: 'GreaterThanThreshold',
AlarmActions: [
'arn:aws:sns:us-east-1:123456789:SlackAlerts'
]
}).promise();
}
// Alert: Disk usage > 85%
async function createDiskAlarm() {
await cloudwatch.putMetricAlarm({
AlarmName: 'HighDiskUsage-OrderService',
AlarmDescription: 'Disk usage exceeds 85%. Clean up logs or expand volume.',
Namespace: 'AWS/EC2',
MetricName: 'DiskSpaceUtilization',
Statistic: 'Maximum',
Period: 300,
EvaluationPeriods: 1,
Threshold: 85,
ComparisonOperator: 'GreaterThanThreshold',
AlarmActions: [
'arn:aws:sns:us-east-1:123456789:SlackAlerts'
]
}).promise();
}
6. Alert Fatigue and How to Prevent It
Alert fatigue is when the team receives so many alerts that they start ignoring them -- including the real ones. It is the number one reason alerting systems fail.
Signs of alert fatigue
- On-call acknowledges alerts without investigating
- Alerts are muted or snoozed habitually
- Multiple alerts fire for the same incident
- Alerts fire during deployments and are always "resolved automatically"
- Team says "that alert always fires, just ignore it"
How to prevent alert fatigue
| Strategy | Implementation |
|---|---|
| Delete noisy alerts | If an alert fires > 3 times/week without action, remove it |
| Increase thresholds | Move from "CPU > 60%" to "CPU > 80% sustained 10 min" |
| Add time windows | Never alert on a single data point; require sustained conditions |
| Deduplicate | Group related alerts into a single notification |
| Route correctly | P3/P4 to Slack, only P1/P2 page on-call |
| Review monthly | Audit all alerts: keep, tune, or delete each one |
| Use error budgets | Alert on SLO burn rate, not individual metric spikes |
SLO burn rate alerting (the modern approach)
Instead of alerting on raw metrics, alert when your error budget is being consumed too fast.
SLO: 99.9% availability over 30 days
Error budget: 43.2 minutes of downtime
Normal burn rate: 1x (consuming budget at the expected rate)
Alert thresholds:
- 14.4x burn rate for 5 min → P1 (will exhaust budget in 2 days)
- 6x burn rate for 30 min → P2 (will exhaust budget in 5 days)
- 3x burn rate for 6 hours → P3 (will exhaust budget in 10 days)
This approach:
✓ Ignores brief spikes (1 minute of errors)
✓ Catches sustained degradation early
✓ Directly tied to what matters (the SLO)
✓ Fewer, more meaningful alerts
7. Important Events to Log
Beyond request/response logging, specific events should be logged for operational awareness, security, and compliance.
API errors and failures
// Log every error with full context
app.use((err, req, res, next) => {
const logData = {
requestId: req.requestId,
userId: req.user?.id,
method: req.method,
path: req.originalUrl,
statusCode: err.status || 500,
errorName: err.name,
errorMessage: err.message,
errorCode: err.code, // Custom error codes
stack: err.stack,
// Downstream service info (if applicable)
downstreamService: err.service,
downstreamStatusCode: err.downstreamStatus
};
if (err.status >= 500) {
logger.error(logData, 'Server error');
} else if (err.status >= 400) {
logger.warn(logData, 'Client error');
}
res.status(err.status || 500).json({
error: err.message,
requestId: req.requestId
});
});
Authentication and security events
// ALWAYS log authentication events -- these are security-critical
const securityLogger = logger.child({ category: 'security' });
// Successful login
securityLogger.info({
event: 'auth.login.success',
userId: user.id,
method: 'password', // password, oauth, magic-link
ip: req.ip,
userAgent: req.get('User-Agent'),
mfaUsed: true
}, 'User logged in');
// Failed login
securityLogger.warn({
event: 'auth.login.failure',
email: maskEmail(email), // "j***@example.com"
reason: 'invalid_password',
ip: req.ip,
userAgent: req.get('User-Agent'),
failedAttempts: loginAttempts
}, 'Login failed');
// Account locked
securityLogger.error({
event: 'auth.account.locked',
userId: user.id,
reason: 'too_many_failed_attempts',
failedAttempts: 5,
lockDuration: '30 minutes',
ip: req.ip
}, 'Account locked due to failed login attempts');
// Password changed
securityLogger.info({
event: 'auth.password.changed',
userId: user.id,
ip: req.ip,
method: 'user_initiated'
}, 'Password changed');
// Permission denied
securityLogger.warn({
event: 'auth.access.denied',
userId: req.user.id,
resource: req.originalUrl,
requiredRole: 'admin',
actualRole: req.user.role,
ip: req.ip
}, 'Access denied - insufficient permissions');
AI usage events
const aiLogger = logger.child({ category: 'ai' });
// Log every AI API call
aiLogger.info({
event: 'ai.call.completed',
provider: 'openai',
model: 'gpt-4o',
promptTokens: response.usage.prompt_tokens,
completionTokens: response.usage.completion_tokens,
totalTokens: response.usage.total_tokens,
duration: endTime - startTime,
temperature: 0.7,
userId: req.user.id,
feature: 'profile-generation', // Which feature triggered this
estimatedCost: calculateCost(response.usage)
}, 'AI API call completed');
// AI error
aiLogger.error({
event: 'ai.call.failed',
provider: 'openai',
model: 'gpt-4o',
error: error.message,
statusCode: error.status,
retryCount: attempt,
userId: req.user.id,
feature: 'profile-generation'
}, 'AI API call failed');
Rate limit events
// Log when users hit rate limits
logger.warn({
event: 'ratelimit.exceeded',
userId: req.user?.id || 'anonymous',
ip: req.ip,
path: req.originalUrl,
limit: rateLimit.max,
windowMs: rateLimit.windowMs,
currentCount: currentRequestCount
}, 'Rate limit exceeded');
// Log when approaching rate limit (early warning)
if (currentRequestCount > rateLimit.max * 0.8) {
logger.info({
event: 'ratelimit.approaching',
userId: req.user?.id,
currentCount,
limit: rateLimit.max,
percentUsed: Math.round((currentRequestCount / rateLimit.max) * 100)
}, 'Approaching rate limit');
}
Payment and financial events
const paymentLogger = logger.child({ category: 'payment' });
// Payment succeeded
paymentLogger.info({
event: 'payment.charge.success',
userId: user.id,
orderId: order.id,
amount: order.total,
currency: 'USD',
provider: 'stripe',
paymentIntentId: paymentIntent.id,
cardLast4: paymentMethod.card.last4, // Only last 4 digits!
cardBrand: paymentMethod.card.brand
}, 'Payment successful');
// Payment failed
paymentLogger.error({
event: 'payment.charge.failure',
userId: user.id,
orderId: order.id,
amount: order.total,
currency: 'USD',
provider: 'stripe',
declineCode: error.decline_code,
errorMessage: error.message,
errorType: error.type
}, 'Payment failed');
// Refund processed
paymentLogger.info({
event: 'payment.refund.processed',
userId: user.id,
orderId: order.id,
refundAmount: refund.amount,
refundId: refund.id,
reason: refund.reason,
initiatedBy: adminUser.id
}, 'Refund processed');
8. Audit Logging
Audit logs are a special category of event logs designed for compliance, security review, and forensic analysis. They answer: "Who did what, when, and from where?"
// src/utils/auditLogger.js
const logger = require('../config/logger');
const auditLogger = logger.child({ category: 'audit' });
function logAuditEvent({
actor, // Who performed the action
action, // What they did
resource, // What they did it to
resourceId, // Specific resource ID
details, // Additional context
ip, // IP address
userAgent, // Browser/client info
outcome // success or failure
}) {
auditLogger.info({
event: 'audit',
actor: {
id: actor.id,
email: actor.email,
role: actor.role
},
action,
resource,
resourceId,
outcome,
details,
ip,
userAgent,
timestamp: new Date().toISOString()
}, `Audit: ${actor.email} ${action} ${resource}/${resourceId}`);
}
module.exports = { logAuditEvent };
Usage in route handlers
const { logAuditEvent } = require('../utils/auditLogger');
// Admin deletes a user
app.delete('/api/admin/users/:id', requireAdmin, async (req, res) => {
const user = await User.findById(req.params.id);
await user.deleteOne();
logAuditEvent({
actor: req.user,
action: 'DELETE',
resource: 'user',
resourceId: req.params.id,
details: { deletedEmail: user.email, deletedRole: user.role },
ip: req.ip,
userAgent: req.get('User-Agent'),
outcome: 'success'
});
res.json({ message: 'User deleted' });
});
// User updates their profile
app.patch('/api/users/me', authenticate, async (req, res) => {
const before = { ...req.user.toObject() };
const updated = await User.findByIdAndUpdate(req.user.id, req.body, { new: true });
logAuditEvent({
actor: req.user,
action: 'UPDATE',
resource: 'user_profile',
resourceId: req.user.id,
details: {
fieldsChanged: Object.keys(req.body),
before: { name: before.name, email: before.email },
after: { name: updated.name, email: updated.email }
},
ip: req.ip,
userAgent: req.get('User-Agent'),
outcome: 'success'
});
res.json(updated);
});
What to audit
| Category | Events to Log |
|---|---|
| Authentication | Login, logout, password change, MFA enable/disable, token refresh |
| Authorization | Permission changes, role assignments, access denials |
| Data access | View, create, update, delete of sensitive records |
| Admin actions | User management, configuration changes, feature toggles |
| Financial | Payments, refunds, subscription changes, pricing updates |
| System | Deployments, configuration changes, secret rotation |
9. Event Logging for Compliance
Different regulations require different retention and logging standards.
| Regulation | What It Requires | Retention |
|---|---|---|
| GDPR | Log access to personal data, data deletion events, consent changes | Varies (document your policy) |
| HIPAA | Log all access to health records, who accessed what and when | 6 years |
| PCI-DSS | Log access to cardholder data, authentication events, admin actions | 1 year (3 months immediately available) |
| SOC 2 | Log security events, access controls, system changes | 1 year |
Compliance logging best practices
- Immutable storage -- audit logs should be written to append-only storage (S3 with Object Lock, CloudWatch Logs)
- Separate from application logs -- audit logs go to a different destination that developers cannot delete
- Include all context -- who, what, when, where (IP), outcome
- Retention policies -- configure automatic retention matching your compliance requirements
- Encrypt at rest -- audit logs often contain references to sensitive operations
// Writing audit logs to a separate immutable destination
const { S3Client, PutObjectCommand } = require('@aws-sdk/client-s3');
const s3 = new S3Client({ region: 'us-east-1' });
async function writeAuditToS3(auditEvent) {
const date = new Date();
const key = `audit-logs/${date.getFullYear()}/${date.getMonth() + 1}/${date.getDate()}/${date.getTime()}-${auditEvent.resourceId}.json`;
await s3.send(new PutObjectCommand({
Bucket: 'myapp-audit-logs', // Bucket with Object Lock enabled
Key: key,
Body: JSON.stringify(auditEvent),
ContentType: 'application/json',
ServerSideEncryption: 'aws:kms'
}));
}
10. Runbook Basics
A runbook is a documented procedure for responding to a specific alert. Every P1/P2 alert should link to a runbook.
Runbook template
# Runbook: High Error Rate on Order Service
## Alert
- Name: HighErrorRate-OrderService
- Threshold: Error rate > 5% for 2 consecutive minutes
- Severity: P1
## Impact
- Users cannot place orders
- Revenue is directly affected
## Diagnosis Steps
1. Check order-service logs for the most common error:
`filter: service=order-service AND level=error | top 10 errorMessage`
2. Check if the error is in our code or a downstream dependency:
`filter: service=order-service AND level=error | top 10 downstreamService`
3. Check recent deployments:
`git log --oneline -5 --since="1 hour ago"`
4. Check database health:
- MongoDB Atlas dashboard: connections, operations/sec, replication lag
- Redis dashboard: memory usage, connected clients, evictions
## Resolution Steps
### If the error is a bad deployment:
1. Roll back to the previous version:
`aws ecs update-service --cluster prod --service order-service --task-definition order-service:PREVIOUS`
2. Verify error rate is decreasing
3. Investigate the bad deploy in a non-production environment
### If the error is a downstream service:
1. Check the downstream service's health dashboard
2. Enable circuit breaker if not already active
3. Escalate to the downstream service team
### If the error is database-related:
1. Check connection pool usage: should be < 80%
2. Check for slow queries: MongoDB profiler or slow query log
3. Check replica set health: `rs.status()`
## Escalation
- Primary: On-call engineer (PagerDuty)
- Secondary: Order Service tech lead
- Manager: Engineering Manager
## Post-Incident
- Write incident summary in #incidents Slack channel
- Schedule blameless post-mortem for P1 incidents
- Update this runbook with any new findings
11. Complete Alerting Configuration Example
Bringing together metrics, alerts, and event logging in a unified module.
// src/monitoring/alerts.js
const logger = require('../config/logger');
const { businessEvents, aiMetrics } = require('../metrics');
// ---- Alert Evaluation (for self-hosted monitoring) ----
class AlertManager {
constructor() {
this.alertStates = new Map(); // Track current alert states
this.silencedAlerts = new Set(); // Temporarily silenced alerts
}
evaluate(alertName, currentValue, threshold, severity, context = {}) {
const isTriggered = currentValue > threshold;
const wasTriggered = this.alertStates.get(alertName)?.triggered || false;
if (isTriggered && !wasTriggered) {
// Alert is firing (new)
this.alertStates.set(alertName, {
triggered: true,
triggeredAt: new Date(),
value: currentValue,
threshold
});
logger.error({
event: 'alert.fired',
alertName,
severity,
currentValue,
threshold,
...context
}, `ALERT FIRED: ${alertName}`);
this.notify(alertName, severity, currentValue, threshold, context);
} else if (!isTriggered && wasTriggered) {
// Alert resolved
const alertState = this.alertStates.get(alertName);
const duration = Date.now() - alertState.triggeredAt.getTime();
this.alertStates.set(alertName, { triggered: false });
logger.info({
event: 'alert.resolved',
alertName,
severity,
currentValue,
threshold,
durationMs: duration,
...context
}, `ALERT RESOLVED: ${alertName} (was firing for ${Math.round(duration / 1000)}s)`);
this.notifyResolved(alertName, severity, duration);
}
}
async notify(alertName, severity, value, threshold, context) {
if (this.silencedAlerts.has(alertName)) return;
// Send to Slack
if (severity === 'critical' || severity === 'high') {
await this.sendSlackAlert({
channel: '#alerts-critical',
text: `*${severity.toUpperCase()}: ${alertName}*\nValue: ${value} (threshold: ${threshold})\nRunbook: https://wiki/runbooks/${alertName}`
});
}
// Page on-call for critical
if (severity === 'critical') {
await this.pageDutyAlert(alertName, value, threshold, context);
}
}
async notifyResolved(alertName, severity, durationMs) {
await this.sendSlackAlert({
channel: '#alerts-critical',
text: `*RESOLVED: ${alertName}* (was firing for ${Math.round(durationMs / 60000)} minutes)`
});
}
silence(alertName, durationMs) {
this.silencedAlerts.add(alertName);
setTimeout(() => this.silencedAlerts.delete(alertName), durationMs);
logger.warn({ alertName, durationMs }, 'Alert silenced');
}
// Placeholder methods -- integrate with actual services
async sendSlackAlert(payload) { /* Slack webhook integration */ }
async pageDutyAlert(name, value, threshold, context) { /* PagerDuty API */ }
}
module.exports = new AlertManager();
// src/monitoring/healthChecks.js
const logger = require('../config/logger');
const mongoose = require('mongoose');
const Redis = require('ioredis');
async function checkHealth() {
const health = {
status: 'healthy',
timestamp: new Date().toISOString(),
checks: {}
};
// Database check
try {
const start = Date.now();
await mongoose.connection.db.admin().ping();
health.checks.database = {
status: 'healthy',
latency: Date.now() - start
};
} catch (error) {
health.checks.database = { status: 'unhealthy', error: error.message };
health.status = 'unhealthy';
}
// Redis check
try {
const start = Date.now();
await redis.ping();
health.checks.redis = {
status: 'healthy',
latency: Date.now() - start
};
} catch (error) {
health.checks.redis = { status: 'unhealthy', error: error.message };
health.status = 'unhealthy';
}
// Log health check results
if (health.status === 'unhealthy') {
logger.error({ health }, 'Health check failed');
} else {
logger.debug({ health }, 'Health check passed');
}
return health;
}
module.exports = { checkHealth };
// Express health endpoint
app.get('/health', async (req, res) => {
const health = await checkHealth();
const statusCode = health.status === 'healthy' ? 200 : 503;
res.status(statusCode).json(health);
});
12. Key Takeaways
- Alert on symptoms, not causes -- alert when user experience degrades (error rate, latency), not on internal metrics (CPU) unless they directly impact users.
- Every alert must be actionable -- if you cannot do anything about it right now, it should not wake you up.
- Alert fatigue kills -- ruthlessly delete or tune noisy alerts. Use SLO burn rate alerting as a modern alternative.
- Log security events -- authentication, authorization, and data access events are non-negotiable for security and compliance.
- Runbooks save lives -- every P1/P2 alert should link to a step-by-step guide. The 3 AM on-call engineer should not have to think.
- Audit logs are separate -- they go to immutable storage, are retained per compliance requirements, and cannot be deleted by application developers.
Explain-It Challenge
- The on-call engineer got paged 12 times last week, but only 2 pages required real action. Diagnose the problem and propose three specific fixes.
- Your payment service processes credit cards. What events MUST you log for PCI-DSS compliance, and what must you NEVER log?
- Design an alerting strategy for a new AI-powered feature that calls the OpenAI API. What metrics would you alert on and at what thresholds?
Navigation: <- 6.7.b Monitoring and Metrics | 6.7 Overview