Episode 6 — Scaling Reliability Microservices Web3 / 6.2 — Building and Orchestrating Microservices
6.2.c — Retry, Timeout & Circuit Breaker
In one sentence: Distributed systems fail constantly — retries handle transient errors, timeouts prevent infinite waits, and circuit breakers stop cascading failures by cutting off calls to unhealthy services.
Navigation: <- 6.2.b API Gateway Pattern | 6.2.d — Event-Driven Architecture ->
1. Why Distributed Calls Fail
In a monolith, a function call either works or throws an exception. In microservices, every inter-service call is a network call — and the network is unreliable.
Things that go wrong between services:
Service A ──── network ────→ Service B
1. Network timeout (packet lost, congestion)
2. Service B is down (crashed, deploying, out of memory)
3. Service B is slow (database overloaded, GC pause)
4. DNS resolution fails
5. Connection refused (port not listening)
6. Partial response (connection drops mid-response)
7. 503 Service Unavailable (B is overloaded)
8. 429 Too Many Requests (B is rate-limiting you)
Without resilience patterns, one failing service takes down everything.
Cascade failure:
Client → Gateway → Order Service → User Service (down!)
│
├── Waiting... (30 sec default timeout)
├── Thread pool exhausted
├── Order Service stops responding
└── Gateway times out → Client gets error
Result: User Service outage → Order Service outage → Total outage
2. Retry Strategies
2.1 Simple Retry
Retry the same request a fixed number of times:
async function simpleRetry(fn, maxRetries = 3) {
let lastError;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err;
console.log(`Attempt ${attempt}/${maxRetries} failed: ${err.message}`);
}
}
throw lastError;
}
// Usage
const user = await simpleRetry(() =>
axios.get('http://user-service:4001/users/1')
);
Problem: All retries happen immediately, which can overwhelm an already struggling service.
2.2 Exponential Backoff
Wait longer between each retry:
async function retryWithBackoff(fn, maxRetries = 3, baseDelay = 1000) {
let lastError;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err;
if (attempt === maxRetries) break;
// Exponential: 1s, 2s, 4s, 8s...
const delay = baseDelay * Math.pow(2, attempt - 1);
console.log(`Attempt ${attempt} failed. Retrying in ${delay}ms...`);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
throw lastError;
}
Attempt 1: immediate
Attempt 2: wait 1000ms (1 second)
Attempt 3: wait 2000ms (2 seconds)
Attempt 4: wait 4000ms (4 seconds)
2.3 Exponential Backoff with Jitter
Jitter adds randomness to prevent the "thundering herd" problem — where hundreds of clients all retry at the exact same time.
async function retryWithJitter(fn, maxRetries = 3, baseDelay = 1000) {
let lastError;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err;
if (attempt === maxRetries) break;
// Exponential backoff + random jitter
const exponentialDelay = baseDelay * Math.pow(2, attempt - 1);
const jitter = Math.random() * exponentialDelay;
const delay = Math.floor(exponentialDelay + jitter);
console.log(`Attempt ${attempt} failed. Retrying in ${delay}ms...`);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
throw lastError;
}
Without jitter (thundering herd):
Client A retries at: 0s, 1s, 2s, 4s
Client B retries at: 0s, 1s, 2s, 4s ← all hit at same time!
Client C retries at: 0s, 1s, 2s, 4s
With jitter (spread out):
Client A retries at: 0s, 1.3s, 2.7s, 5.1s
Client B retries at: 0s, 0.8s, 3.2s, 4.4s ← distributed!
Client C retries at: 0s, 1.6s, 2.1s, 6.8s
2.4 When NOT to Retry
| Status Code | Retry? | Why |
|---|---|---|
| 500 Internal Server Error | Maybe | Could be transient (memory spike) or permanent (bug) |
| 502 Bad Gateway | Yes | Usually transient |
| 503 Service Unavailable | Yes | Service overloaded, may recover |
| 429 Too Many Requests | Yes | But respect Retry-After header |
| 408 Request Timeout | Yes | Transient network issue |
| 400 Bad Request | No | Your request is wrong — retrying won't fix it |
| 401 Unauthorized | No | Auth issue — retrying won't fix it |
| 404 Not Found | No | Resource doesn't exist |
| 409 Conflict | No | Business logic conflict |
function isRetryable(error) {
if (!error.response) return true; // Network error, no response at all
const status = error.response.status;
return [408, 429, 500, 502, 503, 504].includes(status);
}
3. Setting Timeouts
3.1 Axios Timeout
const axios = require('axios');
// Per-request timeout
const response = await axios.get('http://user-service:4001/users/1', {
timeout: 3000, // 3 seconds — includes connection + response time
});
// Default timeout for all requests via an instance
const httpClient = axios.create({
timeout: 5000, // 5-second default
headers: { 'Content-Type': 'application/json' },
});
3.2 AbortController (Native Node.js)
async function fetchWithTimeout(url, timeoutMs = 3000) {
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), timeoutMs);
try {
const response = await fetch(url, { signal: controller.signal });
clearTimeout(timeoutId);
return await response.json();
} catch (err) {
clearTimeout(timeoutId);
if (err.name === 'AbortError') {
throw new Error(`Request to ${url} timed out after ${timeoutMs}ms`);
}
throw err;
}
}
3.3 Timeout Guidelines
Rule of thumb:
Database queries: 1-3 seconds
Internal service calls: 3-5 seconds
External API calls: 5-10 seconds
File uploads: 30-60 seconds
NEVER use the default (no timeout):
- Node.js default HTTP timeout is 120 seconds
- 120 seconds of waiting = thread blocked = resources wasted
- Always set explicit timeouts on every outgoing call
4. Circuit Breaker Pattern
The circuit breaker prevents your service from repeatedly calling a service that is down, giving the failing service time to recover.
4.1 State Machine
┌─────────────────────────────────────────────────────────────────┐
│ CIRCUIT BREAKER STATES │
│ │
│ ┌──────────┐ failures >= threshold ┌────────┐│
│ │ CLOSED │ ─────────────────────────────────────→ │ OPEN ││
│ │ (normal) │ │(reject)││
│ │ │ ←─── probe succeeds ──── ┌───────────┐ │ ││
│ └──────────┘ │ HALF-OPEN │ └────┬───┘│
│ ▲ │ (probe) │ │ │
│ │ └───────────┘ │ │
│ │ ▲ │ │
│ │ │ timeout │ │
│ └─── probe succeeds ──────────────────┘ expires ───┘ │
│ │
│ CLOSED: All calls pass through normally │
│ Track consecutive failures │
│ │
│ OPEN: All calls rejected immediately (fail fast) │
│ Return fallback or error │
│ After timeout, transition to HALF-OPEN │
│ │
│ HALF-OPEN: Allow ONE probe request through │
│ If it succeeds → CLOSED (reset failure count) │
│ If it fails → OPEN (reset timeout) │
└─────────────────────────────────────────────────────────────────┘
4.2 Circuit Breaker Implementation
// shared/utils/circuit-breaker.js
class CircuitBreaker {
constructor(options = {}) {
this.failureThreshold = options.failureThreshold || 5;
this.resetTimeout = options.resetTimeout || 30000; // 30 seconds
this.halfOpenMaxCalls = options.halfOpenMaxCalls || 1;
this.state = 'CLOSED';
this.failureCount = 0;
this.successCount = 0;
this.lastFailureTime = null;
this.halfOpenCalls = 0;
// Metrics
this.metrics = {
totalCalls: 0,
totalFailures: 0,
totalSuccesses: 0,
totalRejections: 0,
};
}
async call(fn) {
this.metrics.totalCalls++;
// ─── OPEN state: reject immediately ───
if (this.state === 'OPEN') {
if (this._shouldAttemptReset()) {
this.state = 'HALF_OPEN';
this.halfOpenCalls = 0;
console.log('[circuit-breaker] Transitioning to HALF_OPEN');
} else {
this.metrics.totalRejections++;
throw new Error(
`Circuit breaker is OPEN. Retry after ${this._timeUntilReset()}ms`
);
}
}
// ─── HALF_OPEN state: allow limited calls ───
if (this.state === 'HALF_OPEN') {
if (this.halfOpenCalls >= this.halfOpenMaxCalls) {
this.metrics.totalRejections++;
throw new Error('Circuit breaker is HALF_OPEN. Probe in progress.');
}
this.halfOpenCalls++;
}
// ─── Execute the call ───
try {
const result = await fn();
this._onSuccess();
return result;
} catch (err) {
this._onFailure();
throw err;
}
}
_onSuccess() {
this.metrics.totalSuccesses++;
this.failureCount = 0;
if (this.state === 'HALF_OPEN') {
this.state = 'CLOSED';
console.log('[circuit-breaker] Probe succeeded. Transitioning to CLOSED');
}
}
_onFailure() {
this.metrics.totalFailures++;
this.failureCount++;
this.lastFailureTime = Date.now();
if (this.state === 'HALF_OPEN') {
this.state = 'OPEN';
console.log('[circuit-breaker] Probe failed. Back to OPEN');
return;
}
if (this.failureCount >= this.failureThreshold) {
this.state = 'OPEN';
console.log(
`[circuit-breaker] Failure threshold reached (${this.failureCount}). Transitioning to OPEN`
);
}
}
_shouldAttemptReset() {
return Date.now() - this.lastFailureTime >= this.resetTimeout;
}
_timeUntilReset() {
const elapsed = Date.now() - this.lastFailureTime;
return Math.max(0, this.resetTimeout - elapsed);
}
getState() {
return {
state: this.state,
failureCount: this.failureCount,
metrics: { ...this.metrics },
};
}
}
module.exports = { CircuitBreaker };
4.3 Using the Circuit Breaker
const axios = require('axios');
const { CircuitBreaker } = require('../../shared/utils/circuit-breaker');
// One circuit breaker per downstream service
const userServiceBreaker = new CircuitBreaker({
failureThreshold: 5, // Open after 5 consecutive failures
resetTimeout: 30000, // Try again after 30 seconds
});
async function getUser(userId) {
try {
const response = await userServiceBreaker.call(() =>
axios.get(`http://user-service:4001/users/${userId}`, {
timeout: 3000,
})
);
return response.data;
} catch (err) {
if (err.message.includes('Circuit breaker is OPEN')) {
console.log('User service circuit is open, using fallback');
return { data: { id: userId, name: 'Unknown User', cached: true } };
}
throw err;
}
}
5. Bulkhead Pattern
The bulkhead pattern isolates resources so that a failure in one area does not consume all resources.
WITHOUT bulkhead: WITH bulkhead:
Thread Pool (10 threads) Pool A: User Service (5 threads)
├── User Service call Pool B: Order Service (3 threads)
├── User Service call Pool C: Payment Service (2 threads)
├── User Service call
├── User Service call If User Service is slow:
├── User Service call Pool A: all 5 threads blocked
├── User Service call Pool B: 3 threads still free → Orders work!
├── User Service call Pool C: 2 threads still free → Payments work!
├── User Service call
├── User Service call (slow!)
└── User Service call (slow!)
ALL threads blocked → nothing works
// Simple bulkhead using a semaphore pattern
class Bulkhead {
constructor(maxConcurrent) {
this.maxConcurrent = maxConcurrent;
this.currentCalls = 0;
this.queue = [];
}
async call(fn) {
if (this.currentCalls >= this.maxConcurrent) {
throw new Error(
`Bulkhead limit reached (${this.maxConcurrent} concurrent calls)`
);
}
this.currentCalls++;
try {
return await fn();
} finally {
this.currentCalls--;
}
}
}
// Limit concurrent calls to user service
const userServiceBulkhead = new Bulkhead(10);
async function getUser(userId) {
return userServiceBulkhead.call(() =>
axios.get(`http://user-service:4001/users/${userId}`, { timeout: 3000 })
);
}
6. Fallback Strategies
When a service call fails, you need a plan B.
async function getUserWithFallback(userId) {
try {
// Primary: call user service
const response = await userServiceBreaker.call(() =>
axios.get(`http://user-service:4001/users/${userId}`, { timeout: 3000 })
);
return response.data.data;
} catch (err) {
// Fallback 1: Try cache
const cached = await cache.get(`user:${userId}`);
if (cached) {
console.log(`Using cached data for user ${userId}`);
return { ...cached, source: 'cache' };
}
// Fallback 2: Return degraded response
console.log(`Returning degraded response for user ${userId}`);
return {
id: userId,
name: 'Unknown User',
source: 'fallback',
};
}
}
| Strategy | When to Use | Example |
|---|---|---|
| Cached data | When stale data is acceptable | Show last-known user profile |
| Default value | When partial data is acceptable | Show "Unknown User" instead of error |
| Degraded service | When feature can work without dependency | Place order without user details, reconcile later |
| Queue for later | When operation can be deferred | Queue notification, send when service recovers |
| Error response | When no fallback makes sense | Return 503 with clear error message |
7. Production-Ready Resilient HTTP Client
Combining all patterns into a single reusable client:
// shared/utils/resilient-client.js
const axios = require('axios');
const { CircuitBreaker } = require('./circuit-breaker');
class ResilientHttpClient {
constructor(serviceName, baseURL, options = {}) {
this.serviceName = serviceName;
this.baseURL = baseURL;
this.timeout = options.timeout || 5000;
this.maxRetries = options.maxRetries || 3;
this.retryBaseDelay = options.retryBaseDelay || 1000;
this.breaker = new CircuitBreaker({
failureThreshold: options.failureThreshold || 5,
resetTimeout: options.resetTimeout || 30000,
});
this.client = axios.create({
baseURL,
timeout: this.timeout,
headers: { 'Content-Type': 'application/json' },
});
}
async request(method, path, data = null, options = {}) {
const requestFn = () =>
this.client.request({
method,
url: path,
data,
headers: options.headers || {},
});
// Circuit breaker wraps retry logic
return this.breaker.call(() =>
this._retryWithBackoff(requestFn, this.maxRetries)
);
}
async _retryWithBackoff(fn, maxRetries) {
let lastError;
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const response = await fn();
return response.data;
} catch (err) {
lastError = err;
if (!this._isRetryable(err)) {
throw err; // Don't retry 400s, 401s, 404s
}
if (attempt === maxRetries) break;
const delay = this.retryBaseDelay * Math.pow(2, attempt - 1);
const jitter = Math.random() * delay;
const waitTime = Math.floor(delay + jitter);
console.log(
`[${this.serviceName}] Attempt ${attempt} failed. ` +
`Retrying in ${waitTime}ms... (${err.message})`
);
await new Promise((resolve) => setTimeout(resolve, waitTime));
}
}
throw lastError;
}
_isRetryable(error) {
if (!error.response) return true; // Network error
return [408, 429, 500, 502, 503, 504].includes(error.response.status);
}
// Convenience methods
async get(path, options) {
return this.request('GET', path, null, options);
}
async post(path, data, options) {
return this.request('POST', path, data, options);
}
async put(path, data, options) {
return this.request('PUT', path, data, options);
}
async delete(path, options) {
return this.request('DELETE', path, null, options);
}
getStatus() {
return {
service: this.serviceName,
baseURL: this.baseURL,
circuitBreaker: this.breaker.getState(),
};
}
}
module.exports = { ResilientHttpClient };
Using the Client
const { ResilientHttpClient } = require('../../shared/utils/resilient-client');
const userClient = new ResilientHttpClient('user-service', 'http://user-service:4001', {
timeout: 3000,
maxRetries: 3,
failureThreshold: 5,
resetTimeout: 30000,
});
const orderClient = new ResilientHttpClient('order-service', 'http://order-service:4002', {
timeout: 5000,
maxRetries: 2,
failureThreshold: 3,
resetTimeout: 60000,
});
// Usage in route handler
app.post('/checkout', async (req, res) => {
try {
const user = await userClient.get(`/users/${req.body.userId}`);
const order = await orderClient.post('/orders', {
userId: user.data.id,
items: req.body.items,
});
res.json({ data: order.data });
} catch (err) {
console.error('Checkout failed:', err.message);
res.status(503).json({ error: 'Service temporarily unavailable' });
}
});
// Expose circuit breaker status for monitoring
app.get('/internal/status', (req, res) => {
res.json({
dependencies: [
userClient.getStatus(),
orderClient.getStatus(),
],
});
});
8. Key Takeaways
- Every network call can fail — always wrap inter-service calls with retry, timeout, and circuit-breaker logic.
- Exponential backoff with jitter prevents thundering herds when retrying.
- Set explicit timeouts on every outgoing call — never rely on the default (often 120 seconds).
- Circuit breakers prevent cascade failures — when a service is down, fail fast instead of blocking.
- Bulkheads isolate failures — one slow dependency should not consume all your resources.
- Always have a fallback — cache, default value, degraded response, or clear error message.
- Only retry on retryable errors — never retry 400 Bad Request or 401 Unauthorized.
Explain-It Challenge
- Your order service has no timeout on calls to the payment service. The payment service starts responding in 45 seconds instead of 200ms. What happens to your order service? Walk through the failure cascade.
- You set
maxRetries: 10with no backoff. The downstream service is overwhelmed. How do your retries make the problem worse? - Explain the circuit breaker to a non-technical product manager. Why does "failing fast" actually improve the user experience?
Navigation: <- 6.2.b API Gateway Pattern | 6.2.d — Event-Driven Architecture ->