Episode 6 — Scaling Reliability Microservices Web3 / 6.2 — Building and Orchestrating Microservices

6.2.c — Retry, Timeout & Circuit Breaker

In one sentence: Distributed systems fail constantly — retries handle transient errors, timeouts prevent infinite waits, and circuit breakers stop cascading failures by cutting off calls to unhealthy services.

Navigation: <- 6.2.b API Gateway Pattern | 6.2.d — Event-Driven Architecture ->

1. Why Distributed Calls Fail

In a monolith, a function call either works or throws an exception. In microservices, every inter-service call is a network call — and the network is unreliable.

Things that go wrong between services:

  Service A ──── network ────→ Service B

  1. Network timeout     (packet lost, congestion)
  2. Service B is down   (crashed, deploying, out of memory)
  3. Service B is slow   (database overloaded, GC pause)
  4. DNS resolution fails
  5. Connection refused   (port not listening)
  6. Partial response     (connection drops mid-response)
  7. 503 Service Unavailable (B is overloaded)
  8. 429 Too Many Requests (B is rate-limiting you)

Without resilience patterns, one failing service takes down everything.

Cascade failure:
  Client → Gateway → Order Service → User Service (down!)
                          │
                          ├── Waiting... (30 sec default timeout)
                          ├── Thread pool exhausted
                          ├── Order Service stops responding
                          └── Gateway times out → Client gets error

  Result: User Service outage → Order Service outage → Total outage

2. Retry Strategies

2.1 Simple Retry

Retry the same request a fixed number of times:

async function simpleRetry(fn, maxRetries = 3) {
  let lastError;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;
      console.log(`Attempt ${attempt}/${maxRetries} failed: ${err.message}`);
    }
  }

  throw lastError;
}

// Usage
const user = await simpleRetry(() =>
  axios.get('http://user-service:4001/users/1')
);

Problem: All retries happen immediately, which can overwhelm an already struggling service.

2.2 Exponential Backoff

Wait longer between each retry:

async function retryWithBackoff(fn, maxRetries = 3, baseDelay = 1000) {
  let lastError;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;

      if (attempt === maxRetries) break;

      // Exponential: 1s, 2s, 4s, 8s...
      const delay = baseDelay * Math.pow(2, attempt - 1);
      console.log(`Attempt ${attempt} failed. Retrying in ${delay}ms...`);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  throw lastError;
}

Attempt 1: immediate
Attempt 2: wait 1000ms  (1 second)
Attempt 3: wait 2000ms  (2 seconds)
Attempt 4: wait 4000ms  (4 seconds)

2.3 Exponential Backoff with Jitter

Jitter adds randomness to prevent the "thundering herd" problem — where hundreds of clients all retry at the exact same time.

async function retryWithJitter(fn, maxRetries = 3, baseDelay = 1000) {
  let lastError;

  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (err) {
      lastError = err;

      if (attempt === maxRetries) break;

      // Exponential backoff + random jitter
      const exponentialDelay = baseDelay * Math.pow(2, attempt - 1);
      const jitter = Math.random() * exponentialDelay;
      const delay = Math.floor(exponentialDelay + jitter);

      console.log(`Attempt ${attempt} failed. Retrying in ${delay}ms...`);
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  throw lastError;
}

Without jitter (thundering herd):
  Client A retries at: 0s, 1s, 2s, 4s
  Client B retries at: 0s, 1s, 2s, 4s   ← all hit at same time!
  Client C retries at: 0s, 1s, 2s, 4s

With jitter (spread out):
  Client A retries at: 0s, 1.3s, 2.7s, 5.1s
  Client B retries at: 0s, 0.8s, 3.2s, 4.4s   ← distributed!
  Client C retries at: 0s, 1.6s, 2.1s, 6.8s

2.4 When NOT to Retry

Status Code	Retry?	Why
500 Internal Server Error	Maybe	Could be transient (memory spike) or permanent (bug)
502 Bad Gateway	Yes	Usually transient
503 Service Unavailable	Yes	Service overloaded, may recover
429 Too Many Requests	Yes	But respect `Retry-After` header
408 Request Timeout	Yes	Transient network issue
400 Bad Request	No	Your request is wrong — retrying won't fix it
401 Unauthorized	No	Auth issue — retrying won't fix it
404 Not Found	No	Resource doesn't exist
409 Conflict	No	Business logic conflict

function isRetryable(error) {
  if (!error.response) return true; // Network error, no response at all
  const status = error.response.status;
  return [408, 429, 500, 502, 503, 504].includes(status);
}

3. Setting Timeouts

3.1 Axios Timeout

const axios = require('axios');

// Per-request timeout
const response = await axios.get('http://user-service:4001/users/1', {
  timeout: 3000, // 3 seconds — includes connection + response time
});

// Default timeout for all requests via an instance
const httpClient = axios.create({
  timeout: 5000,  // 5-second default
  headers: { 'Content-Type': 'application/json' },
});

3.2 AbortController (Native Node.js)

async function fetchWithTimeout(url, timeoutMs = 3000) {
  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), timeoutMs);

  try {
    const response = await fetch(url, { signal: controller.signal });
    clearTimeout(timeoutId);
    return await response.json();
  } catch (err) {
    clearTimeout(timeoutId);
    if (err.name === 'AbortError') {
      throw new Error(`Request to ${url} timed out after ${timeoutMs}ms`);
    }
    throw err;
  }
}

3.3 Timeout Guidelines

Rule of thumb:
  Database queries:     1-3 seconds
  Internal service calls: 3-5 seconds
  External API calls:   5-10 seconds
  File uploads:         30-60 seconds

NEVER use the default (no timeout):
  - Node.js default HTTP timeout is 120 seconds
  - 120 seconds of waiting = thread blocked = resources wasted
  - Always set explicit timeouts on every outgoing call

4. Circuit Breaker Pattern

The circuit breaker prevents your service from repeatedly calling a service that is down, giving the failing service time to recover.

4.1 State Machine

┌─────────────────────────────────────────────────────────────────┐
│                    CIRCUIT BREAKER STATES                        │
│                                                                  │
│   ┌──────────┐         failures >= threshold          ┌────────┐│
│   │  CLOSED  │ ─────────────────────────────────────→ │  OPEN  ││
│   │ (normal) │                                        │(reject)││
│   │          │ ←─── probe succeeds ──── ┌───────────┐ │        ││
│   └──────────┘                          │ HALF-OPEN │ └────┬───┘│
│        ▲                                │  (probe)  │      │    │
│        │                                └───────────┘      │    │
│        │                                     ▲             │    │
│        │                                     │  timeout    │    │
│        └─── probe succeeds ──────────────────┘  expires ───┘    │
│                                                                  │
│  CLOSED:    All calls pass through normally                      │
│             Track consecutive failures                           │
│                                                                  │
│  OPEN:      All calls rejected immediately (fail fast)           │
│             Return fallback or error                             │
│             After timeout, transition to HALF-OPEN               │
│                                                                  │
│  HALF-OPEN: Allow ONE probe request through                      │
│             If it succeeds → CLOSED (reset failure count)        │
│             If it fails → OPEN (reset timeout)                   │
└─────────────────────────────────────────────────────────────────┘

4.2 Circuit Breaker Implementation

// shared/utils/circuit-breaker.js

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 30000; // 30 seconds
    this.halfOpenMaxCalls = options.halfOpenMaxCalls || 1;

    this.state = 'CLOSED';
    this.failureCount = 0;
    this.successCount = 0;
    this.lastFailureTime = null;
    this.halfOpenCalls = 0;

    // Metrics
    this.metrics = {
      totalCalls: 0,
      totalFailures: 0,
      totalSuccesses: 0,
      totalRejections: 0,
    };
  }

  async call(fn) {
    this.metrics.totalCalls++;

    // ─── OPEN state: reject immediately ───
    if (this.state === 'OPEN') {
      if (this._shouldAttemptReset()) {
        this.state = 'HALF_OPEN';
        this.halfOpenCalls = 0;
        console.log('[circuit-breaker] Transitioning to HALF_OPEN');
      } else {
        this.metrics.totalRejections++;
        throw new Error(
          `Circuit breaker is OPEN. Retry after ${this._timeUntilReset()}ms`
        );
      }
    }

    // ─── HALF_OPEN state: allow limited calls ───
    if (this.state === 'HALF_OPEN') {
      if (this.halfOpenCalls >= this.halfOpenMaxCalls) {
        this.metrics.totalRejections++;
        throw new Error('Circuit breaker is HALF_OPEN. Probe in progress.');
      }
      this.halfOpenCalls++;
    }

    // ─── Execute the call ───
    try {
      const result = await fn();
      this._onSuccess();
      return result;
    } catch (err) {
      this._onFailure();
      throw err;
    }
  }

  _onSuccess() {
    this.metrics.totalSuccesses++;
    this.failureCount = 0;

    if (this.state === 'HALF_OPEN') {
      this.state = 'CLOSED';
      console.log('[circuit-breaker] Probe succeeded. Transitioning to CLOSED');
    }
  }

  _onFailure() {
    this.metrics.totalFailures++;
    this.failureCount++;
    this.lastFailureTime = Date.now();

    if (this.state === 'HALF_OPEN') {
      this.state = 'OPEN';
      console.log('[circuit-breaker] Probe failed. Back to OPEN');
      return;
    }

    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      console.log(
        `[circuit-breaker] Failure threshold reached (${this.failureCount}). Transitioning to OPEN`
      );
    }
  }

  _shouldAttemptReset() {
    return Date.now() - this.lastFailureTime >= this.resetTimeout;
  }

  _timeUntilReset() {
    const elapsed = Date.now() - this.lastFailureTime;
    return Math.max(0, this.resetTimeout - elapsed);
  }

  getState() {
    return {
      state: this.state,
      failureCount: this.failureCount,
      metrics: { ...this.metrics },
    };
  }
}

module.exports = { CircuitBreaker };

4.3 Using the Circuit Breaker

const axios = require('axios');
const { CircuitBreaker } = require('../../shared/utils/circuit-breaker');

// One circuit breaker per downstream service
const userServiceBreaker = new CircuitBreaker({
  failureThreshold: 5,     // Open after 5 consecutive failures
  resetTimeout: 30000,      // Try again after 30 seconds
});

async function getUser(userId) {
  try {
    const response = await userServiceBreaker.call(() =>
      axios.get(`http://user-service:4001/users/${userId}`, {
        timeout: 3000,
      })
    );
    return response.data;
  } catch (err) {
    if (err.message.includes('Circuit breaker is OPEN')) {
      console.log('User service circuit is open, using fallback');
      return { data: { id: userId, name: 'Unknown User', cached: true } };
    }
    throw err;
  }
}

5. Bulkhead Pattern

The bulkhead pattern isolates resources so that a failure in one area does not consume all resources.

WITHOUT bulkhead:                    WITH bulkhead:

Thread Pool (10 threads)             Pool A: User Service (5 threads)
  ├── User Service call              Pool B: Order Service (3 threads)
  ├── User Service call              Pool C: Payment Service (2 threads)
  ├── User Service call
  ├── User Service call              If User Service is slow:
  ├── User Service call                Pool A: all 5 threads blocked
  ├── User Service call                Pool B: 3 threads still free → Orders work!
  ├── User Service call                Pool C: 2 threads still free → Payments work!
  ├── User Service call
  ├── User Service call (slow!)
  └── User Service call (slow!)
  ALL threads blocked → nothing works

// Simple bulkhead using a semaphore pattern
class Bulkhead {
  constructor(maxConcurrent) {
    this.maxConcurrent = maxConcurrent;
    this.currentCalls = 0;
    this.queue = [];
  }

  async call(fn) {
    if (this.currentCalls >= this.maxConcurrent) {
      throw new Error(
        `Bulkhead limit reached (${this.maxConcurrent} concurrent calls)`
      );
    }

    this.currentCalls++;
    try {
      return await fn();
    } finally {
      this.currentCalls--;
    }
  }
}

// Limit concurrent calls to user service
const userServiceBulkhead = new Bulkhead(10);

async function getUser(userId) {
  return userServiceBulkhead.call(() =>
    axios.get(`http://user-service:4001/users/${userId}`, { timeout: 3000 })
  );
}

6. Fallback Strategies

When a service call fails, you need a plan B.

async function getUserWithFallback(userId) {
  try {
    // Primary: call user service
    const response = await userServiceBreaker.call(() =>
      axios.get(`http://user-service:4001/users/${userId}`, { timeout: 3000 })
    );
    return response.data.data;
  } catch (err) {
    // Fallback 1: Try cache
    const cached = await cache.get(`user:${userId}`);
    if (cached) {
      console.log(`Using cached data for user ${userId}`);
      return { ...cached, source: 'cache' };
    }

    // Fallback 2: Return degraded response
    console.log(`Returning degraded response for user ${userId}`);
    return {
      id: userId,
      name: 'Unknown User',
      source: 'fallback',
    };
  }
}

Strategy	When to Use	Example
Cached data	When stale data is acceptable	Show last-known user profile
Default value	When partial data is acceptable	Show "Unknown User" instead of error
Degraded service	When feature can work without dependency	Place order without user details, reconcile later
Queue for later	When operation can be deferred	Queue notification, send when service recovers
Error response	When no fallback makes sense	Return 503 with clear error message

7. Production-Ready Resilient HTTP Client

Combining all patterns into a single reusable client:

// shared/utils/resilient-client.js
const axios = require('axios');
const { CircuitBreaker } = require('./circuit-breaker');

class ResilientHttpClient {
  constructor(serviceName, baseURL, options = {}) {
    this.serviceName = serviceName;
    this.baseURL = baseURL;
    this.timeout = options.timeout || 5000;
    this.maxRetries = options.maxRetries || 3;
    this.retryBaseDelay = options.retryBaseDelay || 1000;

    this.breaker = new CircuitBreaker({
      failureThreshold: options.failureThreshold || 5,
      resetTimeout: options.resetTimeout || 30000,
    });

    this.client = axios.create({
      baseURL,
      timeout: this.timeout,
      headers: { 'Content-Type': 'application/json' },
    });
  }

  async request(method, path, data = null, options = {}) {
    const requestFn = () =>
      this.client.request({
        method,
        url: path,
        data,
        headers: options.headers || {},
      });

    // Circuit breaker wraps retry logic
    return this.breaker.call(() =>
      this._retryWithBackoff(requestFn, this.maxRetries)
    );
  }

  async _retryWithBackoff(fn, maxRetries) {
    let lastError;

    for (let attempt = 1; attempt <= maxRetries; attempt++) {
      try {
        const response = await fn();
        return response.data;
      } catch (err) {
        lastError = err;

        if (!this._isRetryable(err)) {
          throw err; // Don't retry 400s, 401s, 404s
        }

        if (attempt === maxRetries) break;

        const delay = this.retryBaseDelay * Math.pow(2, attempt - 1);
        const jitter = Math.random() * delay;
        const waitTime = Math.floor(delay + jitter);

        console.log(
          `[${this.serviceName}] Attempt ${attempt} failed. ` +
          `Retrying in ${waitTime}ms... (${err.message})`
        );
        await new Promise((resolve) => setTimeout(resolve, waitTime));
      }
    }

    throw lastError;
  }

  _isRetryable(error) {
    if (!error.response) return true; // Network error
    return [408, 429, 500, 502, 503, 504].includes(error.response.status);
  }

  // Convenience methods
  async get(path, options) {
    return this.request('GET', path, null, options);
  }

  async post(path, data, options) {
    return this.request('POST', path, data, options);
  }

  async put(path, data, options) {
    return this.request('PUT', path, data, options);
  }

  async delete(path, options) {
    return this.request('DELETE', path, null, options);
  }

  getStatus() {
    return {
      service: this.serviceName,
      baseURL: this.baseURL,
      circuitBreaker: this.breaker.getState(),
    };
  }
}

module.exports = { ResilientHttpClient };

Using the Client

const { ResilientHttpClient } = require('../../shared/utils/resilient-client');

const userClient = new ResilientHttpClient('user-service', 'http://user-service:4001', {
  timeout: 3000,
  maxRetries: 3,
  failureThreshold: 5,
  resetTimeout: 30000,
});

const orderClient = new ResilientHttpClient('order-service', 'http://order-service:4002', {
  timeout: 5000,
  maxRetries: 2,
  failureThreshold: 3,
  resetTimeout: 60000,
});

// Usage in route handler
app.post('/checkout', async (req, res) => {
  try {
    const user = await userClient.get(`/users/${req.body.userId}`);
    const order = await orderClient.post('/orders', {
      userId: user.data.id,
      items: req.body.items,
    });
    res.json({ data: order.data });
  } catch (err) {
    console.error('Checkout failed:', err.message);
    res.status(503).json({ error: 'Service temporarily unavailable' });
  }
});

// Expose circuit breaker status for monitoring
app.get('/internal/status', (req, res) => {
  res.json({
    dependencies: [
      userClient.getStatus(),
      orderClient.getStatus(),
    ],
  });
});

8. Key Takeaways

Every network call can fail — always wrap inter-service calls with retry, timeout, and circuit-breaker logic.
Exponential backoff with jitter prevents thundering herds when retrying.
Set explicit timeouts on every outgoing call — never rely on the default (often 120 seconds).
Circuit breakers prevent cascade failures — when a service is down, fail fast instead of blocking.
Bulkheads isolate failures — one slow dependency should not consume all your resources.
Always have a fallback — cache, default value, degraded response, or clear error message.
Only retry on retryable errors — never retry 400 Bad Request or 401 Unauthorized.

Explain-It Challenge

Your order service has no timeout on calls to the payment service. The payment service starts responding in 45 seconds instead of 200ms. What happens to your order service? Walk through the failure cascade.
You set maxRetries: 10 with no backoff. The downstream service is overwhelmed. How do your retries make the problem worse?
Explain the circuit breaker to a non-technical product manager. Why does "failing fast" actually improve the user experience?

Navigation: <- 6.2.b API Gateway Pattern | 6.2.d — Event-Driven Architecture ->