Episode 6 — Scaling Reliability Microservices Web3 / 6.2 — Building and Orchestrating Microservices

Interview Questions: Building & Orchestrating Microservices

Model answers for independent services, API gateway, retry/timeout/circuit breaker, event-driven architecture, and event payloads/idempotency.

How to use this material (instructions)

Read lessons in order — README.md, then 6.2.a through 6.2.e.
Practice out loud — definition, then example, then pitfall.
Pair with exercises — 6.2-Exercise-Questions.md.
Quick review — 6.2-Quick-Revision.md.

Beginner (Q1–Q4)

Q1. What makes a microservice different from a module in a monolith?

Why interviewers ask: Tests whether you understand the core value proposition of microservices beyond the buzzword.

Model answer:

A module in a monolith is a logical boundary — separate code files or folders, but they share the same process, the same deployment, and the same database. A microservice is a physical boundary — it runs as its own process, has its own codebase and package.json, listens on its own port, uses its own database, and is deployed independently.

The practical difference: in a monolith, changing the notification module requires redeploying the entire application. In microservices, you redeploy only the notification service. In a monolith, a memory leak in one module crashes everything. In microservices, it only crashes that one service.

Monolith module:                    Microservice:
  Same process                        Separate process
  Shared memory / function calls      Network calls (HTTP, events)
  One database                        Own database
  Deploy together                     Deploy independently
  Scale together                      Scale independently

The trade-off is that microservices introduce network complexity — calls that were in-memory function calls now go over the network and can fail, be slow, or arrive out of order.

Q2. What is an API gateway and why do you need one?

Why interviewers ask: Evaluates understanding of cross-cutting concerns and system architecture.

Model answer:

An API gateway is a single entry point that sits between clients and your microservices. Instead of clients knowing the URL and port of every service, they talk to one endpoint — the gateway — which routes requests to the correct service.

Beyond routing, the gateway handles cross-cutting concerns: authentication (validate JWT once, not in every service), rate limiting (prevent abuse at the edge), logging (centralized request logging with correlation IDs), and header manipulation (inject X-User-Id from the verified token so services trust it).

// Without gateway: client knows every service
fetch('http://user-service:4001/users/1');
fetch('http://order-service:4002/orders');
fetch('http://notif-service:4003/notifications');

// With gateway: client knows one endpoint
fetch('http://api.myapp.com/api/users/1');
fetch('http://api.myapp.com/api/orders');
fetch('http://api.myapp.com/api/notifications');

Key rule: the gateway is for external traffic only. Internal service-to-service calls should go directly or through a message queue, not through the gateway — adding an unnecessary network hop.

Q3. Why do distributed systems need retries and timeouts?

Why interviewers ask: Tests awareness that microservices operate over unreliable networks, not in-memory.

Model answer:

In a monolith, calling a function either succeeds or throws — it never "times out" or experiences a "network partition." In microservices, every inter-service call goes over the network, introducing failure modes that don't exist in monoliths: network timeouts, DNS failures, the target service being down, the target service being overloaded, or partial responses from dropped connections.

Timeouts prevent your service from waiting indefinitely. Without a timeout, a call to a slow service blocks the calling thread. If many requests pile up, your thread pool is exhausted and your service becomes unresponsive — even though nothing is wrong with YOUR code. Always set explicit timeouts (3-5 seconds for internal calls).

Retries handle transient failures. If the target service had a brief GC pause or a network blip, retrying after a short delay often succeeds. But retries must be done carefully: use exponential backoff (wait longer between each retry) and jitter (randomize the delay) to avoid a retry storm that overwhelms the already-struggling service.

// Production pattern: exponential backoff with jitter
const delay = baseDelay * Math.pow(2, attempt - 1);
const jitter = Math.random() * delay;
await sleep(delay + jitter);

Only retry on transient errors (500, 502, 503, 429). Never retry on client errors (400, 401, 404) — the request is wrong and retrying won't help.

Q4. What is a message queue and when would you use one?

Why interviewers ask: Tests understanding of asynchronous communication patterns.

Model answer:

A message queue (like RabbitMQ or Kafka) is a broker that sits between services. Instead of Service A calling Service B directly (synchronous, blocking), Service A publishes a message to the queue and moves on immediately. Service B picks up the message and processes it at its own pace.

You use a message queue when the downstream action does not need to happen immediately:

Notifications: The order should succeed even if the email service is down.
Analytics: Dashboard updates can be delayed a few seconds.
Inventory updates: Reserve stock asynchronously after order creation.

The key benefits: decoupling (publisher doesn't know or care who consumes), resilience (if the consumer is down, messages queue up and are processed when it recovers), and load leveling (the queue absorbs traffic spikes).

Synchronous:  Order → [HTTP call] → Notification
              If Notification is down, Order FAILS

Asynchronous: Order → [publish event] → Queue → Notification
              If Notification is down, message waits in queue
              Order succeeds regardless

Use HTTP when you need the response to continue (validate user exists). Use events when the downstream action can happen eventually.

Intermediate (Q5–Q8)

Q5. Explain the circuit breaker pattern. Walk through all three states.

Why interviewers ask: This is a core distributed systems pattern that differentiates experienced engineers from beginners.

Model answer:

The circuit breaker prevents your service from repeatedly calling a dependency that is down, avoiding cascade failures and giving the failing service time to recover.

It has three states:

CLOSED (normal operation): All requests pass through. The breaker tracks consecutive failures. If failures reach the threshold (e.g., 5), it transitions to OPEN.

OPEN (fail fast): All requests are immediately rejected without making the call. This protects both your service (no blocked threads) and the failing service (no additional load). After a timeout (e.g., 30 seconds), it transitions to HALF_OPEN.

HALF_OPEN (probe): One request is allowed through as a test. If it succeeds, the breaker transitions back to CLOSED (service has recovered). If it fails, it goes back to OPEN (service still down).

// Usage pattern
const breaker = new CircuitBreaker({
  failureThreshold: 5,    // Open after 5 failures
  resetTimeout: 30000,     // Try again after 30s
});

try {
  const result = await breaker.call(() =>
    axios.get('http://user-service:4001/users/1', { timeout: 3000 })
  );
} catch (err) {
  if (err.message.includes('Circuit breaker is OPEN')) {
    return fallbackResponse; // Use cache or default
  }
}

Real-world analogy: it is like a fuse in an electrical circuit. When overloaded, the fuse blows (OPEN) to protect the rest of the system. You replace the fuse (HALF_OPEN test) and if the problem is fixed, power flows again (CLOSED).

Q6. How does RabbitMQ routing work? Explain exchanges, bindings, and queues.

Why interviewers ask: Tests depth of understanding beyond "we use RabbitMQ."

Model answer:

RabbitMQ routing has three components:

Exchange: The mailroom. Receives messages from publishers but does not store them. Decides which queues should receive each message based on its type and the routing key.

Queue: The mailbox. Stores messages until a consumer picks them up. Durable queues survive broker restarts.

Binding: The routing rule. Connects an exchange to a queue with a binding key pattern.

Three exchange types matter:

Fanout: Sends every message to ALL bound queues. Like a broadcast. The routing key is ignored. Use for: "every service needs to know about this event."

Direct: Sends messages only to queues where the binding key exactly matches the routing key. Use for: point-to-point delivery to a specific queue.

Topic: Sends messages to queues where the binding key pattern matches the routing key. Uses * (one word) and # (zero or more words). Use for: flexible routing like order.* matching order.placed, order.shipped, etc.

Publisher sends: exchange="events", routing_key="order.placed"

Topic exchange "events" checks bindings:
  queue="notifications" bound with "order.*"    → MATCH ✓
  queue="audit"         bound with "#"          → MATCH ✓
  queue="shipping"      bound with "order.shipped" → NO MATCH ✗

In practice, topic exchanges are the most useful for microservices because they allow services to subscribe to exactly the event patterns they care about.

Q7. What is idempotency and why is it critical for event consumers?

Why interviewers ask: A fundamental concept that separates developers who have dealt with real distributed systems from those who haven't.

Model answer:

Idempotency means processing the same event once or multiple times produces the same result. It is critical because in distributed systems, at-least-once delivery is the default — messages can and will be delivered more than once.

Three scenarios cause duplicates: (1) Consumer processes the message, then crashes before acknowledging it — the queue redelivers. (2) The acknowledgment is lost due to a network issue — the queue redelivers. (3) The publisher retries after a timeout, creating duplicate messages.

Without idempotency, a duplicate order.placed event could charge a customer twice or send two confirmation emails.

Implementation strategies, from simplest to strongest:

// Strategy 1: Event ID check with Redis (fast, simple)
if (await redis.exists(`processed:${eventId}`)) return; // Skip
await processEvent(event);
await redis.set(`processed:${eventId}`, '1', 'EX', 86400);

// Strategy 2: Database unique constraint (strongest)
await db.query(
  `INSERT INTO payments (order_id, amount, event_id)
   VALUES ($1, $2, $3)`,
  [orderId, amount, eventId]
); // Throws on duplicate event_id (unique constraint)

// Strategy 3: Conditional update (natural idempotency)
await db.query(
  `UPDATE orders SET status = 'shipped'
   WHERE id = $1 AND status = 'placed'`,
  [orderId]
); // Does nothing if already shipped

The database unique constraint is the strongest because it is atomic and durable — it survives crashes and restarts. Redis is faster but a TTL expiration could allow a very late duplicate to slip through.

Q8. How do you handle authentication in a microservices architecture?

Why interviewers ask: Tests practical architecture thinking — a very common real-world decision.

Model answer:

Authentication should happen once at the API gateway, not in every service. The gateway validates the JWT (or API key), extracts the user identity, and injects trusted headers like X-User-Id and X-User-Role before forwarding the request to the backend service.

Client sends:
  Authorization: Bearer eyJhbGci...
       │
  Gateway: verify JWT → extract userId=42, role=admin
       │
  Forward to service with headers:
  X-User-Id: 42
  X-User-Role: admin
  (Original Authorization header stripped)

Backend services trust these headers because they are only reachable from the gateway — not from the public internet. This is enforced through Docker networks (services use expose not ports), Kubernetes NetworkPolicies, or VPC security groups.

// Inside a backend service — no JWT validation needed
app.post('/orders', (req, res) => {
  const userId = req.headers['x-user-id'];  // Trusted from gateway
  // Create order for userId
});

This approach has three benefits: (1) Auth logic is in one place, not duplicated across 15 services. (2) Token validation happens once, not N times per request. (3) Services don't need the JWT secret — reducing the attack surface.

The critical security requirement: you must strip any X-User-Id or X-User-Role headers that clients send directly to the gateway. Otherwise, a malicious client could impersonate any user by setting those headers.

Advanced (Q9–Q11)

Q9. Design a resilient inter-service communication strategy for an e-commerce platform.

Why interviewers ask: Tests system design thinking — combining multiple patterns into a cohesive architecture.

Model answer:

I would categorize inter-service communication by whether the caller needs an immediate response:

Synchronous (need response now):

Order Service validates user exists → HTTP call to User Service
Payment processing → HTTP call to Payment Service
Inventory check → HTTP call to Inventory Service

Every synchronous call is wrapped in a ResilientHttpClient with:

Timeout: 3-5 seconds per call
Retries: 2-3 with exponential backoff and jitter
Circuit breaker: failureThreshold=5, resetTimeout=30s
Fallback: cached data or degraded response

const userClient = new ResilientHttpClient('user-service', 'http://user-service:4001', {
  timeout: 3000,
  maxRetries: 3,
  failureThreshold: 5,
  resetTimeout: 30000,
});

Asynchronous (eventual is fine):

Send order confirmation email → event via RabbitMQ
Update analytics dashboard → event via RabbitMQ
Sync data to search index → event via RabbitMQ
Reserve inventory (if latency acceptable) → event

All events go through a topic exchange (platform-events). Each consuming service has its own queue with a dead letter queue. All consumers are idempotent using database unique constraints on event_id.

order.placed → payment-queue (direct)
            → notification-queue (topic: order.*)
            → analytics-queue (topic: #)
            → inventory-queue (topic: order.placed)

Monitoring: Every circuit breaker exposes its state via /internal/status. Dead letter queues trigger alerts. All events include correlationId for distributed tracing.

Q10. You discover that events are occasionally processed out of order. How do you handle this?

Why interviewers ask: Tests ability to handle a real distributed systems problem with practical solutions.

Model answer:

Out-of-order delivery happens when messages take different paths through the broker, or when one consumer instance processes faster than another. For example, order.shipped might arrive before order.payment_received.

I use two strategies depending on the context:

Strategy 1: State machine validation. Define valid state transitions and reject invalid ones:

const validTransitions = {
  'placed': ['payment_received', 'cancelled'],
  'payment_received': ['shipped', 'cancelled'],
  'shipped': ['delivered'],
};

async function handleEvent(event) {
  const currentStatus = await getOrderStatus(event.data.orderId);
  const newStatus = event.type.split('.')[1];

  if (!validTransitions[currentStatus]?.includes(newStatus)) {
    // Requeue with delay — the prerequisite event might arrive shortly
    throw new RetryLaterError(`Cannot transition ${currentStatus} → ${newStatus}`);
  }

  await updateOrderStatus(event.data.orderId, newStatus);
}

This works well for entities with clear lifecycles (orders, payments, shipments).

Strategy 2: Timestamp comparison. Each event carries a timestamp. The consumer stores the latest timestamp per entity and ignores events older than the stored timestamp:

async function handleEvent(event) {
  const eventTime = new Date(event.metadata.timestamp);
  const lastProcessed = await getLastEventTime(event.data.orderId);

  if (lastProcessed && eventTime <= lastProcessed) {
    console.log('Stale event, skipping');
    return;
  }

  await processEvent(event);
  await setLastEventTime(event.data.orderId, eventTime);
}

For the requeue approach, I set a maximum retry count (3-5) with increasing delays. If the prerequisite event never arrives, the message goes to the dead letter queue for manual investigation.

Q11. How would you migrate a synchronous monolith to event-driven microservices without downtime?

Why interviewers ask: Tests real-world migration experience and incremental delivery thinking.

Model answer:

This is never a big-bang rewrite. I would use the Strangler Fig pattern — gradually extracting services while the monolith continues to run.

Phase 1: Introduce the message broker. Set up RabbitMQ alongside the monolith. The monolith starts publishing events at key points (user created, order placed) but also continues its synchronous flow. No consumers yet — this validates the event infrastructure.

Phase 2: Extract the first service. Choose a service with low coupling and clear boundaries — typically notifications. Build the notification service as a consumer of events the monolith already publishes. Run both the monolith's notification code and the new service in parallel. Compare outputs. When confident, disable the monolith's notification code.

Phase 3: Add the gateway. Deploy an API gateway that routes most traffic to the monolith. Route /api/notifications to the new service. Clients see no difference — same URL, same response format.

Phase 4: Extract more services. Repeat for each bounded context. Each extraction:

The monolith publishes relevant events
The new service consumes those events
The gateway routes the new service's endpoints
The monolith code is disabled, then removed

Phase 1:   [Monolith + events] → Queue (no consumers)
Phase 2:   [Monolith] → Queue → [Notification Service]
Phase 3:   Client → [Gateway] → [Monolith]
                              → [Notification Service]
Phase 4+:  Client → [Gateway] → [User Service]
                              → [Order Service]
                              → [Notification Service]
                              → [Monolith (shrinking)]

Key practices: Feature flags control which path is active (monolith or microservice). Dual-write during transition — the monolith handles requests AND publishes events. Comprehensive logging on both paths to compare behavior. Automated tests verify the new service produces the same results as the monolith.

The migration takes months or years for a large system. The goal is never zero downtime for the migration itself — it is zero downtime for users because the system works at every intermediate step.

Quick-fire

#	Question	One-line answer
1	Should services share databases?	No — each service owns its data store
2	Gateway handles auth for...	External traffic only — internal calls trust gateway headers
3	Default retry strategy?	Exponential backoff with jitter
4	Circuit breaker OPEN means...	All calls rejected immediately (fail fast)
5	Fanout exchange does what?	Sends to ALL bound queues regardless of routing key
6	What is idempotency?	Same operation applied N times = same result as 1 time
7	Events named in past tense?	Yes — `order.placed` not `createOrder`
8	Dead letter queue purpose?	Catches messages that cannot be processed after max retries
9	Eventual consistency means?	Data across services becomes consistent eventually, not immediately
10	Publish event before or after DB write?	After — never announce something that did not happen
11	`correlationId` is for?	Tracing a request across multiple services and events

<- Back to 6.2 — Building & Orchestrating Microservices (README)