Episode 9 — System Design / 9.9 — Core Infrastructure

9.9 Interview Questions

Question 1: Explain the cache-aside pattern and when you would use it.

Model Answer

Cache-aside (also called lazy loading) is the most common caching pattern. The application is responsible for managing the cache directly.

How it works:

When a read request comes in, the application first checks the cache.
If the data is found (cache hit), it is returned immediately.
If the data is not found (cache miss), the application queries the database, returns the result to the caller, and writes it to the cache for future reads.

When to use it:

Read-heavy workloads where the same data is accessed repeatedly.
When stale data is tolerable for short periods (the cache may be slightly behind the database between writes).
When you want the simplest caching implementation.

Trade-offs:

First request for any data is always a cache miss (cold start problem).
Data can become stale if the database is updated outside the cache-aware code path.
The application code is responsible for both cache and database logic.

Compared to write-through: Cache-aside is more flexible because only data that is actually requested gets cached. Write-through caches everything that is written, which may waste memory on data nobody reads.

Cache invalidation strategy: Typically, on write, the application deletes the cache entry (rather than updating it). The next read repopulates the cache. This avoids race conditions where concurrent writes and cache updates could leave the cache with stale data.

Question 2: What is the difference between Layer 4 and Layer 7 load balancing?

Model Answer

The difference comes down to what the load balancer can see and act on.

Layer 4 (Transport Layer):

Operates at the TCP/UDP level.
Sees only IP addresses, ports, and protocol type.
Routes entire TCP connections to a backend server without inspecting the content.
Extremely fast because it does minimal processing.
Cannot make decisions based on HTTP URLs, headers, or cookies.
Example: AWS NLB, HAProxy in TCP mode.

Layer 7 (Application Layer):

Operates at the HTTP/HTTPS level.
Can inspect URLs, HTTP headers, cookies, and even the request body.
Can terminate SSL/TLS and establish separate connections to backends.
Supports content-based routing (e.g., /api/* goes to API servers, /static/* goes to file servers).
Can modify requests and responses (add headers, compress, cache).
Example: AWS ALB, Nginx, Envoy.

When to choose Layer 4:

Maximum throughput and minimum latency are required.
Traffic is not HTTP (e.g., database connections, gaming, VoIP).
You do not need content-based routing.

When to choose Layer 7:

You need URL-based or header-based routing.
SSL termination is required at the load balancer.
You need sticky sessions based on cookies.
You are routing to microservices and need path-based routing.

In practice: Many architectures use both. A Layer 4 load balancer at the edge for raw performance, distributing traffic to multiple Layer 7 load balancers that handle content-based routing to application servers.

Question 3: How would you prevent a cache stampede?

Model Answer

A cache stampede (thundering herd) occurs when a popular cache entry expires and many concurrent requests all experience a cache miss simultaneously, all hitting the database at once.

Three main prevention strategies:

1. Locking (Mutex): When a cache miss occurs, only one request acquires a lock and is allowed to rebuild the cache. All other requests either wait for the lock to be released or serve stale data.

Implementation: Use Redis SET key NX EX 10 (set if not exists, with 10-second expiry) as a distributed lock. The lock winner rebuilds the cache; losers retry after a short delay.

Downside: Adds complexity. If the lock holder crashes, you need the lock to auto-expire (which is why we set EX). Other requests are delayed.

2. Early Refresh (Stale-While-Revalidate): Set two TTLs: a "soft" TTL and a "hard" TTL. When the soft TTL expires but hard TTL has not, one request triggers a background refresh while all requests continue to receive the slightly stale data.

This is the best experience for users because nobody waits. The data is briefly stale (seconds) but availability is maintained.

3. Probabilistic Early Expiration (XFetch): Each request has a small probability of refreshing the cache before the TTL expires. As the TTL approaches, the probability increases. This spreads out cache rebuilds over time instead of having a single cliff edge.

Which I would choose: For most production systems, I use early refresh (stale-while-revalidate) because it prioritizes availability and user experience. The data is stale for only a few seconds, and no user ever waits for a cache rebuild.

Question 4: Compare RabbitMQ and Kafka. When would you use each?

Model Answer

RabbitMQ and Kafka solve related but different problems.

RabbitMQ is a traditional message broker. It excels at:

Complex routing: direct, topic, fanout, and headers exchanges let you route messages based on patterns, keys, or headers.
Per-message acknowledgment: each message is individually acknowledged and deleted after consumption.
Push-based delivery: the broker pushes messages to consumers.
Low-latency message delivery for smaller volumes.
Priority queues: messages can have priorities.
Best for: task queues, RPC-style communication, workflows with complex routing rules.

Kafka is a distributed event streaming platform. It excels at:

Extremely high throughput: millions of messages per second.
Message retention: messages are kept for a configurable period (days or weeks), not deleted after consumption.
Replay capability: consumers can rewind their offset and reprocess old messages.
Ordered delivery: strict ordering within a partition.
Consumer groups: multiple consumer groups can independently read the same topic.
Best for: event sourcing, log aggregation, stream processing, analytics pipelines, data integration.

Key differences:

Aspect	RabbitMQ	Kafka
Model	Message broker (smart broker, simple consumer)	Distributed log (simple broker, smart consumer)
After consumption	Message deleted	Message retained
Ordering	Per-queue	Per-partition
Throughput	Tens of thousands/sec	Millions/sec
Replay	Not possible	Consumer rewinds offset

Decision rule:

Need to replay events or have multiple independent consumers reading the same data? Kafka.
Need complex routing rules, priorities, or traditional task distribution? RabbitMQ.
Need high-throughput analytics or event sourcing? Kafka.
Need a simple job queue with retries? RabbitMQ (or SQS if on AWS).

Question 5: What does an API gateway do, and how is it different from a load balancer?

Model Answer

An API gateway is the single entry point for all client requests into a backend system. It handles cross-cutting concerns that would otherwise be duplicated across every microservice.

Core responsibilities:

Routing: Maps external URLs to internal services (e.g., /api/users/* to user-service).
Authentication/Authorization: Validates JWT tokens, API keys, or OAuth tokens before requests reach services.
Rate limiting: Controls request rates per user, per IP, or per endpoint.
Request/response transformation: Adds headers, removes internal fields, converts formats.
API composition: Aggregates responses from multiple services into a single response (reducing round trips for mobile clients).
SSL termination, logging, caching, CORS, circuit breaking.

How it differs from a load balancer:

A load balancer distributes traffic across instances of a service. It does not understand API semantics -- it does not know about authentication, rate limiting, or request transformation. Its job is to ensure even distribution and remove unhealthy servers.

An API gateway manages API-level concerns. It understands HTTP semantics, can inspect and modify requests, and handles authentication and authorization.

In practice, they work together:

Client --> API Gateway --> Load Balancer --> Service Instances

The gateway handles auth, rate limiting, and routing. The load balancer distributes traffic across instances of the target service.

A load balancer is infrastructure-level, answering "which server should handle this connection?" An API gateway is application-level, answering "should this request be allowed, and which service should handle it?"

Question 6: How do microservices handle distributed transactions?

Model Answer

In a monolith, you can use a single database transaction to ensure atomicity. In microservices, each service has its own database, so traditional ACID transactions do not span services. There are two main approaches.

1. Two-Phase Commit (2PC): A coordinator asks all participants to prepare (vote), then tells them all to commit or rollback.

Problems: The coordinator is a single point of failure. Participants hold locks during the prepare phase, reducing throughput. It creates tight coupling between services. For these reasons, 2PC is generally avoided in microservices.

2. Saga Pattern (preferred): A saga is a sequence of local transactions, each in its own service. If one step fails, compensating transactions undo the previous steps.

There are two flavors:

Choreography (event-driven): Each service publishes events after its local transaction. Other services listen and act. For example: Order Service creates an order and publishes "order.created." Payment Service hears this, charges the card, and publishes "payment.success." If Payment fails, it publishes "payment.failed," and Order Service listens and cancels the order.

Pros: No central coordinator, loose coupling. Cons: Hard to understand the overall flow, especially with many steps.

Orchestration (centralized): A Saga Orchestrator explicitly tells each service what to do in sequence. It tracks the state of the saga and handles compensation.

Pros: Easy to understand and debug the flow. Cons: Orchestrator is a single point of failure and creates coupling.

My recommendation: Use choreography for simple sagas (3-4 steps) where the flow is straightforward. Use orchestration for complex sagas (5+ steps) where the flow is hard to follow through events alone.

Key principle: Design each step to be idempotent, so retries do not cause duplicate effects.

Question 7: When would you choose a monolith over microservices?

Model Answer

I would choose a monolith in most early-stage scenarios, and I would only move to microservices when specific problems make the monolith untenable.

Choose a monolith when:

Small team (under 15-20 engineers): Microservices add operational overhead (service discovery, distributed tracing, deployment pipelines per service). A small team cannot afford this overhead and moves faster with a monolith.
New product or domain: When you do not yet understand the domain well enough to draw correct service boundaries, microservices lock you into premature boundaries that are expensive to change. A monolith lets you refactor freely.
Data consistency is critical: Financial applications, inventory systems -- anywhere you need strong ACID transactions across multiple entities. This is trivial in a monolith and extremely complex across microservices.
Simple scaling needs: If your system can scale vertically (bigger machine) or horizontally (multiple instances of the monolith behind a load balancer), microservices add complexity without proportional benefit.
Speed of development matters most: Deploying one artifact, debugging with a single stack trace, running one test suite -- all faster with a monolith.

Move to microservices when:

Teams are stepping on each other (merge conflicts, deployment coordination for 30+ engineers).
Different components have vastly different scaling needs (e.g., video encoding vs. user profiles).
Teams need technology heterogeneity (ML team wants Python, API team wants Go).
Independent deployment is critical (cannot afford 4-hour deployment windows).
Fault isolation is needed (a bug in recommendations should not take down payments).

The pragmatic path: Start with a well-structured modular monolith. When specific pain points emerge, extract those modules into services using the Strangler Fig pattern. Never rewrite everything at once.

Question 8: Explain how a CDN works and when you would use one in a system design.

Model Answer

A CDN is a geographically distributed network of edge servers that cache content close to users, reducing latency and offloading the origin server.

How it works:

A user requests static.example.com/image.jpg.
DNS resolves to the nearest CDN edge server (using GeoDNS or Anycast).
The edge server checks its cache. If the content is cached (hit), it returns it immediately -- the origin server is never contacted.
If it is a cache miss, the edge fetches from the origin, caches the response, and returns it to the user.
Subsequent requests from users near that edge are served from cache.

When to use a CDN:

Always for static assets (images, CSS, JavaScript, videos, fonts). There is almost no reason NOT to use a CDN for these.
API responses that are read-heavy and not user-specific (e.g., product catalog, trending content) -- cache with a short TTL (30-300 seconds).
Global user base where latency matters. A CDN can reduce latency from 200ms to 10ms by serving from a nearby edge.
DDoS protection: CDNs absorb volumetric attacks before they reach your origin.
Reducing origin load: At scale, the CDN serves 90%+ of requests without touching your servers.

When NOT to use:

Highly personalized content (low cache hit rate).
All users in one region (a local load balancer suffices).
Real-time data where any staleness is unacceptable.

Cache invalidation: The best practice is versioned URLs (e.g., app.abc123.js). When the file changes, the URL changes, and the new version is a fresh cache entry. This avoids the need to purge CDN caches, which can be slow and unreliable.

Question 9: What are dead letter queues and why are they important?

Model Answer

A dead letter queue (DLQ) is a special queue that receives messages that could not be processed successfully after a configured number of retries.

Why they matter:

Without a DLQ, a "poison message" -- a message that always fails processing -- can block the entire queue. The consumer picks it up, fails, the message goes back to the queue, the consumer picks it up again, fails again, and this repeats forever. The queue is stuck.

How DLQs work:

Consumer receives a message and tries to process it.
Processing fails (exception, timeout, validation error).
The message is retried (typically 3-5 times, with exponential backoff).
After all retries are exhausted, the message is moved to the DLQ instead of being retried again.
The DLQ preserves the original message, the error reason, and metadata.

What you do with DLQ messages:

Alert: Monitor DLQ depth. If messages are arriving in the DLQ, something is wrong.
Investigate: Examine the message and error to understand why processing failed.
Fix and replay: After fixing the bug, replay messages from the DLQ back into the main queue.
Audit: Keep a record of all messages that failed processing.

Best practices:

Set DLQ retention time longer than the main queue (give yourself time to investigate).
Include the original message body, error details, retry count, and timestamp in the DLQ entry.
Set up alerting on DLQ depth -- do not let failed messages silently accumulate.
Build a tool or script to replay DLQ messages back to the main queue.
Monitor the DLQ processing rate after deploying fixes to confirm the bug is resolved.

Example configuration in SQS: Main queue: 3 max receives, then redirect to DLQ. DLQ retention: 14 days. CloudWatch alarm when DLQ message count > 0.

Question 10: Describe consistent hashing and why it is used in load balancing and caching.

Model Answer

Consistent hashing is a distributed hashing scheme that minimizes the number of keys that need to be remapped when the number of servers changes.

The problem it solves: With simple modular hashing (hash(key) % N), adding or removing a server changes N, causing almost all keys to be remapped to different servers. In a cache, this means nearly every cached item is suddenly on the wrong server -- a massive cache miss storm.

How consistent hashing works:

Arrange a hash space as a circular ring (0 to 2^32-1).
Hash each server's identifier to a point on the ring.
Hash each key to a point on the ring.
Each key is assigned to the first server found clockwise from the key's position.

When a server is removed: Only the keys assigned to that server are redistributed -- they move to the next server clockwise. All other keys stay on their current servers.

When a server is added: It takes over some keys from the next server clockwise. All other keys are unaffected.

Virtual nodes: In practice, servers are mapped to multiple points on the ring (virtual nodes). This ensures more even distribution. Without virtual nodes, servers may get unequal key ranges. With 100-200 virtual nodes per server, distribution becomes nearly uniform.

Quantitative impact: With simple hashing and 10 servers, adding 1 server remaps ~90% of keys. With consistent hashing, adding 1 server remaps only ~10% of keys (roughly 1/N).

Where it is used:

Caching (Redis Cluster, Memcached): Determines which cache node stores each key.
Load balancing: Routes requests for the same user/session to the same server.
Distributed databases (Cassandra, DynamoDB): Determines which node owns each partition.
CDN: Routes content to specific edge servers.

Question 11: Design the infrastructure for a system that handles 100,000 requests per second with 99.99% uptime.

Model Answer

This is a comprehensive design question. I will walk through each infrastructure layer.

Traffic profile: 100K RPS requires multiple servers. At 1,000 RPS per server, we need ~100+ application servers, plus headroom.

Layer 1: DNS and Global Load Balancing

Use DNS with health checks (Route 53 or Cloudflare) to route users to the nearest healthy region.
Deploy across at least 2 regions (e.g., US-East and EU-West) for disaster recovery.
Use latency-based routing so users reach the closest region.

Layer 2: CDN

Place a CDN (CloudFront or Cloudflare) in front of all static assets and cacheable API responses.
This offloads 60-80% of traffic before it reaches our servers.
CDN also provides DDoS protection.

Layer 3: Load Balancing

Layer 4 NLB at the edge for TLS termination and raw TCP performance.
Layer 7 ALB behind it for content-based routing to different service pools.
Health checks every 10 seconds with 3 unhealthy threshold.
Across at least 3 availability zones.

Layer 4: API Gateway

Rate limiting per user and per IP to prevent abuse.
Authentication at the gateway so backend services are shielded.
Circuit breakers to prevent cascading failures.

Layer 5: Application Servers

Stateless servers behind the load balancer (sessions in Redis).
Auto-scaling based on CPU and request count.
Minimum 100 instances; scale up to 200+ during peaks.
Blue-green or canary deployments for zero-downtime updates.

Layer 6: Caching

Redis Cluster with 6+ nodes (3 primary, 3 replica).
Cache-aside pattern for database reads.
Cache the hottest 20% of data that serves 80% of reads.
TTLs of 60-300 seconds depending on data type.
Cache stampede prevention via locking.

Layer 7: Message Queues

Kafka for event streaming (order events, analytics).
SQS for async task processing (emails, notifications).
Dead letter queues configured for all queues.

Layer 8: Database

Primary-replica setup with read replicas.
Sharding if single primary cannot handle write load.
Automated failover (RDS Multi-AZ or equivalent).

Achieving 99.99% uptime (52 minutes of downtime per year):

Multi-AZ deployment (survive AZ failure).
Multi-region with failover (survive region failure).
No single points of failure at any layer.
Automated health checks and failover at every layer.
Canary deployments to catch bugs before full rollout.
Chaos engineering to test failure scenarios.
On-call team with runbooks for incident response.

Monitoring: Prometheus + Grafana for metrics. ELK or Datadog for logs. Jaeger for distributed tracing. PagerDuty for alerts.