Episode 9 — System Design / 9.8 — Communication and Data Layer

9.8 — Interview Questions

How to use this material (instructions)

Simulate real conditions — Set a timer for 5-8 minutes per question. Answer out loud or on a whiteboard.
Answer first, then compare — Write your answer before reading the model answer.
Focus on structure — Interviewers value organized thinking over perfect recall.
Draw diagrams — Use ASCII diagrams to support your explanations.
State tradeoffs — The best answers acknowledge what you are giving up.

Question 1: How Does a REST API Differ from GraphQL? (Beginner)

What the interviewer is testing: Understanding of API paradigms and their tradeoffs.

Model Answer

REST and GraphQL are two approaches to building APIs with fundamentally different philosophies.

REST is resource-oriented. Each resource has a URL, and you use HTTP methods (GET, POST, PUT, DELETE) to operate on it. The server determines the response shape.

GraphQL is query-oriented. There is a single endpoint, and the client specifies exactly what data it needs in the query.

Key differences:

Aspect	REST	GraphQL
Endpoints	Multiple (`/users`, `/posts`)	Single (`/graphql`)
Data shape	Server decides	Client decides
Over-fetching	Common	Eliminated
Under-fetching	Requires multiple requests	Single query
Caching	HTTP caching (simple)	Complex (all POST requests)
Error handling	HTTP status codes	Always 200, errors in response body
Learning curve	Lower	Higher

When to choose REST: Public APIs, simple CRUD, strong caching needs, small teams. When to choose GraphQL: Complex data relationships, mobile apps needing minimal payloads, multiple frontend clients with different data needs.

Neither is universally better. The choice depends on the use case.

Question 2: Explain the CAP Theorem with Examples (Beginner)

What the interviewer is testing: Understanding of fundamental distributed systems tradeoffs.

Model Answer

The CAP theorem states that a distributed data store can only guarantee two of three properties simultaneously:

Consistency (C): Every read receives the most recent write
Availability (A): Every request receives a response
Partition Tolerance (P): The system works despite network partitions

Since network partitions are unavoidable in distributed systems, the real choice is between CP and AP.

  Network Partition Occurs:
  =========================
  
  [Node A]  ----X----  [Node B]     (network link broken)
  
  CP Choice:                         AP Choice:
  Node A stops accepting writes      Both nodes continue working
  to prevent inconsistency.          independently. They may diverge.
  Returns error or waits.            Returns potentially stale data.
  
  Example: MongoDB (default)         Example: Cassandra, DynamoDB
  Use case: Bank account balance     Use case: Shopping cart

Practical examples:

CP (MongoDB): During a partition, the primary continues but secondary nodes cannot serve reads until reconnected. Consistency preserved at the cost of availability.
AP (Cassandra): During a partition, all nodes continue serving reads and writes. When the partition heals, conflicting writes are resolved (e.g., last-write-wins). Availability preserved at the cost of temporary inconsistency.

Key interview point: Most modern databases offer tunable consistency. DynamoDB lets you choose strong or eventual consistency per read. Cassandra lets you set consistency level per query (ONE, QUORUM, ALL).

Question 3: When Would You Choose SQL vs NoSQL? (Intermediate)

What the interviewer is testing: Ability to make justified database decisions based on requirements.

Model Answer

The choice depends on the data characteristics, access patterns, and consistency requirements.

Choose SQL (PostgreSQL, MySQL) when:

Data has clear relationships (users -> orders -> items)
You need ACID transactions (financial data, inventory)
Query patterns are complex or unpredictable (ad-hoc reporting)
Schema is well-defined and relatively stable
Strong consistency is non-negotiable

Choose NoSQL when: 6. Schema is flexible or varies per record (product catalogs with different attributes) 7. Access patterns are known and simple (get user by ID) 8. You need horizontal scaling for massive data volumes 9. Eventual consistency is acceptable 10. High write throughput is critical (IoT, logging)

Specific database recommendations:

Scenario	Database	Why
E-commerce orders	PostgreSQL	ACID for payments, complex joins
User sessions	Redis	Sub-ms latency, TTL expiration
Product catalog	MongoDB	Flexible schema, varying attributes
Social graph	Neo4j	Relationship traversal is core query
IoT time-series	Cassandra	Massive write throughput, time-ordered
Real-time cache	Redis	In-memory, data structures

The mature answer: Most production systems use polyglot persistence. For example, an e-commerce platform might use PostgreSQL for orders, Redis for caching and sessions, Elasticsearch for search, and MongoDB for the product catalog.

Question 4: How Would You Design Database Sharding for a Chat Application? (Intermediate)

What the interviewer is testing: Practical sharding design, shard key selection, handling of cross-shard queries.

Model Answer

Understanding the domain: A chat application's primary entity is a message. The core query patterns are:

Get all messages in a conversation (most frequent)
Get recent conversations for a user
Search messages

Shard key selection:

Candidate Key	Pros	Cons	Verdict
`message_id`	Even distribution	Fetching a conversation requires ALL shards	Bad
`sender_id`	User's sent messages on one shard	Conversation split across shards	Bad
`conversation_id`	All messages in a conversation on one shard	Heavy conversations become hotspots	Best choice

I would choose conversation_id as the shard key with hash-based sharding.

  Shard routing: shard = hash(conversation_id) % num_shards
  
  Shard 0           Shard 1           Shard 2
  +-------------+   +-------------+   +-------------+
  | conv_A msgs |   | conv_C msgs |   | conv_B msgs |
  | conv_D msgs |   | conv_F msgs |   | conv_E msgs |
  +-------------+   +-------------+   +-------------+

Handling the secondary access pattern ("recent conversations for user X"):

Maintain a separate user_conversations table (or cache in Redis) that maps user_id -> [conversation_ids]
This table can be sharded by user_id
First query this table, then fetch messages from the appropriate shard

Handling hotspots (group chats with 10,000 members):

Monitor shard sizes
For extremely active conversations, sub-shard by time range
Use read replicas on hot shards

Handling search:

Sharded database search is expensive (scatter-gather across all shards)
Use a dedicated search service (Elasticsearch) that indexes messages asynchronously

Question 5: Explain the Saga Pattern for Distributed Transactions (Intermediate)

What the interviewer is testing: Understanding of distributed transaction patterns and compensation logic.

Model Answer

A saga is a sequence of local transactions where each step has a compensating transaction to undo its effect if a later step fails. It is the standard alternative to 2PC in microservice architectures.

Example: E-commerce order placement

  Happy Path:
  ===========
  1. Order Service:     Create order (status: pending)
  2. Payment Service:   Charge credit card
  3. Inventory Service: Reserve items
  4. Shipping Service:  Schedule delivery
  5. Order Service:     Update order (status: confirmed)

  Failure at Step 3 (items out of stock):
  ========================================
  1. Order Service:     Create order          --> Done
  2. Payment Service:   Charge credit card    --> Done  
  3. Inventory Service: Reserve items         --> FAILED
  
  Compensate (reverse order):
  2c. Payment Service:  Refund credit card
  1c. Order Service:    Cancel order (status: cancelled)

Two implementation approaches:

Choreography (event-driven):

  Order -> emits "OrderCreated" 
       -> Payment listens, charges, emits "PaymentCharged"
       -> Inventory listens, reserves, emits "ItemsReserved"
       -> Shipping listens, schedules, emits "ShipmentScheduled"

Good for simple flows (3-4 steps). No central coordinator.

Orchestration (central coordinator):

  OrderSaga orchestrator:
    call OrderService.create()
    call PaymentService.charge()
    call InventoryService.reserve()
    if any fails -> call compensation in reverse

Good for complex flows (5+ steps, branching logic, timeouts).

Why not 2PC? 2PC is blocking: all participants hold locks while waiting for the coordinator. If the coordinator fails, participants are stuck. Sagas are non-blocking: each step commits immediately, and compensations handle failures asynchronously.

Tradeoff: Sagas provide eventual consistency, not strong consistency. Between steps, the system is in a partially committed state. Design your UI to handle this (e.g., show "processing" status).

Question 6: How Do Read Replicas Work and What Problems Do They Introduce? (Intermediate)

What the interviewer is testing: Understanding of replication mechanics, lag, and consistency implications.

Model Answer

Read replicas are copies of a primary database that serve read traffic, distributing load across multiple nodes.

  Architecture:
  =============
  
  Writes -> [Primary DB] ---replication---> [Replica 1]
                          ---replication---> [Replica 2]
                          ---replication---> [Replica 3]
                                                 ^
  Reads (distributed) --------------------------/

How replication works:

Primary writes data and records the change in its Write-Ahead Log (WAL)
WAL entries are streamed to replicas (async by default)
Replicas apply the WAL entries to their local copy
Read queries are load-balanced across replicas

The main problem: Replication Lag

With asynchronous replication, there is a delay (typically 10-100ms) between a write to the primary and that write appearing on replicas.

Concrete bug scenario:

User submits a form (POST to primary)
Page redirects and loads data (GET from replica)
Replica has not received the write yet
User sees stale data ("Where is my submission?!")

Solutions:

Approach	How	Cost
Read-your-writes	Route user's reads to primary for N seconds after a write	More load on primary
Sticky sessions	Pin user to a specific replica	Uneven load
Synchronous replication	Primary waits for at least one replica to ACK	Slower writes
Version tokens	Client sends a token from the write; replica checks it has that version	Implementation complexity

Other problems:

Failover complexity: If primary fails, promote a replica, but it may be behind
Write bottleneck: All writes still go to one node
Schema changes: Must be rolled out to primary and all replicas carefully

When to add read replicas: When your database CPU is above 70% and the workload is read-heavy (80%+ reads). If the workload is write-heavy, replicas will not help much.

Question 7: Design the API and Data Layer for a URL Shortener (Intermediate)

What the interviewer is testing: End-to-end thinking from API to database choice to scaling.

Model Answer

API Design:

POST /api/v1/urls
Body: { "long_url": "https://example.com/very/long/path" }
Response: 201 Created
{ "data": { "short_code": "abc123", "short_url": "https://sho.rt/abc123", "long_url": "..." } }

GET /api/v1/urls/{short_code}
Response: 301 Moved Permanently
Location: https://example.com/very/long/path

GET /api/v1/urls/{short_code}/stats
Response: 200 OK
{ "data": { "clicks": 15234, "created_at": "...", "top_countries": [...] } }

Database choice:

The core operation is a key-value lookup: short_code -> long_url. This is a perfect fit for a key-value store.

Layer	Database	Why
Primary store	DynamoDB or PostgreSQL	Durable storage of URL mappings
Cache	Redis	Sub-ms reads for hot URLs (80/20 rule: 20% of URLs get 80% of traffic)
Analytics	Cassandra or ClickHouse	High-volume write of click events

Data model (DynamoDB):

Table: urls
Partition Key: short_code (string)
Attributes: long_url, user_id, created_at, click_count

Table: analytics
Partition Key: short_code
Sort Key: timestamp
Attributes: ip, country, referrer, device

Scaling considerations:

100 million URLs: ~10 GB storage (small, single node works)
10,000 redirects/second: Redis cache handles this easily
Short code generation: Base62 encoding of an auto-increment ID or hash. Use a distributed ID generator (Snowflake) to avoid collisions across nodes.

Consistency: Eventual consistency is fine. If a new short URL takes 100ms to propagate, that is acceptable. The redirect must be fast (cache-first).

Question 8: What Is the Difference Between TCP and UDP, and When Would You Use Each? (Beginner)

What the interviewer is testing: Fundamental networking knowledge applied to real scenarios.

Model Answer

TCP and UDP are transport-layer protocols with fundamentally different guarantees.

TCP (Transmission Control Protocol):

Connection-oriented (three-way handshake: SYN, SYN-ACK, ACK)
Guarantees reliable, ordered delivery
Has flow control and congestion control
Higher overhead (20-60 byte header)

UDP (User Datagram Protocol):

Connectionless (fire and forget)
No delivery guarantee, no ordering
No flow control
Minimal overhead (8 byte header)

Use Case	Protocol	Reasoning
Web pages (HTTP)	TCP	Every byte of HTML/CSS/JS must arrive correctly
REST APIs	TCP	Request-response must be reliable
Video streaming (live)	UDP	A dropped frame is better than a delayed stream
Online gaming	UDP	Stale position data is worse than missing one update
DNS queries	UDP	Small, single request-response; fast
File transfer (FTP)	TCP	Every byte of the file must arrive
Voice calls (VoIP)	UDP	Low latency matters more than perfect audio
Database connections	TCP	Queries and results must be reliable

Key insight for interviews: The choice comes down to this: Is it worse to have incorrect/incomplete data (choose TCP) or delayed data (choose UDP)? For real-time applications where old data is useless, UDP wins. For everything else, TCP is safer.

Modern protocols blur this line: HTTP/3 uses QUIC (built on UDP) but adds reliability on top, getting the best of both worlds.

Question 9: How Would You Handle Data Consistency in a Microservice Architecture? (Advanced)

What the interviewer is testing: Deep understanding of distributed consistency, practical pattern selection.

Model Answer

In a microservice architecture, each service owns its own database. This means a business operation spanning multiple services cannot use a traditional database transaction. We need distributed consistency patterns.

The spectrum of approaches:

  Simplest                                              Most Complex
  <==========================================================>
  Idempotent        Outbox         Choreography    Orchestration    2PC
  retries           Pattern        Saga            Saga             (avoid)

Level 1: Idempotent Operations + Retries For simple cases, make operations idempotent and retry on failure.

POST /api/payments
Idempotency-Key: req_abc123

If this request is sent twice, the server checks the key
and returns the same result without charging twice.

Level 2: Transactional Outbox Pattern Write the event and the data change in the same database transaction.

BEGIN TRANSACTION;
  INSERT INTO orders (id, ...) VALUES (...);
  INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '{...}');
COMMIT;

-- A separate process reads the outbox and publishes events
-- Guarantees: if the order exists, the event will be published

Level 3: Saga (Choreography) For 2-4 step workflows where services are loosely coupled. Each service listens for events, performs its action, and emits the next event.

Level 4: Saga (Orchestration) For 5+ step workflows with complex branching, timeouts, and retry logic. A central orchestrator manages the flow and compensations.

My recommendation for most systems:

Use the outbox pattern for reliable event publishing
Use choreography sagas for simple flows
Use orchestration sagas for complex flows
Never use 2PC across service boundaries
Accept eventual consistency where possible; use strong consistency only where required (payments, inventory)

Practical tradeoff: Between steps in a saga, the system is in a partially committed state. Design your UI to show intermediate states ("Payment processing...") and handle retries gracefully.

Question 10: Compare Sharding Strategies and Explain Consistent Hashing (Advanced)

What the interviewer is testing: Deep understanding of horizontal scaling and data distribution.

Model Answer

Sharding strategies:

Strategy	How Data Is Distributed	Strengths	Weaknesses
Range	By value ranges (e.g., A-H, I-P, Q-Z)	Efficient range queries within a shard	Hotspots if distribution is uneven
Hash	`hash(key) % N`	Even distribution	Range queries hit all shards; resharding moves all data
Geographic	By region (US, EU, APAC)	Data locality, regulatory compliance	Cross-region queries expensive
Directory	Lookup table maps keys to shards	Flexible, can rebalance	Directory is a bottleneck/SPOF

The resharding problem with hash sharding:

Before: 3 shards, hash(key) % 3
  key=7: 7 % 3 = 1 -> Shard 1
  key=8: 8 % 3 = 2 -> Shard 2

After: 4 shards, hash(key) % 4
  key=7: 7 % 4 = 3 -> Shard 3 (MOVED!)
  key=8: 8 % 4 = 0 -> Shard 0 (MOVED!)

Almost ALL keys move! This is catastrophic for large datasets.

Consistent hashing solves this:

  Hash Ring (0 to 2^32):
  
         Node A
          |
  --------+--------+--------+--------+--->
  0      1000     3000     5000     7000
           |        |        |
         Node A   Node B   Node C
  
  Key "alice" hashes to 2500 -> walks clockwise -> Node B
  Key "bob" hashes to 6000   -> walks clockwise -> Node A (wraps around)
  
  Add Node D at position 4000:
  - Only keys between 3000-4000 move from Node C to Node D
  - Everything else stays! (~1/N data moves instead of ~all)

Virtual nodes improve balance: Instead of each physical node having one position on the ring, assign 100-200 virtual positions. This prevents uneven distribution when nodes are few.

  Physical Node A -> Virtual: A1(500), A2(2000), A3(5500), A4(8000)
  Physical Node B -> Virtual: B1(1000), B2(3500), B3(6000), B4(9000)
  
  Keys are more evenly distributed across the ring.

Used by: Amazon DynamoDB, Apache Cassandra, Memcached, content delivery networks.

Question 11: Design the Communication and Data Layer for a Real-Time Collaborative Editor (Advanced)

What the interviewer is testing: Ability to integrate multiple concepts: protocols, databases, consistency, and conflict resolution.

Model Answer

A collaborative editor like Google Docs requires real-time synchronization of edits from multiple users on the same document.

Communication layer:

  +--------+     WebSocket      +------------------+
  | User A | <================> | Collaboration    |
  +--------+                    | Server           |
                                |                  |
  +--------+     WebSocket      | - Receives edits |
  | User B | <================> | - Broadcasts to  |
  +--------+                    |   all editors    |
                                | - Resolves       |
  +--------+     WebSocket      |   conflicts      |
  | User C | <================> |                  |
  +--------+                    +--------+---------+
                                         |
                                   +-----v------+
                                   | Document   |
                                   | Database   |
                                   +------------+

Why WebSockets: Bidirectional, low-latency communication. Every keystroke from any user must be broadcast to all other users in real-time (<100ms).

Conflict resolution: OT or CRDTs

When two users edit the same area simultaneously:

Approach	How	Used By
OT (Operational Transformation)	Transform operations against concurrent operations. Central server is the source of truth.	Google Docs
CRDTs	Data structures that merge automatically without conflicts. No central coordinator needed.	Figma, Apple Notes

For this design, I would use CRDTs because they work well in distributed settings and do not require a single coordinator (better for scaling).

Database layer:

Data	Store	Why
Document content (current state)	PostgreSQL or MongoDB	Durable, queryable, supports versioning
Real-time operations buffer	Redis	Fast pub/sub for broadcasting edits, in-memory operation log
Edit history / audit trail	Cassandra or S3	Append-only, massive write throughput
User presence ("who is online")	Redis	TTL-based presence tracking

Consistency model:

Within a document session: Strong consistency via CRDT convergence (all users converge to the same state)
Document persistence: Periodic snapshots written to PostgreSQL (eventual, a few seconds behind real-time state in Redis)
Between documents: No cross-document consistency needed

Scaling:

Partition by document ID: each document's collaboration session lives on one server
Use consistent hashing to route document sessions to servers
If a single document has too many editors (>100), shard the document into sections

Architecture summary:

  Client (CRDT) <-- WebSocket --> Collaboration Server (CRDT merge)
                                       |
                                  Redis (pub/sub, operation log)
                                       |
                                  PostgreSQL (periodic snapshots)
                                       |
                                  S3/Cassandra (full edit history)

After reviewing these model answers, attempt the Exercise Questions without looking back, then use the Quick Revision sheet for final review.