Episode 9 — System Design / 9.8 — Communication and Data Layer
9.8 — Interview Questions
How to use this material (instructions)
- Simulate real conditions — Set a timer for 5-8 minutes per question. Answer out loud or on a whiteboard.
- Answer first, then compare — Write your answer before reading the model answer.
- Focus on structure — Interviewers value organized thinking over perfect recall.
- Draw diagrams — Use ASCII diagrams to support your explanations.
- State tradeoffs — The best answers acknowledge what you are giving up.
Question 1: How Does a REST API Differ from GraphQL? (Beginner)
What the interviewer is testing: Understanding of API paradigms and their tradeoffs.
Model Answer
REST and GraphQL are two approaches to building APIs with fundamentally different philosophies.
REST is resource-oriented. Each resource has a URL, and you use HTTP methods (GET, POST, PUT, DELETE) to operate on it. The server determines the response shape.
GraphQL is query-oriented. There is a single endpoint, and the client specifies exactly what data it needs in the query.
Key differences:
| Aspect | REST | GraphQL |
|---|---|---|
| Endpoints | Multiple (/users, /posts) | Single (/graphql) |
| Data shape | Server decides | Client decides |
| Over-fetching | Common | Eliminated |
| Under-fetching | Requires multiple requests | Single query |
| Caching | HTTP caching (simple) | Complex (all POST requests) |
| Error handling | HTTP status codes | Always 200, errors in response body |
| Learning curve | Lower | Higher |
When to choose REST: Public APIs, simple CRUD, strong caching needs, small teams. When to choose GraphQL: Complex data relationships, mobile apps needing minimal payloads, multiple frontend clients with different data needs.
Neither is universally better. The choice depends on the use case.
Question 2: Explain the CAP Theorem with Examples (Beginner)
What the interviewer is testing: Understanding of fundamental distributed systems tradeoffs.
Model Answer
The CAP theorem states that a distributed data store can only guarantee two of three properties simultaneously:
- Consistency (C): Every read receives the most recent write
- Availability (A): Every request receives a response
- Partition Tolerance (P): The system works despite network partitions
Since network partitions are unavoidable in distributed systems, the real choice is between CP and AP.
Network Partition Occurs:
=========================
[Node A] ----X---- [Node B] (network link broken)
CP Choice: AP Choice:
Node A stops accepting writes Both nodes continue working
to prevent inconsistency. independently. They may diverge.
Returns error or waits. Returns potentially stale data.
Example: MongoDB (default) Example: Cassandra, DynamoDB
Use case: Bank account balance Use case: Shopping cart
Practical examples:
- CP (MongoDB): During a partition, the primary continues but secondary nodes cannot serve reads until reconnected. Consistency preserved at the cost of availability.
- AP (Cassandra): During a partition, all nodes continue serving reads and writes. When the partition heals, conflicting writes are resolved (e.g., last-write-wins). Availability preserved at the cost of temporary inconsistency.
Key interview point: Most modern databases offer tunable consistency. DynamoDB lets you choose strong or eventual consistency per read. Cassandra lets you set consistency level per query (ONE, QUORUM, ALL).
Question 3: When Would You Choose SQL vs NoSQL? (Intermediate)
What the interviewer is testing: Ability to make justified database decisions based on requirements.
Model Answer
The choice depends on the data characteristics, access patterns, and consistency requirements.
Choose SQL (PostgreSQL, MySQL) when:
- Data has clear relationships (users -> orders -> items)
- You need ACID transactions (financial data, inventory)
- Query patterns are complex or unpredictable (ad-hoc reporting)
- Schema is well-defined and relatively stable
- Strong consistency is non-negotiable
Choose NoSQL when: 6. Schema is flexible or varies per record (product catalogs with different attributes) 7. Access patterns are known and simple (get user by ID) 8. You need horizontal scaling for massive data volumes 9. Eventual consistency is acceptable 10. High write throughput is critical (IoT, logging)
Specific database recommendations:
| Scenario | Database | Why |
|---|---|---|
| E-commerce orders | PostgreSQL | ACID for payments, complex joins |
| User sessions | Redis | Sub-ms latency, TTL expiration |
| Product catalog | MongoDB | Flexible schema, varying attributes |
| Social graph | Neo4j | Relationship traversal is core query |
| IoT time-series | Cassandra | Massive write throughput, time-ordered |
| Real-time cache | Redis | In-memory, data structures |
The mature answer: Most production systems use polyglot persistence. For example, an e-commerce platform might use PostgreSQL for orders, Redis for caching and sessions, Elasticsearch for search, and MongoDB for the product catalog.
Question 4: How Would You Design Database Sharding for a Chat Application? (Intermediate)
What the interviewer is testing: Practical sharding design, shard key selection, handling of cross-shard queries.
Model Answer
Understanding the domain: A chat application's primary entity is a message. The core query patterns are:
- Get all messages in a conversation (most frequent)
- Get recent conversations for a user
- Search messages
Shard key selection:
| Candidate Key | Pros | Cons | Verdict |
|---|---|---|---|
message_id | Even distribution | Fetching a conversation requires ALL shards | Bad |
sender_id | User's sent messages on one shard | Conversation split across shards | Bad |
conversation_id | All messages in a conversation on one shard | Heavy conversations become hotspots | Best choice |
I would choose conversation_id as the shard key with hash-based sharding.
Shard routing: shard = hash(conversation_id) % num_shards
Shard 0 Shard 1 Shard 2
+-------------+ +-------------+ +-------------+
| conv_A msgs | | conv_C msgs | | conv_B msgs |
| conv_D msgs | | conv_F msgs | | conv_E msgs |
+-------------+ +-------------+ +-------------+
Handling the secondary access pattern ("recent conversations for user X"):
- Maintain a separate
user_conversationstable (or cache in Redis) that mapsuser_id -> [conversation_ids] - This table can be sharded by
user_id - First query this table, then fetch messages from the appropriate shard
Handling hotspots (group chats with 10,000 members):
- Monitor shard sizes
- For extremely active conversations, sub-shard by time range
- Use read replicas on hot shards
Handling search:
- Sharded database search is expensive (scatter-gather across all shards)
- Use a dedicated search service (Elasticsearch) that indexes messages asynchronously
Question 5: Explain the Saga Pattern for Distributed Transactions (Intermediate)
What the interviewer is testing: Understanding of distributed transaction patterns and compensation logic.
Model Answer
A saga is a sequence of local transactions where each step has a compensating transaction to undo its effect if a later step fails. It is the standard alternative to 2PC in microservice architectures.
Example: E-commerce order placement
Happy Path:
===========
1. Order Service: Create order (status: pending)
2. Payment Service: Charge credit card
3. Inventory Service: Reserve items
4. Shipping Service: Schedule delivery
5. Order Service: Update order (status: confirmed)
Failure at Step 3 (items out of stock):
========================================
1. Order Service: Create order --> Done
2. Payment Service: Charge credit card --> Done
3. Inventory Service: Reserve items --> FAILED
Compensate (reverse order):
2c. Payment Service: Refund credit card
1c. Order Service: Cancel order (status: cancelled)
Two implementation approaches:
Choreography (event-driven):
Order -> emits "OrderCreated"
-> Payment listens, charges, emits "PaymentCharged"
-> Inventory listens, reserves, emits "ItemsReserved"
-> Shipping listens, schedules, emits "ShipmentScheduled"
Good for simple flows (3-4 steps). No central coordinator.
Orchestration (central coordinator):
OrderSaga orchestrator:
call OrderService.create()
call PaymentService.charge()
call InventoryService.reserve()
if any fails -> call compensation in reverse
Good for complex flows (5+ steps, branching logic, timeouts).
Why not 2PC? 2PC is blocking: all participants hold locks while waiting for the coordinator. If the coordinator fails, participants are stuck. Sagas are non-blocking: each step commits immediately, and compensations handle failures asynchronously.
Tradeoff: Sagas provide eventual consistency, not strong consistency. Between steps, the system is in a partially committed state. Design your UI to handle this (e.g., show "processing" status).
Question 6: How Do Read Replicas Work and What Problems Do They Introduce? (Intermediate)
What the interviewer is testing: Understanding of replication mechanics, lag, and consistency implications.
Model Answer
Read replicas are copies of a primary database that serve read traffic, distributing load across multiple nodes.
Architecture:
=============
Writes -> [Primary DB] ---replication---> [Replica 1]
---replication---> [Replica 2]
---replication---> [Replica 3]
^
Reads (distributed) --------------------------/
How replication works:
- Primary writes data and records the change in its Write-Ahead Log (WAL)
- WAL entries are streamed to replicas (async by default)
- Replicas apply the WAL entries to their local copy
- Read queries are load-balanced across replicas
The main problem: Replication Lag
With asynchronous replication, there is a delay (typically 10-100ms) between a write to the primary and that write appearing on replicas.
Concrete bug scenario:
- User submits a form (POST to primary)
- Page redirects and loads data (GET from replica)
- Replica has not received the write yet
- User sees stale data ("Where is my submission?!")
Solutions:
| Approach | How | Cost |
|---|---|---|
| Read-your-writes | Route user's reads to primary for N seconds after a write | More load on primary |
| Sticky sessions | Pin user to a specific replica | Uneven load |
| Synchronous replication | Primary waits for at least one replica to ACK | Slower writes |
| Version tokens | Client sends a token from the write; replica checks it has that version | Implementation complexity |
Other problems:
- Failover complexity: If primary fails, promote a replica, but it may be behind
- Write bottleneck: All writes still go to one node
- Schema changes: Must be rolled out to primary and all replicas carefully
When to add read replicas: When your database CPU is above 70% and the workload is read-heavy (80%+ reads). If the workload is write-heavy, replicas will not help much.
Question 7: Design the API and Data Layer for a URL Shortener (Intermediate)
What the interviewer is testing: End-to-end thinking from API to database choice to scaling.
Model Answer
API Design:
POST /api/v1/urls
Body: { "long_url": "https://example.com/very/long/path" }
Response: 201 Created
{ "data": { "short_code": "abc123", "short_url": "https://sho.rt/abc123", "long_url": "..." } }
GET /api/v1/urls/{short_code}
Response: 301 Moved Permanently
Location: https://example.com/very/long/path
GET /api/v1/urls/{short_code}/stats
Response: 200 OK
{ "data": { "clicks": 15234, "created_at": "...", "top_countries": [...] } }
Database choice:
The core operation is a key-value lookup: short_code -> long_url. This is a perfect fit for a key-value store.
| Layer | Database | Why |
|---|---|---|
| Primary store | DynamoDB or PostgreSQL | Durable storage of URL mappings |
| Cache | Redis | Sub-ms reads for hot URLs (80/20 rule: 20% of URLs get 80% of traffic) |
| Analytics | Cassandra or ClickHouse | High-volume write of click events |
Data model (DynamoDB):
Table: urls
Partition Key: short_code (string)
Attributes: long_url, user_id, created_at, click_count
Table: analytics
Partition Key: short_code
Sort Key: timestamp
Attributes: ip, country, referrer, device
Scaling considerations:
- 100 million URLs: ~10 GB storage (small, single node works)
- 10,000 redirects/second: Redis cache handles this easily
- Short code generation: Base62 encoding of an auto-increment ID or hash. Use a distributed ID generator (Snowflake) to avoid collisions across nodes.
Consistency: Eventual consistency is fine. If a new short URL takes 100ms to propagate, that is acceptable. The redirect must be fast (cache-first).
Question 8: What Is the Difference Between TCP and UDP, and When Would You Use Each? (Beginner)
What the interviewer is testing: Fundamental networking knowledge applied to real scenarios.
Model Answer
TCP and UDP are transport-layer protocols with fundamentally different guarantees.
TCP (Transmission Control Protocol):
- Connection-oriented (three-way handshake: SYN, SYN-ACK, ACK)
- Guarantees reliable, ordered delivery
- Has flow control and congestion control
- Higher overhead (20-60 byte header)
UDP (User Datagram Protocol):
- Connectionless (fire and forget)
- No delivery guarantee, no ordering
- No flow control
- Minimal overhead (8 byte header)
| Use Case | Protocol | Reasoning |
|---|---|---|
| Web pages (HTTP) | TCP | Every byte of HTML/CSS/JS must arrive correctly |
| REST APIs | TCP | Request-response must be reliable |
| Video streaming (live) | UDP | A dropped frame is better than a delayed stream |
| Online gaming | UDP | Stale position data is worse than missing one update |
| DNS queries | UDP | Small, single request-response; fast |
| File transfer (FTP) | TCP | Every byte of the file must arrive |
| Voice calls (VoIP) | UDP | Low latency matters more than perfect audio |
| Database connections | TCP | Queries and results must be reliable |
Key insight for interviews: The choice comes down to this: Is it worse to have incorrect/incomplete data (choose TCP) or delayed data (choose UDP)? For real-time applications where old data is useless, UDP wins. For everything else, TCP is safer.
Modern protocols blur this line: HTTP/3 uses QUIC (built on UDP) but adds reliability on top, getting the best of both worlds.
Question 9: How Would You Handle Data Consistency in a Microservice Architecture? (Advanced)
What the interviewer is testing: Deep understanding of distributed consistency, practical pattern selection.
Model Answer
In a microservice architecture, each service owns its own database. This means a business operation spanning multiple services cannot use a traditional database transaction. We need distributed consistency patterns.
The spectrum of approaches:
Simplest Most Complex
<==========================================================>
Idempotent Outbox Choreography Orchestration 2PC
retries Pattern Saga Saga (avoid)
Level 1: Idempotent Operations + Retries For simple cases, make operations idempotent and retry on failure.
POST /api/payments
Idempotency-Key: req_abc123
If this request is sent twice, the server checks the key
and returns the same result without charging twice.
Level 2: Transactional Outbox Pattern Write the event and the data change in the same database transaction.
BEGIN TRANSACTION;
INSERT INTO orders (id, ...) VALUES (...);
INSERT INTO outbox (event_type, payload) VALUES ('OrderCreated', '{...}');
COMMIT;
-- A separate process reads the outbox and publishes events
-- Guarantees: if the order exists, the event will be published
Level 3: Saga (Choreography) For 2-4 step workflows where services are loosely coupled. Each service listens for events, performs its action, and emits the next event.
Level 4: Saga (Orchestration) For 5+ step workflows with complex branching, timeouts, and retry logic. A central orchestrator manages the flow and compensations.
My recommendation for most systems:
- Use the outbox pattern for reliable event publishing
- Use choreography sagas for simple flows
- Use orchestration sagas for complex flows
- Never use 2PC across service boundaries
- Accept eventual consistency where possible; use strong consistency only where required (payments, inventory)
Practical tradeoff: Between steps in a saga, the system is in a partially committed state. Design your UI to show intermediate states ("Payment processing...") and handle retries gracefully.
Question 10: Compare Sharding Strategies and Explain Consistent Hashing (Advanced)
What the interviewer is testing: Deep understanding of horizontal scaling and data distribution.
Model Answer
Sharding strategies:
| Strategy | How Data Is Distributed | Strengths | Weaknesses |
|---|---|---|---|
| Range | By value ranges (e.g., A-H, I-P, Q-Z) | Efficient range queries within a shard | Hotspots if distribution is uneven |
| Hash | hash(key) % N | Even distribution | Range queries hit all shards; resharding moves all data |
| Geographic | By region (US, EU, APAC) | Data locality, regulatory compliance | Cross-region queries expensive |
| Directory | Lookup table maps keys to shards | Flexible, can rebalance | Directory is a bottleneck/SPOF |
The resharding problem with hash sharding:
Before: 3 shards, hash(key) % 3
key=7: 7 % 3 = 1 -> Shard 1
key=8: 8 % 3 = 2 -> Shard 2
After: 4 shards, hash(key) % 4
key=7: 7 % 4 = 3 -> Shard 3 (MOVED!)
key=8: 8 % 4 = 0 -> Shard 0 (MOVED!)
Almost ALL keys move! This is catastrophic for large datasets.
Consistent hashing solves this:
Hash Ring (0 to 2^32):
Node A
|
--------+--------+--------+--------+--->
0 1000 3000 5000 7000
| | |
Node A Node B Node C
Key "alice" hashes to 2500 -> walks clockwise -> Node B
Key "bob" hashes to 6000 -> walks clockwise -> Node A (wraps around)
Add Node D at position 4000:
- Only keys between 3000-4000 move from Node C to Node D
- Everything else stays! (~1/N data moves instead of ~all)
Virtual nodes improve balance: Instead of each physical node having one position on the ring, assign 100-200 virtual positions. This prevents uneven distribution when nodes are few.
Physical Node A -> Virtual: A1(500), A2(2000), A3(5500), A4(8000)
Physical Node B -> Virtual: B1(1000), B2(3500), B3(6000), B4(9000)
Keys are more evenly distributed across the ring.
Used by: Amazon DynamoDB, Apache Cassandra, Memcached, content delivery networks.
Question 11: Design the Communication and Data Layer for a Real-Time Collaborative Editor (Advanced)
What the interviewer is testing: Ability to integrate multiple concepts: protocols, databases, consistency, and conflict resolution.
Model Answer
A collaborative editor like Google Docs requires real-time synchronization of edits from multiple users on the same document.
Communication layer:
+--------+ WebSocket +------------------+
| User A | <================> | Collaboration |
+--------+ | Server |
| |
+--------+ WebSocket | - Receives edits |
| User B | <================> | - Broadcasts to |
+--------+ | all editors |
| - Resolves |
+--------+ WebSocket | conflicts |
| User C | <================> | |
+--------+ +--------+---------+
|
+-----v------+
| Document |
| Database |
+------------+
Why WebSockets: Bidirectional, low-latency communication. Every keystroke from any user must be broadcast to all other users in real-time (<100ms).
Conflict resolution: OT or CRDTs
When two users edit the same area simultaneously:
| Approach | How | Used By |
|---|---|---|
| OT (Operational Transformation) | Transform operations against concurrent operations. Central server is the source of truth. | Google Docs |
| CRDTs | Data structures that merge automatically without conflicts. No central coordinator needed. | Figma, Apple Notes |
For this design, I would use CRDTs because they work well in distributed settings and do not require a single coordinator (better for scaling).
Database layer:
| Data | Store | Why |
|---|---|---|
| Document content (current state) | PostgreSQL or MongoDB | Durable, queryable, supports versioning |
| Real-time operations buffer | Redis | Fast pub/sub for broadcasting edits, in-memory operation log |
| Edit history / audit trail | Cassandra or S3 | Append-only, massive write throughput |
| User presence ("who is online") | Redis | TTL-based presence tracking |
Consistency model:
- Within a document session: Strong consistency via CRDT convergence (all users converge to the same state)
- Document persistence: Periodic snapshots written to PostgreSQL (eventual, a few seconds behind real-time state in Redis)
- Between documents: No cross-document consistency needed
Scaling:
- Partition by document ID: each document's collaboration session lives on one server
- Use consistent hashing to route document sessions to servers
- If a single document has too many editors (>100), shard the document into sections
Architecture summary:
Client (CRDT) <-- WebSocket --> Collaboration Server (CRDT merge)
|
Redis (pub/sub, operation log)
|
PostgreSQL (periodic snapshots)
|
S3/Cassandra (full edit history)
After reviewing these model answers, attempt the Exercise Questions without looking back, then use the Quick Revision sheet for final review.