Episode 9 — System Design / 9.8 — Communication and Data Layer
9.8.c — SQL vs NoSQL
Big Picture: Choosing the right database is one of the most consequential decisions in system design. Get it right, and your system scales gracefully. Get it wrong, and you are either re-architecting under pressure or fighting your database at every step. This chapter gives you the knowledge to make that decision confidently.
Table of Contents
- Relational Databases (SQL)
- ACID Properties
- Normalization
- NoSQL Database Types
- CAP Theorem
- SQL vs NoSQL Decision Table
- Database Deep Dives
- Key Takeaways
- Explain-It Challenge
Relational Databases (SQL)
Relational databases store data in tables (rows and columns) with relationships between tables enforced by foreign keys.
USERS TABLE ORDERS TABLE
============== ===============================
| id | name | email | | id | user_id | total | status |
|----|---------|----------------| |----|---------|-------|-----------|
| 1 | Alice | alice@mail.com | | 10 | 1 | 99.99 | delivered |
| 2 | Bob | bob@mail.com | | 11 | 1 | 45.50 | shipped |
| 3 | Charlie | charlie@m.com | | 12 | 2 | 120.0 | pending |
+----+---------+----------------+ +----+---------+-------+-----------+
^ |
| FOREIGN KEY |
+-------------------------------------+
users.id <-- orders.user_id
SQL Joins
-- INNER JOIN: Only matching rows from both tables
SELECT u.name, o.total, o.status
FROM users u
INNER JOIN orders o ON u.id = o.user_id;
-- Result: Alice (2 orders), Bob (1 order). Charlie excluded (no orders).
-- LEFT JOIN: All rows from left table + matching from right
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;
-- Result: Alice, Alice, Bob, Charlie (NULL). All users included.
Join Types Visualized
INNER JOIN LEFT JOIN RIGHT JOIN FULL OUTER JOIN
========== ========= ========== ===============
+---+ +---+ +---+ +---+
/ \ / \ / \ / \
| A B | |[A] B | | A [B]| |[A] [B]|
| [AB] | | [AB] | | [AB] | | [AB] |
\ / \ / \ / \ /
+---+ +---+ +---+ +---+
Only where All A + All B + All A +
both match matching B matching A All B
Schema Definition
CREATE TABLE users (
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
user_id INTEGER NOT NULL REFERENCES users(id),
total DECIMAL(10, 2) NOT NULL,
status VARCHAR(20) DEFAULT 'pending',
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);
ACID Properties
ACID guarantees data integrity in relational databases.
| Property | Meaning | Example |
|---|---|---|
| Atomicity | All operations in a transaction succeed or all fail | Bank transfer: debit AND credit both happen, or neither |
| Consistency | Database moves from one valid state to another | Foreign key constraints, unique constraints always enforced |
| Isolation | Concurrent transactions do not interfere | Two users buying the last item: only one succeeds |
| Durability | Committed data survives crashes | After "commit," data is on disk even if power goes out |
Transaction Example
-- Transfer $100 from Alice to Bob
BEGIN TRANSACTION;
UPDATE accounts SET balance = balance - 100 WHERE user_id = 1; -- Alice
UPDATE accounts SET balance = balance + 100 WHERE user_id = 2; -- Bob
-- If anything fails, ROLLBACK undoes everything
COMMIT;
Isolation Levels
READ UNCOMMITTED READ COMMITTED REPEATABLE READ SERIALIZABLE
================ ============== =============== ============
Dirty reads: YES Dirty reads: NO Dirty reads: NO Dirty reads: NO
Non-repeatable Non-repeatable Non-repeatable Non-repeatable
reads: YES reads: YES reads: NO reads: NO
Phantom Phantom Phantom Phantom
reads: YES reads: YES reads: YES reads: NO
<--- Faster, less safe Slower, more safe --->
<--- More concurrency Less concurrency --->
| Isolation Level | Performance | Safety | Common Use |
|---|---|---|---|
| Read Uncommitted | Fastest | Lowest | Almost never used |
| Read Committed | Fast | Good | PostgreSQL default |
| Repeatable Read | Medium | High | MySQL/InnoDB default |
| Serializable | Slowest | Highest | Financial systems |
Normalization
Normalization eliminates data redundancy by organizing tables properly.
Unnormalized (Bad)
ORDERS (Denormalized)
======================================================================
| order_id | customer_name | customer_email | product | price | qty |
|----------|---------------|-------------------|---------|-------|-----|
| 1 | Alice | alice@mail.com | Laptop | 999 | 1 |
| 2 | Alice | alice@mail.com | Mouse | 29 | 2 |
| 3 | Bob | bob@mail.com | Laptop | 999 | 1 |
Problems:
- Alice's email stored 2x (redundancy)
- Update Alice's email? Must update every row (update anomaly)
- Delete order 3? Lose Bob's info entirely (delete anomaly)
Normalized (Good)
CUSTOMERS PRODUCTS ORDER_ITEMS
============== =========== =========================
| id | name | email | | id | name | price | | order_id | product_id | qty |
|----|-------|-------------| |----|--------|-------| |----------|------------|-----|
| 1 | Alice | alice@m.com | | 1 | Laptop | 999 | | 1 | 1 | 1 |
| 2 | Bob | bob@m.com | | 2 | Mouse | 29 | | 2 | 2 | 2 |
+----+-------+-------------+ +----+--------+-------+ | 3 | 1 | 1 |
+----------+------------+-----+
ORDERS
=====================
| id | customer_id |
|----|------------- |
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
+----+--------------+
Normal Forms (Quick Reference)
| Normal Form | Rule | Violation Example |
|---|---|---|
| 1NF | No repeating groups, atomic values | tags: "red,blue,green" in one cell |
| 2NF | 1NF + no partial dependencies | Non-key column depends on part of composite key |
| 3NF | 2NF + no transitive dependencies | zip_code -> city (city depends on zip, not PK) |
Interview tip: In practice, most production databases aim for 3NF. Higher normal forms (BCNF, 4NF, 5NF) exist but are rarely discussed in interviews.
NoSQL Database Types
NoSQL FAMILY
============
+------------------+ +------------------+ +------------------+ +------------------+
| DOCUMENT | | KEY-VALUE | | COLUMN-FAMILY | | GRAPH |
| ================ | | ================ | | ================ | | ================ |
| | | | | | | |
| { "name": "A", | | key1 -> value1 | | Row1: | | (A)--[:KNOWS]->(B)|
| "age": 30, | | key2 -> value2 | | cf1: {a,b,c} | | | |
| "orders": [ | | key3 -> value3 | | cf2: {x,y} | | [:LIKES] |
| {...} | | | | Row2: | | v |
| ] | | | | cf1: {a,c} | | (C) |
| } | | | | cf3: {m,n} | | |
+------------------+ +------------------+ +------------------+ +------------------+
MongoDB Redis Cassandra Neo4j
CouchDB Memcached HBase Amazon Neptune
DynamoDB (doc mode) DynamoDB (KV mode) ScyllaDB ArangoDB
1. Document Databases
Store data as flexible JSON/BSON documents. No fixed schema.
// MongoDB document
{
"_id": ObjectId("507f1f77bcf86cd799439011"),
"name": "Alice",
"email": "alice@mail.com",
"address": {
"street": "123 Main St",
"city": "Springfield",
"state": "IL"
},
"orders": [
{ "product": "Laptop", "price": 999, "date": "2025-01-15" },
{ "product": "Mouse", "price": 29, "date": "2025-01-20" }
]
}
Strengths: Flexible schema, natural mapping to application objects, horizontal scaling. Weaknesses: No joins (data duplication), weaker consistency guarantees, harder multi-document transactions.
2. Key-Value Stores
Simplest model: a key maps to a value. Think of it as a giant hash map.
SET user:42 '{"name":"Alice","email":"alice@mail.com"}'
SET session:abc123 '{"user_id":42,"expires":"2025-02-01"}'
SET cart:42 '{"items":[{"id":1,"qty":2}]}'
GET user:42 --> '{"name":"Alice","email":"alice@mail.com"}'
DEL session:abc123
Strengths: Extremely fast (O(1) lookups), simple, horizontally scalable. Weaknesses: No complex queries, no relationships, limited value inspection.
3. Column-Family Stores
Organize data into rows and column families. Optimized for writes and wide rows.
ROW KEY: user:alice
+---------------------------+---------------------------+
| Column Family: profile | Column Family: activity |
|---------------------------|---------------------------|
| name: "Alice" | last_login: "2025-01-15" |
| email: "alice@mail.com" | page_views: 1523 |
| age: 30 | last_action: "purchase" |
+---------------------------+---------------------------+
ROW KEY: user:bob
+---------------------------+---------------------------+
| Column Family: profile | Column Family: activity |
|---------------------------|---------------------------|
| name: "Bob" | last_login: "2025-01-14" |
| email: "bob@mail.com" | (no page_views column) |
+---------------------------+---------------------------+
Note: Rows can have different columns (sparse)
Strengths: Massive write throughput, excellent for time-series data, scales to petabytes. Weaknesses: Limited query patterns, no joins, eventual consistency typically.
4. Graph Databases
Store data as nodes and edges. Optimized for traversing relationships.
(Alice)--[:FRIENDS_WITH]-->(Bob)
| |
[:PURCHASED] [:PURCHASED]
| |
v v
(Laptop) (Laptop)
^ ^
| |
[:SIMILAR_TO] [:SIMILAR_TO]
| |
(Tablet) (Desktop)
Query: "Find products purchased by Alice's friends"
MATCH (alice {name:'Alice'})-[:FRIENDS_WITH]->(friend)-[:PURCHASED]->(product)
RETURN product.name
Strengths: Relationship traversal in O(1) per hop, natural for connected data, powerful pattern matching. Weaknesses: Not great for aggregate queries, smaller ecosystem, scaling can be harder.
CAP Theorem
In a distributed system, you can only guarantee two out of three properties:
Consistency
/\
/ \
/ \
/ CA \
/ (single \
/ node \
/ systems) \
/_______________\
/\ /\
/ \ CP / \
/ AP \ / \
/ \ / \
/________\______/________\
Availability Partition
Tolerance
| Property | Meaning |
|---|---|
| Consistency (C) | Every read receives the most recent write (all nodes see the same data at the same time) |
| Availability (A) | Every request receives a response (even if not the latest data) |
| Partition Tolerance (P) | System continues operating despite network partitions between nodes |
The Real-World Interpretation
In distributed systems, network partitions WILL happen. So the real choice is between CP (consistency + partition tolerance) and AP (availability + partition tolerance).
| Category | Behavior During Partition | Examples | Use Cases |
|---|---|---|---|
| CP | Refuses to serve stale data (may return error) | MongoDB (default), HBase, Redis Cluster, Zookeeper | Banking, inventory, leader election |
| AP | Serves potentially stale data (stays available) | Cassandra, DynamoDB, CouchDB, Riak | Social feeds, shopping carts, analytics |
| CA | Not possible in distributed systems | Single-node PostgreSQL, single-node MySQL | Only when running on one machine |
CAP in Practice: Tunable Consistency
Many modern databases let you tune consistency per query:
DynamoDB:
=========
Read: Eventually Consistent (default) -> AP behavior
Read: Strongly Consistent (option) -> CP behavior
Cassandra:
==========
Consistency Level: ONE -> fast, AP behavior
Consistency Level: QUORUM -> balanced
Consistency Level: ALL -> slow, CP behavior
SQL vs NoSQL Decision Table
| Factor | Choose SQL | Choose NoSQL |
|---|---|---|
| Data structure | Well-defined, relational | Flexible, evolving, nested |
| Relationships | Complex joins needed | Few or no relationships |
| Consistency | ACID required (money, inventory) | Eventual consistency acceptable |
| Query patterns | Ad-hoc, complex queries | Known, simple access patterns |
| Scale | Vertical (bigger machine) | Horizontal (more machines) |
| Schema | Stable, rarely changes | Evolves frequently |
| Transactions | Multi-table transactions | Single-document transactions |
| Team expertise | SQL knowledge is universal | Varies by NoSQL type |
| Examples | E-commerce orders, banking, ERP | Social feeds, IoT data, caching |
Decision Flowchart
START
|
v
Do you need multi-table ACID transactions?
| |
YES NO
| |
v v
Use SQL Is your data highly connected (graph-like)?
(PostgreSQL, | |
MySQL) YES NO
| |
v v
Use Graph DB Do you need sub-millisecond latency?
(Neo4j) | |
YES NO
| |
v v
Use Key-Value Is your schema flexible/nested?
(Redis, | |
DynamoDB) YES NO
| |
v v
Use Document DB Do you have massive
(MongoDB) write throughput needs?
| |
YES NO
| |
v v
Use Column- Use SQL
Family (safe default)
(Cassandra)
Database Deep Dives
PostgreSQL (SQL)
Type: Relational (SQL)
CAP: CA (single node) / CP (with replication)
Best for: General purpose, complex queries, ACID compliance
Key Features:
- Advanced data types (JSON, arrays, hstore)
- Full-text search
- Materialized views
- Extensions (PostGIS for geospatial, pgvector for AI)
- MVCC (Multi-Version Concurrency Control)
Used by: Instagram, Spotify, Reddit, Uber (for trip data)
Sweet spot: "When in doubt, start with PostgreSQL"
MongoDB (Document)
Type: Document (NoSQL)
CAP: CP (default) / AP (configurable)
Best for: Flexible schemas, rapid prototyping, content management
Key Features:
- Schema-less BSON documents
- Rich query language
- Aggregation pipeline
- Change streams (real-time)
- Multi-document transactions (since v4.0)
- Atlas Search (built-in full-text search)
Used by: eBay, Adobe, Forbes, Toyota
Sweet spot: Catalogs, content, user profiles, event logging
DynamoDB (Key-Value / Document)
Type: Key-Value / Document (NoSQL)
CAP: AP (default) / CP (strongly consistent reads)
Best for: Serverless architectures, predictable performance at any scale
Key Features:
- Single-digit millisecond latency at any scale
- Fully managed (no servers to provision)
- Auto-scaling
- DynamoDB Streams (change data capture)
- Global tables (multi-region replication)
- On-demand or provisioned capacity
Used by: Amazon.com, Lyft, Airbnb, Samsung
Sweet spot: Shopping carts, session stores, gaming leaderboards
Access pattern: Must design around partition key + sort key
Table Design:
+--------------+----------+------------------+
| Partition Key| Sort Key | Attributes |
|--------------|----------|------------------|
| USER#alice | ORDER#1 | {total: 99.99} |
| USER#alice | ORDER#2 | {total: 45.50} |
| USER#alice | PROFILE | {email: "a@m.c"} |
+--------------+----------+------------------+
Redis (Key-Value / In-Memory)
Type: Key-Value (In-Memory NoSQL)
CAP: CP (Redis Cluster) / Single-node: CA
Best for: Caching, session storage, real-time features
Key Features:
- Sub-millisecond latency (in-memory)
- Rich data structures (strings, hashes, lists, sets, sorted sets, streams)
- Pub/Sub messaging
- Lua scripting
- TTL (automatic expiration)
- Persistence options (RDB snapshots, AOF)
Used by: Twitter, GitHub, Stack Overflow, Pinterest
Sweet spot: Cache, session store, rate limiter, leaderboard, pub/sub
Common patterns:
- Cache: SET cache:user:42 '{"name":"Alice"}' EX 3600
- Counter: INCR page:views:homepage
- Rate limit: INCR api:rate:user:42 + EXPIRE
- Leaderboard: ZADD leaderboard 1500 "alice" 1200 "bob"
Cassandra (Column-Family)
Type: Column-Family (NoSQL)
CAP: AP (tunable consistency)
Best for: Massive write-heavy workloads, time-series data
Key Features:
- Linear scalability (add nodes = add capacity)
- No single point of failure (masterless)
- Multi-datacenter replication
- Tunable consistency (ONE, QUORUM, ALL)
- CQL (Cassandra Query Language, SQL-like)
- Optimized for writes
Used by: Netflix, Apple, Instagram (messages), Discord
Sweet spot: IoT sensor data, messaging, activity feeds, time-series
Write path (why writes are fast):
Client -> Commit Log (append-only, sequential) -> Memtable (memory) -> SSTable (disk, periodic flush)
Neo4j (Graph)
Type: Graph (NoSQL)
CAP: CA (single node) / CP (cluster)
Best for: Highly connected data, relationship-heavy queries
Key Features:
- Native graph storage (index-free adjacency)
- Cypher query language
- ACID compliant
- Graph algorithms library
- Visualization tools
Used by: NASA, eBay, Walmart, Airbnb
Sweet spot: Social networks, recommendation engines, fraud detection, knowledge graphs
Query (Cypher):
MATCH (user:User {name: 'Alice'})-[:FRIENDS*1..3]->(fof)
WHERE fof <> user
RETURN DISTINCT fof.name
// Find friends-of-friends up to 3 hops away
Side-by-Side Summary
| Database | Type | Consistency | Scale Model | Latency | Best For |
|---|---|---|---|---|---|
| PostgreSQL | SQL | Strong (ACID) | Vertical + read replicas | Low | General purpose, complex queries |
| MongoDB | Document | Tunable | Horizontal (sharding) | Low | Flexible schema, content |
| DynamoDB | Key-Value/Doc | Tunable | Horizontal (managed) | Very low | Serverless, predictable perf |
| Redis | Key-Value | Strong (single) | Vertical + Cluster | Sub-ms | Caching, real-time |
| Cassandra | Column-Family | Tunable (AP) | Horizontal (masterless) | Low | Write-heavy, time-series |
| Neo4j | Graph | Strong (ACID) | Vertical + read replicas | Varies | Relationships, graph queries |
Key Takeaways
- SQL is not dead — Relational databases handle the majority of production workloads. PostgreSQL is the safe default.
- NoSQL is not "better" — It is a different set of tradeoffs. Choose based on your access patterns, not hype.
- ACID matters for money — Any system involving financial transactions needs strong consistency guarantees.
- CAP theorem forces a choice — In a distributed system, you choose between consistency and availability during network partitions.
- Polyglot persistence is normal — Most large systems use multiple databases (e.g., PostgreSQL for orders + Redis for cache + Elasticsearch for search).
- Schema design drives performance — In SQL, normalize. In NoSQL, design around your query patterns (denormalize).
- DynamoDB and Cassandra require you to know access patterns upfront — You cannot do ad-hoc queries like SQL. Design your keys carefully.
Explain-It Challenge
Scenario: You are designing the database layer for a social media platform. You need to store:
- User profiles (name, email, bio, settings)
- Posts (text, images, timestamps)
- Social graph (who follows whom)
- News feed (personalized for each user)
- Direct messages
- Notifications
For each data type, recommend a specific database and explain why. Would you use a single database or multiple? Justify your polyglot persistence strategy.
Next -> 9.8.d — Database Scaling