Episode 9 — System Design / 9.8 — Communication and Data Layer

9.8.c — SQL vs NoSQL

Big Picture: Choosing the right database is one of the most consequential decisions in system design. Get it right, and your system scales gracefully. Get it wrong, and you are either re-architecting under pressure or fighting your database at every step. This chapter gives you the knowledge to make that decision confidently.

Relational Databases (SQL)
ACID Properties
Normalization
NoSQL Database Types
CAP Theorem
SQL vs NoSQL Decision Table
Database Deep Dives
Key Takeaways
Explain-It Challenge

Relational Databases (SQL)

Relational databases store data in tables (rows and columns) with relationships between tables enforced by foreign keys.

  USERS TABLE                          ORDERS TABLE
  ==============                       ===============================
  | id | name    | email          |   | id | user_id | total | status    |
  |----|---------|----------------|   |----|---------|-------|-----------|
  | 1  | Alice   | alice@mail.com |   | 10 | 1       | 99.99 | delivered |
  | 2  | Bob     | bob@mail.com   |   | 11 | 1       | 45.50 | shipped   |
  | 3  | Charlie | charlie@m.com  |   | 12 | 2       | 120.0 | pending   |
  +----+---------+----------------+   +----+---------+-------+-----------+
         ^                                     |
         |           FOREIGN KEY               |
         +-------------------------------------+
         users.id  <--  orders.user_id

SQL Joins

-- INNER JOIN: Only matching rows from both tables
SELECT u.name, o.total, o.status
FROM users u
INNER JOIN orders o ON u.id = o.user_id;

-- Result: Alice (2 orders), Bob (1 order). Charlie excluded (no orders).

-- LEFT JOIN: All rows from left table + matching from right
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

-- Result: Alice, Alice, Bob, Charlie (NULL). All users included.

Join Types Visualized

  INNER JOIN          LEFT JOIN           RIGHT JOIN          FULL OUTER JOIN
  ==========          =========           ==========          ===============
    +---+               +---+               +---+               +---+
   /     \             /     \             /     \             /     \
  | A   B |           |[A]  B |           | A  [B]|           |[A] [B]|
  |  [AB]  |          | [AB]  |           |  [AB] |           | [AB]  |
   \     /             \     /             \     /             \     /
    +---+               +---+               +---+               +---+

  Only where           All A +             All B +             All A +
  both match           matching B           matching A          All B

Schema Definition

CREATE TABLE users (
    id          SERIAL PRIMARY KEY,
    name        VARCHAR(100) NOT NULL,
    email       VARCHAR(255) UNIQUE NOT NULL,
    created_at  TIMESTAMP DEFAULT NOW()
);

CREATE TABLE orders (
    id          SERIAL PRIMARY KEY,
    user_id     INTEGER NOT NULL REFERENCES users(id),
    total       DECIMAL(10, 2) NOT NULL,
    status      VARCHAR(20) DEFAULT 'pending',
    created_at  TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);

ACID Properties

ACID guarantees data integrity in relational databases.

Property	Meaning	Example
Atomicity	All operations in a transaction succeed or all fail	Bank transfer: debit AND credit both happen, or neither
Consistency	Database moves from one valid state to another	Foreign key constraints, unique constraints always enforced
Isolation	Concurrent transactions do not interfere	Two users buying the last item: only one succeeds
Durability	Committed data survives crashes	After "commit," data is on disk even if power goes out

Transaction Example

-- Transfer $100 from Alice to Bob
BEGIN TRANSACTION;

UPDATE accounts SET balance = balance - 100 WHERE user_id = 1;  -- Alice
UPDATE accounts SET balance = balance + 100 WHERE user_id = 2;  -- Bob

-- If anything fails, ROLLBACK undoes everything
COMMIT;

Isolation Levels

  READ UNCOMMITTED   READ COMMITTED   REPEATABLE READ   SERIALIZABLE
  ================   ==============   ===============   ============
  Dirty reads: YES   Dirty reads: NO  Dirty reads: NO   Dirty reads: NO
  Non-repeatable     Non-repeatable   Non-repeatable    Non-repeatable
   reads: YES         reads: YES       reads: NO         reads: NO
  Phantom            Phantom          Phantom           Phantom
   reads: YES         reads: YES       reads: YES        reads: NO

  <--- Faster, less safe                    Slower, more safe --->
  <--- More concurrency                     Less concurrency  --->

Isolation Level	Performance	Safety	Common Use
Read Uncommitted	Fastest	Lowest	Almost never used
Read Committed	Fast	Good	PostgreSQL default
Repeatable Read	Medium	High	MySQL/InnoDB default
Serializable	Slowest	Highest	Financial systems

Normalization

Normalization eliminates data redundancy by organizing tables properly.

Unnormalized (Bad)

  ORDERS (Denormalized)
  ======================================================================
  | order_id | customer_name | customer_email    | product | price | qty |
  |----------|---------------|-------------------|---------|-------|-----|
  | 1        | Alice         | alice@mail.com    | Laptop  | 999   | 1   |
  | 2        | Alice         | alice@mail.com    | Mouse   | 29    | 2   |
  | 3        | Bob           | bob@mail.com      | Laptop  | 999   | 1   |
  
  Problems:
  - Alice's email stored 2x (redundancy)
  - Update Alice's email? Must update every row (update anomaly)
  - Delete order 3? Lose Bob's info entirely (delete anomaly)

Normalized (Good)

  CUSTOMERS                    PRODUCTS                  ORDER_ITEMS
  ==============               ===========               =========================
  | id | name  | email       | | id | name   | price | | order_id | product_id | qty |
  |----|-------|-------------|  |----|--------|-------|  |----------|------------|-----|
  | 1  | Alice | alice@m.com |  | 1  | Laptop | 999  |  | 1        | 1          | 1   |
  | 2  | Bob   | bob@m.com   |  | 2  | Mouse  | 29   |  | 2        | 2          | 2   |
  +----+-------+-------------+  +----+--------+-------+  | 3        | 1          | 1   |
                                                          +----------+------------+-----+
  
  ORDERS
  =====================
  | id | customer_id  |
  |----|------------- |
  | 1  | 1            |
  | 2  | 1            |
  | 3  | 2            |
  +----+--------------+

Normal Forms (Quick Reference)

Normal Form	Rule	Violation Example
1NF	No repeating groups, atomic values	`tags: "red,blue,green"` in one cell
2NF	1NF + no partial dependencies	Non-key column depends on part of composite key
3NF	2NF + no transitive dependencies	`zip_code -> city` (city depends on zip, not PK)

Interview tip: In practice, most production databases aim for 3NF. Higher normal forms (BCNF, 4NF, 5NF) exist but are rarely discussed in interviews.

NoSQL Database Types

  NoSQL FAMILY
  ============

  +------------------+    +------------------+    +------------------+    +------------------+
  | DOCUMENT         |    | KEY-VALUE        |    | COLUMN-FAMILY    |    | GRAPH            |
  | ================ |    | ================ |    | ================ |    | ================ |
  |                  |    |                  |    |                  |    |                  |
  | { "name": "A",  |    | key1 -> value1   |    | Row1:            |    |  (A)--[:KNOWS]->(B)|
  |   "age": 30,    |    | key2 -> value2   |    |   cf1: {a,b,c}  |    |   |               |
  |   "orders": [   |    | key3 -> value3   |    |   cf2: {x,y}    |    |   [:LIKES]       |
  |     {...}        |    |                  |    | Row2:            |    |   v              |
  |   ]              |    |                  |    |   cf1: {a,c}     |    |  (C)             |
  | }                |    |                  |    |   cf3: {m,n}     |    |                  |
  +------------------+    +------------------+    +------------------+    +------------------+
  
  MongoDB               Redis                  Cassandra             Neo4j
  CouchDB               Memcached              HBase                 Amazon Neptune
  DynamoDB (doc mode)    DynamoDB (KV mode)     ScyllaDB              ArangoDB

1. Document Databases

Store data as flexible JSON/BSON documents. No fixed schema.

// MongoDB document
{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "Alice",
  "email": "alice@mail.com",
  "address": {
    "street": "123 Main St",
    "city": "Springfield",
    "state": "IL"
  },
  "orders": [
    { "product": "Laptop", "price": 999, "date": "2025-01-15" },
    { "product": "Mouse", "price": 29, "date": "2025-01-20" }
  ]
}

Strengths: Flexible schema, natural mapping to application objects, horizontal scaling. Weaknesses: No joins (data duplication), weaker consistency guarantees, harder multi-document transactions.

2. Key-Value Stores

Simplest model: a key maps to a value. Think of it as a giant hash map.

  SET user:42         '{"name":"Alice","email":"alice@mail.com"}'
  SET session:abc123  '{"user_id":42,"expires":"2025-02-01"}'
  SET cart:42         '{"items":[{"id":1,"qty":2}]}'

  GET user:42    -->  '{"name":"Alice","email":"alice@mail.com"}'
  DEL session:abc123

Strengths: Extremely fast (O(1) lookups), simple, horizontally scalable. Weaknesses: No complex queries, no relationships, limited value inspection.

3. Column-Family Stores

Organize data into rows and column families. Optimized for writes and wide rows.

  ROW KEY: user:alice
  +---------------------------+---------------------------+
  | Column Family: profile    | Column Family: activity   |
  |---------------------------|---------------------------|
  | name: "Alice"             | last_login: "2025-01-15"  |
  | email: "alice@mail.com"   | page_views: 1523          |
  | age: 30                   | last_action: "purchase"   |
  +---------------------------+---------------------------+

  ROW KEY: user:bob
  +---------------------------+---------------------------+
  | Column Family: profile    | Column Family: activity   |
  |---------------------------|---------------------------|
  | name: "Bob"               | last_login: "2025-01-14"  |
  | email: "bob@mail.com"     | (no page_views column)    |
  +---------------------------+---------------------------+
  
  Note: Rows can have different columns (sparse)

Strengths: Massive write throughput, excellent for time-series data, scales to petabytes. Weaknesses: Limited query patterns, no joins, eventual consistency typically.

4. Graph Databases

Store data as nodes and edges. Optimized for traversing relationships.

  (Alice)--[:FRIENDS_WITH]-->(Bob)
     |                         |
     [:PURCHASED]          [:PURCHASED]
     |                         |
     v                         v
  (Laptop)                 (Laptop)
     ^                         ^
     |                         |
     [:SIMILAR_TO]             [:SIMILAR_TO]
     |                         |
  (Tablet)                 (Desktop)

  Query: "Find products purchased by Alice's friends"
  MATCH (alice {name:'Alice'})-[:FRIENDS_WITH]->(friend)-[:PURCHASED]->(product)
  RETURN product.name

Strengths: Relationship traversal in O(1) per hop, natural for connected data, powerful pattern matching. Weaknesses: Not great for aggregate queries, smaller ecosystem, scaling can be harder.

CAP Theorem

In a distributed system, you can only guarantee two out of three properties:

                    Consistency
                       /\
                      /  \
                     /    \
                    /  CA  \
                   / (single \
                  /  node     \
                 /   systems)  \
                /_______________\
               /\              /\
              /  \    CP      /  \
             / AP \          /    \
            /      \        /      \
           /________\______/________\
       Availability              Partition
                                 Tolerance

Property	Meaning
Consistency (C)	Every read receives the most recent write (all nodes see the same data at the same time)
Availability (A)	Every request receives a response (even if not the latest data)
Partition Tolerance (P)	System continues operating despite network partitions between nodes

The Real-World Interpretation

In distributed systems, network partitions WILL happen. So the real choice is between CP (consistency + partition tolerance) and AP (availability + partition tolerance).

Category	Behavior During Partition	Examples	Use Cases
CP	Refuses to serve stale data (may return error)	MongoDB (default), HBase, Redis Cluster, Zookeeper	Banking, inventory, leader election
AP	Serves potentially stale data (stays available)	Cassandra, DynamoDB, CouchDB, Riak	Social feeds, shopping carts, analytics
CA	Not possible in distributed systems	Single-node PostgreSQL, single-node MySQL	Only when running on one machine

CAP in Practice: Tunable Consistency

Many modern databases let you tune consistency per query:

  DynamoDB:
  =========
  Read: Eventually Consistent (default) -> AP behavior
  Read: Strongly Consistent (option)    -> CP behavior

  Cassandra:
  ==========
  Consistency Level: ONE     -> fast, AP behavior
  Consistency Level: QUORUM  -> balanced
  Consistency Level: ALL     -> slow, CP behavior

SQL vs NoSQL Decision Table

Factor	Choose SQL	Choose NoSQL
Data structure	Well-defined, relational	Flexible, evolving, nested
Relationships	Complex joins needed	Few or no relationships
Consistency	ACID required (money, inventory)	Eventual consistency acceptable
Query patterns	Ad-hoc, complex queries	Known, simple access patterns
Scale	Vertical (bigger machine)	Horizontal (more machines)
Schema	Stable, rarely changes	Evolves frequently
Transactions	Multi-table transactions	Single-document transactions
Team expertise	SQL knowledge is universal	Varies by NoSQL type
Examples	E-commerce orders, banking, ERP	Social feeds, IoT data, caching

Decision Flowchart

  START
    |
    v
  Do you need multi-table ACID transactions?
    |                    |
   YES                  NO
    |                    |
    v                    v
  Use SQL            Is your data highly connected (graph-like)?
  (PostgreSQL,         |                    |
   MySQL)             YES                  NO
                       |                    |
                       v                    v
                   Use Graph DB         Do you need sub-millisecond latency?
                   (Neo4j)                |                    |
                                         YES                  NO
                                          |                    |
                                          v                    v
                                      Use Key-Value        Is your schema flexible/nested?
                                      (Redis,                |                    |
                                       DynamoDB)            YES                  NO
                                                             |                    |
                                                             v                    v
                                                         Use Document DB      Do you have massive
                                                         (MongoDB)            write throughput needs?
                                                                                |              |
                                                                               YES            NO
                                                                                |              |
                                                                                v              v
                                                                            Use Column-     Use SQL
                                                                            Family          (safe default)
                                                                            (Cassandra)

Database Deep Dives

PostgreSQL (SQL)

  Type: Relational (SQL)
  CAP: CA (single node) / CP (with replication)
  Best for: General purpose, complex queries, ACID compliance
  
  Key Features:
  - Advanced data types (JSON, arrays, hstore)
  - Full-text search
  - Materialized views
  - Extensions (PostGIS for geospatial, pgvector for AI)
  - MVCC (Multi-Version Concurrency Control)
  
  Used by: Instagram, Spotify, Reddit, Uber (for trip data)
  
  Sweet spot: "When in doubt, start with PostgreSQL"

MongoDB (Document)

  Type: Document (NoSQL)
  CAP: CP (default) / AP (configurable)
  Best for: Flexible schemas, rapid prototyping, content management
  
  Key Features:
  - Schema-less BSON documents
  - Rich query language
  - Aggregation pipeline
  - Change streams (real-time)
  - Multi-document transactions (since v4.0)
  - Atlas Search (built-in full-text search)
  
  Used by: eBay, Adobe, Forbes, Toyota
  
  Sweet spot: Catalogs, content, user profiles, event logging

DynamoDB (Key-Value / Document)

  Type: Key-Value / Document (NoSQL)
  CAP: AP (default) / CP (strongly consistent reads)
  Best for: Serverless architectures, predictable performance at any scale
  
  Key Features:
  - Single-digit millisecond latency at any scale
  - Fully managed (no servers to provision)
  - Auto-scaling
  - DynamoDB Streams (change data capture)
  - Global tables (multi-region replication)
  - On-demand or provisioned capacity
  
  Used by: Amazon.com, Lyft, Airbnb, Samsung
  
  Sweet spot: Shopping carts, session stores, gaming leaderboards
  
  Access pattern: Must design around partition key + sort key
  
  Table Design:
  +--------------+----------+------------------+
  | Partition Key| Sort Key | Attributes       |
  |--------------|----------|------------------|
  | USER#alice   | ORDER#1  | {total: 99.99}   |
  | USER#alice   | ORDER#2  | {total: 45.50}   |
  | USER#alice   | PROFILE  | {email: "a@m.c"} |
  +--------------+----------+------------------+

Redis (Key-Value / In-Memory)

  Type: Key-Value (In-Memory NoSQL)
  CAP: CP (Redis Cluster) / Single-node: CA
  Best for: Caching, session storage, real-time features
  
  Key Features:
  - Sub-millisecond latency (in-memory)
  - Rich data structures (strings, hashes, lists, sets, sorted sets, streams)
  - Pub/Sub messaging
  - Lua scripting
  - TTL (automatic expiration)
  - Persistence options (RDB snapshots, AOF)
  
  Used by: Twitter, GitHub, Stack Overflow, Pinterest
  
  Sweet spot: Cache, session store, rate limiter, leaderboard, pub/sub
  
  Common patterns:
  - Cache: SET cache:user:42 '{"name":"Alice"}' EX 3600
  - Counter: INCR page:views:homepage
  - Rate limit: INCR api:rate:user:42  + EXPIRE
  - Leaderboard: ZADD leaderboard 1500 "alice" 1200 "bob"

Cassandra (Column-Family)

  Type: Column-Family (NoSQL)
  CAP: AP (tunable consistency)
  Best for: Massive write-heavy workloads, time-series data
  
  Key Features:
  - Linear scalability (add nodes = add capacity)
  - No single point of failure (masterless)
  - Multi-datacenter replication
  - Tunable consistency (ONE, QUORUM, ALL)
  - CQL (Cassandra Query Language, SQL-like)
  - Optimized for writes
  
  Used by: Netflix, Apple, Instagram (messages), Discord
  
  Sweet spot: IoT sensor data, messaging, activity feeds, time-series
  
  Write path (why writes are fast):
  Client -> Commit Log (append-only, sequential) -> Memtable (memory) -> SSTable (disk, periodic flush)

Neo4j (Graph)

  Type: Graph (NoSQL)
  CAP: CA (single node) / CP (cluster)
  Best for: Highly connected data, relationship-heavy queries
  
  Key Features:
  - Native graph storage (index-free adjacency)
  - Cypher query language
  - ACID compliant
  - Graph algorithms library
  - Visualization tools
  
  Used by: NASA, eBay, Walmart, Airbnb
  
  Sweet spot: Social networks, recommendation engines, fraud detection, knowledge graphs
  
  Query (Cypher):
  MATCH (user:User {name: 'Alice'})-[:FRIENDS*1..3]->(fof)
  WHERE fof <> user
  RETURN DISTINCT fof.name
  // Find friends-of-friends up to 3 hops away

Side-by-Side Summary

Database	Type	Consistency	Scale Model	Latency	Best For
PostgreSQL	SQL	Strong (ACID)	Vertical + read replicas	Low	General purpose, complex queries
MongoDB	Document	Tunable	Horizontal (sharding)	Low	Flexible schema, content
DynamoDB	Key-Value/Doc	Tunable	Horizontal (managed)	Very low	Serverless, predictable perf
Redis	Key-Value	Strong (single)	Vertical + Cluster	Sub-ms	Caching, real-time
Cassandra	Column-Family	Tunable (AP)	Horizontal (masterless)	Low	Write-heavy, time-series
Neo4j	Graph	Strong (ACID)	Vertical + read replicas	Varies	Relationships, graph queries

Key Takeaways

SQL is not dead — Relational databases handle the majority of production workloads. PostgreSQL is the safe default.
NoSQL is not "better" — It is a different set of tradeoffs. Choose based on your access patterns, not hype.
ACID matters for money — Any system involving financial transactions needs strong consistency guarantees.
CAP theorem forces a choice — In a distributed system, you choose between consistency and availability during network partitions.
Polyglot persistence is normal — Most large systems use multiple databases (e.g., PostgreSQL for orders + Redis for cache + Elasticsearch for search).
Schema design drives performance — In SQL, normalize. In NoSQL, design around your query patterns (denormalize).
DynamoDB and Cassandra require you to know access patterns upfront — You cannot do ad-hoc queries like SQL. Design your keys carefully.

Explain-It Challenge

Scenario: You are designing the database layer for a social media platform. You need to store:

User profiles (name, email, bio, settings)

Posts (text, images, timestamps)

Social graph (who follows whom)

News feed (personalized for each user)

Direct messages

Notifications

For each data type, recommend a specific database and explain why. Would you use a single database or multiple? Justify your polyglot persistence strategy.

Next -> 9.8.d — Database Scaling

Cohart