Episode 9 — System Design / 9.8 — Communication and Data Layer

9.8.c — SQL vs NoSQL

Big Picture: Choosing the right database is one of the most consequential decisions in system design. Get it right, and your system scales gracefully. Get it wrong, and you are either re-architecting under pressure or fighting your database at every step. This chapter gives you the knowledge to make that decision confidently.


Table of Contents

  1. Relational Databases (SQL)
  2. ACID Properties
  3. Normalization
  4. NoSQL Database Types
  5. CAP Theorem
  6. SQL vs NoSQL Decision Table
  7. Database Deep Dives
  8. Key Takeaways
  9. Explain-It Challenge

Relational Databases (SQL)

Relational databases store data in tables (rows and columns) with relationships between tables enforced by foreign keys.

  USERS TABLE                          ORDERS TABLE
  ==============                       ===============================
  | id | name    | email          |   | id | user_id | total | status    |
  |----|---------|----------------|   |----|---------|-------|-----------|
  | 1  | Alice   | alice@mail.com |   | 10 | 1       | 99.99 | delivered |
  | 2  | Bob     | bob@mail.com   |   | 11 | 1       | 45.50 | shipped   |
  | 3  | Charlie | charlie@m.com  |   | 12 | 2       | 120.0 | pending   |
  +----+---------+----------------+   +----+---------+-------+-----------+
         ^                                     |
         |           FOREIGN KEY               |
         +-------------------------------------+
         users.id  <--  orders.user_id

SQL Joins

-- INNER JOIN: Only matching rows from both tables
SELECT u.name, o.total, o.status
FROM users u
INNER JOIN orders o ON u.id = o.user_id;

-- Result: Alice (2 orders), Bob (1 order). Charlie excluded (no orders).

-- LEFT JOIN: All rows from left table + matching from right
SELECT u.name, o.total
FROM users u
LEFT JOIN orders o ON u.id = o.user_id;

-- Result: Alice, Alice, Bob, Charlie (NULL). All users included.

Join Types Visualized

  INNER JOIN          LEFT JOIN           RIGHT JOIN          FULL OUTER JOIN
  ==========          =========           ==========          ===============
    +---+               +---+               +---+               +---+
   /     \             /     \             /     \             /     \
  | A   B |           |[A]  B |           | A  [B]|           |[A] [B]|
  |  [AB]  |          | [AB]  |           |  [AB] |           | [AB]  |
   \     /             \     /             \     /             \     /
    +---+               +---+               +---+               +---+

  Only where           All A +             All B +             All A +
  both match           matching B           matching A          All B

Schema Definition

CREATE TABLE users (
    id          SERIAL PRIMARY KEY,
    name        VARCHAR(100) NOT NULL,
    email       VARCHAR(255) UNIQUE NOT NULL,
    created_at  TIMESTAMP DEFAULT NOW()
);

CREATE TABLE orders (
    id          SERIAL PRIMARY KEY,
    user_id     INTEGER NOT NULL REFERENCES users(id),
    total       DECIMAL(10, 2) NOT NULL,
    status      VARCHAR(20) DEFAULT 'pending',
    created_at  TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_status ON orders(status);

ACID Properties

ACID guarantees data integrity in relational databases.

PropertyMeaningExample
AtomicityAll operations in a transaction succeed or all failBank transfer: debit AND credit both happen, or neither
ConsistencyDatabase moves from one valid state to anotherForeign key constraints, unique constraints always enforced
IsolationConcurrent transactions do not interfereTwo users buying the last item: only one succeeds
DurabilityCommitted data survives crashesAfter "commit," data is on disk even if power goes out

Transaction Example

-- Transfer $100 from Alice to Bob
BEGIN TRANSACTION;

UPDATE accounts SET balance = balance - 100 WHERE user_id = 1;  -- Alice
UPDATE accounts SET balance = balance + 100 WHERE user_id = 2;  -- Bob

-- If anything fails, ROLLBACK undoes everything
COMMIT;

Isolation Levels

  READ UNCOMMITTED   READ COMMITTED   REPEATABLE READ   SERIALIZABLE
  ================   ==============   ===============   ============
  Dirty reads: YES   Dirty reads: NO  Dirty reads: NO   Dirty reads: NO
  Non-repeatable     Non-repeatable   Non-repeatable    Non-repeatable
   reads: YES         reads: YES       reads: NO         reads: NO
  Phantom            Phantom          Phantom           Phantom
   reads: YES         reads: YES       reads: YES        reads: NO

  <--- Faster, less safe                    Slower, more safe --->
  <--- More concurrency                     Less concurrency  --->
Isolation LevelPerformanceSafetyCommon Use
Read UncommittedFastestLowestAlmost never used
Read CommittedFastGoodPostgreSQL default
Repeatable ReadMediumHighMySQL/InnoDB default
SerializableSlowestHighestFinancial systems

Normalization

Normalization eliminates data redundancy by organizing tables properly.

Unnormalized (Bad)

  ORDERS (Denormalized)
  ======================================================================
  | order_id | customer_name | customer_email    | product | price | qty |
  |----------|---------------|-------------------|---------|-------|-----|
  | 1        | Alice         | alice@mail.com    | Laptop  | 999   | 1   |
  | 2        | Alice         | alice@mail.com    | Mouse   | 29    | 2   |
  | 3        | Bob           | bob@mail.com      | Laptop  | 999   | 1   |
  
  Problems:
  - Alice's email stored 2x (redundancy)
  - Update Alice's email? Must update every row (update anomaly)
  - Delete order 3? Lose Bob's info entirely (delete anomaly)

Normalized (Good)

  CUSTOMERS                    PRODUCTS                  ORDER_ITEMS
  ==============               ===========               =========================
  | id | name  | email       | | id | name   | price | | order_id | product_id | qty |
  |----|-------|-------------|  |----|--------|-------|  |----------|------------|-----|
  | 1  | Alice | alice@m.com |  | 1  | Laptop | 999  |  | 1        | 1          | 1   |
  | 2  | Bob   | bob@m.com   |  | 2  | Mouse  | 29   |  | 2        | 2          | 2   |
  +----+-------+-------------+  +----+--------+-------+  | 3        | 1          | 1   |
                                                          +----------+------------+-----+
  
  ORDERS
  =====================
  | id | customer_id  |
  |----|------------- |
  | 1  | 1            |
  | 2  | 1            |
  | 3  | 2            |
  +----+--------------+

Normal Forms (Quick Reference)

Normal FormRuleViolation Example
1NFNo repeating groups, atomic valuestags: "red,blue,green" in one cell
2NF1NF + no partial dependenciesNon-key column depends on part of composite key
3NF2NF + no transitive dependencieszip_code -> city (city depends on zip, not PK)

Interview tip: In practice, most production databases aim for 3NF. Higher normal forms (BCNF, 4NF, 5NF) exist but are rarely discussed in interviews.


NoSQL Database Types

  NoSQL FAMILY
  ============

  +------------------+    +------------------+    +------------------+    +------------------+
  | DOCUMENT         |    | KEY-VALUE        |    | COLUMN-FAMILY    |    | GRAPH            |
  | ================ |    | ================ |    | ================ |    | ================ |
  |                  |    |                  |    |                  |    |                  |
  | { "name": "A",  |    | key1 -> value1   |    | Row1:            |    |  (A)--[:KNOWS]->(B)|
  |   "age": 30,    |    | key2 -> value2   |    |   cf1: {a,b,c}  |    |   |               |
  |   "orders": [   |    | key3 -> value3   |    |   cf2: {x,y}    |    |   [:LIKES]       |
  |     {...}        |    |                  |    | Row2:            |    |   v              |
  |   ]              |    |                  |    |   cf1: {a,c}     |    |  (C)             |
  | }                |    |                  |    |   cf3: {m,n}     |    |                  |
  +------------------+    +------------------+    +------------------+    +------------------+
  
  MongoDB               Redis                  Cassandra             Neo4j
  CouchDB               Memcached              HBase                 Amazon Neptune
  DynamoDB (doc mode)    DynamoDB (KV mode)     ScyllaDB              ArangoDB

1. Document Databases

Store data as flexible JSON/BSON documents. No fixed schema.

// MongoDB document
{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "Alice",
  "email": "alice@mail.com",
  "address": {
    "street": "123 Main St",
    "city": "Springfield",
    "state": "IL"
  },
  "orders": [
    { "product": "Laptop", "price": 999, "date": "2025-01-15" },
    { "product": "Mouse", "price": 29, "date": "2025-01-20" }
  ]
}

Strengths: Flexible schema, natural mapping to application objects, horizontal scaling. Weaknesses: No joins (data duplication), weaker consistency guarantees, harder multi-document transactions.

2. Key-Value Stores

Simplest model: a key maps to a value. Think of it as a giant hash map.

  SET user:42         '{"name":"Alice","email":"alice@mail.com"}'
  SET session:abc123  '{"user_id":42,"expires":"2025-02-01"}'
  SET cart:42         '{"items":[{"id":1,"qty":2}]}'

  GET user:42    -->  '{"name":"Alice","email":"alice@mail.com"}'
  DEL session:abc123

Strengths: Extremely fast (O(1) lookups), simple, horizontally scalable. Weaknesses: No complex queries, no relationships, limited value inspection.

3. Column-Family Stores

Organize data into rows and column families. Optimized for writes and wide rows.

  ROW KEY: user:alice
  +---------------------------+---------------------------+
  | Column Family: profile    | Column Family: activity   |
  |---------------------------|---------------------------|
  | name: "Alice"             | last_login: "2025-01-15"  |
  | email: "alice@mail.com"   | page_views: 1523          |
  | age: 30                   | last_action: "purchase"   |
  +---------------------------+---------------------------+

  ROW KEY: user:bob
  +---------------------------+---------------------------+
  | Column Family: profile    | Column Family: activity   |
  |---------------------------|---------------------------|
  | name: "Bob"               | last_login: "2025-01-14"  |
  | email: "bob@mail.com"     | (no page_views column)    |
  +---------------------------+---------------------------+
  
  Note: Rows can have different columns (sparse)

Strengths: Massive write throughput, excellent for time-series data, scales to petabytes. Weaknesses: Limited query patterns, no joins, eventual consistency typically.

4. Graph Databases

Store data as nodes and edges. Optimized for traversing relationships.

  (Alice)--[:FRIENDS_WITH]-->(Bob)
     |                         |
     [:PURCHASED]          [:PURCHASED]
     |                         |
     v                         v
  (Laptop)                 (Laptop)
     ^                         ^
     |                         |
     [:SIMILAR_TO]             [:SIMILAR_TO]
     |                         |
  (Tablet)                 (Desktop)

  Query: "Find products purchased by Alice's friends"
  MATCH (alice {name:'Alice'})-[:FRIENDS_WITH]->(friend)-[:PURCHASED]->(product)
  RETURN product.name

Strengths: Relationship traversal in O(1) per hop, natural for connected data, powerful pattern matching. Weaknesses: Not great for aggregate queries, smaller ecosystem, scaling can be harder.


CAP Theorem

In a distributed system, you can only guarantee two out of three properties:

                    Consistency
                       /\
                      /  \
                     /    \
                    /  CA  \
                   / (single \
                  /  node     \
                 /   systems)  \
                /_______________\
               /\              /\
              /  \    CP      /  \
             / AP \          /    \
            /      \        /      \
           /________\______/________\
       Availability              Partition
                                 Tolerance
PropertyMeaning
Consistency (C)Every read receives the most recent write (all nodes see the same data at the same time)
Availability (A)Every request receives a response (even if not the latest data)
Partition Tolerance (P)System continues operating despite network partitions between nodes

The Real-World Interpretation

In distributed systems, network partitions WILL happen. So the real choice is between CP (consistency + partition tolerance) and AP (availability + partition tolerance).

CategoryBehavior During PartitionExamplesUse Cases
CPRefuses to serve stale data (may return error)MongoDB (default), HBase, Redis Cluster, ZookeeperBanking, inventory, leader election
APServes potentially stale data (stays available)Cassandra, DynamoDB, CouchDB, RiakSocial feeds, shopping carts, analytics
CANot possible in distributed systemsSingle-node PostgreSQL, single-node MySQLOnly when running on one machine

CAP in Practice: Tunable Consistency

Many modern databases let you tune consistency per query:

  DynamoDB:
  =========
  Read: Eventually Consistent (default) -> AP behavior
  Read: Strongly Consistent (option)    -> CP behavior

  Cassandra:
  ==========
  Consistency Level: ONE     -> fast, AP behavior
  Consistency Level: QUORUM  -> balanced
  Consistency Level: ALL     -> slow, CP behavior

SQL vs NoSQL Decision Table

FactorChoose SQLChoose NoSQL
Data structureWell-defined, relationalFlexible, evolving, nested
RelationshipsComplex joins neededFew or no relationships
ConsistencyACID required (money, inventory)Eventual consistency acceptable
Query patternsAd-hoc, complex queriesKnown, simple access patterns
ScaleVertical (bigger machine)Horizontal (more machines)
SchemaStable, rarely changesEvolves frequently
TransactionsMulti-table transactionsSingle-document transactions
Team expertiseSQL knowledge is universalVaries by NoSQL type
ExamplesE-commerce orders, banking, ERPSocial feeds, IoT data, caching

Decision Flowchart

  START
    |
    v
  Do you need multi-table ACID transactions?
    |                    |
   YES                  NO
    |                    |
    v                    v
  Use SQL            Is your data highly connected (graph-like)?
  (PostgreSQL,         |                    |
   MySQL)             YES                  NO
                       |                    |
                       v                    v
                   Use Graph DB         Do you need sub-millisecond latency?
                   (Neo4j)                |                    |
                                         YES                  NO
                                          |                    |
                                          v                    v
                                      Use Key-Value        Is your schema flexible/nested?
                                      (Redis,                |                    |
                                       DynamoDB)            YES                  NO
                                                             |                    |
                                                             v                    v
                                                         Use Document DB      Do you have massive
                                                         (MongoDB)            write throughput needs?
                                                                                |              |
                                                                               YES            NO
                                                                                |              |
                                                                                v              v
                                                                            Use Column-     Use SQL
                                                                            Family          (safe default)
                                                                            (Cassandra)

Database Deep Dives

PostgreSQL (SQL)

  Type: Relational (SQL)
  CAP: CA (single node) / CP (with replication)
  Best for: General purpose, complex queries, ACID compliance
  
  Key Features:
  - Advanced data types (JSON, arrays, hstore)
  - Full-text search
  - Materialized views
  - Extensions (PostGIS for geospatial, pgvector for AI)
  - MVCC (Multi-Version Concurrency Control)
  
  Used by: Instagram, Spotify, Reddit, Uber (for trip data)
  
  Sweet spot: "When in doubt, start with PostgreSQL"

MongoDB (Document)

  Type: Document (NoSQL)
  CAP: CP (default) / AP (configurable)
  Best for: Flexible schemas, rapid prototyping, content management
  
  Key Features:
  - Schema-less BSON documents
  - Rich query language
  - Aggregation pipeline
  - Change streams (real-time)
  - Multi-document transactions (since v4.0)
  - Atlas Search (built-in full-text search)
  
  Used by: eBay, Adobe, Forbes, Toyota
  
  Sweet spot: Catalogs, content, user profiles, event logging

DynamoDB (Key-Value / Document)

  Type: Key-Value / Document (NoSQL)
  CAP: AP (default) / CP (strongly consistent reads)
  Best for: Serverless architectures, predictable performance at any scale
  
  Key Features:
  - Single-digit millisecond latency at any scale
  - Fully managed (no servers to provision)
  - Auto-scaling
  - DynamoDB Streams (change data capture)
  - Global tables (multi-region replication)
  - On-demand or provisioned capacity
  
  Used by: Amazon.com, Lyft, Airbnb, Samsung
  
  Sweet spot: Shopping carts, session stores, gaming leaderboards
  
  Access pattern: Must design around partition key + sort key
  
  Table Design:
  +--------------+----------+------------------+
  | Partition Key| Sort Key | Attributes       |
  |--------------|----------|------------------|
  | USER#alice   | ORDER#1  | {total: 99.99}   |
  | USER#alice   | ORDER#2  | {total: 45.50}   |
  | USER#alice   | PROFILE  | {email: "a@m.c"} |
  +--------------+----------+------------------+

Redis (Key-Value / In-Memory)

  Type: Key-Value (In-Memory NoSQL)
  CAP: CP (Redis Cluster) / Single-node: CA
  Best for: Caching, session storage, real-time features
  
  Key Features:
  - Sub-millisecond latency (in-memory)
  - Rich data structures (strings, hashes, lists, sets, sorted sets, streams)
  - Pub/Sub messaging
  - Lua scripting
  - TTL (automatic expiration)
  - Persistence options (RDB snapshots, AOF)
  
  Used by: Twitter, GitHub, Stack Overflow, Pinterest
  
  Sweet spot: Cache, session store, rate limiter, leaderboard, pub/sub
  
  Common patterns:
  - Cache: SET cache:user:42 '{"name":"Alice"}' EX 3600
  - Counter: INCR page:views:homepage
  - Rate limit: INCR api:rate:user:42  + EXPIRE
  - Leaderboard: ZADD leaderboard 1500 "alice" 1200 "bob"

Cassandra (Column-Family)

  Type: Column-Family (NoSQL)
  CAP: AP (tunable consistency)
  Best for: Massive write-heavy workloads, time-series data
  
  Key Features:
  - Linear scalability (add nodes = add capacity)
  - No single point of failure (masterless)
  - Multi-datacenter replication
  - Tunable consistency (ONE, QUORUM, ALL)
  - CQL (Cassandra Query Language, SQL-like)
  - Optimized for writes
  
  Used by: Netflix, Apple, Instagram (messages), Discord
  
  Sweet spot: IoT sensor data, messaging, activity feeds, time-series
  
  Write path (why writes are fast):
  Client -> Commit Log (append-only, sequential) -> Memtable (memory) -> SSTable (disk, periodic flush)

Neo4j (Graph)

  Type: Graph (NoSQL)
  CAP: CA (single node) / CP (cluster)
  Best for: Highly connected data, relationship-heavy queries
  
  Key Features:
  - Native graph storage (index-free adjacency)
  - Cypher query language
  - ACID compliant
  - Graph algorithms library
  - Visualization tools
  
  Used by: NASA, eBay, Walmart, Airbnb
  
  Sweet spot: Social networks, recommendation engines, fraud detection, knowledge graphs
  
  Query (Cypher):
  MATCH (user:User {name: 'Alice'})-[:FRIENDS*1..3]->(fof)
  WHERE fof <> user
  RETURN DISTINCT fof.name
  // Find friends-of-friends up to 3 hops away

Side-by-Side Summary

DatabaseTypeConsistencyScale ModelLatencyBest For
PostgreSQLSQLStrong (ACID)Vertical + read replicasLowGeneral purpose, complex queries
MongoDBDocumentTunableHorizontal (sharding)LowFlexible schema, content
DynamoDBKey-Value/DocTunableHorizontal (managed)Very lowServerless, predictable perf
RedisKey-ValueStrong (single)Vertical + ClusterSub-msCaching, real-time
CassandraColumn-FamilyTunable (AP)Horizontal (masterless)LowWrite-heavy, time-series
Neo4jGraphStrong (ACID)Vertical + read replicasVariesRelationships, graph queries

Key Takeaways

  1. SQL is not dead — Relational databases handle the majority of production workloads. PostgreSQL is the safe default.
  2. NoSQL is not "better" — It is a different set of tradeoffs. Choose based on your access patterns, not hype.
  3. ACID matters for money — Any system involving financial transactions needs strong consistency guarantees.
  4. CAP theorem forces a choice — In a distributed system, you choose between consistency and availability during network partitions.
  5. Polyglot persistence is normal — Most large systems use multiple databases (e.g., PostgreSQL for orders + Redis for cache + Elasticsearch for search).
  6. Schema design drives performance — In SQL, normalize. In NoSQL, design around your query patterns (denormalize).
  7. DynamoDB and Cassandra require you to know access patterns upfront — You cannot do ad-hoc queries like SQL. Design your keys carefully.

Explain-It Challenge

Scenario: You are designing the database layer for a social media platform. You need to store:

  1. User profiles (name, email, bio, settings)
  2. Posts (text, images, timestamps)
  3. Social graph (who follows whom)
  4. News feed (personalized for each user)
  5. Direct messages
  6. Notifications

For each data type, recommend a specific database and explain why. Would you use a single database or multiple? Justify your polyglot persistence strategy.


Next -> 9.8.d — Database Scaling