Episode 6 — Scaling Reliability Microservices Web3 / 6.6 — Caching in Production

6.6.b — Cache Invalidation

In one sentence: Cache invalidation — deciding when and how to remove or update stale data from the cache — is famously one of the two hardest problems in computer science, because getting it wrong means users see outdated data, and getting it right requires careful coordination between writes, reads, and distributed systems.

Navigation: ← 6.6.a Redis Caching · 6.6.c — TTL Strategies →


1. "The Two Hard Problems in Computer Science"

"There are only two hard things in Computer Science: cache invalidation and naming things." — Phil Karlton

This quote is famous because it is painfully true. Caching data is easy. Knowing when that cached data is no longer valid is where every production caching system gets complicated.

The fundamental tension:

Too aggressive invalidation → Cache is always empty → No performance benefit
Too lazy invalidation       → Users see stale/wrong data → Trust is broken
Perfect invalidation        → Requires omniscient knowledge of every data change

Every invalidation strategy is a tradeoff between freshness (how current the data is) and performance (how often you avoid hitting the database).


2. Why Cache Invalidation Is Hard

Problem 1: Distributed state

The cache and the database are two separate systems. There is no atomic "update both" operation. Between updating the DB and invalidating the cache, there is a window where they disagree.

Timeline of a race condition:

  t=0   Request A reads user (cache MISS) → starts DB query
  t=1   Request B updates user in DB → invalidates cache
  t=2   Request A's DB query returns (OLD data) → stores in cache
  t=3   Cache now has STALE data that was just invalidated

  Result: Cache contains old data even though we "invalidated" it

Problem 2: Derived data

When one piece of data changes, you need to invalidate everything that was derived from it:

User changes their name:
  - Invalidate user:123
  - Invalidate user:123:profile
  - Invalidate team:456 (contains user's name in member list)
  - Invalidate leaderboard (contains user's display name)
  - Invalidate search index (contains user's name)
  - Invalidate every page that renders this user's name

How many cache keys depend on one user's name? In a complex app: dozens.

Problem 3: Timing

Even with correct invalidation logic, network delays, retries, and async processing create windows where stale data is served.


3. Invalidation Strategies

Strategy 1: Time-Based (TTL)

The simplest approach. Every cached value has a Time To Live. After the TTL expires, the key is automatically deleted and the next read triggers a fresh database query.

// Cache user profile for 5 minutes
await redis.set('user:123', JSON.stringify(user), 'EX', 300);

// After 300 seconds, Redis automatically deletes the key
// Next GET returns null → triggers DB read → re-caches

Pros:

  • Dead simple to implement
  • Self-healing — stale data is guaranteed to expire eventually
  • No need to track what changed

Cons:

  • Data can be stale for up to the full TTL duration
  • Fixed TTL doesn't match all data change frequencies
  • If TTL is too short, cache hit rate drops (defeating the purpose)

Strategy 2: Event-Based (Invalidate on Write)

Whenever data changes, the application explicitly deletes or updates the cache.

async function updateUserEmail(userId, newEmail) {
  // 1. Update the database (source of truth)
  await db.collection('users').updateOne(
    { _id: userId },
    { $set: { email: newEmail, updatedAt: new Date() } }
  );

  // 2. Invalidate all cache keys affected by this change
  await redis.del(`user:${userId}`);
  await redis.del(`user:${userId}:profile`);
  await redis.del(`user:${userId}:settings`);

  // 3. Invalidate derived caches
  const teams = await db.collection('teams').find({ members: userId }).toArray();
  for (const team of teams) {
    await redis.del(`team:${team._id}:members`);
  }
}

Pros:

  • Cache is fresh almost immediately after a write
  • No stale data window (beyond the invalidation execution time)

Cons:

  • You must identify every cache key affected by a change
  • Easy to miss a cache key (especially derived data)
  • Tightly couples write logic to cache logic

Strategy 3: Version-Based (Cache Key with Version)

Include a version number or timestamp in the cache key. When data changes, increment the version — old keys are effectively orphaned and will expire via TTL.

// Store the current version for each entity
await redis.set('user:123:version', '7');

// Cache key includes the version
const version = await redis.get('user:123:version');
const cacheKey = `user:123:v${version}`;
const cached = await redis.get(cacheKey);

if (!cached) {
  const user = await db.collection('users').findOne({ _id: '123' });
  await redis.set(cacheKey, JSON.stringify(user), 'EX', 3600);
}

// On update — just bump the version
async function onUserUpdate(userId) {
  await redis.incr(`user:${userId}:version`);
  // Old cache key (v7) still exists but nobody will look it up
  // It will expire naturally via TTL
}

Pros:

  • No need to find and delete old keys
  • Atomic — incrementing a version is a single Redis command
  • Works well with CDNs (versioned URLs)

Cons:

  • Orphaned keys waste memory until TTL expires
  • Extra Redis call to look up the version before the data

4. Manual Invalidation

Sometimes you need to clear cache manually — during deployments, data migrations, or emergency fixes.

// Clear all cache keys matching a pattern
async function clearCacheByPattern(pattern) {
  let cursor = '0';
  let totalDeleted = 0;

  do {
    const [nextCursor, keys] = await redis.scan(
      cursor, 'MATCH', pattern, 'COUNT', 200
    );
    cursor = nextCursor;

    if (keys.length > 0) {
      const pipeline = redis.pipeline();
      keys.forEach((key) => pipeline.del(key));
      await pipeline.exec();
      totalDeleted += keys.length;
    }
  } while (cursor !== '0');

  console.log(`Cleared ${totalDeleted} keys matching "${pattern}"`);
  return totalDeleted;
}

// Examples
await clearCacheByPattern('user:*');        // All user caches
await clearCacheByPattern('cache:products:*'); // All product list caches
await clearCacheByPattern('*');             // Nuclear option: clear everything

Important: Never use KEYS * in production — it blocks Redis while scanning all keys. Always use SCAN with a cursor, which is non-blocking.


5. Pub/Sub for Distributed Invalidation

In a microservices architecture with multiple services caching data, a write in one service needs to invalidate caches in all services. Redis pub/sub handles this:

// ──────────────────────────────────────────────────────────
// Publisher: The service that writes data
// ──────────────────────────────────────────────────────────
async function updateProduct(productId, updates) {
  await db.collection('products').updateOne(
    { _id: productId },
    { $set: updates }
  );

  // Publish invalidation event to all subscribers
  await redis.publish('cache:invalidation', JSON.stringify({
    type: 'product:updated',
    keys: [
      `product:${productId}`,
      `product:${productId}:details`,
      'products:featured',
      'products:latest',
    ],
    timestamp: Date.now(),
  }));
}

// ──────────────────────────────────────────────────────────
// Subscriber: Every service that might have cached this data
// ──────────────────────────────────────────────────────────
// IMPORTANT: Use a SEPARATE Redis connection for subscriptions
const subscriber = new Redis({ host: '127.0.0.1', port: 6379 });

subscriber.subscribe('cache:invalidation');

subscriber.on('message', async (channel, message) => {
  const event = JSON.parse(message);
  console.log(`Invalidation event: ${event.type}`);

  const pipeline = redis.pipeline();
  for (const key of event.keys) {
    pipeline.del(key);
  }
  await pipeline.exec();

  console.log(`Invalidated ${event.keys.length} keys`);
});
┌──────────────┐                           ┌──────────────┐
│ Product       │   PUBLISH                 │ Search        │
│ Service       │ ─────────┐               │ Service       │
│ (writes data) │          │               │ (has cached   │
└──────────────┘          │               │  product data) │
                          ▼               └───────┬────────┘
                   ┌──────────────┐               │
                   │    Redis     │   SUBSCRIBE   │
                   │   Pub/Sub   │ ◄─────────────┘
                   └──────┬──────┘
                          │  SUBSCRIBE
                          ▼
                   ┌──────────────┐
                   │ Order Service │
                   │ (has cached   │
                   │  product data)│
                   └──────────────┘

6. The Cache Stampede Problem

Cache stampede (also called thundering herd) happens when a popular cache key expires and hundreds of concurrent requests all miss the cache at the same time, all hitting the database simultaneously.

Normal:
  1 request misses → 1 DB query → cache repopulated → next 999 requests hit cache

Stampede:
  Key expires →
    Request 1 misses → DB query
    Request 2 misses → DB query
    Request 3 misses → DB query
    ...
    Request 1000 misses → DB query
  = 1000 simultaneous DB queries for the SAME data
  = Database overload or crash

Prevention 1: Locking (Mutex)

Only one request is allowed to repopulate the cache. Others wait.

async function getWithLock(key, fetchFn, ttl = 300) {
  // Try cache first
  const cached = await redis.get(key);
  if (cached) return JSON.parse(cached);

  // Try to acquire a lock
  const lockKey = `lock:${key}`;
  const lockAcquired = await redis.set(lockKey, '1', 'NX', 'EX', 10);

  if (lockAcquired) {
    try {
      // We got the lock — fetch from DB and populate cache
      const data = await fetchFn();
      await redis.set(key, JSON.stringify(data), 'EX', ttl);
      return data;
    } finally {
      // Release the lock
      await redis.del(lockKey);
    }
  } else {
    // Another request is fetching — wait and retry
    await new Promise((resolve) => setTimeout(resolve, 100));
    return getWithLock(key, fetchFn, ttl); // Retry (with recursion limit in production)
  }
}

Prevention 2: Probabilistic Early Expiration

Randomly refresh the cache before it actually expires. Each request has a small chance of refreshing, spreading the load.

async function getWithEarlyExpiration(key, fetchFn, ttl = 300) {
  const cached = await redis.get(key);
  const remainingTTL = await redis.ttl(key);

  if (cached) {
    // Probabilistically refresh if TTL is getting low
    // Higher chance of refresh as TTL approaches 0
    const expirationWindow = ttl * 0.1; // Last 10% of TTL
    if (remainingTTL > 0 && remainingTTL < expirationWindow) {
      const refreshProbability = 1 - (remainingTTL / expirationWindow);
      if (Math.random() < refreshProbability) {
        // Refresh in background (don't wait)
        fetchFn().then((data) => {
          redis.set(key, JSON.stringify(data), 'EX', ttl);
        }).catch(console.error);
      }
    }
    return JSON.parse(cached);
  }

  // Cache miss — fetch and store
  const data = await fetchFn();
  await redis.set(key, JSON.stringify(data), 'EX', ttl);
  return data;
}

Prevention 3: Background Refresh

Never let hot keys expire. A background process refreshes them before TTL runs out.

// Track keys that should be kept warm
const hotKeys = new Map(); // key → { fetchFn, ttl }

function registerHotKey(key, fetchFn, ttl) {
  hotKeys.set(key, { fetchFn, ttl });
}

// Background refresh loop
setInterval(async () => {
  for (const [key, { fetchFn, ttl }] of hotKeys) {
    const remainingTTL = await redis.ttl(key);
    // Refresh when less than 20% of TTL remains
    if (remainingTTL > 0 && remainingTTL < ttl * 0.2) {
      try {
        const data = await fetchFn();
        await redis.set(key, JSON.stringify(data), 'EX', ttl);
        console.log(`Background refreshed: ${key}`);
      } catch (err) {
        console.error(`Failed to refresh ${key}:`, err.message);
      }
    }
  }
}, 10_000); // Check every 10 seconds

// Register popular endpoints
registerHotKey('products:featured', () => db.collection('products').find({ featured: true }).toArray(), 600);
registerHotKey('stats:homepage', () => computeHomepageStats(), 300);

7. Stale-While-Revalidate Pattern

Serve stale (expired) data immediately while fetching fresh data in the background. The user gets a fast response, and the cache is updated for the next request.

async function staleWhileRevalidate(key, fetchFn, ttl = 300, staleTTL = 600) {
  // Use two keys: one for fresh data, one for stale data
  const freshKey = `fresh:${key}`;
  const staleKey = `stale:${key}`;

  // Check fresh cache first
  const fresh = await redis.get(freshKey);
  if (fresh) return JSON.parse(fresh);

  // Fresh cache expired — check stale cache
  const stale = await redis.get(staleKey);

  // Trigger background revalidation
  const revalidate = async () => {
    try {
      const data = await fetchFn();
      const serialized = JSON.stringify(data);
      await redis.set(freshKey, serialized, 'EX', ttl);
      await redis.set(staleKey, serialized, 'EX', staleTTL);
    } catch (err) {
      console.error(`Revalidation failed for ${key}:`, err.message);
    }
  };

  if (stale) {
    // Return stale data immediately, revalidate in background
    revalidate(); // fire-and-forget (no await)
    return JSON.parse(stale);
  }

  // No data at all — must wait for fresh fetch
  const data = await fetchFn();
  const serialized = JSON.stringify(data);
  await redis.set(freshKey, serialized, 'EX', ttl);
  await redis.set(staleKey, serialized, 'EX', staleTTL);
  return data;
}

This pattern is inspired by the HTTP stale-while-revalidate cache directive (covered in 6.6.c). It prioritizes perceived performance — the user never waits for a database query if any version of the data exists.


8. Common Invalidation Bugs and How to Avoid Them

Bug 1: Delete-before-write race condition

// WRONG ORDER — creates a race condition
async function updateUser(userId, data) {
  await redis.del(`user:${userId}`);    // Delete cache
  await db.updateOne({ _id: userId }, { $set: data }); // Then update DB
  // Problem: Another request reads from DB BEFORE the update completes
  //          → caches the OLD data
}

// CORRECT ORDER — always write DB first, then invalidate
async function updateUser(userId, data) {
  await db.updateOne({ _id: userId }, { $set: data }); // Update DB first
  await redis.del(`user:${userId}`);    // Then invalidate cache
  // Even if another request reads between these lines,
  // it gets the (slightly) old cached data, which will expire via TTL
}

Bug 2: Forgetting derived cache keys

// INCOMPLETE — only invalidates the direct key
async function updateProductPrice(productId, newPrice) {
  await db.updateOne({ _id: productId }, { $set: { price: newPrice } });
  await redis.del(`product:${productId}`);
  // BUG: Forgot to invalidate:
  //   - category page that shows this product's price
  //   - search results that include this product
  //   - cart page that displays line item prices
  //   - "similar products" widgets
}

// BETTER — use a registry of dependent keys
const CACHE_DEPENDENCIES = {
  'product': (id) => [
    `product:${id}`,
    `product:${id}:details`,
    `product:${id}:reviews`,
    `category:*`,           // Could contain this product
    `search:*`,             // Could contain this product
    `cart:*`,               // Could reference this product's price
  ],
};

async function invalidateEntity(entity, id) {
  const patterns = CACHE_DEPENDENCIES[entity]?.(id) || [];
  for (const pattern of patterns) {
    if (pattern.includes('*')) {
      await clearCacheByPattern(pattern);
    } else {
      await redis.del(pattern);
    }
  }
}

Bug 3: Not handling Redis failures

// DANGEROUS — if Redis.del fails, cache has stale data forever
async function updateUser(userId, data) {
  await db.updateOne({ _id: userId }, { $set: data });
  await redis.del(`user:${userId}`); // What if this throws?
}

// SAFE — catch and log, rely on TTL as safety net
async function updateUser(userId, data) {
  await db.updateOne({ _id: userId }, { $set: data });
  try {
    await redis.del(`user:${userId}`);
  } catch (err) {
    console.error(`Cache invalidation failed for user:${userId}`, err);
    // TTL will eventually expire the stale data
    // Optionally: queue a retry or alert
  }
}

Bug 4: Cache stampede after bulk invalidation

// DANGEROUS — deleting 10,000 keys at once causes 10,000 cache misses
async function onDailyPriceUpdate() {
  await clearCacheByPattern('product:*'); // 10K products
  // Next 10K requests all miss → 10K DB queries simultaneously
}

// BETTER — stagger invalidation or warm the cache
async function onDailyPriceUpdate() {
  const products = await db.collection('products').find().toArray();
  const pipeline = redis.pipeline();

  for (const product of products) {
    // Re-warm instead of invalidating
    pipeline.set(
      `product:${product._id}`,
      JSON.stringify(product),
      'EX',
      3600
    );
  }

  await pipeline.exec();
  console.log(`Re-warmed ${products.length} product caches`);
}

9. Invalidation Strategy Decision Guide

Is the data critical (financial, medical)?
  YES → Event-based invalidation + short TTL backup
  NO  ↓

How often does the data change?
  Rarely (< 1x per hour)   → TTL of 30-60 minutes is fine
  Sometimes (1-10x per hour)→ Event-based + moderate TTL (5-15 min)
  Frequently (> 10x per min)→ Don't cache, or use write-through
  ↓

Is the data accessed by many users simultaneously?
  YES → Add stampede prevention (locking or early expiration)
  NO  → Basic TTL is sufficient
  ↓

Is the data expensive to regenerate?
  YES → Use stale-while-revalidate (serve stale while rebuilding)
  NO  → Simple cache-aside with TTL

10. Key Takeaways

  1. Cache invalidation is hard because the cache and database are separate systems with no atomic cross-update operation.
  2. TTL is your safety net — even with event-based invalidation, always set a TTL so stale data eventually expires.
  3. Delete on write, not update on write — deleting the cache key is simpler and avoids race conditions.
  4. Always update DB first, then invalidate cache — the reverse order creates a stale-data race condition.
  5. Cache stampede kills databases — use locking, probabilistic early expiration, or background refresh for popular keys.
  6. Stale-while-revalidate gives the best user experience — serve old data fast while refreshing in the background.
  7. Track your cache dependencies — when entity X changes, know every cache key that contains data from X.

Explain-It Challenge

  1. A user updates their profile picture, but the old picture still shows for 5 minutes. Explain what happened and propose two different solutions with different tradeoffs.
  2. Your e-commerce site has a flash sale. The sale price is updated in the database, but 30% of users still see the old price. Diagnose the problem and fix it.
  3. Explain to a junior developer why "just delete the cache whenever anything changes" is not a scalable strategy.

Navigation: ← 6.6.a Redis Caching · 6.6.c — TTL Strategies →