[BEE-9005] Cache Stampede and Thundering Herd

INFO

When a cache key expires under heavy load, many concurrent requests can race to regenerate it, overwhelming the database. This article defines the problem and covers the mitigation strategies every backend engineer should know.

Context

A cache is only useful when it absorbs traffic before it reaches the database. But the moment a hot cache key expires, the cache momentarily becomes a liability: the first request to arrive finds nothing and goes to the database, and before that query returns, hundreds or thousands of identical requests do the same. This is a cache stampede (also called cache miss storm).

The thundering herd is the broader family of problems where a large cohort of clients or workers simultaneously try to acquire the same scarce resource -- examples include a service restart that boots with cold caches, a mass connection retry after a network blip, or a fleet of workers woken up by the same event.

Both problems share the same structure: a single triggering event causes N independent actors to compete for the same backend resource at the same time, with N often being large enough to cause database overload or service outage.

The Stampede Scenario

The Fix: Only One Request Regenerates

Principle

Protect every hot cache key against concurrent regeneration. Choose a strategy proportional to your traffic: mutex locking for simplicity, singleflight/coalescing for in-process concurrency, probabilistic early expiration for lock-free resilience, and stale-while-revalidate for zero-latency tolerance.

Mitigation Strategies

1. Mutex / Distributed Lock

The simplest protection: only the request that acquires the lock regenerates the value. All others wait, then read from cache once the lock is released.

python

import time
import random

LOCK_TTL = 10  # seconds -- must exceed worst-case DB query time

def get_product(product_id: str) -> dict:
    key = f"product:{product_id}"
    lock_key = f"lock:{key}"

    # Fast path: cache hit
    value = cache.get(key)
    if value:
        return value

    # Slow path: acquire lock to regenerate
    acquired = cache.set(lock_key, "1", nx=True, ex=LOCK_TTL)  # atomic SETNX

    if acquired:
        try:
            value = db.query("SELECT * FROM products WHERE id = ?", product_id)
            cache.set(key, value, ex=300)
            return value
        finally:
            cache.delete(lock_key)  # always release
    else:
        # Did not win the lock -- wait for the winner to populate cache
        for _ in range(20):
            time.sleep(0.1)
            value = cache.get(key)
            if value:
                return value
        # Fallback: return stale data or raise a graceful error
        return get_stale_or_default(product_id)

Trade-offs: Simple to reason about. Under very high concurrency, waiters may retry several times. If the lock holder crashes before releasing, other requests are blocked until the TTL expires -- which is why the lock TTL is mandatory.

2. Request Coalescing (Singleflight)

Collapse all in-flight requests for the same key into a single backend call. The first caller executes the fetch; every subsequent caller for the same key blocks and shares the result when it arrives.

This is the Go singleflight pattern, but the concept applies in any language.

// Go -- using golang.org/x/sync/singleflight
var group singleflight.Group

func GetProduct(productID string) (*Product, error) {
    key := "product:" + productID

    // All concurrent calls with the same key share one DB query
    result, err, _ := group.Do(key, func() (interface{}, error) {
        cached, err := cache.Get(key)
        if err == nil {
            return cached, nil
        }
        product, err := db.QueryProduct(productID)
        if err != nil {
            return nil, err
        }
        cache.Set(key, product, 5*time.Minute)
        return product, nil
    })
    if err != nil {
        return nil, err
    }
    return result.(*Product), nil
}

typescript

// TypeScript -- manual coalescing map
const inFlight = new Map<string, Promise<Product>>();

async function getProduct(productId: string): Promise<Product> {
  const key = `product:${productId}`;

  if (inFlight.has(key)) {
    return inFlight.get(key)!; // join the existing in-flight request
  }

  const promise = (async () => {
    const cached = await cache.get(key);
    if (cached) return cached;

    const product = await db.queryProduct(productId);
    await cache.set(key, product, { ttl: 300 });
    return product;
  })().finally(() => inFlight.delete(key));

  inFlight.set(key, promise);
  return promise;
}

Trade-offs: Works within a single process. In a distributed system with many nodes, each node coalesces its own traffic but still sends one query per node. Combine with a distributed lock if cross-node coalescing is needed.

3. Probabilistic Early Expiration (XFetch)

Recompute the cache value before it expires with a probability that grows as the expiry approaches. No lock needed -- each process independently decides whether to refresh based on a random draw.

The algorithm from Vattani et al. 2015:

recompute = time.now() - (delta * beta * ln(random(0,1))) >= expiry

delta -- time the last recomputation took (in seconds)
beta -- tuning parameter (default 1.0; raise to recompute earlier)
expiry -- absolute Unix timestamp when the key expires

python

import math
import random
import time

BETA = 1.0  # 1.0 is optimal for most workloads; increase to be more aggressive

def get_product_xfetch(product_id: str) -> dict:
    key = f"product:{product_id}"
    entry = cache.get_with_metadata(key)  # returns {value, expiry, delta}

    if entry:
        # Decide probabilistically whether to recompute now
        early_recompute = (
            time.time() - entry["delta"] * BETA * math.log(random.random())
            >= entry["expiry"]
        )
        if not early_recompute:
            return entry["value"]

    # Cache miss or early recompute triggered
    start = time.time()
    value = db.query("SELECT * FROM products WHERE id = ?", product_id)
    delta = time.time() - start  # measure actual fetch time

    expiry = time.time() + 300
    cache.set_with_metadata(key, value, expiry=expiry, delta=delta)
    return value

Trade-offs: Lock-free and highly scalable. Multiple processes may occasionally recompute simultaneously (that is intentional -- it avoids a single point of failure). Works best for keys with measurable fetch time. See Cache stampede -- Wikipedia for the formal derivation.

4. Stale-While-Revalidate

Serve the stale cached value immediately while refreshing in the background. From the user's perspective, there is no latency spike. A stampede cannot occur because only one background job refreshes each key.

python

import threading

def get_product_swr(product_id: str) -> dict:
    key = f"product:{product_id}"
    stale_key = f"stale:{key}"

    value = cache.get(key)
    if value:
        return value  # fresh cache -- done

    stale = cache.get(stale_key)
    if stale:
        # Return stale value, trigger background refresh once
        refresh_key = f"refresh:{key}"
        if cache.set(refresh_key, "1", nx=True, ex=30):
            threading.Thread(
                target=refresh_product, args=(product_id,), daemon=True
            ).start()
        return stale

    # Truly cold -- must synchronously fetch
    return refresh_product(product_id)

def refresh_product(product_id: str) -> dict:
    key = f"product:{product_id}"
    value = db.query("SELECT * FROM products WHERE id = ?", product_id)
    cache.set(key, value, ex=300)             # fresh TTL
    cache.set(f"stale:{key}", value, ex=600)  # stale copy lives longer
    return value

Trade-offs: Best user experience -- zero added latency. Requires storing a stale copy with a longer TTL. Not suitable for data that must never be served stale (e.g., account balances, inventory counts).

5. Jittered TTL

Prevent synchronized mass expiry by adding random jitter to every TTL. This eliminates the thundering herd caused by many keys being written simultaneously (e.g., a cache warm-up job or a deploy).

python

import random

BASE_TTL = 300   # 5 minutes
MAX_JITTER = 60  # ±60 seconds

def set_product_cache(product_id: str, value: dict) -> None:
    ttl = BASE_TTL + random.randint(-MAX_JITTER, MAX_JITTER)
    cache.set(f"product:{product_id}", value, ex=ttl)

Use jittered TTL as a baseline practice for all caches, not as a primary stampede prevention mechanism.

6. Cache Warming on Deploy

A service restart boots with all caches cold. Before shifting traffic to the new instance, pre-populate the cache for the most critical keys.

bash

# Deployment script: warm cache before switching load balancer
./scripts/cache-warm.sh --keys product_catalog,homepage,featured_products
# Wait for warm-up confirmation, then proceed with traffic shift
./scripts/lb-switch.sh

Or at application startup:

python

def on_application_startup():
    top_products = db.query("SELECT id FROM products ORDER BY views DESC LIMIT 1000")
    for product in top_products:
        value = db.query("SELECT * FROM products WHERE id = ?", product.id)
        cache.set(f"product:{product.id}", value, ex=300)
    logger.info("Cache warm-up complete. Ready for traffic.")

Common Mistakes

1. No stampede protection on high-traffic keys

The most frequent mistake. Any key that receives concurrent traffic at expiry time is vulnerable. Audit your hottest keys and ensure at least one protection mechanism is in place.

2. Lock without timeout (deadlock on holder crash)

python

# WRONG -- lock never expires, holder crash causes permanent block
cache.set(lock_key, "1", nx=True)  # no ex= argument

# CORRECT -- lock expires after worst-case recomputation time
cache.set(lock_key, "1", nx=True, ex=10)

3. Uniform TTL across all keys (synchronized mass expiry)

If a cache warm-up job writes 50,000 keys all with TTL=300, they all expire at the same second five minutes later. Always add jitter.

4. Not warming cache after deploy (cold start stampede)

After every deploy, your caches are cold. If you shift traffic immediately, every request on every node is a miss. Warm before switching traffic.

5. Mutex without fallback

If lock acquisition fails repeatedly (e.g., the lock holder is extremely slow), requests will either queue indefinitely or error. Always implement a fallback: return stale data, a default response, or a graceful degradation rather than hanging or crashing.

python

# Fallback hierarchy
def get_product_safe(product_id: str) -> dict:
    try:
        return get_product_with_lock(product_id)
    except LockTimeoutError:
        stale = cache.get(f"stale:product:{product_id}")
        if stale:
            return stale  # serve stale rather than fail
        return {"id": product_id, "error": "temporarily unavailable"}

Strategy Comparison

Strategy	Complexity	Cross-Node	User Latency	Use When
Mutex / distributed lock	Low	Yes	Added wait for non-winners	Simplicity is priority; DB queries are slow
Singleflight / coalescing	Low	No (per-process)	Shared wait for first result	High concurrency within one process/node
XFetch (probabilistic)	Medium	Yes (lock-free)	None	High read throughput; can tolerate occasional duplicate fetch
Stale-while-revalidate	Medium	Yes	None	Staleness is acceptable; user-facing latency is critical
Jittered TTL	Very low	Yes	None	Always apply as baseline
Cache warming	Medium	Yes	None	Deploys and service restarts

For most production systems, apply jittered TTL universally and add singleflight at the application layer. Reserve distributed locks or XFetch for the small set of extremely hot keys that drive disproportionate DB load.

BEE-9001: Caching Fundamentals -- foundational concepts: TTL, eviction, consistency
BEE-9002: Cache Invalidation -- invalidation events are the primary stampede trigger
BEE-9004: Distributed Cache -- stampede propagation across cache nodes
BEE-12007: Rate Limiting -- last-resort DB protection when stampedes do occur

References

Vattani, A.; Chierichetti, F.; Lowenstein, K. (2015). "Optimal Probabilistic Cache Stampede Prevention." cseweb.ucsd.edu
Cache stampede -- Wikipedia
Thundering herd problem -- Wikipedia
Solving Thundering Herds with Request Coalescing in Go -- Jaz's Blog
How to Handle Cache Stampede in Redis -- OneUptime
Avoiding Cache Stampede with XFetch

[BEE-9005] Cache Stampede and Thundering Herd ​

Context ​

The Stampede Scenario ​

The Fix: Only One Request Regenerates ​

Principle ​

Mitigation Strategies ​

1. Mutex / Distributed Lock ​

2. Request Coalescing (Singleflight) ​

3. Probabilistic Early Expiration (XFetch) ​

4. Stale-While-Revalidate ​

5. Jittered TTL ​

6. Cache Warming on Deploy ​

Common Mistakes ​

1. No stampede protection on high-traffic keys ​

2. Lock without timeout (deadlock on holder crash) ​

3. Uniform TTL across all keys (synchronized mass expiry) ​

4. Not warming cache after deploy (cold start stampede) ​

5. Mutex without fallback ​

Strategy Comparison ​

Related BEPs ​

References ​

[BEE-9005] Cache Stampede and Thundering Herd

Context

The Stampede Scenario

The Fix: Only One Request Regenerates

Principle

Mitigation Strategies

1. Mutex / Distributed Lock

2. Request Coalescing (Singleflight)

3. Probabilistic Early Expiration (XFetch)

4. Stale-While-Revalidate

5. Jittered TTL

6. Cache Warming on Deploy

Common Mistakes

1. No stampede protection on high-traffic keys

2. Lock without timeout (deadlock on holder crash)

3. Uniform TTL across all keys (synchronized mass expiry)

4. Not warming cache after deploy (cold start stampede)

5. Mutex without fallback

Strategy Comparison

Related BEPs

References