Testing Caching Strategies in Real Conditions
Comparing cache-aside, write-through, and read-through strategies with measured hit rates, latency, and consistency trade-offs under production traffic patterns.
Context
I tested three caching strategies on a product catalog API serving 15,000 requests/minute. The API reads from Postgres and returns product details. The question was not whether to cache, but which caching pattern would deliver the best hit rate without introducing stale data issues.
Problem
Caching strategies are well-documented in theory. In practice, cache hit rates depend on traffic patterns (Zipfian vs uniform), write frequency, TTL configuration, and cache size constraints. I needed measurements against real traffic to choose correctly.
Constraints
- Cache: Redis 7, single node, 256MB memory limit
- Origin: Postgres 15, product catalog with 50,000 SKUs
- Traffic pattern: 80% of requests hit 20% of products (Zipfian distribution confirmed via access logs)
- Write frequency: 200 product updates/hour (price changes, inventory updates)
- Acceptable staleness: 60 seconds maximum for price data
- Cache key space: product ID (50,000 possible keys)
- Read/write ratio: 99:1
Design
Related: How I'd Design a Mobile Configuration System at Scale.
Strategy 1: Cache-Aside (Lazy Loading)
Application checks cache first. On miss, reads from database and populates cache.
async function getProduct(id: string) {
const cached = await redis.get(`product:${id}`);
if (cached) return JSON.parse(cached);
const product = await db.query('SELECT * FROM products WHERE id = $1', [id]);
await redis.set(`product:${id}`, JSON.stringify(product), 'EX', 60);
return product;
}Strategy 2: Write-Through
Every write updates both the database and cache atomically.
async function updateProduct(id: string, data: ProductUpdate) {
await db.query('UPDATE products SET ... WHERE id = $1', [id, ...]);
const updated = await db.query('SELECT * FROM products WHERE id = $1', [id]);
await redis.set(`product:${id}`, JSON.stringify(updated), 'EX', 300);
}
async function getProduct(id: string) {
const cached = await redis.get(`product:${id}`);
if (cached) return JSON.parse(cached);
const product = await db.query('SELECT * FROM products WHERE id = $1', [id]);
await redis.set(`product:${id}`, JSON.stringify(product), 'EX', 300);
return product;
}Strategy 3: Read-Through (Cache as primary read path)
A caching proxy sits between the application and database. The application only talks to the cache.
class ReadThroughCache {
async get(id: string): Promise<Product> {
const cached = await redis.get(`product:${id}`);
if (cached) return JSON.parse(cached);
const product = await this.loader(id);
await redis.set(`product:${id}`, JSON.stringify(product), 'EX', 60);
return product;
}
private async loader(id: string) {
return db.query('SELECT * FROM products WHERE id = $1', [id]);
}
}On writes, the cache entry is invalidated (deleted), and the next read triggers a reload.
Trade-offs
Performance Results (7-day production traffic)
| Metric | Cache-Aside | Write-Through | Read-Through |
|---|---|---|---|
| Hit rate (overall) | 82% | 91% | 84% |
| Hit rate (top 20% products) | 94% | 98% | 95% |
| p50 read latency | 2ms (hit) / 18ms (miss) | 1.5ms (hit) / 18ms (miss) | 2ms (hit) / 18ms (miss) |
| p95 read latency | 5ms / 35ms | 4ms / 35ms | 5ms / 35ms |
| Write latency overhead | 0ms | +3ms (cache write) | +1ms (cache delete) |
| Max staleness observed | 60s (TTL bound) | 0s (write-through) | 60s (TTL bound) |
| Database load reduction | 78% | 88% | 80% |
Why Write-Through Won on Hit Rate
Write-through maintains higher hit rates because:
- Updated products are immediately available in cache (no miss-then-reload cycle)
- Longer TTLs are safe because the cache is always updated on writes
- No "thundering herd" on popular products after TTL expiry
The 9% hit rate advantage over cache-aside translates to 1,350 fewer database queries per minute at 15,000 req/min.
Staleness Analysis
| Strategy | Staleness Window | Stale Reads During Test |
|---|---|---|
| Cache-Aside (60s TTL) | 0-60s after write | ~3,200 over 7 days |
| Write-Through (300s TTL) | 0s (write updates cache) | 0 for written products |
| Read-Through (60s TTL) | 0-60s after invalidation | ~2,800 over 7 days |
Write-through eliminates staleness for products that are actively updated. The remaining staleness risk is for products updated through a different code path that bypasses the cache update logic.
Memory Usage
| Strategy | Peak Memory | Evictions/Hour |
|---|---|---|
| Cache-Aside (60s TTL) | 45MB | 0 |
| Write-Through (300s TTL) | 120MB | 12 |
| Read-Through (60s TTL) | 48MB | 0 |
Write-through with longer TTLs uses more memory because entries live longer. The 256MB limit was not reached, but at 200,000 SKUs with 300s TTL, it would be.
Failure Modes
See also: Failure Modes I Actively Design For.
Cache-aside: thundering herd on TTL expiry. When a popular product's cache entry expires, multiple concurrent requests all miss the cache and all query the database simultaneously. At 15,000 req/min with the top product receiving 5% of traffic, that is 750 req/min. If the TTL expires, 12+ requests may hit the database in the same second. Mitigation: stale-while-revalidate pattern or distributed locks on cache population.
Write-through: cache-database inconsistency on partial failure. If the database write succeeds but the cache write fails, the cache holds stale data. With a 300s TTL, staleness lasts up to 5 minutes. Mitigation: wrap both operations in a try-catch, and on cache write failure, delete the cache entry instead.
Read-through: cache stampede after invalidation. Invalidating a popular product's cache entry causes the same thundering herd problem as cache-aside TTL expiry. The mitigation is the same: probabilistic early expiration or lock-based single-flight cache population.
All strategies: Redis failure. If Redis is unavailable, cache-aside and read-through degrade to direct database reads (acceptable). Write-through may block writes if the cache update is in the critical path. Mitigation: make the cache write fire-and-forget with a timeout.
Scaling Considerations
- At 100,000 req/min, the write-through strategy's database load reduction (88%) means 12,000 database queries/min instead of 100,000. This is the difference between needing a read replica and not.
- Redis cluster mode supports horizontal scaling, but adds complexity for cache invalidation across shards.
- For multi-region deployments, each region needs its own cache. Cross-region cache invalidation adds latency. Consider region-local caches with shorter TTLs instead.
- Cache warming on deployment: pre-populate the top 1,000 products on application startup to avoid a cold-cache thundering herd.
Observability
- Track cache hit rate per product tier (top 100, top 1000, long tail)
- Monitor cache memory usage and eviction rate
- Log cache misses that result in database queries exceeding 50ms (these are candidates for cache warming)
- Alert on hit rate dropping below 80% (indicates a configuration or traffic pattern change)
- Measure end-to-end latency including cache lookup, not just database query time
Key Takeaways
- Write-through caching delivered the highest hit rate (91%) and eliminated staleness for actively updated products. The cost is a 3ms write latency overhead.
- Cache-aside is the simplest strategy but suffers from thundering herd problems on TTL expiry and lower hit rates.
- TTL configuration is the most impactful tuning parameter. Too short reduces hit rate; too long increases staleness.
- Cache failure must degrade gracefully. Never let a cache outage cascade into a full system outage.
- Measure hit rates by access pattern (popular vs long-tail), not just overall. A 90% overall hit rate can mask a 30% hit rate on the long tail.
Further Reading
- Testing Client-Side Caching Strategies: Measuring the impact of HTTP cache headers, service workers, and local storage caching on repeat visit performance and data freshness.
- Load Testing Mobile Backends With Realistic Traffic: Designing load tests that replicate mobile traffic patterns including bursty connections, mixed network conditions, and session-based wor...
- SSR vs SSG vs ISR in Next.js: What I Measured: Concrete latency, TTFB, and cache-hit measurements across SSR, SSG, and ISR rendering strategies in Next.js under realistic traffic.
Final Thoughts
I deployed write-through caching with a 300-second TTL and a fallback to direct database reads on Redis failure. The 91% hit rate reduced database load by 88%, deferring the need for a read replica by an estimated 6 months at current growth rates. The 3ms write latency overhead was invisible to users. The primary ongoing cost is maintaining cache update logic in every write path, which is a code organization challenge, not a performance one.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.