A cache is the highest-leverage component in system design: one box can absorb 90% of your traffic for a tenth of the cost of scaling what's behind it. It's also where designs lie to themselves, because the load-bearing number — the hit ratio — is a claim about your workload, not a property of the cache.
What a hit ratio really is
When you write "Redis, 90% hit ratio" on a design, you are claiming: nine out of ten reads ask for a key that was already asked for recently enough to still be resident. Whether that's true depends entirely on the workload:
- Popularity skew. Real-world access follows zipf-like distributions — a small set of hot keys gets most of the traffic. The more skewed, the higher the achievable hit ratio with a small cache.
- Mutability. Immutable data (a shortened URL, a published photo) can be cached forever; every repeat is a hit. Frequently-updated data forces short TTLs, which slash hits.
- Personalization. A feed that's unique per user per refresh has almost no repeated reads. Caching fragments (the post objects, the friend lists) works; caching the assembled feed mostly doesn't.
In beprodready, the hit ratio on cache components is a slider you set — and a claim the PRR grades. Setting 99% on a personalized-feed challenge without an argument is a fast way to lose the justification axis.
Misses are the design
The cache's job is to make the common case cheap. The system's job is to survive the uncommon case:
- Steady misses. At 90% hits and 10,000 rps, your origin sees 1,000 rps — size the origin for the miss traffic, not zero.
- The cold cache. Deploys, restarts, evictions, a new region: hit ratio is 0% until the cache warms. Your database briefly faces all the traffic. If it can't take even a few seconds of that, you need request coalescing (collapse identical concurrent misses into one origin fetch) or a warmup step — and your PRR retro should say which.
- The hot key. One viral object can exceed a single cache node's capacity all by itself. Skewed enough traffic turns "the cache layer" into "that one cache shard" — the sharding lesson picks this up.
Staleness is a budget, not a bug
Every caching strategy is an answer to one question: how stale is acceptable?
| Staleness budget | Mechanism |
|---|---|
| Forever (immutable) | Cache on write, never invalidate |
| Minutes | TTL — simplest, self-healing, slightly stale |
| Seconds | Short TTL + event-driven invalidation |
| Zero | Write-through, or don't cache it |
The classic failure is choosing the mechanism before the budget — a TTL on data with a zero-staleness requirement is a correctness bug wearing a performance costume.
Where the cache sits
Layered caches multiply: a CDN at 70% in front of an application cache at 85% means the database sees (1−0.70) × (1−0.85) ≈ 4.5% of read traffic. That's how a $1,000/month stack serves traffic that would otherwise need $10,000 of database. Stack the layers yourself and watch the multiplication work:
interactive
The database sees 4.5% of read traffic — layered misses multiply: (1−0.70) × (1−0.85) = 0.045.
Notice what happens when either layer's ratio slips: the database's traffic share multiplies back up. A CDN config change that drops its hit ratio from 70% to 40% doubles your database load — that's a production incident that begins life as a one-line CDN settings PR.
The full simulation below adds the dimension the multiplier hides — time. Drag the hit ratio down and watch the origin light up — then flush the cache and meet the herd.