beprodready
← case studies

From One Box to Millions: How Photo Feeds Actually Scale

Every photo feed in history has walked the same road: one server, then a panicked database split, then caches, then shards. Walk the four versions, see exactly why each one fell over, and meet the bottleneck before it meets you.

v1 ·v1 — One box (0 → 10k users)

drag to pan — this is the actual system at this stage

Every successful product starts here, and it is the correct architecture for this stage: one app server doing everything — web serving, image handling, business logic — with the database on the same lifecycle.

At 10k users (~50 rps peak), this box loafs along under 5% utilization. The monthly bill is a few hundred dollars. Anyone proposing microservices at this stage is solving a problem they don't have with money they shouldn't spend — exactly what a Production Readiness Review penalizes on the cost axis.

The crack that's coming: the app and the database compete for the same machine, and the database is stateful. Every deploy risks the data; a traffic spike on the app starves the DB of CPU. The first 2 a.m. page is already scheduled — it just doesn't have a date yet.

↓ then the crack opened, and the architecture answered

v2 ·v2 — Split tiers (10k → 200k users)

drag to pan — this is the actual system at this stage

Growth arrives (~1,000 rps peak) and the single box starts swapping. The first real architectural act: separate compute from state. App servers become stateless and horizontally scalable — kill one, nobody notices. The database gets its own properly-sized machine and a load balancer fronts the app tier.

This version has a long, comfortable life. It's also where teams learn the discipline that pays forever: the app tier scales by adding boxes; the database scales by getting bigger — and only one of those curves is sustainable.

The crack that's coming: feeds are ~90% reads, and every single read hits the database. You can buy a bigger DB instance two, maybe three times. Then you run out of "bigger."

↓ then the crack opened, and the architecture answered

v3 ·v3 — The cache era (200k → 2M users)

drag to pan — this is the actual system at this stage

At ~5,000 rps the database is the bottleneck, and the fix is the highest-leverage move in systems engineering: stop asking the database questions it already answered. A CDN absorbs image and static traffic at the edge; an application cache (Redis/Memcached) absorbs the hot read path — profiles, post metadata, precomputed feed fragments.

With a 70% CDN hit ratio and 85% application-cache hits, the database sees under 5% of read traffic. The same database that was dying now yawns. This is why the caching lesson insists hit ratios are claims: this entire architecture stands on those two numbers being real.

The crack that's coming: two, actually. A cache restart now exposes the database to 20x its steady load (the thundering herd). And write volume — which no cache absorbs — keeps growing with users. The database's days as a single machine are numbered.

↓ then the crack opened, and the architecture answered

v4 ·v4 — Shards and regions (2M+ users)

drag to pan — this is the actual system at this stage

Past a few million users two separate walls arrive at once. Write volume exceeds what one primary can absorb — so the data shards by user ID across multiple databases. And the business now loses real money per minute of downtime — so storage spreads across regions, with the cache tier replicated and the feed read path able to survive an entire region's databases going dark.

Notice what did not change: the v3 shape is intact. Each evolution added a layer to the same skeleton — that's what good architecture buys: the next stage is an extension, not a rewrite.

The crack that's coming (there's always one): the hot-key problem. Sharding spreads users; it does nothing for one celebrity whose every post melts a single shard. That story — and its fixes — is the sharding lesson, and it's where the linked challenge picks up.

You've seen every version and every crack. Now it's your turn — same problem, your design, graded against real traffic and failures.

Design it yourself →