When one machine can't hold the data or take the load, you split — by key. Sharding is the standard answer to scale, and the standard source of 2 a.m. incidents, because the split is only as good as the key.
The mechanics
Pick a shard key (user ID, post ID), apply a function, get a shard:
- Hash sharding —
shard = hash(key) % N. Spreads keys uniformly; kills range queries ("all posts from June" now touches every shard). - Range sharding — keys A–F here, G–M there. Preserves range scans; concentrates load wherever the workload is concentrated — which for time-ordered keys means all new writes hit the last shard.
- Directory sharding — a lookup service maps keys to shards. Maximum flexibility, plus a new component that can fail.
Uniform load is the promise. The interactive sim below shows the promise being kept — set skew to 0% and watch shards share perfectly as you add them.
The hot key problem
Now drag the skew slider. A celebrity profile, a viral post, a flash-sale SKU: real traffic concentrates on a few keys, and a key lives on exactly one shard. The brutal consequence, which you can verify in the playground: adding shards does nothing for a hot key. Sixteen shards, and shard 1 still melts while fifteen idle — you're paying 16x for 1x of relief.
Real fixes attack the key, not the count:
- Cache the hot object. The cheapest and usually right answer — a hot key is by definition wildly cacheable (same key, repeated reads). This is why the caching lesson precedes this one.
- Salt the key. Split
celebrity_42intocelebrity_42#1..8spread across shards; readers fan out and merge. Costs read complexity, buys write/read spread. - Special-case hot entities. Detect them (your metrics should make hot keys visible) and route them to dedicated capacity.
A design that answers "hot partition?" with "add more shards" fails the availability axis of any serious review — now you can see why.
Resharding: the bill arrives later
Shard count looks like a free parameter on day one. It isn't — changing it later means moving live data under traffic:
hash(key) % Nreassigns nearly every key when N changes. Consistent hashing fixes this: only ~1/N of keys move.- Plan shard count with growth headroom (e.g., 4x current need) and use logical shards (many small virtual shards mapped onto few physical nodes) so "resharding" becomes "remapping," not "rewriting."
See the difference, key by key — every red square below is live data that has to migrate when you add one shard:
interactive
Adding one shard moves 85/100 keys — hash % N reassigns almost everything. Every red square is live data migrating under traffic.
This is the whole argument for consistent hashing in one picture: with hash % N, adding capacity is a near-total data migration; with a ring, it's a small, bounded handover to the new shard.
What to say in a design review
The sharding paragraph of a strong justification names four things: the shard key and why it distributes well for this workload, the lookup pattern it preserves (point reads? ranges?), the hot-key story, and the resharding plan. Miss the hot-key story and the next viral moment writes it for you.