Replication is how systems survive the death of a machine: keep copies. Every interesting property — and every interesting failure — comes from one question: when does a write become "real" on the copies?
Sync vs async: the only knob that matters
- Synchronous — the write isn't acknowledged until replicas confirm. Lose the primary and the replica is exactly current: zero data loss. The bill: every write waits for the slowest replica, and a partitioned replica can stall all writes. Your durability is now chained to your worst network path.
- Asynchronous — acknowledge immediately, ship to replicas in the background. Fast writes, happy users — and a lag window. If the primary dies with 2 seconds of unshipped writes, those writes are gone. The lag window is the data-loss window.
- Semi-sync (one replica confirms, others trail) buys most of the durability for some of the latency, which is why it's a production favorite.
There is no setting that gives you fast writes, zero loss, and tolerance of slow replicas. Pick which failure you can explain to your users — or their lawyers.
The failover window
Replicas don't help until something notices the primary died and promotes one. That sequence — detect (seconds of failed health checks), elect/promote, repoint clients — is the failover window, and during it writes fail or queue. A realistic budget is several seconds to a minute. Two consequences for your designs:
- "We have a replica" ≠ "zero downtime." Your PRR retro should state the expected window and what clients do during it (retry with backoff? queue? degrade to read-only?).
- Split brain is worse than downtime. If the old primary was merely partitioned — not dead — and keeps accepting writes while a new primary is promoted, you now have two divergent histories. This is why promotion needs consensus (quorums), not optimism.
Read replicas: scaling with stale data
The same copies that provide durability can serve reads — that's how a 5,000-rps primary supports 50,000 rps of reads. The price is eventual consistency: a replica is always slightly behind. The canonical symptom is a user editing their profile and reloading to see the old value, because the read hit a lagging replica. Standard mitigations:
- Read-your-writes routing — pin a user's reads to the primary for a short window after their writes.
- Freshness tokens — the client carries the position of its last write; reads go to any replica at least that fresh.
- Just don't — some reads (balances, inventory at checkout) belong on the primary, period.
Quorums: who must answer?
Leaderless systems (Dynamo-style stores, etcd's reads, Cassandra) generalize the sync/async knob into two numbers: a write counts when W of N replicas confirm it; a read consults R replicas. The magic inequality is R + W > N — it guarantees every read overlaps every write on at least one replica, so you read your own writes without a primary at all. The trade-offs live in how you split the budget:
- W=N, R=1 — durable, fast reads… and one dead replica blocks all writes.
- W=1, R=1 — fast everything, consistency nowhere.
- W=2, R=2, N=3 — the workhorse: strong reads, and both reads and writes survive one node failure.
Tune the three numbers and watch what each configuration buys and costs:
interactive
consistency
strong (R+W=4 > 3)
writes survive
1 node failure
reads survive
1 node failure
Every read overlaps every write on at least one replica, so reads always see the latest acknowledged write.
When a design review asks "what happens when a replica is down?", quorum math is the answer — not a vibe about replication being "pretty reliable."
Placement is half the design
Replicas in the same rack share the rack's fate. The reason beprodready challenges kill regions — not single nodes — is that real outages take out blast radii, not boxes. A replica in region B turns "the region-A database outage" from an incident into a log line. The playground below lets you set the lag, kill the primary, and watch the failover window and loss window play out — numbers, not vibes.