beprodready
System Design Fundamentals

Replication: Copies, Lag, and the Failover Window

lesson 4 of 4 · 9 min

Replication is how systems survive the death of a machine: keep copies. Every interesting property — and every interesting failure — comes from one question: when does a write become "real" on the copies?

Sync vs async: the only knob that matters

  • Synchronous — the write isn't acknowledged until replicas confirm. Lose the primary and the replica is exactly current: zero data loss. The bill: every write waits for the slowest replica, and a partitioned replica can stall all writes. Your durability is now chained to your worst network path.
  • Asynchronous — acknowledge immediately, ship to replicas in the background. Fast writes, happy users — and a lag window. If the primary dies with 2 seconds of unshipped writes, those writes are gone. The lag window is the data-loss window.
  • Semi-sync (one replica confirms, others trail) buys most of the durability for some of the latency, which is why it's a production favorite.

There is no setting that gives you fast writes, zero loss, and tolerance of slow replicas. Pick which failure you can explain to your users — or their lawyers.

The failover window

Replicas don't help until something notices the primary died and promotes one. That sequence — detect (seconds of failed health checks), elect/promote, repoint clients — is the failover window, and during it writes fail or queue. A realistic budget is several seconds to a minute. Two consequences for your designs:

  1. "We have a replica" ≠ "zero downtime." Your PRR retro should state the expected window and what clients do during it (retry with backoff? queue? degrade to read-only?).
  2. Split brain is worse than downtime. If the old primary was merely partitioned — not dead — and keeps accepting writes while a new primary is promoted, you now have two divergent histories. This is why promotion needs consensus (quorums), not optimism.

Read replicas: scaling with stale data

The same copies that provide durability can serve reads — that's how a 5,000-rps primary supports 50,000 rps of reads. The price is eventual consistency: a replica is always slightly behind. The canonical symptom is a user editing their profile and reloading to see the old value, because the read hit a lagging replica. Standard mitigations:

  • Read-your-writes routing — pin a user's reads to the primary for a short window after their writes.
  • Freshness tokens — the client carries the position of its last write; reads go to any replica at least that fresh.
  • Just don't — some reads (balances, inventory at checkout) belong on the primary, period.

Quorums: who must answer?

Leaderless systems (Dynamo-style stores, etcd's reads, Cassandra) generalize the sync/async knob into two numbers: a write counts when W of N replicas confirm it; a read consults R replicas. The magic inequality is R + W > N — it guarantees every read overlaps every write on at least one replica, so you read your own writes without a primary at all. The trade-offs live in how you split the budget:

  • W=N, R=1 — durable, fast reads… and one dead replica blocks all writes.
  • W=1, R=1 — fast everything, consistency nowhere.
  • W=2, R=2, N=3 — the workhorse: strong reads, and both reads and writes survive one node failure.

Tune the three numbers and watch what each configuration buys and costs:

interactive

consistency

strong (R+W=4 > 3)

writes survive

1 node failure

reads survive

1 node failure

Every read overlaps every write on at least one replica, so reads always see the latest acknowledged write.

When a design review asks "what happens when a replica is down?", quorum math is the answer — not a vibe about replication being "pretty reliable."

Placement is half the design

Replicas in the same rack share the rack's fate. The reason beprodready challenges kill regions — not single nodes — is that real outages take out blast radii, not boxes. A replica in region B turns "the region-A database outage" from an incident into a log line. The playground below lets you set the lag, kill the primary, and watch the failover window and loss window play out — numbers, not vibes.

try it — full simulation

replication guidestep 1/3

On async mode, set lag to ~2000ms and writes to ~1000/s, then kill the primary.

waiting for you to try it…

write p50 (healthy)

5ms

failover window

~8s

writes lost on failover

1,000

primary

async · 2000ms behind

replica

✓ healthy — writes acknowledged in 5ms

Check yourself

Q1 · With async replication and 2 seconds of lag, the primary dies. What happens to the last 2 seconds of writes?

Q2 · Why not make all replication synchronous and never lose a write?

Q3 · A user updates their profile, then immediately reloads the page and sees the old value. What's the likely cause?