Replication: Copies, Lag, and the Failover Window

Replication is how systems survive the death of a machine: keep copies. Every interesting property — and every interesting failure — comes from one question: when does a write become "real" on the copies?

Sync vs async: the only knob that matters

Synchronous — the write isn't acknowledged until replicas confirm. Lose the primary and the replica is exactly current: zero data loss. The bill: every write waits for the slowest replica, and a partitioned replica can stall all writes. Your durability is now chained to your worst network path.
Asynchronous — acknowledge immediately, ship to replicas in the background. Fast writes, happy users — and a lag window. If the primary dies with 2 seconds of unshipped writes, those writes are gone. The lag window is the data-loss window.
Semi-sync (one replica confirms, others trail) buys most of the durability for some of the latency, which is why it's a production favorite.

There is no setting that gives you fast writes, zero loss, and tolerance of slow replicas. Pick which failure you can explain to your users — or their lawyers.

The failover window

Replicas don't help until something notices the primary died and promotes one. That sequence — detect (seconds of failed health checks), elect/promote, repoint clients — is the failover window, and during it writes fail or queue. A realistic budget is several seconds to a minute. Two consequences for your designs:

"We have a replica" ≠ "zero downtime." Your PRR retro should state the expected window and what clients do during it (retry with backoff? queue? degrade to read-only?).
Split brain is worse than downtime. If the old primary was merely partitioned — not dead — and keeps accepting writes while a new primary is promoted, you now have two divergent histories. This is why promotion needs consensus (quorums), not optimism.

Read replicas: scaling with stale data

The same copies that provide durability can serve reads — that's how a 5,000-rps primary supports 50,000 rps of reads. The price is eventual consistency: a replica is always slightly behind. The canonical symptom is a user editing their profile and reloading to see the old value, because the read hit a lagging replica. Standard mitigations:

Read-your-writes routing — pin a user's reads to the primary for a short window after their writes.
Freshness tokens — the client carries the position of its last write; reads go to any replica at least that fresh.
Just don't — some reads (balances, inventory at checkout) belong on the primary, period.

Quorums: who must answer?

Leaderless systems (Dynamo-style stores, etcd's reads, Cassandra) generalize the sync/async knob into two numbers: a write counts when W of N replicas confirm it; a read consults R replicas. The magic inequality is R + W > N — it guarantees every read overlaps every write on at least one replica, so you read your own writes without a primary at all. The trade-offs live in how you split the budget:

W=N, R=1 — durable, fast reads… and one dead replica blocks all writes.
W=1, R=1 — fast everything, consistency nowhere.
W=2, R=2, N=3 — the workhorse: strong reads, and both reads and writes survive one node failure.

Tune the three numbers and watch what each configuration buys and costs:

interactive

Replicas (N)3Write quorum (W)2Read quorum (R)2

consistency

strong (R+W=4 > 3)

writes survive

1 node failure

reads survive

1 node failure

Every read overlaps every write on at least one replica, so reads always see the latest acknowledged write.

When a design review asks "what happens when a replica is down?", quorum math is the answer — not a vibe about replication being "pretty reliable."

Placement is half the design

Replicas in the same rack share the rack's fate. The reason beprodready challenges kill regions — not single nodes — is that real outages take out blast radii, not boxes. A replica in region B turns "the region-A database outage" from an incident into a log line. The playground below lets you set the lag, kill the primary, and watch the failover window and loss window play out — numbers, not vibes.

Sync vs async: the only knob that matters

Synchronous — the write isn't acknowledged until replicas confirm. Lose the primary and the replica is exactly current: zero data loss. The bill: every write waits for the slowest replica, and a partitioned replica can stall all writes. Your durability is now chained to your worst network path.
Asynchronous — acknowledge immediately, ship to replicas in the background. Fast writes, happy users — and a lag window. If the primary dies with 2 seconds of unshipped writes, those writes are gone. The lag window is the data-loss window.
Semi-sync (one replica confirms, others trail) buys most of the durability for some of the latency, which is why it's a production favorite.

There is no setting that gives you fast writes, zero loss, and tolerance of slow replicas. Pick which failure you can explain to your users — or their lawyers.

The failover window

"We have a replica" ≠ "zero downtime." Your PRR retro should state the expected window and what clients do during it (retry with backoff? queue? degrade to read-only?).
Split brain is worse than downtime. If the old primary was merely partitioned — not dead — and keeps accepting writes while a new primary is promoted, you now have two divergent histories. This is why promotion needs consensus (quorums), not optimism.

Read replicas: scaling with stale data

Read-your-writes routing — pin a user's reads to the primary for a short window after their writes.
Freshness tokens — the client carries the position of its last write; reads go to any replica at least that fresh.
Just don't — some reads (balances, inventory at checkout) belong on the primary, period.

Quorums: who must answer?

W=N, R=1 — durable, fast reads… and one dead replica blocks all writes.
W=1, R=1 — fast everything, consistency nowhere.
W=2, R=2, N=3 — the workhorse: strong reads, and both reads and writes survive one node failure.

Tune the three numbers and watch what each configuration buys and costs:

interactive

Replicas (N)3Write quorum (W)2Read quorum (R)2

consistency

strong (R+W=4 > 3)

writes survive

1 node failure

reads survive

1 node failure

Every read overlaps every write on at least one replica, so reads always see the latest acknowledged write.

When a design review asks "what happens when a replica is down?", quorum math is the answer — not a vibe about replication being "pretty reliable."

Replication: Copies, Lag, and the Failover Window

Sync vs async: the only knob that matters

The failover window

Read replicas: scaling with stale data

Quorums: who must answer?

Placement is half the design

Check yourself

Replication: Copies, Lag, and the Failover Window

Sync vs async: the only knob that matters

The failover window

Read replicas: scaling with stale data

Quorums: who must answer?

Placement is half the design

Check yourself