beprodready
reference solutionmid · incident

Incident: The Retry Storm That Took Down Payments

An 8-second network hiccup that failed 30% of payment requests turned into a 45-minute outage because clients retried immediately with no backoff, multiplying load by 4x. Diagnose the retry amplification cascade, fix the retry logic with exponential backoff and jitter, add a circuit breaker on payment-db, and propose system-level defenses that prevent this failure mode from being possible.

incident-responseretry-logiccircuit-breakerexponential-backoffcascading-failures

Reference walkthrough

The retry storm is a textbook positive feedback loop: a brief network hiccup causes failures, failures trigger retries, retries amplify load, amplified load causes more failures, which cause more retries. The math is simple but the consequence is severe: retry-count=3 and retry-delay=0ms mean each failed request generates 4 simultaneous attempts. At steady-state 2,000 rps with 100% failures during saturation, the effective load is 8,000 rps — 60% above the 5,000 rps capacity. The feedback loop is self-sustaining because the amplified load keeps payment-api saturated, which keeps generating failures, which keeps generating retries. The primary fix is exponential backoff with full jitter on the client side. Exponential backoff (base=100ms, multiplier=2) spreads retries over 700ms instead of 0ms. Full jitter (random(0, min(cap, base × 2^attempt))) prevents synchronized retry spikes — the thundering herd. With jitter, the 9,600 simultaneous retries are spread over ~700ms, reducing peak retry load from 8,000 rps to roughly 4,000-5,000 rps — within capacity. The circuit breaker is the second line of defense: when payment-db error rate exceeds 50% in a 10-second window, payment-api fails fast with 503 instead of queuing threads that will timeout. This prevents connection pool exhaustion from cascading API-wide. Circuit state lives in Redis so all payment-api instances see the same state simultaneously. System-level prevention requires three independent defenses. First, per-user rate limiting (rate:payment:{user_id} with INCR + EXPIRE 60s) caps the retry amplification per user: even with retry-count=3, only the first attempt per window reaches the DB; subsequent retries return 429 immediately. Second, a retry budget at the LB (if > 20% of requests in 10s have idempotency_keys seen before, shed excess with 429) caps the system-wide retry multiplier regardless of client configuration. Third, the circuit breaker on payment-db (circuit:payment-db in Redis, open at 50% error rate, half-open probe after 30s) breaks the API-DB cascade before connection pool exhaustion occurs.

Key architecture decisions

The choices that separate a passing design from full credit.

1

Retry amplification math: the feedback loop mechanism

retry-count=3 and retry-delay=0ms means each failed request generates 3 additional immediate retries — 4 total attempts at the same instant. If X% of requests are failing, the effective load is baseRps × (1 + retry_count × failure_rate). At 100% failure rate (full saturation): 2,000 × 4 = 8,000 rps. This is a floor, not a ceiling — if the API returns errors fast enough, clients retry fast enough to keep the load at 4x indefinitely. The feedback loop exits only when something absorbs the excess (backoff spreads the load, rate limiter sheds it, or capacity is added).

2

Jitter is not optional — constant backoff creates thundering herd

Exponential backoff without jitter: all clients that failed at T=0 retry at exactly T=100ms (simultaneously), then T=300ms (simultaneously), then T=700ms (simultaneously). Three synchronized spikes, each potentially at 8,000 rps. With full jitter: each client samples independently from random(0, base×2^attempt), spreading retries uniformly over the window. No two clients retry at exactly the same instant. The synchronized spike becomes a smooth curve. Jitter transforms a thundering herd into a gradual recovery.

3

Circuit breaker half-open state enables self-healing

A circuit breaker that can only open (and requires a human to close) is an operational burden. The half-open state enables self-healing: after the open window expires, one probe request is sent to the DB. If the probe succeeds (DB recovered), the circuit closes and normal traffic resumes — no human intervention. If the probe fails, the circuit re-opens for another window. This is the mechanism that lets the system recover from a transient DB overload without on-call intervention, typically within 60-90 seconds of DB recovery.

4

Defense in depth: three independent layers prevent any single layer from being the only safeguard

Rate limiting (per-user), retry budget (system-wide at the LB), and circuit breaker (per downstream service) serve different threat models. Rate limiting catches abusive or misconfigured clients. The retry budget catches storms where many well-behaved clients all retry simultaneously (the coordinated failure case). The circuit breaker catches downstream saturation before it cascades to the API tier. Any one of these alone would have prevented the 45-minute outage. All three together make a retry storm an operational non-event.

Common mistakes

What most candidates get wrong on this challenge.

Adding a constant retry delay (retry-delay: 500ms) without exponential backoff — constant delays still produce synchronized retry spikes every 500ms. The load profile becomes a square wave instead of a spike, but the peaks still exceed capacity. Exponential growth of the delay is required to drain the queue before the next retry wave arrives.

Building the circuit breaker with per-instance in-memory state — if payment-api has 4 replicas and each has its own circuit state, 3 replicas may be unaware that instance 1 tripped the circuit and keep hammering the DB. Shared state in Redis ensures all instances open and close together. The cost is one Redis GET per DB call — entirely acceptable.

Using a request queue to absorb retry storms without bounding queue depth — an unbounded queue accumulates millions of payment requests during a sustained storm. When the queue drains (hours later), it generates a second traffic spike. Bounded queues with shedding (reject when queue depth > threshold) are the right pattern for latency-sensitive services. Queuing is appropriate for async workloads, not synchronous payment processing.

What full-credit looks like

Expand each criterion to see the exact bar.

The retry amplification cascade is correctly diagnosedscalability
weight 3×

Full credit requires the complete cascade chain argued with numbers: (1) retry multiplier computed — retry-count=3 means 4 total attempts per failed request = 4x load multiplier; (2) retried load vs. capacity stated — 8,000 rps vs. 5,000 rps capacity = 60% over; (3) feedback loop mechanism explained — amplified load keeps API saturated, which keeps generating failures, which keeps generating retries, which keeps load at 4x. Partial credit if retries are identified as the cause but the math is missing.

Full credit: Retry multiplier computed (4x = 1 + 3 retries), load vs. capacity stated (8k vs 5k), feedback loop mechanism described (failed retries cause more failures cause more retries).
Retry logic fixed with exponential backoff and jitteroperability
weight 2×

Full credit requires: exponential backoff formula (base=100ms, multiplier=2, max_delay=30s), full jitter described with the mechanism (each client independently samples from random(0, backoff_interval) so no two clients retry at exactly the same instant), and a comparison of peak load with vs. without jitter. Must explicitly state that constant backoff (e.g., retry-delay: 500ms) does NOT solve the thundering herd — it creates a square wave of spikes instead of one spike. Partial credit if backoff is added but jitter is omitted or described without the thundering herd explanation.

Full credit: Exponential backoff formula stated, jitter mechanism explained (random spread prevents synchronized spikes), peak load comparison with and without jitter shown.
Circuit breaker protects payment-db from cascading failuresavailability
weight 2×

Full credit requires all four circuit breaker elements: (1) open threshold — 50% error rate in a 10-second window (or p99 latency threshold); (2) open duration — 30 seconds of fast-fail 503; (3) half-open state — one probe request after the open window, circuit closes on success, re-opens on failure; (4) shared Redis state — circuit:payment-db key visible to all payment-api replicas simultaneously. Missing any one of these loses partial credit. Per-instance in-memory state is explicitly wrong — score zero for that.

Full credit: All four elements present — open threshold (error rate %), open duration (seconds), half-open probe state, shared Redis state across instances.
Defense-in-depth prevention argued from numbersjustification-quality
weight 2×

Full credit requires three defenses each with a named mechanism and what-it-prevents: (1) per-user rate limiting with threshold (e.g., 10 payments/min/user) — prevents individual user retry amplification; (2) retry budget at LB with percentage threshold (e.g., reject retries when > 20% of traffic) — prevents system-wide load multiplier above 1.25x; (3) circuit breaker on payment-db — prevents DB saturation from cascading to API connection pool exhaustion. Monitoring section should name at least two signals that fire within 60 seconds of the incident starting. Partial credit if defenses are named but mechanisms and numbers are absent.

Full credit: Three defenses named with mechanism and what-it-prevents for each (rate limiter caps per-user retry amplification, retry budget caps system-wide multiplier, circuit breaker breaks DB cascade). At least one number per defense (e.g., rate limit threshold, retry budget percentage, circuit breaker threshold).

How to approach this challenge

The same phase-by-phase guide shown during solving — with answers.

1

Phase 1 — Diagnose the Cascade

Before proposing a fix, understand exactly what happened and why the load didn't drop after the network recovered. The retry storm is a positive feedback loop: failed requests cause retries, retries increase load, increased load causes more failures, more failures cause more retries. The key question is: what broke the loop?

10 min · 3 questions

Before proposing a fix, understand exactly what happened and why the load didn't drop after the network recovered. The retry storm is a positive feedback loop: failed requests cause retries, retries increase load, increased load causes more failures, more failures cause more retries. The key question is: what broke the loop?

Q1

The network recovered after 8 seconds. Why was the payment-api still overloaded 45 minutes later?

▸ answer

The retries arrived immediately (retry-delay=0ms). At the moment the network recovered, the 9,600 failed requests were all being retried simultaneously — generating 4× the normal load (8,000 rps). This 8,000 rps is above payment-api's 5,000 rps capacity. So the payment-api was now overloaded by the retries, causing new failures, which triggered new retries, keeping the load at 8,000 rps. The network hiccup set a self-sustaining load amplification loop in motion.

Q2

What is the retry load multiplier? Show the math.

▸ answer

retry-count=3, retry-delay=0ms. Each failed request generates 3 retries fired immediately. Total attempts per failed request = 1 (original) + 3 (retries) = 4. All 4 arrive at the same instant. During the outage, 100% of 2,000 rps is failing (because the API is saturated). So the effective load = 2,000 × 4 = 8,000 rps — 60% above the 5,000 rps capacity.

Q3

What would have broken the feedback loop?

▸ answer

Any mechanism that spreads the retries over time: (1) exponential backoff — each retry fires later than the last, spreading 9,600 simultaneous retries over 700ms; (2) rate limiting — caps retries per user so the total load can't exceed capacity; (3) a circuit breaker on the client — after N consecutive failures, stop sending until the circuit resets. Without any of these, the system has no mechanism to self-regulate.

Deliverable

One paragraph in your overall defense: the retry amplification math (4x multiplier), why the feedback loop was self-sustaining (each cycle of failures caused a new cycle of retries), and which specific mechanism was absent that would have prevented it.

2

Phase 2 — Fix the Retry Logic

Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitter. Understand why jitter is not optional: without it, all clients that failed at the same moment retry at the same moment — a synchronized thundering herd that recreates the saturation spike.

10 min · 3 questions

Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitter. Understand why jitter is not optional: without it, all clients that failed at the same moment retry at the same moment — a synchronized thundering herd that recreates the saturation spike.

Q1

What is exponential backoff and why does it help?

▸ answer

Exponential backoff: retry 1 waits 100ms, retry 2 waits 200ms, retry 3 waits 400ms. Total spread: ~700ms. Instead of 9,600 retries arriving in 0ms, they arrive spread over 700ms. Peak retried load drops from 8,000 rps (9,600 retries / 0ms window) to roughly 9,600 / 0.7s ≈ 13,700 retries/s — wait, that's still 13k. The key insight: retries from retry 1 (100ms after the hiccup ended) are spread over multiple cycles, and the load decreases each cycle as successful retries drain the queue.

Q2

Why is jitter required? What breaks without it?

▸ answer

Without jitter, every client that failed at T=0 retries at T=100ms (all simultaneously), then at T=300ms (all simultaneously), then at T=700ms (all simultaneously). You get three synchronized spikes instead of one continuous spike — the thundering herd just has a different shape. With jitter, each client adds a random delay to its backoff (e.g., retry 1 at 100ms ± random(0, 100ms)), spreading retries uniformly over the backoff window. No synchronized spikes. Full jitter formula: sleep = random(0, min(cap, base × 2^attempt)).

Q3

What retry-count is appropriate, and when should a client give up entirely?

▸ answer

3 retries with exponential backoff is reasonable for transient network errors. However, payment failures require user feedback — infinite retry without surfacing the failure to the user is wrong. After 3 retries, return a failure to the user with a message like "payment is taking longer than expected — check your payment history." The idempotency_key means a successful eventual delivery won't double-charge. Retry budget: total retry time should stay under the user's patience threshold (~30s).

Deliverable

Client retry config in your overall defense: base=100ms, multiplier=2, max_delay=30s, jitter=full, max_retries=3. Show the timing of retries 1-3 for a request that failed at T=0: T+100±50ms, T+300±100ms, T+700±200ms (with jitter). Contrast the peak load with this config vs. the original 0ms delay.

3

Phase 3 — Circuit Breaker for payment-db

When payment-db is overloaded, payment-api threads block waiting for DB responses that never come (or come after 5s timeouts). Blocked threads hold DB connections. Connection pool exhaustion cascades API-wide. A circuit breaker that opens at 50% error rate fails fast and lets the DB recover — breaking the cascade at the right layer.

15 min · 3 questions

When payment-db is overloaded, payment-api threads block waiting for DB responses that never come (or come after 5s timeouts). Blocked threads hold DB connections. Connection pool exhaustion cascades API-wide. A circuit breaker that opens at 50% error rate fails fast and lets the DB recover — breaking the cascade at the right layer.

Q1

What triggers the circuit to open?

▸ answer

Error rate threshold: if more than 50% of DB calls in the last 10s failed (timeout or error), open the circuit. Latency threshold: if p99 DB latency exceeds 2s, open. Once open, all DB calls fail immediately with 503 — no waiting, no connection pool usage. The threshold window (10s) must be short enough to react to an incident quickly but long enough to avoid flapping on transient errors.

Q2

What is the half-open state and why is it needed?

▸ answer

After the circuit has been open for 30 seconds, transition to half-open: allow exactly one probe request to pass through to the DB. If the probe succeeds (DB responded in < 500ms), close the circuit and resume normal operation. If the probe fails, re-open for another 30 seconds. Without half-open, you need a human to manually close the circuit. With half-open, the circuit self-heals once the DB recovers — no pager needed if the DB recovers within its normal self-healing window.

Q3

Where does the circuit breaker state live?

▸ answer

In Redis (circuit:payment-db key). Why not in-memory? If each payment-api instance maintains its own in-memory circuit state, they may be in different states — one instance trips the circuit, the others don't know and keep hammering the DB. Shared state in Redis means all payment-api instances see the same circuit state. Use Redis SET EX for the open state (auto-expires after 30s) and SET for closed/half-open.

Deliverable

Circuit breaker state machine in your overall defense: closed → open (on 50% error rate in 10s window) → half-open (after 30s) → closed (on successful probe) or open (on failed probe). The circuit:payment-db key in Redis stores the state. All payment-api instances read this key before each DB call.

4

Phase 4 — System-Level Prevention

What architectural changes make it impossible for this failure mode to recur? Individual fixes (better retry logic, circuit breaker) help, but the root issue is that the system has no mechanism to shed excess load. Defense in depth: rate limiting, request queuing with shedding, and retry budgets at the API gateway level prevent any single component from being overwhelmed by amplified retries.

10 min · 3 questions

What architectural changes make it impossible for this failure mode to recur? Individual fixes (better retry logic, circuit breaker) help, but the root issue is that the system has no mechanism to shed excess load. Defense in depth: rate limiting, request queuing with shedding, and retry budgets at the API gateway level prevent any single component from being overwhelmed by amplified retries.

Q1

What is a retry budget and how does it prevent storms at scale?

▸ answer

A retry budget is a cap on the percentage of traffic that is retries, measured at the API gateway level. Example: if more than 20% of requests in the last 10s are retries (identified by idempotency_key seen before), the gateway rejects them with 429. This caps the retry load multiplier at 1.25x regardless of client retry configuration — even if clients retry 10 times, only 25% extra load reaches the service. Implement at the LB or API gateway, not in individual services.

Q2

Should you use a request queue with shedding, and when?

▸ answer

A request queue (SQS/in-memory) absorbs bursts but adds latency. For payments (latency- sensitive), a queue is only appropriate if: (1) the burst is temporary (< 60s) and (2) users are informed that processing is async. For the retry storm scenario, the burst lasted 45 minutes — a queue would have accumulated millions of backlogged payments and taken hours to drain. Better to reject excess with 429 + Retry-After and let clients self-regulate with proper backoff. Queuing is not a substitute for backoff — it delays the problem.

Q3

What monitoring would have caught this before 45 minutes elapsed?

▸ answer

Three signals that fire within the first 60 seconds: (1) payment-api error rate alert (> 1% for > 30s), (2) payment-api request rate alert (> 120% of baseline), (3) payment- db connection pool utilization alert (> 80%). If any of these alert to an on-call engineer within 60 seconds of the hiccup, they can manually open the circuit breaker or add capacity before the feedback loop stabilizes. The 45-minute outage is a monitoring failure as much as an architecture failure.

Deliverable

Three systemic defenses in your overall defense: (1) per-user rate limiting (rate:payment key in Redis) — caps retry amplification per user, (2) retry budget at the LB (> 20% retries = 429 on excess) — caps system-wide retry multiplier, (3) circuit breaker on payment-db (circuit:payment-db key in Redis) — breaks the API-DB cascade. Name what each defense specifically prevents, not just what it is.

Ready to test your design?

Open the canvas, place your components, and run the failure scenarios to get graded.