Incident: The Retry Storm That Took Down Payments
An 8-second network hiccup that failed 30% of payment requests turned into a 45-minute outage because clients retried immediately with no backoff, multiplying load by 4x. Diagnose the retry amplification cascade, fix the retry logic with exponential backoff and jitter, add a circuit breaker on payment-db, and propose system-level defenses that prevent this failure mode from being possible.
Reference walkthrough
The retry storm is a textbook positive feedback loop: a brief network hiccup causes failures, failures trigger retries, retries amplify load, amplified load causes more failures, which cause more retries. The math is simple but the consequence is severe: retry-count=3 and retry-delay=0ms mean each failed request generates 4 simultaneous attempts. At steady-state 2,000 rps with 100% failures during saturation, the effective load is 8,000 rps — 60% above the 5,000 rps capacity. The feedback loop is self-sustaining because the amplified load keeps payment-api saturated, which keeps generating failures, which keeps generating retries. The primary fix is exponential backoff with full jitter on the client side. Exponential backoff (base=100ms, multiplier=2) spreads retries over 700ms instead of 0ms. Full jitter (random(0, min(cap, base × 2^attempt))) prevents synchronized retry spikes — the thundering herd. With jitter, the 9,600 simultaneous retries are spread over ~700ms, reducing peak retry load from 8,000 rps to roughly 4,000-5,000 rps — within capacity. The circuit breaker is the second line of defense: when payment-db error rate exceeds 50% in a 10-second window, payment-api fails fast with 503 instead of queuing threads that will timeout. This prevents connection pool exhaustion from cascading API-wide. Circuit state lives in Redis so all payment-api instances see the same state simultaneously. System-level prevention requires three independent defenses. First, per-user rate limiting (rate:payment:{user_id} with INCR + EXPIRE 60s) caps the retry amplification per user: even with retry-count=3, only the first attempt per window reaches the DB; subsequent retries return 429 immediately. Second, a retry budget at the LB (if > 20% of requests in 10s have idempotency_keys seen before, shed excess with 429) caps the system-wide retry multiplier regardless of client configuration. Third, the circuit breaker on payment-db (circuit:payment-db in Redis, open at 50% error rate, half-open probe after 30s) breaks the API-DB cascade before connection pool exhaustion occurs.
Key architecture decisions
The choices that separate a passing design from full credit.
Retry amplification math: the feedback loop mechanism
retry-count=3 and retry-delay=0ms means each failed request generates 3 additional immediate retries — 4 total attempts at the same instant. If X% of requests are failing, the effective load is baseRps × (1 + retry_count × failure_rate). At 100% failure rate (full saturation): 2,000 × 4 = 8,000 rps. This is a floor, not a ceiling — if the API returns errors fast enough, clients retry fast enough to keep the load at 4x indefinitely. The feedback loop exits only when something absorbs the excess (backoff spreads the load, rate limiter sheds it, or capacity is added).
Jitter is not optional — constant backoff creates thundering herd
Exponential backoff without jitter: all clients that failed at T=0 retry at exactly T=100ms (simultaneously), then T=300ms (simultaneously), then T=700ms (simultaneously). Three synchronized spikes, each potentially at 8,000 rps. With full jitter: each client samples independently from random(0, base×2^attempt), spreading retries uniformly over the window. No two clients retry at exactly the same instant. The synchronized spike becomes a smooth curve. Jitter transforms a thundering herd into a gradual recovery.
Circuit breaker half-open state enables self-healing
A circuit breaker that can only open (and requires a human to close) is an operational burden. The half-open state enables self-healing: after the open window expires, one probe request is sent to the DB. If the probe succeeds (DB recovered), the circuit closes and normal traffic resumes — no human intervention. If the probe fails, the circuit re-opens for another window. This is the mechanism that lets the system recover from a transient DB overload without on-call intervention, typically within 60-90 seconds of DB recovery.
Defense in depth: three independent layers prevent any single layer from being the only safeguard
Rate limiting (per-user), retry budget (system-wide at the LB), and circuit breaker (per downstream service) serve different threat models. Rate limiting catches abusive or misconfigured clients. The retry budget catches storms where many well-behaved clients all retry simultaneously (the coordinated failure case). The circuit breaker catches downstream saturation before it cascades to the API tier. Any one of these alone would have prevented the 45-minute outage. All three together make a retry storm an operational non-event.
Common mistakes
What most candidates get wrong on this challenge.
Adding a constant retry delay (retry-delay: 500ms) without exponential backoff — constant delays still produce synchronized retry spikes every 500ms. The load profile becomes a square wave instead of a spike, but the peaks still exceed capacity. Exponential growth of the delay is required to drain the queue before the next retry wave arrives.
Building the circuit breaker with per-instance in-memory state — if payment-api has 4 replicas and each has its own circuit state, 3 replicas may be unaware that instance 1 tripped the circuit and keep hammering the DB. Shared state in Redis ensures all instances open and close together. The cost is one Redis GET per DB call — entirely acceptable.
Using a request queue to absorb retry storms without bounding queue depth — an unbounded queue accumulates millions of payment requests during a sustained storm. When the queue drains (hours later), it generates a second traffic spike. Bounded queues with shedding (reject when queue depth > threshold) are the right pattern for latency-sensitive services. Queuing is appropriate for async workloads, not synchronous payment processing.
What full-credit looks like
Expand each criterion to see the exact bar.
The retry amplification cascade is correctly diagnosedscalabilityweight 3×
Full credit requires the complete cascade chain argued with numbers: (1) retry multiplier computed — retry-count=3 means 4 total attempts per failed request = 4x load multiplier; (2) retried load vs. capacity stated — 8,000 rps vs. 5,000 rps capacity = 60% over; (3) feedback loop mechanism explained — amplified load keeps API saturated, which keeps generating failures, which keeps generating retries, which keeps load at 4x. Partial credit if retries are identified as the cause but the math is missing.
Retry logic fixed with exponential backoff and jitteroperabilityweight 2×
Full credit requires: exponential backoff formula (base=100ms, multiplier=2, max_delay=30s), full jitter described with the mechanism (each client independently samples from random(0, backoff_interval) so no two clients retry at exactly the same instant), and a comparison of peak load with vs. without jitter. Must explicitly state that constant backoff (e.g., retry-delay: 500ms) does NOT solve the thundering herd — it creates a square wave of spikes instead of one spike. Partial credit if backoff is added but jitter is omitted or described without the thundering herd explanation.
Circuit breaker protects payment-db from cascading failuresavailabilityweight 2×
Full credit requires all four circuit breaker elements: (1) open threshold — 50% error rate in a 10-second window (or p99 latency threshold); (2) open duration — 30 seconds of fast-fail 503; (3) half-open state — one probe request after the open window, circuit closes on success, re-opens on failure; (4) shared Redis state — circuit:payment-db key visible to all payment-api replicas simultaneously. Missing any one of these loses partial credit. Per-instance in-memory state is explicitly wrong — score zero for that.
Defense-in-depth prevention argued from numbersjustification-qualityweight 2×
Full credit requires three defenses each with a named mechanism and what-it-prevents: (1) per-user rate limiting with threshold (e.g., 10 payments/min/user) — prevents individual user retry amplification; (2) retry budget at LB with percentage threshold (e.g., reject retries when > 20% of traffic) — prevents system-wide load multiplier above 1.25x; (3) circuit breaker on payment-db — prevents DB saturation from cascading to API connection pool exhaustion. Monitoring section should name at least two signals that fire within 60 seconds of the incident starting. Partial credit if defenses are named but mechanisms and numbers are absent.
How to approach this challenge
The same phase-by-phase guide shown during solving — with answers.
1Phase 1 — Diagnose the Cascade
Before proposing a fix, understand exactly what happened and why the load didn't drop after the network recovered. The retry storm is a positive feedback loop: failed requests cause retries, retries increase load, increased load causes more failures, more failures cause more retries. The key question is: what broke the loop?
10 min · 3 questions
Phase 1 — Diagnose the Cascade
Before proposing a fix, understand exactly what happened and why the load didn't drop after the network recovered. The retry storm is a positive feedback loop: failed requests cause retries, retries increase load, increased load causes more failures, more failures cause more retries. The key question is: what broke the loop?
Before proposing a fix, understand exactly what happened and why the load didn't drop after the network recovered. The retry storm is a positive feedback loop: failed requests cause retries, retries increase load, increased load causes more failures, more failures cause more retries. The key question is: what broke the loop?
Q1The network recovered after 8 seconds. Why was the payment-api still overloaded 45 minutes later?
▸ answer
The retries arrived immediately (retry-delay=0ms). At the moment the network recovered, the 9,600 failed requests were all being retried simultaneously — generating 4× the normal load (8,000 rps). This 8,000 rps is above payment-api's 5,000 rps capacity. So the payment-api was now overloaded by the retries, causing new failures, which triggered new retries, keeping the load at 8,000 rps. The network hiccup set a self-sustaining load amplification loop in motion.
Q2What is the retry load multiplier? Show the math.
▸ answer
retry-count=3, retry-delay=0ms. Each failed request generates 3 retries fired immediately. Total attempts per failed request = 1 (original) + 3 (retries) = 4. All 4 arrive at the same instant. During the outage, 100% of 2,000 rps is failing (because the API is saturated). So the effective load = 2,000 × 4 = 8,000 rps — 60% above the 5,000 rps capacity.
Q3What would have broken the feedback loop?
▸ answer
Any mechanism that spreads the retries over time: (1) exponential backoff — each retry fires later than the last, spreading 9,600 simultaneous retries over 700ms; (2) rate limiting — caps retries per user so the total load can't exceed capacity; (3) a circuit breaker on the client — after N consecutive failures, stop sending until the circuit resets. Without any of these, the system has no mechanism to self-regulate.
Deliverable
One paragraph in your overall defense: the retry amplification math (4x multiplier), why the feedback loop was self-sustaining (each cycle of failures caused a new cycle of retries), and which specific mechanism was absent that would have prevented it.
2Phase 2 — Fix the Retry Logic
Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitter. Understand why jitter is not optional: without it, all clients that failed at the same moment retry at the same moment — a synchronized thundering herd that recreates the saturation spike.
10 min · 3 questions
Phase 2 — Fix the Retry Logic
Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitter. Understand why jitter is not optional: without it, all clients that failed at the same moment retry at the same moment — a synchronized thundering herd that recreates the saturation spike.
Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitter. Understand why jitter is not optional: without it, all clients that failed at the same moment retry at the same moment — a synchronized thundering herd that recreates the saturation spike.
Q1What is exponential backoff and why does it help?
▸ answer
Exponential backoff: retry 1 waits 100ms, retry 2 waits 200ms, retry 3 waits 400ms. Total spread: ~700ms. Instead of 9,600 retries arriving in 0ms, they arrive spread over 700ms. Peak retried load drops from 8,000 rps (9,600 retries / 0ms window) to roughly 9,600 / 0.7s ≈ 13,700 retries/s — wait, that's still 13k. The key insight: retries from retry 1 (100ms after the hiccup ended) are spread over multiple cycles, and the load decreases each cycle as successful retries drain the queue.
Q2Why is jitter required? What breaks without it?
▸ answer
Without jitter, every client that failed at T=0 retries at T=100ms (all simultaneously), then at T=300ms (all simultaneously), then at T=700ms (all simultaneously). You get three synchronized spikes instead of one continuous spike — the thundering herd just has a different shape. With jitter, each client adds a random delay to its backoff (e.g., retry 1 at 100ms ± random(0, 100ms)), spreading retries uniformly over the backoff window. No synchronized spikes. Full jitter formula: sleep = random(0, min(cap, base × 2^attempt)).
Q3What retry-count is appropriate, and when should a client give up entirely?
▸ answer
3 retries with exponential backoff is reasonable for transient network errors. However, payment failures require user feedback — infinite retry without surfacing the failure to the user is wrong. After 3 retries, return a failure to the user with a message like "payment is taking longer than expected — check your payment history." The idempotency_key means a successful eventual delivery won't double-charge. Retry budget: total retry time should stay under the user's patience threshold (~30s).
Deliverable
Client retry config in your overall defense: base=100ms, multiplier=2, max_delay=30s, jitter=full, max_retries=3. Show the timing of retries 1-3 for a request that failed at T=0: T+100±50ms, T+300±100ms, T+700±200ms (with jitter). Contrast the peak load with this config vs. the original 0ms delay.
3Phase 3 — Circuit Breaker for payment-db
When payment-db is overloaded, payment-api threads block waiting for DB responses that never come (or come after 5s timeouts). Blocked threads hold DB connections. Connection pool exhaustion cascades API-wide. A circuit breaker that opens at 50% error rate fails fast and lets the DB recover — breaking the cascade at the right layer.
15 min · 3 questions
Phase 3 — Circuit Breaker for payment-db
When payment-db is overloaded, payment-api threads block waiting for DB responses that never come (or come after 5s timeouts). Blocked threads hold DB connections. Connection pool exhaustion cascades API-wide. A circuit breaker that opens at 50% error rate fails fast and lets the DB recover — breaking the cascade at the right layer.
When payment-db is overloaded, payment-api threads block waiting for DB responses that never come (or come after 5s timeouts). Blocked threads hold DB connections. Connection pool exhaustion cascades API-wide. A circuit breaker that opens at 50% error rate fails fast and lets the DB recover — breaking the cascade at the right layer.
Q1What triggers the circuit to open?
▸ answer
Error rate threshold: if more than 50% of DB calls in the last 10s failed (timeout or error), open the circuit. Latency threshold: if p99 DB latency exceeds 2s, open. Once open, all DB calls fail immediately with 503 — no waiting, no connection pool usage. The threshold window (10s) must be short enough to react to an incident quickly but long enough to avoid flapping on transient errors.
Q2What is the half-open state and why is it needed?
▸ answer
After the circuit has been open for 30 seconds, transition to half-open: allow exactly one probe request to pass through to the DB. If the probe succeeds (DB responded in < 500ms), close the circuit and resume normal operation. If the probe fails, re-open for another 30 seconds. Without half-open, you need a human to manually close the circuit. With half-open, the circuit self-heals once the DB recovers — no pager needed if the DB recovers within its normal self-healing window.
Q3Where does the circuit breaker state live?
▸ answer
In Redis (circuit:payment-db key). Why not in-memory? If each payment-api instance maintains its own in-memory circuit state, they may be in different states — one instance trips the circuit, the others don't know and keep hammering the DB. Shared state in Redis means all payment-api instances see the same circuit state. Use Redis SET EX for the open state (auto-expires after 30s) and SET for closed/half-open.
Deliverable
Circuit breaker state machine in your overall defense: closed → open (on 50% error rate in 10s window) → half-open (after 30s) → closed (on successful probe) or open (on failed probe). The circuit:payment-db key in Redis stores the state. All payment-api instances read this key before each DB call.
4Phase 4 — System-Level Prevention
What architectural changes make it impossible for this failure mode to recur? Individual fixes (better retry logic, circuit breaker) help, but the root issue is that the system has no mechanism to shed excess load. Defense in depth: rate limiting, request queuing with shedding, and retry budgets at the API gateway level prevent any single component from being overwhelmed by amplified retries.
10 min · 3 questions
Phase 4 — System-Level Prevention
What architectural changes make it impossible for this failure mode to recur? Individual fixes (better retry logic, circuit breaker) help, but the root issue is that the system has no mechanism to shed excess load. Defense in depth: rate limiting, request queuing with shedding, and retry budgets at the API gateway level prevent any single component from being overwhelmed by amplified retries.
What architectural changes make it impossible for this failure mode to recur? Individual fixes (better retry logic, circuit breaker) help, but the root issue is that the system has no mechanism to shed excess load. Defense in depth: rate limiting, request queuing with shedding, and retry budgets at the API gateway level prevent any single component from being overwhelmed by amplified retries.
Q1What is a retry budget and how does it prevent storms at scale?
▸ answer
A retry budget is a cap on the percentage of traffic that is retries, measured at the API gateway level. Example: if more than 20% of requests in the last 10s are retries (identified by idempotency_key seen before), the gateway rejects them with 429. This caps the retry load multiplier at 1.25x regardless of client retry configuration — even if clients retry 10 times, only 25% extra load reaches the service. Implement at the LB or API gateway, not in individual services.
Q2Should you use a request queue with shedding, and when?
▸ answer
A request queue (SQS/in-memory) absorbs bursts but adds latency. For payments (latency- sensitive), a queue is only appropriate if: (1) the burst is temporary (< 60s) and (2) users are informed that processing is async. For the retry storm scenario, the burst lasted 45 minutes — a queue would have accumulated millions of backlogged payments and taken hours to drain. Better to reject excess with 429 + Retry-After and let clients self-regulate with proper backoff. Queuing is not a substitute for backoff — it delays the problem.
Q3What monitoring would have caught this before 45 minutes elapsed?
▸ answer
Three signals that fire within the first 60 seconds: (1) payment-api error rate alert (> 1% for > 30s), (2) payment-api request rate alert (> 120% of baseline), (3) payment- db connection pool utilization alert (> 80%). If any of these alert to an on-call engineer within 60 seconds of the hiccup, they can manually open the circuit breaker or add capacity before the feedback loop stabilizes. The 45-minute outage is a monitoring failure as much as an architecture failure.
Deliverable
Three systemic defenses in your overall defense: (1) per-user rate limiting (rate:payment key in Redis) — caps retry amplification per user, (2) retry budget at the LB (> 20% retries = 429 on excess) — caps system-wide retry multiplier, (3) circuit breaker on payment-db (circuit:payment-db key in Redis) — breaks the API-DB cascade. Name what each defense specifically prevents, not just what it is.
Ready to test your design?
Open the canvas, place your components, and run the failure scenarios to get graded.