Incident: The Retry Storm That Took Down Payments
An 8-second network hiccup that failed 30% of payment requests turned into a 45-minute outage because clients retried immediately with no backoff, multiplying load by 4x. Diagnose the retry amplification cascade, fix the retry logic with exponential backoff and jitter, add a circuit breaker on payment-db, and propose system-level defenses that prevent this failure mode from being possible.
Steady traffic
2,000 rps
Spike multiplier
4×
Budget
$1,500/mo
Read ratio
30:70
Load profile
The Scenario
It's 2:17 PM on a Tuesday. A brief network hiccup caused 30% of payment requests to fail for 8 seconds. The clients are configured to retry immediately — retry-count: 3, retry-delay: 0ms. That 8-second hiccup turned into a 45-minute outage because the retries multiplied the load by 4x, which kept the payment service saturated long after the network recovered. You are now debugging the post-mortem. The starting system shows the current state: payment-api overwhelmed by 8,000 rps when the base load is only 2,000 rps. Your job is to explain what happened, fix the retry logic, add a circuit breaker, and prevent a repeat.
Know before you start
- ○Retry logic: how retry-count and retry-delay interact to amplify load under failure
- ○Exponential backoff: why doubling the delay between retries spreads load over time instead of concentrating it
- ○Jitter: why randomizing the backoff delay prevents synchronized retry storms (the thundering herd problem)
- ○Circuit breakers: how the closed/open/half-open state machine protects a downstream service that is struggling
Requirements
Functional
- Payment requests must complete with < 300ms p99 latency
- Failed payments must be retried with appropriate backoff
- Duplicate payment prevention on retry (idempotency)
Non-functional
- Restore steady-state error rate to ~0% at 2,000 rps
- A 30-second network hiccup must not cause a cascading outage lasting more than 5 minutes
- Circuit breaker must prevent payment-db saturation from cascading to payment-api
- Total infrastructure cost under $1,500/month
API Contract
The endpoints your system must implement. The hot path is the one the SLO is measured on.
POST/paymentsProcess a paymenthot pathauth: required
Request body
| field | type | notes |
|---|---|---|
| idempotency_key | string | Client-generated UUID; prevents duplicate charges on retrye.g. 550e8400-e29b-41d4-a716-446655440000 |
| amount_cents | integer | Amount to charge in integer centse.g. 9999 |
| payment_method_id | string | Tokenized payment method |
Response body
| field | type | notes |
|---|---|---|
| payment_id | string | UUID of the created payment |
| status | string | pending | completed | failed |
| amount_cents | integer |
Status codes
HOT PATH — 2,000 rps at steady state. Every payment requires a DB write. The idempotency_key MUST be checked BEFORE charging the payment processor — charging first and then failing on DB write creates duplicate charges. Under the retry storm, the 503 responses triggered by saturation caused clients to retry immediately, amplifying load by 4x. The 503 response body should include Retry-After: 1 header to signal the expected backoff — but clients were not respecting it.
GET/payments/{payment_id}Get payment statusauth: required
Path params
| field | type | notes |
|---|---|---|
| payment_id | string | UUID of the payment |
Response body
| field | type | notes |
|---|---|---|
| payment_id | string | |
| status | string | pending | completed | failed |
| amount_cents | integer | |
| created_at | string |
Status codes
Data Model
tablepaymentsPayment7 columns
Every payment creates a row here. The idempotency_key UNIQUE constraint is the safety net against duplicate charges — if a retry arrives for a key that already has a row, the INSERT fails and the server returns the existing payment (200). At 2,000 rps this is 2,000 inserts/s — within payment-db's 3,000 rps capacity at steady state.
| column | type | constraints | notes |
|---|---|---|---|
| id | uuid | PK | DEFAULT gen_random_uuid() |
| idempotency_key | uuid | IDX | UNIQUE — the safety net against double-charges on retry |
| user_id | uuid | IDX | FK to users table |
| amount_cents | bigint | — | Integer cents — never float for money |
| status | varchar(20) | — | pending | completed | failed |
| payment_method_id | varchar(64) | — | Tokenized — never raw card data |
| created_at | timestamptz | — | DEFAULT now() |
capacity · 2,000 inserts/s × 86,400s = ~173M rows/day. Each row ~200 bytes. 173M × 200B = 34GB/day — archive or partition by created_at after 30 days. The incident is about write throughput, not storage.
Redis / Cache Contracts
Key patterns, TTLs, and commands. Your design must justify the hotness-critical keys.
rate:payment:{user_id}STRINGTTL: rolling 60shigh
Per-user rate limit on payment submissions. Prevents a single user (or compromised account) from amplifying retries beyond their fair share. At 10 payments/minute/user limit, a single user with retry-count=3 can generate at most 40 attempts/minute — the rate limiter absorbs the excess retries before they reach the DB.
INCR rate:payment:{user_id}
EXPIRE rate:payment:{user_id} 60
rationale · During the retry storm, all 2,000 rps worth of users retried simultaneously 3 extra times. A per-user rate limiter caps the retry amplification per user: even with retry-count=3, only the first attempt within the rate window passes; subsequent retries within the window return 429 immediately without touching the DB. This sheds retry load before it reaches payment-api.
circuit:payment-dbSTRINGTTL: none (managed by circuit breaker logic)critical
Circuit breaker state for payment-db. Values: "closed" (normal operation), "open" (fail fast), "half-open" (testing recovery). When payment-db latency spikes above threshold or error rate exceeds 50%, the circuit opens and payment-api fails fast with 503 instead of queuing requests that will timeout anyway. This prevents connection pool exhaustion cascading from DB saturation to API saturation.
GET circuit:payment-db # check state before each DB call
SET circuit:payment-db open EX 30 # open circuit for 30s on threshold breach
SET circuit:payment-db half-open # after 30s, allow one probe request
SET circuit:payment-db closed # on successful probe, close circuit
rationale · Without a circuit breaker, payment-api threads block waiting for an overloaded payment-db to respond. Each blocked thread holds a DB connection. Connection pool exhaustion cascades to the API tier (all threads blocked), which cascades to the LB (queue fills), which causes 503 storms that trigger client retries. The circuit breaker breaks this cascade: when the DB is struggling, fail fast with 503 immediately rather than waiting for timeout. The 30s open window lets the DB drain its queue and recover before accepting new requests.
Capacity Math
Pre-computed numbers to anchor your justifications. Use these — the grader checks your claims against them.
traffic
2,000 rps
Original steady-state load
traffic
9,600 failed requests
Failed requests during 8s hiccup
= 2,000 rps × 8s × 60% failure rate
capacity
4x load
Retry load multiplier (retry-count=3, retry-delay=0ms)
= 1 original request + 3 immediate retries = 4 attempts per failed request
traffic
8,000 rps
Retried load during outage
= 2,000 base rps × 4x retry multiplier
capacity
5,000 rps
payment-api total capacity
= 2 replicas × 2,500 maxRps per replica
capacity
3,000 rps over capacity
Capacity gap (retried load vs capacity)
= 8,000 rps retried load - 5,000 rps capacity = 60% over
latency
~700ms spread per request across 3 retries
Backoff spread with exponential backoff (base 100ms)
= retry 1 at 100ms + retry 2 at 200ms + retry 3 at 400ms = ~700ms total spread
traffic
~3,800 rps peak
Peak rps with backoff (3,800 rps vs 8,000 rps with no backoff)
= 8,000 failures spread over 700ms window instead of 0ms = ~3,800 rps peak vs 8,000 rps
How to Approach This
Work through these phases in order before submitting. Each phase builds on the last.
1Phase 1 — Diagnose the Cascade
10 min · Before proposing a fix, understand exactly what happened and why the load didn't…
3 questions
Phase 1 — Diagnose the Cascade
10 min · Before proposing a fix, understand exactly what happened and why the load didn't…
Before proposing a fix, understand exactly what happened and why the load didn't drop after the network recovered. The retry storm is a positive feedback loop: failed requests cause retries, retries increase load, increased load causes more failures, more failures cause more retries. The key question is: what broke the loop?
The network recovered after 8 seconds. Why was the payment-api still overloaded 45 minutes later?
hint
The retries arrived immediately (retry-delay=0ms). At the moment the network recovered, the 9,600 failed requests were all being retried simultaneously — generating 4× the normal load (8,000 rps). This 8,000 rps is above payment-api's 5,000 rps capacity. So the payment-api was now overloaded by the retries, causing new failures, which triggered new retries, keeping the load at 8,000 rps. The network hiccup set a self-sustaining load amplification loop in motion.
What is the retry load multiplier? Show the math.
hint
retry-count=3, retry-delay=0ms. Each failed request generates 3 retries fired immediately. Total attempts per failed request = 1 (original) + 3 (retries) = 4. All 4 arrive at the same instant. During the outage, 100% of 2,000 rps is failing (because the API is saturated). So the effective load = 2,000 × 4 = 8,000 rps — 60% above the 5,000 rps capacity.
What would have broken the feedback loop?
hint
Any mechanism that spreads the retries over time: (1) exponential backoff — each retry fires later than the last, spreading 9,600 simultaneous retries over 700ms; (2) rate limiting — caps retries per user so the total load can't exceed capacity; (3) a circuit breaker on the client — after N consecutive failures, stop sending until the circuit resets. Without any of these, the system has no mechanism to self-regulate.
Deliverable
One paragraph in your overall defense: the retry amplification math (4x multiplier), why the feedback loop was self-sustaining (each cycle of failures caused a new cycle of retries), and which specific mechanism was absent that would have prevented it.
2Phase 2 — Fix the Retry Logic
10 min · Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitte…
3 questions
Phase 2 — Fix the Retry Logic
10 min · Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitte…
Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitter. Understand why jitter is not optional: without it, all clients that failed at the same moment retry at the same moment — a synchronized thundering herd that recreates the saturation spike.
What is exponential backoff and why does it help?
hint
Exponential backoff: retry 1 waits 100ms, retry 2 waits 200ms, retry 3 waits 400ms. Total spread: ~700ms. Instead of 9,600 retries arriving in 0ms, they arrive spread over 700ms. Peak retried load drops from 8,000 rps (9,600 retries / 0ms window) to roughly 9,600 / 0.7s ≈ 13,700 retries/s — wait, that's still 13k. The key insight: retries from retry 1 (100ms after the hiccup ended) are spread over multiple cycles, and the load decreases each cycle as successful retries drain the queue.
Why is jitter required? What breaks without it?
hint
Without jitter, every client that failed at T=0 retries at T=100ms (all simultaneously), then at T=300ms (all simultaneously), then at T=700ms (all simultaneously). You get three synchronized spikes instead of one continuous spike — the thundering herd just has a different shape. With jitter, each client adds a random delay to its backoff (e.g., retry 1 at 100ms ± random(0, 100ms)), spreading retries uniformly over the backoff window. No synchronized spikes. Full jitter formula: sleep = random(0, min(cap, base × 2^attempt)).
What retry-count is appropriate, and when should a client give up entirely?
hint
3 retries with exponential backoff is reasonable for transient network errors. However, payment failures require user feedback — infinite retry without surfacing the failure to the user is wrong. After 3 retries, return a failure to the user with a message like "payment is taking longer than expected — check your payment history." The idempotency_key means a successful eventual delivery won't double-charge. Retry budget: total retry time should stay under the user's patience threshold (~30s).
Deliverable
Client retry config in your overall defense: base=100ms, multiplier=2, max_delay=30s, jitter=full, max_retries=3. Show the timing of retries 1-3 for a request that failed at T=0: T+100±50ms, T+300±100ms, T+700±200ms (with jitter). Contrast the peak load with this config vs. the original 0ms delay.
3Phase 3 — Circuit Breaker for payment-db
15 min · When payment-db is overloaded, payment-api threads block waiting for DB response…
3 questions
Phase 3 — Circuit Breaker for payment-db
15 min · When payment-db is overloaded, payment-api threads block waiting for DB response…
When payment-db is overloaded, payment-api threads block waiting for DB responses that never come (or come after 5s timeouts). Blocked threads hold DB connections. Connection pool exhaustion cascades API-wide. A circuit breaker that opens at 50% error rate fails fast and lets the DB recover — breaking the cascade at the right layer.
What triggers the circuit to open?
hint
Error rate threshold: if more than 50% of DB calls in the last 10s failed (timeout or error), open the circuit. Latency threshold: if p99 DB latency exceeds 2s, open. Once open, all DB calls fail immediately with 503 — no waiting, no connection pool usage. The threshold window (10s) must be short enough to react to an incident quickly but long enough to avoid flapping on transient errors.
What is the half-open state and why is it needed?
hint
After the circuit has been open for 30 seconds, transition to half-open: allow exactly one probe request to pass through to the DB. If the probe succeeds (DB responded in < 500ms), close the circuit and resume normal operation. If the probe fails, re-open for another 30 seconds. Without half-open, you need a human to manually close the circuit. With half-open, the circuit self-heals once the DB recovers — no pager needed if the DB recovers within its normal self-healing window.
Where does the circuit breaker state live?
hint
In Redis (circuit:payment-db key). Why not in-memory? If each payment-api instance maintains its own in-memory circuit state, they may be in different states — one instance trips the circuit, the others don't know and keep hammering the DB. Shared state in Redis means all payment-api instances see the same circuit state. Use Redis SET EX for the open state (auto-expires after 30s) and SET for closed/half-open.
Deliverable
Circuit breaker state machine in your overall defense: closed → open (on 50% error rate in 10s window) → half-open (after 30s) → closed (on successful probe) or open (on failed probe). The circuit:payment-db key in Redis stores the state. All payment-api instances read this key before each DB call.
Common pitfall
Setting the error rate threshold too low (e.g., 10%) — transient errors (single failed DB call) will trip the circuit, causing unnecessary outages for healthy traffic. The threshold should be high enough to ignore brief transient errors but low enough to catch sustained degradation. 50% over a 10-second window is a reasonable starting point.
4Phase 4 — System-Level Prevention
10 min · What architectural changes make it impossible for this failure mode to recur? In…
3 questions
Phase 4 — System-Level Prevention
10 min · What architectural changes make it impossible for this failure mode to recur? In…
What architectural changes make it impossible for this failure mode to recur? Individual fixes (better retry logic, circuit breaker) help, but the root issue is that the system has no mechanism to shed excess load. Defense in depth: rate limiting, request queuing with shedding, and retry budgets at the API gateway level prevent any single component from being overwhelmed by amplified retries.
What is a retry budget and how does it prevent storms at scale?
hint
A retry budget is a cap on the percentage of traffic that is retries, measured at the API gateway level. Example: if more than 20% of requests in the last 10s are retries (identified by idempotency_key seen before), the gateway rejects them with 429. This caps the retry load multiplier at 1.25x regardless of client retry configuration — even if clients retry 10 times, only 25% extra load reaches the service. Implement at the LB or API gateway, not in individual services.
Should you use a request queue with shedding, and when?
hint
A request queue (SQS/in-memory) absorbs bursts but adds latency. For payments (latency- sensitive), a queue is only appropriate if: (1) the burst is temporary (< 60s) and (2) users are informed that processing is async. For the retry storm scenario, the burst lasted 45 minutes — a queue would have accumulated millions of backlogged payments and taken hours to drain. Better to reject excess with 429 + Retry-After and let clients self-regulate with proper backoff. Queuing is not a substitute for backoff — it delays the problem.
What monitoring would have caught this before 45 minutes elapsed?
hint
Three signals that fire within the first 60 seconds: (1) payment-api error rate alert (> 1% for > 30s), (2) payment-api request rate alert (> 120% of baseline), (3) payment- db connection pool utilization alert (> 80%). If any of these alert to an on-call engineer within 60 seconds of the hiccup, they can manually open the circuit breaker or add capacity before the feedback loop stabilizes. The 45-minute outage is a monitoring failure as much as an architecture failure.
Deliverable
Three systemic defenses in your overall defense: (1) per-user rate limiting (rate:payment key in Redis) — caps retry amplification per user, (2) retry budget at the LB (> 20% retries = 429 on excess) — caps system-wide retry multiplier, (3) circuit breaker on payment-db (circuit:payment-db key in Redis) — breaks the API-DB cascade. Name what each defense specifically prevents, not just what it is.
How You'll Be Graded
The retry amplification cascade is correctly diagnosedscalability
The root cause must be stated with the retry math: 4x multiplier, 8,000 rps vs 5,000 rps capacity, and why the feedback loop was self-sustaining.
Full credit
Retry multiplier computed (4x = 1 + 3 retries), load vs. capacity stated (8k vs 5k), feedback loop mechanism described (failed retries cause more failures cause more retries).
Partial
Retries identified as the cause but multiplier not computed or load math missing.
Zero
Root cause stated as "network hiccup" without identifying the retry amplification mechanism.
Retry logic fixed with exponential backoff and jitteroperability
The fix must replace immediate retry with exponential backoff (base=100ms, multiplier=2) and full jitter. The answer must explain WHY jitter prevents the thundering herd, not just assert that it should be added.
Full credit
Exponential backoff formula stated, jitter mechanism explained (random spread prevents synchronized spikes), peak load comparison with and without jitter shown.
Partial
Backoff added but jitter omitted or mentioned without explaining the thundering herd mechanism.
Zero
Setting retry-delay to a constant 100ms without exponential backoff — constant delays still create synchronized spikes.
Circuit breaker protects payment-db from cascading failuresavailability
A circuit breaker on the payment-db connection that opens at 50% error rate, stays open 30 seconds, and uses a half-open probe to self-heal. State stored in Redis for shared visibility across payment-api instances.
Full credit
All four elements present — open threshold (error rate %), open duration (seconds), half-open probe state, shared Redis state across instances.
Partial
Circuit breaker described but missing half-open state or shared state mechanism.
Zero
No circuit breaker, or circuit breaker described in terms that would not actually break the cascade (e.g., per-instance in-memory state).
Defense-in-depth prevention argued from numbersjustification-quality
The systemic prevention measures must be tied to the specific failure mode with numbers — not generic advice. Each measure must name what it specifically prevents.
Full credit
Three defenses named with mechanism and what-it-prevents for each (rate limiter caps per-user retry amplification, retry budget caps system-wide multiplier, circuit breaker breaks DB cascade). At least one number per defense (e.g., rate limit threshold, retry budget percentage, circuit breaker threshold).
Partial
Defenses listed but mechanisms and numbers missing — "add rate limiting" without specifying what it limits or at what threshold.
Zero
Only one defense named or no numerical reasoning.
Failure Scenarios the Sim Will Inject
Each scenario fires automatically during your simulation run. Your design must survive all of them.
The retry storm (current state: 8k rps)
payment-db slow degradation
Best on desktop — the canvas needs room to breathe. Drafts autosave locally.