Incident· midfree~45 min

Incident: The Retry Storm That Took Down Payments

An 8-second network hiccup that failed 30% of payment requests turned into a 45-minute outage because clients retried immediately with no backoff, multiplying load by 4x. Diagnose the retry amplification cascade, fix the retry logic with exponential backoff and jitter, add a circuit breaker on payment-db, and propose system-level defenses that prevent this failure mode from being possible.

incident-responseretry-logiccircuit-breakerexponential-backoffcascading-failures

Steady traffic

2,000 rps

Spike multiplier

4×

Budget

$1,500/mo

Read ratio

30:70

Load profile

Start designing →View solution

Brief Canvas Solution Solution Canvas

On this pageScenario Prerequisites Requirements API Data Cache Capacity Approach Grading Failures

The Scenario

It's 2:17 PM on a Tuesday. A brief network hiccup caused 30% of payment requests to fail for 8 seconds. The clients are configured to retry immediately — retry-count: 3, retry-delay: 0ms. That 8-second hiccup turned into a 45-minute outage because the retries multiplied the load by 4x, which kept the payment service saturated long after the network recovered. You are now debugging the post-mortem. The starting system shows the current state: payment-api overwhelmed by 8,000 rps when the base load is only 2,000 rps. Your job is to explain what happened, fix the retry logic, add a circuit breaker, and prevent a repeat.

Know before you start

○Retry logic: how retry-count and retry-delay interact to amplify load under failure
○Exponential backoff: why doubling the delay between retries spreads load over time instead of concentrating it
○Jitter: why randomizing the backoff delay prevents synchronized retry storms (the thundering herd problem)
○Circuit breakers: how the closed/open/half-open state machine protects a downstream service that is struggling

Requirements

Functional

Payment requests must complete with < 300ms p99 latency
Failed payments must be retried with appropriate backoff
Duplicate payment prevention on retry (idempotency)

Non-functional

Restore steady-state error rate to ~0% at 2,000 rps
A 30-second network hiccup must not cause a cascading outage lasting more than 5 minutes
Circuit breaker must prevent payment-db saturation from cascading to payment-api
Total infrastructure cost under $1,500/month

API Contract

The endpoints your system must implement. The hot path is the one the SLO is measured on.

POST/paymentsProcess a paymenthot pathauth: required

Request body

field	type	notes
idempotency_key	string	Client-generated UUID; prevents duplicate charges on retrye.g. 550e8400-e29b-41d4-a716-446655440000
amount_cents	integer	Amount to charge in integer centse.g. 9999
payment_method_id	string	Tokenized payment method

Response body

field	type	notes
payment_id	string	UUID of the created payment
status	string	pending \| completed \| failed
amount_cents	integer

Status codes

201Payment created

200Idempotent replay — payment already exists, returns existing payment

402Payment declined by processor

429Rate limit hit

503Service temporarily unavailable — retry with backoff

HOT PATH — 2,000 rps at steady state. Every payment requires a DB write. The idempotency_key MUST be checked BEFORE charging the payment processor — charging first and then failing on DB write creates duplicate charges. Under the retry storm, the 503 responses triggered by saturation caused clients to retry immediately, amplifying load by 4x. The 503 response body should include Retry-After: 1 header to signal the expected backoff — but clients were not respecting it.

GET/payments/{payment_id}Get payment statusauth: required

Path params

field	type	notes
payment_id	string	UUID of the payment

Response body

field	type	notes
payment_id	string
status	string	pending \| completed \| failed
amount_cents	integer
created_at	string

Status codes

200Payment found

404Payment not found

Data Model

tablepaymentsPayment7 columns

Every payment creates a row here. The idempotency_key UNIQUE constraint is the safety net against duplicate charges — if a retry arrives for a key that already has a row, the INSERT fails and the server returns the existing payment (200). At 2,000 rps this is 2,000 inserts/s — within payment-db's 3,000 rps capacity at steady state.

column	type	constraints	notes
id	uuid	PK	DEFAULT gen_random_uuid()
idempotency_key	uuid	IDX	UNIQUE — the safety net against double-charges on retry
user_id	uuid	IDX	FK to users table
amount_cents	bigint	—	Integer cents — never float for money
status	varchar(20)	—	pending \| completed \| failed
payment_method_id	varchar(64)	—	Tokenized — never raw card data
created_at	timestamptz	—	DEFAULT now()

capacity · 2,000 inserts/s × 86,400s = ~173M rows/day. Each row ~200 bytes. 173M × 200B = 34GB/day — archive or partition by created_at after 30 days. The incident is about write throughput, not storage.

Redis / Cache Contracts

Key patterns, TTLs, and commands. Your design must justify the hotness-critical keys.

rate:payment:{user_id}STRINGTTL: rolling 60shigh

Per-user rate limit on payment submissions. Prevents a single user (or compromised account) from amplifying retries beyond their fair share. At 10 payments/minute/user limit, a single user with retry-count=3 can generate at most 40 attempts/minute — the rate limiter absorbs the excess retries before they reach the DB.

INCR rate:payment:{user_id}

EXPIRE rate:payment:{user_id} 60

rationale · During the retry storm, all 2,000 rps worth of users retried simultaneously 3 extra times. A per-user rate limiter caps the retry amplification per user: even with retry-count=3, only the first attempt within the rate window passes; subsequent retries within the window return 429 immediately without touching the DB. This sheds retry load before it reaches payment-api.

circuit:payment-dbSTRINGTTL: none (managed by circuit breaker logic)critical

Circuit breaker state for payment-db. Values: "closed" (normal operation), "open" (fail fast), "half-open" (testing recovery). When payment-db latency spikes above threshold or error rate exceeds 50%, the circuit opens and payment-api fails fast with 503 instead of queuing requests that will timeout anyway. This prevents connection pool exhaustion cascading from DB saturation to API saturation.

GET circuit:payment-db # check state before each DB call

SET circuit:payment-db open EX 30 # open circuit for 30s on threshold breach

SET circuit:payment-db half-open # after 30s, allow one probe request

SET circuit:payment-db closed # on successful probe, close circuit

rationale · Without a circuit breaker, payment-api threads block waiting for an overloaded payment-db to respond. Each blocked thread holds a DB connection. Connection pool exhaustion cascades to the API tier (all threads blocked), which cascades to the LB (queue fills), which causes 503 storms that trigger client retries. The circuit breaker breaks this cascade: when the DB is struggling, fail fast with 503 immediately rather than waiting for timeout. The 30s open window lets the DB drain its queue and recover before accepting new requests.

Capacity Math

Pre-computed numbers to anchor your justifications. Use these — the grader checks your claims against them.

traffic

2,000 rps

Original steady-state load

traffic

9,600 failed requests

Failed requests during 8s hiccup

= 2,000 rps × 8s × 60% failure rate

capacity

4x load

Retry load multiplier (retry-count=3, retry-delay=0ms)

= 1 original request + 3 immediate retries = 4 attempts per failed request

traffic

8,000 rps

Retried load during outage

= 2,000 base rps × 4x retry multiplier

capacity

5,000 rps

payment-api total capacity

= 2 replicas × 2,500 maxRps per replica

capacity

3,000 rps over capacity

Capacity gap (retried load vs capacity)

= 8,000 rps retried load - 5,000 rps capacity = 60% over

latency

~700ms spread per request across 3 retries

Backoff spread with exponential backoff (base 100ms)

= retry 1 at 100ms + retry 2 at 200ms + retry 3 at 400ms = ~700ms total spread

traffic

~3,800 rps peak

Peak rps with backoff (3,800 rps vs 8,000 rps with no backoff)

= 8,000 failures spread over 700ms window instead of 0ms = ~3,800 rps peak vs 8,000 rps

How to Approach This

Work through these phases in order before submitting. Each phase builds on the last.

Phase 1 — Diagnose the Cascade

10 min · Before proposing a fix, understand exactly what happened and why the load didn't…

3 questions

Before proposing a fix, understand exactly what happened and why the load didn't drop after the network recovered. The retry storm is a positive feedback loop: failed requests cause retries, retries increase load, increased load causes more failures, more failures cause more retries. The key question is: what broke the loop?

The network recovered after 8 seconds. Why was the payment-api still overloaded 45 minutes later?

hint

The retries arrived immediately (retry-delay=0ms). At the moment the network recovered, the 9,600 failed requests were all being retried simultaneously — generating 4× the normal load (8,000 rps). This 8,000 rps is above payment-api's 5,000 rps capacity. So the payment-api was now overloaded by the retries, causing new failures, which triggered new retries, keeping the load at 8,000 rps. The network hiccup set a self-sustaining load amplification loop in motion.

What is the retry load multiplier? Show the math.

hint

retry-count=3, retry-delay=0ms. Each failed request generates 3 retries fired immediately. Total attempts per failed request = 1 (original) + 3 (retries) = 4. All 4 arrive at the same instant. During the outage, 100% of 2,000 rps is failing (because the API is saturated). So the effective load = 2,000 × 4 = 8,000 rps — 60% above the 5,000 rps capacity.

What would have broken the feedback loop?

hint

Any mechanism that spreads the retries over time: (1) exponential backoff — each retry fires later than the last, spreading 9,600 simultaneous retries over 700ms; (2) rate limiting — caps retries per user so the total load can't exceed capacity; (3) a circuit breaker on the client — after N consecutive failures, stop sending until the circuit resets. Without any of these, the system has no mechanism to self-regulate.

Deliverable

One paragraph in your overall defense: the retry amplification math (4x multiplier), why the feedback loop was self-sustaining (each cycle of failures caused a new cycle of retries), and which specific mechanism was absent that would have prevented it.

Phase 2 — Fix the Retry Logic

10 min · Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitte…

3 questions

Replace the immediate retry (retry-delay=0ms) with exponential backoff and jitter. Understand why jitter is not optional: without it, all clients that failed at the same moment retry at the same moment — a synchronized thundering herd that recreates the saturation spike.

What is exponential backoff and why does it help?

hint

Exponential backoff: retry 1 waits 100ms, retry 2 waits 200ms, retry 3 waits 400ms. Total spread: ~700ms. Instead of 9,600 retries arriving in 0ms, they arrive spread over 700ms. Peak retried load drops from 8,000 rps (9,600 retries / 0ms window) to roughly 9,600 / 0.7s ≈ 13,700 retries/s — wait, that's still 13k. The key insight: retries from retry 1 (100ms after the hiccup ended) are spread over multiple cycles, and the load decreases each cycle as successful retries drain the queue.

Why is jitter required? What breaks without it?

hint

Without jitter, every client that failed at T=0 retries at T=100ms (all simultaneously), then at T=300ms (all simultaneously), then at T=700ms (all simultaneously). You get three synchronized spikes instead of one continuous spike — the thundering herd just has a different shape. With jitter, each client adds a random delay to its backoff (e.g., retry 1 at 100ms ± random(0, 100ms)), spreading retries uniformly over the backoff window. No synchronized spikes. Full jitter formula: sleep = random(0, min(cap, base × 2^attempt)).

What retry-count is appropriate, and when should a client give up entirely?

hint

3 retries with exponential backoff is reasonable for transient network errors. However, payment failures require user feedback — infinite retry without surfacing the failure to the user is wrong. After 3 retries, return a failure to the user with a message like "payment is taking longer than expected — check your payment history." The idempotency_key means a successful eventual delivery won't double-charge. Retry budget: total retry time should stay under the user's patience threshold (~30s).

Deliverable

Client retry config in your overall defense: base=100ms, multiplier=2, max_delay=30s, jitter=full, max_retries=3. Show the timing of retries 1-3 for a request that failed at T=0: T+100±50ms, T+300±100ms, T+700±200ms (with jitter). Contrast the peak load with this config vs. the original 0ms delay.

Phase 3 — Circuit Breaker for payment-db

15 min · When payment-db is overloaded, payment-api threads block waiting for DB response…

3 questions

When payment-db is overloaded, payment-api threads block waiting for DB responses that never come (or come after 5s timeouts). Blocked threads hold DB connections. Connection pool exhaustion cascades API-wide. A circuit breaker that opens at 50% error rate fails fast and lets the DB recover — breaking the cascade at the right layer.

What triggers the circuit to open?

hint

Error rate threshold: if more than 50% of DB calls in the last 10s failed (timeout or error), open the circuit. Latency threshold: if p99 DB latency exceeds 2s, open. Once open, all DB calls fail immediately with 503 — no waiting, no connection pool usage. The threshold window (10s) must be short enough to react to an incident quickly but long enough to avoid flapping on transient errors.

What is the half-open state and why is it needed?

hint

After the circuit has been open for 30 seconds, transition to half-open: allow exactly one probe request to pass through to the DB. If the probe succeeds (DB responded in < 500ms), close the circuit and resume normal operation. If the probe fails, re-open for another 30 seconds. Without half-open, you need a human to manually close the circuit. With half-open, the circuit self-heals once the DB recovers — no pager needed if the DB recovers within its normal self-healing window.

Where does the circuit breaker state live?

hint

In Redis (circuit:payment-db key). Why not in-memory? If each payment-api instance maintains its own in-memory circuit state, they may be in different states — one instance trips the circuit, the others don't know and keep hammering the DB. Shared state in Redis means all payment-api instances see the same circuit state. Use Redis SET EX for the open state (auto-expires after 30s) and SET for closed/half-open.

Deliverable

Circuit breaker state machine in your overall defense: closed → open (on 50% error rate in 10s window) → half-open (after 30s) → closed (on successful probe) or open (on failed probe). The circuit:payment-db key in Redis stores the state. All payment-api instances read this key before each DB call.

Common pitfall

Setting the error rate threshold too low (e.g., 10%) — transient errors (single failed DB call) will trip the circuit, causing unnecessary outages for healthy traffic. The threshold should be high enough to ignore brief transient errors but low enough to catch sustained degradation. 50% over a 10-second window is a reasonable starting point.

Phase 4 — System-Level Prevention

10 min · What architectural changes make it impossible for this failure mode to recur? In…

3 questions

What architectural changes make it impossible for this failure mode to recur? Individual fixes (better retry logic, circuit breaker) help, but the root issue is that the system has no mechanism to shed excess load. Defense in depth: rate limiting, request queuing with shedding, and retry budgets at the API gateway level prevent any single component from being overwhelmed by amplified retries.

What is a retry budget and how does it prevent storms at scale?

hint

A retry budget is a cap on the percentage of traffic that is retries, measured at the API gateway level. Example: if more than 20% of requests in the last 10s are retries (identified by idempotency_key seen before), the gateway rejects them with 429. This caps the retry load multiplier at 1.25x regardless of client retry configuration — even if clients retry 10 times, only 25% extra load reaches the service. Implement at the LB or API gateway, not in individual services.

Should you use a request queue with shedding, and when?

hint

A request queue (SQS/in-memory) absorbs bursts but adds latency. For payments (latency- sensitive), a queue is only appropriate if: (1) the burst is temporary (< 60s) and (2) users are informed that processing is async. For the retry storm scenario, the burst lasted 45 minutes — a queue would have accumulated millions of backlogged payments and taken hours to drain. Better to reject excess with 429 + Retry-After and let clients self-regulate with proper backoff. Queuing is not a substitute for backoff — it delays the problem.

What monitoring would have caught this before 45 minutes elapsed?

hint

Three signals that fire within the first 60 seconds: (1) payment-api error rate alert (> 1% for > 30s), (2) payment-api request rate alert (> 120% of baseline), (3) payment- db connection pool utilization alert (> 80%). If any of these alert to an on-call engineer within 60 seconds of the hiccup, they can manually open the circuit breaker or add capacity before the feedback loop stabilizes. The 45-minute outage is a monitoring failure as much as an architecture failure.

Deliverable

Three systemic defenses in your overall defense: (1) per-user rate limiting (rate:payment key in Redis) — caps retry amplification per user, (2) retry budget at the LB (> 20% retries = 429 on excess) — caps system-wide retry multiplier, (3) circuit breaker on payment-db (circuit:payment-db key in Redis) — breaks the API-DB cascade. Name what each defense specifically prevents, not just what it is.

How You'll Be Graded

The retry amplification cascade is correctly diagnosed33%scalability

Retry logic fixed with exponential backoff and jitter22%operability

Circuit breaker protects payment-db from cascading failures22%availability

Defense-in-depth prevention argued from numbers22%justification-quality

The retry amplification cascade is correctly diagnosedscalability

The root cause must be stated with the retry math: 4x multiplier, 8,000 rps vs 5,000 rps capacity, and why the feedback loop was self-sustaining.

Full credit

Retry multiplier computed (4x = 1 + 3 retries), load vs. capacity stated (8k vs 5k), feedback loop mechanism described (failed retries cause more failures cause more retries).

Partial

Retries identified as the cause but multiplier not computed or load math missing.

Zero

Root cause stated as "network hiccup" without identifying the retry amplification mechanism.

Retry logic fixed with exponential backoff and jitteroperability

The fix must replace immediate retry with exponential backoff (base=100ms, multiplier=2) and full jitter. The answer must explain WHY jitter prevents the thundering herd, not just assert that it should be added.

Full credit

Exponential backoff formula stated, jitter mechanism explained (random spread prevents synchronized spikes), peak load comparison with and without jitter shown.

Partial

Backoff added but jitter omitted or mentioned without explaining the thundering herd mechanism.

Zero

Setting retry-delay to a constant 100ms without exponential backoff — constant delays still create synchronized spikes.

Circuit breaker protects payment-db from cascading failuresavailability

A circuit breaker on the payment-db connection that opens at 50% error rate, stays open 30 seconds, and uses a half-open probe to self-heal. State stored in Redis for shared visibility across payment-api instances.

Full credit

All four elements present — open threshold (error rate %), open duration (seconds), half-open probe state, shared Redis state across instances.

Partial

Circuit breaker described but missing half-open state or shared state mechanism.

Zero

No circuit breaker, or circuit breaker described in terms that would not actually break the cascade (e.g., per-instance in-memory state).

Defense-in-depth prevention argued from numbersjustification-quality

The systemic prevention measures must be tied to the specific failure mode with numbers — not generic advice. Each measure must name what it specifically prevents.

Full credit

Three defenses named with mechanism and what-it-prevents for each (rate limiter caps per-user retry amplification, retry budget caps system-wide multiplier, circuit breaker breaks DB cascade). At least one number per defense (e.g., rate limit threshold, retry budget percentage, circuit breaker threshold).

Partial

Defenses listed but mechanisms and numbers missing — "add rate limiting" without specifying what it limits or at what threshold.

Zero

Only one defense named or no numerical reasoning.

Failure Scenarios the Sim Will Inject

Each scenario fires automatically during your simulation run. Your design must survive all of them.

traffic spike

The retry storm (current state: 8k rps)

slow degradationt=30s

payment-db slow degradation

postgres

Ready to build it?

Best on desktop — the canvas needs room to breathe. Drafts autosave locally.

View solution Start designing →