Incident· midfree~60 min

Incident: Checkout Is Melting Down

You're on call. The flash sale started 10 minutes ago and checkout error rates are climbing past 50%. You inherit the system as it runs — diagnose the bottleneck, fix the architecture under a $2,000/month budget, and be ready for the next sale and a region outage.

incident-responsecapacityreplicationwrite-heavydatabase-scaling

Steady traffic

4,000 rps

Spike multiplier

2×

Budget

$2,000/mo

Read ratio

30:70

Load profile

Start designing →View solution

Brief Canvas Solution Solution Canvas

On this pageScenario Prerequisites Requirements API Data Cache Capacity Approach Grading Failures

The Scenario

It's 9:02 AM. The flash sale went live at 8:50 AM. PagerDuty fired at 8:53 AM. Your CEO is in the Slack channel. Checkout error rate: 52%. Revenue per minute: $0. The starting system is what's in production right now — you cannot tear it down, you can only modify it. You have 30 minutes before the engineers meeting to present a diagnosis and a fix. The incident retro is in 2 hours. What broke, what changed, and what prevents a repeat?

Know before you start

○Database connection pool exhaustion and what a 'too many connections' error means
○Horizontal scaling of stateless services vs. vertical scaling of databases
○Read replicas: what they can absorb and what they cannot (hint: writes)
○Idempotency keys: why retries on a checkout endpoint need them
○The difference between a write-heavy and a read-heavy workload

Requirements

Functional

Checkout requests must complete with < 200ms p99 latency
Order writes must reach a durable relational store
The fix must keep the existing components' roles intact (this is a live system)
Idempotent checkout — retried requests must not create duplicate orders

Non-functional

Restore steady-state error rate to ~0% at 4,000 rps
Survive the next flash sale (2x traffic) without dropping requests
Survive a region-A database outage with < 1% error rate
Total infrastructure cost under $2,000/month

API Contract

The endpoints your system must implement. The hot path is the one the SLO is measured on.

POST/checkoutSubmit a checkout and create an orderhot pathauth: required

Request body

field	type	notes
idempotency_key	string	Client-generated UUID; prevents duplicate orders on retrye.g. 550e8400-e29b-41d4-a716-446655440000
cart_id	string	ID of the cart being checked out
payment_method_id	string	Tokenized payment method (Stripe token, etc.)
shipping_address_id	string

Response body

field	type	notes
order_id	string	UUID of the created order
status	string	initial order status: pending \| confirmed
total_cents	integer
estimated_delivery	string

Status codes

201Order created

200Idempotent replay — order already exists, returns existing order

402Payment declined

409Cart conflict (item sold out since cart was built)

429Rate limit hit (anti-abuse, not capacity)

503Service degraded — checkout queue full or DB unavailable

HOT PATH — 2,800 write rps at steady state. Every checkout requires a DB write. The idempotency_key must be checked first (Redis or DB unique constraint) BEFORE charging the payment method. Charging first → retries = double charges. The starting system has neither idempotency nor write capacity headroom.

GET/orders/{order_id}Get order status and detailsauth: required

Path params

field	type	notes
order_id	string	UUID of the order

Response body

field	type	notes
order_id	string
status	string	pending \| confirmed \| processing \| shipped \| delivered \| cancelled
items	array	List of line items
total_cents	integer
created_at	string

Status codes

200Order found

403Order belongs to a different user

404Order not found

Read path — can be served from a read replica once you add one. During the incident, this is less critical than the write path.

GET/ordersList current user's ordersauth: required

Query params

field	type	notes
status?	string	Filter by status
limit?	integer	Max results (default: 20, max: 100)
cursor?	string	Pagination cursor

Response body

field	type	notes
orders	array	List of order summaries
next_cursor	string

Status codes

200Orders returned

Low volume (users checking their order history). Can be served from a read replica. Index on user_id + created_at DESC is critical.

Data Model

tableordersOrder9 columns

The central write target — every checkout creates a row here. This is what's saturating the orders-db in the starting system. 2,800 inserts/second against a maxRps=2000 database = backpressure cascade.

column	type	constraints	notes
id	uuid	PK	DEFAULT gen_random_uuid()
user_id	uuid	IDX · FK→users.id	Index for listing orders by user
idempotency_key	uuid	IDX	UNIQUE — the safety net against duplicate checkouts
status	varchar(20)	IDX	pending \| confirmed \| processing \| shipped \| delivered \| cancelled
total_cents	bigint	—	Store money as integer cents — never float
payment_method_id	varchar(64)	—	Tokenized, never store raw card data
shipping_address_id	uuid	—	FK to addresses table (not modeled here)
created_at	timestamptz	IDX	DEFAULT now(); index for time-range queries and partitioning
updated_at	timestamptz	—	DEFAULT now()

capacity · 2,800 writes/s × 600B/row ≈ 1.7MB/s write throughput. At 86,400 s/day ≈ 240M rows/day. Partition by created_at date after month 1, or archive orders older than 30 days to cold storage. The incident fix is not about storage — it's about write IOPS and connection pool limits.

tableorder_itemsOrder Item5 columns

Line items for each order. Each checkout inserts N rows here (one per product). At 2,800 checkouts/s × avg 3 items = 8,400 row inserts/s across both tables — this amplifies the DB write pressure.

column	type	constraints	notes
id	uuid	PK
order_id	uuid	IDX · FK→orders.id	ON DELETE CASCADE
product_id	uuid	IDX	FK to products table
quantity	integer	—	Must be > 0
unit_price_cents	bigint	—	Snapshot at order time — not a live FK to current price

capacity · ~8,400 inserts/s at avg 3 items/order. Amplifies the DB write load significantly. Consider batching item inserts in a single transaction with the order insert — both commit or neither does.

Redis / Cache Contracts

Key patterns, TTLs, and commands. Your design must justify the hotness-critical keys.

idempotency:{idempotency_key}STRINGTTL: 86400s (24h — long enough to cover retry windows)critical

Guards against duplicate order creation when clients retry on 5xx errors. Check this BEFORE the DB write and BEFORE payment authorization.

SET idempotency:{key} {order_id} EX 86400 NX # NX = set only if not exists

GET idempotency:{key} # exists? return cached order_id

rationale · Without this, a 503 response causes the client to retry, potentially creating two orders and two payment charges. Redis SET NX is atomic — only one of N concurrent retries will win. The DB unique constraint is the second line of defense (belt-and-suspenders for durability).

cart:{cart_id}HASHTTL: 3600s (1h, refreshed on cart updates)high

Active cart contents for logged-in users. Reduces read load on the product catalog DB during checkout validation.

HGETALL cart:{cart_id} # fetch cart items

HSET cart:{cart_id} product:{id} {quantity} # update item

DEL cart:{cart_id} # on checkout complete

rationale · During a flash sale, thousands of users validate their carts simultaneously. Without caching, each checkout reads product availability from Postgres (~3-5 items/checkout × 4k rps = up to 20k read rps on the catalog DB). Redis serves these reads at sub-ms latency, keeping catalog DB idle.

rate:checkout:{user_id}STRINGTTL: rolling 60smedium

Per-user rate limit to prevent a single user from hammering checkout (bot abuse, accidental retry storms).

INCR rate:checkout:{user_id}

EXPIRE rate:checkout:{user_id} 60

rationale · During flash sales, aggressive bots can generate thousands of checkout requests per user-token. 10 checkouts/minute/user is a generous limit for legitimate users. This sheds bot load before it hits the DB.

Capacity Math

Pre-computed numbers to anchor your justifications. Use these — the grader checks your claims against them.

traffic

4,000 rps

Total request load (steady state)

traffic

2,800 rps writes

Write load (70% checkout = writes)

= 4,000 × 70% write ratio

traffic

1,200 rps reads

Read load (order status, product lookups)

= 4,000 × 30% read ratio

capacity

4,000 rps

Starting system total checkout-api capacity

= 2 replicas × maxRps 2000 = exactly at 100% utilization

capacity

2,000 rps

Starting orders-db write capacity

= Single instance at maxRps=2000 — less than the 2,800 write rps hitting it

capacity

NONE

DB throughput headroom at 4k rps

= 2,800 write rps > 2,000 maxRps → orders-db is the bottleneck

traffic

5,600 write rps

Flash sale write load

= 8,000 rps × 70% — needs DB that can handle this

storage

~600 bytes

Order row size estimate

= UUID + user_id + status + total_cents + 5 items avg + timestamps

storage

~240M rows/day

Orders per day (steady state)

= 2,800 write rps × 86,400 s — use partitioning or archival

cost

~$1,400/mo

Reference fix cost

= DB upgrade $500 + replica $400 + 2 more checkout-api $400 + LB $100

How to Approach This

Work through these phases in order before submitting. Each phase builds on the last.

Phase 1 — Triage: What Is Saturated?

5 min · Before touching anything, identify the exact component that's causing failures. …

3 questions

Before touching anything, identify the exact component that's causing failures. In the starting system, only one thing can fail at 4k rps — and the sim will show you which component is at 100%+ utilization. Name it before proposing a fix.

Run the sim on the starting system. Which component shows saturation?

hint

The checkout-api has 2 replicas × 2,000 maxRps = 4,000 total rps capacity. That's exactly at 100% utilization — no headroom. The orders-db has maxRps=2,000 but receives 4,000 rps (all checkout traffic, since every checkout writes). Orders-db is at 200% utilization — this is the primary bottleneck.

Why is 50% error rate happening NOW when traffic hasn't changed since launch?

hint

Flash sales create a correlated burst: everyone tries to checkout at the same moment. The system was fine at 2k rps (50% utilization), got slammed to 4k rps at sale start, and the orders-db — already the weakest link — immediately saturated. Orders-db backpressure cascades to checkout-api (connection pool full), which cascades to the LB (queue backed up), which starts returning 503s.

Is the problem write capacity, read capacity, or connection limits?

hint

Write capacity. 70% of 4k rps = 2,800 writes/s to the orders-db, but orders-db.maxRps = 2,000. You're asking a database to absorb 40% more writes than it's sized for. The fix is write capacity, not read replicas (read replicas help reads; they don't absorb writes).

Deliverable

One sentence in your overall defense: "The bottleneck is orders-db at [maxRps] with [write_rps] write rps — [%] over capacity."

Phase 2 — Root Cause: The Capacity Math

10 min · Quantify exactly what's broken. The incident retro will be graded on whether you…

3 questions

Quantify exactly what's broken. The incident retro will be graded on whether you can explain the root cause with numbers, not just "the database was overloaded."

What write rps is the orders-db receiving vs. its capacity?

hint

4,000 total rps × 70% write ratio = 2,800 write rps. orders-db maxRps = 2,000. That's 800 rps of unsatisfied write demand, queuing into backpressure, exhausting the checkout-api's DB connection pool. Result: checkout-api blocks, LB queues fill, 503s cascade.

Why didn't the checkout-api scaling (2 replicas) prevent this?

hint

Adding more app server replicas doesn't help when the bottleneck is DOWNSTREAM. More checkout-api instances just create more concurrent writers hammering the same saturated DB. You can scale checkout-api to 100 replicas — it won't matter if orders-db can only absorb 2,000 writes/s. This is the most common incident misdiagnosis: "scale the API tier" when the DB is the wall.

What would the flash sale do to a fixed system?

hint

Flash sale = 2× traffic = 8,000 rps. Write load = 5,600 write rps. Your fix must handle this, not just 4,000 rps. Size orders-db for 6,000+ write rps with headroom, or add write sharding/queue buffering.

Deliverable

A written root cause in your overall defense: component, utilization percentage, the math showing why it failed, and the cascade chain.

Phase 3 — Fix the Write Path

15 min · Fix the orders-db bottleneck without breaking the live system. The constraint: k…

3 questions

Fix the orders-db bottleneck without breaking the live system. The constraint: keep existing components — you're modifying, not replacing. Every option has a trade-off you must name.

What are your options to increase DB write throughput?

hint

Option A: Vertical scale — upgrade orders-db to a larger instance (higher maxRps). Fastest fix, no code change. Option B: Write queue — put an async queue between checkout-api and orders-db (absorbs bursts but adds latency + complexity). Option C: Write sharding — partition orders-db by user_id or created_at (complex, not needed here). Option D: Connection pooling (PgBouncer) — doesn't increase DB capacity, but reduces connection overhead. For this incident, Option A is the surgical fix.

If you choose vertical scale, what instance size do you need?

hint

Flash sale peak = 5,600 write rps. With 2× headroom = 11,200 rps capacity needed. A db.r6g.2xlarge handles ~10,000 rps at $480/mo — within the $2,000 budget. Size for the flash sale, not steady state. Alternatively: db.r6g.xlarge at ~6,000 rps ($300/mo) + 2× checkout-api headroom gets you to the flash sale with ~7% buffer.

Do you need to scale checkout-api too?

hint

At 8,000 rps (flash sale), 2 replicas × 2,000 = 4,000 capacity — only 50% of needed throughput. You need 4 replicas × 2,000 = 8,000 rps capacity. Or increase maxRps per instance to 4,000 and keep 2. The DB fix is required; the API fix is also required for the flash sale scenario.

Deliverable

Updated canvas: upgraded orders-db (higher maxRps) + 2 additional checkout-api replicas (or increased per-instance maxRps). Run sim and verify both scenarios (steady-state and flash sale) show 0% errors.

Common pitfall

Adding a read replica to fix a write-heavy bottleneck. Read replicas help with reads. This system's problem is 2,800 write rps against a 2,000 rps DB. A read replica doesn't accept writes. The fix is write capacity on the PRIMARY — either bigger instance or write queue.

Phase 4 — Availability: Region-A DB Outage

15 min · Design for the region-A outage scenario: orders-db goes down. For a WRITE-HEAVY …

3 questions

Design for the region-A outage scenario: orders-db goes down. For a WRITE-HEAVY checkout system, this is harder than the URL shortener's outage (where cache absorbed 95% of reads). Here, writes must land somewhere durable.

What happens to checkouts when orders-db (region-A) is down?

hint

Every POST /checkout requires a write. If the primary DB is down, you CANNOT complete a checkout without losing order data. Options: (A) Accept errors for the 60s outage — < 1% of daily orders (the NFR allows < 1% errors), (B) queue writes to a secondary DB in region-B, (C) failover to a hot standby in region-B. Option A sounds bad but IS < 1% errors across a day. The NFR says < 1% error rate, not zero.

Can a read replica in region-B absorb the outage?

hint

A read replica accepts reads, not writes. For the < 1% error rate NFR to be met via a replica, you'd need to: (A) accept reads from the replica during outage (GET /orders works), (B) reject writes with a queued retry (POST /checkout returns 503 with Retry-After, queues to a durable message queue in region-B for replay when primary recovers). This is the safest write-heavy availability pattern.

What is 'durable' for checkout writes during an outage?

hint

If you queue checkouts to Kafka/SQS in region-B and replay them when the primary comes back, you must handle: (A) the customer was charged but no order appeared yet — need idempotency on replay, (B) inventory may have changed during the 60s — need conflict resolution. The idempotency_key in the DB schema handles (A).

Deliverable

A Postgres read replica in region-B on the canvas. Justification must explain: GET /orders reads via replica (stays up during outage), POST /checkout writes either queue to region-B or fail with graceful 503 (< 1% error rate across the day). Both are valid — name which you chose and why.

Phase 5 — Incident Retro

15 min · Write the incident retro in your overall defense. A PRR retro has three mandator…

3 questions

Write the incident retro in your overall defense. A PRR retro has three mandatory sections: what happened (root cause + impact), what changed (the fix, with numbers), and what prevents a repeat (the systemic change). Vague retros lose points.

What is the precise root cause, in one sentence with numbers?

hint

Model answer: "orders-db was provisioned at 2,000 rps write capacity but received 2,800 rps at the flash sale peak (40% over capacity), causing connection pool exhaustion that cascaded to checkout-api (100% utilization) and LB backpressure, resulting in 52% error rate."

What systemic change prevents a repeat?

hint

Options: load testing before each flash sale (detects the capacity gap before production), autoscale rules on orders-db CPU/connections that trigger a vertical scale or read-offload, a circuit breaker on checkout-api that queues rather than errors when DB is saturated, or a write queue (SQS) that absorbs burst writes and smooths DB load. Name one and argue why it specifically prevents THIS incident.

What monitoring would have caught this BEFORE the sale?

hint

Postgres connection pool usage (alert at 80% — that's the signal that a DB is approaching saturation), orders-db write IOPS (alert at 60% of max — gives time to act), checkout-api error rate (alert at 1% — too late for this incident, should have been caught pre-sale). Pre-sale load test simulating 2× base traffic is the real answer.

Deliverable

Overall defense that includes: root cause with utilization number, fix description with target metrics, and one systemic prevention measure. This is what gets graded on the justification-quality axis.

How You'll Be Graded

Service is restored at current traffic25%scalability

Ready for the next flash sale17%scalability

Survives the region-A database outage25%availability

Fixed within the incident budget17%cost-efficiency

The incident retro holds up17%justification-quality

Service is restored at current trafficscalability

The system you hand back must serve 4,000 rps with ~0% errors — that is the incident mission.

Full credit

Steady-state error rate is 0% and no component is saturated.

Partial

Error rate under 2% with identified residual risk named in the justification.

Zero

Errors persist above 2% at steady state.

Ready for the next flash salescalability

A 2x spike must not drop requests; checkout is write-heavy, so a cache alone cannot save you — write capacity is the constraint.

Full credit

0% errors during the spike with no component above 90% utilization.

Partial

Brief saturation, errors under 2%.

Zero

Sustained drops during the spike.

Survives the region-A database outageavailability

All region-A databases die for 60s; order writes must continue somewhere durable, OR fail gracefully (< 1% error rate across the day).

Full credit

Error rate under 1% — via out-of-region write path, queue buffering, or justified graceful degradation.

Partial

Bounded errors (1-5%) with a credible recovery story.

Zero

Checkout hard-fails for the full outage window.

Fixed within the incident budgetcost-efficiency

$2,000/month total. Throwing oversized instances at the problem loses points — the diagnosis should be surgical.

Full credit

Under budget; every capacity increase maps to an identified bottleneck with the math.

Partial

Under budget with unjustified over-provisioning, or marginally over.

Zero

More than 30% over budget.

The incident retro holds upjustification-quality

Root cause named precisely with utilization numbers, fix justified against the capacity math, and a prevention step proposed.

Full credit

Root cause + utilization % + cascade chain + fix + prevention — all argued from numbers.

Partial

Correct fix but vague root-cause ("database was slow") or missing prevention.

Zero

No real diagnosis — "added more servers" without explaining why.

Failure Scenarios the Sim Will Inject

Each scenario fires automatically during your simulation run. Your design must survive all of them.

traffic spike

Right now (the incident)

traffic spiket=30s · 60s

Next flash sale (2x traffic)

crasht=30s · 60s

Region A database outage

region-apostgres

Ready to build it?

Best on desktop — the canvas needs room to breathe. Drafts autosave locally.

View solution Start designing →