incident · mid
Incident: Checkout Is Melting Down
You're on call. The flash sale started 10 minutes ago and checkout error rates are climbing past 50%. You inherit the system as it runs — diagnose the bottleneck, fix the architecture under a $2,000/month budget, and be ready for the next sale and a region outage.
Steady traffic
4,000 rps
Spike
2×
Budget
$2,000/mo
Functional requirements
- Checkout requests must complete with < 200ms p99 latency
- Order writes must reach a durable relational store
- The fix must keep the existing components' roles intact (this is a live system)
Non-functional requirements
- Restore steady-state error rate to ~0% at 4,000 rps
- Survive the next flash sale (2x traffic) without dropping requests
- Survive a region-A database outage with < 1% error rate
- Total infrastructure cost under $2,000/month
Failure scenarios the sim will run
- ⚡ Right now (the incident)
- ⚡ Next flash sale (2x traffic)
- ⚡ Region A database outage
How you'll be graded
- Service is restored at current traffic25% · scalability
The system you hand back must serve 4,000 rps with ~0% errors — that is the incident.
- Ready for the next flash sale17% · scalability
A 2x spike must not drop requests; checkout is write-heavy, so a cache alone cannot save you.
- Survives the region-A database outage25% · availability
All region-A databases die for 60s; order writes must continue somewhere durable.
- Fixed within the incident budget17% · cost-efficiency
$2,000/month total. Throwing oversized instances at the problem loses points — the diagnosis should be surgical.
- The incident retro holds up17% · justification-quality
Root cause named precisely, fix justified with numbers, and a prevention step proposed.
Best on desktop — the canvas needs room to breathe. Drafts autosave locally.