beprodready

incident · mid

Incident: Checkout Is Melting Down

You're on call. The flash sale started 10 minutes ago and checkout error rates are climbing past 50%. You inherit the system as it runs — diagnose the bottleneck, fix the architecture under a $2,000/month budget, and be ready for the next sale and a region outage.

Steady traffic

4,000 rps

Spike

Budget

$2,000/mo

Functional requirements

  • Checkout requests must complete with < 200ms p99 latency
  • Order writes must reach a durable relational store
  • The fix must keep the existing components' roles intact (this is a live system)

Non-functional requirements

  • Restore steady-state error rate to ~0% at 4,000 rps
  • Survive the next flash sale (2x traffic) without dropping requests
  • Survive a region-A database outage with < 1% error rate
  • Total infrastructure cost under $2,000/month

Failure scenarios the sim will run

  • Right now (the incident)
  • Next flash sale (2x traffic)
  • Region A database outage

How you'll be graded

  • Service is restored at current traffic25% · scalability

    The system you hand back must serve 4,000 rps with ~0% errors — that is the incident.

  • Ready for the next flash sale17% · scalability

    A 2x spike must not drop requests; checkout is write-heavy, so a cache alone cannot save you.

  • Survives the region-A database outage25% · availability

    All region-A databases die for 60s; order writes must continue somewhere durable.

  • Fixed within the incident budget17% · cost-efficiency

    $2,000/month total. Throwing oversized instances at the problem loses points — the diagnosis should be surgical.

  • The incident retro holds up17% · justification-quality

    Root cause named precisely, fix justified with numbers, and a prevention step proposed.

Start designing →

Best on desktop — the canvas needs room to breathe. Drafts autosave locally.