Functional
- Payment requests must complete with < 300ms p99 latency
- Failed payments must be retried with appropriate backoff
- Duplicate payment prevention on retry (idempotency)
Non-functional
- Restore steady-state error rate to ~0% at 2,000 rps
- A 30-second network hiccup must not cause a cascading outage lasting more than 5 minutes
- Circuit breaker must prevent payment-db saturation from cascading to payment-api
- Total infrastructure cost under $1,500/month
Failure scenarios
- ⚡ The retry storm (current state: 8k rps)
- ⚡ payment-db slow degradation
Approach guide
- 1.Phase 1 — Diagnose the Cascade10m
- 2.Phase 2 — Fix the Retry Logic10m
- 3.Phase 3 — Circuit Breaker for payment-db15m
- 4.Phase 4 — System-Level Prevention10m
Full guide on the brief page.
Key numbers
- Original steady-state load2,000 rps
- Failed requests during 8s hiccup9,600 failed requests
- Retry load multiplier (retry-count=3, retry-delay=0ms)4x load
- Retried load during outage8,000 rps
- payment-api total capacity5,000 rps