Build Your Own Rate Limiter
Every public API in production has a rate limiter. Build three algorithms -- token bucket, fixed window, sliding window -- in Python, understand the trade-offs, and know exactly what Stripe's 429 means.
Why build this
You've hit a 429 Too Many Requests. You've seen "rate limit: 1000 req/min" in API docs. But do you know why Stripe uses a token bucket instead of a fixed window? Why some APIs allow burst traffic up to 10x the sustained rate? Why a fixed window counter creates a 2x burst at the window boundary that no one intended? Rate limiters are one of the first things added to any production API, and they're the first thing broken when engineers implement them naively. This track teaches the three standard algorithms by having you implement them -- including writing the test that demonstrates the fixed window's famous boundary burst vulnerability. After this track, when you configure rate_limit_per_minute=60 in your API gateway, you'll know which of three fundamentally different behaviors you're actually getting.
What you'll build
Three complete, tested rate limiter implementations in Python: Stage 1 -- Token Bucket: The most common production algorithm. A bucket fills with tokens at a fixed rate (capacity capped). Each request consumes one token; requests with no tokens are rejected. Allows burst traffic up to the bucket capacity while enforcing a sustained rate. This is what Stripe, GitHub Actions, and Twilio use. Stage 2 -- Fixed Window: The simplest possible counter -- count requests per time window, reset at the boundary. Easy to implement, easy to understand, and has a famous flaw: requests can cluster at window boundaries to burst at 2x the intended limit. You'll write the test that proves this. Stage 3 -- Sliding Window Log: Fixes the fixed window's boundary burst by keeping a log of every request timestamp and counting only requests in the last N seconds. Precise but memory-intensive -- each user's log grows linearly with their request rate. This is the algorithm Cloudflare's Redis-based implementations use.
How it works internally
Why rate limiting is hard: The algorithm matters less than where the state lives. A single-server counter in Python is trivial; a distributed counter across 50 API servers where each server only sees 1/50 of the traffic is fundamentally different. Every algorithm you build here has a distributed variant that requires Redis (centralized counter) or a gossip protocol (approximate). Understanding single-server algorithms is the prerequisite to reasoning about the distributed ones.
Token Bucket internals: The key insight is lazy refilling -- you don't need a background thread. On each request, compute elapsed = now - last_refill, add elapsed * rate tokens to the bucket (capped at capacity), update last_refill, then check if there are enough tokens. This calculation is atomic in a Redis Lua script, which is exactly how production systems implement it. Burst behavior: a user with a bucket capacity of 100 and rate of 10/s can send 100 requests instantly (burst), then is limited to 10/s. This is intentional -- batch jobs need burst headroom.
Fixed Window internals: One counter per user per time window. The window is computed as floor(now / window_size) * window_size -- every request in the same window slot increments the same counter. When the slot changes, reset to zero. In Redis: INCR key, EXPIRE key window_seconds (set TTL only on the first INCR). The key is named with the window slot so it auto-expires. The boundary burst: a user sends limit requests at t=59.9s (end of window N), then limit more at t=60.1s (start of window N+1). Both windows count them separately -- 2x the limit in 0.2 seconds. GitHub, Twitter, and most simple APIs accept this trade-off.
Sliding Window Log internals: A deque of timestamps per user. On each request: remove timestamps older than now - window_seconds from the front (the deque stays sorted by arrival time). If len(deque) < limit, append the new timestamp and allow. In Redis: ZADD key now request_id, ZREMRANGEBYSCORE key 0 (now-window), ZCARD key. The ZSET score is the timestamp; trimming is ZREMRANGEBYSCORE. Memory cost: the deque holds up to limit entries per user. At 1000 req/min limit, 10M users, that's 10B entries if all users are active at the limit. Production systems use the sliding window approximation (two fixed windows + interpolation) to reduce this to O(1) per user.
The 3 stages
Build a token bucket that refills at a fixed rate and allows bursts up to the bucket capacity. Implement lazy refilling -- compute elapsed tokens on each request, no background thread needed. This is Stripe's algorithm, AWS API Gateway's throttling model, and the most common production approach.
Implement a per-user counter that resets at each time window boundary. Then write the test that demonstrates the 2x boundary burst -- two windows of requests hitting the limit in 0.2 seconds. Understanding the flaw is the point: you'll see why some systems accept it and others don't.
Keep a deque of every request timestamp per user. Count only requests in the last N seconds. No boundary burst -- the window always looks back from now. Learn the memory trade-off (O(limit) vs O(1) for fixed window) and why Cloudflare approximates this instead of running it exactly.
You'll see this in production
Stripe uses token bucket for API rate limiting. The bucket refills at a sustained rate but allows bursts (useful for batch operations). When you get a 429, the Retry-After header tells you when tokens next become available -- that's the refill calculation you implement in Stage 1.
GitHub uses a fixed window (per-hour counters). The X-RateLimit-Remaining and X-RateLimit-Reset headers expose exactly the fixed window state you build in Stage 2. The boundary burst vulnerability exists but GitHub accepts it as a trade-off for implementation simplicity.
Cloudflare's edge rate limiting uses a sliding window approximation -- a variant of Stage 3 that stores two fixed-window buckets (current and previous) and linearly interpolates the count. Accurate to ~1% and uses O(1) space per user per edge node.
API Gateway's default throttling uses token bucket (burst limit + rate limit). The burst limit is the bucket capacity; the rate limit is the refill rate. Your Stage 1 implementation IS the conceptual model behind the AWS throttling documentation.
A popular Redis module implementing GCRA (Generic Cell Rate Algorithm), equivalent to a token bucket. Redis sorted sets (ZADD + ZREMRANGEBYSCORE) are the production primitive for sliding window log -- the exact data structure behind Stage 3.
Before you start
- ✓Python closures: returning a dict with functions from a factory function
- ✓Python's time module and floating-point Unix timestamps
- ✓What a deque is (collections.deque) and why it's O(1) at both ends
- ✓What an API rate limit is from a user perspective (you've hit a 429 before)
Stages
Ready to build?
Your code runs in the browser — no setup needed.