reference solutionsenior · greenfield

Greenfield: Design a Real-Time Chat System

Design a real-time chat system supporting 100k concurrent WebSocket connections, channel fan-out to 1,000 members, message history, and TTL-based presence — all within a $5,000/month budget. The hard problems are cross-gateway fan-out (a message sent to gateway-1 must reach users on gateway-2) and surviving a gateway crash without losing messages in flight.

Steady traffic

5k rps

Spike

3×

Budget

$5k/mo

Ref. components

6 links

websocketspub-submessage-queuefan-outpersistencepresence

Brief Canvas Solution Solution Canvas

On this pageWalkthrough Decisions Mistakes Full credit Approach

Reference walkthrough

How the reference architecture is reasoned out, end to end.

A real-time chat system at 100k concurrent connections requires two independent layers that serve different purposes: WebSocket gateways for connection state and last-mile delivery, and Redis pub/sub for cross-gateway routing. No single gateway can hold all 100k connections — at 64KB RAM per WebSocket, 100k connections require 6.4GB of RAM for connection state alone, spread across a gateway fleet. When a message arrives at gateway-1, it writes to Postgres (durable), then publishes to Redis (real-time).

Every gateway subscribed to that channel's pub/sub topic receives the message and forwards it over its local WebSocket connections. This architecture means any gateway can handle any message — no sticky sessions required for senders, only for receivers (which is implicit in the persistent connection). Message ordering is solved by Redis INCR: before inserting a message into Postgres, the handler atomically increments channel:seq:{channel_id} and uses the returned value as the sequence_number.

This is monotonically increasing, has no clock dependency, and is consistent even when multiple app servers process messages concurrently. Clients use sequence_number for deduplication (drop any message with a sequence_number already seen) and for reconnect replay (request messages after their last-seen sequence_number). The critical write ordering rule is "Postgres before Redis PUBLISH" — this ensures that even if the Redis publish fails or the gateway crashes mid-delivery, the message is safely in Postgres and will be found by any reconnecting client's history request.

Presence uses TTL-based state: SET presence:{user_id} 1 EX 60 on each heartbeat (every 30s), with the TTL providing a 2x safety margin. No explicit offline signaling is needed — crashes, network partitions, and clean disconnects all result in the presence key expiring after 60s. Per-channel presence queries use SMEMBERS channel:members:{channel_id} to get the member list, then MGET presence:{u1} ... to check each member's status in a single Redis round trip.

For a 1,000-member channel, this is one SMEMBERS + one MGET — typically < 5ms total. Gateway crashes are handled entirely by the reconnect + replay mechanism. When a gateway dies, its 10,000 WebSocket connections are terminated.

Clients detect this via heartbeat timeout (no pong within 10s) and reconnect to any available gateway with exponential backoff and jitter. After reconnect, the client requests missed messages using after_sequence={last_seen_sequence_number} — Postgres returns the gap. No messages are lost because they were written to Postgres before any Redis publish or WebSocket delivery attempt.

The gateway is a pure routing layer — it holds no message state that isn't already in Postgres.

Overall defense

Two api-gateway WebSocket clusters (one per region) each handle 30,000 concurrent connections. Clients connect to their nearest gateway; during a region-A gateway crash (targeting app-server kind), the api-gateway nodes are unaffected — 0% error rate. Redis Pub/Sub brokers cross-gateway message fan-out for real-time delivery with microsecond latency. Kafka durably persists the full message log (2 TB, 80,000 rps) for history and replay. Four postgres replicas (20,000 rps combined) store messages with room to scale. Placing the client in region-b ensures it survives a region-A outage by connecting to the region-b gateway and postgres replica seamlessly.

Key architecture decisions

The choices that separate a passing design from full credit.

Redis pub/sub is the cross-gateway delivery mechanism

Without Redis pub/sub, a message sent to gateway-1 cannot reach a user connected to gateway-2 without direct gateway-to-gateway communication (a distributed systems nightmare) or a polling loop (high latency). Redis pub/sub decouples senders from receivers: the sender publishes once, Redis delivers to all subscribed gateways, each gateway forwards to its local WebSocket connections. Redis pub/sub delivers each message to each subscriber in < 1ms and supports ~1M messages/s per instance — easily sufficient for this workload.

sequence_number solves ordering without clocks

Timestamps from distributed servers have clock skew (NTP drift up to ~100ms) and millisecond granularity — two messages sent within 1ms have the same timestamp and cannot be unambiguously ordered. Redis INCR on a per-channel counter returns a monotonically increasing integer with no clock dependency. Gaps in the sequence (from failed inserts or burned increments) are tolerated — clients must handle gaps and cannot assume sequence_numbers are contiguous. The key property is that sequence_numbers are totally ordered within a channel.

TTL presence tolerates all failure modes uniformly

Disconnect events are unreliable: TCP connections can appear alive to the server for minutes after the client disappears (especially behind NAT or on mobile networks). Using TTL-based presence (heartbeat every 30s, EXPIRE 60s) means all failure modes — client crash, server crash, network partition, clean disconnect — are handled identically: the key expires after 60s. This simplicity is the point. The cost is 60-second staleness in the worst case, which is acceptable for a presence indicator in a chat system.

Fan-out bottleneck is the gateway, not Redis

Redis PUBLISH delivers one message per subscribed gateway — N gateways receive one message each. The gateway then fans it out to its local WebSocket connections. Redis sees message_rate × num_gateways publishes/s; the gateway sees message_rate × local_subscribers writes/s. At 5,000 messages/s with 10 gateways and avg 100 local subscribers per message: Redis sees 50k publishes/s (trivial), each gateway sees 500k WebSocket writes/s (this is the bottleneck). Monitor gateway CPU and egress bandwidth, not Redis PUBLISH throughput.

Common mistakes

What most candidates get wrong on this challenge.

✗

Using HTTP long-polling instead of WebSockets — at 100k concurrent users, long-polling generates 100k+ HTTP requests/s (each poll is a new request). WebSockets maintain persistent connections with ~3,333 heartbeat frames/s for the same user count. The connection overhead alone makes polling untenable at this scale.

✗

Using the Postgres messages table as the real-time fan-out mechanism (polling every 100ms for new messages) — at 100k users × 10 polls/s = 1M DB reads/s. This saturates any reasonable Postgres instance and adds at least 100ms of delivery latency. Redis pub/sub delivers in < 1ms per gateway hop.

✗

Sticky sessions at the load balancer for senders — receivers must be on persistent WebSocket connections (inherently sticky), but senders (POST /messages) are stateless HTTP requests that can go to any app server. Using sticky sessions for senders adds unneeded coupling and creates hot-spots if one user sends a lot of messages.

✗

Counting connections per server to determine "how many gateways you need" without accounting for RAM — 64KB per WebSocket × 10,000 connections per gateway = 640MB just for connection state, plus application memory. A gateway serving 10k connections needs at least 2-4GB RAM total. Servers are sized by RAM first, CPU second, for WebSocket workloads.

What full credit looks like

Expand each criterion to see the exact bar.

WebSocket gateways + Redis pub/sub delivers in < 500msscalability

weight 3×

Full credit requires: WebSocket gateway fleet on the canvas, Redis pub/sub as the cross- gateway fan-out bus (not DB polling), the flow from POST /messages → Postgres write → Redis PUBLISH → gateway SUBSCRIBE → WebSocket frame described in the justification, and a latency budget showing < 500ms is achievable (Postgres write ~5ms + Redis publish ~1ms + WebSocket frame ~1ms = well under 500ms). Partial credit if WebSockets are present but the cross- gateway delivery mechanism is unspecified or uses DB polling.

Full credit: WebSocket gateways present, Redis pub/sub as the fan-out bus, cross-gateway delivery path described, < 500ms argued with latency budget.

Per-channel sequence_number ordering with idempotent clientsscalability

weight 2×

Full credit requires: Redis INCR for sequence_number generation (before Postgres INSERT), client dedup by sequence_number described, and the reconnect replay flow (request messages after_sequence={last_seen}) stated. Must acknowledge that sequence_number gaps are possible (from burned increments on failed inserts) and that clients must tolerate them. Partial credit if ordering is addressed but via timestamps (clock skew problem unaddressed).

Full credit: Redis INCR for sequence_number generation described, client dedup by sequence_number stated, replay-on-reconnect flow described.

Gateway crash causes reconnect + replay, not message lossavailability

weight 2×

Full credit requires all three elements: (1) write order stated — Postgres before Redis PUBLISH, (2) client reconnect flow described — detect disconnect via heartbeat timeout, exponential backoff reconnect, (3) sequence_number replay on reconnect — request messages after last-seen sequence_number to fill the gap. Must explain why no messages are lost even if the gateway had a Redis publish in flight when it crashed (Postgres is the source of truth). Partial credit if reconnect is mentioned but replay mechanism or write ordering is absent.

Full credit: Write order stated (Postgres before Redis publish), client reconnect flow described, sequence_number replay on reconnect described.

TTL-based presence with heartbeat is correct and queryable per channelscalability

weight 1×

Full credit requires: SET presence:{user_id} 1 EX 60 on heartbeat (every 30s), MGET for per-channel presence query (SMEMBERS to get member list, then MGET to check each), and the justification that TTL handles all failure modes (crash, clean disconnect, network partition) uniformly without explicit offline signaling. Partial credit if presence is described but heartbeat interval, TTL value, or the MGET query pattern is unspecified.

Full credit: SET EX 60 on heartbeat described, MGET for per-channel presence query described, crash/disconnect handled by TTL expiry.

Fan-out math and WebSocket sizing argued from numbersjustification-quality

weight 2×

Full credit requires at least three capacity numbers argued from first principles: (1) 64KB/connection × 100k connections = 6.4GB RAM — determines gateway fleet sizing; (2) fan-out math: message_rate × avg_channel_size = WebSocket writes/s per gateway — shows where the bottleneck is; (3) Redis pub/sub throughput: N messages/s vs 1M/s capacity — shows Redis is not the bottleneck. Claims must not contradict each other (e.g., claiming both "10 gateways" and "each gateway handles 50k connections" would contradict the 100k total). Partial credit if correct design but justifications are qualitative only.

Full credit: 64KB/connection × 100k = 6.4GB RAM stated, fan-out math (N members × message rate = WebSocket writes/s), Redis pub/sub headroom vs. 1M/s limit cited.

How to approach this challenge

The same phase-by-phase guide shown during solving — with answers.

Phase 1 — Real-Time Delivery Architecture

Design how WebSocket gateways receive messages and push them to clients. The key insight is that gateway servers are stateful (they hold WebSocket connections) but the message routing must be stateless (any gateway can receive any message). Redis pub/sub is the layer that decouples them. Get this architecture right before touching persistence or presence.

15 min · 3 questions

A message is sent to gateway-1. The recipient is connected to gateway-2. How does it get there?

show answer

Gateway-1 receives the POST /messages request, writes to Postgres, then PUBLISHes to Redis on the channel's topic. Gateway-2 is SUBSCRIBEd to that topic (because it has a user connected who is a member of that channel). Redis delivers the published message to gateway-2, which forwards it over the recipient's WebSocket. No gateway-to-gateway communication needed — Redis is the message bus.

Why not use long-polling or Server-Sent Events instead of WebSockets?

show answer

Long-polling: each poll is a new HTTP request. At 100k concurrent users × avg 1 poll/s = 100k HTTP requests/s. WebSocket: 100k persistent connections, ~3,333 frames/s for presence heartbeats. WebSockets have ~30x lower connection overhead at this scale. Server-Sent Events are unidirectional (server to client only) — you still need HTTP for the send path. WebSocket is the right primitive for bidirectional real-time at this scale.

What does a gateway server need to do on each WebSocket connect?

show answer

(1) Validate auth token. (2) Load user's channel memberships from Redis (SMEMBERS for each channel, or a user->channels index). (3) SUBSCRIBE to each channel's pub/sub topic. (4) SET presence:{user_id} 1 EX 60. This means a gateway restart triggers a burst of subscriptions as clients reconnect — design the reconnect flow to rate-limit handshakes.

Deliverable

Canvas: clients → gateway (multiple replicas) → Redis pub/sub → gateway (fan-out). The message write path: gateway → Postgres (write) → Redis PUBLISH. The receive path: Redis SUBSCRIBE → gateway → WebSocket frame to client. The separation of write durability (Postgres first) and real-time delivery (Redis after) must be explicit.

Phase 2 — Message Persistence and Ordering

Design the ordering guarantee and persistence layer. Timestamps alone are insufficient for ordering (clock skew between servers, same-millisecond messages). The sequence_number is the ordering primitive — a monotonically increasing integer per channel, generated atomically before the message is inserted.

10 min · 3 questions

Why not use the message's created_at timestamp for ordering?

show answer

Two messages sent in the same millisecond have the same timestamp. NTP drift means different app servers may disagree on "now" by up to ~100ms. Two messages sent from different app servers within 100ms of each other could have inverted timestamps. A per-channel sequence_number from Redis INCR is monotonically increasing and has no clock dependency — it gives a total order within a channel with zero ambiguity.

How do you generate the sequence_number atomically before inserting?

show answer

INCR channel:seq:{channel_id} in Redis returns the next integer atomically. This runs before the Postgres INSERT. The sequence_number is then included in the INSERT. If the INSERT fails (e.g., idempotency_key conflict), the sequence_number is "burned" (the increment already happened). This is acceptable — clients will see a gap in the sequence (e.g., 42, 43, 45) and must tolerate gaps. Gaps mean a retry, not a missing message.

What does 'at-least-once delivery' require from the client?

show answer

At-least-once means clients may receive the same message twice (e.g., if the gateway crashes after delivery but before the client acks). Clients must deduplicate by sequence_number: if you receive a message with a sequence_number you have already processed, discard it. This is the idempotent client pattern — it moves the dedup burden from the server to the client, which is the right trade-off for throughput.

Deliverable

The write flow: Redis INCR → Postgres INSERT (with sequence_number) → Redis PUBLISH. Idempotency_key UNIQUE constraint on the messages table. Client reconnect flow: after reconnect, request messages WHERE sequence_number > last_seen_sequence to replay missed messages.

Phase 3 — Fan-Out at Scale

A message to a 1,000-member channel requires 1,000 WebSocket frame writes. Understand where the bottleneck is and verify your design can sustain it. The Redis pub/sub layer serializes the fan-out across all gateways — each gateway only writes to its connected subset of the 1,000 members.

10 min · 3 questions

At 5,000 messages/s to 1,000-member channels, how many WebSocket writes/s does each gateway handle?

show answer

5,000 messages/s × 1,000 members = 5M WebSocket writes/s total across all gateways. If 10 gateway servers each hold 10,000 connections (100k total / 10 gateways), and members are uniformly distributed, each gateway handles 5M/10 = 500k writes/s. At 1KB per message frame, that's 500MB/s of WebSocket throughput per gateway — budget accordingly. In practice, channels are sparse — not all 1,000 members of every channel are connected simultaneously.

Is the gateway or Redis the fan-out bottleneck?

show answer

Redis pub/sub delivers each message ONCE to each subscribed gateway, regardless of how many clients are connected to that gateway. Redis sees 5,000 PUBLISH/s — well within its 1M/s capacity. The gateway sees 500k WebSocket writes/s — this is the likely bottleneck. Monitor gateway CPU and send buffer backpressure, not Redis throughput.

How does a gateway know which local connections to forward a pub/sub message to?

show answer

The gateway maintains an in-memory map: channel_id → [WebSocket connections subscribed to that channel]. When a pub/sub message arrives for channel X, the gateway looks up channel X in its local map and writes the frame to each connection in the list. This is a pure in-memory fan-out — O(N) where N is the number of local connections subscribed to that channel. No Redis lookups needed for the fan-out itself.

Deliverable

Fan-out math in your justification: total WebSocket writes/s = message rate × avg channel size. Per-gateway writes/s = total / number of gateways. Each gateway must be sized to handle this write rate. The Redis pub/sub layer is not the bottleneck — the gateway's network egress and CPU are.

Phase 4 — Presence System

Show 100k users' online/offline state to channel members without polling. TTL-based presence with heartbeat is the standard pattern — simple, scalable, and tolerant of crashes. The design question is: how does a client know who in a channel is online?

10 min · 3 questions

Why not use WebSocket disconnect events to set presence to offline?

show answer

Disconnect events are unreliable: network partitions, client crashes, and gateway crashes all result in the server not knowing the client is gone until a timeout. A TCP connection can appear alive to the server for minutes after the client goes offline. TTL-based presence (EXPIRE 60s, heartbeat every 30s) handles all failure modes uniformly: if the heartbeat stops for any reason, the key expires and the user appears offline after at most 60 seconds. This is more reliable than disconnect events.

How does a client learn who in a channel is currently online?

show answer

Two steps: (1) SMEMBERS channel:members:{channel_id} to get the member list, (2) MGET presence:{u1} presence:{u2} ... for all member IDs in a single round trip. Non-nil results = online. This is O(N) for an N-member channel. For large channels (1,000 members), MGET for 1,000 keys returns in < 5ms. Online presence is computed on-demand when a user opens a channel, not pushed proactively — this avoids presence update storms.

At 100k concurrent users, what is the heartbeat RPS?

show answer

100k users × 1 heartbeat/30s = 3,333 heartbeats/s. Each heartbeat is a POST /presence that executes SET key 1 EX 60. At 3,333/s this is trivial for Redis. The concern is not Redis throughput — it is HTTP overhead. Consider using WebSocket frames for heartbeats (client sends a ping frame every 30s) to avoid an HTTP round trip per user.

Deliverable

Presence flow: WebSocket connect → SET presence:{user_id} 1 EX 60, client sends heartbeat frame (or HTTP POST) every 30s → re-SET with EX 60. Presence query: SMEMBERS + MGET. Presence update events (user came online/offline) are pushed via the channel's pub/sub topic as a presence_update message type, so all connected members see real-time status.

Phase 5 — Availability: Gateway Server Crash

A gateway server crashes. All WebSocket connections to that gateway die. Design what happens: how clients detect the crash, reconnect, and recover missed messages. The key invariant is "no message loss if published to Redis BEFORE sending to WebSocket" — but a crashed gateway may have received a Redis publish but not forwarded it before crashing. The sequence_number replay mechanism is the safety net.

15 min · 3 questions

What happens to the 10,000 WebSocket connections on the crashed gateway?

show answer

They are immediately terminated. The OS closes the sockets. Clients on mobile or behind NAT may not receive a TCP RST and must detect the crash via a heartbeat timeout (e.g., no pong response within 10s). Client reconnect logic: exponential backoff with jitter (100ms base, up to 30s), reconnect to any available gateway (via DNS load balancing or a sticky-session-free LB). After reconnect, request history replay.

How does a reconnected client know what messages it missed?

show answer

The client stores the last sequence_number it received per channel. After reconnect, it calls GET /channels/{channel_id}/messages?before_sequence=... wait, it needs messages AFTER its last seen sequence. The correct API: GET /channels/{id}/messages?after_sequence= {last_seen}&limit=100. The client replays messages in sequence_number order and deduplicates any it already processed. This is the "replay on reconnect" pattern — it requires the client to persist its last-seen sequence_number across reconnects (local storage on web, SQLite on mobile).

A message was published to Redis but the gateway crashed before forwarding it to WebSocket. Is the message lost?

show answer

Not lost — it is in Postgres. The write order is: Postgres first, then Redis PUBLISH. If the gateway crashes after receiving the Redis publish but before forwarding, the message is in Postgres. When the client reconnects to another gateway, the history replay fetches it. The Redis publish is a best-effort delivery mechanism; Postgres is the durable fallback. This is why "Postgres before Redis" is the critical ordering rule.

Deliverable

Client reconnect flow: detect disconnect → exponential backoff → reconnect to gateway → fetch per-channel missed messages (after_sequence) → process in order → resume. Gateway crash is a non-event for message durability because Postgres is the source of truth. Presence recovery: after reconnect, re-SET presence key and re-subscribe to channel topics.

Ready to test your design?

Open the canvas, place your components, and run the failure scenarios to get graded.

View diagram Open canvas →

Greenfield: Design a Real-Time Chat System

Steady traffic

5k rps

Spike

3×

Budget

$5k/mo

Ref. components

6 links

websocketspub-submessage-queuefan-outpersistencepresence