Greenfield: Design a Real-Time Chat System
Design a real-time chat system supporting 100k concurrent WebSocket connections, channel fan-out to 1,000 members, message history, and TTL-based presence — all within a $5,000/month budget. The hard problems are cross-gateway fan-out (a message sent to gateway-1 must reach users on gateway-2) and surviving a gateway crash without losing messages in flight.
Steady traffic
5,000 rps
Spike multiplier
3×
Budget
$5,000/mo
Read ratio
70:30
Load profile
The Scenario
You're joining a new startup that just got funded. The CTO wants a chat system that works like Slack's basic model: channels with multiple members, real-time delivery, message history, and online presence. It starts at 1k concurrent users but must be designed to scale to 100k concurrent connections. You have 60 minutes to design the architecture before the engineering review.
Know before you start
- ○WebSockets: how a persistent bidirectional connection differs from HTTP request-response
- ○Pub/sub: how Redis PUBLISH and SUBSCRIBE work, and why they solve the cross-gateway delivery problem
- ○Message ordering: why a per-channel sequence_number is the correct ordering primitive (not timestamps)
- ○Fan-out: what it means to write one message to N subscribers, and where N=1000 becomes a bottleneck
- ○Presence signals: TTL-based online/offline state vs. polling, and why heartbeat + EXPIRE is the standard pattern
Requirements
Functional
- Users can send messages to channels they are members of
- Messages delivered in < 500ms p99 end-to-end
- Message history up to 10,000 messages per channel
- Online/offline presence visible to channel members
- Users can be in up to 50 channels simultaneously
Non-functional
- Support 100k concurrent WebSocket connections
- Message delivery at least once (idempotent clients via sequence_number dedup)
- < 500ms delivery p99 end-to-end
- Survive a gateway server crash without losing messages in flight
- Message ordering guaranteed per channel (not globally)
API Contract
The endpoints your system must implement. The hot path is the one the SLO is measured on.
POST/channels/{channel_id}/messagesSend a message to a channelhot pathauth: required
Request body
| field | type | notes |
|---|---|---|
| content | string | Message text content (max 4096 chars)e.g. Hey team, the deploy is done! |
| idempotency_key | string | Client-generated UUID; prevents duplicate messages on retrye.g. 550e8400-e29b-41d4-a716-446655440000 |
Path params
| field | type | notes |
|---|---|---|
| channel_id | string | UUID of the channel to post to |
Response body
| field | type | notes |
|---|---|---|
| message_id | string | UUID of the created message |
| channel_id | string | |
| user_id | string | |
| content | string | |
| created_at | string | |
| sequence_number | integer | Monotonically increasing integer per channel; clients use this for ordering and dedup |
Status codes
HOT PATH — every message triggers a Redis PUBLISH to the channel topic, which fan-outs to all gateway servers subscribed to that channel. The sequence_number is generated via Redis INCR channel:seq:{channel_id} — atomic, monotonically increasing per channel. Clients use sequence_number for ordering and dedup (drop messages with a sequence_number they have already seen). Message is written to Postgres BEFORE publishing to Redis — this ensures durability even if the Redis publish fails.
GET/channels/{channel_id}/messagesFetch channel message historyauth: required
Path params
| field | type | notes |
|---|---|---|
| channel_id | string | UUID of the channel |
Query params
| field | type | notes |
|---|---|---|
| before_sequence? | integer | Return messages with sequence_number less than this value (pagination cursor)e.g. 500 |
| limit? | integer | Max messages to return (default: 50, max: 100)e.g. 50 |
Response body
| field | type | notes |
|---|---|---|
| messages | array | List of message objects in descending sequence_number order |
| has_more | boolean | True if there are more messages before the earliest returned |
Status codes
GETws://gateway/connectWebSocket: establish persistent real-time connectionhot pathauth: required
The client connects via WebSocket, passing an auth token as a URL parameter (ws://gateway/connect?token=...). The gateway server validates the token, fetches the user's channel memberships from Redis (channel:members sets), and subscribes to each channel's pub/sub topic (pubsub:channel:{channel_id}). When a message is published to any of those topics, the gateway forwards it as a JSON frame over the WebSocket. Connection state is in-memory on the gateway — user's channel memberships are in Redis. On disconnect, the gateway unsubscribes from all pub/sub topics. Clients must reconnect on gateway crash and replay missed messages using their last seen sequence_number.
POST/channels/{channel_id}/presenceHeartbeat to signal online presenceauth: required
Path params
| field | type | notes |
|---|---|---|
| channel_id | string | Channel to signal presence in |
Status codes
Sets Redis key presence:{user_id} with EXPIRE 60s. No heartbeat received within 60 seconds means the user is considered offline. Clients should heartbeat every 30 seconds to maintain presence with a 2x safety margin. The presence key is global (not per channel) — the user is either online or offline across all channels. To show who in a channel is online, query SMEMBERS channel:members:{channel_id} then GET presence:{user_id} for each member.
Data Model
tablemessagesMessage7 columns
Durable message store — the source of truth for all chat messages. Written before publishing to Redis pub/sub to guarantee durability. The primary read pattern is history retrieval (paginated by sequence_number), not point-lookup. Partition by channel_id or created_at after the table reaches ~100M rows.
| column | type | constraints | notes |
|---|---|---|---|
| id | uuid | PK | DEFAULT gen_random_uuid() |
| channel_id | uuid | IDX | Index for history queries — SELECT * FROM messages WHERE channel_id = $1 ORDER BY sequence_number DESC |
| user_id | uuid | IDX | FK to users table; indexed for audit queries |
| content | text | — | Max 4096 chars enforced at application layer |
| sequence_number | bigint | IDX | Monotonically increasing per channel — generated via Redis INCR before insert |
| idempotency_key | uuid | — | UNIQUE constraint — prevents duplicate messages on client retry |
| created_at | timestamptz | IDX | DEFAULT now(); index for time-range history queries and table partitioning |
capacity · 10M messages/day × 1KB/message = 10GB/day. At 90-day retention: ~900GB. Partition by created_at month using Postgres declarative partitioning. The sequence_number index per channel is the critical read path for history and reconnect replay.
tablechannel_membersChannel Member3 columns
Membership records mapping users to channels. Used at connection time to determine which pub/sub topics to subscribe to, and at presence query time to list who is in a channel. The Redis SET (channel:members:{channel_id}) caches this data for hot channels.
| column | type | constraints | notes |
|---|---|---|---|
| channel_id | uuid | IDX | Part of primary key |
| user_id | uuid | IDX | Part of primary key; index for which channels is this user in |
| joined_at | timestamptz | — | DEFAULT now() |
capacity · If 100k users × avg 20 channels each = 2M rows. Tiny. The Redis SET cache is populated at channel creation and updated on join/leave — the DB is only consulted on cold start.
tablechannelsChannel4 columns
Channel metadata. The member_count is a denormalized counter for display purposes — the authoritative member list is in channel_members and its Redis SET mirror.
| column | type | constraints | notes |
|---|---|---|---|
| id | uuid | PK | DEFAULT gen_random_uuid() |
| name | varchar(255) | — | Channel display name; unique within a workspace (not modeled here) |
| member_count | integer | — | Denormalized counter; incremented on join, decremented on leave |
| created_at | timestamptz | — | DEFAULT now() |
capacity · 10k channels × negligible row size. This table is not a performance concern — it is a metadata store. Channel member_count is updated on every join/leave via UPDATE ... SET member_count = member_count + 1 — atomic at the row level.
Redis / Cache Contracts
Key patterns, TTLs, and commands. Your design must justify the hotness-critical keys.
channel:messages:{channel_id}LISTTTL: 3600s (1h — refreshed on new message)critical
Last 100 messages in a channel in newest-first order. Serves recent history requests without hitting Postgres for the common case (last page of messages). LPUSH on new message, LTRIM to 100, LRANGE for history reads.
LPUSH channel:messages:{channel_id} {message_json} # push on new message; LPUSH maintains newest-first order
LTRIM channel:messages:{channel_id} 0 99 # keep only 100 most recent
LRANGE channel:messages:{channel_id} 0 49 # fetch first page of history
rationale · Most history requests are for the last 50-100 messages (first page). Serving these from Redis (LRANGE, O(N)) avoids a Postgres query with an ORDER BY on a large table. Deep history pagination (before_sequence < 100) falls back to Postgres. The 100-message cap keeps per-channel Redis footprint bounded at ~100KB per active channel.
presence:{user_id}STRINGTTL: 60s (refreshed by heartbeat every 30s)high
Online presence flag. Value "1" means the user is online. Set with EXPIRE 60s on each heartbeat. Missing key (expired or never set) means offline. Gateway servers also set this key on WebSocket connect and let it expire naturally on disconnect (no explicit DEL — a crash-disconnected client would erroneously appear offline immediately if we DEL; instead let TTL handle it).
SET presence:{user_id} 1 EX 60 # heartbeat — resets TTL; NX not used (always refresh)
GET presence:{user_id} # 1 = online, nil = offline
MGET presence:{u1} presence:{u2} ... # batch check for all members of a channel
rationale · TTL-based presence is the standard pattern: no explicit disconnect signaling required, crashes are handled naturally (TTL expires), and the heartbeat interval (30s) gives a 2x safety margin before the 60s TTL. At 100k concurrent users, MGET for a 1000-member channel requires 1000 GET operations — use MGET to batch into a single round trip.
channel:members:{channel_id}SETTTL: 3600scritical
Set of user IDs who are members of this channel. Used by the gateway at connection time (to verify membership) and by the presence system (SMEMBERS to list who to check). Updated on join (SADD) and leave (SREM). Populated from Postgres on cold start.
SMEMBERS channel:members:{channel_id} # list all member user IDs
SISMEMBER channel:members:{channel_id} {user_id} # check membership
SADD channel:members:{channel_id} {user_id} # on join
SREM channel:members:{channel_id} {user_id} # on leave
rationale · Membership lookups happen on every WebSocket connect (fetch all channels the user belongs to) and on every presence query. Redis SET provides O(1) membership check (SISMEMBER) and O(N) full enumeration (SMEMBERS). The DB is the source of truth; Redis is the fast path. Cold start: load from channel_members WHERE user_id = $1 on connection and populate.
pubsub:channel:{channel_id}STREAMcritical
Redis pub/sub topic for real-time message fan-out. When a new message is sent, the message handler PUBLISHes to this topic. All gateway servers that have subscribers in this channel are subscribed to this topic and receive the message for forwarding to their connected WebSocket clients. This is the mechanism that enables cross-gateway delivery.
PUBLISH pubsub:channel:{channel_id} {message_json} # called by message handler on new message
SUBSCRIBE pubsub:channel:{channel_id} # called by gateway on user connection
UNSUBSCRIBE pubsub:channel:{channel_id} # called by gateway on user disconnect
rationale · Without pub/sub, a message sent to a user on gateway-1 cannot reach a user on gateway-2. The gateway server subscribes to all channels its connected users belong to. When Redis publishes, all subscribed gateways receive the message and forward it to their respective WebSocket clients. This is the core cross-gateway delivery mechanism. Redis pub/sub is not durable — if a gateway is down when a message is published, the message is lost from the gateway's perspective. This is acceptable because the message is already durably written to Postgres before publishing; reconnecting clients replay missed messages via sequence_number.
Capacity Math
Pre-computed numbers to anchor your justifications. Use these — the grader checks your claims against them.
capacity
64KB per connection
RAM per WebSocket connection
= kernel socket buffer (32KB) + app frame buffers — 100k connections = 6.4GB RAM
traffic
10M messages/day
Messages per day
= 1k messages/min per active channel × 10k channels × 1 active hour avg
storage
10GB/day
Storage per day
= 10M messages/day × 1KB avg message size
capacity
1,000 WebSocket writes per message
Fan-out write amplification
= 1 message to a 1000-member channel = 1 PUBLISH + 1000 WebSocket frame writes
traffic
3,333 heartbeats/s
Presence heartbeat load
= 100k concurrent users × 1 heartbeat every 30s
capacity
~1M messages/s per Redis instance
Redis pub/sub throughput
= Redis PUBLISH is single-threaded but at < 1KB messages saturates at ~1M/s
traffic
1,666 handshakes/s
WebSocket handshake burst at startup
= 100k connections established over 60s at startup burst
storage
~900GB
Message history storage at 90-day retention
= 10GB/day × 90 days — partition messages table by created_at month
How to Approach This
Work through these phases in order before submitting. Each phase builds on the last.
1Phase 1 — Real-Time Delivery Architecture
15 min · Design how WebSocket gateways receive messages and push them to clients. The key…
3 questions
Phase 1 — Real-Time Delivery Architecture
15 min · Design how WebSocket gateways receive messages and push them to clients. The key…
Design how WebSocket gateways receive messages and push them to clients. The key insight is that gateway servers are stateful (they hold WebSocket connections) but the message routing must be stateless (any gateway can receive any message). Redis pub/sub is the layer that decouples them. Get this architecture right before touching persistence or presence.
A message is sent to gateway-1. The recipient is connected to gateway-2. How does it get there?
hint
Gateway-1 receives the POST /messages request, writes to Postgres, then PUBLISHes to Redis on the channel's topic. Gateway-2 is SUBSCRIBEd to that topic (because it has a user connected who is a member of that channel). Redis delivers the published message to gateway-2, which forwards it over the recipient's WebSocket. No gateway-to-gateway communication needed — Redis is the message bus.
Why not use long-polling or Server-Sent Events instead of WebSockets?
hint
Long-polling: each poll is a new HTTP request. At 100k concurrent users × avg 1 poll/s = 100k HTTP requests/s. WebSocket: 100k persistent connections, ~3,333 frames/s for presence heartbeats. WebSockets have ~30x lower connection overhead at this scale. Server-Sent Events are unidirectional (server to client only) — you still need HTTP for the send path. WebSocket is the right primitive for bidirectional real-time at this scale.
What does a gateway server need to do on each WebSocket connect?
hint
(1) Validate auth token. (2) Load user's channel memberships from Redis (SMEMBERS for each channel, or a user->channels index). (3) SUBSCRIBE to each channel's pub/sub topic. (4) SET presence:{user_id} 1 EX 60. This means a gateway restart triggers a burst of subscriptions as clients reconnect — design the reconnect flow to rate-limit handshakes.
Deliverable
Canvas: clients → gateway (multiple replicas) → Redis pub/sub → gateway (fan-out). The message write path: gateway → Postgres (write) → Redis PUBLISH. The receive path: Redis SUBSCRIBE → gateway → WebSocket frame to client. The separation of write durability (Postgres first) and real-time delivery (Redis after) must be explicit.
Common pitfall
Publishing to Redis BEFORE writing to Postgres. If Postgres write fails after the Redis publish, the message is delivered to all connected clients but never persisted — it vanishes from history. Always write to Postgres first. If the Redis publish fails after a successful Postgres write, the message is safe — reconnecting clients will find it via sequence_number history replay.
2Phase 2 — Message Persistence and Ordering
10 min · Design the ordering guarantee and persistence layer. Timestamps alone are insuff…
3 questions
Phase 2 — Message Persistence and Ordering
10 min · Design the ordering guarantee and persistence layer. Timestamps alone are insuff…
Design the ordering guarantee and persistence layer. Timestamps alone are insufficient for ordering (clock skew between servers, same-millisecond messages). The sequence_number is the ordering primitive — a monotonically increasing integer per channel, generated atomically before the message is inserted.
Why not use the message's created_at timestamp for ordering?
hint
Two messages sent in the same millisecond have the same timestamp. NTP drift means different app servers may disagree on "now" by up to ~100ms. Two messages sent from different app servers within 100ms of each other could have inverted timestamps. A per-channel sequence_number from Redis INCR is monotonically increasing and has no clock dependency — it gives a total order within a channel with zero ambiguity.
How do you generate the sequence_number atomically before inserting?
hint
INCR channel:seq:{channel_id} in Redis returns the next integer atomically. This runs before the Postgres INSERT. The sequence_number is then included in the INSERT. If the INSERT fails (e.g., idempotency_key conflict), the sequence_number is "burned" (the increment already happened). This is acceptable — clients will see a gap in the sequence (e.g., 42, 43, 45) and must tolerate gaps. Gaps mean a retry, not a missing message.
What does 'at-least-once delivery' require from the client?
hint
At-least-once means clients may receive the same message twice (e.g., if the gateway crashes after delivery but before the client acks). Clients must deduplicate by sequence_number: if you receive a message with a sequence_number you have already processed, discard it. This is the idempotent client pattern — it moves the dedup burden from the server to the client, which is the right trade-off for throughput.
Deliverable
The write flow: Redis INCR → Postgres INSERT (with sequence_number) → Redis PUBLISH. Idempotency_key UNIQUE constraint on the messages table. Client reconnect flow: after reconnect, request messages WHERE sequence_number > last_seen_sequence to replay missed messages.
3Phase 3 — Fan-Out at Scale
10 min · A message to a 1,000-member channel requires 1,000 WebSocket frame writes. Under…
3 questions
Phase 3 — Fan-Out at Scale
10 min · A message to a 1,000-member channel requires 1,000 WebSocket frame writes. Under…
A message to a 1,000-member channel requires 1,000 WebSocket frame writes. Understand where the bottleneck is and verify your design can sustain it. The Redis pub/sub layer serializes the fan-out across all gateways — each gateway only writes to its connected subset of the 1,000 members.
At 5,000 messages/s to 1,000-member channels, how many WebSocket writes/s does each gateway handle?
hint
5,000 messages/s × 1,000 members = 5M WebSocket writes/s total across all gateways. If 10 gateway servers each hold 10,000 connections (100k total / 10 gateways), and members are uniformly distributed, each gateway handles 5M/10 = 500k writes/s. At 1KB per message frame, that's 500MB/s of WebSocket throughput per gateway — budget accordingly. In practice, channels are sparse — not all 1,000 members of every channel are connected simultaneously.
Is the gateway or Redis the fan-out bottleneck?
hint
Redis pub/sub delivers each message ONCE to each subscribed gateway, regardless of how many clients are connected to that gateway. Redis sees 5,000 PUBLISH/s — well within its 1M/s capacity. The gateway sees 500k WebSocket writes/s — this is the likely bottleneck. Monitor gateway CPU and send buffer backpressure, not Redis throughput.
How does a gateway know which local connections to forward a pub/sub message to?
hint
The gateway maintains an in-memory map: channel_id → [WebSocket connections subscribed to that channel]. When a pub/sub message arrives for channel X, the gateway looks up channel X in its local map and writes the frame to each connection in the list. This is a pure in-memory fan-out — O(N) where N is the number of local connections subscribed to that channel. No Redis lookups needed for the fan-out itself.
Deliverable
Fan-out math in your justification: total WebSocket writes/s = message rate × avg channel size. Per-gateway writes/s = total / number of gateways. Each gateway must be sized to handle this write rate. The Redis pub/sub layer is not the bottleneck — the gateway's network egress and CPU are.
4Phase 4 — Presence System
10 min · Show 100k users' online/offline state to channel members without polling. TTL-ba…
3 questions
Phase 4 — Presence System
10 min · Show 100k users' online/offline state to channel members without polling. TTL-ba…
Show 100k users' online/offline state to channel members without polling. TTL-based presence with heartbeat is the standard pattern — simple, scalable, and tolerant of crashes. The design question is: how does a client know who in a channel is online?
Why not use WebSocket disconnect events to set presence to offline?
hint
Disconnect events are unreliable: network partitions, client crashes, and gateway crashes all result in the server not knowing the client is gone until a timeout. A TCP connection can appear alive to the server for minutes after the client goes offline. TTL-based presence (EXPIRE 60s, heartbeat every 30s) handles all failure modes uniformly: if the heartbeat stops for any reason, the key expires and the user appears offline after at most 60 seconds. This is more reliable than disconnect events.
How does a client learn who in a channel is currently online?
hint
Two steps: (1) SMEMBERS channel:members:{channel_id} to get the member list, (2) MGET presence:{u1} presence:{u2} ... for all member IDs in a single round trip. Non-nil results = online. This is O(N) for an N-member channel. For large channels (1,000 members), MGET for 1,000 keys returns in < 5ms. Online presence is computed on-demand when a user opens a channel, not pushed proactively — this avoids presence update storms.
At 100k concurrent users, what is the heartbeat RPS?
hint
100k users × 1 heartbeat/30s = 3,333 heartbeats/s. Each heartbeat is a POST /presence that executes SET key 1 EX 60. At 3,333/s this is trivial for Redis. The concern is not Redis throughput — it is HTTP overhead. Consider using WebSocket frames for heartbeats (client sends a ping frame every 30s) to avoid an HTTP round trip per user.
Deliverable
Presence flow: WebSocket connect → SET presence:{user_id} 1 EX 60, client sends heartbeat frame (or HTTP POST) every 30s → re-SET with EX 60. Presence query: SMEMBERS + MGET. Presence update events (user came online/offline) are pushed via the channel's pub/sub topic as a presence_update message type, so all connected members see real-time status.
5Phase 5 — Availability: Gateway Server Crash
15 min · A gateway server crashes. All WebSocket connections to that gateway die. Design …
3 questions
Phase 5 — Availability: Gateway Server Crash
15 min · A gateway server crashes. All WebSocket connections to that gateway die. Design …
A gateway server crashes. All WebSocket connections to that gateway die. Design what happens: how clients detect the crash, reconnect, and recover missed messages. The key invariant is "no message loss if published to Redis BEFORE sending to WebSocket" — but a crashed gateway may have received a Redis publish but not forwarded it before crashing. The sequence_number replay mechanism is the safety net.
What happens to the 10,000 WebSocket connections on the crashed gateway?
hint
They are immediately terminated. The OS closes the sockets. Clients on mobile or behind NAT may not receive a TCP RST and must detect the crash via a heartbeat timeout (e.g., no pong response within 10s). Client reconnect logic: exponential backoff with jitter (100ms base, up to 30s), reconnect to any available gateway (via DNS load balancing or a sticky-session-free LB). After reconnect, request history replay.
How does a reconnected client know what messages it missed?
hint
The client stores the last sequence_number it received per channel. After reconnect, it calls GET /channels/{channel_id}/messages?before_sequence=... wait, it needs messages AFTER its last seen sequence. The correct API: GET /channels/{id}/messages?after_sequence= {last_seen}&limit=100. The client replays messages in sequence_number order and deduplicates any it already processed. This is the "replay on reconnect" pattern — it requires the client to persist its last-seen sequence_number across reconnects (local storage on web, SQLite on mobile).
A message was published to Redis but the gateway crashed before forwarding it to WebSocket. Is the message lost?
hint
Not lost — it is in Postgres. The write order is: Postgres first, then Redis PUBLISH. If the gateway crashes after receiving the Redis publish but before forwarding, the message is in Postgres. When the client reconnects to another gateway, the history replay fetches it. The Redis publish is a best-effort delivery mechanism; Postgres is the durable fallback. This is why "Postgres before Redis" is the critical ordering rule.
Deliverable
Client reconnect flow: detect disconnect → exponential backoff → reconnect to gateway → fetch per-channel missed messages (after_sequence) → process in order → resume. Gateway crash is a non-event for message durability because Postgres is the source of truth. Presence recovery: after reconnect, re-SET presence key and re-subscribe to channel topics.
Common pitfall
Storing in-flight message state only in the gateway's memory. If the gateway has received a message from the POST handler but hasn't forwarded it yet (e.g., in a send buffer), a crash loses that delivery attempt. The fix: write to Postgres before even placing on the send buffer. The client's sequence_number replay is the universal recovery mechanism — design for it from the start, not as an afterthought.
How You'll Be Graded
WebSocket gateways + Redis pub/sub delivers in < 500msscalability
The design must use WebSocket gateways with Redis pub/sub for cross-gateway fan-out. A message sent to any gateway must reach all connected members within 500ms p99.
Full credit
WebSocket gateways present, Redis pub/sub as the fan-out bus, cross-gateway delivery path described, < 500ms argued with latency budget.
Partial
WebSockets present but cross-gateway delivery mechanism is unclear or uses DB polling.
Zero
No WebSockets (polling only) or no cross-gateway delivery mechanism.
Per-channel sequence_number ordering with idempotent clientsscalability
Messages must be totally ordered within a channel. The ordering mechanism must be argued from first principles — not timestamps.
Full credit
Redis INCR for sequence_number generation described, client dedup by sequence_number stated, replay-on-reconnect flow described.
Partial
Ordering mentioned but mechanism is timestamps (clock skew problem unaddressed) or sequence_number without client dedup.
Zero
No ordering mechanism or "Postgres auto-increment" without addressing the gap in the distributed setting.
Gateway crash causes reconnect + replay, not message lossavailability
When a gateway crashes, clients reconnect to another gateway and replay missed messages via sequence_number. No message loss because Postgres is written before Redis publish.
Full credit
Write order stated (Postgres before Redis publish), client reconnect flow described, sequence_number replay on reconnect described.
Partial
Reconnect mentioned but write ordering or replay mechanism not described.
Zero
No reconnect story, or design relies on in-memory delivery state that would be lost on crash.
TTL-based presence with heartbeat is correct and queryable per channelscalability
Online/offline presence for 100k users via Redis TTL keys. Heartbeat every 30s, TTL 60s. Per-channel presence query via SMEMBERS + MGET.
Full credit
SET EX 60 on heartbeat described, MGET for per-channel presence query described, crash/disconnect handled by TTL expiry.
Partial
Presence system present but heartbeat interval or TTL not specified.
Zero
Polling-based presence or WebSocket disconnect events as the sole offline signal.
Fan-out math and WebSocket sizing argued from numbersjustification-quality
Justifications must include the key capacity numbers: WebSocket RAM (64KB × N connections), fan-out write amplification (N members × message rate), and Redis pub/sub throughput headroom.
Full credit
64KB/connection × 100k = 6.4GB RAM stated, fan-out math (N members × message rate = WebSocket writes/s), Redis pub/sub headroom vs. 1M/s limit cited.
Partial
Correct design but justifications are qualitative ("Redis is fast enough").
Zero
No capacity math at all — design floated without numbers.
Failure Scenarios the Sim Will Inject
Each scenario fires automatically during your simulation run. Your design must survive all of them.
Steady-state chat load
Celebrity goes live — 3x connection burst
Gateway server crash
Region A outage
Best on desktop — the canvas needs room to breathe. Drafts autosave locally.