beprodready
greenfield·senior~60 min

Greenfield: Design a Real-Time Chat System

Design a real-time chat system supporting 100k concurrent WebSocket connections, channel fan-out to 1,000 members, message history, and TTL-based presence — all within a $5,000/month budget. The hard problems are cross-gateway fan-out (a message sent to gateway-1 must reach users on gateway-2) and surviving a gateway crash without losing messages in flight.

websocketspub-submessage-queuefan-outpersistencepresence

Steady traffic

5,000 rps

Spike multiplier

Budget

$5,000/mo

Read ratio

70:30

Load profile

3× spike5k rps

The Scenario

You're joining a new startup that just got funded. The CTO wants a chat system that works like Slack's basic model: channels with multiple members, real-time delivery, message history, and online presence. It starts at 1k concurrent users but must be designed to scale to 100k concurrent connections. You have 60 minutes to design the architecture before the engineering review.

Know before you start

  • WebSockets: how a persistent bidirectional connection differs from HTTP request-response
  • Pub/sub: how Redis PUBLISH and SUBSCRIBE work, and why they solve the cross-gateway delivery problem
  • Message ordering: why a per-channel sequence_number is the correct ordering primitive (not timestamps)
  • Fan-out: what it means to write one message to N subscribers, and where N=1000 becomes a bottleneck
  • Presence signals: TTL-based online/offline state vs. polling, and why heartbeat + EXPIRE is the standard pattern

Requirements

Functional

  • Users can send messages to channels they are members of
  • Messages delivered in < 500ms p99 end-to-end
  • Message history up to 10,000 messages per channel
  • Online/offline presence visible to channel members
  • Users can be in up to 50 channels simultaneously

Non-functional

  • Support 100k concurrent WebSocket connections
  • Message delivery at least once (idempotent clients via sequence_number dedup)
  • < 500ms delivery p99 end-to-end
  • Survive a gateway server crash without losing messages in flight
  • Message ordering guaranteed per channel (not globally)

API Contract

The endpoints your system must implement. The hot path is the one the SLO is measured on.

POST/channels/{channel_id}/messagesSend a message to a channelhot pathauth: required

Request body

fieldtypenotes
contentstringMessage text content (max 4096 chars)e.g. Hey team, the deploy is done!
idempotency_keystringClient-generated UUID; prevents duplicate messages on retrye.g. 550e8400-e29b-41d4-a716-446655440000

Path params

fieldtypenotes
channel_idstringUUID of the channel to post to

Response body

fieldtypenotes
message_idstringUUID of the created message
channel_idstring
user_idstring
contentstring
created_atstring
sequence_numberintegerMonotonically increasing integer per channel; clients use this for ordering and dedup

Status codes

201Message created and published
400Invalid content or missing idempotency_key
403User is not a member of this channel

HOT PATH — every message triggers a Redis PUBLISH to the channel topic, which fan-outs to all gateway servers subscribed to that channel. The sequence_number is generated via Redis INCR channel:seq:{channel_id} — atomic, monotonically increasing per channel. Clients use sequence_number for ordering and dedup (drop messages with a sequence_number they have already seen). Message is written to Postgres BEFORE publishing to Redis — this ensures durability even if the Redis publish fails.

GET/channels/{channel_id}/messagesFetch channel message historyauth: required

Path params

fieldtypenotes
channel_idstringUUID of the channel

Query params

fieldtypenotes
before_sequence?integerReturn messages with sequence_number less than this value (pagination cursor)e.g. 500
limit?integerMax messages to return (default: 50, max: 100)e.g. 50

Response body

fieldtypenotes
messagesarrayList of message objects in descending sequence_number order
has_morebooleanTrue if there are more messages before the earliest returned

Status codes

200Messages returned
403User is not a member of this channel
GETws://gateway/connectWebSocket: establish persistent real-time connectionhot pathauth: required

The client connects via WebSocket, passing an auth token as a URL parameter (ws://gateway/connect?token=...). The gateway server validates the token, fetches the user's channel memberships from Redis (channel:members sets), and subscribes to each channel's pub/sub topic (pubsub:channel:{channel_id}). When a message is published to any of those topics, the gateway forwards it as a JSON frame over the WebSocket. Connection state is in-memory on the gateway — user's channel memberships are in Redis. On disconnect, the gateway unsubscribes from all pub/sub topics. Clients must reconnect on gateway crash and replay missed messages using their last seen sequence_number.

POST/channels/{channel_id}/presenceHeartbeat to signal online presenceauth: required

Path params

fieldtypenotes
channel_idstringChannel to signal presence in

Status codes

204Presence recorded

Sets Redis key presence:{user_id} with EXPIRE 60s. No heartbeat received within 60 seconds means the user is considered offline. Clients should heartbeat every 30 seconds to maintain presence with a 2x safety margin. The presence key is global (not per channel) — the user is either online or offline across all channels. To show who in a channel is online, query SMEMBERS channel:members:{channel_id} then GET presence:{user_id} for each member.

Data Model

tablemessagesMessage7 columns

Durable message store — the source of truth for all chat messages. Written before publishing to Redis pub/sub to guarantee durability. The primary read pattern is history retrieval (paginated by sequence_number), not point-lookup. Partition by channel_id or created_at after the table reaches ~100M rows.

columntypeconstraintsnotes
iduuidPKDEFAULT gen_random_uuid()
channel_iduuidIDXIndex for history queries — SELECT * FROM messages WHERE channel_id = $1 ORDER BY sequence_number DESC
user_iduuidIDXFK to users table; indexed for audit queries
contenttextMax 4096 chars enforced at application layer
sequence_numberbigintIDXMonotonically increasing per channel — generated via Redis INCR before insert
idempotency_keyuuidUNIQUE constraint — prevents duplicate messages on client retry
created_attimestamptzIDXDEFAULT now(); index for time-range history queries and table partitioning

capacity · 10M messages/day × 1KB/message = 10GB/day. At 90-day retention: ~900GB. Partition by created_at month using Postgres declarative partitioning. The sequence_number index per channel is the critical read path for history and reconnect replay.

tablechannel_membersChannel Member3 columns

Membership records mapping users to channels. Used at connection time to determine which pub/sub topics to subscribe to, and at presence query time to list who is in a channel. The Redis SET (channel:members:{channel_id}) caches this data for hot channels.

columntypeconstraintsnotes
channel_iduuidIDXPart of primary key
user_iduuidIDXPart of primary key; index for which channels is this user in
joined_attimestamptzDEFAULT now()

capacity · If 100k users × avg 20 channels each = 2M rows. Tiny. The Redis SET cache is populated at channel creation and updated on join/leave — the DB is only consulted on cold start.

tablechannelsChannel4 columns

Channel metadata. The member_count is a denormalized counter for display purposes — the authoritative member list is in channel_members and its Redis SET mirror.

columntypeconstraintsnotes
iduuidPKDEFAULT gen_random_uuid()
namevarchar(255)Channel display name; unique within a workspace (not modeled here)
member_countintegerDenormalized counter; incremented on join, decremented on leave
created_attimestamptzDEFAULT now()

capacity · 10k channels × negligible row size. This table is not a performance concern — it is a metadata store. Channel member_count is updated on every join/leave via UPDATE ... SET member_count = member_count + 1 — atomic at the row level.

Redis / Cache Contracts

Key patterns, TTLs, and commands. Your design must justify the hotness-critical keys.

channel:messages:{channel_id}LISTTTL: 3600s (1h — refreshed on new message)critical

Last 100 messages in a channel in newest-first order. Serves recent history requests without hitting Postgres for the common case (last page of messages). LPUSH on new message, LTRIM to 100, LRANGE for history reads.

LPUSH channel:messages:{channel_id} {message_json} # push on new message; LPUSH maintains newest-first order

LTRIM channel:messages:{channel_id} 0 99 # keep only 100 most recent

LRANGE channel:messages:{channel_id} 0 49 # fetch first page of history

rationale · Most history requests are for the last 50-100 messages (first page). Serving these from Redis (LRANGE, O(N)) avoids a Postgres query with an ORDER BY on a large table. Deep history pagination (before_sequence < 100) falls back to Postgres. The 100-message cap keeps per-channel Redis footprint bounded at ~100KB per active channel.

presence:{user_id}STRINGTTL: 60s (refreshed by heartbeat every 30s)high

Online presence flag. Value "1" means the user is online. Set with EXPIRE 60s on each heartbeat. Missing key (expired or never set) means offline. Gateway servers also set this key on WebSocket connect and let it expire naturally on disconnect (no explicit DEL — a crash-disconnected client would erroneously appear offline immediately if we DEL; instead let TTL handle it).

SET presence:{user_id} 1 EX 60 # heartbeat — resets TTL; NX not used (always refresh)

GET presence:{user_id} # 1 = online, nil = offline

MGET presence:{u1} presence:{u2} ... # batch check for all members of a channel

rationale · TTL-based presence is the standard pattern: no explicit disconnect signaling required, crashes are handled naturally (TTL expires), and the heartbeat interval (30s) gives a 2x safety margin before the 60s TTL. At 100k concurrent users, MGET for a 1000-member channel requires 1000 GET operations — use MGET to batch into a single round trip.

channel:members:{channel_id}SETTTL: 3600scritical

Set of user IDs who are members of this channel. Used by the gateway at connection time (to verify membership) and by the presence system (SMEMBERS to list who to check). Updated on join (SADD) and leave (SREM). Populated from Postgres on cold start.

SMEMBERS channel:members:{channel_id} # list all member user IDs

SISMEMBER channel:members:{channel_id} {user_id} # check membership

SADD channel:members:{channel_id} {user_id} # on join

SREM channel:members:{channel_id} {user_id} # on leave

rationale · Membership lookups happen on every WebSocket connect (fetch all channels the user belongs to) and on every presence query. Redis SET provides O(1) membership check (SISMEMBER) and O(N) full enumeration (SMEMBERS). The DB is the source of truth; Redis is the fast path. Cold start: load from channel_members WHERE user_id = $1 on connection and populate.

pubsub:channel:{channel_id}STREAMcritical

Redis pub/sub topic for real-time message fan-out. When a new message is sent, the message handler PUBLISHes to this topic. All gateway servers that have subscribers in this channel are subscribed to this topic and receive the message for forwarding to their connected WebSocket clients. This is the mechanism that enables cross-gateway delivery.

PUBLISH pubsub:channel:{channel_id} {message_json} # called by message handler on new message

SUBSCRIBE pubsub:channel:{channel_id} # called by gateway on user connection

UNSUBSCRIBE pubsub:channel:{channel_id} # called by gateway on user disconnect

rationale · Without pub/sub, a message sent to a user on gateway-1 cannot reach a user on gateway-2. The gateway server subscribes to all channels its connected users belong to. When Redis publishes, all subscribed gateways receive the message and forward it to their respective WebSocket clients. This is the core cross-gateway delivery mechanism. Redis pub/sub is not durable — if a gateway is down when a message is published, the message is lost from the gateway's perspective. This is acceptable because the message is already durably written to Postgres before publishing; reconnecting clients replay missed messages via sequence_number.

Capacity Math

Pre-computed numbers to anchor your justifications. Use these — the grader checks your claims against them.

capacity

64KB per connection

RAM per WebSocket connection

= kernel socket buffer (32KB) + app frame buffers — 100k connections = 6.4GB RAM

traffic

10M messages/day

Messages per day

= 1k messages/min per active channel × 10k channels × 1 active hour avg

storage

10GB/day

Storage per day

= 10M messages/day × 1KB avg message size

capacity

1,000 WebSocket writes per message

Fan-out write amplification

= 1 message to a 1000-member channel = 1 PUBLISH + 1000 WebSocket frame writes

traffic

3,333 heartbeats/s

Presence heartbeat load

= 100k concurrent users × 1 heartbeat every 30s

capacity

~1M messages/s per Redis instance

Redis pub/sub throughput

= Redis PUBLISH is single-threaded but at < 1KB messages saturates at ~1M/s

traffic

1,666 handshakes/s

WebSocket handshake burst at startup

= 100k connections established over 60s at startup burst

storage

~900GB

Message history storage at 90-day retention

= 10GB/day × 90 days — partition messages table by created_at month

How to Approach This

Work through these phases in order before submitting. Each phase builds on the last.

1

Phase 1 — Real-Time Delivery Architecture

15 min · Design how WebSocket gateways receive messages and push them to clients. The key

3 questions

Design how WebSocket gateways receive messages and push them to clients. The key insight is that gateway servers are stateful (they hold WebSocket connections) but the message routing must be stateless (any gateway can receive any message). Redis pub/sub is the layer that decouples them. Get this architecture right before touching persistence or presence.

A message is sent to gateway-1. The recipient is connected to gateway-2. How does it get there?

hint

Gateway-1 receives the POST /messages request, writes to Postgres, then PUBLISHes to Redis on the channel's topic. Gateway-2 is SUBSCRIBEd to that topic (because it has a user connected who is a member of that channel). Redis delivers the published message to gateway-2, which forwards it over the recipient's WebSocket. No gateway-to-gateway communication needed — Redis is the message bus.

Why not use long-polling or Server-Sent Events instead of WebSockets?

hint

Long-polling: each poll is a new HTTP request. At 100k concurrent users × avg 1 poll/s = 100k HTTP requests/s. WebSocket: 100k persistent connections, ~3,333 frames/s for presence heartbeats. WebSockets have ~30x lower connection overhead at this scale. Server-Sent Events are unidirectional (server to client only) — you still need HTTP for the send path. WebSocket is the right primitive for bidirectional real-time at this scale.

What does a gateway server need to do on each WebSocket connect?

hint

(1) Validate auth token. (2) Load user's channel memberships from Redis (SMEMBERS for each channel, or a user->channels index). (3) SUBSCRIBE to each channel's pub/sub topic. (4) SET presence:{user_id} 1 EX 60. This means a gateway restart triggers a burst of subscriptions as clients reconnect — design the reconnect flow to rate-limit handshakes.

Deliverable

Canvas: clients → gateway (multiple replicas) → Redis pub/sub → gateway (fan-out). The message write path: gateway → Postgres (write) → Redis PUBLISH. The receive path: Redis SUBSCRIBE → gateway → WebSocket frame to client. The separation of write durability (Postgres first) and real-time delivery (Redis after) must be explicit.

Common pitfall

Publishing to Redis BEFORE writing to Postgres. If Postgres write fails after the Redis publish, the message is delivered to all connected clients but never persisted — it vanishes from history. Always write to Postgres first. If the Redis publish fails after a successful Postgres write, the message is safe — reconnecting clients will find it via sequence_number history replay.

2

Phase 2 — Message Persistence and Ordering

10 min · Design the ordering guarantee and persistence layer. Timestamps alone are insuff

3 questions

Design the ordering guarantee and persistence layer. Timestamps alone are insufficient for ordering (clock skew between servers, same-millisecond messages). The sequence_number is the ordering primitive — a monotonically increasing integer per channel, generated atomically before the message is inserted.

Why not use the message's created_at timestamp for ordering?

hint

Two messages sent in the same millisecond have the same timestamp. NTP drift means different app servers may disagree on "now" by up to ~100ms. Two messages sent from different app servers within 100ms of each other could have inverted timestamps. A per-channel sequence_number from Redis INCR is monotonically increasing and has no clock dependency — it gives a total order within a channel with zero ambiguity.

How do you generate the sequence_number atomically before inserting?

hint

INCR channel:seq:{channel_id} in Redis returns the next integer atomically. This runs before the Postgres INSERT. The sequence_number is then included in the INSERT. If the INSERT fails (e.g., idempotency_key conflict), the sequence_number is "burned" (the increment already happened). This is acceptable — clients will see a gap in the sequence (e.g., 42, 43, 45) and must tolerate gaps. Gaps mean a retry, not a missing message.

What does 'at-least-once delivery' require from the client?

hint

At-least-once means clients may receive the same message twice (e.g., if the gateway crashes after delivery but before the client acks). Clients must deduplicate by sequence_number: if you receive a message with a sequence_number you have already processed, discard it. This is the idempotent client pattern — it moves the dedup burden from the server to the client, which is the right trade-off for throughput.

Deliverable

The write flow: Redis INCR → Postgres INSERT (with sequence_number) → Redis PUBLISH. Idempotency_key UNIQUE constraint on the messages table. Client reconnect flow: after reconnect, request messages WHERE sequence_number > last_seen_sequence to replay missed messages.

3

Phase 3 — Fan-Out at Scale

10 min · A message to a 1,000-member channel requires 1,000 WebSocket frame writes. Under

3 questions

A message to a 1,000-member channel requires 1,000 WebSocket frame writes. Understand where the bottleneck is and verify your design can sustain it. The Redis pub/sub layer serializes the fan-out across all gateways — each gateway only writes to its connected subset of the 1,000 members.

At 5,000 messages/s to 1,000-member channels, how many WebSocket writes/s does each gateway handle?

hint

5,000 messages/s × 1,000 members = 5M WebSocket writes/s total across all gateways. If 10 gateway servers each hold 10,000 connections (100k total / 10 gateways), and members are uniformly distributed, each gateway handles 5M/10 = 500k writes/s. At 1KB per message frame, that's 500MB/s of WebSocket throughput per gateway — budget accordingly. In practice, channels are sparse — not all 1,000 members of every channel are connected simultaneously.

Is the gateway or Redis the fan-out bottleneck?

hint

Redis pub/sub delivers each message ONCE to each subscribed gateway, regardless of how many clients are connected to that gateway. Redis sees 5,000 PUBLISH/s — well within its 1M/s capacity. The gateway sees 500k WebSocket writes/s — this is the likely bottleneck. Monitor gateway CPU and send buffer backpressure, not Redis throughput.

How does a gateway know which local connections to forward a pub/sub message to?

hint

The gateway maintains an in-memory map: channel_id → [WebSocket connections subscribed to that channel]. When a pub/sub message arrives for channel X, the gateway looks up channel X in its local map and writes the frame to each connection in the list. This is a pure in-memory fan-out — O(N) where N is the number of local connections subscribed to that channel. No Redis lookups needed for the fan-out itself.

Deliverable

Fan-out math in your justification: total WebSocket writes/s = message rate × avg channel size. Per-gateway writes/s = total / number of gateways. Each gateway must be sized to handle this write rate. The Redis pub/sub layer is not the bottleneck — the gateway's network egress and CPU are.

4

Phase 4 — Presence System

10 min · Show 100k users' online/offline state to channel members without polling. TTL-ba

3 questions

Show 100k users' online/offline state to channel members without polling. TTL-based presence with heartbeat is the standard pattern — simple, scalable, and tolerant of crashes. The design question is: how does a client know who in a channel is online?

Why not use WebSocket disconnect events to set presence to offline?

hint

Disconnect events are unreliable: network partitions, client crashes, and gateway crashes all result in the server not knowing the client is gone until a timeout. A TCP connection can appear alive to the server for minutes after the client goes offline. TTL-based presence (EXPIRE 60s, heartbeat every 30s) handles all failure modes uniformly: if the heartbeat stops for any reason, the key expires and the user appears offline after at most 60 seconds. This is more reliable than disconnect events.

How does a client learn who in a channel is currently online?

hint

Two steps: (1) SMEMBERS channel:members:{channel_id} to get the member list, (2) MGET presence:{u1} presence:{u2} ... for all member IDs in a single round trip. Non-nil results = online. This is O(N) for an N-member channel. For large channels (1,000 members), MGET for 1,000 keys returns in < 5ms. Online presence is computed on-demand when a user opens a channel, not pushed proactively — this avoids presence update storms.

At 100k concurrent users, what is the heartbeat RPS?

hint

100k users × 1 heartbeat/30s = 3,333 heartbeats/s. Each heartbeat is a POST /presence that executes SET key 1 EX 60. At 3,333/s this is trivial for Redis. The concern is not Redis throughput — it is HTTP overhead. Consider using WebSocket frames for heartbeats (client sends a ping frame every 30s) to avoid an HTTP round trip per user.

Deliverable

Presence flow: WebSocket connect → SET presence:{user_id} 1 EX 60, client sends heartbeat frame (or HTTP POST) every 30s → re-SET with EX 60. Presence query: SMEMBERS + MGET. Presence update events (user came online/offline) are pushed via the channel's pub/sub topic as a presence_update message type, so all connected members see real-time status.

5

Phase 5 — Availability: Gateway Server Crash

15 min · A gateway server crashes. All WebSocket connections to that gateway die. Design

3 questions

A gateway server crashes. All WebSocket connections to that gateway die. Design what happens: how clients detect the crash, reconnect, and recover missed messages. The key invariant is "no message loss if published to Redis BEFORE sending to WebSocket" — but a crashed gateway may have received a Redis publish but not forwarded it before crashing. The sequence_number replay mechanism is the safety net.

What happens to the 10,000 WebSocket connections on the crashed gateway?

hint

They are immediately terminated. The OS closes the sockets. Clients on mobile or behind NAT may not receive a TCP RST and must detect the crash via a heartbeat timeout (e.g., no pong response within 10s). Client reconnect logic: exponential backoff with jitter (100ms base, up to 30s), reconnect to any available gateway (via DNS load balancing or a sticky-session-free LB). After reconnect, request history replay.

How does a reconnected client know what messages it missed?

hint

The client stores the last sequence_number it received per channel. After reconnect, it calls GET /channels/{channel_id}/messages?before_sequence=... wait, it needs messages AFTER its last seen sequence. The correct API: GET /channels/{id}/messages?after_sequence= {last_seen}&limit=100. The client replays messages in sequence_number order and deduplicates any it already processed. This is the "replay on reconnect" pattern — it requires the client to persist its last-seen sequence_number across reconnects (local storage on web, SQLite on mobile).

A message was published to Redis but the gateway crashed before forwarding it to WebSocket. Is the message lost?

hint

Not lost — it is in Postgres. The write order is: Postgres first, then Redis PUBLISH. If the gateway crashes after receiving the Redis publish but before forwarding, the message is in Postgres. When the client reconnects to another gateway, the history replay fetches it. The Redis publish is a best-effort delivery mechanism; Postgres is the durable fallback. This is why "Postgres before Redis" is the critical ordering rule.

Deliverable

Client reconnect flow: detect disconnect → exponential backoff → reconnect to gateway → fetch per-channel missed messages (after_sequence) → process in order → resume. Gateway crash is a non-event for message durability because Postgres is the source of truth. Presence recovery: after reconnect, re-SET presence key and re-subscribe to channel topics.

Common pitfall

Storing in-flight message state only in the gateway's memory. If the gateway has received a message from the POST handler but hasn't forwarded it yet (e.g., in a send buffer), a crash loses that delivery attempt. The fix: write to Postgres before even placing on the send buffer. The client's sequence_number replay is the universal recovery mechanism — design for it from the start, not as an afterthought.

How You'll Be Graded

PRRscore
WebSocket gateways + Redis pub/sub delivers in < 500ms30%
Per-channel sequence_number ordering with idempotent clients20%
Gateway crash causes reconnect + replay, not message loss20%
TTL-based presence with heartbeat is correct and queryable per channel10%
Fan-out math and WebSocket sizing argued from numbers20%
WebSocket gateways + Redis pub/sub delivers in < 500msscalability

The design must use WebSocket gateways with Redis pub/sub for cross-gateway fan-out. A message sent to any gateway must reach all connected members within 500ms p99.

Full credit

WebSocket gateways present, Redis pub/sub as the fan-out bus, cross-gateway delivery path described, < 500ms argued with latency budget.

Partial

WebSockets present but cross-gateway delivery mechanism is unclear or uses DB polling.

Zero

No WebSockets (polling only) or no cross-gateway delivery mechanism.

Per-channel sequence_number ordering with idempotent clientsscalability

Messages must be totally ordered within a channel. The ordering mechanism must be argued from first principles — not timestamps.

Full credit

Redis INCR for sequence_number generation described, client dedup by sequence_number stated, replay-on-reconnect flow described.

Partial

Ordering mentioned but mechanism is timestamps (clock skew problem unaddressed) or sequence_number without client dedup.

Zero

No ordering mechanism or "Postgres auto-increment" without addressing the gap in the distributed setting.

Gateway crash causes reconnect + replay, not message lossavailability

When a gateway crashes, clients reconnect to another gateway and replay missed messages via sequence_number. No message loss because Postgres is written before Redis publish.

Full credit

Write order stated (Postgres before Redis publish), client reconnect flow described, sequence_number replay on reconnect described.

Partial

Reconnect mentioned but write ordering or replay mechanism not described.

Zero

No reconnect story, or design relies on in-memory delivery state that would be lost on crash.

TTL-based presence with heartbeat is correct and queryable per channelscalability

Online/offline presence for 100k users via Redis TTL keys. Heartbeat every 30s, TTL 60s. Per-channel presence query via SMEMBERS + MGET.

Full credit

SET EX 60 on heartbeat described, MGET for per-channel presence query described, crash/disconnect handled by TTL expiry.

Partial

Presence system present but heartbeat interval or TTL not specified.

Zero

Polling-based presence or WebSocket disconnect events as the sole offline signal.

Fan-out math and WebSocket sizing argued from numbersjustification-quality

Justifications must include the key capacity numbers: WebSocket RAM (64KB × N connections), fan-out write amplification (N members × message rate), and Redis pub/sub throughput headroom.

Full credit

64KB/connection × 100k = 6.4GB RAM stated, fan-out math (N members × message rate = WebSocket writes/s), Redis pub/sub headroom vs. 1M/s limit cited.

Partial

Correct design but justifications are qualitative ("Redis is fast enough").

Zero

No capacity math at all — design floated without numbers.

Failure Scenarios the Sim Will Inject

Each scenario fires automatically during your simulation run. Your design must survive all of them.

📈

Steady-state chat load

traffic spike
📈

Celebrity goes live — 3x connection burst

traffic spiket=30s
💥

Gateway server crash

crasht=30sfor 60sapp-server
🌐

Region A outage

region outaget=60sfor 60sregion-a

Best on desktop — the canvas needs room to breathe. Drafts autosave locally.