classic system design
Design a message queue
Move work between producers and consumers with leases, ordering boundaries, and backpressure.
delivery semanticspartitioningbackpressuredead-letter queues
Prompt
Design a managed message queue used by internal services to process background work reliably.
Clarifying questions
- Is ordering required globally, per key, or not at all?
- Are messages small payloads or references to object storage?
- How long should unconsumed messages be retained?
Functional requirements
- Publish messages to named queues.
- Let consumers lease, acknowledge, and retry messages.
- Expose dead-letter queues and replay controls.
Nonfunctional requirements
- Provide at-least-once delivery with bounded duplicate risk.
- Keep publish latency predictable during consumer outages.
- Prevent one queue from starving other queues.
Scale assumptions
- One million messages per minute at peak.
- Most messages are under 16 KB.
- Some queues are idle for days and then spike.
API sketch
- POST /v1/queues/{name}/messages { bodyRef, orderingKey?, dedupeKey? }
- POST /v1/queues/{name}/lease { maxMessages, leaseMs } -> messages[]
Data model
- messages(queue, partition, offset, body_ref, status, available_at, lease_until).
- consumer_offsets(queue, consumer_group, partition, committed_offset).
Architecture components
- Producers write to partitioned append logs.
- Consumers lease available messages and acknowledge completion.
- A dead-letter policy moves repeatedly failing messages aside.
Bottlenecks
- Hot ordering keys limit partition parallelism.
- Slow consumers cause retention growth and replay lag.
Failure modes
- Consumer crash: lease expires and message becomes visible again.
- Producer retry: dedupeKey prevents duplicate logical messages for a short window.
- Poison message: dead-letter after max attempts with failure reason.
Observability
- Queue depth, oldest visible age, consumer lag, retry count, dead-letter rate.
- Publish and lease latency by queue tier.
Security / privacy
- Authorize producers and consumers per queue.
- Avoid raw PII in message bodies; prefer encrypted payload references.
Cost considerations
- Retention cost grows with consumer lag and body size.
- High fanout may need topic semantics instead of many duplicate queue writes.
Tradeoffs
- Strict ordering simplifies consumers but limits throughput.
- Push delivery reduces polling waste but complicates backpressure.
Rubric
| Criterion | Weight | Evidence |
|---|---|---|
Separates product behavior from infrastructure assumptions before drawing boxes. clarification | 10 | The answer names users, write paths, read paths, retention, and what is explicitly out of scope. |
Turns traffic and data assumptions into concrete sizing constraints. scale | 15 | Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant. |
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture | 20 | The component diagram has one owner per responsibility and names the synchronous path. |
Defines durable state, indexes, keys, and idempotency records. data | 15 | Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations. |
Names failure modes and the recovery behavior users see. failure | 15 | Covers partial outages, retries, duplicate work, stale reads, overload, and backfill. |
Defines the small set of metrics and traces needed to debug the design. observability | 10 | Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm. |
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs | 15 | Compares at least two viable designs and names the losing design's advantage. |