classic system design
Design a notification system
Send email, push, and in-app notifications while respecting preferences, dedupe, and provider failures.
fanouttemplate versioningpreferencesprovider failover
Prompt
Design a notification platform that product teams can use to send transactional and lifecycle notifications across email, push, and in-app channels.
Clarifying questions
- Which notifications are transactional versus marketing?
- Can users configure channel-level and topic-level preferences?
- Do sends need audit trails for compliance?
Functional requirements
- Accept notification requests from product services.
- Resolve templates, preferences, and channel eligibility.
- Track delivery, bounce, open, and in-app read states.
Nonfunctional requirements
- Do not send duplicates when upstream services retry.
- Support provider failover without losing auditability.
- Keep transactional sends ahead of bulk lifecycle sends.
Scale assumptions
- 20 million notifications per day.
- Peak fanout occurs after product-wide events.
- Email providers impose per-minute and reputation constraints.
API sketch
- POST /v1/notifications { userId, templateKey, channelHints, idempotencyKey, data }
- GET /v1/users/{userId}/notifications for in-app feed reads.
Data model
- notification_requests(id, user_id, template_key, status, idempotency_key).
- notification_deliveries(request_id, channel, provider, status, provider_message_id).
- user_preferences(user_id, topic, email_enabled, push_enabled, in_app_enabled).
Architecture components
- Request API validates and stores notification intent.
- Planner resolves preferences and creates channel delivery jobs.
- Channel workers send through providers and update delivery state.
Bottlenecks
- Template rendering can become CPU-heavy during fanout.
- Provider rate limits can create long queues for one channel.
Failure modes
- Provider outage: pause affected channel and retry with jitter or secondary provider.
- Preference race: use preference snapshot on the request record.
- Duplicate upstream retry: idempotencyKey returns existing request.
Observability
- Delivery latency by channel and priority.
- Bounce, complaint, retry, and provider error rates.
- Preference suppression counts by topic.
Security / privacy
- Restrict template data to declared fields.
- Avoid logging rendered notification bodies with private user content.
Cost considerations
- Provider costs scale by sent message, not requested notification.
- In-app retention can dominate storage if unread messages never expire.
Tradeoffs
- Central preference resolution improves compliance but adds latency.
- Provider abstraction helps failover but can hide channel-specific errors.
Rubric
| Criterion | Weight | Evidence |
|---|---|---|
Separates product behavior from infrastructure assumptions before drawing boxes. clarification | 10 | The answer names users, write paths, read paths, retention, and what is explicitly out of scope. |
Turns traffic and data assumptions into concrete sizing constraints. scale | 15 | Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant. |
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture | 20 | The component diagram has one owner per responsibility and names the synchronous path. |
Defines durable state, indexes, keys, and idempotency records. data | 15 | Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations. |
Names failure modes and the recovery behavior users see. failure | 15 | Covers partial outages, retries, duplicate work, stale reads, overload, and backfill. |
Defines the small set of metrics and traces needed to debug the design. observability | 10 | Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm. |
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs | 15 | Compares at least two viable designs and names the losing design's advantage. |