classic system design

Design a notification system

Send email, push, and in-app notifications while respecting preferences, dedupe, and provider failures.

fanouttemplate versioningpreferencesprovider failover

Prompt

Design a notification platform that product teams can use to send transactional and lifecycle notifications across email, push, and in-app channels.

Clarifying questions

  • Which notifications are transactional versus marketing?
  • Can users configure channel-level and topic-level preferences?
  • Do sends need audit trails for compliance?

Functional requirements

  • Accept notification requests from product services.
  • Resolve templates, preferences, and channel eligibility.
  • Track delivery, bounce, open, and in-app read states.

Nonfunctional requirements

  • Do not send duplicates when upstream services retry.
  • Support provider failover without losing auditability.
  • Keep transactional sends ahead of bulk lifecycle sends.

Scale assumptions

  • 20 million notifications per day.
  • Peak fanout occurs after product-wide events.
  • Email providers impose per-minute and reputation constraints.

API sketch

  • POST /v1/notifications { userId, templateKey, channelHints, idempotencyKey, data }
  • GET /v1/users/{userId}/notifications for in-app feed reads.

Data model

  • notification_requests(id, user_id, template_key, status, idempotency_key).
  • notification_deliveries(request_id, channel, provider, status, provider_message_id).
  • user_preferences(user_id, topic, email_enabled, push_enabled, in_app_enabled).

Architecture components

  • Request API validates and stores notification intent.
  • Planner resolves preferences and creates channel delivery jobs.
  • Channel workers send through providers and update delivery state.

Bottlenecks

  • Template rendering can become CPU-heavy during fanout.
  • Provider rate limits can create long queues for one channel.

Failure modes

  • Provider outage: pause affected channel and retry with jitter or secondary provider.
  • Preference race: use preference snapshot on the request record.
  • Duplicate upstream retry: idempotencyKey returns existing request.

Observability

  • Delivery latency by channel and priority.
  • Bounce, complaint, retry, and provider error rates.
  • Preference suppression counts by topic.

Security / privacy

  • Restrict template data to declared fields.
  • Avoid logging rendered notification bodies with private user content.

Cost considerations

  • Provider costs scale by sent message, not requested notification.
  • In-app retention can dominate storage if unread messages never expire.

Tradeoffs

  • Central preference resolution improves compliance but adds latency.
  • Provider abstraction helps failover but can hide channel-specific errors.

Rubric

CriterionWeightEvidence
Separates product behavior from infrastructure assumptions before drawing boxes.
clarification
10The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints.
scale
15Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split.
architecture
20The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records.
data
15Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see.
failure
15Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design.
observability
10Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt.
tradeoffs
15Compares at least two viable designs and names the losing design's advantage.