classic system design

Design a durable job scheduler

Run delayed and recurring work with retries, backoff, leases, and a clean story for duplicate execution.

leasesretry queuescron semanticsidempotency

Prompt

Design a scheduler that can run delayed jobs, cron jobs, and retryable background work for a multi-tenant SaaS product.

Clarifying questions

  • Do users expect exactly-once side effects or exactly-once scheduling attempts?
  • What is the acceptable clock skew between regions?
  • Are jobs CPU-heavy, IO-heavy, or calls to external services?

Functional requirements

  • Create one-time, delayed, and recurring jobs.
  • Dispatch due jobs to workers with retry and backoff.
  • Expose job status, attempts, next run, and failure reason.

Nonfunctional requirements

  • Dispatch due jobs within 30 seconds of target time for normal load.
  • Avoid unbounded retry storms when an external dependency fails.
  • Make every worker handoff idempotent.

Scale assumptions

  • 10 million scheduled jobs stored.
  • 100,000 jobs due per minute during peak batch windows.
  • Recurring jobs range from every minute to monthly.

API sketch

  • POST /v1/jobs { runAt, cron, payloadRef, idempotencyKey } -> { jobId }
  • POST /v1/jobs/{jobId}/cancel -> { status }

Data model

  • jobs(id, tenant_id, run_at, cron_expr, status, attempts, payload_ref, idempotency_key).
  • job_attempts(job_id, attempt, lease_until, worker_id, started_at, finished_at).

Architecture components

  • A scheduler scanner moves due rows into partitioned ready queues.
  • Workers lease queue items, execute, then mark completion or schedule retry.
  • A reaper returns expired leases to the ready queue.

Bottlenecks

  • Scanning due jobs by timestamp can overload the primary database.
  • Large cron fanout can create synchronized spikes.

Failure modes

  • Worker crash: lease expires and the job is retried.
  • Scheduler crash: another scheduler instance resumes partition ownership.
  • External API outage: retry policy moves jobs to exponential backoff.

Observability

  • Due-lag p95, queue depth, retry rate, lease expiration count.
  • Attempts by failure class and tenant tier.

Security / privacy

  • Store payloads by reference with scoped access, not as raw secrets in queue messages.
  • Enforce tenant isolation on job listing and cancellation.

Cost considerations

  • Partition count trades off scan parallelism against coordination overhead.
  • Long payload retention increases storage and privacy review costs.

Tradeoffs

  • Database-backed scheduling is simpler to reason about; a broker-first design can absorb spikes better.
  • At-least-once execution is realistic; exactly-once side effects belong in the job handler.

Rubric

CriterionWeightEvidence
Separates product behavior from infrastructure assumptions before drawing boxes.
clarification
10The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints.
scale
15Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split.
architecture
20The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records.
data
15Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see.
failure
15Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design.
observability
10Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt.
tradeoffs
15Compares at least two viable designs and names the losing design's advantage.