classic system design
Design a durable job scheduler
Run delayed and recurring work with retries, backoff, leases, and a clean story for duplicate execution.
leasesretry queuescron semanticsidempotency
Prompt
Design a scheduler that can run delayed jobs, cron jobs, and retryable background work for a multi-tenant SaaS product.
Clarifying questions
- Do users expect exactly-once side effects or exactly-once scheduling attempts?
- What is the acceptable clock skew between regions?
- Are jobs CPU-heavy, IO-heavy, or calls to external services?
Functional requirements
- Create one-time, delayed, and recurring jobs.
- Dispatch due jobs to workers with retry and backoff.
- Expose job status, attempts, next run, and failure reason.
Nonfunctional requirements
- Dispatch due jobs within 30 seconds of target time for normal load.
- Avoid unbounded retry storms when an external dependency fails.
- Make every worker handoff idempotent.
Scale assumptions
- 10 million scheduled jobs stored.
- 100,000 jobs due per minute during peak batch windows.
- Recurring jobs range from every minute to monthly.
API sketch
- POST /v1/jobs { runAt, cron, payloadRef, idempotencyKey } -> { jobId }
- POST /v1/jobs/{jobId}/cancel -> { status }
Data model
- jobs(id, tenant_id, run_at, cron_expr, status, attempts, payload_ref, idempotency_key).
- job_attempts(job_id, attempt, lease_until, worker_id, started_at, finished_at).
Architecture components
- A scheduler scanner moves due rows into partitioned ready queues.
- Workers lease queue items, execute, then mark completion or schedule retry.
- A reaper returns expired leases to the ready queue.
Bottlenecks
- Scanning due jobs by timestamp can overload the primary database.
- Large cron fanout can create synchronized spikes.
Failure modes
- Worker crash: lease expires and the job is retried.
- Scheduler crash: another scheduler instance resumes partition ownership.
- External API outage: retry policy moves jobs to exponential backoff.
Observability
- Due-lag p95, queue depth, retry rate, lease expiration count.
- Attempts by failure class and tenant tier.
Security / privacy
- Store payloads by reference with scoped access, not as raw secrets in queue messages.
- Enforce tenant isolation on job listing and cancellation.
Cost considerations
- Partition count trades off scan parallelism against coordination overhead.
- Long payload retention increases storage and privacy review costs.
Tradeoffs
- Database-backed scheduling is simpler to reason about; a broker-first design can absorb spikes better.
- At-least-once execution is realistic; exactly-once side effects belong in the job handler.
Rubric
| Criterion | Weight | Evidence |
|---|---|---|
Separates product behavior from infrastructure assumptions before drawing boxes. clarification | 10 | The answer names users, write paths, read paths, retention, and what is explicitly out of scope. |
Turns traffic and data assumptions into concrete sizing constraints. scale | 15 | Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant. |
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture | 20 | The component diagram has one owner per responsibility and names the synchronous path. |
Defines durable state, indexes, keys, and idempotency records. data | 15 | Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations. |
Names failure modes and the recovery behavior users see. failure | 15 | Covers partial outages, retries, duplicate work, stale reads, overload, and backfill. |
Defines the small set of metrics and traces needed to debug the design. observability | 10 | Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm. |
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs | 15 | Compares at least two viable designs and names the losing design's advantage. |