ML system design

Design a distributed training pipeline

Coordinate data, checkpoints, accelerators, and recovery for large training jobs.

distributed trainingcheckpointingdata loadingcluster scheduling

Prompt

Design a distributed training pipeline for large models. The platform should schedule jobs, feed data, checkpoint progress, and recover from worker failures.

Clarifying questions

  • Are jobs data-parallel, model-parallel, or mixed?
  • What is the expected checkpoint size and frequency?
  • Are training datasets immutable snapshots or continuously updated?

Functional requirements

  • Submit versioned training jobs with data and code references.
  • Allocate compute, launch workers, and monitor job progress.
  • Checkpoint and resume failed jobs.

Nonfunctional requirements

  • Minimize wasted accelerator time after failures.
  • Make completed models reproducible by input snapshot and code version.
  • Keep data loading from starving accelerators.

Scale assumptions

  • Jobs range from one GPU to hundreds of GPUs.
  • Datasets range from terabytes to petabytes.
  • Checkpoints can be tens or hundreds of gigabytes.

API sketch

  • POST /v1/training-jobs { codeRef, dataSnapshot, resources, checkpointPolicy }
  • GET /v1/training-jobs/{jobId} -> state, metrics, checkpoint refs.

Data model

  • training_jobs(job_id, code_ref, data_snapshot, resource_shape, status).
  • checkpoints(job_id, step, object_ref, metrics, created_at).
  • workers(job_id, rank, host, state, last_heartbeat).

Architecture components

  • Control plane validates jobs and reserves accelerator pools.
  • Launcher starts workers with dataset snapshot and checkpoint policy.
  • Coordinator records heartbeats, metrics, checkpoints, and recovery state.

Bottlenecks

  • Data loader throughput can underutilize expensive accelerators.
  • Checkpoint writes can saturate object storage or network links.

Failure modes

  • Worker failure: coordinator restarts from last healthy checkpoint.
  • Object storage slowdown: lengthen checkpoint interval and alert on risk.
  • Bad data shard: quarantine shard and mark job as needing review.

Observability

  • Accelerator utilization, data-loader wait time, step time, checkpoint duration.
  • Failure count by host, job, and data shard.

Security / privacy

  • Restrict dataset snapshots by project and approved training purpose.
  • Keep secrets out of job specs and worker logs.

Cost considerations

  • Idle accelerator time is the main cost leak.
  • Frequent checkpoints reduce lost work but increase storage and network cost.

Tradeoffs

  • Large checkpoints recover more state but slow the training loop.
  • Spot capacity cuts cost but increases recovery complexity.

ML-specific concerns

  • Training lineage must include data snapshot, code, hyperparameters, and checkpoint path.
  • Offline evaluation should run automatically before registry promotion.
  • Serving constraints should feed back into training artifact format.

Rubric

CriterionWeightEvidence
Separates product behavior from infrastructure assumptions before drawing boxes.
clarification
10The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints.
scale
15Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split.
architecture
20The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records.
data
15Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see.
failure
15Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design.
observability
10Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt.
tradeoffs
15Compares at least two viable designs and names the losing design's advantage.
Covers the model, data, evaluation, deployment, and monitoring loop as one system.
ml-specific
20The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling.