ML system design
Design a distributed training pipeline
Coordinate data, checkpoints, accelerators, and recovery for large training jobs.
distributed trainingcheckpointingdata loadingcluster scheduling
Prompt
Design a distributed training pipeline for large models. The platform should schedule jobs, feed data, checkpoint progress, and recover from worker failures.
Clarifying questions
- Are jobs data-parallel, model-parallel, or mixed?
- What is the expected checkpoint size and frequency?
- Are training datasets immutable snapshots or continuously updated?
Functional requirements
- Submit versioned training jobs with data and code references.
- Allocate compute, launch workers, and monitor job progress.
- Checkpoint and resume failed jobs.
Nonfunctional requirements
- Minimize wasted accelerator time after failures.
- Make completed models reproducible by input snapshot and code version.
- Keep data loading from starving accelerators.
Scale assumptions
- Jobs range from one GPU to hundreds of GPUs.
- Datasets range from terabytes to petabytes.
- Checkpoints can be tens or hundreds of gigabytes.
API sketch
- POST /v1/training-jobs { codeRef, dataSnapshot, resources, checkpointPolicy }
- GET /v1/training-jobs/{jobId} -> state, metrics, checkpoint refs.
Data model
- training_jobs(job_id, code_ref, data_snapshot, resource_shape, status).
- checkpoints(job_id, step, object_ref, metrics, created_at).
- workers(job_id, rank, host, state, last_heartbeat).
Architecture components
- Control plane validates jobs and reserves accelerator pools.
- Launcher starts workers with dataset snapshot and checkpoint policy.
- Coordinator records heartbeats, metrics, checkpoints, and recovery state.
Bottlenecks
- Data loader throughput can underutilize expensive accelerators.
- Checkpoint writes can saturate object storage or network links.
Failure modes
- Worker failure: coordinator restarts from last healthy checkpoint.
- Object storage slowdown: lengthen checkpoint interval and alert on risk.
- Bad data shard: quarantine shard and mark job as needing review.
Observability
- Accelerator utilization, data-loader wait time, step time, checkpoint duration.
- Failure count by host, job, and data shard.
Security / privacy
- Restrict dataset snapshots by project and approved training purpose.
- Keep secrets out of job specs and worker logs.
Cost considerations
- Idle accelerator time is the main cost leak.
- Frequent checkpoints reduce lost work but increase storage and network cost.
Tradeoffs
- Large checkpoints recover more state but slow the training loop.
- Spot capacity cuts cost but increases recovery complexity.
ML-specific concerns
- Training lineage must include data snapshot, code, hyperparameters, and checkpoint path.
- Offline evaluation should run automatically before registry promotion.
- Serving constraints should feed back into training artifact format.
Rubric
| Criterion | Weight | Evidence |
|---|---|---|
Separates product behavior from infrastructure assumptions before drawing boxes. clarification | 10 | The answer names users, write paths, read paths, retention, and what is explicitly out of scope. |
Turns traffic and data assumptions into concrete sizing constraints. scale | 15 | Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant. |
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture | 20 | The component diagram has one owner per responsibility and names the synchronous path. |
Defines durable state, indexes, keys, and idempotency records. data | 15 | Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations. |
Names failure modes and the recovery behavior users see. failure | 15 | Covers partial outages, retries, duplicate work, stale reads, overload, and backfill. |
Defines the small set of metrics and traces needed to debug the design. observability | 10 | Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm. |
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs | 15 | Compares at least two viable designs and names the losing design's advantage. |
Covers the model, data, evaluation, deployment, and monitoring loop as one system. ml-specific | 20 | The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling. |