ML system design
Design a model serving platform
Serve models with versioning, autoscaling, canaries, and rollback paths that operators can trust.
online inferenceautoscalingcanary rolloutmodel registry
Prompt
Design a platform for serving multiple ML models behind online APIs. Product teams need versioned deployments, canary traffic, autoscaling, and metrics.
Clarifying questions
- Are models CPU, GPU, or mixed?
- Do callers require synchronous responses or can some requests be async?
- What are the latency and availability targets by model tier?
Functional requirements
- Register models and deploy versioned endpoints.
- Route traffic by model, version, tenant, and canary percentage.
- Expose latency, error, and prediction-quality metrics.
Nonfunctional requirements
- Autoscale without cold-start spikes for hot models.
- Rollback a bad model version within minutes.
- Isolate expensive models from ordinary API traffic.
Scale assumptions
- Hundreds of models, dozens of active high-QPS endpoints.
- Peak 20,000 inference requests per second.
- GPU-backed models have 30 to 90 second warmup times.
API sketch
- POST /v1/deployments { modelId, version, resources, rolloutPolicy }
- POST /v1/models/{modelId}:predict { instances } -> predictions.
Data model
- models(model_id, owner, task_type, approved_versions).
- deployments(deployment_id, model_id, version, resource_shape, rollout_state).
- prediction_logs(request_id, model_id, version, latency_ms, feature_hash, output_summary).
Architecture components
- Registry stores model artifacts, metadata, and approval state.
- Control plane creates serving deployments and traffic policies.
- Data plane routes requests to warm model replicas with batching where safe.
Bottlenecks
- GPU memory limits concurrent model replicas.
- Dynamic batching improves throughput but can hurt p99 latency.
Failure modes
- Bad canary metrics: traffic router returns to previous version.
- Replica cold start: keep minimum warm pool for top models.
- Feature schema mismatch: reject deployment before traffic shift.
Observability
- Latency, error rate, saturation, queueing delay, batch size, cold-start count.
- Prediction drift and canary-vs-control metric deltas.
Security / privacy
- Restrict model artifact access by owner and environment.
- Scrub or summarize prediction logs when inputs can contain private data.
Cost considerations
- GPU idle time is the largest standing cost.
- Warm pools reduce latency at the cost of utilization.
Tradeoffs
- Shared serving clusters improve utilization but make noisy-neighbor isolation harder.
- Per-model deployments isolate risk but increase operations overhead.
ML-specific concerns
- training / serving skew: deployed feature schemas must match training schemas.
- Canary policy needs offline eval gates and online guardrail metrics.
- Model lineage connects artifact, dataset, feature code, and serving image.
Rubric
| Criterion | Weight | Evidence |
|---|---|---|
Separates product behavior from infrastructure assumptions before drawing boxes. clarification | 10 | The answer names users, write paths, read paths, retention, and what is explicitly out of scope. |
Turns traffic and data assumptions into concrete sizing constraints. scale | 15 | Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant. |
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture | 20 | The component diagram has one owner per responsibility and names the synchronous path. |
Defines durable state, indexes, keys, and idempotency records. data | 15 | Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations. |
Names failure modes and the recovery behavior users see. failure | 15 | Covers partial outages, retries, duplicate work, stale reads, overload, and backfill. |
Defines the small set of metrics and traces needed to debug the design. observability | 10 | Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm. |
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs | 15 | Compares at least two viable designs and names the losing design's advantage. |
Covers the model, data, evaluation, deployment, and monitoring loop as one system. ml-specific | 20 | The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling. |