ML system design

Design a model serving platform

Serve models with versioning, autoscaling, canaries, and rollback paths that operators can trust.

online inferenceautoscalingcanary rolloutmodel registry

Prompt

Design a platform for serving multiple ML models behind online APIs. Product teams need versioned deployments, canary traffic, autoscaling, and metrics.

Clarifying questions

Are models CPU, GPU, or mixed?
Do callers require synchronous responses or can some requests be async?
What are the latency and availability targets by model tier?

Functional requirements

Register models and deploy versioned endpoints.
Route traffic by model, version, tenant, and canary percentage.
Expose latency, error, and prediction-quality metrics.

Nonfunctional requirements

Autoscale without cold-start spikes for hot models.
Rollback a bad model version within minutes.
Isolate expensive models from ordinary API traffic.

Scale assumptions

Hundreds of models, dozens of active high-QPS endpoints.
Peak 20,000 inference requests per second.
GPU-backed models have 30 to 90 second warmup times.

API sketch

POST /v1/deployments { modelId, version, resources, rolloutPolicy }
POST /v1/models/{modelId}:predict { instances } -> predictions.

Data model

models(model_id, owner, task_type, approved_versions).
deployments(deployment_id, model_id, version, resource_shape, rollout_state).
prediction_logs(request_id, model_id, version, latency_ms, feature_hash, output_summary).

Architecture components

Registry stores model artifacts, metadata, and approval state.
Control plane creates serving deployments and traffic policies.
Data plane routes requests to warm model replicas with batching where safe.

Bottlenecks

GPU memory limits concurrent model replicas.
Dynamic batching improves throughput but can hurt p99 latency.

Failure modes

Bad canary metrics: traffic router returns to previous version.
Replica cold start: keep minimum warm pool for top models.
Feature schema mismatch: reject deployment before traffic shift.

Observability

Latency, error rate, saturation, queueing delay, batch size, cold-start count.
Prediction drift and canary-vs-control metric deltas.

Security / privacy

Restrict model artifact access by owner and environment.
Scrub or summarize prediction logs when inputs can contain private data.

Cost considerations

GPU idle time is the largest standing cost.
Warm pools reduce latency at the cost of utilization.

Tradeoffs

Shared serving clusters improve utilization but make noisy-neighbor isolation harder.
Per-model deployments isolate risk but increase operations overhead.

ML-specific concerns

training / serving skew: deployed feature schemas must match training schemas.
Canary policy needs offline eval gates and online guardrail metrics.
Model lineage connects artifact, dataset, feature code, and serving image.

Rubric

Criterion	Weight	Evidence
Separates product behavior from infrastructure assumptions before drawing boxes. clarification	10	The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints. scale	15	Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture	20	The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records. data	15	Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see. failure	15	Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design. observability	10	Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs	15	Compares at least two viable designs and names the losing design's advantage.
Covers the model, data, evaluation, deployment, and monitoring loop as one system. ml-specific	20	The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling.