ML system design

Design a model serving platform

Serve models with versioning, autoscaling, canaries, and rollback paths that operators can trust.

online inferenceautoscalingcanary rolloutmodel registry

Prompt

Design a platform for serving multiple ML models behind online APIs. Product teams need versioned deployments, canary traffic, autoscaling, and metrics.

Clarifying questions

  • Are models CPU, GPU, or mixed?
  • Do callers require synchronous responses or can some requests be async?
  • What are the latency and availability targets by model tier?

Functional requirements

  • Register models and deploy versioned endpoints.
  • Route traffic by model, version, tenant, and canary percentage.
  • Expose latency, error, and prediction-quality metrics.

Nonfunctional requirements

  • Autoscale without cold-start spikes for hot models.
  • Rollback a bad model version within minutes.
  • Isolate expensive models from ordinary API traffic.

Scale assumptions

  • Hundreds of models, dozens of active high-QPS endpoints.
  • Peak 20,000 inference requests per second.
  • GPU-backed models have 30 to 90 second warmup times.

API sketch

  • POST /v1/deployments { modelId, version, resources, rolloutPolicy }
  • POST /v1/models/{modelId}:predict { instances } -> predictions.

Data model

  • models(model_id, owner, task_type, approved_versions).
  • deployments(deployment_id, model_id, version, resource_shape, rollout_state).
  • prediction_logs(request_id, model_id, version, latency_ms, feature_hash, output_summary).

Architecture components

  • Registry stores model artifacts, metadata, and approval state.
  • Control plane creates serving deployments and traffic policies.
  • Data plane routes requests to warm model replicas with batching where safe.

Bottlenecks

  • GPU memory limits concurrent model replicas.
  • Dynamic batching improves throughput but can hurt p99 latency.

Failure modes

  • Bad canary metrics: traffic router returns to previous version.
  • Replica cold start: keep minimum warm pool for top models.
  • Feature schema mismatch: reject deployment before traffic shift.

Observability

  • Latency, error rate, saturation, queueing delay, batch size, cold-start count.
  • Prediction drift and canary-vs-control metric deltas.

Security / privacy

  • Restrict model artifact access by owner and environment.
  • Scrub or summarize prediction logs when inputs can contain private data.

Cost considerations

  • GPU idle time is the largest standing cost.
  • Warm pools reduce latency at the cost of utilization.

Tradeoffs

  • Shared serving clusters improve utilization but make noisy-neighbor isolation harder.
  • Per-model deployments isolate risk but increase operations overhead.

ML-specific concerns

  • training / serving skew: deployed feature schemas must match training schemas.
  • Canary policy needs offline eval gates and online guardrail metrics.
  • Model lineage connects artifact, dataset, feature code, and serving image.

Rubric

CriterionWeightEvidence
Separates product behavior from infrastructure assumptions before drawing boxes.
clarification
10The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints.
scale
15Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split.
architecture
20The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records.
data
15Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see.
failure
15Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design.
observability
10Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt.
tradeoffs
15Compares at least two viable designs and names the losing design's advantage.
Covers the model, data, evaluation, deployment, and monitoring loop as one system.
ml-specific
20The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling.