ML system design

Design an ML evaluation platform

Run offline, online, and regression evaluations with lineage strong enough to stop bad releases.

eval datasetsmetricsexperiment trackingrelease gates

Prompt

Design an evaluation platform for ML models and LLM features. Teams need repeatable offline evals, online experiment tracking, and release gates.

Clarifying questions

  • Which tasks are classification, ranking, generation, or retrieval?
  • Are eval labels human-reviewed, synthetic, implicit, or mixed?
  • Who can approve a metric gate override?

Functional requirements

  • Register eval datasets, metric definitions, and runs.
  • Compare candidate models against baselines.
  • Gate releases on configured metrics and guardrails.

Nonfunctional requirements

  • Make eval runs reproducible by dataset, code, and model version.
  • Separate exploratory evals from release-blocking evals.
  • Keep sensitive eval data access-controlled.

Scale assumptions

  • Thousands of eval runs per week.
  • Datasets range from 100 examples to millions of examples.
  • Some generation evals require costly model-as-judge calls.

API sketch

  • POST /v1/eval-runs { modelVersion, datasetId, metricSuiteId } -> { runId }
  • GET /v1/eval-runs/{runId}/comparison?baseline=...

Data model

  • eval_datasets(dataset_id, version, task_type, label_policy, access_policy).
  • metric_suites(suite_id, version, metric_defs).
  • eval_runs(run_id, model_version, dataset_version, metric_suite_version, status, results).

Architecture components

  • Eval runner executes jobs against versioned datasets and model endpoints.
  • Metric service computes deterministic metrics and records judge-model metadata where used.
  • Release gate service compares candidate runs to configured thresholds.

Bottlenecks

  • Large generation evals are expensive and slow.
  • Non-deterministic judge prompts can hide regressions.

Failure modes

  • Metric code bug: freeze release gates and recompute affected runs.
  • Dataset contamination: deprecate dataset version and mark dependent runs invalid.
  • Judge drift: pin judge model and prompt version for release gates.

Observability

  • Eval queue time, run failure rate, metric computation time, cost per run.
  • Gate override count and stale-baseline count.

Security / privacy

  • Restrict eval examples with sensitive labels to approved teams.
  • Store model outputs with retention policy tied to dataset classification.

Cost considerations

  • Model-as-judge and candidate inference cost need per-suite budgets.
  • Caching deterministic predictions can reduce rerun cost.

Tradeoffs

  • Broad eval suites catch more regressions but slow release cadence.
  • Small smoke evals are fast but cannot stand in for launch decisions.

ML-specific concerns

  • Evaluation lineage is the product: dataset, labels, model, prompt, metric, and judge version.
  • Offline evals need online guardrail counterparts when user behavior matters.
  • Regression suites must include known failures, not only average-case examples.

Rubric

CriterionWeightEvidence
Separates product behavior from infrastructure assumptions before drawing boxes.
clarification
10The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints.
scale
15Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split.
architecture
20The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records.
data
15Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see.
failure
15Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design.
observability
10Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt.
tradeoffs
15Compares at least two viable designs and names the losing design's advantage.
Covers the model, data, evaluation, deployment, and monitoring loop as one system.
ml-specific
20The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling.