ML system design

Design a feature store

Keep online and offline features aligned enough that model scores mean what training said they meant.

point-in-time correctnessonline/offline parityfeature lineagefreshness

Prompt

Design a feature store for recommendation and risk models. Teams need reusable features for training, batch scoring, and online inference.

Clarifying questions

  • Which features require real-time updates and which are batch computed?
  • What point-in-time correctness guarantee is required for training data?
  • Who owns feature definitions and review?

Functional requirements

  • Register feature definitions, owners, and schemas.
  • Materialize features to offline and online stores.
  • Serve online feature vectors for inference with freshness metadata.

Nonfunctional requirements

  • Prevent training data from seeing future information.
  • Keep online read latency below the serving model's feature budget.
  • Detect freshness and parity regressions before model rollout.

Scale assumptions

  • 10,000 features across 200 entities.
  • 100,000 online feature reads per second.
  • Some features update hourly; others update within seconds.

API sketch

  • GET /v1/features/{entityType}/{entityId}?names=... -> feature vector.
  • POST /v1/feature-definitions { name, entity, schema, transformRef, freshnessSlo }

Data model

  • feature_definitions(name, entity_type, version, owner, schema, transform_ref).
  • offline_feature_values(entity_id, feature_name, event_time, value).
  • online_feature_values(entity_id, feature_name, value, feature_timestamp).

Architecture components

  • Registry stores definitions, schemas, and ownership.
  • Batch and streaming materializers write to offline and online stores.
  • Serving clients fetch feature vectors through a low-latency API.

Bottlenecks

  • High-cardinality entities can create hot partitions in the online store.
  • Backfills can overwrite online values if event time and processing time are confused.

Failure modes

  • Streaming materializer lag: serving returns stale flag and alert fires.
  • Schema change: block incompatible version from model deployment.
  • Backfill error: replay into a new feature version rather than mutating the active version.

Observability

  • Freshness lag by feature, online read p99, null rate, parity checks.
  • Training-serving skew metrics sampled from live requests.

Security / privacy

  • Classify features by sensitivity and restrict cross-team reuse.
  • Record retention and deletion behavior for user-derived features.

Cost considerations

  • Online store cost follows hot entity-feature pairs and replication.
  • Offline backfills can dominate compute if feature definitions churn.

Tradeoffs

  • Central feature registry improves reuse but adds governance overhead.
  • Streaming features improve freshness but make point-in-time replay harder.

ML-specific concerns

  • training / serving skew is the central failure mode and needs automated parity checks.
  • Feature lineage must connect transforms, datasets, and model versions.
  • Feature freshness should be part of model guardrails, not only data-team dashboards.

Rubric

CriterionWeightEvidence
Separates product behavior from infrastructure assumptions before drawing boxes.
clarification
10The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints.
scale
15Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split.
architecture
20The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records.
data
15Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see.
failure
15Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design.
observability
10Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt.
tradeoffs
15Compares at least two viable designs and names the losing design's advantage.
Covers the model, data, evaluation, deployment, and monitoring loop as one system.
ml-specific
20The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling.