ML system design
Design a feature store
Keep online and offline features aligned enough that model scores mean what training said they meant.
point-in-time correctnessonline/offline parityfeature lineagefreshness
Prompt
Design a feature store for recommendation and risk models. Teams need reusable features for training, batch scoring, and online inference.
Clarifying questions
- Which features require real-time updates and which are batch computed?
- What point-in-time correctness guarantee is required for training data?
- Who owns feature definitions and review?
Functional requirements
- Register feature definitions, owners, and schemas.
- Materialize features to offline and online stores.
- Serve online feature vectors for inference with freshness metadata.
Nonfunctional requirements
- Prevent training data from seeing future information.
- Keep online read latency below the serving model's feature budget.
- Detect freshness and parity regressions before model rollout.
Scale assumptions
- 10,000 features across 200 entities.
- 100,000 online feature reads per second.
- Some features update hourly; others update within seconds.
API sketch
- GET /v1/features/{entityType}/{entityId}?names=... -> feature vector.
- POST /v1/feature-definitions { name, entity, schema, transformRef, freshnessSlo }
Data model
- feature_definitions(name, entity_type, version, owner, schema, transform_ref).
- offline_feature_values(entity_id, feature_name, event_time, value).
- online_feature_values(entity_id, feature_name, value, feature_timestamp).
Architecture components
- Registry stores definitions, schemas, and ownership.
- Batch and streaming materializers write to offline and online stores.
- Serving clients fetch feature vectors through a low-latency API.
Bottlenecks
- High-cardinality entities can create hot partitions in the online store.
- Backfills can overwrite online values if event time and processing time are confused.
Failure modes
- Streaming materializer lag: serving returns stale flag and alert fires.
- Schema change: block incompatible version from model deployment.
- Backfill error: replay into a new feature version rather than mutating the active version.
Observability
- Freshness lag by feature, online read p99, null rate, parity checks.
- Training-serving skew metrics sampled from live requests.
Security / privacy
- Classify features by sensitivity and restrict cross-team reuse.
- Record retention and deletion behavior for user-derived features.
Cost considerations
- Online store cost follows hot entity-feature pairs and replication.
- Offline backfills can dominate compute if feature definitions churn.
Tradeoffs
- Central feature registry improves reuse but adds governance overhead.
- Streaming features improve freshness but make point-in-time replay harder.
ML-specific concerns
- training / serving skew is the central failure mode and needs automated parity checks.
- Feature lineage must connect transforms, datasets, and model versions.
- Feature freshness should be part of model guardrails, not only data-team dashboards.
Rubric
| Criterion | Weight | Evidence |
|---|---|---|
Separates product behavior from infrastructure assumptions before drawing boxes. clarification | 10 | The answer names users, write paths, read paths, retention, and what is explicitly out of scope. |
Turns traffic and data assumptions into concrete sizing constraints. scale | 15 | Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant. |
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture | 20 | The component diagram has one owner per responsibility and names the synchronous path. |
Defines durable state, indexes, keys, and idempotency records. data | 15 | Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations. |
Names failure modes and the recovery behavior users see. failure | 15 | Covers partial outages, retries, duplicate work, stale reads, overload, and backfill. |
Defines the small set of metrics and traces needed to debug the design. observability | 10 | Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm. |
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs | 15 | Compares at least two viable designs and names the losing design's advantage. |
Covers the model, data, evaluation, deployment, and monitoring loop as one system. ml-specific | 20 | The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling. |