ML system design
Design a vector search service
Serve nearest-neighbor retrieval with filtering, index refresh, recall measurement, and cost controls.
nearest neighborsmetadata filteringindex refreshrecall evaluation
Prompt
Design a vector search service used by search, recommendations, and RAG products. It must support metadata filters and frequent index updates.
Clarifying questions
- What vector dimensions and distance metrics are required?
- Are filters selective, broad, or both?
- How fresh must newly embedded items be in search results?
Functional requirements
- Upsert vectors with metadata and version identifiers.
- Query top-k nearest neighbors with optional filters.
- Measure recall and latency for each index version.
Nonfunctional requirements
- Keep p95 vector query latency below 100 ms for ordinary filters.
- Support index rebuilds without taking queries offline.
- Bound recall loss introduced by approximate search.
Scale assumptions
- One billion vectors, 768 to 1,536 dimensions.
- 10,000 vector queries per second at peak.
- Millions of vector upserts per day.
API sketch
- POST /v1/indexes/{name}/vectors:upsert { id, vector, metadata, version }
- POST /v1/indexes/{name}:query { vector, topK, filter } -> neighbors.
Data model
- vectors(id, index_name, embedding_version, metadata, deleted_at).
- index_versions(index_name, version, algorithm, build_state, recall_report).
- query_logs(query_id, index_version, filter_hash, latency_ms, result_ids).
Architecture components
- Write path stores vector records and schedules index updates.
- Serving fleet keeps active index shards in memory with metadata filter side indexes.
- Evaluation jobs compare approximate results to exact-search samples.
Bottlenecks
- High-selectivity filters can defeat approximate nearest-neighbor assumptions.
- Large vectors increase memory bandwidth and network transfer cost.
Failure modes
- Bad index build: keep previous index version active.
- Shard unavailable: route to replica and mark recall-risk if fallback is partial.
- Embedding version mismatch: reject mixed-version queries unless explicitly configured.
Observability
- Latency by topK and filter selectivity, recall@k samples, index freshness lag.
- Memory per shard, rebuild duration, and query error classes.
Security / privacy
- Enforce metadata filters for tenant and ACL boundaries before returning ids.
- Treat vectors as potentially sensitive derived data.
Cost considerations
- RAM cost dominates for hot indexes; compression trades recall against memory.
- Frequent rebuilds consume CPU/GPU budget and can compete with serving.
Tradeoffs
- Pre-filtering protects permissions but can hurt recall for sparse filters.
- Post-filtering is fast for broad filters but can leak candidates if implemented carelessly.
ML-specific concerns
- Embedding version lineage must match query vectors and indexed vectors.
- Recall evaluation needs held-out queries and exact-search baselines.
- Retrieval drift can occur after corpus changes even when the model is unchanged.
Rubric
| Criterion | Weight | Evidence |
|---|---|---|
Separates product behavior from infrastructure assumptions before drawing boxes. clarification | 10 | The answer names users, write paths, read paths, retention, and what is explicitly out of scope. |
Turns traffic and data assumptions into concrete sizing constraints. scale | 15 | Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant. |
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture | 20 | The component diagram has one owner per responsibility and names the synchronous path. |
Defines durable state, indexes, keys, and idempotency records. data | 15 | Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations. |
Names failure modes and the recovery behavior users see. failure | 15 | Covers partial outages, retries, duplicate work, stale reads, overload, and backfill. |
Defines the small set of metrics and traces needed to debug the design. observability | 10 | Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm. |
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs | 15 | Compares at least two viable designs and names the losing design's advantage. |
Covers the model, data, evaluation, deployment, and monitoring loop as one system. ml-specific | 20 | The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling. |