ML system design

Design a vector search service

Serve nearest-neighbor retrieval with filtering, index refresh, recall measurement, and cost controls.

nearest neighborsmetadata filteringindex refreshrecall evaluation

Prompt

Design a vector search service used by search, recommendations, and RAG products. It must support metadata filters and frequent index updates.

Clarifying questions

  • What vector dimensions and distance metrics are required?
  • Are filters selective, broad, or both?
  • How fresh must newly embedded items be in search results?

Functional requirements

  • Upsert vectors with metadata and version identifiers.
  • Query top-k nearest neighbors with optional filters.
  • Measure recall and latency for each index version.

Nonfunctional requirements

  • Keep p95 vector query latency below 100 ms for ordinary filters.
  • Support index rebuilds without taking queries offline.
  • Bound recall loss introduced by approximate search.

Scale assumptions

  • One billion vectors, 768 to 1,536 dimensions.
  • 10,000 vector queries per second at peak.
  • Millions of vector upserts per day.

API sketch

  • POST /v1/indexes/{name}/vectors:upsert { id, vector, metadata, version }
  • POST /v1/indexes/{name}:query { vector, topK, filter } -> neighbors.

Data model

  • vectors(id, index_name, embedding_version, metadata, deleted_at).
  • index_versions(index_name, version, algorithm, build_state, recall_report).
  • query_logs(query_id, index_version, filter_hash, latency_ms, result_ids).

Architecture components

  • Write path stores vector records and schedules index updates.
  • Serving fleet keeps active index shards in memory with metadata filter side indexes.
  • Evaluation jobs compare approximate results to exact-search samples.

Bottlenecks

  • High-selectivity filters can defeat approximate nearest-neighbor assumptions.
  • Large vectors increase memory bandwidth and network transfer cost.

Failure modes

  • Bad index build: keep previous index version active.
  • Shard unavailable: route to replica and mark recall-risk if fallback is partial.
  • Embedding version mismatch: reject mixed-version queries unless explicitly configured.

Observability

  • Latency by topK and filter selectivity, recall@k samples, index freshness lag.
  • Memory per shard, rebuild duration, and query error classes.

Security / privacy

  • Enforce metadata filters for tenant and ACL boundaries before returning ids.
  • Treat vectors as potentially sensitive derived data.

Cost considerations

  • RAM cost dominates for hot indexes; compression trades recall against memory.
  • Frequent rebuilds consume CPU/GPU budget and can compete with serving.

Tradeoffs

  • Pre-filtering protects permissions but can hurt recall for sparse filters.
  • Post-filtering is fast for broad filters but can leak candidates if implemented carelessly.

ML-specific concerns

  • Embedding version lineage must match query vectors and indexed vectors.
  • Recall evaluation needs held-out queries and exact-search baselines.
  • Retrieval drift can occur after corpus changes even when the model is unchanged.

Rubric

CriterionWeightEvidence
Separates product behavior from infrastructure assumptions before drawing boxes.
clarification
10The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints.
scale
15Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split.
architecture
20The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records.
data
15Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see.
failure
15Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design.
observability
10Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt.
tradeoffs
15Compares at least two viable designs and names the losing design's advantage.
Covers the model, data, evaluation, deployment, and monitoring loop as one system.
ml-specific
20The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling.