ML system design

Design a RAG system

Build retrieval, generation, citation, and evaluation loops that do not collapse into a demo prompt.

chunkingembedding retrievalrerankingevaluation

Prompt

Design a retrieval-augmented generation system for internal technical documentation. Users ask natural-language questions and expect cited answers with low hallucination risk.

Clarifying questions

  • Which document sources are authoritative and how often do they change?
  • Do answers need exact quotations, summaries, or both?
  • What is the acceptable behavior when retrieval confidence is low?

Functional requirements

  • Ingest and chunk documents with provenance.
  • Retrieve candidate passages and generate cited answers.
  • Collect feedback and route failed answers into evaluation sets.

Nonfunctional requirements

  • Keep answer latency under 5 seconds for ordinary queries.
  • Do not answer from documents the user cannot access.
  • Make low-confidence behavior explicit instead of inventing an answer.

Scale assumptions

  • Five million documents, 200 million chunks.
  • 1,000 peak queries per minute.
  • Documents update continuously from multiple source systems.

API sketch

  • POST /v1/answer { query, userId, corpusIds } -> { answer, citations, confidence }
  • POST /internal/ingest/document { sourceId, version, acl, bodyRef }

Data model

  • documents(id, source, version, acl_hash, updated_at).
  • chunks(id, document_id, ordinal, text_hash, embedding_id, citation_span).
  • answer_events(query_id, retrieved_chunk_ids, model_version, feedback).

Architecture components

  • Ingestion service extracts text, chunks, embeds, and writes a vector index.
  • Query path performs lexical retrieval, vector retrieval, reranking, and answer generation.
  • Evaluation jobs replay labeled queries against retriever and generator versions.

Bottlenecks

  • Embedding backfills can lag behind document updates.
  • Vector search can return plausible but unauthorized chunks if ACL filters are bolted on late.

Failure modes

  • Retriever confidence low: return a refusal with closest source suggestions.
  • Index update lag: show document freshness metadata in citations.
  • Model regression: rollback model version and keep retriever logs for replay.

Observability

  • Retrieval recall on golden queries, citation coverage, refusal rate, answer latency.
  • Chunk freshness lag and unauthorized-hit prevention counters.

Security / privacy

  • Apply ACL filters before generation and recheck citations before response.
  • Avoid logging raw private queries where retention has not been reviewed.

Cost considerations

  • Generation dominates per-query cost; reranking and long context add second-order costs.
  • Embedding cost follows changed chunks, not only changed documents.

Tradeoffs

  • Smaller chunks improve pinpoint citations but can lose context.
  • Hybrid retrieval is more complex than vector-only retrieval but protects exact identifiers.

ML-specific concerns

  • training / serving skew: ingestion-time chunking must match query-time citation spans.
  • Evaluation must separate retriever recall, citation support, and answer helpfulness.
  • Prompt, model, retriever, reranker, and corpus versions need lineage on every answer.

Rubric

CriterionWeightEvidence
Separates product behavior from infrastructure assumptions before drawing boxes.
clarification
10The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints.
scale
15Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split.
architecture
20The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records.
data
15Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see.
failure
15Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design.
observability
10Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt.
tradeoffs
15Compares at least two viable designs and names the losing design's advantage.
Covers the model, data, evaluation, deployment, and monitoring loop as one system.
ml-specific
20The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling.