ML system design

Design a RAG system

Build retrieval, generation, citation, and evaluation loops that do not collapse into a demo prompt.

chunkingembedding retrievalrerankingevaluation

Prompt

Design a retrieval-augmented generation system for internal technical documentation. Users ask natural-language questions and expect cited answers with low hallucination risk.

Clarifying questions

Which document sources are authoritative and how often do they change?
Do answers need exact quotations, summaries, or both?
What is the acceptable behavior when retrieval confidence is low?

Functional requirements

Ingest and chunk documents with provenance.
Retrieve candidate passages and generate cited answers.
Collect feedback and route failed answers into evaluation sets.

Nonfunctional requirements

Keep answer latency under 5 seconds for ordinary queries.
Do not answer from documents the user cannot access.
Make low-confidence behavior explicit instead of inventing an answer.

Scale assumptions

Five million documents, 200 million chunks.
1,000 peak queries per minute.
Documents update continuously from multiple source systems.

API sketch

POST /v1/answer { query, userId, corpusIds } -> { answer, citations, confidence }
POST /internal/ingest/document { sourceId, version, acl, bodyRef }

Data model

documents(id, source, version, acl_hash, updated_at).
chunks(id, document_id, ordinal, text_hash, embedding_id, citation_span).
answer_events(query_id, retrieved_chunk_ids, model_version, feedback).

Architecture components

Ingestion service extracts text, chunks, embeds, and writes a vector index.
Query path performs lexical retrieval, vector retrieval, reranking, and answer generation.
Evaluation jobs replay labeled queries against retriever and generator versions.

Bottlenecks

Embedding backfills can lag behind document updates.
Vector search can return plausible but unauthorized chunks if ACL filters are bolted on late.

Failure modes

Retriever confidence low: return a refusal with closest source suggestions.
Index update lag: show document freshness metadata in citations.
Model regression: rollback model version and keep retriever logs for replay.

Observability

Retrieval recall on golden queries, citation coverage, refusal rate, answer latency.
Chunk freshness lag and unauthorized-hit prevention counters.

Security / privacy

Apply ACL filters before generation and recheck citations before response.
Avoid logging raw private queries where retention has not been reviewed.

Cost considerations

Generation dominates per-query cost; reranking and long context add second-order costs.
Embedding cost follows changed chunks, not only changed documents.

Tradeoffs

Smaller chunks improve pinpoint citations but can lose context.
Hybrid retrieval is more complex than vector-only retrieval but protects exact identifiers.

ML-specific concerns

training / serving skew: ingestion-time chunking must match query-time citation spans.
Evaluation must separate retriever recall, citation support, and answer helpfulness.
Prompt, model, retriever, reranker, and corpus versions need lineage on every answer.

Rubric

Criterion	Weight	Evidence
Separates product behavior from infrastructure assumptions before drawing boxes. clarification	10	The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints. scale	15	Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture	20	The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records. data	15	Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see. failure	15	Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design. observability	10	Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs	15	Compares at least two viable designs and names the losing design's advantage.
Covers the model, data, evaluation, deployment, and monitoring loop as one system. ml-specific	20	The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling.