ML system design
Design a RAG system
Build retrieval, generation, citation, and evaluation loops that do not collapse into a demo prompt.
chunkingembedding retrievalrerankingevaluation
Prompt
Design a retrieval-augmented generation system for internal technical documentation. Users ask natural-language questions and expect cited answers with low hallucination risk.
Clarifying questions
- Which document sources are authoritative and how often do they change?
- Do answers need exact quotations, summaries, or both?
- What is the acceptable behavior when retrieval confidence is low?
Functional requirements
- Ingest and chunk documents with provenance.
- Retrieve candidate passages and generate cited answers.
- Collect feedback and route failed answers into evaluation sets.
Nonfunctional requirements
- Keep answer latency under 5 seconds for ordinary queries.
- Do not answer from documents the user cannot access.
- Make low-confidence behavior explicit instead of inventing an answer.
Scale assumptions
- Five million documents, 200 million chunks.
- 1,000 peak queries per minute.
- Documents update continuously from multiple source systems.
API sketch
- POST /v1/answer { query, userId, corpusIds } -> { answer, citations, confidence }
- POST /internal/ingest/document { sourceId, version, acl, bodyRef }
Data model
- documents(id, source, version, acl_hash, updated_at).
- chunks(id, document_id, ordinal, text_hash, embedding_id, citation_span).
- answer_events(query_id, retrieved_chunk_ids, model_version, feedback).
Architecture components
- Ingestion service extracts text, chunks, embeds, and writes a vector index.
- Query path performs lexical retrieval, vector retrieval, reranking, and answer generation.
- Evaluation jobs replay labeled queries against retriever and generator versions.
Bottlenecks
- Embedding backfills can lag behind document updates.
- Vector search can return plausible but unauthorized chunks if ACL filters are bolted on late.
Failure modes
- Retriever confidence low: return a refusal with closest source suggestions.
- Index update lag: show document freshness metadata in citations.
- Model regression: rollback model version and keep retriever logs for replay.
Observability
- Retrieval recall on golden queries, citation coverage, refusal rate, answer latency.
- Chunk freshness lag and unauthorized-hit prevention counters.
Security / privacy
- Apply ACL filters before generation and recheck citations before response.
- Avoid logging raw private queries where retention has not been reviewed.
Cost considerations
- Generation dominates per-query cost; reranking and long context add second-order costs.
- Embedding cost follows changed chunks, not only changed documents.
Tradeoffs
- Smaller chunks improve pinpoint citations but can lose context.
- Hybrid retrieval is more complex than vector-only retrieval but protects exact identifiers.
ML-specific concerns
- training / serving skew: ingestion-time chunking must match query-time citation spans.
- Evaluation must separate retriever recall, citation support, and answer helpfulness.
- Prompt, model, retriever, reranker, and corpus versions need lineage on every answer.
Rubric
| Criterion | Weight | Evidence |
|---|---|---|
Separates product behavior from infrastructure assumptions before drawing boxes. clarification | 10 | The answer names users, write paths, read paths, retention, and what is explicitly out of scope. |
Turns traffic and data assumptions into concrete sizing constraints. scale | 15 | Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant. |
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture | 20 | The component diagram has one owner per responsibility and names the synchronous path. |
Defines durable state, indexes, keys, and idempotency records. data | 15 | Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations. |
Names failure modes and the recovery behavior users see. failure | 15 | Covers partial outages, retries, duplicate work, stale reads, overload, and backfill. |
Defines the small set of metrics and traces needed to debug the design. observability | 10 | Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm. |
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs | 15 | Compares at least two viable designs and names the losing design's advantage. |
Covers the model, data, evaluation, deployment, and monitoring loop as one system. ml-specific | 20 | The answer includes lineage, offline eval, online eval, rollback, freshness, and drift handling. |