classic system design
Design an observability pipeline
Collect logs, metrics, and traces without making every outage an ingestion-cost problem.
telemetry ingestionsamplingcardinality controlretention tiers
Prompt
Design an observability pipeline for a multi-service platform. Engineers need metrics, logs, traces, and alerts during incidents without unbounded data costs.
Clarifying questions
- Which telemetry is needed for alerting versus debugging?
- How long must raw logs and traces be retained?
- Are tenants allowed to query their own telemetry?
Functional requirements
- Ingest service metrics, structured logs, and distributed traces.
- Store telemetry with query APIs and dashboards.
- Trigger alerts from service-level indicators.
Nonfunctional requirements
- Keep ingestion available during partial regional outages.
- Control high-cardinality labels before they reach expensive storage.
- Make alert evaluation independent of ad hoc log search.
Scale assumptions
- 10 TB of logs per day before filtering.
- One million metric samples per second.
- Trace sampling varies by service and error class.
API sketch
- POST /v1/telemetry/logs batched structured log ingestion.
- POST /v1/telemetry/traces OpenTelemetry-compatible trace ingestion.
Data model
- metrics(series_id, labels_hash, timestamp, value).
- logs(service, severity, timestamp, trace_id, body_ref, labels).
- traces(trace_id, span_id, parent_span_id, service, duration_ms, status).
Architecture components
- Collectors batch and sample telemetry near services.
- Ingestion gateway validates labels and routes by telemetry type.
- Hot storage serves recent incident queries; cold storage holds compressed archives.
Bottlenecks
- Unbounded labels can explode metric series count.
- Incident spikes can overload the same pipeline engineers need to debug.
Failure modes
- Ingestion overload: drop debug logs before metrics and error traces.
- Storage outage: buffer bounded telemetry at collectors.
- Bad label rollout: enforce allowlists and reject unknown high-cardinality labels.
Observability
- Ingestion lag, dropped telemetry by reason, query latency, alert evaluation lag.
- Cardinality growth by service and label.
Security / privacy
- Redact secrets before storage and block common token patterns.
- Partition tenant-visible telemetry from platform-internal telemetry.
Cost considerations
- Logs dominate storage cost; sampling and retention tiers need explicit budgets.
- Trace tail-sampling increases collector memory.
Tradeoffs
- Head sampling is cheap but misses rare failures.
- Tail sampling captures failures better but delays trace export.
Rubric
| Criterion | Weight | Evidence |
|---|---|---|
Separates product behavior from infrastructure assumptions before drawing boxes. clarification | 10 | The answer names users, write paths, read paths, retention, and what is explicitly out of scope. |
Turns traffic and data assumptions into concrete sizing constraints. scale | 15 | Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant. |
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture | 20 | The component diagram has one owner per responsibility and names the synchronous path. |
Defines durable state, indexes, keys, and idempotency records. data | 15 | Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations. |
Names failure modes and the recovery behavior users see. failure | 15 | Covers partial outages, retries, duplicate work, stale reads, overload, and backfill. |
Defines the small set of metrics and traces needed to debug the design. observability | 10 | Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm. |
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs | 15 | Compares at least two viable designs and names the losing design's advantage. |