classic system design

Design an observability pipeline

Collect logs, metrics, and traces without making every outage an ingestion-cost problem.

telemetry ingestionsamplingcardinality controlretention tiers

Prompt

Design an observability pipeline for a multi-service platform. Engineers need metrics, logs, traces, and alerts during incidents without unbounded data costs.

Clarifying questions

Which telemetry is needed for alerting versus debugging?
How long must raw logs and traces be retained?
Are tenants allowed to query their own telemetry?

Functional requirements

Ingest service metrics, structured logs, and distributed traces.
Store telemetry with query APIs and dashboards.
Trigger alerts from service-level indicators.

Nonfunctional requirements

Keep ingestion available during partial regional outages.
Control high-cardinality labels before they reach expensive storage.
Make alert evaluation independent of ad hoc log search.

Scale assumptions

10 TB of logs per day before filtering.
One million metric samples per second.
Trace sampling varies by service and error class.

API sketch

POST /v1/telemetry/logs batched structured log ingestion.
POST /v1/telemetry/traces OpenTelemetry-compatible trace ingestion.

Data model

metrics(series_id, labels_hash, timestamp, value).
logs(service, severity, timestamp, trace_id, body_ref, labels).
traces(trace_id, span_id, parent_span_id, service, duration_ms, status).

Architecture components

Collectors batch and sample telemetry near services.
Ingestion gateway validates labels and routes by telemetry type.
Hot storage serves recent incident queries; cold storage holds compressed archives.

Bottlenecks

Unbounded labels can explode metric series count.
Incident spikes can overload the same pipeline engineers need to debug.

Failure modes

Ingestion overload: drop debug logs before metrics and error traces.
Storage outage: buffer bounded telemetry at collectors.
Bad label rollout: enforce allowlists and reject unknown high-cardinality labels.

Observability

Ingestion lag, dropped telemetry by reason, query latency, alert evaluation lag.
Cardinality growth by service and label.

Security / privacy

Redact secrets before storage and block common token patterns.
Partition tenant-visible telemetry from platform-internal telemetry.

Cost considerations

Logs dominate storage cost; sampling and retention tiers need explicit budgets.
Trace tail-sampling increases collector memory.

Tradeoffs

Head sampling is cheap but misses rare failures.
Tail sampling captures failures better but delays trace export.

Rubric

Criterion	Weight	Evidence
Separates product behavior from infrastructure assumptions before drawing boxes. clarification	10	The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints. scale	15	Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split. architecture	20	The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records. data	15	Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see. failure	15	Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design. observability	10	Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt. tradeoffs	15	Compares at least two viable designs and names the losing design's advantage.