classic system design

Design an observability pipeline

Collect logs, metrics, and traces without making every outage an ingestion-cost problem.

telemetry ingestionsamplingcardinality controlretention tiers

Prompt

Design an observability pipeline for a multi-service platform. Engineers need metrics, logs, traces, and alerts during incidents without unbounded data costs.

Clarifying questions

  • Which telemetry is needed for alerting versus debugging?
  • How long must raw logs and traces be retained?
  • Are tenants allowed to query their own telemetry?

Functional requirements

  • Ingest service metrics, structured logs, and distributed traces.
  • Store telemetry with query APIs and dashboards.
  • Trigger alerts from service-level indicators.

Nonfunctional requirements

  • Keep ingestion available during partial regional outages.
  • Control high-cardinality labels before they reach expensive storage.
  • Make alert evaluation independent of ad hoc log search.

Scale assumptions

  • 10 TB of logs per day before filtering.
  • One million metric samples per second.
  • Trace sampling varies by service and error class.

API sketch

  • POST /v1/telemetry/logs batched structured log ingestion.
  • POST /v1/telemetry/traces OpenTelemetry-compatible trace ingestion.

Data model

  • metrics(series_id, labels_hash, timestamp, value).
  • logs(service, severity, timestamp, trace_id, body_ref, labels).
  • traces(trace_id, span_id, parent_span_id, service, duration_ms, status).

Architecture components

  • Collectors batch and sample telemetry near services.
  • Ingestion gateway validates labels and routes by telemetry type.
  • Hot storage serves recent incident queries; cold storage holds compressed archives.

Bottlenecks

  • Unbounded labels can explode metric series count.
  • Incident spikes can overload the same pipeline engineers need to debug.

Failure modes

  • Ingestion overload: drop debug logs before metrics and error traces.
  • Storage outage: buffer bounded telemetry at collectors.
  • Bad label rollout: enforce allowlists and reject unknown high-cardinality labels.

Observability

  • Ingestion lag, dropped telemetry by reason, query latency, alert evaluation lag.
  • Cardinality growth by service and label.

Security / privacy

  • Redact secrets before storage and block common token patterns.
  • Partition tenant-visible telemetry from platform-internal telemetry.

Cost considerations

  • Logs dominate storage cost; sampling and retention tiers need explicit budgets.
  • Trace tail-sampling increases collector memory.

Tradeoffs

  • Head sampling is cheap but misses rare failures.
  • Tail sampling captures failures better but delays trace export.

Rubric

CriterionWeightEvidence
Separates product behavior from infrastructure assumptions before drawing boxes.
clarification
10The answer names users, write paths, read paths, retention, and what is explicitly out of scope.
Turns traffic and data assumptions into concrete sizing constraints.
scale
15Uses RPS, storage growth, hot-key risk, fanout, latency budget, or memory budget where relevant.
Draws clear service, cache, queue, and storage boundaries with reasons for each split.
architecture
20The component diagram has one owner per responsibility and names the synchronous path.
Defines durable state, indexes, keys, and idempotency records.
data
15Tables or collections include primary keys, lookup paths, TTLs, and consistency expectations.
Names failure modes and the recovery behavior users see.
failure
15Covers partial outages, retries, duplicate work, stale reads, overload, and backfill.
Defines the small set of metrics and traces needed to debug the design.
observability
10Includes SLIs, saturation metrics, queue lag, error classes, and an alert tied to user harm.
Explains what is being sacrificed and why that sacrifice fits the prompt.
tradeoffs
15Compares at least two viable designs and names the losing design's advantage.