Designing for Observability From Day One

Dhruval Dhameliya·July 5, 2025·7 min read

How to build observability into system architecture from the start, covering the three pillars, instrumentation patterns, and common pitfalls.

Observability added after the fact is always incomplete. The most critical paths are the hardest to instrument retroactively because they were built without instrumentation in mind. Starting with observability as a design requirement changes how you structure code, handle errors, and expose internal state.

The Three Pillars in Practice

Logs, metrics, and traces are the standard pillars. But listing them is not the same as using them effectively.

Logs answer "what happened." They are the narrative record of system behavior. Useful for debugging specific incidents but expensive to query at scale.

Metrics answer "how is the system performing." They are aggregated, time-series data. Useful for dashboards, alerting, and capacity planning. Cheap to store and query.

Traces answer "how did this request flow through the system." They connect the dots between services for a single request. Useful for diagnosing latency issues and understanding dependencies.

The mistake I see most often: teams invest heavily in one pillar and neglect the others. Logs without metrics means you discover problems from user complaints rather than dashboards. Metrics without traces means you know something is slow but cannot determine where. Traces without logs means you can see the request path but not the details of what went wrong at each step.

All three are required. They serve different purposes and complement each other.

Instrumentation Patterns

The RED Method for Services

For every service, instrument three things:

  • Rate: requests per second
  • Errors: errors per second (segmented by error type)
  • Duration: latency histogram (P50, P95, P99)

These three metrics give you a complete picture of service health. If the rate drops, something upstream is broken. If errors increase, something in this service or downstream is broken. If duration increases, the service is degrading.

The USE Method for Resources

For every resource (CPU, memory, disk, network, connection pools), instrument:

  • Utilization: percentage of resource capacity in use
  • Saturation: work queued because the resource is fully utilized
  • Errors: errors caused by resource exhaustion

These metrics predict problems before they become incidents. A connection pool at 80% utilization is not an incident yet, but it will be if traffic increases.

Business Metrics

Technical metrics alone are insufficient. Instrument business-relevant metrics:

  • Orders processed per minute
  • Payment success rate
  • User sign-up completion rate
  • Search result relevance scores

These metrics catch problems that technical metrics miss. A deployment that introduces a bug in the pricing calculation may show perfect latency and error rates while producing wrong prices.

Instrumentation at the Architecture Level

Observability is not just about adding metric calls to existing code. It is about structuring the architecture to be observable.

Explicit boundaries. Every system boundary (API gateway, service-to-service call, database query, queue publish/consume) should be instrumented. These are the points where problems manifest. If you instrument nothing else, instrument every boundary.

Structured context propagation. The correlation ID pattern extends to observability: every request carries a context object that includes trace ID, span ID, user ID, tenant ID, and feature flag state. This context is attached to every log line, metric emission, and trace span.

Health check depth. A health check that returns 200 OK when the process is running is useless. A useful health check verifies:

  • Database connectivity (can execute a simple query)
  • Downstream service reachability (can ping critical dependencies)
  • Queue connectivity (can publish and consume a test message)
  • Disk space and memory availability (above minimum thresholds)

Related: Designing a Simple Metrics Collection Service.

Report the status of each component individually so operators can identify which dependency is failing.

Alert Design

Observability without alerting is monitoring that nobody looks at. Alert design is where most teams go wrong.

Alert on symptoms, not causes. Alert when the error rate exceeds the threshold, not when a specific exception occurs. Symptom-based alerts catch unknown failure modes. Cause-based alerts only catch the failures you anticipated.

See also: Failure Modes I Actively Design For.

Every alert must have an actionable response. If the on-call engineer receives an alert and the appropriate response is "wait and see if it resolves," that alert should not exist. Either make it actionable or remove it.

Use multiple severity levels with clear definitions:

SeverityDefinitionResponse
CriticalUser-facing functionality is brokenImmediate response, wake people up
WarningDegradation detected, may escalateInvestigate within 30 minutes
InfoAnomaly detected, no user impactReview during business hours

Tune aggressively. An alert that fires falsely more than 10% of the time trains the team to ignore it. Either fix the threshold or delete the alert.

Dashboard Design

Dashboards should answer specific questions, not display every metric available.

The dashboards I create for every service:

  1. Service overview: RED metrics, dependency health, deployment markers. This is the first dashboard an on-call engineer opens during an incident.

  2. Resource utilization: CPU, memory, disk, connection pools, thread pools. This is for capacity planning and saturation detection.

  3. Business metrics: domain-specific metrics that reflect the service's purpose. This is for product teams and business stakeholders.

  4. Debug dashboard: detailed breakdowns of latency by endpoint, error rates by type, and dependency latency. This is for deep investigation during incidents.

Each dashboard has a clear audience and purpose. A dashboard that tries to serve everyone serves no one.

Common Pitfalls

Sampling too aggressively. Sampling traces at 1% means you need 100 occurrences of a problem before you are statistically likely to capture one trace. For rare but critical errors, sample at 100% or use tail-based sampling that captures traces for slow or failed requests.

High-cardinality labels. Adding user ID as a metric label creates a metric explosion that overwhelms your metrics backend. Use high-cardinality data in logs and traces, not in metrics.

Alert fatigue from noisy baselines. If the system has regular spikes (batch jobs, cron tasks, traffic patterns), alerts must account for them. Otherwise, every spike generates a false alert that erodes trust in the alerting system.

Neglecting the write path. Instrumentation adds overhead. A tracing library that adds 5ms of latency to every request may be acceptable for a web service but unacceptable for a low-latency trading system. Measure the cost of instrumentation itself.

Key Takeaways

  • All three pillars (logs, metrics, traces) are required. They serve different purposes and complement each other.
  • Instrument the RED metrics (Rate, Errors, Duration) for every service and the USE metrics (Utilization, Saturation, Errors) for every resource.
  • Business metrics catch problems that technical metrics miss. Instrument them from day one.
  • Alert on symptoms, not causes. Every alert must have an actionable response.
  • High-cardinality data belongs in logs and traces, not in metrics.
  • Dashboards should answer specific questions for specific audiences, not display every available metric.

Further Reading

Final Thoughts

Observability is not a tool you install. It is a design discipline that shapes how you build, deploy, and operate systems. The investment pays off every time an incident occurs and the on-call engineer can diagnose it from a dashboard instead of guessing. Build it in from day one because retrofitting it is expensive, incomplete, and always lower priority than the next feature.

Recommended