Designing for Observability From Day One
How to build observability into system architecture from the start, covering the three pillars, instrumentation patterns, and common pitfalls.
Observability added after the fact is always incomplete. The most critical paths are the hardest to instrument retroactively because they were built without instrumentation in mind. Starting with observability as a design requirement changes how you structure code, handle errors, and expose internal state.
The Three Pillars in Practice
Logs, metrics, and traces are the standard pillars. But listing them is not the same as using them effectively.
Logs answer "what happened." They are the narrative record of system behavior. Useful for debugging specific incidents but expensive to query at scale.
Metrics answer "how is the system performing." They are aggregated, time-series data. Useful for dashboards, alerting, and capacity planning. Cheap to store and query.
Traces answer "how did this request flow through the system." They connect the dots between services for a single request. Useful for diagnosing latency issues and understanding dependencies.
The mistake I see most often: teams invest heavily in one pillar and neglect the others. Logs without metrics means you discover problems from user complaints rather than dashboards. Metrics without traces means you know something is slow but cannot determine where. Traces without logs means you can see the request path but not the details of what went wrong at each step.
All three are required. They serve different purposes and complement each other.
Instrumentation Patterns
The RED Method for Services
For every service, instrument three things:
- Rate: requests per second
- Errors: errors per second (segmented by error type)
- Duration: latency histogram (P50, P95, P99)
These three metrics give you a complete picture of service health. If the rate drops, something upstream is broken. If errors increase, something in this service or downstream is broken. If duration increases, the service is degrading.
The USE Method for Resources
For every resource (CPU, memory, disk, network, connection pools), instrument:
- Utilization: percentage of resource capacity in use
- Saturation: work queued because the resource is fully utilized
- Errors: errors caused by resource exhaustion
These metrics predict problems before they become incidents. A connection pool at 80% utilization is not an incident yet, but it will be if traffic increases.
Business Metrics
Technical metrics alone are insufficient. Instrument business-relevant metrics:
- Orders processed per minute
- Payment success rate
- User sign-up completion rate
- Search result relevance scores
These metrics catch problems that technical metrics miss. A deployment that introduces a bug in the pricing calculation may show perfect latency and error rates while producing wrong prices.
Instrumentation at the Architecture Level
Observability is not just about adding metric calls to existing code. It is about structuring the architecture to be observable.
Explicit boundaries. Every system boundary (API gateway, service-to-service call, database query, queue publish/consume) should be instrumented. These are the points where problems manifest. If you instrument nothing else, instrument every boundary.
Structured context propagation. The correlation ID pattern extends to observability: every request carries a context object that includes trace ID, span ID, user ID, tenant ID, and feature flag state. This context is attached to every log line, metric emission, and trace span.
Health check depth. A health check that returns 200 OK when the process is running is useless. A useful health check verifies:
- Database connectivity (can execute a simple query)
- Downstream service reachability (can ping critical dependencies)
- Queue connectivity (can publish and consume a test message)
- Disk space and memory availability (above minimum thresholds)
Related: Designing a Simple Metrics Collection Service.
Report the status of each component individually so operators can identify which dependency is failing.
Alert Design
Observability without alerting is monitoring that nobody looks at. Alert design is where most teams go wrong.
Alert on symptoms, not causes. Alert when the error rate exceeds the threshold, not when a specific exception occurs. Symptom-based alerts catch unknown failure modes. Cause-based alerts only catch the failures you anticipated.
See also: Failure Modes I Actively Design For.
Every alert must have an actionable response. If the on-call engineer receives an alert and the appropriate response is "wait and see if it resolves," that alert should not exist. Either make it actionable or remove it.
Use multiple severity levels with clear definitions:
| Severity | Definition | Response |
|---|---|---|
| Critical | User-facing functionality is broken | Immediate response, wake people up |
| Warning | Degradation detected, may escalate | Investigate within 30 minutes |
| Info | Anomaly detected, no user impact | Review during business hours |
Tune aggressively. An alert that fires falsely more than 10% of the time trains the team to ignore it. Either fix the threshold or delete the alert.
Dashboard Design
Dashboards should answer specific questions, not display every metric available.
The dashboards I create for every service:
-
Service overview: RED metrics, dependency health, deployment markers. This is the first dashboard an on-call engineer opens during an incident.
-
Resource utilization: CPU, memory, disk, connection pools, thread pools. This is for capacity planning and saturation detection.
-
Business metrics: domain-specific metrics that reflect the service's purpose. This is for product teams and business stakeholders.
-
Debug dashboard: detailed breakdowns of latency by endpoint, error rates by type, and dependency latency. This is for deep investigation during incidents.
Each dashboard has a clear audience and purpose. A dashboard that tries to serve everyone serves no one.
Common Pitfalls
Sampling too aggressively. Sampling traces at 1% means you need 100 occurrences of a problem before you are statistically likely to capture one trace. For rare but critical errors, sample at 100% or use tail-based sampling that captures traces for slow or failed requests.
High-cardinality labels. Adding user ID as a metric label creates a metric explosion that overwhelms your metrics backend. Use high-cardinality data in logs and traces, not in metrics.
Alert fatigue from noisy baselines. If the system has regular spikes (batch jobs, cron tasks, traffic patterns), alerts must account for them. Otherwise, every spike generates a false alert that erodes trust in the alerting system.
Neglecting the write path. Instrumentation adds overhead. A tracing library that adds 5ms of latency to every request may be acceptable for a web service but unacceptable for a low-latency trading system. Measure the cost of instrumentation itself.
Key Takeaways
- All three pillars (logs, metrics, traces) are required. They serve different purposes and complement each other.
- Instrument the RED metrics (Rate, Errors, Duration) for every service and the USE metrics (Utilization, Saturation, Errors) for every resource.
- Business metrics catch problems that technical metrics miss. Instrument them from day one.
- Alert on symptoms, not causes. Every alert must have an actionable response.
- High-cardinality data belongs in logs and traces, not in metrics.
- Dashboards should answer specific questions for specific audiences, not display every available metric.
Further Reading
- Designing Mobile Systems for Poor Network Conditions: Architecture patterns for mobile apps that function reliably on slow, intermittent, and lossy networks, covering request prioritization, ...
- Designing Systems That Degrade Gracefully: How to build systems that continue providing value when components fail, covering load shedding, fallback strategies, and partial availab...
- Why Observability Beats Optimization: The case for investing in observability before optimization, and why the ability to understand system behavior is more valuable than maki...
Final Thoughts
Observability is not a tool you install. It is a design discipline that shapes how you build, deploy, and operate systems. The investment pays off every time an incident occurs and the on-call engineer can diagnose it from a dashboard instead of guessing. Build it in from day one because retrofitting it is expensive, incomplete, and always lower priority than the next feature.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.