Why Observability Beats Optimization

Dhruval Dhameliya·July 26, 2025·6 min read

The case for investing in observability before optimization, and why the ability to understand system behavior is more valuable than making it faster.

Context

Early in my career, I spent a lot of time optimizing. Shaving milliseconds off database queries, reducing memory allocations, tuning thread pools. It felt productive. The numbers went down, the graphs looked better.

Related: Memory Allocation Patterns That Hurt Performance.

What I did not realize was that I was optimizing systems I did not fully understand. I was making local improvements without seeing the global picture. The system would get faster in one place and break in another, and I would not know why because I could not see the whole system's behavior.

Now I follow a different sequence: observe first, understand second, optimize third. The optimization is often the easiest part.

The Optimization Trap

Optimization without observability has three failure modes:

1. Optimizing the wrong thing. You profile locally and find a hot function. You optimize it from 10ms to 1ms. But the request takes 500ms total, and the hot function was called once per request. You saved 9ms on a 500ms request. The actual bottleneck was a network call you did not measure.

2. Optimizing in a way that hurts other things. You add an aggressive cache to reduce database load. Database CPU drops by 40%. But the cache consumes memory that was previously available for the application, and GC pauses increase. Net effect: p99 latency gets worse.

3. Optimizing for the wrong workload. You optimize for the load pattern you see today. Traffic shifts. The optimization becomes irrelevant or harmful. Without observability, you do not notice the shift until something breaks.

What Observability Actually Means

Observability is not dashboards. Dashboards are a presentation layer. Observability is the property of a system that allows you to understand its internal state by examining its external outputs.

A system is observable when you can answer any question about its behavior without deploying new code. This means:

  • You can explain any latency. For any slow request, you can attribute the time to specific operations: this much for the database, this much for the downstream service, this much for serialization.
  • You can explain any error. For any failed request, you can trace the causal chain from the user-visible error back to the root cause.
  • You can explain any anomaly. When a metric moves unexpectedly, you can determine whether it was caused by a code change, a traffic change, a dependency change, or an infrastructure change.

The Observability Investment

InvestmentWhat It EnablesTypical Setup Time
Structured logging with trace IDsCorrelate events across services1-2 weeks
RED metrics (rate, errors, duration) at every boundaryInstant anomaly detection1-2 days per service
Distributed tracingRequest lifecycle reconstruction2-4 weeks
Business metricsValidate that the system does what it exists to do1 week
Alerting on SLOs, not thresholdsReduce alert noise, focus on user impact1-2 weeks

The total investment is significant but front-loaded. Once the infrastructure is in place, adding observability to new services is incremental.

See also: Measuring Cold Starts Across Different Architectures.

Observability-First Workflow

When I encounter a performance problem now, the workflow is:

  1. Look at metrics. Is the problem widespread or isolated? Constant or intermittent? Correlated with a deployment or traffic change?
  2. Pull traces. For affected requests, what does the latency distribution look like? Where is the time being spent?
  3. Examine logs in context. For specific slow traces, what were the decision points? What was the resource state?
  4. Form a hypothesis. Based on the data, not intuition.
  5. Validate the hypothesis. Can I reproduce the issue? Can I explain why the fix works?
  6. Optimize. Now that I understand the problem, the fix is usually straightforward.

Compare this to the optimization-first workflow: guess what is slow, change it, deploy, hope the metrics improve. The observability-first approach takes longer on the first iteration but converges faster because you are not guessing.

Real Example

A checkout service had p99 latency of 4.2 seconds. The target was under 2 seconds. The team's instinct was to optimize the database queries, which they suspected were slow.

With tracing enabled, the actual breakdown was:

  • Database queries: 180ms (not the bottleneck)
  • Fraud detection service call: 350ms (not the bottleneck)
  • Inventory reservation service: 3,200ms (the bottleneck)

The inventory service was slow because it was making a synchronous call to a warehouse management system with a 3-second timeout, and the WMS was responding in 2.8 seconds on average because it was running a full table scan for inventory lookups.

Without tracing, the team would have spent weeks optimizing the wrong service. With tracing, they identified the root cause in 30 minutes and the fix (adding an index on the WMS table) took 15 minutes.

When Optimization Comes First

There are cases where optimization should precede observability:

  • The system is so slow it cannot process the observability data. If adding metrics causes the system to fall further behind, you need to optimize just enough to create headroom for instrumentation.
  • The bottleneck is obvious and the fix is trivial. If a missing database index is clearly the issue, just add it.
  • Regulatory or cost pressure demands immediate improvement. When the system costs significantly more than it should and the cause is well understood.

These are exceptions. The default should be observe first.

Key Takeaways

  • Optimization without observability means optimizing systems you do not fully understand. You will often optimize the wrong thing.
  • Observability is the property that lets you answer any question about system behavior without deploying new code.
  • The observability-first workflow (observe, understand, optimize) converges faster than the optimization-first workflow (guess, change, hope).
  • Distributed tracing is the highest-leverage observability tool for multi-service architectures. It turns hours of guessing into minutes of analysis.
  • The observability investment is front-loaded. Once the infrastructure exists, adding instrumentation to new services is incremental.

Further Reading

Final Thoughts

The ability to see what your system is doing is more valuable than the ability to make it faster. Speed without understanding is fragile. Understanding without speed is an optimization away from being solved. Invest in the ability to understand first, and the optimizations will follow naturally.

Recommended