What Logs Didn't Tell Me
Exploring the blind spots in traditional logging approaches and the incidents where logs were present but useless, along with what I now build instead.
Context
I have spent hundreds of hours reading logs during incidents. Scrolling through walls of text, grepping for patterns, correlating timestamps across services. And in too many of those incidents, the logs were there but they told me nothing useful. The problem was not missing logs. It was the wrong logs.
This post covers the specific ways logging failed me and what I build instead.
The Five Blind Spots
1. Logs Showed What Happened, Not Why
A payment service logged every transaction. Timestamps, amounts, user IDs, status codes. During an incident where transactions were failing at a 15% rate, I could see the failures. I could count them. I could chart them over time. But the logs did not tell me why those specific transactions were failing while others succeeded.
The root cause was a thread pool exhaustion issue that only manifested under a specific load pattern. The logs showed the symptom (failed transactions) but not the mechanism (thread pool saturation). I needed metrics for that: thread pool active count, queue depth, rejection count.
What I build now: For every critical path, I log not just the outcome but the resource state at decision points. Thread pool utilization at the time of submission. Queue depth at the time of enqueue. Connection pool availability at the time of checkout.
2. Logs Were Too Verbose to Be Useful
A search service logged every query, every result set, every ranking score. In development, this was invaluable. In production, at 5,000 queries per second, the log volume was 2 GB per minute. Searching for a specific event was like finding a needle in a haystack that was actively growing.
The team's response was to reduce log levels in production. That reduced volume but also eliminated the detailed logs they needed during incidents, exactly when they needed them most.
What I build now: Dynamic log levels that can be increased for specific users, request IDs, or traffic segments without affecting the rest. Feature-flag-controlled debug logging that activates for sampled traffic.
3. Logs Lacked Correlation
Three services were involved in a checkout flow: cart, payment, and fulfillment. Each logged independently. Each used its own request ID format. Correlating a single user's checkout across all three required manual timestamp matching, which was error-prone and slow.
During an incident where checkouts were silently dropping items, it took over an hour just to establish which service was dropping the item. With proper correlation, that would have been a single query.
What I build now: A correlation ID generated at the edge and propagated through every service in the chain. Every log line, every metric, every trace span carries this ID. One query reconstructs the entire request lifecycle.
4. Logs Did Not Capture What Did Not Happen
A notification service was supposed to send emails after order completion. It failed silently for a segment of users because a conditional check was wrong. The logs showed every email that was sent. They did not show the emails that should have been sent but were not.
This is the fundamental limitation of logging: it records what code executes, not what code should have executed. Missing behavior is invisible unless you explicitly log it.
What I build now: Reconciliation jobs that compare expected outcomes with actual outcomes. If 1,000 orders completed but only 950 emails were sent, the reconciliation job flags the gap within minutes.
5. Logs Were Structured but Not Queryable
A team adopted structured JSON logging. Every log line was a well-formed JSON object with consistent fields. Progress. But the logs were stored in flat files on each node. To query them, someone had to SSH into each node, parse the JSON, and aggregate manually. The structured data was there but not accessible.
What I build now: Logs ship to a centralized system (ELK, Datadog, Loki) where they can be queried, aggregated, and visualized. The structure is only useful if it is queryable at scale.
Related: Refactoring a System Without Breaking Users.
The Observability Stack I Actually Use
Logs alone are insufficient. A complete observability stack has three pillars, and each fills gaps the others cannot:
| Pillar | What It Tells You | What It Cannot Tell You |
|---|---|---|
| Logs | What happened at a specific point in code | Aggregate behavior, trends, correlations across services |
| Metrics | Aggregate behavior, trends, rates, distributions | Why a specific request failed |
| Traces | The full lifecycle of a request across services | Aggregate system health |
The real power comes from linking all three. A metric alert fires (high error rate). You drill into traces for the affected time window. You find a trace that shows elevated latency in one service. You pull the logs for that trace ID and find the specific error.
Practical Patterns
- Log at boundaries, not in loops. Log when you enter a service, when you exit, and when you make an external call. Do not log inside hot loops.
- Include decision context. When code takes a branch, log which branch and why. "Routed to fallback: primary cache miss, TTL expired 30s ago."
- Log the contract, not the payload. Log the shape of the data (field count, size, presence of required fields) rather than the full payload. This avoids PII issues and reduces volume.
- Make "nothing happened" visible. If a scheduled job should run every 5 minutes, log when it runs and alert when it does not.
Key Takeaways
- Logs that show what happened without showing why are insufficient for diagnosis.
- Log volume in production is an adversary. Dynamic log levels and sampled debug logging solve the verbosity problem.
- Correlation IDs are non-negotiable for multi-service architectures.
- The absence of expected behavior is invisible to logs. Reconciliation jobs fill this gap.
- Structured logs are only useful if they are centralized and queryable.
- Logs, metrics, and traces each fill gaps the others cannot. Link all three.
See also: Building a Minimal Feature Flag Service.
Further Reading
- Designing for Observability From Day One: How to build observability into system architecture from the start, covering the three pillars, instrumentation patterns, and common pitf...
- Designing Systems That Degrade Gracefully: How to build systems that continue providing value when components fail, covering load shedding, fallback strategies, and partial availab...
- How Small Decisions Cause Big Outages: Examining how seemingly minor technical decisions compound into major production incidents, with real patterns and the organizational dyn...
Final Thoughts
Logging is necessary but not sufficient. The incidents that cost me the most time were not the ones with missing logs. They were the ones where the logs gave me confidence that I understood the problem, but that understanding was wrong. Building systems that are truly observable means going beyond "did we log it?" to "can we answer any question about this system's behavior, at any point in time, within minutes?"
Recommended
Understanding ANRs: Detection, Root Causes, and Fixes
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.
Designing Idempotent APIs for Mobile Clients
How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios specific to mobile networks.
Refactoring a System Without Breaking Users
Strategies for large-scale refactors that keep production stable, covering parallel runs, feature flags, gradual migrations, and verification techniques.