What Logs Didn't Tell Me

Context

I have spent hundreds of hours reading logs during incidents. Scrolling through walls of text, grepping for patterns, correlating timestamps across services. And in too many of those incidents, the logs were there but they told me nothing useful. The problem was not missing logs. It was the wrong logs.

This post covers the specific ways logging failed me and what I build instead.

The Five Blind Spots

1. Logs Showed What Happened, Not Why

A payment service logged every transaction. Timestamps, amounts, user IDs, status codes. During an incident where transactions were failing at a 15% rate, I could see the failures. I could count them. I could chart them over time. But the logs did not tell me why those specific transactions were failing while others succeeded.

The root cause was a thread pool exhaustion issue that only manifested under a specific load pattern. The logs showed the symptom (failed transactions) but not the mechanism (thread pool saturation). I needed metrics for that: thread pool active count, queue depth, rejection count.

What I build now: For every critical path, I log not just the outcome but the resource state at decision points. Thread pool utilization at the time of submission. Queue depth at the time of enqueue. Connection pool availability at the time of checkout.

2. Logs Were Too Verbose to Be Useful

A search service logged every query, every result set, every ranking score. In development, this was invaluable. In production, at 5,000 queries per second, the log volume was 2 GB per minute. Searching for a specific event was like finding a needle in a haystack that was actively growing.

The team's response was to reduce log levels in production. That reduced volume but also eliminated the detailed logs they needed during incidents, exactly when they needed them most.

What I build now: Dynamic log levels that can be increased for specific users, request IDs, or traffic segments without affecting the rest. Feature-flag-controlled debug logging that activates for sampled traffic.

3. Logs Lacked Correlation

Three services were involved in a checkout flow: cart, payment, and fulfillment. Each logged independently. Each used its own request ID format. Correlating a single user's checkout across all three required manual timestamp matching, which was error-prone and slow.

During an incident where checkouts were silently dropping items, it took over an hour just to establish which service was dropping the item. With proper correlation, that would have been a single query.

What I build now: A correlation ID generated at the edge and propagated through every service in the chain. Every log line, every metric, every trace span carries this ID. One query reconstructs the entire request lifecycle.

4. Logs Did Not Capture What Did Not Happen

A notification service was supposed to send emails after order completion. It failed silently for a segment of users because a conditional check was wrong. The logs showed every email that was sent. They did not show the emails that should have been sent but were not.

This is the fundamental limitation of logging: it records what code executes, not what code should have executed. Missing behavior is invisible unless you explicitly log it.

What I build now: Reconciliation jobs that compare expected outcomes with actual outcomes. If 1,000 orders completed but only 950 emails were sent, the reconciliation job flags the gap within minutes.

5. Logs Were Structured but Not Queryable

A team adopted structured JSON logging. Every log line was a well-formed JSON object with consistent fields. Progress. But the logs were stored in flat files on each node. To query them, someone had to SSH into each node, parse the JSON, and aggregate manually. The structured data was there but not accessible.

What I build now: Logs ship to a centralized system (ELK, Datadog, Loki) where they can be queried, aggregated, and visualized. The structure is only useful if it is queryable at scale.

The Observability Stack I Actually Use

Logs alone are insufficient. A complete observability stack has three pillars, and each fills gaps the others cannot:

Pillar	What It Tells You	What It Cannot Tell You
Logs	What happened at a specific point in code	Aggregate behavior, trends, correlations across services
Metrics	Aggregate behavior, trends, rates, distributions	Why a specific request failed
Traces	The full lifecycle of a request across services	Aggregate system health

The real power comes from linking all three. A metric alert fires (high error rate). You drill into traces for the affected time window. You find a trace that shows elevated latency in one service. You pull the logs for that trace ID and find the specific error.

Practical Patterns

Log at boundaries, not in loops. Log when you enter a service, when you exit, and when you make an external call. Do not log inside hot loops.
Include decision context. When code takes a branch, log which branch and why. "Routed to fallback: primary cache miss, TTL expired 30s ago."
Log the contract, not the payload. Log the shape of the data (field count, size, presence of required fields) rather than the full payload. This avoids PII issues and reduces volume.
Make "nothing happened" visible. If a scheduled job should run every 5 minutes, log when it runs and alert when it does not.

Key Takeaways

Logs that show what happened without showing why are insufficient for diagnosis.
Log volume in production is an adversary. Dynamic log levels and sampled debug logging solve the verbosity problem.
Correlation IDs are non-negotiable for multi-service architectures.
The absence of expected behavior is invisible to logs. Reconciliation jobs fill this gap.
Structured logs are only useful if they are centralized and queryable.
Logs, metrics, and traces each fill gaps the others cannot. Link all three.

Final Thoughts

Logging is necessary but not sufficient. The incidents that cost me the most time were not the ones with missing logs. They were the ones where the logs gave me confidence that I understood the problem, but that understanding was wrong. Building systems that are truly observable means going beyond "did we log it?" to "can we answer any question about this system's behavior, at any point in time, within minutes?"