How Small Decisions Cause Big Outages

Context

Major outages rarely have major causes. They are almost always the result of small, reasonable decisions that interact in unexpected ways. A default timeout value that nobody questioned. A retry policy that seemed conservative. A log rotation config that was copy-pasted from another service. Each decision was locally rational. Together, they created the conditions for a significant failure.

This post examines the patterns I have seen repeatedly and the organizational dynamics that allow small decisions to compound into large incidents.

Pattern 1: The Default That Nobody Questioned

A team deployed a new service with a default connection pool size of 10. The default came from the framework's documentation and was reasonable for development workloads. In production, under peak load, the service needed 50 concurrent connections. The pool was exhausted, requests queued, timeouts cascaded.

The root cause was not the pool size. It was the process that allowed a production service to launch with untested defaults. Nobody asked: "What is the expected concurrent connection count under peak load?"

The small decision: Accepting the framework default without load testing. The big outage: 45 minutes of degraded checkout experience during a traffic spike.

Pattern 2: The Retry That Amplified

Service A called Service B with a retry count of 3. Service B called Service C with a retry count of 3. When Service C became slow, Service B retried 3 times per request. Service A retried each of those 3 times. One user request generated up to 9 calls to Service C. Multiply by thousands of concurrent users and Service C went from slow to completely overwhelmed.

Retry Depth	Calls per User Request	1000 Concurrent Users
No retries	1	1,000
1 layer, 3 retries	3	3,000
2 layers, 3 retries each	9	9,000
3 layers, 3 retries each	27	27,000

The small decision: Each team independently chose a retry count of 3, which is perfectly reasonable in isolation. The big outage: A 27x amplification factor that turned a minor slowdown into a total service failure.

Pattern 3: The Config Drift

Two instances of the same service ran with different configurations. One was deployed three months ago with a memory limit of 2 GB. The other was deployed last week with a memory limit of 4 GB (the new default). Under load, the 2 GB instance hit its memory limit and was OOM-killed. The load balancer shifted traffic to the remaining instance, which could not handle the full load alone.

The small decision: Not enforcing configuration consistency across instances of the same service. The big outage: A cascading failure triggered by an OOM kill that would not have happened with consistent configuration.

Pattern 4: The Missing Timeout

An internal service called an external API without a timeout. The external API, which had been reliable for two years, started hanging. The internal service's threads blocked indefinitely, waiting for a response that would never come. The thread pool filled. New requests could not be processed. The service appeared healthy (no errors, no crashes) but was completely unresponsive.

The small decision: Not setting a timeout because the external API "always responds quickly." The big outage: A service that appeared healthy but processed zero requests for 20 minutes.

Pattern 5: The Test That Was Disabled

A critical integration test was flaky. It failed about 5% of the time due to a timing issue in the test setup, not a real bug. The team disabled it with a TODO comment to fix later. Six months passed. A code change introduced a real integration bug in the exact code path the test covered. The bug shipped to production and caused data corruption for 0.3% of users over two days.

The small decision: Disabling a flaky test instead of fixing it. The big outage: Two days of silent data corruption in a critical path.

Why Small Decisions Compound

Three organizational dynamics allow small decisions to become big problems:

1. Local Rationality, Global Irrationality

Each team makes decisions that are rational from their perspective. Team A's retry policy is fine. Team B's retry policy is fine. But the combination of Team A's and Team B's retry policies creates a multiplicative amplification that neither team anticipated.

This is a coordination problem. It cannot be solved by making better local decisions. It requires system-level thinking, where someone looks at the behavior of the entire call chain, not just individual services.

2. The Normalization of Deviance

When a small problem exists but does not cause an incident, teams accept it as normal. The default connection pool size works fine at current load. The missing timeout has not been a problem yet. The disabled test has not caught anything.

Each of these is a latent risk. The system operates within its tolerances, and the risk is invisible. Until something changes (traffic spikes, a dependency degrades, a code change hits the untested path) and the latent risk becomes an active failure.

3. Configuration as an Afterthought

Most engineering effort goes into the code. Configuration, the values that determine how the code behaves in production, receives far less attention. Timeouts, pool sizes, retry counts, memory limits, queue depths: these are often set once during initial development and never revisited. But they determine how the system behaves under stress, which is exactly when correct behavior matters most.

What I Do Differently

Review configuration with the same rigor as code. Every timeout, pool size, and retry count should be justified by a calculation or a load test, not by a default value.
Map retry chains end to end. If your request path is A -> B -> C, the total retry amplification is the product of each layer's retry count. Make this explicit.
Enforce configuration consistency. All instances of a service should run the same configuration. Drift is a time bomb.
Set timeouts on everything. Every network call, every database query, every lock acquisition. The timeout value should be based on measured p99 latency with a reasonable margin.
Fix or delete flaky tests. A disabled test is worse than no test because it gives the illusion of coverage.

Key Takeaways

Major outages are caused by the interaction of small, locally rational decisions, not by single dramatic failures.
Retry amplification across service layers is one of the most common patterns. Map the total amplification factor for every critical path.
Framework defaults are tuned for development, not production. Every production configuration value needs justification.
Missing timeouts turn dependency failures into self-inflicted outages.
Disabled tests create coverage gaps that silently widen over time.
System-level thinking (looking at behavior across the entire call chain) is required to catch the interactions that local thinking misses.

Final Thoughts

The most effective reliability engineering I have done was not building sophisticated failure detection systems. It was reviewing the small decisions: the timeouts, the retry policies, the connection pool sizes, the test coverage gaps. These are not glamorous. They do not make for exciting conference talks. But they are where most outages begin, and where most outages can be prevented.