What Production Failures Have Taught Me
A collection of hard-won lessons from years of production incidents, covering the patterns that repeat across systems and the mental models that help prevent them.
Context
Every production failure I have been part of left a mark. Not the emotional kind, though those exist too. The kind that rewires how you think about building systems. After enough incidents, patterns emerge. The same categories of failure repeat across different stacks, different teams, different decades.
This is not a post-mortem. It is a distillation of what I carry forward from all of them.
The Failures That Taught the Most
1. The Silent Data Corruption
A billing system was writing incorrect amounts for a subset of users. No errors in logs. No alerts. The bug lived in a timezone conversion that only triggered for users in a specific offset range during DST transitions. It ran for six weeks before a user complaint surfaced it.
Lesson: The absence of errors is not evidence of correctness. Systems need positive confirmation of expected behavior, not just negative detection of unexpected behavior.
2. The Cascading Retry Storm
A downstream service started returning 503s at a rate of about 2%. Every upstream caller retried three times with no backoff. The downstream service went from 2% failure to 100% failure in under ninety seconds.
Lesson: Retries without backoff and circuit breakers are a distributed denial-of-service attack you launch against yourself. Every retry policy needs three things: exponential backoff, jitter, and a circuit breaker.
3. The Config Change That Wasn't Tested
A feature flag change went out that modified the default timeout for a critical path from 5 seconds to 500 milliseconds. It was a typo. The config system had no validation, no diff review, and no canary rollout. The change affected 100% of traffic instantly.
Lesson: Config changes are deployments. They deserve the same rigor: review, validation, gradual rollout, and rollback capability.
4. The Migration That Locked the Table
A schema migration added an index to a table with 400 million rows. The migration tool acquired a write lock on the table. The migration took 45 minutes. For 45 minutes, no writes could land. The queue backed up, timeouts cascaded, and the entire write path went down.
Lesson: Schema changes on large tables need online DDL tools (pt-online-schema-change, gh-ost). Never assume your ORM's migration tool handles locking correctly at scale.
5. The Dependency Nobody Tracked
A microservice depended on an internal library that depended on an external SaaS API for geocoding. When that API changed its rate limiting policy, the library started throwing. The microservice team had no idea this dependency existed.
Lesson: Your dependency tree is your failure tree. If you cannot enumerate every external call your service makes, transitively, you cannot reason about its failure modes.
Patterns Across Failures
After enough incidents, the taxonomy becomes clear:
| Category | Root Cause Pattern | Prevention |
|---|---|---|
| Silent failures | Missing assertions on expected behavior | Invariant checks, data quality monitors |
| Cascading failures | Tight coupling without backpressure | Circuit breakers, bulkheads, load shedding |
| Config failures | Treating config as low-risk | Validation schemas, canary rollouts, rollback |
| Data failures | Untested edge cases in transformations | Property-based testing, shadow pipelines |
| Dependency failures | Invisible transitive dependencies | Dependency mapping, failure injection |
What Changed in My Approach
These failures reshaped how I build:
- I write the alert before I write the feature. If I cannot define what "healthy" looks like for a feature, I do not understand it well enough to ship it.
- I treat every external call as hostile. Timeouts, retries with backoff, circuit breakers, fallback behavior. Every single one.
- I deploy config changes like code. Version controlled, reviewed, validated, gradually rolled out.
- I map failure domains early. Before building, I draw the blast radius of every dependency. If a single dependency can take down the whole system, that is a design flaw.
- I build for observability first. Not logging everything, but ensuring that every significant state transition is visible and queryable.
What I Tell Junior Engineers
Three things I wish someone had told me earlier:
- Production will surprise you. Your mental model of the system is always incomplete. Design for that incompleteness.
- The most dangerous failures are the quiet ones. A crashing service gets fixed in minutes. A service that silently returns wrong data can run for months.
- Incidents are investments. Every incident you handle well makes the system and the team more resilient. Every incident you rush past without learning from is a missed investment.
Key Takeaways
- The absence of errors is not the presence of correctness. Build positive health signals, not just negative failure signals.
- Retries without backoff and circuit breakers amplify failures instead of handling them.
- Config changes deserve the same deployment rigor as code changes.
- Your transitive dependency tree is your failure surface. Map it.
- Write the alert before you write the feature.
- Silent failures cause more cumulative damage than loud ones.
See also: Designing Event Schemas That Survive Product Changes.
Related: Designing a Feature Flag and Remote Config System.
Further Reading
- Lessons From Debugging Distributed Systems: Practical lessons from years of debugging distributed systems, covering the unique challenges of partial failures, clock skew, message or...
- Handling Partial Failures in Distributed Mobile Systems: Strategies for handling partial failures in systems where mobile clients interact with multiple backend services, covering compensation, ...
- Memory Leaks in Android: Patterns I've Seen in Production: Real-world memory leak patterns from production Android apps, covering lifecycle-bound leaks, static references, listener registration, a...
Final Thoughts
Production failures are not random. They cluster around a small number of patterns: missing observability, uncontrolled blast radius, untested edge cases, and invisible dependencies. Once you learn to see these patterns, you start designing against them instinctively. The goal is not to prevent all failures. That is impossible. The goal is to make failures visible, contained, and recoverable.
Recommended
Understanding ANRs: Detection, Root Causes, and Fixes
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.
Designing Idempotent APIs for Mobile Clients
How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios specific to mobile networks.
Refactoring a System Without Breaking Users
Strategies for large-scale refactors that keep production stable, covering parallel runs, feature flags, gradual migrations, and verification techniques.