What Production Failures Have Taught Me

Dhruval Dhameliya·December 14, 2025·5 min read

A collection of hard-won lessons from years of production incidents, covering the patterns that repeat across systems and the mental models that help prevent them.

Context

Every production failure I have been part of left a mark. Not the emotional kind, though those exist too. The kind that rewires how you think about building systems. After enough incidents, patterns emerge. The same categories of failure repeat across different stacks, different teams, different decades.

This is not a post-mortem. It is a distillation of what I carry forward from all of them.

The Failures That Taught the Most

1. The Silent Data Corruption

A billing system was writing incorrect amounts for a subset of users. No errors in logs. No alerts. The bug lived in a timezone conversion that only triggered for users in a specific offset range during DST transitions. It ran for six weeks before a user complaint surfaced it.

Lesson: The absence of errors is not evidence of correctness. Systems need positive confirmation of expected behavior, not just negative detection of unexpected behavior.

2. The Cascading Retry Storm

A downstream service started returning 503s at a rate of about 2%. Every upstream caller retried three times with no backoff. The downstream service went from 2% failure to 100% failure in under ninety seconds.

Lesson: Retries without backoff and circuit breakers are a distributed denial-of-service attack you launch against yourself. Every retry policy needs three things: exponential backoff, jitter, and a circuit breaker.

3. The Config Change That Wasn't Tested

A feature flag change went out that modified the default timeout for a critical path from 5 seconds to 500 milliseconds. It was a typo. The config system had no validation, no diff review, and no canary rollout. The change affected 100% of traffic instantly.

Lesson: Config changes are deployments. They deserve the same rigor: review, validation, gradual rollout, and rollback capability.

4. The Migration That Locked the Table

A schema migration added an index to a table with 400 million rows. The migration tool acquired a write lock on the table. The migration took 45 minutes. For 45 minutes, no writes could land. The queue backed up, timeouts cascaded, and the entire write path went down.

Lesson: Schema changes on large tables need online DDL tools (pt-online-schema-change, gh-ost). Never assume your ORM's migration tool handles locking correctly at scale.

5. The Dependency Nobody Tracked

A microservice depended on an internal library that depended on an external SaaS API for geocoding. When that API changed its rate limiting policy, the library started throwing. The microservice team had no idea this dependency existed.

Lesson: Your dependency tree is your failure tree. If you cannot enumerate every external call your service makes, transitively, you cannot reason about its failure modes.

Patterns Across Failures

After enough incidents, the taxonomy becomes clear:

CategoryRoot Cause PatternPrevention
Silent failuresMissing assertions on expected behaviorInvariant checks, data quality monitors
Cascading failuresTight coupling without backpressureCircuit breakers, bulkheads, load shedding
Config failuresTreating config as low-riskValidation schemas, canary rollouts, rollback
Data failuresUntested edge cases in transformationsProperty-based testing, shadow pipelines
Dependency failuresInvisible transitive dependenciesDependency mapping, failure injection

What Changed in My Approach

These failures reshaped how I build:

  • I write the alert before I write the feature. If I cannot define what "healthy" looks like for a feature, I do not understand it well enough to ship it.
  • I treat every external call as hostile. Timeouts, retries with backoff, circuit breakers, fallback behavior. Every single one.
  • I deploy config changes like code. Version controlled, reviewed, validated, gradually rolled out.
  • I map failure domains early. Before building, I draw the blast radius of every dependency. If a single dependency can take down the whole system, that is a design flaw.
  • I build for observability first. Not logging everything, but ensuring that every significant state transition is visible and queryable.

What I Tell Junior Engineers

Three things I wish someone had told me earlier:

  1. Production will surprise you. Your mental model of the system is always incomplete. Design for that incompleteness.
  2. The most dangerous failures are the quiet ones. A crashing service gets fixed in minutes. A service that silently returns wrong data can run for months.
  3. Incidents are investments. Every incident you handle well makes the system and the team more resilient. Every incident you rush past without learning from is a missed investment.

Key Takeaways

  • The absence of errors is not the presence of correctness. Build positive health signals, not just negative failure signals.
  • Retries without backoff and circuit breakers amplify failures instead of handling them.
  • Config changes deserve the same deployment rigor as code changes.
  • Your transitive dependency tree is your failure surface. Map it.
  • Write the alert before you write the feature.
  • Silent failures cause more cumulative damage than loud ones.

See also: Designing Event Schemas That Survive Product Changes.

Related: Designing a Feature Flag and Remote Config System.


Further Reading

Final Thoughts

Production failures are not random. They cluster around a small number of patterns: missing observability, uncontrolled blast radius, untested edge cases, and invisible dependencies. Once you learn to see these patterns, you start designing against them instinctively. The goal is not to prevent all failures. That is impossible. The goal is to make failures visible, contained, and recoverable.

Recommended