Failure Modes I Actively Design For
A catalog of failure modes that experienced engineers anticipate and design around, from cascading failures to data corruption to clock skew.
Systems do not fail in the ways you expect. They fail in the ways you did not think about. Over the years, I have built a mental catalog of failure modes that I now design for proactively, before the first line of production code.
Cascading Failures
A single service slowing down can take out an entire platform. Service A calls Service B, which is slow. A's thread pool fills up. A stops responding. Service C, which depends on A, also fills its thread pool. The cascade propagates until everything is down.
Defenses I always implement:
- Circuit breakers with explicit open, half-open, and closed states
- Timeout budgets that propagate across service boundaries (if the top-level timeout is 5 seconds and 3 seconds have elapsed, downstream calls get 2 seconds, not 5)
- Bulkheads to isolate critical paths from non-critical ones (separate thread pools, separate connection pools)
- Load shedding that drops low-priority requests before the system saturates
The circuit breaker alone is not enough. Without timeout budgets, a slow downstream service still consumes resources until the circuit opens.
Retry Storms
A transient failure causes clients to retry. If every client retries immediately, the recovering service receives a burst of traffic that is a multiple of normal load. This pushes it back into failure.
The standard mitigation is exponential backoff with jitter. But that only works if all clients implement it correctly. The server-side defense is equally important:
- Return
Retry-Afterheaders with explicit backoff durations - Implement adaptive rate limiting that tightens under load
- Use distinct response codes for "retry later" (503) versus "do not retry" (400)
I also design for the case where a client ignores the retry guidance entirely. You cannot control client behavior, so the server must protect itself regardless.
Partial Failures in Distributed Transactions
See also: Handling Partial Failures in Distributed Mobile Systems.
When a multi-step operation succeeds at step 3 of 5, you have a partially completed transaction. This is one of the hardest failure modes to handle correctly.
Approaches I use depending on the context:
| Pattern | When to use | Trade-off |
|---|---|---|
| Saga with compensating transactions | Business operations across services | Requires every step to have an inverse |
| Outbox pattern | Ensuring event publication with local transactions | Adds operational complexity of polling |
| Idempotency keys | Safely retrying failed operations | Requires careful key design and storage |
| Two-phase commit | Strong consistency required across databases | Performance cost, coordinator is SPOF |
In practice, I reach for the saga pattern most often. The key is making compensating transactions truly idempotent, because they will be retried too.
Related: Designing Event Schemas That Survive Product Changes.
Data Corruption
Not all data corruption is dramatic. The most dangerous kind is subtle: a wrong value written to one field, a timezone conversion that silently shifts timestamps by hours, a unicode normalization issue that makes string comparisons fail.
Defenses:
- Checksums on critical data paths. If you write a financial amount, verify it on read.
- Audit logs that capture before and after states. When corruption is detected, you need to know the last known good value.
- Validation at every system boundary. Do not trust that upstream validated the data correctly.
- Immutable append-only stores for critical state. Never update in place if you can append a new version.
I once spent three weeks debugging a currency conversion issue where a float-to-decimal cast silently lost precision. The total financial impact was small, but the trust impact was significant. Since then, I store all monetary values as integers (cents) with explicit currency codes.
Clock Skew
Distributed systems rely on timestamps for ordering, expiration, and coordination. Clock skew between machines can be seconds or even minutes if NTP is misconfigured.
Design decisions I make to handle this:
- Never use wall clock time for ordering. Use logical clocks or sequence numbers.
- Build expiration logic with a tolerance window. If a token expires at T, accept it until T + skew_tolerance.
- Log timestamps from the machine that generates the event, but use server-side timestamps for ordering decisions.
- Monitor NTP drift as a system health metric.
Resource Exhaustion
Every resource has a limit: file descriptors, disk space, memory, connection pool slots, thread pool capacity. Systems rarely fail at 100% utilization. They start degrading much earlier.
I set alerts at 70% utilization for all critical resources. Not because 70% is dangerous, but because the degradation curve is non-linear. Performance at 90% utilization is dramatically worse than at 70%.
Specific resources I always monitor:
- Database connection pool usage (alert at 70% of max connections)
- Disk space on machines that write logs or temporary files
- File descriptor counts on systems with many open connections
- JVM heap usage and garbage collection pause times
Poison Messages
A malformed message enters a queue. The consumer attempts to process it, fails, and the message returns to the queue. The consumer picks it up again. This loop continues forever, blocking all other messages.
Every queue consumer I build includes:
- A maximum retry count per message
- A dead letter queue for messages that exceed the retry count
- Alerting on dead letter queue depth
- A mechanism to inspect and replay dead letter messages after fixing the bug
Deployment Failures
A deployment that partially succeeds is worse than one that completely fails. Half your fleet runs v2 while the other half runs v1. If there are incompatible changes, the system behaves unpredictably.
My deployment requirements:
- Rolling deployments with health checks between batches
- Automatic rollback if health checks fail
- Backward-compatible changes deployed before forward-only changes
- The ability to run mixed versions during the rollout window
Key Takeaways
- Cascading failures require defense in depth: circuit breakers, timeout budgets, bulkheads, and load shedding together.
- Retry storms are a server-side problem even when caused by client behavior. Protect the server regardless of client cooperation.
- Data corruption defenses include checksums, audit logs, boundary validation, and immutable storage.
- Clock skew affects every distributed system. Use logical clocks for ordering and build tolerance windows for expiration.
- Poison messages need dead letter queues with alerting. Never allow a single bad message to block an entire queue.
Further Reading
- Implementing Webhooks and Observing Failure Modes: Designing a webhook delivery system with retries, dead letter queues, signature verification, and measured reliability under various fail...
- Lessons From Debugging Distributed Systems: Practical lessons from years of debugging distributed systems, covering the unique challenges of partial failures, clock skew, message or...
- Designing Idempotent APIs for Mobile Clients: How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios spe...
Final Thoughts
The difference between a system that handles failure gracefully and one that collapses is not the presence of failure. It is whether the engineers anticipated the failure mode and built a response into the architecture. I design for failure not because I expect things to break, but because I know they will.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.