Failure Modes I Actively Design For

Systems do not fail in the ways you expect. They fail in the ways you did not think about. Over the years, I have built a mental catalog of failure modes that I now design for proactively, before the first line of production code.

Cascading Failures

A single service slowing down can take out an entire platform. Service A calls Service B, which is slow. A's thread pool fills up. A stops responding. Service C, which depends on A, also fills its thread pool. The cascade propagates until everything is down.

Defenses I always implement:

Circuit breakers with explicit open, half-open, and closed states
Timeout budgets that propagate across service boundaries (if the top-level timeout is 5 seconds and 3 seconds have elapsed, downstream calls get 2 seconds, not 5)
Bulkheads to isolate critical paths from non-critical ones (separate thread pools, separate connection pools)
Load shedding that drops low-priority requests before the system saturates

The circuit breaker alone is not enough. Without timeout budgets, a slow downstream service still consumes resources until the circuit opens.

Retry Storms

A transient failure causes clients to retry. If every client retries immediately, the recovering service receives a burst of traffic that is a multiple of normal load. This pushes it back into failure.

The standard mitigation is exponential backoff with jitter. But that only works if all clients implement it correctly. The server-side defense is equally important:

Return Retry-After headers with explicit backoff durations
Implement adaptive rate limiting that tightens under load
Use distinct response codes for "retry later" (503) versus "do not retry" (400)

I also design for the case where a client ignores the retry guidance entirely. You cannot control client behavior, so the server must protect itself regardless.

Partial Failures in Distributed Transactions

When a multi-step operation succeeds at step 3 of 5, you have a partially completed transaction. This is one of the hardest failure modes to handle correctly.

Approaches I use depending on the context:

Pattern	When to use	Trade-off
Saga with compensating transactions	Business operations across services	Requires every step to have an inverse
Outbox pattern	Ensuring event publication with local transactions	Adds operational complexity of polling
Idempotency keys	Safely retrying failed operations	Requires careful key design and storage
Two-phase commit	Strong consistency required across databases	Performance cost, coordinator is SPOF

In practice, I reach for the saga pattern most often. The key is making compensating transactions truly idempotent, because they will be retried too.

Data Corruption

Not all data corruption is dramatic. The most dangerous kind is subtle: a wrong value written to one field, a timezone conversion that silently shifts timestamps by hours, a unicode normalization issue that makes string comparisons fail.

Defenses:

Checksums on critical data paths. If you write a financial amount, verify it on read.
Audit logs that capture before and after states. When corruption is detected, you need to know the last known good value.
Validation at every system boundary. Do not trust that upstream validated the data correctly.
Immutable append-only stores for critical state. Never update in place if you can append a new version.

I once spent three weeks debugging a currency conversion issue where a float-to-decimal cast silently lost precision. The total financial impact was small, but the trust impact was significant. Since then, I store all monetary values as integers (cents) with explicit currency codes.

Clock Skew

Distributed systems rely on timestamps for ordering, expiration, and coordination. Clock skew between machines can be seconds or even minutes if NTP is misconfigured.

Design decisions I make to handle this:

Never use wall clock time for ordering. Use logical clocks or sequence numbers.
Build expiration logic with a tolerance window. If a token expires at T, accept it until T + skew_tolerance.
Log timestamps from the machine that generates the event, but use server-side timestamps for ordering decisions.
Monitor NTP drift as a system health metric.

Resource Exhaustion

Every resource has a limit: file descriptors, disk space, memory, connection pool slots, thread pool capacity. Systems rarely fail at 100% utilization. They start degrading much earlier.

I set alerts at 70% utilization for all critical resources. Not because 70% is dangerous, but because the degradation curve is non-linear. Performance at 90% utilization is dramatically worse than at 70%.

Specific resources I always monitor:

Database connection pool usage (alert at 70% of max connections)
Disk space on machines that write logs or temporary files
File descriptor counts on systems with many open connections
JVM heap usage and garbage collection pause times

Poison Messages

A malformed message enters a queue. The consumer attempts to process it, fails, and the message returns to the queue. The consumer picks it up again. This loop continues forever, blocking all other messages.

Every queue consumer I build includes:

A maximum retry count per message
A dead letter queue for messages that exceed the retry count
Alerting on dead letter queue depth
A mechanism to inspect and replay dead letter messages after fixing the bug

Deployment Failures

A deployment that partially succeeds is worse than one that completely fails. Half your fleet runs v2 while the other half runs v1. If there are incompatible changes, the system behaves unpredictably.

My deployment requirements:

Rolling deployments with health checks between batches
Automatic rollback if health checks fail
Backward-compatible changes deployed before forward-only changes
The ability to run mixed versions during the rollout window

Key Takeaways

Cascading failures require defense in depth: circuit breakers, timeout budgets, bulkheads, and load shedding together.
Retry storms are a server-side problem even when caused by client behavior. Protect the server regardless of client cooperation.
Data corruption defenses include checksums, audit logs, boundary validation, and immutable storage.
Clock skew affects every distributed system. Use logical clocks for ordering and build tolerance windows for expiration.
Poison messages need dead letter queues with alerting. Never allow a single bad message to block an entire queue.

Final Thoughts

The difference between a system that handles failure gracefully and one that collapses is not the presence of failure. It is whether the engineers anticipated the failure mode and built a response into the architecture. I design for failure not because I expect things to break, but because I know they will.

Failure Modes I Actively Design For

Cascading Failures

Retry Storms

Partial Failures in Distributed Transactions

Data Corruption

Clock Skew

Resource Exhaustion

Poison Messages

Deployment Failures

Key Takeaways

Further Reading

Final Thoughts

Recommended

Designing an Offline-First Sync Engine for Mobile Apps

Jetpack Compose Recomposition: A Deep Dive

Event Tracking System Design for Android Applications