Designing Systems That Degrade Gracefully
How to build systems that continue providing value when components fail, covering load shedding, fallback strategies, and partial availability patterns.
A system that works perfectly or not at all is a system that will eventually not work at all. Graceful degradation means the system continues providing value, possibly reduced value, when components fail. This is a design property, not an operational afterthought.
Related: Event Tracking System Design for Android Applications.
The Degradation Hierarchy
Not all functionality is equally important. The first step in designing for graceful degradation is establishing a clear hierarchy:
See also: Designing Retry and Backoff Strategies for Mobile Networks.
Tier 1: Core functionality that must always work. For an e-commerce site, this is search, product pages, and checkout. These paths get the most redundancy, the strictest SLOs, and the most investment in fallback strategies.
Tier 2: Important functionality that can be temporarily reduced. Personalized recommendations, real-time inventory counts, dynamic pricing. These enhance the experience but the system remains useful without them.
Tier 3: Non-critical functionality that can be disabled entirely. Analytics tracking, A/B test assignments, promotional banners. Disabling these has no impact on the user's ability to complete their task.
Define this hierarchy explicitly and document it. When a component fails, the team needs to know immediately which tier it affects and what the degraded behavior should be.
Load Shedding
When the system is overwhelmed, serving all requests poorly is worse than serving some requests well and rejecting the rest.
Load shedding strategies, ordered from coarsest to finest:
-
Drop all non-essential traffic. Disable Tier 3 functionality entirely. This frees resources for Tier 1 and Tier 2 paths.
-
Rate-limit by client priority. Authenticated users get priority over anonymous users. Paying customers get priority over free-tier users. Internal services get priority over external API consumers.
-
Reject expensive requests. Complex search queries, large batch operations, and report generation consume disproportionate resources. Reject them first.
-
Queue overflow with backpressure. When queues fill, signal upstream to slow down rather than accepting messages that will expire before processing.
The implementation must be fast. If the load shedding logic itself consumes significant resources, it defeats the purpose. I typically implement it at the load balancer or API gateway level where the overhead is minimal.
Fallback Strategies
When a dependency fails, the system needs a predetermined response:
| Dependency failure | Fallback strategy |
|---|---|
| Recommendation engine down | Show popular items instead of personalized ones |
| Real-time pricing unavailable | Use last-known cached price with staleness indicator |
| Search index degraded | Fall back to database query with reduced result quality |
| Authentication service slow | Extend session TTL, defer re-authentication |
| Payment processor timeout | Queue the charge for retry, confirm order provisionally |
| CDN unavailable | Serve from origin with reduced performance |
Each fallback has a trigger condition (how do we detect the failure), a fallback behavior (what do we do instead), and a recovery condition (when do we resume normal behavior).
The fallback must be tested regularly. An untested fallback is not a fallback. It is a hope. I run monthly "dependency failure drills" where we simulate each dependency going down and verify the fallback behaves correctly.
Circuit Breakers With Fallback Integration
Circuit breakers prevent cascading failures by cutting off calls to failing dependencies. But a circuit breaker that returns an error to the user is only half the solution. The other half is returning a degraded response instead.
The integration pattern:
try {
result = callDependencyThroughCircuitBreaker()
} catch (CircuitOpenException) {
result = getFallbackResponse()
markResponseAsDegraded()
}
The markResponseAsDegraded step is important. Downstream consumers and monitoring systems need to know that this response came from a fallback, not from the primary path. This prevents stale or approximate data from being treated as authoritative.
Partial Availability
Some failures affect only a subset of users or data. A database shard going down makes data on that shard unavailable but should not affect data on other shards.
Design for partial availability:
- Shard-aware error handling. If shard 3 is down, requests for data on shard 3 return a clear "temporarily unavailable" response. Requests for data on other shards succeed normally.
- Feature-level isolation. If the notification service is down, the rest of the application continues working. Notifications are queued and delivered when the service recovers.
- Region-level isolation. If one data center is degraded, traffic shifts to healthy data centers. Users may experience higher latency but not an outage.
The key principle: a failure in one component should not propagate to unrelated components. This requires explicit isolation boundaries, not just logical separation.
Caching as a Degradation Strategy
Caches serve double duty: they improve performance under normal conditions and provide stale data during failures.
The cache degradation pattern:
- Under normal conditions: cache with a short TTL (minutes), refresh on expiry
- When the source is slow: extend the TTL, serve slightly stale data
- When the source is down: serve from cache indefinitely with a staleness warning
This requires the cache to store metadata alongside the data: when it was cached, what the original TTL was, and whether the source was healthy at the time of caching.
Not all data is safe to serve stale. Financial balances, inventory counts, and security permissions must not be served from a stale cache. But product descriptions, user profiles, and configuration values can tolerate minutes or hours of staleness with minimal impact.
Client-Side Degradation
The client can participate in graceful degradation:
- Retry with exponential backoff. The client handles transient failures without user intervention.
- Offline mode. For mobile applications, cache critical data locally and allow the user to continue working. Sync when connectivity returns.
- Progressive loading. Load critical content first. If non-critical content fails to load, the page is still usable.
- Error boundaries. A failure in one component (a widget, a panel) should not crash the entire page.
Client-side degradation must be designed, not accidental. The default behavior when a request fails should be a considered design decision, not an unhandled exception.
Testing Degradation
Graceful degradation that is not tested will not work when needed. Testing approaches:
- Chaos engineering. Randomly kill dependencies in a staging environment and verify the system degrades correctly.
- Dependency injection for failure. Inject failures into specific dependencies in a controlled manner. Verify the fallback activates and the system recovers when the failure is removed.
- Load testing beyond capacity. Push the system past its limits and verify that load shedding activates correctly and critical paths remain functional.
- Game days. Simulate major outage scenarios with the on-call team. Practice the decision-making process, not just the technical mechanisms.
Test the recovery path as well. A system that degrades gracefully but does not recover automatically when the failure resolves is still problematic.
Key Takeaways
- Establish a clear functionality hierarchy (Tier 1, 2, 3) and document what degrades versus what breaks.
- Load shedding should prioritize critical traffic and reject expensive operations first.
- Every dependency failure needs a predetermined fallback, a trigger condition, and a recovery condition.
- Test degradation paths monthly. An untested fallback is not a real fallback.
- Caches serve as degradation infrastructure: extend TTLs when sources are unavailable, with staleness metadata.
- Partial availability means a failure in one component does not propagate to unrelated components.
Further Reading
- Handling Partial Failures in Distributed Mobile Systems: Strategies for handling partial failures in systems where mobile clients interact with multiple backend services, covering compensation, ...
- Designing Mobile Systems for Poor Network Conditions: Architecture patterns for mobile apps that function reliably on slow, intermittent, and lossy networks, covering request prioritization, ...
- Designing Systems That Fail Loudly: Why silent failures are more dangerous than crashes, and how to design systems that surface problems immediately rather than hiding them ...
Final Thoughts
Users tolerate degraded experiences far better than they tolerate outages. A recommendation engine that shows popular items instead of personalized ones is barely noticed. A checkout page that returns a 500 error loses revenue and trust. Designing for graceful degradation is not about perfection under failure. It is about maintaining the core value proposition when the system is under stress.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.