Designing Retry and Backoff Strategies for Mobile Networks
A detailed look at retry strategies for mobile clients, covering exponential backoff, jitter, circuit breakers, and adaptive retry policies for unreliable networks.
Retrying failed requests sounds trivial. In practice, naive retries amplify failures, drain batteries, and cause thundering herds that take down backends. This post covers how to design retry strategies that help rather than harm.
Related: How I'd Design a Mobile Configuration System at Scale.
Context
Mobile clients face transient failures constantly: DNS resolution timeouts, TCP connection resets, HTTP 5xx responses, and network switches between WiFi and cellular. A well-designed retry strategy recovers from transient failures transparently while protecting both the client (battery, data usage) and the server (load, availability).
Problem
Design a retry and backoff system that:
- Recovers from transient network failures automatically
- Avoids amplifying server load during partial outages
- Respects device constraints (battery, data plan)
- Adapts to network conditions in real time
See also: Handling Partial Failures in Distributed Mobile Systems.
Constraints
| Constraint | Detail |
|---|---|
| Battery | Each network request costs ~1-2mW; unnecessary retries measurably impact battery |
| Data usage | Metered connections require conservative retry behavior |
| Server load | N clients retrying simultaneously creates N-fold load amplification |
| User perception | Retries should be invisible for background requests, fast for user-initiated ones |
| Timeout budget | Total retry window should not exceed the user's patience (typically 15-30 seconds for foreground requests) |
Design
Retry Classification
Not all failures are retryable:
| Status/Error | Retryable | Rationale |
|---|---|---|
| HTTP 500, 502, 503 | Yes | Server-side transient errors |
| HTTP 429 | Yes (with Retry-After) | Rate limited, explicitly told to retry |
| HTTP 400, 401, 403, 404 | No | Client errors, retrying will not help |
| HTTP 409 | Conditional | Conflict may resolve after state refresh |
| IOException (timeout) | Yes | Network transient |
| UnknownHostException | Yes (limited) | DNS may recover, but limit to 2 retries |
| SSLException | No | Certificate issues will not self-resolve |
Exponential Backoff with Jitter
class RetryPolicy(
private val maxRetries: Int = 3,
private val baseDelayMs: Long = 1000,
private val maxDelayMs: Long = 30_000,
private val jitterFactor: Double = 0.5
) {
fun getDelay(attempt: Int): Long {
val exponentialDelay = baseDelayMs * 2.0.pow(attempt).toLong()
val cappedDelay = minOf(exponentialDelay, maxDelayMs)
val jitter = (Random.nextDouble() * jitterFactor * cappedDelay).toLong()
return cappedDelay + jitter
}
fun shouldRetry(attempt: Int, error: Throwable): Boolean {
if (attempt >= maxRetries) return false
return when (error) {
is IOException -> true
is HttpException -> error.code() in listOf(500, 502, 503, 429)
else -> false
}
}
}Why jitter matters: Without jitter, all clients that failed at the same time retry at the same time (1s, 2s, 4s). Jitter decorrelates retry timing, spreading load across the retry window. Full jitter (randomizing the entire delay) is more effective than equal jitter (randomizing only half).
Retry Interceptor
class RetryInterceptor(
private val retryPolicy: RetryPolicy,
private val circuitBreaker: CircuitBreaker
) : Interceptor {
override fun intercept(chain: Interceptor.Chain): Response {
val request = chain.request()
var lastException: Exception? = null
for (attempt in 0..retryPolicy.maxRetries) {
if (circuitBreaker.isOpen()) {
throw CircuitOpenException("Circuit breaker is open for ${request.url.host}")
}
try {
val response = chain.proceed(request)
if (response.isSuccessful) {
circuitBreaker.recordSuccess()
return response
}
if (!retryPolicy.shouldRetry(attempt, HttpException(response))) {
return response
}
response.close()
circuitBreaker.recordFailure()
if (response.code == 429) {
val retryAfter = response.header("Retry-After")?.toLongOrNull()
Thread.sleep(retryAfter?.times(1000) ?: retryPolicy.getDelay(attempt))
} else {
Thread.sleep(retryPolicy.getDelay(attempt))
}
} catch (e: IOException) {
lastException = e
circuitBreaker.recordFailure()
if (!retryPolicy.shouldRetry(attempt, e)) throw e
Thread.sleep(retryPolicy.getDelay(attempt))
}
}
throw lastException ?: IOException("Retry exhausted")
}
}Circuit Breaker
Stops retrying when a service is clearly down, preventing wasted resources:
class CircuitBreaker(
private val failureThreshold: Int = 5,
private val resetTimeoutMs: Long = 60_000
) {
private var failureCount = AtomicInteger(0)
private var lastFailureTime = AtomicLong(0)
private var state = AtomicReference(State.CLOSED)
enum class State { CLOSED, OPEN, HALF_OPEN }
fun isOpen(): Boolean {
if (state.get() == State.OPEN) {
if (System.currentTimeMillis() - lastFailureTime.get() > resetTimeoutMs) {
state.set(State.HALF_OPEN)
return false
}
return true
}
return false
}
fun recordSuccess() {
failureCount.set(0)
state.set(State.CLOSED)
}
fun recordFailure() {
lastFailureTime.set(System.currentTimeMillis())
if (failureCount.incrementAndGet() >= failureThreshold) {
state.set(State.OPEN)
}
}
}Adaptive Retry Policy
Adjust retry behavior based on network quality:
class AdaptiveRetryPolicy(
private val networkMonitor: NetworkMonitor
) {
fun getPolicy(): RetryPolicy {
return when (networkMonitor.getConnectionQuality()) {
ConnectionQuality.EXCELLENT -> RetryPolicy(maxRetries = 2, baseDelayMs = 500)
ConnectionQuality.GOOD -> RetryPolicy(maxRetries = 3, baseDelayMs = 1000)
ConnectionQuality.MODERATE -> RetryPolicy(maxRetries = 3, baseDelayMs = 2000)
ConnectionQuality.POOR -> RetryPolicy(maxRetries = 4, baseDelayMs = 3000)
ConnectionQuality.NONE -> RetryPolicy(maxRetries = 0) // Don't retry, queue for later
}
}
}Request Priority and Retry Budgets
Not all requests deserve the same retry investment:
| Priority | Max Retries | Total Timeout | Example |
|---|---|---|---|
| Critical | 5 | 60s | Payment submission |
| High | 3 | 30s | User-initiated data fetch |
| Normal | 2 | 15s | Background sync |
| Low | 1 | 5s | Analytics, prefetch |
Trade-offs
| Decision | Upside | Downside |
|---|---|---|
| Full jitter | Best load distribution | Potentially longer delays for individual requests |
| Circuit breaker | Protects server during outages | Requests fail fast without attempting, potential false opens |
| Adaptive policy | Optimized for current conditions | Requires network quality estimation, added complexity |
| Per-request priority | Resources allocated proportionally | Priority assignment adds API surface complexity |
| Retry-After header respect | Server-controlled backoff | Malicious or buggy servers can set extreme values |
Failure Modes
- Thundering herd after outage recovery: All circuit breakers enter HALF_OPEN simultaneously. Mitigation: add jitter to the reset timeout.
- Retry storm on partial outage: One degraded endpoint causes retries across all clients. Mitigation: per-endpoint circuit breakers, not global.
- Battery drain from aggressive retries: Background retries with short delays. Mitigation: use WorkManager for background requests, which respects Doze mode.
- Infinite retry on non-transient error: A 500 caused by a persistent bug will never succeed. Mitigation: after max retries, surface the error to the user or dead-letter the request.
- Metered connection waste: Retries consuming user's data plan. Mitigation: reduce max retries on metered connections, skip low-priority retries entirely.
Scaling Considerations
- At scale, even 1% of clients retrying simultaneously creates significant load. Full jitter and circuit breakers are not optional.
- Server-side rate limiting must complement client-side backoff. The server is the authoritative source of "back off" signals.
- For globally distributed APIs, retry policies should be region-aware. A US client hitting an EU endpoint with high latency should have longer timeouts, not more retries.
Observability
- Track: retry rate per endpoint, retry success rate (do retries actually help?), circuit breaker state transitions, total request duration including retries.
- Alert on: retry rate exceeding 10% (indicates systemic issue), circuit breaker staying open for more than 5 minutes, retry success rate below 50%.
- Log: each retry attempt with the attempt number, delay, and error type. This is essential for debugging retry-related issues.
Key Takeaways
- Classify errors before retrying. Retrying a 401 is always wrong.
- Exponential backoff without jitter is almost as bad as no backoff. Always add jitter.
- Circuit breakers are the server's safety valve. Without them, retries become a DDoS from your own clients.
- Adapt retry behavior to network conditions. A retry strategy for WiFi should not be the same as one for 2G.
- Budget retries per request priority. Not every request deserves 5 retries and 60 seconds of patience.
Further Reading
- Designing Mobile Systems for Poor Network Conditions: Architecture patterns for mobile apps that function reliably on slow, intermittent, and lossy networks, covering request prioritization, ...
- Designing Idempotent APIs for Mobile Clients: How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios spe...
- Designing Rate Limiting for Mobile APIs: Rate limiting strategies for APIs consumed by mobile clients, covering token bucket algorithms, client identification, degradation modes,...
Final Thoughts
A retry strategy is a contract between client and server. The client promises not to overwhelm the server during failures. The server promises to communicate backoff requirements via status codes and headers. When both sides honor this contract, transient failures become invisible to users.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.