Designing Retry and Backoff Strategies for Mobile Networks

Retrying failed requests sounds trivial. In practice, naive retries amplify failures, drain batteries, and cause thundering herds that take down backends. This post covers how to design retry strategies that help rather than harm.

Context

Mobile clients face transient failures constantly: DNS resolution timeouts, TCP connection resets, HTTP 5xx responses, and network switches between WiFi and cellular. A well-designed retry strategy recovers from transient failures transparently while protecting both the client (battery, data usage) and the server (load, availability).

Problem

Design a retry and backoff system that:

Recovers from transient network failures automatically
Avoids amplifying server load during partial outages
Respects device constraints (battery, data plan)
Adapts to network conditions in real time

Constraints

Constraint	Detail
Battery	Each network request costs ~1-2mW; unnecessary retries measurably impact battery
Data usage	Metered connections require conservative retry behavior
Server load	N clients retrying simultaneously creates N-fold load amplification
User perception	Retries should be invisible for background requests, fast for user-initiated ones
Timeout budget	Total retry window should not exceed the user's patience (typically 15-30 seconds for foreground requests)

Design

Retry Classification

Not all failures are retryable:

Status/Error	Retryable	Rationale
HTTP 500, 502, 503	Yes	Server-side transient errors
HTTP 429	Yes (with Retry-After)	Rate limited, explicitly told to retry
HTTP 400, 401, 403, 404	No	Client errors, retrying will not help
HTTP 409	Conditional	Conflict may resolve after state refresh
IOException (timeout)	Yes	Network transient
UnknownHostException	Yes (limited)	DNS may recover, but limit to 2 retries
SSLException	No	Certificate issues will not self-resolve

Exponential Backoff with Jitter

class RetryPolicy(
    private val maxRetries: Int = 3,
    private val baseDelayMs: Long = 1000,
    private val maxDelayMs: Long = 30_000,
    private val jitterFactor: Double = 0.5
) {
    fun getDelay(attempt: Int): Long {
        val exponentialDelay = baseDelayMs * 2.0.pow(attempt).toLong()
        val cappedDelay = minOf(exponentialDelay, maxDelayMs)
        val jitter = (Random.nextDouble() * jitterFactor * cappedDelay).toLong()
        return cappedDelay + jitter
    }
 
    fun shouldRetry(attempt: Int, error: Throwable): Boolean {
        if (attempt >= maxRetries) return false
        return when (error) {
            is IOException -> true
            is HttpException -> error.code() in listOf(500, 502, 503, 429)
            else -> false
        }
    }
}

Why jitter matters: Without jitter, all clients that failed at the same time retry at the same time (1s, 2s, 4s). Jitter decorrelates retry timing, spreading load across the retry window. Full jitter (randomizing the entire delay) is more effective than equal jitter (randomizing only half).

Retry Interceptor

class RetryInterceptor(
    private val retryPolicy: RetryPolicy,
    private val circuitBreaker: CircuitBreaker
) : Interceptor {
 
    override fun intercept(chain: Interceptor.Chain): Response {
        val request = chain.request()
        var lastException: Exception? = null
 
        for (attempt in 0..retryPolicy.maxRetries) {
            if (circuitBreaker.isOpen()) {
                throw CircuitOpenException("Circuit breaker is open for ${request.url.host}")
            }
 
            try {
                val response = chain.proceed(request)
 
                if (response.isSuccessful) {
                    circuitBreaker.recordSuccess()
                    return response
                }
 
                if (!retryPolicy.shouldRetry(attempt, HttpException(response))) {
                    return response
                }
 
                response.close()
                circuitBreaker.recordFailure()
 
                if (response.code == 429) {
                    val retryAfter = response.header("Retry-After")?.toLongOrNull()
                    Thread.sleep(retryAfter?.times(1000) ?: retryPolicy.getDelay(attempt))
                } else {
                    Thread.sleep(retryPolicy.getDelay(attempt))
                }
            } catch (e: IOException) {
                lastException = e
                circuitBreaker.recordFailure()
                if (!retryPolicy.shouldRetry(attempt, e)) throw e
                Thread.sleep(retryPolicy.getDelay(attempt))
            }
        }
 
        throw lastException ?: IOException("Retry exhausted")
    }
}

Circuit Breaker

Stops retrying when a service is clearly down, preventing wasted resources:

class CircuitBreaker(
    private val failureThreshold: Int = 5,
    private val resetTimeoutMs: Long = 60_000
) {
    private var failureCount = AtomicInteger(0)
    private var lastFailureTime = AtomicLong(0)
    private var state = AtomicReference(State.CLOSED)
 
    enum class State { CLOSED, OPEN, HALF_OPEN }
 
    fun isOpen(): Boolean {
        if (state.get() == State.OPEN) {
            if (System.currentTimeMillis() - lastFailureTime.get() > resetTimeoutMs) {
                state.set(State.HALF_OPEN)
                return false
            }
            return true
        }
        return false
    }
 
    fun recordSuccess() {
        failureCount.set(0)
        state.set(State.CLOSED)
    }
 
    fun recordFailure() {
        lastFailureTime.set(System.currentTimeMillis())
        if (failureCount.incrementAndGet() >= failureThreshold) {
            state.set(State.OPEN)
        }
    }
}

Adaptive Retry Policy

Adjust retry behavior based on network quality:

class AdaptiveRetryPolicy(
    private val networkMonitor: NetworkMonitor
) {
    fun getPolicy(): RetryPolicy {
        return when (networkMonitor.getConnectionQuality()) {
            ConnectionQuality.EXCELLENT -> RetryPolicy(maxRetries = 2, baseDelayMs = 500)
            ConnectionQuality.GOOD -> RetryPolicy(maxRetries = 3, baseDelayMs = 1000)
            ConnectionQuality.MODERATE -> RetryPolicy(maxRetries = 3, baseDelayMs = 2000)
            ConnectionQuality.POOR -> RetryPolicy(maxRetries = 4, baseDelayMs = 3000)
            ConnectionQuality.NONE -> RetryPolicy(maxRetries = 0) // Don't retry, queue for later
        }
    }
}

Request Priority and Retry Budgets

Not all requests deserve the same retry investment:

Priority	Max Retries	Total Timeout	Example
Critical	5	60s	Payment submission
High	3	30s	User-initiated data fetch
Normal	2	15s	Background sync
Low	1	5s	Analytics, prefetch

Trade-offs

Decision	Upside	Downside
Full jitter	Best load distribution	Potentially longer delays for individual requests
Circuit breaker	Protects server during outages	Requests fail fast without attempting, potential false opens
Adaptive policy	Optimized for current conditions	Requires network quality estimation, added complexity
Per-request priority	Resources allocated proportionally	Priority assignment adds API surface complexity
Retry-After header respect	Server-controlled backoff	Malicious or buggy servers can set extreme values

Failure Modes

Thundering herd after outage recovery: All circuit breakers enter HALF_OPEN simultaneously. Mitigation: add jitter to the reset timeout.
Retry storm on partial outage: One degraded endpoint causes retries across all clients. Mitigation: per-endpoint circuit breakers, not global.
Battery drain from aggressive retries: Background retries with short delays. Mitigation: use WorkManager for background requests, which respects Doze mode.
Infinite retry on non-transient error: A 500 caused by a persistent bug will never succeed. Mitigation: after max retries, surface the error to the user or dead-letter the request.
Metered connection waste: Retries consuming user's data plan. Mitigation: reduce max retries on metered connections, skip low-priority retries entirely.

Scaling Considerations

At scale, even 1% of clients retrying simultaneously creates significant load. Full jitter and circuit breakers are not optional.
Server-side rate limiting must complement client-side backoff. The server is the authoritative source of "back off" signals.
For globally distributed APIs, retry policies should be region-aware. A US client hitting an EU endpoint with high latency should have longer timeouts, not more retries.

Observability

Track: retry rate per endpoint, retry success rate (do retries actually help?), circuit breaker state transitions, total request duration including retries.
Alert on: retry rate exceeding 10% (indicates systemic issue), circuit breaker staying open for more than 5 minutes, retry success rate below 50%.
Log: each retry attempt with the attempt number, delay, and error type. This is essential for debugging retry-related issues.

Key Takeaways

Classify errors before retrying. Retrying a 401 is always wrong.
Exponential backoff without jitter is almost as bad as no backoff. Always add jitter.
Circuit breakers are the server's safety valve. Without them, retries become a DDoS from your own clients.
Adapt retry behavior to network conditions. A retry strategy for WiFi should not be the same as one for 2G.
Budget retries per request priority. Not every request deserves 5 retries and 60 seconds of patience.

Final Thoughts

A retry strategy is a contract between client and server. The client promises not to overwhelm the server during failures. The server promises to communicate backoff requirements via status codes and headers. When both sides honor this contract, transient failures become invisible to users.