Designing Retry and Backoff Strategies for Mobile Networks

Dhruval Dhameliya·December 17, 2025·7 min read

A detailed look at retry strategies for mobile clients, covering exponential backoff, jitter, circuit breakers, and adaptive retry policies for unreliable networks.

Retrying failed requests sounds trivial. In practice, naive retries amplify failures, drain batteries, and cause thundering herds that take down backends. This post covers how to design retry strategies that help rather than harm.

Related: How I'd Design a Mobile Configuration System at Scale.

Context

Mobile clients face transient failures constantly: DNS resolution timeouts, TCP connection resets, HTTP 5xx responses, and network switches between WiFi and cellular. A well-designed retry strategy recovers from transient failures transparently while protecting both the client (battery, data usage) and the server (load, availability).

Problem

Design a retry and backoff system that:

  • Recovers from transient network failures automatically
  • Avoids amplifying server load during partial outages
  • Respects device constraints (battery, data plan)
  • Adapts to network conditions in real time

See also: Handling Partial Failures in Distributed Mobile Systems.

Constraints

ConstraintDetail
BatteryEach network request costs ~1-2mW; unnecessary retries measurably impact battery
Data usageMetered connections require conservative retry behavior
Server loadN clients retrying simultaneously creates N-fold load amplification
User perceptionRetries should be invisible for background requests, fast for user-initiated ones
Timeout budgetTotal retry window should not exceed the user's patience (typically 15-30 seconds for foreground requests)

Design

Retry Classification

Not all failures are retryable:

Status/ErrorRetryableRationale
HTTP 500, 502, 503YesServer-side transient errors
HTTP 429Yes (with Retry-After)Rate limited, explicitly told to retry
HTTP 400, 401, 403, 404NoClient errors, retrying will not help
HTTP 409ConditionalConflict may resolve after state refresh
IOException (timeout)YesNetwork transient
UnknownHostExceptionYes (limited)DNS may recover, but limit to 2 retries
SSLExceptionNoCertificate issues will not self-resolve

Exponential Backoff with Jitter

class RetryPolicy(
    private val maxRetries: Int = 3,
    private val baseDelayMs: Long = 1000,
    private val maxDelayMs: Long = 30_000,
    private val jitterFactor: Double = 0.5
) {
    fun getDelay(attempt: Int): Long {
        val exponentialDelay = baseDelayMs * 2.0.pow(attempt).toLong()
        val cappedDelay = minOf(exponentialDelay, maxDelayMs)
        val jitter = (Random.nextDouble() * jitterFactor * cappedDelay).toLong()
        return cappedDelay + jitter
    }
 
    fun shouldRetry(attempt: Int, error: Throwable): Boolean {
        if (attempt >= maxRetries) return false
        return when (error) {
            is IOException -> true
            is HttpException -> error.code() in listOf(500, 502, 503, 429)
            else -> false
        }
    }
}

Why jitter matters: Without jitter, all clients that failed at the same time retry at the same time (1s, 2s, 4s). Jitter decorrelates retry timing, spreading load across the retry window. Full jitter (randomizing the entire delay) is more effective than equal jitter (randomizing only half).

Retry Interceptor

class RetryInterceptor(
    private val retryPolicy: RetryPolicy,
    private val circuitBreaker: CircuitBreaker
) : Interceptor {
 
    override fun intercept(chain: Interceptor.Chain): Response {
        val request = chain.request()
        var lastException: Exception? = null
 
        for (attempt in 0..retryPolicy.maxRetries) {
            if (circuitBreaker.isOpen()) {
                throw CircuitOpenException("Circuit breaker is open for ${request.url.host}")
            }
 
            try {
                val response = chain.proceed(request)
 
                if (response.isSuccessful) {
                    circuitBreaker.recordSuccess()
                    return response
                }
 
                if (!retryPolicy.shouldRetry(attempt, HttpException(response))) {
                    return response
                }
 
                response.close()
                circuitBreaker.recordFailure()
 
                if (response.code == 429) {
                    val retryAfter = response.header("Retry-After")?.toLongOrNull()
                    Thread.sleep(retryAfter?.times(1000) ?: retryPolicy.getDelay(attempt))
                } else {
                    Thread.sleep(retryPolicy.getDelay(attempt))
                }
            } catch (e: IOException) {
                lastException = e
                circuitBreaker.recordFailure()
                if (!retryPolicy.shouldRetry(attempt, e)) throw e
                Thread.sleep(retryPolicy.getDelay(attempt))
            }
        }
 
        throw lastException ?: IOException("Retry exhausted")
    }
}

Circuit Breaker

Stops retrying when a service is clearly down, preventing wasted resources:

class CircuitBreaker(
    private val failureThreshold: Int = 5,
    private val resetTimeoutMs: Long = 60_000
) {
    private var failureCount = AtomicInteger(0)
    private var lastFailureTime = AtomicLong(0)
    private var state = AtomicReference(State.CLOSED)
 
    enum class State { CLOSED, OPEN, HALF_OPEN }
 
    fun isOpen(): Boolean {
        if (state.get() == State.OPEN) {
            if (System.currentTimeMillis() - lastFailureTime.get() > resetTimeoutMs) {
                state.set(State.HALF_OPEN)
                return false
            }
            return true
        }
        return false
    }
 
    fun recordSuccess() {
        failureCount.set(0)
        state.set(State.CLOSED)
    }
 
    fun recordFailure() {
        lastFailureTime.set(System.currentTimeMillis())
        if (failureCount.incrementAndGet() >= failureThreshold) {
            state.set(State.OPEN)
        }
    }
}

Adaptive Retry Policy

Adjust retry behavior based on network quality:

class AdaptiveRetryPolicy(
    private val networkMonitor: NetworkMonitor
) {
    fun getPolicy(): RetryPolicy {
        return when (networkMonitor.getConnectionQuality()) {
            ConnectionQuality.EXCELLENT -> RetryPolicy(maxRetries = 2, baseDelayMs = 500)
            ConnectionQuality.GOOD -> RetryPolicy(maxRetries = 3, baseDelayMs = 1000)
            ConnectionQuality.MODERATE -> RetryPolicy(maxRetries = 3, baseDelayMs = 2000)
            ConnectionQuality.POOR -> RetryPolicy(maxRetries = 4, baseDelayMs = 3000)
            ConnectionQuality.NONE -> RetryPolicy(maxRetries = 0) // Don't retry, queue for later
        }
    }
}

Request Priority and Retry Budgets

Not all requests deserve the same retry investment:

PriorityMax RetriesTotal TimeoutExample
Critical560sPayment submission
High330sUser-initiated data fetch
Normal215sBackground sync
Low15sAnalytics, prefetch

Trade-offs

DecisionUpsideDownside
Full jitterBest load distributionPotentially longer delays for individual requests
Circuit breakerProtects server during outagesRequests fail fast without attempting, potential false opens
Adaptive policyOptimized for current conditionsRequires network quality estimation, added complexity
Per-request priorityResources allocated proportionallyPriority assignment adds API surface complexity
Retry-After header respectServer-controlled backoffMalicious or buggy servers can set extreme values

Failure Modes

  • Thundering herd after outage recovery: All circuit breakers enter HALF_OPEN simultaneously. Mitigation: add jitter to the reset timeout.
  • Retry storm on partial outage: One degraded endpoint causes retries across all clients. Mitigation: per-endpoint circuit breakers, not global.
  • Battery drain from aggressive retries: Background retries with short delays. Mitigation: use WorkManager for background requests, which respects Doze mode.
  • Infinite retry on non-transient error: A 500 caused by a persistent bug will never succeed. Mitigation: after max retries, surface the error to the user or dead-letter the request.
  • Metered connection waste: Retries consuming user's data plan. Mitigation: reduce max retries on metered connections, skip low-priority retries entirely.

Scaling Considerations

  • At scale, even 1% of clients retrying simultaneously creates significant load. Full jitter and circuit breakers are not optional.
  • Server-side rate limiting must complement client-side backoff. The server is the authoritative source of "back off" signals.
  • For globally distributed APIs, retry policies should be region-aware. A US client hitting an EU endpoint with high latency should have longer timeouts, not more retries.

Observability

  • Track: retry rate per endpoint, retry success rate (do retries actually help?), circuit breaker state transitions, total request duration including retries.
  • Alert on: retry rate exceeding 10% (indicates systemic issue), circuit breaker staying open for more than 5 minutes, retry success rate below 50%.
  • Log: each retry attempt with the attempt number, delay, and error type. This is essential for debugging retry-related issues.

Key Takeaways

  • Classify errors before retrying. Retrying a 401 is always wrong.
  • Exponential backoff without jitter is almost as bad as no backoff. Always add jitter.
  • Circuit breakers are the server's safety valve. Without them, retries become a DDoS from your own clients.
  • Adapt retry behavior to network conditions. A retry strategy for WiFi should not be the same as one for 2G.
  • Budget retries per request priority. Not every request deserves 5 retries and 60 seconds of patience.

Further Reading

Final Thoughts

A retry strategy is a contract between client and server. The client promises not to overwhelm the server during failures. The server promises to communicate backoff requirements via status codes and headers. When both sides honor this contract, transient failures become invisible to users.

Recommended