Designing Rate Limiting for Mobile APIs

Dhruval Dhameliya·October 21, 2025·7 min read

Rate limiting strategies for APIs consumed by mobile clients, covering token bucket algorithms, client identification, degradation modes, and mobile-specific considerations.

Rate limiting for mobile APIs differs from web rate limiting in important ways. Mobile clients cannot be trivially identified by IP (NAT, carrier-grade NAT). They retry aggressively on failure. They operate on unreliable networks where legitimate requests can look like bursts. This post covers how to design rate limiting that protects your backend without punishing your users.

See also: Designing Retry and Backoff Strategies for Mobile Networks.

Context

Related: Event Tracking System Design for Android Applications.

Rate limiting prevents abuse, protects backend resources, and ensures fair access across users. For mobile APIs, the rate limiter must distinguish between legitimate burst traffic (app foregrounding, network recovery) and actual abuse (compromised clients, scraping), while communicating limits clearly to the client.

Problem

Design a rate limiting system that:

  • Protects backend services from overload
  • Identifies and throttles abusive clients without affecting legitimate users
  • Communicates rate limit status to mobile clients for adaptive behavior
  • Handles the unique traffic patterns of mobile apps (burst on foreground, silence on background)

Constraints

ConstraintDetail
Client identificationIP-based identification unreliable (CGNAT, VPN, WiFi networks)
Burst patternsApp foreground triggers 5-10 concurrent requests legitimately
Clock reliabilityDevice clocks cannot be trusted for client-side rate limiting
Error handlingClients must handle 429 responses gracefully without retry storms
LatencyRate limit check must add less than 5ms per request

Design

Client Identification

IP address is insufficient for mobile. Use a composite identifier:

IdentifierReliabilityGranularity
Authenticated user IDHigh (after login)Per-user
Device ID (Android ID / IDFV)MediumPer-device
API keyHighPer-app / per-partner
IP addressLow (shared IPs)Per-IP (fallback only)
identify_client(request):
    if request.has_auth_token:
        return ("user", extract_user_id(request.auth_token))
    if request.has_device_id_header:
        return ("device", request.header("X-Device-Id"))
    if request.has_api_key:
        return ("apikey", request.header("X-API-Key"))
    return ("ip", request.remote_ip)  // Least preferred

Algorithm: Token Bucket

Token bucket is the best fit for mobile APIs because it naturally accommodates bursts:

TokenBucket {
    capacity: Int        // Max tokens (burst allowance)
    refill_rate: Float   // Tokens added per second
    tokens: Float        // Current token count
    last_refill: Timestamp
}

check_rate_limit(bucket, cost=1):
    refill(bucket)
    if bucket.tokens >= cost:
        bucket.tokens -= cost
        return ALLOWED
    return REJECTED

refill(bucket):
    now = current_time()
    elapsed = now - bucket.last_refill
    bucket.tokens = min(bucket.capacity, bucket.tokens + elapsed * bucket.refill_rate)
    bucket.last_refill = now

Rate Limit Tiers

Different limits for different contexts:

TierCapacityRefill RateApplied To
Authenticated user100 tokens10/secLogged-in users
Anonymous device30 tokens3/secPre-login, browsing
API partner1000 tokens100/secThird-party integrations
IP fallback20 tokens2/secUnidentified clients

Endpoint-Specific Limits

High-cost endpoints get additional per-endpoint limits:

endpoint_limits = {
    "POST /orders":      {capacity: 5, rate: 1/sec},    // Expensive
    "POST /auth/login":  {capacity: 5, rate: 1/min},    // Abuse target
    "GET /feed":         {capacity: 30, rate: 5/sec},    // High traffic
    "GET /search":       {capacity: 20, rate: 3/sec},    // Expensive queries
}

A request must pass both the global user limit and the endpoint-specific limit.

Response Headers

Every response includes rate limit information:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1701234567
Retry-After: 30  // Only on 429 responses

Client-Side Handling

class RateLimitAwareClient(private val httpClient: OkHttpClient) {
 
    private val retryAfterMap = ConcurrentHashMap<String, Long>() // endpoint -> resumeTime
 
    fun execute(request: Request): Response {
        val endpoint = request.url.encodedPath
        val resumeTime = retryAfterMap[endpoint]
 
        if (resumeTime != null && System.currentTimeMillis() < resumeTime) {
            throw RateLimitedException(
                retryAfterMs = resumeTime - System.currentTimeMillis()
            )
        }
 
        val response = httpClient.newCall(request).execute()
 
        if (response.code == 429) {
            val retryAfter = response.header("Retry-After")?.toLongOrNull() ?: 30
            retryAfterMap[endpoint] = System.currentTimeMillis() + retryAfter * 1000
            throw RateLimitedException(retryAfterMs = retryAfter * 1000)
        }
 
        return response
    }
}

Distributed Rate Limiting

For multi-instance backends, rate limit state must be shared:

Client -> API Gateway -> Rate Limiter (Redis) -> Backend Service

Redis implementation using a sliding window:

sliding_window_rate_limit(key, limit, window_seconds):
    now = current_time_ms()
    window_start = now - (window_seconds * 1000)

    // Remove expired entries
    redis.zremrangebyscore(key, 0, window_start)

    // Count requests in window
    count = redis.zcard(key)

    if count >= limit:
        return REJECTED

    // Add current request
    redis.zadd(key, now, unique_request_id)
    redis.expire(key, window_seconds)
    return ALLOWED

Graceful Degradation Under Load

When the backend is under pressure, progressively tighten rate limits:

Backend LoadAction
Normal (< 70% CPU)Standard rate limits
Elevated (70-85%)Reduce limits by 30% for anonymous clients
High (85-95%)Reduce limits by 50%, reject low-priority requests
Critical (> 95%)Allow only authenticated, critical-path requests

Trade-offs

DecisionUpsideDownside
Token bucketNatural burst toleranceSlightly more complex than fixed window
Composite client IDAccurate per-user limitingRequires multiple identification strategies
Redis for stateFast, shared across instancesRedis failure disables rate limiting
Per-endpoint limitsFine-grained protectionMore configuration to maintain
Adaptive limits under loadProtects backend dynamicallyCan throttle legitimate users during spikes

Failure Modes

  • Redis unavailable: Two options. (a) Fail open: allow all requests (risky during attack). (b) Fail closed with generous in-memory limits per instance (safer). Choose based on the cost of over-serving vs. under-serving.
  • Clock skew across instances: Sliding window calculations diverge. Use Redis server time (TIME command) instead of instance-local clocks.
  • Legitimate burst after offline: User comes online after hours offline, app fires 20 requests simultaneously. Token bucket's burst capacity handles this if sized correctly. If not, the first few requests succeed, and the client backs off using Retry-After.
  • CGNAT false positives: Thousands of users share one IP, hitting the IP-based limit. Mitigate by preferring user/device identification over IP, and setting IP limits generously.
  • Client ignoring 429: A buggy or malicious client retries immediately. Server-side mitigation: escalate from 429 to temporary ban (403) after repeated violations.

Scaling Considerations

  • Redis sharding: shard by client identifier hash to distribute load.
  • For extremely high throughput (100K+ RPS), use local rate limiting per instance (approximate) combined with centralized rate limiting for accuracy on aggregates.
  • Rate limit rules should be configurable at runtime (via the config system) without redeployment.

Observability

  • Track: rate limit hit rate per endpoint, per client tier; 429 response rate; Redis latency for rate limit checks; client retry patterns after 429.
  • Alert on: 429 rate exceeding 5% of total traffic (indicates limits too tight or an attack), Redis latency exceeding 10ms, single client generating more than 1000 requests/minute.
  • Dashboard: real-time view of top rate-limited clients, endpoint heat map, rate limit headroom by tier.

Key Takeaways

  • Do not rely on IP addresses for mobile client identification. Use authenticated user IDs or device IDs.
  • Token bucket is the right algorithm for mobile APIs. Fixed windows penalize legitimate burst patterns.
  • Communicate rate limit status in every response, not just 429s. Clients can proactively back off.
  • Layer rate limits: global per-user and per-endpoint. Some endpoints need tighter limits regardless of global budget.
  • Plan for Redis failure. Rate limiting disappearing under load is worse than no rate limiting at all.

Further Reading

Final Thoughts

Rate limiting is the immune system of your API. Too aggressive, and it attacks healthy traffic. Too permissive, and it lets threats through. The key is making it adaptive: responsive to backend health, fair to legitimate users, and strict with abusers.

Recommended