Designing Rate Limiting for Mobile APIs

Rate limiting for mobile APIs differs from web rate limiting in important ways. Mobile clients cannot be trivially identified by IP (NAT, carrier-grade NAT). They retry aggressively on failure. They operate on unreliable networks where legitimate requests can look like bursts. This post covers how to design rate limiting that protects your backend without punishing your users.

Context

Rate limiting prevents abuse, protects backend resources, and ensures fair access across users. For mobile APIs, the rate limiter must distinguish between legitimate burst traffic (app foregrounding, network recovery) and actual abuse (compromised clients, scraping), while communicating limits clearly to the client.

Problem

Design a rate limiting system that:

Protects backend services from overload
Identifies and throttles abusive clients without affecting legitimate users
Communicates rate limit status to mobile clients for adaptive behavior
Handles the unique traffic patterns of mobile apps (burst on foreground, silence on background)

Constraints

Constraint	Detail
Client identification	IP-based identification unreliable (CGNAT, VPN, WiFi networks)
Burst patterns	App foreground triggers 5-10 concurrent requests legitimately
Clock reliability	Device clocks cannot be trusted for client-side rate limiting
Error handling	Clients must handle 429 responses gracefully without retry storms
Latency	Rate limit check must add less than 5ms per request

Design

Client Identification

IP address is insufficient for mobile. Use a composite identifier:

Identifier	Reliability	Granularity
Authenticated user ID	High (after login)	Per-user
Device ID (Android ID / IDFV)	Medium	Per-device
API key	High	Per-app / per-partner
IP address	Low (shared IPs)	Per-IP (fallback only)

identify_client(request):
    if request.has_auth_token:
        return ("user", extract_user_id(request.auth_token))
    if request.has_device_id_header:
        return ("device", request.header("X-Device-Id"))
    if request.has_api_key:
        return ("apikey", request.header("X-API-Key"))
    return ("ip", request.remote_ip)  // Least preferred

Algorithm: Token Bucket

Token bucket is the best fit for mobile APIs because it naturally accommodates bursts:

TokenBucket {
    capacity: Int        // Max tokens (burst allowance)
    refill_rate: Float   // Tokens added per second
    tokens: Float        // Current token count
    last_refill: Timestamp
}

check_rate_limit(bucket, cost=1):
    refill(bucket)
    if bucket.tokens >= cost:
        bucket.tokens -= cost
        return ALLOWED
    return REJECTED

refill(bucket):
    now = current_time()
    elapsed = now - bucket.last_refill
    bucket.tokens = min(bucket.capacity, bucket.tokens + elapsed * bucket.refill_rate)
    bucket.last_refill = now

Rate Limit Tiers

Different limits for different contexts:

Tier	Capacity	Refill Rate	Applied To
Authenticated user	100 tokens	10/sec	Logged-in users
Anonymous device	30 tokens	3/sec	Pre-login, browsing
API partner	1000 tokens	100/sec	Third-party integrations
IP fallback	20 tokens	2/sec	Unidentified clients

Endpoint-Specific Limits

High-cost endpoints get additional per-endpoint limits:

endpoint_limits = {
    "POST /orders":      {capacity: 5, rate: 1/sec},    // Expensive
    "POST /auth/login":  {capacity: 5, rate: 1/min},    // Abuse target
    "GET /feed":         {capacity: 30, rate: 5/sec},    // High traffic
    "GET /search":       {capacity: 20, rate: 3/sec},    // Expensive queries
}

A request must pass both the global user limit and the endpoint-specific limit.

Response Headers

Every response includes rate limit information:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1701234567
Retry-After: 30  // Only on 429 responses

Client-Side Handling

class RateLimitAwareClient(private val httpClient: OkHttpClient) {
 
    private val retryAfterMap = ConcurrentHashMap<String, Long>() // endpoint -> resumeTime
 
    fun execute(request: Request): Response {
        val endpoint = request.url.encodedPath
        val resumeTime = retryAfterMap[endpoint]
 
        if (resumeTime != null && System.currentTimeMillis() < resumeTime) {
            throw RateLimitedException(
                retryAfterMs = resumeTime - System.currentTimeMillis()
            )
        }
 
        val response = httpClient.newCall(request).execute()
 
        if (response.code == 429) {
            val retryAfter = response.header("Retry-After")?.toLongOrNull() ?: 30
            retryAfterMap[endpoint] = System.currentTimeMillis() + retryAfter * 1000
            throw RateLimitedException(retryAfterMs = retryAfter * 1000)
        }
 
        return response
    }
}

Distributed Rate Limiting

For multi-instance backends, rate limit state must be shared:

Client -> API Gateway -> Rate Limiter (Redis) -> Backend Service

Redis implementation using a sliding window:

sliding_window_rate_limit(key, limit, window_seconds):
    now = current_time_ms()
    window_start = now - (window_seconds * 1000)

    // Remove expired entries
    redis.zremrangebyscore(key, 0, window_start)

    // Count requests in window
    count = redis.zcard(key)

    if count >= limit:
        return REJECTED

    // Add current request
    redis.zadd(key, now, unique_request_id)
    redis.expire(key, window_seconds)
    return ALLOWED

Graceful Degradation Under Load

When the backend is under pressure, progressively tighten rate limits:

Backend Load	Action
Normal (< 70% CPU)	Standard rate limits
Elevated (70-85%)	Reduce limits by 30% for anonymous clients
High (85-95%)	Reduce limits by 50%, reject low-priority requests
Critical (> 95%)	Allow only authenticated, critical-path requests

Trade-offs

Decision	Upside	Downside
Token bucket	Natural burst tolerance	Slightly more complex than fixed window
Composite client ID	Accurate per-user limiting	Requires multiple identification strategies
Redis for state	Fast, shared across instances	Redis failure disables rate limiting
Per-endpoint limits	Fine-grained protection	More configuration to maintain
Adaptive limits under load	Protects backend dynamically	Can throttle legitimate users during spikes

Failure Modes

Redis unavailable: Two options. (a) Fail open: allow all requests (risky during attack). (b) Fail closed with generous in-memory limits per instance (safer). Choose based on the cost of over-serving vs. under-serving.
Clock skew across instances: Sliding window calculations diverge. Use Redis server time (TIME command) instead of instance-local clocks.
Legitimate burst after offline: User comes online after hours offline, app fires 20 requests simultaneously. Token bucket's burst capacity handles this if sized correctly. If not, the first few requests succeed, and the client backs off using Retry-After.
CGNAT false positives: Thousands of users share one IP, hitting the IP-based limit. Mitigate by preferring user/device identification over IP, and setting IP limits generously.
Client ignoring 429: A buggy or malicious client retries immediately. Server-side mitigation: escalate from 429 to temporary ban (403) after repeated violations.

Scaling Considerations

Redis sharding: shard by client identifier hash to distribute load.
For extremely high throughput (100K+ RPS), use local rate limiting per instance (approximate) combined with centralized rate limiting for accuracy on aggregates.
Rate limit rules should be configurable at runtime (via the config system) without redeployment.

Observability

Track: rate limit hit rate per endpoint, per client tier; 429 response rate; Redis latency for rate limit checks; client retry patterns after 429.
Alert on: 429 rate exceeding 5% of total traffic (indicates limits too tight or an attack), Redis latency exceeding 10ms, single client generating more than 1000 requests/minute.
Dashboard: real-time view of top rate-limited clients, endpoint heat map, rate limit headroom by tier.

Key Takeaways

Do not rely on IP addresses for mobile client identification. Use authenticated user IDs or device IDs.
Token bucket is the right algorithm for mobile APIs. Fixed windows penalize legitimate burst patterns.
Communicate rate limit status in every response, not just 429s. Clients can proactively back off.
Layer rate limits: global per-user and per-endpoint. Some endpoints need tighter limits regardless of global budget.
Plan for Redis failure. Rate limiting disappearing under load is worse than no rate limiting at all.

Final Thoughts

Rate limiting is the immune system of your API. Too aggressive, and it attacks healthy traffic. Too permissive, and it lets threats through. The key is making it adaptive: responsive to backend health, fair to legitimate users, and strict with abusers.