Designing Rate Limiting for Mobile APIs
Rate limiting strategies for APIs consumed by mobile clients, covering token bucket algorithms, client identification, degradation modes, and mobile-specific considerations.
Rate limiting for mobile APIs differs from web rate limiting in important ways. Mobile clients cannot be trivially identified by IP (NAT, carrier-grade NAT). They retry aggressively on failure. They operate on unreliable networks where legitimate requests can look like bursts. This post covers how to design rate limiting that protects your backend without punishing your users.
See also: Designing Retry and Backoff Strategies for Mobile Networks.
Context
Related: Event Tracking System Design for Android Applications.
Rate limiting prevents abuse, protects backend resources, and ensures fair access across users. For mobile APIs, the rate limiter must distinguish between legitimate burst traffic (app foregrounding, network recovery) and actual abuse (compromised clients, scraping), while communicating limits clearly to the client.
Problem
Design a rate limiting system that:
- Protects backend services from overload
- Identifies and throttles abusive clients without affecting legitimate users
- Communicates rate limit status to mobile clients for adaptive behavior
- Handles the unique traffic patterns of mobile apps (burst on foreground, silence on background)
Constraints
| Constraint | Detail |
|---|---|
| Client identification | IP-based identification unreliable (CGNAT, VPN, WiFi networks) |
| Burst patterns | App foreground triggers 5-10 concurrent requests legitimately |
| Clock reliability | Device clocks cannot be trusted for client-side rate limiting |
| Error handling | Clients must handle 429 responses gracefully without retry storms |
| Latency | Rate limit check must add less than 5ms per request |
Design
Client Identification
IP address is insufficient for mobile. Use a composite identifier:
| Identifier | Reliability | Granularity |
|---|---|---|
| Authenticated user ID | High (after login) | Per-user |
| Device ID (Android ID / IDFV) | Medium | Per-device |
| API key | High | Per-app / per-partner |
| IP address | Low (shared IPs) | Per-IP (fallback only) |
identify_client(request):
if request.has_auth_token:
return ("user", extract_user_id(request.auth_token))
if request.has_device_id_header:
return ("device", request.header("X-Device-Id"))
if request.has_api_key:
return ("apikey", request.header("X-API-Key"))
return ("ip", request.remote_ip) // Least preferred
Algorithm: Token Bucket
Token bucket is the best fit for mobile APIs because it naturally accommodates bursts:
TokenBucket {
capacity: Int // Max tokens (burst allowance)
refill_rate: Float // Tokens added per second
tokens: Float // Current token count
last_refill: Timestamp
}
check_rate_limit(bucket, cost=1):
refill(bucket)
if bucket.tokens >= cost:
bucket.tokens -= cost
return ALLOWED
return REJECTED
refill(bucket):
now = current_time()
elapsed = now - bucket.last_refill
bucket.tokens = min(bucket.capacity, bucket.tokens + elapsed * bucket.refill_rate)
bucket.last_refill = now
Rate Limit Tiers
Different limits for different contexts:
| Tier | Capacity | Refill Rate | Applied To |
|---|---|---|---|
| Authenticated user | 100 tokens | 10/sec | Logged-in users |
| Anonymous device | 30 tokens | 3/sec | Pre-login, browsing |
| API partner | 1000 tokens | 100/sec | Third-party integrations |
| IP fallback | 20 tokens | 2/sec | Unidentified clients |
Endpoint-Specific Limits
High-cost endpoints get additional per-endpoint limits:
endpoint_limits = {
"POST /orders": {capacity: 5, rate: 1/sec}, // Expensive
"POST /auth/login": {capacity: 5, rate: 1/min}, // Abuse target
"GET /feed": {capacity: 30, rate: 5/sec}, // High traffic
"GET /search": {capacity: 20, rate: 3/sec}, // Expensive queries
}
A request must pass both the global user limit and the endpoint-specific limit.
Response Headers
Every response includes rate limit information:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 73
X-RateLimit-Reset: 1701234567
Retry-After: 30 // Only on 429 responses
Client-Side Handling
class RateLimitAwareClient(private val httpClient: OkHttpClient) {
private val retryAfterMap = ConcurrentHashMap<String, Long>() // endpoint -> resumeTime
fun execute(request: Request): Response {
val endpoint = request.url.encodedPath
val resumeTime = retryAfterMap[endpoint]
if (resumeTime != null && System.currentTimeMillis() < resumeTime) {
throw RateLimitedException(
retryAfterMs = resumeTime - System.currentTimeMillis()
)
}
val response = httpClient.newCall(request).execute()
if (response.code == 429) {
val retryAfter = response.header("Retry-After")?.toLongOrNull() ?: 30
retryAfterMap[endpoint] = System.currentTimeMillis() + retryAfter * 1000
throw RateLimitedException(retryAfterMs = retryAfter * 1000)
}
return response
}
}Distributed Rate Limiting
For multi-instance backends, rate limit state must be shared:
Client -> API Gateway -> Rate Limiter (Redis) -> Backend Service
Redis implementation using a sliding window:
sliding_window_rate_limit(key, limit, window_seconds):
now = current_time_ms()
window_start = now - (window_seconds * 1000)
// Remove expired entries
redis.zremrangebyscore(key, 0, window_start)
// Count requests in window
count = redis.zcard(key)
if count >= limit:
return REJECTED
// Add current request
redis.zadd(key, now, unique_request_id)
redis.expire(key, window_seconds)
return ALLOWED
Graceful Degradation Under Load
When the backend is under pressure, progressively tighten rate limits:
| Backend Load | Action |
|---|---|
| Normal (< 70% CPU) | Standard rate limits |
| Elevated (70-85%) | Reduce limits by 30% for anonymous clients |
| High (85-95%) | Reduce limits by 50%, reject low-priority requests |
| Critical (> 95%) | Allow only authenticated, critical-path requests |
Trade-offs
| Decision | Upside | Downside |
|---|---|---|
| Token bucket | Natural burst tolerance | Slightly more complex than fixed window |
| Composite client ID | Accurate per-user limiting | Requires multiple identification strategies |
| Redis for state | Fast, shared across instances | Redis failure disables rate limiting |
| Per-endpoint limits | Fine-grained protection | More configuration to maintain |
| Adaptive limits under load | Protects backend dynamically | Can throttle legitimate users during spikes |
Failure Modes
- Redis unavailable: Two options. (a) Fail open: allow all requests (risky during attack). (b) Fail closed with generous in-memory limits per instance (safer). Choose based on the cost of over-serving vs. under-serving.
- Clock skew across instances: Sliding window calculations diverge. Use Redis server time (
TIMEcommand) instead of instance-local clocks. - Legitimate burst after offline: User comes online after hours offline, app fires 20 requests simultaneously. Token bucket's burst capacity handles this if sized correctly. If not, the first few requests succeed, and the client backs off using Retry-After.
- CGNAT false positives: Thousands of users share one IP, hitting the IP-based limit. Mitigate by preferring user/device identification over IP, and setting IP limits generously.
- Client ignoring 429: A buggy or malicious client retries immediately. Server-side mitigation: escalate from 429 to temporary ban (403) after repeated violations.
Scaling Considerations
- Redis sharding: shard by client identifier hash to distribute load.
- For extremely high throughput (100K+ RPS), use local rate limiting per instance (approximate) combined with centralized rate limiting for accuracy on aggregates.
- Rate limit rules should be configurable at runtime (via the config system) without redeployment.
Observability
- Track: rate limit hit rate per endpoint, per client tier; 429 response rate; Redis latency for rate limit checks; client retry patterns after 429.
- Alert on: 429 rate exceeding 5% of total traffic (indicates limits too tight or an attack), Redis latency exceeding 10ms, single client generating more than 1000 requests/minute.
- Dashboard: real-time view of top rate-limited clients, endpoint heat map, rate limit headroom by tier.
Key Takeaways
- Do not rely on IP addresses for mobile client identification. Use authenticated user IDs or device IDs.
- Token bucket is the right algorithm for mobile APIs. Fixed windows penalize legitimate burst patterns.
- Communicate rate limit status in every response, not just 429s. Clients can proactively back off.
- Layer rate limits: global per-user and per-endpoint. Some endpoints need tighter limits regardless of global budget.
- Plan for Redis failure. Rate limiting disappearing under load is worse than no rate limiting at all.
Further Reading
- Designing APIs With Mobile Constraints in Mind: How to design backend APIs that account for mobile-specific constraints: bandwidth, latency, battery, intermittent connectivity, and long...
- Designing Idempotent APIs for Mobile Clients: How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios spe...
- Versioning APIs Without Breaking Old Mobile Apps: Strategies for API versioning that keep old mobile app versions functional, covering URL versioning, header versioning, additive changes,...
Final Thoughts
Rate limiting is the immune system of your API. Too aggressive, and it attacks healthy traffic. Too permissive, and it lets threats through. The key is making it adaptive: responsive to backend health, fair to legitimate users, and strict with abusers.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.