Building a Rate-Limited API From Scratch
Implementing token bucket and sliding window rate limiting with Redis, including burst handling, multi-tier limits, and measured overhead.
Context
An API serving 20,000 requests/minute needed rate limiting per API key. Requirements: 100 requests/minute per key with burst allowance, separate tiers for free and paid users, and sub-millisecond overhead per request. I implemented two algorithms and compared them.
Problem
Rate limiting appears simple (count requests, reject when over limit) but the implementation details matter: clock synchronization, distributed counting, burst behavior, and the fairness of the rejection strategy. The wrong algorithm can reject legitimate traffic or allow abuse.
Constraints
- Storage: Redis 7, single node
- Rate limit: 100 req/min (free tier), 1,000 req/min (paid tier)
- Burst: allow 20% burst above limit for up to 5 seconds
- Latency budget: rate limit check must add less than 1ms to request processing
- Distributed: 8 application instances share the rate limit state
- Must return standard rate limit headers (
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset)
Design
See also: How I'd Design a Mobile Configuration System at Scale.
Algorithm 1: Token Bucket
The token bucket allows burst traffic by accumulating tokens over time. Each request consumes one token. Tokens refill at a fixed rate.
async function tokenBucket(
key: string,
maxTokens: number,
refillRate: number, // tokens per second
): Promise<{ allowed: boolean; remaining: number; resetAt: number }> {
const now = Date.now();
const result = await redis.eval(`
local key = KEYS[1]
local max_tokens = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or max_tokens
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens based on elapsed time
local elapsed = (now - last_refill) / 1000
tokens = math.min(max_tokens, tokens + (elapsed * refill_rate))
local allowed = tokens >= 1
if allowed then
tokens = tokens - 1
end
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) + 1)
return {allowed and 1 or 0, math.floor(tokens), math.ceil((max_tokens - tokens) / refill_rate)}
`, 1, key, maxTokens, refillRate, now);
return {
allowed: result[0] === 1,
remaining: result[1],
resetAt: now + result[2] * 1000,
};
}Configuration for free tier: maxTokens = 120 (100 + 20% burst), refillRate = 1.67 (100 tokens per 60 seconds).
Algorithm 2: Sliding Window Counter
The sliding window divides time into fixed intervals and weights the previous interval based on how far into the current interval the request falls.
async function slidingWindowCounter(
key: string,
limit: number,
windowMs: number,
): Promise<{ allowed: boolean; remaining: number; resetAt: number }> {
const now = Date.now();
const currentWindow = Math.floor(now / windowMs);
const previousWindow = currentWindow - 1;
const windowProgress = (now % windowMs) / windowMs;
const result = await redis.eval(`
local curr_key = KEYS[1] .. ':' .. ARGV[1]
local prev_key = KEYS[1] .. ':' .. ARGV[2]
local limit = tonumber(ARGV[3])
local weight = tonumber(ARGV[4])
local prev_count = tonumber(redis.call('GET', prev_key) or '0')
local curr_count = tonumber(redis.call('GET', curr_key) or '0')
local weighted_count = math.floor(prev_count * (1 - weight)) + curr_count
if weighted_count >= limit then
return {0, 0}
end
redis.call('INCR', curr_key)
redis.call('EXPIRE', curr_key, math.ceil(ARGV[5] / 1000) * 2)
return {1, limit - weighted_count - 1}
`, 1, key, currentWindow, previousWindow, limit, windowProgress, windowMs);
return {
allowed: result[0] === 1,
remaining: result[1],
resetAt: (currentWindow + 1) * windowMs,
};
}Trade-offs
Algorithm Comparison
| Property | Token Bucket | Sliding Window |
|---|---|---|
| Burst handling | Natural (accumulated tokens) | None (hard limit per window) |
| Memory per key | 2 fields (tokens, last_refill) | 2 keys (current + previous window) |
| Accuracy | Exact | Approximate (weighted average) |
| Latency (p50) | 0.3ms | 0.25ms |
| Latency (p99) | 0.8ms | 0.7ms |
| Implementation complexity | Medium (Lua script) | Medium (Lua script) |
| Fairness under sustained load | Smooth distribution | Can cluster at window boundaries |
Burst Behavior Test
I sent 150 requests in 1 second to a key with a 100/minute limit:
| Algorithm | Allowed in 1st second | Behavior |
|---|---|---|
| Token Bucket (120 max) | 120 | Allowed burst up to max tokens, then rejected |
| Sliding Window (100 limit) | 100 | Hard cutoff at limit, no burst |
Token bucket is better for APIs where clients send batches (mobile sync, webhook delivery). Sliding window is better for strict rate enforcement (billing, abuse prevention).
Multi-Tier Configuration
| Tier | Token Bucket Config | Sliding Window Config |
|---|---|---|
| Free | 120 tokens, 1.67/s refill | 100/minute window |
| Paid | 1,200 tokens, 16.67/s refill | 1,000/minute window |
| Internal | 12,000 tokens, 166.7/s refill | 10,000/minute window |
Redis Memory Usage
At 10,000 active API keys:
| Algorithm | Memory |
|---|---|
| Token bucket | ~1.2MB (2 hash fields per key) |
| Sliding window | ~0.8MB (2 string keys per key) |
Both are negligible. Memory is not the constraint at this scale.
Related: Failure Modes I Actively Design For.
Failure Modes
Redis unavailable: If Redis is down, the rate limiter cannot check limits. Two options: fail open (allow all requests, risking abuse) or fail closed (reject all requests, causing an outage). I chose fail open with a local in-memory fallback that applies a conservative per-instance limit.
Clock skew across instances: Token bucket uses the current timestamp to calculate refills. If two instances have clocked skewed by 1 second, the refill calculation diverges. Using Redis server time (redis.call('TIME')) instead of client time eliminates this.
Race condition on concurrent checks: Without atomic Lua scripts, two concurrent requests could read the same token count, both see tokens available, and both decrement, resulting in over-admission. The Lua script executes atomically in Redis, preventing this.
Key expiration during burst: If a token bucket key expires (TTL reached) during a sustained burst, the next request creates a fresh bucket with full tokens, effectively resetting the limit. The TTL must be longer than the maximum burst duration.
Scaling Considerations
- Redis handles 100,000+ operations/second. Rate limiting 10,000 keys at 100 req/min generates ~16,000 Redis operations/min. This is 0.3% of Redis capacity.
- For multi-region deployments, each region can have its own Redis instance with region-local rate limits. Global rate limiting across regions requires a centralized Redis or a distributed counter (significantly more complex).
- At 1M+ API keys, consider using Redis Cluster with hash tags to co-locate related keys on the same shard.
- Rate limit headers should include
Retry-Afterfor rejected requests to help well-behaved clients back off appropriately.
Observability
- Track rate limit hit rate per tier (percentage of requests rejected)
- Monitor Redis latency for rate limit operations (should be under 1ms at p99)
- Log the top 10 most-throttled API keys daily (identifies potential abuse or misconfigured clients)
- Alert on Redis failure with the fail-open fallback activated
- Dashboard showing request distribution across tiers and burst patterns
Key Takeaways
- Token bucket is the better default for most APIs. It handles burst traffic naturally and provides smooth rate limiting.
- Sliding window is preferable when strict per-window enforcement is required (billing, compliance).
- Lua scripts in Redis provide atomic rate limit checks. Never implement distributed rate limiting with separate read-check-write operations.
- Fail-open with a local fallback is the correct default for rate limiters. A rate limiter outage should not cause an application outage.
- Rate limit headers are not optional. Clients need
X-RateLimit-RemainingandRetry-Afterto behave correctly.
Further Reading
- Designing Rate Limiting for Mobile APIs: Rate limiting strategies for APIs consumed by mobile clients, covering token bucket algorithms, client identification, degradation modes,...
- How I'd Design a Scalable Notification System: System design for a multi-channel notification system covering delivery guarantees, rate limiting, user preferences, and failure handling...
- Building a Simple Search Index: Designing an inverted index from scratch with tokenization, ranking, and query parsing, then comparing it against Postgres full-text search.
Final Thoughts
I deployed the token bucket algorithm with per-tier configuration. The 0.3ms p50 overhead was invisible in API latency. The burst allowance reduced false rejections for mobile clients that batch requests on network reconnection. The total implementation was 150 lines of TypeScript and a 20-line Lua script. The operational overhead is monitoring Redis availability and reviewing the top-throttled-keys report weekly.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.