Building a Rate-Limited API From Scratch

Context

An API serving 20,000 requests/minute needed rate limiting per API key. Requirements: 100 requests/minute per key with burst allowance, separate tiers for free and paid users, and sub-millisecond overhead per request. I implemented two algorithms and compared them.

Problem

Rate limiting appears simple (count requests, reject when over limit) but the implementation details matter: clock synchronization, distributed counting, burst behavior, and the fairness of the rejection strategy. The wrong algorithm can reject legitimate traffic or allow abuse.

Constraints

Storage: Redis 7, single node
Rate limit: 100 req/min (free tier), 1,000 req/min (paid tier)
Burst: allow 20% burst above limit for up to 5 seconds
Latency budget: rate limit check must add less than 1ms to request processing
Distributed: 8 application instances share the rate limit state
Must return standard rate limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset)

Design

Algorithm 1: Token Bucket

The token bucket allows burst traffic by accumulating tokens over time. Each request consumes one token. Tokens refill at a fixed rate.

async function tokenBucket(
  key: string,
  maxTokens: number,
  refillRate: number, // tokens per second
): Promise<{ allowed: boolean; remaining: number; resetAt: number }> {
  const now = Date.now();
  const result = await redis.eval(`
    local key = KEYS[1]
    local max_tokens = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])
    local now = tonumber(ARGV[3])
 
    local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(bucket[1]) or max_tokens
    local last_refill = tonumber(bucket[2]) or now
 
    -- Refill tokens based on elapsed time
    local elapsed = (now - last_refill) / 1000
    tokens = math.min(max_tokens, tokens + (elapsed * refill_rate))
 
    local allowed = tokens >= 1
    if allowed then
      tokens = tokens - 1
    end
 
    redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
    redis.call('EXPIRE', key, math.ceil(max_tokens / refill_rate) + 1)
 
    return {allowed and 1 or 0, math.floor(tokens), math.ceil((max_tokens - tokens) / refill_rate)}
  `, 1, key, maxTokens, refillRate, now);
 
  return {
    allowed: result[0] === 1,
    remaining: result[1],
    resetAt: now + result[2] * 1000,
  };
}

Configuration for free tier: maxTokens = 120 (100 + 20% burst), refillRate = 1.67 (100 tokens per 60 seconds).

Algorithm 2: Sliding Window Counter

The sliding window divides time into fixed intervals and weights the previous interval based on how far into the current interval the request falls.

async function slidingWindowCounter(
  key: string,
  limit: number,
  windowMs: number,
): Promise<{ allowed: boolean; remaining: number; resetAt: number }> {
  const now = Date.now();
  const currentWindow = Math.floor(now / windowMs);
  const previousWindow = currentWindow - 1;
  const windowProgress = (now % windowMs) / windowMs;
 
  const result = await redis.eval(`
    local curr_key = KEYS[1] .. ':' .. ARGV[1]
    local prev_key = KEYS[1] .. ':' .. ARGV[2]
    local limit = tonumber(ARGV[3])
    local weight = tonumber(ARGV[4])
 
    local prev_count = tonumber(redis.call('GET', prev_key) or '0')
    local curr_count = tonumber(redis.call('GET', curr_key) or '0')
 
    local weighted_count = math.floor(prev_count * (1 - weight)) + curr_count
 
    if weighted_count >= limit then
      return {0, 0}
    end
 
    redis.call('INCR', curr_key)
    redis.call('EXPIRE', curr_key, math.ceil(ARGV[5] / 1000) * 2)
 
    return {1, limit - weighted_count - 1}
  `, 1, key, currentWindow, previousWindow, limit, windowProgress, windowMs);
 
  return {
    allowed: result[0] === 1,
    remaining: result[1],
    resetAt: (currentWindow + 1) * windowMs,
  };
}

Trade-offs

Algorithm Comparison

Property	Token Bucket	Sliding Window
Burst handling	Natural (accumulated tokens)	None (hard limit per window)
Memory per key	2 fields (tokens, last_refill)	2 keys (current + previous window)
Accuracy	Exact	Approximate (weighted average)
Latency (p50)	0.3ms	0.25ms
Latency (p99)	0.8ms	0.7ms
Implementation complexity	Medium (Lua script)	Medium (Lua script)
Fairness under sustained load	Smooth distribution	Can cluster at window boundaries

Burst Behavior Test

I sent 150 requests in 1 second to a key with a 100/minute limit:

Algorithm	Allowed in 1st second	Behavior
Token Bucket (120 max)	120	Allowed burst up to max tokens, then rejected
Sliding Window (100 limit)	100	Hard cutoff at limit, no burst

Token bucket is better for APIs where clients send batches (mobile sync, webhook delivery). Sliding window is better for strict rate enforcement (billing, abuse prevention).

Multi-Tier Configuration

Tier	Token Bucket Config	Sliding Window Config
Free	120 tokens, 1.67/s refill	100/minute window
Paid	1,200 tokens, 16.67/s refill	1,000/minute window
Internal	12,000 tokens, 166.7/s refill	10,000/minute window

Redis Memory Usage

At 10,000 active API keys:

Algorithm	Memory
Token bucket	~1.2MB (2 hash fields per key)
Sliding window	~0.8MB (2 string keys per key)

Both are negligible. Memory is not the constraint at this scale.

Failure Modes

Redis unavailable: If Redis is down, the rate limiter cannot check limits. Two options: fail open (allow all requests, risking abuse) or fail closed (reject all requests, causing an outage). I chose fail open with a local in-memory fallback that applies a conservative per-instance limit.

Clock skew across instances: Token bucket uses the current timestamp to calculate refills. If two instances have clocked skewed by 1 second, the refill calculation diverges. Using Redis server time (redis.call('TIME')) instead of client time eliminates this.

Race condition on concurrent checks: Without atomic Lua scripts, two concurrent requests could read the same token count, both see tokens available, and both decrement, resulting in over-admission. The Lua script executes atomically in Redis, preventing this.

Key expiration during burst: If a token bucket key expires (TTL reached) during a sustained burst, the next request creates a fresh bucket with full tokens, effectively resetting the limit. The TTL must be longer than the maximum burst duration.

Scaling Considerations

Redis handles 100,000+ operations/second. Rate limiting 10,000 keys at 100 req/min generates ~16,000 Redis operations/min. This is 0.3% of Redis capacity.
For multi-region deployments, each region can have its own Redis instance with region-local rate limits. Global rate limiting across regions requires a centralized Redis or a distributed counter (significantly more complex).
At 1M+ API keys, consider using Redis Cluster with hash tags to co-locate related keys on the same shard.
Rate limit headers should include Retry-After for rejected requests to help well-behaved clients back off appropriately.

Observability

Track rate limit hit rate per tier (percentage of requests rejected)
Monitor Redis latency for rate limit operations (should be under 1ms at p99)
Log the top 10 most-throttled API keys daily (identifies potential abuse or misconfigured clients)
Alert on Redis failure with the fail-open fallback activated
Dashboard showing request distribution across tiers and burst patterns

Key Takeaways

Token bucket is the better default for most APIs. It handles burst traffic naturally and provides smooth rate limiting.
Sliding window is preferable when strict per-window enforcement is required (billing, compliance).
Lua scripts in Redis provide atomic rate limit checks. Never implement distributed rate limiting with separate read-check-write operations.
Fail-open with a local fallback is the correct default for rate limiters. A rate limiter outage should not cause an application outage.
Rate limit headers are not optional. Clients need X-RateLimit-Remaining and Retry-After to behave correctly.

Final Thoughts

I deployed the token bucket algorithm with per-tier configuration. The 0.3ms p50 overhead was invisible in API latency. The burst allowance reduced false rejections for mobile clients that batch requests on network reconnection. The total implementation was 150 lines of TypeScript and a 20-line Lua script. The operational overhead is monitoring Redis availability and reviewing the top-throttled-keys report weekly.