Designing a Simple Authentication Service
Architecture for a session-based authentication service with JWT access tokens, refresh token rotation, and measured security trade-offs.
Context
A web application needed authentication for 50,000 users. Requirements: email/password login, session management, JWT access tokens for API authorization, refresh token rotation, and account lockout after failed attempts. No third-party auth provider (Auth0, Clerk) due to cost constraints at scale.
Problem
Authentication is a solved problem with unsolved trade-offs. JWTs are stateless but irrevocable. Sessions are revocable but require server-side state. Refresh tokens extend session duration but add complexity. The design must balance security, performance, and operational simplicity.
Constraints
- Users: 50,000 registered, 15,000 DAU
- Login frequency: average 1.2 logins per user per day (across devices)
- Token verification: every API request (18,000 req/min at peak)
- Storage: Postgres for user accounts, Redis for active sessions
- Access token lifetime: 15 minutes
- Refresh token lifetime: 7 days
- Must support multiple concurrent sessions (mobile + web)
- Password storage: bcrypt with cost factor 12
Design
Schema
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT UNIQUE NOT NULL,
password_hash TEXT NOT NULL,
email_verified BOOLEAN NOT NULL DEFAULT false,
failed_login_attempts INTEGER NOT NULL DEFAULT 0,
locked_until TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE refresh_tokens (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
token_hash TEXT NOT NULL,
family_id UUID NOT NULL, -- for rotation detection
expires_at TIMESTAMPTZ NOT NULL,
revoked_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_refresh_tokens_user ON refresh_tokens (user_id);
CREATE INDEX idx_refresh_tokens_family ON refresh_tokens (family_id);Token Architecture
Login
-> Verify credentials
-> Generate access token (JWT, 15 min, signed with RS256)
-> Generate refresh token (opaque, 7 days, stored in DB)
-> Return both to client
API Request
-> Verify access token (JWT signature check, no DB lookup)
-> If expired, client uses refresh token to get new pair
Token Refresh
-> Validate refresh token against DB
-> Rotate: issue new refresh token, revoke old one
-> Issue new access token
-> Return both to client
Access Token (JWT)
const accessToken = jwt.sign(
{
sub: user.id,
email: user.email,
roles: user.roles,
iat: Math.floor(Date.now() / 1000),
exp: Math.floor(Date.now() / 1000) + 900, // 15 minutes
},
privateKey,
{ algorithm: 'RS256' }
);RS256 (RSA + SHA-256) allows verification with the public key only. API servers do not need access to the signing key.
Refresh Token Rotation
async function rotateRefreshToken(oldTokenValue: string) {
const oldTokenHash = sha256(oldTokenValue);
const oldToken = await db.query(
'SELECT * FROM refresh_tokens WHERE token_hash = $1 AND revoked_at IS NULL',
[oldTokenHash]
);
if (!oldToken) {
// Token not found or already revoked
// Possible token reuse attack: revoke entire family
await db.query(
'UPDATE refresh_tokens SET revoked_at = now() WHERE family_id = $1',
[oldToken?.family_id]
);
throw new Error('Invalid refresh token');
}
if (oldToken.expires_at < new Date()) {
throw new Error('Refresh token expired');
}
// Revoke old token
await db.query(
'UPDATE refresh_tokens SET revoked_at = now() WHERE id = $1',
[oldToken.id]
);
// Issue new refresh token in the same family
const newTokenValue = crypto.randomBytes(32).toString('hex');
await db.query(
`INSERT INTO refresh_tokens (user_id, token_hash, family_id, expires_at)
VALUES ($1, $2, $3, $4)`,
[oldToken.user_id, sha256(newTokenValue), oldToken.family_id, addDays(new Date(), 7)]
);
return newTokenValue;
}The family_id groups all refresh tokens from a single login session. If a revoked token is reused (indicating theft), the entire family is revoked, forcing re-authentication on all devices in that session family.
Account Lockout
async function handleFailedLogin(userId: string) {
const result = await db.query(
`UPDATE users
SET failed_login_attempts = failed_login_attempts + 1,
locked_until = CASE
WHEN failed_login_attempts >= 4 THEN now() + interval '15 minutes'
ELSE locked_until
END
WHERE id = $1
RETURNING failed_login_attempts`,
[userId]
);
return result.rows[0].failed_login_attempts;
}After 5 failed attempts, the account locks for 15 minutes. Failed attempt count resets on successful login.
Trade-offs
Token Strategy Comparison
| Property | JWT Only (no refresh) | JWT + Refresh (this design) | Session Cookie Only |
|---|---|---|---|
| Stateless verification | Yes | Access: yes, Refresh: no | No |
| Revocability | No (until expiry) | Access: no, Refresh: yes | Yes (immediate) |
| DB lookups per API request | 0 | 0 | 1 |
| Token theft impact | Full access until expiry | 15 min max (access), detectable (refresh) | Until session invalidated |
| Multi-device support | Manual | Native (separate refresh tokens) | Native (separate sessions) |
| Complexity | Low | Medium | Low |
Performance Measurements
| Operation | Latency (p50) | Latency (p95) |
|---|---|---|
| JWT verification (RS256) | 0.3ms | 0.5ms |
| Login (bcrypt + DB + token gen) | 280ms | 420ms |
| Token refresh (DB lookup + token gen) | 8ms | 22ms |
| Session check (Redis) | 0.5ms | 1.2ms |
JWT verification at 0.3ms per request adds 5.4 seconds of cumulative CPU time per minute at 18,000 req/min. This is negligible.
bcrypt at cost factor 12 takes ~250ms. This is intentionally slow to resist brute-force attacks. At 18,000 logins/hour (peak), this requires 4,500 CPU-seconds/hour of bcrypt computation. A single core handles ~14 bcrypt operations/second, so 4 concurrent login requests saturate one core.
Security Properties
| Attack Vector | Mitigation |
|---|---|
| Password brute force | bcrypt (250ms/attempt) + account lockout (5 attempts) |
| Token theft (access) | 15-minute expiry limits exposure window |
| Token theft (refresh) | Rotation detection via family_id, entire family revoked |
| Token replay | Short-lived access tokens, single-use refresh tokens |
| Credential stuffing | Rate limiting on login endpoint (10 attempts/IP/minute) |
| Password database leak | bcrypt with cost factor 12 (estimated crack time: years per hash) |
Failure Modes
Related: Failure Modes I Actively Design For.
Redis down for session storage: If Redis is unavailable, active session checks fail. JWT access tokens continue to work (stateless verification), but refresh token operations fail because they query Postgres through the session layer. Mitigation: separate the refresh token flow from session storage. Use Postgres directly for refresh token operations.
Clock skew on JWT verification: If the API server clock is ahead of the auth server clock, newly issued JWTs may appear to be "not yet valid." Mitigation: add a 30-second clock skew tolerance to the JWT verification library.
Refresh token family false positive: If a client retries a refresh request (due to network timeout) and the first request succeeded, the retry uses a revoked token. This triggers the token reuse detection, revoking the entire family and forcing re-authentication. Mitigation: add a 10-second grace period for recently revoked tokens (allow a single reuse within the grace window).
bcrypt DoS: An attacker sending thousands of login requests with random passwords forces the server to compute bcrypt hashes for each one, consuming CPU. Mitigation: rate limit the login endpoint aggressively (10 req/IP/minute) and add a proof-of-work challenge (e.g., hashcash) for IPs exceeding the limit.
Scaling Considerations
- JWT verification scales linearly with CPU. No database dependency for API authorization.
- Login endpoint is CPU-bound (bcrypt). Scale horizontally with more instances, or use a dedicated login service.
- Refresh token rotation requires a database write per refresh. At 15-minute access token lifetimes and 15,000 DAU, that is ~60,000 refresh operations/day. Postgres handles this easily.
- For millions of users, partition the
refresh_tokenstable byuser_idand set up automatic cleanup of expired tokens.
Observability
- Track login success/failure rate per IP and per account
- Monitor JWT verification errors (expired, invalid signature, malformed)
- Alert on refresh token family revocations (potential token theft)
- Dashboard: active sessions per user, login frequency, lockout events
- Log (but do not expose) the reason for every authentication failure
See also: Event Tracking System Design for Android Applications.
Key Takeaways
- JWT access tokens (15 minutes) with refresh token rotation provides the best balance of stateless verification and revocability.
- Refresh token family tracking detects token theft by identifying reuse of revoked tokens.
- bcrypt at cost factor 12 is slow by design. Plan CPU capacity for the login endpoint accordingly.
- Account lockout is a blunt instrument. Combine it with IP-based rate limiting for defense in depth.
- Add a grace period for refresh token reuse to handle client retry scenarios without false-positive family revocations.
Further Reading
- Designing a Simple Metrics Collection Service: Architecture for a lightweight metrics ingestion pipeline using a buffer, batch writes, and pre-aggregated rollups on Postgres.
- Designing a Simple CMS From Scratch: Architecture decisions behind building a file-based CMS with MDX, Git-backed versioning, and incremental builds for a content-heavy site.
- Designing Secure Auth Flows for Mobile Applications: Architecture for secure authentication flows in mobile apps, covering OAuth 2.0 with PKCE, token management, biometric auth, and session ...
Final Thoughts
This authentication service handles 15,000 DAU with zero additional infrastructure cost beyond existing Postgres and Redis. The JWT + refresh token pattern eliminates database lookups on every API request (saving 18,000 queries/minute) while maintaining the ability to revoke sessions within 15 minutes. The total implementation is approximately 500 lines of TypeScript. The primary ongoing operational task is monitoring for credential stuffing attacks via the failed login rate dashboard.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.