What Breaks First When Traffic Scales
A catalog of components that fail first under increasing traffic, ordered by how commonly they become bottlenecks in web applications.
Context
Over the past three years, I have been involved in scaling four different web applications from hundreds to hundreds of thousands of requests per minute. In every case, the failure order was predictable. The same components break first, in roughly the same sequence, regardless of the tech stack.
Related: Failure Modes I Actively Design For.
Problem
Teams invest in scaling the wrong things. They add application server capacity when the database is the bottleneck. They optimize queries when the connection pool is exhausted. They cache aggressively when the problem is DNS resolution. Understanding the typical failure order saves weeks of misdirected effort.
Constraints
- This analysis covers typical web applications: a load balancer, application servers, a relational database, and a CDN
- Traffic pattern: gradual ramp from 100 to 100,000 requests per minute
- Application type: read-heavy (80/20 read/write ratio)
- Infrastructure: cloud-hosted (AWS or similar), not on-premise
Design
The failure order I have observed, ranked by frequency:
The Failure Cascade
| Order | Component | Typical Breaking Point | Symptom |
|---|---|---|---|
| 1 | Database connections | 50-100 concurrent connections | Connection timeout errors, 5xx responses |
| 2 | Database query performance | Varies by query complexity | Slow responses, growing queue depth |
| 3 | Application memory | Depends on instance size | OOM kills, pod restarts |
| 4 | External API rate limits | Vendor-specific | 429 responses, degraded features |
| 5 | DNS resolution | 1,000+ req/s per resolver | Intermittent connection failures |
| 6 | Load balancer configuration | Varies by provider | Uneven distribution, hot instances |
| 7 | Disk I/O (logging, temp files) | High write volume | Increased latency, disk full errors |
| 8 | TLS handshake overhead | 5,000+ new connections/s | Increased TTFB, CPU saturation |
1. Database Connection Pool Exhaustion
This is almost always the first failure. A typical Postgres instance allows 100 default connections. A Node.js application server with a pool size of 10, running 12 instances, needs 120 connections. Add connection overhead from migrations, monitoring, and admin tools, and you are over the limit before traffic even spikes.
Detection: pg_stat_activity shows connections in "idle" or "idle in transaction" state near max_connections.
Fix hierarchy:
- Add a connection pooler (PgBouncer) in transaction mode
- Reduce per-instance pool size
- Increase
max_connections(last resort, increases memory per connection)
2. Slow Queries Under Load
Queries that run in 5ms at low traffic take 500ms under load. This happens because of lock contention, buffer cache misses, and increased I/O wait. The usual suspects:
- Missing indexes on WHERE and JOIN columns
- Sequential scans on tables over 100K rows
- N+1 query patterns amplified by concurrent requests
SELECT *fetching columns that are never used
Detection: pg_stat_statements sorted by total_exec_time or mean_exec_time.
Fix hierarchy:
- Add missing indexes (covers 60% of cases)
- Rewrite N+1 patterns as JOINs or batch queries
- Add read replicas for read-heavy queries
- Implement query-level caching
3. Application Memory Pressure
Node.js defaults to a 1.5GB heap limit. Java applications have JVM heap configuration. In both cases, memory pressure manifests as increased GC pauses, then OOM kills.
Common causes:
- Unbounded in-memory caches
- Large JSON payloads held in memory during processing
- Memory leaks from event listener accumulation
- Logging libraries buffering in memory
See also: Event Tracking System Design for Android Applications.
Detection: container memory metrics approaching limit, GC pause duration increasing.
Fix hierarchy:
- Profile memory allocation (Node.js:
--inspectwith Chrome DevTools) - Set explicit limits on in-memory caches
- Stream large payloads instead of buffering
- Increase instance memory (temporary, not a fix)
4. External API Rate Limits
Third-party APIs (payment processors, email services, geocoding) impose rate limits. At low traffic, you never hit them. At scale, a payment flow processing 100 orders/minute can exhaust a 60 req/min API limit.
Detection: 429 HTTP responses from external services, feature-specific error spikes.
Fix hierarchy:
- Implement client-side rate limiting with a token bucket
- Add request queuing with backpressure
- Negotiate higher limits with the vendor
- Cache API responses where possible
5. DNS Resolution Bottlenecks
Every outbound HTTP request requires DNS resolution. At high request rates, the default system resolver becomes a bottleneck. This manifests as intermittent 1-3 second delays on random requests.
Detection: inconsistent latency spikes that do not correlate with application or database metrics.
Fix hierarchy:
- Enable DNS caching in the HTTP client (Node.js:
lookupoption withdns.resolvecache) - Use a local DNS cache (dnsmasq, systemd-resolved)
- Increase resolver concurrency limits
Trade-offs
| Mitigation | Cost | Complexity | Effectiveness |
|---|---|---|---|
| Connection pooler (PgBouncer) | Low ($0-10/month) | Medium | High |
| Read replicas | Medium ($50-200/month) | High | High |
| Application caching (Redis) | Medium ($20-50/month) | Medium | Medium-High |
| CDN for static assets | Low ($5-20/month) | Low | High for static |
| Vertical scaling (bigger instances) | High | Low | Temporary |
Vertical scaling buys time. Horizontal scaling buys capacity. Architectural changes (caching, read replicas, async processing) buy durability.
Failure Modes
Cascading failures: database slowness causes application request queues to grow, which exhausts memory, which causes OOM kills, which reduces capacity, which increases load on surviving instances. This cascade can take a healthy system to zero in under 60 seconds.
Hidden coupling: an application that calls three external APIs sequentially has a latency floor equal to the sum of the slowest API responses. Under load, each API slows independently, and the combined latency exceeds timeout thresholds.
Retry storms: when a service returns errors, clients retry. If retry logic is aggressive (immediate retry, no backoff), a temporary failure becomes a permanent one because retries double the load on the already-struggling service.
Scaling Considerations
- Profile under load, not at rest. A system that looks healthy at 100 req/min may be critically bottlenecked at 1,000 req/min.
- Scale the bottleneck, not the symptom. If the database is slow, adding more application servers makes it worse by increasing connection pressure.
- Implement circuit breakers for external dependencies. A failing third-party API should degrade gracefully, not bring down the entire application.
- Load test with realistic traffic patterns. Synthetic benchmarks with uniform requests miss the hot-path bottlenecks that real traffic exposes.
Observability
The minimum monitoring stack for scaling:
- Database: connection count, query duration (p50/p95/p99), lock wait time, replication lag
- Application: request duration, error rate, memory usage, GC pause time, event loop lag (Node.js)
- Infrastructure: CPU utilization, network I/O, disk I/O, DNS resolution time
- External: response time and error rate per third-party dependency
Alert on the leading indicators (connection count approaching limit, memory usage above 80%) rather than lagging indicators (error rate above 5%).
Key Takeaways
- Database connections break first in almost every scaling scenario. Add PgBouncer before you need it.
- Slow queries hide at low traffic and dominate at high traffic. Use
pg_stat_statementsproactively. - Memory pressure causes cascading failures. Set explicit limits on everything that allocates memory.
- External API rate limits are a hard wall. Design for them with client-side rate limiting and queuing.
- Scale the bottleneck, not the symptom. Adding application servers when the database is the constraint makes things worse.
Further Reading
- Load Testing Mobile Backends With Realistic Traffic: Designing load tests that replicate mobile traffic patterns including bursty connections, mixed network conditions, and session-based wor...
- Scaling Isn't the Hard Part, Debugging Is: Why the real challenge of operating at scale is not handling load but diagnosing problems in systems too large and too fast for any one p...
- Benchmarking Database Writes Under Load: Measured write throughput and latency for Postgres under increasing concurrency, comparing single inserts, batch inserts, COPY, and async...
Final Thoughts
The failure order is predictable: connections, queries, memory, external limits, DNS. Knowing this sequence means you can instrument and mitigate proactively instead of reacting to each failure in production. Every scaling effort I have been part of started with PgBouncer and ended with caching. The path between those two steps is where the real engineering happens.
Recommended
Understanding ANRs: Detection, Root Causes, and Fixes
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.
How I'd Design a Scalable Notification System
System design for a multi-channel notification system covering delivery guarantees, rate limiting, user preferences, and failure handling at scale.
Designing Idempotent APIs for Mobile Clients
How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios specific to mobile networks.