Engineering Decisions That Reduce Pager Fatigue
Architectural and operational decisions that reduce the frequency and severity of production pages, based on patterns from years of on-call experience.
Pager fatigue is not an operational problem. It is an engineering problem. Every page represents a system behavior that was not anticipated, handled, or automated. Reducing pager fatigue requires architectural decisions, not just better runbooks.
Categorizing Pages by Root Cause
Before fixing anything, I categorize the last 90 days of pages into root causes:
| Category | Example | Engineering response |
|---|---|---|
| Transient dependency failure | Downstream service 503 for 30 seconds | Add circuit breaker, retry with backoff |
| Resource exhaustion | Database connection pool full | Auto-scaling, better connection management |
| Deployment regression | New code introduces error spike | Canary deployment, automated rollback |
| Capacity surprise | Traffic spike exceeds provisioned capacity | Auto-scaling, load shedding |
| Configuration error | Wrong feature flag value in production | Config validation, environment guardrails |
| Data quality issue | Malformed input causes processing failure | Schema validation, dead letter queues |
| Clock/timing issue | Cron job overlap, lease expiration race | Distributed locking, idempotent jobs |
Most on-call teams will find that 70-80% of pages fall into two or three categories. Fix those categories systematically and the page volume drops dramatically.
Self-Healing Systems
The highest-leverage engineering decision for reducing pages: make the system fix itself for known failure modes.
Automatic retry with circuit breaking. A transient downstream failure should not page anyone. The system should retry with exponential backoff, open the circuit if failures persist, and only page when the circuit has been open for longer than the expected recovery time.
Auto-scaling on utilization thresholds. A traffic spike that pushes CPU to 80% should trigger scaling, not a page. The page should fire only if auto-scaling fails or the system reaches a hard capacity limit.
Automatic rollback on health check failure. A deployment that increases error rates should roll back without human intervention. The page should be informational ("deployment was rolled back") rather than actionable ("error rate is elevated, please investigate").
Queue consumer restart on crash. A consumer that crashes on a poison message should restart automatically, move the message to the dead letter queue, and continue processing. The page should be about dead letter queue depth exceeding a threshold, not about individual consumer crashes.
The pattern: engineer the first response into the system. Page humans only when the automated response is insufficient.
Reducing Alert Noise
Alert noise is the primary cause of pager fatigue. When legitimate pages are buried in false positives, on-call engineers either ignore all pages or burn out responding to every one.
Require a minimum duration before alerting. A metric that spikes for 10 seconds and recovers is not an incident. Require the condition to persist for a meaningful window (typically 5-15 minutes for non-critical alerts) before paging.
Use composite alerts. Instead of alerting when any single metric crosses a threshold, alert when multiple correlated signals indicate a problem. Error rate above 1% AND latency P99 above 2 seconds AND success rate below 99% together indicate a real problem. Any one alone might be noise.
Separate informational alerts from actionable pages. Not everything that is anomalous requires an immediate response. Use a tiered system:
- Page: requires immediate human response, interrupts sleep
- Ticket: requires response within business hours, auto-creates a task
- Notification: informational, reviewed during regular triage
If more than 30% of pages result in "no action taken," the page threshold is wrong.
Designing Out Common Page Sources
Related: How I'd Design a Mobile Configuration System at Scale.
See also: Designing a Feature Flag and Remote Config System.
The Noisy Neighbor Problem
Shared infrastructure where one tenant's behavior affects others generates pages. A single customer sending 100x their normal traffic can degrade the service for everyone.
Engineering response: per-tenant rate limiting, resource isolation, and traffic shaping. The system should handle noisy neighbors automatically rather than requiring an engineer to identify and throttle them manually.
The Batch Job Collision
Two batch jobs that run at the same time and compete for the same database resources. This creates load spikes, timeouts, and pages.
Engineering response: batch job scheduling with resource awareness. Jobs declare their resource requirements. The scheduler ensures conflicting jobs do not overlap. If a job runs longer than expected, it yields resources rather than competing for them.
The Schema Migration Page
A database migration that locks a table and causes timeouts for application queries. The on-call engineer must either wait for the migration to complete or intervene manually.
Engineering response: online schema migration tools that avoid long-running locks. Migration plans reviewed and tested against production-scale data before execution. Migrations run during low-traffic periods with automatic abort if they exceed time or lock budgets.
The Certificate Expiration
TLS certificates expiring and causing connection failures. This is entirely preventable.
Engineering response: automated certificate rotation with monitoring that alerts 30 days before expiration, not after the certificate has expired.
On-Call Burden as a Design Input
I treat on-call burden as a system metric, not just an operational concern. During design reviews, I ask:
- What new failure modes does this design introduce?
- What new alerts are needed?
- What is the expected page frequency from this component?
- Can the on-call engineer diagnose and resolve issues without the original developer?
If a design adds a component that is expected to generate even one page per week, that is a significant cost that should be weighed against the design's benefits.
Runbooks That Actually Help
When a page does fire, the runbook determines whether resolution takes 5 minutes or 50 minutes.
Effective runbook structure:
- What this alert means (one sentence)
- Likely causes (ranked by probability)
- Diagnostic steps (specific commands, not generic advice)
- Resolution steps (for each likely cause)
- Escalation criteria (when to wake up someone else)
The worst runbooks say "investigate and resolve." The best runbooks walk the on-call engineer through a decision tree that covers 90% of scenarios.
I also enforce a rule: every page that takes more than 15 minutes to resolve gets a runbook update as part of the incident follow-up. The runbook is a living document that improves with every incident.
Measuring Progress
Track these metrics to evaluate whether pager fatigue is improving:
- Pages per on-call shift: should trend downward
- Time to acknowledge: if increasing, engineers are becoming desensitized
- Time to resolve: should decrease as automation and runbooks improve
- Pages with no action taken: should be near zero (indicates alert tuning)
- Repeat pages (same alert within 24 hours): indicates the root cause was not addressed
Review these metrics monthly with the team. Make pager fatigue reduction an explicit engineering goal with allocated time, not a side project.
Key Takeaways
- Categorize pages by root cause. Fix the top two or three categories for maximum impact.
- Self-healing systems handle known failure modes automatically. Page humans only when automation is insufficient.
- Reduce alert noise through minimum duration thresholds, composite alerts, and tiered severity levels.
- Design out common page sources: noisy neighbors, batch job collisions, schema migrations, and certificate expirations.
- Treat on-call burden as a design input during architecture reviews.
- Every page that takes more than 15 minutes to resolve should result in a runbook update.
Further Reading
- How Small Decisions Cause Big Outages: Examining how seemingly minor technical decisions compound into major production incidents, with real patterns and the organizational dyn...
- Engineering Lessons I Relearned the Hard Way: Lessons that I knew intellectually but had to learn through painful experience before they truly changed my behavior, covering testing, d...
- What Production Failures Have Taught Me: A collection of hard-won lessons from years of production incidents, covering the patterns that repeat across systems and the mental mode...
Final Thoughts
The goal is not zero pages. The goal is that every page represents a genuinely novel situation that requires human judgment. If the system can predict the failure and the response, it should handle both automatically. The on-call engineer's time is too valuable to spend on problems the system could solve itself.
Recommended
Understanding ANRs: Detection, Root Causes, and Fixes
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.
Designing Idempotent APIs for Mobile Clients
How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios specific to mobile networks.
Refactoring a System Without Breaking Users
Strategies for large-scale refactors that keep production stable, covering parallel runs, feature flags, gradual migrations, and verification techniques.