Implementing Webhooks and Observing Failure Modes
Designing a webhook delivery system with retries, dead letter queues, signature verification, and measured reliability under various failure conditions.
Context
An API platform needed to deliver event notifications to customer endpoints via webhooks. Requirements: at-least-once delivery, configurable retry policies, payload signature verification, and delivery observability. The system needed to handle 50,000 webhook deliveries per hour across 500 registered endpoints.
Related: Refactoring a System Without Breaking Users.
Problem
Webhook delivery is deceptively complex. Customer endpoints go down, respond slowly, return unexpected status codes, or silently drop requests. The sender must handle all of these cases without losing events, overwhelming failing endpoints, or consuming unbounded resources on retries.
Constraints
- Delivery volume: 50,000 events/hour, bursty (event storms during batch operations)
- Registered endpoints: 500, with varying reliability (some are dev environments)
- Retry policy: exponential backoff, max 8 retries over 24 hours
- Delivery guarantee: at-least-once (duplicates are acceptable, losses are not)
- Latency: webhook should be dispatched within 5 seconds of the triggering event
- Payload signing: HMAC-SHA256 for integrity verification
- Storage: Postgres for delivery state, Redis for the dispatch queue
Design
See also: How I'd Design a Scalable Notification System.
Architecture
Event Source -> Event Table (Postgres)
-> Dispatcher Worker (polls every 1 second)
-> Redis Queue (per-endpoint delivery queue)
-> Delivery Worker Pool (10 workers)
-> HTTP POST to customer endpoint
-> Success: mark delivered
-> Failure: schedule retry with backoff
-> Max retries exceeded: move to dead letter queue
Schema
CREATE TABLE webhook_endpoints (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
url TEXT NOT NULL,
secret TEXT NOT NULL, -- HMAC signing key
events TEXT[] NOT NULL, -- subscribed event types
active BOOLEAN NOT NULL DEFAULT true,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE webhook_deliveries (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
endpoint_id UUID NOT NULL REFERENCES webhook_endpoints(id),
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
attempts INTEGER NOT NULL DEFAULT 0,
next_attempt_at TIMESTAMPTZ,
last_response_code INTEGER,
last_response_body TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
delivered_at TIMESTAMPTZ
);
CREATE INDEX idx_deliveries_pending
ON webhook_deliveries (next_attempt_at)
WHERE status = 'pending';Payload Signing
function signPayload(payload: string, secret: string): string {
return crypto
.createHmac('sha256', secret)
.update(payload)
.digest('hex');
}
// Delivery
const body = JSON.stringify(event.payload);
const signature = signPayload(body, endpoint.secret);
await fetch(endpoint.url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Webhook-Signature': `sha256=${signature}`,
'X-Webhook-ID': delivery.id,
'X-Webhook-Timestamp': Date.now().toString(),
},
body,
signal: AbortSignal.timeout(10000), // 10-second timeout
});Retry Schedule
| Attempt | Delay After Failure | Cumulative Time |
|---|---|---|
| 1 | Immediate | 0 |
| 2 | 30 seconds | 30s |
| 3 | 2 minutes | 2.5min |
| 4 | 10 minutes | 12.5min |
| 5 | 30 minutes | 42.5min |
| 6 | 1 hour | 1h 42min |
| 7 | 4 hours | 5h 42min |
| 8 | 8 hours | 13h 42min |
| Dead letter | N/A | ~24 hours total |
Formula: delay = min(baseDelay * 2^attempt, maxDelay) with baseDelay = 30s and maxDelay = 8h.
Delivery Worker
async function deliverWebhook(delivery: WebhookDelivery) {
const endpoint = await getEndpoint(delivery.endpoint_id);
if (!endpoint.active) {
await markDelivery(delivery.id, 'skipped');
return;
}
try {
const response = await fetch(endpoint.url, {
method: 'POST',
headers: buildHeaders(delivery, endpoint),
body: JSON.stringify(delivery.payload),
signal: AbortSignal.timeout(10000),
});
if (response.status >= 200 && response.status < 300) {
await markDelivery(delivery.id, 'delivered', {
responseCode: response.status,
deliveredAt: new Date(),
});
} else {
await handleFailure(delivery, response.status, await response.text());
}
} catch (error) {
await handleFailure(delivery, 0, error.message);
}
}
async function handleFailure(
delivery: WebhookDelivery,
statusCode: number,
responseBody: string,
) {
const nextAttempt = delivery.attempts + 1;
if (nextAttempt > 8) {
await markDelivery(delivery.id, 'dead_letter', {
responseCode: statusCode,
lastResponseBody: responseBody.slice(0, 1000),
});
return;
}
const delay = Math.min(30 * Math.pow(2, nextAttempt), 28800); // max 8 hours
await scheduleRetry(delivery.id, nextAttempt, delay, statusCode, responseBody);
}Trade-offs
| Design Choice | Benefit | Cost |
|---|---|---|
| At-least-once (vs exactly-once) | Simpler, more reliable | Receivers must handle duplicates |
| 10-second timeout | Prevents slow endpoints from blocking workers | May miss slow-but-valid responses |
| 24-hour retry window | Covers transient outages | Delays failure notification to users |
| Per-endpoint queuing | Failing endpoint does not block others | More complex queue management |
| HMAC signing | Payload integrity and authenticity | Receiver must implement verification |
Delivery Success Rate by Endpoint Quality
During 30 days of production operation:
| Endpoint Category | Count | First-Attempt Success | Final Success (after retries) | Dead Letter Rate |
|---|---|---|---|---|
| Production (stable) | 350 | 97.2% | 99.8% | 0.2% |
| Staging (intermittent) | 100 | 72.4% | 91.3% | 8.7% |
| Dev (unreliable) | 50 | 41.8% | 65.2% | 34.8% |
Retries recovered 2.6% of deliveries to production endpoints and 18.9% to staging endpoints. The retry mechanism is essential for production reliability.
Failure Modes
Endpoint consistently returning 500: After 8 retries over 24 hours, the delivery moves to the dead letter queue. If the endpoint returns 500 for all events, the dead letter queue grows unboundedly. Mitigation: auto-disable endpoints with a dead letter rate above 50% over 24 hours, and notify the endpoint owner.
Slow endpoint consuming all workers: An endpoint responding in 9.5 seconds (just under the 10-second timeout) ties up a delivery worker for 9.5 seconds per event. With 10 workers and 100 events/minute for that endpoint, the queue grows. Mitigation: per-endpoint concurrency limits (max 2 workers per endpoint) and adaptive timeouts (reduce timeout for consistently slow endpoints).
Event storm overwhelming the queue: A batch operation generating 10,000 events in 1 second creates 10,000 webhook deliveries simultaneously. Redis queue depth spikes, and workers cannot keep up. Mitigation: event debouncing (aggregate events within a 5-second window) and queue depth alerting.
Replay attacks on signed payloads: An attacker who intercepts a webhook payload can replay it. The X-Webhook-Timestamp header allows receivers to reject payloads older than 5 minutes, but this requires receiver-side implementation. The sender cannot enforce it.
Database growth from delivery records: At 50,000 deliveries/hour, the webhook_deliveries table grows by 1.2M rows/day. Without pruning, query performance degrades within weeks. Mitigation: partition by month, retain detailed records for 90 days, archive to cold storage.
Scaling Considerations
- At 500,000 deliveries/hour, the single Redis queue becomes a bottleneck. Shard queues by endpoint ID across multiple Redis instances.
- Delivery workers scale horizontally. Add more workers to increase throughput, but respect per-endpoint concurrency limits.
- For global delivery, deploy workers in multiple regions to reduce latency to geographically distributed endpoints.
- Consider using a managed queue (SQS, Cloud Tasks) instead of Redis for delivery dispatch. Managed queues handle visibility timeouts, dead letter routing, and scaling automatically.
Observability
- Delivery latency: time from event creation to successful delivery (p50, p95, p99)
- Success rate: per endpoint, per event type, overall
- Retry distribution: how many deliveries require 1, 2, 3+ retries
- Dead letter rate: per endpoint (for auto-disable decisions)
- Queue depth: per endpoint and overall (for capacity planning)
- Endpoint response time: p50 and p95 per endpoint (for timeout tuning)
Dashboard priority: queue depth and dead letter rate. These are the two metrics that predict problems before they become outages.
Key Takeaways
- At-least-once delivery with idempotency keys is simpler and more reliable than exactly-once. Push the deduplication responsibility to the receiver.
- Exponential backoff with a 24-hour retry window recovers 2-19% of failed deliveries depending on endpoint stability.
- Per-endpoint queuing prevents a single failing endpoint from blocking deliveries to healthy endpoints.
- Auto-disable endpoints with consistently high failure rates. Continuing to deliver to a dead endpoint wastes resources and obscures metrics.
- Payload signing with timestamps prevents both tampering and replay attacks (when receivers validate the timestamp).
Further Reading
- Failure Modes I Actively Design For: A catalog of failure modes that experienced engineers anticipate and design around, from cascading failures to data corruption to clock s...
- Event Tracking System Design for Android Applications: A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, an...
- How I'd Design a Mobile Configuration System at Scale: Designing a configuration system for mobile apps at scale, covering config delivery, caching layers, override hierarchies, and safe rollo...
Final Thoughts
The system delivers 50,000 webhooks/hour with a 99.8% success rate to production endpoints. The primary operational task is reviewing the dead letter queue weekly and contacting endpoint owners whose integrations are failing. The retry mechanism is the most valuable component: without it, the success rate would drop to 97.2%. That 2.6% gap represents 1,300 lost events per hour, which is unacceptable for an integration platform.
Recommended
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.
Understanding ANRs: Detection, Root Causes, and Fixes
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.
Designing a Feature Flag and Remote Config System
Architecture and trade-offs for building a feature flag and remote configuration system that handles targeting, rollout, and consistency across mobile clients.