Implementing Webhooks and Observing Failure Modes

Context

An API platform needed to deliver event notifications to customer endpoints via webhooks. Requirements: at-least-once delivery, configurable retry policies, payload signature verification, and delivery observability. The system needed to handle 50,000 webhook deliveries per hour across 500 registered endpoints.

Problem

Webhook delivery is deceptively complex. Customer endpoints go down, respond slowly, return unexpected status codes, or silently drop requests. The sender must handle all of these cases without losing events, overwhelming failing endpoints, or consuming unbounded resources on retries.

Constraints

Delivery volume: 50,000 events/hour, bursty (event storms during batch operations)
Registered endpoints: 500, with varying reliability (some are dev environments)
Retry policy: exponential backoff, max 8 retries over 24 hours
Delivery guarantee: at-least-once (duplicates are acceptable, losses are not)
Latency: webhook should be dispatched within 5 seconds of the triggering event
Payload signing: HMAC-SHA256 for integrity verification
Storage: Postgres for delivery state, Redis for the dispatch queue

Design

Architecture

Event Source -> Event Table (Postgres)
  -> Dispatcher Worker (polls every 1 second)
    -> Redis Queue (per-endpoint delivery queue)
      -> Delivery Worker Pool (10 workers)
        -> HTTP POST to customer endpoint
          -> Success: mark delivered
          -> Failure: schedule retry with backoff
          -> Max retries exceeded: move to dead letter queue

Schema

CREATE TABLE webhook_endpoints (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  url TEXT NOT NULL,
  secret TEXT NOT NULL, -- HMAC signing key
  events TEXT[] NOT NULL, -- subscribed event types
  active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
 
CREATE TABLE webhook_deliveries (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  endpoint_id UUID NOT NULL REFERENCES webhook_endpoints(id),
  event_type TEXT NOT NULL,
  payload JSONB NOT NULL,
  status TEXT NOT NULL DEFAULT 'pending',
  attempts INTEGER NOT NULL DEFAULT 0,
  next_attempt_at TIMESTAMPTZ,
  last_response_code INTEGER,
  last_response_body TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  delivered_at TIMESTAMPTZ
);
 
CREATE INDEX idx_deliveries_pending
  ON webhook_deliveries (next_attempt_at)
  WHERE status = 'pending';

Payload Signing

function signPayload(payload: string, secret: string): string {
  return crypto
    .createHmac('sha256', secret)
    .update(payload)
    .digest('hex');
}
 
// Delivery
const body = JSON.stringify(event.payload);
const signature = signPayload(body, endpoint.secret);
 
await fetch(endpoint.url, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Webhook-Signature': `sha256=${signature}`,
    'X-Webhook-ID': delivery.id,
    'X-Webhook-Timestamp': Date.now().toString(),
  },
  body,
  signal: AbortSignal.timeout(10000), // 10-second timeout
});

Retry Schedule

Attempt	Delay After Failure	Cumulative Time
1	Immediate	0
2	30 seconds	30s
3	2 minutes	2.5min
4	10 minutes	12.5min
5	30 minutes	42.5min
6	1 hour	1h 42min
7	4 hours	5h 42min
8	8 hours	13h 42min
Dead letter	N/A	~24 hours total

Formula: delay = min(baseDelay * 2^attempt, maxDelay) with baseDelay = 30s and maxDelay = 8h.

Delivery Worker

async function deliverWebhook(delivery: WebhookDelivery) {
  const endpoint = await getEndpoint(delivery.endpoint_id);
  if (!endpoint.active) {
    await markDelivery(delivery.id, 'skipped');
    return;
  }
 
  try {
    const response = await fetch(endpoint.url, {
      method: 'POST',
      headers: buildHeaders(delivery, endpoint),
      body: JSON.stringify(delivery.payload),
      signal: AbortSignal.timeout(10000),
    });
 
    if (response.status >= 200 && response.status < 300) {
      await markDelivery(delivery.id, 'delivered', {
        responseCode: response.status,
        deliveredAt: new Date(),
      });
    } else {
      await handleFailure(delivery, response.status, await response.text());
    }
  } catch (error) {
    await handleFailure(delivery, 0, error.message);
  }
}
 
async function handleFailure(
  delivery: WebhookDelivery,
  statusCode: number,
  responseBody: string,
) {
  const nextAttempt = delivery.attempts + 1;
  if (nextAttempt > 8) {
    await markDelivery(delivery.id, 'dead_letter', {
      responseCode: statusCode,
      lastResponseBody: responseBody.slice(0, 1000),
    });
    return;
  }
 
  const delay = Math.min(30 * Math.pow(2, nextAttempt), 28800); // max 8 hours
  await scheduleRetry(delivery.id, nextAttempt, delay, statusCode, responseBody);
}

Trade-offs

Design Choice	Benefit	Cost
At-least-once (vs exactly-once)	Simpler, more reliable	Receivers must handle duplicates
10-second timeout	Prevents slow endpoints from blocking workers	May miss slow-but-valid responses
24-hour retry window	Covers transient outages	Delays failure notification to users
Per-endpoint queuing	Failing endpoint does not block others	More complex queue management
HMAC signing	Payload integrity and authenticity	Receiver must implement verification

Delivery Success Rate by Endpoint Quality

During 30 days of production operation:

Endpoint Category	Count	First-Attempt Success	Final Success (after retries)	Dead Letter Rate
Production (stable)	350	97.2%	99.8%	0.2%
Staging (intermittent)	100	72.4%	91.3%	8.7%
Dev (unreliable)	50	41.8%	65.2%	34.8%

Retries recovered 2.6% of deliveries to production endpoints and 18.9% to staging endpoints. The retry mechanism is essential for production reliability.

Failure Modes

Endpoint consistently returning 500: After 8 retries over 24 hours, the delivery moves to the dead letter queue. If the endpoint returns 500 for all events, the dead letter queue grows unboundedly. Mitigation: auto-disable endpoints with a dead letter rate above 50% over 24 hours, and notify the endpoint owner.

Slow endpoint consuming all workers: An endpoint responding in 9.5 seconds (just under the 10-second timeout) ties up a delivery worker for 9.5 seconds per event. With 10 workers and 100 events/minute for that endpoint, the queue grows. Mitigation: per-endpoint concurrency limits (max 2 workers per endpoint) and adaptive timeouts (reduce timeout for consistently slow endpoints).

Event storm overwhelming the queue: A batch operation generating 10,000 events in 1 second creates 10,000 webhook deliveries simultaneously. Redis queue depth spikes, and workers cannot keep up. Mitigation: event debouncing (aggregate events within a 5-second window) and queue depth alerting.

Replay attacks on signed payloads: An attacker who intercepts a webhook payload can replay it. The X-Webhook-Timestamp header allows receivers to reject payloads older than 5 minutes, but this requires receiver-side implementation. The sender cannot enforce it.

Database growth from delivery records: At 50,000 deliveries/hour, the webhook_deliveries table grows by 1.2M rows/day. Without pruning, query performance degrades within weeks. Mitigation: partition by month, retain detailed records for 90 days, archive to cold storage.

Scaling Considerations

At 500,000 deliveries/hour, the single Redis queue becomes a bottleneck. Shard queues by endpoint ID across multiple Redis instances.
Delivery workers scale horizontally. Add more workers to increase throughput, but respect per-endpoint concurrency limits.
For global delivery, deploy workers in multiple regions to reduce latency to geographically distributed endpoints.
Consider using a managed queue (SQS, Cloud Tasks) instead of Redis for delivery dispatch. Managed queues handle visibility timeouts, dead letter routing, and scaling automatically.

Observability

Delivery latency: time from event creation to successful delivery (p50, p95, p99)
Success rate: per endpoint, per event type, overall
Retry distribution: how many deliveries require 1, 2, 3+ retries
Dead letter rate: per endpoint (for auto-disable decisions)
Queue depth: per endpoint and overall (for capacity planning)
Endpoint response time: p50 and p95 per endpoint (for timeout tuning)

Dashboard priority: queue depth and dead letter rate. These are the two metrics that predict problems before they become outages.

Key Takeaways

At-least-once delivery with idempotency keys is simpler and more reliable than exactly-once. Push the deduplication responsibility to the receiver.
Exponential backoff with a 24-hour retry window recovers 2-19% of failed deliveries depending on endpoint stability.
Per-endpoint queuing prevents a single failing endpoint from blocking deliveries to healthy endpoints.
Auto-disable endpoints with consistently high failure rates. Continuing to deliver to a dead endpoint wastes resources and obscures metrics.
Payload signing with timestamps prevents both tampering and replay attacks (when receivers validate the timestamp).

Final Thoughts

The system delivers 50,000 webhooks/hour with a 99.8% success rate to production endpoints. The primary operational task is reviewing the dead letter queue weekly and contacting endpoint owners whose integrations are failing. The retry mechanism is the most valuable component: without it, the success rate would drop to 97.2%. That 2.6% gap represents 1,300 lost events per hour, which is unacceptable for an integration platform.