Implementing Webhooks and Observing Failure Modes

Dhruval Dhameliya·July 20, 2025·8 min read

Designing a webhook delivery system with retries, dead letter queues, signature verification, and measured reliability under various failure conditions.

Context

An API platform needed to deliver event notifications to customer endpoints via webhooks. Requirements: at-least-once delivery, configurable retry policies, payload signature verification, and delivery observability. The system needed to handle 50,000 webhook deliveries per hour across 500 registered endpoints.

Related: Refactoring a System Without Breaking Users.

Problem

Webhook delivery is deceptively complex. Customer endpoints go down, respond slowly, return unexpected status codes, or silently drop requests. The sender must handle all of these cases without losing events, overwhelming failing endpoints, or consuming unbounded resources on retries.

Constraints

  • Delivery volume: 50,000 events/hour, bursty (event storms during batch operations)
  • Registered endpoints: 500, with varying reliability (some are dev environments)
  • Retry policy: exponential backoff, max 8 retries over 24 hours
  • Delivery guarantee: at-least-once (duplicates are acceptable, losses are not)
  • Latency: webhook should be dispatched within 5 seconds of the triggering event
  • Payload signing: HMAC-SHA256 for integrity verification
  • Storage: Postgres for delivery state, Redis for the dispatch queue

Design

See also: How I'd Design a Scalable Notification System.

Architecture

Event Source -> Event Table (Postgres)
  -> Dispatcher Worker (polls every 1 second)
    -> Redis Queue (per-endpoint delivery queue)
      -> Delivery Worker Pool (10 workers)
        -> HTTP POST to customer endpoint
          -> Success: mark delivered
          -> Failure: schedule retry with backoff
          -> Max retries exceeded: move to dead letter queue

Schema

CREATE TABLE webhook_endpoints (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  url TEXT NOT NULL,
  secret TEXT NOT NULL, -- HMAC signing key
  events TEXT[] NOT NULL, -- subscribed event types
  active BOOLEAN NOT NULL DEFAULT true,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
 
CREATE TABLE webhook_deliveries (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  endpoint_id UUID NOT NULL REFERENCES webhook_endpoints(id),
  event_type TEXT NOT NULL,
  payload JSONB NOT NULL,
  status TEXT NOT NULL DEFAULT 'pending',
  attempts INTEGER NOT NULL DEFAULT 0,
  next_attempt_at TIMESTAMPTZ,
  last_response_code INTEGER,
  last_response_body TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  delivered_at TIMESTAMPTZ
);
 
CREATE INDEX idx_deliveries_pending
  ON webhook_deliveries (next_attempt_at)
  WHERE status = 'pending';

Payload Signing

function signPayload(payload: string, secret: string): string {
  return crypto
    .createHmac('sha256', secret)
    .update(payload)
    .digest('hex');
}
 
// Delivery
const body = JSON.stringify(event.payload);
const signature = signPayload(body, endpoint.secret);
 
await fetch(endpoint.url, {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'X-Webhook-Signature': `sha256=${signature}`,
    'X-Webhook-ID': delivery.id,
    'X-Webhook-Timestamp': Date.now().toString(),
  },
  body,
  signal: AbortSignal.timeout(10000), // 10-second timeout
});

Retry Schedule

AttemptDelay After FailureCumulative Time
1Immediate0
230 seconds30s
32 minutes2.5min
410 minutes12.5min
530 minutes42.5min
61 hour1h 42min
74 hours5h 42min
88 hours13h 42min
Dead letterN/A~24 hours total

Formula: delay = min(baseDelay * 2^attempt, maxDelay) with baseDelay = 30s and maxDelay = 8h.

Delivery Worker

async function deliverWebhook(delivery: WebhookDelivery) {
  const endpoint = await getEndpoint(delivery.endpoint_id);
  if (!endpoint.active) {
    await markDelivery(delivery.id, 'skipped');
    return;
  }
 
  try {
    const response = await fetch(endpoint.url, {
      method: 'POST',
      headers: buildHeaders(delivery, endpoint),
      body: JSON.stringify(delivery.payload),
      signal: AbortSignal.timeout(10000),
    });
 
    if (response.status >= 200 && response.status < 300) {
      await markDelivery(delivery.id, 'delivered', {
        responseCode: response.status,
        deliveredAt: new Date(),
      });
    } else {
      await handleFailure(delivery, response.status, await response.text());
    }
  } catch (error) {
    await handleFailure(delivery, 0, error.message);
  }
}
 
async function handleFailure(
  delivery: WebhookDelivery,
  statusCode: number,
  responseBody: string,
) {
  const nextAttempt = delivery.attempts + 1;
  if (nextAttempt > 8) {
    await markDelivery(delivery.id, 'dead_letter', {
      responseCode: statusCode,
      lastResponseBody: responseBody.slice(0, 1000),
    });
    return;
  }
 
  const delay = Math.min(30 * Math.pow(2, nextAttempt), 28800); // max 8 hours
  await scheduleRetry(delivery.id, nextAttempt, delay, statusCode, responseBody);
}

Trade-offs

Design ChoiceBenefitCost
At-least-once (vs exactly-once)Simpler, more reliableReceivers must handle duplicates
10-second timeoutPrevents slow endpoints from blocking workersMay miss slow-but-valid responses
24-hour retry windowCovers transient outagesDelays failure notification to users
Per-endpoint queuingFailing endpoint does not block othersMore complex queue management
HMAC signingPayload integrity and authenticityReceiver must implement verification

Delivery Success Rate by Endpoint Quality

During 30 days of production operation:

Endpoint CategoryCountFirst-Attempt SuccessFinal Success (after retries)Dead Letter Rate
Production (stable)35097.2%99.8%0.2%
Staging (intermittent)10072.4%91.3%8.7%
Dev (unreliable)5041.8%65.2%34.8%

Retries recovered 2.6% of deliveries to production endpoints and 18.9% to staging endpoints. The retry mechanism is essential for production reliability.

Failure Modes

Endpoint consistently returning 500: After 8 retries over 24 hours, the delivery moves to the dead letter queue. If the endpoint returns 500 for all events, the dead letter queue grows unboundedly. Mitigation: auto-disable endpoints with a dead letter rate above 50% over 24 hours, and notify the endpoint owner.

Slow endpoint consuming all workers: An endpoint responding in 9.5 seconds (just under the 10-second timeout) ties up a delivery worker for 9.5 seconds per event. With 10 workers and 100 events/minute for that endpoint, the queue grows. Mitigation: per-endpoint concurrency limits (max 2 workers per endpoint) and adaptive timeouts (reduce timeout for consistently slow endpoints).

Event storm overwhelming the queue: A batch operation generating 10,000 events in 1 second creates 10,000 webhook deliveries simultaneously. Redis queue depth spikes, and workers cannot keep up. Mitigation: event debouncing (aggregate events within a 5-second window) and queue depth alerting.

Replay attacks on signed payloads: An attacker who intercepts a webhook payload can replay it. The X-Webhook-Timestamp header allows receivers to reject payloads older than 5 minutes, but this requires receiver-side implementation. The sender cannot enforce it.

Database growth from delivery records: At 50,000 deliveries/hour, the webhook_deliveries table grows by 1.2M rows/day. Without pruning, query performance degrades within weeks. Mitigation: partition by month, retain detailed records for 90 days, archive to cold storage.

Scaling Considerations

  • At 500,000 deliveries/hour, the single Redis queue becomes a bottleneck. Shard queues by endpoint ID across multiple Redis instances.
  • Delivery workers scale horizontally. Add more workers to increase throughput, but respect per-endpoint concurrency limits.
  • For global delivery, deploy workers in multiple regions to reduce latency to geographically distributed endpoints.
  • Consider using a managed queue (SQS, Cloud Tasks) instead of Redis for delivery dispatch. Managed queues handle visibility timeouts, dead letter routing, and scaling automatically.

Observability

  • Delivery latency: time from event creation to successful delivery (p50, p95, p99)
  • Success rate: per endpoint, per event type, overall
  • Retry distribution: how many deliveries require 1, 2, 3+ retries
  • Dead letter rate: per endpoint (for auto-disable decisions)
  • Queue depth: per endpoint and overall (for capacity planning)
  • Endpoint response time: p50 and p95 per endpoint (for timeout tuning)

Dashboard priority: queue depth and dead letter rate. These are the two metrics that predict problems before they become outages.

Key Takeaways

  • At-least-once delivery with idempotency keys is simpler and more reliable than exactly-once. Push the deduplication responsibility to the receiver.
  • Exponential backoff with a 24-hour retry window recovers 2-19% of failed deliveries depending on endpoint stability.
  • Per-endpoint queuing prevents a single failing endpoint from blocking deliveries to healthy endpoints.
  • Auto-disable endpoints with consistently high failure rates. Continuing to deliver to a dead endpoint wastes resources and obscures metrics.
  • Payload signing with timestamps prevents both tampering and replay attacks (when receivers validate the timestamp).

Further Reading

Final Thoughts

The system delivers 50,000 webhooks/hour with a 99.8% success rate to production endpoints. The primary operational task is reviewing the dead letter queue weekly and contacting endpoint owners whose integrations are failing. The retry mechanism is the most valuable component: without it, the success rate would drop to 97.2%. That 2.6% gap represents 1,300 lost events per hour, which is unacceptable for an integration platform.

Recommended