Experimenting With Background Workers at Scale

Dhruval Dhameliya·June 5, 2025·8 min read

Testing job queue architectures with BullMQ, Postgres-based queues, and SQS under increasing job volumes, with failure handling and scaling measurements.

Context

A platform processing 500,000 background jobs per day needed a reliable job queue. Jobs ranged from sending emails (low priority, high volume) to processing payments (high priority, low tolerance for failure). I tested three queue implementations to find the right architecture for each job category.

Problem

Background job processing introduces three categories of problems: delivery guarantees (at-least-once vs exactly-once), ordering (FIFO vs priority), and failure handling (retries, dead letters, timeouts). Different queue backends make different trade-offs on each axis.

Constraints

  • Job volume: 500,000 jobs/day, with peaks of 2,000 jobs/minute
  • Job types: email (60%), webhook delivery (25%), payment processing (10%), report generation (5%)
  • Processing latency: emails within 60 seconds, payments within 5 seconds, reports within 5 minutes
  • Failure tolerance: payments must not be lost or double-processed; emails can tolerate occasional duplicates
  • Infrastructure: existing Redis and Postgres; AWS account available for SQS
  • Worker count: 10 workers across 5 instances

Design

Queue Implementation 1: BullMQ (Redis-backed)

import { Queue, Worker } from 'bullmq';
 
const emailQueue = new Queue('emails', {
  connection: { host: 'redis', port: 6379 },
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 5000 },
    removeOnComplete: { age: 86400 }, // 24 hours
    removeOnFail: { age: 604800 },    // 7 days
  },
});
 
const emailWorker = new Worker('emails', async (job) => {
  await sendEmail(job.data.to, job.data.subject, job.data.body);
}, {
  connection: { host: 'redis', port: 6379 },
  concurrency: 5,
  limiter: { max: 100, duration: 60000 }, // 100 emails/minute rate limit
});

Queue Implementation 2: Postgres-based (SKIP LOCKED)

CREATE TABLE job_queue (
  id BIGSERIAL PRIMARY KEY,
  queue_name TEXT NOT NULL,
  payload JSONB NOT NULL,
  status TEXT NOT NULL DEFAULT 'pending',
  priority INTEGER NOT NULL DEFAULT 0,
  attempts INTEGER NOT NULL DEFAULT 0,
  max_attempts INTEGER NOT NULL DEFAULT 3,
  scheduled_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  locked_until TIMESTAMPTZ,
  locked_by TEXT,
  completed_at TIMESTAMPTZ,
  failed_at TIMESTAMPTZ,
  error_message TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
 
CREATE INDEX idx_job_queue_pending
  ON job_queue (queue_name, priority DESC, scheduled_at)
  WHERE status = 'pending';

Worker polling query:

UPDATE job_queue
SET status = 'processing',
    locked_until = now() + interval '5 minutes',
    locked_by = $1,
    attempts = attempts + 1
WHERE id = (
  SELECT id FROM job_queue
  WHERE queue_name = $2
    AND status = 'pending'
    AND scheduled_at <= now()
  ORDER BY priority DESC, scheduled_at
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
RETURNING *;

SKIP LOCKED ensures multiple workers do not contend on the same row.

Queue Implementation 3: AWS SQS

import { SQSClient, SendMessageCommand, ReceiveMessageCommand } from '@aws-sdk/client-sqs';
 
const sqs = new SQSClient({ region: 'us-east-1' });
 
// Enqueue
await sqs.send(new SendMessageCommand({
  QueueUrl: QUEUE_URL,
  MessageBody: JSON.stringify(jobData),
  MessageGroupId: jobData.type, // FIFO queue
  MessageDeduplicationId: jobData.id,
}));
 
// Worker
async function pollAndProcess() {
  const response = await sqs.send(new ReceiveMessageCommand({
    QueueUrl: QUEUE_URL,
    MaxNumberOfMessages: 10,
    WaitTimeSeconds: 20, // Long polling
    VisibilityTimeout: 300,
  }));
 
  for (const message of response.Messages || []) {
    try {
      await processJob(JSON.parse(message.Body));
      await deleteMessage(message.ReceiptHandle);
    } catch (error) {
      // Message becomes visible again after VisibilityTimeout
    }
  }
}

Trade-offs

Throughput Benchmarks

MetricBullMQ (Redis)Postgres QueueSQS
Enqueue rate (sustained)15,000 jobs/s3,000 jobs/s1,000 jobs/s (per queue)
Dequeue rate (10 workers)8,000 jobs/s1,500 jobs/s800 jobs/s
Enqueue latency (p50)0.5ms2ms15ms
Enqueue latency (p95)1.2ms8ms45ms
End-to-end latency (p50)12ms50ms200ms
End-to-end latency (p95)45ms200ms800ms

Feature Comparison

Related: Designing a Feature Flag and Remote Config System.

FeatureBullMQPostgres QueueSQS
Priority queuesYesYes (ORDER BY priority)No (separate queues)
Delayed jobsYesYes (scheduled_at)Yes (DelaySeconds)
FIFO orderingOptionalYes (natural)Yes (FIFO queues)
Rate limitingBuilt-inManualNo (manual)
Job progress trackingYesManualNo
Dead letter queueManualManualBuilt-in
PersistenceRedis persistence (AOF/RDB)Full ACIDFully managed
Exactly-once deliveryNo (at-least-once)Yes (with transactions)Yes (FIFO + dedup)
Operational overheadMedium (Redis)Low (existing DB)Low (managed)

Cost at 500,000 Jobs/Day

QueueInfrastructure CostNotes
BullMQ$15/month (Redis instance)Existing Redis may suffice
Postgres$0 (existing DB)Uses DB capacity
SQS$0.20/monthFirst 1M requests free

Recommended Queue per Job Type

Job TypeBest QueueReason
Email sendingBullMQHigh volume, rate limiting needed, duplicates acceptable
Webhook deliveryBullMQRetry with backoff, per-endpoint concurrency control
Payment processingPostgresExactly-once with transactions, ACID guarantees
Report generationSQSLong-running, managed dead letters, no Redis dependency

Failure Modes

BullMQ: Redis data loss: If Redis is configured without persistence (or with RDB-only snapshots), a crash loses all queued jobs since the last snapshot. For email jobs, this is tolerable. For payment jobs, it is not. Mitigation: use AOF persistence with appendfsync everysec, or do not use Redis for critical jobs.

Postgres queue: table bloat: High job throughput creates and deletes millions of rows. Without aggressive vacuuming, the table bloats, and the index on pending jobs becomes inefficient. Mitigation: partition by date, run aggressive autovacuum settings on the queue table, and archive completed jobs to a separate table.

SQS: visibility timeout misconfiguration: If a job takes longer than the visibility timeout, SQS makes it visible again, and another worker picks it up. This causes double processing. Mitigation: set the visibility timeout to 2x the maximum expected processing time, and implement idempotency in the job handler.

Worker starvation on priority queues: High-priority jobs can starve low-priority jobs indefinitely. If payment jobs arrive continuously, email jobs never process. Mitigation: use separate queues with dedicated workers, or implement weighted fair queuing (process 1 low-priority job for every 10 high-priority jobs).

Poison pill jobs: A job that consistently fails (bad payload, unrecoverable error) retries indefinitely and blocks the queue. Mitigation: set max_attempts and move permanently failed jobs to a dead letter queue after exhausting retries.

Scaling Considerations

  • BullMQ scales to 100,000+ jobs/second with Redis Cluster. Each queue can be sharded across cluster nodes.
  • Postgres queues hit a ceiling at approximately 5,000-10,000 jobs/second due to row-level lock contention and WAL generation. Beyond this, dedicated queue infrastructure is necessary.
  • SQS scales automatically with no configuration. Throughput increases are handled transparently by AWS.
  • For 5M+ jobs/day, consider a dedicated job processing framework (Temporal, Inngest) that provides workflow orchestration, not just queueing.
  • Worker auto-scaling based on queue depth: spin up workers when queue depth exceeds a threshold, spin down when idle. BullMQ provides queue event listeners for this.

See also: Event Tracking System Design for Android Applications.

Observability

  • Queue depth: the most important metric. Growing queue depth indicates workers cannot keep up.
  • Processing duration per job type: p50, p95, p99. Detect slow jobs before they cause timeouts.
  • Failure rate per job type: alert on failure rate exceeding 5% (indicates a systemic issue, not transient failures).
  • Dead letter queue depth: should be near zero. Growing DLQ indicates unhandled failure modes.
  • Worker utilization: percentage of time workers are processing vs idle. Below 50% means over-provisioned; above 90% means under-provisioned.
  • End-to-end latency: time from job creation to completion. SLA violations are detected here.

Key Takeaways

  • Use BullMQ for high-throughput, latency-sensitive jobs where at-least-once delivery is acceptable. Its rate limiting and priority features are mature.
  • Use Postgres queues for jobs that require transactional guarantees (enqueue job in the same transaction as the business operation).
  • Use SQS for long-running jobs where managed infrastructure and built-in dead letter queues reduce operational burden.
  • Never use the same queue for all job types. Different jobs have different SLAs, failure tolerances, and throughput requirements.
  • Monitor queue depth and dead letter queue depth as the primary health indicators. Everything else is secondary.

Further Reading

Final Thoughts

The final architecture uses all three queue backends: BullMQ for emails and webhooks (85% of volume), Postgres for payment jobs (10% of volume, transactional safety), and SQS for report generation (5% of volume, long-running). This is more complex than a single queue, but each job type gets the delivery guarantees and performance characteristics it requires. The total infrastructure cost is under $20/month. The operational overhead is monitoring three dashboards instead of one, which is a reasonable price for the reliability improvement.

Recommended