Experimenting With Background Workers at Scale

Context

A platform processing 500,000 background jobs per day needed a reliable job queue. Jobs ranged from sending emails (low priority, high volume) to processing payments (high priority, low tolerance for failure). I tested three queue implementations to find the right architecture for each job category.

Problem

Background job processing introduces three categories of problems: delivery guarantees (at-least-once vs exactly-once), ordering (FIFO vs priority), and failure handling (retries, dead letters, timeouts). Different queue backends make different trade-offs on each axis.

Constraints

Job volume: 500,000 jobs/day, with peaks of 2,000 jobs/minute
Job types: email (60%), webhook delivery (25%), payment processing (10%), report generation (5%)
Processing latency: emails within 60 seconds, payments within 5 seconds, reports within 5 minutes
Failure tolerance: payments must not be lost or double-processed; emails can tolerate occasional duplicates
Infrastructure: existing Redis and Postgres; AWS account available for SQS
Worker count: 10 workers across 5 instances

Design

Queue Implementation 1: BullMQ (Redis-backed)

import { Queue, Worker } from 'bullmq';
 
const emailQueue = new Queue('emails', {
  connection: { host: 'redis', port: 6379 },
  defaultJobOptions: {
    attempts: 3,
    backoff: { type: 'exponential', delay: 5000 },
    removeOnComplete: { age: 86400 }, // 24 hours
    removeOnFail: { age: 604800 },    // 7 days
  },
});
 
const emailWorker = new Worker('emails', async (job) => {
  await sendEmail(job.data.to, job.data.subject, job.data.body);
}, {
  connection: { host: 'redis', port: 6379 },
  concurrency: 5,
  limiter: { max: 100, duration: 60000 }, // 100 emails/minute rate limit
});

Queue Implementation 2: Postgres-based (SKIP LOCKED)

CREATE TABLE job_queue (
  id BIGSERIAL PRIMARY KEY,
  queue_name TEXT NOT NULL,
  payload JSONB NOT NULL,
  status TEXT NOT NULL DEFAULT 'pending',
  priority INTEGER NOT NULL DEFAULT 0,
  attempts INTEGER NOT NULL DEFAULT 0,
  max_attempts INTEGER NOT NULL DEFAULT 3,
  scheduled_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  locked_until TIMESTAMPTZ,
  locked_by TEXT,
  completed_at TIMESTAMPTZ,
  failed_at TIMESTAMPTZ,
  error_message TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
 
CREATE INDEX idx_job_queue_pending
  ON job_queue (queue_name, priority DESC, scheduled_at)
  WHERE status = 'pending';

Worker polling query:

UPDATE job_queue
SET status = 'processing',
    locked_until = now() + interval '5 minutes',
    locked_by = $1,
    attempts = attempts + 1
WHERE id = (
  SELECT id FROM job_queue
  WHERE queue_name = $2
    AND status = 'pending'
    AND scheduled_at <= now()
  ORDER BY priority DESC, scheduled_at
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
RETURNING *;

SKIP LOCKED ensures multiple workers do not contend on the same row.

Queue Implementation 3: AWS SQS

import { SQSClient, SendMessageCommand, ReceiveMessageCommand } from '@aws-sdk/client-sqs';
 
const sqs = new SQSClient({ region: 'us-east-1' });
 
// Enqueue
await sqs.send(new SendMessageCommand({
  QueueUrl: QUEUE_URL,
  MessageBody: JSON.stringify(jobData),
  MessageGroupId: jobData.type, // FIFO queue
  MessageDeduplicationId: jobData.id,
}));
 
// Worker
async function pollAndProcess() {
  const response = await sqs.send(new ReceiveMessageCommand({
    QueueUrl: QUEUE_URL,
    MaxNumberOfMessages: 10,
    WaitTimeSeconds: 20, // Long polling
    VisibilityTimeout: 300,
  }));
 
  for (const message of response.Messages || []) {
    try {
      await processJob(JSON.parse(message.Body));
      await deleteMessage(message.ReceiptHandle);
    } catch (error) {
      // Message becomes visible again after VisibilityTimeout
    }
  }
}

Trade-offs

Throughput Benchmarks

Metric	BullMQ (Redis)	Postgres Queue	SQS
Enqueue rate (sustained)	15,000 jobs/s	3,000 jobs/s	1,000 jobs/s (per queue)
Dequeue rate (10 workers)	8,000 jobs/s	1,500 jobs/s	800 jobs/s
Enqueue latency (p50)	0.5ms	2ms	15ms
Enqueue latency (p95)	1.2ms	8ms	45ms
End-to-end latency (p50)	12ms	50ms	200ms
End-to-end latency (p95)	45ms	200ms	800ms

Feature Comparison

Feature	BullMQ	Postgres Queue	SQS
Priority queues	Yes	Yes (ORDER BY priority)	No (separate queues)
Delayed jobs	Yes	Yes (scheduled_at)	Yes (DelaySeconds)
FIFO ordering	Optional	Yes (natural)	Yes (FIFO queues)
Rate limiting	Built-in	Manual	No (manual)
Job progress tracking	Yes	Manual	No
Dead letter queue	Manual	Manual	Built-in
Persistence	Redis persistence (AOF/RDB)	Full ACID	Fully managed
Exactly-once delivery	No (at-least-once)	Yes (with transactions)	Yes (FIFO + dedup)
Operational overhead	Medium (Redis)	Low (existing DB)	Low (managed)

Cost at 500,000 Jobs/Day

Queue	Infrastructure Cost	Notes
BullMQ	$15/month (Redis instance)	Existing Redis may suffice
Postgres	$0 (existing DB)	Uses DB capacity
SQS	$0.20/month	First 1M requests free

Recommended Queue per Job Type

Job Type	Best Queue	Reason
Email sending	BullMQ	High volume, rate limiting needed, duplicates acceptable
Webhook delivery	BullMQ	Retry with backoff, per-endpoint concurrency control
Payment processing	Postgres	Exactly-once with transactions, ACID guarantees
Report generation	SQS	Long-running, managed dead letters, no Redis dependency

Failure Modes

BullMQ: Redis data loss: If Redis is configured without persistence (or with RDB-only snapshots), a crash loses all queued jobs since the last snapshot. For email jobs, this is tolerable. For payment jobs, it is not. Mitigation: use AOF persistence with appendfsync everysec, or do not use Redis for critical jobs.

Postgres queue: table bloat: High job throughput creates and deletes millions of rows. Without aggressive vacuuming, the table bloats, and the index on pending jobs becomes inefficient. Mitigation: partition by date, run aggressive autovacuum settings on the queue table, and archive completed jobs to a separate table.

SQS: visibility timeout misconfiguration: If a job takes longer than the visibility timeout, SQS makes it visible again, and another worker picks it up. This causes double processing. Mitigation: set the visibility timeout to 2x the maximum expected processing time, and implement idempotency in the job handler.

Worker starvation on priority queues: High-priority jobs can starve low-priority jobs indefinitely. If payment jobs arrive continuously, email jobs never process. Mitigation: use separate queues with dedicated workers, or implement weighted fair queuing (process 1 low-priority job for every 10 high-priority jobs).

Poison pill jobs: A job that consistently fails (bad payload, unrecoverable error) retries indefinitely and blocks the queue. Mitigation: set max_attempts and move permanently failed jobs to a dead letter queue after exhausting retries.

Scaling Considerations

BullMQ scales to 100,000+ jobs/second with Redis Cluster. Each queue can be sharded across cluster nodes.
Postgres queues hit a ceiling at approximately 5,000-10,000 jobs/second due to row-level lock contention and WAL generation. Beyond this, dedicated queue infrastructure is necessary.
SQS scales automatically with no configuration. Throughput increases are handled transparently by AWS.
For 5M+ jobs/day, consider a dedicated job processing framework (Temporal, Inngest) that provides workflow orchestration, not just queueing.
Worker auto-scaling based on queue depth: spin up workers when queue depth exceeds a threshold, spin down when idle. BullMQ provides queue event listeners for this.

Observability

Queue depth: the most important metric. Growing queue depth indicates workers cannot keep up.
Processing duration per job type: p50, p95, p99. Detect slow jobs before they cause timeouts.
Failure rate per job type: alert on failure rate exceeding 5% (indicates a systemic issue, not transient failures).
Dead letter queue depth: should be near zero. Growing DLQ indicates unhandled failure modes.
Worker utilization: percentage of time workers are processing vs idle. Below 50% means over-provisioned; above 90% means under-provisioned.
End-to-end latency: time from job creation to completion. SLA violations are detected here.

Key Takeaways

Use BullMQ for high-throughput, latency-sensitive jobs where at-least-once delivery is acceptable. Its rate limiting and priority features are mature.
Use Postgres queues for jobs that require transactional guarantees (enqueue job in the same transaction as the business operation).
Use SQS for long-running jobs where managed infrastructure and built-in dead letter queues reduce operational burden.
Never use the same queue for all job types. Different jobs have different SLAs, failure tolerances, and throughput requirements.
Monitor queue depth and dead letter queue depth as the primary health indicators. Everything else is secondary.

Final Thoughts

The final architecture uses all three queue backends: BullMQ for emails and webhooks (85% of volume), Postgres for payment jobs (10% of volume, transactional safety), and SQS for report generation (5% of volume, long-running). This is more complex than a single queue, but each job type gets the delivery guarantees and performance characteristics it requires. The total infrastructure cost is under $20/month. The operational overhead is monitoring three dashboards instead of one, which is a reasonable price for the reliability improvement.