Experimenting With Background Workers at Scale
Testing job queue architectures with BullMQ, Postgres-based queues, and SQS under increasing job volumes, with failure handling and scaling measurements.
Context
A platform processing 500,000 background jobs per day needed a reliable job queue. Jobs ranged from sending emails (low priority, high volume) to processing payments (high priority, low tolerance for failure). I tested three queue implementations to find the right architecture for each job category.
Problem
Background job processing introduces three categories of problems: delivery guarantees (at-least-once vs exactly-once), ordering (FIFO vs priority), and failure handling (retries, dead letters, timeouts). Different queue backends make different trade-offs on each axis.
Constraints
- Job volume: 500,000 jobs/day, with peaks of 2,000 jobs/minute
- Job types: email (60%), webhook delivery (25%), payment processing (10%), report generation (5%)
- Processing latency: emails within 60 seconds, payments within 5 seconds, reports within 5 minutes
- Failure tolerance: payments must not be lost or double-processed; emails can tolerate occasional duplicates
- Infrastructure: existing Redis and Postgres; AWS account available for SQS
- Worker count: 10 workers across 5 instances
Design
Queue Implementation 1: BullMQ (Redis-backed)
import { Queue, Worker } from 'bullmq';
const emailQueue = new Queue('emails', {
connection: { host: 'redis', port: 6379 },
defaultJobOptions: {
attempts: 3,
backoff: { type: 'exponential', delay: 5000 },
removeOnComplete: { age: 86400 }, // 24 hours
removeOnFail: { age: 604800 }, // 7 days
},
});
const emailWorker = new Worker('emails', async (job) => {
await sendEmail(job.data.to, job.data.subject, job.data.body);
}, {
connection: { host: 'redis', port: 6379 },
concurrency: 5,
limiter: { max: 100, duration: 60000 }, // 100 emails/minute rate limit
});Queue Implementation 2: Postgres-based (SKIP LOCKED)
CREATE TABLE job_queue (
id BIGSERIAL PRIMARY KEY,
queue_name TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
priority INTEGER NOT NULL DEFAULT 0,
attempts INTEGER NOT NULL DEFAULT 0,
max_attempts INTEGER NOT NULL DEFAULT 3,
scheduled_at TIMESTAMPTZ NOT NULL DEFAULT now(),
locked_until TIMESTAMPTZ,
locked_by TEXT,
completed_at TIMESTAMPTZ,
failed_at TIMESTAMPTZ,
error_message TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_job_queue_pending
ON job_queue (queue_name, priority DESC, scheduled_at)
WHERE status = 'pending';Worker polling query:
UPDATE job_queue
SET status = 'processing',
locked_until = now() + interval '5 minutes',
locked_by = $1,
attempts = attempts + 1
WHERE id = (
SELECT id FROM job_queue
WHERE queue_name = $2
AND status = 'pending'
AND scheduled_at <= now()
ORDER BY priority DESC, scheduled_at
LIMIT 1
FOR UPDATE SKIP LOCKED
)
RETURNING *;SKIP LOCKED ensures multiple workers do not contend on the same row.
Queue Implementation 3: AWS SQS
import { SQSClient, SendMessageCommand, ReceiveMessageCommand } from '@aws-sdk/client-sqs';
const sqs = new SQSClient({ region: 'us-east-1' });
// Enqueue
await sqs.send(new SendMessageCommand({
QueueUrl: QUEUE_URL,
MessageBody: JSON.stringify(jobData),
MessageGroupId: jobData.type, // FIFO queue
MessageDeduplicationId: jobData.id,
}));
// Worker
async function pollAndProcess() {
const response = await sqs.send(new ReceiveMessageCommand({
QueueUrl: QUEUE_URL,
MaxNumberOfMessages: 10,
WaitTimeSeconds: 20, // Long polling
VisibilityTimeout: 300,
}));
for (const message of response.Messages || []) {
try {
await processJob(JSON.parse(message.Body));
await deleteMessage(message.ReceiptHandle);
} catch (error) {
// Message becomes visible again after VisibilityTimeout
}
}
}Trade-offs
Throughput Benchmarks
| Metric | BullMQ (Redis) | Postgres Queue | SQS |
|---|---|---|---|
| Enqueue rate (sustained) | 15,000 jobs/s | 3,000 jobs/s | 1,000 jobs/s (per queue) |
| Dequeue rate (10 workers) | 8,000 jobs/s | 1,500 jobs/s | 800 jobs/s |
| Enqueue latency (p50) | 0.5ms | 2ms | 15ms |
| Enqueue latency (p95) | 1.2ms | 8ms | 45ms |
| End-to-end latency (p50) | 12ms | 50ms | 200ms |
| End-to-end latency (p95) | 45ms | 200ms | 800ms |
Feature Comparison
Related: Designing a Feature Flag and Remote Config System.
| Feature | BullMQ | Postgres Queue | SQS |
|---|---|---|---|
| Priority queues | Yes | Yes (ORDER BY priority) | No (separate queues) |
| Delayed jobs | Yes | Yes (scheduled_at) | Yes (DelaySeconds) |
| FIFO ordering | Optional | Yes (natural) | Yes (FIFO queues) |
| Rate limiting | Built-in | Manual | No (manual) |
| Job progress tracking | Yes | Manual | No |
| Dead letter queue | Manual | Manual | Built-in |
| Persistence | Redis persistence (AOF/RDB) | Full ACID | Fully managed |
| Exactly-once delivery | No (at-least-once) | Yes (with transactions) | Yes (FIFO + dedup) |
| Operational overhead | Medium (Redis) | Low (existing DB) | Low (managed) |
Cost at 500,000 Jobs/Day
| Queue | Infrastructure Cost | Notes |
|---|---|---|
| BullMQ | $15/month (Redis instance) | Existing Redis may suffice |
| Postgres | $0 (existing DB) | Uses DB capacity |
| SQS | $0.20/month | First 1M requests free |
Recommended Queue per Job Type
| Job Type | Best Queue | Reason |
|---|---|---|
| Email sending | BullMQ | High volume, rate limiting needed, duplicates acceptable |
| Webhook delivery | BullMQ | Retry with backoff, per-endpoint concurrency control |
| Payment processing | Postgres | Exactly-once with transactions, ACID guarantees |
| Report generation | SQS | Long-running, managed dead letters, no Redis dependency |
Failure Modes
BullMQ: Redis data loss: If Redis is configured without persistence (or with RDB-only snapshots), a crash loses all queued jobs since the last snapshot. For email jobs, this is tolerable. For payment jobs, it is not. Mitigation: use AOF persistence with appendfsync everysec, or do not use Redis for critical jobs.
Postgres queue: table bloat: High job throughput creates and deletes millions of rows. Without aggressive vacuuming, the table bloats, and the index on pending jobs becomes inefficient. Mitigation: partition by date, run aggressive autovacuum settings on the queue table, and archive completed jobs to a separate table.
SQS: visibility timeout misconfiguration: If a job takes longer than the visibility timeout, SQS makes it visible again, and another worker picks it up. This causes double processing. Mitigation: set the visibility timeout to 2x the maximum expected processing time, and implement idempotency in the job handler.
Worker starvation on priority queues: High-priority jobs can starve low-priority jobs indefinitely. If payment jobs arrive continuously, email jobs never process. Mitigation: use separate queues with dedicated workers, or implement weighted fair queuing (process 1 low-priority job for every 10 high-priority jobs).
Poison pill jobs: A job that consistently fails (bad payload, unrecoverable error) retries indefinitely and blocks the queue. Mitigation: set max_attempts and move permanently failed jobs to a dead letter queue after exhausting retries.
Scaling Considerations
- BullMQ scales to 100,000+ jobs/second with Redis Cluster. Each queue can be sharded across cluster nodes.
- Postgres queues hit a ceiling at approximately 5,000-10,000 jobs/second due to row-level lock contention and WAL generation. Beyond this, dedicated queue infrastructure is necessary.
- SQS scales automatically with no configuration. Throughput increases are handled transparently by AWS.
- For 5M+ jobs/day, consider a dedicated job processing framework (Temporal, Inngest) that provides workflow orchestration, not just queueing.
- Worker auto-scaling based on queue depth: spin up workers when queue depth exceeds a threshold, spin down when idle. BullMQ provides queue event listeners for this.
See also: Event Tracking System Design for Android Applications.
Observability
- Queue depth: the most important metric. Growing queue depth indicates workers cannot keep up.
- Processing duration per job type: p50, p95, p99. Detect slow jobs before they cause timeouts.
- Failure rate per job type: alert on failure rate exceeding 5% (indicates a systemic issue, not transient failures).
- Dead letter queue depth: should be near zero. Growing DLQ indicates unhandled failure modes.
- Worker utilization: percentage of time workers are processing vs idle. Below 50% means over-provisioned; above 90% means under-provisioned.
- End-to-end latency: time from job creation to completion. SLA violations are detected here.
Key Takeaways
- Use BullMQ for high-throughput, latency-sensitive jobs where at-least-once delivery is acceptable. Its rate limiting and priority features are mature.
- Use Postgres queues for jobs that require transactional guarantees (enqueue job in the same transaction as the business operation).
- Use SQS for long-running jobs where managed infrastructure and built-in dead letter queues reduce operational burden.
- Never use the same queue for all job types. Different jobs have different SLAs, failure tolerances, and throughput requirements.
- Monitor queue depth and dead letter queue depth as the primary health indicators. Everything else is secondary.
Further Reading
- Why Simple Systems Scale Better: An argument for architectural simplicity as a scaling strategy, with examples of how complexity creates bottlenecks that simple designs a...
- Designing Background Job Systems for Mobile Apps: Architecture for reliable background job execution on Android, covering WorkManager, job prioritization, constraint handling, and failure...
- How I'd Design a Mobile Configuration System at Scale: Designing a configuration system for mobile apps at scale, covering config delivery, caching layers, override hierarchies, and safe rollo...
Final Thoughts
The final architecture uses all three queue backends: BullMQ for emails and webhooks (85% of volume), Postgres for payment jobs (10% of volume, transactional safety), and SQS for report generation (5% of volume, long-running). This is more complex than a single queue, but each job type gets the delivery guarantees and performance characteristics it requires. The total infrastructure cost is under $20/month. The operational overhead is monitoring three dashboards instead of one, which is a reasonable price for the reliability improvement.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.