Designing Background Job Systems for Mobile Apps

Dhruval Dhameliya·September 15, 2025·7 min read

Architecture for reliable background job execution on Android, covering WorkManager, job prioritization, constraint handling, and failure recovery.

Background work on Android is a constrained problem. The OS actively kills background processes, restricts network access in Doze mode, and limits execution time. Designing a reliable background job system requires working within these constraints, not fighting them.

See also: Designing Retry and Backoff Strategies for Mobile Networks.

Context

Mobile apps need background work for sync, data upload, cache cleanup, notification processing, and periodic maintenance. Android's background execution limits (introduced in Oreo and tightened in every subsequent release) mean that naive approaches (plain threads, services, AlarmManager) are unreliable on modern devices.

Problem

Design a background job system that:

  • Executes reliably despite OS-imposed restrictions
  • Handles job prioritization, dependencies, and retries
  • Respects device constraints (battery, network, storage)
  • Provides observability into job execution and failure

Constraints

ConstraintDetail
Execution windowOS may defer jobs by minutes to hours depending on Doze state
Execution time10-minute max execution per WorkManager job
Network accessRestricted in Doze; available in maintenance windows
BatteryBackground work budget is finite; excessive use causes battery warnings
Process deathThe app process can be killed at any time; state must be persisted

Design

Job Categories

CategoryUrgencyDeferrableExampleMechanism
ImmediateHighNoProcessing a received messageForeground Service
ExpeditedHighSlightlyCompleting a user-initiated uploadWorkManager (expedited)
DeferrableLowYesSyncing read receipts, analytics flushWorkManager (periodic/one-time)
Exact-timeVariesNoScheduled reminderAlarmManager (exact)

Architecture

// Central job coordinator
class JobCoordinator(
    private val workManager: WorkManager,
    private val jobRegistry: JobRegistry
) {
    fun schedule(jobSpec: JobSpec) {
        val constraints = Constraints.Builder()
            .setRequiredNetworkType(jobSpec.networkRequirement)
            .setRequiresBatteryNotLow(jobSpec.requiresBattery)
            .setRequiresStorageNotLow(jobSpec.requiresStorage)
            .build()
 
        val request = when (jobSpec.schedule) {
            is OneTime -> OneTimeWorkRequestBuilder<DelegatingWorker>()
                .setConstraints(constraints)
                .setBackoffCriteria(
                    BackoffPolicy.EXPONENTIAL,
                    jobSpec.initialBackoffSeconds,
                    TimeUnit.SECONDS
                )
                .setInputData(workDataOf("job_type" to jobSpec.type))
                .addTag(jobSpec.tag)
                .build()
 
            is Periodic -> PeriodicWorkRequestBuilder<DelegatingWorker>(
                jobSpec.schedule.intervalMinutes, TimeUnit.MINUTES
            )
                .setConstraints(constraints)
                .setInputData(workDataOf("job_type" to jobSpec.type))
                .addTag(jobSpec.tag)
                .build()
        }
 
        workManager.enqueueUniqueWork(
            jobSpec.uniqueName,
            jobSpec.existingWorkPolicy,
            request
        )
    }
}

Delegating Worker Pattern

A single Worker class delegates to registered job handlers, avoiding the need for dozens of Worker subclasses:

class DelegatingWorker(
    context: Context,
    params: WorkerParameters
) : CoroutineWorker(context, params) {
 
    override suspend fun doWork(): Result {
        val jobType = inputData.getString("job_type")
            ?: return Result.failure()
 
        val handler = JobRegistry.getHandler(jobType)
            ?: return Result.failure()
 
        return try {
            val outcome = handler.execute(inputData)
            when (outcome) {
                JobOutcome.SUCCESS -> Result.success(outcome.outputData)
                JobOutcome.RETRY -> Result.retry()
                JobOutcome.FAILURE -> Result.failure(outcome.errorData)
            }
        } catch (e: Exception) {
            JobLogger.logFailure(jobType, e, runAttemptCount)
            if (runAttemptCount < handler.maxRetries) {
                Result.retry()
            } else {
                Result.failure(workDataOf("error" to e.message))
            }
        }
    }
}

Job Dependencies and Chaining

For multi-step workflows (e.g., compress then upload then cleanup):

fun scheduleUploadPipeline(fileId: String) {
    val compress = OneTimeWorkRequestBuilder<DelegatingWorker>()
        .setInputData(workDataOf("job_type" to "compress", "file_id" to fileId))
        .build()
 
    val upload = OneTimeWorkRequestBuilder<DelegatingWorker>()
        .setInputData(workDataOf("job_type" to "upload"))
        .setConstraints(Constraints.Builder()
            .setRequiredNetworkType(NetworkType.CONNECTED)
            .build())
        .build()
 
    val cleanup = OneTimeWorkRequestBuilder<DelegatingWorker>()
        .setInputData(workDataOf("job_type" to "cleanup"))
        .build()
 
    workManager.beginWith(compress)
        .then(upload)
        .then(cleanup)
        .enqueue()
}

Priority Management

WorkManager does not support explicit priority levels, but you can simulate them:

PriorityImplementation
CriticalExpedited work request with foreground service fallback
HighOne-time work with no initial delay
NormalOne-time work with standard constraints
LowOne-time work with requiresBatteryNotLow and requiresCharging

Long-Running Jobs

For jobs exceeding the 10-minute limit (large file uploads, database migrations):

class LongRunningUploadWorker(
    context: Context,
    params: WorkerParameters
) : CoroutineWorker(context, params) {
 
    override suspend fun doWork(): Result {
        setForeground(createForegroundInfo())
 
        val chunks = inputData.getStringArray("chunk_ids") ?: return Result.failure()
        val startIndex = inputData.getInt("resume_from", 0)
 
        for (i in startIndex until chunks.size) {
            val success = uploadChunk(chunks[i])
            if (!success) {
                // Save progress for retry
                return Result.retry()
            }
            setProgress(workDataOf("progress" to (i + 1).toFloat() / chunks.size))
        }
 
        return Result.success()
    }
 
    private fun createForegroundInfo(): ForegroundInfo {
        val notification = NotificationCompat.Builder(applicationContext, CHANNEL_ID)
            .setContentTitle("Uploading...")
            .setSmallIcon(R.drawable.ic_upload)
            .setOngoing(true)
            .build()
 
        return ForegroundInfo(NOTIFICATION_ID, notification)
    }
}

Job Deduplication

Related: Event Tracking System Design for Android Applications.

Prevent duplicate jobs for the same logical operation:

// ExistingWorkPolicy.KEEP: if a job with this name exists, keep the existing one
workManager.enqueueUniqueWork(
    "sync_user_data",
    ExistingWorkPolicy.KEEP,
    syncWorkRequest
)
 
// ExistingWorkPolicy.REPLACE: cancel existing and start new
workManager.enqueueUniqueWork(
    "upload_profile_photo",
    ExistingWorkPolicy.REPLACE,
    uploadWorkRequest
)

Trade-offs

DecisionUpsideDownside
WorkManager as sole schedulerReliable, handles Doze, survives rebootsLimited control over exact execution timing
Delegating worker patternSingle worker class, easy registrationIndirection makes stack traces less clear
Job chainingClean multi-step workflowsFailure in one step requires careful recovery
Foreground service for long jobsGuaranteed execution timeRequires visible notification, user may dismiss
Chunked uploads with resumeSurvives interruptionComplex state management per chunk

Failure Modes

  • WorkManager database corruption: WorkManager uses an internal SQLite database. Corruption causes all scheduled jobs to be lost. Mitigation: critical jobs should have a secondary scheduling mechanism (e.g., check and re-enqueue on app start).
  • Job starvation: Low-priority jobs never execute because high-priority jobs keep the queue busy. Mitigation: use unique work names to prevent accumulation, set TTLs on deferrable jobs.
  • Infinite retry loop: A permanent failure (e.g., deleted resource on server) causes indefinite retries. Mitigation: cap retries and move to a dead letter queue after max attempts.
  • Constraint never met: A job requiring WiFi and charging may never execute for some users. Mitigation: fall back to relaxed constraints after a timeout period (e.g., after 48 hours, allow cellular).
  • OEM battery optimization: Some manufacturers aggressively kill background processes beyond stock Android behavior. Mitigation: guide users to disable battery optimization for critical apps, detect and report OEM restrictions.

Scaling Considerations

  • At high job volumes (100+ pending jobs), WorkManager's scheduling overhead becomes noticeable. Batch small jobs into fewer, larger work units.
  • For jobs that must execute across app updates, use stable unique work names that do not change between versions.
  • Test background job behavior on real devices from multiple manufacturers (Samsung, Xiaomi, Huawei). Emulator behavior does not reflect real-world OEM restrictions.

Observability

  • Track: job execution count by type, success/failure/retry rates, execution duration (p50/p95), time from enqueue to execution start, retry count distribution.
  • Alert on: failure rate exceeding 10% for any job type, jobs pending for more than 24 hours, retry count exceeding max for more than 1% of executions.
  • Log: each job execution with type, duration, attempt number, constraints met, and outcome. Use a structured logging format for easy querying.

Key Takeaways

  • Use WorkManager for all deferrable background work. It handles Doze, app standby, and boot persistence.
  • Use the delegating worker pattern to avoid Worker class proliferation.
  • Cap retries and handle permanent failures. Infinite retry loops waste battery and create noise.
  • Test on real devices from multiple OEMs. Background execution behavior varies dramatically across manufacturers.
  • Break long-running jobs into resumable chunks. Process death is not exceptional; it is expected.

Further Reading

Final Thoughts

Background job systems on Android are fundamentally about working within constraints, not around them. The OS restricts background execution for good reason: battery life. A well-designed job system respects these constraints while still delivering reliable background processing. Fight the OS, and your app will be killed. Work with it, and your jobs will run when they matter most.

Recommended