Handling Partial Failures in Distributed Mobile Systems
Strategies for handling partial failures in systems where mobile clients interact with multiple backend services, covering compensation, saga patterns, and client-side resilience.
A mobile app that calls three backend services to complete one user action has three independent failure points. When one succeeds and another fails, the system is in an inconsistent state. This post covers how to design for partial failures as an expected condition, not an exceptional one.
Related: Mobile Analytics Pipeline: From App Event to Dashboard.
See also: How I'd Design a Mobile Configuration System at Scale.
Context
Modern mobile apps do not talk to a single monolithic backend. They interact with multiple services: auth, payments, inventory, notifications, analytics. A user action like "place an order" may involve 4-5 service calls. Any one of these can fail independently, leaving the system in a partially completed state.
Problem
Design a system that:
- Handles partial failures without corrupting state
- Provides clear user feedback about what succeeded and what failed
- Recovers automatically where possible
- Maintains data consistency across services
Constraints
| Constraint | Detail |
|---|---|
| Atomicity | No distributed transactions across services; each service has its own database |
| Network | Client may lose connectivity mid-flow |
| Latency | Multi-step flows must complete within user patience (10-15 seconds) |
| Idempotency | Recovery actions must be safe to retry |
| Visibility | Users must understand what happened when a flow partially fails |
Design
Failure Taxonomy
| Failure Type | Description | Recovery |
|---|---|---|
| Transient | Timeout, 503 | Retry with backoff |
| Permanent | 400, 404, business rule violation | Do not retry, compensate |
| Ambiguous | Timeout with no response | Check status, then retry or compensate |
| Cascade | Downstream service failure | Circuit breaker, fallback |
Pattern 1: Orchestrator with Compensation (Saga)
A server-side orchestrator manages the multi-step flow and applies compensating actions on failure:
Client -> Order Orchestrator
|
+-> 1. Reserve Inventory (success)
+-> 2. Charge Payment (success)
+-> 3. Create Shipment (FAILURE)
|
+-> Compensate: Refund Payment
+-> Compensate: Release Inventory
+-> Return error to client
saga_place_order(order):
steps = [
Step("reserve_inventory", reserve, compensate=release_inventory),
Step("charge_payment", charge, compensate=refund_payment),
Step("create_shipment", create_shipment, compensate=cancel_shipment),
Step("send_confirmation", send_email, compensate=noop)
]
completed = []
for step in steps:
result = execute_with_retry(step.action, max_retries=2)
if result.failed:
// Compensate in reverse order
for completed_step in reversed(completed):
execute_with_retry(completed_step.compensate)
return OrderResult.FAILED(step.name, result.error)
completed.append(step)
return OrderResult.SUCCESS
Pattern 2: Client-Side Orchestration
When the client drives the flow (e.g., no backend orchestrator exists):
class OrderFlow(
private val inventoryApi: InventoryApi,
private val paymentApi: PaymentApi,
private val orderApi: OrderApi
) {
suspend fun placeOrder(order: Order): OrderResult {
// Step 1: Reserve inventory
val reservation = try {
inventoryApi.reserve(order.items)
} catch (e: Exception) {
return OrderResult.Failed("Could not reserve items", recoverable = true)
}
// Step 2: Charge payment
val payment = try {
paymentApi.charge(order.paymentMethod, order.total)
} catch (e: Exception) {
// Compensate step 1
safeExecute { inventoryApi.release(reservation.id) }
return OrderResult.Failed("Payment failed", recoverable = true)
}
// Step 3: Create order
return try {
val created = orderApi.create(order, reservation.id, payment.id)
OrderResult.Success(created.orderId)
} catch (e: Exception) {
// Compensate steps 1 and 2
safeExecute { paymentApi.refund(payment.id) }
safeExecute { inventoryApi.release(reservation.id) }
OrderResult.Failed("Order creation failed", recoverable = true)
}
}
private suspend fun safeExecute(block: suspend () -> Unit) {
try {
block()
} catch (e: Exception) {
// Log compensation failure for manual intervention
ErrorReporter.reportCompensationFailure(e)
}
}
}Pattern 3: Status Polling for Ambiguous Failures
When the client does not know if a request succeeded:
class AmbiguousFailureHandler(
private val orderApi: OrderApi,
private val pollingInterval: Long = 2000,
private val maxPolls: Int = 5
) {
suspend fun submitWithConfirmation(order: Order): OrderResult {
val idempotencyKey = UUID.randomUUID().toString()
return try {
orderApi.submit(order, idempotencyKey)
} catch (e: IOException) {
// Ambiguous: did the server receive and process it?
pollForResult(idempotencyKey)
}
}
private suspend fun pollForResult(idempotencyKey: String): OrderResult {
repeat(maxPolls) {
delay(pollingInterval)
try {
val status = orderApi.getStatusByIdempotencyKey(idempotencyKey)
return when (status) {
"completed" -> OrderResult.Success(status.orderId)
"failed" -> OrderResult.Failed(status.reason)
"processing" -> continue // Keep polling
"not_found" -> continue // Not yet received
else -> continue
}
} catch (e: Exception) {
continue // Network still unstable
}
}
return OrderResult.Unknown("Please check your orders page")
}
}Pattern 4: Outbox Pattern for Reliable Publishing
When a service must update its database and publish an event (or call another service) atomically:
Service writes to its database AND to an outbox table in the same transaction.
A separate process reads the outbox and publishes events/calls downstream services.
Transaction:
1. INSERT INTO orders (id, ...) VALUES (...)
2. INSERT INTO outbox (event_type, payload) VALUES ('order_created', {...})
COMMIT
Outbox Processor (background):
1. SELECT * FROM outbox WHERE published = false ORDER BY created_at
2. For each: publish event, mark as published
Client-Side Resilience
class ResilientFlow(
private val stateStore: FlowStateStore // Persisted to disk
) {
suspend fun executeWithRecovery(flowId: String, steps: List<FlowStep>): FlowResult {
// Resume from last successful step if recovering from crash
val savedState = stateStore.get(flowId)
val startFrom = savedState?.lastCompletedStep?.plus(1) ?: 0
for (i in startFrom until steps.size) {
val step = steps[i]
val result = step.execute()
if (result.failed) {
stateStore.save(flowId, FlowState(
lastCompletedStep = i - 1,
completedResults = savedState?.completedResults.orEmpty(),
failedStep = i,
error = result.error
))
return compensate(steps, i - 1, savedState)
}
stateStore.save(flowId, FlowState(
lastCompletedStep = i,
completedResults = (savedState?.completedResults.orEmpty()) + result
))
}
stateStore.clear(flowId)
return FlowResult.Success
}
}Trade-offs
| Decision | Upside | Downside |
|---|---|---|
| Server-side saga | Central control, easier to reason about | Single point of failure (orchestrator) |
| Client-side orchestration | No orchestrator dependency | Client process death interrupts flow |
| Status polling | Resolves ambiguous failures | Adds latency, extra server load |
| Outbox pattern | Atomic local write + downstream propagation | Outbox table overhead, eventual consistency |
| Compensation | Recovers from partial failures | Compensating actions can themselves fail |
Failure Modes
- Compensation failure: A refund API call fails after a failed order. Mitigation: retry compensations with exponential backoff. After max retries, create a manual intervention ticket.
- Client crash mid-flow: The app is killed between step 2 and step 3. Mitigation: persist flow state to disk. On next launch, detect incomplete flows and resume or compensate.
- Phantom completion: Server processes the request but the client never receives the response. The client retries, creating a duplicate. Mitigation: idempotency keys on every mutating request.
- Compensation ordering: Compensations must run in reverse order. If step 2 compensation depends on step 1 still being active, running them out of order fails. Mitigation: design compensations to be order-independent where possible.
- Stuck in-progress state: A flow remains in "processing" state indefinitely. Mitigation: set a maximum age for in-progress flows. After expiry, trigger compensation automatically.
Scaling Considerations
- Server-side sagas centralize coordination but add a service dependency. At high throughput, the orchestrator becomes a bottleneck. Scale it horizontally by partitioning by order ID.
- Client-side orchestration distributes load but is less reliable. Use it for non-critical flows; reserve server-side sagas for financial transactions.
- The outbox pattern requires periodic cleanup. Archive or delete processed outbox entries to prevent table bloat.
Observability
- Track: partial failure rate per flow type, compensation success rate, time-to-recovery, stuck flow count, idempotent retry rate.
- Alert on: compensation failure rate exceeding 1%, flows stuck in processing for more than 15 minutes, manual intervention queue depth exceeding threshold.
- Dashboard: flow completion funnel showing drop-off at each step, with failure reasons.
Key Takeaways
- Treat partial failures as a normal operating condition. Every multi-service flow will experience them.
- Design every step with a compensating action. If you cannot undo a step, you cannot safely attempt it as part of a multi-step flow.
- Persist flow state to survive process death. On mobile, the process can die at any moment.
- Use idempotency keys on every mutating request. Ambiguous failures demand safe retries.
- Prefer server-side orchestration for critical flows (payments, orders). Client-side orchestration is acceptable for non-critical flows where eventual consistency is tolerable.
Further Reading
- Lessons From Debugging Distributed Systems: Practical lessons from years of debugging distributed systems, covering the unique challenges of partial failures, clock skew, message or...
- Handling Data Conflicts in Offline-First Systems: Strategies for detecting and resolving data conflicts in offline-first mobile systems, covering CRDTs, last-write-wins, operational trans...
- Designing Mobile Systems for Poor Network Conditions: Architecture patterns for mobile apps that function reliably on slow, intermittent, and lossy networks, covering request prioritization, ...
Final Thoughts
Distributed systems fail partially by nature. The question is not whether your multi-step flow will encounter a partial failure, but how gracefully it handles one. Design compensation into every flow from the start. Retrofitting it after the first production incident is significantly more painful.
Recommended
Understanding ANRs: Detection, Root Causes, and Fixes
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.
Designing Idempotent APIs for Mobile Clients
How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios specific to mobile networks.
Refactoring a System Without Breaking Users
Strategies for large-scale refactors that keep production stable, covering parallel runs, feature flags, gradual migrations, and verification techniques.