Handling Partial Failures in Distributed Mobile Systems

Dhruval Dhameliya·September 3, 2025·8 min read

Strategies for handling partial failures in systems where mobile clients interact with multiple backend services, covering compensation, saga patterns, and client-side resilience.

A mobile app that calls three backend services to complete one user action has three independent failure points. When one succeeds and another fails, the system is in an inconsistent state. This post covers how to design for partial failures as an expected condition, not an exceptional one.

Related: Mobile Analytics Pipeline: From App Event to Dashboard.

See also: How I'd Design a Mobile Configuration System at Scale.

Context

Modern mobile apps do not talk to a single monolithic backend. They interact with multiple services: auth, payments, inventory, notifications, analytics. A user action like "place an order" may involve 4-5 service calls. Any one of these can fail independently, leaving the system in a partially completed state.

Problem

Design a system that:

  • Handles partial failures without corrupting state
  • Provides clear user feedback about what succeeded and what failed
  • Recovers automatically where possible
  • Maintains data consistency across services

Constraints

ConstraintDetail
AtomicityNo distributed transactions across services; each service has its own database
NetworkClient may lose connectivity mid-flow
LatencyMulti-step flows must complete within user patience (10-15 seconds)
IdempotencyRecovery actions must be safe to retry
VisibilityUsers must understand what happened when a flow partially fails

Design

Failure Taxonomy

Failure TypeDescriptionRecovery
TransientTimeout, 503Retry with backoff
Permanent400, 404, business rule violationDo not retry, compensate
AmbiguousTimeout with no responseCheck status, then retry or compensate
CascadeDownstream service failureCircuit breaker, fallback

Pattern 1: Orchestrator with Compensation (Saga)

A server-side orchestrator manages the multi-step flow and applies compensating actions on failure:

Client -> Order Orchestrator
              |
              +-> 1. Reserve Inventory (success)
              +-> 2. Charge Payment (success)
              +-> 3. Create Shipment (FAILURE)
              |
              +-> Compensate: Refund Payment
              +-> Compensate: Release Inventory
              +-> Return error to client
saga_place_order(order):
    steps = [
        Step("reserve_inventory", reserve, compensate=release_inventory),
        Step("charge_payment", charge, compensate=refund_payment),
        Step("create_shipment", create_shipment, compensate=cancel_shipment),
        Step("send_confirmation", send_email, compensate=noop)
    ]

    completed = []
    for step in steps:
        result = execute_with_retry(step.action, max_retries=2)
        if result.failed:
            // Compensate in reverse order
            for completed_step in reversed(completed):
                execute_with_retry(completed_step.compensate)
            return OrderResult.FAILED(step.name, result.error)
        completed.append(step)

    return OrderResult.SUCCESS

Pattern 2: Client-Side Orchestration

When the client drives the flow (e.g., no backend orchestrator exists):

class OrderFlow(
    private val inventoryApi: InventoryApi,
    private val paymentApi: PaymentApi,
    private val orderApi: OrderApi
) {
    suspend fun placeOrder(order: Order): OrderResult {
        // Step 1: Reserve inventory
        val reservation = try {
            inventoryApi.reserve(order.items)
        } catch (e: Exception) {
            return OrderResult.Failed("Could not reserve items", recoverable = true)
        }
 
        // Step 2: Charge payment
        val payment = try {
            paymentApi.charge(order.paymentMethod, order.total)
        } catch (e: Exception) {
            // Compensate step 1
            safeExecute { inventoryApi.release(reservation.id) }
            return OrderResult.Failed("Payment failed", recoverable = true)
        }
 
        // Step 3: Create order
        return try {
            val created = orderApi.create(order, reservation.id, payment.id)
            OrderResult.Success(created.orderId)
        } catch (e: Exception) {
            // Compensate steps 1 and 2
            safeExecute { paymentApi.refund(payment.id) }
            safeExecute { inventoryApi.release(reservation.id) }
            OrderResult.Failed("Order creation failed", recoverable = true)
        }
    }
 
    private suspend fun safeExecute(block: suspend () -> Unit) {
        try {
            block()
        } catch (e: Exception) {
            // Log compensation failure for manual intervention
            ErrorReporter.reportCompensationFailure(e)
        }
    }
}

Pattern 3: Status Polling for Ambiguous Failures

When the client does not know if a request succeeded:

class AmbiguousFailureHandler(
    private val orderApi: OrderApi,
    private val pollingInterval: Long = 2000,
    private val maxPolls: Int = 5
) {
    suspend fun submitWithConfirmation(order: Order): OrderResult {
        val idempotencyKey = UUID.randomUUID().toString()
 
        return try {
            orderApi.submit(order, idempotencyKey)
        } catch (e: IOException) {
            // Ambiguous: did the server receive and process it?
            pollForResult(idempotencyKey)
        }
    }
 
    private suspend fun pollForResult(idempotencyKey: String): OrderResult {
        repeat(maxPolls) {
            delay(pollingInterval)
            try {
                val status = orderApi.getStatusByIdempotencyKey(idempotencyKey)
                return when (status) {
                    "completed" -> OrderResult.Success(status.orderId)
                    "failed" -> OrderResult.Failed(status.reason)
                    "processing" -> continue // Keep polling
                    "not_found" -> continue  // Not yet received
                    else -> continue
                }
            } catch (e: Exception) {
                continue // Network still unstable
            }
        }
        return OrderResult.Unknown("Please check your orders page")
    }
}

Pattern 4: Outbox Pattern for Reliable Publishing

When a service must update its database and publish an event (or call another service) atomically:

Service writes to its database AND to an outbox table in the same transaction.
A separate process reads the outbox and publishes events/calls downstream services.

Transaction:
    1. INSERT INTO orders (id, ...) VALUES (...)
    2. INSERT INTO outbox (event_type, payload) VALUES ('order_created', {...})
    COMMIT

Outbox Processor (background):
    1. SELECT * FROM outbox WHERE published = false ORDER BY created_at
    2. For each: publish event, mark as published

Client-Side Resilience

class ResilientFlow(
    private val stateStore: FlowStateStore // Persisted to disk
) {
    suspend fun executeWithRecovery(flowId: String, steps: List<FlowStep>): FlowResult {
        // Resume from last successful step if recovering from crash
        val savedState = stateStore.get(flowId)
        val startFrom = savedState?.lastCompletedStep?.plus(1) ?: 0
 
        for (i in startFrom until steps.size) {
            val step = steps[i]
            val result = step.execute()
 
            if (result.failed) {
                stateStore.save(flowId, FlowState(
                    lastCompletedStep = i - 1,
                    completedResults = savedState?.completedResults.orEmpty(),
                    failedStep = i,
                    error = result.error
                ))
                return compensate(steps, i - 1, savedState)
            }
 
            stateStore.save(flowId, FlowState(
                lastCompletedStep = i,
                completedResults = (savedState?.completedResults.orEmpty()) + result
            ))
        }
 
        stateStore.clear(flowId)
        return FlowResult.Success
    }
}

Trade-offs

DecisionUpsideDownside
Server-side sagaCentral control, easier to reason aboutSingle point of failure (orchestrator)
Client-side orchestrationNo orchestrator dependencyClient process death interrupts flow
Status pollingResolves ambiguous failuresAdds latency, extra server load
Outbox patternAtomic local write + downstream propagationOutbox table overhead, eventual consistency
CompensationRecovers from partial failuresCompensating actions can themselves fail

Failure Modes

  • Compensation failure: A refund API call fails after a failed order. Mitigation: retry compensations with exponential backoff. After max retries, create a manual intervention ticket.
  • Client crash mid-flow: The app is killed between step 2 and step 3. Mitigation: persist flow state to disk. On next launch, detect incomplete flows and resume or compensate.
  • Phantom completion: Server processes the request but the client never receives the response. The client retries, creating a duplicate. Mitigation: idempotency keys on every mutating request.
  • Compensation ordering: Compensations must run in reverse order. If step 2 compensation depends on step 1 still being active, running them out of order fails. Mitigation: design compensations to be order-independent where possible.
  • Stuck in-progress state: A flow remains in "processing" state indefinitely. Mitigation: set a maximum age for in-progress flows. After expiry, trigger compensation automatically.

Scaling Considerations

  • Server-side sagas centralize coordination but add a service dependency. At high throughput, the orchestrator becomes a bottleneck. Scale it horizontally by partitioning by order ID.
  • Client-side orchestration distributes load but is less reliable. Use it for non-critical flows; reserve server-side sagas for financial transactions.
  • The outbox pattern requires periodic cleanup. Archive or delete processed outbox entries to prevent table bloat.

Observability

  • Track: partial failure rate per flow type, compensation success rate, time-to-recovery, stuck flow count, idempotent retry rate.
  • Alert on: compensation failure rate exceeding 1%, flows stuck in processing for more than 15 minutes, manual intervention queue depth exceeding threshold.
  • Dashboard: flow completion funnel showing drop-off at each step, with failure reasons.

Key Takeaways

  • Treat partial failures as a normal operating condition. Every multi-service flow will experience them.
  • Design every step with a compensating action. If you cannot undo a step, you cannot safely attempt it as part of a multi-step flow.
  • Persist flow state to survive process death. On mobile, the process can die at any moment.
  • Use idempotency keys on every mutating request. Ambiguous failures demand safe retries.
  • Prefer server-side orchestration for critical flows (payments, orders). Client-side orchestration is acceptable for non-critical flows where eventual consistency is tolerable.

Further Reading

Final Thoughts

Distributed systems fail partially by nature. The question is not whether your multi-step flow will encounter a partial failure, but how gracefully it handles one. Design compensation into every flow from the start. Retrofitting it after the first production incident is significantly more painful.

Recommended