Handling Partial Failures in Distributed Mobile Systems

A mobile app that calls three backend services to complete one user action has three independent failure points. When one succeeds and another fails, the system is in an inconsistent state. This post covers how to design for partial failures as an expected condition, not an exceptional one.

Context

Modern mobile apps do not talk to a single monolithic backend. They interact with multiple services: auth, payments, inventory, notifications, analytics. A user action like "place an order" may involve 4-5 service calls. Any one of these can fail independently, leaving the system in a partially completed state.

Problem

Design a system that:

Handles partial failures without corrupting state
Provides clear user feedback about what succeeded and what failed
Recovers automatically where possible
Maintains data consistency across services

Constraints

Constraint	Detail
Atomicity	No distributed transactions across services; each service has its own database
Network	Client may lose connectivity mid-flow
Latency	Multi-step flows must complete within user patience (10-15 seconds)
Idempotency	Recovery actions must be safe to retry
Visibility	Users must understand what happened when a flow partially fails

Design

Failure Taxonomy

Failure Type	Description	Recovery
Transient	Timeout, 503	Retry with backoff
Permanent	400, 404, business rule violation	Do not retry, compensate
Ambiguous	Timeout with no response	Check status, then retry or compensate
Cascade	Downstream service failure	Circuit breaker, fallback

Pattern 1: Orchestrator with Compensation (Saga)

A server-side orchestrator manages the multi-step flow and applies compensating actions on failure:

Client -> Order Orchestrator
              |
              +-> 1. Reserve Inventory (success)
              +-> 2. Charge Payment (success)
              +-> 3. Create Shipment (FAILURE)
              |
              +-> Compensate: Refund Payment
              +-> Compensate: Release Inventory
              +-> Return error to client

saga_place_order(order):
    steps = [
        Step("reserve_inventory", reserve, compensate=release_inventory),
        Step("charge_payment", charge, compensate=refund_payment),
        Step("create_shipment", create_shipment, compensate=cancel_shipment),
        Step("send_confirmation", send_email, compensate=noop)
    ]

    completed = []
    for step in steps:
        result = execute_with_retry(step.action, max_retries=2)
        if result.failed:
            // Compensate in reverse order
            for completed_step in reversed(completed):
                execute_with_retry(completed_step.compensate)
            return OrderResult.FAILED(step.name, result.error)
        completed.append(step)

    return OrderResult.SUCCESS

Pattern 2: Client-Side Orchestration

When the client drives the flow (e.g., no backend orchestrator exists):

class OrderFlow(
    private val inventoryApi: InventoryApi,
    private val paymentApi: PaymentApi,
    private val orderApi: OrderApi
) {
    suspend fun placeOrder(order: Order): OrderResult {
        // Step 1: Reserve inventory
        val reservation = try {
            inventoryApi.reserve(order.items)
        } catch (e: Exception) {
            return OrderResult.Failed("Could not reserve items", recoverable = true)
        }
 
        // Step 2: Charge payment
        val payment = try {
            paymentApi.charge(order.paymentMethod, order.total)
        } catch (e: Exception) {
            // Compensate step 1
            safeExecute { inventoryApi.release(reservation.id) }
            return OrderResult.Failed("Payment failed", recoverable = true)
        }
 
        // Step 3: Create order
        return try {
            val created = orderApi.create(order, reservation.id, payment.id)
            OrderResult.Success(created.orderId)
        } catch (e: Exception) {
            // Compensate steps 1 and 2
            safeExecute { paymentApi.refund(payment.id) }
            safeExecute { inventoryApi.release(reservation.id) }
            OrderResult.Failed("Order creation failed", recoverable = true)
        }
    }
 
    private suspend fun safeExecute(block: suspend () -> Unit) {
        try {
            block()
        } catch (e: Exception) {
            // Log compensation failure for manual intervention
            ErrorReporter.reportCompensationFailure(e)
        }
    }
}

Pattern 3: Status Polling for Ambiguous Failures

When the client does not know if a request succeeded:

class AmbiguousFailureHandler(
    private val orderApi: OrderApi,
    private val pollingInterval: Long = 2000,
    private val maxPolls: Int = 5
) {
    suspend fun submitWithConfirmation(order: Order): OrderResult {
        val idempotencyKey = UUID.randomUUID().toString()
 
        return try {
            orderApi.submit(order, idempotencyKey)
        } catch (e: IOException) {
            // Ambiguous: did the server receive and process it?
            pollForResult(idempotencyKey)
        }
    }
 
    private suspend fun pollForResult(idempotencyKey: String): OrderResult {
        repeat(maxPolls) {
            delay(pollingInterval)
            try {
                val status = orderApi.getStatusByIdempotencyKey(idempotencyKey)
                return when (status) {
                    "completed" -> OrderResult.Success(status.orderId)
                    "failed" -> OrderResult.Failed(status.reason)
                    "processing" -> continue // Keep polling
                    "not_found" -> continue  // Not yet received
                    else -> continue
                }
            } catch (e: Exception) {
                continue // Network still unstable
            }
        }
        return OrderResult.Unknown("Please check your orders page")
    }
}

Pattern 4: Outbox Pattern for Reliable Publishing

When a service must update its database and publish an event (or call another service) atomically:

Service writes to its database AND to an outbox table in the same transaction.
A separate process reads the outbox and publishes events/calls downstream services.

Transaction:
    1. INSERT INTO orders (id, ...) VALUES (...)
    2. INSERT INTO outbox (event_type, payload) VALUES ('order_created', {...})
    COMMIT

Outbox Processor (background):
    1. SELECT * FROM outbox WHERE published = false ORDER BY created_at
    2. For each: publish event, mark as published

Client-Side Resilience

class ResilientFlow(
    private val stateStore: FlowStateStore // Persisted to disk
) {
    suspend fun executeWithRecovery(flowId: String, steps: List<FlowStep>): FlowResult {
        // Resume from last successful step if recovering from crash
        val savedState = stateStore.get(flowId)
        val startFrom = savedState?.lastCompletedStep?.plus(1) ?: 0
 
        for (i in startFrom until steps.size) {
            val step = steps[i]
            val result = step.execute()
 
            if (result.failed) {
                stateStore.save(flowId, FlowState(
                    lastCompletedStep = i - 1,
                    completedResults = savedState?.completedResults.orEmpty(),
                    failedStep = i,
                    error = result.error
                ))
                return compensate(steps, i - 1, savedState)
            }
 
            stateStore.save(flowId, FlowState(
                lastCompletedStep = i,
                completedResults = (savedState?.completedResults.orEmpty()) + result
            ))
        }
 
        stateStore.clear(flowId)
        return FlowResult.Success
    }
}

Trade-offs

Decision	Upside	Downside
Server-side saga	Central control, easier to reason about	Single point of failure (orchestrator)
Client-side orchestration	No orchestrator dependency	Client process death interrupts flow
Status polling	Resolves ambiguous failures	Adds latency, extra server load
Outbox pattern	Atomic local write + downstream propagation	Outbox table overhead, eventual consistency
Compensation	Recovers from partial failures	Compensating actions can themselves fail

Failure Modes

Compensation failure: A refund API call fails after a failed order. Mitigation: retry compensations with exponential backoff. After max retries, create a manual intervention ticket.
Client crash mid-flow: The app is killed between step 2 and step 3. Mitigation: persist flow state to disk. On next launch, detect incomplete flows and resume or compensate.
Phantom completion: Server processes the request but the client never receives the response. The client retries, creating a duplicate. Mitigation: idempotency keys on every mutating request.
Compensation ordering: Compensations must run in reverse order. If step 2 compensation depends on step 1 still being active, running them out of order fails. Mitigation: design compensations to be order-independent where possible.
Stuck in-progress state: A flow remains in "processing" state indefinitely. Mitigation: set a maximum age for in-progress flows. After expiry, trigger compensation automatically.

Scaling Considerations

Server-side sagas centralize coordination but add a service dependency. At high throughput, the orchestrator becomes a bottleneck. Scale it horizontally by partitioning by order ID.
Client-side orchestration distributes load but is less reliable. Use it for non-critical flows; reserve server-side sagas for financial transactions.
The outbox pattern requires periodic cleanup. Archive or delete processed outbox entries to prevent table bloat.

Observability

Track: partial failure rate per flow type, compensation success rate, time-to-recovery, stuck flow count, idempotent retry rate.
Alert on: compensation failure rate exceeding 1%, flows stuck in processing for more than 15 minutes, manual intervention queue depth exceeding threshold.
Dashboard: flow completion funnel showing drop-off at each step, with failure reasons.

Key Takeaways

Treat partial failures as a normal operating condition. Every multi-service flow will experience them.
Design every step with a compensating action. If you cannot undo a step, you cannot safely attempt it as part of a multi-step flow.
Persist flow state to survive process death. On mobile, the process can die at any moment.
Use idempotency keys on every mutating request. Ambiguous failures demand safe retries.
Prefer server-side orchestration for critical flows (payments, orders). Client-side orchestration is acceptable for non-critical flows where eventual consistency is tolerable.

Final Thoughts

Distributed systems fail partially by nature. The question is not whether your multi-step flow will encounter a partial failure, but how gracefully it handles one. Design compensation into every flow from the start. Retrofitting it after the first production incident is significantly more painful.