Designing an Experimentation Platform for Mobile Apps

Running experiments on mobile is harder than on the web. You cannot change assignments mid-session without invalidating data. You cannot guarantee immediate metric collection. You cannot revert a treatment instantly. This post covers how to design an experimentation platform that accounts for these constraints.

Context

An experimentation platform enables product teams to run A/B tests, measure impact, and make data-driven decisions. On mobile, the platform must handle assignment consistency, delayed metric collection (offline users), app version fragmentation, and long experiment durations due to slow update cycles.

Problem

Design an experimentation platform that:

Assigns users to experiment variants deterministically
Tracks exposure accurately (not just assignment, but actual treatment display)
Collects metrics with eventual consistency, handling late-arriving data
Provides statistically valid results with guardrail metrics
Supports mutual exclusion and layered experiments

Constraints

Constraint	Detail
Assignment latency	Must resolve in under 5ms on-device
Consistency	Assignment must not change within a session or across sessions for the same experiment
Metric delay	Events may arrive 72+ hours late from offline devices
Sample size	Experiments need sufficient power; mobile DAU determines minimum experiment duration
Interaction effects	Multiple concurrent experiments must not interfere with each other

Design

Assignment Architecture

App Start -> Fetch Experiment Config (cached) -> Local Evaluator -> Assignment Map

The experiment config is a JSON payload fetched from the server (via CDN), containing all active experiments, their variants, targeting rules, and traffic allocations. The evaluator runs locally.

data class Experiment(
    val id: String,
    val key: String,
    val variants: List<Variant>,
    val trafficAllocation: Int, // 0-100, percentage of eligible users in experiment
    val targetingRules: List<TargetingRule>,
    val layer: String, // Mutual exclusion layer
    val status: ExperimentStatus // RUNNING, PAUSED, COMPLETED
)
 
data class Variant(
    val key: String, // "control", "treatment_a", "treatment_b"
    val weight: Int  // Relative weight for assignment
)

Deterministic Assignment

Assignment uses a hash function, not random number generation. This ensures the same user always gets the same variant without server-side state.

object ExperimentAssigner {
    fun assign(userId: String, experiment: Experiment): Variant? {
        if (experiment.status != ExperimentStatus.RUNNING) return null
 
        // Traffic allocation: is this user in the experiment?
        val trafficHash = murmurHash("$userId:${experiment.key}:traffic") % 100
        if (trafficHash >= experiment.trafficAllocation) return null
 
        // Variant assignment: which variant?
        val variantHash = murmurHash("$userId:${experiment.key}:variant") % 100
        var cumulative = 0
        for (variant in experiment.variants) {
            cumulative += variant.weight
            if (variantHash < cumulative) return variant
        }
        return experiment.variants.last()
    }
}

Two separate hashes: one for traffic allocation, one for variant assignment. This ensures that changing traffic allocation (e.g., ramping from 10% to 50%) does not reassign existing users to different variants.

Exposure Tracking

Assignment is not exposure. A user is assigned at config evaluation time, but exposed only when they actually see the feature controlled by the experiment.

class ExperimentTracker(private val analytics: AnalyticsClient) {
    private val exposedExperiments = mutableSetOf<String>()
 
    fun trackExposure(experiment: Experiment, variant: Variant) {
        val key = "${experiment.key}:${variant.key}"
        if (key in exposedExperiments) return // Deduplicate within session
 
        exposedExperiments.add(key)
        analytics.track(
            "experiment_exposure",
            mapOf(
                "experiment_key" to experiment.key,
                "variant_key" to variant.key,
                "experiment_id" to experiment.id,
                "timestamp" to System.currentTimeMillis()
            )
        )
    }
}

Exposure tracking happens at the point of feature rendering, not at assignment time. This prevents dilution from users who are assigned but never encounter the feature.

Mutual Exclusion Layers

Experiments that could interact are placed in the same layer. Each user is assigned to at most one experiment per layer.

Layer: "checkout_flow"
  - Experiment A: checkout button color (10% traffic)
  - Experiment B: checkout flow redesign (10% traffic)

User can be in A or B, but not both.

fun assignWithinLayer(userId: String, layer: Layer): Pair<Experiment, Variant>? {
    val layerHash = murmurHash("$userId:${layer.key}") % 100
    var cumulative = 0
 
    for (experiment in layer.experiments) {
        cumulative += experiment.trafficAllocation
        if (layerHash < cumulative) {
            val variant = ExperimentAssigner.assign(userId, experiment)
            return if (variant != null) experiment to variant else null
        }
    }
    return null // User not in any experiment in this layer
}

Metric Collection Pipeline

Client (exposure + metric events) -> Analytics Pipeline -> Experiment Metrics Store
                                                                |
                                                        Statistical Analysis Engine
                                                                |
                                                        Experiment Dashboard

Metrics are collected through the standard analytics pipeline (see the analytics pipeline post). The experiment metrics store joins exposure events with outcome metrics (conversion, revenue, engagement) by user ID and experiment.

Statistical Analysis

For each experiment, compute per-variant metrics and run hypothesis tests:

Component	Approach
Test type	Two-sample t-test for continuous metrics, chi-squared for proportions
Correction	Bonferroni correction for multiple variants
Confidence level	95% (configurable per experiment)
Power	80% minimum; pre-experiment power analysis determines required sample size
Sequential testing	Use group sequential design to allow early stopping

Guardrail Metrics

Every experiment monitors guardrail metrics regardless of its primary metric:

App crash rate
API error rate
Session duration
Revenue per user

If any guardrail metric degrades beyond a threshold (e.g., crash rate increases by 0.1%), the experiment is automatically paused and the team is alerted.

guardrail_check(experiment, metrics):
    for guardrail in GLOBAL_GUARDRAILS:
        control_value = metrics[experiment.control][guardrail]
        treatment_value = metrics[experiment.treatment][guardrail]
        delta = (treatment_value - control_value) / control_value

        if delta > guardrail.threshold:
            pause_experiment(experiment)
            alert_team(experiment, guardrail, delta)

Trade-offs

Decision	Upside	Downside
Client-side assignment	No network call, works offline	Cannot change assignment without app update of config
Deterministic hashing	Stable, reproducible assignments	Hash function quality affects uniformity
Exposure tracking at render	Accurate exposure, reduced dilution	Requires instrumenting every feature point
Mutual exclusion layers	Prevents interaction effects	Reduces available traffic per experiment
Guardrail auto-pause	Prevents harm to users	False positives can pause valid experiments

Failure Modes

Config fetch failure: Client uses cached config. Experiments continue with potentially stale definitions. New experiments will not start until config refreshes.
Hash non-uniformity: A bad hash function causes uneven variant distribution. Validate distribution in pre-launch checks (chi-squared test on assignment counts).
Late metric arrival: Offline users' metrics arrive days late. The analysis engine must support recomputation as late data arrives, using event timestamps rather than arrival timestamps.
Sample ratio mismatch (SRM): Variant sizes diverge from expected ratios. This indicates a bug in assignment or exposure tracking. Run SRM checks automatically and halt analysis if detected.
Novelty effects: Short experiments capture novelty, not sustained behavior change. Run experiments for at least 2 full weeks to account for novelty decay.

Scaling Considerations

Experiment volume: Support 50-100 concurrent experiments. The config payload must remain compact (under 50KB compressed).
Analysis latency: Pre-aggregate metrics per experiment per variant daily. Full recomputation is too expensive at scale.
Multi-platform: Ensure identical hash implementations across Android, iOS, and web. A single discrepancy breaks experiment integrity.
Long-running experiments: Mobile experiments run longer due to slow adoption. Support experiment durations of 4-8 weeks without data infrastructure strain.

Observability

Track: assignment distribution per experiment (detect SRM), exposure rate vs. assignment rate, metric computation latency, guardrail trigger frequency.
Dashboard: per-experiment view showing variant metrics, confidence intervals, cumulative sample size, estimated time to significance.
Alert on: SRM detected, guardrail breach, experiment running beyond planned duration without decision.

Key Takeaways

Separate assignment from exposure. Analyzing assigned-but-not-exposed users dilutes treatment effects and produces misleading results.
Use deterministic hashing with separate hashes for traffic allocation and variant assignment. This allows safe ramp-ups.
Mutual exclusion layers prevent interaction effects but cost available traffic. Use them selectively for experiments that could plausibly interact.
Guardrail metrics are non-negotiable. Every experiment must be monitored for regressions in core health metrics.
Account for late-arriving data in your analysis pipeline. Mobile users go offline, and their metrics arrive late.

Final Thoughts

An experimentation platform is a decision-making system, not just an A/B testing tool. The quality of decisions depends on assignment integrity, accurate exposure tracking, and statistically sound analysis. Cutting corners on any of these produces numbers that look precise but are fundamentally unreliable.