Designing an Experimentation Platform for Mobile Apps
System design for a mobile experimentation platform covering assignment, exposure tracking, metric collection, statistical analysis, and guardrail metrics.
Running experiments on mobile is harder than on the web. You cannot change assignments mid-session without invalidating data. You cannot guarantee immediate metric collection. You cannot revert a treatment instantly. This post covers how to design an experimentation platform that accounts for these constraints.
Related: Mobile Analytics Pipeline: From App Event to Dashboard.
Context
An experimentation platform enables product teams to run A/B tests, measure impact, and make data-driven decisions. On mobile, the platform must handle assignment consistency, delayed metric collection (offline users), app version fragmentation, and long experiment durations due to slow update cycles.
Problem
Design an experimentation platform that:
- Assigns users to experiment variants deterministically
- Tracks exposure accurately (not just assignment, but actual treatment display)
- Collects metrics with eventual consistency, handling late-arriving data
- Provides statistically valid results with guardrail metrics
- Supports mutual exclusion and layered experiments
See also: Event Tracking System Design for Android Applications.
Constraints
| Constraint | Detail |
|---|---|
| Assignment latency | Must resolve in under 5ms on-device |
| Consistency | Assignment must not change within a session or across sessions for the same experiment |
| Metric delay | Events may arrive 72+ hours late from offline devices |
| Sample size | Experiments need sufficient power; mobile DAU determines minimum experiment duration |
| Interaction effects | Multiple concurrent experiments must not interfere with each other |
Design
Assignment Architecture
App Start -> Fetch Experiment Config (cached) -> Local Evaluator -> Assignment Map
The experiment config is a JSON payload fetched from the server (via CDN), containing all active experiments, their variants, targeting rules, and traffic allocations. The evaluator runs locally.
data class Experiment(
val id: String,
val key: String,
val variants: List<Variant>,
val trafficAllocation: Int, // 0-100, percentage of eligible users in experiment
val targetingRules: List<TargetingRule>,
val layer: String, // Mutual exclusion layer
val status: ExperimentStatus // RUNNING, PAUSED, COMPLETED
)
data class Variant(
val key: String, // "control", "treatment_a", "treatment_b"
val weight: Int // Relative weight for assignment
)Deterministic Assignment
Assignment uses a hash function, not random number generation. This ensures the same user always gets the same variant without server-side state.
object ExperimentAssigner {
fun assign(userId: String, experiment: Experiment): Variant? {
if (experiment.status != ExperimentStatus.RUNNING) return null
// Traffic allocation: is this user in the experiment?
val trafficHash = murmurHash("$userId:${experiment.key}:traffic") % 100
if (trafficHash >= experiment.trafficAllocation) return null
// Variant assignment: which variant?
val variantHash = murmurHash("$userId:${experiment.key}:variant") % 100
var cumulative = 0
for (variant in experiment.variants) {
cumulative += variant.weight
if (variantHash < cumulative) return variant
}
return experiment.variants.last()
}
}Two separate hashes: one for traffic allocation, one for variant assignment. This ensures that changing traffic allocation (e.g., ramping from 10% to 50%) does not reassign existing users to different variants.
Exposure Tracking
Assignment is not exposure. A user is assigned at config evaluation time, but exposed only when they actually see the feature controlled by the experiment.
class ExperimentTracker(private val analytics: AnalyticsClient) {
private val exposedExperiments = mutableSetOf<String>()
fun trackExposure(experiment: Experiment, variant: Variant) {
val key = "${experiment.key}:${variant.key}"
if (key in exposedExperiments) return // Deduplicate within session
exposedExperiments.add(key)
analytics.track(
"experiment_exposure",
mapOf(
"experiment_key" to experiment.key,
"variant_key" to variant.key,
"experiment_id" to experiment.id,
"timestamp" to System.currentTimeMillis()
)
)
}
}Exposure tracking happens at the point of feature rendering, not at assignment time. This prevents dilution from users who are assigned but never encounter the feature.
Mutual Exclusion Layers
Experiments that could interact are placed in the same layer. Each user is assigned to at most one experiment per layer.
Layer: "checkout_flow"
- Experiment A: checkout button color (10% traffic)
- Experiment B: checkout flow redesign (10% traffic)
User can be in A or B, but not both.
fun assignWithinLayer(userId: String, layer: Layer): Pair<Experiment, Variant>? {
val layerHash = murmurHash("$userId:${layer.key}") % 100
var cumulative = 0
for (experiment in layer.experiments) {
cumulative += experiment.trafficAllocation
if (layerHash < cumulative) {
val variant = ExperimentAssigner.assign(userId, experiment)
return if (variant != null) experiment to variant else null
}
}
return null // User not in any experiment in this layer
}Metric Collection Pipeline
Client (exposure + metric events) -> Analytics Pipeline -> Experiment Metrics Store
|
Statistical Analysis Engine
|
Experiment Dashboard
Metrics are collected through the standard analytics pipeline (see the analytics pipeline post). The experiment metrics store joins exposure events with outcome metrics (conversion, revenue, engagement) by user ID and experiment.
Statistical Analysis
For each experiment, compute per-variant metrics and run hypothesis tests:
| Component | Approach |
|---|---|
| Test type | Two-sample t-test for continuous metrics, chi-squared for proportions |
| Correction | Bonferroni correction for multiple variants |
| Confidence level | 95% (configurable per experiment) |
| Power | 80% minimum; pre-experiment power analysis determines required sample size |
| Sequential testing | Use group sequential design to allow early stopping |
Guardrail Metrics
Every experiment monitors guardrail metrics regardless of its primary metric:
- App crash rate
- API error rate
- Session duration
- Revenue per user
If any guardrail metric degrades beyond a threshold (e.g., crash rate increases by 0.1%), the experiment is automatically paused and the team is alerted.
guardrail_check(experiment, metrics):
for guardrail in GLOBAL_GUARDRAILS:
control_value = metrics[experiment.control][guardrail]
treatment_value = metrics[experiment.treatment][guardrail]
delta = (treatment_value - control_value) / control_value
if delta > guardrail.threshold:
pause_experiment(experiment)
alert_team(experiment, guardrail, delta)
Trade-offs
| Decision | Upside | Downside |
|---|---|---|
| Client-side assignment | No network call, works offline | Cannot change assignment without app update of config |
| Deterministic hashing | Stable, reproducible assignments | Hash function quality affects uniformity |
| Exposure tracking at render | Accurate exposure, reduced dilution | Requires instrumenting every feature point |
| Mutual exclusion layers | Prevents interaction effects | Reduces available traffic per experiment |
| Guardrail auto-pause | Prevents harm to users | False positives can pause valid experiments |
Failure Modes
- Config fetch failure: Client uses cached config. Experiments continue with potentially stale definitions. New experiments will not start until config refreshes.
- Hash non-uniformity: A bad hash function causes uneven variant distribution. Validate distribution in pre-launch checks (chi-squared test on assignment counts).
- Late metric arrival: Offline users' metrics arrive days late. The analysis engine must support recomputation as late data arrives, using event timestamps rather than arrival timestamps.
- Sample ratio mismatch (SRM): Variant sizes diverge from expected ratios. This indicates a bug in assignment or exposure tracking. Run SRM checks automatically and halt analysis if detected.
- Novelty effects: Short experiments capture novelty, not sustained behavior change. Run experiments for at least 2 full weeks to account for novelty decay.
Scaling Considerations
- Experiment volume: Support 50-100 concurrent experiments. The config payload must remain compact (under 50KB compressed).
- Analysis latency: Pre-aggregate metrics per experiment per variant daily. Full recomputation is too expensive at scale.
- Multi-platform: Ensure identical hash implementations across Android, iOS, and web. A single discrepancy breaks experiment integrity.
- Long-running experiments: Mobile experiments run longer due to slow adoption. Support experiment durations of 4-8 weeks without data infrastructure strain.
Observability
- Track: assignment distribution per experiment (detect SRM), exposure rate vs. assignment rate, metric computation latency, guardrail trigger frequency.
- Dashboard: per-experiment view showing variant metrics, confidence intervals, cumulative sample size, estimated time to significance.
- Alert on: SRM detected, guardrail breach, experiment running beyond planned duration without decision.
Key Takeaways
- Separate assignment from exposure. Analyzing assigned-but-not-exposed users dilutes treatment effects and produces misleading results.
- Use deterministic hashing with separate hashes for traffic allocation and variant assignment. This allows safe ramp-ups.
- Mutual exclusion layers prevent interaction effects but cost available traffic. Use them selectively for experiments that could plausibly interact.
- Guardrail metrics are non-negotiable. Every experiment must be monitored for regressions in core health metrics.
- Account for late-arriving data in your analysis pipeline. Mobile users go offline, and their metrics arrive late.
Further Reading
- Designing a Feature Flag and Remote Config System: Architecture and trade-offs for building a feature flag and remote configuration system that handles targeting, rollout, and consistency ...
- Designing Idempotent APIs for Mobile Clients: How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios spe...
- Designing an Offline-First Sync Engine for Mobile Apps: A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, qu...
Final Thoughts
An experimentation platform is a decision-making system, not just an A/B testing tool. The quality of decisions depends on assignment integrity, accurate exposure tracking, and statistically sound analysis. Cutting corners on any of these produces numbers that look precise but are fundamentally unreliable.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.