Mobile Analytics Pipeline: From App Event to Dashboard
End-to-end design of a mobile analytics pipeline covering ingestion, processing, storage, and querying, with emphasis on reliability and latency trade-offs.
The gap between "we track events" and "we have reliable analytics" is an entire distributed system. This post traces the path of a mobile event from the moment it fires on a device to the moment it appears on a dashboard, covering each stage's design decisions and failure modes.
Related: Failure Modes I Actively Design For.
See also: Event Tracking System Design for Android Applications.
Context
Mobile analytics pipelines ingest billions of events daily across heterogeneous devices, networks, and app versions. The pipeline must handle late-arriving data, duplicate events, schema evolution, and query patterns ranging from real-time counters to complex cohort analysis.
Problem
Design an end-to-end pipeline that:
- Ingests events from mobile clients with at-least-once delivery
- Processes and enriches events in near-real-time
- Stores data for both real-time dashboards and historical analysis
- Handles schema changes without breaking downstream consumers
Constraints
| Constraint | Detail |
|---|---|
| Ingestion rate | 100K-1M events/second at peak |
| Latency target | Events visible on dashboards within 5 minutes |
| Retention | Raw events: 90 days. Aggregates: 2 years |
| Late arrivals | Events can arrive up to 72 hours late (offline devices) |
| Cost | Storage and compute costs must scale sub-linearly with event volume |
Design
Stage 1: Client-Side Collection
Events are batched on-device (covered in detail in the event tracking post). The client sends compressed JSON payloads to the ingestion endpoint. Each event carries a client-generated UUID for deduplication.
Stage 2: Ingestion Layer
Mobile Client -> Load Balancer -> Ingestion Service -> Kafka
The ingestion service is stateless. It performs:
- Request validation (auth token, payload size limits)
- Basic schema validation (required fields present)
- Timestamp normalization (client timestamp vs server receive timestamp)
- Writes to Kafka with the event name as the partition key
Why Kafka: Decouples ingestion from processing. Provides durability, replay capability, and natural backpressure. Partitioning by event name enables parallel processing per event type.
Stage 3: Stream Processing
Kafka -> Flink/Spark Streaming -> Enrichment -> Kafka (enriched topic)
Processing steps:
- Deduplication: Bloom filter on event UUIDs with a 24-hour window. Accepts ~0.1% false positive rate to keep memory bounded.
- Enrichment: Join with user dimension tables (segment, country, subscription tier) from a Redis cache.
- Schema validation: Validate against the schema registry. Route invalid events to a dead letter topic.
- Sessionization: Assign events to sessions using a 30-minute inactivity timeout.
Stage 4: Storage
Two storage paths serve different query patterns:
| Storage | Purpose | Technology | Query Latency |
|---|---|---|---|
| Real-time store | Live dashboards, alerting | Druid or ClickHouse | < 1 second |
| Analytical store | Ad-hoc queries, cohort analysis | Parquet on S3 + Presto/Trino | 5-30 seconds |
The real-time store ingests from the enriched Kafka topic. The analytical store is populated by a batch job that compacts hourly Kafka segments into Parquet files, partitioned by date and event name.
Stage 5: Query and Dashboard Layer
Dashboard UI -> Query Service -> Real-time Store (recent data)
-> Analytical Store (historical data)
The query service routes based on time range: queries within the last 24 hours hit the real-time store, older queries hit the analytical store. A query planner merges results when a query spans both.
Late Arrival Handling
Late events (from devices that were offline) are processed normally through the pipeline. The real-time store uses a mutable data model that can update existing time buckets. The analytical store handles late arrivals through periodic recompaction jobs that merge late data into existing Parquet partitions.
Late event arrives -> Ingestion -> Processing (same pipeline)
|
Real-time store update (upsert into existing bucket)
|
Batch recompaction (merge into Parquet partition)
Trade-offs
| Decision | Upside | Downside |
|---|---|---|
| Kafka as central bus | Durability, replay, decoupling | Operational complexity, Kafka cluster management |
| Bloom filter dedup | Memory-efficient, fast | False positives (0.1%), requires windowing |
| Dual storage (real-time + analytical) | Optimized for both query patterns | Data consistency between stores, higher cost |
| Parquet on S3 | Cheap long-term storage, columnar queries | Higher query latency, requires compute cluster |
| Client-side UUIDs for dedup | No server state for ID generation | Clock skew can cause UUID collisions (extremely rare) |
Failure Modes
- Kafka broker failure: Replication factor of 3 ensures no data loss. Producers retry with idempotent writes enabled.
- Processing lag: If Flink falls behind, consumer lag increases. Alert when lag exceeds 5 minutes. Scale processing partitions horizontally.
- Enrichment cache miss: If Redis is unavailable, enrich with "unknown" values and flag events for re-enrichment in the next batch cycle.
- Schema registry outage: Fall back to permissive mode, accept all events, validate retroactively when the registry recovers.
- Dashboard query timeout: Implement query cost estimation. Reject queries that would scan more than 1 billion rows without aggregation.
Scaling Considerations
- Ingestion: Scale horizontally behind the load balancer. Each instance is stateless.
- Kafka: Add partitions per topic as throughput grows. Rebalance consumers accordingly.
- Processing: Flink scales by adding task slots. Ensure checkpointing interval is tuned (too frequent wastes I/O, too infrequent risks reprocessing).
- Storage: ClickHouse scales with sharding. S3 scales inherently. Presto scales by adding worker nodes.
- Cost control: Implement tiered retention. Move data older than 30 days to cold storage. Downsample high-frequency events (e.g., scroll events aggregated to 1-minute buckets after 7 days).
Observability
- Pipeline health: Kafka consumer lag, processing throughput (events/sec), enrichment cache hit rate, dead letter queue depth.
- Data quality: Schema validation failure rate, duplicate rate, late arrival rate, null field rate per event type.
- Query performance: p50/p95/p99 query latency, query error rate, cache hit rate on the query layer.
- End-to-end latency: Measure the time from client event timestamp to dashboard visibility. Track p50 and p99.
Key Takeaways
- Decouple ingestion from processing. A queue between them provides durability and backpressure naturally.
- Use dual storage paths: one optimized for real-time, one for cost-efficient historical analysis.
- Handle late arrivals as a first-class concern, not an afterthought. Mobile devices go offline regularly.
- Deduplication must happen in the pipeline, not as a downstream cleanup. Downstream consumers should trust the data.
- Observe the pipeline itself with the same rigor as the data it carries.
Further Reading
- Designing an Experimentation Platform for Mobile Apps: System design for a mobile experimentation platform covering assignment, exposure tracking, metric collection, statistical analysis, and ...
- Designing Idempotent APIs for Mobile Clients: How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios spe...
- Client-Heavy vs Server-Heavy Architectures for Mobile: Comparing client-heavy and server-heavy mobile architectures across performance, maintainability, update velocity, and user experience tr...
Final Thoughts
An analytics pipeline is only as good as its weakest stage. Invest in schema enforcement early, build late-arrival handling from day one, and treat pipeline observability as a product, not a side project.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.