Mobile Analytics Pipeline: From App Event to Dashboard

Dhruval Dhameliya·January 25, 2026·6 min read

End-to-end design of a mobile analytics pipeline covering ingestion, processing, storage, and querying, with emphasis on reliability and latency trade-offs.

The gap between "we track events" and "we have reliable analytics" is an entire distributed system. This post traces the path of a mobile event from the moment it fires on a device to the moment it appears on a dashboard, covering each stage's design decisions and failure modes.

Related: Failure Modes I Actively Design For.

See also: Event Tracking System Design for Android Applications.

Context

Mobile analytics pipelines ingest billions of events daily across heterogeneous devices, networks, and app versions. The pipeline must handle late-arriving data, duplicate events, schema evolution, and query patterns ranging from real-time counters to complex cohort analysis.

Problem

Design an end-to-end pipeline that:

  • Ingests events from mobile clients with at-least-once delivery
  • Processes and enriches events in near-real-time
  • Stores data for both real-time dashboards and historical analysis
  • Handles schema changes without breaking downstream consumers

Constraints

ConstraintDetail
Ingestion rate100K-1M events/second at peak
Latency targetEvents visible on dashboards within 5 minutes
RetentionRaw events: 90 days. Aggregates: 2 years
Late arrivalsEvents can arrive up to 72 hours late (offline devices)
CostStorage and compute costs must scale sub-linearly with event volume

Design

Stage 1: Client-Side Collection

Events are batched on-device (covered in detail in the event tracking post). The client sends compressed JSON payloads to the ingestion endpoint. Each event carries a client-generated UUID for deduplication.

Stage 2: Ingestion Layer

Mobile Client -> Load Balancer -> Ingestion Service -> Kafka

The ingestion service is stateless. It performs:

  • Request validation (auth token, payload size limits)
  • Basic schema validation (required fields present)
  • Timestamp normalization (client timestamp vs server receive timestamp)
  • Writes to Kafka with the event name as the partition key

Why Kafka: Decouples ingestion from processing. Provides durability, replay capability, and natural backpressure. Partitioning by event name enables parallel processing per event type.

Stage 3: Stream Processing

Kafka -> Flink/Spark Streaming -> Enrichment -> Kafka (enriched topic)

Processing steps:

  1. Deduplication: Bloom filter on event UUIDs with a 24-hour window. Accepts ~0.1% false positive rate to keep memory bounded.
  2. Enrichment: Join with user dimension tables (segment, country, subscription tier) from a Redis cache.
  3. Schema validation: Validate against the schema registry. Route invalid events to a dead letter topic.
  4. Sessionization: Assign events to sessions using a 30-minute inactivity timeout.

Stage 4: Storage

Two storage paths serve different query patterns:

StoragePurposeTechnologyQuery Latency
Real-time storeLive dashboards, alertingDruid or ClickHouse< 1 second
Analytical storeAd-hoc queries, cohort analysisParquet on S3 + Presto/Trino5-30 seconds

The real-time store ingests from the enriched Kafka topic. The analytical store is populated by a batch job that compacts hourly Kafka segments into Parquet files, partitioned by date and event name.

Stage 5: Query and Dashboard Layer

Dashboard UI -> Query Service -> Real-time Store (recent data)
                              -> Analytical Store (historical data)

The query service routes based on time range: queries within the last 24 hours hit the real-time store, older queries hit the analytical store. A query planner merges results when a query spans both.

Late Arrival Handling

Late events (from devices that were offline) are processed normally through the pipeline. The real-time store uses a mutable data model that can update existing time buckets. The analytical store handles late arrivals through periodic recompaction jobs that merge late data into existing Parquet partitions.

Late event arrives -> Ingestion -> Processing (same pipeline)
                                       |
                   Real-time store update (upsert into existing bucket)
                                       |
                   Batch recompaction (merge into Parquet partition)

Trade-offs

DecisionUpsideDownside
Kafka as central busDurability, replay, decouplingOperational complexity, Kafka cluster management
Bloom filter dedupMemory-efficient, fastFalse positives (0.1%), requires windowing
Dual storage (real-time + analytical)Optimized for both query patternsData consistency between stores, higher cost
Parquet on S3Cheap long-term storage, columnar queriesHigher query latency, requires compute cluster
Client-side UUIDs for dedupNo server state for ID generationClock skew can cause UUID collisions (extremely rare)

Failure Modes

  • Kafka broker failure: Replication factor of 3 ensures no data loss. Producers retry with idempotent writes enabled.
  • Processing lag: If Flink falls behind, consumer lag increases. Alert when lag exceeds 5 minutes. Scale processing partitions horizontally.
  • Enrichment cache miss: If Redis is unavailable, enrich with "unknown" values and flag events for re-enrichment in the next batch cycle.
  • Schema registry outage: Fall back to permissive mode, accept all events, validate retroactively when the registry recovers.
  • Dashboard query timeout: Implement query cost estimation. Reject queries that would scan more than 1 billion rows without aggregation.

Scaling Considerations

  • Ingestion: Scale horizontally behind the load balancer. Each instance is stateless.
  • Kafka: Add partitions per topic as throughput grows. Rebalance consumers accordingly.
  • Processing: Flink scales by adding task slots. Ensure checkpointing interval is tuned (too frequent wastes I/O, too infrequent risks reprocessing).
  • Storage: ClickHouse scales with sharding. S3 scales inherently. Presto scales by adding worker nodes.
  • Cost control: Implement tiered retention. Move data older than 30 days to cold storage. Downsample high-frequency events (e.g., scroll events aggregated to 1-minute buckets after 7 days).

Observability

  • Pipeline health: Kafka consumer lag, processing throughput (events/sec), enrichment cache hit rate, dead letter queue depth.
  • Data quality: Schema validation failure rate, duplicate rate, late arrival rate, null field rate per event type.
  • Query performance: p50/p95/p99 query latency, query error rate, cache hit rate on the query layer.
  • End-to-end latency: Measure the time from client event timestamp to dashboard visibility. Track p50 and p99.

Key Takeaways

  • Decouple ingestion from processing. A queue between them provides durability and backpressure naturally.
  • Use dual storage paths: one optimized for real-time, one for cost-efficient historical analysis.
  • Handle late arrivals as a first-class concern, not an afterthought. Mobile devices go offline regularly.
  • Deduplication must happen in the pipeline, not as a downstream cleanup. Downstream consumers should trust the data.
  • Observe the pipeline itself with the same rigor as the data it carries.

Further Reading

Final Thoughts

An analytics pipeline is only as good as its weakest stage. Invest in schema enforcement early, build late-arrival handling from day one, and treat pipeline observability as a product, not a side project.

Recommended