Mobile Analytics Pipeline: From App Event to Dashboard

The gap between "we track events" and "we have reliable analytics" is an entire distributed system. This post traces the path of a mobile event from the moment it fires on a device to the moment it appears on a dashboard, covering each stage's design decisions and failure modes.

Context

Mobile analytics pipelines ingest billions of events daily across heterogeneous devices, networks, and app versions. The pipeline must handle late-arriving data, duplicate events, schema evolution, and query patterns ranging from real-time counters to complex cohort analysis.

Problem

Design an end-to-end pipeline that:

Ingests events from mobile clients with at-least-once delivery
Processes and enriches events in near-real-time
Stores data for both real-time dashboards and historical analysis
Handles schema changes without breaking downstream consumers

Constraints

Constraint	Detail
Ingestion rate	100K-1M events/second at peak
Latency target	Events visible on dashboards within 5 minutes
Retention	Raw events: 90 days. Aggregates: 2 years
Late arrivals	Events can arrive up to 72 hours late (offline devices)
Cost	Storage and compute costs must scale sub-linearly with event volume

Design

Stage 1: Client-Side Collection

Events are batched on-device (covered in detail in the event tracking post). The client sends compressed JSON payloads to the ingestion endpoint. Each event carries a client-generated UUID for deduplication.

Stage 2: Ingestion Layer

Mobile Client -> Load Balancer -> Ingestion Service -> Kafka

The ingestion service is stateless. It performs:

Request validation (auth token, payload size limits)
Basic schema validation (required fields present)
Timestamp normalization (client timestamp vs server receive timestamp)
Writes to Kafka with the event name as the partition key

Why Kafka: Decouples ingestion from processing. Provides durability, replay capability, and natural backpressure. Partitioning by event name enables parallel processing per event type.

Stage 3: Stream Processing

Kafka -> Flink/Spark Streaming -> Enrichment -> Kafka (enriched topic)

Processing steps:

Deduplication: Bloom filter on event UUIDs with a 24-hour window. Accepts ~0.1% false positive rate to keep memory bounded.
Enrichment: Join with user dimension tables (segment, country, subscription tier) from a Redis cache.
Schema validation: Validate against the schema registry. Route invalid events to a dead letter topic.
Sessionization: Assign events to sessions using a 30-minute inactivity timeout.

Stage 4: Storage

Two storage paths serve different query patterns:

Storage	Purpose	Technology	Query Latency
Real-time store	Live dashboards, alerting	Druid or ClickHouse	< 1 second
Analytical store	Ad-hoc queries, cohort analysis	Parquet on S3 + Presto/Trino	5-30 seconds

The real-time store ingests from the enriched Kafka topic. The analytical store is populated by a batch job that compacts hourly Kafka segments into Parquet files, partitioned by date and event name.

Stage 5: Query and Dashboard Layer

Dashboard UI -> Query Service -> Real-time Store (recent data)
                              -> Analytical Store (historical data)

The query service routes based on time range: queries within the last 24 hours hit the real-time store, older queries hit the analytical store. A query planner merges results when a query spans both.

Late Arrival Handling

Late events (from devices that were offline) are processed normally through the pipeline. The real-time store uses a mutable data model that can update existing time buckets. The analytical store handles late arrivals through periodic recompaction jobs that merge late data into existing Parquet partitions.

Late event arrives -> Ingestion -> Processing (same pipeline)
                                       |
                   Real-time store update (upsert into existing bucket)
                                       |
                   Batch recompaction (merge into Parquet partition)

Trade-offs

Decision	Upside	Downside
Kafka as central bus	Durability, replay, decoupling	Operational complexity, Kafka cluster management
Bloom filter dedup	Memory-efficient, fast	False positives (0.1%), requires windowing
Dual storage (real-time + analytical)	Optimized for both query patterns	Data consistency between stores, higher cost
Parquet on S3	Cheap long-term storage, columnar queries	Higher query latency, requires compute cluster
Client-side UUIDs for dedup	No server state for ID generation	Clock skew can cause UUID collisions (extremely rare)

Failure Modes

Kafka broker failure: Replication factor of 3 ensures no data loss. Producers retry with idempotent writes enabled.
Processing lag: If Flink falls behind, consumer lag increases. Alert when lag exceeds 5 minutes. Scale processing partitions horizontally.
Enrichment cache miss: If Redis is unavailable, enrich with "unknown" values and flag events for re-enrichment in the next batch cycle.
Schema registry outage: Fall back to permissive mode, accept all events, validate retroactively when the registry recovers.
Dashboard query timeout: Implement query cost estimation. Reject queries that would scan more than 1 billion rows without aggregation.

Scaling Considerations

Ingestion: Scale horizontally behind the load balancer. Each instance is stateless.
Kafka: Add partitions per topic as throughput grows. Rebalance consumers accordingly.
Processing: Flink scales by adding task slots. Ensure checkpointing interval is tuned (too frequent wastes I/O, too infrequent risks reprocessing).
Storage: ClickHouse scales with sharding. S3 scales inherently. Presto scales by adding worker nodes.
Cost control: Implement tiered retention. Move data older than 30 days to cold storage. Downsample high-frequency events (e.g., scroll events aggregated to 1-minute buckets after 7 days).

Observability

Pipeline health: Kafka consumer lag, processing throughput (events/sec), enrichment cache hit rate, dead letter queue depth.
Data quality: Schema validation failure rate, duplicate rate, late arrival rate, null field rate per event type.
Query performance: p50/p95/p99 query latency, query error rate, cache hit rate on the query layer.
End-to-end latency: Measure the time from client event timestamp to dashboard visibility. Track p50 and p99.

Key Takeaways

Decouple ingestion from processing. A queue between them provides durability and backpressure naturally.
Use dual storage paths: one optimized for real-time, one for cost-efficient historical analysis.
Handle late arrivals as a first-class concern, not an afterthought. Mobile devices go offline regularly.
Deduplication must happen in the pipeline, not as a downstream cleanup. Downstream consumers should trust the data.
Observe the pipeline itself with the same rigor as the data it carries.

Final Thoughts

An analytics pipeline is only as good as its weakest stage. Invest in schema enforcement early, build late-arrival handling from day one, and treat pipeline observability as a product, not a side project.