Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.
Most analytics SDKs are treated as fire-and-forget. That works until you need guaranteed delivery, schema validation, or cost control on ingestion volume. This post covers how to design an event tracking system for Android that handles all three.
Context
Any non-trivial Android app generates thousands of events per session: screen views, taps, scroll depth, API latencies, crashes. These events drive product decisions, A/B test evaluation, and anomaly detection. The tracking system is the pipeline entry point, and its reliability directly affects downstream data quality.
Problem
Design an on-device event tracking system that:
- Captures structured events with minimal performance overhead
- Guarantees delivery even across app restarts and network failures
- Enforces schema consistency before events leave the device
- Controls volume to avoid backend ingestion cost blowouts
Constraints
| Constraint | Detail |
|---|---|
| Battery | Batching and network calls must respect Doze mode and battery optimization |
| Memory | Event queue must be bounded; unbounded queues cause OOMs on low-end devices |
| Disk | SQLite or file-based persistence; limited storage on budget devices |
| Network | Must handle offline, flaky, and metered connections |
| Thread safety | Events fire from UI thread, background workers, and broadcast receivers |
Design
Event Model
data class TrackingEvent(
val name: String,
val properties: Map<String, Any>,
val timestamp: Long = System.currentTimeMillis(),
val sessionId: String,
val sequenceNumber: Long,
val schemaVersion: Int = 1
)The sequenceNumber enables server-side deduplication and ordering. The schemaVersion field lets the backend route events to the correct deserialization path.
Architecture Layers
- Capture Layer: A singleton
EventTrackerexposes a thread-safetrack(event)method. Events are validated against a local schema registry before acceptance. - Persistence Layer: Accepted events are written to a Room database table. This guarantees survival across process death.
- Batching Layer: A
CoroutineWorkerruns on a periodic schedule (e.g., every 60 seconds) or when the batch size hits a threshold (e.g., 50 events). It reads persisted events, serializes them, and hands off to the transport layer. - Transport Layer: An HTTP client sends batched payloads with gzip compression. On success, events are deleted from the local database. On failure, they remain for the next cycle.
Schema Enforcement
object SchemaRegistry {
private val schemas: Map<String, Set<String>> = mapOf(
"screen_view" to setOf("screen_name", "referrer"),
"button_tap" to setOf("button_id", "screen_name"),
"api_call" to setOf("endpoint", "status_code", "latency_ms")
)
fun validate(event: TrackingEvent): Boolean {
val required = schemas[event.name] ?: return false
return required.all { it in event.properties }
}
}Events that fail validation are dropped and logged to a separate debug stream. This prevents garbage data from polluting the pipeline.
Batching Strategy
class EventBatcher(
private val dao: EventDao,
private val transport: EventTransport,
private val maxBatchSize: Int = 50,
private val maxAgeMs: Long = 60_000
) {
suspend fun flush() {
val events = dao.getOldestEvents(maxBatchSize)
if (events.isEmpty()) return
val payload = EventPayload(
events = events,
deviceId = DeviceIdProvider.getId(),
appVersion = BuildConfig.VERSION_NAME
)
val result = transport.send(payload)
if (result.isSuccess) {
dao.deleteByIds(events.map { it.id })
}
}
}Volume Control
Three mechanisms:
- Sampling: High-frequency events (e.g., scroll events) are sampled at configurable rates (1%, 10%) using a deterministic hash on session ID.
- Deduplication: Events with the same name and properties within a 500ms window are collapsed.
- TTL: Events older than 72 hours in the local database are purged. Stale data has diminishing analytical value.
Trade-offs
| Decision | Upside | Downside |
|---|---|---|
| Room for persistence | Survives process death, queryable | Adds ~1MB to APK, slightly slower than file I/O |
| Periodic batching | Reduces network calls, battery-friendly | Events delayed up to batch interval |
| Client-side schema validation | Prevents bad data at source | Requires schema updates shipped with app releases |
| Gzip compression | Reduces bandwidth 60-80% | CPU cost on low-end devices |
| Deterministic sampling | Consistent per-user experience | Cannot adjust sampling for already-emitted events |
Failure Modes
- Database full: Set a max row count (e.g., 10,000). When exceeded, drop oldest events first. Alert via a meta-event.
- Serialization failure: Catch exceptions per-event, not per-batch. One malformed event should not block the batch.
- Transport timeout: Exponential backoff with jitter, capped at 5 minutes. After 3 consecutive failures, back off to next WorkManager cycle.
- Schema mismatch on server: Server returns partial acceptance. Client deletes accepted events, retains rejected ones with a
rejectedflag to prevent infinite retries.
Scaling Considerations
- Event volume growth: Move from per-event rows to batch-serialized blobs (Protocol Buffers) to reduce database write amplification.
- Multi-process apps: If the app uses multiple processes, use a ContentProvider or a dedicated tracking process to avoid database locking issues.
- Backend ingestion: The client should respect HTTP 429 responses and
Retry-Afterheaders. Server-driven throttling is essential at scale.
Observability
- Track a meta-event
tracking_healththat reports: events queued, events sent, events dropped, batch size, flush latency, and error counts. - Expose a debug mode that logs all events to Logcat with a
TRACKINGtag, gated behind a build config flag. - Dashboard metrics: delivery rate, p95 flush latency, schema validation failure rate, compression ratio.
Key Takeaways
- Persist before sending. In-memory-only queues lose events on process death.
- Validate schemas on the client. Bad data caught at the source saves hours of pipeline debugging.
- Batch aggressively. Per-event HTTP calls are wasteful on mobile networks.
- Bound everything: queue size, batch size, retry count, event TTL. Unbounded systems fail unpredictably.
- Ship observability for the tracking system itself. You cannot trust a system you cannot measure.
Related: Debugging Performance Issues in Large Android Apps.
See also: Mobile Analytics Pipeline: From App Event to Dashboard.
Further Reading
- How I'd Design a Mobile Configuration System at Scale: Designing a configuration system for mobile apps at scale, covering config delivery, caching layers, override hierarchies, and safe rollo...
- How I'd Design a Scalable Notification System: System design for a multi-channel notification system covering delivery guarantees, rate limiting, user preferences, and failure handling...
- Managing Large Dependency Graphs in Android: Strategies for structuring, optimizing, and debugging dependency injection graphs in large Android apps using Dagger/Hilt, covering scopi...
Final Thoughts
An event tracking system is infrastructure, not a feature. Treat it with the same rigor as your networking layer or database. The cost of getting it wrong is not a crash, it is months of decisions made on unreliable data.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Understanding ANRs: Detection, Root Causes, and Fixes
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.