Event Tracking System Design for Android Applications

Most analytics SDKs are treated as fire-and-forget. That works until you need guaranteed delivery, schema validation, or cost control on ingestion volume. This post covers how to design an event tracking system for Android that handles all three.

Context

Any non-trivial Android app generates thousands of events per session: screen views, taps, scroll depth, API latencies, crashes. These events drive product decisions, A/B test evaluation, and anomaly detection. The tracking system is the pipeline entry point, and its reliability directly affects downstream data quality.

Problem

Design an on-device event tracking system that:

Captures structured events with minimal performance overhead
Guarantees delivery even across app restarts and network failures
Enforces schema consistency before events leave the device
Controls volume to avoid backend ingestion cost blowouts

Constraints

Constraint	Detail
Battery	Batching and network calls must respect Doze mode and battery optimization
Memory	Event queue must be bounded; unbounded queues cause OOMs on low-end devices
Disk	SQLite or file-based persistence; limited storage on budget devices
Network	Must handle offline, flaky, and metered connections
Thread safety	Events fire from UI thread, background workers, and broadcast receivers

Design

Event Model

data class TrackingEvent(
    val name: String,
    val properties: Map<String, Any>,
    val timestamp: Long = System.currentTimeMillis(),
    val sessionId: String,
    val sequenceNumber: Long,
    val schemaVersion: Int = 1
)

The sequenceNumber enables server-side deduplication and ordering. The schemaVersion field lets the backend route events to the correct deserialization path.

Architecture Layers

Capture Layer: A singleton EventTracker exposes a thread-safe track(event) method. Events are validated against a local schema registry before acceptance.
Persistence Layer: Accepted events are written to a Room database table. This guarantees survival across process death.
Batching Layer: A CoroutineWorker runs on a periodic schedule (e.g., every 60 seconds) or when the batch size hits a threshold (e.g., 50 events). It reads persisted events, serializes them, and hands off to the transport layer.
Transport Layer: An HTTP client sends batched payloads with gzip compression. On success, events are deleted from the local database. On failure, they remain for the next cycle.

Schema Enforcement

object SchemaRegistry {
    private val schemas: Map<String, Set<String>> = mapOf(
        "screen_view" to setOf("screen_name", "referrer"),
        "button_tap" to setOf("button_id", "screen_name"),
        "api_call" to setOf("endpoint", "status_code", "latency_ms")
    )
 
    fun validate(event: TrackingEvent): Boolean {
        val required = schemas[event.name] ?: return false
        return required.all { it in event.properties }
    }
}

Events that fail validation are dropped and logged to a separate debug stream. This prevents garbage data from polluting the pipeline.

Batching Strategy

class EventBatcher(
    private val dao: EventDao,
    private val transport: EventTransport,
    private val maxBatchSize: Int = 50,
    private val maxAgeMs: Long = 60_000
) {
    suspend fun flush() {
        val events = dao.getOldestEvents(maxBatchSize)
        if (events.isEmpty()) return
 
        val payload = EventPayload(
            events = events,
            deviceId = DeviceIdProvider.getId(),
            appVersion = BuildConfig.VERSION_NAME
        )
 
        val result = transport.send(payload)
        if (result.isSuccess) {
            dao.deleteByIds(events.map { it.id })
        }
    }
}

Volume Control

Three mechanisms:

Sampling: High-frequency events (e.g., scroll events) are sampled at configurable rates (1%, 10%) using a deterministic hash on session ID.
Deduplication: Events with the same name and properties within a 500ms window are collapsed.
TTL: Events older than 72 hours in the local database are purged. Stale data has diminishing analytical value.

Trade-offs

Decision	Upside	Downside
Room for persistence	Survives process death, queryable	Adds ~1MB to APK, slightly slower than file I/O
Periodic batching	Reduces network calls, battery-friendly	Events delayed up to batch interval
Client-side schema validation	Prevents bad data at source	Requires schema updates shipped with app releases
Gzip compression	Reduces bandwidth 60-80%	CPU cost on low-end devices
Deterministic sampling	Consistent per-user experience	Cannot adjust sampling for already-emitted events

Failure Modes

Database full: Set a max row count (e.g., 10,000). When exceeded, drop oldest events first. Alert via a meta-event.
Serialization failure: Catch exceptions per-event, not per-batch. One malformed event should not block the batch.
Transport timeout: Exponential backoff with jitter, capped at 5 minutes. After 3 consecutive failures, back off to next WorkManager cycle.
Schema mismatch on server: Server returns partial acceptance. Client deletes accepted events, retains rejected ones with a rejected flag to prevent infinite retries.

Scaling Considerations

Event volume growth: Move from per-event rows to batch-serialized blobs (Protocol Buffers) to reduce database write amplification.
Multi-process apps: If the app uses multiple processes, use a ContentProvider or a dedicated tracking process to avoid database locking issues.
Backend ingestion: The client should respect HTTP 429 responses and Retry-After headers. Server-driven throttling is essential at scale.

Observability

Track a meta-event tracking_health that reports: events queued, events sent, events dropped, batch size, flush latency, and error counts.
Expose a debug mode that logs all events to Logcat with a TRACKING tag, gated behind a build config flag.
Dashboard metrics: delivery rate, p95 flush latency, schema validation failure rate, compression ratio.

Key Takeaways

Persist before sending. In-memory-only queues lose events on process death.
Validate schemas on the client. Bad data caught at the source saves hours of pipeline debugging.
Batch aggressively. Per-event HTTP calls are wasteful on mobile networks.
Bound everything: queue size, batch size, retry count, event TTL. Unbounded systems fail unpredictably.
Ship observability for the tracking system itself. You cannot trust a system you cannot measure.

Final Thoughts

An event tracking system is infrastructure, not a feature. Treat it with the same rigor as your networking layer or database. The cost of getting it wrong is not a crash, it is months of decisions made on unreliable data.