Event Tracking System Design for Android Applications

Dhruval Dhameliya·February 12, 2026·6 min read

A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.

Most analytics SDKs are treated as fire-and-forget. That works until you need guaranteed delivery, schema validation, or cost control on ingestion volume. This post covers how to design an event tracking system for Android that handles all three.

Context

Any non-trivial Android app generates thousands of events per session: screen views, taps, scroll depth, API latencies, crashes. These events drive product decisions, A/B test evaluation, and anomaly detection. The tracking system is the pipeline entry point, and its reliability directly affects downstream data quality.

Problem

Design an on-device event tracking system that:

  • Captures structured events with minimal performance overhead
  • Guarantees delivery even across app restarts and network failures
  • Enforces schema consistency before events leave the device
  • Controls volume to avoid backend ingestion cost blowouts

Constraints

ConstraintDetail
BatteryBatching and network calls must respect Doze mode and battery optimization
MemoryEvent queue must be bounded; unbounded queues cause OOMs on low-end devices
DiskSQLite or file-based persistence; limited storage on budget devices
NetworkMust handle offline, flaky, and metered connections
Thread safetyEvents fire from UI thread, background workers, and broadcast receivers

Design

Event Model

data class TrackingEvent(
    val name: String,
    val properties: Map<String, Any>,
    val timestamp: Long = System.currentTimeMillis(),
    val sessionId: String,
    val sequenceNumber: Long,
    val schemaVersion: Int = 1
)

The sequenceNumber enables server-side deduplication and ordering. The schemaVersion field lets the backend route events to the correct deserialization path.

Architecture Layers

  1. Capture Layer: A singleton EventTracker exposes a thread-safe track(event) method. Events are validated against a local schema registry before acceptance.
  2. Persistence Layer: Accepted events are written to a Room database table. This guarantees survival across process death.
  3. Batching Layer: A CoroutineWorker runs on a periodic schedule (e.g., every 60 seconds) or when the batch size hits a threshold (e.g., 50 events). It reads persisted events, serializes them, and hands off to the transport layer.
  4. Transport Layer: An HTTP client sends batched payloads with gzip compression. On success, events are deleted from the local database. On failure, they remain for the next cycle.

Schema Enforcement

object SchemaRegistry {
    private val schemas: Map<String, Set<String>> = mapOf(
        "screen_view" to setOf("screen_name", "referrer"),
        "button_tap" to setOf("button_id", "screen_name"),
        "api_call" to setOf("endpoint", "status_code", "latency_ms")
    )
 
    fun validate(event: TrackingEvent): Boolean {
        val required = schemas[event.name] ?: return false
        return required.all { it in event.properties }
    }
}

Events that fail validation are dropped and logged to a separate debug stream. This prevents garbage data from polluting the pipeline.

Batching Strategy

class EventBatcher(
    private val dao: EventDao,
    private val transport: EventTransport,
    private val maxBatchSize: Int = 50,
    private val maxAgeMs: Long = 60_000
) {
    suspend fun flush() {
        val events = dao.getOldestEvents(maxBatchSize)
        if (events.isEmpty()) return
 
        val payload = EventPayload(
            events = events,
            deviceId = DeviceIdProvider.getId(),
            appVersion = BuildConfig.VERSION_NAME
        )
 
        val result = transport.send(payload)
        if (result.isSuccess) {
            dao.deleteByIds(events.map { it.id })
        }
    }
}

Volume Control

Three mechanisms:

  • Sampling: High-frequency events (e.g., scroll events) are sampled at configurable rates (1%, 10%) using a deterministic hash on session ID.
  • Deduplication: Events with the same name and properties within a 500ms window are collapsed.
  • TTL: Events older than 72 hours in the local database are purged. Stale data has diminishing analytical value.

Trade-offs

DecisionUpsideDownside
Room for persistenceSurvives process death, queryableAdds ~1MB to APK, slightly slower than file I/O
Periodic batchingReduces network calls, battery-friendlyEvents delayed up to batch interval
Client-side schema validationPrevents bad data at sourceRequires schema updates shipped with app releases
Gzip compressionReduces bandwidth 60-80%CPU cost on low-end devices
Deterministic samplingConsistent per-user experienceCannot adjust sampling for already-emitted events

Failure Modes

  • Database full: Set a max row count (e.g., 10,000). When exceeded, drop oldest events first. Alert via a meta-event.
  • Serialization failure: Catch exceptions per-event, not per-batch. One malformed event should not block the batch.
  • Transport timeout: Exponential backoff with jitter, capped at 5 minutes. After 3 consecutive failures, back off to next WorkManager cycle.
  • Schema mismatch on server: Server returns partial acceptance. Client deletes accepted events, retains rejected ones with a rejected flag to prevent infinite retries.

Scaling Considerations

  • Event volume growth: Move from per-event rows to batch-serialized blobs (Protocol Buffers) to reduce database write amplification.
  • Multi-process apps: If the app uses multiple processes, use a ContentProvider or a dedicated tracking process to avoid database locking issues.
  • Backend ingestion: The client should respect HTTP 429 responses and Retry-After headers. Server-driven throttling is essential at scale.

Observability

  • Track a meta-event tracking_health that reports: events queued, events sent, events dropped, batch size, flush latency, and error counts.
  • Expose a debug mode that logs all events to Logcat with a TRACKING tag, gated behind a build config flag.
  • Dashboard metrics: delivery rate, p95 flush latency, schema validation failure rate, compression ratio.

Key Takeaways

  • Persist before sending. In-memory-only queues lose events on process death.
  • Validate schemas on the client. Bad data caught at the source saves hours of pipeline debugging.
  • Batch aggressively. Per-event HTTP calls are wasteful on mobile networks.
  • Bound everything: queue size, batch size, retry count, event TTL. Unbounded systems fail unpredictably.
  • Ship observability for the tracking system itself. You cannot trust a system you cannot measure.

Related: Debugging Performance Issues in Large Android Apps.

See also: Mobile Analytics Pipeline: From App Event to Dashboard.


Further Reading

Final Thoughts

An event tracking system is infrastructure, not a feature. Treat it with the same rigor as your networking layer or database. The cost of getting it wrong is not a crash, it is months of decisions made on unreliable data.

Recommended