How I'd Design a Mobile Configuration System at Scale
Designing a configuration system for mobile apps at scale, covering config delivery, caching layers, override hierarchies, and safe rollout of config changes.
Configuration systems control timeouts, feature thresholds, UI copy, endpoint URLs, and hundreds of other runtime parameters. At scale, a config change can affect millions of devices simultaneously. This post covers how to design a system that makes config changes safe, fast, and observable.
Context
A mobile configuration system delivers key-value pairs (or structured config objects) to devices at runtime. It replaces hardcoded values with dynamically controllable ones. At scale (10M+ devices), the system must handle high read throughput, fast propagation of changes, safe rollout, and graceful failure.
Problem
Design a configuration system that:
- Delivers config to millions of devices with low latency
- Supports hierarchical overrides (global, platform, app version, user segment)
- Rolls out config changes safely with canary and percentage-based releases
- Fails safely when the config service is unreachable
Constraints
| Constraint | Detail |
|---|---|
| Fetch latency | Config available within 200ms of app cold start |
| Propagation | Config changes reach 95% of active devices within 15 minutes |
| Payload size | Full config payload under 100KB compressed |
| Reliability | Must function with no network, stale cache, or corrupted cache |
| Consistency | Config should not change mid-session (pin per session) |
Design
Config Data Model
ConfigEntry {
key: String // "api_timeout_ms", "max_retry_count"
value: Any // 5000, 3
type: ValueType // INT, STRING, BOOLEAN, JSON
metadata: {
description: String
owner: String // Team or individual
last_modified: Timestamp
version: Int
}
overrides: List<Override>
}
Override {
condition: Condition // platform=android, app_version>=5.0, country=IN
value: Any
priority: Int // Higher priority overrides win
}
Override Hierarchy
Overrides are evaluated in priority order. The most specific matching override wins:
| Priority | Level | Example |
|---|---|---|
| 1 (lowest) | Global default | api_timeout_ms = 5000 |
| 2 | Platform | Android: api_timeout_ms = 7000 |
| 3 | App version range | v5.0-5.2: api_timeout_ms = 10000 |
| 4 | Country/Region | India: api_timeout_ms = 15000 |
| 5 | User segment | Beta users: api_timeout_ms = 3000 |
| 6 (highest) | Individual user | User 12345: api_timeout_ms = 2000 |
class ConfigResolver(private val context: DeviceContext) {
fun resolve(entry: ConfigEntry): Any {
val applicableOverrides = entry.overrides
.filter { it.condition.matches(context) }
.sortedByDescending { it.priority }
return applicableOverrides.firstOrNull()?.value ?: entry.value
}
}Server Architecture
Admin UI -> Config Service -> PostgreSQL (source of truth)
|
v
Config Compiler -> CDN (compiled config JSON per platform)
|
Mobile Client
The Config Compiler runs on every config change. It evaluates all entries and overrides, produces platform-specific JSON payloads, and pushes them to the CDN. The client never evaluates overrides at runtime (for simple cases). For user-segment or individual overrides, the client receives the override rules and evaluates locally.
Client-Side Architecture
class ConfigManager(
private val diskCache: ConfigDiskCache,
private val fetcher: ConfigFetcher,
private val defaults: Map<String, Any>
) {
private var sessionConfig: Map<String, Any>? = null
suspend fun initialize() {
// 1. Load from disk cache (fast, survives process death)
val cachedConfig = diskCache.load()
// 2. Pin for this session
sessionConfig = cachedConfig ?: defaults
// 3. Fetch latest in background (for next session)
fetchAndCacheInBackground()
}
fun getString(key: String, default: String): String {
return (sessionConfig?.get(key) as? String) ?: defaults[key] as? String ?: default
}
fun getInt(key: String, default: Int): Int {
return (sessionConfig?.get(key) as? Int) ?: defaults[key] as? Int ?: default
}
private suspend fun fetchAndCacheInBackground() {
try {
val latest = fetcher.fetch()
diskCache.save(latest)
// Will be used in next session
} catch (e: Exception) {
// Silently fail; current session uses cached config
}
}
}Safe Rollout
See also: Designing Background Job Systems for Mobile Apps.
Config changes are rolled out progressively:
- Internal: Deploy to internal employees only (user segment override).
- Canary (1%): Deploy to 1% of users via percentage-based targeting.
- Gradual ramp: 5% -> 25% -> 50% -> 100%, with 24 hours between each stage.
- Full rollout: Remove targeting, make the new value the global default.
rollout_config_change(key, new_value, stages):
for stage in stages:
apply_override(key, new_value, targeting=stage.targeting)
invalidate_cdn_cache()
wait(stage.bake_time)
check_guardrails(key)
if guardrails_breached:
rollback(key)
alert_owner(key)
return ROLLED_BACK
promote_to_default(key, new_value)
return SUCCESS
Validation
Every config change passes through validation before deployment:
- Type check: New value matches declared type.
- Range check: Numeric values within declared min/max bounds.
- Dependency check: If config A depends on config B, validate consistency.
- Diff review: Changes to critical configs (endpoint URLs, auth params) require two-person approval.
Trade-offs
| Decision | Upside | Downside |
|---|---|---|
| CDN delivery | Low latency, high availability, scales to any device count | Propagation delay (CDN TTL) |
| Session pinning | Consistent behavior within a session | Urgent changes delayed until next session |
| Server-side compilation | Simple client logic | Compiler must run on every change, adds latency to admin workflow |
| Hierarchical overrides | Flexible targeting | Complex evaluation, harder to reason about effective value |
| Percentage-based rollout | Safe, progressive | Slower time-to-full-rollout |
Failure Modes
- CDN outage: Client falls back to disk cache. If disk cache is corrupted, falls back to compiled-in defaults.
- Config compiler crash: CDN continues serving the last successfully compiled config. Alert the on-call team.
- Invalid config deployed: A string value set for an integer config causes parse errors. Mitigation: type validation at write time, and client-side type coercion with fallback to default.
- Config key collision: Two teams use the same key for different purposes. Mitigation: namespace keys by team (e.g.,
payments.api_timeout_ms). - Stale disk cache: App installed months ago has ancient config. Mitigation: include a
min_config_versioncheck; if the cached version is too old, block on a network fetch with a timeout before falling back to defaults.
Scaling Considerations
- Payload size: At 1,000+ config keys, the payload exceeds 100KB. Split into config groups fetched on demand. Core configs fetched at startup, feature-specific configs fetched when the feature is accessed.
- CDN invalidation at scale: Invalidating CDN cache across all edge nodes takes seconds to minutes. For urgent changes (kill switches), maintain a lightweight sidecar endpoint that bypasses CDN.
- Multi-region: Config service deployed per region. Changes propagate via async replication. Accept eventual consistency across regions (config changes take up to 5 minutes to propagate globally).
Related: Designing Event Schemas That Survive Product Changes.
Observability
- Track: config fetch success rate, cache hit rate, config version distribution across devices, time-to-propagation for each config change.
- Alert on: fetch failure rate exceeding 5%, config version lagging by more than 2 versions for more than 10% of devices, critical config rollback triggered.
- Audit log: every config change with who, what, when, and the rollout stage.
Key Takeaways
- Session-pin config values. Mid-session changes cause inconsistent behavior and are nearly impossible to debug.
- Use hierarchical overrides for flexibility, but namespace keys and document ownership to prevent chaos.
- Roll out config changes progressively with guardrail checks between stages. A bad config change is as dangerous as a bad code deploy.
- Always have three fallback layers: network fetch, disk cache, compiled-in defaults. The app must function even if the config system is completely unreachable.
- Validate every config change before deployment. Type checks, range checks, and dependency checks catch the majority of config-related incidents.
Further Reading
- How I'd Design a Scalable Notification System: System design for a multi-channel notification system covering delivery guarantees, rate limiting, user preferences, and failure handling...
- Event Tracking System Design for Android Applications: A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, an...
- Designing a Feature Flag and Remote Config System: Architecture and trade-offs for building a feature flag and remote configuration system that handles targeting, rollout, and consistency ...
Final Thoughts
A configuration system is a remote control for your application. It is powerful and dangerous in equal measure. The guardrails around how config changes are proposed, validated, rolled out, and monitored matter more than the delivery mechanism itself.
Recommended
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.
Designing a Feature Flag and Remote Config System
Architecture and trade-offs for building a feature flag and remote configuration system that handles targeting, rollout, and consistency across mobile clients.
Mobile Analytics Pipeline: From App Event to Dashboard
End-to-end design of a mobile analytics pipeline covering ingestion, processing, storage, and querying, with emphasis on reliability and latency trade-offs.