Understanding ANRs: Detection, Root Causes, and Fixes
8 min read
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.
8 min read
A systematic look at Application Not Responding errors on Android, covering the detection mechanism, common root causes in production, and concrete strategies to fix and prevent them.
7 min read
How to design APIs that handle duplicate requests safely, covering idempotency keys, server-side deduplication, and failure scenarios specific to mobile networks.
6 min read
Strategies for large-scale refactors that keep production stable, covering parallel runs, feature flags, gradual migrations, and verification techniques.
7 min read
Strategies for detecting and resolving data conflicts in offline-first mobile systems, covering CRDTs, last-write-wins, operational transforms, and manual resolution.
7 min read
A detailed look at retry strategies for mobile clients, covering exponential backoff, jitter, circuit breakers, and adaptive retry policies for unreliable networks.
5 min read
A collection of hard-won lessons from years of production incidents, covering the patterns that repeat across systems and the mental models that help prevent them.
7 min read
A catalog of components that fail first under increasing traffic, ordered by how commonly they become bottlenecks in web applications.
6 min read
A catalog of failure modes that experienced engineers anticipate and design around, from cascading failures to data corruption to clock skew.
7 min read
Architecture patterns for mobile apps that function reliably on slow, intermittent, and lossy networks, covering request prioritization, adaptive quality, and graceful degradation.
6 min read
Why the real challenge of operating at scale is not handling load but diagnosing problems in systems too large and too fast for any one person to fully understand.
6 min read
Exploring the blind spots in traditional logging approaches and the incidents where logs were present but useless, along with what I now build instead.
6 min read
How to design APIs, configurations, and system interfaces that guide users toward correct usage and make dangerous operations difficult to perform accidentally.
8 min read
A structured methodology for identifying and fixing battery drain in Android apps, covering wake locks, location updates, background work, and network polling patterns.
6 min read
Why silent failures are more dangerous than crashes, and how to design systems that surface problems immediately rather than hiding them behind fallback behavior.
7 min read
Concrete design decisions that make production systems debuggable, from structured logging and correlation IDs to deterministic behavior and state inspection.
9 min read
Strategies for diagnosing and fixing bugs that disappear when observed, covering race conditions, timing-dependent failures, and non-deterministic behavior in Android applications.
8 min read
Strategies for handling partial failures in systems where mobile clients interact with multiple backend services, covering compensation, saga patterns, and client-side resilience.
6 min read
Practical lessons from years of debugging distributed systems, covering the unique challenges of partial failures, clock skew, message ordering, and the tools and mental models that actually help.
6 min read
The case for investing in observability before optimization, and why the ability to understand system behavior is more valuable than making it faster.
8 min read
Designing a webhook delivery system with retries, dead letter queues, signature verification, and measured reliability under various failure conditions.
7 min read
Examining how seemingly minor technical decisions compound into major production incidents, with real patterns and the organizational dynamics that allow them.
7 min read
How to build observability into system architecture from the start, covering the three pillars, instrumentation patterns, and common pitfalls.
7 min read
Architectural and operational decisions that reduce the frequency and severity of production pages, based on patterns from years of on-call experience.
7 min read
How to build systems that continue providing value when components fail, covering load shedding, fallback strategies, and partial availability patterns.
7 min read
Lessons that I knew intellectually but had to learn through painful experience before they truly changed my behavior, covering testing, deployments, dependencies, and team dynamics.