Engineering Decisions I Don't Compromise On
Non-negotiable engineering principles I enforce regardless of deadline pressure, team size, or scope, and the reasoning behind each one.
Over the years I have developed a short list of engineering decisions where I refuse to take shortcuts. These are not preferences or style choices. They are load-bearing principles that, when compromised, lead to incidents, technical debt, and eroded trust. Every one of them exists because I learned the cost of the alternative the hard way.
Idempotency on Write Paths
Every write operation must be idempotent. If a client retries a request due to a network timeout, the system must produce the same result as if the request executed once. This is non-negotiable because retries are not optional in distributed systems. They are inevitable.
Implementation requirements:
- Every mutation endpoint accepts an idempotency key
- The server stores the result of the first execution and returns it for subsequent requests with the same key
- Key storage has a TTL that exceeds the maximum retry window
- The idempotency check happens before any side effects
I have seen double charges, duplicate records, and corrupted state from non-idempotent writes. The cost of adding idempotency after the fact is an order of magnitude higher than building it in from the start.
Automated Rollback Capability
Every deployment must be reversible within minutes without manual intervention. If the new version causes errors, the system rolls back automatically based on health check failures.
This means:
- Database migrations must be backward-compatible (the previous application version must work with the new schema)
- Configuration changes are versioned and rollbackable
- Feature flags control new behavior, allowing instant disable without a deploy
- No "one-way door" deployments without an explicit review and approval process
The moment you deploy something that cannot be rolled back, you have committed to debugging in production under pressure. That is not a position I accept voluntarily.
Schema Validation at System Boundaries
Every piece of data that crosses a system boundary gets validated against a schema. API requests, queue messages, file uploads, configuration values. All of it.
The validation is strict:
- Unknown fields are rejected, not silently ignored
- Type coercion does not happen implicitly
- Required fields are enforced, not assumed
- String formats (dates, UUIDs, emails) are validated against patterns
Lenient input handling creates a class of bugs that are nearly impossible to diagnose. The data "looks right" but has subtle formatting differences that cause downstream failures hours or days later.
See also: Handling Partial Failures in Distributed Mobile Systems.
Logging With Context
Every log line includes enough context to trace it back to the originating request without consulting another system. At minimum:
- Request ID or correlation ID
- User or tenant identifier
- Operation being performed
- Relevant entity IDs
Logs without context are noise. When an on-call engineer is investigating a production issue at 2 AM, they should not need to join log lines from three different systems to understand what happened.
I also enforce structured logging (JSON format) over unstructured log messages. Structured logs can be queried. Unstructured logs can only be grepped, and grep does not scale.
Zero Trust Between Services
Internal services authenticate and authorize every request. The network boundary is not a security boundary. This applies to:
- Service-to-service communication (mutual TLS or signed tokens)
- Database connections (per-service credentials with least-privilege access)
- Queue consumers (authenticated connections with topic-level authorization)
- Internal APIs (authentication required even for "internal only" endpoints)
The argument against this is always performance or complexity. The argument for it is that a single compromised service should not grant access to every other service in the system.
Separate Read and Write Models When Complexity Demands It
When a system has significantly different read and write patterns, I separate the models rather than forcing one schema to serve both. This is not CQRS dogma. It is a practical response to the observation that read-optimized and write-optimized schemas have different shapes.
The trigger for separation:
- Read queries require joins across more than three tables
- Write operations need strong consistency while reads can tolerate staleness
- Read and write traffic scale at different rates
- Reporting queries cause contention with transactional writes
No Silent Failures
If an operation fails, the failure must be visible. No swallowed exceptions, no empty catch blocks, no "log and continue" for errors that affect correctness.
The rules:
- Errors that affect user-visible behavior must result in an error response, not a partial success
- Background job failures must generate alerts, not just log entries
- Data inconsistencies must be detected and surfaced, not silently tolerated
- Health checks must verify actual functionality, not just process liveness
Silent failures are technical debt that accrues interest in the form of data corruption and user trust erosion.
Capacity Planning Before Launch
Every new system or major feature gets a capacity analysis before production deployment. Not a rough estimate. A documented analysis that includes:
- Expected traffic patterns (peak, average, growth rate)
- Resource requirements per request (CPU, memory, I/O, network)
- Scaling limits of each component (database connections, queue throughput, API rate limits)
- Cost projections at 1x, 5x, and 10x current traffic
This has saved me from embarrassing launch failures more times than I can count. The analysis does not need to be perfect. It needs to be done.
Key Takeaways
Related: Designing Event Schemas That Survive Product Changes.
- Idempotency on write paths prevents data corruption from inevitable retries.
- Automated rollback capability means you never have to debug under deployment pressure.
- Schema validation at boundaries catches subtle data issues before they propagate.
- Structured logging with request context makes production debugging tractable.
- Zero trust between services limits the blast radius of a security incident.
- Silent failures compound into data corruption and eroded user trust.
- Capacity planning before launch prevents avoidable outages.
Further Reading
- Engineering Principles I Apply Across Domains: Universal engineering principles that hold true regardless of language, framework, domain, or scale, distilled from years of building pro...
- How I Think About Engineering Risk: A framework for identifying, categorizing, and managing engineering risk across system design, team dynamics, and operational decisions.
- Engineering Decisions That Reduce Pager Fatigue: Architectural and operational decisions that reduce the frequency and severity of production pages, based on patterns from years of on-ca...
Final Thoughts
These are not aspirational goals. They are minimum requirements. The cost of implementing them upfront is a fraction of the cost of not having them when things go wrong. Every compromise on this list has, in my experience, resulted in a production incident that cost more to resolve than the original implementation would have.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.