Engineering Decisions I Don't Compromise On

Dhruval Dhameliya·October 15, 2025·6 min read

Non-negotiable engineering principles I enforce regardless of deadline pressure, team size, or scope, and the reasoning behind each one.

Over the years I have developed a short list of engineering decisions where I refuse to take shortcuts. These are not preferences or style choices. They are load-bearing principles that, when compromised, lead to incidents, technical debt, and eroded trust. Every one of them exists because I learned the cost of the alternative the hard way.

Idempotency on Write Paths

Every write operation must be idempotent. If a client retries a request due to a network timeout, the system must produce the same result as if the request executed once. This is non-negotiable because retries are not optional in distributed systems. They are inevitable.

Implementation requirements:

  • Every mutation endpoint accepts an idempotency key
  • The server stores the result of the first execution and returns it for subsequent requests with the same key
  • Key storage has a TTL that exceeds the maximum retry window
  • The idempotency check happens before any side effects

I have seen double charges, duplicate records, and corrupted state from non-idempotent writes. The cost of adding idempotency after the fact is an order of magnitude higher than building it in from the start.

Automated Rollback Capability

Every deployment must be reversible within minutes without manual intervention. If the new version causes errors, the system rolls back automatically based on health check failures.

This means:

  • Database migrations must be backward-compatible (the previous application version must work with the new schema)
  • Configuration changes are versioned and rollbackable
  • Feature flags control new behavior, allowing instant disable without a deploy
  • No "one-way door" deployments without an explicit review and approval process

The moment you deploy something that cannot be rolled back, you have committed to debugging in production under pressure. That is not a position I accept voluntarily.

Schema Validation at System Boundaries

Every piece of data that crosses a system boundary gets validated against a schema. API requests, queue messages, file uploads, configuration values. All of it.

The validation is strict:

  • Unknown fields are rejected, not silently ignored
  • Type coercion does not happen implicitly
  • Required fields are enforced, not assumed
  • String formats (dates, UUIDs, emails) are validated against patterns

Lenient input handling creates a class of bugs that are nearly impossible to diagnose. The data "looks right" but has subtle formatting differences that cause downstream failures hours or days later.

See also: Handling Partial Failures in Distributed Mobile Systems.

Logging With Context

Every log line includes enough context to trace it back to the originating request without consulting another system. At minimum:

  • Request ID or correlation ID
  • User or tenant identifier
  • Operation being performed
  • Relevant entity IDs

Logs without context are noise. When an on-call engineer is investigating a production issue at 2 AM, they should not need to join log lines from three different systems to understand what happened.

I also enforce structured logging (JSON format) over unstructured log messages. Structured logs can be queried. Unstructured logs can only be grepped, and grep does not scale.

Zero Trust Between Services

Internal services authenticate and authorize every request. The network boundary is not a security boundary. This applies to:

  • Service-to-service communication (mutual TLS or signed tokens)
  • Database connections (per-service credentials with least-privilege access)
  • Queue consumers (authenticated connections with topic-level authorization)
  • Internal APIs (authentication required even for "internal only" endpoints)

The argument against this is always performance or complexity. The argument for it is that a single compromised service should not grant access to every other service in the system.

Separate Read and Write Models When Complexity Demands It

When a system has significantly different read and write patterns, I separate the models rather than forcing one schema to serve both. This is not CQRS dogma. It is a practical response to the observation that read-optimized and write-optimized schemas have different shapes.

The trigger for separation:

  • Read queries require joins across more than three tables
  • Write operations need strong consistency while reads can tolerate staleness
  • Read and write traffic scale at different rates
  • Reporting queries cause contention with transactional writes

No Silent Failures

If an operation fails, the failure must be visible. No swallowed exceptions, no empty catch blocks, no "log and continue" for errors that affect correctness.

The rules:

  • Errors that affect user-visible behavior must result in an error response, not a partial success
  • Background job failures must generate alerts, not just log entries
  • Data inconsistencies must be detected and surfaced, not silently tolerated
  • Health checks must verify actual functionality, not just process liveness

Silent failures are technical debt that accrues interest in the form of data corruption and user trust erosion.

Capacity Planning Before Launch

Every new system or major feature gets a capacity analysis before production deployment. Not a rough estimate. A documented analysis that includes:

  • Expected traffic patterns (peak, average, growth rate)
  • Resource requirements per request (CPU, memory, I/O, network)
  • Scaling limits of each component (database connections, queue throughput, API rate limits)
  • Cost projections at 1x, 5x, and 10x current traffic

This has saved me from embarrassing launch failures more times than I can count. The analysis does not need to be perfect. It needs to be done.

Key Takeaways

Related: Designing Event Schemas That Survive Product Changes.

  • Idempotency on write paths prevents data corruption from inevitable retries.
  • Automated rollback capability means you never have to debug under deployment pressure.
  • Schema validation at boundaries catches subtle data issues before they propagate.
  • Structured logging with request context makes production debugging tractable.
  • Zero trust between services limits the blast radius of a security incident.
  • Silent failures compound into data corruption and eroded user trust.
  • Capacity planning before launch prevents avoidable outages.

Further Reading

Final Thoughts

These are not aspirational goals. They are minimum requirements. The cost of implementing them upfront is a fraction of the cost of not having them when things go wrong. Every compromise on this list has, in my experience, resulted in a production incident that cost more to resolve than the original implementation would have.

Recommended