Engineering Lessons I Relearned the Hard Way
Lessons that I knew intellectually but had to learn through painful experience before they truly changed my behavior, covering testing, deployments, dependencies, and team dynamics.
Context
There are lessons you know and lessons you have internalized. The gap between the two is measured in production incidents. I knew "always have a rollback plan" as an abstract principle for years before a failed migration with no rollback path made it a visceral truth.
This post covers the lessons I had to relearn through experience. Not because I was ignorant of them, but because knowing something intellectually and acting on it consistently are very different things.
Lesson 1: "It Works on My Machine" Is Not Testing
I knew integration testing was important. But when deadlines pressed, I would convince myself that thorough local testing was sufficient. The unit tests passed. The service started up locally. Manual testing against a local database worked. Ship it.
The production environment had a different database version, different network latency characteristics, different connection pool settings, and a different data distribution. The bug manifested only under production conditions.
What I relearned: Local testing validates logic. Integration testing validates behavior. Production testing validates the system. You need all three, and skipping any of them is borrowing against future incidents.
What changed in my behavior: I now refuse to sign off on a deployment that has not been validated in a staging environment that mirrors production topology. Not "similar to production." Mirrors it.
Lesson 2: Database Migrations Are the Riskiest Deployments
I knew this. Everyone knows this. And yet I approved a migration that altered a column type on a table with 200 million rows during business hours because "it should be fast."
It was not fast. The table lock held for 40 minutes. Every read and write to that table failed. The cascade took out three dependent services.
What I relearned: Schema migrations on large tables are the highest-risk operations you can perform. They deserve dedicated planning, off-peak execution, and tested rollback procedures.
What changed in my behavior:
- All migrations on tables larger than 10 million rows use online DDL tools
- All migrations have a tested rollback script
- All migrations run during low-traffic windows
- All migrations have explicit go/no-go criteria and a designated decision-maker
Lesson 3: Third-Party Dependencies Break
I integrated with a third-party payment processor that had 99.99% uptime for two years. I did not build a fallback path because the dependency "never goes down."
It went down. For four hours. During our busiest period. We had no fallback, no queue for retry, and no way to process payments. We lost revenue and customer trust.
What I relearned: Every external dependency will fail. The question is not "if" but "when" and "for how long." Your system must have a plan for every dependency failure.
See also: Failure Modes I Actively Design For.
What changed in my behavior: Every external dependency now has:
| Dependency Aspect | Requirement |
|---|---|
| Timeout | Set to 2x the measured p99 latency |
| Retry | Exponential backoff with jitter, max 3 attempts |
| Circuit breaker | Opens after 5 consecutive failures, half-opens after 30 seconds |
| Fallback | Defined behavior when the dependency is unavailable |
| Monitoring | Latency and error rate dashboards with alerts |
Lesson 4: Distributed Transactions Are Harder Than You Think
I built a workflow that needed to update data in two services atomically. I used a two-phase commit pattern that I had read about. In theory, it works. In practice, the coordinator crashed between phase one and phase two, leaving the system in an inconsistent state that took three hours to resolve manually.
What I relearned: Distributed transactions are fragile. The failure modes are numerous and the recovery procedures are complex. In most cases, eventual consistency with compensating transactions is more reliable than distributed atomicity.
What changed in my behavior: I avoid distributed transactions entirely. Instead:
- Use sagas with compensating actions for multi-service workflows
- Accept eventual consistency where the business allows it
- Use idempotent operations so that retries are safe
- Build reconciliation to detect and fix inconsistencies
Lesson 5: Monitoring Without Alerting Is Just Logging
I built comprehensive dashboards for a new service. CPU usage, memory, request rates, error rates, latency percentiles. Beautiful graphs. Nobody looked at them.
When the service started degrading, the dashboards showed it clearly. But nobody was looking because there were no alerts. The degradation was reported by a user three hours after it started.
What I relearned: Dashboards are for investigation. Alerts are for detection. Without alerts, dashboards are passive documentation of system state that nobody consults proactively.
What changed in my behavior: Every service gets alerts before dashboards. The alert triggers investigation. The dashboard supports investigation. This order matters.
Lesson 6: Documentation Rots
I wrote thorough documentation for a system's architecture, operational procedures, and design decisions. Six months later, the system had evolved significantly. The documentation had not. New team members followed the outdated documentation and made mistakes.
What I relearned: Documentation that is not maintained is worse than no documentation because it is confidently wrong. People trust documentation. When it is outdated, it actively misleads.
What changed in my behavior:
- Architecture Decision Records (ADRs) over comprehensive design documents. ADRs capture the decision and its context at a point in time, so they do not become "outdated," they become historical.
- Runbooks attached to alerts. When an alert fires, the runbook link is in the alert. If the runbook is wrong, the on-call engineer updates it immediately.
- Tests as documentation. Tests that demonstrate expected behavior do not go stale because they break when the behavior changes.
Lesson 7: Code Reviews Are Not Quality Gates
Related: Debugging Performance Issues in Large Android Apps.
I relied on code reviews to catch bugs, design issues, and operational gaps. Reviews caught some things, but they missed others consistently: race conditions, performance issues under load, configuration errors, and missing error handling for unlikely scenarios.
What I relearned: Code reviews catch readability issues and obvious logic errors. They do not reliably catch concurrency bugs, performance issues, or operational gaps. These require different tools: static analysis, load testing, chaos engineering, and operational checklists.
What changed in my behavior: Code reviews focus on readability, design alignment, and obvious correctness. Automated tools handle linting, static analysis, and known vulnerability detection. Load tests and failure injection tests handle performance and resilience validation. Each tool does what it does best.
Key Takeaways
- Local testing validates logic. Integration testing validates behavior. Production testing validates the system. You need all three.
- Database migrations on large tables are the highest-risk deployments. Treat them accordingly.
- Every external dependency will fail. Build timeout, retry, circuit breaker, and fallback for each one.
- Avoid distributed transactions. Sagas with compensating actions are more reliable in practice.
- Alerts before dashboards. Without alerts, dashboards are passive and unobserved.
- Prefer ADRs and tests over comprehensive documentation. They age better.
- Code reviews catch readability and logic issues. Automated tools and testing catch the rest.
Further Reading
- Lessons From Debugging Distributed Systems: Practical lessons from years of debugging distributed systems, covering the unique challenges of partial failures, clock skew, message or...
- Engineering Decisions That Reduce Pager Fatigue: Architectural and operational decisions that reduce the frequency and severity of production pages, based on patterns from years of on-ca...
- How I Think About Engineering Risk: A framework for identifying, categorizing, and managing engineering risk across system design, team dynamics, and operational decisions.
Final Thoughts
The gap between knowing a lesson and acting on it consistently is where most engineering mistakes happen. I knew every one of these lessons before I relearned them. But knowledge without the visceral experience of failure did not change my behavior. The cost of relearning was significant: production incidents, lost revenue, and accumulated technical debt. If this post helps even one engineer skip the relearning and go straight to the behavior change, it will have been worth writing.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.