Engineering Lessons I Relearned the Hard Way

Context

There are lessons you know and lessons you have internalized. The gap between the two is measured in production incidents. I knew "always have a rollback plan" as an abstract principle for years before a failed migration with no rollback path made it a visceral truth.

This post covers the lessons I had to relearn through experience. Not because I was ignorant of them, but because knowing something intellectually and acting on it consistently are very different things.

Lesson 1: "It Works on My Machine" Is Not Testing

I knew integration testing was important. But when deadlines pressed, I would convince myself that thorough local testing was sufficient. The unit tests passed. The service started up locally. Manual testing against a local database worked. Ship it.

The production environment had a different database version, different network latency characteristics, different connection pool settings, and a different data distribution. The bug manifested only under production conditions.

What I relearned: Local testing validates logic. Integration testing validates behavior. Production testing validates the system. You need all three, and skipping any of them is borrowing against future incidents.

What changed in my behavior: I now refuse to sign off on a deployment that has not been validated in a staging environment that mirrors production topology. Not "similar to production." Mirrors it.

Lesson 2: Database Migrations Are the Riskiest Deployments

I knew this. Everyone knows this. And yet I approved a migration that altered a column type on a table with 200 million rows during business hours because "it should be fast."

It was not fast. The table lock held for 40 minutes. Every read and write to that table failed. The cascade took out three dependent services.

What I relearned: Schema migrations on large tables are the highest-risk operations you can perform. They deserve dedicated planning, off-peak execution, and tested rollback procedures.

What changed in my behavior:

All migrations on tables larger than 10 million rows use online DDL tools
All migrations have a tested rollback script
All migrations run during low-traffic windows
All migrations have explicit go/no-go criteria and a designated decision-maker

Lesson 3: Third-Party Dependencies Break

I integrated with a third-party payment processor that had 99.99% uptime for two years. I did not build a fallback path because the dependency "never goes down."

It went down. For four hours. During our busiest period. We had no fallback, no queue for retry, and no way to process payments. We lost revenue and customer trust.

What I relearned: Every external dependency will fail. The question is not "if" but "when" and "for how long." Your system must have a plan for every dependency failure.

What changed in my behavior: Every external dependency now has:

Dependency Aspect	Requirement
Timeout	Set to 2x the measured p99 latency
Retry	Exponential backoff with jitter, max 3 attempts
Circuit breaker	Opens after 5 consecutive failures, half-opens after 30 seconds
Fallback	Defined behavior when the dependency is unavailable
Monitoring	Latency and error rate dashboards with alerts

Lesson 4: Distributed Transactions Are Harder Than You Think

I built a workflow that needed to update data in two services atomically. I used a two-phase commit pattern that I had read about. In theory, it works. In practice, the coordinator crashed between phase one and phase two, leaving the system in an inconsistent state that took three hours to resolve manually.

What I relearned: Distributed transactions are fragile. The failure modes are numerous and the recovery procedures are complex. In most cases, eventual consistency with compensating transactions is more reliable than distributed atomicity.

What changed in my behavior: I avoid distributed transactions entirely. Instead:

Use sagas with compensating actions for multi-service workflows
Accept eventual consistency where the business allows it
Use idempotent operations so that retries are safe
Build reconciliation to detect and fix inconsistencies

Lesson 5: Monitoring Without Alerting Is Just Logging

I built comprehensive dashboards for a new service. CPU usage, memory, request rates, error rates, latency percentiles. Beautiful graphs. Nobody looked at them.

When the service started degrading, the dashboards showed it clearly. But nobody was looking because there were no alerts. The degradation was reported by a user three hours after it started.

What I relearned: Dashboards are for investigation. Alerts are for detection. Without alerts, dashboards are passive documentation of system state that nobody consults proactively.

What changed in my behavior: Every service gets alerts before dashboards. The alert triggers investigation. The dashboard supports investigation. This order matters.

Lesson 6: Documentation Rots

I wrote thorough documentation for a system's architecture, operational procedures, and design decisions. Six months later, the system had evolved significantly. The documentation had not. New team members followed the outdated documentation and made mistakes.

What I relearned: Documentation that is not maintained is worse than no documentation because it is confidently wrong. People trust documentation. When it is outdated, it actively misleads.

What changed in my behavior:

Architecture Decision Records (ADRs) over comprehensive design documents. ADRs capture the decision and its context at a point in time, so they do not become "outdated," they become historical.
Runbooks attached to alerts. When an alert fires, the runbook link is in the alert. If the runbook is wrong, the on-call engineer updates it immediately.
Tests as documentation. Tests that demonstrate expected behavior do not go stale because they break when the behavior changes.

Lesson 7: Code Reviews Are Not Quality Gates

I relied on code reviews to catch bugs, design issues, and operational gaps. Reviews caught some things, but they missed others consistently: race conditions, performance issues under load, configuration errors, and missing error handling for unlikely scenarios.

What I relearned: Code reviews catch readability issues and obvious logic errors. They do not reliably catch concurrency bugs, performance issues, or operational gaps. These require different tools: static analysis, load testing, chaos engineering, and operational checklists.

What changed in my behavior: Code reviews focus on readability, design alignment, and obvious correctness. Automated tools handle linting, static analysis, and known vulnerability detection. Load tests and failure injection tests handle performance and resilience validation. Each tool does what it does best.

Key Takeaways

Local testing validates logic. Integration testing validates behavior. Production testing validates the system. You need all three.
Database migrations on large tables are the highest-risk deployments. Treat them accordingly.
Every external dependency will fail. Build timeout, retry, circuit breaker, and fallback for each one.
Avoid distributed transactions. Sagas with compensating actions are more reliable in practice.
Alerts before dashboards. Without alerts, dashboards are passive and unobserved.
Prefer ADRs and tests over comprehensive documentation. They age better.
Code reviews catch readability and logic issues. Automated tools and testing catch the rest.

Final Thoughts

The gap between knowing a lesson and acting on it consistently is where most engineering mistakes happen. I knew every one of these lessons before I relearned them. But knowledge without the visceral experience of failure did not change my behavior. The cost of relearning was significant: production incidents, lost revenue, and accumulated technical debt. If this post helps even one engineer skip the relearning and go straight to the behavior change, it will have been worth writing.

Engineering Lessons I Relearned the Hard Way

Context

Lesson 1: "It Works on My Machine" Is Not Testing

Lesson 2: Database Migrations Are the Riskiest Deployments

Lesson 3: Third-Party Dependencies Break

Lesson 4: Distributed Transactions Are Harder Than You Think

Lesson 5: Monitoring Without Alerting Is Just Logging

Lesson 6: Documentation Rots

Lesson 7: Code Reviews Are Not Quality Gates

Key Takeaways

Further Reading

Final Thoughts

Recommended

Designing an Offline-First Sync Engine for Mobile Apps

Jetpack Compose Recomposition: A Deep Dive

Event Tracking System Design for Android Applications