Refactoring a System Without Breaking Users
Strategies for large-scale refactors that keep production stable, covering parallel runs, feature flags, gradual migrations, and verification techniques.
The hardest refactors are the ones where the system cannot go down. No maintenance window. No "please refresh your browser." Users keep hitting the old paths while you rebuild underneath them. This is how I approach those situations.
The Parallel Run Pattern
Before cutting over any critical path, I run the old and new implementations side by side. Both execute. Only the old one returns results to the user. The new one logs its output for comparison.
Request --> Old Path (returns response)
\-> New Path (logs result, discarded)
Compare outputs offline. Fix divergences. Repeat.
This is expensive in compute but cheap in risk. I have caught subtle bugs in "equivalent" reimplementations that would have been production incidents. Date formatting differences, null handling edge cases, floating point rounding in financial calculations. The parallel run catches all of it.
Migration Phases
Every large refactor follows the same phased approach:
Phase 1: Introduce the new path behind a flag. No traffic goes to it. It exists, it compiles, it has tests. That is all.
Phase 2: Shadow traffic. A percentage of requests also execute against the new path. Results are compared but not served.
Phase 3: Canary. A small percentage of real traffic (1-5%) uses the new path for actual responses. Metrics are compared against the old path.
Phase 4: Gradual rollout. Increase the percentage in stages: 10%, 25%, 50%, 100%. Each stage runs for at least a full business cycle (typically one week).
Phase 5: Remove the old path. Only after the new path has been at 100% for a sufficient bake period.
The temptation is always to compress these phases. Resist it. The bake time between phases is where you discover the long-tail issues.
Database Migration Without Downtime
Schema changes during a refactor are the highest-risk operations. The approach I follow:
- Expand: Add new columns or tables alongside existing ones. Application writes to both.
- Migrate: Backfill historical data. Verify counts and checksums.
- Switch reads: Point queries at the new schema. Old schema still receives writes.
- Contract: Remove writes to old schema. Drop old columns after a retention period.
Each step is independently deployable and independently reversible. If step 3 reveals a problem, you revert reads without touching writes.
API Versioning During Refactors
When a refactor changes API behavior, even subtly, I version the endpoint. The old version stays exactly as-is. The new version carries the refactored behavior.
Common mistakes I have seen:
- Changing response field types (string to integer) without versioning
- Removing fields that "nobody uses" (someone always uses them)
- Changing error response formats (clients parse error bodies more than you think)
The rule: if any client could observe a behavioral difference, it is a breaking change regardless of whether you think it matters.
Verification Strategies
Running both paths is not useful without a way to compare them systematically.
| Verification method | Best for | Limitation |
|---|---|---|
| Output diffing | Deterministic transformations | Fails for time-dependent or random outputs |
| Metric comparison | Latency, error rates, throughput | Does not catch logical correctness issues |
| Checksum validation | Data migration correctness | Expensive for large datasets |
| Integration test replay | End-to-end behavior | Test suite may not cover edge cases |
| Production traffic replay | Real-world coverage | Requires sanitization of sensitive data |
I typically combine at least two of these. Metric comparison alone is insufficient because you can have identical error rates with completely different errors.
Handling State During Cutover
Stateless services are straightforward to refactor. Stateful systems are where things get dangerous. The key question: what happens to in-flight state during the cutover?
For queue-based systems, I drain the old consumer, verify the queue is empty, then start the new consumer. For systems with long-lived connections (WebSockets, streaming), I implement graceful handoff where new connections go to the new path while existing connections complete on the old path.
The worst case is shared mutable state (caches, session stores). Here I ensure the new path can read state written by the old path and vice versa. This usually means keeping the serialization format stable even if the internal representation changes.
Feature Flags as Safety Valves
Every refactored path gets a kill switch. Not just an on/off toggle, but a percentage-based rollout with the ability to target specific user segments.
The flag evaluation must be fast and must not itself become a point of failure. I have seen systems where the feature flag service going down caused all flags to evaluate to their default (old path), which then overwhelmed the old path because it had been scaled down.
Design the flag fallback behavior explicitly. Document what happens when the flag service is unreachable.
See also: How I'd Design a Mobile Configuration System at Scale.
When to Stop and Revert
Having clear revert criteria before starting is essential:
- Error rate exceeds baseline by more than 0.1% for 15 minutes
- P99 latency exceeds baseline by more than 20%
- Any data inconsistency detected between old and new paths
- On-call engineer cannot explain an anomaly within 30 minutes
These are not suggestions. They are hard thresholds written into the rollout plan before the first line of code is deployed.
Key Takeaways
- Parallel runs catch bugs that unit tests and integration tests miss, especially around edge cases in data formatting and null handling.
- Every refactor phase should be independently deployable and independently reversible.
- Database schema changes follow the expand-migrate-switch-contract pattern to avoid downtime.
- Define revert criteria before deployment, not during an incident.
- The bake time between rollout phases is where long-tail bugs surface. Do not compress it.
Related: What Breaks First When Traffic Scales.
Further Reading
- Versioning APIs Without Breaking Old Mobile Apps: Strategies for API versioning that keep old mobile app versions functional, covering URL versioning, header versioning, additive changes,...
- Reducing APK Size Without Breaking Features: Practical techniques for shrinking Android APK size in production apps, covering R8 configuration, resource optimization, native library ...
- What Makes a System Easy to Debug: Concrete design decisions that make production systems debuggable, from structured logging and correlation IDs to deterministic behavior ...
Final Thoughts
A refactor that breaks users is not a refactor. It is an incident that happens to include new code. The overhead of parallel runs, phased rollouts, and verification may feel excessive, but it is a fraction of the cost of a production incident on a critical path. Build the safety infrastructure first, then refactor with confidence.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.