Refactoring a System Without Breaking Users

Dhruval Dhameliya·January 1, 2026·6 min read

Strategies for large-scale refactors that keep production stable, covering parallel runs, feature flags, gradual migrations, and verification techniques.

The hardest refactors are the ones where the system cannot go down. No maintenance window. No "please refresh your browser." Users keep hitting the old paths while you rebuild underneath them. This is how I approach those situations.

The Parallel Run Pattern

Before cutting over any critical path, I run the old and new implementations side by side. Both execute. Only the old one returns results to the user. The new one logs its output for comparison.

Request --> Old Path (returns response)
        \-> New Path (logs result, discarded)

Compare outputs offline. Fix divergences. Repeat.

This is expensive in compute but cheap in risk. I have caught subtle bugs in "equivalent" reimplementations that would have been production incidents. Date formatting differences, null handling edge cases, floating point rounding in financial calculations. The parallel run catches all of it.

Migration Phases

Every large refactor follows the same phased approach:

Phase 1: Introduce the new path behind a flag. No traffic goes to it. It exists, it compiles, it has tests. That is all.

Phase 2: Shadow traffic. A percentage of requests also execute against the new path. Results are compared but not served.

Phase 3: Canary. A small percentage of real traffic (1-5%) uses the new path for actual responses. Metrics are compared against the old path.

Phase 4: Gradual rollout. Increase the percentage in stages: 10%, 25%, 50%, 100%. Each stage runs for at least a full business cycle (typically one week).

Phase 5: Remove the old path. Only after the new path has been at 100% for a sufficient bake period.

The temptation is always to compress these phases. Resist it. The bake time between phases is where you discover the long-tail issues.

Database Migration Without Downtime

Schema changes during a refactor are the highest-risk operations. The approach I follow:

  1. Expand: Add new columns or tables alongside existing ones. Application writes to both.
  2. Migrate: Backfill historical data. Verify counts and checksums.
  3. Switch reads: Point queries at the new schema. Old schema still receives writes.
  4. Contract: Remove writes to old schema. Drop old columns after a retention period.

Each step is independently deployable and independently reversible. If step 3 reveals a problem, you revert reads without touching writes.

API Versioning During Refactors

When a refactor changes API behavior, even subtly, I version the endpoint. The old version stays exactly as-is. The new version carries the refactored behavior.

Common mistakes I have seen:

  • Changing response field types (string to integer) without versioning
  • Removing fields that "nobody uses" (someone always uses them)
  • Changing error response formats (clients parse error bodies more than you think)

The rule: if any client could observe a behavioral difference, it is a breaking change regardless of whether you think it matters.

Verification Strategies

Running both paths is not useful without a way to compare them systematically.

Verification methodBest forLimitation
Output diffingDeterministic transformationsFails for time-dependent or random outputs
Metric comparisonLatency, error rates, throughputDoes not catch logical correctness issues
Checksum validationData migration correctnessExpensive for large datasets
Integration test replayEnd-to-end behaviorTest suite may not cover edge cases
Production traffic replayReal-world coverageRequires sanitization of sensitive data

I typically combine at least two of these. Metric comparison alone is insufficient because you can have identical error rates with completely different errors.

Handling State During Cutover

Stateless services are straightforward to refactor. Stateful systems are where things get dangerous. The key question: what happens to in-flight state during the cutover?

For queue-based systems, I drain the old consumer, verify the queue is empty, then start the new consumer. For systems with long-lived connections (WebSockets, streaming), I implement graceful handoff where new connections go to the new path while existing connections complete on the old path.

The worst case is shared mutable state (caches, session stores). Here I ensure the new path can read state written by the old path and vice versa. This usually means keeping the serialization format stable even if the internal representation changes.

Feature Flags as Safety Valves

Every refactored path gets a kill switch. Not just an on/off toggle, but a percentage-based rollout with the ability to target specific user segments.

The flag evaluation must be fast and must not itself become a point of failure. I have seen systems where the feature flag service going down caused all flags to evaluate to their default (old path), which then overwhelmed the old path because it had been scaled down.

Design the flag fallback behavior explicitly. Document what happens when the flag service is unreachable.

See also: How I'd Design a Mobile Configuration System at Scale.

When to Stop and Revert

Having clear revert criteria before starting is essential:

  • Error rate exceeds baseline by more than 0.1% for 15 minutes
  • P99 latency exceeds baseline by more than 20%
  • Any data inconsistency detected between old and new paths
  • On-call engineer cannot explain an anomaly within 30 minutes

These are not suggestions. They are hard thresholds written into the rollout plan before the first line of code is deployed.

Key Takeaways

  • Parallel runs catch bugs that unit tests and integration tests miss, especially around edge cases in data formatting and null handling.
  • Every refactor phase should be independently deployable and independently reversible.
  • Database schema changes follow the expand-migrate-switch-contract pattern to avoid downtime.
  • Define revert criteria before deployment, not during an incident.
  • The bake time between rollout phases is where long-tail bugs surface. Do not compress it.

Related: What Breaks First When Traffic Scales.


Further Reading

Final Thoughts

A refactor that breaks users is not a refactor. It is an incident that happens to include new code. The overhead of parallel runs, phased rollouts, and verification may feel excessive, but it is a fraction of the cost of a production incident on a critical path. Build the safety infrastructure first, then refactor with confidence.

Recommended