Designing Systems I'd Be Proud to Maintain
The design principles I follow to build systems that are not just functional but genuinely pleasant to maintain, debug, and evolve over time.
Context
Most systems are designed to work. Fewer are designed to be maintained. The difference shows up over time: the system that was elegant at launch becomes a burden within a year because it was optimized for the initial build, not for the ongoing maintenance.
I have started asking a different question during design: "Would I be proud to maintain this in two years?" Not "would I be proud to present this at a conference" or "would I be proud of the architecture diagram." Would I be proud to be the person who debugs it at 2 AM, who adds a feature to it under deadline pressure, who onboards a new team member into it?
See also: Designing a Feature Flag and Remote Config System.
This question changes the design in specific, concrete ways.
Principle 1: Fewer Moving Parts
Every component in the system is a maintenance obligation. It needs monitoring, alerting, capacity planning, security patching, dependency updates, and documentation. Reducing the number of components reduces the maintenance surface area.
Before adding a component, I ask:
- Can an existing component absorb this responsibility?
- Is the problem this component solves real, or hypothetical?
- Is the operational cost of this component justified by its benefit?
A message queue that decouples two services that are deployed by the same team on the same schedule is a component that adds operational cost without meaningful benefit. Removing it simplifies the system without losing anything.
Principle 2: Consistent Patterns
A system with consistent patterns is dramatically easier to maintain than one where each component does things differently. When every service handles errors the same way, uses the same logging format, follows the same deployment process, and organizes code in the same structure, an engineer who understands one service can maintain any service.
Patterns I enforce consistently:
| Concern | Consistent Pattern |
|---|---|
| Error handling | Errors are caught at boundaries, logged with context, and returned as structured responses |
| Logging | JSON structured logs with trace ID, service name, operation, and duration |
| Configuration | Environment variables loaded at startup, validated against a schema |
| Health checks | /health endpoint that checks all dependencies |
| Metrics | RED metrics (rate, errors, duration) at every service boundary |
| Testing | Unit tests for logic, integration tests for boundaries, contract tests for APIs |
The initial investment in establishing patterns is moderate. The ongoing benefit is enormous: every new service, every new engineer, every new feature starts from a known foundation.
Principle 3: No Surprises
A maintainable system behaves predictably. There are no hidden side effects, no undocumented behaviors, no implicit dependencies.
What "no surprises" means in practice:
- Functions do what their name says.
sendEmailsends an email. It does not also update a database record and fire an analytics event. - Dependencies are explicit. A service declares its dependencies in its configuration, not discovers them at runtime.
- Configuration has sane defaults. If a configuration value is missing, the system either uses a safe default or fails to start. It does not silently use a zero value.
- State transitions are logged. When an order moves from "pending" to "processing," that transition is recorded with a timestamp and the reason.
Principle 4: Self-Diagnosing
A system I would be proud to maintain can tell me what is wrong with it. Not through a wall of logs that I need to parse, but through specific, actionable signals.
Self-diagnosing capabilities:
- Health endpoints that explain their status. Not just "unhealthy" but "unhealthy: database connection pool exhausted, 0 of 20 connections available."
- Metrics that capture business outcomes, not just technical indicators. "Orders per minute" is more useful than "requests per second" for understanding whether the system is working correctly.
- Alerts with runbook links. Every alert includes a link to a runbook that tells the on-call engineer what to check and what to do.
- Structured error responses. Errors include an error code, a human-readable message, and a correlation ID for tracing.
Principle 5: Graceful Degradation
A system I would be proud to maintain does not collapse when a dependency fails. It degrades gracefully, continuing to serve what it can while clearly communicating what it cannot.
Degradation strategies:
- Serve stale data when fresh data is unavailable. With a clear indicator that the data may not be current.
- Disable non-critical features when their dependencies are down. The checkout flow works even if the recommendation engine is offline.
- Queue work for later when a downstream service is unavailable. Process the backlog when the dependency recovers.
- Return partial results with a clear indication. "Showing 8 of 10 results. Some sources are temporarily unavailable."
The key is that degradation is designed, not accidental. The system knows its degradation modes and handles them explicitly.
Principle 6: Incremental Evolution
A system I would be proud to maintain can be changed incrementally. No big-bang rewrites. No multi-month migration projects. Small, safe, reversible changes that can be deployed and validated independently.
Design choices that enable incremental evolution:
- Feature flags for gradual rollout of new behavior
- Backward-compatible APIs that allow old and new clients to coexist
- Database schema changes using expand-then-contract (add the new column, migrate data, remove the old column)
- Strangler fig pattern for replacing legacy components: route traffic to the new implementation gradually while the old one still runs
- Versioned data formats that can be read by both old and new code
Principle 7: Clear Ownership
A system I would be proud to maintain has clear ownership at every level: who owns this service, who owns this data, who owns this operational runbook, who gets paged when this breaks.
Unclear ownership leads to:
- Nobody updating the dependency versions because it is "not my responsibility"
- Alert fatigue because nobody knows which team should respond
- Technical debt accumulating because there is no owner to prioritize its repayment
- Knowledge silos where only one person knows how a component works
Clear ownership means every component has a team name on it, every alert has a routing rule, and every runbook has a last-reviewed date.
The Maintenance Litmus Test
Before finalizing a design, I run through this checklist:
- Can a new engineer understand this component in a day?
- Can an on-call engineer diagnose an issue in this component within 15 minutes?
- Can a developer add a typical feature to this component in less than a week?
- Can the system lose any single dependency and continue serving users (with degradation)?
- Can a deployment be rolled back in under a minute?
- Is every component owned by a specific team?
If the answer to any of these is "no," the design has maintenance problems that will compound over time.
Key Takeaways
- Design for maintenance, not just for launch. The initial build is a small fraction of the system's lifetime.
- Fewer components means fewer maintenance obligations. Remove any component whose operational cost exceeds its benefit.
- Consistent patterns across services make the entire system learnable from understanding one service.
- Self-diagnosing systems reduce incident resolution time. Health checks should explain what is wrong, not just that something is wrong.
- Graceful degradation is designed, not accidental. Know your degradation modes and handle them explicitly.
- Enable incremental evolution. Big-bang changes are risky. Small, reversible changes are safe.
- Clear ownership prevents the accumulation of orphaned components and unaddressed technical debt.
Related: Engineering Decisions That Reduce Pager Fatigue.
Further Reading
- Designing Systems That Degrade Gracefully: How to build systems that continue providing value when components fail, covering load shedding, fallback strategies, and partial availab...
- Designing Systems That Fail Loudly: Why silent failures are more dangerous than crashes, and how to design systems that surface problems immediately rather than hiding them ...
- Building Systems That Can Be Explained Simply: Why the ability to explain a system in simple terms is a design constraint, not a communication skill, and how to build systems that meet...
Final Thoughts
Pride in maintenance is a different kind of engineering pride than pride in creation. Creating something new is exciting. Maintaining something well, keeping it healthy, evolving it safely, operating it reliably, is quieter work. But it is the work that determines whether the system serves its users well over its lifetime. The systems I am most proud of are not the most architecturally ambitious ones. They are the ones that are still running, still maintainable, and still a pleasure to work on, years after they were built.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.