What I Look for in System Designs
The specific qualities and patterns I look for when reviewing system designs, from data flow clarity to failure mode analysis, and the common mistakes that signal deeper problems.
Context
System design is where architecture meets reality. A good system design answers not just "how will this work?" but "how will this fail?", "how will this evolve?", and "how will we operate this?" I have reviewed system designs for new services, major refactors, and greenfield platforms. The patterns that distinguish strong designs from weak ones are consistent.
What I Look For
1. Clear Data Flow
The most important diagram in any system design is the data flow diagram. Not the architecture diagram showing boxes and arrows between services, but the diagram showing how data moves through the system, where it is transformed, where it is stored, and where it is read.
Questions I ask about data flow:
- Where is data written? Where is it read? Are those the same store?
- How does data get from the write path to the read path? Is it synchronous or asynchronous?
- What happens to in-flight data when a component fails?
- Where are the points of potential data loss?
A design that cannot clearly trace data from ingestion to consumption has a fundamental clarity problem.
2. Explicit Consistency Model
Every system that involves more than one data store has a consistency model, whether the designers acknowledge it or not. I want to see it acknowledged explicitly.
| Consistency Model | Trade-off | Appropriate When |
|---|---|---|
| Strong consistency | Higher latency, lower availability | Financial transactions, inventory counts |
| Eventual consistency | Lower latency, higher availability | Social feeds, analytics, recommendation updates |
| Causal consistency | Moderate latency, preserves user intent | Collaborative editing, messaging |
A design that says "data is replicated across regions" without specifying the consistency model is incomplete. How stale can a read be? What happens during a partition? These questions must have explicit answers.
3. Failure Mode Analysis
See also: Failure Modes I Actively Design For.
Strong designs include a section that answers: "What happens when X fails?" for every significant component.
The minimum set:
- What happens when the database is unreachable?
- What happens when a downstream service is slow?
- What happens when a downstream service returns errors?
- What happens when the message queue is full?
- What happens when the system runs out of memory or disk?
- What happens during a network partition between data centers?
Each answer should specify the user-visible impact and the recovery mechanism. "The system degrades gracefully" is not an answer. "Read requests are served from cache (data may be up to 5 minutes stale), write requests return an error with a retry-after header" is an answer.
4. Capacity Planning
A system design should include rough capacity estimates that justify the architectural choices.
What I want to see:
- Expected request rate (average and peak)
- Expected data volume (current and projected growth)
- Expected storage requirements (with retention policy)
- Expected compute requirements (CPU, memory per node)
- Scaling strategy (horizontal, vertical, or both)
These do not need to be precise. Order-of-magnitude estimates are sufficient. The point is not the numbers themselves but the reasoning that connects the expected load to the architectural decisions.
If the design proposes a distributed cache but the expected data set fits in a single node's memory, that is a mismatch worth questioning.
5. Operational Model
How will this system be operated in production? The best system designs I have reviewed include an operational model:
- Deployment: How is the system deployed? How long does a deployment take? Can it be rolled back?
- Monitoring: What metrics will be collected? What alerts will be configured? What SLOs will be defined?
- Debugging: How will engineers investigate issues? What tools and access will they need?
- Scaling: How is capacity added or removed? Is it automatic or manual?
- Data management: How is data backed up? How is it restored? What is the retention policy?
A design without an operational model is a design that will be painful to run.
6. API Design at Boundaries
The APIs between components are the contracts that determine how tightly coupled the system is. I look for:
- Versioning strategy: How will the API evolve without breaking consumers?
- Error handling contract: What error codes are possible? What should the caller do for each?
- Idempotency: Can the caller safely retry a failed request?
- Pagination: For list endpoints, how is pagination handled?
- Rate limiting: How is the API protected from abuse or accidental overload?
Red Flags in System Designs
The Happy Path Design
The design describes how the system works when everything is working. No discussion of failure modes, error handling, or degraded operation. This is a design for a demo, not for production.
The Technology-First Design
The design is organized around technologies ("the Kafka layer", "the Redis layer") rather than around capabilities or data flows. This usually means the technology choices were made before the problem was fully understood.
The Missing Numbers
No capacity estimates, no latency targets, no throughput requirements. Without numbers, architectural decisions are gut feelings rather than engineering judgments.
The Single Point of Failure
A component that, if it fails, takes down the entire system. Common culprits: a single database with no replication, a single load balancer, a single configuration service.
The Distributed Monolith
Multiple services that must be deployed together, that share a database, or that cannot function independently. This has all the complexity of a distributed system with none of the benefits.
A Design Review Checklist
For my own reference and for teams I work with:
- Problem statement is clear and quantified
- Data flow is traceable from ingestion to consumption
- Consistency model is explicit
- Failure modes are enumerated with user impact and recovery
- Capacity estimates justify the architectural choices
- API contracts are versioned and specify error handling
- Operational model covers deployment, monitoring, and debugging
- Migration plan exists (if replacing an existing system)
- Success criteria are defined and measurable
Key Takeaways
- Data flow clarity is the most important quality of a system design. If you cannot trace data from ingestion to consumption, the design is incomplete.
- Consistency models must be explicit. "Eventually consistent" needs to define "how eventual."
- Failure mode analysis is mandatory. Every significant component needs a "what happens when this fails?" answer.
- Capacity estimates justify (or invalidate) architectural choices. Without numbers, decisions are opinions.
- An operational model is part of the design, not an afterthought.
- Red flags include happy-path-only designs, technology-first thinking, missing numbers, and distributed monoliths.
Related: How I Think About System Boundaries.
Further Reading
- How I'd Design a Scalable Notification System: System design for a multi-channel notification system covering delivery guarantees, rate limiting, user preferences, and failure handling...
- Designing a Feature Flag and Remote Config System: Architecture and trade-offs for building a feature flag and remote configuration system that handles targeting, rollout, and consistency ...
- Event Tracking System Design for Android Applications: A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, an...
Final Thoughts
A system design is a communication tool. Its purpose is to create shared understanding among the people who will build, operate, and depend on the system. The qualities I look for, data flow clarity, explicit consistency, failure analysis, capacity planning, and operational modeling, all serve this purpose. They force the designer to make their assumptions visible and their trade-offs explicit, which is the foundation for making good engineering decisions as a team.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.