What I Look for in System Designs

Context

System design is where architecture meets reality. A good system design answers not just "how will this work?" but "how will this fail?", "how will this evolve?", and "how will we operate this?" I have reviewed system designs for new services, major refactors, and greenfield platforms. The patterns that distinguish strong designs from weak ones are consistent.

What I Look For

1. Clear Data Flow

The most important diagram in any system design is the data flow diagram. Not the architecture diagram showing boxes and arrows between services, but the diagram showing how data moves through the system, where it is transformed, where it is stored, and where it is read.

Questions I ask about data flow:

Where is data written? Where is it read? Are those the same store?
How does data get from the write path to the read path? Is it synchronous or asynchronous?
What happens to in-flight data when a component fails?
Where are the points of potential data loss?

A design that cannot clearly trace data from ingestion to consumption has a fundamental clarity problem.

2. Explicit Consistency Model

Every system that involves more than one data store has a consistency model, whether the designers acknowledge it or not. I want to see it acknowledged explicitly.

Consistency Model	Trade-off	Appropriate When
Strong consistency	Higher latency, lower availability	Financial transactions, inventory counts
Eventual consistency	Lower latency, higher availability	Social feeds, analytics, recommendation updates
Causal consistency	Moderate latency, preserves user intent	Collaborative editing, messaging

A design that says "data is replicated across regions" without specifying the consistency model is incomplete. How stale can a read be? What happens during a partition? These questions must have explicit answers.

3. Failure Mode Analysis

Strong designs include a section that answers: "What happens when X fails?" for every significant component.

The minimum set:

What happens when the database is unreachable?
What happens when a downstream service is slow?
What happens when a downstream service returns errors?
What happens when the message queue is full?
What happens when the system runs out of memory or disk?
What happens during a network partition between data centers?

Each answer should specify the user-visible impact and the recovery mechanism. "The system degrades gracefully" is not an answer. "Read requests are served from cache (data may be up to 5 minutes stale), write requests return an error with a retry-after header" is an answer.

4. Capacity Planning

A system design should include rough capacity estimates that justify the architectural choices.

What I want to see:

Expected request rate (average and peak)
Expected data volume (current and projected growth)
Expected storage requirements (with retention policy)
Expected compute requirements (CPU, memory per node)
Scaling strategy (horizontal, vertical, or both)

These do not need to be precise. Order-of-magnitude estimates are sufficient. The point is not the numbers themselves but the reasoning that connects the expected load to the architectural decisions.

If the design proposes a distributed cache but the expected data set fits in a single node's memory, that is a mismatch worth questioning.

5. Operational Model

How will this system be operated in production? The best system designs I have reviewed include an operational model:

Deployment: How is the system deployed? How long does a deployment take? Can it be rolled back?
Monitoring: What metrics will be collected? What alerts will be configured? What SLOs will be defined?
Debugging: How will engineers investigate issues? What tools and access will they need?
Scaling: How is capacity added or removed? Is it automatic or manual?
Data management: How is data backed up? How is it restored? What is the retention policy?

A design without an operational model is a design that will be painful to run.

6. API Design at Boundaries

The APIs between components are the contracts that determine how tightly coupled the system is. I look for:

Versioning strategy: How will the API evolve without breaking consumers?
Error handling contract: What error codes are possible? What should the caller do for each?
Idempotency: Can the caller safely retry a failed request?
Pagination: For list endpoints, how is pagination handled?
Rate limiting: How is the API protected from abuse or accidental overload?

Red Flags in System Designs

The Happy Path Design

The design describes how the system works when everything is working. No discussion of failure modes, error handling, or degraded operation. This is a design for a demo, not for production.

The Technology-First Design

The design is organized around technologies ("the Kafka layer", "the Redis layer") rather than around capabilities or data flows. This usually means the technology choices were made before the problem was fully understood.

The Missing Numbers

No capacity estimates, no latency targets, no throughput requirements. Without numbers, architectural decisions are gut feelings rather than engineering judgments.

The Single Point of Failure

A component that, if it fails, takes down the entire system. Common culprits: a single database with no replication, a single load balancer, a single configuration service.

The Distributed Monolith

Multiple services that must be deployed together, that share a database, or that cannot function independently. This has all the complexity of a distributed system with none of the benefits.

A Design Review Checklist

For my own reference and for teams I work with:

Problem statement is clear and quantified
Data flow is traceable from ingestion to consumption
Consistency model is explicit
Failure modes are enumerated with user impact and recovery
Capacity estimates justify the architectural choices
API contracts are versioned and specify error handling
Operational model covers deployment, monitoring, and debugging
Migration plan exists (if replacing an existing system)
Success criteria are defined and measurable

Key Takeaways

Data flow clarity is the most important quality of a system design. If you cannot trace data from ingestion to consumption, the design is incomplete.
Consistency models must be explicit. "Eventually consistent" needs to define "how eventual."
Failure mode analysis is mandatory. Every significant component needs a "what happens when this fails?" answer.
Capacity estimates justify (or invalidate) architectural choices. Without numbers, decisions are opinions.
An operational model is part of the design, not an afterthought.
Red flags include happy-path-only designs, technology-first thinking, missing numbers, and distributed monoliths.

Final Thoughts

A system design is a communication tool. Its purpose is to create shared understanding among the people who will build, operate, and depend on the system. The qualities I look for, data flow clarity, explicit consistency, failure analysis, capacity planning, and operational modeling, all serve this purpose. They force the designer to make their assumptions visible and their trade-offs explicit, which is the foundation for making good engineering decisions as a team.