Designing Systems for Humans, Not Just Machines
Why the human factors in system design, including cognitive load, operational ergonomics, and team structure, matter as much as the technical architecture.
Context
Most system design discussions focus on machines: how many requests per second the system can handle, how data is partitioned, how services communicate. These are important. But the system also has human operators, developers who extend it, on-call engineers who debug it, and product managers who need to understand its capabilities.
The best-architected system in the world is a liability if the team cannot operate it, cannot debug it, or cannot extend it without introducing regressions. Human factors are not soft concerns. They are engineering constraints as real as throughput and latency.
Human Constraint 1: Cognitive Load
Every system has a cognitive load: the amount of context an engineer must hold in their head to work effectively. This includes:
- The data model and its invariants
- The service interactions and their failure modes
- The deployment pipeline and its quirks
- The monitoring and alerting configuration
- The historical context (why decisions were made)
As a system grows, its cognitive load grows. At some point, no single person can hold the entire system in their head. This is not a failure of the engineers. It is a failure of the design.
Design responses to cognitive load:
- Clear boundaries: Each component should be understandable in isolation, without requiring knowledge of the entire system.
- Local reasoning: An engineer should be able to understand a component's behavior by reading its code, without needing to trace through five other services.
- Consistent patterns: When every service handles errors the same way, uses the same logging format, and follows the same deployment process, an engineer who understands one service can work on any service.
- Documentation at the right level: Not comprehensive documentation of everything, but clear documentation of the non-obvious: why a particular approach was chosen, what trade-offs were made, and where the dragons live.
Human Constraint 2: Operational Ergonomics
The experience of operating a system, deploying it, monitoring it, debugging it, and recovering from failures, is a design surface that deserves as much attention as the API surface.
Good operational ergonomics:
| Aspect | Poor Ergonomics | Good Ergonomics |
|---|---|---|
| Deployment | Manual steps, wiki-based procedures | One-command deployment with automatic rollback |
| Monitoring | 50 dashboards, 200 alert rules | 5 dashboards, SLO-based alerts |
| Debugging | SSH into nodes, grep logs manually | Centralized search, trace-based investigation |
| Recovery | Undocumented, improvised | Runbooks with decision trees, automated remediation |
| Configuration | Scattered across files, databases, and environment variables | Centralized, versioned, validated |
The team that spends 30% of its time on operational toil is not doing engineering. It is fighting the system. Every hour spent on operational ergonomics pays back in reduced toil and faster incident resolution.
Human Constraint 3: Team Structure (Conway's Law)
Your system architecture will mirror your team structure. This is not a suggestion. It is an observed law of organizational behavior that has proven remarkably durable.
If you have three teams and they need to build a compiler, you will get a three-pass compiler. If you have two teams and they need to build a distributed system, you will get two services.
Implications for design:
- Align service boundaries with team boundaries. A service owned by two teams is a service owned by nobody.
- If you want a different architecture, change the team structure first.
- Cross-team dependencies are architectural dependencies. Minimize them.
- The interface between two teams' services is a communication bottleneck. Design it with the same care you would design a public API.
Human Constraint 4: On-Call Experience
The on-call experience is a direct output of system design decisions. Every architectural shortcut, every missing health check, every undocumented failure mode shows up as a 3 AM page.
Designing for on-call:
Related: Designing Event Schemas That Survive Product Changes.
See also: Designing a Feature Flag and Remote Config System.
- Actionable alerts: Every alert should answer three questions: What is broken? Who is affected? What should I do?
- Graduated severity: Not everything is a page. Use severity levels that match the actual user impact.
- Self-healing where possible: If the system can automatically restart a failed component, it should. The on-call engineer should be notified, not woken up.
- Clear escalation paths: When the on-call engineer cannot resolve an issue, the path to someone who can should be documented and fast.
- Post-incident learning: Every incident should produce a blameless post-mortem that leads to concrete improvements. The goal is to make each type of incident happen at most once.
Human Constraint 5: Onboarding
How long does it take a new engineer to make their first meaningful contribution? This is a measure of system complexity as experienced by humans.
Systems that onboard well share these traits:
- A working local development environment that can be set up in under an hour
- A "hello world" path that walks through the system's core flow
- Consistent patterns that reduce the surface area a new engineer must learn
- Tests that serve as executable documentation of expected behavior
- A culture where asking "why?" about any design decision is encouraged
Systems that onboard poorly have:
- Tribal knowledge that is not written down
- Special incantations to set up the development environment
- Inconsistent patterns across services that require learning each one individually
- Insufficient test coverage that forces new engineers to be cautious about changes
Practical Patterns
Pattern: The Decision Record
For every significant design decision, write a short record: What was decided, what alternatives were considered, and why this option was chosen. This is not documentation for documentation's sake. It is context preservation for future engineers who will ask "why was it done this way?"
Pattern: The Error Budget
Define an error budget for each service: the amount of unreliability that is acceptable. When the error budget is spent, the team shifts from feature work to reliability work. This makes the trade-off between features and reliability explicit and data-driven.
Pattern: The Operational Review
Regularly review the operational experience of running each service. How many pages did it generate? How long did incidents take to resolve? What toil could be automated? This review treats operational experience as a first-class quality metric.
Key Takeaways
- Human factors, including cognitive load, operational ergonomics, and team structure, are engineering constraints as real as throughput and latency.
- Design for local reasoning. Engineers should be able to understand a component without knowing the entire system.
- Operational ergonomics (deployment, monitoring, debugging, recovery) deserve as much design attention as the API surface.
- Align service boundaries with team boundaries. Conway's Law is not optional.
- Design the on-call experience intentionally. Every architectural shortcut shows up as a page.
- Onboarding time is a measure of system complexity. Optimize it.
Further Reading
- Designing Systems That Are Hard to Misuse: How to design APIs, configurations, and system interfaces that guide users toward correct usage and make dangerous operations difficult t...
- Designing Background Job Systems for Mobile Apps: Architecture for reliable background job execution on Android, covering WorkManager, job prioritization, constraint handling, and failure...
- Designing Mobile Systems for Poor Network Conditions: Architecture patterns for mobile apps that function reliably on slow, intermittent, and lossy networks, covering request prioritization, ...
Final Thoughts
A system exists to serve its users, but it is built and operated by humans. Ignoring the human factors in system design produces systems that are technically sound but operationally painful. The engineers who maintain it burn out. The on-call rotation becomes dreaded. New team members take months to become productive. These are not people problems. They are design problems. The best systems I have worked on were designed with as much care for the humans operating them as for the machines running them.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.