How I Think About Engineering Risk
A framework for identifying, categorizing, and managing engineering risk across system design, team dynamics, and operational decisions.
Engineering risk is not the same as engineering uncertainty. Uncertainty means you do not know what will happen. Risk means you have identified what could go wrong and estimated the impact. Converting uncertainty into risk is the first step toward managing it.
A Risk Model for Engineering
I categorize engineering risk along two dimensions: probability of occurrence and severity of impact.
| Low severity | High severity | |
|---|---|---|
| High probability | Accept and monitor | Fix immediately |
| Low probability | Accept and ignore | Design mitigation |
The interesting quadrant is low probability, high severity. These are the risks that do not happen often but cause significant damage when they do. Database corruption, security breaches, data center failures, cascading outages. You cannot fix them by making them less probable (they are already infrequent). You manage them by reducing the severity through mitigation strategies.
Types of Engineering Risk
Technical Risk
The system does not behave as designed. Bugs, performance issues, integration failures, scaling limits.
Mitigation: testing at multiple levels (unit, integration, load), canary deployments, automated rollback, monitoring with actionable alerts.
The most dangerous technical risk is the one you believe you have mitigated but have not. A test suite that covers 90% of code but not the 10% that handles error paths gives false confidence. I focus testing effort on the paths that matter most: error handling, edge cases, and recovery logic.
Operational Risk
The system behaves as designed but cannot be operated safely. Deployments are risky, debugging is difficult, scaling requires manual intervention, on-call burden is unsustainable.
Related: Debugging Performance Issues in Large Android Apps.
Mitigation: infrastructure as code, automated deployment pipelines, comprehensive runbooks, observability investment, on-call rotation that distributes burden fairly.
Operational risk is insidious because it does not manifest as a single incident. It manifests as a slow accumulation of toil, engineer burnout, and increasing incident response times. By the time it is visible in metrics, it has been eroding the team for months.
Dependency Risk
A critical dependency changes behavior, degrades, or becomes unavailable. This includes both technical dependencies (APIs, libraries, infrastructure services) and organizational dependencies (a team that owns a service you depend on).
Mitigation: abstraction layers over external dependencies (so you can swap providers), health checks and circuit breakers, SLA agreements with internal teams, fallback strategies for critical paths.
The most overlooked dependency risk is organizational. If your critical path depends on a service owned by a team that is understaffed, reorganizing, or has different priorities, you have a risk that no amount of technical mitigation can address. Surface it early and escalate if necessary.
Knowledge Risk
The system cannot be understood, modified, or operated by the current or future team. Key knowledge is concentrated in one or two engineers. Documentation is outdated or missing. The codebase has areas that "nobody touches."
Mitigation: architecture decision records, code review as knowledge transfer, rotation of on-call and project ownership, deliberate pair programming on critical areas.
Knowledge risk increases silently. It only becomes visible when the knowledgeable engineer is unavailable (vacation, departure, illness) and the team discovers they cannot operate the system effectively.
Compliance and Security Risk
The system does not meet regulatory requirements, exposes sensitive data, or has vulnerabilities that could be exploited.
Mitigation: security reviews as part of the design process (not after the fact), automated dependency scanning for known vulnerabilities, data classification and access controls, audit logging for sensitive operations.
Compliance risk has a unique property: the probability of an audit or breach may be low, but the severity can be existential for the business. This puts it firmly in the "design mitigation" quadrant.
Risk Budgets
Every project has an implicit risk budget. The question is whether you manage it deliberately or discover it after an incident.
I make risk budgets explicit:
- Per-quarter risk allocation. The team can take on X amount of technical risk this quarter. Major refactors, new technology adoption, and architectural changes consume from this budget.
- Per-deployment risk assessment. Each deployment gets a risk score based on the scope of changes, the criticality of affected paths, and the availability of rollback mechanisms. High-risk deployments get additional review, canary periods, and monitoring.
- Cumulative risk tracking. Risk accumulates. Three medium-risk changes deployed in the same week create a combinatorial risk that exceeds the sum of individual risks. Space risky changes apart.
Risk Communication
Technical risk must be communicated to non-technical stakeholders in terms they understand. "The database might run out of connections under peak load" is meaningless to a product manager. "There is a 10% chance that the checkout page goes down during Black Friday traffic" is actionable.
I frame risk communication as:
- What could happen (in business terms, not technical terms)
- How likely it is (rough probability, not a precise number)
- What the impact would be (revenue, users affected, reputation)
- What mitigation costs (engineering time, infrastructure spend)
- What accepting the risk means (explicit acknowledgment of the potential impact)
This framing makes risk a business decision, not a technical one. The engineering team quantifies the risk and proposes mitigations. The business decides whether to invest in mitigation or accept the risk.
Pre-Mortems
A pre-mortem is a risk identification exercise conducted before a project launches. The team imagines that the project has failed and works backward to identify what went wrong.
The format:
- "It is 6 months from now and this project has failed. What happened?"
- Each team member writes down their answers independently.
- The team discusses and categorizes the risks.
- For each high-impact risk, assign a mitigation owner and a mitigation plan.
Pre-mortems surface risks that optimism suppresses. In a standard planning session, the team focuses on how to succeed. In a pre-mortem, the team focuses on how to fail, which reveals blind spots.
I run pre-mortems for every project that involves architectural changes, new technology adoption, or changes to critical user-facing paths.
Accepting Risk Explicitly
Not all risks should be mitigated. Mitigation has a cost, and sometimes the cost exceeds the expected loss.
The key is explicit acceptance. When the team decides not to mitigate a risk, document:
- The risk and its estimated probability and impact
- The cost of mitigation
- The decision to accept and the reasoning
- The conditions under which the decision should be revisited
Implicit risk acceptance ("we did not think about it") is dangerous. Explicit risk acceptance ("we evaluated it and decided the mitigation cost exceeds the expected loss") is professional engineering.
Key Takeaways
- Convert uncertainty into risk by identifying what could go wrong and estimating the impact.
- Low probability, high severity risks are managed through mitigation (reducing impact), not prevention (reducing probability).
- Five categories of engineering risk: technical, operational, dependency, knowledge, and compliance/security.
- Make risk budgets explicit per quarter and per deployment. Cumulative risk from multiple changes exceeds the sum of individual risks.
- Communicate risk to stakeholders in business terms: what, how likely, what impact, what mitigation costs.
- Pre-mortems surface risks that optimism suppresses. Run them for every significant project.
- Accept risk explicitly and document the decision.
See also: Event Tracking System Design for Android Applications.
Further Reading
- Designing Systems for Humans, Not Just Machines: Why the human factors in system design, including cognitive load, operational ergonomics, and team structure, matter as much as the techn...
- Engineering Decisions I Don't Compromise On: Non-negotiable engineering principles I enforce regardless of deadline pressure, team size, or scope, and the reasoning behind each one.
- Engineering Lessons I Relearned the Hard Way: Lessons that I knew intellectually but had to learn through painful experience before they truly changed my behavior, covering testing, d...
Final Thoughts
Risk management is not about avoiding all risk. It is about choosing which risks to take deliberately, mitigating the ones with the highest expected loss, and accepting the rest with full awareness. The engineering teams I trust most are not the ones that never have incidents. They are the ones that can explain exactly which risks they accepted and why.
Recommended
Designing an Offline-First Sync Engine for Mobile Apps
A deep dive into building a reliable sync engine that keeps mobile apps functional without connectivity, covering conflict resolution, queue management, and real-world trade-offs.
Jetpack Compose Recomposition: A Deep Dive
A detailed look at how Compose recomposition works under the hood, what triggers it, how the slot table tracks state, and how to control it in production apps.
Event Tracking System Design for Android Applications
A systems-level breakdown of designing an event tracking system for Android, covering batching, schema enforcement, local persistence, and delivery guarantees.