Failure Is An Opportunity: The Reality of Incident Postmortem

Making the best out of system outages and service disruptions

One thing all software engineers can agree on is that every system has failures. It doesn’t matter how great our system architecture is, or how stable the cloud platform our system runs on is, our system will fail. It’s not a question of if, it’s a question of when.

Some failures would be labeled as a major outage in which the entire system is down and all our users can’t even access it. Other failures would be in the form of minor service disruptions impacting only some features or only some users of our product. Both are system failure events.

The way we handle system failure events is important. We want to minimize the business impact, which means both decreasing the frequency of failure occurrences, as well as shortening the time it takes us from detection to remediation once a failure occurs.

Incident postmortem is a process used by many teams to achieve this goal. some teams may call it “incident retrospective”, probably to avoid the negative context, but both terms refer to the same process.

In this article, I will share my perspective about performing incident postmortems. I will cover both the tactical aspects of how to actually run the process, but more importantly, I will try to convince you that incident postmortem is a unique opportunity to gain strategic insights into our product and our team. Ready?

Simply Answer 3 Questions

At the core of a postmortem process, we should answer the 3 questions listed below. Each of the 3 questions starts with “What could have we done to…”. This way of phrasing will keep the discussion concrete, and will eventually help us derive action items. The order of the 3 questions is important. The order I am suggesting is a bit counter-intuitive for some people, but I find that going backward rather than forward is more efficient in the case of postmortem. Here are the 3 questions:

1. Remediate The Incident Faster

“Once the issue was detected, what could have we done to remediate it faster?”

2. Detect The Incident Earlier

“Once the issue was introduced, what could have we done to detect it earlier?”

3. Prevent The Incident

“What could have we done to prevent the incident from happening?”

Best Practices:

Now that we know the questions that need to be answered, let’s discuss the mechanism for getting it done. Usually, I am not a fan of meeting structure recipes, but incident postmortem meetings are an exception. The inherited sensitivity and the risk of getting into a blaming game is a good enough reason to stick to a well-defined meeting structure which is almost like a protocol. Here are five best practices for a postmortem meeting:

1. Timing

ASAP. The postmortem meeting should take place no more than 3 days after the incident was resolved. Not only because if we delay it further people would forget what actually happened during the event (facts), but mostly because you want people to come to this meeting with the same sense of urgency (feelings) they had when they were handling the incident.

2. Pre-requisites

The last thing you want to do in a postmortem meeting is to fight over the facts. The meeting must start with the postmortem owner presenting an event log. This log is simply a chronological list of events: information that became available, decisions that were made, and actions that were taken. The log starts from the moment we became aware that there might be an issue and ends at the moment we declared that the issue is resolved. The log must be facts only, with no interpretation and no justification for the actions that were taken.

3. Participants

Everyone that may be able to contribute. This is not a “managers only” meeting. There is nothing confidential (otherwise you have a bigger problem) and in most cases, the most valuable people are the developers and DevOps/SRE who were actually fighting the flames and operating the system to resolve the issue.

4. Agenda

My favorite postmortem meeting structure is:

Present the events log (mentioned above). From start to finish. No discussion, just facts alignment.
Present the overall business impact. How long was the service down? how many users were impacted? which reactions did we receive during and after the incident?
Discuss and answer each of the 3 questions above. One by one, in the order they are listed (resolution, detection, prevention). Each question must get a simple answer and concrete action items. Don’t allow the discussion to creep from one question to another.

5. Outcome

For each of the action items you have listed while answering the 3 questions, you must create an issue in your project management tool (Jira, Asana, Monday, Etc.). Otherwise, it will be forgotten. Some people also schedule a specific follow-up meeting to track the progress of these issues, but this really depends on how you do tracking in general.

In addition to creating issue per action item, I also recommend having a page in your knowledge management system (say Confluence) with a list of all incident postmortems and a linked page per incident containing a summary of the meeting, the events log, and the answer to the 3 questions.

The Opportunity:

Photo by National Cancer Institute on Unsplash

So we had the meeting, we investigated what happened, agreed on what we could have done better, and even came up with actions items. That’s great. Most teams stop here. When they do so, they miss an opportunity. The opportunity to get a reflection of the core strengths and weaknesses of their team, of their product, and their processes.

There could be many aspects for this reflection, and they may differ according to the nature of the failure. Here are the 6 aspects which I find most valuable to analyze right after a postmortem meeting is completed:

1. Root Cause Analysis

When there is a symptom of a system malfunction, our ability to determine what is causing this symptom and whether this symptom is an indication of a real problem is critical. It’s not only about being right or wrong. It’s also about the amount of time and effort required to perform the investigation.

A situation in which there was an early symptom, but nobody picked it up and investigated it, or someone did pick it up but wasn’t able to detect the root cause until it became a bigger issue should concern you. In such cases, I would look deep into our observability tools and processes. Do we have the right logs, metrics & traces? Can all relevant team members access them and know how to use them?

2. Impact Analysis

After the root cause of the issue was detected, the next thing is to map the impact. Which services are impacted? which features in the product? is it impacting all our users or only a subset? Once again, it’s not only about being able to come with the right answer but rather about how long does it take us and how many people should be involved to get a clear impact map.

A situation in which the root cause was already isolated, but we had to call 3 or more people (usually from different teams) and it took us more than 15 minutes to get an impact analysis should be a red flag. In such cases, I would look deep into how good of understanding we have of the broader system architecture, and specifically the cross-service dependencies.

3. Blast Radius

The most common reason for outages and service disruptions is the changes that we deploy. They could be either code changes or configuration changes. In the second place we have infrastructure issues, and only then comes all other reasons like abnormal user behavior, security attacks, etc.

Under the reasonable assumption that the change we deployed or the infra that failed directly impacted only a specific module (microservice), we should ask ourselves whether the actual damage could have been reduced by having better isolation mechanisms. I would start with a deep look at our high availability architecture and how well it served us with this incident, and continue with an examination of circuit breakers, back pressure controls & other mechanisms that could prevent cascading errors and reduce the blast radius of the failure.

4. Recoverability

Once we have identified the code change or the infrastructure issue that caused the issue, we want to fix it as quickly as possible.

A situation in which the root cause was already isolated, but it took us more than 30 minutes to deploy indicates we have a problem. We must be able to roll back changes we recently deployed, and this shouldn’t take more than few minutes. Not being able to do so usually indicates a gap in our CD pipeline. If the root cause of the incident was related to infrastructure failure, not being able to fix the issue quickly indicates that we are either using the wrong infrastructure or (more often) lack the expertise required for operating this infrastructure at scale.

5. Technical Debt

Being close to my heart, and I wrote a dedicated article about tech debt:

Technical Air Pollution: The Reality of Technical Debt

Deficiencies that initially reduce your velocity might eventually kill your product

medium.com

In the context of incident postmortems, we are getting a precious opportunity to re-evaluate our tech debt management decisions. In some cases, an incident can be viewed as the interest we are paying for the tech debt. Given the business impact of the incident, we may realize that the interest became high enough to justify investing in closing some of the debt.

6. Culture

Last but not least, we can learn a lot about our org’s culture from the way we handled failure. Unlike the other aspects mentioned above where I was able to point you to specific areas that need to be inspected, culture is a broader and “softer” aspect, meaning it’s a bit harder to define and quantify.

Still, there are some common areas that I am usually looking into:

Ownership & Attitude: Some people “jump to the fire” immediately while others will only get involved when there is solid proof that the issue is related to their area of responsibility.
Communication Patterns: Some teams are easier to communicate with than others. In some teams, different members of the team can assist in issue resolution, while in others you always see the same person handling all issues.
Willingness to Improve: The way people behave during a postmortem process is an indicator of their ability to learn from mistakes, willingness to be coached, and in general to their ability to grow.

Conclusion:

Think about system failures as a disease. Postmortem meetings in their basic form are pills that help our body fight the specific disease. But when a postmortem meeting is followed by a deeper analysis, we have a chance to develop a vaccine. A vaccine that will not only help us prevent similar production incidents from happening in the future but one that can improve the way our product is built and the way our team is working. Failure is an opportunity. Use it wisely!

‍