Towards More Effective Incident Postmortems

Anu-angie - Jun 3 '20 - - Dev Community

An incident postmortem is not only an essential document for reference, but also necessary as a process by which teams can collaboratively learn from failure, and communicate independent learnings across the organization.

As our systems grow in scale and complexity, outages are inevitable, no matter how hard we try to provide uninterrupted services. When an outage occurs, the most important and immediate step is, of course, fixing the underlying issue and keeping the relevant stakeholders and customers informed. A lot of the incidents can be quickly rectified with tools like infrastructure automation, runbooks, feature flags, version control, continuous delivery and people can be kept in the loop with chatops and status pages. These actions, though beneficial to fix the situation at hand, do not really help understand what failed and why. And understanding what failed and why is a crucial step towards preventing similar occurrences going forward.

This is where incident postmortems come in - the next logical step after any incident is to dissect and analyze the why, how and the what of the incident. And ideally, this should really be done for every single incident, and not just the high severity or high impact ones.

An incident postmortem is a report that records the details of an incident, the impact it has on the service, the team that was assembled to address the event, the immediate steps taken to mitigate the damage,the actions taken to resolve the incident and the lessons learnt that can help the team minimize the impact of future incidents. These lessons can in turn affect how you think about a particular component of your system, or sometimes just how mitigation steps could be done faster in specific cases. Which is a big deal, to say the least.

Importance of Incident Postmortems

An incident postmortem is not only an essential document for reference, but also necessary as a process by which teams can collaboratively learn from failure, and communicate independent learnings across the organization.

There are several reasons why doing an incident postmortem are incredibly important:

  • Serves as a documentation tool: It provides team members with the ability to record the nitty-gritties of an incident ensuring that it won’t be forgotten. A well-documented incident becomes invaluable to a team since it not only includes a description of what happened but also details on the actions that were taken that serves as a reference point for remediation and mitigation of future incidents

  • Helps build trust and transparency with customers and relevant stakeholders of a particular service or application when posted publicly. This also helps build confidence amongst users that necessary steps are taken to prevent any future disruptions to the services provided

  • Instills a culture of learning. As rightly said, “The cost of failure is education”. It also helps shift the focus from the immediate now to the future. This is why conducting blameless postmortems becomes crucial. More on being blameless is covered later in this blog.

  • Serves as an opportunity to get more insights to drive improvement in infrastructure when services and applications fail in new and interesting ways for us to realise what areas need improvement

Incident Postmortem - What does it consist of?

Incident Postmortems are also called RCAs (Root Cause Analysis) or incident reviews. At Squadcast, we prefer the term Incident Reviews but to keep this easier to digest, we are going to refer to them by the more popular “Incident Postmortem”, for the rest of this article. When it comes to an incident postmortem, there is no one-size-fits-all approach or even a universally accepted standard for doing different kinds of post-mortems. The Postmortem process varies across organizations and sometimes even within companies depending on the size and culture of the teams, from casual to highly formal, depending on the nature of the product or the severity of the incident.

Regardless of the names and the approach, the end goal remains the same - to keep relevant stakeholders informed and as a learning opportunity not only to fix a weakness but to make systems more resilient as a whole. The whole incident postmortem process can take considerable time and effort to gather information and the postmortem meeting (if needed) might occur days, or even weeks, after the actual incident depending on the severity of the same.

A typical postmortem process covers the below-outlined aspects, in no particular order:

  • A high-level outline or summary: This covers the ‘what’ and ‘why’ of the incident, the severity and business impact on customers or users, people involved in the response process and the resolution of the incident. This is particularly beneficial to managers and application owners who need to communicate details of the incident to the top management and relevant outside stakeholders.

  • Causes: This part of the postmortem addresses more technical and operational aspects of the incident starting with the causes and triggers, explaining the origins of the failure and highlights the underlying cause - what made the system to break. A popular method to get to the root cause is called the 5 Whys Process - which was first made popular by Toyota.

  • Effects: Post analyzing the deeper granularity on the causes of the incident, the team is now tasked with measuring and analyzing the effects on business, services, and users. This step of the postmortem process also analyses the extent and severity of the incident. For instance, the impact on the business when a payment service was down on an e-commerce website affecting its customer’s experience in purchase.

  • Resolution: This step starts with a diagnostic dissection into details of the Incident Timeline covering the time of failure, the time when the incident was recognized and handled, the team involved in the process, procedures taken to remediate the problem. This part can also include a review of failed attempts which can serve as a reference to the team when a similar incident occurs, saving valuable time.

  • Conclusion: Outlines the key takeaways, recommendations and next steps to ensure prevention of the same or similar incident in the future.

Successful postmortems are blameless

“Blameless postmortems are a tenet of SRE culture. You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems”

A critical factor in incident postmortem to be successful is that they are blameless. A culture that seeks to point fingers at the person who may have caused an outage through error or omission is unlikely to get truthful answers during a review, thus negating the intent behind the whole exercise of having an incident postmortem in the first place.

Through blameless postmortems, the aim is to have a nurturing environment where every “mistake” is seen as an opportunity to strengthen the system. Blameless postmortems shift from allocating blame to investigating the underlying cause and reasons, why an individual or team faced an outage, and also emphasizing the effective prevention plans that can be put in place.

Many teams, including us here at Squadcast similar to Google, have adopted the culture of the blameless postmortem which paves way to build resilience in its teams and systems.

Blameless postmortems can tend to be challenging to write since the postmortem format clearly identifies the actions that led to the incident. However, removing blame from a postmortem provides the team the confidence to escalate issues without fear. The next section outlines the steps that can be taken to conduct effective blameless postmortems.

In order to ensure that teams develop a culture around blameless incident postmortem reviews, it should also be noted that empowering teams with an easy and automated way to capture incident information and publish the final report with reusable checklists and templates, could potentially make incident postmortem meetings less dreadful. In fact, having an automated timeline and templates that are auto-populated with incident metrics and other details as part of your incident management tool can help the process be more consistent and productive for every incident that occurs.

Conducting effective Incident Postmortems - The process

In order for postmortems to be blameless and effective at reducing recurring incidents, the review process can incentivize teams to identify root causes to fix them. A well-run postmortem allows your team to come together in a less stressful environment to achieve several goals.The exact method can depend on team culture.

Here are a few best practices that can ensure the effectiveness of postmortems‍:

1. Start with an incident timeline

Prior to conducting an effective postmortem meeting , the premise of the meeting should be around the timeline of significant activity - from chat conversations, incident details and more. You can streamline the entire postmortem process with automated incident timeline building, collaborative editing, actionable insights, and formalize your own postmortem process to make it as easy as possible for your team to respond to issues.

The goal is to understand all contributing root causes, document the incident for pattern discovery, which allows you to set a better context during the post mortem meeting. This step also plays a key role in enacting effective preventative actions to reduce the likelihood or impact of recurrence.

2. Conduct a postmortem meeting with anyone internal to the team who was affected by the incident

A structured and collaborative approach by bringing people together affected by an incident allows for a better cohesive contribution to the postmortem meeting in terms of what they learnt from the incident. This also helps in building trust and resiliency within teams. The formal incident postmortem document that records the details of the incident along with how the team remedied it can help teams in handling future incidents.

At this step, a formal template can help you record all key details and helps build consistency across all your incident postmortems.

At Squadcast we use our own incident postmortem feature that helps build an insightful timeline in a matter of minutes. This is especially useful as automation ensures that you can quickly have a system-generated post mortem for pretty much any incident, big or small. There are also a few predefined postmortem templates available from the likes of Google, Azure, and others. You can also choose to create new templates/modify existing ones. What’s more, these are available to download in MD and PDF formats!

3. Define roles and owners along with having a moderator

Another key aspect to keep in mind during a postmortem meeting is to have well defined roles and owners along with having a moderator who can ensure the meeting stays on track and avoid any hint of a “blamestorming” session. It will be helpful to have guidelines for the owners of the postmortem process in how the meetings should be run.

The owner of the review is tasked with managing the meeting and chronicling the subsequent report. It is advisable that the owner should be someone who has sufficient understanding of the technical details, familiarity with the incident, and an understanding of the business impact. Mostly, the moderator is the owner of the incident review and is responsible for maintaining order and giving every participant the chance to speak.

4. Determine the urgency of an incident by setting the right thresholds

Not all incidents are equal. Each incident in an organisation should be associated with a measurable severity level based on the impact it has on its business and customers. Associating incidents with the right severity level can help you prioritize your postmortem process. For instance, Sev 1 or higher incidents definitely require a postmortem, while for less severe incidents, postmortems can be automated with a tool like Squadcast.

That said, if need be, teams should also be provided with an option to request a postmortem for any incident that doesn't meet the threshold.

5. Devil’s in the Details - incident metrics and other key information captured

Capturing as many details as possible about what happened and what was done during the incident can help teams be more unambiguous. Details such as links to tickets, status updates, incident state documents like monitoring charts along with screenshots and relevant graphics or dashboards becomes a powerful data set that captures the fine details of an incident.

It is also crucial that along with summarizing key details, important incident related metrics are also captured that help you associate numeric and hard data to the incidents and their impact. Metrics such as Mean Time to Resolution (MTTR), SLO, Extent of SLO breach, Error Budget consumed, severity of incident, number of minutes of downtime can be considered for postmortem tracking. With consistent measurement of these metrics, you can analyze the incident trends over time.

The key to conducting effective incident postmortems that can help you improve your team and systems is to have a process and stick to. And, making sure it is effective requires commitment at all levels in the organization.

6. Publish and track postmortems promptly

Once the postmortem review meeting is completed, the final but important step is to publish the postmortem promptly and distribute the same as an internal communication, typically via email, to all relevant stakeholders, describing the results and key learnings along with a link to the full report.

Google states that “A prompt postmortem tends to be more accurate because information is fresh in the contributors’ minds. The people who were affected by the outage are waiting for an explanation and some demonstration that you have things under control. The longer you wait, the more they will fill the gap with the products of their imagination. That seldom works in your favor!”

Regular application of these practices results in better system design, less downtime, and more effective and happier engineers.

Related Reading

There are many resources out there that you may consider to check out, if you are interested to know more on how to conduct effective postmortems, here are few of our suggestions

Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution and create a knowledge base to effectively handle incidents.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player