These are the notes from Chapter 15: Postmortem Culture: Learning from Failure from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
SRE book notes: Managing Incidents
Hercules Lemke Merscher ・ Jan 31 ・ 1 min read
Unless we have some formalized process of learning from these incidents in place, they may recur ad infinitum. Left unchecked, incidents can multiply in complexity or even cascade, overwhelming a system and its operators and ultimately impacting our users.
It is important to define postmortem criteria before an incident occurs so that everyone knows when a postmortem is necessary.
A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. If a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment.
You can’t "fix" people, but you can fix systems and processes to better support people making the right choices when designing and maintaining complex systems.
Avoid Blame and Keep It Constructive
No Postmortem Left Unreviewed
If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.
You can also follow me on Twitter and Mastodon.
Photo by Kenny Eliason on Unsplash