An Engineer’s Rite of Passage

Molly Struve (she/her) - Jan 12 '19 - - Dev Community

It is a rite of passage for every engineer to take down production. Whether it be a full blown 500 page being served to all users or breaking background processing, at some point in your career, you will take down production. When it happens, especially for the first time, it can be ROUGH! At least, I know it was for me.

My First Production Outage

Shortly after being hired by Kenna, I was working on a ticket that required me to add a column to a table in the database. We were using simple Rails at the time, so all I did was write a migration for the new column and issued a PR. The PR was approved and I immediately merged it. Every migration I had ever run in my hefty 2+ years of experience took a few seconds. Why should this be any different? I’m sure some of you can see where this is headed 😉

Unfortunately, the migration did not take a few seconds. You see, the table I added the column to was the BIGGEST table in our database! Hundreds of millions of rows big. As soon as that migration started, the entire table locked itself and remained locked for 3 hours. I should also mention, this table gets A LOT of writes. Jobs meant to update the table started blowing up left and right. It was a disaster. Luckily, we were pretty small at the time, so we waited it out. Once the migration finished, we retried the jobs that had failed and came out the other side just fine.

Now the business might have been fine, but I was devastated. A couple days later my boss wanted to talk to me. I thought for sure this was it, I was going to get fired. But instead, my boss asked how I was doing after the outage. I responded I was hanging in there. He went on to assure me that it wasn’t just my fault and to remember that someone else had approved the PR. He also said he could have done better to prevent it. He explained that this is just what happens sometimes and that I am still learning so I shouldn’t beat myself up about it.

Hearing that was exactly what I needed. It has been 3 years since my first outage and I have taken down production, in varying degrees, more than once in that time. Does that make me a bad engineer? NO WAY! Every time it happens, I learn from the experience and I take steps to prevent it from ever happening again. I also remind myself, it’s not the end of the world when it does happen. Luckily, I work with an amazing group of people, and usually after a few months, we look back on our mistakes and laugh about them.

What is your production outage story?

How did you deal with it at the time? What did you learn from it? Let's share some war stories, so the next time an engineer takes down production, they can read this thread and be reminded that they are not alone!!! 🤗

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player