The Slack Outage is a Wake Up Call For All IT Orgs

OverOps - Aug 7 '19 - - Dev Community

Slack took quite a beating during and after the service outage that happened last Monday morning. Here’s a small sample of the headlines (before they were updated) that come up on a simple news search of Slack:

  • Slack is Experiencing Worldwide Outage, Degraded Performance
  • Yes, Slack is Down.
  • Happy Monday, Slack is Down
  • How Microsoft Teams May Have Caused Today’s Slack Outage

The last one stands out from the others - it was the only one that I actually clicked on. What do they mean Microsoft caused the outage? What’s the connection? Upon opening the article, I realized they weren’t talking about Microsoft teams, they were talking about Microsoft Teams.

Here’s the timeline they lay out to explain what they’re talking about:

  • July 11 - Microsoft announces they hit 13 million active daily users in - June (Slack reported 10 million users in January)
  • July 22 - Slack relaunches their desktop app to load faster and use less memory
  • July 29 - Slack goes down

Their theory, then, is that at some point Slack started to work on a major upgrade to their backend system. Feeling the pressure after Microsoft’s announcement, Slack pushed the new update to production before it was ready and all hell broke loose. Seems to make sense more or less.

Of course, engineers have probably been working on this update for several months at least, and it’s hard to say whether that announcement would be enough to interrupt the product roadmap. Still, it’s definitely possible that engineering managers started applying additional pressure on teams to build, test and deploy faster than originally expected. It’s hard to say.

One thing that is clear is that this outage was caused by changes made to the application's code. Similar outages reported a few days prior, on the 26th, were determined to be a result of changes made to the code the night before. The engineering team rolled back those changes and started to deploy intermittent fixes.

The latest outage on Monday morning follows the same story. The engineering team reported that on July 29th, they “made a change that inadvertently caused some performance issues, including messages failing to send.” Roughly an hour after the service went down, Slack had announced that users once again have the “all clear” to use their Slack channels without issue.

What caused the recent “Slack-out”?

In the space of a week, Slack’s engineering team deployed a massive update to their entire desktop application plus at least 2 other (presumably) smaller code changes. And they aren’t the only ones pushing code at a breakneck pace.

Over the last half a decade or so, the average release frequency in enterprise IT organizations has plummeted from around 12 months down to just 3 weeks. Many organizations, like Slack, are deploying new code to production weekly or even daily in an effort to out-innovate competitors and to please customers that are jonesing for new features.

Unfortunately, in many cases the increased pressure and the overall sentiment that “fast isn’t fast enough” comes at the expense of code quality and reliability. New automated testing frameworks and additional tooling have helped to limit the impact, but there’s still no way to account for every possible scenario.

How does this impact the company?

The most obvious way that application failures affect the business are through negative customer experiences. Unlike with a less critical error that affects only a handful of users, an application outage has the potential to unite the public against the company. As evidenced by this headline from CNN regarding the Slack outage:
Breaking: Slack Is Down, Twitter Goes Berserk

#SlackOutage, #SlackDown and others were trending last Monday, similar to recent outages from Facebook, Google and Twitter ironically (#TwitterDown was trending once the system was back up). Not only does this help to sway public opinion, it can begin to form a sort of herd mentality against the company. Just think about public opinion of major US airlines… We won’t name names.

In the short term, negative customer experiences on such a large scale hurt the brand’s reputation. In the long term, brand tarnishment coming from such events hurt the company’s bottom line.

Poor customer experience isn’t the only way these issues impact a company’s bottom line. Debugging and troubleshooting time means shifting developer and operations resources away from product innovation (which was our original goal…). Contractual SLAs may be breached which could lead to additional financial repercussions.

Plus, errors in general can contribute to higher log ingestion and storage costs and infrastructure overhead. Here’s a calculator you can use to find out how much your error volume is costing your company each year: https://calculator.overops.com/

How can you stop this from happening to you?

Companies like Slack are facing a paradoxical need to build and deploy faster than before while simultaneously improving the quality and reliability of their applications. With less time to write and test the code, improving--or even maintaining--code quality is no easy task.

In order to succeed, it’s important to track metrics that signal risk to the application and to create automated quality gates based on those metrics. In order to ensure a new release won’t impact customers once deployed to production, here are 4 crucial metrics to track:

  • New errors
  • Increasing errors
  • Resurfaced errors
  • Slowdowns In addition to data and metrics, accountability for engineers is incredibly important for ensuring code quality and application health. You can find more information about building a culture of accountability here.
. . . .
Terabox Video Player