We looked into the most time-consuming tasks for developers. It turns out, more than 25% of their time, on average, is spent troubleshooting.
More than a full day out of a developer’s work week can be spent troubleshooting production errors. In many cases, it’s even more than that. We hear all the time from engineering teams that their developers are spending at least 25% of their time, on average, solving (or trying to solve) production issues. That means they’re dedicating more than a full day of their work week to troubleshooting.
Where does all of this time come from? And how does it add up so quickly?
1. Identifying there was an error
The first step in solving any problem is admitting you have one in the first place. Then, you should probably figure out what the problem is, if you don’t already know. Surprisingly (or not), this is actually one of the parts in the debugging process that gives developers the most trouble.
How do you first figure out that something isn’t working right?
From the people that we spoke with, we learned that, on average, well over half of production errors are reported by the end users. Many of those companies rely on users to provide feedback for up to 80-90% of errors.
Right away, we know that it isn’t good practice to rely on your users to tell you when you have a problem or that something isn’t working right. Even worse, you don’t want to have problems that AREN’T being reported by users, because that will relate directly to low customer satisfaction and customer churn.
The problem with relying on end users to report errors is that more often than not, their reports are missing critical information about the error. The dialogue, if you’re lucky enough to be able to have one, takes up a lot of time and there’s a lot of back and forth between engineering, support, and QA that sucks up tremendous amounts of time to retrieve information for troubleshooting.
2. Determining the severity of the error
Whether your information is coming from user feedback or is showing up in the logs, determining the error rate is essential to recognizing if you’re dealing with a critical issue or not. If the error happens infrequently and doesn’t have a large impact on users, dedicating time to reproducing and resolving the problem may not be worth the cost when there are much more critical errors happening.
When relying on logs and end users, your means for understanding error rates and severity are limited. When looking at exceptions, logged warnings or errors, the error rate is a key metric for triaging the system’s health as a whole. To identify what are the most critical issues that require our developers’ attention, we use OverOps to sort through new errors and their frequency, before zooming in on their root cause analysis. Some of our most successful users are implementing an “Inbox Zero” approach to exceptions, as if they were emails that require handling. Here’s how they do it.
3. Locating the affected code (or, Sifting through log files)
Once the error has been identified, it’s time to find its actual location in the code and make sure it’s assigned to the right developer. There are a couple of ways to do this, but the most common practice is to spend hours, or even days, sifting through log files and looking for clues. Aside from this being comparable to looking for a needle in a haystack, the developer who is tasked with resolving the issue may not have a clear idea of what he or she is looking for. Plus, the information they need may not have been written to the logs at all.
Using log management tools, like Splunk or Elk, can help cut through the noise, but they can’t help when the piece of information that you’re looking for was never written to the logs to begin with. In that case, the only way to get the information you need is to add logging verbosity and hope that when (if!) the error happens again, you can see what’s going on.
4. Reproducing the error
If your logs don’t give you a clear answer to why the error happened, and this is highly likely, trying to reproduce it is the next step. Plus, reproducing an error before you attempt to fix it is good practice regardless. After all, if you don’t first reproduce the problem then how can you be sure that you’ve fixed it when you’re done debugging?
Unfortunately, with vague reports coming from users and unclear logging statements surrounding the error, finding the exact flow of events that caused it takes a lot of time and may even be impossible.
If you’re lucky, the problem might have occurred in a part of the code that you recently worked on and you can easily deduce the steps that led to it. Otherwise, the most common way of finding the event flow is to try different things and observe the results, send the ticket to QA for more testing, look through the logs or maybe adding additional logging statements. More often than not, failure in this step is what causes tickets to be closed with the infamous… “could not reproduce”.
5. Entering “war room mode”
It’s hard to dispute the benefits of working in a “war room-like” set-up. It can bring with it project clarity, stronger communication and increased productivity. So, maybe you’re thinking about rearranging the furniture in your office to boost innovation and progress, or maybe your application crashed for some unknown reason and you need to figure out why RIGHT NOW.
If you’re lucky, you might only have to do a war room once or twice a year when something catastrophic happens. For others, war room situations occur on a weekly, or even daily, basis. With so much time being dedicated to resolve issues, there’s hardly any time left to advance the product roadmap.
Some of the developers that we’ve spoken with recently described gathering for a war room situation and still not having the information to move forward with a solution. Some described war room situations that lasted for 5 or 6 days. That’s bad. Not only do these situations take time away from the rest of your work tasks, it can hurt the reputation and revenue of the company.
The Cure
The most effective way that we’ve found to cut down on the time that your team spends debugging is to automate the error resolution process. That means automating not only the identification of errors, but more importantly the automation of root cause analysis, with access to the complete source code and variable state across the entire call stack of every error.
OverOps created a tool that does just that. Not only does it provide you with the source and full call stack of any exception or error, it reveals the exact variable state at the time of error so teams no longer need to spend endless hours trying to reproduce it. Check out how it works here.
Final Thoughts
Production debugging is absolutely necessary, but letting it take up so much time ISN’T! That 25% adds up quickly. It adds up to a full-time salaried developer for every 3 working on new features and projects. That’s crazy!
It’s time to start automating the production debugging process so that you can get out of the war room and back to your backlog.
Written by Tali Soroker, originally published on the OverOps Blog.