This blog post is adapted from a talk given by Amy Unger at RailsConf 2018 titled "Knobs, buttons & switches: Operating your application at scale."
We've all seen applications that keel over when a single, upstream service goes down. Despite our best intentions, sometimes an unexpected outage has us scrambling to make repairs. In this blog post, we'll take a look at some tools you can integrate into your application before disaster strikes. We'll talk about seven strategies that can help you shed load, fail gracefully, and protect struggling services. We'll also talk about the technical implementations of these techniques—particularly in Ruby, though the lessons are applicable to any language!
1. Maintenance mode
Transforming your application from a live, active site into a single, downtime page is too large a hammer for many applications. However, it can also be the only choice to make when you're not certain what the actual problem is and it can also be the perfect tool for a smaller application or microservice. Therefore, it should be one of the first safeguards to build.
When you've gone into maintenance mode, your site should have a clear message with a link to a status page, so that your users know what to expect. More importantly, maintenance mode should be really easy to toggle. At Heroku, we have this implemented as an environment variable:
MAINTENANCE_MODE=on
While there are many ways to implement this mode, your implementation should be easy for your application operators to toggle whether it’s 2pm or 2am!
2. Read-only mode
For applications which modify any data, read-only mode helps preserve a minimum amount of functionality for your users and give them confidence that you haven’t lost all their information. What this looks like depends on what functions your application provides to users. Perhaps it stores data in the database, uploads files to a document store, or changes user preferences held in a cache. Once you know what your application is modifying, you can plan for what would happen if a user can't modify it.
Suppose there's a sharp increase in replication lag from a bad actor, and all of your users can no longer make any changes on your site. Rather than take the whole application down, it may be better to enter a read-only mode. This can also be set as a simple environment variable:
READONLY_MODE=on
Customers often appreciate a read-only mode over a full-blown maintenance mode so that they can retrieve data and know which of their most recent changes went through. If you know that your site is still able to serve its visitors, your read-only mode should indicate via a banner (or some other UI element) that certain features are temporarily disabled while an ongoing problem is being resolved.
3. Feature flags
Often, feature flags are introduced as a means of A/B testing new site functionality, but they can also be used for handling incidents as they occur.
There are three different types of feature flags to consider:
- User-level: these flags are toggled on a per-user basis. During an outage, they're probably not very useful, due to their narrow effect.
- Application-level: these flags affect all users of your site. These might behave more like the maintenance mode and read-only mode toggles listed above.
- Group-level: these flags only affect a subset of users that you have previously identified.
When it comes to incident handling, group-level feature flags are the most useful of the three. You'll want to think about what groupings are meaningful for your application; these end up being a combination of what you want to control and who your application’s users are.
Suppose your application has started selling products to a limited number of users. One evening, there's a critical issue, and the feature needs to be disabled. We implement this at Heroku within the code itself. A single class can answer questions about the current application state and toggled features:
ApplicationSetting.get('billing-enabled')
=> true
This ApplicationSetting
model could be backed by a database, by Redis -- whatever provides the most resiliency to make sure that this check doesn't fail.
Depending on your company’s need for stability, it may make sense to further subdivide into smaller segments. For example, perhaps your EU users have an entirely different feature flag for billing:
ApplicationSetting.get('billing-enabled-eu')
=> false
For earlier-stage companies, it may be silly to have so many levels of refinement, but if your directive is to shave customer impact down by tenths of percentages, you'll be grateful for the confidence about which segment of the application is being affected!
4. Rate limits
Rate limits are intended to protect you from disrespectful and malicious traffic, but they can also help you shed load. When you are receiving a mixture of normal and malicious traffic, you may need to artificially slow down everything while getting to the problem's source.
If you need to drop half your traffic, drop half your traffic. Your legitimate users may need to try two or three times to get a particular request handled, but if you make it clear to them that your service is unexpectedly (but intentionally!) rejecting a fair number of requests because it's under some sort of load, they will understand and adjust their expectations.
Rate limits can also protect access to your application from other parts of your business that rely on your service. Often, the single application that a user sees is actually a mesh of different services all acting together to create a single user experience. While you absolutely can make that internal system function when other services are down, it can be easier to just prioritize internal requests over external ones.
At Heroku, we implement rate limits as a combination of two different kinds of levers: a single default for every account, plus additional modifiers for different users. We find that this gives us the flexibility to provide certain users the rate limits that they need, while at the same time retaining a single control for how much traffic we are able to handle at any one point.
We set this value as an application setting with a global rate limit default:
ApplicationSetting.set('default-rate') = 100
Here, we're assuming it's 100 requests per minute—hopefully your site can handle much more than that! Next, we assign all the users a default modifier:
user.rate_limit_modifier
=> 1.0
Every user starts with a modifier of one. To determine the customer's rate limit, we will multiply the application default by their modifier in order to determine what their rate limit ought to be:
user.rate_limit
=> 100.0 # requests per minute
Suppose a power user writes in to support and provides legitimate reasons for needing twice the rate limit. In that case, we can set their modifier to two:
power_user.rate_limit_modifier
=> 2.0
This will grant them a rate limit of 200 requests per minute.
At some point, we might need to cut down our traffic. In that case, we can cut the rate limit in half:
ApplicationSetting.set('default-rate') = 50
Every user now has their default rate limit halved, including the power user above. But their value of 100 is still a little bit above than everyone else's default of 50, such that they can continue on with their important work.
Setting limits like this allows us to rapidly adjust traffic coming in without having to run a script over every single user to adjust their rate limit. It's important to note that, depending on your application, you may want to consider doing cost-based rate limiting. With a cost-based rate limiting system, you "charge" a user a number of tokens depending on the length of their request, such that they can't call your really slow endpoints as frequently as your blazing fast endpoints.
Finally, it may seem counter-intuitive, but the more complex the algorithm for rate limiting, the worse it will be during denial of service attacks. The more computation time it takes for you to say that you can't process a request, the worse off you are when you're dealing with a flood of them. This is no reason to not implement sophisticated rate limiting if you need it, but it is a reason to make sure that you have other layers in place to handle distributed denial of service attacks.
5. Stop non-critical work
If your application is consistently pushing up against the limits of its infrastructure, you should be able to pull the plug on anything that isn't urgent. For example, if there are any jobs or processes that don't need to be fulfilled immediately, you should just be able to turn them off.
Let's take a look at how that can be accomplished in the context of a function which generates a monthly user report:
class MonthlyUserReport
def run
do_something
end
end
do_something
has a decent chance of being very computationally expensive. We can instead rework this class to first assert that reports can be generated:
class MonthlyUserReport
def run
return unless enabled?
do_something
end
def enabled?
ReportSetting.get("monthly_user_report")
end
end
Now, before we do any work, we can check to make sure the generation is enabled. Just like the application settings above, we have ReportSetting
defined as a model here:
ReportSetting.get("monthly_user_report")
=> false
We can also generalize this implementation. Let's make the monthly user report inherit from a parent Report
system:
class MonthlyUserReport < Report
def build
do_something
end
end
Now, the monthly user report is only responsible for performing the build, and the parent class is responsible for figuring out whether or not the job ought to run:
class Report
def run
return unless enabled?
build
end
def enabled?
ReportSetting.get(self.class.underscore)
end
end
6. Known unknowns
Sometimes, observing the effects of a new change will be beyond the scope of a feature flag. Even if you believe that all your tests are flawless, you still carry doubt knowing that a disastrous outcome is looming in the shadows. In these cases, you can use a control/candidate system such as Scientist to monitor the behavior of both the new and the old code paths.
Scientist allows you to gradually roll out changes and refactors. It also allows you to enable or disable new or experimental code immediately if there are any problems. Being able to turn suspicious code paths off one-by-one is a really great tool to moving you closer to the real problem faster.
7. Circuit breakers
Circuit breakers allow you to play nice with the services that you depend on. These are typically responsive shut offs that safeguard interactions between services under dire situations. For example, if the number of 500 errors you see from a service in the last 60 seconds passes a threshold, a responsive shut off can automatically step in and halt any calls to those struggling services. This gives those dependent services time to recover, but it also frees up your web processes from spending the time calling a service that is most likely failing.
A responsive shut-off works far faster than any monitoring service. A monitoring service may page your on-call engineer, which prompts them to go to their computer, then search for the right playbook, and then finally take action. By the time the original page was sent to a human, your responsive shut off has already kicked in and you're in a better failure mode.
A circuit breaker could work in a similar way that the monthly billing report worked. Just as the monthly billing report inherited from a parent Report
class, a billing service client could inherit from a Client
class that would set up by default circuit breakers for any of its children.
Further considerations
There are a number of additional caveats you may want to investigate.
The first one is around visibility. Whether it has a lot of pretty graphics or just some command line output, having a way to display the different places where you're storing the state for your buttons and switches and combining it into one comprehensible place is really important for incident operations. Really consider how much work it will take to figure out if a switch is flipped or not, because in general, the fancier and more sophisticated your switch is, the more likely it is to become part of the problem!
You should also be routinely testing whether these switches are actually working. Does it actually work? You can’t have the confidence to know that it will work until you've tried it.
With the variety of techniques listed above, you will want to carefully consider how you form these safeguards and where you store their state. There are a number of options available: in a relational database, a data caching layer, as environment variables, etc. You can even have configurations in your code as a last resort if you believe that a way to control for failure is a fresh deploy.
Consider whether flipping a switch would require access to a component that could be down. If that switch requires access to a running production server and you can't communicate to that server, what happens? How might you change the behavior of your application if you can't deploy changes? If you have an immutable infrastructure, that might mean environment variables are totally out of the question for handling certain failure cases. One of the reasons why we rely so heavily on databases for storing our application state is because we have high confidence that we can retain access to the database in order to manually run SQL statements to toggle those safeguards.
What this boils down to is this: the more configurable you make your application at runtime, the less confident you can be that it will work in predictable ways. Have you tested how a certain user, when flagged into three features, interacts with all of your services? As you implement these knobs and buttons, keep in mind that you are trading knowledge for control. However, it's still a better deal at the end of the day. More control over mitigating issues in the app is better than confidently knowing the exact and particular way an app is down, but having no way to do anything about it.