You've built a service, you call it, it does something and returns a result, but how long does it take, and why does it take longer than your users would like some of the time? In this post I'll start with the basics and gradually introduce standardized terminology and things that make the answer to this question more complicated, while highlighting key points to know.
To start with, we need a way to measure how long it takes, and to understand two fundamentally different points of view. If we measure the experience as an outside user calling the service we measure how long it takes to respond. If we instrument our code to measure the requests from start to finish as they run, we're only measuring the service. This leads to the first key point, people get sloppy with terminology and often aren't clear where they got their measurements.
Be careful to measure Response Time at the user, and Service Time at the service itself.
For real world examples, there are many steps in a process, and each step takes time. That time for each step is the Residence time, and consists of some Wait Time and some Service Time. As an example, a user launches an app on their iPhone, and it calls a web service to authenticate the user. Why might that be slow sometimes? The amount of actual service time work required to generate the request in the phone, transmit that to a web service, lookup the user, return the result and display the next step should be pretty much the same every time. The variation in response time is driven by waiting in line (Queueing) for a resource that is also processing other requests. Network transmission from the iPhone to the authentication server is over many hops, and before each hop is a queue of packets waiting in line to be sent. If the queue is empty or short, then response will be quick, and if the queue is long the response will be slow. When the request arrives at the server, it's also in a queue waiting for the CPU to start processing that request, and if a database lookup is needed, there's another queue to get that done.
Waiting in a Queue is the main reason why Response Time increases.
Most of the instrumentation we get from monitoring tools is measures of how often something completed which we call Throughput. In some cases we also have a measure of incoming work as Arrival Rate. For a simple case like a web service with a Steady state workload where one request results in one response, both are the same. However retries and errors will increase arrivals without increasing throughput, rapidly changing workloads or very long requests like batch jobs will see temporary imbalances between arrivals and completed throughput, and it's possible to construct more complex request patterns.
Throughput is the number of completed successful requests. Look to see if Arrival Rate is different, and to be sure you know what is actually being measured.
While it's possible to measure a single request flowing through the system using a tracing mechanism like Zipkin or AWS X-Ray this discussion is about how to think about the effect of large numbers of requests, and how they interact with each other. The average behavior is measured over a fixed time interval, which could be a second, minute, hour or day. There needs to be enough data to average together, and without going into the theory a rule of thumb is that there should be at least 20 data points in an average.
For infrequent requests, pick a time period that averages at least 20 requests together to get useful measurements.
If the time period is too coarse, it hides the variation in workloads. For example for a video conferencing system, measuring hourly call rates will miss the fact that most calls start around the first minute of the hour, and it's easy to get a peak that overloads the system, so per-second measurements are more appropriate.
For spiky workloads, use high resolution one-second average measurements.
Monitoring tools vary, however it's rare to get direct measurements of how long the line is in the various queues. It's also not always obvious how much Concurrency is available to process the queues. For most networks one packet at a time is in transit, but for CPUs each core or vCPU works in parallel to process the run queue. For databases, there is often a fixed maximum number of connections to clients which limits concurrency.
For each step in the processing of a request, record or estimate the Concurrency being used to process it.
If we think about a system running in a steady state, with a stable average throughput and response time, then we can estimate the queue length simply by multiplying the throughput and the residence time. This is known as Little's Law, it's very simple, and is often used by monitoring tools to generate queue length estimates, but it's only true for steady state averages of randomly arriving work.
Little's Law Average Queue = Average Throughput * Average Residence
To understand why this works, and when it doesn't, it's important to understand how work arrives at a service and what determines the gap between requests. If you are running a very simple performance test in a loop then the gap between requests is constant, Little's Law doesn't apply, queues will be short and your test isn't very realistic, unless you are trying to simulate a conveyor belt like situation. It's a common mistake to do this kind of test successfully, then put the service in production and watch it get slow or fall over at a much lower throughput than the test workload.
Constant rate loop tests don't generate queues, they simulate a conveyor belt.
For real world Internet traffic, with many independent users who aren't coordinating and make single requests, the intervals between requests are random. So you want a load generator that randomizes some wait time between each request. The way most systems do this uses a uniformly distributed random distribution, which is better than a conveyor belt, but isn't correct. To simulate web traffic, and for Little's Law to apply you need to use use a negative exponential distribution, as described in this blog post by Dr Neil Gunther.
The proper random think time calculation is needed to generate more realistic queues.
However, it gets worse. It turns out that network traffic is not randomly distributed, it comes in bursts. Those bursts come in clumps. Think of what actually happens when a user starts an iPhone app. It doesn't make one request, it makes a burst of requests. In addition users taking part in a flash sale will be synchronized to visit their app at around the same time causing a clump of bursts of traffic. The distribution is known as pareto or hyperbolic. In addition, when networks reconfigure themselves, traffic is delayed for a while and a queue builds up for a while then floods the downstream systems with a thundering herd. Update - there's some useful work by Jim Brady and Neil Gunther on how to configure load testing tools to be more realistic, and a CMG2019 paper by Jim Brady on measuring how well your test loads are behaving.
Truly real world workloads are more bursty and will have higher queues and long tail response times than common load test tools generate by default.
You should expect queues and response times to vary and to have a long tail of a few extremely slow requests even at the best of times when average utilization is low. So what happens as one of the steps in the process starts to get busy? For processing steps which don't have available concurrency (like a network transmission), as the utilization increases, the probability that requests contend with each other increases, and so does the residence time. The rule of thumb for networks is that they gradually start to get slow around 50-70% utilization.
Plan to keep network utilization below 50% for good latency.
Utilization is also problematic and measurements can be misleading, but it's defined as the proportion of time something is busy. For CPUs where there are more executions happening in parallel, slow down happens at higher utilization, but kicks in harder and can surprise you. This makes intuitive sense if you think about the last available CPU as the point of contention. For example if there are 16 vCPUs, the last CPU is the last 6.25% of capacity, so residence time kicks up sharply around 93.75% utilization. For a 100 vCPU system, it kicks up around 99% utilization. The formula that approximates this behavior for randomly arriving requests in steady state (the same conditions as apply for Little's Law) is R=S/(1-U^N).
Inflation of average residence time as utilization increases is reduced in multi-processor systems but "hits the wall" harder.
Unpicking this, take the average utilization as a proportion, not a percentage, and raise it to the power of the number of processors. Subtract from 1, and divide into the average service time to get an estimate of the average residence time. If average utilization is low, dividing by a number near 1 means that average residence time is nearly the same as the average service time. For a network that has concurrency of N=1, 70% average utilization means that we are dividing by 0.3, and average residence time is about three times higher than at low utilization.
Rule of thumb is to keep inflation of average residence time below 2-3x throughout the system to maintain a good average user visible response time.
For a 16 vCPU system at 95% average utilization, 0.95^16 = 0.44 and we are dividing by 0.56, which roughly doubles average residence time. At 98% average utilization 0.98^16=0.72 and we are dividing by 0.28, so average residence time goes from acceptable to slow for only a 3% increase in average utilization.
The problem with running a multiprocessor system at high average utilization is that small changes in workload levels have increasingly large impacts.
There is a standard Unix/Linux metric called Load Average which is poorly understood and has several problems. For Unix systems including Solaris/AIX/HPUX it records the number of operating system threads that are running and waiting to run on the CPU. For Linux it includes operating system threads blocked waiting for disk I/O as well. It then maintains three time decayed values over 1 minute, 5 minute and 15 minutes. The first thing to understand is that the metric dates back to single CPU systems in the 1960s, and I would always divide the load average value by the number of vCPUs to get a measure that is comparable across systems. The second is that it's not measuring across a fixed time interval like other metrics, so it's not the same kind of average, and it builds in a delayed response. The third is that the Linux implementation is a bug that has become institutionalized as a feature, and inflates the result. It's just a terrible metric to monitor with an alert or feed to an autoscaling algorithm.
Load Average doesn't measure load, and isn't an average. Best ignored.
If a system is over-run, and more work arrives than can be processed, we reach 100% utilization, and the formula has a divide by zero which predicts infinite residence time. In practice it's worse than this, because when the system becomes slow, the first thing that happens is that upstream users of the system retry their request, which magnifies the amount of work to be done and causes a retry storm. The system will have a long queue of work, and go into a catatonic state where it's not responding.
Systems that hit a sustained average utilization of 100% will become unresponsive, and build up large queues of work.
When I look at how timeouts and retries are configured, I often find too many retries and timeouts that are far too long. This increases work amplification and makes a retry storm more likely. I've talked about this in depth in the past, and have added a new post on this subject. The best strategy is a short timeout with a single retry, if possible to a different connection that goes to a different instance of a service.
Timeouts should never be set the same across a system, they need to be much longer at the front end edge, and much shorter deep inside the system.
The usual operator response is to clear an overloaded queue by rebooting the system, but a well designed system will limit it's queues and shed work by silently dropping or generating fast-fail responses to incoming requests. Databases and other services with a fixed maximum number of connections behave like this. When you can't get another connection to make a request, you get a fast fail response. If the connection limit is set too low, the database will reject work that it has capacity to process, and if its set too high, the database will slow down too much before it rejects more incoming work.
Think about how to shed incoming work if the system reaches 100% average utilization, and what configuration limits you should set.
The best way to maintain a good response time under extreme conditions is to fail fast and shed incoming load. Even under normal conditions, most real world systems have a long tail of slow response times. However with the right measurements we can manage and anticipate problems, and with careful design and testing it's possible to build systems that manage their maximum response time to user requests.
To dig deeper into this topic, there's a lot of good information in Neil Gunther's blog, books, and training classes. I worked with Neil to develop a summer school performance training class that was hosted at Stanford in the late 1990's, and he's been running events ever since. For me, co-presenting with Neil was a formative experience, a deep dive into queuing theory that really solidified my mental models of how systems behave.
Picture of lines of waiting barrels taken by Adrian at a wine cellar in Bordeaux.