TODO: Why would someone want to use this? What problem does it solve, and how?
- How do we get maximum goodput (successful throughput) in an overloaded system?
- Overload, goodput vs throughput
- Need to throttle or shed load in order to protect systems from overload
- Ideally, 1. we send the amount of traffic that the system can handle, but no more, and 2. the system can shed any excess load with a minimum of work.
- How do we prevent upstream systems from overloading downstream systems?
- Backpressure
- E.g.
External API -> Internal service -> Database
- Need to be able to propagate throttling
- Reject, drop or pause (slow down)
- Systems need a way to communicate backpressure
- How do we detect load or overload?
- Increases in latency
- Load-based request failures e.g.
- Explicit: HTTP 429, gRPC RESOURCE_EXHAUSTED
- Implicit: timeouts, gRPC DEADLINE_EXCEEDED
TODO: When shouldn't you use it?
All systems have hard limits on the amount of concurrency they can support. The available concurrency is limited by resources such as CPU, memory, disk or network bandwidth, thread pools or connection pools. For most of these (memory being one exception), when they saturate, queues start to build up.
Little's Law can be applied to steady state systems (that is, systems where the queues are not growing due to overload):
L = λW
where
L
= number of jobs in a stationary systemλ
= the long-term average effective arrival rateW
= the average time that the system takes to process a job
E.g. for requests to a server, concurrency = RPS * average latency
.
A single CPU has a natural concurrency limit of 1 before jobs start queueing and latency increases. If we know that jobs take 10ms on average, then the CPU can handle 100 RPS on average.
For complex systems, we generally do not know the concurrency limits.
TODO:
- Rate is one dimensional
- Doesn't consider how long things take (how expensive the work is for each request)
- Concurrency is a better measure of throughput, considering both rate and latency
- Reacts to changes in both rate and latency, that is: more jobs, or more expensive jobs
- This is the difference between:
- How many people should we let into the nightclub per minute?
- How many people should we allow into the nightclub at once?
- Example:
- Load tests revealed that at rates much higher than 100 RPS, the system starts to get overloaded.
- Expected average latency = 100ms
- Arrival rate = 80 RPS
- RPS – limit = 100 RPS
- Imagine an unexpected situation causes the server to get overloaded. Perhaps the some expensive requests are being made, or capacity has been reduce because of a crash.
- Latency goes up 10x to 1s. 80 RPS is still allowed through. Rate limiting doesn't protect the system.
- Concurrency – limit = 10
- Imagine the same scenario.
L = λW = 80 * 1 = 80
, which is much higher than 10. At this RPS and latency, 7/8 requests would be rejected, reducing load on the server. As load reduces, latency goes down, allowing more requests per second through.
- Imagine the same scenario.
TODO:
- Static
- How do you know what to set it to? Even if you can work out a good number, what if that changes? E.g. a downstream system increases/decreases capacity, or the workload changes?
- Dynamic
- Can automatically detect and respond to overload
- Circuit breakers
- All or nothing - if a circuit breaker sees a downstream system is overloaded it stops all traffic to that system. This is OK for a complete outage, but many cases of overload are likely to be "brownouts" where some traffic could be processed.
- Throttling
- More responsive to partial outages. Traffic can be reduced to a level the downstream system can handle. Overall availability during a partial outage can be much higher.
Example of a circuit breaker causing a complete outage for a particular API route:
availability = 0% overloaded
v v v
Client -> API -> internal system -> database
^
circuit breaker trips
- TCP congestion control
- Delay-based (RTT in TCP, latency here)
- Loss-based (packet loss in TCP, errors caused by load here)
- Delay-based - latency
- TODO:
- Loss-based - request failures
- TODO:
- We need a way to detect overload.
- Causes: Resources such as CPU, memory, threads, connections, bandwidth are the underlying bottlenecks
- It can be hard to predict what the bottleneck will be and what effect it will have, especially in large, complex systems.
- Symptoms: Instead, we can measure symptoms - increased latency (our own, or from other systems) or failures from overloaded systems
- In the spirit of alerting on symptoms, not causes
- Push vs pull
- Push – e.g. clients sending requests to servers
- Pull – e.g. consumers pulling messages off a queue
- For backpressure to work in a push-based system, upstream systems need to know when to stop sending traffic.
- A couple of ways to implement this:
- Explicit: downstream systems sending data to the client indicating it is congested or overloaded
- "Please don't give me any more work, I'm very busy"
- e.g. ECN in TCP, HTTP 429 or 503, gRPC RESOURCE_EXHAUSTED
- Requires more coupling between services
- Implicit: upstream systems inferring congestion or overload
- "My colleague looks very overworked, perhaps I won't give them any more tasks to do"
- e.g. connection timeouts, refused connections, request timeouts, increasing latency
- Requires some guesswork, symptoms are not always caused by congestion
- E.g. connection timeouts can happen when firewalls drop packets
- Explicit: downstream systems sending data to the client indicating it is congested or overloaded
- Client-side
- Compete with each other – algorithm needs to be fair
- Loss-based can "muscle out" delay-based algorithms
- Load balanced across multiple servers
- Assume load balancers are working well
- Prefer loss-based? Why?
- Can use rejected requests as a backpressure signal, especially for offline/batch applications.
- Compete with each other – algorithm needs to be fair
- Server-side
- Prefer delay-based? Why?
- Can proactively shed load and protect themselves from becoming overloaded by detecting early signs of congestion.
- Prefer delay-based? Why?
- Example: allocating 75% of capacity to client-serving traffic, 25% for batch jobs, but allowing either to use excess capacity.
- Clients identified using e.g. headers (must be trusted)
- Generally static
- How might dynamic partitioning work?
- Using unspecified set of values, e.g. by customer ID?
- Would probably want to keep the cardinality low, using a naturally limited identifier
- How might dynamic partitioning work?
- Imagine a server with three operations:
- Read resource by ID
- Full text search for resources
- Write new resource
- One per service is simple, and protects the whole service if it is overloaded. If all operations have similar latency characteristics it can work well. But a wide range of potential latencies could make it unpredictable when using a delay-based limiter.
- Using separate limiters per operation could work more predictably, but they would need to fairly distribute the capacity between them. E.g. it would be bad if operation 1 could take all capacity and not leave enough for the others.
- The algorithms are generally fair, so could work?
- Alternatively, use a partitioned limiter?
- TODO: Need to think about this one
- End-to-end principle, same as for retries
- What about:
- Every server protects itself with a limiter
- Backpressure from servers is propagated upwards
- Client-side limiting is done at the top level?