-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make latency impact of decision logs predictable #5724
Comments
@mjungsbluth currently, OPA assumes an unlimited buffer size and then users can set one. If there was a ring buffer with a fixed size, how would that size be determined? Or do we make this a mandatory config option? It's not clear how we would select the "right" value.
Can you share more details about the setup, policy etc. ? I am asking since I ran a performance test and p99 with and w/o decision logging was pretty similar hence wanted to understand a bit more here.
You're right. What if we allowed the user to specify a memory threshold beyond which OPA starts dropping old logs till memory usage goes below the threshold. Few unknowns here but the idea is OPA keeps around more logs than the bounded approach w/o losing it all due to a OOM scenario. |
Hi @ashutosh-narkar, I would in either case suggest to go with sth user supplied first. As an example if a buffer size is specified, an additional flag (sth line overflow_strategy) could be „drop“ (current default behavior) or „backpressure“ could determine that you rather would like to keep all logs and rather delay the decision. Other variants would be to return a 500/err if a maximum latency is reached (This would typically be more monitoring friendly) The setup has been shared as part of our relationship with Styra but we essentially saturated 50 OPA instances with 30,000 rps with a static „allow“ rule. I have to check but it is possible that only the 99.5 percentile was off which at our scale would already an issue for clients. As for the ring buffer, even a standard golang channel with max size is better than the mutex. It just looked suspicious that inside the lock serialization and other caching is going on and not just protecting the shared memory. The issue with the latency might also be outside the decision log plugin but it was easy to reproduce with turning just the plugin off an on. |
We did a POC that implements a ring buffer based decision logger. Below are some end-to-end latency results. The tests use a very simple policy. The code is on this branch if y'all want to try it out. Any feedback would be great!
|
@mefarazath can you try to incorporate the mentioned branch in the Skipper based benchmarks you‘re currently running. Esp. If you start measuring the p99.5 and p99.9 percentiles |
@ashutosh-narkar this looks very promising! esp. the latency stability. In your tests, could you capture the p99.5 and 99.9 percentiles as well? This is where we could see the biggest difference, esp. when pushing the instances to cpu saturation… |
I tried the PoC using the Skipper OPA filter benchmarks outlined here with both the OPA v0.68.0 (current version used in skipper) and the PoC branch. 4 scenarios in the benchmark were,
Noticed a significant improvement in p99, p99.5, p99.9 (in addition to average/mean) compared to v0.68.0.
Dumping my benchstat results here incase anyone is interested. Note: Percentile measurement was introduced in skipper benchmarks with PR (yet to be merged) |
@ashutosh-narkar I wanted to say that I noticed a significant improvement in latencies and decision logging when using the ring-buffer poc. I used a policy and ran system was running at 2.5k RPS (requests per second) over a 2-minute test. Without ring buffer implementation Requests [total, rate, throughput] 359308, 2847.43, 82.54 Using the ring buffer implementation Requests [total, rate, throughput] 360001, 3000.01, 2999.96 |
What is the underlying problem you're trying to solve?
The current decision log plugin runs in two variants: unbounded (all decisions are kept) and bound (decisions are discarded if the buffer has an overflow). The actual trade-off is between auditability and availability. However, in the first unbounded case, the likelihood of an OOM kill actually grows if the decision log API gets overloaded, still loosing decisions.
On top, there are quite heavy locks on the the decision log plugin that force for example the encoding of decision in single file. When measuring raw performance of a fleet of OPAs (~50 instances at 30,000 rps) we measured a one order of magnitude higher p99 latency with vs without decision logs turned on.
Describe the ideal solution
If we change the trade-off to auditability vs latency guarantees, a lock-free ring buffer with a fixed size could be used as an alternative to the existing solution. This would limit the used memory in both cases.
In case auditability is favoured, offered chunks would be tried until it can be put in the buffer (this creates back pressure and increases latency). In case low latency is favoured, offered chunks that cannot be placed in the buffer can be discarded.
In both cases, this can be achieved without holding any locks.
The text was updated successfully, but these errors were encountered: