Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make latency impact of decision logs predictable #5724

Open
mjungsbluth opened this issue Mar 4, 2023 · 8 comments
Open

Make latency impact of decision logs predictable #5724

mjungsbluth opened this issue Mar 4, 2023 · 8 comments
Assignees
Labels
design feature-request investigating Issues being actively investigated monitoring Issues related to decision log and status plugins optimization

Comments

@mjungsbluth
Copy link
Contributor

What is the underlying problem you're trying to solve?

The current decision log plugin runs in two variants: unbounded (all decisions are kept) and bound (decisions are discarded if the buffer has an overflow). The actual trade-off is between auditability and availability. However, in the first unbounded case, the likelihood of an OOM kill actually grows if the decision log API gets overloaded, still loosing decisions.

On top, there are quite heavy locks on the the decision log plugin that force for example the encoding of decision in single file. When measuring raw performance of a fleet of OPAs (~50 instances at 30,000 rps) we measured a one order of magnitude higher p99 latency with vs without decision logs turned on.

Describe the ideal solution

If we change the trade-off to auditability vs latency guarantees, a lock-free ring buffer with a fixed size could be used as an alternative to the existing solution. This would limit the used memory in both cases.
In case auditability is favoured, offered chunks would be tried until it can be put in the buffer (this creates back pressure and increases latency). In case low latency is favoured, offered chunks that cannot be placed in the buffer can be discarded.
In both cases, this can be achieved without holding any locks.

@ashutosh-narkar ashutosh-narkar added design distribution Issues related to the bundle plugin optimization runtime and removed design distribution Issues related to the bundle plugin labels Mar 7, 2023
@ashutosh-narkar ashutosh-narkar added monitoring Issues related to decision log and status plugins and removed runtime labels Mar 8, 2023
@ashutosh-narkar ashutosh-narkar self-assigned this Mar 9, 2023
@ashutosh-narkar ashutosh-narkar added the requires-investigation Issues not under active investigation but requires one label Mar 9, 2023
@ashutosh-narkar ashutosh-narkar added investigating Issues being actively investigated and removed requires-investigation Issues not under active investigation but requires one labels Apr 20, 2023
@ashutosh-narkar
Copy link
Member

@mjungsbluth currently, OPA assumes an unlimited buffer size and then users can set one. If there was a ring buffer with a fixed size, how would that size be determined? Or do we make this a mandatory config option? It's not clear how we would select the "right" value.

On top, there are quite heavy locks on the the decision log plugin that force for example the encoding of decision in single file. When measuring raw performance of a fleet of OPAs (~50 instances at 30,000 rps) we measured a one order of magnitude higher p99 latency with vs without decision logs turned on.

Can you share more details about the setup, policy etc. ? I am asking since I ran a performance test and p99 with and w/o decision logging was pretty similar hence wanted to understand a bit more here.

However, in the first unbounded case, the likelihood of an OOM kill actually grows if the decision log API gets overloaded, still loosing decisions.

You're right. What if we allowed the user to specify a memory threshold beyond which OPA starts dropping old logs till memory usage goes below the threshold. Few unknowns here but the idea is OPA keeps around more logs than the bounded approach w/o losing it all due to a OOM scenario.

@mjungsbluth
Copy link
Contributor Author

Hi @ashutosh-narkar,

I would in either case suggest to go with sth user supplied first. As an example if a buffer size is specified, an additional flag (sth line overflow_strategy) could be „drop“ (current default behavior) or „backpressure“ could determine that you rather would like to keep all logs and rather delay the decision. Other variants would be to return a 500/err if a maximum latency is reached (This would typically be more monitoring friendly)

The setup has been shared as part of our relationship with Styra but we essentially saturated 50 OPA instances with 30,000 rps with a static „allow“ rule. I have to check but it is possible that only the 99.5 percentile was off which at our scale would already an issue for clients.

As for the ring buffer, even a standard golang channel with max size is better than the mutex. It just looked suspicious that inside the lock serialization and other caching is going on and not just protecting the shared memory. The issue with the latency might also be outside the decision log plugin but it was easy to reproduce with turning just the plugin off an on.

@Pushpalanka
Copy link
Contributor

Pushpalanka commented Aug 27, 2024

On a related note sharing this additional information in case it helps.
This is from a 60s profile capture during a load test with a simple allow rule. It was observed decision logging is taking more than 50% of evaluation time.

image

@ashutosh-narkar
Copy link
Member

We did a POC that implements a ring buffer based decision logger. Below are some end-to-end latency results. The tests use a very simple policy. The code is on this branch if y'all want to try it out. Any feedback would be great!

OPA v0.69.0

$ echo 'POST http://localhost:8181/v1/data/foo/allow' | vegeta attack --duration=300s -rate=500 | tee results.bin | vegeta report
Requests      [total, rate, throughput]         150000, 500.00, 500.00
Duration      [total, attack, wait]             5m0s, 5m0s, 267.833µs
Latencies     [min, mean, 50, 90, 95, 99, max]  104.541µs, 689.407µs, 341.12µs, 504.51µs, 605.328µs, 14.288ms, 77.812ms
Bytes In      [total, mean]                     23700000, 158.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:150000
Error Set:

dev

$ echo 'POST http://localhost:8181/v1/data/foo/allow' | vegeta attack --duration=300s -rate=500 | tee results.bin | vegeta report
Requests      [total, rate, throughput]         150000, 500.00, 500.00
Duration      [total, attack, wait]             5m0s, 5m0s, 435.583µs
Latencies     [min, mean, 50, 90, 95, 99, max]  115.75µs, 378.295µs, 352.412µs, 483.206µs, 524.698µs, 784.635µs, 20.485ms
Bytes In      [total, mean]                     23700000, 158.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:150000
Error Set:

@mjungsbluth
Copy link
Contributor Author

@mefarazath can you try to incorporate the mentioned branch in the Skipper based benchmarks you‘re currently running. Esp. If you start measuring the p99.5 and p99.9 percentiles

@mjungsbluth
Copy link
Contributor Author

@ashutosh-narkar this looks very promising! esp. the latency stability. In your tests, could you capture the p99.5 and 99.9 percentiles as well? This is where we could see the biggest difference, esp. when pushing the instances to cpu saturation…

@mefarazath
Copy link

mefarazath commented Nov 1, 2024

I tried the PoC using the Skipper OPA filter benchmarks outlined here with both the OPA v0.68.0 (current version used in skipper) and the PoC branch.

4 scenarios in the benchmark were,

  1. minimal -> Simple allow := true policy
  2. with-body -> allow with request body parameter
  3. jwt-validation -> allow with JWT validation
  4. policy-with-data-lookup -> allow with a datasource lookup (a PR is pending to add this to skipper repo)

Noticed a significant improvement in p99, p99.5, p99.9 (in addition to average/mean) compared to v0.68.0.
So this looks promising indeed :)

➜  decision_logs_improvements benchstat v0.68.0_1s.txt fix_1s.txt                                                                                                                                                                       10ms
goos: darwin
goarch: arm64
pkg: github.com/zalando/skipper/filters/openpolicyagent/opaauthorizerequest
cpu: Apple M1 Max
                                                                     │ v0.68.0_1s.txt │              fix_1s.txt              │
                                                                     │     sec/op     │    sec/op     vs base                │
AuthorizeRequest/minimal/with-decision-logs-10                           54.83µ ±  3%   12.26µ ±  2%  -77.64% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P50-10                       118.9µ ±  3%   102.1µ ±  3%  -14.15% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P99-10                     15321.6µ ±  1%   455.7µ ±  1%  -97.03% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P99.5-10                   15716.0µ ±  2%   547.4µ ±  1%  -96.52% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P99.9-10                   16828.7µ ± 15%   891.5µ ±  6%  -94.70% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs-10                         56.94µ ±  2%   25.92µ ±  9%  -54.48% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P50-10                     166.4µ ±  1%   235.6µ ±  5%  +41.63% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P99-10                   14108.5µ ±  2%   767.0µ ± 13%  -94.56% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P99.5-10                 14611.9µ ±  2%   937.6µ ± 15%  -93.58% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P99.9-10                  15.808m ±  3%   1.629m ± 17%  -89.70% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs-10                    72.04µ ±  3%   22.39µ ±  3%  -68.92% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P50-10                215.6µ ±  3%   191.8µ ±  6%  -11.05% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P99-10              15629.4µ ±  1%   661.1µ ±  4%  -95.77% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P99.5-10            16062.0µ ±  2%   787.9µ ±  5%  -95.09% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P99.9-10             20.174m ±  4%   1.336m ±  8%  -93.38% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs-10           72.18µ ±  4%   30.76µ ±  2%  -57.38% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P50-10       274.1µ ±  7%   274.3µ ±  2%        ~ (p=0.565 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P99-10     14347.6µ ±  1%   809.3µ ±  4%  -94.36% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P99.5-10   15226.4µ ±  2%   944.0µ ±  4%  -93.80% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P99.9-10    18.926m ±  4%   1.452m ± 12%  -92.33% (p=0.000 n=10)
geomean                                                                  2.168m         307.1µ        -85.83%

                                                               │ v0.68.0_1s.txt │              fix_1s.txt              │
                                                               │      B/op      │     B/op      vs base                │
AuthorizeRequest/minimal/with-decision-logs-10                     79.42Ki ± 0%   54.52Ki ± 0%  -31.35% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs-10                   88.54Ki ± 0%   62.43Ki ± 0%  -29.49% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs-10             119.18Ki ± 0%   89.60Ki ± 0%  -24.82% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs-10    123.89Ki ± 0%   96.63Ki ± 0%  -22.00% (p=0.000 n=10)
geomean                                                            100.9Ki        73.68Ki       -27.01%

                                                               │ v0.68.0_1s.txt │             fix_1s.txt              │
                                                               │   allocs/op    │  allocs/op   vs base                │
AuthorizeRequest/minimal/with-decision-logs-10                      1.339k ± 0%   1.055k ± 0%  -21.18% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs-10                    1.399k ± 0%   1.096k ± 0%  -21.66% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs-10               1.806k ± 0%   1.520k ± 0%  -15.86% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs-10      2.529k ± 0%   2.217k ± 0%  -12.34% (p=0.000 n=10)
geomean                                                             1.710k        1.405k       -17.85%

Dumping my benchstat results here incase anyone is interested.

Note: Percentile measurement was introduced in skipper benchmarks with PR (yet to be merged)

@adetunjiakintundeakinde
Copy link

adetunjiakintundeakinde commented Nov 29, 2024

@ashutosh-narkar I wanted to say that I noticed a significant improvement in latencies and decision logging when using the ring-buffer poc. I used a policy and ran system was running at 2.5k RPS (requests per second) over a 2-minute test.

Without ring buffer implementation

Requests [total, rate, throughput] 359308, 2847.43, 82.54
Duration [total, attack, wait] 2m34s, 2m6s, 27.404s
Latencies [min, mean, 50, 90, 95, 99, max] 122.375µs, 37.885s, 30.575s, 1m0s, 1m1s, 1m17s, 2m2s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 3.53%
Status Codes [code:count] 0:346631 200:12677

Using the ring buffer implementation

Requests [total, rate, throughput] 360001, 3000.01, 2999.96
Duration [total, attack, wait] 2m0s, 2m0s, 1.542ms
Latencies [min, mean, 50, 90, 95, 99, max] 226.833µs, 13.918ms, 1.353ms, 7.088ms, 18.205ms, 362.479ms, 3.015s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 100.00%
Status Codes [code:count] 0:1 200:360000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design feature-request investigating Issues being actively investigated monitoring Issues related to decision log and status plugins optimization
Projects
Status: In Progress
Development

No branches or pull requests

6 participants