-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use maxConnection=-1 in router-perf test to increase tps and reduce error connections #451
Comments
/cc @sjug |
HI @qiliRedHat we tested the new haproxy auto maxConn functionality with the network edge team and the decision was made to not set it as default for the reasons you've mentioned. It's not a config that we'd normally want to test on a regular basis. There should be/is a note added to the router docs for customers that want to have the extra throughput at the cost of higher resource consumption. |
Here we have two things
We should reduce number of router connections in our e2e mb config file to 20K(using 500 routes 40 clients) for http and passthrough and 10k (using 500 routes20 clients) for edge and re-encrypt. In the comment https://bugzilla.redhat.com/show_bug.cgi?id=1983751#c7 we can see more number of non-200 responses compared to 200 responses with 80 clients. Unfortunately we are counting these non-200 response time in latency calculation. We, in our local testing, also tuned haproxy "maxconn" by adding oc set env -n openshift-ingress deployment router-default ROUTER_MAX_CONNECTIONS=80000 in our e2e script in configure_ingress_images() at https://github.com/cloud-bulldozer/e2e-benchmarking/blob/master/workloads/router-perf-v2/common.sh#L47 Below table is our results with 20k, 80k, 120K haproxy "maxconn" on rosa and self-managed aws clusters. We used 500 routes, 80 clients, "edge" termination and 50 keep-alive-requests in this testing for "60" seconds mb duration.
With 80k haproxy "maxconn", we reached 1258K succesful "200" status requests. And the total requests were 1265 (i.e 1258+7). Requests per second in this case is 20970. With 20K haproxy "maxconn", total requests were only 525K (409K+118K). Number of non 200 status requests (409K) were 4 times higher than 200 status requests (118K). As we go with default haproy "maxconn" i.e 20K, we should reduce number of router connections in our e2e mb config file to
Non 200 status failures are happening at 2 stages
So including time taken for these non-200 status requests in latency calculation is not ideal as errors happening at different stages. Our result.csv parser in e2e should be enhanced to use only 200 status requests for latency calculation. |
@venkataanil I also noticed the way of calculating latency. The latency result is not accurate especially when non-200 response number is relatively high. I think ideal way could be calculating latency of latency separately for each type of 'result_codes'. e2e-benchmarking/workloads/router-perf-v2/workload.py Lines 62 to 71 in f3991ff
And the way of calculating TPS(rps) is the number of 200 response code /runtime, that means if the non-200 number is relatively big, the TPS(rps) will be smaller. That's another reason to avoid 200 response code.
|
Thanks for sharing this, I will pay attention to the doc. Now I only saw in 4.11 doc there is a new configuration of maxConnections in tuningOptions https://docs.openshift.com/container-platform/4.11/networking/ingress-operator.html#nw-ingress-controller-configuration-parameters_configuring-ingress |
@qiliRedHat thanks for reporting, the latency problem has been patched already - #453 |
…narios Fixes: cloud-bulldozer#451 Signed-off-by: Raul Sevilla <[email protected]>
@rsevilla87 Hi Raul, in this bug comment https://bugzilla.redhat.com/show_bug.cgi?id=1983751#c7, I verified that with configuring maxConnection=-1, we can reduce lot's of '0' response number, increase '200' response number, as well as increasing TPS and reduce latency.
More test data and charts can be found here
https://docs.google.com/spreadsheets/d/1jNYCdTu2XvSs4xARk8PwQGoPZVgra0jOQORIlUKAdKg/edit#gid=1789221797
Please let me know what you think about adding this configuration to router-perf test as an ENV var and set it to default.
Please notice that maxConnection=-1 will cause the router pod to consume more cpu and memory.
In my test result above, I used
INFRA_NODE_INSTANCE_TYPE=m5.12xlarge (32x128)
WORKLOAD_NODE_INSTANCE_TYPE=m5.8xlarge (48x192)
I saw cloud-bulldozer/airflow-kubernetes#190 tried to change Infrastructure nodes shifting from 48x192 to 16x64, that may need to be reevaluate if using this configuration.
The text was updated successfully, but these errors were encountered: