Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flake] Etcd timeout -> leader election failure -> webhook down #1743

Open
lentzi90 opened this issue May 22, 2024 · 5 comments
Open

[Flake] Etcd timeout -> leader election failure -> webhook down #1743

lentzi90 opened this issue May 22, 2024 · 5 comments
Labels
kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/accepted Indicates an issue is ready to be actively worked on.

Comments

@lentzi90
Copy link
Member

This issue is mostly to document and keep track of the test failures. The issue is not with BMO itself, rather a performance issue in the CI system.

Which jobs are flaking

Possibly all running on Jenkins workers in Xerces.
It has been observed in BMO e2e tests at least.

Reason for failure (if possible):

Occasionally we see tests fail with a failed to call webhook (see logs below) even though the webhook was working just before and no changes were made to it. Checking the BMO logs reveal that the issue is with etcd. BMO is unable to renew its lease or perform leader election. As a result, it stops and then restarts. This is why the webhook is refusing connection.

Test logs:

[2024-05-21T02:06:02.489Z] • [FAILED] [7.212 seconds]
[2024-05-21T02:06:02.489Z] Inspection [It] should inspect a newly created BMH [required, inspection]
[2024-05-21T02:06:02.489Z] /home/metal3ci/workspace/metal3-bmo-e2e-test-periodic-release-0.6/test/e2e/inspection_test.go:85
[2024-05-21T02:06:02.489Z] 
[2024-05-21T02:06:02.489Z]   Timeline >>
[2024-05-21T02:06:02.489Z]   INFO: Creating namespace inspection-wcmx49
[2024-05-21T02:06:02.489Z]   INFO: Creating event watcher for namespace "inspection-wcmx49"
[2024-05-21T02:06:02.489Z]   STEP: Creating a secret with BMH credentials @ 05/21/24 02:05:55.385
[2024-05-21T02:06:02.489Z]   STEP: creating a BMH @ 05/21/24 02:05:55.761
[2024-05-21T02:06:02.489Z]   [FAILED] in [It] - /home/metal3ci/workspace/metal3-bmo-e2e-test-periodic-release-0.6/test/e2e/inspection_test.go:110 @ 05/21/24 02:05:55.79
[2024-05-21T02:06:02.489Z]   INFO: Deleting namespace inspection-wcmx49
[2024-05-21T02:06:02.489Z]   << Timeline
[2024-05-21T02:06:02.489Z] 
[2024-05-21T02:06:02.489Z]   [FAILED] Unexpected error:
[2024-05-21T02:06:02.489Z]       <*errors.StatusError | 0xc000383180>: 
[2024-05-21T02:06:02.489Z]       Internal error occurred: failed calling webhook "baremetalhost.metal3.io": failed to call webhook: Post "[https://baremetal-operator-webhook-service.baremetal-operator-system.svc:443/validate-metal3-io-v1alpha1-baremetalhost?timeout=10s](https://baremetal-operator-webhook-service.baremetal-operator-system.svc/validate-metal3-io-v1alpha1-baremetalhost?timeout=10s)": dial tcp 10.99.193.108:443: connect: connection refused

BMO logs:

E0521 02:09:39.324811       1 leaderelection.go:369] Failed to update lock: etcdserver: request timed out
E0521 02:09:42.298968       1 leaderelection.go:332] error retrieving resource lock baremetal-operator-system/baremetal-operator: Get "https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/baremetal-operator-system/leases/baremetal-operator": context deadline exceeded
I0521 02:09:42.299218       1 leaderelection.go:285] failed to renew lease baremetal-operator-system/baremetal-operator: timed out waiting for the condition
{"level":"info","ts":1716257382.3178487,"msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":1716257382.3179276,"msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":1716257382.3179662,"msg":"Stopping and waiting for caches"}
{"level":"info","ts":1716257382.3181348,"msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":1716257382.3182237,"msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":1716257382.3182437,"msg":"Wait completed, proceeding to shutdown the manager"}

Anything else you would like to add:

We could possibly workaround or at least improve this by disabling leader election. I don't think this is a good idea though, since we may just be pushing the issue further and make it even harder to realize why tests fail.
The only real solution is to ensure that the CI environment is performant enough to avoid these flakes.

/kind flake

@metal3-io-bot metal3-io-bot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels May 22, 2024
@Rozzii
Copy link
Member

Rozzii commented May 29, 2024

/triage accepted

@metal3-io-bot metal3-io-bot added triage/accepted Indicates an issue is ready to be actively worked on. and removed needs-triage Indicates an issue lacks a `triage/foo` label and requires one. labels May 29, 2024
@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2024
@Rozzii
Copy link
Member

Rozzii commented Sep 11, 2024

/remove-lifecycle stale

@metal3-io-bot metal3-io-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024
@Rozzii
Copy link
Member

Rozzii commented Sep 11, 2024

@Sunnatillo

@metal3-io-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues will close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@metal3-io-bot metal3-io-bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/accepted Indicates an issue is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants