-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e flake: Disable after relocate times out randomly #1659
Comments
The VRG is stuck here:
And finally Ramen didn't remove the finalizer from the PVC because the VR didn't get deleted, and the second finalizer was not removed by the VR
And finally, the reason the VR is stuck is because of this:
It seems likely that the primary resource was deleted before the secondary was cleaned up. The proper sequence should be to clean up the secondary first, then the primary. I believe this is the root cause in theory, but we need to confirm it by reviewing the code. If the deletions are happening in parallel, then premature deletion of the primary would explain the issue. |
And sure enough, the deletion is not prioritized, secondary first then primary. It is done randomly. here it is:
We need to first sort then delete... |
I'll send a PR for e2e first, and handle the ordered deletion in ramen later. |
Related to #716 , also worked on in DFBUGS-601 ? |
Right, I had a delete test that reproduced this issue, maybe we can revive it later in e2e. |
I see this error randomly both locally and in the CI, in the last days we see lot of failures, so this may be a regression in ramen.
Looking at the cluster we see:
On the hub, the drpc is deleting state (good):
On dr1, the application is running (good), and the vrg/vr were deleted (good):
On dr2, we see that the application did not complete the cleanup after relocate:
I think the issue is not waiting for stable state before disabling dr. It may be a bug in ramen that we cannot handle disable dr when relocated did not complete but there is no reason to test this edge case. We need to test the normal case which is:
This is the flow in basic-test, and it works reliably:
ramen/test/basic-test/relocate
Line 21 in 8c376ff
In e2e we wait until phase is relocated:
ramen/e2e/dractions/actions.go
Line 154 in 8c376ff
This is very early in relocate flow. Relocated did not succeed yet.
Logs from the failing run:
disable-timeout.gather.tar.gz
Failed builds:
The text was updated successfully, but these errors were encountered: