Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e flake: Disable after relocate times out randomly #1659

Open
nirs opened this issue Nov 20, 2024 · 5 comments
Open

e2e flake: Disable after relocate times out randomly #1659

nirs opened this issue Nov 20, 2024 · 5 comments
Assignees
Labels
bug Something isn't working high Issue is of high priority and needs attention test Testing related issue

Comments

@nirs
Copy link
Member

nirs commented Nov 20, 2024

I see this error randomly both locally and in the CI, in the last days we see lot of failures, so this may be a regression in ramen.

2024-11-20T19:29:09.997+0200	INFO	appset-deploy-rbd-busybox	dractions/retry.go:115	drpc phase is Relocated
2024-11-20T19:29:09.999+0200	INFO	appset-deploy-rbd-busybox	dractions/retry.go:58	drpc is ready
=== RUN   TestSuites/Exhaustive/appset-deploy-rbd-busybox/Disable
2024-11-20T19:29:09.999+0200	INFO	appset-deploy-rbd-busybox	dractions/actions.go:102	Unprotecting workload
2024-11-20T19:29:09.999+0200	INFO	appset-deploy-rbd-busybox	dractions/actions.go:107	Deleting drpc
    actions_test.go:64: drpc "appset-deploy-rbd-busybox" is not deleted yet before timeout, fail
=== NAME  TestSuites/Exhaustive/appset-deploy-rbd-busybox
    exhaustive_suite_test.go:91: Disable failed
...
--- FAIL: TestSuites (0.05s)
    ...
    --- FAIL: TestSuites/Exhaustive (6.06s)
        ...
        --- FAIL: TestSuites/Exhaustive/appset-deploy-rbd-busybox (1231.36s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Deploy (0.17s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Enable (90.15s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Failover (270.25s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Relocate (270.29s)
            --- FAIL: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Disable (600.49s)

Looking at the cluster we see:

On the hub, the drpc is deleting state (good):

% kubectl get drpc -A  --context hub                                          
NAMESPACE   NAME                        AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE
argocd      appset-deploy-rbd-busybox   31m   dr1                dr2               Relocate       Deleting

On dr1, the application is running (good), and the vrg/vr were deleted (good):

% kubectl get deploy,pod,pvc,vrg,vr -n appset-deploy-rbd-busybox --context dr1
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/busybox   1/1     1            1           13m

NAME                           READY   STATUS    RESTARTS   AGE
pod/busybox-7d5747dcf9-47b5t   1/1     Running   0          13m

NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/busybox-pvc   Bound    pvc-ff343fdd-0a8d-4b2a-90bc-f05e831a11e2   1Gi        RWO            rook-ceph-block   <unset>                 15m

On dr2, we see that the application did not complete the cleanup after relocate:

% kubectl get deploy,pod,pvc,vrg,vr -n appset-deploy-rbd-busybox --context dr2
NAME                                STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/busybox-pvc   Terminating   pvc-ff343fdd-0a8d-4b2a-90bc-f05e831a11e2   1Gi        RWO            rook-ceph-block   <unset>                 24m

NAME                                                                    DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/appset-deploy-rbd-busybox   secondary      Secondary

NAME                                                             AGE   VOLUMEREPLICATIONCLASS   PVCNAME       DESIREDSTATE   CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc   24m   vrc-sample               busybox-pvc   secondary      Secondary

I think the issue is not waiting for stable state before disabling dr. It may be a bug in ramen that we cannot handle disable dr when relocated did not complete but there is no reason to test this edge case. We need to test the normal case which is:

  1. Wait until application is relocated
  2. Wait for first replication - meaning that the application is protected on the new cluster
  3. Delete the drpc

This is the flow in basic-test, and it works reliably:

test.wait_until_drpc_is_stable()

In e2e we wait until phase is relocated:

return waitDRPC(client, namespace, name, ramen.Relocated)

This is very early in relocate flow. Relocated did not succeed yet.

Logs from the failing run:
disable-timeout.gather.tar.gz

Failed builds:

@nirs nirs added bug Something isn't working test Testing related issue labels Nov 20, 2024
@nirs nirs added the high Issue is of high priority and needs attention label Nov 25, 2024
@BenamarMk
Copy link
Member

The VRG is stuck here:
Requeuing due to processing a deleted VR
It looks like the VR was deleted, but maybe the finalizer is still there. Let look at vr.yaml:
...
The finalizer was not removed from the VR:

apiVersion: replication.storage.openshift.io/v1alpha1
kind: VolumeReplication
metadata:
  creationTimestamp: "2024-11-20T17:20:09Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-11-20T17:28:31Z"
  finalizers:
  - replication.storage.openshift.io

And finally Ramen didn't remove the finalizer from the PVC because the VR didn't get deleted, and the second finalizer was not removed by the VR

finalizers:
  - volumereplicationgroups.ramendr.openshift.io/pvc-vr-protection
  - replication.storage.openshift.io/pvc-protection

And finally, the reason the VR is stuck is because of this:

2024-11-20T17:39:27.538Z        ERROR   failed to disable volume replication    {"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication", "VolumeReplication": {"name":"busybox-pvc","namespace":"appset-deploy-rbd-busybox"}, "namespace": "appset-deploy-rbd-busybox", "name": "busybox-pvc", "reconcileID": "5569378f-a276-409e-8a49-3cf9f649edfb", "Request.Name": "busybox-pvc", "Request.Namespace": "appset-deploy-rbd-busybox", "error": "rpc error: code = InvalidArgument desc = invalid arguments provided: secondary image status is up=true and state=error"}
github.com/csi-addons/kubernetes-csi-addons/internal/controller/replication%2estorage.(*VolumeReplicationReconciler).disableVolumeReplication
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/internal/controller/replication.storage/volumereplication_controller.go:728
github.com/csi-addons/kubernetes-csi-addons/internal/controller/replication%2estorage.(*VolumeReplicationReconciler).Reconcile
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/internal/controller/replication.storage/volumereplication_controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224

It seems likely that the primary resource was deleted before the secondary was cleaned up. The proper sequence should be to clean up the secondary first, then the primary.

I believe this is the root cause in theory, but we need to confirm it by reviewing the code. If the deletions are happening in parallel, then premature deletion of the primary would explain the issue.

@BenamarMk
Copy link
Member

And sure enough, the deletion is not prioritized, secondary first then primary. It is done randomly. here it is:

We need to first sort then delete...

@nirs nirs assigned nirs and unassigned BenamarMk Nov 25, 2024
@nirs
Copy link
Member Author

nirs commented Nov 25, 2024

I'll send a PR for e2e first, and handle the ordered deletion in ramen later.

@ShyamsundarR
Copy link
Member

Related to #716 , also worked on in DFBUGS-601 ?

@nirs
Copy link
Member Author

nirs commented Nov 25, 2024

Related to #716 , also worked on in DFBUGS-601 ?

Right, I had a delete test that reproduced this issue, maybe we can revive it later in e2e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high Issue is of high priority and needs attention test Testing related issue
Projects
None yet
Development

No branches or pull requests

3 participants