e2e flake: Disable after relocate times out randomly #1659

nirs · 2024-11-20T18:00:37Z

I see this error randomly both locally and in the CI, in the last days we see lot of failures, so this may be a regression in ramen.

2024-11-20T19:29:09.997+0200	INFO	appset-deploy-rbd-busybox	dractions/retry.go:115	drpc phase is Relocated
2024-11-20T19:29:09.999+0200	INFO	appset-deploy-rbd-busybox	dractions/retry.go:58	drpc is ready
=== RUN   TestSuites/Exhaustive/appset-deploy-rbd-busybox/Disable
2024-11-20T19:29:09.999+0200	INFO	appset-deploy-rbd-busybox	dractions/actions.go:102	Unprotecting workload
2024-11-20T19:29:09.999+0200	INFO	appset-deploy-rbd-busybox	dractions/actions.go:107	Deleting drpc
    actions_test.go:64: drpc "appset-deploy-rbd-busybox" is not deleted yet before timeout, fail
=== NAME  TestSuites/Exhaustive/appset-deploy-rbd-busybox
    exhaustive_suite_test.go:91: Disable failed
...
--- FAIL: TestSuites (0.05s)
    ...
    --- FAIL: TestSuites/Exhaustive (6.06s)
        ...
        --- FAIL: TestSuites/Exhaustive/appset-deploy-rbd-busybox (1231.36s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Deploy (0.17s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Enable (90.15s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Failover (270.25s)
            --- PASS: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Relocate (270.29s)
            --- FAIL: TestSuites/Exhaustive/appset-deploy-rbd-busybox/Disable (600.49s)

Looking at the cluster we see:

On the hub, the drpc is deleting state (good):

% kubectl get drpc -A  --context hub                                          
NAMESPACE   NAME                        AGE   PREFERREDCLUSTER   FAILOVERCLUSTER   DESIREDSTATE   CURRENTSTATE
argocd      appset-deploy-rbd-busybox   31m   dr1                dr2               Relocate       Deleting

On dr1, the application is running (good), and the vrg/vr were deleted (good):

% kubectl get deploy,pod,pvc,vrg,vr -n appset-deploy-rbd-busybox --context dr1
NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/busybox   1/1     1            1           13m

NAME                           READY   STATUS    RESTARTS   AGE
pod/busybox-7d5747dcf9-47b5t   1/1     Running   0          13m

NAME                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/busybox-pvc   Bound    pvc-ff343fdd-0a8d-4b2a-90bc-f05e831a11e2   1Gi        RWO            rook-ceph-block   <unset>                 15m

On dr2, we see that the application did not complete the cleanup after relocate:

% kubectl get deploy,pod,pvc,vrg,vr -n appset-deploy-rbd-busybox --context dr2
NAME                                STATUS        VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/busybox-pvc   Terminating   pvc-ff343fdd-0a8d-4b2a-90bc-f05e831a11e2   1Gi        RWO            rook-ceph-block   <unset>                 24m

NAME                                                                    DESIREDSTATE   CURRENTSTATE
volumereplicationgroup.ramendr.openshift.io/appset-deploy-rbd-busybox   secondary      Secondary

NAME                                                             AGE   VOLUMEREPLICATIONCLASS   PVCNAME       DESIREDSTATE   CURRENTSTATE
volumereplication.replication.storage.openshift.io/busybox-pvc   24m   vrc-sample               busybox-pvc   secondary      Secondary

I think the issue is not waiting for stable state before disabling dr. It may be a bug in ramen that we cannot handle disable dr when relocated did not complete but there is no reason to test this edge case. We need to test the normal case which is:

Wait until application is relocated
Wait for first replication - meaning that the application is protected on the new cluster
Delete the drpc

This is the flow in basic-test, and it works reliably:

ramen/test/basic-test/relocate

Line 21 in 8c376ff

test.wait_until_drpc_is_stable()

In e2e we wait until phase is relocated:

ramen/e2e/dractions/actions.go

Line 154 in 8c376ff

return waitDRPC(client, namespace, name, ramen.Relocated)

This is very early in relocate flow. Relocated did not succeed yet.

Logs from the failing run:
disable-timeout.gather.tar.gz

Failed builds:

The text was updated successfully, but these errors were encountered:

BenamarMk · 2024-11-25T16:05:47Z

The VRG is stuck here:
Requeuing due to processing a deleted VR
It looks like the VR was deleted, but maybe the finalizer is still there. Let look at vr.yaml:
...
The finalizer was not removed from the VR:

apiVersion: replication.storage.openshift.io/v1alpha1
kind: VolumeReplication
metadata:
  creationTimestamp: "2024-11-20T17:20:09Z"
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2024-11-20T17:28:31Z"
  finalizers:
  - replication.storage.openshift.io

And finally Ramen didn't remove the finalizer from the PVC because the VR didn't get deleted, and the second finalizer was not removed by the VR

finalizers:
  - volumereplicationgroups.ramendr.openshift.io/pvc-vr-protection
  - replication.storage.openshift.io/pvc-protection

And finally, the reason the VR is stuck is because of this:

2024-11-20T17:39:27.538Z        ERROR   failed to disable volume replication    {"controller": "volumereplication", "controllerGroup": "replication.storage.openshift.io", "controllerKind": "VolumeReplication", "VolumeReplication": {"name":"busybox-pvc","namespace":"appset-deploy-rbd-busybox"}, "namespace": "appset-deploy-rbd-busybox", "name": "busybox-pvc", "reconcileID": "5569378f-a276-409e-8a49-3cf9f649edfb", "Request.Name": "busybox-pvc", "Request.Namespace": "appset-deploy-rbd-busybox", "error": "rpc error: code = InvalidArgument desc = invalid arguments provided: secondary image status is up=true and state=error"}
github.com/csi-addons/kubernetes-csi-addons/internal/controller/replication%2estorage.(*VolumeReplicationReconciler).disableVolumeReplication
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/internal/controller/replication.storage/volumereplication_controller.go:728
github.com/csi-addons/kubernetes-csi-addons/internal/controller/replication%2estorage.(*VolumeReplicationReconciler).Reconcile
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/internal/controller/replication.storage/volumereplication_controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:303
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2
        /workspace/go/src/github.com/csi-addons/kubernetes-csi-addons/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:224

It seems likely that the primary resource was deleted before the secondary was cleaned up. The proper sequence should be to clean up the secondary first, then the primary.

I believe this is the root cause in theory, but we need to confirm it by reviewing the code. If the deletions are happening in parallel, then premature deletion of the primary would explain the issue.

BenamarMk · 2024-11-25T16:09:57Z

And sure enough, the deletion is not prioritized, secondary first then primary. It is done randomly. here it is:

ramen/internal/controller/drplacementcontrol_controller.go

Line 730 in 546f89c

// delete VRG manifestwork

We need to first sort then delete...

nirs · 2024-11-25T16:37:04Z

I'll send a PR for e2e first, and handle the ordered deletion in ramen later.

ShyamsundarR · 2024-11-25T16:39:55Z

Related to #716 , also worked on in DFBUGS-601 ?

nirs · 2024-11-25T16:44:50Z

Related to #716 , also worked on in DFBUGS-601 ?

Right, I had a delete test that reproduced this issue, maybe we can revive it later in e2e.

nirs added bug Something isn't working test Testing related issue labels Nov 20, 2024

nirs mentioned this issue Nov 21, 2024

Support workdir venv #1548

Open

ShyamsundarR mentioned this issue Nov 24, 2024

Add missing license headers to source files #1673

Merged

nirs added the high Issue is of high priority and needs attention label Nov 25, 2024

nirs assigned BenamarMk Nov 25, 2024

nirs assigned nirs and unassigned BenamarMk Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e flake: Disable after relocate times out randomly #1659

e2e flake: Disable after relocate times out randomly #1659

nirs commented Nov 20, 2024 •

edited

Loading

BenamarMk commented Nov 25, 2024

BenamarMk commented Nov 25, 2024

nirs commented Nov 25, 2024

ShyamsundarR commented Nov 25, 2024

nirs commented Nov 25, 2024

e2e flake: Disable after relocate times out randomly #1659

e2e flake: Disable after relocate times out randomly #1659

Comments

nirs commented Nov 20, 2024 • edited Loading

BenamarMk commented Nov 25, 2024

BenamarMk commented Nov 25, 2024

nirs commented Nov 25, 2024

ShyamsundarR commented Nov 25, 2024

nirs commented Nov 25, 2024

nirs commented Nov 20, 2024 •

edited

Loading