DBZ-8051: Recreate Kubernetes Strategy #68

ryanvanhuuksloot · 2024-07-11T19:04:16Z

Part of Issue: https://issues.redhat.com/browse/DBZ-8051

The default strategy makes it seem like we can do zero downtime but it can cause big problems with multiple race conditions.

We can work on zero downtime in a DDD but for now, we should have a safe default.

Snippet from a conversation with @jcechace

That's partially true -- when a record batch is finished the connector commits offsets, however only when OffsetCommitPolicy determines it should do it -- in case of the default PeriodicCommitOffsetPolicy that is if the time period elapsed,  with AlwaysCommitOffsetPolicy it would be at the end of each batch

The data integrity issue is possible -- as I described -- in case of a crash at the wrong time.  The problem is that there is an overlap where old pod P1 is still running when P2 announces ready (so they work concurrently for a brief moment). In this "overlap" period P1 receives a signal to shut down (which means it commits offsets) and P2 will crash. Now if P1 was ahead (and it likely was), P3 (which was spawned due to the crash of P2) will pick up the offset committed by P1 (assuming P1 was the last to commit, which it could have been). This aleasy means that the following happens - P1 emitted CUD events, P2 emitted duplicities for CU and P3 now skips the D as that prior to the committed offset). So for the same record your sink now contains CUDCU and the last event for that record is an Update, even though it should be Delete. 

In a scenario where the entire redeployment is triggered by the operator we could implement a signal which would tell DS to commit its offsets  and refrain from committing any future offsets (sort of a. "shut-down-prepare state"). Unfortunately in reality the situation is even worse as there is Deployment+ReplicaSetControler  in a way and a similar situation  can happen without the operator being involved (e.g. when K8s decides to relocate the pod to a different node).

Note: This does cause error logs in the operator when you deploy where there are RollingUpdate strategies.

Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PATCH at: https://250.131.152.1:443/apis/apps/v1/namespaces/cdc-core-staging-sharded-core-unrestricted-hs7i/deployments/core-shard-88?fieldManager=debeziumserver&force=true. Message: Deployment.apps "core-shard-88" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate'. Received status: Status(apiVersion=v1, code=422, details=StatusDetails(causes=[StatusCause(field=spec.strategy.rollingUpdate, message=Forbidden: may not be specified when strategy `type` is 'Recreate', reason=FieldValueForbidden, additionalProperties={})], group=apps, kind=Deployment, name=core-shard-88, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Deployment.apps "core-shard-88" is invalid: spec.strategy.rollingUpdate: Forbidden: may not be specified when strategy `type` is 'Recreate', metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Invalid, status=Failure, additionalProperties={}).

It doesn't cause any issues with your deployments but the error logs can look more menacing than they are.

DBZ-8051: Recreate Kubernetes Strategy

2ba65e9

jcechace merged commit a112c5a into debezium:main Jul 11, 2024
3 checks passed

ryanvanhuuksloot deleted the DBZ-8051 branch July 11, 2024 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBZ-8051: Recreate Kubernetes Strategy #68

DBZ-8051: Recreate Kubernetes Strategy #68

ryanvanhuuksloot commented Jul 11, 2024 •

edited

Loading

DBZ-8051: Recreate Kubernetes Strategy #68

DBZ-8051: Recreate Kubernetes Strategy #68

Conversation

ryanvanhuuksloot commented Jul 11, 2024 • edited Loading

ryanvanhuuksloot commented Jul 11, 2024 •

edited

Loading