Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HCP Cluster Resource Deletion Cascading Subscription Delete #920

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

mbarnes
Copy link
Collaborator

@mbarnes mbarnes commented Dec 3, 2024

What this PR does

When a subscription state changes to "Deleted", the RP now triggers a deletion of all HCP clusters under the subscription as per the Resource Provider Contract.

Behind the scenes, this introduces the concept of "implicit" and "explicit" async operations:

  • An "implicit" async operation has an "Operation" item in Cosmos DB, but no status endpoint for ARM to poll.
  • An "explicit" async operation starts as an "implicit" operation. The Frontend.ExposeOperation method enriches the "Operation" item with information necessary to make the status endpoint accessible to ARM, and adds appropriate async headers to an http.ResponseWriter.

Importantly, the backend pod does not distinguish between implicit and explicit async operations. The sole purpose of an "implicit" async operation at the moment, which is only used for deletions, is for the backend to delete the "Resource" item in Cosmos DB after the actual resource is deleted.

Jira: ARO-13321 - Implement Cascading Subscription Deletion

Special notes for your reviewer

  • This duplicates a few database iterator commits from #883, which is still outstanding.
  • I'll add unit tests for this in a follow-up PR after I convert our existing unit tests to use gomock for Cosmos DB operations. To add unit tests now would just create extra work for myself in the conversion effort.
  • I still need to document asynchronous operation mechanics in general and this "implicit" vs "explicit" concept will be part of it. I've been holding off on writing documentation until I can make some (imo) necessary changes to our database design. This is just to say I haven't forgotten about it.

Matthew Barnes added 5 commits December 3, 2024 09:20
Works for any document type now, and just stores a slice rather
than the entire cache for that document type.
A new function NewQueryItemsSinglePageIterator alters the behavior
of QueryItemsIterator to stop after the first "page" of items and
record the Cosmos DB continuation token. The continuation token
can be retrieved from the iterator with GetContinuationToken.

A QueryItemsIterator created with NewQueryItemsIterator will never
have a continuation token because it iterates until the last item.

The in-memory cache also adds a GetContinuationToken method to its
iterator implementation to fulfill the interface contract, but it
always returns an empty string.
For CosmosDBClient, the maxItems argument controls the type
of iterator returned. A positive maxItems returns a single-
page iterator with a possible continuation token, otherwise
the iterator continues until the last item.

Since the in-memory cache does not have continuation tokens,
the maxItems argument is ignored.

This also drops the resourceType argument. Callers first need to
parse the iterator items into resource documents before checking
the resource type.
Add "externalID" and "internalID" parameters so the returned
document is a minimum valid OperationDocument for writing.
The operation item must now be created in the database prior to
calling ExposeOperation. ExposeOperation does all its processing
in a database update callback.

This is because there is an increasing number of cases where we
create an implicit async operation with no visible status endpoint.
Calling ExposeOperation makes an implicit async operation explicit,
with a status endpoint for ARM to poll. Hence the rename.

The tradeoff is explicit asyncrhonous operations now require two
database operations (create and update) but it helps make the RP
logic cleaner. This could possibly be mitigated in the future by
using Cosmos DB's transactional batch operations, but it's gonna
take some serious refactoring to get there.
Matthew Barnes added 5 commits December 4, 2024 12:55
CancelActiveOperation marks the status of any active operation on
the resource as canceled.
Will be reusing DeleteResource for subscription deletion.

Add database bookkeeping for the resource and any child resources.
This includes creating implicit operations for each resource being
deleted. The caller may then expose the returned operation ID.
By my read of the Subscription Lifecycle API Reference [1], we
should favor 200 OK over 201 Created when creating or updating
a subscription.

[1]
https://github.com/cloud-and-ai-microsoft/resource-provider-contract/blob/master/v1.0/subscription-lifecycle-api-reference.md#response
Called when a subscription is deleted. The method is idempotent in
case of multiple subscription PUT requests.
Don't count on OperationID being set in OperationDocuments.
Implicit async operations will not have this field set. Get
the subscription ID from ExternalID instead.
@mbarnes mbarnes force-pushed the subscription-deletion branch from d76b6ac to 05f8144 Compare December 4, 2024 17:56
@mbarnes mbarnes enabled auto-merge December 4, 2024 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant