Update docs on CI/CD

sgibson91 · Dec 19, 2024 · 60f442a · 60f442a
1 parent 3ad9265
commit 60f442a
Show file tree

Hide file tree

Showing 2 changed files with 22 additions and 35 deletions.
diff --git a/docs/hub-deployment-guide/runbooks/phase3/initial-hub-setup.md b/docs/hub-deployment-guide/runbooks/phase3/initial-hub-setup.md
@@ -162,13 +162,13 @@ All of the following steps must be followed in order to consider phase 3.1 compl
    If Dask gateway will be needed, then choose a `basehub`, and follow the guide on
    [how to enable dask-gateway on an existing hub](howto:features:daskhub).
 
-1. **Add the new cluster to CI/CD**
+1. **Add the new cluster and staging hub to CI/CD**
 
    ```{important}
-   This step is only applicable if the hub is the first hub being deployed to a cluster.
+   This step is only applicable if the hub is the first hub being deployed to a cluster **or** has `staging` in it's name.
    ```
 
-   To ensure the new cluster and its hubs are appropriately handled by our CI/CD system, please add it as an entry in the following places:
+   To ensure the new cluster and its hubs are appropriately handled by our CI/CD system, please add it as an entry in the following places in the [`deploy-hubs.yaml`](https://github.com/2i2c-org/infrastructure/blob/HEAD/.github/workflows/deploy-hubs.yaml) GitHub Actions workflow file:
 
       - The [`deploy-hubs.yaml`](https://github.com/2i2c-org/infrastructure/blob/008ae2c1deb3f5b97d0c334ed124fa090df1f0c6/.github/workflows/deploy-hubs.yaml#L121) GitHub workflow has a job named [`upgrade-support-and-staging`](https://github.com/2i2c-org/infrastructure/blob/18f5a4f8f39ed98c2f5c99091ae9f19a1075c988/.github/workflows/deploy-hubs.yaml#L128-L166) that needs to list of clusters being automatically deployed by our CI/CD system. Add an entry for the new cluster here.
 

diff --git a/docs/reference/ci-cd/hub-deploy.md b/docs/reference/ci-cd/hub-deploy.md
@@ -7,62 +7,50 @@ You can learn more about this workflow in our blog post [Multiple JupyterHubs, m
 
 The best place to learn about the latest state of our *automatic* hub deployment
 is to look at [the `deploy-hubs.yaml` GitHub Actions workflow file](https://github.com/2i2c-org/infrastructure/tree/HEAD/.github/workflows/deploy-hubs.yaml).
-This workflow file depends on a locally defined action that [sets up access to a given cluster](https://github.com/2i2c-org/infrastructure/blob/main/.github/actions/setup-deploy/action.yaml) and itself contains four main jobs, detailed below.
+This workflow file depends on a locally defined action that [sets up access to a given cluster](https://github.com/2i2c-org/infrastructure/blob/main/.github/actions/setup-deploy/action.yaml) and itself contains a range of jobs, the most relevant ones of which are detailed below.
+There are also some filtering/optimisation jobs which are not discussed here.
 
 ## Main hub deployment workflow
 
 (cicd/hub/generate-jobs)=
 ### 1. `generate-jobs`: Generate Helm upgrade jobs
 
 The first job takes a list of files that have been added/modified as part of a Pull Request and pipes them into the [`generate-helm-upgrade-jobs` sub-command](https://github.com/2i2c-org/infrastructure/blob/main/deployer/helm_upgrade_decision.py) of the [deployer module](https://github.com/2i2c-org/infrastructure/tree/main/deployer).
-This sub-command uses a set of functions to calculate which hubs on which clusters require a helm upgrade, alongside whether the support chart and staging hub on that cluster should also be upgraded.
-If any production hubs require an upgrade, the upgrade of the staging hub is a requirement.
+This sub-command uses a set of functions to calculate which hubs on which clusters require a helm upgrade, alongside whether the support chart and staging hub(s) on that cluster should also be upgraded.
+If any production hubs require an upgrade, the upgrade of the staging hub(s) is a requirement.
 
 This job provides the following outputs:
 
-- Two JSON objects that can be read by later GitHub Actions jobs to define matrix jobs.
-  These JSON objects detail: which clusters require their support chart and/or staging hub to be upgraded, and which production hubs require an upgrade.
+- Three JSON objects that can be read by later GitHub Actions jobs to define matrix jobs.
+  These JSON objects detail: which clusters require their support chart to be upgraded, which staging hub(s) require an upgrade, and which production hubs require an upgrade.
 - The above JSON objects are also rendered as human-readable tables using [`rich`](https://github.com/Textualize/rich).
 
-````{admonition} Some special cased filepaths
+```{admonition} Some special cased filepaths
 While the aim of this workflow is to only upgrade the pieces of the infrastructure that require it with every change, some changes do require us to redeploy everything.
 
 - If a cluster's `cluster.yaml` file has been modified, we upgrade the support chart and **all** hubs on **that** cluster. This is because we cannot tell what has been changed without inspecting the diff of the file.
 - If any of the `basehub` or `daskhub` Helm charts have additions/modifications in their paths, we redeploy **all** hubs across **all** clusters.
-- If the support Helm chart has additions/modifications in its path, we redeploy the support chart on **all** clusters.
-- If the deployer module has additions/modifications in its path, then we redeploy **all** hubs on **all** clusters.
-
-```{attention}
-Right now, we redeploy everything when the deployer changes since the deployer undertakes some tasks that generates config related to authentication.
-This may change in the future as we move towards the deployer becoming a separable, stand-alone package.
+- If the `support` Helm chart has additions/modifications in its path, we redeploy the support chart on **all** clusters.
+- If the `deployer` module has additions/modifications in its path, then we redeploy **all** hubs on **all** clusters.
 ```
-````
 
-### 2. `upgrade-support-and-staging`: Upgrade support and staging hub Helm charts on clusters that require it
+### 2. `upgrade-support`: Upgrade support Helm chart on clusters that require it
 
-The next job reads in one of the JSON objects detailed above that defines which clusters need their support chart and/or staging hub upgrading.
-*Note that it is not a requirement for both the support chart and staging hub to be upgraded during this job.*
+The next job reads in one of the JSON objects detailed above that defines which clusters need their support chart upgrading.
 A matrix job is set up that parallelises over all the clusters defined in the JSON object.
-For each cluster, the support chart is first upgraded (if required) followed by the staging hub (if required).
+For each cluster, the support chart is upgraded (if required).
+We set an output variable from this job to determine if any support chart upgrades fail for a cluster.
+We then use these outputs to filter out the failed clusters and prevent further deployments to them, without impairing deployments to unrelated clusters.
 
-```{note}
-The 2i2c cluster is a special case here as it has three staging hubs: one running the `basehub` Helm chart and another running the `daskhub` Helm chart.
-We therefore run extra steps for the 2i2c cluster to upgrade these hubs (if required).
-```
+### 3. `upgrade-staging`: Upgrade Helm chart for staging hub(s) in parallel
 
+Next we deploy the staging hub(s) on a cluster.
 We use staging hubs as [canary deployments](https://sre.google/workbook/canarying-releases/) and prevent deploying production hubs if a staging deployment fails.
-Hence, the last step of this job is to set an output variable that stores if the job completed successfully or failed.
-
-### 3. `filter-generate-jobs`: Filter out jobs for clusters whose support/staging job failed
+Similarly to `upgrade-support`, the last step of this job is to set an output variable that stores if the job completed successfully or failed.
 
-This job is an optimisation job.
-While we do want to prevent all production hubs on Cluster X from being upgraded if its support/staging job fails, we **don't** want to prevent the production hubs on Cluster Y from being upgraded because the support/staging job for Cluster X failed.
+### 4. `upgrade-prod`: Upgrade Helm chart for production hubs in parallel
 
-This job reads in the production hub job definitions generated in job 1 and the support/staging success/failure variables set in job 2, then proceeds to filter out the productions hub upgrade jobs that were due to be run on a cluster whose support/staging job failed.
-
-### 4. `upgrade-prod-hubs`: Upgrade Helm chart for production hubs in parallel
-
-This last job deploys all production hubs that require it in parallel to the clusters that successfully completed job 2.
+This last job deploys all production hubs that require it in parallel to the clusters that successfully completed a staging upgrade.
 
 (cicd/hub/pr-comment)=
 ## Posting the deployment plan as a comment on a Pull Request
@@ -82,7 +70,6 @@ This workflow downloads the artifacts uploaded by `generate-jobs` and then uses
 - Either update an existing comment or create a new comment on the PR posting the Markdown tables downloaded as an artifact.
 
 ```{admonition} Why we're using artifacts and separate workflow files
-
 Any secrets used by GitHub Actions are not available to Pull Requests that come from forks by default to protect against malicious code being executed with privileged access. `generate-jobs` needs to run in the PR context in order to establish which files are added/modified, but the required secrets would not be available for the rest of the workflow that would post a comment to the PR.
 
 To overcome this in a secure manner, we upload the required information (the body of the comment to be posted and the number of the PR the comment should be posted to) as artifacts.