From 353ff952684d196f2be47f37b91f5418aa423445 Mon Sep 17 00:00:00 2001 From: Sarah Gibson Date: Fri, 13 Dec 2024 12:41:25 +0000 Subject: [PATCH 1/7] Centralise docs on transfering data Includes lessons learned from AWI-CIROH migration and using pods to do the transfer --- docs/howto/features/storage-quota.md | 87 +++++------ .../filesystem-management/data-transfer.md | 138 +++++++++++++++++ .../decrease-size-gcp-filestore.md | 140 +----------------- docs/howto/filesystem-management/index.md | 1 + 4 files changed, 182 insertions(+), 184 deletions(-) create mode 100644 docs/howto/filesystem-management/data-transfer.md diff --git a/docs/howto/features/storage-quota.md b/docs/howto/features/storage-quota.md index d9b0da3dc7..03c5609ea6 100644 --- a/docs/howto/features/storage-quota.md +++ b/docs/howto/features/storage-quota.md @@ -1,7 +1,7 @@ (howto:configure-storage-quota)= # Configure per-user storage quotas -This guide explains how to enable and configure per-user storage quotas. +This guide explains how to enable and configure per-user storage quotas using the `jupyterhub-home-nfs` helm chart. ```{note} Nest all config examples under a `basehub` key if deploying this for a daskhub. @@ -11,35 +11,49 @@ Nest all config examples under a `basehub` key if deploying this for a daskhub. The in-cluster NFS server uses a pre-provisioned disk to store the users' home directories. We don't use a dynamically provisioned volume because we want to be able to reuse the same disk even when the Kubernetes cluster is deleted and recreated. So the first step is to create a disk that will be used to store the users' home directories. -For infrastructure running on AWS, we can create a disk through Terraform by adding a block like the following to the [tfvars file of the hub](https://github.com/2i2c-org/infrastructure/tree/main/terraform/aws/projects): +For infrastructure running on AWS, we can create a disk through Terraform by adding a block like the following to the [`tfvars` file of the cluster](https://github.com/2i2c-org/infrastructure/tree/main/terraform/aws/projects): ```hcl ebs_volumes = { "staging" = { - size = 100 + size = 100 # in GB type = "gp3" name_suffix = "staging" - tags = {} + tags = { "2i2c:hub-name": "staging" } } } ``` This will create a disk with a size of 100GB for the `staging` hub that we can reference when configuring the NFS server. -## Enabling jupyterhub-home-nfs +Apply these changes with: -To be able to configure per-user storage quotas, we need to run an in-cluster NFS server using [`jupyterhub-home-nfs`](https://github.com/sunu/jupyterhub-home-nfs). This can be enabled by setting `jupyterhub-home-nfs.enabled` to `true` in the hub's values file. +```bash +terraform plan -var-file=projects/$CLUSTER_NAME.tfvars +terraform apply -var-file=projects/$CLUSTER_NAME.tfvars +``` + +## Enabling `jupyterhub-home-nfs` + +To be able to configure per-user storage quotas, we need to run an in-cluster NFS server using [`jupyterhub-home-nfs`](https://github.com/sunu/jupyterhub-home-nfs). This can be enabled by setting `jupyterhub-home-nfs.enabled = true` in the hub's values file (or the common values files if all hubs on this cluster will be using this). + +`jupyterhub-home-nfs` expects a reference to an pre-provisioned disk. +You can retrieve the `volumeId` by checking the Terraform outputs: + +```bash +terraform output +``` -jupyterhub-home-nfs expects a reference to an pre-provisioned disk. Here's an example of how to configure that on AWS and GCP. +Here's an example of how to connect the volume to `jupyterhub-home-nfs` on AWS and GCP in the hub values file. `````{tab-set} ````{tab-item} AWS :sync: aws-key ```yaml jupyterhub-home-nfs: - enabled: true + enabled: true # can be migrated to common values file eks: - enabled: true + enabled: true # can be migrated to common values file volumeId: vol-0a1246ee2e07372d0 ``` ```` @@ -48,9 +62,9 @@ jupyterhub-home-nfs: :sync: gcp-key ```yaml jupyterhub-home-nfs: - enabled: true + enabled: true # can be migrated to common values file gke: - enabled: true + enabled: true # can be migrated to common values file volumeId: projects/jupyter-nfs/zones/us-central1-f/disks/jupyter-nfs-home-directories ``` ```` @@ -59,63 +73,34 @@ jupyterhub-home-nfs: These changes can be deployed by running the following command: ```bash -deployer deploy +deployer deploy $CLUSTER_NAME $HUB_NAME ``` Once these changes are deployed, we should have a new NFS server running in our cluster through the `jupyterhub-home-nfs` Helm chart. We can get the IP address of the NFS server by running the following commands: ```bash # Authenticate with the cluster -deployer use-cluster-credentials +deployer use-cluster-credentials $CLUSTER_NAME # Retrieve the service IP -kubectl -n get svc -nfs-service +kubectl -n $HUB_NAME get svc ${HUB_NAME}-nfs-service ``` To check whether the NFS server is running properly, see the [Troubleshooting](#troubleshooting) section. -## Migrating existing home directories - -If there are existing home directories, we need to migrate them to the new NFS server. For this, we will create a throwaway pod with both the existing home directories and the new NFS server mounted, and we will copy the contents from the existing home directories to the new NFS server. - -Here's an example of how to do this: - -```bash -# Create a throwaway pod with both the existing home directories and the new NFS server mounted -deployer exec root-homes --extra-nfs-server= --extra-nfs-base-path=/ --extra-nfs-mount-path=/new-nfs-volume - -# Copy the existing home directories to the new NFS server while keeping the original permissions -rsync -av --info=progress2 /root-homes/ /new-nfs-volume/ -``` - -Make sure the path structure of the existing home directories matches the path structure of the home directories in the new NFS storage volume. For example, if an existing home directory is located at `/root-homes/staging/user`, we should expect to find it at `/new-nfs-volume/staging/user`. - -## Switching to the new NFS server - -Now that we have a new NFS server running in our cluster, and we have migrated the existing home directories to the new NFS server, we can switch the hub to use the new NFS server. This can be done by updating the `basehub.nfs.pv.serverIP` field in the hub's values file. - -```yaml -nfs: - pv: - serverIP: -``` +## Migrating existing home directories and switching to the new NFS server -Note that Kubernetes doesn't allow changing an existing PersistentVolume. So we need to delete the existing PersistentVolume first. +See [](data-transfer) for instructions on performing these steps. -```bash -kubectl delete pv ${HUB_NAME}-home-nfs --wait=false -kubectl --namespace $HUB_NAME delete pvc home-nfs --wait=false -kubectl --namespace $HUB_NAME delete pod -l component=shared-dirsize-metrics -kubectl --namespace $HUB_NAME delete pod -l component=shared-volume-metrics -``` +## Enforcing storage quotas -After this, we should be able to deploy the hub and have it use the new NFS server: +````{warning} +If you attempt to enforce quotas before having performed the migration, you may see the following error: ```bash -deployer deploy +FileNotFoundError: [Errno 2] No such file or directory: '/export/$HUB_NAME' ``` - -## Enforcing storage quotas +```` Now we can set quotas for each user and configure the path to monitor for storage quota enforcement. @@ -133,7 +118,7 @@ The `path` field is the path to the parent directory of the user's home director To deploy the changes, we need to run the following command: ```bash -deployer deploy +deployer deploy $CLUSTER_NAME $HUB_NAME ``` Once this is deployed, the hub will automatically enforce the storage quota for each user. If a user's home directory exceeds the quota, the user's pod may not be able to start successfully. diff --git a/docs/howto/filesystem-management/data-transfer.md b/docs/howto/filesystem-management/data-transfer.md new file mode 100644 index 0000000000..90fcac8695 --- /dev/null +++ b/docs/howto/filesystem-management/data-transfer.md @@ -0,0 +1,138 @@ +(data-transfer)= +# Transfer data across filestores + +This documentation covers how to transfer data between filestores. + +This process should be repeated for as many hubs as there are on the cluster, remembering to update the value of `$HUB_NAME`. + +## The initial copy process + +```bash +export CLUSTER_NAME= +export HUB_NAME= +``` + +1. **Create a pod on the cluster and mount the source and destination filestores.** + + We can use the following deployer command to create a pod in the cluster with the two filestores mounted. + The current default filestore will be mounted automatically and we use `--extra-nfs-*` flags to mount the second filestore. + + ```bash + deployer exec root-homes $CLUSTER_NAME $HUB_NAME \ + --extra-nfs-server=$SERVER_IP \ + --extra-nfs-base-path="/" \ + --extra-nfs-mount-path="dest-fs" \ + --persist + ``` + + - `$SERVER_IP` can be found either through the relevant Cloud provider console, or by running `kubectl --namespace $HUB_NAME get svc` if the second filestore is running `jupyterhub-home-nfs`. + - The `--persist` flag will prevent the pod from terminating when you exit it, so you can leave the transfer process running. + +1. **Install some tools into the pod.** + + We'll need a few extra tools for this transfer so let's install them. + + ```bash + apt-get update && \ + apt-get install -y rsync parallel screen + ``` + + - `rsync` is what will actually perform the copy + - `parallel` will help speed up the process by parallelising it + - `screen` will help the process continue to live in the pod and be protected from network disruptions that would normally kill it + +1. **Start a screen session and begin the initial copy process.** + + ```bash + ls /root-homes/${HUB_NAME}/ | parallel -j4 rsync -ah --progress /root-homes/${HUB_NAME}/{}/ /dest-fs/${HUB_NAME}/{}/ + ``` + + ```{admonition} Monitoring tips + :class: tip + + Start with `-j4`, monitor for an hour or so, and increase the number of threads until you reach high CPU utilisation (low `idle`, high `iowait` from the `top` command). + ``` + + ```{admonition} screen tips + :class: tip + + To disconnect your `screen` session, you can input {kbd}`Ctrl` + {kbd}`A`, then {kbd}`D` (for "detach"). + + To reconnect to a running `screen` session, run `screen -r`. + + Once you have finished with your `screen` session, you can kill it by inputting {kbd}`Ctrl` + {kbd}`A`, then {kbd}`K` and confirming. + ``` + + Once you have detached from `screen`, can now `exit` the pod and let the copy run. + +(data-transfer:reattach-pod)= +## Reattaching to the data transfer pod + +You can regain access to the pod created for the data transfer using: + +```bash +kubectl --namespace $HUB_NAME attach -i ${CLUSTER_NAME}-root-home-shell +``` + +## Switching the NFS servers over + +Once the files have been migrated to the new NFS filestore, we can update the hub(s) to use the new filestore IP address. + +At this point, it is useful to have a few terminal windows open: + +- One terminal with `deployer use-cluster-credentials $CLUSTER_NAME` running to run `kubectl` commands in +- Another terminal to run `deployer deploy $CLUSTER_NAME $HUB_NAME` in +- A terminal that is attached to the data transfer pod to re-run the file transfer (see [](data-transfer:reattach-pod)) + +1. **Check there are no active users on the hub.** + You can check this by running: + + ```bash + kubectl --namespace $HUB_NAME get pods -l "component=singleuser-server" + ``` + + If no resources are found, you can proceed to the next step. + +1. **Make the hub unavailable by deleting the `proxy-public` service.** + + ```bash + kubectl --namespace $HUB_NAME delete svc proxy-public + ``` + +1. **Re-run the `rsync` command in the data transfer pod.** + This process should take much less time now that the initial copy has completed. + +1. **Delete the `PersistentVolume` and all dependent objects.** + `PersistentVolumes` are _not_ editable, so we need to delete and recreate them to allow the deploy with the new IP address to succeed. + Below is the sequence of objects _dependent_ on the pv, and we need to delete all of them for the deploy to finish. + + ```bash + kubectl delete pv ${HUB_NAME}-home-nfs --wait=false + kubectl --namespace $HUB_NAME delete pvc home-nfs --wait=false + kubectl --namespace $HUB_NAME delete pod -l component=shared-dirsize-metrics + kubectl --namespace $HUB_NAME delete pod -l component=shared-volume-metrics + ``` + +1. **Update `nfs.pv.serverIP` values in the `.values.yaml` file.** + + ```yaml + nfs: + pv: + serverIP: + ``` + +1. **Run `deployer deploy $CLUSTER_NAME $HUB_NAME`.** + This should also bring back the `proxy-public` service and restore access. + You can monitor progress by running: + + ```bash + kubectl --namespace $HUB_NAME get svc --watch + ``` + +Open and merge a PR with these changes so that other engineers cannot accidentally overwrite them. + +We can now delete the pod we created to mount the filestores: + +```bash +kubectl --namespace $HUB_NAME delete pod ${CLUSTER_NAME}-root-home-shell +``` diff --git a/docs/howto/filesystem-management/decrease-size-gcp-filestore.md b/docs/howto/filesystem-management/decrease-size-gcp-filestore.md index de4879b538..039f878744 100644 --- a/docs/howto/filesystem-management/decrease-size-gcp-filestore.md +++ b/docs/howto/filesystem-management/decrease-size-gcp-filestore.md @@ -1,5 +1,5 @@ (howto:decrease-size-gcp-filestore)= -# Decrease the size of a GCP Filestore +# Resize a GCP Filestore down Filestores deployed using the `BASIC_HDD` tier (which we do by default) support _increasing_ their size, but **not** _decreasing_ it. Therefore when we talk about "decreasing the size of a GCP filestore", we are actually referring to creating a brand new filestore of the desired smaller size, copying all the files across from the larger filestore, and then deleting the larger filestore. @@ -11,13 +11,13 @@ export CLUSTER_NAME="" export HUB_NAME="" ``` -## 1. Create a new filestore +## Create a new filestore Navigate to the `terraform/gcp` folder in the `infrastructure` repository and open the relevant `projects/.tfvars` file. Add another filestore definition to the file with config that looks like this: -``` +```hcl filestores = { "filestore" : { # This first filestore instance should already be present capacity_gb: @@ -46,137 +46,11 @@ of the new filestore to the cluster's support values file, following Open a PR and merge these changes so that other engineers cannot accidentally overwrite them. -## 2. Create a VM - -In the GCP console of the project you are working in, [create a VM](https://console.cloud.google.com/compute/instances) by clicking the "Create instance" button at the top of the page. - -- It is helpful to give the VM a name, such as `nfs-copy-vm`, so you can identify it -- Make sure you create the VM in the same region and/or zone as the cluster (you can find this info in the `tfvars` file) -- Choose an instance like an `e2-standard-8` which has 8 CPUs and 32GB memory -- Under the "Boot disk" section, increase the disk size to 500GB (this can always be changed later) and swap the operating system to Ubuntu - -Once the VM has been created, click on it from the list of instances, and then ssh into it by clicking the ssh button at the top of the window. -This will open a new browser window. - -## 3. Attach source and destination filestores to the VM[^1] - -[^1]: - -First we need to install the NFS software: - -```bash -sudo apt-get -y update && -sudo apt-get install nfs-common -``` - -````{note} -If this fails, you may also need to install `zip` to extract the archive. - -```bash -sudo apt-get install zip -``` -```` - -We then make two folders which will serve as the mount points for the filestores: - -```bash -sudo mkdir -p src-fs -sudo mkdir -p dest-fs -``` - -Mount the two filestores using the `mount command` - -```bash -sudo mount -o rw,intr :/ -``` - -`` should always be `homes` and the `` for both filestores can be found on the [filestore instances page](https://console.cloud.google.com/filestore/instances). - -You can confirm that the filestores were mounted successfully by running: - -```bash -df -h --type=nfs -``` - -And the output should contain something similar to the following: - -```bash -Filesystem Size Used Avail Use% Mounted on -10.0.1.2:/share1 1018G 76M 966G 1% /mnt/render -10.0.2.2:/vol3 1018G 76M 966G 1% /mnt/filestore3 -``` - -## 4. Copy the files from the source to the destination filestore - -First of all, start a [screen session](https://linuxize.com/post/how-to-use-linux-screen/) by running `screen`. -This will allow you to close the browser window containing your ssh connection to the VM without stopping the copy process. +## Migrating the data and switching to the new filestore -Begin copying the files from the source to the destination filestore with the following `rclone` command: +See [](data-transfer) for instructions on how to perform these steps. -```bash -sudo rclone sync --multi-thread-streams=12 --progress --links src-fs dest-fs -``` - -Depending on the size of the filestore, this could take anywhere from hours to days! - -```{admonition} screen tips -:class: tip - -To disconnect your `screen` session, you can input {kbd}`Ctrl` + {kbd}`A`, then {kbd}`D` (for "detach"). - -To reconnect to a running `screen` session, run `screen -r`. - -Once you have finished with your `screen` session, you can kill it by inputting {kbd}`Ctrl` + {kbd}`A`, then {kbd}`K` and confirming. -``` - -## 5. Use the new filestore IP address in all relevant hub config files - -Once the initial copy of the files has completed, we can begin the process of updating the hubs to use the new filestore IP address. -It is best practice to begin with the `staging` hub before moving onto any production hubs. - -At this point it is useful to set up two terminal windows: - -- One terminal with `deployer use-cluster-credentials $CLUSTER_NAME` executed to run `kubectl commands -- Another terminal to run `deployer deploy $CLUSTER_NAME $HUB_NAME` - -You should also have the browser window with the ssh connection to the VM handy to re-run the file copy command. - -1. **Check there are no active users on the hub.** - You can check this by running: - ```bash - kubectl --namespace $HUB_NAME get pods -l "component=singleuser-server" - ``` - If no resources are found, you can proceed to the next step. -1. **Make the hub unavailable by deleting the `proxy-public` service.** - ```bash - kubectl --namespace $HUB_NAME delete svc proxy-public - ``` -1. **Re-run the `rclone` command on the VM.** - This process should take much less time now that the initial copy has completed. -1. **Delete the `PersistentVolume` and all dependent objects.** - `PersistentVolumes` are _not_ editable, so we need to delete and recreate them to allow the deploy with the new IP address to succeed. - Below is the sequence of objects _dependent_ on the pv, and we need to delete all of them for the deploy to finish. - ```bash - kubectl delete pv ${HUB_NAME}-home-nfs --wait=false - kubectl --namespace $HUB_NAME delete pvc home-nfs --wait=false - kubectl --namespace $HUB_NAME delete pod -l component=shared-dirsize-metrics - kubectl --namespace $HUB_NAME delete pod -l component=shared-volume-metrics - ``` -1. **Update `nfs.pv.serverIP` values in the `.values.yaml` file.** -1. **Run `deployer deploy $CLUSTER_NAME $HUB_NAME`.** - This should also bring back the `proxy-public` service and restore access. - You can monitor progress by running: - ```bash - kubectl --namespace $HUB_NAME get svc --watch - ``` - -Repeat this process for as many hubs as there are on the cluster, remembering to update the value of `$HUB_NAME`. - -Open and merge a PR with these changes so that other engineers cannot accidentally overwrite them. - -We can now delete the VM we created to mount the filestores. - -## 6. Decommission the previous filestore +## Decommission the previous filestore Back in the `terraform/gcp` folder and `.tfvars` file, we can delete the definition of the original filestore. @@ -184,7 +58,7 @@ You also need to temporarily comment out the [`lifecycle` rule in the `storage.t Plan and apply these changes, ensuring only the old filestore will be destroyed: -``` +```bash terraform plan -var-file=projects/$CLUSTER_NAME.tfvars terraform apply -var-file=projects/$CLUSTER_NAME.tfvars ``` diff --git a/docs/howto/filesystem-management/index.md b/docs/howto/filesystem-management/index.md index 49ac69d6e6..1cdeec3e9a 100644 --- a/docs/howto/filesystem-management/index.md +++ b/docs/howto/filesystem-management/index.md @@ -5,6 +5,7 @@ This documentation covers tasks related to managing filesystems. ```{toctree} :maxdepth: 2 +data-transfer filesystem-backups/index decrease-size-gcp-filestore increase-size-aws-ebs From b0776e43237fa1acbf91cf1e2ad0d2944b57d7e5 Mon Sep 17 00:00:00 2001 From: Sarah Gibson Date: Fri, 13 Dec 2024 12:47:10 +0000 Subject: [PATCH 2/7] fix code block declaration --- docs/howto/features/storage-quota.md | 2 +- docs/howto/filesystem-management/decrease-size-gcp-filestore.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/howto/features/storage-quota.md b/docs/howto/features/storage-quota.md index 03c5609ea6..2644c49378 100644 --- a/docs/howto/features/storage-quota.md +++ b/docs/howto/features/storage-quota.md @@ -13,7 +13,7 @@ The in-cluster NFS server uses a pre-provisioned disk to store the users' home d For infrastructure running on AWS, we can create a disk through Terraform by adding a block like the following to the [`tfvars` file of the cluster](https://github.com/2i2c-org/infrastructure/tree/main/terraform/aws/projects): -```hcl +``` ebs_volumes = { "staging" = { size = 100 # in GB diff --git a/docs/howto/filesystem-management/decrease-size-gcp-filestore.md b/docs/howto/filesystem-management/decrease-size-gcp-filestore.md index 039f878744..654ba32c7d 100644 --- a/docs/howto/filesystem-management/decrease-size-gcp-filestore.md +++ b/docs/howto/filesystem-management/decrease-size-gcp-filestore.md @@ -17,7 +17,7 @@ Navigate to the `terraform/gcp` folder in the `infrastructure` repository and op Add another filestore definition to the file with config that looks like this: -```hcl +``` filestores = { "filestore" : { # This first filestore instance should already be present capacity_gb: From 096eac740259d77dfb55b07d0fb5b5f91fd0a8ef Mon Sep 17 00:00:00 2001 From: Sarah Gibson Date: Fri, 13 Dec 2024 12:50:29 +0000 Subject: [PATCH 3/7] fix a clashing myst ref --- docs/howto/features/storage-quota.md | 2 +- docs/howto/filesystem-management/data-transfer.md | 6 +++--- .../filesystem-management/decrease-size-gcp-filestore.md | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/howto/features/storage-quota.md b/docs/howto/features/storage-quota.md index 2644c49378..76407fd2b2 100644 --- a/docs/howto/features/storage-quota.md +++ b/docs/howto/features/storage-quota.md @@ -90,7 +90,7 @@ To check whether the NFS server is running properly, see the [Troubleshooting](# ## Migrating existing home directories and switching to the new NFS server -See [](data-transfer) for instructions on performing these steps. +See [](migrate-data) for instructions on performing these steps. ## Enforcing storage quotas diff --git a/docs/howto/filesystem-management/data-transfer.md b/docs/howto/filesystem-management/data-transfer.md index 90fcac8695..6af0979175 100644 --- a/docs/howto/filesystem-management/data-transfer.md +++ b/docs/howto/filesystem-management/data-transfer.md @@ -1,4 +1,4 @@ -(data-transfer)= +(migrate-data)= # Transfer data across filestores This documentation covers how to transfer data between filestores. @@ -65,7 +65,7 @@ export HUB_NAME= Once you have detached from `screen`, can now `exit` the pod and let the copy run. -(data-transfer:reattach-pod)= +(migrate-data:reattach-pod)= ## Reattaching to the data transfer pod You can regain access to the pod created for the data transfer using: @@ -82,7 +82,7 @@ At this point, it is useful to have a few terminal windows open: - One terminal with `deployer use-cluster-credentials $CLUSTER_NAME` running to run `kubectl` commands in - Another terminal to run `deployer deploy $CLUSTER_NAME $HUB_NAME` in -- A terminal that is attached to the data transfer pod to re-run the file transfer (see [](data-transfer:reattach-pod)) +- A terminal that is attached to the data transfer pod to re-run the file transfer (see [](migrate-data:reattach-pod)) 1. **Check there are no active users on the hub.** You can check this by running: diff --git a/docs/howto/filesystem-management/decrease-size-gcp-filestore.md b/docs/howto/filesystem-management/decrease-size-gcp-filestore.md index 654ba32c7d..e882cf13d2 100644 --- a/docs/howto/filesystem-management/decrease-size-gcp-filestore.md +++ b/docs/howto/filesystem-management/decrease-size-gcp-filestore.md @@ -48,7 +48,7 @@ Open a PR and merge these changes so that other engineers cannot accidentally ov ## Migrating the data and switching to the new filestore -See [](data-transfer) for instructions on how to perform these steps. +See [](migrate-data) for instructions on how to perform these steps. ## Decommission the previous filestore From b870772fc8c7ef5c785873bed04cb061c84b1c1c Mon Sep 17 00:00:00 2001 From: Sarah Gibson Date: Mon, 16 Dec 2024 10:21:20 +0000 Subject: [PATCH 4/7] Use more generic language to show that this is cloud-agnostic --- .../filesystem-management/data-transfer.md | 26 +++++++++---------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/docs/howto/filesystem-management/data-transfer.md b/docs/howto/filesystem-management/data-transfer.md index 6af0979175..072cb98b8e 100644 --- a/docs/howto/filesystem-management/data-transfer.md +++ b/docs/howto/filesystem-management/data-transfer.md @@ -1,7 +1,7 @@ (migrate-data)= -# Transfer data across filestores +# Migrate data across NFS servers -This documentation covers how to transfer data between filestores. +This documentation covers how to transfer data between NFS servers in a cloud-agnostic way. This process should be repeated for as many hubs as there are on the cluster, remembering to update the value of `$HUB_NAME`. @@ -12,10 +12,10 @@ export CLUSTER_NAME= export HUB_NAME= ``` -1. **Create a pod on the cluster and mount the source and destination filestores.** +1. **Create a pod on the cluster and mount the source and destination NFS servers.** - We can use the following deployer command to create a pod in the cluster with the two filestores mounted. - The current default filestore will be mounted automatically and we use `--extra-nfs-*` flags to mount the second filestore. + We can use the following deployer command to create a pod in the cluster with the two NFS servers mounted. + The current default NFS will be mounted automatically and we use `--extra-nfs-*` flags to mount the second NFS. ```bash deployer exec root-homes $CLUSTER_NAME $HUB_NAME \ @@ -25,7 +25,7 @@ export HUB_NAME= --persist ``` - - `$SERVER_IP` can be found either through the relevant Cloud provider console, or by running `kubectl --namespace $HUB_NAME get svc` if the second filestore is running `jupyterhub-home-nfs`. + - `$SERVER_IP` can be found either through the relevant Cloud provider console, or by running `kubectl --namespace $HUB_NAME get svc` if the second NFS is running `jupyterhub-home-nfs`. - The `--persist` flag will prevent the pod from terminating when you exit it, so you can leave the transfer process running. 1. **Install some tools into the pod.** @@ -76,7 +76,7 @@ kubectl --namespace $HUB_NAME attach -i ${CLUSTER_NAME}-root-home-shell ## Switching the NFS servers over -Once the files have been migrated to the new NFS filestore, we can update the hub(s) to use the new filestore IP address. +Once the files have been migrated, we can update the hub(s) to use the new NFS server IP address. At this point, it is useful to have a few terminal windows open: @@ -93,16 +93,16 @@ At this point, it is useful to have a few terminal windows open: If no resources are found, you can proceed to the next step. -1. **Make the hub unavailable by deleting the `proxy-public` service.** +2. **Make the hub unavailable by deleting the `proxy-public` service.** ```bash kubectl --namespace $HUB_NAME delete svc proxy-public ``` -1. **Re-run the `rsync` command in the data transfer pod.** +3. **Re-run the `rsync` command in the data transfer pod.** This process should take much less time now that the initial copy has completed. -1. **Delete the `PersistentVolume` and all dependent objects.** +4. **Delete the `PersistentVolume` and all dependent objects.** `PersistentVolumes` are _not_ editable, so we need to delete and recreate them to allow the deploy with the new IP address to succeed. Below is the sequence of objects _dependent_ on the pv, and we need to delete all of them for the deploy to finish. @@ -113,7 +113,7 @@ At this point, it is useful to have a few terminal windows open: kubectl --namespace $HUB_NAME delete pod -l component=shared-volume-metrics ``` -1. **Update `nfs.pv.serverIP` values in the `.values.yaml` file.** +5. **Update `nfs.pv.serverIP` values in the `.values.yaml` file.** ```yaml nfs: @@ -121,7 +121,7 @@ At this point, it is useful to have a few terminal windows open: serverIP: ``` -1. **Run `deployer deploy $CLUSTER_NAME $HUB_NAME`.** +6. **Run `deployer deploy $CLUSTER_NAME $HUB_NAME`.** This should also bring back the `proxy-public` service and restore access. You can monitor progress by running: @@ -131,7 +131,7 @@ At this point, it is useful to have a few terminal windows open: Open and merge a PR with these changes so that other engineers cannot accidentally overwrite them. -We can now delete the pod we created to mount the filestores: +We can now delete the pod we created to mount the NFS servers: ```bash kubectl --namespace $HUB_NAME delete pod ${CLUSTER_NAME}-root-home-shell From f7b7cd997af9f3c05879a345b9b6452899e40d29 Mon Sep 17 00:00:00 2001 From: Sarah Gibson Date: Mon, 16 Dec 2024 10:24:42 +0000 Subject: [PATCH 5/7] Note what to do when user servers are running during the NFS server switchover process --- docs/howto/filesystem-management/data-transfer.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/howto/filesystem-management/data-transfer.md b/docs/howto/filesystem-management/data-transfer.md index 072cb98b8e..32a09bb0f3 100644 --- a/docs/howto/filesystem-management/data-transfer.md +++ b/docs/howto/filesystem-management/data-transfer.md @@ -92,6 +92,7 @@ At this point, it is useful to have a few terminal windows open: ``` If no resources are found, you can proceed to the next step. + If there are resources, you may wish to wait until these servers have stopped, or coordinate a maintenance window with the community when disruption and potential data loss should be expected. 2. **Make the hub unavailable by deleting the `proxy-public` service.** From d9dd92886bdd532330e07bb75abd47f5c1e43af5 Mon Sep 17 00:00:00 2001 From: Sarah Gibson Date: Mon, 16 Dec 2024 10:43:26 +0000 Subject: [PATCH 6/7] Add a step to check the reclaim policy of the PV to ensure we don't lose data --- .../filesystem-management/data-transfer.md | 27 +++++++++++++++---- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/docs/howto/filesystem-management/data-transfer.md b/docs/howto/filesystem-management/data-transfer.md index 32a09bb0f3..150348ac60 100644 --- a/docs/howto/filesystem-management/data-transfer.md +++ b/docs/howto/filesystem-management/data-transfer.md @@ -94,16 +94,33 @@ At this point, it is useful to have a few terminal windows open: If no resources are found, you can proceed to the next step. If there are resources, you may wish to wait until these servers have stopped, or coordinate a maintenance window with the community when disruption and potential data loss should be expected. -2. **Make the hub unavailable by deleting the `proxy-public` service.** +1. **Make the hub unavailable by deleting the `proxy-public` service.** ```bash kubectl --namespace $HUB_NAME delete svc proxy-public ``` -3. **Re-run the `rsync` command in the data transfer pod.** +1. **Re-run the `rsync` command in the data transfer pod.** This process should take much less time now that the initial copy has completed. -4. **Delete the `PersistentVolume` and all dependent objects.** +1. **Check the Reclaim Policy of the `Persistent Volume`.** + + We should first verify the [reclaim policy](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming) of the persistent volume to ensure we will not lose any data. + + The reclaim policy can be checked by running: + + ```bash + kubectl get pv ${HUB_NAME}-home-nfs + ``` + + If the reclaim policy is `Retain`, we are safe to delete the pv without data loss. + Otherwise, you may need to patch the reclaim policy to change it to `Retain` with: + + ```bash + kubectl patch pv ${HUB_NAME}-home-nfs -p '{"spec": {"persistentVolumeReclaimPolicy": "Retain"}}' + ``` + +1. **Delete the `PersistentVolume` and all dependent objects.** `PersistentVolumes` are _not_ editable, so we need to delete and recreate them to allow the deploy with the new IP address to succeed. Below is the sequence of objects _dependent_ on the pv, and we need to delete all of them for the deploy to finish. @@ -114,7 +131,7 @@ At this point, it is useful to have a few terminal windows open: kubectl --namespace $HUB_NAME delete pod -l component=shared-volume-metrics ``` -5. **Update `nfs.pv.serverIP` values in the `.values.yaml` file.** +1. **Update `nfs.pv.serverIP` values in the `.values.yaml` file.** ```yaml nfs: @@ -122,7 +139,7 @@ At this point, it is useful to have a few terminal windows open: serverIP: ``` -6. **Run `deployer deploy $CLUSTER_NAME $HUB_NAME`.** +1. **Run `deployer deploy $CLUSTER_NAME $HUB_NAME`.** This should also bring back the `proxy-public` service and restore access. You can monitor progress by running: From e892e82fb5f55dc0c77feed51a36a06624d7035d Mon Sep 17 00:00:00 2001 From: Sarah Gibson Date: Fri, 20 Dec 2024 10:20:19 +0000 Subject: [PATCH 7/7] Clarify pod reattchment steps --- docs/howto/filesystem-management/data-transfer.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/howto/filesystem-management/data-transfer.md b/docs/howto/filesystem-management/data-transfer.md index 150348ac60..9c4fab6ea7 100644 --- a/docs/howto/filesystem-management/data-transfer.md +++ b/docs/howto/filesystem-management/data-transfer.md @@ -71,7 +71,11 @@ export HUB_NAME= You can regain access to the pod created for the data transfer using: ```bash -kubectl --namespace $HUB_NAME attach -i ${CLUSTER_NAME}-root-home-shell +# Creates a new bash process within the pod +kubectl --namespace $HUB_NAME exec -it ${CLUSTER_NAME}-root-home-shell -- /bin/bash + +# Reattaches to the running screen process which is running the rsync process +screen -r ``` ## Switching the NFS servers over