Skip to content

Commit

Permalink
Merge pull request 2i2c-org#5284 from sgibson91/file-transfer-docs
Browse files Browse the repository at this point in the history
Centralise docs on transferring data
  • Loading branch information
sgibson91 authored Dec 20, 2024
2 parents 819c419 + e892e82 commit f30d566
Show file tree
Hide file tree
Showing 4 changed files with 204 additions and 184 deletions.
89 changes: 37 additions & 52 deletions docs/howto/features/storage-quota.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
(howto:configure-storage-quota)=
# Configure per-user storage quotas

This guide explains how to enable and configure per-user storage quotas.
This guide explains how to enable and configure per-user storage quotas using the `jupyterhub-home-nfs` helm chart.

```{note}
Nest all config examples under a `basehub` key if deploying this for a daskhub.
Expand All @@ -11,35 +11,49 @@ Nest all config examples under a `basehub` key if deploying this for a daskhub.

The in-cluster NFS server uses a pre-provisioned disk to store the users' home directories. We don't use a dynamically provisioned volume because we want to be able to reuse the same disk even when the Kubernetes cluster is deleted and recreated. So the first step is to create a disk that will be used to store the users' home directories.

For infrastructure running on AWS, we can create a disk through Terraform by adding a block like the following to the [tfvars file of the hub](https://github.com/2i2c-org/infrastructure/tree/main/terraform/aws/projects):
For infrastructure running on AWS, we can create a disk through Terraform by adding a block like the following to the [`tfvars` file of the cluster](https://github.com/2i2c-org/infrastructure/tree/main/terraform/aws/projects):

```hcl
```
ebs_volumes = {
"staging" = {
size = 100
size = 100 # in GB
type = "gp3"
name_suffix = "staging"
tags = {}
tags = { "2i2c:hub-name": "staging" }
}
}
```

This will create a disk with a size of 100GB for the `staging` hub that we can reference when configuring the NFS server.

## Enabling jupyterhub-home-nfs
Apply these changes with:

```bash
terraform plan -var-file=projects/$CLUSTER_NAME.tfvars
terraform apply -var-file=projects/$CLUSTER_NAME.tfvars
```

## Enabling `jupyterhub-home-nfs`

To be able to configure per-user storage quotas, we need to run an in-cluster NFS server using [`jupyterhub-home-nfs`](https://github.com/sunu/jupyterhub-home-nfs). This can be enabled by setting `jupyterhub-home-nfs.enabled = true` in the hub's values file (or the common values files if all hubs on this cluster will be using this).

To be able to configure per-user storage quotas, we need to run an in-cluster NFS server using [`jupyterhub-home-nfs`](https://github.com/sunu/jupyterhub-home-nfs). This can be enabled by setting `jupyterhub-home-nfs.enabled` to `true` in the hub's values file.
`jupyterhub-home-nfs` expects a reference to an pre-provisioned disk.
You can retrieve the `volumeId` by checking the Terraform outputs:

jupyterhub-home-nfs expects a reference to an pre-provisioned disk. Here's an example of how to configure that on AWS and GCP.
```bash
terraform output
```

Here's an example of how to connect the volume to `jupyterhub-home-nfs` on AWS and GCP in the hub values file.

`````{tab-set}
````{tab-item} AWS
:sync: aws-key
```yaml
jupyterhub-home-nfs:
enabled: true
enabled: true # can be migrated to common values file
eks:
enabled: true
enabled: true # can be migrated to common values file
volumeId: vol-0a1246ee2e07372d0
```
````
Expand All @@ -48,9 +62,9 @@ jupyterhub-home-nfs:
:sync: gcp-key
```yaml
jupyterhub-home-nfs:
enabled: true
enabled: true # can be migrated to common values file
gke:
enabled: true
enabled: true # can be migrated to common values file
volumeId: projects/jupyter-nfs/zones/us-central1-f/disks/jupyter-nfs-home-directories
```
````
Expand All @@ -59,63 +73,34 @@ jupyterhub-home-nfs:
These changes can be deployed by running the following command:

```bash
deployer deploy <cluster_name> <hub_name>
deployer deploy $CLUSTER_NAME $HUB_NAME
```

Once these changes are deployed, we should have a new NFS server running in our cluster through the `jupyterhub-home-nfs` Helm chart. We can get the IP address of the NFS server by running the following commands:

```bash
# Authenticate with the cluster
deployer use-cluster-credentials <cluster_name>
deployer use-cluster-credentials $CLUSTER_NAME

# Retrieve the service IP
kubectl -n <hub_name> get svc <hub_name>-nfs-service
kubectl -n $HUB_NAME get svc ${HUB_NAME}-nfs-service
```

To check whether the NFS server is running properly, see the [Troubleshooting](#troubleshooting) section.

## Migrating existing home directories

If there are existing home directories, we need to migrate them to the new NFS server. For this, we will create a throwaway pod with both the existing home directories and the new NFS server mounted, and we will copy the contents from the existing home directories to the new NFS server.
## Migrating existing home directories and switching to the new NFS server

Here's an example of how to do this:
See [](migrate-data) for instructions on performing these steps.

```bash
# Create a throwaway pod with both the existing home directories and the new NFS server mounted
deployer exec root-homes <cluster_name> <hub_name> --extra-nfs-server=<nfs_service_ip> --extra-nfs-base-path=/ --extra-nfs-mount-path=/new-nfs-volume

# Copy the existing home directories to the new NFS server while keeping the original permissions
rsync -av --info=progress2 /root-homes/<path-to-the-parent-of-user-home-directories> /new-nfs-volume/
```

Make sure the path structure of the existing home directories matches the path structure of the home directories in the new NFS storage volume. For example, if an existing home directory is located at `/root-homes/staging/user`, we should expect to find it at `/new-nfs-volume/staging/user`.

## Switching to the new NFS server

Now that we have a new NFS server running in our cluster, and we have migrated the existing home directories to the new NFS server, we can switch the hub to use the new NFS server. This can be done by updating the `basehub.nfs.pv.serverIP` field in the hub's values file.

```yaml
nfs:
pv:
serverIP: <nfs_service_ip>
```
Note that Kubernetes doesn't allow changing an existing PersistentVolume. So we need to delete the existing PersistentVolume first.
```bash
kubectl delete pv ${HUB_NAME}-home-nfs --wait=false
kubectl --namespace $HUB_NAME delete pvc home-nfs --wait=false
kubectl --namespace $HUB_NAME delete pod -l component=shared-dirsize-metrics
kubectl --namespace $HUB_NAME delete pod -l component=shared-volume-metrics
```
## Enforcing storage quotas

After this, we should be able to deploy the hub and have it use the new NFS server:
````{warning}
If you attempt to enforce quotas before having performed the migration, you may see the following error:
```bash
deployer deploy <cluster_name> <hub_name>
FileNotFoundError: [Errno 2] No such file or directory: '/export/$HUB_NAME'
```

## Enforcing storage quotas
````

Now we can set quotas for each user and configure the path to monitor for storage quota enforcement.

Expand All @@ -133,7 +118,7 @@ The `path` field is the path to the parent directory of the user's home director
To deploy the changes, we need to run the following command:

```bash
deployer deploy <cluster_name> <hub_name>
deployer deploy $CLUSTER_NAME $HUB_NAME
```

Once this is deployed, the hub will automatically enforce the storage quota for each user. If a user's home directory exceeds the quota, the user's pod may not be able to start successfully.
Expand Down
160 changes: 160 additions & 0 deletions docs/howto/filesystem-management/data-transfer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
(migrate-data)=
# Migrate data across NFS servers

This documentation covers how to transfer data between NFS servers in a cloud-agnostic way.

This process should be repeated for as many hubs as there are on the cluster, remembering to update the value of `$HUB_NAME`.

## The initial copy process

```bash
export CLUSTER_NAME=<cluster_name>
export HUB_NAME=<hub_name>
```

1. **Create a pod on the cluster and mount the source and destination NFS servers.**

We can use the following deployer command to create a pod in the cluster with the two NFS servers mounted.
The current default NFS will be mounted automatically and we use `--extra-nfs-*` flags to mount the second NFS.

```bash
deployer exec root-homes $CLUSTER_NAME $HUB_NAME \
--extra-nfs-server=$SERVER_IP \
--extra-nfs-base-path="/" \
--extra-nfs-mount-path="dest-fs" \
--persist
```

- `$SERVER_IP` can be found either through the relevant Cloud provider console, or by running `kubectl --namespace $HUB_NAME get svc` if the second NFS is running `jupyterhub-home-nfs`.
- The `--persist` flag will prevent the pod from terminating when you exit it, so you can leave the transfer process running.

1. **Install some tools into the pod.**

We'll need a few extra tools for this transfer so let's install them.

```bash
apt-get update && \
apt-get install -y rsync parallel screen
```

- `rsync` is what will actually perform the copy
- `parallel` will help speed up the process by parallelising it
- `screen` will help the process continue to live in the pod and be protected from network disruptions that would normally kill it

1. **Start a screen session and begin the initial copy process.**

```bash
ls /root-homes/${HUB_NAME}/ | parallel -j4 rsync -ah --progress /root-homes/${HUB_NAME}/{}/ /dest-fs/${HUB_NAME}/{}/
```

```{admonition} Monitoring tips
:class: tip
Start with `-j4`, monitor for an hour or so, and increase the number of threads until you reach high CPU utilisation (low `idle`, high `iowait` from the `top` command).
```

```{admonition} screen tips
:class: tip
To disconnect your `screen` session, you can input {kbd}`Ctrl` + {kbd}`A`, then {kbd}`D` (for "detach").
To reconnect to a running `screen` session, run `screen -r`.
Once you have finished with your `screen` session, you can kill it by inputting {kbd}`Ctrl` + {kbd}`A`, then {kbd}`K` and confirming.
```

Once you have detached from `screen`, can now `exit` the pod and let the copy run.

(migrate-data:reattach-pod)=
## Reattaching to the data transfer pod

You can regain access to the pod created for the data transfer using:

```bash
# Creates a new bash process within the pod
kubectl --namespace $HUB_NAME exec -it ${CLUSTER_NAME}-root-home-shell -- /bin/bash

# Reattaches to the running screen process which is running the rsync process
screen -r
```

## Switching the NFS servers over

Once the files have been migrated, we can update the hub(s) to use the new NFS server IP address.

At this point, it is useful to have a few terminal windows open:

- One terminal with `deployer use-cluster-credentials $CLUSTER_NAME` running to run `kubectl` commands in
- Another terminal to run `deployer deploy $CLUSTER_NAME $HUB_NAME` in
- A terminal that is attached to the data transfer pod to re-run the file transfer (see [](migrate-data:reattach-pod))

1. **Check there are no active users on the hub.**
You can check this by running:

```bash
kubectl --namespace $HUB_NAME get pods -l "component=singleuser-server"
```

If no resources are found, you can proceed to the next step.
If there are resources, you may wish to wait until these servers have stopped, or coordinate a maintenance window with the community when disruption and potential data loss should be expected.

1. **Make the hub unavailable by deleting the `proxy-public` service.**

```bash
kubectl --namespace $HUB_NAME delete svc proxy-public
```

1. **Re-run the `rsync` command in the data transfer pod.**
This process should take much less time now that the initial copy has completed.

1. **Check the Reclaim Policy of the `Persistent Volume`.**

We should first verify the [reclaim policy](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#reclaiming) of the persistent volume to ensure we will not lose any data.

The reclaim policy can be checked by running:

```bash
kubectl get pv ${HUB_NAME}-home-nfs
```

If the reclaim policy is `Retain`, we are safe to delete the pv without data loss.
Otherwise, you may need to patch the reclaim policy to change it to `Retain` with:

```bash
kubectl patch pv ${HUB_NAME}-home-nfs -p '{"spec": {"persistentVolumeReclaimPolicy": "Retain"}}'
```

1. **Delete the `PersistentVolume` and all dependent objects.**
`PersistentVolumes` are _not_ editable, so we need to delete and recreate them to allow the deploy with the new IP address to succeed.
Below is the sequence of objects _dependent_ on the pv, and we need to delete all of them for the deploy to finish.

```bash
kubectl delete pv ${HUB_NAME}-home-nfs --wait=false
kubectl --namespace $HUB_NAME delete pvc home-nfs --wait=false
kubectl --namespace $HUB_NAME delete pod -l component=shared-dirsize-metrics
kubectl --namespace $HUB_NAME delete pod -l component=shared-volume-metrics
```

1. **Update `nfs.pv.serverIP` values in the `<hub-name>.values.yaml` file.**

```yaml
nfs:
pv:
serverIP: <nfs_service_ip>
```
1. **Run `deployer deploy $CLUSTER_NAME $HUB_NAME`.**
This should also bring back the `proxy-public` service and restore access.
You can monitor progress by running:

```bash
kubectl --namespace $HUB_NAME get svc --watch
```

Open and merge a PR with these changes so that other engineers cannot accidentally overwrite them.

We can now delete the pod we created to mount the NFS servers:

```bash
kubectl --namespace $HUB_NAME delete pod ${CLUSTER_NAME}-root-home-shell
```
Loading

0 comments on commit f30d566

Please sign in to comment.