Skip to content

Commit

Permalink
Merge pull request 2i2c-org#4628 from sgibson91/manual-gcp-backup-check
Browse files Browse the repository at this point in the history
Create a process for manually verifying GCP filestore backups have happened and old backups have been cleaned up
  • Loading branch information
sgibson91 authored Aug 15, 2024
2 parents ccb4d08 + 4911325 commit 2ca9e2f
Show file tree
Hide file tree
Showing 7 changed files with 198 additions and 7 deletions.
13 changes: 8 additions & 5 deletions deployer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,9 +107,10 @@ The `deployer.py` file is the main file, that contains all of the commands regis
│   │   ├── deploy_dashboards.py
│   │   ├── tokens.py
│   │   └── utils.py
│   └── validate
│   ├── cluster.schema.yaml
│   └── config.py
│   ├── validate
│   │   ├── cluster.schema.yaml
│   │   └── config.py
│ └── verify_backups.py
```

### The `health_check_tests` directory
Expand All @@ -135,15 +136,16 @@ This section descripts some of the subcommands the `deployer` can carry out.
**Command line usage:**

```bash
Usage: deployer [OPTIONS] COMMAND [ARGS]...

Usage: deployer [OPTIONS] COMMAND [ARGS]...
╭─ Options ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --install-completion Install completion for the current shell. │
│ --show-completion Show completion for the current shell, to copy it or customize the installation. │
│ --help Show this message and exit. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ cilogon-client Manage cilogon clients for hubs' authentication. │
│ config Get refined information from the config folder. │
│ debug Debug issues by accessing different components and their logs │
│ decrypt-age Decrypt secrets sent to `[email protected]` via `age` │
│ deploy Deploy one or more hubs in a given cluster │
Expand All @@ -156,6 +158,7 @@ This section descripts some of the subcommands the `deployer` can carry out.
│ transform Programmatically transform datasets, such as cost tables for billing purposes. │
│ use-cluster-credentials Pop a new shell or execute a command after authenticating to the given cluster using the deployer's credentials │
│ validate Validate configuration files such as helm chart values and cluster.yaml files. │
│ verify-backups Verify backups of home directories have been successfully created, and old backups have been cleared out. │
╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
```

Expand Down
1 change: 1 addition & 0 deletions deployer/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
import deployer.commands.grafana.tokens # noqa: F401
import deployer.commands.transform.cost_table # noqa: F401
import deployer.commands.validate.config # noqa: F401
import deployer.commands.verify_backups # noqa: F401
import deployer.keys.decrypt_age # noqa: F401

from .cli_app import app
Expand Down
6 changes: 6 additions & 0 deletions deployer/cli_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
grafana_app = typer.Typer(pretty_exceptions_show_locals=False)
validate_app = typer.Typer(pretty_exceptions_show_locals=False)
transform_app = typer.Typer(pretty_exceptions_show_locals=False)
verify_backups_app = typer.Typer(pretty_exceptions_show_locals=False)

app.add_typer(
generate_app,
Expand Down Expand Up @@ -57,3 +58,8 @@
name="transform",
help="Programmatically transform datasets, such as cost tables for billing purposes.",
)
app.add_typer(
verify_backups_app,
name="verify-backups",
help="Verify backups of home directories have been successfully created, and old backups have been cleared out.",
)
163 changes: 163 additions & 0 deletions deployer/commands/verify_backups.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
"""
Helper script to verify home directories are being backed up correctly
GCP
---
Wraps a gcloud command to list existing backups of a Fileshare
"""

import json
import subprocess
from datetime import datetime, timedelta

import jmespath
import typer

from deployer.cli_app import verify_backups_app
from deployer.utils.rendering import print_colour


def get_existing_gcp_backups(
project: str, region: str, filestore_name: str, filestore_share_name: str
):
"""List existing backups of a share on a filestore using the gcloud CLI.
We filter the backups based on:
- GCP project
- GCP region
- Filestore name
- Filestore share name
Args:
project (str): The GCP project the filestore is located in
region (str): The region the filestore is located in, e.g., us-central1
filestore_name (str): The name of the filestore instance
filestore_share_name (str): The name of the share on the filestore instance
Returns:
list(dict): A JSON-like object, where each dict-entry in the list describes
an existing backup of the filestore
"""
# Get all existing backups in the selected project and region
backups = subprocess.check_output(
[
"gcloud",
"filestore",
"backups",
"list",
"--format=json",
f"--project={project}",
f"--region={region}",
],
text=True,
)
backups = json.loads(backups)

# Filter returned backups by filestore and share names
backups = jmespath.search(
f"[?sourceFileShare == '{filestore_share_name}' && contains(sourceInstance, '{filestore_name}')]",
backups,
)

# Parse `createTime` property into a datetime object for comparison
backups = [
{
k: (
datetime.strptime(v.split(".")[0], "%Y-%m-%dT%H:%M:%S")
if k == "createTime"
else v
)
for k, v in backup.items()
}
for backup in backups
]

return backups


def filter_gcp_backups_into_recent_and_old(
backups: list, backup_freq_days: int, retention_days: int
):
"""Filter the list of backups into two groups:
- Recently created backups that were created within our backup window,
defined by backup_freq_days
- Out of date back ups that are older than our retention window, defined by
retention days
Args:
backups (list(dict)): A JSON-like object defining the existing backups
for the filestore and share we care about
backup_freq_days (int, optional): The time period in days for which we
create a backup
retention_days (int): The number of days above which a backup is considered
to be out of date
Returns:
recent_backups (list(dict)): A JSON-like object containing all existing
backups with a `createTime` within our backup window
old_backups (list(dict)): A JSON-like object containing all existing
backups with a `createTime` older than our retention window
"""
# Generate a list of filestore backups that are younger than our backup window
recent_backups = [
backup
for backup in backups
if datetime.now() - backup["createTime"] < timedelta(days=backup_freq_days)
]

# Generate a list of filestore backups that are older than our set retention period
old_backups = [
backup
for backup in backups
if datetime.now() - backup["createTime"] > timedelta(days=retention_days)
]

return recent_backups, old_backups


@verify_backups_app.command()
def gcp(
project: str = typer.Argument(
..., help="The GCP project the filestore is located in"
),
region: str = typer.Argument(
..., help="The GCP region the filestore is located in, e.g., us-central1"
),
filestore_name: str = typer.Argument(
..., help="The name of the filestore instance to verify backups of"
),
filestore_share_name: str = typer.Option(
"homes", help="The name of the share on the filestore"
),
backup_freq_days: int = typer.Option(
1, help="How often, in days, backups should be created"
),
retention_days: int = typer.Option(
5, help="How old, in days, backups are allowed to become before being deleted"
),
):
filestore_backups = get_existing_gcp_backups(
project, region, filestore_name, filestore_share_name
)
recent_filestore_backups, old_filestore_backups = (
filter_gcp_backups_into_recent_and_old(
filestore_backups, backup_freq_days, retention_days
)
)

if len(recent_filestore_backups) > 0:
print_colour(
f"A backup has been made within the last {backup_freq_days} day(s)!"
)
else:
print_colour(
f"No backups have been made in the last {backup_freq_days} day(s)!",
colour="red",
)

if len(old_filestore_backups) > 0:
print_colour(
f"Filestore backups older than {retention_days} day(s) have been found!",
colour="red",
)
else:
print_colour("No out-dated backups have been found!")
15 changes: 15 additions & 0 deletions docs/howto/filesystem-backups/enable-backups.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,3 +63,18 @@ export CLUSTER_NAME=<cluster-name>

This will have successfully enabled automatic backups of GCP Filestores for this
cluster.

### Verify successful backups on GCP

We manually verify that backups are being successfully created and cleaned up on a regular schedule.

To verify that a backup has been recently created, and that no backups older than the retention period exist, we can use the following deployer command:

```bash
deployer verify-backups gcp <project-name> <region> <filestore-name>
```

where:
- `<project-name>` is the name of the GCP project the Filestore is located in
- `<region>` is the GCP region the Filestore is located in, e.g., `us-central1`
- `<filestore-name>` is the name of the Filestore instance
2 changes: 1 addition & 1 deletion docs/howto/upgrade-cluster/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ kubectl get pod -A -l "app.kubernetes.io/component in (dask-scheduler, dask-work
Notify others in 2i2c that your are starting this cluster upgrade in the
`#maintenance-notices` Slack channel.

### 4. Upgrade the k8s control plane
### 4. Upgrade the k8s control plane[^2]

#### 4.1. Upgrade the k8s control plane one minor version

Expand Down
5 changes: 4 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,4 +42,7 @@ requests==2.*
GitPython==3.*

# Used to parse units that kubernetes understands (like GiB)
kubernetes
kubernetes

# Used to perform regex searches on JSON objects
jmespath

0 comments on commit 2ca9e2f

Please sign in to comment.