Skip to content

Commit

Permalink
document how to enable prometheus alerts
Browse files Browse the repository at this point in the history
  • Loading branch information
sunu committed Dec 19, 2024
1 parent 2ae2999 commit 1a081db
Showing 1 changed file with 53 additions and 0 deletions.
53 changes: 53 additions & 0 deletions docs/howto/features/storage-quota.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,59 @@ deployer deploy <cluster_name> <hub_name>

Once this is deployed, the hub will automatically enforce the storage quota for each user. If a user's home directory exceeds the quota, the user's pod may not be able to start successfully.

## Enabling alerting through Prometheus Alertmanager

Once we have enabled storage quotas, we want to be alerted when the disk usage of the NFS server exceeds a certain threshold so that we can take appropriate action.

To do this, we need to create a Prometheus rule that will alert us when the disk usage of the NFS server exceeds a certain threshold using Alertmanager.

First, we need to enable Alertmanager in the hub's support values file (for example, [here's the one for the `nasa-veda` cluster](https://github.com/2i2c-org/infrastructure/blob/main/config/clusters/nasa-veda/support.values.yaml)).

```yaml
prometheus:
alertmanager:
enabled: true
```

Then, we need to create a Prometheus rule that will alert us when the disk usage of the NFS server exceeds a certain threshold. For example, to alert us when the disk usage of the NFS server exceeds 90% of the total disk size, we would add the following to the hub's support values file:

```yaml
prometheus:
serverFiles:
alerting_rules.yml:
groups:
- name: <cluster_name> jupyterhub-home-nfs EBS volume full
rules:
- alert: jupyterhub-home-nfs-ebs-full
expr: node_filesystem_avail_bytes{mountpoint="/shared-volume", component="shared-volume-metrics"} / node_filesystem_size_bytes{mountpoint="/shared-volume", component="shared-volume-metrics"} < 0.1
for: 15m
labels:
severity: critical
channel: pagerduty
cluster: <cluster_name>
annotations:
summary: "jupyterhub-home-nfs EBS volume full in namespace {{ $labels.namespace }}"
```

And finally, we need to configure Alertmanager to send alerts to PagerDuty.

```yaml
prometheus:
alertmanager:
enabled: true
config:
route:
group_wait: 10s
group_interval: 5m
receiver: pagerduty
repeat_interval: 3h
routes:
- receiver: pagerduty
match:
channel: pagerduty
```


## Troubleshooting

### Checking the NFS server is running properly
Expand Down

0 comments on commit 1a081db

Please sign in to comment.