Enabling per hub storage quotas #764

sgibson91 · 2021-10-19T12:27:33Z

October 2023 Update

Current proposals:

Implementation: Enabling per hub storage quotas #764 (comment)
Alerting/monitoring: Enabling per hub storage quotas #764 (comment)

Work already done:

Prometheus exporter for dir sizes: Enabling per hub storage quotas #764 (comment)

Description

Right now in our NFS servers, user home dirs expand until the whole NFS storage is consumed and then we either expand the storage further or delete stuff. This isn't necessarily a problem, except that the storage may not be equally distributed between hubs or even users! A slightly nicer solution may be to implement some storage quotas on a per user or per hub basis and then we can at least communicate with Community Representatives that that is the amount of storage they are allocated and can work from there.

This is a topic that has already had a lot of discussion in Pangeo pangeo-data/pangeo-cloud-federation#654

I already did some work to bring the nfs-server-provisioner helm chart into our infrastructure. We time-boxed the effort to get that working and instead fell back onto use Google Cloud Filestore on the Pangeo deployment. I believe achieving such quotas will involve more work in making the nfs-server-provisioner chart functional.

Value / benefit

Can more clearly communicate how much storage a given Hub Community can expect
No single user or hub can bogart the NFS storage with lots of data

Implementation details

No response

Tasks to complete

No response

Updates

No response

The text was updated successfully, but these errors were encountered:

damianavila · 2021-10-19T15:25:03Z

A slightly nicer solution may be to implement some storage quotas on a per user or per hub basis

I like that idea!

sgibson91 · 2022-08-10T09:31:26Z

@damianavila I wonder if we can bring this issue into our backlog following https://2i2c.freshdesk.com/a/tickets/171 Since disk space filling up is most often felt by hubs on shared clusters and feels most unfair that the hub community reporting it may not have even been the ones to cause it.

damianavila · 2022-08-10T11:56:12Z

I added it to the backlog and raised the priority! Thanks for bringing this one to attention, @sgibson91.

yuvipanda · 2022-10-10T22:33:27Z

My current worry is that the nfs external provisoner seems a bit abandoned (kubernetes-sigs/nfs-ganesha-server-and-external-provisioner#106). @consideRatio got some patches into it, so he might have an idea on how active it is?

The other problem I have with it is that it generates directory names with randomized chars, so if we lose the PVC objects (by recreating the k8s cluster, for example) we can no longer map the users to their home directories! currently that is not the case - we can recreate the k8s cluster and not lose any user data. This complicates backup quite a bit.

yuvipanda · 2022-10-10T22:42:12Z

Here's a different approach to try:

For shared clusters

Add a deployment to basehub that deploys an NFS server (nfs-kernel-server or nfs-ganesha). This would just attach to a PVC. So we'll have 1 NFS server per hub, which isolates hubs from trampling on each other in private spaces.
Assuming we use XFS for the PVC, we can have a sidecar that runs xfs-quota in a loop to set per-user limits

The advantages of this approach are that it's much simpler than the external provisioner, doesn't require us to maintain an NFS server by hand, and doesn't keep state in the kubernetes cluster that is required to associate users home directories with the users. Given that the provisioner doesn't seem to be maintained upstream, I think we have a much better chance of maintaining this than of maintaining the external provisioner itself.

For non-shared clusters

I think using cloud based filestores (EFS, etc) is still more apporpriate when we aren't operating shared clusters. For those, disk space should focus on reporting. I'd suggest we write a prometheus exporter that basically runs du and reports per-user disk space, and we can expose this to users via grafana. That way, we can answer the question of 'who is eating disk space?' and have alerts if necessary.

yuvipanda · 2022-10-10T22:45:52Z

There is a lot of precedent for running this kinda 'nfs server in a pod' - https://github.com/appscode/third-party-tools/blob/master/storage/nfs/artifacts/nfs-server.yaml for example. It'll have to run as a privileged pod, but totally doable. It'll also allow us to monitor free space on the disk easily.

sgibson91 · 2022-10-11T09:39:32Z

Thank you for the proposal @yuvipanda! I like it for three reasons:

It resolves the systemic unfairness for storage on shared clusters
For dedicated clusters, I think Community Reps will really appreciate the grafana-based reporting structure you suggest, particularly around managing costs
It avoids putting unnecessary and difficult maintenance on the engineers

I believe we should try and work towards implementing an MVP of this sooner rather than later!

Vaibhav1919 · 2022-11-11T13:06:10Z

@yuvipanda @consideRatio
Is it possible to restrict EFS/NFS data per kubernetes pod?

sgibson91 · 2022-11-11T17:04:56Z

is there a rough value for how much persistent space each user has in /home/jovyan? This will help with brainstorming our final configuration.

from https://2i2c.freshdesk.com/a/tickets/271

I think Yuvi has already lined up a great proposal above, and we just need to assign someone to try to implement that.

Vaibhav1919 · 2022-11-13T13:16:41Z

@yuvipanda @sgibson91 @consideRatio
Do there exist a way through which we can use nfs external provisioned and assigning quota for each user without pre-creating XFS volume from google PD ?

This deploys https://github.com/yuvipanda/prometheus-dirsize-exporter, an *efficient* per-user homedirectory stats (size, no. of files, last modified date, etc) collector. It is capped at performing no more than 250 IO operations per second, to not overwhelm NFS servers. Metrics are refreshed every 2h after completion, although on large servers (like LEAP), they can take many many hours to complete with just 250 IO operations per second. This is perfectly fine though, as we do not need 'up to date' information. Trading off metric latency for minimal resource usage is pretty good here. Ref 2i2c-org#764

yuvipanda · 2023-06-06T17:15:07Z

From #764 (comment):

I'd suggest we write a prometheus exporter that basically runs du and reports per-user disk space, and we can expose this to users via grafana.

I wrote this today! https://github.com/yuvipanda/prometheus-dirsize-exporter. Has some performance optimizations as well, although could do more. #2621 deploys it to our clusters.

This deploys https://github.com/yuvipanda/prometheus-dirsize-exporter, an *efficient* per-user homedirectory stats (size, no. of files, last modified date, etc) collector. It is capped at performing no more than 250 IO operations per second, to not overwhelm NFS servers. Metrics are refreshed every 2h after completion, although on large servers (like LEAP), they can take many many hours to complete with just 250 IO operations per second. This is perfectly fine though, as we do not need 'up to date' information. Trading off metric latency for minimal resource usage is pretty good here. Ref 2i2c-org#764

GeorgianaElena · 2023-09-26T16:32:51Z

Just wanted to raise the priority of building an alerting and possibly a notification system on top of the really great and useful Grafana dashboad.

This is motivated by @jbusecke ticket https://2i2c.freshdesk.com/a/tickets/995 about the challenges of manually monitoring the usage and notifying the users one by one in the context of an increasing user base.

I was thinking that one step in this direction is to try and enable Grafana alerting on the usage dashboard, by setting a per user limit above which it would notify Julius.

Anyway, I believe this is something to have on our radars for next quarter. cc @damianavila who I remember mentioning that improving our monitoring and alerting systems might be a goal of the future quarter.

consideRatio · 2023-10-19T08:36:03Z

@yuvipanda we transitioned away from having a few in-cluster NFS servers I recall.

What is your take currently on the previously proposed implementation in #764 (comment)?

yuvipanda · 2023-10-19T08:44:32Z

@consideRatio that's still the only possible path forward for per-hub quotas. It is probably also a quarter's worth of work, and not high priority right now. It's also just per-hub quotas, I think per-user quotas should be handled separately (and perhaps be more 'alerts' than actual quota). I think this issue can be scoped down to only discussing per-hub quotas, and left to be prioritized in the future.

consideRatio · 2023-10-19T08:46:34Z

@GeorgianaElena do you want to open a dedicated issue to represent the ideas in #764 (comment)? I don't think we will manage to track that effectively as part of this already complicated issue.

yuvipanda · 2024-08-15T16:28:01Z

Is eing handled via NASA-IMPACT/veda-jupyterhub#41

sgibson91 added 🏷️ optimization labels Oct 19, 2021

sgibson91 added this to DEPRECATED Engineering and Product Backlog Oct 19, 2021

sgibson91 moved this to Todo in DEPRECATED Engineering and Product Backlog Oct 19, 2021

choldgraf removed the impact: med label Oct 28, 2021

choldgraf removed the status in DEPRECATED Engineering and Product Backlog Nov 5, 2021

choldgraf removed this from DEPRECATED Engineering and Product Backlog Nov 23, 2021

damianavila added this to DEPRECATED Engineering and Product Backlog Aug 10, 2022

damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Aug 10, 2022

choldgraf removed the 🏷️ optimization label Sep 16, 2022

sgibson91 mentioned this issue Oct 10, 2022

Remove nfs-server-provisioner #1751

Merged

yuvipanda mentioned this issue Nov 1, 2022

Explore Google File Store as a replacement for NFS! berkeley-dsep-infra/datahub#3898

Closed

3 tasks

yuvipanda mentioned this issue Jun 6, 2023

Add an efficient, user homedirectory size prometheus reporter #2621

Merged

damianavila added this to Sprint Board Sep 28, 2023

damianavila moved this to Todo 👍 in Sprint Board Sep 28, 2023

sgibson91 added the nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt label Oct 18, 2023

consideRatio removed the nominated-to-be-resolved-during-q4-2023 Nomination to be resolved during q4 goal of reducing the technical debt label Oct 19, 2023

consideRatio changed the title ~~Enabling per user/per hub storage quotas~~ Enabling per hub storage quotas Oct 19, 2023

consideRatio mentioned this issue Apr 9, 2024

Per user storage quotas or alternative improvements #3922

Closed

3 tasks

sgibson91 removed this from Sprint Board Jun 18, 2024

sgibson91 removed this from DEPRECATED Engineering and Product Backlog Jun 18, 2024

yuvipanda closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling per hub storage quotas #764

Enabling per hub storage quotas #764

sgibson91 commented Oct 19, 2021 •

edited

Loading

damianavila commented Oct 19, 2021

sgibson91 commented Aug 10, 2022 •

edited

Loading

damianavila commented Aug 10, 2022

yuvipanda commented Oct 10, 2022

yuvipanda commented Oct 10, 2022

yuvipanda commented Oct 10, 2022

sgibson91 commented Oct 11, 2022

Vaibhav1919 commented Nov 11, 2022 •

edited

Loading

sgibson91 commented Nov 11, 2022

Vaibhav1919 commented Nov 13, 2022 •

edited

Loading

yuvipanda commented Jun 6, 2023

GeorgianaElena commented Sep 26, 2023

consideRatio commented Oct 19, 2023

yuvipanda commented Oct 19, 2023

consideRatio commented Oct 19, 2023

yuvipanda commented Aug 15, 2024

Enabling per hub storage quotas #764

Enabling per hub storage quotas #764

Comments

sgibson91 commented Oct 19, 2021 • edited Loading

October 2023 Update

Description

Value / benefit

Implementation details

Tasks to complete

Updates

damianavila commented Oct 19, 2021

sgibson91 commented Aug 10, 2022 • edited Loading

damianavila commented Aug 10, 2022

yuvipanda commented Oct 10, 2022

yuvipanda commented Oct 10, 2022

For shared clusters

For non-shared clusters

yuvipanda commented Oct 10, 2022

sgibson91 commented Oct 11, 2022

Vaibhav1919 commented Nov 11, 2022 • edited Loading

sgibson91 commented Nov 11, 2022

Vaibhav1919 commented Nov 13, 2022 • edited Loading

yuvipanda commented Jun 6, 2023

GeorgianaElena commented Sep 26, 2023

consideRatio commented Oct 19, 2023

yuvipanda commented Oct 19, 2023

consideRatio commented Oct 19, 2023

yuvipanda commented Aug 15, 2024

sgibson91 commented Oct 19, 2021 •

edited

Loading

sgibson91 commented Aug 10, 2022 •

edited

Loading

Vaibhav1919 commented Nov 11, 2022 •

edited

Loading

Vaibhav1919 commented Nov 13, 2022 •

edited

Loading