-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling per hub storage quotas #764
Comments
I like that idea! |
@damianavila I wonder if we can bring this issue into our backlog following https://2i2c.freshdesk.com/a/tickets/171 Since disk space filling up is most often felt by hubs on shared clusters and feels most unfair that the hub community reporting it may not have even been the ones to cause it. |
I added it to the backlog and raised the priority! Thanks for bringing this one to attention, @sgibson91. |
My current worry is that the nfs external provisoner seems a bit abandoned (kubernetes-sigs/nfs-ganesha-server-and-external-provisioner#106). @consideRatio got some patches into it, so he might have an idea on how active it is? The other problem I have with it is that it generates directory names with randomized chars, so if we lose the PVC objects (by recreating the k8s cluster, for example) we can no longer map the users to their home directories! currently that is not the case - we can recreate the k8s cluster and not lose any user data. This complicates backup quite a bit. |
Here's a different approach to try: For shared clusters
The advantages of this approach are that it's much simpler than the external provisioner, doesn't require us to maintain an NFS server by hand, and doesn't keep state in the kubernetes cluster that is required to associate users home directories with the users. Given that the provisioner doesn't seem to be maintained upstream, I think we have a much better chance of maintaining this than of maintaining the external provisioner itself. For non-shared clustersI think using cloud based filestores (EFS, etc) is still more apporpriate when we aren't operating shared clusters. For those, disk space should focus on reporting. I'd suggest we write a prometheus exporter that basically runs |
There is a lot of precedent for running this kinda 'nfs server in a pod' - https://github.com/appscode/third-party-tools/blob/master/storage/nfs/artifacts/nfs-server.yaml for example. It'll have to run as a privileged pod, but totally doable. It'll also allow us to monitor free space on the disk easily. |
Thank you for the proposal @yuvipanda! I like it for three reasons:
I believe we should try and work towards implementing an MVP of this sooner rather than later! |
@yuvipanda @consideRatio |
from https://2i2c.freshdesk.com/a/tickets/271 I think Yuvi has already lined up a great proposal above, and we just need to assign someone to try to implement that. |
@yuvipanda @sgibson91 @consideRatio |
This deploys https://github.com/yuvipanda/prometheus-dirsize-exporter, an *efficient* per-user homedirectory stats (size, no. of files, last modified date, etc) collector. It is capped at performing no more than 250 IO operations per second, to not overwhelm NFS servers. Metrics are refreshed every 2h after completion, although on large servers (like LEAP), they can take many many hours to complete with just 250 IO operations per second. This is perfectly fine though, as we do not need 'up to date' information. Trading off metric latency for minimal resource usage is pretty good here. Ref 2i2c-org#764
From #764 (comment):
I wrote this today! https://github.com/yuvipanda/prometheus-dirsize-exporter. Has some performance optimizations as well, although could do more. #2621 deploys it to our clusters. |
This deploys https://github.com/yuvipanda/prometheus-dirsize-exporter, an *efficient* per-user homedirectory stats (size, no. of files, last modified date, etc) collector. It is capped at performing no more than 250 IO operations per second, to not overwhelm NFS servers. Metrics are refreshed every 2h after completion, although on large servers (like LEAP), they can take many many hours to complete with just 250 IO operations per second. This is perfectly fine though, as we do not need 'up to date' information. Trading off metric latency for minimal resource usage is pretty good here. Ref 2i2c-org#764
Just wanted to raise the priority of building an alerting and possibly a notification system on top of the really great and useful Grafana dashboad. This is motivated by @jbusecke ticket https://2i2c.freshdesk.com/a/tickets/995 about the challenges of manually monitoring the usage and notifying the users one by one in the context of an increasing user base. I was thinking that one step in this direction is to try and enable Grafana alerting on the usage dashboard, by setting a per user limit above which it would notify Julius. Anyway, I believe this is something to have on our radars for next quarter. cc @damianavila who I remember mentioning that improving our monitoring and alerting systems might be a goal of the future quarter. |
@yuvipanda we transitioned away from having a few in-cluster NFS servers I recall. What is your take currently on the previously proposed implementation in #764 (comment)? |
@consideRatio that's still the only possible path forward for per-hub quotas. It is probably also a quarter's worth of work, and not high priority right now. It's also just per-hub quotas, I think per-user quotas should be handled separately (and perhaps be more 'alerts' than actual quota). I think this issue can be scoped down to only discussing per-hub quotas, and left to be prioritized in the future. |
@GeorgianaElena do you want to open a dedicated issue to represent the ideas in #764 (comment)? I don't think we will manage to track that effectively as part of this already complicated issue. |
Is eing handled via NASA-IMPACT/veda-jupyterhub#41 |
October 2023 Update
Current proposals:
Work already done:
Description
Right now in our NFS servers, user home dirs expand until the whole NFS storage is consumed and then we either expand the storage further or delete stuff. This isn't necessarily a problem, except that the storage may not be equally distributed between hubs or even users! A slightly nicer solution may be to implement some storage quotas on a per user or per hub basis and then we can at least communicate with Community Representatives that that is the amount of storage they are allocated and can work from there.
This is a topic that has already had a lot of discussion in Pangeo pangeo-data/pangeo-cloud-federation#654
I already did some work to bring the
nfs-server-provisioner
helm chart into our infrastructure. We time-boxed the effort to get that working and instead fell back onto use Google Cloud Filestore on the Pangeo deployment. I believe achieving such quotas will involve more work in making thenfs-server-provisioner
chart functional.Value / benefit
Implementation details
No response
Tasks to complete
No response
Updates
No response
The text was updated successfully, but these errors were encountered: