-
-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use tasks_per_node
to split sweep across tasks
#2633
base: main
Are you sure you want to change the base?
Conversation
e119fcd
to
016dc28
Compare
This would be a very useful feature to have! Will this be merged into the main branch at some point? In my particular case, slurm is configured to allocate a full node per job, where each node comes with 4 gpus. My models are quite small though and easily fit on a single gpu. Having the hydra sweeper submitting a new job (which seems to be the default at the moment) per hyperparameter value is hence very wasteful for me whereas parallelizing within a slurm job (and hence a single node) across tasks sounds exactly like the thing I am looking for. If there is a different solution to this, I would of course also be interested in that. Thank you very much! |
I wouldn't count on it -- I currently don't need it anymore and as I mentioned in the PR description, I think the current implementation may break some existing use cases. So it would require someone to re-work it a bit (for instance with one of my suggestions, but maybe there's a better way too). Note however that I've used it successfully so you should be able to cherry-pick this commit and use it if it's helpful to you. |
This is a very useful feature. Can we try to work on this and get it merged? Anyone else is interested in this? |
Can I implement option 1) and then can we hope this can be merged to mainline? |
@Jasha10 @odelalleau What do you think? Is it possible to get this up as discussed about? |
Is |
Yes, I agree it seems like a specific usecase within hydra. But it is a wonderful usecase when we want to run 5-6 jobs within one node without worrying too much. In particular, my usecase is a large multirun, but with a quick arg, I can just run N number of tasks on a node (this translates to sharing GPU resources, when the indivitual tasks are not GPU intensive). |
Yes I totally agree and would need the same thing. What I wanted to ask is what was the effect of |
Well, to me it seems like something which cannot technically work with the hydra framework. But its a broader question whether to make sure that the |
Can we try to merge this? |
Motivation
When running a sweep, someone may want to be able to use the same GPU for multiple jobs in a sweep. This PR makes it possible by leveraging the
tasks_per_node
argument (if set to 2 for instance, then 2 jobs may share the same GPU).Discussion
This is currently a draft, open for feedback. I don't think it's actually a good idea to systematically use
tasks_per_node
for this, because some users may be using this setting for multiprocess jobs.Two options could be:
split_sweep_over_tasks
(default=False
) to enable this behavior (my preferred solution at this time)jobs_group_size
, default=1) so that it can be combined with multi-task jobs (would be more complex to implement: would need to spawn multiple processes from each SLURM job, instead of just relying on SLURM's tasks mechanism as implemented here)Feedback and other ideas welcome!
The current implementation also has a small hack when we end up launching a single job => not sure if there's a better way to deal with this situation (basically I would like to force submitit to create a job array even for a single-job array).
Have you read the Contributing Guidelines on pull requests?
Yes
Test Plan
TBD
Related Issues and PRs
Fixes #2632