Feature request: recognize requeued jobs in SLURM #3573

Hoeze · 2023-01-24T18:56:33Z

Hoeze
Jan 24, 2023

On our SLURM cluster we have a preemptible lowprio partition.
It seems like Nextflow does not recognize job preemtions. What is the correct way to configure Nextflow here?

clusterOptions = --requeue: Can this cause duplicated job submissions?
clusterOptions = --no-requeue: The whole pipeline breaks all the time due to many job preemptions

Hoeze · 2023-01-24T19:44:24Z

Hoeze
Jan 24, 2023
Author

I just checked the source code: Nextflow just assumes preempted jobs to have errored and also always adds --no-requeue to the sbatch command.

However, it would make more sense to add --requeue by default to all submissions and check whether the job is being cancelled or just restarted later.
Here is some python snippet on how to do that:

def run_scontrol_on_preempted_job(cluster, jobid):
    sctrl_res = sp.check_output(
        shlex.split(f"scontrol {cluster} -o show job {jobid}")
    )
    m = re.search(r"Requeue=(\w+)", sctrl_res.decode())
    requeueable = m.group(1)

    if requeueable == "1":
        job_state = "REQUEUED"
    else:
        job_state = "CANCELLED"

    return job_state

This would need to be put here:

nextflow/modules/nextflow/src/main/groovy/nextflow/executor/SlurmExecutor.groovy

Line 181 in 827ee97

result.put( cols[0], STATUS_MAP.get(cols[1]) )

Unfortunately I have no idea how to translate that snippet to Groovy.
Could someone from the Nextflow devs do that?

1 reply

bentsherman Mar 24, 2023
Maintainer

Hi @Hoeze , we probably aren't gonna get around to trying this anytime soon. Since you already have a code snippet in Python and you have a SLURM environment where you can test the behavior, your best bet is to try to implement this change in Nextflow and create a PR, demonstrating that it works in your environment.

I know you said you aren't familiar with Groovy, but you already found the correct spot in the Nextflow codebase to implement the change, and there is plenty of Groovy code in Nextflow that you can use as a reference.

To build and test Nextflow, all you need is Java, clone the Nextflow repo, make your changes, run make compile to build it, and use the launch.sh the same way you would use nextflow.

pditommaso · 2023-01-25T15:22:14Z

pditommaso
Jan 25, 2023
Maintainer

Nextlow adds --no-requeue by default, don't think it's possible to override it

nextflow/modules/nextflow/src/main/groovy/nextflow/executor/SlurmExecutor.groovy

Line 55 in 827ee97

result << '--no-requeue' << '' // note: directive need to be returned as pairs

1 reply

bentsherman Jan 25, 2023
Maintainer

I think he is saying that we should change the SLURM executor to use --requeue and either fail or restart a job based on its queue status from scontrol.

bentsherman · 2023-01-25T15:33:14Z

bentsherman
Jan 25, 2023
Maintainer

This issue has a long history: #226, #234, #3422

But to be honest, I'm having a hard time following these discussions. They talk about jobs that are suspended by SLURM due to preemption or node failure, so it seems like you would want to use --requeue instead.

0 replies

Hoeze · 2023-03-15T21:59:46Z

Hoeze
Mar 15, 2023
Author

@bentsherman is it possible to have some sidecar process similar to Snakemake's --cluster-sidecar?
https://github.com/gagneurlab/snakemake-slurm-template/blob/master/slurm-sidecar.py

This script is extremely helpful on our cluster to reduce the load on the Slurmmaster while polling 1000s of jobs in parallel.
Also, it correctly recognizes requeued jobs

3 replies

bentsherman Mar 16, 2023
Maintainer

Looks like it is doing something similar to what you proposed. But how does it help reduce the load on the SLURM scheduler? Is it just a higher polling interval? By default Nextflow checks the status of jobs once per minute.

Hoeze Mar 16, 2023
Author

Ah yes, indeed Nextflow also polls all jobs at once.
So the only difference is the recognition of requeued jobs.

Hoeze Mar 16, 2023
Author

Snakemake runs for every job a separate command to check the state, so I had to implement this cluster sidecar which one can query with a simple cURL command as often as Snakemake likes.

sidevshiy · 2024-08-18T21:06:25Z

sidevshiy
Aug 18, 2024

@Hoeze Hi, Hoeze!
Did you manage to make your changes work ? I would be really glad if we can figure out somehow! Thank you!

1 reply

Hoeze Aug 19, 2024
Author

Hi @sidevshiy, I did not investigate this further as I currently do not have a direct use case.
I might have a use case the next months which I can test this on, but if you need this now you should be fine implementing the changes I described here:
#3573 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: recognize requeued jobs in SLURM #3573

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Feature request: recognize requeued jobs in SLURM #3573

Hoeze Jan 24, 2023

Replies: 5 comments · 6 replies

Hoeze Jan 24, 2023 Author

bentsherman Mar 24, 2023 Maintainer

pditommaso Jan 25, 2023 Maintainer

bentsherman Jan 25, 2023 Maintainer

bentsherman Jan 25, 2023 Maintainer

Hoeze Mar 15, 2023 Author

bentsherman Mar 16, 2023 Maintainer

Hoeze Mar 16, 2023 Author

Hoeze Mar 16, 2023 Author

sidevshiy Aug 18, 2024

Hoeze Aug 19, 2024 Author

Hoeze
Jan 24, 2023

Replies: 5 comments 6 replies

Hoeze
Jan 24, 2023
Author

bentsherman Mar 24, 2023
Maintainer

pditommaso
Jan 25, 2023
Maintainer

bentsherman Jan 25, 2023
Maintainer

bentsherman
Jan 25, 2023
Maintainer

Hoeze
Mar 15, 2023
Author

bentsherman Mar 16, 2023
Maintainer

Hoeze Mar 16, 2023
Author

Hoeze Mar 16, 2023
Author

sidevshiy
Aug 18, 2024

Hoeze Aug 19, 2024
Author