Fix slurm multinode example #3229

ffrancesco94 · 2024-11-08T13:47:50Z

What does this PR do?

The SLURM multinode submit script doesn't work for multiple reasons (see also #1239 which is still open). This PR aims at solving some of those issues, namely:

Typo in the $CMD command
$SLURM_NNODES was recently deprecated in favour of $SLURM_JOB_NUM_NODES
If multinode setup involves multiple GPUs, it has to be enforced with --multi-gpu and the rank of each process has to be set with --machine-rank
If multiple GPUs are present, the complete_nlp_example doesn't handle distributed evaluation correctly. In particular, either the evaluate GLUE mrpc metric has to be loaded with the --num_process and --process_id flags (which will subsequently fail due to several breakages due to recent datasets evaluate#542 & related) or metric computation has to happen only on the main process. This PR goes for the second.

Fixes #3206

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings. NA
Did you write any new necessary tests? NA

Who can review?

@muellerzr or @SunMarc

Compute metric with evaluate from main process only, avoiding bug in multinode evaluate.

Enforce --multi_gpu on multiple nodes. Moreover, make sure that each rank gets correctly addressed based on the $SLURM_PROCID. $SLURM_NNODES has now been deprecated and replaced by $SLURM_JOB_NUM_NODES. Fixed typo in $CMD as well.

muellerzr · 2024-12-02T18:51:34Z

examples/complete_nlp_example.py

-        eval_metric = metric.compute()
-        # Use accelerator.print to print only on the main process.
-        accelerator.print(f"epoch {epoch}:", eval_metric)
+        if accelerator.is_main_process:
+            # Computing metrics in a distributed manner requires calling evaluate.load() with the
+            # n_process and process_id arguments. However, the metric.add_batch() step will fail 
+            # due to a bug with datasets and evaluate (see https://github.com/huggingface/evaluate/issues/542)
+            # and related
+            eval_metric = metric.compute()
+            # Use accelerator.print to print only on the main process.
+            accelerator.print(f"epoch {epoch}:", eval_metric)


This breaking is a first for me, let me try to repr this

HuggingFaceDocBuilderDev · 2024-12-02T18:54:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ffrancesco94 added 2 commits November 8, 2024 14:24

Update complete_nlp_example.py

81f1d78

Compute metric with evaluate from main process only, avoiding bug in multinode evaluate.

Update submit_multinode.sh

b71085a

Enforce --multi_gpu on multiple nodes. Moreover, make sure that each rank gets correctly addressed based on the $SLURM_PROCID. $SLURM_NNODES has now been deprecated and replaced by $SLURM_JOB_NUM_NODES. Fixed typo in $CMD as well.

SunMarc requested a review from muellerzr November 15, 2024 16:52

muellerzr reviewed Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix slurm multinode example #3229

Fix slurm multinode example #3229

ffrancesco94 commented Nov 8, 2024 •

edited

Loading

muellerzr Dec 2, 2024

HuggingFaceDocBuilderDev commented Dec 2, 2024

Fix slurm multinode example #3229

Are you sure you want to change the base?

Fix slurm multinode example #3229

Conversation

ffrancesco94 commented Nov 8, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

muellerzr Dec 2, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Dec 2, 2024

ffrancesco94 commented Nov 8, 2024 •

edited

Loading