Slurm multi-node work fine but multi-gpu doesn't #20438

atifkhanncl · 2024-11-22T10:08:45Z

Bug description

I am training a sample model which works on multiple GPUs as long as these are across nodes. But as soon as I allocate more than one GPU on a node it returns [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1 [rank7]: ncclUnhandledCudaError: Call to CUDA function failed. [rank7]: Last error: [rank7]: Cuda failure 'invalid device pointer'

What version are you seeing the problem on?

v2.4

How to reproduce the bug

python training script:

from pytorch_lightning.demos.boring_classes import BoringModel, BoringDataModule
from pytorch_lightning import Trainer
import os


def main():
    print(
        f"LOCAL_RANK={os.environ.get('LOCAL_RANK', 0)}, SLURM_NTASKS={os.environ.get('SLURM_NTASKS')}, SLURM_NTASKS_PER_NODE={os.environ.get('SLURM_NTASKS_PER_NODE')}"
    )
    model = BoringModel()
    datamodule = BoringDataModule()
    trainer = Trainer(max_epochs=100,devices=2,num_nodes=4)
    print(f"trainer.num_devices: {trainer.num_devices}")
    trainer.fit(model, datamodule)


if __name__ == "__main__":
    main()


Slurm sbatch.sh file:

#!/bin/bash 
#SBATCH --job-name=rocm_DDP_lightining
#SBATCH --nodes=4
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=96g
#SBATCH --output=/mnt/jobOutput/sample.out
#SBATCH --error=/mnt/jobErrors/sample.err
#SBATCH --time=0-02:00:00
#SBATCH --cpus-per-task 10
#SBATCH --partition rocm
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
srun python /mnt/sample_lightning.py

Error messages and logs

Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
You are using a CUDA device ('AMD Instinct MI50/MI60') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 109, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 18, in main
[rank0]:     trainer.fit(model, dm)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 938, in _run
[rank0]:     self.__setup_profiler()
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1071, in __setup_profiler
[rank0]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]:                                                                             ^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1233, in log_dir
[rank0]:     dirpath = self.strategy.broadcast(dirpath)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
[rank0]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]:     work = default_pg.broadcast([tensor], opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid device pointer'

.
.
.
# [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Cuda failure 'invalid device pointer'
srun: error: clust1-rocm-6: task 5: Exited with exit code 1
srun: error: clust1-rocm-3: task 1: Exited with exit code 1
srun: error: clust1-rocm-4: task 3: Exited with exit code 1
srun: error: clust1-rocm-8: task 7: Exited with exit code 1
srun: error: clust1-rocm-6: task 4: Exited with exit code 1
srun: error: clust1-rocm-4: task 2: Exited with exit code 1
srun: error: clust1-rocm-8: task 6: Exited with exit code 1
srun: error: clust1-rocm-3: task 0: Exited with exit code 1

Environment

Current environment

#- PyTorch Lightning Version (e.g., 2.4.0):2.4.0
#- PyTorch Version (e.g., 2.4): 2.3.1+rocm5.7
#- Python version (e.g., 3.12):3.11.0
#- OS (e.g., Linux): Linux 4.18.0-372.32.1.el8_6.x86_64 (RHEL)
#- CUDA/cuDNN version: rocm5.7
#- GPU models and configuration: AMD Instinct MI50/MI60
#- How you installed Lightning(`conda`, `pip`, source): pip

More info

No response

The text was updated successfully, but these errors were encountered:

Anivader · 2024-11-22T21:10:24Z

update your trainer to include accelerator and strategy arguments -

trainer = Trainer(max_epochs=100, devices=2, num_nodes=4, accelerator="gpu", strategy=DDPStrategy(find_unused_parameters=True))

Also try setting devices='auto'. This will use all available GPUs on the selected node.

atifkhanncl added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Nov 22, 2024

github-actions bot added the ver: 2.4.x label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm multi-node work fine but multi-gpu doesn't #20438

Slurm multi-node work fine but multi-gpu doesn't #20438

atifkhanncl commented Nov 22, 2024

Anivader commented Nov 22, 2024

Slurm multi-node work fine but multi-gpu doesn't #20438

Slurm multi-node work fine but multi-gpu doesn't #20438

Comments

atifkhanncl commented Nov 22, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Anivader commented Nov 22, 2024