Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm multi-node work fine but multi-gpu doesn't #20438

Open
atifkhanncl opened this issue Nov 22, 2024 · 1 comment
Open

Slurm multi-node work fine but multi-gpu doesn't #20438

atifkhanncl opened this issue Nov 22, 2024 · 1 comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x

Comments

@atifkhanncl
Copy link

Bug description

I am training a sample model which works on multiple GPUs as long as these are across nodes. But as soon as I allocate more than one GPU on a node it returns [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1 [rank7]: ncclUnhandledCudaError: Call to CUDA function failed. [rank7]: Last error: [rank7]: Cuda failure 'invalid device pointer'

What version are you seeing the problem on?

v2.4

How to reproduce the bug

python training script:

from pytorch_lightning.demos.boring_classes import BoringModel, BoringDataModule
from pytorch_lightning import Trainer
import os


def main():
    print(
        f"LOCAL_RANK={os.environ.get('LOCAL_RANK', 0)}, SLURM_NTASKS={os.environ.get('SLURM_NTASKS')}, SLURM_NTASKS_PER_NODE={os.environ.get('SLURM_NTASKS_PER_NODE')}"
    )
    model = BoringModel()
    datamodule = BoringDataModule()
    trainer = Trainer(max_epochs=100,devices=2,num_nodes=4)
    print(f"trainer.num_devices: {trainer.num_devices}")
    trainer.fit(model, datamodule)


if __name__ == "__main__":
    main()


Slurm sbatch.sh file:

#!/bin/bash 
#SBATCH --job-name=rocm_DDP_lightining
#SBATCH --nodes=4
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --mem=96g
#SBATCH --output=/mnt/jobOutput/sample.out
#SBATCH --error=/mnt/jobErrors/sample.err
#SBATCH --time=0-02:00:00
#SBATCH --cpus-per-task 10
#SBATCH --partition rocm
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
srun python /mnt/sample_lightning.py

Error messages and logs

Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
You are using a CUDA device ('AMD Instinct MI50/MI60') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 109, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 18, in main
[rank0]:     trainer.fit(model, dm)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]:     call._call_and_handle_interrupt(
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]:     return function(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]:     self._run(model, ckpt_path=ckpt_path)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 938, in _run
[rank0]:     self.__setup_profiler()
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1071, in __setup_profiler
[rank0]:     self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]:                                                                             ^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1233, in log_dir
[rank0]:     dirpath = self.strategy.broadcast(dirpath)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
[rank0]:     torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]:     broadcast(object_sizes_tensor, src=src, group=group)
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]:     work = default_pg.broadcast([tensor], opts)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid device pointer'

.
.
.
# [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Cuda failure 'invalid device pointer'
srun: error: clust1-rocm-6: task 5: Exited with exit code 1
srun: error: clust1-rocm-3: task 1: Exited with exit code 1
srun: error: clust1-rocm-4: task 3: Exited with exit code 1
srun: error: clust1-rocm-8: task 7: Exited with exit code 1
srun: error: clust1-rocm-6: task 4: Exited with exit code 1
srun: error: clust1-rocm-4: task 2: Exited with exit code 1
srun: error: clust1-rocm-8: task 6: Exited with exit code 1
srun: error: clust1-rocm-3: task 0: Exited with exit code 1

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):2.4.0
#- PyTorch Version (e.g., 2.4): 2.3.1+rocm5.7
#- Python version (e.g., 3.12):3.11.0
#- OS (e.g., Linux): Linux 4.18.0-372.32.1.el8_6.x86_64 (RHEL)
#- CUDA/cuDNN version: rocm5.7
#- GPU models and configuration: AMD Instinct MI50/MI60
#- How you installed Lightning(`conda`, `pip`, source): pip

More info

No response

@atifkhanncl atifkhanncl added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Nov 22, 2024
@Anivader
Copy link

update your trainer to include accelerator and strategy arguments -

trainer = Trainer(max_epochs=100, devices=2, num_nodes=4, accelerator="gpu", strategy=DDPStrategy(find_unused_parameters=True))

Also try setting devices='auto'. This will use all available GPUs on the selected node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x
Projects
None yet
Development

No branches or pull requests

2 participants