You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am training a sample model which works on multiple GPUs as long as these are across nodes. But as soon as I allocate more than one GPU on a node it returns [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1 [rank7]: ncclUnhandledCudaError: Call to CUDA function failed. [rank7]: Last error: [rank7]: Cuda failure 'invalid device pointer'
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Using the `SDPA` attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:75: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
You are using a CUDA device ('AMD Instinct MI50/MI60') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------
[rank0]: Traceback (most recent call last):
[rank0]: File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 109, in <module>
[rank0]: main()
[rank0]: File "/mnt/scratchc/ralab/atif/SRResNet_SRGAN/rocm_DDP.py", line 18, in main
[rank0]: trainer.fit(model, dm)
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
[rank0]: call._call_and_handle_interrupt(
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
[rank0]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank0]: return function(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
[rank0]: self._run(model, ckpt_path=ckpt_path)
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 938, in _run
[rank0]: self.__setup_profiler()
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1071, in __setup_profiler
[rank0]: self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
[rank0]: ^^^^^^^^^^^^
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1233, in log_dir
[rank0]: dirpath = self.strategy.broadcast(dirpath)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/lightning/pytorch/strategies/ddp.py", line 307, in broadcast
[rank0]: torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
[rank0]: broadcast(object_sizes_tensor, src=src, group=group)
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/Users/khan01/miniconda3/envs/rocm_test/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2136, in broadcast
[rank0]: work = default_pg.broadcast([tensor], opts)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid device pointer'
.
.
.
# [rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1
[rank7]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank7]: Last error:
[rank7]: Cuda failure 'invalid device pointer'
srun: error: clust1-rocm-6: task 5: Exited with exit code 1
srun: error: clust1-rocm-3: task 1: Exited with exit code 1
srun: error: clust1-rocm-4: task 3: Exited with exit code 1
srun: error: clust1-rocm-8: task 7: Exited with exit code 1
srun: error: clust1-rocm-6: task 4: Exited with exit code 1
srun: error: clust1-rocm-4: task 2: Exited with exit code 1
srun: error: clust1-rocm-8: task 6: Exited with exit code 1
srun: error: clust1-rocm-3: task 0: Exited with exit code 1
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):2.4.0
#- PyTorch Version (e.g., 2.4): 2.3.1+rocm5.7
#- Python version (e.g., 3.12):3.11.0
#- OS (e.g., Linux): Linux 4.18.0-372.32.1.el8_6.x86_64 (RHEL)
#- CUDA/cuDNN version: rocm5.7
#- GPU models and configuration: AMD Instinct MI50/MI60
#- How you installed Lightning(`conda`, `pip`, source): pip
More info
No response
The text was updated successfully, but these errors were encountered:
Bug description
I am training a sample model which works on multiple GPUs as long as these are across nodes. But as soon as I allocate more than one GPU on a node it returns
[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled cuda error, NCCL version 2.17.1 [rank7]: ncclUnhandledCudaError: Call to CUDA function failed. [rank7]: Last error: [rank7]: Cuda failure 'invalid device pointer'
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: