You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
4.1.7a1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Taken from NVIDIA's HPC SDK (more details in the logs)
Please describe the system on which you are running
Operating system/version:
Linux, Ubuntu 22.04
Computer hardware:
2 A100_40GB_PCIE
Network type:
Not sure (any tips how to extract this information?)
Details of the problem
A simple reproducer calls in a loop cudaMalloc - MPI_Bcast - cudaFree and the device memory is checked via cudaGetMemInfo(). The expectation is that the same amount of device memory is available at each iteration, but in reality the amount of memory decreases and after long enough, this test would lead to a cudaMalloc failure due to running out of the device memory (thus I call it a memory leak).
Note: same reproducer also fails with UCX (which is turned off explicitly in the command above), but there I know UCX-specific workarounds and the issue is likely same as #12849. But for non-UCX case I am not sure if this is relevant at all.
Update: I've checked that adding --mca btl_smcuda_use_cuda_ipc 0 fixes the issue. As I understand, this is a workaround rather than a solution.
So I think it would be helpful if someone can comment on this w.r.t to when and how this will get fixed (hopefully)? I am a bit surprised that such a simple-looking (at least, for me) reproducer does not work. Maybe it should be added to a test suite or something like it.
It seems indeed related to #12849. Basically pointer used for communication between two GPU gets registered for IPC, and the IPC handle is never released which prevents the memory from being freed resulting in a leak. So indeed disabling IPC fixes the issue.
In particular the issue never occurs if you always use the same buffer for communication. However if you keep changing it (like I or you did) the number of IPC handle will increase until you get the out of memory ...
Additionally, this mechanism from my understanding is indeed independent of UCX.
Note also that a very similar issue occur with MPIch in my case.
Hello!
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
4.1.7a1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Taken from NVIDIA's HPC SDK (more details in the logs)
Please describe the system on which you are running
Linux, Ubuntu 22.04
Not sure (any tips how to extract this information?)
Details of the problem
A simple reproducer calls in a loop cudaMalloc - MPI_Bcast - cudaFree and the device memory is checked via cudaGetMemInfo(). The expectation is that the same amount of device memory is available at each iteration, but in reality the amount of memory decreases and after long enough, this test would lead to a cudaMalloc failure due to running out of the device memory (thus I call it a memory leak).
Reproducer is compiled and run with
Note: same reproducer also fails with UCX (which is turned off explicitly in the command above), but there I know UCX-specific workarounds and the issue is likely same as #12849. But for non-UCX case I am not sure if this is relevant at all.
Reproducer as *.txt:
repro_test.txt
Output example: (also contains output from
ompi_info --parsable --config
andlog_non_ucx2.log
Any suggestions?
Thanks,
Kirill
The text was updated successfully, but these errors were encountered: