You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I get CUDA Error: misaligned address when running the tp comm overlap unit test with recent pytorch container.
I think the error comes from the cublas versions that enables nvjet.
[rank1]: Traceback (most recent call last):
[rank1]: File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 922, in <module>
[rank1]: sys.exit(_main(_parse_args()))
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
[rank1]: return f(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^
[rank1]: File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 721, in _main
[rank1]: all_outputs = _fp8_gemm()
[rank1]: ^^^^^^^^^^^
[rank1]: File "/lustre/fsw/coreai_mlperf_training/slym/module_tests/tp_overlap/te.tp_tests/tests/pytorch/distributed/run_gemm_with_overlap.py", line 602, in _fp8_gemm
[rank1]: return tex.fp8_gemm(
[rank1]: ^^^^^^^^^^^^^
[rank1]: File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/cpp_extensions/gemm.py", line 180, in fp8_gemm
[rank1]: _ = fn(*args)
[rank1]: ^^^^^^^^^
[rank1]: RuntimeError: /workspace/TransformerEngine/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp:802 in function split_overlap_ag: CUDA Error: misaligned address
The text was updated successfully, but these errors were encountered:
/workspace/TransformerEngine/transformer_engine/common/comm_gemm_overlap/comm_gemm_overlap.cpp:802 is a cudaEventRecord call. It seems weird that this would trigger a misaligned address error, so I'm guessing the error actually originates from nvte_cublas_gemm just a few lines above that?
I'm not familiar with nvjet. Does cuBLAS have an environment variable that lets us at least temporarily disable this for debugging?
Got the above error with the old container and setting LD_LIBRARY_PATH to use the recent cublas build. Here, when not using the recent cublas build, the unit test just runs fine.
Got the above error with the latest pytorch container.
The model e2e job with the latest cublas build runs fine.
So, I think this is just about the unit test codes that is not working.
I started seeing the same misaligned address error with the new TE/JAX API in PR #1337. I wonder if these are related somehow. I will try again with an older container to see if it goes away. If so, I probably need to reach out to the cuBLAS team because it's not clear to me why the unit tests fail when e2e works.
I get
CUDA Error: misaligned address
when running the tp comm overlap unit test with recent pytorch container.I think the error comes from the cublas versions that enables
nvjet
.The text was updated successfully, but these errors were encountered: