-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Roctracer returns correlation_id of 0 for all communication kernels #100
Comments
cc @mwootton |
@mwootton I added a debug in this PR: pytorch/kineto#982 And saw the following prints: It seems like the callback is populating the id with 0 before we even process it |
Summary: Roctracer does not give the grid/block alongside device activities; however, they do have the information in the launch event. Using the correlation we can then stitch these properties using a map from correlation to grid or block. Currently this won't work for RCCL events until ROCm/roctracer#100 is resolved Differential Revision: D61743013
Summary: Pull Request resolved: pytorch#983 Roctracer does not give the grid/block alongside device activities; however, they do have the information in the launch event. Using the correlation we can then stitch these properties using a map from correlation to grid or block. Currently this won't work for RCCL events until ROCm/roctracer#100 is resolved Reviewed By: leitian, aaronenyeshi Differential Revision: D61743013
Summary: Pull Request resolved: pytorch#983 Roctracer does not give the grid/block alongside device activities; however, they do have the information in the launch event. Using the correlation we can then stitch these properties using a map from correlation to grid or block. Currently this won't work for RCCL events until ROCm/roctracer#100 is resolved Reviewed By: leitian, aaronenyeshi Differential Revision: D61743013
Summary: Pull Request resolved: #983 Roctracer does not give the grid/block alongside device activities; however, they do have the information in the launch event. Using the correlation we can then stitch these properties using a map from correlation to grid or block. Currently this won't work for RCCL events until ROCm/roctracer#100 is resolved Reviewed By: leitian, aaronenyeshi Differential Revision: D61743013 fbshipit-source-id: 1205c62f45e8982b88f7a664857090d981f2cb3c
I was able to find an internal issue where this was addressed. It is fixed in rocm6.2. |
Confirmed this was fixed in 6.2.0 |
Problem Description
When profiling, we observe that the activity_record_t/roctracer_record_t objects for communication kernels all have a correlation_id of 0. For example, we see CPU event
hipExtLaunchKernel
with correlation 29170; however, its corresponding GPU kernel,ncclDevKernel_Generic(ncclDevComm*, channelMasks, ncclWork*)
, has correlation of 0. We see that for non-CCL events, the correlation_id of the CPU and GPU events do match despite using the same method of getting correlation_id as CCL events.We obtain the correlation_ids for all async roctracer activities in kineto within this callback: https://github.com/pytorch/kineto/blob/main/libkineto/src/RoctracerLogger.cpp#L295
Thanks in advance!
Operating System
CentOS Stream 9
CPU
AMD EPYC 7713
GPU
AMD Instinct MI300X
ROCm Version
6.1.0.60100-82
ROCm Component
roctracer
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: