Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Roctracer returns correlation_id of 0 for all communication kernels #100

Closed
sraikund16 opened this issue Aug 21, 2024 · 4 comments

Comments

@sraikund16
Copy link

sraikund16 commented Aug 21, 2024

Problem Description

When profiling, we observe that the activity_record_t/roctracer_record_t objects for communication kernels all have a correlation_id of 0. For example, we see CPU event hipExtLaunchKernel with correlation 29170; however, its corresponding GPU kernel, ncclDevKernel_Generic(ncclDevComm*, channelMasks, ncclWork*), has correlation of 0. We see that for non-CCL events, the correlation_id of the CPU and GPU events do match despite using the same method of getting correlation_id as CCL events.

We obtain the correlation_ids for all async roctracer activities in kineto within this callback: https://github.com/pytorch/kineto/blob/main/libkineto/src/RoctracerLogger.cpp#L295

Thanks in advance!

Operating System

CentOS Stream 9

CPU

AMD EPYC 7713

GPU

AMD Instinct MI300X

ROCm Version

6.1.0.60100-82

ROCm Component

roctracer

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@jithunnair-amd jithunnair-amd assigned mwootton and unassigned mwootton Aug 27, 2024
@jithunnair-amd
Copy link

cc @mwootton

@sraikund16
Copy link
Author

@mwootton I added a debug in this PR: pytorch/kineto#982

And saw the following prints:
WARNING:2024-08-28 15:29:41 3273561:3284767 RoctracerLogger.cpp:307] Correlation id is 0 for record: ncclDevKernel_Generic(ncclDevComm*, channelMasks, ncclWork*)

It seems like the callback is populating the id with 0 before we even process it

sraikund16 added a commit to sraikund16/kineto that referenced this issue Aug 29, 2024
Summary: Roctracer does not give the grid/block alongside device activities; however, they do have the information in the launch event. Using the correlation we can then stitch these properties using a map from correlation to grid or block. Currently this won't work for RCCL events until ROCm/roctracer#100 is resolved

Differential Revision: D61743013
sraikund16 added a commit to sraikund16/kineto that referenced this issue Aug 29, 2024
Summary:
Pull Request resolved: pytorch#983

Roctracer does not give the grid/block alongside device activities; however, they do have the information in the launch event. Using the correlation we can then stitch these properties using a map from correlation to grid or block. Currently this won't work for RCCL events until ROCm/roctracer#100 is resolved

Reviewed By: leitian, aaronenyeshi

Differential Revision: D61743013
sraikund16 added a commit to sraikund16/kineto that referenced this issue Aug 30, 2024
Summary:
Pull Request resolved: pytorch#983

Roctracer does not give the grid/block alongside device activities; however, they do have the information in the launch event. Using the correlation we can then stitch these properties using a map from correlation to grid or block. Currently this won't work for RCCL events until ROCm/roctracer#100 is resolved

Reviewed By: leitian, aaronenyeshi

Differential Revision: D61743013
facebook-github-bot pushed a commit to pytorch/kineto that referenced this issue Aug 30, 2024
Summary:
Pull Request resolved: #983

Roctracer does not give the grid/block alongside device activities; however, they do have the information in the launch event. Using the correlation we can then stitch these properties using a map from correlation to grid or block. Currently this won't work for RCCL events until ROCm/roctracer#100 is resolved

Reviewed By: leitian, aaronenyeshi

Differential Revision: D61743013

fbshipit-source-id: 1205c62f45e8982b88f7a664857090d981f2cb3c
@mwootton
Copy link

mwootton commented Sep 3, 2024

I was able to find an internal issue where this was addressed. It is fixed in rocm6.2.

@sraikund16
Copy link
Author

Confirmed this was fixed in 6.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants