We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
main branch, latest commit fbac0fd building with make -DENABLE_NPKIT_NET_CHECK_LATENCY reproduces the error:
make -DENABLE_NPKIT_NET_CHECK_LATENCY
transport/net.cc: In function ‘ncclResult_t sendProxyProgress(ncclProxyState*, ncclProxyArgs*)’: transport/net.cc:1280:21: error: ‘comm’ was not declared in this scope 1280 | comm->rank, sub->peer, sub->channelId, buffSlot, sub->npKitSizesFifo[buffSlot], npKitSendDuration, sub->npKitLastPollInterval[buffSlot], sub->npKitMaxPollInterval[buffSlot], sub->npKitPollIntervalSum[buffSlot], sub->npKitPollCnt[buffSlot], sub->npKitSta rtTime[buffSlot], sub->npKitLastPollTime[buffSlot]); | ^~~~ transport/net.cc: In function ‘ncclResult_t recvProxyProgress(ncclProxyState*, ncclProxyArgs*)’: transport/net.cc:1509:23: error: ‘comm’ was not declared in this scope 1509 | comm->rank, sub->peer, sub->channelId, step%NCCL_STEPS, sizes[i], npKitRecvDuration, sub->npKitLastPollInterval[step%NCCL_STEPS], sub->npKitMaxPollInterval[step%NCCL_STEPS], sub->npKitPollIntervalSum[step%NCCL_STEPS], sub->npKitPollCnt[step%NCCL_STEPS ], sub->npKitStartTime[step%NCCL_STEPS], sub->npKitLastPollTime[step%NCCL_STEPS]); | ^~~~ make[1]: *** [Makefile:121: msccl-executor-nccl/build/obj/transport/net.o] Error 1 make[1]: *** Waiting for unfinished jobs....
relevant code section in src/transport/net.cc
src/transport/net.cc
#if defined(ENABLE_NPKIT_NET_CHECK_LATENCY) uint64_t npKitSendDuration = sub->npKitLastPollTime[buffSlot] - sub->npKitStartTime[buffSlot]; if (g_npkit_num_warmup_ops > 0) { g_npkit_num_warmup_ops--; } if (g_npkit_num_warmup_ops == 0 && npKitSendDuration > g_npkit_net_check_latency_threshold_us) { fprintf(stdout, "NPKIT LONG SEND (R:%d,P:%d,C:%d,S:%d): %d took %lu us, last/max/sum poll interval %lu/%lu/%lu us, cnt: %lu, ts: %lu/%lu\n", comm->rank, sub->peer, sub->channelId, buffSlot, sub->npKitSizesFifo[buffSlot], npKitSendDuration, sub->npKitLastPollInterval[buffSlot], sub->npKitMaxPollInterval[buffSlot], sub->npKitPollIntervalSum[buffSlot], sub->npKitPollCnt[buffSlot], sub->npKitStartTime[buffSlot], sub->npKitLastPollTime[buffSlot]); sub->npKitStartTime[buffSlot] = sub->npKitLastPollTime[buffSlot] = npKitGetTsInUs(); sub->npKitMaxPollInterval[buffSlot] = sub->npKitPollIntervalSum[buffSlot] = sub->npKitPollCnt[buffSlot] = 0; } #endif #if defined(ENABLE_NPKIT_NET_CHECK_LATENCY) if (g_npkit_num_warmup_ops > 0) { g_npkit_num_warmup_ops--; } uint64_t npKitRecvDuration = sub->npKitLastPollTime[step%NCCL_STEPS] - sub->npKitStartTime[step%NCCL_STEPS]; if (g_npkit_num_warmup_ops == 0 && npKitRecvDuration > g_npkit_net_check_latency_threshold_us) { fprintf(stdout, "NPKIT LONG RECV (R:%d,P:%d,C:%d,S:%lu): %d took %lu us, last/max/sum poll interval %lu/%lu/%lu us, cnt: %lu, ts: %lu/%lu\n", comm->rank, sub->peer, sub->channelId, step%NCCL_STEPS, sizes[i], npKitRecvDuration, sub->npKitLastPollInterval[step%NCCL_STEPS], sub->npKitMaxPollInterval[step%NCCL_STEPS], sub->npKitPollIntervalSum[step%NCCL_STEPS], sub->npKitPollCnt[step%NCCL_STEPS], sub->npKitStartTime[step%NCCL_STEPS], sub->npKitLastPollTime[step%NCCL_STEPS]); sub->npKitStartTime[step%NCCL_STEPS] = sub->npKitLastPollTime[step%NCCL_STEPS] = npKitGetTsInUs(); sub->npKitMaxPollInterval[step%NCCL_STEPS] = sub->npKitPollIntervalSum[step%NCCL_STEPS] = sub->npKitPollCnt[step%NCCL_STEPS] = 0; } #endif
The text was updated successfully, but these errors were encountered:
Hi,
Can you try replacing comm->rank to proxyState->tpRank and try again? According to https://github.com/Azure/msccl-executor-nccl/blob/main/src/proxy.cc#L1776, comm->rank can be referenced using proxyState->tpRank now.
comm->rank
proxyState->tpRank
Sorry, something went wrong.
It compiles, but I wasn't sure they're functionally equivalent. Should I submit a PR?
We're still building automated CI for this repo so it cannot be verified immediately.
If you verify that this fix works in your env, it would be very helpful for you to submit a PR for this. Really appreciate your contribution!
cc @seagater
No branches or pull requests
main branch, latest commit fbac0fd
building with
make -DENABLE_NPKIT_NET_CHECK_LATENCY
reproduces the error:relevant code section in
src/transport/net.cc
The text was updated successfully, but these errors were encountered: