Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile error: -DENABLE_NPKIT_NET_CHECK_LATENCY #50

Open
ChenYuHo opened this issue Nov 27, 2024 · 4 comments
Open

Compile error: -DENABLE_NPKIT_NET_CHECK_LATENCY #50

ChenYuHo opened this issue Nov 27, 2024 · 4 comments

Comments

@ChenYuHo
Copy link

ChenYuHo commented Nov 27, 2024

main branch, latest commit fbac0fd
building with make -DENABLE_NPKIT_NET_CHECK_LATENCY reproduces the error:

transport/net.cc: In function ‘ncclResult_t sendProxyProgress(ncclProxyState*, ncclProxyArgs*)’:
transport/net.cc:1280:21: error: ‘comm’ was not declared in this scope
 1280 |                     comm->rank, sub->peer, sub->channelId, buffSlot, sub->npKitSizesFifo[buffSlot], npKitSendDuration, sub->npKitLastPollInterval[buffSlot], sub->npKitMaxPollInterval[buffSlot], sub->npKitPollIntervalSum[buffSlot], sub->npKitPollCnt[buffSlot], sub->npKitSta
rtTime[buffSlot], sub->npKitLastPollTime[buffSlot]);      
      |                     ^~~~                              
transport/net.cc: In function ‘ncclResult_t recvProxyProgress(ncclProxyState*, ncclProxyArgs*)’:
transport/net.cc:1509:23: error: ‘comm’ was not declared in this scope
 1509 |                       comm->rank, sub->peer, sub->channelId, step%NCCL_STEPS, sizes[i], npKitRecvDuration, sub->npKitLastPollInterval[step%NCCL_STEPS], sub->npKitMaxPollInterval[step%NCCL_STEPS], sub->npKitPollIntervalSum[step%NCCL_STEPS], sub->npKitPollCnt[step%NCCL_STEPS
], sub->npKitStartTime[step%NCCL_STEPS], sub->npKitLastPollTime[step%NCCL_STEPS]);
      |                       ^~~~
make[1]: *** [Makefile:121: msccl-executor-nccl/build/obj/transport/net.o] Error 1
make[1]: *** Waiting for unfinished jobs....

relevant code section in src/transport/net.cc

#if defined(ENABLE_NPKIT_NET_CHECK_LATENCY)
          uint64_t npKitSendDuration = sub->npKitLastPollTime[buffSlot] - sub->npKitStartTime[buffSlot];
          if (g_npkit_num_warmup_ops > 0) {
            g_npkit_num_warmup_ops--;
          }
          if (g_npkit_num_warmup_ops == 0 && npKitSendDuration > g_npkit_net_check_latency_threshold_us) {
            fprintf(stdout, "NPKIT LONG SEND (R:%d,P:%d,C:%d,S:%d): %d took %lu us, last/max/sum poll interval %lu/%lu/%lu us, cnt: %lu, ts: %lu/%lu\n",
                    comm->rank, sub->peer, sub->channelId, buffSlot, sub->npKitSizesFifo[buffSlot], npKitSendDuration, sub->npKitLastPollInterval[buffSlot], sub->npKitMaxPollInterval[buffSlot], sub->npKitPollIntervalSum[buffSlot], sub->npKitPollCnt[buffSlot], sub->npKitStartTime[buffSlot], sub->npKitLastPollTime[buffSlot]);
            sub->npKitStartTime[buffSlot] = sub->npKitLastPollTime[buffSlot] = npKitGetTsInUs();
            sub->npKitMaxPollInterval[buffSlot] = sub->npKitPollIntervalSum[buffSlot] = sub->npKitPollCnt[buffSlot] = 0;
          }
#endif


#if defined(ENABLE_NPKIT_NET_CHECK_LATENCY)
            if (g_npkit_num_warmup_ops > 0) {
              g_npkit_num_warmup_ops--;
            }
            uint64_t npKitRecvDuration = sub->npKitLastPollTime[step%NCCL_STEPS] - sub->npKitStartTime[step%NCCL_STEPS];
            if (g_npkit_num_warmup_ops == 0 && npKitRecvDuration > g_npkit_net_check_latency_threshold_us) {
              fprintf(stdout, "NPKIT LONG RECV (R:%d,P:%d,C:%d,S:%lu): %d took %lu us, last/max/sum poll interval %lu/%lu/%lu us, cnt: %lu, ts: %lu/%lu\n",
                      comm->rank, sub->peer, sub->channelId, step%NCCL_STEPS, sizes[i], npKitRecvDuration, sub->npKitLastPollInterval[step%NCCL_STEPS], sub->npKitMaxPollInterval[step%NCCL_STEPS], sub->npKitPollIntervalSum[step%NCCL_STEPS], sub->npKitPollCnt[step%NCCL_STEPS], sub->npKitStartTime[step%NCCL_STEPS], sub->npKitLastPollTime[step%NCCL_STEPS]);
              sub->npKitStartTime[step%NCCL_STEPS] = sub->npKitLastPollTime[step%NCCL_STEPS] = npKitGetTsInUs();
              sub->npKitMaxPollInterval[step%NCCL_STEPS] = sub->npKitPollIntervalSum[step%NCCL_STEPS] = sub->npKitPollCnt[step%NCCL_STEPS] = 0;
            }
#endif
@yzygitzh
Copy link
Collaborator

Hi,

Can you try replacing comm->rank to proxyState->tpRank and try again? According to https://github.com/Azure/msccl-executor-nccl/blob/main/src/proxy.cc#L1776, comm->rank can be referenced using proxyState->tpRank now.

@ChenYuHo
Copy link
Author

It compiles, but I wasn't sure they're functionally equivalent. Should I submit a PR?

@yzygitzh
Copy link
Collaborator

We're still building automated CI for this repo so it cannot be verified immediately.

If you verify that this fix works in your env, it would be very helpful for you to submit a PR for this. Really appreciate your contribution!

@yzygitzh
Copy link
Collaborator

cc @seagater

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants