pytorch2.0 使用autoparallelize()在initialize_device_mesh时,distributed_c10d的broadcast无法回调回来发生了core dump #3797
wangbluo
started this conversation in
Community | General
Replies: 1 comment
-
好吧,我看到了colossalai现在还不支持pytorch2.0,而在pytorch2.0版本的c10库里,dynamo_unsupported_distributed_c10d_ops就包含了broadcast,所以我猜测应该是这个原因,refer:https://github.com/pytorch/pytorch/blob/4f2c007a1b5170c2aa0d47e388ff9e07c7a7d354/torch/distributed/distributed_c10d.py#L4241 顺便问一下colossalai支持2.0的release day是什么时候呢 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
参考的这个例子examples/language/gpt/experiments/auto_parallel/auto_parallel_with_gpt.py,使用了autoparallelize接口来切分gpt,pytorch版本是2.0,发现在initialize_device_mesh时,distributed_c10d的broadcast无法回调回来发生了core dump。
尝试用gdb分析core,没有具体报错堆栈,可能c10库本身就舍弃了debug信息。
调用堆栈:
autoparallelize->initialize_device_mesh->profile_ab->broadcast_object_list->broadcast,broadcast是异步的,到这一步就进行不下去了,断点显示一直处于运行中,直到超时
试过pytorch1.12是可以正常运行的,想问一下是不是pytorch版本的问题
这个报错是异步执行的,coredump也看不出来,希望可以给点提示,感谢。
Beta Was this translation helpful? Give feedback.
All reactions