pytorch2.0 使用autoparallelize()在initialize_device_mesh时，distributed_c10d的broadcast无法回调回来发生了core dump #3797

wangbluo · 2023-05-22T12:36:52Z

wangbluo
May 22, 2023
Collaborator

参考的这个例子examples/language/gpt/experiments/auto_parallel/auto_parallel_with_gpt.py，使用了autoparallelize接口来切分gpt，pytorch版本是2.0，发现在initialize_device_mesh时，distributed_c10d的broadcast无法回调回来发生了core dump。
尝试用gdb分析core，没有具体报错堆栈，可能c10库本身就舍弃了debug信息。

调用堆栈：
autoparallelize->initialize_device_mesh->profile_ab->broadcast_object_list->broadcast，broadcast是异步的，到这一步就进行不下去了，断点显示一直处于运行中，直到超时

试过pytorch1.12是可以正常运行的，想问一下是不是pytorch版本的问题

这个报错是异步执行的，coredump也看不出来，希望可以给点提示，感谢。

wangbluo · 2023-05-22T13:04:43Z

wangbluo
May 22, 2023
Collaborator Author

好吧，我看到了colossalai现在还不支持pytorch2.0，而在pytorch2.0版本的c10库里，dynamo_unsupported_distributed_c10d_ops就包含了broadcast，所以我猜测应该是这个原因，refer：https://github.com/pytorch/pytorch/blob/4f2c007a1b5170c2aa0d47e388ff9e07c7a7d354/torch/distributed/distributed_c10d.py#L4241

顺便问一下colossalai支持2.0的release day是什么时候呢

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch2.0 使用autoparallelize()在initialize_device_mesh时，distributed_c10d的broadcast无法回调回来发生了core dump #3797

{{title}}

Replies: 1 comment

{{title}}

Select a reply

pytorch2.0 使用autoparallelize()在initialize_device_mesh时，distributed_c10d的broadcast无法回调回来发生了core dump #3797

wangbluo May 22, 2023 Collaborator

Replies: 1 comment

wangbluo May 22, 2023 Collaborator Author

wangbluo
May 22, 2023
Collaborator

wangbluo
May 22, 2023
Collaborator Author