Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

修正多个GPU选择BUG #134

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

修正多个GPU选择BUG #134

wants to merge 1 commit into from

Conversation

flydsc
Copy link

@flydsc flydsc commented Mar 8, 2020

存在一个bug,当传入参数device并非从0开始(),例如传入2,3或者1,2 时候,提示报错:AssertionError: Invalid device id。

  • 这个bug产生的原因是当75行代码设置了全局GPU的数量,如下:
    os.environ["CUDA_VISIBLE_DEVICES"] = args.device # 此处设置程序使用哪些显卡

  • 而多GPU平行模型的代码如下:
    model = DataParallel(model, device_ids=[int(i) for i in args.device.split(',')])

问题在于device_ids读入的id是实际id号而因为设置了环境变量CUDA_VISIBLE_DEVICES,导致gpu 的id不一致。

以传入两个gpu id为例:

  1. 如果传入参数是0,1 环境变量识别出两个GPU,则GPU的id为0,1,工作正常;
  2. 如果传入参数不是0开头,而是 1,2,环境变量识别出的两个GPU,工作环境识别的的ID是’0,1‘, 那此时的DataParallel 中的device_ids如果继续传入1,2就会报错AssertionError: Invalid device id

解决办法:
把多GPU平行模型的代码如下:
model = DataParallel(model, device_ids=[int(i) for i in args.device.split(',')])
改为
model = DataParallel(model, device_ids=list(range(len(args.device.split(',')))))
即根据数量,从0开始建立id list。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant