-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple gpu training issue #2
Comments
Hi @ShixuanGu, that's a good question which is unfortunately a bit difficult for me to check since I have graduated and no longer have access to the infrastructure necessary to run this code... 10_000 is an arbitrary number steps per "epoch", by which I mean the period how often I want to run validation. The default batch size for shapenet_conditional is 48, so with 4 GPUs it should be split 12 per GPU but do the same number of steps (10_000). Why does it do less - I'm not sure, I doubt it's an outright bug with PyTorch Lightning but it may be my misuse of it/misunderstanding of its conventions. Could you check the length of the actual dataset (available as |
Many thanks for the rapid reply! There might be a compatible issue of lightning multiple GPU training on the sample generator in the dataloader. Meanwhile, I tested the training code with different GPU numbers. Data - 10000 (len(train loader)) 1 GPU - 10000 GPU*2 - 7666 GPU*3 - 5111 4 GPU - 3833 |
It seems the "epoch size" is the sample number used for training, e.g., epoch size N and batch size B will lead to N training samples and N/B batch number per epoch. When using multiple GPUS for training, the batch size per epoch should be N/(B*GPU number). This works as expected when set epoch size as None (which will use the whole dataset). The issue I encountered: I post this problem here and meanwhile looking into the bug related to distributed sampler in lightning. |
It seems an issue that pytorch lightning will automatically replace the sampler in dataloader with distributed_sampler, which leads to the unexpected batch number per epoch. The issue can be fixed by adding "use_distributed_sampler=False," to the trainer initialization. Beyond that, could you kindly talk about how to determine the hyperparameters used in the gecco? e.g., scheduler, epoch size, step size w.r.t. dataset scale. |
When running taskonomy_conditional script using the shapenet dataset, using 4 a100 vs 1 a100 will lead to different epoch sizes under the same setting. Is this torch lightning issue or implementation issue?
num steps: 1000000, batch size: 48, save_every: 10000, epoch size: 480000, num epochs: 100
4 GPU:
Epoch 0: 0%| | 1/3833 [00:04<4:32:03, 0.23it/s]
Epoch 0: 0%| | 1/3833 [00:04<4:32:06, 0.23it/s, v_num=2.3e+7, train_loss=152.0]
Epoch 0: 0%| | 2/3833 [00:04<2:25:26, 0.44it/s, v_num=2.3e+7, train_loss=152.0]
Epoch 0: 0%| | 2/3833 [00:04<2:25:28, 0.44it/s, v_num=2.3e+7, train_loss=149.0]
Epoch 0: 0%| | 3/3833 [00:04<1:41:29, 0.63it/s, v_num=2.3e+7, train_loss=149.0]
Epoch 0: 0%| | 3/3833 [00:04<1:41:30, 0.63it/s, v_num=2.3e+7, train_loss=148.0]
Epoch 0: 0%| | 4/3833 [00:04<1:19:29, 0.80it/s, v_num=2.3e+7, train_loss=148.0]
1 GPU:
Epoch 0: 0%| | 0/10000 [00:00<?, ?it/s]
Epoch 0: 0%| | 1/10000 [00:08<24:29:45, 0.11it/s]
Epoch 0: 0%| | 1/10000 [00:08<24:29:50, 0.11it/s, v_num=2.3e+7, train_loss=181.0]
Epoch 0: 0%| | 2/10000 [00:09<12:36:12, 0.22it/s, v_num=2.3e+7, train_loss=181.0]
Epoch 0: 0%| | 2/10000 [00:09<12:36:14, 0.22it/s, v_num=2.3e+7, train_loss=189.0]
Epoch 0: 0%| | 3/10000 [00:09<8:36:41, 0.32it/s, v_num=2.3e+7, train_loss=189.0]
The text was updated successfully, but these errors were encountered: