Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple gpu training issue #2

Open
ShixuanGu opened this issue Mar 10, 2024 · 4 comments
Open

multiple gpu training issue #2

ShixuanGu opened this issue Mar 10, 2024 · 4 comments

Comments

@ShixuanGu
Copy link

When running taskonomy_conditional script using the shapenet dataset, using 4 a100 vs 1 a100 will lead to different epoch sizes under the same setting. Is this torch lightning issue or implementation issue?

num steps: 1000000, batch size: 48, save_every: 10000, epoch size: 480000, num epochs: 100

4 GPU:
Epoch 0: 0%| | 1/3833 [00:04<4:32:03, 0.23it/s]
Epoch 0: 0%| | 1/3833 [00:04<4:32:06, 0.23it/s, v_num=2.3e+7, train_loss=152.0]
Epoch 0: 0%| | 2/3833 [00:04<2:25:26, 0.44it/s, v_num=2.3e+7, train_loss=152.0]
Epoch 0: 0%| | 2/3833 [00:04<2:25:28, 0.44it/s, v_num=2.3e+7, train_loss=149.0]
Epoch 0: 0%| | 3/3833 [00:04<1:41:29, 0.63it/s, v_num=2.3e+7, train_loss=149.0]
Epoch 0: 0%| | 3/3833 [00:04<1:41:30, 0.63it/s, v_num=2.3e+7, train_loss=148.0]
Epoch 0: 0%| | 4/3833 [00:04<1:19:29, 0.80it/s, v_num=2.3e+7, train_loss=148.0]

1 GPU:
Epoch 0: 0%| | 0/10000 [00:00<?, ?it/s]
Epoch 0: 0%| | 1/10000 [00:08<24:29:45, 0.11it/s]
Epoch 0: 0%| | 1/10000 [00:08<24:29:50, 0.11it/s, v_num=2.3e+7, train_loss=181.0]
Epoch 0: 0%| | 2/10000 [00:09<12:36:12, 0.22it/s, v_num=2.3e+7, train_loss=181.0]
Epoch 0: 0%| | 2/10000 [00:09<12:36:14, 0.22it/s, v_num=2.3e+7, train_loss=189.0]
Epoch 0: 0%| | 3/10000 [00:09<8:36:41, 0.32it/s, v_num=2.3e+7, train_loss=189.0]

@jatentaki
Copy link
Collaborator

Hi @ShixuanGu, that's a good question which is unfortunately a bit difficult for me to check since I have graduated and no longer have access to the infrastructure necessary to run this code... 10_000 is an arbitrary number steps per "epoch", by which I mean the period how often I want to run validation. The default batch size for shapenet_conditional is 48, so with 4 GPUs it should be split 12 per GPU but do the same number of steps (10_000). Why does it do less - I'm not sure, I doubt it's an outright bug with PyTorch Lightning but it may be my misuse of it/misunderstanding of its conventions.

Could you check the length of the actual dataset (available as len(dataloader.dataset)) and what happens with 2/3 GPUs? The number 3833 is a mystery for me but I imagine it could be the size of the dataset divided by 48 or something like that, which would shed some light on what's happening.

@ShixuanGu
Copy link
Author

Many thanks for the rapid reply! There might be a compatible issue of lightning multiple GPU training on the sample generator in the dataloader. Meanwhile, I tested the training code with different GPU numbers.

Data - 10000 (len(train loader))

1 GPU - 10000
Epoch 0: 0%| | 0/10000 [00:00<?, ?it/s]
Epoch 0: 0%| | 1/10000 [00:08<24:29:45, 0.11it/s]

GPU*2 - 7666
Epoch 0: 0%| | 1/7666 [00:01<3:47:22, 0.56it/s]
Epoch 0: 0%| | 1/7666 [00:01<3:47:26, 0.56it/s, v_num=2.31e+7, train_loss=147.0]

GPU*3 - 5111
Epoch 0: 0%| | 1/5111 [00:04<6:36:35, 0.21it/s]
Epoch 0: 0%| | 1/5111 [00:04<6:36:40, 0.21it/s, v_num=2.3e+7, train_loss=172.0]

4 GPU - 3833
Epoch 0: 0%| | 1/3833 [00:04<4:32:03, 0.23it/s]
Epoch 0: 0%| | 1/3833 [00:04<4:32:06, 0.23it/s, v_num=2.3e+7, train_loss=152.0]

@ShixuanGu
Copy link
Author

It seems the "epoch size" is the sample number used for training, e.g., epoch size N and batch size B will lead to N training samples and N/B batch number per epoch.

When using multiple GPUS for training, the batch size per epoch should be N/(B*GPU number). This works as expected when set epoch size as None (which will use the whole dataset).

The issue I encountered:
When the parameter "epoch size" is not None, e.g., 10_000 (which is smaller than the dataset scale, the len(trainloader) returns 10_000 as expected), during the training, the batch number per epoch is not 10_000/(BGPU number), but the (sample number of total dataset)/(BGPU number).

I post this problem here and meanwhile looking into the bug related to distributed sampler in lightning.

@ShixuanGu
Copy link
Author

It seems an issue that pytorch lightning will automatically replace the sampler in dataloader with distributed_sampler, which leads to the unexpected batch number per epoch. The issue can be fixed by adding "use_distributed_sampler=False," to the trainer initialization.

Beyond that, could you kindly talk about how to determine the hyperparameters used in the gecco? e.g., scheduler, epoch size, step size w.r.t. dataset scale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants