multiple gpu training issue #2

ShixuanGu · 2024-03-10T07:48:41Z

When running taskonomy_conditional script using the shapenet dataset, using 4 a100 vs 1 a100 will lead to different epoch sizes under the same setting. Is this torch lightning issue or implementation issue?

num steps: 1000000, batch size: 48, save_every: 10000, epoch size: 480000, num epochs: 100

4 GPU:
Epoch 0: 0%| | 1/3833 [00:04<4:32:03, 0.23it/s]
Epoch 0: 0%| | 1/3833 [00:04<4:32:06, 0.23it/s, v_num=2.3e+7, train_loss=152.0]
Epoch 0: 0%| | 2/3833 [00:04<2:25:26, 0.44it/s, v_num=2.3e+7, train_loss=152.0]
Epoch 0: 0%| | 2/3833 [00:04<2:25:28, 0.44it/s, v_num=2.3e+7, train_loss=149.0]
Epoch 0: 0%| | 3/3833 [00:04<1:41:29, 0.63it/s, v_num=2.3e+7, train_loss=149.0]
Epoch 0: 0%| | 3/3833 [00:04<1:41:30, 0.63it/s, v_num=2.3e+7, train_loss=148.0]
Epoch 0: 0%| | 4/3833 [00:04<1:19:29, 0.80it/s, v_num=2.3e+7, train_loss=148.0]

1 GPU:
Epoch 0: 0%| | 0/10000 [00:00<?, ?it/s]
Epoch 0: 0%| | 1/10000 [00:08<24:29:45, 0.11it/s]
Epoch 0: 0%| | 1/10000 [00:08<24:29:50, 0.11it/s, v_num=2.3e+7, train_loss=181.0]
Epoch 0: 0%| | 2/10000 [00:09<12:36:12, 0.22it/s, v_num=2.3e+7, train_loss=181.0]
Epoch 0: 0%| | 2/10000 [00:09<12:36:14, 0.22it/s, v_num=2.3e+7, train_loss=189.0]
Epoch 0: 0%| | 3/10000 [00:09<8:36:41, 0.32it/s, v_num=2.3e+7, train_loss=189.0]

jatentaki · 2024-03-10T10:07:59Z

Hi @ShixuanGu, that's a good question which is unfortunately a bit difficult for me to check since I have graduated and no longer have access to the infrastructure necessary to run this code... 10_000 is an arbitrary number steps per "epoch", by which I mean the period how often I want to run validation. The default batch size for shapenet_conditional is 48, so with 4 GPUs it should be split 12 per GPU but do the same number of steps (10_000). Why does it do less - I'm not sure, I doubt it's an outright bug with PyTorch Lightning but it may be my misuse of it/misunderstanding of its conventions.

Could you check the length of the actual dataset (available as len(dataloader.dataset)) and what happens with 2/3 GPUs? The number 3833 is a mystery for me but I imagine it could be the size of the dataset divided by 48 or something like that, which would shed some light on what's happening.

ShixuanGu · 2024-03-11T18:10:06Z

Many thanks for the rapid reply! There might be a compatible issue of lightning multiple GPU training on the sample generator in the dataloader. Meanwhile, I tested the training code with different GPU numbers.

Data - 10000 (len(train loader))

1 GPU - 10000
Epoch 0: 0%| | 0/10000 [00:00<?, ?it/s]
Epoch 0: 0%| | 1/10000 [00:08<24:29:45, 0.11it/s]

GPU*2 - 7666
Epoch 0: 0%| | 1/7666 [00:01<3:47:22, 0.56it/s]
Epoch 0: 0%| | 1/7666 [00:01<3:47:26, 0.56it/s, v_num=2.31e+7, train_loss=147.0]

GPU*3 - 5111
Epoch 0: 0%| | 1/5111 [00:04<6:36:35, 0.21it/s]
Epoch 0: 0%| | 1/5111 [00:04<6:36:40, 0.21it/s, v_num=2.3e+7, train_loss=172.0]

4 GPU - 3833
Epoch 0: 0%| | 1/3833 [00:04<4:32:03, 0.23it/s]
Epoch 0: 0%| | 1/3833 [00:04<4:32:06, 0.23it/s, v_num=2.3e+7, train_loss=152.0]

ShixuanGu · 2024-03-12T21:56:21Z

It seems the "epoch size" is the sample number used for training, e.g., epoch size N and batch size B will lead to N training samples and N/B batch number per epoch.

When using multiple GPUS for training, the batch size per epoch should be N/(B*GPU number). This works as expected when set epoch size as None (which will use the whole dataset).

The issue I encountered:
When the parameter "epoch size" is not None, e.g., 10_000 (which is smaller than the dataset scale, the len(trainloader) returns 10_000 as expected), during the training, the batch number per epoch is not 10_000/(BGPU number), but the (sample number of total dataset)/(BGPU number).

I post this problem here and meanwhile looking into the bug related to distributed sampler in lightning.

ShixuanGu · 2024-03-13T01:05:20Z

It seems an issue that pytorch lightning will automatically replace the sampler in dataloader with distributed_sampler, which leads to the unexpected batch number per epoch. The issue can be fixed by adding "use_distributed_sampler=False," to the trainer initialization.

Beyond that, could you kindly talk about how to determine the hyperparameters used in the gecco? e.g., scheduler, epoch size, step size w.r.t. dataset scale.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple gpu training issue #2

multiple gpu training issue #2

ShixuanGu commented Mar 10, 2024

jatentaki commented Mar 10, 2024

ShixuanGu commented Mar 11, 2024

ShixuanGu commented Mar 12, 2024

ShixuanGu commented Mar 13, 2024

multiple gpu training issue #2

multiple gpu training issue #2

Comments

ShixuanGu commented Mar 10, 2024

jatentaki commented Mar 10, 2024

ShixuanGu commented Mar 11, 2024

ShixuanGu commented Mar 12, 2024

ShixuanGu commented Mar 13, 2024