Optimizing Batch-size and Learning-rate schedules for distributed computing #1611

sck-at-ucy · 2024-11-21T16:10:30Z

sck-at-ucy
Nov 21, 2024

I am finding (at least in the case of my model), that to get good scaling of distributed training performance when the number of nodes is 3 or more, I have to use schedules for both the batch size and the learning rate. I am finding that starting with a smaller batch size during the first few epochs, when the grads are changing faster, helps reduce the loss quickly. I double the batch-size every n epochs up to a predefined max batch size and then keep it fixed from that point on.

This seems to work well, but I am looking for ways to avoid a "trial and error" approach. If the model or the dataset is changed I would likely have to repeat the process all over again and the time spent finding near optimal schedules for batch size and learning rate kind of negates the distributed training benefit.

Thus, I was wondering if tools based on Bayesian optimization, for example Optuna, could be used to streamline this process? Before investing time to move this direction, I would be happy to hear the advice of @awni and @angeloskath. Is this worthwhile? Are there better solutions to this issue?

awni · 2024-11-25T19:04:11Z

awni
Nov 25, 2024
Maintainer

I am looking for ways to avoid a "trial and error" approach

The problem of needing to tune hyper parameters in machine learning is basically as old as machine learning :). Some people have success with the autotuning software .. but I think it can be pretty hit or miss. They aren't tools I use myself too often though I can't speak for others.

I am finding that starting with a smaller batch size during the first few epochs, when the grads are changing faster, helps reduce the loss quickly.

It might be better to try a smaller learning rate rather than a smaller batch size. So put your learning rate on a warmup, follows by the regular decay schedule. That is quite common for transformer optimization. That will simplify needing to schedule two parameters. I am curious about using a smaller batch size in the beginning, does it work as well without a larger batch size and it's just slower (since it's a bigger batch) or does it actually converge more slowly as a function of the step (that would be surprising)?

0 replies

sck-at-ucy · 2024-11-25T20:17:40Z

sck-at-ucy
Nov 25, 2024
Author

What I am describing applies to training with more than 2 nodes over Thunderbolt.

Basically the challenge I had not been able to solve till I used a batch-size schedule with Lamb is the following:

With a small batch size the loss decreases fast but the communication overhead kills any benefits of distributed training because the time per epoch goes up. GPU utilization drops from 95% to ~80%. No net gain in terms of how long it takes to train the model, except perhaps being able to fit a larger model.
With large batch sizes, the time per epoch scales nicely with number of nodes, you recover 90-90% GPU utilization, but you loose this benefit because the loss decrease rate with epoch slows down to the point that again you do not get any net benefit from doing distributed training (in terms of time needed to complete the training) because the total number of epochs increases almost proportionally.
I have not been able to find a learning rate schedule that would solve problems contribution and code of conduct #1 and Readme #2 and fixed batch size throughout.
The only way I have been able to get a net benefit (and almost get linear scaling with nodes) is to start with small batch sizes to lower the loss fast early on and then speed up the time spent per epoch by using large batch sizes and coasting through the later part of the training. Lamb works nicely with this approach.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Batch-size and Learning-rate schedules for distributed computing #1611

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Optimizing Batch-size and Learning-rate schedules for distributed computing #1611

sck-at-ucy Nov 21, 2024

Replies: 2 comments

awni Nov 25, 2024 Maintainer

sck-at-ucy Nov 25, 2024 Author

sck-at-ucy
Nov 21, 2024

awni
Nov 25, 2024
Maintainer

sck-at-ucy
Nov 25, 2024
Author