Optimizing Batch-size and Learning-rate schedules for distributed computing #1611
Replies: 2 comments
-
The problem of needing to tune hyper parameters in machine learning is basically as old as machine learning :). Some people have success with the autotuning software .. but I think it can be pretty hit or miss. They aren't tools I use myself too often though I can't speak for others.
It might be better to try a smaller learning rate rather than a smaller batch size. So put your learning rate on a warmup, follows by the regular decay schedule. That is quite common for transformer optimization. That will simplify needing to schedule two parameters. I am curious about using a smaller batch size in the beginning, does it work as well without a larger batch size and it's just slower (since it's a bigger batch) or does it actually converge more slowly as a function of the step (that would be surprising)? |
Beta Was this translation helpful? Give feedback.
-
What I am describing applies to training with more than 2 nodes over Thunderbolt. Basically the challenge I had not been able to solve till I used a batch-size schedule with Lamb is the following:
|
Beta Was this translation helpful? Give feedback.
-
I am finding (at least in the case of my model), that to get good scaling of distributed training performance when the number of nodes is 3 or more, I have to use schedules for both the batch size and the learning rate. I am finding that starting with a smaller batch size during the first few epochs, when the grads are changing faster, helps reduce the loss quickly. I double the batch-size every n epochs up to a predefined max batch size and then keep it fixed from that point on.
This seems to work well, but I am looking for ways to avoid a "trial and error" approach. If the model or the dataset is changed I would likely have to repeat the process all over again and the time spent finding near optimal schedules for batch size and learning rate kind of negates the distributed training benefit.
Thus, I was wondering if tools based on Bayesian optimization, for example Optuna, could be used to streamline this process? Before investing time to move this direction, I would be happy to hear the advice of @awni and @angeloskath. Is this worthwhile? Are there better solutions to this issue?
Beta Was this translation helpful? Give feedback.
All reactions