Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAdam for pytorch official #62

Open
brando90 opened this issue Jul 30, 2021 · 6 comments
Open

RAdam for pytorch official #62

brando90 opened this issue Jul 30, 2021 · 6 comments

Comments

@brando90
Copy link

I am curious, why hasn't RAdam been included official in pytorch?

pytorch/pytorch#24892

@Tony-Y
Copy link
Contributor

Tony-Y commented Jul 31, 2021

This paper shows that "the Rectified Adam (RAdam) algorithm can be characterized as four steps of momentum SGD, followed by Adam with a fixed warmup schedule." So, we may use the Adam with a warmup schedule instead when we need RAdam.

My implementation: https://github.com/Tony-Y/pytorch_warmup

@LiyuanLucasLiu
Copy link
Owner

Hi @Tony-Y, I'm curious why you prefer to use Adam with a warmup instead of RAdam.

I think the very basic fact both papers agree on, is that it's necessary to include warmup to handle the variance of adaptive learning rate. Their linear warmup schedule of 2/(1−β2) steps is just a further approximation to our derived first-order approximation.

Thanks for the question @brando90 and thanks for pointing me the PR. It is really an encouragement to me seeing that PR.

From my perspective, I don't think being included as an official module in PyTorch matters that much. The initiative of our study is to show the adaptive learning rate may cause some problems (the strongest evidence is the controlled experiments, i.e., Adam-2k v.s. Adam w.o. warmup). RAdam serves as a role to further verify our intuition on this matter. Although I'm very happy to see our optimizer helped & inspired many researchers, our optimizer is still experimental. It takes a lot of efforts to take the optimizer really to the next level.

We have been working on something new these two years, stay tuned : -)

@brando90
Copy link
Author

brando90 commented Aug 2, 2021

Hi Liyuan,

Great to hear form you!

I am curious, what do you mean by "It takes a lot of efforts to take the optimizer really to the next level."? There aren't many hyperparameters to tune so I am curious what that means.

Looking forward to your next opitmizer!

@brando90
Copy link
Author

brando90 commented Aug 2, 2021

@Tony-Y I am also curious to know why you prefer warm up vs RAdam - especially since RAdam seems quite robust and remove hypoer parameters (which are the ML researcher's nightmare!)

@Tony-Y
Copy link
Contributor

Tony-Y commented Aug 3, 2021

I think that a new approach introduced by RAdam is only a nonlinear warmup. Such nonlinear warmups may outperform the untuned linear warmup sometimes.

@brando90
Copy link
Author

brando90 commented Aug 4, 2021

@Tony-Y the original paper you cited "On the Adequacy of Untuned Warmup for Adaptive Optimization" claims that RAdam is just equivalent to Adam + Warm up. From that perspective, it makes no difference which one of the too I use. Isn't it simpler to just fork the RAdam repo then git clone it and then use RAdam? RAdam is just a standard pytorch optimizer so using it is trivial.

(My guess is) the other alternative is to use the hugging face warm-up (which I've never used) https://huggingface.co/transformers/main_classes/optimizer_schedules.html?highlight=cosine#transformers.get_cosine_schedule_with_warmup and then use the linear schedule the paper you linked suggested.

In the end with the claim that they are "equivalent" either algorithm is fine. I will go with RAdam for now since it's already downloaded in my code and it's just as simple to use compared to the other - unless of course you have code that makes it trivial to plug in or have a convincing case beyond they are equivalent.

If you think warm-up is better perhaps a tutorial on how to use your warm-up version would be great to make it just as simple to plug in as RAdam. :)

I am looking forward to see how this debate on optimizers on transformers progresses.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants