You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am a big fan of sophia used it cited it everytime. Just thought of suggesting you a new and less resource intensive experiment.
a) Karpathy updated the nano_gpt2 training code with tokens w/o replacement dataloader, new data finewebedu-10B. I am curious how sophia would do given this new setting.
b) inverse layer idx used here is pretty good but recently many works have used qk normalization.
Thank you for such a good repo. Feel free to disregrad this suggestion.
The text was updated successfully, but these errors were encountered:
I trained gpt2 on finewebedu-10B with Sophia (with different settings : varying learning rates or weight decayed layers). And got this for initial results. The loss early goes up.
You can see the difference with the baseline (AdamW configured as in Nanogpt repo).
Some suggested batch_size ramp up to solve this bad behaviour...
I haven't try this yet.
I'm open to discussions.
Hi @Liuhong99 ,
I am a big fan of sophia used it cited it everytime. Just thought of suggesting you a new and less resource intensive experiment.
a) Karpathy updated the nano_gpt2 training code with tokens w/o replacement dataloader, new data finewebedu-10B. I am curious how sophia would do given this new setting.
b) inverse layer idx used here is pretty good but recently many works have used qk normalization.
Thank you for such a good repo. Feel free to disregrad this suggestion.
The text was updated successfully, but these errors were encountered: