Suggested New experiments- GPT2-small w/ Sophia on Fineweb-10B data #51

sanyalsunny111 · 2024-08-09T17:07:03Z

I am a big fan of sophia used it cited it everytime. Just thought of suggesting you a new and less resource intensive experiment.

a) Karpathy updated the nano_gpt2 training code with tokens w/o replacement dataloader, new data finewebedu-10B. I am curious how sophia would do given this new setting.

b) inverse layer idx used here is pretty good but recently many works have used qk normalization.

Thank you for such a good repo. Feel free to disregrad this suggestion.

dhia680 · 2024-08-27T08:08:17Z

I trained gpt2 on finewebedu-10B with Sophia (with different settings : varying learning rates or weight decayed layers). And got this for initial results. The loss early goes up.
You can see the difference with the baseline (AdamW configured as in Nanogpt repo).

Some suggested batch_size ramp up to solve this bad behaviour...
I haven't try this yet.
I'm open to discussions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested New experiments- GPT2-small w/ Sophia on Fineweb-10B data #51

Suggested New experiments- GPT2-small w/ Sophia on Fineweb-10B data #51

sanyalsunny111 commented Aug 9, 2024 •

edited

Loading

dhia680 commented Aug 27, 2024

Suggested New experiments- GPT2-small w/ Sophia on Fineweb-10B data #51

Suggested New experiments- GPT2-small w/ Sophia on Fineweb-10B data #51

Comments

sanyalsunny111 commented Aug 9, 2024 • edited Loading

dhia680 commented Aug 27, 2024

sanyalsunny111 commented Aug 9, 2024 •

edited

Loading