Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested New experiments- GPT2-small w/ Sophia on Fineweb-10B data #51

Open
sanyalsunny111 opened this issue Aug 9, 2024 · 1 comment

Comments

@sanyalsunny111
Copy link

sanyalsunny111 commented Aug 9, 2024

Hi @Liuhong99 ,

I am a big fan of sophia used it cited it everytime. Just thought of suggesting you a new and less resource intensive experiment.

a) Karpathy updated the nano_gpt2 training code with tokens w/o replacement dataloader, new data finewebedu-10B. I am curious how sophia would do given this new setting.

b) inverse layer idx used here is pretty good but recently many works have used qk normalization.

Thank you for such a good repo. Feel free to disregrad this suggestion.

@dhia680
Copy link

dhia680 commented Aug 27, 2024

I trained gpt2 on finewebedu-10B with Sophia (with different settings : varying learning rates or weight decayed layers). And got this for initial results. The loss early goes up.
You can see the difference with the baseline (AdamW configured as in Nanogpt repo).
sophia on fineweb edu loss
Some suggested batch_size ramp up to solve this bad behaviour...
I haven't try this yet.
I'm open to discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants