Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoch counter does not resume when resuming from start checkpoint. #26

Open
dillfrescott opened this issue May 5, 2024 · 7 comments
Open

Comments

@dillfrescott
Copy link
Contributor

dillfrescott commented May 5, 2024

It seems to reset to 0 every time

@ZFTurbo
Copy link
Owner

ZFTurbo commented May 5, 2024

Yes, it's not saved in model data anywhere. Actually config can be saved inside too... I need to think what to save.

It's not actually an error, just not enough functionality.

@dillfrescott
Copy link
Contributor Author

Gotcha. I've been manually adjusting the "for epoch in range" values in train.py every resume which works i guess.

@jarredou
Copy link

jarredou commented May 6, 2024

I've started working on a more "resume-friendly" fork a while ago with the --resume CLI args, and saving optimizer, scheduler states + epoch, best_sdr and last training loss values within the "last_xxx.ckpt" saved model (+ wandb logging here).
main...jarredou:Music-Source-Separation-Training:wandb+resume

Code is not bulletproof.

@dillfrescott
Copy link
Contributor Author

Ah, thank you! @jarredou

@dillfrescott
Copy link
Contributor Author

You should do a PR for that

@jarredou
Copy link

jarredou commented May 7, 2024

It would require more work for a PR, like I said it's not bulletproof in its current state and can lead to some errors, but since few months, I don't have free time to spend on this, unfortunately.

@dillfrescott
Copy link
Contributor Author

Ah, gotcha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants