Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging the code to wespeaker #3

Open
wsstriving opened this issue Jul 26, 2024 · 11 comments
Open

Merging the code to wespeaker #3

wsstriving opened this issue Jul 26, 2024 · 11 comments
Assignees

Comments

@wsstriving
Copy link

Thank you for the excellent work! I would like to ask if you would mind if we adapt this code into the official WeSpeaker models. We will definitely include the original paper link, authorship, etc. I just want to check whether you are okay with the open-source license of WeSpeaker.

Best regards,
Shuai

@vanIvan
Copy link
Collaborator

vanIvan commented Jul 26, 2024

@wsstriving Sure, we'll be happy, if you provide working way to train it. I'm sharing archive with configs and training logs for all model sizes. The configs might be different a bit from the ones that are used in wespeaker, but main hyperparams have same structure. Also few models were trained with AAMSoftmax instead of SphereFace2, they will have better quality, if they retrained with SphereFace2.
redimnet_vox2_configs.tar.gz

@vanIvan vanIvan closed this as completed Jul 26, 2024
@wsstriving
Copy link
Author

Hi, I would like to ask whether the configs you provided here are the ones used in the paper, because I found that for some of them, I cannot get the same number of parameters (As shown in Table 4).

Other questions:

  • Some of them are using arc_margin loss and the others using sphereface 2, any specific considerations?
  • Do you have any additional tricks when playing with large margin finetuning? I only tested B2 currently and I can get comparable results before LM, but the one after LM is not that good.

@vanIvan
Copy link
Collaborator

vanIvan commented Aug 6, 2024

Hi, @wsstriving thank you for your attempt on model retraining!

Answering your questions:

because I found that for some of them, I cannot get the same number of parameters (As shown in Table 4)

  1. There might be small difference for b1+ model sizes, 100-200k parameters +/- due to different num of parameters calculation. Also could you please clarify what is the size difference for models? Might there be difference for model with class head and without - for pretraining it should have had additional 192 x 5994 parameters +1.15M? In our paper we reported size without classification head and features (as they are frozen during training / inference)

Some of them are using arc_margin loss and the others using sphereface 2, any specific considerations?

  1. For all of the model sizes we found that SphereFace works better and advise to switch to it by default (b2 config should be default for all models). Most of the models were trained with same set of hyperparams: weigth decay / max LR / schedulers setup - which were copied from one of resnet configs from original wespeaker pipeline. This is true for all models sizes except probably the smallest B0 and largest B6 redimnets - there we migth sligthly changed weigth decay.

Do you have any additional tricks when playing with large margin finetuning? I only tested B2 currently and I can get comparable results before LM, but the one after LM is not that good.

  1. Actually we don't, it should work without any changes. Could you share configs you are using for LM and metrics you are getting?

@vanIvan vanIvan reopened this Aug 6, 2024
@wsstriving
Copy link
Author

wsstriving commented Aug 7, 2024

I've created a draft pull request for wespeaker (https://github.com/wenet-e2e/wespeaker/pull/346/files) that you can have a check. Basically, I've adapted your code to align with the wespeaker style and removed the preprocessing part (feature extraction) to use wespeaker's existing implementation. You can find the default configurations for B0-B6 models in the model.py file, along with a comparison of model sizes.

Unfortunately, I don't have the resources to run all the experiments right now. However, I can share some preliminary results for the B2 model with the arc_margin loss. I initially had global_context_att set to False (different from your setup)

Before LM (no score norm):
O: 0.744
E: 0.932
H: 1.761
After LM (no score norm):
O: 0.712
E: 0.894
H: 1.621

@vanIvan
Copy link
Collaborator

vanIvan commented Aug 7, 2024

Thank you for sharing, there is some mismatch in features setup:

  • We are using larger than 10ms hop size, 15ms - doesn't degrade quality much, but makes max batch size larger that can fit on GPU, the features setup in our configs are dummy (the ones that are from wespeaker originally), cause we are using features integrated into model - we forgot to remove them.

I found our internal results for ReDimNet-B2 LM model trained with AAM loss with global_context_att set to True:

After LM (no score norm):
O: 0.675
E: 0.826
H: 1.457

There might be some improvement when setting global_context_att to True.

The best results (matching ours) you should get by using for all models:

  • SphereFace2 loss
  • ASTP pooling with global_context_att=True
  • Hop length: best results are achieved using 10ms, yet the improvement from 15ms is very small, and training is significant faster with 15ms features - so we would advice to use them.

@vanIvan vanIvan self-assigned this Aug 20, 2024
@wsstriving
Copy link
Author

Hi, @vanIvan we have merged the initial version into wespeaker wenet-e2e/wespeaker#346, but still there is some performance gap, it will be great if you could try the current implementation and give some suggestions! BTW, if you will be at Interspeech, looking forward to talking with you face to face.

@vanIvan
Copy link
Collaborator

vanIvan commented Aug 27, 2024

Hi, @wsstriving, thank you for integration, we'll try to look at it soon. Yes, me and few of my colleagues from our team are going to attend Interspeech and present ReDimNet there, would be nice to meet there, let's keep in touch!

@vanIvan
Copy link
Collaborator

vanIvan commented Aug 31, 2024

Hello, @wsstriving! I have realized, that there is no variable weight decay for projection head separately from backbone neural network in wespeaker pipeline - there is currently only one weight_decay used for whole network. So I've added this variable weight_decay for projection head in forked wespeaker pipeline, could you please check it and if you have time, probably retrain model to check whether it improves results (especially for SF2 loss). I also made model more lightweight during training, by increasing hop length of melbanks in it's config - now it should train faster, and one could fit bigger batch on same GPU setup.

@MonolithFoundation
Copy link

Hi, would like ask if there will be a RedimNet model for multi-langual support? For instance trained on mixed Chinese and English speaker verification?

@vanIvan
Copy link
Collaborator

vanIvan commented Nov 13, 2024

Yes, new models would be pretrained on voxblink2 and finetuned on voxblink2+vox2+cnceleb. They will perform way better on Chinese.

@vanIvan
Copy link
Collaborator

vanIvan commented Nov 13, 2024

Happy to share good news, we have released first models pretrained on voxblink2, you could find them on evaluation page. The first example there is with the best voxblink2 finetuned model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants