Merging the code to wespeaker #3

wsstriving · 2024-07-26T06:04:02Z

Thank you for the excellent work! I would like to ask if you would mind if we adapt this code into the official WeSpeaker models. We will definitely include the original paper link, authorship, etc. I just want to check whether you are okay with the open-source license of WeSpeaker.

Best regards,
Shuai

vanIvan · 2024-07-26T08:30:06Z

@wsstriving Sure, we'll be happy, if you provide working way to train it. I'm sharing archive with configs and training logs for all model sizes. The configs might be different a bit from the ones that are used in wespeaker, but main hyperparams have same structure. Also few models were trained with AAMSoftmax instead of SphereFace2, they will have better quality, if they retrained with SphereFace2.
redimnet_vox2_configs.tar.gz

wsstriving · 2024-08-06T11:05:41Z

Hi, I would like to ask whether the configs you provided here are the ones used in the paper, because I found that for some of them, I cannot get the same number of parameters (As shown in Table 4).

Other questions:

Some of them are using arc_margin loss and the others using sphereface 2, any specific considerations?
Do you have any additional tricks when playing with large margin finetuning? I only tested B2 currently and I can get comparable results before LM, but the one after LM is not that good.

vanIvan · 2024-08-06T13:10:16Z

Hi, @wsstriving thank you for your attempt on model retraining!

Answering your questions:

because I found that for some of them, I cannot get the same number of parameters (As shown in Table 4)

There might be small difference for b1+ model sizes, 100-200k parameters +/- due to different num of parameters calculation. Also could you please clarify what is the size difference for models? Might there be difference for model with class head and without - for pretraining it should have had additional 192 x 5994 parameters +1.15M? In our paper we reported size without classification head and features (as they are frozen during training / inference)

Some of them are using arc_margin loss and the others using sphereface 2, any specific considerations?

For all of the model sizes we found that SphereFace works better and advise to switch to it by default (b2 config should be default for all models). Most of the models were trained with same set of hyperparams: weigth decay / max LR / schedulers setup - which were copied from one of resnet configs from original wespeaker pipeline. This is true for all models sizes except probably the smallest B0 and largest B6 redimnets - there we migth sligthly changed weigth decay.

Do you have any additional tricks when playing with large margin finetuning? I only tested B2 currently and I can get comparable results before LM, but the one after LM is not that good.

Actually we don't, it should work without any changes. Could you share configs you are using for LM and metrics you are getting?

wsstriving · 2024-08-07T07:25:08Z

I've created a draft pull request for wespeaker (https://github.com/wenet-e2e/wespeaker/pull/346/files) that you can have a check. Basically, I've adapted your code to align with the wespeaker style and removed the preprocessing part (feature extraction) to use wespeaker's existing implementation. You can find the default configurations for B0-B6 models in the model.py file, along with a comparison of model sizes.

Unfortunately, I don't have the resources to run all the experiments right now. However, I can share some preliminary results for the B2 model with the arc_margin loss. I initially had global_context_att set to False (different from your setup)

Before LM (no score norm):
O: 0.744
E: 0.932
H: 1.761
After LM （no score norm）:
O: 0.712
E: 0.894
H: 1.621

vanIvan · 2024-08-07T14:03:47Z

Thank you for sharing, there is some mismatch in features setup:

We are using larger than 10ms hop size, 15ms - doesn't degrade quality much, but makes max batch size larger that can fit on GPU, the features setup in our configs are dummy (the ones that are from wespeaker originally), cause we are using features integrated into model - we forgot to remove them.

I found our internal results for ReDimNet-B2 LM model trained with AAM loss with global_context_att set to True:

After LM （no score norm):
O: 0.675
E: 0.826
H: 1.457

There might be some improvement when setting global_context_att to True.

The best results (matching ours) you should get by using for all models:

SphereFace2 loss
ASTP pooling with global_context_att=True
Hop length: best results are achieved using 10ms, yet the improvement from 15ms is very small, and training is significant faster with 15ms features - so we would advice to use them.

wsstriving · 2024-08-27T08:12:43Z

Hi, @vanIvan we have merged the initial version into wespeaker wenet-e2e/wespeaker#346, but still there is some performance gap, it will be great if you could try the current implementation and give some suggestions! BTW, if you will be at Interspeech, looking forward to talking with you face to face.

vanIvan · 2024-08-27T16:03:40Z

Hi, @wsstriving, thank you for integration, we'll try to look at it soon. Yes, me and few of my colleagues from our team are going to attend Interspeech and present ReDimNet there, would be nice to meet there, let's keep in touch!

vanIvan · 2024-08-31T09:43:12Z

Hello, @wsstriving! I have realized, that there is no variable weight decay for projection head separately from backbone neural network in wespeaker pipeline - there is currently only one weight_decay used for whole network. So I've added this variable weight_decay for projection head in forked wespeaker pipeline, could you please check it and if you have time, probably retrain model to check whether it improves results (especially for SF2 loss). I also made model more lightweight during training, by increasing hop length of melbanks in it's config - now it should train faster, and one could fit bigger batch on same GPU setup.

MonolithFoundation · 2024-11-13T06:33:14Z

Hi, would like ask if there will be a RedimNet model for multi-langual support? For instance trained on mixed Chinese and English speaker verification?

vanIvan · 2024-11-13T10:38:27Z

Yes, new models would be pretrained on voxblink2 and finetuned on voxblink2+vox2+cnceleb. They will perform way better on Chinese.

vanIvan · 2024-11-13T13:37:17Z

Happy to share good news, we have released first models pretrained on voxblink2, you could find them on evaluation page. The first example there is with the best voxblink2 finetuned model.

wsstriving mentioned this issue Jul 26, 2024

Adding ReDimNet to wespeaker wenet-e2e/wespeaker#341

Open

vanIvan closed this as completed Jul 26, 2024

vanIvan reopened this Aug 6, 2024

vanIvan self-assigned this Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging the code to wespeaker #3

Merging the code to wespeaker #3

wsstriving commented Jul 26, 2024

vanIvan commented Jul 26, 2024

wsstriving commented Aug 6, 2024

vanIvan commented Aug 6, 2024 •

edited

Loading

wsstriving commented Aug 7, 2024 •

edited

Loading

vanIvan commented Aug 7, 2024

wsstriving commented Aug 27, 2024

vanIvan commented Aug 27, 2024

vanIvan commented Aug 31, 2024 •

edited

Loading

MonolithFoundation commented Nov 13, 2024

vanIvan commented Nov 13, 2024

vanIvan commented Nov 13, 2024

Merging the code to wespeaker #3

Merging the code to wespeaker #3

Comments

wsstriving commented Jul 26, 2024

vanIvan commented Jul 26, 2024

wsstriving commented Aug 6, 2024

vanIvan commented Aug 6, 2024 • edited Loading

wsstriving commented Aug 7, 2024 • edited Loading

vanIvan commented Aug 7, 2024

wsstriving commented Aug 27, 2024

vanIvan commented Aug 27, 2024

vanIvan commented Aug 31, 2024 • edited Loading

MonolithFoundation commented Nov 13, 2024

vanIvan commented Nov 13, 2024

vanIvan commented Nov 13, 2024

vanIvan commented Aug 6, 2024 •

edited

Loading

wsstriving commented Aug 7, 2024 •

edited

Loading

vanIvan commented Aug 31, 2024 •

edited

Loading