You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly, I really appreciate for this repo. It helped me a lot for learning about TTS.
But I think I met some problems on inference stage.
I trained the model with LibriTTS with adjusted configs from FastSpeech2 repo, just removing language options.
(If you wish, I will make a pull request about it. It would be helpful for others to train model.)
While the training loss was as you shown, I cannot get proper duration prediction while I'm doing inference.
I checked the training stage where synth_one_sample function operates by saving wavs, and I saw that predicted speech and reconstructed speech was fairly good quality (a bit error for mel prediction though).
So I guess there could be some issues on mel embedding for conditional normalization layer and speaker embedding.
Maybe there could be some conflicts on them?
In this sense, it will be helpful for me and other people to get some inference examples such as speaker embedding samples and inferenced samples.
I attach some samples, configs, commands here. tested_data.zip
The text was updated successfully, but these errors were encountered:
I'm getting a very similar problem, troubling me for days. Have you solved it?
Here are my tensorboard logs. They seem pretty strange as the losses stop decreasing after a very short period of time (several thousand steps) and start to blow up. This phenomenon even happens before phone-level-embedding-prediction ( which could also be a trouble!)
Luckily I found my problem originated not from the model or code itself. It was from the value of x-vector I was using. I used the x-vectors extracted from SpeechBrain library instead of speaker embedding table, and the values in these x-vectors can range from -100 to +100. This caused numerical instability in conditional layer norms, so the loss cannot be decreased. After normalizing these embeddings, my training process went correct.
Firstly, I really appreciate for this repo. It helped me a lot for learning about TTS.
But I think I met some problems on inference stage.
I trained the model with LibriTTS with adjusted configs from FastSpeech2 repo, just removing language options.
(If you wish, I will make a pull request about it. It would be helpful for others to train model.)
While the training loss was as you shown, I cannot get proper duration prediction while I'm doing inference.
I checked the training stage where
synth_one_sample
function operates by saving wavs, and I saw that predicted speech and reconstructed speech was fairly good quality (a bit error for mel prediction though).So I guess there could be some issues on mel embedding for conditional normalization layer and speaker embedding.
Maybe there could be some conflicts on them?
In this sense, it will be helpful for me and other people to get some inference examples such as speaker embedding samples and inferenced samples.
I attach some samples, configs, commands here.
tested_data.zip
The text was updated successfully, but these errors were encountered: