-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the use of "reference_audio" when inference #9
Comments
@tuanh123789 hi can you give me some help? |
hi, I have the same question.Do you slove it? |
Maybe because AdaSpeech only has a phone-level predictor but not utterance-level, so in inference stage you still need to input a reference mel to get an utterance-level vector. I am not sure. |
In original Adaspeech paper, there is "In the inference process, the utterance-level acoustic conditions are extracted from another reference speech of the speaker, and the phoneme-level acoustic conditions are predicted from phoneme-level acoustic predictor." |
Thank you for your answer.I'd like to know if I can just find a reference audio without specifying the content of the audio, or if the text of the reference audio must match the content of the composite audio. |
I don't think we need to have a reference audio with exactly the same content (otherwise text-to-speech is useless then). In my understanding, the reference audio only provides some information about the acoustic condition (and maybe also speaker information). So providing an arbitrary utterance of the target speaker is already reasonable. |
I used a speech-level encoder during training, but I removed the reference audio when synthesising the speech, does the speech-level encoder used have an effect on the final audio? |
Isn't the inclusion of a discourse-level encoder further enriching the modelling information? |
@freshwindy When you removed the reference audio, do you mean replace that utterance-level vector with all 0? Because there still needs to be a vector that fills the blank. I haven't done any correspondent experiments yet. About the discourse level encoder, I can't come up with a reason why it won't enrich the modeling information, just the enrichment may not be so obvious to perceive, I'm not sure : ) |
hi, I want to know what's the use of "reference_audio" when inference?
The text was updated successfully, but these errors were encountered: