Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring expressivity/predict into ExpressiveTranslator. #292

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kauterry
Copy link
Contributor

Almost done with implementation, saving it as a draft since I'm yet to test it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 20, 2023
@yilinyang7
Copy link

yilinyang7 commented Dec 20, 2023

I'm not super sure of the intention of this PR, can I ask a few questions:

  • Why having both inference/ and predict/ subdirectories, as they're seem to serve same purpose it could be confusing
  • This new ExpressiveTranslator only support single-audio translation, its use-case would be too narrow, isn't it? The single audio translation is well-served by our two public demo (internal & HF) plus an gadio app.py script. I'm not convinced why we'll need this.

@kauterry
Copy link
Contributor Author

I'm not super sure of the intention of this PR, can I ask a few questions:

  • Why having both inference/ and predict/ subdirectories, as they're seem to serve same purpose it could be confusing
  • This new ExpressiveTranslator only support single-audio translation, its use-case would be too narrow, isn't it? The single audio translation is well-served by our two public demo (internal & HF) plus an gadio app.py script. I'm not convinced why we'll need this.

There is absolutely no harm in having this extra way of doing expressivity inference. Developers aren’t going to use the demo, they need the code. If you see for M4T we have m4t_predict which is a thin layer around the translator class, we want to stick to the same format for expressivity_predict as well. Btw the more ways we have the easier we make our models for adoption, that’s the whole point of OSS. In this case, writing an expressive translator strips out all the unnecessary boilerplate from the user, and they can call expressive inference easily, making it in the same format as M4T inference.

I don’t think it’s confusing, the CLI directory has stuff which the user can leverage directly. Inference acts as a bridge between the CLI and the actual models. If we have a translator, we definitely need to have an expressive_translator. So I strongly disagree with you on this. In fact, this discrepancy between m4t_predict’s design and expressivity_predict’s design was pointed out by @elbayadm. I feel we should’ve paid more attention to the design of these interfaces which expose our models to the users.

@kauterry
Copy link
Contributor Author

predict falls under the CLI directory. Inference is a directory of its own, feel free to suggest a better name to make it clear.

@yilinyang7
Copy link

yilinyang7 commented Dec 21, 2023

Thanks for the reply.
I agree it doesn't harm to implement this extra single-audio inference script.
Yet my point is, when I looked at ExpressiveTranslator class, it only supports single-audio inference. With such a general class name, isn't that too limited, and could confuse future users? I believe it should support batching as Translator and PretsselGenerator does.
For the directory, I don't understand why you want to have both cli/expressivity/predict and cli/expressivity/inference. We could instead put the script files under one directory?

@yilinyang7
Copy link

For expressivity inference, the batched inference and the automatic evaluation metrics that runs after it should be the default point-of-entry.
I don't disagree with an additional single audio inference script, but just align us on what's the main point-of-entry.
If this ExpressiveTranslator class doesn't support batched inference, it'll be confusing to the users.

@kauterry
Copy link
Contributor Author

This is going to support batched inference, and I will refactor expressivity_evaluate to use this script. Exactly like how Translator is used in m4t_evaluate and m4t_predict. How does that sound?

@yilinyang7
Copy link

This is going to support batched inference, and I will refactor expressivity_evaluate to use this script. Exactly like how Translator is used in m4t_evaluate and m4t_predict. How does that sound?

yep, that sounds good, thanks!

@elbayadm
Copy link
Contributor

Thank you @kauterry for drafting this PR.
@yilinyang7, the main thing that I thought was missing in SC is this:
speech_output, text_output = expressive_translator.predict(input) where a user wouldn't need to compute fbank and src_gcmvn for prosody outside of the call to a predict function. The point is not to add new functionalities (e.g. batching) but to provide a better entry point in python à la HF's transformers. Not gradio, not demo and not command line.
Functionalities can be added later and refactoring expressivity_evaluate is definitely necessary.

@@ -161,7 +161,7 @@ Please check out above [section](#seamlessexpressive-models) on how to acquire `
### W2v-BERT 2.0 speech encoder
| Model Name | #params | checkpoint |
| ----------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| W2v-BERT 2.0 | 600M | [🤗 Model card](https://huggingface.co/facebook/conformer-shaw) - [checkpoint](https://huggingface.co/facebook/conformer-shaw/resolve/main/conformer_shaw.pt)
| W2v-BERT 2.0 | 600M | [🤗 Model card](https://huggingface.co/facebook/w2v-bert-2.0) - [checkpoint](https://huggingface.co/facebook/w2v-bert-2.0/resolve/main/conformer_shaw.pt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fix this in another PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@yilinyang7
Copy link

predict falls under the CLI directory. Inference is a directory of its own, feel free to suggest a better name to make it clear.

sorry I missed this, and I just realize I mis-read the directory layers.

@yilinyang7
Copy link

Thank you @kauterry for drafting this PR. @yilinyang7, the main thing that I thought was missing in SC is this: speech_output, text_output = expressive_translator.predict(input) where a user wouldn't need to compute fbank and src_gcmvn for prosody outside of the call to a predict function. The point is not to add new functionalities (e.g. batching) but to provide a better entry point in python à la HF's transformers. Not gradio, not demo and not command line. Functionalities can be added later and refactoring expressivity_evaluate is definitely necessary.

Thanks for the clarification, that makes sense!
I think the current discrepancy is that expressivity_evaluate does batched inference, while current ExpressiveTranslator class only support single-audio inference.
If we want a clearer API call, we might need to refactor every script to use this new ExpressiveTranslator class, which should support batched inference for efficiency.
If I misunderstood anything, let me know!

@elbayadm
Copy link
Contributor

Thank you @kauterry for drafting this PR. @yilinyang7, the main thing that I thought was missing in SC is this: speech_output, text_output = expressive_translator.predict(input) where a user wouldn't need to compute fbank and src_gcmvn for prosody outside of the call to a predict function. The point is not to add new functionalities (e.g. batching) but to provide a better entry point in python à la HF's transformers. Not gradio, not demo and not command line. Functionalities can be added later and refactoring expressivity_evaluate is definitely necessary.

Thanks for the clarification, that makes sense! I think the current discrepancy is that expressivity_evaluate does batched inference, while current ExpressiveTranslator class only support single-audio inference. If we want a clearer API call, we might need to refactor every script to use this new ExpressiveTranslator class, which should support batched inference for efficiency. If I misunderstood anything, let me know!

That's reasonable, if it's already implemented then the new ExpressiveTranslator should use that as well.

Copy link

@yilinyang7 yilinyang7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the bug fix, the batched ExpressiveTranslator inference is done, and could replicate previous results over all metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants