-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
17 implement cli based interaction #34
base: main
Are you sure you want to change the base?
Conversation
4e95724
to
1bded51
Compare
@daavoo - Did an early test of this and its working well - tested on codespaces and worked end-to-end with the output saving. thanks for also updating the docs. |
bd56198
to
caae572
Compare
- Refactor speakers part of the config.
fe55a78
to
bf5698b
Compare
@daavoo - Thanks for this, a lot of work here! |
Taking a look 👀 |
@stefanfrench I think this is fixed now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Tested CLI works
- Tested updated streamlit app works
- Tested mkdocs serve
All working as expected, approved for merge. Thanks for the big effort on this. I think we will need to update the customiozation guide docs based on the changes, but we can do that in a separate issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few notes just looking it from GH. I ll test it locally in a bit.
"Speaker 1": "Sure! Imagine it like this...", | ||
"Speaker 2": "Oh, that's cool! But how does..." | ||
} | ||
sampling_rate: 44100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually depends on the TTS model we use. Parler uses 44.100, but Oute models use 24.000. So sampling rate should be set internally by us like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I don't know why I thought it was a user-facing param
output_folder: str | ||
text_to_text_model: Annotated[str, AfterValidator(validate_text_to_text_model)] | ||
text_to_text_prompt: str | ||
text_to_speech_model: Literal[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does look cleaner than before, however it no longer makes it able for each speaker to use a different TTS model (e.g speaker 1 is from parler, speaker 2 is from oute). I had seen this feature in another open source notebooklm implementation and thought it was cool. Do we think this is something we want to enable or it is a bit of an overreach (who can load locally LLM + >=2 TTS models anyway)? I am okay with basically removing this feature, but maybe we can update the PR description to make it clear that this wont be possible anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we think this is something we want to enable or it is a bit of an overreach (who can load locally LLM + >=2 TTS models anyway)?
I think it is an interesting scenario. Probably not worth to make the default CLI more complex but it feels like it should be doable using the "low level" API directly. Maybe it is a good example to add to some "advanced customization/usage" page in the docs or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this change since with #49 , I will need to reintroduce this function together with _speech_generation_oute
.
And speaker profile could either by an Oute id (female_1
) or a natural language description from Parler. I do feel a bit weird though that we would be using one variable in slightly two different ways. Do you have any other suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🆒
document-to-podcast \ | ||
--input_file "example_data/Mozilla-Trustworthy_AI.pdf" \ | ||
--output_folder "example_data" | ||
--text_to_text_model "Qwen/Qwen2.5-1.5B-Instruct-GGUF/qwen2.5-1.5b-instruct-q8_0.gguf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you dont provide an argument, does it take the value from the config.yaml
?
What's changing
cli.py
and expose it as entrypoint.config
and refactor how speaker configuration is handled.mock
.e2e
tests that can be run throughworkflow_dispatch
(takes too much time to be enabled by default)How to test it