-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Whisper WebUI VAD segmenting #287
Comments
Hi @drohack , just saw this. Would you be able to share your modifications to replace their faster_whisper transcription model with stable-ts? Thanks. |
I kind of hacked it together by downloading the Whisper-WebUI source code and importing it into my project. Then updating/replacing the fasterWhisperContainer.py file to call stable-ts instead. (faster than reverse engineering how they segmented/batched the audio file and passed it into faster_whisper. Though now my project is just a snapshot of WebUI...) Here's my code calling WebUI. It looks pretty similar to how you'd call it normally for faster_whisper. The one main difference is that you have to pass the
Here's the relevant lines that I edited from WebUI/src/whisper/fasterWhisperContainer.py so that it calls stable-ts instead:
Update the
Update
It really was just a few lines of code changed. I do have my full project posted here: https://github.com/drohack/AutoSubVideos |
Thanks so much @drohack ! |
Would it be possible to add a similar option to stable-ts that Whisper-WebUI (https://gitlab.com/aadnk/whisper-webui) uses to split up the audio into segments of perceived audio using VAD?
The main benefit that this provides is that for longer audio files you can leave the
condition_on_previous_text
toTrue
and have minimal looping errors (where the transcription gets into a infinite loop of providing the same response back). This is due to the audio segments being much shorter so the previous text is only applied to that short audio section.I've been able to download the Whisper-WebUI code and replace their faster_whisper transcription model with stable-ts. So I know it works. I wouldn't expect it to change anything as it's not actually changing the Whisper code. It just handles splitting up the audio, and merging the transcript back together at the end.
I know this is a pretty big feature request, but I do think it would be beneficial.
The text was updated successfully, but these errors were encountered: