Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always a bit delay and a bit early stops #343

Open
terryops opened this issue Apr 11, 2024 · 4 comments
Open

Always a bit delay and a bit early stops #343

terryops opened this issue Apr 11, 2024 · 4 comments

Comments

@terryops
Copy link

I'm utilizing stable-ts alongside faster-whisper's integrated VAD parameters, and I've noticed that when executing the following code snippet:
result = model.transcribe_stable(filename, regroup=False, k_size=9, vad_filter=True),
the outcomes generally exhibit a slight delay and cease prematurely compared to the original faster-whisper performance. Despite tweaking several parameters within stable-ts, I haven't found a successful adjustment yet.
In my previous workflow, all my audio files are pre-processed with demucs before being fed into faster-whisper, which typically yields satisfactory results.
However, in scenarios where the audio contains considerable noise, particularly coughs and other disruptions, the timestamps are excessively extended, spanning from the cough to the actual content.
This issue led me to experiment with stable-ts, though it hasn't met my expectations so far.
Could you offer any advice on this matter? I've experimented with the k_size and q_levels settings without finding a viable solution.
Thanks in advance.

@jianfch
Copy link
Owner

jianfch commented Apr 11, 2024

If faster-whisper was yielding satisfactory results with vad_filter=True, you might find better results with vad=True instead of k_size and q_levels which could be causing the "slight delay and cease prematurely" especially audio preprocessed with demucs. Since vad_filter=True already filters the result, completely disabling the silence suppression with suppress_silence=False is an option to consider if the issue persists even vad=True.

@terryops
Copy link
Author

terryops commented Apr 12, 2024 via email

@terryops
Copy link
Author

I've noticed that setting vad=True doesn't improve outcomes compared to the built-in VAD filter in faster-whisper. Could it be that the inference process of the Silero VAD has been modified in your implementation? My review of your code revealed the absence of the min_silence_duration_ms feature, which might result in frequent brief silences interspersed between speech segments.

@jianfch
Copy link
Owner

jianfch commented Apr 13, 2024

I've noticed that setting vad=True doesn't improve outcomes compared to the built-in VAD filter in faster-whisper.

Likely due to the different approaches. Faster-Whisper uses the VAD predictions to trim the audio into chunks that meet the threshold and only transcribe those chunks. Stable-ts uses the VAD predictions to trim the timings after the transcription is completed (see https://github.com/jianfch/stable-ts?#silence-suppression).
You can check if the latter is working as intended with the nonspeech timings in the attribute, nonspeech_sections, of the transcription result object returned by transcribe_stable(). Any of the nonspeech_sections that do not satisfy the required conditions (determined by parameters) are ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants