-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing the 'Align' Feature for Accurate Audio-Text Synchronization #331
Comments
See #296 (comment). Other tip is try to smaller models such |
@jianfch
|
The chosen alignment heads of |
@jianfch It would be great if the 'Align' feature could be further optimized. |
@jianfch To address this, could we use two models to align separately and then integrate them together to eliminate the errors? Since alignment is fast, taking only a few seconds, it's worth trying to align twice. |
This could work in theory, but it requires a reliable way to autodetect the errors. |
-improved word timing by making `gap_padding` more effective; now only resets the timestamps for padded words with duration below `min_word_dur` instead removing `gap_padding` and re-aligning the entire segment when any words falls below `min_word_dur` -added parameters, `presplit` and `gap_padding`, to `align()` to increase its chance of detecting speech gaps within and before each segment -added parameter, `extra_models`, to `align()` and `transcribe()` to average the timings produced by multiple different Whisper models for cases such as #331 -`suppress_attention` is deprecated and will be removed in future versions; `suppress_attention` prevents `gap_padding` from working properly -for `demucs` and `dfnet` denoisers, the audio is now denoised in 2 channels instead of mono when denoised directly or `stream=False` -fixed error caused by `AudioLoader` trying to enable progressbar for `noisereduce` denoiser with `progress=True` -fixed incorrectly parsing warnings as titles from yt-dlp when downloading audio with it -added parameter, `engine`, to `load_model()` to specify quantization engine (#341); auto assign the first available `engine` when `engine=None` and default engine is 'none' which can cause issues such as (#338) -`ts_num` and `ts_noise` are deprecated and will be removed in future versions -fixed `dfnet` denoiser not setting its model to the specified `device` -corrected default value in docstring in `WhisperResult.adjust_by_silence()` for `word_level` from False to True
model = stable_whisper.load_model('base')
extra_models = [stable_whisper.load_model(name) for name in ['base.en', 'small', 'small.en', 'tiny', 'tiny.en']]
result = model.transcribe('audio.wav', extra_models=extra_models) |
I ultimately found the reason for the issue—it was a problem with the initial text input, which contained extra sentences. When stable-ts performing alignment, it can sometimes correct errors and sometimes it cannot. However, as long as the source problem is resolved, the alignment is always correct. |
How can I ensure that the "Align" feature, which aligns plain text or tokens with audio at the word level, avoids outputting accidental errors? This feature is great because it can output final results based on edited line breaks, and I can import recognition results without word timestamps from various models, which is very flexible.
In practical tests, sometimes the output is perfect. However, sometimes a few sentences are missed, or the timestamps are disordered.
What input conditions should I pay attention to in order to ensure perfect output?
The text was updated successfully, but these errors were encountered: