Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing the 'Align' Feature for Accurate Audio-Text Synchronization #331

Open
zxl777 opened this issue Apr 4, 2024 · 8 comments
Open

Comments

@zxl777
Copy link

zxl777 commented Apr 4, 2024

How can I ensure that the "Align" feature, which aligns plain text or tokens with audio at the word level, avoids outputting accidental errors? This feature is great because it can output final results based on edited line breaks, and I can import recognition results without word timestamps from various models, which is very flexible.

In practical tests, sometimes the output is perfect. However, sometimes a few sentences are missed, or the timestamps are disordered.

What input conditions should I pay attention to in order to ensure perfect output?

@jianfch
Copy link
Owner

jianfch commented Apr 4, 2024

See #296 (comment).

Other tip is try to smaller models such base, you might find better result with those.

@zxl777
Copy link
Author

zxl777 commented Apr 5, 2024

@jianfch
I recently attempted to use load_model('small.en'), and the results were flawless. However, when I tried load_model('medium.en'), it consistently missed some sentences.

# model = stable_whisper.load_model('medium.en')
model = stable_whisper.load_model('small.en')
result = model.align(audio, text, language='en',original_split=True)

@jianfch
Copy link
Owner

jianfch commented Apr 8, 2024

The chosen alignment heads of medium.en might simply be not as reliable as the ones of small.en.

@zxl777
Copy link
Author

zxl777 commented Apr 8, 2024

@jianfch
My recent discovery is that using small.en sometimes results in the occasional duplication of sentences, whereas using medium.en can lead to missing sentences. Therefore, the issues with small.en are relatively minor.

It would be great if the 'Align' feature could be further optimized.

@zxl777
Copy link
Author

zxl777 commented Apr 9, 2024

@jianfch
I suspect that the issue of having an extra sentence or missing one in the aligned results is due to the characteristics of the model and occasional recognition errors.

To address this, could we use two models to align separately and then integrate them together to eliminate the errors? Since alignment is fast, taking only a few seconds, it's worth trying to align twice.

@jianfch
Copy link
Owner

jianfch commented Apr 10, 2024

To address this, could we use two models to align separately and then integrate them together to eliminate the errors?

This could work in theory, but it requires a reliable way to autodetect the errors.

jianfch added a commit that referenced this issue Apr 14, 2024
-improved word timing by making `gap_padding` more effective; now only resets the timestamps for padded words with duration below `min_word_dur` instead removing `gap_padding`  and re-aligning the entire segment when any words falls below `min_word_dur`
-added parameters, `presplit` and `gap_padding`, to `align()` to increase its chance of detecting speech gaps within and before each segment
-added parameter, `extra_models`, to `align()` and `transcribe()` to average the timings produced by multiple different Whisper models for cases such as #331
-`suppress_attention` is deprecated and will be removed in future versions; `suppress_attention` prevents `gap_padding` from working properly
-for `demucs` and `dfnet` denoisers, the audio is now denoised in 2 channels instead of mono when denoised directly or `stream=False`
-fixed error caused by `AudioLoader` trying to enable progressbar for `noisereduce` denoiser with `progress=True`
-fixed incorrectly parsing warnings as titles from yt-dlp when downloading audio with it
-added parameter, `engine`, to `load_model()` to specify quantization engine (#341); auto assign the first available `engine` when `engine=None` and default engine is 'none' which can cause issues such as (#338)
-`ts_num` and `ts_noise` are deprecated and will be removed in future versions
-fixed `dfnet` denoiser not setting its model to the specified `device`
-corrected default value in docstring in `WhisperResult.adjust_by_silence()` for `word_level` from False to True
@jianfch
Copy link
Owner

jianfch commented Apr 14, 2024

extra_models introduced in 5513609 will compute the timestamps from the average of all the extra_models and the model.

model = stable_whisper.load_model('base')
extra_models = [stable_whisper.load_model(name) for name in ['base.en', 'small', 'small.en', 'tiny', 'tiny.en']]
result = model.transcribe('audio.wav', extra_models=extra_models)

@zxl777
Copy link
Author

zxl777 commented Apr 14, 2024

I ultimately found the reason for the issue—it was a problem with the initial text input, which contained extra sentences. When stable-ts performing alignment, it can sometimes correct errors and sometimes it cannot.

However, as long as the source problem is resolved, the alignment is always correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants