Optimizing the 'Align' Feature for Accurate Audio-Text Synchronization #331

zxl777 · 2024-04-04T14:53:09Z

How can I ensure that the "Align" feature, which aligns plain text or tokens with audio at the word level, avoids outputting accidental errors? This feature is great because it can output final results based on edited line breaks, and I can import recognition results without word timestamps from various models, which is very flexible.

In practical tests, sometimes the output is perfect. However, sometimes a few sentences are missed, or the timestamps are disordered.

What input conditions should I pay attention to in order to ensure perfect output?

jianfch · 2024-04-04T17:07:40Z

See #296 (comment).

Other tip is try to smaller models such base, you might find better result with those.

zxl777 · 2024-04-05T23:55:13Z

@jianfch
I recently attempted to use load_model('small.en'), and the results were flawless. However, when I tried load_model('medium.en'), it consistently missed some sentences.

# model = stable_whisper.load_model('medium.en')
model = stable_whisper.load_model('small.en')
result = model.align(audio, text, language='en',original_split=True)

jianfch · 2024-04-08T05:48:34Z

The chosen alignment heads of medium.en might simply be not as reliable as the ones of small.en.

zxl777 · 2024-04-08T23:10:39Z

@jianfch
My recent discovery is that using small.en sometimes results in the occasional duplication of sentences, whereas using medium.en can lead to missing sentences. Therefore, the issues with small.en are relatively minor.

It would be great if the 'Align' feature could be further optimized.

zxl777 · 2024-04-09T18:42:37Z

@jianfch
I suspect that the issue of having an extra sentence or missing one in the aligned results is due to the characteristics of the model and occasional recognition errors.

To address this, could we use two models to align separately and then integrate them together to eliminate the errors? Since alignment is fast, taking only a few seconds, it's worth trying to align twice.

jianfch · 2024-04-10T00:14:35Z

To address this, could we use two models to align separately and then integrate them together to eliminate the errors?

This could work in theory, but it requires a reliable way to autodetect the errors.

-improved word timing by making `gap_padding` more effective; now only resets the timestamps for padded words with duration below `min_word_dur` instead removing `gap_padding` and re-aligning the entire segment when any words falls below `min_word_dur` -added parameters, `presplit` and `gap_padding`, to `align()` to increase its chance of detecting speech gaps within and before each segment -added parameter, `extra_models`, to `align()` and `transcribe()` to average the timings produced by multiple different Whisper models for cases such as #331 -`suppress_attention` is deprecated and will be removed in future versions; `suppress_attention` prevents `gap_padding` from working properly -for `demucs` and `dfnet` denoisers, the audio is now denoised in 2 channels instead of mono when denoised directly or `stream=False` -fixed error caused by `AudioLoader` trying to enable progressbar for `noisereduce` denoiser with `progress=True` -fixed incorrectly parsing warnings as titles from yt-dlp when downloading audio with it -added parameter, `engine`, to `load_model()` to specify quantization engine (#341); auto assign the first available `engine` when `engine=None` and default engine is 'none' which can cause issues such as (#338) -`ts_num` and `ts_noise` are deprecated and will be removed in future versions -fixed `dfnet` denoiser not setting its model to the specified `device` -corrected default value in docstring in `WhisperResult.adjust_by_silence()` for `word_level` from False to True

jianfch · 2024-04-14T04:20:43Z

extra_models introduced in 5513609 will compute the timestamps from the average of all the extra_models and the model.

model = stable_whisper.load_model('base')
extra_models = [stable_whisper.load_model(name) for name in ['base.en', 'small', 'small.en', 'tiny', 'tiny.en']]
result = model.transcribe('audio.wav', extra_models=extra_models)

zxl777 · 2024-04-14T22:01:27Z

I ultimately found the reason for the issue—it was a problem with the initial text input, which contained extra sentences. When stable-ts performing alignment, it can sometimes correct errors and sometimes it cannot.

However, as long as the source problem is resolved, the alignment is always correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing the 'Align' Feature for Accurate Audio-Text Synchronization #331

Optimizing the 'Align' Feature for Accurate Audio-Text Synchronization #331

zxl777 commented Apr 4, 2024

jianfch commented Apr 4, 2024 •

edited

Loading

zxl777 commented Apr 5, 2024

jianfch commented Apr 8, 2024

zxl777 commented Apr 8, 2024

zxl777 commented Apr 9, 2024

jianfch commented Apr 10, 2024

jianfch commented Apr 14, 2024

zxl777 commented Apr 14, 2024

Optimizing the 'Align' Feature for Accurate Audio-Text Synchronization #331

Optimizing the 'Align' Feature for Accurate Audio-Text Synchronization #331

Comments

zxl777 commented Apr 4, 2024

jianfch commented Apr 4, 2024 • edited Loading

zxl777 commented Apr 5, 2024

jianfch commented Apr 8, 2024

zxl777 commented Apr 8, 2024

zxl777 commented Apr 9, 2024

jianfch commented Apr 10, 2024

jianfch commented Apr 14, 2024

zxl777 commented Apr 14, 2024

jianfch commented Apr 4, 2024 •

edited

Loading