noisereduce leads to "ValueError: Expected parameter logits" error #421

Chevolier · 2024-12-05T15:00:17Z

When I used 'noisereduce' as denoiser, given below, the sample_path links to an 20 min audio,
result = model.transcribe(sample_path, suppress_silence=True, vad=True, denoiser='noisereduce'),

The following error is reported:

ValueError Traceback (most recent call last)
Cell In[14], line 5
3 denoiser = 'noisereduce'
4 decode_options = {'language': 'en'}
----> 5 result = model.transcribe(sample_path, suppress_silence=True, vad=True, denoiser=denoiser, language='en') # , condition_on_previous_text=False) # lower_quantile=0.05, lower_threshold=0.1)
6 duration = time.time() - start_time
7 print(f"Duration stable-ts: {duration:.3f} s.")

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/whisper_word_level/original_whisper.py:484, in transcribe_stable(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, word_timestamps, regroup, ts_num, ts_noise, suppress_silence, suppress_word_ts, suppress_attention, use_word_position, q_levels, k_size, time_scale, denoiser, denoiser_options, demucs, demucs_options, vad, vad_threshold, vad_onnx, min_word_dur, min_silence_dur, nonspeech_error, only_voice_freq, prepend_punctuations, append_punctuations, stream, mel_first, split_callback, suppress_ts_tokens, gap_padding, only_ffmpeg, max_instant_words, avg_prob_threshold, nonspeech_skip, progress_callback, ignore_compatibility, extra_models, dynamic_heads, **decode_options)
482 detect_language()
483 decode_options["prompt"] = all_tokens[prompt_reset_since:]
--> 484 result: DecodingResult = decode_with_fallback(mel_segment, ts_token_mask=ts_token_mask)
485 tokens = torch.tensor(result.tokens)
487 if no_speech_threshold is not None:
488 # no voice activity check

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/whisper_word_level/original_whisper.py:345, in transcribe_stable..decode_with_fallback(seg, ts_token_mask)
342 kwargs.pop("best_of", None)
344 options = DecodingOptions(**kwargs, temperature=t)
--> 345 decode_result, audio_features = decode_stable(model,
346 seg,
347 options,
348 ts_token_mask=ts_token_mask if suppress_ts_tokens else None,
349 audio_features=audio_features)
351 needs_fallback = False
352 if (
353 compression_ratio_threshold is not None
354 and decode_result.compression_ratio > compression_ratio_threshold
355 ):

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/decode.py:110, in decode_stable(model, mel, options, ts_token_mask, audio_features, **kwargs)
107 options = replace(options, **kwargs)
109 task = DecodingTaskStable(model, options, ts_token_mask=ts_token_mask, audio_features=audio_features)
--> 110 result = task.run(mel)
112 return result[0] if single else result, task.audio_features

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/whisper/decoding.py:737, in DecodingTask.run(self, mel)
734 tokens = tokens.repeat_interleave(self.n_group, dim=0).to(audio_features.device)
736 # call the main sampling loop
--> 737 tokens, sum_logprobs, no_speech_probs = self._main_loop(audio_features, tokens)
739 # reshape the tensors to have (n_audio, n_group) as the first two dimensions
740 audio_features = audio_features[:: self.n_group]

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/decode.py:60, in DecodingTaskStable.main_loop(self, audio_features, tokens)
58 logits.nan_to_num(-np.inf)
59 # expand the tokens tensor with the selected next tokens
---> 60 tokens, completed = self.decoder.update(tokens, logits, sum_logprobs)
62 if completed or tokens.shape[-1] > self.n_ctx:
63 break

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/whisper/decoding.py:283, in GreedyDecoder.update(self, tokens, logits, sum_logprobs)
281 next_tokens = logits.argmax(dim=-1)
282 else:
--> 283 next_tokens = Categorical(logits=logits / self.temperature).sample()
285 logprobs = F.log_softmax(logits.float(), dim=-1)
286 current_logprobs = logprobs[torch.arange(logprobs.shape[0]), next_tokens]

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/distributions/categorical.py:70, in Categorical.init(self, probs, logits, validate_args)
66 self._num_events = self._param.size()[-1]
67 batch_shape = (
68 self._param.size()[:-1] if self._param.ndimension() > 1 else torch.Size()
69 )
---> 70 super().init(batch_shape, validate_args=validate_args)

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/distributions/distribution.py:68, in Distribution.init(self, batch_shape, event_shape, validate_args)
66 valid = constraint.check(value)
67 if not valid.all():
---> 68 raise ValueError(
69 f"Expected parameter {param} "
70 f"({type(value).name} of shape {tuple(value.shape)}) "
71 f"of distribution {repr(self)} "
72 f"to satisfy the constraint {repr(constraint)}, "
73 f"but found invalid values:\n{value}"
74 )
75 super().init()

ValueError: Expected parameter logits (Tensor of shape (1, 51866)) of distribution Categorical(logits: torch.Size([1, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0')

jianfch · 2024-12-06T01:07:10Z

A quick fix would be model.transcribe(..., temperature=0) but this may affect the quality of the outputs.
Are you able to share the audio is cause this issue?

Chevolier · 2024-12-06T02:37:10Z

Thanks for the quick reply, the following is the link to the audio, its around 220 MB.

Audio

Sorry it has only 12 hour's expiration, if it expires, please let me know.

-fixed `denoiser='noisereduce'` producing audio tensor with nan value causing #421

jianfch · 2024-12-06T04:46:18Z

Should be fixed after 852b39c.

Chevolier · 2024-12-06T05:52:23Z

Thanks a lot for the quick reply. It works after test.

By the way, for the audio I provided, I found that
1. The timestamps of a segment seem to be a bit earlier by around 300~400 ms;
For instance, the timestamps obtained using stable-ts is
00:04:24,370 --> 00:04:26,220
Let me think about it a little bit

And a slightly better timestamp obtained using a commercial service is:
00:04:24,769 --> 00:04:26,510
Let me think about it a little bit.

2. Some sentences appear two early or last too long. For instance,

00:05:36,520 --> 00:05:46,016
I miss Matthew

This sentence lasted for 10 s, however, in the audio, it only last for around 1 s.

The command I use is
result = model.transcribe(
sample_path,
suppress_silence=True,
vad=True,
vad_threshold=0.35
)

Do you have any suggestions on overcoming the above issues? Thanks!

jianfch · 2024-12-06T22:16:28Z

Which model are you using?

Try to increase vad_threshold.
model.refine(result) might also help.

Chevolier · 2024-12-07T02:42:09Z

Thanks for the response. I tried to increase vad_threshold to 0.5 or model.refine(result), still the issues exist. First, I doubt whether it is related to the audio, so changed to a different audio like the following one
https://www.youtube.com/watch?v=0O2Rq4HJBxw
It is not movie but an open course, which does not contain too much background music. I generated the srt file using whisper's original transcribe, pipeline, and faster-whisper. Then for stable-ts, I tried using original_whisper, faster-whisper. After comparison, still found that the timestamps are slightly earlier around 300~400 ms compared to whisper's original or pipelined or faster-whisper. Does this have anything to do with silence handling part?

jianfch · 2024-12-07T06:01:33Z

Does this have anything to do with silence handling part?

With the exception of result.adjust_gaps(), all the changes made to the timestamps will only make start start later and end end earlier. So the default silence handling should not cause timestamps to be slightly earlier than the default Whisper outputs.

What I suspect to be happening is the regrouping exposes the early timestamps that would normally be hidden once converted to segment-level SRT.
For example:
0.0 -> 2.0 : " This is a test."
This segment has nearly perfect timing. But regrouped into:
0.0 -> 0.7 : " This is"
0.7 -> 2.0 : " a test."
Now second segment, " a test.", starts earlier than it should because each word uses the end of the previous word as its own start (Stable-ts tries to eliminate this by adding the gaps from the silence detection in-between the words).
Since Whisper or Faster-Whisper outputs on their own are not regrouped, these imperfect timestamps would not be exposed.

I'd suggest disabling regrouping, model.transcribe(regroup='cm).
Or try the latest commit.

Chevolier · 2024-12-07T12:09:59Z

Thanks, after some tries, I found that when I added word_timestamps=True, the segment timestamp are a little bit shorter than word_timestamps=False, i.e., started later and ended earlier around ~100 ms. This is true for the original whisper, faster-whisper as well. The more accurate word-level timestamps shortened the segment timestamps after time-alignment. It may be more accurate from the audio signal's perspective, but not very friendly to humans when we use it for subtitles. So I decided to add several hundred ms to the segment timestamps manually as a workaround. Thank you very much for all the helps again. This issue can be closed.

jianfch added a commit that referenced this issue Dec 6, 2024

fixed denoiser='noisereduce' causing #421

852b39c

-fixed `denoiser='noisereduce'` producing audio tensor with nan value causing #421

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

noisereduce leads to "ValueError: Expected parameter logits" error #421

noisereduce leads to "ValueError: Expected parameter logits" error #421

Chevolier commented Dec 5, 2024

jianfch commented Dec 6, 2024

Chevolier commented Dec 6, 2024

jianfch commented Dec 6, 2024

Chevolier commented Dec 6, 2024

jianfch commented Dec 6, 2024

Chevolier commented Dec 7, 2024

jianfch commented Dec 7, 2024 •

edited

Loading

Chevolier commented Dec 7, 2024

noisereduce leads to "ValueError: Expected parameter logits" error #421

noisereduce leads to "ValueError: Expected parameter logits" error #421

Comments

Chevolier commented Dec 5, 2024

jianfch commented Dec 6, 2024

Chevolier commented Dec 6, 2024

jianfch commented Dec 6, 2024

Chevolier commented Dec 6, 2024

jianfch commented Dec 6, 2024

Chevolier commented Dec 7, 2024

jianfch commented Dec 7, 2024 • edited Loading

Chevolier commented Dec 7, 2024

jianfch commented Dec 7, 2024 •

edited

Loading