Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

noisereduce leads to "ValueError: Expected parameter logits" error #421

Open
Chevolier opened this issue Dec 5, 2024 · 8 comments
Open

Comments

@Chevolier
Copy link

When I used 'noisereduce' as denoiser, given below, the sample_path links to an 20 min audio,
result = model.transcribe(sample_path, suppress_silence=True, vad=True, denoiser='noisereduce'),

The following error is reported:


ValueError Traceback (most recent call last)
Cell In[14], line 5
3 denoiser = 'noisereduce'
4 decode_options = {'language': 'en'}
----> 5 result = model.transcribe(sample_path, suppress_silence=True, vad=True, denoiser=denoiser, language='en') # , condition_on_previous_text=False) # lower_quantile=0.05, lower_threshold=0.1)
6 duration = time.time() - start_time
7 print(f"Duration stable-ts: {duration:.3f} s.")

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/whisper_word_level/original_whisper.py:484, in transcribe_stable(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, word_timestamps, regroup, ts_num, ts_noise, suppress_silence, suppress_word_ts, suppress_attention, use_word_position, q_levels, k_size, time_scale, denoiser, denoiser_options, demucs, demucs_options, vad, vad_threshold, vad_onnx, min_word_dur, min_silence_dur, nonspeech_error, only_voice_freq, prepend_punctuations, append_punctuations, stream, mel_first, split_callback, suppress_ts_tokens, gap_padding, only_ffmpeg, max_instant_words, avg_prob_threshold, nonspeech_skip, progress_callback, ignore_compatibility, extra_models, dynamic_heads, **decode_options)
482 detect_language()
483 decode_options["prompt"] = all_tokens[prompt_reset_since:]
--> 484 result: DecodingResult = decode_with_fallback(mel_segment, ts_token_mask=ts_token_mask)
485 tokens = torch.tensor(result.tokens)
487 if no_speech_threshold is not None:
488 # no voice activity check

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/whisper_word_level/original_whisper.py:345, in transcribe_stable..decode_with_fallback(seg, ts_token_mask)
342 kwargs.pop("best_of", None)
344 options = DecodingOptions(**kwargs, temperature=t)
--> 345 decode_result, audio_features = decode_stable(model,
346 seg,
347 options,
348 ts_token_mask=ts_token_mask if suppress_ts_tokens else None,
349 audio_features=audio_features)
351 needs_fallback = False
352 if (
353 compression_ratio_threshold is not None
354 and decode_result.compression_ratio > compression_ratio_threshold
355 ):

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/decode.py:110, in decode_stable(model, mel, options, ts_token_mask, audio_features, **kwargs)
107 options = replace(options, **kwargs)
109 task = DecodingTaskStable(model, options, ts_token_mask=ts_token_mask, audio_features=audio_features)
--> 110 result = task.run(mel)
112 return result[0] if single else result, task.audio_features

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/whisper/decoding.py:737, in DecodingTask.run(self, mel)
734 tokens = tokens.repeat_interleave(self.n_group, dim=0).to(audio_features.device)
736 # call the main sampling loop
--> 737 tokens, sum_logprobs, no_speech_probs = self._main_loop(audio_features, tokens)
739 # reshape the tensors to have (n_audio, n_group) as the first two dimensions
740 audio_features = audio_features[:: self.n_group]

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/decode.py:60, in DecodingTaskStable.main_loop(self, audio_features, tokens)
58 logits.nan_to_num
(-np.inf)
59 # expand the tokens tensor with the selected next tokens
---> 60 tokens, completed = self.decoder.update(tokens, logits, sum_logprobs)
62 if completed or tokens.shape[-1] > self.n_ctx:
63 break

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/whisper/decoding.py:283, in GreedyDecoder.update(self, tokens, logits, sum_logprobs)
281 next_tokens = logits.argmax(dim=-1)
282 else:
--> 283 next_tokens = Categorical(logits=logits / self.temperature).sample()
285 logprobs = F.log_softmax(logits.float(), dim=-1)
286 current_logprobs = logprobs[torch.arange(logprobs.shape[0]), next_tokens]

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/distributions/categorical.py:70, in Categorical.init(self, probs, logits, validate_args)
66 self._num_events = self._param.size()[-1]
67 batch_shape = (
68 self._param.size()[:-1] if self._param.ndimension() > 1 else torch.Size()
69 )
---> 70 super().init(batch_shape, validate_args=validate_args)

File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/distributions/distribution.py:68, in Distribution.init(self, batch_shape, event_shape, validate_args)
66 valid = constraint.check(value)
67 if not valid.all():
---> 68 raise ValueError(
69 f"Expected parameter {param} "
70 f"({type(value).name} of shape {tuple(value.shape)}) "
71 f"of distribution {repr(self)} "
72 f"to satisfy the constraint {repr(constraint)}, "
73 f"but found invalid values:\n{value}"
74 )
75 super().init()

ValueError: Expected parameter logits (Tensor of shape (1, 51866)) of distribution Categorical(logits: torch.Size([1, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0')

@jianfch
Copy link
Owner

jianfch commented Dec 6, 2024

A quick fix would be model.transcribe(..., temperature=0) but this may affect the quality of the outputs.
Are you able to share the audio is cause this issue?

@Chevolier
Copy link
Author

Thanks for the quick reply, the following is the link to the audio, its around 220 MB.

Audio

Sorry it has only 12 hour's expiration, if it expires, please let me know.

jianfch added a commit that referenced this issue Dec 6, 2024
-fixed  `denoiser='noisereduce'` producing audio tensor with nan value causing #421
@jianfch
Copy link
Owner

jianfch commented Dec 6, 2024

Should be fixed after 852b39c.

@Chevolier
Copy link
Author

Thanks a lot for the quick reply. It works after test.

By the way, for the audio I provided, I found that
1. The timestamps of a segment seem to be a bit earlier by around 300~400 ms;
For instance, the timestamps obtained using stable-ts is
00:04:24,370 --> 00:04:26,220
Let me think about it a little bit

And a slightly better timestamp obtained using a commercial service is:
00:04:24,769 --> 00:04:26,510
Let me think about it a little bit.

2. Some sentences appear two early or last too long. For instance,

00:05:36,520 --> 00:05:46,016
I miss Matthew

This sentence lasted for 10 s, however, in the audio, it only last for around 1 s.

The command I use is
result = model.transcribe(
sample_path,
suppress_silence=True,
vad=True,
vad_threshold=0.35
)

Do you have any suggestions on overcoming the above issues? Thanks!

@jianfch
Copy link
Owner

jianfch commented Dec 6, 2024

Which model are you using?

Try to increase vad_threshold.
model.refine(result) might also help.

@Chevolier
Copy link
Author

Thanks for the response. I tried to increase vad_threshold to 0.5 or model.refine(result), still the issues exist. First, I doubt whether it is related to the audio, so changed to a different audio like the following one
https://www.youtube.com/watch?v=0O2Rq4HJBxw
It is not movie but an open course, which does not contain too much background music. I generated the srt file using whisper's original transcribe, pipeline, and faster-whisper. Then for stable-ts, I tried using original_whisper, faster-whisper. After comparison, still found that the timestamps are slightly earlier around 300~400 ms compared to whisper's original or pipelined or faster-whisper. Does this have anything to do with silence handling part?

@jianfch
Copy link
Owner

jianfch commented Dec 7, 2024

Does this have anything to do with silence handling part?

With the exception of result.adjust_gaps(), all the changes made to the timestamps will only make start start later and end end earlier. So the default silence handling should not cause timestamps to be slightly earlier than the default Whisper outputs.

What I suspect to be happening is the regrouping exposes the early timestamps that would normally be hidden once converted to segment-level SRT.
For example:
0.0 -> 2.0 : " This is a test."
This segment has nearly perfect timing. But regrouped into:
0.0 -> 0.7 : " This is"
0.7 -> 2.0 : " a test."
Now second segment, " a test.", starts earlier than it should because each word uses the end of the previous word as its own start (Stable-ts tries to eliminate this by adding the gaps from the silence detection in-between the words).
Since Whisper or Faster-Whisper outputs on their own are not regrouped, these imperfect timestamps would not be exposed.

I'd suggest disabling regrouping, model.transcribe(regroup='cm).
Or try the latest commit.

@Chevolier
Copy link
Author

Thanks, after some tries, I found that when I added word_timestamps=True, the segment timestamp are a little bit shorter than word_timestamps=False, i.e., started later and ended earlier around ~100 ms. This is true for the original whisper, faster-whisper as well. The more accurate word-level timestamps shortened the segment timestamps after time-alignment. It may be more accurate from the audio signal's perspective, but not very friendly to humans when we use it for subtitles. So I decided to add several hundred ms to the segment timestamps manually as a workaround. Thank you very much for all the helps again. This issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants