-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
noisereduce leads to "ValueError: Expected parameter logits" error #421
Comments
A quick fix would be |
Thanks for the quick reply, the following is the link to the audio, its around 220 MB. Sorry it has only 12 hour's expiration, if it expires, please let me know. |
Should be fixed after 852b39c. |
Thanks a lot for the quick reply. It works after test. By the way, for the audio I provided, I found that And a slightly better timestamp obtained using a commercial service is: 2. Some sentences appear two early or last too long. For instance, 00:05:36,520 --> 00:05:46,016 This sentence lasted for 10 s, however, in the audio, it only last for around 1 s. The command I use is Do you have any suggestions on overcoming the above issues? Thanks! |
Which model are you using? Try to increase |
Thanks for the response. I tried to increase vad_threshold to 0.5 or model.refine(result), still the issues exist. First, I doubt whether it is related to the audio, so changed to a different audio like the following one |
With the exception of What I suspect to be happening is the regrouping exposes the early timestamps that would normally be hidden once converted to segment-level SRT. I'd suggest disabling regrouping, |
Thanks, after some tries, I found that when I added word_timestamps=True, the segment timestamp are a little bit shorter than word_timestamps=False, i.e., started later and ended earlier around ~100 ms. This is true for the original whisper, faster-whisper as well. The more accurate word-level timestamps shortened the segment timestamps after time-alignment. It may be more accurate from the audio signal's perspective, but not very friendly to humans when we use it for subtitles. So I decided to add several hundred ms to the segment timestamps manually as a workaround. Thank you very much for all the helps again. This issue can be closed. |
When I used 'noisereduce' as denoiser, given below, the sample_path links to an 20 min audio,
result = model.transcribe(sample_path, suppress_silence=True, vad=True, denoiser='noisereduce'),
The following error is reported:
ValueError Traceback (most recent call last)
Cell In[14], line 5
3 denoiser = 'noisereduce'
4 decode_options = {'language': 'en'}
----> 5 result = model.transcribe(sample_path, suppress_silence=True, vad=True, denoiser=denoiser, language='en') # , condition_on_previous_text=False) # lower_quantile=0.05, lower_threshold=0.1)
6 duration = time.time() - start_time
7 print(f"Duration stable-ts: {duration:.3f} s.")
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/whisper_word_level/original_whisper.py:484, in transcribe_stable(model, audio, verbose, temperature, compression_ratio_threshold, logprob_threshold, no_speech_threshold, condition_on_previous_text, initial_prompt, word_timestamps, regroup, ts_num, ts_noise, suppress_silence, suppress_word_ts, suppress_attention, use_word_position, q_levels, k_size, time_scale, denoiser, denoiser_options, demucs, demucs_options, vad, vad_threshold, vad_onnx, min_word_dur, min_silence_dur, nonspeech_error, only_voice_freq, prepend_punctuations, append_punctuations, stream, mel_first, split_callback, suppress_ts_tokens, gap_padding, only_ffmpeg, max_instant_words, avg_prob_threshold, nonspeech_skip, progress_callback, ignore_compatibility, extra_models, dynamic_heads, **decode_options)
482 detect_language()
483 decode_options["prompt"] = all_tokens[prompt_reset_since:]
--> 484 result: DecodingResult = decode_with_fallback(mel_segment, ts_token_mask=ts_token_mask)
485 tokens = torch.tensor(result.tokens)
487 if no_speech_threshold is not None:
488 # no voice activity check
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/whisper_word_level/original_whisper.py:345, in transcribe_stable..decode_with_fallback(seg, ts_token_mask)
342 kwargs.pop("best_of", None)
344 options = DecodingOptions(**kwargs, temperature=t)
--> 345 decode_result, audio_features = decode_stable(model,
346 seg,
347 options,
348 ts_token_mask=ts_token_mask if suppress_ts_tokens else None,
349 audio_features=audio_features)
351 needs_fallback = False
352 if (
353 compression_ratio_threshold is not None
354 and decode_result.compression_ratio > compression_ratio_threshold
355 ):
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/decode.py:110, in decode_stable(model, mel, options, ts_token_mask, audio_features, **kwargs)
107 options = replace(options, **kwargs)
109 task = DecodingTaskStable(model, options, ts_token_mask=ts_token_mask, audio_features=audio_features)
--> 110 result = task.run(mel)
112 return result[0] if single else result, task.audio_features
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/whisper/decoding.py:737, in DecodingTask.run(self, mel)
734 tokens = tokens.repeat_interleave(self.n_group, dim=0).to(audio_features.device)
736 # call the main sampling loop
--> 737 tokens, sum_logprobs, no_speech_probs = self._main_loop(audio_features, tokens)
739 # reshape the tensors to have (n_audio, n_group) as the first two dimensions
740 audio_features = audio_features[:: self.n_group]
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/stable_whisper/decode.py:60, in DecodingTaskStable.main_loop(self, audio_features, tokens)
58 logits.nan_to_num(-np.inf)
59 # expand the tokens tensor with the selected next tokens
---> 60 tokens, completed = self.decoder.update(tokens, logits, sum_logprobs)
62 if completed or tokens.shape[-1] > self.n_ctx:
63 break
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/whisper/decoding.py:283, in GreedyDecoder.update(self, tokens, logits, sum_logprobs)
281 next_tokens = logits.argmax(dim=-1)
282 else:
--> 283 next_tokens = Categorical(logits=logits / self.temperature).sample()
285 logprobs = F.log_softmax(logits.float(), dim=-1)
286 current_logprobs = logprobs[torch.arange(logprobs.shape[0]), next_tokens]
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/distributions/categorical.py:70, in Categorical.init(self, probs, logits, validate_args)
66 self._num_events = self._param.size()[-1]
67 batch_shape = (
68 self._param.size()[:-1] if self._param.ndimension() > 1 else torch.Size()
69 )
---> 70 super().init(batch_shape, validate_args=validate_args)
File /home/ec2-user/SageMaker/efs/conda_envs/whisper/lib/python3.10/site-packages/torch/distributions/distribution.py:68, in Distribution.init(self, batch_shape, event_shape, validate_args)
66 valid = constraint.check(value)
67 if not valid.all():
---> 68 raise ValueError(
69 f"Expected parameter {param} "
70 f"({type(value).name} of shape {tuple(value.shape)}) "
71 f"of distribution {repr(self)} "
72 f"to satisfy the constraint {repr(constraint)}, "
73 f"but found invalid values:\n{value}"
74 )
75 super().init()
ValueError: Expected parameter logits (Tensor of shape (1, 51866)) of distribution Categorical(logits: torch.Size([1, 51866])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0')
The text was updated successfully, but these errors were encountered: