-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Intel GPU support #309
Comments
It should work as along as the from stable_whisper.whisper_word_level.hf_whisper import WhisperHF
class WhisperHFOV(WhisperHF):
def __init__(self):
self._model_name = model_id
self._pipe = pipe Then run the model as usual. model = WhisperHFOV()
result = model.transcribe('audio.mp3') |
Thanks! that got things going, it gets passed to stable-ts for transcription: g@void /Dev $ python ./drain.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
Avaiable devices: ['CPU', 'GPU']
Compiling the encoder to GPU ...
Compiling the decoder to GPU ...
Compiling the decoder to GPU ...
device must be of type <class 'str'> but got <class 'torch.device'> instead
Transcribing with Hugging Face Whisper (distil-whisper/distil-small.en)... Unfortunately after 10 seconds or so all GPU processes die, I have no idea why, it doesn't happen when transcribing without stable-ts. Any ideas? |
Try to use to same arguments as when transcribing without stable-ts. stable-ts/stable_whisper/whisper_word_level/hf_whisper.py Lines 97 to 102 in 53272cb
The batch_size is 24 by default, so try setting a lower value.
|
Yeah, I'm using the same. Where would I pass the batch_size? pipe = pipeline(
"automatic-speech-recognition",
model=ov_model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=15,
batch_size=16
) |
I think that I've moved forward, now I'm getting this: File "/home/g/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 861, in generate
raise ValueError(
ValueError: Cannot specify `task` or `language` for an English-only model. If the model is intended to be multilingual, pass `is_multilingual=True` to generate, or update the generation config. after switching to multilang model I get his: raise ValueError(
ValueError: Make sure to set `return_segments=True` to return generation outputs as part of the `'segments' key.` Managed to get further, now it doesn't die but fails at: File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
outputs["token_timestamps"] = self._extract_token_timestamps(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps' it seems to support token timestamps [from modeling_seq2seq.py] : def generate(
self,
input_features: Optional[torch.Tensor] = None,
generation_config=None,
logits_processor=None,
stopping_criteria=None,
prefix_allowed_tokens_fn=None,
synced_gpus=False,
return_timestamps=None,
task=None,
language=None,
is_multilingual=None,
prompt_ids: Optional[torch.Tensor] = None,
num_segment_frames: Optional[int] = None,
return_token_timestamps: Optional[bool] = None,
return_segments: bool = False,
attention_mask: Optional[torch.Tensor] = None,
time_precision: int = 0.02,
return_dict_in_generate: Optional[bool] = None,
**kwargs,
) |
model = stable_whisper.load_hf_whisper(model_id, pipeline=pipe)
Since by default, |
Now it results in: `Traceback (most recent call last):
File "/run/media/greggy/1a4fd6d7-1f9d-42c6-9324-661804695013/D/owisp/./n2.py", line 57, in <module>
result = model.transcribe(audio)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/stable_whisper/whisper_word_level/hf_whisper.py", line 236, in transcribe
return transcribe_any(
^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/stable_whisper/non_whisper.py", line 340, in transcribe_any
result = inference_func(**inference_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/stable_whisper/whisper_word_level/hf_whisper.py", line 101, in _inner_transcribe
if self.model_name.endswith('en'):
^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute 'endswith'` I'm trying to go as simple as possible: from transformers import GenerationConfig, WhisperProcessor,WhisperForConditionalGeneration
import whisper
from transformers import pipeline
from pathlib import Path
import numpy as np
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
import openvino as ov
model_id="openai/whisper-small"
d="cpu"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)
audio = whisper.load_audio("./4.wav")
i = processor(audio,return_tensors="pt").input_features.to(d)
model_path = Path(model_id.replace('/', '_'))
ov_config = {"CACHE_DIR": ""}
if not model_path.exists():
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
)
ov_model.half()
ov_model.save_pretrained(model_path)
else:
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
model_path, ov_config=ov_config, compile=False
)
ov_model.generation_config = generation_config
core = ov.Core()
options=str(core.available_devices)
print("Avaiable devices: "+options)
device = 'GPU'
ov_model.to(device)
ov_model.compile()
pipe = pipeline(
"automatic-speech-recognition",
model=ov_model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=10,
device=d,
)
from stable_whisper.whisper_word_level.hf_whisper import WhisperHF
class WhisperHFOV(WhisperHF):
def __init__(self):
self._model_name = ov_model
self._pipe = pipe
model = WhisperHFOV()
result = model.transcribe(audio)
#result = pipe(audio)
import json
with open("sample.json", "w") as outfile:
json.dump(result, outfile)
print(result["text"]) It works in standalone mode [it doesn't output token timestamps though, does it matter?] But fails when piped to stable |
Changed pipeline initialization from transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig
import whisper
from pathlib import Path
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
from transformers import pipeline
import openvino as ov
import json
import stable_whisper
# Load the Whisper model
model_id = "openai/whisper-small"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)
# Load audio
audio = whisper.load_audio("./4.wav")
input_features = processor(audio, return_tensors="pt").input_features
# Configure OpenVINO model
ov_config = {"CACHE_DIR": ""}
model_path = Path(model_id.replace('/', '_'))
if not model_path.exists():
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
)
ov_model.half()
ov_model.save_pretrained(model_path)
else:
ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
model_path, ov_config=ov_config, compile=False
)
ov_model.generation_config = generation_config
# Choose device
device = 'GPU' # Change this to 'GPU' if GPU is preferred
ov_model.to(device)
ov_model.compile()
# Configure pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=ov_model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=20,
)
# Initialize the model instance with pipeline
model_instance = stable_whisper.load_hf_whisper(model_id, pipeline=pipe)
# Transcribe the audio
result = model_instance.transcribe(audio)
# Save result to JSON
with open("sample.json", "w") as outfile:
json.dump(result, outfile)
print(result["text"]) And it seems to be churning through, judging by GPU activity, but at the end it's back to previous error ext__
item = next(self.iterator)
^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
processed = self.infer(next(self.iterator), **self.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1068, in forward
model_outputs = self._forward(model_inputs, **forward_params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
tokens = self.model.generate(
^^^^^^^^^^^^^^^^^^^^
File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
outputs["token_timestamps"] = self._extract_token_timestamps(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps' some problem with optimum? |
The token timestamps might not have been implemented for OV model. You can disable it with # Configure pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=ov_model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30
)
# Initialize the model instance with pipeline
model_instance = stable_whisper.load_hf_whisper(model_id, pipeline=pipe)
result = model_instance.transcribe(audio, word_timestamps=False, batch_size=20) |
I think that the model is fine, I can see in its config that it has alignment heads for that. |
It works with word_timestamps=False. So it's an upstream bug,, I'll report it later. |
I've been testing HF models with openvino gpu backend [Intel GPUs and CPUs] and they're blazing fast, even on Integrated GPU. I tried to integrate it into hf_whisper.py but for some reason it defaults to torch cpu. It's possible to detect openvino device via :
I've been trying to integrate this (unsuccesfuly:
Would it be possible? The code looks compatible.
The text was updated successfully, but these errors were encountered: