[FR] Intel GPU support #309

barolo · 2024-02-11T23:12:06Z

I've been testing HF models with openvino gpu backend [Intel GPUs and CPUs] and they're blazing fast, even on Integrated GPU. I tried to integrate it into hf_whisper.py but for some reason it defaults to torch cpu. It's possible to detect openvino device via :

core = ov.Core()
options=str(core.available_devices)

I've been trying to integrate this (unsuccesfuly:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
import openvino as ov



model_path = Path(model.replace)
ov_config = {"CACHE_DIR": ""}

processor = AutoProcessor.from_pretrained(model_id)
pt_model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)
pt_model.eval();

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )


core = ov.Core()
options=str(core.available_devices)
print("Avaiable devices: "+options)
device = 'GPU'

ov_model.to(device)
ov_model.compile()

ov_model.generation_config = pt_model.generation_config

pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
)

Would it be possible? The code looks compatible.

jianfch · 2024-02-12T04:34:21Z

It should work as along as the pipe works on its own. WhisperHF uses non_whisper.transcribe_any() which is designed to work with any ASR model. You can simply inherit WhisperHF and assign the working pipe to self._pipe like this:

from stable_whisper.whisper_word_level.hf_whisper import WhisperHF
class WhisperHFOV(WhisperHF):
    def __init__(self):
        self._model_name = model_id
        self._pipe = pipe

Then run the model as usual.

model = WhisperHFOV()
result = model.transcribe('audio.mp3')

barolo · 2024-02-12T05:52:17Z

Thanks! that got things going, it gets passed to stable-ts for transcription:

g@void /Dev $ python ./drain.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
Avaiable devices: ['CPU', 'GPU']
Compiling the encoder to GPU ...
Compiling the decoder to GPU ...
Compiling the decoder to GPU ...
device must be of type <class 'str'> but got <class 'torch.device'> instead
Transcribing with Hugging Face Whisper (distil-whisper/distil-small.en)...

Unfortunately after 10 seconds or so all GPU processes die, I have no idea why, it doesn't happen when transcribing without stable-ts. Any ideas?

jianfch · 2024-02-12T06:14:26Z

Unfortunately after 10 seconds or so all GPU processes die, I have no idea why, it doesn't happen when transcribing without stable-ts. Any ideas?

Try to use to same arguments as when transcribing without stable-ts.

stable-ts/stable_whisper/whisper_word_level/hf_whisper.py

Lines 97 to 102 in 53272cb

    
           result = self._pipe( 
        
               audio, 
        
               batch_size=batch_size, 
        
               generate_kwargs=generate_kwargs, 
        
               return_timestamps='word' if word_timestamps else True, 
        
           )['chunks']

The batch_size is 24 by default, so try setting a lower value.

barolo · 2024-02-12T06:53:25Z

Unfortunately after 10 seconds or so all GPU processes die, I have no idea why, it doesn't happen when transcribing without stable-ts. Any ideas?

Try to use to same arguments as when transcribing without stable-ts.

stable-ts/stable_whisper/whisper_word_level/hf_whisper.py

Lines 97 to 102 in 53272cb

result = self._pipe(

audio,

batch_size=batch_size,

generate_kwargs=generate_kwargs,

return_timestamps='word' if word_timestamps else True,

)['chunks']

The batch_size is 24 by default, so try setting a lower value.

Yeah, I'm using the same. Where would I pass the batch_size?
edit: it's already passed, at 16

pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16
)

barolo · 2024-02-12T07:52:50Z

I think that I've moved forward, now I'm getting this:

 File "/home/g/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 861, in generate
    raise ValueError(
ValueError: Cannot specify `task` or `language` for an English-only model. If the model is intended to be multilingual, pass `is_multilingual=True` to generate, or update the generation config.

after switching to multilang model I get his:

  raise ValueError(
ValueError: Make sure to set `return_segments=True` to return generation outputs as part of the `'segments' key.`

Managed to get further, now it doesn't die but fails at:

  File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'

it seems to support token timestamps [from modeling_seq2seq.py] :

    def generate(
        self,
        input_features: Optional[torch.Tensor] = None,
        generation_config=None,
        logits_processor=None,
        stopping_criteria=None,
        prefix_allowed_tokens_fn=None,
        synced_gpus=False,
        return_timestamps=None,
        task=None,
        language=None,
        is_multilingual=None,
        prompt_ids: Optional[torch.Tensor] = None,
        num_segment_frames: Optional[int] = None,
        return_token_timestamps: Optional[bool] = None,
        return_segments: bool = False,
        attention_mask: Optional[torch.Tensor] = None,
        time_precision: int = 0.02,
        return_dict_in_generate: Optional[bool] = None,
        **kwargs,
    )

jianfch · 2024-02-12T19:15:44Z

edit: it's already passed, at 16

batch_size gets reassigned in WhisperHF.transcribe() in which the default is 24.
The English-only model issue should be fixed in c356491.
You can also initialize WhisperHF with the pipeline instead of using the inheriting method in c356491.

model = stable_whisper.load_hf_whisper(model_id, pipeline=pipe)

it seems to support token timestamps

Since by default, WhisperHF.transcribe() expects word timestamps (which it already handles by passing return_timestamps='word' into the pipeline), I'd suggest not using return_token_timestamps.
What were all arguments you passed into the pipeline?

barolo · 2024-02-12T19:48:03Z

Now it results in:

 `Traceback (most recent call last):
  File "/run/media/greggy/1a4fd6d7-1f9d-42c6-9324-661804695013/D/owisp/./n2.py", line 57, in <module>
    result = model.transcribe(audio)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/stable_whisper/whisper_word_level/hf_whisper.py", line 236, in transcribe
    return transcribe_any(
           ^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/stable_whisper/non_whisper.py", line 340, in transcribe_any
    result = inference_func(**inference_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/stable_whisper/whisper_word_level/hf_whisper.py", line 101, in _inner_transcribe
    if self.model_name.endswith('en'):
       ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute 'endswith'`

I'm trying to go as simple as possible:

from transformers import GenerationConfig, WhisperProcessor,WhisperForConditionalGeneration
import whisper
from transformers import pipeline
from pathlib import Path
import numpy as np
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
import openvino as ov

model_id="openai/whisper-small"
d="cpu"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)

audio = whisper.load_audio("./4.wav")
i = processor(audio,return_tensors="pt").input_features.to(d)


model_path = Path(model_id.replace('/', '_'))
ov_config = {"CACHE_DIR": ""}

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )

ov_model.generation_config = generation_config

core = ov.Core()
options=str(core.available_devices)
print("Avaiable devices: "+options)
device = 'GPU'

ov_model.to(device)
ov_model.compile()


pipe = pipeline(
  "automatic-speech-recognition",
  model=ov_model,
  tokenizer=processor.tokenizer,
  feature_extractor=processor.feature_extractor,
  chunk_length_s=30,
  batch_size=10,
  device=d,
)
from stable_whisper.whisper_word_level.hf_whisper import WhisperHF
class WhisperHFOV(WhisperHF):
    def __init__(self):
        self._model_name = ov_model
        self._pipe = pipe

model = WhisperHFOV()
result = model.transcribe(audio)
#result = pipe(audio)

import json
with open("sample.json", "w") as outfile:
    json.dump(result, outfile)
print(result["text"])

It works in standalone mode [it doesn't output token timestamps though, does it matter?] But fails when piped to stable

barolo · 2024-02-12T20:17:56Z

Changed pipeline initialization

from transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig
import whisper
from pathlib import Path
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
from transformers import pipeline
import openvino as ov
import json
import stable_whisper

# Load the Whisper model
model_id = "openai/whisper-small"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)

# Load audio
audio = whisper.load_audio("./4.wav")
input_features = processor(audio, return_tensors="pt").input_features

# Configure OpenVINO model
ov_config = {"CACHE_DIR": ""}
model_path = Path(model_id.replace('/', '_'))

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )

ov_model.generation_config = generation_config

# Choose device
device = 'GPU'  # Change this to 'GPU' if GPU is preferred
ov_model.to(device)
ov_model.compile()

# Configure pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=20,
)

# Initialize the model instance with pipeline
model_instance = stable_whisper.load_hf_whisper(model_id, pipeline=pipe)

# Transcribe the audio
result = model_instance.transcribe(audio)

# Save result to JSON
with open("sample.json", "w") as outfile:
    json.dump(result, outfile)

print(result["text"])

And it seems to be churning through, judging by GPU activity, but at the end it's back to previous error

ext__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'

some problem with optimum?

jianfch · 2024-02-13T00:35:25Z

The token timestamps might not have been implemented for OV model. You can disable it with word_timestamps=False, but it prevent stable-ts from regrouping the result or adjusting the word timestamps based on the detect non-speech. Note that batch_size need to be specified in transcribe() for it be useful or else it will default to 24.

# Configure pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30
)

# Initialize the model instance with pipeline
model_instance = stable_whisper.load_hf_whisper(model_id, pipeline=pipe)
result = model_instance.transcribe(audio, word_timestamps=False, batch_size=20)

barolo · 2024-02-13T05:33:34Z

I think that the model is fine, I can see in its config that it has alignment heads for that.
That's the thing, word level timestamps [for karaoke style subs] are the reason for doing this, and using stable-ts.
I'll fiddle some more then I'll report a a bug to hf-optimum, since I can't get token level timestamps, even in standalone one without piping through stable-ts.

barolo · 2024-02-13T15:28:34Z

It works with word_timestamps=False. So it's an upstream bug,, I'll report it later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Intel GPU support #309

[FR] Intel GPU support #309

barolo commented Feb 11, 2024 •

edited

Loading

jianfch commented Feb 12, 2024

barolo commented Feb 12, 2024 •

edited

Loading

jianfch commented Feb 12, 2024

barolo commented Feb 12, 2024 •

edited

Loading

barolo commented Feb 12, 2024 •

edited

Loading

jianfch commented Feb 12, 2024 •

edited

Loading

barolo commented Feb 12, 2024

barolo commented Feb 12, 2024 •

edited

Loading

jianfch commented Feb 13, 2024

barolo commented Feb 13, 2024 •

edited

Loading

barolo commented Feb 13, 2024

[FR] Intel GPU support #309

[FR] Intel GPU support #309

Comments

barolo commented Feb 11, 2024 • edited Loading

jianfch commented Feb 12, 2024

barolo commented Feb 12, 2024 • edited Loading

jianfch commented Feb 12, 2024

barolo commented Feb 12, 2024 • edited Loading

barolo commented Feb 12, 2024 • edited Loading

jianfch commented Feb 12, 2024 • edited Loading

barolo commented Feb 12, 2024

barolo commented Feb 12, 2024 • edited Loading

jianfch commented Feb 13, 2024

barolo commented Feb 13, 2024 • edited Loading

barolo commented Feb 13, 2024

barolo commented Feb 11, 2024 •

edited

Loading

barolo commented Feb 12, 2024 •

edited

Loading

barolo commented Feb 12, 2024 •

edited

Loading

barolo commented Feb 12, 2024 •

edited

Loading

jianfch commented Feb 12, 2024 •

edited

Loading

barolo commented Feb 12, 2024 •

edited

Loading

barolo commented Feb 13, 2024 •

edited

Loading