Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Intel GPU support #309

Open
barolo opened this issue Feb 11, 2024 · 11 comments
Open

[FR] Intel GPU support #309

barolo opened this issue Feb 11, 2024 · 11 comments

Comments

@barolo
Copy link

barolo commented Feb 11, 2024

I've been testing HF models with openvino gpu backend [Intel GPUs and CPUs] and they're blazing fast, even on Integrated GPU. I tried to integrate it into hf_whisper.py but for some reason it defaults to torch cpu. It's possible to detect openvino device via :

core = ov.Core()
options=str(core.available_devices)

I've been trying to integrate this (unsuccesfuly:

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
import openvino as ov



model_path = Path(model.replace)
ov_config = {"CACHE_DIR": ""}

processor = AutoProcessor.from_pretrained(model_id)
pt_model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id)
pt_model.eval();

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )


core = ov.Core()
options=str(core.available_devices)
print("Avaiable devices: "+options)
device = 'GPU'

ov_model.to(device)
ov_model.compile()

ov_model.generation_config = pt_model.generation_config

pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
)

Would it be possible? The code looks compatible.

@jianfch
Copy link
Owner

jianfch commented Feb 12, 2024

It should work as along as the pipe works on its own. WhisperHF uses non_whisper.transcribe_any() which is designed to work with any ASR model. You can simply inherit WhisperHF and assign the working pipe to self._pipe like this:

from stable_whisper.whisper_word_level.hf_whisper import WhisperHF
class WhisperHFOV(WhisperHF):
    def __init__(self):
        self._model_name = model_id
        self._pipe = pipe 

Then run the model as usual.

model = WhisperHFOV()
result = model.transcribe('audio.mp3')

@barolo
Copy link
Author

barolo commented Feb 12, 2024

Thanks! that got things going, it gets passed to stable-ts for transcription:

g@void /Dev $ python ./drain.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
Avaiable devices: ['CPU', 'GPU']
Compiling the encoder to GPU ...
Compiling the decoder to GPU ...
Compiling the decoder to GPU ...
device must be of type <class 'str'> but got <class 'torch.device'> instead
Transcribing with Hugging Face Whisper (distil-whisper/distil-small.en)...

Unfortunately after 10 seconds or so all GPU processes die, I have no idea why, it doesn't happen when transcribing without stable-ts. Any ideas?

@jianfch
Copy link
Owner

jianfch commented Feb 12, 2024

Unfortunately after 10 seconds or so all GPU processes die, I have no idea why, it doesn't happen when transcribing without stable-ts. Any ideas?

Try to use to same arguments as when transcribing without stable-ts.

result = self._pipe(
audio,
batch_size=batch_size,
generate_kwargs=generate_kwargs,
return_timestamps='word' if word_timestamps else True,
)['chunks']

The batch_size is 24 by default, so try setting a lower value.

@barolo
Copy link
Author

barolo commented Feb 12, 2024

Unfortunately after 10 seconds or so all GPU processes die, I have no idea why, it doesn't happen when transcribing without stable-ts. Any ideas?

Try to use to same arguments as when transcribing without stable-ts.

result = self._pipe(
audio,
batch_size=batch_size,
generate_kwargs=generate_kwargs,
return_timestamps='word' if word_timestamps else True,
)['chunks']

The batch_size is 24 by default, so try setting a lower value.

Yeah, I'm using the same. Where would I pass the batch_size?
edit: it's already passed, at 16

pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16
)

@barolo
Copy link
Author

barolo commented Feb 12, 2024

I think that I've moved forward, now I'm getting this:

 File "/home/g/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 861, in generate
    raise ValueError(
ValueError: Cannot specify `task` or `language` for an English-only model. If the model is intended to be multilingual, pass `is_multilingual=True` to generate, or update the generation config.

after switching to multilang model I get his:

  raise ValueError(
ValueError: Make sure to set `return_segments=True` to return generation outputs as part of the `'segments' key.`

Managed to get further, now it doesn't die but fails at:

  File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'

it seems to support token timestamps [from modeling_seq2seq.py] :

    def generate(
        self,
        input_features: Optional[torch.Tensor] = None,
        generation_config=None,
        logits_processor=None,
        stopping_criteria=None,
        prefix_allowed_tokens_fn=None,
        synced_gpus=False,
        return_timestamps=None,
        task=None,
        language=None,
        is_multilingual=None,
        prompt_ids: Optional[torch.Tensor] = None,
        num_segment_frames: Optional[int] = None,
        return_token_timestamps: Optional[bool] = None,
        return_segments: bool = False,
        attention_mask: Optional[torch.Tensor] = None,
        time_precision: int = 0.02,
        return_dict_in_generate: Optional[bool] = None,
        **kwargs,
    )

@jianfch
Copy link
Owner

jianfch commented Feb 12, 2024

edit: it's already passed, at 16

batch_size gets reassigned in WhisperHF.transcribe() in which the default is 24.
The English-only model issue should be fixed in c356491.
You can also initialize WhisperHF with the pipeline instead of using the inheriting method in c356491.

model = stable_whisper.load_hf_whisper(model_id, pipeline=pipe)

it seems to support token timestamps

Since by default, WhisperHF.transcribe() expects word timestamps (which it already handles by passing return_timestamps='word' into the pipeline), I'd suggest not using return_token_timestamps.
What were all arguments you passed into the pipeline?

@barolo
Copy link
Author

barolo commented Feb 12, 2024

Now it results in:

 `Traceback (most recent call last):
  File "/run/media/greggy/1a4fd6d7-1f9d-42c6-9324-661804695013/D/owisp/./n2.py", line 57, in <module>
    result = model.transcribe(audio)
             ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/stable_whisper/whisper_word_level/hf_whisper.py", line 236, in transcribe
    return transcribe_any(
           ^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/stable_whisper/non_whisper.py", line 340, in transcribe_any
    result = inference_func(**inference_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/stable_whisper/whisper_word_level/hf_whisper.py", line 101, in _inner_transcribe
    if self.model_name.endswith('en'):
       ^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute 'endswith'`

I'm trying to go as simple as possible:

from transformers import GenerationConfig, WhisperProcessor,WhisperForConditionalGeneration
import whisper
from transformers import pipeline
from pathlib import Path
import numpy as np
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
import openvino as ov

model_id="openai/whisper-small"
d="cpu"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)

audio = whisper.load_audio("./4.wav")
i = processor(audio,return_tensors="pt").input_features.to(d)


model_path = Path(model_id.replace('/', '_'))
ov_config = {"CACHE_DIR": ""}

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )

ov_model.generation_config = generation_config

core = ov.Core()
options=str(core.available_devices)
print("Avaiable devices: "+options)
device = 'GPU'

ov_model.to(device)
ov_model.compile()


pipe = pipeline(
  "automatic-speech-recognition",
  model=ov_model,
  tokenizer=processor.tokenizer,
  feature_extractor=processor.feature_extractor,
  chunk_length_s=30,
  batch_size=10,
  device=d,
)
from stable_whisper.whisper_word_level.hf_whisper import WhisperHF
class WhisperHFOV(WhisperHF):
    def __init__(self):
        self._model_name = ov_model
        self._pipe = pipe

model = WhisperHFOV()
result = model.transcribe(audio)
#result = pipe(audio)

import json
with open("sample.json", "w") as outfile:
    json.dump(result, outfile)
print(result["text"])

It works in standalone mode [it doesn't output token timestamps though, does it matter?] But fails when piped to stable

@barolo
Copy link
Author

barolo commented Feb 12, 2024

Changed pipeline initialization

from transformers import WhisperProcessor, WhisperForConditionalGeneration, GenerationConfig
import whisper
from pathlib import Path
from optimum.intel.openvino import OVModelForSpeechSeq2Seq
from transformers import pipeline
import openvino as ov
import json
import stable_whisper

# Load the Whisper model
model_id = "openai/whisper-small"
model = WhisperForConditionalGeneration.from_pretrained(model_id)
processor = WhisperProcessor.from_pretrained(model_id)
generation_config = GenerationConfig.from_pretrained(model_id)

# Load audio
audio = whisper.load_audio("./4.wav")
input_features = processor(audio, return_tensors="pt").input_features

# Configure OpenVINO model
ov_config = {"CACHE_DIR": ""}
model_path = Path(model_id.replace('/', '_'))

if not model_path.exists():
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_id, ov_config=ov_config, export=True, compile=False, load_in_8bit=False
    )
    ov_model.half()
    ov_model.save_pretrained(model_path)
else:
    ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
        model_path, ov_config=ov_config, compile=False
    )

ov_model.generation_config = generation_config

# Choose device
device = 'GPU'  # Change this to 'GPU' if GPU is preferred
ov_model.to(device)
ov_model.compile()

# Configure pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,
    batch_size=20,
)

# Initialize the model instance with pipeline
model_instance = stable_whisper.load_hf_whisper(model_id, pipeline=pipe)

# Transcribe the audio
result = model_instance.transcribe(audio)

# Save result to JSON
with open("sample.json", "w") as outfile:
    json.dump(result, outfile)

print(result["text"])

And it seems to be churning through, judging by GPU activity, but at the end it's back to previous error

ext__
    item = next(self.iterator)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/pt_utils.py", line 266, in __next__
    processed = self.infer(next(self.iterator), **self.params)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/base.py", line 1068, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/transformers/pipelines/automatic_speech_recognition.py", line 507, in _forward
    tokens = self.model.generate(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/greggy/.local/lib/python3.11/site-packages/optimum/intel/openvino/modeling_seq2seq.py", line 1018, in generate
    outputs["token_timestamps"] = self._extract_token_timestamps(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: '_OVModelForWhisper' object has no attribute '_extract_token_timestamps'

some problem with optimum?

@jianfch
Copy link
Owner

jianfch commented Feb 13, 2024

The token timestamps might not have been implemented for OV model. You can disable it with word_timestamps=False, but it prevent stable-ts from regrouping the result or adjusting the word timestamps based on the detect non-speech. Note that batch_size need to be specified in transcribe() for it be useful or else it will default to 24.

# Configure pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=ov_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30
)

# Initialize the model instance with pipeline
model_instance = stable_whisper.load_hf_whisper(model_id, pipeline=pipe)
result = model_instance.transcribe(audio, word_timestamps=False, batch_size=20)

@barolo
Copy link
Author

barolo commented Feb 13, 2024

I think that the model is fine, I can see in its config that it has alignment heads for that.
That's the thing, word level timestamps [for karaoke style subs] are the reason for doing this, and using stable-ts.
I'll fiddle some more then I'll report a a bug to hf-optimum, since I can't get token level timestamps, even in standalone one without piping through stable-ts.

@barolo
Copy link
Author

barolo commented Feb 13, 2024

It works with word_timestamps=False. So it's an upstream bug,, I'll report it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants