Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huggingface's Fine Tuned model that can be used? #378

Open
Patrick10731 opened this issue Jul 6, 2024 · 10 comments
Open

Huggingface's Fine Tuned model that can be used? #378

Patrick10731 opened this issue Jul 6, 2024 · 10 comments

Comments

@Patrick10731
Copy link

Patrick10731 commented Jul 6, 2024

I tryed to use distil-whisper-v3 in stable-ts and it can be used.
However, it's unable to be used when I try to use "distil-large-v2".
Other model can't be used too.(ex:kotoba-whisper,"kotoba-tech/kotoba-whisper-v1.0")
What kind of model can be used in stable-ts except for OpenAI's model?

import stable_whisper

model = stable_whisper.load_hf_whisper('distil-whisper/distil-large-v3', device='cpu')
result = model.transcribe('audio.mp3')

result.to_srt_vtt('audio.srt', word_level=False)

@jianfch
Copy link
Owner

jianfch commented Jul 6, 2024

The models with preconfigured alignment heads or ones compatible with original heads will work.
For the ones compatible with the original heads, you can manually config it by assigning the head indices to model._pipe.model.generation_config.alignment_heads.

Technically even models without alignment heads, such as distil-large-v2, will work as well by disabling word timestamps with model.transcribe('audio.mp3', word_timestamps=False). However, many features, such as regrouping and word-level timestamp adjustment, will be unavailable.

@dgoryeo
Copy link

dgoryeo commented Sep 27, 2024

Hi @Patrick10731 , did you get any of kotoba-whisper models to work with Stable_ts? I am trying their kotoba-tech/kotoba-whisper-v2.1 model, but I keep getting out of memory error.

@jianfch , I'm not sure if you have already come across kotobal-tech models in Huggingface. Their latest model is using Stable-ts for accurate timestamp and regroup. I thought you might be interested.

@Patrick10731
Copy link
Author

@jianfch
Thanks, it worked

@dgoryeo
I confirmed that this code will work, try it.


import stable_whisper

model = stable_whisper.load_hf_whisper('kotoba-tech/kotoba-whisper-v1.1', device='cpu')
result = model.transcribe('audio.mp3', word_timestamps=False)

result.to_srt_vtt('audio.srt', word_level=False)

I also found that many models still won't work but will work if you convert the model into faster-whisper's model.

For example, this model won't work

import stable_whisper

model = stable_whisper.load_hf_whisper('Scrya/whisper-large-v2-cantonese', device='cpu')
result = model.transcribe('audio.mp3', word_timestamps=False)

result.to_srt_vtt('audio.srt', word_level=False)

But following code will work.

import stable_whisper

model = stable_whisper.load_faster_whisper('XA9/faster-whisper-large-v2-cantonese-2', device='cpu', compute_type='default')
result = model.transcribe_stable('audio.mp3')
result.to_srt_vtt('audio.srt', word_level=False)

This converted model is from here (https://huggingface.co/XA9/faster-whisper-large-v2-cantonese-2),
and this model is converted by using following command.

 ct2-transformers-converter --model Scrya/whisper-large-v2-cantonese --output_dir faster-whisper-large-v2-cantonese-2 --copy_files  preprocessor_config.json --quantization float16

So I recommend to try converting model if a model won't work.

@dgoryeo
Copy link

dgoryeo commented Sep 28, 2024

Thank you @Patrick10731 , by any chance have you tried Kotoba's v2.1 (which is a distilled) Whisper?

I will try to follow your recommendation. At the moment I am running out of memory with 2.1 but I haven't tried on CPU only --I've tried device=cuda so far.

@Patrick10731
Copy link
Author

Patrick10731 commented Sep 29, 2024

@dgoryeo
I tried with this code and it worked.
How about you to try setting device='cpu'?
The reason of running out of memory must be lacking of performance of your video card.

import stable_whisper

model = stable_whisper.load_hf_whisper('kotoba-tech/kotoba-whisper-v2.1', device='cpu')
result = model.transcribe('audio.mp3', word_timestamps=False)

result.to_srt_vtt('audio.srt', word_level=False)

@dgoryeo
Copy link

dgoryeo commented Sep 29, 2024

Thanks @Patrick10731 , I will test it on cpu. I have 12GB gpu vram, so didn't expect to run out of memory.. I'll test and will report back.

@jianfch
Copy link
Owner

jianfch commented Sep 29, 2024

@dgoryeo 12GB might be too low for the default batch_size=24. Try smaller batch_size.

@dgoryeo
Copy link

dgoryeo commented Sep 29, 2024

@jianfch , that must be it. I'll change the batch_size accordingly.

When I use the model directly with transformers, I use batch_size 16 with no problem:

    pipe = pipeline(
        model=model_id,
        torch_dtype=torch_dtype,
        device=device,
        model_kwargs=model_kwargs,
        chunk_length_s=15,
        batch_size=16,
        trust_remote_code=True,
        stable_ts=True,
        punctuator=True
    )

Thanks

@jianfch
Copy link
Owner

jianfch commented Sep 30, 2024

@dgoryeo You can pass this pipe directly to the pipeline parameter of stable_whisper.load_hf_whisper().

@dgoryeo
Copy link

dgoryeo commented Oct 7, 2024

Here to reporting back that it worked.

I tested the both options:
(a) direct calling model = stable_whisper.load_hf_whisper('kotoba-tech/kotoba-whisper-v2.1', device='cuda'), and
(b) passing the pipe parameter to stable_whisper.load_hf_whisper(), device cuda.

Both worked. Though I was happier with the results of (a).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants