faster-whisper is a reimplementation of OpenAI's Whisper model using CTranslate2, which is a fast inference engine for Transformer models.
This implementation is up to 4 times faster than openai/whisper for the same accuracy while using less memory. The efficiency can be further improved with 8-bit quantization on both CPU and GPU.
For reference, here's the time and memory usage that are required to transcribe 13 minutes of audio using different implementations:
Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory |
---|---|---|---|---|---|
openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB |
faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB |
faster-whisper | int8 | 5 | 59s | 3091MB | 3117MB |
Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.
Implementation | Precision | Beam size | Time | Max. memory |
---|---|---|---|---|
openai/whisper | fp32 | 5 | 10m31s | 3101MB |
whisper.cpp | fp32 | 5 | 17m42s | 1581MB |
whisper.cpp | fp16 | 5 | 12m39s | 873MB |
faster-whisper | fp32 | 5 | 2m44s | 1675MB |
faster-whisper | int8 | 5 | 2m04s | 995MB |
Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.
The module can be installed from PyPI:
pip install faster-whisper
Other installation methods:
# Install the master branch:
pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"
# Install a specific commit:
pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
GPU execution requires the NVIDIA libraries cuBLAS 11.x and cuDNN 8.x to be installed on the system. Please refer to the CTranslate2 documentation.
from faster_whisper import WhisperModel
model_size = "large-v2"
# Run on GPU with FP16
model = WhisperModel(model_size, device="cuda", compute_type="float16")
# or run on GPU with INT8
# model = WhisperModel(model_size, device="cuda", compute_type="int8_float16")
# or run on CPU with INT8
# model = WhisperModel(model_size, device="cpu", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
for segment in segments:
print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
Warning: segments
is a generator so the transcription only starts when you iterate over it. The transcription can be run to completion by gathering the segments in a list or a for
loop:
segments, _ = model.transcribe("audio.mp3")
segments = list(segments) # The transcription will actually run here.
segments, _ = model.transcribe("audio.mp3", word_timestamps=True)
for segment in segments:
for word in segment.words:
print("[%.2fs -> %.2fs] %s" % (word.start, word.end, word.word))
The library integrates the Silero VAD model to filter out parts of the audio without speech:
segments, _ = model.transcribe("audio.mp3", vad_filter=True)
The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the source code. They can be customized with the dictionary argument vad_parameters
:
segments, _ = model.transcribe(
"audio.mp3",
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500),
)
The library logging level can be configured like this:
import logging
logging.basicConfig()
logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
See more model and transcription options in the WhisperModel
class implementation.
Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!
- whisper-ctranslate2 is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
- whisper-diarize is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
- whisper-standalone-win contains the portable ready to run binaries of faster-whisper for Windows.
- asr-sd-pipeline provides a scalable, modular, end to end multi-speaker speech to text solution implemented using AzureML pipelines.
- Open-Lyrics is a Python library that transcribes voice files using faster-whisper, and translates/polishes the resulting text into
.lrc
files in the desired language using OpenAI-GPT. - wscribe is a flexible transcript generation tool supporting faster-whisper, it can export word level transcript and the exported transcript then can be edited with wscribe-editor
When loading a model from its size such as WhisperModel("large-v2")
, the correspondig CTranslate2 model is automatically downloaded from the Hugging Face Hub.
We also provide a script to convert any Whisper models compatible with the Transformers library. They could be the original OpenAI models or user fine-tuned models.
For example the command below converts the original "large-v2" Whisper model and saves the weights in FP16:
pip install transformers[torch]>=4.23
ct2-transformers-converter --model openai/whisper-large-v2 --output_dir whisper-large-v2-ct2 \
--copy_files tokenizer.json --quantization float16
- The option
--model
accepts a model name on the Hub or a path to a model directory. - If the option
--copy_files tokenizer.json
is not used, the tokenizer configuration is automatically downloaded when the model is loaded later.
Models can also be converted from the code. See the conversion API.
- Directly load the model from a local directory:
model = faster_whisper.WhisperModel("whisper-large-v2-ct2")
- Upload your model to the Hugging Face Hub and load it from its name:
model = faster_whisper.WhisperModel("username/whisper-large-v2-ct2")
If you are comparing the performance against other Whisper implementations, you should make sure to run the comparison with similar settings. In particular:
- Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper,
model.transcribe
uses a default beam size of 1 but here we use a default beam size of 5. - When running on CPU, make sure to set the same number of threads. Many frameworks will read the environment variable
OMP_NUM_THREADS
, which can be set when running your script:
OMP_NUM_THREADS=4 python3 my_script.py