BigDL-LLM is a low-bit LLM library on Intel XPU (Xeon/Core/Flex/Arc/PVC), featuring broadest model support, lowest latency and smallest memory footprint. It is released as part of the open source BigDL project under Apache 2.0 License.
You can use BigDL-LLM to run any pytorch model (e.g. HuggingFace transformers models). It automatically optimizes and accelerates LLMs using low-bit optimizations, modern hardware accelerations and latest software optimizations.
Using BigDL-LLM is easy. With just 1-line of code change, you can immediately observe significant speedup 1 .
from bigdl.llm import optimize_model
from transformers import LlamaForCausalLM, LlamaTokenizer
model = LlamaForCausalLM.from_pretrained(model_path,...)
# apply bigdl-llm low-bit optimization, by default uses INT4
model = optimize_model(model)
...
BigDL-LLM provides a variety of low-bit optimizations (e.g., INT3/NF3/INT4/NF4/INT5/INT8), and allows you to run LLMs on low-cost PCs (CPU-only), on PCs with GPU, or on cloud.
The demos below shows the experiences of running 7B and 13B model on a 16G memory laptop.
The following chapters in this tutorial will explain in more details about how to use BigDL-LLM to build LLM applications, e.g. best practices for setting up your environment, APIs, Chinese support, GPU, application development guides with case studies, etc. Most chapters provide runnable notebooks using popular open source models. Read along to learn more and run the code on your laptop.
Also, you can check out our GitHub repo for more information and latest news.
We have already verified many models on BigDL-LLM and provided ready-to-run examples, such as Llama2, Vicuna, ChatGLM, ChatGLM2, Baichuan, MOSS, Falcon, Dolly-v1, Dolly-v2, StarCoder, Mistral, RedPajama, Whisper, etc. You can find more model examples here.
Footnotes
-
Performance varies by use, configuration and other factors.
bigdl-llm
may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩