Quantize the model

This document provides different strategies for quantizing the various models available in Lit-GPT to reduce GPU memory usage, which is useful for running larger models on certain GPU hardware.

All the examples below were run on an A100 40GB GPU with CUDA 12.1.

Note

Quantization also supports finetuning via QLoRA

Baseline

It's useful to start with a baseline to have a reference point for memory savings via the various quantization methods.

python generate/base.py --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision 32-true --max_new_tokens 256
...
Time for inference 1: 6.93 sec total, 36.96 tokens/sec.
Memory used: 28.95 GB

First, using a lower precision compared to 32-bit float can result in two times reduced memory consumption. You can either try setting --precision 16-true for regular 16-bit precision or --precision bf16-true if your GPU supports brain-float 16-bit precision. (This brief video explains the difference between regular 16-bit and bf16-bit precision.)

In short, when --precision bf16-true or --precision 16-true is used, the model weights will automatically be converted and consume less memory. However, this might not be enough for large models or when using GPUs with limited memory.

python generate/base.py --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision bf16-true --max_new_tokens 256
...
Time for inference 1: 5.37 sec total, 47.66 tokens/sec.
Memory used: 14.50 GB

To reduce the memory requirements further, Lit-GPT supports several quantization techniques, which are shown below.

Note

Most quantization examples below also use the --precision bf16-true setting explained above. If your GPU does not support bfloat16, you can change it to --precision 16-true.

`bnb.nf4`

Enabled with bitsandbytes. Check out the paper to learn more about how it works.

Note

bitsandbytes only supports CUDA devices and the Linux operating system. Windows users should use WSL2.

Uses the normalized float 4 (nf4) data type. This is recommended over "fp4" based on the paper's experimental results and theoretical analysis.

pip install scipy bitsandbytes  # scipy is required until https://github.com/TimDettmers/bitsandbytes/pull/525 is released

python generate/base.py --quantize bnb.nf4 --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision bf16-true --max_new_tokens 256
...
Time for inference 1: 6.80 sec total, 37.62 tokens/sec
Memory used: 5.72 GB

`bnb.nf4-dq`

Enabled with bitsandbytes. Check out the paper to learn more about how it works.

"dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).

pip install scipy bitsandbytes  # scipy is required until https://github.com/TimDettmers/bitsandbytes/pull/525 is released

python generate/base.py --quantize bnb.nf4-dq --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision bf16-true --max_new_tokens 256
...
Time for inference 1: 8.09 sec total, 30.87 tokens/sec
Memory used: 5.38 GB

`bnb.fp4`

Enabled with bitsandbytes. Check out the paper to learn more about how it works.

Uses pure FP4 quantization.

pip install scipy bitsandbytes  # scipy is required until https://github.com/TimDettmers/bitsandbytes/pull/525 is released

python generate/base.py --quantize bnb.fp4 --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision bf16-true --max_new_tokens 256
...
Time for inference 1: 6.92 sec total, 36.98 tokens/sec
Memory used: 5.72 GB

`bnb.fp4-dq`

Enabled with bitsandbytes. Check out the paper to learn more about how it works.

"dq" stands for "Double Quantization" which reduces the average memory footprint by quantizing the quantization constants. In average, this amounts to about 0.37 bits per parameter (approximately 3 GB for a 65B model).

pip install scipy bitsandbytes  # scipy is required until https://github.com/TimDettmers/bitsandbytes/pull/525 is released

python generate/base.py --quantize bnb.fp4-dq --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision bf16-true --max_new_tokens 256
...
Time for inference 1: 10.02 sec total, 25.54 tokens/sec
Memory used: 5.38 GB

`bnb.int8`

Enabled with bitsandbytes. Check out the paper to learn more about how it works.

pip install scipy bitsandbytes  # scipy is required until https://github.com/TimDettmers/bitsandbytes/pull/525 is released

python generate/base.py --quantize bnb.int8 --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision 16-true --max_new_tokens 256
...
Time for inference 1: 20.22 sec total, 12.66 tokens/sec
Memory used: 8.70 GB

`gptq.int4`

Check out the paper to learn more about how it works.

This technique needs a conversion of the weights first:

pip install datasets

python quantize/gptq.py --precision bf16-true --checkpoint_dir checkpoints/tiiuae/falcon-7b
...
Time for quantization: 850.25 sec total
Memory used: 23.68 GB

It is important to note that this conversion step required a considerable amount of memory (higher than regular inference) and may take a long time, depending on the size of the model.

generation then works as usual with --quantize gptq.int4 which will load the newly quantized checkpoint file:

python generate/base.py --quantize gptq.int4 --checkpoint_dir checkpoints/tiiuae/falcon-7b --precision 32-true --max_new_tokens 256
...
Time for inference 1: 34.35 sec total, 7.45 tokens/sec
Memory used: 5.05 GB