Skip to content

Commit

Permalink
bump version to v0.6.0a0 (#2371)
Browse files Browse the repository at this point in the history
* bump version to v0.6.0a0

* miss one doc

* update w4a16.md
  • Loading branch information
lvhan028 authored Aug 26, 2024
1 parent 91f6cdf commit 97b880b
Show file tree
Hide file tree
Showing 7 changed files with 21 additions and 27 deletions.
2 changes: 1 addition & 1 deletion docs/en/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ pip install lmdeploy
The default prebuilt package is compiled on **CUDA 12**. If CUDA 11+ (>=11.3) is required, you can install lmdeploy by:

```shell
export LMDEPLOY_VERSION=0.5.3
export LMDEPLOY_VERSION=0.6.0a0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
Expand Down
4 changes: 2 additions & 2 deletions docs/en/multi_modal/minicpmv.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ docker run --runtime nvidia --gpus all \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:v0.5.3-cu12 \
openmmlab/lmdeploy:latest \
lmdeploy serve api_server openbmb/MiniCPM-V-2_6
```

Expand All @@ -165,7 +165,7 @@ version: '3.5'
services:
lmdeploy:
container_name: lmdeploy
image: openmmlab/lmdeploy:v0.5.3-cu12
image: openmmlab/lmdeploy:latest
ports:
- "23333:23333"
environment:
Expand Down
17 changes: 6 additions & 11 deletions docs/en/quantization/w4a16.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,17 @@
# AWQ
# AWQ/GPTQ

LMDeploy adopts [AWQ](https://arxiv.org/abs/2306.00978) algorithm for 4bit weight-only quantization. By developed the high-performance cuda kernel, the 4bit quantized model inference achieves up to 2.4x faster than FP16.
LMDeploy TurboMind engine supports the inference of 4bit quantized models that are quantized both by [AWQ](https://arxiv.org/abs/2306.00978) and [GPTQ](https://github.com/AutoGPTQ/AutoGPTQ), but its quantization module only supports the AWQ quantization algorithm.

LMDeploy supports the following NVIDIA GPU for W4A16 inference:
The following NVIDIA GPUs are available for AWQ/GPTQ INT4 inference:

- V100(sm70): V100
- Turing(sm75): 20 series, T4

- Ampere(sm80,sm86): 30 series, A10, A16, A30, A100

- Ada Lovelace(sm89): 40 series

Before proceeding with the quantization and inference, please ensure that lmdeploy is installed.

```shell
pip install lmdeploy[all]
```
Before proceeding with the quantization and inference, please ensure that lmdeploy is installed by following the [installation guide](../installation.md)

This article comprises the following sections:
The remainder of this article is structured into the following sections:

<!-- toc -->

Expand Down
2 changes: 1 addition & 1 deletion docs/zh_cn/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ pip install lmdeploy
默认的预构建包是在 **CUDA 12** 上编译的。如果需要 CUDA 11+ (>=11.3),你可以使用以下命令安装 lmdeploy:

```shell
export LMDEPLOY_VERSION=0.5.3
export LMDEPLOY_VERSION=0.6.0a0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
```
Expand Down
4 changes: 2 additions & 2 deletions docs/zh_cn/multi_modal/minicpmv.md
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@ docker run --runtime nvidia --gpus all \
--env "HUGGING_FACE_HUB_TOKEN=<secret>" \
-p 23333:23333 \
--ipc=host \
openmmlab/lmdeploy:v0.5.3-cu12 \
openmmlab/lmdeploy:latest \
lmdeploy serve api_server openbmb/MiniCPM-V-2_6
```

Expand All @@ -165,7 +165,7 @@ version: '3.5'
services:
lmdeploy:
container_name: lmdeploy
image: openmmlab/lmdeploy:v0.5.3-cu12
image: openmmlab/lmdeploy:latest
ports:
- "23333:23333"
environment:
Expand Down
17 changes: 8 additions & 9 deletions docs/zh_cn/quantization/w4a16.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,17 @@
# INT4 模型量化和部署

LMDeploy 使用 AWQ 算法,实现模型 4bit 权重量化。推理引擎 TurboMind 提供了非常高效的 4bit 推理 cuda kernel,性能是 FP16 的 2.4 倍以上。它支持以下 NVIDIA 显卡:
LMDeploy TurboMind 引擎支持由 [AWQ](https://arxiv.org/abs/2306.00978)[GPTQ](https://github.com/AutoGPTQ/AutoGPTQ) 两种量化方法量化的 4bit 模型的推理。然而,LMDeploy 量化模块目前仅支持 AWQ 量化算法。

- 图灵架构(sm75):20系列、T4
- 安培架构(sm80,sm86):30系列、A10、A16、A30、A100
- Ada Lovelace架构(sm89):40 系列
可用于 AWQ/GPTQ INT4 推理的 NVIDIA GPU 包括:

在量化和部署之前,请确保安装了 lmdeploy.
- V100(sm70): V100
- Turing(sm75): 20 系列,T4
- Ampere(sm80,sm86): 30 系列,A10, A16, A30, A100
- Ada Lovelace(sm89): 40 系列

```shell
pip install lmdeploy[all]
```
在进行量化和推理之前,请确保按照[安装指南](../installation.md)安装了 lmdeploy。

本文由以下章节组成
本文的其余部分由以下章节组成

<!-- toc -->

Expand Down
2 changes: 1 addition & 1 deletion lmdeploy/version.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Tuple

__version__ = '0.5.3'
__version__ = '0.6.0a0'
short_version = __version__


Expand Down

0 comments on commit 97b880b

Please sign in to comment.