Releases: InternLM/lmdeploy
Releases · InternLM/lmdeploy
LMDeploy Release V0.0.10
What's Changed
💥 Improvements
- [feature] Graceful termination of background threads in LlamaV2 by @akhoroshev in #458
- expose stop words and filter eoa by @AllentDan in #352
🐞 Bug fixes
- Fix side effect brought by supporting codellama:
sequence_start
is always true when callingmodel.get_prompt
by @lvhan028 in #466 - Miss meta instruction of internlm-chat model by @lvhan028 in #470
- [bug] Fix race condition by @akhoroshev in #460
- Fix compatibility issues with Pydantic 2 by @aisensiy in #465
- fix benchmark serving cannot use Qwen tokenizer by @AllentDan in #443
- Fix memory leak by @lvhan028 in #488
📚 Documentations
- Fix typo in README.md by @eltociear in #462
🌐 Other
New Contributors
- @eltociear made their first contribution in #462
- @akhoroshev made their first contribution in #458
- @aisensiy made their first contribution in #465
Full Changelog: v0.0.9...v0.0.10
LMDeploy Release V0.0.9
Highlight
- Support InternLM 20B, including FP16, W4A16, and W4KV8
What's Changed
🚀 Features
💥 Improvements
- Reduce gil switching by @irexyc in #407
- Profile token generation with more settings by @AllentDan in #364
🐞 Bug fixes
- Fix disk space limit for building docker image by @RunningLeon in #404
- more general pypi ci by @irexyc in #412
- Fix build.md by @pangsg in #411
- Fix memory leak by @irexyc in #415
- Fix token count bug by @AllentDan in #416
- [Fix] Support actual seqlen in flash-attention2 by @grimoire in #418
- [Fix] output[-1] when output is empty by @wangruohui in #405
🌐 Other
- rename readthedocs config file by @RunningLeon in #429
- bump version to v0.0.9 by @lvhan028 in #428
New Contributors
Full Changelog: v0.0.8...v0.0.9
LMDeploy Release V0.0.8
Highlight
- Support Baichuan2-7B-Base and Baichuan2-7B-Chat
- Support all features of Code Llama: code completion, infilling, chat / instruct, and python specialist
What's Changed
🚀 Features
- Support baichuan2-chat chat template by @wangruohui in #378
- Support codellama by @lvhan028 in #359
🐞 Bug fixes
- [Fix] when using stream is False, continuous batching doesn't work by @sleepwalker2017 in #346
- [Fix] Set max dynamic smem size for decoder MHA to support context length > 8k by @lvhan028 in #377
- Fix exceed session len core dump for chat and generate by @AllentDan in #366
- [Fix] update puyu model by @Harold-lkk in #399
📚 Documentations
- [Docs] Fix quantization docs link by @LZHgrla in #367
- [Docs] Simplify
build.md
by @pppppM in #370 - [Docs] Update lmdeploy logo by @lvhan028 in #372
New Contributors
- @sleepwalker2017 made their first contribution in #346
Full Changelog: v0.0.7...v0.0.8
LMDeploy Release V0.0.7
Highlights
- Flash attention 2 is supported, boosting context decoding speed by approximately 45%
- Token_id decoding has been optimized for better efficiency
- The gemm-tunned script has been packed in the PyPI package
What's Changed
🚀 Features
💥 Improvements
- add llama_gemm to wheel by @irexyc in #320
- Decode generated token_ids incrementally by @AllentDan in #309
🐞 Bug fixes
- Fix turbomind import error on windows by @irexyc in #316
- Fix profile_serving hung issue by @lvhan028 in #344
📚 Documentations
- Fix readthedocs building by @RunningLeon in #321
- fix(kvint8): update doc by @tpoisonooo in #315
- Update FAQ for restful api by @AllentDan in #319
Full Changelog: v0.0.6...v0.0.7
LMDeploy Release V0.0.6
Highlights
- Support Qwen-7B with dynamic NTK scaling and logN scaling in turbomind
- Support tensor parallelism for W4A16
- Add OpenAI-like RESTful API
- Support Llama-2 70B 4-bit quantization
What's Changed
🚀 Features
- Profiling tool for huggingface and deepspeed models by @wangruohui in #161
- Support windows platform by @irexyc in #209
- Qwen-7B, dynamic NTK scaling and logN scaling support in turbomind by @lzhangzz in #230
- Add Restful API by @AllentDan in #223
- Support context decoding with DP in pytorch by @wangruohui in #193
💥 Improvements
- Support TP for W4A16 by @lzhangzz in #262
- Pass chat template args including meta_prompt to model(7785142) by @AllentDan in #225
- Enable the Gradio server to call inference services through the RESTful API by @AllentDan in #287
🐞 Bug fixes
- Adjust dependency of gradio server by @AllentDan in #236
- Implement
movmatrix
using warp shuffling for CUDA < 11.8 by @lzhangzz in #267 - Add 'accelerate' to requirement list by @lvhan028 in #261
- Fix building with CUDA 11.3 by @lzhangzz in #280
- Pad tok_embedding and output weights to make their shape divisible by TP by @lvhan028 in #285
- Fix llama2 70b & qwen quantization error by @pppppM in #273
- Import turbomind in gradio server only when it is needed by @AllentDan in #303
📚 Documentations
- Remove specified version in user guide by @lvhan028 in #241
- docs(quantzation): update description by @tpoisonooo in #253 and #272
- Check-in FAQ by @lvhan028 in #256
- add readthedocs by @RunningLeon in #208
🌐 Other
- Update workflow for building docker image by @RunningLeon in #282
- Change to github-hosted runner for building docker image by @RunningLeon in #291
Known issues
- 4-bit Qwen-7b model inference failed. #307 is addressing this issue.
Full Changelog: v0.0.5...v0.0.6
LMDeploy Release V0.0.5
LMDeploy Release V0.0.4
Highlight
- Support 4-bit LLM quantization and inference. Check this guide for detailed information.
What's Changed
🚀 Features
- Blazing fast W4A16 inference by @lzhangzz in #202
- Support AWQ by @pppppM in #108 and @AllentDan in #228
💥 Improvements
- Add release note template by @lvhan028 in #211
- feat(quantization): kv cache use asymmetric by @tpoisonooo in #218
🐞 Bug fixes
📚 Documentations
- Update W4A16 News by @pppppM in #227
- Check-in user guide for w4a16 LLM deployment by @lvhan028 in #224
Full Changelog: v0.0.3...v0.0.4
LMDeploy Release V0.0.3
What's Changed
🚀 Features
- Support tensor parallelism without offline splitting model weights by @grimoire in #158
- Add script to split HuggingFace model to the smallest sharded checkpoints by @LZHgrla in #199
- Add non-stream inference api for chatbot by @lvhan028 in #200
💥 Improvements
- Add issue/pr templates by @lvhan028 in #184
- Remove unused code to reduce binary size by @lzhangzz in #181
- Support serving with gradio without communicating to TIS by @AllentDan in #162
- Improve postprocessing in TIS serving by applying Incremental de-tokenizing by @lvhan028 in #197
- Support multi-session chat by @wangruohui in #178
🐞 Bug fixes
- Fix build test error and move turbmind csrc test cases to
tests/csrc
by @lvhan028 in #188 - Fix launching client error by moving lmdeploy/turbomind/utils.py to lmdeploy/utils.py by @lvhan028 in #191
📚 Documentations
- Update README.md by @tpoisonooo in #187
- Translate turbomind.md by @xin-li-67 in #173
New Contributors
Full Changelog: v0.0.2...v0.0.3
LMDeploy Release V0.0.2
What's Changed
🚀 Features
- Add lmdeploy python package built scripts and CI workflow by @irexyc in #163, #164, #170
- Support LLama-2 with GQA by @lzhangzz in #147 and @grimoire in #160
- Add Llama-2 chat template by @grimoire in #140
- Add decode-only forward pass by @lzhangzz in #153
- Support tensor parallelism in turbomind's python API by @grimoire #82
- Support w pack qkv by @tpoisonooo in #83
💥 Improvements
- Refactor the chat template of supported models using factory pattern by @lvhan028 in #144 and @streamsunshine in #174
- Add profile throughput benchmark by @grimoire in #146
- Remove slicing reponse and add resume api by @streamsunshine in #154
- Support DeepSpeed on autoTP and kernel injection by @KevinNuNu and @wangruohui in #138
- Add github action for publishing docker image by @RunningLeon in #148
🐞 Bug fixes
- Fix getting package root path error in python3.9 by @lvhan028 in #157
- Return carriage caused overwriting at the same line by @wangruohui in #143
- Fix the offset during streaming chat by @lvhan028 in #142
- Fix concatenate bug in benchmark serving script by @rollroll90 in #134
- Fix attempted_relative_import by @KevinNuNu in #125
📚 Documentations
- Translate
en/quantization.md
into Chinese by @xin-li-67 in #166 - Check-in benchmark on real conversation data by @lvhan028 in #156
- Fix typo and missing dependant packages in REAME and requirements.txt by @vansin in #123, @APX103 in #109, @AllentDan in #119 and @del-zhenwu in #124
- Add turbomind's architecture documentation by @lzhangzz in #101
New Contributors
@streamsunshine @del-zhenwu @APX103 @xin-li-67 @KevinNuNu @rollroll90