LMDeploy Release V0.0.7

lvhan028 released this 04 Sep 06:39

· 855 commits to main since this release

Highlights

Flash attention 2 is supported, boosting context decoding speed by approximately 45%
Token_id decoding has been optimized for better efficiency
The gemm-tunned script has been packed in the PyPI package

What's Changed

🚀 Features

Add flashattention2 by @grimoire in #196

💥 Improvements

add llama_gemm to wheel by @irexyc in #320
Decode generated token_ids incrementally by @AllentDan in #309

🐞 Bug fixes

Fix turbomind import error on windows by @irexyc in #316
Fix profile_serving hung issue by @lvhan028 in #344

📚 Documentations

Fix readthedocs building by @RunningLeon in #321
fix(kvint8): update doc by @tpoisonooo in #315
Update FAQ for restful api by @AllentDan in #319

Full Changelog: v0.0.6...v0.0.7

Contributors

grimoire, lvhan028, and 4 other contributors

Assets 2