LMDeploy Release V0.0.7
Highlights
- Flash attention 2 is supported, boosting context decoding speed by approximately 45%
- Token_id decoding has been optimized for better efficiency
- The gemm-tunned script has been packed in the PyPI package
What's Changed
🚀 Features
💥 Improvements
- add llama_gemm to wheel by @irexyc in #320
- Decode generated token_ids incrementally by @AllentDan in #309
🐞 Bug fixes
- Fix turbomind import error on windows by @irexyc in #316
- Fix profile_serving hung issue by @lvhan028 in #344
📚 Documentations
- Fix readthedocs building by @RunningLeon in #321
- fix(kvint8): update doc by @tpoisonooo in #315
- Update FAQ for restful api by @AllentDan in #319
Full Changelog: v0.0.6...v0.0.7