profile throughput without new threads #2826

grimoire · 2024-11-28T02:18:42Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

lvhan028 · 2024-12-02T03:41:28Z

lmdeploy/pytorch/engine/model_agent.py

@@ -742,8 +688,9 @@ async def async_forward(self, inputs: ModelInputs, swap_in_map: SwapMap,
        output = self._forward_impl(inputs,
                                    swap_in_map=swap_in_map,
                                    swap_out_map=swap_out_map)
-        await asyncio.get_event_loop().run_in_executor(None,
-                                                       self.stream.synchronize)
+        await asyncio.sleep(0)


How does this method outperform the previous one?

asyncio.sleep release CPU without creating new thread

lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py

AllentDan · 2024-12-02T11:00:37Z

lmdeploy/pytorch/engine/engine.py

@@ -741,6 +748,8 @@ def __update_inputs(next_token_ids):
        logger.debug('<ForwardTask>: '
                     f'batch_size={inputs.seq_length.size(0)} '
                     f'num_tokens={inputs.input_ids.size(-1)}')
+        if self.gpu_count == 1:


Why do we need this while previously not?

logits process can be streaming with cuda inputs.
distribute container with cuda tensor is slower than CPU tensor.

AllentDan · 2024-12-02T11:07:56Z

lmdeploy/pytorch/kernels/cuda/apply_rotary_pos_emb.py

@@ -60,8 +34,8 @@ def apply_rotary_pos_emb_qk_kernel(
    BLOCK_N: tl.constexpr,
 ):
    """apply rotary on key AND query kernel."""
-    seq_block_id = tl.program_id(0)
-    head_id = tl.program_id(1)
+    seq_block_id = tl.program_id(1)


What is the benefit of this change?

increate cos/sin cache hit rate.

AllentDan · 2024-12-02T11:10:31Z

lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py

@@ -79,7 +47,7 @@ def _fill_kv_cache_kernel(
    QSeqLens,
    KVSeqLens,
    BlockOffsets,
-    num_heads: tl.constexpr,
+    is_decoding: tl.constexpr,


Does it mean that the quant kernel of fill_kv_cache can also be optimized?

quant kernel would be optimized in future.

grimoire and others added 7 commits November 27, 2024 21:13

profile throughput without threads

4a18be7

optimize main loop

31afcf2

fix torch.event

88ad4dc

fix python>3.11

9585aef

optimize tp

3ea4aa8

reduce cudagraph copy

549c6c6

optimize fill kv cache

3df2e49

lvhan028 self-requested a review November 29, 2024 04:58

grimoire added 3 commits November 29, 2024 13:43

optimize silu and mul

037cac6

optimize apply rotary

4247295

remove executor

aa25512

lvhan028 added the improvement label Nov 29, 2024

lvhan028 requested a review from AllentDan December 1, 2024 01:48

remove kernel

4ea3127

lvhan028 reviewed Dec 2, 2024

View reviewed changes

lmdeploy/pytorch/kernels/cuda/fill_kv_cache.py Outdated Show resolved Hide resolved

remove num_heads==1

b269793

lvhan028 approved these changes Dec 2, 2024

View reviewed changes

AllentDan reviewed Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profile throughput without new threads #2826

profile throughput without new threads #2826

grimoire commented Nov 28, 2024

lvhan028 Dec 2, 2024

grimoire Dec 2, 2024

AllentDan Dec 2, 2024

grimoire Dec 2, 2024

AllentDan Dec 2, 2024

grimoire Dec 2, 2024

AllentDan Dec 2, 2024

grimoire Dec 2, 2024

profile throughput without new threads #2826

Are you sure you want to change the base?

profile throughput without new threads #2826

Conversation

grimoire commented Nov 28, 2024

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment