[Misc]: nsys profile can not show CUDA HW on all devices #10708

irasin · 2024-11-27T11:43:26Z

Anything you want to discuss about vllm.

I want to use nsys profile to check the performance of vllm.
I test vllm with an llama2-7B model using tp4 on four Nvidia A10 gpus, and here is my test script which does not use cuda_graph

import torch
import time
import vllm
from vllm import LLM, SamplingParams

print(vllm.__file__)


def generate_fixed_shape_requests(tokenizer_or_model_path,
                                  batch_size: int = 1,
                                  input_len: int = 2048,
                                  output_len: int = 2048):

    if isinstance(tokenizer_or_model_path, str):
        from transformers import AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_or_model_path, trust_remote_code=True)
    else:
        tokenizer = tokenizer_or_model_path

    prompt = None
    vocab_size = tokenizer.vocab_size
    token_id = int(vocab_size / 5)
    pass_num = 5
    while True:
        start_len = input_len - pass_num if (input_len - pass_num) > 0 else 0
        end_len = input_len + pass_num
        for i in range(start_len, end_len):
            prompt_token = [token_id] * i
            prompt = tokenizer.decode(prompt_token, skip_special_tokens=True)
            if len(tokenizer.encode(prompt)) == input_len:
                break

        if len(tokenizer.encode(prompt)) == input_len:
            break

        token_id += 1

    request: List[Tuple[str, int, int]] = []
    for i in (range(batch_size)):
        request.append((prompt, input_len, output_len))

    return request


def test(llm, batch_size, prompt_len, output_len):
    print(f"batch_size = {batch_size}, prompt_len = {prompt_len}, output_len = {output_len}")
    tokenizer = llm.get_tokenizer()
    res = generate_fixed_shape_requests(tokenizer, batch_size, prompt_len, output_len)
    prompts = [i[0] for i in res]

    warmup_sampling_params = SamplingParams(max_tokens=1, ignore_eos=True, temperature=0)
    sampling_params = SamplingParams(max_tokens=output_len, ignore_eos=True, temperature=0)
    llm.generate(prompts, warmup_sampling_params)

    avg_time = 0
    run_cnt = 1
    for cnt in range(run_cnt):

        torch.cuda.synchronize()
        start_time = time.perf_counter()

        outputs = llm.generate(prompts, sampling_params)

        torch.cuda.synchronize()
        stop_time = time.perf_counter()
        last_duration = (stop_time - start_time) * 1e3
        avg_time += last_duration
        print(f"output_len {output_len}, {cnt}-th time {last_duration} ms")

    output_tokens = 0
    avg_time = avg_time / run_cnt

    for idx, output in enumerate(outputs):
        prompt = output.prompt
        generated_text = output.outputs[0].text
        if idx <= 0:
            # print(f"Prompt: {prompt}")
            print(f"Generated text: {generated_text}\n\n")

        output_tokens += len(output.outputs[0].token_ids)
    assert output_tokens == batch_size * output_len, f"{output_tokens} != {batch_size} * {output_len}"

    output_token_per_sec = output_tokens / (avg_time / 1e3)
    print(f"output_tokens = {output_tokens}, avg time = {avg_time} ms, output_token_per_sec = {output_token_per_sec}\n")


if __name__ == "__main__":

    gpu_memory_utilization = 0.9

    model_path = "models/Llama-2-7b-hf"
    tp_size = 4

    # load_format = "auto"
    load_format = "dummy"

    quantization = None
    if "awq" in model_path.lower():
        quantization = "awq"

    llm = LLM(
        model=model_path,
        tensor_parallel_size=tp_size,
        load_format=load_format,
        trust_remote_code=True,
        distributed_executor_backend="ray",
        gpu_memory_utilization=gpu_memory_utilization,
        quantization=quantization,
        enforce_eager=True,
    )

    test(llm, 1, 1, 4)

And profile it with nsight

nsys profile --stats=true -o nv_7B_prof_eager  --force-overwrite=true --trace-fork-before-exec=true  --gpu-metrics-device=all    python3 test_fixed.py

I used tp4 to run, but only the driver worker process can see the CUDA HW which contains cuda kernels

I wonder why the other three processes are lost here?

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

irasin · 2024-11-27T11:58:24Z

However, if I profile with enforce_eager=False，I can see 4 CUDA HW in different processes

But the final cuda graph GraphExec only appears in the first gpu

Is it a problem of ray or nsight itself?

irasin added the misc label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Misc]: nsys profile can not show CUDA HW on all devices #10708

[Misc]: nsys profile can not show CUDA HW on all devices #10708

irasin commented Nov 27, 2024

irasin commented Nov 27, 2024

[Misc]: nsys profile can not show CUDA HW on all devices #10708

[Misc]: nsys profile can not show CUDA HW on all devices #10708

Comments

irasin commented Nov 27, 2024

Anything you want to discuss about vllm.

Before submitting a new issue...

irasin commented Nov 27, 2024