Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc]: nsys profile can not show CUDA HW on all devices #10708

Open
1 task done
irasin opened this issue Nov 27, 2024 · 1 comment
Open
1 task done

[Misc]: nsys profile can not show CUDA HW on all devices #10708

irasin opened this issue Nov 27, 2024 · 1 comment
Labels

Comments

@irasin
Copy link
Contributor

irasin commented Nov 27, 2024

Anything you want to discuss about vllm.

I want to use nsys profile to check the performance of vllm.
I test vllm with an llama2-7B model using tp4 on four Nvidia A10 gpus, and here is my test script which does not use cuda_graph

import torch
import time
import vllm
from vllm import LLM, SamplingParams

print(vllm.__file__)


def generate_fixed_shape_requests(tokenizer_or_model_path,
                                  batch_size: int = 1,
                                  input_len: int = 2048,
                                  output_len: int = 2048):

    if isinstance(tokenizer_or_model_path, str):
        from transformers import AutoTokenizer
        tokenizer = AutoTokenizer.from_pretrained(tokenizer_or_model_path, trust_remote_code=True)
    else:
        tokenizer = tokenizer_or_model_path

    prompt = None
    vocab_size = tokenizer.vocab_size
    token_id = int(vocab_size / 5)
    pass_num = 5
    while True:
        start_len = input_len - pass_num if (input_len - pass_num) > 0 else 0
        end_len = input_len + pass_num
        for i in range(start_len, end_len):
            prompt_token = [token_id] * i
            prompt = tokenizer.decode(prompt_token, skip_special_tokens=True)
            if len(tokenizer.encode(prompt)) == input_len:
                break

        if len(tokenizer.encode(prompt)) == input_len:
            break

        token_id += 1

    request: List[Tuple[str, int, int]] = []
    for i in (range(batch_size)):
        request.append((prompt, input_len, output_len))

    return request


def test(llm, batch_size, prompt_len, output_len):
    print(f"batch_size = {batch_size}, prompt_len = {prompt_len}, output_len = {output_len}")
    tokenizer = llm.get_tokenizer()
    res = generate_fixed_shape_requests(tokenizer, batch_size, prompt_len, output_len)
    prompts = [i[0] for i in res]

    warmup_sampling_params = SamplingParams(max_tokens=1, ignore_eos=True, temperature=0)
    sampling_params = SamplingParams(max_tokens=output_len, ignore_eos=True, temperature=0)
    llm.generate(prompts, warmup_sampling_params)

    avg_time = 0
    run_cnt = 1
    for cnt in range(run_cnt):

        torch.cuda.synchronize()
        start_time = time.perf_counter()

        outputs = llm.generate(prompts, sampling_params)

        torch.cuda.synchronize()
        stop_time = time.perf_counter()
        last_duration = (stop_time - start_time) * 1e3
        avg_time += last_duration
        print(f"output_len {output_len}, {cnt}-th time {last_duration} ms")

    output_tokens = 0
    avg_time = avg_time / run_cnt

    for idx, output in enumerate(outputs):
        prompt = output.prompt
        generated_text = output.outputs[0].text
        if idx <= 0:
            # print(f"Prompt: {prompt}")
            print(f"Generated text: {generated_text}\n\n")

        output_tokens += len(output.outputs[0].token_ids)
    assert output_tokens == batch_size * output_len, f"{output_tokens} != {batch_size} * {output_len}"

    output_token_per_sec = output_tokens / (avg_time / 1e3)
    print(f"output_tokens = {output_tokens}, avg time = {avg_time} ms, output_token_per_sec = {output_token_per_sec}\n")


if __name__ == "__main__":

    gpu_memory_utilization = 0.9

    model_path = "models/Llama-2-7b-hf"
    tp_size = 4

    # load_format = "auto"
    load_format = "dummy"

    quantization = None
    if "awq" in model_path.lower():
        quantization = "awq"

    llm = LLM(
        model=model_path,
        tensor_parallel_size=tp_size,
        load_format=load_format,
        trust_remote_code=True,
        distributed_executor_backend="ray",
        gpu_memory_utilization=gpu_memory_utilization,
        quantization=quantization,
        enforce_eager=True,
    )

    test(llm, 1, 1, 4)

And profile it with nsight

nsys profile --stats=true -o nv_7B_prof_eager  --force-overwrite=true --trace-fork-before-exec=true  --gpu-metrics-device=all    python3 test_fixed.py

I used tp4 to run, but only the driver worker process can see the CUDA HW which contains cuda kernels
image

I wonder why the other three processes are lost here?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@irasin irasin added the misc label Nov 27, 2024
@irasin
Copy link
Contributor Author

irasin commented Nov 27, 2024

However, if I profile with enforce_eager=False,I can see 4 CUDA HW in different processes
image

But the final cuda graph GraphExec only appears in the first gpu
image

Is it a problem of ray or nsight itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant