rope_benchmark #3550

jjsjann123 · 2024-12-10T00:12:30Z

Rope benchmark extracted from lightning trace.

jjsjann123 · 2024-12-10T00:14:10Z

benchmarks/python/test_rope.py

+}
+
+
+@pytest.mark.parametrize(


This is the only part that's worth reviewing.

code above were directly dumped from Kevin's rope example script. (Note that I have to update the script with nv_enable_matmul in thunder.jit, otherwise we are seeing segmentation at nvfuser definition level)

jjsjann123 · 2024-12-10T00:15:26Z

I also want to add another toy example where we'll sweep on the batch size. But I'll do that in a separate PR.

naoyam · 2024-12-10T02:55:02Z

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

Priya2698 · 2024-12-10T03:39:16Z

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

naoyam · 2024-12-10T03:46:27Z

@Priya2698 is adding the Thunder backend #3394. Does it mean we can just have the forward functions?

We will also benchmark backward pass with Thunder backend.

Yes, so, we don't need to have the backward implementations explicitly, right?

jjsjann123 · 2024-12-16T05:36:23Z

Looking at the thunder-nvfuser timing.

Strangely the benchmark number doesn't match with the benchmark from kevin's example.
This is from the measurement from pytest

Name (time in us)                                                                                       Min                   Max                  Mean            StdDev                Median               IQR            Outliers         OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              204.8290 (1.0)        212.5130 (1.0)        207.1972 (1.0)      2.5573 (2.49)       206.0485 (1.0)      4.0260 (4.17)          2;0  4,826.3200 (1.0)          10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       320.3510 (1.56)       324.3850 (1.53)       322.8819 (1.56)     1.3519 (1.32)       322.8555 (1.57)     1.8470 (1.91)          3;0  3,097.1076 (0.64)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_qwen2_rope']              356.9320 (1.74)       360.3840 (1.70)       357.8536 (1.73)     1.0271 (1.0)        357.7265 (1.74)     0.9920 (1.03)          1;1  2,794.4388 (0.58)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_mistral_nemo_rope']       428.8940 (2.09)       432.8350 (2.04)       430.9671 (2.08)     1.1889 (1.16)       431.0560 (2.09)     1.8540 (1.92)          3;0  2,320.3627 (0.48)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']               548.0630 (2.68)       554.1090 (2.61)       552.0020 (2.66)     1.6203 (1.58)       552.3545 (2.68)     0.9650 (1.0)           2;2  1,811.5876 (0.38)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']         621.6160 (3.03)       626.1340 (2.95)       623.5093 (3.01)     1.6043 (1.56)       623.0065 (3.02)     2.3690 (2.45)          4;0  1,603.8253 (0.33)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='hf_phi3_rope']             1,022.0870 (4.99)     1,028.2720 (4.84)     1,024.4110 (4.94)     2.0313 (1.98)     1,024.3360 (4.97)     3.5130 (3.64)          2;0    976.1707 (0.20)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_2_7b_hf_rope']       1,308.1660 (6.39)     1,313.6600 (6.18)     1,310.4751 (6.32)     2.0083 (1.96)     1,310.5750 (6.36)     3.5940 (3.72)          5;0    763.0820 (0.16)         10           1
test_rope_variations_fwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,373.1600 (6.70)     1,382.4350 (6.51)     1,377.5739 (6.65)     2.3928 (2.33)     1,377.8270 (6.69)     2.2130 (2.29)          2;1    725.9139 (0.15)         10           1
test_rope_variations_bwd_benchmark[executor='thunder'-rope_variation='llama_3_8B_rope']          1,925.9490 (9.40)     1,936.4170 (9.11)     1,931.5364 (9.32)     2.8123 (2.74)     1,931.2535 (9.37)     2.3720 (2.46)          3;1    517.7226 (0.11)         10           1

But if I run the manual rope_example, I'm getting these

root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_phi3.py --execs Thunder-nvFuser
                             Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  microsoft/Phi-3.5-mini-instruct           1  ...             0.597             0.739
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_qwen2.py --execs Thunder-nvFuser
                      Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  Qwen/Qwen2.5-7B-Instruct           1  ...             0.397             0.507
root@a9fb56dcd91f:/volume/rope/rope_examples# python hf_mistral_nemo.py --execs Thunder-nvFuser
                              Model  Batch-Size  ...  Forward-Time(ms) Backward-Time(ms)
0  mistralai/Mistral-Nemo-Base-2407           1  ...             0.593             0.322
root@a9fb56dcd91f:/volume/rope/rope_examples# python lit_gpt_models.py --execs Thunder-nvFuser
           Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-2-7b-hf           2             4096  Thunder-nvFuser             0.629              0.960
        Model  Batch-Size  Sequence-Length         Executor  Forward-Time(ms)  Backward-Time(ms)
0  Llama-3-8B           2             8192  Thunder-nvFuser             1.383              1.567

I'll double check the measurement script, as well as compile options (i.e. thunder trace options).

We need to do the same sanity check for torchcompile later.

jjsjann123 added 6 commits December 9, 2024 14:14

benchmark added

f516886

adding other benchmarks from Kevin's example

1d6920b

hf_mistral_nemo added

800cbd8

fixing strided inputs

0c81c54

typo

40c554f

oops, missed an input

5a07055

jjsjann123 requested review from naoyam, kevinstephano, xwang233 and Priya2698 December 10, 2024 00:12

jjsjann123 commented Dec 10, 2024

View reviewed changes

jjsjann123 marked this pull request as draft December 10, 2024 21:28

jjsjann123 added 13 commits December 15, 2024 19:03

WIP

2074ae5

Merge remote-tracking branch 'origin/main' into HEAD

d090f3c

WIP

d9f06f3

WIP

dc2211b

WIP

8882d06

WIP

44d8b55

WAR

cb4db6b

adding qwen2

83bbc7f

fixing qwen2

602c516

hf_phi3 added

3f28aeb

wip

d7cbf20

add hf_mistral_nemo

156a7a5

keep forgetting json

b8752bf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rope_benchmark #3550

rope_benchmark #3550

jjsjann123 commented Dec 10, 2024

jjsjann123 Dec 10, 2024

jjsjann123 commented Dec 10, 2024

naoyam commented Dec 10, 2024

Priya2698 commented Dec 10, 2024

naoyam commented Dec 10, 2024

jjsjann123 commented Dec 16, 2024

		}


		@pytest.mark.parametrize(

rope_benchmark #3550

Are you sure you want to change the base?

rope_benchmark #3550

Conversation

jjsjann123 commented Dec 10, 2024

jjsjann123 Dec 10, 2024

Choose a reason for hiding this comment

jjsjann123 commented Dec 10, 2024

naoyam commented Dec 10, 2024

Priya2698 commented Dec 10, 2024

naoyam commented Dec 10, 2024

jjsjann123 commented Dec 16, 2024