NEON: properly implement _high intrinsics #1030

easyaspi314 · 2023-05-31T02:45:51Z

High intrinsics merely have an implicit vget_high or vcombine as a helper for most of the widen or narrow instructions since 64-bit can't address the upper halves of registers anymore. There is no need to complicate them further.

mr-c · 2023-05-31T06:55:37Z

simde/arm/neon/addw_high.h

-    simde_int16x8_private r_;
-    simde_int16x8_private a_ = simde_int16x8_to_private(a);
-    simde_int8x16_private b_ = simde_int8x16_to_private(b);
-
-    SIMDE_VECTORIZE
-    for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) {
-      r_.values[i] = a_.values[i] + b_.values[i + ((sizeof(b_.values) / sizeof(b_.values[0])) / 2)];
-    }
-
-    return simde_int16x8_from_private(r_);


Hmm.. So you think that there is no architecture/compiler combo that would produce better code from this vectorize loop than the fallback of simde_vaddw_s8(a, simde_vget_high_s8(b)) ?

I am mostly going for ease of implementation on this PR.

If the compiler is reasonably intelligent it would be able to detect the redundant assignment/shuffle and eliminate it. However I haven't tested codegen.

GCC and Clang both generate identical code on a downscaled version, eliding the copy.

MSVC x86 emits a few extra instructions on /arch:IA32 either way if I use a copy loop or memcpy, but it isn't terrible. https://godbolt.org/z/Y3v4vjz46

Here is /arch:SSE2: https://godbolt.org/z/nWTKMfh7K

However, 99% of the time MSVC will use SSE2 by default — /arch:IA32 is opt-in.

GCC and Clang are the ones where scalar counts, and they emit identical code.

Long story short, 99% free code reuse.

Hold up, the story changes with uint16_t... GCC vomits.

With which version does GCC vomit when compiling the uint16_t functions: the vectorized or the downscaled version?

It actually seems to be the opposite problem. The autovec codegen is actually bad on vaddw_u16. GCC couldn't autovec the one-shot one.

It actually seems to be the opposite problem. The autovec codegen is actually bad on vaddw_u16. GCC couldn't autovec the one-shot one.

So you're seeing better code from this PR for GCC?

No. Rather it is vaddw_u16 having mediocre codegen and reusing it passes those codegen issues to vaddw_high_u16. This is because GCC vectorizes it internally which is better for when SIMD is available

Okay. Is this PR ready, or do you want to make other changes?

High intrinsics merely have an implicit vget_high or vcombine. There is no need to complicate them further.

mr-c · 2023-09-29T07:41:47Z

@easyaspi314 hey-o, does this PR need more work or should I rebase and merge?

mr-c reviewed May 31, 2023

View reviewed changes

mr-c force-pushed the neon_simplify_high branch from 45bec97 to 096ef18 Compare June 1, 2023 10:53

NEON: properly implement _high intrinsics

c3aa2c6

High intrinsics merely have an implicit vget_high or vcombine. There is no need to complicate them further.

mr-c force-pushed the neon_simplify_high branch from bbb79cd to c3aa2c6 Compare June 11, 2023 08:51

Merge branch 'master' into neon_simplify_high

074896f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEON: properly implement _high intrinsics #1030

NEON: properly implement _high intrinsics #1030

easyaspi314 commented May 31, 2023 •

edited

Loading

mr-c May 31, 2023

easyaspi314 May 31, 2023

easyaspi314 May 31, 2023 •

edited

Loading

easyaspi314 May 31, 2023

mr-c Jun 1, 2023

easyaspi314 Jun 1, 2023 •

edited

Loading

mr-c Jun 3, 2023

easyaspi314 Jun 4, 2023 •

edited

Loading

mr-c Jun 4, 2023

mr-c commented Sep 29, 2023

NEON: properly implement _high intrinsics #1030

Are you sure you want to change the base?

NEON: properly implement _high intrinsics #1030

Conversation

easyaspi314 commented May 31, 2023 • edited Loading

mr-c May 31, 2023

Choose a reason for hiding this comment

easyaspi314 May 31, 2023

Choose a reason for hiding this comment

easyaspi314 May 31, 2023 • edited Loading

Choose a reason for hiding this comment

easyaspi314 May 31, 2023

Choose a reason for hiding this comment

mr-c Jun 1, 2023

Choose a reason for hiding this comment

easyaspi314 Jun 1, 2023 • edited Loading

Choose a reason for hiding this comment

mr-c Jun 3, 2023

Choose a reason for hiding this comment

easyaspi314 Jun 4, 2023 • edited Loading

Choose a reason for hiding this comment

mr-c Jun 4, 2023

Choose a reason for hiding this comment

mr-c commented Sep 29, 2023

easyaspi314 commented May 31, 2023 •

edited

Loading

easyaspi314 May 31, 2023 •

edited

Loading

easyaspi314 Jun 1, 2023 •

edited

Loading

easyaspi314 Jun 4, 2023 •

edited

Loading