-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NEON: properly implement _high intrinsics #1030
base: master
Are you sure you want to change the base?
NEON: properly implement _high intrinsics #1030
Conversation
simde_int16x8_private r_; | ||
simde_int16x8_private a_ = simde_int16x8_to_private(a); | ||
simde_int8x16_private b_ = simde_int8x16_to_private(b); | ||
|
||
SIMDE_VECTORIZE | ||
for (size_t i = 0 ; i < (sizeof(r_.values) / sizeof(r_.values[0])) ; i++) { | ||
r_.values[i] = a_.values[i] + b_.values[i + ((sizeof(b_.values) / sizeof(b_.values[0])) / 2)]; | ||
} | ||
|
||
return simde_int16x8_from_private(r_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. So you think that there is no architecture/compiler combo that would produce better code from this vectorize loop than the fallback of simde_vaddw_s8(a, simde_vget_high_s8(b))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am mostly going for ease of implementation on this PR.
If the compiler is reasonably intelligent it would be able to detect the redundant assignment/shuffle and eliminate it. However I haven't tested codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCC and Clang both generate identical code on a downscaled version, eliding the copy.
MSVC x86 emits a few extra instructions on /arch:IA32
either way if I use a copy loop or memcpy, but it isn't terrible. https://godbolt.org/z/Y3v4vjz46
Here is /arch:SSE2
: https://godbolt.org/z/nWTKMfh7K
However, 99% of the time MSVC will use SSE2 by default — /arch:IA32
is opt-in.
GCC and Clang are the ones where scalar counts, and they emit identical code.
Long story short, 99% free code reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hold up, the story changes with uint16_t... GCC vomits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With which version does GCC vomit when compiling the uint16_t
functions: the vectorized or the downscaled version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually seems to be the opposite problem. The autovec codegen is actually bad on vaddw_u16. GCC couldn't autovec the one-shot one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually seems to be the opposite problem. The autovec codegen is actually bad on vaddw_u16. GCC couldn't autovec the one-shot one.
So you're seeing better code from this PR for GCC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Rather it is vaddw_u16 having mediocre codegen and reusing it passes those codegen issues to vaddw_high_u16. This is because GCC vectorizes it internally which is better for when SIMD is available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Is this PR ready, or do you want to make other changes?
High intrinsics merely have an implicit vget_high or vcombine. There is no need to complicate them further.
@easyaspi314 hey-o, does this PR need more work or should I rebase and merge? |
High intrinsics merely have an implicit vget_high or vcombine as a helper for most of the widen or narrow instructions since 64-bit can't address the upper halves of registers anymore. There is no need to complicate them further.