Adding new HWY_AVX10_2 target #2348

johnplatts · 2024-10-09T14:00:27Z

The upcoming Intel AVX10.2 instruction set (which is described in the specification that can be found at https://www.intel.com/content/www/us/en/content-details/828965/intel-advanced-vector-extensions-10-2-intel-avx10-2-architecture-specification.html) adds the following operations:

BF16 Add/Sub/Mul/Div/Sqrt/[Neg]MulAdd/[Neg]MulSub/ApproximateReciprocal[Sqrt]
BF16 Eq/Ne/Le/Lt/Ge/Gt/Min/Max
IEEE 754-2019 Min/Max for BF16/F16/F32/F64 vectors
BF16/F16/F32/F64 MinMagnitude (equivalent to IfThenElse(Lt(Abs(a), Abs(b)), a, b) if both a[i] and b[i] are non-NaN)
BF16/F16/F32/F64 MaxMagnitude (equivalent to IfThenElse(Lt(Abs(a), Abs(b)), b, a) if both a[i] and b[i] are non-NaN)
F16/BF16/F32->I8/U8 DemoteTo (there is already a use case for F16->I8/U8 DemoteTo in the implementation of I8/U8 Div on AVX3_SPR/AVX10_2/NEON_BF16)
F32->F16 OrderedDemote2To
New floating-point to integer PromoteTo/ConvertTo/DemoteTo instructions that saturate out-of-range non-NaN values to be within the range of the target integer type and convert NaNs to 0
F16->F32 WidenMulPairwiseAdd
U16xU16->U32 WidenMulPairwiseAdd/SatWidenMulPairwiseAccumulate/ReorderWidenMulAccumulate (originally introduced in AVX-VNNI-INT16, but extended to include 512-bit vectors on AVX10.2 CPU's that support 512-bit vectors)
I8xI8->I32 and U8xU8->I32 SumOfMulQuadAccumulate (originally introduced in AVX-VNNI-INT8, but extended to include 512-bit vectors on AVX10.2 CPU's that support 512-bit vectors)

GCC 15 and Clang 20, which are currently under development and scheduled to be released in Spring 2025, will have support for the new AVX10.2 intrinsics.

The new _mm*_cvttsp[h,s,d]_epi* intrinsics available on AVX10.2 should also fix the undefined behavior that is there with the conversion of out-of-range floating-point vectors to integer vectors with GCC (and this issue was described at #2183).

Also need to move some of the ops for 256-bit or smaller vectors that are currently implemented in the hwy/ops/x86_512-inl.h header on AVX3 targets into a separate header as support for 512-bit vectors is optional on AVX10.2.

The text was updated successfully, but these errors were encountered:

jan-wassenberg · 2024-10-10T14:25:38Z

Thanks for starting the discussion! Looks like GNR has also just been introduced/launched, but that supports 10.1, I think.

Min/MaxNumber (Min with proper NaN handling per IEEE754:2019) and Min/MaxMagnitude look useful, as does F16 WidenMulPairwiseAdd. Would be very happy to see those added :)
I don't see a burning need for bf16 ops. This target is AFAIK the only platform that has them, and just about the only demand I see for bf16 is mul/add, which is mostly covered by the existing WidenMul.

I agree we'd want to split the "AVX3" and "512-bit" aspects of x86_512-inl.h.

How about I make a TODO for around 2025-03 to lay the groundwork by creating the HWY_AVX10_2 (or HWY_AVX102?) target/boilerplate? Would you later like to add some of its functionality?

johnplatts · 2024-10-11T02:31:34Z

MinMagnitude/MaxMagnitude ops are implemented in pull request #2353.

johnplatts · 2024-11-30T18:28:46Z

It is possible to go ahead and implement the HWY_AVX10_2 target as GCC 14, Clang 18, and Clang 19 have the -no-evex512 option that allows the HWY_AVX10_2 target to be implemented, even without full support for the intrinsics for the new AVX10.2 instructions that will be there in the upcoming GCC 15 and Clang 20 releases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new HWY_AVX10_2 target #2348

Adding new HWY_AVX10_2 target #2348

johnplatts commented Oct 9, 2024

jan-wassenberg commented Oct 10, 2024

johnplatts commented Oct 11, 2024

johnplatts commented Nov 30, 2024

Adding new HWY_AVX10_2 target #2348

Adding new HWY_AVX10_2 target #2348

Comments

johnplatts commented Oct 9, 2024

jan-wassenberg commented Oct 10, 2024

johnplatts commented Oct 11, 2024

johnplatts commented Nov 30, 2024