Release Notes

Here we draft the release notes for the next release.

Note: format is [summary] [commit hash or PR#] [author(s)]

Use the release notes helper script to generate the preliminary list. Then group the changes and review the descriptions and look out for ????

Mostly the first line of the commit line is a good summary, but please think through each entry and (re)write a summary that helps users quickly determine if this change would be interesting/useful to them. For example, include the name of the intrinsic/function in the summary so that users don't have to click through each commit themselves.

SIMDe 0.8

Summary

Complete set of implementations for all NEON intrinsics have been finished! (@yyctw @wewe5215 SIMDe PRs are tested using Fedora Rawhide (@junaruga)

X86

There are a total of 6876 SIMD functions on x86, 2937 (42.71%) of which have been implemented in SIMDe so far. Specifically for AVX-512, of the 5270 functions currently in AVX-512, SIMDe implements 1516 (28.77%).

Newly added function families

AES: 5 of 6 (83.33%)

Newly AVX512 added function families

cvtus_storeu: 1 of 18 (5.56%) implemented.
fpclass: 3 of 24 (12.50%) implemented.
i32gather: 1 of 8 (12.50%) implemented.
i64gather: 8 of 8 💯 kand permutex rcp reduce

Additions to existing families

AVX512BW: 6 additional, 337 total of 790 (42.66%)
AVX512DQ: 5 additional, 112 total of 376 (29.79%)
AVX512F: 48 additional, 1087 total of 2824 (38.49%)
AVX512_FP16: 15 additional, 17 total of 1105 (1.54%)

Neon

SIMDe currently implements 6670 out of 6670 (100.00%) NEON functions; up from 56.46% in the previous release!

Newly added families

abal
abal_high
abd
abdh
abdl_high
addhn_high
aes
bfdot
bfdot_lane
cadd_rot
cale
calt
cmla_lane
cmla_rot_lane
copy_lane
cvt_high
cvt_n
cvta
cvtn
cvtp
cvtx
cvtx_high
div
dupb_lane
duph_lane
eor3
fmlal
fms
fms_lane
fms_n
ld2_dup
ld2_lane
ld3_dup
ld3_lane
ld4_dup
maxnmv
minnmv
mla_lane
mla_high_lane
mls_lane
mlsl_high_lane
mmla
mull_high_lane
mull_high_n
mulx
mulx_lane
pmaxnm
pminnm
qdmlal
qdmlal_high
qdmlal_high_lane
qdmlal_high_n
qdmlal_lane
qdmlal_n
qdmlsl
qdmlsl_high
qdmlsl_high_lane
qdmlsl_high_n
qdmlsl_lane
qdmlsl_n
qdmlslh
qdmlslh_lane
qdmulhh
qdmulhh_lane
qdmull_high
qdmull_high_lane
qdmull_high_n
qdmull_lane
qdmull_n
qdmullh_lane
qmovun_high
qrdmlah
qrdmlah_lane
qrdmlahh
qrdmlahh_lane
qrdmlsh
qrdmlsh_lane
qrdmlshh
qrdmlshh_lane
qrdmulhh_lane
qrshl
qrshlh
qrshrn_high_n
qrshrnh_n
qrshrun_high_n
qrshrunh_n
qshl_n
qshlh_n
qshluh_n
qshrn_high_n
qshrnh_n
qshrun_high_n
qshrunh_n
raddhn
raddhn_high
rax
recp
rnd32x
rnd32x
rnd32x
rnd64z
rnda
rndx
rshrn_high_n
rsubhn
rsubhn
set_lane
sha1
sha1h
sha256
sha512
shll_high_n
shrn_high_n
sli_n
sm3
sm4
sqrt
st1_x2
st1_x3
st1_x4
st1q_x2
st1q_x3
st1q_x4
subhn_high
sudot_lane
usdot
usdot_lane

Finally complete families

cvtn
mla_lane

MSA

Details

simde-f16: improve _Float16 usage; better INFHF/NANHF defs 8910057 @mr-c simde_float16: prefer __fp16 if available aba26f6 @mr-c

Implementation of Arm intrinsics

NEON

cvtn: vcvtnq_{s32_f32,s64_f64}: add SSE & AVX512 optimized implementations e134cc7 @mr-c cvtn: vcvtnq_u32_f32 is a V8 function 8432c70 @mr-c min: Remove non-working MMX specialization from simde_vmin_s16 6858b92 @M-HT shll: Extend constant range in simde_vshll_n_XXX intrinsics (#1064) beb1c61 @M-HT various: Implement some f16XN types and f16 related intrinsics. (#1071) aae2245 @yyctw qtbl/qtbx polyfills for A32V7 a2fef9e @easyaspi314 arm: use SIMDE_ARCH_ARM_FMA 7198d6d @mr-c arm neon: Complex operations from Armv8.3-a (#1077) d08d67c @wewe5215 more fp16 using intrinsics supported by architecture v7 (skip version) (#1081) 5e7c4d4 @yyctw st1{,q}_*_x{2,3,4}: initial implementation (#1082) 879d1a0 @yyctw part 1 of implement all intrinsics supported by architecture A64 (#1090) 2eedece @yyctw Add AES instructions. 23adcd2 805ccd2 @yyctw Modified simde_float16 to simde_float16_t (#1100) 8a05dc6 @yyctw implement all intrinsics supported by architecture A64-remaining part (#1093) 018ba24 @yyctw add enable vmlaq_laneq_f32 and vcvtq_n_f64_u64 c7d314b @yyctw implement all bf16-related intrinsics (#1110) c59db7c @yyctw

SVE Intrinsics

WASM intrinsics

simd128: fix altivec_p7 version of wasm_f64x2_pmin 96d6e53 @mr-c simd128: add missing unsigned functions ea5e283 @mr-c simd128 f{32x4,64x2}_min: add workaround for a gcc<6 issue d5d6d10 @mr-c

x86 intrinsics

sse{,2,4.1}, avx{,2} *stream{,load}: use _builtin_nontemporal{load,store} 6ce6030 @mr-

SSE*

sse: Fix issues related to MXCSR register (#1060) 653aba8 @M-HT sse: implement _mm_movelh_ps for Arm64 514564e @mr-c sse _mm_movemask_ps: remove unused code fba97e4 @mr-c sse2 mm_pause: more archs, add a basic test 692a2e8 @mr- sse4.1: use logical OR instead of bitwise OR in neon impl of _mm_testnzc_si128 edd4678 @mr-c sse4.1 mm_testz_si128: fix backwards short circuit logic f132275 @mr-c

AVX

run test from #926 ce9708c @mr-c simde_mm256_shuffle_pd fix for natural vector size < 128 1594d7c @mr-c

AVX2

AVX512

fpclass: naive implementation 353bf5f @mr-c loadu: fix native detection 305f434 @mr-c set: add simde_x_mm512_set_m256{,d} 67e0c50 @mr-c gather: add MSVC native fallbacks 7b7e3f6 @mr-c

AVX512FP16 / m512h initial support e97691c @mr-c fix many native aliases 75014b9 @mr-c

CLMUL

fix natives, some require VPCLMULQDQ f819c52 @mr-c

GFNI

XOP

F16C

FMA

SVML

enable SIMDE_X86_SVML_NATIVE for MSVC 2019+ 593af95 @mr-c

AES

aes: initial implementation of most aes instructions (#1072) 8632391 @Vineg

MIPS MSA intrinics

msa neon impl: float64x2_t is not avail in A32V7 ae4c4ab @mr-c

Arch support

x86(-64)

fix SIMDE_ARCH_X86_SSE4_2 define 5e4b308 @cbielow

arm64

x86 aes: add neon implementation using the crypto extension fb3554f @mr-

z/Arch

Altivec

neon/st1: disable last remaining AltiVec implementation 0521245 @mr-c

e2k (Elbrus)

Power

sse2,wasm simd128: skip SIMDE_CONVERT_VECTOR_ impementations on PowerPC 4de999a @mr-c wasm simd128: more powerpc fixes 7cb5691 @mr-c

Compiler Specific

GCC

GCC AVX512F: SIMDE_BUG_GCC_95399 was fixed in GCC 9.5, 10.4, 11.4, 12+ 3fa89c5 @mr-c GCC x86/x64: SIMDE_BUG_GCC_98521 was fixed in 10.3 edde42e @mr-c
GCC x86: SIMDE_BUG_GCC_94482 was fixed in 8.5, 9.4, 10+ 43d86a3 @mr-c Add workaround for GCC bug 111609 fdafd8e @M-HT arm neon ld2: silence warnings at -O3 on gcc risc-v 8f56628 @mr-c

Clang

clang powerpc: vec_bperm bug was fixed in clang-14 6feb28a @mr-c clmul: aarch64 clang has difficulties with poly64x1_t 1e1bd76 @mr-c clang aarch64: optimization bug 45541 was fixed in clang-15 7ca5712 @mr-c A32V7: Don't trust clang for load multiple on A32V7 927f141 @easyaspi314 clang wasm: SIMDE_BUG_CLANG_60655 is fixed in the upcoming 17.0 release 25cebbe @mr-c

ClangCL

fp16: don't use _Float16 on ClangCL if not supported 8a6b8c5 @mr-c svml: don't enable SIMDE_X86_SVML_NATIVE for ClangCl c877fe5 @mr-

MSVC

avx512 types: avoid using native AVX512 types on MSVC unless required 029d749 @mr-c

Testing with Docker/Podman & CI

Update recipe for qemu git mode 54b8c8f @mr-c riscv64 gcc: typo fix for endian little 7423339 @mr-c add new cross sets; Ubuntu Focal and Bionic support b0b9710 @mr-c native tests: also AVX512, MSA; fix WASM SIMD128 path bdd075b @mr-c test-flags: support the x86 microarchitecture levels 518b777 @mr-c

Appveyor

preserve test log 9815161 @mr-c
save meson log on error 5207d83 @mr-

Azure

Circle CI

circleci: clang, set -Wno-unsafe-buffer-usage 24c93c2 @mr-c

Cirrus CI

Local testing with Docker/Podman

Drone.io

GitHub Actions

upgrade qemu ; fixes remaining ppc64el fails! e91944b @mr-c
tidy matrix ordering for easier to read job names b52ac36 @mr-c add clang-qemu: aarch64, riscv64, ppc64el, s390x 8a6dbab @mr-c test armv7 with gcc-12 via qemu 8cd8de1 @mr-c add armel to gcc and clang qemu matrices 4ca849b @mr-c
add armv7 to clang-qemu matrix a144aca @mr-c use GCC 12 for adv x64 native testing + AVX512FP f156b41 @mr-c expand mac-os/xcode testing matrix 8055410 @mr-c fix macos-13+brew failure c6149de @mr-c test with clang-16 e25ced8 @mr-c
add gcc-13 43ac8fc @mr-c run on commits to the primary branch to prime the cache 6055bfb @mr-c

Misc

README: mark F16C as complete 2d87cf5 @mr-c README: Give credit to creator/maintainer of the vcpkg for SIMDe ceb1e73 @mr-c README: related projects: add AvxToNeon 13bf92a @mr-c

Template for next time

# Summary
## [X86](https://github.com/simd-everywhere/implementation-status/blob/main/x86.md)
### Newly added function families
### Additions to existing families
## [Neon](https://github.com/simd-everywhere/implementation-status/blob/main/neon.md)
## [MSA](https://github.com/simd-everywhere/implementation-status/blob/main/msa.md)
# Details
## Implementation of Arm intrinsics
### NEON
### SVE Intrinsics
## WASM intrinsics
## x86 intrinsics
### SSE*
### AVX
### AVX2
### AVX512
### GFNI 
### XOP
### F16C
### FMA
### SVML
## MIPS MSA intrinics
## Arch support
### arm64
### z/Arch
### Altivec
### e2k (Elbrus)
### Power
## Testing with Docker/Podman & CI
### [Appveyor](https://ci.appveyor.com/project/nemequ/simde/history)
### [Azure](https://dev.azure.com/simd-everywhere/SIMDe/_build?definitionId=3)
### [Circle CI](https://app.circleci.com/pipelines/github/simd-everywhere/simde)
### [Cirrus CI](https://cirrus-ci.com/github/simd-everywhere/simde)
### [Local testing with Docker/Podman](https://github.com/simd-everywhere/simde/tree/master/docker#readme)
### [Drone.io](https://cloud.drone.io/simd-everywhere/simde)
### [GitHub Actions](https://github.com/simd-everywhere/simde/actions)
### [Netlify](https://app.netlify.com/sites/simde/)
### [Packit CI](https://dashboard.packit.dev/projects/github.com/simd-everywhere/simde)
### [Semaphore CI](https://nemequ.semaphoreci.com/projects/simde)
### [Travis](https://app.travis-ci.com/github/simd-everywhere/simde)
## Misc