You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
The key has expired.
Known Issues
half8== and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)
(s)byte, (u)short vector and (U)Int128 multiplication, division and modulo operations by compile time constants are not optimal
optimized (U)Int128 comparison operators didn't make it into this release
bool vectors generated from operations on non-(s)byte vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties
most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
AVX2 (s)byte32all_dif lookup tables are currently way too large (kiloBytes)
Fixes
(Issue #10) bool8/16/32 are now blittable when not used within an IJob
Additions
added comb(n, k) for scalar- and vector integer types. This is known as the binomial coefficient or "n choose k". An optional Promise parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows (which is not true for most solutions found online that claim it), uses a O(min(k, n - k)) algorithm with respect to time
added perm(n, k) for scalar- and vector integer types. This is known as "k-permutations of n". An optional Promise parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows, uses a O(k) algorithm with respect to time
added nextgreater(x) for all types. For integer types, it is a wrapper function for addsaturated(x, 1). For floating point types, it returns the next greater representable floating point value(s), unless x is NaN or infinite. An optional Promise parameter allows for numerous optimizations.
added nextsmaller(x) for all types. For integer types, it is a wrapper function for subsaturated(x, 1). For floating point types, it returns the next smaller representable floating point value(s), unless x is NaN or infinite. An optional Promise parameter allows for numerous optimizations
added nexttoward(from, to) for all types, returning the next representable integer/floating point value(s) in a given direction, unless from is equal to to. For floating point types, from is returned if from is NaN or infinite. If to is NaN, NaN is returned. An optional Promise parameter allows for numerous optimizations.
Improvements
improved performance of 64bit vectorized division thanks to a newly implemented and further optimized algorithm from a July 13th 2022 research paper, which replaces a vectorized loop (rather slow; up to 64 iterations; no instruction level parallelism outside the loop possible until the loop finished executing, following an almost certainly mispredicted branch) with straight line code. Due to "recent" improvements to divider circuits, this code path is inferior to hardware supported scalar division via element extraction for (u)long2, specifically, even when the quotient and/or remainder vector is in the middle of a dependency chain and even in tight loops, and is thus only implemented for (u)long3/4 types and only if compiling for AVX2
improved performance and reduced code size of up to (s)byte8 and every (u)short vector division if not compiling with FloatMode.Fast. Reduced constants possibly read from RAM in either case.
fixed performance regression of SIMD register <-> software abstraction conversions for types using up the entirety of a hardware register
lcm for (s)byte vectors with 8 elements or less: decreased code size by 20 or 28 bytes; removed 2 or 4 or 8 bytes of constant data read from RAM; reduced latency by 2 or 3 clock cycles
verified and increased the (u)long scalar- and vector intcbrtPromise.Unsafe0 range from [0, 1ul << 40] to [0, 1ul << 46], the code path of which is also possibly chosen at compile time
implemented optimized quarter{X} IEEE-754 comparison operators (without having to cast to float{X}). Vectorized halfX comparisons are implemented in MaxMath.Intrinsics.Xse as well and used where appropriate. compareto with quarter{X} and half{X} function overloads were implemented.
reduced latency of add/subsaturated for scalar Int128s, scalar and vector longs as well as vector ints by about a third
replaced (U)Int128.ToString(null, null)s call to BigInteger.ToString() and thus unnecessary heap allocations with an optimized implementation
(u)short8/ and % operators now correctly check for SSE2 support rather than AVX2
removed aliased fixed size buffers from all types, also improving indexer operator performance if the index is a compile time constant (in some cases)
Changes
Burst compiled code that uses a Promise argument which is notacompiletimeconstant will throw an exception in DEBUG, as it represents significant overhead instead of an optimization. This will currently not inform users of the name of the function but rather the Burst compiled job/function that threw it.
Fixed Oversights
added explicit type conversion operators for scalar floats and doubles to half8 and all quarter vectors (as well as scalar halfs to quarter vectors)