Skip to content

v2.3.0

Compare
Choose a tag to compare
@MrUnbelievable92 MrUnbelievable92 released this 18 Aug 06:18
· 40 commits to master since this release
981f38f

Known Issues

  • half8 == and != operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)
  • (s)byte, (u)short vector and (U)Int128 multiplication, division and modulo operations by compile time constants are not optimal
  • optimized (U)Int128 comparison operators didn't make it into this release
  • using bool vectors generated from 256 bit input vectors like so: long4 x = select(a, b, >>> myLong4a < myLong4b <<<) (as an example) does not generate the most efficient machine code possible
  • unit tests for 64-bit bits_zerohigh functions fail 100% of the time because of a bug related to the managed debug implementation of intrinsics (reported)
  • unit tests for intrinsics code paths for all functions that use "(mm256_)shuffle_ps" or "(mm256_)blendv_ps" can fail semi-randomly due to a bug which changes the bit content of ints which would be NaN if dereferenced as a float and written back to memory (reported)
  • most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
  • (s)byte32 all_dif lookup tables are currently way too large (kiloBytes)

Fixes

  • fixed quarter rounding behavior when casting a wider floating point type to a quarter to round towards the nearest representable value instead of truncating the mantissa

Additions

added namespace MaxMath.Intrinsics for users who want to use the math library through "high level" X86 intrinsics. Because users need to guard their intrinsics code with e.g. if (Burst.Intrinsics.X86.Sse2.IsSse2Supported) blocks and supported architectures vary (slightly) from function to function, these are considered unsafe, undocumented and unrecommended and only serve as an exposed layer of abstraction which is used internally anyway.

added flags enum Promise, with values Nothing, Everything NoOverflow, ZeroOrGreater, ZeroOrLess, NonZero and Unsafe 0 through 3 aswell as the composites Positive and Negative. This flags enum is only ever used as an optional parameter and offers faster, yet more unsafe code. Specifics vary between functions and sometimes even overloads but are documented accordingly. Optimizations are only ever to be added, not removed (= a ...promise ... of never introducing breaking changes in this regard)

Other Additions

  • added factorial (for integer types) and gamma (floating point types) functions. factorial, when called without a Promise parameter, clamps the result to type.MaxValue in case of overflow
  • added erf(c), the (complementary) error function for floating point types
  • added (c)minmag and (c)maxmag functions, returning the (componentwise) minimum/maximum magnitude of two values or within a vector; equivalent to abs(x) > abs(y) ? x : y (maxmag) or abs(cmin(c)) > abs(cmax(c)) ? cmin(c) : cmax(c) (cmaxmag)
  • added (c)minmax and (c)minmaxmag functions which return both the (componentwise/columnwise) minimum and maximum (magnitude) as out parameters
  • added bitfield functions for scalar and vector integer types - small utility functions that pack several smaller integers into bigger ones
  • added copysign(x, y) functions for signed types, which is equivalent to return y < 0 ? nabs(x) : abs(x)
  • added (naive?) implementation for scalar- and vector float/double inverse hyberbolic functions asinh, acosh and atanh
  • added intlog10 functions (integer base ten logarithm)
  • added the bit test/bt family of functions for scalar and vector integer types. A testbit(POST_ACTION)((ref)x, i) function returns a boolean (vector), indicating whether the bit in x at index i is 1 and may (or may not) flip, set, or reset that bit afterwards
  • added a new category of type conversion functions with the suffix "unsafe". Added to(u)longunsafe and todoubleunsafe with a Promise parameter, allowing for up to two levels of optimization (vectorized 64bit int <-> 64 bit float is not hardware supported). Details in the XML documentation. Default double <-> (u)long conversion operators - apart from having their 4-element version improved - now check whether or not a safe range for unsafe conversions can be validated at compile time
  • added scalar/vectorized toquarterunsafe allowing for each type to be converted to a quarter type while specifying whether the input value will or will not overflow and/or is >= 0

Improvements

improved performance of several vector operators and function overloads for types that use up an entire hardware register while having to be up-cast to a wider type considerably - surrounding boilerplate code uses a new "in-house" faster-than-hardware algorithm with its dependency chain latency having been reduced from x [0 <= x <= 3] + (9 or 10) clock cycles down to x + (0 or 1 or 3) + (1 or 3) clock cycles

massive performance improvements for all vector types that are not a total of 128 or 256 bits wide, respectively, either through the Avx.[...]undefined[...] compiler intrinsics or through controlled undefined behaviour, by declaring an uninitialized variable and using pointer syntax to force the C# compiler into trusting that the variable has been fully initialized; this cannot lead to memory access violations, since the variable is declared and thus enough space is reserved on the stack, before it is optimized away by LLVM and assigned a hardware register instead, with undefined upper elements. This allows for upper elements of hardware registers to be ignored during compilation. Unnecessarily emitted instructions like movq xmm0, xmm0 (move the low 8 bytes from a register to the same register, zeroing out the upper 8 bytes, even though only the lower 8 bytes will be written back to memory) or far worse instruction sequences, for example when using vectors with 3 elements, are now (MOSTLY; there's still work to be done) omitted instead. Although most zero-upper-elements instruction( sequence)s only took a single clock cycle, they were always part of each dependency chain and could happen between almost each function call, including operators of course. The same improvements apply to Unity.Mathematics types when passed to maxmath functions.

improved performance throughout the library by effectively adding hundreds of thousands of Unity.Burst.CompilerServices.Constant.IsConstantExpression condition checks more to many functions within the library. Most notably, algorithms, where the total latency is dependant on the byte size of arguments, may now perform much faster. Some but not yet all of these constant checks are exposed through a Promise parameter

Other Improvements

  • improved performance of scalar (u)short to (u)short2/3/4 conversion
  • reduced latency of all, any first, last, count and bitmask functions for bool8/16/32 when used with an expression as the argument, such as all(x != y) - a way to force the compiler to omit unnecessary intructions was found
  • reduced latency of addsaturated for scalar unsigned integer types
  • reduced latency of float/double to (U)Int128 conversion
  • reduced latency of shl, shrl and shra and thus all functions using those - especially for: shl for (s)byte vectors of all sizes if compiling for SSE4 and 32 byte sized vectors if compiling for AVX2; shl for (u)short vectors of 4 or more elements if compiling for at least SSE4; shra for (u)long vectors if compiling for AVX2 and the vector containing the shift amounts is a compile time constant.
  • reduced long2/3/4 shra code size and latency by another 2 clock cycles if compiling for AVX2
  • reduced latency of variable rol/r vector functions beyond shl/r improvements and added an optional Promise parameter, allowing the caller to promise the rotation values are in a specific range
  • reduced latency of long2/3/4 "is negative checks" - mylong4 < 0/0 > mylong4 by 33% by doubling its code size. This further improves performance/adds to code size of functions in the library
  • reduced latency of (u)long2/3/4 isinrange functions
  • reduced latency of unsigned byte and ushort vector to float vector conversion. This also affects performance of (s)byte (u)short vector intsqrt functions, aswell as the respective % and / operators (byte2/3/4/8, all ushort vectors)
  • reduced (u)long vector intcbrt latency by ~45% and reduced code size by ~20% (roughly 150 bytes). For other integer vector types, the latency has been reduced by ~8 to ~15 clock cycles
  • added hidden and retroactively improved exp2 scalar and vector integer argument function overloads. These return exp2((float/double)x) or (float/double)(1 << x) in 3 instead of 6 to 7 clock cycles at best; they of course also work for negative input values i.e. reciprocals of powers of 2. The (u)int overloads convert to floats, the (u)long overloads convert to doubles; explicit integer to integer casting should (and sometimes has to) be used for optimal results. Additionally, these overloads contain an optional 'Promise' parameter, allowing for omission of clamping which is needed to ensure correct underflow/overflow behavior, as dictated by Unity's exp2 implementation. If you ever used the standard exp2 function by implicitly converting an int type to a float type, performance was improved by a factor of about 30x. This overload only "breaks" code that casts (u)long types to float types implicitly if the result is expected to be a float type. It is recommended to explicitly cast the (u)long type to a (u)int type in such a case
  • added ==, !=, <, >, <= and >= operators for UInt128 and signed long/int comparisons, as the expensive float conversion and comparison was previosly used when, for instance, comparing a UInt128 to a constant int such as 1 or 0
  • implemented SIMD (u)int and (u)long division/modulo algorithms. (u)long performance gains are only noticable under certain conditions; the (u)int performance gain is substantial (and unfortunately not used for (u)int2/3/4by LLVM/Burst - these are now exposed as further div overloads and new mod functions). Other functions than operator overloads are positively affected
  • added SSE2 fallback code for all (s)byte2/3/4 shuffles, eg. myByte4.xzzw
  • added more SSE4 -> SSE2 fallback code instead of n * (vector element extraction code + scalar code + vector element insertion code), where viable (now - thanks to some specific performance improvements)
  • improved performance of double4 to (u)long4 conversions if compiling for AVX2
  • optimized each possible byte vector division/modulo operation by a scalar compile time constant. Many, if not most, were not even auto-vectorized, let alone optimized for SIMD instructions instead of general purpose register instructions, which were translated poorly if vectorized
  • replaced double precision (r)cbrt's math.pow(x, (-)1d/3d) call with an optimized implementation
  • reduced latency of float scalar- and vector (r)cbrt by ~1 + (1 or 2) * ~4 clock cycles, while also gaining a small amount of precision; Reduced code size, aswell as the number of required compile time constants
  • reduced latency of float scalar- and vector (r)cbrt which handle negative inputs accurately (i.e. the new standard) by one clock cycle... Making it just one clock cycle slower than the unsafe version, mostly just providing a somewhat consiberable advantage with regard to code size
  • reduced (s)byte16 and (u)short16 all_dif lookup table size by 896 bytes (traded for an increase of 8 bytes in code size so this doesn't save RAM; It potentially reduces memory latency aswell as register spilling onto the stack)
  • reduced (u)int8 t/lzcnt latency by ~10%, also positively affecting (u)int2/3/4/8 gcd and lcm performance, as it is part of a loop within gcd
  • reduced double and float to quarter conversion latency (15+ clock cycles down to 7, optimally (CPU dependant)), code size and the number of constants being used. This affects scalar and vector conversions; the scalar versions are now branch free.
  • added AVX2 -> SSSE3 -> SSE2 fallback code for (s)byte32 and (u)short16 all_dif functions

Changes

Complete avg Overhaul

  • renamed avg overloads which calculate the average value of a vector itself to cavg for consistency reasons (max vs cmax, for instance)
  • 32- and 64bit integer (c)avg calculations can no longer result in overflow of intermediate calculations and thus incorrect results (lower performance by default)
  • added Promise parameters to most (c)avg overloads. These can bring back the previous performance of 32- and 64bit integer overloads
  • reduced latency of signed 8/16 bit (c)avg overloads

Other Changes

  • (U)Int128((u)long lo64, (u)long hi64) constructors are now public
  • theInt128 intsqrt overload now returns a ulong
  • replaced the optional float (r)cbrt bool paramater handleNegativeInput with a Promise parameter and removed it from the double overloads completely, with having its' NonNegative flag set being a requirement for the faster version. This is first due to the introduction of the Promise type and thus for consistency reasons. Also, the optimized double implementation handles negative numbers for free, which is now the standard behavior.
  • replaced the optional intcbrt bool paramater handleNegativeInput with a Promise parameter for reasons mentioned above, also handling negative input values correctly by default
  • Bumped C# Dev Tools to version 1.0.8

Fixed Oversights

  • (Issue #5) .meta files are now included to allow for adding the repository to Unity projects via its github URL
  • added floorpow2 function overloads for scalar (u)int and (u)long types
  • added legitemately faster-than-hardware double scalar- and vector fastrcp and fastrsqrt overloads (substantially less accurate than FloatMode.Fast, FloatPrecision.Low 1d / x or 1d / sqrt(x))
  • the seven Bit Manipulation Instructions (functions with a bits_ prefix) now have their vector equivalents implemented as overloads