Releases: MrUnbelievable92/MaxMath
v2.9.0
Known Issues
half8
==
and!=
operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics... for now... hint)bool
vectors generated from operations on non-(s)byte
vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficultiesfloat8
min()
andmax()
functions don't handle NaNs the same way Unity.Mathematics does
Fixes
- Fixed XML documentation not showing descriptions for valid
Promise
flags - Fixed
cminmax
documentation bitmask64
withnumBits
equal to 64 now correctly returns a bitmask with all 64 bits set if not compiling for Bmi1 i.e. AVX2- Fixed
uint8
tofloat8
type conversion if compiling for AVX2 - Fixed incorrect
mod
implementations - (ISSUE #16) Fixed
float
anddouble
(r)cbrt
edge cases (+/-0, Infinity and NaN). Additionally, the scalar- and vectorfloat
implementation now returns accurate results for subnormal numbers. Performance is affected negatively yet minimally (~2 clock cycles, + ~10 instructions); new validPromise
flags allow for call-site selection of faster code paths
Additions
Divider<T>
Divider<T>
is an opaque OOP-like struct which performs fast integer division and modulo operations as well as divisibility checks.
For any divisor of any scalar- or vector integer type T
, a Divider<T>
instance replaces division operations by multiplication-, shift- and rounding operations, utilizing the most suitable of 2 algorithms, typically used by compilers for compile time constant divisors.
Divider<T>
was carefully crafted in a way that allows for complete compile-time evaluation of constant divisors of all types in Burst compiled code.
Divider<T>
is NOT meant to replace divison operations; a (notable) performance gain is only to be expected in case the same divisor is used multiple times, or when multiple divisors are computed at once, utilizing SIMD (for instance, when a very predictable i
is the divisor in a for-loop).
Numerous Promise
flags allow for faster operations, provided that the Divider<T>
instance is both initialized and used in the same block of Burst compiled code and not loaded from RAM.
The implementation is pseudo-generic and only works for integer types known to MaxMath. Furthermore, Bursts inabilty to compile-time evaluate typeof(T)
often requires explicit initialization (example: new Divider<byte>((byte)42))
. DEBUG
only validity checks ensure correct initialization and usage.
The current Divider API consists of...:
/
and%
operators:- LHS: scalar <> RHS: Divider(scalar): requires both scalars to be of the same type; returns a scalar of the that type
- LHS: vector <> RHS: Divider(vector): requires both vectors to be of the same type; returns a vector of the that type
- LHS: scalar <> RHS: Divider(vector): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
- LHS: vector <> RHS: Divider(scalar): requires the vector type to contain integers of the scalar type; returns an instance of the vector type
DivRem
member methodsEvenlyDivides
member methodsT Divisor
as a readonly propertypublic const
Promise
s withinDivider<T>
, documenting valid promise flags with appropriate naming, starting with "PROMISE_"Get/SetInnerDivider<U>
methods: get or set a scalar- or vectorDivider<U>
within aDivider<T>
- Component shuffles:
Divider<T>.wzxy
swizzle "operators" as properties.
NOTE: Get/SetInnerDivider<U>
methods and Divider<T>.[a][b][c][d]
properties will change in the future. Due to current limitations regarding C# generics, swizzle operators only take in or return the same type the respective property is a member of, i.e. you cannot use these to get a Divider<int2>
from a Divider<int4>
. Get/SetInnerDivider<U>
are placeholderholders both for these operations as well as for the v[a]_[b]
properties for vectors with 8 or more components. C# will at some point get more complex type extension language support, at which point this API will change.
quadruple
(PREVIEW)
Analogous to (U)Int128
, this library now supports 128 bit floating point operations with its respective software-implemented type. It is fully IEEE754 compliant and in the typical 1 sign bit, 15 exponent bits, 112 mantissa bits format.
NOTE: quadruple
is in preview for an unforseeable amount of time. This means that it is neither completely optimized, nor are all maxmath
functions available for it at this time.
The following functions have been implemented: ToString
and Parse
(no perfect roundtrip guaranteed), All constants (example: PI_QUAD
), Random128
NextQuadruple
(optionally with min and max values), all type conversions except for decimal
, -
(unary), +
(binary), -
(binary), *
, /
, %
, ==
, !=
, <
, <=
, >
, >=
, fmod
, mad
, msub
, rcp
, isnan
, isinf
, isfinite
, isnormal
, issubnormal
, round
, floor
, ceil
, trunc
, roundtoint
(and all other integer variations), fastsqrt
, (r)sqrt
, (r)cbrt
, isinrange
, approx
, select
, compareto
, min
, max
, copysign
, nextgreater
, nextsmaller
, nexttoward
, radians
, degrees
, chgsign
Functions
- Added
isnormal
andissubnormal
functions for floating point types - Added
hypot
andinthypot
functions for calculating[int]sqrt(a * a + b * b)
without overflow, unless an optionalPromise
parameter with itsNoOverflow
flag set is passed as a compile time constant argument - Added
roundto(s)byte/(u)short/(u)int/(u)long/(U)Int128
. These take in floating point values of any type and convert them to the respective integer scalar- or vector type while rounding towards the nearest integer - Added
cor
andcxor
. These reduce vectors of a given integer type to a scalar integer of that type by applying bitwise OR or XOR operations between each element - Split
approx
into two overloads: one with a custom tolerance parameter (the old version) and one without, which calculates an appropriate tolerance instead - Added
roundmultiple(x, m)
,floormultiple(x, m)
,ceilmultiple(x, m)
andtruncmultiple(x, m)
for all types, rounding x to the nearest multiple of any positive m with the selected rounding mode (for example: ceilmultiple rounds x to the nearest greater multiple of m) - Added a whole stack of bit manipulation functions for all scalar- and vector integer types:
parityodd
,parityeven
,countzerobits
,l1cnt
,t1cnt
,lzmask
,tzmask
,l1mask
,t1mask
,bits_extractlowest0
,bits_masktolowest
,bits_masktolowest0
,bits_maskfromlowest
,bits_maskfromlowest0
,bits_setlowest
,bits_surroundlowest
andbits_surroundlowest0
Global Compilation Options
- Added Global Compilation Options for
OptimizeFor
,FloatMode
andFloatPrecision
. A proposal for compile-time access to job-specific options has been forwarded to the Burst team and is on their backlog. For now, these global options are dependency-injection-style placeholders and thus hard-coded toOptimizeFor.Performance
,FloatMode.Default
andFloatPrecision.Standard
, respectively, and can be customized within the source code itself at .../MaxMath/Runtime/Compiler Extensions/Compilation Options.cs
Improvements
Meta
- This library now fully supports ARM CPUs' SIMD instructions (huge!). It utilizes SSE2NEON and SIMDe to convert x86 SIMD instructions to ARM SIMD instructions or instruction sequences. Because of this, generated ARM code will sometimes remain slightly unoptimized, because the author is unable to verify correctness of ARM specific optimizations with unit tests in most cases.
Performance
- Implemented optimized
(u)long
vector tofloat
vector type convesion operators - Implemented the execution of two loop bodies in one for functions that use loop-based algorithms, when a vector type wider than 128 bits is used without compiling for AVX(2)
- Implemented an
AssumeRangeAttribute
equivalent for all vectorized functions with known return value ranges - Implemented more optimal
(U)Int128
comparison operators - Implemented optimal
(U)Int128
multiplication operations with- and division and modulo operations by compile time constants - Implemented optimal
(U)Int128
division and modulo operations by replacing a loop algorithm with straight line code. Because Burst does not expose the hardware-supported 128x64 narrowing division instruction as an intrinsic, this instruction, which is fundamentally important to the algorithm, is implemented with fallback code. A highly optimized (speed & size) native DLL written in Windows x86-64 assembly containing the most optimal implementation of any varation of 128 bit integer division was added to utilize this hardware instruction. This does mean that 128 bit integer division now results in a function call that cannot be inlined, yet the performance gain is worth it. Additionally, the C#/assembly interface was carefully crafted to avoid calling external functions partially or even entirely by utilizingUnity.Burst.CompilerServices.Constant.IsConstantExpression<T>()
- Increased valid
Promise.Unsafe0
range for(u)long
intcbrt
from [0, 2^46} to [0, 2^48] - Added an optional
Promise
parameter togamma
- Added an optional
Promise
parameter toerf(c)
- Added an optional
Promise
parameter togcd
andlcm
- Added
quarter
andhalf
scalar- and vector function overloads formin
,max
,minmax
,clamp
,saturate
,isinrange
,trunc
,round
,ceil
,floor
andsign
- Removed the only non-optimizing branch in vector code in the entire library within the
long2/3/4
>>
operator if the shift amount is not a comp...
v2.3.5
Known Issues
half8
==
and!=
operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)(s)byte
,(u)short
vector and(U)Int128
multiplication, division and modulo operations by compile time constants are not optimal- optimized
(U)Int128
comparison operators didn't make it into this release bool
vectors generated from operations on non-(s)byte
vectors do not generate the most optimal machine code possible, partly due to an LLVM performance regression, partly due to other compiler related difficulties- most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
- AVX2
(s)byte32
all_dif
lookup tables are currently way too large (kiloBytes)
Fixes
- (Issue #10)
bool8/16/32
are now blittable when not used within anIJob
Additions
- added
comb(n, k)
for scalar- and vector integer types. This is known as the binomial coefficient or "n choose k". An optionalPromise
parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows (which is not true for most solutions found online that claim it), uses a O(min(k, n - k)) algorithm with respect to time - added
perm(n, k)
for scalar- and vector integer types. This is known as "k-permutations of n". An optionalPromise
parameter can select a O(1) code path using the factorial formula, whereas the standard approach, which cannot ever overflow unless the result itself overflows, uses a O(k) algorithm with respect to time - added
nextgreater(x)
for all types. For integer types, it is a wrapper function foraddsaturated(x, 1)
. For floating point types, it returns the next greater representable floating point value(s), unless x is NaN or infinite. An optionalPromise
parameter allows for numerous optimizations. - added
nextsmaller(x)
for all types. For integer types, it is a wrapper function forsubsaturated(x, 1)
. For floating point types, it returns the next smaller representable floating point value(s), unless x is NaN or infinite. An optionalPromise
parameter allows for numerous optimizations - added
nexttoward(from, to)
for all types, returning the next representable integer/floating point value(s) in a given direction, unlessfrom
is equal toto
. For floating point types,from
is returned iffrom
is NaN or infinite. Ifto
is NaN, NaN is returned. An optionalPromise
parameter allows for numerous optimizations.
Improvements
- improved performance of 64bit vectorized division thanks to a newly implemented and further optimized algorithm from a July 13th 2022 research paper, which replaces a vectorized loop (rather slow; up to 64 iterations; no instruction level parallelism outside the loop possible until the loop finished executing, following an almost certainly mispredicted branch) with straight line code. Due to "recent" improvements to divider circuits, this code path is inferior to hardware supported scalar division via element extraction for
(u)long2
, specifically, even when the quotient and/or remainder vector is in the middle of a dependency chain and even in tight loops, and is thus only implemented for(u)long3/4
types and only if compiling for AVX2 - improved performance and reduced code size of up to
(s)byte8
and every(u)short
vector division if not compiling withFloatMode.Fast
. Reduced constants possibly read from RAM in either case. - fixed performance regression of SIMD register <-> software abstraction conversions for types using up the entirety of a hardware register
lcm
for(s)byte
vectors with 8 elements or less: decreased code size by 20 or 28 bytes; removed 2 or 4 or 8 bytes of constant data read from RAM; reduced latency by 2 or 3 clock cycles- verified and increased the
(u)long
scalar- and vectorintcbrt
Promise.Unsafe0
range from [0, 1ul << 40] to [0, 1ul << 46], the code path of which is also possibly chosen at compile time - implemented optimized
quarter{X}
IEEE-754 comparison operators (without having to cast tofloat{X}
). VectorizedhalfX
comparisons are implemented inMaxMath.Intrinsics.Xse
as well and used where appropriate.compareto
withquarter{X}
andhalf{X}
function overloads were implemented. - reduced latency of
add/subsaturated
for scalarInt128
s, scalar and vectorlong
s as well as vectorint
s by about a third - replaced
(U)Int128.ToString(null, null)
s call toBigInteger.ToString()
and thus unnecessary heap allocations with an optimized implementation (u)short8
/
and%
operators now correctly check for SSE2 support rather than AVX2- removed aliased fixed size buffers from all types, also improving indexer operator performance if the index is a compile time constant (in some cases)
Changes
- Burst compiled code that uses a
Promise
argument which is not a compile time constant will throw an exception inDEBUG
, as it represents significant overhead instead of an optimization. This will currently not inform users of the name of the function but rather the Burst compiled job/function that threw it.
Fixed Oversights
- added
explicit
type conversion operators for scalarfloat
s anddouble
s tohalf8
and allquarter
vectors (as well as scalarhalf
s toquarter
vectors)
v2.3.0
Known Issues
half8
==
and!=
operators don't conform to the IEEE 754 standard (compliant with Unity.Mathematics)(s)byte
,(u)short
vector and(U)Int128
multiplication, division and modulo operations by compile time constants are not optimal- optimized
(U)Int128
comparison operators didn't make it into this release - using
bool
vectors generated from 256 bit input vectors like so:long4 x = select(a, b, >>> myLong4a < myLong4b <<<)
(as an example) does not generate the most efficient machine code possible - unit tests for 64-bit
bits_zerohigh
functions fail 100% of the time because of a bug related to the managed debug implementation of intrinsics (reported) - unit tests for intrinsics code paths for all functions that use "(mm256_)shuffle_ps" or "(mm256_)blendv_ps" can fail semi-randomly due to a bug which changes the bit content of
int
s which would be NaN if dereferenced as afloat
and written back to memory (reported) - most vectorized function overloads don't communicate return value ranges to the compiler yet, missing out on more efficient code paths selected at compile-time-only with compile-time-only value range checks.
(s)byte32
all_dif
lookup tables are currently way too large (kiloBytes)
Fixes
- fixed
quarter
rounding behavior when casting a wider floating point type to aquarter
to round towards the nearest representable value instead of truncating the mantissa
Additions
added namespace MaxMath.Intrinsics
for users who want to use the math library through "high level" X86 intrinsics. Because users need to guard their intrinsics code with e.g. if (Burst.Intrinsics.X86.Sse2.IsSse2Supported)
blocks and supported architectures vary (slightly) from function to function, these are considered unsafe, undocumented and unrecommended and only serve as an exposed layer of abstraction which is used internally anyway.
added flags enum Promise
, with values Nothing
, Everything
NoOverflow
, ZeroOrGreater
, ZeroOrLess
, NonZero
and Unsafe
0 through 3 aswell as the composites Positive
and Negative
. This flags enum is only ever used as an optional parameter and offers faster, yet more unsafe code. Specifics vary between functions and sometimes even overloads but are documented accordingly. Optimizations are only ever to be added, not removed (= a ...promise ... of never introducing breaking changes in this regard)
Other Additions
- added
factorial
(for integer types) andgamma
(floating point types) functions.factorial
, when called without aPromise
parameter, clamps the result totype.MaxValue
in case of overflow - added
erf(c)
, the (complementary) error function for floating point types - added
(c)minmag
and(c)maxmag
functions, returning the (componentwise) minimum/maximum magnitude of two values or within a vector; equivalent toabs(x) > abs(y) ? x : y
(maxmag
) orabs(cmin(c)) > abs(cmax(c)) ? cmin(c) : cmax(c)
(cmaxmag
) - added
(c)minmax
and(c)minmaxmag
functions which return both the (componentwise/columnwise) minimum and maximum (magnitude) asout
parameters - added
bitfield
functions for scalar and vector integer types - small utility functions that pack several smaller integers into bigger ones - added
copysign(x, y)
functions for signed types, which is equivalent toreturn y < 0 ? nabs(x) : abs(x)
- added (naive?) implementation for scalar- and vector
float
/double
inverse hyberbolic functionsasinh
,acosh
andatanh
- added
intlog10
functions (integer base ten logarithm) - added the
bit test
/bt
family of functions for scalar and vector integer types. Atestbit(POST_ACTION)((ref)x, i)
function returns a boolean (vector), indicating whether the bit inx
at indexi
is 1 and may (or may not) flip, set, or reset that bit afterwards - added a new category of type conversion functions with the suffix "unsafe". Added
to(u)longunsafe
andtodoubleunsafe
with aPromise
parameter, allowing for up to two levels of optimization (vectorized 64bit int <-> 64 bit float is not hardware supported). Details in the XML documentation. Defaultdouble
<->(u)long
conversion operators - apart from having their 4-element version improved - now check whether or not a safe range for unsafe conversions can be validated at compile time - added scalar/vectorized
toquarterunsafe
allowing for each type to be converted to a quarter type while specifying whether the input value will or will not overflow and/or is >= 0
Improvements
improved performance of several vector operators and function overloads for types that use up an entire hardware register while having to be up-cast to a wider type considerably - surrounding boilerplate code uses a new "in-house" faster-than-hardware algorithm with its dependency chain latency having been reduced from x [0 <= x <= 3] + (9 or 10) clock cycles down to x + (0 or 1 or 3) + (1 or 3) clock cycles
massive performance improvements for all vector types that are not a total of 128 or 256 bits wide, respectively, either through the Avx.[...]undefined[...]
compiler intrinsics or through controlled undefined behaviour, by declaring an uninitialized variable and using pointer syntax to force the C# compiler into trusting that the variable has been fully initialized; this cannot lead to memory access violations, since the variable is declared and thus enough space is reserved on the stack, before it is optimized away by LLVM and assigned a hardware register instead, with undefined upper elements. This allows for upper elements of hardware registers to be ignored during compilation. Unnecessarily emitted instructions like movq xmm0, xmm0
(move the low 8 bytes from a register to the same register, zeroing out the upper 8 bytes, even though only the lower 8 bytes will be written back to memory) or far worse instruction sequences, for example when using vectors with 3 elements, are now (MOSTLY; there's still work to be done) omitted instead. Although most zero-upper-elements instruction( sequence)s only took a single clock cycle, they were always part of each dependency chain and could happen between almost each function call, including operators of course. The same improvements apply to Unity.Mathematics
types when passed to maxmath
functions.
improved performance throughout the library by effectively adding hundreds of thousands of Unity.Burst.CompilerServices.Constant.IsConstantExpression
condition checks more to many functions within the library. Most notably, algorithms, where the total latency is dependant on the byte size of arguments, may now perform much faster. Some but not yet all of these constant checks are exposed through a Promise
parameter
Other Improvements
- improved performance of scalar
(u)short
to(u)short2/3/4
conversion - reduced latency of
all
,any
first
,last
,count
andbitmask
functions forbool8/16/32
when used with an expression as the argument, such asall(x != y)
- a way to force the compiler to omit unnecessary intructions was found - reduced latency of
addsaturated
for scalar unsigned integer types - reduced latency of
float
/double
to(U)Int128
conversion - reduced latency of
shl
,shrl
andshra
and thus all functions using those - especially for:shl
for(s)byte
vectors of all sizes if compiling for SSE4 and 32 byte sized vectors if compiling for AVX2;shl
for(u)short
vectors of 4 or more elements if compiling for at least SSE4;shra
for(u)long
vectors if compiling for AVX2 and the vector containing the shift amounts is a compile time constant. - reduced
long2/3/4
shra
code size and latency by another 2 clock cycles if compiling for AVX2 - reduced latency of variable
rol/r
vector functions beyondshl/r
improvements and added an optionalPromise
parameter, allowing the caller to promise the rotation values are in a specific range - reduced latency of
long2/3/4
"is negative checks" -mylong4 < 0
/0 > mylong4
by 33% by doubling its code size. This further improves performance/adds to code size of functions in the library - reduced latency of
(u)long2/3/4
isinrange
functions - reduced latency of unsigned
byte
andushort
vector to float vector conversion. This also affects performance of(s)byte
(u)short
vectorintsqrt
functions, aswell as the respective%
and/
operators (byte2/3/4/8
, allushort
vectors) - reduced
(u)long
vectorintcbrt
latency by ~45% and reduced code size by ~20% (roughly 150 bytes). For other integer vector types, the latency has been reduced by ~8 to ~15 clock cycles - added hidden and retroactively improved
exp2
scalar and vector integer argument function overloads. These returnexp2((float/double)x)
or(float/double)(1 << x)
in 3 instead of 6 to 7 clock cycles at best; they of course also work for negative input values i.e. reciprocals of powers of 2. The(u)int
overloads convert tofloat
s, the(u)long
overloads convert todouble
s; explicit integer to integer casting should (and sometimes has to) be used for optimal results. Additionally, these overloads contain an optional 'Promise' parameter, allowing for omission of clamping which is needed to ensure correct underflow/overflow behavior, as dictated by Unity'sexp2
implementation. If you ever used the standardexp2
function by implicitly converting anint
type to afloat
type, performance was improved by a factor of about 30x. This overload only "breaks" code that casts(u)long
types tofloat
types implicitly if the result is expected to be afloat
type. It is recommended to explicitly cast the(u)long
type to a(u)int
type in such a case - added
==
,!=
,<
,>
,<=
and>=
operators forUInt128
and signedlong
/int
comparisons, as the expensive float conversion and comparison was previosly used when, for instance, compar...
MaxMath v2.2.0
Known Issues
half8
==
and!=
operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation(s)byte
,(u)short
vector and(U)Int128
multiplication, division and modulo operations by compile time constants are not optimal. For (U)Int128, it requires a new Burst feature à laT Constant.ForceCompileTimeEvaluation<T, U>(Func<U, T> code)
(proposed); Currently work is being done on(s)byte
and(u)short
vectors in this regard, which will beat any compiler. The current (tested) state of all optimizations possible is included in this version.pow
functions with compile time constant exponents currently do not handle many decimal numbers -math.rsqrt
would often be used in those cases for optimal performance but it is actually slower when theUnity.Burst.FloatMode
is set to anything butFloatMode.Fast
. To guarantee optimal performance, compile time access to the currentFloatMode
would be needed (proposed)double
(r)cbrt
functions are currently not optimized
Fixes
- linked
float8
rcp
andrsqrt
functions to Bursts'FloatMode
andFloatPrecision
short.MinValue / -1
now correctly overflows toshort.MinValue
when dividing ashort16
vector by anothershort16
vector when compiling for AVX or higher- fixed scalar
quarter
todouble
conversion for when thequarter
value is negative - fixed scalar
half
toquarter
conversion for when thehalf
value is negative - fixed vector
quarter
toulong
conversion for when aquarter
value is negative - fixed
(u)short8
toquarter8
conversion
Additions
Added saturation arithmetic to the library for all scalar- and vector types. Saturation arithmetic clamps the result of an operation to type.MinValue
and type.MaxValue
if under- or overflow occurs, respectively and has single-instruction hardware support for (s)bytes
and (u)shorts
. The included functions are:
addsaturated
subsaturated
mulsaturated
divsaturated
(only clamps division of floating point types and signed division of, for instance,sbyte.MinValue
( = -128)/ -1
tosbyte.MaxValue
( =127), which would cause a hardware exception forint
s andlong
s`)castsaturated
(all types to all other types with a smaller range),csumsaturated
cprodsaturated
(U)Int128
- added high performance
(U)Int128
types with full library support, meaning: all operators and type conversions aswell as all functions support these types. Most operations of both types, in Burst code, compile down to optimal machine code. Exceptions: 1) signed 64x64 bit to 128 bit multiplication 2)*
,/
,%
anddivrem
functions with a scalar compile time constant argument (See: Known Issues 2) - added
Random128
XOR-Shift pseudo random number generator for generating(U)Int128
s
Cube Root
- added high performance & accuracy
(r)cbrt
- (reciprocal) cube root functions for scalar and vectorfloat
- anddouble
types based on a research paper from 2021. An optionalbool
parameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case withmath.pow(x, 1f/3f)
), which is set tofalse
by default - added high performance
intcbrt
- integer cube root functions for all scalar and vector integer types. For signed integer types, an optionalbool
parameter allows the caller to decide whether or not negative input values should be handled correctly (which is not the case withmath.pow(x, 1f/3f)
), which is set tofalse
by default
Other Additions
- added a
log
function to all scalar and vectorfloat
- anddouble
types with a second parameterb
, which is the logarithms' base - added
reversebytes
functions for all scalar- and vector types, which convert back and forth between big endian and little endian byte order, respectively. All of them (scalar, vector) compile down to single hardware instructions - added
pow
functions with scalar exponents forfloat
anddouble
scalars and vectors, with optimizations for selected constant exponents (not necessarily whole exponents) - added function overloads to all functions for scalar
(s)byte
s and(u)short
s in order to resolve function call resolution ambiguity which was already present inUnity.Mathematics
, which may also improve performance in some cases - added a static readonly
New
property toRandomX
XOR-Shift pseudo random generators. It callsEnvironment.TickCount
internally (and is thus seeded somewhat randomly), makes sure it is non-zero and can be called from Burst native code - added
fastrcp
functions forfloat
scalars and vectors, faster (and substantially less accurate) thanFloatPrecision.Low
,FloatMode.Fast
Burst implementations - added
fastrsqrt
functions forfloat
scalars and vectors, faster (and substantially less accurate) thanFloatPrecision.Low
,FloatMode.Fast
Burst implementations
Improvements
- added AVX and AVX2 code for
float8
sin
,cos
,tan
,sincos
,asin
,acos
,atan
,atan2
,sinh
,cosh
,tanh
,pow
,exp
,exp2
,exp10
,log
,log2
,log10
andfmod
(and the%
operator) - optimized many
/
,%
,*
anddivrem
operations with a scalar compile time constant argument for(s)byte
vectors (see 'Known Issues 2'), which were previously not optimized (...optimally/at all) by Burst. - added SSE2 fallback code for converting AVX vector types to SSE vector types and vice versa(for example:
short16
(256 bit) tobyte16
(128 bit)) - scalar
(s)byte
and(u)short
rol
andror
functions now compile down to single hardware instructions - improved performance and/or reduced code size of nearly all vector comparison operations (
==
,>
etc.) - improved performance of - and added SSE2 fallback code for bitfield to boolean vector conversion (
toboolX
and thus alsoselect(vector a, vector b, bitmask c)
); - improved performance of
intpow
functions in general and for when the exponent is a compile time constant - improved performance and reduced code size of
compareto
vector functions (especially for unsigned types) - added more optimizations to
isdivisible
- improved performance of
intsqrt
functions for(u)long
and(s)byte
scalar and vector types considerably - reduced code size of
ispow2
vector functions - reduced code size of
(s)byte
vector-by-vector division - improved performance of
Random64
's(u)long4
generation if compiling for AVX2 - improved performance of
(s)byte
matrix multiplication - reduced code size of
(u)short
- and up to(s)byte8
vector by vector division anddivrem
functions(and improved performance if compiling for SSE2 only) - reduced code size and improved performance of
isinrange
functions for(u)long
vector types - reduced code size of
ushort
vector>=
and<=
operators for SSE2 fallback code by ~75% - improved performance and reduced code size of SSE2 down-casting fallback code
Changes
- API BREAKING CHANGE: The various boolean to integer/floating point conversion functions (
touint8
/tof32
etc.) are now renamed to contain C# types in their names (tobyte
/tofloat
etc.) - API BREAKING CHANGE: If you use this library as intended, meaning you import it and
Unity.Mathematics.math
statically (using static MaxMath.maxmath;
) and you use thepow
functions with scalar bases and scalar exponents in those scripts, you will encounter the first ever function call resolution ambiguity. It is strongly recommended to always use themaxmath.pow
function, because it optimizes anypow
call enormously if the exponent is a compile time constant, which does NOT necessarily mean that such a call must declare the exponent as a literal value - the exponent may become a compile time constant due to constant propagation quarter
is now areadonly struct
quarter
tosbyte
,short
,int
andlong
coversions are now required to be declared explicitly- removed
countbits(void* ptr, ulong bytes)
from the library and added it to https://github.com/MrUnbelievable92/SIMD-Algorithms with more options
Fixed Oversights
- (Issue #3) added constructor wrappers to the maxmath class analogous to
Unity.Mathematics
(byte4 myByte4 = (maxmath.)byte4(1, 2, 3, 4);
) - added
dsub
- fused divide-subtract function for scalar and vectorfloat
types - added an optional
bool fast = false
parameter todad
,dsub
,dadsub
anddsubadd
functions - added
andnot
function overloads for scalar and vectorbool
types - added implicit type conversions of scalar
quarter
values tohalf
,float
anddouble
vectors - added
all_eq
andall_dif
functions for vectors of size 2 - added
all_eq
andall_dif
functions forfloat
anddouble
vectors
MaxMath v2.1.2
Known Issues
- half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
Fixes
- fixed undefined behavior of "vshr" functions for vector types smaller than 128 bits
- fixed SSE2 implementations of "vrol" and "vror" functions for the (u)short16 type
Additions
- implemented Bmi1 and Bmi2 intrinsics as functions with a "bits_" prefix (except for "andn", which has already been implemented as "andnot")
- added high performance and/or SIMD "isdivisible" functions for all integer vector types and scalar value types
- added high performance and/or SIMD "intpow" - integer exponentiation - functions for (u)int, (u)long and all integer vector types
- added high performance and/or SIMD "floorpow2" functions for all integer vector types
- added "nabs" - negative absolute value functions for all non-boolean vector- and single value types
- added "indexof(vector v, value x)" functions for all non-boolean vector types
Improvements
- aggressivley optimized away global variables (shuffle masks) and thus memory access and usage where appropriate
- improved performance of 256 bit vector subvector getters
- added Sse2 fallback code for all (u)long2/3/4 operators
- improved performance of mulitplication, division and modulo operations for all (s)byte- and (u)short vector- and matrix types when dividing by a single non-compile time constant value
- added overloads for (s)byte- and (u)short vectors' "divrem" functions with a scalar value as the divisor parameter, improving performance when it is a compile time constant
- improved performance of "intsqrt" functions for most types
Changes
- bump com.unity.burst to version 1.5
Fixed Oversights
- added bitmask8 and bitmask16 functions for (s)byte and (u)short vector types, respectively
MaxMath v2.1.1
Known Issues
- half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
Fixes
- fixed triggered burst compilation error by "Sse4_1.blend_epi16" when compiling for SSE2 due to fallback code not using a constant value for "imm8"
- fixed incorrect CPU feature checks for quarter vector type-conversion code when compiling for SSE2
- fixed "tzcnt" implementations (were completely broken)
- fixed scalar (single value and C# fallback) "lzcnt" implementations for (s)byte and (u)short values and (u)long4 vectors
Additions
- added "ulong countbits(void* ptr, ulong bytes)", which counts the number of 1-bits in a given block of memory, using Wojciech Mula's SIMD population count algorithm
- added high performance and/or SIMD "gcd" a.k.a. greatest common divisor functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
- added high performance and/or SIMD "lcm" a.k.a. least common multiple functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
- added high performance and/or SIMD "intsqrt" - integer square root (floor(sqrt(x)) functions for all integer- and integer vector types, with the functions for signed integers and vectors throwing an ArgumentOutOfRangeException in case a value is negative
Improvements
- performance improvements of "avg" functions for signed integer vectors
- added SIMD implementations of the "transpose" functions for all matrix types
- added SSE4 and SSE2 fallback code for variable bitshifts ("shl", "shrl" and "shra")
- added SSE2 fallback code for (s)byte vector-by-vector division and modulo operations
- added SSE2 fallback code for "all_dif" for (s)byte16, (u)short8 and (u)int8 vectors
- added SSE2 fallback code for typecasting, propagating through the entire library
- added SSE2 fallback code for "addsub" and "subadd" functions
- bitmask32 and bitmask64 now allow for masks to be up to 32 and 64 bits wide, respectively
Changes
- renamed "BurstCompilerException" to "CPUFeatureCheckException"
- "shl", "shrl" and "shra" now have undefined behavior when bitshifting any value outside of the interval [0, 8 * sizeof(integer_type) - 1] for performance reasons and because of differences between SSE, AVX and managed C#
Fixed Oversights
- added "shl", "shrl" and "shra" (varying per element) functions for (s)byte and (u)short vectors
- added "ror" and "rol" (varying per element) functions for (s)byte and (u)short vectors
- added "compareto" functions for all vector types except half- and quarter vectors
- added "all_dif" functions for (s)byte32 vectors
- added vshr/l and vror/l functions for (s)byte32 and (u)short16 vectors
2.1.1 Hotfix
Fixes
- fixed SSE2 "shl", "shrl" and "shra" implementations
- fixed SSE2 "intsqrt" implementations
Improvements
- improved performance of (s)byte2, -3, -4, -8, -16 and (u)short2, -3, -4, -8 "gcd" functions (and thus "lcm") when compiling for Avx2
- improved performance of "tzcnt" and "lzcnt" implementations for all vector types if compiling for SSE4 or higher, propagating through a lot of the library
Fixed Oversights
Added documentation for RandomX methods
MaxMath v2.1.0
Known Issues
- half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
Fixes
- fixed triggered burst compilation error by "Sse4_1.blend_epi16" when compiling for SSE2 due to fallback code not using a constant value for "imm8"
- fixed incorrect CPU feature checks for quarter vector type-conversion code when compiling for SSE2
- fixed "tzcnt" implementations (were completely broken)
- fixed scalar (single value and C# fallback) "lzcnt" implementations for (s)byte and (u)short values and (u)long4 vectors
Additions
- added "ulong countbits(void* ptr, ulong bytes)", which counts the number of 1-bits in a given block of memory, using Wojciech Mula's SIMD population count algorithm
- added high performance and/or SIMD "gcd" a.k.a. greatest common divisor functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
- added high performance and/or SIMD "lcm" a.k.a. least common multiple functions for (u)int, (u)long and all integer vector types, which always return unsigned types and vectors
- added high performance and/or SIMD "intsqrt" - integer square root (floor(sqrt(x)) functions for all integer- and integer vector types, with the functions for signed integers and vectors throwing an ArgumentOutOfRangeException in case a value is negative
Improvements
- performance improvements of "avg" functions for signed integer vectors
- added SIMD implementations of the "transpose" functions for all matrix types
- added SSE4 and SSE2 fallback code for variable bitshifts ("shl", "shrl" and "shra")
- added SSE2 fallback code for (s)byte vector-by-vector division and modulo operations
- added SSE2 fallback code for "all_dif" for (s)byte16, (u)short8 and (u)int8 vectors
- added SSE2 fallback code for typecasting, propagating through the entire library
- added SSE2 fallback code for "addsub" and "subadd" functions
- bitmask32 and bitmask64 now allow for masks to be up to 32 and 64 bits wide, respectively
Changes
- renamed "BurstCompilerException" to "CPUFeatureCheckException"
- "shl", "shrl" and "shra" now have undefined behavior when bitshifting any value outside of the interval [0, 8 * sizeof(integer_type) - 1] for performance reasons and because of differences between SSE, AVX and managed C#
Fixed Oversights
- added "shl", "shrl" and "shra" (varying per element) functions for (s)byte and (u)short vectors
- added "ror" and "rol" (varying per element) functions for (s)byte and (u)short vectors
- added "compareto" functions for all vector types except half- and quarter vectors
- added "all_dif" functions for (s)byte32 vectors
- added vshr/l and vror/l functions for (s)byte32 and (u)short16 vectors
MaxMath v2.0.0
Re-Release Notes
- Version 2.0.0 adds - for the first time - fallback procedures from Avx2 to Sse4, Sse2 and platform independent instruction sets, respectively, with some major optimizations for all of them
- ARM and other instruction sets do NOT have optimized fallback procedures written for them, and there are no plans for it at this time. Burst/LLVM are good at recognizing the patterns in the code, though, and some of the code will be vectorized for other platforms (confirmed)
Known Issues
- half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation
Fixes
- fixed incorrect bool4 subvector getters of the bool8 type
Improvements
- removed "fixed" vector element access to improve performance in managed C#
Additions
- added "shuffle(vector, vector, ShuffleComponent(, ShuffleComponent)(, ShuffleComponent)(, ShuffleComponent)) functions for (s)byte, (u)short, (u)long, quarter and half vectors
Changes
- Bump com.unity.burst to version 1.4.4
Fixed Oversights
-
Added "addsub" function for floating point types, complementary to "subadd"
-
Added "addsub" and "subadd" functions for integer types
MaxMath v1.2.0
Known Issues
- half8 "==" and "!=" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation.
Fixes
- Added preliminary safety cast to a float of the half value in toboolsafe() until Unity fixes their half '==' and '!=' operators according to IEEE 754
Additions
"quarter" precision floats and vectors
- "quarter" is an 8-bit IEEE 754 1.3.4.-3 floating point value, often called a "minifloat"
- It has a very limited range of [-15.5, 15.5] with an epsilon of 0.015625. All integers, aswell as i + 0.5, within that range can be represented as a quarter
- Type conversion from - and to quarters also conforms to the IEEE 754 standard. In detail, casting to a quarter performs rounding according to a) its' precision and b) whether or not the more precise value is closer to 0 or to quarter.Epsilon. NaN and +/- zero preservation, aswell as preservation of/clamping to +/- infintiy was also implemented
- "==" and "!=" operators for vectors conforming to the IEEE 754 standard were implemented (unlike, currently, Unity's "half" type). All the other boolean- and arithmetic operators were implemented for the base type only, which will return single precision results (for arithmetic operations). For vectors, quarter vectors are to be (implicitly) cast to single precision vectors first, until/if Unity changes their "half" implementation.
- Type conversions from - and to all other single value and vector types were implemented
- Full function implementation within the library was added, including: abs(), isnan(), isinf(), isfinite(), select(), as[s]byte/asquarter(), vrol/r(), vshl/r(), toboolsafe and toquartersafe
Fixed Oversights
-
Added missing type conversions from - and to half8 for (s)byte8, (u)short8 and (u)int8 vectors
-
Added missing type conversions from - and to half8 for booleans and boolean vectors
-
Added half "select" functions
-
Improved the performance of unsafe boolean-to-half/float/double functions
-
added (preliminary?) "abs", "isnan", "isinf" and "isfinite" for half and half vectors, eliminating unnecessary casting
MaxMath v1.1.0
Known Issues
- half8 "equals" and "not equals" operators don't conform to the IEEE 754 standard - Unity has not yet reacted to my bug-report in regards to their "half" implementation.
Fixes
- Fixed a bug where vshl-/vshr-ing a(n) (s)byte16 vector by 11 would return the vector itself
Changes
- Changed the return type of count(boolx) to a uint instead of an int
Additions
RNG
- Added/Modified 8, 16, 32 and 64 bit XOR-Shift pseudo random number generators:
- They use the most efficient (Avx2) SIMD instructions to generate vectors with elements of the corresponding size in bytes. When compared to Unity.Mathematics, the performance is better since a) scalar multiplication of each generated value has been replaced by a single SIMD instruction and b) doubles are generated by Random64 instead of two 32-bit RNG iterations
- Removed NextT(T max) from SIGNED integer and floating point types, since those will never generate negative numbers. One can either generate an unsigned integer and cast it to a signed value for free, or use the functions with min and max parameters, as both of these would be more clear in regards to what range the result will be in
- Unity.Mathematics.Random is implicitly convertible to Random32 and vice versa. Safe and fast explicit type conversions between Random8, 16, 32 and 64 were added
Shuffle
-
Added bool8, bool16 and bool32 subvector getters
-
Added setters for half8 subvectors
-
Added (s)byte32 subvector getters
-
Added setters for (s)byte8 and s)byte16 subvectors
-
Added setters for (u)short8 subvectors
-
!!! Setters for (s)byte32, (u)short16, (u)int8 and float8 subvectors are implemented, but due to Unity.Burst related bugs, they are deactivated in the code. The issue has been forwarded a month ago and should be fixed with Burst 1.5
-
Slightly improved the performance of a select few (s)byte2/3/4 vector shuffle getters and setters