Add support for SIMD multi-arch exports #102

Ivorforce · 2024-10-04T21:33:50Z

I discussed this on Discord with Claire: Some people might be willing to trade a substantially larger binary size for better top speeds when running with a good architecture.

I see 3 ways to approach it:

Dynamically rebind vatensor functions to different binaries, depending on runtime arch. This would probably be a lot of work, but it would unify it under one binary and thus be one self-enclosed binary with a common interface. Plus, a lot of non-critical code could be un-duplicated, like how reductions don't really benefit from AVX512 (from preliminary tests)
Offer multiple complete binaries based on arch feature tags. These don't exist yet so Godot itself would have to be involved. It's less effort overall but also a worse trade-off.
Fork / extend xtensor to support runtime checks itself. This may actually already be implemented, I just gotta check "for real". I don't think it is, but it may.

There might be another way, but i certainly don't know it.

The text was updated successfully, but these errors were encountered:

Ivorforce · 2024-11-13T12:19:25Z

On macOS, it's possible to add a x86_64h slice which supports haswell+, instead of the default x86_64 slice. This should benefit us especially much, since it supports avx2, sse4.2 and more by default. This is added to the binary and can speed up execution.

Ivorforce · 2024-11-18T14:40:31Z

The slice adds significant size to the binary, and only speeds up some functions substantially enough to warrant the difference.

I think I have a better solution:

Determine which functions benefit from which SIMD additions (i can only test up to avx2 unfortunately)
Make a python script that uses features.py and scu.py functionality to compile all of these files separately, using the appropriate flag (I think -march=x86-64-v2 may be most appropriate as a first test).

Then, either:

Call all vatensor functions for the automatically determined SIMD appropriate for the machine.
Within each vatensor function, add an if (avx2) { va::avx2::function(a, b, c...) } to add indirection after the call.

The former should be faster, but the latter makes for a cleaner SIMD agnostic interface. I think I prefer the latter for this reason. The difference shouldn't be huge though.
Duplication can be avoided by exposing the 'smallest common denominator' functions separately for those that do some logic before dispatching with SIMD differences, though most are pretty minimal already.

The upside of this solution is it's agnostic to the dispatch target - theoretically, this could include BLAS dispatch e.g. if BLAS is installed locally (or loaded otherwise). The downside is it's a bit verbose with the dispatch call, but that could be made minimal, probably.

Ivorforce added the feature New feature or request label Oct 4, 2024

Ivorforce mentioned this issue Nov 24, 2024

Adapt a ufunc-like architecture for calls #161

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for SIMD multi-arch exports #102

Add support for SIMD multi-arch exports #102

Ivorforce commented Oct 4, 2024

Ivorforce commented Nov 13, 2024 •

edited

Loading

Ivorforce commented Nov 18, 2024

Add support for SIMD multi-arch exports #102

Add support for SIMD multi-arch exports #102

Comments

Ivorforce commented Oct 4, 2024

Ivorforce commented Nov 13, 2024 • edited Loading

Ivorforce commented Nov 18, 2024

Ivorforce commented Nov 13, 2024 •

edited

Loading