Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SIMD multi-arch exports #102

Open
Ivorforce opened this issue Oct 4, 2024 · 2 comments
Open

Add support for SIMD multi-arch exports #102

Ivorforce opened this issue Oct 4, 2024 · 2 comments
Labels
feature New feature or request

Comments

@Ivorforce
Copy link
Owner

I discussed this on Discord with Claire: Some people might be willing to trade a substantially larger binary size for better top speeds when running with a good architecture.

I see 3 ways to approach it:

  1. Dynamically rebind vatensor functions to different binaries, depending on runtime arch. This would probably be a lot of work, but it would unify it under one binary and thus be one self-enclosed binary with a common interface. Plus, a lot of non-critical code could be un-duplicated, like how reductions don't really benefit from AVX512 (from preliminary tests)
  2. Offer multiple complete binaries based on arch feature tags. These don't exist yet so Godot itself would have to be involved. It's less effort overall but also a worse trade-off.
  3. Fork / extend xtensor to support runtime checks itself. This may actually already be implemented, I just gotta check "for real". I don't think it is, but it may.

There might be another way, but i certainly don't know it.

@Ivorforce Ivorforce added the feature New feature or request label Oct 4, 2024
@Ivorforce
Copy link
Owner Author

Ivorforce commented Nov 13, 2024

On macOS, it's possible to add a x86_64h slice which supports haswell+, instead of the default x86_64 slice. This should benefit us especially much, since it supports avx2, sse4.2 and more by default. This is added to the binary and can speed up execution.

@Ivorforce
Copy link
Owner Author

The slice adds significant size to the binary, and only speeds up some functions substantially enough to warrant the difference.

I think I have a better solution:

  1. Determine which functions benefit from which SIMD additions (i can only test up to avx2 unfortunately)
  2. Make a python script that uses features.py and scu.py functionality to compile all of these files separately, using the appropriate flag (I think -march=x86-64-v2 may be most appropriate as a first test).

Then, either:

  • Call all vatensor functions for the automatically determined SIMD appropriate for the machine.
  • Within each vatensor function, add an if (avx2) { va::avx2::function(a, b, c...) } to add indirection after the call.

The former should be faster, but the latter makes for a cleaner SIMD agnostic interface. I think I prefer the latter for this reason. The difference shouldn't be huge though.
Duplication can be avoided by exposing the 'smallest common denominator' functions separately for those that do some logic before dispatching with SIMD differences, though most are pretty minimal already.

The upside of this solution is it's agnostic to the dispatch target - theoretically, this could include BLAS dispatch e.g. if BLAS is installed locally (or loaded otherwise). The downside is it's a bit verbose with the dispatch call, but that could be made minimal, probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant