Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement an expand function / scatter for masks? #975

Closed
avitase opened this issue Nov 17, 2023 · 9 comments
Closed

How to implement an expand function / scatter for masks? #975

avitase opened this issue Nov 17, 2023 · 9 comments

Comments

@avitase
Copy link

avitase commented Nov 17, 2023

How can I implement an efficient "expand" function that behaves like the family of *_mask_expand_epi* instrinsics of the AVX-512 instruction set? As gather/scatter already got their own counterparts in xsimd's API, is it worth doing the same for compress/expand?

@avitase avitase changed the title How to implement an scatter for masks? How to implement an expand function / scatter for masks? Nov 17, 2023
@serge-sans-paille
Copy link
Contributor

Mmmh, that's an interesting scenario. It's going to be quite difficult to implement this in an efficient generic manner, but I understand the pattern. Let's discuss the expected API. Following the AVX512 intrinsics, à la xsimd, that could be

xsimd::batch<T, A> xsimd::expand(xsimd::batch<T, A> self, xsimd::batch_bool<T, A> mask, xsimd::batch<T, A> filler);

The main problem probably is that we're likely to require shuffles to implement this operation, and we only have shuffle with constant masks, so we actually can provide a generic

xsimd::batch<T, A> xsimd::expand(xsimd::batch<T, A> self, xsimd::batch_bool_constant<T, A> mask, xsimd::batch<T, A> filler);

Unfortunately, a constant mask probably has very little use...

@avitase
Copy link
Author

avitase commented Nov 19, 2023

@serge-sans-paille, thanks for your reply!

xsimd::batch<T, A> xsimd::expand(xsimd::batch<T, A> self, xsimd::batch_bool<T, A> mask, xsimd::batch<T, A> filler);

I am not sure about the purpose of filler. Naively, I would expect that mask indicates what part of self should be distributed into an empty array. Do you propose to do this operation in filler instead?

The main problem probably is that we're likely to require shuffles to implement this operation

That's what I was wondering about as well. However, for me it doesn't feel right since expand is simpler/less powerful than shuffle or scatter because it does not reorder or even duplicate and shouldn't be implemented in terms of these more costly/complex operations. However, if no hardware support is available, how bad would be a generic fallback with a raw for loop? The same holds for compress...

@serge-sans-paille
Copy link
Contributor

*mask_expand_epi* has a filler argument that can be used to fill the empty parts of the array. I'm fine with having the fillear always be zero if it makes the generic function easier / more efficient to write.

After some extra thoughts:
expand with a static mask is just a specialization of a single shuffle. Not super fancy but why not. The generic gather relies on a sequence of generic inserts, not super fancy either :-/

But wait, we already have a generic swizzle, and an expand with zeroes can be expressed in terms of swizzle, I'll give it a try!

@avitase
Copy link
Author

avitase commented Nov 20, 2023

Oh, you are right. I was looking at the maskz commands - my bad. Looking forward to seeing what you come up with:) Personally, I struggled to see how I could use swizzle to enumerate the mask, e.g., let's say we want to expand a b c d e f g according to the mask 0 0 1 0 1 1 0 to get 0 0 a 0 b c 0. The mask for swizzle then has to be ? ? 0 ? 1 ? 2 3 ? where the ? could be set to any value as the final result has to be selected by the original mask anyhow. The question is, how to do this enumeration/accumulation/scan of the 1s in 0 0 1 0 1 1 0 to get ? ? 0 ? 1 2 ? w/o expand on an index sequence? 🐔🥚 If the answer is a raw for-loop then this solution is no better than implementing expand with such a loop in the first place but maybe I am missing something here.

@serge-sans-paille
Copy link
Contributor

mmh the example you give is not an expand, it's just a bit mask, and you just need to bitwise_and your data and the mask to get the expected result. In your example, an expand would yield a b c 0 0 0 0, and indeed a swizzle is not the answer.

@avitase
Copy link
Author

avitase commented Nov 21, 2023

I think our discussion finally converges towards the same problem. My initial question was about *mask{z}_expand_* as defined by Intel Intrinsics Guide or in APL (with a subsequent truncation). Translating the pseudo-code from the Intel guide to Python I get something like this:

def _mm_maskz_expand_epi8(k, a):
    N = len(k)
    dst = [0] * N

    m = 0
    for i in range(N):
        if k[i]:
            dst[i] = a[m]
            m += 1

    return dst

Obviously, my translation of N = len(k) is not quite right, however, for my previous example, this indeed yields:

>>> _mm_maskz_expand_epi8([False, False, True, False, True, True, False], "abcdefg")
[0, 0, 'a', 0, 'b', 'c', 0]

So my question is how to implement this in the generic case when this intrinsic is not available? Is it possible to do this w/o a raw for-loop? At the same time, this question becomes a feature request for xsimd:)

@serge-sans-paille
Copy link
Contributor

Can you check if the code in #977 matches your performance expectation (on non - AVX512 targets, that is)?

avitase added a commit to avitase/xsimd that referenced this issue Nov 22, 2023
The previous implementation relied on a raw for-loop that compiles an index sequence that was subsequently fed into a combination of select and swizzle. The new version doesn't need the latter two steps but compiles the result directly in the for-loop.
@avitase
Copy link
Author

avitase commented Nov 22, 2023

Your solution works. Thanks:) However, I have pruned your solution a bit. What do you think about #978 ?

serge-sans-paille added a commit that referenced this issue Nov 23, 2023
Also provide a specialization for avx512f.

Related to #975
serge-sans-paille added a commit that referenced this issue Nov 28, 2023
* Generic, simple implementation fox xsimd::compress

Related to #975

* fixup! Generic, simple implementation fox xsimd::compress

* fixup! Generic, simple implementation fox xsimd::compress
serge-sans-paille added a commit that referenced this issue Nov 28, 2023
Also provide a specialization for avx512f.

Related to #975
serge-sans-paille added a commit that referenced this issue Nov 29, 2023
Also provide a specialization for avx512f.

Related to #975
serge-sans-paille added a commit that referenced this issue Nov 29, 2023
Also provide a specialization for avx512f.

Related to #975
serge-sans-paille added a commit that referenced this issue Nov 29, 2023
Also provide a specialization for avx512f.

Related to #975
@serge-sans-paille
Copy link
Contributor

Fixed as of #981 and #977

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants