Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup opportunity via vector shuffle #31

Open
amonakov opened this issue Jul 7, 2022 · 1 comment
Open

Speedup opportunity via vector shuffle #31

amonakov opened this issue Jul 7, 2022 · 1 comment

Comments

@amonakov
Copy link

amonakov commented Jul 7, 2022

@bonzini forwarded me a request to contribute this here. Thanks Paolo, and thank you Mikolaj for lbzip2.

There's an opportunity to significantly speed up the common case in the inverse MTF transform where the reordered index is in the first sliding window (i.e. 0..15). Instead of taking a poorly predictable branch, we can load the entire window, shuffle it on registers, and write it back. With SSSE3 this boils down to a load-pshufb-store sequence, but could be done as well by shifting/masking on general registers (obviously more instructions, but still should be an improvement over the current branchy code).

I coded the following as a personal exercise. Probably not upstreamable, would like to hear what the requirements for a proper patch might be. If you try this patch, keep in mind that CFLAGS need to enable SSSE3 (-mssse3, or something like -march=native).

--- ../lbzip2/src/decode.c	2022-07-06 00:57:22.365988349 +0300
+++ src/decode.c	2022-07-06 09:54:45.420303970 +0300
@@ -452,6 +452,34 @@ mtf_one(uint8_t **imtf_row, uint8_t *imt
        version is given in #else clause.
      */
 #if ROW_WIDTH == 16
+#if defined(__SSSE3__)
+    {
+      typedef char c8v16 __attribute__((vector_size(16)));
+      static const c8v16 tab[] = {
+        { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 1, 0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 2, 0, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 3, 0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 4, 0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 5, 0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 6, 0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 7, 0, 1, 2, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 14, 15 },
+        { 8, 0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15 },
+        { 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15 },
+        { 10, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15 },
+        { 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15 },
+        { 12, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15 },
+        { 13, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15 },
+        { 14, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 15 },
+        { 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 },
+      };
+      c8v16 t;
+      memcpy(&t, pp, sizeof t);
+      t = __builtin_ia32_pshufb128(t, tab[nn]);
+      memcpy(pp, &t, sizeof t);
+      return c;
+    }
+#endif
     switch (nn) {
     default:
       abort();
@tansy
Copy link

tansy commented Oct 5, 2022

I tried to test it but despite cpuid claiming SSE3 support it won't compile. Also on ARM there is no SSE. Would you mind to release it in "general terms", so it could be tested?

Ed. I have used `-mssse3' and compiled it.
The test result turned out to be interesting:

$ time lbzip2-2.5  -t -n1  silesia.tar.bzip2

real    0m15.470s
user    0m14.405s
sys     0m1.073s

$ time lbzip2-2.5-i31  -t -n1  silesia.tar.bzip2

real    0m15.300s
user    0m14.034s
sys     0m1.116s

$ time lbzip2-2.5  -t -n1  silesia.tar.lbzip2 

real    0m15.496s
user    0m14.267s
sys     0m1.071s

$ time lbzip2-2.5-i31  -t -n1  silesia.tar.lbzip2

real    0m18.341s
user    0m16.430s
sys     0m1.099s

So with bzip2 archive there was no difference, with lbzip2 archive it was 20% slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants