You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@bonzini forwarded me a request to contribute this here. Thanks Paolo, and thank you Mikolaj for lbzip2.
There's an opportunity to significantly speed up the common case in the inverse MTF transform where the reordered index is in the first sliding window (i.e. 0..15). Instead of taking a poorly predictable branch, we can load the entire window, shuffle it on registers, and write it back. With SSSE3 this boils down to a load-pshufb-store sequence, but could be done as well by shifting/masking on general registers (obviously more instructions, but still should be an improvement over the current branchy code).
I coded the following as a personal exercise. Probably not upstreamable, would like to hear what the requirements for a proper patch might be. If you try this patch, keep in mind that CFLAGS need to enable SSSE3 (-mssse3, or something like -march=native).
I tried to test it but despite cpuid claiming SSE3 support it won't compile. Also on ARM there is no SSE. Would you mind to release it in "general terms", so it could be tested?
Ed. I have used `-mssse3' and compiled it.
The test result turned out to be interesting:
$ time lbzip2-2.5 -t -n1 silesia.tar.bzip2
real 0m15.470s
user 0m14.405s
sys 0m1.073s
$ time lbzip2-2.5-i31 -t -n1 silesia.tar.bzip2
real 0m15.300s
user 0m14.034s
sys 0m1.116s
$ time lbzip2-2.5 -t -n1 silesia.tar.lbzip2
real 0m15.496s
user 0m14.267s
sys 0m1.071s
$ time lbzip2-2.5-i31 -t -n1 silesia.tar.lbzip2
real 0m18.341s
user 0m16.430s
sys 0m1.099s
So with bzip2 archive there was no difference, with lbzip2 archive it was 20% slower.
@bonzini forwarded me a request to contribute this here. Thanks Paolo, and thank you Mikolaj for lbzip2.
There's an opportunity to significantly speed up the common case in the inverse MTF transform where the reordered index is in the first sliding window (i.e. 0..15). Instead of taking a poorly predictable branch, we can load the entire window, shuffle it on registers, and write it back. With SSSE3 this boils down to a load-pshufb-store sequence, but could be done as well by shifting/masking on general registers (obviously more instructions, but still should be an improvement over the current branchy code).
I coded the following as a personal exercise. Probably not upstreamable, would like to hear what the requirements for a proper patch might be. If you try this patch, keep in mind that CFLAGS need to enable SSSE3 (
-mssse3
, or something like-march=native
).The text was updated successfully, but these errors were encountered: