Any idea why C# is so much slower than Java and Go in related_post_gen benchmark? #93225
-
Does anyone have any ideas why C# AOT is so much slower than Java (GraalVM) and Go in related_post_gen benchmark? I'm just interested to learn what is the culprit. I tried optimizing it but I can't get any meaningful performance improvements and I don't understand where the big difference comes from. Do other two compilers just produce more optimized assembly, or is one used type from the standard library that is less optimized, or UTF16 vs UFT8 strings? Or is there some glaring issue in C# implementation that I'm missing? Snapshot of benchmark results
|
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 26 replies
-
Might be interesting to compare nativeaot |
Beta Was this translation helpful? Give feedback.
-
Few things...
|
Beta Was this translation helpful? Give feedback.
-
One optimization that crossed my mind was possibly due to string-deduping the Json parsing. I don't know if any of the other languages libraries attempt to do that to any degree, but I know that S.T.Json does not:
Since the inner-most loop is doing a dictionary lookup on the tag string, it would benefit if the tags were de-duped during deserialization, as there are only 28 distinct tags . My understanding is that string.GetHashCode gets cached per instance, which would avoid having to hash all the duplicated strings during the "processing" phase, since it was already done during the serialization phase. Additionally, the dictionary lookup needs to do the full string equality check for duplicate strings rather than the reference equality fast-path that would be done when de-duped. Testing this on my machine: standard deserializer (dupe tags): custom deserializer (deduped tags): So, it saves about 16% processing time. I used Ben.StringIntern for the string-deduping. I should mention that the overall time is reduced because my custom deserialization code ignores the "title" property since it isn't actually needed to produce the correct output. |
Beta Was this translation helpful? Give feedback.
-
Made a PR to make the benchmark faster (jinyus/related_post_gen#304). Now it's in top 3 in the latest results. |
Beta Was this translation helpful? Give feedback.
-
SIMD alone is unlikely to help much. It needs to use multiple threads or it will be a very slow version.
The bounds check were getting removed in the original version too, but the problem was that the code was jumping around too much. Here's the summary:
The hot path is that the next count of posts is lower than what we already have in the queue. But as you can see, the hot path starts at the top, then does a jump from 00007FF7612BD217 to 07FF7612BD280 (because count is lower), then it increments j, compares it with the limit and jumps back from 00007FF7612BD284 to 07FF7612BD20C at the top. The PR added an inner loop for this that doesn't do so much jumping around. The hot loop is just:
And we go from top to bottom, many many times (the branch at 00007FF692DAE55B is not taken most of the time, and the one at 00007FF692DAE566 is taken pretty much always). |
Beta Was this translation helpful? Give feedback.
Made a PR to make the benchmark faster (jinyus/related_post_gen#304). Now it's in top 3 in the latest results.