Any idea why C# is so much slower than Java and Go in related_post_gen benchmark? #93225

zigzag312 · 2023-10-09T14:22:13Z

zigzag312
Oct 9, 2023

Does anyone have any ideas why C# AOT is so much slower than Java (GraalVM) and Go in related_post_gen benchmark?

I'm just interested to learn what is the culprit. I tried optimizing it but I can't get any meaningful performance improvements and I don't understand where the big difference comes from. Do other two compilers just produce more optimized assembly, or is one used type from the standard library that is less optimized, or UTF16 vs UFT8 strings? Or is there some glaring issue in C# implementation that I'm missing?

Snapshot of benchmark results

Language	Time (5k posts)	15k posts	30k posts	Total
Go	30.94 ms	256.98 ms	998.54 ms	1.29 s
Java (GraalVM)	36.00 ms	298.67 ms	1.16 s	1.50 s
Julia	32.77 ms	302.67 ms	1.18 s	1.52 s
Rust	36.10 ms	307.96 ms	1.27 s	1.62 s
Zig	29.00 ms	331.33 ms	1.30 s	1.66 s
F#	56.08 ms	364.00 ms	1.39 s	1.81 s
Vlang	45.92 ms	379.67 ms	1.45 s	1.88 s
Swift	49.01 ms	419.11 ms	1.63 s	2.10 s
C#	56.10 ms	478.25 ms	1.82 s	2.36 s
Odin	44.97 ms	418.74 ms	1.99 s	2.45 s
Nim	32.08 ms	394.00 ms	2.09 s	2.52 s
Crystal	58.43 ms	512.20 ms	2.02 s	2.59 s
Dart VM	79.00 ms	811.67 ms	2.74 s	3.63 s
LuaJIT	102.37 ms	784.57 ms	2.91 s	3.80 s
Dart AOT	116.08 ms	1.01 s	3.97 s	5.10 s
JS (Deno)	168.77 ms	1.40 s	4.40 s	5.96 s
JS (Node)	154.15 ms	1.27 s	4.70 s	6.13 s
ocaml	173.92 ms	1.53 s	6.09 s	7.80 s
Numpy	366.95 ms	3.22 s	13.35 s	16.93 s
JS (Bun)	589.62 ms	5.15 s	24.00 s	29.73 s
Python	1.69 s	15.02 s	60.37 s	77.08 s
Lua	1.94 s	17.10 s	69.53 s	88.57 s
Java (JIT)	234.00 ms	OutOfMemoryError	OutOfMemoryError	N/A

Answered by MichalStrehovsky

Oct 17, 2023

Made a PR to make the benchmark faster (jinyus/related_post_gen#304). Now it's in top 3 in the latest results.

View full answer

danmoseley · 2023-10-09T14:28:26Z

danmoseley
Oct 9, 2023
Collaborator

Might be interesting to compare nativeaot

1 reply

zigzag312 Oct 9, 2023
Author

These results are with NativeAOT.

tannergooding · 2023-10-09T15:44:43Z

tannergooding
Oct 9, 2023
Collaborator

Few things...

The implementations aren't really "equivalent" across the various implementations. For example: Java and C++ are using linear lists (like std::vector and IntArrayList), while in C# you have it using a LinkedList<T>. Other differences exist as well.
When running with the JIT, using Stopwatch isn't "sufficient" because it will only measure the debug/unoptimized code that the startup path will hit. It doesn't factor in things liked tiered JIT, steady state performance, or dynamic PGO. For "run once" tests, you can annotate it with [MethodImpl(MethodImplOptions.AggressiveOptimization)] to bypass tiering.
Some things, like stackalloc with a dynamic size are "pessimized" in .NET because they have to correctly account for buffer overruns and other basic security features. Best practice is typically to use a fixed size that is the upper bound of what you expect and to fallback to a heap allocation otherwise.
There are bits in pieces where there is an "idomatic" alternative to the sequence you're doing that will result in better perf. For example, initializing tagMap from an existing map or how using tagMapTemp in the first place is causing significantly more work to be done in the .NET version as compared to Java/C++

14 replies

danmoseley Oct 10, 2023
Collaborator

CC @MichalStrehovsky for nativeaot comparison

MichalStrehovsky Oct 10, 2023
Collaborator

I wanted to suggest not using Native AOT since dynamic PGO has a decent chance to make wonders, but even the 30k entry run is not enough for dynamic PGO to pay for itself unfortunately. So we're stuck with native AOT.

One would think native AOT should be comparable with Java/GraalVM, but GraalVM uses checked-in PGO data (two thirds of the related_post_gen repo size is Graal's optimization data in default.iprof). If I was the repo maintainer, I would not have accepted that contribution because it sets a bad precented (e.g. Rust doesn't use PGO yet, but I assume it soon will, and so will all the other languages). We don't have a usable PGO for native AOT in .NET 8, so we can't check in our giant blob.

My suggestions:

Upgrade to latest. This is using .NET 7. I see improvement from .NET 8 alone. Looks like the repo will need to wait until November for that due to rules.
I wanted to suggest OptimizationPreference=Speed but it doesn't seem to help for this (it hurts instead).
I also wanted to suggest IlcInstructionSet=native but that also doesn't seem to help (it hurts instead)
The manual elision of bounds checks goes against repo rules, so it's not applicable.

This diff seemed to help on .NET 8. There's absolutely no reason to stackalloc for something that runs once and the more explicit bounds seemed to help RyuJIT figure out some bounds check can be elided. I don't know if it helps on .NET 7. All of the above applies to .NET 8.

diff --git a/csharp/Program.cs b/csharp/Program.cs
index f2fb00c..b4223ee 100644
--- a/csharp/Program.cs
+++ b/csharp/Program.cs
@@ -32,13 +32,13 @@ foreach (var (tag, postIds) in tagMapTemp)
     tagMap[tag] = postIds.ToArray();
 }

-Span<RelatedPosts> allRelatedPosts = new RelatedPosts[postsCount];
-Span<byte> taggedPostCount = stackalloc byte[postsCount];
-Span<(byte Count, int PostId)> top5 = stackalloc (byte Count, int PostId)[topN];
+var allRelatedPosts = new RelatedPosts[postsCount];
+var taggedPostCount = new byte[postsCount];
+var top5 = new (byte Count, int PostId)[topN];

-for (var i = 0; i < postsCount; i++)
+for (var i = 0; i < taggedPostCount.Length; i++)
 {
-    taggedPostCount.Clear();  // reset counts
+    Array.Fill(taggedPostCount, default);  // reset counts

     foreach (var tag in posts[i].Tags)
     {
@@ -49,11 +49,11 @@ for (var i = 0; i < postsCount; i++)
     }

     taggedPostCount[i] = 0;  // Don't count self
-    top5.Clear();
+    Array.Fill(top5, default);
     byte minTags = 0;

     //  custom priority queue to find top N
-    for (var j = 0; j < postsCount; j++)
+    for (var j = 0; j < taggedPostCount.Length; j++)
     {
         byte count = taggedPostCount[j];

@@ -76,7 +76,7 @@ for (var i = 0; i < postsCount; i++)
     var topPosts = new Post[topN];

     // Convert indexes back to Post references. skip even indexes
-    for (int j = 0; j < 5; j ++)
+    for (int j = 0; j < topPosts.Length; j ++)
     {
         topPosts[j] = posts[top5[j].PostId];
     }

zigzag312 Oct 10, 2023
Author

Here are results I get with suggested changes.

There seems to be a performance regression in NativeAOT 8.0 compared to NativeAOT 7.0 for NoTagMapTemp and NoTagMapTempAndWoBoundsCheck versions.

NativeAOT 7.0 → NativeAOT 8.0 (RC1):

16.71 ms → 18.90 ms NoTagMapTemp
17.76 ms → 20.80 ms NoTagMapTempAndWoBoundsCheck

| Method                               | Runtime       | WarmupCount | Mean     | Error    | StdDev   | Median   | Ratio | RatioSD |
|------------------------------------- |-------------- |------------ |---------:|---------:|---------:|---------:|------:|--------:|
| Original                             | .NET 7.0      | 200         | 17.13 ms | 0.055 ms | 0.232 ms | 17.11 ms |  1.00 |    0.00 |
| Tanner1_NoTagMapTemp                 | .NET 7.0      | 200         | 17.42 ms | 0.038 ms | 0.160 ms | 17.40 ms |  1.02 |    0.02 |
| Tanner2_NoTagMapTempAndWoBoundsCheck | .NET 7.0      | 200         | 17.82 ms | 0.046 ms | 0.192 ms | 17.81 ms |  1.04 |    0.02 |
| Michael_NoStackalloc                 | .NET 7.0      | 200         | 18.04 ms | 0.043 ms | 0.180 ms | 18.03 ms |  1.05 |    0.02 |
|                                      |               |             |          |          |          |          |       |         |
| Original                             | .NET 8.0      | 200         | 17.09 ms | 0.036 ms | 0.149 ms | 17.06 ms |  1.00 |    0.00 |
| Tanner1_NoTagMapTemp                 | .NET 8.0      | 200         | 16.89 ms | 0.037 ms | 0.150 ms | 16.86 ms |  0.99 |    0.01 |
| Tanner2_NoTagMapTempAndWoBoundsCheck | .NET 8.0      | 200         | 13.23 ms | 0.033 ms | 0.137 ms | 13.21 ms |  0.77 |    0.01 |
| Michael_NoStackalloc                 | .NET 8.0      | 200         | 17.37 ms | 0.046 ms | 0.190 ms | 17.37 ms |  1.02 |    0.01 |
|                                      |               |             |          |          |          |          |       |         |
| Original                             | NativeAOT 7.0 | Default     | 17.00 ms | 0.048 ms | 0.197 ms | 16.96 ms |  1.00 |    0.00 |
| Tanner1_NoTagMapTemp                 | NativeAOT 7.0 | Default     | 16.71 ms | 0.044 ms | 0.181 ms | 16.66 ms |  0.98 |    0.02 |
| Tanner2_NoTagMapTempAndWoBoundsCheck | NativeAOT 7.0 | Default     | 17.76 ms | 0.043 ms | 0.181 ms | 17.73 ms |  1.04 |    0.01 |
| Michael_NoStackalloc                 | NativeAOT 7.0 | Default     | 17.94 ms | 0.060 ms | 0.249 ms | 17.86 ms |  1.06 |    0.02 |
|                                      |               |             |          |          |          |          |       |         |
| Original                             | NativeAOT 8.0 | Default     | 17.21 ms | 0.048 ms | 0.201 ms | 17.16 ms |  1.00 |    0.00 |
| Tanner1_NoTagMapTemp                 | NativeAOT 8.0 | Default     | 18.90 ms | 0.050 ms | 0.207 ms | 18.84 ms |  1.10 |    0.02 |
| Tanner2_NoTagMapTempAndWoBoundsCheck | NativeAOT 8.0 | Default     | 20.80 ms | 0.087 ms | 0.365 ms | 20.68 ms |  1.21 |    0.03 |
| Michael_NoStackalloc                 | NativeAOT 8.0 | Default     | 17.83 ms | 0.048 ms | 0.201 ms | 17.79 ms |  1.04 |    0.02 |

Code

using System.Collections.Generic;
using System.Diagnostics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Text.Json;
using System.Text.Json.Serialization;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;
using BenchmarkDotNet.Running;

public static class Program
{
    public static void Main()
    {
        var summary = BenchmarkRunner.Run<Bench>();
    }
}

[SimpleJob(RuntimeMoniker.NativeAot70, iterationCount: 200)]
[SimpleJob(RuntimeMoniker.NativeAot80, iterationCount: 200)]
[SimpleJob(RuntimeMoniker.Net70, iterationCount: 200, warmupCount: 200)]
[SimpleJob(RuntimeMoniker.Net80, iterationCount: 200, warmupCount: 200)]
public class Bench
{
    private List<Post>     posts;
    private RelatedPosts[] allRelatedPosts;

    [IterationSetup]
    public void IterationSetup()
    {
        posts = JsonSerializer.Deserialize(File.ReadAllText(@"posts.json"), MyJsonContext.Default.ListPost)!;
    }

    [IterationCleanup]
    public void IterationCleanup()
    {
        File.WriteAllText(@"related_posts_csharp.json",
            JsonSerializer.Serialize(allRelatedPosts.ToArray(), MyJsonContext.Default.RelatedPostsArray));
    }

    [Benchmark(Baseline = true)]
    public void Original()
    {
        const int topN       = 5;
        var       postsCount = posts.Count;

        var sw = Stopwatch.StartNew();

        // slower when int[] is used
        var tagMapTemp = new Dictionary<string, LinkedList<int>>(100);

        for (var i = 0; i < postsCount; i++)
        {
            foreach (var tag in posts[i].Tags)
            {
                // single lookup
                ref var stack = ref CollectionsMarshal.GetValueRefOrAddDefault(tagMapTemp, tag, out _);
                stack ??= new LinkedList<int>();
                stack.AddLast(i);
            }
        }

        var tagMap = new Dictionary<string, int[]>(tagMapTemp.Count);

        foreach (var (tag, postIds) in tagMapTemp)
        {
            tagMap[tag] = postIds.ToArray();
        }

        allRelatedPosts = new RelatedPosts[postsCount];
        Span<byte>                     taggedPostCount = stackalloc byte[postsCount];
        Span<(byte Count, int PostId)> top5            = stackalloc (byte Count, int PostId)[topN];

        for (var i = 0; i < postsCount; i++)
        {
            taggedPostCount.Clear(); // reset counts

            foreach (var tag in posts[i].Tags)
            {
                foreach (var otherPostIdx in tagMap[tag])
                {
                    taggedPostCount[otherPostIdx]++;
                }
            }

            taggedPostCount[i] = 0; // Don't count self
            top5.Clear();
            byte minTags = 0;

            //  custom priority queue to find top N
            for (var j = 0; j < postsCount; j++)
            {
                byte count = taggedPostCount[j];

                if (count > minTags)
                {
                    int upperBound = topN - 2;

                    while (upperBound >= 0 && count > top5[upperBound].Count)
                    {
                        top5[upperBound + 1] = top5[upperBound];
                        upperBound--;
                    }

                    top5[upperBound + 1] = (count, j);

                    minTags = top5[topN - 1].Count;
                }
            }

            var topPosts = new Post[topN];

            // Convert indexes back to Post references. skip even indexes
            for (int j = 0; j < 5; j++)
            {
                topPosts[j] = posts[top5[j].PostId];
            }

            allRelatedPosts[i] = new RelatedPosts
            {
                Id = posts[i].Id,
                Tags = posts[i].Tags,
                Related = topPosts
            };
        }

        sw.Stop();

        Console.WriteLine($"Processing time (w/o IO): {sw.Elapsed.TotalMilliseconds}ms");
    }


    [Benchmark()]
    public void Tanner1_NoTagMapTemp()
    {
        const int  topN       = 5;
        Span<Post> postsSpan  = CollectionsMarshal.AsSpan(posts);
        var        postsCount = posts.Count;

        var sw = Stopwatch.StartNew();

        var tagMap = new Dictionary<string, List<int>>(100);

        for (var i = 0; i < postsCount; i++)
        {
            foreach (var tag in postsSpan[i].Tags)
            {
                // single lookup
                ref var stack = ref CollectionsMarshal.GetValueRefOrAddDefault(tagMap, tag, out _);
                stack ??= new List<int>();
                stack.Add(i);
            }
        }

        allRelatedPosts = new RelatedPosts[postsCount];
        Span<byte>                     taggedPostCount = stackalloc byte[postsCount];
        Span<(byte Count, int PostId)> top5            = stackalloc (byte Count, int PostId)[topN];

        for (var i = 0; i < postsCount; i++)
        {
            taggedPostCount.Clear(); // reset counts
            var post = postsSpan[i];

            foreach (var tag in post.Tags)
            {
                foreach (var otherPostIdx in CollectionsMarshal.AsSpan(tagMap[tag]))
                {
                    taggedPostCount[otherPostIdx]++;
                }
            }

            taggedPostCount[i] = 0; // Don't count self
            top5.Clear();
            byte minTags = 0;

            //  custom priority queue to find top N
            for (var j = 0; j < postsCount; j++)
            {
                byte count = taggedPostCount[j];

                if (count > minTags)
                {
                    int upperBound = topN - 2;

                    while (upperBound >= 0 && count > top5[upperBound].Count)
                    {
                        top5[upperBound + 1] = top5[upperBound];
                        upperBound--;
                    }

                    top5[upperBound + 1] = (count, j);

                    minTags = top5[topN - 1].Count;
                }
            }

            var        topPosts     = new Post[topN];
            Span<Post> topPostsSpan = topPosts;

            // Convert indexes back to Post references.
            for (int j = 0; j < 5; j++)
            {
                topPostsSpan[j] = postsSpan[top5[j].PostId];
            }

            allRelatedPosts[i] = new RelatedPosts
            {
                Id = post.Id,
                Tags = post.Tags,
                Related = topPosts
            };
        }

        sw.Stop();

        Console.WriteLine($"Processing time (w/o IO): {sw.Elapsed.TotalMilliseconds}ms");
    }

    [Benchmark]
    public void Tanner2_NoTagMapTempAndWoBoundsCheck()
    {
        const int TOP_N                     = 5;
        const int INITIAL_TAGGED_COUNT_SIZE = 100;

        var sw = Stopwatch.StartNew();

        Span<Post> postsSpan = CollectionsMarshal.AsSpan(posts);
        var        tagMap    = new Dictionary<string, List<int>>(INITIAL_TAGGED_COUNT_SIZE);

        for (var i = 0; i < postsSpan.Length; i++)
        {
            foreach (var tag in Unsafe.Add(ref MemoryMarshal.GetReference(postsSpan), i).Tags.AsSpan())
            {
                ref var tags = ref CollectionsMarshal.GetValueRefOrAddDefault(tagMap, tag, out _);
                tags ??= new List<int>(postsSpan.Length);
                tags.Add(i);
            }
        }

        allRelatedPosts = new RelatedPosts[postsSpan.Length];
        Span<RelatedPosts> allRelatedPostsSpan = allRelatedPosts;
        Span<int>          taggedPostCount     = new int[postsSpan.Length];
        Span<int>          top5                = new int[TOP_N * 2];

        for (var i = 0; i < postsSpan.Length; i++)
        {
            taggedPostCount.Clear();

            foreach (var tag in Unsafe.Add(ref MemoryMarshal.GetReference(postsSpan), i).Tags.AsSpan())
            {
                foreach (var otherPostIdx in CollectionsMarshal.AsSpan(tagMap[tag]))
                {
                    Unsafe.Add(ref MemoryMarshal.GetReference(taggedPostCount), otherPostIdx)++;
                }
            }

            Unsafe.Add(ref MemoryMarshal.GetReference(taggedPostCount), i) = 0;

            top5.Clear();
            int minTags = 0;

            //  custom priority queue to find top N
            for (var j = 0; j < postsSpan.Length; j++)
            {
                int count = Unsafe.Add(ref MemoryMarshal.GetReference(taggedPostCount), j);

                if (count > minTags)
                {
                    int upperBound = (TOP_N - 2) * 2;

                    while (upperBound >= 0 && count > Unsafe.Add(ref MemoryMarshal.GetReference(top5), upperBound))
                    {
                        Unsafe.Add(ref MemoryMarshal.GetReference(top5), upperBound + 2) =
                            Unsafe.Add(ref MemoryMarshal.GetReference(top5), upperBound);
                        Unsafe.Add(ref MemoryMarshal.GetReference(top5), upperBound     + 3) =
                            Unsafe.Add(ref MemoryMarshal.GetReference(top5), upperBound + 1);
                        upperBound -= 2;
                    }

                    int insertPos = upperBound + 2;
                    Unsafe.Add(ref MemoryMarshal.GetReference(top5), insertPos) = count;
                    Unsafe.Add(ref MemoryMarshal.GetReference(top5), insertPos + 1) = j;

                    minTags = Unsafe.Add(ref MemoryMarshal.GetReference(top5), TOP_N * 2 - 2);
                }
            }

            var topPosts = new Post[TOP_N];

            for (var j = 1; j < 10; j += 2)
            {
                topPosts[j / 2] = Unsafe.Add(ref MemoryMarshal.GetReference(postsSpan),
                    Unsafe.Add(ref MemoryMarshal.GetReference(top5), j));
            }

            var post = Unsafe.Add(ref MemoryMarshal.GetReference(postsSpan), i);

            var relatedPost = new RelatedPosts
            {
                Id = post.Id,
                Tags = post.Tags,
                Related = topPosts
            };
            Unsafe.Add(ref MemoryMarshal.GetReference(allRelatedPostsSpan), i) = relatedPost;
        }

        sw.Stop();

        Console.WriteLine($"Processing time (w/o IO): {sw.Elapsed.TotalMilliseconds}ms");
    }

    [Benchmark()]
    public void Michael_NoStackalloc()
    {
        const int topN       = 5;
        var       postsCount = posts.Count;

        var sw = Stopwatch.StartNew();

        // slower when int[] is used
        var tagMapTemp = new Dictionary<string, LinkedList<int>>(100);

        for (var i = 0; i < postsCount; i++)
        {
            foreach (var tag in posts[i].Tags)
            {
                // single lookup
                ref var stack = ref CollectionsMarshal.GetValueRefOrAddDefault(tagMapTemp, tag, out _);
                stack ??= new LinkedList<int>();
                stack.AddLast(i);
            }
        }

        var tagMap = new Dictionary<string, int[]>(tagMapTemp.Count);

        foreach (var (tag, postIds) in tagMapTemp)
        {
            tagMap[tag] = postIds.ToArray();
        }

        allRelatedPosts = new RelatedPosts[postsCount];
        var taggedPostCount = new byte[postsCount];
        var top5            = new (byte Count, int PostId)[topN];

        for (var i = 0; i < taggedPostCount.Length; i++)
        {
            Array.Fill(taggedPostCount, default); // reset counts

            foreach (var tag in posts[i].Tags)
            {
                foreach (var otherPostIdx in tagMap[tag])
                {
                    taggedPostCount[otherPostIdx]++;
                }
            }

            taggedPostCount[i] = 0; // Don't count self
            Array.Fill(top5, default);
            byte minTags = 0;

            //  custom priority queue to find top N
            for (var j = 0; j < taggedPostCount.Length; j++)
            {
                byte count = taggedPostCount[j];

                if (count > minTags)
                {
                    int upperBound = topN - 2;

                    while (upperBound >= 0 && count > top5[upperBound].Count)
                    {
                        top5[upperBound + 1] = top5[upperBound];
                        upperBound--;
                    }

                    top5[upperBound + 1] = (count, j);

                    minTags = top5[topN - 1].Count;
                }
            }

            var topPosts = new Post[topN];

            // Convert indexes back to Post references. skip even indexes
            for (int j = 0; j < topPosts.Length; j++)
            {
                topPosts[j] = posts[top5[j].PostId];
            }

            allRelatedPosts[i] = new RelatedPosts
            {
                Id = posts[i].Id,
                Tags = posts[i].Tags,
                Related = topPosts
            };
        }

        sw.Stop();

        Console.WriteLine($"Processing time (w/o IO): {sw.Elapsed.TotalMilliseconds}ms");
    }
}


public record struct Post
{
    [JsonPropertyName("_id")] public string Id { get; set; }

    [JsonPropertyName("title")] public string Title { get; set; }

    [JsonPropertyName("tags")] public string[] Tags { get; set; }
}

public record RelatedPosts
{
    [JsonPropertyName("_id")] public string Id { get; set; }

    [JsonPropertyName("tags")] public string[] Tags { get; set; }

    [JsonPropertyName("related")] public Post[] Related { get; set; }
}

[JsonSerializable(typeof(Post))]
[JsonSerializable(typeof(List<Post>))]
[JsonSerializable(typeof(RelatedPosts))]
[JsonSerializable(typeof(RelatedPosts[]))]
public partial class MyJsonContext : JsonSerializerContext { }

MichalStrehovsky Oct 10, 2023
Collaborator

Weird that my suggestion doesn't seem to help. I saw a improvement. I did measure for the 30k entries only scenario. Dynamic pgo in .net 8 has the best chance to catch up on the largest benchmark and since all benchmark times get added up, this one has the most weight and makes most sense to optimize for.

zigzag312 Oct 10, 2023
Author

I tested for 5k posts as that file is already included in the repo and original test was for 5k posts, but I agree optimizing for 30k makes the most sense.

I'm learning a lot here. PGO explains the performance of Java.

For C# JIT, PGO does seem to kick in NoTagMapTempAndWoBoundsCheck test as time drops after about 50 iterations from 24ms to just 13ms per iteration. Kind of amazing.

For comparison, on my machine Go benchmark runs in ~11–12ms. It seems Go compiler is able to optimize away bounds checking. (Just a wild guess, as I'm not really familiar how compilers work). I find it interesting that it's able to do it without PGO.

I understand that NativeAOT in .NET is still very young, so I'm keeping my fingers crossed it will continue to be improved like in the pass few versions.

MarkPflug · 2023-10-09T22:44:30Z

MarkPflug
Oct 9, 2023

One optimization that crossed my mind was possibly due to string-deduping the Json parsing. I don't know if any of the other languages libraries attempt to do that to any degree, but I know that S.T.Json does not:

var x = System.Text.Json.JsonSerializer.Deserialize<string[]>("[\"a\",\"a\"]");
Console.WriteLine(x[0] == x[1]); //true
Console.WriteLine(object.ReferenceEquals(x[0], x[1])); //false

Since the inner-most loop is doing a dictionary lookup on the tag string, it would benefit if the tags were de-duped during deserialization, as there are only 28 distinct tags . My understanding is that string.GetHashCode gets cached per instance, which would avoid having to hash all the duplicated strings during the "processing" phase, since it was already done during the serialization phase. Additionally, the dictionary lookup needs to do the full string equality check for duplicate strings rather than the reference equality fast-path that would be done when de-duped.

Testing this on my machine:

standard deserializer (dupe tags):
Processing time (w/o IO): 33.4672ms
Total time (w IO): 49.2576ms

custom deserializer (deduped tags):
Processing time (w/o IO): 28.117ms
Total time (w IO): 43.9135ms

So, it saves about 16% processing time. I used Ben.StringIntern for the string-deduping. I should mention that the overall time is reduced because my custom deserialization code ignores the "title" property since it isn't actually needed to produce the correct output.

1 reply

zigzag312 Oct 10, 2023
Author

Interesting optimization. It doesn't look like Go and Java implementations are doing this, so this can't be the source of the difference. Still, it's a nice performance improvement.

MichalStrehovsky · 2023-10-17T21:16:40Z

MichalStrehovsky
Oct 17, 2023
Collaborator

Made a PR to make the benchmark faster (jinyus/related_post_gen#304). Now it's in top 3 in the latest results.

8 replies

zigzag312 Oct 18, 2023
Author

That's terrific @MichalStrehovsky! Biggest performance gain seems to come from changes done to the custom priority queue. I'm guessing that compiler manages to optimize (remove) bound checks for tight while loop? What seems to be the reason compiler managed to optimize this version, but not the original version?

It would be great, if you would write a post explaining optimization process like this. From the detective work, on how to find a C# code that's not getting optimally optimized by the compiler, to the code shaping process that shows what can we try, to shape the code in a way that the compiler can optimize it. I would be thrilled to learn to do something like this.

C# now gets really impressive result in the benchmark.

zigzag312 Oct 18, 2023
Author

@danmoseley I'll try to do a simple concurrent version using the vectorized version of the custom priority queue that was done by @MichalStrehovsky, as SIMD is allowed for multi-threaded implementations.

MichalStrehovsky Oct 18, 2023
Collaborator

Nice work @MichalStrehovsky ! Now much faster than Graal.

Oh, I have reasons to believe it's even more faster than graal than what the results show right now: jinyus/related_post_gen#322

zigzag312 Oct 18, 2023
Author

@danmoseley I added PR for concurrent version (jinyus/related_post_gen#325).

Just adding a simple Parallel.For it drops from 12.5 ms to 1.5 ms (16 cores CPU). Also adding vectors, it drops additional ~7% to 1.4 ms.

Original single-threaded implementations was 17 ms on my machine.

danmoseley Oct 20, 2023
Collaborator

From the table at the bottom of https://github.com/jinyus/related_post_gen, C# is now fastest or near to it on the parallel benchmark. Nice work @zigzag312 . (I don't know how optimized the other entries are though.)

MichalStrehovsky · 2023-10-18T09:06:43Z

MichalStrehovsky
Oct 18, 2023
Collaborator

@danmoseley I'll try to do a simple concurrent version using the vectorized version of the custom priority queue that was done by @MichalStrehovsky, as SIMD is allowed for multi-threaded implementations.

SIMD alone is unlikely to help much. It needs to use multiple threads or it will be a very slow version.

That's terrific @MichalStrehovsky! Biggest performance gain seems to come from changes done to the custom priority queue. I'm guessing that compiler manages to optimize (remove) bound checks for tight while loop? What seems to be the reason compiler managed to optimize this version, but not the original version?

The bounds check were getting removed in the original version too, but the problem was that the code was jumping around too much.

Here's the summary:

I set a breakpoint at the beginning of the hot loop. Once the breakpoint hit, I switched the debugger to see the assembly instructions. The flow of the original for loop looked like this:

00007FF7612BD208  xor         eax,eax  

        //  custom priority queue to find top N
        for (var j = 0; j < postsCount; j++)
00007FF7612BD20A  xor         ecx,ecx  
        {
            byte count = taggedPostCount[j];
00007FF7612BD20C  mov         edx,ecx  
00007FF7612BD20E  movzx       r8d,byte ptr [r14+rdx+10h]  

            if (count > minTags)
00007FF7612BD214  cmp         r8d,eax  
00007FF7612BD217  jle         related_Program____Main___g__GetRelatedPosts_0_0+480h (07FF7612BD280h)  
            {
                int upperBound = topN - 2;
00007FF7612BD219  mov         eax,3  
00007FF7612BD21E  jmp         related_Program____Main___g__GetRelatedPosts_0_0+439h (07FF7612BD239h)  
                {
                    top5[upperBound + 1] = top5[upperBound];
00007FF7612BD220  lea         r10d,[rax+1]  
00007FF7612BD224  cmp         r10d,r12d  
00007FF7612BD227  jae         related_Program____Main___g__GetRelatedPosts_0_0+964h (07FF7612BD764h)  
00007FF7612BD22D  mov         rdx,qword ptr [rdx]  
00007FF7612BD230  mov         qword ptr [r13+r10*8],rdx  
                    upperBound--;
00007FF7612BD235  dec         eax  

                while (upperBound >= 0 && count > top5[upperBound].Count)
00007FF7612BD237  js          related_Program____Main___g__GetRelatedPosts_0_0+453h (07FF7612BD253h)  
00007FF7612BD239  cmp         eax,r12d  
00007FF7612BD23C  jae         related_Program____Main___g__GetRelatedPosts_0_0+964h (07FF7612BD764h)  
00007FF7612BD242  mov         edx,eax  
00007FF7612BD244  lea         rdx,[r13+rdx*8]  
00007FF7612BD249  movzx       r10d,byte ptr [rdx+4]  
00007FF7612BD24E  cmp         r10d,r8d  
00007FF7612BD251  jl          related_Program____Main___g__GetRelatedPosts_0_0+420h (07FF7612BD220h)  
00007FF7612BD253  lea         edx,[rax+1]  
00007FF7612BD256  cmp         edx,r12d  
00007FF7612BD259  jae         related_Program____Main___g__GetRelatedPosts_0_0+964h (07FF7612BD764h)  
00007FF7612BD25F  lea         edx,[rax+1]  
00007FF7612BD262  lea         r10,[r13+rdx*8]  
00007FF7612BD267  mov         r9d,ecx  
00007FF7612BD26A  mov         dword ptr [r10],r9d  
00007FF7612BD26D  mov         byte ptr [r10+4],r8b  
                }

                top5[upperBound + 1] = (count, j);

                minTags = top5[topN - 1].Count;
00007FF7612BD271  cmp         r12d,4  
00007FF7612BD275  jbe         related_Program____Main___g__GetRelatedPosts_0_0+964h (07FF7612BD764h)  
00007FF7612BD27B  movzx       eax,byte ptr [r13+24h]  

        //  custom priority queue to find top N
        for (var j = 0; j < postsCount; j++)
00007FF7612BD280  inc         ecx  
00007FF7612BD282  cmp         ecx,esi  
00007FF7612BD284  jl          related_Program____Main___g__GetRelatedPosts_0_0+40Ch (07FF7612BD20Ch)

The hot path is that the next count of posts is lower than what we already have in the queue. But as you can see, the hot path starts at the top, then does a jump from 00007FF7612BD217 to 07FF7612BD280 (because count is lower), then it increments j, compares it with the limit and jumps back from 00007FF7612BD284 to 07FF7612BD20C at the top.

The PR added an inner loop for this that doesn't do so much jumping around. The hot loop is just:

00007FF692DAE556  inc         edx  
        {
            while ((uint)p < (uint)taggedPostCount.Length && taggedPostCount[p] <= minTags)
00007FF692DAE558  cmp         r8d,edx  
00007FF692DAE55B  jbe         related_Program____Main___g__GetRelatedPosts_0_3+248h (07FF692DAE568h)  
00007FF692DAE55D  mov         eax,edx  
00007FF692DAE55F  movzx       eax,byte ptr [rbx+rax+10h]  
00007FF692DAE564  cmp         eax,ecx  
00007FF692DAE566  jle         related_Program____Main___g__GetRelatedPosts_0_3+236h (07FF692DAE556h)

And we go from top to bottom, many many times (the branch at 00007FF692DAE55B is not taken most of the time, and the one at 00007FF692DAE566 is taken pretty much always).

2 replies

zigzag312 Oct 18, 2023
Author

I must admit, I have never thought of using Disassembly window in VS while debugging and using a breakpoint to get to the correct place. I only used tools for viewing the IL. This is going to help so much!

Running with debugging in release mode will run in JIT mode and show assembly produced by JIT even when <PublishAot>true</PublishAot>, right? How can I inspect assembly produced by AOT?

vitek-karas Oct 18, 2023
Collaborator

Run the AOT executable under the native debugger (as if you're debugging C++). You should still be able to hit breakpoints set in C# code (the symbols are mapping the instructions back to C# sources). The disassembly window should work basically the same as shown above.

NCLnclNCL · 2023-12-21T04:32:22Z

NCLnclNCL
Dec 21, 2023

good

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any idea why C# is so much slower than Java and Go in related_post_gen benchmark? #93225

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 26 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Any idea why C# is so much slower than Java and Go in related_post_gen benchmark? #93225

zigzag312 Oct 9, 2023

Replies: 6 comments · 26 replies

danmoseley Oct 9, 2023 Collaborator

zigzag312 Oct 9, 2023 Author

tannergooding Oct 9, 2023 Collaborator

danmoseley Oct 10, 2023 Collaborator

MichalStrehovsky Oct 10, 2023 Collaborator

zigzag312 Oct 10, 2023 Author

MichalStrehovsky Oct 10, 2023 Collaborator

zigzag312 Oct 10, 2023 Author

MarkPflug Oct 9, 2023

zigzag312 Oct 10, 2023 Author

MichalStrehovsky Oct 17, 2023 Collaborator

zigzag312 Oct 18, 2023 Author

zigzag312 Oct 18, 2023 Author

MichalStrehovsky Oct 18, 2023 Collaborator

zigzag312 Oct 18, 2023 Author

danmoseley Oct 20, 2023 Collaborator

MichalStrehovsky Oct 18, 2023 Collaborator

zigzag312 Oct 18, 2023 Author

vitek-karas Oct 18, 2023 Collaborator

NCLnclNCL Dec 21, 2023

zigzag312
Oct 9, 2023

Replies: 6 comments 26 replies

danmoseley
Oct 9, 2023
Collaborator

zigzag312 Oct 9, 2023
Author

tannergooding
Oct 9, 2023
Collaborator

danmoseley Oct 10, 2023
Collaborator

MichalStrehovsky Oct 10, 2023
Collaborator

zigzag312 Oct 10, 2023
Author

MichalStrehovsky Oct 10, 2023
Collaborator

zigzag312 Oct 10, 2023
Author

MarkPflug
Oct 9, 2023

zigzag312 Oct 10, 2023
Author

MichalStrehovsky
Oct 17, 2023
Collaborator

zigzag312 Oct 18, 2023
Author

zigzag312 Oct 18, 2023
Author

MichalStrehovsky Oct 18, 2023
Collaborator

zigzag312 Oct 18, 2023
Author

danmoseley Oct 20, 2023
Collaborator

MichalStrehovsky
Oct 18, 2023
Collaborator

zigzag312 Oct 18, 2023
Author

vitek-karas Oct 18, 2023
Collaborator

NCLnclNCL
Dec 21, 2023