Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider implementing http2 ping-pong health checks for inactive endpoints #2578

Open
searoz opened this issue Dec 5, 2024 · 0 comments
Open
Labels
enhancement New feature or request

Comments

@searoz
Copy link

searoz commented Dec 5, 2024

Is your feature request related to a problem? Please describe.

Consider the following code:

var endpoints = new[]
{
    new DnsEndPoint("localhost", 5000),
    new DnsEndPoint("localhost", 6000)
};

var grpcClient = CreateGrpcClient(endpoints);
using var serverStream = grpcClient.ServerStream(new MyGrpcRequest());
// This server stream is effectively endless - it constantly pushes new responses and never returns.
while (await serverStream.ResponseStream.MoveNext())
{
    // process response
}

static MyGrpcClient CreateGrpcClient(IReadOnlyCollection<DnsEndPoint> endpoints)
{
    const string Scheme = "test";
    const string ServiceName = "test";

    var services = new ServiceCollection();

    services.AddLogging(logging =>
    {
        logging
            .SetMinimumLevel(LogLevel.Information)
            .AddFilter("Grpc.Net.Client.Balancer", LogLevel.Trace);

        logging.AddSimpleConsole(console =>
        {
            console.TimestampFormat = "[HH:mm:ss] ";
            console.SingleLine = true;
        });
    });

    services
        .AddGrpcClient<MyGrpcClient>(o =>
        {
            o.Address = new Uri($"{Scheme}://{ServiceName}");
        })
        .ConfigureChannel(o =>
        {
            o.Credentials = ChannelCredentials.Insecure;
            o.ServiceConfig = new ServiceConfig
            {
                LoadBalancingConfigs =
                {
                    new CustomLoadBalancingConfig()
                }
            };
        });

    services.AddSingleton<ResolverFactory>(new CustomResolverFactory(
        Scheme,
        new Dictionary<string, IReadOnlyCollection<DnsEndPoint>>
        {
            {ServiceName, endpoints}
        }));
    services.AddSingleton<LoadBalancerFactory, CustomLoadBalancerFactory>();

    var serviceProvider = services.BuildServiceProvider();

    return serviceProvider.GetRequiredService<MyGrpcClient>();
}

/// <summary>
/// Basic resolver factory that creates a basic resolver for specified host
/// </summary>
sealed class CustomResolverFactory(string name, IReadOnlyDictionary<string, IReadOnlyCollection<DnsEndPoint>> endpointsByHost)
    : ResolverFactory
{
    public override Resolver Create(ResolverOptions options)
    {
        var endpoints =
            endpointsByHost.GetValueOrDefault(options.Address.Host)
            ??
            throw new InvalidOperationException($"No endpoints for host {options.Address.Host}");
        return new CustomResolver(endpoints, options.LoggerFactory);
    }

    public override string Name => name;
}

/// <summary>
/// Basic resolver that returns a set of specified endpoints
/// </summary>
sealed class CustomResolver(IReadOnlyCollection<DnsEndPoint> endpoints, ILoggerFactory loggerFactory)
    : PollingResolver(loggerFactory)
{
    protected override Task ResolveAsync(CancellationToken cancellationToken)
    {
        var addresses = endpoints.Select(e => new BalancerAddress(e)).ToArray();
        var result = ResolverResult.ForResult(addresses);
        Listener(result);
        return Task.CompletedTask;
    }
}

sealed class CustomLoadBalancingConfig() : LoadBalancingConfig(CustomLoadBalancerFactory.LoadBalancerFactoryName);

sealed class CustomLoadBalancerFactory : LoadBalancerFactory
{
    public const string LoadBalancerFactoryName = nameof(CustomLoadBalancerFactory);

    public override LoadBalancer Create(LoadBalancerOptions options) =>
        new CustomLoadBalancer(options.Controller, options.LoggerFactory);

    public override string Name => LoadBalancerFactoryName;
}

sealed class CustomLoadBalancer(IChannelControlHelper controller, ILoggerFactory loggerFactory)
    : SubchannelsLoadBalancer(controller, loggerFactory)
{
    protected override SubchannelPicker CreatePicker(IReadOnlyList<Subchannel> readySubchannels) =>
        new CustomSubchannelPicker(readySubchannels);
}

sealed class CustomSubchannelPicker(IReadOnlyList<Subchannel> readySubchannels) : SubchannelPicker
{
    public override PickResult Pick(PickContext context) =>
        readySubchannels switch
        {
            [var singleSubChannel] => PickResult.ForSubchannel(singleSubChannel),
            null or [] => PickResult.ForFailure(new Status(StatusCode.Unavailable, "No ready subchannels")),
            _ => PickResult.ForFailure(new Status(StatusCode.Unavailable,
                $"Too many ready subchannels: {readySubchannels.Count}")),
        };
}

After letting this code run for a couple of minutes and analyzing logs we can spot an unexpected behavior:

[12:07:45] trce: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[4] Subchannel id '1-2' checking socket Unspecified/localhost:6000.
[12:07:50] trce: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[4] Subchannel id '1-2' checking socket Unspecified/localhost:6000.
[12:07:55] trce: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[4] Subchannel id '1-2' checking socket Unspecified/localhost:6000.
[12:07:55] trce: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[15] Subchannel id '1-2' socket Unspecified/localhost:6000 is receiving 17 available bytes.
[12:07:55] dbug: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[14] Subchannel id '1-2' socket Unspecified/localhost:6000 is in a bad state and can't be used.
[12:07:55] dbug: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[16] Subchannel id '1-2' socket Unspecified/localhost:6000 is being closed because it can't be used. Socket lifetime of 00:02:15.2011945. The socket either can't receive data or it has received unexpected data.
[12:07:55] dbug: Grpc.Net.Client.Balancer.Subchannel[11] Subchannel id '1-2' state changed to Idle. Detail: 'Lost connection to socket.'.
[12:07:55] trce: Grpc.Net.Client.Balancer.Subchannel[14] Subchannel id '1-2' executing state changed registration '1-2-1'.
[12:07:55] dbug: Grpc.Net.Client.Balancer.Internal.ConnectionManager[4] Channel picker updated.
[12:07:55] trce: Grpc.Net.Client.Balancer.PollingResolver[1] CustomResolver refresh requested.
[12:07:55] trce: Grpc.Net.Client.Balancer.PollingResolver[8] CustomResolver resolve starting.
[12:07:55] trce: Grpc.Net.Client.Balancer.PollingResolver[4] CustomResolver result with status code 'OK' and 2 addresses.
[12:07:55] trce: Grpc.Net.Client.Balancer.Subchannel[4] Subchannel id '1-2' connection requested.
[12:07:55] dbug: Grpc.Net.Client.Balancer.Subchannel[11] Subchannel id '1-2' state changed to Connecting. Detail: 'Connection requested.'.
[12:07:55] trce: Grpc.Net.Client.Balancer.Subchannel[14] Subchannel id '1-2' executing state changed registration '1-2-1'.
[12:07:55] dbug: Grpc.Net.Client.Balancer.Internal.ConnectionManager[4] Channel picker updated.
[12:07:55] dbug: Grpc.Net.Client.Balancer.Subchannel[6] Subchannel id '1-2' connecting to transport.
[12:07:55] trce: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[1] Subchannel id '1-2' connecting socket to Unspecified/localhost:6000.
[12:07:55] trce: Grpc.Net.Client.Balancer.Subchannel[19] Subchannel id '1-1' updated with addresses: localhost:5000
[12:07:55] trce: Grpc.Net.Client.Balancer.PollingResolver[7] CustomResolver resolve task completed.
[12:07:57] dbug: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[2] Subchannel id '1-2' connected to socket Unspecified/localhost:6000.
[12:07:57] dbug: Grpc.Net.Client.Balancer.Subchannel[11] Subchannel id '1-2' state changed to Ready. Detail: 'Successfully connected to socket.'.
[12:07:57] trce: Grpc.Net.Client.Balancer.Subchannel[14] Subchannel id '1-2' executing state changed registration '1-2-1'.
[12:07:57] dbug: Grpc.Net.Client.Balancer.Internal.ConnectionManager[4] Channel picker updated.
[12:08:02] trce: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[4] Subchannel id '1-2' checking socket Unspecified/localhost:6000.
[12:08:07] trce: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[4] Subchannel id '1-2' checking socket Unspecified/localhost:6000.
[12:08:12] trce: Grpc.Net.Client.Balancer.Internal.SocketConnectivitySubchannelTransport[4] Subchannel id '1-2' checking socket Unspecified/localhost:6000.

To summarize what happened:

  • I've implemented a custom basic resolver + custom basic load balancer. Because of this custom load balancing, grpc-dotnet now creates tho subchannels with 1 endpoint each instead of one subchannel with 2 endpoints by default.
  • I connect to my grpc server using two endpoints and establish an endless server-side stream. grpc-dotnet selects one subchannel to serve this stream, the other one remains idle.
  • grpc-dotnet begins its health-check routine for idle subchannel by establishing a tcp socket and constantly polling it every 5 seconds.
  • Because there's no actual http2 traffic being passed through this socket, grpc server considers this connection stale and closes it after some time. For Kestrel, this happens after 130 seconds by default
  • Our socket is now closed, which forces grpc-dotnet to update a corresponding picker and re-query our resolver

Of course, in this particular example this quirk is nothing to worry about. The problem is that I use DnsResolver in my production code, and this behavior forces DnsResolver to re-query DNS entries over and over again. Considering the fact that I have tons of k8s pods in my production environment that share the same custom grpc balancing mechanism, it means that every 135 seconds tons of unnecessary DNS requests are being resolved which creates unnecessary load on my DNS servers.

Describe the solution you'd like

I can think of two solutions:

  • Implement actual http2 ping-pong health check mechanism in SocketConnectivitySubchannelTransport instead of just polling a tcp socket. If I understand correctly, this is exactly what grpc-go does
  • Making ISubchannelTransport and corresponding types like TransportStatus and ConnectResult public so one can implement their own health-check logic

Describe alternatives you've considered

I've considered implementing my own health-check logic, but, as stated above, ISubchannelTransport and corresponding types are internal, which makes this impossible

@searoz searoz added the enhancement New feature or request label Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant