Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of js-rpc timeout and middleware changes to fix Auth Provider timeout issue #589

Merged
merged 4 commits into from
Nov 3, 2023

Conversation

addievo
Copy link
Contributor

@addievo addievo commented Oct 17, 2023

Description

Provider Auth Timeouts

This PR makes the timeout for the Provider Authentication flow longer. The result of how short it is currently is that the js-rpc default timeout set in src/config:

defaultsSystem: {
    rpcCallTimeoutTime: 15_000, // 15 seconds

As the Provider Authentication flow utilises the IdentitiesAuthenticate handler in client/handlers, it will follow this timeout. So there needs to be a way to override the default timeout for the IdentitiesAuthenticate handler only, without affecting any other handlers.

Although the aforementioned comment specifies

Therefore it is expected that specific clients calls and server handlers
* will override this timeout to cater to their specific circumstances.

This is currently not the case, as in js-rpc handler timeout is preferred, only if it is lesser than the default timeout, which is not the case in our desired scenario.

The solution is to make it so that the handler timeout overrides the default timeout set on the RPCServer, regardless of whether one is longer than the other.

This has now been updated in the following PR:
MatrixAI/js-rpc#47

Moving timeoutMiddleware to js-rpc

As per MatrixAI/js-rpc#42, timeoutMiddleware has been moved out, and is now a default middleware. This will mean that timeoutMiddleware will be applied everywhere.

Updated Type Requirements of js-rpc Request params and Response results

The updated types in js-rpc of JSONRPCRequestMessage, JSONRPCRequestNotification and JSONRPCResponseResultmean that JSONRPCRequest*.params and JSONRPCResponseResult.result should now always be objects. This is so that metadata can always be appended to the params. Therefore, we need to make sure in Polykey that non-JSON-objects are never being passed as results or params.

Long-Running Streaming Calls

Due to Timeout changes, specifically that timeouts no longer after the first server-sent message, long-running streaming calls are now possible without setting timeouts to Infinity.

Issues Fixed

Tasks

  • 1. Update js-rpc to allow for override of default timeout
  • 2. Update timeout for authentication handler to be 2 minutes.
  • 3. Remove timeoutMiddleware from Polykey.
  • 4. Check and update (if relevant) the type of message structure.
  • 5. Amend the types ClientRPCRequestParms and ClientRPCResponseResult from client handlers.
  • 6. Remove metadata.authorization from AgentRPCRequestParams and AgentRPCResponseResult

Final checklist

  • Domain specific tests
  • Full tests
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@ghost
Copy link

ghost commented Oct 17, 2023

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@addievo addievo mentioned this pull request Oct 17, 2023
@CMCDragonkai
Copy link
Member

Would this actually solve the problem? Maybe it's better to link to PK CLI to just try the authenticate command?

@addievo addievo force-pushed the feature-auth-timeout branch from 63b1178 to afb4bc0 Compare October 18, 2023 01:56
@addievo
Copy link
Contributor Author

addievo commented Oct 18, 2023

This problem is more complex than it first seemed.

The root issue is :-

  1. RPCServer has handlerTimeoutTime which is initialed in Polykey in PolykeyAgent.
  2. To alter this timeout, the timeout has to be changed in CommandStart, the only place where a PK Agent is started.
  3. Changing streamKeepAliveTimeoutTime in PolyKeyClient, which occurs in createPolykeyClient does not resolve the issue.

So therefore, the only possible solution right now is to alter timeout while RPCServer creating in PolykeyAgent in CommandStart, but that would require changing a global solution to fix a local problem, something which is less than ideal.

Currently exploring other possible solutions but I am fairly certain that this is the only solution here to fix the issue, as the core issue is the fact that RPC times out in exactly 15000ms, which isn't nearly enough time to comfortably authenticate github or other providers.

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Oct 18, 2023

To fix this properly there needs to be some changes to the handlers, callers and possible RPC.

  1. We need to specify the timeouts on a per handler/caller basis. I recall us designing something like this in the past but I don't recall if it's implemented in the RPC.
  2. If I recall, there are 3 ways the timeouts are set. first, the global timeout set for the client and server. second, the specific timeout for the handler or caller definition if that is implemented. third, what ever is passed in as the ctx timeout for the client call. By design I believe that the minimum of these values was used when creating the timeout timer. So if you want to enable a specific timeout larger than the default that will need to change.

Ultimately we need to be able to

  1. Specify a larger timeout for just the specific caller/handler we want to have the longer timeout.
  2. allow for timeouts longer than the global timeout value to be used when making a specific call.
  3. We probably don't want to allow the client to override the server's timeout to be much longer than it's timeout for that handler.

You'll need to explore the RPC code to see how things are implemented currently and likely a new feature needs to be added to the RPC to enable this.

If we do need to make changes to the RPC for this then be sure to write a spec. It's important you get then intended changes and requirements down. It's an important touch stone for tracking what needs to be done.

@addievo
Copy link
Contributor Author

addievo commented Oct 18, 2023

@tegefaulkes

  1. I did try to increase the RPCClient timeout, however the Server timeout still overrides it.

We need to specify the timeouts on a per handler/caller basis. I recall us designing something like this in the past but I don't recall if it's implemented in the RPC.

  1. Hmmm the ctx, yea that might be something worth looking into.

Alright, I'll look into the RPC code.

Thanks for the reply.

@CMCDragonkai
Copy link
Member

Please flesh out the spec in the OP @addievo

@tegefaulkes
Copy link
Contributor

This requires changes to the js-rpc code, is there no link to that work here?

@CMCDragonkai
Copy link
Member

It's in the github events: MatrixAI/js-rpc#47

@addievo
Copy link
Contributor Author

addievo commented Oct 25, 2023

So this PR grows in complexity by the day.

  1. Even after the changes to js-rpc to allow changes to the server timeout through the handler, the timeout shows no difference.

  2. Current observations are that while the timeout can be lowered via the caller ctx timeout, increasing timeout ends to no avail.

  3. My current thought process is that there is a time bottleneck elsewhere causing the issue. Some theories to suggest this are :

a. When RPC times out, theres a particular error it throws:

> ts-node src/polykey.ts -np ./tmp/nodeA identities authenticate github.com

url             https://github.com/login/device
userCode        5FC2-254D
ErrorRPCTimedOut: RPC has timed out

b. Whereas with the normal timeout, it throws a specific error as so:

> ts-node src/polykey.ts -np ./tmp/nodeA identities authenticate github.com

url             https://github.com/login/device
userCode        5FC2-254D
ErrorRPCTimedOut: RPC has timed out
  1. I think the problem domain is as follows:
// Promise.race does not cancel unfinished promises
    // the finished condition variable is needed to stop the pollAccessToken process
    // the pollTimer is needed to stop the pollTimerP
    let pollTimedOut = false;
    let pollTimer;
    const pollTimerP = new Promise<void>((r) => {
      pollTimer = setTimeout(() => {
        pollTimedOut = true;
        r();
      }, timeout);
    });
    const that = this;
    const pollAccessToken = async () => {
      const payload = new URLSearchParams();
      payload.set('grant_type', 'urn:ietf:params:oauth:grant-type:device_code');
      payload.set('client_id', that.clientId);
      payload.set('device_code', deviceCode);
      const request = new Request(
        'https://github.com/login/oauth/access_token',
        {
          method: 'POST',
          headers: {
            'Content-Type': 'application/x-www-form-urlencoded',
            Accept: 'application/json',
          },
          body: payload.toString(),
        },
      );
      while (true) {
        if (pollTimedOut) {
          this.logger.info('Poll has timed out');
          return;
        }
        const response = await fetch(request);
        if (!response.ok) {
          throw new identitiesErrors.ErrorProviderAuthentication(
            `Provider responded with ${response.status} ${response.statusText}`,
          );
        }
        let data;
        try {
          data = await response.json();
        } catch (e) {
          throw new identitiesErrors.ErrorProviderAuthentication(
            'Provider access token response is not valid JSON',
            { cause: e },
          );
        }
        if (data.error) {
          if (data.error === 'authorization_pending') {
            await utils.sleep(pollInterval);
            continue;
          } else if (data.error === 'slow_down') {
            // Convert seconds to milliseconds
            pollInterval = parseInt(data.get('interval') ?? '1') * 1000;
            await utils.sleep(pollInterval);
            continue;
          }
          throw new identitiesErrors.ErrorProviderAuthentication(
            `Provider access token request responded with: ${data.error}`,
          );
        }
        const providerToken = {
          accessToken: data.access_token,
        };
        return providerToken;
      }
    };
    let providerToken;
    try {
      providerToken = await Promise.race([pollAccessToken(), pollTimerP]);
    } finally {
      clearTimeout(pollTimer);
    }
    if (providerToken == null) {
      throw new identitiesErrors.ErrorProviderAuthentication(
        `Provider authentication flow timed out`,
      );
    }

As this is where the specific error being thrown.

I will investigate this further.

The only solution working till now is changing the default timeout (handlerTimeout) for the rpcServer formed during creation of PolykeyAgent.

Also to add on this point further, lets assume the default timeout for the rpcServer to be 30000ms.
We call the authIdentify handler with a timeout of 120000ms, the server's default timeout will be overridden and set to 120000ms, now this timeout would carry forward till
a. A new handler sets a different timeout.
b. The server is destroyed and created anew.

Maybe the rpc requires further changes to set the timeout to default after the rpc call is handled.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Oct 25, 2023

No the solution needs to be simplified. You have to verify that call site timeouts override client timeouts and handler timeouts override server timeouts.

Once that is verified then you have to understand that between client and server, the lowest timeout is chosen.

Therefore you just first give sufficient time on both ends. Using both call site timeout and also handler timeout. This has nothing to do with the client timeout or server timeout.

Afterwards any further fixes happens on the PK CLI command itself.


Also the scope isn't increasing - the scope is the requirement. You're supposed to figure out what needs to be done for the requirement.

@CMCDragonkai
Copy link
Member

This is blocked on MatrixAI/js-rpc#47

@CMCDragonkai CMCDragonkai assigned amydevs and unassigned addievo Oct 30, 2023
@tegefaulkes
Copy link
Contributor

I'll be merging the rpc PR. We'll have to update this PR after rpc has a new release.

@tegefaulkes
Copy link
Contributor

MatrixAI/js-rpc#47 has been merged now and doing a release. When that Is done i'll look into finishing this.

@tegefaulkes
Copy link
Contributor

I think we're adding some extra fixes to this PR. Things that need to be changed

  1. Update the timeout for the authentication handler in the client domain.
  2. Polykey-CLI needs to update the call for authentication to use the extended timeout as well. (blocked by this)
  3. The timeout middleware is built into rpc now, so it needs to be removed from polykey.
  4. Types of the message structure may have changed. so that needs to be checked.
  5. types like ClientRPCRequestParams need to be removed from the client and agent handler types.

@addievo
Copy link
Contributor Author

addievo commented Oct 31, 2023

@tegefaulkes Cleaning up PR and adding tasks

@addievo addievo changed the title Increase github auth timeout Integration of js-rpc timeout and middleware changes to fix GitHub timeout issue Oct 31, 2023
@addievo addievo self-assigned this Oct 31, 2023
@amydevs
Copy link
Contributor

amydevs commented Nov 1, 2023

I think we're adding some extra fixes to this PR. Things that need to be changed

1. Update the timeout for the authentication handler in the client domain.

2. `Polykey-CLI` needs to update the call for authentication to use the extended timeout as well. (blocked by this)

3. The timeout middleware is built into `rpc` now, so it needs to be removed from `polykey`.

4. Types of the message structure _may_ have changed. so that needs to be checked.

5. types like `ClientRPCRequestParams` need to be removed from the client and agent handler types.

i think types like ClientRPCRequestParams still need to stay in order to add properties to the metadata, specifically that for authorization

@amydevs amydevs force-pushed the feature-auth-timeout branch 3 times, most recently from 4f1469f to 6899223 Compare November 1, 2023 02:54
@amydevs amydevs changed the title Integration of js-rpc timeout and middleware changes to fix GitHub timeout issue Integration of js-rpc timeout and middleware changes to fix Auth Provider timeout issue Nov 2, 2023
@addievo
Copy link
Contributor Author

addievo commented Nov 2, 2023

Looks good

src/client/types.ts Outdated Show resolved Hide resolved
@CMCDragonkai
Copy link
Member

Looks good

Nope, it doesn't look good. See comments above.

@CMCDragonkai
Copy link
Member

Long-Running Streaming Calls

Do we have any calls in the entire client service or agent service that uses this yet?

I don't think so, but it is likely to be needed to provide push-flow abstractions over the network boundary.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Nov 2, 2023

@amydevs earlier @addievo meantioned the pollAccessToken process being used in identities/providers/github/GitHubProvider.ts.

I'm thinking was originally used to poll the GitHub API. Now with things like timed context, can we refactor that to push-down the timer context? That loop was PRE-invention of the timer contexts. And it seems it can be refactored to improve our system.

Perhaps it is necessary to actually solve the github auth timeout in its entirety?

@amydevs amydevs force-pushed the feature-auth-timeout branch from da3018d to e0523d0 Compare November 3, 2023 04:40
…to be less confusing

chore: bump `@matrixai/rpc` to `0.3.1`

[ci-skip]
@amydevs amydevs force-pushed the feature-auth-timeout branch from e0523d0 to 77d7fd7 Compare November 3, 2023 04:41
@amydevs
Copy link
Contributor

amydevs commented Nov 3, 2023

@CMCDragonkai

this would mean that the abstract Provider class would have to be updated accordingly to match. I think this would be out of scope with this PR

@amydevs
Copy link
Contributor

amydevs commented Nov 3, 2023

Long-Running Streaming Calls

Do we have any calls in the entire client service or agent service that uses this yet?

I don't think so, but it is likely to be needed to provide push-flow abstractions over the network boundary.

i don't think there are any places that need long-running calls at the moment

@CMCDragonkai
Copy link
Member

@CMCDragonkai

this would mean that the abstract Provider class would have to be updated accordingly to match. I think this would be out of scope with this PR

Ok

@amydevs
Copy link
Contributor

amydevs commented Nov 3, 2023

should be ready to merge

@CMCDragonkai CMCDragonkai merged commit 570548a into staging Nov 3, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants