Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests for NAT-Traversal and Hole-Punching #381

Merged
merged 15 commits into from
Jun 24, 2022
Merged

Conversation

emmacasolin
Copy link
Contributor

@emmacasolin emmacasolin commented Jun 9, 2022

Description

This PR description was copied from #357

The architecture for our NAT busting will be completed in #326. Once this is done, we should be able to run fully simulated tests for communication between nodes using hole-punching across NAT. This will involve writing Jest tests that create namespaces with iptables rules to simulate different kinds of NAT and creating nodes inside of these namespaces. Specifically, we are only looking at endpoint-independent and endpoint-dependent NAT mapping, since our nat busting should be able to traverse all different types of firewalls. Thus, these tests will only be simulating port-restricted cone (endpoint-independent) and symmetric (endpoint-dependent) NAT.

The test cases that need to be considered here are:

  1. Node1 connect to Node2 - basic sanity test
  2. Node1 behind NAT connects to Node2 - here Node1 is acting like a client and it is behind a NAT, connecting to an open Node2 that isn't behind NAT
  3. Node1 connects to Node2 behind NAT - here Node1 is acting like a client, connecting to a closed Node2 that is behind a NAT
  4. Node1 behind NAT connects to Node2 behind NAT - here Node1 is acting like a client and it is behind NAT, and it is connecting to Node2 which is also behind NAT

These tests should cover all possible NAT combinations and should be repeated both with and without the use of a seed node as a signalling server (as such, some tests will be expected to fail, since the seed nodes are used in our NAT busting).

Relates to #148

Issues Fixed

Tasks

  • 1. Create util functions for creating namespaces and interfaces, as well as for setting up various iptables rules
  • 2. Write tests for basic connection (no NAT, sanity test)
  • 3. Write tests for one node behind NAT
  • 4. Write tests for both nodes behind NAT
  • [ ] 5. Write tests using a seed node To be done in a new PR
  • 6. Ensure tests are separated from the rest of the testing suite (i.e. are not run with npm run test), and make sure they check that the operating system is Linux before running
  • [ ] 7. Investigate no STDERR logs when running pk agent status or pk agent start in the beginning Tests for NAT-Traversal and Hole-Punching #357 (comment) could be resolved in Testnet Deployment #326 - resolved here: Tests for NAT-Traversal and Hole-Punching #357 (comment)
  • 8. Investigate the usage of unshare instead of ip netns which reduces the overhead of using sudo and makes our tests more accessible in other platforms like CI/CD.
  • [ ] 9. Authenticate the sender of hold-punch messages Authenticate the sender of a hole-punching signalling message #148 - Not relevant to this PR, we will need to review the security of the P2P system after testnet deployment

Final checklist

  • Domain specific tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@emmacasolin emmacasolin self-assigned this Jun 9, 2022
@emmacasolin emmacasolin force-pushed the feature-nat-testing branch from bb1d042 to 54d8b7d Compare June 9, 2022 02:04
@CMCDragonkai
Copy link
Member

Some important things here in order to help debugging:

  1. The packages of iproute2, utillinux and iptables-legacy should be left to the ambient OS context to bring in. It doesn't matter to bring these into shell.nix or nix-shell --packages because even if they are brought into the shell, they won't work unless the ambient OS supports them.
  2. The gitlab runner image currently brings in iproute2 and utillinux but not iptables-legacy. The problem with bringing in iptables-legacy is that this means you have a userspace command that expects the underlying OS to be using nftables. We don't know if the gitlab runners are using nftables at the OS level, therefore we can't be sure that iptables-legacy is the right package to use. Right now we don't use it all, so we just need to test on CI/CD later to see what works. Similarly we wouldn't want to specify these things for our shell.nix either because we consider these things to be OS dependencies, because they deal with the networking stack and kernel-level filesystem behaviour.
  3. Inside the jest tests, your conditional testing for the NAT testing should be more strict. We should check that the platform is linux, but also the existence of the commands in the PATH we will be using: iptables, mount, ip and whatever else.
  4. In your network setup, and wherever you are executing shell commands, you must do it in a way where it is easier to debug. See the scripts/pkg.js, I always create a "command array" first, then afterwards execute the command array. Use the logger.info to log out what is the command array that you are executing, and then always check the exit code if it succeeded. If it failed to succeed you want to print out the STDERR via logger.error. Note that if it failed to succeed, you should always then reverse any side effects in the catch handler.
  5. Make sure to be using https://nodejs.org/api/child_process.html#synchronous-process-creation to make it easier for you to catch your errors.

I'm going to check how to do a quick executable check in nodejs and also which exec sync you should be using.

@CMCDragonkai
Copy link
Member

I think you should be using https://nodejs.org/api/child_process.html#child_processexecfilesyncfile-args-options when you run a synchronous command.

And you want to use spawn when you want to do it asynchronously. Don't use any of the others. The execFileSync won't run it in a shell by default whereas execSync always runs it in a shell. We don't need a shell if we are just executing a single command.

Then you want to make sure to catch the exception from it, and deal with it accordingly.

@CMCDragonkai
Copy link
Member

To check if a command exists, we can do it in 2 ways:

  1. Since we're already checking if the platform is linux, we can just use execSync along with command -v .... See https://stackoverflow.com/a/26759734/582917 for the right command here. In this case we are using execSync because we want to use a shell builtin. And because it is already a linux platform, there's no issues on windows.
  2. Use shelljs, since we are already bringing in shx to do cross-platform copying (@emmacasolin it should be in the devDependencies), and it relies on shelljs, we could use it to do shelljs.which. See the shelljs docs to understand how to use it.
if (!shell.which('git')) {
  // ...
}

The first case is nice so that we are not reliant on more and more packages. The second case is a more extendable if we need to deal with more platform constraints in the future like on windows. I reckon since shelljs is already there, we might as well use it, and proceed with it. Make sure both shx and shelljs are in devDependencies.

Note that this should not be embedded into the describeIf and testIf, keep those simple. These are more things that would be inlined specifically to test files where applicable.

@CMCDragonkai
Copy link
Member

Once it is running @emmacasolin please post a log of running the tests, change your log level to INFO for that run, and then change it back to WARN afterwards.

@CMCDragonkai
Copy link
Member

Is task 5. supposed to be for issue #159? In that case, it will be blocked by #378 merge and the completion of another deployment procedure to testnet.polykey.io.

@CMCDragonkai
Copy link
Member

As I explained in-person, we're not using nix-shell --packages because it ignores what's in the shell.nix. It's still useful but only if we are running a single command as that is more efficient than loading the entire shell.nix. In the future we can apply arguments like --arg and --argstr to pass arguments into the shell.nix that can optimise its behaviour for certain environments, but that's for later.

@CMCDragonkai
Copy link
Member

I think we can reuse pkExec as you've done, but also create a wrapper called exec.

This is just async exec () { ... } and does the same thing as pkExec but doesn't use any of the TS stuff. Then you can apply that to all your synchronous setup commands involving ip, iptables... etc.

If you do that, you should put that in just the tests/bin/utils.ts along with pkExec... etc. You should put it above the pkExec related functions.

@CMCDragonkai
Copy link
Member

Then where we are using execFileSync we can just use await exec. Better UX. The sync one should only be used for interactive scripts like pkg.js I forgot it would block the event loop. Also automates the stdio options.

@emmacasolin
Copy link
Contributor Author

Task 6 was resolved by introducing conditional testing rather than separating the NAT tests. A describe/test can be skipped by using describeIf or testIf respectively from tests/utils.ts and providing a condition as the first parameter followed by the usual arguments for a describe/test block. Here we are checking both that the platform is linux (process.platform === 'linux') and that the necessary commands exist in the PATH (shell.which('ip') && shell.which('iptables') && shell.which('nsenter') && shell.which('unshare')).

@CMCDragonkai
Copy link
Member

Looking good. It looks like we do need to add iptables-legacy to our gitlab runner image. The reason is because we can't use iptables-legacy in our shell.nix because our NixOS machines are not compatible with it anymore. Previously we had specified it in our .gitlab-ci.yml file but this is not longer doable because we want to use shell.nix inside our CI.

So we just need to add it to our gitlab runner and rebuild the image.

Once gitlab runners upgrade their instances to use nftables, we will have to update gitlab runner, changing iptables-legacy to iptables (or drop replace the whole thing with the nft rules instead).

@CMCDragonkai
Copy link
Member

@CMCDragonkai
Copy link
Member

Proceed to recheck the tests @emmacasolin, and if all good, start squashing.

@emmacasolin
Copy link
Contributor Author

Proceed to recheck the tests @emmacasolin, and if all good, start squashing.

NAT tests are passing on the pipeline now! Will proceed to start squashing

@emmacasolin emmacasolin force-pushed the feature-nat-testing branch from 438fa0b to 08209b2 Compare June 13, 2022 22:58
@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jun 14, 2022

Can you create a new issue addressing the migration from iptables to nftables, and its dependencies:

  • Gitlab runners supporting nftables instead of iptables-legacy
    • Means gitlab runner image has to change from iptables-legacy package to nftables package
  • NixOS runners supporting nftables (already does)
  • Jest tests changing from iptables commands to nft commands, and explanation of equivalent commands
  • Reference the nftables work and iptables scripts that you used in this PR
  • Any other additional supporting information
  • No more xtables.lock problem, which should mean no more mounting namespace
  • Updating the describeIf constraints to constrain on nft and/or mount if needed

Such an issue can be addressed in the future so we can use nft instead.

});
});
test(
'Node1 behind EIM NAT connects to Node2',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make your test descriptions lowercase, all of our test descriptions are lower case atm. And it should say node 1 behind ....

Also what is EIM? I'm not familiar with this acronym, perhaps it should be expanded?

shell.which('iptables') &&
shell.which('nsenter') &&
shell.which('unshare'),
'no NAT',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer calling this DMZ, to indicate no NAT situations.

@CMCDragonkai
Copy link
Member

Why did you use "endpoint-independent" and "endpoint-dependent" terminology over using port-restricted and symmetric terms? Is it more precise name for these?

@CMCDragonkai
Copy link
Member

@emmacasolin can you copy paste the test results from the all the tests and also annotate how they map to the spec's required test cases. I'm a bit confused how each test corresponds to the situations in the spec. Because it's so easy to forget how NAT works, I'm thinking we should add more documentation, or much more descriptive names to our tests (and structure it) so that it's clearer what each situation is testing.

@emmacasolin
Copy link
Contributor Author

Why did you use "endpoint-independent" and "endpoint-dependent" terminology over using port-restricted and symmetric terms? Is it more precise name for these?

Yeah endpoint-independent mapping (EIM) and endpoint-dependent mapping (EDM) seem to be the preferred terms over port-restricted and symmetric. Port-restricted is a specific implementation of EIM and symmetric is a specific implementation of EDM, but there are other implementations so it's better to use the broader terms.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jun 14, 2022

@emmacasolin can you copy paste the test results from the all the tests and also annotate how they map to the spec's required test cases. I'm a bit confused how each test corresponds to the situations in the spec. Because it's so easy to forget how NAT works, I'm thinking we should add more documentation, or much more descriptive names to our tests (and structure it) so that it's clearer what each situation is testing.

And I reckon it'd be a good idea to draw some simple ASCII diagrams with asciiflow and just embed it in the tests themselves so we have a quick reference to each architecture being tested.

@emmacasolin
Copy link
Contributor Author

Something strange... I wanted to see if I could make a test run that would usually pass fail by adding in the additional time that we see in the failing run, however this actually made the test run faster, and even caused a run that would usually fail to pass!

What I did was add an await sleep(6000) inside sendHolePunchMessage (in NodeConnectionManager) just before the client call is made. I wonder if maybe there's something that blocks the process when there's too much going on, so putting in a sleep allows it to continue? But I'm not sure.

It's obviously not a fix, but the "sweet spot" for the sleep time is 500ms - this is enough to stop the problem test from failing and not too much that other tests timeout.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Jun 24, 2022

Based on some discussion, there are some architectural issues we need to review from ground up. However in order to merge this there are some quick fixes that we need to apply in this PR.

  1. Private addresses should never be contacted. Agent 2 should not be contacting 10.0.0.2. You need to verify what the message is being sent from A2 to Seed, and then Seed to A1. A1 should never even know what the private IP of 10.0.0.2, and you should verify that the Seed should be rewriting the from address, or completely replace the from address.
  2. Punch handlers should be checking whether a proxy connection already exist. If they exist, then there shouldn't be an attempting to create a new connection.
  3. Refresh queue should be exhausted quickly in such a small network. Something is causing the refresh queue to loop infinitely. This should be prevented.

Long term issues (non-exhaustive) to be addressed in staging:

  1. Use wait/poll to define the state transition for these tests. We need to wait for the network to settle first before doing the next step. This is just for testing.
  2. Queue & cancellable promises need to be specced out to reduce the complexity of each step of this architecture.
  3. Signalling requests should be "fire and forget". Similar to 204 Accept code in HTTP. We shouldn't be waiting for the response when signalling.
  4. Promise.all and Promise.any Asynchronous Promise Cancellation with Cancellable Promises, AbortController and Generic Timer #297.

Either way we need to merge this to staging, and then work on the staging directly and divide and conquer each of these issues.

@emmacasolin
Copy link
Contributor Author

Replacing the proxy address in a relay hole punch message with the address in your own node graph (if this exists) has gotten rid of all of the 10.0.0.2:55552 private addresses from the reverse connection attempts. So the seed node wasn't previously re-writing this and that's where the private addresses were coming from. So this solves issue 1.

@emmacasolin
Copy link
Contributor Author

emmacasolin commented Jun 24, 2022

As for point 2, it doesn't look like there's any check for an existing open node connection when you attempt to ping another node, but there is a check for an existing proxy connection, so we just need to add a check for a node connection and then that should resolve that issue.

I'm assuming here that if we have an open node connection with another agent, then that implies the agent we're connecting to is online? So if this.connections.get(targetNodeIdString) returns something then ping should just return true immediately?

@tegefaulkes
Copy link
Contributor

We don't need to check for an existing node connection. If there was an existing node connection then it would be using the proxy connection. So we only have to check for the proxy connection.

@emmacasolin
Copy link
Contributor Author

We don't need to check for an existing node connection. If there was an existing node connection then it would be using the proxy connection. So we only have to check for the proxy connection.

In that case then nothing needs to be done for point 2, because we already check for an existing forward connection before establishing a new one in establishForwardConnection.

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Jun 24, 2022

I set up a test mimicking the 3 node setup Emma has to check how the refresh buckets queue behaves. It seems to be working as intended. It will iterate over the higher buckets preforming a find node operation. In this case it will look like about 2 connections per bucket. sometimes that can be 8 or more buckets so we can end up with about 24 connections in the logs depending on how far apart the node Ids are.

We need to add a way for the refresh bucket queue to detect that it's getting no new information due to the small network and skip doing refresh operations. This might need some thinking about and specifying out in a new issue. For theses tests we can just pause the refresh buckets queue. I may need to add a way to pause it though.

Here is a log of the queue
seed: v02v1q057qmh5igvoc6d9k5cm3fpcpv3ee7okfetl20v6ublcb7l0
node1: vtukutu74cldiot9it3422r0u94t2jah6i07csek5kv78lgsn61i0
node2: vj1nq0nj5f52ugcpda7f107b6n36i0t74kn9d3l5s8q035ndc68o0

DEBUG:NodeGraph:Adding node vtukutu74cldiot9it3422r0u94t2jah6i07csek5kv78lgsn61i0 to bucket 254
DEBUG:NodeManager:Removing bucket 254 from queue
DEBUG:NodeManager:Adding bucket 254 to queue
DEBUG:NodeManager:refresh bucket queue has unplugged
DEBUG:NodeManager:Adding bucket 255 to queue
DEBUG:NodeManager:refresh bucket queue has unplugged
DEBUG:NodeManager:processing refreshBucket for bucket 254, 2 left in queue
DEBUG:Queue:Plugging queue
DEBUG:NodeGraph:Updating node vtukutu74cldiot9it3422r0u94t2jah6i07csek5kv78lgsn61i0 in bucket 254
DEBUG:NodeManager:Removing bucket 254 from queue
INFO:NodeConnectionManager:Getting connection to vtukutu74cldiot9it3422r0u94t2jah6i07csek5kv78lgsn61i0
INFO:NodeConnectionManager:no existing entry, creating connection to vtukutu74cldiot9it3422r0u94t2jah6i07csek5kv78lgsn61i0
INFO:NodeConnection 127.0.0.1:39612:Creating NodeConnection
INFO:clientFactory:Creating GRPCClientAgent connecting to 127.0.0.1:39612
INFO:NodeConnectionManager:Getting connection to v02v1q057qmh5igvoc6d9k5cm3fpcpv3ee7okfetl20v6ublcb7l0
INFO:NodeConnectionManager:existing entry found for v02v1q057qmh5igvoc6d9k5cm3fpcpv3ee7okfetl20v6ublcb7l0
INFO:NodeConnectionManager:withConnF calling function with connection to v02v1q057qmh5igvoc6d9k5cm3fpcpv3ee7okfetl20v6ublcb7l0
INFO:Proxy:Handling CONNECT to 127.0.0.1:39612
INFO:ConnectionForward 127.0.0.1:39612:Starting Connection Forward
INFO:ConnectionForward 127.0.0.1:39612:Started Connection Forward
INFO:ConnectionForward 127.0.0.1:39612:Composing Connection Forward
INFO:ConnectionForward 127.0.0.1:39612:Composed Connection Forward
INFO:Proxy:Handled CONNECT to 127.0.0.1:39612
INFO:clientFactory:Created GRPCClientAgent connecting to 127.0.0.1:39612
DEBUG:clientFactory:Watch Channel State: READY
INFO:NodeConnection 127.0.0.1:39612:Created NodeConnection
DEBUG:NodeGraph:Updating node vtukutu74cldiot9it3422r0u94t2jah6i07csek5kv78lgsn61i0 in bucket 254
DEBUG:NodeManager:Removing bucket 254 from queue
INFO:NodeConnectionManager:withConnF calling function with connection to vtukutu74cldiot9it3422r0u94t2jah6i07csek5kv78lgsn61i0
DEBUG:NodeGraph:Updating node v02v1q057qmh5igvoc6d9k5cm3fpcpv3ee7okfetl20v6ublcb7l0 in bucket 255
DEBUG:NodeManager:Removing bucket 255 from queue
INFO:NodeConnectionManager:Getting connection to v02v1q057qmh5igvoc6d9k5cm3fpcpv3ee7okfetl20v6ublcb7l0
INFO:NodeConnectionManager:existing entry found for v02v1q057qmh5igvoc6d9k5cm3fpcpv3ee7okfetl20v6ublcb7l0
INFO:NodeConnectionManager:withConnF calling function with connection to v02v1q057qmh5igvoc6d9k5cm3fpcpv3ee7okfetl20v6ublcb7l0
DEBUG:NodeManager:Removing bucket 254 from queue
DEBUG:NodeManager:refresh bucket queue has plugged

@emmacasolin
Copy link
Contributor Author

I've verified by looking through the logs that each agent only creates one forward, reverse, and node connection per other agent. They will always check for the existence of a connection before creating a new one and won't create a new one if one exists for that address (or node Id in the case of node connections).

Notably, the address fix seems to have solved the timeout issue - there aren't any other outstanding issues with the NAT tests, and they can run with the refresh buckets queue enabled.

joshuakarp and others added 15 commits June 24, 2022 16:28
If an error was logged out by a service handler it would previously appear as `createClientService:`, which was too vague. The logger will now state the name of the handler.
The testnet PR brought in some changes that affect the NAT tests, so they needed to be modified slightly.
- Adding a node to the node graph pings the node by default. We want to disable this in the NAT tests so we have more control over the pings.
- Nodes add the details of any node that pings them. We can now remove some additional `nodes add` calls that were required to imitate this previously missing functionality.
Now that nodes add the details of a node that contacts them, we no longer need the `edmSimple` configuration to do this manually.
General linting, using capitals for constants in NAT utils, and lowercase test descriptions
A relay node for a hole punch message was previously not modifying the proxy address in the message (which is the "return address" used to contact the source node). For nodes behind a NAT, who do not know their own public address, they rely on this overwriting so that nodes do not try to contact them on their private, inaccessible address.
This will allow us to disable the queue for testing.
the composed flag was set at the beginning of the compose function causing another function to throw an error due to an undefined property if it was called at the same time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants