Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(libp2p): shared TCP listeners and AutoTLS.AutoWSS #10565
feat(libp2p): shared TCP listeners and AutoTLS.AutoWSS #10565
Changes from 7 commits
7833e64
590bed0
2a33eb2
bc215c2
6de3faa
1831a3d
9b887ea
80e8895
0d7ad48
2adb2f1
65d51e9
66317b4
d178b87
2d84386
c841ca6
0a59574
075ed6b
a436f4e
b5cfd6d
6caf78c
2919d6d
e150509
7a4ec80
858e10a
50cd7cd
6367f64
bb87df3
6b5b2ed
d532e58
a1f1cb2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got this to the state that is as close to go-libp2p release as possible, filling this here so I won't forget:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be a regression between 9b887ea (this PR) and 433444b (latest master) caused by TCP reuse logic in go-libp2p bumped here.
I did
curl localhost:5001/debug/pprof/profile > ipfs.cpuprof
and CPU profile seems to confirm the source is TCP reuse from libp2p/go-libp2p#2984? (cc @aschmahmann @MarcoPolo):How to reproduce
ipfs daemon
staging-2024-12-04-9b887e
(go-libp2p masterv0.37.1-0.20241202220543-9024f8e8c86e
) will grow CPU use once ipfs-webui Peers screen is opened. CPU peaks within 1minute.master-2024-12-03-433444b
(go-libp2p v0.37.2
) does not trigger high CPU loadHere are docker images for repro convenience:
docker run --rm -it --net=host ipfs/kubo:staging-2024-12-04-9b887ea
docker run --rm -it --net=host ipfs/kubo:master-2024-12-03-433444b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related work:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libp2p/go-libp2p#3080 seems to fix the CPU spin.
Tested in 2adb2f1 – need to investigate why CI fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am debugging sharness with
make -O -j 1
because CI runs with-j 10
and logs are mangled.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting bizzare. Tests with the same go-libp2p and boxo version pass in #10631.
The only
go.mod
diff between a436f4e (this PR, ci failing due to timeout after 20m) and 070e6ae (#10631, ci green, sharness run takes 5minutes), is:other than that, we don't change anything in default config other than enabling shared tcp.
Will disable (b5cfd6d) and re-run, to see if that causes issue somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 👉 Sharness passes green if we disable
libp2p.ShareTCPListener()
(b5cfd6d passed in job here).Something related to
libp2p.ShareTCPListener()
makes sharness hang and timeout if run in 10 parallel threads (cd test/sharness
andtime make -O -j 10
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looking for the cause, but I was able to reproduce locally by using the same settings as CI:
It hanged for a few minutes and finished eventually, but took 3x longer than regular run (15m vs 5m).
Re-run and it hanged for 20m and i had to kill it, so there is some factor of randomness to the hang.
Looking at process tree spawned by
make
, thet0250-files-api.sh
andt0260-sharding.sh
t0181-private-network.sh
t0182-circuit-relay.sh
andt0320-pubsub.sh
seem to hang the longest in semi-random order.Will resume tomorrow unless someone else has any idea.
In the meantime, i've re-run https://github.com/ipfs/kubo/actions/runs/12420930073 to double-confirm disabling
libp2p.ShareTCPListener()
fixes sharness.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run two tests to narrow down the search:
libp2p.ShareTCPListener()
and sharness finished in 8m successfully (log)libp2p.ShareTCPListener()
and sharness timeouted after 20m (log)Maybe increasing timeout would help, but taking 2x as long feels like a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, running sequentially (#10565 (comment)) found the problem:
libp2p.ShareTCPListener()
break PNET and./t0181-private-network.sh
hangs – kinda makes sense, PNET is TCP-only feature of libp2p and we mess up with TCP.I'll now confirm this is the only one failing,
and if other tests pass, will disable TCP sharing when PNET is enabled for now.
I'll also see if we can set timeout per test suite / unit, to avoid this debugging horror in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news, PNET was the only reason sharness failed. Disabled port sharing when PNET is enabled (bb87df3) and CI is green again:
I'm going forward with Kubo 0.33.0-rc1.