Fix require.Eventually gotchas #31861

guyarb · 2024-12-08T07:25:31Z

What does this PR do?

Using require.* functions with t *testing.T inside require.Eventually will lead to early return (and test failure) even if the condition will be fulfilled in the next iteration. To use require.* methods inside require.Eventually we need to use require.EventuallyWithT and passed collect *assert.CollectT instead of t *testing.T.

Motivation

Fix possible test failures due to early abort.

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr · 2024-12-08T08:07:50Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv aws.create-vm --pipeline-id=50645090 --os-family=ubuntu

Note: This applies to commit 05195f4

cit-pr-commenter · 2024-12-08T09:26:10Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 78ec7b1e-e16d-4cde-903e-a5fe17b75b45

Baseline: 00f8f5b
Comparison: f41dfa5
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_idle_all_features	memory utilization	+2.30	[+2.20, +2.41]	1	Logs bounds checks dashboard
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.38	[-0.39, +1.16]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	+0.30	[+0.24, +0.36]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.13	[-0.73, +0.98]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	+0.10	[-0.37, +0.56]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	+0.03	[-0.70, +0.76]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	+0.02	[-0.60, +0.65]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.01, +0.01]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.01	[-0.11, +0.09]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	-0.07	[-0.85, +0.71]	1	Logs
➖	file_tree	memory utilization	-0.28	[-0.42, -0.15]	1	Logs
➖	quality_gate_idle	memory utilization	-0.46	[-0.50, -0.42]	1	Logs bounds checks dashboard
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	-0.47	[-1.18, +0.25]	1	Logs
➖	otel_to_otel_logs	ingress throughput	-0.73	[-1.44, -0.03]	1	Logs
➖	quality_gate_logs	% cpu utilization	-2.19	[-5.08, +0.71]	1	Logs

Bounds Checks: ❌ Failed

perf	experiment	bounds_check_name	replicates_passed	links
❌	file_to_blackhole_0ms_latency	lost_bytes	8/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.

hmahmood · 2024-12-09T18:21:39Z

pkg/network/tracer/tracer_test.go

@@ -198,12 +198,13 @@ func (s *TracerSuite) TestTCPSendAndReceive() {
 	require.NoError(t, err)

 	var conn *network.ConnectionStats
-	require.Eventually(t, func() bool {
+	require.EventuallyWithT(t, func(collect *assert.CollectT) {


Does this one need to be changed? I don't see any require.* calls instead the Eventually loop.

yes, as connections := getConnections(t, tr) has require.NoError in it

Worth noting when I tried something like this earlier, I discovered that when getConnections(collect, tr) fails, it will panic because of the require call (not assert) and the resulting panic is mostly unreadable compared to using getConnections(t, tr). It doesn't properly turn the panic into a readable test failure. Maybe because collect doesn't expect to ever use require, not sure. For me, I just used t with getConnections inside Eventually since I am fine with it failing if getConnections errors even once

Just changed require.NoError to require.Error in getConnections just to bring an output of a failure in such a case -

=== RUN TestTracerSuite/CO-RE/TestTCPSendAndReceive tracer_test.go:201: Error Trace: /home/guy/dd/datadog-agent/pkg/network/tracer/tracer_test.go:973 /home/guy/dd/datadog-agent/pkg/network/tracer/tracer_test.go:203 /home/guy/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1700 Error: An error is expected but got nil. tracer_test.go:201: Error Trace: /home/guy/dd/datadog-agent/pkg/network/tracer/tracer_test.go:201 Error: Condition never satisfied Test: TestTracerSuite/CO-RE/TestTCPSendAndReceive Messages: failed to find connection --- FAIL: TestTracerSuite/CO-RE/TestTCPSendAndReceive (5.99s)

There's no panic, and the output seems reasonable

hmahmood · 2024-12-09T18:53:43Z

pkg/network/tracer/tracer_linux_test.go

-		assert.NoError(c, err)
+	require.Eventually(t, func() bool {
+		clientIP, clientPort, _, err = testdns.SendDNSQueries([]string{destDomain}, destAddr, "udp")
+		return err == nil


I think these ones where we are checking the error are better with EventuallyWithT since we will get these errors in the test failure output. Here if there is an error, we won't see the actual error when the test fails.

hmahmood · 2024-12-09T18:55:22Z

pkg/network/tracer/tracer_linux_test.go

-		if !assert.NoError(t, err) {
-			return false
-		}
+		require.NoError(collect, err)


This won't return right? We don't need to continue if there is an error.

require.* will halt the execution immediately and fail the iteration, so we won't proceed beyond that line if there's an error

hmahmood · 2024-12-09T19:01:57Z

pkg/network/tracer/tracer_linux_test.go

@@ -1421,7 +1423,7 @@ func (s *TracerSuite) TestUDPPythonReusePort() {

 		t.Log(conns)

-		return len(conns) == 4
+		require.Len(collect, conns, 4)


Should these be assert.*? The FailNow method on CollectT exits the program (https://github.com/stretchr/testify/blob/master/assert/assertions.go#L1975), so I am not sure this does what is intended in the PR.

since we're providing collect, require.* will halt the execution of the current iteration, but we will retry after the sleep interval

Were you able to test this? Just looking at the code for CollectT.FailNow (which gets called from any require.* call), I see runtime.Goexit being called which would just exit the program (and not continuing after the sleep?).

yes, you can try with this example

func TestExample(t *testing.T) { i := 0 require.EventuallyWithT(t, func(ct *assert.CollectT) { i++ t.Log("running iteration", i) require.Greater(ct, i, 5) require.Equal(ct, 0, i%2) t.Log("iteration", i, "done") }, time.Second, 100*time.Millisecond) }

The output is

=== RUN TestExample tracer_linux_test.go:691: running iteration 1 tracer_linux_test.go:691: running iteration 2 tracer_linux_test.go:691: running iteration 3 tracer_linux_test.go:691: running iteration 4 tracer_linux_test.go:691: running iteration 5 tracer_linux_test.go:691: running iteration 6 tracer_linux_test.go:694: iteration 6 done --- PASS: TestExample (0.60s) PASS

Nvm, runtime.Goexit just terminates the current goroutine, so this should be fine.

guyarb · 2024-12-09T22:46:02Z

/merge

dd-devflow · 2024-12-09T22:46:13Z

Devflow running: `/merge`

View all feedbacks in Devflow UI.

2024-12-09 22:46:13 UTC ℹ️ MergeQueue: waiting for PR to be ready

This merge request is not mergeable yet, because of pending checks/missing approvals. It will be added to the queue as soon as checks pass and/or get approvals.
Note: if you pushed new commits since the last approval, you may need additional approval.
You can remove it from the waiting list with /remove command.

2024-12-10 02:46:14 UTC ⚠️ MergeQueue: This merge request was unqueued

This merge request was unqueued

guyarb · 2024-12-10T05:42:06Z

/merge

dd-devflow · 2024-12-10T05:42:16Z

Devflow running: `/merge`

View all feedbacks in Devflow UI.

2024-12-10 05:42:15 UTC ℹ️ MergeQueue: pull request added to the queue

The median merge time in main is 24m.

guyarb added 5 commits December 8, 2024 08:47

usm: Fix require.Eventually for tracer_usm_linux_test

406b42c

npm: Remove redundant argument

7828abc

npm: Fix assertions in tracer_linux_test

7b7504b

npm: Fix assertions in tracer_test

8c0ef7a

usm: Fix assertions in kafka_monitor_test

69c0d5f

guyarb added changelog/no-changelog team/networks team/usm The USM team qa/no-code-change No code change in Agent code requiring validation labels Dec 8, 2024

github-actions bot added component/system-probe medium review PR review might take time labels Dec 8, 2024

wip: usm linux test

f41dfa5

guyarb marked this pull request as ready for review December 8, 2024 09:40

guyarb requested review from a team as code owners December 8, 2024 09:40

guyarb requested a review from AyyLam December 8, 2024 09:40

Yumasi approved these changes Dec 9, 2024

View reviewed changes

hmahmood reviewed Dec 9, 2024

View reviewed changes

github-actions bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Dec 9, 2024

npm: revert changes for tests

05195f4

hmahmood approved these changes Dec 9, 2024

View reviewed changes

dd-mergequeue bot merged commit 5e3a9d7 into main Dec 10, 2024
302 checks passed

dd-mergequeue bot deleted the guy.arbitman/fix-require-eventually branch December 10, 2024 06:20

github-actions bot added this to the 7.62.0 milestone Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix require.Eventually gotchas #31861

Fix require.Eventually gotchas #31861

guyarb commented Dec 8, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 8, 2024 •

edited

Loading

cit-pr-commenter bot commented Dec 8, 2024

Fine details of change detection per experiment

Explanation

hmahmood Dec 9, 2024

guyarb Dec 9, 2024

pimlu Dec 9, 2024 •

edited

Loading

guyarb Dec 9, 2024

hmahmood Dec 9, 2024 •

edited

Loading

guyarb Dec 9, 2024

hmahmood Dec 9, 2024

guyarb Dec 9, 2024 •

edited

Loading

hmahmood Dec 9, 2024

guyarb Dec 9, 2024

hmahmood Dec 9, 2024

guyarb Dec 9, 2024

hmahmood Dec 9, 2024

guyarb commented Dec 9, 2024

dd-devflow bot commented Dec 9, 2024 •

edited

Loading

guyarb commented Dec 10, 2024

dd-devflow bot commented Dec 10, 2024 •

edited

Loading

Fix require.Eventually gotchas #31861

Fix require.Eventually gotchas #31861

Conversation

guyarb commented Dec 8, 2024 • edited Loading

What does this PR do?

Motivation

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr bot commented Dec 8, 2024 • edited Loading

Test changes on VM

cit-pr-commenter bot commented Dec 8, 2024

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ❌ Failed

Explanation

CI Pass/Fail Decision

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pimlu Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmahmood Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guyarb Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guyarb commented Dec 9, 2024

dd-devflow bot commented Dec 9, 2024 • edited Loading

Devflow running: /merge

guyarb commented Dec 10, 2024

dd-devflow bot commented Dec 10, 2024 • edited Loading

Devflow running: /merge

guyarb commented Dec 8, 2024 •

edited

Loading

agent-platform-auto-pr bot commented Dec 8, 2024 •

edited

Loading

pimlu Dec 9, 2024 •

edited

Loading

hmahmood Dec 9, 2024 •

edited

Loading

guyarb Dec 9, 2024 •

edited

Loading

dd-devflow bot commented Dec 9, 2024 •

edited

Loading

Devflow running: `/merge`

dd-devflow bot commented Dec 10, 2024 •

edited

Loading

Devflow running: `/merge`