-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "unbundle agent and remove associated code (#31228)" #31796
Conversation
This reverts commit 052b83d.
Test changes on VMUse this command from test-infra-definitions to manually test this PR changes on a VM: inv aws.create-vm --pipeline-id=50963946 --os-family=ubuntu Note: This applies to commit ce7fc27 |
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 68f07c6 Optimization Goals: ✅ No significant changes detected
|
perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
---|---|---|---|---|---|---|
➖ | quality_gate_idle | memory utilization | +0.32 | [+0.27, +0.36] | 1 | Logs bounds checks dashboard |
➖ | file_tree | memory utilization | +0.07 | [-0.06, +0.20] | 1 | Logs |
➖ | file_to_blackhole_300ms_latency | egress throughput | +0.01 | [-0.62, +0.65] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http2 | egress throughput | +0.00 | [-0.84, +0.85] | 1 | Logs |
➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.01, +0.01] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency | egress throughput | -0.00 | [-0.85, +0.84] | 1 | Logs |
➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.10, +0.08] | 1 | Logs |
➖ | file_to_blackhole_100ms_latency | egress throughput | -0.01 | [-0.69, +0.66] | 1 | Logs |
➖ | file_to_blackhole_0ms_latency_http1 | egress throughput | -0.02 | [-0.85, +0.81] | 1 | Logs |
➖ | quality_gate_logs | % cpu utilization | -0.02 | [-2.95, +2.90] | 1 | Logs |
➖ | file_to_blackhole_500ms_latency | egress throughput | -0.06 | [-0.82, +0.71] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.30 | [-1.08, +0.48] | 1 | Logs |
➖ | file_to_blackhole_1000ms_latency_linear_load | egress throughput | -0.36 | [-0.81, +0.10] | 1 | Logs |
➖ | tcp_syslog_to_blackhole | ingress throughput | -0.50 | [-0.56, -0.43] | 1 | Logs |
➖ | otel_to_otel_logs | ingress throughput | -0.52 | [-1.19, +0.15] | 1 | Logs |
➖ | quality_gate_idle_all_features | memory utilization | -0.66 | [-0.77, -0.54] | 1 | Logs bounds checks dashboard |
➖ | uds_dogstatsd_to_api_cpu | % cpu utilization | -0.80 | [-1.51, -0.09] | 1 | Logs |
Bounds Checks: ❌ Failed
perf | experiment | bounds_check_name | replicates_passed | links |
---|---|---|---|---|
❌ | file_to_blackhole_500ms_latency | lost_bytes | 8/10 | |
✅ | file_to_blackhole_0ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http1 | memory_usage | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | lost_bytes | 10/10 | |
✅ | file_to_blackhole_0ms_latency_http2 | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_1000ms_latency_linear_load | memory_usage | 10/10 | |
✅ | file_to_blackhole_100ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_300ms_latency | lost_bytes | 10/10 | |
✅ | file_to_blackhole_300ms_latency | memory_usage | 10/10 | |
✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | |
✅ | quality_gate_idle | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_idle_all_features | memory_usage | 10/10 | bounds checks dashboard |
✅ | quality_gate_logs | lost_bytes | 10/10 | |
✅ | quality_gate_logs | memory_usage | 10/10 |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
deva agent.build --bundle process-agent --bundle security-agent | ||
``` | ||
|
||
To disable bundling entirely: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ question
Sounds confusing that this doc says that the agent can combine multiple functionalities as an option, and then here it says that to disable bundling you have to pass --bundle agent
. What is the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't put anything it's equivalent to --bundle agent
/merge |
Devflow running:
|
Package size comparisonComparison with ancestor Diff per package
Decision |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-7.61.x 7.61.x
# Navigate to the new working tree
cd .worktrees/backport-7.61.x
# Create a new branch
git switch --create backport-31796-to-7.61.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 906f61a453542d379ddfd379b9c53d401a9b696b
# Push it to GitHub
git push --set-upstream origin backport-31796-to-7.61.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-7.61.x Then, create a pull request where the |
What does this PR do?
It reverts the cleanup around the single-binary agent done in #31228 so that the Heroku Agent can keep a single-binary Agent including the core agent and the process-agent.
Motivation
Removing "bundling" on #31228 made the two Agent builds done for the heroku target essentially the same, when one of them is actually expected to include the process agent (bundled in the same binary).
This is the core logic that was lost and that this revert would recover:
datadog-agent/omnibus/config/software/datadog-agent.rb
Lines 35 to 37 in 4757161
datadog-agent/omnibus/config/software/datadog-agent.rb
Line 103 in 4757161
datadog-agent/omnibus/config/software/datadog-agent.rb
Line 109 in 4757161
This was discovered when we were investigating why we were building the Agent twice for Heroku with no clear differences (#31572, #31625).
This single-binary build was added for Heroku in #23422.
The Heroku Buildpack relies on the presence of a "core-agent" and a "core-agent+process-agent" and chooses between the two (deleting the other) depending on whether the process agent is required. The logic where this is done is here:
https://github.com/DataDog/heroku-buildpack-datadog/blob/5034fb2a7840a8996dee968bf7004f830f4c3149/bin/compile#L241-L262.
Describe how you validated your changes
TODO: Ideally we would confirm that we stopped having the process agent when the change was introduced and that we've recovered it with this PR, though as of now I have no idea how to do that for Heroku.
Possible Drawbacks / Trade-offs
This reverts the original PR wholesale, whereas this could probably be achieved with a smaller change. But we're in the middle of a release and the fix looks urgent to me to try to do it cleanly now.
A longer term discussion might still be needed regarding how we want to support this specific use case.
Additional Notes
This will have to be backported to 7.61.