Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary extra reload on integration policy change under EA managed mode #41751

Open
carsonip opened this issue Nov 8, 2024 · 3 comments · May be fixed by #41794
Open

Unnecessary extra reload on integration policy change under EA managed mode #41751

carsonip opened this issue Nov 8, 2024 · 3 comments · May be fixed by #41794
Assignees
Labels
bug impact:low Long-term priority, unless it's a quick fix. needs_team Indicates that the issue/PR needs a Team:* label

Comments

@carsonip
Copy link
Member

carsonip commented Nov 8, 2024

APM Server version (apm-server version): 8.15.0+

Description of the problem including expected versus actual behavior:

APM server reloader reloads runner (internal/beater/beater.go) 2 times instead of 1 when integration policy is changed.

This is likely due to nil check being earlier than the unchanged check in EA apm tracing PR: https://github.com/elastic/beats/pull/40030/files#diff-dd024b66f2ed585cbbfce1d8a0aeba55183ab0d04bed2a67f7738f62b5658e63R896-R908 . This causes apm tracing config nil -> nil to trigger a reload.

When this is fixed, we should also remove the workaround in https://github.com/elastic/apm-server/pull/14574/files#diff-e91b3485079439409a6365ec6aa74c2c009623463528bc9d541ca6494df8cf35R350

Steps to reproduce:

  1. Start EA managed apm-server e.g. on ESS
  2. Update apm integration policy
  3. Observe that every loaded input config log line is accompanied by a loaded apm tracing config even though apm tracing config has not changed. This implies 2 reloads are done instead of 1.

Provide logs (if relevant):

apm-server log lines from new to old. Everything is triggered by BeatV2Manager.unitListen, followed by loaded apm tracing config, then loaded input config.

"@timestamp",message
"Nov 8, 2024 @ 09:55:56.291","no longer blocking ingestion as all precondition checks are now satisfied"
"Nov 8, 2024 @ 09:55:56.289","Starting apm-server [f4b8e24ec4828569d47ef590282506d43367885c built 2024-10-08 15:25:30 +0000 UTC]. Hit CTRL-C to stop it."
"Nov 8, 2024 @ 09:55:56.289","update routine done"
"Nov 8, 2024 @ 09:55:56.289","CORS related setting `apm-server.rum.allow_origins` allows all origins. Consider more restrictive setting for production use."
"Nov 8, 2024 @ 09:55:56.289","Path /intake/v2/events added to request handler"
"Nov 8, 2024 @ 09:55:56.289","Path /intake/v3/rum/events added to request handler"
"Nov 8, 2024 @ 09:55:56.289","RUM endpoints enabled!"
"Nov 8, 2024 @ 09:55:56.289","SSL enabled."
"Nov 8, 2024 @ 09:55:56.289","Path /v1/logs added to request handler"
"Nov 8, 2024 @ 09:55:56.289","Path /v1/metrics added to request handler"
"Nov 8, 2024 @ 09:55:56.289","loaded input config"
"Nov 8, 2024 @ 09:55:56.289","Path /v1/traces added to request handler"
"Nov 8, 2024 @ 09:55:56.288","Path / added to request handler"
"Nov 8, 2024 @ 09:55:56.288","running final aggregation"
"Nov 8, 2024 @ 09:55:56.288","Path /config/v1/rum/agents added to request handler"
"Nov 8, 2024 @ 09:55:56.288","Path /intake/v2/rum/events added to request handler"
"Nov 8, 2024 @ 09:55:56.288","LSM aggregator stopped"
"Nov 8, 2024 @ 09:55:56.288","Path /config/v1/agents added to request handler"
"Nov 8, 2024 @ 09:55:56.287","Stop listening on: 0.0.0.0:8200"
"Nov 8, 2024 @ 09:55:56.287","SSL enabled."
"Nov 8, 2024 @ 09:55:56.287","RUM endpoints enabled!"
"Nov 8, 2024 @ 09:55:56.287","Server stopped"
"Nov 8, 2024 @ 09:55:56.287","stopping aggregator"
"Nov 8, 2024 @ 09:55:56.287","CORS related setting `apm-server.rum.allow_origins` allows all origins. Consider more restrictive setting for production use."
"Nov 8, 2024 @ 09:55:56.287","Path /v1/metrics added to request handler"
"Nov 8, 2024 @ 09:55:56.287","Path /v1/logs added to request handler"
"Nov 8, 2024 @ 09:55:56.287","Starting apm-server [f4b8e24ec4828569d47ef590282506d43367885c built 2024-10-08 15:25:30 +0000 UTC]. Hit CTRL-C to stop it."
"Nov 8, 2024 @ 09:55:56.286","Path /config/v1/rum/agents added to request handler"
"Nov 8, 2024 @ 09:55:56.286","Path /intake/v2/events added to request handler"
"Nov 8, 2024 @ 09:55:56.286","Aggregation.Transactions.MaxGroups set to 5000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","blocking ingestion until all preconditions are satisfied"
"Nov 8, 2024 @ 09:55:56.286","Path /v1/traces added to request handler"
"Nov 8, 2024 @ 09:55:56.286","docappender.MaxRequests set to 11 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Path /intake/v3/rum/events added to request handler"
"Nov 8, 2024 @ 09:55:56.286","Path /intake/v2/rum/events added to request handler"
"Nov 8, 2024 @ 09:55:56.286","docappender.DocumentBufferSize set to 819 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Aggregation.ServiceTransactions.MaxGroups for service aggregation set to 1000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Path / added to request handler"
"Nov 8, 2024 @ 09:55:56.286","MaxConcurrentDecoders set to 102 based on 80 percent of 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Aggregation.ServiceDestinations.MaxGroups set to 10000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Aggregation.MaxServices set to 1000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.286","Path /config/v1/agents added to request handler"
"Nov 8, 2024 @ 09:55:56.284","failed to populate sourcemap metadata: failed to run initial search query: fetcher unavailable: server shutting down"
"Nov 8, 2024 @ 09:55:56.284","docappender.MaxRequests set to 11 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","docappender.DocumentBufferSize set to 819 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","MaxConcurrentDecoders set to 102 based on 80 percent of 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","Aggregation.ServiceTransactions.MaxGroups for service aggregation set to 1000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","blocking ingestion until all preconditions are satisfied"
"Nov 8, 2024 @ 09:55:56.283","Aggregation.MaxServices set to 1000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","Aggregation.Transactions.MaxGroups set to 5000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","Aggregation.ServiceDestinations.MaxGroups set to 10000 based on 1.0gb of memory"
"Nov 8, 2024 @ 09:55:56.283","precondition failed: context canceled"
"Nov 8, 2024 @ 09:55:56.282","Listening on: [::]:8200"
"Nov 8, 2024 @ 09:55:56.282","using output.elasticsearch for fetching agent config"
"Nov 8, 2024 @ 09:55:56.282","Falling back to elasticsearch output for API Key usage"
"Nov 8, 2024 @ 09:55:56.282","using apm-server.agent.config.elasticsearch for fetching agent config"
"Nov 8, 2024 @ 09:55:56.282","stopping apm-server... waiting maximum of 30s for queues to drain"
"Nov 8, 2024 @ 09:55:56.281","update routine done"
"Nov 8, 2024 @ 09:55:56.280","LSM aggregator stopped"
"Nov 8, 2024 @ 09:55:56.280","running final aggregation"
"Nov 8, 2024 @ 09:55:56.280","Falling back to elasticsearch output for API Key usage"
"Nov 8, 2024 @ 09:55:56.280","Stop listening on: 0.0.0.0:8200"
"Nov 8, 2024 @ 09:55:56.280","stopping apm-server... waiting maximum of 30s for queues to drain"
"Nov 8, 2024 @ 09:55:56.280","received input from elastic-agent"
"Nov 8, 2024 @ 09:55:56.280","Listening on: [::]:8200"
"Nov 8, 2024 @ 09:55:56.280","Server stopped"
"Nov 8, 2024 @ 09:55:56.280","stopping aggregator"
"Nov 8, 2024 @ 09:55:56.280","loaded apm tracing config"
"Nov 8, 2024 @ 09:55:56.279","using output.elasticsearch for fetching agent config"
"Nov 8, 2024 @ 09:55:56.279","using apm-server.agent.config.elasticsearch for fetching agent config"
"Nov 8, 2024 @ 09:55:55.278","BeatV2Manager.unitListen UnitChanged.ID(apm-es-containerhost-elastic-cloud-apm), UnitChanged.Type(modified), UnitChanged.Trigger(2): modified/config_change_triggered"
@carsonip carsonip added the bug label Nov 8, 2024
@simitt
Copy link
Contributor

simitt commented Nov 8, 2024

@carsonip can you please assess the impact of this in your description and then add an according impact label for this. While I agree the reload is unnecessary, how much of an optimization is it vs. what's the negative consequence.

@carsonip
Copy link
Member Author

carsonip commented Nov 8, 2024

IIUC the bug sits in beats. It is better seen as a regression from supporting apm tracing config reload in #40030 . As for the impact, I don't think it will create any actual problems, other than some extra API calls e.g. for agentcfg and sourcemap fetcher as part of the initialization process. If any new connection is unlucky enough to use the new runner that gets stopped immediately, these connections will need to be drained. Also, although I haven't studied how likely it is relevant, but it is possible that a reload that quick will trigger some race conditions (a bug in itself), e.g. if runners from 2 consecutive quick restarts race to modify a global monitoring registry.

TLDR: not high impact, but as this is a recent regression, it would be good to optimize this.

@simitt simitt added the impact:low Long-term priority, unless it's a quick fix. label Nov 8, 2024
@carsonip carsonip transferred this issue from elastic/apm-server Nov 22, 2024
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 22, 2024
@botelastic
Copy link

botelastic bot commented Nov 22, 2024

This issue doesn't have a Team:<team> label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug impact:low Long-term priority, unless it's a quick fix. needs_team Indicates that the issue/PR needs a Team:* label
Projects
None yet
2 participants