-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Autoscaler Nomad Plugin Doesn't Take Into Consideration CPU or Memory requested by blocked evals #584
Comments
Hii @megalan247 👋 Those would definitely be useful metrics to have, but unfortunately there isn't a good way to read them from the Nomad API. The Nomad APM plugin uses Nomad's REST API to retrieve values, not client metrics. We have to do it this way because metrics are available per agent, meaning that, when you query To make things worse, these metrics are only emitted by the cluster leader, so the Autoscaler would have to have access to the Nomad server APIs but some environments don't allow that, specially if the Autoscaler is running in Nomad itself, as an allocation. From Nomad's perspective, the problem is that these metrics are not persisted in the state store, they are only available as in-memory metrics, so it would not be possible to query them like the other information. So the only solution for now is to use an APM that is able to scrape/receive and aggregate metrics from all agents in your cluster, otherwise you will only have, at best, partial data. I will keep this open if things change in the future, but unfortunately I think it will take a while for us to be able to it. |
For anyone stumbling up this looking for more info, my team managed to come up with something that we think works for us inspired by the config posted in a seemingly unrelated issue. Prerequisite: you have all of your agents configured to send metrics. We use DataDog, so our telemetry {
publish_allocation_metrics = true
publish_node_metrics = true
datadog_address = "localhost:8125"
disable_hostname = true
collection_interval = "10s"
} For a given AWS ASG that we want to scale we have two checks of the metrics @megalan247 mentioned: check "scale_up_on_exhausted_cpu" {
source = "datadog"
query_window = "5m"
query = "default_zero(default_zero(sum:nomad.nomad.blocked_evals.cpu{environment:{{ environment }},node_class:{{ node_class }}})/default_zero(sum:nomad.nomad.blocked_evals.cpu{environment:{{ environment }},node_class:{{ node_class }}}))"
strategy "target-value" {
target = 0.9
}
}
check "scale_up_on_exhausted_memory" {
source = "datadog"
query_window = "5m"
query = "default_zero(default_zero(sum:nomad.nomad.blocked_evals.memory{environment:{{ environment }},node_class:{{ node_class }}})/default_zero(sum:nomad.nomad.blocked_evals.memory{environment:{{ environment }},node_class:{{ node_class }}}))"
strategy "target-value" {
target = 0.9
}
}
Initial testing looks good— if a deployment is blocked because of resource exhaustion, our metric jumps and the autoscaler reacts appropriately:
Totally open to any and all feedback on this approach from the maintainers or other folks who have successfully solved for this! And last, credit where credit is due— thank you @baxor! 🎉 |
Taken from Slack: "The metrics are already in Nomad, I think this issue is about adding them as overhead or separate metric to the Nomad APM, so they can be used within calculations. "running allocs + queued allocs = total CPU of task group." |
Hey everyone, this is an awesome project! However in using this we found a small issue with the npmad apm plugin
Nomad now exposes the below metrics
nomad.nomad.blocked_evals.cpu
nomad.nomad.blocked_evals.memory
Which represent the amount of memory and CPU that is requested by blocked and unplaced evals. Correctly the nomad apm plugin just reads all the nodes and the allocs that are currently placed, but should also take into consideration whether it needs to scale up the cluster due to unplaced evals.
Currently we can use prometheus to get around this, but we found that using the Nomad API directly was significantly more robust for cluster autoscaling. So ideally the nomad apm would take this into consideration.
Thanks!
The text was updated successfully, but these errors were encountered: