Autoscaler Nomad Plugin Doesn't Take Into Consideration CPU or Memory requested by blocked evals #584

rorylshanks · 2022-05-24T12:22:32Z

Hey everyone, this is an awesome project! However in using this we found a small issue with the npmad apm plugin

Nomad now exposes the below metrics

nomad.nomad.blocked_evals.cpu
nomad.nomad.blocked_evals.memory

Which represent the amount of memory and CPU that is requested by blocked and unplaced evals. Correctly the nomad apm plugin just reads all the nodes and the allocs that are currently placed, but should also take into consideration whether it needs to scale up the cluster due to unplaced evals.

Currently we can use prometheus to get around this, but we found that using the Nomad API directly was significantly more robust for cluster autoscaling. So ideally the nomad apm would take this into consideration.

Thanks!

lgfa29 · 2022-05-26T20:02:58Z

Hii @megalan247 👋

Those would definitely be useful metrics to have, but unfortunately there isn't a good way to read them from the Nomad API.

The Nomad APM plugin uses Nomad's REST API to retrieve values, not client metrics. We have to do it this way because metrics are available per agent, meaning that, when you query /v1/metrics you only receive the values for the Nomad agent that you sent the request to, so the Autoscaler would need access and be able to scrape metrics from all agents in your cluster, which may not be feasible.

To make things worse, these metrics are only emitted by the cluster leader, so the Autoscaler would have to have access to the Nomad server APIs but some environments don't allow that, specially if the Autoscaler is running in Nomad itself, as an allocation.

From Nomad's perspective, the problem is that these metrics are not persisted in the state store, they are only available as in-memory metrics, so it would not be possible to query them like the other information.

So the only solution for now is to use an APM that is able to scrape/receive and aggregate metrics from all agents in your cluster, otherwise you will only have, at best, partial data.

I will keep this open if things change in the future, but unfortunately I think it will take a while for us to be able to it.

Cbeck527 · 2022-07-08T15:51:03Z

For anyone stumbling up this looking for more info, my team managed to come up with something that we think works for us inspired by the config posted in a seemingly unrelated issue.

Prerequisite: you have all of your agents configured to send metrics. We use DataDog, so our telemetry {} block looks something like this:

telemetry {
  publish_allocation_metrics = true
  publish_node_metrics       = true
  datadog_address = "localhost:8125"
  disable_hostname = true
  collection_interval = "10s"
}

For a given AWS ASG that we want to scale we have two checks of the metrics @megalan247 mentioned:
(note: the {{ }} are variables populated by our config management)

    check "scale_up_on_exhausted_cpu" {
      source = "datadog"
      query_window = "5m"
      query = "default_zero(default_zero(sum:nomad.nomad.blocked_evals.cpu{environment:{{ environment }},node_class:{{ node_class }}})/default_zero(sum:nomad.nomad.blocked_evals.cpu{environment:{{ environment }},node_class:{{ node_class }}}))"

      strategy "target-value" {
        target = 0.9
      }
    }

    check "scale_up_on_exhausted_memory" {
      source = "datadog"
      query_window = "5m"
      query = "default_zero(default_zero(sum:nomad.nomad.blocked_evals.memory{environment:{{ environment }},node_class:{{ node_class }}})/default_zero(sum:nomad.nomad.blocked_evals.memory{environment:{{ environment }},node_class:{{ node_class }}}))"

      strategy "target-value" {
        target = 0.9
      }
    }

Initial testing looks good— if a deployment is blocked because of resource exhaustion, our metric jumps and the autoscaler reacts appropriately:

2022-07-07T19:23:52.099Z [INFO]  policy_eval.worker: scaling target: id=e12f7c10-4292-ace8-6872-833b95344800 policy_id=cf12159c-94b3-6156-6b93-19e1f2e7d87f queue=cluster target=aws-asg from=9 to=10 reason="scaling up because factor is 1.111111" meta=map[nomad_policy_id:cf12159c-94b3-6156-6b93-19e1f2e7d87f]
2022-07-07T19:24:12.933Z [INFO]  internal_plugin.aws-asg: successfully performed and verified scaling out: action=scale_out asg_name=workers desired_count=10

Totally open to any and all feedback on this approach from the maintainers or other folks who have successfully solved for this! And last, credit where credit is due— thank you @baxor! 🎉

Juanadelacuesta · 2024-06-11T07:07:17Z

Taken from Slack: "The metrics are already in Nomad, I think this issue is about adding them as overhead or separate metric to the Nomad APM, so they can be used within calculations. "running allocs + queued allocs = total CPU of task group."

lgfa29 added stage/thinking type/enhancement theme/apm/nomad labels May 26, 2022

protochron mentioned this issue Jul 14, 2022

When there are blocked evaluations, nomad.nomad.blocked_evals.[cpu,memory] are always 0 hashicorp/nomad#13759

Closed

Juanadelacuesta assigned Juanadelacuesta and unassigned Juanadelacuesta May 15, 2024

Juanadelacuesta self-assigned this Jun 3, 2024

Juanadelacuesta removed their assignment Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler Nomad Plugin Doesn't Take Into Consideration CPU or Memory requested by blocked evals #584

Autoscaler Nomad Plugin Doesn't Take Into Consideration CPU or Memory requested by blocked evals #584

rorylshanks commented May 24, 2022

lgfa29 commented May 26, 2022

Cbeck527 commented Jul 8, 2022

Juanadelacuesta commented Jun 11, 2024 •

edited

Loading

Autoscaler Nomad Plugin Doesn't Take Into Consideration CPU or Memory requested by blocked evals #584

Autoscaler Nomad Plugin Doesn't Take Into Consideration CPU or Memory requested by blocked evals #584

Comments

rorylshanks commented May 24, 2022

lgfa29 commented May 26, 2022

Cbeck527 commented Jul 8, 2022

Juanadelacuesta commented Jun 11, 2024 • edited Loading

Juanadelacuesta commented Jun 11, 2024 •

edited

Loading