Extra telemetry on policy evaluation failure #661

the-nando · 2023-07-09T08:22:30Z

We recently run into two separate issue where the Nomad autoscaler failed to describe AWS autoscaling groups due to an expired AWS token or failed to evaluate a scaling policy because of an issue reaching the APM (Prometheus).

{"@level":"warn","@message":"failed to get target status","@module":"policy_manager.policy_handler","@timestamp":"2023-07-06T16:21:26.029652Z","error":"failed to describe AWS Autoscaling Group: operation error Auto Scaling: DescribeAutoScalingGroups, https response error StatusCode: 403, RequestID: c674bc86-1234-4fb1-5678-b264741176bc, api error ExpiredToken: The security token included in the request is expired","policy_id":"613aeb80-xs23-8f4e-1234-ef2ca2748d8a"}

It would be great to have a couple of extra Prometheus metrics exported by the autoscaler to be monitored to detect simple failures.

The text was updated successfully, but these errors were encountered:

the-nando linked a pull request Jul 9, 2023 that will close this issue

Add extra telemetry to monitor failures #660

Open

lgfa29 added stage/accepted type/enhancement theme/agent theme/metrics labels Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extra telemetry on policy evaluation failure #661

Extra telemetry on policy evaluation failure #661

the-nando commented Jul 9, 2023 •

edited

Loading

Extra telemetry on policy evaluation failure #661

Extra telemetry on policy evaluation failure #661

Comments

the-nando commented Jul 9, 2023 • edited Loading

the-nando commented Jul 9, 2023 •

edited

Loading