Availbility Metric #1068

nitinatgh · 2024-09-11T11:09:18Z

nitinatgh
Sep 11, 2024

Hi Team,

Have a question/idea for metrics relating to availability for NTH.

Currently there is only 2 metrics available and each has different action associated with it:

actions - cordon-and-drain, post-drain, pre-drain that has node_status:success/error
events_error - event_error_where:sqs_monitor

For our purpose, we'd only like to focus on the tasks that NTH needs to execute for handling the spot interruption, so everything from receiving the notice till letting kube api know. Everything after that we are not concerned about. So e.g. we have seen errors for cordon-and-drain as the node can't be drained in time due to various reason such as PDB or terminationGracePeriodSeconds => 120s, for this we believe it's down to kube to handle and not NTH.

Is it possible to have such an action added to only show up to that point, this way we can have a proper availability metric for our clusters for NTH and not get false alerts as a result of the above.

Please let me know your thoughts around this.

Thanks!

LikithaVemulapalli · 2024-09-13T15:09:41Z

LikithaVemulapalli
Sep 13, 2024
Maintainer

Hello @nitinatgh, thanks for opening this discussion. The main focus of Node Termination Handler is to cordon and drain the node before the node gets terminated due to any spot interruption notice that is sent, we do not have immediate action item to support the feature that you suggested. Currently NTH has become complex due to various configs and use cases due to multiple requests from customers, we do not want to increase the complexity by adding new enhancements, we will update this thread and let you know if we are planning to pick this enhancement. Thanks again for opening the discussion.

3 replies

nitinatgh Sep 16, 2024
Author

Hi @LikithaVemulapalli ,

Thanks for the reply.

Understood. Otherwise is it not possible to separate the 2 actions for the purpose of the metric, so have cordon and then drain, as they are 2 separate tasks that will be issued to kube api?

Thanks

Nitin

LikithaVemulapalli Sep 16, 2024
Maintainer

Hello @nitinatgh, this can be done from your end if you want to make any changes to this existing code, you can fork the repo and have your version based on your workloads, we do not want to modify existing actions on how NTH performs as it might impact the existing customers who are using NTH from a long time. We are only focusing on adding new features/fix critical bugs as per customer requests. Thanks!

nitinatgh Sep 17, 2024
Author

Ok sure, we'll look into it.
Otherwise if there is scope for the original ask, but keep us informed.
Thanks for your time and help on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Availbility Metric #1068

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Availbility Metric #1068

nitinatgh Sep 11, 2024

Replies: 1 comment · 3 replies

LikithaVemulapalli Sep 13, 2024 Maintainer

nitinatgh Sep 16, 2024 Author

LikithaVemulapalli Sep 16, 2024 Maintainer

nitinatgh Sep 17, 2024 Author

nitinatgh
Sep 11, 2024

Replies: 1 comment 3 replies

LikithaVemulapalli
Sep 13, 2024
Maintainer

nitinatgh Sep 16, 2024
Author

LikithaVemulapalli Sep 16, 2024
Maintainer

nitinatgh Sep 17, 2024
Author