The Addon-operator implements Prometheus target at /metrics
endpoint. The default port is 9650
.
-
addon_operator_binding_count{module="", hook=""}
— a gauge with bindings count for every hooks. Global hooks has empty "module" label. -
addon_operator_config_values_errors_total{}
— a counter of ConfigMap validation errors afterkubectl edit
. See validation. -
addon_operator_global_hook_run_seconds{hook="", binding="", activation="", queue=""}
— a histogram with hook execution times. "hook" label is a name of the hook, "binding" is a binding name from configuration, "queue" is a queue name where hook is queued and "activation" is an event that triggers hook execution. -
addon_operator_global_hook_run_errors_total{hook="", binding="", activation="", queue=""}
– this is the counter of hooks’ execution errors. It only tracks errors of hooks with the disabledallowFailure
(i.e. respective key is omitted in the configuration or theallowFailure: false
parameter is set). This metric has a "hook" label with the name of a failed hook. -
addon_operator_global_hook_run_allowed_errors_total{hook="", binding="", activation="", queue=""}
– this is the counter of hooks’ execution errors. It only tracks errors of hooks that are allowed to exit with an error (the parameterallowFailure: true
is set in the configuration). The metric has a "hook" label with the name of a failed hook. -
addon_operator_global_hook_run_success_total{hook="", binding="", activation="", queue=""}
– this is the counter of hooks’ success execution. The metric has a "hook" label with the name of a succeeded hook. -
addon_operator_global_hook_run_sys_cpu_seconds{hook="", binding="", activation="", queue=""}
— a histogram with global hook system cpu seconds. -
addon_operator_global_hook_run_user_cpu_seconds{hook="", binding="", activation="", queue=""}
— a histogram with global hook user cpu seconds. -
addon_operator_global_hook_run_max_rss_bytes{hook="", binding="", activation="", queue=""}
— a gauge with global hook max rss usage in bytes. -
addon_operator_module_hook_run_seconds{module="", hook="", binding="", activation="", queue=""}
— a histogram with module hook execution times. "module" label is a name of the module, "hook" label is a name of the hook, "binding" is a binding name from configuration, "queue" is a queue name where hook is queued and "activation" is an event that triggers hook execution. -
addon_operator_module_hook_run_errors_total{module="", hook="", binding="", activation="", queue=""}
– this is the counter of hooks’ execution errors. It only tracks errors of hooks with the disabledallowFailure
(i.e. respective key is omitted in the configuration or theallowFailure: false
parameter is set). This metric has a "hook" label with the name of a failed hook. -
addon_operator_module_hook_run_allowed_errors_total{module="", hook="", binding="", activation="", queue=""}
– this is the counter of hooks’ execution errors. It only tracks errors of hooks that are allowed to exit with an error (the parameterallowFailure: true
is set in the configuration). The metric has a "hook" label with the name of a failed hook. -
addon_operator_module_hook_run_success_total{module="", hook="", binding="", activation="", queue=""}
– this is the counter of hooks’ success execution. The metric has a "hook" label with the name of a succeeded hook. -
addon_operator_module_hook_run_sys_cpu_seconds{module="", hook="", binding="", activation="", queue=""}
— a histogram with module hook system cpu seconds. -
addon_operator_module_hook_run_user_cpu_seconds{module="", hook="", binding="", activation="", queue=""}
— a histogram with module hook user cpu seconds. -
addon_operator_module_hook_run_max_rss_bytes{module="", hook="", binding="", activation="", queue=""}
— a gauge with module hook max rss usage in bytes. -
addon_operator_module_discover_errors_total
– a counter of errors during the modules discover process. It increases in these cases:- an 'enabled' script is executed with an error
- a module hook return an invalid configuration
- a call to the Kubernetes API ends with an error (for example, retrieving Helm releases).
-
addon_operator_module_run_errors_total{module=x}
– counter of errors on module start-up. -
addon_operator_module_delete_errors_total{module=x}
– counter of errors on module deletion. -
addon_operator_module_run_seconds{module=""}
— a histogram with module execution timings. -
addon_operator_module_helm_seconds{module="", activation=""}
— a histogram of module’shelm upgrade
timings. -
addon_operator_helm_operation_seconds{module="", activation="", operation=""}
— a histogram of different helm operations timings. -
addon_operator_convergence_seconds{activation=onStartup}
— a counter of seconds spent to execute "reload all modules" processes. "activation=OnStartup" label value can be used to retrieve information about first "reload all modules" when operator starts. -
addon_operator_convergence_total{activation=onStartup}
— a counter of "reload all modules" processes. -
addon_operator_tasks_queue_length{queue=""}
– a gauge showing the length of the working queue. This metric can be used to warn about stuck hooks. It has the "queue" label with the queue name. -
addon_operator_task_wait_in_queue_seconds_total{module="", hook="", binding="", queue=""}
— a counter with seconds that the task is elapsed in the queue. -
addon_operator_live_ticks
– a counter that increases every 10 seconds. This metric can be used for alerting about an unhealthy Addon-operator. It has no labels. -
addon_operator_kube_jq_filter_duration_seconds{module="", hook="", binding="", queue="", kind=""}
— a histogram with jq filter timings. -
addon_operator_kube_event_duration_seconds{module="", hook="", binding="", queue="", kind=""}
— a histogram with kube event handling timings. -
addon_operator_kube_snapshot_objects{module="", hook="", binding="", queue=""}
— a gauge with count of cached objects (the snapshot) for particular binding. "module" label is empty for global hook. -
addon_operator_kube_snapshot_bytes{module="", hook="", binding="", queue=""}
— a gauge with size in bytes of cached objects for particular binding. Each cached object contains a Kubernetes object and/or result of jqFilter depending on the binding configuration. The size is a sum of the length of Kubernetes object in JSON format and the length of jqFilter‘s result in JSON format. -
addon_operator_kubernetes_client_request_result_total
— a counter of requests made by kubernetes/client-go library. -
addon_operator_kubernetes_client_request_latency_seconds
— a histogram with latency of requests made by kubernetes/client-go library. -
addon_operator_tasks_queue_action_duration_seconds{queue_name="", queue_action=""}
— a histogram with measurements of low level queue operations. Use QUEUE_ACTIONS_METRICS="no" to disable this metric.
Hooks can export metrics by writing a set of operation on JSON format into $METRICS_PATH file.
Operation to increase a counter:
{"name":"metric_name","add":1,"labels":{"label1":"value1"}}
Operation to set a value for a gauge:
{"name":"metric_name","set":33,"labels":{"label1":"value1"}}
Labels are not required, but Shell-operator adds hook
and module
labels.
Several metrics can be expored at once. For example, this script will create 2 metrics:
echo '{"name":"hook_metric_count","add":1,"labels":{"label1":"value1"}}' >> $METRICS_PATH
echo '{"name":"hook_metrics_items","add":1,"labels":{"label1":"value1"}}' >> $METRICS_PATH