Skip to content

Latest commit

 

History

History
83 lines (55 loc) · 7.72 KB

METRICS.md

File metadata and controls

83 lines (55 loc) · 7.72 KB

Addon-operator metrics

The Addon-operator implements Prometheus target at /metrics endpoint. The default port is 9650.

  • addon_operator_binding_count{module="", hook=""} — a gauge with bindings count for every hooks. Global hooks has empty "module" label.

  • addon_operator_config_values_errors_total{} — a counter of ConfigMap validation errors after kubectl edit. See validation.

  • addon_operator_global_hook_run_seconds{hook="", binding="", activation="", queue=""} — a histogram with hook execution times. "hook" label is a name of the hook, "binding" is a binding name from configuration, "queue" is a queue name where hook is queued and "activation" is an event that triggers hook execution.

  • addon_operator_global_hook_run_errors_total{hook="", binding="", activation="", queue=""} – this is the counter of hooks’ execution errors. It only tracks errors of hooks with the disabled allowFailure (i.e. respective key is omitted in the configuration or the allowFailure: false parameter is set). This metric has a "hook" label with the name of a failed hook.

  • addon_operator_global_hook_run_allowed_errors_total{hook="", binding="", activation="", queue=""} – this is the counter of hooks’ execution errors. It only tracks errors of hooks that are allowed to exit with an error (the parameter allowFailure: true is set in the configuration). The metric has a "hook" label with the name of a failed hook.

  • addon_operator_global_hook_run_success_total{hook="", binding="", activation="", queue=""} – this is the counter of hooks’ success execution. The metric has a "hook" label with the name of a succeeded hook.

  • addon_operator_global_hook_run_sys_cpu_seconds{hook="", binding="", activation="", queue=""} — a histogram with global hook system cpu seconds.

  • addon_operator_global_hook_run_user_cpu_seconds{hook="", binding="", activation="", queue=""} — a histogram with global hook user cpu seconds.

  • addon_operator_global_hook_run_max_rss_bytes{hook="", binding="", activation="", queue=""} — a gauge with global hook max rss usage in bytes.

  • addon_operator_module_hook_run_seconds{module="", hook="", binding="", activation="", queue=""} — a histogram with module hook execution times. "module" label is a name of the module, "hook" label is a name of the hook, "binding" is a binding name from configuration, "queue" is a queue name where hook is queued and "activation" is an event that triggers hook execution.

  • addon_operator_module_hook_run_errors_total{module="", hook="", binding="", activation="", queue=""} – this is the counter of hooks’ execution errors. It only tracks errors of hooks with the disabled allowFailure (i.e. respective key is omitted in the configuration or the allowFailure: false parameter is set). This metric has a "hook" label with the name of a failed hook.

  • addon_operator_module_hook_run_allowed_errors_total{module="", hook="", binding="", activation="", queue=""} – this is the counter of hooks’ execution errors. It only tracks errors of hooks that are allowed to exit with an error (the parameter allowFailure: true is set in the configuration). The metric has a "hook" label with the name of a failed hook.

  • addon_operator_module_hook_run_success_total{module="", hook="", binding="", activation="", queue=""} – this is the counter of hooks’ success execution. The metric has a "hook" label with the name of a succeeded hook.

  • addon_operator_module_hook_run_sys_cpu_seconds{module="", hook="", binding="", activation="", queue=""} — a histogram with module hook system cpu seconds.

  • addon_operator_module_hook_run_user_cpu_seconds{module="", hook="", binding="", activation="", queue=""} — a histogram with module hook user cpu seconds.

  • addon_operator_module_hook_run_max_rss_bytes{module="", hook="", binding="", activation="", queue=""} — a gauge with module hook max rss usage in bytes.

  • addon_operator_module_discover_errors_total – a counter of errors during the modules discover process. It increases in these cases:

    • an 'enabled' script is executed with an error
    • a module hook return an invalid configuration
    • a call to the Kubernetes API ends with an error (for example, retrieving Helm releases).
  • addon_operator_module_run_errors_total{module=x} – counter of errors on module start-up.

  • addon_operator_module_delete_errors_total{module=x} – counter of errors on module deletion.

  • addon_operator_module_run_seconds{module=""} — a histogram with module execution timings.

  • addon_operator_module_helm_seconds{module="", activation=""} — a histogram of module’s helm upgrade timings.

  • addon_operator_helm_operation_seconds{module="", activation="", operation=""} — a histogram of different helm operations timings.

  • addon_operator_convergence_seconds{activation=onStartup} — a counter of seconds spent to execute "reload all modules" processes. "activation=OnStartup" label value can be used to retrieve information about first "reload all modules" when operator starts.

  • addon_operator_convergence_total{activation=onStartup} — a counter of "reload all modules" processes.

  • addon_operator_tasks_queue_length{queue=""} – a gauge showing the length of the working queue. This metric can be used to warn about stuck hooks. It has the "queue" label with the queue name.

  • addon_operator_task_wait_in_queue_seconds_total{module="", hook="", binding="", queue=""} — a counter with seconds that the task is elapsed in the queue.

  • addon_operator_live_ticks – a counter that increases every 10 seconds. This metric can be used for alerting about an unhealthy Addon-operator. It has no labels.

  • addon_operator_kube_jq_filter_duration_seconds{module="", hook="", binding="", queue="", kind=""} — a histogram with jq filter timings.

  • addon_operator_kube_event_duration_seconds{module="", hook="", binding="", queue="", kind=""} — a histogram with kube event handling timings.

  • addon_operator_kube_snapshot_objects{module="", hook="", binding="", queue=""} — a gauge with count of cached objects (the snapshot) for particular binding. "module" label is empty for global hook.

  • addon_operator_kube_snapshot_bytes{module="", hook="", binding="", queue=""} — a gauge with size in bytes of cached objects for particular binding. Each cached object contains a Kubernetes object and/or result of jqFilter depending on the binding configuration. The size is a sum of the length of Kubernetes object in JSON format and the length of jqFilter‘s result in JSON format.

  • addon_operator_kubernetes_client_request_result_total — a counter of requests made by kubernetes/client-go library.

  • addon_operator_kubernetes_client_request_latency_seconds — a histogram with latency of requests made by kubernetes/client-go library.

  • addon_operator_tasks_queue_action_duration_seconds{queue_name="", queue_action=""} — a histogram with measurements of low level queue operations. Use QUEUE_ACTIONS_METRICS="no" to disable this metric.

Custom metrics

Hooks can export metrics by writing a set of operation on JSON format into $METRICS_PATH file.

Operation to increase a counter:

{"name":"metric_name","add":1,"labels":{"label1":"value1"}}

Operation to set a value for a gauge:

{"name":"metric_name","set":33,"labels":{"label1":"value1"}}

Labels are not required, but Shell-operator adds hook and module labels.

Several metrics can be expored at once. For example, this script will create 2 metrics:

echo '{"name":"hook_metric_count","add":1,"labels":{"label1":"value1"}}' >> $METRICS_PATH
echo '{"name":"hook_metrics_items","add":1,"labels":{"label1":"value1"}}' >> $METRICS_PATH