rhobs · simonpasquier · Oct 8, 2024 · Oct 1, 2024 · Oct 4, 2024
diff --git a/.mdox.yaml b/.mdox.yaml
@@ -38,6 +38,12 @@ transformations:
             weight: 2
             pre: <i class='fas fa-users'></i>
 
+  - glob: "Products/OpenshiftMonitoring/instrumentation.md"
+    frontMatter:
+      template: |
+        title: "{{ .Origin.FirstHeader }}"
+        lastmod: "{{ .Origin.LastMod }}"
+        weight: 5
   - glob: "Products/OpenshiftMonitoring/collecting_metrics.md"
     frontMatter:
       template: |

diff --git a/content/Products/OpenshiftMonitoring/instrumentation.md b/content/Products/OpenshiftMonitoring/instrumentation.md
@@ -0,0 +1,82 @@
+# Instrumentation guidelines
+
+This document details good practices to adopt when you instrument your application for Prometheus. It is not meant to be a replacement of the [upstream documentation](https://prometheus.io/docs/practices/instrumentation/) but an introduction focused on the OpenShift use case.
+
+## Targeted audience
+
+This document is intended for OpenShift developers that want to instrument their operators and operands for Prometheus.
+
+## Getting started
+
+To instrument software written in Golang, see the official [Golang client](https://pkg.go.dev/github.com/prometheus/client_golang). For other languages, refer to the [curated list](https://prometheus.io/docs/instrumenting/clientlibs/#client-libraries) of client libraries.
+
+Prometheus stores all data as time series which are a stream of timestamped values (samples) identified by a metric name and a set of unique labels (a.ka. dimensions or key/value pairs). Its data model is described in details in this [page](https://prometheus.io/docs/concepts/data_model/). Time series would be represented like this:
+
+```
+# HELP http_requests_total Total number of HTTP requests by method and handler.
+# TYPE http_requests_total counter
+http_requests_total{method="GET", handler="/messages"}  500
+http_requests_total{method="POST", handler="/messages"} 10
+```
+
+Prometheus supports 4 [metric types](https://prometheus.io/docs/concepts/metric_types/):
+* Gauge which represents a single numerical value that can arbitrarily go up and down.
+* Counter, a cumulative metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero on restart. When querying a counter metric, you usually apply a `rate()` or `increase()` function.
+* Histogram which represents observations (usually things like request durations or response sizes) and counts them in configurable buckets.
+* Summary which represents observations too but it reports configurable quantiles over a (fixed) sliding time window. In practice, they are rarely used.
+
+Adding metrics for any operation should be part of the code review process like any other factor that is kept in mind for production ready code.
+
+To learn more about when to use which metric type, how to name metrics and how to choose labels, read the following documentation:
+* [Prometheus naming recommendations](https://prometheus.io/docs/practices/naming/)
+* [Prometheus instrumentation](https://prometheus.io/docs/practices/instrumentation/)
+* [Kubernetes metric instrumentation guide](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/metric-instrumentation.md)
+* [Instrumenting a Go application for Prometheus](https://prometheus.io/docs/guides/go-application/)
+
+## Example
+
+Here is a fictional Go code example instrumented with a Gauge metric and a multi-dimensional Counter metric:
+
+```golang
+	cpuTemp := prometheus.NewGauge(prometheus.GaugeOpts{
+		Name: "cpu_temperature_celsius",
+		Help: "Current temperature of the CPU.",
+	})
+
+	hdFailures := prometheus.NewCounterVec(
+		prometheus.CounterOpts{
+			Name: "hd_errors_total",
+			Help: "Number of hard-disk errors.",
+		},
+		[]string{"device"},
+	)}
+
+	reg := prometheus.NewRegistry()
+	reg.MustRegister(cpuTemp, m.hdFailures)
+
+	cpuTemp.Set(55.2)
+
+	// Record 1 failure for the /dev/sda device.
+	hdFailures.With(prometheus.Labels{"device":"/dev/sda"}).Inc()
+	// Record 3 failures for the /dev/sdb device.
+	hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
+	hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
+	hdFailures.With(prometheus.Labels{"device":"/dev/sdb"}).Inc()
+```
+
+## Labels
+
+Defining when to add and when not to add a label to a metric is a [difficult choice](https://prometheus.io/docs/practices/instrumentation/#use-labels). The general rule is: the fewer labels, the better. Every unique combination of label names and values creates a new time series and Prometheus memory usage is mostly driven by the number of times series loaded into RAM during ingestion and querying. A good rule of thumb is to have less than 10 time series per metric name and target. A common mistake is to store dynamic information such as usernames, IP addresses or error messages into a label which can lead to thousands of time series.
+
+Labels such as `pod`, `service`, `job` and `instance` shouldn't be set by the application. Instead they are discovered at runtime by Prometheus when it queries the Kubernetes API to discover which targets should be scraped for metrics.
+
+## Custom collectors
+
+It is sometimes not feasible to use one of the 4 Metric types, typically when your application already has the information stored for other purpose (for instance, it maintains a list of custom objects retrieved from the Kubernetes API). In this case, the [custom collector](https://pkg.go.dev/github.com/prometheus/[email protected]/prometheus#hdr-Custom_Collectors_and_constant_Metrics) pattern can be useful.
+
+You can find an example of this pattern in the [github.com/prometheus-operator/prometheus-operator](https://github.com/prometheus-operator/prometheus-operator/blob/3df0811bdc7c046cb283006d94092e42219a0e2f/pkg/operator/operator.go#L166-L191) project.
+
+## Next steps
+
+* [Collect metrics](collecting_metrics.md) with Prometheus.
+* [Configure alerting](alerting.md) with Prometheus.