Missing count and sum in prometheus histogram metrics #41573

henrikno · 2024-11-08T20:50:54Z

This prometheus histogram metric:

cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.005"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.025"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.1"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.25"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="0.5"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="1"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="2"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="4"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="8"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="15"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="30"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="60"} 23
cilium_k8s_client_rate_limiter_duration_seconds_bucket{method="GET",path="/version",le="+Inf"} 23
cilium_k8s_client_rate_limiter_duration_seconds_sum{method="GET",path="/version"} 3.2705999999999995e-05
cilium_k8s_client_rate_limiter_duration_seconds_count{method="GET",path="/version"} 23

Gets turned into this in ES:

    "prometheus": {
      "cilium_k8s_client_api_latency_time_seconds": {
        "histogram": {
          "values": [],
          "counts": []
        }
      },
      "cilium_k8s_client_rate_limiter_duration_seconds": {
        "histogram": {
          "values": [],
          "counts": []
        }
      },
      "labels": {
        "instance": "10.21.98.177:9962",
        "job": "prometheus",
        "method": "GET",
        "path": "/version"
      },
      "labels_fingerprint": "lrtsgTghb5LOY7ViR50IWXf7y6M="
    },

We are using use_types and rate, 1. because it's the default in the elastic-agent integration, and 2. to be able to query them in Kibana.
https://www.elastic.co/docs/current/integrations/prometheus#histograms-and-types-1
However, the values don't look like the example, and what we expect.
The _count and _sum is missing. I was hoping to query the rate of the count/sum.

Sidenote, if the buckets are diffed, I expected it to be named .rate. I was confused why the values did not match what the prometheus endpoint waas reporting at all. This is what it does for counters. Also for counters it keeps the original, but I can see that that would increase the storage.

Another thing that looks funny is the empty values {"values":[],"counts":[]}. Looks like this is happening because we use TSDS, which is also the default in the prometheus integration.

I tried to reproduce it with just metricbeat 8.15.3.

Put the example at the top in a file called metrics, then run python3 -m http.server 9000

cat modules.d/prometheus.yml
- module: prometheus
  period: 20s
  hosts: ["localhost:9000"]
  metrics_path: /metrics
  use_types: true
  rate_counters: false

Set this in metricbeat.yml

output.console:
  pretty: true

With use_types: true, rate_counters: false, I get a bunch of zeroes:

   "cilium_k8s_client_rate_limiter_duration_seconds": {
      "histogram": {
        "values": [
          0.0025,
          0.015,
          0.0625,
          0.175,
          0.375,
          0.75,
          1.5,
          3,
          6,
          11.5,
          22.5,
          45,
          60
        ],
        "counts": [
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0,
          0
        ]
      }

With rate_counters: false I expected to get the exact same values that prometheus reports.
Also missing _sum and _count. If I disable use_types, I do see the values, but there's one document per bucket which is extremely difficult to query.

The text was updated successfully, but these errors were encountered:

jlind23 · 2024-11-12T06:15:36Z

Thanks @henrikno for creating this.
Let me pull @lalit-satapathy in as his team is the one owning the prometheus integration.

lalit-satapathy · 2024-11-12T06:36:43Z

Thanks @henrikno for creating this. Let me pull @lalit-satapathy in as his team is the one owning the prometheus integration.

@shmsr can you take a look?

shmsr · 2024-11-12T07:00:38Z

I'll take a look.

shmsr · 2024-11-14T09:24:22Z

Hi @henrikno 👋🏽

Apologies for taking some time to reply as I was busy investigating a couple of other bugs.

I do understand it is not working as expected for you but this is how it is.

So when use_types: true, this part of the code is used to form the event:

beats/x-pack/metricbeat/module/prometheus/collector/data.go

Line 147 in 367d94e

TODO convert histogram to ES type

and as you can see *_sum and *_count are intentionally left out. I am not sure why this decision was taken ~5years ago to omit it out.

Also, see this logic here how Prometheus Histogram is converted to an ES histogram:

beats/x-pack/metricbeat/module/prometheus/collector/histogram.go

Line 16 in 367d94e

    
           // PromHistogramToES takes a Prometheus histogram and converts it to an ES histogram:

As have a value of 23 for all buckets and hence the rate is turning out to be 0; so it's because of the sample input that you provided. I'd recommend reading the full logic there and further discussion here: elastic/apm-agent-python#1165 (comment). So, this is the reason why counts gives you an array filled with 0s.

Also, if you use_types: false, it seems you can get *_sum and *_count. Here:

beats/metricbeat/module/prometheus/collector/data.go

Line 137 in 367d94e

name + "_sum": histogram.GetSampleSum(),

But yes you miss out the transformation from Prometheus Histogram to ES Histogram here as use_types is false. But I recommend trying this out and see if this suits you?

I need to check if it is safe to add *_sum and *_count when use_types: true and why the decision of not adding them was taken back then.

Also, about the docs example that you mentioned: https://www.elastic.co/docs/current/integrations/prometheus#histograms-and-types-1

{
    "prometheus": {
        "labels": {
            "instance": "172.27.0.2:9090",
            "job": "prometheus"
        },
        "prometheus_target_interval_length_seconds_count": {
            "counter": 1,
            "rate": 0
        },
        "prometheus_target_interval_length_seconds_sum": {
            "counter": 15.000401344,
            "rate": 0
        }
        "prometheus_tsdb_compaction_chunk_range_seconds_bucket": {
            "histogram": {
                "values": [50, 300, 1000, 4000, 16000],
                "counts": [10, 2, 34, 7]
            }
        }
    },
}

If you notice prometheus_target_interval_length_seconds and prometheus_tsdb_compaction_chunk_range_seconds are two different metrics where the latter is a histogram and the former a counter (rate counter). For histogram here you can there's no *_sum and *_count when use_types is true.

I hope this explanation helps.

henrikno · 2024-11-20T17:14:14Z

I get that it was an "intentional" decision to not add it, but I think that was a mistake as it makes it very difficult or impossible to query it.
There's a TODO about adding it.

beats/x-pack/metricbeat/module/prometheus/collector/data.go

Lines 147 to 148 in 367d94e

    
           				TODO convert histogram to ES type 
        
           				Send sum & count? not sure it's worth it

There's an enhancement issue that considers adding it elastic/integrations#5042
In the docs for histogram https://prometheus.io/docs/practices/histograms/

Note that the number of observations (showing up in Prometheus as a time series with a _count suffix) is inherently a counter (as described above, it only goes up). The sum of observations (showing up as a time series with a _sum suffix) behaves like a counter, too, as long as there are no negative observations.

So a histogram is also "counter", so it should behave similar to a counter, so we can query it with a rate, avg etc.

shmsr · 2024-11-21T09:43:58Z

Thanks Henrik for updating.

As previously discussed on Slack (and sharing here for those who weren't part of that conversation): I reviewed the Slack thread asking for histogram changes and Andrew's proposal regarding histogram implementation enhancements filed more than a year ago. Since this is Andrew's proposal, it requires further discussion and it'd be great if he can also share his views on this. We need to evaluate this thoroughly while considering customers who are using the current histogram implementation. Making changes has implications, so this needs detailed discussion with the Prometheus package owners and members of ES/ Kibana.

The current code works as intended according to the original design assumptions. While I agree with Henrik that improvements are needed, we should discuss this in detail within our team.

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 8, 2024

jlind23 added the Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team label Nov 12, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 12, 2024

lalit-satapathy assigned shmsr Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing count and sum in prometheus histogram metrics #41573

Missing count and sum in prometheus histogram metrics #41573

henrikno commented Nov 8, 2024

jlind23 commented Nov 12, 2024

lalit-satapathy commented Nov 12, 2024

shmsr commented Nov 12, 2024

shmsr commented Nov 14, 2024 •

edited

Loading

henrikno commented Nov 20, 2024

shmsr commented Nov 21, 2024 •

edited

Loading

Missing count and sum in prometheus histogram metrics #41573

Missing count and sum in prometheus histogram metrics #41573

Comments

henrikno commented Nov 8, 2024

jlind23 commented Nov 12, 2024

lalit-satapathy commented Nov 12, 2024

shmsr commented Nov 12, 2024

shmsr commented Nov 14, 2024 • edited Loading

henrikno commented Nov 20, 2024

shmsr commented Nov 21, 2024 • edited Loading

shmsr commented Nov 14, 2024 •

edited

Loading

shmsr commented Nov 21, 2024 •

edited

Loading