Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phc: add metrics for number of unhealthy endpoints #3040

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/operation/operation.md
Original file line number Diff line number Diff line change
Expand Up @@ -925,7 +925,8 @@ while choosing the endpoint for the given request

A set of metrics will be exposed to track passive health check:

* `passive-health-check.endpoints.dropped`: Number of all endpoints dropped before load balancing a request, so after N requests and M endpoints are being dropped this counter would be N*M.
* Counter `passive-health-check.requests.failures.mitigated`: Number of all possible requests failures mitigated by passive health check, so after N requests and M endpoints this counter could be N*M in worst case scenario (all endpoints aren't healty).
* Gauge `passive-health-check.endpoints.dropped`: Number of unhealthy/filtered endpoints

## Memory consumption

Expand Down
2 changes: 1 addition & 1 deletion proxy/healthy_endpoints.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ func (h *healthyEndpoints) filterHealthyEndpoints(ctx *context, endpoints []rout
if p < dropProbability {
ctx.Logger().Infof("Dropping endpoint %q due to passive health check: p=%0.2f, dropProbability=%0.2f",
e.Host, p, dropProbability)
metrics.IncCounter("passive-health-check.endpoints.dropped")
metrics.IncCounter("passive-health-check.requests.failures.mitigated")
} else {
filtered = append(filtered, e)
}
Expand Down
5 changes: 4 additions & 1 deletion proxy/healthy_endpoints_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -175,8 +175,11 @@ func TestPHCForMultipleHealthyAndOneUnhealthyEndpoints(t *testing.T) {
failedReqs := sendGetRequests(t, ps)
assert.InDelta(t, 0, failedReqs, 0.1*float64(nRequests))
mockMetrics.WithCounters(func(counters map[string]int64) {
assert.InDelta(t, float64(nRequests), float64(counters["passive-health-check.endpoints.dropped"]), 0.3*float64(nRequests)) // allow 30% error
assert.InDelta(t, float64(nRequests), float64(counters["passive-health-check.requests.failures.mitigated"]), 0.3*float64(nRequests)) // allow 30% error
})
v, ok := mockMetrics.Gauge("passive-health-check.endpoints.dropped")
assert.True(t, ok, "passive-health-check.endpoints.dropped gauge not found")
assert.Equal(t, 1.0, v)
})
}

Expand Down
2 changes: 2 additions & 0 deletions proxy/proxy.go
Original file line number Diff line number Diff line change
Expand Up @@ -539,8 +539,10 @@ func setRequestURLForDynamicBackend(u *url.URL, stateBag map[string]interface{})
func (p *Proxy) selectEndpoint(ctx *context) *routing.LBEndpoint {
rt := ctx.route
endpoints := rt.LBEndpoints
beforefiltering := len(endpoints)
endpoints = p.fadein.filterFadeIn(endpoints, rt)
endpoints = p.heathlyEndpoints.filterHealthyEndpoints(ctx, endpoints, p.metrics)
p.metrics.UpdateGauge("passive-health-check.endpoints.dropped", float64(beforefiltering-len(endpoints)))
Copy link
Member Author

@MustafaSaber MustafaSaber Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussions with @RomanZavodskikh, this gauge will be overwritten by different services(requests), for example svc A has 2 unhealthy endpoints and svc B has 1 both will write on the same metric.

What we can do is append routeId or Name+Namespace combination. we aren't sure that's a good thing memory wise, wdyt @AlexanderYastrebov @szuecs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the 2 options the better is to use routeId, because Kubernetes data is 1) not available and 2) often not applicable (what about non-kubernetes dataclients?).
Another way would be to have passive-health-check.endpoints.dropped.<endpoint> and Gauge is 0 or 1 and then the query would be sum() to get all current dropped endpoints.

In any case it seems that we need to add some unbounded memory usage and we should think, if we need this at all or if we start by only logs that log the endpoint.


lbctx := &routing.LBContext{
Request: ctx.request,
Expand Down
Loading