Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sidecar issue since v0.37.0 #7947

Open
rgarcia89 opened this issue Nov 29, 2024 · 10 comments
Open

Sidecar issue since v0.37.0 #7947

rgarcia89 opened this issue Nov 29, 2024 · 10 comments

Comments

@rgarcia89
Copy link
Contributor

Thanos, Prometheus and Golang version used:
Thanos: v0.36.1 / v0.37.0
Prometheus: v2.51.2

What happened:
After upgrading the Thanos sidecar container from v0.36.1 to v0.37.0, the Thanos Querier (v0.37.0) is no longer able to properly connect to the sidecar. The screenshot below shows two clusters running v0.36.1 and four clusters running v0.37.0.

image

As observed, the Min Time data is missing for the sidecars running on the latest version. Interestingly, when using Thanos Query, I can still see data newer than two hours. This suggests that the sidecar is being used since this data has not yet been uploaded to the object storage. Similarly, I can also view data from within the last two hours via Grafana. However, label_values queries no longer work when selecting this time range.

image

Copy link

dosubot bot commented Nov 29, 2024

Hello, @rgarcia89! I'm here to help you with any bugs, questions, or contributions you have. Let's tackle this issue together!

I found some similar threads that might be relevant to your issue:

  1. Open Issue: Thanos Side Endpoints Not Showing on Querier - This issue discusses problems with the Querier not displaying sidecar endpoints. No solution was found in the comments [1].

  2. Closed Issue: Querier cannot speak to external sidecar - This issue was resolved by implementing a proxy solution using Envoy to allow the Querier to communicate with the external sidecar through a proxy setup [2].

Additionally, there is a known breaking change in Thanos v0.37.0: the removal of the Store gRPC Info function, which might affect connectivity if your setup relies on this functionality [3].

These threads and changes might provide some insights or potential solutions to your connectivity issue with the Thanos sidecar after upgrading to v0.37.0.

To continue talking to Dosu, mention @dosu.

@MichaHoffmann
Copy link
Contributor

Maybe this could be related to #7820 ?

@MichaHoffmann
Copy link
Contributor

Do you see anything suspicious in the sidecar logs?

@rgarcia89
Copy link
Contributor Author

Nope nothing suspicious in there.

ts=2024-11-29T15:18:09.581373607Z caller=main.go:77 level=debug msg="maxprocs: Leaving GOMAXPROCS=2: CPU quota undefined"
ts=2024-11-29T15:18:09.584061062Z caller=options.go:29 level=info protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
ts=2024-11-29T15:18:09.5846933Z caller=factory.go:54 level=info msg="loading bucket configuration"
ts=2024-11-29T15:18:09.584853638Z caller=azure.go:150 level=debug msg="creating new Azure bucket connection" component=sidecar
ts=2024-11-29T15:18:09.648072464Z caller=sidecar.go:432 level=info msg="starting sidecar"
ts=2024-11-29T15:18:09.648222143Z caller=intrumentation.go:75 level=info msg="changing probe status" status=healthy
ts=2024-11-29T15:18:09.648276193Z caller=http.go:73 level=info service=http/server component=sidecar msg="listening for requests and metrics" address=:10902
ts=2024-11-29T15:18:09.648720274Z caller=reloader.go:274 level=info component=reloader msg="nothing to be watched"
ts=2024-11-29T15:18:09.648857256Z caller=tls_config.go:348 level=info service=http/server component=sidecar msg="Listening on" address=[::]:10902
ts=2024-11-29T15:18:09.648932726Z caller=tls_config.go:351 level=info service=http/server component=sidecar msg="TLS is disabled." http2=false address=[::]:10902
ts=2024-11-29T15:18:09.656684874Z caller=sidecar.go:444 level=warn msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
ts=2024-11-29T15:18:11.660515648Z caller=sidecar.go:444 level=warn msg="failed to get Prometheus flags. Is Prometheus running? Retrying" err="got non-200 response code: 503, response: Service Unavailable"
ts=2024-11-29T15:18:13.654530085Z caller=sidecar.go:200 level=info msg="successfully validated prometheus flags"
ts=2024-11-29T15:18:13.655620235Z caller=promclient.go:663 level=debug msg="build version" url=http://localhost:9090/api/v1/status/buildinfo
ts=2024-11-29T15:18:13.658524014Z caller=sidecar.go:223 level=info msg="successfully loaded prometheus version"
ts=2024-11-29T15:18:13.659194342Z caller=promclient.go:699 level=debug msg="lowest timestamp" url=http://localhost:9090/metrics
ts=2024-11-29T15:18:13.7098399Z caller=sidecar.go:254 level=info msg="successfully loaded prometheus external labels" external_labels="{cluster=\"rnd\", prometheus=\"monitoring/rnd\", prometheus_replica=\"prometheus-rnd-0\", stage=\"lab\"}"
ts=2024-11-29T15:18:13.714241929Z caller=intrumentation.go:56 level=info msg="changing probe status" status=ready
ts=2024-11-29T15:18:13.714966097Z caller=promclient.go:699 level=debug msg="lowest timestamp" url=http://localhost:9090/metrics
ts=2024-11-29T15:18:13.715796374Z caller=grpc.go:167 level=info service=gRPC/server component=sidecar msg="listening for serving gRPC" address=:10901
ts=2024-11-29T15:18:43.71504741Z caller=promclient.go:699 level=debug msg="lowest timestamp" url=http://localhost:9090/metrics
ts=2024-11-29T15:19:13.71577582Z caller=promclient.go:699 level=debug msg="lowest timestamp" url=http://localhost:9090/metrics

@MichaHoffmann
Copy link
Contributor

So it looks like we were able to get lowest timestamp from prometheus. Did prometheus cut a block already?

@rgarcia89
Copy link
Contributor Author

Not since I restarted it. We have configured a block size of 2h

image

@MichaHoffmann
Copy link
Contributor

I think the problem should solve itself once prometheus cuts a block for the first time.

@rgarcia89
Copy link
Contributor Author

I will let you know. Still, this is not happening with v0.36.1

@MichaHoffmann
Copy link
Contributor

I will let you know. Still, this is not happening with v0.36.1

Yeah; we likely need to fallback to shipper timestamp if we cannot consult the metrics ~ thats a bug; but it would still be cool to know if it recovers after first block is cut!

@rgarcia89
Copy link
Contributor Author

No difference even after prometheus cut a block

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants