-
Notifications
You must be signed in to change notification settings - Fork 361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus exporter #1200
base: master
Are you sure you want to change the base?
Prometheus exporter #1200
Conversation
Add a new command: cmsPrometheusExporter This will start a new HTTP server that listens on a port (defaults to 8811) for requests to /metrics. At this endpoint many metrics about a contest (or all contests) are exposed for Prometheus to be collected. Those metrics consider both system values (queue status, connected workers, ...) and contest values (submissions, users, questions, ...) The metrics exposed may leak information about the contest, so it's important to secure this endpoint (for example using a reverse proxy, or binding localhost). With this exporter it's possible to build interactive and real-time dashboard for monitoring the status of the contest. For example, you can find here [1] the dashboard we used at OII 2020, and here [2] the one used at an OIS round. [1]: https://snapshot.raintank.io/dashboard/snapshot/0J29w0KruEiymy6zV30beX98aRG6njiX?orgId=2 [2]: https://snapshot.raintank.io/dashboard/snapshot/JngCBt71ZOJhiyEUC7yxZhS3guZmZYYi?orgId=2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks great! The dashboard examples are nice.
I see that prometheus_client.start_http_server()
uses threading
, which should be patched by gevent — OK.
I noticed that:
- The service crashes on start if CMS (
EvaluationService
, to be precise) is not running. - The service returns HTTP 500 if CMS/
EvaluationService
is stopped.
As a status reporting service, it is probably good to make it more robust to the state of other services.
super().__init__() | ||
|
||
self.contest_id = args.contest_id | ||
self.export_subs = not args.no_submissions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to rename export_subs
and collect_subs
to export_submissions
and collect_submissions
, respectively. It's only this field that is shortened, and the full name is more descriptive.
if self.export_users: | ||
yield from self.collect_users(session) | ||
|
||
def collect_subs(self, session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, only the collect()
method is the public interface. In this case the other method names should be prefixed with _
.
service = PrometheusExporter(args) | ||
REGISTRY.register(service) | ||
start_http_server(args.port, addr=args.host) | ||
print("Started at http://%s:%s/metric" % (args.host, args.port)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proper logging should be used here.
) | ||
metric.add_metric(["total"], len(status)) | ||
|
||
connected = len([1 for worker in status.values() if worker["connected"]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use generator expressions to avoid constructing the full list: sum(1 for … in … if …)
. Similarly below.
REGISTRY.register(service) | ||
start_http_server(args.port, addr=args.host) | ||
print("Started at http://%s:%s/metric" % (args.host, args.port)) | ||
Service.run(service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not service.run()
?
parser = argparse.ArgumentParser(description="Prometheus exporter.") | ||
parser.add_argument( | ||
"--host", | ||
help="IP address to bind", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help="IP address to bind", | |
help="IP address to bind to", |
@@ -25,3 +25,6 @@ pyyaml>=5.3,<5.4 # http://pyyaml.org/wiki/PyYAML | |||
# Only for printing: | |||
pycups>=1.9,<1.10 # https://pypi.python.org/pypi/pycups | |||
PyPDF2>=1.26,<1.27 # https://github.com/mstamy2/PyPDF2/blob/master/CHANGELOG | |||
|
|||
# Only for cmsPrometheusExporter | |||
prometheus-client>=0.7,<0.8 # https://pypi.org/project/prometheus-client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation should also be updated at "Installing CMS and its Python dependencies". We need python3-prometheus-client
for Ubuntu and python-prometheus_client
for Arch Linux as optional deps.
@@ -0,0 +1,293 @@ | |||
#!/usr/bin/env python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script file should have the executable permission.
REGISTRY.register(service) | ||
start_http_server(args.port, addr=args.host) | ||
print("Started at http://%s:%s/metric" % (args.host, args.port)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code could be moved to a new method in PrometheusExporter
. In fact, PrometheusExporter.run()
could also be overridden and this new method be called from there.
Add a new command:
cmsPrometheusExporter
This will start a new HTTP server that listens on a port (defaults to 8811) for requests to /metrics. At this endpoint many metrics about a contest (or all contests) are exposed for Prometheus to be collected. Those metrics consider both system values (queue status, connected workers, ...) and contest values (submissions, users, questions, ...)
This exporter works by executing simple queries to the database, and obtains the queue information asking Evaluation Service.
The metrics exposed may leak information about the contest, so it's important to secure this endpoint (for example using a reverse proxy, or binding localhost).
With this exporter it's possible to build interactive and real-time dashboard for monitoring the status of the contest. For example, you can find here the dashboard we used at OII 2020, and here the one used at an OIS round.
Those dashboards are built using Grafana, and we configured it also for sending telegram notifications when something wrong was happening (e.g. queue too long, see image below).