-
Notifications
You must be signed in to change notification settings - Fork 44
Installation of application-monitoring-operator using full declarative language #128
Comments
Tagging @david-martin to get an initial opinion |
There's a couple of things happening around integreatly and monitoring in OSD4 that means I can't give a clear indication of what I think is best way forward yet. Namely:
I'm not sure what's happening here, or why some element of it would be private. |
@david-martin integreatly-operator doesn't pull operators from Quay. |
@matskiv Do you think you can publish the operator on one of the 3 default OperatorSources, or maybe provide the needed setup to install it using OLM (so not installing it through current Makefile)? |
@slopezz AFAIK updating one of the 3 default registries would require some manual work for each realease. |
@david-martin what do you think about publishing AMO (at least on integreatly OperatorSource)? |
@slopezz At the moment, it's looking likely we'll drop AMO from the integreatly-operator (in OpenShift 4) and put AMO into maintenance mode on the v0 branch for existing Integreatly/RHMI 1.x clusters on OpenShift 3. I'll explain the thinking behind this further below. As such, it's unlikely we'd take on the publishing of AMO to an OperatorSource. The rationale for this is a number of things:
So, right now, our intent in the shorter term is to look into solving some of the above problems in a more efficient and Operator/OLM friendly way that meets the needs of the Integreatly on OpenShift 4, and is likely to take the form of changes in the integreatly-operator rather than in AMO. |
@david-martin It makes sense, we have started using AMO because we thought it was the way to go with application monitoring following RHMI strategy. Right now we are using it mainly for dev purpose , and found it very useful, so engineering team can easily start playing with grafana and prometheus in order to build 3scale dashboards and alerts, but we understand your current concerns about it, and definitively we think that all Red Hat products (not only Integration ones) should use the same monitoring stack, which can lead to the standardization of application monitoring. We have done a quick test of current Here below I add our inital At the end of the test we have added a few takeaways that we can discuss afterwards, maybe we can work together with people from Openshift monitoring team in order to provide real feedback about it to improve the product. User Workload Monitoring TestWe have done a quick test on how user workload monitoring works, in order to check current features and viability for our usage for 3scale product (both on-prem and SaaS). We have used latest OCP 4.4.0-rc.4. ArchitectureArchitecture can be check at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/user-workload-monitoring.md : 1.There are two prometheus instances (each ones scraping its own ServiceMonitors...):
How to setupFollowing official docs, basically it is needed to create a new configmap on namespace
And immediately is created on namespace
$ oc get pods -n openshift-user-workload-monitoring
NAME READY STATUS RESTARTS AGE
prometheus-operator-55569f49-6sfnh 1/1 Running 0 5h34m
prometheus-user-workload-0 5/5 Running 1 5h34m
prometheus-user-workload-1 5/5 Running 1 5h34m Then you just need to deploy any application with metrics on any namespace, and:
On our case, we just:
MetricsThen if you go to the
Both queries shows data, because we are really executing queries to Documentation says that as an administrator, you can go to And execute application PromQL queries there, but it does not work, because cluster prometheus doesn't have application data, application data is on user-workload-monitoring prometheus (or also on In addition, user workload monitoring prometheus does not include a public Route to check current alerts (which personally I find useful), unlike cluster prometheus: AlertsWe have created a sample Then documentation says that alert should appear on the same But there, there is no application active Alert (only 2 active alerts from cluster prometheus) and if we look for our specific sample alert So we can see that unlike But, if we go to Here we can see both Cluster prometheus (2 alerts firing) and Application alerts (1 fake alert firing): So it seems that by some reason, GrafanaUser workload monitoring does not include grafana instance (it is out of the scope), and current cluster grafana is not an operator, is a static deployment with specific volumes mounting specific kubernetes grafana dashboards on configmaps. So if you want to have application dashboards you need to have your own grafana instance (like integreatly grafana-operator for example, with autodiscovery of dashboards using labels). Takeaways
|
Context
At 3scale engineering team, we want to use application-monitoring-operator, so both RHMI/3scale will use the same monitoring stack, which will help both teams to follow the same direction, taking into account that 3scale is working on adding metrics, prometheusRules, grafanaDashboards for the next release.
At 3scale SRE/Ops team we are using openshift hive to provision our on demand dev OCP clusters (so engineers can easily do testing with metrics, dashboards...), and we are using hive SyncSet object in order to apply same configurations to different OCP clusters (we define all resources once on a single yaml, and then we can apply the same config to any dev cluster, by just adding new clusters name to the list in the
SyncSet
object).We have seen that current documented operator installation involves executing a Makefile target (with grafana/prometheus versions), which executes a bash script that executes
oc apply
of different files, directories or URLs.We need an easy way to install the monitoring stack using declarative language (no Makefile target executions), so it will be easy to maintain and keep track of every change for every release on GitHub (GitOps philosophy).
Current workaround
As a workaround, what we are doing now is to parse/extract all resources deployed by scripts/install.sh and adding them to a single
SyncSet
object (which has a specific spec format). But before creating theSyncSet
object, due to openshift hive using k8s native APIs, they don't accept for example some OpenShift apiVersion likeauthorization.openshift.io/v1
and need to be replaced by k8s native alternativerbac.authorization.k8s.io/v1
(see issues openshift/hive#864 and https://issues.redhat.com/browse/CO-532), so we need to fix some resources in order to be full compatible with hive:authorization.openshift.io/v1
to k8srbac.authorization.k8s.io/v1
(plus applying some additions like adding roleRef.king and roleRef.Group), actually you are already using that k8s native apiVersion on other ClusterRole/ClusterRolebinding objects (but not on all), example:grafana-proxy
resource:We have checked on deploy/cluster-roles/README.md that you use
Integr8ly installer
in order to install application-monitoring-operator (not the Makefile target), and you don't use yamls on deploy/cluster-roles/In order to have full compatibility with k8s (hence with openshift hive), and not require us to make transformation on almost all objects, we wonder if we can open a PR to fix those small issues, while still being fully compatible with Openshift:
grafana-proxy
Possible improvement
To make easier the installation of
application-monitoring-operator
using a full declarative language without having to manage all that 25 yamls, we have seen that you are already using olm-catalog, so we wonder if you plan to:OperatorSource
like certified-operators, redhat-operators or community-operators (so operator can be used by anybody)OperatorSource
resource that can be easily deployed on Openshift cluster, and then just create aSubscription
object to deploy the operator on a given namespace, channel, version...We have tried to deploy a
OperatorSource
using data from the Makefile (likeregistryNamespace: integreatly
):But we have seen that only the
integreatly
operator is available, so we guessapplication-monitoring-operator
might be private.The text was updated successfully, but these errors were encountered: