You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Implement the infrastructure for Explorer to poll each node's /stats endpoint, and record/aggregate the collected stat history.
Implement the API endpoints for clients to query this stat history, per node.
Polling Frequency
Each node's /stats are cached on a per-minute basis in Riak, and therefore it doesn't make sense to poll each node much more frequently than once or twice per minute (twice because it's not synchronized exactly to when the minute ends, the boundary is rolling/undefined).
Server Side vs Client Side Polling
The Explorer API service needs to be the one polling and storing each node's stats. (The aggregate stats should be stored in Riak, after issue #123 gets implemented).
Although the polling could technically be done on the client-side, from the Ember app, consider the fact that multiple developers are likely to have an Explorer gui running. For example, if the customer has a 10 node cluster, and if the stats polling was done on the client side, for each dev that opens an Explorer app, it would now add 10 more /stats requests per minute. This would add up quickly.
Explorer Standalone Mode vs Node-Embedded (Mesos) Mode
One design challenge that needs to be solved is which Explorer node should be polling for stats.
Given that Explorer can run in basically two different modes:
A "standalone" mode where it's pointed to a single node in a cluster, and doesn't have to be co-located with any Riak nodes.
An "embedded"/co-located mode. For example, in the Riak Mesos project, an instance of Explorer is spun up with each Riak node (to provide various cluster management capabilities that Mesos needs).
Given the second mode, where there's as many Explorer API services as there are Riak nodes, we can't have every Explorer service poll every Riak node. (That would be a lot of duplication).
Instead, we have two choices:
A. Pick one of the Explorer nodes to be the only one polling & recording stats. However, this is not very distributed or resilient -- if something happens to that node's server, there goes all the stats collection.
B. Have every Explorer node only poll the local Riak node for stats. This at least solves the resilience problem.
The second option, B, would be preferable. Except it applies only to Mesos-type Explorer installations. What about for a single standalone Explorer API service? It doesn't have a local node to poll, and in fact, it would actually need to poll all the nodes instead. (Since it's the only one running, there's no duplication there).
To solve this dilemma, I propose an additional riak_explorer.conf setting:
This would denote whether Explorer was running in standalone or clustered mode, which would then determine whether the API should poll all the Riak nodes in the cluster, or just its co-located local one.
Which Stats to Record
When looking at the output of a /stats request, it becomes clear that not all stats returned are useful to record. (For example, all of the *_version stats don't change, and aren't really needed that often).
Given that we have a lot of stats (~400 stats, not counting the MDC Replication ones), and they keep expanding with every Riak release, it would be helpful to be able to programmatically tell which stats are actually useful for graphing and recording.
The riak-help-json project attempts to solve this problem. Specifically, take a look at the riak_status.json file. Explorer can exclude (not store) stats with the category attribute equal to any of: cluster state, config, versions or usage. In addition, there is probably no need to store the various *_total counts (these are stats with metric_type = 'summary'), since these are Totals since node restart. Explorer only needs to store the latest value for all the summary metrics.
Aggregated Stats Format
This needs further consideration/design. Things to think about:
The stats need to be recorded on a per individual node.
Only the metric_type = 'interval' stats (see above) need to be stored time-series-style. (Though possibly the metric_type = 'summary' ones also).
Consider using a "time series" like key structure
Consider also aggregating averages on an minute, hourly and daily basis
Stats TS-like API for Client use
TBD
The text was updated successfully, but these errors were encountered:
/stats
endpoint, and record/aggregate the collected stat history.Polling Frequency
Each node's
/stats
are cached on a per-minute basis in Riak, and therefore it doesn't make sense to poll each node much more frequently than once or twice per minute (twice because it's not synchronized exactly to when the minute ends, the boundary is rolling/undefined).Server Side vs Client Side Polling
The Explorer API service needs to be the one polling and storing each node's stats. (The aggregate stats should be stored in Riak, after issue #123 gets implemented).
Although the polling could technically be done on the client-side, from the Ember app, consider the fact that multiple developers are likely to have an Explorer gui running. For example, if the customer has a 10 node cluster, and if the stats polling was done on the client side, for each dev that opens an Explorer app, it would now add 10 more
/stats
requests per minute. This would add up quickly.Explorer Standalone Mode vs Node-Embedded (Mesos) Mode
One design challenge that needs to be solved is which Explorer node should be polling for stats.
Given that Explorer can run in basically two different modes:
Given the second mode, where there's as many Explorer API services as there are Riak nodes, we can't have every Explorer service poll every Riak node. (That would be a lot of duplication).
Instead, we have two choices:
A. Pick one of the Explorer nodes to be the only one polling & recording stats. However, this is not very distributed or resilient -- if something happens to that node's server, there goes all the stats collection.
B. Have every Explorer node only poll the local Riak node for stats. This at least solves the resilience problem.
The second option, B, would be preferable. Except it applies only to Mesos-type Explorer installations. What about for a single standalone Explorer API service? It doesn't have a local node to poll, and in fact, it would actually need to poll all the nodes instead. (Since it's the only one running, there's no duplication there).
To solve this dilemma, I propose an additional
riak_explorer.conf
setting:This would denote whether Explorer was running in
standalone
orclustered
mode, which would then determine whether the API should poll all the Riak nodes in the cluster, or just its co-located local one.Which Stats to Record
When looking at the output of a
/stats
request, it becomes clear that not all stats returned are useful to record. (For example, all of the*_version
stats don't change, and aren't really needed that often).Given that we have a lot of stats (~400 stats, not counting the MDC Replication ones), and they keep expanding with every Riak release, it would be helpful to be able to programmatically tell which stats are actually useful for graphing and recording.
The riak-help-json project attempts to solve this problem. Specifically, take a look at the riak_status.json file. Explorer can exclude (not store) stats with the
category
attribute equal to any of:cluster state
,config
,versions
orusage
. In addition, there is probably no need to store the various*_total
counts (these are stats withmetric_type = 'summary'
), since these are Totals since node restart. Explorer only needs to store the latest value for all the summary metrics.Aggregated Stats Format
This needs further consideration/design. Things to think about:
metric_type = 'interval'
stats (see above) need to be stored time-series-style. (Though possibly themetric_type = 'summary'
ones also).Stats TS-like API for Client use
TBD
The text was updated successfully, but these errors were encountered: