Implement API/infrastructure for polling node stats #125

dmitrizagidulin · 2015-12-21T19:51:21Z

Implement the infrastructure for Explorer to poll each node's /stats endpoint, and record/aggregate the collected stat history.
Implement the API endpoints for clients to query this stat history, per node.

Polling Frequency

Each node's /stats are cached on a per-minute basis in Riak, and therefore it doesn't make sense to poll each node much more frequently than once or twice per minute (twice because it's not synchronized exactly to when the minute ends, the boundary is rolling/undefined).

Server Side vs Client Side Polling

The Explorer API service needs to be the one polling and storing each node's stats. (The aggregate stats should be stored in Riak, after issue #123 gets implemented).

Although the polling could technically be done on the client-side, from the Ember app, consider the fact that multiple developers are likely to have an Explorer gui running. For example, if the customer has a 10 node cluster, and if the stats polling was done on the client side, for each dev that opens an Explorer app, it would now add 10 more /stats requests per minute. This would add up quickly.

Explorer Standalone Mode vs Node-Embedded (Mesos) Mode

One design challenge that needs to be solved is which Explorer node should be polling for stats.
Given that Explorer can run in basically two different modes:

A "standalone" mode where it's pointed to a single node in a cluster, and doesn't have to be co-located with any Riak nodes.
An "embedded"/co-located mode. For example, in the Riak Mesos project, an instance of Explorer is spun up with each Riak node (to provide various cluster management capabilities that Mesos needs).

Given the second mode, where there's as many Explorer API services as there are Riak nodes, we can't have every Explorer service poll every Riak node. (That would be a lot of duplication).
Instead, we have two choices:

A. Pick one of the Explorer nodes to be the only one polling & recording stats. However, this is not very distributed or resilient -- if something happens to that node's server, there goes all the stats collection.

B. Have every Explorer node only poll the local Riak node for stats. This at least solves the resilience problem.

The second option, B, would be preferable. Except it applies only to Mesos-type Explorer installations. What about for a single standalone Explorer API service? It doesn't have a local node to poll, and in fact, it would actually need to poll all the nodes instead. (Since it's the only one running, there's no duplication there).

To solve this dilemma, I propose an additional riak_explorer.conf setting:

## Acceptable values: standalone, clustered
deploy_mode = standalone

This would denote whether Explorer was running in standalone or clustered mode, which would then determine whether the API should poll all the Riak nodes in the cluster, or just its co-located local one.

Which Stats to Record

When looking at the output of a /stats request, it becomes clear that not all stats returned are useful to record. (For example, all of the *_version stats don't change, and aren't really needed that often).

Given that we have a lot of stats (~400 stats, not counting the MDC Replication ones), and they keep expanding with every Riak release, it would be helpful to be able to programmatically tell which stats are actually useful for graphing and recording.

The riak-help-json project attempts to solve this problem. Specifically, take a look at the riak_status.json file. Explorer can exclude (not store) stats with the category attribute equal to any of: cluster state, config, versions or usage. In addition, there is probably no need to store the various *_total counts (these are stats with metric_type = 'summary'), since these are Totals since node restart. Explorer only needs to store the latest value for all the summary metrics.

Aggregated Stats Format

This needs further consideration/design. Things to think about:

The stats need to be recorded on a per individual node.
Only the metric_type = 'interval' stats (see above) need to be stored time-series-style. (Though possibly the metric_type = 'summary' ones also).
Consider using a "time series" like key structure
Consider also aggregating averages on an minute, hourly and daily basis

Stats TS-like API for Client use

TBD

The text was updated successfully, but these errors were encountered:

dmitrizagidulin added this to the Riak Stats History / Graphing milestone Dec 21, 2015

dmitrizagidulin mentioned this issue Dec 23, 2015

Implement basho_bench integration in Standalone explorer mode #126

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement API/infrastructure for polling node stats #125

Implement API/infrastructure for polling node stats #125

dmitrizagidulin commented Dec 21, 2015

Implement API/infrastructure for polling node stats #125

Implement API/infrastructure for polling node stats #125

Comments

dmitrizagidulin commented Dec 21, 2015

Polling Frequency

Server Side vs Client Side Polling

Explorer Standalone Mode vs Node-Embedded (Mesos) Mode

Which Stats to Record

Aggregated Stats Format

Stats TS-like API for Client use