Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement API/infrastructure for polling node stats #125

Open
dmitrizagidulin opened this issue Dec 21, 2015 · 0 comments
Open

Implement API/infrastructure for polling node stats #125

dmitrizagidulin opened this issue Dec 21, 2015 · 0 comments

Comments

@dmitrizagidulin
Copy link
Contributor

  1. Implement the infrastructure for Explorer to poll each node's /stats endpoint, and record/aggregate the collected stat history.
  2. Implement the API endpoints for clients to query this stat history, per node.

Polling Frequency

Each node's /stats are cached on a per-minute basis in Riak, and therefore it doesn't make sense to poll each node much more frequently than once or twice per minute (twice because it's not synchronized exactly to when the minute ends, the boundary is rolling/undefined).

Server Side vs Client Side Polling

The Explorer API service needs to be the one polling and storing each node's stats. (The aggregate stats should be stored in Riak, after issue #123 gets implemented).

Although the polling could technically be done on the client-side, from the Ember app, consider the fact that multiple developers are likely to have an Explorer gui running. For example, if the customer has a 10 node cluster, and if the stats polling was done on the client side, for each dev that opens an Explorer app, it would now add 10 more /stats requests per minute. This would add up quickly.

Explorer Standalone Mode vs Node-Embedded (Mesos) Mode

One design challenge that needs to be solved is which Explorer node should be polling for stats.
Given that Explorer can run in basically two different modes:

  1. A "standalone" mode where it's pointed to a single node in a cluster, and doesn't have to be co-located with any Riak nodes.
  2. An "embedded"/co-located mode. For example, in the Riak Mesos project, an instance of Explorer is spun up with each Riak node (to provide various cluster management capabilities that Mesos needs).

Given the second mode, where there's as many Explorer API services as there are Riak nodes, we can't have every Explorer service poll every Riak node. (That would be a lot of duplication).
Instead, we have two choices:

A. Pick one of the Explorer nodes to be the only one polling & recording stats. However, this is not very distributed or resilient -- if something happens to that node's server, there goes all the stats collection.

B. Have every Explorer node only poll the local Riak node for stats. This at least solves the resilience problem.

The second option, B, would be preferable. Except it applies only to Mesos-type Explorer installations. What about for a single standalone Explorer API service? It doesn't have a local node to poll, and in fact, it would actually need to poll all the nodes instead. (Since it's the only one running, there's no duplication there).

To solve this dilemma, I propose an additional riak_explorer.conf setting:

## Acceptable values: standalone, clustered
deploy_mode = standalone

This would denote whether Explorer was running in standalone or clustered mode, which would then determine whether the API should poll all the Riak nodes in the cluster, or just its co-located local one.

Which Stats to Record

When looking at the output of a /stats request, it becomes clear that not all stats returned are useful to record. (For example, all of the *_version stats don't change, and aren't really needed that often).

Given that we have a lot of stats (~400 stats, not counting the MDC Replication ones), and they keep expanding with every Riak release, it would be helpful to be able to programmatically tell which stats are actually useful for graphing and recording.

The riak-help-json project attempts to solve this problem. Specifically, take a look at the riak_status.json file. Explorer can exclude (not store) stats with the category attribute equal to any of: cluster state, config, versions or usage. In addition, there is probably no need to store the various *_total counts (these are stats with metric_type = 'summary'), since these are Totals since node restart. Explorer only needs to store the latest value for all the summary metrics.

Aggregated Stats Format

This needs further consideration/design. Things to think about:

  • The stats need to be recorded on a per individual node.
  • Only the metric_type = 'interval' stats (see above) need to be stored time-series-style. (Though possibly the metric_type = 'summary' ones also).
  • Consider using a "time series" like key structure
  • Consider also aggregating averages on an minute, hourly and daily basis

Stats TS-like API for Client use

TBD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant