Skip to content

Google Summer of Code 2014 : Improvement of Automatic Benchmarking System

Anand Soni edited this page Aug 8, 2014 · 20 revisions

Welcome to the benchmarks wiki! This wiki describes in detail the new additions to the benchmarking system made during GSoC 2014.

Overview

This repository was created by Marcus Edel aiming to compare the performance of various Machine Learning libraries over various classifiers. Until the start of GSoC 2014, we compared the libraries based on the run times of various algorithms/classifiers. But, since run-time was not a sufficient way to establish any benchmark, we came up with the following additions to the repository to make it more efficient, useful and unique in its own way:

  • Implemented some of the very widely used machine learning performance metrics
  • Modified the existing interface and integrated these metrics with various classifiers of the following libraries :
    • Mlpack
    • Scikit
    • Shogun
    • Weka
    • Matlab
    • Mlpy
  • Implemented a Bootstrapping framework which can be used to directly compare performance of libraries
  • Developed a visualization framework to plot the metric values for all libraries as a grouped chart
  • Added a Sortable table implementation to sort library performance on the basis of any metric value

Metrics Implemented

The following performance metrics were implemented during the Summer of 2014 -

  • Accuracy - The accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's actual (true) value. This gives a fair idea of how well the classifier is predicting classes.
  • Precision - Precision (also called positive predictive value) is the fraction of retrieved instances that are relevant. In other words, it is the ratio of the actually true predictions to the total true predictions.
  • Recall - Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved.
  • F Measure - It is the metric which combines both precision and recall. The F measure can be interpreted as a weighted average of the precision and recall, where an F score reaches its best value at 1 and worst score at 0.
  • Lift - Lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. A good classifier will generally give us a high lift.
  • Matthews Correlation Coefficient - The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.
  • Mean Squared Error - The mean squared error (MSE) of an estimator measures the average of the squares of the "errors", that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss
  • Mean Predictive Information - It is a metric closely related to cross entropy or the negative log likelihood and is a fairly natural metric to test the performance of classifiers.

The Bootstrapping Framework

The Bootstrapping framework is the star addition to the benchmarking system during GSoC 2014. The performance of various classifiers depends on the problems/datasets selected. To uniformly compare and rank classifiers and to normalize this process, we implemented the bootstrap analysis. We randomly select a bootstrap sample (sampling with replacement) from the available problems/datasets. For this bootstrap sample of problems and the implemented metrics we rank the classifiers by mean performance across the sampled problems and metrics. This bootstrap sampling is repeated 100 times. We then take the average metric values for these 100 different values yielding a potentially significant rankings of the learning methods.

To rank the classifiers, we represent the bootstrapping results in a tabular form as well as in the form of grouped bar charts. This table can then be sorted based on any metric value just by a click on a metric column header! The ranks so obtained give a fair picture of the overall performance of a method.

Changes in the benchmarking API/Interface

Three important changes were made to the config file structure to integrate it with the two new tasks added to the benchmarking system -

  • The metrics task - This task is created to parse the config file, iterate over all libraries and calculate all the metrics for the libraries and the method. The metrics are returned as a dictionary of dictionaries with key as the method name and the value as the dictionary with all the metric values corresponding to each library contained in sub-dictionaries.
  • The bootstrap task - This task is created to retrieve the metric values from the database, perform the bootstrap analysis and create the graphical and tabular visualizations. (See the new structure of the config file in the next section)
  • The old method task was modified to a new timing task to keep the run-time based calculations separate from the metrics.

Updated structure of the config file

For the older config file status, refer - Automatic Benchmark Wiki Mlpack Trac. The new configuration file with both the new tasks should look something like this -

# Block for general settings.
library: general
settings:
    # Time until a timeout in seconds.
    timeout: 9000
    database: 'reports/benchmark.db'
    keepReports: 20
---
# MLPACK:
# A Scalable C++  Machine Learning Library
library: scikit
methods:
    PERCEPTRON:
        run: ['timing, 'metrics', 'bootstrap']
        iteration: 1
        script: methods/scikit/perceptron.py
        format: [csv, txt]
        datasets:
            - files: [ ['datasets/iris_train.csv', 'datasets/iris_test.csv', 'datasets/iris_labels.csv'] ]

The above configuration file includes the new tasks to calculate the metics, and two perform the bootstrap analysis using the tables created in the database.

How to add new metrics?

Adding new metrics to the benchmarking system is really easy! All we need to do is add the new metric definition to the Metrics class in methods/metrics/definitions.py as a static method. Here is a glimpse -

  '''
  @param param1_name - param1 description
  @param param2_name - param2 description
  Description of the new metric comes here.
  '''
  @staticmethod
  def NewMetricName(param1_name, param2_name):
    //Metric definition comes here. 

Once the new metric definition is added, it is advisable to add a unit test for the metric too. To add unit tests implement a small test method in tests/metrics_unit_test.py in the Metrics_Test class as shown below -

  '''
  Test for the NewMetric metric
  '''
  def test_NewMetric(self):
    //Add the test here.

For reference it is good to have a look at the corresponding files. And that's it! The new metric is implemented!

How to integrate a new metric with libraries?

After implementing the new metric, you might want to integrate it with one of the methods/classifiers of a particular library. It is an easy task but requires some understanding of the existing code and the additions required to make. Some of the generic steps that must be followed are listed below -

  • First of all, open the method file and see if the metrics definitions path has been imported in the method file or not. If it is already there, well and good; otherwise add the following lines to the file -
#Import the metrics definitions path.
metrics_folder = os.path.realpath(os.path.abspath(os.path.join(
  os.path.split(inspect.getfile(inspect.currentframe()))[0], "../metrics")))
if metrics_folder not in sys.path:
  sys.path.insert(0, metrics_folder)  

  • The next step is the most important one. See if the RunMetrics(..) method has been implemented in the file. If it is there it will look like this -
  def RunMetrics(self, options):
    if len(self.dataset) >= 3:
      # Check if we need to build and run the model.
      # Possibly some checks here
      # Code to get test data and predicted labels data here.
      # Some confusion matrix creation here.
      AvgAcc = Metrics.AverageAccuracy(confusionMatrix)
      # Other metrics calculated here like the above one.
      metrics_dict = {}
      metrics_dict['Avg Accuracy'] = AvgAcc
      # Other metrics added to dictionary like the above one.
      return metrics_dict
    else:
      # Some error message here.

You just need to add two lines here. One to calculate your metric value (just like AvgAcc) and the other to add the value to the dictionary as shown in the code snippet above. However, if the RunMetrics(..) function has not been implemented, you need to add the complete function with all the other metrics and the new metric just as shown in the above snippet.

  • Next, see if there is a RunTiming(..) function in the method file. If it is there, then you are almost done. If not, look for RunMethod(..) function. It will certainly be there. Just rename the function to RunTiming(..).
  • These generic steps will be followed every time a new metric has to be integrated with a method of any library. After these, mostly we need to debug the errors and make changes accordingly. It should not be a great ordeal to get the metric working once the above steps are done. Moreover, it is always a good idea to look at the already integrated methods for reference.
Clone this wiki locally