Skip to content

Google Summer of Code 2014 : Improvement of Automatic Benchmarking System

Anand Soni edited this page Aug 8, 2014 · 20 revisions

Welcome to the benchmarks wiki! This wiki describes in detail the new additions to the benchmarking system made during GSoC 2014.

Overview

This repository was created by Marcus Edel aiming to compare the performance of various Machine Learning libraries over various classifiers. Until the start of GSoC 2014, we compared the libraries based on the run times of various algorithms/classifiers. But, since run-time was not a sufficient way to establish any benchmark, we came up with the following additions to the repository to make it more efficient, useful and unique in its own way:

  • Implemented some of the very widely used machine learning performance metrics
  • Modified the existing interface and integrated these metrics with various classifiers of the following libraries :
    • Mlpack
    • Scikit
    • Shogun
    • Weka
    • Matlab
    • Mlpy
  • Implemented a Bootstrapping framework which can be used to directly compare performance of libraries
  • Developed a visualization framework to plot the metric values for all libraries as a grouped chart
  • Added a Sortable table implementation to sort library performance on the basis of any metric value

Metrics Implemented

The following performance metrics were implemented during the Summer of 2014 -

  • Accuracy - The accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's actual (true) value. This gives a fair idea of how well the classifier is predicting classes.
  • Precision - Precision (also called positive predictive value) is the fraction of retrieved instances that are relevant. In other words, it is the ratio of the actually true predictions to the total true predictions.
  • Recall - Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved.
  • F Measure - It is the metric which combines both precision and recall. The F measure can be interpreted as a weighted average of the precision and recall, where an F score reaches its best value at 1 and worst score at 0.
  • Lift - Lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. A good classifier will generally give us a high lift.
  • Matthews Correlation Coefficient - The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.
  • Mean Squared Error - The mean squared error (MSE) of an estimator measures the average of the squares of the "errors", that is, the difference between the estimator and what is estimated. MSE is a risk function, corresponding to the expected value of the squared error loss or quadratic loss
  • Mean Predictive Information - It is a metric closely related to cross entropy or the negative log likelihood and is a fairly natural metric to test the performance of classifiers.

The Bootstrapping Framework

The Bootstrapping framework is the star addition to the benchmarking system during GSoC 2014. The performance of various classifiers depends on the problems/datasets selected. To uniformly compare and rank classifiers and to normalize this process, we implemented the bootstrap analysis. We randomly select a bootstrap sample (sampling with replacement) from the available problems/datasets. For this bootstrap sample of problems and the implemented metrics we rank the classifiers by mean performance across the sampled problems and metrics. This bootstrap sampling is repeated 100 times. We then take the average metric values for these 100 different values yielding a potentially significant rankings of the learning methods.

To rank the classifiers, we represent the bootstrapping results in a tabular form as well as in the form of grouped bar charts. This table can then be sorted based on any metric value just by a click on a metric column header! The ranks so obtained give a fair picture of the overall performance of a method.

Changes in the benchmarking API/Interface

Two important changes were made to the config file structure to integrate it with the two new tasks added to the benchmarking system -

  • The metrics task - This task is created to parse the config file, iterate over all libraries and calculate all the metrics for the libraries and the method. The metrics are returned as a dictionary of dictionaries with key as the method name and the value as the dictionary with all the metric values corresponding to each library contained in sub-dictionaries.
  • The bootstrap task -

Updated structure of the config file

For the older config file status, refer - Automatic Benchmark Wiki Mlpack Trac. The new configuration file with both the new tasks should look something like this -

# Block for general settings.
library: general
settings:
    # Time until a timeout in seconds.
    timeout: 9000
    database: 'reports/benchmark.db'
    keepReports: 20
---
# MLPACK:
# A Scalable C++  Machine Learning Library
library: scikit
methods:
    PERCEPTRON:
        run: ['timing, 'metrics', 'bootstrap']
        iteration: 1
        script: methods/scikit/perceptron.py
        format: [csv, txt]
        datasets:
            - files: [ ['datasets/iris_train.csv', 'datasets/iris_test.csv', 'datasets/iris_labels.csv'] ]

The above configuration file includes the new tasks to calculate the metics, and two perform the bootstrap analysis using the tables created in the database.

How to add new metrics?

How to integrate a new metric with libraries?

Clone this wiki locally