This chapter explores boosting, another ensemble learning algorithm typically based on decision trees that often produces even better results the Random Forests.
The key difference is that boosting, in its original AdaBoost version, modifies the training data for each tree based on the cumulative errors made by the model before adding the new tree. Random Forests, in contrast, use bagging to train many trees independently from each other using randomized versions of the training set. While Random Forests can be trained in parallel, boosting proceeds sequentially using reweighted versions of the data. State-of-the-art boosting implementations also adopt the randomization strategies of random forests.
In this chapter, we will see how boosting has evolved into one of the most successful machine learning algorithms over the last three decades. At the time of writing, it has come to dominate machine learning competitions for structured data (as opposed to high-dimensional images or speech, for example, where the relationship between the input and output is more complex, and deep learning excels at). More specifically, in this chapter we will cover the following topics:
- How boosting works, and how it compares to bagging
- How boosting has evolved from adaptive to gradient boosting
- How to use and tune AdaBoost and gradient boosting models with sklearn
- How state-of-the-art GBM implementations dramatically speed up computation
- How to prevent overfitting of gradient boosting models
- How to build, tune, and evaluate gradient boosting models on large datasets using xgboost, lightgbm, and catboost
- How to interpret and gain insights from gradient boosting models
Like bagging, boosting combines base learners into an ensemble. Boosting was initially developed for classification problems, but can also be used for regression, and has been called one of the most potent learning ideas introduced in the last 20 years (as described in Elements of Statistical Learning by Trevor Hastie, et al). Like bagging, it is a general method or metamethod that can be applied to many statistical learning models.
The following sections briefly introduce AdaBoost and then focus on the gradient boosting model, as well as several state-of-the-art implementations of this algorithm.
AdaBoost is a significant departure from bagging, which builds ensembles on very deep trees to reduce bias. AdaBoost, in contrast, grows shallow trees as weak learners, often producing superior accuracy with stumps—that is, trees formed by a single split. The algorithm starts with an equal-weighted training set and then successively alters the sample distribution. After each iteration, AdaBoost increases the weights of incorrectly classified observations and reduces the weights of correctly predicted samples so that subsequent weak learners focus more on particularly difficult cases. Once trained, the new decision tree is incorporated into the ensemble with a weight that reflects its contribution to reducing the training error.
- A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting, Y. Freund, R. Schapire, 1997.
As part of its ensemble module, sklearn provides an AdaBoostClassifier implementation that supports two or more classes.
The code examples for this section are in the notebook gbm_baseline that compares the performance of various algorithms with a dummy classifier that always predicts the most frequent class.
The algorithms in this chapter use a dataset generated by the notebook feature-engineering from Chapter 4 on Alpha Factor Research that needs to be executed first.
sklearn
AdaBoost docs
The main idea behind the resulting Gradient Boosting Machines (GBM) algorithm is the training of the base learners to learn the negative gradient of the current loss function of the ensemble. As a result, each addition to the ensemble directly contributes to reducing the overall training error given the errors made by prior ensemble members. Since each new member represents a new function of the data, gradient boosting is also said to optimize over the functions hm in an additive fashion.
- Greedy function approximation: A gradient boosting machine, Jerome H. Friedman, 1999
The two key drivers of gradient boosting performance are the size of the ensemble and the complexity of its constituent decision trees. The control of complexity for decision trees aims to avoid learning highly specific rules that typically imply a very small number of samples in leaf nodes. We covered the most effective constraints used to limit the ability of a decision tree to overfit to the training data in Chapter 4 on Decision Trees and Random Forests.
In addition to directly controlling the size of the ensemble, there are various regularization techniques, such as shrinkage, that we encountered in the context of the Ridge and Lasso linear regression models in Chapter 7, Linear Models – Regression and Classification. Furthermore, the randomization techniques used in the context of random forests are also commonly applied to gradient boosting machines.
The ensemble module of sklearn contains an implementation of gradient boosting trees for regression and classification, both binary and multiclass.
The notebook gbm_baseline demonstrates how to run cross-validation for the GradientBoostingClassifier.
The notebook sklearn_gbm_tuning shows how to GridSearchCV to search for the best set of parameters. This can be very time-consuming to run.
The notebook sklearn_gbm_tuning_results displays some of the results that can be obtained.
scikit-klearn
Gradient Boosting docs
Over the last few years, several new gradient boosting implementations have used various innovations that accelerate training, improve resource efficiency, and allow the algorithm to scale to very large datasets. The new implementations and their sources are as follows:
- XGBoost (extreme gradient boosting), started in 2014 by Tianqi Chen at the University of Washington
- LightGBM, first released in January 2017, by Microsoft
- CatBoost, first released in April 2017 by Yandex
The book review the numerous algorithmic innovations that have emerged over time and subsequently converged (so that most features are available for all implementations) before illustrating their implementation.
####References
- xgboost - LightGBM Parameter Comparison
- xgboost vs LightGBM Benchmarks
- Depth- vs Leaf-wise growth
- Rashmi Korlakai Vinayak, Ran Gilad-Bachrach. “DART: Dropouts meet Multiple Additive Regression Trees.”
XGBoost, LightGBM, and CatBoost offer interfaces for multiple languages, including Python, and have both a sklearn interface that is compatible with other sklearn features, such as GridSearchCV and their own methods to train and predict gradient boosting models. The notebook gbm_baseline illustrates the use of the sklearn interface for each implementation. The library methods are often better documented and are also easy to use, so we'll use them to illustrate the use of these models.
The process entails the creation of library-specific data formats, the tuning of various hyperparameters, and the evaluation of results that we will describe in the following sections.
The notebook xgboost_lightgbm_catboost_tuning relies on the the scripts gbm_tuning.py and [gbm_utils.py]gbm_utils.py) to produce the cross-validation results shown in the book, where the visualizations are created in the notebook xgboost_lightgbm_catboost_tuning_results.
All libraries have their own data format to precompute feature statistics to accelerate the search for split points. These can also be persisted to accelerate the start of subsequent training.
- GitHub Repo
- Documentation
- Parameters
- XGBoost: A Scalable Tree Boosting System
- Accelerating the XGBoost algorithm using GPU computing. Mitchell R, Frank E., 2017, PeerJ Computer Science 3:e127
- XGBoost: Scalable GPU Accelerated Learning, Rory Mitchell, Andrey Adinets, Thejaswi Rao, 2018
- Nvidia Parallel Forall: Gradient Boosting, Decision Trees and XGBoost with CUDA, Rory Mitchell, 2017
- Awesome XGBoost
- GitHub Repo
- Documentation
- Parameters
- Python API
- LightGBM: A Highly Efficient Gradient Boosting Decision Tree
- On Grouping for Maximum Homogeneity
- Python API
- CatBoost: unbiased boosting with categorical features
- CatBoost: gradient boosting with categorical features
Understanding why a model predicts a certain outcome is very important for several reasons, including trust, actionability, accountability, and debugging. Insights into the nonlinear relationship between features and the outcome uncovered by the model, as well as interactions among features, are also of value when the goal is to learn more about the underlying drivers of the phenomenon under study.
The notebook model_interpretation illustrates the methods listed below.
There are three primary ways to compute global feature importance values:
- Gain: This classic approach introduced by Leo Breiman in 1984 uses the total reduction of loss or impurity contributed by all splits for a given feature. The motivation is largely heuristic, but it is a commonly used method to select features.
- Split count: This is an alternative approach that counts how often a feature is used to make a split decision, based on the selection of features for this purpose based on the resultant information gain.
- Permutation: This approach randomly permutes the feature values in a test set and measures how much the model's error changes, assuming that an important feature should create a large increase in the prediction error. Different permutation choices lead to alternative implementations of this basic approach.
In addition to the summary contribution of individual features to the model's prediction, partial dependence plots visualize the relationship between the target variable and a set of features. The nonlinear nature of gradient boosting trees causes this relationship to depend on the values of all other features.
At the 2017 NIPS conference, Scott Lundberg and Su-In Lee from the University of Washington presented a new and more accurate approach to explaining the contribution of individual features to the output of tree ensemble models called SHapley Additive exPlanations, or SHAP values.
This new algorithm departs from the observation that feature-attribution methods for tree ensembles, such as the ones we looked at earlier, are inconsistent — that is, a change in a model that increases the impact of a feature on the output can lower the importance values for this feature.
SHAP values unify ideas from collaborative game theory and local explanations, and have been shown to be theoretically optimal, consistent, and locally accurate based on expectations. Most importantly, Lundberg and Lee have developed an algorithm that manages to reduce the complexity of computing these model-agnostic, additive feature-attribution methods from O(TLDM) to O(TLD2), where T and M are the number of trees and features, respectively, and D and L are the maximum depth and number of leaves across the trees.
This important innovation permits the explanation of predictions from previously intractable models with thousands of trees and features in a fraction of a second. An open source implementation became available in late 2017 and is compatible with XGBoost, LightGBM, CatBoost, and sklearn tree models.
Shapley values originated in game theory as a technique for assigning a value to each player in a collaborative game that reflects their contribution to the team's success. SHAP values are an adaptation of the game theory concept to tree-based models and are calculated for each feature and each sample. They measure how a feature contributes to the model output for a given observation. For this reason, SHAP values provide differentiated insights into how the impact of a feature varies across samples, which is important given the role of interaction effects in these nonlinear models.
SHAP values provide granular feature attribution at the level of each individual prediction, and enable much richer inspection of complex models through (interactive) visualization. The SHAP summary scatterplot displayed at the beginning of this section offers much more differentiated insights than a global feature-importance bar chart. Force plots of individual clustered predictions allow for more detailed analysis, while SHAP dependence plots capture interaction effects and, as a result, provide more accurate and detailed results than partial dependence plots.