Original Source Here
The Ultimate Beginner Guide to Boosting
Ensemble methods for dummies. A step-by-step tutorial with hands-on Python code.
Boosting is a meta-algorithm from the ensemble learning paradigm where multiple models (often termed “weak learners”) are trained to solve the same problem and combined to get better results. This article introduces everything you need in order to take off with Boosting. Leslie Valiant, an extraordinarily brilliant computational theorist, promoted this paradigm.
I recently wrote about bagging, which builds the same model on multiple bootstraps from the data and combines each model’s prediction to get an overall classification or prediction.
One of the enormous advantages of bagging is that we can parallelize it. On the other side, boosting works sequentially by starting with a weak learner and adding up a new model trained on residuals of the weak learner.
Boosting is repeated several times, and each model is trained on a changed version of the original dataset, regularized using a learning rate. Unlike bagging, boosting leverages teamwork. Each model that runs dictates which features the next model will focus on. Boosting therefore trains weak learners sequentially in a very adaptative way.
Each model might be a poor fit for the data, but a linear combination of the ensemble can be expressive and flexible. Each new simple model added to the ensemble compensates for the weaknesses of the current ensemble.
We’ll motivate Boosting by noticing patterns in the errors that a single classifier makes and how training a weak model on those errors can improve accuracy.
Why Boosting Works?
Let’s work with a small subset of the HIGGS dataset in the UCI machine learning repository. This 2014 paper contains further details about the data.
Each row represents an experiment of colliding beams of protons at high energy. The class column differentiates between collisions that produce Higgs bosons (value 1) and collisions that produce only background noise (value 0). We are interested in predicting the class using the bagging technique.
tree1, a decision tree with depth 3, to the training data. For each predictor, we make a plot that compares two distributions: the values of that predictor for samples that
tree1 classified correctly, and those that were incorrectly classified.
As we testify in the graphic below, the distribution of samples’ values for correct vs. wrong predictions is almost identical. This evidence suggests that the likelihood that a sample could be miss-classified does not depend on its value. Therefore, we can improve the predictive capability of the decision tree by combining it with another tree trained on miss-classification information. This is the motivation behind doing boosting.
How Boosting Works?
The best way to understand boosting is to implement a simplified version of it using just two classifiers from scratch.
Let’s define the first classifier
tree1 as a simple decision tree with depth 3. The second classifier
tree2 is another decision tree with depth 3.
tree1 is trained on the original training dataset.
tree2 is trained on a changed training dataset that is obtained after applying a weight of 2 to samples that
tree1 missclassified. Therefore,
tree2 learns from the residuals of the weak-learner
tree1. The overall classification is computed by averaging predictions from both trees.
The results suggest how, by training a weak learner on residuals produced by another weak learner, and averaging the prediction probability of the two classifiers; we have increased the overall test accuracy. This is a great illustration of the boosting mechanism.
How Gradient Boosting Works?
Intuitively, each weak-learner model we add to our ensemble model learns from the errors of the ensemble. Thus, with each addition, the weighed residual influences the next weak learner. If we consider the weight as a tuning parameter, then we can find its optimal value using the most popular optimization technique: gradient descent.
In order to grasp how gradient boosting works, let’s reconsider a few decision trees of depth 1, 2, 3, and 4 as weak learners. This time, we will apply gradient descent with a learning rate of 0.05, and run boosting for 800 iterations.
With boosting, we learn from the residuals of the previous decision tree in each new iteration, therefore we observe an increasing training set accuracy until we will overfit the data and observe a steady-state close to 100%.
As depicted below, the tree depth affects the trends pretty much: training set accuracy is higher with bigger trees, regardless of the number of iterations.
Depth-1 and depth-2 trees can be very slow learners, which require a higher number of iterations. Depth-3 and depth-4 trees are learning faster; the test set accuracy increases steadily during the first iterations until it reaches a peak, then it decreases with the increasing number of iterations. While depth-1 and depth-2 trees underfit the data, depth-4 tree overfits the most. Depth 3 suggests the lowest variance and lowest bias.
Based on the plot we just made, what combination of weak-learner depth and number of iterations seems optimal?
A gradient-boosted depth-3 decision tree classifier found after 97 boosting iterations seems optimal in terms of low variance and low bias.
Gradient boosting achieves better accuracy than the manually boosted depth-3 tree using updated weights in the previous section.
The improved accuracy suggests that gradient boosting actually helps quite much. Especially, it over-performs, compared to the best accuracy attained through bagging in my previous article.
Bagging and boosting are so-called ‘ensemble’ techniques, which, by aggregating many weak-learner models, such as single decision trees, substantially improve the predictive accuracy.
We can train independently each tree in bagging. Therefore bagging technique is better suited to parallelization, for example, on a multi-core CPU computer. Boosting is sequential, since each tree is built using the previous one.
Bagging is great for decreasing variance when a model is overfitting. Boosting is appropriate for decreasing bias in an underfit model.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot