Original Source Here
Ace your Machine Learning Interview — Part 7
We have come to the seventh article dedicated to my “Ace your Machine Learning Interview” series in which I recap some of the basics of Machine Learning hoping that it will be helpful to you in facing your next interview!😁
In case you are interested in the previous articles in this series, I leave the links here:
- Ace your Machine Learning Interview — Part 1: Dive into Linear, Lasso and Ridge Regression and their assumptions
- Ace your Machine Learning Interview — Part 2: Dive into Logistic Regression for classification problems using Python
- Ace your Machine Learning Interview — Part 3: Dive into Naive Bayes Classifier using Python
- Ace your Machine Learning Interview — Part 4: Dive into Support Vector Machines using Python
- Ace your Machine Learning Interview — Part 5: Dive into Kernel Support Vector Machines using Python
- Ace your Machine Learning Interview — Part 6: Dive into Decision Trees using Python
Ensemble Learning is a method that allows you to use multiple predictors, classifiers or regressors, together. In this way, you can create a more robust system based on knowledge of various Machine Learning models.
If you think about it, it’s kind of what we normally do as humans when we are undecided about something. I remember an episode in The Big Bang theory series where Sheldon was undecided about whether to buy an Xbox or PlayStation. Then he asks his friends’ opinions hoping to have a majority and make the decision accordingly. I will leave you the video here!
What the Ensembling method does is the same thing. If it has to classify a point it asks for the opinion of several classifiers and will take the majority result. (The use of majority is not always the case, but it is the most common)
General Ensemble Learning
So we have already figured out how to construct an Ensembling method. We train several classifiers for example, then take the majority vote to classify a new instance.
The following image pretty much sums it up.
To be technical, when we are dealing with a binary classification we say we are using Majority Voting. On the other hand, when we are doing a multiclass classification we say Plurality Voting. Basically what you do practically is use the mode function which in statistics takes the most frequent element from a list.
But what do these C1, C2, …, Cn classifiers look like? They can be the classifiers you prefer the important thing is that you then group and use their predictions in a function such as a majority voting. For example, we could use a Logistic Regression, an SVM and a Decision Tree.
But instead of using all different algorithms in the Ensembling method, can we always use the same algorithm? The answer is yes!
Bagging, Pasting and RandomForest
So far we have always taken our whole dataset and used it to do training of different algorithms which we then grouped into an Ensembling method.
Another thing we can do is to always use the same algorithm, for example we can use three SVMs, but each SVM will be drawn on a different subset of the initial dataset. The subset can be a subset of all the records in the dataset or a subset of the features (keeping all the records). In the end we take as usual the predictions and do Majority Voting on them.
If we have a train set of 900 records. We could take a random subset of 300 records to use for each SVM. Note well that if we take this subset with repetitions we are using a method called Bagging. Otherwise, without repetitions, it is called Pasting.
Have you ever used or seen used an Ensembling method Based on Bagging or Pasting? I bet you have. The RandomForest!
We mentioned in the previous article that often a Decision Tree (DT) algorithm is used in an Ensembling method to increase its robustness. And an Ensebling composed of only a Decision Tree is called Random Forest, because it is a widely used algorithm that performs very well.
We have already mentioned that a DT always tries to make a split on the best feature it finds. A Random Forest, on the other hand, when instantiating a DT among many, assigns it only a subset of the features. So the DT will have to search for the best splits only within that subset of features.
There is an implementation of Random Forest that allows for faster training but adds noise. In this case, each feature of a DT in addition to being taken from a random subset of features will also do a split on a randomly imposed threshold. For this reason, they are very fast and we can instantiate multiple DTs at the same time. The extra trees trade more bias for a lower variance and it’s faster to train.
Let us first look at the results of 3 different algorithms on the Iris dataset.The dataset is provided by sklearn under an open license, it can be found here. The dataset is as follows.
We will use these3 algorithms: LogisticRegression, DecisionTreeClassifier and KNeighborsClassifier.
In addition for the first and last, we are going to do some data processing using a StandardScaler. To make it easier we will group the algorithm and scaler in a sklearn pipeline.
Each algorithm and pipeline will be evaluated using a cross-fold validation with 10 folds.
I risultati ottenuti dai tre algoritmi sono i seguenti:
- Logistic regression : ROC AUC: 0.92 – std: 0.15
- Decision Tree : ROC AUC: 0.87 std: 0.18
- KNN: ROC AUC: 0.85 std: 0.13
Let’s see if we can do better with majority voting. We first define a class called MajorityVotingClassifier in the following way. We only need to inherit the classes BaseEstimator and ClassifierMixin in order to inherit common estimator functionalities such as the get/set_params function.
Now we just need to instantiate a new classifier of type MajorityVotingClassifier and add it to the list of those we already had to compare them.
Let’s see the results obtained. Yours will probably be slightly different from mine.
- Logistic regression: ROC AUC: 0.92 std: 0.15
- Decision Tree: ROC AUC: 0.87 std: 0.18
- KNN : ROC AUC: 0.85 std: 0.13
- Majority Voting: ROC AUC 0.98 std: 0.05
You see that this time Majority Voting managed to outperform individual classifiers by several percentage points.
So by taking advantage of weaker classifiers, we were able to create a classifier that achieved a score of even 98%!
The Ensemble Method is often used in competitions such as Kaggle or similar and sometimes allows you to gain that few extra percentage points to win. On the other hand, it is not always useful in industry, since it is often not worth training multiple classifiers for little improvements in performance, so one must use this method consciously.
I myself used it to win the best system award in the NLP competition called EVALITA2020 in which I created a classifier based on the ensembling technique. You can read more about the system I developed to do stance detection here.😁
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot