Random Forest and Decision Trees by hand — no coding



Original Source Here

Random Forest and Decision Trees by hand — no coding

A pink sakura tree (Unsplash)

Introduction

In this article, we will discuss Decision Trees and Random Forest, two algorithms used in Machine Learning for classification and regression tasks.

I will show how to build a Decision Tree from scratch using a pen and paper and how to generalise this and build a Random Forest model.

Dataset

Let’s see how this work in practice with a simple dataset.

Dataset (Image by author)

Here, we are trying to predict whether or not someone has a college degree. We have 2 numerical variables for that, that person’s age and salary (in thousands of dollars).

Initial hypotheses about the dataset

One thing I always recommend when you look at a new dataset is to come up with hypotheses you have about it and see if they are true.

I personally have two assumptions here:

  • Someone with a high salary is more likely to have a college degree. I am basing this on the fact that a high number of professions with high paying jobs (medicine, law, tech, consulting, finance and more) require college degrees.
  • Younger people will be more likely to go to college than their elders. I am basing it on research by the Pew Research Center and also on personal experience (I have often heard older people around me saying they feel like my generation goes to college a lot more than they did — now, I know it’s true).

I quickly computed some averages and found that the average age in the dataset of someone who went to college is 34 years, compared to 43 years for someone who did not. Regarding the salary, someone who went to college has an average salary of 63K compared to 37K for someone who did not go to college. Thus, both our assumptions are true.

It is not mandatory to do this but it is always useful and allows you to better understand the dataset you are working with.

Decision Tree

Now, let’s see how we can train a Decision Tree on this dataset and make predictions.

Root node (Image by author)

At the beginning, we have 6 Yes and 4 No. This will make our root node.

At anytime, we always have access to the true labels (i.e. we always know throughout the decision tree whether an example is a Yes or a No). Keep this in mind, it is important for what comes next.

At each node of the decision tree, we can predict one of the classes. Here, we can predict either Yes or No.

  • If we predict Yes, we will correctly classify 6 datapoints out of the 10 we have and misclassify 4. Thus, we will have 4/10 =0.4 error rate.
  • If we predict No, we will correctly classify 4 datapoints out of the 10 we have and misclassify 6. Thus, we will have 6/10 =0.6 error rate.

Thus here, we should predict Yes to get the lowest error rate.

Now, let’s see how we can improve on this. Keep in mind that we still have not used any features here, only the true labels of the dataset.

Splits

The question now is how we can use our features to improve our predictions and do better than 0.4 error rate. We will use a threshold for this: if someone has a salary higher than X, then we predict Yes. If they have a salary lower than X, we predict a No. Let’s see an example:

Splitting on Salary (Image by author)

Here, we still have our root node but instead of predicting straight away, we first split on Salary at a split value of 32.5K (spoiler: it is the best split here) and then make our predictions — again, we have the true labels. Our error rate is now at 0.2 (we only misclassify the two yellow points in the box on the right where it says 6 | 2 – we are predicting Yes for those while we should predict No).

So, two questions: why did we decide to split on Salary first (and not on Age or another feature if we had more) and why did we split at 32.5K and not at another value?

How to find the best threshold?

Salaries ordered from smallest to biggest, Green is Yes and Yellow is No (Image by author)

Here, I have ordered the salaries from smallest to biggest. The idea to find the best threshold for a given feature is the following:

  • Sort the values of the feature (like above)
  • Take the middle value between two points
  • Use this as a threshold value like earlier and compute the loss
  • Do this for all the middle values (i.e. all the splits)
  • Pick the best split for a given feature

Let’s see an example.

Splitting at 42K of salary (Image by author)

Here, we take 42 as a threshold, which is the middle value between 40 and 44. This divides our list of numbers into two.

For each part, we can predict either Yes or No.

  • If we predicted Yes on the left and No on the right (the opposite of what we are doing on the picture), we would have 6 errors = 0.6 error rate.
  • If we did it the other way around, like on the picture, we would get 4 errors = 0.4 error rate.

Therefore, the lowest error rate we can achieve here is 0.4 (what is on the picture).

The idea now is to compute the loss for all the possible split values. Let’s see what this looks like.

All the possible splits for Salary (Image by author)

Here, we have all the possible ways we could split on Salary. We see that 32.5K is the best value to split on, as it has the lowest error rate.

Keep in mind that whether you split on 42 or 42.005 does not matter as long as you pick a threshold between the two values you have. Here is what I mean:

Split region (Image by author)

Here, you can pick any values in the red region, as long as it is greater than 40 and smaller than 44 (so 40.1 works, as well as 43.9). The reason is that you would still get the same error rate with these values. Thus, which one you pick does not matter. We simply pick the midpoint as a convention.

Deciding on which variable to split

Now, we know that, out of all the ways we could split on Salary, a threshold of 32.5K is the best way to split.

Now, how do we decide if we should split on Age or Salary? Simple. Compute the error rate for Age like we did with Salary and pick the lowest error rate out of both and use that as our split value. Let’s do that.

Summary of all possible splits at the root (Image by author)

The lowest error rate we can achieve with Age is 0.3 while it is 0.2 with Salary. Thus, we split on 32.5K for Salary and pick this for our first split.

Now, we recurse: as long as we do not have 0 errors in a given node or are out of features, we keep splitting. Here is what our tree looks like so far:

Current tree (Image by author)
  • On the left side, you see that we cannot do better (we have perfectly classified those 2 points). Therefore, we do not need to split further.
  • On the right side, we can still split on Age and potentially do better.

Let’s compute all the possible splits for Age. Here, we have 8 datapoints instead of 10, as we are only working with the datapoints on the right side of the tree.

We start like we did earlier by sorting the datapoints.

Sorting the 8 datapoints left (Image by author)

Now, we get the following list of splits.

Splits on Age (Image by author)

Thus, we split on Age at 52.5 and predict Yes if the age is lower than 52.5 and No if it is higher. This gives us the following tree:

Final tree for this dataset (Image by author)

We get a 0.1 error rate for our final tree, as it only misclassifies 1 point as a Yes instead of a No. We cannot split further as we are out of features to split on.

Categorical Features

If you have categorical instead of numerical features, the process is the same as above. You will simply do the splits on the values that you have.

For example, if you had a categorical feature indicating whether someone lived in the countryside or lived in a city, you would split on this and see how well it performed. Then, you would compare with the other features. Same process basically. This is why I covered numerical features first as categorical ones are basically the same but you already have the splits.

When to stop growing a tree?

Earlier, we did not have other features that we could split on (we had 2 features, Age and Salary, and split twice). Thus, this question did not come up.

In practice, you will have a lot more features and you will be able to split a lot more and make your trees grow much bigger,

As I covered in a previous article, you will have a bias-variance tradeoff here. The deeper the tree, the better it will fit the dataset but the higher the risk of overfitting. Your tree will start to capture the noise of the data. Thus, the question of when to stop growing a tree arises.

There are two main techniques for this: early stopping and pruning.

In early stopping, you stop splitting on a given node of the tree based on a given condition. Here are examples of early splitting conditions:

  • Setting the minimum number of samples required at a leaf node
  • Setting the maximum depth of the tree

In pruning, you simply build the tree like we did earlier and build a more complex one than you would with early stopping. Then, you remove some nodes based on a given condition.

Both have their pros and cons and are used in practice. I recommend having a look at the Scikit-learn documentation for this. The parameters I covered are called pruning, max_depth and min_samples_leaf. There are a lot more and I recommend to explore the documentation as much as possible.

Random Forest

Now, how to build a Random Forest classifier? Simple.

First, you create a certain number of Decision Trees. Then, you sample uniformly from your dataset (with replacement) the same number of times as the number of examples you have in your dataset. So, if you have 100 examples in your dataset, you will sample 100 points from it. With replacement means that once you have sampled a given point, you do not take it off the dataset (you can sample the same point twice basically).

This give us the following procedure for building a Random Forest classifier:

  • Sample with replacement from your dataset
  • Train one of the decision trees on this sub-dataset
  • Repeat for the desired number of trees

Once you have your set of Decision Trees, you simply take the majority vote. In our previous example, imagine you had trained 100 decision trees. If 64 trees predicted Yes and 36 predicted No for a given example, then you would predict Yes.

Going further

I stayed relatively high-level in this article and did not cover a few things such as the different metrics you can use for your splits and how trees work for regression problems. There are also a lot more parameters that can you find in the Scikit-learn documentation.

However, this should serve as a solid foundation for anyone who has not worked with this algorithm before. I recommend re-doing the calculations on a piece of paper and see if you get the same results. This really helps with understanding the algorithm.

I recently wrote an article on Ridge and Lasso, two regularized algorithms for regression problems. Although they cannot be used for classification, the article will help you understand the bias-variance tradeoff. This is useful in the context of Decision Trees to understand why you should stop early when training your trees or prune them.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: