Lasso and Ridge regression: an intuitive comparison



Original Source Here

Lasso and Ridge regression: An intuitive comparison

And how they can help you understand regularisation

Lasso and Ridge (The Elements of Statistical Learning)

Introduction

When people begin their Machine Learning journey, they often start with Linear Regression, one of the most simple algorithms out there. However, this model quickly shows its limitations, especially when working with datasets that lead models to overfit. The main solutions to this are called Ridge and Lasso regressions.

Bias Variance trade-off

To understand why these models are useful, we first need to discuss the bias-variance trade-off.

Bias Variance trade-off (The Elements of Statistical Learning)

There are two main sources of error for a model in a supervised setting: bias and variance.

  • Bias is the error from wrong assumptions in the learning algorithm. A high bias will make the algorithm miss the relevant relationships between the features and the target (also called underfitting)
  • Variance is an error due to sensitivity to little fluctuations in the training data. A high variance will make the algorithm model the random noise of the training data (also called overfitting).

Ideally, you want to find the sweet spot where the sum of these two components is minimized. That will give you the best performing model.

Examples

First, we will see an example of underfitting:

An example of undercutting (Image by author)

Here, you see that the model does not capture well the relationship between the features and the target. Therefore, it has a high bias (the algorithm misses the relevant relationships between the features and the target) but a low variance (does not model the random noise of the data).

On the contrary, here is an example of overfitting:

An example of overfitting (Image by author)

Here, you see that the algorithm understands the relationship between the features and the target but also models the noise of the data. Therefore, it has a low bias (the algorithm gets the relevant relationships between the features and the target) but a high variance (models the random noise of the training data).

Now, let’s see what an example of a good fit looks like:

An example of a good fit (Image by author)

Here, you see that the algorithm is able to model the relationship between the features and the target but it does not model the noise of the data. Therefore, it has low bias and low variance. This is the kind of fit we want to achieve.

What is the link between this and Ridge/Lasso?

When you fit a Linear Regression model, here is what happens. You have a set of features (often called X and represented as a matrix) and you want to find a set of coefficients (often called β and represented as a vector) by which you multiply the values in X to predict your target (often called y and represented as a vector).

Making predictions with a linear model (Image by author)

The problem is that, in some cases, a Linear Regression will overfit specific datasets. What do you do in this case? Use Ridge and Lasso regression.

How do these models work?

Lasso and Ridge are both Linear Regression models but with a penalty (also called a regularization). They add a penalty to how big your beta vector can get, each in a different way.

Lasso regression

Lasso puts a penalty on the l1-norm of your Beta vector. The l1-norm of a vector is the sum of the absolute values in that vector.

l1-norm of a vector (Image by author)

This makes Lasso zero out some coefficients in your Beta vector. I will not explain why in detail, as it would overcomplicate this tutorial and requires a background in optimization. If you are interested in why this happens, check out this link.

To summarise it simply, using Lasso is like saying: “Try to achieve the best performance possible but if you find that some coefficients are useless, drop them”.

Ridge Regression

Ridge puts a penalty on the l2-norm of your Beta vector. The 2-norm of a vector is the square root of the sum of the squared values in your vector.

l2-norm of a vector (Image by author)

This makes Ridge prevent the coefficients of your Beta vector to reach extreme values (which often happens when overfitting).

To summarise it simply, using Ridge is like saying: “Try to achieve the best performance possible but none of the coefficients should have extreme values”.

Regularisation parameter

Both of these models have a regularisation parameter called lambda, which controls how large the penalty is. At λ=0, both Lasso and Ridge become Linear Regression models (we simply do not put any penalties). By increasing lambda, we increase the constraint on the size of the beta vector. This is where each model optimises differently and tries to find the best set of coefficients given its own constraints.

An example: the Boston Housing dataset

Let’s try to see what problems we can face in practice with a dataset and how we can solve those with Ridge and Lasso.

To follow along, go to this link on my GitHub and simply follow the instructions in the Readme. The dataset I used can be downloaded here from Kaggle.

Boston Housing dataset

The Boston Housing dataset is from 1993 and is one of the most famous datasets in Machine Learning. The target feature is the median values of homes in Boston while the features are the associated home and neighbourhood attributes.

Reading the dataset

The first step is to read the dataset and print its first 5 rows.

Reading the data (Image by author)

First, I define the names of the columns using a list. Then, I call read_csv with delim_whitespace=True to tell pandas that our data is separated by whitespaces and not commas, header=None to signify that the first line of the file is not the column header and finally names=colnames to instead use our list defined earlier as the names of the columns.

I also use .head(100) to only keep the first 100 rows of the dataset and not the full dataset with 505 rows of data. The reason for this is that I want to illustrate overfitting, which will be more likely if we have less data. In practice, you would keep the full dataset (in general, the more data, the better).

Train test split

The next step is to split our data into X (features) and Y (target) and then to split those two into training (X_train, y_train) and test sets (X_test, y_test). I put 80% of the data in the training set and 20% in the test set, which is one of the most common splits for Machine Learning problems.

Train test split (Image by author)

Fitting a linear regression model

After this, I fit a linear regression model on the training data and compute the Mean Squared Error (MSE) on the test data. Finally, I print out the Beta vector to see what the coefficients of our model look like.

So, can we do better than ≈6.4 MSE? Yes.

Lasso Regression

In this example, I fit various Lasso regression models using a list of values of lambda, which is the regularization parameter (the higher the lambda, the higher we penalize the model i.e. we restrict the sum of the absolute values of the beta vector).

We see that the highest performance is achieved at a value of Lambda=0.3 and the MSE is ≈4.2. Now, let’s see what the coefficients of our Beta vector look like.

Coefficients of the Beta vector with Lasso (Image by author)

You can see that the model has zeroed out around half of the coefficients. It has kept only 8 out of the 14 coefficients but kept the weight of one of them quite large, RM which is the average number of rooms per dwelling. This makes sense as the number of rooms of a property is in general correlated with its price (an apartment for 6 people is almost always more expensive than an apartment for 1 person).

Therefore, you can see the link with what we discussed earlier. We “told” Lasso to find the best model given the constraint on how much weight could be put on each coefficient (i.e. the “budget”) and it “decided” to put a large amount of that “budget” on the number of rooms to figure out the price of the properties.

Now, let’s see what Ridge can do here.

Ridge Regression

Here, I apply the same steps as before with Lasso. The values of lambda that I use here are different. Keep in mind that the values of lambda between Ridge and Lasso are not proportional i.e. a lambda of 5 for Lasso does not equal in any sense a lambda of 5 for Ridge.

We see here that we are able to do even better than before at lambda=3 with an MSE ≈ 4.1, which is better than both Lasso and Linear Regression. Now, let’s look at the Beta vector.

Beta vector for Ridge (Image by author)

We see that we still have the coefficient for RM quite high (around 3.76) while all others coefficients have been decreased. However, none of them has been zeroed out like with Lasso.

This is the key difference between the two: Lasso will often zero out features while Ridge will reduce the weight of most in the model.

I invite you to go over the Beta vectors of each model and double-check the values: understanding what happens in the Beta vector is key to understanding these models.

How to decide which one to use?

Lasso is good when you have a few features with high predicting power while the others are useless. Thus, it will zero out the useless ones and keep only a subset of the variables.

Ridge is good when the predicting power of your dataset is spread out over the different features. Thus, it will not zero out some features that could be helpful when making predictions but will simply reduce the weight of most variables in the model.

In practice, this is often hard to determine. Thus, the best way is to simply do what I coded above and see what is the best MSE you can get on the test set using different values of lambda.

Going further

If you want to dig deeper into the maths (which I would advise since it will help you better understand how regularization works), I recommend reading Chapter 3.4 of The Elements of Statistical Learning by Trevor Hastie, Robert Tibshirani and Jerome Friedman. Robert Tibshirani is the inventor of Lasso and was my professor in Machine Learning at Stanford. His book is a reference in this field and goes deep into the maths while also giving the big picture of what is happening.

I also recommend re-implementing these models on a different dataset and seeing which one performs best and trying to get a feel of why it is the case.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: