Model Selection Fatigue — What to choose?



Original Source Here

Model Selection Fatigue — What to choose?

Decision fatigue is a psychological phenomenon when you’re tired, you make terrible decisions. C’mon, remember that time ___fill in the blanks___ It’s the same when you are selecting a machine learning model — when you’re tired, bad choices are made. Lucky for you, I’ve made some simple-to-follow code that make decisions much easier! Let’s show you an easy way to select a model for logistics data; our objective is to predict the time it takes a package to be delivered. Follow/run the Jupyter notebook while reading 📚

Data science is 20% modeling, 60% cleaning and 20% complaining about cleaning.

Let’s avoid data cleaning confusion and run a data cleaning script (further explained here).

!python -m data_cleaning.py

Too easy! That saves a file clean-data.csv where the first column is the target variable y and the other columns are the features X. Other articles also suggest an 80–20 split for training and test data:

Now, we’re ready to train different models and evaluate them on our test set. Since we predict numerical values of the time taken for a package to be delivered, the mean square error (MSE) is a suitable evaluation metric. We look at three different models:

  1. Linear life: Linear regression predicts a linear relationship between Xand y.
  2. Amazon forest: XGBoost is a bunch of decision trees with gradient boosting (just a better way to combining results).
  3. Deep (&Meaningful) stuff: A multilayer perceptron (MLP) is a class of neural network, usually for accounting for non-linear, complex relationships between Xand y.

A simple way to compare and select models is using a sklearn pipeline and then showing the results as pandas dataframe.

The results look something like this: XGBoost regressor has the lowest MSE and also shows all the default parameters (you can adjust this column name, of course). The linear regression has terrible prediction due to the large number of columns (1712) and most likely suffers from multicolinearity. The MLP is just as good but more rigorous comparison would involve tuning the hyperparameters.

Thanks for taking the time to read ✌️

If you learned something new, please like, share, and comment below!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: