Original Source Here
Online resources on linear regression are of two kinds. The first kind paints a simplistic picture of it, as if, with only a few lines of code you can implement regression and make a great prediction.
On the other spectrum, regression is introduced as a complex algorithm with intimidating concepts like regularization with Ridge/LASSO/ElasticNets, hyper-parameter tuning, cross-validation etc.
I am not going in any of those directions, I’d rather focus on a range of issues you may encounter while implementation regression and ways to get around them. I am calling it a “regression checklist”.
Note that the checklist provided below isn’t a linear workflow, it should rather be treated as an iterative process until you are satisfied with your model.
What is linear regression?
Linear regression is a statistical procedure to find the relationship between two or more variables and make a prediction based on that relationship. For example, we know that age and height are two correlated variables — i.e., kids get taller as they get older. Machine learning takes this statistical concept to make a prediction. That means, if we knew the age of kids, we could reasonably predict their heights.
If only one variable is used as a predictor (e.g. age as a predictor of height) then it’s simple linear regression. Sometimes more than one predictor is used — they are called multiple regression. An example of multiple regression is predicting the volume of a tree, based on its age and height.
What follows below is a typical machine learning workflow with a list of issues that data scientists encounter in a regression project.
1. Data wrangling
Those who collect data and those who develop a models are often different people. So when a dataset is handed over to a modeler, it comes in a variety of forms and shapes, which require quite a bit of work — first to understand the data and then to transform it into a usable format for modeling.
So data wrangling simply means cleaning up the data, understanding what each feature represents and transforming data into a better shape. Some initial steps you can take include:
- Checking feature names and units of measurements.
- Checking size and shape (rows & columns) of the dataset.
- Checking for data type of the features if they need a type conversion (e.g. converting strings to numeric data type).
- Running descriptive stats to see the central tendency and dispersion of numeric data and observe anything unusual. Make sure to run descriptive stats separately for categorical variables.
- Checking if categorical features are are stored as “object”, if so, may need to convert them into “category” type.
2. Feature selection
A linear relationship between predictor and dependent variable is a prerequisite in regression. If a feature doesn’t demonstrate any relationship with the dependent variable, it isn’t going to add any value to the model. You should eliminate them. At the end of the process find a set of predictors with strong explanatory power.
How to find the best set of features? Here are a few tools:
- Scatterplot and pair plot
- Correlation and heat map
- Principal Component Analysis (PCA)
- Statistical tests of linear relationship: t-test, ANOVA
- Test for multicollinearity with Variance Inflation Factor (VIF)
3. Feature engineering
Feature engineering means transforming one or more of existing features entirely. That means you generate a new feature that is different in some ways from the original feature.
The following is a checklist for feature engineering:
a) Creating new features by decomposing existing ones: e.g. by splitting up a timestamp (e.g. 06/30/2021) column you get new columns: year, month, weekday, holiday dummies etc.
b) Creating calculated columns: e.g. from two existing columns — price and floor area in a housing dataset—you create a new calculated column (“$/sqft”).
c) Feature transformation
- Feature encoding: label encoder, one-hot-encoder/dummy variables
- Rescaling: Normalization, Standardization
- Log transformation
- Data type conversion
d) Outlier treatments/Windsorizing
- Statistical techniques: boxplot, distribution plot, scatterplot, inter-quartile range (IQR)
- Time series: moving average
- Machine learning technique: Local Outlier Factor (LOF)
e) Missing values treatment
- Remove rows
- Remove columns (if too many missing in a single feature)
- Imputate with an appropriate value (e.g. mean, median)
- Lasso regression (L1 regularization)
- Ridge regression (L2 regularization)
- ElasticNet (a combination of L1 & L2 regularization)
4. Creating a benchmark model
After data cleaning, feature selection and feature engineering, you are ready to build your first model. It is not the final model, rather a “benchmark model” to be used later on for improving on it. Key steps in building a baseline model include:
i) Storing the dependent (y) and predictor variables (X) separately
ii) Splitting data into training and testing set
iii) Importing and instantiating model (with default parameters)
iv) Fitting model to training data
v) Predicting on testing data
5. Performance evaluation
Once you’ve built the benchmark model, next up evaluating how it worked on testing data. A number evaluation metrics are out there to choose from:
- Error metrics: MAE/MSE/RMSE/MAPE
- Model comparison metric: AIC, BIC, R^2
- Residual plot
- Q-Q plot
- Histogram of errors for normality check
- Comparing errors (e.g. RMSE) with average of y values (to check how far off are the errors from y values on average)
- K-fold cross validation
6. Model tuning
From the evaluation metrics, you’ve got an idea how bad your model is (yes, your benchmark model is supposed to be bad). You can now go ahead and see what knobs you can tune to make it better. You do it iteratively and may end up with many different versions of the model and data (so make sure you use a version control system such as DVC to track the changes). Here are some options for model tuning:
- Fit polynomial
- Forward selection of features
- Backward elimination
- Checking feature coefficients (weed out features that contribute little or none to the model)
- Cross validation
- Gradient Descent
- Apply another algorithm (e.g. Random Forest Regression, Support Vector Regression)
7. Final model
Once you’ve built the “right” model that you are satisfied with, it’s time to deploy it into production. Production can mean as simple as creating a desktop app, to commercial deployment on the cloud. Regardless, few final items in the checklist include:
- Retraining model on the full dataset
- Sanity check with domain expert
- Saving the model (e.g. in joblib)
- Interpreting model in real world
- Periodically checking for model drift
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot