Original Source Here
DATA SCIENCE FUNDAMENTALS
From ML Model to ML Pipeline
With Scikit-learn in Python
Building machine learning model is not only about choosing the right algorithm and tuning its hyperparameters. Significant amount of time is spent wrangling data and feature engineering before model experimentation begins. These preprocessing steps can easily overwhelm your worklflow and become hard to track. Focusing from ML model to ML pipeline and seeing the preprocessing steps as an integral part of building a model can help keep your workflow more organised. In this post, we will first look at wrong way to preprocess data for a model, then will learn a correct approach followed by two ways to build ML Pipeline.
ML Pipeline has many definitions depending on the context. In this post, ML Pipeline is defined as a collection of preprocessing steps and a model. This means when raw data is passed to the ML Pipeline, it preprocesses the data to the right format, scores the data using the model and pops out a prediction score.
📦 0. Setup
Let’s import libraries and a sample data: a subset of the titanic dataset (the data is available through Seaborn under the BSD-3 licence).
We will now define commonly used variables to easily reference later on:
It’s time to look at the first approach.
❌ 1. Wrong approach
It’s not uncommon to use pandas methods like this when preprocessing:
We imputed missing values, scaled numerical variables between 0 to 1 and one-hot-encoded categorical variables. After preprocessing, the data is partitioned and a model is fitted:
Okay, let’s analyse what was wrong with this approach:
◼️ Imputation: Numerical variables should be imputed with a mean from the training data instead of the entire data.
◼️ Scaling: Min and max should be calculated from the training data.
◼️ Encoding: Categories should be inferred from the training data. In addition, even if the data is partitioned prior to preprocessing, one-hot-encoding with
pd.get_dummies(X_test) can result in inconsistent training and test data (i.e. the columns may vary depending on the categories in both datasets). Therefore,
pd.get_dummies() should not be used for one-hot-encoding when preparing data for a model.
💡 Test data should be set aside prior to preprocessing. Any statistics such as mean, min and max used for preprocessing should be derived from the training data. Otherwise, there will be a data leakage problem.
Now, let’s asses the model. We will use ROC-AUC to evaluate the model. We will create a function that calculates ROC-AUC since it will be useful for evaluating the subsequent approaches:
❔ 2. Correct approach but …
We will partition the data first and prepreprocess data using Scikit-learn’s transformers to prevent data leakage by preprocessing correctly:
Lovely, we can fit the model now:
We need to preprocess the test dataset in the same way before evaluating:
Awesome, this time the approach was correct. But writing good code doesn’t stop at being correct. For each preprocessing step, we stored interim outputs for both training and test datasets. When the number of preprocessing steps increase, this will soon become very tedious to keep up and therefore prone to error like missing a step in preprocessing the test data. This code can be made more organised, streamlined and readable. That’s what we will do in the next sections.
✔️ 3. Elegant approach #1
Let’s streamline the previous code using Scikit-learn’s
ColumnTransformer. If you aren’t familiar with them, this post explains them concisely.
◼️ Splits input data into numerical and categorical groups
◼️ Preprocesses both groups in parallel
◼️ Concatenates the preprocessed data from both groups
◼️ Passes the preprocessed data into the model
When raw data is passed to the trained pipeline, it will preprocess and make a prediction. This means we no longer have to store interim results for both training and test dataset. Scoring unseen data is as simple as
pipe.predict(). That’s very elegant, isn’t it? Now, let’s evaluate the performance of the model:
Great to see that it matches the performance of previous approach since the transformation was exactly the same but just written in a more elegant way. For our small example, this is the best approach among the four approaches shown in this post.
Scikit-learn’s out-of-the-box transformers such as
SimpleImputer are fast and efficient. However, these prebuilt transformers may not always fulfill our unique preprocessing needs. In those cases, being familiar with the next approach gives us more control over bespoke ways of preprocessing.
✔️ 4. Elegant approach #2
In this approach, we will create custom transformers with Scikit-learn. Seeing how the same preprocessing steps we have familiarised translated into custom transformers hopefully will help you grasp the main idea of constructing them. If you are interested in example use-cases of custom transformers, check out this GitHub repository.
Unlike before, the steps are done sequentially one after another each step passing its output to the next step as an input. It’s time to evaluate the model:
Yay, we just learned another elegant way to achieve same result as before. While we only exclusively used prebuilt transformers in third approach and exclusively used custom transformers in the fourth approach, they can be used together provided that the custom transformers are defined to work coherently with the out-of-the-box transformers.
That was it for this post! When using the latter 2 approaches, one benefit is that hyperparamater tuning can be done on the entire pipeline rather than only on the model. I hope you have learned practical ways to start using ML pipeline. ✨
Would you like to access more content like this? Medium members get unlimited access to any articles on Medium. If you become a member using my referral link, a portion of your membership fee will directly go to support me.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot