Original Source Here
Auto-Sklearn: Scikit-Learn on Steroids
Automate the “boring” stuff. Accelerate your model development lifecycle.
A typical machine learning workflow is an iterative cycle of data processing, feature processing, model training, and evaluation. Imagine having to experiment with different combinations of data processing methods, model algorithm, and hyperparameters until we get a satisfactory model performance. This laborious and time-consuming task is commonly performed during hyperparameter optimization.
The objective of hyperparameter optimization is to find the optimal model pipeline components and their associated hyperparameters. Let’s assume a simple model pipeline that has two model pipeline components: an imputer step followed by a random forest classifier.
The imputer step has a hyperparameter called “strategy” which determines how the imputation is performed e.g. using mean, median or mode. The random forest classifier has a hyperparameter called “depth” which determines the maximum depth of an individual decision tree in the forest. Our objective is to find which combination of hyperparameters across model pipeline components provides the best result. Two common ways to do hyperparameter tuning are by using Grid Search or Random Search.
For each hyperparameter, we make a list of possible values and try all possible combinations of values. In the case of our simple example, we have 3 imputer strategies and 3 different random forest classifier depth to try, hence there are 9 different combinations in total.
In random search, we define the range and choices for each hyperparameter and the sets of hyperparameters are randomly chosen within these boundaries. In the case of our simple example, the range for depth was between 2 to 6 and choices for imputer strategy were mean, median or mode.
Notice that the sets of hyperparameters in Grid and Random Search are selected independently of one another. Neither of these methods uses results from prior training and evaluation trial to improve results in the next trial. A more efficient way to go about doing hyperparameter optimization is to utilize results from prior trials to improve the selection of hyperparameters for the next trial. Such approach was used in Bayesian optimization.
Bayesian optimization stores prior searched hyperparameters and results of a predefined objective function (e.g. binary cross entropy loss) and use it to create a surrogate model. The purpose of a surrogate model is to quickly estimate the performance of the actual model given a particular set of candidate hyperparameter. This allows us to decide if we should use the set of candidate hyperparameter to train the actual model with. As the number of trials increases, the surrogate model, updated with additional trial results, improves and starts to recommend better candidate hyperparameters.
Bayesian optimization suffers from cold start problem as it requires trial data to build the surrogate model before it is able to recommend good candidate hyperparameter for the next trial. There are no historical trials for the surrogate model to learn from at the beginning, therefore the candidate hyperparameters are selected at random which leads to slow start in finding good performing hyperparameters.
To overcome the cold start problem, Auto-Sklearn, an open source AutoML library, incorporates warm start, through a process called meta-learning, into Bayesian optimization to get instantiation of hyperparameters that are better than random.
Automated Machine Learning (AutoML) is the process of automating tasks in the machine learning pipeline such as data preprocessing, feature preprocessing, hyperparameter optimization, model selection and evaluation. Auto-Sklearn automates the above mentioned tasks using for the popular Scikit-Learn machine learning framework. Below image shows is how Auto-Sklearn works in a nutshell.
Auto-Sklearn uses Bayesian optimization with warm start (meta-learning) to find the optimal model pipeline and build an ensemble from the individual model pipelines at the end. Let’s examine the different components in the Auto-Sklearn framework.
The purpose of meta-learning is to find good instantiation of hyperparameters for Bayesian optimization so that it performs better than random at the start. The intuition behind meta learning is simple: datasets with similar meta features performs similarly on the same set of hyperparameter. Meta features as defined by Auto-Sklearn authors are “characteristics of the dataset that can be computed efficiently and that help to determine which algorithm to use on a new dataset”.
During offline training, a total of 38 meta features such as skewness, kurtosis, number of features, number of classes etc were tabulated for 140 reference datasets from OpenML. Each reference dataset were trained using Bayesian optimization process and the results were evaluated. Hyperparameters that gave the best results for each reference dataset are stored and these hyperparameters serve as instantiation for the Bayesian optimizer for new dataset with similar meta features.
During training of model for the new dataset, the meta features for the new dataset are tabulated and the reference datasets are ranked according to the L1 distance to the new dataset in the meta feature space. The stored hyperparameters from the top 25 nearest reference datasets are used to instantiate the Bayesian optimizer.
The authors experimented with different variants of Auto-Sklearn on the reference dataset and compared them using the average ranking across different training duration. Lower rank indicates better performance. Variants with meta learning (blue and green) show steep drop in rank at the start due to good initialization of the Bayesian optimizer.
Auto-Sklearn preprocesses the data in the following order .
- One Hot Encoding of categorical features
- Imputation using mean, median or mode
- Rescaling features
- Balance the dataset using class weights
After data pre-processing, features may be optionally pre-processed with one or more of the following categories of feature pre-processors .
- Matrix decomposition using PCA, truncated SCV, kernel PCA or ICA
- Univariate feature selection
- Classification-based features selection
- Feature clustering
- Kernel approximations
- Polynomial feature expansion
- Feature embeddings
- Sparse representation and transformation
During the training process, Auto-Sklearn trains mutiple individual models which can used to construct an ensemble model. Ensemble models combines weighted output of multiple trained models to provide a final prediction. They are known to be less prone to overfitting and generally outperforms single models.
From figure 1, the authors showed that variant that uses ensemble performs better than variant without ensemble (black vs red and green vs blue). The variant with meta-learning and ensemble (green) performs the best.
Let’s take a look at some practical examples of Auto-Sklearn in action.
Install the package
pip install auto-sklearn==0.13
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFoldfrom autosklearn.classification import AutoSklearnClassifierfrom autosklearn.metrics import (accuracy,
Load the dataset
We will be using a dataset from UCI which describes a bank’s marketing campaign which offers clients to place a term deposit. The target variable is yes if the customer agrees and no if the customer decides not to place a term deposit. You can find the original dataset here.
We read the dataset as a Pandas dataframe.
df = pd.read_csv('bank-additional-full.csv', sep = ';')
Prepare the data
Auto-Sklearn requires us to identify is a column is numerical categorical either in the pandas dataframe or we can do it later in the
fit function. Lets convert it now.
num_cols = ['ge', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
cat_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']df[num_cols] = df[num_cols].apply(pd.to_numeric)
df[cat_cols] = df[cat_cols].apply(pd.Categorical)y = df.pop('y')
X = df.copy()X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1, stratify=y)
Instantiate the classifier
skf = StratifiedKFold(n_splits=5)
clf = AutoSklearnClassifier(time_left_for_this_task=600,
memory_limit = 10240,
ensemble_size = 3,
metric = average_precision,
scoring_functions=[roc_auc, average_precision, accuracy, f1, precision, recall, log_loss])
Below are some of the parameters used in
time_left_for_this_task: Limit the total training time (in seconds)
max_models_on_disc: Limit the number of models to keep
memory_limit: The amount of memory (in MB) which we want to utilize
resampling_strategy: holdout or different kinds of cross validation. Refer to this documentation.
ensemble_size: Number of models to include in the ensemble. Auto-Sklearn provides an option to create ensemble after the individual models are created by taking the top
ensemble_size number of models in a weighted fashion.
metric: A metric which we want to optimize
scoring_function: One or more metrics which we want to evaluate the model on
Fit the classifier
clf.fit(X = X_train, y = y_train)
Under the hood, Auto-Sklearn constructs a Scikit-Learn pipeline during each trial. A Scikit-Learn pipeline is used to assemble a series of steps that performs data processing, feature processing and an estimator (classifier or regressor). The
fit function trigger the entire Auto-Sklearn constructing, fitting and evaluating multiple Scikit-Learn pipeline until the stopping criteria
time_left_for_this_task is met.
We can view the results and the chosen hyperparameters.
df_cv_results = pd.DataFrame(clf.cv_results_).sort_values(by = 'mean_test_score', ascending = False)
We can also view the comparison among all trials on the leaderboard
clf.leaderboard(detailed = True, ensemble_only=False)
We can view which pipelines were selected for the ensemble using
This method returns a list of tuples
[(weight_1, model_1), …, (weight_n, model_n)]. The
weight indicates how much weight it gives to the output of each model. All
weight values will sum up to 1.
We can also view additional trainings statistics.
Refit with all the training data
During k-fold cross validation, Auto-Sklearn fit each model pipeline k times on the dataset for evaluation only, it does not keep any of the trained model. Therefore we need to call the
refit method to fit the models pipeline found during cross validation with all the training data.
clf.refit(X = X_train, y = y_train)
Load Model and Predict
Let’s load the saved model pipeline for inference.
clf = load('model.joblib')
y_probas = clf.predict_proba(X_test)
pos_label = 'yes'
y_proba = y_probas[:, clf.classes_.tolist().index(pos_label)]
Searching for the optimal model pipeline components and hyperparameters is a non-trivial task. Fortunately, there are AutoML solutions such as Auto-Sklearn which can help automate the process. In this article, we examined how Auto-Sklearn uses meta-learning and Bayesian optimization to find the optimal model pipeline and construct model ensemble. Auto-Sklearn is one of many AutoML packages out there. Check out other alternatives such as H2O AutoML.
You can find the demo code used in this article here.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot