Beginners Baseline Model for Machine Learning Project

https://miro.medium.com/max/1200/0*Y9wQiSmqiHrv8_OY

Original Source Here

Baseline Model

What is a Baseline Model? We can define the baseline model as a reference to the actual model. The baseline model should be a simple model that acts as a comparison and is easy to explain. Moreover, the baseline model should be based on the dataset to create the actual model.

Why do we want to have a baseline model in our project? There are three main reasons.

Image by Author

Understand the Data

The baseline model helped us understand our data better, primarily related to model creation. During the baseline model creation, there are a few pieces of information we would acquire, including:

  • Dataset power for prediction. A baseline model with little prediction to none could indicate a low signal or low fitting.
  • Dataset subset classification. The baseline model could help us identify which part of the dataset observation was harder to classify. The information would allow us to select the model.
  • Dataset target classification. The observation and the baseline model also give information on which target values were harder to identify.

Faster Model Iteration

The model development and the other process would become easier with the baseline model in place. How does the baseline model make things faster? For example, your baseline model would become a reference to build the model, and sometimes your baseline model is already enough for the business user.

Performance Benchmark

We use the baseline model to have a performance metric we go against when developing the actual model. With the baseline model, we could assess whether we need a complex model or the simple one already working for the business. Moreover, we could use the baseline model to benchmark the business KPI.

Creating the Baseline Model

We already understand a baseline model and the advantage of having one; now, we only need to create one. However, what could be considered a baseline model? We could employ anything as a baseline model, from simple means to the complex model. But, a too complex model would defeat the purpose of the baseline model — simple and fast, so we often only use the complex model if it is a benchmark for the research.

If we want to categorize a baseline model, I will group them into two groups:

  1. Simple Baseline Model
  2. Machine Learning Baseline Model

Simple Baseline Model

The simple baseline model is a model with simple logic to create the baseline. It could be simply a random model prediction or a particular rule-based model. The point is to have a simple model that we can use to benchmark against the actual model.

A simple baseline model could come from simple statistics, business logic, or various stochastic model. The differences also come down to the modeling problem such as:

  1. Structured data or unstructured data
  2. Supervised or unsupervised problem
  3. Classification or Regression

Depending on the definition, you might want a different baseline model. But, for an example purpose, let’s stick to the classification tabular model problem. I would use the Breast Cancer dataset example below.

import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
df = load_breast_cancer()
df.keys()
X, y = pd.DataFrame(df['data'], columns =df['feature_names'] ), pd.Series(df['target'])
X.info()
Image by Author

We can see that our dataset is comprised of 30 numerical columns. Let’s see the target distribution.

y.value_counts()
Image by Author

As we can see above, the distribution is slightly imbalanced with the target 1 but not so much.

We want to create a classification model with our current dataset to predict the breast cancer patient; we would need a baseline model as a benchmark. Now, we would create various simple baseline models with the help of DummyClassifier from Scikit-Learn.

What is DummyClassifier? It is a classifier that ignores input features and creates a classifier for a simple baseline to compare against other more complex classifiers. We could use various strategies within the class; let’s try to explore it one-by-one.

First, I would create a function that allows us to generate the baseline model.

#Function for evaluation metrics
def print_binary_evaluation(X_train, X_test,y_train, y_true, strategy):
dummy_clf = DummyClassifier(strategy=strategy)
dummy_clf.fit(X_train, y_train)
y_pred = dummy_clf.predict(X_test)
results_dict = {'accuracy': accuracy_score(y_true, y_pred),
'recall': recall_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred),
'f1_score': f1_score(y_true, y_pred)}
return results_dict

We would evaluate the baseline model by the accuracy, recall, precision, and F1 score. Then I would try to split the dataset.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

Let’s try testing each baseline model strategy using the above function.

Most Frequent (Mode)

This is the simplest strategy where we predict the most frequent label in the dataset. We have target 1 as the most frequent in our dataset, so the classifier would always have a prediction 1.

print_binary_evaluation(X_train, X_test, y_train, y_test, 'most_frequent')
Image by Author

If we take out the dummy classifier, the prediction result is similar to the image below.

Image by Author

By predicting 1 all the time, we achieve a 0.62 accuracy and 0.76 F1-Score. It is not bad in the machine learning term but would the business do good by suggesting that all patients always have breast cancer? Obviously not.

At the very least, we now have the baseline model for our complex model to have against. Let’s try out another strategy.

Uniform

Uniform strategy creates a baseline model that predicts with random uniform distribution. It means that all the targets have a similar probability of being a prediction output.

print_binary_evaluation(X_train, X_test,y_train, y_test, 'uniform')
Image by Author

As we can see from the result above, the metrics are closer to 50% because the baseline model prediction distribution was uniform.

Image by Author

The result is also generally closer to the uniform distribution.

Stratified

A stratified strategy is a strategy to create a baseline model that follows the target distribution. It is suitable for imbalanced data as it reflects the actual distribution.

print_binary_evaluation(X_train, X_test,y_train, y_test, 'stratified')
Image by Author

We can see the metrics also results similar to the baseline model with uniform strategy as the distribution was almost identical.

Image by Author

We can see that the result was slightly dominated by target 1 as it is the dominant label.

Overall the simple baseline model is a prediction model that ignores input data and only focuses on the prediction result. If we use the simple baseline model above as a benchmark, we probably want to be better than the model that only predicts everyone has breast cancer.

There is another way to develop a simple baseline model based on business logic, but it requires domain knowledge and business understanding. I have written about the topic in another article that you could refer to.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: