A Machine Learning Model Is No Longer a Black Box Thanks to SHAP

https://miro.medium.com/max/1200/0*4oK-XI4JHEzZJRFY

Original Source Here

Machine Learning

A Machine Learning Model Is No Longer a Black Box Thanks to SHAP

A step-by-step tutorial in Python to reveal how your machine learning model works internally

Photo by Sam Moqadam on Unsplash

One of the first mistakes a data scientist can make when building a model to represent data is to consider the algorithm used as a black box. In practice, the data scientist could focus more on data cleaning and then try one or more Machine Learning algorithms, without understanding what exactly that algorithm does.

Indeed, the first question a data scientist should ask themselves, before choosing this or that model of Machine Learning, is to ask whether it is really necessary to use Machine Learning.

So, my suggestion is that Machine Learning is the last resort, to be used if there is no alternative solution.

Once you have determined that Machine Learning is necessary, it is important to open the black box to understand what the algorithm does and how it works.

There are a variety of techniques to explain the models and make it easier for people who do not have machine learning expertise to understand why a model made certain predictions.

In this article, I will introduce SHAP values, which is one of the most popular techniques for a model explanation. I will also walk through an example to show how to use SHAP values to get insights.

The article is organized as follows:

  • Overview of SHAP
  • A Practical Example in Python

1 Overview of SHAP

SHAP stands for “SHapley Additive exPlanations.” Shapley values are a widely used approach from cooperative game theory.

In Machine Learning, a Shapley value measures the contribution to the outcome from each feature separately among all the input features. In practice, a Shapely value permits understanding how a predicted value is built from the input features.

The SHAP algorithm was first published in 2017 by Lundberg and Lee in an article entitled A Unified Approach to Interpreting Model Predictions (the article has almost 5,500 quotations, given its importance).

For more details on how the SHAP value works, you can read these two interesting articles by Samuele Mazzanti, entitled SHAP Values Explained Exactly How You Wished Someone Explained to You and Black-Box models are actually more explainable than a Logistic Regression.

To deal with SHAP values in Python, you can install the shap package:

pip3 install shap

SHAP values can be calculated for a variety of Python libraries, including Scikit-learn, XGBoost, LightGBM, CatBoost, and Pyspark. The full documentation of the shap package is available at this link.

2 A Practical Example in Python

As a practical example, I exploit the well-known diabetes dataset, provided by the scikit-learn package. The description of the dataset is available at this link. I test the following algorithms:

  • DummyRegressor
  • LinearRegressor
  • SGDRegressor.

For each tested model, I create the model, train it and I predict new values, given by the test set. Then, I calculate the Mean Squared Error (MSE) to check its performance. Finally, I calculate and plot the SHAP values.

2.1 Load Dataset

Firstly, I load the diabetes dataset:

from sklearn.datasets import load_diabetesdata = load_diabetes(as_frame=True)
X = data.data
y = data.target
Image by Author

and I split it into training and test sets:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

The objective of this scenario is to calculate the blood glucose value (y value), from some input features, including body mass index (BMI), body pressure (bp), and other similar parameters. Input features are already normalized. This is a typical regression problem.

2.2 Dummy Regressor

Before applying a real Machine Learning model, I build a baseline model, i.e. Dummy Regressor, which calculates the output value as the average value of the outputs in the training set.

The Dummy Regressor can be used for comparison, i.e. check whether a Machine Learning model improves the performance with respect to it.

from sklearn.dummy import DummyRegressor
model = DummyRegressor()
model.fit(X_train, y_train)

I calculate the MSE for the model:

y_pred = model.predict(X_test)
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

the output is 5755.47.

Now, I can calculate the SHAP value for this basic model. I build a generic Explainer with the model and the training set, and then I calculate the SHAP values on a dataset, which can be different from the training set. In my example, I calculate the SHAP values for the training set.

import shapexplainer = shap.Explainer(model.predict, X_train)
shap_values = explainer(X_train)

Note that the Explainer may receive as input the model itself or the model.predict function, depending on the model type.

The shap library provides different functions to plot the SHAP values, including the following ones:

  • summary_plot() — show the contribution of each feature to the SHAP values;
  • scatter() — show the scatter plot of SHAP values versus every input feature;
  • plots.force() — interactive plot, for all the datasets;
  • plots.waterfall() — show how the SHAP value is built for a single data.

Before calling the previous functions, the following command must be run:

shap.initjs()

Firstly, I draw the scatter plot:

shap.plots.scatter(shap_values, color=shap_values)

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: