https://miro.medium.com/max/1200/0*4oK-XI4JHEzZJRFY

Original Source Here

## Machine Learning

# A Machine Learning Model Is No Longer a Black Box Thanks to SHAP

## A step-by-step tutorial in Python to reveal how your machine learning model works internally

One of the first mistakes a data scientist can make when building a model to represent data is to consider the algorithm used as a black box. In practice, the data scientist could focus more on data cleaning and then try one or more Machine Learning algorithms, without understanding what exactly that algorithm does.

Indeed, the first question a data scientist should ask themselves, before choosing this or that model of Machine Learning, is to **ask whether it is really necessary to use Machine Learning**.

So, my suggestion is that Machine Learning is the last resort, to be used if there is no alternative solution.

Once you have determined that Machine Learning is necessary, it is important to **open the black box** to understand what the algorithm does and how it works.

There are a variety of techniques to explain the models and make it easier for people who do not have machine learning expertise to understand why a model made certain predictions.

In this article, I will introduce SHAP values, which is one of the most popular techniques for a model explanation. I will also walk through an example to show how to use SHAP values to get insights.

The article is organized as follows:

- Overview of SHAP
- A Practical Example in Python

# 1 Overview of SHAP

SHAP stands for “*SHapley Additive exPlanations.*” Shapley values are a widely used approach from cooperative game theory.

In Machine Learning, **a Shapley value measures the contribution to the outcome from each feature separately among all the input features**. In practice, a Shapely value permits understanding how a predicted value is built from the input features.

The SHAP algorithm was first published in 2017 by Lundberg and Lee in an article entitled A Unified Approach to Interpreting Model Predictions (the article has almost 5,500 quotations, given its importance).

For more details on how the SHAP value works, you can read these two interesting articles by Samuele Mazzanti, entitled SHAP Values Explained Exactly How You Wished Someone Explained to You and Black-Box models are actually more explainable than a Logistic Regression.

To deal with SHAP values in Python, you can install the `shap`

package:

`pip3 install shap`

SHAP values can be calculated for a variety of Python libraries, including Scikit-learn, XGBoost, LightGBM, CatBoost, and Pyspark. The full documentation of the `shap`

package is available at this link.

# 2 A Practical Example in Python

As a practical example, I exploit the well-known diabetes dataset, provided by the `scikit-learn`

package. The description of the dataset is available at this link. I test the following algorithms:

`DummyRegressor`

`LinearRegressor`

`SGDRegressor`

.

For each tested model, I create the model, train it and I predict new values, given by the test set. Then, I calculate the Mean Squared Error (MSE) to check its performance. Finally, I calculate and plot the SHAP values.

## 2.1 Load Dataset

Firstly, I load the diabetes dataset:

from sklearn.datasets import load_diabetesdata =load_diabetes(as_frame=True)

X = data.data

y = data.target

and I split it into training and test sets:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.33, random_state=42)

The objective of this scenario is to calculate the blood glucose value (y value), from some input features, including body mass index (BMI), body pressure (bp), and other similar parameters. Input features are already normalized. This is a typical regression problem.

## 2.2 Dummy Regressor

Before applying a real Machine Learning model, I build a baseline model, i.e. Dummy Regressor, which calculates the output value as the average value of the outputs in the training set.

The Dummy Regressor can be used for comparison, i.e. check whether a Machine Learning model improves the performance with respect to it.

`from sklearn.dummy import DummyRegressor`

model = **DummyRegressor**()

model.fit(X_train, y_train)

I calculate the MSE for the model:

`y_pred = model.predict(X_test)`

print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

the output is 5755.47.

Now, I can calculate the SHAP value for this basic model. I build a generic `Explainer`

with the model and the training set, and then I calculate the SHAP values on a dataset, which can be different from the training set. In my example, I calculate the SHAP values for the training set.

import shapexplainer = shap.Explainer(model.predict, X_train)

shap_values = explainer(X_train)

Note that the `Explainer`

may receive as input the model itself or the `model.predict`

function, depending on the model type.

The `shap`

library provides different functions to plot the SHAP values, including the following ones:

`summary_plot()`

— show the contribution of each feature to the SHAP values;`scatter()`

— show the scatter plot of SHAP values versus every input feature;`plots.force()`

— interactive plot, for all the datasets;`plots.waterfall()`

— show how the SHAP value is built for a single data.

Before calling the previous functions, the following command must be run:

`shap.initjs()`

Firstly, I draw the scatter plot:

`shap.plots.scatter(shap_values, color=shap_values)`

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot