How to Evaluate the Performance of Your ML/ AI Models*BRFiCWGkQjVQ8ewb

Original Source Here

How to Evaluate the Performance of Your ML/ AI Models

An accurate evaluation is the only way to performance improvement

Photo by Scott Graham on Unsplash

Learning by doing is one of the best approaches to learning anything, from tech to a new language or cooking a new dish. Once you have learned the basics of a field or an application, you can build on that knowledge by acting. Building models for various applications is the best way to make your knowledge concrete regarding machine learning and artificial intelligence.

Though both fields (or really sub-fields, since they do overlap) have applications in a wide variety of contexts, the steps to learning how to build a model are more or less the same regardless of the target application field.

AI language models such as ChatGPT and Bard are gaining popularity and interest from both tech novices and general audiences because they can be very useful in our daily lives.

Now that more models are being released and presented, one may ask, what makes a “good” AI/ ML model, and how can we evaluate the performance of one?

This is what we are going to cover in this article. But again, we assume you already have an AI or ML model built. Now, you want to evaluate and improve its performance (if necessary). But, again, regardless of the type of model you have and your end application, you can take steps to evaluate your model and improve its performance.

To help us follow through with the concepts, let’s use the Wine dataset from sklearn [1], apply the support vector classifier (SVC), and then test its metrics.

So, let’s jump right in…

First, let’s import the libraries we will use (don’t worry about what each of those do now, we’ll get to that!).

import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import matplotlib.pyplot as plt

Now, we read our dataset, apply the classifier, and evaluate it.

wine_data = datasets.load_wine()
X =
y =

1. Split the dataset for better analysis.

Depending on your stage in the learning process, you may need access to a large amount of data that you can use for training and testing, and evaluating. Also, you can use different data to train and test your model because that will prevent you from genuinely assessing your model’s performance.

To overcome that challenge, split your data into three smaller random sets and use them for training, testing, and validating.

A good rule of thumb to do that split is a 60,20,20 approach. You would use 60% of the data for training, 20% for validation, and 20% for testing. You need to shuffle your data before you do the split to ensure a better representation of that data.

I know that may sound complicated, but luckily, ticket-learn came to the rescue by offering a function to perform that split for you, train_test_split().

So, we can take our dataset and split it like so:

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.20, train_size=0.60, random_state=1, stratify=y)

Then use the training portion of it as input to the classifier.

#Scale data
sc = StandardScaler()
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)
#Apply SVC model
svc = SVC(kernel='linear', C=10.0, random_state=1), Y_train)
#Obtain predictions
Y_pred = svc.predict(X_test)

At this point, we have some results to “evaluate.”

2. Define your evaluation metrics.

Before starting the evaluation process, we must ask ourselves an essential question about the model we use: what would make this model good?

The answer to this question depends on the model and how you plan to use it. That being said, there are standard evaluation metrics that data scientists use when they want to test the performance of an AI/ ML model, including:

  1. Accuracy is the percentage of correct predictions by the model out of the total prediction. That means, when I run the model, how many predictions are true among all predictions? This article goes into depth about testing the accuracy of a model.
  2. Precision is the percentage of true positive predictions by the model out of all positive predictions. Unfortunately, precision and accuracy are often confused; one way to make the difference between them clear is to think of accuracy as the closeness of the predictions to the actual values, while precision is how close the correct predictions are to each other. So, accuracy is an absolute measure, yet both are important to evaluate the model’s performance.
  3. Recall is the proportion of true positive predictions from all actual positive instances in the dataset. Recall aims to find related predictions within a dataset. Mathematically, if we increase the recall, we decrease the precision of the model.
  4. F1 score is the combination mean of precision and recall, providing a balanced measure of a model’s performance using both precision and recall. This video by CodeBasics discusses the relation between precision, recall, and F1 score and how to find the optimal balance of those evaluation metrics.
Video By CodeBasics

Now, let’s calculate the different metrics for the predicted data. The way we will do that is by first displaying the confusion matrix. The confusion matrix is simply the actual results of data vs. the predicted results.

conf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
#Plot the confusion matrix
fig, ax = plt.subplots(figsize=(5, 5))
ax.matshow(conf_matrix,, alpha=0.3)
for i in range(conf_matrix.shape[0]):
for j in range(conf_matrix.shape[1]):
ax.text(x=j, y=i,s=conf_matrix[i, j], va='center', ha='center', size='xx-large')
plt.xlabel('Predicted Values', fontsize=18)
plt.ylabel('Actual Values', fontsize=18)

The confusion matrix to our dataset will look something like,

If we look at this confusion matrix, we can see that the actual value was “1” in some cases while the predicted value was “0”. Which means the classifier is not a %100 accurate.

We can calculate this classifier’s accuracy, precision, recall, and f1 score using this code.

print('Precision: %.3f' % precision_score(Y_test, Y_pred, average='micro'))
print('Recall: %.3f' % recall_score(Y_test, Y_pred, average='micro'))
print('Accuracy: %.3f' % accuracy_score(Y_test, Y_pred))
print('F1 Score: %.3f' % f1_score(Y_test, Y_pred, average='micro'))

For this particular example, the results for those are:

  1. Precision = 0.889
  2. Recall = 0.889
  3. Accuracy = 0.889
  4. F1 score = 0.889

Though you can really use different approaches to evaluate your models, some evaluation methods will better estimate the model’s performance based on the model type. For example, in addition to the above methods, if the model you’re evaluating is a regression (or it includes regression) model, you can also use:

– Mean Squared Error (MSE) mathematically is the average of the squared differences between predicted and actual values.

– Mean Absolute Error (MAE) is the average of the absolute differences between predicted and actual values.

Those two metrics are closely related, but implementation-wise, MAE is simpler (at least mathematically) than MSE. However, MAE doesn’t do well with significant errors, unlike MSE, which emphasizes the errors (because it squares them).

3. Validate and tune the model’s hyperparameters.

Before discussing hyperparameters, let’s first differentiate between a hyperparameter and a parameter. A parameter is a way a model is defined to solve a problem. In contrast, hyperparameters are used to test, validate, and optimize the model’s performance. Hyperparameters are often chosen by the data scientists (or the client, in some cases) to control and validate the learning process of the model and hence, its performance.

There are different types of hyperparameters that you can use to validate your model; some are general and can be used on any model, such as:

  • Learning Rate: this hyperparameter controls how much the model needs to be changed in response to some error when the model’s parameters are updated or altered. Choosing the optimal learning rate is a trade-off with the time needed for the training process. If the learning rate is low, then it may slow down the training process. In contrast, if the learning rate is too high, the training process will be faster, but the model performance may suffer.
  • Batch Size: The size of your training dataset will significantly affect the model’s training time and learning rate. So, finding the optimal batch size is a skill that is often developed as you build more models and grow your experience.
  • Number of Epochs: An epoch is a complete cycle for training the machine learning model. The number of epochs to use varies from one model to another. Theoretically, more epochs lead to fewer errors in the validation process.

In addition to the above hyperparameters, there are model-specific hyperparameters such as regularization strength or the number of hidden layers in implementing a neural network. This 15 mins Video by APMonitor explores various hyperparameters and their differences.

Video by APMonitor

4. Iterate and refine

Validating an AI/ ML model is not a linear process but more of an iterative one. You go through the data split, the hyperparameters tuning, analyzing, and validating the results often more than once. The number of times you repeat that process depends on the analysis of the results. For some models, you may only need to do this once; for others, you may need to do it a couple of times.

If you need to repeat the process, you will use the insights from the previous evaluation to improve the model’s architecture, training process, or hyperparameter settings until you are satisfied with the model’s performance.

Final Thoughts

When you start building your own ML and AI models, you will quickly realize that choosing and implementing the model is the easy part of the workflow. However, testing and evaluation is the part that will take most of the development process. Evaluating an AI/ ML model is an iterative and often time-consuming process, and it requires careful analysis, experimentation, and fine-tuning to achieve the desired performance.

Luckily, the more experience you have building more models, the more systematic the process of evaluating your model’s performance will get. And it’s a worthwhile skill considering the importance of evaluating your model, such as:

  1. Evaluating our models allows us to objectively measures the model’s metrics which helps in understanding its strengths and weaknesses and provides insights into its predictive or decision-making capabilities.
  2. If different models that can solve the same problems exist, then evaluating them enables us to compare their performance and choose the one that suits our application best.
  3. Evaluation provides insights into the model’s weaknesses, allowing for improvements through analyzing the errors and areas where the model underperforms.

So, have patience and keep building models; it gets better and more efficient with the more models you build. Don’t let the process details discourage you. It may look like a complex process, but once you understand the steps, it will become second nature to you.


[1] Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California,
School of Information and Computer Science. (CC BY 4.0)


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: