# Comprehensive Guide on Multiclass Classification Metrics

## Cohen’s Kappa score for multiclass classification

You can think of the kappa score as a supercharged version of accuracy, a version that also integrates measurements of chance and class imbalance.

As you probably know, accuracy can be very misleading because it does not take class imbalance into account. In a target where the positive to negative ratio is 10:100, you can still get over 90% accuracy if the classifier simply predicts all negative samples correctly. Also, as machine learning algorithms rely on probabilistic assumptions of the data, we need a score that can measure the inherent uncertainty that comes with generating predictions. And the Kappa score, named after Jacob Cohen, is one of the few that can represent all that in a single metric.

In official literature, its definition is “a metric to quantify the agreement between two raters.” Here is the Wikipedia definition:

Cohen’s kappa coefficient (κ) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance.

Here is the official formula:

In classification, this formula is interpreted as follows:

P_0 is the observed proportional agreement between actual and predicted values. This would be the sum of the diagonal cells of any confusion matrix divided by the sum of non-diagonal cells. In other words, another name for simple accuracy.

P_e is the probability that true values and false values agree by chance. We will see how these are calculated using the matrix we were using throughout this guide:

Let’s find the accuracy first: sum of the diagonal cells divided by the sum of off-diagonal ones — 0.6. To find the value of P_e, we need to find the probabilities of true values are the same as predicted values by chance for each class.

1. Ideal class — the probability of both true and predicted values are ideal by chance. There are 250 samples, 57 of which are ideal diamonds. So, the probability of a random diamond being ideal is

P(actual_ideal) = 57 / 250 = 0.228

Now, out of all 250 predictions, 38 of them are ideal. So, the probability of a random prediction being ideal is

P(predicted_ideal) = 16 / 250 = 0.064

The probability of both conditions being true is their product so:

P_e(actual_ideal, predicted_ideal) = 0.228 * 0.064 = 0.014592

Now, we will do the same for other classes:

1. Premium class — the probability of both true and predicted values are premium by chance:

P(actual_premium) = 45 / 250 = 0.18

P(predicted_premium) = 28 / 250 = 0.112

P_e(actual_premium, predicted_premium) = 0.02016

1. Good class — the probability of both true and predicted values are good by chance:

P(actual_good) = 74 / 250 = 0.296

P(predicted_good) = 26 / 250 = 0.104

P_e(actual_good, predicted_good) = 0.030784

1. Fair class — the probability of both true and predicted values are fair by chance:

P(actual_fair) = 74 / 250 = 0.296

P(predicted_fair) = 30 / 250 = 0.12

P_e(actual_fair, predicted_fair) = 0.03552

Final P_e is the sum of the above calculations:

P_e(final) = 0.014592 + 0.02016 + 0.030784 + 0.03552 = 0.101056

Accuracy, P_0 = 0.6

Plugging in the numbers:

The good news is, you can do all this in a line of code with Sklearn:

Generally, a score above 0.8 is considered excellent. The score we got is a humble moderate.

For more information, I suggest reading these two excellent articles:

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot