Original Source Here

## Cohen’s Kappa score for multiclass classification

You can think of the *kappa score* as a supercharged version of accuracy, a version that also integrates measurements of chance and class imbalance.

As you probably know, accuracy can be very misleading because it does not take class imbalance into account. In a target where the positive to negative ratio is 10:100, you can still get over 90% accuracy if the classifier simply predicts all negative samples correctly. Also, as machine learning algorithms rely on probabilistic assumptions of the data, we need a score that can measure the inherent uncertainty that comes with generating predictions. And the Kappa score, named after Jacob Cohen, is one of the few that can represent all that in a single metric.

In official literature, its definition is “a metric to quantify the agreement between two raters.” Here is the Wikipedia definition:

Cohen’s kappa coefficient (κ) is a statistic that is used to measure inter-rater reliability (and also intra-rater reliability) for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance.

Here is the official formula:

In classification, this formula is interpreted as follows:

*P_0* is the *observed proportional agreement* between actual and predicted values. This would be the sum of the diagonal cells of any confusion matrix divided by the sum of non-diagonal cells. In other words, another name for simple accuracy.

*P_e* is the probability that true values and false values agree by chance. We will see how these are calculated using the matrix we were using throughout this guide:

Let’s find the accuracy first: sum of the diagonal cells divided by the sum of off-diagonal ones — 0.6. To find the value of *P_e*, we need to find the probabilities of true values are the same as predicted values by chance for each class.

*Ideal*class — the probability of both true and predicted values are*ideal*by chance. There are 250 samples, 57 of which are ideal diamonds. So, the probability of a random diamond being ideal is

**P(actual_ideal) = 57 / 250 = 0.228**

Now, out of all 250 predictions, 38 of them are ideal. So, the probability of a random prediction being ideal is

**P(predicted_ideal) = 16 / 250 = 0.064**

The probability of both conditions being true is their product so:

**P_e(actual_ideal, predicted_ideal) = 0.228 * 0.064 = 0.014592**

Now, we will do the same for other classes:

*Premium*class — the probability of both true and predicted values are*premium*by chance:

**P(actual_premium) = 45 / 250 = 0.18**

**P(predicted_premium) = 28 / 250 = 0.112**

**P_e(actual_premium, predicted_premium) = 0.02016**

*Good*class — the probability of both true and predicted values are*good*by chance:

**P(actual_good) = 74 / 250 = 0.296**

**P(predicted_good) = 26 / 250 = 0.104**

**P_e(actual_good, predicted_good) = 0.030784**

*Fair*class — the probability of both true and predicted values are*fair*by chance:

**P(actual_fair) = 74 / 250 = 0.296**

**P(predicted_fair) = 30 / 250 = 0.12**

**P_e(actual_fair, predicted_fair) = 0.03552**

Final P_e is the sum of the above calculations:

**P_e(final) = 0.014592 + 0.02016 + 0.030784 + 0.03552 = 0.101056**

**Accuracy, P_0 = 0.6**

Plugging in the numbers:

The good news is, you can do all this in a line of code with Sklearn:

Generally, a score above 0.8 is considered excellent. The score we got is a humble moderate.

For more information, I suggest reading these two excellent articles:

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot