Understanding classification metrics



Original Source Here

Understanding classification metrics

For a simple binary classification task the beginner is often confused by a ton of metrics. Here I give a bottom up explanation of the key concepts before introducing them by their names.

Those working in classification metrics will often come across terms like true positive rate, false positive rate, recall and precision and it is useful to understand what they mean in depth.

There are actually many good sources out there explaining these terms but true understanding only comes from sitting down and thinking about things. Furthermore, we have our own ways of thinking about things so there is no one size fits all. In this note I am crystallizing my understanding of these terms for my reference and for anyone who thinks like me.

Visualization of the classification problem form https://commons.wikimedia.org/wiki/File:Precisionrecall.svg

Instead of introducing all these terms outright and taking the reader through the same confusions I had when I read about them, I will take a different approach. I will talk about a situation and an experiment and work through the math. Only in the end will I name the terms. That way the reader will understand the concept and then bother with the names.

Lets take a scenario where there is some state which can be True (T) or False (F). This could be say the Covid-19 status of individuals or whether or not an image is that of a Cat or not in Imagenet. Then we have a classifier which could be one of those cool canines that can sniff Covid-19 positive patients or your favorite CNN classifier that can tell cats from non-cats. These will output Positive (P) when they “think” the sample is True (T) and will output Negative (N) when they “think” the sample is Negative (N).

Now at this point we have a few natural quantities we can think about.

  1. We can look at the population probability of being true or false i.e. P(T) and P(F). Note that they sum to 1. This may be known (cats in Imagenet) or unknown (true instances of Covid-19 cases) and in the latter we may have estimates.
  2. We can look at how well our classifier performs by asking what is the probability of a true sample being found to be positive P(P|T) or negative P(N|T). Note that P(P|T) + P(N|T)=1 as a true sample can only be labelled one way or another. Similarly, we can talk about P(P|F) and P(N|F) and they sum to 1 as well.
  3. We can also look at how well our classifier performs by asking the inverted question of the probability of a sample being found to be positive actually being True P(T|P) and being False P(F|P). These have to add upto unity as well since a sample labelled positive has to be either True or False i.e. P(T|P) + P(F|P) = 1. Similarly, we can talk about P(T|N) and P(F|N).

Now I am going to talk about Bayes theorem. If you only know about it but have not internalized it, I would suggest pausing and thinking about it hard to understand it. Its not very hard and can be understood with Venn diagrams. If you have not heard about it then you should read about it on Wikipedia and then think hard about it like I suggested. I will not be motivating it here and will simply be using it.

Now, using Bayes theorem we can relate the above quantities

There are obviously three more relations I can write but note that one of them is obtained from the above by T <-> F and P <-> N and the other two are obtained by the sum of probabilities being one as mentioned in point 3 above. Furthermore, from a business point of view, the above form is usually the most useful. Well almost! We just need to massage it a bit using the sum of probabilities mentioned in point 2 above.

That’s it! That is all there is. However some terms above have special names which I will now introduce. The guy on the left, P(T|P) is call precision. It is a very useful metric as it quantifies what fraction of those samples classified as Positive are indeed True. P(P|T) is called sensitivity or recall and it is the fraction of True samples that are classified as Positive by the classifier. Finally, P(N|F), which is quite analogous to the previous quantity, unfortunately is not called recall for negatives (which would have made life simpler so … obviously not), is called specificity.

Usually, the metric reported for tests or classifiers in medicine are sensitivity and specificity. However, it is often mentioned that the numbers should be taken with a pinch of salt when trying to invert them and get (what you now know to be) precision if the true population density is skewed (i.e. P(T) >> P(F) or P(T) << P(F)). Arguably, from a medicine or business point of view often precision and recall/sensitivity are more useful to compare tradeoff wise than specificity and recall/sensitivity.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: