Deriving Naive Bayes from scratch

Original Source Here

Deriving Naive Bayes from scratch

Mathematically building Naive Bayes from the scratch.

Photo by Franki Chamaki on Unsplash

Naive Bayes is one of the most popular classifiers in Machine Learning. To be more specific, it refers to a family of algorithms, all based on some common principles. In this article, we will be building up the mathematics behind this classifier from the scratch.

1. Basic Probability
2. Pen and Paper
3. A cup of coffee (Optional)

Behind the name “Naive Bayes”:

The name “Naive Bayes” is derived from two words; naive, meaning unsophisticated or simple, and Bayes, from the Bayes theorem formulated by Reverend Thomas Bayes.

Bayes’ Theorem:

Probably most of us have studied the Bayes’ theorem in our high school.
Bayes’ theorem states that, given two events A and B,
P(A|B) = P(B|A) * P(A) / P(B).
P(A|B): the probability of event A occurring given that B is true,
P(B|A): the probability of event B occurring given that A is true,
P(A): the probability of occurrence of A alone,
P(B): the probability of occurrence of B alone.

The task at hand:

Now let’s understand what our task basically is.

We are given a query point xq, and we need to find out the class to which it belongs.
For simplicity, let’s suppose our task is to perform a binary classification, and our class can take values 0(C=0) or 1(C=1).

Now the given query point can have d-features, so xq can be a d-dimensional vector, each dimension representing a unique feature.
xq=(x1, x2, x3,………………,xd).

Now, given xq, we need to find the probability that it can belong to class C=1 or, P(C1 | xq).
Similarly, we need to find, given xq, the probability that it can belong to class C=0 or, P(C0 | xq).
Once we get these values, we can simply compare their numerical values, and declare their class label, the one with the higher probability.

But, the question is, how can we get the numerical values to compare?

Bayes’ Theorem to help us:

Let’s understand how we are going to calculate, P(C1 | xq). We can then extend the same for calculating P(C0 | xq).

Now, using Bayes’ Theorem,
P(C1 | xq) = P(xq | C1) * P(C1) / P(xq), and ,
P(C0 | xq) = P(xq | C0) * P(C0) / P(xq).

Note that the denominator, in both of the equations is the same, hence for comparison of the two values we just need to know the value of the numerator.

P(C1 | xq) ∝ P(xq | C1) * P(C1)

Let’s focus on the term, P(xq | C1).
Since, xq is a vector containing d-features. Each feature is some event in terms of probability, and xq represents nothing but occurrence all these events.
xq=(x1, x2, x3, ………, xd)

P(xq | C1) = P(x1, x2, x3, ………, xd|C1).

So, P(C1 | xq) ∝ P(x1, x2, x3, ………, xd | C1) * P(C1)

According to conditional probability,
P(A|B) = P(A, B) / P(B)
=>P(A|B)*P(B) = P(A, B)
Note: Here comma is nothing but a simple set-intersection.

Using conditional probability,
P(x1, x2, x3, ………, xd | C1) * P(C1) = P(x1, x2, x3, ………, xd, C1)

So, P(C1 | xq) ∝ P(x1, x2, x3, ………, xd, C1)

Applying, P(A, B) = P(A|B)*P(B), to the term P(x1, x2, x3, ………, xd, C1) repeatedly:

P(x1, x2, x3, ………, xd, C1)
= P(x1 | x2, x3, …, xd, C1) * P(x2, x3, ………, xd, C1)

=P(x1 | x2, x3, ….., xd, C1) * P(x2|x3, ….., xd, C1) * P(x3, ….., xd, C1)

=P(x1 | x2, x3, ….., xd, C1) * P(x2|x3, ….., xd, C1) * ……………..
……..* P(xd-1 | xd, C1) * P(xd | C1) * P(C1)

Each term here is a complex conditional probability and it’s not possible to calculate each of these terms.

“Naive” Assumption to the rescue:

Given a conditional probability, P(A | B, C), if A is mutually independent of B, then,
P(A | B, C) = P(A | C).

For the problem above, our “Naive” assumption is going to save us.
Here, we will assume that: all the features in xq are mutually independent of each other, conditional on category C1.

P(x1 | x2, x3, …, xd, C1) = P(x1 | C1)
P(x2|x3, ….., xd, C1) = P(x2 | C1)
d-1|xd, C1) = P(xd-1 | C1)

P(x1, x2, x3, …., xd, C1) = P(x1 | C1) * P(x2 | C1)*……………

……* P(xd | C1)* P(C1)

P(C1 | xq) ∝ P(x1 | C1) * P(x2 | C1)*….* P(xd | C1) * P(C1)
=> P(C1 | xq) ∝ P(C1)* ∏ P(xi | C1) (i =1 to n)

=>P(C1 | xq) = k * P(C1)* ∏ P(xi | C1) (i =1 to n),
Here, k= 1/P(xq).

P(C0 | xq) = k * P(C0)* ∏ P(xi | C0) (i =1 to n)

Since k = 1/P(xq) is same in both the equations.

Given a training dataset, it is very easy to calculate P(C1) and P(xi|C1).

P(C1) is just number of data points in our training data set with class label as 1 divided by total number of data points.

P(xi|C1) is just number of data points in our training dataset in which xi(value of the any i-th feature) occurs and which are also labelled as 1 divided by total number of data points labelled as 1.

Once we get P(C1 | xq) and P(C0 | xq), we can just compare them and declare the class label of xq with the one corresponding to higher probability.

Though we have done this derivation for binary classification, the same can be extended to multi-class classification.


Naive Bayes algorithms are used extensively in spam detection, polarity analysis, and sentiment analysis. It is a family of algorithms, all of which are based on common principles like the Bayes theorem and our Naive assumption. Naive Bayes is very efficient and simple to understand.

The main limitation of Naive Bayes classifiers is the assumption of independent predictor features. Naive Bayes assumes that all the features are mutually independent. In real life, it’s almost impossible to have a dataset in which all the features are completely mutually independent.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: