Original Source Here
So you now understand the formula of updating the weights and outputs but we miss one important thing which is the evaluation of the estimated probability distribution. In the following, we will discuss two key measurements that are often used in BNN.
4. Negative Log-Likelihood
For regression problems, we will always use Mean Squared Error (MSE) as the loss function in SNN since we only have a point estimate. However, we will do something different in BNN. By having the predicted distribution, we will use negative log-likelihood as the loss function.
Okay, let’s explain them one by one.
Likelihood is the joint probability of the observed data as a function of the predicted distribution. In other words, we want to find out how likely the data would be distributed just like our predicted distribution. The larger the likelihood, the more accurate our predicted distribution.
And for log-likelihood, we have it because of easy calculation. By leveraging the log properties (log ab = log a + log b), we can now use summation instead of multiplication.
Last but not least, we add the negative sign to form the negative log-likelihood because in machine learning, we always optimize the objective function by minimizing the cost function or loss function instead of maximizing it.
5. Kullback-Leibler Divergence (KL Divergence)
KL divergence is to quantify how much difference there is from one distribution to another distribution. Let say p is the true distribution while q is the predicted distribution. In fact, it is just equal to cross-entropy between two distributions minus the entropy of the true distribution p. In other words, it explains how much further the predicted distribution q can be improved.
For those who have no idea what entropy and cross-entropy are, simply speaking, entropy is the lowest boundary of the “cost” to represent the true distribution p while cross-entropy is the “cost” to represent the true distribution p using the predicted distribution q. Stemming from this, KL divergence will represent how much further the “cost” for the predicted distribution q can be reduced.
So back to today’s focus, p will refer to the true distribution of the model weights and outputs while q will be our predicted distribution. We will use KL divergence to calculate the difference between two distributions so as to update our predicted distribution.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot