Bayesian Regression From Scratch*KKon1leXCIz1Ha–

Original Source Here

Bayesian Regression From Scratch

Photo by Klim Musalimov on Unsplash


Linear Regression is the most well known algorithm in Data Science, however there is more than one version of it. The version most people use comes from the Frequentist interpretation of statistics, but there is another that comes from the Bayesian school of thought.

In this article, we will go over Bayes’ theorem, the difference between Frequentist and Bayesian statistics and finally carry out Bayesian Linear Regression from scratch using Python.

Note: Throughout the post I assume the reader has a basic understanding of Bayesian statistics and Linear Regression. I do recap over these topics, however not to a depth where a brand new reader may fully grasp them.

Bayesian Inference Re-cap

Bayes’ Theorem

Bayes’ theorem is written as follows:

Equation generated in LaTeX by author.
Equation generated in LaTeX by author.

If you are not familiar of Bayes’ theorem, I highly recommend you check out my previous article on the subject:

Bayesian Updating

Bayes’ theorem is used to update our belief about a certain event in light of new data using the following formula:

Equation generated in LaTeX by author.

After we calculate the posterior, we may acquire new data about what we are trying to model. We then calculate the new posterior with this new data using the old posterior as the new prior. This process of updating the prior with new data is called Bayesian updating. This is what Bayesian inference essentially is.

You can read more about Bayesian updating in one of my recent articles:

Regression Theory

Linear Regression

Regression aims to estimate the effect of some feature, x, on a certain target, y:

Equation generated by author in LaTeX.

Where β_0 is the intercept, β_1 is the coefficient that scales the relation between the target and feature and ε is the error term, which in Linear Regression follows a normal distribution:

Equation generated by author in LaTeX.

Where σ is the standard deviation.

The aim of Linear Regression is to determine the best of values of the parameters β_0, β_1 and σ that describe the relationship between the feature, x, and target, y.

Note: I am sure most people reading this are aware of what Linear Regression is, if not there are so many resources out there that can probably explain it to your better than I can!

Frequentist View

The most well known way of finding the parameters of the Linear Regression model comes from the Frequentist view of statistics.

The Frequentist view uses the Ordinary Least Squares (OLS) method, through the residual sum of squares (RSS) to estimate the parameters:

Equation generated by author in LaTeX.

Where y is the actual values and ŷ is the prediction from our model, which takes the generic form of:

Equation generated by author in LaTeX.

Where X and β are an array of features and parameters.

The generic solution to this OLS equation is:

Equation generated by author in LaTeX.

A full derivation of this solution can be found here.

The key thing from the Frequentist approach is that we get a single fixed value for each parameter.

The final model has, in a way, assumed that the data we are modelling must come from these parameters which are fixed. However, it is impossible to acquire all the data, hence it could seem foolish to assume that these single valued parameters are 100% right. Another way of framing this is that we have assumed that we have enough data to deduce a meaningful single value for the parameters.

Bayesian View

The other view is that the parameters take on a distribution of values with some being more likely than others. It considers several plausible parameter combinations that could have produced the observed data.

We have an initial view/range of what we think the parameters could be, for example we could think that the intercept is equally likely to be any number between 0 and 10. This is the prior of our parameters.

We then update the priors using the observed data to create a posterior distribution for each parameter.

That means the target, y, is now a randomly distributed variable on the data, x, and parameters β_0, β_1, σ:

Equation generated by author in LaTeX.

And so the likelihood for each target variable is the probability density function (PDF) of the normal distribution:

Equation generated by author in LaTeX.

The product of all the single likelihoods for y_i produce the total likelihood of this current model with its given parameters.

A more dense and full derivation of this likelihood function and Bayesian regression as a whole can be found here.

A nice way to view the Bayesian method is that we update the distribution of parameters as we acquire more data and our model becomes more certain of what the parameters ought to be. They may well be the OLS Frequentist estimate, but this is not guarenteed.

This has been a quick run down of the differences between Frequentist and Bayesian statistics. If you want to get a better understanding, there are many resources out there, but I liked this blog post!

Bayesian Regression in Python

Lets now go through implementing Bayesian Linear Regression from scratch for a simple model where we have one feature!

Generating Data

We start by generating some data in Python using the make_regression function from sklearn:

# Import packages
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
# Generate data
x, y = datasets.make_regression(n_samples=100,
# Plot data
fig, ax = plt.subplots(figsize=(9,5))
ax.scatter(x, y)
Plot generated by author in Python.

Ordinary Least Squares

We can estimate the Frequentist regression line using the method of OLS using the statsmodel package:

# Packages
import statsmodels.formula.api as smf
# Create a dataframe
data = pd.DataFrame(list(zip(x.flatten(), y)), columns =['x', 'y'])
# Calculating the slope and intercept
formula = 'y ~ x'
results = smf.ols(formula, data=data).fit()
# Get our equation of the OLS line
intercept = results.params['Intercept']
slope = results.params['x']
x_vals = np.arange(min(x), max(x), 0.1)
ols_line = slope*x_vals + intercept
# Plot the OLS line
fig, ax = plt.subplots(figsize=(9,5))
ax.scatter(data['x'], data['y'])
ax.plot(xs, ols_line,label='OLS Fit', color='red')
Plot generated by author in Python.

For clarity, the slope is β_1, intercept is β_0 and sigma is σ which is what we used above in the theoretical section to describe the regression line.

This is the Frequentist interpretation as we now only have single estimates for each parameter. We will now carry out the Bayesian version.


To begin we need to assign some prior distribution to our parameters. Using the results from the OLS estimate, we construct an uniform uninformative prior with a range that is 20% eitherside of the OLS estimate:

def make_uninformative_prior(name,data):
"""Function to make priors."""
df = pd.DataFrame({name:data})
df['Probability'] = 1/len(data)
return df
# Create a range of values for the slope
data_slope = np.linspace(results.params['x']*0.8,
results.params['x']*1.2, num=60)
prior_slope = make_uninformative_prior('slope',data_slope)
# Create a range of values for the intercept
data_inter = np.linspace(results.params['Intercept']*0.8,
results.params['Intercept']*1.2, num=60)
prior_inter = make_uninformative_prior('intercept',data_inter)
# Create a range of values for the sigma
data_sigma = np.linspace(results.resid.std()*0.8,
results.resid.std()*1.2, num=60)
prior_sigma = make_uninformative_prior('sigma',data_sigma)

If we had some knowledge of the actual parameters, we may have used a different prior that weighted certain values of the parameters differently. The priors are completely arbitrary and subjective. This is often one argument against Bayesian statistics as it leads to non-objective probabilities.

We now compute the joint distributions of the three parameters. This tells us how likely a specific combination of parameters is at explaining the observed data:

# Counter for the row index
counter = 0
# Dataframe to store the combinations in
df = pd.DataFrame(columns=['slope','intercept','sigma','prior'])
# Iterate through the slope
for slope in prior_slope['slope']:
prob_slope = \
prior_slope['Prior'].loc[prior_slope['slope'] == slope]

# Iterate through the intercept
for intercept in prior_inter['intercept']:
prob_inter = \
prior_inter['Prior'].loc[prior_inter['intercept'] \
== intercept]

# Iterate through the error
for sigma in prior_sigma['sigma']:
prob_sigma = \
prior_sigma['Prior'].loc[prior_sigma['sigma'] == sigma]

# Calculate the prior of this specific combination
prob = \

# Insert the row of data
df.loc[counter] = \
[slope] + [intercept] + [sigma] + [prob]

# Update row index
counter += 1

Apologies if the formatting appears awkwardly on your screen. If it does, I highly recommend you view the code in my GitHub repo here to make it more interpretable and understandable!

I am fully aware that for loops are not optimal and that vectorised implementations using pandas and numpy would be quicker. However, I think using loops allows us to understand better what is going on!

As we have uninformative priors for each parameter, each combination has the exact same prior probability:

Image by author.


Like we said before, the posterior is directly proportional to the product of the prior and likelihood. Therefore, to reach the posterior distribution for each parameter, we need to compute their likelihood under our observed data.

To compute the likelihood, we build a model for each possible combination (every row) and compute the residuals to find the likelihood using the formula I stated earlier:

counter = 0
df['likelihood'] = df['prior']
# Loop over the combination of values
for slope in prior_slope['slope']:
for intercept in prior_inter['intercept']:
for sigma in prior_sigma['sigma']:

# Compute the predictions from this line
predictions = slope * data['x'] + intercept

# Compute residual/errors of this line
residual = data['y'] - predictions

# Compute the likelihood function that we saw above
likelihoods = norm(0, sigma).pdf(residual)

# Compute the total likelihood
df['likelihood'].loc[counter] =
counter += 1

This part may take awhile to run which is one of the issues in using Bayesian methods for large scale models. To know why this is the case, checkout my previous post on Bayesian Conjugate Priors that tells you some short-comings of carrying out Bayes’ theorem:

Our dataframe is now looking like:

Image by author.

Bayesian Update

We can now carry out our Bayesian update as follows:

df['posterior'] = df['prior'] * df['likelihood']
df['posterior'] = df['posterior']/df['posterior'].sum()

And our resulting dataframe is now:

Image by author.

Marginal Distributions

To output the marginal posterior distribution for each parameter, we need to sum up the posteriors over the two other parameters. For example, to find the marginal posterior distribution of the slope, we sum up, for each value of the slope, the posterior over sigma and the intercept (basically an integration):

slope_df = df.groupby('slope').sum()

We can then plot the posterior distribution of the slope:

# Plot the posterior distribution of the slope
plt.plot(slope_df.index, slope_df.posterior, linewidth=3)
plt.xlabel('Slope Value', fontsize=18)
plt.ylabel('PDF', fontsize=18)
plt.axvline(results.params['x'], color='red', ls='--', label='OLS Estimate')
Plot generated by author in Python.

The slope is now a distribution!

We can carry out a similar calculation for the intercept and error terms:

intercept_df = df.groupby('intercept').sum()
sigma_df = df.groupby('sigma').sum()
# Plot the posterior distribution of the Intercept
plt.plot(intercept_df.index, intercept_df.posterior, linewidth=3)
plt.xlabel('Intercept Value', fontsize=18)
plt.ylabel('PDF', fontsize=18)
plt.axvline(results.params['Intercept'], color='red', ls='--', label='OLS Estimate')
# Plot the posterior distribution of sigma
plt.plot(sigma_df.index, sigma_df.posterior, linewidth=3)
plt.xlabel('Sigma Value', fontsize=18)
plt.ylabel('PDF', fontsize=18)
plt.axvline(results.resid.std(), color='red', ls='--', label='OLS Estimate')
Plot generated by author in Python.
Plot generated by author in Python.

The OLS estimate is the most likely value for these parameters, however it is not the only one. There are 216,000 other potential combinations which we can use to the model the data!


This calculation is not that straight forward and writing this blog is what finally got me to understand the whole Bayesian regression process. I would advise the reader to checkout the full notebook on my GitHub and have a play with it:


In this article we have recapped over Bayes’ theorem, explained the key difference between Frequentist and Bayesian statistics and finally carried out Bayesian Linear Regression from scratch.

As you may have noticed, this topic is quite wide in the pre-requisite knowledge that you need to fully grasp it. If I had tried to fit in all the background topics this blog would have literally been a textbook!

Therefore, I would recommend to anyone who hasn’t fully understood what we did here to go over Bayesian statistics and Linear Regression. There are so many resources on these subjects that will teach to you better than I can!

Connect With Me!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: