https://miro.medium.com/max/1200/0*KKon1leXCIz1Ha–

Original Source Here

# Bayesian Regression From Scratch

# Introduction

**Linear Regression** is the most well known algorithm in Data Science, however there is more than one version of it. The version most people use comes from the **Frequentist** interpretation of statistics, but there is another that comes from the **Bayesian** school of thought.

In this article, we will go over **Bayes’ theorem**, the difference between Frequentist and Bayesian statistics and finally carry out **Bayesian Linear Regression** from scratch using Python.

Note: Throughout the post I assume the reader has a basic understanding of Bayesian statistics and Linear Regression. I do recap over these topics, however not to a depth where a brand new reader may fully grasp them.

# Bayesian Inference Re-cap

## Bayes’ Theorem

Bayes’ theorem** **is written as follows:

If you are not familiar of Bayes’ theorem, I highly recommend you check out my previous article on the subject:

## Bayesian Updating

Bayes’ theorem is used to update our belief about a certain event in light of new data using the following formula:

After we calculate the posterior, we may acquire new data about what we are trying to model. We then calculate the new posterior with this new data using the old posterior as the new prior. This process of updating the prior with new data is called Bayesian updating. This is what** ****Bayesian inference** essentially is.

You can read more about Bayesian updating in one of my recent articles:

# Regression Theory

## Linear Regression

**Regression** aims to estimate the effect of some feature, ** x**, on a certain target,

**:**

*y*Where ** β_0** is the intercept,

**is the coefficient that scales the relation between the target and feature and**

*β_1***is the error term, which in Linear Regression follows a**

*ε***normal distribution**:

Where ** σ **is the

**standard deviation**.

The aim of Linear Regression is to determine the best of values of the parameters ** β_0, β_1 **and

**that describe the relationship between the feature,**

*σ***, and target,**

*x***.**

*y*Note: I am sure most people reading this are aware of what Linear Regression is, if not there are so many resources out there that can probably explain it to your better than I can!

## Frequentist View

The most well known way of finding the parameters of the Linear Regression model comes from the Frequentist view of statistics.

The Frequentist view uses the **Ordinary Least Squares (OLS)** method, through the **residual sum of squares (RSS)**** **to estimate the parameters:

Where ** y **is the actual values and

**is the prediction from our model, which takes the generic form of:**

*ŷ*Where ** X** and

**are an array of features and parameters.**

*β*The generic solution to this OLS equation is:

A full derivation of this solution can be found here.

The key thing from the Frequentist approach is that we get a *single fixed value *for each parameter.

The final model has, in a way, assumed that the data we are modelling must come from these parameters which are fixed. However, it is impossible to acquire all the data, hence it could seem foolish to assume that these* single valued* parameters are 100% right. Another way of framing this is that we have assumed that we have enough data to deduce a meaningful single value for the parameters.

## Bayesian View

The other view is that the parameters take on a distribution of values with some being more likely than others. It considers several plausible parameter combinations that could have produced the observed data.

We have an initial view/range of what we think the parameters could be, for example we could think that the intercept is equally likely to be any number between 0 and 10. This is the prior of our parameters.

We then update the priors using the observed data to create a posterior distribution for each parameter.

That means the target, ** y**, is now a randomly distributed variable on the data,

**, and parameters**

*x*

*β_0,*

*β_1, σ:*And so the likelihood for each target variable is the **probability density function (PDF)**** **of the normal distribution:

The product of all the single likelihoods for ** y_i** produce the total likelihood of this current model with its given parameters.

A more dense and full derivation of this likelihood function and Bayesian regression as a whole can be found here.

A nice way to view the Bayesian method is that we update the distribution of parameters as we acquire more data and our model becomes more certain of what the parameters ought to be. They may well be the OLS Frequentist estimate, but this is not guarenteed.

This has been a quick run down of the differences between Frequentist and Bayesian statistics. If you want to get a better understanding, there are many resources out there, but I liked this blog post!

# Bayesian Regression in Python

Lets now go through implementing Bayesian Linear Regression from scratch for a simple model where we have one feature!

## Generating Data

We start by generating some data in Python using the *make_regression *function from sklearn:

# Import packages

import pandas as pd

import matplotlib.pyplot as plt

from sklearn import datasets# Generate data

x, y = datasets.make_regression(n_samples=100,

n_features=1,

noise=10)# Plot data

fig, ax = plt.subplots(figsize=(9,5))

ax.scatter(x, y)

ax.ticklabel_format(style='plain')

plt.xlabel('x',fontsize=18)

plt.ylabel('y',fontsize=18)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.show()

## Ordinary Least Squares

We can estimate the Frequentist regression line using the method of OLS using the statsmodel package:

# Packages

import statsmodels.formula.api as smf# Create a dataframe

data = pd.DataFrame(list(zip(x.flatten(), y)), columns =['x', 'y'])# Calculating the slope and intercept

formula = 'y ~ x'

results = smf.ols(formula, data=data).fit()# Get our equation of the OLS line

intercept = results.params['Intercept']

slope = results.params['x']

x_vals = np.arange(min(x), max(x), 0.1)

ols_line = slope*x_vals + intercept# Plot the OLS line

fig, ax = plt.subplots(figsize=(9,5))

ax.scatter(data['x'], data['y'])

ax.plot(xs, ols_line,label='OLS Fit', color='red')

ax.ticklabel_format(style='plain')

plt.xlabel('x',fontsize=18)

plt.ylabel('y',fontsize=18)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.legend(fontsize=16)

plt.show()

For clarity, the slope is ** β_1,** intercept is

**and sigma is**

*β_0***which is what we used above in the theoretical section to describe the regression line.**

*σ*This is the Frequentist interpretation as we now only have single estimates for each parameter. We will now carry out the Bayesian version.

## Priors

To begin we need to assign some prior distribution to our parameters. Using the results from the OLS estimate, we construct an uniform uninformative prior with a range that is 20% eitherside of the OLS estimate:

def make_uninformative_prior(name,data):

"""Function to make priors."""

df = pd.DataFrame({name:data})

df['Probability'] = 1/len(data)

return df# Create a range of values for the slope

data_slope = np.linspace(results.params['x']*0.8,

results.params['x']*1.2, num=60)prior_slope = make_uninformative_prior('slope',data_slope)

prior_slope.head()# Create a range of values for the intercept

data_inter = np.linspace(results.params['Intercept']*0.8,

results.params['Intercept']*1.2, num=60)prior_inter = make_uninformative_prior('intercept',data_inter)

prior_inter.head()# Create a range of values for the sigma

data_sigma = np.linspace(results.resid.std()*0.8,

results.resid.std()*1.2, num=60)prior_sigma = make_uninformative_prior('sigma',data_sigma)

prior_sigma.head()

If we had some knowledge of the actual parameters, we may have used a different prior that weighted certain values of the parameters differently. The priors are completely arbitrary and subjective. This is often one argument against Bayesian statistics as it leads to non-objective probabilities.

We now compute the joint distributions of the three parameters. This tells us how likely a specific combination of parameters is at explaining the observed data:

# Counter for the row index

counter = 0# Dataframe to store the combinations in

df = pd.DataFrame(columns=['slope','intercept','sigma','prior'])# Iterate through the slope

for slope in prior_slope['slope']:

prob_slope = \

prior_slope['Prior'].loc[prior_slope['slope'] == slope]

# Iterate through the intercept

for intercept in prior_inter['intercept']:

prob_inter = \

prior_inter['Prior'].loc[prior_inter['intercept'] \

== intercept]

# Iterate through the error

for sigma in prior_sigma['sigma']:

prob_sigma = \

prior_sigma['Prior'].loc[prior_sigma['sigma'] == sigma]

# Calculate the prior of this specific combination

prob = \

float(prob_slope)*float(prob_inter)*float(prob_sigma)

# Insert the row of data

df.loc[counter] = \

[slope] + [intercept] + [sigma] + [prob]

# Update row index

counter += 1

Apologies if the formatting appears awkwardly on your screen. If it does, I highly recommend you view the code in my GitHub repo here to make it more interpretable and understandable!

I am fully aware that for loops are not optimal and that vectorised implementations using pandas and numpy would be quicker. However, I think using loops allows us to understand better what is going on!

As we have uninformative priors for each parameter, each combination has the exact same prior probability:

## Likelihood

Like we said before, the posterior is directly proportional to the product of the prior and likelihood. Therefore, to reach the posterior distribution for each parameter, we need to compute their likelihood under our observed data.

To compute the likelihood, we build a model for each possible combination (every row) and compute the residuals to find the likelihood using the formula I stated earlier:

counter = 0

df['likelihood'] = df['prior']# Loop over the combination of values

for slope in prior_slope['slope']:

for intercept in prior_inter['intercept']:

for sigma in prior_sigma['sigma']:

# Compute the predictions from this line

predictions = slope * data['x'] + intercept

# Compute residual/errors of this line

residual = data['y'] - predictions

# Compute the likelihood function that we saw above

likelihoods = norm(0, sigma).pdf(residual)

# Compute the total likelihood

df['likelihood'].loc[counter] = likelihoods.prod()

counter += 1

This part may take awhile to run which is one of the issues in using Bayesian methods for large scale models. To know why this is the case, checkout my previous post on Bayesian Conjugate Priors that tells you some short-comings of carrying out Bayes’ theorem:

Our dataframe is now looking like:

## Bayesian Update

We can now carry out our Bayesian update as follows:

`df['posterior'] = df['prior'] * df['likelihood']`

df['posterior'] = df['posterior']/df['posterior'].sum()

And our resulting dataframe is now:

## Marginal Distributions

To output the marginal posterior distribution for each parameter, we need to sum up the posteriors over the two other parameters. For example, to find the marginal posterior distribution of the slope, we sum up, for each value of the slope** , **the posterior over sigma and the intercept (basically an integration):

`slope_df = df.groupby('slope').sum()`

We can then plot the posterior distribution of the slope:

`# Plot the posterior distribution of the slope`

plt.figure(figsize=(8,5))

plt.plot(slope_df.index, slope_df.posterior, linewidth=3)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.xlabel('Slope Value', fontsize=18)

plt.ylabel('PDF', fontsize=18)

plt.axvline(results.params['x'], color='red', ls='--', label='OLS Estimate')

plt.legend(fontsize=16)

plt.show()

The slope is now a distribution!

We can carry out a similar calculation for the intercept and error terms:

intercept_df = df.groupby('intercept').sum()

sigma_df = df.groupby('sigma').sum()# Plot the posterior distribution of the Intercept

plt.figure(figsize=(8,5))

plt.plot(intercept_df.index, intercept_df.posterior, linewidth=3)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.xlabel('Intercept Value', fontsize=18)

plt.ylabel('PDF', fontsize=18)

plt.axvline(results.params['Intercept'], color='red', ls='--', label='OLS Estimate')

plt.legend(fontsize=16)

plt.show()# Plot the posterior distribution of sigma

plt.figure(figsize=(8,5))

plt.plot(sigma_df.index, sigma_df.posterior, linewidth=3)

plt.xticks(fontsize=18)

plt.yticks(fontsize=18)

plt.xlabel('Sigma Value', fontsize=18)

plt.ylabel('PDF', fontsize=18)

plt.axvline(results.resid.std(), color='red', ls='--', label='OLS Estimate')

plt.legend(fontsize=16)

plt.show()

The OLS estimate is the most likely value for these parameters, however it is not the only one. There are 216,000 other potential combinations which we can use to the model the data!

## Summary

This calculation is not that straight forward and writing this blog is what finally got me to understand the whole Bayesian regression process. I would advise the reader to checkout the full notebook on my GitHub and have a play with it:

# Conclusion

In this article we have recapped over Bayes’ theorem, explained the key difference between Frequentist and Bayesian statistics and finally carried out Bayesian Linear Regression from scratch.

As you may have noticed, this topic is quite wide in the pre-requisite knowledge that you need to fully grasp it. If I had tried to fit in all the background topics this blog would have literally been a textbook!

Therefore, I would recommend to anyone who hasn’t fully understood what we did here to go over Bayesian statistics and Linear Regression. There are so many resources on these subjects that will teach to you better than I can!

# Connect With Me!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot