Traditional ML- Linear Regression(Maths and Code)

Original Source Here

Traditional ML- Linear Regression(Maths and Code)

This is my first article for traditional machine learning algorithm and I will start with Linear Regression.

Linear regression is a statistical tool or procedure to predict a variable value based on the value of other variables. A more technical definition will be, a statistical tool or procedure to calculate or predict a dependent variable from an independent variable. Before diving deeper into the topic let’s define dependent and independent variables.

  1. Independent variables: These variables are independent of other variables in the provided data and are basically the cause(s).
  2. Dependent variables: These variables depend on the independent variables for their values and are the effects of those cause(s)

From these definitions, we can clearly get a sense of the causal relationship, we can get an idea of such types of relationship easily from a scatterplot or any other graph.

Let’s look at some sample problem statements for the same-

  1. How well a student score in final exams/tests given a score(s) of his previous exams/test.
  2. given mileage and how old the car is, predicting what will be the price of the car(2 independent variables).

Linear regression uses the mathematical equation, y = b*x + a, that describes the line of best fit for the relationship between y (dependent variable) and x (independent variable). The regression coefficient, i.e., r^2 implies the degree of variability of y due to x. b is the slope of the line and a stands for the intercept.


assumptions are a great way to check if a particular dataset can be analysed using linear regression.

  1. Both dependent and independent variables must be continuous.
  2. When plotted we should notice some linear relationship between the two variables.
  3. Data must show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along the line.
  4. when plotted if you notice some outliers then make sure to remove them
  5. All the values of “y” are independent of each other, though dependent on “x.”
Linear scatterplot

For now, let us deal with the first problem statement and see what our data looks like

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
data = pd.read_csv('./data_set/scores')

Let us also see how the plot looks like

plt.scatter(x = data.Hours ,y = data.Scores ,color="green")
plt.xlabel("No. of study Hours")
plt.title("Hours V/S Scores")

Before moving ahead we must jump in and do some math. How do we get the values of a and b, i.e the intercept and slope for the regression line we want to draw? here are some formulas to make you understand how it is done.

Pearson’s coefficient

here x and Y bar stands for their respective means.

r² is the proportion of the total variance (s²) of Y that can be explained by the linear regression of Y on x. 1-r² is the proportion that is not explained by the regression. Thus 1-r² = s²xY / s²Y.

The coefficient of determination is the portion of the total variation in the dependent variable that can be explained by variation in the independent variable(s). R² gives out a value from range 1 to 0, there exists a perfect relationship between dependent and independent variables if it is +1, as it gets close to zero, the relationship is weak.

Splitting Dataset into training

x = data.drop("Scores" ,axis ="columns")
y = data.drop("Hours" ,axis ="columns")
X_train,X_test,Y_train,Y_test = train_test_split(x,y,test_size = 0.2 ,random_state = 40)## you can always print and check how x,y,X_train,Y_train etc looks like or what their shape is(just to confirm)

Implementing Linear Regression

lin = LinearRegression(),y)
line = lin.coef_*x + lin.intercept_
plt.scatter(x,y ,color = "green")
plt.plot(x,line,color = "orange")

Making Predictions

Y_predicted = lin.predict(X_test)
pd.DataFrame(np.c_[X_test,Y_test,Y_predicted] ,columns = ["Study hours" ,"original marks" ,"predicted Marks"])

This will give out a table with all predicted values and actual values. Now to check the accuracy of our model,


My model had an accuracy of 0.95


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: