Capstone Project : Regression Analysis for Housing Price Prediction Using Python

Original Source Here

Capstone Project : Regression Analysis for Predicting Housing Price with the Use of Python

Multiple Linear Regression for Predicting Housing Price

Author : Vincent Leonard and Jonathan Valentino


In this day and age, it is hard to predict the price of a house. There are a lot of variables that affects its price. People are having a hard time knowing the housing price when each house have different conditions. Generally, people want to buy houses that is worth the price. But they do not know how much a specific condition of the house affects it’s price. For the sellers, they do not want to sell their houses in a lower than average cost. So, they need to be able to accurately predict housing prices so that they can get or give the best price.

To solve this problem we can use Python to analyze available data. These data can be used to predict housing price given certain variables or factors.

Regression analysis is a set of statistical methods used for the estimation of relationships between a dependent variable and independent variables. It can be used to calculate the strength of the relationship between variables and for modeling the future relationship between them. There are three requirements to do a regression analysis:

  • The relationship between the variables is linear
  • Both variables must be at least interval scale
  • The least squares criterion is used to determine the equation

There are three types of regression analysis which is simple regression analysis, multiple regression analysis, and non-linear regression analysis. In this particular case, we will use the multiple regression analysis.


  1. Multiple linear regression
  2. Dataset
  3. Read dataset
  4. Independent and dependent variables
  5. Splitting data
  6. Applying model
  7. Do the Prediction
  8. Model Evaluation

Multiple Regression Analysis

Multiple regression analysis is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called a dependent variable and the variable that we use to predict the dependent variable is called the independent variable. In this case the dependent variable is the housing price and the independent variable is every other variable that affects the housing price.

The equation of multiple regression analysis will be:

  • X1 … Xk are the independent variables
  • a is the Y-intercept (or Constant value)
  • b1 is the net change in Y for each unit change in X1 holding X2 … Xk constant. It is called a partial regression coefficient or just a regression coefficient
  • The least squares criterion is used to develop this equation
  • Determining b1, b2, etc. is very tedious, a software package such as Python, R, SPSS, or Excel is recommended.


The dataset we are using have a lot of data showing features and conditions of a house. The analysis of these data will show how a condition will affect housing price and how it can be used to determine house prices. The dataset was taken on May 2014 until May 2015 at King County, USA. The dataset that will be used for this analysis is taken from

Read the Dataset

Import Library that will be used

Load Dataset

We can load the dataset that is already saved in google drive with csv format. You can use the provided link:

There are 20 Column in total and we took only 10 of them that are really impactful. Because there are more than one variable that can affect the price, we use Multiple Linear Regression.

Exploratory Data Analysis (EDA)

Why we use EDA in our analysis? EDA is an analysis method that we can use to find important metrics/features, often with visualizations methods. In this analysis, we use 2 popular EDA methods, Univariate Analysis and Multivariate Analysis.

Univariate analysis

Univariate analysis helps provide summary statistics for each field in the raw dataset (or) summary only on one variable. In this method, we will be using Box plot. Here is the code in python:

plt.figure(figsize=(14, 6))
sns.boxplot(data=dataset, showfliers=False)

Multivariate Analysis

Multivariate analysis is used for understanding the interactions between each fields in the dataset more than two. In this method, we will be using pairplot and 3D scatter plot. Here is the code in python:

plt.figure(figsize=(10, 6))
sns.heatmap(dataset.corr(), vmin=-1, vmax=1, cmap="coolwarm", annot=True)

Assign Independent and Dependent Variables

First thing, we need to define the dependent variables and independent variables using python. We can use “iloc” to get the value from specific column into our variable. Here is the example:

Split dataset into data Training and data Test

Now we must split the data into test and train set before applying the model. We need to separate the variables with higher proportion for training data and lower proportion for testing data. In this case I give 25% proportion for the testing data.

Create regression equation from all data

The next step is to train the Linear Regression model. We can use the fit() command with X_train and y_train as the parameters. From this, we can know the coefficient from each variable automatically.

House price prediction

We can input the data that we want to use to predict the house price. The data input should have the same arrangement as the columns of the dataset.

Model Summary based on data training

We can use model evaluation to confirm the accuracy level of prediction from the model that we make. To do model summary, we can use OLS (Ordinary Least Square) principle in python like the code below:

Evaluating Goodness of Fit using RMSE

To know how accurate or good our result is, we can analyze this table

First is the R-Squared. R-Squared ranges from 0 to 1. 1 means that the data have a lot of variance or variety while 0 means that the data have weak variance or little variety. In this case, the R-Squared being 0.564 means that our data have moderate variance. Not too much variety and not too little variety.

The second data we can analyze is the P>|t|. If we have a constant that is different from the other constants it means that there is an outlier in our data. An outlier means that the data or variable is far from the average data. In this case, constant X1 is an outlier. We decide to not exclude this outlier because the result we got is already decent.

Test the prediction results with test data

Now we will try to predict the result with test data.

Evaluating Goodness of Fit using RMSE

We can use Root Mean Squared Error (RMSE) to predict how accurate our prediction is. We can know by calculating how much the RMSE is compared to the the range of our data. The lower the RMSE compared to the data range, the more accurate our prediction is.

To know the data range of price we need to look at our previous table in univariate analysis. The maximum price being 7,700,000 and the minimum price being 75,000 means that the range of the data is 7,625,000. To conclude, in the range of 7,625,000 our error margin is 242,238 which means that it is decently accurate.

Mean Absolute Percentage Error (MAPE)

The alternative of RMSE is MAPE. With MAPE we can check if the RMSE that we got is still acceptable or not. We can use the equation below to predict how good our prediction is.

MAPE result Notes :

  1. Below 10% = Excellent
  2. 10% — 20% = Good
  3. 21% — 50% = Reasonable
  4. Above 50% = Inaccurate

Our MAPE result is 32% which means that we have a reasonable prediction

End Notes

Thankyou for reading this article. We hope that from this article, you will be able to use multiple regression for your very own predictions. Here is the full code of python using Google Collab.



Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: