Original Source Here
Machine Learning Theory and Programming — Supervised Learning: Regression Analysis
An introduction to popular machine learning algorithms
Machine learning, as a part of artificial intelligence, is the study of computer algorithms that can improve automatically through experience and by the use of data.
In 1959, Arthur Samuel, a pioneer in computer gaming and artificial intelligence, defined machine learning as the field of study that gives computers the ability to learn without being explicitly programmed.
Machine learning has advanced significantly in the last couple of decades. In 1997, IBM supercomputer Deep Blue defeated the Russian chess grandmaster, Garry Kasparov. In 2016, Google DeepMind AI program AlphaGo defeated the Go World Champion, Lee Sedol. Machine learning algorithms have been used in a wide variety of applications, such as in medicine, email filtering, speech recognition, computer vision, and self-driving cars.
We are writing an introduction series about machine learning theory and programming. In this article, let’s walk through supervised learning: regression analysis.
Supervised learning builds a mathematical model of a set of data that contains both the inputs and the correct outputs. The data is known as training data, which is a set of training examples. This mathematical model is a function that best maps inputs to outputs, which is called hypothesis (
After the training, we can use the hypothesis to predict outputs from the new data inputs.
Supervised learning has the correct answer for each example in the data, while unsupervised learning takes a set of data that contains only inputs. Without correct answers, unsupervised learning uses more complex methods and focuses on finding structures in the data, such as grouping or clustering of data points.
There is also semi-supervised learning in which training examples have some of the outputs. Those training examples without output data still produce a considerable improvement in learning accuracy, when used in conjunction with those examples with output data.
Supervised learning collects data, and then produces an output from the previous experience. It solves real-world problems based on computation algorithms.
Supervised learning has the following common types:
- Regression: It uses an algorithm to compute the relationship between dependent (target) and independent (predictor) inputs. Popular regression algorithms are linear regression and polynomial regression.
- Classification: It uses an algorithm to accurately assign test data into specific categories. Popular classification algorithms are logical regression, linear classifier, support vector machine, decision tree, and random forest.
Regression Analysis Theory
Regression analysis is a statistical method that models the relationship between a dependent (
y) and one or more independent variables (
Linear regression is one of the easiest and most popular machine learning algorithms. It shows a linear relationship between dependent and independent variables. The hypothesis of a linear regression equation is defined as follows:
For example, it could be a model that predicts the housing value (
h), where the input features are square footage (
x₁), bedroom number (
x₂), bathroom number (
x₃), …, and homeowner association fee (
i = 1, 2, …, m, the set of training data is defined as follows:
(x₁, x₂, …, xₙ) is a set of house features, and
y is the actual appraised house value.
Cost function (
J) is a mechanism that returns the error between predicted outcomes and the actual outcomes.
Gradient descent is an optimization algorithm that is used to compute coefficients to minimize the cost function. For every
j = 0, 1, …, m,
θⱼ is calculated iteratively using the following equation, where
α is the learning rate.
J(θ) is a convex function, gradient descent should converge to the global minimum. For sufficient small
J(θ) decreases on every iteration. However, if
α is too small, the gradient descent may take a long time to converge.
With the calculated coefficients,
θ, the hypothesis is ready to train new data.
When there is one input variable, it is called simple linear regression. When there are multiple input variables, it is called multiple linear regression.
Simple linear regression algorithms are more likely to underfit, which refers to a model that is not able to capture the underlying trend of the data. A possible solution is to increase the number of variables. Of course, using more variables increases the computation complexity.
Polynomial regression is a regression algorithm that models the relationship between a dependent (
y) and independent variables (
x) as nth degree polynomial (non-linear). The one-variable polynomial regression equation is defined as follows:
Polynomial regression can be multiple variables as well. Here is an example:
High-order polynomial algorithms are more likely to overfit, which refers to a model that models the training data too well and hence loses the generality. Overfitting happens when the noise or random fluctuations in the training data is picked up and learned as real data by the model.
Overall, the hypothesis can be written in a matrix format:
Regression Analysis Programming
From regression analysis theory, we can see the computation algorithm involves advanced mathematics. It is complicated if we use normal programming languages to train the model.
Luckily, we are empowered by machine learning programming languages, such as MATLAB, Octave, R, etc. For some of the general-purposed programming languages, such as Python, they are supplied with machine learning libraries. Therefore, it may only take one function call to train a model.
MATLAB is a proprietary multi-paradigm programming language and numeric computing environment developed by MathWorks. It provides built-in functions for simple linear regression (
fitlm) and polynomial regression (
Simple Linear Regression Example
Suppose we have ten houses, and their appraised values (
$999,998, $1,320,000, $1,200,500, $2,650,000, $1,950,000, $3,200,000, $2,700,500, $2,200,400, $3,050,000, and $1,100,000. We use one variable, square footages, for input data (
x). They are
1,000ft², 1,400ft², 1,200ft², 3,000ft², 2,050ft², 3,500ft², 1,900ft², 2,400ft², 2,580ft², and 900ft².
The following is a MATLAB program:
Line 1 specifies the function name,
houseAppraisal, and the return value,
Line 2 uses a built-in function,
fitlm, to fit a linear regression mode with specific parameters.
fitlm(X,y) returns a linear regression model of the responses,
y, fit to the data matrix
Line 3 is the data matrix,
X, which lists house appraised values.
Line 4 is the responses,
y, which is a vector of square footage values.
Lines 2–4 effectively use one line of code to train the linear regression and assign the result to the return value,
Line 5 plots the linear regression,
linewidth is set to 2.
Line 6 turns off the lower confidence bound. By default, the plot shows
95% confidence bounds.
Line 7 turns off the upper confidence bound.
Line 8 labels the x-axis.
Line 9 labels the y-axis.
Line 10 defines the plot title.
Line 11 defines legend labels.
Line 12 terminates the function.
At the MATLAB command console, we run the function:
There is a lot of information on the linear regression model. It generates the coefficients for our use case’s hypothesis:
y = 290560 + 876.36x.
It is plotted as follows:
If we use this model to measure a
2000ft² house, the predicted house value is
290560 + 876.36 * 2000).
Polynomial Regression Example
The following is a MATLAB program for polynomial regression:
Line 1 specifies the function name,
Line 2 calls a builtin function,
linspace, to return a row vector (
x) from 0 to
4π with 10 evenly spaced points. It is
[0 1.3963 2.7925 4.1888 5.5851 6.9813 8.3776 9.7738 11.1701 12.5664].
Line 3 generates a vector (
y) that is the sine function of each
x. It is
[0 0.9848 0.3420 -0.8660 -0.6428 0.6428 0.8660 -0.3420 -0.9848 -0.0000].
Line 4 plots 10 points of
(x, y) values, where
marker is set to cross (
color is set to blue (
linewidth is set to 2.
Line 5 calls
hold on to retain plots in the current axes so that new plots added to the axes do not delete existing plots.
Line 6 uses a built-in function,
ployfit, to return the coefficients for a polynomial
p(x) of degree
n that is a best fit for the data in
y. It returns a vector,
[-0.0001 0.0028 -0.0464 0.3702 -1.3808 1.9084 -0.1141 0.0002]. It means that the hypothesis of the trained model is
y = -0.0001x⁷ + 0.0028x⁶ - 0.0464x⁵ + 0.3702x⁴ — 1.3808x³ + 1.9084x² — 0.1141x + 0.0002.
Line 7 generates a vector,
x₁, that has the default 100 points.
Line 8 calls a builtin function,
polyval(p, x₁), to evaluate the polynomial
p at each point in
x₁. The result is
y₁ = -0.0001x₁⁷ + 0.0028x₁⁶ - 0.0464x₁⁵ + 0.3702x₁⁴ — 1.3808x₁³ + 1.9084x₁² — 0.1141x₁ + 0.0002.
Line 9 plots 100 points of
(x₁, y₁) values that look continuously.
Line 10 labels the x-axis.
Line 11 labels the y-axis.
Line 12 defines the plot title.
Line 13 defines legend labels.
Line 14 calls
hold off to set the hold state to off.
Line 15 terminates the function.
At the MATLAB command console, we run the function,
polynomialExample. It plots the training data along with the trained polynomial regression curve.
If we use this model to predict outputs, the input value,
10, will generate the output of
There are many machine learning algorithms. We have presented linear regression and polynomial regression in supervised learning.
Machine learning programming languages are designed with pre-built libraries and advanced support of data science and data models. We have shown examples to implement linear regression and polynomial regression using MATLAB.
Thanks for reading. I hope this was helpful. Stay tuned for other algorithms.
Special thanks to Josh Poduska, Andrew Ziegler, and Subir Mansukhani for recommending machine learning resources! Also, thanks to Professor Andrew Ng’s Machine Learning Class.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot