# Introduction to Deep Learning: The MLP

Original Source Here

# Introduction to Deep Learning: The MLP

Linearly inseparable data
In the case of linearly inseparable data, no such hyperplane exists, that can divide correctly, hence a single perceptron cannot classify such data properly (Minsky 1969). We cannot draw a straight line separating the points (-1,-1),(1,1) from the points (-1,1),(1,-1). Hence a single perceptron is not enough to compute an XOR gate.

To overcome this limitation, Marvin Minsky and Seymour Papert introduced a networked structure in 1969. A set of hidden neurons was introduced in between the inputs and output perceptron. These hidden neurons processed the data and “feed” this data to the output neurons. Using this, we can compute far more complicated boolean and real-valued functions.

XOR Function can be decomposed as a function of AND, OR, and NOT gate.

A sufficiently wide and deep network of Perceptrons is essentially a universal Boolean function.

But how deep can we need to go? In the case of an N variable XOR gate, if we use just a single hidden layer, we will require 2ᴺ⁻¹+1 perceptrons. That is O(2ⁿ⁻¹) complexity. With slight rearrangement, we can reduce the number of perceptrons required to 3(N-1) i.e. of O(N) complexity. Deeper networks may require far fewer neurons than shallower networks to express the same function.

MLP the Universal Approximator
MLP can be used to compute classification “boundaries” for real-valued inputs. Let’s build a network of units with a single output that fires if the input is in the colored region.
The region inside the “pentagon” belongs to one class and outside to another. To mark this region, we first use 5 perceptrons to draw 5 decision boundaries and then use another perceptron to combine these 5 decision boundaries, and the area bounded by them gives us the desired region.

By combining enough perceptrons, we can draw arbitrarily complex decision boundaries. But to keep drawing the decision boundaries by PTA is not possible. In fact, for threshold-function activation perceptrons, computing so many decision boundaries would be an NP-complete combinatorial optimization problem.

Consider the function in Figure 8a. Can we learn a linear model from this data? We will require multiple lower-level perceptrons to compute the linear boundaries, but for that, we would need data labels for the individual labels, as shown in Figure 8b. But these labels are not given. Now we have to relabel the blue dots in such a way that we can learn a decision boundary that keeps red dots on one side as shown in Figure 8b. But this has to be done for all the individual neurons, such that when combined using a higher level perceptron we achieve the desired pattern. Getting any of them wrong will result in incorrect output! Doing this for all the neurons will take exponential time since we have to test the output of all the individual neurons, for every data point, in order to properly learn the output neuron. This is an NP-complete combinatorial optimization problem.
Other problems with using PTA for weight updates are:
1) Individual neurons’ weights can change significantly without changing overall error. There is no indication of which direction to change the weights to reduce error since the threshold function is flat with only a jump at 0. Because of this, its derivative is 0 throughout except at 0, where it’s non-differentiable.
2) Real-life data is rarely clean. The PTA wouldn’t work in the first place.

To overcome this, we will be using different activation functions, that are differentiable. We will be reading more about this and the training of neural networks in general, in the next part.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot