A Little About Perceptrons and Activation Functions



Original Source Here

A Little About Perceptrons and Activation Functions

Oftentimes, scientific breakthroughs and engineering feats are inspired by what is already there in nature long before we even thought about them. From the analogous birds and airplanes to elephant-trunk-inspired robotic arms, we need no further proof to agree that nature has been a major source of inspiration of the things we have today.

Computer science, of course, is no exception. Although the study about the human brain has been around much longer, in 1943, Warren McCulloch and Walter Pitts introduced the first neural network model using electrical circuits. Neural networks works very much like the human brain, in which their perceptrons are analogous to our neurons, hence the name.

When beginning to study machine learning, most people would first learn about the concept of over-fitting and under-fitting, apparently both are something we would want to avoid in our model. What we do want to have is a model that is a good/appropriate fit to our data.

Underfit, Good fit, and Overfit example. Image retrieved from julienharbulot.com

In a regression problem, we would want to be able to predict the value of the output given an input. A neural network is capable to do just that with their perceptrons.

The simplest mathematical model of a neuron called the Perceptron. Image retrieved from researchgate.com uploaded by Zafeirios Fountas

Let’s say that we have a set of points in the X-Y plane, let’s say X represents average hours of sleep per week and Y being the test score for that corresponding student. If we figure out the pattern, we can predict the test score of a student given the average hours of sleep that student has every week.

Now let’s talk about perceptrons. They are analogous to the neurons inside the human brain and are basically the building blocks of the constructed regression line. Let’s say we have a simple neural network with one hidden layer, that is, the layer in which perceptrons reside, with three perceptrons. A perceptron takes in inputs (possibly more than one) with each input multiplied by some factor called a weight, totaled together and added with some number called a bias. After adding the weighted sum and the bias, the value is then plugged into an activation function. The result of this activation function is what actually taken as the output of the perceptron. For neural networks with one hidden layer, all output values of each perceptron is then added all up together to be the predicted value for the original input.

In our sleep-score example, a neural network takes in a value of the average amount of sleep a student gets every week as the input, witch which each perceptron will multiply that value by corresponding weights of each perceptron and predicts the test score for that student by summing the already-multiplied inputs and adding a bias value.

To create a prediction as depicted with the blue line in the graph, a neural network with 3 perceptrons in a single hidden layer is constructed

Figuring out the correct weights and bias is needed to create an accurate regression model and that is done by the backpropagation algorithm. But this article will not discuss the how-it-works and mechanisms of said algorithm. Instead, we will discuss more about kinds of activation functions.

Activation Functions

Most of the time, the pattern that our data points make on the X-Y plane may not be in the pattern of a straight line. If that’s the case, a simple straight-line linear regression in the form of

y = mx + b

would not be good enough to predict values of X. If we force ourselves to use a linear regression on cases like so, we would have an under-fitting model.

In this case, we need a tool to model our data points that allows us to bend the regression line. We do this using activation functions. To put simply, activation functions are what makes our regression non-linear. A few of the commonly used activation functions are as follows.

Sigmoid Activation Function

The graph of the sigmoid function looks like the following.

Sigmoid function. Image retrieved from wikipedia.com

The equation for the “S” shaped graph for our sigmoid function is as follows.

Since the sigmoid function only ranges from 0 to 1, it is usually used for predicting probabilities. The sigmoid function is also differentiable, which means the derivative can be calculated. However, the derivative gets closer and closer to 0 as it approaches both positive and negative infinites. Small values of gradient aren’t desirable for backpropagation. Also logistic sigmoid function can cause a neural network to be stuck at training time.

Softmax activation function on the other hand, is a more generalized logistic activation function for multi-class classification. Meaning that softmax can be used for solving a classification problem involving 2 or more classes.

Tanh/Hyperbolic Tangent Activation Function

Tanh or the hyperbolic tanget activation function is a lot similar to the sigmoid function. The tanh activation function also has the sigmoidal “S” shape, but the difference is that the tanh ranges from -1 to 1. the graph and equation for the tanh activation function compared side by side with a sigmoid activation function is as follows.

Sigmoid and Tanh compared side by side. Image retrieved from https://stats.stackexchange.com answer by ekoulier

Tanh is usually applied in classification problems between two classes. Like the sigmoid activation function, the tanh function is differentiable, but in consequence, the gradient value also approaches 0 as it gets closer to both infinite ends, which is not desirable for backpropagation. Tanh is favorable when one has to pick between sigmoid and tanh. The equation for the tanh function is as follows.

ReLU and Leaky ReLU activation function

ReLU stands for Rectified Linear Unit, and is the most commonly used activation function in neural networks. ReLU activation function ranges from 0 to infinity, with 0 for values less than or equal to 0, that is, 0 for negative values. In general, ReLU activation function is more favorable compared to Sigmoid and Tanh. However, it still may not be able to map nagetive values properly since it turns it into 0 right away.

This is where Leaky ReLU comes in. Leaky ReLU is there as an attempt to solve the dying ReLU problem. Leaky ReLU does not set negative values to zero, instead, it follows a linear equation with a gradient, usually 0.01. When the gradient is not 0.01, it’s called Randomized ReLU.

ReLU vs Leaky ReLU graph. Image retrieved from towardsdatascience.com by Sagar Sharma

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: