# Intuitions behind different Activation Functions in Deep Learning

## Derivatives, Advantages, Disadvantages, Python implementation and Use cases

As we know in neural networks, neurons work with corresponding weight, bias and their respective activation function. The weights get multiplied with the inputs and then activation function is applied to the element before going to the next layer.

Why is the Activation Function required?

Activation functions help to introduce non-linearity into the output of a neuron, which helps in accuracy, computational efficiency and convergence speed. Activation functions should be monotonic, differentiable and quickly converging with respect to the weights for optimization purposes.

In this article, we will discuss the following important activation functions:

1) Sigmoid

2) Hyperbolic Tangent (Tanh)

3) Rectified Linear Unit (ReLU)

4) Leaky ReLU

5) Exponential Linear Units (ELU)

6) Parametric ReLU (PReLU)

7) Softmax

1) Sigmoid

This is one of the most common activation functions, also known as Logistic functions. The function is defined as:

The function and its derivative looks like below:

From the above graph we observe:

a) The function looks like S-shaped curve

b) This function transforms the input values between 0 and 1 and centered at 0.5 ie. not zero centered.

c) The function is monotonic and differentiable. Note, the derivative of sigmoid function ranges between 0 to 0.25.

Disadvantages of Sigmoid

a) Vanishing Gradient: In neural network, during the backpropagation stage, weight(w) is updated as below:

From the graph, we can understand the derivative is bounded between 0 to 0.25. Due to chain rule of differentiation, the derivative can be so low that weight might not get changed or updated significantly. This leads to problem of updating weights during the backpropagation stage and no noteworthy information is passed to the following layers. This problem is called vanishing gradient.

b) Computationally Expensive due to its exponential nature.

c) Output is not zero centric which reduces the efficiency of updating the weights.

2) Hyperbolic Tangent (Tanh)

Another very commonly used activation function is Tanh, which is defined as:

The function and its derivative looks like below:

It’s clear form above graph:

a) The function looks like S-shaped curve

b) The function transforms the values between -1 and 1 and centered at 0.

c) The function is monotonic and differentiable. Note, the derivative of the tanh function ranges between 0 to 1.

Tanh and sigmoid, both are monotonically increasing functions that asymptotes at some finite value as it approaches to +inf and -inf.

Disadvantages of Tanh: Similar to Sigmoid, Tanh also has a Vanishing gradient issue and computationally expensive due to its exponential operation.

Advantage of Tanh over Sigmoid: As we noted tanh is zero centric which means tanh function is symmetric about the origin. Hence, convergence is usually faster if the average of each input variable over the training set is close to zero.

3) Rectified Linear Unit (ReLU)

ReLU is the most popular activation function while updating the hidden layers. ReLU returns 0 when negative input is passed and for any positive input, it returns the value itself. This function is defined as:

The function and its derivative looks like below:

From the above graph we observe:

a) The function outputs 0 for any value less than zero and for positive values the function is monotonic and continuous.

b) Derivative of the function is 0 for z<0 and 1 for z>0 but the function is not differentiable at point 0.

c) It is not differentiable as for negative inputs, the derivative is 0.

Advantages of ReLU

a) ReLU overcomes the problem of vanishing gradient as the derivative is 1 for z>0.

b) Due to its simplistic equation, it is computationally faster compared to the Sigmoid and Tanh activation function.

Disadvantage (Dying ReLU): As mentioned above the derivative is 0 for negative inputs, so equation (1) leads to w(new) = w(old). That means, the neurons which go into that state will stop responding to variations in error/ input (because gradient is 0, so nothing changes ). This is called the dying ReLu problem. This leads to dead neurons which are not able to update the weights anymore in backpropagation. To overcome this problem, Leaky ReLU comes into picture.

4) Leaky ReLU

This is an improvement over the ReLU by tweaking the function for negative inputs as below:

The function and its derivative looks like below:

Basically, leaky ReLU allows a small, non-zero, constant gradient .This ensures the neuron will not die by introducing the non-zero slope.

Disadvantage of Leaky ReLU: If most of the weights are negative, then as per chain rule of derivative it will get multiplied with o.01 for multiple times. This will eventually lead to vanishing gradients, which we try to overcome.

5) Exponential Linear Units (ELU)

To overcome this issue and maintain the other properties of leaky relu, ELU comes into picture. This is defined as:

The function and its derivative looks like below:

Advantages:

a) Dying ReLU problem is solved

b) Output id zero centric

c) No need to find the derivative at 0

Disadvantages: Due to its exponential nature, it is computationally expensive.

6) Parametric ReLU (PReLU)

This is the most generalized form of all different ReLU variants. The function is defined as:

Where, β is authorized to learn during the backpropagation and can be considered as learning parametres.

Note, if β = 0, similar to ReLU

if β = 0.01, similar to Leaky ReLU

7) Softmax

Softmax calculates the probability distributions of the event over n different events. It calculates the probabilities of each target class over all possible target classes. Later the calculated probabilities help to determine the target class for the given inputs. The function is defined as below:

The function looks like below:

Softmax is generally used for multiclass problems ie. if the number of levels are more than 2, this activation function is used at last layer. ‘Max’ part returns the largest value and the ‘Soft’ part ensures that smaller values have lower probabilities but are not discarded. Also note the sum of the probabilities of all classes will be 1.

# Python Code Snippet

Below python code has been used to create the above graphs for each function and their corresponding derivatives.

# Few More Activation Functions

These are frequently used activation functions in deep learning, but the list doesn’t end here:

# When to use which activation functions

Usually, if the output ranges between (0,1) or (-1, 1) then sigmoid or tanh can be used. On the other hand, to predict output values larger than 1, ReLU is commonly used as tanh or sigmoid are not suitable by definition.

In the case of a binary classifier, the Sigmoid activation function should be used. While predicting a probability for a multiclass problem, the softmax activation function should be used in the last layer. Again, tanh or sigmoid usually doesn’t work well in hidden layers. ReLU or Leaky ReLU should be used in hidden layers. Swish activation function is used when number of hidden layers are high (close to 30).

However, the use of activation functions mostly depends on the data, problem in hand and the range of the expected output.

Hope you like this article!!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot