A Comprehensive Guide of Neural Networks Activation Functions: How, When, and Why?



Original Source Here

A Comprehensive Guide of Neural Networks Activation Functions: How, When, Why?

Picture generated from Stability Diffusion using the title as prompt

In artificial neural networks (ANN), the units are designed to be a loose copy of biological neurons in the brain. The first artificial neuron was designed by McCulloch and Pitts in 1943 [1] and consisted of a linear threshold unit. This was to mimic that artificial neurons do not just output the raw input they receive, but instead, output the result of the activation function. This behavior is inspired by biological neurons, where a neuron will fire or not depending on the input it receives. From the Perceptron models to more modern Deep Learning architectures, a variety of activation functions have been used and researchers are always looking for the perfect activation function. In this post, I will describe the classic properties of activation functions and when to use them. In the second part, I will present more advanced activation functions such as adaptive ones, and what it takes to obtain an optimal activation function.

Overview of activation functions characteristics

When looking for an activation function for your neural network, there are a few properties that have to be taken into account.

  • Nonlinearity: The main property that comes to mind is nonlinearity. It is well-known that compared to a linear function, nonlinearity improves the training of ANN. This is mostly due to the fact that the non-linear activation function allows the ANN to separate high-dimensional nonlinear data instead of being restricted to linear space.
  • Computational cost: The activation function is being used at every timestep during the simulations, in particular with backpropagation during training. It is thus essential to make sure that the activation function is trackable in terms of computation.
  • The gradient: When training ANN, the gradient can be subject to vanishing or exploding gradient problems. This is due to the way activation functions are contracting the variables after every step, for example, the logistic function contracting towards [0,1]. This can lead the network to have no gradients left to propagate back after a few iterations. A solution to this is to use non-saturating activation functions.
  • Differentiability: The training algorithms being the backpropagation algorithm, it is necessary to ensure the differentiability of the activation function to make sure the algorithm works properly.

Classical activation functions

This section will describe some of the most common activation functions encountered with ANN, their properties as well as their performances on common machine learning tasks.

Piece-wise linear function

This function consists of one of the simplest (if not the simplest) activation functions. The range of its activity is [0,1], however, the derivative is not defined in b.

Piece-wise linear function with b=2 (by me)

In general, this activation function is only used as the output of a regression problem.

Sigmoid function

The sigmoid function was one of the most popular choices of function in the early 1990s. It was notably used by Hinton et al. [2] for automatic speech recognition. The function is defined as:

This function is differentiable which makes it a very suitable candidate for the backpropagation algorithm.

Sigmoid function (by me)

As seen in the figure, the sigmoid function is bounded which was the reason for its popularity. It is however subject to a vanishing gradient problem, and it was shown [3] that the deeper a neural network, the less effective it is to train it with a sigmoid as an activation function.

Hyperbolic tangent function

In the early 2000s, the hyperbolic tangent function replaced the sigmoid one. The function is defined as:

It ranges from [-1,1] which makes it suitable for regression problems due to the zero-centered property.

Hyperbolic tangent function (by me)

The main disadvantage of this function is saturation. Indeed, the hyperbolic tangent saturates pretty fast (faster than the sigmoid one), which can make it hard for the artificial neural network to modify the weights accordingly during training. However, it is worth noting that this function is often used as an activation function for the units of the hidden layers in recurrent neural networks.

Rectified Linear Unit

Rectified Linear Unit or ReLU is used to overcome the vanishing gradient problem. It is defined as

It is worth noting that, at the opposite of the sigmoid and hyperbolic tangent functions, the derivative of ReLU is monotonic.

ReLU function (by me)

The ReLU function is commonly used in classification tasks and overcomes the vanishing gradient problem due to its unbounded nature. In addition, the computational cost is cheaper than sigmoid due to the absence of exponential function. This function is used in the hidden layer of almost all neural network architecture but is not as common as an output function. The main disadvantage is that this function has a saturation issue for the negative values. To overcome this issue, it has been proposed to use the leaky ReLU [4]:

All these functions can be compared on two standard classification tasks: MNIST and CIFAR-10. The MNIST database contains handwritten digits and the CIFAR-10 contains objects from 10 categories. I am providing here a summary of the best performances that can be found for every activation function in the literature, independently of the architecture (for a great summary of even more activation function performances, see [5]).

Designing an optimal activation function

The previous section highlighted the fact that the choice of the activation function depends on the task that the network has to solve and its position within the network such as the hidden or output layer. Instead of trying to find the optimal activation function, it is thus natural to try to beat already existing functions for which we know when to use them. This is the reason why adaptive activation functions have been introduced.

Parametric Activation Functions

The first type of adaptable function I want to discuss is the parametric activation function. For example, it is possible to use a parametric hyperbolic tangent function [7]:

Example of a parametric hyperbolic tangent function (by me)

with the parameters a and b being adapted for every neuron. The parameters will then vary during training, using the classic backpropagation algorithm.

The advantage of parametric activation functions is that it is possible to modify almost any standard activation function this way, by adding parameters. Of course, the added parameters add complexity to the computation but it leads to better performances and they are used in state-of-the-art deep learning architecture.

Stochastic adaptive activation functions

Another way to introduce adaptive functions is to use a stochastic approach. First introduced by Gulcehre et al [6], it consists in using a structured and bounded noise to allow for faster learning. Or in other words, after the deterministic function has been applied to the input, one adds a random noise with a specific bias and variance to it. This is especially useful for activation functions that saturate. An example of a stochastic adaptive function is the following:

Stochastic adaptive ReLU (from [5] under CC-BY license)

This type of noisy activation function is useful with saturation as the noise is applied after the threshold, so it allows the function to go above it. As in the case of the parametric activation function, these functions overperform the fixed activation function in most machine learning tasks.

Here below is a summary of the performances of a few adaptive functions on the classic MNIST and CIFAR-10 tasks:

The adaptive functions do have better performances than their counterpart.

Conclusion

In recent years, neural architectures have become bigger and bigger, with billions of parameters. Thus, one of the main challenges is to obtain a fast convergence of the training algorithm. This can be achieved using adaptive functions, despite their high computational cost. Their higher performances have been observed in both classification and regression machine learning tasks. However, the proof that the best activation function still hasn’t been found is that researchers are now investigating quantized adaptive activation function in order to reduce the computational cost while keeping the fast convergence of adaptive function.

Connect with me on LinkedIn.

References

[1] McCulloch, Warren S., and Walter Pitts. “A logical calculus of the ideas immanent in nervous activity.” Bulletin of mathematical biology 52.1 (1990): 99–115.

[2] Hinton, Geoffrey, et al. “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.” IEEE Signal processing magazine 29.6 (2012): 82–97.

[3] Nair, Vinod, and Geoffrey E. Hinton. “Rectified linear units improve restricted boltzmann machines.” Icml. 2010.

[4] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. “Rectifier nonlinearities improve neural network acoustic models.” Proc. icml. Vol. 30. №1. 2013.

[5] Jagtap, Ameya D., and George Em Karniadakis. “How important are activation functions in regression and classification? A survey, performance comparison, and future directions.” arXiv preprint arXiv:2209.02681 (2022).

[6] Gulcehre, Caglar, et al. “Noisy activation functions.” International conference on machine learning. PMLR, 2016.

[7] Chen, Chyi-Tsong, and Wei-Der Chang. “A feedforward neural network with function shape autotuning.” Neural networks 9.4 (1996): 627–641.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: