Activation functions you might have missed



Original Source Here

Activation functions you might have missed

Should you “swish” to these new inventions, or stay with the oldies but goldies?

The pace of scientific progress in the field of machine learning is unparalleled these days. It is quite hard to stay up-to-date, unless only within a narrow niche. Every now and then, a new paper pops up claiming to have achieved some state-of-the-art results. Most of these new inventions never make it to become default go-to methods, sometimes because they prove not as good as it was initially hoped, but sometimes simply because they end up crumpled in the flood of new publications.

What a shame it would have been to miss some golden nugget! Dread not, for I got you covered. I have recently browsed through some relatively recent papers on one of the building blocks of neural networks: the activation functions. Let’s take a look at a couple of the most promising ones to see why are they good and when to use them. But before we do, we will quickly go through the commonly used activations to understand what problems do they solve or create. If you can tell between PReLU and RReLU, feel free to scroll down past the first two sections.

Why activate anyway?

Inside each neural network’s unit, the unit’s inputs are multiplied with some weight parameters W, a bias b is added, and the result is fed into a function, referred to as activation function. Its output, in turn, is the unit’s output that is passed to the next layer of units.

The insides of a NN’s unit. Image by the author.

The activation function could in principle be any function, as long as it is not linear. Why? Should we use a linear activation (which includes an identity function, meaning no activation at all), our network would effectively become a simple linear regression model, no matter how many layers and units we use. This is because a linear combination of a linear combination can be expressed as a single linear equation.

Such a network would have limited learning capabilities, hence the need to introduce a non-linearity.

Classical activation functions

Let’s take a quick look at the five most commonly used activation functions. Here they are, implemented using numpy.

Classical activation function: numpy implementation. Image by the author.

And here is what they look like:

Classical activation function: plots. Image by the author.

Let me discuss each of them shortly.

The sigmoid or logistic activation was historically the first one to replace step-functions in early networks. According to science, this is roughly the function that is used to activate neurons in our biological brains. It was a gamechanger, as the sigmoid’s well-defined, non-zero derivative allowed for the usage of gradient descent to train the neural networks. Since then, the sigmoid has been displaced by other functions inside the networks’ hidden layers, although it is still used as the final predictive layer for binary classification tasks.

The hyperbolic tangent (tanh) is quite similar in shape to the sigmoid, but it takes values between -1 and 1, instead of between 0 and 1. As a result, its outputs are more centered around zero, which helps speed up convergence, especially early in the training.

Both the sigmoid and the tanh, however, share one issue: they are saturating functions. When the input is very large or very small, the slope approaches zero, making the gradients vanish and the learning slow. Hence the need for non-saturating activations. A success story belongs to the rectified linear unit (ReLU) function, which does not saturate for positive values. It is fast to compute, and thanks to the lack of a maximum value, it prevents the vanishing gradient problem. It has one drawback though, referred to as dying ReLU. The problem is that ReLU outputs zeros for any negative value. If the network’s weights reach such values that they always yield negative values when multiplied with the inputs, then the entire ReLU-activated unit keeps producing zeros. If many neurons die like this, the networks learning capabilities get impaired.

To alleviate the dying ReLU problem, a couple of upgrades to the ReLU have been proposed. Leaky ReLU has a small but non-zero slope for negative values that ensures the neurons won’t die. Some exotic variants of this activation function include the randomized leaky ReLU (RReLU), in which this small slope is chosen randomly while training, or the parametrized leaky ReLU (PReLU), in which the slope is considered one of the network’s parameters and is learned via gradient descent.

Finally, the exponential linear unit (ELU) came to be, beating all the relly variants. It takes the best of all worlds: non-zero gradients for negative values eliminate the dying neurons problem just like in leaky ReLU, negative values make the outputs closer to zero just like in tanh, and most importantly, ELU is smooth around zero, which speeds up convergence. It comes with its own problem though: the usage of the exponential function makes it relatively slow to compute.

Here is the overview of the classical activations compiled for your convenience:

Classical activation functions: a comparison. Compiled by the author.

Let’s now take a look at some of the more recent inventions!

Scaled ELU (SELU)

The Scaled ELU or SELU activation was introduced in a 2017 paper by Klambauer et al. As the name suggests, it’s a scaled version of the ELU, with the two scaling constants in the formula below chosen such as in the TensorFlow and Pytorch implementations.

The SELU function has a peculiar property. The authors of the paper showed that if properly initialized, dense feed-forward networks will self-normalize provided all hidden layers are SELU-activated. This means that each layer’s output will roughly have the mean equal to zero and the standard deviation equal to one, which helps prevent the vanishing or exploding gradients problems and allows for building deep networks. The paper evaluated such self-normalizing networks on over 120 tasks from the UCI machine learning repository, drug discovery benchmarks, and even astronomy tasks to find that they significantly outperform traditional feed-forward networks.

Gaussian Error Linear Unit (GELU)

The Gaussian Error Linear Unit, or GELU, was proposed in a 2016 paper by Hendrycks & Gimpel. The function simply multiplies its input with the normal distribution’s cumulative density function at this input. Since this calculation is quite slow, a much faster approximation is often used in practice that only differs in the fourth decimal place.

In contrast to the ReLU family of activations, GELU weights its inputs by their value instead of thresholding them by their sign. The authors have evaluated the GELU activation against the ReLU and ELU functions and found performance improvements across all considered computer vision, natural language processing, and speech tasks.

Swish

The Swish activation function, invented at Google Brain in 2017 by Ramachandran et al. is remarkably simple: it just multiplies the input by its own sigmoid. It is pretty similar in shape to the GELU function.

The authors of the paper notice that despite many other activations having been proposed, ReLU is still the most widely adopted, mainly due to the inconsistency of gains from using the novelties. Hence, they evaluated Swish by simply using it as a replacement for ReLU in network architectures that have been optimized for ReLU. They found a significant performance boost and suggest using Swish as a drop-in replacement for ReLU.

The Swish paper also contains an interesting discussion on what makes activation functions good. The authors point out being unbounded above, bounded below, non-monotonic, and smooth as the reasons why Swish works so well. You might have noticed that GELU had all those properties too, and so does the last activation we will discuss shortly. It looks like this is the direction in which the research on activations is heading.

Mish

The Mish activation is the most recent invention among the ones discussed so far. It was proposed by Misra in a 2019 paper. Mish was inspired by Swish and has been shown to outperform it in a variety of computer vision tasks.

To quote the original paper, Mish was “found by systematic analysis and experimentation over the characteristics that made Swish so effective”. Mish seems to be the best activation in stock, but remember that the original paper only tested it on computer vision tasks.

Which activation to use?

In his fantastic book “ Hands-On Machine Learning with Scikit-Learn and TensorFlow”, Geron states the following general rule:

SELU > ELU > Leaky ReLU > ReLU

But there are some gotchas. If the network’s architecture prevents it from self-normalizing, then ELU might be a better choice than SELU. Next, if speed is important, (leaky) ReLU will be a better option than the slow ELU. But then, the book discusses none of the more recently proposed activations.

Once, I was discussing a network architecture I was working on at that time with my colleague, a former Googler. The first piece of advice he gave me was to replace ReLUs with Swishes. It was not a gamechanger, but nevertheless, the performance improved.

Based on this and my other experiences, I would suggest the following, subjective decision tree when choosing the activations, assuming the rest of the architecture is fixed.

How to choose activations, by the author.

Sources

Thanks for reading! If you liked this post, why don’t you subscribe to get email updates on my new articles? And by becoming a Medium member, you can support my writing and get unlimited access to all stories by other authors and myself. You can also try one of my other articles. Can’t choose? Pick one of these:

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: