Leaky ReLU vs. ReLU Activation Functions: Which is Better?

https://miro.medium.com/max/1200/0*jz-rXwXTOnyl6AP8

Original Source Here

Leaky ReLU vs. ReLU Activation Functions: Which is Better?

Photo by Lucas Vasques on Unsplash

Introduction

As data scientists, we continuously seek to improve and find the most optimal parameters for our machine-learning (ML) models. Creating a model with the correct architecture and hyperparameters can be the difference between a model that is capable of answering your current problem, or generating a model that is utterly useless and provides no beneficial value to your issue.

We will be looking at two different activation functions a person may want to incorporate in their ML models, which in today’s case are neural networks. Choosing the correct activation function could lead to a model that has higher accuracy, lower loss, and is more stable during its training process.

What are Activation Functions?

Activation functions allow for ML models to solve nonlinear problems. There are many different activation functions that a model can adopt today (ie. Sigmoid, Swish, Mish, tanh) and I highly recommend you research other options to see which function could be the most supportive for your next model.

Rectified Linear Unit (ReLU) Activation Function

The Rectified Linear Unit (ReLU) Activation Function was adopted after it showed capabilities of overcoming the saturation occurring when a model utilized the Sigmoid Activation Function.

What is saturation? It aligns with the exploding and vanishing gradients problem which arises when training a neural network. When gradients “explode”, the activations are sent to extremely large numbers and the update in model weights is too big, leading to a model that can not learn to solve a given task. When gradients “vanish,” the updates in the model’s weights become so small that they are unable to be changed to conform and solve the given problem.

The ReLU Activation Function helped alleviate this issue in two ways.

  1. It is not constrained by zero and one like the Sigmoid Activation Function (along the y-axis).
  2. Any value that is negative will be sent to zero.
ReLU Activation Function (Image from Author)

The ReLU Activation Function takes the maximum value of x in each instance. Additionally, the ReLU Activation Function tends to converge faster since it has no division or exponential properties in its calculations [1].

Leaky Rectified Linear Unit (LReLU) Activation Function

The Leaky ReLU Activation Function (LReLU) is very similar to the ReLU Activation Function with one change. Instead of sending negative values to zero, a very small slope parameter is used which incorporates some information from negative values.

LReLU Activation Function (Image from Author)

This activation function was first introduced by Maas et al. [2] who used a slope parameter of 0.01.

Experiment

The dataset used for this experiment is the MNIST dataset of handwritten numbers [3]. To use this dataset, we need to properly load it into our notebook with Python and shape each image to (28,28,1). Once the images are reshaped, we can normalize them by dividing each image by 255 (number of pixels).

For the experiment, I wanted to compare the performance of the activation functions over 1, 10, 100, and 1,000 epochs. This was the main parameter I changed for my experiment however there are various others that could have been adjusted to see how the different activation functions would affect performance (ie. adjust the number of layers within the models). The models were identical besides the activation functions they adopted.

GIST: Model Architectures (Created by the Author)

As shown in the model above, I created a Convolutional Neural Network (CNN) for classifying the MNIST images. Don’t know what CNN is?? Check out my post below!

The two models will adopt the same architecture of two convolutional layers followed by max-pooling layers but the activation functions will be interchanged.

Setup

Gist Code: Experiment Setup (Created by the Author)

The code above was designed for this experiment. It is really useful if you are varying the number of epochs for your training experiment. You may want to use a more complex training function (like Gridsearch) if you are planning to take a more advanced approach and vary other hyperparameters (model layers, number of neurons, etc).

Results

Image: Experiment Results

For 1 epoch, the ReLU model performed better on the training set while the LReLU model performed better on the test set. While informative, the results from 1 iteration of model training do not provide the clearest indication of which activate function ultimately performs better but will help in analyzing how model performance is impacted as the number of epochs increases. For the training set, the ReLU model had an accuracy that was 0.2789% greater and a loss value that was 0.9952% lower than the LReLU model. For the training set at 1 epoch, the ReLU model performed better. For the test set, the ReLU model had an accuracy that was 0.5230% lower and a loss value that was 1.0000% higher than the LReLU model. For the test set at 1 epoch, the LReLU model performed better.

With 10 epochs of model training, the LReLU model performed better in all of the assessment criteria. Regarding the training set, the ReLU model had an accuracy that was 0.1166% lower and a loss value 0.4811% higher than the LReLU model. For the training set at 10 epochs, the LReLU model performed better. Using the test set, the ReLU model had an accuracy that was 0.0190% lower and a loss value which was 1.8700%% higher than the LReLU model. The results for the test set at 10 epochs outlined how the LReLU model performed better. At this point, the LReLU model performed better than the ReLU model across all metrics.

In 100 epochs of model training, the LReLU activation function won again in all categories. For the training set, The ReLU model had an accuracy that was 0.0574% lower and a loss value which was 0.1367% higher than the LReLU model. For the training set at 100 epochs, the LReLU model performed better. Working with the test set, the ReLU model had an accuracy that was 0.1600% lower and a loss value which was 1.7030% higher than the LReLU model. As previously observed, the LReLU model performed better than the ReLU model when making predictions on the test set.

Finally, using 1,000 epochs, the LReLU activation function outperformed AGAIN in all categories. When developing the model with the training set, the ReLU model had an accuracy that was 0.0888% lower and a loss value which was 0.3474% higher than the LReLU model. The LReLU model performed better with the training set at 1,000 epochs. Assessing model development with the test set, the ReLU model had an accuracy that was 0.0900% lower and a loss value which was 2.1124% higher than the LReLU model. This follows the same previous pattern: the LReLU Model performed better than the ReLU model.

Final Conclusion: The model supported by the LReLU activation function outperformed the model supported by the ReLU activation function.

Discussion

As shown by the results, while the differences are minimal (but really says a lot because the performance was very high for both models), the LReLU modelshowed to perform better as the number of epochs increased. It is important to factor in this metric because taking the average accuracy across the training set, the ReLU model performed 0.003275% better. The ReLU model did have higher loss across different levels of epochs with an average higher loss of 0.0075%.

For the test set, the LReLU model had better-performing accuracy which was 0.2424% higher than the ReLU model across all trials. Additionally, the LReLU model had loss values that were 0.85% lower on average than the ReLU model. While the ReLU model had a higher accuracy for one trial(at 1 epoch!), the results show the model that adopted the LReLU Activation Function had a more optimal performance overall and I would recommend experimenting with the LReLU Activation Function over the ReLU Activation Function for your next ML model!

Conclusion

While the results are marginal, the LReLU Activation Function does in fact perform better than the ReLU Activation Function when changing the number of training epochs (for today’s given task). There are a few caveats to this study that should be taken into consideration. First, this study was performed with the MNIST dataset which is not a difficult dataset to use for achieving high model performance. A follow on to this analysis would be to use a more difficult dataset (this could also lead to changes in the architecture you explore). Another caveat is the fact that the only parameter changed for the experiment was the number of epochs. The goal of today was to see if there were any advantages to using an LReLU Activation Function over a ReLU Activation Function when training a neural network, and as shown by the results, it may be more optimal to adopt the LReLU Activation Function in the architecture of your next neural network.

If you enjoy today’s reading, PLEASE give me a follow and let me know if there is another topic you would like me to explore! If you do not have a Medium account, sign up through my link here! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: