https://miro.medium.com/max/1200/0*_PsuhulQsJIBCu9l

Original Source Here

# A Quick Setup for Neural Networks Hyperparameters for Best Results

It is not easy to find optimal hyperparameters settings and test different settings by ourselves until achieving good results in neural networks. In general, you must spend a lot of time tweaking these hyperparameters settings and get better models. However, the most important thing is not to be greedy to avoid overfitting. Here is a quick setup for how to set up these hyperparameters in a short time to get good results.

## Table of Content:

- Number of Hidden Layer
- Number of Neurons per Hidden Layer
- Learning Rate
- Activation Function
- Batch Size
- Optimizers
- Loss Function
- Number of Epochs

*Become a **Medium member** to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.*

# 1. Number of Hidden Layers

The first hyperparameter in a neural network is the number of hidden layers. For many problems, you can just begin with a single hidden layer adn get reasonable results as theoretically, a neural network with one hidden layer can model most of the complex functions if it has enough neurons. However, for more complex cognitive functions, deep neural networks will be much more efficient compared to shallow ones.

Since in deep neural networks, lower hidden layers model low-level structures (e.g., line segments of various shapes and orientations), intermediate hidden layers combine these low-level structures to model intermediate-level structures (e.g., squares, circles), and the highest hidden layers and the output layer combine these intermediate structures to high-level model structures (e.g., faces).

Not only does this hierarchical architecture help DNNs converge faster to a good solution, but it also improves their ability to generalize to new datasets. For example, if you have already trained a model to recognize faces in pictures and you now want to train a new neural network to recognize hairstyles, you can kickstart the training by reusing the lower layers of the first network. Instead of randomly initializing the weights and biases of the first few layers of the new neural network, you can initialize them to the values of the weights and biases of the lower layers of the first network. This way, the network will not have to learn from scratch all the low-level structures that occur in most pictures; it will only have to learn the higher-level structures (e.g., hairstyles). This is called transfer learning.

**In summary, for many problems, you can start with just one or two hidden layers, and the neural network will work just fine. For instance, you can easily reach above 97% accuracy on the MNIST dataset using just one hidden layer with a few hundred neurons and above 98% accuracy using two hidden layers with the same total number of neurons in roughly the same amount of training time.**

For more complex problems, you can ramp up the number of hidden layers until you start overfitting the training set. Very complex tasks, such as large image classification or speech recognition, typically require networks with dozens of layers or even hundreds, but not fully connected ones, and they need a huge amount of training data. You will rarely have to train such networks from scratch: it is much more common to reuse parts of a pre-trained state-of-the-art network that performs a similar task. Training will then be a lot faster and require much fewer data.

# 2. Number of Neurons per Hidden Layer

The number of neurons in the input and output layers is determined by the type of input and output your task requires. For example, the MNIST task requires 28 x 28= 784 input neurons and 10 output neurons since it has ten classes. If the task is regression or binary classification, then the number of output neurons is just one.

As for the hidden layers, it used to be common to size them to form a pyramid, with fewer and fewer neurons at each layer-the rationale being that many low-level features can coalesce into far fewer high-level features. A typical neural network for MNIST might have 3 hidden layers, the first with 300 neurons, the second with 200, and the third with 100. However, this practice has been largely abandoned because it seems that using the same number of neurons in all hidden layers performs just as well in most cases, or even better; plus, there is only one hyperparameter to tune instead of one per layer.

That said, depending on the dataset, it can sometimes help to make the first hidden layer bigger than the others. Just like the number of layers, you can try increasing the number of neurons gradually until the network starts overfitting. But in practice, it’s often simpler and more efficient to pick a model with more layers and neurons than you actually need, then use **early stopping** and other regularization techniques to prevent it from overfitting.

Vincent Vanhoucke, a scientist at Google, has dubbed this the “stretch pants” approach: instead of wasting time looking for pants that perfectly match your size just use large stretch pants that will shrink down to the right size. With this approach, you avoid bottleneck layers that could ruin your model. On the flip side, if a layer has too few neurons, it will not have enough representational power to preserve all the useful information from the inputs (e.g., a layer with two neurons can only output 21 data, so if it processes 3D data, some information will be lost). No matter how big and powerful the rest of the network is, that information will never be recovered. **In general, you will get better results by increasing the number of layers instead of the number of neurons per layer.**

# 3. Learning Rate

The learning rate is considered to be one of the most important hyperparameters you can optimize in a neural network model.

One way to find a good learning rate is to train the model for a few hundred iterations, starting with a very low learning rate (e.g., 10^-5) and gradually increasing it up to a very large value (e.g., 10). This is done by multiplying the learning rate by a constant factor at each iteration (e.g., by exp(log(10⁶)/500) to go from 10^-5 to 10 in 500 iterations).

If you plot the loss as a function of the learning rate (using a LIP log scale for the learning rate), you should see it drop at first. But after a while, the learning rate will be too large, so the loss will shoot back up: **the optimal learning rate will be a bit lower than the point at which the loss starts to climb (typically about 10 times lower than the turning point)**. You can then reinitialize the model and train it normally using this good learning rate.

Finally, it is important to remember that the optimal learning rate depends on the other hyperparameters, especially the batch size, so if you modify any of the hyperparameters, remember to update the learning rate as well.

# 4. Activation Function

An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network. The choice of activation function has a large impact on the capability and performance of the neural network, and different activation functions may be used in different parts of the model.

A very good starting point is to start with **Relu **as an activation function for the hidden layers as shown in the figure below.

The activation functions of the output layer will depend mainly on the task. For regression tasks, you can use the linear activation function as you would like to output from the fully connected layers without changes.

If your problem is a classification problem, then there are three main types of classification problems, and each may use a different activation function.

- If there are two mutually exclusive classes (binary classification), then your output layer will have one node, and a
**sigmoid activation**function should be used. - If there are more than two mutually exclusive classes (multiclass classification), then your output layer will have one node per class, and a
**softmax activation**should be used. - If there are two or more mutually inclusive classes (multilabel classification), then your output layer will have one node for each class, and a
**sigmoid activation**function is used.

So to summarize this:

**Regression:**One node with linear activation.**Binary Classification**: One node, sigmoid activation.**Multiclass Classification**: One node per class, softmax activation.**Multilabel Classification**: One node per class, sigmoid activation.

# 5. Batch Size

The batch size can have a significant impact on your model’s performance and training time. You can choose a small batch size such as 32 or 64, or you can use the maximum batch size that will fit in your memory how to decide which to choose, or are there any other options?

The main benefit of using large batch sizes is that hardware accelerators such as GPUs can process them efficiently and the algorithm will see more instances per second which will increase its performance. Therefore, many researchers and practitioners recommend using the largest batch size that can fit in GPU RAM. However, in practice, it was found that large batch sizes often lead to training instabilities, especially at the beginning of training, and the resulting model may not generalize as well as a model trained with a small batch size.

On the other hand, Yann LeCun tweeted in 2018, “**Friends don’t let friends use mini-batches larger than 32**” citing this 2018 paper **Revisiting Small Batch Training for Deep Neural Networks** by Dominic Masters and Carlo Luschi, which concluded that using small batches (from 2 to 32) was preferable because small batches led to better models in less training time.

However, other papers point in the opposite direction such as 2017 papers by Elad Hoffer et al. **Train longer, generalize better: closing the generalization gap in large batch training of neural networks**** **and Priya Goyal et al. paper **Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour**** **showed that it was possible to use very large batch sizes (up to 8,192) using various techniques such as **warming up the learning rate** (i.e., starting training with a small learning rate, then ramping it up, as mentioned above. This will lead to a very short training time without any generalization gap.

So, in summary, a good starting strategy is to try to use a large batch size based on your hardware and to use a learning rate warmup if training is unstable or the final performance is disappointing, then try using a small batch size instead.

# 6. Optimizers

An optimizer is a function or an algorithm that modifies the attributes of the neural network, such as weights and learning rates. Therefore, it helps in reducing the overall loss and improving the accuracy. The problem of choosing the right weights for the model is a daunting task, as a deep learning model generally consists of millions of parameters. It raises the need to choose a suitable optimization algorithm for your application.

You can use different optimizers to make changes in your weights and learning rate. However, choosing the best optimizer depends upon the application. One possible solution is to try all the possibilities and choose the one that shows the best results. This might good solution if your data is small, but when dealing with hundreds of gigabytes of data, even a single epoch can take a considerable amount of time.

However being said, if you would like to choose one optimizer, you should choose **adam optimizer**. The adam optimizer has several benefits, due to which it is used widely. It is adapted as a benchmark for deep learning papers and recommended as a default optimization algorithm. Moreover, the algorithm is straightforward to implement, has a faster running time, low memory requirements, and requires less tuning than any other optimization algorithm.

# 7. Loss Function

The purpose of loss functions is to compute the quantity that a model should seek to minimize during training. Importantly, the choice of the loss function is directly related to the **activation function** used in the output layer of your neural network and the expected output and the learning task. The elements are connected with each other.

We use the 𝐬𝐩𝐚𝐫𝐬𝐞_𝐜𝐚𝐭𝐞𝐠𝐨𝐫𝐢𝐜𝐚𝐥_𝐜𝐫𝐨𝐬𝐬𝐞𝐧𝐭𝐫𝐨𝐩𝐲 loss when we have sparse labels (when for each instance we get a class index, for example, if we have three classes, then labels will be 0,1,2). We use 𝐜𝐚𝐭𝐞𝐠𝐨𝐫𝐢𝐜𝐚𝐥_𝐜𝐫𝐨𝐬𝐬𝐞𝐧𝐭𝐫𝐨𝐩𝐲 loss when we have one target probability per class instance, for example [0,0,1] for class 3. We use the 𝐛𝐢𝐧𝐚𝐫𝐲_𝐜𝐫𝐨𝐬𝐬𝐞𝐧𝐭𝐫𝐨𝐩𝐲 loss for binary classification tasks.

For regression, we can use **mean_squared_error **or the **mean_absolute_error **function.

# 8. Number of Epochs

In most cases, the number of epochs or iterations will not be needed to be optimized. You can just use early stopping to stop when there is no any improvement in the performance of the models.

*Loved the article? Become a **Medium member** to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link at no extra cost to you.*

*Thanks for reading! If you like the article, make sure to clap (up to 50!) and connect with me on **LinkedIn** and follow me on **Medium** to stay updated with my new articles.*

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot