Why Using Learning Rate Schedulers In NNs May Be a Waste of Time

https://miro.medium.com/max/1200/0*H-NPJXpUJHNptEHo

Original Source Here

Why Using Learning Rate Schedulers in NNs May Be a Waste of Time

Hint: Batch size is the key, and it might not be what you think!

Photo by Andrik Langfield on Unsplash

TL;DR: instead of decreasing the learning rate by a factor, increase the batch size using the same factor to achieve faster convergence and, if not better, training results.

In recent years, the continuous development of neural networks has led to an increasing amount of applications of them in the real world. Despite the popularity surrounding neural networks and other similar deep learning methods, they still contain significant drawbacks that influence their usability or performance on unseen datasets.

Neural Networks are notoriously known for consuming significant computational resources and time. As the continuing research for neural networks progresses, more and more complex architectures are being proposed and put into practice. However, even though some larger networks may be able to obtain decent performance compared to their smaller counterparts, the sheer amount of training and inference time it takes deems them inapplicable to many real-world scenarios. One standard solution to decrease the training time for neural networks is increasing the batch size, or how much data the network sees for one update of its parameters. Logically, if the network sees more data for each update, then the total number of parameter updates would decrease compared to a network that updates its parameters every time a new sample is presented. We can effectively reduce network training time by decreasing the number of parameter updates.

Batch Size and Generalization Gaps

However, recent studies show that larger batch sizes lead to bigger generalization gaps (see Nitish Shirish Keskar et al. “On large-batch training for deep learning: Generalization gap and sharp minima”). The term generalization gap is synonymous with overfitting, describing the phenomenon where there is a “gap” in performance between training and testing data, the latter being worse than the first. To increase performance on unseen datasets, models need to have the ability to generalize and apply what they have learned in training to new, unseen samples.

To explain why increased batch size leads to significant generalization gaps, imagine a string of numbers in increasing order by 2, starting from “1, 3, 5, …”, all the way up to “…,37, 39”. Now, say that you were told to select 15 numbers randomly from the list without looking. After you open your eyes, how likely will you discover the increment pattern in the original list (assuming that you were not told the pattern beforehand)? After some rearranging, the answer is expected to be very likely. This is because you were given 15 out of 20 values in the list; by chance, some combinations can give away the pattern. What if you could only randomly select three values from the list? Would you still be able to identify the pattern? The answer will not be as certain because three values out of the 20 in the list may give little to no information about the original relationship between each number (increment by 2).

A thought experiment on the difference between larger vs. smaller batch sizes. Image from the author.

The effect that batch sizes have on the model performances works the same way as the thought experiment presented above: the larger the batch size, the more likely that the model will be able to discover relationships between features and targets in fewer epochs. This finding may be beneficial to training since the model could reach the global minima quicker, but it comes at the cost of poorer generalization. When the model sees fewer data at every update step, it may not be able to find the best parameters that suit the training data, but it is more likely to generalize a universal rule for all datasets of the same type (just like how not every list of numbers in the world has the pattern of increment by 2). Models with larger batch sizes are more prone to overfit, creating bigger generalization gaps than models with smaller batch sizes that can perform well over unseen samples.

How Batch Size and Learning Rate Relates to Noisy Training

Samuel L. Smith and Quoc V. Le, in their study “A bayesian perspective on generalization and stochastic gradient descent,” further proves this point by proposing a “noise scale” calculation for the SGD optimizer:

where N is the number of training samples, B is the batch size, and epsilon is the learning rate. Due to B << N in most cases, we can simplify the noise scale to

The “noise scale” value describes the magnitude of random fluctuations in SGD training due to batch updates. Intuitively, the smaller the noise, the better the training result, but note that keeping a reasonable amount of noise in training helps the model to generalize, serving as regularization to training. Setting the batch size to a relatively large value will produce minimal noise throughout training, resulting in worse generalization ability thus the intuition surroudning larger batch sizes leading to greater generalization gaps.

Both increasing the batch size and decreasing the learning rate reduces the noise level.

Decaying learning rates are usually employed as a part of training to decrease noise. By having a larger learning rate value at the start of training, the network would be able to search a larger parameter space in the noisy situation. Once the model has found the optimal direction/step to take, extra noise may start to hurt the training (i.e., validation loss starts to increase), then the learning rate is slowly brought down. This action will reduce the noise as the model closes in on the global minima, slowing down to a more stable training phase. Doing so can ensure model generalization and decent training results but at the cost of lengthy training time of up to days or even weeks for complex ConvNets.

Increase the Batch Size Instead of Decaying the Learning Rate

Notice that not only can decaying the value of epsilon/learning rate decrease the noise level, but also increasing the value of B/batch size can achieve the same effect. Utilizing this fact, Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le showed that learning rate and batch size are inversely proportional in their study “Do not Decay the Learning Rate, Increase the Batch Size.” The authors proposed that instead of using learning rate schedulers, the batch size should be increased during training which can empirically achieve the same results as decaying learning rates. In the authors’ experiment, they trained three types of models with the following conditions:

  1. Decaying learning rate: learning rate is repeatedly decreased by a factor of 5
  2. Hybrid: batch size is increased at the first step while holding the learning rate constant, then the training style reverts to #1. This training method simulates potential hardware limits where a larger batch size is not feasible
  3. Increasing batch size: batch size is increased by a factor of 5 at the same rate as #1.

The diagram shown below presents their training results.

Experiment results from 3 types of models. From Samuel L. Smith et al.

On the left, we observe that the training result of the model with the increasing batch size is equivalent to training with a decaying learning rate. However, the model with increased batch size achieved the result of 67% fewer parameter updates than the traditional decaying learning rate model. More specifically, the authors stated that batch size should be increased until it is around N/10; then, the model will revert to the traditional decaying learning rate training style for the rest of the training duration.

Essentially, we can replace learning rate schedulers with “batch size schedulers” and substantially decrease training time. Below is a simple demonstration of the authors’ method in Keras. The forest cover type dataset is used (can be found in Scikit-learn).

from sklearn.datasets import fetch_covtypedataset = fetch_covtype()
X = pd.DataFrame(dataset.data)
y = pd.DataFrame(dataset.target)
y = pd.get_dummies(y[0])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

The model used for demonstration is a rather deep network for tabular data with seven hidden layers and a bottleneck-like design.

def create_testing_model():

inp = L.Input(X.shape[1])
x = L.Dense(512, activation="relu")(inp)
x = L.Dropout(0.25)(x)
x = L.Dense(256, activation="relu")(x)
x = L.Dense(256, activation="relu")(x)
x = L.Dropout(0.25)(x)
x = L.Dense(128, activation="relu")(x)
x = L.Dense(256, activation="relu")(x)
x = L.Dense(256, activation="relu")(x)
x = L.Dropout(0.25)(x)
x = L.Dense(512, activation="relu")(x)
x = L.Dropout(0.25)(x)
out = L.Dense(7, activation="softmax")(x)
normal_model = M.Model(inputs=inp, outputs=out)
normal_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=6e-4), loss=tf.keras.losses.CategoricalCrossentropy())

return normal_model

By mimicking the Keras ReduceLROnPlateau callback, batch size is increased by a factor of 5 starting at 256 when the validation loss does not improve for three epochs.

# timing purposes
batch_start = time.time()
# total epoch number
epoch=60
# ensure the batch size is 256 on the first epoch
batch_size=256/reduce_fac
# for recording training history
batch_history_total = pd.DataFrame([], columns=["loss", "val_loss"])
# init model
batch_model = create_testing_model()
# make sure batch size does not exceed sample size/10 and epoch doens't go below 0
while epoch > 0 and batch_size <= len(X_train)/10:

print(F"CURRENT BATCH SIZE: {batch_size*reduce_fac}")
# early stop if val loss stops improving for 3 epochs
early_stop = C.EarlyStopping(monitor="val_loss", patience=3, verbose=2)
batch_history = batch_model.fit(X_train, y_train, epochs=epoch, batch_size=int(batch_size*reduce_fac),
validation_data=(X_test, y_test), callbacks=[early_stop], verbose=2)
cur_history = pd.DataFrame(batch_history.history)
# concat the current training history to the total dataframe
batch_history_total = pd.concat([batch_history_total, cur_history], axis=0).reset_index(drop=True)
# adjust batch size
batch_size = batch_size*(reduce_fac)
# decrease total epoch by the number of epochs trained
epoch_trained = len(cur_history)
epoch -= epoch_trained
if epoch > 0:
# reverting to reduce lr training
print("reverting to reduce lr training")
reduce_lr = C.ReduceLROnPlateau(monitor="val_loss", factor=(1/reduce_fac), patience=3, verbose=2)
batch_history = batch_model.fit(X_train, y_train, epochs=epoch, batch_size=int(batch_size), validation_data=(X_test, y_test), callbacks=[reduce_lr], verbose=2)
batch_history_total = pd.concat([batch_history_total, pd.DataFrame(batch_history.history)], axis=0).reset_index(drop=True)
batch_end = time.time()

For comparison, the same model was trained with a constant batch size of 256 while reducing the learning rate for every three epochs that the validation loss did not improve. Both models were trained for 60 epochs, and the model trained solely with ReduceLROnPlateau was astoundingly two times slower (on GPU) with a slightly worse validation performance.

Training result comparison between the reduce batch size method and the decay lr method. Image from the author.

Next time, when training any neural networks, regardless of image classification, tabular data, or audio tasks, give “batch size schedulers” a try. The results might save you some time (and subtle improvements)!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: