Getting Started with Deep Learning Using CNNs*NqH9q301WMY77DdK

Original Source Here

Getting Started with Deep Learning Using CNNs

Implementing the “Hello World!” of convolutional neural networks

Photo by Roman Mager on Unsplash

A large part of the progress made in deep learning over the recent years is due to the concept of convolutional neural networks or CNNs. These networks have become the de-facto standard in all but the most trivial tasks in image processing. The basic concepts of CNNs originate from the 1980s, the first application to image recognition was published in 1989. Like so many topics in the field of deep learning the big advances came with more computing power and one major factor was using GPUs instead of CPUs for training starting in the mid 2000s.

We will look at one of the most important papers in the field, namely the 1998 paper “Gradient-Based Learning Applied to Document Recognition”. This paper is worth reading more than 20 years after publication for a variety of reasons. Notable are the authors including Yann LeCun and Joshua Bangio which together with Geoffrey Hinton are considered the “Godfathers of Deep Learning” and jointly won the 2018 Turing Award for their work in the field.

Another issue is that the paper is not purely a research effort but the solutions developed there have been commercially applied to recognise handwritten digits at NCR corporation. As the paper comprehensively describes the effort when focusing on the architecture of the neural network we only have to read sections 1 to 3 of the in total 10 sections, The length of the whole paper is 46 pages, which by itself is also noteworthy and way above the usual.

The most relevant issue here is that the LeNet-5 network discussed in the paper is generally considered one of the most relevant in the history of CNNs. Sometimes it is referred to as the “Hello World!” of CNNs. There are some arguments supporting this:

  • “LeNet-5 implementation” gives you more than 130.000 hits on Google, quite a lot of them relevant.
  • Using current libraries implementing the basic architecture is done in a few lines of code as opposed to the large research effort described in the paper.
  • You will get quite good results on the MNIST dataset fast. Not enough for best paper awards nowadays, but probably way better than trying from scratch.
  • The training of the network will take between about 100s on a CPU and only a few seconds on a GPU per epoch depending on hyperparameters used. This gives you reasonable turnaround time for experimenting with the net.

On the other hand closely reimplementing the network from the paper would be really difficult as at the time it was implemented the authors had to do everything by hand. In 1998 there were no GPUs, Python was not the lingua franca of AI and there were no established deep learning libraries. While doing everything from scratch they also did lots of optimisations which are difficult to reproduce using state of the art libraries. Most implementations you will find using Google just make the network bigger to get similarly good results on MNIST fast. So it’s not as simple as with “Hello World!” to get comparable results, but it is still a good basis to get started with CNNs.

Getting Started with Coding

We will not even attempt to reproduce the paper accurately, but as opposed to most other implementations found on the Internet we will get started with a version following the paper at least in some relevant parameters. The code will bei in Python and use Keras on Tensorflow. Using a high-level library like Keras the code for the network is almost trivial which also will make experimenting with it easier.

The basic structure of the network is explained in the paper in detail in section II.B. It has seven layers in total.

  • A convolutional layer with 6 5×5 kernels with padding so we actually have padded 32×32 images as input as opposed to the original 28×28 MNIST images.
  • A 2×2 pooling layer. Here reproducing the paper starts getting difficult already as you will realize reading the paper. For simplicity reasons we use average pooling.
  • A convolutional layer with 16 5×5 kernels and no padding. Again this is a simplification of the structure in the paper.
  • Another 2×2 pooling layer using average pooling, again not the real thing.
  • A fully connected layer with 120 neurons.
  • A fully connected layer with 84 neurons.
  • A softmax layer to finally get the 10 possible output classes. While this is again different from the paper, the fully connected layers are the real thing.

The activation function used in the paper in tanh and the loss function is similar to MSE or mean squared error. Both were optimized for the paper as they had to be coded by hand as opposed to just using them from a high level library like we do. For the optimizer we chose a simple SGD or stochastic gradient descent. The network was trained over 20 epochs with a learning rate of “0.0005 for the first two passes, 0.0002 for the next three, 0.00005 for the next 4 and 0.00001 thereafter” as explicitly specified in the paper.

To create the model in Keras is just a few lines of code:

model = keras.Sequential(
layers.Conv2D(6, kernel_size=(5, 5), padding=’same’, activation=’tanh’),
layers.AveragePooling2D(pool_size=(2, 2)),
layers.Conv2D(16, kernel_size=(5, 5), padding=’valid’, activation=”tanh”),
layers.AveragePooling2D(pool_size=(2, 2)),
layers.Dense(120, activation=”tanh”),
layers.Dense(84, activation=”tanh”),
layers.Dense(num_classes, activation=”softmax”)

If you look at the model summary and compare it with the detailed description of the network you can see that only layers C1, C5 and F6 match the paper regarding the number of trainable parameters. This is because we did not try to replicate the optimisations done in the paper.

Model: “sequential”
Layer (type) Output Shape Param #
conv2d (Conv2D) (None, 28, 28, 6) 156
average_pooling2d (AveragePo (None, 14, 14, 6) 0
conv2d_1 (Conv2D) (None, 10, 10, 16) 2416
average_pooling2d_1 (Average (None, 5, 5, 16) 0
flatten (Flatten) (None, 400) 0
dense (Dense) (None, 120) 48120
dense_1 (Dense) (None, 84) 10164
dense_2 (Dense) (None, 10) 850
Total params: 61,706
Trainable params: 61,706
Non-trainable params: 0

Training and Scoring

The significant challenge of reproducing the paper does not stop with the network architecture. The next question to answer is how to structure the training process. On the positive side the MNIST dataset used in the paper is still the MNIST dataset used today. There is an interesting and short history of MNIST in section III.A of the paper well worth reading.

Besides the loss function, the optimizer and the learning rate we have to decide on the batch size and if and how to use cross validation during the training. The low learning rates given in the paper combined with learning only 20 epochs just make sense if no batches were used, i.e. a batch size of 1. Note that using a GPU only gives a relevant speedup if micro batching is used, so training will take some time with these parameters.

The paper also explores the effect of the amount of training data on the network performance and explicitly states that 60.000 training images from MNIST were used. These are all training images available in standard MNIST. So we use no cross validation, i.e. a validation split of 0.

This results in the following code for training the net:

lr_list = [0.0005, 0.0005, 0.0002, 0.0002, 0.0002, 0.00005,
0.00005, 0.00005, 0.00005, 0.00001, 0.00001, 0.00001,
0.00001, 0.00001, 0.00001, 0.00001, 0.00001, 0.00001,
0.00001, 0.00001]
def calc_lr():
elem = lr_list[0]
del lr_list[0]
optimizer = keras.optimizers.SGD(learning_rate=calc_lr)model.compile(loss=”mean_squared_error”, optimizer=optimizer,
metrics=[“accuracy”]), y_train, batch_size=1, epochs=20,

Training the network outputs the accuracy on the training set and if cross validation is used the accuracy on the validation set. The only relevant metric for the network is performance on the test set as this gives a reasonable hint on how the network will perform on real life data. Training the network with these parameters in my case resulted in an accuracy of 0.9257 on the training data and 0.9292 on the test data. It is a good sign that accuracy on the test set is higher and hints at good generalization of the network.

But is 93% a good accuracy overall? Well, actually not. The paper compares several techniques and models for MNIST classification in section C and the worst one mentioned there has an accuracy of about 95%. An important result of the paper is that LeNet-5 as proposed there gets accuracies of 99% or more on MNIST. So we have quite some way to go to win a best paper award even in 1998.

Getting a Better Score

As mentioned before there are many optimisations hinted at in the paper and it is probably at least a PhD thesis worth of work to try to replicate them. So we are not going to go that way. The easier way to get better results is to take the basic structure of LeNet-5 and use current best practice in CNNs as opposed to 20 plus years ago. Another path worth considering is making the net bigger as training only takes very little time using current hardware anyway.

To see what is possible we take a shortcut by using one of the many implementations found on the web, for example this one happens to be the first hit in my Google search:

The code ist a bit more elaborate than our code probably because an older version of Keras has been used. There are not too many differences to our choices, the network looks exactly the same actually. Potentially relevant differences are:

  • Use of categorical cross entropy as opposed to mean squared error as the loss function.
  • Use of the default learning rate of 0.01 as opposed to the learning rates from the paper.
  • Use of the default batch size of 32 as opposed to no batches i.e. a batch size of 1.
  • Use of cross validation.

These simple adaptations give a vastly improved result of an accuracy of 0.9890 on the training data and 0.9870 on the test data in my case. Going back to batch size one and no cross validation I got a training data accuracy of 1.0000 which suggests overfitting. But as I also got 0.9888 on the test data I didn’t really care. So just by changing the loss function and increasing the learning rate we are quite a bit closer to a best paper award had we only handed in our work in 1998 already.


Reimplementing the basic structure of LeNet-5 is easy, but you will not really replicate the work in the paper. The performance on the MNIST dataset of this more than 20 years old network is still quite impressive, however. It is critical to use good hyperparameters! While it may be considered best practice to use cross entropy for classification as opposed to MSE for regression there are more options to decide on. Using CNNs on MNIST has been done so often that the maximum accuracy is quite well known.

An alternative is Fashion MNIST which can replace classical MNIST without any changes to the network as image resolution is the same. It also has 10 classes, but the goal is to categorize fashion articles instead of handwritten digits. For further experimentation with the “Hello World!” of CNNs this is a more interesting data set.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: