Convolutional Neural Network

Original Source Here

Convolutional Neural Network

Artificial Intelligence refers to machine intelligence that operates the same way human intelligence works illustrating conscience and emotions. Nowadays computers are trained enough to play with humans and not only they can play but can compete with humans and are capable of winning. This all is so possible because the machine is powered by Artificial Intelligence. In today’s scenario, the machines are continuously been modified to adapt to the capabilities of the human. The machine is embedded with AI to follow the humans in terms of their vision, their perception, and even recognition of the Image known as visual perception, speech recognition, Natural Language Processing (NLP) and so many tasks.

Deep learning is a subset of Artificial Intelligence and the Convolutional Neural network is an algorithm or it is the type of deep neural networks that are applicable for analysis of visual image and act as an algorithm for Computer Vision domain. In this post, we will cover the basics of Convolutional Neural Network, building blocks of CNN, the structures of CNN, and its implementation with Tensor Flow.

What is Convolutional Neural Network?

A Convolutional Neural Network or CNN is a category of artificial neural network that is implemented for recognition of the image and processing of the image to process pixel data. It uses the Convolutional technique for achieving this milestone. Convolution Neural Network is the state of art for recognizing what the image is, what’s in images, or even playing roles in doing things like assigning captions for images.

The purpose of the convolutional layer of Convolutional Neural Network which is a deep learning algorithm is to convert all the pixels to a single value. It does so with the help of the Convolutional process. For instance, if convolution is applied to an image, a decrease in the size of the image is observed and also all the information is brought together in the field and contained inside a single pixel. Vector is the outcome of the Convolutional layer. The vital functionality of CNN that separates it from the other of its kinds is that it automatically helps detects the mandatory features without any human intervention. For example from among the given pictures of cats and dogs, it can learn the way to distinguish the cats and dogs features by just looking at the key features of each class by itself.

The structure of Convolution Neural Network is as follows: Convolution>Pooling>Convolution>Pooling>fully connected Layer>Output.

The input image or data is received from by the Convolution layer by a process called convolution. This layer creates feature maps from the original dataset and then passes them on to the pooling layer which is also a hidden layer which in turn gets connected to a fully connected layer that is our neural network which is also a hidden layer and finally the output is generated. The output obtained is the same as the output that is generated from a multi-layer perceptron model.

Convolutional layer

The convolutional layer refers to the vital building block of CNN. It is the first and foremost layer representing the structure of the Convolutional Neural Network. Neurons that are present in the first Convolutional Neural Network despite being connected to every pixel, get connected to the pixels of their respective fields as shown below.

Resulting in which neurons present in the second convolutional layer get connected to neurons present in a small rectangle of the first layer. Due to this architecture, the network can focus on features of low level residing in the first hidden layer. After this, the network is assembled into high-level features residing in the next hidden layer and it goes on. Due to the presence of hierarchical structure in real-world images, CNNs are widely in use for image recognition.

Figure 1: CNN layers with rectangular local receptive fields


The weight of the neuron is represented in form of a small image as the receptive field size. Figure 2 below shows two sets of weights called filters. The first set of weight is represented as Black Square possessing a vertical white line passing through the middle. For neurons utilizing these weights will try to ignore nearly everything but not the central vertical line. The other filter contains a Black Square possessing in middle the horizontal white line. The neurons equipped with this weight ignores everything but not the central horizontal line.

The outcome of the layer is the top-left image when every neuron inside a layer experiences the same vertical line filter provided input image is supplied to the network shown in Figure 2.We observe that the vertical white lines are honed but the rest of them gets blurred. So feature maps are the result of many layers full of neurons using the same filter that is used to mark the image area that is very similar to the filter. At the time of training, the most useful filters are discovered by CNN for its task. Also, the filters learn how to combine them into complex patterns. For instance, the cross signifies the area in an image where the vertical, as well as horizontal filter, remain active.

Figure 2: Applying two different filters to get two feature maps

tacking Multiple Feature Maps

For sake of simplicity, each convolutional layer is represented as a thin 2D layer. But in the real world, it is represented in form of 3D as given below as it is formed from numerous feature maps. Based on one feature map, neurons have the same parameters but distinct feature maps contain different parameters. In a nutshell, a convolutional layer in parallel takes in inputs and offers multiple filters to these inputs so that it is capable enough to detect multiple features from the inputs.

Since all the neurons in the feature maps possess the exact parameters due to which the number of parameters gets lessened in the model. But the fact is that the CNN on recognition of a pattern in a certain location can identify in other locations also.

Though input layers comprise of many sub-layers: one per color. There are probably three colors: red, green, and blue probably known as RGB. There is only one channel in Grayscale images but some images may contain more than one. For instance satellite images capturing extra light infrared frequencies.

Figure 3: Convolution layers with multiple feature maps, and images with three channels

Implementing a Convolutional layer in TensorFlow

import numpy as np # first a couple of imports

from sklearn.datasets import load_sample_images # importing sample images

# Load sample images

dataset = np.array(load_sample_images().images, dtype=np.float32) batch_size, height, width, channels = dataset.shape

# Create 2 filters

filters_test = np.zeros(shape=(7, 7, channels, 2), dtype=np.float32) filters_test[:, 3, :, 0] = 1 # vertical line

filters_test[3, :, :, 1] = 1 # horizontal line

# Create a graph with input X plus a convolutional layer applying the 2 filters X = tf.placeholder(tf.float32, shape=(None, height, width, channels)) convolution = tf.nn.conv2d(X, filters, strides=[1,2,2,1], padding=”SAME”)

with tf.Session() as sess:

output =, feed_dict={X: dataset})

plt.imshow(output[0, :, :, 1]) # plot 1st image’s 2nd feature map # To display the output

Pooling layer

The pooling layer works after the convolutional layer in structure. Pooling can be referred to as down-sampling or subsampling that means to shrink the image resulting in the reduction of memory usage, computational load, reduction in the number of parameters to prevent overfitting. When Input image size is reduced then the neural network can sustain and bear a bit of shift in image. It supports down-sampling in the form of “max-pooling”, where a region is selected and the maximum value of the region is selected and declared as the new value for the entire region.

Likewise in the Convolutional layer, each neuron in the pooling layer is interconnected with the outputs of the restricted number of neurons in the back layer that is encompassed under a rectangular receptive field. We further need to define neurons size, the stride, the padding type in the same way we did before. But the pooling neurons do not have any weight, instead, it gathers the inputs with the help of aggregation function of max and means. In Figure 4, we can notice a max-pooling layer, the most popular pooling layer category. The point to be noted here is that only the max input value contained in each kernel are able to pass to the next layer since the other inputs are dropped.

Figure 4: Max pooling layer (2 × 2 pooling kernel, stride 2, no padding)

Implementing a max pooling layer in TensorFlow

[…] # loading the image dataset, just like above

# Create a graph with input X plus a max pooling layer

X = tf.placeholder(tf.float32, shape=(None, height, width, channels))

max_pool = tf.nn.max_pool(X, ksize=[1,2,2,1], strides=[1,2,2,1],padding=”VALID”)

with tf.Session() as sess:

output =, feed_dict={X: dataset})

plt.imshow(output[0].astype(np.uint8)) # plot the output for the 1st image # Displaying the output

The Architecture of Convolutional Neural Network

The CNN architectures consist of a few convolutional layers followed by the pooling layer and then other convolutional layers followed by pooling layers and so on. The image not only gets smaller as it passes through the network but also gets deeper due to the addition of the feature maps as a result of the convolutional layer. In Figure 5, a neural network is added made up of less fully connected layers (+ReLUs) and the final layer gives prediction output.

Figure 5: Typical CNN architecture

· There are certain architectures associated with CNN. The basic 4 architectures that Convolutional Neural Network possess are:

· LeNet-5 architecture

· AlexNet architecture

· GoogLeNet architecture

· ResNet architecture


The most widely known CNN architecture is the LeNet-5 architecture. Created by Yann LeCun in 1998, this architecture is the most widely used for hand-written digit recognition (MNIST).

LeNet-5 Architecture is composed of the layers such as Out, F6, C5, C3, S2, C1, Input layer. The mean pooling layer is a little complex than it actually is. The neurons process the average of the inputs and then it multiplies the result obtained with the coefficient known as learnable coefficient and then adds a learnable bias term and after all this activation function is now implemented. The neurons in the C3 maps are connected to only three or four S2 maps neurons.

The output layer is different and instead of calculating the dot product of the input and weight vectors, the neurons output the Euclidean distance’s square that is between its input vector and weight vector.



AlexNet was developed by AlexKrizhevsky, Ilya Sutskever, and Geoffrey Hinton. It is the same as LeNet-5 but a slight difference is that it is much larger and deeper and also the first among all to attach convolutional layer directly on top of each other and not stacking the pooling layer on top of each convolutional layer. AlexNet architecture includes:

AlexNet takes into account the competitive normalization step just after the ReLU step of C1 and C3 layers which are called local response normalization that strengthens the neurons and makes it most strong, activate inhibit neurons in the neighbor feature maps resulting to which numerous feature maps encourage to specialize and going beyond the limits to explore a more range of features to get a more precise generalization.


This architecture was developed by Christian Szegedy et al. with the help of Google research. It had an awesome performance because the network was much deeper than last CNN which became possible only due to inception modules that are sub-networks. This allows GoogLeNet to utilize the parameters reliably more efficiently than previous architectures. Actually, GoogLeNet contains 10 times fewer parameters as compared to AlexNet architecture. GoogLeNet architecture is deep indeed that it needs to be represented in three columns. Feature maps are the output given by the convolutional module and it contains nine inception modules. All the convolutional layer utilizes ReLU activation function. The total number of feature maps given by the convolutional layer is represented before the kernel size.


ResNet or the Residual Network was developed by Kaiming He et al. delivered an error rate under 3.6% by the use of in-depth CNN comprising of 152 layers. Here skip connections are used which are also called shortcut connections.

ResNet architecture is the same as the GoogLeNet Architecture except for the dropout layer and between which lies the deep piles of simple residual units. Every few residual units get the number of feature maps doubled. ResNet-34 is ResNet with 34 layers is composed of three residual units and the output is 64 feature maps with 4 RUs with 128 maps, 6 RUs with 256 maps, and 3 RUs with 512 maps.

The deeper the CNN, the lighter it gets and the less amount of parameters it requires. After all the ResNet architecture is the simplest of all and the powerful among all the architecture of CNN and is preferred for the ILSVRC challenge.

From ResNet architecture, it is clear that this architecture resembles GoogLeNet when it starts and ends probably if we leave the dropout layer. Two Convolutional layers make up the residual units along with Batch Normalization (BN) and ReLU that uses 3 × 3 kernels as well as also preserves spatial dimensions.


By the end, we came across this post, we have discovered a few aspects of Artificial Intelligence and its real-world utilization. Within Artificial Intelligence underlies its subset called deep learning which has a neural network whose one of three-part is the convolutional neural network on which the entire article was focused. We came across the basic structure of Convolutional Neural Network, a thorough highlight on the building blocks of Convolutional Neural Network was also viewed accompanying to which the TensorFlow implementation of different CNN layers was also reviewed. Last but not the least, we went down the lane to different architectures of Convolutional Neural Network.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: