Build Your First Convolutional Neural Network to Classify Cats and Dogs

Original Source Here

Build Your First Convolutional Neural Network to Classify Cats and Dogs

Photo by Yan Laurichesse on Unsplash


Have you ever wondered how social media utilizes facial recognition, how object detection aids in the development of self-driving cars, or how medical imagery can be used to automatically determine a diagnosis? Convolutional Neural Networks are what make them all possible, but how?

To illustrate the structure and behavior of Convolutional Neural Networks (CNNs), in this article we will build a basic model to classify images of cats and dogs. The principles outlined here are the very same building blocks that enable the cutting-edge applications described above.

Suppose you have a random image and your task is to determine if the image contains a cat or a dog. The initial step would be to pass the image’s pixels (essentially matrices of numerical RGB values) to the neural network’s input layer in the form of arrays. By conducting various computations and manipulations (basically, applying algorithms with trained weights), the hidden layers extract features from the image array (called “feature maps”). As the pixel array matrices pass through deeper layers, the model is able to compose feature maps that represent increasingly complex and abstract features of the original image. Finally, the image passes through a fully-connected (dense) layer that uses the information identified in these feature maps to classify the image as its output.

What is a Convolutional Neural Network and how does it recognize an image?

A CNN is a type of feed-forward neural network that uses convolution kernels to slide along features and create feature maps, which are essentially arrays of numerical values. They are probably most popular for their successful application in visual imagery applications. They are also sometimes referred to as ConvNets and can be used to detect, and in some cases identify, the contents of an image.

Every image, whether printed on paper, displayed on your computer screen, or fed to a CNN, is represented as a collection (or array) of pixel values. For black and white images, this typically means a single layer of values ranging between 0 and 1. For color images, this often means 3 layers of values, one layer for red pixel values, one layer for blue pixel values, and one layer for green pixel values. Other image matrix structures exist, but are less common. Color image pixels most often range between 0 and 1 or 0 and 255.

Convolutional Neural Network Architecture


The three most common layers you’ll find in a CNN architecture are convolution layers, pooling layers, and densely connected layers. Other layer types exist as well, but for this basic tutorial, we’ll stick to examining these three.

1. Convolutional Layer — The convolutional layer is one of the most important components of a CNN since it is where the majority of the calculations takes place. It requires input data, a filter, and a feature map, among other things. Let’s pretend the input is a color picture, which is made up of a 3D matrix of pixels. This means the input will have three dimensions: height, width, and depth, which match the RGB pixel values of a picture.

We also have a feature detector, sometimes known as a convolution kernel or a filter, that moves over the image’s receptive fields, checking for the presence of a feature. The feature detector is a 2D weighted array of numerical values which is applied to the numerical pixel matrix from left to right, top to bottom. The filter size can vary, which also affects the size of the receptive field.

Image from

2. Pooling Layer — The pooling layer uses a summary statistic of neighboring outputs to condense pixel information onto a smaller output matrix. This reduces the representation’s spatial size, as well as the computational expense. Every slice of the representation is independently handled throughout the pooling procedure.

Some examples of functions commonly employed in the pooling layer are: the rectangular neighborhood average, the L2 norm of the rectangle neighborhood, and a weighted average depending on the distance from the center pixel. The most prevalent method, however, is max pooling, which reports the neighborhood’s maximum output, as shown below.

Image from:

3. Fully-Connected (Dense) Layer — Neurons in this layer have a complete connection with all neurons in the preceding and following layers. The fully-connected layer aids in the mapping of representations between input and output. The final layer of any CNN must be a fully-connected layer with the same number of nodes as outputs.

Image from


The first step to building any model is to download and prepare the data. In the following example, make sure you remember to unzip the files using the unzip command. Let’s dive in!


Next, we’ll import the required libraries:

import pandas as pd
import numpy as np
import os
import warnings
import random
from keras.preprocessing.image import load_img
import matplotlib.pyplot as plt

Below, we create the two lists input_path and label, where input_path stores the image paths as strings (including appropriate folder names) and label indicates whether an image contains a cat (00) or a dog (11).

input_path = []
label = []
for folder_name in os.listdir("PetImages"):
for folder_path in os.listdir("PetImages/"+folder_name):
if folder_name == 'Cat':
input_path.append(os.path.join("PetImages", folder_name,

Now, we create a data frame from the dataset with two fields (columns): images and label :

df = pd.DataFrame()
df['images'] = input_path
df['label'] = label
df = df.sample(frac=1).reset_index(drop=True)

We’ll also use the ImageDataGenerator class of Keras, which is often used to dynamically generate augmented images during training. We generally use this class to systematically generate new images from training images, when our training data set is limited or insufficient for training. Our data set is rather large, but it’s an important tool to know how to use, so let’s see how it works:

# splitting the input data
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=42)
from keras.preprocessing.image import ImageDataGenerator
training_generator = ImageDataGenerator(
rescale = 1./255, # normalization of images
rotation_range = 40, # augmention of images to avoid overfitting
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True,
fill_mode = 'nearest'
validation_generator = ImageDataGenerator(rescale = 1./255)training_iterator = training_generator.flow_from_dataframe(
validation_iterator = validation_generator.flow_from_dataframe(

Now, let’s built the model architecture:

from keras import Sequential
from keras.layers import Conv2D, MaxPool2D, Flatten, Dense
model = Sequential([
Conv2D(16, (3,3), activation= 'relu',
input_shape= (155,155,3)),
Conv2D(32, (3,3), activation= 'relu'),
Conv2D(64, (3,3), activation= 'relu'),
Dense(512, activation= 'relu'),
Dense(1, activation= 'sigmoid')

Compiling the model is as easy as using the .compile method. Let’s check out a quick summary of what we’ve built before we start training:

model.compile(optimizer= 'adam', loss= 'binary_crossentropy', 
metrics= ['accuracy'])

Now it’s time to train the model:

prediction =, epochs=5, 

Visualizing the results

accuracy = prediction.history['accuracy']
val_accuracy = prediction.history['val_accuracy']
epochs = range(len(accuracy))
plt.plot(epochs, accuracy, 'g', label='Training Accuracy')
plt.plot(epochs, val_accuracy, 'b', label='Validation Accuracy')
plt.title('Accuracy Graph')
loss = prediction.history['loss']
val_loss = prediction.history['val_loss']
plt.plot(epochs, loss, 'g', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Loss Graph')


Below we plot two graphs: an Accuracy Graph and a Loss Graph. In each, the blue line shows the validation loss and the green line shows the training loss.

To download the Complete code from Github — Click Here, and please find the below clip that shows the live execution of this code.


In this article, we learned what a Conventional Neural Network is, how it breaks down unstructured input data (like images), and what a basic CNN architecture looks like. Congratulations, you’ve now built your first Convolutional Neural Network!

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: