Cat Dog Classification with CNN*UZEwr9S1xBpwDEwh

Original Source Here

Cat Dog Classification with CNN

A Pytorch image classification tutorial.

Photo by Tran Mau Tri Tam ✪ on Unsplash

Classifying cats and dogs might be somewhat unnecessary in practice. But to me it is actually a good starting point to learn neural networks. In this article I am going to share my approach on doing the classification task. The dataset to be used can be accessed through this link.

Here’s the outline of this article:

  1. Import Modules & Set Up Device
  2. Load Images & Create Labels
  3. Preprocessing and Data Augmentation
  4. Custom Dataset Class & Data Loader
  5. Create CNN Model
  6. Model Training
  7. Evaluation

Without further ado, let’s get our hands dirty with some codes!

1. Import Modules & Set Up Device

Let’s start this project by importing the required modules. I will explain them all as we go through the article.

# Codeblock 1
import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torchvision.transforms as transforms

from tqdm import tqdm
from torchinfo import summary
from import DataLoader

Once all modules have successfully been imported, now that we can initialize device, which is essentially just a string of either cuda or cpu. If your code detects Nvidia GPU installed on your machine, then it will automatically assign cuda as the content of the device variable. Keep in mind that by doing this we haven’t actually make use of the GPU, but rather we just want to detect it.

# Codeblock 2
device = 'cuda' if torch.cuda.is_available() else 'cpu'

2. Load Images & Create Labels

As I have mentioned earlier, we are going to utilize Cat and Dog Dataset which is publicly available on Kaggle. Here’s what the structure of the dataset looks like.

All images are stored in separate folders named cats and dogs.

What we need to do is to load the images from the folders named cats and dogs which come from both test_set and training_set. This basically means that we will do the exact same thing four times. In order to make things simple, I decided to create a function for that which I name load_images(). The function is shown in Codeblock 3 below.

# Codeblock 3
def load_images(path):

images = []
filenames = os.listdir(path)

for filename in tqdm(filenames):
if filename == '_DS_Store':
image = cv2.imread(os.path.join(path, filename))
image = cv2.resize(image, dsize=(100,100))
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

return np.array(images)

The load_images() function is considerably simple. It works by accepting a directory address where the images are stored (path). Every single file in the folder is going to be loaded using cv2.imread(). Those images are then resized to 100×100 and converted to RGB (remember that by default OpenCV loads images in BGR).

If you take a closer look at the above code, you will see that I use an if statement which is going to turn True whenever it accesses a file named _DS_Store. And we are going to get rid of that. To be honest I am not sure what file it actually is, but it appears on all folders that we are working with.

The _DS_Store file

As the load_images() function has been created, now that we will use it to actually load the images. We can see the code below that I store the images in cats_train, dogs_train, cats_test, and dogs_test which I think the name of these arrays are self-explanatory.

# Codeblock 4
cats_train = load_images('/kaggle/input/cat-and-dog/training_set/training_set/cats')
dogs_train = load_images('/kaggle/input/cat-and-dog/training_set/training_set/dogs')

cats_test = load_images('/kaggle/input/cat-and-dog/test_set/test_set/cats')
dogs_test = load_images('/kaggle/input/cat-and-dog/test_set/test_set/dogs')
The imgae loading progress bars.

It takes around 65 and 13 seconds to load all training and testing data respectively. We can also see in the progress bar above that that the number of cat images in training set is 4000 while the dogs are 4005.

The value of 4001 and 4006 in the above progress bar are actually the number of iterations that we make when accessing all files in the folder. But we need to subtract it by 1 to get the actual number of images thanks to the _DS_Store file that we skipped.

Just to ensure that we have loaded those images properly, we are going to check the shape of those arrays by running the Codeblock 5 below. Here we can see that all images have successfully been loaded and reshaped to the dimension that we specified earlier.

# Codeblock 5
The shape of all arrays.

What we need to do next is to put all train and test data into the same array which can simply be achieved by using np.append() function. The codes written in Codeblock 6 below utilizes this function to concatenate the arrays along the 0-th axis. By doing so, now that we have all our training and testing images stored in X_train and X_test, respectively.

# Codeblock 6
X_train = np.append(cats_train, dogs_train, axis=0)
X_test = np.append(cats_test, dogs_test, axis=0)

The shape of X_train and X_test.

Creating Labels

That was all for the images. The subsequent step to be done is to create a label for each of those images. The idea of this part is to label cats with 0 and dogs with 1.

The way to do it is pretty simple. Remember that the first 4000 images in X_train are cats and the remaining 4005 are dogs. By knowing this structure, we can create an array of zeros and ones with these lengths which are then concatenated the same way as what we have done with X_train. We will do the exact same thing to create labels for test data.

# Codeblock 7
y_train = np.array([0] * len(cats_train) + [1] * len(dogs_train))
y_test = np.array([0] * len(cats_test) + [1] * len(dogs_test))

The shape of y_train and y_test.

Up until this point we have successfully created labels for all images in our dataset.

Displaying Several Images

As a side task, I will also create a function for displaying several images in our dataset. The function which I name show_images() accepts 3 parameters: images, labels, and start_index. The first two are basically an array of images and labels — which is pretty straightforward. Whereas start_index denotes the index of the images that we want to show first. The Codeblock 8 below shows what the function looks like.

# Codeblock 8
def show_images(images, labels, start_index):
fig, axes = plt.subplots(nrows=4, ncols=8, figsize=(20,12))

counter = start_index

for i in range(4):
for j in range(8):
axes[i,j].imshow(images[counter], cmap='gray')
counter += 1

Once the function has been initialized, now that we can try to use it.

# Codeblock 9
show_images(X_train, y_train, 0)
The first 32 images from class cat.

In the Codeblock 10 below I set the start_index to 4000 since I want to see the first 32 dog images.

# Codeblock 10
show_images(X_train, y_train, 4000)
The first 32 images from class dog.

3. Preprocessing and Data Augmentation

Preprocessing is going to be done on both images and labels. Let’s talk about the latter first. The labels that we have created earlier are not in a shape that is accepted by Pytorch when it comes to training a model. Below is what the first 10 labels in the current y_train look like.

# Codeblock 11
The first 10 labels in y_train.

Now that we need to put this to a 2-dimensional array using the following code. We also convert it to a Pytorch tensor afterwards.

# Codeblock 12
y_train = torch.from_numpy(y_train.reshape(len(y_train),1))
y_test = torch.from_numpy(y_test.reshape(len(y_test),1))

The y_train array after being processed.

Image Preprocessing and Augmentation

The convenient part when working with Pytorch is that we can do image preprocessing and augmentation using a single function, namely transforms.Compose(). What we are doing with the Codeblock 13 below is that we will convert the images (which is originally in fom of a Numpy array) to Pytorch tensor. The pixel intensity values in the images are also squeezed to -1 and 1 only using transforms.Normalize(). It is important to know that I repeat the value for mean and standard deviation three times because the original image has RGB color channels. So basically, each element in the list corresponds to a single channel.

# Codeblock 13
transforms_train = transforms.Compose([transforms.ToTensor(), # convert to tensor
transforms.Normalize(mean=[0.5,0.5,0.5], std=[0.5,0.5,0.5]) # squeeze to -1 and 1

To the augmentation, we are going to perform random rotation, random horizontal flip, random vertical flip and random grayscale. The p parameter denotes the probability of a transformation function being applied to an image. We can see in the above code that the probability of an image being flipped vertically is very small (0.005). This is essentially because we assume that most of the cats and dogs in our testing set are not going to look upside down, but if any, our model is expected to be able to predict it correctly.

Image data augmentation should only be performed on training data. Images in testing set should be left unchanged, except for the preprocessing purpose only. This is basically because we assume that these images are the ones that we will see naturally in a real-world application. The Codeblock 14 below shows the transformations to be used on testing data.

# Codeblock 14
transforms_test = transforms.Compose([transforms.ToTensor(),
transforms.Normalize([0.5,0.5,0.5], [0.5,0.5,0.5])])

4. Custom Dataset Class & Data Loader

The next step to do is that I am going to create a class to store the pair of images and labels. In fact, we can actually use TensorDataset class which is already available in Pytorch module. However, we will create a custom one instead since TensorDataset does not allow transforms.Compose objects to get passed, which causes us unable to do preprocessing and augmentation. The Codeblock 15 below shows what a custom class of Cat_Dog_Dataset looks like.

# Codeblock 15
class Cat_Dog_Dataset():
def __init__(self, images, labels, transform=None):
self.images = images
self.labels = labels
self.transform = transform

def __len__(self):
return len(self.images)

def __getitem__(self, index):
image = self.images[index]
label = self.labels[index]

if self.transform:
image = self.transform(image)

return (image, label)

When it comes to creating a custom dataset class, there are three methods required to be defined: __init__(), __len__(), and __getitem__(). The first one is employed to initialize all attributes. Secondly, __len__() allows the Cat_Dog_Dataset to be passed into len() function in order to find out the number of samples within the dataset. Lastly, __getitem__() makes the object of this class possible to be indexed. All these functionalities are actually required by DataLoader (which I will explain later) in order for it to work properly.

As the Cat_Dog_Dataset class has been defined, we can now wrap our X_train and y_train with it. Don’t forget to pass the transformations_train as it lets our images to get transformed as it is indexed. We will also do the similar thing for the test data.

# Codeblock 16
train_dataset = Cat_Dog_Dataset(images=X_train, labels=y_train, transform=transforms_train)
test_dataset = Cat_Dog_Dataset(images=X_test, labels=y_test, transform=transforms_test)

Data Loader

Once train_dataset and test_dataset have been initialized, we need to create DataLoader for the two in order to determine how the data is going to get loaded. In our case, I decided to train a model with a batch size of 32. I also set the drop_last parameter to True in order to avoid the last batch to get fed into a model whenever it contains less than 32 images in it.

# Codeblock 17
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, drop_last=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True, drop_last=True)

Visualizing Some Augmented Images

Below is the code to display the augmented images in case you’re wondering what our images look like after getting transformed randomly. One important thing to keep in mind is that after the data being processed with the previous transformation functions, the shape of the image array automatically changes from (no_of_images, height, width, no_of_channels) to (no_of_images, no_of_channels, height, width).

# Codeblock 18
iterator = iter(train_loader)
image_batch, label_batch = next(iterator)

The shape of the image array after being processed with our transformation functions.

On the other hand, when it comes to displaying an image, we need to convert the shape of the array back to the initial dimension. This can be achieved by using permute() method. Once it is done, we can now feed the image array (image_batch_permuted) to the show_images() function.

# Codeblock 19
image_batch_permuted = image_batch.permute(0, 2, 3, 1)


show_images(image_batch_permuted, label_batch, 0)
The image augmentation results.

And here is what the transformed images look like. In fact, we cannot determine the appearance of the horizontally flipped images as we are not familiar with the original, unflipped versions. However, we can see here that some of the images get slightly rotated and some of those also get converted to grayscale (still with 3 channels though). Additionally, the images appear darker than the original because of the pixel scaling in the -1 to 1 value range.

5. Create CNN Model

At this point we have had our dataset prepared. Not only that, but we have also ensured that the augmentation works as expected. Hence, the next step to do is creating the CNN model. The details of the model can be seen in Codeblock 20 below.

# Codeblock 20
class CNN(nn.Module):
def __init__(self):
self.conv0 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=(3,3), stride=(1,1), padding=(1,1), bias=False)
self.bn0 = nn.BatchNorm2d(num_features=16)
self.maxpool = nn.MaxPool2d(kernel_size=(2,2), stride=(2,2))

self.conv1 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(3,3), stride=(1,1), padding=(1,1), bias=False)
self.bn1 = nn.BatchNorm2d(num_features=32)
# self.maxpool

self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(3,3), stride=(1,1), padding=(1,1), bias=False)
self.bn2 = nn.BatchNorm2d(num_features=64)
# self.maxpool

self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=(3,3), stride=(1,1), padding=(1,1), bias=False)
self.bn3 = nn.BatchNorm2d(num_features=128)
# self.maxpool

self.dropout = nn.Dropout(p=0.5)
self.fc0 = nn.Linear(in_features=128*6*6, out_features=64)
self.fc1 = nn.Linear(in_features=64, out_features=32)
self.fc2 = nn.Linear(in_features=32, out_features=1)

def forward(self, x):
x = F.relu(self.bn0(self.conv0(x)))
x = self.maxpool(x)

x = F.relu(self.bn1(self.conv1(x)))
x = self.maxpool(x)

x = F.relu(self.bn2(self.conv2(x)))
x = self.maxpool(x)

x = F.relu(self.bn3(self.conv3(x)))
x = self.maxpool(x)

x = x.reshape(x.shape[0], -1)

x = self.dropout(x)
x = F.relu(self.fc0(x))
x = F.relu(self.fc1(x))
x = F.sigmoid(self.fc2(x))

return x

Let me explain the architecture.

The Convolutional Neural Network model that we are going to create is fairly simple. There will be 4 convolution layers in which all of those use 3×3 kernel, 1 stride and 1 padding. This kind of convolutional layer configuration preserves the spatial dimension of the input tensor. What makes the four Conv2d layers different is the number of kernels. The first one — which is directly connected to the input layer of the CNN — consists of 16 kernels. This number doubles up in the subsequent convolution layers such that the rest consist of 32, 64, and 128 kernels respectively.

The increasing number of kernels causes the number of output channels to get increased as well. This also leads to the large output dimension despite the exact same spatial dimension. And we don’t want this to happen since our model may run into the curse of dimensionality problem. In order to address this issue, we are going to apply maximum pooling layer right after a convolution layer. With the pooling size and stride of 2, the resulting spatial output dimension is going to get reduced by twice. In this case, since we have 4 pooling layers, hence the input size of 100×100 will be reduced to 50×50, 25×25, 12×12, and 6×6, respectively.

Note that the image size reduction from 25×25 to 12×12 is achieved by decreasing the height and width dimensions by 1 pixel each.

Batch normalization layers are also going to be implemented in this network. This layer is going to be placed between convolution layer and ReLU (Rectified Linear Unit). Some of the research papers that uses this Conv-BN-ReLU structure are [1], [2] and [3]. Furthermore, you might notice in the Conv2d layers that I set the bias parameter to False. The reason is that the presence of batch normalization layer right after convolution layer causes the bias in the convolution to be somewhat useless [4].

An example of Conv-BN-ReLU structure (the overall architecture is different to ours) [4].

After reaching the last max pooling layer, the next step to do is to flatten the tensor. The resulting tensor is then passed through a dropout layer with the drop rate of 50%. This dropout layer is then connected to two consecutive hidden layers with 32 and 64 number of neurons respectively. Lastly, we will connect an output layer with a single neuron. It might be worth noting that the two hidden layers are using ReLU activation function, while the output layer uses sigmoid.

After the CNN class has been created, now that we can actually initialize the model. We will do the initialization using the following code. Don’t forget to write to(device) in order to let GPU does our work.

# Codeblock 21
model = CNN().to(device)

If you want, you are actually able to print out the details of the model using the summary() function which is taken from torchinfo module.

# Codeblock 22
summary(model, input_size=(4,3,100,100))
The details of the CNN model.

6. Model Training

Before training the model, we need to specify a loss function and an optimizer to be used. In this case I will use binary cross entropy (BCELoss) and Adam for the two, respectively.

# Codeblock 23
loss_function = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

A Function to Predict Test Data

I want to be able to read all evaluation metrices towards test data for every single epoch. In order to make things simple, I want to wrap the processes up in a function which I name predict_test_data(), and we will call the function at the end of each epoch.

# Codeblock 24
def predict_test_data(model, test_loader):

num_correct = 0
num_samples = 0


with torch.no_grad():
for batch, (X_test, y_test) in enumerate(test_loader):
X_test = X_test.float().to(device)
y_test = y_test.float().to(device)

# Calculate loss (forward propagation)
test_preds = model(X_test)
test_loss = loss_function(test_preds, y_test)

# Calculate accuracy
rounded_test_preds = torch.round(test_preds)
num_correct += torch.sum(rounded_test_preds == y_test)
num_samples += len(y_test)


test_acc = num_correct/num_samples

return test_loss, test_acc

This function works by accepting two input parameters: a model and a data loader containing samples from test set. The processes to be done inside the function is only predicting those test data (forward propagate). The accuracy score is then calculated afterwards.

One thing that might be considered to be important is that we need to put our model to evaluation mode using model.eval() prior to the forward propagation. One of the reasons of doing this is to reconnect the randomly disconnected neurons in the dropout layer in order for the network to gain its best performance. The model will then set back to training mode using model.train() after the entire predicting process is done.

The Training Loop

As the predict_test_data() function has been created, we are now going to work on the training loop. This process is actually pretty standard. There will be two for loops, one which iterates for each epoch and the other one iterates for each batch. And we will do the following operations in every single batch: forward propagation, backward propagation, and gradient descent. There will also be several other operations regarding loss value and accuracy score calculations in between.

In this case, I decided to run the training for 100 epochs which takes me around 20 minutes to complete. The Codeblock 25 below displays how I create the training loops.

# Codeblock 25

train_losses = [] # Training and testing loss was calculated based on the last batch of each epoch.
test_losses = []
train_accs = []
test_accs = []

for epoch in range(100):

num_correct_train = 0
num_samples_train = 0
for batch, (X_train, y_train) in tqdm(enumerate(train_loader), total=len(train_loader)):
X_train = X_train.float().to(device)
y_train = y_train.float().to(device)

# Forward propagation
train_preds = model(X_train)
train_loss = loss_function(train_preds, y_train)

# Calculate train accuracy
with torch.no_grad():
rounded_train_preds = torch.round(train_preds)
num_correct_train += torch.sum(rounded_train_preds == y_train)
num_samples_train += len(y_train)

# Backward propagation

# Gradient descent

train_acc = num_correct_train/num_samples_train
test_loss, test_acc = predict_test_data(model, test_loader)


print(f'Epoch: {epoch} \t|' \
f' Train loss: {np.round(train_loss.item(),3)} \t|' \
f' Test loss: {np.round(test_loss.item(),3)} \t|' \
f' Train acc: {np.round(train_acc.item(),2)} \t|' \
f' Test acc: {np.round(test_acc.item(),2)}')

When the above code is running, our notebook will output the metrics data which look as follows. Here I decided to only display the first several epochs since displaying the entire training progress is just a waste of space.

What the training progress looks like.

7. Evaluation

The performance of this model is going to be evaluated by printing out the loss value and accuracy score history as the epoch goes. The Codeblock 27 and 28 below show how I do it.

# Codeblock 26
plt.legend(['train_losses', 'test_losses'])
Loss value history.

The above figure shows that the training progress looks good. This is essentially because the loss value (both towards training and testing data) has a tendency to get slightly lower as the epoch goes despite the fluctuation that we encounter at around epoch 70.

In the figure below, we can observe that the accuracy towards both training and testing data is improving epoch-by-epoch. It is pretty cool to see that in this case we achieved the best testing accuracy of 94%.

# Codeblock 27
plt.legend(['train_accs', 'test_accs'])
Accuracy score history.

Predicting Images on Test Set

The process of predicting images on test set is pretty much the same as what we did in Codeblock 18 and 19. The difference here is that we pass the prediction results as the title of each image rather than the ground truths.

# Codeblock 28

# Load test images
iter_test = iter(test_loader)
img_test, lbl_test = next(iter_test)

# Predict labels
preds_test = model(
img_test_permuted = img_test.permute(0, 2, 3, 1)
rounded_preds = preds_test.round()

# Show test images and the predicted labels
show_images(img_test_permuted, rounded_preds, 0)
Predictions on test data.

And well, according to the above figure it seems like all predictions are correct. Remember that cats are labeled as 0 while dogs are 1.

That’s all for this article. Thanks for reading!

The fully working code of this notebook can be accessed through this link.


[1] Tianyu Ci, Zhen Liu, Ying Wang. Assessment of the Degree of Building Damage Caused by Using Convolutional Neural Networks in Combination with Ordinal Regression. Remote Sensing. [Accessed January 28, 2023].

[2] Garikoitz Lerma Usabiaga et al., Retrospective Head Motion Estimation in Structural Brain MRI with 3D CNNs. Research Gate. [Accessed January 28, 2023].

[3] Hao Sun, Xiangtao Zheng and Xiaoqiang Lu. A Supervised Segmentation Network for Hyperspectral Image Classification. IEEE Transactions on Image Processing. [Accessed January 28, 2023].

[4] Remove bias from the convolution if the convolution is followed by a normalization layer. Stack Overflow. [Accessed January 29, 2023].

More content at

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: