Enhancing Neural Networks with Mixup in PyTorch

Original Source Here

Enhancing Neural Networks with Mixup in PyTorch

Randomly mixing up images, and it works better?

Image classification has been one of the domains that thrived with the exponential improvement of deep learning. Traditional image recognition tasks heavily rely on processing methods such as dilations/erosions, kernels, and transforms to the frequency domain , and yet the difficulty in feature extraction has ultimately confined the progress made through these methods. Neural networks, on the other hand, focus on finding the relationships between the input images and output labels to ‘tune’ an architecture for such purpose. While the increase in accuracy was significant, networks often require vast quantities of data for training, and thus numerous research now focuses on performing data augmentation — the process of increasing data quantity from a pre-existing dataset.

This article introduces a simple yet surprisingly effective augmentation strategy — mixup, with an implementation via PyTorch and the comparison of results.

Before Mixup — Why Data Augmentation?

Parameters inside a neural network architecture are trained and updated based on a given set of training data. However, as the training data only covers a certain part of the entire distribution of the possible data, the network would likely overfit on the ‘seen’ part of distribution. Hence, the more data we have for training would theoretically cover a better picture of the entire distribution.

While the number of data we have is limited, we can always try to slightly alter the images and use them as ‘new’ samples to feed into the network for training. This process is called data augmentation.

What is Mixup?

Figure 1. Simple Visualization of image mixup.

Supposedly we are classifying images of dogs and cats, and we are given a set of images for each of them with labels (i.e., [1, 0] -> dogs, [0, 1] -> cats), a mixup process is simply averaging out two images and their labels correspondingly as a new data.

Specifically, we can write the concept of mixup mathematically:

where x, y are the mixed images and labels of xᵢ (label yᵢ) and xⱼ (label y), and λ is a random number from a given beta distribution.

This provides continuous samples of data in between the different classes, which intuitively expands the distribution of a given training set and thus makes the network more robust during the testing phase.

Using mixup on any networks

Since mixup is merely a data augmentation method, it is orthogonal to any network architectures for classification, meaning that you can always implement this in a dataset with any networks you wish for a classification problem.

Based on the original paper mixup: Beyond Empirical Risk Minimization, Zhang et al. had experimented with multiple datasets and architectures, empirically indicating that the benefit of mixup is not just a one-time special case.

Computing Environment


The entire program is built via the PyTorch library (including torchvision). The concept of mixup requires sample generation from beta distribution, which could be acquired from the NumPy library, we also used the random library to find random images for mixup. The following code imports all the libraries:


For demonstration, we apply the concept of mixup on the traditional image classification, to which CIFAR-10 seems to be the most viable option. CIFAR-10 contains 60000 colored images of 10 classes (6000 per class) divided into training and testing sets in a 5:1 ratio. The images are fairly simple to classify yet harder than the most basic digit recognition dataset MNIST.

There are numerous ways to download the CIFAR-10 dataset, including from the University of Toronto website or using torchvision datasets. One particular platform worth mentioning is the Graviti Open Datasets platform, which contains hundreds of datasets and the corresponding authors for them, as well as labels for each dataset’s designated training tasks (i.e., classification, object detection). You may download other classification datasets such as CompCars or SVHN to test out the improvement mixup brings in different scenarios. The company is currently developing their SDKs, which, although currently takes extra time to load the data directly, can be very useful in the near future as they are rapidly improving batch downloading.

Hardware Requirements

It is preferred to train the neural network on GPUs, as it increases the training speed significantly. However, if only CPU is available, you may still test the program. To allow your program to determine the hardware itself, simply use the following:



The goal is to see the results of mixup and not the network itself. Hence, for demonstration purposes, a simple convolutional neural network (CNN) of 4 layers, followed by 2 layers of full-connected layers is implemented. Note that for both the mixup and non-mixup training procedure, the same network is applied to ensure fairness in comparison.

We can build the simple network like the following:


The mixup stage is done during the dataset loading process. Therefore, we must write our own datasets instead of using the default ones provided by torchvision.datasets.

The following is a simple implementation of mixup by incorporating the beta distribution function from NumPy:

Note that we did not apply mixup for all images, but roughly every one in five. We also used a beta distribution of 0.2. You may change the distribution as well as the number of images that are mixed for different experiments. Perhaps you may achieve even better results!

Training and Evaluation

The following code shows the training procedure. We set the batch size to 128, the learning rate to 1e-3, and the total number of epochs to 30. The entire training is performed twice — with and without the mixup. The loss also has to be defined by ourselves, as currently, BCE loss doesn’t allow labels with decimals:

To evaluate the effect of mixup, we calculate final accuracy based on three trials with and without the mixup. Without mixup, the network had resulted in approximately 74.5% accuracy on the test set, while with mixup, the accuracy was boosted to around 76.5%!

Extending Beyond Image Classification

While mixup has pushed state-of-the-art accuracies in image classification, research has shown that its benefits extend into other computer vision tasks such as generation and robustness to adversarial examples. Research literature has also been extending the concept into 3D representations which are also shown to be very effective (e.g., PointMixup).


So there you have it! Hopefully this article gives you a basic overview and guidance on how to apply mixup onto your image classification network training. The full implementation can be found in the following Github repository:

Thank you for making it this far 🙏! I will be posting more on different areas of computer vision/deep learning. Make sure to check out my other articles on VAE, one shot learning, and many more!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: