Original Source Here
Maximizing the Impact of Data Augmentation: Effective Techniques and Best Practices
Data augmentation is a popular technique in machine learning that involves creating new data from existing data by making various modifications to it. These modifications could include adding noise, flipping images horizontally, or changing colors, among others. Data augmentation is an effective way to increase the size of a dataset and improve the performance of machine learning models.
However, data augmentation is not a one-size-fits-all solution. To maximize its impact, it is important to use effective techniques and best practices. In this article, we will explore some of the best practices for data augmentation and provide practical examples of how to implement them. We will also discuss the importance of data iteration loops and how they can be used to further enhance the impact of data augmentation.
Table of Contents:
- What is Data Augmentation?
- Data Augmentation Best Practices
- Data Augmentation Practical Example
- Data Iteration Loop
If you want to start a career in data science & AI and do not know how. I offer data science mentoring sessions and long-term career mentoring:
Join the Medium membership program for only 5$ to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.
1. What is Data Augmentation?
Data augmentation is an essential technique in machine learning and deep learning, and it is used to artificially expand the size of a dataset by generating new and diverse samples from the existing ones.
We need data augmentation for several reasons:
- Improved model performance: A model trained on a larger dataset with diverse examples is more likely to generalize better on new data and perform better in real-world scenarios.
- Overcome overfitting: Data augmentation helps to overcome overfitting by increasing the variety of training examples, which reduces the risk of the model memorizing the training data.
- Balance class distribution: Sometimes, datasets may have an imbalanced distribution of classes. By applying data augmentation techniques to the minority class, we can generate new examples that help to balance the distribution of classes in the dataset.
- Reduce bias: Data augmentation can also help to reduce bias in the model by generating more diverse examples that represent a wider range of input variations.
Here are some basic techniques of data augmentation that are commonly used in machine learning and deep learning:
- Flip or mirror images: This technique involves flipping or mirroring an image horizontally or vertically to create a new sample.
- Rotation: Rotating an image by a certain degree can create a new sample that represents a different angle of the object or scene.
- Scaling: Scaling an image up or down can create new samples of the same object or scene at different sizes.
- Translation: Shifting an image horizontally or vertically can create a new sample of the same object or scene in a different position.
- Cropping: Cropping an image can create a new sample that focuses on a different part of the object or scene.
- Adding noise: Adding random noise to an image can create a new sample with slightly different pixel values, which can help the model generalize better to noisy or blurry images.
- Color jittering: Changing the brightness, contrast, or saturation of an image can create new samples with different color variations.
These techniques can be combined in various ways to create a more diverse set of augmented data samples.
2. Data Augmentation Best Practices
Data augmentation can be a very efficient way to get more data, especially for unstructured data problems such as images, audio, and maybe text. But when carrying out data augmentation, there’re a lot of choices you have to make.
- What are the parameters?
- How do you design the data augmentation setup?
Let’s dive into this to look at some best practices.
Let’s look at a speech recognition example, given an audio clip like this: ChatGPT is shaping the world. Suppose you take background cafe noise and add it to the previous audio clip. Literally, just take the two waveforms and sum them up, then you can create a synthetic example that sounds like someone is saying: ChatGPT is shaping the world in a noisy cafe. This is one form of data augmentation that lets you efficiently create a lot of data that sounds like data collected in the cafe. You can also add it to background music so that it sounds like someone saying it with maybe the radio on in the background.
Now when carrying out data augmentation, there’re a few decisions you need to make such as what types of background noise should you use and how loud should the background noise be relative to the speech, and so on.
The goal of data augmentation is to create examples that your learning algorithm can learn from. As a framework for doing that, you should think of how you can create realistic examples that the algorithm does poorly on because if the algorithm already does well in those examples, then there’s less for it to learn from. But at the same time, you want the examples to still be ones that a human or maybe some other baseline can do well on because otherwise, one way to generate examples that the algorithm does poorly on would be to just create examples that are so noisy that no one can hear what anyone said, that will not be helpful. You want examples that are hard enough to challenge the algorithm but not so hard that they’re impossible for any human or any algorithm to ever do well on. That’s why when you generate new examples using data augmentation, you should focus on generating examples that meet both of these criteria.
One way that some people do data augmentation is to generate an augmented data set and then train the learning algorithm and see if the algorithm does better on the dev set. Then fiddle around with the parameters for data augmentation and change the learning algorithm again, and so on. This turns out to be quite inefficient because every time you change your data augmentation parameters, you need to train your new network or train your learning algorithm all over, and this can take a long time.
Instead, using these three principles allows you to sanity check that your new data generated using data augmentation is useful without actually having to spend maybe hours or sometimes days training a learning algorithm on that data to verify that it will result in performance improvement. Here are three principles you might go through when you are generating new data.
- Does the new data sound realistic?: You want your audio to actually sound like realistic audio of the sort that you want your algorithm to perform on.
- Is the X-to-Y mapping clear?: In other words, can humans still recognize what was said? This is to verify point two here.
- Is the algorithm currently doing poorly on this new data? That helps you verify point one. Because if the model performs very well on the new data, then the model will not learn new information from the data.
If you can generate data that meets all of these criteria, then that would give you a higher chance when you put this data into your training set and retrain the algorithm that the model performance will improve and you will become nearer to the human level performance.
3. Data Augmentation Practical Example
Let’s have a look at a practical example, using images this time. Let’s say that you have a very small set of images of smartphones with scratches. Here’s how you may be able to use data augmentation. You can take the image and flip it horizontally. This results in a pretty realistic image. The phone buttons are now on the other side, but this could be a useful example to add to your training set.
Another option is to change the image contrast, as shown in figure 1. You can see that when we brighten it up or decrease it is still visible, but when we try darkening the image, it becomes so dark that even I, as a person, can’t really tell if there’s a scratch there or not.
Whereas these two examples on top would pass the checklist we had earlier, that the human can still detect the scratch well, the last example is too dark; it would fail those checklists. Therefore it is better to choose the data augmentation scheme that generates more examples that look like the two examples on top and a few of the ones that look like the ones here at the bottom. In summary, we want images that look realistic, that humans can do well on, and hopefully, the algorithm does poorly on.
4. Data Iteration Loop
You may have heard of the term model iteration, which refers to iteratively training a model using error analysis and then trying to decide how to improve the model. Taking a data-centric approach to AI development, sometimes it’s useful to instead use a data iteration loop where you repeatedly take the data and the model, train your learning algorithm, do error analysis, and as you go through this loop, focus on how to add data or improve the quality of the data.
For many practical applications, taking this data iteration loop approach with a robust hyperparameter search that’s important too. Taking of data iteration loop approach results in faster improvements to your learning algorithm performance, depending on your problem.
When you’re working on an unstructured data problem, data augmentation, if you can create new data that seems realistic, that humans can do quite well on, but the algorithm struggles on, that can be an efficient way to improve your learning algorithm performance. If you fall through error analysis, that your learning algorithm does poorly on speech with cafe noise, data augmentation to generate more data with cafe noise could be an efficient way to improve your learning algorithm performance.
If you like the article and would like to support me, make sure to:
Join the Medium membership program for only 5$ to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link at no extra cost to you.
Looking to start a career in data science and AI and do not know how. I offer data science mentoring sessions and long-term career mentoring:
AI/ML
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot