Data Augmentation for Text [with code]

Original Source Here

This article will show how to code in PyTorch, data augmentation techniques for deep learning problems such as text classification, text generation, etc.

For data augmentation with Image related problems, you can find the implementations here.

Data Augmentation Techniques

1. Back Translation

2. Random Insertion

3. Random Deletion

4. Random Swap

Note: The data augmentation for text is a costly operation, if we try to use it in the training loop, it will increase the training time significantly. Unlike the image, for text augmentation, first, we need to replace the words and then use the related word embedding for that word. So, we will create an augmentation beforehand and use it in training.

I have done the data augmentation for the English language but these techniques are language agnostic.

Back Translation

The idea is to have different sentences mean the same thing use for training.

Step 1: Select the English sentence.

The Best Way To Get Started Is To Quit Talking And Begin Doing.

Step 2: Select a random language (Korean) and convert the sentence to that language using language translation.

시작하는 가장 좋은 방법은 말하기를 그만두고 시작하는 것입니다.

Step 3: Now translate it back to English.

The best way to start is to stop talking and start.

Now, you can see that your sentences will mean the same thing but have a different structure.

To achieve this we need not train a language translation model, but we can leverage Google translate.

Let’s install the library required:

sudo pip install google_trans_new

Let’s code:

Please remember that while using Google translate API, there will be a limit to the number of sentences you can convert. So there are 2 ways to handle it:
1. Take a paid subscription for the API
2. Do data augmentation for your dataset multiple times. Say if 800 calls are the limit, then do data augmentation for 800 sentences and save it. After some time again do the augmentation for the next 800 sentences and so on.

Random Insertion

The idea here is to find the non stop-words in the sentence and randomly replace few words with their synonyms.

Step 1: Select the English sentence.

The Best Way To Get Started Is To Quit Talking And Begin Doing.

Step 2: Filter out the stop words.

["Best", "Way", "Started", "Quit", "Talking", "Begin", "Doing"]

Step 3: Choose 3 random words and find their closest synonyms.

words = ["Way", "Quit", "Talking"]
synonyms = ["Method", "Leave", "Speaking"]

Step 4: Replace those words in the original sentence.

The Best Method To Get Started Is To Leave Speaking And Begin Doing.

To achieve this we can use the existing pre-trained word embeddings. However, for better results, you should use custom-trained word embeddings or a language model.

Download and extract the Google word2vec with 300 dimensions.

wget -c ""
gzip -d GoogleNews-vectors-negative300.bin.gz

Let’s code:

This can be done in one shot but still takes time to generate for all the datasets.

Random Deletion

The idea here is to randomly delete few words from the sentence.

Step 1: Select a sentence

The Best Way To Get Started Is To Quit Talking And Begin Doing.

Step 2: Assign random probability to all the words

0.3 The 
0.2 Best
0.3 Way
0.8 To
0.8 Get
0.3 Started
0.7 Is
0.2 To
0.3 Quit
0.9 Talking
0.4 And
0.4 Begin
0.9 Doing.

Step 3: Remove all the words where the score is less than 0.3

The Way To Get Started Is Quit Talking And Begin Doing.

Let’s code:

Random Swap

The idea here is to randomly pick 2 words at a time and swap their position in the sentence. This can be done n number of times.

Step 1: Select the sentence

The Best Way To Get Started Is To Quit Talking And Begin Doing.

Step 2: Select the words and swap them

The Started Way To Get Quit Is To Best Talking And Begin Doing.

Let’s code:

Once, we have all the augmented data ready with us we can begin our training. All the above methods defined above are general approaches. To apply it to your dataset first you need to understand your dataset and then apply customized augmentation techniques.

These techniques can also be mixed and matched among themselves.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: