Natural Language Processing (NLP) with Deep Learning Models (RNN & CNN)— Coleridge Initiative



Original Source Here

INTRODUCTION

In this discussion, we will be training 2 deep learning RNN models Bidirectional LSTM and GRU and 1 deep learning CNN model sep-CNN to solve a NLP problem. This is to show how we can approach any NLP problem using the deep learning/ Sequence Vectors route.

This discussion uses data from the Kaggle competition Coleridge Initiative where data scientists are challenged to show how publicly funded data are used to serve science and society. The results will show how public data are being used in science and help the government make wiser, more transparent public investments. It will help move researchers and governments from using ad-hoc methods to automated ways of finding out what datasets are being used to solve problems, what measures are being generated, and which researchers are the experts.

Data Source

Methodology

In this competition, natural language processing (NLP) is used to automate the discovery of how scientific data are referenced in publications. Utilizing the full text of scientific publications from numerous research areas gathered from CHORUS publisher members and other sources, data scientists will identify data sets that the publications’ authors used in their work.

If successful, data scientists will help support evidence in government data. Automated NLP approaches will enable government agencies and researchers to quickly find the information they need. The approach will be used to develop data usage scorecards to better enable agencies to show how their data are used and bring down a critical barrier to the access and use of public data.

I attempted:

In the first part, we covered the n-gram approach (blog, notebook). However, because bag-of-words isn’t an order-preserving tokenization method, it tends to be used in shallow language-processing models rather than in deep-learning models. This second part, we will discuss sequence vector approach, which understand tokens as sequence and preserve the general structure of the sentence since for some text samples, word order is critical to the text’s meaning.. Models that can learn from the adjacency of tokens are known as sequence models. This includes CNN and RNN which can infer meaning from the order of words in a sample. For these models, we represent the text as a sequence of tokens, preserving order, instead of a set of tokens. For this part of the project, we will build 3 deep learning models:

  • Model 1: Recurrent neural network (RNN) Bidirectional LSTM with GloVe embedding
  • Model 2 : Recurrent neural network (RNN) GRU with GloVe embedding
  • Model 3: Convolutional neural network (CNN) sep-CNN also with GloVe embedding

Similarity between models:

  • batch_size = 256. The point of having many batches is just to spare the computer’s memory, so we test smaller packs one at a time.
  • ReLU converts negative numbers to zeros and helps models learn non-linear functions
  • Because this is a multi-class classification problem, all models use activation = softmax which assigns decimal probabilities to each class in a multi-class problem. Those decimal probabilities must add up to 1.0. This additional constraint helps training converge more quickly than it otherwise would.
  • Adam implementation is used for mini-batch gradient descent
  • loss = categorical cross entropy which is suitable for multi-class classification
  • accuracy is evaluated of the model.
  • epochs = 30 with Early Stopping(stopped if there is no further improvement in both loss and accuracy): number of times the model is trained over the entire dataset.
  • steps_per_epoch = total_training_samples // batch_size: the number of batch iterations before a training epoch is considered finished
  • validation_steps = total_validation_samples // validation_batch_size: similar to steps_per_epoch but on the validation set

Metrics

The performance of the models are evaluated based on:

  1. Accuracy score
  2. Precision: True Negative (TN) or specificity to determine how good the model is at detecting negatives (Normal)
  3. Recall: True Positive (TP) or sensitivity to determine how good the model is at detecting the positives (Pneumonia)
  4. F1: harmonic mean of precision and recall

SUMMARY OF FINDINGS

OBTAIN TRAINING SENTENCES

For the full EDA of this dataset, see notebook.

Using the following code, we read each individual publication and break it down into sentences using sent_tokenizefunction from the nltk (Natural Language Toolkit), which is a leading platform for building Python programs to work with human language data. For each sentence, we will use .search to search for matching dataset_title, dataset_label, cleaned_labelin each of the sentences. When a match is found, it will be returned as a tuple containing starting and ending index of the matched string and labeled as DATASET.

To process the sentence, we use 2 functions clean_text and shorten_sentences to process the text. Since we need to preserve the exact words for dataset_title, the text cleaning process is simple with removing special characters and lower-casing. Then we break the sentences into shorter sentences.

Putting it all together:

Text with dataset: 64512
Text without dataset: 50303

Dataframe

print(train_df['Sentence'][1000])
print('\n')
print(train_df['Label'][1000])
using data from the baltimore longitudinal study of aging blsa we are able to generate systems level models of biological and physiological function and then demonstrate how these networks change with age.


baltimore longitudinal study of aging blsa

Check for number of unique labels:

train_df['Label'].nunique()132

We have a total of 132 labels which is consistent with the dataset. Full EDA analysis of this dataset can be found in this notebook.

PREPROCESSING

Train Test Split

We need to split our dataset into a training and validation set. We’ll use 80% of the dataset as the training data and evaluate the performance on the remaining 20% (holdout set):

Train sentences: (51609,) 
Test sentences: (12903,)
Train labels: (51609,)
Test labels: (12903,)

Summary of Steps

  • Tokenizes the texts into words
  • Creates a vocabulary using the top 20,000 tokens
  • Converts the tokens into sequence vectors
  • Pads the sequences to a fixed sequence length

Like all other neural networks, deep-learning models don’t take as input raw text, hence the first step we need to do is to transform text into numeric tensors.

Tokenization

Most neural networks models will begin by breaking up a sequence of strings into individual words, phrases, or whole sentences: a process known as tokenizing. A tokenizer builds the vocabulary and converts a word sequence to an integer sequence. Each integer maps to a value in a dictionary that encodes the entire corpus, with the keys in the dictionary being the vocabulary terms themselves. For example:

num_words is the size of our vocabulary. We set top_k = 20000 meaning the 20,000 most common words will be kept.

fit_on_texts

text_to_sequence

pad_sequences The LSTM layers accept sequences of the same length only but most often each text sequence has different length of words. To counter this, we can use pad_sequence which simply pads the sequence of words with zeros. Therefore, every sentence represented as integers can be padded to have the same length. We will work with the max length of the longest sequence and pad the shorter sequences. The resulting feature vector contains mostly zeros, since we have a fairly short sentence.

print(train_df['Sentence'][10])
print(X_train_pad[10])
using data from the national education longitudinal study nels we estimate a value added education production function that includes parental effort as an input.
[0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 362 507 2 787 4001 2421 40 788 44
332 719 1164 1238 3 694 10 67 389 440 726 8 695 862
299 7]

maxlen is also a necessary parameter to specify how long the sequences should be. This cuts sequences that exceed that number. Our X_train_pad will look like so:

array([[   0,    0,    0, ..., 1789,  236,  381],
[ 0, 0, 0, ..., 238, 235, 351],
[ 0, 0, 0, ..., 11, 52, 53],
...,
[ 0, 0, 0, ..., 10, 5, 59],
[ 0, 0, 0, ..., 1086, 27, 55],
[ 0, 0, 0, ..., 82, 1975, 528]], dtype=int32)

Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data and assigns each a unique integer. We can access the mapping of words to integers as a dictionary attribute called word_index on the Tokenizer object. Later, when we make predictions, we can convert the prediction to numbers and look up their associated words in the same mapping. For example:

{'the': 1,
'of': 2,
'and': 3,
'in': 4,
'adni': 5,
'to': 6,
'data': 7,
'for': 8,
'a': 9,
'from': 10,
'study': 11,
's': 12,
'with': 13,
'were': 14,
'on': 15,
'longitudinal': 16,
'by': 17,
'as': 18,
'section': 19,
'text': 20,
'title': 21,
'disease': 22,
'national': 23,
'is': 24,
'1': 25,
'this': 26,
'2': 27,
'alzheimer': 28,
'education': 29,
'that': 30,
...}

We need to know the size of the vocabulary for defining the embedding layer later. We can determine the vocabulary by calculating the size of the mapping dictionary.

Words are assigned values from 1 to the total number of words (e.g. 32916). The Embedding layer needs to allocate a vector representation for each word in this vocabulary from index 1 to the largest index and because indexing of arrays is zero-offset, the index of the word at the end of the vocabulary will be 32917; that means the array must be 32916 + 1 in length. Therefore, when specifying the vocabulary size to the Embedding layer, we specify it as 1 larger than the actual vocabulary.

32917

Encode Labels

We need to one hot encode the output word. This means converting it from an integer to a vector of 0 values: one for each word in the vocabulary, with a 1 to indicate the specific word at the index of the words integer value. This is so that the model learns to predict the probability distribution for the next word and the ground truth from which to learn from is 0 for all words except the actual word that comes next.

print(train_df['Sentence'][10])
print(train_df['Label'][10])
print(y_train[10])
using data from the national education longitudinal study nels we estimate a value added education production function that includes parental effort as an input.national education longitudinal study55

We see that national education longitudinal study is assigned 55 after LabelEncoder.Next, we will binarize all the labels for the neural net.

to_categoricals to one hot encode the output words for each input-output sequence pair.

print(train_df['Sentence'][10])
print(train_df[‘Label’][10])
print(y_train[10])
using data from the national education longitudinal study nels we estimate a value added education production function that includes parental effort as an input.national education longitudinal study[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

y_train will look like this:

array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

The final shape of our X and y are:

X_train shape: (15728, 100)
X_test shape: (3933, 100)
y_train shape: (15728, 130)
y_test shape: (3933, 130)

Embedding Layer

Sequence models generally have a larger number of parameters to learn. The first layer in these models is an embedding layer, which learns the relationship between the words in a dense vector space [Google].

The Embedding layer is best understood as a dictionary mapping integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors. It’s effectively a dictionary lookup [Google].

We will need the following parameters:

  • input_dim: the size of the vocabulary
  • output_dim: the size of the dense vector
  • input_length: the length of the sequence

We use size_of_vocabulary which has […] words

It also has a parameter to specify how many dimensions will be used to represent each word. That is, the size of the embedding vector space. Common values are 50, 100, and 300. Here we use 300.

Pre-trained Word Vector

Since the dataset used is small, one method of addressing this lack of data in a given domain is to leverage data from a similar domain. This means using what is learned from one task and applying that to another task without learning from scratch. Words in a given dataset are most likely not unique to that dataset. Thus we can transfer an embedding learned from another dataset into our embedding layer. These embeddings are referred to as pre-trained embeddings.

GloVe , or Global Vectors for Word Representation, developed by Stanford researchers in 2014 (Pennington, et al.) is an embedding technique based on factorizing a matrix of word co-occurrence statistics. Its developers have made available pre-computed embeddings for millions of English tokens, obtained from Wikipedia data or from Common Crawl data.

Found 2195885 word vectors.

We can find out how many of our word vectors that are covered in the pre-trained Embedding:

0.7209344715496552

This means 72% of the vocabulary is covered by the pre-trained GloVe model, which is a OK coverage of our vocabulary.

We are done preprocessing raw text, let’s start building our deep learning models.

Recurrent neural network (RNN)

Recurrent neural network (RNN) are used when the inputs are sequential i.e reading sequentially from left to right. This is perfect for modeling languages because language is a sequence of words and each word is dependent on the words that come before it. However, RNN suffers from short-term memory and thus often suffer from vanishing gradients problem. Long short Term Memory (LSTM) and Gated Recurrent Unit (GRU) models are the solution for this. These networks have internal mechanisms called gates that can regulate the flow of information and can thus remember information for long periods of time without having to deal with the vanishing gradient problem.

(source)

The differences between LSTM and GRU are (Phi, 2020):

  1. GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates)
  2. LSTM remember longer sequences than GRU
  3. GRU is simpler and trains faster than LSTM

Bidirectional LSTM with GloVe

Long short Term Memory networks, or LSTM is a recurrent neural network that has LSTM cell blocks in place of our standard neural network layers. LSTMs have been observed as the most effective solution to solve sequence prediction problems, which is considered as one of the hardest problems to solve in the data science industry.

One of the many problems in NLP is how to understand a word’s context. In some cases, a word can have completely different meanings depending on the surrounding words. This is where bidirectional RNN comes in, which requires two steps because computers need to take in token input from both directions to get the full context. The first step is a forward pass (taking in words from left to right), and the second part is a backwards pass (taking in words from right to left).

Model Architecture

  • Start with initializing the model by specifying that the model is a Sequential model. Sequential model is the simplest Keras model for neural networks that are composed with just one single stack of layers connected sequentially
  • Embedding is the first layer which learns the relationship between the words in a dense vector space. To use the pre-trained GloVe, set weight = [embedding_matrix]. The pre-trained Embedding layers need to be frozen by setting trainable = False. This will prevent the weights, which are initially calculated and stored in order to reduce redundant processes and speed up training, in the pre-trained layers from being updated while training and later used as fixed feature extractors.
  • LSTM layer . If we want to have a sequence for the output, not just a single vector, we need to set return_sequence = True.
  • SpatialDropout1D is similar to Dropout but it drops entire 1D feature maps instead of individual elements. SpatialDropout1D is recommended after the Embedding to help improve the independence between the feature maps instead of the normal Dropout.
  • Bidirectional LSTM layer duplicates the RNN processing chain so that inputs are processed in both forward and reverse order.
  • Three Dense layers with 512, 128, and 64 neurons, are added which connect to the LSTM hidden layers to interpret the features extracted from the sequence. Each Dense layer manages its own weight matrix, containing all the connection weights between the neurons and their inputs. It also manages a vector of bias terms (one per neuron).
  • activation = ReLU(Rectified Linear Unit) are added to each layers so that all the negative values will not get passed to the next layer.
  • One output with activation = softmax, added last, is used to ensure the outputs have the characteristics of normalized probabilities. Since this is a multi-class classification problem, num_classes, or the total number of classes output neuron is needed.

Compile Model

Fit Model

Set patience = 10 for EarlyStopping, a regularization technique, meaning that the model will stop training if it doesn’t see any improvement in val_acc in 10 epochs.

Model Evaluation

Train curve and Validation curve are following each other closely. Accuracy improves overtime and Loss decreases overtime.

1613/1613 [==============================] - 346s 215ms/step - loss: 0.3470 - acc: 0.8369
Train loss & accuracy: [0.34700754284858704, 0.8368695378303528]


404/404 [==============================] - 88s 218ms/step - loss: 0.3859 - acc: 0.8283
Test loss & accuracy: [0.3859459459781647, 0.8282570242881775]

Train and test accuracy are similar (83% and 83% respectively) hence we did not overfit the model.

Model: LSTM 
precision recall f1-score support

0 0.00 0.00 0.00 1
1 0.00 0.00 0.00 4
2 0.00 0.00 0.00 2
3 0.92 0.97 0.94 6189
4 0.00 0.00 0.00 5
7 0.00 0.00 0.00 1
8 0.95 1.00 0.97 220
9 0.00 0.00 0.00 7
10 0.52 0.31 0.39 726
11 0.00 0.00 0.00 16
12 0.00 0.00 0.00 9
13 0.00 0.00 0.00 3
14 0.00 0.00 0.00 3
15 0.69 0.86 0.76 149
16 0.41 0.15 0.22 60
17 0.76 1.00 0.86 333
19 0.00 0.00 0.00 106
20 0.60 0.84 0.70 496
21 0.30 0.07 0.11 392
22 0.34 0.77 0.47 84
24 0.00 0.00 0.00 1
25 0.00 0.00 0.00 1
26 0.00 0.00 0.00 2
27 0.90 0.99 0.95 279
28 0.00 0.00 0.00 1
29 0.00 0.00 0.00 2
30 0.76 1.00 0.87 71
31 0.00 0.00 0.00 2
32 0.69 0.96 0.81 78
33 0.00 0.00 0.00 1
35 0.00 0.00 0.00 12
37 0.00 0.00 0.00 1
38 0.50 0.10 0.17 10
40 0.40 0.96 0.57 28
41 0.00 0.00 0.00 12
43 0.98 0.99 0.98 445
44 0.73 0.43 0.54 360
45 0.00 0.00 0.00 1
46 0.00 0.00 0.00 4
47 0.00 0.00 0.00 2
48 0.35 0.75 0.48 44
49 0.00 0.00 0.00 4
50 0.00 0.00 0.00 3
51 0.00 0.00 0.00 49
52 0.87 0.87 0.87 47
53 0.64 0.44 0.52 16
54 0.85 0.91 0.88 186
55 0.00 0.00 0.00 21
56 0.00 0.00 0.00 4
57 0.00 0.00 0.00 3
58 0.86 0.52 0.65 23
59 0.00 0.00 0.00 9
60 0.00 0.00 0.00 4
61 0.00 0.00 0.00 4
63 0.00 0.00 0.00 1
66 0.49 0.77 0.60 241
68 0.00 0.00 0.00 1
72 0.00 0.00 0.00 9
74 0.00 0.00 0.00 1
76 0.78 1.00 0.88 21
77 0.56 0.82 0.67 17
79 0.00 0.00 0.00 13
82 0.50 0.14 0.22 21
84 0.00 0.00 0.00 8
85 0.00 0.00 0.00 16
86 0.00 0.00 0.00 3
88 0.00 0.00 0.00 7
89 0.90 0.99 0.94 113
90 0.00 0.00 0.00 5
91 0.00 0.00 0.00 2
92 0.00 0.00 0.00 2
93 0.62 1.00 0.77 98
95 0.00 0.00 0.00 56
97 0.00 0.00 0.00 1
98 0.00 0.00 0.00 15
100 0.00 0.00 0.00 3
101 0.00 0.00 0.00 1
102 0.65 0.77 0.71 44
103 0.81 1.00 0.89 55
104 0.00 0.00 0.00 3
106 0.79 0.83 0.81 23
107 0.00 0.00 0.00 1
108 0.00 0.00 0.00 1
110 0.99 1.00 1.00 142
112 0.00 0.00 0.00 6
113 0.00 0.00 0.00 1
114 0.00 0.00 0.00 2
115 0.00 0.00 0.00 41
116 0.46 0.89 0.60 71
117 1.00 1.00 1.00 64
119 0.99 1.00 0.99 208
120 0.85 0.90 0.87 172
121 0.87 0.92 0.89 281
122 0.55 0.87 0.67 38
123 0.85 0.98 0.91 42
124 0.33 0.06 0.11 16
125 0.00 0.00 0.00 1
126 0.00 0.00 0.00 3
127 0.00 0.00 0.00 1
128 0.99 0.99 0.99 368
129 0.00 0.00 0.00 26
130 0.00 0.00 0.00 3
131 0.83 0.99 0.90 99

accuracy 0.83 12903
macro avg 0.28 0.31 0.28 12903
weighted avg 0.79 0.83 0.80 12903

The classification report only show 41 classes out of 133 total number of classes are picked up by the model. This is due to the highly imbalance dataset. Although our accuracy score is high (83%), more training data that can balance the minority classes are needed in the future.

GRU

Gated recurrent units (GRUs) are a gating mechanism in RNN. The GRU is like a LSTM with a forget gate, but has fewer parameters than LSTM, as it lacks an output gate.

Model Architect

  • Start with initializing the model by specifying that the model is a Sequential model.
  • Embedding is the first layer with pre-trained GloVe by setting weight = [embedding_matrix]. The pre-trained Embedding layers need to be frozen by setting trainable = False.
  • SpatialDropout1D drops entire 1D feature maps instead of individual elements.
  • GRU layer with return_sequence = True.
  • Three Dense layers with 512, 128, and 64 neurons, are added to interpret the features extracted from the sequence.
  • activation = ReLU(Rectified Linear Unit) are added to each layers so that all the negative values will not get passed to the next layer.
  • One output with activation = softmax, added last, is used to ensure the outputs have the characteristics of normalized probabilities. Since this is a multi-class classification problem, num_classes, or the total number of classes output neuron is needed.

Compile Model

Fit Model

Model Evaluation

Train curve and Validation curve are following each other closely. Accuracy improves overtime and Loss decreases overtime.

1613/1613 [==============================] - 437s 271ms/step - loss: 0.5258 - acc: 0.7959
Train loss & accuracy: [0.5257872343063354, 0.7959076762199402]


404/404 [==============================] - 109s 269ms/step - loss: 0.5597 - acc: 0.7894
Test loss & accuracy: [0.5597025156021118, 0.7893512845039368]

Train and test accuracy are similar (80% and 79% respectively) hence we did not overfit the model.

Model: GRU 
precision recall f1-score support

0 0.00 0.00 0.00 1
1 0.00 0.00 0.00 4
2 0.00 0.00 0.00 2
3 0.97 0.90 0.93 6189
4 0.00 0.00 0.00 5
7 0.00 0.00 0.00 1
8 0.95 1.00 0.97 220
9 0.00 0.00 0.00 7
10 0.48 0.81 0.60 726
11 0.00 0.00 0.00 16
12 0.00 0.00 0.00 9
13 0.00 0.00 0.00 3
14 0.00 0.00 0.00 3
15 0.66 0.85 0.74 149
16 0.39 0.15 0.22 60
17 0.76 1.00 0.86 333
19 0.00 0.00 0.00 106
20 0.60 0.87 0.71 496
21 0.25 0.02 0.04 392
22 0.33 0.86 0.48 84
24 0.00 0.00 0.00 1
25 0.00 0.00 0.00 1
26 0.00 0.00 0.00 2
27 0.90 0.99 0.95 279
28 0.00 0.00 0.00 1
29 0.00 0.00 0.00 2
30 0.89 1.00 0.94 71
31 0.00 0.00 0.00 2
32 0.75 0.97 0.84 78
33 0.00 0.00 0.00 1
35 0.00 0.00 0.00 12
37 0.00 0.00 0.00 1
38 0.45 0.90 0.60 10
40 0.57 1.00 0.73 28
41 0.00 0.00 0.00 12
43 1.00 0.96 0.98 445
44 0.58 0.94 0.72 360
45 0.00 0.00 0.00 1
46 0.00 0.00 0.00 4
47 0.00 0.00 0.00 2
48 0.33 0.98 0.50 44
49 0.00 0.00 0.00 4
50 0.00 0.00 0.00 3
51 0.00 0.00 0.00 49
52 0.70 0.94 0.80 47
53 0.57 0.25 0.35 16
54 0.85 0.97 0.91 186
55 0.00 0.00 0.00 21
56 0.00 0.00 0.00 4
57 0.00 0.00 0.00 3
58 1.00 0.78 0.88 23
59 0.00 0.00 0.00 9
60 0.00 0.00 0.00 4
61 0.00 0.00 0.00 4
63 0.00 0.00 0.00 1
66 0.31 0.02 0.04 241
68 0.00 0.00 0.00 1
72 0.00 0.00 0.00 9
74 0.00 0.00 0.00 1
76 0.75 1.00 0.86 21
77 0.38 0.65 0.48 17
79 0.00 0.00 0.00 13
82 1.00 0.52 0.69 21
84 0.00 0.00 0.00 8
85 0.00 0.00 0.00 16
86 0.00 0.00 0.00 3
88 0.00 0.00 0.00 7
89 0.92 1.00 0.96 113
90 0.00 0.00 0.00 5
91 0.00 0.00 0.00 2
92 0.00 0.00 0.00 2
93 0.62 1.00 0.76 98
95 0.00 0.00 0.00 56
97 0.00 0.00 0.00 1
98 0.00 0.00 0.00 15
100 0.00 0.00 0.00 3
101 0.00 0.00 0.00 1
102 0.55 1.00 0.71 44
103 0.90 1.00 0.95 55
104 0.00 0.00 0.00 3
106 0.91 0.91 0.91 23
107 0.00 0.00 0.00 1
108 0.00 0.00 0.00 1
110 1.00 0.99 1.00 142
112 0.00 0.00 0.00 6
113 0.00 0.00 0.00 1
114 0.00 0.00 0.00 2
115 0.00 0.00 0.00 41
116 0.61 0.85 0.71 71
117 1.00 0.98 0.99 64
119 0.99 1.00 0.99 208
120 0.93 0.90 0.91 172
121 0.86 0.96 0.91 281
122 0.58 0.87 0.69 38
123 0.86 1.00 0.92 42
124 0.92 0.69 0.79 16
125 0.00 0.00 0.00 1
126 0.00 0.00 0.00 3
127 0.00 0.00 0.00 1
128 1.00 0.99 1.00 368
129 0.00 0.00 0.00 26
130 0.00 0.00 0.00 3
131 0.98 0.96 0.97 99

accuracy 0.83 12903
macro avg 0.29 0.33 0.30 12903
weighted avg 0.81 0.83 0.80 12903

The classification report only show 41 classes are picked up by the model while the total number of classes is 133. This is due to the highly imbalance dataset. Although our accuracy score is high (83%), more training data that can balance the minority classes are needed in the future.

Convolutional Neural Network (CNNs)

When think about Convolutional Neural Network (CNNs), we typically think of Computer Vision. However, CNN can also be applied to NLP. Instead of image pixels, sentences or documents represented as a matrix are the input. Each row of the matrix corresponds to one token, typically a word, but it could be a character. In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words) (Britz, 2016).

(source)

Using CNN for NLP is counterintuitive since in vision, pixels close to each other are likely to be semantically related but in language, parts of phrases could be separated by several other words. However, Britz (2016) argues, if the simple Bag of Words model, which is an obvious oversimplification with incorrect assumptions, can be the standard approach for years and with good results, CNN, arguably can also be used for NLP. Moreover, compared to something like n-gram which can quickly become expensive when computing anything more than 3-grams, CNNs are much more efficient. Let’s try CNN out.

sep-CNN

Original code from Google workshop is here.

Model Architect

  • Start with initializing the model by specifying that the model is a Sequential model.
  • Embedding is the first layer with pre-trained GloVe, set weight = [embedding_matrix]. The pre-trained Embedding layers need to be frozen by setting trainable = False.
  • Dropout to reduce overfitting by reducing the number of neurons.
  • SeparableConv layer is theoretically identical to Conv
  • Pooling layer is sandwiched between two successive convolutional layers to reduce the spatial size of the convoluted feature/ parameters in the network. MaxPooling1D is the most common pooling methods.
  • activation = ReLU(Rectified Linear Unit) are added to each SeparableConv layer so that all the negative values will not get passed to the next layer.
  • GlobalAveragePooling1D to minimize overfitting by reducing the total number of parameters in the model.
  • One output with activation = softmax, added last, is used to ensure the outputs have the characteristics of normalized probabilities. Since this is a multi-class classification problem, num_classes, or the total number of classes output neuron is needed.

Compile Model

Fit Model

Model Evaluation

Train curve and Validation curve are following each other closely. Accuracy improves overtime and Loss decreases overtime.

1613/1613 [==============================] - 24s 15ms/step - loss: 2.3333 - acc: 0.4805
Train loss & accuracy: [2.333292007446289, 0.4804975986480713]


404/404 [==============================] - 6s 15ms/step - loss: 2.3373 - acc: 0.4797
Test loss & accuracy: [2.3373019695281982, 0.4796558916568756]

Train and test accuracy are similar (48% and 48%, respectively). The model is not overfit.

Model: SEPCNN 
precision recall f1-score support

0 0.00 0.00 0.00 1
1 0.00 0.00 0.00 4
2 0.00 0.00 0.00 2
3 0.48 1.00 0.65 6189
4 0.00 0.00 0.00 5
7 0.00 0.00 0.00 1
8 0.00 0.00 0.00 220
9 0.00 0.00 0.00 7
10 0.00 0.00 0.00 726
11 0.00 0.00 0.00 16
12 0.00 0.00 0.00 9
13 0.00 0.00 0.00 3
14 0.00 0.00 0.00 3
15 0.00 0.00 0.00 149
16 0.00 0.00 0.00 60
17 0.00 0.00 0.00 333
19 0.00 0.00 0.00 106
20 0.00 0.00 0.00 496
21 0.00 0.00 0.00 392
22 0.00 0.00 0.00 84
24 0.00 0.00 0.00 1
25 0.00 0.00 0.00 1
26 0.00 0.00 0.00 2
27 0.00 0.00 0.00 279
28 0.00 0.00 0.00 1
29 0.00 0.00 0.00 2
30 0.00 0.00 0.00 71
31 0.00 0.00 0.00 2
32 0.00 0.00 0.00 78
33 0.00 0.00 0.00 1
35 0.00 0.00 0.00 12
37 0.00 0.00 0.00 1
38 0.00 0.00 0.00 10
40 0.00 0.00 0.00 28
41 0.00 0.00 0.00 12
43 0.00 0.00 0.00 445
44 0.00 0.00 0.00 360
45 0.00 0.00 0.00 1
46 0.00 0.00 0.00 4
47 0.00 0.00 0.00 2
48 0.00 0.00 0.00 44
49 0.00 0.00 0.00 4
50 0.00 0.00 0.00 3
51 0.00 0.00 0.00 49
52 0.00 0.00 0.00 47
53 0.00 0.00 0.00 16
54 0.00 0.00 0.00 186
55 0.00 0.00 0.00 21
56 0.00 0.00 0.00 4
57 0.00 0.00 0.00 3
58 0.00 0.00 0.00 23
59 0.00 0.00 0.00 9
60 0.00 0.00 0.00 4
61 0.00 0.00 0.00 4
63 0.00 0.00 0.00 1
66 0.00 0.00 0.00 241
68 0.00 0.00 0.00 1
72 0.00 0.00 0.00 9
74 0.00 0.00 0.00 1
76 0.00 0.00 0.00 21
77 0.00 0.00 0.00 17
79 0.00 0.00 0.00 13
82 0.00 0.00 0.00 21
84 0.00 0.00 0.00 8
85 0.00 0.00 0.00 16
86 0.00 0.00 0.00 3
88 0.00 0.00 0.00 7
89 0.00 0.00 0.00 113
90 0.00 0.00 0.00 5
91 0.00 0.00 0.00 2
92 0.00 0.00 0.00 2
93 0.00 0.00 0.00 98
95 0.00 0.00 0.00 56
97 0.00 0.00 0.00 1
98 0.00 0.00 0.00 15
100 0.00 0.00 0.00 3
101 0.00 0.00 0.00 1
102 0.00 0.00 0.00 44
103 0.00 0.00 0.00 55
104 0.00 0.00 0.00 3
106 0.00 0.00 0.00 23
107 0.00 0.00 0.00 1
108 0.00 0.00 0.00 1
110 0.00 0.00 0.00 142
112 0.00 0.00 0.00 6
113 0.00 0.00 0.00 1
114 0.00 0.00 0.00 2
115 0.00 0.00 0.00 41
116 0.00 0.00 0.00 71
117 0.00 0.00 0.00 64
119 0.00 0.00 0.00 208
120 0.00 0.00 0.00 172
121 0.00 0.00 0.00 281
122 0.00 0.00 0.00 38
123 0.00 0.00 0.00 42
124 0.00 0.00 0.00 16
125 0.00 0.00 0.00 1
126 0.00 0.00 0.00 3
127 0.00 0.00 0.00 1
128 0.00 0.00 0.00 368
129 0.00 0.00 0.00 26
130 0.00 0.00 0.00 3
131 0.00 0.00 0.00 99

accuracy 0.48 12903
macro avg 0.00 0.01 0.01 12903
weighted avg 0.23 0.48 0.31 12903

However, when we look at the classification report, only one class (class 3) is picked up by the model while the rest failed. This model is discarded in the end.

PREDICTION

Now let’s make prediction on our training dataset. First we read the train dataset.

Then we import test dataset with read_json_pub function:

Set train_path = test_path so that a new test set of 8,000 publications can be fed into the model:

Now we will attempt to fill in the column PredictionString using all our deep learning models.

For comparison, we also use literal matching to find and identify DATASET (cleaned labelis from literal matching):

cleaned label: {'alzheimer s disease neuroimaging initiative adni ', 'adni'}
lstm label: {'adni'}
gru label: {'adni'}

cleaned label: {'nces common core of data', 'trends in international mathematics and science study', 'common core of data'}
lstm label: {'adni'}
gru label: {'our world in data'}

cleaned label: {'sea lake and overland surges from hurricanes', 'slosh model', 'noaa storm surge inundation'}
lstm label: {'ibtracs'}
gru label: {'ibtracs'}

cleaned label: {'rural urban continuum codes'}
lstm label: {'adni'}
gru label: {'adni'}

Combining all methods, we get:

alzheimer s disease neuroimaging initiative adni |adni


common core of data|our world in data|trends in international mathematics and science study|adni|nces common core of data


sea lake and overland surges from hurricanes|slosh model|ibtracs|noaa storm surge inundation


rural urban continuum codes|adni

Put all label_list into required format for submission:

There you go, this is how you can approach any NLP problem with deep learning/ sequence vector method.

Github

Kaggle

REFERENCE

Britz, D. (2016, January 10). Understanding Convolutional Neural Networks for NLP. WildML. http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/.

Chollet, F. (2017). Chapter 6. Deep learning for text and sequences · Deep Learning with Python. · Deep Learning with Python. https://livebook.manning.com/book/deep-learning-with-python/chapter-6/18.

Google. (n.d.). Step 4: Build, Train, and Evaluate Your Model. Google. https://developers.google.com/machine-learning/guides/text-classification/step-4.

Pennington, J. (n.d.). GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/.

Phi, M. (2020, June 28). Illustrated Guide to LSTM’s and GRU’s: A step by step explanation. Medium. https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: