Long Short-Term Memory (LSTM) for Sentiment Analysis



Original Source Here

Reading the Dataset

dataset = pd.read_csv("IMDB Dataset.csv")
dataset.head()
--Output-- review | sentiment
0 |One of the other reviewers has mentioned that ... | positive
1 |A wonderful little production. <br /><br />The... | positive
2 |I thought this was a wonderful way to spend ti... | positive
3 |Basically there's a family where a little boy ... | negative
4 |Petter Mattei's "Love in the Time of Money" is... | positive

Let’s see how many sentiment sentences are annotated in our dataset:

dataset.sentiment.value_counts()--Output--positive    25000
negative 25000
Name: sentiment, dtype: int64

As you can see from the above output, our dataset is balanced, which provides us with a similar detection rate for the two classes.

Factoring Sentences into Words

We need to factor the sentences into words for tokenization and word embedding (see below for more on these processes).

word_corpus = []
for text in dataset['review']:
words = [word.lower() for word in word_tokenize(text)]
word_corpus.append(words)
numberof_words = len(word_corpus)
print(numberof_words)
--Output--50000

Splitting the Dataset

The dataset needs to be split into training and test (80% and 20%).

train_size = int(dataset.shape[0] * 0.8)
X_train = dataset.review[:train_size]
y_train = dataset.sentiment[:train_size]
X_test = dataset.review[train_size:]
y_test = dataset.sentiment[train_size:]

Tokenization

Tokenization is the process of breaking down a phrase, sentence, paragraph, or full text document into smaller components, such as words or terms.

We need to tokenize the words and padding for equal input dimensions. In order to do that, we will use the Tokenizer available in Keras.

tokenizer = Tokenizer(numberof_words)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_train = pad_sequences(X_train, maxlen=128, truncating='post', padding='post')

First, we have passed the number of words available in the word corpus to Tokenizer, then using the fit_on_text internal vocabulary for the text list is updated. After updating the vocabulary, text_to_sequence is used to convert the tokens available in the text corpus to a sequence of integers.

Now that we have tokenized the words in X_train set. Let’s see how it would look like.

X_train[0], len(X_train[0])--Output--(array([ 27,    4,   1,   80, 2102,  45, 1073,   12, 100,
147, 39, 316, 2968, 409, 459, 26, 3173, 33,
23, 200, 14, 11, 6, 614, 48, 606, 16,
68, 7, 7, 1, 87, 148, 12, 3256, 68,
41, 2968, 13, 92, 5626, 2,16202, 134, 4,
569, 60, 271, 8, 200, 36, 1, 673, 139,
1712, 68, 11, 6, 21, 3, 118, 15, 1,
7870, 2257, 38,11540, 11, 118, 2495, 54,5662,
16, 5182, 5, 1438, 377, 38, 569, 92, 6,
3730, 8, 1, 360, 353, 4, 1, 673, 7,
7, 9, 6, 431, 2968, 14, 12, 6, 1,
11736, 356, 5, 1,14689,6526, 2594, 1087, 9,
2661, 1432, 20,22583, 534, 32, 4795, 2451, 4,
1, 1193, 117, 29, 1,6893, 25, 2874,12191,
2, 392], dtype=int32), 128)

As you can see from the above output, all the words are represented in numbers. Therefore, words available in X_train has been effectively tokenized.

Let’s do the same for the X_test set.

X_test = tokenizer.texts_to_sequences(X_test)
X_test = pad_sequences(X_test, maxlen=128, truncating='post', padding='post')

After tokenizing both the training and test set, we need to look at how many unique words are available in the dataset. We check the available unique words to confirm that dataset contains different words rather than similar words.

index = tokenizer.word_index
print("Count of unique words: {}".format(len(index)))
--Output--Count of unique words: 112173

Word Embedding

A word embedding is a learned text representation in which words with related meanings are represented similarly in the form of real-valued vectors. These vectors carry the meanings of words such that words nearby carry similar meanings.

During this implementation, we will be using the GloVe embeddings. You are free to try different word embedding tools, such as Word2Vec, FastText, etc.

First, we need to make a dictionary of all the words in the corpus using the pre-trained GloVe embeddings. (Click the link embedded to download the pre-trained word vectors)

embeddings = {}
with open("glove.twitter.27B.100d.txt") as file:
for line in file:
val = line.split()
words = val[0]
vectors = np.asarray(val[1:], 'float32')
embeddings[words] = vectors
file.close()

Secondly, we need to make a matrix of all words available in the dataset with vectors from the embedding dictionary. Which will contains the pre trained weights of the words.

emb_matrix = np.zeros((numberof_words, 100))
for i, word in tokenizer.index_word.items():
if i < (numberof_words+1):
vector = embeddings.get(word)
if vector is not None:
emb_matrix[i] = vector

Creating a weights matrix is needed for the embedding layer of the model. The embedding layer is the first hidden layer of our model, and all that layer does is map the integer inputs to the vectors found at the corresponding index in the embedding matrix.

Creating a Model

Before creating the model, we need to encode the labels in the dataset using LabelEncoder().

labels = LabelEncoder()
y_train = labels.fit_transform(y_train)
y_test = labels.transform(y_test)

There are two types of ways to create a model using Keras.

  • Sequential API
  • Functional API

Using a Sequential API instead of functional API is suitable for this scenario because we are creating the model layer by layer.

model = Sequential()model.add(Embedding(input_dim=num_words, output_dim=100, 
embeddings_initializer=Constant(embedding_matrix),
input_length=128, trainable=False))
model.add(LSTM(100, dropout=0.1))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

The layers I have applied are as follows:

  • 1st Layer — Embedding Layer : Converts the tokenized words into an embedding of a particular size
  • 2nd Layer — LSTM Layer : Contains hidden state dims and additional layers
  • 3rd Layer — Dense Layer : Connects all the outputs from previous layers to all its neurons
  • Activation Function — Sigmoid Activation Function : Converts all the outputs to values between 0 and 1
  • Loss Function — Binary Cross Entropy : Compares the outputs and predicts the actual class output between 0 and 1
model.summary()--Output--Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 128, 100) 5000000
_________________________________________________________________
lstm (LSTM) (None, 100) 80400
_________________________________________________________________
dense (Dense) (None, 1) 101
=================================================================
Total params: 5,080,501
Trainable params: 80,501
Non-trainable params: 5,000,000
_________________________________________________________________

Now let’s train our model for 10 epochs. This is a small amount, just to test out our approach. Generally speaking, you’ll want to train your models for more epochs to achieve better model performance.

history = model.fit(X_train, y_train, epochs=5, batch_size=2048, validation_data=(X_test, y_test))--Output--Epoch 1/5
20/20 [==============================] - 55s 3s/step - loss: 0.6879 - accuracy: 0.5345 - val_loss: 0.6384 - val_accuracy: 0.6338
Epoch 2/5
20/20 [==============================] - 45s 2s/step - loss: 0.6074 - accuracy: 0.6816 - val_loss: 0.5314 - val_accuracy: 0.7451
Epoch 3/5
20/20 [==============================] - 45s 2s/step - loss: 0.5362 - accuracy: 0.7417 - val_loss: 0.5010 - val_accuracy: 0.7680
Epoch 4/5
20/20 [==============================] - 45s 2s/step - loss: 0.5080 - accuracy: 0.7597 - val_loss: 0.4718 - val_accuracy: 0.7783
Epoch 5/5
20/20 [==============================] - 45s 2s/step - loss: 0.4871 - accuracy: 0.7646 - val_loss: 0.4719 - val_accuracy: 0.7745

As you can see from the above output, when the number of epochs increases, the accuracy of the model also increases.

Plotting the Loss

plt.figure(figsize=(16,5))
epochs = range(1, len(history.history['accuracy'])+1)
plt.plot(epochs, history.history['loss'], 'b', label='Training Loss', color='red')
plt.plot(epochs, history.history['val_loss'], 'b', label='Validation Loss')
plt.legend()
plt.show()
Plot of loss

As you can see from the above plot, training loss and validation loss decreases with the number of epochs. We can see that the training process is outputting a good fit learning curve.

Plotting the Accuracies

plt.figure(figsize=(16,5))
epochs = range(1, len(history.history['accuracy'])+1)
plt.plot(epochs, history.history['accuracy'], 'b', label='Training Accuracy', color='red')
plt.plot(epochs, history.history['val_accuracy'], 'b', label='Validation Accuracy')
plt.legend()
plt.show()
Plot of accuracy

As you can observe from the above plot, the accuracy increases when the number of epochs increases. But applying a higher number of epochs would lead to an overfit. We can stop the training at the point training accuracy and validation accuracy is at a point of stability.

Validation

We can validate the model using an example sentence and checking whether or not the model can determine the sentiment of the given sentence.

sentence = ['This movie was the worst. Acting was not good.']sentence_tokened = tokenizer.texts_to_sequences(sentence)sentence_padded = pad_sequences(sentence_tokened,maxlen=128,truncating='post',                                     padding='post')print(sentence[0])print("Positivity of the sentence:{}".format(model.predict(sentence_padded)[0]))--Output--This movie was the worst. Acting was not good.
Positivity of the sentence: [0.14108086]

As you can see from the above output, our model was able to identify the sentence as one with a low positivity rating, even with the word ‘good’ included in it. You can include a matrix of your choosing to label the output as positive or negative. For an example, 0–40 as Negative, 40–60 as Neutral, and 60–100 as Positive.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: