Tweet Sentiment Extraction

Tweet Sentiment Extraction——deep_learning-5

Tweet Sentiment Extraction,brand%20reputation%2C%20and%20understand%20customers.

Table of Contents

  1. Introduction
  2. Usage of ML/DL for this problem
  3. Data Overview
  4. Performance Metric
  5. Exploratory Data Analysis
  6. Usage of Deep Learning model to solve this problem
  7. Base Model
  8. Modified Base model
  9. Transformers and BERT
  10. TFRoBERTa model for Question Answering
  11. Further Improvements that can be done
  12. Code in GitHub
  13. Reference


Sentiment Analysis can be defined as the process of analyzing text data and categorizing them into Positive, Negative, or Neutral sentiments. Sentiment Analysis is used in many cases like Social Media Monitoring, Customer service, Brand Monitoring, political campaigns, etc. Analyzing customer feedback such as social media conversations, product reviews, and survey responses allows companies to understand the customer’s emotions better which is becoming more essential to meet their needs.

Usage of ML/DL for this problem

It is almost impossible to manually sort thousands of social media conversations, customer reviews, and surveys. So we have to use either ML/DL to build a model that analyzes the text data and performs the required operations. The problem I am trying to solve here is part of this Kaggle competition. In this problem, we are given some text data along with their sentiment(positive/negative/neutral) and we need to find the phrases/words that best supports the sentiment.

Data Overview

The dataset used here is from the Kaggle competition Tweet Sentiment Extraction. The dataset used in this competition is from phrases from Figure Eight’s Data for Everyone platform. It consists of two data files train.csv and test.csv, where there are 27481 rows in training data and 3534 rows in test data.

List of columns in the dataset

textID: unique id for each row of data

text: this column contains text data of the tweet.

sentiment: the sentiment of the text (positive/negative/neutral)

selected_text: phrases /words from the text that best supports the sentiment

Performance Metric

The performance metric used in this problem is the word-level Jaccard score. The Jaccard Score or Jaccard Similarity is one of the statistics used in understanding the similarity between two sets.

Example of Jaccard score for text data:

Sentence 1: AI is our friend and it has been friendly
Sentence 2: AI and humans have always been friendly

Exploratory Data Analysis

From the above plot, we could conclude that both training data and test data contain similar distributions sentiment as the Majority of points belong to Neutral followed by Positive and Negative texts.

The above can be used to infer the length of the text (no.of words) for all the sentiments are between a count 0–20 and only a few points are greater than 25 words

There is a certain difference in no.of words for the selected_text column for different sentiments. We could see a clear spike in two of the three above plots which suggest that majority of selected_text phrases length lie between 0–10 and only there are a few sentences that are greater than 10 words of length.
For Neutral sentiment, the majority of selected_text phrases are longer when compared to positive/negative sentiment labels as most of the phrases lie between a length of 0–20.

The above plot shows the most common words that are found in the text and selected_text columns for different sentiments.

There is a huge spike around 1.0 for neutral sentiments, which means that most of the selected_text phrases for the neutral sentiment labels are the text sentences itself, i.e both the text and selected_text values are the same for most of the neutral sentiment data.
For both Positive and Negative sentiment labels, we could see two spikes in the graph, one at around 0.1 or 0.15 and the other around 1.0. The spike around 0.1/0.15 suggests that the similarity between text and selected_text is very low. i.e given a text, only a few words/phrases are considered as selected_text.

Usage of Deep Learning model to solve this problem

One of the most widely used Neural Networks if the input data is sequential or contextual data is Recurrent Neural Networks(RNN). One of the major advantages that RNN provides over normal feed-forward neural networks is that RNN doesn’t only consider input at the current timestep but it also considers previous values to predict the current output.

As you can see from the above image, at the first time step, X⁰ is passed as input to the model to get H⁰. In the next timestep, both X¹ and H⁰ are passed as input to the model which gives H¹ as output. Unlike normal Neural networks, all the inputs are related to each other. The most commonly used RNN networks are LSTM (Long Short-Term Memory) and GRU(Gated Recurrent Unit).

We may not deep dive into the internals of LSTM and GRU here, but I will give a broad overview of these RNN networks.


LSTM network has 3 main gates
1.Forget gate
2. Input gate
3. Output gate

C_(t-1) : old cell state
c_t: current cell state
h_(t-1): output from the previous state
h_t= output of the current state

Forget gate decides how much information needs to be retained from previous states and how much can be neglected
Input gate decides which new information will be added to the cell state
Output gate will decide what information will be passed to the network at the next instance


The major difference between LSTM and GRU is that GRU has only 2 gates Update gate and Reset gate.

Update gate is a combination of input gate and forget gate. It decides what information to be retained and what information to be added.
Reset gate decides what information needs to be passed to the network in the next instance.

GRU has less no. of gates when compared to LSTM and hence it’s computationally cheaper and faster than LSTM.

Base Model:

Here we are trying to build a base RNN model using LSTM/GRU which takes the text and sentiment as input and gives selected_text as output.
We have the input in the text format that needs to be converted into numerical data so that we can pass it to the model. Also, we need to perform Data Cleaning and Preprocessing steps so that the data is clean and ready for further operations.

Converting the text to integers and padding them to same_length

We have text columns that contain text from various tweets. Each data point would be of different lengths. So we have to convert this text data into numbers and pad all the points so that all the inputs are of the same length. In TensorFlow, we have Tokenizer and pad_sequences module which can be used to perform these operations.

Embedding Layer:

The embedding layer consists of an embedding matrix which is a matrix containing a high-dimensional representation of a particular word that’s is present in the training data.

Usually, we use pretrained embedding vectors as these are vectors obtained by training a large amount of data and we can simply download them and use them in our embedding layer.

The above is the model architecture of the Base model where the input to the model is text and sentiment values. These values are passed to a GRU layer and finally, we have a Dense layer that predicts the start and end indexes to get the selected_text values from the text.

For example:

text: I am so happy today because I bought a new phone
sentiment: Positive
selected_text: I am so happy
Start index: 0
End index : 3
So if I provide this input data to my model, the predicted output should be 0,3 (Start and end positions) which corresponds to the text ‘I am so happy’

Performance of Base Model

Using this base model, we got a Jaccard score of around 0.5, certainly, the model’s performance could be improved much better than this.

Modified Base Model:

Since our base model didn’t perform well as expected, we need to modify our base model to improve the performance. Instead of using Normal LSTM/GRU, here we are going to Bi-directional LSTM/GRU. The advantage of using a Bi-directional LSTM over a normal one is that Bidirectional LSTMs allows the networks to have both forward and backward information about the sequence in every time step.
Bidirectional RNNs will run your input in two ways, one from the first to last and the other from last to first. Bidirectional RNNs help to preserve information from the latter states of input while unidirectional RNNs help only to preserve information about previous inputs.,any%20point%20in%20time%20to

Also in this model, the output format is gonna be different from the base model. Here the output is a vector of len MAX_LEN(max. length of i/p). The words which are part of selected_text will be given a value of 1 and others will be given a value of 0

For example :
text: I am so happy today because I bought a new phone
sentiment: Positive
selected_text: I am so happy
output vector: 1 1 1 1 0 0 0 0 0 0

Performance of the Model

This modified Seq2Seq model performed better than the base model, as we got a Jaccard score of 0.6. On analyzing the errors, we could notice that the model could perform well on neutral data, but not as well for positive and negative data points.

Transformers and BERT

Transformers are the start-of-the-art models for solving most of the NLP tasks. The transformers use a concept called attention to handling the dependencies between the inputs and outputs. The attention mechanism helps the model to look at an input sentence and decides at each timestep in which other parts of the sequence are important. One more important point to note is that Transformers doesn’t use any RNNs in the model. This is a sequence to sequence architecture i.e. input is fed into an encoder, encoder processes the input and passes to the decoder which predicts the output.

Actually, transformers are made of a stack of encoders and decoders. i.e it consists of 6 encoders and 6 decoders. Further Encoder has two sub-layers: multi-headed self-attention layer and fully connected feed-forward neural network. The decoder contains three sub-layers: Self-attention layer, Encoder-Decoder attention layer, and feed-forward neural network.

The Self-attention layer in the encoder helps to look at other words in the input sentence as it encodes a specific word.

The Encoder-Decoder Attention layer in the decoder helps it to focus on relevant parts of the input sentence

A small brief about attention:

Given a sentence like the one in the above image, what does the word it refers to? Does it refer to the animal or the street or some other thing? It’s easy for humans to understand this but this is where the attention layer helps the model to understand how a particular word is related to other words in the text.

Q is a matrix that contains the query (vector representation of one word in the sequence), K denotes all the keys (vector representations of all the words in the sequence) and V is the values, which are again the vector representations of all the words in the sequence.
Self-attention is computed not once but multiple times in the Transformer’s architecture, in parallel and independently. It is therefore referred to as Multi-head Attention.

BERT(Bidirectional Encoder Representation for Transformers)

BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language.BERT provides state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.

The transformer encoder reads the entire sequence of words at once. Therefore it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).

The first input token is supplied with special [CLS], where CLS stands for Classification. The Encoder units inside the BERT models are similar to the ones we saw earlier in the Transformer model.
BERT takes a sequence of words as input which further keeps flowing through the encoder stack output of each encoder is passed to the next encoder model.
Each position outputs a vector of size(768 in BERT).

BERT model can also be fine-tuned for various tasks like sequence classification, Sentence pair classification, Question Answering tasks, Named Entity Recognition, etc. Since the invention of BERT, several methods have been presented to improve BERT either on its performance metrics or computational speed. Below is the table with details about the various models.

For this problem, I have used RoBERTa (Robustly optimized BERT Approach) model to solve this problem. As RoBERTa is developed based on BERT, there share a lot of configs. Below are the few things that are different between Roberta and BERT.

  • Reserved Token: BERT uses [CLS] and [SEP] as starting token and separator token respectively while RoBERTa uses <s> and </s> to covert sentences.
  • Size of Subword: BERT has around 30k subwords while RoBERTa has around 50k subwords.
  • Bigger training data (16G vs 161G)
  • Training on longer sequences

Which Tokenization strategy is used by BERT?

BERT uses WordPiece tokenization. The vocabulary is initialized with all the individual characters in the language, and then the most frequent/likely combinations of the existing words in the vocabulary are iteratively added.

How does BERT handle OOV words?

Any word that does not occur in the vocabulary is broken down into sub-words greedily. For example, if play, ##ing, and ##ed are present in the vocabulary but playing and played are OOV words then they will be broken down into play + ##ing and play + ##ed respectively. (## is used to represent sub-words).

TFRoBERTa model for Question Answering

This model uses a RoBERTa tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.

How to formulate our problem to Question Answering Task?

Given that our problem is to get the selected_text from the given text, the problem can be converted into a Question Answering task where the Question is Sentiment, Context is the text and Answer is the selected_text

For example:

Question: Positive

Context: I am so happy because I bought a new phone today

Answer: I am so happy

There are certain parameters we need to be aware of before building the model.

input_ids — Indices of input sequence tokens in the vocabulary.

The input ids are often the only required parameters to be passed to the model as input. They are token indices, numerical representations of tokens building the sequences that will be used as input by the model.

attention_mask — Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

1 for tokens that are not masked,

0 for tokens that are masked.

Loading Roberta tokenizer:

Getting input_ids and attention mask:

The input ids and the attention mask can be obtained easily using the tokenizer.encode_plus() function. The encoder_plus gives the output in the following format.


Getting Start tokens and end tokens:

The input text is converted to tokens and the start and end positions of the tokens that correspond to the selected_text are used to formulate the output vector.

Example :

text: maybe used to have. besides without ac it`s too hot to sleep

selected_text: too hot

text to tokens: [‘<s>’, ‘ maybe’, ‘ used’, ‘ to’, ‘ have’, ‘.’, ‘ besides’, ‘ without’, ‘ ac’, ‘ it’, ‘`’, ‘s’, ‘ too’, ‘ hot’, ‘ to’, ‘ sleep’, ‘</s>’]

Roberta tokenizer converts the given text into tokens as above, the <s>and</s> are used to denote the start and end of the given sentence.

start tokens:[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ………]

end tokens: [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0………]

The length of start and end tokens will be equal to the MAX_LENGTH value.

If you see the list of tokens, the values at the 12th and 13th indices correspond to the selected_text “too hot”. That’s why the 12th and 13th indexes in start tokens and end tokens are marked as 1 and else 0.

Loading Pretrained RoBERTa for Question Answering Model:

Build the model:

The above is the model architecture where the input to the model is the input_ids and attention_mask, the model processes the input and gives out the start index and end index from which the selected_text can be extracted from the given text

Performance Metric

This TFRoBERTA for Question Answering model provides the best Jaccard score of 0.7 which is the BEST score among all models.

Further Improvements that can be done

Various other versions of BERT models can be tried to see whether any models give better performance (higher Jaccard score) than the present model. Also, we can try to ensemble multiple models to improve model performance.

Code Reference:

Email Id:
Mobile: +91–7200681570




via Deep Learning on Medium——deep_learning-5

November 2, 2020 at 02:37AM

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot – Find Granola @GranolaAI

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: