Toxicity of the comment By Jigsaw

Original Source Here

Toxicity of the comment By Jigsaw

Source: Google Images

Here, in This blog, I am going to explain a complete case study Jigsaw Unintended Bias in Toxicity Classification.


The Conversation AI team, a research initiative founded by Jigsaw and Google, builds technology to protect voice in a conversation. A main area of study is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful, or otherwise likely to make someone leave a discussion.

Business Problem:

Jigsaw Unintended Bias in Toxicity Classification challenged the Kaggle community to build toxicity models that operate fairly across a diverse range of conversations. Nowadays lots of cyberbullying, sexual harassment, vulgar comment, and so on, are increasing as social media consumption increasing day by day. The challenge of the competition is to classify conversations with an unintended bias. Their main purpose is that using this model they can decrease the number of crimes on social media.

ML Formulation:

The task is to predict the toxicity of the given conversations using ML/DL models. Find an unknown pattern from given text data and classify it into toxic or non-toxic, and label the subgroups.


Data Source: Kaggle Competition

There are Three files available:


This file contains 1.8M+ training data. With a bunch of columns.

  • Comment_text: comment (text data) string type, only train data for model
  • Target: the toxicity of the associated comment, ranges between 0.0 to 1.0, our main aim is to predict this column.
  • Id: unique id associated with a comment
  • Some subtypes are also given with train data: sever_toxicity, obscene, threat, insult, identity_attack, sexual_explicit which we don’t need to predict but they gave us for our research.
  • Additionally, given us labeled identity attributes, representing identity are mentioned in comments. Here we have only those identity columns with more than 500 examples.
  • They provide some metadata from jigsaw’s annotations: toxicity_annotator_count and identity_annotator_count.
  • And some metadata from Civil Comments: created_data, publication_id, parent_id, article_id, rating, funny, wow, sad, likes, disagree
  • How did they calculate toxicity?

They have given the same data to 10 annotators and asked them “Rate the toxicity of this comment”, and then aggregated this value to the target.

List of given identity columns:

  1. Male
  2. Female
  3. Transgender
  4. Other_gender
  5. Heterosexual
  6. Homesexual_gay_or_lesbian
  7. Bisexual h. Other_sexual_orientation
  8. Christian j. Jewish k. Mushlim
  9. Hindu
  10. Buddhist
  11. Atheist
  12. Other_religion
  13. Black
  14. White Asian
  15. Latino
  16. Other_race_or_ethnicity
  17. Physical_disability
  18. Intellectual_or_learning_disabillity
  19. Psychiatric_or_mental_illness
  20. Other_disabillity

Here, these columns are given in the range of 0.0 to 1.0, we are not going to predict these columns too but these data will help us to research.


In test data, we have two columns.

  • Comment_text: comment (text data) string type, only train data for model.
  • Id: unique id associated with a comment


In the sample_submission.csv we have two columns.

  • Id: unique id associated with a comment
  • Prediction: our prediction of toxicity

Performance Metrics:

Selecting Performance metrics is not easy for any machine learning problem. I have tried a combination of ROC-AUC and F1-Score. Mainly I use F1- score but I didn’t 0.5 thresholds for classified as positive or negative, I have used the threshold from ROC-AUC which gives me good results.

Let’s look at the EDA part to get a better understanding of the data.

Exploratory Data Analysis:

Firstly we will look see some examples of a given data.

Who is the jerk in the last row between the C & E?

Yeah, too bad … Oregon Live has 100 times the comments, including the ridiculous and offensive to the sublime. This is a classic Eugene over-sensitive reaction. The software is crap (is that civil, coming from a professional software engineer?).\n\n– Paul

Prosecute the bastards!

The above examples are a positive example which is picked from train dataset.

I understand that meth get a hold and then people kiss their lives goodbye. Tragic.

Nice to some attempts to try to make comments better—it feels like any innovation in commenting communities ended with the launch of Disqus nearly a decade ago.

If I told you I wanted Bernie to win because testicles, how would you react? \n\nFeel free to vote your emotions, but don’t expect many congratulations for it. I’ll help you out though. You’re almost certainly a Clinton voter. They are the only ones who find these two candidates even remotely interchangeable.

The above examples are a negative example which is picked from train dataset.

Target feature:

This data’s main source is social platforms so as we know mostly comment posting on social media is non-toxic, so we have data highly unbalanced.

Target-probability distribution

As per the above Pdf, we can understand that most target values are 0 or near 0.

Count plot

In the above plot, we can see that data are highly unbalanced.

I also checked the number of capital char, emojis, punctuation, spaces, and length. matter to target feature or not.

Comment Length PDF

in above pdf of comment length both toxic and non-toxic looks same only.

these features are numerical features and the target feature is categorical I use a t-test to check the relation. but these features didn’t help me.

Next, I tried to check the most frequent words individually in positive and negative comments. and both have similar words. same tried for rare words but it is also the same.

Wordcloud plot

above word cloud plotted from the positive dataset, you can see most occur words in the negative comments.

Data Preprocessing:

their lots of features that are useful and lots of are not useful for our model.

so we will try to remove those features which are not useful to our model.

first of all, I handle punctuations some punctuation was useful and some were not useful so I added space before and after the punctuation.

Ex. “ Hi, How are you?” -> “Hi , How are you ? ”

I added space because while doing tokenization i “Hi” and “,” both will be treated as separate tokens. and some of the punctuation was removed.

Next, I tried to handle contractions as below:

can’t ->can not

haven’t -> have not

these types of words I tried to handle using TreebankWordTokenizer using NLTK library.

and last I remove unusual single quotes.

Loss function:

Loss function matters a lot to train models because weight updates happen using loss. good loss helps us to converge in the right direction.

I implemented the above loss function with the help of this Github repo.


Whatever data we use we should transform those data to numeric and then only the model can understand those not matter it is image,text, audio, or tabular data.

So, I used two different tokenizers to tokenized text data.

above tokenizer, I used from Huggingface’s transformers library.

this tokenizer converts text data to sequence data. this tokenizer specifically useful for the BERT model.

another tokenizer I use from TensorFlow.

Here we have to fit tokenizer as per our data it will create vocab and generate sequence data. Keras tokenizer returns the same length sequence, so I use pad_sequences which convert the sequence to the same length.

now, finally building the model.

before an understanding of model, structure let’s understand the working of some layers.

Layers Understanding:

Embedding Layer:

The embedding layer is a way to represent the word with the relation with the other work. it is alow similar words for the same representation.

this layer joins with the neural net and tries to represent similar words in a similar direction. each word generates a vector.

Embedding Layer

in the above image out input size is 237 words and the output is with a 600 size vector for each word. the output size can be anything 50,100,300 etc.

here in our scenario, I didn’t train the embedding layer I used 2 pre-trained embedding layers with an output size each 300. one embedding was used from Fast text which was trined on a 1M words vector. and another one is Glove which was trained on 1.9M data.

Bidirectional LSTM:

LSTM layers use for sequence data like text, time-series data, etc.

LSTM layer memorizes old data, in the text data previous words matters a lot to the next layer.

Image source:

how LSTM memorizes old data which we can see in the above image.

in bidirectional LSTM memorize data from both ends and input it together as below:


Attention Layer:

I used the attention layer to get the weighted of the words.


in the above code, I initialized the weight and multiplied it with the input, and used the sigmoid activation function. those weights sum bu timestamp and use as attention.

Model Building:

I build a Total of 4 models, actually, 3 one fine-tuned BERT.

  1. BERT Fine tunned

Here I used the Pre-Trained model from Huggingface.

  • input layer and attention layer
  • Pre-trained BERT layer
  • Dense layer with 128 output size
  • dropout with a 0.1 rate
  • output1 with 1 output size
  • output 2 from BERT layer with the 8 size
  • merge output1 and output2

Trained 14 Epoch with 32Batch size on Tesla V100.

2. Stacked-LSTM

  • input layer
  • Embedding layer Fasttext + Glove
  • Spatial 1D dropout layer
  • Two Bi-directional LSTM
  • Globel max-pooling and Average -pooling
  • merged average max-pooling and average-pooling output
  • Dense layer with 512 size -Dense1
  • Skip layer: dense layer output + max and average pooling output
  • Dense layer with 512 size -Dense1
  • Skip Layer: dence1 +dence2
  • two output one with 1 output and the other with 16 output.
  • merged both outputs

Trained 6 Epoch with 256 Batch size, used Early stopping

3. Bi-LSTM-Attention

  • input layer
  • Embedding layer Fasttext + Glove
  • Spatial 1D dropout layer
  • Bi-directional LSTM
  • Spatial 1D dropout layer
  • Bi-directional LSTM
  • applied timestamp wise relu on LSTM output
  • Attention layer
  • Dropout with 0.5 rate
  • Dence layer with 512 size
  • Dropout with 0.5 rate
  • Dence layer with 512 size
  • Dropout with 0.5 rate
  • two output one with 1 output and the other with 16 output.
  • merged both outputs

Trained 6 Epoch with 256 Batch size, used Early stopping

4. Bi-LSTM V-Attention

This model is the same as the above 2 models, I tried max, average polling and attention layer in 1 model.

in the 1st model, I added 2 attention layers with each LSTM layer.

Trained 4 Epoch with 256 Batch size, used Early stopping


  1. BERT : 0.93063
  2. Stacked-LSTM : 0.92997
  3. Bi-LSTM-Attention: 0.92981
  4. Bi-LSTM V-Attention: 0.92750

And final Ensemble result is 0.93622.

After the final submission, my final standing was 632 out of 3116.

Final Submission

I deployed the Model on GCP and here is the demo video of the model prediction.

Future work:

  • Train BERT with 128 or 256 batch size, 256 is good
  • Try other NLP models like GPT2, XLNet, etc.

As you already reach here that means you find something interesting in a blog, feel free to add comments, feedback, or suggestion.

Thank you so much for reading my blog.

My LinkedIn:

My Kaggle:

My Github:



Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: