Amazon Review Sentiment Analysis using BERT



Original Source Here

Amazon Review Sentiment Analysis using BERT

Introduction

Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. As of 2019, Google has been leveraging BERT to better understand user searches.

basic structure

In Bert architecture, encoders are stack one on another.

BERT is pre-trained on two NLP tasks:

  1. Masked Language Model
  2. Next sentence Prediction

MLM teaches BERT to understand relationships between words — NSP teaches BERT to understand longer-term dependencies across sentences.

Masked Language Model

Need for Bi-directionality

BERT is designed as a deeply bidirectional model. The network effectively captures information from both the right and left context of a token from the first layer itself and all the way through to the last layer.

Traditionally, we had language models either trained to predict the next word in a sentence (right-to-left context used in GPT) or language models that were trained on a left-to-right context. This made our models susceptible to errors due to loss in information.

Let’s say we have a sentence — “I love to eat Pizza”. We want to train a bi-directional language model. Instead of trying to predict the next word in the sequence, we can build a model to predict a missing word from within the sequence itself.

Let’s replace “Pizza” with “[MASK]”. This is a token to denote that the token is missing. We’ll then train the model in such a way that it should be able to predict “Pizza” as the missing token: “I love to eat [MASK].”

Next sentence Prediction

Masked Language Models (MLMs) learn to understand the relationship between words. Additionally, BERT is also trained on the task of Next Sentence Prediction for tasks that require an understanding of the relationship between sentences.

NSP consists of giving BERT two sentences, sentence A and sentence B. We then say, ‘hey BERT, does sentence B come after sentence A?’ — and BERT says either IsNextSentence or NotNextSentence.

So let’s say we have three sentences:

  1. After finding the magic green orb, Dave went home.
  2. 3.6Ma ago human-like footprints were left on volcanic ash in Laetoli, northern Tanzania.
  3. Once home, Dave finished his leftover pizza and fell asleep on the couch.

If I asked you if you believe (logically) that sentence 2 follows sentence 1 — would you say yes? Probably not.

How about sentence 3 following sentence 1? Seems more likely.

It is this style of logic that BERT learns from NSP — longer-term dependencies between sentences.

Text Preprocessing

  1. Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
  2. Segment embeddings: A marker indicating Sentence A or Sentence B is added to each token. This allows the encoder to distinguish between sentences.
  3. Positional embeddings: A positional embedding is added to each token to indicate its position in the sentence.

I am going to implement bert-base-multilingual-uncased-sentiment for sentiment analysis

This a bert-base-multilingual-uncased model finetuned for sentiment analysis on product reviews in six languages: English, Dutch, German, French, Spanish and Italian. It predicts the sentiment of the review as a number of stars (between 1 and 5).

This model is intended for direct use as a sentiment analysis model for product reviews in any of the six languages above, or for further finetuning on related sentiment analysis tasks.

2 Steps We have to follow

  1. Web Scrapping
  2. Implement Bert

Web Scrapping

Beautifulsoup and splash libraries will help for scrapping reviews from the webpage

Install docker and splash image files on your system.

Run Splash in Docker Desktop after then run the following script.

import requests
from bs4 import BeautifulSoup
import pandas as pd
reviewlist = []def get_soup(url):
r = requests.get('http://localhost:8050/render.html', params={'url': url, 'wait': 2})
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def get_reviews(soup):
reviews = soup.find_all('div', {'data-hook': 'review'})
try:
for item in reviews:
review = {
'product': soup.title.text.replace('Amazon.in:Customer reviews:', '').strip(), #'product': soup.title.text.replace('Amazon.in:Customer reviews:', '').strip(),
'title': item.find('a', {'data-hook': 'review-title'}).text.strip(),
'rating': float(item.find('i', {'data-hook': 'review-star-rating'}).text.replace('out of 5 stars', '').strip()),
'body': item.find('span', {'data-hook': 'review-body'}).text.strip(),
}
reviewlist.append(review)
except:
pass
for x in range(1,999):
soup = get_soup(f'https://www.amazon.in/Samsung-Galaxy-Storage-Additional-Exchange/product-reviews/B086KFBNV5/ref=cm_cr_getr_d_paging_btm_prev_2?ie=UTF8&reviewerType=all_reviews&pageNumber={x}')
print(f'Getting page: {x}')
get_reviews(soup)
print(len(reviewlist))
if not soup.find('li', {'class': 'a-disabled a-last'}):
pass
else:
break
df = pd.DataFrame(reviewlist)
df.to_excel('Samsung_Galaxy_Z.xlsx', index=False)
print('Fin.')

we will get an excel file that has reviews.

Implement Bert

for installing

!pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio===0.8.1 -f https://download.pytorch.org/whl/torch_stable.html!pip install transformers requests beautifulsoup4 pandas numpy

Import Libraries

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests
from bs4 import BeautifulSoup
import re

Instantiate Model

tokenizer = AutoTokenizer.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')model = AutoModelForSequenceClassification.from_pretrained('nlptown/bert-base-multilingual-uncased-sentiment')

Load Reviews into DataFrame and Score

def sentiment_score(review):
tokens = tokenizer.encode(review, return_tensors='pt')
result = model(tokens)
return int(torch.argmax(result.logits))+1

above code return number within range 1–5

df['sentiment'] = df['body'].apply(lambda x: sentiment_score(x[:46]))

Notebook link

output

Thank you Nicholas Renotte for your guidance. do watch Nicholas’s channel for an amazing content link

My Github link

My LinkedIn profile link

My email id -pratikmpatil12@gmail.com

references

https://www.youtube.com/channel/UC8tgRQ7DOzAbn9L7zDL8mLg

https://towardsdatascience.com/bert-for-next-sentence-prediction-466b67f8226f

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: