Sentiment Analysis of Amazon Product Reviews



Original Source Here

Sentiment Analysis of Amazon Product Reviews

Hey Folks, In this article I walk you through sentiment analysis of Amazon Electronics product reviews.

image from this source

Contents:

  • What is sentiment analysis?
  • Download dataset
  • Analyze the dataset
  • Data pre-processing
  • Apply different machine learning algorithms
  • Training the dataset
  • Prediction
  • Confusion matrix
  • Plot ROC Curve

What is sentiment analysis?

Sentiment analysis uses natural language processing, text analysis, and statistics to analyze customer sentiment. The best businesses understand the sentiment of their customers — what people are saying, how they’re saying it, and what they mean. Customer sentiment can be found in tweets, comments, reviews, or other places where people mention your brand. Sentiment Analysis is the domain of understanding these emotions with software, and it’s a must-understand for developers and business leaders in a modern workplace.

As with many other fields, advances in deep learning have brought sentiment analysis into the foreground of cutting-edge algorithms. Today we use natural language processing, statistics, and text analysis to extract, and identify the sentiment of words into positive, negative, or neutral categories.

Download dataset

Before we move forward let’s download the dataset that we use in this project.

You can download the dataset from here: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Electronics_5.json.gz
The download size of the dataset is 1.2GB.

The dataset is zipped so you need to unzip the dataset in your system (computer). Now the Size of the dataset around 2.5GB.

It may be possible that this dataset would not open in your Microsoft Excel.

If you still want to open you can use Delimit software for it, Here Download Link: http://delimitware.com/download.html

Delimit: Handle large delimited data files with ease.

Let’s analyze the dataset

The dataset contains these columns

reviewerID — ID of the reviewer, e.g. A2SUAM1J3GNN3B
asin — ID of the product, e.g.
0000013714
reviewerName — name of the reviewer
vote — helpful votes of the review
style — a disctionary of the product metadata, e.g., “Format” is “Hardcover”
reviewText — text of the review
overall — rating of the product
summary — summary of the review
unixReviewTime — time of the review (unix time)
reviewTime — time of the review (raw)
image — images that users post after they have received the product

The dataset has lots of features but
For sentiment analysis, we need review and rating.

Step 1: importing the libraries that we use in this project

import numpy as np
import pandas as pd
import random
import os
import json
import sys
import gzip
from collections import defaultdict
import csv
import time

#nltk libraries and packages
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import wordnet as wn

#Ml related libraries
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression as LR
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection, naive_bayes, svm
from sklearn.tree import DecisionTreeClassifier

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn import metrics

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score as AUC

Step 2: After reading the dataset as a pandas data frame

we create a dataset that has id, review, and rating of product for sentiment analysis.

#reading the json file in a list
values=[]
with open("Electronics_5.json","r") as f:
for i in f:
values.append(json.loads(i))
print(values[:5])

we saved our filtered dataset in the Electronic_review.csv file.

now we read our Electronic_review data into a data frame.

#read the dataset into a df
colnames = ["id","text","overall"]
df= pd.read_csv("Electronic_review.csv",names= colnames,header = None)

Step 4: Populating the data with proper values of sentiments

The division of sentiment, based on vote value, is as follows

  • 0 < Vote < 3 => Negative sentiment (-1)
  • Vote = 3 => Neutral Sentiment (0)
  • 3 < Vote <= 5 => Positive Sentiment (1)

Let’s save this data frame as processedData.csv.

newdf.to_csv("processedData.csv",chunksize=100000)

Let’s see how our processed data look like.

df = pd.read_csv("processedData.csv",nrows = 100000)
print(df.head(5))
image by author

Step 5:

preprocess the text data samples

Steps for preprocessing

  • Filter out numbers
  • Stemming and lemmatization
  • Remove stopwords
  • Other root level token changes

let’s import some important libraries

from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import wordnet as wn
import nltk
nltk.download("stopwords")
import re
nltk.download("punkt")

now read the processedDatat.csv

df= pd.read_csv(“processedData.csv”)

  • Stemming algorithms work by cutting off the end of the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. This indiscriminate cutting can be successful on some occasions, but not always, and that is why we affirm that this approach presents some limitations. Below we illustrate the method with examples in both English and Spanish.
  • developing a stemmer is far simpler than building a lemmatizer. In the latter, deep linguistics knowledge is required to create dictionaries that allow the algorithm to look for the proper form of the word. Once this is done, the noise will be reduced and the results provided on the information retrieval process will be more accurate.
lat_df = df[:100000]
lat_df.to_csv("CurrentUsedFile.csv")

We saved the first 100000 rows of data as CurrentUsedFile.csv so that we can easily process the data.

Step 6:

Split the dataset into train and test set

#importing the new dataset
lat_df = pd.read_csv("CurrentUsedFile.csv")
print(lat_df.head(5))
image by author
#create x and y => x:textreview , y:sentiment
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(lat_df['reviewText_final'],lat_df['Sentiment'],test_size=0.2,random_state = 42)
print(Train_X.shape,Train_Y.shape)
print(Test_X.shape,Test_Y.shape)

Test_Y_binarise = label_binarize(Test_Y,classes = [0,1,2])

Step 7: Applying TF-IDF vectorizer to the tokens formed for each of the review samples

# Vectorize the words by using TF-IDF Vectorizer - This is done to find how important a word in document is in comaprison to the dffrom sklearn.feature_extraction.text import TfidfVectorizerTfidf_vect = TfidfVectorizer(max_features=500000)               #tweak features based on the dataset
Tfidf_vect.fit(lat_df['reviewText_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

Step 8: Applying the SVM, NB, and DT models

Before going to head let’s create a model evaluation Function.

def modelEvaluation(predictions, y_test_set):
#Print model evaluation to predicted result

print ("\nAccuracy on validation set: {:.4f}".format(accuracy_score(y_test_set, predictions)))
print ("\nClassification report : \n", metrics.classification_report(y_test_set, predictions))
print ("\nConfusion Matrix : \n", metrics.confusion_matrix(y_test_set, predictions))

Naive Bayes Model:

# Classifier - Algorithm - Naive Bayes
# fit the training dataset on the classifier
import time
second=time.time()
Naive = naive_bayes.MultinomialNB()
historyNB = Naive.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_NB = Naive.predict(Test_X_Tfidf)
modelEvaluation(predictions_NB, Test_Y)
image by author
from sklearn.metrics import precision_recall_fscore_supporta,b,c,d = precision_recall_fscore_support(Test_Y, predictions_NB, average='macro')# Use accuracy_score function to get the accuracy
print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)
print("Precision is: ",a)
print("Recall is: ",b)
print("F-1 Score is: ",c)

Now let’s plot the ROC curve for Naive Bayes

image by author

Support Vector Machine (SVM) Model

asvm,bsvm,csvm,dsvm = precision_recall_fscore_support(Test_Y, predictions_SVM, average='macro')
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
print("Precision is: ",asvm)
print("Recall is: ",bsvm)
image by author
image by author

Decision Tree Model:

third=time.time()
decTree = DecisionTreeClassifier()
decTree.fit(Train_X_Tfidf, Train_Y)
y_decTree_predicted = decTree.predict(Test_X_Tfidf)
modelEvaluation(y_decTree_predicted, Test_Y)
image by author
image by author

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: