The forest of amazon reviews, part 1 (an NLP story)

6 min readMay 14, 2021

In this post we will enter the world of NLP ( natural language processing ) that studies the interactions between computers and human language.

Text Analysis is a major application field for machine learning algorithms.

We will explain how to analyze text obtained from Amazon product reviews, we will call this sentiment analysis.

In Amazon reviews, customers put stars on their comment (1 to 5), in this case we only take into account good (4,5) or bad (1,2) reviews, not neutral.

Then apply a prediction model and evaluate its performance. We will do it in 2 parts:

part 1) We will clean up our text by removing everything that is not useful to analyze it. We will use NLTK to remove stopwords and stemmize our words. We will create a vectorizer that allows us to create a matrix with the repetition of our words.

part 2) We will apply a model called Random Forest to our dataframe obtained in part 1. Then we will evaluate its performance with different metrics and we will test different configurations of the model using GridSearchCV

Ok, too much to explain !!, I recommend that as you read each concept that I’ll explain with its code, keep your jupyter open with the notebook dedicated to this post and try for yourself.

Import the libraries

import pandas as pd
import nltk
import re
import string

from nltk we will need: stopwords and PorterStemmer, but what are they for?

Stopwords

contains a set of meaningless words such as articles, pronouns, prepositions, etc. that are filtered before or after processing. Ex:’i’, ‘me’, ‘my’, ‘myself’, ‘we’, ‘our’, ‘ours’, ‘ourselves’, ‘you’, “you’re”,…

Stemming and Lemmatization

PorterStemmer is a kind of Stemming: “ Stem (root) is the part of the word to which you add inflectional (changing/deriving) affixes such as (-ed,-ize, -s,-de,-ing). So stemming a word or sentence may result in words that are not actual words. Stems are created by removing the suffixes or prefixes used with a word. “ (datacamp.com)

Another similar technique is Lemmatization: “ Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words.” (datacamp.com)

So, Stemming and Lemmatization both generate the root form of the inflected words. The difference is that stem might not be an actual word whereas, lemma is an actual language word. Both are good techniques widely used in Text Mining.

# nltk libraries
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import PorterStemmerstopwords = stopwords.words('english')
ps = PorterStemmer()

Get the data

We get the data from kaggle (Amazon Reviews for Sentiment Analysis). The data is in txt files compacted with bz2 with the following format:

__label__ ReviewText, where label could be __label__1 (bad review) or __label__2 (good review)

import bz2 #unzip and load
train_file = bz2.BZ2File('train.ft.txt.bz2')
test_file = bz2.BZ2File('test.ft.txt.bz2')
train_lines = train_file.readlines()
test_lines = test_file.readlines()
print('Train line: {}'.format(train_lines[11]))

Train line: b’__label__2 Great book: This was a great book,I just could not put it down,and could…

There are 3600000 train lines, 400000 test lines !!

so, and due to processing time, we are only going to take a small part of that data

train_part = train_lines[:10000]
test_part = test_lines[:2000]
# Put all together
amz_rev_part = train_part + test_part

Prepare the data

Now we will obtain our values of X (the review) and y (label), with some cleaning and transformation tasks in X

# Get label text and return 0 (__label__1=bad review) and 1 (__label__1=good_review)
def reviewToY(review): 
    return 0 if review.split(' ')[0] == '__label__1' else 1# Get review string feature
# delete the last char (\n) and transform to lower
def reviewToX(review):
    review = review.split(' ', 1)[1][:-1].lower()
    return reviewdef splitReviewsLabels(lines): 
    reviews = []
    labels = []
    for review in tqdm(lines):
        rev = reviewToX(review)
        label = reviewToY(review)
        # only get the first 512 chars for review
        reviews.append(rev[:512])
        labels.append(label)
    return reviews, labelsX_amz_rev_part, y_amz_rev_part = splitReviewsLabels(amz_rev_part)

Next create a method that tokenize our reviews (obtain a words list), remove the stopwords and stemmize the words:

# Tokenize sentence, then delete stopwords and stemming
def remove_stopword_and_stem(text):
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
 return text

Create the Bag of Words with CountVectorizer

What is a Bag of Words?

BoW turns text into fixed-length vectors by counting the number of times a word appears in a document.

In Scikit Learn there are several vectorizers that generate a BoW, where the most used are CountVectorizer and TfidfVectorizer.

For detailed information on this topic, read ‘Text feature extraction’ in sklearn-feature-extraction

# create vectorizer
count_vect = CountVectorizer(analyzer=remove_stopword_and_stem)
# train the vectorizer
vec = count_vect.fit(X_amz_rev_part)
# create my bag of words (X_cv) with transform()
X_cv = vec.transform(X_amz_rev_part)
# view BoW
print(X_cv.shape)
print(count_vect.get_feature_names()[3000:3010])

(12000, 31803)

['bock', 'bode', 'bodi', 'bodic', 'bodo', 'bodybuild', 'bodyfin', 'bodygroom', 'bodyguard', 'bodyment']

And this is the type and shape of my BoW:

<12000x31803 sparse matrix of type '<class 'numpy.int64'>' with 351918 stored elements in Compressed Sparse Row format>

Now we create a fresh dataframe with my BoW information:

# build dataframe with the sparse matrix 
X_features_df = pd.DataFrame(X_cv.toarray())
# Rename columns with the word
X_features_df.columns = count_vect.get_feature_names()
X_features_df

Now that we have our dataframe from where to obtain X (features), we can add new columns based on the data we already have, such as a feature that has the number of letters that each review has.

So we get our original df (X_amz_rev_part) and count the review letters, then create the new feature (review_len) in the new df (X_features_df ):

# Adding new features , For ex. = len(review) 
X_amz_df = pd.DataFrame(X_amz_rev_part, columns = ['Review'])
X_features_df['review_len'] = X_amz_df['Review'].apply(lambda x: len(x) - x.count(" "))
X_features_df['review_len']

ok, many numbers, matrix and vectors, what if we visualize something more understandable for the human? :)

For example the top ten common words in my BoW:

# Sum words of my Bag of Words, and get an ordered freq. of that
sum_words = X_cv.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)# get the top 10 common words
common_words = words_freq[:10]
# create df with top words and count
df = pd.DataFrame(common_words, columns = ['ReviewText' , 'count'])# build a bar graph and plot
import matplotlib.pyplot as plt
fig = plt.gcf()
fig.set_size_inches( 20, 8)
ax = df.plot.bar(x='ReviewText', y='count', rot=0)

Ok folks, this is the first part of our sentiment analysis on amazon reviews. We have obtained our data in the numerical form required by a machine learning algorithm.

In part 2 we will apply to this data an ML classifier called Random Forest, which we hope can predict with the greatest precision if a review is good / bad.

As always the github link with the complete sentiment analysis jupyter nb is attached so that you can verify the code for yourself. Also follow me on my data science blog EmpowereDataScience

Your comments and/or your like are appreciated ;)

The forest of amazon reviews, part 1 (an NLP story)

Import the libraries

Get the data

Prepare the data

Create the Bag of Words with CountVectorizer

Written by Jorge Ercoli