Sentiment Analysis Using Text Mining PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Sentiment Analysis – Movie Reviews

Sentiment Analysis - Movie Reviews


By: Siddharth Panchal, Deependra Surana, Pratyaksh Shah

Abstract:
We are building a Sentiment Analysis System using Natural Language Processing / Text Mining.
The system will analyze the review of the movie and based on content – it will identify if the
review is positive or Negative. This will be used to identify numbers of positive review vs
negative reviews.
Introduction:
Movie watchers are very keen to know the review of the movies and based on they make decision
to watch movie or not. Users are having access to thousands of movies with the introduction of
multiple platforms like Netflix, Amazon Prime, Disney, Apple TV etc. We have seen users browsing
multiple movies before settling down to choose one to watch. Sometime users spend hours to
select movies.
We are building a system that will help movie watcher to select movies based on previous
reviews. He or She need not to read all the reviews. System will mine the data using “Natural
Language Processing” or “Text Mining” capabilities and summarize overall positive or Negative
reviews of the movies.
This recommendation is remarkably interesting and useful from a business model perspective as
well as user experience. Based on sentiments of reviews – Movies can be classified under “5-star
system” This model will help business to price movies accordingly. Similarly, this model will help
movie watchers to select movies and improve overall experience.
Throughout our research, we will explore various analytical approaches to identify tags/
sentiments of the review. Language used by the viewers in the review and appropriate
techniques to clean the data set. We will test various method to test the accuracy of classification
of review.
Overall Project goal is to increase users experience and identify the sentiments of the movie
viewers based on reviews. Success of the Projects depends on identification of movie
sentiments with acceptable level of accuracy.

1
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

Dataset:
Data set contains movies ~50K movies with sentiments. Approximately half of the data set
contains positive sentiments.

We will be using dataset called IMDB Dataset.csv


Data Set contains 2 columns- Movie review and Sentiments.

Ethical ML Framework:

Objective of the application is to identify viewers sentiments based on movie review. Many
aspects of ethical ML framework do not apply directly. Application do not collect any personal
identifiable information and data is not used for anything other than “sentiments”. This data
may be biased towards the viewers who prefer to review movies. Many movies that were not
released to mass and may not get all the reviews. This is open source data obtained from
Kaggle.com
Assumption:
We are assuming that data provided in this dataset has no bias towards reviews and its
sentiment. We are also assuming that each reviews sentiment is properly identified.

Data Preparation & Exploration:


Importing libraries:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

import wordcloud

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectoriz


er
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import re

from bs4 import BeautifulSoup

2
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

Loading movie dataset and do basic exploration:


dataset = pd.read_csv('https://raw.githubusercontent.com/Group-7-Big-Data/
Assignment-2/master/IMDB_review_cleaned.csv')

Exploratory Data Analysis


Dataset contains 50,000 rows and 2 columns. First column is called review and contains reviews
left by users about movies and 2nd column is called sentiment which includes if review is
positive or negative.
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 review 50000 non-null object
1 sentiment 50000 non-null object
dtypes: object(2)
memory usage: 781.4+ KB

When we group dataset by its sentiment value, we can see we have 50/50 distribution of
positive and negative reviews. There are also some reviews which are duplicated, we can
confirm this by looking at frequency.
dataset.groupby("sentiment").describe()

Plotting positive and negative review.


ax = sns.countplot(x="sentiment", data=dataset)

3
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

By creating new column and filling with length of letters each review has, we can get a basic
idea how each review is written. We have found that positive reviews contain more letter/word
in them than negative review. This maybe because lot of time when person do not like
something, they usually leave review in really small sentence expressing their disbelief while
when they line something, they tend to go in detail and write exactly why they like it.
dataset['sentence_length'] = dataset['review'].apply(len)
ax = sns.FacetGrid(dataset, col="sentiment")
ax.map(plt.hist, "sentence_length", color="steelblue", bins=50)

dataset.sort_values('sentence_length', ascending=False)

4
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

Data Preprocessing - Cleaning Reviews:


Before we feed this data into our model, we need to clean our reviews since it contains lot of
non-texts as well.

As we can see in some reviews, some reviews include html tags in them, we have used library
called beautifulsoup to remove these tags. We are also using regex to take out symbols and
numbers. Then we are downloading stopwords list from nltk library since we do not want to
include words such as ‘the’, ‘if’, ‘and’ in our model. Finally, we will lemmatize each word and
store it as new review.
#nltk.download('stopwords')
#nltk.download('wordnet')
wlm = WordNetLemmatizer()
all_stopwords = stopwords.words('english')

# Creating function to clean reviews


def review_to_words(text):
# As we saw earlier, reviewes included html tags, we can take them out
using beautifulsoup library
review = BeautifulSoup(text).get_text()
# We do not need any symbol or number, just letter for review. Using r
egex we can only keep english letters in review
review = re.sub('[^a-zA-Z]', ' ', review)
# convert all words to lower and splitting them in list
review = review.lower()
review = review.split()
# Now we have a list of words, we go through each of them and lemmatiz
e words and also not include stop words
review = [wlm.lemmatize(word) for word in review if not word in set(al
l_stopwords)]
return review

# Apply function and create words column to store clean words


dataset['words'] = dataset['review'].apply(review_to_words)

# We need to combine lists of words in stentence


dataset['review'] = dataset['words'].apply(lambda x: ' '.join(map(str, x))
)
dataset.head()

Our review column is now clean text, it does not include stopwords, symbols, numbers, and
html tags.

5
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

We have saved this clean text in new csv file so in future we do not have to go through cleaning
process.
to_save = dataset.copy()
to_save = to_save.drop(['sentence_length', 'words'], axis=1)
to_save.to_csv('IMDB_review_cleaned.csv', index=False)

Data Visualization – Tag Cloud:


Let us view each word positive and negative reviews contains in word cloud form, so we can get
a better idea which words appears more in positive reviews and which words appear more in
negative reviews.
We have created two different data frames, one for each type of sentiment.
pos_review = dataset[dataset['sentiment'] == 'positive']
neg_review = dataset[dataset['sentiment'] == 'negative']

Now plotting positive and negative review word cloud.


text = ' '.join(pos_review['review'].astype(str).tolist())
fig_wordcloud = wordcloud.WordCloud(background_color='lightgrey',
colormap='viridis', width=800, height=
600).generate(text)
plt.figure(figsize=(10,7), frameon=True)
plt.imshow(fig_wordcloud)
plt.axis('off')
plt.title('Positive Review - Word Cloud', fontsize=20 )
plt.show()

6
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

text = ' '.join(neg_review['review'].astype(str).tolist())


fig_wordcloud = wordcloud.WordCloud(background_color='lightgrey',
colormap='viridis', width=800, height=
600).generate(text)
plt.figure(figsize=(10,7), frameon=True)
plt.imshow(fig_wordcloud)
plt.axis('off')
plt.title('Negative Review - Word Cloud', fontsize=20 )
plt.show()

Bag of Word model vs Term Frequency-Inverse Document Frequency (TF-IDF) Model:

Now that we have cleaned our review it is time to simplifying a text representation for our
machine learning algorithm. In Bag of word it counts how many times a word appears in
document while in TF-IDF model each word contains a weight and measures its relevancy.
Bag of word model:
We have looked at bag of word method first.
cv = CountVectorizer(min_df=2, max_df=0.5, ngram_range=(1,2))
cv_matrix = cv.fit_transform(dataset.review)

Train-Test split:

We have split training and test set with 70%/30%.


X_train, X_test, y_train, y_test = train_test_split(cv_matrix, dataset.sen
timent, test_size=0.3, random_state=0)

7
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

Linear SVC:
svc_m = LinearSVC()
svc_m.fit(X_train, y_train)
y_pred = svc_m.predict(X_test)

By looking at accuracy and confusion matrix, we can see that model has accuracy of 89.2%
cm = confusion_matrix(y_test, y_pred)
print(cm)
print("Accuracy of Linear SVC on Bag of Word: {}".format(accuracy_score(y_
test, y_pred)))
[[6694 846]
[ 770 6690]]
Accuracy of Linear SVC on Bag of Word: 0.8922666666666667

Multinomial Naïve Bayes:


NB = MultinomialNB()
NB.fit(X_train, y_train)
y_pred = NB.predict(X_test)

Accuracy of multinomial naïve bayes model is 88.4%, it is slightly worse than linear SVC model.
cm = confusion_matrix(y_test, y_pred)
print(cm)

print("Accuracy of Multinomial NB on Bag of Word: {}".format(accuracy_scor


e(y_test, y_pred)))

[[6666 874]
[ 855 6605]]

Accuracy of Multinomial NB on Bag of Word: 0.8847333333333334

SGDC Classifier:
SGDC = SGDClassifier()
SGDC.fit(X_train, y_train)
y_pred = SGDC.predict(X_test)

Accuracy for SGDC classifier is 88.8%, still worse than linear SVC.
cm = confusion_matrix(y_test, y_pred)
print(cm)

print("Accuracy of SGDC Classifier on Bag of Word: {}".format(accuracy_sco


re(y_test, y_pred)))

[[6807 733]
[ 940 6520]]

Accuracy of SGDC Classifier on Bag of Word: 0.8884666666666666

8
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

We will run all three of these classification models with TF-IDF method now to see if we get
higher accuracy for our model.
Term Frequency-Inverse Document Frequency (TF-IDF) model:
tfid = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1,2))
tfid_matrix = tfid.fit_transform(dataset.review)

X_train, X_test, y_train, y_test = train_test_split(tfid_matrix, dataset.s


entiment, test_size=0.3, random_state=0)

Linear SVC:
svc_m = LinearSVC()
svc_m.fit(X_train, y_train)
y_pred = svc_m.predict(X_test)

At accuracy of 90.6%, this is highest we have seen so far.


cm = confusion_matrix(y_test, y_pred)
print(cm)

print("Accuracy of Linear SVC on Bag of Word: {}".format(accuracy_score(y_


test, y_pred)))

[[6772 768]
[ 642 6818]]

Accuracy of Linear SVC on Bag of Word: 0.906

Multinomial Naïve Bayes:


NB = MultinomialNB()
NB.fit(X_train, y_train)
y_pred = NB.predict(X_test)

Accuracy of multinomial naïve bayes model is 88.8%.


cm = confusion_matrix(y_test, y_pred)
print(cm)

print("Accuracy of Multinomial NB on Bag of Word: {}".format(accuracy_scor


e(y_test, y_pred)))

[[6644 896]
[ 783 6677]]

Accuracy of Multinomial NB on Bag of Word: 0.8880666666666667

SGDC Classifier:
SGDC = SGDClassifier()
SGDC.fit(X_train, y_train)
y_pred = SGDC.predict(X_test)

9
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

Accuracy for SGDC classifier is 89.5%.


cm = confusion_matrix(y_test, y_pred)
print(cm)

print("Accuracy of SGDC Classifier on Bag of Word: {}".format(accuracy_sco


re(y_test, y_pred)))

[[6602 938]
[ 631 6829]]

Accuracy of SGDC Classifier on Bag of Word: 0.8954

Conclusion & Final Model Development:


We have been most successful with TF-IDF with Linear SVC model with 90.6% accuracy. We will
be developing our final model with it.
We have created a function which will accept text as input and then return prediction
def predict_sentence(text, df):
text = BeautifulSoup(text).get_text()
text = re.sub('[^a-zA-Z]', ' ', text)
text = text.lower()
text = [text]

extra = pd.Series(text)
review_series = df.review.append(extra, ignore_index=True)

tfid = TfidfVectorizer(ngram_range=(1,2))
tfid_transformed = tfid.fit_transform(review_series)

tfid_matrix = tfid_transformed[:50000]
tfid_predict = tfid_transformed[-1:]

X_train, X_test, y_train, y_test = train_test_split(tfid_matrix, df.se


ntiment, test_size=0.3, random_state=0)

svc_m = LinearSVC()
svc_m.fit(tfid_matrix, df.sentiment)

y_pred = svc_m.predict(tfid_predict)

return y_pred

Test One:
predict_sentence('Movie was really bad', dataset)
array(['negative'], dtype=object)

10
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

Test Two:
predict_sentence('Wonderful movie, I enjoyed every second of it. Would lov
e to watch again.', dataset)
array(['positive'], dtype=object)

Model Deployment:
Model we have chosen has accuracy of 90.6%. This model would help people working for movie
review website to filter out if reviews left by users is positive or negative and categorize them.
More data can be collected when users enters new reviews on movies and train our model on
those data.
We have developed model using python so by using dash app and plotly we have deployed our
app on Heroku cloud server. The model is currently deployed at https://group-7-text-
mining.herokuapp.com/ and code for the application can be found on our github repo at
https://github.com/Group-7-Big-Data/Assignment-2

11
Course Lab 2 CASDA1040 GR07
Sentiment Analysis – Movie Reviews

Work Cited:

Nicholson, Chris. “A Beginner's Guide to Bag of Words & TF-IDF.” Pathmind, 2020,
pathmind.com/wiki/bagofwords-tf-idf.

Witten, I.H, and E. Frank. Data Mining.Practical Machine Learning Tools and Techniques. 2nd
Ed. Elsevier, 2005.

Prabhakaran, Selva. “Lemmatization Approaches with Examples in Python.” Machine Learning


Plus, 18 May 2020, www.machinelearningplus.com/nlp/lemmatization-examples-python/.

Authors:
Siddharth Panchal
York University School of Continuing Studies

Deependra Surana
York University School of Continuing Studies

Pratyaksh Shah
York University School of Continuing Studies

12
Course Lab 2 CASDA1040 GR07

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy