We are building a Sentiment Analysis System using Natural Language Processing / Text Mining.
The system will analyze the review of the movie and based on content – it will identify if the
review is positive or Negative. This will be used to identify numbers of positive review vs
negative reviews.
Movie watchers are very keen to know the review of the movies and based on they make decision
to watch movie or not. Users are having access to thousands of movies with the introduction of
multiple platforms like Netflix, Amazon Prime, Disney, Apple TV etc. We have seen users browsing
multiple movies before settling down to choose one to watch. Sometime users spend hours to
select movies.
We are building a system that will help movie watcher to select movies based on previous
reviews. He or She need not to read all the reviews. System will mine the data using “Natural
Language Processing” or “Text Mining” capabilities and summarize overall positive or Negative
reviews of the movies.
This recommendation is remarkably interesting and useful from a business model perspective as
well as user experience. Based on sentiments of reviews – Movies can be classified under “5-star
system” This model will help business to price movies accordingly. Similarly, this model will help
movie watchers to select movies and improve overall experience.
Throughout our research, we will explore various analytical approaches to identify tags/
sentiments of the review. Language used by the viewers in the review and appropriate
techniques to clean the data set. We will test various method to test the accuracy of classification
of review.
Overall Project goal is to increase users experience and identify the sentiments of the movie
viewers based on reviews. Success of the Projects depends on identification of movie
sentiments with acceptable level of accuracy.
Data set contains movies ~50K movies with sentiments. Approximately half of the data set
contains positive sentiments.
Ethical ML Framework:
Objective of the application is to identify viewers sentiments based on movie review. Many
aspects of ethical ML framework do not apply directly. Application do not collect any personal
identifiable information and data is not used for anything other than “sentiments”. This data
may be biased towards the viewers who prefer to review movies. Many movies that were not
released to mass and may not get all the reviews. This is open source data obtained from
We are assuming that data provided in this dataset has no bias towards reviews and its
sentiment. We are also assuming that each reviews sentiment is properly identified.
import wordcloud
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
When we group dataset by its sentiment value, we can see we have 50/50 distribution of
positive and negative reviews. There are also some reviews which are duplicated, we can
confirm this by looking at frequency.
By creating new column and filling with length of letters each review has, we can get a basic
idea how each review is written. We have found that positive reviews contain more letter/word
in them than negative review. This maybe because lot of time when person do not like
something, they usually leave review in really small sentence expressing their disbelief while
when they line something, they tend to go in detail and write exactly why they like it.
dataset['sentence_length'] = dataset['review'].apply(len)
ax = sns.FacetGrid(dataset, col="sentiment"), "sentence_length", color="steelblue", bins=50)
dataset.sort_values('sentence_length', ascending=False)
As we can see in some reviews, some reviews include html tags in them, we have used library
called beautifulsoup to remove these tags. We are also using regex to take out symbols and
numbers. Then we are downloading stopwords list from nltk library since we do not want to
include words such as ‘the’, ‘if’, ‘and’ in our model. Finally, we will lemmatize each word and
store it as new review.'stopwords')'wordnet')
wlm = WordNetLemmatizer()
all_stopwords = stopwords.words('english')
Our review column is now clean text, it does not include stopwords, symbols, numbers, and
html tags.
We have saved this clean text in new csv file so in future we do not have to go through cleaning
to_save = dataset.copy()
to_save = to_save.drop(['sentence_length', 'words'], axis=1)
to_save.to_csv('IMDB_review_cleaned.csv', index=False)
Now that we have cleaned our review it is time to simplifying a text representation for our
machine learning algorithm. In Bag of word it counts how many times a word appears in
document while in TF-IDF model each word contains a weight and measures its relevancy.
Bag of word model:
We have looked at bag of word method first.
cv = CountVectorizer(min_df=2, max_df=0.5, ngram_range=(1,2))
cv_matrix = cv.fit_transform(
Train-Test split:
Linear SVC:
svc_m = LinearSVC(), y_train)
y_pred = svc_m.predict(X_test)
By looking at accuracy and confusion matrix, we can see that model has accuracy of 89.2%
cm = confusion_matrix(y_test, y_pred)
print("Accuracy of Linear SVC on Bag of Word: {}".format(accuracy_score(y_
test, y_pred)))
[[6694 846]
[ 770 6690]]
Accuracy of Linear SVC on Bag of Word: 0.8922666666666667
Accuracy of multinomial naïve bayes model is 88.4%, it is slightly worse than linear SVC model.
cm = confusion_matrix(y_test, y_pred)
[[6666 874]
[ 855 6605]]
SGDC Classifier:
SGDC = SGDClassifier(), y_train)
y_pred = SGDC.predict(X_test)
Accuracy for SGDC classifier is 88.8%, still worse than linear SVC.
cm = confusion_matrix(y_test, y_pred)
[[6807 733]
[ 940 6520]]
We will run all three of these classification models with TF-IDF method now to see if we get
higher accuracy for our model.
Term Frequency-Inverse Document Frequency (TF-IDF) model:
tfid = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1,2))
tfid_matrix = tfid.fit_transform(
Linear SVC:
svc_m = LinearSVC(), y_train)
y_pred = svc_m.predict(X_test)
[[6772 768]
[ 642 6818]]
[[6644 896]
[ 783 6677]]
SGDC Classifier:
SGDC = SGDClassifier(), y_train)
y_pred = SGDC.predict(X_test)
[[6602 938]
[ 631 6829]]
extra = pd.Series(text)
review_series =, ignore_index=True)
tfid = TfidfVectorizer(ngram_range=(1,2))
tfid_transformed = tfid.fit_transform(review_series)
tfid_matrix = tfid_transformed[:50000]
tfid_predict = tfid_transformed[-1:]
svc_m = LinearSVC(), df.sentiment)
y_pred = svc_m.predict(tfid_predict)
return y_pred
Test One:
predict_sentence('Movie was really bad', dataset)
array(['negative'], dtype=object)
Test Two:
predict_sentence('Wonderful movie, I enjoyed every second of it. Would lov
e to watch again.', dataset)
array(['positive'], dtype=object)
Model Deployment:
Model we have chosen has accuracy of 90.6%. This model would help people working for movie
review website to filter out if reviews left by users is positive or negative and categorize them.
More data can be collected when users enters new reviews on movies and train our model on
those data.
We have developed model using python so by using dash app and plotly we have deployed our
app on Heroku cloud server. The model is currently deployed at https://group-7-text- and code for the application can be found on our github repo at
