0% found this document useful (0 votes)
8 views53 pages

Efficient Features for Movie

Uploaded by

sibylanderson520
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views53 pages

Efficient Features for Movie

Uploaded by

sibylanderson520
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Efficient Features for Movie

Recommendation Systems

SUVIR BHARGAV

Master’s Degree Project


Stockholm, Sweden October 2014

XR-EE-KT 2014:012
Efficient Features for Movie Recommendation
Systems

SUVIR BHARGAV

Master’s Thesis at VionLabs AB


Supervisor: Roelof Pieters
Examiner: Markus Flierl

XR-EE-KT 2014:012
iii

Abstract

User written movie reviews carry substantial amounts of movie re-


lated features such as description of location, time period, genres, char-
acters, etc. Using natural language processing and topic modeling based
techniques, it is possible to extract features from movie reviews and find
movies with similar features. In this thesis, a feature extraction method
is presented and the use of the extracted features in finding similar
movies is investigated. We do the text pre-processing on a collection of
movie reviews. We then extract topics from the collection using topic
modeling techniques and store the topic distribution for each movie.
Similarity metrics such as Hellinger distance is then used to find movies
with similar topic distribution. Furthermore, the extracted topics are
used as an explanation during subjective evaluation. Experimental re-
sults show that our extracted topics represent useful movie features and
that they can be used to find similar movies efficiently.
v

Acknowledgements

This thesis has been carried out at Vionlabs AB. Starting from the
initial idea to final execution, everyone at Vionlabs supported the en-
deavour to build and create something around movie and technology. I
would like to thank my supervisor, Roelof Pieters for his guidance and
having so many endless discussions around NLP, topic modeling and
movie recommendation systems.
I would also like to thank main author of Gensim library, Radim
for his endless suggestions and ideas. I extend my gratitude to great
community of programmers and engineers who took time to reply and
gave suggestions to my questions on stackoverflow.
I would like to thank my coordinator and examiner, Markus Flierl
for giving valuable guidance and suggestions at each stage of the project.
I would also like to thank all the movie judges at Vionlabs for their time
and effort in rating movies. In the end, I would like to thank my family
and friends, who constantly supported me throughout the thesis.
Contents

Contents vi

List of Figures viii

1 Introduction 1
1.1 Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3
2.1 Movie Data Processing: A Literature review . . . . . . . . . . . . . . 3
2.2 Document representation . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . 6
2.4 Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 Kullback-Leibler (KL) divergence . . . . . . . . . . . . . . . . 9
2.4.3 Hellinger Distance . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Recommendation based on Movie Topics 11


3.1 User Reviews of Movies as Data . . . . . . . . . . . . . . . . . . . . . 12
3.2 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Movie Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Topic Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Experimental Setup and Results 19


4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.1 Text processing . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.1.2 Training LDA model . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Calculating movie similarity . . . . . . . . . . . . . . . . . . . 22
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

vi
CONTENTS vii

4.2.1 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . 22


4.2.2 Web based movie evaluation setup . . . . . . . . . . . . . . . 23
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Evaluation result . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Rating correlation . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.3 Observations on Subjective evaluation . . . . . . . . . . . . . 28

5 Conclusion and Future Directions 35


5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.1 Movie review preprocessing . . . . . . . . . . . . . . . . . . . 36
5.2.2 Building complex topic models . . . . . . . . . . . . . . . . . 36

Bibliography 37
List of Figures

2.1 Vector Space Model of documents, Figure by pyevolve[24] . . . . . . . . 5


2.2 The graphical model for latent Dirichlet allocation. Each node is a ran-
dom variable in the generative process. Shaded circle represents observed
variable i.e. words of documents and unshaded circles are all hidden
variable. Plates represents replication i.e. N denotes words within docu-
ments and D is collection of documents. Figure taken from Blei’s paper
[20]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Angle between two documents in a 2-d document-term space. . . . . . . 9

3.1 The overall system showing all steps involved. System works by pre-
processing reviews, traning LDA model, extracting topics out of it. Top-
ics are then later used to find similar movies. . . . . . . . . . . . . . . . 11
3.2 Screen-shot shows a sample movie review taken from IMDB. Highlighted
words are relevant features that can be used for finding similar movies. . 12
3.3 Collection and preprocessing of movie reviews. . . . . . . . . . . . . . . 13
3.4 Preprocessing of movie reviews is done in parallel by spawning sub-
processes for available number of CPU cores. Above representation is
inspired from Chris Kiehl’s blog [37]. . . . . . . . . . . . . . . . . . . . 14
3.5 Tree showing nltk based chunking technique applied on movie data. . . 14
3.6 Sample topics generated from user movie reviews for the movie Gravity 17
3.7 Cosine similarity and Hellinger distance shows strong positively correla-
tion. The X-axis shows similarity score for Hellinger distance whereas
Y-axis represents cosine similarity score. . . . . . . . . . . . . . . . . . 18

4.1 A tree diagram showing movie review corpus . . . . . . . . . . . . . . . 20


4.2 Chart showing movies genres of popular movies from last 10 years. . . 20
4.3 A visualization showing 20 topics generated from 100 movie reviews. Ver-
tical axis represents movie reviews data denoted by their corresponding
ids while the horizontal axis represents movie topics. . . . . . . . . . . . 21
4.4 Front-page of the movie evaluation system, showing five target movies.
A user clicks on a target movie and five similar movies are presented for
evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

viii
List of Figures ix

4.5 Web based movie evaluation system. Shown on left is a target movie
Front-page upon log-in to the movie evaluation system, showing 10 target
movies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 Movie evaluation system with explanation. . . . . . . . . . . . . . . . . 26
4.7 Result of average rating for Genre (top) and Genre with explanation
(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.8 Result of average rating for Mood (top) and Mood with explanation
(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.9 Result of average rating for Plot (top) and Plot with explanation (bottom). 31
4.10 Result of average rating for Overlap (top) and Overlap with explanation
(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.11 Result shows average ratings for the movie topics. . . . . . . . . . . . . 33
4.12 Strong positive correlation between Genre and Mood. . . . . . . . . . . 33
4.13 Strong positive correlation of ratings between two judges. Judges most
agree with rating 1 and then with ratings 2 and 3. . . . . . . . . . . . . 34
Chapter 1

Introduction

The advent of movie streaming services made available thousands of movies with
a click of a button [1]. We now have movies not only from Hollywood, but also
from international cinema, documentaries, indie movies, etc. With so many movies
at hand, the consumer faces the dilemma of what to watch. At the end of the
day, people just want to relax and watch something that matches with their mood,
taste and style. This is where Recommendation Systems (RS) can help, suggesting
movies that match user taste and viewing habits. In order to recommend movies,
we need to understand movies first. The more we understand movie features(genre,
keywords, mood, etc,.), the better recommendation we can serve.
Commercial streaming services such as Netflix [2] and Jinni [3] combine semantic
information about movies with user ratings to get the optimum hybrid RS. However,
they still depend on human taggers [4], [5] for basic feature representation which
are needed to classify movies or songs. Although the results obtained from human
taggers is quite good, such an approach is definitely not scalable when tagging
hundred of thousands of movies or millions of daily generated videos.
For a system to understand a movie, it needs movie features such as movie
cast, movie genre, movie plot, etc. With these information, a system can better
categorize movies. User written movie reviews is one such source of features. It car-
ries substantial amount of movie related information such as location, time period,
genre, lead characters and memorable scene descriptions.
Since a user written movie review contains both useful (i.e keywords) and useless
(i.e. stopwords) information, some text pre-processing is required before it can be
used by a RS. With pre-processed movie data, the next step is to find a good feature
representation for movies. In this thesis, we explore feature extraction from movie
reviews using Natural Language Processing (NLP) and topic modeling techniques
and use the extracted features to find similar movies. The experiments are done on
a small set of movies to show that movie topics are efficient features for RS.

1
2 CHAPTER 1. INTRODUCTION

1.1 Question
• Is it possible to extract or generate movie features from user reviews?

• Is it possible to use extracted features to find similar movies?

• What is a good feature and how can we distinguish good features from bad
ones?

1.2 Goals
The goals of this master thesis are:

• Extract movies features from user reviews of movies.

• Investigate extracted features to find similar movies.

• Draw conclusions about the performance of the developed prototype system.

1.3 Outline
This work is presented in the following chapters:

• Chapter 2 discusses the background study done during the project. Technical
concepts that have been used in the project will be presented.

• Chapter 3 presents a recommendation system based on movie topics. Major


steps involved in topic extraction are discussed in detail. The chapter closes
by discussing the implementation of similarity metrics used to find similar
movies based on topics.

• Chapter 4 presents the experimental setup, evaluation system and results.

• Chapter 5 concludes the project and discusses future directions.


Chapter 2

Background

A Recommendation System provides items and suggestions to a person based on


his or her interests and past usage history. Such a system is the backbone to many
of today’s content streaming services such as Netflix1 , Pandora2 and Youtube3 .
Recommendation Systems (RS) are usually classified based on the approach used
to filter information such as content based filtering, collaborative filtering (based on
users activities) and hybrid (combining both).
Collaborative filtering based RS have seen interest lately because of the Netflix
competition [6] whereas content based systems face challenges of efficient feature
representation of meta-data from audio, video and text. Luckily for the movie
domain, lot of textual information is readily available such as plot lines, dialogues
and reviews.

2.1 Movie Data Processing: A Literature review


Movie data in the form of keywords, script, dialogue, review has been used in
research activity in past decade [7]–[10]. [8] explores movie recommendation using
cultural metadata such as user comments, plot outlines, keywords, etc, and shows
highest precision with user comments. The report [9] discusses movie classification
using NLP based methods such as Named Entity Recognizer (NER) and Part-of-
Speech (POS) tagger with movie script as input. It concludes that NLP based
features performs well when compared to non-NLP features (without the use of
NER and POS) although it reports only 50% accuracy because of the small corpus
size.
We decided to use movie reviews written by moviegoers primarily because a) easy
availability [7], [11], and computationally inexpensive for off-the-shelf hardware. b)
Each movie review can be considered as a single document representing the movie.
This allows us to use document based classification methods on movies. c) Simple
1
www.netflix.com
2
http://www.pandora.com/
3
www.youtube.com

3
4 CHAPTER 2. BACKGROUND

heuristic of combining several user written movie reviews of a single movie into a
single document have the potential to discover semantic patterns at the movie level.
Furthermore combining all individual movie documents into collection allows us to
explore patterns across the collection (essentially, across genres).
In order to use movie review as data, it is necessary to remove irrelevant words,
symbols, html tags, etc,. In NLP, a large number of open source tools and libraries
[12]–[14] are available and used as the first step in any kind of text processing.
The chapter 7 of [12] mentions the steps involved in extracting information from
text. [15] uses nltk toolkit for stopwords and stemming steps. Text data with noise
can drastically affect the result of any kind of NLP based model training. Text
filtering helps in removing unnecessary information and allows us to use complex
mathematical models on it.
The paper [8] compares the results obtained by preprocessing of meta data. It
simply computes the cosine similarity from document-term matrix of movie data.
Although paper showed highest precision with user comments of movies, it did not
analyzed the data further by advance techniques such as LSA. After preprocessing,
reducing the data dimensionality is next step in feature extraction.
Document-term matrix is used as the input to many semantic analysis tech-
niques starting from basic tf-idf scheme to complex models such as LSI, LSA and
LDA. These dimensionality reduction techniques could yield semantic features of
big data from off-the-shelf hardware [13]. Such models are interesting candidates to
investigate semantic concepts from movie data. The thesis [10] studies sentiment
analysis done on movie review using LSA but concludes that dimensions that capture
the most variance are not always the most discriminating features for classification.
On the other hand, [16] shows interesting results when using topic modeling in
content based RS. Probabilistic topic modeling allows us to extract hidden features
i.e. “topics” from documents. LDA, a model based on topic modeling, shows
good results in both document clustering [17] and recommendation system [18],
[19]. It can capture important intra-document statistical structure by considering
mixture models for exchangeability of both words and documents within a corpus
[20]. With probabilistic techniques such as LDA, it is possible to derive semantic
similarities from textual movie data. Such extracted semantic information can be
used to find similar movies. Moreover, LDA can assign topic distribution to new
unseen documents, an important requirement for building scalable RS for movies
as it should be trivial to add new movies on regular basis. For a RS, computing
similarity is an essential part, be it either similarity of content or user rating.
Clustering, an unsupervised classification of patterns is a technique applied on
movie meta-data in RS. The review paper on clustering [21] briefly discusses sim-
ilarity measure but emphasizes that similarity is fundamental to clustering. [22],
[23] has done a detailed study with commonly used similarity measure in text clus-
tering. Since the input data for our project i.e. movie reviews are in the form of
text document, we can look for similarity measure discussed in [22] to begin with.
Once the similarity between movies is computed, it is important to evaluate the
result obtained. For unsupervised learning techniques such as LDA, evaluation is
2.2. DOCUMENT REPRESENTATION 5

still a challenge. Since our project is based on movies, subjective evaluation is an


obvious choice as the movies are ultimately watched by people.
In the end, even though topic modeling has shown good results in recommender
systems [16], [19], it has been hardly explored for movie recommendation. Under-
standing of movie data still face challenges and we need algorithms with semantic
understanding to solve it.

2.2 Document representation


Before stepping into NLP based techniques, it is important to understand basic
document representation. Let’s say we have a set of documents and for the sake
of simplicity, each document consist of a single sentence. We can represent such a
model into vector space as shown in Figure 2.1. Such a representation is called Vec-
tor Space Model (VSM). Each word corresponds to a dimension and each document
is a vector with non-negative values on each dimensions. Figure 2.1 is an example in
a 3-dimensional space but in practice the document space usually runs into tens and
thousands of dimensions. VSM allow us to use 2, 3-dimensional geometric formulae
and extend it to m-dimensions, where m is the number of distinct terms appearing
in a set of documents.

Figure 2.1. Vector Space Model of documents, Figure by pyevolve[24]

To represent document-term as a vector, consider each word as a term. Obvi-


ously, some terms appear more frequently and are considered to be important for
document. Let D = {d1 , ....., dn }, be a corpus of documents and T = {t1 , ....., tm },
be the set of distinct terms occurring in D. Let tf(d,t) represents the frequency
of term t ∈ T in document d ∈ D. For document d, we can then represent a
m-dimensional vector t~d [22] as
6 CHAPTER 2. BACKGROUND

t~d = (tf (d, t1 ), ......, tf (d, tm ))


In practice more complicated schemes such as tfidf weighting are used. For basic
proto-typing, document-term vector, t~d is good to start with.

2.3 Topic Modeling


2.3.1 Overview
Modeling the text corpora is central problem of information retrieval (IR) and
classification. tf-idf, a widely used scheme in IR domain is based on document-term
methodology. It describes the importance of word to a document in the collection
and reduces the length of documents to a fixed length matrix representation. But
tf-idf hardly gives any insight into intra and inter document statistical structure and
it still has a term × document sized matrix, quite a high dimension. To tackle these
problems, Latent semantic indexing (LSI) was proposed which uses singular value
decomposition on document-term matrix. LSI, being a dimensionality reduction
technique quickly became popular. In 1999, Hofmann proposed an improvement
over LSI called probabilistic LSI (pLSI) [25]. pLSI models each word in a document
as a sample from a mixture model [20] thereby giving a representation of document
interms of probability distribution of “topics”. Mixture components representing
topics, are basically multinomial random variables.
Although an improvement over LSI, pLSI lacked the probabilistic model at the
level of documents. This lead to pLSI parameters growing linearly with the size of
corpus. Another challenge was to assign topic proportion to new unseen documents.
Improving on pLSI shortcomings, LDA model was introduced by David Blei [20].
Before going into LDA, it is important to distinguish between feature and hidden
feature. In image analysis, a feature is said to be “point of interest” for image
description. A “good feature” is said to have useful properties [26] such as

• perceptually meaningful (as to humans)

• analytically special (eg. maxima)

• identifiable on different images

Hidden features is mostly used in statistical and probabilistic modeling, are hid-
den random variables that are inferred from observed variables. In topic modeling
sense, hidden variables are topics representing the thematic structure of a document
collection and observed variables are words of the document.

2.3.2 Latent Dirichlet Allocation


In the original paper [27], a topic is defined as distribution over fixed vocabulary.
Such a distribution allows us to represent document in terms of multiple topics
2.3. TOPIC MODELING 7

with different proportions thereby making it easier to classify, store and find similar
documents in a collection.
LDA defines the generative process for documents with the assumption that
the topics are generated first, before the documents. Hence, while training with a
number of topic equal to 100, we are basically assuming that there are 100 topics
in the collection of documents. For each document in the collection, the words can
be generated in two stage [27] process
1. Randomly choose a distribution over topics.
2. For each word in the document
a) Randomly choose a topic from the distribution over topics in step 1.
b) Randomly choose a word from the corresponding distribution over the
vocabulary
Above process reflects the idea of LDA that documents exhibits multiple topics.
Step 1 shows that each document exhibits the topics with different proportion.
Further, each word within each document is picked from one of the topics (step 2b),
where the selected topic is chosen from the per-document distribution over topics
(step 2a).
The generative process for LDA can be written as joint distribution of the hidden
and observed variables:

p(β1:K , θ1:D , z1:D , w1:D )


K
Y D
Y QN 
= p(βi ) p(θd ) n=1
p(zd,n | θd )p(wd,n | β1:K , zd,n ) (2.1)
i=1 d=1

Where β1:K are the topics and each βk is a distribution over the vocabulary. wd
are observed words for document d. wd,n is the nth word in document d. The topic
proportions for the dth document are θd , where θd,k is the topic proportion for topic
k in document d. The topic assignments for the dth document are zd , where zd,n is
the topic assignment for the nth word in document d. Figure 2.2 shows the graphical
model of LDA with three levels. First, α and η are corpus level parameter, assumed
to be sampled once in the process of generating a corpus. The variables θd are
document-level variables, sampled once per document. Finally, the variables zd,n
and wd,n are word-level variables and sampled once for each word in each document
[27].
After obtaining the joint distribution, we now compute the conditional distri-
bution of the hidden variables that is topics given the observed variables that is
words. In Bayesian statistics, it is called a posterior of the hidden variables given
the observed variables.

p(β1:K , θ1:D , z1:D , w1:D )


p(β1:K , θ1:D , z1:D | w1:D ) = (2.2)
p(w1:D )
8 CHAPTER 2. BACKGROUND

Figure 2.2. The graphical model for latent Dirichlet allocation. Each node is a
random variable in the generative process. Shaded circle represents observed variable
i.e. words of documents and unshaded circles are all hidden variable. Plates represents
replication i.e. N denotes words within documents and D is collection of documents.
Figure taken from Blei’s paper [20].

The numerator is the joint distribution of all the random variables and the
denominator is the marginal probability, summing over all possible ways of assigning
each observed word of the collection to one of the topics [27]. With exponentially
large computation in denominator, various approximation techniques are used to
approximate the posterior. We used the MALLET [28] package, which uses Gibbs
sampling for posterior approximation.
As mentioned in [27], relaxing and extending the statistical assumptions made by
LDA could narrow down the topics to specific semantic patterns. Nowadays, topic
modeling have been optimized with features such as online learning LDA model for
documents arriving in stream and multi-threading support.

2.4 Similarity Metrics


Finding similar movies for a target movie is the objective of the Content Based
RS. Media content can be in the form of audio, video, and text. In our case, each
movie is represented by a single document consisting of movie review as text hence,
it is useful to look at currently used similarity metrics in the document clustering
domain. In document clustering, closeness between documents is defined in terms
of similarity or distance between them. In the rest of the chapters, some of the
commonly used similarity metrics are discussed.

2.4.1 Cosine Similarity


Cosine Similarity (CS) is the most used measure of document similarity. Its usage
can be seen in information retrieval domain such as measuring similarity between
the documents with data obtained from LSI algorithm [29]. In-order to measure the
similarity of two documents, we can calculate the cosine of the angle between the
two term-vectors of the document. Figure 2.3 shows the angle in two-dimensional
document space.
Given two documents t~a and t~b , their cosine similarity is represented by
2.4. SIMILARITY METRICS 9

Figure 2.3. Angle between two documents in a 2-d document-term space.

t~a · t~b
docsimcs (t~a , t~b ) = , (2.3)
|t~a | × |t~b |

where t~a and t~b are m-dimensional vectors over the term set T = {t1 , ....., tm } . It
is important to note that for documents, the tf-idf weights are non-negative. Hence,
the CS is always between [0,1].
Although cosine similarity is a widely used similarity metric, it is important
to consider metrics based on probability distributions if the input data is topic
distribution. The Kullback-Leibler divergence is shown to effectively cluster text
data using both terms [22] and topic distribution [30].

2.4.2 Kullback-Leibler (KL) divergence


In the field of information theory, a document is described by a probability distri-
bution of terms. We can then calculate the similarity between two documents as
the distance between the two corresponding probability distributions [22]. For two
distributions P and Q, the KL divergence of Q from P is

X Pi
Dkl (P ||Q) = P (i) log ( ) (2.4)
i
Qi

In other words, KL divergence of Q from P is a measure of the information lost


when Q is used to approximate P [31].
The limitation with KL divergence when using it for similarity between docu-
ments based on probability distribution of topics is that it is not symmetric. For a
distance measure to be considered as a metric of similarity, it must be symmetric i.
e. distance from x to y is the same as the distance from y to x. For the case of KL
divergence, consider the following equation, again in the document scenario:
m
X wt,a
Dkl (t~a ||t~b ) − Dkl (t~b ||t~a ) = log ( )(wt,a + wt,b ) (2.5)
t=1
wt,b
10 CHAPTER 2. BACKGROUND

Since the above equation is not zero, KL divergence is not symmetric. The solution is
to use the arithmetic average of Dkl (P ||Q) and Dkl (Q||P ) or calculate the Hellinger
Distance (HL) for such cases [32], [33]. In this work, we explored further with the
HL distance.

2.4.3 Hellinger Distance


Hellinger Distance is a metric of similarity between two probability distributions.
For probability distributions P = {pi }i∈[n] , Q = {qi }i∈[n] supported on [n], the
Hellinger distance [34] between P and Q is defined as
1 √ p
h(P, Q) = √ · || P − Q||2 , (2.6)
2
It is important to note that for cosine similarity, a higher value is better whereas
for the Hellinger distance, a smaller value represents more similarity.
The motivation to improve movie recommendation is the initial push to explore
NLP and topic modeling techniques. Along with the knowledge about document
processing, topic modeling and similarity measure, we can now discusses the design
approach taken during the project implementation.
Chapter 3

Recommendation based on Movie


Topics

This chapter discusses the implementation of major steps involved in prototyping a


movie topics based RS. The algorithm for overall system is visualized in four steps
as shown in Figure 3.1.

Figure 3.1. The overall system showing all steps involved. System works by pre-
processing reviews, traning LDA model, extracting topics out of it. Topics are then
later used to find similar movies.

Summarizing the system

1. Two set of dataset set were created and preprocessed for the experiment

• Corpus A, a set of user written movie reviews extracted from the web.
Basically, a list of popular 943 movies over the last 10 years, rated by
users on web.
• Corpus B, a list of ten target movies representing popular genres, hand-
picked by two movie lovers who later evaluated the results.

2. A LDA based model is trained on Corpus A to generate movie topics.

11
12 CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS

3. Using the trained model, indexes of topic distributions for both corpus A and
unseen corpus B are created.

4. Using Similarity metrics, a list of five similar movies for each target movie is
created and presented for evaluation.

To implement the above system, python [35] is used as programming language of


choice because of the large ecosystem of machine learning (ML) tools and libraries
around it. Python based ML systems are easy to scale as most of the open source
libraries are memory efficient and supports multi-thread of execution.
We start by analyzing and pre-processing movie data. Next, feature extraction is
performed on processed data. Finally, extracted features are used to find similarity
between movies.

3.1 User Reviews of Movies as Data

Figure 3.2. Screen-shot shows a sample movie review taken from IMDB. High-
lighted words are relevant features that can be used for finding similar movies.

Movies reviews are available widely in the form of audio, video and text based,
we needed to narrow down our approach of initial data. We decided to use text based
movie reviews as they are easy to extract over the Internet and has low computation
complexity when proto-typing with different algorithms. Reviews themselves are
written by movie critics or users. Basing our feature extraction on movie critic
reviews could result in biased view about movie. Combining large amount of reviews
written by users and using it as the source to our feature extraction system has
the benefit that we might pick semantic patterns considered or agreed by wide
audience of cinema. Figure 3.2 shows such semantic patterns that we want to
extract in this project. In the sample review for movie Gravity shown below, observe
the description of another movie “Apollo 13 ”. Users connect movies while writing
3.2. TEXT PREPROCESSING 13

reviews and it could useful in finding semantic patterns across movies belonging to
same genre.
In the report, we use the term “document” to be consistent with the IR and
topic modeling domain but in our experimental setup, a document consists of user
written movie reviews and it represents a movie.

3.2 Text Preprocessing


In Natural Language Processing, a corpus is a collection of text data [36], used for
verifying hypothesis about language such as extracting features from text or finding
pattern of word usage. For movie review data, we collected the text data and
followed the preprocessing as shown in Figure 3.3. During preprocessing, irrelevant
words such as {of, and, or} are removed using common english stopword list.

Figure 3.3. Collection and preprocessing of movie reviews.


14 CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS

Figure 3.4. Preprocessing of movie reviews is done in parallel by spawning sub-


processes for available number of CPU cores. Above representation is inspired from
Chris Kiehl’s blog [37].

Next, NLTK’s default lemmatizer is used for lemmatisation. It uses WordNet


Database1 to look up lemmas. A lemmatizer reduces all derivationally related forms
of a word to a common base form. For example the word “cars” is reduced to “car”.
This allows us to keep the concept words and remove other forms of same word in
a corpus.
Since text preprocessing is done on 1k movie reviews, it is useful to process
them in parallel. Figure 3.4 shows multiprocessing approach taken to implement
preprocessing of movie reviews in parallel. Python based multiprocessing package
is used as it allows to spawn new processes to utilize multiple processors on a given
machine [35]. This saves time during prototyping and allows us to scale the system.
With the preprocessed data at hand, we explored number of techniques in NLP
domain. We experimented with chunk extraction on movie data. Chunking is useful
for segmenting and labelling multi-token sequences in a sentence. One such result
is shown in Figure 3.5.

Figure 3.5. Tree showing nltk based chunking technique applied on movie data.

1
http://wordnet.princeton.edu/
3.3. FEATURE EXTRACTION 15

Although chunking based approach is useful for tasks such as extracting infor-
mation it is not the right tool when analyzing semantic pattern in large volumes
of unlabeled text such as movie reviews. In IR domain, analyzing large unlabeled
text is common requirement and this motivated us to look for various IR techniques
such as LSI and LDA.

3.3 Feature Extraction


3.3.1 Overview
The goal of feature extraction is to transform data from image or text into nu-
merical features for the purpose of analysis. In text processing, techniques such as
document-term methodology convert text documents into numerical data. We can
then easily feed such matrix-form data into machine learning algorithms to observe
the thematic structure of documents. Mathematical techniques such as Latent Se-
mantic Indexing (LSI) are then used to project document-term matrix from high
dimensional to lower dimensional spaces in order to identify semantic meaning and
similarity between documents. LSI is basically an application of Singular Value
Decomposition (SVD) to a document-term matrix. Another approach in text pro-
cessing is to express such words and documents in terms of probability distribution
leading to models useful in finding semantic information. Probabilistic LSI (pLSI)
and Latent Dirichlet allocation (LDA) are such probabilistic models. Compared to
LDA, pLSI provides no probabilistic model at the level of documents. For analyz-
ing movies, it is necessary to model at the level of movies in a collection. Another
benefit with LDA is that it better fits for new unseen documents (new upcoming
movies in our case), a important requirement for movie recommendation system.
In Figure 3.2, we can observe that the movie review talks about the concept
space with words such as “science”, “cosmic”; genres with words such as “drama”,
“thriller”. Hence, a single movie review blends multiple topics with different pro-
portions. Essentially, a movie is a combination of different genres where each genre
could be represented with different proportion. As discussed in section 2.3.2 of
chapter 2, LDA model correlates with the idea of representing document (movie
in our case) with multiple topics. Hence, we experimented with LDA modeling on
movie reviews dataset to analyze the movie topics.

3.3.2 Movie Topics


For the project, Gensim’s [13] python wrapper to LDA MALLET [28] is used.
MALLET has a number of benefits such as multi-threading support and a fast
implementation of Gibbs sampling. In order to generate movie topics, we first train
the LDA model on 1k movie reviews corpus. We then obtained the topic distribution
by passing a movie review to a trained LDA model.
Figure 3.6 shows five topics generated from reviews of the movie Gravity. Each
column represents one topic. It can be observed from the Figure that the topics t1,
16 CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS

t2 and t4 represents the movie Gravity with words such as “shuttle”, “exploration”,
“debris”, “adrenaline”. The topics t3 and t5 does not give accurate description and
needs some more filtering in order to get better topics.
Prototyping with review dataset gave us following insights about the quality of
topics

• Preprocess the reviews extensively, remove unnecessary words.

• Use descriptive reviews as they are more useful compared to reviews with just
sentiment value.

Ultimately, training LDA model on movie reviews is just one step in getting good
movie features. As a post-processing step, similarity measures can be used to find
movies with similar topic distribution.

3.4 Topic Similarity


During prototyping we explored commonly used similarity measure such as Cosine
Similarity (CS), Kullback Leibler (KL) divergence and Hellinger Distance (HL).
As mentioned in 2.4.2, KL divergence is a non symmetric measure. Hence, we
calculated both CS and HL as similarity metric for ten target movies against the
corpus of 1k reviews. Similarity values are then converted to common similarity
score of 0-100 for comparison. Figure 3.7 shows positive correlation obtained from
50 movie score done separately for both CS and HL. Considering the probability
distribution of the movie topics, we used the Hellinger distance as the similarity
measure for the experimental setup. The distance metric is calculated as

1. Index the topic distribution of the query movies q and the movie corpus C.

2. Apply distance metric formula on indexed q and C.

3. Sort and pick the top five movies.


3.4. TOPIC SIMILARITY
clooney suicide man johnson gaps
willis flashbacks damaged weapons lasted
bullock philosophical outset assassin ialso
sandra bleak cards bullets welldone
gravity symbolism watcher installment ifeel
mcclane narration whathappens bullet posted
debris linear nerve adrenaline edit
brucewillis exploration onhis actionsequences knowthat
justin poetic maintains matrix insteadof
shuttle artsy atfirst combat ahuge

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

Figure 3.6. Sample topics generated from user movie reviews for the movie Gravity

17
18
CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS
Figure 3.7. Cosine similarity and Hellinger distance shows strong positively cor-
relation. The X-axis shows similarity score for Hellinger distance whereas Y-axis
represents cosine similarity score.
Chapter 4

Experimental Setup and Results

4.1 Experimental Setup


We started the setup by collecting necessary movie reviews from the web. A list of
top movies from last 10 years is created. It consist of around top 50 movies from
each year between 2004-2013. IMDB creates popular movie list1 for each year based
on user votes. Such a list represents a good mixture of popular genres liked by movie
goers. With the list of 943 movies, user written movie reviews were scraped and
stored in raw HTML format. Next, text content is extracted from html files using
BeautifulSoup [38], an open source library.
The extracted text is stored in a directory where each text file consists of user
written movie reviews for a single movie. In total, the corpus has 943 movie reviews
as shown in Figure 4.1 in the tree structure. The prepared corpus is a balanced
mixture of popular genres as shown in Figure 4.2. It allows us to experiment without
any bias towards particular movie genre.
A point to note is that we created a corpus by processing Large Movie Review
Dataset [39] as well but due to the computational complexity, we decided to scale
down and prototyped on smaller dataset as mentioned above.

4.1.1 Text processing


For text processing on reviews we used NLTK [40], an open source library. First,
iterate over the corpus and tokenize each file. Tokens are basic element of text
mining, allowing us to analyze and process text at word level. Since we have ac-
cess to individual words now, remove the punctuation and unwanted words i.e.
stopwords. Stopwords are high-frequency grammatical words which are usually ig-
nored as they do not provide any useful information. Examples of stopwords are
{other, there, the, of, are}. We used NLTK’s default stopword2 for English lan-
1
http://www.imdb.com/search/title?year=2013,2013&title_type=feature
2
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/
/english.stop

19
20 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

src

movie-reviews-10-years

movie1.txt

movie2.txt

...

...

movie943.txt

Figure 4.1. A tree diagram showing movie review corpus

Figure 4.2. Chart showing movies genres of popular movies from last 10 years.

guage. We kept all the word tokens which are longer than two alphabetical charac-
ters. Words occurring only once in whole corpus are also removed.
During prototyping, we picked few meaningless topics (from the movie recom-
mendation point of view; words such as good, great, bad, excellent, review, film) and
created a stop word list out of it. The visualization Figure 4.3 shows meaningless
topics with high density blue columns. These columns represents topics with com-
mon words present throughout the corpus. With the new list as feedback to the
system, we re-preprocessed our corpus and obtained improved topics.
4.1. EXPERIMENTAL SETUP
Figure 4.3. A visualization showing 20 topics generated from 100 movie reviews.
Vertical axis represents movie reviews data denoted by their corresponding ids while
the horizontal axis represents movie topics.

21
22 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

4.1.2 Training LDA model


The movie reviews corpus is passed to the gensim’s [13] python wrapper for LDA
mallet [28]. We tested the quality of generated topics with 100, 150 and 250 topics
and found 100 topic as the right size for our 1k movie review corpus. Having too
less or too many topics could affect the quality of topics. Also, training model for
100 topics is computationally efficient and saves time. This allows us to repeat
the process with different inputs. With the generated movie topics, we decided
to investigate further and experimented with the use of topics in finding similar
movies.

4.1.3 Calculating movie similarity


Once the model is trained on movie reviews, we can infer the topic distributions on
new, unseen movies by passing their reviews in the same way as we passed original
corpus. We indexed and stored the topic distribution for ten target movies that we
wish to use during evaluation. Using the HL distance metric, we then find movies
with similar distribution. The final result is saved in json format and passed to the
web evaluation system for evaluation purpose.

4.2 Evaluation
The goal of our experimental evaluation is twofold. First, evaluate the performance
of the system. Secondly, verify the topics themselves and their effectiveness in
representing movies. We choose the subjective evaluation with explanation as it
fits our two-fold goal. Subjective evaluation measures are expressions of the users
about the system or their interaction with the system [41]. It is commonly used to
evaluate the usability of a recommender systems.
During implementation, we borrowed the ideas from approaches used in RS with
explanation [42]. Traditional RS behaves like a black box to end-user. This leaves
the end user confused as to why a particular movie have been recommended. RS
with explanation could help the user to understand the system better and comple-
ment it by giving feedback. For our evaluation, the movie topics are presented as
an explanation to the recommended movie and as a criterion to receive feedback on
it.

4.2.1 Evaluation criteria


Table 4.1 shows five-point criteria for evaluation system. Genre, Mood and Plot
are basic to the movie similarity. We observed the presence of actor names in the
extracted movie topics. Hence, evaluating the effect of actor overlap could be useful.
4.2. EVALUATION 23

Criteria Explanation
Genre similarity of genres between the target movie and
recommended movies.
Mood similarity of mood.
Plot similarity of plot.
Overlap Overlap of Actors/Actress/Director or lead cast.
Topic-relevance- Relevance of topic as an explanation to the recommended
score movies.
Table 4.1. Evaluation criteria used in our web based movie evaluation system.

4.2.2 Web based movie evaluation setup


A web based movie evaluation system is created to evaluate the results obtained
from our experimental setup. Figure 4.4 shows the home page of the system. It
allows users to log-in over the web and rate movies. Evaluation starts by

• presenting a target movie and a recommended movie. Judges then rate the
recommended movie based on various evaluation criteria.

• Next, explanation in the form of movie topics is shown and judges re-rate the
movie after reading the explanation.

We decided to show explanation for each recommended movie in order to evaluate


how well the topics represent a movie. Figure 4.5 and 4.6 shows the above men-
tioned two step evaluation. Finally, for each target movie, five movies are presented
following the above mentioned steps. The ratings are then saved to database and
later used to analyze the results. For the project, three judges were invited to rate
movies. Our judges are regular moviegoers and watch movies from a wide spectrum
of genres.
24
CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS
Figure 4.4. Front-page of the movie evaluation system, showing five target movies.
A user clicks on a target movie and five similar movies are presented for evaluation.
4.2. EVALUATION
Figure 4.5. Web based movie evaluation system. Shown on left is a target movie
Front-page upon log-in to the movie evaluation system, showing 10 target movies.

25
26
CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS
Figure 4.6. Movie evaluation system with explanation.
4.3. RESULTS 27

4.3 Results
Initially we did an evaluation for ten target movies, but realized that subjective
evaluation is a slow process. Hence we updated the system and did the evaluaiton
for five target movies only. Figures [4.7, 4.8, 4.9, 4.10, 4.11] show evaluation results
for four evaluation criterion. For each of the evaluation criteria, we also show the
movie topics as an explanation. The result for evaluation with explanation is shown
at the bottom visualization of each figure. The movie judges rated the movie on a
scale 1-4 with 1 equal to “Not Similar”. Rating 2 is for “Somewhat similar”. Rating
3 is for “Similar” and finally, rating 4 is for “Perfect”, representing that user is
happy with recommended movie.

4.3.1 Evaluation result


Some observations about the results

• Out of five movies given to judges for evaluation, one movie has similarity
score of 40-50%, hence it received lowest ratings whereas for most of the other
movies, scores were in the range of 50-70%.

• As shown in the top of Figure 4.7, the genre criterion shows results with 30-
35% ratings between 2 and 3 with median of 3 for genre only and 2 for genre
with explanation. As observable in re-ratings (bottom figure), movie topics are
slightly different than judge’s understanding about genre. But overall judges
agree with movie topics as an explanation for movie genre.

• As shown in the top of Figure 4.8, the mood criterion shows results between
25-30% ratings between 2 and 3 with a median of 2. Both genre and mood
information have been captured quite well by movie topics. As observable,
both ratings (top) and re-ratings (bottom) are almost same for mood evalua-
tion. Hence, judges agree with movie topics as explanation of mood aspect of
movies.

• As shown in Figure 4.9, the majority of recommended movies are not similar
at all in terms of movie plot with 40% of ratings given to “Not Similar”. This
shows that capturing the plot is much more difficult than genre or mood. Both
ratings (top) and re-ratings (bottom) are almost similar for plot evaluation.
Hence, judges agree that the plot information is not well captured by movie
topics and more information is needed to recommend movies with similar
plots.

• In LDA model, the order of words and order of documents are not considered.
For modeling movie plot information, a time based description of concepts
and events are important. In order to better extract plot information, topics
must evolve over the timeline of a movie.
28 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

• Overlap criterion have been rated “Not Similar” with 60-70% of the ratings.
This is understandable as our collection is of 1k movies only and it is difficult
to find overlaps of actors within smaller corpus. Again, both ratings (top)
and re-ratings (bottom) are almost same for overlap evaluation. Hence, judges
agree that the actors overlaps are not well captured by movie topics.

• Figure 4.11 shows ratings for topic relevance score. A combined rating of
73% is given between 2 and 3. It represents the overall usefulness of topics in
finding similar movies and use of topics as an explanation.

• We did the subjective evaluation on smaller scale as it is a slow process to rate


movies. Judges needed some time to watch previously unseen movies before
rating them.

• Overall, genre, mood and topic relevance score criteria has shown useful re-
sults.

It is important to observe that topics generated from LDA model changes every
time a model is re-trained. Running the model anew generates different set of
topics, slightly changed from previous one. Hence, final result of similar movies
might change as well based on generated topics.

4.3.2 Rating correlation


• Although we analyzed other criteria for correlation but observed a strong
positive correlation between genre and mood as shown in the Figure 4.12.

• For correlation between judges ratings, we analyzed all the ratings. Figure
4.13 shows strong positive correlation between two judges. This shows that
both judges agree with each other with 34.4% of ratings into rank 1 followed
by 13.6% ratings in rank 2.

4.3.3 Observations on Subjective evaluation


Subjective evaluation is useful in getting feedback on recommended movies. Our
judges gave feedback about the movie topics and showed interest in rating topics
individually for future evaluation. This could be useful in filtering noise and main-
taining a top rated list of movie topics. Although, our system has only 100 topics,
topic rating could be highly relevant when building hierarchical list of movie topics
as the key challenge with higher number of topics is to maintain good topics and
remove bad topics. In the end, subjective evaluation has time constraint as it is slow
process to evaluate movies and topics individually, but the outcome is quite accurate
conclusion of extracted features, recommended movies and the system itself.
4.3. RESULTS 29

Figure 4.7. Result of average rating for Genre (top) and Genre with explanation
(bottom).
30 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

Figure 4.8. Result of average rating for Mood (top) and Mood with explanation
(bottom).
4.3. RESULTS 31

Figure 4.9. Result of average rating for Plot (top) and Plot with explanation
(bottom).
32 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

Figure 4.10. Result of average rating for Overlap (top) and Overlap with explana-
tion (bottom).
4.3. RESULTS 33

Figure 4.11. Result shows average ratings for the movie topics.

Figure 4.12. Strong positive correlation between Genre and Mood.


34 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS

Figure 4.13. Strong positive correlation of ratings between two judges. Judges most
agree with rating 1 and then with ratings 2 and 3.
Chapter 5

Conclusion and Future Directions

5.1 Conclusion

In this project, we developed the prototyping system for extracting movie features
i.e. topics. We trained a model on a collection of movie reviews and used the trained
model to find similar movies based on the Hellinger distance of movie topics.
Evaluation results shows that such an approach gives good result even with a
small movie collection. Results shows that the movie topics are efficient features as
they performs fairly well in capturing movie genre and mood. Movie plot results
are somewhat satisfactory but need descriptive plot information and better methods
that can capture the story-line. Our small sized movie corpus resulted in very few
overlap between actors. The topics as an explanation in movie recommendation
are quite useful but need to be fine-tuned with the ability to rate individual topics.
User rated movie topics could be used as a feedback to the system.
Finally, movie topics are efficient features for movie recommendation systems
as they represent the semantic patterns behind movies. With user movie reviews
as data, movie topics capture the essential movie aspects such as genre and mood.
Our prototyping approach to feature extraction has the potential to scale for a large
number of movies.

5.2 Future Directions

In this project, we considered user written movie reviews for extracting features.
Such a method could be extended or combined with other forms of movie meta-data
such as plot, genres, keywords. With recent advancement in deep learning, it would
be interesting to study the effect of combining LDA as a preprocessing step in deep
learning analysis of movie reviews. In the following, we discuss a few interesting
future directions.

35
36 CHAPTER 5. CONCLUSION AND FUTURE DIRECTIONS

5.2.1 Movie review preprocessing


Basic LDA model itself does not care about the word order. As it is easily observ-
able, word-order matters in several cases, especially for bi-grams movie keywords
such as “dark comedy” or “nordic horror”. We did a little experiment with bi-grams
but ended up with noisy bi-grams based movie topics as the bi-grams were not con-
sistent with their representation. Approaches in language construction [43] could
be used to create multi-word movie keywords. Finally, extracting and using word
construction from movie reviews has the potential to further capture the movie
semantics.

5.2.2 Building complex topic models


The LDA model can be considered as a base model, and more complex models can
be build on top of it based on the complex needs we have from the data at hand.
Correlated topic model (CTM) [44] and Dynamic topic model (DTM) [45] are such
models built on top of LDA. For example, DTM could be used to observe changing
movie patterns over the years. With TV shows being made for 10-15 seasons, DTM
could highlight the rise and fall of characters over the seasons.
Topic models can be extended to include additional information such as meta-
data. For example, author-topic models attach the topic proportions to authors,
making it possible to calculate author similarity [27] based on topic proportions. Hi-
erarchical LDA models [46] are another direction to explore as extending hundreds
of topics to thousands could represent a wide spectrum of movie genres. Recom-
mending movies based on the topics liked by users and rating topics themselves are
some of the ways to improve extracted topics and build a system based on topic
modeling.
With so many choices of streamable content, the challenge is to efficiently extract
features from all forms of meta-data, recommend relevant content to the end-user
and keep serendipity in your recommendation.
Bibliography

[1] J. Booton, One-click netflix button to make movie streaming even easier | fox
business, en-US, Text.Article, Netflix, Aug. 2011. [Online]. Available: http:
//www.foxbusiness.com/markets/2011/01/04/click-netflix-button-
appear-remote-controls-movie-streaming/ (visited on Jun. 11, 2014).
[2] A. C. Madrigal, How netflix reverse engineered hollywood, Jan. 2014. [Online].
Available: http : / / www . theatlantic . com / technology / archive / 2014 /
01 / how - netflix - reverse - engineered - hollywood / 282679/ (visited on
May 13, 2014).
[3] Me TV: how jinni is revolutionizing search. [Online]. Available: http://www.
forbes.com/sites/dorothypomerantz/2013/02/18/me- tv- how- jinni-
is-revolutionizing-search/ (visited on May 13, 2014).
[4] B. Fritz, “Cadre of film buffs helps netflix viewers sort through the clutter”,
en-US, Los Angeles Times, Sep. 2012, issn: 0458-3035. [Online]. Available:
http://articles.latimes.com/2012/sep/03/business/la-fi-0903-ct-
netflix-taggers-20120903 (visited on May 15, 2014).
[5] J. Layton. (May 2006). How pandora radio works, [Online]. Available: http:
//computer.howstuffworks.com/internet/basics/pandora.htm.
[6] X. Amatriain, The netflix tech blog: netflix recommendations: beyond the 5
stars (part 1). [Online]. Available: http://techblog.netflix.com/2012/
04/netflix-recommendations-beyond-5-stars.html (visited on Jun. 11,
2014).
[7] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification
using machine learning techniques”, in Proceedings of EMNLP, 2002, pp. 79–
86.
[8] S. Ah and C.-K. Shi, “Exploring movie recommendation system using cul-
tural metadata”, in 2008 International Conference on Cyberworlds, Sep. 2008,
pp. 431–438. doi: 10.1109/CW.2008.13.
[9] A. Blackstock and M. Spitz, “Classifying movie scripts by genre with a MEMM
using NLP-Based features”, Stanford, M.Sc.Course Natural Language Pro-
cessing, Student report, Jun. 2008. [Online]. Available: http://nlp.stanford.
edu/courses/cs224n/2008/reports/06.pdf.

37
38 BIBLIOGRAPHY

[10] R. Berendsen, “Movie reviews: do words add up to a sentiment?”, PhD thesis,


Rijksuniversiteit Groningen, Sep. 2010.
[11] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens:
an open architecture for collaborative filtering of netnews”, in Proceedings
of the 1994 ACM Conference on Computer Supported Cooperative Work, ser.
CSCW ’94, Chapel Hill, North Carolina, USA: ACM, 1994, pp. 175–186, isbn:
0-89791-689-1. doi: 10 . 1145 / 192844 . 192905. [Online]. Available: http :
//doi.acm.org/10.1145/192844.192905.
[12] S. Bird, E. Klein, and E. Loper, Natural language processing with Python,
1st ed. O’Reilly, 2009.
[13] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with
Large Corpora”, English, in Proceedings of the LREC 2010 Workshop on
New Challenges for NLP Frameworks, http://is.muni.cz/publication/
884893/en, Valletta, Malta: ELRA, May 22, 2010, pp. 45–50.
[14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: machine
learning in Python”, Journal of Machine Learning Research, vol. 12, pp. 2825–
2830, 2011.
[15] J. Vig, S. Sen, and J. Riedl, “Tagsplanations”, Proceedingsc of the 13th in-
ternational conference on Intelligent user interfaces - IUI ’09, 2008. doi:
10 . 1145 / 1502650 . 1502661. [Online]. Available: http : / / dx . doi . org /
10.1145/1502650.1502661.
[16] T. Luostarinen and o. Kohonen, “Using topic models in content-based news
recommender systems”, English, in nodalida13, ser. 19, vol. 85, Oslo, Norway:
Linkoping University Electronic Press; 581 83 Linkoping; Sweden, May 2013,
239 of 474, isbn: 978-91-7519-589-6. [Online]. Available: http://emmtee.net/
oe/nodalida13/conference/11.pdf.
[17] R. K. V and K. Raghuveer, “Article: legal documents clustering using latent
dirichlet allocation”, International Journal of Applied Information Systems,
vol. 2, no. 6, pp. 27–33, May 2012, Published by Foundation of Computer
Science, New York, USA.
[18] R. Krestel, P. Fankhauser, and W. Nejdl, “Latent dirichlet allocation for tag
recommendation”, Proceedings of the third ACM conference on Recommender
systems - RecSys ’09, 2009. doi: 10.1145/1639714.1639726. [Online]. Avail-
able: http://dx.doi.org/10.1145/1639714.1639726.
[19] C. Wang and D. M. Blei, “Collaborative topic modeling for recommending
scientific articles”, in Proceedings of the 17th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, ser. KDD ’11, San
Diego, California, USA: ACM, 2011, pp. 448–456, isbn: 978-1-4503-0813-7.
BIBLIOGRAPHY 39

doi: 10.1145/2020408.2020480. [Online]. Available: http://doi.acm.org/


10.1145/2020408.2020480.
[20] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation”, J.
Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003, issn: 1532-4435. [Online].
Available: http://dl.acm.org/citation.cfm?id=944919.944937.
[21] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review”, CSUR,
vol. 31, no. 3, pp. 264–323, 1999. doi: 10.1145/331499.331504. [Online].
Available: http://dx.doi.org/10.1145/331499.331504.
[22] A. Huang, “Similarity measures for text document clustering”, in Proceed-
ings of the sixth new zealand computer science research student conference
(NZCSRSC2008), Christchurch, New Zealand, 2008, pp. 49–56.
[23] S. Bordag, “A comparison of co-occurrence and similarity measures as simula-
tions of context”, Proceedings of the 9th international conference on Compu-
tational linguistics and intelligent text processing, pp. 52–63, 2008. [Online].
Available: http://dl.acm.org/citation.cfm?id=1787584.
[24] C. Perone. (Sep. 2013). Machine learning :: cosine similarity for vector space
models (part iii) | pyevolve, [Online]. Available: http://pyevolve.sourceforge.
net/wordpress/?p=2497.
[25] T. Hofmann, “Probabilistic latent semantic indexing”, in Proceedings of the
22nd annual international ACM SIGIR conference on Research and develop-
ment in information retrieval, ACM, 1999, pp. 50–57.
[26] A. Aichert, “Feature extraction techniques”, in CAMP MEDICAL SEMINAR,
2008.
[27] D. M. Blei, “Introduction to probabilistic topic models”, Communications of
the ACM, 2011. [Online]. Available: http://www.cs.princeton.edu/~blei/
papers/Blei2011.pdf.
[28] A. K. McCallum, Mallet: a machine learning for language toolkit, 2002. [On-
line]. Available: http://mallet.cs.umass.edu.
[29] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information
Retrieval. New York, NY, USA: Cambridge University Press, 2008, isbn:
0521865719, 9780521865715.
[30] D. Olszewski, “Fraud detection in telecommunications using kullback-leibler
divergence and latent dirichlet allocation”, English, in Adaptive and Natural
Computing Algorithms, ser. Lecture Notes in Computer Science, A. Dobnikar,
U. Lotrič, and B. Šter, Eds., vol. 6594, Springer Berlin Heidelberg, 2011,
pp. 71–80, isbn: 978-3-642-20266-7. doi: 10.1007/978-3-642-20267-4_8.
[Online]. Available: http://dx.doi.org/10.1007/978-3-642-20267-4_8.
[31] Wikipedia. (Sep. 2014). Kullback–leibler divergence, [Online]. Available: https:
//en.wikipedia.org/wiki/Kullback-Leibler_divergence.
40 BIBLIOGRAPHY

[32] DocEng 2011: document visual similarity measure for document search, Oct.
2011. [Online]. Available: https://www.youtube.com/watch?v=KVFY- r-
BLJQ&feature=youtube_gdata_player (visited on Aug. 26, 2014).
[33] D. M. Blei and J. D. Lafferty, “Topic models”, Text mining: classification,
clustering, and applications, vol. 10, p. 71, 2009.
[34] P. Harsha. (Sep. 2011). Hellinger distance, [Online]. Available: http://www.
tcs.tifr.res.in/~prahladh/teaching/2011- 12/comm/lectures/l12.
pdf.
[35] G. van Rossum and F. L. Drake, The Python Language Reference Manual.
Network Theory Ltd., 2011, isbn: 1906966141, 9781906966140.
[36] D. Crystal, What is a corpus? what is corpus linguistics?, English, university
website. [Online]. Available: http://www.tu-chemnitz.de/phil/english/
chairs/linguist/independent/kursmaterialien/language_computers/
whatis.htm.
[37] C. kiehl. (Dec. 2013). Parallelism in one line, [Online]. Available: https://
medium.com/@thechriskiehl/parallelism-in-one-line-40e9b2b36148.
[38] L. Richardson. (Apr. 2007). Beautiful soup documentation, [Online]. Avail-
able: http://www.crummy.com/software/BeautifulSoup/bs4/doc/.
[39] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,
“Learning word vectors for sentiment analysis”, in Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies, Portland, Oregon, USA: Association for Computa-
tional Linguistics, Jun. 2011, pp. 142–150. [Online]. Available: http://www.
aclweb.org/anthology/P11-1015.
[40] E. Loper and S. Bird. (May 17, 2002). NLTK: The Natural Language Toolkit.
arXiv: cs / 0205028, [Online]. Available: http : / / arxiv . org / abs / cs /
0205028.
[41] Recsyswiki. (Feb. 2011). Subjective evaluation measures, [Online]. Available:
http://recsyswiki.com/wiki/Subjective_evaluation_measures.
[42] N. Tintarev and J. Masthoff, “Designing and evaluating explanations for rec-
ommender systems”, English, in Recommender Systems Handbook, F. Ricci, L.
Rokach, B. Shapira, and P. B. Kantor, Eds., Springer US, 2011, pp. 479–510,
isbn: 978-0-387-85819-7. doi: 10.1007/978- 0- 387- 85820- 3_15. [Online].
Available: http://dx.doi.org/10.1007/978-0-387-85820-3_15.
[43] M. Sahlgren and O. Knutsson, “Proceedings of the workshop on extracting
and using constructions in nlp”, Swedish Institute of Computer Science, 2009.
[44] D. M. Blei and J. D. Lafferty, “A correlated topic model of science”, The
Annals of Applied Statistics, vol. 1, no. 1, pp. 17–35, Jun. 2007. doi: 10.1214/
07-AOAS114. [Online]. Available: http://dx.doi.org/10.1214/07-AOAS114.
BIBLIOGRAPHY 41

[45] D. M. Blei and J. D. Lafferty, “Dynamic topic models”, in Proceedings of the


23rd International Conference on Machine Learning, ser. ICML ’06, Pitts-
burgh, Pennsylvania: ACM, 2006, pp. 113–120, isbn: 1-59593-383-2. doi: 10.
1145 / 1143844 . 1143859. [Online]. Available: http : / / doi . acm . org / 10 .
1145/1143844.1143859.
[46] D. Griffiths and M. Tenenbaum, “Hierarchical topic models and the nested
chinese restaurant process”, Advances in neural information processing sys-
tems, vol. 16, p. 17, 2004.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy