Efficient Features for Movie
Efficient Features for Movie
Recommendation Systems
SUVIR BHARGAV
XR-EE-KT 2014:012
Efficient Features for Movie Recommendation
Systems
SUVIR BHARGAV
XR-EE-KT 2014:012
iii
Abstract
Acknowledgements
This thesis has been carried out at Vionlabs AB. Starting from the
initial idea to final execution, everyone at Vionlabs supported the en-
deavour to build and create something around movie and technology. I
would like to thank my supervisor, Roelof Pieters for his guidance and
having so many endless discussions around NLP, topic modeling and
movie recommendation systems.
I would also like to thank main author of Gensim library, Radim
for his endless suggestions and ideas. I extend my gratitude to great
community of programmers and engineers who took time to reply and
gave suggestions to my questions on stackoverflow.
I would like to thank my coordinator and examiner, Markus Flierl
for giving valuable guidance and suggestions at each stage of the project.
I would also like to thank all the movie judges at Vionlabs for their time
and effort in rating movies. In the end, I would like to thank my family
and friends, who constantly supported me throughout the thesis.
Contents
Contents vi
1 Introduction 1
1.1 Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
2.1 Movie Data Processing: A Literature review . . . . . . . . . . . . . . 3
2.2 Document representation . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Topic Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . 6
2.4 Similarity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.1 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.2 Kullback-Leibler (KL) divergence . . . . . . . . . . . . . . . . 9
2.4.3 Hellinger Distance . . . . . . . . . . . . . . . . . . . . . . . . 10
vi
CONTENTS vii
Bibliography 37
List of Figures
3.1 The overall system showing all steps involved. System works by pre-
processing reviews, traning LDA model, extracting topics out of it. Top-
ics are then later used to find similar movies. . . . . . . . . . . . . . . . 11
3.2 Screen-shot shows a sample movie review taken from IMDB. Highlighted
words are relevant features that can be used for finding similar movies. . 12
3.3 Collection and preprocessing of movie reviews. . . . . . . . . . . . . . . 13
3.4 Preprocessing of movie reviews is done in parallel by spawning sub-
processes for available number of CPU cores. Above representation is
inspired from Chris Kiehl’s blog [37]. . . . . . . . . . . . . . . . . . . . 14
3.5 Tree showing nltk based chunking technique applied on movie data. . . 14
3.6 Sample topics generated from user movie reviews for the movie Gravity 17
3.7 Cosine similarity and Hellinger distance shows strong positively correla-
tion. The X-axis shows similarity score for Hellinger distance whereas
Y-axis represents cosine similarity score. . . . . . . . . . . . . . . . . . 18
viii
List of Figures ix
4.5 Web based movie evaluation system. Shown on left is a target movie
Front-page upon log-in to the movie evaluation system, showing 10 target
movies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.6 Movie evaluation system with explanation. . . . . . . . . . . . . . . . . 26
4.7 Result of average rating for Genre (top) and Genre with explanation
(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.8 Result of average rating for Mood (top) and Mood with explanation
(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.9 Result of average rating for Plot (top) and Plot with explanation (bottom). 31
4.10 Result of average rating for Overlap (top) and Overlap with explanation
(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.11 Result shows average ratings for the movie topics. . . . . . . . . . . . . 33
4.12 Strong positive correlation between Genre and Mood. . . . . . . . . . . 33
4.13 Strong positive correlation of ratings between two judges. Judges most
agree with rating 1 and then with ratings 2 and 3. . . . . . . . . . . . . 34
Chapter 1
Introduction
The advent of movie streaming services made available thousands of movies with
a click of a button [1]. We now have movies not only from Hollywood, but also
from international cinema, documentaries, indie movies, etc. With so many movies
at hand, the consumer faces the dilemma of what to watch. At the end of the
day, people just want to relax and watch something that matches with their mood,
taste and style. This is where Recommendation Systems (RS) can help, suggesting
movies that match user taste and viewing habits. In order to recommend movies,
we need to understand movies first. The more we understand movie features(genre,
keywords, mood, etc,.), the better recommendation we can serve.
Commercial streaming services such as Netflix [2] and Jinni [3] combine semantic
information about movies with user ratings to get the optimum hybrid RS. However,
they still depend on human taggers [4], [5] for basic feature representation which
are needed to classify movies or songs. Although the results obtained from human
taggers is quite good, such an approach is definitely not scalable when tagging
hundred of thousands of movies or millions of daily generated videos.
For a system to understand a movie, it needs movie features such as movie
cast, movie genre, movie plot, etc. With these information, a system can better
categorize movies. User written movie reviews is one such source of features. It car-
ries substantial amount of movie related information such as location, time period,
genre, lead characters and memorable scene descriptions.
Since a user written movie review contains both useful (i.e keywords) and useless
(i.e. stopwords) information, some text pre-processing is required before it can be
used by a RS. With pre-processed movie data, the next step is to find a good feature
representation for movies. In this thesis, we explore feature extraction from movie
reviews using Natural Language Processing (NLP) and topic modeling techniques
and use the extracted features to find similar movies. The experiments are done on
a small set of movies to show that movie topics are efficient features for RS.
1
2 CHAPTER 1. INTRODUCTION
1.1 Question
• Is it possible to extract or generate movie features from user reviews?
• What is a good feature and how can we distinguish good features from bad
ones?
1.2 Goals
The goals of this master thesis are:
1.3 Outline
This work is presented in the following chapters:
• Chapter 2 discusses the background study done during the project. Technical
concepts that have been used in the project will be presented.
Background
3
4 CHAPTER 2. BACKGROUND
heuristic of combining several user written movie reviews of a single movie into a
single document have the potential to discover semantic patterns at the movie level.
Furthermore combining all individual movie documents into collection allows us to
explore patterns across the collection (essentially, across genres).
In order to use movie review as data, it is necessary to remove irrelevant words,
symbols, html tags, etc,. In NLP, a large number of open source tools and libraries
[12]–[14] are available and used as the first step in any kind of text processing.
The chapter 7 of [12] mentions the steps involved in extracting information from
text. [15] uses nltk toolkit for stopwords and stemming steps. Text data with noise
can drastically affect the result of any kind of NLP based model training. Text
filtering helps in removing unnecessary information and allows us to use complex
mathematical models on it.
The paper [8] compares the results obtained by preprocessing of meta data. It
simply computes the cosine similarity from document-term matrix of movie data.
Although paper showed highest precision with user comments of movies, it did not
analyzed the data further by advance techniques such as LSA. After preprocessing,
reducing the data dimensionality is next step in feature extraction.
Document-term matrix is used as the input to many semantic analysis tech-
niques starting from basic tf-idf scheme to complex models such as LSI, LSA and
LDA. These dimensionality reduction techniques could yield semantic features of
big data from off-the-shelf hardware [13]. Such models are interesting candidates to
investigate semantic concepts from movie data. The thesis [10] studies sentiment
analysis done on movie review using LSA but concludes that dimensions that capture
the most variance are not always the most discriminating features for classification.
On the other hand, [16] shows interesting results when using topic modeling in
content based RS. Probabilistic topic modeling allows us to extract hidden features
i.e. “topics” from documents. LDA, a model based on topic modeling, shows
good results in both document clustering [17] and recommendation system [18],
[19]. It can capture important intra-document statistical structure by considering
mixture models for exchangeability of both words and documents within a corpus
[20]. With probabilistic techniques such as LDA, it is possible to derive semantic
similarities from textual movie data. Such extracted semantic information can be
used to find similar movies. Moreover, LDA can assign topic distribution to new
unseen documents, an important requirement for building scalable RS for movies
as it should be trivial to add new movies on regular basis. For a RS, computing
similarity is an essential part, be it either similarity of content or user rating.
Clustering, an unsupervised classification of patterns is a technique applied on
movie meta-data in RS. The review paper on clustering [21] briefly discusses sim-
ilarity measure but emphasizes that similarity is fundamental to clustering. [22],
[23] has done a detailed study with commonly used similarity measure in text clus-
tering. Since the input data for our project i.e. movie reviews are in the form of
text document, we can look for similarity measure discussed in [22] to begin with.
Once the similarity between movies is computed, it is important to evaluate the
result obtained. For unsupervised learning techniques such as LDA, evaluation is
2.2. DOCUMENT REPRESENTATION 5
Hidden features is mostly used in statistical and probabilistic modeling, are hid-
den random variables that are inferred from observed variables. In topic modeling
sense, hidden variables are topics representing the thematic structure of a document
collection and observed variables are words of the document.
with different proportions thereby making it easier to classify, store and find similar
documents in a collection.
LDA defines the generative process for documents with the assumption that
the topics are generated first, before the documents. Hence, while training with a
number of topic equal to 100, we are basically assuming that there are 100 topics
in the collection of documents. For each document in the collection, the words can
be generated in two stage [27] process
1. Randomly choose a distribution over topics.
2. For each word in the document
a) Randomly choose a topic from the distribution over topics in step 1.
b) Randomly choose a word from the corresponding distribution over the
vocabulary
Above process reflects the idea of LDA that documents exhibits multiple topics.
Step 1 shows that each document exhibits the topics with different proportion.
Further, each word within each document is picked from one of the topics (step 2b),
where the selected topic is chosen from the per-document distribution over topics
(step 2a).
The generative process for LDA can be written as joint distribution of the hidden
and observed variables:
Where β1:K are the topics and each βk is a distribution over the vocabulary. wd
are observed words for document d. wd,n is the nth word in document d. The topic
proportions for the dth document are θd , where θd,k is the topic proportion for topic
k in document d. The topic assignments for the dth document are zd , where zd,n is
the topic assignment for the nth word in document d. Figure 2.2 shows the graphical
model of LDA with three levels. First, α and η are corpus level parameter, assumed
to be sampled once in the process of generating a corpus. The variables θd are
document-level variables, sampled once per document. Finally, the variables zd,n
and wd,n are word-level variables and sampled once for each word in each document
[27].
After obtaining the joint distribution, we now compute the conditional distri-
bution of the hidden variables that is topics given the observed variables that is
words. In Bayesian statistics, it is called a posterior of the hidden variables given
the observed variables.
Figure 2.2. The graphical model for latent Dirichlet allocation. Each node is a
random variable in the generative process. Shaded circle represents observed variable
i.e. words of documents and unshaded circles are all hidden variable. Plates represents
replication i.e. N denotes words within documents and D is collection of documents.
Figure taken from Blei’s paper [20].
The numerator is the joint distribution of all the random variables and the
denominator is the marginal probability, summing over all possible ways of assigning
each observed word of the collection to one of the topics [27]. With exponentially
large computation in denominator, various approximation techniques are used to
approximate the posterior. We used the MALLET [28] package, which uses Gibbs
sampling for posterior approximation.
As mentioned in [27], relaxing and extending the statistical assumptions made by
LDA could narrow down the topics to specific semantic patterns. Nowadays, topic
modeling have been optimized with features such as online learning LDA model for
documents arriving in stream and multi-threading support.
t~a · t~b
docsimcs (t~a , t~b ) = , (2.3)
|t~a | × |t~b |
where t~a and t~b are m-dimensional vectors over the term set T = {t1 , ....., tm } . It
is important to note that for documents, the tf-idf weights are non-negative. Hence,
the CS is always between [0,1].
Although cosine similarity is a widely used similarity metric, it is important
to consider metrics based on probability distributions if the input data is topic
distribution. The Kullback-Leibler divergence is shown to effectively cluster text
data using both terms [22] and topic distribution [30].
X Pi
Dkl (P ||Q) = P (i) log ( ) (2.4)
i
Qi
Since the above equation is not zero, KL divergence is not symmetric. The solution is
to use the arithmetic average of Dkl (P ||Q) and Dkl (Q||P ) or calculate the Hellinger
Distance (HL) for such cases [32], [33]. In this work, we explored further with the
HL distance.
Figure 3.1. The overall system showing all steps involved. System works by pre-
processing reviews, traning LDA model, extracting topics out of it. Topics are then
later used to find similar movies.
1. Two set of dataset set were created and preprocessed for the experiment
• Corpus A, a set of user written movie reviews extracted from the web.
Basically, a list of popular 943 movies over the last 10 years, rated by
users on web.
• Corpus B, a list of ten target movies representing popular genres, hand-
picked by two movie lovers who later evaluated the results.
11
12 CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS
3. Using the trained model, indexes of topic distributions for both corpus A and
unseen corpus B are created.
4. Using Similarity metrics, a list of five similar movies for each target movie is
created and presented for evaluation.
Figure 3.2. Screen-shot shows a sample movie review taken from IMDB. High-
lighted words are relevant features that can be used for finding similar movies.
Movies reviews are available widely in the form of audio, video and text based,
we needed to narrow down our approach of initial data. We decided to use text based
movie reviews as they are easy to extract over the Internet and has low computation
complexity when proto-typing with different algorithms. Reviews themselves are
written by movie critics or users. Basing our feature extraction on movie critic
reviews could result in biased view about movie. Combining large amount of reviews
written by users and using it as the source to our feature extraction system has
the benefit that we might pick semantic patterns considered or agreed by wide
audience of cinema. Figure 3.2 shows such semantic patterns that we want to
extract in this project. In the sample review for movie Gravity shown below, observe
the description of another movie “Apollo 13 ”. Users connect movies while writing
3.2. TEXT PREPROCESSING 13
reviews and it could useful in finding semantic patterns across movies belonging to
same genre.
In the report, we use the term “document” to be consistent with the IR and
topic modeling domain but in our experimental setup, a document consists of user
written movie reviews and it represents a movie.
Figure 3.5. Tree showing nltk based chunking technique applied on movie data.
1
http://wordnet.princeton.edu/
3.3. FEATURE EXTRACTION 15
Although chunking based approach is useful for tasks such as extracting infor-
mation it is not the right tool when analyzing semantic pattern in large volumes
of unlabeled text such as movie reviews. In IR domain, analyzing large unlabeled
text is common requirement and this motivated us to look for various IR techniques
such as LSI and LDA.
t2 and t4 represents the movie Gravity with words such as “shuttle”, “exploration”,
“debris”, “adrenaline”. The topics t3 and t5 does not give accurate description and
needs some more filtering in order to get better topics.
Prototyping with review dataset gave us following insights about the quality of
topics
• Use descriptive reviews as they are more useful compared to reviews with just
sentiment value.
Ultimately, training LDA model on movie reviews is just one step in getting good
movie features. As a post-processing step, similarity measures can be used to find
movies with similar topic distribution.
1. Index the topic distribution of the query movies q and the movie corpus C.
Figure 3.6. Sample topics generated from user movie reviews for the movie Gravity
17
18
CHAPTER 3. RECOMMENDATION BASED ON MOVIE TOPICS
Figure 3.7. Cosine similarity and Hellinger distance shows strong positively cor-
relation. The X-axis shows similarity score for Hellinger distance whereas Y-axis
represents cosine similarity score.
Chapter 4
19
20 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS
src
movie-reviews-10-years
movie1.txt
movie2.txt
...
...
movie943.txt
Figure 4.2. Chart showing movies genres of popular movies from last 10 years.
guage. We kept all the word tokens which are longer than two alphabetical charac-
ters. Words occurring only once in whole corpus are also removed.
During prototyping, we picked few meaningless topics (from the movie recom-
mendation point of view; words such as good, great, bad, excellent, review, film) and
created a stop word list out of it. The visualization Figure 4.3 shows meaningless
topics with high density blue columns. These columns represents topics with com-
mon words present throughout the corpus. With the new list as feedback to the
system, we re-preprocessed our corpus and obtained improved topics.
4.1. EXPERIMENTAL SETUP
Figure 4.3. A visualization showing 20 topics generated from 100 movie reviews.
Vertical axis represents movie reviews data denoted by their corresponding ids while
the horizontal axis represents movie topics.
21
22 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS
4.2 Evaluation
The goal of our experimental evaluation is twofold. First, evaluate the performance
of the system. Secondly, verify the topics themselves and their effectiveness in
representing movies. We choose the subjective evaluation with explanation as it
fits our two-fold goal. Subjective evaluation measures are expressions of the users
about the system or their interaction with the system [41]. It is commonly used to
evaluate the usability of a recommender systems.
During implementation, we borrowed the ideas from approaches used in RS with
explanation [42]. Traditional RS behaves like a black box to end-user. This leaves
the end user confused as to why a particular movie have been recommended. RS
with explanation could help the user to understand the system better and comple-
ment it by giving feedback. For our evaluation, the movie topics are presented as
an explanation to the recommended movie and as a criterion to receive feedback on
it.
Criteria Explanation
Genre similarity of genres between the target movie and
recommended movies.
Mood similarity of mood.
Plot similarity of plot.
Overlap Overlap of Actors/Actress/Director or lead cast.
Topic-relevance- Relevance of topic as an explanation to the recommended
score movies.
Table 4.1. Evaluation criteria used in our web based movie evaluation system.
• presenting a target movie and a recommended movie. Judges then rate the
recommended movie based on various evaluation criteria.
• Next, explanation in the form of movie topics is shown and judges re-rate the
movie after reading the explanation.
25
26
CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS
Figure 4.6. Movie evaluation system with explanation.
4.3. RESULTS 27
4.3 Results
Initially we did an evaluation for ten target movies, but realized that subjective
evaluation is a slow process. Hence we updated the system and did the evaluaiton
for five target movies only. Figures [4.7, 4.8, 4.9, 4.10, 4.11] show evaluation results
for four evaluation criterion. For each of the evaluation criteria, we also show the
movie topics as an explanation. The result for evaluation with explanation is shown
at the bottom visualization of each figure. The movie judges rated the movie on a
scale 1-4 with 1 equal to “Not Similar”. Rating 2 is for “Somewhat similar”. Rating
3 is for “Similar” and finally, rating 4 is for “Perfect”, representing that user is
happy with recommended movie.
• Out of five movies given to judges for evaluation, one movie has similarity
score of 40-50%, hence it received lowest ratings whereas for most of the other
movies, scores were in the range of 50-70%.
• As shown in the top of Figure 4.7, the genre criterion shows results with 30-
35% ratings between 2 and 3 with median of 3 for genre only and 2 for genre
with explanation. As observable in re-ratings (bottom figure), movie topics are
slightly different than judge’s understanding about genre. But overall judges
agree with movie topics as an explanation for movie genre.
• As shown in the top of Figure 4.8, the mood criterion shows results between
25-30% ratings between 2 and 3 with a median of 2. Both genre and mood
information have been captured quite well by movie topics. As observable,
both ratings (top) and re-ratings (bottom) are almost same for mood evalua-
tion. Hence, judges agree with movie topics as explanation of mood aspect of
movies.
• As shown in Figure 4.9, the majority of recommended movies are not similar
at all in terms of movie plot with 40% of ratings given to “Not Similar”. This
shows that capturing the plot is much more difficult than genre or mood. Both
ratings (top) and re-ratings (bottom) are almost similar for plot evaluation.
Hence, judges agree that the plot information is not well captured by movie
topics and more information is needed to recommend movies with similar
plots.
• In LDA model, the order of words and order of documents are not considered.
For modeling movie plot information, a time based description of concepts
and events are important. In order to better extract plot information, topics
must evolve over the timeline of a movie.
28 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS
• Overlap criterion have been rated “Not Similar” with 60-70% of the ratings.
This is understandable as our collection is of 1k movies only and it is difficult
to find overlaps of actors within smaller corpus. Again, both ratings (top)
and re-ratings (bottom) are almost same for overlap evaluation. Hence, judges
agree that the actors overlaps are not well captured by movie topics.
• Figure 4.11 shows ratings for topic relevance score. A combined rating of
73% is given between 2 and 3. It represents the overall usefulness of topics in
finding similar movies and use of topics as an explanation.
• Overall, genre, mood and topic relevance score criteria has shown useful re-
sults.
It is important to observe that topics generated from LDA model changes every
time a model is re-trained. Running the model anew generates different set of
topics, slightly changed from previous one. Hence, final result of similar movies
might change as well based on generated topics.
• For correlation between judges ratings, we analyzed all the ratings. Figure
4.13 shows strong positive correlation between two judges. This shows that
both judges agree with each other with 34.4% of ratings into rank 1 followed
by 13.6% ratings in rank 2.
Figure 4.7. Result of average rating for Genre (top) and Genre with explanation
(bottom).
30 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS
Figure 4.8. Result of average rating for Mood (top) and Mood with explanation
(bottom).
4.3. RESULTS 31
Figure 4.9. Result of average rating for Plot (top) and Plot with explanation
(bottom).
32 CHAPTER 4. EXPERIMENTAL SETUP AND RESULTS
Figure 4.10. Result of average rating for Overlap (top) and Overlap with explana-
tion (bottom).
4.3. RESULTS 33
Figure 4.11. Result shows average ratings for the movie topics.
Figure 4.13. Strong positive correlation of ratings between two judges. Judges most
agree with rating 1 and then with ratings 2 and 3.
Chapter 5
5.1 Conclusion
In this project, we developed the prototyping system for extracting movie features
i.e. topics. We trained a model on a collection of movie reviews and used the trained
model to find similar movies based on the Hellinger distance of movie topics.
Evaluation results shows that such an approach gives good result even with a
small movie collection. Results shows that the movie topics are efficient features as
they performs fairly well in capturing movie genre and mood. Movie plot results
are somewhat satisfactory but need descriptive plot information and better methods
that can capture the story-line. Our small sized movie corpus resulted in very few
overlap between actors. The topics as an explanation in movie recommendation
are quite useful but need to be fine-tuned with the ability to rate individual topics.
User rated movie topics could be used as a feedback to the system.
Finally, movie topics are efficient features for movie recommendation systems
as they represent the semantic patterns behind movies. With user movie reviews
as data, movie topics capture the essential movie aspects such as genre and mood.
Our prototyping approach to feature extraction has the potential to scale for a large
number of movies.
In this project, we considered user written movie reviews for extracting features.
Such a method could be extended or combined with other forms of movie meta-data
such as plot, genres, keywords. With recent advancement in deep learning, it would
be interesting to study the effect of combining LDA as a preprocessing step in deep
learning analysis of movie reviews. In the following, we discuss a few interesting
future directions.
35
36 CHAPTER 5. CONCLUSION AND FUTURE DIRECTIONS
[1] J. Booton, One-click netflix button to make movie streaming even easier | fox
business, en-US, Text.Article, Netflix, Aug. 2011. [Online]. Available: http:
//www.foxbusiness.com/markets/2011/01/04/click-netflix-button-
appear-remote-controls-movie-streaming/ (visited on Jun. 11, 2014).
[2] A. C. Madrigal, How netflix reverse engineered hollywood, Jan. 2014. [Online].
Available: http : / / www . theatlantic . com / technology / archive / 2014 /
01 / how - netflix - reverse - engineered - hollywood / 282679/ (visited on
May 13, 2014).
[3] Me TV: how jinni is revolutionizing search. [Online]. Available: http://www.
forbes.com/sites/dorothypomerantz/2013/02/18/me- tv- how- jinni-
is-revolutionizing-search/ (visited on May 13, 2014).
[4] B. Fritz, “Cadre of film buffs helps netflix viewers sort through the clutter”,
en-US, Los Angeles Times, Sep. 2012, issn: 0458-3035. [Online]. Available:
http://articles.latimes.com/2012/sep/03/business/la-fi-0903-ct-
netflix-taggers-20120903 (visited on May 15, 2014).
[5] J. Layton. (May 2006). How pandora radio works, [Online]. Available: http:
//computer.howstuffworks.com/internet/basics/pandora.htm.
[6] X. Amatriain, The netflix tech blog: netflix recommendations: beyond the 5
stars (part 1). [Online]. Available: http://techblog.netflix.com/2012/
04/netflix-recommendations-beyond-5-stars.html (visited on Jun. 11,
2014).
[7] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? sentiment classification
using machine learning techniques”, in Proceedings of EMNLP, 2002, pp. 79–
86.
[8] S. Ah and C.-K. Shi, “Exploring movie recommendation system using cul-
tural metadata”, in 2008 International Conference on Cyberworlds, Sep. 2008,
pp. 431–438. doi: 10.1109/CW.2008.13.
[9] A. Blackstock and M. Spitz, “Classifying movie scripts by genre with a MEMM
using NLP-Based features”, Stanford, M.Sc.Course Natural Language Pro-
cessing, Student report, Jun. 2008. [Online]. Available: http://nlp.stanford.
edu/courses/cs224n/2008/reports/06.pdf.
37
38 BIBLIOGRAPHY
[32] DocEng 2011: document visual similarity measure for document search, Oct.
2011. [Online]. Available: https://www.youtube.com/watch?v=KVFY- r-
BLJQ&feature=youtube_gdata_player (visited on Aug. 26, 2014).
[33] D. M. Blei and J. D. Lafferty, “Topic models”, Text mining: classification,
clustering, and applications, vol. 10, p. 71, 2009.
[34] P. Harsha. (Sep. 2011). Hellinger distance, [Online]. Available: http://www.
tcs.tifr.res.in/~prahladh/teaching/2011- 12/comm/lectures/l12.
pdf.
[35] G. van Rossum and F. L. Drake, The Python Language Reference Manual.
Network Theory Ltd., 2011, isbn: 1906966141, 9781906966140.
[36] D. Crystal, What is a corpus? what is corpus linguistics?, English, university
website. [Online]. Available: http://www.tu-chemnitz.de/phil/english/
chairs/linguist/independent/kursmaterialien/language_computers/
whatis.htm.
[37] C. kiehl. (Dec. 2013). Parallelism in one line, [Online]. Available: https://
medium.com/@thechriskiehl/parallelism-in-one-line-40e9b2b36148.
[38] L. Richardson. (Apr. 2007). Beautiful soup documentation, [Online]. Avail-
able: http://www.crummy.com/software/BeautifulSoup/bs4/doc/.
[39] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,
“Learning word vectors for sentiment analysis”, in Proceedings of the 49th
Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies, Portland, Oregon, USA: Association for Computa-
tional Linguistics, Jun. 2011, pp. 142–150. [Online]. Available: http://www.
aclweb.org/anthology/P11-1015.
[40] E. Loper and S. Bird. (May 17, 2002). NLTK: The Natural Language Toolkit.
arXiv: cs / 0205028, [Online]. Available: http : / / arxiv . org / abs / cs /
0205028.
[41] Recsyswiki. (Feb. 2011). Subjective evaluation measures, [Online]. Available:
http://recsyswiki.com/wiki/Subjective_evaluation_measures.
[42] N. Tintarev and J. Masthoff, “Designing and evaluating explanations for rec-
ommender systems”, English, in Recommender Systems Handbook, F. Ricci, L.
Rokach, B. Shapira, and P. B. Kantor, Eds., Springer US, 2011, pp. 479–510,
isbn: 978-0-387-85819-7. doi: 10.1007/978- 0- 387- 85820- 3_15. [Online].
Available: http://dx.doi.org/10.1007/978-0-387-85820-3_15.
[43] M. Sahlgren and O. Knutsson, “Proceedings of the workshop on extracting
and using constructions in nlp”, Swedish Institute of Computer Science, 2009.
[44] D. M. Blei and J. D. Lafferty, “A correlated topic model of science”, The
Annals of Applied Statistics, vol. 1, no. 1, pp. 17–35, Jun. 2007. doi: 10.1214/
07-AOAS114. [Online]. Available: http://dx.doi.org/10.1214/07-AOAS114.
BIBLIOGRAPHY 41