Movie Genre Classification
Movie Genre Classification
net/publication/322929271
CITATIONS READS
41 4,637
2 authors:
All content following this page was uploaded by Ali Mert Ertugrul on 30 March 2018.
Abstract—Movie plot summaries are expected to reflect the In the literature, there exists a number of studies that
genre of movies since many spectators read the plot summaries perform movie genre classification using a variety of sources
before deciding to watch a movie. In this study, we perform movie including visual, audio and textual features from trailers,
genre classification from plot summaries of movies using bidirec-
tional LSTM (Bi-LSTM). We first divide each plot summary of posters and texts. Among the studies that employ visual and/or
a movie into sentences and assign the genre of corresponding audio features, Rasheed et al. [2], [3] utilized visual features
movie to each sentence. Next, using the word representations of including average shot length, color variance, motion content
sentences, we train Bi-LSTM networks. We estimate the genres and lighting key, from movie previews to predict movie genres.
for each sentence separately. Since plot summaries generally Yuan et al. [4] also employed visual features from videos
contain multiple sentences, we use majority voting for the final
decision by considering the posterior probabilities of genres including temporal and spatial ones to classify genres using
assigned to sentences. Our results reflect that, training Bi-LSTM hierarchical SVM. Zhou et al. [5] represented movie trailers
network after dividing the plot summaries into their sentences using bag-of-visual-words model with shot classes as vocab-
and fusing the predictions for individual sentences outperform ularies and utilized them for genre classification. Moreover,
training the network with the whole plot summaries with the Huang et al. [6] extracted both visual and audio features from
limited amount of data. Moreover, employing Bi-LSTM performs
better compared to basic Recurrent Neural Networks (RNNs) and movie trailers using a meta-heuristic optimization algorithm
Logistic Regression (LR) as a baseline. and performed genre classification. Ekenel et al. [7] com-
Index Terms—Movie genre classification; LSTM; Recurrent bined low level audio and visual features including signal
Neural Networks (RNNs) energy, fundamental frequency for audio; color and texture-
based features for visual representation to conduct multi-
I. I NTRODUCTION AND BACKGROUND modal genre classification. Ivasic et al. [8] employed low-
level visual features based on colors and edges obtained from
Movie plot summaries reflect the genre of the movies such movie posters, then used them to classify posters into genres.
as action, drama, horror, etc., such that people can easily Furthermore, Simoes et al. [9] and Wehrmann et al. [10] used
capture the genre information of the movies from their plot convolutional neural networks (CNNs) based architectures to
summaries. Especially, several sentences in the plot summaries perform movie genre classification from movie trailers instead
are high representatives of genre of the movie. People usually of using hand-crafted features.
read the plot summaries of movies before watching them to In addition to efforts employing visual and audio fea-
get an idea about the movie. Therefore, plot summaries are tures, several studies used textual sources including plots
written in such a way that they convey the genre information and synopses for movie genre classification. Fu et al. [11]
to the people. For example, if the plot mentions humorous utilized vector space model to represent synopses and used this
obstacles that must be overcome before lovers eventually representation as input for SVM. Hong et al. [12] extracted
come together, the movie is a likely to be a romantic-comedy textual features from social tags via social websites. Then,
[1]. In this regard, there is a hidden representation of genre they applied probabilistic latent semantic analysis (PLSA) to
information in the movie plot summaries. In this study, we incorporate textual, visual and audio features for genre classi-
aim to learn this hidden representation. In other words, our fication. Furthermore, Arevalo et al. [13] proposed a gated unit
purpose is to classify the genres of the movies from their plot for multi-modal classification task and they performed movie
summaries using Bi-LSTM by considering genre information genre classification using poster and plot information with a
represented by each individual sentence. With this method, basic recurrent neural networks (RNNs) to represent plot infor-
representations of plot summaries can be used for movie mation. Similarly, Pham et al. [14] proposed column network
recommendation. In addition to that, it can be inferred whether for collective classification and they evaluated this network on
a plot summary actually reflects the genre of the movie it movie genre classification task using plot summaries. They
belongs to. Therefore, this method can be beneficial during represented plot summaries as Bag-of-Words (BoW) vector
the preparation of movie plots. of 1.000 most frequent words. Aforementioned studies using
TABLE I
D ISTRIBUTION OF THE SAMPLES FOR EACH GENRE
II. M ETHOD
A. Data Collection and Pre-processing
In preprocessing step, we first obtained movie names from
MovieLens1 datasets. We further collected necessary informa-
tion of the movies including full plot summaries (input) and
genres (ground-truth) through OMDb API2 using correspond- Fig. 1. Bi-LSTM Network Architecture for Movie Genre Classification
ing movie names as inputs.
Within the scope of this study, we selected four types
of genres, namely Thriller, Horror, Comedy and Drama for are in dimension of 300 and they were obtained using the
movie genre classification. Since the number of the movies in skip-gram model. Therefore, the relationships between words
the database vary for each genre, we randomly sampled movies and their context are modeled beforehand. As a result, the row
for each of them uniformly. However, the total number of input is converted to continuous vector representation and then
sentences in the plot summaries changes for each genre as the fed into the network. Note that, for any word in the plots that
plots may include different number of sentences. Accordingly, does not have a corresponding word vector in the dictionary,
in the document-level classification task (using whole plots as a random word vector in dimension of 300 was generated.
inputs), we have uniformly sampled the data based on their C. Model
genres. On the other hand, the data for the sentence-level
The LSTM model [16] is an RNN architecture, which
classification (using sentences as inputs), is unbalanced for the
is capable of learning complex dependencies across time.
training. As a result, we obtained a total of 6.360 movies and
LSTM RNNs address the vanishing gradient problem of basic
22.278 sentences for the genre classification task, respectively.
RNNs by employing gating functions together with the state
The Table I shows the distribution of the number of the movies
dynamics. In this study, we use Bi-LSTM network. It is
and the total number of sentences for each genre in the dataset.
composed of two LSTM neural networks, a forward LSTM to
Before training a model for classification, we conducted
model the preceding contexts, and a backward LSTM to model
a pre-processing step. We first converted all texts in the
the following contexts respectively. The architecture used in
plots to lowercase. Next, we eliminated all punctuation marks
the study is given in Fig. 1.
except the ones that separate the sentences. Additionally, we
Note that, each plot summary of a movie is divided into
eliminated the stop-words. We also divided plot summaries
sentences and the genre of corresponding movie is assigned
into sentences for the sentence-level classification task. We
to each sentence. During training, each input (sentence) is
performed all tasks in pre-processing step using NLTK3 .
represented as the words it includes and continuous word
B. Text Representation representations are obtained using [15]. It is useful when the
limited data is used since semantic and syntactic relationship
The purpose of this step is to represent semantic and
among the words are captured. We name this representation in
syntactic relationship among the words, which improves the
the architecture in Fig. 1 as embedding layer. Then, the word
performance where the training data is limited. After pre-
representations are fed into the Bi-LSTM network. Practically,
processing step, each input (full plot for document-level and
a linear projection layer is put between Bi-LSTM and softmax
sentence for sentence-level) is represented using continuous
layers. Finally, a softmax layer, which is stacked on the top
vector representation. In order to do that, the pre-trained word
of the Bi-LSTM, takes the learned representations of the last
vectors that are proposed by [15] are used. The word vectors
output of Bi-LSTM as the input, and returns the classification
are obtained as a result of training on Wikipedia. These vectors
probabilities for each movie genre.
1 https://grouplens.org/datasets/movielens/ In order to train the model, we minimize the negative log-
2 http://www.omdbapi.com/ likelihood of the estimation error, where the loss function is
3 http://www.nltk.org/ given in Eq. 1 below.
separately and then averages the results. The calculations of
C
1 X micro precision and micro recall are given in Eq. 2a and 2b
L(θ) = − yi log(ŷi ), (1)
C i=1 whereas equations for macro precision and macro recall are
shown in Eq. 3a and 3b.
where C is the number of the target classes, y is the one-
PC
hot representation of the ground truth, and ŷ is the estimated i=1 tpi
probability distribution assigned to the genres by the model. pmicro = PC PC (2a)
i=1 tpi + i=1 f pi
D. Classification PC
micro i=1 tpi
Since we divide the plot summaries into sentences, we r = PC PC (2b)
estimate the class labels for each of them separately during i=1 tpi + i=1 f ni