Semantic Class Detectors in Video Genre Recognition
Semantic Class Detectors in Video Genre Recognition
RECOGNITION
Keywords: Video Genre Recognition, Semantic Indexing, Local Features, SIFT, SVM
Abstract: This paper presents our approach to video genre recognition which we developed for MediaEval 2011 evalua-
tion. We treat the genre recognition task as a classification problem. We encode visual information in standard
way using local features and Bag of Word representation. Audio channel is parameterized in similar way
starting from its spectrogram. Further, we exploit available automatic speech transcripts and user generated
meta-data for which we compute BOW representations as well. It is reasonable to expect that semantic con-
tent of a video is strongly related to its genre, and if this semantic information was available it would make
genre recognition simpler and more reliable. To this end, we used annotations for 345 semantic classes from
TRECVID 2011 semantic indexing task to train semantic class detectors. Responses of these detectors were
then used as features for genre recognition. The paper explains the approach in detail, it shows relative per-
formance of the individual features and their combinations measured on MediaEval 2011 genre recognition
dataset, and it sketches possible future research. The results show that, although, meta-data is more infor-
mative compared to the content-based features, results are improved by adding content-based information to
the meta-data. Despite the fact that the semantic detectors were trained on completely different dataset, using
them as feature extractors on the target dataset provides better result than the original low-level audio and
video features.
sets of local descriptors to BOW representations. where σ defines the size of the kernel. In our experi-
Generally, codebook transform assigns objects to a set ments σ was set to the average distance between two
of prototypes and it computes occurrence frequency closest neighboring codewords from the codebook.
histograms of the prototypes. The prototypes are For parameterization of the audio information, an
commonly called codewords and a set of prototypes approach similar to parameterization of the visual in-
is called a codebook. In our case, the codebooks formation was used. The audio track was regularly
were created by k-means algorithm with Euclidean segmented into 100 possibly overlapping segments.
distance. The size of the codebooks was 4096. The length of the segments was 10 seconds and over-
When assigning local features to codewords by lap was allowed as necessary. Mel-frequency spec-
hard mapping, quantization errors occur and some trograms with 128 frequency bands, maximum fre-
information is lost. This is especially significant in quency 8 KHz, window length 100ms and overlap
high-dimensional spaces, as is the case of the local 80ms were computed from these segments. Dynamic
patch descriptors, where the distances to several near- range of the spectrograms was reduced to fit 8-bit res-
est codewords tend to be very similar. In the context olution. The spectrograms were then processed as im-
of image classification, this issue was discussed for ages by dense sampling and SIFT descriptor. BOW
example by Gambert et al. (van Gemert et al., 2010) representation was constructed for the spectrograms
who propose to distribute local patches to close code- by codebook transform the same way as for images.
words according to codeword uncertainty. Computa- For classification, BOW histograms of the indi-
tion of BOW with codeword uncertainty is defined for vidual images and audio segments were averaged to
each codeword w from a codebook B as get single BOW vector of each representation for each
K (w, p) video.
UNC(w) = ∑ ∑v∈B K (v, p) , (1) From the metadata and the ASR data, XML tags
p∈P were removed together with any non-alphabetical
characters and words where lower-case character was
where P is a set of local image features and K is a
followed by upper-case character were split. Stem-
kernel function. We use Gaussian kernel
ming was not performed on the data. Although,
kw − w0 k22
0 the data includes several Dutch, French and Spanish
K(w, w ) = exp − , (2) videos, we did not employ any machine translation, as
2σ2
the ratio of the non-English videos is relatively small of a particular video was used to compute response on
and it should not seriously influence results. For each that video. Before fusion, classifier responses were
video, separate word occurrence counts for metadata normalized to have zero mean and unit standard de-
and ASR were collected. viation. Multinomial L2-regularized logistic regres-
All feature vectors were normalized to unit length sion was used for the fusion. The regularization pa-
for classification. rameter was estimated by the same grid search and
cross-validation procedure as in the case of the linear
2.2 Classification Scheme SVMs. Considering the different nature of the avail-
able features, the video and audio classifiers (see Sec-
tion 2.1) were fused separately and the classifiers us-
Although, the data in MediaEval genre tagging task is ing semantic features (see further in Section 2.3) were
multi-class (a video is assigned to a single class), the fused separately as well. Finally, the two classifiers
evaluation metric is Mean Average Precision, and the created by fusion were fused again with the classifiers
genre recognition problem is in general multi label - based on the ASR transcripts and metadata. In this
one video may belong to several genres, e.g. Sci-Fi second fusion, single set of weights was computed
and comedy. As a result, we build classifiers for each for the different modalities and these were used for
genre separately and independently. all genres in order to limit overfitting.
The classification structure has two levels. The
first level consists of linear SVM classifiers each
based on a single BOW representation. These clas- 2.3 Semantic Detectors in Genre
sifiers are then fused by logistic regression to produce Recognition
robust estimates of the genres.
SVM (Cortes and Vapnik, 1995) is often used for The TRECVID1 2011 SIN task provided a training
various tasks in image and video classification (Le dataset consisting of approximately 11,200 videos
et al., 2011; van de Sande et al., 2010; van Gemert with total length of 400 hours. The duration of the
et al., 2010; Snoek et al., 2010; Smeaton et al., 2009). videos ranges from 10s to 3.5 minutes. The source
SVM has four main advantages. It generalizes well, it of the videos is Internet Archive2 . The videos were
can use kernels, it is easy to work with, and good- partitioned into 266473 shots (Ayache et al., 2006)
quality SVM solvers are available. Although non- which are represented by a corresponding keyframe.
linear kernel have been shown to perform better in The 500 semantic classes proposed by TRECVID or-
image recognition (Perronnin et al., 2010), we se- ganizers were annotated by active learning3 (Ayache
lected linear kernel due to the very small training set and Quénot, 2007; Ayache and Quénot, 2008). Total
size. Radial Basis Function kernels which are usu- 4.1M hand-annotations were collected and this pro-
ally used (Perronnin et al., 2010; van de Sande et al., duced 18M annotations after propagation using rela-
2010; van Gemert et al., 2010; Snoek et al., 2010) tions (e.g. Cat implies Animal). For 345 classes, the
introduce an additional hyper-parameter which has to annotations contained more than 4 positive instances.
be estimated in cross-validation on the training set. Examples of the classes are Actor, Airplane Flying,
Estimating this parameter together with the SVM reg- Bicycling, Canoe, Doorway, Ground Vehicles, Sta-
ularization parameter could prove to be unreliable on dium, Tennis, Armed Person, Door Opening, George
the small dataset. Bush, Military Buildings, Researcher, Synthetic Im-
The single SVM regularization parameter was es- ages, Underwater and Violent Action.
timated by grid search with 5-fold cross-validation if Using the TRECVID SIN task data, 345 semantic
enough samples for particular class were available. classifiers were trained. These classifiers use the same
The objective function in the grid search was Mean eight BOW feature types and the same SVM classi-
Average Precision and the same parameter was used fiers as described in Section 2.1 and Section 2.2. Fur-
for all genre classes for a particular BOW representa- ther details on these classifiers can be found in (Be-
tion. ran et al., 2011) together with the results achieved in
Due to the fact that no validation set was available, TRECVID 2011 evaluation.
we had to re-use the training set for Logistic Regres- We applied these 345 classifiers to the extracted
sion fusion. To keep the classifiers from overfitting, images and audio segments and created feature rep-
we trained the Logistic Regression on responses of the resentations for the videos by computing histograms
5 classifiers learned in cross-validation with the esti-
mated best value of the SVM hyper-parameter. Each 1 http://trecvid.nist.gov/
classifier computed responses on the part of the data 2 http://www.archive.org/
Features MAP The best result was MAP 0.56 (Rouvier and Linares,
All including metadata 0.451 2011). This result was reached by explicitly using IDs
Metadata 0.405 of the uploaders and the fact that uploaders tend to
All Content-based 0.304 upload similar videos. Other than that, the approach
ASR 0.165 classified the data by SVM on metadata, ASR tran-
Table 3: Mean average precision achieved by fusion of all scripts and audio and video features.
features, metadata alone, all content based features (audio,
video and ASR), and ASR alone.
Run MAP
RUN1 0.165
RUN3 0.346 5 CONCLUSIONS
RUN4 0.322
RUN5 0.360
Table 4: Mean average precision on test set achieved by the The presented genre recognition approach
runs submitted to MediaEval 2011. achieves good result on the datasets used in the
experiments. The results could be even considered
ence was that weights for the classifier fusion were surprisingly good considering the small size of the
set by hand. When fusing the audio and video fea- training set used. However, it is not certain how the
tures, uniform weights were used. RUN1 used only results would generalize to larger and more diverse
ASR. RUN3 combined all features with the weight of datasets.
ASR and METADATA increased to 2.5. RUN4 com- Although the metadata is definitely the most im-
bined the low-level audio and video features, ASR portant source of information for genre recognition,
and metadata. Here the weights of ASR and metadata the audio and video content features improve results
were set to 1.25. RUN5 combined semantic features, when combined with the metadata. Compared to the
ASR and metadata with the same weights as in RUN4. metadata, content-based features achieve worse re-
The results of these runs are show in Table 4. sults, but they do not require any human effort.
The best purely content-based method submitted
to MediaEval 2011 achieved MAP 0.121 (Ionescum The semantic features for classification improve
et al., 2011). Very successful were methods focus- over the low-level features individually, as well as,
ing on metadata and information retrieval methods. when combined.
ACKNOWLEDGEMENTS Mikolajczyk, K. (2004). Scale & Affine Invariant Interest
Point Detectors. International Journal of Computer
Vision, 60(1):63–86.
This work has been supported by the EU FP7
project TA2: Together Anywhere, Together Anytime Mikolajczyk, K. and Schmid, C. (2005). A Performance
Evaluation of Local Descriptors. IEEE Trans. Pattern
ICT-2007-214793, grant no 214793, and by BUT FIT Anal. Mach. Intell., 27(10):1615–1630.
grant No. FIT-11-S-2. Perronnin, F., Senchez, J., and Xerox, Y. L. (2010). Large-
scale image categorization with explicit data embed-
ding. In Computer Vision and Pattern Recognition
(CVPR), 2010 IEEE Conference on, pages 2297–
REFERENCES 2304, San Francisco, CA.
Rouvier, M. and Linares, G. (2011). LIA @ MediaEval
Ayache, S. and Quénot, G. (2007). Evaluation of active 2011 : Compact Representation of Heterogeneous De-
learning strategies for video indexing. Signal Process- scriptors for Video Genre Classication. In MediaEval
ing: Image Communication, 22(7-8):692–704. 2011 Workshop, Pisa, Italy.
Ayache, S. and Quénot, G. (2008). Video Corpus Annota- Smeaton, A. F., Over, P., and Kraaij, W. (2009). High-Level
tion Using Active Learning. In Macdonald, C., Ounis, Feature Detection from Video in TRECVid: a 5-Year
I., Plachouras, V., Ruthven, I., and White, R., editors, Retrospective of Achievements. In Divakaran, A., ed-
Advances in Information Retrieval, volume 4956 of itor, Multimedia Content Analysis, Theory and Appli-
Lecture Notes in Computer Science, pages 187–198. cations, pages 151–174. Springer Verlag, Berlin.
Springer Berlin / Heidelberg.
Snoek, C. G. M., van de Sande, K. E. A., de Rooij, O., Hu-
Ayache, S., Quénot, G., and Gensel, J. (2006). CLIPS-LSR urnink, B., Gavves, E., Odijk, D., de Rijke, M., Gev-
Experiments at TRECVID 2006. In TREC Video Re- ers, T., Worring, M., Koelma, D. C., and Smeulders,
trieval Evaluation Online Proceedings. TRECVID. A. W. M. (2010). The MediaMill TRECVID 2010
Beran, V., Hradis, M., Otrusina, L., and Reznicek, I. (2011). Semantic Video Search Engine. In TRECVID 2010:
Brno University of Technology at TRECVid 2011. In Participant Notebook Papers and Slides.
TRECVID 2011: Participant Notebook Papers and van de Sande, K. E. A., Gevers, T., and Snoek, C. G. M.
Slides, Gaithersburg, MD, US. National Institute of (2010). Evaluating Color Descriptors for Object and
Standards and Technology. Scene Recognition. {IEEE} Transactions on Pattern
Brezeale, D. and Cook, D. J. (2008). Automatic Video Clas- Analysis and Machine Intelligence, 32(9):1582–1596.
sification: A Survey of the Literature. IEEE Transac- van Gemert, J. C., Veenman, C. J., Smeulders, A. W. M.,
tions on Systems Man and Cybernetics Part C Appli- and Geusebroek, J. M. (2010). Visual Word Ambigu-
cations and Reviews, 38(3):416–430. ity. PAMI, 32(7):1271–1283.
Cortes, C. and Vapnik, V. (1995). Support-Vector Networks. You, J., Liu, G., and Perkis, A. (2010). A semantic frame-
Machine Learning, 20(3):273–297. work for video genre classification and event analysis.
Gauvain, J.-L., Lamel, L., and Adda, G. (2002). The LIMSI Signal Processing Image Communication, 25(4):287–
Broadcast News transcription system. Speech Com- 302.
munication, 37(1-2):89–108.
Hradis, M., Reznicek, I., and Behun, K. (2011). Brno Uni-
versity of Technology at MediaEval 2011 Genre Tag-
ging Task. In Working Notes Proceedings of the Me-
diaEval 2011 Workshop, Pisa, Italy.
Ionescum, B., Seyerlehner, K., Vertan, C., and Lamber, P.
(2011). Audio-Visual Content Description for Video
Genre Classication in the Context of Social Media. In
MediaEval 2011 Workshop, Pisa, Italy.
Larson, M., Eskevich, M., Ordelman, R., Kofler, C.,
Schmeideke, S., and Jones, G. J. F. (2011). Overview
of MediaEval 2011 Rich Speech Retrieval Task and
Genre Tagging Task. In MediaEval 2011 Workshop,
Pisa, Italy.
Le, Q. V., Zou, W. Y., Yeung, S. Y., and Ng, A. Y. (2011).
Learning hierarchical invariant spatio-temporal fea-
tures for action recognition with independent sub-
space analysis. Learning, pages 1–4.
Lowe, D. G. (1999). Object Recognition from Local Scale-
Invariant Features. In ICCV ’99: Proceedings of the
International Conference on Computer Vision-Volume
2, page 1150, Washington, DC, USA. IEEE Computer
Society.