Recognizing Image Style
Recognizing Image Style
Tech Report
Sergey Karayev1,2 Aaron Hertzmann3 Holger Winnemöller3 Aseem Agarwala3 Trevor Darrel1,2
1
UC Berkeley, 2 ICSI, and 3 Adobe
arXiv:1311.3715v1 [cs.CV] 15 Nov 2013
Abstract
1. Introduction
then analyze the divisions between these classes. Find-
Images convey meaning in multiple ways; visual style is ing existing datasets insufficient, we gather a new large-
often a significant component of image meaning for creative scale dataset of photographs annotated with diverse visual
images. For example, the same scene portrayed in the lush, style labels. This dataset embodies several different as-
springtime colors of a Renoir painting would tell a differ- pects of visual style, including photographic techniques
ent story than shown in the harsh, dark tones of a typical (“Macro,” “HDR”), composition styles (“Minimal,” “Geo-
horror movie. Visual style is crucial to how a viewer in- metric”), moods (“Serene,” “Melancholy”), genres (“Vin-
terprets an image in many contexts, including art, design, tage,” “Romantic,” “Horror”), and types of scenes (“Hazy,”
entertainment, advertising, and social media. Moreover, an “Sunny”). We also gather a large dataset of visual art
increasing amount of visual media consumption though on- (mostly paintings) annotated with art historical style labels,
line social media feeds, photo sharing sites, and news sites, ranging from Renaissance to modern art. We perform a
is now curated by machines and not people. Yet, virtually thorough evaluation of different visual features for the task
no research in computer vision has explored visual style. of predicting these style annotations. We find that “deep”
This paper introduces new approaches and datasets for features trained on a large amount of data labeled with ob-
the automatic analysis of image style. Visual style is very ject class categories (ImageNet) perform significantly better
recognizable to human viewers, yet difficult to define pre- than traditionally used hand-designed features.
cisely. Style may combine aspects of color, lighting, com- The style predictors that our datasets and learning enable
position, scene objects, and other facets. Hence, we pre- are useful as mid-level features in their own right. When
fer to define style empirically through labelled data, and making presentations, a searchable source of stylistically
1
posing colors” by Dhar et al., who also attempted to pre-
dict Flickr’s proprietary “interestingness” measure, which
is determined by social activity on the website [6]. Their
high-level features were themselves trained in a classifica-
tion framework on labeled datasets. Gygli et al. [10] gath-
Baroque Rococo Northern Renaissance ered and predicted human evaluation of image interesting-
ness, building on work by Isola et al. [12], who used various
high-level features to predict human judgements of image
memorability.
Murray et al. [17] introduced the Aesthetic Visual Anal-
ysis (AVA) dataset, annotated with ratings by users of
DPChallenge, a photographic skill competition website.
Impressionism Post-Impressionism Ukiyo-e
This dataset is primarily aimed at predicting beauty, and
Murray et al. showed that generic feature descriptors with
state-of-the-art coding gave better predictions than the
previously-used hand-designed features. Our use of “deep-
network” features trained on a large amount of visual data
is informed by their findings.
Abs. Expressionism Minimalism Color Field Painting
The AVA dataset contains some photographic style labels
Figure 2: Typical images in different style categories from
(e.g., “Duotones,” “HDR”), derived from the titles and de-
our Wikipaintings dataset. The dataset comprises 85,000
scriptions of the photographic challenges to which photos
images labeled with 22 art historical styles.
were submitted. These style labels primarily reflect pho-
tographic techniques such as “HDR” and simple composi-
tional qualities like “Duotones.” Using images from this
coherent images would be useful. A story may be illus-
dataset, Marchesotti and Peronnin [16] gathered bi-grams
trated with images that match not only its objective content,
from user comments on the website, and used a simple
but also its sentiment. In addition to evaluating classifica-
sparse feature selection method to find ones predictive of
tion performance of our approach, we demonstrate an ap-
aesthetic rating. The attributes they found to be informative
plication of style classifiers to visual search, making a large
(e.g., “lovely photo,” “nice detail”) are not specific to image
image collection searchable by both content tags and visual
style.
style (“bird, bright/energetic,” “train, film noir”). Addition-
In contrast to their unsupervised learning approach, we
ally, we demonstrate that styles learned from paintings can
gather annotations of style that are supervised, either by
be used to search collections of photographs, and vice versa.
membership in a user-curated Flickr group, or by art histo-
All data, trained predictors, code, and a web-based user rian experts. We are unaware of other previous work gath-
interface for searching image collections “with style” will ering annotations of image style.
be released upon publication.
In a task similar to predicting the style of an image, Borth
et al. [3] performed sentiment analysis on images. Follow-
2. Related Work ing the “ObjectBank” [14] approach, the authors trained
and deployed object detectors trained on data labeled with
Most research in computer vision addresses recognition
adjective-noun pairs of known sentiment value to predict
and reconstruction, independent of image style. A few pre-
the sentiment for the entire image.
vious works have focused directly on image composition,
Features based on image statistics have been successfully
particularly on the high-level attributes of beauty, interest-
employed to detect artistic forgeries, e.g., [15]. Their work
ingness, and memorability.
focused on extremely fine-scale discrimination between two
The groundwork for predicting aesthetic rating of pho-
very similar classes, and has not been applied to broader
tographs was laid by Datta et al. [4], who designed vi-
style classification.
sual features to represent concepts such as colorfulness,
saturation, rule-of-thirds, and depth of field. Classifiers
3. Data Sources
based on these features were evaluated on a dataset of pho-
tographs rated for aesthetics and originality by users of the Performance of scene and object recognition depends di-
photo.net community. The same approach was further rectly on the quality of the training data set. To our knowl-
applied to a small set of Impressionist paintings [13]. edge, there is only one existing dataset annotated with vi-
The feature space was expanded with more high-level sual style, and only a narrow range of styles are repre-
descriptive features such as “presence of animals” and “op- sented [17]. We review the best current dataset for aesthetic
2
prediction, which has a subset of style annotations. We then Table 1: Sample of our Flickr Style groups, showing the
present two new datasets, covering a range of visual styles. size of available data and the membership rules.
3
14000 style Frequency
ILSVRC2010 nodes. In essence, MC-bit is a hand-crafted
12000
10000 “deep” architecture, stacking classifiers and pooling opera-
8000 tions on top of lower-level features.
6000
4000
2000 Deep convolutional net. Current state-of-the-art results
0
) e e ) o e e g ) e t t l n t
on ImageNet, the largest image classification challenge,
nismalismicism nism nismalism ern lism oqu icism anc ism nism coc bism anc anc alism tin nce iyo- p Ar t Ar lism rme ctio icism vism nismp Ar
re ssio ReomantressioressioSurreu (MoSdymbo Baorclassenaissrimitivressio Ro CeunaisesnaisMs inimld Paninaissa Uk Pobstragcic Rerat InfAobstraadem Fauressio O
Imp R t-Imp Exp uvea
Pos r t No or
Neern RArt (Pct Exp
th aïve stra
ly R h R
Ear Hig Colo (Late
r Fie Re A Ma A ical Ac
L yr N eo-E
xp have come from a deep convolutional network trained in
A N N Ab ism
ner
Man a fully-supervised manner. We use DeCAF [7], an open-
16000 genre Frequency source implementation of such an eight-layer network,
14000
12000 trained on over a million images annotated with 1,000 Im-
10000 ageNet classes. We investigate using two different layers
8000
6000 of the network, referred to as DeCAF5 (9,000-dimensional)
4000
2000
and DeCAF6 (4,000-dimensional, closer to the supervised
0
)
signal), computed from images center-cropped and resized
trait ape ting ting ting ape udy life ting ting tion sign (nu ting ting rina ture trait ting ting
por landsc e pain t pain s pain citysc and st still c pain e pain illustra de inting al pain r pain ma sculp elf-por al pain al pain
r c
gen abstra religio
u
ske
tch boli ativ
sym figur e pa logic flow
nud mytho
e s nim oric
a lleg
a
to 256 by 256 pixels. Since DeCAF is trained on object
Date Frequency recognition, not style recognition, we also test whether tun-
9000
8000 ing the network for our style datasets improve performance.
7000
6000
5000 Content classifiers. Following Dhar et al. [6], who use
4000
3000
high-level classifiers as features for their aesthetic rating
2000 prediction task, we evaluate using object classifier confi-
1000
0
dences as features. Specifically, we train classifiers for all
1400 1500 1600 1700 1800 1900 2000
Year 20 classes of the PASCAL VOC [9] using the DeCAF6 fea-
ture. The resulting classifiers are quite reliable, obtaining
Figure 3: Distribution of image style, genre, and date in the 0.7 mean AP on the VOC 2012.
Wikipaintings dataset. We aggregate the data to train four classifiers for “ani-
mals”, “vehicles”, “indoor objects” and “people”. These ag-
gregate classes are presumed to discriminate between vastly
CIELAB color space has 4, 14, and 14 bins in the L*, a*,
different types of images – types for which different style
and b* channels, following Palermo et al. [19], who showed
signals may apply. For example, a Romantic scene with
this to be the best performing single feature for determining
people may be largely about the composition of the scene,
the date of historical color images.
whereas, Romantic scenes with vehicles may be largely de-
GIST. The classic gist descriptor [18] is known to per- scribed by color.
form well for scene classification and retrieval of images To enable our classifiers to learn content-dependent
visually similar at a low-resolution scale, and thus can rep- style, we can take the outer product of a feature channel
resent image composition to some extent. We use the IN- with the four aggregate content classifiers.
RIA LEAR implementation, resizing images to 256 by 256
pixels and extracting a 960-dimensional color GIST feature.
5. Learning algorithm
We wish to learn to classify novel images according
Graph-based visual saliency. We also model composi- to their style, using the labels exemplified by the datasets
tion with a visual attention feature [11]. The feature is fast given in the previous section. Because the datasets we
to compute and has been shown to predict human fixations deal with are quite large and some of the features are high-
in natural images basically as well as an individual human dimensional, we consider only linear classifiers, relying on
(humans are far better in aggregate, however). The 1024- sophisticated features to provide enough robustness for lin-
dimensional feature is computed from images resized to 256 ear classification to be accurate.
by 256 pixels. We use an open-source implementation of Stochastic
Gradient Descent with adaptive subgradient [1]. The learn-
Meta-class binary features. Image content can be pre-
ing process optimizes the function
dictive of individual styles, e.g., Macro images include
many images of insects and flowers. The mc-bit fea- λ2 X
min λ1 kwk1 + kwk22 + `(xi , yi , w)
ture [2] is a 15,000-dimensional bit vector feature learned w 2 i
as a non-linear combination of classifiers trained using
existing features (e.g., SIFT, GIST, Self-Similarity) on We set the L1 and L2 regularization parameters and the
thousands of random ImageNet synsets, including internal form of the loss function by validation on a held-out set.
4
Late-fusion x Content MC-bit GIST
For the loss `(x, y, w), we consider the hinge (max(0, 1 − DeCAF_6 DeCAF_5 Graph-based Saliency
1.0 Fine-tuned DeCAF_6 L*a*b* Histogram random
y · wT x)) and logistic (log(1 + exp(−y · wT x))) functions. 0.9
0.8
Murray-CVPR-2012
0.7
For multi-class classification, we always use the One vs. 0.6
0.5
0.4
All reduction. We set the initial learning rate to 0.5, and use 0.3
0.2
adaptive subgradient optimization [8]. 0.1
0.0
rs es e e o r e s F s s t
For all features except binary ones, values are standard- Colo Duoton HDR e_Grain _Whit posur Macr n_Blu _Imag _Third w_DO ouette t_Focu g_Poin _mean
ary_ g n
Ima Light_O Long_
Ex io e f
Mot egativ Rule_o Shall
o Silh Sof nishin
ent Va
lem N
ized: each column has its mean subtracted, and is divided Comp
5
Table 2: Mean APs (or accuracies, where noted) on three datasets for the considered single-channel features and their
second-stage combination. As some features were clearly dominated by others on the AVA dataset, only the better features
were evaluated on larger datasets.
Late-fusion DeCAF5 DeCAF6 MC-bit Tuned DC6 L*a*b* Hist GIST Saliency random
AVA Rating (acc.) - 0.779 0.686 0.843 0.720 0.574 0.558 0.539 0.500
AVA Style 0.604 0.427 0.577 0.529 0.552 0.291 0.220 0.149 0.127
Flickr 0.419 0.314 0.391 0.360 0.396 - - - 0.066
Wikipaintings 0.476 - 0.356 0.443 0.356 - - - 0.043
metavehicle
metaperson
metaanimal
metaindoor
occurs with “vehicle”, “HDR” doesn’t occur with “cat”).
bicycle
To further enable our linear classifier to take advantage of
horse
bird
dog
car
cat
such correlations, we take an outer product of our content Melancholy 0.01 -0.04 -0.02 0.05 0.03 0.02 0.02 0.02 0.06 -0.05
classifier features with the second-stage late fusion features
Noir 0.05 -0.13 0.04 0.08 -0.01 0.08 -0.03 0.04 0.01 0.00
(“Late-fusion × Content” in all results figures).
HDR 0.11 -0.00 0.10 -0.13 -0.04 -0.01 -0.06 -0.04 -0.00 0.14
Geometric_Composition
Soft,_Pastel -0.05 -0.01 -0.07 0.01 0.02 -0.05 0.00 0.07 0.08 -0.10
Bright,_Energetic
Long_Exposure
Depth_of_Field
Soft,_Pastel
Melancholy
Romantic
Minimal
Sunny
Macro
Hazy
prior
HDR
Noir
Bright,_Energetic 0.30 0.07 0.01 0.05 0.06 0.00 0.02 0.10 0.12 0.02 0.04 0.00 0.03 0.05 0.06 0.04 0.03 0.06 Horror 0.02 -0.11 -0.08 0.08 0.05 0.06 0.05 0.03 0.07 -0.10
Depth_of_Field 0.03 0.27 0.01 0.02 0.04 0.02 0.02 0.04 0.17 0.08 0.02 0.02 0.06 0.03 0.10 0.01 0.03 0.06
Sunny -0.05 0.06 0.05 -0.08 -0.05 0.05 -0.07 -0.08 -0.07 0.13
Ethereal 0.02 0.03 0.33 0.01 0.02 0.05 0.04 0.05 0.02 0.15 0.04 0.02 0.04 0.01 0.08 0.04 0.07 0.06
Geometric_Composition 0.06 0.03 0.00 0.41 0.07 0.01 0.03 0.04 0.01 0.05 0.13 0.05 0.01 0.02 0.02 0.00 0.02 0.06 Serene 0.01 0.10 0.05 -0.07 -0.01 -0.00 0.02 -0.06 -0.05 0.06
HDR 0.03 0.01 0.00 0.02 0.56 0.02 0.02 0.15 0.01 0.02 0.01 0.01 0.01 0.06 0.01 0.05 0.01 0.06
Hazy -0.02 0.05 0.12 -0.05 -0.06 0.00 -0.03 -0.09 -0.06 0.15
Hazy 0.00 0.01 0.05 0.01 0.02 0.52 0.00 0.09 0.01 0.04 0.05 0.02 0.02 0.04 0.03 0.08 0.01 0.06
Horror 0.03 0.03 0.05 0.02 0.03 0.01 0.44 0.03 0.01 0.14 0.01 0.07 0.04 0.01 0.03 0.01 0.02 0.06
Bright,_Energetic 0.01 0.01 -0.00 -0.04 -0.01 -0.03 -0.04 0.05 0.03 -0.01
Long_Exposure 0.03 0.01 0.01 0.02 0.09 0.02 0.02 0.60 0.01 0.03 0.03 0.03 0.00 0.04 0.01 0.05 0.00 0.06
Ethereal -0.03 0.02 -0.03 0.09 0.02 -0.00 0.07 -0.03 -0.03 -0.03
Macro 0.05 0.05 0.01 0.02 0.00 0.00 0.01 0.01 0.76 0.01 0.03 0.00 0.01 0.01 0.03 0.00 0.01 0.06
Vintage -0.03 -0.07 -0.06 0.07 0.02 -0.06 -0.01 0.11 0.03 -0.11
Melancholy 0.01 0.05 0.05 0.03 0.03 0.03 0.05 0.06 0.02 0.30 0.06 0.04 0.07 0.04 0.10 0.03 0.04 0.06
Minimal 0.02 0.02 0.02 0.10 0.01 0.03 0.00 0.06 0.05 0.05 0.49 0.03 0.00 0.03 0.02 0.04 0.01 0.06 Depth_of_Field 0.00 0.06 -0.04 -0.01 -0.01 0.01 0.07 -0.03 0.05 -0.07
Noir 0.01 0.03 0.02 0.05 0.02 0.04 0.11 0.04 0.01 0.15 0.03 0.46 0.01 0.00 0.01 0.01 0.02 0.06
Long_Exposure 0.00 -0.00 0.09 -0.06 -0.03 0.01 -0.05 -0.06 -0.05 0.12
Romantic 0.04 0.05 0.03 0.02 0.04 0.01 0.04 0.04 0.04 0.12 0.01 0.03 0.16 0.05 0.23 0.04 0.06 0.06
Serene 0.04 0.06 0.02 0.02 0.12 0.05 0.01 0.12 0.10 0.05 0.04 0.01 0.02 0.21 0.05 0.09 0.01 0.06 Geometric_Composition 0.07 -0.07 0.01 0.01 0.01 -0.02 -0.06 0.09 -0.04 0.01
Soft,_Pastel 0.03 0.10 0.03 0.02 0.03 0.04 0.02 0.05 0.05 0.07 0.03 0.02 0.06 0.02 0.36 0.01 0.08 0.04
Minimal -0.04 0.03 0.03 0.06 0.01 -0.00 -0.04 -0.02 -0.08 0.07
Sunny 0.02 0.02 0.03 0.02 0.05 0.03 0.01 0.12 0.00 0.02 0.03 0.01 0.01 0.06 0.02 0.56 0.00 0.06
Vintage 0.03 0.09 0.04 0.02 0.02 0.01 0.02 0.01 0.03 0.10 0.02 0.00 0.07 0.00 0.18 0.00 0.35 0.06
Romantic -0.01 -0.03 -0.06 0.01 0.02 -0.01 0.02 0.03 0.13 -0.09
6
Late-fusion x Content DeCAF_5
Fine-tuned DeCAF_6 MC-bit References
DeCAF_6 random DeCAF_6 Late-fusion x Content
1.0 ImageNet MC-bit
1.0Fine-tuned DeCAF_6 random [1] A. Agarwal, O. Chapelle, M. Dudik, and J. Langford. A Re-
0.8
Top-K Accuracy
Top-K Accuracy
0.6 of Machine Learning Research, 2012. 4
0.4
0.4
0.2 0.2
[2] A. Bergamo and L. Torresani. Meta-class features for large-
0.0 0.0 scale object categorization on a budget. In CVPR, 2012. 4
1 2 3 4 5 1 2 3 4 5
K K
[3] D. Borth, R. Ji, T. Chen, and T. M. Breuel. Large-scale
Visual Sentiment Ontology and Detectors Using Adjective
Figure 9: Top-K accuracies for the Flickr and Wikipaintings Noun Pairs. In ACM MM, 2013. 2
datasets, respectively.
[4] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Studying Aesthetics
in Photographic Images Using a Computational Approach.
In ECCV, 2006. 2
raphy or design inspiration may be better navigated with [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
a vocabulary of style. Currently, companies expend labor ImageNet: A Large-Scale Hierarchical Image Database. In
to manually annotate stock photography with such labels. CVPR, 2009. 3
With our approach, any image collection can be searchable [6] S. Dhar, T. L. Berg, and S. Brook. High Level Describable
and rankable by style. We apply style classifiers to the PAS- Attributes for Predicting Aesthetics and Interestingness. In
CAL visual object class dataset, and show top hits for differ- CVPR, 2011. 2, 4
ent styles for the “bird” and “train” categories in Figure 11. [7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
Additionally, styles learned from photographs can be E. Tzeng, T. Darrell, T. Eecs, and B. Edu. DeCAF: A Deep
used to order paintings, and styles learned from paintings Convolutional Activation Feature for Generic Visual Recog-
can be used to order photographs, as illustrated in Fig- nition. Technical report, 2013. arXiv:1310.1531v1. 4
ure 10. [8] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient
methods for online learning and stochastic optimization. The
Journal of Machine Learning Research, 2011. 5
7. Conclusion [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman. The PASCAL VOC Challenge Results, 2010.
We have described datasets and algorithms for classify- 4
ing image styles. Given the importance of style in modern
[10] M. Gygli, F. Nater, and L. V. Gool. The Interestingness of
visual communication, we believe that understanding style Images. In ICCV, 2013. 2
is an important challenge for computer vision, and our re-
[11] J. Harel, C. Koch, and P. Perona. Graph-Based Visual
sults illustrate the potential for future research in this area. Saliency. In NIPS, 2006. 4
One challenging question is to define and understand the [12] P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an
meaning of style. Different types of styles relate to content, image memorable? In CVPR, June 2011. 2
color, lighting, composition, and other factors. Our work [13] C. Li and T. Chen. Aesthetic Visual Quality Assessment of
provides some preliminary evidence about the relationships Paintings. IEEE Journal of Selected Topics in Signal Pro-
of these quantities. cessing, 3(2):236–252, Apr. 2009. 2
We were suprised by the success of the DeCAF convo- [14] L.-j. Li, H. Su, E. P. Xing, and L. Fei-fei. Object Bank: A
lution net, which was originally trained for object recogni- High-Level Image Representation for Scene Classification &
tion. Moreover, fine-tuning it for style did not significantly Semantic Feature Sparsification. In NIPS, 2010. 2
increase performance. Perhaps the network layers that we [15] S. Lyu, D. Rockmore, and H. Farid. A digital technique for
use as features are extremely good as general visual features art authentication. PNAS, 101(49), 2004. 2, 3
for image representation in general. Another explanation is [16] L. Marchesotti and F. Perronnin. Learning beautiful (and
that object recogntion depends on object appearance, e.g., ugly) attributes. In BMVC, 2013. 2
distinguishing red from white wine, or different kinds of [17] N. Murray, D. Barcelona, L. Marchesotti, and F. Perronnin.
terriers, and that the model learns to repurpose these feature AVA: A Large-Scale Database for Aesthetic Visual Analysis.
for image style. In CVPR, 2012. 2, 3, 5
Another possibility is that the style labels can be pre- [18] A. Oliva and A. Torralba. Modeling the Shape of the Scene:
dicted from object content alone. We do see strong correla- A Holistic Representation of the Spatial Envelope. IJCV,
tions in our data, e.g., Macro images frequently depict birds 42(3):145–175, 2001. 4
and flowers.However, we found that using 1,000 ImageNet [19] F. Palermo, J. Hays, and A. A. Efros. Dating Historical Color
classifiers as a features was significantly worse than the per- Images. In ECCV, 2012. 3
formance of the DeCAF6 layer feature (see Tables 3, 4, 5).
7
Bright,
Energetic Minimalism
Serene Impressionism
Ethereal Cubism
Flickr Painting
Painting Data Flickr Data
Style Style
Figure 10: Cross-dataset style. On the left are shown top scorers from the Wikipaintings set, for styles learned on the Flickr
set. On the right, Flickr photographs are accordingly sorted by Painting style. (Figure best viewed in color.)
Geometric
Composition
HDR
Film
Noir
Vintage
Bright,
Energetic
Horror
Cubism
Surrealism
8
Table 3: All per-class APs on all evaluated features on the AVA Style dataset.
Late-fusion DeCAF6 DC6,f t MC-bit Murray DeCAF5 ImageNet L*a*b* GIST Saliency
Complementary Colors 0.469 0.548 0.514 0.329 0.440 0.368 0.389 0.294 0.223 0.111
Duotones 0.676 0.737 0.665 0.612 0.510 0.363 0.383 0.582 0.255 0.233
HDR 0.669 0.594 0.516 0.624 0.640 0.494 0.335 0.194 0.124 0.101
Image Grain 0.647 0.545 0.563 0.744 0.740 0.535 0.219 0.213 0.104 0.104
Light On White 0.908 0.915 0.860 0.802 0.730 0.805 0.508 0.867 0.704 0.172
Long Exposure 0.453 0.431 0.444 0.420 0.430 0.208 0.242 0.232 0.159 0.147
Macro 0.478 0.427 0.488 0.413 0.500 0.376 0.438 0.230 0.269 0.161
Motion Blur 0.478 0.467 0.380 0.458 0.400 0.327 0.186 0.117 0.114 0.122
Negative Image 0.595 0.619 0.561 0.499 0.690 0.427 0.323 0.268 0.189 0.123
Rule of Thirds 0.352 0.353 0.290 0.236 0.300 0.269 0.244 0.188 0.167 0.228
Shallow DOF 0.624 0.659 0.627 0.637 0.480 0.522 0.517 0.332 0.276 0.223
Silhouettes 0.791 0.801 0.835 0.801 0.720 0.609 0.401 0.261 0.263 0.130
Soft Focus 0.312 0.354 0.305 0.290 0.390 0.225 0.170 0.127 0.126 0.114
Vanishing Point 0.684 0.658 0.646 0.685 0.570 0.527 0.542 0.123 0.107 0.161
mean 0.581 0.579 0.550 0.539 0.539 0.432 0.350 0.288 0.220 0.152
Table 4: All per-class APs on all evaluated features on the Flickr dataset.
9
Table 5: All per-class APs on all evaluated features on the Wikipaintings dataset.
10