0% found this document useful (0 votes)
41 views

Recognizing Image Style

This document describes a new approach and datasets for automatically analyzing image style in computer vision. The researchers gathered two novel datasets: one with 55,000 Flickr photos annotated with 18 visual styles, and another with 85,000 paintings labeled with 22 art historical styles. They evaluate different image features for predicting these style labels, finding that "deep" features trained on large labeled datasets perform best. Their work enables new applications like style-based image search and presentation.

Uploaded by

Elias Cortez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Recognizing Image Style

This document describes a new approach and datasets for automatically analyzing image style in computer vision. The researchers gathered two novel datasets: one with 55,000 Flickr photos annotated with 18 visual styles, and another with 85,000 paintings labeled with 22 art historical styles. They evaluate different image features for predicting these style labels, finding that "deep" features trained on large labeled datasets perform best. Their work enables new applications like style-based image search and presentation.

Uploaded by

Elias Cortez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Recognizing Image Style

Tech Report

Sergey Karayev1,2 Aaron Hertzmann3 Holger Winnemöller3 Aseem Agarwala3 Trevor Darrel1,2
1
UC Berkeley, 2 ICSI, and 3 Adobe
arXiv:1311.3715v1 [cs.CV] 15 Nov 2013

Abstract

The style of an image plays a significant role in how it is


viewed, but has received little attention in computer vision
research. We describe an approach to predicting style of
images, and perform a thorough evaluation of different im- HDR Long Exposure Macro
age features for these tasks. We find that features learned in
a multi-layer network generally perform best – even when
trained with object class (not style) labels. Our large-
scale learning methods results in the best published perfor-
mance on an existing dataset of aesthetic ratings and pho- Vintage Romantic Horror
tographic style annotations. We present two novel datasets:
55K Flickr photographs annotated with curated style la-
bels as well as free-form tags, and 85K paintings anno-
tated with style and genre labels. Our approach shows
excellent classification performance on both datasets. We
Minimal Hazy Noir
use the learned classifiers to extend traditional tag-based
image search to consider stylistic constraints, and demon- Figure 1: Typical images in different style categories of our
strate cross-dataset understanding of style. Flickr Style dataset. The dataset comprises 18 styles in to-
tal, each with 3,000 examples.

1. Introduction
then analyze the divisions between these classes. Find-
Images convey meaning in multiple ways; visual style is ing existing datasets insufficient, we gather a new large-
often a significant component of image meaning for creative scale dataset of photographs annotated with diverse visual
images. For example, the same scene portrayed in the lush, style labels. This dataset embodies several different as-
springtime colors of a Renoir painting would tell a differ- pects of visual style, including photographic techniques
ent story than shown in the harsh, dark tones of a typical (“Macro,” “HDR”), composition styles (“Minimal,” “Geo-
horror movie. Visual style is crucial to how a viewer in- metric”), moods (“Serene,” “Melancholy”), genres (“Vin-
terprets an image in many contexts, including art, design, tage,” “Romantic,” “Horror”), and types of scenes (“Hazy,”
entertainment, advertising, and social media. Moreover, an “Sunny”). We also gather a large dataset of visual art
increasing amount of visual media consumption though on- (mostly paintings) annotated with art historical style labels,
line social media feeds, photo sharing sites, and news sites, ranging from Renaissance to modern art. We perform a
is now curated by machines and not people. Yet, virtually thorough evaluation of different visual features for the task
no research in computer vision has explored visual style. of predicting these style annotations. We find that “deep”
This paper introduces new approaches and datasets for features trained on a large amount of data labeled with ob-
the automatic analysis of image style. Visual style is very ject class categories (ImageNet) perform significantly better
recognizable to human viewers, yet difficult to define pre- than traditionally used hand-designed features.
cisely. Style may combine aspects of color, lighting, com- The style predictors that our datasets and learning enable
position, scene objects, and other facets. Hence, we pre- are useful as mid-level features in their own right. When
fer to define style empirically through labelled data, and making presentations, a searchable source of stylistically

1
posing colors” by Dhar et al., who also attempted to pre-
dict Flickr’s proprietary “interestingness” measure, which
is determined by social activity on the website [6]. Their
high-level features were themselves trained in a classifica-
tion framework on labeled datasets. Gygli et al. [10] gath-
Baroque Rococo Northern Renaissance ered and predicted human evaluation of image interesting-
ness, building on work by Isola et al. [12], who used various
high-level features to predict human judgements of image
memorability.
Murray et al. [17] introduced the Aesthetic Visual Anal-
ysis (AVA) dataset, annotated with ratings by users of
DPChallenge, a photographic skill competition website.
Impressionism Post-Impressionism Ukiyo-e
This dataset is primarily aimed at predicting beauty, and
Murray et al. showed that generic feature descriptors with
state-of-the-art coding gave better predictions than the
previously-used hand-designed features. Our use of “deep-
network” features trained on a large amount of visual data
is informed by their findings.
Abs. Expressionism Minimalism Color Field Painting
The AVA dataset contains some photographic style labels
Figure 2: Typical images in different style categories from
(e.g., “Duotones,” “HDR”), derived from the titles and de-
our Wikipaintings dataset. The dataset comprises 85,000
scriptions of the photographic challenges to which photos
images labeled with 22 art historical styles.
were submitted. These style labels primarily reflect pho-
tographic techniques such as “HDR” and simple composi-
tional qualities like “Duotones.” Using images from this
coherent images would be useful. A story may be illus-
dataset, Marchesotti and Peronnin [16] gathered bi-grams
trated with images that match not only its objective content,
from user comments on the website, and used a simple
but also its sentiment. In addition to evaluating classifica-
sparse feature selection method to find ones predictive of
tion performance of our approach, we demonstrate an ap-
aesthetic rating. The attributes they found to be informative
plication of style classifiers to visual search, making a large
(e.g., “lovely photo,” “nice detail”) are not specific to image
image collection searchable by both content tags and visual
style.
style (“bird, bright/energetic,” “train, film noir”). Addition-
In contrast to their unsupervised learning approach, we
ally, we demonstrate that styles learned from paintings can
gather annotations of style that are supervised, either by
be used to search collections of photographs, and vice versa.
membership in a user-curated Flickr group, or by art histo-
All data, trained predictors, code, and a web-based user rian experts. We are unaware of other previous work gath-
interface for searching image collections “with style” will ering annotations of image style.
be released upon publication.
In a task similar to predicting the style of an image, Borth
et al. [3] performed sentiment analysis on images. Follow-
2. Related Work ing the “ObjectBank” [14] approach, the authors trained
and deployed object detectors trained on data labeled with
Most research in computer vision addresses recognition
adjective-noun pairs of known sentiment value to predict
and reconstruction, independent of image style. A few pre-
the sentiment for the entire image.
vious works have focused directly on image composition,
Features based on image statistics have been successfully
particularly on the high-level attributes of beauty, interest-
employed to detect artistic forgeries, e.g., [15]. Their work
ingness, and memorability.
focused on extremely fine-scale discrimination between two
The groundwork for predicting aesthetic rating of pho-
very similar classes, and has not been applied to broader
tographs was laid by Datta et al. [4], who designed vi-
style classification.
sual features to represent concepts such as colorfulness,
saturation, rule-of-thirds, and depth of field. Classifiers
3. Data Sources
based on these features were evaluated on a dataset of pho-
tographs rated for aesthetics and originality by users of the Performance of scene and object recognition depends di-
photo.net community. The same approach was further rectly on the quality of the training data set. To our knowl-
applied to a small set of Impressionist paintings [13]. edge, there is only one existing dataset annotated with vi-
The feature space was expanded with more high-level sual style, and only a narrow range of styles are repre-
descriptive features such as “presence of animals” and “op- sented [17]. We review the best current dataset for aesthetic

2
prediction, which has a subset of style annotations. We then Table 1: Sample of our Flickr Style groups, showing the
present two new datasets, covering a range of visual styles. size of available data and the membership rules.

Group name Images Description (excerpt)


3.1. Aesthetic Visual Analysis (AVA) Geometric Beauty 153909 Circles, triangles, rectangles,
symmetric objects, repeated
AVA [17] is a dataset of 250K images from patterns ...
dpchallenge.net, where users submit and judge Pastel, Soft 100116 Pastels and whites, blossom
photos organized into thematic challenges such as “Cats”, and sunglares. Anything that
“Textures and Materials”, or “Depth of Field”. On average, is soft, pretty and just dreamy.
an image receives around 200 ratings on a 1-10 scale. Film Noir Mood 7775 Not just black and white pho-
These ratings reflect both absolute image quality and how tography, but a dark, gritty,
moody feel. ...
well the image meets the goals of the specific challenge.
Horror 16501 The scariest pics you can find.
The dataset also includes 14,000 images annotated with ... your bloodiest, ghastliest,
14 labels of photographic style, manually created by the freakiest snaps. ...
authors by combining photos from 72 different challenges. Melancholy 90748 Only melancholic shots.
The task in this dataset is to predict rating mean and
standard deviations, and to predict style labels. The authors
did not report prediction scores on style prediction. 3.3. Wikipaintings
The styles in AVA are primarily photographic tech- The above datasets deal exclusively with modern, pho-
niques, such as “HDR” and “Long Exposure,” and sim- tographic images. To our knowledge, no existing vision al-
ple compositional techniques such as “Silhouettes,” and gorithm can categorize non-photorealistic media, such as
“Vanishing Point.” Out of the 14 photographic style labels, paintings and drawings. In a major step to this goal, we col-
less than half have over a thousand positive examples, and lect a dataset of 100,000 high-art images – mostly paintings
“higher-level” styles such as mood and genre styles are not – labeled with artist, style, genre, date, and free-form tag in-
represented. formation by a community of experts on the Wikipaintings
website.
3.2. Flickr Style Analyzing style of non-photorealistic media is an inter-
esting problem, as much of our present understanding of vi-
Our goal is to describe a broader range of image style sual style arises out of thousands of years of developments
beyond photographic style, which also includes a range of in fine art, marked by distinct historical styles. As shown
genres, compositional styles, and moods. We would like in Figure 3, the dataset presents significant stylistic diver-
to gather data from a rich source, such as Flickr, so that sity, primarily spanning Renaissance styles to modern art
the size of our dataset can be increased with minimal ef- movements. We select 22 styles with more than 1,000 ex-
fort. Although Flickr users often provide free-form tags for amples, for a total of 85,000 images. Example images are
their uploaded images, the tags tend to be quite unreliable. shown in Figure 2.
Instead, we turn to Flickr groups, which are community-
curated collections of visual concepts. There are Flickr 4. Image Features
Groups for most visual style concepts that we considered.
In order to classify styles, we must choose appropriate
We selected 18 groups with large image collections and image features. We hypothesize that image style may be re-
clearly defined membership rules to produce our Flickr lated to many different features, including low-level statis-
Style dataset. We collected 3,000 positive examples for tics [15], color choices, composition, and content. Hence,
each label, for a total of 54,000 images. Example images we test features that embody these different elements, in-
are shown in Figure 1, and examples of the group rules and cluding features from the object recognition literature. We
image counts are given in Table 1. The names of our styles evaluate single-feature performance, as well as second-
can be seen in evaluation plots Figure 5 and Figure 6, and stage fusion of multiple features.
in Table 4.
The labels are clean in the positive examples, but noisy L*a*b color histogram. Many of the Flickr styles exhibit
in the negative examples, in the same way as the ImageNet strong dependence on color. For example, Noir images are
dataset [5]. That is, a picture labeled as Sunny is indeed nearly all black-and-white, while most Horror images are
Sunny, but it may also be Romantic, for which it is not la- very dark, and Vintage images use old photographic colors.
beled. We consider this an unfortunate but acceptable real- We use a standard color histogram feature, computed on
ity of working with a large-scale dataset. the whole image. The 784-dimensional joint histogram in

3
14000 style Frequency
ILSVRC2010 nodes. In essence, MC-bit is a hand-crafted
12000
10000 “deep” architecture, stacking classifiers and pooling opera-
8000 tions on top of lower-level features.
6000
4000
2000 Deep convolutional net. Current state-of-the-art results
0
) e e ) o e e g ) e t t l n t
on ImageNet, the largest image classification challenge,
nismalismicism nism nismalism ern lism oqu icism anc ism nism coc bism anc anc alism tin nce iyo- p Ar t Ar lism rme ctio icism vism nismp Ar
re ssio ReomantressioressioSurreu (MoSdymbo Baorclassenaissrimitivressio Ro CeunaisesnaisMs inimld Paninaissa Uk Pobstragcic Rerat InfAobstraadem Fauressio O
Imp R t-Imp Exp uvea
Pos r t No or
Neern RArt (Pct Exp
th aïve stra
ly R h R
Ear Hig Colo (Late
r Fie Re A Ma A ical Ac
L yr N eo-E
xp have come from a deep convolutional network trained in
A N N Ab ism
ner
Man a fully-supervised manner. We use DeCAF [7], an open-
16000 genre Frequency source implementation of such an eight-layer network,
14000
12000 trained on over a million images annotated with 1,000 Im-
10000 ageNet classes. We investigate using two different layers
8000
6000 of the network, referred to as DeCAF5 (9,000-dimensional)
4000
2000
and DeCAF6 (4,000-dimensional, closer to the supervised
0
)
signal), computed from images center-cropped and resized
trait ape ting ting ting ape udy life ting ting tion sign (nu ting ting rina ture trait ting ting
por landsc e pain t pain s pain citysc and st still c pain e pain illustra de inting al pain r pain ma sculp elf-por al pain al pain
r c
gen abstra religio
u
ske
tch boli ativ
sym figur e pa logic flow
nud mytho
e s nim oric
a lleg
a
to 256 by 256 pixels. Since DeCAF is trained on object
Date Frequency recognition, not style recognition, we also test whether tun-
9000
8000 ing the network for our style datasets improve performance.
7000
6000
5000 Content classifiers. Following Dhar et al. [6], who use
4000
3000
high-level classifiers as features for their aesthetic rating
2000 prediction task, we evaluate using object classifier confi-
1000
0
dences as features. Specifically, we train classifiers for all
1400 1500 1600 1700 1800 1900 2000
Year 20 classes of the PASCAL VOC [9] using the DeCAF6 fea-
ture. The resulting classifiers are quite reliable, obtaining
Figure 3: Distribution of image style, genre, and date in the 0.7 mean AP on the VOC 2012.
Wikipaintings dataset. We aggregate the data to train four classifiers for “ani-
mals”, “vehicles”, “indoor objects” and “people”. These ag-
gregate classes are presumed to discriminate between vastly
CIELAB color space has 4, 14, and 14 bins in the L*, a*,
different types of images – types for which different style
and b* channels, following Palermo et al. [19], who showed
signals may apply. For example, a Romantic scene with
this to be the best performing single feature for determining
people may be largely about the composition of the scene,
the date of historical color images.
whereas, Romantic scenes with vehicles may be largely de-
GIST. The classic gist descriptor [18] is known to per- scribed by color.
form well for scene classification and retrieval of images To enable our classifiers to learn content-dependent
visually similar at a low-resolution scale, and thus can rep- style, we can take the outer product of a feature channel
resent image composition to some extent. We use the IN- with the four aggregate content classifiers.
RIA LEAR implementation, resizing images to 256 by 256
pixels and extracting a 960-dimensional color GIST feature.
5. Learning algorithm
We wish to learn to classify novel images according
Graph-based visual saliency. We also model composi- to their style, using the labels exemplified by the datasets
tion with a visual attention feature [11]. The feature is fast given in the previous section. Because the datasets we
to compute and has been shown to predict human fixations deal with are quite large and some of the features are high-
in natural images basically as well as an individual human dimensional, we consider only linear classifiers, relying on
(humans are far better in aggregate, however). The 1024- sophisticated features to provide enough robustness for lin-
dimensional feature is computed from images resized to 256 ear classification to be accurate.
by 256 pixels. We use an open-source implementation of Stochastic
Gradient Descent with adaptive subgradient [1]. The learn-
Meta-class binary features. Image content can be pre-
ing process optimizes the function
dictive of individual styles, e.g., Macro images include
many images of insects and flowers. The mc-bit fea- λ2 X
min λ1 kwk1 + kwk22 + `(xi , yi , w)
ture [2] is a 15,000-dimensional bit vector feature learned w 2 i
as a non-linear combination of classifiers trained using
existing features (e.g., SIFT, GIST, Self-Similarity) on We set the L1 and L2 regularization parameters and the
thousands of random ImageNet synsets, including internal form of the loss function by validation on a held-out set.

4
Late-fusion x Content MC-bit GIST
For the loss `(x, y, w), we consider the hinge (max(0, 1 − DeCAF_6 DeCAF_5 Graph-based Saliency
1.0 Fine-tuned DeCAF_6 L*a*b* Histogram random
y · wT x)) and logistic (log(1 + exp(−y · wT x))) functions. 0.9
0.8
Murray-CVPR-2012
0.7
For multi-class classification, we always use the One vs. 0.6
0.5
0.4
All reduction. We set the initial learning rate to 0.5, and use 0.3
0.2
adaptive subgradient optimization [8]. 0.1
0.0
rs es e e o r e s F s s t
For all features except binary ones, values are standard- Colo Duoton HDR e_Grain _Whit posur Macr n_Blu _Imag _Third w_DO ouette t_Focu g_Poin _mean
ary_ g n
Ima Light_O Long_
Ex io e f
Mot egativ Rule_o Shall
o Silh Sof nishin
ent Va
lem N
ized: each column has its mean subtracted, and is divided Comp

by its standard deviation. For feature combinations, we


use two-stage late fusion. First, single-feature classifiers Figure 4: APs on the AVA Style dataset.
are trained, then their scores are linearly combined with
Late-fusion x Content DeCAF_6 DeCAF_5
weights learned by a second classifier. 1.0 Fine-tuned DeCAF_6 MC-bit random
0.9
0.8
0.7
0.6
6. Evaluation 0.5
0.4
0.3
0.2
Details of our experiments follow, with a concluding dis- 0.1
0.0
l y r e o l l
cussion section. ic ld n R
get Fie erea itio HD Haz Horro xposur Macr ncholy Minima Noir mantic Serene _Paste Sunny intage _mean
ner _of_ Eth ompos Ro , V
h t,_E Depth _C ong
_E Mela Soft
Brig me t r ic L
Geo
6.1. AVA Style
Figure 5: APs on the Flickr dataset.
We evaluate classification of aesthetic rating and of 14
different photographic style labels on the 14,000 images of
the AVA dataset that have such labels. For the style la- mean AP; both results beat the previous state-of-the-art of
bels, the publishers of the dataset provide a train/test split, 0.538 mean AP [17].
where training images have only one label, but test images In all metrics, the DeCAF and MC-bit features signif-
may have more than one label [17]. Although the provided icantly outperform the more low-level features. Accord-
test split has an uneven class distribution, we found that to ingly, we do not evaluate the low-level features on the larger
compare with the reported results, a class-balanced set is Flickr and Wikipaintings datasets.
needed.
Consequently, we adhere to the provided split but com- 6.2. Flickr Style
pute evaluation metrics on a random class-balanced subset
Following the same evaluation setup and metrics as
of the test data. We use class-balanced data for evaluation
above, we learn and predict style labels on the 53,000 im-
in all following experiments.
ages labeled with 18 different visual styles of our new Flickr
Style dataset, using 20% of the data for testing, and another
Metrics. Following the established approach, aesthetic 20% for parameter tuning validation. Results are presented
rating is classified in a binary prediction task of being below in Figures 5 and 9, and in Table 2.
or above the mean. On this task, we report the accuracy of The best single-channel feature is again DeCAF6 with
our predictions. 0.396 mean AP; feature fusion obtains 0.419 mean AP. Sur-
For multi-class prediction of the style labels, we re- prisingly, fine-tuning the DeCAF convolution net with im-
port the confusion matrix of most confident classifications ages from our datasets did not increase performance.
for each image, top-K accuracies (useful to see when the
dataset has easily confused labels), and per-class Average Content correlations. We plot the confusion matrix of
Precision (AP): area under the Precision vs. Recall curve. this single-label dataset in Figure 6. As expected, there
are points of understandable confusion: Depth of Field
Results. For all features, AP scores for the AVA Style vs. Macro, Romantic vs. Pastel, Vintage vs. Melancholy.
dataset are shown in Figure 4. The mean AP scores and There are also surprising sources of mistakes: Macro vs.
the aesthetic rating accuracy are given in the overall results Bright/Energetic, for example.
table Table 2. To explain this particular confusion, we observed that
For aesthetic rating performance, the best single feature lots of Macro photos contain bright flowers, insects, or
is the MC-bit feature, obtaining 0.843 accuracy. Previous birds, often against vibrant greenery. Here, at least, the con-
work did not report accuracy on this subset of the data, but tent of the image dominates its style label.
their best reported accuracy on the test set of the full AVA To explore further content-style correlations, we plot the
data is 0.68 [17]. For style classification, the best single fea- outputs of PASCAL object class classifiers (one of our fea-
ture is the DeCAF6 convolution network feature, obtaining tures) on the Flickr dataset in Figure 7. We can observe that
0.577 mean AP. Feature fusion improves the result to 0.604 some styles have strong correlations to content (e.g. “Hazy”

5
Table 2: Mean APs (or accuracies, where noted) on three datasets for the considered single-channel features and their
second-stage combination. As some features were clearly dominated by others on the AVA dataset, only the better features
were evaluated on larger datasets.

Late-fusion DeCAF5 DeCAF6 MC-bit Tuned DC6 L*a*b* Hist GIST Saliency random
AVA Rating (acc.) - 0.779 0.686 0.843 0.720 0.574 0.558 0.539 0.500
AVA Style 0.604 0.427 0.577 0.529 0.552 0.291 0.220 0.149 0.127
Flickr 0.419 0.314 0.391 0.360 0.396 - - - 0.066
Wikipaintings 0.476 - 0.356 0.443 0.356 - - - 0.043

metavehicle
metaperson
metaanimal

metaindoor
occurs with “vehicle”, “HDR” doesn’t occur with “cat”).

bicycle
To further enable our linear classifier to take advantage of

horse
bird

dog
car

cat
such correlations, we take an outer product of our content Melancholy 0.01 -0.04 -0.02 0.05 0.03 0.02 0.02 0.02 0.06 -0.05
classifier features with the second-stage late fusion features
Noir 0.05 -0.13 0.04 0.08 -0.01 0.08 -0.03 0.04 0.01 0.00
(“Late-fusion × Content” in all results figures).
HDR 0.11 -0.00 0.10 -0.13 -0.04 -0.01 -0.06 -0.04 -0.00 0.14
Geometric_Composition

Soft,_Pastel -0.05 -0.01 -0.07 0.01 0.02 -0.05 0.00 0.07 0.08 -0.10
Bright,_Energetic

Long_Exposure
Depth_of_Field

Soft,_Pastel
Melancholy

Romantic

Macro -0.05 0.13


Ethereal

Minimal

-0.13 -0.01 0.03 -0.03 0.14 -0.04 -0.08 -0.11


Vintage
Serene
Horror

Sunny
Macro
Hazy

prior
HDR

Noir

Bright,_Energetic 0.30 0.07 0.01 0.05 0.06 0.00 0.02 0.10 0.12 0.02 0.04 0.00 0.03 0.05 0.06 0.04 0.03 0.06 Horror 0.02 -0.11 -0.08 0.08 0.05 0.06 0.05 0.03 0.07 -0.10
Depth_of_Field 0.03 0.27 0.01 0.02 0.04 0.02 0.02 0.04 0.17 0.08 0.02 0.02 0.06 0.03 0.10 0.01 0.03 0.06
Sunny -0.05 0.06 0.05 -0.08 -0.05 0.05 -0.07 -0.08 -0.07 0.13
Ethereal 0.02 0.03 0.33 0.01 0.02 0.05 0.04 0.05 0.02 0.15 0.04 0.02 0.04 0.01 0.08 0.04 0.07 0.06

Geometric_Composition 0.06 0.03 0.00 0.41 0.07 0.01 0.03 0.04 0.01 0.05 0.13 0.05 0.01 0.02 0.02 0.00 0.02 0.06 Serene 0.01 0.10 0.05 -0.07 -0.01 -0.00 0.02 -0.06 -0.05 0.06
HDR 0.03 0.01 0.00 0.02 0.56 0.02 0.02 0.15 0.01 0.02 0.01 0.01 0.01 0.06 0.01 0.05 0.01 0.06
Hazy -0.02 0.05 0.12 -0.05 -0.06 0.00 -0.03 -0.09 -0.06 0.15
Hazy 0.00 0.01 0.05 0.01 0.02 0.52 0.00 0.09 0.01 0.04 0.05 0.02 0.02 0.04 0.03 0.08 0.01 0.06

Horror 0.03 0.03 0.05 0.02 0.03 0.01 0.44 0.03 0.01 0.14 0.01 0.07 0.04 0.01 0.03 0.01 0.02 0.06
Bright,_Energetic 0.01 0.01 -0.00 -0.04 -0.01 -0.03 -0.04 0.05 0.03 -0.01

Long_Exposure 0.03 0.01 0.01 0.02 0.09 0.02 0.02 0.60 0.01 0.03 0.03 0.03 0.00 0.04 0.01 0.05 0.00 0.06
Ethereal -0.03 0.02 -0.03 0.09 0.02 -0.00 0.07 -0.03 -0.03 -0.03
Macro 0.05 0.05 0.01 0.02 0.00 0.00 0.01 0.01 0.76 0.01 0.03 0.00 0.01 0.01 0.03 0.00 0.01 0.06
Vintage -0.03 -0.07 -0.06 0.07 0.02 -0.06 -0.01 0.11 0.03 -0.11
Melancholy 0.01 0.05 0.05 0.03 0.03 0.03 0.05 0.06 0.02 0.30 0.06 0.04 0.07 0.04 0.10 0.03 0.04 0.06

Minimal 0.02 0.02 0.02 0.10 0.01 0.03 0.00 0.06 0.05 0.05 0.49 0.03 0.00 0.03 0.02 0.04 0.01 0.06 Depth_of_Field 0.00 0.06 -0.04 -0.01 -0.01 0.01 0.07 -0.03 0.05 -0.07
Noir 0.01 0.03 0.02 0.05 0.02 0.04 0.11 0.04 0.01 0.15 0.03 0.46 0.01 0.00 0.01 0.01 0.02 0.06
Long_Exposure 0.00 -0.00 0.09 -0.06 -0.03 0.01 -0.05 -0.06 -0.05 0.12
Romantic 0.04 0.05 0.03 0.02 0.04 0.01 0.04 0.04 0.04 0.12 0.01 0.03 0.16 0.05 0.23 0.04 0.06 0.06

Serene 0.04 0.06 0.02 0.02 0.12 0.05 0.01 0.12 0.10 0.05 0.04 0.01 0.02 0.21 0.05 0.09 0.01 0.06 Geometric_Composition 0.07 -0.07 0.01 0.01 0.01 -0.02 -0.06 0.09 -0.04 0.01
Soft,_Pastel 0.03 0.10 0.03 0.02 0.03 0.04 0.02 0.05 0.05 0.07 0.03 0.02 0.06 0.02 0.36 0.01 0.08 0.04
Minimal -0.04 0.03 0.03 0.06 0.01 -0.00 -0.04 -0.02 -0.08 0.07
Sunny 0.02 0.02 0.03 0.02 0.05 0.03 0.01 0.12 0.00 0.02 0.03 0.01 0.01 0.06 0.02 0.56 0.00 0.06

Vintage 0.03 0.09 0.04 0.02 0.02 0.01 0.02 0.01 0.03 0.10 0.02 0.00 0.07 0.00 0.18 0.00 0.35 0.06
Romantic -0.01 -0.03 -0.06 0.01 0.02 -0.01 0.02 0.03 0.13 -0.09

0.00 0.76 1.00


-0.30 -0.13 0.00 0.15 0.30

Figure 6: Confusion matrix of our best classifier


Figure 7: Correlation of PASCAL content classifers
(Late-fusion × Content) on the Flickr dataset.
(columns) against ground truth Flickr style labels (rows).

Late-fusion x Content DeCAF_6 random


6.3. Wikipaintings MC-bit Fine-tuned DeCAF_6
1.0
0.9
With the same setup and features as in the Flickr experi- 0.8
0.7
0.6
ments, we evaluate 85,000 images labeled with 22 different 0.5
0.4
art styles. The results are given in Figures 8 and 9, and 0.3
0.2
0.1
in Table 2. The best single-channel feature is MC-bit with 0.0
t l ) e g e e ) ) e rt o e n
t_Ar nism rme dern oqu intin bism anc nism anc nism alismance alism ism icism anc p_A nism alism coc icism alism olism iyo- ea
0.444 mean AP; feature fusion obtains 0.476 mean AP. As trac ssio Info (Mo Bar _Pa Cu aiss ssio aiss ssio _Re aiss im itiv lass aiss Po ssio Re RoomantSurreSymb Uk _m
Abs_Expre Artv_ eau_ r_Field ly_RenExpreh_RenImpreMagice_Ren Mrint_(PrimNeocrn_Ren -Impre R
t u r ig t A e t
trac Art_No Colo Ea H _(La Nave_ North Pos
with Flickr, fine-tuning the convolutional net feature did not Abs ism
ner
Man
increase its performance on this dataset.
6.4. Applications of style classifiers Figure 8: APs on the Wikipaintings dataset.

Style classifiers learned on our datasets can be used to-


ward novel goals. For example, sources of stock photog-

6
Late-fusion x Content DeCAF_5
Fine-tuned DeCAF_6 MC-bit References
DeCAF_6 random DeCAF_6 Late-fusion x Content
1.0 ImageNet MC-bit
1.0Fine-tuned DeCAF_6 random [1] A. Agarwal, O. Chapelle, M. Dudik, and J. Langford. A Re-
0.8
Top-K Accuracy

0.8 liable Effective Terascale Linear Learning System. Journal


0.6

Top-K Accuracy
0.6 of Machine Learning Research, 2012. 4
0.4
0.4
0.2 0.2
[2] A. Bergamo and L. Torresani. Meta-class features for large-
0.0 0.0 scale object categorization on a budget. In CVPR, 2012. 4
1 2 3 4 5 1 2 3 4 5
K K
[3] D. Borth, R. Ji, T. Chen, and T. M. Breuel. Large-scale
Visual Sentiment Ontology and Detectors Using Adjective
Figure 9: Top-K accuracies for the Flickr and Wikipaintings Noun Pairs. In ACM MM, 2013. 2
datasets, respectively.
[4] R. Datta, D. Joshi, J. Li, and J. Z. Wang. Studying Aesthetics
in Photographic Images Using a Computational Approach.
In ECCV, 2006. 2
raphy or design inspiration may be better navigated with [5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.
a vocabulary of style. Currently, companies expend labor ImageNet: A Large-Scale Hierarchical Image Database. In
to manually annotate stock photography with such labels. CVPR, 2009. 3
With our approach, any image collection can be searchable [6] S. Dhar, T. L. Berg, and S. Brook. High Level Describable
and rankable by style. We apply style classifiers to the PAS- Attributes for Predicting Aesthetics and Interestingness. In
CAL visual object class dataset, and show top hits for differ- CVPR, 2011. 2, 4
ent styles for the “bird” and “train” categories in Figure 11. [7] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang,
Additionally, styles learned from photographs can be E. Tzeng, T. Darrell, T. Eecs, and B. Edu. DeCAF: A Deep
used to order paintings, and styles learned from paintings Convolutional Activation Feature for Generic Visual Recog-
can be used to order photographs, as illustrated in Fig- nition. Technical report, 2013. arXiv:1310.1531v1. 4
ure 10. [8] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient
methods for online learning and stochastic optimization. The
Journal of Machine Learning Research, 2011. 5
7. Conclusion [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
A. Zisserman. The PASCAL VOC Challenge Results, 2010.
We have described datasets and algorithms for classify- 4
ing image styles. Given the importance of style in modern
[10] M. Gygli, F. Nater, and L. V. Gool. The Interestingness of
visual communication, we believe that understanding style Images. In ICCV, 2013. 2
is an important challenge for computer vision, and our re-
[11] J. Harel, C. Koch, and P. Perona. Graph-Based Visual
sults illustrate the potential for future research in this area. Saliency. In NIPS, 2006. 4
One challenging question is to define and understand the [12] P. Isola, J. Xiao, A. Torralba, and A. Oliva. What makes an
meaning of style. Different types of styles relate to content, image memorable? In CVPR, June 2011. 2
color, lighting, composition, and other factors. Our work [13] C. Li and T. Chen. Aesthetic Visual Quality Assessment of
provides some preliminary evidence about the relationships Paintings. IEEE Journal of Selected Topics in Signal Pro-
of these quantities. cessing, 3(2):236–252, Apr. 2009. 2
We were suprised by the success of the DeCAF convo- [14] L.-j. Li, H. Su, E. P. Xing, and L. Fei-fei. Object Bank: A
lution net, which was originally trained for object recogni- High-Level Image Representation for Scene Classification &
tion. Moreover, fine-tuning it for style did not significantly Semantic Feature Sparsification. In NIPS, 2010. 2
increase performance. Perhaps the network layers that we [15] S. Lyu, D. Rockmore, and H. Farid. A digital technique for
use as features are extremely good as general visual features art authentication. PNAS, 101(49), 2004. 2, 3
for image representation in general. Another explanation is [16] L. Marchesotti and F. Perronnin. Learning beautiful (and
that object recogntion depends on object appearance, e.g., ugly) attributes. In BMVC, 2013. 2
distinguishing red from white wine, or different kinds of [17] N. Murray, D. Barcelona, L. Marchesotti, and F. Perronnin.
terriers, and that the model learns to repurpose these feature AVA: A Large-Scale Database for Aesthetic Visual Analysis.
for image style. In CVPR, 2012. 2, 3, 5
Another possibility is that the style labels can be pre- [18] A. Oliva and A. Torralba. Modeling the Shape of the Scene:
dicted from object content alone. We do see strong correla- A Holistic Representation of the Spatial Envelope. IJCV,
tions in our data, e.g., Macro images frequently depict birds 42(3):145–175, 2001. 4
and flowers.However, we found that using 1,000 ImageNet [19] F. Palermo, J. Hays, and A. A. Efros. Dating Historical Color
classifiers as a features was significantly worse than the per- Images. In ECCV, 2012. 3
formance of the DeCAF6 layer feature (see Tables 3, 4, 5).

7
Bright,
Energetic Minimalism

Serene Impressionism

Ethereal Cubism

Flickr Painting
Painting Data Flickr Data
Style Style

Figure 10: Cross-dataset style. On the left are shown top scorers from the Wikipaintings set, for styles learned on the Flickr
set. On the right, Flickr photographs are accordingly sorted by Painting style. (Figure best viewed in color.)

Geometric
Composition

HDR

Film
Noir

Vintage

Bright,
Energetic

Horror

Cubism

Surrealism

Content: “bird” Content: “train”


Figure 11: Style-based search within the PASCAL dataset, showing the top 3 images with classifier “bird” and “train,” in six
Flickr styles and two Wikipaintings style. Note that the PASCAL data includes only 773 bird images and 550 trains, and thus
their stylistic range is limited. (Figure best viewed in color.)

8
Table 3: All per-class APs on all evaluated features on the AVA Style dataset.

Late-fusion DeCAF6 DC6,f t MC-bit Murray DeCAF5 ImageNet L*a*b* GIST Saliency
Complementary Colors 0.469 0.548 0.514 0.329 0.440 0.368 0.389 0.294 0.223 0.111
Duotones 0.676 0.737 0.665 0.612 0.510 0.363 0.383 0.582 0.255 0.233
HDR 0.669 0.594 0.516 0.624 0.640 0.494 0.335 0.194 0.124 0.101
Image Grain 0.647 0.545 0.563 0.744 0.740 0.535 0.219 0.213 0.104 0.104
Light On White 0.908 0.915 0.860 0.802 0.730 0.805 0.508 0.867 0.704 0.172
Long Exposure 0.453 0.431 0.444 0.420 0.430 0.208 0.242 0.232 0.159 0.147
Macro 0.478 0.427 0.488 0.413 0.500 0.376 0.438 0.230 0.269 0.161
Motion Blur 0.478 0.467 0.380 0.458 0.400 0.327 0.186 0.117 0.114 0.122
Negative Image 0.595 0.619 0.561 0.499 0.690 0.427 0.323 0.268 0.189 0.123
Rule of Thirds 0.352 0.353 0.290 0.236 0.300 0.269 0.244 0.188 0.167 0.228
Shallow DOF 0.624 0.659 0.627 0.637 0.480 0.522 0.517 0.332 0.276 0.223
Silhouettes 0.791 0.801 0.835 0.801 0.720 0.609 0.401 0.261 0.263 0.130
Soft Focus 0.312 0.354 0.305 0.290 0.390 0.225 0.170 0.127 0.126 0.114
Vanishing Point 0.684 0.658 0.646 0.685 0.570 0.527 0.542 0.123 0.107 0.161
mean 0.581 0.579 0.550 0.539 0.539 0.432 0.350 0.288 0.220 0.152

Table 4: All per-class APs on all evaluated features on the Flickr dataset.

Late-fusion x Content Fine-tuned DeCAF6 DeCAF6 MC-bit DeCAF5 Imagenet


Bright, Energetic 0.355 0.340 0.331 0.250 0.313 0.231
Depth of Field 0.266 0.252 0.241 0.230 0.208 0.202
Ethereal 0.418 0.383 0.365 0.328 0.356 0.190
Geometric Composition 0.442 0.409 0.395 0.399 0.369 0.347
HDR 0.548 0.488 0.477 0.527 0.332 0.293
Hazy 0.565 0.504 0.506 0.489 0.386 0.330
Horror 0.479 0.471 0.464 0.304 0.337 0.286
Long Exposure 0.469 0.415 0.388 0.426 0.300 0.254
Macro 0.684 0.667 0.683 0.620 0.588 0.640
Melancholy 0.178 0.166 0.157 0.169 0.096 0.131
Minimal 0.498 0.476 0.465 0.452 0.319 0.281
Noir 0.529 0.527 0.521 0.409 0.372 0.290
Romantic 0.200 0.210 0.206 0.162 0.140 0.185
Serene 0.209 0.197 0.191 0.219 0.142 0.175
Soft, Pastel 0.309 0.304 0.317 0.267 0.269 0.272
Sunny 0.550 0.551 0.540 0.523 0.481 0.388
Vintage 0.421 0.382 0.385 0.348 0.309 0.268
mean 0.419 0.397 0.390 0.360 0.313 0.280

9
Table 5: All per-class APs on all evaluated features on the Wikipaintings dataset.

Late-fusion x Content MC-bit DeCAF6 Fine-tuned DeCAF6 ImageNet


Abstract Art 0.341 0.314 0.258 0.233 0.192
Abstract Expressionism 0.351 0.340 0.243 0.222 0.159
Art Informel 0.221 0.217 0.187 0.158 0.138
Art Nouveau (Modern) 0.421 0.402 0.197 0.219 0.096
Baroque 0.436 0.386 0.313 0.330 0.162
Color Field Painting 0.773 0.739 0.689 0.703 0.503
Cubism 0.495 0.488 0.400 0.427 0.193
Early Renaissance 0.578 0.559 0.453 0.424 0.192
Expressionism 0.235 0.230 0.186 0.186 0.093
High Renaissance 0.401 0.345 0.288 0.281 0.165
Impressionism 0.586 0.528 0.411 0.433 0.227
Magic Realism 0.521 0.465 0.428 0.435 0.198
Mannerism (Late Renaissance) 0.505 0.439 0.356 0.359 0.171
Minimalism 0.660 0.614 0.604 0.636 0.449
Nave Art (Primitivism) 0.395 0.425 0.225 0.210 0.111
Neoclassicism 0.601 0.537 0.399 0.438 0.179
Northern Renaissance 0.560 0.478 0.433 0.339 0.119
Pop Art 0.441 0.398 0.281 0.304 0.163
Post-Impressionism 0.348 0.348 0.292 0.317 0.135
Realism 0.408 0.309 0.266 0.265 0.159
Rococo 0.616 0.548 0.467 0.501 0.242
Romanticism 0.392 0.389 0.343 0.265 0.185
Surrealism 0.262 0.247 0.134 0.152 0.099
Symbolism 0.390 0.390 0.260 0.296 0.172
Ukiyo-e 0.895 0.894 0.788 0.765 0.260
mean 0.473 0.441 0.356 0.356 0.191

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy