Deepstyle: Multimodal Search Engine For Fashion and Interior Design
Deepstyle: Multimodal Search Engine For Fashion and Interior Design
Deepstyle: Multimodal Search Engine For Fashion and Interior Design
life scenario where the user looks for ”the same shirt but of
denim”. Our novel method, dubbed DeepStyle, mitigates those
shortcomings by using a joint neural network architecture to
model contextual dependencies between features of different
modalities. We prove the robustness of this approach on two
different challenging datasets of fashion items and furniture
where our DeepStyle engine outperforms baseline methods by
18-21% on the tested datasets. Our search engine is commercially
deployed and available through a Web-based application.
Index Terms—Multimedia computing, Multi-layer neural net-
work, Multimodal Search, Machine Learning
I. INTRODUCTION
Fig. 1. Example of a typical multimodal query sent to a search engine for
fashion items. By modeling common multimodal space with a deep neural
M ULTIMODAL search engine allows to retrieve a set of
items from a multimedia database according to their
similarity to the query in more than one feature spaces, e.g.
network, we can provide a more flexible and natural user interface while
retrieving results that are semantically correct, as opposed to the results of
the search based on the state-of-the-art Visual Search Embedding model [9].
textual and visual or audiovisual. This problem can be divided
into smaller subproblems by using separate solutions for each
modality. The advantage of this approach is that both textual context of a search for fashion items, they typically focus on
and visual search engines have been developed for several using other modalities as an additional source of information,
decades now and have reached a certain level of maturity. e.g. to increase classification accuracy of compatible and non-
Traditional approaches such as Video Google [2] have been compatible outfits [7].
improved, adapted and deployed in industry, especially in the To address the above-mentioned shortcomings of the cur-
ever-growing domain of e-commerce. Major online retailers rently available search engines, we propose a novel end-to-
such as Zalando and ASOS already offer visual search engine end method that uses neural network architecture to model the
functionalities to help users find products that they want to joint multimodal space of database objects. This method is an
buy [3]. Furthermore, interactive multimedia search engines extension of our previous work [8] that blended multimodal
are omnipresent in mobile devices and allow for speech, text results. Although in this paper we focus mostly on the fashion
or visual queries [4], [5], [6]. items such as clothes, accessories and furniture, our search
Nevertheless, using separate search engines per each modal- engine is in principle agnostic to object types and can be suc-
ity suffers from one significant shortcoming: it prevents the cessfully applied in many other multimedia applications. We
users from specifying a very natural query such as ’I want call our method DeepStyle and show that thanks to its ability to
this type of dress but made of silk’. This is mainly due to the jointly model both visual and textual modalities, it allows for a
fact that the notion of similarity in separate spaces of different more intuitive search queries, while providing higher accuracy
modalities is different than in one multimodal space. Fur- than the competing approaches. We prove the superiority of
thermore, modeling this highly dimensional multimodal space our method over single-modality approaches and state-of-the-
requires more complex training strategies and thoroughly art multimodal representation using two large-scale datasets of
annotated datasets. Finally, defining the right balance between fashion and furniture items. Finally, we deploy our DeepStyle
the importance of various modalities in the context of a user search engine as a web-based application.
query is not obvious and hard to estimate a priori. Although To summarize, the contributions of our paper are threefold:
several multimodal representations have been proposed in the • In addition to the results using blending methods from
1 Polish-Japanese multiple modalities, we propose a novel multimodal end-
Academy of Information Technology, Warsaw, Poland
2 Warsaw University of Technology, Warsaw, Poland to-end search engine based on a deep neural network ar-
3 Tooploox, Warsaw, Poland. chitecture. It is robust to domain changes and outperforms
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA 2
Fig. 2. A high-level overview of our Simple Style Search Engine. The visual search block uses object detection algorithm YOLO 9000 [25] and outputs of a
pretrained deep neural network. The textual block allows to further specify search criteria with text and increases the contextual importance of the retrieved
results. Finally, by blending the visual and textual search results method significantly improves the stylistic and aesthetic similarity of the retrieved items.
Neural Machine Translation (NMT) and perceives visual and to extend the information about the searched object. Finally,
textual modalities as the same concept described in differ- research community has not yet paid much attention to define
ent languages. The proposed architecture consists of LSTM or evaluate style similarity.
RNNs for encoding sentences, CNN for encoding images
and structure-content neural language model (SC-NLM) for III. F ROM SINGLE TO MULTIMODAL SEARCH
decoding. The authors show that their learned multimodal In this section, we present a baseline style search engine
embedding space preserves semantic regularities in terms of model introduced in [8], which is the basis for our current
vector space arithmetic e.g. image of a blue car - ”blue” + research. It is built on top of two single-modal modules.
”red” is near images of red cars. However, results of this task More precisely, two searches are run independently for both
are only available in some example images. We would like image and text queries resulting in two initial sets of results.
to leverage their work and numerically evaluate multimodal Then, the best matches are selected from initial pool of results
query retrieval, specifically in the domain of fashion and according to blending methods - re-ranking based on visual
interior design. features similarity to the query image as well as on contextual
Xintong Han et al. [30] train bi-LSTM model to predict similarity (items that appear more often together in the same
next item in the outfit generation. Moreover, they learn a joint context).
image-text embedding by regressing image features to their For input, baseline style search engine takes two types
semantic representations aiming to inject attribute and category of query information: an image containing object(-s), e.g.
information as a regularization for training the LSTM. It a picture of a dining room, and a textual query used to
should be noted, however, that their approach to stylistic specify search criteria, e.g. cozy and fluffy. If needed, an object
compatibility is different from ours in a way that they optimize detection algorithm is run on the uploaded picture to detect
for generation of a complete outfit (e.g. it should not contain objects of classes of interest such as chairs, tables or sofas.
two pairs of shoes) whereas we would like to retrieve items of Once the objects are detected, their regions of interest are
similar style regardless of the category they belong to. Also, extracted as picture patches and run through visual search
they evaluate compatibility with ”fill-in-the-blanks” test that method. For queries that already represent a single object,
does not incorporate retrieval from the full dataset of items. no object detection is required. Simultaneously, the engine
Only several example results are illustrated and no quantitative retrieves the results for a textual query. With all visual and
evaluation is presented. textual matches retrieved, our blending algorithm ranks them
Numerous works focus on the task of generating a compati- depending on the similarity in the respective feature spaces
ble outfit from available clothing products [7], [30]. However, and returns the resulting list of stylistically and aesthetically
none of the related works focus on the notion of multimodality similar objects. Fig. 2 shows a high-level overview of our Style
and multimodal fashion retrieval. Text information is only used Search Engine. Below, we describe each part of the engine in
as an alternative query and not as a complimentary information more details.
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA 4
catalog image i ∈ I and the textual description t ∈ T. network, improves over the first network by introducing a
The multimodal query provided by the user is given by second branch with shared weights and contrastive loss to learn
Q = (iq , tq ),where iq ∈ I is the visual query and tq ∈ T map pairs from the same outfit close in the embedding space.
is the textual query. DeepStyle: Our proposed neural network learns com-
We run a series of experiments with blending methods, mon embedding through classification task. Our architecture,
aiming to combine the retrieval results from various modalities dubbed DeepStyle, is inspired by [7], where they use a
in the most effective way. To that end, we use the following multimodal joint embedding for fashion product retrieval. In
approaches for blending. contrast to their work, our goal is not to retrieve images with
Late-fusion Blending: In the simplest case, we retrieve top text query (or vice versa) but to retrieve items where a text
k items independently for each modality and take their sum as query compliments the image and provides additional query
a set of final results. We do not use the contextual information requirements.
here. Similarly to [7], our network has two inputs - image features
Early-fusion Blending: (output of penultimate layer of pretrained CNN) and text
In order to use the full potential of our multimodal search features (processed with the same word2vec model trained
engine, we combine the retrieval results of visual, textual as on descriptions). We then optimize for classification loss to
well as contextual search engines in the specific order. We enforce the concept of semantic regularities. For this purpose,
optimize this order to present the most stylistically coherent product category labels (with arbitrary number of classes)
sets to the user. To that end, we propose Early-fusion Blending should be present in the dataset. Unlike [7], we do not consider
approach that uses features extracted from different modalities the image and the text branches separately for predictions but
in a sequential manner. add a fully connected layer on top of the concatenated image
More precisely, for a multimodal query (iq , tq ), an initial and text embeddings that is used to predict a single class.
set of results Rvis is returned for visual modality - closest DeepStyle-Siamese: We want to also include context in-
images to iq in terms of Euclidean distance dvis between formation (whether or not two items appeared in the same
their visual representations. Then, we retrieve contextually context) to our network. For this purpose, we design a Siamese
similar products Rcont that are close to Rvis results in terms network [35] where each branch has a dual input consisting
of dcont distance (context space search described in section of image and text features. Positive pairs are generated as
III-C). Finally, Rvis and Rcont form a list of candidate items image-text pairs from the same outfit while unrelated pairs
from which we select the results R by extracting the textual are obtained by randomly sampling an item (image and
features (word2vec vectors) from items descriptors and rank description) from a different outfit.
them using distance from the textual query dtext . Two types of losses are optimized. Classification loss is used
This process can be formulated as: as before to help network learn semantic regularities. Also,
minimizing contrastive loss encourages image-text pairs from
( n1
)
X
Rvis = p : argmin dvis (iq , ii ) ⇒ (1) the same outfit to have a small distance between embedding
p1 ,...,pn1 ∈P i=1
vectors while different outfit items to have distance larger than
( ) a predefined margin.
n2
[ X Formally, contrastive loss is defined in the following manner
Rcont = p : argmin dcont (p, pi ) ⇒
p1 ,...,pn2 ∈P i=1 [35]:
p∈Rvis
1 1
Rcand = Rcont ∪ Rvis LC (d, y) = (1 − y) d2 + y {max(0, d − m)}2 , (2)
2 2
( n3
) where d is the Euclidean distance between two different
X embedded image-text vectors (i, t) and (i0 , t0 ), y is a binary
R= p: argmin dtext (tq , ti )
p1 ,...,pn3 ∈Rcand i=1 label indicating whether two vectors are from the same outfit
(y = 0) or from different outfits (y = 1) and m is a predefined
where n1 , n2 and n3 are parameters to be chosen empiri- margin for the minimal distance between items from different
cally. outfits.
Full training loss consists of weighted sum of contrastive
IV. D EEP S TYLE : M ULTIMODAL S TYLE S EARCH E NGINE loss and cross entropy classification losses:
WITH D EEP L EARNING
Inspired by recent advancements in deep learning for com- αLC (d, y) + βLX (Cl1 (i, t), ỹ(i, t))+ (3)
puter vision, we experiment with end-to-end approaches that
learn the embedding space jointly. In this section, we propose +γLX (Cl2 (i0 , t0 ), ỹ(i0 , t0 )),
neural network architectures that are fed with image and text
as inputs, while learning a multimodal embedding space. Such where LX is the cross entropy loss, Cl1 (i, t) and Cl2 (i, t)
embedding can later be used to retrieve results using a multi- are outputs of the first and second classification branches
modal query. The first proposed architecture is a multimodal respectively and ỹ(i, t) is the category label for product with
DeepStyle network that learns common image-text embedding image i and text description t. Parameters α, β, γ are treated
through classification task. The second, DeepStyle-Siamese as hyperparameters for tuning.
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA 6
Fig. 5. The proposed architecture of DeepStyle network. An image is first fed through the ResNet-50 network pretrained on ImageNet, while the corresponding
text description is transformed with word2vec. Both branches are compressed to 128 dimensions, then concatenated to common vector. Final layer predicts a
clothing item category. Penultimate layer serves as a multimodal image and text representation of a product item.
A. Interior Design
To our knowledge, there is no publicly available dataset
Fig. 6. The architecture of DeepStyle-Siamese network. DeepStyle block that contains the interior design items and fulfill previously
is the block of dense and concatenation layers from Fig. 5 that has shared mentioned criteria. Hence, we collect our own dataset by
weights between the image-text pairs. Three kinds of losses are optimised
- the classification loss for each image-text branch and the contrastive loss scraping the website of one of the most popular interior
for image-text pairs. Contrastive loss is computed on joint image and text design distributors - IKEA1 . We collect 298 room photos
descriptors. with their description and 2193 individual product photos with
their textual descriptions. A sample image of the room scene
V. DATASETS and interior item along with their description can be seen in
Fig. 7. We also group together products from some of the
Although several datasets for standard visual search meth- most frequent object classes (e.g. chair, table, sofa) for more
ods exist, e.g. Oxford 5K [13] or Paris 6K [36], they are detailed analysis. In addition, we divide room scene photos
not suitable for our experiments, as our multimodal approach into 10 categories based on the room class (kitchen, living
requires an additional type of information to be evaluated. room, bedroom, children room, office). The vast majority of
More precisely, dataset that can be used with a multimodal
search engine should fulfill the following conditions: 1 https://ikea.com/
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA 7
Fig. 7. Example entries from IKEA dataset. It contains room images, object images and their respective text descriptions.
B. Fashion
Several datasets for fashion related tasks are already pub-
licly available. DeepFashion [37] contains 800 000 images
divided into several subsets for different computer vision tasks.
However, it lacks the context (outfit) information as well as the
detailed text description. Fashion Icon [28] dataset contains
video frames for human parsing but no individual product
images. In contrast, Polyvore [30] dataset has satisfied our
dataset conditions mentioned before.
Polyvore dataset contains 111 589 clothing items that are
grouped into compatible outfits (of 5-10 items per outfit).
We perform additional dataset cleaning - remove non-clothing
items such as electronic gadgets, furniture, cosmetics, designer
logos, plants, furniture. In addition, we perform additional
scraping of Polyvore2 website for product items in the cleaned
dataset to obtain longer product descriptions and add the
descriptions where they are missing. As a result, we have Fig. 8. t-SNE visualization of clothing items’ visual features embedding.
Distinctive classes of objects, e.g. those that share visual similarities are
82 229 items from 85 categories with text descriptions and clustered around the same region of the space.
context information. Context information is much weaker
when compared to IKEA dataset. Only 30% of clothing items
appear in more than one outfit.
Item (query) images are already object photos. Therefore, VI. EVALUATION
for fashion dataset object detection step from style search
engine is omitted for evaluation.
In this section we will present the evaluation procedure, as
2 http://polyvore.com well as the quantitative results.
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA 8
A. Evaluation Metrics Inspired by the [40], we define the average intra-list similarity
for a generated results list R of length k to be:
Similarity score: As mentioned in Sec. II-C, defining a
−1 X
similarity metric that allows quantifying the stylistic similarity k X
between products is a challenging task and an active area of AILS(R) = s(p1 , p2 ), (7)
2
research. In this work, we propose the following similarity pi ∈R pj ∈R,pi 6=pj
measure that is inspired by [24] and based on the probabilistic that is an average similarity score computed across all possible
data-driven approach. pairs in the list of generated items. By doing so, we are aiming
Let us remind that P is a set of all possible product items to assess the overall compatibility of the generated set. As
available in the catalog. Let us then denote C to be a set of all mentioned in [40], this metric is also permutation-insensitive,
sets that contain stylistically compatible items (such as outfits hence the order of retrieved results does not matter, making it
or interior design rooms). Then we search for a similarity suitable for not ranked results.
function between two items p1 , p2 ∈ P which determines
if they fit well together. We propose the empirical similarity
function sc : P × P → [0, 1] which is computed in the B. Baseline
following way: In experiments, we compare the results with a recent mul-
timodal approach to item retrieval, namely Visual Search Em-
|{Ci ∈ C : p1 ∈ Ci ∧ p2 ∈ Ci }| bedding (VSE) [9]. For evaluation, we fine-tune the weights
sc (p1 , p2 ) = . (4)
maxp∈{p1 ,p2 } | {Cj ∈ C : p ∈ Cj } | of a pretrained model made publicly available by authors on
our datasets. The model was pretrained on MS COCO dataset
In fact, it is the number of compatible sets Ci that are
that has 80 categories with broad semantic context, hence it’s
empirically found from C, in which both p1 and p2 appear,
applicable to our datasets. For feature extraction we use VGG
normalized by the maximum number of compatible sets in
19 [41] architecture as suggested by authors.
which any of those items occur. This metric can be interpreted
We also compare our method with Late and Early-fusion
as an empirical probability for the two objects p1 and p2 to
Blending strategies.
appear in the same compatible set and it is expressed by the
similarity score lying in the interval [0, 1]
In order to account for datasets that have weak context C. Results
information (where two items rarely co-occur in the same Evaluation protocol: In order to test the ability of our
compatible set), we add an additional similarity measure sn method to generalize, we evaluate it using a dataset different
that is directly derived from their name overlap. It counts for from the training dataset. For both datasets, we set aside 10%
overlap of some of the most frequent descriptive words such as of the initial number of items for that purpose. All results
elegant, denim, casual, etc. It should be mentioned, however, shown in this section come from the following evaluation
that product name information should be independent from procedure:
the text description (that is used during training). As a result,
1) For each item/text query from the test set we extract
name-derived similarity is non-zero only on datasets that have
visual and textual features.
this kind of additional name information.
2) We run engine and retrieve a set of k most compatible
items from the trained embedding space.
sn (p1 , p2 ) = 1{Wp1 ∩ Wp2 6= ∅}, (5) 3) We evaluate the query results by computing an Average
Intra-List Similarity metric for all possible pairs between
where Wf is a set of frequent descriptive words appearing in the retrieved items and the query, which gives k2 pairs
the name of item f . for k retrieved items.
To summarize, an evaluated pair is considered to be similar 4) The final results are computed as the mean of AILS
if either of the two conditions is satisfied: scores for all of the tested queries.
• items co-occurred in the same outfit before It should be noted that for the IKEA dataset, object detection is
• names of the two items are overlapping performed on room images and similar items are returned for
Formally, the most confident item in the picture. On the other hand, for
Polyvore dataset, the test set images are already catalog items
s(p1 , p2 ) = max (sc (p1 , p2 ), sn (p1 , p2 )) . (6) of clothes on white background, hence the object detection is
not necessary and this step is omitted.
Intra-List similarity: Given that our multimodal query Quantitative results: Tab. I shows the results of the blend-
search engine provides a non-ranked list of stylistically sim- ing methods for the IKEA dataset in terms of the mean value
ilar items, the definition of the evaluation problem differs of our similarity metric.
significantly from other information retrieval domains. For When analyzing the results of blending approaches, we
this reason, instead of using some of the usual metrics for experiment with several textual queries in order to evaluate
performance evaluation like mAP [38] or nDCG [39], which system robustness towards changes in the text search. We
use a ranked list of items as an input, we apply a modified observe that DeepStyle approach outperforms the baseline and
version of the established metric for non-ranked list retrieval. other blending methods for almost all text queries achieving
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA 9
TABLE I
M EAN AILS RESULTS AVERAGED FOR IKEA DATASET AND SAMPLE TEXT QUERIES FROM THE SET OF MOST FREQUENT WORDS IN TEXT DESCRIPTIONS .
Blending [8]
Text query VSE [9] DeepStyle DeepStyle-Siamese
Late-fusion Early-fusion
decorative 0.1475 0.2742 0.2332 0.2453 0.2840
black 0.3217 0.2361 0.2354 0.1967 0.2237
white 0.1476 0.2534 0.2048 0.1730 0.2742
smooth 0.1648 0.2667 0.2472 0.3022 0.2642
cosy 0.2918 0.1073 0.2283 0.3591 0.2730
fabric 0.1038 0.1352 0.2225 0.0817 0.2487
colourful 0.3163 0.2698 0.2327 0.3568 0.2623
Average 0.2134 0.2164 0.2287 0.2449 0.2589
TABLE II
M EAN AILS RESULTS FOR FASHION S EARCH ON P OLYVORE DATASET. S AMPLE TEXT QUERIES ARE SELECTED FROM THE SET OF MOST FREQUENT
WORDS IN TEXT DESCRIPTIONS
Blending [8]
Text query VSE [9] DeepStyle DeepStyle-Siamese
Late-fusion Early-fusion
black 0.2932 0.2038 0.3038 0.2835 0.2719
white 0.2524 0.2047 0.2898 0.2012 0.2179
leather 0.2885 0.2355 0.2946 0.2510 0.3155
jeans 0.2381 0.1925 0.2843 0.4341 0.4066
wool 0.3025 0.1836 0.2657 0.5457 0.4337
women 0.2488 0.1931 0.3088 0.3808 0.3460
men 0.2836 0.1944 0.2900 0.1961 0.2549
floral 0.2729 0.3212 0.2954 0.3384 0.2858
vintage 0.2986 0.3104 0.3035 0.3317 0.3935
boho 0.2543 0.3074 0.2893 0.2750 0.3641
casual 0.2808 0.3361 0.3030 0.2071 0.2693
Average 0.2740 0.2439 0.2935 0.3131 0.3236
the highest average similarity score. DeepStyle-Siamese ap- released to public4 . Fig. 10 shows a screenshot from the
proach gives the best results, outperforming the VSE baseline working Web application with Style Search Engine.
by 21% for IKEA dataset and 18% for Polyvore dataset.
4 http://stylesearch.tooploox.com/
Tab. II shows the results of all of the tested methods for
the Polyvore dataset in terms of the mean value of our sim-
ilarity metric. Here, we also evaluate two joint architectures,
namely DeepStyle and DeepStyle-Siamese. Fig. VI-C shows
that DeepStyle architecture yields better results in terms of
an average performance over different textual queries, when
compared to our previous manual blending approaches, as well
as the VSE baseline approach. In this case, DeepStyle-Siamese
also yields the best average similarity results. In terms of an
average performance, it scores by 32% higher, when compared
to the simplest baseline model and more than 4% higher, when
compared to DeepStyle.
VII. W EB A PPLICATION
3 http://flask.pocoo.org/
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA 10
[7] Y. Li, L. Cao, J. Zhu, and J. Luo, “Mining fashion outfit composition
using an end-to-end deep learning approach on set data,” IEEE Trans.
Multimedia, vol. 19, no. 8, pp. 1946–1955, 2017.
[8] I. Tautkute, A. Mozejko, W. Stokowiec, T. Trzcinski, L. Brocki, and
K. Marasek, “What looks good with my sofa: Multimodal search engine
for interior design,” in Proceedings of the 2017 Federated Conference on
Computer Science and Information Systems (M. Ganzha, L. Maciaszek,
and M. Paprzycki, eds.), vol. 11 of Annals of Computer Science and
Information Systems, pp. 1275–1282, IEEE, 2017.
[9] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying visual-
semantic embeddings with multimodal neural language models,” CoRR,
vol. abs/1411.2539, 2014.
[10] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary
tree,” In CVPR, 2006.
[11] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”
In IJCV, 2004.
[12] Z. S. Harris, “Distributional structure,” Papers on Syntax, pp. 3–22,
1981.
[13] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object
retrieval with large vocabularies and fast spatial matching,” In CVPR,
2007.
[14] L. Xie, J. Wang, B. Zhang, and Q. Tian, “Fine-grained image search,”
IEEE Trans. Multimedia, vol. 17, no. 5, pp. 636–647, 2015.
[15] A. Gordo, J. Almazn, J. Revaud, and D. Larlus, “Deep image retrieval:
Learning global representations for image search,” In ECCV, 2016.
[16] G. Tolias, R. Sicre, and H. Jégou, “Particular object retrieval with
integral max-pooling of CNN activations,” CoRR, vol. abs/1511.05879,
2015.
[17] S. Bell and K. Bala, “Learning visual similarity for product design with
convolutional neural networks,” ACM Trans. on Graph., vol. 34, no. 4,
pp. 98:1–98:10, 2015.
[18] G. Salton, A. Wong, and C. S. Yang, “A vector space model for
automatic indexing,” Commun. of the ACM, vol. 18, no. 11, pp. 613–620,
1975.
[19] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
word representations in vector space,” CoRR, vol. abs/1301.3781, 2013.
Fig. 10. Sample screenshot of our Style Search Engine for interior design [20] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for
applied in web application showing product detection and retrieval of visually word representation,” In EMNLP, 2014.
similar products. [21] M. E. Yumer and L. B. Kara, “Co-constrained handles for deformation
in shape collections,” ACM Trans. on Graph., vol. 33, no. 6, pp. 1–11,
2014.
[22] O. V. Kaick, K. Xu, H. Zhang, Y. Wang, S. Sun, A. Shamir, and
VIII. CONCLUSIONS D. Cohen-Or, “Co-hierarchical analysis of shape structures,” ACM Trans.
In this paper, we experiment with several different archi- on Graph., vol. 32, no. 4, p. 1, 2013.
[23] Z. Lun, E. Kalogerakis, and A. Sheffer, “Elements of style: Learning
tectures for multimodal query item retrieval. This includes perceptual shape style similarity,” ACM Trans. on Graph., vol. 34, no. 4,
manual result blending approaches as well as joint systems, pp. 84:1–84:14, 2015.
where we learn common embeddings using classification and [24] “Art history and its methods: a critical anthology,” Choice Reviews
Online, vol. 33, no. 06, 1996.
contrastive loss functions. Our method achieves state-of-the-art [25] J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” CoRR,
results for the generation of stylistically compatible item sets vol. abs/1612.08242, 2016.
using multimodal queries. We also show that our methodology [26] E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun, “Neuroaes-
thetics in fashion: Modeling the perception of fashionability,” In CVPR,
can be applied to various commercial domain applications, 2015.
easily adopting new e-commerce datasets by exploiting the [27] S. Liu, Z. Song, M. Wang, C. Xu, H. Lu, and S. Yan, “Street-to-shop:
product images and their associated metadata. Finally, we Cross-scenario clothing retrieval via parts alignment and auxiliary set,”
In CVPR, 2012.
deploy a publicly available web implementation of our solution [28] S. Liu, X. Liang, L. Liu, K. Lu, L. Lin, X. Cao, and S. Yan, “Fashion
and release the new dataset with the IKEA furniture items. parsing with video context,” IEEE Trans. Multimedia, vol. 17, no. 8,
pp. 1347–1358, 2015.
[29] B. Zhao, X. Wu, Q. Peng, and S. Yan, “Clothing cosegmentation for
R EFERENCES shopping images with cluttered background,” IEEE Trans. Multimedia,
[1] G. Bradski, “OpenCV.” https://opencv.org/, 2014. vol. 18, no. 6, pp. 1111–1123, 2016.
[2] J. Sivic and A. Zisserman, Video Google: Efficient Visual Search of [30] X. Han, Z. Wu, Y. Jiang, and L. S. Davis, “Learning fashion compati-
Videos. Springer, 2006. bility with bidirectional lstms,” CoRR, vol. abs/1707.05691, 2017.
[3] B. Davis, “Image recognition in ecommerce: Visual search, product [31] Y. Jing, D. C. Liu, D. Kislyuk, A. Zhai, J. Xu, J. Donahue, and S. Tavel,
tagging and content curation.” https://econsultancy.com/blog, 2017. “Visual search at pinterest,” CoRR, vol. abs/1505.07647, 2015.
[4] H. Li, Y. Wang, T. Mei, J. Wang, and S. Li, “Interactive multimodal [32] J. Redmon, “Darknet: Open source neural networks in C.” http://pjreddie.
visual search on mobile device,” IEEE Trans. Multimedia, vol. 15, no. 3, com/darknet/, 2016.
pp. 594–607, 2013. [33] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
[5] J. Sang, T. Mei, Y.-Q. Xu, C. Zhao, C. Xu, and S. Li, “Interaction Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, and et al., “Imagenet
design for mobile visual search,” IEEE Trans. Multimedia, vol. 15, no. 7, large scale visual recognition challenge,” In IJCV, 2015.
pp. 1665–1676, 2013. [34] G. Hinton and L. Van der Maaten, “Visualizing data using t-sne,” In
[6] D. M. Chen and B. Girod, “A hybrid mobile visual search system with JMLR, 2008.
compact global signatures,” IEEE Trans. Multimedia, vol. 17, no. 7, [35] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by
pp. 1019–1030, 2015. learning an invariant mapping,” In CVPR, 2006.
SUBMITTED TO IEEE TRANSACTIONS ON MULTIMEDIA 11