1 s2.0 S1047320320301115 Main
1 s2.0 S1047320320301115 Main
1 s2.0 S1047320320301115 Main
72 (2020) 102860
a r t i c l e i n f o a b s t r a c t
Article history: With the rapid development of deep learning techniques, convolutional neural networks (CNN) have
Received 4 June 2019 been widely investigated for the feature representations in the image retrieval task. However, the key
Revised 6 June 2020 step in CNN-based retrieval, i.e., feature aggregation has not been solved in a robust and general manner
Accepted 16 July 2020
when tackling different kinds of images. In this paper, we present a deep feature aggregation method for
Available online 24 July 2020
image retrieval using the Fourier transform and low-pass filtering, which can adaptively compute the
weights for each feature map with discrimination. Specifically, the low-pass filtering can preserve the
Keywords:
semantic information in each feature map by transforming images to the frequency domain. In addition,
Image retrieval
Convolutional neural networks
we develop three adaptive methods to further improve the robustness of feature aggregation, i.e., Region
Feature aggregation of Interests (ROI) selection, spatial weighting and channel weighting. Experimental results demonstrate
Fourier transform the superiority of the proposed method in comparison with other state-of-the-art, in achieving robust
Low-pass filtering and accurate object retrieval under five benchmark datasets.
Ó 2020 Published by Elsevier Inc.
1. Introduction tasks. And this way does not require a lot of cumbersome text
annotation work. The advantage of ‘‘search by image” makes the
The rapid development of the Internet and artificial intelligence technology widely used, not only for search engines, such as Goo-
[1,2] plays an increasingly important role in various fields, involv- gle, but also in e-commerce, medicine, and other fields.
ing image processing [3], video processing [4], natural language One of the key problems of CBIR is to get the image representa-
processing [5], defect detection [6], cross-modal retrieval [7–9], tion. For CBIR, most methods primarily use hand-crafted descrip-
etc. Image retrieval, including text-based image retrieval and tors (e.g. SIFT [10]) or features from Convolutional Neural
content-based image retrieval (CBIR), is one of the important Network (CNN [11]) as the image representations. Although typical
sub-areas. It has become a critical technique in tackling large- hand-crafted image representations such as the features organized
scale and diverse images when searching related objects. In the using the ”bag-of-words” model (BoW [12]) have achieved many
process of image retrieval, where finding similar and related successes, they cannot be extended well when searching in large-
images including the desired target in large-scale image databases, scale datasets [13,14]. On the other side, the image representations
accuracy is the most important evaluation metric in practical use based on CNN features benefit from the fast development of deep
cases. For text-based image retrieval, manual labeling work is learning [11], which increasingly applies to various application
labor-intensive and time-consuming. Moreover, the annotated fields including image processing [2,15,8], image classification
semantic texts are subject to artificial subjective factors in the [16–18], semantic segmentation [19,20], object detection [21,22],
annotation. They cannot fully express the rich information con- etc. The semantic information of images can be captured by the
tained in the image. These characteristics result in limited retrieval activation of the deep neural network trained on large datasets,
performance. For CBIR, it directly uses the query image itself as the which is suitable for image representation. For instance, the image
retrieval input. And it outputs images that are similar to the query features extracted from CNNs, such as VGG16 [23], can greatly
image’s semantics in the image database as retrieval results. The improve image retrieval performance, which makes image retrie-
results obtained in this way are more accurate, which takes advan- val using convolutional neural networks become a frontier and
tage of the ability of the computers to quickly process repetitive popular research field.
In early works, the CNN-based CBIR methods directly use the
⇑ Corresponding author. deep features extracted from the outputs of the fully connected
E-mail address: cclidd@xjtu.edu.cn (C. Li). layers as the global feature vectors to represent images [24–26].
https://doi.org/10.1016/j.jvcir.2020.102860
1047-3203/Ó 2020 Published by Elsevier Inc.
2 Z. Zhou et al. / J. Vis. Commun. Image R. 72 (2020) 102860
While recent researches find that the features extracted from the to discriminative activation that may be suppressed. SPoC [27]
outputs of the convolutional layer perform better in the image introduces a centering prior. And it uses sum-pooling to aggregate
retrieval task since they contain more semantic information. In this the outputs of the last convolutional layer. In this work, the geo-
field, one of the key problems is the aggregation method of convo- metric center of the image is taken as its region of interest (ROI),
lutional features, which is also the focus of our research. Recently, which improves retrieval performance. However, the method relies
multiple feature aggregation methods have been proposed for heavily on the central prior, which is not always feasible. R-MAC
image retrieval, which is briefly reviewed in related works. How- [30] performs max-pooling on the convolutional layer’s activation
ever, some prior assumptions and steps that can be improved are over the corresponding regions to derive representations for image
the main limitations for accurate feature representation. Therefore, regions. Then, it aggregates them to image representation with
current feature aggregation methods have not been solved in a sum-pooling. Similar to the MOP method, the operation of the slid-
robust and general manner. ing window is performed on the feature map instead of the image,
Motivated by previous works [27,28], we propose an adaptive which speeds up the feature extraction. And this method can
feature aggregation method for the object retrieval based on Four- achieve good retrieval accuracy without fine-tuning. The convolu-
ier transform and low-pass filtering, with the modified strategies tional features can directly be aggregated to the image representa-
in ROI selection, spatial weighting and channel weighting. Particu- tion via spatial max-pooling or sum-pooling, which is efficient.
larly, the Low-pass features can be used to improve the discrimina- Further, some recent works [27,28,31,32] propose to select the
tion of deep features, resulting in a significant improvement in region of interest (ROI) and weight the features in ROI before
retrieval precision. Our contributions can be summarized as aggregation.
follows: ROI selection: In the image retrieval process, the selection of
–To the best of our knowledge, this is the first time that ROI before aggregation can improve retrieval accuracy, which is
employs Fourier transform and low-pass filtering for the feature proposed by several works. When searching for a specific pur-
aggregation in image retrieval. They both directly apply on the fea- pose, we only need to complete the search based on the features
ture maps, which can effectively compute the weights for each fea- in regions that we are more interested in. Instead of using all fea-
ture map with discrimination in the frequency domain. tures, only the features in ROI are used, which can increase accu-
–The ROI selection, spatial weighting and channel weighting in racy and efficiency. In this way, the interference of features in the
our method are modified based on the feature maps of Fourier region of non-interest can also be suppressed, so that the retrieval
transform and low-pass filtering. They constitute a complete set accuracy is improved. For example, SCDA (selective convolutional
of weighting framework. The constructed weight matrix is simple descriptor aggregation) [31] method gets the ‘‘aggregation map”
and intuitive, which can make the weighted feature maps more via adding up the activation of the last convolutional layer
recognizable. through the direction of depth. And it selects the region of the
–Our framework does not require fine-tuning works. It demon- ‘‘aggregation map” with the value larger than the average value
strates state-of-the-art performance under five benchmark data- as ROI. However, this method only works well for images that
sets for image retrieval with robustness. have few noisy objects and are mostly filled with objects we
The remaining of this paper is organized as follows. Section 2 are interested in. When the images contain not only the query
discusses related works. Section 3 presents the details of our main object but also the interference of other notable objects, the per-
contributions. Section 4 gives a wide range of experiments to com- formance of this method is significantly degraded. One of the
prehensively evaluate the proposed methods. Section 5 concludes methods that Do et al. proposed is to select the max local feature
the current work and discusses future directions. of each feature map as a mask to select ROI [32]. However, this
method also works well on images that contain a single object.
When the images contain other disturbing objects, the perfor-
2. Related works
mance of this method also begins to decline.
Feature weighting: Prior to aggregation, some important dis-
The content-based image retrieval based on the convolutional
criminative features are weighted to further increase their dis-
neural network attracts a lot of attention in recent years. And some
cernibility, which also improves the retrieval accuracy. Recently,
effective methods have been proposed in this field. In this work,
several feature aggregation methods including weighting schemes
the Fourier transform and low-pass filters are adopted to make
have been proposed for image retrieval, e.g. CroW [28], which sig-
the CNN-based CBIR have better performance. Thus, related works
nificantly improves retrieval performance. In detail, CroW [28]
about these parts are briefly reviewed in this section.
designs a cross-dimensional weighting scheme to weight the fea-
ture maps. And it uses sum-pooling to aggregate them to the image
2.1. The CNN-based CBIR representation. This method obtains good retrieval accuracy with-
out fine-tuning works. More importantly, it proposes the idea of
Deep image features extracted from the fully connected layer weighting in two dimensions of space and channel. However, ROI
were mostly adopted in the early CBIR works [24–26]. However, is failed to be considered in CroW [28]. Moreover, the feature
the recent researches show that features extracted from the convo- map of each channel is simply added to construct the spatial
lutional layer perform better in the image retrieval task, since more weight.
useful information for CBIR, i.e. the semantics, is contained in the Fine-tuning works: In the field of deep learning, fine-tuning
convolutional features. Compared with the fully connected layer work is one of the ways to further improve performance. Some
features, the convolutional features are always high dimensional supervisory fine-tuning works are adopted in some works [33–
which are hard to compute the similarities. So how to aggregate 35] in image retrieval, which does improve performance. How-
the convolutional features to compact global feature vectors is very ever, the premise that fine-tuning can effectively work is that
important for the image retrieval task. there is appropriate training data for use. Hence, it is necessary
To make the convolutional features be appropriate for image to collect, clean and annotate the training data to make it suit-
retrieval, multiple feature aggregation methods are proposed. For able for fine-tuning, which is time-consuming and labor-
example, MAC [29] generates an image representation by using intensive. So, fine-tuning works are not always feasible. There-
the max activation of the entire convolutional layer. Compared fore, the aggregation method we proposed does not need to be
with sum-pooling, max-pooling’s performance is poor, which leads fine-tuned.
Z. Zhou et al. / J. Vis. Commun. Image R. 72 (2020) 102860 3
2.2. Fourier transform and filtering FM). Then, the Fourier transform operation (denoted as FFT) is
designed for each feature map to compute a processed two-
Fourier transform is a classic technique in the field of conven- dimensional matrix (defined as FFT_FM). Further, the low-pass fil-
tional digital image processing that converts images to be pro- tering (denoted as LP) is introduced to obtain a low-pass processed
cessed from the spatial domain to the frequency domain. And feature map (denoted as LP_FM), which is similar to the low-pass
then, the images further processed by various filters are converted filtering in the conventional digital image processing. After that,
back to the spatial domain by inverse Fourier transform. One of the the processed LP_FM is further processed by a mask selection pro-
most commonly used filters is the low-pass filter that attenuates cess to obtain the mask (denoted as Mask) of the region of interest
the high-frequency content of the image in the frequency domain (ROI). We employ Mask for ROI to mask ORIG_FM to get the pro-
and preserves low-frequency content that contains most of the cessed feature map (denoted as FM_ROI). Further, based on
information, which can suppress noise and smooth (or blur) the FM_ROI, we construct the spatial and channel weight (denoted as
images. In recent works, some improved variants of Fourier trans- LSW and LCW) according to our modified weight construction
form and various filters are widely used in various sub-fields of strategy. Then, based on them, the ”cross” weighting aggregation
image processing. of the feature maps is performed. Finally, the operations such as
Zhu et al. [36] propose a 3D wear area reconstruction method normalization and whitening are performed (these operations are
that uses Fourier transform to fuse the methods of depth from consistent with the operations in CroW [28]). The global feature
focus (DFF) and shape from shading (SFS). After performing high- vector of the image can be obtained by the whole process for
pass filtering and low-pass filtering for the 3D reconstructed robust and accurate object retrieval.
images of DFF and SFS, respectively, in the frequency domain, the
3D wear area reconstruction image of the grinding wheel can then 3.2. FFT and LP for feature aggregation
be obtained by Fourier inversion. Kumar et al. [37] propose a frac-
tional Fourier transform and fractional-order calculus-based image For existing feature aggregation methods directly using the fea-
edge detection, which can be performed in the fractional Fourier ture maps extracted from convolutional or pooling layers, the prior
frequency domain through the usage of the ”fractional Fourier assumptions and steps that can be improved are the main limita-
transformation” tool. Feng et al. [38] propose the Hash authentica- tions for accurate feature representation [27,30,28]. For digital
tion algorithm for color-image involving the super-complex Four- images, high-frequency contents represent edges or noise in the
ier transform and the input image pre-processing based on the image that can be attenuated, while low-frequency contents con-
mean filter and the rotation factor. Sun et al. [39] propose an tain most of the information that can be reserved. Visually, a
improved Fourier transform phase measurement method for mea- smoother and noise-removed image can be more easily identified
suring the quality parameters of an image intensifier. In the pro- in many cases. Moreover, with the Fourier transform, complex
posed method, the fundamental frequency components of the low-pass filter in the spatial domain becomes simple and intuitive
captured patterns are obtained after Fourier transform and filter- when transformed in the frequency domain. Therefore, we propose
ing. Zhang et al. [40] propose a color-image encryption scheme to employ Fourier transform and low-pass filter to process feature
based on two-dimension (2D) compressive sensing (CS) and frac- maps directly to preserve low-frequency contents.
tion Fourier transform (FrFT). Chen et al. [41] propose several Here, we develop adaptive Fourier transform (FFT) and low-pass
image segmentation algorithms combined with different image filter (LP) for feature aggregation. The image that can be processed
pre-processing methods including Butterworth low-pass filtering, by the traditional Fourier transform is a two-dimensional matrix,
Butterworth high-pass enhanced filtering and adaptive weighted while the convolutional features extracted from the convolutional
median filtering. Nidhi et al. [42] propose a novel pansharpening or pooling layer of CNNs are always three-dimensional tensor.
scheme based on two-dimensional discrete fractional Fourier Thus, the convolutional features should be further processed for
transform (2D-DFRFT). In the scheme, panchromatic and intensity Fourier transform. For example, the deep features extracted from
images are transformed using 2D-DFRFT. And they are filtered by the Pool5 layer of VGG16 [23] we used are three-dimensional ten-
high-pass filters. sor with 512 feature maps. Each feature map characterizes the
In our method, the Fourier transform and low-pass filtering are deep feature obtained by the corresponding convolutional kernel.
adapted to directly apply on the feature maps. To the best of our Then, we sum this set of feature maps along the channel direction
knowledge, this is the first time that employs them for the feature as follows:
aggregation method in image retrieval.
X
K
X FM ði; jÞ ¼ Xði; jÞ8i ¼ 1; 2 . . . ; W 8j ¼ 1; 2 . . . ; H ð1Þ
3. Methodology k¼1
SUM LP
FFT
P=60
...
512
Mask
For ROI
Mask
K=0.3
Representation
ROI
Selecon
Spaal Weight
Global Feature Vector
FM_ROI
Normalizaon
Weight
And Whitening
Channel Weight
...
512
Fig. 1. The whole framework of our proposed method, including Fourier transform, low-pass filtering, ROI selection, spatial and channel weighting.
0 !1=a 11=b
X a where Bn denotes the value of the n-th channel. The B obtained
sij ¼ @Lij = Lmn A ; ð4Þ above represents the low-pass factors for channel weights, where
m;n Bn denotes the low-pass factor of the n-th channel. Then, we use a
function to represent the negative correlation between the low-
among them, a ¼ 0:5; b ¼ 2, which refers to the setting in crow [28]. pass factor and channel weight, as shown in Eq. (6). Although it is
By applying Eq. (4) to each position, we can derive a two- similar to the logarithmic function in CroW [28] and both of them
dimensional matrix S, which can be directly used as the spatial can represent a negative correlation, it is clear that the function
weight matrix LSW to be applied on each feature map. In our deep we used is simpler and more intuitive. Substituting the B into Eq.
feature aggregation framework, using the spatial weight matrix (6), we can derive the vector C, which can be directly used as the
obtained in this part, we can directly perform the spatial weighting channel weight vector LCW.
operation on the feature map FM_ROI. In this way, the weighting
operation in the spatial dimension can be completed through the C n ¼ exp Bn : ð6Þ
above intuitive and simple process.
By weighting the spatially weighted and aggregated features by
Low-pass channel weighting: For the previous channel
channel weight LCW, the weighting operation in the channel
weighting, e.g. Kalantidis et al. [28], its sparsity is mainly measured
dimension can be completed. Finally, after performing the opera-
by the non-zero response, which is disturbed by non-target objects
tions such as normalization and whitening on the weighted feature
such as people, sky and background. This scheme is not always
vector, the final global feature vector of the image can be obtained.
suitable for tackling generalized image retrieval scenarios. There-
fore, we propose a channel weighting method based on Fourier
3.3. Implementation details
transform and low-pass filtering.
We directly use the aforementioned two-dimensional weight
We summarize the deep feature aggregation framework in
matrix S to treat it as a mask to perform mask operation on the fea-
ture map FM_ROI, recording the obtained three-dimensional ten- Algorithm 1. Let ORIG FM 2 RðWHK Þ denotes deep features
sor as Q. Then summation is performed on each channel of Q to extracted from the last convolutional or pooling layer, which
obtain a vector B. Subsequently, we find the maximum Bmax and including K feature maps, with width W and height H. Let
the minimum Bmin in this vector. And we substitute them into Mask 2 RðWHÞ denotes the generated mask matrix for ROI selec-
Eq. (5): tion, which has width W and height H. Let LSW 2 RðWHÞ and
Bn ¼ ðBn Bmin Þ=ðBmax Bmin Þ; ð5Þ LCW 2 RK denote the Low-pass spatial weight matrix and the
Low-pass channel weight vector, respectively.
Table 1
Retrieval performance comparison (in mAP) with some recent deep feature aggregation methods.
6 Z. Zhou et al. / J. Vis. Commun. Image R. 72 (2020) 102860
Algorithm 1. Deep feature Aggregation with NetVLAD [33], which is fine-tuned. All comparison results
are shown in Table 1.
Input: Tensor ORIG FM 2 RðWHK Þ , dimensionality K 0 , As shown in Table 1, for different dimensions of representation,
parameter P, parameter K, whitening parameter W. the performance of our method outperforms other aggregation
0
output K 0 -Dim global representation b 2 RK . methods, apart from R-MAC [30] and R-MAC + E [47], which
1: Fourier transform ORIG_FM to FFT_FM; demonstrate excellent performances in 512-dimension of Paris
2: Low-pass filtering on FFT_FM to obtain LP_FM with [45] and Holidays [46]. However, in 512-dimension of Paris [45],
parameter P; the original CroW [28] does have a large gap with R-MAC [30].
3: Generate Mask for ROI with LP_FM and parameter K; Moreover, R-MAC + E [47] is an improvement based on R-MAC
4: Select ROI on ORIG_FM with Mask to obtain FM_ROI; [30]. While our method improves retrieval performance by 2.5%
5: Generate LSW with FM_ROI; and 3.0% respectively compared with the original CroW [28]. In
6: Perform spatial weighting on FM_ROI with LSW and particular, compared with the original CroW [28], the performance
aggregation to obtain a; of our method increases by at least 2.4% in all dimensions of Paris
7: Generate LCW with FM_ROI and LSW; [45], with a maximum increase of 3.9%.
8: Perform channel weighting on awith LCW to obtain b0 ; The proposed method with query expansion (QE) is compared
~ with CroW + QE [28] in the last part of Table 1. In the experiments,
9: Normalize b0 to obtain b;
the average query expansion [49] computed by the top-10 query
~ with parameter K 0 and
10: Dim. reduction and whitening on b results is used. It can be seen that our method with QE gets the best
^
Wto obtain b; performance in three dimensions on all datasets.
^ again to obtain final global representation b.
11: Normalize b These quantitative results validate that our method can achieve
significant improvement in retrieval performance, which proves
that the introduction of Fourier transform and the low-pass filter
in the aggregation method of CBIR is effective and positive. It also
demonstrates that our aggregation framework including ROI selec-
4. Experiments tion, spatial weighting and channel weighting based on FFT and LP
can effectively improve the distinguishability of feature maps.
4.1. Experimental setting To validate the retrieval performance in a more intuitive man-
ner, we present several sets of randomly selected retrieval results
Our experiments are conducted on five public available data- for Oxford5K [44] in Fig. 2. On the other side, we only have two
sets: Oxford5K [44] (5062 building photos with 55 queries includ- false results in all 55 groups of top-10 query results in Paris6K [45].
ing 11 landmarks), Oxford105K (an extension to Oxford5K [44],
adding 100 K distracted photos from Flicker), Paris6K [45] (6392 4.3. Discussion
building photos with 55 queries including 11 landmarks), Par-
is106K (an extension to Paris6K [45], adding 100 K distracted pho- Our framework results in significant improvements in retrieval
tos from Flicker), Holidays [46] (1491 holiday photos with 500 results, which mainly benefit from the modified ROI selection, spa-
queries). The standard protocol consisting of other methods is tial and channel cross-dimensional weighting based on Fourier
employed, i.e., using the upright version of images in Holidays transform and the low-pass filter. Each of them has its own effects
[46] and using the cropped queries in Oxford5K [44], Paris6K in boosting the retrieval performance.
[45], Oxford105K and Paris106K. FFT and LP: For the image, after conversion from the spatial
We use the pre-trained deep model VGG16 [23] from ImageNet domain to the frequency domain, its high-frequency information
[43] without fine-tuning. All deep features used in the experiment mainly corresponds to the parts that change sharply, such as edge,
are extracted directly from the Pool5 layer of VGG16 [23]. And the contour and noise, while its low-frequency information mainly
number of channels is 512. For all searches, we use the mean aver- corresponds to the parts that change slowly, such as a large flat
age precision (mAP) as the evaluation metric on each dataset to area in the image. In fact, when we are retrieving the target image,
evaluate the performance of the search, where mAP is defined as we pay more attention to the low-frequency parts that describe
the average percentage of same class images in all retrieved images most of the main information, rather than the high-frequency
after evaluation all Queries. Particularly, to fairly test the retrieval parts. So we use the low-pass filter to process the image, which
accuracy on each dataset, when testing the retrieval accuracy of makes the smoothed image easier to distinguish. The reason for
Oxford5K (105 K) [44], we learn the whitening parameters on Par- using the Fourier transform is that the filter operation on the image
is6K [45]. Vice versa, when testing the retrieval accuracy of Paris6K in the spatial domain is interfered by some pixels such as noise, so
(106 K) [45] and Holidays [46], we learn whitening parameters on that some important information in the image may be lost. But the
Oxford5K [44]. In our method, two parameters are involved. one is corresponding filter operation on the image in the frequency
the parameter P when setting the low-pass filter. The other one is domain does not cause the loss of useful information in the image.
the parameter K when selecting the ROI. We empirically determine In our paper, we assume that the above theories are also feasible
the value of them: P ¼ 60; K ¼ 0:3. Detailed discussions are pro- for feature maps. And we modify the Fourier transform and low-
vided in Section 4.3. pass filter to be suitable for feature maps. Moreover, based on
them, we develop feature aggregation method, including ROI selec-
4.2. Validation tion, low-pass spatial and channel weighting. The experimental
results validate that our method can achieve improvement in
On the five public benchmark datasets, we compare the retrieval performance, which proves that our assumption is valid
retrieval accuracy (measured in mAP) of our method in three and the filtered feature maps are more distinguishable.
dimensions with some recent deep feature aggregation meth- ROI selection: We choose the feature map processed by the
ods without fine-tuning, including CroW [28], Neural code aforementioned low-pass filter to generate the mask for ROI. We
[24], R-MAC [30], R-MAC + E [47], Razavian et al. [26], Zhang consider that this feature map suppresses most of the high-
et al. [48] and SPoC [27]. Furthermore, for more comprehensive frequency information. And it retains the main low-frequency
verification, we also perform a similar performance comparison information which we pay more attention to. Hence, we calculate
Z. Zhou et al. / J. Vis. Commun. Image R. 72 (2020) 102860 7
Fig. 2. Randomly selected images from Oxford5K [44] datasets with their corresponding top-10 retrieval results, using 512-dimensional global feature vectors, with no query
expansion. The query image is displayed in the leftmost place. The query results are in the right part, where the wrong results are marked with the red dashed line.
the average-value of this feature map. Then, we use K times of region of interest selected in this way is a region that retains most
average-value as benchmark to measure whether the correspond- important and representative information, while some interfer-
ing position is considered as a part of ROI. We consider that the ence from high-frequency information is suppressed, such as
8 Z. Zhou et al. / J. Vis. Commun. Image R. 72 (2020) 102860
Table 2
Retrieval performance comparison (in mAP) of the original CroW [28] and the CroW
[28] added the ROI selection of our method on three dimensions.
Table 4 spatial weighting and low-pass channel weighting, using the Four-
Retrieval performance comparison (in mAP) of different values of parameter K ier transform and low-pass filtering for feature map. Particularly,
on three dimensions.
the proposed FFT and LP allow the feature map to perform corre-
sponding pre-processing steps before performing weighting and
aggregation, which makes the processed feature map more dis-
criminating. Additionally, the modified ROI selection and low-
pass co-weighting including space and channel can enhance the
discrimination of image representations, which obtains better
retrieval accuracy. Experimental results demonstrate the superior-
ity of the proposed deep feature aggregation framework in com-
parison with other state-of-the-art aggregation methods. Our
future works focus on developing more general and effective
aggregation methods for the feature representation in large-scale
image retrieval tasks.
[19] H. Lu, B. Li, J. Zhu, Y. Li, Y. Li, X. Xu, L. He, X. Li, J. Li, S. Serikawa, Wound [35] A. Gordo, J. Almazán, J. Revaud, D. Larlus, Deep image retrieval: Learning global
intensity correction and segmentation with convolutional neural networks, representations for image search, in: European conference on computer vision,
Concurr. Comput.: Pract. Experience 29 (2017) e3927. Springer, pp. 241–257.
[20] J. Yang, D. She, M. Sun, M.-M. Cheng, P.L. Rosin, L. Wang, Visual sentiment [36] A. Zhu, D. He, J. Zhao, W. Luo, W. Chen, 3d wear area reconstruction of grinding
prediction based on automatic discovery of affective regions, IEEE Trans. wheel by frequency-domain fusion, Int. J. Adv. Manuf. Technol. 88 (2017)
Multimedia 20 (2018) 2513–2525. 1111–1117.
[21] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object [37] S. Kumar, R. Saxena, K. Singh, Fractional fourier transform and fractional-order
detection with region proposal networks, in: Advances in neural information calculus-based image edge detection, Circ., Syst., Signal Process. 36 (2017)
processing systems, pp. 91–99. 1493–1513.
[22] D. Kim, M. Arsalan, K. Park, Convolutional neural network-based shadow [38] H. Feng, G. Chang, X. Guo, Hash algorithm for color image based on super-
detection in images using visible light camera sensor, Sensors 18 (2018) 960. complex fourier transform coupled with position permutation, J. Front.
[23] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale Comput. Sci. Technol. (2017).
image recognition, arXiv preprint arXiv:1409.1556 (2014). [39] S. Sun, Y. Cao, Y. Wang, C. Chen, G. Fu, X. Xu, Improved fourier transform phase
[24] A. Babenko, A. Slesarev, A. Chigorin, V. Lempitsky, Neural codes for image measurement method for measuring the quality parameters of an image
retrieval, in: European conference on computer vision, Springer, pp. 584–599. intensifier, Opt. Eng. 56 (2017) 034113.
[25] Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep [40] D. Zhang, X. Liao, B. Yang, Y. Zhang, A fast and efficient approach to color-
convolutional activation features, in: European conference on computer vision, image encryption based on compressive sensing and fractional fourier
Springer, pp. 392–407. transform, Multimedia Tools Appl. 77 (2018) 2191–2208.
[26] A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the- [41] J. Chen, F. Li, Y. Fu, Q. Liu, J. Huang, K. Li, A study of image segmentation
shelf: an astounding baseline for recognition, in: Proceedings of the IEEE algorithms combined with different image preprocessing methods for thyroid
conference on computer vision and pattern recognition workshops, pp. 806– ultrasound images, in: 2017 IEEE International Conference on Imaging
813. Systems and Techniques (IST), IEEE, pp. 1–5.
[27] A. Babenko, V. Lempitsky, Aggregating local deep features for image retrieval, [42] N. Saxena, K.K. Sharma, Pansharpening scheme using filtering in two-
in: Proceedings of the IEEE international conference on computer vision, pp. dimensional discrete fractional fourier transform, IET Image Proc. 12 (2018)
1269–1277. 1013–1019.
[28] Y. Kalantidis, C. Mellina, S. Osindero, Cross-dimensional weighting for [43] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale
aggregated deep convolutional features, in: European Conference on hierarchical image database, in: 2009 IEEE conference on computer vision and
Computer Vision, Springer, pp. 685–701. pattern recognition, Ieee, pp. 248–255.
[29] A. Sharif Razavian, J. Sullivan, A. Maki, S. Carlsson, A baseline for visual [44] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large
instance retrieval with deep convolutional networks, in: International vocabularies and fast spatial matching, in: IEEE Conference on Computer
Conference on Learning Representations, May 7-9, 2015, San Diego, CA, ICLR. Vision & Pattern Recognition.
[30] G. Tolias, R. Sicre, H. Jégou, Particular object retrieval with integral max- [45] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Lost in quantization:
pooling of cnn activations, Comput. Sci. (2015). Improving particular object retrieval in large scale image databases, in: 2008
[31] X.S. Wei, J.H. Luo, J. Wu, Z.H. Zhou, Selective convolutional descriptor IEEE conference on computer vision and pattern recognition, IEEE, pp. 1–8.
aggregation for fine-grained image retrieval, IEEE Trans. Image Process. A [46] H. Jégou, M. Douze, C. Schmid, Improving bag-of-features for large scale image
Publ. IEEE Signal Process. Soc. 26 (2017) 2868. search, Int. J. Comput. Vision 87 (2010) 316–336.
[32] T.-T. Do, T. Hoang, D.-K.L. Tan, H. Le, T.V. Nguyen, N.-M. Cheung, From selective [47] P. Liu, G. Gou, H. Guo, D. Zhang, H. Zhao, Q. Zhou, Fusing feature distribution
deep convolutional features to compact binary representations for image entropy with r-mac features in image retrieval, Entropy 21 (2019) 1037.
retrieval, ACM Trans. Multimedia Comput., Commun., Appl. (TOMM) 15 (2019) [48] X. Zhang, Near-duplicate image retrieval based on multiple features, in: 2018
43. IEEE Visual Communications and Image Processing (VCIP), IEEE, pp. 1–4.
[33] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, J. Sivic, Netvlad: Cnn architecture [49] O. Chum, J. Philbin, J. Sivic, M. Isard, A. Zisserman, Total recall: Automatic
for weakly supervised place recognition, in: Proceedings of the IEEE query expansion with a generative feature model for object retrieval, in: 2007
Conference on Computer Vision and Pattern Recognition, pp. 5297–5307. IEEE 11th International Conference on Computer Vision, IEEE, pp. 1–8.
[34] F. Radenović, G. Tolias, O. Chum, Cnn image retrieval learns from bow: [50] L. Gao, X. Li, J. Song, H.T. Shen, Hierarchical lstms with adaptive attention for
Unsupervised fine-tuning with hard examples, in: European conference on visual captioning, IEEE Trans. Pattern Anal. Mach. Intell. (2019).
computer vision, Springer, pp. 3–20.