1 s2.0 S1047320320301115 Main

J. Vis. Commun. Image R.
72 (2020) 102860
Contents lists available at ScienceDirect
J. Vis. Commun. Image R.

journal homepage: www.elsevier.com/locate/jvci
Adaptive deep feature aggregation using Fourier transform and low-pass

filtering for robust object retrieval
Ziyao Zhou, Xinsheng Wang, Chen Li ⇑, Ming Zeng, Zhongyu Li
School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China
a r t i c l e i n f o a b s t r a c t
Article history: With the rapid development of deep learning techniques, convolutional neural networks (CNN) have
Received 4 June 2019 been widely investigated for the feature representations in the image retrieval task. However, the key
Revised 6 June 2020 step in CNN-based retrieval, i.e., feature aggregation has not been solved in a robust and general manner
Accepted 16 July 2020
when tackling different kinds of images. In this paper, we present a deep feature aggregation method for
Available online 24 July 2020
image retrieval using the Fourier transform and low-pass filtering, which can adaptively compute the
weights for each feature map with discrimination. Specifically, the low-pass filtering can preserve the
Keywords:
semantic information in each feature map by transforming images to the frequency domain. In addition,
Image retrieval
Convolutional neural networks
we develop three adaptive methods to further improve the robustness of feature aggregation, i.e., Region
Feature aggregation of Interests (ROI) selection, spatial weighting and channel weighting. Experimental results demonstrate
Fourier transform the superiority of the proposed method in comparison with other state-of-the-art, in achieving robust
Low-pass filtering and accurate object retrieval under five benchmark datasets.
Ó 2020 Published by Elsevier Inc.
1. Introduction tasks. And this way does not require a lot of cumbersome text
annotation work. The advantage of ‘‘search by image” makes the
The rapid development of the Internet and artificial intelligence technology widely used, not only for search engines, such as Goo-
[1,2] plays an increasingly important role in various fields, involv- gle, but also in e-commerce, medicine, and other fields.
ing image processing [3], video processing [4], natural language One of the key problems of CBIR is to get the image representa-
processing [5], defect detection [6], cross-modal retrieval [7–9], tion. For CBIR, most methods primarily use hand-crafted descrip-
etc. Image retrieval, including text-based image retrieval and tors (e.g. SIFT [10]) or features from Convolutional Neural
content-based image retrieval (CBIR), is one of the important Network (CNN [11]) as the image representations. Although typical
sub-areas. It has become a critical technique in tackling large- hand-crafted image representations such as the features organized
scale and diverse images when searching related objects. In the using the ”bag-of-words” model (BoW [12]) have achieved many
process of image retrieval, where finding similar and related successes, they cannot be extended well when searching in large-
images including the desired target in large-scale image databases, scale datasets [13,14]. On the other side, the image representations
accuracy is the most important evaluation metric in practical use based on CNN features benefit from the fast development of deep
cases. For text-based image retrieval, manual labeling work is learning [11], which increasingly applies to various application
labor-intensive and time-consuming. Moreover, the annotated fields including image processing [2,15,8], image classification
semantic texts are subject to artificial subjective factors in the [16–18], semantic segmentation [19,20], object detection [21,22],
annotation. They cannot fully express the rich information con- etc. The semantic information of images can be captured by the
tained in the image. These characteristics result in limited retrieval activation of the deep neural network trained on large datasets,
performance. For CBIR, it directly uses the query image itself as the which is suitable for image representation. For instance, the image
retrieval input. And it outputs images that are similar to the query features extracted from CNNs, such as VGG16 [23], can greatly
image’s semantics in the image database as retrieval results. The improve image retrieval performance, which makes image retrie-
results obtained in this way are more accurate, which takes advan- val using convolutional neural networks become a frontier and
tage of the ability of the computers to quickly process repetitive popular research field.
In early works, the CNN-based CBIR methods directly use the
⇑ Corresponding author. deep features extracted from the outputs of the fully connected
E-mail address: cclidd@xjtu.edu.cn (C. Li). layers as the global feature vectors to represent images [24–26].
https://doi.org/10.1016/j.jvcir.2020.102860
1047-3203/Ó 2020 Published by Elsevier Inc.
2 Z. Zhou et al. / J. Vis. Commun. Image R. 72 (2020) 102860
While recent researches find that the features extracted from the to discriminative activation that may be suppressed. SPoC [27]
outputs of the convolutional layer perform better in the image introduces a centering prior. And it uses sum-pooling to aggregate
retrieval task since they contain more semantic information. In this the outputs of the last convolutional layer. In this work, the geo-
field, one of the key problems is the aggregation method of convo- metric center of the image is taken as its region of interest (ROI),
lutional features, which is also the focus of our research. Recently, which improves retrieval performance. However, the method relies
multiple feature aggregation methods have been proposed for heavily on the central prior, which is not always feasible. R-MAC
image retrieval, which is briefly reviewed in related works. How- [30] performs max-pooling on the convolutional layer’s activation
ever, some prior assumptions and steps that can be improved are over the corresponding regions to derive representations for image
the main limitations for accurate feature representation. Therefore, regions. Then, it aggregates them to image representation with
current feature aggregation methods have not been solved in a sum-pooling. Similar to the MOP method, the operation of the slid-
robust and general manner. ing window is performed on the feature map instead of the image,
Motivated by previous works [27,28], we propose an adaptive which speeds up the feature extraction. And this method can
feature aggregation method for the object retrieval based on Four- achieve good retrieval accuracy without fine-tuning. The convolu-
ier transform and low-pass filtering, with the modified strategies tional features can directly be aggregated to the image representa-
in ROI selection, spatial weighting and channel weighting. Particu- tion via spatial max-pooling or sum-pooling, which is efficient.
larly, the Low-pass features can be used to improve the discrimina- Further, some recent works [27,28,31,32] propose to select the
tion of deep features, resulting in a significant improvement in region of interest (ROI) and weight the features in ROI before
retrieval precision. Our contributions can be summarized as aggregation.
follows: ROI selection: In the image retrieval process, the selection of
–To the best of our knowledge, this is the first time that ROI before aggregation can improve retrieval accuracy, which is
employs Fourier transform and low-pass filtering for the feature proposed by several works. When searching for a specific pur-
aggregation in image retrieval. They both directly apply on the fea- pose, we only need to complete the search based on the features
ture maps, which can effectively compute the weights for each fea- in regions that we are more interested in. Instead of using all fea-
ture map with discrimination in the frequency domain. tures, only the features in ROI are used, which can increase accu-
–The ROI selection, spatial weighting and channel weighting in racy and efficiency. In this way, the interference of features in the
our method are modified based on the feature maps of Fourier region of non-interest can also be suppressed, so that the retrieval
transform and low-pass filtering. They constitute a complete set accuracy is improved. For example, SCDA (selective convolutional
of weighting framework. The constructed weight matrix is simple descriptor aggregation) [31] method gets the ‘‘aggregation map”
and intuitive, which can make the weighted feature maps more via adding up the activation of the last convolutional layer
recognizable. through the direction of depth. And it selects the region of the
–Our framework does not require fine-tuning works. It demon- ‘‘aggregation map” with the value larger than the average value
strates state-of-the-art performance under five benchmark data- as ROI. However, this method only works well for images that
sets for image retrieval with robustness. have few noisy objects and are mostly filled with objects we
The remaining of this paper is organized as follows. Section 2 are interested in. When the images contain not only the query
discusses related works. Section 3 presents the details of our main object but also the interference of other notable objects, the per-
contributions. Section 4 gives a wide range of experiments to com- formance of this method is significantly degraded. One of the
prehensively evaluate the proposed methods. Section 5 concludes methods that Do et al. proposed is to select the max local feature
the current work and discusses future directions. of each feature map as a mask to select ROI [32]. However, this
method also works well on images that contain a single object.
When the images contain other disturbing objects, the perfor-
2. Related works
mance of this method also begins to decline.
Feature weighting: Prior to aggregation, some important dis-
The content-based image retrieval based on the convolutional
criminative features are weighted to further increase their dis-
neural network attracts a lot of attention in recent years. And some
cernibility, which also improves the retrieval accuracy. Recently,
effective methods have been proposed in this field. In this work,
several feature aggregation methods including weighting schemes
the Fourier transform and low-pass filters are adopted to make
have been proposed for image retrieval, e.g. CroW [28], which sig-
the CNN-based CBIR have better performance. Thus, related works
nificantly improves retrieval performance. In detail, CroW [28]
about these parts are briefly reviewed in this section.
designs a cross-dimensional weighting scheme to weight the fea-
ture maps. And it uses sum-pooling to aggregate them to the image
2.1. The CNN-based CBIR representation. This method obtains good retrieval accuracy with-
out fine-tuning works. More importantly, it proposes the idea of
Deep image features extracted from the fully connected layer weighting in two dimensions of space and channel. However, ROI
were mostly adopted in the early CBIR works [24–26]. However, is failed to be considered in CroW [28]. Moreover, the feature
the recent researches show that features extracted from the convo- map of each channel is simply added to construct the spatial
lutional layer perform better in the image retrieval task, since more weight.
useful information for CBIR, i.e. the semantics, is contained in the Fine-tuning works: In the field of deep learning, fine-tuning
convolutional features. Compared with the fully connected layer work is one of the ways to further improve performance. Some
features, the convolutional features are always high dimensional supervisory fine-tuning works are adopted in some works [33–
which are hard to compute the similarities. So how to aggregate 35] in image retrieval, which does improve performance. How-
the convolutional features to compact global feature vectors is very ever, the premise that fine-tuning can effectively work is that
important for the image retrieval task. there is appropriate training data for use. Hence, it is necessary
To make the convolutional features be appropriate for image to collect, clean and annotate the training data to make it suit-
retrieval, multiple feature aggregation methods are proposed. For able for fine-tuning, which is time-consuming and labor-
example, MAC [29] generates an image representation by using intensive. So, fine-tuning works are not always feasible. There-
the max activation of the entire convolutional layer. Compared fore, the aggregation method we proposed does not need to be
with sum-pooling, max-pooling’s performance is poor, which leads fine-tuned.
Z. Zhou et al. / J. Vis. Commun. Image R. 72 (2020) 102860 3
2.2. Fourier transform and filtering FM). Then, the Fourier transform operation (denoted as FFT) is
designed for each feature map to compute a processed two-
Fourier transform is a classic technique in the field of conven- dimensional matrix (defined as FFT_FM). Further, the low-pass fil-
tional digital image processing that converts images to be pro- tering (denoted as LP) is introduced to obtain a low-pass processed
cessed from the spatial domain to the frequency domain. And feature map (denoted as LP_FM), which is similar to the low-pass
then, the images further processed by various filters are converted filtering in the conventional digital image processing. After that,
back to the spatial domain by inverse Fourier transform. One of the the processed LP_FM is further processed by a mask selection pro-
most commonly used filters is the low-pass filter that attenuates cess to obtain the mask (denoted as Mask) of the region of interest
the high-frequency content of the image in the frequency domain (ROI). We employ Mask for ROI to mask ORIG_FM to get the pro-
and preserves low-frequency content that contains most of the cessed feature map (denoted as FM_ROI). Further, based on
information, which can suppress noise and smooth (or blur) the FM_ROI, we construct the spatial and channel weight (denoted as
images. In recent works, some improved variants of Fourier trans- LSW and LCW) according to our modified weight construction
form and various filters are widely used in various sub-fields of strategy. Then, based on them, the ”cross” weighting aggregation
image processing. of the feature maps is performed. Finally, the operations such as
Zhu et al. [36] propose a 3D wear area reconstruction method normalization and whitening are performed (these operations are
that uses Fourier transform to fuse the methods of depth from consistent with the operations in CroW [28]). The global feature
focus (DFF) and shape from shading (SFS). After performing high- vector of the image can be obtained by the whole process for
pass filtering and low-pass filtering for the 3D reconstructed robust and accurate object retrieval.
images of DFF and SFS, respectively, in the frequency domain, the
3D wear area reconstruction image of the grinding wheel can then 3.2. FFT and LP for feature aggregation
be obtained by Fourier inversion. Kumar et al. [37] propose a frac-
tional Fourier transform and fractional-order calculus-based image For existing feature aggregation methods directly using the fea-
edge detection, which can be performed in the fractional Fourier ture maps extracted from convolutional or pooling layers, the prior
frequency domain through the usage of the ”fractional Fourier assumptions and steps that can be improved are the main limita-
transformation” tool. Feng et al. [38] propose the Hash authentica- tions for accurate feature representation [27,30,28]. For digital
tion algorithm for color-image involving the super-complex Four- images, high-frequency contents represent edges or noise in the
ier transform and the input image pre-processing based on the image that can be attenuated, while low-frequency contents con-
mean filter and the rotation factor. Sun et al. [39] propose an tain most of the information that can be reserved. Visually, a
improved Fourier transform phase measurement method for mea- smoother and noise-removed image can be more easily identified
suring the quality parameters of an image intensifier. In the pro- in many cases. Moreover, with the Fourier transform, complex
posed method, the fundamental frequency components of the low-pass filter in the spatial domain becomes simple and intuitive
captured patterns are obtained after Fourier transform and filter- when transformed in the frequency domain. Therefore, we propose
ing. Zhang et al. [40] propose a color-image encryption scheme to employ Fourier transform and low-pass filter to process feature
based on two-dimension (2D) compressive sensing (CS) and frac- maps directly to preserve low-frequency contents.
tion Fourier transform (FrFT). Chen et al. [41] propose several Here, we develop adaptive Fourier transform (FFT) and low-pass
image segmentation algorithms combined with different image filter (LP) for feature aggregation. The image that can be processed
pre-processing methods including Butterworth low-pass filtering, by the traditional Fourier transform is a two-dimensional matrix,
Butterworth high-pass enhanced filtering and adaptive weighted while the convolutional features extracted from the convolutional
median filtering. Nidhi et al. [42] propose a novel pansharpening or pooling layer of CNNs are always three-dimensional tensor.
scheme based on two-dimensional discrete fractional Fourier Thus, the convolutional features should be further processed for
transform (2D-DFRFT). In the scheme, panchromatic and intensity Fourier transform. For example, the deep features extracted from
images are transformed using 2D-DFRFT. And they are filtered by the Pool5 layer of VGG16 [23] we used are three-dimensional ten-
high-pass filters. sor with 512 feature maps. Each feature map characterizes the
In our method, the Fourier transform and low-pass filtering are deep feature obtained by the corresponding convolutional kernel.
adapted to directly apply on the feature maps. To the best of our Then, we sum this set of feature maps along the channel direction
knowledge, this is the first time that employs them for the feature as follows:
aggregation method in image retrieval.
X
K
X FM ði; jÞ ¼ Xði; jÞ8i ¼ 1; 2 . . . ; W 8j ¼ 1; 2 . . . ; H ð1Þ
3. Methodology k¼1
By Eq. (1), we can get a dense feature map FM which is a two-

In this section, we present the theoretical and technical details
dimensional matrix, where Xði; jÞ denotes the value of each position
of our adaptive deep feature aggregation framework, including fea-
in ORIG_FM and X FM ði; jÞ denotes the value of each position in FM.
ture aggregation based on Fourier transform and low-pass filtering,
FM is deformed to a shape suitable for FFT, which can be converted
with the modified strategies in ROI selection, spatial weighting and
to the frequency domain using FFT. Next, we perform the low-pass
channel weighting.
filtering operation (LP) on the feature map FFT_FM that is trans-
formed into the frequency domain by FFT. Since the low-
3.1. Overview frequency parts are distributed in four corners of FFT_FM, while
the rest contains the high-frequency parts. Thus, we convert the
In our feature aggregation framework (illustrated in Fig. 1), the low-frequency parts to the middle to obtain the deformed Dft_shift.
image is directly set as input into the VGG16 model [23] pre- There is still no information loss in this operation. Then we con-
trained using ImageNet [43] without fine-tuning works, where struct a low-pass mask according to the required size, which
deep features are extracted from the last convolutional or pooling involves a parameter P (according to the experiments, we set
layer, resulting in a tensor for each image. In the aggregation P ¼ 60). The parameter P visually represents half the length of the
method, the entire tensor (denoted as ORIG_FM) is summed along side of the square in the middle of the feature map in the frequency
the channel direction to get a dense representation (denoted as domain. Only the low-frequency contents in this square are allowed
Feature maps Feature map

VGG16
Extracted from VGG16 Aer summaon
SUM LP
FFT
P=60
...
512
Mask
For ROI
Mask
K=0.3
Representation
ROI
Selecon
Spaal Weight
Global Feature Vector
FM_ROI
Normalizaon
Weight
And Whitening
Channel Weight
...
512
Fig. 1. The whole framework of our proposed method, including Fourier transform, low-pass filtering, ROI selection, spatial and channel weighting.
to pass. After that, the low-pass mask is used to mask Dft_shift to X

W X
H
¼
X X LP ði; jÞ=ðW HÞ; ð2Þ
achieve low-pass filtering. Here, note that the result after mask
i¼1 j¼1
should be inversely deformed before reconstructing the feature
map in the spatial domain. The low-frequency parts in the middle (
1 X LP ði; jÞ P K X
are restored to four corners. After the inverse deformation, the
X M ði; jÞ ¼ ; ð3Þ
inverse Fourier transform can be performed. By the above continu-
0 X LP ði; jÞ < K X
ous operations, it obtains the result LP_FM in spatial domain pro-
cessed by the low-pass filter, which is more suitable for feature where X LP ði; jÞ denotes the value of each position in LP_FM, X
representation. denotes the average-value of LP_FM, and X M ði; jÞ denotes the value
ROI selection: When searching for an image with special of each position in Mask for ROI. Through the above operations,
objects, we would like to pay more attention to the region of inter- we can obtain a mask for ROI. We use it to perform mask opera-
est (ROI) rather than the whole image. Similarly, selecting the ROI tion on ORIG_FM, which can achieve the purpose of selecting ROI
also plays an important role in image retrieval tasks, since it can and suppressing interference from other cluttered regions, obtain-
suppress the interference of features in other regions of non- ing the feature map FM_ROI with ROI selected finally. Note that
interest. FM_ROI is still a three-dimensional tensor with the same shape
Considering the traditional method of ROI selection is to take a as ORIG_FM.
certain fixed region as the ROI, this ”fixed” method is not always Instead of intercepting a fixed area in a traditional sense, our
universal. As for different images, the ROI that we need is not method can dynamically generate different average-value
always the same fixed region, where the differences in perspec- according to the corresponding LP_FM obtained by low-pass fil-
tive and illumination make the ROI have some differences. Here, tering. Moreover, it is not only for a specific category of images,
we propose a generalized method of ROI selection based on the which can extend to multiple image retrieval scenarios. Accord-
feature map LP_FM processed by the low-pass filtering. LP_FM ingly, our method can be robust and universal for image
can smooth out some non-semantic details and noise. And it retrieval.
retains most of the information, which becomes more discrimi- Low-pass spatial weighting: Kalantidis et al. [28] directly add
nating. Thus, it is necessary to select the ROI based on LP_FM. the feature maps of all channels to obtain corresponding spatial
We design a mask for ROI to process the deep features ORIG_FM weights, which cannot well consider the importance of each fea-
extracted from the neural network. Subsequently, we use LP_FM ture map in a robust manner. Here, we propose a method for con-
to set the specific value in Mask. We first compute the average- structing spatial weights based on FFT and LP. Especially, we use
value of LP_FM as shown in Eq. (2). Then, we compare the value the feature map FM_ROI obtained in previous as input. Next, we
of each position in LP_FM with the average-value. If the value is perform Fourier transform and low-pass filtering on it. Here, we
larger or equal to K times of average-value, we set the value of get a result L processed by FFT and LP. And we use L to replace S0
the corresponding position in Mask to 1 as shown in Eq. (3). in the normalized formula in CroW [28], where the value of each
Otherwise, we set it to 0 as shown in Eq. (3). Here, we set the position is recorded as Lij . Then, the weight of each spatial position
parameter K to 0:3. ði; jÞ is calculated by Eq. (4):
0 !1=a 11=b
X a where Bn denotes the value of the n-th channel. The B obtained
sij ¼ @Lij = Lmn A ; ð4Þ above represents the low-pass factors for channel weights, where
m;n Bn denotes the low-pass factor of the n-th channel. Then, we use a
function to represent the negative correlation between the low-
among them, a ¼ 0:5; b ¼ 2, which refers to the setting in crow [28]. pass factor and channel weight, as shown in Eq. (6). Although it is
By applying Eq. (4) to each position, we can derive a two- similar to the logarithmic function in CroW [28] and both of them
dimensional matrix S, which can be directly used as the spatial can represent a negative correlation, it is clear that the function
weight matrix LSW to be applied on each feature map. In our deep we used is simpler and more intuitive. Substituting the B into Eq.
feature aggregation framework, using the spatial weight matrix (6), we can derive the vector C, which can be directly used as the
obtained in this part, we can directly perform the spatial weighting channel weight vector LCW.
operation on the feature map FM_ROI. In this way, the weighting
operation in the spatial dimension can be completed through the C n ¼ exp Bn : ð6Þ
above intuitive and simple process.
By weighting the spatially weighted and aggregated features by
Low-pass channel weighting: For the previous channel
channel weight LCW, the weighting operation in the channel
weighting, e.g. Kalantidis et al. [28], its sparsity is mainly measured
dimension can be completed. Finally, after performing the opera-
by the non-zero response, which is disturbed by non-target objects
tions such as normalization and whitening on the weighted feature
such as people, sky and background. This scheme is not always
vector, the final global feature vector of the image can be obtained.
suitable for tackling generalized image retrieval scenarios. There-
fore, we propose a channel weighting method based on Fourier
3.3. Implementation details
transform and low-pass filtering.
We directly use the aforementioned two-dimensional weight
We summarize the deep feature aggregation framework in
matrix S to treat it as a mask to perform mask operation on the fea-
ture map FM_ROI, recording the obtained three-dimensional ten- Algorithm 1. Let ORIG FM 2 RðWHK Þ denotes deep features
sor as Q. Then summation is performed on each channel of Q to extracted from the last convolutional or pooling layer, which
obtain a vector B. Subsequently, we find the maximum Bmax and including K feature maps, with width W and height H. Let
the minimum Bmin in this vector. And we substitute them into Mask 2 RðWHÞ denotes the generated mask matrix for ROI selec-
Eq. (5): tion, which has width W and height H. Let LSW 2 RðWHÞ and
Bn ¼ ðBn Bmin Þ=ðBmax Bmin Þ; ð5Þ LCW 2 RK denote the Low-pass spatial weight matrix and the
Low-pass channel weight vector, respectively.
Table 1
Retrieval performance comparison (in mAP) with some recent deep feature aggregation methods.
Algorithm 1. Deep feature Aggregation with NetVLAD [33], which is fine-tuned. All comparison results
are shown in Table 1.
Input: Tensor ORIG FM 2 RðWHK Þ , dimensionality K 0 , As shown in Table 1, for different dimensions of representation,
parameter P, parameter K, whitening parameter W. the performance of our method outperforms other aggregation
0
output K 0 -Dim global representation b 2 RK . methods, apart from R-MAC [30] and R-MAC + E [47], which
1: Fourier transform ORIG_FM to FFT_FM; demonstrate excellent performances in 512-dimension of Paris
2: Low-pass filtering on FFT_FM to obtain LP_FM with [45] and Holidays [46]. However, in 512-dimension of Paris [45],
parameter P; the original CroW [28] does have a large gap with R-MAC [30].
3: Generate Mask for ROI with LP_FM and parameter K; Moreover, R-MAC + E [47] is an improvement based on R-MAC
4: Select ROI on ORIG_FM with Mask to obtain FM_ROI; [30]. While our method improves retrieval performance by 2.5%
5: Generate LSW with FM_ROI; and 3.0% respectively compared with the original CroW [28]. In
6: Perform spatial weighting on FM_ROI with LSW and particular, compared with the original CroW [28], the performance
aggregation to obtain a; of our method increases by at least 2.4% in all dimensions of Paris
7: Generate LCW with FM_ROI and LSW; [45], with a maximum increase of 3.9%.
8: Perform channel weighting on awith LCW to obtain b0 ; The proposed method with query expansion (QE) is compared
~ with CroW + QE [28] in the last part of Table 1. In the experiments,
9: Normalize b0 to obtain b;
the average query expansion [49] computed by the top-10 query
~ with parameter K 0 and
10: Dim. reduction and whitening on b results is used. It can be seen that our method with QE gets the best
^
Wto obtain b; performance in three dimensions on all datasets.
^ again to obtain final global representation b.
11: Normalize b These quantitative results validate that our method can achieve
significant improvement in retrieval performance, which proves
that the introduction of Fourier transform and the low-pass filter
in the aggregation method of CBIR is effective and positive. It also
demonstrates that our aggregation framework including ROI selec-
4. Experiments tion, spatial weighting and channel weighting based on FFT and LP
can effectively improve the distinguishability of feature maps.
4.1. Experimental setting To validate the retrieval performance in a more intuitive man-
ner, we present several sets of randomly selected retrieval results
Our experiments are conducted on five public available data- for Oxford5K [44] in Fig. 2. On the other side, we only have two
sets: Oxford5K [44] (5062 building photos with 55 queries includ- false results in all 55 groups of top-10 query results in Paris6K [45].
ing 11 landmarks), Oxford105K (an extension to Oxford5K [44],
adding 100 K distracted photos from Flicker), Paris6K [45] (6392 4.3. Discussion
building photos with 55 queries including 11 landmarks), Par-
is106K (an extension to Paris6K [45], adding 100 K distracted pho- Our framework results in significant improvements in retrieval
tos from Flicker), Holidays [46] (1491 holiday photos with 500 results, which mainly benefit from the modified ROI selection, spa-
queries). The standard protocol consisting of other methods is tial and channel cross-dimensional weighting based on Fourier
employed, i.e., using the upright version of images in Holidays transform and the low-pass filter. Each of them has its own effects
[46] and using the cropped queries in Oxford5K [44], Paris6K in boosting the retrieval performance.
[45], Oxford105K and Paris106K. FFT and LP: For the image, after conversion from the spatial
We use the pre-trained deep model VGG16 [23] from ImageNet domain to the frequency domain, its high-frequency information
[43] without fine-tuning. All deep features used in the experiment mainly corresponds to the parts that change sharply, such as edge,
are extracted directly from the Pool5 layer of VGG16 [23]. And the contour and noise, while its low-frequency information mainly
number of channels is 512. For all searches, we use the mean aver- corresponds to the parts that change slowly, such as a large flat
age precision (mAP) as the evaluation metric on each dataset to area in the image. In fact, when we are retrieving the target image,
evaluate the performance of the search, where mAP is defined as we pay more attention to the low-frequency parts that describe
the average percentage of same class images in all retrieved images most of the main information, rather than the high-frequency
after evaluation all Queries. Particularly, to fairly test the retrieval parts. So we use the low-pass filter to process the image, which
accuracy on each dataset, when testing the retrieval accuracy of makes the smoothed image easier to distinguish. The reason for
Oxford5K (105 K) [44], we learn the whitening parameters on Par- using the Fourier transform is that the filter operation on the image
is6K [45]. Vice versa, when testing the retrieval accuracy of Paris6K in the spatial domain is interfered by some pixels such as noise, so
(106 K) [45] and Holidays [46], we learn whitening parameters on that some important information in the image may be lost. But the
Oxford5K [44]. In our method, two parameters are involved. one is corresponding filter operation on the image in the frequency
the parameter P when setting the low-pass filter. The other one is domain does not cause the loss of useful information in the image.
the parameter K when selecting the ROI. We empirically determine In our paper, we assume that the above theories are also feasible
the value of them: P ¼ 60; K ¼ 0:3. Detailed discussions are pro- for feature maps. And we modify the Fourier transform and low-
vided in Section 4.3. pass filter to be suitable for feature maps. Moreover, based on
them, we develop feature aggregation method, including ROI selec-
4.2. Validation tion, low-pass spatial and channel weighting. The experimental
results validate that our method can achieve improvement in
On the five public benchmark datasets, we compare the retrieval performance, which proves that our assumption is valid
retrieval accuracy (measured in mAP) of our method in three and the filtered feature maps are more distinguishable.
dimensions with some recent deep feature aggregation meth- ROI selection: We choose the feature map processed by the
ods without fine-tuning, including CroW [28], Neural code aforementioned low-pass filter to generate the mask for ROI. We
[24], R-MAC [30], R-MAC + E [47], Razavian et al. [26], Zhang consider that this feature map suppresses most of the high-
et al. [48] and SPoC [27]. Furthermore, for more comprehensive frequency information. And it retains the main low-frequency
verification, we also perform a similar performance comparison information which we pay more attention to. Hence, we calculate
Fig. 2. Randomly selected images from Oxford5K [44] datasets with their corresponding top-10 retrieval results, using 512-dimensional global feature vectors, with no query
expansion. The query image is displayed in the leftmost place. The query results are in the right part, where the wrong results are marked with the red dashed line.
the average-value of this feature map. Then, we use K times of region of interest selected in this way is a region that retains most
average-value as benchmark to measure whether the correspond- important and representative information, while some interfer-
ing position is considered as a part of ROI. We consider that the ence from high-frequency information is suppressed, such as
edges, contours and noise. Moreover, instead of selecting a fixed

area as the ROI, our method dynamically and adaptively selects
the ROI according to its own feature map, which means that the
average-value of its low-pass feature map is recalculated each time
ROI selection. The experimental results prove that our assumption
and method are valid. Table 2 shows the comparison of retrieval
performance of the original CroW [28] and the CroW [28] added
our ROI selection on Oxford5K [44] and Paris6K [45]. Although
the retrieval accuracy is less improved, it can still validate that
our ROI selection indeed has an effect, which constitutes an impor-
tant part of the complete aggregation framework based on FFT and
LP. And the increased recognizability of the feature map processed
by it is further amplified in subsequent weighting operations.
Low-pass spatial and channel weighting: The modified spatial
and channel weight designed in our method can apply on deep fea-
tures individually or jointly. Regarding spatial weighting, we com-
pare the low-pass spatial weight (LSW) in our method with the Fig. 3. Retrieval performance comparison (in mAP) for different weighting combi-
nations on three dimensions tested on Oxford5K [44].
spatial weight (SW) in CroW [28]. Regarding channel weighting,
we compare the low-pass channel weight (LCW) in our method
with the channel weight (CW) in CroW [28]. To completely validate Furthermore, the combination of LSW and LCW can achieve bet-
the effectiveness of our method, we perform six groups of weight- ter retrieval performance, which can be proved by the results of
ing combinations to test the retrieval accuracy (in mAP) in three LSW + LCW in Fig. 3. Therefore, the effectiveness of our spatial
dimensions on Oxford5k [44]. Fig. 3 records the retrieval results and channel weight is verified.
of these weighting combinations. The parameters in the proposed method are P involved in the
CroW [28] simply sums up the original feature maps of all chan- setting of the low-pass filter and K involved in ROI selection.
nels in generating its SW. Compared with SW, our LSW is gener- Parameter P: The purpose of setting up the low-pass filter is to
ated through inputting feature maps after ROI selection and allow low-frequency content to pass only. But ”low” is an abstract
processing them by the low-pass filter to further suppress interfer- concept. Different people have different understandings of ”low”.
ence of high-frequency information such as edges, contours and And it is not convenient for quantitative control. So we measure
noise. It makes LSW not only maintain the advantages of spatial the ”low” quantitatively by parameter P. Parameter P is used to
weight in CroW [28] but also introduce the impact of the low- intuitively and quantitatively control the range of low-frequency
pass filter. In Fig. 3, it can be seen that the retrieval performance content that is allowed to pass. Parameter P visually represents
(in mAP) of CW, SW + CW and LSW + CW increases, which can half the length of the side of the square in the middle of the feature
prove that spatial weight can indeed improve the retrieval perfor- map in the frequency domain. Only the low-frequency content in
mance and our LSW has a more effective improvement than SW of this square is allowed to pass. We discuss parameter P in the
CroW [28]. experiment. And we empirically determine the value of P.
CroW [28] implements CW with the sparsity of each channel Table 3 shows the comparison of retrieval performance on Par-
which is indicated by the non-zero response. However, the non- is6K [45] in selecting different values of P. In this experiment, we
zero response only considers the activated area. And it ignores set K ¼ 0:3. And we apply low-pass spatial and channel weighting.
the intensity of activation, which leads to the result that back- As can be seen, for all the values of P in the table, our method
ground (people, sky and ground) may take large weights in the fea- achieves superior performance that gains over the baselines (CroW
ture maps with large values of the non-zero response. Differently, [28]) in three dimensions. When P is between 50 and 70, the retrie-
we use the sum-value of low-pass feature maps in generating the val performance is relatively stable. Among them, when P is 60, the
channel weight, which is more appropriate compared with the highest retrieval performance is achieved in all three dimensions.
sparsity. The low-pass feature maps are obtained by LCW and fea- While P takes a value outside the range, the performance slightly
ture maps in ROI. Moreover, we use a simpler and more effective fluctuates. Therefore, in the aspect of parameter sensitiveness,
expression than CroW [28] to show a positive correlation between
channel weight and our low-pass factor. The explanation about
this positive relation is that infrequently occurring features pro-
vide important signals. In Fig. 3, one can see that the retrieval per- Table 3
formance (in mAP) of SW, SW + CW and SW + LCW increases, Retrieval performance comparison (in mAP) of different values of parameter P on
three dimensions.
which proves that channel weight indeed improves the retrieval
performance and our LCW has a more effective improvement than
CW of CroW [28].
Table 2
Retrieval performance comparison (in mAP) of the original CroW [28] and the CroW
[28] added the ROI selection of our method on three dimensions.
Method Dim Paris6K Oxford5K

CroW [28] 512 79.7093 70.8360
ROI + CroW 512 79.7946 70.8406
CroW [28] 256 76.4181 68.3544
ROI + CroW 256 77.2547 68.4011
CroW [28] 128 74.5290 63.2791
ROI + CroW 128 74.7075 64.5579
Table 4 spatial weighting and low-pass channel weighting, using the Four-
Retrieval performance comparison (in mAP) of different values of parameter K ier transform and low-pass filtering for feature map. Particularly,
on three dimensions.
the proposed FFT and LP allow the feature map to perform corre-
sponding pre-processing steps before performing weighting and
aggregation, which makes the processed feature map more dis-
criminating. Additionally, the modified ROI selection and low-
pass co-weighting including space and channel can enhance the
discrimination of image representations, which obtains better
retrieval accuracy. Experimental results demonstrate the superior-
ity of the proposed deep feature aggregation framework in com-
parison with other state-of-the-art aggregation methods. Our
future works focus on developing more general and effective
aggregation methods for the feature representation in large-scale
image retrieval tasks.
Declaration of Competing Interest
The authors declare that they have no known competing finan-

parameter P is relatively insensitive, which means that our method cial interests or personal relationships that could have appeared
can achieve relatively stable and excellent retrieval performance to influence the work reported in this paper.
when P is in a suitable range. Therefore, we empirically set P ¼ 60.
Parameter K: In Section 3.2, parameter K is involved in setting
Acknowledgments
the value of the mask for ROI selection. So it directly affects the
generated mask. We also discuss the value of K in the experiment.
This work was supported by National Natural Science Founda-
The results are shown in Table 4. Here is the comparison of retrie-
tion of China (NSFC) [Grant No. 61573273].
val performance on Paris6K [45] in selecting different values of K.
In this experiment, we set P ¼ 60. And we apply low-pass spatial
and channel weighting. According to Table 4, when K is between References
0.1 and 0.3, our method achieves superior retrieval performance
in three dimensions. Among them, when K ¼ 0:3, the best retrieval [1] H. Lu, Y. Li, C. Min, H. Kim, S. Serikawa, Brain intelligence: Go beyond artificial
performance is obtained. However, the retrieval performance intelligence, Mobile Networks Appl. 23 (2017) 368–375.
[2] H. Lu, Y. Li, T. Uemura, H. Kim, S. Serikawa, Low illumination underwater light
begins to decline and fluctuate when the value of K continues to field images reconstruction using deep convolutional neural networks, Future
increase. Similarly, parameter K is also relatively insensitive, which Gener. Comput. Syst. 82 (2018) 142–148.
means that our method can achieve relatively stable and excellent [3] S. Serikawa, H. Lu, Underwater image dehazing using joint trilateral filter,
Comput. Electr. Eng. 40 (2014) 41–50.
retrieval performance when K is in a suitable range. Therefore, we [4] L. Gao, Z. Guo, H. Zhang, X. Xu, H.T. Shen, Video captioning with attention-
empirically set K ¼ 0:3. based lstm and semantic consistency, IEEE Trans. Multimedia 19 (2017) 2045–
Optimization: To be clear, we summarize the major ideas of 2055.
[5] T. Young, D. Hazarika, S. Poria, E. Cambria, Recent trends in deep learning
optimization of our algorithm as follows: based natural language processing, IEEE Comput. Intell. Magaz. 13 (2018) 55–
75.
When setting the specific values of the mask for ROI in our [6] Y. Li, W. Zhao, J. Pan, Deformable patterned fabric defect detection with fisher
criterion-based deep learning, IEEE Trans. Autom. Sci. Eng. 14 (2017) 1256–
framework, we simply set them to 0 or 1. We consider introduc- 1264.
ing some smoother values that take into account the distribu- [7] L. He, X. Xu, H. Lu, Y. Yang, F. Shen, H.T. Shen, Unsupervised cross-modal
tion (different weights between 0 and 1). retrieval through adversarial learning, in: 2017 IEEE International Conference
on Multimedia and Expo (ICME), IEEE, pp. 1153–1158.
We consider introducing new filters, such as selective filters.
[8] X. Xu, L. He, H. Lu, L. Gao, Y. Ji, Deep adversarial metric learning for cross-modal
More specifically, we consider setting some external interfaces retrieval, World Wide Web 22 (2019) 657–672.
for the parameters in the filters, which can flexibly and conve- [9] X. Xu, H. Lu, J. Song, Y. Yang, H.T. Shen, X. Li, Ternary adversarial networks with
niently set different permissible ranges for some specific retrie- self-supervision for zero-shot cross-modal retrieval, IEEE Trans. Cybernet.,
2019b.
val tasks. At present, the same filter is applied by all [10] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J.
components of our framework. In future works, we consider Comput. Vision 60 (2004) 91–110.
introducing several different filters with multiple different com- [11] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
convolutional neural networks, in: Advances in neural information processing
binations of parameters to offer a more suitable filter for each systems, pp. 1097–1105.
component. [12] J. Sivic, A. Zisserman, Video google: A text retrieval approach to object
There are many classic and effective algorithms in the field of matching in videos, in: null, IEEE, p. 1470.
[13] H. Jégou, M. Douze, C. Schmid, P. Pérez, Aggregating local descriptors into a
traditional digital image processing. We try to find other algo- compact image representation, in: CVPR 2010–23rd IEEE Conference on
rithms that can be effectively ported to the feature map in CBIR Computer Vision & Pattern Recognition, IEEE Computer Society, pp. 3304–
through suitable modification. 3311.
[14] Z. Li, X. Zhang, H. Müller, S. Zhang, Large-scale retrieval for medical image
Inspired by [50], we consider introducing LSTM and attention analytics: A comprehensive review, Med. Image Anal. 43 (2018) 66–84.
mechanism to enable more complex representations of visual [15] X. Mao, C. Shen, Y.-B. Yang, Image restoration using very deep convolutional
data and information at different scales. encoder-decoder networks with symmetric skip connections, in: Advances in
neural information processing systems, pp. 2802–2810.
[16] H. Lu, Y. Li, T. Uemura, Z. Ge, X. Xu, L. He, S. Serikawa, H. Kim, Fdcnet: filtering
deep convolutional network for marine organism classification, Multimedia
5. Conclusions Tools Appl. 77 (2018) 21847–21860.
[17] L. Ran, Y. Zhang, W. Wei, Q. Zhang, A hyperspectral image classification
framework with spatial pixel pair features, Sensors 17 (2017) 2421.
In this paper, we present an adaptive deep feature aggregation [18] L. Wang, X. Xu, H. Dong, R. Gui, F. Pu, Multi-pixel simultaneous classification of
framework for image retrieval including ROI selection, low-pass polsar image using convolutional neural networks, Sensors 18 (2018) 769.
[19] H. Lu, B. Li, J. Zhu, Y. Li, Y. Li, X. Xu, L. He, X. Li, J. Li, S. Serikawa, Wound [35] A. Gordo, J. Almazán, J. Revaud, D. Larlus, Deep image retrieval: Learning global
intensity correction and segmentation with convolutional neural networks, representations for image search, in: European conference on computer vision,
Concurr. Comput.: Pract. Experience 29 (2017) e3927. Springer, pp. 241–257.
[20] J. Yang, D. She, M. Sun, M.-M. Cheng, P.L. Rosin, L. Wang, Visual sentiment [36] A. Zhu, D. He, J. Zhao, W. Luo, W. Chen, 3d wear area reconstruction of grinding
prediction based on automatic discovery of affective regions, IEEE Trans. wheel by frequency-domain fusion, Int. J. Adv. Manuf. Technol. 88 (2017)
Multimedia 20 (2018) 2513–2525. 1111–1117.
[21] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object [37] S. Kumar, R. Saxena, K. Singh, Fractional fourier transform and fractional-order
detection with region proposal networks, in: Advances in neural information calculus-based image edge detection, Circ., Syst., Signal Process. 36 (2017)
processing systems, pp. 91–99. 1493–1513.
[22] D. Kim, M. Arsalan, K. Park, Convolutional neural network-based shadow [38] H. Feng, G. Chang, X. Guo, Hash algorithm for color image based on super-
detection in images using visible light camera sensor, Sensors 18 (2018) 960. complex fourier transform coupled with position permutation, J. Front.
[23] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale Comput. Sci. Technol. (2017).
image recognition, arXiv preprint arXiv:1409.1556 (2014). [39] S. Sun, Y. Cao, Y. Wang, C. Chen, G. Fu, X. Xu, Improved fourier transform phase
[24] A. Babenko, A. Slesarev, A. Chigorin, V. Lempitsky, Neural codes for image measurement method for measuring the quality parameters of an image
retrieval, in: European conference on computer vision, Springer, pp. 584–599. intensifier, Opt. Eng. 56 (2017) 034113.
[25] Y. Gong, L. Wang, R. Guo, S. Lazebnik, Multi-scale orderless pooling of deep [40] D. Zhang, X. Liao, B. Yang, Y. Zhang, A fast and efficient approach to color-
convolutional activation features, in: European conference on computer vision, image encryption based on compressive sensing and fractional fourier
Springer, pp. 392–407. transform, Multimedia Tools Appl. 77 (2018) 2191–2208.
[26] A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the- [41] J. Chen, F. Li, Y. Fu, Q. Liu, J. Huang, K. Li, A study of image segmentation
shelf: an astounding baseline for recognition, in: Proceedings of the IEEE algorithms combined with different image preprocessing methods for thyroid
conference on computer vision and pattern recognition workshops, pp. 806– ultrasound images, in: 2017 IEEE International Conference on Imaging
813. Systems and Techniques (IST), IEEE, pp. 1–5.
[27] A. Babenko, V. Lempitsky, Aggregating local deep features for image retrieval, [42] N. Saxena, K.K. Sharma, Pansharpening scheme using filtering in two-
in: Proceedings of the IEEE international conference on computer vision, pp. dimensional discrete fractional fourier transform, IET Image Proc. 12 (2018)
1269–1277. 1013–1019.
[28] Y. Kalantidis, C. Mellina, S. Osindero, Cross-dimensional weighting for [43] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale
aggregated deep convolutional features, in: European Conference on hierarchical image database, in: 2009 IEEE conference on computer vision and
Computer Vision, Springer, pp. 685–701. pattern recognition, Ieee, pp. 248–255.
[29] A. Sharif Razavian, J. Sullivan, A. Maki, S. Carlsson, A baseline for visual [44] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large
instance retrieval with deep convolutional networks, in: International vocabularies and fast spatial matching, in: IEEE Conference on Computer
Conference on Learning Representations, May 7-9, 2015, San Diego, CA, ICLR. Vision & Pattern Recognition.
[30] G. Tolias, R. Sicre, H. Jégou, Particular object retrieval with integral max- [45] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Lost in quantization:
pooling of cnn activations, Comput. Sci. (2015). Improving particular object retrieval in large scale image databases, in: 2008
[31] X.S. Wei, J.H. Luo, J. Wu, Z.H. Zhou, Selective convolutional descriptor IEEE conference on computer vision and pattern recognition, IEEE, pp. 1–8.
aggregation for fine-grained image retrieval, IEEE Trans. Image Process. A [46] H. Jégou, M. Douze, C. Schmid, Improving bag-of-features for large scale image
Publ. IEEE Signal Process. Soc. 26 (2017) 2868. search, Int. J. Comput. Vision 87 (2010) 316–336.
[32] T.-T. Do, T. Hoang, D.-K.L. Tan, H. Le, T.V. Nguyen, N.-M. Cheung, From selective [47] P. Liu, G. Gou, H. Guo, D. Zhang, H. Zhao, Q. Zhou, Fusing feature distribution
deep convolutional features to compact binary representations for image entropy with r-mac features in image retrieval, Entropy 21 (2019) 1037.
retrieval, ACM Trans. Multimedia Comput., Commun., Appl. (TOMM) 15 (2019) [48] X. Zhang, Near-duplicate image retrieval based on multiple features, in: 2018
43. IEEE Visual Communications and Image Processing (VCIP), IEEE, pp. 1–4.
[33] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, J. Sivic, Netvlad: Cnn architecture [49] O. Chum, J. Philbin, J. Sivic, M. Isard, A. Zisserman, Total recall: Automatic
for weakly supervised place recognition, in: Proceedings of the IEEE query expansion with a generative feature model for object retrieval, in: 2007
Conference on Computer Vision and Pattern Recognition, pp. 5297–5307. IEEE 11th International Conference on Computer Vision, IEEE, pp. 1–8.
[34] F. Radenović, G. Tolias, O. Chum, Cnn image retrieval learns from bow: [50] L. Gao, X. Li, J. Song, H.T. Shen, Hierarchical lstms with adaptive attention for
Unsupervised fine-tuning with hard examples, in: European conference on visual captioning, IEEE Trans. Pattern Anal. Mach. Intell. (2019).
computer vision, Springer, pp. 3–20.

1 s2.0 S1047320320301115 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S1047320320301115 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S1047320320301115 Main

Uploaded by

Copyright:

Available Formats

J. Vis. Commun. Image R.

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.

Adaptive deep feature aggregation using Fourier transform and low-pass

By Eq. (1), we can get a dense feature map FM which is a two-

Feature maps Feature map

to pass. After that, the low-pass mask is used to mask Dft_shift to X

edges, contours and noise. Moreover, instead of selecting a fixed

Method Dim Paris6K Oxford5K

Declaration of Competing Interest

The authors declare that they have no known competing finan-

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.