Hu Squeeze-And-Excitation Networks CVPR 2018 Paper
Hu Squeeze-And-Excitation Networks CVPR 2018 Paper
Hu Squeeze-And-Excitation Networks CVPR 2018 Paper
7132
Figure 1: A Squeeze-and-Excitation block.
features in a class agnostic manner, bolstering the quality of dinality (the size of the set of transformations) [15, 47].
the shared lower level representations. In later layers, the Multi-branch convolutions can be interpreted as a generali-
SE block becomes increasingly specialised, and responds sation of this concept, enabling more flexible compositions
to different inputs in a highly class-specific manner. Con- of operators [16, 42, 43, 44]. Recently, compositions which
sequently, the benefits of feature recalibration conducted by have been learned in an automated manner [26, 54, 55]
SE blocks can be accumulated through the entire network. have shown competitive performance. Cross-channel cor-
The development of new CNN architectures is a chal- relations are typically mapped as new combinations of fea-
lenging engineering task, typically involving the selection tures, either independently of spatial structure [6, 20] or
of many new hyperparameters and layer configurations. By jointly by using standard convolutional filters [24] with 1×1
contrast, the design of the SE block outlined above is sim- convolutions. Much of this work has concentrated on the
ple, and can be used directly with existing state-of-the-art objective of reducing model and computational complexity,
architectures whose modules can be strengthened by direct reflecting an assumption that channel relationships can be
replacement with their SE counterparts. formulated as a composition of instance-agnostic functions
Moreover, as shown in Sec. 4, SE blocks are computa- with local receptive fields. In contrast, we claim that provid-
tionally lightweight and impose only a slight increase in ing the unit with a mechanism to explicitly model dynamic,
model complexity and computational burden. To support non-linear dependencies between channels using global in-
these claims, we develop several SENets and provide an formation can ease the learning process, and significantly
extensive evaluation on the ImageNet 2012 dataset [34]. enhance the representational power of the network.
To demonstrate their general applicability, we also present
Attention and gating mechanisms. Attention can be
results beyond ImageNet, indicating that the proposed ap- viewed, broadly, as a tool to bias the allocation of available
proach is not restricted to a specific dataset or a task.
processing resources towards the most informative compo-
Using SENets, we won the first place in the ILSVRC nents of an input signal [17, 18, 22, 29, 32]. The benefits
2017 classification competition. Our top performing model
of such a mechanism have been shown across a range of
ensemble achieves a 2.251% top-5 error on the test set1 . tasks, from localisation and understanding in images [3, 19]
This represents a ∼25% relative improvement in compari-
to sequence-based models [2, 28]. It is typically imple-
son to the winner entry of the previous year (with a top-5 mented in combination with a gating function (e.g. a soft-
error of 2.991%).
max or sigmoid) and sequential techniques [12, 41]. Re-
cent work has shown its applicability to tasks such as im-
2. Related Work
age captioning [4, 48] and lip reading [7]. In these appli-
Deep architectures. VGGNets [39] and Inception mod- cations, it is often used on top of one or more layers rep-
els [43] demonstrated the benefits of increasing depth. resenting higher-level abstractions for adaptation between
Batch normalization (BN) [16] improved gradient propa- modalities. Wang et al. [46] introduce a powerful trunk-
gation by inserting units to regulate layer inputs, stabilis- and-mask attention mechanism using an hourglass module
ing the learning process. ResNets [10, 11] showed the ef- [31]. This high capacity unit is inserted into deep resid-
fectiveness of learning deeper networks through the use of ual networks between intermediate stages. In contrast, our
identity-based skip connections. Highway networks [40] proposed SE block is a lightweight gating mechanism, spe-
employed a gating mechanism to regulate shortcut connec- cialised to model channel-wise relationships in a computa-
tions. Reformulations of the connections between network tionally efficient manner and designed to enhance the repre-
layers [5, 14] have been shown to further improve the learn- sentational power of basic modules throughout the network.
ing and representational properties of deep networks.
An alternative line of research has explored ways to tune 3. Squeeze-and-Excitation Blocks
the functional form of the modular components of a net-
The Squeeze-and-Excitation block is a computational
work. Grouped convolutions can be used to increase car-
unit which can be constructed
′
for
′
any
′
given transformation
1 http://image-net.org/challenges/LSVRC/2017/results Ftr : X → U, X ∈ RH ×W ×C , U ∈ RH×W ×C . For
7133
simplicity, in the notation that follows we take Ftr to be a to fully capture channel-wise dependencies. To fulfil this
convolutional operator. Let V = [v1 , v2 , . . . , vC ] denote objective, the function must meet two criteria: first, it must
the learned set of filter kernels, where vc refers to the pa- be flexible (in particular, it must be capable of learning a
rameters of the c-th filter. We can then write the outputs of nonlinear interaction between channels) and second, it must
Ftr as U = [u1 , u2 , . . . , uC ], where learn a non-mutually-exclusive relationship since we would
C′
X like to ensure that multiple channels are allowed to be em-
uc = vc ∗ X = vcs ∗ xs . (1) phasised opposed to one-hot activation. To meet these cri-
s=1 teria, we opt to employ a simple gating mechanism with a
′ sigmoid activation:
Here ∗ denotes convolution, vc = [vc1 , vc2 , . . . , vcC ] and
′
X = [x1 , x2 , . . . , xC ] (to simplify the notation, bias terms s = Fex (z, W) = σ(g(z, W)) = σ(W2 δ(W1 z)), (3)
are omitted), while vcs is a 2D spatial kernel, and therefore
C
represents a single channel of vc which acts on the corre- where δ refers to the ReLU [30] function, W1 ∈ R r ×C and
C
sponding channel of X. Since the output is produced by W2 ∈ RC× r . To limit model complexity and aid generali-
a summation through all channels, the channel dependen- sation, we parameterise the gating mechanism by forming a
cies are implicitly embedded in vc , but these dependencies bottleneck with two fully connected (FC) layers around the
are entangled with the spatial correlation captured by the non-linearity, i.e. a dimensionality-reduction layer with pa-
filters. Our goal is to ensure that the network is able to in- rameters W1 with reduction ratio r (this parameter choice
crease its sensitivity to informative features so that they can is discussed in Sec. 6.4), a ReLU and then a dimensionality-
be exploited by subsequent transformations, and to suppress increasing layer with parameters W2 . The final output of
less useful ones. We propose to achieve this by explicitly the block is obtained by rescaling the transformation output
modelling channel interdependencies to recalibrate filter re- U with the activations:
sponses in two steps, squeeze and excitation, before they are
fed into next transformation. A diagram of an SE building ec = Fscale (uc , sc ) = sc · uc ,
x (4)
block is shown in Fig. 1.
where Xe = [e eC ] and Fscale (uc , sc ) refers to
e2 , . . . , x
x1 , x
3.1. Squeeze: Global Information Embedding
channel-wise multiplication between the feature map uc ∈
In order to tackle the issue of exploiting channel depen- RH×W and the scalar sc .
dencies, we first consider the signal to each channel in the Discussion. The activations act as channel weights
output features. Each of the learned filters operates with a adapted to the input-specific descriptor z. In this regard,
local receptive field and consequently each unit of the trans- SE blocks intrinsically introduce dynamics conditioned on
formation output U is unable to exploit contextual informa- the input, helping to boost feature discriminability.
tion outside of this region. This is an issue that becomes
more severe in the lower layers of the network whose re- 3.3. Exemplars: SE-Inception and SE-ResNet
ceptive field sizes are small.
It is straightforward to apply the SE block to AlexNet
To mitigate this problem, we propose to squeeze global
[21] and VGGNet [39]. The flexibility of the SE block
spatial information into a channel descriptor. This is
means that it can be directly applied to transformations be-
achieved by using global average pooling to generate
yond standard convolutions. To illustrate this point, we de-
channel-wise statistics. Formally, a statistic z ∈ RC is gen-
velop SENets by integrating SE blocks into modern archi-
erated by shrinking U through spatial dimensions H × W ,
tectures with sophisticated designs.
where the c-th element of z is calculated by:
For non-residual networks, such as Inception network,
XH X W
1 SE blocks are constructed for the network by taking the
zc = Fsq (uc ) = uc (i, j). (2)
H × W i=1 j=1 transformation Ftr to be an entire Inception module (see
Fig. 2). By making this change for each such module
Discussion. The transformation output U can be in- in the architecture, we construct an SE-Inception network.
terpreted as a collection of the local descriptors whose Moreover, SE blocks are sufficiently flexible to be used in
statistics are expressive for the whole image. Exploiting residual networks. Fig. 3 depicts the schema of an SE-
such information is prevalent in feature engineering work ResNet module. Here, the SE block transformation Ftr
[35, 38, 49]. We opt for the simplest, global average pool- is taken to be the non-identity branch of a residual mod-
ing, noting that more sophisticated aggregation strategies ule. Squeeze and excitation both act before summation
could be employed here as well. with the identity branch. More variants that integrate with
ResNeXt [47], Inception-ResNet [42], MobileNet [13] and
3.2. Excitation: Adaptive Recalibration
ShuffleNet [52] can be constructed by following the simi-
To make use of the information aggregated in the squeeze lar schemes. We describe the architecture of SE-ResNet-50
operation, we follow it with a second operation which aims and SE-ResNeXt-50 in Table 1.
7134
a single pass forwards and backwards through ResNet-50
X X
takes 190 ms, compared to 209 ms for SE-ResNet-50 (both
Inception Inception
�×W×C
timings are performed on a server with 8 NVIDIA Titan X
X GPUs). We argue that this represents a reasonable overhead
Global pooling
Inception Module 1×1×C
FC C
particularly since global pooling and small inner-product
1×1×
� operations are less optimised in existing GPU libraries.
ReLU C
1×1×
� Moreover, due to its importance for embedded device ap-
FC 1×1×C plications, we also benchmark CPU inference time for each
Sigmoid 1×1×C
model: for a 224 × 224 pixel input image, ResNet-50 takes
164 ms, compared to 167 ms for SE-ResNet-50. The small
Scale
�×W×C additional computational overhead required by the SE block
X is justified by its contribution to model performance.
SE-Inception Module Next, we consider the additional parameters introduced
by the proposed block. All of them are contained in the two
Figure 2: The schema of the original Inception module (left) and FC layers of the gating mechanism, which constitute a small
the SE-Inception module (right). fraction of the total network capacity. More precisely, the
number of additional parameters introduced is given by:
X X
S
2X
Residual Residual
�×W×C Ns · C s 2 (5)
r s=1
+ Global pooling
1×1×C
X
C
FC
1×1×
�
where r denotes the reduction ratio, S refers to the num-
ResNet Module
ReLU 1×1×
C ber of stages (where each stage refers to the collection of
�
FC
blocks operating on feature maps of a common spatial di-
1×1×C
mension), Cs denotes the dimension of the output channels
Sigmoid
1×1×C
and Ns denotes the repeated block number for stage s. SE-
Scale
�×W×C ResNet-50 introduces ∼2.5 million additional parameters
+ �×W×C
beyond the ∼25 million parameters required by ResNet-50,
X corresponding to a ∼10% increase. The majority of these
SE-ResNet Module parameters come from the last stage of the network, where
excitation is performed across the greatest channel dimen-
Figure 3: The schema of the original Residual module (left) and sions. However, we found that the comparatively expensive
the SE-ResNet module (right). final stage of SE blocks could be removed at a marginal cost
in performance (<0.1% top-1 error on ImageNet) to reduce
the relative parameter increase to ∼4%, which may prove
4. Model and Computational Complexity useful in cases where parameter usage is a key considera-
tion (see further discussion in Sec. 6.4).
For the proposed SE block to be viable in practice, it
must provide an effective trade-off between model com-
5. Implementation
plexity and performance which is important for scalability.
We set the reduction ratio r to be 16 in all experiments, ex- Each plain network and its corresponding SE counter-
cept where stated otherwise (more discussion can be found part are trained with identical optimisation schemes. Dur-
in Sec. 6.4). To illustrate the cost of the module, we take ing training on ImageNet, we follow standard practice and
the comparison between ResNet-50 and SE-ResNet-50 as perform data augmentation with random-size cropping [43]
an example, where the accuracy of SE-ResNet-50 is supe- to 224 × 224 pixels (299 × 299 for Inception-ResNet-v2
rior to ResNet-50 and approaches that of a deeper ResNet- [42] and SE-Inception-ResNet-v2) and random horizontal
101 network (shown in Table 2). ResNet-50 requires ∼3.86 flipping. Input images are normalised through mean chan-
GFLOPs in a single forward pass for a 224 × 224 pixel in- nel subtraction. In addition, we adopt the data balancing
put image. Each SE block makes use of a global average strategy described in [36] for mini-batch sampling. The
pooling operation in the squeeze phase and two small fully networks are trained on our distributed learning system
connected layers in the excitation phase, followed by an “ROCS” which is designed to handle efficient parallel train-
inexpensive channel-wise scaling operation. In aggregate, ing of large networks. Optimisation is performed using syn-
SE-ResNet-50 requires ∼3.87 GFLOPs, corresponding to a chronous SGD with momentum 0.9 and a mini-batch size
0.26% relative increase over the original ResNet-50. of 1024. The initial learning rate is set to 0.6 and decreased
In practice, with a training mini-batch of 256 images, by a factor of 10 every 30 epochs. All models are trained
7135
Output size ResNet-50 SE-ResNet-50 SE-ResNeXt-50 (32 × 4d)
112 × 112 conv, 7 × 7, 64, stride 2
max pool, 3 × 3, stride 2
56 × 56
conv, 1 × 1, 64 conv, 1 × 1, 128
conv, 1 × 1, 64 conv, 3 × 3, 64 conv, 3 × 3, 128
conv, 3 × 3, 64 × 3 C = 32
×3
conv, 1 × 1, 256 × 3 conv, 1 × 1, 256
conv, 1 × 1, 256
f c, [16, 256] f c, [16, 256]
conv, 1 × 1, 128 conv, 1 × 1, 256
conv, 1 × 1, 128 conv, 3 × 3, 128 conv, 3 × 3, 256 C = 32
28 × 28 conv, 3 × 3, 128 × 4 ×4
conv, 1 × 1, 512 × 4 conv, 1 × 1, 512
conv, 1 × 1, 512
f c, [32, 512] f c, [32, 512]
conv, 1 × 1, 256 conv, 1 × 1, 512
conv, 1 × 1, 256 conv, 3 × 3, 256 conv, 3 × 3, 512
conv, 3 × 3, 256 × 6 C = 32
×6
14 × 14 conv, 1 × 1, 1024 × 6 conv, 1 × 1, 1024
conv, 1 × 1, 1024
f c, [64, 1024] f c, [64, 1024]
conv, 1 × 1, 512 conv, 1 × 1, 1024
conv, 1 × 1, 512 conv, 3 × 3, 512 conv, 3 × 3, 1024 C = 32
conv, 3 × 3, 512 × 3 ×3
7×7 conv, 1 × 1, 2048 × 3 conv, 1 × 1, 2048
conv, 1 × 1, 2048
f c, [128, 2048] f c, [128, 2048]
1×1 global average pool, 1000-d f c, softmax
Table 1: (Left) ResNet-50. (Middle) SE-ResNet-50. (Right) SE-ResNeXt-50 with a 32×4d template. The shapes and operations with
specific parameter settings of a residual building block are listed inside the brackets and the number of stacked blocks in a stage is presented
outside. The inner brackets following by fc indicates the output dimension of the two fully connected layers in an SE module.
Table 2: Single-crop error rates (%) on the ImageNet validation set and complexity comparisons. The original column refers to the results
reported in the original papers. To enable a fair comparison, we re-train the baseline models and report the scores in the re-implementation
column. The SENet column refers to the corresponding architectures in which SE blocks have been added. The numbers in brackets denote
the performance improvement over the re-implemented baselines. † indicates that the model has been evaluated on the non-blacklisted
subset of the validation set (this is discussed in more detail in [42]), which may slightly improve results. VGG-16 and SE-VGG-16 are
trained with batch normalization.
for 100 epochs from scratch, using the weight initialisation Network depth. We first compare the SE-ResNet against
strategy described in [9]. ResNet architectures with different depths. The results in
When testing, we apply a centre crop evaluation on the Table 2 shows that SE blocks consistently improve perfor-
validation set, where 224×224 pixels are cropped from each mance across different depths with an extremely small in-
image whose shorter edge is first resized to 256 (299 × 299 crease in computational complexity.
from each image whose shorter edge is first resized to 352 Remarkably, SE-ResNet-50 achieves a single-crop top-5
for Inception-ResNet-v2 and SE-Inception-ResNet-v2). validation error of 6.62%, exceeding ResNet-50 (7.48%)
by 0.86% and approaching the performance achieved by
6. Experiments the much deeper ResNet-101 network (6.52% top-5 error)
with only half of the computational overhead (3.87 GFLOPs
6.1. ImageNet Classification
vs. 7.58 GFLOPs). This pattern is repeated at greater
The ImageNet 2012 dataset is comprised of 1.28 mil- depth, where SE-ResNet-101 (6.07% top-5 error) not only
lion training images and 50K validation images from 1000 matches, but outperforms the deeper ResNet-152 network
classes. We train networks on the training set and report the (6.34% top-5 error) by 0.27%. Fig. 4 depicts the training
top-1 and the top-5 errors. and validation curves of SE-ResNet-50 and ResNet-50 (the
7136
original re-implementation SENet
top-1 top-5 top-1 top-5 Million top-1 top-5 Million
MFLOPs MFLOPs
err. err. err. err. Parameters err. err. Parameters
MobileNet [13] 29.4 - 29.1 10.1 569 4.2 25.3(3.8) 7.9(2.2) 572 4.7
ShuffleNet [52] 34.1 - 33.9 13.6 140 1.8 31.7(2.2) 11.7(1.9) 142 2.4
Table 3: Single-crop error rates (%) on the ImageNet validation set and complexity comparisons. Here, MobileNet refers to “1.0
MobileNet-224” in [13] and ShuffleNet refers to “ShuffleNet 1 × (g = 3)” in [52].
7137
top-1 err. top-5 err. AP@IoU=0.5 AP
Places-365-CNN [37] 41.07 11.48 ResNet-50 45.2 25.1
ResNet-152 (ours) 41.15 11.61 SE-ResNet-50 46.8 26.4
SE-ResNet-152 40.37 11.01 ResNet-101 48.4 27.2
SE-ResNet-101 49.2 27.9
Table 5: Single-crop error rates (%) on Places365 validation set.
Table 6: Object detection results on the COCO 40k validation set
by using the basic Faster R-CNN.
Results on ILSVRC 2017 Classification Competition.
SENets formed the foundation of our submission to the
competition where we won first place. Our winning en- Here our intention is to evaluate the benefit of replacing the
try comprised a small ensemble of SENets that employed base architecture ResNet with SE-ResNet, so that the im-
a standard multi-scale and multi-crop fusion strategy to ob- provements can be attributed to better representations. Ta-
tain a 2.251% top-5 error on the test set. One of our high- ble 6 shows the results by using ResNet-50, ResNet-101
performing networks, which we term SENet-154, was con- and their SE counterparts on the validation set, respectively.
structed by integrating SE blocks with a modified ResNeXt SE-ResNet-50 outperforms ResNet-50 by 1.3% (a relative
[47] (details are provided in supplemental material), the 5.2% improvement) on COCO’s standard metric AP and
goal of which is to reach the best possible accuracy with 1.6% on AP@IoU=0.5. Importantly, SE blocks are capable
less emphasis on model complexity. We compare it with of benefiting the deeper architecture ResNet-101 by 0.7%
the top-performing published models on the ImageNet val- (a relative 2.6% improvement) on the AP metric.
idation set in Table 4. Our model achieved a top-1 error
of 18.68% and a top-5 error of 4.47% using a 224 × 224 6.4. Analysis and Interpretation
centre crop evaluation. To enable a fair comparison, we
Reduction ratio. The reduction ratio r introduced in
provide a 320 × 320 centre crop evaluation, showing a sig-
Eqn. (5) is an important hyperparameter which allows us to
nificant performance improvement on prior work. After the
vary the capacity and computational cost of the SE blocks
competition, we train an SENet-154 with a larger input size
in the model. To investigate this relationship, we conduct
320 × 320, achieving the lower error rate under both the
experiments based on SE-ResNet-50 for a range of differ-
top-1 (16.88%) and the top-5 (3.58%) error metrics.
ent r values. The comparison in Table 7 reveals that per-
6.2. Scene Classification formance does not improve monotonically with increased
capacity. This is likely to be a result of enabling the SE
We conduct experiments on the Places365-Challenge block to overfit the channel interdependencies of the train-
dataset [53] for scene classification. It comprises 8 million ing set. In particular, we found that setting r = 16 achieved
training images and 36, 500 validation images across 365 a good tradeoff between accuracy and complexity and con-
categories. Relative to classification, the task of scene un- sequently, we used this value for all experiments.
derstanding can provide a better assessment of the ability of
a model to generalise well and handle abstraction, since it The role of Excitation. While SE blocks have been empir-
requires the capture of more complex data associations and ically shown to improve network performance, we would
robustness to a greater level of appearance variation. also like to understand how the self-gating excitation mech-
anism operates in practice. To provide a clearer picture of
We use ResNet-152 as a strong baseline to assess the ef-
the behaviour of SE blocks, in this section we study exam-
fectiveness of SE blocks and follow the training and eval-
ple activations from the SE-ResNet-50 model and examine
uation protocols in [37]. Table 5 shows the results of
their distribution with respect to different classes at different
ResNet-152 and SE-ResNet-152. Specifically, SE-ResNet-
blocks. Specifically, we sample four classes from the Ima-
152 (11.01% top-5 error) achieves a lower validation error
geNet dataset that exhibit semantic and appearance diver-
than ResNet-152 (11.61% top-5 error), providing evidence
that SE blocks can perform well on different datasets. This
SENet surpasses the previous state-of-the-art model Places- Ratio r top-1 err. top-5 err. Million Parameters
365-CNN [37] which has a top-5 error of 11.48% on this 4 23.21 6.63 35.7
task. 8 23.19 6.64 30.7
16 23.29 6.62 28.1
6.3. Object Detection on COCO
32 23.40 6.77 26.9
We further evaluate the generalisation of SE blocks on original 24.80 7.48 25.6
object detection task using COCO dataset [25] which con-
tains 80k training images and 40k validation images, fol- Table 7: Single-crop error rates (%) on ImageNet validation set
lowing [10]. We use Faster R-CNN [33] as the detec- and parameter sizes for SE-ResNet-50 at different reduction ratios
tion method and follow the basic implementation in [10]. r. Here original refers to ResNet-50.
7138
(a) SE 2 3 (b) SE 3 4 (c) SE 4 6
Figure 5: Activations induced by Excitation in the different modules of SE-ResNet-50 on ImageNet. The module is named as
“SE stageID blockID”.
sity, namely goldfish, pug, plane and cliff (example images classifiers). This suggests that SE 5 2 and SE 5 3 are less
from these classes are shown in supplemental material). We important than previous blocks in providing recalibration to
then draw fifty samples for each class from the validation the network. This finding is consistent with the result of the
set and compute the average activations for fifty uniformly empirical investigation in Sec. 4 which demonstrated that
sampled channels in the last SE block in each stage (imme- the overall parameter count could be significantly reduced
diately prior to downsampling) and plot their distribution in by removing the SE blocks for the last stage with only a
Fig. 5. For reference, we also plot the distribution of aver- marginal loss of performance.
age activations across all 1000 classes.
7139
References 2010. 2
[23] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolu-
[1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside- tional deep belief networks for scalable unsupervised learn-
outside net: Detecting objects in context with skip pooling ing of hierarchical representations. In ICML, 2009. 8
and recurrent neural networks. In CVPR, 2016. 1 [24] M. Lin, Q. Chen, and S. Yan. Network in network.
[2] T. Bluche. Joint line segmentation and transcription for end- arXiv:1312.4400, 2013. 2
to-end handwritten paragraph recognition. In NIPS, 2016. [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
2 manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com-
[3] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, mon objects in context. ECCV, 2014. 7
L. Wang, C. Huang, W. Xu, D. Ramanan, and T. S. Huang. [26] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and
Look and think twice: Capturing top-down visual atten- K. Kavukcuoglu. Hierarchical representations for efficient
tion with feedback convolutional neural networks. In ICCV, architecture search. arXiv: 1711.00436, 2017. 2
2015. 2 [27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
[4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and networks for semantic segmentation. In CVPR, 2015. 1
T. Chua. SCA-CNN: Spatial and channel-wise attention [28] A. Miech, I. Laptev, and J. Sivic. Learnable pooling with
in convolutional networks for image captioning. In CVPR, context gating for video classification. arXiv:1706.06905,
2017. 2 2017. 2
[5] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual [29] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recur-
path networks. In NIPS, 2017. 2, 6 rent models of visual attention. In NIPS, 2014. 2
[6] F. Chollet. Xception: Deep learning with depthwise separa- [30] V. Nair and G. E. Hinton. Rectified linear units improve re-
ble convolutions. In CVPR, 2017. 2 stricted boltzmann machines. In ICML, 2010. 3
[7] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip [31] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
reading sentences in the wild. In CVPR, 2017. 2 works for human pose estimation. In ECCV, 2016. 1, 2
[8] D. Han, J. Kim, and J. Kim. Deep pyramidal residual net- [32] B. A. Olshausen, C. H. Anderson, and D. C. V. Essen. A
works. In CVPR, 2017. 6 neurobiological model of visual attention and invariant pat-
[9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rec- tern recognition based on dynamic routing of information.
tifiers: Surpassing human-level performance on ImageNet Journal of Neuroscience, 1993. 2
classification. In ICCV, 2015. 5 [33] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning wards real-time object detection with region proposal net-
for image recognition. In CVPR, 2016. 2, 5, 6, 7 works. In NIPS, 2015. 1, 7
[11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
deep residual networks. In ECCV, 2016. 2, 6 S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. A. C. Berg, and L. Fei-Fei. ImageNet large scale visual
Neural computation, 1997. 2 recognition challenge. IJCV, 2015. 2
[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, [35] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Im-
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi- age classification with the fisher vector: Theory and practice.
cient convolutional neural networks for mobile vision appli- RR-8209, INRIA, 2013. 3
cations. arXiv:1704.04861, 2017. 3, 6 [36] L. Shen, Z. Lin, and Q. Huang. Relay backpropagation for
[14] G. Huang, Z. Liu, K. Q. Weinberger, and L. Maaten. Densely effective learning of deep convolutional neural networks. In
connected convolutional networks. In CVPR, 2017. 2, 6 ECCV, 2016. 4
[15] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. [37] L. Shen, Z. Lin, G. Sun, and J. Hu. Places401 and places365
Deep roots: Improving CNN efficiency with hierarchical fil- models. https://github.com/lishen-shirley/
ter groups. In CVPR, 2017. 2 Places2-CNNs, 2016. 7
[16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating [38] L. Shen, G. Sun, Q. Huang, S. Wang, Z. Lin, and E. Wu.
deep network training by reducing internal covariate shift. In Multi-level discriminative dictionary learning with applica-
ICML, 2015. 1, 2, 5, 6 tion to large scale image classification. IEEE TIP, 2015. 3
[17] L. Itti and C. Koch. Computational modelling of visual at- [39] K. Simonyan and A. Zisserman. Very deep convolutional
tention. Nature reviews neuroscience, 2001. 2 networks for large-scale image recognition. In ICLR, 2015.
[18] L. Itti, C. Koch, and E. Niebur. A model of saliency-based 2, 3, 5, 6
visual attention for rapid scene analysis. IEEE TPAMI, 1998. [40] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training
2 very deep networks. In NIPS, 2015. 2
[19] M. Jaderberg, K. Simonyan, A. Zisserman, and [41] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber.
K. Kavukcuoglu. Spatial transformer networks. In Deep networks with internal selective attention through feed-
NIPS, 2015. 1, 2 back connections. In NIPS, 2014. 2
[20] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up [42] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-
convolutional neural networks with low rank expansions. In v4, inception-resnet and the impact of residual connections
BMVC, 2014. 2 on learning. In ICLR Workshop, 2016. 2, 3, 4, 5, 6
[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
classification with deep convolutional neural networks. In D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
NIPS, 2012. 1, 3 Going deeper with convolutions. In CVPR, 2015. 1, 2, 4
[22] H. Larochelle and G. E. Hinton. Learning to combine foveal [44] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
glimpses with a third-order boltzmann machine. In NIPS, Rethinking the inception architecture for computer vision. In
7140
CVPR, 2016. 2, 6
[45] A. Toshev and C. Szegedy. DeepPose: Human pose estima-
tion via deep neural networks. In CVPR, 2014. 1
[46] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang,
X. Wang, and X. Tang. Residual attention network for image
classification. In CVPR, 2017. 2, 6
[47] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated
residual transformations for deep neural networks. In CVPR,
2017. 2, 3, 5, 6, 7
[48] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-
nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural
image caption generation with visual attention. In ICML,
2015. 2
[49] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyra-
mid matching using sparse coding for image classification.
In CVPR, 2009. 3
[50] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How trans-
ferable are features in deep neural networks? In NIPS, 2014.
8
[51] X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of
structural diversity in very deep networks. In CVPR, 2017. 6
[52] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An
extremely efficient convolutional neural network for mobile
devices. arXiv:1707.01083, 2017. 3, 6
[53] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.
Places: A 10 million image database for scene recognition.
IEEE TPAMI, 2017. 7
[54] B. Zoph and Q. V. Le. Neural architecture search with rein-
forcement learning. In ICLR, 2017. 2
[55] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learn-
ing transferable architectures for scalable image recognition.
arXiv: 1707.07012, 2017. 2, 6
7141