Hu Squeeze-And-Excitation Networks CVPR 2018 Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Squeeze-and-Excitation Networks

Jie Hu1∗ Li Shen2∗ Gang Sun1


hujie@momenta.ai lishen@robots.ox.ac.uk sungang@momenta.ai
1 2
Momenta Department of Engineering Science, University of Oxford

Abstract mechanisms that help capture spatial correlations without


requiring additional supervision. One such approach was
Convolutional neural networks are built upon the con- popularised by the Inception architectures [16, 43], which
volution operation, which extracts informative features by showed that the network can achieve competitive accuracy
fusing spatial and channel-wise information together within by embedding multi-scale processes in its modules. More
local receptive fields. In order to boost the representa- recent work has sought to better model spatial dependence
tional power of a network, several recent approaches have [1, 31] and incorporate spatial attention [19].
shown the benefit of enhancing spatial encoding. In this In this paper, we investigate a different aspect of archi-
work, we focus on the channel relationship and propose tectural design - the channel relationship, by introducing a
a novel architectural unit, which we term the “Squeeze- new architectural unit, which we term the “Squeeze-and-
and-Excitation” (SE) block, that adaptively recalibrates Excitation” (SE) block. Our goal is to improve the rep-
channel-wise feature responses by explicitly modelling in- resentational power of a network by explicitly modelling
terdependencies between channels. We demonstrate that by the interdependencies between the channels of its convolu-
stacking these blocks together, we can construct SENet ar- tional features. To achieve this, we propose a mechanism
chitectures that generalise extremely well across challeng- that allows the network to perform feature recalibration,
ing datasets. Crucially, we find that SE blocks produce through which it can learn to use global information to se-
significant performance improvements for existing state-of- lectively emphasise informative features and suppress less
the-art deep architectures at minimal additional computa- useful ones.
tional cost. SENets formed the foundation of our ILSVRC
The basic structure of the SE building block is illustrated
2017 classification submission which won first place and
in Fig. 1. ′ For′ any′ given transformation Ftr : X → U,
significantly reduced the top-5 error to 2.251%, achiev-
X ∈ RH ×W ×C , U ∈ RH×W ×C , (e.g. a convolution
ing a ∼25% relative improvement over the winning en-
or a set of convolutions), we can construct a correspond-
try of 2016. Code and models are available at https:
ing SE block to perform feature recalibration as follows.
//github.com/hujie-frank/SENet.
The features U are first passed through a squeeze opera-
tion, which aggregates the feature maps across spatial di-
mensions H × W to produce a channel descriptor. This
1. Introduction descriptor embeds the global distribution of channel-wise
feature responses, enabling information from the global re-
Convolutional neural networks (CNNs) have proven to
ceptive field of the network to be leveraged by its lower lay-
be effective models for tackling a variety of visual tasks
ers. This is followed by an excitation operation, in which
[21, 27, 33, 45]. For each convolutional layer, a set of
sample-specific activations, learned for each channel by a
filters are learned to express local spatial connectivity pat-
self-gating mechanism based on channel dependence, gov-
terns along input channels. In other words, convolutional
ern the excitation of each channel. The feature maps U
filters are expected to be informative combinations by fus-
are then reweighted to generate the output of the SE block
ing spatial and channel-wise information together within lo-
which can then be fed directly into subsequent layers.
cal receptive fields. By stacking a series of convolutional
layers interleaved with non-linearities and downsampling, An SE network can be generated by simply stacking a
CNNs are capable of capturing hierarchical patterns with collection of SE building blocks. SE blocks can also be
global receptive fields as powerful image descriptions. Re- used as a drop-in replacement for the original block at any
cent work has demonstrated that the performance of net- depth in the architecture. However, while the template for
works can be improved by explicitly embedding learning the building block is generic, as we show in Sec. 6.4, the
role it performs at different depths adapts to the needs of the
∗ Equal contribution. network. In the early layers, it learns to excite informative

7132
Figure 1: A Squeeze-and-Excitation block.

features in a class agnostic manner, bolstering the quality of dinality (the size of the set of transformations) [15, 47].
the shared lower level representations. In later layers, the Multi-branch convolutions can be interpreted as a generali-
SE block becomes increasingly specialised, and responds sation of this concept, enabling more flexible compositions
to different inputs in a highly class-specific manner. Con- of operators [16, 42, 43, 44]. Recently, compositions which
sequently, the benefits of feature recalibration conducted by have been learned in an automated manner [26, 54, 55]
SE blocks can be accumulated through the entire network. have shown competitive performance. Cross-channel cor-
The development of new CNN architectures is a chal- relations are typically mapped as new combinations of fea-
lenging engineering task, typically involving the selection tures, either independently of spatial structure [6, 20] or
of many new hyperparameters and layer configurations. By jointly by using standard convolutional filters [24] with 1×1
contrast, the design of the SE block outlined above is sim- convolutions. Much of this work has concentrated on the
ple, and can be used directly with existing state-of-the-art objective of reducing model and computational complexity,
architectures whose modules can be strengthened by direct reflecting an assumption that channel relationships can be
replacement with their SE counterparts. formulated as a composition of instance-agnostic functions
Moreover, as shown in Sec. 4, SE blocks are computa- with local receptive fields. In contrast, we claim that provid-
tionally lightweight and impose only a slight increase in ing the unit with a mechanism to explicitly model dynamic,
model complexity and computational burden. To support non-linear dependencies between channels using global in-
these claims, we develop several SENets and provide an formation can ease the learning process, and significantly
extensive evaluation on the ImageNet 2012 dataset [34]. enhance the representational power of the network.
To demonstrate their general applicability, we also present
Attention and gating mechanisms. Attention can be
results beyond ImageNet, indicating that the proposed ap- viewed, broadly, as a tool to bias the allocation of available
proach is not restricted to a specific dataset or a task.
processing resources towards the most informative compo-
Using SENets, we won the first place in the ILSVRC nents of an input signal [17, 18, 22, 29, 32]. The benefits
2017 classification competition. Our top performing model
of such a mechanism have been shown across a range of
ensemble achieves a 2.251% top-5 error on the test set1 . tasks, from localisation and understanding in images [3, 19]
This represents a ∼25% relative improvement in compari-
to sequence-based models [2, 28]. It is typically imple-
son to the winner entry of the previous year (with a top-5 mented in combination with a gating function (e.g. a soft-
error of 2.991%).
max or sigmoid) and sequential techniques [12, 41]. Re-
cent work has shown its applicability to tasks such as im-
2. Related Work
age captioning [4, 48] and lip reading [7]. In these appli-
Deep architectures. VGGNets [39] and Inception mod- cations, it is often used on top of one or more layers rep-
els [43] demonstrated the benefits of increasing depth. resenting higher-level abstractions for adaptation between
Batch normalization (BN) [16] improved gradient propa- modalities. Wang et al. [46] introduce a powerful trunk-
gation by inserting units to regulate layer inputs, stabilis- and-mask attention mechanism using an hourglass module
ing the learning process. ResNets [10, 11] showed the ef- [31]. This high capacity unit is inserted into deep resid-
fectiveness of learning deeper networks through the use of ual networks between intermediate stages. In contrast, our
identity-based skip connections. Highway networks [40] proposed SE block is a lightweight gating mechanism, spe-
employed a gating mechanism to regulate shortcut connec- cialised to model channel-wise relationships in a computa-
tions. Reformulations of the connections between network tionally efficient manner and designed to enhance the repre-
layers [5, 14] have been shown to further improve the learn- sentational power of basic modules throughout the network.
ing and representational properties of deep networks.
An alternative line of research has explored ways to tune 3. Squeeze-and-Excitation Blocks
the functional form of the modular components of a net-
The Squeeze-and-Excitation block is a computational
work. Grouped convolutions can be used to increase car-
unit which can be constructed

for

any

given transformation
1 http://image-net.org/challenges/LSVRC/2017/results Ftr : X → U, X ∈ RH ×W ×C , U ∈ RH×W ×C . For

7133
simplicity, in the notation that follows we take Ftr to be a to fully capture channel-wise dependencies. To fulfil this
convolutional operator. Let V = [v1 , v2 , . . . , vC ] denote objective, the function must meet two criteria: first, it must
the learned set of filter kernels, where vc refers to the pa- be flexible (in particular, it must be capable of learning a
rameters of the c-th filter. We can then write the outputs of nonlinear interaction between channels) and second, it must
Ftr as U = [u1 , u2 , . . . , uC ], where learn a non-mutually-exclusive relationship since we would
C′
X like to ensure that multiple channels are allowed to be em-
uc = vc ∗ X = vcs ∗ xs . (1) phasised opposed to one-hot activation. To meet these cri-
s=1 teria, we opt to employ a simple gating mechanism with a
′ sigmoid activation:
Here ∗ denotes convolution, vc = [vc1 , vc2 , . . . , vcC ] and

X = [x1 , x2 , . . . , xC ] (to simplify the notation, bias terms s = Fex (z, W) = σ(g(z, W)) = σ(W2 δ(W1 z)), (3)
are omitted), while vcs is a 2D spatial kernel, and therefore
C
represents a single channel of vc which acts on the corre- where δ refers to the ReLU [30] function, W1 ∈ R r ×C and
C
sponding channel of X. Since the output is produced by W2 ∈ RC× r . To limit model complexity and aid generali-
a summation through all channels, the channel dependen- sation, we parameterise the gating mechanism by forming a
cies are implicitly embedded in vc , but these dependencies bottleneck with two fully connected (FC) layers around the
are entangled with the spatial correlation captured by the non-linearity, i.e. a dimensionality-reduction layer with pa-
filters. Our goal is to ensure that the network is able to in- rameters W1 with reduction ratio r (this parameter choice
crease its sensitivity to informative features so that they can is discussed in Sec. 6.4), a ReLU and then a dimensionality-
be exploited by subsequent transformations, and to suppress increasing layer with parameters W2 . The final output of
less useful ones. We propose to achieve this by explicitly the block is obtained by rescaling the transformation output
modelling channel interdependencies to recalibrate filter re- U with the activations:
sponses in two steps, squeeze and excitation, before they are
fed into next transformation. A diagram of an SE building ec = Fscale (uc , sc ) = sc · uc ,
x (4)
block is shown in Fig. 1.
where Xe = [e eC ] and Fscale (uc , sc ) refers to
e2 , . . . , x
x1 , x
3.1. Squeeze: Global Information Embedding
channel-wise multiplication between the feature map uc ∈
In order to tackle the issue of exploiting channel depen- RH×W and the scalar sc .
dencies, we first consider the signal to each channel in the Discussion. The activations act as channel weights
output features. Each of the learned filters operates with a adapted to the input-specific descriptor z. In this regard,
local receptive field and consequently each unit of the trans- SE blocks intrinsically introduce dynamics conditioned on
formation output U is unable to exploit contextual informa- the input, helping to boost feature discriminability.
tion outside of this region. This is an issue that becomes
more severe in the lower layers of the network whose re- 3.3. Exemplars: SE-Inception and SE-ResNet
ceptive field sizes are small.
It is straightforward to apply the SE block to AlexNet
To mitigate this problem, we propose to squeeze global
[21] and VGGNet [39]. The flexibility of the SE block
spatial information into a channel descriptor. This is
means that it can be directly applied to transformations be-
achieved by using global average pooling to generate
yond standard convolutions. To illustrate this point, we de-
channel-wise statistics. Formally, a statistic z ∈ RC is gen-
velop SENets by integrating SE blocks into modern archi-
erated by shrinking U through spatial dimensions H × W ,
tectures with sophisticated designs.
where the c-th element of z is calculated by:
For non-residual networks, such as Inception network,
XH X W
1 SE blocks are constructed for the network by taking the
zc = Fsq (uc ) = uc (i, j). (2)
H × W i=1 j=1 transformation Ftr to be an entire Inception module (see
Fig. 2). By making this change for each such module
Discussion. The transformation output U can be in- in the architecture, we construct an SE-Inception network.
terpreted as a collection of the local descriptors whose Moreover, SE blocks are sufficiently flexible to be used in
statistics are expressive for the whole image. Exploiting residual networks. Fig. 3 depicts the schema of an SE-
such information is prevalent in feature engineering work ResNet module. Here, the SE block transformation Ftr
[35, 38, 49]. We opt for the simplest, global average pool- is taken to be the non-identity branch of a residual mod-
ing, noting that more sophisticated aggregation strategies ule. Squeeze and excitation both act before summation
could be employed here as well. with the identity branch. More variants that integrate with
ResNeXt [47], Inception-ResNet [42], MobileNet [13] and
3.2. Excitation: Adaptive Recalibration
ShuffleNet [52] can be constructed by following the simi-
To make use of the information aggregated in the squeeze lar schemes. We describe the architecture of SE-ResNet-50
operation, we follow it with a second operation which aims and SE-ResNeXt-50 in Table 1.

7134
a single pass forwards and backwards through ResNet-50
X X
takes 190 ms, compared to 209 ms for SE-ResNet-50 (both
Inception Inception
�×W×C
timings are performed on a server with 8 NVIDIA Titan X
X GPUs). We argue that this represents a reasonable overhead
Global pooling
Inception Module 1×1×C

FC C
particularly since global pooling and small inner-product
1×1×
� operations are less optimised in existing GPU libraries.
ReLU C
1×1×
� Moreover, due to its importance for embedded device ap-
FC 1×1×C plications, we also benchmark CPU inference time for each
Sigmoid 1×1×C
model: for a 224 × 224 pixel input image, ResNet-50 takes
164 ms, compared to 167 ms for SE-ResNet-50. The small
Scale
�×W×C additional computational overhead required by the SE block
X is justified by its contribution to model performance.
SE-Inception Module Next, we consider the additional parameters introduced
by the proposed block. All of them are contained in the two
Figure 2: The schema of the original Inception module (left) and FC layers of the gating mechanism, which constitute a small
the SE-Inception module (right). fraction of the total network capacity. More precisely, the
number of additional parameters introduced is given by:
X X
S
2X
Residual Residual
�×W×C Ns · C s 2 (5)
r s=1
+ Global pooling
1×1×C
X
C
FC
1×1×

where r denotes the reduction ratio, S refers to the num-
ResNet Module
ReLU 1×1×
C ber of stages (where each stage refers to the collection of

FC
blocks operating on feature maps of a common spatial di-
1×1×C
mension), Cs denotes the dimension of the output channels
Sigmoid
1×1×C
and Ns denotes the repeated block number for stage s. SE-
Scale
�×W×C ResNet-50 introduces ∼2.5 million additional parameters
+ �×W×C
beyond the ∼25 million parameters required by ResNet-50,
X corresponding to a ∼10% increase. The majority of these
SE-ResNet Module parameters come from the last stage of the network, where
excitation is performed across the greatest channel dimen-
Figure 3: The schema of the original Residual module (left) and sions. However, we found that the comparatively expensive
the SE-ResNet module (right). final stage of SE blocks could be removed at a marginal cost
in performance (<0.1% top-1 error on ImageNet) to reduce
the relative parameter increase to ∼4%, which may prove
4. Model and Computational Complexity useful in cases where parameter usage is a key considera-
tion (see further discussion in Sec. 6.4).
For the proposed SE block to be viable in practice, it
must provide an effective trade-off between model com-
5. Implementation
plexity and performance which is important for scalability.
We set the reduction ratio r to be 16 in all experiments, ex- Each plain network and its corresponding SE counter-
cept where stated otherwise (more discussion can be found part are trained with identical optimisation schemes. Dur-
in Sec. 6.4). To illustrate the cost of the module, we take ing training on ImageNet, we follow standard practice and
the comparison between ResNet-50 and SE-ResNet-50 as perform data augmentation with random-size cropping [43]
an example, where the accuracy of SE-ResNet-50 is supe- to 224 × 224 pixels (299 × 299 for Inception-ResNet-v2
rior to ResNet-50 and approaches that of a deeper ResNet- [42] and SE-Inception-ResNet-v2) and random horizontal
101 network (shown in Table 2). ResNet-50 requires ∼3.86 flipping. Input images are normalised through mean chan-
GFLOPs in a single forward pass for a 224 × 224 pixel in- nel subtraction. In addition, we adopt the data balancing
put image. Each SE block makes use of a global average strategy described in [36] for mini-batch sampling. The
pooling operation in the squeeze phase and two small fully networks are trained on our distributed learning system
connected layers in the excitation phase, followed by an “ROCS” which is designed to handle efficient parallel train-
inexpensive channel-wise scaling operation. In aggregate, ing of large networks. Optimisation is performed using syn-
SE-ResNet-50 requires ∼3.87 GFLOPs, corresponding to a chronous SGD with momentum 0.9 and a mini-batch size
0.26% relative increase over the original ResNet-50. of 1024. The initial learning rate is set to 0.6 and decreased
In practice, with a training mini-batch of 256 images, by a factor of 10 every 30 epochs. All models are trained

7135
Output size ResNet-50 SE-ResNet-50 SE-ResNeXt-50 (32 × 4d)
112 × 112 conv, 7 × 7, 64, stride 2
max pool, 3 × 3, stride 2
56 × 56    
  conv, 1 × 1, 64 conv, 1 × 1, 128
conv, 1 × 1, 64 conv, 3 × 3, 64  conv, 3 × 3, 128
conv, 3 × 3, 64  × 3    C = 32
×3
conv, 1 × 1, 256 × 3 conv, 1 × 1, 256 
conv, 1 × 1, 256
f c, [16, 256] f c, [16, 256]
   
  conv, 1 × 1, 128 conv, 1 × 1, 256
conv, 1 × 1, 128 conv, 3 × 3, 128 conv, 3 × 3, 256 C = 32
28 × 28 conv, 3 × 3, 128 × 4    ×4
conv, 1 × 1, 512 × 4 conv, 1 × 1, 512 
conv, 1 × 1, 512
f c, [32, 512] f c, [32, 512]
   
  conv, 1 × 1, 256 conv, 1 × 1, 512
conv, 1 × 1, 256 conv, 3 × 3, 256  conv, 3 × 3, 512
conv, 3 × 3, 256  × 6    C = 32
×6
14 × 14 conv, 1 × 1, 1024 × 6 conv, 1 × 1, 1024 
conv, 1 × 1, 1024
f c, [64, 1024] f c, [64, 1024]
   
  conv, 1 × 1, 512 conv, 1 × 1, 1024
conv, 1 × 1, 512 conv, 3 × 3, 512  conv, 3 × 3, 1024 C = 32
conv, 3 × 3, 512  × 3    ×3
7×7 conv, 1 × 1, 2048 × 3 conv, 1 × 1, 2048 
conv, 1 × 1, 2048
f c, [128, 2048] f c, [128, 2048]
1×1 global average pool, 1000-d f c, softmax

Table 1: (Left) ResNet-50. (Middle) SE-ResNet-50. (Right) SE-ResNeXt-50 with a 32×4d template. The shapes and operations with
specific parameter settings of a residual building block are listed inside the brackets and the number of stacked blocks in a stage is presented
outside. The inner brackets following by fc indicates the output dimension of the two fully connected layers in an SE module.

original re-implementation SENet


top-1 err. top-5 err. top-1err. top-5 err. GFLOPs top-1 err. top-5 err. GFLOPs
ResNet-50 [10] 24.7 7.8 24.80 7.48 3.86 23.29(1.51) 6.62(0.86) 3.87
ResNet-101 [10] 23.6 7.1 23.17 6.52 7.58 22.38(0.79) 6.07(0.45) 7.60
ResNet-152 [10] 23.0 6.7 22.42 6.34 11.30 21.57(0.85) 5.73(0.61) 11.32
ResNeXt-50 [47] 22.2 - 22.11 5.90 4.24 21.10(1.01) 5.49(0.41) 4.25
ResNeXt-101 [47] 21.2 5.6 21.18 5.57 7.99 20.70(0.48) 5.01(0.56) 8.00
VGG-16 [39] - - 27.02 8.81 15.47 25.22(1.80) 7.70(1.11) 15.48
BN-Inception [16] 25.2 7.82 25.38 7.89 2.03 24.23(1.15) 7.14(0.75) 2.04
Inception-ResNet-v2 [42] 19.9† 4.9† 20.37 5.21 11.75 19.80(0.57) 4.79(0.42) 11.76

Table 2: Single-crop error rates (%) on the ImageNet validation set and complexity comparisons. The original column refers to the results
reported in the original papers. To enable a fair comparison, we re-train the baseline models and report the scores in the re-implementation
column. The SENet column refers to the corresponding architectures in which SE blocks have been added. The numbers in brackets denote
the performance improvement over the re-implemented baselines. † indicates that the model has been evaluated on the non-blacklisted
subset of the validation set (this is discussed in more detail in [42]), which may slightly improve results. VGG-16 and SE-VGG-16 are
trained with batch normalization.

for 100 epochs from scratch, using the weight initialisation Network depth. We first compare the SE-ResNet against
strategy described in [9]. ResNet architectures with different depths. The results in
When testing, we apply a centre crop evaluation on the Table 2 shows that SE blocks consistently improve perfor-
validation set, where 224×224 pixels are cropped from each mance across different depths with an extremely small in-
image whose shorter edge is first resized to 256 (299 × 299 crease in computational complexity.
from each image whose shorter edge is first resized to 352 Remarkably, SE-ResNet-50 achieves a single-crop top-5
for Inception-ResNet-v2 and SE-Inception-ResNet-v2). validation error of 6.62%, exceeding ResNet-50 (7.48%)
by 0.86% and approaching the performance achieved by
6. Experiments the much deeper ResNet-101 network (6.52% top-5 error)
with only half of the computational overhead (3.87 GFLOPs
6.1. ImageNet Classification
vs. 7.58 GFLOPs). This pattern is repeated at greater
The ImageNet 2012 dataset is comprised of 1.28 mil- depth, where SE-ResNet-101 (6.07% top-5 error) not only
lion training images and 50K validation images from 1000 matches, but outperforms the deeper ResNet-152 network
classes. We train networks on the training set and report the (6.34% top-5 error) by 0.27%. Fig. 4 depicts the training
top-1 and the top-5 errors. and validation curves of SE-ResNet-50 and ResNet-50 (the

7136
original re-implementation SENet
top-1 top-5 top-1 top-5 Million top-1 top-5 Million
MFLOPs MFLOPs
err. err. err. err. Parameters err. err. Parameters
MobileNet [13] 29.4 - 29.1 10.1 569 4.2 25.3(3.8) 7.9(2.2) 572 4.7
ShuffleNet [52] 34.1 - 33.9 13.6 140 1.8 31.7(2.2) 11.7(1.9) 142 2.4

Table 3: Single-crop error rates (%) on the ImageNet validation set and complexity comparisons. Here, MobileNet refers to “1.0
MobileNet-224” in [13] and ShuffleNet refers to “ShuffleNet 1 × (g = 3)” in [52].

224 × 224 320 × 320 /


299 × 299
top-1 top-5 top-1 top-5
err. err. err. err.
ResNet-152 [10] 23.0 6.7 21.3 5.5
ResNet-200 [11] 21.7 5.8 20.1 4.8
Inception-v3 [44] - - 21.2 5.6
Inception-v4 [42] - - 20.0 5.0
Inception-ResNet-v2 [42] - - 19.9 4.9
ResNeXt-101 (64 × 4d) [47] 20.4 5.3 19.1 4.4
DenseNet-264 [14] 22.15 6.12 - -
Attention-92 [46] - - 19.5 4.8
Very Deep PolyNet [51] † - - 18.71 4.25
Figure 4: Training curves of ResNet-50 and SE-ResNet-50 on Im- PyramidNet-200 [8] 20.1 5.4 19.2 4.7
ageNet. DPN-131 [5] 19.93 5.12 18.55 4.16
SENet-154 18.68 4.47 17.28 3.79
curves of more networks are shown in supplementary mate- NASNet-A (6@4032) [55] † - - 17.3‡ 3.8‡
rial). While it should be noted that the SE blocks themselves SENet-154 (post-challenge) - - 16.88‡ 3.58‡
add depth, they do so in an extremely computationally ef-
ficient manner and yield good returns even at the point at Table 4: Single-crop error rates of state-of-the-art CNNs on Im-
which extending the depth of the base architecture achieves ageNet validation set. The size of test crop is 224 × 224 and
diminishing returns. Moreover, we see that the performance 320 × 320 / 299 × 299 as in [11]. † denotes the model with a
larger crop 331 × 331. ‡ denotes the post-challenge result. SENet-
improvements are consistent through training across a range
154 (post-challenge) is trained with a larger input size 320 × 320
of different depths, suggesting that the improvements in- compared to the original one with the input size 224 × 224.
duced by SE blocks can be used in combination with in-
creasing the depth of the base architecture.
Integration with modern architectures. We next inves- to 352). SE-Inception-ResNet-v2 (4.79% top-5 error) out-
tigate the effect of combining SE blocks with another two performs our reimplemented Inception-ResNet-v2 (5.21%
state-of-the-art architectures, Inception-ResNet-v2 [42] and top-5 error) by 0.42% (a relative improvement of 8.1%) as
ResNeXt (using the setting of 32 × 4d) [47], which both well as the reported result in [42].
introduce prior structures in modules. We also assess the effect of SE blocks when operating
We construct SENet equivalents of these networks, SE- on non-residual networks by conducting experiments with
Inception-ResNet-v2 and SE-ResNeXt (the configuration of the VGG-16 [39] and BN-Inception architecture [16]. As a
SE-ResNeXt-50 is given in Table 1). The results in Ta- deep network is tricky to optimise [16, 39], to facilitate the
ble 2 illustrate the significant performance improvement training of VGG-16 from scratch, we add a Batch Normal-
induced by SE blocks when introduced into both archi- ization layer after each convolution. We apply the identical
tectures. In particular, SE-ResNeXt-50 has a top-5 er- scheme for training SE-VGG-16. The results of the compar-
ror of 5.49% which is superior to both its direct counter- ison are shown in Table 2, exhibiting the same phenomena
part ResNeXt-50 (5.90% top-5 error) as well as the deeper that emerged in the residual architectures.
ResNeXt-101 (5.57% top-5 error), a model which has al- Finally, we evaluate on two representative efficient ar-
most double the number of parameters and computational chitectures, MobileNet [13] and ShuffleNet [52] in Table
overhead. As for the experiments of Inception-ResNet- 3, showing that SE blocks can consistently improve the ac-
v2, we conjecture the difference of cropping strategy might curacy by a large margin at minimal increases in computa-
lead to the gap between their reported result and our re- tional cost. These experiments demonstrate that improve-
implemented one, as their original image size has not been ments induced by SE blocks can be used in combination
clarified in [42] while we crop the 299 × 299 region from with a wide range of architectures. Moreover, this result
a relatively larger image (where the shorter edge is resized holds for both residual and non-residual foundations.

7137
top-1 err. top-5 err. AP@IoU=0.5 AP
Places-365-CNN [37] 41.07 11.48 ResNet-50 45.2 25.1
ResNet-152 (ours) 41.15 11.61 SE-ResNet-50 46.8 26.4
SE-ResNet-152 40.37 11.01 ResNet-101 48.4 27.2
SE-ResNet-101 49.2 27.9
Table 5: Single-crop error rates (%) on Places365 validation set.
Table 6: Object detection results on the COCO 40k validation set
by using the basic Faster R-CNN.
Results on ILSVRC 2017 Classification Competition.
SENets formed the foundation of our submission to the
competition where we won first place. Our winning en- Here our intention is to evaluate the benefit of replacing the
try comprised a small ensemble of SENets that employed base architecture ResNet with SE-ResNet, so that the im-
a standard multi-scale and multi-crop fusion strategy to ob- provements can be attributed to better representations. Ta-
tain a 2.251% top-5 error on the test set. One of our high- ble 6 shows the results by using ResNet-50, ResNet-101
performing networks, which we term SENet-154, was con- and their SE counterparts on the validation set, respectively.
structed by integrating SE blocks with a modified ResNeXt SE-ResNet-50 outperforms ResNet-50 by 1.3% (a relative
[47] (details are provided in supplemental material), the 5.2% improvement) on COCO’s standard metric AP and
goal of which is to reach the best possible accuracy with 1.6% on AP@IoU=0.5. Importantly, SE blocks are capable
less emphasis on model complexity. We compare it with of benefiting the deeper architecture ResNet-101 by 0.7%
the top-performing published models on the ImageNet val- (a relative 2.6% improvement) on the AP metric.
idation set in Table 4. Our model achieved a top-1 error
of 18.68% and a top-5 error of 4.47% using a 224 × 224 6.4. Analysis and Interpretation
centre crop evaluation. To enable a fair comparison, we
Reduction ratio. The reduction ratio r introduced in
provide a 320 × 320 centre crop evaluation, showing a sig-
Eqn. (5) is an important hyperparameter which allows us to
nificant performance improvement on prior work. After the
vary the capacity and computational cost of the SE blocks
competition, we train an SENet-154 with a larger input size
in the model. To investigate this relationship, we conduct
320 × 320, achieving the lower error rate under both the
experiments based on SE-ResNet-50 for a range of differ-
top-1 (16.88%) and the top-5 (3.58%) error metrics.
ent r values. The comparison in Table 7 reveals that per-
6.2. Scene Classification formance does not improve monotonically with increased
capacity. This is likely to be a result of enabling the SE
We conduct experiments on the Places365-Challenge block to overfit the channel interdependencies of the train-
dataset [53] for scene classification. It comprises 8 million ing set. In particular, we found that setting r = 16 achieved
training images and 36, 500 validation images across 365 a good tradeoff between accuracy and complexity and con-
categories. Relative to classification, the task of scene un- sequently, we used this value for all experiments.
derstanding can provide a better assessment of the ability of
a model to generalise well and handle abstraction, since it The role of Excitation. While SE blocks have been empir-
requires the capture of more complex data associations and ically shown to improve network performance, we would
robustness to a greater level of appearance variation. also like to understand how the self-gating excitation mech-
anism operates in practice. To provide a clearer picture of
We use ResNet-152 as a strong baseline to assess the ef-
the behaviour of SE blocks, in this section we study exam-
fectiveness of SE blocks and follow the training and eval-
ple activations from the SE-ResNet-50 model and examine
uation protocols in [37]. Table 5 shows the results of
their distribution with respect to different classes at different
ResNet-152 and SE-ResNet-152. Specifically, SE-ResNet-
blocks. Specifically, we sample four classes from the Ima-
152 (11.01% top-5 error) achieves a lower validation error
geNet dataset that exhibit semantic and appearance diver-
than ResNet-152 (11.61% top-5 error), providing evidence
that SE blocks can perform well on different datasets. This
SENet surpasses the previous state-of-the-art model Places- Ratio r top-1 err. top-5 err. Million Parameters
365-CNN [37] which has a top-5 error of 11.48% on this 4 23.21 6.63 35.7
task. 8 23.19 6.64 30.7
16 23.29 6.62 28.1
6.3. Object Detection on COCO
32 23.40 6.77 26.9
We further evaluate the generalisation of SE blocks on original 24.80 7.48 25.6
object detection task using COCO dataset [25] which con-
tains 80k training images and 40k validation images, fol- Table 7: Single-crop error rates (%) on ImageNet validation set
lowing [10]. We use Faster R-CNN [33] as the detec- and parameter sizes for SE-ResNet-50 at different reduction ratios
tion method and follow the basic implementation in [10]. r. Here original refers to ResNet-50.

7138
(a) SE 2 3 (b) SE 3 4 (c) SE 4 6

(d) SE 5 1 (e) SE 5 2 (f) SE 5 3

Figure 5: Activations induced by Excitation in the different modules of SE-ResNet-50 on ImageNet. The module is named as
“SE stageID blockID”.

sity, namely goldfish, pug, plane and cliff (example images classifiers). This suggests that SE 5 2 and SE 5 3 are less
from these classes are shown in supplemental material). We important than previous blocks in providing recalibration to
then draw fifty samples for each class from the validation the network. This finding is consistent with the result of the
set and compute the average activations for fifty uniformly empirical investigation in Sec. 4 which demonstrated that
sampled channels in the last SE block in each stage (imme- the overall parameter count could be significantly reduced
diately prior to downsampling) and plot their distribution in by removing the SE blocks for the last stage with only a
Fig. 5. For reference, we also plot the distribution of aver- marginal loss of performance.
age activations across all 1000 classes.

We make the following three observations about the role 7. Conclusion


of Excitation. First, the distribution across different classes In this paper we proposed the SE block, a novel architec-
is nearly identical in lower layers, e.g. SE 2 3. This sug- tural unit designed to improve the representational capacity
gests that the importance of feature channels is likely to of a network by enabling it to perform dynamic channel-
be shared by different classes in the early stages of the wise feature recalibration. Extensive experiments demon-
network. Interestingly however, the second observation is strate the effectiveness of SENets which achieve state-of-
that at greater depth, the value of each channel becomes the-art performance on multiple datasets. In addition, they
much more class-specific as different classes exhibit differ- provide some insight into the limitations of previous archi-
ent preferences to the discriminative value of features, e.g. tectures in modelling channel-wise feature dependencies,
SE 4 6 and SE 5 1. The two observations are consistent which we hope may prove useful for other tasks requiring
with findings in previous work [23, 50], namely that lower strong discriminative features. Finally, the feature impor-
layer features are typically more general (i.e. class agnostic tance induced by SE blocks may be helpful to related fields
in the context of classification) while higher layer features such as network pruning for compression.
have greater specificity. As a result, representation learning
benefits from the recalibration induced by SE blocks which Acknowledgements. We would like to thank Professor Andrew
adaptively facilitates feature extraction and specialisation to Zisserman for his helpful comments and Samuel Albanie for his
the extent that it is needed. Finally, we observe a some- discussions and writing edit for the paper. We would like to
what different phenomena in the last stage of the network. thank Chao Li for his contributions in the training system. Li
SE 5 2 exhibits an interesting tendency towards a saturated Shen is supported by the Office of the Director of National Intelli-
gence (ODNI), Intelligence Advanced Research Projects Activity
state in which most of the activations are close to 1 and the
(IARPA), via contract number 2014-14071600010. The views and
remainder is close to 0. At the point at which all activations conclusions contained herein are those of the author and should
take the value 1, this block would become a standard resid- not be interpreted as necessarily representing the official policies
ual block. At the end of the network in the SE 5 3 (which is or endorsements, either expressed or implied, of ODNI, IARPA, or
immediately followed by global pooling prior before clas- the U.S. Government. The U.S. Government is authorized to re-
sifiers), a similar pattern emerges over different classes, up produce and distribute reprints for Governmental purpose notwith-
to a slight change in scale (which could be tuned by the standing any copyright annotation thereon.

7139
References 2010. 2
[23] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolu-
[1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside- tional deep belief networks for scalable unsupervised learn-
outside net: Detecting objects in context with skip pooling ing of hierarchical representations. In ICML, 2009. 8
and recurrent neural networks. In CVPR, 2016. 1 [24] M. Lin, Q. Chen, and S. Yan. Network in network.
[2] T. Bluche. Joint line segmentation and transcription for end- arXiv:1312.4400, 2013. 2
to-end handwritten paragraph recognition. In NIPS, 2016. [25] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
2 manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com-
[3] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, mon objects in context. ECCV, 2014. 7
L. Wang, C. Huang, W. Xu, D. Ramanan, and T. S. Huang. [26] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and
Look and think twice: Capturing top-down visual atten- K. Kavukcuoglu. Hierarchical representations for efficient
tion with feedback convolutional neural networks. In ICCV, architecture search. arXiv: 1711.00436, 2017. 2
2015. 2 [27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
[4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and networks for semantic segmentation. In CVPR, 2015. 1
T. Chua. SCA-CNN: Spatial and channel-wise attention [28] A. Miech, I. Laptev, and J. Sivic. Learnable pooling with
in convolutional networks for image captioning. In CVPR, context gating for video classification. arXiv:1706.06905,
2017. 2 2017. 2
[5] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual [29] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recur-
path networks. In NIPS, 2017. 2, 6 rent models of visual attention. In NIPS, 2014. 2
[6] F. Chollet. Xception: Deep learning with depthwise separa- [30] V. Nair and G. E. Hinton. Rectified linear units improve re-
ble convolutions. In CVPR, 2017. 2 stricted boltzmann machines. In ICML, 2010. 3
[7] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip [31] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
reading sentences in the wild. In CVPR, 2017. 2 works for human pose estimation. In ECCV, 2016. 1, 2
[8] D. Han, J. Kim, and J. Kim. Deep pyramidal residual net- [32] B. A. Olshausen, C. H. Anderson, and D. C. V. Essen. A
works. In CVPR, 2017. 6 neurobiological model of visual attention and invariant pat-
[9] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rec- tern recognition based on dynamic routing of information.
tifiers: Surpassing human-level performance on ImageNet Journal of Neuroscience, 1993. 2
classification. In ICCV, 2015. 5 [33] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning wards real-time object detection with region proposal net-
for image recognition. In CVPR, 2016. 2, 5, 6, 7 works. In NIPS, 2015. 1, 7
[11] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
deep residual networks. In ECCV, 2016. 2, 6 S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. A. C. Berg, and L. Fei-Fei. ImageNet large scale visual
Neural computation, 1997. 2 recognition challenge. IJCV, 2015. 2
[13] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, [35] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Im-
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi- age classification with the fisher vector: Theory and practice.
cient convolutional neural networks for mobile vision appli- RR-8209, INRIA, 2013. 3
cations. arXiv:1704.04861, 2017. 3, 6 [36] L. Shen, Z. Lin, and Q. Huang. Relay backpropagation for
[14] G. Huang, Z. Liu, K. Q. Weinberger, and L. Maaten. Densely effective learning of deep convolutional neural networks. In
connected convolutional networks. In CVPR, 2017. 2, 6 ECCV, 2016. 4
[15] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. [37] L. Shen, Z. Lin, G. Sun, and J. Hu. Places401 and places365
Deep roots: Improving CNN efficiency with hierarchical fil- models. https://github.com/lishen-shirley/
ter groups. In CVPR, 2017. 2 Places2-CNNs, 2016. 7
[16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating [38] L. Shen, G. Sun, Q. Huang, S. Wang, Z. Lin, and E. Wu.
deep network training by reducing internal covariate shift. In Multi-level discriminative dictionary learning with applica-
ICML, 2015. 1, 2, 5, 6 tion to large scale image classification. IEEE TIP, 2015. 3
[17] L. Itti and C. Koch. Computational modelling of visual at- [39] K. Simonyan and A. Zisserman. Very deep convolutional
tention. Nature reviews neuroscience, 2001. 2 networks for large-scale image recognition. In ICLR, 2015.
[18] L. Itti, C. Koch, and E. Niebur. A model of saliency-based 2, 3, 5, 6
visual attention for rapid scene analysis. IEEE TPAMI, 1998. [40] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training
2 very deep networks. In NIPS, 2015. 2
[19] M. Jaderberg, K. Simonyan, A. Zisserman, and [41] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber.
K. Kavukcuoglu. Spatial transformer networks. In Deep networks with internal selective attention through feed-
NIPS, 2015. 1, 2 back connections. In NIPS, 2014. 2
[20] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up [42] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-
convolutional neural networks with low rank expansions. In v4, inception-resnet and the impact of residual connections
BMVC, 2014. 2 on learning. In ICLR Workshop, 2016. 2, 3, 4, 5, 6
[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
classification with deep convolutional neural networks. In D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
NIPS, 2012. 1, 3 Going deeper with convolutions. In CVPR, 2015. 1, 2, 4
[22] H. Larochelle and G. E. Hinton. Learning to combine foveal [44] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.
glimpses with a third-order boltzmann machine. In NIPS, Rethinking the inception architecture for computer vision. In

7140
CVPR, 2016. 2, 6
[45] A. Toshev and C. Szegedy. DeepPose: Human pose estima-
tion via deep neural networks. In CVPR, 2014. 1
[46] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang,
X. Wang, and X. Tang. Residual attention network for image
classification. In CVPR, 2017. 2, 6
[47] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated
residual transformations for deep neural networks. In CVPR,
2017. 2, 3, 5, 6, 7
[48] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudi-
nov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural
image caption generation with visual attention. In ICML,
2015. 2
[49] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyra-
mid matching using sparse coding for image classification.
In CVPR, 2009. 3
[50] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How trans-
ferable are features in deep neural networks? In NIPS, 2014.
8
[51] X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of
structural diversity in very deep networks. In CVPR, 2017. 6
[52] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An
extremely efficient convolutional neural network for mobile
devices. arXiv:1707.01083, 2017. 3, 6
[53] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba.
Places: A 10 million image database for scene recognition.
IEEE TPAMI, 2017. 7
[54] B. Zoph and Q. V. Le. Neural architecture search with rein-
forcement learning. In ICLR, 2017. 2
[55] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learn-
ing transferable architectures for scalable image recognition.
arXiv: 1707.07012, 2017. 2, 6

7141

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy