Kong Et Al - 2021 - ClassSR - CVPR2021
Kong Et Al - 2021 - ClassSR - CVPR2021
Data Characteristic
3
Shanghai AI Lab, Shanghai, China
4
SIAT Branch, Shenzhen Institute of Artificial Intelligence and Robotics for Society
{xt.kong, hy.zhao1, yu.qiao, chao.dong}@siat.ac.cn
to process different sub-images after the decomposition. On Figure 1. The SR result (x4) of ClassSR-FSRCNN. The Class-
Module classifies the image “0896” (DIV2K) into 56% simple,
this basis, we propose a new solution pipeline – ClassSR
20% medium and 24% hard sub-images. Compared with FSR-
that combines classification and SR in a unified framework.
CNN, ClassSR-FSRCNN uses only 55% FLOPs to achieve the
In particular, it first uses a Class-Module to classify the sub- same performance.
images into different classes according to restoration diffi-
image from a low-resolution input. In this paper, we
culties, then applies an SR-Module to perform SR for differ-
study how to accelerate SR algorithms on “large” input im-
ent classes. The Class-Module is a conventional classifica-
ages, which will be upsampled to at least 2K resolution
tion network, while the SR-Module is a network container
(2048×1080). While in real-world usages, the image/video
that consists of the to-be-accelerated SR network and its
resolution for smartphones and TV monitors has already
simplified versions. We further introduce a new classifica-
reached 4K (4096 × 2160), or even 8K (7680 × 4320). As
tion method with two losses – Class-Loss and Average-Loss
most recent SR algorithms are built on CNNs, the mem-
to produce the classification results. After joint training, a
ory and computational cost will grow quadratically with the
majority of sub-images will pass through smaller networks,
input size. Thus it is necessary to decompose input into sub-
thus the computational cost can be significantly reduced.
images and continuously accelerate SR algorithms to meet
Experiments show that our ClassSR can help most existing
the requirement of real-time implementation on real images.
methods (e.g., FSRCNN, CARN, SRResNet, RCAN) save up
Recent works on SR acceleration focus on proposing
to 50% FLOPs on DIV8K datasets. This general framework
light-weight network structures, e.g., from the early FSR-
can also be applied in other low-level vision tasks.
CNN [6] to the latest CARN [2], which are detailed in the
Sec. 2. We tackle this problem from a different perspec-
tive. Instead of designing a faster model, we propose a new
1. Introduction processing pipeline that could accelerate most SR methods.
Image super-resolution (SR) is a long-studied topic, Above all, we draw the observation that different image re-
which aims to generate a high-resolution visual-pleasing gions require different network complexities (see Sec. 3.1).
For example, the flat area (e.g., sky, land) is naturally easier
* Corresponding author (e-mail: chao.dong@siat.ac.cn) to process than textures (e.g., hair, feathers). This indicates
1
CARN (small) [2], SRResNet (middle) [13] and RCAN
-50%
(large) [25]. As shown in Fig. 2, the ClassSR method could
-48%
help these SR networks save 50%, 47%, 48%, 50% compu-
tational cost on the DIV8K dataset, respectively. An exam-
-47% ple is shown in Fig. 1, where the flat areas (color in light
green) are processed with the simple network and the tex-
tures (color in red) are processed with the complex one. We
-50% have also provided a detailed ablation study on the choice
of different network settings.
Overall, our contributions are three-fold: (1) We pro-
Figure 2. PSNR and FLOPs comparison between ClassSR and pose ClassSR. It is the first SR pipeline that incorpo-
original networks on Test8K with × 4. rates classification and super-resolution together on the sub-
image level. (2) We tackle acceleration by the character-
that if we can use smaller networks to treat less complex
istic of data. It makes ClassSR orthogonal to other accel-
image regions, the computational cost will be significantly
eration networks. A network compressed to the limit can
reduced. According to this observation, we can adopt dif-
still be accelerated by ClassSR. (3) We propose a clas-
ferent networks for different contents after decomposition.
sification method with two novel losses. It divides sub-
Sub-image decomposition is especially beneficial for
images according to their restoration difficulties that are
large images. First, more regions are relatively simple to
processed by a specific branch instead of predetermined la-
restore. According to our statistics, about 60% LR sub-
bels, so it can also be directly applied to other low-level
images (32 × 32) belong to smooth regions for DIV8K [7]
vision tasks. The code will be made available: https:
dataset, while the percentage drops to 30% for DIV2K [1]
//github.com/Xiangtaokong/ClassSR
dataset. Thus the acceleration ratio will be higher for large
images. Second, sub-image decomposition can help save
memory space in real applications, and is essential for low-
2. Related work
memory processing chips. It is also plausible to distribute 2.1. CNNs for Image Super-Resolution
sub-images to parallel processors for further acceleration.
To address the above issue and accelerate existing SR Since SRCNN [5] first introduced convolutional neural
methods, we propose a new solution pipeline, namely networks (CNNs) to the SR task, many deep neural net-
ClassSR, to perform classification and super-resolution si- works have been developed to improve the reconstruction
multaneously. The framework consists of two modules – results. For example, VDSR [10] uses a very deep network
Class-Module and SR-Module. The Class-Module is a sim- to learn the image residual. SRResNet [13] introduces Res-
ple classification network that classifies the input into a spe- Block [8] to further expand the network size. EDSR [14] re-
cific class according to the restoration difficulty, while the moves some redundant layers from SRResNet and advances
SR-Module is a network container that processes the classi- results. RDN [26] and RRDB [20] adopt dense connections
fied input with the SR network of the corresponding class. to utilize the information from preceding layers. Further-
They are connected together and need to be trained jointly. more, RCAN [25], SAN [4] and RFA [15] explore the atten-
The novelty lies in the classification method and training tion mechanism to design deeper networks and constantly
strategy. Specifically, we introduce two new losses to con- refresh the state-of-the-art. However, the expensive compu-
strain the classification results. The first one is a Class-Loss tational cost has limited their practical usages.
that encourages a higher probability of the selected class
2.2. Light-weight SR Networks
for individual sub-images. The other one is an Average-
Loss that ensures the overall classification results not bias To reduce computational cost, many acceleration meth-
to a single class. These two losses work cooperatively ods have been proposed. FSRCNN [6] and ESPCN [18]
to make the classification meaningful and well-distributed. use the LR image as input and upscale the feature maps at
The Image-Loss (L1 loss) is also added to guarantee the the end of the networks. LapSRN [12] introduces a deep
reconstruction performance. For the training strategy, we laplacian pyramid network that gradually upscales the fea-
first pre-train the SR-Module with Image-Loss. Then we ture maps. CARN [2] uses the group convolution to design
fix the SR-Module and optimize the Class-Module with all a cascading residual network for fast processing. IMDN [9]
three losses. Finally, we optimize the two modules simul- extracts hierarchical features by splitting operations and
taneously until convergence. This pipeline is general and then aggregates them to save computation. PAN [27] adopts
effective for different SR networks. pixel attention to obtain an effective network.
Experiments are conducted on representative SR net- All of those methods aim to design a relatively light-
works with different scales – FSRCNN (tiny) [6], weight network with an acceptable reconstruction perfor-
2
Simple in Fig. 3, we show these values in a blue curve and sepa-
rate them into three classes with the same numbers of sub-
images – “simple, medium, hard”. It is observed that the
sub-images with high PSNR values are generally smooth,
Medium while the sub-images with low PSNR values contain com-
plex textures.
Then we adopt different networks to deal with different
Hard
kinds of sub-images. As shown in Table 1, we use three FS-
RCNN models with the same network structure but different
channel numbers in the first conv. layer and the last deconv.
layer (i.e., 16, 36, 56). They are separately trained with
“simple, medium, hard” sub-images from training dataset2 .
Figure 3. The ranked PSNR curve of sub-images from DIV2K val- From Table 1, we can find that there is almost no difference
idation set and the visualization of three classes. for FSRCNN(16) and FSRCNN-O(56) on “simple” sub-
Model FLOPs Simple Medium Hard images, and FSRCNN(36) can achieve roughly the same
FSRCNN (16) 141M 42.71dB – – performance as FSRCNN-O(56) on “medium” sub-images.
FSRCNN (36) 304M – 29.62dB – This indicates that we can use a light-weight network to deal
FSRCNN (56) 468M – – 22.73dB with simple sub-images to save computational cost. That
FSRCNN-O (56) 468M 42.70dB 29.69dB 22.71dB is why we propose the following ClassSR method, which
Table 1. PSNR values obtained by three SR branches of ClassSR- could treat different image regions differently and acceler-
FSRCNN with ×4. They are separately trained with “simple, ate existing SR methods.
medium, hard” training data and tested on corresponding valida-
tion data. -O: the original networks trained with all data. 3.2. Overview of ClassSR
mance. In contrast, our ClassSR is a general framework that ClassSR is a new solution pipeline for single image SR.
could accelerate most existing SR methods, even if ranging It consists of two modules – Class-Module and SR-Module,
from tiny networks to large networks. as shown in Fig. 4. The Class-Module classifies the input
images into M classes, while the SR-Module contains M
2.3. Region-aware Image Restoration j
branches (SR networks) {fSR j=1 to deal with different in-
}M
Recently, investigators start to treat different image re- puts. To be specific, the large input LR image X is first de-
gions with different processing strategies. RAISR [17] di- composed into overlapping sub-images {xi }N i=1 . The Class-
vides the image patches into clusters, and constructs an ap- Module accepts each sub-image xi and generates a proba-
propriate filter for each cluster. It also uses an efficient bility vector [P1 (xi ), ..., PM (xi )]. After that, we determine
hashing approach to reduce the complexity of the cluster- which SR network to be used by selecting the index of the
ing algorithm. SFTGAN [19] introduces a novel spatial maximum probability value J = arg maxj Pj (xi ). Then
feature transform layer to incorporate the high-level seman- xi will be processed by the Jth branch of the SR-Module:
tic prior which is an implicit way to process different re- yi = fSRJ
(xi ). Finally, we combine all output sub-images
gions with different parameters. RL-Restore [23] and Path- {yi }i=1 to get the final large SR image Y (2K-8K).
N
3
𝑁 Class-Module SR-Module Simple
𝑥𝑖 𝑖=1 Medium
1 Hard
Simple 𝑓𝑆𝑅
Pooling
Conv
Conv
FC
··· 2 𝑁
𝑓𝑆𝑅 𝑦𝑖 𝑖=1
Medium
3
Combination
𝑃1 𝑥𝑖 , 𝑃2 𝑥𝑖 , 𝑃3 𝑥𝑖 𝑓𝑆𝑅
Decomposition
𝐽 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑗 𝑃𝑗 𝑥𝑖 Hard
Example
1 2 3
𝑓𝑆𝑅 Simple Net 𝑓𝑆𝑅 Medium Net 𝑓𝑆𝑅 Complex Net SR
LR
16C 16C 16C 36C 36C 36C 56C 56C 56C
Figure 4. The overview of the proposed ClassSR, when the number of classes M = 3. Class-Module: aims to generate the probability
vector, SR-Module: aims to deal with the corresponding sub-images.
3.4. SR-Module see Sec. 3.6.1) to make the maximum probability to ap-
proach 1, and y will be equal to the sub-image with prob-
The SR-Module is designed as a container that consists
j ability 1. Note that if we only adopt the Image-Loss and
of several independent branches {fSR j=1 . In general,
}M
Class-Loss, the training will easily converge to an extreme
each branch can be any learning-based SR network. As our
point, where all images are classified into the most complex
goal is to accelerate an existing SR method (e.g., FSRCNN,
branch. To avoid such a biased result, we design the La
CARN), we adopt this SR network as the base network, and
(Average-Loss, see Sec. 3.6.2) to constrain the classification
set it as the most complex branch fSR M
. The other branches
results. This is our proposed new classification method.
are obtained by reducing the network complexity of fSR M
.
For simplicity, we use the number of channels in each con- 3.6. Loss Functions
volution layer to control the network complexity. Then how
many channels are required for each SR branch? The prin- The loss function consists of three losses – a commonly
ciple is that the branch network should achieve comparable used L1 loss (Image-Loss) and our proposed two losses
results as the base network trained with all data in the cor- Lc (Class-Loss) and La (Average-Loss). Specifically, L1
responding class. For instance (see Table 1 and Fig. 4), the is used to ensure the image reconstruction quality, Lc im-
number of channels for fSR 1 2
, fSR 3
, fSR can be 16, 36, 56, proves the effectiveness of classification, and La ensures
where 56 is the channel number of the base network. Note that each SR branch can be chosen equally. The loss func-
that we can also decrease the network complexity in other tion is shown as:
ways, such as reducing layers (see Sec. 4.3.4), as long as
the network performance meets the above principle. L = w1 × L1 + w2 × Lc + w3 × La , (2)
3.5. Classification Method where w1 , w2 and w3 are the weights to balance different
loss terms. L1 is the 1-norm distance between the output
During training, the Class-Module classifies sub-images
image and ground truth, just as in previous works [10, 13].
according to their restoration difficulties of a specific branch
The two new losses Lc and La are detailed below.
instead of predetermined labels. Therefore, different from
testing, the input sub-image x should pass through all M SR
branches. Besides, in order to ensure that the Class-Module 3.6.1 Class-Loss
can accept the gradient propagation from the reconstruction
As mentioned in Sec. 3.5, the Class-Loss constrains the out-
results, we multiply the reconstructed sub-images fSR i
(x)
put probability distribution of the Class-Module. We pre-
and the corresponding classification probability Pi (x) to
fer that the Class-Module has much higher confidence in
generate the final SR output y as:
class with the maximum probability than others. For exam-
M ple, the classification result [0.90, 0.05, 0.05] is better than
[0.34,0.33,0.33], as the latter seems like a random selection.
X
y= i
Pi (x) × fSR (x). (1)
i=1 The Class-Loss is formulated as:
We just use Image-Loss (L1 loss) to constrain y, then M −1 M M
we can obtain classification probabilities automatically. But
X X X
Lc = − |Pi (x) − Pj (x)|, s.t. Pi (x) = 1.
during testing, the input only pass the SR branch with the i=1 j=i+1 i=1
maximum probability. Thus, we propose Lc (Class-Loss, (3)
4
where M is the number of classes. The Lc is the negative Afterwards, we relax all parameters and finetune the
number of distance sum between each class probability for whole model. During joint training, the Class-Module re-
a same sub-image. This loss can greatly enlarge the proba- fines its output probability vectors by the final SR results,
bility gap between different classification results so that the and the SR-Module updates according to the new classifi-
maximum probability value will be close to 1. cation results. In experiments (see Fig. 6), we can find that
the sub-images are assigned to different SR branches, while
3.6.2 Average-Loss the performance and efficiency improve simultaneously.
5
RCAN-O SRResNet-O CARN-O FSRCNN-O
GT 25.29dB/32.58G 25.05dB/5.20G 24.85dB/1.15G 24.52dB/468M
Bicubic ClassSR-RCAN
25.30dB/20.21G(62%)
ClassSR-SRResNet
25.06dB/3.39G(65%)
ClassSR-CARN
24.93dB/0.78G(67%)
ClassSR-FSRCNN
24.54dB/293M(63%)
pre-training. The Class-Module is trained within 200k itera- change the first conv. layer and the last deconv. layer.
6
64) for SRResNet, and (36, 50, 64) for RCAN. Training and ing different regions differently will bring no incoherence
testing follow the same procedure as described above. between adjacent sub-images.
Results are summarized in Table 2. Obviously, most Complexity Analysis During testing, first, we use the
ClassSR methods can obtain better performance than the average FLOPs of all 32 × 32 sub-images within a test set
original networks with lower computational cost, ranging to evaluate the running time because the FLOPs is device-
from 70% to 50%. The reduction of FLOPs is highly cor- independent and well-known by most researchers and en-
related with the image resolution of test data. The accelera- gineers. The FLOPs already includes the cost of Class-
tion on Test8K is the most significant, nearly 2 times (50% Module, which is only 8M, almost negligible for the whole
FLOPs) for all methods. This is because a larger input im- model. Second, we need to clarify that the aim of ClassSR
age can be decomposed into more sub-images, which have is to save FLOPs instead of parameters. The former one can
a higher probability to be processed by simple branches. represent the real running time, while the latter one mainly
influences the memory. Note that the memory cost brought
28.55 410 by model parameters is much less than saving intermediate
28.51 370
features, thus the increased parameters brought by ClassSR
are acceptable.
FLOPs(M)
PSNR
28.47 330
28.43 290 4.3. Ablation Study
28.39 0 40 80 120 160 200 250 0 40 80 120 160 200 4.3.1 Effect of Class-Loss
Iteration(K) Iteration(K)
(a) The PSNR curve of Class. (b) The FLOPs curve of Class. In the ablation study, we test the effect of different com-
28.58 290 ponents and settings with ClassSR-FSRCNN. First, we test
28.56 280 the effect of the proposed Class-Loss by removing it from
FLOPs(M)
28.54 270
28.52 260 PSNR and FLOPs during training. Without the Class-Loss,
both two curves cannot converge. This is because that the
28.50 0 100 200 300 400 500 250 0 100 200 300 400 500
Iteration(K) Iteration(K) output probability vectors of the Class-Module all become
(c) The PSNR curve of Joint. (d) The FLOPs curve of Joint. [0.333, 0.333, 0.333] under the influence of the Average-
Loss. In other words, the input images are randomly as-
Figure 6. The training curves of Class-Module (Class) and joint
training (Joint) of ClassSR-FSRCNN. signed to an SR branch, leading to unstable performance.
This demonstrates the importance of Class-Loss.
To further understand how ClassSR works, we use
ClassSR-FSRCNN to illustrate the behaviors and interme- 28.6 490
diate results of different training stages. First, let us see 28.5 400
the performance of SR-Module pre-training. As shown in
PSNR (dB)
FLOPs (M)
28.4 310
Table 1, the results of SR branches in the corresponding
validation sets are roughly the same as the original network. 28.3 Without Class-Loss
220 Without Class-Loss
All Loss All Loss
This is in accordance with our observation in Sec. 3.1. Then 28.2 0 40 80 120 160 200 130 0 40 80 120 160 200
Iteration (K) Iteration (K)
we show the validation curves of training Class-Module and
joint training in Fig. 6. We can see that the PSNR values in- (a) PSNR curves (b) FLOPs curves
crease with the decrease of FLOPs even during the training Figure 7. Training curves comparison of Class-Module
with/without Class-Loss for ClassSR-FSRCNN.
of Class-Module. This indicates that the increase in perfor-
mance is not at the cost of the computation burden. In other 4.3.2 Effect of Average-Loss
words, the input images are classified into more appropriate Then we evaluate the effect of the Average-Loss by remov-
branches during the training process, demonstrating the ef- ing it from the loss function (w3 = 0). From Fig. 8, we can
fectiveness of both two training procedures. After training, see that both PSNR and FLOPs stop changing from a very
we test ClassSR-FSRCNN on Test8K. Statistically, 61%, early stage. The reason is that all input images are assigned
23%, 16% sub-images are assigned to FSRCNN (16), FSR- to the most complex branch, which is a bad local minimum
CNN (36), FSRCNN (56), respectively. The overall FLOPs for optimization. The Average-Loss is proposed to avoid
drop from 468M to 236M. This further reflects the effec- such biased classification results.
tiveness of classification.
Fig. 5 shows a visual example, where we observe that
4.3.3 Effect of the number of classes
ClassSR methods can obtain the same visual effects as the
original networks. Furthermore, the transitions among dif- We also investigate the effect of the number of classes,
ferent regions are smooth and natural. In other words, treat- which is also the number of SR branches. We conduct ex-
7
Model Test2K FLOPs Test4K FLOPs Test8K FLOPs
ClassSR-FSRCNN(2) (16, 56) 25.61dB 310M(66%) 26.91dB 280M(60%) 32.72dB 228M(49%)
ClassSR-FSRCNN(3) (16, 36, 56) 25.61dB 311M(66%) 26.91dB 286M(61%) 32.73dB 238M(51%)
ClassSR-FSRCNN(4) (16, 29, 43, 56) 25.61dB 298M(64%) 26.92dB 290M(62%) 32.73dB 238M(51%)
ClassSR-FSRCNN(5) (16, 26, 36, 46, 56) 25.63dB 306M(65%) 26.93dB 286M(61%) 32.74dB 248M(53%)
Table 3. PSNR obtained by ClassSR. ClassSR-FSRCNN(2) (16, 56): ClassSR has 2 branches. fSR
1
has 16 channels, fSR
2
has 56 channels.
Model Test2K FLOPs Test4K FLOPs Test8K FLOPs
ClassSR-SRResNet(38 12, 54 14, 64 16) 26.20dB 3.60G(69%) 27.65dB 3.28G(63%) 33.50dB 2.68G(52%)
ClassSR-SRResNet(42 8, 56 12, 64 16) 26.20dB 3.60G(69%) 27.65dB 3.28G(63%) 33.50dB 2.68G(52%)
Table 4. PSNR values obtained by ClassSR with different layers and channels on Test2K, Test4K and Test8K. ClassSR-SRResNet (38 12,
54 14, 64 16): fSR
1
has 42 channels and 12 layers, fSR
2
has 54 channels and 14 layers, fSR
3
has 64 channels and 16 layers.
28.55 520 on DIV2K5 following the above training settings.
28.52 450
PSNR (dB)
FLOPs (M)
Module to replace SR-Module. Then we train the network noise (σ=25) and set patch size as 32 × 32.
8
References computer vision and pattern recognition, pages 4681–4690,
2017. 2, 4, 6
[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge [14] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
on single image super-resolution: Dataset and study. In Pro- Kyoung Mu Lee. Enhanced deep residual networks for single
ceedings of the IEEE Conference on Computer Vision and image super-resolution. In The IEEE Conference on Com-
Pattern Recognition (CVPR) Workshops, July 2017. 2, 3, 5 puter Vision and Pattern Recognition (CVPR) Workshops,
[2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, July 2017. 2
accurate, and lightweight super-resolution with cascading [15] Jie Liu, Wenjie Zhang, Yuting Tang, Jie Tang, and Gang-
residual network. In Proceedings of the European Confer- shan Wu. Residual feature aggregation network for image
ence on Computer Vision (ECCV), pages 252–268, 2018. 1, super-resolution. In Proceedings of the IEEE/CVF Confer-
2, 6 ence on Computer Vision and Pattern Recognition (CVPR),
[3] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and June 2020. 2
Marie Line Alberi-Morel. Low-complexity single-image [16] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
super-resolution based on nonnegative neighbor embedding. Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
2012. 5 ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
[4] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and differentiation in pytorch. 2017. 6
Lei Zhang. Second-order attention network for single im- [17] Yaniv Romano, John Isidoro, and Peyman Milanfar. Raisr:
age super-resolution. In Proceedings of the IEEE conference rapid and accurate image super resolution. IEEE Transac-
on computer vision and pattern recognition, pages 11065– tions on Computational Imaging, 3(1):110–125, 2016. 3
11074, 2019. 2 [18] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,
[5] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan
Tang. Image super-resolution using deep convolutional net- Wang. Real-time single image and video super-resolution
works. IEEE transactions on pattern analysis and machine using an efficient sub-pixel convolutional neural network. In
intelligence, 38(2):295–307, 2015. 2 Proceedings of the IEEE conference on computer vision and
[6] Chao Dong, Chen Change Loy, and Xiaoou Tang. Acceler- pattern recognition, pages 1874–1883, 2016. 2
ating the super-resolution convolutional neural network. In [19] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy.
European conference on computer vision, pages 391–407. Recovering realistic texture in image super-resolution by
Springer, 2016. 1, 2, 6 deep spatial feature transform. In Proceedings of the IEEE
[7] S. Gu, A. Lugmayr, M. Danelljan, M. Fritsche, J. Lamour, Conference on Computer Vision and Pattern Recognition
and R. Timofte. Div8k: Diverse 8k resolution image dataset. (CVPR), June 2018. 3
In 2019 IEEE/CVF International Conference on Computer [20] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,
Vision Workshop (ICCVW), pages 3512–3516, 2019. 2, 5 Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. hanced super-resolution generative adversarial networks. In
Deep residual learning for image recognition. In Proceed- Proceedings of the European Conference on Computer Vi-
ings of the IEEE Conference on Computer Vision and Pattern sion (ECCV), pages 0–0, 2018. 2, 3, 5
Recognition (CVPR), June 2016. 2 [21] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-
[9] Zheng Hui, Xinbo Gao, Yunchu Yang, and Xiumei Wang. moncelli. Image quality assessment: from error visibility to
Lightweight image super-resolution with information multi- structural similarity. IEEE transactions on image processing,
distillation network. In Proceedings of the 27th ACM In- 13(4):600–612, 2004. 6
ternational Conference on Multimedia, pages 2024–2032, [22] Jianchao Yang, John Wright, Thomas S Huang, and Yi
2019. 2 Ma. Image super-resolution via sparse representation. IEEE
[10] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate transactions on image processing, 19(11):2861–2873, 2010.
image super-resolution using very deep convolutional net- 5
works. In Proceedings of the IEEE conference on computer [23] Ke Yu, Chao Dong, Liang Lin, and Chen Change Loy. Craft-
vision and pattern recognition, pages 1646–1654, 2016. 2, 4 ing a toolchain for image restoration by deep reinforcement
[11] Diederik P Kingma and Jimmy Ba. Adam: A method for learning. In Proceedings of the IEEE Conference on Com-
stochastic optimization. arXiv preprint arXiv:1412.6980, puter Vision and Pattern Recognition (CVPR), June 2018. 3
2014. 6 [24] Ke Yu, Xintao Wang, Chao Dong, Xiaoou Tang, and
[12] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming- Chen Change Loy. Path-restore: Learning network
Hsuan Yang. Deep laplacian pyramid networks for fast and path selection for image restoration. arXiv preprint
accurate super-resolution. In Proceedings of the IEEE con- arXiv:1904.10343, 2019. 3
ference on computer vision and pattern recognition, pages [25] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
624–632, 2017. 2 Zhong, and Yun Fu. Image super-resolution using very deep
[13] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, residual channel attention networks. In Proceedings of the
Andrew Cunningham, Alejandro Acosta, Andrew Aitken, European Conference on Computer Vision (ECCV), pages
Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- 286–301, 2018. 2, 6
realistic single image super-resolution using a generative ad- [26] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and
versarial network. In Proceedings of the IEEE conference on Yun Fu. Residual dense network for image super-resolution.
9
In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2472–2481, 2018. 2
[27] Hengyuan Zhao, Xiangtao Kong, Jingwen He, Yu Qiao, and
Chao Dong. Efficient image super-resolution using pixel at-
tention, 2020. 2
10
ClassSR: A General Framework to Accelerate Super-Resolution Networks by
Data Characteristics
Supplementary File
1
layer name kernel size nc Model FLOPs Simple Medium Hard
conv first (3, n c, 3, 3) 36 52 64 SRResNet (36) 1.66G 43.63dB 30.70dB 23.21dB
(n c, n c, 3, 3) SRResNet (52) 3.44G 43.67dB 30.91dB 23.47dB
[ResBlock] × 16 36 52 64 SRResNet (64) 5.20G 43.52dB 30.85dB 23.54dB
(n c, n c, 3, 3)
upconv1 (n c×4, n c, 3, 3) 36 52 64 SRResNet-O (56) 5.20G 43.68dB 30.93dB 23.52dB
upconv2 (n c×4, n c, 3, 3) 36 52 64 Table 7. PSNR values obtained by three SR branches of ClassSR-
conv (n c, n c, 3, 3) 36 52 64 SRResNet on different validation sets with ×4. -O: the original
conv last (n c, 3, 3, 3) 36 52 64 networks trained with all data.
FLOPs 36: 1.66G; 52: 3.44G; 64: 5.20G
Table 3. The network architecture of branches in ClassSR- Model FLOPs Simple Medium Hard
SRResNet. n c: number of channels; (3, n c, 3, 3): Input chan- RCAN (36) 10.33G 43.78dB 31.00dB 23.42dB
nel 3, output channel n c and convolutional layer with kernel size RCAN (50) 19.90G 43.82dB 31.25dB 23.78dB
3×3. RCAN (64) 32.60G 43.60dB 31.16dB 23.86dB
RCAN-O (64) 32.60G 43.84dB 31.27dB 23.86dB
layer name kernel size nc Table 8. PSNR values obtained by three SR branches of ClassSR-
conv1 (3, n c, 3, 3) 36 50 64 RCAN on different validation sets with ×4. -O: the original net-
works trained with all data.
(n c, n c, 3, 3)
(n c, n c, 3, 3)
[RCAB]×200 36 50 64
(n c, n c/16, 3, 3) 2. Additional Experiments
(n c/16, n c, 3, 3)
upconv1 (n c×4, n c, 3, 3) 36 50 64 2.1. Effect of Class-Loss and Average-Loss ratio
upconv2 (n c×4, n c, 3, 3) 36 50 64 How to balance the effect of the Class-Loss and Average-
conv out (n c, 3, 3, 3) 36 50 64 Loss? We conduct experiments by changing the weight
FLOPs 36: 10.33G; 50: 19.90G; 64: 32.60G of the Class-Loss (w2 = 0.5, 1, 2) and fixing the other
Table 4. The network architecture of branches in ClassSR-RCAN.
weights (w1 = 2000, w3 = 6). Results are shown in
n c: number of channels; (3, n c, 3, 3): Input channel 3, output
channel n c and convolutional layer with kernel size 3×3.
Fig. 1. From Fig. 1(b) and Fig. 1(c), we can observe that
w2 = 1 achieves the best trade-off between PSNR and
FLOPs. When w2 = 2, the performance decrease signif-
icantly. It seems that w2 = 0.5 is comparable with w2 = 1.
Model FLOPs Simple Medium Hard
However, from the Fig. 1(a), we can see that w2 = 0.5 (blue
FSRCNN (16) 141M 42.71dB 29.28dB 22.42dB
line) has a much larger classification loss. Here the number
FSRCNN (36) 304M 42.50dB 29.62dB 22.65dB
FSRCNN (56) 468M 42.00dB 29.61dB 22.73dB 0.4 indicates that the maximum classification probability is
FSRCNN-O (56) 468M 42.70dB 29.69dB 22.71dB less than 80%, which is below our requirement. Therefore,
Table 5. PSNR values obtained by three SR branches of ClassSR- we set w1 = 2000, w2 = 1, w3 = 6 as our default setting.
FSRCNN on different validation sets with ×4. -O: the original
networks trained with all data. 2.3 w2=0.5
28.57 430 w2=0.5
w2=1.0 w2=1.0
1.7 w2=2.0 28.53 380 w2=2.0
Class-Loss
PSNR (dB)
FLOPs (M)
2
Model Test2K/FLOPs Test4K/FLOPs Test8K/FLOPs
FSRCNN-O S:0% M:0% H:100%/468M(100%) S:0% M:0% H:100%/468M(100%) S:0% M:0% H:100%/468M(100%)
ClassSR-FSRCNN S:33% M:34% H:33%/308M(66%) S:43% M:29% H:28%/284M(60%) S:61% M:23% H:16%/236M(50%)
CARN-O S:0% M:0% H:100%/1.15G(100%) S:0% M:0% H:100%/1.15G(100%) S:0% M:0% H:100%/1.15G(100%)
ClassSR-CARN S:30% M:29% H:41%/814M(71%) S:40% M:27% H:33%/742M(64%) S:60% M:22% H:18%/608M(53%)
SRResNet-O S:0% M:0% H:100%/5.20G(100%) S:0% M:0% H:100%/5.20G(100%) S:0% M:0% H:100%/5.20G(100%)
ClassSR-SRResNet S:31% M:28% H:41%/3.62G(70%) S:41% M:26% H:33%/3.30G(63%) S:60% M:22% H:18%/2.70G(52%)
RCAN-O S:0% M:0% H:100%/32.60G(100%) S:0% M:0% H:100%/32.60G(100%) S:0% M:0% H:100%/32.60G(100%)
ClassSR-RCAN S:33% M:32% H:35%/21.16G(65%) S:42% M:29% H:29%/19.47G(60%) S:60% M:22% H:17%/16.19G(50%)
Table 9. Classification results on Test2K, Test4K and Test8K. -O: the original networks trained with all data. -R: randomly selecting SR
branches of ClassSR.
Table 10. PSNR values on Test2K, Test4K and Test8K. Red/Blue text: best performance/lowest FLOPs. 20dB-35dB: the model is trained
with data that its PSNR obtained by MSRResNet [6] is between 20dB and 35dB.
Table 11. PSNR values on Test2K, Test4K and Test8K. Red/Blue text: best performance/lowest FLOPs.
Table 12. PSNR values on Test8K 2K, Test8K 4K and Test8K. -O: the original networks. Red/Blue text: best performance/lowest FLOPs.
Test8K 2K: the 2K images is downsampled from Test8K (index 1400-1500 from DIV8K dataset).
2.2. Effect of the range of training data 2.3. Comparison with gradient-based Classification
3
2.4. Effect of the contents and the resolutions residual network. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 252–268, 2018. 1
To illustrate that the reduction of FLOPs is highly corre- [3] Chao Dong, Chen Change Loy, and Xiaoou Tang. Acceler-
lated with the image resolution of test data, we also evaluate ating the super-resolution convolutional neural network. In
ClassSR in the images with the same contents and different European conference on computer vision, pages 391–407.
resolutions (see Table12). Actually, there is no difference Springer, 2016. 1
between using the same images and random selection. They [4] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,
can reach the same conclusion that ClassSR is more signif- Andrew Cunningham, Alejandro Acosta, Andrew Aitken,
icant on high-resolution images because a larger input im- Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-
age can be decomposed into more sub-images, which have realistic single image super-resolution using a generative ad-
a higher probability to be processed by simple branches. versarial network. In Proceedings of the IEEE conference on
Besides, it can also show that the reduction of FLOPs of computer vision and pattern recognition, pages 4681–4690,
ClassSR is related to specific images (see Sec.3). 2017. 1
[5] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,
2.5. The actual running time Kaiming He, and Piotr Dollar. Designing network design
spaces. In Proceedings of the IEEE/CVF Conference on Com-
The actual running time is similar to FLOPs in this work. puter Vision and Pattern Recognition (CVPR), June 2020. 4
Some works [5] figure out that FLOPs do not have a strong [6] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,
relationship with the time but this phenomenon is always Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-
caused by different structures. However, the branches of hanced super-resolution generative adversarial networks. In
ClassSR are directly derived from original networks by re- Proceedings of the European Conference on Computer Vision
ducing layers/channels. Thus the FLOPs and running time (ECCV), pages 0–0, 2018. 1, 3
in this work have the same trend. The reason why we use [7] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
FLOPs instead of time/activations is that FLOPs is device- Zhong, and Yun Fu. Image super-resolution using very deep
independent and well-known by most researchers and engi- residual channel attention networks. In Proceedings of the Eu-
ropean Conference on Computer Vision (ECCV), pages 286–
neers. Besides, the sub-images after decomposition can be
301, 2018. 1
distributed to parallel processors for further acceleration in
actual ues.
References
[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on
single image super-resolution: Dataset and study. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, July 2017. 1
[2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast,
accurate, and lightweight super-resolution with cascading
4
ClassSR-FSRCNN: 38.86dB / 180M(38%) ClassSR-FSRCNN: 29.27dB / 467M(99%)
(FSRCNN: 38.83dB / 468M) (FSRCNN: 29.15dB / 468M)
5
RCAN-O SRResNet-O CARN-O FSRCNN-O
GT 26.58dB/32.58G 25.97dB/5.20G 25.39dB/1.15G 24.68dB/468M
GT RCAN-O
23.17dB/32.58G
SRResNet-O
23.00dB/5.20G
CARN-O
22.78dB/1.15G
FSRCNN-O
22.52dB/468M
Figure 3. Visual results of ClassSR and the original networks on ×4 super-resolution. -O: the original networks.