0% found this document useful (0 votes)
17 views16 pages

Kong Et Al - 2021 - ClassSR - CVPR2021

Uploaded by

zjd13969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

Kong Et Al - 2021 - ClassSR - CVPR2021

Uploaded by

zjd13969
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

ClassSR: A General Framework to Accelerate Super-Resolution Networks by

Data Characteristic

Xiangtao Kong1,2 Hengyuan Zhao1 Yu Qiao1,3 Chao Dong1,4 *


1
Key Laboratory of Human-Machine Intelligence-Synergy Systems,
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
arXiv:2103.04039v1 [cs.CV] 6 Mar 2021

3
Shanghai AI Lab, Shanghai, China
4
SIAT Branch, Shenzhen Institute of Artificial Intelligence and Robotics for Society
{xt.kong, hy.zhao1, yu.qiao, chao.dong}@siat.ac.cn

Abstract ClassSR-FSRCNN: 36.97dB / 256M


(FSRCNN: 36.81dB / 468M)

We aim at accelerating super-resolution (SR) networks


on large images (2K-8K). The large images are usually de-
composed into small sub-images in practical usages. Based
on this processing, we found that different image regions
have different restoration difficulties and can be processed
by networks with different capacities. Intuitively, smooth
areas are easier to super-solve than complex textures. To
utilize this property, we can adopt appropriate SR networks Simple Medium Hard

to process different sub-images after the decomposition. On Figure 1. The SR result (x4) of ClassSR-FSRCNN. The Class-
Module classifies the image “0896” (DIV2K) into 56% simple,
this basis, we propose a new solution pipeline – ClassSR
20% medium and 24% hard sub-images. Compared with FSR-
that combines classification and SR in a unified framework.
CNN, ClassSR-FSRCNN uses only 55% FLOPs to achieve the
In particular, it first uses a Class-Module to classify the sub- same performance.
images into different classes according to restoration diffi-
image from a low-resolution input. In this paper, we
culties, then applies an SR-Module to perform SR for differ-
study how to accelerate SR algorithms on “large” input im-
ent classes. The Class-Module is a conventional classifica-
ages, which will be upsampled to at least 2K resolution
tion network, while the SR-Module is a network container
(2048×1080). While in real-world usages, the image/video
that consists of the to-be-accelerated SR network and its
resolution for smartphones and TV monitors has already
simplified versions. We further introduce a new classifica-
reached 4K (4096 × 2160), or even 8K (7680 × 4320). As
tion method with two losses – Class-Loss and Average-Loss
most recent SR algorithms are built on CNNs, the mem-
to produce the classification results. After joint training, a
ory and computational cost will grow quadratically with the
majority of sub-images will pass through smaller networks,
input size. Thus it is necessary to decompose input into sub-
thus the computational cost can be significantly reduced.
images and continuously accelerate SR algorithms to meet
Experiments show that our ClassSR can help most existing
the requirement of real-time implementation on real images.
methods (e.g., FSRCNN, CARN, SRResNet, RCAN) save up
Recent works on SR acceleration focus on proposing
to 50% FLOPs on DIV8K datasets. This general framework
light-weight network structures, e.g., from the early FSR-
can also be applied in other low-level vision tasks.
CNN [6] to the latest CARN [2], which are detailed in the
Sec. 2. We tackle this problem from a different perspec-
tive. Instead of designing a faster model, we propose a new
1. Introduction processing pipeline that could accelerate most SR methods.
Image super-resolution (SR) is a long-studied topic, Above all, we draw the observation that different image re-
which aims to generate a high-resolution visual-pleasing gions require different network complexities (see Sec. 3.1).
For example, the flat area (e.g., sky, land) is naturally easier
* Corresponding author (e-mail: chao.dong@siat.ac.cn) to process than textures (e.g., hair, feathers). This indicates

1
CARN (small) [2], SRResNet (middle) [13] and RCAN
-50%
(large) [25]. As shown in Fig. 2, the ClassSR method could
-48%
help these SR networks save 50%, 47%, 48%, 50% compu-
tational cost on the DIV8K dataset, respectively. An exam-
-47% ple is shown in Fig. 1, where the flat areas (color in light
green) are processed with the simple network and the tex-
tures (color in red) are processed with the complex one. We
-50% have also provided a detailed ablation study on the choice
of different network settings.
Overall, our contributions are three-fold: (1) We pro-
Figure 2. PSNR and FLOPs comparison between ClassSR and pose ClassSR. It is the first SR pipeline that incorpo-
original networks on Test8K with × 4. rates classification and super-resolution together on the sub-
image level. (2) We tackle acceleration by the character-
that if we can use smaller networks to treat less complex
istic of data. It makes ClassSR orthogonal to other accel-
image regions, the computational cost will be significantly
eration networks. A network compressed to the limit can
reduced. According to this observation, we can adopt dif-
still be accelerated by ClassSR. (3) We propose a clas-
ferent networks for different contents after decomposition.
sification method with two novel losses. It divides sub-
Sub-image decomposition is especially beneficial for
images according to their restoration difficulties that are
large images. First, more regions are relatively simple to
processed by a specific branch instead of predetermined la-
restore. According to our statistics, about 60% LR sub-
bels, so it can also be directly applied to other low-level
images (32 × 32) belong to smooth regions for DIV8K [7]
vision tasks. The code will be made available: https:
dataset, while the percentage drops to 30% for DIV2K [1]
//github.com/Xiangtaokong/ClassSR
dataset. Thus the acceleration ratio will be higher for large
images. Second, sub-image decomposition can help save
memory space in real applications, and is essential for low-
2. Related work
memory processing chips. It is also plausible to distribute 2.1. CNNs for Image Super-Resolution
sub-images to parallel processors for further acceleration.
To address the above issue and accelerate existing SR Since SRCNN [5] first introduced convolutional neural
methods, we propose a new solution pipeline, namely networks (CNNs) to the SR task, many deep neural net-
ClassSR, to perform classification and super-resolution si- works have been developed to improve the reconstruction
multaneously. The framework consists of two modules – results. For example, VDSR [10] uses a very deep network
Class-Module and SR-Module. The Class-Module is a sim- to learn the image residual. SRResNet [13] introduces Res-
ple classification network that classifies the input into a spe- Block [8] to further expand the network size. EDSR [14] re-
cific class according to the restoration difficulty, while the moves some redundant layers from SRResNet and advances
SR-Module is a network container that processes the classi- results. RDN [26] and RRDB [20] adopt dense connections
fied input with the SR network of the corresponding class. to utilize the information from preceding layers. Further-
They are connected together and need to be trained jointly. more, RCAN [25], SAN [4] and RFA [15] explore the atten-
The novelty lies in the classification method and training tion mechanism to design deeper networks and constantly
strategy. Specifically, we introduce two new losses to con- refresh the state-of-the-art. However, the expensive compu-
strain the classification results. The first one is a Class-Loss tational cost has limited their practical usages.
that encourages a higher probability of the selected class
2.2. Light-weight SR Networks
for individual sub-images. The other one is an Average-
Loss that ensures the overall classification results not bias To reduce computational cost, many acceleration meth-
to a single class. These two losses work cooperatively ods have been proposed. FSRCNN [6] and ESPCN [18]
to make the classification meaningful and well-distributed. use the LR image as input and upscale the feature maps at
The Image-Loss (L1 loss) is also added to guarantee the the end of the networks. LapSRN [12] introduces a deep
reconstruction performance. For the training strategy, we laplacian pyramid network that gradually upscales the fea-
first pre-train the SR-Module with Image-Loss. Then we ture maps. CARN [2] uses the group convolution to design
fix the SR-Module and optimize the Class-Module with all a cascading residual network for fast processing. IMDN [9]
three losses. Finally, we optimize the two modules simul- extracts hierarchical features by splitting operations and
taneously until convergence. This pipeline is general and then aggregates them to save computation. PAN [27] adopts
effective for different SR networks. pixel attention to obtain an effective network.
Experiments are conducted on representative SR net- All of those methods aim to design a relatively light-
works with different scales – FSRCNN (tiny) [6], weight network with an acceptable reconstruction perfor-

2
Simple in Fig. 3, we show these values in a blue curve and sepa-
rate them into three classes with the same numbers of sub-
images – “simple, medium, hard”. It is observed that the
sub-images with high PSNR values are generally smooth,
Medium while the sub-images with low PSNR values contain com-
plex textures.
Then we adopt different networks to deal with different
Hard
kinds of sub-images. As shown in Table 1, we use three FS-
RCNN models with the same network structure but different
channel numbers in the first conv. layer and the last deconv.
layer (i.e., 16, 36, 56). They are separately trained with
“simple, medium, hard” sub-images from training dataset2 .
Figure 3. The ranked PSNR curve of sub-images from DIV2K val- From Table 1, we can find that there is almost no difference
idation set and the visualization of three classes. for FSRCNN(16) and FSRCNN-O(56) on “simple” sub-
Model FLOPs Simple Medium Hard images, and FSRCNN(36) can achieve roughly the same
FSRCNN (16) 141M 42.71dB – – performance as FSRCNN-O(56) on “medium” sub-images.
FSRCNN (36) 304M – 29.62dB – This indicates that we can use a light-weight network to deal
FSRCNN (56) 468M – – 22.73dB with simple sub-images to save computational cost. That
FSRCNN-O (56) 468M 42.70dB 29.69dB 22.71dB is why we propose the following ClassSR method, which
Table 1. PSNR values obtained by three SR branches of ClassSR- could treat different image regions differently and acceler-
FSRCNN with ×4. They are separately trained with “simple, ate existing SR methods.
medium, hard” training data and tested on corresponding valida-
tion data. -O: the original networks trained with all data. 3.2. Overview of ClassSR
mance. In contrast, our ClassSR is a general framework that ClassSR is a new solution pipeline for single image SR.
could accelerate most existing SR methods, even if ranging It consists of two modules – Class-Module and SR-Module,
from tiny networks to large networks. as shown in Fig. 4. The Class-Module classifies the input
images into M classes, while the SR-Module contains M
2.3. Region-aware Image Restoration j
branches (SR networks) {fSR j=1 to deal with different in-
}M
Recently, investigators start to treat different image re- puts. To be specific, the large input LR image X is first de-
gions with different processing strategies. RAISR [17] di- composed into overlapping sub-images {xi }N i=1 . The Class-
vides the image patches into clusters, and constructs an ap- Module accepts each sub-image xi and generates a proba-
propriate filter for each cluster. It also uses an efficient bility vector [P1 (xi ), ..., PM (xi )]. After that, we determine
hashing approach to reduce the complexity of the cluster- which SR network to be used by selecting the index of the
ing algorithm. SFTGAN [19] introduces a novel spatial maximum probability value J = arg maxj Pj (xi ). Then
feature transform layer to incorporate the high-level seman- xi will be processed by the Jth branch of the SR-Module:
tic prior which is an implicit way to process different re- yi = fSRJ
(xi ). Finally, we combine all output sub-images
gions with different parameters. RL-Restore [23] and Path- {yi }i=1 to get the final large SR image Y (2K-8K).
N

Restore [24] decompose the image into sub-images and esti-


mate an appropriate processing path by reinforcement learn- 3.3. Class-Module
ing. Different from them, we propose a new classification The goal of Class-Module is to tell “whether the input
method to determine the processing of each region. sub-image is easy or hard to reconstruct” by low-level fea-
tures. As shown in Fig. 4, we design the Class-Module as a
3. Methods simple classification network, which contains five convolu-
3.1. Observation tion layers, an average pooling layer and a fully-connected
layer. The convolution layers are responsible for feature ex-
We first illustrate our observation on different kinds of traction, while the pooling and fully-connected layers out-
sub-images. Specifically, we investigate the statistical char- put the probability vector. This network is pretty light-
acteristics of 32 × 32 LR sub-images in DIV2K validation weight, and brings little additional computational cost. Ex-
dataset [1] 1 . To evaluate their restoration difficulty, we pass periments show that such a simple structure can already
all sub-images through the MSRResNet [20], and rank these achieve satisfactory classification results.
sub-images according to their PSNR values. As depicted
2 We use 800 training images (0001-0800) in DIV2K, reduce them to
1 We use 100 validation images (0801-0900), and crop the sub-images 0.6, 0.7, 0.8, 0.9 times, and crop the sub-images with stride 16 and collect
with stride 32 and collect 17,808 sub-images in total. 1,594,077 sub-images in total.

3
𝑁 Class-Module SR-Module Simple
𝑥𝑖 𝑖=1 Medium
1 Hard
Simple 𝑓𝑆𝑅

Pooling
Conv

Conv

FC
··· 2 𝑁
𝑓𝑆𝑅 𝑦𝑖 𝑖=1
Medium
3

Combination
𝑃1 𝑥𝑖 , 𝑃2 𝑥𝑖 , 𝑃3 𝑥𝑖 𝑓𝑆𝑅
Decomposition

𝐽 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑗 𝑃𝑗 𝑥𝑖 Hard
Example

1 2 3
𝑓𝑆𝑅 Simple Net 𝑓𝑆𝑅 Medium Net 𝑓𝑆𝑅 Complex Net SR
LR
16C 16C 16C 36C 36C 36C 56C 56C 56C

Figure 4. The overview of the proposed ClassSR, when the number of classes M = 3. Class-Module: aims to generate the probability
vector, SR-Module: aims to deal with the corresponding sub-images.
3.4. SR-Module see Sec. 3.6.1) to make the maximum probability to ap-
proach 1, and y will be equal to the sub-image with prob-
The SR-Module is designed as a container that consists
j ability 1. Note that if we only adopt the Image-Loss and
of several independent branches {fSR j=1 . In general,
}M
Class-Loss, the training will easily converge to an extreme
each branch can be any learning-based SR network. As our
point, where all images are classified into the most complex
goal is to accelerate an existing SR method (e.g., FSRCNN,
branch. To avoid such a biased result, we design the La
CARN), we adopt this SR network as the base network, and
(Average-Loss, see Sec. 3.6.2) to constrain the classification
set it as the most complex branch fSR M
. The other branches
results. This is our proposed new classification method.
are obtained by reducing the network complexity of fSR M
.
For simplicity, we use the number of channels in each con- 3.6. Loss Functions
volution layer to control the network complexity. Then how
many channels are required for each SR branch? The prin- The loss function consists of three losses – a commonly
ciple is that the branch network should achieve comparable used L1 loss (Image-Loss) and our proposed two losses
results as the base network trained with all data in the cor- Lc (Class-Loss) and La (Average-Loss). Specifically, L1
responding class. For instance (see Table 1 and Fig. 4), the is used to ensure the image reconstruction quality, Lc im-
number of channels for fSR 1 2
, fSR 3
, fSR can be 16, 36, 56, proves the effectiveness of classification, and La ensures
where 56 is the channel number of the base network. Note that each SR branch can be chosen equally. The loss func-
that we can also decrease the network complexity in other tion is shown as:
ways, such as reducing layers (see Sec. 4.3.4), as long as
the network performance meets the above principle. L = w1 × L1 + w2 × Lc + w3 × La , (2)

3.5. Classification Method where w1 , w2 and w3 are the weights to balance different
loss terms. L1 is the 1-norm distance between the output
During training, the Class-Module classifies sub-images
image and ground truth, just as in previous works [10, 13].
according to their restoration difficulties of a specific branch
The two new losses Lc and La are detailed below.
instead of predetermined labels. Therefore, different from
testing, the input sub-image x should pass through all M SR
branches. Besides, in order to ensure that the Class-Module 3.6.1 Class-Loss
can accept the gradient propagation from the reconstruction
As mentioned in Sec. 3.5, the Class-Loss constrains the out-
results, we multiply the reconstructed sub-images fSR i
(x)
put probability distribution of the Class-Module. We pre-
and the corresponding classification probability Pi (x) to
fer that the Class-Module has much higher confidence in
generate the final SR output y as:
class with the maximum probability than others. For exam-
M ple, the classification result [0.90, 0.05, 0.05] is better than
[0.34,0.33,0.33], as the latter seems like a random selection.
X
y= i
Pi (x) × fSR (x). (1)
i=1 The Class-Loss is formulated as:
We just use Image-Loss (L1 loss) to constrain y, then M −1 M M
we can obtain classification probabilities automatically. But
X X X
Lc = − |Pi (x) − Pj (x)|, s.t. Pi (x) = 1.
during testing, the input only pass the SR branch with the i=1 j=i+1 i=1
maximum probability. Thus, we propose Lc (Class-Loss, (3)

4
where M is the number of classes. The Lc is the negative Afterwards, we relax all parameters and finetune the
number of distance sum between each class probability for whole model. During joint training, the Class-Module re-
a same sub-image. This loss can greatly enlarge the proba- fines its output probability vectors by the final SR results,
bility gap between different classification results so that the and the SR-Module updates according to the new classifi-
maximum probability value will be close to 1. cation results. In experiments (see Fig. 6), we can find that
the sub-images are assigned to different SR branches, while
3.6.2 Average-Loss the performance and efficiency improve simultaneously.

As mentioned in Sec. 3.5, if we only adopt the Image-Loss 3.8. Discussion


and Class-Loss, the sub-images are prone to be assigned to We further clarify the unique features of ClassSR as fol-
the most complex branch. This is because that the most lows. 1) The classification+SR strategy adopted by ClassSR
complex SR network can easily get better results. Then the has significant practical values. This is based on the obser-
Class-Module will lose its functionality and the SR-Module vation that large images SR (2K-8K) have different charac-
degenerates to the base network. To avoid this, we should teristics with small images SR (e.g., the same content cover
ensure that each SR branch has an equal opportunity to be more pixels), thus are more suitable for sub-image decom-
selected. Therefore, we design the Average-Loss to con- position and special treatment. 2) While the idea of divide-
strain the classification results. It is formulated as: and-conquer is straightforward, the novelty of our method
M X B lies in the joint optimization of classification and super-
X B resolution. With a unified framework, we can simultane-
La = | Pi (xj ) − |, (4)
i=1 j=1
M ously constrain the classification and reconstruction results
by a dedicated loss combination. 3) ClassSR can be used
where B is the batch size. The La is the sum of the dis- together with previous methods for double acceleration.
tance between the average number ( M B
) and the sub-images
numberPB of each class within a batch. We use the probability 4. Experiments
sum j=1 Pi (xj ) to calculate the sub-images number be-
cause statistic number do not propagate gradients. With this 4.1. Setting
loss, the number of sub-images that pass through each SR 4.1.1 Training Data
branch during training would be approximately the same.
We use the DIV2K [1] dataset for training. To prepare the
3.7. Training Strategy training data, we first downsample3 the original images with
scaling factors 0.6, 0.7, 0.8, 0.9 to generate the HR images.
We propose to train the ClassSR by three steps: First,
These images are further downsampled 4 times to obtain
pre-train SR-Module, then train Class-Module with fixing
the LR images. Then we densely crop 1.59M sub-images
SR-Module using the proposed three losses, finally fine-
with size 32 × 32 from LR images. These sub-images are
tune all networks jointly. This is because that if we train
equally divided into three classes (0.53M for each) accord-
both Class-Module and SR-Module from scratch, the per-
ing to their PSNR values through MSRResNet [20]. All
formance will be very unstable, and the classification will
sub-images are further augmented by flipping and rotation.
easily fall into a bad local minimum.
Finally, we obtain “simple, medium, hard” datasets for SR-
To pre-train the SR-Module, we use the data classified
Module pre-training. Besides, we also select ten images
by the PSNR values. Specifically, all sub-images are passed
(index 0801-0810) from the DIV2K validation set for vali-
through a well-trained MSRResNet. Then these sub-images
dation during training.
are ranked according to their PSNR values. Next, the first
1/3 sub-images are assigned to the hard class, while the last
1/3 belong to the simple class, just as in Sec. 3.1. Then we 4.1.2 Testing Data
train the simple/medium/complex SR branch on the corre-
Instead of commonly used SR test sets, such as Set5 [3]
sponding simple/medium/hard data. Although using PSNR
and Set14 [22], as their images are too small to be de-
obtained by MSRResNet to estimate the restoration difficul-
composed, we select 300 images (index 1201-1500) from
ties is not perfect for different SR branches, it could provide
the DIV8K [7] dataset. Specifically, the first two hundred
SR branches a good starting point.
images are downsampled to 2K and 4K resolution, respec-
After that, we add the Class-Module and fix the parame- tively, which are used as HR images of Test2K and Test4K
ters of the SR-Module. The overall model is trained with the datasets. The last hundred images form the Test8K dataset.
three losses on all data. As shown in Fig. 6(a) and Fig. 6(b), The LR images are also obtained by ×4 downsampling
this procedure could give the Class-Module a primary clas-
sification ability. 3 We use bicubic downsampling for all experiments.

5
RCAN-O SRResNet-O CARN-O FSRCNN-O
GT 25.29dB/32.58G 25.05dB/5.20G 24.85dB/1.15G 24.52dB/468M

Bicubic ClassSR-RCAN
25.30dB/20.21G(62%)
ClassSR-SRResNet
25.06dB/3.39G(65%)
ClassSR-CARN
24.93dB/0.78G(67%)
ClassSR-FSRCNN
24.54dB/293M(63%)

RCAN-O SRResNet-O CARN-O FSRCNN-O


GT 31.03dB/32.58G 30.40dB/5.20G 30.03dB/1.15G 28.97dB/468M

ClassSR-RCAN ClassSR-SRResNet ClassSR-CARN ClassSR-FSRCNN


Bicubic 31.00dB/20.88G(64%) 30.60dB/3.40G(65%) 30.19dB/0.77G(67%) 29.17dB/323M(69%)
Figure 5. Visual results of ClassSR and the original networks on 4K images with ×4 super-resolution. The right images are 200×200
which contain decomposition borders.(The size of super-resolved sub-images is 128×128.) -O: the original networks.
Model Parameters Test2K FLOPs Test4K FLOPs Test8K FLOPs
FSRCNN-O 25K 25.61dB 468M(100%) 26.90dB 468M(100%) 32.66dB 468M(100%)
ClassSR-FSRCNN 113K 25.61dB 311M(66%) 26.91dB 286M(61%) 32.73dB 238M(51%)
CARN-O 295K 25.95dB 1.15G(100%) 27.34dB 1.15G(100%) 33.18dB 1.15G(100%)
ClassSR-CARN 645K 26.01dB 814M(71%) 27.42dB 742M(64%) 33.24dB 608M(53%)
SRResNet-O 1.5M 26.19dB 5.20G(100%) 27.65dB 5.20G(100%) 33.50dB 5.20G(100%)
ClassSR-SRResNet 3.1M 26.20dB 3.62G(70%) 27.66dB 3.30G(63%) 33.50dB 2.70G(52%)
RCAN-O 15.6M 26.39dB 32.60G(100%) 27.89dB 32.60G(100%) 33.76dB 32.60G(100%)
ClassSR-RCAN 30.1M 26.39dB 21.22G(65%) 27.88dB 19.49G(60%) 33.73dB 16.36G(50%)
Table 2. PSNR values on Test2K, Test4K and Test8K. -O: the original networks. Red/Blue text: best performance/lowest FLOPs.
based on HR images. During testing, the LR images are tions. Finally, we train two modules jointly with all settings
cropped into 32 × 32 sub-images with stride 28. The super- unchanged. Besides, we also train the original network with
resolved sub-images are combined to SR images by aver- all data in a larger number of iterations than ClassSR for a
aging overlapping areas. We use PSNR values between SR fair comparison. All models are built on the PyTorch frame-
and HR images to evaluate the reconstruction performance work [16] and trained with NVIDIA 2080Ti GPUs.
and calculate the average FLOPs of all 32 × 32 sub-images
within a test set to evaluate the computational cost. 4.2. ClassSR with Existing SR networks
ClassSR is a general framework that can incorporate
most deep learning based SR methods, regardless of the
4.1.3 Training Details
network structure. Thus, we do not compare ClassSR with
First, we pre-train the SR-Module. The fSR 1
, fSR
2
and fSR
3 other network accelerating strategies because they can also
are separately trained on different training data (“simple, be further accelerated by ClassSR. Therefore, to demon-
medium, hard”). The mini-batch size is set to 16. L1 loss strate its effectiveness, we use the ClassSR to accelerate
function [21] is adopted with Adam optimizer [11] (β1 = FSRCNN (tiny) [6], CARN (small) [2], SRResNet (mid-
0.9, β2 = 0.999). The cosine annealing learning strategy is dle) [13] and RCAN (large) [25], which are representative
applied to adjust the learning rate. The initial learning rate networks of different network scales. Their SR-Modules all
is set to 10−3 and the minimum is set to 10−7 . The period of contain three branches. The most complex branch fSR 3
is
cosine is 500k iterations. Then we train the Class-Module the original network, while the other branches are obtained
with three losses (the weights w1 , w2 , w3 are set to 2000, by reducing the channels in each convolution layer. Specif-
1, 6) on all data. Note that we use a larger batch size(96), ically, the channel configurations of the three branches are
since the Average-loss needs to balance the number of sub- (16, 36, 56) for FSRCNN4 , (36, 52, 64) for CARN, (36, 52,
images within each batch. The other settings are the same as 4 As FSRCNN has different numbers of channels in each layer, we only

pre-training. The Class-Module is trained within 200k itera- change the first conv. layer and the last deconv. layer.

6
64) for SRResNet, and (36, 50, 64) for RCAN. Training and ing different regions differently will bring no incoherence
testing follow the same procedure as described above. between adjacent sub-images.
Results are summarized in Table 2. Obviously, most Complexity Analysis During testing, first, we use the
ClassSR methods can obtain better performance than the average FLOPs of all 32 × 32 sub-images within a test set
original networks with lower computational cost, ranging to evaluate the running time because the FLOPs is device-
from 70% to 50%. The reduction of FLOPs is highly cor- independent and well-known by most researchers and en-
related with the image resolution of test data. The accelera- gineers. The FLOPs already includes the cost of Class-
tion on Test8K is the most significant, nearly 2 times (50% Module, which is only 8M, almost negligible for the whole
FLOPs) for all methods. This is because a larger input im- model. Second, we need to clarify that the aim of ClassSR
age can be decomposed into more sub-images, which have is to save FLOPs instead of parameters. The former one can
a higher probability to be processed by simple branches. represent the real running time, while the latter one mainly
influences the memory. Note that the memory cost brought
28.55 410 by model parameters is much less than saving intermediate
28.51 370
features, thus the increased parameters brought by ClassSR
are acceptable.
FLOPs(M)
PSNR

28.47 330
28.43 290 4.3. Ablation Study
28.39 0 40 80 120 160 200 250 0 40 80 120 160 200 4.3.1 Effect of Class-Loss
Iteration(K) Iteration(K)
(a) The PSNR curve of Class. (b) The FLOPs curve of Class. In the ablation study, we test the effect of different com-
28.58 290 ponents and settings with ClassSR-FSRCNN. First, we test
28.56 280 the effect of the proposed Class-Loss by removing it from
FLOPs(M)

the loss function (w2 = 0). Fig. 7 shows the curves of


PSNR

28.54 270
28.52 260 PSNR and FLOPs during training. Without the Class-Loss,
both two curves cannot converge. This is because that the
28.50 0 100 200 300 400 500 250 0 100 200 300 400 500
Iteration(K) Iteration(K) output probability vectors of the Class-Module all become
(c) The PSNR curve of Joint. (d) The FLOPs curve of Joint. [0.333, 0.333, 0.333] under the influence of the Average-
Loss. In other words, the input images are randomly as-
Figure 6. The training curves of Class-Module (Class) and joint
training (Joint) of ClassSR-FSRCNN. signed to an SR branch, leading to unstable performance.
This demonstrates the importance of Class-Loss.
To further understand how ClassSR works, we use
ClassSR-FSRCNN to illustrate the behaviors and interme- 28.6 490
diate results of different training stages. First, let us see 28.5 400
the performance of SR-Module pre-training. As shown in
PSNR (dB)

FLOPs (M)

28.4 310
Table 1, the results of SR branches in the corresponding
validation sets are roughly the same as the original network. 28.3 Without Class-Loss
220 Without Class-Loss
All Loss All Loss
This is in accordance with our observation in Sec. 3.1. Then 28.2 0 40 80 120 160 200 130 0 40 80 120 160 200
Iteration (K) Iteration (K)
we show the validation curves of training Class-Module and
joint training in Fig. 6. We can see that the PSNR values in- (a) PSNR curves (b) FLOPs curves
crease with the decrease of FLOPs even during the training Figure 7. Training curves comparison of Class-Module
with/without Class-Loss for ClassSR-FSRCNN.
of Class-Module. This indicates that the increase in perfor-
mance is not at the cost of the computation burden. In other 4.3.2 Effect of Average-Loss
words, the input images are classified into more appropriate Then we evaluate the effect of the Average-Loss by remov-
branches during the training process, demonstrating the ef- ing it from the loss function (w3 = 0). From Fig. 8, we can
fectiveness of both two training procedures. After training, see that both PSNR and FLOPs stop changing from a very
we test ClassSR-FSRCNN on Test8K. Statistically, 61%, early stage. The reason is that all input images are assigned
23%, 16% sub-images are assigned to FSRCNN (16), FSR- to the most complex branch, which is a bad local minimum
CNN (36), FSRCNN (56), respectively. The overall FLOPs for optimization. The Average-Loss is proposed to avoid
drop from 468M to 236M. This further reflects the effec- such biased classification results.
tiveness of classification.
Fig. 5 shows a visual example, where we observe that
4.3.3 Effect of the number of classes
ClassSR methods can obtain the same visual effects as the
original networks. Furthermore, the transitions among dif- We also investigate the effect of the number of classes,
ferent regions are smooth and natural. In other words, treat- which is also the number of SR branches. We conduct ex-

7
Model Test2K FLOPs Test4K FLOPs Test8K FLOPs
ClassSR-FSRCNN(2) (16, 56) 25.61dB 310M(66%) 26.91dB 280M(60%) 32.72dB 228M(49%)
ClassSR-FSRCNN(3) (16, 36, 56) 25.61dB 311M(66%) 26.91dB 286M(61%) 32.73dB 238M(51%)
ClassSR-FSRCNN(4) (16, 29, 43, 56) 25.61dB 298M(64%) 26.92dB 290M(62%) 32.73dB 238M(51%)
ClassSR-FSRCNN(5) (16, 26, 36, 46, 56) 25.63dB 306M(65%) 26.93dB 286M(61%) 32.74dB 248M(53%)
Table 3. PSNR obtained by ClassSR. ClassSR-FSRCNN(2) (16, 56): ClassSR has 2 branches. fSR
1
has 16 channels, fSR
2
has 56 channels.
Model Test2K FLOPs Test4K FLOPs Test8K FLOPs
ClassSR-SRResNet(38 12, 54 14, 64 16) 26.20dB 3.60G(69%) 27.65dB 3.28G(63%) 33.50dB 2.68G(52%)
ClassSR-SRResNet(42 8, 56 12, 64 16) 26.20dB 3.60G(69%) 27.65dB 3.28G(63%) 33.50dB 2.68G(52%)
Table 4. PSNR values obtained by ClassSR with different layers and channels on Test2K, Test4K and Test8K. ClassSR-SRResNet (38 12,
54 14, 64 16): fSR
1
has 42 channels and 12 layers, fSR
2
has 54 channels and 14 layers, fSR
3
has 64 channels and 16 layers.
28.55 520 on DIV2K5 following the above training settings.
28.52 450
PSNR (dB)

FLOPs (M)

28.49 380 Without Average-Loss


All Loss
Model Test2K/FLOPs Test4K/FLOPs
28.46 310 DnCNN-O 31.20dB/1.14G(100%) 32.26dB/ 1.14G(100%)
Without Average-Loss
All Loss DnCNN-C 31.23dB/0.83G(73%) 32.28dB/0.76G(67%)
28.43 0 40 80 120 160 200 240 0 40 80 120 160 200
Iteration (K) Iteration (K) Table 5. PSNR values on Test2K and Test4K. -O: the original net-
work. -C: Denoise with ClassSR framework.
(a) PSNR curves (b) FLOPs curves
Figure 8. Training curves comparison of Class-Module As shown in Table 5, we evaluate the network on Test2K
with/without Average-Loss for ClassSR-FSRCNN. and Test4K with the same noise level. DnCNN with
ClassSR framework can obtain higher PSNR than the orig-
periments with 2, 3, 4, 5 classes. To pre-train SR branches,
inal DnCNN but with lower computational cost. Compared
we also divide the training data into different numbers
with SR tasks, there are no enough “simple” sub-images
of classes, using the same equal-division strategy as in
in a noisy image in denoising. Therefore, the computa-
Sec. 4.1.1. Correspondingly, we set different channels num-
tional cost saved by ClassSR is not as much as that in SR
bers for different settings, as shown in Table 3. From the re-
task. Nevertheless, this result has illustrated ClassSR can
sults, we can observe that more classes will bring better per-
be adapted to other low-level vision tasks.
formance. However, the differences are insignificant. Even
the case with two classes achieves satisfactory results. This
shows that the ClassSR is robust to the number of classes. 5. Conclusion
In this work, we propose ClassSR with a new classifi-
4.3.4 Controling network complexity in other ways cation method and two novel loss functions, which could
accelerate almost all learning-based SR methods on large
As mentioned in Sec. 3.4, we obtain branch networks with
images (2K-8K). The key idea is using a Class-Module to
different network complexities by changing the number of
classify the sub-images into different classes (e.g., “simple,
channels and layers at the same time. As shown in Ta-
medium, hard”), each class corresponds to different pro-
ble 4, we could obtain a comparable performance as reduc-
cessing branches with different network capacity. Extensive
ing channels in Table 2. The reason why we do not only
experiments well demonstrate that ClassSR can accelerate
reduce the layers is that the FLOPs brought by middle lay-
most existing methods on different datasets. Processing im-
ers account for a small proportion of the total FLOPs in
ages with more “simple” regions (e.g, DIV8K) will save
light-weight networks (3% for FSRCNN, 58% for CARN
more FLOPs. Besides, ClassSR can also be applied in other
and 47% for SRResNet). In other words, even removing all
low-level vision tasks.
the middle layers can only reduce little FLOPs. Therefore,
it is essential to select proper ways to reduce the network Acknowledgements. This work was supported in part
complexity for different base networks. by the Shanghai Committee of Science and Technology,
China (Grant No. 20DZ1100800), in part by the Na-
4.4. ClassSR in other low-level tasks tional Natural Science Foundation of China under Grant
(61906184), Science and Technology Service Network Ini-
To demonstrate that our proposed ClassSR is flexible and tiative of Chinese Academy of Sciences (KFJSTSQYZX-
can be easily applied to deal with other low-vision tasks, 092), Shenzhen Institute of Artificial Intelligence and
where different regions have different restoration difficul- Robotics for Society.
ties, we conduct experiments on image denoising. We use
DnCNN with different channels (38, 52, 64) as the Denoise- 5 We use 800 training images (0001-0800) in DIV2K with Gaussian

Module to replace SR-Module. Then we train the network noise (σ=25) and set patch size as 32 × 32.

8
References computer vision and pattern recognition, pages 4681–4690,
2017. 2, 4, 6
[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge [14] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
on single image super-resolution: Dataset and study. In Pro- Kyoung Mu Lee. Enhanced deep residual networks for single
ceedings of the IEEE Conference on Computer Vision and image super-resolution. In The IEEE Conference on Com-
Pattern Recognition (CVPR) Workshops, July 2017. 2, 3, 5 puter Vision and Pattern Recognition (CVPR) Workshops,
[2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast, July 2017. 2
accurate, and lightweight super-resolution with cascading [15] Jie Liu, Wenjie Zhang, Yuting Tang, Jie Tang, and Gang-
residual network. In Proceedings of the European Confer- shan Wu. Residual feature aggregation network for image
ence on Computer Vision (ECCV), pages 252–268, 2018. 1, super-resolution. In Proceedings of the IEEE/CVF Confer-
2, 6 ence on Computer Vision and Pattern Recognition (CVPR),
[3] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and June 2020. 2
Marie Line Alberi-Morel. Low-complexity single-image [16] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
super-resolution based on nonnegative neighbor embedding. Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al-
2012. 5 ban Desmaison, Luca Antiga, and Adam Lerer. Automatic
[4] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and differentiation in pytorch. 2017. 6
Lei Zhang. Second-order attention network for single im- [17] Yaniv Romano, John Isidoro, and Peyman Milanfar. Raisr:
age super-resolution. In Proceedings of the IEEE conference rapid and accurate image super resolution. IEEE Transac-
on computer vision and pattern recognition, pages 11065– tions on Computational Imaging, 3(1):110–125, 2016. 3
11074, 2019. 2 [18] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,
[5] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan
Tang. Image super-resolution using deep convolutional net- Wang. Real-time single image and video super-resolution
works. IEEE transactions on pattern analysis and machine using an efficient sub-pixel convolutional neural network. In
intelligence, 38(2):295–307, 2015. 2 Proceedings of the IEEE conference on computer vision and
[6] Chao Dong, Chen Change Loy, and Xiaoou Tang. Acceler- pattern recognition, pages 1874–1883, 2016. 2
ating the super-resolution convolutional neural network. In [19] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy.
European conference on computer vision, pages 391–407. Recovering realistic texture in image super-resolution by
Springer, 2016. 1, 2, 6 deep spatial feature transform. In Proceedings of the IEEE
[7] S. Gu, A. Lugmayr, M. Danelljan, M. Fritsche, J. Lamour, Conference on Computer Vision and Pattern Recognition
and R. Timofte. Div8k: Diverse 8k resolution image dataset. (CVPR), June 2018. 3
In 2019 IEEE/CVF International Conference on Computer [20] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,
Vision Workshop (ICCVW), pages 3512–3516, 2019. 2, 5 Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. hanced super-resolution generative adversarial networks. In
Deep residual learning for image recognition. In Proceed- Proceedings of the European Conference on Computer Vi-
ings of the IEEE Conference on Computer Vision and Pattern sion (ECCV), pages 0–0, 2018. 2, 3, 5
Recognition (CVPR), June 2016. 2 [21] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-
[9] Zheng Hui, Xinbo Gao, Yunchu Yang, and Xiumei Wang. moncelli. Image quality assessment: from error visibility to
Lightweight image super-resolution with information multi- structural similarity. IEEE transactions on image processing,
distillation network. In Proceedings of the 27th ACM In- 13(4):600–612, 2004. 6
ternational Conference on Multimedia, pages 2024–2032, [22] Jianchao Yang, John Wright, Thomas S Huang, and Yi
2019. 2 Ma. Image super-resolution via sparse representation. IEEE
[10] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate transactions on image processing, 19(11):2861–2873, 2010.
image super-resolution using very deep convolutional net- 5
works. In Proceedings of the IEEE conference on computer [23] Ke Yu, Chao Dong, Liang Lin, and Chen Change Loy. Craft-
vision and pattern recognition, pages 1646–1654, 2016. 2, 4 ing a toolchain for image restoration by deep reinforcement
[11] Diederik P Kingma and Jimmy Ba. Adam: A method for learning. In Proceedings of the IEEE Conference on Com-
stochastic optimization. arXiv preprint arXiv:1412.6980, puter Vision and Pattern Recognition (CVPR), June 2018. 3
2014. 6 [24] Ke Yu, Xintao Wang, Chao Dong, Xiaoou Tang, and
[12] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming- Chen Change Loy. Path-restore: Learning network
Hsuan Yang. Deep laplacian pyramid networks for fast and path selection for image restoration. arXiv preprint
accurate super-resolution. In Proceedings of the IEEE con- arXiv:1904.10343, 2019. 3
ference on computer vision and pattern recognition, pages [25] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
624–632, 2017. 2 Zhong, and Yun Fu. Image super-resolution using very deep
[13] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, residual channel attention networks. In Proceedings of the
Andrew Cunningham, Alejandro Acosta, Andrew Aitken, European Conference on Computer Vision (ECCV), pages
Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo- 286–301, 2018. 2, 6
realistic single image super-resolution using a generative ad- [26] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and
versarial network. In Proceedings of the IEEE conference on Yun Fu. Residual dense network for image super-resolution.

9
In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2472–2481, 2018. 2
[27] Hengyuan Zhao, Xiangtao Kong, Jingwen He, Yu Qiao, and
Chao Dong. Efficient image super-resolution using pixel at-
tention, 2020. 2

10
ClassSR: A General Framework to Accelerate Super-Resolution Networks by
Data Characteristics
Supplementary File

Xiangtao Kong1,2 Hengyuan Zhao1 Yu Qiao1,3 Chao Dong1,4 ∗


1
Key Laboratory of Human-Machine Intelligence-Synergy Systems,
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
2
University of Chinese Academy of Sciences
3
Shanghai AI Lab, Shanghai, China
4
SIAT Branch, Shenzhen Institute of Artificial Intelligence and Robotics for Society
{xt.kong, hy.zhao1, yu.qiao, chao.dong}@siat.ac.cn

Abstract layer name kernel size nc


conv1 (3, n c, 5, 5) 16 36 56
In this supplementary file, we first present more de- conv2 (n c, 12, 1, 1) 16 36 56
tails and additional experimental results of our proposed conv3 (12, 12, 3, 3) - - -
ClassSR with existing SR networks. They illustrate the de- conv4 (12, 12, 3, 3) - - -
tails of reducing computation cost. Then, we provide addi- conv5 (12, 12, 3, 3) - - -
tional experiments of ablation study. Finally, we show more conv6 (12, 12, 3, 3) - - -
qualitative results of ClassSR to clearly show the effective- conv7 (12, n c, 1, 1) 16 36 56
ness of our method. deconv (n c, 3, 9, 9) 16 36 56
FLOPs 16: 141M; 36: 304M; 56: 468M
Table 1. The network architecture of branches in ClassSR-
1. Details of ClassSR with Existing SR net- FSRCNN. n c: number of channels; (3, n c, 5, 5): Input channel 3,
works output channel n c and convolutional layer with kernel size 5×5.

1.1. Network Architecture layer name kernel size nc group


In Table 1, 2, 3 and 4, we provide more details of conv1 (3, n c, 3, 3) 36 52 64 1
branches (FSRCNN [3], CARN [2], SRResNet [4, 6] and Block1 (n c, n c, 3, 3) 36 52 64 4
RCAN [7]) that are used in ClassSR to illustrate “how to C1 (n c×2, n c, 1, 1) 36 52 64 1
control the network scale”. Block2 (n c, n c, 3, 3) 36 52 64 4
C2 (n c×3, n c, 1, 1) 36 52 64 1
1.2. Performance of Branches Block3 (n c, n c, 3, 3) 36 52 64 4
We show all the results of original networks and SR C3 (n c×4, n c, 1, 1) 36 52 64 1
branches that are used in ClassSR in Table 5, 6, 7 and 8. upconv1 (n c, n c×4, 3, 3) 36 52 64 4
Note that the small/large network is trained on the corre- upconv2 (n c, n c×4, 3, 3) 36 52 64 4
sponding simple/complex dataset, while the original net- conv out (n c, 3, 3, 3) 36 52 64 1
work is trained with all data (DIV2K [1]). FLOPs 36: 0.38G; 52: 0.77G; 64: 1.15G
Table 2. The network architecture of branches in ClassSR-CARN.
1.3. Classification Results n c: number of channels; (3, n c, 3, 3): Input channel 3, output
channel n c and convolutional layer with kernel size 3×3.
In this section, we provide more details of classifica-
tion results obtained by ClassSR to illustrate that how
we reduce the FLOPs. For example, as shown in Ta-
images of Test8K to FSRCNN (16) (simple), FSRCNN (36)
ble 9, ClassSR-FSRCNN assigns 61%, 23% and 16% sub-
(medium) and FSRCNN (56) (hard), respectively. Every
∗ Corresponding author (e-mail: chao.dong@siat.ac.cn) branch has different complexity, and the most complex FS-

1
layer name kernel size nc Model FLOPs Simple Medium Hard
conv first (3, n c, 3, 3) 36 52 64 SRResNet (36) 1.66G 43.63dB 30.70dB 23.21dB
(n c, n c, 3, 3) SRResNet (52) 3.44G 43.67dB 30.91dB 23.47dB
[ResBlock] × 16 36 52 64 SRResNet (64) 5.20G 43.52dB 30.85dB 23.54dB
(n c, n c, 3, 3)
upconv1 (n c×4, n c, 3, 3) 36 52 64 SRResNet-O (56) 5.20G 43.68dB 30.93dB 23.52dB
upconv2 (n c×4, n c, 3, 3) 36 52 64 Table 7. PSNR values obtained by three SR branches of ClassSR-
conv (n c, n c, 3, 3) 36 52 64 SRResNet on different validation sets with ×4. -O: the original
conv last (n c, 3, 3, 3) 36 52 64 networks trained with all data.
FLOPs 36: 1.66G; 52: 3.44G; 64: 5.20G
Table 3. The network architecture of branches in ClassSR- Model FLOPs Simple Medium Hard
SRResNet. n c: number of channels; (3, n c, 3, 3): Input chan- RCAN (36) 10.33G 43.78dB 31.00dB 23.42dB
nel 3, output channel n c and convolutional layer with kernel size RCAN (50) 19.90G 43.82dB 31.25dB 23.78dB
3×3. RCAN (64) 32.60G 43.60dB 31.16dB 23.86dB
RCAN-O (64) 32.60G 43.84dB 31.27dB 23.86dB

layer name kernel size nc Table 8. PSNR values obtained by three SR branches of ClassSR-
conv1 (3, n c, 3, 3) 36 50 64 RCAN on different validation sets with ×4. -O: the original net-
works trained with all data.
(n c, n c, 3, 3)
(n c, n c, 3, 3)
[RCAB]×200 36 50 64
(n c, n c/16, 3, 3) 2. Additional Experiments
(n c/16, n c, 3, 3)
upconv1 (n c×4, n c, 3, 3) 36 50 64 2.1. Effect of Class-Loss and Average-Loss ratio
upconv2 (n c×4, n c, 3, 3) 36 50 64 How to balance the effect of the Class-Loss and Average-
conv out (n c, 3, 3, 3) 36 50 64 Loss? We conduct experiments by changing the weight
FLOPs 36: 10.33G; 50: 19.90G; 64: 32.60G of the Class-Loss (w2 = 0.5, 1, 2) and fixing the other
Table 4. The network architecture of branches in ClassSR-RCAN.
weights (w1 = 2000, w3 = 6). Results are shown in
n c: number of channels; (3, n c, 3, 3): Input channel 3, output
channel n c and convolutional layer with kernel size 3×3.
Fig. 1. From Fig. 1(b) and Fig. 1(c), we can observe that
w2 = 1 achieves the best trade-off between PSNR and
FLOPs. When w2 = 2, the performance decrease signif-
icantly. It seems that w2 = 0.5 is comparable with w2 = 1.
Model FLOPs Simple Medium Hard
However, from the Fig. 1(a), we can see that w2 = 0.5 (blue
FSRCNN (16) 141M 42.71dB 29.28dB 22.42dB
line) has a much larger classification loss. Here the number
FSRCNN (36) 304M 42.50dB 29.62dB 22.65dB
FSRCNN (56) 468M 42.00dB 29.61dB 22.73dB 0.4 indicates that the maximum classification probability is
FSRCNN-O (56) 468M 42.70dB 29.69dB 22.71dB less than 80%, which is below our requirement. Therefore,
Table 5. PSNR values obtained by three SR branches of ClassSR- we set w1 = 2000, w2 = 1, w3 = 6 as our default setting.
FSRCNN on different validation sets with ×4. -O: the original
networks trained with all data. 2.3 w2=0.5
28.57 430 w2=0.5
w2=1.0 w2=1.0
1.7 w2=2.0 28.53 380 w2=2.0
Class-Loss

PSNR (dB)

FLOPs (M)

1.1 28.49 330


0.5 28.45 w2=0.5 280
Model FLOPs Simple Medium Hard 0.1 0 28.41 0
w2=1.0
w2=2.0
230 0
40 80 120 160 200 40 80 120 160 200 40 80 120 160 200
Iteration (K) Iteration (K) Iteration (K)
CARN (36) 0.38G 42.88dB 29.83dB 22.68dB
CARN (52) 0.77G 43.01dB 30.36dB 23.06dB (a) Class-Loss curves (b) PSNR curves (c) FLOPs curves
CARN (64) 1.15G 43.14dB 30.45dB 23.23dB Figure 1. Training curves comparison of Class-Module with dif-
CARN-O (64) 1.15G 43.25dB 30.33dB 23.08dB ferent weight of Class-Loss for ClassSR-FSRCNN.

Table 6. PSNR values obtained by three SR branches of ClassSR-


CARN on different validation sets with ×4. -O: the original net- Furthermore, this phenomenon is reasonable because
works trained with all data. that is related to the characteristics of the loss functions.
Specifically, a lower weight of Class-Loss makes sub-
images attempt to more selections of class and may lead
to better performance. But a too low weight of Class-Loss
RCNN (56) has the same architecture with original FSR- causes the maximum probability not to approach 1. There-
CNN. Therefore, the overall FLOPs drop from 468M to fore, this weight should not be adopted because that it con-
236M. flicts with testing.

2
Model Test2K/FLOPs Test4K/FLOPs Test8K/FLOPs
FSRCNN-O S:0% M:0% H:100%/468M(100%) S:0% M:0% H:100%/468M(100%) S:0% M:0% H:100%/468M(100%)
ClassSR-FSRCNN S:33% M:34% H:33%/308M(66%) S:43% M:29% H:28%/284M(60%) S:61% M:23% H:16%/236M(50%)
CARN-O S:0% M:0% H:100%/1.15G(100%) S:0% M:0% H:100%/1.15G(100%) S:0% M:0% H:100%/1.15G(100%)
ClassSR-CARN S:30% M:29% H:41%/814M(71%) S:40% M:27% H:33%/742M(64%) S:60% M:22% H:18%/608M(53%)
SRResNet-O S:0% M:0% H:100%/5.20G(100%) S:0% M:0% H:100%/5.20G(100%) S:0% M:0% H:100%/5.20G(100%)
ClassSR-SRResNet S:31% M:28% H:41%/3.62G(70%) S:41% M:26% H:33%/3.30G(63%) S:60% M:22% H:18%/2.70G(52%)
RCAN-O S:0% M:0% H:100%/32.60G(100%) S:0% M:0% H:100%/32.60G(100%) S:0% M:0% H:100%/32.60G(100%)
ClassSR-RCAN S:33% M:32% H:35%/21.16G(65%) S:42% M:29% H:29%/19.47G(60%) S:60% M:22% H:17%/16.19G(50%)

Table 9. Classification results on Test2K, Test4K and Test8K. -O: the original networks trained with all data. -R: randomly selecting SR
branches of ClassSR.

Model Test2K FLOPs Test4K FLOPs Test8K FLOPs


ClassSR-FSRCNN 25.61dB 308M(66%) 26.91dB 284M(61%) 32.73dB 236M(50%)
ClassSR-FSRCNN(20dB-35dB) 25.59dB 270M(58%) 26.89dB 254M(54%) 32.67dB 210M(45%)
ClassSR-FSRCNN(35dB-50dB) 25.50dB 364M(76%) 26.75dB 336M(71%) 32.62dB 310M(65%)

Table 10. PSNR values on Test2K, Test4K and Test8K. Red/Blue text: best performance/lowest FLOPs. 20dB-35dB: the model is trained
with data that its PSNR obtained by MSRResNet [6] is between 20dB and 35dB.

Model Test2K FLOPs Test4K FLOPs Test8K FLOPs


FSRCNN-O 25.61dB 468M(100%) 26.90dB 468M(100%) 32.66dB 468M(100%)
Gradient-FSRCNN 25.58dB 308M(66%) 26.87dB 284M(60%) 32.68dB 236M(50%)
ClassSR-FSRCNN 25.61dB 308M(66%) 26.91dB 284M(60%) 32.73dB 236M(50%)

Table 11. PSNR values on Test2K, Test4K and Test8K. Red/Blue text: best performance/lowest FLOPs.

Model Parameters Test8K 2K FLOPs Test8K 4K FLOPs Test8K FLOPs


FSRCNN-O 25K 28.72dB 468M(100%) 30.27dB 468M(100%) 32.66dB 468M(100%)
ClassSR-FSRCNN 113K 28.73dB 282M(60%) 30.30dB 259M(55%) 32.73dB 236M(50%)
CARN-O 295K 29.33dB 1.15G(100%) 30.88dB 1.15G(100%) 33.18dB 1.15G(100%)
ClassSR-CARN 645K 29.29dB 0.72G(63%) 30.86dB 0.67G(58%) 33.24dB 0.61G(53%)
SRResNet-O 1.5M 29.55dB 5.20G(100%) 31.13dB 5.20G(100%) 33.50dB 5.20G(100%)
ClassSR-SRResNet 3.1M 29.56dB 3.20G(62%) 31.14dB 2.95G(57%) 33.50dB 2.70G(52%)
RCAN-O 15.6M 29.83dB 32.60G(100%) 31.41dB 32.60G(100%) 33.76dB 32.60G(100%)
ClassSR-RCAN 30.1M 29.80dB 18.98G(58%) 31.40dB 17.46G(54%) 33.73dB 16.19G(50%)

Table 12. PSNR values on Test8K 2K, Test8K 4K and Test8K. -O: the original networks. Red/Blue text: best performance/lowest FLOPs.
Test8K 2K: the 2K images is downsampled from Test8K (index 1400-1500 from DIV8K dataset).

2.2. Effect of the range of training data 2.3. Comparison with gradient-based Classification

After the training of ClassSR, we find that the sub-


images that are divided into simple branch are smooth.
Some methods such as calculating gradients can also dis-
As shown in Table 10, ClassSR is trained in DIV2K tinguish whether a sub-image is smooth or not.
with a limited range of PSNR values. It can be found that
ClassSR obtains a worse PSNR result when it is trained Therefore, we calculate the average gradients of the sub-
with only simple or complex data, because the model never images of training data and divide them into three classes by
learns how to deal with such simple or complex data. Fur- the gradient values (279.52, 556.79) to train SR-Module of
thermore, there is more computational cost of ClassSR that ClassSR-FSRCNN, after that we use the thresholds of gra-
is trained in simple data, because the model “thinks” the dient value (279.52, 556.79) as the standard of classification
normal data is relatively complex. Therefore, ClassSR is for testing. As shown in Table11, the PSNR of Gradient-
supposed to be trained with a large range of data that con- FSRCNN is lower than ClassSR-FSRCNN and the original
sists of simple and complex samples. networks.

3
2.4. Effect of the contents and the resolutions residual network. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 252–268, 2018. 1
To illustrate that the reduction of FLOPs is highly corre- [3] Chao Dong, Chen Change Loy, and Xiaoou Tang. Acceler-
lated with the image resolution of test data, we also evaluate ating the super-resolution convolutional neural network. In
ClassSR in the images with the same contents and different European conference on computer vision, pages 391–407.
resolutions (see Table12). Actually, there is no difference Springer, 2016. 1
between using the same images and random selection. They [4] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero,
can reach the same conclusion that ClassSR is more signif- Andrew Cunningham, Alejandro Acosta, Andrew Aitken,
icant on high-resolution images because a larger input im- Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-
age can be decomposed into more sub-images, which have realistic single image super-resolution using a generative ad-
a higher probability to be processed by simple branches. versarial network. In Proceedings of the IEEE conference on
Besides, it can also show that the reduction of FLOPs of computer vision and pattern recognition, pages 4681–4690,
ClassSR is related to specific images (see Sec.3). 2017. 1
[5] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,
2.5. The actual running time Kaiming He, and Piotr Dollar. Designing network design
spaces. In Proceedings of the IEEE/CVF Conference on Com-
The actual running time is similar to FLOPs in this work. puter Vision and Pattern Recognition (CVPR), June 2020. 4
Some works [5] figure out that FLOPs do not have a strong [6] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu,
relationship with the time but this phenomenon is always Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: En-
caused by different structures. However, the branches of hanced super-resolution generative adversarial networks. In
ClassSR are directly derived from original networks by re- Proceedings of the European Conference on Computer Vision
ducing layers/channels. Thus the FLOPs and running time (ECCV), pages 0–0, 2018. 1, 3
in this work have the same trend. The reason why we use [7] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
FLOPs instead of time/activations is that FLOPs is device- Zhong, and Yun Fu. Image super-resolution using very deep
independent and well-known by most researchers and engi- residual channel attention networks. In Proceedings of the Eu-
ropean Conference on Computer Vision (ECCV), pages 286–
neers. Besides, the sub-images after decomposition can be
301, 2018. 1
distributed to parallel processors for further acceleration in
actual ues.

3. More Qualitative Results


In this section, we provide additional qualitative results
to clearly show the effectiveness of our ClassSR. We show
the visual classification results in Fig. 2 and employ FLOPs
and PSNR to evaluate the results. ClassSR methods always
obtain better performance with lower computation cost. Es-
pecially, ‘DIV2K-0821’ is so complex that almost all sub-
images of it are assigned to the most complex network. In
this case, ClassSR also performs well but only saves little
computation cost. However, ClassSR reduces computation
cost significantly when processing images that have more
relatively simple regions. We also compare the visual re-
sults of the proposed ClassSR with the original networks
in Fig. 3. The results illustrate that ClassSR methods can
obtain the same visual effects as the original networks, and
show that treating different regions differently will bring no
incoherence between adjacent sub-images.

References
[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on
single image super-resolution: Dataset and study. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, July 2017. 1
[2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Fast,
accurate, and lightweight super-resolution with cascading

4
ClassSR-FSRCNN: 38.86dB / 180M(38%) ClassSR-FSRCNN: 29.27dB / 467M(99%)
(FSRCNN: 38.83dB / 468M) (FSRCNN: 29.15dB / 468M)

DIV2K-0843 (2K) DIV2K-0821(2K)


ClassSR-FSRCNN: 25.80dB / 305M(65%) ClassSR-FSRCNN: 33.30dB /189M(40%)
(FSRCNN: 25.75dB / 468M) (FSRCNN: 33.27dB / 468M)

Test4K-1333 (4K) Test4K-1319(4K)


ClassSR-FSRCNN: 45.86dB / 151M(32%) ClassSR-FSRCNN: 41.77dB / 166M(36%)
(FSRCNN: 45.57dB / 468M) (FSRCNN: 41.48dB / 468M)

Test8K-1430 (8K) Test8K-1459(8K)


Figure 2. The SR result (x4) of ClassSR-FSRCNN. The Class-Module classifies the sub-images to simple (green), medium (yellow) and
hard (red) Classes.

5
RCAN-O SRResNet-O CARN-O FSRCNN-O
GT 26.58dB/32.58G 25.97dB/5.20G 25.39dB/1.15G 24.68dB/468M

ClassSR-RCAN ClassSR-SRResNet ClassSR-CARN ClassSR-FSRCNN


Bicubic 26.62dB/19.09G(59%) 26.09dB/3.04G(58%) 25.63dB/0.75G(65%) 24.70dB/282M(60%)

RCAN-O SRResNet-O CARN-O FSRCNN-O


GT 28.69dB/32.58G 28.17dB/5.20G 27.79dB/1.15G 27.17dB/468M

ClassSR-RCAN ClassSR-SRResNet ClassSR-CARN ClassSR-FSRCNN


Bicubic 28.69dB/19.20G(59%) 28.27dB/3.20G(62%) 27.94dB/0.79G(69%) 27.19dB/301M(65%)

RCAN-O SRResNet-O CARN-O FSRCNN-O


GT 29.57dB/32.58G 29.48dB/5.20G 29.38dB/1.15G 29.17dB/468M

ClassSR-RCAN ClassSR-SRResNet ClassSR-CARN ClassSR-FSRCNN


Bicubic 29.56dB/13.06G(40%) 29.48dB/2.34G(45%) 29.31dB/0.53G(46%) 29.18dB/195M(42%)

GT RCAN-O
23.17dB/32.58G
SRResNet-O
23.00dB/5.20G
CARN-O
22.78dB/1.15G
FSRCNN-O
22.52dB/468M

ClassSR-RCAN ClassSR-SRResNet ClassSR-CARN ClassSR-FSRCNN


Bicubic 23.17dB/24.11G(74%) 23.04dB/3.93G(76%) 22.87dB/0.90G(78%) 22.53dB/356M(76%)

Figure 3. Visual results of ClassSR and the original networks on ×4 super-resolution. -O: the original networks.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy