0% found this document useful (0 votes)
106 views12 pages

Swinir: Image Restoration Using Swin Transformer

SwinIR is a new image restoration model based on Swin Transformer that achieves state-of-the-art results on image super-resolution, denoising, and compression artifact reduction tasks. SwinIR consists of three parts: 1) a shallow feature extraction module using convolution to preserve low-frequency information, 2) a deep feature extraction module composed of multiple residual Swin Transformer blocks for local and cross-window attention, and 3) a high-quality image reconstruction module. Experiments show SwinIR outperforms previous methods on various restoration tasks while using fewer parameters.

Uploaded by

peppyguypg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views12 pages

Swinir: Image Restoration Using Swin Transformer

SwinIR is a new image restoration model based on Swin Transformer that achieves state-of-the-art results on image super-resolution, denoising, and compression artifact reduction tasks. SwinIR consists of three parts: 1) a shallow feature extraction module using convolution to preserve low-frequency information, 2) a deep feature extraction module composed of multiple residual Swin Transformer blocks for local and cross-window attention, and 3) a high-quality image reconstruction module. Experiments show SwinIR outperforms previous methods on various restoration tasks while using fewer parameters.

Uploaded by

peppyguypg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

SwinIR: Image Restoration Using Swin Transformer

Jingyun Liang1 Jiezhang Cao1 Guolei Sun1 Kai Zhang1, * Luc Van Gool1,2 Radu Timofte1
1 2
Computer Vision Lab, ETH Zurich, Switzerland KU Leuven, Belgium
{jinliang, jiezcao, guosun, kai.zhang, vangool, timofter}@vision.ee.ethz.ch
https://github.com/JingyunLiang/SwinIR
arXiv:2108.10257v1 [eess.IV] 23 Aug 2021

Abstract SwinIR (ours)


32.70
Image restoration is a long-standing low-level vision 32.65 HAN (ECCV2020)

PSNR (dB)
RCAN (ECCV2018)
problem that aims to restore high-quality images from low- IPT (CVPR2021)
32.60 NLSA (CVPR2021)
quality images (e.g., downscaled, noisy and compressed im- IGNN (NeurIPS2020)
ages). While state-of-the-art image restoration methods are 32.55
OISR (CVPR2019)
based on convolutional neural networks, few attempts have 32.50 RNAN (ICLR2019)
been made with Transformers which show impressive per- RDN (CVPR2018)
32.45 EDSR (CVPR2017)
formance on high-level vision tasks. In this paper, we pro- 0.2 0.4 0.6 0.8 1.0 1.2
pose a strong baseline model SwinIR for image restora- Number of Parameters 1e8

tion based on the Swin Transformer. SwinIR consists of Figure 1: PSNR results v.s the total number of parameters of dif-
three parts: shallow feature extraction, deep feature extrac- ferent methods for image SR (×4) on Set5 [3].
tion and high-quality image reconstruction. In particular,
the deep feature extraction module is composed of several methods [73, 14, 28], they generally suffer from two basic
residual Swin Transformer blocks (RSTB), each of which problems that stem from the basic convolution layer. First,
has several Swin Transformer layers together with a resid- the interactions between images and convolution kernels are
ual connection. We conduct experiments on three represen- content-independent. Using the same convolution kernel to
tative tasks: image super-resolution (including classical, restore different image regions may not be the best choice.
lightweight and real-world image super-resolution), image Second, under the principle of local processing, convolution
denoising (including grayscale and color image denoising) is not effective for long-range dependency modelling.
and JPEG compression artifact reduction. Experimental re- As an alternative to CNN, Transformer [76] designs a
sults demonstrate that SwinIR outperforms state-of-the-art self-attention mechanism to capture global interactions be-
methods on different tasks by up to 0.14∼0.45dB, while the tween contexts and has shown promising performance in
total number of parameters can be reduced by up to 67%. several vision problems [6, 74, 19, 56]. However, vision
Transformers for image restoration [9, 5] usually divide
the input image into patches with fixed size (e.g., 48×48)
1. Introduction and process each patch independently. Such a strategy in-
Image restoration, such as image super-resolution (SR), evitably gives rise to two drawbacks. First, border pixels
image denoising and JPEG compression artifact reduction, cannot utilize neighbouring pixels that are out of the patch
aims to reconstruct the high-quality clean image from its for image restoration. Second, the restored image may in-
low-quality degraded counterpart. Since several revolu- troduce border artifacts around each patch. While this prob-
tionary work [18, 40, 90, 91], convolutional neural net- lem can be alleviated by patch overlapping, it would intro-
works (CNN) have become the primary workhorse for im- duce extra computational burden.
age restoration [43, 51, 43, 81, 92, 95, 24, 93, 46, 89, 88]. Recently, Swin Transformer [56] has shown great
Most CNN-based methods focus on elaborate architec- promise as it integrates the advantages of both CNN and
ture designs such as residual learning [43, 51] and dense Transformer. On the one hand, it has the advantage of
connections [97, 81]. Although the performance is sig- CNN to process image with large size due to the local at-
nificantly improved compared with traditional model-based tention mechanism. On the other hand, it has the advantage
of Transformer to model long-range dependency with the
* Corresponding author. shifted window scheme.

1
In this paper, we propose an image restoration model, ject detection [6, 53, 74, 56], segmentation [84, 99, 56, 4]
namely SwinIR, based on Swin Transformer. More specif- and crowd counting [47, 69], it learns to attend to impor-
ically, SwinIR consists of three modules: shallow feature tant image regions by exploring the global interactions be-
extraction, deep feature extraction and high-quality image tween different regions. Due to its impressive performance,
reconstruction modules. Shallow feature extraction module Transformer has also been introduced for image restora-
uses a convolution layer to extract shallow feature, which tion [9, 5, 82]. Chen et al. [9] proposed a backbone model
is directly transmitted to the reconstruction module so as to IPT for various restoration problems based on the stan-
preserve low-frequency information. Deep feature extrac- dard Transformer. However, IPT relies on large number of
tion module is mainly composed of residual Swin Trans- parameters (over 115.5M parameters), large-scale datasets
former blocks (RSTB), each of which utilizes several Swin (over 1.1M images) and multi-task learning for good perfor-
Transformer layers for local attention and cross-window in- mance. Cao et al. [5] proposed VSR-Transformer that uses
teraction. In addition, we add a convolution layer at the the self-attention mechanism for better feature fusion in
end of the block for feature enhancement and use a resid- video SR, but image features are still extracted from CNN.
ual connection to provide a shortcut for feature aggregation. Besides, both IPT and VSR-Transformer are patch-wise at-
Finally, both shallow and deep features are fused in the re- tention, which may be improper for image restoration. In
construction module for high-quality image reconstruction. addition, a concurrent work [82] proposed a U-shaped ar-
Compared with prevalent CNN-based image restoration chitecture based on the Swin Transformer [56].
models, Transformer-based SwinIR has several benefits: (1)
content-based interactions between image content and at- 3. Method
tention weights, which can be interpreted as spatially vary-
ing convolution [13, 21, 75]. (2) long-range dependency
3.1. Network Architecture
modelling are enabled by the shifted window mechanism. As shown in Fig. 2, SwinIR consists of three modules:
(3) better performance with less parameters. For example, shallow feature extraction, deep feature extraction and high-
as shown in Fig. 1, SwinIR achieves better PSNR with less quality (HQ) image reconstruction modules. We employ the
parameters compared with existing image SR methods. same feature extraction modules for all restoration tasks, but
use different reconstruction modules for different tasks.
2. Related Work Shallow and deep feature extraction. Given a low-
2.1. Image Restoration quality (LQ) input ILQ ∈ RH×W ×Cin (H, W and Cin are
the image height, width and input channel number, respec-
Compared to traditional image restoration methods [28, tively), we use a 3 × 3 convolutional layer HSF (·) to extract
72, 73, 62, 32] which are generally model-based, learning- shallow feature F0 ∈ RH×W ×C as
based methods, especially CNN-based methods, have be-
come more popular due to their impressive performance. F0 = HSF (ILQ ), (1)
They often learn mappings between low-quality and high-
quality images from large-scale paired datasets. Since pi- where C is the feature channel number. The convolution
oneering work SRCNN [18] (for image SR), DnCNN [90] layer is good at early visual processing, leading to more
(for image denoising) and ARCNN [17] (for JPEG com- stable optimization and better results [86]. It also provides
pression artifact reduction), a flurry of CNN-based mod- a simple way to map the input image space to a higher
els have been proposed to improve model representation dimensional feature space. Then, we extract deep feature
ability by using more elaborate neural network architec- FDF ∈ RH×W ×C from F0 as
ture designs, such as residual block [40, 7, 88], dense
FDF = HDF (F0 ), (2)
block [81, 97, 98] and others [10, 42, 93, 78, 77, 79, 50, 48,
49, 92, 70, 36, 83, 30, 11, 16, 96, 64, 38, 26, 41, 25]. Some
where HDF (·) is the deep feature extraction module and it
of them have exploited the attention mechanism inside the
contains K residual Swin Transformer blocks (RSTB) and
CNN framework, such as channel attention [95, 15, 63],
a 3 × 3 convolutional layer. More specifically, intermediate
non-local attention [52, 61] and adaptive patch aggrega-
features F1 , F2 , . . . , FK and the output deep feature FDF are
tion [100].
extracted block by block as
2.2. Vision Transformer Fi = HRSTBi (Fi−1 ), i = 1, 2, . . . , K,
(3)
Recently, natural language processing model Trans- FDF = HCONV (FK ),
former [76] has gained much popularity in the computer
vision community. When used in vision problems such where HRSTBi (·) denotes the i-th RSTB and HCONV is the
as image classification [66, 19, 84, 56, 45, 55, 75], ob- last convolutional layer. Using a convolutional layer at the

2
Deep Feature Extraction

RSTB

RSTB

RSTB

RSTB
HQ Image

RSTB

RSTB
Shallow Feature

Conv
Extraction
+ Reconstruction

LayerNorm
LayerNorm
Conv

MSA

MLP
STL

STL
STL

STL

STL

STL
+ + +

(a) Residual Swin Transformer Block (RSTB) (b) Swin Transformer Layer (STL)
Figure 2: The architecture of the proposed SwinIR for image restoration.

end of feature extraction can bring the inductive bias of the classical and lightweight image SR, we only use the naive
convolution operation into the Transformer-based network, L1 pixel loss as same as previous work to show the effec-
and lay a better foundation for the later aggregation of shal- tiveness of the proposed network. For real-world image SR,
low and deep features. we use a combination of pixel loss, GAN loss and percep-
Image reconstruction. Taking image SR as an example, tual loss [81, 89, 80, 27, 39, 81] to improve visual quality.
we reconstruct the high-quality image IRHQ by aggregating For image denoising and JPEG compression artifact re-
shallow and deep features as duction, we use the Charbonnier loss [8]
q
IRHQ = HREC (F0 + FDF ), (4) L = kIRHQ − IHQ k2 + 2 , (7)

where HREC (·) is the function of the reconstruction mod-


where  is a constant that is empirically set to 10−3 .
ule. Shallow feature mainly contain low-frequencies, while
deep feature focus on recovering lost high-frequencies. 3.2. Residual Swin Transformer Block
With a long skip connection, SwinIR can transmit the low-
frequency information directly to the reconstruction mod- As shown in Fig. 2(a), the residual Swin Transformer
ule, which can help deep feature extraction module focus block (RSTB) is a residual block with Swin Transformer
on high-frequency information and stabilize training. For layers (STL) and convolutional layers. Given the input fea-
the implementation of reconstruction module, we use the ture Fi,0 of the i-th RSTB, we first extract intermediate fea-
sub-pixel convolution layer [68] to upsample the feature. tures Fi,1 , Fi,2 , . . . , Fi,L by L Swin Transformer layers as
For tasks that do not need upsampling, such as image
denoising and JPEG compression artifact reduction, a single Fi,j = HSTLi,j (Fi,j−1 ), j = 1, 2, . . . , L, (8)
convolution layer is used for reconstruction. Besides, we
use residual learning to reconstruct the residual between the where HSTLi,j (·) is the j-th Swin Transformer layer in the
LQ and the HQ image instead of the HQ image. This is i-th RSTB. Then, we add a convolutional layer before the
formulated as residual connection. The output of RSTB is formulated as

IRHQ = HSwinIR (ILQ ) + ILQ , (5) Fi,out = HCONVi (Fi,L ) + Fi,0 , (9)

where HSwinIR (·) denotes the function of SwinIR. where HCONVi (·) is the convolutional layer in the i-th
Loss function. For image SR, we optimize the parameters RSTB. This design has two benefits. First, although Trans-
of SwinIR by minimizing the L1 pixel loss former can be viewed as a specific instantiation of spatially
varying convolution [21, 75], covolutional layers with spa-
L = kIRHQ − IHQ k1 , (6) tially invariant filters can enhance the translational equivari-
ance of SwinIR. Second, the residual connection provides a
where IRHQ is obtained by taking ILQ as the input of SwinIR, identity-based connection from different blocks to the re-
and IHQ is the corresponding ground-truth HQ image. For construction module, allowing the aggregation of different
levels of features.
3
Swin Transformer layer. Swin Transformer layer partions. For lightweight image SR, we decrease RSTB
(STL) [56] is based on the standard multi-head self- number and channel number to 4 and 60, respectively. Fol-
attention of the original Transformer layer [76]. The main lowing [95, 63], when self-ensemble strategy [51] is used
differences lie in local attention and the shifted window in testing, we mark the model with a symbol “+”, e.g.,
mechanism. As shown in Fig. 2(b), given an input of size SwinIR+. Due to page limit, training and evaluation details
H × W × C, Swin Transformer first reshapes the input to are provided in the supplementary.
a HW 2
M 2 × M × C feature by partitioning the input into 4.2. Ablation Study and Discussion
non-overlapping M × M local windows, where HW M 2 is the
total number of windows. Then, it computes the standard For ablation study, we train SwinIR on DIV2K [1] for
self-attention separately for each window (i.e., local atten- classical image SR (×2) and test it on Manga109 [60].
2
tion). For a local window feature X ∈ RM ×C , the query, Impact of channel number, RSTB number and STL
key and value matrices Q, K and V are computed as number. We show the effects of channel number, RSTB
number and STL number in a RSTB on model performance
Q = XPQ , K = XPK , V = XPV , (10) in Figs. 3(a), 3(b) and 3(c), respectively. It is observed that
the PSNR is positively correlated with these three hyper-
where PQ , PK and PV are projection matrices that are parameters. For channel number, although the performance
shared across different windows. Generally, we have keeps increasing, the total number of parameters grows
2
Q, K, V ∈ RM ×d . The attention matrix is thus computed quadratically. To balance the performance and model size,
by the self-attention mechanism in a local window as we choose 180 as the channel number in rest experiments.
√ As for RSTB number and layer number, the performance
Attention(Q, K, V ) = SoftMax(QK T / d + B)V, (11) gain becomes saturated gradually. We choose 6 for both of
them to obtain a relatively small model.
where B is the learnable relative positional encoding. In
practice, following [76], we perform the attention function Impact of patch size and training image number; model
for h times in parallel and concatenate the results for multi- convergence comparison. We compare the proposed
head self-attention (MSA). SwinIR with a representative CNN-based model RCAN to
Next, a multi-layer perceptron (MLP) that has two fully- compare the difference of Transformer-based and CNN-
connected layers with GELU non-linearity between them is based models. From Fig. 3(d), one can see that SwinIR
used for further feature transformations. The LayerNorm performs better than RCAN on different patch sizes, and
(LN) layer is added before both MSA and MLP, and the the PSNR gain becomes larger when the patch size is larger.
residual connection is employed for both modules. The Fig. 3(e) shows the impact of the number of training images.
whole process is formulated as Extra images from Flickr2K are used in training when the
percentage is larger than 100% (800 images). There are
X = MSA(LN(X)) + X, two observations. First, as expected, the performance of
(12) SwinIR rises with the training image number. Second, dif-
X = MLP(LN(X)) + X.
ferent from the observation in IPT that Transformer-based
However, when the partition is fixed for different lay- models are heavily relied on large amount of training data,
ers, there is no connection across local windows. There- SwinIR achieves better results than CNN-based models us-
fore, regular and shifted window partitioning are used al- ing the same training data, even when the dataset is small
ternately to enable cross-window connections [56], where (i.e., 25%, 200 images). We also plot the PSNR during
shifted window partitioning means shifting the feature by training for both SwinIR and RCAN in Fig. 3(f). It is clear
(b M M
2 c, b 2 c) pixels before partitioning.
that SwinIR converges faster and better than RCAN, which
is contradictory to previous observations that Transformer-
4. Experiments based models often suffer from slow model convergence.
Impact of residual connection and convolution layer in
4.1. Experimental Setup
RSTB. Table 1 shows four residual connection variants
For classical image SR, real-world image SR, image in RSTB: no residual connection, using 1 × 1 convolu-
denoising and JPEG compression artifact reduction, the tion layer, using 3 × 3 convolution layer and using three
RSTB number, STL number, window size, channel num- 3 × 3 convolution layers (channel number of the interme-
ber and attention head number are generally set to 6, 6, diate layer is set to one fourth of network channel num-
8, 180 and 6, respectively. One exception is that the win- ber). From the table, we can have following observations.
dow size is set to 7 for JPEG compression artifact reduc- First, the residual connection in RSTB is important as it
tion, as we observe significant performance drop when us- improves the PSNR by 0.16dB. Second, using 1 × 1 con-
ing 8, possibly because JPEG encoding uses 8 × 8 image volution brings little improvement maybe because it cannot

4
39.70 39.7 39.70
39.56 39.5 39.54

PSNR (dB)
PSNR (dB)

PSNR (dB)
39.42 39.3 39.38
39.28 39.1 39.22
39.14 38.9 39.06
39.00 60 90 120 150 180 210 240 38.70 2 4 6 8 10 12 38.900 2 4 6 8 10 12
Channel Number RSTB Number Layer Number in a RSTB
(a) (b) (c)
39.7 39.90 39.6
39.6 39.72 39.4
39.2

PSNR (dB)

PSNR (dB)
39.54
PSNR (dB)

39.5 39.0
39.4 39.36 38.8
39.18 38.6
39.3 SwinIR SwinIR SwinIR
RCAN (CNN-based) RCAN (CNN-based) 38.4 RCAN (CNN-based)
39.00
39.2 32 38.2
40 48 56 64 72 80 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 0 1 2 3 4 51e5
Training Patch Size Percentage of Used Images Training Iterations
(d) (e) (f)
Figure 3: Ablation study on different settings of SwinIR. Results are tested on Manga109 [60] for image SR (×2).

Table 1: Ablation study on RSTB design. blurring artifacts, resulting in sharp and natural edges. In
Design No residual 1 × 1 conv 3 × 3 conv Three 3 × 3 conv contrast, most CNN-based methods produces blurry images
PSNR 39.42 39.45 39.58 39.56
or even incorrect textures. IPT generates better images com-
pared with CNN-based methods, but it suffers from image
extract local neighbouring information as 3 × 3 convolution distortions and border artifact.
does. Third, although using three 3 × 3 convolution lay- Lightweight image SR. We also provide comparison of
ers can reduce the number of parameters, the performance SwinIR (small size) with state-of-the-art lightweight im-
drops slightly. age SR methods: CARN [2], FALSR-A [12], IMDN [35],
LAPAR-A [44] and LatticeNet [57]. In addition to PSNR
4.3. Results on Image SR and SSIM, we also report the total numbers of parame-
Classical image SR. Table 2 shows the quantitative com- ters and multiply-accumulate operations (evaluated on a
parisons between SwinIR (middle size) and state-of-the-art 1280 × 720 HQ image) to compare the model size and com-
methods: DBPN [31], RCAN [95], RRDB [81], SAN [15], putational complexity of different models. As shown in Ta-
IGNN [100], HAN [63], NLSA [61] and IPT [9]. As one ble 3, SwinIR outperforms competitive methods by a PSNR
can see, when trained on DIV2K, SwinIR achieves best margin of up to 0.53dB on different benchmark datasets,
performance on almost all five benchmark datasets for all with similar total numbers of parameters and multiply-
scale factors. The maximum PSNR gain reaches 0.26dB accumulate operations. This indicates that the SwinIR ar-
on Manga109 for scale factor 4. Note that RCAN and chitecture is highly efficient for image restoration.
HAN introduce channel and spatial attention, IGNN pro-
Real-world image SR. The ultimate goal of image SR
poses adaptive patch feature aggregation, and NLSA is
is for real-world applications. Recently, Zhang et al. [89]
based on the non-local attention mechanism. However, all
proposed a practical degradation model BSRGAN for real-
these CNN-based attention mechanisms perform worse than
world image SR and achieved surprising results in real
the proposed Transformer-based SwinIR, which indicates
scenarios1 . To test the performance of SwinIR for real-
the effectiveness of the proposed model. When we train
world SR, we re-train SwinIR by using the same degra-
SwinIR on a larger dataset (DIV2K+Flickr2K), the perfor-
dation model as BSRGAN for low-quality image syn-
mance further increases by a large margin (up to 0.47dB),
thesis. Since there is no ground-truth high-quality im-
achieving better accuracy than the same Transformer-based
ages, we only provide visual comparison with representa-
model IPT, even though IPT utilizes ImageNet (more than
tive bicubic model ESRGAN [81] and state-of-the-art real-
1.3M images) in training and has huge number of param-
world image SR models RealSR [37], BSRGAN [89] and
eters (115.5M). In contrast, SwinIR has a small number
Real-ESRGAN [80]. As shown in Fig. 5, SwinIR pro-
of parameters (11.8M) even compared with state-of-the-art
duces visually pleasing images with clear and sharp edges,
CNN-based models (15.4∼44.3M). As for runtime, repre-
whereas other compared methods may suffer from unsat-
sentative CNN-based model RCAN, IPT and SwinIR take
isfactory artifacts. In addition, to exploit the full poten-
about 0.2, 4.5s and 1.1s to test on a 1, 024 × 1, 024 im-
tial of SwinIR for real applications, we further propose a
age, respectively. Visual comparisons are show in Fig. 4.
SwinIR can restore high-frequency details and alleviate the 1 https://github.com/cszn/BSRGAN

5
Table 2: Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for classical image SR on bench-
mark datasets. Best and second best performance are in red and blue colors, respectively. Results on ×8 are provided in
supplementary.
Training Set5 [3] Set14 [87] BSD100 [58] Urban100 [34] Manga109 [60]
Method Scale
Dataset PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
RCAN [95] ×2 DIV2K 38.27 0.9614 34.12 0.9216 32.41 0.9027 33.34 0.9384 39.44 0.9786
SAN [15] ×2 DIV2K 38.31 0.9620 34.07 0.9213 32.42 0.9028 33.10 0.9370 39.32 0.9792
IGNN [100] ×2 DIV2K 38.24 0.9613 34.07 0.9217 32.41 0.9025 33.23 0.9383 39.35 0.9786
HAN [63] ×2 DIV2K 38.27 0.9614 34.16 0.9217 32.41 0.9027 33.35 0.9385 39.46 0.9785
NLSA [61] ×2 DIV2K 38.34 0.9618 34.08 0.9231 32.43 0.9027 33.42 0.9394 39.59 0.9789
SwinIR (Ours) ×2 DIV2K 38.35 0.9620 34.14 0.9227 32.44 0.9030 33.40 0.9393 39.60 0.9792
SwinIR+ (Ours) ×2 DIV2K 38.38 0.9621 34.24 0.9233 32.47 0.9032 33.51 0.9401 39.70 0.9794
DBPN [31] ×2 DIV2K+Flickr2K 38.09 0.9600 33.85 0.9190 32.27 0.9000 32.55 0.9324 38.89 0.9775
IPT [9] ×2 ImageNet 38.37 - 34.43 - 32.48 - 33.76 - - -
SwinIR (Ours) ×2 DIV2K+Flickr2K 38.42 0.9623 34.46 0.9250 32.53 0.9041 33.81 0.9427 39.92 0.9797
SwinIR+ (Ours) ×2 DIV2K+Flickr2K 38.46 0.9624 34.61 0.9260 32.55 0.9043 33.95 0.9433 40.02 0.9800
RCAN [95] ×3 DIV2K 34.74 0.9299 30.65 0.8482 29.32 0.8111 29.09 0.8702 34.44 0.9499
SAN [15] ×3 DIV2K 34.75 0.9300 30.59 0.8476 29.33 0.8112 28.93 0.8671 34.30 0.9494
IGNN [100] ×3 DIV2K 34.72 0.9298 30.66 0.8484 29.31 0.8105 29.03 0.8696 34.39 0.9496
HAN [63] ×3 DIV2K 34.75 0.9299 30.67 0.8483 29.32 0.8110 29.10 0.8705 34.48 0.9500
NLSA [61] ×3 DIV2K 34.85 0.9306 30.70 0.8485 29.34 0.8117 29.25 0.8726 34.57 0.9508
SwinIR (Ours) ×3 DIV2K 34.89 0.9312 30.77 0.8503 29.37 0.8124 29.29 0.8744 34.74 0.9518
SwinIR+ (Ours) ×3 DIV2K 34.95 0.9316 30.83 0.8511 29.41 0.8130 29.42 0.8761 34.92 0.9526
IPT [9] ×3 ImageNet 34.81 - 30.85 - 29.38 - 29.49 - - -
SwinIR (Ours) ×3 DIV2K+Flickr2K 34.97 0.9318 30.93 0.8534 29.46 0.8145 29.75 0.8826 35.12 0.9537
SwinIR+ (Ours) ×3 DIV2K+Flickr2K 35.04 0.9322 31.00 0.8542 29.49 0.8150 29.90 0.8841 35.28 0.9543
RCAN [95] ×4 DIV2K 32.63 0.9002 28.87 0.7889 27.77 0.7436 26.82 0.8087 31.22 0.9173
SAN [15] ×4 DIV2K 32.64 0.9003 28.92 0.7888 27.78 0.7436 26.79 0.8068 31.18 0.9169
IGNN [100] ×4 DIV2K 32.57 0.8998 28.85 0.7891 27.77 0.7434 26.84 0.8090 31.28 0.9182
HAN [63] ×4 DIV2K 32.64 0.9002 28.90 0.7890 27.80 0.7442 26.85 0.8094 31.42 0.9177
NLSA [61] ×4 DIV2K 32.59 0.9000 28.87 0.7891 27.78 0.7444 26.96 0.8109 31.27 0.9184
SwinIR (Ours) ×4 DIV2K 32.72 0.9021 28.94 0.7914 27.83 0.7459 27.07 0.8164 31.67 0.9226
SwinIR+ (Ours) ×4 DIV2K 32.81 0.9029 29.02 0.7928 27.87 0.7466 27.21 0.8187 31.88 0.9423
DBPN [31] ×4 DIV2K+Flickr2K 32.47 0.8980 28.82 0.7860 27.72 0.7400 26.38 0.7946 30.91 0.9137
IPT [9] ×4 ImageNet 32.64 - 29.01 - 27.82 - 27.26 - - -
RRDB [81] ×4 DIV2K+Flickr2K 32.73 0.9011 28.99 0.7917 27.85 0.7455 27.03 0.8153 31.66 0.9196
SwinIR (Ours) ×4 DIV2K+Flickr2K 32.92 0.9044 29.09 0.7950 27.92 0.7489 27.45 0.8254 32.03 0.9260
SwinIR+ (Ours) ×4 DIV2K+Flickr2K 32.93 0.9043 29.15 0.7958 27.95 0.7494 27.56 0.8273 32.22 0.9273

HR VDSR [40] EDSR [51] RDN [97] OISR [33]

Urban100 (4×):img 012 SAN [15] RNAN [96] IGNN [100] IPT [9] SwinIR (ours)

Figure 4: Visual comparison of bicubic image SR (×4) methods. Compared images are derived from [9]. Best viewed by zooming.

Table 3: Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for lightweight image SR on bench-
mark datasets. Best and second best performance are in red and blue colors, respectively.
Set5 [3] Set14 [87] BSD100 [58] Urban100 [34] Manga109 [60]
Method Scale #Params #Mult-Adds
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
CARN [2] ×2 1,592K 222.8G 37.76 0.9590 33.52 0.9166 32.09 0.8978 31.92 0.9256 38.36 0.9765
FALSR-A [12] ×2 1,021K 234.7G 37.82 0.959 33.55 0.9168 32.1 0.8987 31.93 0.9256 - -
IMDN [35] ×2 694K 158.8G 38.00 0.9605 33.63 0.9177 32.19 0.8996 32.17 0.9283 38.88 0.9774
LAPAR-A [44] ×2 548K 171.0G 38.01 0.9605 33.62 0.9183 32.19 0.8999 32.10 0.9283 38.67 0.9772
LatticeNet [57] ×2 756K 169.5G 38.15 0.9610 33.78 0.9193 32.25 0.9005 32.43 0.9302 - -
SwinIR (Ours) ×2 878K 195.6G 38.14 0.9611 33.86 0.9206 32.31 0.9012 32.76 0.9340 39.12 0.9783
CARN [2] ×3 1,592K 118.8G 34.29 0.9255 30.29 0.8407 29.06 0.8034 28.06 0.8493 33.50 0.9440
IMDN [35] ×3 703K 71.5G 34.36 0.9270 30.32 0.8417 29.09 0.8046 28.17 0.8519 33.61 0.9445
LAPAR-A [44] ×3 544K 114.0G 34.36 0.9267 30.34 0.8421 29.11 0.8054 28.15 0.8523 33.51 0.9441
LatticeNet [57] ×3 765K 76.3G 34.53 0.9281 30.39 0.8424 29.15 0.8059 28.33 0.8538 - -
SwinIR (Ours) ×3 886K 87.2G 34.62 0.9289 30.54 0.8463 29.20 0.8082 28.66 0.8624 33.98 0.9478
CARN [2] ×4 1,592K 90.9G 32.13 0.8937 28.60 0.7806 27.58 0.7349 26.07 0.7837 30.47 0.9084
IMDN [35] ×4 715K 40.9G 32.21 0.8948 28.58 0.7811 27.56 0.7353 26.04 0.7838 30.45 0.9075
LAPAR-A [44] ×4 659K 94.0G 32.15 0.8944 28.61 0.7818 27.61 0.7366 26.14 0.7871 30.42 0.9074
LatticeNet [57] ×4 777K 43.6G 32.30 0.8962 28.68 0.7830 27.62 0.7367 26.25 0.7873 - -
SwinIR (Ours) ×4 897K 49.6G 32.44 0.8976 28.77 0.7858 27.69 0.7406 26.47 0.7980 30.92 0.9151

6
LR (×4) ESRGAN [81] RealSR [37] BSRGAN [89] Real-ESRGAN [80] SwinIR (ours)
Figure 5: Visual comparison of real-world image SR (×4) methods on real-world images.

Table 4: Quantitative comparison (average PSNR/SSIM/PSNR-B) with state-of-the-art methods for


JPEG compression artifact reduction on benchmark datasets. Best and second best performance are in red and
blue colors, respectively.
Dataset q ARCNN [17] DnCNN-3 [90] QGAC [20] RNAN [96] RDN [98] DRUNet [88] SwinIR (ours)
10 29.03/0.7929/28.76 29.40/0.8026/29.13 29.84/0.8370/29.43 29.96/0.8178/29.62 30.00/0.8188/- 30.16/0.8234/29.81 30.27/0.8249/29.95
Classic5 20 31.15/0.8517/30.59 31.63/0.8610/31.19 31.98/0.8850/31.37 32.11/0.8693/31.57 32.15/0.8699/- 32.39/0.8734/31.80 32.52/0.8748/31.99
[22] 30 32.51/0.8806/31.98 32.91/0.8861/32.38 33.22/0.9070/32.42 33.38/0.8924/32.68 33.43/0.8930/- 33.59/0.8949/32.82 33.73/0.8961/33.03
40 33.32/0.8953/32.79 33.77/0.9003/33.20 - 34.27/0.9061/33.4 34.27/0.9061/- 34.41/0.9075/33.51 34.52/0.9082/33.66
10 28.96/0.8076/28.77 29.19/0.8123/28.90 29.53/0.8400/29.15 29.63/0.8239/29.25 29.67/0.8247/- 29.79/0.8278/29.48 29.86/0.8287/29.50
LIVE1 20 31.29/0.8733/30.79 31.59/0.8802/31.07 31.86/0.9010/31.27 32.03/0.8877/31.44 32.07/0.8882/- 32.17/0.8899/31.69 32.25/0.8909/31.70
[67] 30 32.67/0.9043/32.22 32.98/0.9090/32.34 33.23/0.9250/32.50 33.45/0.9149/32.71 33.51/0.9153/- 33.59/0.9166/32.99 33.69/0.9174/33.01
40 33.63/0.9198/33.14 33.96/0.9247/33.28 - 34.47/0.9299/33.66 34.51/0.9302/- 34.58/0.9312/33.93 34.67/0.9317/33.88

large model and train it on much larger datasets. Exper- pared methods include traditional models BM3D [14]
iments show that it can deal with more complex corrup- and WNNM [29], CNN-based models DnCNN [90], IR-
tions and achieves even better performance on real-world CNN [91], FFDNet [92], N3Net [65], NLRN [52], FOC-
images than the current model. Due to page limit, the details Net [38], RNAN [96], MWCNN [54] and DRUNet [88].
are given in our project page https://github.com/ Following [90, 88], the compared noise levels include 15,
JingyunLiang/SwinIR. 25 and 50. As one can see, our model achieves better per-
formance than all compared methods. In particular, it sur-
4.4. Results on JPEG Compression Artifact Reduc- passes the state-of-the-art model DRUNet by up to 0.3dB
tion on the large Urban100 dataset that has 100 high-resolution
Table 4 shows the comparison of SwinIR with state- testing images. It is worth pointing out that SwinIR only
of-the-art JPEG compression artifact reduction methods: has 12.0M parameters, whereas DRUNet has 32.7M param-
ARCNN [17], DnCNN-3 [90], QGAC [20], RNAN [96], eters. This indicates that the SwinIR architecture is highly
RDN [98] and DRUNet [88]. All of compared methods efficient in learning feature representations for restoration.
are CNN-based models. Following [98, 88], we test dif- The visual comparison for grayscale and color image de-
ferent methods on two benchmark datasets (Classic5 [22] noising of different methods are shown in Figs. 6 and 7.
and LIVE1 [67]) for JPEG quality factors 10, 20, 30 and As we can see, our method can remove heavy noise cor-
40. As we can see, the proposed SwinIR has average PSNR ruption and preserve high-frequency image details, result-
gains of at least 0.11dB and 0.07dB on two testing datasets ing in sharper edges and more natural textures. By contrast,
for different quality factors. Besides, compared with the other methods suffer from either over-smoothness or over-
previous best model DRUNet, SwinIR only has 11.5M pa- sharpness, and cannot recover rich textures.
rameters, while DRUNet is a large model that has 32.7M
parameters. 5. Conclusion
4.5. Results on Image Denoising
In this paper, we propose a Swin Transformer-based im-
We show grayscale and color image denoising re- age restoration model SwinIR. The model is composed of
sults in Table 5 and Table 6, respectively. Com- three parts: shallow feature extraction, deep feature extrac-

7
Table 5: Quantitative comparison (average PSNR) with state-of-the-art methods for grayscale image denoising on bench-
mark datasets. Best and second best performance are in red and blue colors, respectively.
BM3D WNNM DnCNN IRCNN FFDNet N3Net NLRN FOCNet RNAN MWCNN DRUNet
Dataset σ SwinIR (ours)
[14] [29] [90] [91] [92] [65] [52] [38] [96] [54] [88]
15 32.37 32.70 32.86 32.76 32.75 - 33.16 33.07 - 33.15 33.25 33.36
Set12
25 29.97 30.28 30.44 30.37 30.43 30.55 30.80 30.73 - 30.79 30.94 31.01
[90]
50 26.72 27.05 27.18 27.12 27.32 27.43 27.64 27.68 27.70 27.74 27.90 27.91
15 31.08 31.37 31.73 31.63 31.63 - 31.88 31.83 - 31.86 31.91 31.97
BSD68
25 28.57 28.83 29.23 29.15 29.19 29.30 29.41 29.38 - 29.41 29.48 29.50
[59]
50 25.60 25.87 26.23 26.19 26.29 26.39 26.47 26.50 26.48 26.53 26.59 26.58
15 32.35 32.97 32.64 32.46 32.40 - 33.45 33.15 - 33.17 33.44 33.70
Urban100
25 29.70 30.39 29.95 29.80 29.90 30.19 30.94 30.64 - 30.66 31.11 31.30
[34]
50 25.95 26.83 26.26 26.22 26.50 26.82 27.49 27.40 27.65 27.42 27.96 27.98

Table 6: Quantitative comparison (average PSNR) with state-of-the-art methods for color image denoising on benchmark
datasets. Best and second best performance are in red and blue colors, respectively.
BM3D DnCNN IRCNN FFDNet DSNet RPCNN BRDNet RNAN RDN IPT DRUNet
Dataset σ SwinIR (ours)
[14] [90] [91] [92] [64] [85] [71] [96] [98] [9] [88]
15 33.52 33.90 33.86 33.87 33.91 - 34.10 - - - 34.30 34.42
CBSD68
25 30.71 31.24 31.16 31.21 31.28 31.24 31.43 - - - 31.69 31.78
[59]
50 27.38 27.95 27.86 27.96 28.05 28.06 28.16 28.27 28.31 28.39 28.51 28.56
15 34.28 34.60 34.69 34.63 34.63 - 34.88 - - - 35.31 35.34
Kodak24
25 32.15 32.14 32.18 32.13 32.16 32.34 32.41 - - - 32.89 32.89
[23]
50 28.46 28.95 28.93 28.98 29.05 29.25 29.22 29.58 29.66 29.64 29.86 29.79
15 34.06 33.45 34.58 34.66 34.67 - 35.08 - - - 35.40 35.61
McMaster
25 31.66 31.52 32.18 32.35 32.40 32.33 32.75 - - - 33.14 33.20
[94]
50 28.51 28.62 28.91 29.18 29.28 29.33 29.52 29.72 - 29.98 30.08 30.22
15 33.93 32.98 33.78 33.83 - - 34.42 - - - 34.81 35.13
Urban100
25 31.36 30.81 31.20 31.40 - 31.81 31.99 - - - 32.60 32.90
[34]
50 27.93 27.59 27.70 28.05 - 28.62 28.56 29.08 29.38 29.71 29.61 29.82

Noisy BM3D [14] DnCNN [90] FFDNet [92] DRUNet [88] SwinIR (ours)

Figure 6: Visual comparison of grayscale image denoising (noise level 50) methods on image “Monarch” from Set12 [90]. Compared
images are derived from [88].

Noisy DnCNN [90] FFDNet [92] IPT [9] DRUNet [88] SwinIR (ours)

Figure 7: Visual comparison of color image denoising (noise level 50) methods on image “163085” from CBSD68 [59]. Compared images
are derived from [88].

tion and HR reconstruction modules. In particular, we use a pression artifact reduction, which demonstrates the effec-
stack of residual Swin Transformer blocks (RSTB) for deep tiveness and generalizability of the proposed SwinIR. In the
feature extraction, and each RSTB is composed of Swin future, we will extend the model to other restoration tasks
Transformer layers, convolution layer and a residual con- such as image deblurring and deraining.
nection. Extensive experiments show that SwinIR achieves
state-of-the-art performance on three representative image
restoration tasks and six different settings: classic image Acknowledgements This paper was partially supported
SR, lightweight image SR, real-world image SR, grayscale by the ETH Zurich Fund (OK), a Huawei Technologies Oy
image denoising, color image denoising and JPEG com- (Finland) project, the China Scholarship Council and an
Amazon AWS grant. Special thanks goes to Yijue Chen.

8
References [14] Kostadin Dabov, Alessandro Foi, Vladimir Katkovnik,
and Karen Egiazarian. Image denoising by sparse 3-d
[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge transform-domain collaborative filtering. IEEE Transac-
on single image super-resolution: Dataset and study. In tions on image processing, 16(8):2080–2095, 2007. 1, 7,
IEEE Conference on Computer Vision and Pattern Recog- 8
nition Workshops, pages 126–135, 2017. 4
[15] Tao Dai, Jianrui Cai, Yongbing Zhang, Shu-Tao Xia, and
[2] Namhyuk Ahn, Byungkon Kang, and Kyung-Ah Sohn. Lei Zhang. Second-order attention network for single im-
Fast, accurate, and lightweight super-resolution with cas- age super-resolution. In IEEE Conference on Computer Vi-
cading residual network. In European Conference on Com- sion and Pattern Recognition, pages 11065–11074, 2019.
puter Vision, pages 252–268, 2018. 5, 6 2, 5, 6
[3] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and
[16] Xin Deng, Yutong Zhang, Mai Xu, Shuhang Gu, and Yiping
Marie line Alberi Morel. Low-complexity single-image
Duan. Deep coupled feedback network for joint exposure
super-resolution based on nonnegative neighbor embed-
fusion and image super-resolution. IEEE Transactions on
ding. In British Machine Vision Conference, pages 135.1–
Image Processing, 30:3098–3112, 2021. 2
135.10, 2012. 1, 6
[17] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou
[4] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xi-
Tang. Compression artifacts reduction by a deep convolu-
aopeng Zhang, Qi Tian, and Manning Wang. Swin-unet:
tional network. In IEEE International Conference on Com-
Unet-like pure transformer for medical image segmenta-
puter Vision, pages 576–584, 2015. 2, 7
tion. arXiv preprint arXiv:2105.05537, 2021. 2
[18] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou
[5] Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool.
Tang. Learning a deep convolutional network for image
Video super-resolution transformer. arXiv preprint
super-resolution. In European Conference on Computer Vi-
arXiv:2106.06847, 2021. 1, 2
sion, pages 184–199, 2014. 1, 2
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nico-
[19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
las Usunier, Alexander Kirillov, and Sergey Zagoruyko.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
End-to-end object detection with transformers. In European
Mostafa Dehghani, Matthias Minderer, Georg Heigold,
Conference on Computer Vision, pages 213–229. Springer,
Sylvain Gelly, et al. An image is worth 16x16 words:
2020. 1, 2
Transformers for image recognition at scale. arXiv preprint
[7] Lukas Cavigelli, Pascal Hager, and Luca Benini. Cas-cnn:
arXiv:2010.11929, 2020. 1, 2
A deep convolutional neural network for image compres-
sion artifact suppression. In 2017 International Joint Con- [20] Max Ehrlich, Larry Davis, Ser-Nam Lim, and Abhinav
ference on Neural Networks, pages 752–759, 2017. 2 Shrivastava. Quantization guided jpeg artifact correction.
In European Conference on Computer Vision, pages 293–
[8] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert,
309, 2020. 7
and Michel Barlaud. Two deterministic half-quadratic reg-
ularization algorithms for computed imaging. In Interna- [21] Gamaleldin Elsayed, Prajit Ramachandran, Jonathon
tional Conference on Image Processing, volume 2, pages Shlens, and Simon Kornblith. Revisiting spatial invariance
168–172. IEEE, 1994. 3 with low-rank local connectivity. In International Confer-
[9] Hanting Chen, Yunhe Wang, Tianyu Guo, Chang Xu, Yip- ence on Machine Learning, pages 2868–2879, 2020. 2, 3
ing Deng, Zhenhua Liu, Siwei Ma, Chunjing Xu, Chao Xu, [22] Alessandro Foi, Vladimir Katkovnik, and Karen Egiazar-
and Wen Gao. Pre-trained image processing transformer. In ian. Pointwise shape-adaptive dct for high-quality de-
IEEE Conference on Computer Vision and Pattern Recog- noising and deblocking of grayscale and color images.
nition, pages 12299–12310, 2021. 1, 2, 5, 6, 8 IEEE Transactions on Image Processing, 16(5):1395–1411,
[10] Yunjin Chen and Thomas Pock. Trainable nonlinear reac- 2007. 7
tion diffusion: A flexible framework for fast and effective [23] Rich Franzen. Kodak lossless true color image suite.
image restoration. IEEE transactions on pattern analysis source: http://r0k. us/graphics/kodak, 4(2), 1999. 8
and machine intelligence, 39(6):1256–1272, 2016. 2 [24] Manuel Fritsche, Shuhang Gu, and Radu Timofte. Fre-
[11] Wenlong Cheng, Mingbo Zhao, Zhiling Ye, and Shuhang quency separation for real-world super-resolution. In IEEE
Gu. Mfagan: A compression framework for memory- Conference on International Conference on Computer Vi-
efficient on-device super-resolution gan. arXiv preprint sion Workshops, pages 3599–3608, 2019. 1
arXiv:2107.12679, 2021. 2 [25] Xueyang Fu, Menglu Wang, Xiangyong Cao, Xinghao
[12] Xiangxiang Chu, Bo Zhang, Hailong Ma, Ruijun Xu, Ding, and Zheng-Jun Zha. A model-driven deep unfold-
and Qingyuan Li. Fast, accurate and lightweight super- ing method for jpeg artifacts removal. IEEE Transactions
resolution with neural architecture search. In International on Neural Networks and Learning Systems, 2021. 2
Conference on Pattern Recognition, pages 59–64. IEEE, [26] Xueyang Fu, Zheng-Jun Zha, Feng Wu, Xinghao Ding, and
2020. 5, 6 John Paisley. Jpeg artifacts reduction via deep convolu-
[13] Jean-Baptiste Cordonnier, Andreas Loukas, and Martin tional sparse coding. In IEEE International Conference on
Jaggi. On the relationship between self-attention and con- Computer Vision, pages 2501–2510, 2019. 2
volutional layers. arXiv preprint arXiv:1911.03584, 2019. [27] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
2 Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,

9
and Yoshua Bengio. Generative adversarial nets. In Ad- networks. In IEEE Conference on Computer Vision and
vances in Neural Information Processing Systems, pages Pattern Recognition, pages 1646–1654, 2016. 1, 2, 6
2672–2680, 2014. 3 [41] Yoonsik Kim, Jae Woong Soh, Jaewoo Park, Byeongyong
[28] Shuhang Gu, Nong Sang, and Fan Ma. Fast image su- Ahn, Hyun-Seung Lee, Young-Su Moon, and Nam Ik Cho.
per resolution via local regression. In IEEE Conference A pseudo-blind convolutional neural network for the reduc-
on International Conference on Pattern Recognition, pages tion of compression artifacts. IEEE Transactions on Cir-
3128–3131, 2012. 1, 2 cuits and Systems for Video Technology, 30(4):1121–1135,
[29] Shuhang Gu, Lei Zhang, Wangmeng Zuo, and Xiangchu 2019. 2
Feng. Weighted nuclear norm minimization with applica- [42] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-
tion to image denoising. In IEEE conference on computer Hsuan Yang. Deep laplacian pyramid networks for fast
vision and pattern recognition, pages 2862–2869, 2014. 7, and accurate super-resolution. In IEEE Conference on
8 Computer Vision and Pattern Recognition, pages 624–632,
[30] Yong Guo, Jian Chen, Jingdong Wang, Qi Chen, Jiezhang 2017. 2
Cao, Zeshuai Deng, Yanwu Xu, and Mingkui Tan. Closed- [43] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Ca-
loop matters: Dual regression networks for single image ballero, Andrew Cunningham, Alejandro Acosta, Andrew
super-resolution. In IEEE Conference on Computer Vision Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al.
and Pattern Recognition, pages 5407–5416, 2020. 2 Photo-realistic single image super-resolution using a gener-
[31] Muhammad Haris, Gregory Shakhnarovich, and Norim- ative adversarial network. In IEEE Conference on Com-
ichi Ukita. Deep back-projection networks for super- puter Vision and Pattern Recognition, pages 4681–4690,
resolution. In IEEE Conference on Computer Vision and 2017. 1
Pattern Recognition, pages 1664–1673, 2018. 5, 6 [44] Wenbo Li, Kun Zhou, Lu Qi, Nianjuan Jiang, Jiangbo Lu,
[32] Kaiming He, Jian Sun, and Xiaoou Tang. Single image haze and Jiaya Jia. Lapar: Linearly-assembled pixel-adaptive
removal using dark channel prior. IEEE transactions on regression network for single image super-resolution and
Pattern Analysis and Machine Intelligence, 33(12):2341– beyond. arXiv preprint arXiv:2105.10422, 2021. 5, 6
2353, 2010. 2 [45] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc
[33] Xiangyu He, Zitao Mo, Peisong Wang, Yang Liu, Van Gool. Localvit: Bringing locality to vision transform-
Mingyuan Yang, and Jian Cheng. Ode-inspired network ers. arXiv preprint arXiv:2104.05707, 2021. 2
design for single image super-resolution. In IEEE Confer- [46] Zhen Li, Jinglei Yang, Zheng Liu, Xiaomin Yang, Gwang-
ence on Computer Vision and Pattern Recognition, pages gil Jeon, and Wei Wu. Feedback network for image super-
1732–1741, 2019. 6 resolution. In IEEE Conference on Computer Vision and
[34] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Pattern Recognition, pages 3867–3876, 2019. 1
Single image super-resolution from transformed self- [47] Dingkang Liang, Xiwu Chen, Wei Xu, Yu Zhou, and Xiang
exemplars. In IEEE Conference on Computer Vision and Bai. Transcrowd: Weakly-supervised crowd counting with
Pattern Recognition, pages 5197–5206, 2015. 6, 8 transformer. arXiv preprint arXiv:2104.09116, 2021. 2
[35] Zheng Hui, Xinbo Gao, Yunchu Yang, and Xiumei [48] Jingyun Liang, Andreas Lugmayr, Kai Zhang, Martin
Wang. Lightweight image super-resolution with informa- Danelljan, Luc Van Gool, and Radu Timofte. Hierarchi-
tion multi-distillation network. In ACM International Con- cal conditional flow: A unified framework for image super-
ference on Multimedia, pages 2024–2032, 2019. 5, 6 resolution and image rescaling. In IEEE Conference on In-
[36] Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin ternational Conference on Computer Vision, 2021. 2
Wang, and Qi Tian. Video super-resolution with recurrent [49] Jingyun Liang, Guolei Sun, Kai Zhang, Luc Van Gool, and
structure-detail network. In European Conference on Com- Radu Timofte. Mutual affine network for spatially variant
puter Vision, pages 645–660. Springer, 2020. 2 kernel estimation in blind image super-resolution. In IEEE
[37] Xiaozhong Ji, Yun Cao, Ying Tai, Chengjie Wang, Jilin Li, Conference on International Conference on Computer Vi-
and Feiyue Huang. Real-world super-resolution via ker- sion, 2021. 2
nel estimation and noise injection. In IEEE Conference [50] Jingyun Liang, Kai Zhang, Shuhang Gu, Luc Van Gool, and
on Computer Vision and Pattern Recognition Workshops, Radu Timofte. Flow-based kernel prior with application to
pages 466–467, 2020. 5, 7 blind super-resolution. In IEEE Conference on Computer
[38] Xixi Jia, Sanyang Liu, Xiangchu Feng, and Lei Zhang. Foc- Vision and Pattern Recognition, pages 10601–10610, 2021.
net: A fractional optimal control network for image denois- 2
ing. In IEEE Conference on Computer Vision and Pattern [51] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and
Recognition, pages 6054–6063, 2019. 2, 7, 8 Kyoung Mu Lee. Enhanced deep residual networks for sin-
[39] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual gle image super-resolution. In IEEE Conference on Com-
losses for real-time style transfer and super-resolution. In puter Vision and Pattern Recognition Workshops, pages
European Conference on Computer Vision, pages 694–711. 136–144, 2017. 1, 4, 6
Springer, 2016. 3 [52] Ding Liu, Bihan Wen, Yuchen Fan, Chen Change Loy, and
[40] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accu- Thomas S Huang. Non-local recurrent network for image
rate image super-resolution using very deep convolutional restoration. arXiv preprint arXiv:1806.02919, 2018. 2, 7, 8

10
[53] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul Fieguth, Jie [67] HR Sheikh. Live image quality assessment database release
Chen, Xinwang Liu, and Matti Pietikäinen. Deep learning 2. http://live. ece. utexas. edu/research/quality, 2005. 7
for generic object detection: A survey. International Jour- [68] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes
nal of Computer Vision, 128(2):261–318, 2020. 2 Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and
[54] Pengju Liu, Hongzhi Zhang, Kai Zhang, Liang Lin, and Zehan Wang. Real-time single image and video super-
Wangmeng Zuo. Multi-level wavelet-cnn for image restora- resolution using an efficient sub-pixel convolutional neural
tion. In IEEE conference on computer vision and pattern network. In IEEE Conference on Computer Vision and Pat-
recognition workshops, pages 773–782, 2018. 7, 8 tern Recognition, pages 1874–1883, 2016. 3
[55] Yun Liu, Guolei Sun, Yu Qiu, Le Zhang, Ajad Chhatkuli, [69] Guolei Sun, Yun Liu, Thomas Probst, Danda Pani Paudel,
and Luc Van Gool. Transformer in convolutional neural Nikola Popovic, and Luc Van Gool. Boosting crowd count-
networks. arXiv preprint arXiv:2106.03180, 2021. 2 ing with transformers. arXiv preprint arXiv:2105.10926,
[56] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, 2021. 2
Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- [70] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu.
former: Hierarchical vision transformer using shifted win- Memnet: A persistent memory network for image restora-
dows. arXiv preprint arXiv:2103.14030, 2021. 1, 2, 4 tion. In IEEE International Conference on Computer Vi-
[57] Xiaotong Luo, Yuan Xie, Yulun Zhang, Yanyun Qu, Cui- sion, pages 4539–4547, 2017. 2
hua Li, and Yun Fu. Latticenet: Towards lightweight image [71] Chunwei Tian, Yong Xu, and Wangmeng Zuo. Image de-
super-resolution with lattice block. In European Confer- noising using deep cnn with batch renormalization. Neural
ence on Computer Vision, pages 272–289, 2020. 5, 6 Networks, 121:461–473, 2020. 8
[58] David Martin, Charless Fowlkes, Doron Tal, and Jitendra [72] Radu Timofte, Vincent De Smet, and Luc Van Gool. An-
Malik. A database of human segmented natural images chored neighborhood regression for fast example-based
and its application to evaluating segmentation algorithms super-resolution. In IEEE Conference on International
and measuring ecological statistics. In IEEE Conference on Conference on Computer Vision, pages 1920–1927, 2013.
International Conference on Computer Vision, pages 416– 2
423, 2001. 6 [73] Radu Timofte, Vincent De Smet, and Luc Van Gool. A+:
[59] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Adjusted anchored neighborhood regression for fast super-
Malik. A database of human segmented natural images and resolution. In Asian Conference on Computer Vision, pages
its application to evaluating segmentation algorithms and 111–126, 2014. 1, 2
measuring ecological statistics. In IEEE International Con- [74] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
ference on Computer Vision, pages 416–423, 2001. 8 Massa, Alexandre Sablayrolles, and Hervé Jégou. Train-
[60] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fuji- ing data-efficient image transformers & distillation through
moto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu attention. arXiv preprint arXiv:2012.12877, 2020. 1, 2
Aizawa. Sketch-based manga retrieval using manga109 [75] Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas,
dataset. Multimedia Tools and Applications, 76(20):21811– Niki Parmar, Blake Hechtman, and Jonathon Shlens. Scal-
21838, 2017. 4, 5, 6 ing local self-attention for parameter efficient visual back-
[61] Yiqun Mei, Yuchen Fan, and Yuqian Zhou. Image super- bones. arXiv preprint arXiv:2103.12731, 2021. 2, 3
resolution with non-local sparse attention. In IEEE Confer- [76] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ence on Computer Vision and Pattern Recognition, pages Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser,
3517–3526, 2021. 2, 5, 6 and Illia Polosukhin. Attention is all you need. arXiv
[62] Tomer Michaeli and Michal Irani. Nonparametric blind preprint arXiv:1706.03762, 2017. 1, 2, 4
super-resolution. In IEEE Conference on International [77] Longguang Wang, Yingqian Wang, Xiaoyu Dong, Qingyu
Conference on Computer Vision, pages 945–952, 2013. 2 Xu, Jungang Yang, Wei An, and Yulan Guo. Unsuper-
[63] Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lian- vised degradation representation learning for blind super-
ping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, resolution. In IEEE Conference on Computer Vision and
and Haifeng Shen. Single image super-resolution via a Pattern Recognition, pages 10581–10590, 2021. 2
holistic attention network. In European Conference on [78] Longguang Wang, Yingqian Wang, Zhengfa Liang, Zaiping
Computer Vision, pages 191–207, 2020. 2, 4, 5, 6 Lin, Jungang Yang, Wei An, and Yulan Guo. Learning par-
[64] Yali Peng, Lu Zhang, Shigang Liu, Xiaojun Wu, Yu Zhang, allax attention for stereo image super-resolution. In IEEE
and Xili Wang. Dilated residual networks with symmet- Conference on Computer Vision and Pattern Recognition,
ric skip connection for image denoising. Neurocomputing, pages 12250–12259, 2019. 2
345:67–76, 2019. 2, 8 [79] Longguang Wang, Yingqian Wang, Zaiping Lin, Jungang
[65] Tobias Plötz and Stefan Roth. Neural nearest neighbors net- Yang, Wei An, and Yulan Guo. Learning a single net-
works. arXiv preprint arXiv:1810.12575, 2018. 7, 8 work for scale-arbitrary super-resolution. In IEEE Con-
[66] Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan ference on International Conference on Computer Vision,
Bello, Anselm Levskaya, and Jonathon Shlens. Stand- pages 10581–10590, 2021. 2
alone self-attention in vision models. arXiv preprint [80] Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan.
arXiv:1906.05909, 2019. 2 Real-esrgan: Training real-world blind super-resolution

11
with pure synthetic data. arXiv preprint arXiv:2107.10833, [94] Lei Zhang, Xiaolin Wu, Antoni Buades, and Xin Li. Color
2021. 3, 5, 7 demosaicking by local directional interpolation and nonlo-
[81] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, cal adaptive thresholding. Journal of Electronic imaging,
Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: 20(2):023016, 2011. 8
Enhanced super-resolution generative adversarial networks. [95] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng
In European Conference on Computer Vision Workshops, Zhong, and Yun Fu. Image super-resolution using very deep
pages 701–710, 2018. 1, 2, 3, 5, 6, 7 residual channel attention networks. In European Confer-
[82] Zhendong Wang, Xiaodong Cun, Jianmin Bao, and ence on Computer Vision, pages 286–301, 2018. 1, 2, 4, 5,
Jianzhuang Liu. Uformer: A general u-shaped transformer 6
for image restoration. arXiv preprint arXiv:2106.03106, [96] Yulun Zhang, Kunpeng Li, Kai Li, Bineng Zhong, and
2021. 2 Yun Fu. Residual non-local attention networks for image
[83] Yunxuan Wei, Shuhang Gu, Yawei Li, Radu Timofte, Long- restoration. arXiv preprint arXiv:1903.10082, 2019. 2, 6,
cun Jin, and Hengjie Song. Unsupervised real-world im- 7, 8
age super resolution via domain-distance aware training. In [97] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong,
IEEE Conference on Computer Vision and Pattern Recog- and Yun Fu. Residual dense network for image super-
nition, pages 13385–13394, 2021. 2 resolution. In IEEE Conference on Computer Vision and
[84] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Pattern Recognition, pages 2472–2481, 2018. 1, 2, 6
Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, [98] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and
Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. Vi- Yun Fu. Residual dense network for image restoration.
sual transformers: Token-based image representation IEEE Transactions on Pattern Analysis and Machine Intel-
and processing for computer vision. arXiv preprint ligence, 43(7):2480–2495, 2020. 2, 7, 8
arXiv:2006.03677, 2020. 2 [99] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu,
[85] Zhihao Xia and Ayan Chakrabarti. Identifying recurring Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao
patterns with deep neural networks for natural image de- Xiang, Philip HS Torr, et al. Rethinking semantic segmen-
noising. In IEEE Winter Conference on Applications of tation from a sequence-to-sequence perspective with trans-
Computer Vision, pages 2426–2434, 2020. 8 formers. In IEEE Conference on Computer Vision and Pat-
[86] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Pi- tern Recognition, pages 6881–6890, 2021. 2
otr Dollár, and Ross Girshick. Early convolutions help [100] Shangchen Zhou, Jiawei Zhang, Wangmeng Zuo, and
transformers see better. arXiv preprint arXiv:2106.14881, Chen Change Loy. Cross-scale internal graph neu-
2021. 2 ral network for image super-resolution. arXiv preprint
[87] Roman Zeyde, Michael Elad, and Matan Protter. On single arXiv:2006.16673, 2020. 2, 5, 6
image scale-up using sparse-representations. In Interna-
tional Conference on Curves and Surfaces, pages 711–730,
2010. 6
[88] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc
Van Gool, and Radu Timofte. Plug-and-play image restora-
tion with deep denoiser prior. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 2021. 1, 2, 7, 8
[89] Kai Zhang, Jingyun Liang, Luc Van Gool, and Radu Timo-
fte. Designing a practical degradation model for deep blind
image super-resolution. In IEEE Conference on Interna-
tional Conference on Computer Vision, 2021. 1, 3, 5, 7
[90] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and
Lei Zhang. Beyond a gaussian denoiser: Residual learning
of deep cnn for image denoising. IEEE transactions on
image processing, 26(7):3142–3155, 2017. 1, 2, 7, 8
[91] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang.
Learning deep cnn denoiser prior for image restoration. In
IEEE Conference on Computer Vision and Pattern Recog-
nition, pages 3929–3938, 2017. 1, 7, 8
[92] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Ffdnet:
Toward a fast and flexible solution for cnn-based im-
age denoising. IEEE Transactions on Image Processing,
27(9):4608–4622, 2018. 1, 2, 7, 8
[93] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Learning a
single convolutional super-resolution network for multiple
degradations. In IEEE Conference on Computer Vision and
Pattern Recognition, pages 3262–3271, 2018. 1, 2

12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy