Fast and Accurate Image Super Resolution by Deep CNN With Skip Connection and Network in Network
Fast and Accurate Image Super Resolution by Deep CNN With Skip Connection and Network in Network
Fast and Accurate Image Super Resolution by Deep CNN With Skip Connection and Network in Network
Keywords: Deep Learning, Image Super Resolution, Deep CNN, Residual Net,
Skip Connection, Network in Network
1 Introduction
Single Image Super-Resolution (SISR) was mainly used for specific fields like security
video surveillance and medical imaging. But now SISR is widely needed in TV, video
playing, and websites as display resolutions are getting higher and higher while source
contents remain between twice and eight times lower resolution when compared to re-
cent displays. In other cases, network bandwidth is generally limited while the display’s
resolution is rather high. Recent Deep-Learning based methods (especially with deeply
and fully convolutional networks) have achieved high performance in the problem of
2
SISR from low resolution (LR) images to high resolution (HR) images. We believe this
is because deep learning can progressively grasp both local and global structures on the
image at same time by cascading CNNs and nonlinear layers. However, with regards
to power consumption and real-time processing, deeply and fully convolutional net-
works require large computation and a lengthy processing time. In this paper, we pro-
pose a lighter network by optimizing the network structure with recent deep-learning
techniques, as shown in Figure 1. For example, recent state-of-the-art deep-learning
based SISR models which we will introduce at section 2 have 20 to 30 CNN layers,
while our proposed model (DCSCN) needs only 11 layers and the total computations
of CNN filters are 10 to 100 times smaller than the others.
Fig. 1. Our model (DCSCN) structure. The last CNN (dark blue) outputs the channels of the
square of scale factor. Then it will be reshaped to a HR image.
Image Detail Reconstruction In the case of data up-sampling, the transposed convo-
lutional layer (also known as a deconvolution layer) proposed by Matthew D. Zeiler [1]
is typically used. The transposed convolutional layer can learn up-sampling kernels,
however, the process is similar to the usual convolutional layer and the reconstruction
ability is limited. To obtain a better reconstruction performance, the transposed convo-
lutional layers need to be stacked deeply, which means the process needs heavy com-
putation. So we propose a parallelized CNN structure like the Network in Network [2],
which usually consists of one (or more) 1x1 CNN(s). Remarkably, the 1x1 CNN layer
3
not only reduces the dimensions of the previous layer for faster computation with less
information loss, but also adds more nonlinearity to enhance the potential representa-
tion of the network. With this structure, we can significantly reduce the number of CNN
or transposed CNN filters. 1x1 CNN has 9 times less computation than 3x3 CNN, so
our reconstruction network is much lighter than other deep-learning based methods.
2 Related Work
Deep Learning-based methods are currently active and showing significant perfor-
mances on SISR tasks. Super-Resolution Convolutional Neural Network (SRCNN) [3]
is the method proposed at this very early stage. C. Dong et al. use 2 to 4 CNN layers to
prove that the learned CNN layers model performs well on SISR tasks. The authors
concluded that using a larger CNN filter size is better than using deeper CNN layers.
SRCNN is followed by Deeply-Recursive Convolutional Network for Image Super-
Resolution (DRCN) [4]. DRCN uses deep (a total of 20) CNN layers, which means the
model has huge parameters. However, they share each CNN’s weight to reduce the
number of parameters to train, meaning they succeed in training the deep CNN network
and achieving significant performances.
The other Deep Learning-based method, VDSR [5], is proposed by the same authors
of DRCN. VDSR uses Deep Residual Learning [6], which was developed by re-
searchers from Microsoft Research and is famous for receiving first place in ILSVRC
2015 (a large image classification competition). By using residual-learning and gradi-
ent clipping, VDSR proposed a way of significantly speeding up the training step.
Very deep Residual Encoder-Decoder Networks (RED) [7] are also based on residual-
learning. RED contains symmetric convolutional (encoder) and deconvolutional (de-
coder) layers. It also has skip connections and connects instead to every two or three
layers. Using this symmetric structure, they can train very deep (30 of) layers and
achieve state-of-the-art performance. These studies therefore reflect the trend of “the
Deeper the Better”.
On the other hand, Yaniv Romano et al. proposed Rapid and Accurate Image Super
Resolution (RAISR) [8], which is a shallow and faster learning-based method. It clas-
sifies input image patches according to the patch’s angle, strength and coherence and
then learn maps from LR image to HR image among the clustered patches. C. Dong et
al. also proposed FSRCNN [9] as a faster version of their SRCNN [3]. FSRCNN uses
transposed CNN to process the input image directly. RAISR and FRSCNN’s pro-
cessing speeds are 10 to 100 times faster than other state-of-the-art Deep Learning-
based methods. However, their performance is not as high as other deeply convolu-
tional methods, like DRCN, VDSR or RED.
4
3 Proposed Method
We started building our model from scratch. Started from only 1 CNN layer with small
dataset and then grow the number of layers, filters and the data. When it stopped im-
proving performance, we tried to change the model structure and tried lots of deep
learning technics like mini-batch, dropout, batch normalization, regularizations, initial-
izations, optimizers and activators to learn the meanings of using each structures and
technics. Finally, we carefully chose structures and hyper parameters which will suit
for SISR task and build our final model.
In the previous studies, an up-sampled image was often used as their input for the
Deep Learning-based architecture. In these models, the SISR networks will be pixel-
wise. However, 20-30 CNN layers are necessary for each up-sampled pixel and heavy
computation (up to 4x, 9x and 16x) is required, as shown in Figure 2. It also seems in-
efficient to extract a feature from an up-sampled image rather than from the original
image, even from the perspective of the reconstruction process.
Fig. 2. Simplified process structures of (a) other models and (b) our model (DCSCN).
5
The last CNN, represented by the dark blue color in Figure 1, outputs 4 channels (when
the scale factor s = 2) and each channel represents each corner-pixel of the up-sampled
pixel. DCSCN reshapes the 4ch LR image to an HR(4x) image and then finally it is
added to the bi-cubic up-sampled original input image. As with typical Residual learn-
ing networks, the model is made to focus on learning residual output and this greatly
helps learning performances, even in cases of shallow (less than 7 layers) models.
Table 1. The numbers of filters of each CNN layer of our proposed model
Feature extraction network Reconstruction network
1 2 3 4 5 6 7 A1 B1 B2 L
DCSCN 96 76 65 55 47 39 32 64 32 32 4
c-DCSCN 32 26 22 18 14 11 8 24 8 8 4
4 Experiments
or not. The total number of training images is 1,164 and the total size is 435MB.
Color(RGB) images are converted to YCbCr image and only Y-channel is processed.
Each training image is split into 32 by 32 patches with stride 16 and 64 patches are used
as a mini-batch. For testing, we use SET 5 [14], SET 14 [15], and BSDS100 [13] da-
tasets.
DCSCN c-DCSCN
Dataset SRCNN DRCN VDSR RED30
(ours) (ours)
Fig. 4. Comparison between reconstruction performance for set14 vs. computation com-
plexity. DCSCN’s complexity is taken as 1.00.
8
This paper proposed a fast and accurate Image Super Resolution method based on CNN
with skip connection and network in network. In the feature extraction network of our
method, the structure is optimized and both local and global features are sent to the
reconstruction network by skip connection. In the reconstruction network, network in
network architecture is used to obtain a better reconstruction performance with less
computation. In addition, the model is designed to be capable of processing original
size images. Using these devices, our model can achieve state-of-the-art performance
with less computation resources.
Since SISR tasks are now beginning to be used on the network edge (the entry point
devices of services like mobile, tablet and IoT devices), building a small but still effec-
tive model is rather important. While this model has been proposed through numerous
trial and error processes, there should be a better way of tuning the model structure and
hyper parameters. Establishment of a method to design suitable model complexity for
each problem is needed.
Another noteworthy aspect of this study is the use of the ensemble learning model.
Deep Learning itself has a good capacity for complex problems, however, classic en-
semble learning tends to lead to good results with less computation, even when there is
great diversity within the problem. Also, the ensemble model makes it easier to paral-
lelize for faster computation. Therefore, small sets of Deep-Learning models could be
made and combined to work as an ensemble model to fix real and complex problems.
References
1. Zeiler, M.D., Krishnan, D., Taylor, G.W., Fergus, R.: Deconvolutional Networks. In: Com-
puter Vision and Pattern Recognition, pp. 2528–2535 (2010)
2. Lin, M., Chen, Q., Yan, S.: Network in Network. In: International Conference on Learning
Representations (2014)
3. Dong, C., Loy, C.C., He, K., Tang, X.: Learning a Deep Convolutional Network for Image
Super-Resolution. In: European Conference on Computer Vision, pp. 184–199 (2014)
4. Kim, J., Lee, J.K., Lee, K.M.: Deeply-Recursive Convolutional Network for Image Super-
Resolution. In: Computer Vision and Pattern Recognition, pp. 1637-1645 (2016)
5. Kim, J., Lee, J.K., Lee, K.M.: Accurate Image Super-Resolution Using Very Deep Convo-
lutional Networks. In: Computer Vision and Pattern Recognition, pp. 1646-1654 (2016)
6. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In:
Computer Vision and Pattern Recognition, pp. 770-778 (2016)
7. Mao, X.J., Shen, C., Yang, Y.B.: Image Restoration Using Very Deep Convolutional En-
coder-Decoder Networks with Symmetric Skip Connections. In: Neural Information Pro-
cessing Systems (2016)
8. Romano, Y., Isidoro, J., Milanfar, P.: RAISR: Rapid and Accurate Image Super Resolution.
In: IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 110-125 (2017)
9. Dong, C., Loy, C.C., Tang, X.: Accelerating the Super-Resolution Convolutional Neural
Network. In: European Conference on Computer Vision (2016)
9
10. Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing Deep Neural Networks with
Pruning, Trained Quantization and Huffman Coding. In: International Conference on Learn-
ing Representations (2016)
11. He, K., Zhang, X., Ren, S., Sun, J.: Delving Deep into Rectifiers: Surpassing Human-Level
Performance on ImageNet Classification. In: IEEE International Conference on Computer
Vision, pp. 1026-1034 (2015)
12. Yang, J., Wright, J., Huang, T.S., Ma, Y.: Image Super Resolution via Sparse Representa-
tion. In: IEEE Transactions on Image Processing, vol. 19, no. 11, pp. 2861-2873 (2010)
13. Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour Detection and Hierarchical Image
Segmentation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33,
no. 5, pp. 898–916 (2011)
14. Bevilacqua, M., Roumy, A., Guillemot, C., Alberi-Morel, M.L.: Low-Complexity Single-
Image Super-Resolution based on Nonnegative Neighbor Embedding. In: British Machine
Vision Conference (2012)
15. Zeyde, R., Elad, M., Protter, M.: On Single Image Scale-Up using Sparse-Representations.
In: International Conference on Curves and Surfaces, pp. 711-730 (2012)
16. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving
Neural Networks by Preventing Coadaptation of Feature Detectors. arXiv preprint
arXiv:1207.0580 (2012)
17. Kingma, D.P., Ba, J.L.: Adam: A Method for Stochastic Optimization. In: International Con-
ference on Learning Representations (2015)