Electronics 08 00292 v3 PDF
Electronics 08 00292 v3 PDF
Electronics 08 00292 v3 PDF
Review
A State-of-the-Art Survey on Deep Learning Theory
and Architectures
Md Zahangir Alom 1, *, Tarek M. Taha 1 , Chris Yakopcic 1 , Stefan Westberg 1 , Paheding Sidike 2 ,
Mst Shamima Nasrin 1 , Mahmudul Hasan 3 , Brian C. Van Essen 4 , Abdul A. S. Awwal 4 and
Vijayan K. Asari 1
1 Department of Electrical and Computer Engineering, University of Dayton, Dayton, OH 45469, USA;
ttaha1@udayton.edu (T.M.T.); cyakopcic1@udayton.edu (C.Y.); westbergs1@udayton.edu (S.W.);
nasrinm1@udayton.edu (M.S.N.); vasari1@udayton.edu (V.K.A.)
2 Department of Earth and Atmospheric Sciences, Saint Louis University, Saint Louis, MO 63108, USA;
sidike.paheding@slu.edu
3 Comcast Labs, Washington, DC 20005, USA; mahmud.ucr@gmail.com
4 Lawrence Livermore National Laboratory (LLNL), Livermore, CA 94550, USA;
vanessen1@llnl.gov (B.C.V.E.); awwal1@llnl.gov (A.A.S.A.)
* Correspondence: alomm1@udayton.edu
Received: 17 January 2019; Accepted: 31 January 2019; Published: 5 March 2019
Abstract: In recent years, deep learning has garnered tremendous success in a variety of application
domains. This new field of machine learning has been growing rapidly and has been applied to
most traditional application domains, as well as some new areas that present more opportunities.
Different methods have been proposed based on different categories of learning, including supervised,
semi-supervised, and un-supervised learning. Experimental results show state-of-the-art performance
using deep learning when compared to traditional machine learning approaches in the fields of
image processing, computer vision, speech recognition, machine translation, art, medical imaging,
medical information processing, robotics and control, bioinformatics, natural language processing,
cybersecurity, and many others. This survey presents a brief survey on the advances that have
occurred in the area of Deep Learning (DL), starting with the Deep Neural Network (DNN). The
survey goes on to cover Convolutional Neural Network (CNN), Recurrent Neural Network (RNN),
including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), Auto-Encoder (AE),
Deep Belief Network (DBN), Generative Adversarial Network (GAN), and Deep Reinforcement
Learning (DRL). Additionally, we have discussed recent developments, such as advanced variant
DL techniques based on these DL approaches. This work considers most of the papers published
after 2012 from when the history of deep learning began. Furthermore, DL approaches that have
been explored and evaluated in different application domains are also included in this survey. We
also included recently developed frameworks, SDKs, and benchmark datasets that are used for
implementing and evaluating deep learning approaches. There are some surveys that have been
published on DL using neural networks and a survey on Reinforcement Learning (RL). However,
those papers have not discussed individual advanced techniques for training large-scale deep learning
models and the recently developed method of generative models.
Keywords: deep learning; convolutional neural network (CNN); recurrent neural network (RNN);
auto-encoder (AE); restricted Boltzmann machine (RBM); deep belief network (DBN); generative
adversarial network (GAN); deep reinforcement learning (DRL); transfer learning
approaches.
Figure 2. Category of Deep Leaning approaches.
several learning algorithms are applied to the features of a single task or dataset and a decision is
made according to the multiple outcomes from the different algorithms.
On the other hand, in the case of DL, the features are learned automatically and are represented
hierarchically in multiple levels. This is the strong point of DL against traditional machine learning
approaches. Table 1 shows the different feature-based learning approaches with different learning steps.
10
7.4 6.7
8
6 5
3.57
4
2
0
2012 2013 2014 2014 2015
Figure 3. Accuracy
Accuracy for ImageNet classification challenge with different DL models.
At present, DL is being applied in almost all areas. As a result, this approach is often called a
Phone error rate (PER) in percentage(%)
universal learning approach.
DCNN [20]
CTC[19]
Segmental RNN[23]
Attention-based RNN[22]
RNN transducer[19]
Ensemble DNN/CNN/RNN[21]
DCNN [20]
CTC[19]
End-to-end DL [17]
Boundary-factored SCRF[14]
0 5 10 15 20 25 30 35
Figure 4. Phone error rate (PER) for TIMIT Acoustic-Phonetic Continuous Speech Corpus dataset [13–23].
Figure 4. Phone error rate (PER) for TIMIT Acoustic-Phonetic Continuous Speech Corpus dataset [13–
23]. Automatic speech recognition. The initial success in the field of speech recognition on the
(b).
popular TIMIT dataset (common data set are generally used for evaluation) was with small-scale
(b). Automatic speech recognition. The initial success in the field of speech recognition on the
recognition tasks [24]. The TIMIT Acoustic-Phonetic continuous speech Corpus contains 630 speakers
popular TIMIT dataset (common data set are generally used for evaluation) was with small-scale
from eight major dialects of American English, where each speaker reads 10 sentences. Figure 4
recognition tasks [24]. The TIMIT Acoustic-Phonetic continuous speech Corpus contains 630 speakers
summarizes
Electronics 2019,the
8, x error rates,
FOR PEER including these early results and is measured as a percent phone error6 rate
REVIEW of 67
from eight major dialects of American English, where each speaker reads 10 sentences. Figure 4
(PER) over the last 20 years. The bar graph clearly shows that the recently developed DL approaches
summarizes(top
approaches the error
therates, including these early results and is measured as a percent phone error
(top of the graph) of perform graph)
better perform
compared better
to anycompared to any
other previous other previous
machine machine learning
learning approaches on the
rate (PER)
approaches over the last 20
on the TIMIT dataset. years. The bar graph clearly shows that the recently developed DL
TIMIT dataset.
Some example applications are are shown
shown in
in Figures
Figures 55 and
and 6.
6.
Road 1 Road 2
Road 3
Figure 5. Data-driven
Figure 5. Data-driven traffic
traffic forecasting. Using the
forecasting. Using the dynamics
dynamics of
of the
the traffic
traffic flow
flow (roads
(roads 1,
1, 2,
2, and
and 3)
3) to
to
capture the spatial dependency using by Diffusion Convolutional Recurrent Neural Network
capture the spatial dependency using by Diffusion Convolutional Recurrent Neural Network [25]. [25].
Figure 5. Data-driven traffic forecasting. Using the dynamics of the traffic flow (roads 1, 2, and 3) to
Electronics 2019, 8, 292 6 of 66
capture the spatial dependency using by Diffusion Convolutional Recurrent Neural Network [25].
Example images
Figure6.6. Example
Figure imageswhere DL DL
where is applied successfully
is applied and achieved
successfully and state-of-the-art performance.
achieved state-of-the-art
The images were taken from the corresponding references [26–29].
performance. The images were taken from the correspond ding references.
1.5.4. Scalability
The DL approach is highly scalable. Microsoft invented a deep network known as ResNet [11].
This network contains 1202 layers and is often implemented at a supercomputing scale. There is a big
initiative at Lawrence Livermore National Laboratory (LLNL) in developing frameworks for networks
like this, which can implement thousands of nodes [24].
1.6. Challenges of DL
There are several challenges for DL:
The DL approach is highly scalable. Microsoft invented a deep network known as ResNet [11].
This network contains 1202 layers and is often implemented at a supercomputing scale. There is a big
initiative at Lawrence Livermore National Laboratory (LLNL) in developing frameworks for
networks like this, which can implement thousands of nodes [24].
Electronics 2019, 8, 292 7 of 66
1.6. Challenges of DL
• Big data
There areanalytics using DL for DL:
several challenges
• • Big data analytics
Scalability of DL approaches using DL
• • toScalability
Ability of DLwhich
generate data approaches
is important where data is not available for learning the system
• Ability to generate data
(especially for computer vision task, whichsuchis as
important where data is not available for learning the
inverse graphics).
• system (especially
Energy efficient techniquesfor forcomputer vision task,
special purpose suchincluding
devices, as inversemobile
graphics).
intelligence, FPGAs,
•
and so on. Energy efficient techniques for special purpose devices, including mobile intelligence,
• FPGAs, and so on.
Multi-task and transfer learning or multi-module learning. This means learning from different
• Multi-task
domains and transfer
or with different modelslearning
together.or multi-module learning. This means learning from
• Dealing with causality in learning. different models together.
different domains or with
• Dealing with causality in learning.
Most
Most ofof the
the above-mentioned
above-mentioned challenges have already been considered by the DL community.
Firstly,
Firstly,for
forthe
thebig
bigdata
dataanalytics
analyticschallenge,
challenge,there
thereisisaagood
goodsurvey
surveythat
thatwas
wasconducted
conductedin in2014
2014[30].
[30]. In
In
this paper, the authors explained details on how DL can deal with different criteria,
this paper, the authors explained details on how DL can deal with different criteria, including including volume,
velocity,
volume, variety,
velocity, and veracity
variety, andofveracity
the big data problem.
of the big dataThe authorsThe
problem. alsoauthors
showedalsodifferent
showedadvantages
different
of DL approaches
advantages of DL when dealing with
approaches whenbig dealing
data problems [31,32].
with big dataFigure 7 clearly
problems demonstrates
[31,32]. Figure 7that the
clearly
performance of traditional ML approaches shows better performance for lesser amounts
demonstrates that the performance of traditional ML approaches shows better performance for lesser of input data.
As the amount
amounts of inputof data
data.increases beyondofadata
As the amount certain number,
increases the performance
beyond of traditional
a certain number, machine
the performance
learning approaches
of traditional machinebecomes steady,
learning whereasbecomes
approaches DL approaches
steady,increase
whereas with
DLrespect to theincrease
approaches increment
withof
the amount
respect to theof increment
data. of the amount of data.
Figure 7. The performance of deep learning with respect to the amount of data.
Figure 7. The performance of deep learning with respect to the amount of data.
Secondly, in most of the cases for solving large-scale problems, the solution is being implemented
Secondly, in most of the cases for solving large-scale problems, the solution is being
on High-Performance Computing (HPC) system (super-computing, cluster, sometimes considered
implemented on High-Performance Computing (HPC) system (super-computing, cluster, sometimes
cloud computing) which offers immense potential for data-intensive business computing. As data
considered cloud computing) which offers immense potential for data-intensive business computing.
explodes in velocity, variety, veracity, and volume, it is getting increasingly difficult to scale compute
As data explodes in velocity, variety, veracity, and volume, it is getting increasingly difficult to scale
performance using enterprise-class servers and storage in step with the increase. Most of the articles
compute performance using enterprise-class servers and storage in step with the increase. Most of
considered all the demands and suggested efficient HPC with heterogeneous computing systems. In
the articles considered all the demands and suggested efficient HPC with heterogeneous computing
one example, Lawrence Livermore National Laboratory (LLNL) has developed a framework which
systems. In one example, Lawrence Livermore National Laboratory (LLNL) has developed a
is called Livermore Big Artificial Neural Networks (LBANN) for large-scale implementation (in
framework which is called Livermore Big Artificial Neural Networks (LBANN) for large-scale
super-computing scale) for DL which clearly supplants the issue of scalability of DL [24].
Thirdly, generative models are another challenge for deep learning. One example is the GAN,
which is an outstanding approach for data generation for any task which can generate data with the
same distribution [33]. Fourthly, multi-task and transfer learning which we have discussed in Section 7.
Fourthly, there is a lot of research that has been conducted on energy efficient deep learning approaches
with respect to network architectures and hardwires. Section 10 discusses this issue.
Can we make any uniform model that can solve multiple tasks in different application domains?
As far as the multi-model system is concerned, one article from Google titled One Model To Learn Them
All [34] is a good example. This approach can learn from different application domains, including
Electronics 2019, 8, 292 8 of 66
ImageNet, multiple translation tasks, Image captioning (MS-COCO dataset), speech recognition
corpus and English parsing task. We will be discussing most of the challenges and respective solutions
through this survey. There are some other multi-task techniques that have been proposed in the last
few years [35–37].
Finally, a learning system with causality has been presented, which is a graphical model that
defines how one may infer a causal model from data. Recently a DL based approach has been proposed
for solving this type of problem [38]. However, there are other many challenging issues have been
solved in the last few years which were not possible to solve efficiently before this revolution. For
example, image or video captioning [39], style transferring from one domain to another domain using
GAN [40], text to image synthesis [41], and many more [42].
There are some surveys that have been conducted recently in the DL field [43–46]. These
papers survey on DL and its revolution, but they did not address the recently developed generative
model called GAN [33]. In addition, they discuss little RL and did not cover recent trends of DRL
approaches [1,44]. In most of the cases, the surveys that have been conducted are on different
DL approaches individually. There is a good survey which is based on Reinforcement Learning
approaches [46,47]. Another survey exists on transfer learning [48]. One survey has been conducted
on neural network hardware [49]. However, the main objective of this work is to provide an overall
idea on deep learning and its related fields, including deep supervised (e.g., DNN, CNN, and RNN),
unsupervised (e.g., AE, RBM, GAN) (sometimes GAN also used for semi-supervised learning tasks)
and DRL. In some cases, DRL is considered to be a semi-supervised or an unsupervised approach. In
addition, we have considered the recently developing trends in this field and applications which are
developed based on these techniques. Furthermore, we have included the framework and benchmark
datasets which are often used for evaluating deep learning techniques. Moreover, the name of the
conferences and journals are also included which are considered by this community for publishing
their research articles.
The rest of the paper has been organized in the following ways: The detailed surveys of DNNs are
discussed in Section 2, Section 3 discusses on CNN. Section 4 describes different advanced techniques
for efficient training of DL approaches. Section 5 discusses RNNs. AEs and RBMs are discussed in
Section 6. GANs with applications are discussed in Section 7. RL is presented in Section 8. Section 9
explains transfer learning. Section 10 presents energy efficient approaches and hardwires for DL.
Section 11 discusses deep learning frameworks and standard development kits (SDK). The benchmarks
for different application domains with web links are given in Section 12. The conclusions are made in
Section 13.
(including
Electronics weights
2019, and biases that are learned during training) which produce outputs. This unit
8, 292 is
9 of 66
called a perceptron. The fundamental of ANN is discussed in References [1,3].
1943 - present
•1943: McCulloch & Pitts show that neurons can be combined to construct a Turing
machine [50].
•1958: Rosenblatt shows that perceptron’s will converge if what they are trying to
learn can be represented [51].
•1969: Minsky & Papert show the limitations of perceptron’s, killing research in
neural networks for a decade [52].
•1985: The backpropagation algorithm by Geoffrey Hinton et al. [53] revitalizes the
field.
•1988: Neocognitron: a hierarchical neural network capable of visual pattern
recognition [54].
•1998: CNNs with Backpropagation for document analysis by Yan LeCun [55].
•2006: The Hinton lab solves the training problem for DNNs [56,57].
•2012 - pressent: A variety of deep learning algorithms are increasingly emerging.
2.4. Back-Propagation
2.4. (BP)
Back-Propagation (BP)
DNN is
DNN is trained
trained with
with the
the popular
popular Back-Propagation
Back-Propagation (BP)
(BP) algorithm
algorithm with
with SGD
SGD [47,53].
[47,53]. In
In the
the case
case
of MLPs,
of MLPs, we
we can
can easily
easily represent
represent NN
NN models
models using
using computation
computation graphs
graphs which
which are
are directive
directive acyclic
acyclic
graphs. For that representation of DL, we can use the chain-rule to efficiently calculate the gradient
graphs. For that representation of DL, we can use the chain-rule to efficiently calculate the gradient
from the
from the top
top to
to the
the bottom
bottom layers
layers with
with BP,
BP,as
asshown
shownininReferences
References[53,59–63].
[53,59–63].
2.5. Momentum
Momentum is a method which helps to accelerate the training process with with the
the SGD
SGD approach.
approach.
The main
The main idea behind
behind it is to use the moving average of the gradient instead of using only the current
the gradient.
real value of the gradient. We
We can
can express
express this
this with
with the
the following
followingequation
equationmathematically:
mathematically:
where ηt is the tth round learning rate, η0 is the initial learning rate, and β is the decay factor with a
value between the range of (0, 1).
The step function format for exponential decay is:
t
η t = η0 β b e c . (4)
The common practice is to use a learning rate decay of β = 0.1 to reduce the learning rate by a
factor of 10 at each stage.
3.1.Electronics
CNN Overview
2019, 8, x FOR PEER REVIEW 11 of 67
This network structure was first proposed by Fukushima in 1988 [54]. It was not widely
3.1. CNN Overview
used, however, due to limits of computation hardware for training the network. In the 1990s,
LeCun et Thisal. network
[55] appliedstructure was first proposed
a gradient-based learningby algorithm
FukushimatoinCNNs 1988 [54].
andItobtained
was not successful
widely used, results
however, due to limits of computation hardware for training the network.
for the handwritten digit classification problem. After that, researchers further improved CNNs In the 1990s, LeCun et al.
and[55] appliedstate-of-the-art
reported a gradient-based learning
results algorithm
in many to CNNs
recognition and obtained
tasks. CNNs have successful
severalresults for the over
advantages
handwritten digit classification problem. After that, researchers further improved CNNs and
DNNs, including being more like the human visual processing system, being highly optimized in the
reported state-of-the-art results in many recognition tasks. CNNs have several advantages over
structure for processing 2D and 3D images, and being effective at learning and extracting abstractions
DNNs, including being more like the human visual processing system, being highly optimized in the
of 2D features.
structure The max pooling
for processing 2D and 3D layer of CNNs
images, is effective
and being effective inatabsorbing
learning and shape variations.
extracting Moreover,
abstractions
composed of sparse
of 2D features. Theconnections
max poolingwith layertied weights,
of CNNs CNNs in
is effective have significantly
absorbing fewer parameters
shape variations. Moreover, than a
fully connected network of similar size. Most of all, CNNs are trained with
composed of sparse connections with tied weights, CNNs have significantly fewer parameters than the gradient-based learning
algorithm and suffernetwork
a fully connected less from the diminishing
of similar size. Most of gradient
all, CNNs problem. Given
are trained withthat
thethe gradient-based
gradient-based
algorithm
learningtrains the whole
algorithm network
and suffer to minimize
less from an errorgradient
the diminishing criterionproblem.
directly,Given
CNNs canthe
that produce
gradient-highly
based algorithm
optimized weights.trains the whole network to minimize an error criterion directly, CNNs can produce
highly
Figure optimized
9 showsweights.
the overall architecture of CNNs consists of two main parts: Feature extractors
and a classifier.9 shows
Figure In the the overall
feature architecture
extraction of CNNs
layers, each consists
layer ofofthe two main parts:
network Feature
receives theextractors
output from
its immediate previous layer as its input and passes its output as the input to the nextfrom
and a classifier. In the feature extraction layers, each layer of the network receives the output its The
layer.
immediate previous layer as its input and passes its output as the input to the next layer. The CNN
CNN architecture consists of a combination of three types of layers: Convolution, max-pooling, and
architecture consists of a combination of three types of layers: Convolution, max-pooling, and
classification. There are two types of layers in the low and middle-level of the network: Convolutional
classification. There are two types of layers in the low and middle-level of the network: Convolutional
layers and max-pooling layers. The even numbered layers are for convolutions and the odd-numbered
layers and max-pooling layers. The even numbered layers are for convolutions and the odd-
layers are for max-pooling
numbered layers are for operations.
max-pooling The output The
operations. nodes of thenodes
output convolution and max-pooling
of the convolution and max- layers
are pooling
groupedlayersinto aare2Dgrouped
plane called
into a 2D plane called feature mapping. Each plane of a layer is usually the
feature mapping. Each plane of a layer is usually derived from
combination
derived from of one
the or more planes
combination of previous
of one layers.ofThe
or more planes nodes layers.
previous of a plane
The are connected
nodes of a planeto are
a small
region of eachtoconnected
connected planesof
a small region of each
the previous
connected layer. Each
planes of node of the convolution
the previous layer. Eachlayer
nodeextracts
of the the
convolution
features from the layer extracts
input the features
images from the operations
by convolution input images onbythe convolution
input nodes.operations on the input
nodes.
Figure 9. The
Figure 9. Theoverall
overallarchitecture
architecture of
of the
the Convolutional Neural
Convolutional Neural Network
Network (CNN)
(CNN) includes
includes an input
an input
layer, multiple
layer, multiple alternating
alternatingconvolution
convolution and max-poolinglayers,
and max-pooling layers,
oneone fully-connected
fully-connected layerlayer
and and
one one
classification layer.
classification layer.
Higher-level
Higher-levelfeatures
featuresarearederived
derived from
from features propagatedfrom
features propagated from lower
lower level
level layers.
layers. As theAs the
features
features propagatetotothe
propagate thehighest
highest layer
layerororlevel,
level,thethe
dimensions
dimensions of features are reduced
of features depending
are reduced on
depending
the size
on the size of
ofthethekernel
kernelforfor
thethe
convolutional
convolutional and andmax-pooling
max-pooling operations respectively.
operations However,
respectively. the
However,
the number of feature maps usually increased for representing better features of the input for
number of feature maps usually increased for representing better features of the input images images
for ensuring
ensuringclassification accuracy.
classification TheThe
accuracy. output of theof
output last
thelayer
lastoflayer
the CNN is used
of the CNNasisthe input
used astothe
a fully
input to
connected network which is called classification layer. Feed-forward neural networks have been used
a fully connected network which is called classification layer. Feed-forward neural networks have
as the classification layer as they have better performance [56,64]. In the classification layer, the
extracted features are taken as inputs with respect to the dimension of the weight matrix of the final
Electronics 2019, 8, 292 12 of 66
been used as the classification layer as they have better performance [56,64]. In the classification
layer, the extracted features are taken as inputs with respect to the dimension of the weight matrix of
the final neural network. However, the fully connected layers are expensive in terms of network or
learning parameters. Nowadays, there are several new techniques, including average pooling and
global average pooling that is used as an alternative of fully-connected networks. The score of the
respective class is calculated in the top classification layer using a soft-max layer. Based on the highest
score, the classifier gives output for the corresponding classes. Mathematical details on different layers
of CNNs are discussed in the following section.
where x lj is the output of the current layer, xil −1 is the previous layer output, klij is the kernel for the
present layer, and blj are the biases for the current layer. M j represents a selection of input maps. For
each output map, an additive bias b is given. However, the input maps will be convolved with distinct
kernels to generate the corresponding output maps. The output maps finally go through a linear
or non-linear activation function (such as sigmoid, hyperbolic tangent, Softmax, rectified linear, or
identity functions).
where down(.) represents a sub-sampling function. Two types of operations are mostly performed in
this layer: Average pooling or max-pooling. In the case of the average pooling approach, the function
usually sums up over N × N patches of the feature maps from the previous layer and selects the
average value. On the other hand, in the case of max-pooling, the highest value is selected from the
N × N patches of the feature maps. Therefore, the output map dimensions are reduced by n times. In
some special cases, each output map is multiplied with a scalar. Some alternative sub-sampling layers
have been proposed, such as fractional max-pooling layer and sub-sampling with convolution. These
are explained in Section 4.6.
the number of layers which are incorporated in the network model. However, in most cases, two to
Electronics 2019, 8, x FOR PEER REVIEW
four layers have been observed in different architectures, including LeNet [55], AlexNet [7], and13VGG
of 67
Net [9]. As the fully connected layers are expensive in terms of computation, alternative approaches
layer and the average pooling layer which help to reduce the number of parameters in the network
have been proposed during the last few years. These include the global average pooling layer and the
significantly.
average pooling layer which help to reduce the number of parameters in the network significantly.
In the backward propagation through the CNNs, the fully connected layer updates following
In the backward propagation through the CNNs, the fully connected layer updates following
the general approach of fully connected neural networks (FCNN). The filters of the convolutional
the general approach of fully connected neural networks (FCNN). The filters of the convolutional
layers are updated by performing the full convolutional operation on the feature maps between the
layers are updated by performing the full convolutional operation on the feature maps between the
convolutional layer and its immediate previous layer. Figure 10 shows the basic operations in the
convolutional layer and its immediate previous layer. Figure 10 shows the basic operations in the
convolution and sub-sampling of an input image.
convolution and sub-sampling of an input image.
𝑃𝑎𝑟𝑚 =
If bias is added with the weights, (𝐹 the
then × (𝐹above
+ 1) × 𝐹𝑀 )can
equation × 𝐹𝑀
be ,written as follows: (13)
here the total number of parameters of 𝑙 lathe yer can be represented with 𝑃 , 𝐹𝑀 is for the total
number of output feature maps, Parmand ( F × (isF the
l =𝐹𝑀 × FM
+ 1)total l −1 ) × of
number FM l,
input (13)
feature maps or channels.
For example, let’s assume the 𝑙 layerth has 𝐹𝑀 = 32 input features maps, 𝐹𝑀 = 64 output
here the total number of parameters of l lathe yer can be represented with Pl , FMl is for the total
feature maps, and the filter size is 𝐹 = 5. In this case, the total number of parameters with a bias for
number of output feature maps, and FMl −1 is the total number of input feature maps or channels.
this layer: 𝑃𝑎𝑟𝑚 = (5 × 5 × 33) × 64 = 528,000. Thus, the amount of memory (𝑀𝑒𝑚 ) needs for the
For example, let’s assume the l th layer has FMl −1 = 32 input features maps, FMl = 64 output feature
operations of the 𝑙 layer can be expressed as
𝑀𝑒𝑚 = (𝑁 × 𝑁 × 𝐹𝑀 ). (14)
maps, and the filter size is F = 5. In this case, the total number of parameters with a bias for this layer:
Parml = (5 × 5 × 33) × 64 = 528, 000. Thus, the amount of memory (Meml ) needs for the operations
of the l th layer can be expressed as
recognition accuracy against all the traditional machine learning and computer vision approaches.
It was a significant breakthrough in the field of machine learning and computer vision for visual
recognition and classification tasks and is the point in history where interest in deep learning
increased rapidly.
The architecture of AlexNet is shown in Figure 12. The first convolutional layer performs
convolution and max-pooling with Local Response Normalization (LRN) where 96 different receptive
Electronics 2019, 8, x FOR PEER REVIEW 15 of 67
filters are used that are 11 × 11 in size. The max pooling operations are performed with 3 × 3 filters
with a stride
3 filters with size of 2.size
a stride Theofsame
2. Theoperations are performed
same operations in the second
are performed layer with
in the second layer5× 5 filters.
with 5×5
× 3 filters
3filters. are used in the third, fourth, and fifth convolutional layers with 384, 384, and
3 × 3 filters are used in the third, fourth, and fifth convolutional layers with 384, 384, and 296 feature
296
maps
feature respectively. Two fullyTwo
maps respectively. connected (FC) layers(FC)
fully connected are used with
layers aredropout
used withfollowed by afollowed
dropout Softmax layer
by a
at the end.
Softmax Two
layer at networks
the end. Twowithnetworks
similar structure andstructure
with similar the sameandnumber of feature
the same number maps are trained
of feature maps
in parallel for this model. Two new concepts, Local Response Normalization
are trained in parallel for this model. Two new concepts, Local Response Normalization (LRN) (LRN) and dropout,
and
are introduced
dropout, in this network.
are introduced LRN can be
in this network. LRNapplied
can beinapplied
two different
in twoways: Firstways:
different applying
First on single
applying
channel
on singleorchannel
feature maps, where
or feature maps, × N patch
an Nwhere is selected
an N×N from
patch is the same
selected fromfeature mapfeature
the same and normalized
map and
based
normalized based on the neighborhood values. Second, LRN can be applied across thefeature
on the neighborhood values. Second, LRN can be applied across the channels or maps
channels or
(neighborhood along the third dimension but a single pixel or location).
feature maps (neighborhood along the third dimension but a single pixel or location).
Figure 12. The architecture of AlexNet: Convolution, max-pooling, Local Response Normalization
Figure 12. The architecture of AlexNet: Convolution, max-pooling, Local Response Normalization
(LRN) and fully connected (FC) layer.
(LRN) and fully connected (FC) layer.
AlexNet has three convolution layers and two fully connected layers. When processing the
AlexNet has three convolution layers and two fully connected layers. When processing the
ImageNet dataset, the total number of parameters for AlexNet can be calculated as follows for the
ImageNet dataset, the total number of parameters for AlexNet can be calculated as follows for the
first layer: Input samples are 224 × 224 × 3, filters (kernels or masks) or a receptive field that has
first layer: Input samples are 224×224×3, filters (kernels or masks) or a receptive field that has a size
a size 11, the stride is 4, and the output of the first convolution layer is 55 × 55 × 96. According
11, the stride is 4, and the output of the first convolution layer is 55×55×96. According to the equations
to the equations in Section 3.1.4, we can calculate that this first layer has 290,400 (55 × 55 × 96)
in section 3.1.4, we can calculate that this first layer has 290400 (55×55×96) neurons and 364 (11 ×11×3
neurons and 364 (11 ×11 × 3 = 363 + 1 bias) weights. The parameters for the first convolution layer
= 363 + 1 bias) weights. The parameters for the first convolution layer are 290400×364 = 105,705,600.
are 290,400 × 364 = 105,705,600. Table 2 shows the number of parameters for each layer in millions.
Table 2 shows the number of parameters for each layer in millions. The total number of weights and
The total number of weights and MACs for the whole network are 61M and 724M, respectively.
MACs for the whole network are 61M and 724M, respectively.
3.2.3. ZFNet / Clarifai (2013)
3.2.3. ZFNet / Clarifai (2013)
In 2013, Matthew Zeiler and Rob Fergue won the 2013 ILSVRC with a CNN architecture which
was anIn extension
2013, Matthew Zeiler and
of AlexNet. The Rob Fergue
network waswon theZFNet
called 2013 ILSVRC
[8], afterwith
the aauthors’
CNN architecture which
names. As CNNs
was an extension of AlexNet. The network was called ZFNet [8], after the authors’
are expensive computationally, an optimum use of parameters is needed from a model complexitynames. As CNNs
are expensive
point of view. computationally, an optimum
The ZFNet architecture is an use of parameters
improvement is neededdesigned
of AlexNet, from a model complexity
by tweaking the
point of parameters
network view. The ZFNet architecture
of the latter. ZFNet is an7improvement
uses of AlexNet,
× 7 kernels instead of 11 ×designed bytotweaking
11 kernels the
significantly
network parameters of the latter. ZFNet uses 7x7 kernels instead of 11x11 kernels
reduce the number of weights. This reduces the number of network parameters dramatically and to significantly
reduce theoverall
improves number of weights.
recognition This reduces the number of network parameters dramatically and
accuracy.
improves overall recognition accuracy.
3.2.5. VGGNET
3.2.5. VGGNET (2014)
(2014)
The Visual
The Visual Geometry
Geometry GroupGroup (VGG),
(VGG), was was the
the runner-up
runner-up of of the
the 2014
2014 ILSVRC
ILSVRC [9].[9]. The
The main
main
contribution of
contribution ofthis
thiswork
workis is that
that it shows
it shows thatthat the depth
the depth of a network
of a network is a critical
is a critical componentcomponent to
to achieve
achieve better recognition or classification accuracy in CNNs. The VGG
better recognition or classification accuracy in CNNs. The VGG architecture consists of two architecture consists of two
convolutional layers
convolutional layersboth
both of of
which
whichuse use
the ReLU activation
the ReLU function.
activation Following
function. the activation
Following function
the activation
is a single max pooling layer and several fully connected layers also using a ReLU
function is a single max pooling layer and several fully connected layers also using a ReLU activation activation function.
The final layer
function. of thelayer
The final modelofisthe a Softmax
model layeris a for classification.
Softmax layer forInclassification.
VGG-E [9] theInconvolution
VGG-E [9]filter
the
size is changed
convolution tosize
filter × changed
a 3 is 3 filter with
to a a3 stride of with
× 3 filter 2. Three VGG-E
a stride of 2. [9] models,
Three VGG-E VGG-11, VGG-16,
[9] models, and
VGG-11,
VGG-19;
VGG-16, were proposedwere
and VGG-19; the models
proposed hadthe11, models
16, and 19
hadlayers respectively.
11, 16, and 19 layersThe respectively.
VGG networkThe model
VGG is
shown inmodel
network Figureis13.
shown in Figure 13.
Figure
Figure 13. The
13. The basic
basic building
building block
block of VGG
of VGG network:
network: Convolution
Convolution (Conv)
(Conv) andand FC fully
FC for for fully connected
connected layers.
layers.
All versions of the VGG-E models ended the same with three fully connected layers. However,
All
the numberversions of the VGG-E
of convolution models
layers ended
varied the same
VGG-11 with three
contained fully connected
8 convolution layers.
layers, However,
VGG-16 had
theconvolution
13 number of convolution layers varied
layers, and VGG-19 VGG-11
had 16 contained
convolution 8 convolution
layers. VGG-19, the layers,
mostVGG-16 had 13
computational
convolution layers,
expensive model, and VGG-19
contained had 16and
138Mweights convolution
had 15.5 Mlayers.
MACs.VGG-19, the most computational
expensive model, contained 138Mweights and had 15.5 M MACs.
3.2.6. GoogLeNet (2014)
3.2.6. GoogLeNet (2014)
GoogLeNet, the winner of ILSVRC 2014 [10], was a model proposed by Christian Szegedy of
Google with the objective
GoogLeNet, the winner of of
reducing
ILSVRCcomputation complexity
2014 [10], was comparedby
a model proposed to Christian
the traditional CNN.
Szegedy of
The proposed
Google method
with the was
objective ofto incorporate
reducing Inceptioncomplexity
computation Layers that had variable
compared to thereceptive fields,
traditional CNN.which
The
were created
proposed by different
method kernel sizes.Inception
was to incorporate These receptive
Layers fields created
that had operations
variable receptive that captured
fields, whichsparse
were
correlation
created by patterns
differentinkernel
the new feature
sizes. mapreceptive
These stack. fields created operations that captured sparse
The initial
correlation concept
patterns in theofnew
thefeature
Inception
maplayer
stack.can be seen in Figure 14. GoogLeNet improved
state-of-the-art recognition accuracy using a stack of Inception layers, seen in Figure 15. The difference
between the naïve inception layer and final Inception Layer was the addition of 1 × 1 convolution
kernels. These kernels allowed for dimensionality reduction before computationally expensive layers.
GoogLeNet consisted of 22 layers in total, which was far greater than any network before it. Later
improved version of this network is proposed in [71]. However, the number of network parameters
7M network parameters when AlexNet had 60M and VGG-19 138M. The computations for
GoogLeNet also were 1.53G MACs far14.
Figure
Figure
lower
14. than layer:
Inception
Inception
that ofNaive
layer: AlexNet or VGG.
Naive version.
version.
The initial concept of the Inception layer can be seen in Figure 14. GoogLeNet improved state-
of-the-art recognition accuracy using a stack of Inception layers, seen in Figure 15. The difference
between the naïve inception layer and final Inception Layer was the addition of 1x1 convolution
Electronics 2019, 8, x FOR PEER REVIEW 17 of 67
kernels. These kernels allowed for dimensionality reduction before computationally expensive
layers. GoogLeNet
7M network consisted of 22
parameters layers
when in total,
AlexNet which
had 60M and was far 138M.
VGG-19 greater
Thethan any network
computations for before it.
GoogLeNet also were 1.53G MACs far lower than that of AlexNet or VGG.
Later improved version of this network is proposed in [71]. However, the number of network
parameters GoogLeNet used was much lower than its predecessor AlexNet or VGG. GoogLeNet had
16. Basic
FigureFigure diagram
16. Basic ofthe
diagram of the Residual
Residual block.block.
The residual network consists of several basic residual blocks. However, the operations in the
residual block can be varied depending on the different architecture of residual networks [11]. The
wider version of the residual network was proposed by Zagoruvko el at. [72], another improved
Figure
residual network approach 16. Basic
known diagram
as aggregated of thetransformation
residual Residual block.[73]. Recently, some other
variants of residual models have been introduced based on the Residual Network architecture [74–
76]. Furthermore, there are several advanced architectures that are combined with Inception and
Electronics 2019, 8, 292 18 of 66
The residual network consists of several basic residual blocks. However, the operations in the
residual block can be varied depending on the different architecture of residual networks [11]. The
wider version of the residual network was proposed by Zagoruvko et al. [72], another improved
residual network approach known as aggregated residual transformation [73]. Recently, some other
variants of residual models have been introduced based on the Residual Network architecture [74–76].
Furthermore, there are several advanced architectures that are combined with Inception and Residual
units. The
Electronics basic
2019, 8, x conceptual diagram of Inception-Residual unit is shown in the following Figure
FOR PEER REVIEW 18 17.
of 67
Electronics 2019, 8, x FOR PEER REVIEW 18 of 67
Figure 17.
Figure The basic
17. The basic block
block diagram
diagram for
for Inception
Inception Residual
Residual unit.
unit.
Figure 17. The basic block diagram for Inception Residual unit.
Mathematically, this concept can be represented as
Mathematically, this concept can be represented as
Mathematically, this concept can be represented × K
as
×
F ℱ(xl3𝑥−×1×3 ⨀ 𝑥xl5×−×15) ++𝑥 xl −, 1 ,
xl 𝑥= = (16)
(16)
𝑥 = ℱ( 𝑥 ⨀𝑥 )+ 𝑥 , (16)
where the symbol J ⨀ refers the concentration operations between two outputs from the 3×3 and 5×5
where the
where symbol ⨀ refers
the symbol refersthetheconcentration
concentrationoperations operationsbetween
betweentwo twooutputs
outputsfrom from the 3× 3 and
filters. After that, the convolution operation is performed with 1×1 filters. Finally, the
the 3×3 and
outputs 5×5
are
5 × 5 filters.
filters. After
Afterthe
added with that, that, the convolution
the convolution
inputs of this block operation
operation
of 𝑥 . The
is
is conceptperformed
performed with 1
with 1×1 block
of Inception
× 1 filters.
filters.with Finally,
Finally, the outputs
the connections
residual outputs are
are added
added with
with
is introduced theinthe inputs
inputs of this
of this
the Inception-v4
block
block of of𝑥 xl −. 1The
architecture
. The concept
concept
[71]. The of
ofInception
Inceptionblock
improved
block withresidual
version with
residualconnections
connections
of the Inception-Residual
is introduced
is introduced in in the Inception-v4
theproposed
Inception-v4 architecture
architecture [71]. [71]. The
The improved
improved version
version of of the
the Inception-Residual
Inception-Residual
network were also [76,77].
network were also proposed
network were also proposed [76,77]. [76,77].
3.2.8. Densely Connected Network (DenseNet)
3.2.8. Densely
3.2.8. Densely Connected
Connected Network
Network (DenseNet)
(DenseNet)
DenseNet developed by Gao et al. in 2017 [68], which consists of densely connected CNN layers,
DenseNet developed
DenseNet developed by by Gao
Gao etet al.
al. in
in 2017
2017 [68], which which consists
consists of ofdensely
denselyconnected
connected CNNCNN layers,
layers,
the outputs of each layer are connected with all[68], successor layers in a dense block [68]. Therefore, it is
the
the outputs
outputs of each layer are connected with all successor layers in a dense block [68]. Therefore, it is
is
formed withofdense
each connectivity
layer are connected
between with
the all successor
layers layersit in
rewarding thea name
dense DenseNet.
block [68]. This
Therefore,
conceptit is
formed with dense
formed dense connectivitybetween betweenthe thelayers
layersrewarding
rewardingititthe thename
nameDenseNet.
DenseNet. This concept
efficientwith
for featureconnectivity
reuse, which dramatically reduces network parameters. DenseNet This concept
consists is
of
is efficient
efficient for
for feature
feature reuse,
reuse, which
which dramatically
dramatically reduces
reduces network
network parameters.
parameters. DenseNet
DenseNet consists
consists of
of
several dense blocks and transition blocks, which are placed between two adjacent dense blocks. The
several dense
several dense blocks
blocks andand transition
transition blocks,
blocks, which
which are are placed
placed between
between twotwo adjacent
adjacent dense
dense blocks.
blocks. The
The
conceptual diagram of a dense block is shown in Figure 18.
conceptual diagram
conceptual diagram of of aa dense
dense block
block is is shown
shown in in Figure
Figure 18.
18.
Each layer takes all the preceding feature maps as input. When deconstructing Figure 19, the l th
layer received all the feature maps from previous layers of x0 , x1 , x2 · · · xl −1 as input:
xl = Hl ([ x0 , x1 , x2 · · · xl −1 ]), (17)
Electronics 2019, 8, x FOR PEER REVIEW 19 of 67
where [ x0 , x1 , x2 · · · xl −1 ] are the concatenated features for layers 0, · · · · · · , l − 1 and Hl (·) is
Thisasarchitecture
considered a single tensor. is anItadvanced
performsand threealternative architectureoperations:
different consecutive of ResNet model, which is efficient
Batch-Normalization
for[78],
(BN) designing
followed large
by amodels
ReLU [70]withand
nominal
a 3 × 3depth, but shorter
convolution paths In
operation. forthe
thetransaction
propagation of gradient
block, 1×1
during training [69]. This concept is based on drop-path which is another
convolutional operations are performed with BN followed by a 2 × 2 average pooling layer. This new regularization approach
for making
model large networks.accuracy
shows state-of-the-art As a result,
with thisa concept helps
reasonable to enforce
number speed versus
of network accuracy
parameters fortradeoffs.
object
The basic block
recognitions tasks. diagram of FractalNet is shown in Figure 19.
Figure TheThe
19. 19.
Figure detailed FractalNet
detailed module
FractalNet on the
module leftleft
on the andand
FractalNet on the
FractalNet right.
on the right.
3.2.9. FractalNet (2016)
3.3. CapsuleNet
This architecture is an advanced and alternative architecture of ResNet model, which is efficient
CNNs large
for designing are an effective
models withmethodology
nominal depth,for but
detecting
shorterfeatures
paths forof the
an propagation
object and achieving good
of gradient
recognition performance compared to state-of-the-art handcrafted feature detectors. There
during training [69]. This concept is based on drop-path which is another regularization approach for are limits
to CNNs,
making large which are As
networks. thata result,
it doesthis
notconcept
take into account
helps special
to enforce relationships,
speed perspective,
versus accuracy size,
tradeoffs. Theand
orientation, of features. For example, if you have
basic block diagram of FractalNet is shown in Figure 19. a face image, it does not matter the placement of
different components (nose, eye, mouth, etc.) of the faces neurons of a CNN will wrongly active and
3.3.recognition
CapsuleNet as a face without considering special relationships (orientation, size). Now, imagine a
neuron which contains the likelihood with properties of features (perspective, orientation, size etc.).
CNNs are an effective methodology for detecting features of an object and achieving good
This special type of neurons, capsules, can detect face efficiently with distinct information. The
recognition performance compared to state-of-the-art handcrafted feature detectors. There are limits
capsule network consists of several layers of capsule nodes. The first version of capsule network
to CNNs, which are that it does not take into account special relationships, perspective, size, and
(CapsNet) consisted of three layers of capsule nodes in an encoding unit.
orientation, of features. For example, if you have a face image, it does not matter the placement of
This architecture for MNIST (28×28) images, the 256 9×9 kernels are applied with a stride 1, so
different components (nose, eye, mouth, etc.) of the faces neurons of a CNN will wrongly active and
the output is (28 − 9 + 1 = 20) with 256 feature maps. Then the outputs are fed to the primary
recognition as a face without considering special relationships (orientation, size). Now, imagine a
capsule layer which is a modified convolution layer that generates an 8-D vector instead of a scalar.
neuron which contains the likelihood with properties of features (perspective, orientation, size etc.).
In the first convolutional layer, 9×9 kernels are applied with stride 2, the output dimension is
This special type of neurons, capsules, can detect face efficiently with distinct information. The capsule
((20 − 9)/2 + 1 = 6). The primary capsules are used 8×32 kernels which generates 32×8×6×6 (32
network consists of several layers of capsule nodes. The first version of capsule network (CapsNet)
groups for 8 neurons with 6×6 size).
consisted of three layers of capsule nodes in an encoding unit.
This architecture for MNIST (28 × 28) images, the 256 9 × 9 kernels are applied with a stride 1,
so the output is (28 − 9 + 1 = 20) with 256 feature maps. Then the outputs are fed to the primary
capsule layer which is a modified convolution layer that generates an 8-D vector instead of a scalar.
In the first convolutional layer, 9 × 9 kernels are applied with stride 2, the output dimension is
Electronics 2019, 8, 292 20 of 66
Figure 20. A CapsNet encoding unit with 3 layers. The instance of each class is represented with a
vector of a capsule in DigitCaps layer that is used for calculating classification loss. The weights
between the primary capsule layer and DigitCaps layer are represented with 𝑊 .
The entire encoding and decoding processes of CapsNet is shown in Figures 20 and 21,
respectively. We used a max-pooling layer in CNN often that can handle translation variance. Even
if a feature moves if it is still under a max pooling window it can be detected. As the capsule contains
Figure
Figure 20.
the weighted 20.sum CapsNet
A CapsNet encoding
encoding
of features unit
fromunit with 33 layers.
the with
previous layers. The
The
layer, instanceof
instance
therefore ofeach
this each classisisrepresented
class
approach isrepresented with
with
capable of a
detecting
avector
vectorofofa acapsule
capsuleininDigitCaps
DigitCapslayer
layer that
that is
is used
used for
for calculating
calculating classification
classification loss.
loss. The
The weights
weights
overlapped features which is important for segmentation and detection tasks.
between
between the
the primary
primary capsule
capsule layer
layer and
and DigitCaps
DigitCaps layer
layer are
arerepresented withW𝑊
representedwith ij . .
The entire encoding and decoding processes of CapsNet is shown in Figures 20 and 21,
respectively. We used a max-pooling layer in CNN often that can handle translation variance. Even
if a feature moves if it is still under a max pooling window it can be detected. As the capsule contains
the weighted sum of features from the previous layer, therefore this approach is capable of detecting
overlapped features which is important for segmentation and detection tasks.
Figure 21. The decoding unit where a digit is reconstructed from DigitCaps layer representation. The
Figure 21. The decoding unit where a digit is reconstructed from DigitCaps layer representation. The
Euclidean distance is used minimizing the error between the input sample and the reconstructed
Euclidean distance is used minimizing the error between the input sample and the reconstructed
sample from the sigmoid layer. True labels are used for reconstruction target during training.
sample from the sigmoid layer. True labels are used for reconstruction target during training.
In the traditional CNN, a single cost function is used to evaluate the overall error which propagates
In the
backward traditional
during training.CNN, a single
However, costcase,
in this function is usedbetween
if the weight to evaluate the overall
two neurons error
is zero, which
then the
activation of a neuron is not propagated from that neuron. The signal is routed with respect to theis
propagates backward during training. However, in this case, if the weight between two neurons
zero, then
feature the activation
parameters ratherof a neuron
than is not
a one size fitspropagated from that
all cost function neuron.dynamic
in iterative The signal is routed
routing withwith
the
Figure
respect to 21. The
the decoding
feature unit where
parameters a digit
rather is reconstructed
than a one size from
fits DigitCaps
all cost layer representation.
function in iterative The
dynamic
agreement. For details about this architecture, please see Reference [79]. This new CNN architecture
Euclidean
routing with distance
the is used For
agreement. minimizing about
the error
thisbetween the input sample and the reconstructed
provides state-of-the-art accuracydetails
for handwritten architecture,
digit recognition pleaseonsee Reference
MNIST. [79]. This
However, fromnew
an
sample from
CNN architecture the sigmoid layer. True labels are used for reconstruction target during training.
application point ofprovides
view, this state-of-the-art
architecture is accuracy for handwritten
more suitable digit recognition
for segmentation and detectionon MNIST.
tasks
However,
compare tofrom an application
classification tasks. point of view, this architecture is more suitable for segmentation and
In the traditional CNN, a single cost function is used to evaluate the overall error which
detection tasks compare to classification tasks.
propagates
3.4. Comparison backward during
of Different training. However, in this case, if the weight between two neurons is
Models
zero, then the activation of
3.4. Comparison of Different Models a neuron is not propagated from that neuron. The signal is routed with
Theto
respect comparison
the featureofparameters
recently proposed
rather models
than a based
one sizeon error,
fits allnetwork parameters,
cost function and a maximum
in iterative dynamic
number of connections
comparison are
of given
recently in Table
proposed2.
routing with the agreement. For details about this architecture, please see Reference [79]. Thisand
The models based on error, network parameters, newa
maximum number of connections are given in Table 2.
CNN architecture provides state-of-the-art accuracy for handwritten digit recognition on MNIST.
However, from an application point of view, this architecture is more suitable for segmentation and
Table 2. The top-5% errors with computational parameters and macs for different deep CNN models.
detection tasks compare to classification tasks.
Table 2. The top-5% errors with computational parameters and macs for different deep CNN models.
Methods LeNet-5 [54] AlexNet [7] OverFeat (fast) [8] VGG-16 [9] GoogLeNet [10] ResNet-50(v1) [11]
Top-5 errors n/a 16.4 14.2 7.4 6.7 5.3
Input size 28 × 28 227 × 227 231 × 231 224 × 224 224 × 224 224 × 224
Number of Conv Layers 2 5 5 16 21 50
Filter Size 5 3,5,11 3,7 3 1,3,5,7 1,3,7
Number of Feature Maps 1,6 3–256 3–1024 3–512 3–1024 3–1024
Stride 1 1,4 1,4 1 1,2 1,2
Number of Weights 26 k 2.3 M 16 M 14.7 M 6.0 M 23.5 M
Number of MACs 1.9 M 666 M 2.67 G 15.3 G 1.43 G 3.86 G
Number of FC layers 2 3 3 3 1 1
Number of Weights 406 k 58.6 M 130 M 124 M 1M 1M
Number of MACs 405 k 58.6 M 130 M 124 M 1M 1M
Total Weights 431 k 61 M 146 M 138 M 7M 25.5 M
Total MACs 2.3 M 724 M 2.8 G 15.5 G 1.43 G 3.9 G
compared to exiting approaches [114–119]. However, the state-of-the-art models for classification,
segmentation and detection task are listed as follows:
(1) Models for classification problems: according to the architecture of classification models, the
input images are encoded different step with convolution and subsampling layers and finally the
SoftMax approach is used to calculate class probability. Most of the models have discussed above
are applied to the classification problem. However, these model with classification layer can be used
as feature extraction for segmentation and detection tasks. The list of the classification models are
as follows: AlexNet [55], VGGNet [9], GoogleNet [10], ResNet [11], DenseNet [68], FractalNet [69],
CapsuleNet [79], IRCNN [83], IRRCNN [77], DCRN [120] and so on.
(2) Models for segmentation problems: there are several semantic segmentation models have been
proposed in the last few years. The segmentation model consists of two units: Encoding and decoding
units. In the encoding unit, the convolution and subsampling operations are performed to encode to
the lower dimensional latent space where as the decoding unit decodes the image from latent space
performing deconvolution and up-sampling operation. The very first segmentation model is Fully
Convolutional Network (FCN) [27,121]. Later the improved version of this network is proposed which
is named as SegNet [122]. There are several new models have proposed recently which includes
RefineNet [123], PSPNEt [124], DeepLab [125], UNet [126], and R2U-Net [127].
(3) Models for detection problems: the detection problem is a bit different compared to classification
and segmentation problems. In this case, the model goal is to identify target types with its
corresponding position. The model answers two questions: What is the object (classification problem)?
and where the object (regression problem)? To achieve these goals, two losses are calculated for
classification and regression unit in top of the feature extraction module and the model weights
are updated with respect to the both loses. For the very first time, Region based CNN (RCNN) is
proposed for object detection task [128]. Recently, there are some better detection approaches have
been proposed, including focal loss for dense object detector [129], Later the different improved version
of this network is proposed called faster RCNN, fast RCNN [80,130]. mask R-CNN [131], You only
look once (YOLO) [132], SSD: Single Shot MultiBox Detector [133] and UD-Net for tissue detection
from pathological images [120].
2
wl ∼ N 0, . (18)
nl
4.4. Alternative
4.4. Alternative Convolutional
Convolutional Methods
Methods
Alternative and
Alternative and computationally
computationally efficient
efficient convolutional
convolutional techniques
techniques that
that reduce
reduce the
the cost
cost of
of
multiplicationsby
multiplications byaafactor
factorof
of2.5
2.5have
havebeen
beenproposed
proposed[147].
[147].
4.5. Activation
4.5. Activation Function
Function
The traditional
The traditional Sigmoid
Sigmoid and
and Tanh
Tanhactivation
activationfunctions
functionshave
havebeen
beenused
usedfor
for implementing
implementing neural
neural
networkapproaches
network approachesinin thethe past
past fewfew decades.
decades. The graphical
The graphical and mathematical
and mathematical representation
representation is shownis
shown
in in22.
Figure Figure 22.
(a) (b)
Sigmoid:
Sigmoid:
1
y= 1 x. (19)
𝑦=1 + e . (19)
1+𝑒
Tanh:
Tanh: e x − e− x
y= . (20)
e𝑒x +−e𝑒− x
𝑦= . (20)
The popular activation function called Rectified 𝑒 + Linear
𝑒 Unit (ReLU) proposed in 2010 solves the
vanishing gradientactivation
The popular problem for training
function deepRectified
called learning Linear
approaches. The basic
Unit (ReLU) conceptin
proposed is 2010
simple to keep
solves the
all the values
vanishing above zero
gradient and sets
problem for all negative
training values
deep to zero
learning that is shown
approaches. Theinbasic
Figure 23 [64].isThe
concept ReLU
simple to
activation
keep
Electronics
was
all the first
2019,values
8, x FOR
used
above in AlexNet [7].
zero and sets all negative values to zero that is shown in Figure 23 [64].
PEER REVIEW The
25 of 67
ReLU activation was first used in AlexNet [7].
Figure 23.
Figure 23. Pictorial
Pictorial representation
representationof
ofRectified
RectifiedLinear
LinearUnit
Unit(ReLU).
(ReLU).
(a) (b)
Electronics
Figure2019,
Figure 24.
24.8,Diagram
x FOR PEER
Diagram forREVIEW
for (a)
(a) Leaky
Leaky ReLU
ReLU (Rectified
(RectifiedLinear
LinearUnit),
Unit),and
and(b)
(b) Exponential
ExponentialLinear
LinearUnit 26 of 67
Unit(ELU).
(ELU).
Leaky ReLU:
Figure 26.
Figure 26. Spatial
Spatial pyramid
pyramid pooling.
pooling.
4.7. Regularization
The multi-scaleApproaches
pyramidfor DL was proposed in 2015 [153]. In 2015, Benjamin G. proposed a
pooling
new architecture with Fractional max pooling, which provides state-of-the-art classification accuracy
There are different regularization approaches that have been proposed in the past few years for
for CIFAR-10 and CIFAR-100 datasets. This structure generalizes the network by considering two
deep CNN. The simplest but efficient approach called dropout was proposed by Hinton in 2012 [156].
important properties for a sub-sampling layer or pooling layer. First, the non-overlapped max-pooling
In Dropout, a randomly selected subset of activations is set to zero within a layer [157]. The dropout
layer limits the generalize of the deep structure of the network, this paper proposed a network with
concept is shown in Figure 27.
3 × 3 overlapped max-pooling with 2-stride instead of 2 × 2 as sub-sampling layer [154]. Another
paper which has conducted research on different types of pooling approaches, including mixed, gated,
and tree as a generalization of pooling functions [155].
Figure 27.
Figure 27. Pictorial
Pictorial representation
representation of
ofthe
theconcept
conceptDropout.
Dropout.
Another
Another regularization
regularization approach
approach isis called
called Drop
Drop Connect.
Connect. In
In this
this case,
case, instead
instead ofof dropping
dropping the
the
activation, the subset of weights within the network layers are set to zero. As a result,
activation, the subset of weights within the network layers are set to zero. As a result, each layer each layer
receives
receives the
the randomly
randomly selected
selected subset
subset of
of units
units from
from the
theimmediate
immediate previous
previous layer
layer [158].
[158]. Some
Some other
other
regularization
regularizationapproaches
approachesare
areproposed
proposedas aswell
well[159].
[159].
5.1. Introduction
Human thoughts have persistence; Human don’t throw a thing away and start their thinking from
scratch in a second. As you are reading this article, you understand each word or sentence based on the
understanding of previous words or sentences. The traditional neural network approaches, including
DNNs and CNNs cannot deal with this type of problem. The standard Neural Networks and CNN
are incapable due to the following reasons. First, these approaches only handle a fixed-size vector
as input (e.g., an image or video frame) and produce a fixed-size vector as output (e.g., probabilities
of different classes). Second, those models operate with a fixed number of computational steps (e.g.,
the number of layers in the model). The RNNs are unique as they allow operation over a sequence of
vectors over time. The Hopfield Newark introduced this concept in 1982 but the idea was described
shortly in 1974 [163]. The pictorial representation is shown in Figure 28.
Electronics 2019, 8, x FOR PEER REVIEW 28 of 67
versions of
Different versions ofRNN
RNNhave
havebeen
beenproposed
proposedinin Jordan
Jordan and
and Elman
Elman [164,165].
[164,165]. In the
In the Elman,
Elman, the
the architecture
architecture usesuses
thethe output
output from
from hidden
hidden layers
layers asasinputs
inputsalongside
alongsidethe
thenormal
normalinputs
inputs of
of hidden
layers [129]. On the other hand, the outputs from the output unit are used as inputs with the inputs
of the hidden layer in Jordan network [130]. Jordan, in contrast, uses inputs from the outputs of the
output unit with the inputs to the hidden layer.
layer. Mathematically
Mathematically expressed
expressed as:
as:
Elman network [164]:
[1164]:
ht = σh (wh xt + uh ht−1 + bh ), (24)
h = σ (w x + u h + b ), (24)
y =σ w h +b . (25)
Jordan network [165]:
h = σ (w x + u y + b ), (26)
y =σ w h +b , (27)
architecture uses the output from hidden layers as inputs alongside the normal inputs of hidden
layers [129]. On the other hand, the outputs from the output unit are used as inputs with the inputs
of the hidden layer in Jordan network [130]. Jordan, in contrast, uses inputs from the outputs of the
output unit with the inputs to the hidden layer. Mathematically expressed as:
Elman network [1164]:
Electronics 2019, 8, 292 28 of 66
h = σ (w x + u h + b ), (24)
yyt =
= σσy w
wy h
ht +
+bby .
(25)
(25)
Jordan network [165]:
ht = σh (w
wh xt + uh yt−1 + bh),, (26)
h =σ x +u y +b (26)
yyt = σy w y ht +
= σ w h + b ,, b y (27)
(27)
where
where xxt isisa avector
vectorofof
inputs, ht are
inputs, hidden
h are layerlayer
hidden vectors, yt areytheare
vectors, output vectors,
the output w and w
vectors, u are
andweight
u are
matrices and b is the bias vector.
weight matrices and b is the bias vector.
A loop
A loop allows
allows information
information to to be
be passed
passed from
from one
one step
step of
of the
the network
network toto the
the next.
next. AA recurrent
recurrent
neural network can be thought of as multiple copies of the same network, each network passing aa
neural network can be thought of as multiple copies of the same network, each network passing
message to
message to aa successor.
successor. The
The diagram
diagram below
below Figure
Figure 29
29 shows
shows what
what happens
happens ifif we
we unroll
unroll the
the loop.
loop.
The main problem with RNN approaches is that there exists the vanishing vanishing gradient
gradient problem.
problem.
For the first time, this problem is solved by Hochreiter et al. [166]. [166]. A deep RNN consisting of 1000
subsequent
Electronics 2019,layers
8, x FORwas
PEERimplemented
REVIEW and evaluated to solve deep learning tasks in 1993 [167].29 There
of 67
are several solutions that have been proposed for solving the vanishing gradient problem of RNN
approaches
are in the pastunderstanding
used for language few decades. [170].
Two possible effectivemodeling,
In the language solutions ittotries
thisto
problem
predict are
the first
nextto clip
word
or
theset of words
gradient andorscale
some cases
the if the
sentences
gradient if thebased
norm is
norm is too
on too large,
thelarge, and secondly,
previous
and secondly,
ones create
[171].create
RNNs aa better
better RNN model.
are networks
RNN model.
with
One of the better models was introduced by
loops in them, allowing information to persist. Another Felix A. et al. in 2000 named Long Short-Term
el at. example: The RNNs are able to connect Memory
(LSTM)
(LSTM) [168,169].
previous information
[168,169]. From
Fromtothe
theLSTM
present
LSTM there have
task:
there been
Using
have been different
previous
different advanced
video approaches
frames,
advanced proposed
understanding
approaches theinpresent
proposed the
in last
the
few trying
and
last years which
few years are explained
to generate
which future
are in the as
frames
explained following
in well sections.sections.
[172].
the following The diagram for LSTMfor
The diagram is shown
LSTM is in shown
Figure 30.
in
Figure 30.
The RNN approaches allowed sequences in the input, the output, or in the most general case
both. For example, DL for text mining, building deep learning models on textual data requires
representation of the basic text unit and word. Neural network structures that can hierarchically
capture the sequential nature of the text. In most of these cases, RNNs or Recursive Neural Networks
Figure 30.
Figure Diagramfor
30. Diagram forLong
LongShort-Term
Short-Term Memory
Memory (LSTM).
The Short-Term
5.2. Long RNN approachesMemoryallowed
(LSTM) sequences in the input, the output, or in the most general case both.
For example, DL for text mining, building deep learning models on textual data requires representation
of theThe keytext
basic idea of and
unit LSTMs is the
word. cell state,
Neural networkthestructures
horizontalthat
linecan
running throughcapture
hierarchically the toptheof Figure 31.
sequential
LSTMs remove
nature of or In
the text. add information
most to theRNNs
of these cases, cell state called gates:
or Recursive An Networks
Neural (i used
input gateare ), forget gate (f )
for language
and output gate (o ) can be defined as:
understanding [170]. In the language modeling, it tries to predict the next word or set of words or some
cases sentences based on the previousf ones = σ(W [171]. ,x +
. h RNNs are ),
b networks with loops in them, allowing
(28)
information to persist. Another example: The RNNs are able to connect previous information to the
i = σ(W . h , x + b ), (29)
present task: Using previous video frames, understanding the present and trying to generate future
frames as well [172]. C = tanh(W . h , x + b ), (30)
C =f ∗C +i ∗ C , (31)
O = σ(W . h , x + b ), (32)
h = O ∗ tanh(C ). (33)
Electronics 2019, 8, 292 29 of 66
ht = (1 − zt ) ∗ ht−1 + zt ∗ h
et . (37)
The question is which one is the best? According to the different empirical studies, there is no
clear evidence of a winner. However, the GRU requires fewer network parameters, which makes the
model faster. On the other hand, LSTM provides better performance, if you have enough data and
computational power [174]. There is a variant LSTM named Deep LSTM [175]. Another variant that is
h = tanh(W. r ∗ h , x ), (36)
h = (1 − z ) ∗ h +z ∗ h . (37)
The question is which one is the best? According to the different empirical studies, there is no
clear evidence of a winner. However, the GRU requires fewer network parameters, which makes the
Electronics 2019, 8, 292 30 of 66
model faster. On the other hand, LSTM provides better performance, if you have enough data and
computational power [174]. There is a variant LSTM named Deep LSTM [175]. Another variant that
a bit
is adifferent approach
bit different approachcalled A clockwork
called A clockworkRNN RNN[176]. There
[176]. is an
There important
is an important empirical
empiricalevaluation
evaluation
onon a different
a different version
version ofof
RNN
RNN approaches,
approaches,including
including LSTM
LSTM byby
Greff, et et
Greff, al.al.
inin
2015
2015[177] and
[177] andthe final
the final
conclusion
conclusionwas wasallallthe
theLSTM
LSTMvariants
variantswere
wereallallabout
aboutthe
thesame
same [177].
[177]. Another empirical evaluation
evaluation is
is conducted
conducted on on thousands of RNN RNN architecture,
architecture, including
includingLSTM,
LSTM,GRUGRUand andsosoon
onfinding
findingsome
somethat
that
worked better than LSTMs on certain tasks
worked better than LSTMs on certain tasks [178] [178]
5.4. Convolutional
5.4. LSTM
Convolutional (ConvLSTM)
LSTM (ConvLSTM)
The
Theproblem
problemwithwithfully
fullyconnected
connected(FC) (FC)LSTM
LSTMand andshort
shortFC-LSTM
FC-LSTMmodelmodelis ishandling
handling
spatiotemporal
spatiotemporal data and and
data its usage of full connections
its usage in the input-to-state
of full connections and state-to-state
in the input-to-state and transactions,
state-to-state
where no spatialwhere
transactions, information has been
no spatial encoded.has
information Thebeen
internal gates The
encoded. of ConvLSTM are 3D
internal gates of tensors,
ConvLSTM whereare
the
3D tensors, where the last two dimensions are spatial dimensions (rows and columns). the
last two dimensions are spatial dimensions (rows and columns). The ConvLSTM determines The
future state ofdetermines
ConvLSTM a certain cell infuture
the the grid with
state of respect
a certaintocell
inputs and
in the thewith
grid pastrespect
states of
toits local and
inputs neighbors
the past
which can be achieved using convolution operations in the state-to-state or inputs-to-states
states of its local neighbors which can be achieved using convolution operations in the state-to-state transition,
shown in Figure 32. transition, shown in Figure 32.
or inputs-to-states
Figure
Figure 32.32. Pictorial
Pictorial diagram
diagram forfor ConvLSTM.
ConvLSTM.
ConvLSTM
ConvLSTM is is
providing good
providing good performance
performanceforfor
temporal
temporaldata
dataanalysis
analysiswith
withvideo
videodatasets
datasets[172].
[172].
Mathematically
MathematicallythetheConvLSTM
ConvLSTM is is
expressed asas
expressed follows
follows where
where* represents
* representsthe convolution
the convolutionoperation
operation
and ◦ denotes
and ∘ denotes
forfor
Hadamard
Hadamardproduct:
product:
it =i σ=(w
σ(w· X.t𝒳
xi ++ w∗ H
whi ∗ ℋ ++
t−1 ww ∘ 𝒞 + b ),
hi ◦ Ct−1 + bi ),
(38)
(38)
f = σ(w . 𝒳 + w ∗ ℋ + w ∘ 𝒞 + b ), (39)
ft = σ(wxf · Xt + whf ∗ Ht−1 + whf ◦ Ct−1 + bf ), (39)
C = tanh(w . 𝒳 + w ∗ ℋ + b ), (40)
Cet = tan h(wxc · Xt + whc ∗ Ht−1 + bC ), (40)
C =f ∘C +i ∗C , (41)
Ct = ft ◦ Ct−1 + it ∗ Cet , (41)
o = σ(w . 𝒳 + w ∗ ℋ + w ∘ 𝒞 + b , (42)
ot = σ(wxo · Xt + who ∗ Ht−1 + who ◦ Ct + bo , (42)
h = o ∘ tanh(C ). (43)
ht = ot ◦ tan h(Ct ). (43)
5.5.
5.5. AA variantofofArchitectures
Variant ArchitecturesofofRNN
RNNwith
withRespective
RespectivetotothetheApplications
Applications
To incorporate the attention mechanism with RNNs, Word2Vec is used in most of the cases for
a word or sentence encoding. Word2vec is a powerful word embedding technique with a 2-layer
predictive NN from raw text inputs. This approach is used in the different fields of applications,
including unsupervised learning with words, relationship learning between the different words, the
ability to abstract higher meaning of the words based on the similarity, sentence modeling, language
understanding and many more. There are different other word embedding approaches that have been
proposed in the past few years which are used to solve difficult tasks and provide state-of-the-art
performance, including machine translation and language modeling, Image and video captioning and
time series data analysis [179–181].
From the application point of view, RNNs can solve different types of problems which need
different architectures of RNNs, shown in Figure 33. In Figure 33, Input vectors are represented as
ability to abstract higher meaning of the words based on the similarity, sentence modeling, language
understanding and many more. There are different other word embedding approaches that have
been proposed in the past few years which are used to solve difficult tasks and provide state-of-the-
art performance, including machine translation and language modeling, Image and video captioning
and time2019,
Electronics series data analysis [179–181].
8, 292 31 of 66
From the application point of view, RNNs can solve different types of problems which need
different architectures of RNNs, shown in Figure 33. In Figure 33, Input vectors are represented as
green, RNN
green, RNNstatesstatesareare
represented
representedwithwith
blue blue
and orange represents
and orange the output
represents thevector.
outputThese structures
vector. These
can be described as:
structures can be described as:
One to
One to One: Standard mode
One: Standard mode forfor classification
classification without
without RNN RNN (e.g.,
(e.g., image
image classification
classification problem)
problem)
shown Figure
shown Figure 33(a) 33a.
Many to
Many to One: Sequence of
One: Sequence of inputs
inputs and
and aa single
single output
output (e.g.,
(e.g., the
the sentiment
sentiment analysis
analysis where
where inputs
inputs
are aa set
are set of
of sentences
sentences or or words
words andand output
output isis aa positive
positive or or negative
negative expression)
expression) shown
shown Figure
Figure 33b.
33b.
One to Many: Where a system takes an input and produces a sequence of outputs (Image
One to Many: Where a system takes an input and produces a sequence of outputs (Image
Captioning problem:
Captioning problem:InputInputis is a single
a single image
image andand output
output is aofset
is a set of words
words with context)
with context) shown shown
Figure
Figure
33c. 33c.
Many to
Many Many: Sequences
to Many: Sequences of of inputs
inputs andand outputs
outputs (e.g.,
(e.g., machine
machine translation: machine takes
translation: machine takes a a
sequence of
sequence of words
words from
from English
English andand translates
translates to to aa sequence
sequence of of words
words in in French)
French) shown
shown Figure
Figure33d.
33d.
Many to Many: Sequence to sequence learning (e.g., video classification problem in which
Many to Many: Sequence to sequence learning (e.g., video classification problem in which we we
take video frames as input and wish to label each frame of the video
take video frames as input and wish to label each frame of the video shown Figure 33e. shown Figure 33e.
(d) (e)
Figure 33. The different structure of RNN with respect to the applications: (a) One to one; (b) many to
Figure
one; (c) 33.
oneThe different
to many; (d) structure of RNN
many to many; with
and respecttotomany.
(e) many the applications: (a) One to one; (b) many
to one; (c) one to many; (d) many to many; and (e) many to many.
5.6. Attention-based Models with RNN
Different attention-based models have been proposed using RNN approaches. The first initiative
for RNNs with the attention that automatically learns to describe the content of images is proposed
by Xu, et al. in 2015 [182]. A dual state attention based RNN is proposed for effective time series
prediction [183]. Another difficult task is Visual Question Answering (VQA) using GRUs where the
inputs are an image and a natural language question about the image, the task is to provide an accurate
natural language answer. The output is to be conditional on both image and textual inputs. A CNN
is used to encode the image and an RNN is implemented to encode the sentence [184]. Another
outstanding concept is released from Google called Pixel Recurrent Neural Networks (Pixel RNN).
This approach provides state-of-the-art performance for image completion tasks [185]. The new model
called residual RNN is proposed, where the RNN is introduced with an effective residual connection
in a deep recurrent network [186].
Electronics 2019, 8, 292 32 of 66
The encoder and decoder transition can be represented with ∅. and ϕ, ∅ : X → F and
The encoder and decoder transition can be represented with ∅ and 𝜑, ∅ ∶ 𝒳 → ℱ and 𝜑 ∶ ℱ →
𝒳, then→ X , then
ϕ : F
∅, ϕ = argmin∅, ϕ k X − (∅, ϕ) X k2 . (44)
∅, 𝜑 = 𝑎𝑟𝑔𝑚𝑖𝑛∅, ‖𝑋 − (∅, 𝜑)𝑋‖ . (44)
If we consider a simple autoencoder with one hidden layer, where the input is 𝑥 ∈ ℝ = 𝒳,
which is mapped onto ∈ ℝ = ℱ, it can be then expressed as follows:
𝑧 = 𝜎 (𝑊𝑥 + 𝑏), (45)
where W is the weight matrix and b is bias. 𝜎 represents an element wise activation function, such
Electronics 2019, 8, 292 33 of 66
If we consider a simple autoencoder with one hidden layer, where the input is x ∈ Rd = X , which
is mapped onto ∈ R p = F , it can be then expressed as follows:
where W is the weight matrix and b is bias. σ1 represents an element wise activation function, such as
a sigmoid or a rectified linear unit (RLU). Let us consider z is again mapped or reconstructed onto x 0
which is the same dimension of x. The reconstruction can be expressed as
x 0 = σ2 W 0 z + b0 .
(46)
This model is trained with minimizing the reconstruction errors, which is defined as loss function
as follows
2 2
L x, x 0 = k x − x 0 k = k x − σ2 W 0 (σ1 (Wx + b)) + b0 k .
(47)
Usually, the feature space of F has lower dimensions than the input feature space X , which can
be considered as the compressed representation of the input sample. In the case of multilayer auto
encoder, the same operation will be repeated as required with in the encoding and decoding phases.
A deep Auto encoder is constructed by extending the encoder and decoder with multiple hidden
layers. The Gradient vanishing problem is still a big issue with the deeper model of AE: The gradient
becomes too small as it passes back through many layers of an AE model. Different advanced AE
models are discussed in the following sections.
Figure36.
Figure 36.Split-Brain
Split-BrainAutoencoder.
Autoencoder.
6.4.
6.4.Applications
ApplicationsofofAE
AE
AE
AE isis applied
applied inin Bio-informatics
Bio-informatics [136,208]
[136,208] and
andcybersecurity
cybersecurity [209].
[209]. We
We can
can apply
apply AE
AE forfor
unsupervised feature extraction and then apply Winner Take All (WTA) for clustering those
unsupervised feature extraction and then apply Winner Take All (WTA) for clustering those samples samples
for
forgenerating
generatinglabels
labels[210]. AEAE
[210]. hashas
been used
been as an
used as encoding and decoding
an encoding technique
and decoding with or
technique for or
with other
for
deep learning approaches, including CNN, DNN, RNN, and RL in the last decade. However,
other deep learning approaches, including CNN, DNN, RNN, and RL in the last decade. However, here are
some other
here are someapproaches recently published
other approaches [207,211]. [207,211]
recently published
Figure 37.Block
Figure37. Blockdiagram
diagramfor
forRestricted
RestrictedBoltzmann
BoltzmannMachine
Machine(RBM).
(RBM).
Energy-based models mean that the probability distribution over the variables of interest is
defined through an energy function. The energy function is composed from a set of observable
variables s 𝑉 = {𝑣 } and a set of hidden variables = {ℎ } , where i is a node in the visible layer, j is a
node in the hidden layer. It is restricted in the sense that there are no visible-visible or hidden-hidden
connections. The values corresponding to visible units of the RBM because their states are observed;
the feature detectors correspond to hidden units. A joint configuration, (v,h) of the visible and hidden
Electronics 2019, 8, 292 35 of 66
Energy-based models mean that the probability distribution over the variables of interest is
defined through an energy function. The energy function is composed from a set of observable
variables s V = {vi } and a set of hidden variables = {hi }, where i is a node in the visible layer, j is a
node in the hidden layer. It is restricted in the sense that there are no visible-visible or hidden-hidden
connections. The values corresponding to visible units of the RBM because their states are observed;
the feature detectors correspond to hidden units. A joint configuration, (v,h) of the visible and hidden
units has an energy (Hopfield, 1982) given by:
where vi h j are the binary states of visible unit i and hidden unit j, ai , b j are their biases and wij is the
weight between them. The network assigns a probability to a possible pair of a visible and a hidden
vector via this energy function:
1
p(v, h) = e−E(v,h) , (50)
Z
where the partition function, Z, is given by summing over all possible pairs of visible and hidden vectors:
Z = ∑ e−E(v,h) . (51)
v,h
The probability that the network assigns to a visible vector, v, is given by summing over all
possible hidden vectors:
1
p(v) = ∑ e−E(v,h) . (52)
Z h
The probability that the network assigns to a training sample can be raised by adjusting the
weights and biases to lower the energy of that sample and to raise the energy of other samples,
especially those have low energies and therefore make a big contribution to the partition function. The
derivative of the log probability of a training vector with respect to weight is surprisingly simple.
∂logp(v)
= vi h j data − vi h j model , (53)
∂wij
where the angle brackets are used to denote expectations under the distribution specified by the
subscript that follows. This leads to a simple learning rule for performing stochastic steepest ascent in
the log probability of the training data:
wij = ε vi h j data
− v i h j model
, (54)
where ε is a learning rate. Given a randomly selected training image, v, the binary state, h j , of each
hidden unit, j is set to 1 with probability
!
p h j = 1|v = σ bj + ∑ vi wij ,
(55)
i
where σ(x) is the logistic sigmoid function 1/ 1 + e(−x) , vi h j is then an unbiased sample. Because
there are no direct connections between visible units in an RBM, it is also easy to get an unbiased
sample of the state of a visible unit, given a hidden vector
!
p(vi = 1|h) = σ ai + ∑ h j wij . (56)
j
Electronics 2019, 8, 292 36 of 66
Getting an unbiased sample of vi h j model is much more difficult. It can be done by starting at any
random state of the visible units and performing alternating Gibbs sampling for a long time. A single
iteration of alternating Gibbs sampling consists of updating all the hidden units in parallel using
Equation (55) followed by updating all the visible units in parallel using the following Equation (56).
A much faster learning procedure was proposed in Hinton (2002). This starts by setting the states of
the visible units to a training vector. Then the binary states of the hidden units are all computed in
parallel using Equation (55). Once binary states have been chosen for the hidden units, a reconstruction
is produced by setting each vi to 1 with a probability given by Equation (56). The change in weight is
then given by
∆wij = ε vi h j data − vi h j recon .
(57)
A simplified version of the same learning rule that uses the states of individual units instead
of a pairwise product is used for the biases [214]. This approach is mainly used for pre-training a
neural network in an unsupervised manner to generate initial weights. One of the most popular deep
learning approaches called Deep Belief Network (DBN) is proposed based on this approach. Some
of the examples of the applications with RBM and DBN for data encoding, news clustering, image
segmentation, and cybersecurity are shown, for detail see References [57,215–217].
minG maxD V (D, G) = Ex∼Pdata (x) [log(D(x))] + Ez∼Pdata (z) [log(1 − D(G(z)))]. (58)
In practice, this equation may not provide sufficient gradient for learning G (which started from
random Gaussian noise) at the early stages. In the early stages, D can reject samples because they
are clearly different compared to training samples. In this case, log(1 − D(G(z))) will be saturated.
Instead of training G to minimize log(1 − D(G(z))) we can train G to maximize log(G(z)) objective
Electronics 2019, 8, x FOR PEER REVIEW 37 of 67
techniques.
Electronics 2019,GAN
8, 292 is an unsupervised deep learning approach where two neural networks compete 37 of 66
against each other in a zero-sum game. In the case of the image generation problem, the generator
starts with Gaussian noise to generate images and the discriminator determines how good the
function which
generated images provides
are. Thismuch better
process gradients
continues in the
until early stagesof
outputs during learning.become
the generator However, there
close were
to actual
some limitations of convergence during training with the first version. In the beginning
input samples. According to Figure 38, it can be considered that Discriminator (D) and Generator (G) state a GAN
has some
two players limitations
playing the regarding
min-maxthe game
following
withissues:
the function of V (D, G) which can be expressed as
follows
• Theaccording to this paper
lack of a heuristic cost [33,218].
function (as pixel-wise approximate means square errors (MSE))
• 𝑚𝑖𝑛
Unstable 𝑚𝑎𝑥 (sometimes
to train 𝑉(𝐷, 𝐺) = 𝔼that
~ ) 𝑙𝑜𝑔(𝐷(𝑥)) + 𝔼 ~
can( because ( ) 𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧))) .
of producing nonsensical outputs) (58)
Figure 40
Figure 40 represents
represents generated bedroom images
generated bedroom images after
after five
five epochs
epochs of
of training. There appears
training. There appears to
to
be evidence of visual under-fitting via repeated noise textures across multiple samples, such as
be evidence of visual under-fitting via repeated noise textures across multiple samples, such as the the
baseboards of
baseboards of some
some of
of the
the beds.
beds.
Figure
Figure 39.
39. Experimental
Experimental outputs
outputs of
of bedroom
bedroom images.
images.
Figure
Figure 4040 represents
represents generated
generated bedroom
bedroom images
images after
after five
five epochs
epochs of
of training.
training. There
There appears
appears toto
be
be evidence
evidence
Electronics
of
2019,of
visual
visual under-fitting via repeated noise textures across multiple samples, such as
8, 292
under-fitting via repeated noise textures across multiple samples, such the
as of
38 the
66
baseboards
baseboards of of some
some of
of the
the beds.
beds.
Figure
Figure40.
Figure 40.Reconstructed
40. Reconstructedbedroom
Reconstructed bedroomimages
bedroom images using
using deep
deep convolution
convolution GAN
GAN (DCGAN).
(DCGAN).
In
In Figure
Figure 40,
40, according
according toto article
article in
in [221],
[221], the
the top
top rows
rows interpolation
interpolation between
between aa series
between series ofof nine
nine
random
random points
pointsin
points in Z,
Z, and
and show
show that
that the
the learned
learned space
space has
has smooth
smooth transitions.
transitions. In In every
every image,
image, space
space
plausibly
plausibly looks
looks like
like aabedroom.
bedroom. InInthe
the6th
the 6throw,
6th row,you
row, yousee
seeaaaroom
see roomwithout
room withoutaaawindow
without windowslowly
window slowlytransforming
slowly transforming
transforming
into a room with a giant window.
into a room with a giant window. In In the
In the 10th
the 10th row,
row, you see what appears to be a TV slowly being
10th row, you see what appears to be a TV slowly being
transformed
transformed into
into
transformed into a aa window.
window. The
The following
following
following Figure
Figure 41
41 shows
shows the
the effective
effective application
application of
of latent
latent space
space
vectors. Latent space
Latent space
vectors. Latent vectors
space vectors can
vectors can be
can be turned
be turned into
turned into meaning
into meaning output
output by first performing addition and
meaning output by first performing addition and
subtraction
subtractionoperations
operationsfollowed
followed byby aa decode.
decode. Figure
Figure41 41according
according to to article
articleinin [221],
[221], shows
showsthat thataa man
man
with
with glasses
glasses minus
minus aa manman and
and add
add aa woman
woman whichwhich results
results in
results in aa woman
woman withwith glasses.
glasses.
Figure
Figure
Figure 41.Example
41.
41. Exampleof
Example ofsmile
of smile
smile arithmetic and
arithmetic and arithmeticfor
and arithmetic
arithmetic forwearing
for wearingglass
wearing glassusing
glass using
using GAN:
GAN:
GAN: a man
aa man
man with
with
with
glasses minus
glasses minusman without
man glasses
without plus
glasses woman
plus woman without
withoutglasses equal
glasses to
equal woman
to womanwith glasses.
with
glasses minus man without glasses plus woman without glasses equal to woman with glasses. glasses.
Figure 42.
Figure Facegeneration
42. Face generation in
in different
different angle
angle using
using GAN.
GAN.
Recently, Google
Recently, Google proposed
proposed extended
extended versions
versions of of GANs
GANs called
called Boundary
Boundary Equilibrium
Equilibrium Generative
Generative
Adversarial Networks (BEGAN) with a simple but robust architecture [228]. BEGAN has aa better
Adversarial Networks (BEGAN) with a simple but robust architecture [228]. BEGAN has better
training procedure
training procedure with
with fast
fast and
and stable
stable convergence.
convergence. The The concept
concept of of equilibrium
equilibrium helps
helps toto balance
balance the
the
power of the discriminator against the generator. In addition, it can balance
power of the discriminator against the generator. In addition, it can balance the trade-off between the trade-off between
image diversity
image diversity and
and visual
visual quality
quality [228].
[228]. Another
Anothersimilar
similarwork
workisiscalled
calledWasserstein
Wasserstein GANGAN (WGAN)
(WGAN)
algorithm that shows significant benefits over traditional GAN [229]. WGANs
algorithm that shows significant benefits over traditional GAN [229]. WGANs had two major benefits had two major
benefits over traditional GANs. First, a WGAN meaningfully correlates
over traditional GANs. First, a WGAN meaningfully correlates the loss metric with the generator’sthe loss metric with the
generator’s convergence
convergence and sample
and sample quality. quality.
Secondly, Secondly,
WGANs haveWGANs
improved have improved
stability of thestability of the
optimization
optimization
process. process.
The improved
The improved version
version of
ofWGAN
WGANisisproposed
proposedwith witha new
a new clipping
clippingtechnique,
technique,which
which penalizes
penalizes the
normal
the of the
normal ofgradient of the of
the gradient critic
thewith respect
critic with to its inputs
respect [230].
to its There
inputs is a promising
[230]. There is aarchitecture
promising
that has been proposed based on generative models where the images
architecture that has been proposed based on generative models where the images are are represented with untrained
represented
DNNuntrained
with that give anDNNopportunity
that give for
an better understanding
opportunity for betterand visualization and
understanding of DNNs [231]. Adversarial
visualization of DNNs
examples for generative models have also been introduced [232].
[231]. Adversarial examples for generative models have also been introduced [232].Energy-based GAN was proposed by
Energy-based
Yann LeCun
GAN from Facebook
was proposed by Yann inLeCun
2016 [233].
fromThe traininginprocess
Facebook is difficult
2016 [233]. for GANs,
The training Manifold
process Matching
is difficult for
GAN (MMGAN) proposed with better training process which
GANs, Manifold Matching GAN (MMGAN) proposed with better training process which is is experimented on three different
datasets and the
experimented on experimental
three different results clearly
datasets anddemonstrate
the experimentalthe efficacy
resultsofclearly
MMGAN against other
demonstrate the
models [234]. GAN for geo-statistical simulation and inversion with efficient
efficacy of MMGAN against other models [234]. GAN for geo-statistical simulation and inversion training approach [235].
with Probabilistic GANapproach
efficient training (PGAN) [235].
which is a new kind of GAN with a modified objective function. The
mainProbabilistic
idea behindGAN this method is
(PGAN) whichto integrate
is a newa probabilistic
kind of GANmodel with a(A Gaussian
modified Mixture
objective Model) The
function. into
the GAN
main ideaframework
behind thisthat supports
method likelihooda rather
is to integrate than classification
probabilistic [236]. AMixture
model (A Gaussian GAN with Bayesian
Model) into
Network model [237]. Variational Auto encode is a popular deep learning approach,
the GAN framework that supports likelihood rather than classification [236]. A GAN with Bayesian which is trained
with Adversarial Variational Bayes (AVB) which helps to establish a principle connection between VAE
and GAN [238]. The f-GAN which is proposed based on the general feed-forward neural network [239].
Markov model-based GAN for texture synthesis [240]. Another generative model based on the doubly
stochastic MCMC method [241]. GAN with multi-Generator [242]
Is an unsupervised GAN capable of learning on a pixel level domain adaptation that transforms
in the pixel space from one domain to another domain? This approach provides state-of-the-art
performance against several unsupervised domain adaptation techniques with a large margin [243].
A new network is proposed called Schema Network, which is an object-oriented generative physics
simulator able to disentangle multiple causes of events reasoning through causes to achieve a goal that
is learned from dynamics of an environment from data [244]. There is interesting research that has
been conducted with a GAN that is to Generate Adversarial Text to Image Synthesis. In this paper,
the new deep architecture is proposed for GAN formulation which can take the text description of
an image and produce realistic images with respect to the inputs. This is an effective technique for
Electronics 2019, 8, 292 40 of 66
text-based image synthesis using a character level text encoder and class conditional GAN. GAN is
evaluated on bird and flower dataset first then general text to the image which is evaluated on MS
COCO dataset [40].
Generative moment matching network (GMMN) technique which is an alternative approach for the
generative model [264].
Some other applications of GAN include pose estimation [265], photo editing network [266], and
anomaly detection [267]. DiscoGAN for learning cross-domain relation with GAN [40], unsupervised
image-to-image translation with generative model, [268], single shot learning with GAN [269], response
generation and question answering system [270,271]. Last but not least, WaveNet as a generative
model has been developed for generating audio waveform in [272] and dual path network in [273].
Figure 43.
Figure 43. Conceptual diagram for
for Reinforcement
Reinforcement Learning
Learning (RL)
(RL) system.
system.
Unlike
Unlike the
the general
general supervised
supervised andand unsupervised
unsupervised machine learning, RL is defined not by
characterizing
characterizing learning methods,
methods, but
but by
by characterizing
characterizing aa learning
learning problem.
problem. However, the recent recent
success
success of
ofDL
DLhas
hashad
hada ahuge
hugeimpact
impactononthe success
the of of
success DRLDRL which is known
which as DRL.
is known as DRL.According to the
According to
learning strategy,
the learning the RL the
strategy, technique is learnedisthrough
RL technique learnedobservation. For observing
through observation. Fortheobserving
environment,the
the promisingthe
environment, DLpromising
techniques DLinclude CNN,
techniques RNN,CNN,
include LSTM, and LSTM,
RNN, GRU are andused
GRU depending upon the
are used depending
observation space. As space.
upon the observation DL techniques encode data
As DL techniques efficiently,
encode therefore, therefore,
data efficiently, the following step of action
the following step
is
of performed more accurately.
action is performed According
more accurately. to the action,
According to thethe agentthe
action, receives an appropriate
agent receives reward
an appropriate
respectively. As a result,
reward respectively. As athe entire
result, theRL approach
entire becomes
RL approach more efficient
becomes to learntoand
more efficient interact
learn in the
and interact
environment with better
in the environment with performance.
better performance.
However, the history of the modern DRL revolution began from Google Deep Mind in 2013 with
Atari games with DRL. In which the DRL based approaches perform better against the human expert
in almost all of the games. In this case, the environment is observed on video frames which are
processed using a CNN [275,276]. The success of DRL approaches depends on the level of difficulty
of the task attempt to be solved. After a huge success of Alpha-Go and Atari from Google Deep mind,
they proposed a reinforcement learning environment based on StarCraft II in 2017, which is called
Electronics 2019, 8, 292 42 of 66
However, the history of the modern DRL revolution began from Google Deep Mind in 2013
with Atari games with DRL. In which the DRL based approaches perform better against the human
expert in almost all of the games. In this case, the environment is observed on video frames which are
processed using a CNN [275,276]. The success of DRL approaches depends on the level of difficulty of
the task attempt to be solved. After a huge success of Alpha-Go and Atari from Google Deep mind,
they proposed a reinforcement learning environment based on StarCraft II in 2017, which is called
SC2LE (StarCraft II Learning Environment) [277]. The SC2LE is a game with multi-agent with multiple
players’ interactions. This proposed approach has a large action space involving the selection and
control of hundreds of units. It contains many states to observe from raw feature space and it uses
strategies over thousands of steps. The open source Python-based StarCraft II game engine has been
provided free in online.
8.2. Q-Learning
There are some fundamental strategies which are essential to know for working with DRL. First,
the RL learning approach has a function that calculates the Quality of state-action combination which
is called Q-Learning (Q-function). Algorithm 2 describes the basic computational flow of Q-learning.
Q-learning is defined as a model-free reinforcement learning approach which is used to find
an optimal action-selection policy for any given (finite) Markov Decision Process (MDP). MDP is a
mathematical framework for modeling decision using state, action and rewards. Q-learning only needs
to know about the states available and what are the possible actions in each state. Another improved
version of Q-Learning known as Bi-directional Q-Learning. In this article, the Q-Learning is discussed,
for details on bi-directional Q-Learning please see Reference [278].
At each step s, choose the action which maximizes the following function Q (s, a)
• Q is an estimated utility function—it tells us how good an action is given in a certain state
• r (s, a) immediate reward for making an action best utility (Q) for the resulting state
This can be formulated with the recursive definition as follows:
This equation is called Bellman’s equation, which is the core equation for RL. Here r(s, a) is the
immediate reward, γ is the relative value of delay vs. immediate rewards [0, 1] s0 is the new state after
action a. The a and a0 are an action in sate s and s0 respectively. The action is selected based on the
following equation:
π (s) = argmaxa Q(s, a). (60)
In each state, a value is assigned called a Q-value. When we visit a state and we receive a reward
accordingly. We use the reward to update the estimated value for that state. As the reward is stochastic,
as a result, we need to visit the states many times. In addition, it is not guaranteed that we will get
the same reward (Rt ) in another episode. The summation of the future rewards in episodic tasks
and environments are unpredictable, further in the future, we go further with the reward diversely
as expressed,
Gt = Rt+1 + Rt+2 + Rt+3 + . . . . . . . . . . . + RT . (61)
The sum of discounted future rewards in both cases are some factor as scalar.
here γ is a constant. The more we are in the future, the less we take the reward into account.
Properties of Q-learning:
• Convergence of Q-function: Approximation will be converged to the true Q-function, but it must
visit possible state-action pair infinitely many times.
Electronics 2019, 8, 292 43 of 66
• The state table size can be vary depending on the observation space and complexity.
• Unseen values are not considered during observation.
The way to fix these problems is to use a neural network (particularly DNN) as an approximation
instead of the state table. The inputs of DNN are the state and action and the outputs are numbers
between 0 and 1 that represent the utility encoding the states and actions properly. That is the
place where the deep learning approaches contribute to making better decisions with respect to
the state information. Most of the cases for observing the environment, we use several acquisition
devices, including a camera or other sensing devices for observing the learning environment. For
example, if you observed the setup for the challenge of Alpha-Go then it can be seen that the
environment, action, and reward are learned based on the pixel values (pixel in action). For details see
References [275,276,279].
However, it is difficult to develop an agent which can interact or perform well in any observation
environment. Therefore, most of the researchers in the field select their action space or environment
before training the agent for that environment. The benchmark concept, in this case, is a little bit
different compared to supervised or unsupervised deep learning approach. Due to the variety of
environments, the benchmark depends on what level of difficulty the environment has been considered
compared to the previous or exiting researches? The difficulties depend on the different parameters,
number of agents, a way of interaction between the agents, the number of players and so on.
Recently, another good learning approach has been proposed for DRL [46,274]. There are many
papers published with different networks of DRL, including Deep Q-Networks (DQN), Double DQN,
Asynchronous methods, policy optimization strategy (including deterministic policy gradient, deep
deterministic policy gradient, guided policy search, trust region policy optimization, combining policy
gradient and Q-learning) are proposed [46,274]. Policy Gradient (DAGGER) Superhuman GO using
supervised learning with policy gradient and Monte Carlo tree search with value function [46,280].
Robotics manipulation using guided policy search [281]. DRL for 3D games using policy gradients [282].
Algorithm 2: Q-Learning
Initialization:
For each state-action pair (s, a)
initialize the table entry Q̂(s, a) to zero
Steps:
1. Observed the current state s
2. REPEAT:
- s = s0
group-driven RL is proposed for health care on a mobile device for personalized mHealth Intervention.
In this work, K-means clustering is applied for grouping the people and finally shared with RL
policy for each group [285]. Optimal policy learning is a challenging task with RL for an agent.
Option-Observation Initiation sets (OOIs) allow agents to learn optimal policies in the challenging
task of POMDPs which are learned faster than RNN [286]. 3D Bin Packing Problem (BPP) is proposed
with DRL. The main objective is to place the number of the cuboid-shaped items that can minimize the
surface area of the bin [287].
The import component of DRL is the reward which is determined based on the observation
and the action of the agent. The real-world reward function is not perfect at all times. Due to the
sensor error, the agent may get maximum reward whereas the actual reward should be smaller. This
paper proposed a formulation based on generalized Markov Decision Problem (MDP) called Corrupt
Reward MDP [288]. The trust region optimization based deep RL is proposed using recently developed
Kronecker-factored approximation to the curvature (K-FAC) [289]. In addition, there is some research
that has been conducted in the evaluation of physics experiments using the deep learning approach.
This experiment focuses agent to learn basic properties, such as mass and cohesion of the objects in the
interactive simulation environment [290].
Recently Fuzzy RL policies have been proposed that is suitable for continuous state and action
space [291]. The important investigation and discussion are made for hyper-parameters in policy
gradient for continuous control, the general variance of the algorithm. This paper also provides a
guideline for reporting results and comparison against baseline methods [292]. Deep RL is also applied
to high precision assembly tasks [293]. The Bellman equation is one of the main functions of RL
technique, a function approximation is proposed which ensures that the Bellman Optimality Equation
always holds. Then the function is estimated to maximize the likelihood of the observed motion [294].
DRL based hierarchical system is used for could resource allocation and power management in could
computing system [295]. A novel Attention-aware Face Hallucination (Attention-FC) is proposed
where Deep RL is used for enhancing the quality of the image on a single patch for images which are
applied to face images [296].
Figure 44. Conceptual diagram for transfer learning: Pretrained on ImageNet and transfer learning is
Figure 44. Conceptual diagram for transfer learning: Pretrained on ImageNet and transfer learning is
used for retraining on PASCAL dataset.
used for retraining on PASCAL dataset.
10.2. What Is A Pre-trained Model?
10. Transfer Learning
A pre-trained model is a model which is already trained in the same domains as the intended
domain. For example,
10.1. Transfer Learningfor an image recognition task, an Inception model already trained on ImageNet
can be downloaded. The Inception model can then be used for a different recognition task, and instead
A good
of training way to
it from explain
scratch thetransfer
weightslearning
can be is to as
left look at thesome
is with student-teacher relationship.
learned features. A teacher
This method of
offers a course after gathering details knowledge regarding that subject [48]. The information
training is useful when there is a lack of sample data. There are a lot of pre-trained models available will be
conveyed VGG,
(including through a series
ResNet, of Inception
and lectures over
Net time. This can
on different be considered
datasets) that thefrom
in model-zoo teacher (expert) is
the following
transferring
link: information (knowledge) to the students (learner). The same thing happens in case of
https://github.com/BVLC/caffe/wiki/Model-Zoo.
deep learning, a network is trained with a big amount data and during the training, the model learns
the weights
10.3. Why WillandYoubias.
Use These weights
Pre-trained can be transferred to other networks for testing or retraining a
Models?
similar new model. The network can start with pre-trained weights instead of training from scratch.
There are a lot of reasons for using pre-trained models. Firstly, it requires a lot of expensive
The conceptual diagram for transfer learning method is shown in Figure 44.
computation power to train big models on big datasets. Secondly, it can take up to multiple weeks to
train big models. Training new models with pre-trained weights can speed up convergence, as well as
help the network generalization.
We can take a trained network for a different domain which can be adapted for any other domain
for the target task [307,308]. First training a network with a close domain for which it is easy to get
labeled data using standard backpropagation, for example, ImageNet classification, pseudo classes
from augmented data. Then cut off the top layers of network and replace with the supervised objective
for the target domain. Finally, tune the network using backpropagation with labels for the target
domain until validation loss starts to increase [307,308]. There are some survey papers and books that
are published on transfer learning [309,310]. Self-taught learning with transfer learning [311]. Boosting
approach for transfer learning [312].
11.1. Overview
DNNs have been successfully applied and achieved better recognition accuracies in different
application domains, such as computer vision, speech processing, natural language processing, big
data problem and many more. However, most of the cases the training is being executed on Graphics
Processing Units (GPU) for dealing with big volumes of data which is expensive in terms of power.
Recently researchers have been training and testing with deeper and wider networks to achieve
even better classification accuracy to achieve human or beyond human level recognition accuracy
in some cases. While the size of the neural network is increasing, it becomes more powerful and
provides better classification accuracy. However, the storage consumption, memory bandwidth
and computational cost are increasing exponentially. On the other hand, these types of massive
scale implementation with large numbers of network parameters are not suitable for low power
implementation, unmanned aerial vehicle (UAV), different medical devices, a low memory system,
such as mobile devices, Field Programmable Gate Array (FPGA) and so on.
There is much research going on to develop better network structures or networks with lower
computation cost, fewer numbers of parameters for low-power and low-memory systems without
lowering classification accuracy. There are two ways to design an efficient deep network structure:
• The first approach is to optimize the internal operational cost with an efficient network structure;
• Second design a network with low precision operations or a hardware efficient network.
The internal operations and parameters of a network structure can be reduced by using low
dimensional convolution filters for convolution layers [71,99].
There is a lot of benefit to this approach. Firstly, the convolutional with rectification operations
makes the decision more discriminative. Secondly, the main benefit of this approach is to reduce
the number of computation parameters drastically. For example, if one layer has 5 × 5 dimensional
Electronics 2019, 8, 292 47 of 66
filters which can be replaced with two 3 × 3 dimensional filters (without pooling layer in between
then) for better feature learning; three 3 × 3 dimensional filters can be used as a replacement of 7 × 7
dimensional filters and so on. Benefits of using a lower-dimensional filter are that assuming both
the present convolutional layer has C channels, for three layers for 3 × 3 filter the total number of
parameters are weights: 3 × (3 × 3 × C × C) = 27C2 weights, whereas in the size of the filter is 7 × 7,
the total number of parameters are (7 × 7 × C × C) = 49C2 , which is almost double compared to the
three 3 × 3 filter parameters. Moreover, placement of layers, such as convolutional, pooling, drop-out
in the network in different intervals has an impact on overall classification accuracy. There are some
strategies that are mentioned to optimize the network architecture recently to design robust deep
learning models [99,100,313] and efficient implementation of CNNs on FPGA platform [314].
Strategy 1: Replace the 3 × 3 filter with 1 × 1 filters. The main reasons to use a lower dimension
filter to reduce the overall number of parameter. By replacing 3 × 3 filters with 1 × 1 can be reduced
9x number of parameters.
Strategy 2: Decrease the number of input channels to 3 × 3 filters. For a layer, the sizes of the
output feature maps are calculated, which is related to the network parameters using N − F
S + 1, where
N is input map’s size, F is filter size, S is for strides. To reduce the number of parameters, it is not only
enough to reduce the size of the filters, but also it requires controlling number of input channels or
featuring dimension.
Strategy 3: Down-sample late in the network so that convolution layers have activation maps:
The outputs of present convolution layers can be at least 1 × 1 or often larger than 1 × 1. The output
width and height can be controlled by some criterions: (1) The size of the input sample (e.g., 256 × 256)
and (2) Choosing the post down sample layer. Most commonly pooling layers are such as average or
max pooling layer is used, there is an alternative sub-sampling layer with convolution (3 × 3 filters)
and stride with 2. If most of the earlier layers have larger stride, then most of the layers will have small
numbers of activation maps.
There are some other techniques that have been proposed in the last few years [320–323]. Another
power efficient and hardware friendly network structure has been proposed for a CNN with XNOR
operations. In XNOR based CNN implementations, both the filters and input to the convolution layer
is binary. This result about 58x faster convolutional operations and 32x memory saving. In the same
paper, Binary-Weight-Networks was proposed which saved around 32x memory saving. That makes
it possible to implement state-of-the-art networks on CPU for real-time use instead of GPU. These
networks are tested on the ImageNet dataset and provide only 2.9% less classification accuracy than
full-precision AlexNet (in top-1% measure). This network requires less power and computation time.
This could make it possible to accelerate the training process of deep neural network dramatically
for specialized hardware implementation [273,274]. For the first time, Energy Efficient Deep Neural
Network (EEDN) architecture was proposed for the neuromorphic system in 2016. In addition, they
released a deep learning framework called EEDN, which provides close accuracy to state-of-the-art
accuracy almost all the popular benchmarks except ImageNet dataset [324,325].
14. Summary
In this paper, we have provided an in-depth review of deep learning and its applications over
the past few years. Different state-of-the-art deep learning models in different categories of learning,
including supervised, unsupervised, and Reinforcement Learning (RL), as well as their applications in
different domains were reviewed. In addition, we have explained in detail the different supervised deep
learning techniques, including DNN, CNN, and RNN. The un-supervised deep learning techniques,
including AE, RBM, and GAN, were reviewed in detail. In the same section, we have considered
and explained unsupervised learning techniques which are proposed based on LSTM and RL. In
Section 8, we presented a survey on Deep Reinforcement Learning (DRL) with the fundamental
learning technique called Q-Learning. The recently developed Bayesian Deep Learning (BDL) and
Transfer Learning (TL) approaches are also discussed in Sections 9 and 10, respectively. Furthermore,
we have conducted a survey on energy efficient deep learning approaches, transfer learning with DL,
and hardware development trends of DL. Moreover, we have discussed some DL frameworks and
benchmark datasets, which are often used for the implementation and evaluation of deep learning
approaches. Finally, we have included relevant journals and conferences, where the DL community
has been publishing their valuable research articles.
Electronics 2019, 8, 292 49 of 66
Funding: This work was supported by the National Science Foundation under awards 1718633 and 1309708.
Acknowledgments: We would like to thank all authors mentioned in the reference of this paper from whom we
have learned a lot and thus made this review paper possible.
Conflicts of Interest: The authors declare no conflict of interest.
Appendix A
Most of the time people use different deep learning frameworks and Standard Development Kits
(SDKs) for implementing deep learning approaches which are listed below:
A.1. Frameworks
• Tensorflow: https://www.tensorflow.org/
• Caffe: http://caffe.berkeleyvision.org/
• KERAS: https://keras.io/
• Theano: http://deeplearning.net/software/theano/
• Torch: http://torch.ch/
• PyTorch: http://pytorch.org/
• Lasagne: https://lasagne.readthedocs.io/en/latest/
• DL4J (DeepLearning4J): https://deeplearning4j.org/
• Chainer: http://chainer.org/
• DIGITS: https://developer.nvidia.com/digits
• CNTK (Microsoft): https://github.com/Microsoft/CNTK
• MatConvNet: http://www.vlfeat.org/matconvnet/
• MINERVA: https://github.com/dmlc/minerva
• MXNET: https://github.com/dmlc/mxnet
• OpenDeep: http://www.opendeep.org/
• PuRine: https://github.com/purine/purine2
• PyLerarn2: http://deeplearning.net/software/pylearn2/
• TensorLayer: https://github.com/zsdonghao/tensorlayer
• LBANN: https://github.com/LLNL/lbann
A.2. SDKs
• cuDNN: https://developer.nvidia.com/cudnn
• TensorRT: https://developer.nvidia.com/tensorrt
• DeepStreamSDK: https://developer.nvidia.com/deepstream-sdk
• cuBLAS: https://developer.nvidia.com/cublas
• cuSPARSE: http://docs.nvidia.com/cuda/cusparse/
• NCCL: https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/
• MNIST: http://yann.lecun.com/exdb/mnist/
• CIFAR 10/100: https://www.cs.toronto.edu/~{}kriz/cifar.html
• SVHN/ SVHN2: http://ufldl.stanford.edu/housenumbers/
Electronics 2019, 8, 292 50 of 66
• Flickr-30k
• Common Objects in Context (COCO): http://cocodataset.org/#overview, http://sidgan.me/
technical/2016/01/09/Exploring-Datasets
In addition, there is another alternative solution in data programming that labels subsets of data
using weak supervision strategies or domain heuristics as labeling functions even if they are noisy and
may conflict samples [87].
A.4.1. Conferences
• Neural Information Processing System (NIPS)
• International Conference on Learning Representation (ICLR): What are you doing for
Deep Learning?
• International Conference on Machine Learning (ICML)
• Computer Vision and Pattern Recognition (CVPR): What are you doing with Deep Learning?
• International Conference on Computer Vision (ICCV)
• European Conference on Computer Vision (ECCV)
• British Machine Vision Conference (BMVC)
A.4.2. Journal
• Journal of Machine Learning Research (JMLR)
• IEEE Transaction of Neural Network and Learning System (ITNNLS)
• IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
• Computer Vision and Image Understanding (CVIU)
• Pattern Recognition Letter
Electronics 2019, 8, 292 52 of 66
References
1. Schmidhuber, J. Deep Learning in Neural Networks: An Overview. Neural Netw. 2015, 61, 85–117. [CrossRef]
[PubMed]
2. Bengio, Y.; LeCun, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444.
3. Bengio, Y.; Courville, A.; Vincent, P. Representation Learning: A Review and New Perspectives. IEEE Trans.
Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [CrossRef] [PubMed]
4. Bengio, Y. Learning deep architectures for AI. Found. Trends Mach. Learn. 2009, 2, 1–127. [CrossRef]
5. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.;
Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015,
518, 529–533. [CrossRef] [PubMed]
6. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari
with deep reinforcement learning. arXiv 2013, arXiv:1312.5602.
7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks.
In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe,
NV, USA, 3–6 December 2012; pp. 1106–1114.
8. Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. arXiv 2013, arXiv:1311.2901.
9. Simonyan, K.; Zisserman, A. deep convolutional networks for large-scale image recognition. arXiv 2014,
arXiv:1409.1556.
10. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.
11. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
12. Canziani, A.; Paszke, A.; Culurciello, E. An analysis of deep neural network models for practical applications.
arXiv 2016, arXiv:1605.07678.
Electronics 2019, 8, 292 53 of 66
13. Zweig, G. Classification and recognition with direct segment models. In Proceedings of the 2012 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March
2012; pp. 4161–4164.
14. He, Y.; Fosler-Lussier, E. Efficient segmental conditional random fields for one-pass phone recognition. In
Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association,
Portland, OR, USA, 9–13 September 2012.
15. Abdel-Hamid, O.; Deng, L.; Yu, D.; Jiang, H. Deep segmental neural networks for speech recognition.
Interspeech 2013, 36, 70.
16. Tang, H.; Wang, W.; Gimpel, K.; Livescu, K. Discriminative segmental cascades for feature-rich phone
recognition. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding
(ASRU), Scottsdale, AZ, USA, 13–17 December 2015; pp. 561–568.
17. Song, W.; Cai, J. End-to-End Deep Neural Network for Automatic Speech Recognition. 1. (Errors: 21.1), 2015.
Available online: https://cs224d.stanford.edu/reports/SongWilliam.pdf (accessed on 17 January 2018).
18. Deng, L.; Abdel-Hamid, O.; Yu, D. A deep convolutional neural network using heterogeneous pooling
for trading acoustic invariance with phonetic confusion. In Proceedings of the 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, 26–31 May 2013;
pp. 6669–6673.
19. Graves, A.; Mohamed, A.-R.; Hinton, G. Speech recognition with deep recurrent neural networks. In
Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649.
20. Zhang, Y.; Pezeshki, M.; Brakel, P.; Zhang, S.; Bengio, C.L.Y.; Courville, A. Towards end-to-end speech
recognition with deep convolutional neural networks. arXiv 2017, arXiv:1701.02720.
21. Deng, L.; Platt, J. Ensemble deep learning for speech recognition. In Proceedings of the Fifteenth Annual
Conference of the International Speech Communication Association, Singapore, 14–18 September 2014.
22. Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech
recognition. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015;
pp. 577–585.
23. Lu, L.; Kong, L.; Dyer, C.; Smith, N.A.; Renals, S. Segmental recurrent neural networks for end-to-end speech
recognition. arXiv 2016, arXiv:1603.00223.
24. Van Essen, B.; Kim, H.; Pearce, R.; Boakye, K.; Chen, B. LBANN: Livermore big artificial neural network
HPC toolkit. In Proceedings of the Workshop on Machine Learning in High-Performance Computing
Environments, Austin, TX, USA, 15–20 November 2015; p. 5.
25. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Graph Convolutional Recurrent Neural Network: Data-Driven Traffic
Forecasting. arXiv 2017, arXiv:1707.01926.
26. Md, Z.A.; Aspiras, T.; Taha, T.M.; Asari, V.K.; Bowen, T.J. Advanced deep convolutional neural network
approaches for digital pathology image analysis: A comprehensive evaluation with different use cases. In
Proceedings of the Pathology Visions 2018, San Diego, CA, USA, 4–6 November 2018.
27. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015;
pp. 3431–3440.
28. Alom, M.Z.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Nuclei Segmentation with Recurrent Residual Convolutional
Neural Networks based U-Net (R2U-Net). In Proceedings of the NAECON 2018-IEEE National Aerospace
and Electronics Conference, Dayton, OH, USA, 23–26 July 2018; pp. 228–233.
29. Alom, M.Z.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Microscopic Blood Cell Classification Using Inception
Recurrent Residual Convolutional Neural Networks. In Proceedings of the NAECON 2018-IEEE National
Aerospace and Electronics Conference, Dayton, OH, USA, 23–26 July 2018; pp. 222–227.
30. Chen, X.-W.; Lin, X. Big Data Deep Learning: Challenges and Perspectives. IEEE Access 2014, 2, 514–525.
[CrossRef]
31. Zhou, Z.-H.; Chawla, N.V.; Jin, Y.; Williams, G.J. Big data opportunities and challenges: Discussions from
data analytics perspectives. IEEE Comput. Intell. Mag. 2014, 9, 62–74. [CrossRef]
32. Najafabadi, M.M.; Villanustre, F.; Khoshgoftaar, T.M.; Seliya, N.; Wald, R.; Muharemagic, E. Deep learning
applications and challenges in big data analytics. J. Big Data 2015, 2, 1. [CrossRef]
Electronics 2019, 8, 292 54 of 66
33. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y.
Generative adversarial nets. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge,
MA, USA, 2014; pp. 2672–2680.
34. Kaiser, L.; Gomez, A.N.; Shazeer, N.; Vaswani, A.; Parmar, N.; Jones, L.; Uszkoreit, J. One model to learn
them all. arXiv 2017, arXiv:1706.05137.
35. Collobert, R.; Weston, J. A unified architecture for natural language processing: Deep neural networks with
multitask learning. In Proceedings of the 25th International Conference on Machine Learning, Helsinki,
Finland, 5–9 July 2008; pp. 160–167.
36. Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.; Wattenberg, M.;
Corrado, G.; et al. Google’s multilingual neural machine translation system: Enabling zero-shot translation.
Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [CrossRef]
37. Argyriou, A.; Evgeniou, T.; Pontil, M. Multi-task feature learning. In Advances in Neural Information Processing
Systems; The MIT Press: Cambridge, MA, USA, 2007; pp. 41–48.
38. Singh, K.; Gupta, G.; Vig, L.; Shroff, G.; Agarwal, P. Deep Convolutional Neural Networks for Pairwise
Causality. arXiv 2017, arXiv:1701.00597.
39. Yu, H.; Wang, J.; Huang, Z.; Yang, Y.; Xu, W. Video paragraph captioning using hierarchical recurrent neural
networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas,
NV, USA, 27–30 June 2016; pp. 4584–4593.
40. Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative
adversarial networks. arXiv 2017, arXiv:1703.05192.
41. Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis.
arXiv 2016, arXiv:1605.05396.
42. Deng, L.; Yu, D. Deep learning: Methods and applications. Found. Trends Signal Process. 2014, 7, 197–387.
[CrossRef]
43. Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent
advances in convolutional neural networks. arXiv 2015, arXiv:1512.07108.
44. Sze, V.; Chen, Y.; Yang, T.; Emer, J.S. Efficient processing of deep neural networks: A tutorial and survey.
Proc. IEEE 2017, 105, 2295–2329. [CrossRef]
45. Kwon, D.; Kim, H.; Kim, J.; Suh, S.C.; Kim, I.; Kim, K.J. A survey of deep learning-based network anomaly
detection. Cluster Comput. 2017, 1–13. [CrossRef]
46. Li, Y. Deep reinforcement learning: An overview. arXiv 2017, arXiv:1701.07274.
47. Kober, J.; Bagnell, J.A.; Peters, J. Reinforcement learning in robotics: A survey. Int. J. Robot. Res. 2013, 32,
1238–1274. [CrossRef]
48. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [CrossRef]
49. Schuman, C.D.; Potok, T.E.; Patton, R.M.; Birdwell, J.D.; Dean, M.E.; Rose, G.S.; Plank, J.S. A survey of
neuromorphic computing and neural networks in hardware. arXiv 2017, arXiv:1705.06963.
50. McCulloch, W.S.; Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys.
1943, 5, 115–133. [CrossRef]
51. Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain.
Psychol. Rev. 1958, 65, 386. [CrossRef] [PubMed]
52. Minsky, M.; Papert, S.A. Perceptrons: An Introduction to Computational Geometry; MIT Press: Cambridge, MA,
USA, 2017.
53. Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A learning algorithm for Boltzmann machines. Cogn. Sci. 1985, 9,
147–169. [CrossRef]
54. Fukushima, K. Neocognitron: A hierarchical neural network capable of visual pattern recognition.
Neural Netw. 1988, 1, 119–130. [CrossRef]
55. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition.
Proc. IEEE 1998, 86, 2278–2324. [CrossRef]
56. Hinton, G.E.; Osindero, S.; Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18,
1527–1554. [CrossRef] [PubMed]
57. Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006,
313, 504–507. [CrossRef] [PubMed]
Electronics 2019, 8, 292 55 of 66
58. Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg,
Germany, 2012; pp. 421–436.
59. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors.
Cogn. Model. 1988, 5, 1. [CrossRef]
60. Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep
learning. Int. Conf. Mach. Learning. 2013, 28, 1139–1147.
61. Yoshua, B.; Lamblin, P.; Popovici, D.; Larochelle, H. Greedy Layer-Wise Training of Deep Network. In
Advances in Neural Information Processing Systems 19 (NIPS 2006); MIT Press: Cambridge, MA, USA, 2007;
pp. 153–160.
62. Erhan, D.; Manzagol, P.; Bengio, Y.; Bengio, S.; Vincent, P. The difficulty of training deep architectures and
the effect of unsupervised pre-training. Artif. Intell. Stat. 2009, 5, 153–160.
63. Mohamed, A.-R.; Dahl, G.E.; Hinton, G. Acoustic modeling using deep belief networks. IEEE Trans. Audio
Speech Lang. Process. 2012, 20, 14–22. [CrossRef]
64. Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the
27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814.
65. Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P. Extracting and composing robust features with denoising
autoencoders. In Proceedings of the Twenty-fifth International Conference on Machine Learning, Helsinki,
Finland, 5–9 July 2008; pp. 1096–1103.
66. Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv 2013, arXiv:1312.4400.
67. Springenberg, J.T.; Dosovitskiy, A.; Brox, T.; Riedmiller, M. Striving for simplicity: The all convolutional net.
arXiv 2014, arXiv:1412.6806.
68. Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26
July 2017; pp. 4700–4708.
69. Larsson, G.; Maire, M.; Shakhnarovich, G. FractalNet: Ultra-Deep Neural Networks without Residuals. arXiv
2016, arXiv:1605.07648.
70. Szegedy, C.; Ioffe, S.; Vanhoucke, V. Inception-v4, inception-resnet and the impact of residual connections on
learning. arXiv 2016, arXiv:1602.07261.
71. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer
vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV,
USA, 27–30 June 2016; pp. 2818–2826.
72. Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv 2016, arXiv:1605.07146.
73. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks.
arXiv 2016, arXiv:1611.05431.
74. Veit, A.; Wilber, M.J.; Belongie, S. Residual networks behave like ensembles of relatively shallow networks.
In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; pp. 550–558.
75. Abdi, M.; Nahavandi, S. Multi-Residual Networks: Improving the Speed and Accuracy of Residual Networks.
arXiv 2016, arXiv:1609.05672.
76. Zhang, X.; Li, Z.; Loy, C.C.; Lin, D. Polynet: A pursuit of structural diversity in very deep networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017; pp. 718–726.
77. Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Improved inception-residual convolutional
neural network for object recognition. arXiv 2017, arXiv:1712.09888.
78. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate
shift. arXiv 2015, arXiv:1502.03167.
79. Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. In Advances in Neural Information
Processing Systems (NIPS); MIT Press: Cambridge, MA, USA, 2017; pp. 3856–3866.
80. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal
networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015;
pp. 91–99.
81. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv 2016, arXiv:1610.02357.
82. Liang, M.; Hu, X. Recurrent convolutional neural network for object recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015.
Electronics 2019, 8, 292 56 of 66
83. Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M. Inception Recurrent Convolutional Neural Network for
Object Recognition. arXiv 2017, arXiv:1704.07709.
84. Li, Y.; Ouyang, W.; Wang, X.; Tang, X. Vip-cnn: Visual phrase guided convolutional neural network. In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu,
HI, USA, 21–26 July 2017; pp. 7244–7253.
85. Bagherinezhad, H.; Rastegari, M.; Farhadi, A. LCNN: Lookup-based Convolutional Neural Network. arXiv
2016, arXiv:1611.06473.
86. Bansal, A.; Chen, X.; Russell, B.; Gupta, A.; Ramanan, D. Pixelnet: Representation of the pixels, by the pixels,
and for the pixels. arXiv 2017, arXiv:1702.06506.
87. Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q. Deep networks with stochastic depth. In European
Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 646–661.
88. Lee, C.-Y.; Xie, S.; Gallagher, P.; Zhang, Z.; Tu, Z. Deeply-supervised nets. In Proceedings of the Artificial
Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; pp. 562–570.
89. Pezeshki, M.; Fan, L.; Brakel, P.; Courville, A.; Bengio, Y. Deconstructing the ladder network architecture. In
Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016;
pp. 2368–2376.
90. Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review.
Neural Comput. 2017, 29, 2352–2449. [CrossRef] [PubMed]
91. Tzeng, E.; Hoffman, J.; Darrell, T.; Saenko, K. Simultaneous deep transfer across domains and tasks. In
Proceedings of the IEEE International Conference on Computer Vision, Las Condes, Chile, 11–18 December
2015; pp. 4068–4076.
92. Ba, J.; Caruana, R. Do deep nets really need to be deep? In Advances in Neural Information Processing Systems;
NIPS Proceedings; MIT Press: Cambridge, MA, USA, 2014.
93. Urban, G.; Geras, K.J.; Kahou, S.E.; Aslan, O.; Wang, S.; Caruana, R.; Mohamed, A.; Philipose, M.;
Richardson, M. Do deep convolutional nets really need to be deep and convolutional? arXiv 2016,
arXiv:1603.05691.
94. Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv
2014, arXiv:1412.6550.
95. Mishkin, D.; Matas, J. All you need is a good init. arXiv 2015, arXiv:1511.06422.
96. Pandey, G.; Dukkipati, A. To go deep or wide in learning? arXiv 2014, arXiv:1402.5634.
97. Ratner, A.J.; de Sa, C.M.; Wu, S.; Selsam, D.; Ré, C. Data programming: Creating large training sets, quickly.
In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; pp. 3567–3575.
98. Aberger, C.R.; Lamb, A.; Tu, S.; Nötzli, A.; Olukotun, K.; Ré, C. Emptyheaded: A relational engine for graph
processing. ACM Trans. Database Syst. 2017, 42, 20. [CrossRef]
99. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. Squeezenet: Alexnet-level
accuracy with 50x fewer parameters and <0.5 mb model size. arXiv 2016, arXiv:1602.07360.
100. Han, S.; Mao, H.; Dally, W.J. Deep compression: Compressing deep neural network with pruning, trained
quantization and huffman coding. arXiv 2015, arXiv:1510.00149.
101. Niepert, M.; Ahmed, M.; Kutzkov, K. Learning Convolutional Neural Networks for Graphs. arXiv 2016,
arXiv:1605.05273.
102. Awesome Deep Vision. Available online: https://github.com/kjw0612/awesome-deep-vision (accessed on
17 January 2018).
103. Jia, X.; Xu, X.; Cai, B.; Guo, K. Single Image Super-Resolution Using Multi-Scale Convolutional Neural
Network. In Pacific Rim Conference on Multimedia; Springer: Cham, Switzerland, 2017; pp. 149–157.
104. Ahn, B.; Cho, N.I. Block-Matching Convolutional Neural Network for Image Denoising. arXiv 2017,
arXiv:1704.00524.
105. Ma, S.; Liu, J.; Chen, C.W. A-Lamp: Adaptive Layout-Aware Multi-Patch Deep Convolutional Neural
Network for Photo Aesthetic Assessment. arXiv 2017, arXiv:1704.00248.
106. Cao, X.; Zhou, F.; Xu, L.; Meng, D.; Xu, Z.; Paisley, J. Hyperspectral Image Classification With Markov
Random Fields and a Convolutional Neural Network. IEEE Trans. Image Process. 2018, 27, 2354–2367.
[CrossRef] [PubMed]
Electronics 2019, 8, 292 57 of 66
107. De Vos, B.D.; Berendsen, F.F.; Viergever, M.A.; Staring, M.; Išgum, I. End-to-end unsupervised deformable
image registration with a convolutional neural network. In Deep Learning in Medical Image Analysis and
Multimodal Learning for Clinical Decision Support; Springer: Cham, Switzerland, 2017; pp. 204–212.
108. Wang, X.; Oxholm, G.; Zhang, D.; Wang, Y. Multimodal transfer: A hierarchical deep convolutional neural
network for fast artistic style transfer. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; Volume 2, p. 7.
109. Babaee, M.; Dinh, D.T.; Rigoll, G. A deep convolutional neural network for background subtraction. arXiv
2017, arXiv:1702.01731.
110. Alom, M.Z.; Sidike, P.; Hasan, M.; Taha, T.M.; Asari, V.K. Handwritten Bangla Character Recognition Using
the State-of-the-Art Deep Convolutional Neural Networks. Comput. Intell. Neurosci. 2018, 2018, 6747098.
[CrossRef] [PubMed]
111. Alom, M.Z.; Awwal, A.A.S.; Lowe-Webb, R.; Taha, T.M. Optical beam classification using deep learning:
A comparison with rule-and feature-based classification. In Proceedings of the Optics and Photonics for
Information Processing XI, San Diego, CA, USA, 6–10 August 2017; Volume 10395.
112. Sidike, P.; Sagan, V.; Maimaitijiang, M.; Maimaitiyiming, M.; Shakoor, N.; Burken, J.; Mockler, T.; Fritschi, F.B.
dPEN: deep Progressively Expanded Network for mapping heterogeneous agricultural landscape using
WorldView-3 satellite imagery. Remote Sens. Environ. 2019, 221, 756–772. [CrossRef]
113. Alom, M.Z.; Alam, M.; Taha, T.M.; Iftekharuddin, K.M. Object recognition using cellular simultaneous
recurrent networks and convolutional neural network. In Proceedings of the 2017 International Joint
Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2873–2880.
114. Ronao, C.A.; Cho, S.-B. Human activity recognition with smartphone sensors using deep learning neural
networks. Expert Syst. Appl. 2016, 59, 235–244. [CrossRef]
115. Yang, J.; Nguyen, M.N.; San, P.P.; Li, X.L.; Krishnaswamy, S. Deep convolutional neural networks on
multichannel time series for human activity recognition. In Proceedings of the Twenty-Fourth International
Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015.
116. Hammerla, N.Y.; Halloran, S.; Ploetz, T. Deep, convolutional, and recurrent models for human activity
recognition using wearables. arXiv 2016, arXiv:1604.08880.
117. Ordóñez, F.J.; Roggen, D. Deep convolutional and lstm recurrent neural networks for multimodal wearable
activity recognition. Sensors 2016, 16, 115. [CrossRef] [PubMed]
118. Rad, N.M.; Kia, S.M.; Zarbo, C.; van Laarhoven, T.; Jurman, G.; Venuti, P.; Marchiori, E.; Furlanello, C. Deep
learning for automatic stereotypical motor movement detection using wearable sensors in autism spectrum
disorders. Signal Process. 2018, 144, 180–191.
119. Ravi, D.; Wong, C.; Lo, B.; Yang, G. Deep learning for human activity recognition: A resource efficient
implementation on low-power devices. In Proceedings of the 2016 IEEE 13th International Conference
on Wearable and Implantable Body Sensor Networks (BSN), San Francisco, CA, USA, 14–17 June 2016;
pp. 71–76.
120. Alom, M.Z.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Microscopic Nuclei Classification, Segmentation
and Detection with improved Deep Convolutional Neural Network (DCNN) Approaches. arXiv 2018,
arXiv:1811.03447.
121. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep
convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062.
122. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for
image segmentation. arXiv 2015, arXiv:1511.00561.
123. Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic
segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5168–5177.
124. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017;
pp. 2881–2890.
125. Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation
with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal.
Mach. Intell. 2018, 40, 834–848. [CrossRef] [PubMed]
Electronics 2019, 8, 292 58 of 66
126. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation.
In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham,
Switzerland, 2015; pp. 234–241.
127. Alom, M.Z.; Hasan, M.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Recurrent Residual Convolutional Neural
Network based on U-Net (R2U-Net) for Medical Image Segmentation. arXiv 2018, arXiv:1802.06955.
128. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and
semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Columbus, OH, USA, 23–28 June 2014; pp. 580–587.
129. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the
IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988.
130. Wang, X.; Shrivastava, A.; Gupta, A. A-fast-rcnn: Hard positive generation via adversary for object detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017.
131. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the 2017 IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988.
132. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
27–30 June 2016; pp. 779–788.
133. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox
detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37.
134. Hou, J.-C.; Wang, S.; Lai, Y.; Tsao, Y.; Chang, H.; Wang, H. Audio-Visual Speech Enhancement Using
Multimodal Deep Convolutional Neural Networks. arXiv 2017, arXiv:1703.10893.
135. Xu, Y.; Kong, Q.; Huang, Q.; Wang, W.; Plumbley, M.D. Convolutional gated recurrent neural network
incorporating spatial features for audio tagging. In Proceedings of the 2017 International Joint Conference
on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 3461–3466.
136. Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.;
van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017,
42, 60–88. [CrossRef] [PubMed]
137. Zhang, Z.; Xie, Y.; Xing, F.; McGough, M.; Yang, L. Mdnet: A semantically and visually interpretable
medical image diagnosis network. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6428–6436.
138. Tran, P.V. A fully convolutional neural network for cardiac segmentation in short-axis MRI. arXiv 2016,
arXiv:1604.00494.
139. Tan, J.H.U.; Acharya, R.; Bhandary, S.V.; Chua, K.C.; Sivaprasad, S. Segmentation of optic disc, fovea and
retinal vasculature using a single convolutional neural network. J. Comput. Sci. 2017, 20, 70–79. [CrossRef]
140. Moeskops, P.; Viergever, M.A.; Mendrik, A.M.; de Vries, L.S.; Benders, M.J.N.L.; Išgum, I. Automatic
segmentation of MR brain images with a convolutional neural network. IEEE Trans. Med Imaging 2016, 35,
1252–1261. [CrossRef] [PubMed]
141. Alom, M.Z.; Yakopcic, C.; Taha, T.M.; Asari, V.K. Breast Cancer Classification from Histopathological Images
with Inception Recurrent Residual Convolutional Neural Network. arXiv 2018, arXiv:1811.04241.
142. LeCun, Y.; Bottou, L.; Orr, G. Efficient BackProp. In Neural Networks: Tricks of the Trade; Orr, G., Müller, K.,
Eds.; Lecture Notes in Computer Science; Springer: Berlin, Germany, 2012.
143. Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy,
13–15 May 2010; pp. 249–256.
144. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on
imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Las Condes,
Chile, 11–18 December 2015; pp. 1026–1034.
145. Vedaldi, A.; Lenc, K. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the 23rd
ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 689–692.
146. Laurent, C.; Pereyra, G.; Brakel, P.; Zhang, Y.; Bengio, Y. Batch normalized recurrent neural networks. In
Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Shanghai, China, 20–25 March 2016; pp. 2657–2661.
Electronics 2019, 8, 292 59 of 66
147. Lavin, A.; Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021.
148. Clevert, D.-A.; Unterthiner, T.; Hochreiter, S. Fast and accurate deep network learning by exponential linear
units (elus). arXiv 2015, arXiv:1511.07289.
149. Li, Y.; Fan, C.; Li, Y.; Wu, Q.; Ming, Y. Improving deep neural network with multiple parametric exponential
linear units. Neurocomputing 2018, 301, 11–24. [CrossRef]
150. Jin, X.; Xu, C.; Feng, J.; Wei, Y.; Xiong, J.; Yan, S. Deep Learning with S-Shaped Rectified Linear Activation
Units. AAAI 2016, 3, 2–3.
151. Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical evaluation of rectified activations in convolutional network.
arXiv 2015, arXiv:1505.00853.
152. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual
recognition. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 346–361.
153. Yoo, D.; Park, S.; Lee, J.; Kweon, I.S. Multi-scale pyramid pooling for deep convolutional representation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA,
USA, 7–12 June 2015; pp. 71–80.
154. Graham, B. Fractional max-pooling. arXiv 2014, arXiv:1412.6071.
155. Lee, C.-Y.; Gallagher, P.W.; Tu, Z. Generalizing pooling functions in convolutional neural networks: Mixed,
gated, and tree. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016;
pp. 464–472.
156. Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks
by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580.
157. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent
neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958.
158. Wan, L.; Zeiler, M.; Zhang, S.; le Cun, Y.; Fergus, R. Regularization of neural networks using dropconnect.
In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013;
pp. 1058–1066.
159. Bulò, S.R.; Porzi, L.; Kontschieder, P. Dropout distillation. In Proceedings of the International Conference on
Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 99–107.
160. Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747.
161. Le, Q.V.; Ngiam, J.; Coates, A.; Lahiri, A.; Prochnow, B.; Ng, A.Y. On optimization methods for deep learning.
In Proceedings of the 28th International Conference on International Conference on Machine Learning,
Bellevue, WA, USA, 28 June –2 July 2011; pp. 265–272.
162. Koushik, J.; Hayashi, H. Improving stochastic gradient descent with feedback. arXiv 2016, arXiv:1611.01505.
163. Sathasivam, S.; Abdullah, W.A. Logic learning in Hopfield networks. arXiv 2008, arXiv:0804.4075.
164. Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [CrossRef]
165. Jordan, M.I. Serial order: A parallel distributed processing approach. Adv. Psychol. 1997, 121, 471–495.
166. Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. Gradient Flow in Recurrent Nets: The Difficulty of
Learning Long-Term Dependencies; IEEE Press: New York, NY, USA, 2001.
167. Schmidhuber, J. Habilitation Thesis: Netzwerkarchitekturen, Zielfunktionen und Kettenregel (Network
architectures, objective functions, and chain rule). Ph.D. Thesis, Technische Universität München, München,
Germany, 15 April 1993.
168. Gers, F.A.; Schmidhuber, J. Recurrent nets that time and count. In Proceedings of the IEEE-INNS-ENNS
International Joint Conference on Neural Networks, Como, Italy, 24–27 July 2000; Volume 3.
169. Gers, F.A.; Schraudolph, N.N.; Schmidhuber, J. Learning precise timing with LSTM recurrent networks.
J. Mach. Learn. Res. 2002, 3, 115–143.
170. Socher, R.; Lin, C.C.; Manning, C.; Ng, A.Y. Parsing natural scenes and natural language with recursive
neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11),
Bellevue, WA, USA, 28 June–2 July 2011; pp. 129–136.
171. Mikolov, T.; Karafiát, M.; Burget, L.; Černocký, J.; Khudanpur, S. Recurrent neural network based language
model. In Proceedings of the Eleventh Annual Conference of the International Speech Communication
Association. Makuhari, Chiba, Japan, 26–30 September 2010; Volume 2.
Electronics 2019, 8, 292 60 of 66
172. Xingjian, S.H.I.; Chen, Z.; Wang, H.; Yeung, D.; Wong, W.; Woo, W. Convolutional LSTM network: A machine
learning approach for precipitation nowcasting. In Advances in Neural Information Processing Systems (NIPS);
NIPS Proceedings; MIT Press: Cambridge, MA, USA, 2015; pp. 802–810.
173. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on
sequence modeling. arXiv 2014, arXiv:1412.3555.
174. Jozefowicz, R.; Zaremba, W.; Sutskever, I. An empirical exploration of recurrent network architectures. In
Proceedings of the 32nd International Conference on Machine Learning (ICML-15), Lille, France, 6–11 July 2015.
175. Yao, K.; Cohn, T.; Vylomova, K.; Duh, K.; Dyer, C. Depth-gated recurrent neural networks. arXiv 2015,
arXiv:1508.03790.
176. Koutnik, J.; Greff, K.; Gomez, F.; Schmidhuber, J. A clockwork rnn. arXiv 2014, arXiv:1402.3511.
177. Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey.
IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 2222–2232. [CrossRef] [PubMed]
178. Karpathy, A.; Li, F.-F. Deep visual-semantic alignments for generating image descriptions. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015.
179. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space.
arXiv 2013, arXiv:1301.3781.
180. Goldberg, Y.; Levy, O. word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding
method. arXiv 2014, arXiv:1402.3722.
181. Kunihiko, F. Neural network model for selective attention in visual pattern recognition and associative recall.
Appl. Opt. 1987, 26, 4985–4992.
182. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell:
Neural image caption generation with visual attention. In Proceedings of the International Conference on
Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057.
183. Qin, Y.; Song, D.; Chen, H.; Cheng, W.; Jiang, G.; Cottrell, G. A dual-stage attention-based recurrent neural
network for time series prediction. arXiv 2017, arXiv:1704.02971.
184. Xiong, C.; Merity, S.; Socher, R. Dynamic memory networks for visual and textual question answering. In
Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016.
185. Oord, A.v.d.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. arXiv 2016, arXiv:1601.06759.
186. Xue, W.; Nachum, I.B.; Pandey, S.; Warrington, J.; Leung, S.; Li, S. Direct estimation of regional wall
thicknesses via residual recurrent neural network. In International Conference on Information Processing in
Medical Imaging; Springer: Cham, Switzerland, 2017; pp. 505–516.
187. Tjandra, A.; Sakti, S.; Manurung, R.; Adriani, M.; Nakamura, S. Gated recurrent neural tensor network.
In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC,
Canada, 24–29 July 2016; pp. 448–455.
188. Wang, S.; Jing, J. Learning natural language inference with LSTM. arXiv 2015, arXiv:1512.08849.
189. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural
Information Processing Systems (NIPS); MIT Press: Cambridge, MA, USA, 2014; pp. 3104–3112.
190. Lakhani, V.A.; Mahadev, R. Multi-Language Identification Using Convolutional Recurrent Neural Network.
arXiv 2016, arXiv:1611.04010.
191. Längkvist, M.; Karlsson, L.; Loutfi, A. A review of unsupervised feature learning and deep learning for
time-series modeling. Pattern Recognit. Lett. 2014, 42, 11–24. [CrossRef]
192. Malhotra, P.; Vishnu, T.V.; Vig, L.; Agarwal, P.; Shroff, G. TimeNet: Pre-trained deep recurrent neural network
for time series classification. arXiv 2017, arXiv:1706.08838.
193. Soltau, H.; Liao, H.; Sak, H. Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary
speech recognition. arXiv 2016, arXiv:1610.09975.
194. Sak, H.; Senior, A.; Beaufays, F. Long short-term memory recurrent neural network architectures for large
scale acoustic modeling. In Proceedings of the Fifteenth Annual Conference of the International Speech
Communication Association, Singapore, 14–18 September 2014.
195. Adavanne, S.; Pertilä, P.; Virtanen, T. Sound event detection using spatial features and convolutional recurrent
neural network. arXiv 2017, arXiv:1706.02291.
196. Chien, J.-T.; Misbullah, A. Deep long short-term memory networks for speech recognition. In Proceedings of
the 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, China,
17–20 October 2016.
Electronics 2019, 8, 292 61 of 66
197. Choi, E.; Schuetz, A.; Stewart, W.F.; Sun, J. Using recurrent neural network models for early detection of
heart failure onset. J. Am. Med Inform. Assoc. 2016, 24, 361–370. [CrossRef] [PubMed]
198. Azzouni, A.; Pujolle, G. A Long Short-Term Memory Recurrent Neural Network Framework for Network
Traffic Matrix Prediction. arXiv 2017, arXiv:1705.05690.
199. Olabiyi, O.; Martinson, E.; Chintalapudi, V.; Guo, R. Driver Action Prediction Using Deep (Bidirectional)
Recurrent Neural Network. arXiv 2017, arXiv:1706.02257.
200. Kim, B.D.; Kang, C.M.; Lee, S.H.; Chae, H.; Kim, J.; Chung, C.C.; Choi, J.W. Probabilistic vehicle trajectory
prediction over occupancy grid map via recurrent neural network. arXiv 2017, arXiv:1704.07049.
201. Richard, A.; Gall, J. A bag-of-words equivalent recurrent neural network for action recognition. Comput. Vis.
Image Underst. 2017, 156, 79–91. [CrossRef]
202. Bontemps, L.; McDermott, J.; Le-Khac, N.-H. Collective Anomaly Detection Based on Long Short-Term
Memory Recurrent Neural Networks. In International Conference on Future Data and Security Engineering;
Springer International Publishing: Cham, Switzerland, 2016.
203. Kingma, D.P.; Welling, M. Stochastic gradient VB and the variational auto-encoder. In Proceedings of the
Second International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014.
204. Ng, A. Sparse autoencoder. CS294A Lect. Notes 2011, 72, 1–19.
205. Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P. Stacked denoising autoencoders: Learning
useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11,
3371–3408.
206. Zhang, R.; Isola, P.; Efros, A.A. Split-brain autoencoders: Unsupervised learning by cross-channel prediction.
arXiv 2016, arXiv:1611.09842.
207. Lu, J.; Deshpande, A.; Forsyth, D. CDVAE: Co-embedding Deep Variational Auto Encoder for Conditional
Variational Generation. arXiv 2016, arXiv:1612.00132.
208. Chicco, D.; Sadowski, P.; Baldi, P. Deep Autoencoder Neural Networks for Gene Ontology Annotation
Predictions. In Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and
Health Informatics—BCB ’14, Niagara Falls, NY, USA, 2–4 August 2010; pp. 533–540.
209. Alom, M.Z.; Taha, T.M. Network Intrusion Detection for Cyber Security using Unsupervised Deep Learning
Approaches. In Proceedings of the Aerospace and Electronics Conference (NAECON), Dayton, OH, USA,
27–30 June 2017.
210. Song, C.; Liu, F.; Huang, Y.; Wang, L.; Tan, T. Auto-encoder based data clustering. In Iberoamerican Congress
on Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2013; pp. 117–124.
211. Ahmad, M.; Protasov, S.; Khan, A.M. Hyperspectral Band Selection Using Unsupervised Non-Linear Deep
Auto Encoder to Train External Classifiers. arXiv 2017, arXiv:1705.06920.
212. Freund, Y.; Haussler, D. Unsupervised learning of distributions on binary vectors using two layer networks.
In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 1992; pp. 912–919.
213. Larochelle, H.; Bengio, Y. Classification using discriminative restricted Boltzmann machines. In Proceedings
of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008.
214. Salakhutdinov, R.; Hinton, G.E. Deep Boltzmann machines. AISTATS 2009, 1, 3.
215. Alom, M.Z.; Bontupalli, V.R.; Taha, T.M. Intrusion detection using deep belief networks. In Proceedings of
the Aerospace and Electronics Conference (NAECON), Dayton, OH, USA, 16–19 June 2015.
216. Alom, M.Z.; Sidike, P.; Taha, T.M.; Asari, V.K. Handwritten bangla digit recognition using deep learning.
arXiv 2017, arXiv:1705.02680.
217. Albalooshi, F.A.; Sidike, P.; Sagan, V.; Albalooshi, Y.; Asari, V.K. Deep Belief Active Contours (DBAC) with Its
Application to Oil Spill Segmentation from Remotely Sensed Aerial Imagery. Photogramm. Eng. Remote Sens.
2018, 84, 451–458. [CrossRef]
218. Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.K.; Wang, Z.; Smolley, S.P. Least squares generative adversarial networks. In
Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017;
pp. 2794–2802.
219. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for
training gans. arXiv 2016, arXiv:1606.03498.
220. Vondrick, C.; Pirsiavash, H.; Torralba, A. Generating videos with scene dynamics. In Advances in Neural
Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016; pp. 613–621.
Electronics 2019, 8, 292 62 of 66
221. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative
adversarial networks. arXiv 2015, arXiv:1511.06434.
222. Wang, X.; Gupta, A. Generative image modeling using style and structure adversarial networks. In European
Conference on Computer Vision; Springer: Cham, Switzerland, 2016.
223. Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable
representation learning by information maximizing generative adversarial nets. In Advances in Neural
Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016.
224. Im, D.J.; Kim, C.D.; Jiang, H.; Memisevic, R. Generating images with recurrent adversarial net- works. arXiv
2016, arXiv:1602.05110.
225. Isola, P.; Zhu, J.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks.
arXiv 2017, arXiv:1611.07004.
226. Liu, M.-Y.; Tuzel, O. Coupled generative adversarial networks. In Advances in Neural Information Processing
Systems; MIT Press: Cambridge, MA, USA, 2016.
227. Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial feature learning. arXiv 2016, arXiv:1605.09782.
228. Berthelot, D.; Schumm, T.; Metz, L. Began: Boundary equilibrium generative adversarial networks. arXiv
2017, arXiv:1703.10717.
229. Martin, A.; Chintala, S.; Bottou, L. Wasserstein gan. arXiv 2017, arXiv:1701.07875.
230. Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein gans.
In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 5767–5777.
231. He, K.; Wang, Y.; Hopcroft, J. A powerful generative model using random weights for the deep image
representation. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016.
232. Kos, J.; Fischer, I.; Song, D. Adversarial examples for generative models. arXiv 2017, arXiv:1702.06832.
233. Zhao, J.; Mathieu, M.; LeCun, Y. Energy-based generative adversarial network. arXiv 2016, arXiv:1609.03126.
234. Park, N.; Anand, A.; Moniz, J.R.A.; Lee, K.; Chakraborty, T.; Choo, J.; Park, H.; Kim, Y. MMGAN: Manifold
Matching Generative Adversarial Network for Generating Images. arXiv 2017, arXiv:1707.08273.
235. Laloy, E.; Hérault, R.; Jacques, D.; Linde, N. Efficient training-image based geostatistical simulation and
inversion using a spatial generative adversarial neural network. arXiv 2017, arXiv:1708.04975.
236. Eghbal-zadeh, H.; Widmer, G. Probabilistic Generative Adversarial Networks. arXiv 2017, arXiv:1708.01886.
237. Fowkes, J.; Sutton, C. A Bayesian Network Model for Interesting Itemsets. In Joint European Conference on
Machine Learning and Knowledge Disco in Databases; Springer International Publishing: Cham, Switzerland, 2016.
238. Mescheder, L.; Nowozin, S.; Geiger, A. Adversarial variational bayes: Unifying variational autoencoders and
generative adversarial networks. arXiv 2017, arXiv:1701.04722.
239. Nowozin, S.; Cseke, B.; Tomioka, R. f-gan: Training generative neural samplers using variational divergence
minimization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2016.
240. Li, C.; Wand, M. Precomputed real-time texture synthesis with markovian generative adversarial networks.
In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016.
241. Du, C.; Zhu, J.; Zhang, B. Learning Deep Generative Models with Doubly Stochastic Gradient MCMC.
IEEE Trans. Neural Networks Learn. Syst. 2018, 29, 3084–3096. [CrossRef] [PubMed]
242. 242. Hoang, Quan, Tu Dinh Nguyen, Trung Le, and Dinh Phung. Multi-Generator Gernerative Adversarial
Nets. arXiv 2017, arXiv:1708.02556.
243. Bousmalis, K.; Silberman, N.; Dohan, D.; Erhan, D.; Krishnan, D. Unsupervised pixel-level domain adaptation
with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 7.
244. Kansky, K.; Silver, T.; Mély, D.A.; Eldawy, M.; Lázaro-Gredilla, M.; Lou, X.; Dorfman, N.; Sidor, S.; Phoenix, S.;
George, D. Schema networks: Zero-shot transfer with a generative causal model of intuitive physics. arXiv
2017, arXiv:1706.04317.
245. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.;
Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. arXiv
2016, arXiv:1609.04802.
246. Souly, N.; Spampinato, C.; Shah, M. Semi and Weakly Supervised Semantic Segmentation Using Generative
Adversarial Network. arXiv 2017, arXiv:1703.09695.
247. Dash, A.; Gamboa, J.C.B.; Ahmed, S.; Liwicki, M.; Afzal, M.Z. TAC-GAN-text conditioned auxiliary classifier
generative adversarial network. arXiv 2017, arXiv:1703.06412.
Electronics 2019, 8, 292 63 of 66
248. Zhang, H.; Dana, K. Multi-style Generative Network for Real-time Transfer. arXiv 2017, arXiv:1703.06953.
249. Zhang, H.; Sindagi, V.; Patel, V.M. Image De-raining Using a Conditional Generative Adversarial Network.
arXiv 2017, arXiv:1701.05957.
250. Serban, I.V.; Sordoni, A.; Bengio, Y.; Courville, A.C.; Pineau, J. Building End-To-End Dialogue Systems Using
Generative Hierarchical Neural Network Models. AAAI 2016, 16, 3776–3784.
251. Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech Enhancement Generative Adversarial Network. arXiv
2017, arXiv:1703.09452.
252. Yang, L.-C.; Chou, S.-Z.; Yang, Y.-I. MidiNet: A convolutional generative adversarial network for
symbolic-domain music generation. In Proceedings of the 18th International Society for Music Information
Retrieval Conference (ISMIR’2017), Suzhou, China, 23–27 October 2017.
253. Yang, Q.; Yan, P.; Zhang, Y.; Yu, H.; Shi, Y.; Mou, X.; Kalra, M.K.; Zhang, Y.; Sun, L.; Wang, G. Low-dose
CT image denoising using a generative adversarial network with Wasserstein distance and perceptual loss.
IEEE Trans. Med. Imaging 2018, 37, 1348–1357. [CrossRef] [PubMed]
254. Rezaei, M.; Harmuth, K.; Gierke, W.; Kellermeier, T.; Fischer, M.; Yang, H.; Meinel, C. A conditional
adversarial network for semantic segmentation of brain tumor. In International MICCAI Brainlesion Workshop;
Springer: Cham, Switzerland, 2017; pp. 241–252.
255. Xue, Y.; Xu, T.; Zhang, H.; Long, L.R.; Huang, X. Segan: Adversarial network with multi-scale l 1 loss for
medical image segmentation. Neuroinformatics 2018, 16, 383–392. [CrossRef] [PubMed]
256. Mardani, M.; Gong, E.; Cheng, J.Y.; Vasanawala, S.; Zaharchuk, G.; Alley, M.; Thakur, N.; Han, S.; Dally, W.;
Pauly, J.M.; et al. Deep generative adversarial networks for compressed sensing automates MRI. arXiv 2017,
arXiv:1706.00051.
257. Choi, E.; Biswal, S.; Malin, B.; Duke, J.; Stewart, W.F.; Sun, J. Generating Multilabel Discrete Electronic Health
Records Using Generative Adversarial Networks. arXiv 2017, arXiv:1703.06490.
258. Esteban, C.; Hyland, S.L.; Rätsch, G. Real-valued (medical) time series generation with recurrent conditional
gans. arXiv 2017, arXiv:1706.02633.
259. Hayes, J.; Melis, L.; Danezis, G.; de Cristofaro, E. LOGAN: evaluating privacy leakage of generative models
using generative adversarial networks. arXiv 2017, arXiv:1705.07663.
260. Gordon, J.; Hernández-Lobato, J.M. Bayesian Semisupervised Learning with Deep Generative Models. arXiv
2017, arXiv:1706.09751.
261. Abbasnejad, M.E.; Shi, Q.; Abbasnejad, I.; van den Hengel, A.; Dick, A. Bayesian conditional generative
adverserial networks. arXiv 2017, arXiv:1706.05477.
262. Grnarova, P.; Levy, K.Y.; Lucchi, A.; Hofmann, T.; Krause, A. An online learning approach to generative
adversarial networks. arXiv 2017, arXiv:1706.03269.
263. Li, Y.; Swersky, K.; Zemel, R. Generative moment matching networks. In Proceedings of the International
Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1718–1727.
264. Li, C.-L.; Chang, W.; Cheng, Y.; Yang, Y.; Póczos, B. Mmd gan: Towards deeper understanding of moment
matching network. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA,
2017; pp. 2203–2213.
265. Nie, X.; Feng, J.; Xing, J.; Yan, S. Generative partition networks for multi-person pose estimation. arXiv 2017,
arXiv:1705.07422.
266. Saeedi, A.; Hoffman, M.D.; DiVerdi, S.J.; Ghandeharioun, A.; Johnson, M.J.; Adams, R.P. Multimodal
prediction and personalization of photo edits with deep generative models. arXiv 2017, arXiv:1704.04997.
267. Schlegl, T.; Seeböck, P.; Waldstein, S.M.; Schmidt-Erfurth, U.; Langs, G. Unsupervised anomaly detection
with generative adversarial networks to guide marker discovery. In International Conference on Information
Processing in Medical Imaging; Springer: Cham, Switzerland, 2017; pp. 146–157.
268. Liu, M.-Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. In Advances in Neural
Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 700–708.
269. Mehrotra, A.; Dukkipati, A. Generative Adversarial Residual Pairwise Networks for One Shot Learning.
arXiv 2017, arXiv:1703.08033.
270. Sordoni, A.; Galley, M.; Auli, M.; Brockett, C.; Ji, Y.; Mitchell, M.; Nie, J.; Gao, J.; Dolan, B. A neural network
approach to context-sensitive generation of conversational responses. arXiv 2015, arXiv:1506.06714.
271. Yin, J.; Jiang, X.; Lu, Z.; Shang, L.; Li, H.; Li, X. Neural generative question answering. arXiv 2015,
arXiv:1512.01337.
Electronics 2019, 8, 292 64 of 66
272. Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.;
Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499.
273. Chen, Y.; Li, J.; Xiao, H.; Jin, X.; Yan, S.; Feng, J. Dual path networks. In Advances in Neural Information
Processing Systems; MIT Press: Cambridge, MA, USA, 2017; pp. 4467–4475.
274. Mahmud, M.; Kaiser, M.S.; Hussain, A.; Vassanelli, S. Applications of deep learning and reinforcement
learning to biological data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2063–2079. [CrossRef] [PubMed]
275. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016.
276. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J. Mastering
the game of Go with deep neural networks and tree search. Nature 2016, 529, 484. [CrossRef] [PubMed]
277. Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhnevets, A.S.; Yeo, M.; Makhzani, A.; Küttler, H.;
Agapiou, J.; Schrittwieser, J.; et al. Starcraft ii: A new challenge for reinforcement learning. arXiv 2017,
arXiv:1708.04782.
278. Koenig, S.; Simmons, R.G. Complexity Analysis of Real-Time Reinforcement Learning Applied to Finding Shortest
Paths in Deterministic Domains; Tech. Report, No. CMU-CS-93-106; Computer Science Department,
Carnegie-Mellon University: Pittsburgh PA, Decemver, 1992.
279. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.;
Bolton, A.; et al. Mastering the game of go without human knowledge. Nature 2017, 550, 354. [CrossRef]
[PubMed]
280. Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.I.; Moritz, P. Trust Region Policy Optimization. In Proceedings
of the 32nd International Conference on Machine Learning (ICML-15), Lille, France, 6–11 July 2015; Volume
37, pp. 1889–1897.
281. Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res.
2016, 17, 1334–1373.
282. Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous
methods for deep reinforcement learning. In Proceedings of the International Conference on Machine
Learning, New York, NY, USA, 20–22 June 2016; pp. 1928–1937.
283. Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. A brief survey of deep reinforcement
learning. arXiv 2017, arXiv:1708.05866.
284. Zhu, F.; Liao, P.; Zhu, X.; Yao, Y.; Huang, J. Cohesion-based online actor-critic reinforcement learning for
mhealth intervention. arXiv 2017, arXiv:1703.10039.
285. Zhu, F.; Guo, J.; Xu, Z.; Liao, P.; Yang, L.; Huang, J. Group-driven reinforcement learning for personalized
mhealth intervention. In International Conference on Medical Image Computing and Computer-Assisted
Intervention; Springer: Cham, Switzerland, 2018; pp. 590–598.
286. Steckelmacher, D.; Roijers, D.M.; Harutyunyan, A.; Vrancx, P.; Plisnier, H.; Nowé, A. Reinforcement
learning in POMDPs with memoryless options and option-observation initiation sets. In Proceedings
of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018.
287. Hu, H.; Zhang, X.; Yan, X.; Wang, L.; Xu, Y. Solving a new 3d bin packing problem with deep reinforcement
learning method. arXiv 2017, arXiv:1708.05930.
288. Everitt, T.; Krakovna, V.; Orseau, L.; Hutter, M.; Legg, S. Reinforcement learning with a corrupted reward
channel. arXiv 2017, arXiv:1705.08417.
289. Wu, Y.; Mansimov, E.; Grosse, R.B.; Liao, S.; Ba, J. Scalable trust-region method for deep reinforcement
learning using kronecker-factored approximation. In Advances in Neural Information Processing Systems; MIT
Press: Cambridge, MA, USA, 2017; pp. 5279–5288.
290. Denil, M.; Agrawal, P.; Kulkarni, T.D.; Erez, T.; Battaglia, P.; de Freitas, N. Learning to perform physics
experiments via deep reinforcement learning. arXiv 2016, arXiv:1611.01843.
291. Hein, D.; Hentschel, A.; Runkler, T.; Udluft, S. Particle swarm optimization for generating interpretable
fuzzy reinforcement learning policies. Eng. Appl. Artif. Intell. 2017, 65, 87–98. [CrossRef]
292. Islam, R.; Henderson, P.; Gomrokchi, M.; Precup, D. Reproducibility of benchmarked deep reinforcement
learning tasks for continuous control. arXiv 2017, arXiv:1708.04133.
293. Inoue, T.; de Magistris, G.; Munawar, A.; Yokoya, T.; Tachibana, R. Deep reinforcement learning for high
precision assembly tasks. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 819–825.
Electronics 2019, 8, 292 65 of 66
294. Li, K.; Burdick, J.W. Inverse Reinforcement Learning in Large State Spaces via Function Approximation.
arXiv 2017, arXiv:1707.09394.
295. Liu, N.; Li, Z.; Xu, J.; Xu, Z.; Lin, S.; Qiu, Q.; Tang, J.; Wang, Y. A hierarchical framework of cloud resource
allocation and power management using deep reinforcement learning. In Proceedings of the 2017 IEEE 37th
International Conference on Distributed Computing Systems (ICDCS), Atlanta, GA, USA, 5–8 June 2017;
pp. 372–382.
296. Cao, Q.; Lin, L.; Shi, Y.; Liang, X.; Li, G. Attention-aware face hallucination via deep reinforcement learning.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017; pp. 690–698.
297. Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In
Advances in Neural Information Processing Systems (NIPS); MIT Press: Cambridge, MA, USA, 2017.
298. Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and
semantics. arXiv 2017, arXiv:1705.07115.
299. Google Photos labeled black people ‘gorillas’. Available online: https://www.usatoday.com/story/tech/
2015/07/01/google-apologizes-after-photos-identify-black-people-as-gorillas/29567465/ (accessed on
1 March 2019).
300. Gal, Y.; Ghahramani, Z. Bayesian convolutional neural networks with Bernoulli approximate variational
inference. arXiv 2015, arXiv:1506.02158.
301. Kumar, S.; Laumann, F.; Maurin, A.L.; Olsen, M.; Bayesian, M.L. Convolutional Neural Networks with
Variational Inference. arXiv 2018, arXiv:1704.02798.
302. Vladimirova, M.; Arbel, J.; Mesejo, P. Bayesian neural networks become heavier-tailed with depth. In
Proceedings of the Bayesian Deep Learning Workshop during the Thirty-Second Conference on Neural
Information Processing Systems (NIPS 2018), Montréal, QC, Canada, 7 December 2018.
303. Hu, S.X.; Champs-sur-Marne, F.; Moreno, P.G.; Lawrence, N.; Damianou, A. β-BNN: A Rate-Distortion
Perspective on Bayesian Neural Networks. In Proceedings of the Bayesian Deep Learning Workshop during
the Thirty-Second Conference on Neural Information Processing Systems (NIPS 2018), Montréal, QC, Canada,
7 December 2018.
304. Salvator, L.; Han, J.; Schroers, C.; Mandt, S. Video Compression through Deep Bayesian Learning Bayesian.
In Proceedings of the Deep Learning Workshop during the Thirty-Second Conference on Neural Information
Processing Systems (NIPS 2018), Montréal, QC, Canada, 7 December 2018.
305. Krishnan, R.; Subedar, M.; Tickoo, O. BAR: Bayesian Activity Recognition using variational inference. arXiv
2018, arXiv:1811.03305.
306. Chen, T.; Goodfellow, I.; Shlens, J. Net2net: Accelerating learning via knowledge transfer. arXiv 2015,
arXiv:1511.05641.
307. Ganin, Y.; Lempitsky, V. Unsupervised domain adaptation by backpropagation. arXiv 2014, arXiv:1409.7495.
308. Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V.
Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 2096–2130.
309. Taylor, M.E.; Stone, P. Transfer learning for reinforcement learning domains: A survey. J. Mach. Learn. Res.
2009, 10, 1633–1685.
310. McKeough, A. Teaching for Transfer: Fostering Generalization in Learning; Routledge: London, UK, 2013.
311. Raina, R.; Battle, A.; Lee, H.; Packer, B.; Ng, A.Y. Self-taught learning: transfer learning from unlabeled data.
In Proceedings of the 24th international conference on Machine learning, Corvallis, OR, USA, 20–24 June
2007; pp. 759–766.
312. Wenyuan, D.; Yang, Q.; Xue, G.; Yu, Y. Boosting for transfer learning. In Proceedings of the 24th International
Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 193–200.
313. Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.;
et al. Google’s neural machine translation system: Bridging the gap between human and machine translation.
arXiv 2016, arXiv:1609.08144.
314. Qiu, J.; Wang, J.; Yao, S.; Guo, K.; Li, B.; Zhou, E.; Yu, J.; Tang, T.; Xu, N.; Song, S.; et al. Going deeper
with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 21–23 February 2016;
pp. 26–35.
Electronics 2019, 8, 292 66 of 66
315. He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5353–5360.
316. Lin, Z.; Courbariaux, M.; Memisevic, R.; Bengio, Y. Neural networks with few multiplications. arXiv 2015,
arXiv:1510.03009.
317. Courbariaux, M.; David, J.-E.; Bengio, Y. Training deep neural networks with low precision multiplications.
arXiv 2014, arXiv:1412.7024.
318. Courbariaux, M.; Bengio, Y.; David, J.-P. Binaryconnect: Training deep neural networks with binary weights
during propagations. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015.
319. Hubara, I.; Soudry, D.; El Yaniv, R. Binarized Neural Networks. arXiv 2016, arXiv:1602.02505.
320. Kim, M.; Smaragdis, P. Bitwise neural networks. arXiv 2016, arXiv:1601.06071.
321. Dettmers, T. 8-Bit Approximations for Parallelism in Deep Learning. arXiv 2015, arXiv:1511.04561.
322. Gupta, S.; Agrawal, A.; Gopalakrishnan, K.; Narayanan, P. Deep learning with limited numerical precision.
In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015;
pp. 1737–1746.
323. Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural
networks with low bitwidth gradients. arXiv 2016, arXiv:1606.06160.
324. Merolla, P.A.; Arthur, J.V.; Alvarez-Icaza, R.; Cassidy, A.S.; Sawada, J.; Akopyan, F.; Jackson, B.L.; Imam, N.;
Guo, C.; Nakamura, Y.; et al. A million spiking-neuron integrated circuit with a scalable communication
network and interface. Science 2014, 345, 668–673. [CrossRef] [PubMed]
325. Steven, K.E.; Merolla, P.A.; Arthur, J.V.; Cassidy, A.S. Convolutional networks for fast, energy-efficient
neuromorphic computing. Proc. Natl. Acad. Sci. USA 2016, 27, 201604850.
326. Zidan, M.A.; Strachan, J.P.; Lu, W.D. The future of electronics based on memristive systems. Nat. Electron.
2018, 1, 22. [CrossRef]
327. Chen, Y.-H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An energy-efficient reconfigurable accelerator for deep
convolutional neural networks. IEEE J. Solid-State Circuits 2017, 52, 127–138. [CrossRef]
328. Chen, Y.; Luo, T.; Liu, S.; Zhang, S.; He, L.; Wang, J.; Li, L.; Chen, T.; Xu, Z.; Sun, N.; et al. Dadiannao: A
machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture, Cambridge, UK, 13–17 December 2014; pp. 609–622.
329. Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.;
Borchers, A.; et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 2017
ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada,
24–28 June 2017; pp. 1–12.
330. Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: Efficient inference engine
on compressed deep neural network. In Proceedings of the 2016 ACM/IEEE 43rd Annual International
Symposium on Computer Architecture (ISCA), Seoul, Korea, 18–22 June 2016; pp. 243–254.
331. Zhang, X.; Zou, J.; Ming, X.; He, K.; Sun, J. Efficient and accurate approximations of nonlinear convolutional
networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA,
USA, 7–12 June 2015; pp. 1984–1992.
332. Novikov, A.; Podoprikhin, D.; Osokin, A.; Vetrov, D.P. Tensorizing neural networks. In Advances in Neural
Information Processing Systems; MIT Press: Cambridge, MA, USA, 2005; pp. 442–450.
333. Zhu, C.; Han, S.; Mao, H.; Dally, W.J. Trained ternary quantization. arXiv 2016, arXiv:1612.01064.
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).