10 1016@j Isprsjprs 2019 04 016

ISPRS Journal of Photogrammetry and Remote Sensing 152 (2019) 192–210
Contents lists available at ScienceDirect
ISPRS Journal of Photogrammetry and Remote Sensing

journal homepage: www.elsevier.com/locate/isprsjprs
Meta-analysis of deep neural networks in remote sensing: A comparative T

study of mono-temporal classification to support vector machines
Shahriar S. Heydari, Giorgos Mountrakis
⁎
Department of Environmental Resources Engineering, State University of New York, College of Environmental Science and Forestry, 1 Forestry Drive, Syracuse, NY 13210,
United States
ARTICLE INFO ABSTRACT
Keywords: Deep learning methods have recently found widespread adoption for remote sensing tasks, particularly in image
Deep learning or pixel classification. Their flexibility and versatility has enabled researchers to propose many different designs
Classification to process remote sensing data in all spectral, spatial, and temporal dimensions. In most of the reported cases
Convolutional neural network they surpass their non-deep rivals in overall classification accuracy. However, there is considerable diversity in
Deep belief network
implementation details in each case and a systematic quantitative comparison to non-deep classifiers does not
Stacked auto encoder
Support vector machine
exist. In this paper, we look at the major research papers that have studied deep learning image classifiers in
recent years and undertake a meta-analysis on their performance compared to the most used non-deep rival,
Support Vector Machine (SVM) classifiers. We focus on mono-temporal classification as the time-series image
classification did not offer sufficient samples. Our work covered 103 manuscripts and included 92 cases that
supported direct accuracy comparisons between deep learners and SVMs.
Our general findings are the following: (i) Deep networks have better performance than non-deep spectral
SVM implementations, with Convolutional Neural Networks (CNNs) performing better than other deep learners.
This advantage, however, diminishes when feeding SVM with richer features extracted from data (e.g. spatial
filters). (ii) Transfer learning and fine-tuning on pre-trained CNNs are offering promising results over spectral or
enhanced SVM, however these pre-trained networks are currently limited to RGB input data, therefore currently
lack applicability in multi/hyperspectral data. (iii) There is no strong relationship between network complexity
and accuracy gains over SVM; small to medium networks perform similarly to more complex networks. (iv)
Contrary to the popular belief, there are numerous cases of high deep networks performance with training
proportions of 10% or less.
Our study also indicates that the new generation of classifiers is often overperforming existing benchmark
datasets, with accuracies surpassing 99%. There is a clear need for new benchmark dataset collections with
diverse spectral, spatial and temporal resolutions and coverage that will enable us to study the design gen-
eralizations, challenge these new classifiers, and further advance remote sensing science. Our community could
also benefit from a coordinated effort to create a large pre-trained network specifically designed for remote
sensing images that users could later fine-tune and adjust to their study specifics.
1. Introduction nodes (Rumelhart et al., 1986). The back-propagation method has

worked well for non-deep structures (1–2 hidden layers) but gradient-
Artificial neural networks (ANNs) first started with cybernetics in based training of deep neural networks (DNNs) could get stuck in local
the 1940–1960s and led to the invention of the first single neuron minima or plateaus due to the dramatic increase in number of model
model named perceptron (Rosenblatt, 1958). Being a data-driven model parameters and vanishing of gradients during backpropagation (Bengio,
with the ability to simulate arbitary computing functions through op- 2009). There is no standard definition to label a neural network as
timization, ANNs found a wide range of applications. The next major deep, but it mostly refers to network of two hidden layers or more, used
breakthrough happened in late 80s with the invention of back-propa- to automatically extract a hierarchical set of features from data. Com-
gation and a gradient-based optimization algorithm to train a neural pared to 1–2 layer structures, DNNs promise to provide more compact
network with one or two hidden layers with any desired number of models for the same modeling capabilities (Bengio, 2009). However,
⁎
Corresponding author.
E-mail addresses: sshahhey@syr.edu (S.S. Heydari), gmountrakis@esf.edu (G. Mountrakis).
https://doi.org/10.1016/j.isprsjprs.2019.04.016
Received 1 February 2018; Received in revised form 17 April 2019; Accepted 24 April 2019
Available online 01 May 2019
0924-2716/ © 2019 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.
S.S. Heydari and G. Mountrakis ISPRS Journal of Photogrammetry and Remote Sensing 152 (2019) 192–210
the high node number of DNNs made it difficult to train and optimize in deep learning benefits. For example, Khatami et al. (2016) grouped all
a practical manner. neural network types under one category and did not distinguish deep
The seminal work of Hinton et al. (2006) showed that unsupervised networks from other implementations. Ma et al. (2017) conducted si-
pre-training of each layer, one after another, could considerably im- milar meta-analysis focusing on object-based classification (thus ex-
prove results. This layer-wise training approach, named greedy algo- cluding pixel-based ones) without separating deep learning methods.
rithm, was the key that opened new avenues to deep neural networks. There are some other papers that review deep learning architectures in
The greedy algorithm could also be followed by a fine-tuning process, in general such as Deng (2014) or W. Liu et al. (2017), or for specific type
which the entire network is tuned together using backpropagation, but of data, such as Camps-Valls et al. (2014) on hyperspectral data clas-
this time from a much better starting point. Deep network theories and sification. These works also lack quantitative comparisonsusing a meta-
practices have expanded considerably during the last decade. It has analysis approach.
resulted in establishment of some major network types (with con- The overarching goal is to provide readers with the “big picture” of
tinuous enhancements) and numerous applications in different do- current research and build on the collective knowledge of published
mains. In close relationship with image processing and computer vision, works to assess deep learning benefits in remote sensing. To undertake
remote sensing (RS) is one of many areas that deep learning is targeting. the proposed meta-analysis task, we reviewed major research papers
Generally and following discussion in L. Zhang et al. (2016), we can and built a database of case studies of deep network applications in the
categorize remote sensing applications of deep learning into four remote sensing field while extracting main network and data char-
groups: (1) RS image pre-processing, (2) scene classification, (3) pixel- acteristics. This database was analyzed to identify deep learning clas-
based classification and image segmentation, and (4) target detection. sification performance and its distribution across these network (e.g.
For image pre-processing tasks, we can name pan-sharpening, de- network complexity) and data characteristics (e.g. spatial resolution).
noising, and resolution enhancement as major applications. Scene We expect this analysis to provide a knowledge baseline as the remote
classification is done based on some extracted features from a scene, sensing community further incorporates deep leaning in related activ-
which the deep networks are assumed to be good at. The non-deep ities.
approaches normally use some handcrafted features extracted from the The structure of the manuscript is as follows. A brief overlook of
scene to feed the classifier (SVM, KNN, etc.) and predict the scene type. deep network types is presented in Section 2 along with key in-
Deep networks have opened the door to direct use of spectral and troductory references. A summary table is also provided to describe
spatial information together to generate a richer set of features auto- extracted parameters for each research paper. In Section 3 after in-
matically. This automatic extraction increases the potential for good troducing a descriptive statistics and summarizing design ideas en-
generalization and scalability of this method compared to handcrafted countered in the selected research papers and used datasets, we provide
features. Handcrafted features tend to be tailored closely to a specific our main comparative analysis and discuss important research ques-
case and application and possibly perform better than any automatic tions about parameters effect on network performance. The last section
system, but because of this specificity they cannot be easily or suc- provides concluding remarks.
cessfully generalized to another cases/studies. This type of work is
closely related to image recognition task but for categorization of re- 2. Methods
mote sensing scenes (such as agricultural field, residential area, airport,
parking lot, etc.), therefore sharing network configurations between In this section we first describe three DNN methods that have been
computer vision and remote sensing applications is common here. Pixel popular in RS tasks. Section 2.2 contains an explanation on the paper
classification and segmentation (or semantic labeling) are similar to database and associated characteristics and metrics used in the com-
scene classification but operate at the pixel rather than scene level, and parative accuracy analysis between DNNs and non-deep methods.
produce a thematic map instead of a single category index. This is
perhaps the most studied RS application and deep networks have shown 2.1. Summary of popular deep neural networks in remote sensing
performance benefits due to their ability to co-process spatial and
spectral data easily, especially for hyperspectral images. Our main The deep learning paradigm is concentrated on automated hier-
target in this paper is to focus on image or scene classification – we do archical feature extraction. Numerous methods and their modifications
not address other applications. In addition, we focus on mono-temporal have been devised along the past years. Here we briefly introduce the
classification as the time-series image classification is still in its infancy. three most widely used structures which were used in our identified
Target or object detection is generally an extension to the three afore- studies. More detailed descriptions of each structure can be found in
mentioned groups, where specific objects defined by their shape or many machine learning textbooks, for example Bengio (2009) and
boundary are extracted from an image. This field has found many useful Goodfellow et al. (2016), or tutorials such as Le (2015) and Deng
but challenging applications in high resolution and real time image/ (2014). Zhu et al. (2017) and L. Zhang et al. (2016) also provide tu-
video processing. torials for deep learning for remote sensing (RS) applications.
Following the explosive growth of new algorithmic developments Deep networks have been developed to enhance and enrich data
and case studies in deep learning RS applications in the past 3–4 years, representations in an automated and intelligent manner. A good re-
several review manuscripts have been published (Ghamisi et al., 2017, presentation is, of course, dependent on the specific application and
Xia et al., 2017, L. Zhang et al., 2016, or P. Liu et al., 2017) and a should be learned from training data. One important deep network
Special Issue in this journal has summarized several recent advance- category in this class is based on Autoencoders (AEs). The idea behind
ments (Mountrakis et al., 2018). The majority of these reviews are an autoencoder is basically an encoder-decoder network to regenerate
descriptive and do not offer a quantitative assessment of deep learning the input as accurately as possible in its output. Under specific condi-
benefits building on the extensive available comparisons in the litera- tions, the encoder part works as a good feature extractor and can be
ture. The overal goal of this work is to bridge this knowledge gap by stacked to build deep networks named Stacked Auto Encoders or SAEs
undertaking a meta-analysis comparing deep and non-deep classifica- (the decoder part is not used). The imposed condition on objective
tion algorithms through a meta-analysis of published research. function is typically a form of sparsity, but other variants are also
Other meta-analysis works exist but they do not examine explicitly studied. To put it simply, AE learns a deterministic representation of the
193
Table 1
Parameters collected on each case study.
Reference Citation code for the referenced research paper
Network type One of below categories:

– Convolutional Neural Network (CNN),
– Deep Belief Network (DBN),
– Stacked AutoEncoder (SAE)
Learning strategy One of below categories:
– Unsupervised
– Unsupervised & fine-tuning
– Semisupervised
– (fully) Supervised
– Transfer learning
– Transfer learning & fine-tuning
Number of parameters Number of trainable network parameters, i.e. weights and biases of network neurons and connections. We manually created this number to approximate
network complexity
Dataset Name of dataset used for the research, including:
– Brazilian coffee, NWPU-RESISC45, RSSCN7, UC Merced, and WHU-RS19: 3-band images used in scene classification,
– Indian Pines, Houston, Kennedy Space Center, Pavia University, Pavia City Center, and Salinas: hyperspectral images used in pixel classification,
– ISPRS Potsdam and ISPRS Vaihingen: very high resolution images used in image segmentation,
– Others: Remaining datasets.
Spatial resolution Dataset spatial resolution expressed through pixel size
# of channels Number of spectral and auxiliary channels
Training proportion Proportion of training data size in reference dataset
Metric type Metric used for reporting performance in research case, including Overall Accuracy, Average Accuracy, Average Precision, F1, Kappa, etc
Deep network result Best reported value of network classification performance
SVM results Best achieved performance of SVM implementation
input by minimizing a cost function based on the difference between invariance property to the network. Each layer’s output is typically
input and the regenerated one at the decoder output. This learning named a map and it is generally desired to have multiple maps generated
takes place using gradient-based optimization and standard back- at each layer. Here the filter weights are tuned typically by supervised
propagation techniques. AEs are well suited to unsupervised learning training, as the limited number of shared parameters in each layer
and can be trained layer-wise, possibly followed by a supervised fine- (compared to a fully connected network) allows it. There are also some
tuning phase of the entire network. For a good overview of auto- pre-trained large network structures publicly available for use and fine-
encoders with some work examples and executable codes see Andrew tuning them for specific applications is another common approach. For a
Ng’s Deep Learning tutorial at http://deeplearning.stanford.edu/wiki/ university course on convolutional neural networks readers are referred
index.php/UFLDL_Tutorial. Vincent et al. (2010) also provide more to http://cs231n.stanford.edu/. Zeiler and Fergus (2014) also provide a
details on autoencoders and unsupervised learning. discussion on visualization and understanding of the internal CNN
Another way of thinking about data representation is to learn the workings.
statistical distribution of input, i.e. a probabilistic approach. This ap- Working with sequence data is another important type of remote
proach has led to Generative models or Structured Probabilistic Models. sensing works, particularly on three bases: studying hyperspectral
Deep belief network (DBN) based on stacking layers of Restricted signal variations and analyzing their dependencies; adding the time
Boltzmann Machine (RBM) is the most popular variant for RS appli- dimension as another data element to explore land use feature patterns
cations. Here the aim is to minimize the Boltzmann cost function, to (profiles) and use them in classification; and pursuing detection of
maximize “the similarity (in a probabilistic sense) between the re- changes in land cover or land use by processing time-series data. Neural
presentation and projection of the input” (Singhal et al., 2016). This networks – and specifically Recurrent Neural Networks – are gaining
optimization does not use an assumed output, so a different algorithm momentum for these applications but the number of published papers is
(contrastive divergence) is required to train the neurons. However, si- still low. These networks are promising with new modifications such as
milarly to the autoencoder, training is unsupervised and, more im- adding more powerful and deep memory cells (see for example Lyu
portant, it can be done in a greedy layer-wise approach for a stack of et al. (2016), Mou et al. (2017), Rußwurm and Körner (2017),
layers. This layer-wise approach was devised by the seminal work of Rußwurm and Körner (2018), Ndikumana et al. (2018), Niculescu et al.
Hinton et al. (2006) and later implemented by both SAEs and DBNs. (2018), or Sharma et al. (2018)). However, we did not consider se-
Therefore SAEs and DBNs are often discussed together in the literature quence data applications in our paper due to lack of enough data and
(e.g. Vincent et al. (2010)). When trained, the network can provide our focus was only on feed-forward networks and its three main var-
extracted features for the new data to be classified. Tutorials on RBM iants: SAE, DBN, and CNN.
and DBN are available through the internet, for example see https://
deeplearning4j.org/restrictedboltzmannmachine, which includes ex-
2.2. Comparative performance database creation
ecutable codes.
The third type, which is the most used structure in recent years, is the
Our overarching goal is to look at the analyzed DNN case studies
convolutional neural network (CNN). Inspired by the human visual
and compare them together and to a well-known non-deep classifier,
system and designed to process images, it has limited connection to only
Support Vector Machine (SVM). SVM will serve both as a representative
adjacent neurons in each layer, with the same connection weights for
for non-deep classifier to compare with deep networks, and as a base-
each neuron within each layer. It may include down-sampling in each
line to compare different DNN architectures. SVMs were selected as the
layer, which reduces the processing resolution but adds translation
benchmarking algorithm because: (i) they were found to be the best
194
non-deep performing classifiers in an extensive comparison of pub- some studies but not in the others). Note that in some cases the input
lished work (Khatami et al., 2016), and (ii) the majority of DNN papers channels are processed and dimensionality was reduced (mostly em-
found in this review chose to include SVM as the main benchmark, thus ploying PCA) and the result is applied to the network, but we do not
validating our decision. We also examine accuracy trends across data mention this dimensionality reduction in the table, although we take it
and method characteristics. Direct comparisons of published works are into consideration when calculating the number of parameters and
not feasible due to variances in data types, sampling design, algorithmic consider the network in its actual tested configuration. There are two
details, and test metrics. Therefore, we concentrated on aggregating cases of using Landsat and one case of MODIS imagery that has been
results from manuscripts where accuracy metrics are reported mutually indicated in table separately due to importance of these data sources
under common conditions for deep and non-deep implementations. Although from one hand they are of less attention today because of
This database was then used to do comparative meta-analysis and other their inferior spatial resolution, but from the other hand they are of
quantitative statistical analyses. interest for their rich temporal dimension in time-series analysis. Data
The result was 103 research papers from 2014 until Nov. 2018 fusion from different sources is also experiencing growing attention,
covering 183 case studies that include deep learning-based classifica- especially adding height data through digital elevation models (DEM).
tion, 92 cases of which supported direct comparisons of accuracy to We discuss this further in the design options (Section 3.3) but an in-
SVM. The main characteristics of these case studies are summarized in depth analysis of this issue was outside the scope of this work.
Table A1, Appendix A, with each column of the appendix table defined Another important factor in network prediction performance is the
in Table 1 below. These parameters reflect the most important aspects data training size. More training data typically leads to better network
of the research design and we use them to present the discussion of our generalization, but in many cases the labeled training data is very
research questions in the subsequent sections. We treat each data set in limited. The corresponding column shows the rounded proportion of
a research paper as a separate case, because the output result and (labeled) training data samples to the entire reference data set, varying
possibly the network structure may vary per case in any single paper. from as low as 0.1% to 90%. We refer to it as “training proportion”
One of the most important parameters in network specification is hereafter, and consider the proportion in one single run of the network,
the number of network parameters which reflects network complexity. therefore a cross-validation scheme does not change the value in the
This is typically a surrogate of network depth and width. It is expected table from a similar hold-out.
that a bigger network would be more powerful, but the network ar- The reported classification accuracy (overall or average) value is the
chitecture and way of processing (reflected in other columns of the best performance reported for the reference dataset in each case. It is
table) greatly impacts this performance. Therefore, it is not unexpected reported as a number between 0 and 100 except for the metric Average
that a smaller but more elegant network outperforms a larger one in Normalized Modified Retrieval Rank (ANMRR). Although overall ac-
obtained accuracy. For example, in classifying the ISPRS Potsdam and curacy is an aggregate metric and cannot show important class-de-
Vaihingen datasets, Maggiori et al. (2016) achieved > 1% better ac- pendent performance values, but it is still the most widely used metric
curacy than Volpi and Tuia (2017) by a network having around 1/10th due to its simplicity and general applicability. Even though in some
of their network size. This number is mostly calculated from network cases more detailed evaluations are provided along with overall accu-
parameters given in the cited paper but in some cases it is given in the racy, due to different experimental designs and data structures in our
cited paper as well. In cases that given information was not sufficient or meta-analysis, these detailed metrics were not widely comparable and
ambiguity was not cleared by correspondence, the entry was left blank. therefore class-specific measures were not included.
This number includes parameters in as many network branches as im- In some cases, an additional pre- or post-processing step comple-
plemented, but it does not include parameters associated with addi- ments the deep network to enhance the performance, for example
tional stages of combination or fusion with other data or algorithms. It merging the resulting map with an auxiliary segmentation result,
also counts the network layers parameters up to the last layer before the adding a conditional random field (CRF) layer for edge enhancement,
final classifier, which is typically a Softmax layer but SVM is also used. or object-based processing. These methods differ largely in im-
In around 70% of our cases the deep network is followed by a Softmax plementation details and experiment setup so cannot be directly com-
classifier, therefore we drop the final classifier type from our list of pared to assess the processing gains; we provide more details on them in
parameters. Section 3.3.
The learning strategy column is another important network para- Although the chosen non-deep methods vary greatly in type and
meter. It does not point to the final classifier training as it is always options from paper to paper, there are still numerous cases where DNNs
supervised, but shows the methodology for determining network are compared to an SVM-based implementation, with Random Forest
parameters. The supervised learning is the most common approach in and KNN being the next classifier types used by much less frequency in
deep networks. It can also have different variations in the form of cost our observed cases. Therefore we chose those papers reporting on SVM
function or optimization procedure, or being enhanced by data-driven results as the candidates for doing our quantitative analysis (in the next
techniques such as active learning. Those advanced cases are desig- section). SVM is a good choice for benchmarking because it is a well-
nated as supervised+ in our database. The fine-tuning options show the established and proven classification tool with generally superior per-
cases when network parameters are fine-tuned after an initial un- formance (Mountrakis et al., 2011; Khatami et al., 2016; Heydari and
supervised learning or transferred from a pre-trained network in Mountrakis, 2018). Note that in remote sensing image or scene classi-
transfer learning. Transfer learning is available to CNN only. DBNs are fication tasks, we are generally interested in both feature generation
usually limited to unsupervised & fine-tuning type, while SAEs are used and classification. Neural networks can do both automatically – and
with both unsupervised learning techniques. Semi-supervised learning deep networks put more stress on the feature extraction task – but SVM
is also used in some cases, which is a strategy for using both labeled and classifiers should be fed with features already generated by another
unlabeled data in optimizing the network cost function. algorithm. The SVM implementation itself may vary between proces-
Spatial resolution in our collected research cases varies from 5 cm sing the raw pixels data or some secondary handcrafted spectral/spatial
for VHR imagery to 30 m for Landsat, left blank if not provided. The features derived from data. To ensure a more fair comparison we se-
number of channels shows the ones that have been actually used in the parated these two cases due to the potential important impact of
experiment (some channels have been set aside for their low quality in working with features instead of raw data. Clearly, there are many
195
Table 2
Basic statistics of collected case studies.
Year 2014 2015 2016 2017 2018
Number of cases 4 27 21 22 33
Network type CNN DBN SAE
Number of cases 150 9 25
Dataset spatial resolution < 30 cm 30 cm – 3 m >3 m
Number of cases 23 88 48
Spectral and auxiliary bands 1–3 4–10 11–99 > 100
Number of cases 59 48 1 70
variations and methods for handcrafting features and each paper may classification applications as indicated in the last column, and the “la-
include a different set of methods for comparison, so we could not go belled elements” column should be interpreted accordingly. Among
into their implementations detail and a detailed comparison. Further- them, the Brazilian Coffee, NWPU-RESISC45, RSSCN7, UC Merced, and
more, SVM optimization methods varied (e.g. hyperparameters and WHU-RS19 have been used for scene classification while the others
kernel choice), however we assumed (and it was also stressed in some concentrated on pixel classification/image segmentation. It is im-
papers) that after tuning the authors reported their best SVM perfor- portant to note a significant limitation. While still being extensively
mance. used even in papers from 2018, some of the commonly used datasets are
We should mention here that although our meta-analysis covers old and outdated: the major issue is their small size compared to da-
many different cases, each case has almost a unique setting of the above tasets with millions of elements typically used in computer vision and
parameters and therefore our analysis is naturally limited in depth and other artificial intelligence studies. This issue has been partly addressed
statistical richness. Our objective was to study general trends and for by some very high resolution datasets such as ISPRS Vaihingen and
the first time in the literature offer a quantitative meta-analysis of DNNs Potdam datasets, which became a standard test bench for newly arrived
in remote sensing applications. Our quantitative analysis did not go into (mostly CNN-based) networks. Furthermore, hyperspectral cases are
a detailed analysis of the effect of every design option due to lack of limited to a single scene and some datasets cover a very small geo-
data. graphic area, which limits the generalization ability of the obtained
results. Again, there is a new dataset presented through IEEE GRSS
3. Results and discussion contest in 2018 which consists of a relatively large area of 1.4 km2
covered by both very high resolution (5 cm) RGB and high resolution
3.1. Descriptive statistics (1 m) multispectral data. However, none of our reviewed articles was
based on this new dataset (Le Saux et al., 2018).
Table 2 provides information on case studies distribution by year, There is a still a need to create more large and rich datasets for
network type, spatial resolution and input dimensionality. Note that remote sensing applications in different spatial and spectral resolutions.
some manuscripts may contain more than one study, and spatial or Preparing datasets for tackling temporal applications is another im-
input dimensionality information was not always available. portant issue, which is even more restricted than other applications.
There is an increase in research papers on deep networks for remote However, the decision to pick specific labels and the procedure for
sensing classification applications after 2014, continuing to date. CNN creating ground truth maps is very application-specific. Provision of
is the most commonly used network type, then SAE followed by DBN. auxiliary data (commonly DSM based on LiDAR) is also an important
Most of the datasets are either hyperspectral (> 100 spectral channels) enhancement that is available in few datasets and should be en-
or less than 10 channels. Just one case study had spectral channels couraged.
between 10 and 100. Hyperspectral dataset are of high spatial resolu-
tion (around 1 m) so sit in the middle group of spatial resolution ca- 3.3. Network design options
tegory. Very high resolution ones (< 30 cm) are mostly available in
RGB with possibly adding Near-Infrared band and/or DSM data to it, In terms of network optimization for deep networks the simplest
with just one very recent case incorporating a drone-based six band way is to change the network depth (number of layers) and width
experiment at spatial resolution of 4.7 cm. More information on data- (neurons per layer). Additional modifications include changes in the
sets will be given in the next section. activation function, the type of classifier or the training strategy (su-
pervised/unsupervised). Looking beyond these fairly common adjust-
3.2. Datasets in the selected case studies ments, we present in Table 3 a descriptive summary of the most im-
portant design innovations we encountered. The table is organized to
A wide variety of approximately 60 different datasets were used titles summarizing the main design point, followed by specific design
throughout the selected case studies. They included frequently used ideas in each section. The number of papers using each option is pro-
datasets along with datasets selected from public sources such as vided to suggest popularity. Some design options are not exclusive to a
Google Earth, QuickBird, WorldView, Landsat archives and proprietary specific network type (e.g. network mixing options), while others are
data sources. Cases that have been used more than twice in our review only applicable to specific network types (e.g. fully convolutional net-
are listed in Table 3. The table includes both scene and pixel work). The classification task type may also require special provisions.
196
Table 4
45 class, 700 samples per class/scene

3 class, but highly imbalanced/scene

# of classes/classification task Network options and design innovations in collected papers.
Option Frequency
Making dense (full resolution) output options (for CNN):

Fully Convolutional Network (convolution and deconvolution) 13
No down-sampling 1
Multiscale capability options:
15 class/pixel
16 class/pixel
13 class/pixel
16 class/pixel
6 class/pixel
6 class/pixel
9 class/pixel
9 class/pixel
Getting multiscale input 10
Using multiscale kernels (filters) 7
Skip links (forwarding features/scores from one layer to another 8
non-adjacent layer)
Network mixing options (fusion/aggregation method varies by case):
Parallel handcrafted features 3
Parallel 1-D and 2-D convolutional networks 3
# of spectral and aux. channels
Parallel networks on different band combinations or sensors 9

Cascaded networks 3
Parallel (Ensemble) of different deep networks 10
Training options:
Salient patches selection to train/test network 2
RGB + NI + DSM
RG + NI + DSM
Active learning or iterative feature selection (removing inferior 4

144 + LiDAR
features)
Data augmentation or adding virtual samples to the input data 15
RG + NI
Other specialized methods 9

RGB
RGB
RGB
RGB
220
224
103
103
224
Multimodal processing:
3-D processing modules 7
Spatial averaging/filtering over a neighborhood for 2
spectral + spatial input generation
Spatial res.
PCA dimensionality reduction and spectral + spatial input 9

0.2–30 m
generation
2.5 m
1.3 m
1.3 m
3.7 m
0.5 m
20 m
18 m
5 cm
9 cm
Sequence data/temporal processing 4

1 ft
Other specialized methods 2

Other features:
Feeding network with handcrafted features (not raw data) 4
Optimizing input band selection with genetic algorithms 1
24 full images (of 38)
16 full images (of 33)
MRF/CRF processing or boundary detection 7

Labelled elements
Denoising SAE implementation 3

50,000 scenes
31,500 scenes
Initial and/or final data/feature filtering or segmentation to enhance 12

2800 scenes
2100 scenes
950 scenes
object discrimination
148,152
15,029
10,249
42,776
Sparse or other type of coding to create codebook after feature 4

5211
5348
generation and classify the code

Emerging network modules (e.g. residual module, inception module, 10
LSTM)
circa 2500 × 2000
Image size (pixels)
For example, in image segmentation the objects’ boundary alignment is

6000 × 6000
1905 × 349
of primary concern, while in scene classification this is not important.

145 × 145
512 × 614
256 × 256
512 × 614
610 × 340
400 × 400
512 × 217
256 × 256
600 × 600
64 × 64
This makes edge enhancement techniques more relevant to the former

application than the latter. As each design idea is presented and tested
in a unique setting with single or multiple choices of listed options on
different datasets and compared with different non-deep rivals, com-
parison between different design ideas and quantitative analysis of their
Dataset size
merit is not possible. However, we discuss general findings on design

50,000
31,500
2800
2100
options below and our intent is that this table will act as a preliminary
950
38
33
1
1
1
1
catalog and guide future research, either through gap analysis or

through frequently-implemented method identification.
USGS satellite imagery
Google Earth images
Google Earth images
Google Earth images
3.3.1. Dense (fully connected) networks

Sensor platform
Specifications for most frequently used datasets.
This CNN-type network is the de-facto network of choice for very

Aerial photo
Aerial photo
ITRES-CASI
high resolution classification and almost all of the image segmentation

AVIRIS
AVIRIS
AVIRIS
ROSIS
ROSIS
works – particularly experiments with ISPRS Potsdam and Vaihingen

SPOT
datasets. The competition in this field is extensive, and some of the most
popular networks have been implemented in this category to win the
ISPRS competition. It is always possible to run the entire network and
KSC (Kennedy Space Center)
classify the image pixel by pixel, but it means a huge redundancy in

calculations and therefore a direct map-to-map conversion (which ty-
Houston (2013 GRSS)
pically contains chain of downsampling and then upsampling) is pre-

NWPU-RESISC45
ferred. Upsampling design is a hot topic and each paper tries to find a
ISPRS Vaihingen
Pavia University
Brazilian coffee
ISPRS Potsdam
Dataset name
better way to do it. Edge enhancement and additional segmentation

Pavia Center
Indian pines
WHU-RS19
UC Merced
techniques have also been examined by different approaches to en-

RSSCN7
Salinas
Table 3
hance the result (we will refer to it in another paragraph in this sec-
tion).
197
3.3.2. Multiscale capability options

This issue is of a particular interest in CNN networks due to the
limited connectivity of their neurons to the previous layer, but other
network types may also use it when they use a sliding window me-
chanism in their input layer. In custom CNN networks the multiscale
filters with or without skip links (forwarding features/scores from one
layer to another non-adjacent layer) is a promising choice, but this
option is not typically available for pre-trained networks.
3.3.3. Network mixing options

There is a variety of practices for this option as listed by categories
in Table 4. The most frequent option is ensemble of different networks
or using a parallel network on different bands (especially when the
additional input in form of DSM or LiDAR is provided). Parallel 1-D
(spectral) and 2-D (spatial) network is also found in some cases, but
other forms of spectral/spatial input combination are more frequent
(we discuss it in a later paragraph). As the use of pretrained networks
becomes more common, parallel networks are the natural way of
overcoming the imported network input limitation to RGB channels.
3.3.4. Training options

Engineering the input data is the most frequent form of enhancing
training operations in deep networks, which is implemented in a variety Fig. 1. Comparative performance distribution of DNN vs SVM.
of methods. The simplest case is to crop, rotate and flip the input pat-
ches (basic data augmentation) or adding virtual samples to the input
data (particularly used for making input set more balanced). Recently, – Initial segmentation and creation of superpixels to feed deep net-
active learning and interactive sample selection approaches are gaining work.
more attention. There are other specially designed algorithms used to – Merging of deep network predicted map and segmented or CRF/
enhance data quality, such as salient patch selection; or specialized MRF generated map based on network prediction confidence.
methods for calculating network parameters, such as calculation of – Other pre/post processing methods (e.g. GLCM/gabor filters).
neurons weights by clustering or PCA decomposition instead of
training. There is no dominant method among the aforementioned techni-
ques and new methods are continuously emerging.
3.3.5. Multimodal processing
Deep networks in remote sensing classification started with pro- 3.4. DNN vs. SVM classification accuracy comparisons
cessing spectral components but quickly evolved to process other di-
mensions of data as well. Data processing in spatial context is now ty- This section focuses on classification accuracy comparisons between
pical, especially with CNN, and joint spectral-spatial processing in 3-D DNN and SVM methods. We focused on SVM comparisons since the
convolutional filters is popular. Before that, other techniques such as majority of the manuscripts we reviewed selected SVM as their
averaging over spatial dimension or PCA compression of spectral di- benchmark. The SVM choice over other methods (e.g. RF) is further
mension were common, but newer 3-D architectures has shown slightly supported by a previously conducted meta-analysis, where SVM was
better performance in our case studies. The newest trend in multimode found to outperform other (non-deep) methods (Khatami et al., 2016).
processing is to incorporate sequence/temporal processing, for example A detailed table summarizing each study is available in Appendix A. For
by treating spectral component of hyperspectral imagery as a (corre- a case study to be considered both methods were tested on the same
lated) sequence, or working on time-series of spectral-spatial data dataset and results were reported in the form of average or overall
cubes. accuracy.
Deep networks are usually designed to employ high volumes of
3.3.6. Other features available spectral and spatial data. However, in many of the selected
In addition to the above, we identified numerous special algorithms cases of pixel classification, the authors compare DNNs to simple
and techniques throughout our survey that are organized in this section. spectral processing by SVM or other non-deep rivals, thus providing an
In earlier studies we saw some cases of performance improvement by unfair advantage to DNNs as they also incorporate spatial information.
feeding network with handcrafted features, but it seems to be an ob- Knowledge of feature generation details may not be a primary concern
solete idea now. Object-based classification, image segmentation, and in deep networks as it is optimized automatically by the network, but
additional MRF/CRF processing have been attractive research areas finding the best method for feature generation to feed an SVM is not a
from the early days and still draw a lot of attention. Parallel to that, straightforward task that often requires trial and error for each dataset.
developing and applying newer and more complicated network mod- On the other hand, designing the best deep networks out of standard
ules (for example residual modules in CNN or LSTM in RNN) in RS basic schemes is not a trivial issue and we see new designs continuously
applications are increasing trends. In the reviewed cases, newly emer- arising. To further inform readers in all figures in this section, we
ging modules seem to have the upper hand at the expense of much marked the cases with enhanced feature generation for non-deep clas-
larger and more complicated networks. The other options found are: sifier (SVM) with a black circle to separate them from cases using ex-
clusively spectral information in their SVM implementation. Initial
– CRF postprocessing of deep network predictions to delineate and summary results are depicted in Fig. 1. Also as there were just two cases
enhance object edges. with SVM accuracy below 70%, we set our scale to start from that and
198
example, one author used the average values of a neighborhood around

each pixel (to be classified) for each band and added it to the central
pixel’s own data, then fed the SAE or DBN with this composite data
vector. It is also common in hyperspectral image classification to pro-
cess input image with dimensionality reduction techniques such as PCA
first, and then build the spatial information to be added to the original
central pixel for classification.
CNN also has the ability to preserve the spatial relationships while
processing through different layers, as spatial filtering takes place in
each layer without flattening data to a row vector. In SAE or DBN im-
plementations, the spatial information is flattened and concatenated to
the spectral data at the network input, and although the spatial in-
formation is implicitly included, the spatial relations between vector
values are lost (Y. Li et al., 2017; Yue et al., 2015; Basu et al., 2015).
However, the CNN spatial coverage is limited to a neighborhood of
fixed size at the input, increasing step by step while resolution is re-
duced accordingly in pooling layers. This issue is restrictive to scale-
dependent information, although it can be remedied to some extent by
a multiscale structure (for example see X. Chen et al., 2014, Zhao et al.,
2015, or Zhao and Du, 2016). Other networks are not inherently limited
by these rules, though the strength of spatial relationships is generally
reduced with the increasing distance from the central point according
Fig. 2. Comparative performance distribution across learning methods for to Tobler’s law in geography (Tobler, 1970). It is also important to
CNN. consider that the SAE and DBN methods are trained in an unsupervised
fashion while the CNN method follows a supervised approach. There-
fore, the CNN implementation may be advantageous due to the in-
omitted those two in display to reduce the congestion on the upper corporation of labeling information (Y. Li et al., 2017; Shi and Pun,
accuracy values in the provided figures. 2018). SAE and DBN are also trained in a greedy layer-wise fashion that
In general, deep learning approaches offer consistently better results may limit potential learning opportunities; each layer’s parameters are
than SVM methods, even when only the cases with enhanced (non- fixed when tuning the next layer. Joint training of layers for SAE and
spectral) features are considered. The reported improvement (differ- DBN has been proposed in Zhou et al. (2014) and reported to perform
ence in accuracy value) can be as high as 30% for CNN, 16% for SAE better than typical greedy layer-wise approach, but it is not of common
and 3% for DBN. The DBN values may not be very representative due to use.
the scarcity of this network type application in remote sensing, but this To investigate further DNN accuracy gains, we examined their dis-
lower application rate itself can be a sign of its lack of merit and/or tribution across five contributing factors, namely the DNN learning
underlying complexity. CNN accuracy benefits are often attributed to method, the network complexity, spatial resolution, input dimension-
the integrated processing of spatial and spectral information, while for ality and training dataset proportion. Due to the low number of case
SAE or DBN benefits involve specific experimental design. As an studies and variation of design parameters and datasets employed in
different studies we do not report a multivariable regression model.
Instead, we limit our analysis to single factor distribution plots.
3.4.1. Distribution across learning methods

Figs. 2 and 3 present accuracy comparisons for different learning
methods for CNN and SAE, respectively. DBNs have a single learning
option therefore they are omitted from this analysis. Starting with Fig. 2
and CNN methods, the majority of cases have used supervised training
or its enhanced version shown as supervised+ (cases with different cost
function or optimization procedure, or being enhanced by data-driven
techniques such as active learning). CNNs using supervised learning is
mostly compared to spectral SVM and tend to offer higher relative gains
in more complex classifications, where the corresponding SVM accu-
racy is lower. This result is expected due to integration of spatial data in
CNN and lack of it in spectral SVM. As mentioned in C. Zhao et al.
(2017), W. Zhao et al. (2017), there are similarities between low-level
features in different classes that cannot be resolved solely by the
spectral components and integration of spatial data is required (an
example is the road and building roof pixels in an aerial image). As seen
in the upper right corner of the graph, in cases where the SVM was fed
with enhanced features, the performance is fairly close to the su-
pervised learning DNN cases. One benefit of deep networks is the
flexibility to build the features automatically and match them to the
specific dataset under study, contrary to handcrafted features that
Fig. 3. Comparative performance distribution across learning methods for SAE.
199
Fig. 4. Comparative performance distribution across network complexity. Fig. 5. Comparative performance distribution across spatial resolution.
should be selected among many variants for SVM or other non-deep

classifiers.
Transfer learning in CNN also offers some improvements visible in
Fig. 2 – and especially when combined with fine-tuning – over en-
hanced SVM. Base networks are typically taken from models developed
and trained in computer vision industry and will not be introduced here
here. The most widely used model in our reviewed cases as AlexNet (9
cases), followed by VGG-16 (8 cases), VGG-M (5 cases), and GoogLeNet
(4 cases). Other networks have also been applied with lower frequency
such as ResNet, other VGG-series networks, SegNet, Overfeat, and
CaffeNet. Their improved performance over supervised learning CNN
cases could be attributed to the fact that supervised CNN cases are
usually custom designed and small in size compared to CNN networks
used for transfer learning, therefore they may not be much more
powerful than a SVM fed with enhanced features. A large fully su-
pervised network may achieve considerable improvement over an en-
hanced SVM, but the computational budget might be prohibitive and
the risk of overfitting high. Transfer learning, however, uses a proven
network architecture that is pre-set using an extensive collection of
labelled image data, and reduces user involvement into network design
issues. Successful application of this technique suggests that the fea-
tures generated by those large image collections have a good general-
ization capability and can be matched to arbitrary datasets assuming a
Fig. 6. Comparative performance distribution across input data dimensionality.
fine-tuning step. Comparisons of different possibilities for feature ex-
traction, supervised training and transfer learning (with or without fine
tuning) for selected CNN architectures are described in detail in
Nogueira et al. (2017) and tested for three well-known scene classifi- spectrum and obtain good results. Such a combination can also be run
cation datasets. They suggest to use transfer learning and fine-tuning on limited number of input samples as the large network is pretrained.
instead of fully supervised training because the pre-trained networks For the other options of semisupervised and unsupervised learning we
start from a better initialization state in the search space. A significant can see limited improvement but there are not enough samples for
limitation though is that pre-trained networks are not currently ap- conclusive results.
plicable in multispectral/hyperspectral classification tasks because ex- Looking at Fig. 3 and the SAE learning methods, fine-tuning of
isting pre-trained networks come from computer vision – trained on unsupervised methods tends to offer some gains over enhanced SVM,
ImageNet – using RGB images. However, as studied in Huang et al. while there is no gain without fine-tuning. An explanation could be that
(2018), we can mix a big pre-trained network fed by the RGB portion of unsupervised learning receives its strength from using much more data
spectrum with a smaller deep network capable of mining the entire (labeled or unlabeled), so the features may be more representative of
200
3.4.2. Distribution across network complexity

To examine this, we discretized the number of parameters to six bins
from less than 10 K (class A) to greater than 100M (class F); the result is
shown in Fig. 4. Extremely low end (class A) cases are rare and do not
seem to offer considerable improvements. Class B has the highest fre-
quency (23 cases), then class C, D, E, and F with 16, 11, 10, and 9 cases
correspondingly. It can be seen that class B, which is a still relatively
small network, has been compared to spectral SVM only and is mostly
present in the upper right part of the graph, where the performance of
spectral SVM is already high. These cases are those mostly associated
with supervised learning method mentioned before. But larger net-
works (especially classes D and F) show considerable improvements
over enhanced SVM. Based on this, we may advise to use larger net-
works (with fine-tuning) as mentioned before. However, this graph also
demonstrates that all network complexity classes have the potential to
achieve accuracy of 95% or more, which may be sufficient for many
cases, especially considering other data limitations (e.g. registration
errors). Note that class F cases are all ImageNet pre-trained networks,
which are naturally the largest networks in our study cases.
3.4.3. Distribution across spatial resolution

The corresponding graph is shown in Fig. 5. It is difficult to discern a
Fig. 7. Comparative performance distribution across training proportion.
specific pattern with respect to the spatial resolution, therefore no
conclusive remarks could be made.
the data. However, matching them to classes requires an extra step of 3.4.4. Distribution across input data dimensionality
supervised learning. Therefore, unsupervised learning alone is com- Fig. 6 organizes the results in three general categories, mostly se-
parable to enhanced SVM, and fine-tuning further improves results. parating RGB (group A) and hyperspectral images (group C), with
Semi-supervised learning was used in two cases with better results, but group B being cases employing additional multispectral components
its application detail is case-dependent. There are different methods such as NI and/or auxiliary data such as DSM/LiDAR.
and also underlying assumptions about actual class distribution for Although it seems that multispectral group (B) generally achieves a
doing semi-supervised learning (for example see Zhu and Goldberg bit less improvement compared to other two groups, there is no strong
(2009) and Camps-Valls et al. (2014)). Each method and assumption is evidence and supporting theory for that.
embedding a specific additional regularization term for unlabeled data
in the optimization cost function but there is no standardized way of 3.4.5. Distribution across training size/proportion
doing that. This lack of standardization may be a cause for its limited In the examined manuscripts the sampling is either a single pixel
use. (for pixel classification or image segmentation applications) or an
image patch (for scene classification or target detection applications).
Labeled data size is mostly in the order of a few thousands, with ad-
ditional cases with considerably more labelled data. Sampling is done
within the labelled dataset, with the proportion varying substantially in
different implementations from as low as 0.1% to as high as 90%. We
consider two different ways of training data size affecting network si-
mulations. The first issue is the training data size, which should be
considered in accordance to the network size and number of para-
meters. A large network with few training data may experience over-
fitting and lack of generalization, while a small network may not be
powerful enough to model a complex set of training data. The other
issue is the training data proportion, which imposes the same under-
fitting/overfitting scenario. We compared the absolute number of
training data units (pixels or scenes) to the number of network para-
meters in our database and found that in almost 90% of cases we have
less data units than networks parameters to be tuned. The overfitting
control mechanisms such as regularization are always included in the
network design and it will alleviate overfitting problem, but there is still
a substantial difference between the remote sensing and computer vi-
sion fields, as we have very large reference datasets in the latter.
Looking at Table 4, the only case with millions of samples in remote
sensing are ISPRS datasets, but the winners are all CNNs and there is no
comparison reported with SVMs (competition is just between different
CNN architectures) so we couldn’t include them in our SVM-based
charts.
Fig. 8. Comparative performance distribution across training data size.
201
network and enlarging scale of pixel influence, there is always some

trace of even far pixels on training phase. Therefore, strictly enforcing
independency rule to the neighboring pixels may invalidate all of the
CNN networks, which is no desire for anybody.
3.4.6. Review of widely-used data sets

In previous sections the objective was to reveal patterns (or lack of
any pattern) in different networks comparative performance along
important parameters. The main limitation of this analysis is that the
comparisons could not be done by varying just one parameter and
fixing the others while we could not have such a control in our data
collection (hence we used the term ‘distribution’ instead of ‘effect’ in
our section titles). In this section we go one step further and look at
different cases as applied on the same dataset to extract more in-
formation on the competency of different network types. There are
some datasets that are heavily used in various papers and therefore can
serve as a benchmark for algorithmic comparison. Except for ISPRS
Potsdam and ISPRS Vaihingen (where comparison to SVM was not
available), Fig. 9 shows the result graphically.
Here are some observations:
– Indian Pines (a hyperspectral dataset): The highest accuracy here is

Fig. 9. Comparative performance distribution across widely used datasets. obtained by CNN at 99.8% overall accuracy, but SAE and CNN have
generally the same level of performance. Their improvement over
spectral SVM can be as large as 16% for both network types.
However, this high gain is reduced to just about 2–4% in compar-
In order to examine how DNN gains are influenced by training data ison to enhanced SVM cases.
absolute size and relative proportion with respect to the testing data – Kennedy Space Center (a hyperspectral dataset): Recently CNN
two figures were produced. Fig. 7 shows the comparative performance achieved an accuracy of 100% on this dataset (Haut et al., 2018) but
categorized in training proportions from A (less than 20%) to E (greater with a high training proportion of 85% and 5.6% gain improvement
than 80%), and Fig. 8 groups cases by absolute training dataset size over spectral SVM. Other comparisons were made with CNN and
from A (below 1000) to E (over one hundred thousand). The observed SAE but with enhanced SVM. The best accuracy of SAE was 98.8%,
variability in the graphs and the lack of a consistent pattern suggest that which was almost the same as a very sophisticated SVM im-
high training size or proportion are not a general requirement for deep plementation.
learning algorithms because there are various cases of high (> 95%) – Pavia Center (a hyperspectral dataset): CNN implementations show
overall accuracy from very low to very high sampling ratio or size. A a little improvement up to 3.3% with training proportion of10%.
closer examination took place to further investigate training size and The peak achieved accuracy was 99.95% but with a training pro-
proportion with respect to network and learning method type. In cases portion of 80% with very minor gains over SVM, and in all cases it
of DBN, the training proportion was always high (> 50%) but there was was compared to the spectral SVM. We have only one SAE im-
no explanation or justification for it in the reviewed articles. In general, plementation for this dataset in our list, which does not improve
the CNN methods with supervised learning have been used in all over the enhanced SVM.
training proportions, while CNNs with transfer learning with fine- – Pavia University (a hyperspectral dataset): Here CNN here works
tuning were run with higher training proportion. This may be attributed better than SAE with a maximum overall accuracy of 99.7% and
to overfitting concerns in transfer learning cases, as the base network is improvements up to 16.7% over spectral SVM, while for SAE it is at
usually large with millions of parameters. Therefore, it is an open most 7.6% (both with training proportion of 10%).We have about
question how the transfer learning works in remote sensing cases where 2% improvement over a very sophisticated SVM implementation for
low training ratios are predominant. this dataset, but for SAE the gain over enhanced SVM is minor.
There is also another concern that has not been discussed in the – Salinas (a hyperspectral dataset): This dataset was only applied to
reviewed papers. About 64% of the cases in our entire database (and CNN and the best achieved accuracy was 99.9% with a training
about 73% of cases used in the figures) belong to pixel classification proportion of 50%. This was 8% gain in accuracy compared to
category with the rest focusing on scene classification. In scene classi- spectral SVM, and other results showed some other gains. But there
fication cases we have completely separate train and test scenes, was no case of comparison with enhanced SVM.
therefore adding spatial data in training phase (which naturally hap- – UC Merced (an RGB dataset used for scene classification): Here CNN
pens in any CNN network) will not affect the testing performance. In works well with maximum overall accuracy of 99.5% and im-
pixel classification application the train and test pixels are chosen in- provement up to 21% over SVM, while SAE was tested once with
dependently, but if the spatial processing is part of algorithm (that is improvement of just 1%. In all cases, training proportion was high
typical), the training and testing pixels’ neighborhoods may overlap and (60–80%) and it was compared to enhanced SVM, but SVM was fed
this may violate the basic assumption of independent training/testing with very different features in different cases.
samples. The real impact of this issue is not discussed in any of re-
viewed literature and it seems that the authors didn’t consider it cri- For the ISPRS Potsdam and Vaihingen datasets, the CNNs has been
tical. It can be also argued that with multiple pooling layers in a CNN the winner over all of the recent contests, so the race is only between
202
them and there is no research that compare them to a SVM based developing methods for it (for example see Zeiler and Fergus (2013) or
classification. Therefore, we could not include them in Fig. 7. In both Yosinski et al. (2015)), but currently research is lacking in remote
ISPRS datasets the best results are achieved by transfer learning & fine- sensing tasks.
tuning in recent years. The best case was based on ResNet-101 with We compare different studies and reflect on their findings in a
overall accuracy of 91.1% for both cases, followed by VGG-16 and collective manner. The possible reasons for deep network strengths in
SegNet-based cases with overall accuracy of 90.3%. Training proportion each individual aspect (network type, learning strategy, sampling pro-
is standardized at 30% for Vaihingen and 45% for Potsdam (except portion, etc.) was discussed in previous sections without going into
ResNet-101 case, where the training proportion was 47% and 63%, mathematical formulas, due to the nature of meta-analysis. The ma-
respectively). These implementations are large networks, but a recent jority of manuscripts reported that the SVM (or other rivals) parameters
paper (Zhang et al., 2018b) has also achieved accuracies of 89.4% for have been tuned and optimized for best performance, but there is a lack
Potsdam and 88.4% for Vaihingen with a small supervised network of consistency in reporting and protocol (e.g. grid search density).
with number of parameters much less than above transferred networks Establishing best optimization practices would benefit our community
(but with increasing training proportion to 70–75%). In almost all cases by limiting inconsistencies that could lead to result bias.
additional enhancement techniques such as joint segmentation, CRF Another important conclusion is that algorithms are now outpacing
processing or multiscale blocks has been implemented to boost the benchmark datasets. We already see accuracy estimations exceeding
performance a bit higher. 99% for some well-known datasets such as Indian Pine, Pavia Center
The above datasets, while used extensively for classification as- and University, Salinas, and UC Merced. To allow deep learners to reach
sessment, should be avoided in the future. They are relatively small to their full potential, it is paramount that more elaborate benchmark
match the generalization capabilities of deep networks and in most datasets should become available with diverse spectral/spatial/tem-
cases there are already algorithms that reach 100% accuracy, therefore poral resolution and geographic coverage.
offering limited opportunities for improvement. It is necessary to de- We could not analyze further the processing time because either it
velop new, large and multi-sensory datasets for remote sensing image was not available in many cases, or it was not specified if it contains the
classification, especially for hyperspectral data, to help better in- entire time for optimizing meta-parameters or not. It is generally true
vestigate the potential of deep networks. that deep networks need considerably more processing time for training
(though the testing/simulation process is generally quick) but with
4. Concluding remarks continuous increases in processing power, deep networks are readily
usable particularly by incorporating both CPUs and GPUs together. It
While the number of case studies precluded detailed statistical would be interesting to evaluate the time saved by using pre-trained
analysis on the effect of each contributing factors generally we can see networks and just fine-tuning them, but currently there were no sta-
that: tistics reported to extract conclusive guidance.
There are numerous design options currently offered (see Table 3).
– Deep networks have generally better performance than spectral Multiscale input is particularly useful to capture geographic relation-
SVM implementations, with CNNs performing better than other ships in earth observations. Furthermore, fully convolutional networks
deep learners. This advantage, however, diminishes when using are promising for dense semantic labeling (classification of all image
SVM over more rich features extracted from data. pixels at once and producing the same output dense map as the input
– Transfer learning and fine-tuning on pre-trained CNNs offer pro- image size). Other researches have added various segmentation tech-
mising results even when compared to enhanced SVM im- niques, boundary detection and correction methods and CRF/MRF post-
plementations, and they provide for flexibility and scalability be- processing and showed their benefit to enhance classification of edge
cause there is no need to manually engineer the features or use a pixels. While existing comparisons suggest the potential of CNN, they
very large training dataset. However, these pre-trained networks are do not concretely identify a winning design among different options.
currently limited to RGB input data, therefore currently lack ap- For example, at the ISPRS Vaihingen image segmentation contest three
plicability in multi/hyperspectral data. They have also not been CNN methods were within 1.2% of overall accuracy (Sherrah, 2016;
tested in low training proportion scenarios. Audebert et al., 2016; and Marmanis et al., 2016b). Looking into the
– There is no strong relationship between network complexity and future, remote sensing experts will favor 3-D CNN structures from pre-
accuracy gains over SVM; small to medium networks perform si- processing, dimensionality reduction methods like PCA or shallow 1-D
milarly to more complex networks. and 2-D networks. The current state of the art 3-D CNN structures has
– Contrary to popular belief, there are numerous cases of good deep already offered significant improvements and the training process is
network performance with training proportions of 10% or lower. becoming easier (see Chen et al., 2016; Y. Li et al., 2017). Furthermore,
our community would significantly benefit from a coordinated invest-
As previously noted, deep networks are important due to their ment from large funding institutions to create a pre-trained DNN for
ability to extract useful rich features automatically from large data sets remote sensing data (similar to the ImageNet for RBG images). This pre-
without the need for manual feature extraction. For example, automatic trained network would harness the power of large data volumes while
feature extraction has been used in Rußwurm and Körner (2018) to allowing fine-tuning to specific applications.
automatically detect cloud occlusion in temporal remote sensing data.
This automation of feature extraction also has limitations, most notably Acknowledgments
the difficulty to extract and evaluate these features. The visualizations
in deep networks rarely go further than the first two layers, which focus This work was supported by the USDA McIntire Stennis program, a
on very basic features like edges and gradients. There have been limited SUNY ESF Graduate Assistantship and NASA's Land Cover Land Use
trials to describe and visualize the extracted features and even Change Program (grant # NNX15AD42G).
Appendix A
203
Table A1
Database of collected deep network application in remote sensing.
Other network parameters Dataset specification Best reported performances
Reference Network Type # of parameters Learning type Dataset Spatial resolution # of channels Training Metric type Deep network SVM (Non
proportion deep)
(Penatti et al., 2015) CNN 289 M Transfer learning Brazilian coffee 3 0.8 Average 83 87
S.S. Heydari and G. Mountrakis
accuracy
(Yu et al., 2017) CNN 24.6 M Unsupervised Brazilian coffee 3 0.8 Overall accuracy 87.8 87
(Castelluccio et al., 2015) CNN 5M Transfer learning Brazilian coffee 3 Overall accuracy 91.8
(Nogueira et al., 2017) CNN 60 M Transfer learning & fine- Brazilian coffee 3 0.6 Overall accuracy 94.5 87
tuning
(Wu and Prasad, 2018) CNN + RNN Semisupervised Houston 2.5 m 144 Overall accuracy 82.6 80.2
(Xu et al., 2018) CNN Supervised+ Houston 2.5 m 144 + 1 0.19 Overall accuracy 88 80.5
(Pan et al., 2018) CNN Houston 2.5 m 144 Overall accuracy 90.8
(Li et al., 2014) DBN 14.7 K Unsupervised & fine-tuning Houston 2.5 m 144 Overall accuracy 97.7 97.5
(Zabalza et al., 2016) SAE 4.2 K Unsupervised Indian Pines 20 m 200 0.05 Overall accuracy 80.7 82.1
(Ghamisi et al., 2016) CNN 188 K Supervised Indian Pines 20 m 200 0.05 Overall accuracy 83.3 78.2
(Shi and Pun, 2018) CNN 2.5 M Supervised Indian Pines 20 m 200 0.01 Overall accuracy 85.2
(Mou et al., 2018b) CNN 1.44 M Unsupervised & fine-tuning Indian Pines 20 m 200 0.05 Overall accuracy 85.8 72.8
(C. Zhao et al., 2017) SAE 30.2 K Unsupervised & fine-tuning Indian Pines 20 m 200 0.1 Overall accuracy 89.8 88.9
(W. Hu et al., 2015) CNN 80.6 K Supervised Indian Pines 20 m 220 0.2 Overall accuracy 90.2 87.6
(Pan et al., 2018) CNN Indian Pines 20 m 200 Overall accuracy 90.7
(Xing et al., 2016) SAE 241 K Unsupervised Indian Pines 20 m 200 0.5 Overall accuracy 92.1 90.6
(W. Li et al., 2017) CNN 57.9 K Supervised Indian Pines 20 m 220 0.2 Overall accuracy 94.3 88.2
(Chen et al., 2015) DBN Unsupervised & fine-tuning Indian Pines 20 m 200 0.5 Overall accuracy 96 95.5
(Li et al., 2015) SAE 21.7 M Unsupervised & fine-tuning Indian Pines 20 m 200 0.05 Overall accuracy 96.3 92.4
(Sun et al., 2017) SAE 107 K Semisupervised Indian Pines 20 m 200 0.1 Overall accuracy 96.4 80.6
204
(Ding et al., 2017) CNN 380 K Unsupervised Indian Pines 20 m 200 0.5 Overall accuracy 97.8
(Ma et al., 2015) SAE 14.2 K Unsupervised & fine-tuning Indian Pines 20 m 200 0.1 Overall accuracy 98.2
(Paoletti et al., 2018) CNN 96 M Supervised Indian Pines 20 m 200 0.24 Overall accuracy 98.4
(Chen et al., 2016) CNN 44.9 M Supervised Indian Pines 20 m 200 0.2 Overall accuracy 98.5 96.9
(H. Zhang et al., 2017) CNN Supervised Indian Pines 20 m 200 0.1 Overall accuracy 98.8
(Makantasis et al., 2015) CNN 97.6 K Supervised Indian Pines 20 m 224 0.8 Overall accuracy 98.9 82.7
(Y. Li et al., 2017) CNN 197 K Supervised Indian Pines 20 m 200 0.5 Overall accuracy 99.1
(Haut et al., 2018) CNN 8.9 M Supervised+ Indian Pines 20 m 200 0.5 Overall accuracy 99.8 81.3
(Sherrah, 2016) CNN 3.26 M Supervised ISPRS Potsdam 5 cm 5 0.45 Overall accuracy 84.1
(Volpi and Tuia, 2017) CNN 6.38 M Supervised ISPRS Potsdam 5 cm 5 0.45 Overall accuracy 85.8
(Maggiori et al., 2016) CNN 530 K Supervised ISPRS Potsdam 5 cm 4 0.45 Overall accuracy 87
(Zhang et al., 2018a) CNN 17 K Supervised ISPRS Potsdam 5 cm 4 0.75 Overall accuracy 89.4 82.4
(Sherrah, 2016) CNN 22.7 M Transfer learning & fine- ISPRS Potsdam 5 cm 4 0.45 Overall accuracy 90.3
tuning
(Yongcheng Liu et al., 2018) CNN 481 M Transfer learning & fine- ISPRS Potsdam 5 cm 4 (DSMs not used) 0.63 Overall accuracy 91.1
tuning
(Tschannen et al., 2016) CNN 30 K Supervised ISPRS Vaihingen 9 cm 5 0.3 Overall accuracy 85.5
(Paisitkriangkrai et al., 2015) CNN Supervised ISPRS Vaihingen 9 cm 5 0.3 Overall accuracy 86.9
(W. Zhao et al., 2017) CNN Supervised ISPRS Vaihingen 9 cm 4 0.1 Overall accuracy 87.1 66.6
(Volpi and Tuia, 2017) CNN 6.38 M Supervised ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 87.3
(Marcos et al., 2018) CNN 100 K Supervised ISPRS Vaihingen 9 cm 4 0.45 Overall accuracy 87.6
(Zhang et al., 2018b) CNN 17 K Supervised ISPRS Vaihingen 9 cm 4 0.7 Overall accuracy 88.4 81.7
(Maggiori et al., 2016) CNN 727 K Supervised ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 88.9
(Sherrah, 2016) CNN 3.26 M Supervised ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 89.1
(Audebert et al., 2016) CNN 32 M Transfer learning & fine- ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 89.8
tuning
(Marmanis et al., 2016b) CNN 806 M Transfer learning & fine- ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 90.3
tuning
(continued on next page)
Table A1 (continued)
proportion deep)
(Yongcheng Liu et al., 2018) CNN 481 M Transfer learning & fine- ISPRS Vaihingen 9 cm 3 (DSMs not used) 0.47 Overall accuracy 91.1
tuning
(C. Zhao et al., 2017) SAE 20.8 K Unsupervised & fine-tuning Kennedy Space 18 m 224 0.1 Overall accuracy 93.5 91.1
Center
(Chen et al., 2016) CNN 5.85 M Supervised Kennedy Space 18 m 224 0.1 Overall accuracy 97.1 95.7
Center
(Y. Chen et al., 2014) SAE 8.72 K Unsupervised & fine-tuning Kennedy Space 18 m 176 0.6 Overall accuracy 98.8 98.7
Center
(Haut et al., 2018) CNN 8.8 M Supervised+ Kennedy Space 18 m 224 0.85 Overall accuracy 100 94.4
Center
(Ishii et al., 2015) CNN 60 M Supervised Landsat 8 30 m 3 0.35 F1 71 37.2
(Mou et al., 2018a) CNN + RNN Supervised Landsat ETM 30 m 6 Overall accuracy 98 95.7
(Karalas et al., 2015) SAE 155 K Unsupervised & fine-tuning MODIS 500 sq.m 7 Average 62.8
precision
(Zhou et al., 2017) CNN 126 M Transfer learning & fine- Other 0.5 m 3 ANMRR 0.04
tuning
(Kemker et al., 2018) CNN 11.9 M Supervised+ Other 4.7 cm 6 0.25 Average 57.3 29.6
accuracy
(Kemker et al., 2018) CNN 69 M Supervised+ Other 4.7 cm 6 0.25 Average 59.8 29.6
accuracy
(Bittner et al., 2017) CNN 134 M Transfer learning & fine- Other 0.5 m 1 F1 70
tuning
(Lagrange et al., 2015) CNN 141 M Transfer learning Other 5 cm 4 0.6 Overall accuracy 72.4 70.2
205
(Cao et al., 2016) CNN 60 M Transfer learning & fine- Other 3 F1 72.4
tuning
(Ji et al., 2018) CNN 102 K Supervised+ Other 15 m 4 0.85 Overall accuracy 79.4 78.5
(Fu et al., 2017) CNN Supervised Other 1m 3 0.9 F1 79.5 61.5
(Tang et al., 2017) CNN Transfer learning & fine- Other 3 Average 79.5
tuning precision
(Huang et al., 2018) CNN 39 M Transfer learning & fine- Other 0.5 m 4 0.57 Overall accuracy 80 71.8
tuning
(Chen et al., 2018) CNN Transfer learning & fine- Other 8, 16 m 3 Average 80
tuning precision
(Chen et al., 2013) DBN 4.2 M Unsupervised & fine-tuning Other 3 0.2 F1 81.7 78.4
(Marcos et al., 2018) CNN 430 K Supervised Other 5 cm 4 0.7 Overall accuracy 82.6
(Cheng et al., 2017b) CNN 14.7 M Transfer learning Other 30 m 0.2 Overall accuracy 84.3
(Yanfei Liu et al., 2018) CNN Supervised Other 4 m (MSI), 1 m (Pan) 3 0.8 Overall accuracy 85 84.7
(Yongcheng Liu et al., 2018) CNN 481 M Transfer learning & fine- Other 1m 3 0.93 F1 85.6
tuning
(Geng et al., 2015) SAE 28.4 K Unsupervised & fine-tuning Other 0.38 m 1 0.5 Overall accuracy 88.1 76.9
(Lguensat et al., 2017) CNN 177 K Supervised Other 1 0.18 Overall accuracy 88.6
(Han et al., 2018) CNN 286 M Semisupervised Other 30 m Overall accuracy 88.6
(Zhao et al., 2015) DBN 379 K Unsupervised & fine-tuning Other 0.6 m 1 0.7 Overall accuracy 88.9 85.6
(Zhang et al., 2018c) CNN 226 K Supervised Other 50 cm 4 0.6 Overall accuracy 89.5 79.5
(Zhang et al., 2018a) CNN 17 K Supervised Other 50 cm 4 0.5 Overall accuracy 89.6
(Zhang et al., 2018b) CNN 17 K Supervised Other 50 cm 4 0.7 Overall accuracy 89.8 81.2
(Vakalopoulou et al., 2015) CNN 60 M Transfer learning Other 0.6 m 4 0.4 Average 90
precision
(Qayyum et al., 2017) CNN 6.61 M Transfer learning Other 15 cm 3 0.8 Overall accuracy 90.3 83.1
(Cheng et al., 2017a) CNN 134 M Transfer learning & fine- Other 30 m 0.8 Overall accuracy 90.3
tuning
(F. Zhang et al., 2015) SAE 90.3 K Unsupervised Other 1m 3 0.25 Overall accuracy 90.8 90

proportion deep)
(Zhang et al., 2018a) CNN 17 K Supervised Other 50 cm 4 0.5 Overall accuracy 90.9
(Zhang et al., 2018c) CNN 226 K Supervised Other 50 cm 4 0.6 Overall accuracy 90.9 80.4
(Zhang et al., 2018b) CNN 17 K Supervised Other 50 cm 4 0.7 Overall accuracy 91 81.7
(Zhao and Du, 2016) CNN Supervised Other 1.8 m 8 0.15 Overall accuracy 91.1
(Huang et al., 2018) CNN 39 M Transfer learning & fine- Other 1.24 m 8 0.62 Overall accuracy 91.3 80
tuning
(Khan et al., 2017) CNN 151 M Transfer learning & fine- Other 25 m 3 0.9 Overall accuracy 91.3 76.5
tuning
(Han et al., 2018) CNN 286 M Semisupervised Other Overall accuracy 91.4
(X. Chen et al., 2014) CNN 395 K Supervised Other 1 F1 91.6 79.3
(L. Zhang et al., 2015) CNN 44 M Transfer learning Other 3 0.75 F1 91.8
(Khan et al., 2017) CNN 151 M Transfer learning & fine- Other 25 m 3 0.9 Overall accuracy 92 74.1
tuning
(F. Zhang et al., 2017) CNN 266 K Supervised Other 1.2 m 3 0.8 Overall accuracy 92.4
(Pan et al., 2018) CNN Other 1m 84 Overall accuracy 93.2
(S. Liu et al., 2018) CNN 28.4 M Transfer learning & fine- Other 3 0.67 Overall accuracy 93.4 78
tuning
(Basu et al., 2015) DBN 3.6 K Unsupervised & fine-tuning Other 4 0.8 Overall accuracy 93.9
(Längkvist et al., 2016) CNN 1.91 M Unsupervised & fine-tuning Other 0.5 m 6 0.7 Overall accuracy 94.5
(Ji et al., 2018) CNN 102 K Supervised+ Other 4m 4 0.17 Overall accuracy 94.7 93.5
(Rezaee et al., 2018) CNN 53.9 M Transfer learning & fine- Other 5m 5 (3 used for CNN) 0.46 Overall accuracy 94.8
tuning
(Cui et al., 2018) CNN 9.7 K Supervised Other 2 m (MSI), 0.5 m 8 (MSI) + Pan 0.8 Overall accuracy 94.8
206
(Pan)
(Yanfei Liu et al., 2018) CNN Supervised Other 2m 3 0.8 Overall accuracy 94.8 80.3
(Xing et al., 2016) SAE 52.8 K Unsupervised Other 30 m 224 0.5 Overall accuracy 95.5 96.9
(Gong et al., 2017) SAE 81 K Unsupervised & fine-tuning Other 2m 4 0.5 Overall accuracy 95.7 94.4
(Ma et al., 2016) CNN Supervised Other 4 0.8 Overall accuracy 96
(W. Zhao et al., 2017) CNN Supervised Other 0.5 m 8 0.1 Overall accuracy 96.3 66.5
(Han et al., 2018) CNN 286 M Semisupervised Other 3 Overall accuracy 96.8
(Hu et al., 2017) CNN Supervised Other 1m 161 Overall accuracy 97 93.6
(Yu et al., 2016) DBN 2.43 M Unsupervised & fine-tuning Other 0.27 m 3 F1 97
(Wu and Prasad, 2018) CNN + RNN Semisupervised Other 1m 360 Overall accuracy 97.3 95.2
(Wang et al., 2018) CNN 252 K Supervised Other 3 0.74 Overall accuracy 97.3 93.7
(Basu et al., 2015) DBN 3.6 K Unsupervised & fine-tuning Other 4 0.8 Overall accuracy 97.9
(Xu et al., 2018) CNN Supervised+ Other 1m 63 + 1 0.03 Overall accuracy 97.9 92.7
(Nogueira et al., 2017) CNN 5M Transfer learning & fine- Other 0.5 m 3 0.6 Overall accuracy 98 90
tuning
(Tao et al., 2018) CNN Supervised Other 0.5–4 m 4 0.008 Overall accuracy 98.4 89.2
(Ma et al., 2016) CNN Supervised Other 4 0.8 Overall accuracy 98.4
(Gong et al., 2018) CNN 139 M Transfer learning & fine- Other 2m 3 0.8 Overall accuracy 98.5 77.7
tuning
(Wang et al., 2015) CNN 438 K Supervised Other 3 0.6 Overall accuracy 98.7
(Fan Zhang et al., 2016) CNN Supervised Other 1m 3 0.2 Overall accuracy 98.8
(Weng et al., 2018) CNN 3.4 M Transfer learning Other 0.25 Overall accuracy 98.8 91.3
(Gong et al., 2018) CNN 139 M Transfer learning & fine- Other 3 0.8 Overall accuracy 98.8
tuning
(Ji et al., 2018) CNN 107 K Supervised+ Other 4m 4 0.03 Overall accuracy 98.9 96.5
(Maggiori et al., 2017) CNN 459 K Supervised Other 1m 3 0.9 Overall accuracy 99.5 94.9
(Y. Li et al., 2017) CNN 128 K Supervised Other 30 m 242 0.5 Overall accuracy 99.6
(Basaeed et al., 2016) CNN 56.4 K Supervised Other 30 m 10 0.75 Overall accuracy 99.7
(Weng et al., 2018) CNN 3.4 M Transfer learning Other 0.5 Overall accuracy 99.7

proportion deep)
(W. Zhao et al., 2017) CNN Supervised Pavia Center 1.3 m 103 0.1 Overall accuracy 96.3 92.98
(Shi and Pun, 2018) CNN 673 K Supervised Pavia Center 1.3 m 103 0.001 Overall accuracy 97
(Aptoula et al., 2016) CNN 1.31 M Supervised Pavia Center 1.3 m 103 0.05 Kappa 97.4
(Zabalza et al., 2016) SAE 2.4 K Unsupervised Pavia Center 1.3 m 103 0.05 Overall accuracy 97.4 97.4
(Ben Hamida et al., 2018) CNN 3681 Supervised Pavia Center 1.3 m 103 0.05 Overall accuracy 98.9
(Tao et al., 2015) SAE Unsupervised Pavia Center 1.3 m 103 0.05 Overall accuracy 99.6
(Zhao and Du, 2016) CNN Supervised Pavia Center 1.3 m 103 0.05 Overall accuracy 99.7 97.7
(Makantasis et al., 2015) CNN 10.9 K Supervised Pavia Center 1.3 m 103 0.8 Overall accuracy 99.9 99
(Ghamisi et al., 2016) CNN 188 K Supervised Pavia University 1.3 m 103 0.1 Overall accuracy 83.4 78.2
(Mou et al., 2018b) CNN 1.39 M Unsupervised & fine-tuning Pavia University 1.3 m 103 0.1 Overall accuracy 87.4 79.9
(Wu and Prasad, 2018) CNN + RNN Semisupervised Pavia University 1.3 m 103 Overall accuracy 88.4 81.2
(Ding et al., 2017) CNN 226 K Unsupervised Pavia University 1.3 m 100 0.5 Overall accuracy 90.6
(W. Hu et al., 2015) CNN 80.6 K Supervised Pavia University 1.3 m 103 0.05 Overall accuracy 92.6 90.5
(Yue et al., 2015) CNN 182 K Supervised Pavia University 1.3 m 103 Overall accuracy 95.2 85.2
(Xing et al., 2016) SAE 212 K Unsupervised Pavia University 1.3 m 103 0.5 Overall accuracy 96 93.6
(Zhao et al., 2015) CNN 239 K Unsupervised Pavia University 1.3 m 103 0.1 Overall accuracy 96.4 85.2
(W. Li et al., 2017) CNN 57.9 K Supervised Pavia University 1.3 m 103 0.05 Overall accuracy 96.5 90.6
(Zhao and Du, 2016) CNN Supervised Pavia University 1.3 m 103 0.1 Overall accuracy 96.8 80.1
(Ben Hamida et al., 2018) CNN 6862 Supervised Pavia University 1.3 m 103 0.05 Overall accuracy 97.2
(Paoletti et al., 2018) CNN 173 M Supervised Pavia University 1.3 m 103 0.04 Overall accuracy 97.8
(Aptoula et al., 2016) CNN 1.31 M Supervised Pavia University 1.3 m 103 0.1 Kappa 97.9
(Shi and Pun, 2018) CNN 673 K Supervised Pavia University 1.3 m 103 0.01 Overall accuracy 98.5
(Y. Chen et al., 2014) SAE 29 K Unsupervised & fine-tuning Pavia University 1.3 m 103 0.6 Overall accuracy 98.5 97.4
207
(Tao et al., 2015) SAE Unsupervised Pavia University 1.3 m 103 0.1 Overall accuracy 98.6
(Ma et al., 2015) SAE 10 K Unsupervised & fine-tuning Pavia University 1.3 m 103 0.1 Overall accuracy 98.7
(Sun et al., 2017) SAE 30.2 K Semisupervised Pavia University 1.3 m 103 0.1 Overall accuracy 98.7 91.1
(Xu et al., 2018) CNN Supervised+ Pavia University 1.3 m 103 0.04 Overall accuracy 99.1 89.9
(Chen et al., 2015) DBN Unsupervised & fine-tuning Pavia University 1.3 m 103 0.5 Overall accuracy 99.1 98.4
(Y. Li et al., 2017) CNN 110 K Supervised Pavia University 1.3 m 103 0.5 Overall accuracy 99.4
(Makantasis et al., 2015) CNN 10.9 K Supervised Pavia University 1.3 m 103 0.8 Overall accuracy 99.6 93.9
(H. Zhang et al., 2017) CNN Supervised Pavia University 1.3 m 103 0.05 Overall accuracy 99.7
(Chen et al., 2016) CNN 5.85 M Supervised Pavia University 1.3 m 103 0.1 Overall accuracy 99.7 97.7
(Zhou et al., 2017) CNN 126 M Transfer learning & fine- RSSCN7 3 ANMRR 0.3
tuning
(Zou et al., 2015) DBN 3.1 M Unsupervised & fine-tuning RSSCN7 3 0.5 Average 77
accuracy
(Wu et al., 2016) SAE 2.53 M Unsupervised RSSCN7 3 0.5 Overall accuracy 90.4
(W. Hu et al., 2015) CNN 80.6 K Supervised Salinas 3.7 m 220 0.05 Overall accuracy 92.6 91.7
(W. Li et al., 2017) CNN 57.9 K Supervised Salinas 3.7 m 204 0.05 Overall accuracy 94.8 92.9
(Xu et al., 2018) CNN Supervised+ Salinas 3.7 m 204 0.06 Overall accuracy 97.7 92.2
(Ma et al., 2015) SAE 37.7 K Unsupervised & fine-tuning Salinas 3.7 m 204 0.01 Overall accuracy 98.3
(Makantasis et al., 2015) CNN 10.9 K Supervised Salinas 3.7 m 224 0.8 Overall accuracy 99.5 94
(Haut et al., 2018) CNN 8.9 M Supervised+ Salinas 3.7 m 204 0.5 Overall accuracy 99.9 91.1
(Zhou et al., 2017) CNN 126 M Transfer learning & fine- UC Merced 1 ft 3 ANMRR 0.33
tuning
(Zhou et al., 2015) SAE 51.6 K Unsupervised UC Merced 1 ft 3 Average 64.5
precision
(F. Zhang et al., 2015) SAE 301 K Unsupervised UC Merced 1 ft 3 0.8 Overall accuracy 82.7 81.7
(Romero et al., 2016) CNN 49.1 M Unsupervised UC Merced 1 ft 3 0.8 Overall accuracy 84.5
(Yu et al., 2017) CNN 24.6 M Unsupervised UC Merced 1 ft 3 0.8 Overall accuracy 88.57 81.7
(Marmanis et al., 2016a) CNN 155 M Transfer learning & fine- UC Merced 1 ft 3 0.7 Overall accuracy 92.4
tuning

References
SVM (Non
Aptoula, E., Ozdemir, M.C., Yanikoglu, B., 2016. Deep learning with attribute profiles for
deep)
hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 13, 1970–1974.
92.9
92.3
81.7
77.4
https://doi.org/10.1109/LGRS.2016.2619354.
81
90
Audebert, N., Le Saux, B., Lefèvre, S., 2016. Semantic segmentation of earth observation
Deep network
data using multimodal and multi-scale deep networks. In: Asian Conference on
Computer Vision. Springer, Cham, pp. 180–196.
Best reported performances
Basaeed, E., Bhaskar, H., Al-Mualla, M., 2016. Supervised remote sensing image seg-
96.48
mentation using boosted convolutional neural networks. Knowl.-Based Syst. 99,
92.7
93.4
93.5
94.5
94.5
95.6
96.9
97.1
97.1
98.3
99.4
99.5
98.6
19–27. https://doi.org/10.1016/j.knosys.2016.01.028.
Basu, S., Ganguly, S., Mukhopadhyay, S., DiBiano, R., Karki, M., Nemani, R., 2015.
accuracy
accuracy
accuracy
accuracy
accuracy
accuracy
accuracy
accuracy
accuracy
accuracy
accuracy
Overall accuracy
Overall accuracy
Deepsat: a learning framework for satellite imagery. In: Proceedings of the 23rd
SIGSPATIAL International Conference on Advances in Geographic Information
Metric type
Systems, pp. 37.
accuracy
Average
Ben Hamida, A., Benoit, A., Lambert, P., Ben Amar, C., 2018. 3-D deep learning approach
Overall
Overall
Overall
Overall
Overall
Overall
Overall
Overall
Overall
Overall
Overall
for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 56,
4420–4434. https://doi.org/10.1109/TGRS.2018.2818945.
Bengio, Y., 2009. Learning deep architectures for AI. Found. Trends Mach. Learn. 2,
1–127. https://doi.org/10.1561/2200000006.
Bittner, K., Cui, S., Reinartz, P., 2017. Building extraction from remote sensing data using
fully convolutional networks. ISPRS – Int. Arch. Photogramm. Remote Sens. Spat. Inf.
proportion
Sci. XLII-1/W1, 481–486. https://doi.org/10.5194/isprs-archives-XLII-1-W1-481-

Training
2017.
Camps-Valls, G., Tuia, D., Bruzzone, L., Benediktsson, J.A., 2014. Advances in hyper-
0.5
0.7
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.6
0.6
spectral image classification: earth monitoring with statistical learning methods. IEEE
Signal Process. Mag. 31, 45–54. https://doi.org/10.1109/MSP.2013.2279179.
Cao, Y., Niu, X., Dou, Y., 2016. Region-based convolutional neural networks for object
detection in very high resolution remote sensing images. IEEE 548–554. https://doi.
# of channels
org/10.1109/FSKD.2016.7603232.
Castelluccio, M., Poggi, G., Sansone, C., Verdoliva, L., 2015. Land use classification in
remote sensing images by convolutional. Neural Networks ArXiv150800092 Cs.
Chen, F., Ren, R., Van de Voorde, T., Xu, W., Zhou, G., Zhou, Y., 2018. Fast automatic
3
3
3
3
3
3
3
3
3
3
3
airport detection in remote sensing images using convolutional neural networks.

Remote Sens. 10, 443. https://doi.org/10.3390/rs10030443.
Chen, X., Xiang, S., Liu, C.-L., Pan, C.-H., 2014. Vehicle detection in satellite images by
Spatial resolution
hybrid deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 11,
1797–1801. https://doi.org/10.1109/LGRS.2014.2309695.
Chen, X., Xiang, S., Liu, C.-L., Pan, C.-H., 2013. Aircraft detection by deep belief nets.
IEEE 54–58. https://doi.org/10.1109/ACPR.2013.5.
Chen, Y., Jiang, H., Li, C., Jia, X., Ghamisi, P., 2016. Deep feature extraction and clas-
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
1 ft
1 ft
1
1
1
1
1
1
1
1
1
1
1
sification of hyperspectral images based on convolutional neural networks. IEEE

Trans. Geosci. Remote Sens. 54, 6232–6251. https://doi.org/10.1109/TGRS.2016.
Dataset specification
2584107.
Chen, Y., Lin, Z., Zhao, X., Wang, G., Gu, Y., 2014. Deep learning-based classification of
hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 7, 2094–2107.
WHU-RS19
Merced
Merced
Merced
Merced
Merced
Merced
Merced
Merced
Merced
Merced
Merced
UC Merced
UC Merced
https://doi.org/10.1109/JSTARS.2014.2329330.
Dataset
Chen, Y., Zhao, X., Jia, X., 2015. Spectral-spatial classification of hyperspectral data based
on deep belief network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 8,
UC
UC
UC
UC
UC
UC
UC
UC
UC
UC
UC
2381–2392. https://doi.org/10.1109/JSTARS.2015.2388577.
Cheng, G., Han, J., Lu, X., 2017a. Remote sensing image scene classification: benchmark
and state of the art. Proc. IEEE 105, 1865–1883. https://doi.org/10.1109/JPROC.
Transfer learning & fine-
2017.2675998.
Cheng, G., Li, Z., Yao, X., Guo, L., Wei, Z., 2017b. Remote sensing image scene classifi-
cation using bag of convolutional features. IEEE Geosci. Remote Sens. Lett. 14,
Transfer learning
Transfer learning
Transfer learning
Transfer learning
Transfer learning
Transfer learning
Semisupervised
Learning type
Cui, W., Zheng, Z., Zhou, Q., Huang, J., Yuan, Y., 2018. Application of a parallel spec-
Unsupervised
Supervised
Supervised
Supervised
tral–spatial convolution neural network in object-oriented remote sensing land use

classification. Remote Sens. Lett. 9, 334–342. https://doi.org/10.1080/2150704X.
Other network parameters
tuning
tuning
tuning
2017.1420265.
Deng, L., 2014. A tutorial survey of architectures, algorithms, and applications for deep
learning. APSIPA Trans. Signal Inf. Process. 3. https://doi.org/10.1017/atsip.2013.9.
# of parameters
Ding, C., Li, Y., Xia, Y., Wei, W., Zhang, L., Zhang, Y., 2017. Convolutional neural net-
works based hyperspectral image classification method with adaptive kernels.
Remote Sens. 9, 618. https://doi.org/10.3390/rs9060618.
Fu, G., Liu, C., Zhou, R., Sun, T., Zhang, Q., 2017. Classification for high resolution re-
2.53 M
19.6 M
44.1 M
286 M
126 M
117 M
139 M
204 M
920 K
60 M
mote sensing imagery using a fully convolutional network. Remote Sens. 9, 498.
5M
5M
https://doi.org/10.3390/rs9050498.
Geng, J., Fan, J., Wang, H., Ma, X., Li, B., Chen, F., 2015. High-resolution SAR image
Network Type
classification via deep convolutional autoencoders. IEEE Geosci. Remote Sens. Lett.
12, 2351–2355. https://doi.org/10.1109/LGRS.2015.2478256.
Ghamisi, P., Chen, Y., Zhu, X.X., 2016. A self-improving convolution neural network for
CNN
CNN
CNN
CNN
CNN
CNN
CNN
CNN
CNN
CNN
CNN
CNN
CNN
the classification of hyperspectral data. IEEE Geosci. Remote Sens. Lett. 13,
SAE
Ghamisi, P., Plaza, J., Chen, Y., Li, J., Plaza, A.J., 2017. Advanced spectral classifiers for
hyperspectral images: a review. IEEE Geosci. Remote Sens. Mag. 5, 8–32. https://doi.
(Castelluccio et al., 2015)
org/10.1109/MGRS.2016.2616418.
(Fan Zhang et al., 2016)
(Yanfei Liu et al., 2018)
(Nogueira et al., 2017)
Gong, M., Zhan, T., Zhang, P., Miao, Q., 2017. Superpixel-based difference representation
(Penatti et al., 2015)

(Weng et al., 2017)
(F. Hu et al., 2015)
(F. Hu et al., 2015)
learning for change detection in multispectral remote sensing images. IEEE Trans.
(Gong et al., 2018)
(Zhou et al., 2016)
(Luus et al., 2015)
(Han et al., 2018)

(Wu et al., 2016)
(Gu et al., 2018)
Geosci. Remote Sens. 55, 2658–2673. https://doi.org/10.1109/TGRS.2017.2650198.

Gong, X., Xie, Z., Liu, Y., Shi, X., Zheng, Z., 2018. Deep salient feature based anti-noise
transfer network for scene classification of remote sensing imagery. Remote Sens. 10,
Reference
410. https://doi.org/10.3390/rs10030410.
Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. The MIT Press, Cambridge,
Massachusetts.
208
Gu, X., Angelov, P.P., Zhang, C., Atkinson, P.M., 2018. A massively parallel deep rule- very high resolution images via a self-cascaded convolutional neural network. ISPRS
based ensemble classifier for remote sensing scenes. IEEE Geosci. Remote Sens. Lett. J. Photogramm. Remote Sens. 145, 78–95. https://doi.org/10.1016/j.isprsjprs.2017.
15, 345–349. https://doi.org/10.1109/LGRS.2017.2787421. 12.007.
Han, W., Feng, R., Wang, L., Cheng, Y., 2018. A semi-supervised generative framework Liu, Yanfei, Zhong, Y., Fei, F., Zhu, Q., Qin, Q., 2018. Scene classification based on a deep
with deep learning features for high-resolution remote sensing image scene classifi- random-scale stretched convolutional neural network. Remote Sens. 10, 444. https://
cation. ISPRS J. Photogramm. Remote Sens. 145, 23–43. https://doi.org/10.1016/j. doi.org/10.3390/rs10030444.
isprsjprs.2017.11.004. Luus, F.P.S., Salmon, B.P., van den Bergh, F., Maharaj, B.T.J., 2015. Multiview deep
Haut, J.M., Paoletti, M.E., Plaza, J., Li, J., Plaza, A., 2018. Active learning with con- learning for land-use classification. IEEE Geosci. Remote Sens. Lett. 12, 2448–2452.
volutional neural networks for hyperspectral image classification using a new baye- https://doi.org/10.1109/LGRS.2015.2483680.
sian approach. IEEE Trans. Geosci. Remote Sens. 56, 6440–6461. https://doi.org/10. Lyu, H., Lu, H., Mou, L., 2016. Learning a transferable change rule from a recurrent neural
1109/TGRS.2018.2838665. network for land cover change detection. Remote Sens. 8, 506. https://doi.org/10.
Heydari, S.S., Mountrakis, G., 2018. Effect of classifier selection, reference sample size, 3390/rs8060506.
reference class distribution and scene heterogeneity in per-pixel classification accu- Ma, L., Li, M., Ma, X., Cheng, L., Du, P., Liu, Y., 2017. A review of supervised object-based
racy using 26 Landsat sites. Remote Sens. Environ. 204, 648–658. https://doi.org/10. land-cover image classification. ISPRS J. Photogramm. Remote Sens. 130, 277–293.
1016/j.rse.2017.09.035. https://doi.org/10.1016/j.isprsjprs.2017.06.001.
Hinton, G.E., Osindero, S., Teh, Y.-W., 2006. A fast learning algorithm for deep belief Ma, X., Geng, J., Wang, H., 2015. Hyperspectral image classification via contextual deep
nets. Neural Comput. 18, 1527–1554. https://doi.org/10.1162/neco.2006.18.7. learning. EURASIP J. Image Video Process 2015. https://doi.org/10.1186/s13640-
1527. 015-0071-8.
Hu, F., Xia, G.-S., Hu, J., Zhang, L., 2015. Transferring deep convolutional neural net- Ma, Z., Wang, Z., Liu, C., Liu, X., 2016. Satellite Imagery Classification Based on Deep
works for the scene classification of high-resolution remote sensing imagery. Remote Convolution Network. World Acad. Sci. Eng. Technol. 10.
Sens. 7, 14680–14707. https://doi.org/10.3390/rs71114680. Maggiori, E., Tarabalka, Y., Charpiat, G., Alliez, P., 2017. Convolutional neural networks
Hu, J., Mou, L., Schmitt, A., Zhu, X.X., 2017. FusioNet: a two-stream convolutional neural for large-scale remote-sensing image classification. IEEE Trans. Geosci. Remote Sens.
network for urban scene classification using PolSAR and hyperspectral data. IEEE 55, 645–657. https://doi.org/10.1109/TGRS.2016.2612821.
1–4. https://doi.org/10.1109/JURSE.2017.7924565. Maggiori, E., Tarabalka, Y., Charpiat, G., Alliez, P., 2016. High-Resolution Semantic
Hu, W., Huang, Y., Wei, L., Zhang, F., Li, H., 2015. Deep convolutional neural networks Labeling with Convolutional Neural Networks. ArXiv Prepr. ArXiv161101962.
for hyperspectral image classification. J. Sens. 2015, 1–12. https://doi.org/10.1155/ Makantasis, K., Karantzalos, K., Doulamis, A., Doulamis, N., 2015. Deep supervised
2015/258619. learning for hyperspectral data classification through convolutional neural networks.
Huang, B., Zhao, B., Song, Y., 2018. Urban land-use mapping using a deep convolutional In: Geoscience and Remote Sensing Symposium (IGARSS), 2015 IEEE International.
neural network with high spatial resolution multispectral remote sensing imagery. IEEE, pp. 4959–4962.
Remote Sens. Environ. 214, 73–86. https://doi.org/10.1016/j.rse.2018.04.050. Marcos, D., Volpi, M., Kellenberger, B., Tuia, D., 2018. Land cover mapping at very high
Ishii, T., Nakamura, R., Nakada, H., Mochizuki, Y., Ishikawa, H., 2015. Surface object resolution with rotation equivariant CNNs: towards small yet accurate models. ISPRS
recognition with CNN and SVM in Landsat 8 images. IEEE 341–344. https://doi.org/ J. Photogramm. Remote Sens. 145, 96–107. https://doi.org/10.1016/j.isprsjprs.
10.1109/MVA.2015.7153200. 2018.01.021.
Ji, S., Zhang, C., Xu, A., Shi, Y., Duan, Y., 2018. 3D convolutional neural networks for Marmanis, D., Datcu, M., Esch, T., Stilla, U., 2016a. Deep learning earth observation
crop classification with multi-temporal remote sensing images. Remote Sens. 10, 75. classification using imagenet pretrained networks. IEEE Geosci. Remote Sens. Lett.
https://doi.org/10.3390/rs10010075. 13, 105–109. https://doi.org/10.1109/LGRS.2015.2499239.
Karalas, K., Tsagkatakis, G., Zervakis, M., Tsakalides, P., 2015. Deep learning for multi- Marmanis, D., Schindler, K., Wegner, J.D., Galliani, S., Datcu, M., Stilla, U., 2016b.
label land cover classification. In: Bruzzone, L. (Ed.), p. 96430Q. https://doi.org/10. Classification with an edge: improving semantic image segmentation with boundary
1117/12.2195082. detection. ArXiv Prepr. ArXiv161201337.
Kemker, R., Salvaggio, C., Kanan, C., 2018. Algorithms for semantic segmentation of Mou, L., Bruzzone, L., Zhu, X.X., 2018a. Learning Spectral-Spatial-Temporal Features via
multispectral remote sensing imagery using deep learning. ISPRS J. Photogramm. a Recurrent Convolutional Neural Network for Change Detection in Multispectral
Remote Sens. 145, 60–77. https://doi.org/10.1016/j.isprsjprs.2018.04.014. Imagery. ArXiv180302642 Cs.
Khan, S.H., He, X., Porikli, F., Bennamoun, M., 2017. Forest change detection in in- Mou, L., Ghamisi, P., Zhu, X.X., 2018b. Unsupervised spectral-spatial feature learning via
complete satellite images with deep neural networks. IEEE Trans. Geosci. Remote deep residual conv–deconv network for hyperspectral image classification. IEEE
Sens. 1–17. https://doi.org/10.1109/TGRS.2017.2707528. Trans. Geosci. Remote Sens. 56, 391–406. https://doi.org/10.1109/TGRS.2017.
Khatami, R., Mountrakis, G., Stehman, S.V., 2016. A meta-analysis of remote sensing 2748160.
research on supervised pixel-based land-cover image classification processes: general Mou, L., Ghamisi, P., Zhu, X.X., 2017. Deep recurrent neural networks for hyperspectral
guidelines for practitioners and future research. Remote Sens. Environ. 177, 89–100. image classification. IEEE Trans. Geosci. Remote Sens. 55, 3639–3655. https://doi.
https://doi.org/10.1016/j.rse.2016.02.028. org/10.1109/TGRS.2016.2636241.
Lagrange, A., Le Saux, B., Beaupere, A., Boulch, A., Chan-Hon-Tong, A., Herbin, S., Mountrakis, G., Im, J., Ogole, C., 2011. Support vector machines in remote sensing: a
Randrianarivo, H., Ferecatu, M., 2015. Benchmarking classification of Earth-ob- review. ISPRS J. Photogramm. Remote Sens. 66, 247–259. https://doi.org/10.1016/j.
servation data: from learning explicit features to convolutional networks, in: IGARSS isprsjprs.2010.11.001.
2015. Ndikumana, E., Ho Tong Minh, D., Baghdadi, N., Courault, D., Hossard, L., 2018. Deep
Längkvist, M., Kiselev, A., Alirezaie, M., Loutfi, A., 2016. Classification and segmentation recurrent neural network for agricultural classification using multitemporal SAR
of satellite orthoimagery using convolutional neural networks. Remote Sens. 8, 329. sentinel-1 for camargue, France. Remote Sens. 10, 1217. https://doi.org/10.3390/
https://doi.org/10.3390/rs8040329. rs10081217.
Le, Q.V., 2015. A Tutorial on Deep Learning Part 2: Autoencoders, Convolutional Neural Niculescu, S., Ienco, D., Hanganu, J., 2018. Application of deep learning of multi-tem-
Networks and Recurrent Neural Networks. poral sentinel-1 images for the classification of coastal vegetation zone of the danube
Le Saux, B., Yokoya, N., Hansch, R., Prasad, S., 2018. Advanced multisource optical re- delta. ISPRS – Int. Arch. Photogramm. Remote Sens. Spat Inf. Sci. XLII–3, 1311–1318.
mote sensing for urban land use and land cover classification [Technical https://doi.org/10.5194/isprs-archives-XLII-3-1311-2018.
Committees]. IEEE Geosci. Remote Sens. Mag. 6, 85–89. https://doi.org/10.1109/ Nogueira, K., Penatti, O.A.B., dos Santos, J.A., 2017. Towards better exploiting con-
MGRS.2018.2874328. volutional neural networks for remote sensing scene classification. Pattern Recognit.
Lguensat, R., Sun, M., Fablet, R., Mason, E., Tandeo, P., Chen, G., 2017. EddyNet: a deep 61, 539–556. https://doi.org/10.1016/j.patcog.2016.07.001.
neural network for pixel-wise classification of oceanic eddies. ArXiv171103954 Phys. Paisitkriangkrai, S., Sherrah, J., Janney, P., Hengel, A.V.-D., 2015. Effective semantic
Li, J., Bruzzone, L., Liu, S., 2015. Deep feature representation for hyperspectral image pixel labelling with convolutional networks and Conditional Random Fields. In: 2015
classification. IEEE 4951–4954. https://doi.org/10.1109/IGARSS.2015.7326943. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
Li, T., Zhang, J., Zhang, Y., 2014. Classification of hyperspectral image based on deep Presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition
belief networks. In: 2014 IEEE International Conference on Image Processing (ICIP). Workshops (CVPRW), pp. 36–43. 10.1109/CVPRW.2015.7301381.
Presented at the 2014 IEEE International Conference on Image Processing (ICIP), pp. Pan, B., Shi, Z., Xu, X., 2018. MugNet: deep learning for hyperspectral image classification
5132–5136. 10.1109/ICIP.2014.7026039. using limited samples. ISPRS J. Photogramm. Remote Sens. 145, 108–119. https://
Li, W., Wu, G., Zhang, F., Du, Q., 2017. Hyperspectral image classification using deep doi.org/10.1016/j.isprsjprs.2017.11.003.
pixel-pair features. IEEE Trans. Geosci. Remote Sens. 55, 844–853. https://doi.org/ Paoletti, M.E., Haut, J.M., Plaza, J., Plaza, A., 2018. A new deep convolutional neural
10.1109/TGRS.2016.2616355. network for fast hyperspectral image classification. ISPRS J. Photogramm. Remote
Li, Y., Zhang, H., Shen, Q., 2017. Spectral-spatial classification of hyperspectral imagery Sens. 145, 120–147. https://doi.org/10.1016/j.isprsjprs.2017.11.021.
with 3D convolutional neural network. Remote Sens. 9, 67. https://doi.org/10.3390/ Penatti, O.A.B., Nogueira, K., Santos, J.A. dos, 2015. Do deep features generalize from
rs9010067. everyday objects to remote sensing and aerial scenes domains?. In: 2015 IEEE
Liu, P., Choo, K.-K.R., Wang, L., Huang, F., 2017. SVM or deep learning? A comparative Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
study on remote sensing image classification. Soft Comput. 21, 7053–7065. https:// Presented at the 2015 IEEE Conference on Computer Vision and Pattern Recognition
doi.org/10.1007/s00500-016-2247-2. Workshops (CVPRW), pp. 44–51. 10.1109/CVPRW.2015.7301382.
Liu, S., Li, M., Zhang, Z., Xiao, B., Cao, X., 2018. Multimodal ground-based cloud clas- Qayyum, A., Malik, A.S., Saad, N.M., Iqbal, M., Faris Abdullah, M., Rasheed, W., Rashid
sification using joint fusion convolutional neural network. Remote Sens. 10, 822. Abdullah, T.A., Bin Jafaar, M.Y., 2017. Scene classification for aerial images based on
https://doi.org/10.3390/rs10060822. CNN using sparse coding technique. Int. J. Remote Sens. 38, 2662–2685. https://doi.
Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E., 2017. A survey of deep neural org/10.1080/01431161.2017.1296206.
network architectures and their applications. Neurocomputing 234, 11–26. https:// Rezaee, M., Mahdianpari, M., Zhang, Y., Salehi, B., 2018. Deep convolutional neural
doi.org/10.1016/j.neucom.2016.12.038. network for complex wetland classification using optical remote sensing imagery.
Liu, Yongcheng, Fan, B., Wang, L., Bai, J., Xiang, S., Pan, C., 2018. Semantic labeling in IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 11, 3030–3039. https://doi.org/10.
209
1109/JSTARS.2018.2846178. Yu, Y., Gong, Z., Wang, C., Zhong, P., 2017. An unsupervised convolutional feature fusion
Romero, A., Gatta, C., Camps-Valls, G., 2016. Unsupervised deep feature extraction for network for deep representation of remote sensing images. IEEE Geosci. Remote Sens.
remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 54, Lett. 1–5. https://doi.org/10.1109/LGRS.2017.2767626.
1349–1362. https://doi.org/10.1109/TGRS.2015.2478379. Yu, Y., Guan, H., Zai, D., Ji, Z., 2016. Rotation-and-scale-invariant airplane detection in
Rosenblatt, F., 1958. The perceptron: a probabilistic model for information storage and high-resolution satellite images based on deep-Hough-forests. ISPRS J. Photogramm.
organization in the brain. Psychol. Rev. 65, 386–408. https://doi.org/10.1037/ Remote Sens. 112, 50–64. https://doi.org/10.1016/j.isprsjprs.2015.04.014.
h0042519. Yue, J., Zhao, W., Mao, S., Liu, H., 2015. Spectral–spatial classification of hyperspectral
Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning representations by back- images using deep convolutional neural networks. Remote Sens. Lett. 6, 468–477.
propagating errors. Nature 323, 533–536. https://doi.org/10.1038/323533a0. https://doi.org/10.1080/2150704X.2015.1047045.
Rußwurm, M., Körner, M., 2018. Multi-Temporal Land Cover Classification with Zabalza, J., Ren, J., Zheng, J., Zhao, H., Qing, C., Yang, Z., Du, P., Marshall, S., 2016.
Sequential Recurrent Encoders. ArXiv180202080 Cs. Novel segmented stacked autoencoder for effective dimensionality reduction and
Rußwurm, M., Körner, M., 2017. Multi-temporal land cover classification with long short- feature extraction in hyperspectral imaging. Neurocomputing 185, 1–10. https://doi.
term memory neural networks. ISPRS – Int. Arch. Photogramm. Remote Sens. Spat. org/10.1016/j.neucom.2015.11.044.
Inf. Sci. XLII-1/W1, 551–558. https://doi.org/10.5194/isprs-archives-XLII-1-W1- Zeiler, M.D., Fergus, R., 2014. Visualizing and understanding convolutional networks. In:
551-2017. Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (Eds.), Computer Vision – ECCV 2014:
Sharma, A., Liu, X., Yang, X., 2018. Land cover classification from multi-temporal, multi- 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings,
spectral remotely sensed imagery using patch-based recurrent neural networks. Part I. Springer International Publishing, Cham, pp. 818–833 https://doi.org/
Neural Netw. 105, 346–355. https://doi.org/10.1016/j.neunet.2018.05.019. 10.1007/978-3-319-10590-1_53.
Sherrah, J., 2016. Fully convolutional networks for dense semantic labelling of high- Zeiler, M.D., Fergus, R., 2013. Visualizing and Understanding Convolutional Networks.
resolution aerial imagery. ArXiv Prepr. ArXiv160602585. ArXiv13112901 Cs.
Shi, C., Pun, C.-M., 2018. Superpixel-based 3D deep neural networks for hyperspectral Zhang, C., Pan, X., Li, H., Gardiner, A., Sargent, I., Hare, J., Atkinson, P.M., 2018a. A
image classification. Pattern Recognit. 74, 600–616. https://doi.org/10.1016/j. hybrid MLP-CNN classifier for very fine resolution remotely sensed image classifi-
patcog.2017.09.007. cation. ISPRS J. Photogramm. Remote Sens. 140, 133–144. https://doi.org/10.1016/
Singhal, V., Gogna, A., Majumdar, A., 2016. Deep dictionary learning vs deep belief j.isprsjprs.2017.07.014.
network vs stacked autoencoder: an empirical analysis. In: Hirose, A., Ozawa, S., Zhang, C., Sargent, I., Pan, X., Gardiner, A., Hare, J., Atkinson, P.M., 2018b. VPRS-based
Doya, K., Ikeda, K., Lee, M., Liu, D. (Eds.), Neural Information Processing. Springer regional decision fusion of CNN and MRF classifications for very fine resolution re-
International Publishing, Cham, pp. 337–344. https://doi.org/10.1007/978-3-319- motely sensed images. IEEE Trans. Geosci. Remote Sens. 56, 4507–4521. https://doi.
46681-1_41. org/10.1109/TGRS.2018.2822783.
Sun, X., Zhou, F., Dong, J., Gao, F., Mu, Q., Wang, X., 2017. Encoding spectral and spatial Zhang, C., Sargent, I., Pan, X., Li, H., Gardiner, A., Hare, J., Atkinson, P.M., 2018c. An
context information for hyperspectral image classification. IEEE Geosci. Remote Sens. object-based convolutional neural network (OCNN) for urban land use classification.
Lett. 14, 2250–2254. https://doi.org/10.1109/LGRS.2017.2759168. Remote Sens. Environ. 216, 57–70. https://doi.org/10.1016/j.rse.2018.06.034.
Tang, T., Zhou, S., Deng, Z., Zou, H., Lei, L., 2017. Vehicle detection in aerial images Zhang, F., Du, B., Zhang, L., 2017. A multi-task convolutional neural network for mega-
based on region convolutional neural networks and hard negative example mining. city analysis using very high resolution satellite imagery and geospatial data.
Sensors 17, 336. https://doi.org/10.3390/s17020336. ArXiv170207985 Cs.
Tao, C., Pan, H., Li, Y., Zou, Z., 2015. Unsupervised spectral-spatial feature learning with Zhang, Fan, Du, B., Zhang, L., 2016. Scene classification via a gradient boosting random
stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geosci. convolutional network framework. IEEE Trans. Geosci. Remote Sens. 54, 1793–1802.
Remote Sens. Lett. 12, 2438–2442. https://doi.org/10.1109/LGRS.2015.2482520. https://doi.org/10.1109/TGRS.2015.2488681.
Tao, Y., Xu, M., Lu, Z., Zhong, Y., 2018. DenseNet-based depth-width double reinforced Zhang, F., Du, B., Zhang, L., 2015. Saliency-guided unsupervised feature learning for
deep learning neural network for high-resolution remote sensing image per-pixel scene classification. IEEE Trans. Geosci. Remote Sens. 53, 2175–2184. https://doi.
classification. Remote Sens. 10, 779. https://doi.org/10.3390/rs10050779. org/10.1109/TGRS.2014.2357078.
Tobler, W.R., 1970. A computer movie simulating urban growth in the detroit region. Zhang, H., Li, Y., Zhang, Y., Shen, Q., 2017. Spectral-spatial classification of hyperspectral
Econ. Geogr. 46, 234. https://doi.org/10.2307/143141. imagery using a dual-channel convolutional neural network. Remote Sens. Lett. 8,
Tschannen, M., Cavigelli, L., Mentzer, F., Wiatowski, T., Benini, L., 2016. Deep Structured 438–447. https://doi.org/10.1080/2150704X.2017.1280200.
Features for Semantic Segmentation. ArXiv160907916 Cs. Zhang, L., Shi, Z., Wu, J., 2015. A hierarchical oil tank detector with deep surrounding
Vakalopoulou, M., Karantzalos, K., Komodakis, N., Paragios, N., 2015. Building detection features for high-resolution optical satellite imagery. IEEE J. Sel. Top. Appl. Earth
in very high resolution multispectral data with deep learning features. IEEE Obs. Remote Sens. 8, 4895–4909. https://doi.org/10.1109/JSTARS.2015.2467377.
1873–1876. https://doi.org/10.1109/IGARSS.2015.7326158. Zhang, L., Zhang, L., Du, B., 2016. Deep learning for remote sensing data: a technical
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A., 2010. Stacked denoising tutorial on the state of the art. IEEE Geosci. Remote Sens. Mag. 4, 22–40. https://doi.
autoencoders: learning useful representations in a deep network with a local de- org/10.1109/MGRS.2016.2540798.
noising criterion. J. Mach. Learn. Res. 11, 3371–3408. Zhao, C., Wan, X., Zhao, G., Cui, B., Liu, W., Qi, B., 2017. Spectral-spatial classification of
Volpi, M., Tuia, D., 2017. Dense semantic labeling of subdecimeter resolution images with hyperspectral imagery based on stacked sparse autoencoder and random forest. Eur.
convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 55, 881–893. J. Remote Sens. 50, 47–63. https://doi.org/10.1080/22797254.2017.1274566.
Wang, J., Song, J., Chen, M., Yang, Z., 2015. Road network extraction: a neural-dynamic Zhao, W., Du, S., 2016. Learning multiscale and deep representations for classifying re-
framework based on deep learning and a finite state machine. Int. J. Remote Sens. 36, motely sensed imagery. ISPRS J. Photogramm. Remote Sens. 113, 155–165. https://
3144–3169. https://doi.org/10.1080/01431161.2015.1054049. doi.org/10.1016/j.isprsjprs.2016.01.004.
Wang, S.-H., Sun, J., Phillips, P., Zhao, G., Zhang, Y.-D., 2018. Polarimetric synthetic Zhao, W., Du, S., Emery, W.J., 2017. Object-based convolutional neural network for high-
aperture radar image segmentation by convolutional neural network using graphical resolution imagery classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 10,
processing units. J. Real-Time Image Process. 15, 631–642. https://doi.org/10.1007/ 3386–3396. https://doi.org/10.1109/JSTARS.2017.2680324.
s11554-017-0717-0. Zhao, W., Guo, Z., Yue, J., Zhang, X., Luo, L., 2015. On combining multiscale deep
Weng, Q., Mao, Z., Lin, J., Guo, W., 2017. Land-use classification via extreme learning learning features for the classification of hyperspectral remote sensing imagery. Int. J.
classifier based on deep convolutional features. IEEE Geosci. Remote Sens. Lett. 14, Remote Sens. 36, 3368–3379. https://doi.org/10.1080/2150704X.2015.1062157.
704–708. https://doi.org/10.1109/LGRS.2017.2672643. Zhou, W., Newsam, S., Li, C., Shao, Z., 2017. Learning low dimensional convolutional
Weng, Q., Mao, Z., Lin, J., Liao, X., 2018. Land-use scene classification based on a CNN neural networks for high-resolution remote sensing image retrieval. Remote Sens. 9,
using a constrained extreme learning machine. Int. J. Remote Sens. 39, 6281–6299. 489. https://doi.org/10.3390/rs9050489.
https://doi.org/10.1080/01431161.2018.1458346. Zhou, W., Shao, Z., Cheng, Q., 2016. Deep feature representations for high-resolution
Wu, H., Liu, B., Su, W., Zhang, W., Sun, J., 2016. Deep filter banks for land-use scene remote sensing scene classification, in: 2016 4th International Workshop on Earth
classification. IEEE Geosci. Remote Sens. Lett. 13, 1895–1899. https://doi.org/10. Observation and Remote Sensing Applications (EORSA). Presented at the 2016 4th
1109/LGRS.2016.2616440. International Workshop on Earth Observation and Remote Sensing Applications
Wu, H., Prasad, S., 2018. Semi-supervised deep learning using pseudo labels for hyper- (EORSA), pp. 338–342. 10.1109/EORSA.2016.7552825.
spectral image classification. IEEE Trans. Image Process. 27, 1259–1270. https://doi. Zhou, W., Shao, Z., Diao, C., Cheng, Q., 2015. High-resolution remote-sensing imagery
org/10.1109/TIP.2017.2772836. retrieval using sparse features by auto-encoder. Remote Sens. Lett. 6, 775–783.
Xia, G.-S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., Lu, X., 2017. AID: a https://doi.org/10.1080/2150704X.2015.1074756.
benchmark data set for performance evaluation of aerial scene classification. IEEE Zhou, Y., Arpit, D., Nwogu, I., Govindaraju, V., 2014. Is Joint Training Better for Deep
Trans. Geosci. Remote Sens. 55, 3965–3981. https://doi.org/10.1109/TGRS.2017. Auto-Encoders? ArXiv14051380 Cs Stat.
2685945. Zhu, X., Goldberg, A.B., 2009. Introduction to semi-supervised learning. Synth. Lect. Artif.
Xing, C., Ma, L., Yang, X., 2016. Stacked denoise autoencoder based feature extraction Intell. Mach. Learn. 3, 1–130. https://doi.org/10.2200/
and classification for hyperspectral images. J. Sens. 2016, 1–10. https://doi.org/10. S00196ED1V01Y200906AIM006.
1155/2016/3632943. Zhu, X.X., Tuia, D., Mou, L., Xia, G.-S., Zhang, L., Xu, F., Fraundorfer, F., 2017. Deep
Xu, X., Li, W., Ran, Q., Du, Q., Gao, L., Zhang, B., 2018. Multisource remote sensing data learning in remote sensing: a comprehensive review and list of resources. IEEE
classification based on convolutional neural network. IEEE Trans. Geosci. Remote Geosci. Remote Sens. Mag. 5, 8–36. https://doi.org/10.1109/MGRS.2017.2762307.
Sens. 56, 937–949. https://doi.org/10.1109/TGRS.2017.2756851. Zou, Q., Ni, L., Zhang, T., Wang, Q., 2015. Deep learning based feature selection for
Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H., 2015. Understanding Neural remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 12, 2321–2325.
Networks Through Deep Visualization. ArXiv150606579 Cs. https://doi.org/10.1109/LGRS.2015.2475299.
210

10 1016@j Isprsjprs 2019 04 016

Uploaded by

Copyright:

Available Formats

10 1016@j Isprsjprs 2019 04 016

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10 1016@j Isprsjprs 2019 04 016

Uploaded by

Copyright:

Available Formats

ISPRS Journal of Photogrammetry and Remote Sensing 152 (2019) 192–210

Contents lists available at ScienceDirect

ISPRS Journal of Photogrammetry and Remote Sensing

Meta-analysis of deep neural networks in remote sensing: A comparative T

ARTICLE INFO ABSTRACT

1. Introduction nodes (Rumelhart et al., 1986). The back-propagation method has

Network type One of below categories:

Network type CNN DBN SAE

Number of cases 150 9 25

Dataset spatial resolution < 30 cm 30 cm – 3 m >3 m

Spectral and auxiliary bands 1–3 4–10 11–99 > 100

45 class, 700 samples per class/scene

21 class, 100 samples per class/scene

7 class, 400 samples per class/scene

19 class, 50 samples per class/scene

Making dense (full resolution) output options (for CNN):

Parallel networks on different band combinations or sensors 9

Active learning or iterative feature selection (removing inferior 4

Other specialized methods 9

PCA dimensionality reduction and spectral + spatial input 9

Sequence data/temporal processing 4

Other specialized methods 2

MRF/CRF processing or boundary detection 7

Denoising SAE implementation 3

Initial and/or final data/feature filtering or segmentation to enhance 12

Sparse or other type of coding to create codebook after feature 4

generation and classify the code

For example, in image segmentation the objects’ boundary alignment is

of primary concern, while in scene classification this is not important.

This makes edge enhancement techniques more relevant to the former

merit is not possible. However, we discuss general findings on design

catalog and guide future research, either through gap analysis or

Google Earth images

Google Earth images

3.3.1. Dense (fully connected) networks

This CNN-type network is the de-facto network of choice for very

high resolution classification and almost all of the image segmentation

works – particularly experiments with ISPRS Potsdam and Vaihingen

classify the image pixel by pixel, but it means a huge redundancy in

pically contains chain of downsampling and then upsampling) is pre-

better way to do it. Edge enhancement and additional segmentation

techniques have also been examined by different approaches to en-

3.3.2. Multiscale capability options

3.3.3. Network mixing options

3.3.4. Training options

example, one author used the average values of a neighborhood around

3.4.1. Distribution across learning methods

should be selected among many variants for SVM or other non-deep

3.4.2. Distribution across network complexity

3.4.3. Distribution across spatial resolution

network and enlarging scale of pixel influence, there is always some

3.4.6. Review of widely-used data sets

– Indian Pines (a hyperspectral dataset): The highest accuracy here is

Other network parameters Dataset specification Best reported performances

(continued on next page)

Other network parameters Dataset specification Best reported performances

(continued on next page)

Other network parameters Dataset specification Best reported performances

(continued on next page)