Department of Environmental Resources Engineering, State University of New York, College of Environmental Science and Forestry, 1 Forestry Drive, Syracuse, NY 13210,
United States
Keywords: Deep learning methods have recently found widespread adoption for remote sensing tasks, particularly in image
Deep learning or pixel classification. Their flexibility and versatility has enabled researchers to propose many different designs
Classification to process remote sensing data in all spectral, spatial, and temporal dimensions. In most of the reported cases
Convolutional neural network they surpass their non-deep rivals in overall classification accuracy. However, there is considerable diversity in
Deep belief network
implementation details in each case and a systematic quantitative comparison to non-deep classifiers does not
Stacked auto encoder
Support vector machine
exist. In this paper, we look at the major research papers that have studied deep learning image classifiers in
recent years and undertake a meta-analysis on their performance compared to the most used non-deep rival,
Support Vector Machine (SVM) classifiers. We focus on mono-temporal classification as the time-series image
classification did not offer sufficient samples. Our work covered 103 manuscripts and included 92 cases that
supported direct accuracy comparisons between deep learners and SVMs.
Our general findings are the following: (i) Deep networks have better performance than non-deep spectral
SVM implementations, with Convolutional Neural Networks (CNNs) performing better than other deep learners.
This advantage, however, diminishes when feeding SVM with richer features extracted from data (e.g. spatial
filters). (ii) Transfer learning and fine-tuning on pre-trained CNNs are offering promising results over spectral or
enhanced SVM, however these pre-trained networks are currently limited to RGB input data, therefore currently
lack applicability in multi/hyperspectral data. (iii) There is no strong relationship between network complexity
and accuracy gains over SVM; small to medium networks perform similarly to more complex networks. (iv)
Contrary to the popular belief, there are numerous cases of high deep networks performance with training
proportions of 10% or less.
Our study also indicates that the new generation of classifiers is often overperforming existing benchmark
datasets, with accuracies surpassing 99%. There is a clear need for new benchmark dataset collections with
diverse spectral, spatial and temporal resolutions and coverage that will enable us to study the design gen-
eralizations, challenge these new classifiers, and further advance remote sensing science. Our community could
also benefit from a coordinated effort to create a large pre-trained network specifically designed for remote
sensing images that users could later fine-tune and adjust to their study specifics.
E-mail addresses: sshahhey@syr.edu (S.S. Heydari), gmountrakis@esf.edu (G. Mountrakis).
the high node number of DNNs made it difficult to train and optimize in deep learning benefits. For example, Khatami et al. (2016) grouped all
a practical manner. neural network types under one category and did not distinguish deep
The seminal work of Hinton et al. (2006) showed that unsupervised networks from other implementations. Ma et al. (2017) conducted si-
pre-training of each layer, one after another, could considerably im- milar meta-analysis focusing on object-based classification (thus ex-
prove results. This layer-wise training approach, named greedy algo- cluding pixel-based ones) without separating deep learning methods.
rithm, was the key that opened new avenues to deep neural networks. There are some other papers that review deep learning architectures in
The greedy algorithm could also be followed by a fine-tuning process, in general such as Deng (2014) or W. Liu et al. (2017), or for specific type
which the entire network is tuned together using backpropagation, but of data, such as Camps-Valls et al. (2014) on hyperspectral data clas-
this time from a much better starting point. Deep network theories and sification. These works also lack quantitative comparisonsusing a meta-
practices have expanded considerably during the last decade. It has analysis approach.
resulted in establishment of some major network types (with con- The overarching goal is to provide readers with the “big picture” of
tinuous enhancements) and numerous applications in different do- current research and build on the collective knowledge of published
mains. In close relationship with image processing and computer vision, works to assess deep learning benefits in remote sensing. To undertake
remote sensing (RS) is one of many areas that deep learning is targeting. the proposed meta-analysis task, we reviewed major research papers
Generally and following discussion in L. Zhang et al. (2016), we can and built a database of case studies of deep network applications in the
categorize remote sensing applications of deep learning into four remote sensing field while extracting main network and data char-
groups: (1) RS image pre-processing, (2) scene classification, (3) pixel- acteristics. This database was analyzed to identify deep learning clas-
based classification and image segmentation, and (4) target detection. sification performance and its distribution across these network (e.g.
For image pre-processing tasks, we can name pan-sharpening, de- network complexity) and data characteristics (e.g. spatial resolution).
noising, and resolution enhancement as major applications. Scene We expect this analysis to provide a knowledge baseline as the remote
classification is done based on some extracted features from a scene, sensing community further incorporates deep leaning in related activ-
which the deep networks are assumed to be good at. The non-deep ities.
approaches normally use some handcrafted features extracted from the The structure of the manuscript is as follows. A brief overlook of
scene to feed the classifier (SVM, KNN, etc.) and predict the scene type. deep network types is presented in Section 2 along with key in-
Deep networks have opened the door to direct use of spectral and troductory references. A summary table is also provided to describe
spatial information together to generate a richer set of features auto- extracted parameters for each research paper. In Section 3 after in-
matically. This automatic extraction increases the potential for good troducing a descriptive statistics and summarizing design ideas en-
generalization and scalability of this method compared to handcrafted countered in the selected research papers and used datasets, we provide
features. Handcrafted features tend to be tailored closely to a specific our main comparative analysis and discuss important research ques-
case and application and possibly perform better than any automatic tions about parameters effect on network performance. The last section
system, but because of this specificity they cannot be easily or suc- provides concluding remarks.
cessfully generalized to another cases/studies. This type of work is
closely related to image recognition task but for categorization of re- 2. Methods
mote sensing scenes (such as agricultural field, residential area, airport,
parking lot, etc.), therefore sharing network configurations between In this section we first describe three DNN methods that have been
computer vision and remote sensing applications is common here. Pixel popular in RS tasks. Section 2.2 contains an explanation on the paper
classification and segmentation (or semantic labeling) are similar to database and associated characteristics and metrics used in the com-
scene classification but operate at the pixel rather than scene level, and parative accuracy analysis between DNNs and non-deep methods.
produce a thematic map instead of a single category index. This is
perhaps the most studied RS application and deep networks have shown 2.1. Summary of popular deep neural networks in remote sensing
performance benefits due to their ability to co-process spatial and
spectral data easily, especially for hyperspectral images. Our main The deep learning paradigm is concentrated on automated hier-
target in this paper is to focus on image or scene classification – we do archical feature extraction. Numerous methods and their modifications
not address other applications. In addition, we focus on mono-temporal have been devised along the past years. Here we briefly introduce the
classification as the time-series image classification is still in its infancy. three most widely used structures which were used in our identified
Target or object detection is generally an extension to the three afore- studies. More detailed descriptions of each structure can be found in
mentioned groups, where specific objects defined by their shape or many machine learning textbooks, for example Bengio (2009) and
boundary are extracted from an image. This field has found many useful Goodfellow et al. (2016), or tutorials such as Le (2015) and Deng
but challenging applications in high resolution and real time image/ (2014). Zhu et al. (2017) and L. Zhang et al. (2016) also provide tu-
video processing. torials for deep learning for remote sensing (RS) applications.
Following the explosive growth of new algorithmic developments Deep networks have been developed to enhance and enrich data
and case studies in deep learning RS applications in the past 3–4 years, representations in an automated and intelligent manner. A good re-
several review manuscripts have been published (Ghamisi et al., 2017, presentation is, of course, dependent on the specific application and
Xia et al., 2017, L. Zhang et al., 2016, or P. Liu et al., 2017) and a should be learned from training data. One important deep network
Special Issue in this journal has summarized several recent advance- category in this class is based on Autoencoders (AEs). The idea behind
ments (Mountrakis et al., 2018). The majority of these reviews are an autoencoder is basically an encoder-decoder network to regenerate
descriptive and do not offer a quantitative assessment of deep learning the input as accurately as possible in its output. Under specific condi-
benefits building on the extensive available comparisons in the litera- tions, the encoder part works as a good feature extractor and can be
ture. The overal goal of this work is to bridge this knowledge gap by stacked to build deep networks named Stacked Auto Encoders or SAEs
undertaking a meta-analysis comparing deep and non-deep classifica- (the decoder part is not used). The imposed condition on objective
tion algorithms through a meta-analysis of published research. function is typically a form of sparsity, but other variants are also
Other meta-analysis works exist but they do not examine explicitly studied. To put it simply, AE learns a deterministic representation of the
Table 1
Parameters collected on each case study.
Reference Citation code for the referenced research paper
input by minimizing a cost function based on the difference between invariance property to the network. Each layer’s output is typically
input and the regenerated one at the decoder output. This learning named a map and it is generally desired to have multiple maps generated
takes place using gradient-based optimization and standard back- at each layer. Here the filter weights are tuned typically by supervised
propagation techniques. AEs are well suited to unsupervised learning training, as the limited number of shared parameters in each layer
and can be trained layer-wise, possibly followed by a supervised fine- (compared to a fully connected network) allows it. There are also some
tuning phase of the entire network. For a good overview of auto- pre-trained large network structures publicly available for use and fine-
encoders with some work examples and executable codes see Andrew tuning them for specific applications is another common approach. For a
Ng’s Deep Learning tutorial at http://deeplearning.stanford.edu/wiki/ university course on convolutional neural networks readers are referred
index.php/UFLDL_Tutorial. Vincent et al. (2010) also provide more to http://cs231n.stanford.edu/. Zeiler and Fergus (2014) also provide a
details on autoencoders and unsupervised learning. discussion on visualization and understanding of the internal CNN
Another way of thinking about data representation is to learn the workings.
statistical distribution of input, i.e. a probabilistic approach. This ap- Working with sequence data is another important type of remote
proach has led to Generative models or Structured Probabilistic Models. sensing works, particularly on three bases: studying hyperspectral
Deep belief network (DBN) based on stacking layers of Restricted signal variations and analyzing their dependencies; adding the time
Boltzmann Machine (RBM) is the most popular variant for RS appli- dimension as another data element to explore land use feature patterns
cations. Here the aim is to minimize the Boltzmann cost function, to (profiles) and use them in classification; and pursuing detection of
maximize “the similarity (in a probabilistic sense) between the re- changes in land cover or land use by processing time-series data. Neural
presentation and projection of the input” (Singhal et al., 2016). This networks – and specifically Recurrent Neural Networks – are gaining
optimization does not use an assumed output, so a different algorithm momentum for these applications but the number of published papers is
(contrastive divergence) is required to train the neurons. However, si- still low. These networks are promising with new modifications such as
milarly to the autoencoder, training is unsupervised and, more im- adding more powerful and deep memory cells (see for example Lyu
portant, it can be done in a greedy layer-wise approach for a stack of et al. (2016), Mou et al. (2017), Rußwurm and Körner (2017),
layers. This layer-wise approach was devised by the seminal work of Rußwurm and Körner (2018), Ndikumana et al. (2018), Niculescu et al.
Hinton et al. (2006) and later implemented by both SAEs and DBNs. (2018), or Sharma et al. (2018)). However, we did not consider se-
Therefore SAEs and DBNs are often discussed together in the literature quence data applications in our paper due to lack of enough data and
(e.g. Vincent et al. (2010)). When trained, the network can provide our focus was only on feed-forward networks and its three main var-
extracted features for the new data to be classified. Tutorials on RBM iants: SAE, DBN, and CNN.
and DBN are available through the internet, for example see https://
deeplearning4j.org/restrictedboltzmannmachine, which includes ex-
2.2. Comparative performance database creation
ecutable codes.
The third type, which is the most used structure in recent years, is the
Our overarching goal is to look at the analyzed DNN case studies
convolutional neural network (CNN). Inspired by the human visual
and compare them together and to a well-known non-deep classifier,
system and designed to process images, it has limited connection to only
Support Vector Machine (SVM). SVM will serve both as a representative
adjacent neurons in each layer, with the same connection weights for
for non-deep classifier to compare with deep networks, and as a base-
each neuron within each layer. It may include down-sampling in each
line to compare different DNN architectures. SVMs were selected as the
layer, which reduces the processing resolution but adds translation
benchmarking algorithm because: (i) they were found to be the best
non-deep performing classifiers in an extensive comparison of pub- some studies but not in the others). Note that in some cases the input
lished work (Khatami et al., 2016), and (ii) the majority of DNN papers channels are processed and dimensionality was reduced (mostly em-
found in this review chose to include SVM as the main benchmark, thus ploying PCA) and the result is applied to the network, but we do not
validating our decision. We also examine accuracy trends across data mention this dimensionality reduction in the table, although we take it
and method characteristics. Direct comparisons of published works are into consideration when calculating the number of parameters and
not feasible due to variances in data types, sampling design, algorithmic consider the network in its actual tested configuration. There are two
details, and test metrics. Therefore, we concentrated on aggregating cases of using Landsat and one case of MODIS imagery that has been
results from manuscripts where accuracy metrics are reported mutually indicated in table separately due to importance of these data sources
under common conditions for deep and non-deep implementations. Although from one hand they are of less attention today because of
This database was then used to do comparative meta-analysis and other their inferior spatial resolution, but from the other hand they are of
quantitative statistical analyses. interest for their rich temporal dimension in time-series analysis. Data
The result was 103 research papers from 2014 until Nov. 2018 fusion from different sources is also experiencing growing attention,
covering 183 case studies that include deep learning-based classifica- especially adding height data through digital elevation models (DEM).
tion, 92 cases of which supported direct comparisons of accuracy to We discuss this further in the design options (Section 3.3) but an in-
SVM. The main characteristics of these case studies are summarized in depth analysis of this issue was outside the scope of this work.
Table A1, Appendix A, with each column of the appendix table defined Another important factor in network prediction performance is the
in Table 1 below. These parameters reflect the most important aspects data training size. More training data typically leads to better network
of the research design and we use them to present the discussion of our generalization, but in many cases the labeled training data is very
research questions in the subsequent sections. We treat each data set in limited. The corresponding column shows the rounded proportion of
a research paper as a separate case, because the output result and (labeled) training data samples to the entire reference data set, varying
possibly the network structure may vary per case in any single paper. from as low as 0.1% to 90%. We refer to it as “training proportion”
One of the most important parameters in network specification is hereafter, and consider the proportion in one single run of the network,
the number of network parameters which reflects network complexity. therefore a cross-validation scheme does not change the value in the
This is typically a surrogate of network depth and width. It is expected table from a similar hold-out.
that a bigger network would be more powerful, but the network ar- The reported classification accuracy (overall or average) value is the
chitecture and way of processing (reflected in other columns of the best performance reported for the reference dataset in each case. It is
table) greatly impacts this performance. Therefore, it is not unexpected reported as a number between 0 and 100 except for the metric Average
that a smaller but more elegant network outperforms a larger one in Normalized Modified Retrieval Rank (ANMRR). Although overall ac-
obtained accuracy. For example, in classifying the ISPRS Potsdam and curacy is an aggregate metric and cannot show important class-de-
Vaihingen datasets, Maggiori et al. (2016) achieved > 1% better ac- pendent performance values, but it is still the most widely used metric
curacy than Volpi and Tuia (2017) by a network having around 1/10th due to its simplicity and general applicability. Even though in some
of their network size. This number is mostly calculated from network cases more detailed evaluations are provided along with overall accu-
parameters given in the cited paper but in some cases it is given in the racy, due to different experimental designs and data structures in our
cited paper as well. In cases that given information was not sufficient or meta-analysis, these detailed metrics were not widely comparable and
ambiguity was not cleared by correspondence, the entry was left blank. therefore class-specific measures were not included.
This number includes parameters in as many network branches as im- In some cases, an additional pre- or post-processing step comple-
plemented, but it does not include parameters associated with addi- ments the deep network to enhance the performance, for example
tional stages of combination or fusion with other data or algorithms. It merging the resulting map with an auxiliary segmentation result,
also counts the network layers parameters up to the last layer before the adding a conditional random field (CRF) layer for edge enhancement,
final classifier, which is typically a Softmax layer but SVM is also used. or object-based processing. These methods differ largely in im-
In around 70% of our cases the deep network is followed by a Softmax plementation details and experiment setup so cannot be directly com-
classifier, therefore we drop the final classifier type from our list of pared to assess the processing gains; we provide more details on them in
parameters. Section 3.3.
The learning strategy column is another important network para- Although the chosen non-deep methods vary greatly in type and
meter. It does not point to the final classifier training as it is always options from paper to paper, there are still numerous cases where DNNs
supervised, but shows the methodology for determining network are compared to an SVM-based implementation, with Random Forest
parameters. The supervised learning is the most common approach in and KNN being the next classifier types used by much less frequency in
deep networks. It can also have different variations in the form of cost our observed cases. Therefore we chose those papers reporting on SVM
function or optimization procedure, or being enhanced by data-driven results as the candidates for doing our quantitative analysis (in the next
techniques such as active learning. Those advanced cases are desig- section). SVM is a good choice for benchmarking because it is a well-
nated as supervised+ in our database. The fine-tuning options show the established and proven classification tool with generally superior per-
cases when network parameters are fine-tuned after an initial un- formance (Mountrakis et al., 2011; Khatami et al., 2016; Heydari and
supervised learning or transferred from a pre-trained network in Mountrakis, 2018). Note that in remote sensing image or scene classi-
transfer learning. Transfer learning is available to CNN only. DBNs are fication tasks, we are generally interested in both feature generation
usually limited to unsupervised & fine-tuning type, while SAEs are used and classification. Neural networks can do both automatically – and
with both unsupervised learning techniques. Semi-supervised learning deep networks put more stress on the feature extraction task – but SVM
is also used in some cases, which is a strategy for using both labeled and classifiers should be fed with features already generated by another
unlabeled data in optimizing the network cost function. algorithm. The SVM implementation itself may vary between proces-
Spatial resolution in our collected research cases varies from 5 cm sing the raw pixels data or some secondary handcrafted spectral/spatial
for VHR imagery to 30 m for Landsat, left blank if not provided. The features derived from data. To ensure a more fair comparison we se-
number of channels shows the ones that have been actually used in the parated these two cases due to the potential important impact of
experiment (some channels have been set aside for their low quality in working with features instead of raw data. Clearly, there are many
Table 2
Basic statistics of collected case studies.
Year 2014 2015 2016 2017 2018
Number of cases 4 27 21 22 33
Number of cases 23 88 48
Number of cases 59 48 1 70
variations and methods for handcrafting features and each paper may classification applications as indicated in the last column, and the “la-
include a different set of methods for comparison, so we could not go belled elements” column should be interpreted accordingly. Among
into their implementations detail and a detailed comparison. Further- them, the Brazilian Coffee, NWPU-RESISC45, RSSCN7, UC Merced, and
more, SVM optimization methods varied (e.g. hyperparameters and WHU-RS19 have been used for scene classification while the others
kernel choice), however we assumed (and it was also stressed in some concentrated on pixel classification/image segmentation. It is im-
papers) that after tuning the authors reported their best SVM perfor- portant to note a significant limitation. While still being extensively
mance. used even in papers from 2018, some of the commonly used datasets are
We should mention here that although our meta-analysis covers old and outdated: the major issue is their small size compared to da-
many different cases, each case has almost a unique setting of the above tasets with millions of elements typically used in computer vision and
parameters and therefore our analysis is naturally limited in depth and other artificial intelligence studies. This issue has been partly addressed
statistical richness. Our objective was to study general trends and for by some very high resolution datasets such as ISPRS Vaihingen and
the first time in the literature offer a quantitative meta-analysis of DNNs Potdam datasets, which became a standard test bench for newly arrived
in remote sensing applications. Our quantitative analysis did not go into (mostly CNN-based) networks. Furthermore, hyperspectral cases are
a detailed analysis of the effect of every design option due to lack of limited to a single scene and some datasets cover a very small geo-
data. graphic area, which limits the generalization ability of the obtained
results. Again, there is a new dataset presented through IEEE GRSS
3. Results and discussion contest in 2018 which consists of a relatively large area of 1.4 km2
covered by both very high resolution (5 cm) RGB and high resolution
3.1. Descriptive statistics (1 m) multispectral data. However, none of our reviewed articles was
based on this new dataset (Le Saux et al., 2018).
Table 2 provides information on case studies distribution by year, There is a still a need to create more large and rich datasets for
network type, spatial resolution and input dimensionality. Note that remote sensing applications in different spatial and spectral resolutions.
some manuscripts may contain more than one study, and spatial or Preparing datasets for tackling temporal applications is another im-
input dimensionality information was not always available. portant issue, which is even more restricted than other applications.
There is an increase in research papers on deep networks for remote However, the decision to pick specific labels and the procedure for
sensing classification applications after 2014, continuing to date. CNN creating ground truth maps is very application-specific. Provision of
is the most commonly used network type, then SAE followed by DBN. auxiliary data (commonly DSM based on LiDAR) is also an important
Most of the datasets are either hyperspectral (> 100 spectral channels) enhancement that is available in few datasets and should be en-
or less than 10 channels. Just one case study had spectral channels couraged.
between 10 and 100. Hyperspectral dataset are of high spatial resolu-
tion (around 1 m) so sit in the middle group of spatial resolution ca- 3.3. Network design options
tegory. Very high resolution ones (< 30 cm) are mostly available in
RGB with possibly adding Near-Infrared band and/or DSM data to it, In terms of network optimization for deep networks the simplest
with just one very recent case incorporating a drone-based six band way is to change the network depth (number of layers) and width
experiment at spatial resolution of 4.7 cm. More information on data- (neurons per layer). Additional modifications include changes in the
sets will be given in the next section. activation function, the type of classifier or the training strategy (su-
pervised/unsupervised). Looking beyond these fairly common adjust-
3.2. Datasets in the selected case studies ments, we present in Table 3 a descriptive summary of the most im-
portant design innovations we encountered. The table is organized to
A wide variety of approximately 60 different datasets were used titles summarizing the main design point, followed by specific design
throughout the selected case studies. They included frequently used ideas in each section. The number of papers using each option is pro-
datasets along with datasets selected from public sources such as vided to suggest popularity. Some design options are not exclusive to a
Google Earth, QuickBird, WorldView, Landsat archives and proprietary specific network type (e.g. network mixing options), while others are
data sources. Cases that have been used more than twice in our review only applicable to specific network types (e.g. fully convolutional net-
are listed in Table 3. The table includes both scene and pixel work). The classification task type may also require special provisions.
Table 4
13 class/pixel
16 class/pixel
6 class/pixel
6 class/pixel
9 class/pixel
9 class/pixel
Getting multiscale input 10
Using multiscale kernels (filters) 7
Skip links (forwarding features/scores from one layer to another 8
non-adjacent layer)
Network mixing options (fusion/aggregation method varies by case):
Parallel handcrafted features 3
Parallel 1-D and 2-D convolutional networks 3
# of spectral and aux. channels
Data augmentation or adding virtual samples to the input data 15
Multimodal processing:
3-D processing modules 7
Spatial averaging/filtering over a neighborhood for 2
spectral + spatial input generation
Spatial res.
2.5 m
1.3 m
1.3 m
3.7 m
0.5 m
20 m
18 m
5 cm
9 cm
31,500 scenes
2100 scenes
950 scenes
object discrimination
512 × 614
256 × 256
512 × 614
610 × 340
400 × 400
512 × 217
256 × 256
600 × 600
64 × 64
options below and our intent is that this table will act as a preliminary
datasets. The competition in this field is extensive, and some of the most
popular networks have been implemented in this category to win the
ISPRS competition. It is always possible to run the entire network and
KSC (Kennedy Space Center)
ferred. Upsampling design is a hot topic and each paper tries to find a
ISPRS Vaihingen
Pavia University
Brazilian coffee
ISPRS Potsdam
Dataset name
UC Merced
hance the result (we will refer to it in another paragraph in this sec-
Fig. 4. Comparative performance distribution across network complexity. Fig. 5. Comparative performance distribution across spatial resolution.
the data. However, matching them to classes requires an extra step of 3.4.4. Distribution across input data dimensionality
supervised learning. Therefore, unsupervised learning alone is com- Fig. 6 organizes the results in three general categories, mostly se-
parable to enhanced SVM, and fine-tuning further improves results. parating RGB (group A) and hyperspectral images (group C), with
Semi-supervised learning was used in two cases with better results, but group B being cases employing additional multispectral components
its application detail is case-dependent. There are different methods such as NI and/or auxiliary data such as DSM/LiDAR.
and also underlying assumptions about actual class distribution for Although it seems that multispectral group (B) generally achieves a
doing semi-supervised learning (for example see Zhu and Goldberg bit less improvement compared to other two groups, there is no strong
(2009) and Camps-Valls et al. (2014)). Each method and assumption is evidence and supporting theory for that.
embedding a specific additional regularization term for unlabeled data
in the optimization cost function but there is no standardized way of 3.4.5. Distribution across training size/proportion
doing that. This lack of standardization may be a cause for its limited In the examined manuscripts the sampling is either a single pixel
use. (for pixel classification or image segmentation applications) or an
image patch (for scene classification or target detection applications).
Labeled data size is mostly in the order of a few thousands, with ad-
ditional cases with considerably more labelled data. Sampling is done
within the labelled dataset, with the proportion varying substantially in
different implementations from as low as 0.1% to as high as 90%. We
consider two different ways of training data size affecting network si-
mulations. The first issue is the training data size, which should be
considered in accordance to the network size and number of para-
meters. A large network with few training data may experience over-
fitting and lack of generalization, while a small network may not be
powerful enough to model a complex set of training data. The other
issue is the training data proportion, which imposes the same under-
fitting/overfitting scenario. We compared the absolute number of
training data units (pixels or scenes) to the number of network para-
meters in our database and found that in almost 90% of cases we have
less data units than networks parameters to be tuned. The overfitting
control mechanisms such as regularization are always included in the
network design and it will alleviate overfitting problem, but there is still
a substantial difference between the remote sensing and computer vi-
sion fields, as we have very large reference datasets in the latter.
Looking at Table 4, the only case with millions of samples in remote
sensing are ISPRS datasets, but the winners are all CNNs and there is no
comparison reported with SVMs (competition is just between different
CNN architectures) so we couldn’t include them in our SVM-based
Fig. 8. Comparative performance distribution across training data size.
them and there is no research that compare them to a SVM based developing methods for it (for example see Zeiler and Fergus (2013) or
classification. Therefore, we could not include them in Fig. 7. In both Yosinski et al. (2015)), but currently research is lacking in remote
ISPRS datasets the best results are achieved by transfer learning & fine- sensing tasks.
tuning in recent years. The best case was based on ResNet-101 with We compare different studies and reflect on their findings in a
overall accuracy of 91.1% for both cases, followed by VGG-16 and collective manner. The possible reasons for deep network strengths in
SegNet-based cases with overall accuracy of 90.3%. Training proportion each individual aspect (network type, learning strategy, sampling pro-
is standardized at 30% for Vaihingen and 45% for Potsdam (except portion, etc.) was discussed in previous sections without going into
ResNet-101 case, where the training proportion was 47% and 63%, mathematical formulas, due to the nature of meta-analysis. The ma-
respectively). These implementations are large networks, but a recent jority of manuscripts reported that the SVM (or other rivals) parameters
paper (Zhang et al., 2018b) has also achieved accuracies of 89.4% for have been tuned and optimized for best performance, but there is a lack
Potsdam and 88.4% for Vaihingen with a small supervised network of consistency in reporting and protocol (e.g. grid search density).
with number of parameters much less than above transferred networks Establishing best optimization practices would benefit our community
(but with increasing training proportion to 70–75%). In almost all cases by limiting inconsistencies that could lead to result bias.
additional enhancement techniques such as joint segmentation, CRF Another important conclusion is that algorithms are now outpacing
processing or multiscale blocks has been implemented to boost the benchmark datasets. We already see accuracy estimations exceeding
performance a bit higher. 99% for some well-known datasets such as Indian Pine, Pavia Center
The above datasets, while used extensively for classification as- and University, Salinas, and UC Merced. To allow deep learners to reach
sessment, should be avoided in the future. They are relatively small to their full potential, it is paramount that more elaborate benchmark
match the generalization capabilities of deep networks and in most datasets should become available with diverse spectral/spatial/tem-
cases there are already algorithms that reach 100% accuracy, therefore poral resolution and geographic coverage.
offering limited opportunities for improvement. It is necessary to de- We could not analyze further the processing time because either it
velop new, large and multi-sensory datasets for remote sensing image was not available in many cases, or it was not specified if it contains the
classification, especially for hyperspectral data, to help better in- entire time for optimizing meta-parameters or not. It is generally true
vestigate the potential of deep networks. that deep networks need considerably more processing time for training
(though the testing/simulation process is generally quick) but with
4. Concluding remarks continuous increases in processing power, deep networks are readily
usable particularly by incorporating both CPUs and GPUs together. It
While the number of case studies precluded detailed statistical would be interesting to evaluate the time saved by using pre-trained
analysis on the effect of each contributing factors generally we can see networks and just fine-tuning them, but currently there were no sta-
that: tistics reported to extract conclusive guidance.
There are numerous design options currently offered (see Table 3).
– Deep networks have generally better performance than spectral Multiscale input is particularly useful to capture geographic relation-
SVM implementations, with CNNs performing better than other ships in earth observations. Furthermore, fully convolutional networks
deep learners. This advantage, however, diminishes when using are promising for dense semantic labeling (classification of all image
SVM over more rich features extracted from data. pixels at once and producing the same output dense map as the input
– Transfer learning and fine-tuning on pre-trained CNNs offer pro- image size). Other researches have added various segmentation tech-
mising results even when compared to enhanced SVM im- niques, boundary detection and correction methods and CRF/MRF post-
plementations, and they provide for flexibility and scalability be- processing and showed their benefit to enhance classification of edge
cause there is no need to manually engineer the features or use a pixels. While existing comparisons suggest the potential of CNN, they
very large training dataset. However, these pre-trained networks are do not concretely identify a winning design among different options.
currently limited to RGB input data, therefore currently lack ap- For example, at the ISPRS Vaihingen image segmentation contest three
plicability in multi/hyperspectral data. They have also not been CNN methods were within 1.2% of overall accuracy (Sherrah, 2016;
tested in low training proportion scenarios. Audebert et al., 2016; and Marmanis et al., 2016b). Looking into the
– There is no strong relationship between network complexity and future, remote sensing experts will favor 3-D CNN structures from pre-
accuracy gains over SVM; small to medium networks perform si- processing, dimensionality reduction methods like PCA or shallow 1-D
milarly to more complex networks. and 2-D networks. The current state of the art 3-D CNN structures has
– Contrary to popular belief, there are numerous cases of good deep already offered significant improvements and the training process is
network performance with training proportions of 10% or lower. becoming easier (see Chen et al., 2016; Y. Li et al., 2017). Furthermore,
our community would significantly benefit from a coordinated invest-
As previously noted, deep networks are important due to their ment from large funding institutions to create a pre-trained DNN for
ability to extract useful rich features automatically from large data sets remote sensing data (similar to the ImageNet for RBG images). This pre-
without the need for manual feature extraction. For example, automatic trained network would harness the power of large data volumes while
feature extraction has been used in Rußwurm and Körner (2018) to allowing fine-tuning to specific applications.
automatically detect cloud occlusion in temporal remote sensing data.
This automation of feature extraction also has limitations, most notably Acknowledgments
the difficulty to extract and evaluate these features. The visualizations
in deep networks rarely go further than the first two layers, which focus This work was supported by the USDA McIntire Stennis program, a
on very basic features like edges and gradients. There have been limited SUNY ESF Graduate Assistantship and NASA's Land Cover Land Use
trials to describe and visualize the extracted features and even Change Program (grant # NNX15AD42G).
Appendix A
Table A1
Database of collected deep network application in remote sensing.
Other network parameters Dataset specification Best reported performances
Reference Network Type # of parameters Learning type Dataset Spatial resolution # of channels Training Metric type Deep network SVM (Non
proportion deep)
(Penatti et al., 2015) CNN 289 M Transfer learning Brazilian coffee 3 0.8 Average 83 87
(Yu et al., 2017) CNN 24.6 M Unsupervised Brazilian coffee 3 0.8 Overall accuracy 87.8 87
(Castelluccio et al., 2015) CNN 5M Transfer learning Brazilian coffee 3 Overall accuracy 91.8
(Nogueira et al., 2017) CNN 60 M Transfer learning & fine- Brazilian coffee 3 0.6 Overall accuracy 94.5 87
(Wu and Prasad, 2018) CNN + RNN Semisupervised Houston 2.5 m 144 Overall accuracy 82.6 80.2
(Xu et al., 2018) CNN Supervised+ Houston 2.5 m 144 + 1 0.19 Overall accuracy 88 80.5
(Pan et al., 2018) CNN Houston 2.5 m 144 Overall accuracy 90.8
(Li et al., 2014) DBN 14.7 K Unsupervised & fine-tuning Houston 2.5 m 144 Overall accuracy 97.7 97.5
(Zabalza et al., 2016) SAE 4.2 K Unsupervised Indian Pines 20 m 200 0.05 Overall accuracy 80.7 82.1
(Ghamisi et al., 2016) CNN 188 K Supervised Indian Pines 20 m 200 0.05 Overall accuracy 83.3 78.2
(Shi and Pun, 2018) CNN 2.5 M Supervised Indian Pines 20 m 200 0.01 Overall accuracy 85.2
(Mou et al., 2018b) CNN 1.44 M Unsupervised & fine-tuning Indian Pines 20 m 200 0.05 Overall accuracy 85.8 72.8
(C. Zhao et al., 2017) SAE 30.2 K Unsupervised & fine-tuning Indian Pines 20 m 200 0.1 Overall accuracy 89.8 88.9
(W. Hu et al., 2015) CNN 80.6 K Supervised Indian Pines 20 m 220 0.2 Overall accuracy 90.2 87.6
(Pan et al., 2018) CNN Indian Pines 20 m 200 Overall accuracy 90.7
(Xing et al., 2016) SAE 241 K Unsupervised Indian Pines 20 m 200 0.5 Overall accuracy 92.1 90.6
(W. Li et al., 2017) CNN 57.9 K Supervised Indian Pines 20 m 220 0.2 Overall accuracy 94.3 88.2
(Chen et al., 2015) DBN Unsupervised & fine-tuning Indian Pines 20 m 200 0.5 Overall accuracy 96 95.5
(Li et al., 2015) SAE 21.7 M Unsupervised & fine-tuning Indian Pines 20 m 200 0.05 Overall accuracy 96.3 92.4
(Sun et al., 2017) SAE 107 K Semisupervised Indian Pines 20 m 200 0.1 Overall accuracy 96.4 80.6
(Ding et al., 2017) CNN 380 K Unsupervised Indian Pines 20 m 200 0.5 Overall accuracy 97.8
(Ma et al., 2015) SAE 14.2 K Unsupervised & fine-tuning Indian Pines 20 m 200 0.1 Overall accuracy 98.2
(Paoletti et al., 2018) CNN 96 M Supervised Indian Pines 20 m 200 0.24 Overall accuracy 98.4
(Chen et al., 2016) CNN 44.9 M Supervised Indian Pines 20 m 200 0.2 Overall accuracy 98.5 96.9
(H. Zhang et al., 2017) CNN Supervised Indian Pines 20 m 200 0.1 Overall accuracy 98.8
(Makantasis et al., 2015) CNN 97.6 K Supervised Indian Pines 20 m 224 0.8 Overall accuracy 98.9 82.7
(Y. Li et al., 2017) CNN 197 K Supervised Indian Pines 20 m 200 0.5 Overall accuracy 99.1
(Haut et al., 2018) CNN 8.9 M Supervised+ Indian Pines 20 m 200 0.5 Overall accuracy 99.8 81.3
(Sherrah, 2016) CNN 3.26 M Supervised ISPRS Potsdam 5 cm 5 0.45 Overall accuracy 84.1
(Volpi and Tuia, 2017) CNN 6.38 M Supervised ISPRS Potsdam 5 cm 5 0.45 Overall accuracy 85.8
(Maggiori et al., 2016) CNN 530 K Supervised ISPRS Potsdam 5 cm 4 0.45 Overall accuracy 87
(Zhang et al., 2018a) CNN 17 K Supervised ISPRS Potsdam 5 cm 4 0.75 Overall accuracy 89.4 82.4
(Sherrah, 2016) CNN 22.7 M Transfer learning & fine- ISPRS Potsdam 5 cm 4 0.45 Overall accuracy 90.3
(Yongcheng Liu et al., 2018) CNN 481 M Transfer learning & fine- ISPRS Potsdam 5 cm 4 (DSMs not used) 0.63 Overall accuracy 91.1
(Tschannen et al., 2016) CNN 30 K Supervised ISPRS Vaihingen 9 cm 5 0.3 Overall accuracy 85.5
(Paisitkriangkrai et al., 2015) CNN Supervised ISPRS Vaihingen 9 cm 5 0.3 Overall accuracy 86.9
(W. Zhao et al., 2017) CNN Supervised ISPRS Vaihingen 9 cm 4 0.1 Overall accuracy 87.1 66.6
(Volpi and Tuia, 2017) CNN 6.38 M Supervised ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 87.3
(Marcos et al., 2018) CNN 100 K Supervised ISPRS Vaihingen 9 cm 4 0.45 Overall accuracy 87.6
(Zhang et al., 2018b) CNN 17 K Supervised ISPRS Vaihingen 9 cm 4 0.7 Overall accuracy 88.4 81.7
(Maggiori et al., 2016) CNN 727 K Supervised ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 88.9
(Sherrah, 2016) CNN 3.26 M Supervised ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 89.1
(Audebert et al., 2016) CNN 32 M Transfer learning & fine- ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 89.8
(Marmanis et al., 2016b) CNN 806 M Transfer learning & fine- ISPRS Vaihingen 9 cm 4 0.3 Overall accuracy 90.3
(continued on next page)
Table A1 (continued)
Reference Network Type # of parameters Learning type Dataset Spatial resolution # of channels Training Metric type Deep network SVM (Non
proportion deep)
(Yongcheng Liu et al., 2018) CNN 481 M Transfer learning & fine- ISPRS Vaihingen 9 cm 3 (DSMs not used) 0.47 Overall accuracy 91.1
(C. Zhao et al., 2017) SAE 20.8 K Unsupervised & fine-tuning Kennedy Space 18 m 224 0.1 Overall accuracy 93.5 91.1
(Chen et al., 2016) CNN 5.85 M Supervised Kennedy Space 18 m 224 0.1 Overall accuracy 97.1 95.7
(Y. Chen et al., 2014) SAE 8.72 K Unsupervised & fine-tuning Kennedy Space 18 m 176 0.6 Overall accuracy 98.8 98.7
(Haut et al., 2018) CNN 8.8 M Supervised+ Kennedy Space 18 m 224 0.85 Overall accuracy 100 94.4
(Ishii et al., 2015) CNN 60 M Supervised Landsat 8 30 m 3 0.35 F1 71 37.2
(Mou et al., 2018a) CNN + RNN Supervised Landsat ETM 30 m 6 Overall accuracy 98 95.7
(Karalas et al., 2015) SAE 155 K Unsupervised & fine-tuning MODIS 500 sq.m 7 Average 62.8
(Zhou et al., 2017) CNN 126 M Transfer learning & fine- Other 0.5 m 3 ANMRR 0.04
(Kemker et al., 2018) CNN 11.9 M Supervised+ Other 4.7 cm 6 0.25 Average 57.3 29.6
(Kemker et al., 2018) CNN 69 M Supervised+ Other 4.7 cm 6 0.25 Average 59.8 29.6
(Bittner et al., 2017) CNN 134 M Transfer learning & fine- Other 0.5 m 1 F1 70
(Lagrange et al., 2015) CNN 141 M Transfer learning Other 5 cm 4 0.6 Overall accuracy 72.4 70.2
(Cao et al., 2016) CNN 60 M Transfer learning & fine- Other 3 F1 72.4
(Ji et al., 2018) CNN 102 K Supervised+ Other 15 m 4 0.85 Overall accuracy 79.4 78.5
(Fu et al., 2017) CNN Supervised Other 1m 3 0.9 F1 79.5 61.5
(Tang et al., 2017) CNN Transfer learning & fine- Other 3 Average 79.5
tuning precision
(Huang et al., 2018) CNN 39 M Transfer learning & fine- Other 0.5 m 4 0.57 Overall accuracy 80 71.8
(Chen et al., 2018) CNN Transfer learning & fine- Other 8, 16 m 3 Average 80
tuning precision
(Chen et al., 2013) DBN 4.2 M Unsupervised & fine-tuning Other 3 0.2 F1 81.7 78.4
(Marcos et al., 2018) CNN 430 K Supervised Other 5 cm 4 0.7 Overall accuracy 82.6
(Cheng et al., 2017b) CNN 14.7 M Transfer learning Other 30 m 0.2 Overall accuracy 84.3
(Yanfei Liu et al., 2018) CNN Supervised Other 4 m (MSI), 1 m (Pan) 3 0.8 Overall accuracy 85 84.7
(Yongcheng Liu et al., 2018) CNN 481 M Transfer learning & fine- Other 1m 3 0.93 F1 85.6
(Geng et al., 2015) SAE 28.4 K Unsupervised & fine-tuning Other 0.38 m 1 0.5 Overall accuracy 88.1 76.9
(Lguensat et al., 2017) CNN 177 K Supervised Other 1 0.18 Overall accuracy 88.6
(Han et al., 2018) CNN 286 M Semisupervised Other 30 m Overall accuracy 88.6
(Zhao et al., 2015) DBN 379 K Unsupervised & fine-tuning Other 0.6 m 1 0.7 Overall accuracy 88.9 85.6
(Zhang et al., 2018c) CNN 226 K Supervised Other 50 cm 4 0.6 Overall accuracy 89.5 79.5
(Zhang et al., 2018a) CNN 17 K Supervised Other 50 cm 4 0.5 Overall accuracy 89.6
(Zhang et al., 2018b) CNN 17 K Supervised Other 50 cm 4 0.7 Overall accuracy 89.8 81.2
(Vakalopoulou et al., 2015) CNN 60 M Transfer learning Other 0.6 m 4 0.4 Average 90
(Qayyum et al., 2017) CNN 6.61 M Transfer learning Other 15 cm 3 0.8 Overall accuracy 90.3 83.1
(Cheng et al., 2017a) CNN 134 M Transfer learning & fine- Other 30 m 0.8 Overall accuracy 90.3
(F. Zhang et al., 2015) SAE 90.3 K Unsupervised Other 1m 3 0.25 Overall accuracy 90.8 90
Reference Network Type # of parameters Learning type Dataset Spatial resolution # of channels Training Metric type Deep network SVM (Non
proportion deep)
(Zhang et al., 2018a) CNN 17 K Supervised Other 50 cm 4 0.5 Overall accuracy 90.9
(Zhang et al., 2018c) CNN 226 K Supervised Other 50 cm 4 0.6 Overall accuracy 90.9 80.4
(Zhang et al., 2018b) CNN 17 K Supervised Other 50 cm 4 0.7 Overall accuracy 91 81.7
(Zhao and Du, 2016) CNN Supervised Other 1.8 m 8 0.15 Overall accuracy 91.1
(Huang et al., 2018) CNN 39 M Transfer learning & fine- Other 1.24 m 8 0.62 Overall accuracy 91.3 80
(Khan et al., 2017) CNN 151 M Transfer learning & fine- Other 25 m 3 0.9 Overall accuracy 91.3 76.5
(Han et al., 2018) CNN 286 M Semisupervised Other Overall accuracy 91.4
(X. Chen et al., 2014) CNN 395 K Supervised Other 1 F1 91.6 79.3
(L. Zhang et al., 2015) CNN 44 M Transfer learning Other 3 0.75 F1 91.8
(Khan et al., 2017) CNN 151 M Transfer learning & fine- Other 25 m 3 0.9 Overall accuracy 92 74.1
(F. Zhang et al., 2017) CNN 266 K Supervised Other 1.2 m 3 0.8 Overall accuracy 92.4
(Pan et al., 2018) CNN Other 1m 84 Overall accuracy 93.2
(S. Liu et al., 2018) CNN 28.4 M Transfer learning & fine- Other 3 0.67 Overall accuracy 93.4 78
(Basu et al., 2015) DBN 3.6 K Unsupervised & fine-tuning Other 4 0.8 Overall accuracy 93.9
(Längkvist et al., 2016) CNN 1.91 M Unsupervised & fine-tuning Other 0.5 m 6 0.7 Overall accuracy 94.5
(Ji et al., 2018) CNN 102 K Supervised+ Other 4m 4 0.17 Overall accuracy 94.7 93.5
(Rezaee et al., 2018) CNN 53.9 M Transfer learning & fine- Other 5m 5 (3 used for CNN) 0.46 Overall accuracy 94.8
(Cui et al., 2018) CNN 9.7 K Supervised Other 2 m (MSI), 0.5 m 8 (MSI) + Pan 0.8 Overall accuracy 94.8
(Yanfei Liu et al., 2018) CNN Supervised Other 2m 3 0.8 Overall accuracy 94.8 80.3
(Xing et al., 2016) SAE 52.8 K Unsupervised Other 30 m 224 0.5 Overall accuracy 95.5 96.9
(Gong et al., 2017) SAE 81 K Unsupervised & fine-tuning Other 2m 4 0.5 Overall accuracy 95.7 94.4
(Ma et al., 2016) CNN Supervised Other 4 0.8 Overall accuracy 96
(W. Zhao et al., 2017) CNN Supervised Other 0.5 m 8 0.1 Overall accuracy 96.3 66.5
(Han et al., 2018) CNN 286 M Semisupervised Other 3 Overall accuracy 96.8
(Hu et al., 2017) CNN Supervised Other 1m 161 Overall accuracy 97 93.6
(Yu et al., 2016) DBN 2.43 M Unsupervised & fine-tuning Other 0.27 m 3 F1 97
(Wu and Prasad, 2018) CNN + RNN Semisupervised Other 1m 360 Overall accuracy 97.3 95.2
(Wang et al., 2018) CNN 252 K Supervised Other 3 0.74 Overall accuracy 97.3 93.7
(Basu et al., 2015) DBN 3.6 K Unsupervised & fine-tuning Other 4 0.8 Overall accuracy 97.9
(Xu et al., 2018) CNN Supervised+ Other 1m 63 + 1 0.03 Overall accuracy 97.9 92.7
(Nogueira et al., 2017) CNN 5M Transfer learning & fine- Other 0.5 m 3 0.6 Overall accuracy 98 90
(Tao et al., 2018) CNN Supervised Other 0.5–4 m 4 0.008 Overall accuracy 98.4 89.2
(Ma et al., 2016) CNN Supervised Other 4 0.8 Overall accuracy 98.4
(Gong et al., 2018) CNN 139 M Transfer learning & fine- Other 2m 3 0.8 Overall accuracy 98.5 77.7
(Wang et al., 2015) CNN 438 K Supervised Other 3 0.6 Overall accuracy 98.7
(Fan Zhang et al., 2016) CNN Supervised Other 1m 3 0.2 Overall accuracy 98.8
(Weng et al., 2018) CNN 3.4 M Transfer learning Other 0.25 Overall accuracy 98.8 91.3
(Gong et al., 2018) CNN 139 M Transfer learning & fine- Other 3 0.8 Overall accuracy 98.8
(Ji et al., 2018) CNN 107 K Supervised+ Other 4m 4 0.03 Overall accuracy 98.9 96.5
(Maggiori et al., 2017) CNN 459 K Supervised Other 1m 3 0.9 Overall accuracy 99.5 94.9
(Y. Li et al., 2017) CNN 128 K Supervised Other 30 m 242 0.5 Overall accuracy 99.6
(Basaeed et al., 2016) CNN 56.4 K Supervised Other 30 m 10 0.75 Overall accuracy 99.7
(Weng et al., 2018) CNN 3.4 M Transfer learning Other 0.5 Overall accuracy 99.7
Reference Network Type # of parameters Learning type Dataset Spatial resolution # of channels Training Metric type Deep network SVM (Non
proportion deep)
(W. Zhao et al., 2017) CNN Supervised Pavia Center 1.3 m 103 0.1 Overall accuracy 96.3 92.98
(Shi and Pun, 2018) CNN 673 K Supervised Pavia Center 1.3 m 103 0.001 Overall accuracy 97
(Aptoula et al., 2016) CNN 1.31 M Supervised Pavia Center 1.3 m 103 0.05 Kappa 97.4
(Zabalza et al., 2016) SAE 2.4 K Unsupervised Pavia Center 1.3 m 103 0.05 Overall accuracy 97.4 97.4
(Ben Hamida et al., 2018) CNN 3681 Supervised Pavia Center 1.3 m 103 0.05 Overall accuracy 98.9
(Tao et al., 2015) SAE Unsupervised Pavia Center 1.3 m 103 0.05 Overall accuracy 99.6
(Zhao and Du, 2016) CNN Supervised Pavia Center 1.3 m 103 0.05 Overall accuracy 99.7 97.7
(Makantasis et al., 2015) CNN 10.9 K Supervised Pavia Center 1.3 m 103 0.8 Overall accuracy 99.9 99
(Ghamisi et al., 2016) CNN 188 K Supervised Pavia University 1.3 m 103 0.1 Overall accuracy 83.4 78.2
(Mou et al., 2018b) CNN 1.39 M Unsupervised & fine-tuning Pavia University 1.3 m 103 0.1 Overall accuracy 87.4 79.9
(Wu and Prasad, 2018) CNN + RNN Semisupervised Pavia University 1.3 m 103 Overall accuracy 88.4 81.2
(Ding et al., 2017) CNN 226 K Unsupervised Pavia University 1.3 m 100 0.5 Overall accuracy 90.6
(W. Hu et al., 2015) CNN 80.6 K Supervised Pavia University 1.3 m 103 0.05 Overall accuracy 92.6 90.5
(Yue et al., 2015) CNN 182 K Supervised Pavia University 1.3 m 103 Overall accuracy 95.2 85.2
(Xing et al., 2016) SAE 212 K Unsupervised Pavia University 1.3 m 103 0.5 Overall accuracy 96 93.6
(Zhao et al., 2015) CNN 239 K Unsupervised Pavia University 1.3 m 103 0.1 Overall accuracy 96.4 85.2
(W. Li et al., 2017) CNN 57.9 K Supervised Pavia University 1.3 m 103 0.05 Overall accuracy 96.5 90.6
(Zhao and Du, 2016) CNN Supervised Pavia University 1.3 m 103 0.1 Overall accuracy 96.8 80.1
(Ben Hamida et al., 2018) CNN 6862 Supervised Pavia University 1.3 m 103 0.05 Overall accuracy 97.2
(Paoletti et al., 2018) CNN 173 M Supervised Pavia University 1.3 m 103 0.04 Overall accuracy 97.8
(Aptoula et al., 2016) CNN 1.31 M Supervised Pavia University 1.3 m 103 0.1 Kappa 97.9
(Shi and Pun, 2018) CNN 673 K Supervised Pavia University 1.3 m 103 0.01 Overall accuracy 98.5
(Y. Chen et al., 2014) SAE 29 K Unsupervised & fine-tuning Pavia University 1.3 m 103 0.6 Overall accuracy 98.5 97.4
(Tao et al., 2015) SAE Unsupervised Pavia University 1.3 m 103 0.1 Overall accuracy 98.6
(Ma et al., 2015) SAE 10 K Unsupervised & fine-tuning Pavia University 1.3 m 103 0.1 Overall accuracy 98.7
(Sun et al., 2017) SAE 30.2 K Semisupervised Pavia University 1.3 m 103 0.1 Overall accuracy 98.7 91.1
(Xu et al., 2018) CNN Supervised+ Pavia University 1.3 m 103 0.04 Overall accuracy 99.1 89.9
(Chen et al., 2015) DBN Unsupervised & fine-tuning Pavia University 1.3 m 103 0.5 Overall accuracy 99.1 98.4
(Y. Li et al., 2017) CNN 110 K Supervised Pavia University 1.3 m 103 0.5 Overall accuracy 99.4
(Makantasis et al., 2015) CNN 10.9 K Supervised Pavia University 1.3 m 103 0.8 Overall accuracy 99.6 93.9
(H. Zhang et al., 2017) CNN Supervised Pavia University 1.3 m 103 0.05 Overall accuracy 99.7
(Chen et al., 2016) CNN 5.85 M Supervised Pavia University 1.3 m 103 0.1 Overall accuracy 99.7 97.7
(Zhou et al., 2017) CNN 126 M Transfer learning & fine- RSSCN7 3 ANMRR 0.3
(Zou et al., 2015) DBN 3.1 M Unsupervised & fine-tuning RSSCN7 3 0.5 Average 77
(Wu et al., 2016) SAE 2.53 M Unsupervised RSSCN7 3 0.5 Overall accuracy 90.4
(W. Hu et al., 2015) CNN 80.6 K Supervised Salinas 3.7 m 220 0.05 Overall accuracy 92.6 91.7
(W. Li et al., 2017) CNN 57.9 K Supervised Salinas 3.7 m 204 0.05 Overall accuracy 94.8 92.9
(Xu et al., 2018) CNN Supervised+ Salinas 3.7 m 204 0.06 Overall accuracy 97.7 92.2
(Ma et al., 2015) SAE 37.7 K Unsupervised & fine-tuning Salinas 3.7 m 204 0.01 Overall accuracy 98.3
(Makantasis et al., 2015) CNN 10.9 K Supervised Salinas 3.7 m 224 0.8 Overall accuracy 99.5 94
(Haut et al., 2018) CNN 8.9 M Supervised+ Salinas 3.7 m 204 0.5 Overall accuracy 99.9 91.1
(Zhou et al., 2017) CNN 126 M Transfer learning & fine- UC Merced 1 ft 3 ANMRR 0.33
(Zhou et al., 2015) SAE 51.6 K Unsupervised UC Merced 1 ft 3 Average 64.5
(F. Zhang et al., 2015) SAE 301 K Unsupervised UC Merced 1 ft 3 0.8 Overall accuracy 82.7 81.7
(Romero et al., 2016) CNN 49.1 M Unsupervised UC Merced 1 ft 3 0.8 Overall accuracy 84.5
(Yu et al., 2017) CNN 24.6 M Unsupervised UC Merced 1 ft 3 0.8 Overall accuracy 88.57 81.7
(Marmanis et al., 2016a) CNN 155 M Transfer learning & fine- UC Merced 1 ft 3 0.7 Overall accuracy 92.4
SVM (Non
