0% found this document useful (0 votes)

20 views20 pages

Chen 2016

CNN FOR HSI

Uploaded by

Bhavatarini Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views20 pages

Chen 2016

CNN FOR HSI

Uploaded by

Bhavatarini Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

6232 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO.

10, OCTOBER 2016

Deep Feature Extraction and Classification of

Hyperspectral Images Based on
Convolutional Neural Networks
Yushi Chen, Member, IEEE, Hanlu Jiang, Chunyang Li, Xiuping Jia, Senior Member, IEEE, and
Pedram Ghamisi, Member, IEEE

Abstract—Due to the advantages of deep learning, in this paper, differentiating materials of interest with increased classification
a regularized deep feature extraction (FE) method is presented accuracy. Moreover, with respect to advances in hyperspectral
for hyperspectral image (HSI) classification using a convolutional technology, the fine spatial resolution of recently operated
neural network (CNN). The proposed approach employs several
convolutional and pooling layers to extract deep features from sensors makes the analysis of small spatial structures in images
HSIs, which are nonlinear, discriminant, and invariant. These possible [1]. The aforementioned advances make the hyper-
features are useful for image classification and target detection. spectral data a useful tool for a wide variety of applications.
Furthermore, in order to address the common issue of imbalance By increasing the dimensionality of the images in the spec-
between high dimensionality and limited availability of training tral domain, theoretical and practical problems may arise. In
samples for the classification of HSI, a few strategies such as L2
regularization and dropout are investigated to avoid overfitting this manner, conventional techniques which are developed for
in class data modeling. More importantly, we propose a 3-D multispectral data are no longer efficient for the processing
CNN-based FE model with combined regularization to extract of high-dimensional data mostly due to the so-called curse of
effective spectral–spatial features of hyperspectral imagery. Fi- dimensionality [2]. In order to address the curse of dimension-
nally, in order to further improve the performance, a virtual ality, feature extraction (FE) is considered as a crucial step in
sample enhanced method is proposed. The proposed approaches
are carried out on three widely used hyperspectral data sets: HSI processing [3]. However, due to the spatial variability of
Indian Pines, University of Pavia, and Kennedy Space Center. spectral signatures, HSI FE is still a challenging task [4].
The obtained results reveal that the proposed models with sparse In the early stage of the study on HSI FE, the focus was on
constraints provide competitive results to state-of-the-art methods. spectral-based methods, including principal component analy-
In addition, the proposed deep FE opens a new window for further sis (PCA) [5], independent component analysis (ICA) [6], linear
research.
discriminant analysis [7], etc. [8], [9]. These methods apply
Index Terms—Convolutional neural network (CNN), deep linear transformations to extract potentially better features of
learning, feature extraction (FE), hyperspectral image (HSI) the input data in the new domain. With respect to the complex
classification.
light-scattering mechanisms of nature objects (e.g., vegetation),
hyperspectral data are inherently nonlinear [10], [11], which
I. I NTRODUCTION make linear transformation-based methods not that suitable for
the analysis of such data.
H YPERSPECTRAL images (HSIs) are usually composed
of several hundreds of spectral data channels of the
same scene. The detailed spectral information provided by
Since 2000, when two papers on manifold learning were
published in Science [12], [13], manifold learning has become
a hot topic in many research areas, including hyperspectral
hyperspectral sensors increases the power of accurately
remote sensing. Manifold learning attempts finding the intrinsic
structure of nonlinearly distributed data, which is expected to be
Manuscript received August 1, 2015; revised February 16, 2016; accepted highly useful for hyperspectral FE [14].
June 12, 2016. Date of publication July 18, 2016; date of current version
August 11, 2016. This work was supported in part by the Fundamental Research Alternatively, the nonlinear problem can be addressed by
Funds for the Central Universities under Grant HIT.NSRIF.2013028 and in part kernel-based algorithms for data representation [15]. Kernel
by the National Natural Science Foundation of China under Grant 61301206. methods map the original data into a higher dimensional Hilbert
(Corresponding author: Yushi Chen.)
Y. Chen, H. Jiang, and C. Li are with the Department of Information En- space and offer a possibility of converting a nonlinear problem
gineering, School of Electronics and Information Engineering, Harbin Institute to a linear one [16].
of Technology, Harbin 150001, China (e-mail: chenyushi@hit.edu.cn; halo91@ Recent studies have suggested incorporating spatial informa-
163.com; lcy_buzz@mail.dlut.edu.cn).
X. Jia is with the School of Engineering and Information Technology, tion into a spectral-based FE system [17]. With the development
The University of New South Wales, Canberra, A.C.T. 2600, Australia (e-mail: of imaging technology, hyperspectral sensors can provide good
x.jia@adfa.edu.au). spatial resolution. As a result, detailed spatial information has
P. Ghamisi is with Signal Processing in Earth Observation, Technische
Universität München, 80333 Munich, Germany, and also with the Remote become available [18]. It has been found that spectral–spatial
Sensing Technology Institute (IMF), German Aerospace Center (DLR), 82234 FE methods provide good improvement in terms of classifi-
Weßling, Germany (e-mail: p.ghamisi@gmail.com). cation performance [19]. In [20], a method was introduced
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. based on the fusion of morphological operators and support
Digital Object Identifier 10.1109/TGRS.2016.2584107 vector machine (SVM), which leads to high classification
0196-2892 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6233

accuracy. In [21], the proposed framework extracted the spatial hyperspectral data simultaneously, it is reasonable to formulate
and spectral information using loopy belief propagation and a 3-D CNN. Furthermore, to address the problem of over-
active learning. The sparse representation [22] of extended fitting caused by limited training samples of hyperspectral
morphological attribute profile was investigated to incorporate data, we design a combined regularization strategy, including
spatial information in remote sensing image classification in rectified linear unit (ReLU) and dropout to achieve better model
[23], which further improves classification accuracy. In the generalization.
hyperspectral remote sensing community, most of the current In this paper, we investigate the application of supervised
FE methods consider only one-layer processing, which down- CNN, which is one of the deep models, in HSI FE and develop
grades the capacity of feature learning. a 3-D CNN model for effective spectral-and-spatial-based HSI
Most of FE and classification methods are not based on a classification. It is challenging to apply deep learning to HSI
“deep” manner. The widely used PCA and ICA are single- since its data structure is complex and the number of training
layer learning methods [24]. Classifiers such as linear SVMs samples is limited. In computer vision, the number of training
and logistic regression (LR) can be attributed as single-layer samples varies from tens of thousands to tens of millions [32],
classifiers, whereas decision tree or kernel SVMs are believed [33], whereas having such a large number of training samples is
to have two layers [24]. not common in hyperspectral remote sensing classification. In
On the other hand, it is found in neuroscience that the visual general, a neural network has a powerful representation capa-
system of primate human is characterized by a sequence of bility with abundant training samples. Without enough training
different levels of processing (on the order of 10), and this kind samples, a neural network faces a problem of “overfitting,”
of learning system performs very well in the tasks of object which means that the classification performance of test data will
recognition [25]. Deep learning-based methods, which include be downgraded. This problem is expected when deep learning
two or more layers to extract new features, are designed to is applied to remote sensing data while this paper presents a
simulate the process from the retina to the cortex, and these solution to make such approaches feasible for situations when
deep architectures have a potential to yield high performances only a limited number of training samples is available. We
in image classification and target detection [26], [27]. use several regularization methods, including L2 regularization,
Undesired scattering from other objects may deform the and dropout strategies to handle the overfitting issue.
spectral characteristics of the object of interest. Furthermore, The main goal of this paper is to propose a deep FE method
other factors such as different atmospheric scattering conditions for HSI classification. With the help of training samples, the
and intraclass variability make it extremely difficult to extract proposed CNN models extract the abstract and robust features
the features of hyperspectral data effectively. To address such of HSI, which are important for classification. In more detail,
issues, deep architecture is known as a promising option since the main contributions are listed as follows.
it can potentially lead to more abstract features at high levels,
1) Three deep FE architectures based on a CNN are pro-
which are generally robust and invariant [28].
posed to extract the spectral, spatial, and spectral–spatial
Very recently, some deep models have been proposed for
features of HSI. The designed 3-D CNN can extract the
hyperspectral remote sensing image processing [49]. To the
spectral–spatial features effectively, which leads to better
best of our knowledge, a deep learning method, i.e., stacked
classification performance.
autoencoder (SAE), was proposed for HSI classification in 2014
2) To address the problem of overfitting caused by the
[29]. Later, an improved autoencoder was proposed based on
limited number of training samples, some regularization
sparse constraint [50]. In 2015, another deep model, entitled
strategies, including L2 regularization and dropout, are
deep belief network (DBN), was proposed [30]. The deep
used in the training process.
models could extract the robust features and outperform other
3) In order to further improve the performance, a virtual
methods in terms of classification accuracy. However, due to
sample enhanced method is proposed to create training
the full connection of different layers in the aforementioned
samples from the imaging procedure perspective.
approaches, they demand to train a lot of parameters, which
4) The hierarchical features of different depth extracted
is an undesirable factor due to the lack of available training
from HSI are visualized and analyzed for the first time.
samples. Furthermore, SAE and DBN cannot extract the spa-
5) The proposed methods are applied on three well-known
tial information efficiently because they need to represent the
hyperspectral data sets. In this context, we compared the
spatial information into a vector before the training stage.
proposed methods with some traditional methods from
Convolutional neural network (CNN) uses local connections
a different perspective such as classification accuracy,
to effectively extract the spatial information and shared weights
analysis of complexities, and processing time.
to significantly reduce the number of parameters. Very recently,
an unsupervised convolutional network has been proposed The remainder of this paper is organized as follows:
for remote sensing image analysis. This method uses greedy Section II presents the description of CNN and 1-D CNN-based
layerwise unsupervised pretraining to formulate a deep CNN HSI spectral FE frameworks. Sections III and IV present the
model [31]. spatial and spectral–spatial FE frameworks for HSI classifica-
Compared with the unsupervised method, supervised CNN tion, respectively. The virtual sample enhanced CNN is intro-
may extract more effective features with the help of class- duced in Section V. The experiments conducted are reported in
specific information, which can be provided by training Section VI. We conclude this paper in Section VII with some
samples. To extract the spectral and spatial information of discussions.
6234 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 10, OCTOBER 2016

II. O NE -D IMENSIONAL CNN-BASED

HSI FE AND C LASSIFICATION
A. Neural Network and Deep Learning
How to find effective features is the core issue in image
classification and pattern recognition. Humans have an amazing
skill in extracting meaningful features, and a lot of research Fig. 1. Local connections in the architecture of the CNN.
projects have been undertaken to build an FE system as smart
as human in the last several decades. Deep learning is a newly
developed approach aiming for artificial intelligence.
Deep learning-based methods build a network with several
layers, typically deeper than three layers. Deep neural network
Fig. 2. Shared weights in the architecture of the CNN.
(DNN) can represent complicated data. However, it is very
difficult to train the network. Due to the lack of a proper training neurons in the (m − 1)th layer, as an example. In CNN, some
algorithm, it was difficult to harness this powerful model until connections between neurons are replicated across the entire
Hinton and Salakhutdinov proposed a deep learning idea [27]. layer, which share the same weights and biases. In Fig. 2,
Deep learning involves a class of models that try to learn the same color indicates the same weight. Using a specific
multiple levels of data representation, which helps to take architecture like local connections and shared weights, CNN
advantage of input data such as image, speech, and text. Deep tends to provide better generalization when facing computer
learning model is usually initialized via unsupervised learning vision problems.
and followed by fine-tuning in a supervised manner. The high- A complete CNN stage contains a convolution layer and a
level features can be learnt from the low-level features. This pooling layer. Deep CNN is constructed by stacking several
kind of learning leads to the extraction of abstract and invariant convolution layers and pooling layers to form deep architecture.
features, which is beneficial for a wide variety of tasks such as The convolutional layer is introduced first. The value of a
classification and target detection. x
neuron vij at position x of the jth feature map in the ith layer
There are a few deep learning models in the literature, is denoted as follows:
including DBN [34], [35], SAE [36], and CNN [28]. Recently,
P i −1
CNNs have been found to be a good alternative to other deep x p x+p
vij = g bij + wijm v(i−1)m (1)
learning models in classification [38], [39] and detection [40]. m p=0
In this paper, we investigate the application of deep CNN for
HSI FE. e − e−x x
g(x) = tanh(x) = (2)
ex + e−x
where m indexes the feature map in the previous layer ((i −
B. CNN (1-D CNN) p
1)th layer) connected to the current feature map, wijm is the
The human visual system can tackle classification, detection, weight of position p connected to the mth feature map, Pi is
and recognition issues very effectively. Therefore, machine the width of the kernel toward the spectral dimension, and bij
learning researchers have developed advanced data processing is the bias of jth feature map in the ith layer.
methods in recent years based on the inspirations from biologi- Pooling can offer invariance by reducing the resolution of
cal visual systems [25]. the feature maps [37]. Each pooling layer corresponds to the
CNN is a special type of DNN that is inspired by neuro- previous convolutional layer. The neuron in the pooling layer
science. From Hubel’s earlier work, we know that the cells in combines a small N × 1 patch of the convolution layer. The
the cortex of the human vision system are sensitive to small most common pooling operation is max pooling, which is used
regions. The responses of cells within receptive fields have throughout this paper. The max pooling is as follows:
a strong capability to exploit the local spatial correlation in
images. aj = max an×1i u(n, 1) (3)
N ×1
Additionally, there are two types of cells within the visual
cortex, i.e., simple cells and complex cells. While simple cells where u(n, 1) is a window function to the patch of the convo-
detect local features, complex cells “pool” the outputs of simple lution layer, and aj is the maximum in the neighborhood.
cells within a neighborhood. In other words, simple cells are All layers, including the convolutional layers and pooling
sensitive to specific edge-like patterns within their receptive layers of the deep CNN model, are trained using a back-
field, whereas complex cells have large receptive fields and they propagation algorithm.
are locally invariant.
The architecture of CNN is different from other deep learning
C. Spectral FE Framework for HSI Classification
models. There are two special aspects in the architecture of
CNN, i.e., local connections and shared weights. CNN ex- In this section, we present a 1-D FE method considering
ploits the local correlation using local connectivity between only spectral information. This method stacks several CNNs to
the neurons of near layers. We illustrate this in Fig. 1, where develop a deep CNN model with L2 regularization. Generally,
the neurons in the mth layer are connected to three adjacent the classification of HSI includes two procedures, including FE
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6235

Fig. 3. Architecture of deep CNN with spectral FE of HSI.

and classification. In the FE procedure, LR is taken into account activation of each output unit sums to 1 so that we can deem
to adjust the weights and biases in the back-propagation. After the output as a set of conditional probabilities. For given input
the training, the learned features can be used in conjunction vector R, the probability that the input belongs to category i can
with classifiers such as LR, K-nearest neighbor (KNN), and be estimated as follows:
SVMs [1].
The proposed architecture is shown in Fig. 3. The input of the eWi R+bi
system is a pixel vector of hyperspectral data, and the output of P (Y = i|R, W, b) = s(W R + b) = Wj R+bj (5)
je
the system is the label of the pixel vector. It consists of several
convolutional and pooling layers and an LR layer. In Fig. 3, as
where W and b are the weights and biases of the LR layer, and
an example, the flexible CNN model includes two convolution
the summation is done over all the output units.
layers and two pooling layers. There are three feature maps in
In the LR, the size of the output layer is set to be the same
the first convolution layer and six feature maps in the second
as the total number of classes defined, and the size of the input
convolution layer.
layer is set to be the same as the size of the output layer of
After several layers of convolution and pooling, the in-
the CNN. Since the LR is implemented as a single-layer neural
put pixel vector can be converted into a feature vector,
network, it can be merged with the former layers of networks to
which captures the spectral information in the input pixel
form a deep classifier.
vector. Finally, we use LR or other classifiers to fulfill the
classification step.
The power of CNN depends on the connections (weights) of
D. L2 Regularization of CNN
the network; hence, it is very important to find a set of proper
weights. Gradient back-propagation is the core fundamental Overfitting is a common problem of neural network ap-
algorithm for all kinds of neural networks. In this paper, the proaches, which means that the classification results can be very
model parameters are initialized randomly and trained by an good on the training data set but poor on the test data set. In this
error back-propagation algorithm. case, HSI will be classified with low accuracy. The number of
Before setting an updating rule for the weights, one needs training samples is limited in HSI classification, which often
to properly set an “error” measure, i.e., a cost function. There leads to the problem of overfitting.
are several ways to define such a cost function. In our imple- To avoid overfitting, it is necessary to adopt additional tech-
mentation, a mini-batch update strategy is adopted, which is niques such as regularization. In this section, we introduce L2
suitable for large data set processing, and the cost is computed regularization in the proposed model, which is a penalizing
on a mini-batch of inputs [37] model with extreme parameter values [41].
L2 regularization encourages the sum of the squares of
1
m
the parameters to be small, which can be added to learning
c0 = − [xi log(zi ) + (1 − xi ) log(1 − zi )] . (4)
m i=1 algorithms that minimize a cost function. Equation (4) is then
modified to
Here, m denotes the mini-batch size. Two variables xi and
λ 2
N
zi denote the ith predicted label and the label in the mini-
c = c0 + w (6)
batch, respectively. The i summation is done over the whole 2m j=1 j
mini-batch. Our hope turns to optimize (4) using mini-batch
stochastic gradient descent.
LR is a type of probabilistic statistical classification model. where m denotes the mini-batch size, N is the number of
It measures the relation between a categorical variable and the weights, and λ is a free parameter that needs to be tuned
input variables using probability scores as the predicted values empirically. In addition, the coefficient, 1/2, is used to simplify
of the input variables. the process of the derivation.
To perform classification by utilizing the learned features In (6), one can see that L2 regularization can make w small.
from the CNN, we employ an LR classifier, which uses soft- In most cases, it can help with the reduction of the bias of the
max as its output-layer activation. Softmax ensures that the model to mitigate the overfitting problem.
6236 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 10, OCTOBER 2016

Fig. 4. Architecture of CNN with spatial features for HSI classification. The first step of processing is PCA along with spectral dimension, and then CNN is
introduced to extract layerwise deep features.

III. T WO -D IMENSIONAL CNN-BASED

HSI FE AND C LASSIFICATION
A. Two-Dimensional CNN
A complete 2-D CNN layer contains a convolutional layer
and a pooling layer. The 2-D convolutional layer is obtained by
xy
the extension of (1). The value of a neuron vij at position (x, y)
of the jth feature map in the ith layer is denoted as follows:

P i −1 Q
i −1
xy pq (x+p)(y+q)
vij = g bij + wijm v(i−1)m (7)
m p=0 q=0

where m indexes the feature map in the (i − 1)th layer con-

pq
nected to the current (jth) feature map, wijm is the weight of
position (p, q) connected to the mth feature map, Pi and Qi are
the height and the width of the spatial convolution kernel, and
bij is the bias of the jth feature map in the ith layer.
Pooling is carried out in the similar way to the 1-D CNN.
The neuron in the pooling layer combines a small n × n patch
of the convolutional layer.
Fig. 5. Comparison of (a) 2-D and (b) 3-D convolutions. In (b), the size
B. Fine-Tuning and Classification of the convolution kernel in the spectral dimension is 3 and the weights are
color-coded.
Based on the theory described previously, a variety of CNN
architectures can be developed. In this section, we present the xyz
designed CNN for a single band (the first principal component) The value of a neuron vij at position (x, y, z) of the jth
of HSI. The architecture is shown in Fig. 4. feature map in the ith layer is given by
We choose K × K neighborhoods of a current pixel as the
P i −1 Q
i −1 R
i −1
xyz pqr (x+p)(y+q)(z+r)
input to the 2-D CNN model. Then, we build deep CNN to vij = g wijm v(i−1)m + bij
extract the useful features. Each layer of CNN contains 2-D m p=0 q=0 r=0
convolution and pooling. When the spatial resolution of the (8)
image is not very high, 4 × 4 kernel or 5 × 5 kernel can be where m indexes the feature map in the (i − 1)th layer con-
selected to run convolution and 2 × 2 kernel for pooling. nected to the current (jth) feature map, and Pi and Qi are the
After several layers of convolution and pooling, the input im- height and the width of the spatial convolution kernel. Ri is
age can be represented by some feature vectors, which capture pqr
the size of the kernel along toward spectral dimension, wijm is
the spatial information contained in the K × K neighborhood the value of position (p, q, r) connected to the mth feature map,
region of the input pixel. Then, the learned features are fed to and bij is the bias of the jth feature map in the ith layer.
the LR for classification. Through 3-D convolution, CNN can extract the spatial
and spectral information of hyperspectral data simultane-
IV. T HREE -D IMENSIONAL CNN-BASED ously. The learned features are useful for the further image
HSI FE AND C LASSIFICATION classification step.
A. Three-Dimensional CNN
B. Spectral–Spatial FE Framework
We can see from Sections II and III that the 1-D CNN extracts
spectral features and the 2-D CNN extracts the local spatial Hyperspectral remote sensing images contain both spatial
features of each pixel. We further develop 3-D CNN to learn and spectral information. In this section, we integrate the
both spatial and spectral features of HSI. Fig. 5 shows the spectral and spatial features together to construct a joint
comparison of 2-D and 3-D convolutions. spectral–spatial classification framework using a 3-D CNN.
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6237

Fig. 6. Architecture of 3-D CNN with spectral–spatial features for HSI classification.

Fig. 6 shows the architecture of 3-D CNN for HSI classi- in a local minimal of the loss function, which results in poor
fication. We choose K × K × B neighborhoods of a pixel as performance. To obtain proper weights, a lot of samples are
an input to the 3-D CNN model, in which B is the number required in the training procedure. However, these samples
of bands. Each layer of CNN contains 3-D convolution and are usually obtained by manual labeling of a small number
pooling. As an example, a 4 × 4 × 32 kernel or a 5 × 5 × of pixels in an image or based on some field measurements.
32 kernel can be applied to 3-D convolution, and a 2 × 2 kernel Therefore, the collection of these samples is both expensive
can be applied for subsampling. After performing a deep 3-D and time demanding. Consequently, the number of available
CNN, the LR approach is conducted for the classification step. training samples is usually limited, which is a challenging issue
in supervised classification. To solve the dilemma, we utilize
C. Regularizations Based on Sparse Constraints virtual sample as a promising tool from a different perspective.
The issue of high dimensionality and limited number of The virtual sample method tries to create new training sam-
training samples makes the overfitting a serious problem, par- ples from given training samples. The critical issue is how to
ticularly when the input is a 3-D cube. The dimensionality of generate proper samples while we figure out a solution from
the spectral-based CNN, which is presented in Section II-C, the imaging procedure perspective. Because of the complex
is around a couple of hundreds (the number of bands); the situation of lighting in the large scene, objects of the same class
dimensionality of the spatial-based CNN, which is presented in show different characteristics in different locations. Therefore,
Section III-B, is around several hundreds (K ×K, e.g., K = 27); we can simulate a virtual sample by multiplying a random fac-
the dimensionality of the spectral-and-spatial-based CNN, tor to a training sample and adding random noise. Furthermore,
which is presented in Section IV-B, is around several thousands we can generate a virtual sample from two given samples of the
(K × K × B). It is easy to obtain that the high dimensionality same class with proper ratios. The virtual sample idea is helpful
of the input data may lead to an overfitting situation. In order to in the training of a CNN.
handle the issue of 3-D CNN, a combined regularization strategy To tackle the problem of having limited training samples,
based on sparse constraint is developed, which includes ReLU instead of regularization such as L2 regularization and dropout,
and dropout, and applies dropout in the fully connected layer. virtual samples have been generated and added to the training
There are different kinds of ReLUs available to apply. In this samples.
paper, the adopted ReLU is a simple nonlinear operation that
accepts the input of a neuron if it is positive, whereas it returns
A. Changing Radiation-Based Virtual Samples
to 0 if the input is negative. In many applications, ReLUs in
CNNs can improve the performances [42]. Remote sensing, including hyperspectral imaging, usually
Dropout is a recently introduced method to handle overfit- contains a large scene, whereas the objects of the same class
ting. It sets the output of some hidden neurons to zero, which in different locations are affected by different radiation. Virtual
means that the dropped neurons do not contribute in the forward samples can be created by simulating the imaging procedure.
pass and they are not used in the back-propagation procedure. In New virtual sample yn is obtained by multiplying a random
different training epochs, the deep CNN forms a different neural factor and adding random noise to a training sample xm
network by dropping neurons randomly. The dropout method
prevents complex co-adaptations [43]. y n = αm xm + βn. (9)
By using ReLU and dropout, the outputs of many neurons
The training sample xm is a cube extracted from the hyper-
are 0. We use several ReLUs and dropouts at several layers to
spectral cube, which contains the spectral and spatial informa-
achieve powerful sparse-based regularization for the deep net-
tion of pixel to be classified.
work and address the overfitting problem in HSI classification.
In (9), αm indicates the disturbance of light intensity,
which can vary under many situations such as seasons and
V. V IRTUAL S AMPLE E NHANCED CNN
atmospheric conditions, whereas β controls the weight of the
As a matter of fact, CNN has a lot of weights needed to random Gaussian noise n, which may result from the interac-
be trained. Inappropriate weights may cause getting trapped tion of adjacent pixels and instrumental error.
6238 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 10, OCTOBER 2016

TABLE I
L AND -C OVER C LASSES AND N UMBERS OF P IXELS
ON THE I NDIAN P INES D ATA S ET

Fig. 7. Indian Pines data set. (Left) False color composite image (bands 28, 19,
and 10) and (right) ground truth.

B. Mixture-Based Virtual Samples

Because of the long distance between the object and the
sensor, mixture is very common in remote sensing. Inspired by
the phenomenon, it is possible to generate a virtual sample y k
from two given samples of the same class with proper ratios
αi xi + αj xj
yk = + βn. (10)
αi + αj

In (10), xi and xj are two training samples from the same

class, and y k is the virtual sample generated by the two training
samples, whereas β controls the weight of the random Gaussian
noise n. Based on the fact that the hyperspectral characteristics
of one class vary within a certain range, it is reasonable to
assume that results of mixture within this range still belong to
the same class. Therefore, here, we assign the same label of
training samples to the virtual sample y k .
Then, the real training samples and virtual samples are used
together as training samples to get the proper weights in the
network.
Although there are many other methods that can generate
the virtual samples, the changing radiation and mixture-based
methods are simple yet effective ways.

VI. E XPERIMENTAL R ESULTS

A. Data Description and Experiment Design
In our study, three widely used hyperspectral data sets with Fig. 8. University of Pavia data set. (Left) False color composite (bands 10, 27,
and 46) and (right) representing nine land-cover classes.
different environmental settings were adopted to validate the
proposed methods. They are a mixed vegetation site over the
Indian Pines test area in Northwestern Indiana (Indian Pines), pixels. In the experiment, noisy bands have been removed and
an urban site over the city of Pavia, Italy (University of Pavia), the remaining 103 channels were used for classification. Nine
and a site over Kennedy Space Center (KSC), Florida. land-cover classes were selected, which are shown in Fig. 8,
The first data set was acquired by the Airborne Visible/ and the numbers of samples for each class are given in Table II.
Infrared Imaging Spectrometer (AVIRIS). The data set was The third data set was acquired by the AVIRIS instrument
obtained from an aircraft flown, with a size of 145 pixels × over KSC, Florida, on March 23, 1996. The KSC data set has
145 pixels and 220 spectral bands in the wavelength range of an altitude of approximately 20 km, with a spatial resolution
0.4–2.5 μm. The false color image is shown in Fig. 7(a). The of 18 m. The data set includes 176 bands used for the analysis
number of bands is reduced to 200 by removing water absorp- after removing water absorption and low-signal-to-noise-ratio
tion bands. Sixteen different land-cover classes are provided in bands. For classification purposes, 13 classes were defined for
the ground truth, as shown in Fig. 7. The number of samples of the site. The samples are listed in Table III and shown in Fig. 9.
each class is listed in Table I. For all three data sets, we split the labeled samples into two
The second data set was gathered by a sensor known as subsets, i.e., training and test samples, and the details are listed
the Reflective Optics System Imaging Spectrometer (ROSIS-3) in Tables I–III. During the training procedure of CNN, we used
over the city of Pavia, Italy, with 610 pixels × 340 pixels and 90% of the training samples to learn weights and biases of
115 bands in the range of 0.43–0.86 μm. The high spatial reso- each neuron and the remaining 10% of the training samples to
lution of 1.3 m/pixel aims to avoid a high percentage of mixed guide the design of proper architectures. In other words, we use
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6239

TABLE II TABLE IV
L AND -C OVER C LASSES AND N UMBERS OF P IXELS A RCHITECTURES OF THE 1-D CNN ON T HREE D ATA S ETS
ON THE U NIVERSITY OF PAVIA D ATA S ET

and Kappa coefficient K are used as performance measures.

We run the experiments 20 times with different initial random
training samples, and then confidence intervals of OA, AA, and
K are reported.

TABLE III B. Design CNN With Spectral Features

L AND -C OVER C LASSES AND N UMBERS
OF P IXELS IN THE KSC D ATA S ET Spectral feature-based HSI classification is a traditional and
widely used method, in which the pixel vector of HSI is the
input. The primary objective of this section is to design a CNN
model to evaluate the effectiveness of deep FE in the spectral
domain. The experiments include the design and visualization
of spectral information-based CNN and the comparisons with
other FE methods and typical classifiers.
1) Architecture Design of the 1-D CNN: Optimization of
CNN was performed using the trial-and-error approach again
to determine the parameters of model on the number of nodes
in hidden layers, learning rate, kernel size, and the number of
convolution layers.
Table IV shows the architectures of deep CNNs for three data
sets. As an example, for the Indian Pines data set, there are 13
layers, denoted as I1, C2, S3, C4, S5, C6, S7, C8, S9, C10,
S11, F 12, and O13 in sequence. I1 is the input layer. C refers
to the convolution layers, and S refers to the pooling layers.
F 12 is a fully connected layer, and O13 is the output layer of
the whole neural network.
The input data are normalized into [−1 1]. For the LR, the
learning rate is set to 0.005, and the training epoch is 700 for the
Indian Pines data set. For the University of Pavia data set, we set
the learning rate to 0.01 and the number of epochs to 300. For
the KSC data set, the learning rate is 0.001 with 600 epochs. A
generalized cross-validation method is applied to estimate the
normalization parameter of L2 regularization [44].
Fig. 9. KSC data set. (Left) False color composite (bands 28, 19, and 10) and Fig. 10 shows the classification results of the Indian Pines,
(right) ground truth. University of Pavia and KSC data sets. In Fig. 10, we can
see that the depth does help improve classification accuracy.
the classification results of the remaining 10% of the training However, too much layers may downgrade classification re-
samples to identify if the network is overfitted. This is important sults. The numbers of proper convolution layers are 5, 3, and 4
for designing the network. The test set is used to assess final for the Indian Pines, University of Pavia, and KSC data sets,
classification performance. respectively. This is affected by the dimensionalities of inputs,
The experiments were conducted in four scenarios. The first which are 200, 103, and 176, respectively.
scenario aims at extracting the deep spectral features of HSI. 2) Visualization and Analysis of the 1-D CNN: In order to gain
The second scenario tests the usefulness of deep spatial FE. detailed understanding of the 1-D CNN, visualization of CNN for
After this, the effectiveness of deep spectral–spatial FE is HSI is provided in this section. In the visualization and analysis
investigated. In the last scenario, the results of virtual sample part, the University of Pavia data set is used as an example.
methods are presented. Weights play a key role in a neural network; hence, they are
In order to quantitatively compare the capabilities of the pro- displayed in grayscale images for visualization. Every row in
posed models, overall accuracy (OA), average accuracy (AA), the figures represents a convolutional kernel, and the intensities
6240 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 10, OCTOBER 2016

Fig. 10. Spectral information-based classification results of different depths

using CNN.

Fig. 12. Weights of the second and third convolutional layers on the University
of Pavia data set. In the first column of the image, there are 12 filters, and each
tiny image contains 42(6 × 7) weights of a convolutional kernel. The second
one shows 24 filters and 96(12 × 8) weights in a tiny image. (a) Learned
weights of the second convolutional layer. (b) Learned weights of the third
convolutional layer.
Fig. 11. Weights of the first convolutional layer on the University of Pavia data
set. Each tiny image (1 × 8) stands for the weights of a convolutional kernel.
There are six convolutional kernels in the first convolutional layer. The intensity In order to evaluate the effectiveness of the extracted fea-
of each pixel stands for the value of corresponding weight. (a) Randomly tures, the similarity in the same class and the divisibility
initialized weights of first convolution layer. (b) Learned weights of first
convolution layer. between different classes are shown in Table V in a quantitative
way. We selected three classes for calculation and calculated
in a row represent the connection intensity of the network. Each the divisibility of different classes with J − M distance. The
convolutional kernel can extract the unique feature of the input. J − M distance is defined as [45]
Fig. 11 shows the weights of the first convolutional layer on the
University of Pavia data set. The weights are randomly ini- Jij = 2(1 − e−Bij ) (11)
tialized and trained using back-propagation methods. From −1
1 ci + cj
Fig. 11(b), the learned weights show some structures. For Bij = (mi − mj )T (mi − mj )
8 2
example, the intensities of the first row are high on the left ⎛ (c +c ) ⎞
i j
side and low on the right side. Fig. 14 shows the weights of 1 2
the 2-D CNN, and it is helpful for the understanding. Different + log ⎝ ⎠ (12)
2 |ci ||cj |
convolutional kernels can extract the features from different
perspectives, and the abundant features are helpful for further
where mi and ci are sample’s average vector and covariance ma-
processing.
trix. Bij is the Bhattacharyya distance between the two classes.
Fig. 12 shows the weights learned at the second and third
The similarity in the same class is evaluated with the correla-
convolutional layers in an image form where the brightness
tion coefficient on a scale of −1 to 1. The correlation coefficient
is proportional to the value of the weights. There are 12 and
calculation formula is defined as follows:
24 convolutional kernels at layers 2 and 3, respectively. The
numbers of weights, i.e., 42 at layer 2 and 96 at layer 3, are C(x, y)
arranged in an image form artificially. Different convolutional ρx,y = (13)
D(x) D(y)
kernels can extract the features from different perspectives. The
abundant features are helpful for further processing. where x and y are two feature vectors, whereas C(x, y) is a
The learned features, which are obtained by the convolution covariance matrix. D(x) and D(y) are the variances of two
of inputs and kernels, on the University of Pavia data set vectors. We use the mean of all correlation coefficients to
are illustrated as curves in Fig. 13. The class of Meadows is evaluate the similarity in the same class.
selected for visualization, and the extracted features after each The higher similarity within class and the higher divisibility
convolutional layer are shown with a different color. It is shown between classes make the classification step smoother. From
that these different features are extracted by different convolu- Table V, by comparing the calculated results in different layers,
tion kernels. The extracted features become more abstract after one can see that features have a high similarity in the same
the third convolutional and pooling layers. class and large divisibility in different classes as the number of
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6241

Fig. 13. Extracted features after convolution and pooling layers on the University of Pavia data set. (a) Original spectral information. (b) and (c) Features after
the first convolutional layer. (d)–(f) Features after the second convolutional layer. (g)–(i) Features after the third convolutional layer.

TABLE V
S IMILARITY AND D IVISIBILITY OF S PECTRAL F EATURES ON THE U NIVERSITY OF PAVIA D ATA S ET

convolutional layers increases. Therefore, the results infer that In order to have a fair comparison, we used 10% of the training
the extracted features are valid and efficient. samples to find the best parameters of FE methods using grid
3) Comparisons With Different FE Methods and Classifiers: search. Theresult reported in Tables VI–VIII are the best classifi-
In this set of experiments, CNN was compared with the PCA, fac- cation results when the number of features was properly selected
tor analysis (FA), and locally linear embedding (LLE) in order to for each FE method. On the selection of parameters, the number
investigate the potential of CNN for hyperspectral spectral FE. of features was chosen in the range of 10 to N (i.e., the number
PCA is a widely used FE method. FA is a linear statistical method of hyperspectral bands) with an interval of 10. The number of
designed for potential factors from observed variables to replace neighbors in LLE has been changed in a range from 1 to 10. The
original data [46]. LLE is a popular nonlinear dimension reduc- final classification results such as OA, AA, and Kappa were cal-
tion method, which is considered as a kind of manifold learning culated on the test data set. In this set of experiments, CNN was
algorithm [47]. In this paper, the effectiveness of different FE compared with the PCA, FA, and LLE in order to investigate the
methods is evaluated mainly through classification results. We potential of CNN for hyperspectral spectral FE. PCA is a widely
also classify the features using several classifiers such as KNN used FE method. FA is a linear statistical method designed for
classifier and a nonlinear SVM based on radial basis function potential factors from observed variables to replace original data
(RBF-SVM). Using the same features with different classifiers, [46]. LLE is a popular nonlinear dimension reduction method,
we can evaluate the effectiveness of the extracted features. which is considered as a kind of manifold learning algorithm
Tables VI–VIII show that the CNN-based FE methods al- [47]. In this paper, the effectiveness of different FE methods is
ways provide the best performances of OA, AA, and Kappa for evaluated mainly through classification results. We also classify
all three data sets. The classification accuracy values are given the features using several classifiers such as KNN classifier and
in the form of mean ± standard deviation from the perspective an RBF-SVM. Using the same features with different classi-
of statistics, which is used as a measurement of volatility. fiers, we can evaluate the effectiveness of the extracted features.
6242 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 10, OCTOBER 2016

TABLE VI
C LASSIFICATION R ESULTS O BTAINED BY D IFFERENT FE A PPROACHES ON THE I NDIAN P INES D ATA S ET

TABLE VII
C LASSIFICATION R ESULTS O BTAINED BY D IFFERENT FE A PPROACHES ON THE U NIVERSITY OF PAVIA D ATA S ET
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6243

TABLE VIII
C LASSIFICATION R ESULTS O BTAINED BY D IFFERENT FE A PPROACHES ON THE KSC D ATA S ET

Tables VI–VIII show that the CNN-based FE methods al- TABLE IX

A RCHITECTURE OF THE 2-D C ONVOLUTION N EURAL N ETWORK
ways provide the best performances of OA, AA, and Kappa for
all three data sets. The classification accuracy values are given
in the form of mean ± standard deviation from the perspective
of statistics, which is used as a measurement of volatility.
In order to have a fair comparison, we used 10% of the train-
ing samples to find the best parameters of FE methods using
pixels considered, with an improvement of 3.92%, 3.88%, and
grid search. The result reported in Tables VI–VIII are the best
0.046 over the RBF-SVM, respectively.
classification results when the number of features was properly
Table VII shows the experimental results for the University
selected for each FE method. On the selection of parameters,
of Pavia data set. It is shown that the CNN-LR provides better
the number of features was chosen in the range of 10 to N (i.e.,
results again and outperforms RBF-SVM by 2.26%, 2.39%, and
the number of hyperspectral bands) with an interval of 10. The
0.0237 on average in terms of OA, AA, and K, respectively. It is
number of neighbors in LLE has been changed in a range from
worth noting that the obtained variance is very small. In terms of
1 to 10. For KNN, the range of the nearest neighbors has been
class accuracy values, the class “Bricks” was the most difficult
changed from 1 to 30 with the interval of two. In RBF-SVM,
one to be classified. The CNN-LR still exhibits the best accuracy
there are two parameters, i.e., C and γ [48]; thus, we applied
(89.09 ± 1.18) for this class. Concerning computational cost,
2-D grid search from a wide range (i.e., C = 2−5 , 2−4 , . . . , 219 ;
CNN has the longest processing time (given by the sum of the
γ = 2−15 , 2−14 , . . . 24 ) to get the best parameters. The learning
training and test times) compared with the other methods.
rate and the number of epochs for LR were selected empirically.
Table VI also shows the results obtained in a situation when
C. CNN With Spatial Features
the models were trained using the original complete set of
spectral bands (200 bands) of the Indian Pines data set. Due In this section, we investigate the effectiveness of the 2-D
to the imbalance between the numbers of training samples and CNN for hyperspectral data FE and classification. There are
the numbers of bands used, the accuracy and its corresponding two reasons. On one hand, the original CNN is designed for 2-D
variance have a wide range from one class to another one. image classification. The usefulness of 2-D CNN for HSI classi-
Compared with PCA, FA, and LLE, CNN-based FE leads to fication should be tested. In this paper, 1-D CNN and 3-D CNN
better performance, particularly when it combines with LR. are designed for spectral classification and spectral–spatial
The CNN-LR exhibits the highest OA, AA, and K, the highest classification, respectively. On the other hand, to maintain the
percentage of correctly classified pixels among all the test integrity, 2-D CNN should be investigated. There are several
6244 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 10, OCTOBER 2016

Fig. 14. Weights of the first convolutional layer. Each tiny image (4 × 4) stands for a convolutional kernel. There are 32 kernels in the first convolutional layer.
The intensity of each pixel stands for the value of corresponding weight. (a) Randomly initialized weights of the first convolution layer of the University of Pavia
data set. (b) Learned weights of the first convolution layer of the University of Pavia data set.

Fig. 15. Extracted features of the University of Pavia data set. There are six rows, and each row of images represents one class. There are four columns in the
figure. The first column is allocated to the input images. The second column is allocated to the four feature maps after the first convolution. The third column is
allocated to the four feature maps after the first ReLU operation. The last column is composed of the four features of the first pooling operation.

approaches to create 2-D input such as choose one band of HSI

randomly. The first principal component is used to create the
2-D input because the first principal component contains the
most energy of the whole HSI. The learned spatial features are
used for further classification.
1) Architecture Design, Visualization, and Analysis: There
are several factors that need to be selected in the experiments.
The input images were normalized into [−0.5 0.5]. We used
a large neighborhood window (27 × 27) for the first principal
component as the input 2-D image for the three data sets.
The details of the architecture of the CNN are listed in
Table IX. Because of the small size of the input image, we
used three convolution layers and three pooling layers. After
the CNN, the input image was converted into a vector with
128 dimensions.
In the training procedure, we used a mini-batch-based back- Fig. 16. Extracted features after three convolutional layers on two asphalt
propagation method, whereas the size of mini-batch is 100. samples. The number of feature maps after the first convolution layer is 32,
The learning rate of all CNNs is set to be 0.01. In this part and the size of each feature map is 24 × 24; the number of feature maps after
the second convolution layer is 64, and the size of each feature map is 8 × 8;
of the experiment, the number of training epochs of the CNN and the number of feature maps after the second convolution layer is 128, and
is 200. the size of each feature map is 1 × 1.
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6245

TABLE X
S IMILARITY AND D IVISIBILITY OF S PATIAL F EATURES ON THE U NIVERSITY OF PAVIA D ATA S ET

TABLE XI TABLE XIII

C LASSIFYING W ITH S PATIAL F EATURES ON THE I NDIAN D ATA S ET C LASSIFYING W ITH S PATIAL F EATURES ON THE KSC D ATA S ET

TABLE XIV
A RCHITECTURE OF THE 3-D C ONVOLUTION N EURAL N ETWORK

TABLE XII
C LASSIFYING W ITH S PATIAL F EATURES ON THE
U NIVERSITY OF PAVIA D ATA S ET

Fig. 15 shows the features after the convolutional layer,

ReLU layer, and pooling layer. The first column shows the
inputs of different classes, which contains nine 27 × 27 small
images. After convolution processing of 32 kernels, which are
shown in Fig. 14(b), the input is converted into 32 feature maps.
Four of the feature maps, which are 24 × 24 images after the
4 × 4 convolution processing, are shown in the second column
in Fig. 15. The ReLU layer cuts off the features that are less than
0, and the features are showed in the third column in Fig. 15.
In the visualization part, we used the University of Pavia data After the max pooling operation, the size of feature maps is
set as an example. The power of the neural network lies in the 12 × 12. Different convolutional kernels extract different
weights. In the beginning, the weights are randomly initialized features of the input, and the abundant features give a lot of
with standard deviation 0.001. The weights of different con- potential for further processing.
volutional kernels in the first convolutional layer are shown in Extracted features after three convolutional layers of two
Fig. 14. The initial weights, which are shown in Fig. 14(a), are asphalt samples are shown in Fig. 16. The convolutional
random; the learned weights are with obvious structures. operation can extract different features according to different
6246 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 10, OCTOBER 2016

Fig. 17. Classification results with and without dropout on the (left) Indian Pines, (middle) University of Pavia, and (right) KSC data sets.

Fig. 18. Training error with and without ReLU on the (left) Indian Pines, (middle) University of Pavia, and (right) KSC data sets.

convolutional kernels. The correlation coefficient of the two in-

put images is 0.2975, which means that they are quite different.
After three convolutions, the correlation coefficient of the two
inputs is 0.8324, which means they are quite similar. This kind
of processing is very useful for further classification.
In order to evaluate the effectiveness of the extracted fea-
tures, the similarity in the same class and the divisibility
between the different classes are shown in Table X in a quan-
titative way. From Table X, after convolution operations, the
similarity in the same class and the divisibility in different
classes are increased. While some features in the middle layers
have relatively low similarity in the same class and relatively
low divisibility in the different classes, the features in the
middle layers are not suitable for classification.
In Tables XI–XIII, we can see that the deep CNN method
outperforms RBF-SVMs in terms of the OA, AA, and Kappa.
In this case, the CNN significantly improves RBF-SVM for all Fig. 19. Influence of spatial size on the (left) Indian Pines, (middle) University
of Pavia, and (right) KSC data sets. Best view in color.
three data sets.

1) Architecture Design and Parameter Analysis: For the

D. CNN With Spatial–Spectral Features
Indian Pines, University of Pavia, and KSC data sets, we use
In this part of the experiments, we investigate the advantage 27 × 27 × 200, 27 × 27 × 103, and 27 × 27 × 176 neighbors
of 3-D CNN for HSI FE and classification. With the help of of each pixel as the input 3-D images, respectively. The input
proper CNN architecture, we used the neighbors of the pixel in images are normalized into [−0.5 0.5]. The structures’ details
all bands. The CNN learns the spectral and spatial features by are given in Table XIV. After the 3-D CNN, the input image
itself, and learned features were used in classification. was converted into a vector. The size of mini-batch was 100,
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6247

TABLE XV
S IMILARITY AND D IVISIBILITY OF S PECTRAL –S PATIAL F EATURES ON THE U NIVERSITY OF PAVIA D ATA S ET

and the learning rate was 0.003. In this set of experiments, the TABLE XVI
C LASSIFICATION W ITH S PECTRAL –S PATIAL F EATURES
number of training epochs CNNs is 400. ON THE I NDIAN P INES D ATA S ET
There are three factors (dropout, ReLU, and the size of the
spatial window) that influence the final classification accuracy
significantly, and they are analyzed in the following.
In the proposed architecture, dropout plays an important
role to address overfitting. In this experiment, the results
(classification error) with and without dropout on the three data
sets are presented in Fig. 17. In the figure, the training errors
without dropout regularization are very low after dozens of
epochs, whereas the test errors without dropout are very high.
This is the problem of overfitting. For the training and test errors
with dropout, the training errors are relatively high, whereas the
test errors are relatively low. This means that the model with
dropout has a good capability of generalization.
The effectiveness of the dropout can be explained in two
ways. The first one is to prevent co-adaptations of the units
on the training samples, and the second one is to average the
predictions of many different networks [43]. If a hidden unit
knows its collaborative units, it leads to good performance on
the training data. However, these units might not perform well
on the test data set. However, if a hidden unit adapts well on
many different collaborative units, it will be more dependent
on itself rather than depending on some certain combinations
of hidden units. Dropout strategy makes it possible to train
different networks, and each network gets a classification result. and the full width is 2W + 1. To have a fair comparison, we
As the training procedure continues, most of the networks give resize other spatial sizes to 27 × 27 and get classification
the correct results to eliminate incorrect results on the final accuracy values using the models aforementioned. For the
classification results. Indian Pines data set, the OA can reach the highest and the value
ReLU is another important factor that is influential to final is nearly 98% when the half width is 14. For the University of
performance. Krizhevsky et al. claimed that the nonsaturating Pavia and KSC data sets, the results show that the best accuracy
nonlinear function as ReLU can gain better performances than values are obtained when the half width is 13.
these saturating nonlinearities such as sigmoid function [26]. In order to evaluate the effectiveness of the extracted
The classification errors with and without ReLU on the three spectral–spatial features, Table XV presents the similarity in
data sets are demonstrated in Fig. 18. From Fig. 18, conver- the same class and the divisibility between the different classes.
gence of the models with sigmoid function are slower than Compared with Tables V and X, after convolution operations,
convergence of the models with ReLU. In particular, on the the spectral–spatial features get the highest similarity in the
Indian Pines data set, a CNN with ReLU (red solid lines) same class and the highest divisibility between the different
reaches a 50% error rate six times faster than the same network classes, which shows that the spectral–spatial features have the
with sigmoid (blue dashed lines). On the other hand, the models potential for accurate classification.
with ReLU can lead to lower training error (close to 0) at the 2) Comparative Experiments With Other Spectral–Spatial
end of training. In summary, CNN with ReLU can accelerate Methods: We also conducted RBF-SVM with the original data
convergence and improve the training accuracy. sets and extended morphological profile (EMP) for compari-
The size of 3-D input is an important parameter too. The son. EMP followed by SVM is an advanced spatial–spectral
dimensionality toward spectral dimension is fixed, whereas the classification method for hyperspectral data. We used opening
dimensionalities toward spatial dimension are changeable. A and closing operations on the first five, seven, and three prin-
set of experiments is organized to get a proper size of 3-D inputs. cipal components of the Indian Pines, University of Pavia, and
Fig. 19 shows the results using different sizes of spatial window. KSC data sets to extract structural information, respectively. In
The half widths of spatial size are set to W = [11, 12, 13, 14, 15], the experiments, the structuring element used was a disk and
6248 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 10, OCTOBER 2016

TABLE XVII TABLE XIX

C LASSIFICATION W ITH S PECTRAL –S PATIAL F EATURES RUNNING T IME C OMPARISON
ON THE U NIVERSITY OF PAVIA D ATA S ET

TABLE XX
C LASSIFICATION A CCURACY VALUES ON THE I NDIAN P INES D ATA S ET

TABLE XXI
C LASSIFICATION A CCURACY VALUES ON THE
U NIVERSITY OF PAVIA D ATA S ET

TABLE XVIII
C LASSIFICATION W ITH S PECTRAL –S PATIAL
F EATURES ON THE KSC D ATA S ET

TABLE XXII
C LASSIFICATION A CCURACY VALUES ON THE KSC D ATA S ET

On the other hand, the advantage of deep learning algorithms

is that they are superfast on testing.
The training and test times are shown in Table XIX. In the ta-
ble, we can see that the test time is only 0.88, 1.02, and 0.30 min
for the Indian Pines, University of Pavia, and KSC data sets,
respectively. Fast test time is very important in real applications.
With the quick development of hardware technology, par-
ticularly on graphic processing units, the drawback of long
training time of a deep learning method can be mitigated in the
near future.
the structure sizes were progressively increased from 1 to 4.
Therefore, 40, 56, and 24 spatial features were generated. The E. CNN With Virtual Sample
generated spatial features and original spectral features are used 1) Classification Results: In this part of the experiments, the
for classification. Wide ranges of c and g values for the SVM advantages of 3-D CNN with virtual samples for HSI FE and
were searched in the EMP with the RBF-SVM method; for the classification are investigated. The two proposed virtual sample
Indian Pines data set, they were configured as c = 218 and g = methods (Method A and Method B) in Section V are tested on
21 , whereas those in the Pavia data set were c = 219 and g = 21 , the three data sets.
and in the KSC data set, they were c = 210 and g = 2−1 . For every virtual sample generated by (9), αm is a uniformly
Tables XVI–XVIII provide information about the classifica- distributed random number in [0.9, 1.1], and β, which is the
tion results compared with the typical SVMs. From the results, weight of noise n, is set to 1/25. Meanwhile, for every virtual
we can see that the classification accuracy values of CNN in terms sample generated by (10), αi and αj are uniformly distributed
of OA, AA, and Kappa coefficient are higher than those of other random numbers on the interval [0, 1], whereas xi and xj are
FE and classification methods. The results show that the de- randomly chosen from the same class.
signed CNN can help improve the classification accuracy of HSI. The CNN architecture and the training procedure in this
3) Computation Cost of the 3-D CNN: In general, we con- section are the same as in the previous section. Classification
cede that neural networks take longer time to train the network accuracy values obtained by different approaches on the Indian
compared with other machine learning algorithms such as KNN Pines, University of Pavia, and KSC data sets are shown in
or SVM, and so does the proposed deep learning methods. Tables XX–XXII.
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6249

Fig. 20. Indian Pines. (a) False color image. (b)–(f) Classification maps for different classifiers: (b) 1D-SVM, (c) 3D-EMP-SVM, (d) 3D-CNN, (e) 3D-CNN with
Method A, and (f) 3D-CNN with Method B.

Fig. 21. University of Pavia. (a) False color image. (b)–(f) Classification maps for different classifiers: (b) 1D-SVM, (c) 3D-EMP-SVM, (d) 3D-CNN, (e) 3D-CNN
with Method A, and (f) 3D-CNN with Method B.

Fig. 22. KSC. (a) False color image. (b)–(f) Classification maps for different classifiers: (b) 1D-SVM, (c) 3D-EMP-SVM, (d) 3D-CNN, (e) 3D-CNN with
Method A, and (f) 3D-CNN with Method B.

Under the condition of limited training samples, CNN with optimized. From the resulting images, we can figure out how
virtual samples outperformed EMP-based and original CNN the proposed FE method affects the classification results.
methods in terms of OA, AA, and Kappa coefficient. This From Figs. 20–22, it is obvious that the spectral classi-
proves that CNN with virtual samples is a powerful tool for fication method (1D-SVM) always results in noisy scatter
HSI classification. points in the images [see Figs. 20(a)–22(a)]. While the
In Tables XVIII–XX, in comparison with the original CNN, spectral–spatial methods correct this shortcoming, which elim-
the OA improved by 0.97%, 0.12%, and 0.76% in the Indian inate noisy scattered points of misclassification. The CNN with
Pines, University of Pavia, and KSC data sets, respectively. virtual sample method gives more detailed classification maps
Moreover, the variances of OA are degraded too, which means [see Fig. 22(e) and (f)].
that the CNNs with virtual samples are less influenced by Obviously, both of the proposed virtual sample approaches
different training samples. can increase the classification accuracy of CNN significantly
It can be also found in experiments that CNN classifier will under insufficient training data.
achieve a better performance in terms of classification accuracy
if more virtual samples are created.
2) Classification Maps: At last, the classification accuracy
VII. D ISCUSSION AND C ONCLUSION
values are examined to form a visual perspective. The 1D-
SVM, 3D-EMP-SVM, 3D-CNN, and 3D-CNN with virtual In order to harvest the powerfulness of deep models for
samples are selected to classify the whole images. Figs. 20–22 HSI FE and classification, in this paper, we have proposed
are classification maps of different methods investigated in this deep CNN architectures to extract the spectral, spatial, and
paper for the three data sets. All parameters in these models are spectral-and-spatial-based deep features.
6250 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO. 10, OCTOBER 2016

The design of proper deep CNN models is the first important [6] A. Villa, J. A. Benediktsson, J. Chanussot, and C. Jutten, “Hyperspectral
issue we are facing. In the design of the spectral deep model, image classification with independent component discriminant analysis,”
IEEE Trans. Geosci. Remote Sens., vol. 49, no. 12, pp. 4865–4876,
we use a small local reception field and three to five convolu- Dec. 2011.
tional layers. For the spatial deep model, we use a small local [7] T. V. Bandos, L. Bruzzone, and G. Camps-Valls, “Classification of hy-
reception field. For the spectral-and-spatial-based deep model, perspectral images with regularized linear discriminant analysis,” IEEE
Trans. Geosci. Remote Sens., vol. 47, no. 3, pp. 862–873, Mar. 2009.
we use a special 3-D CNN model with a large reception field [8] L. M. Bruce, C. H. Koger, and J. Li, “Dimensionality reduction of
in the spectral domain and a small reception field in the spatial hyperspectral data using discrete wavelet transform feature extraction,”
domain to extract the integrated features of HSI. The proper IEEE Trans. Geosci. Remote Sens., vol. 40, no. 10, pp. 2331–2338,
Oct. 2002.
design will balance the capacity and complexity of the network, [9] L. O. Jimenez and D. A. Landgrebe, “Hyperspectral data analysis and
which is very important for further FE and classification. supervised feature reduction via projection pursuit,” IEEE Trans. Geosci.
In hyperspectral remote sensing cases, only limited training Remote Sens., vol. 37, no. 6, pp. 2653–2667, Nov. 1999.
[10] D. Lunga, S. Prasad, M. M. Crawford, and O. Ersoy, “Manifold-learning-
samples are available. To solve the problem of overfitting, we based feature extraction for classification of hyperspectral data: A review
use L2 regularization for spectral CNN. When the input is a 3-D of advances in manifold learning,” IEEE Signal Process. Mag., vol. 31,
cube, overfitting becomes more serious. We then adopt a regu- no. 1, pp. 55–66, Jan. 2014.
[11] T. Han, and D. Goodenough, “Investigation of nonlinearity in hyperspec-
larization entitled dropout. The proper regularization strategies tral imagery using surrogate data methods,” IEEE Trans. Geosci. Remote
play an important role for accurate classification of HSI. Sens., vol. 46, no. 10, pp. 2840–2847, Oct. 2008.
[12] B. Tenenbaum, V. Silva, and C. Langford, “A global geometric framework
Parameters affect the classification accuracy and computa- for nonlinear dimensionality reduction,” Science, vol. 290, no. 5500,
tional complexity. In the realization of deep CNNs for HSI pp. 2319–2323, Dec. 2000.
FE and classification, we gather some useful experience on [13] S. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by
locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,
parameter setting. The experimental results suggest that one Dec. 2000.
or two layers often provide limited capacity in FE of HSI. [14] C. M. Bachmann, T. L. Ainsworth, and R. A. Fusina, “Improved man-
Based on our experimental results, we suggest using a three- ifold coordinate representations of large-scale hyperspectral scenes,”
IEEE Trans. Geosci. Remote Sens., vol. 44, no. 10, pp. 2786–2803,
layer CNN with 4 × 4 or 5 × 5 convolution kernel and 2 × 2 Oct. 2006.
pooling kernel in each layer for HSI FE. [15] B. Scholkopf and A. J. Smola, Learning With Kernels. Cambridge, MA,
By using proper architecture and powerful regularization, the USA: MIT Press, 2002.
[16] B. C. Kuo, C. H. Li, and J. M. Yang, “Kernel nonparametric weighted
proposed 3-D deep CNN has been demonstrated to provide excel- feature extraction for hyperspectral image classification,” IEEE Trans.
lent classification performance under the condition of limited Geosci. Remote Sens., vol. 47, no. 4, pp. 1139–1155, Apr. 2009.
training samples. The proposed deep model is promising with [17] A. Plaza, J. Plaza, and G. Martin, “Incorporation of spatial constraints into
spectral mixture analysis of remotely sensed hyperspectral data,” in Proc.
high potential, which opens a new window for further research. IEEE Int. Workshop Mach. Learn. Signal Process., Grenoble, France,
In order to further improve the performance of CNN-based 2009, pp. 1–6.
methods, a method entitled virtual sample is proposed. Virtual [18] M. Fauvel, Y. Tarabalka, J. A. Benediktsson, J. Chanussot, and
J. C. Tilton, “Advances in spectral–spatial classification of hyperspectral
samples are generated by changing radiation and different mix- images,” Proc. IEEE, vol. 101, no. 3, pp. 652–675, Mar. 2013.
ture. Then, the training samples and the created virtual samples [19] Y. Tarabalka, M. Fauvel, J. Chanussot, and J. A. Benediktsson, “SVM- and
are used together in order to train a CNN. MRF-based method for accurate classification of hyperspectral images,”
IEEE Geosci. Remote Sens. Lett., vol. 7, no. 4, pp. 736–740, Oct. 2010.
In summary, to address the HSI FE and classification prob- [20] M. Fauvel, J. A. Benediktsson, J. Chanussot, and J. Sveinsson, “Spectral
lem with limited training samples, we propose an idea of big and spatial classification of hyperspectral data using SVMs and mor-
phological profiles,” IEEE Trans. Geosci. Remote Sens., vol. 46, no. 11,
network with strong constraints. The big feedforward DNN pp. 3804–3814, Nov. 2008.
using deep 3-D CNN with virtual samples achieves by far the [21] J. Li, J. M. Bioucas-Dias, and A. Plaza, “Spectral–spatial classifica-
best results in terms of classification accuracy. tion of hyperspectral data using loopy belief propagation and active
learning,” IEEE Trans. Geosci. Remote Sens., vol. 51, no. 2, pp. 844–856,
CNN is a hot topic in machine learning and computer vi- Feb. 2013.
sion. Various improvements have been made in recent years, [22] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral image classifi-
and they can be also used in the proposed CNN architecture. cation using dictionary-based sparse representation,” IEEE Trans. Geosci.
Remote Sens., vol. 49, no. 10, pp. 3973–3985, Oct. 2011.
The proposed model can be combined with post-classification [23] B. Song, J. Li, J. M. Bioucas-Dias, and J. A. Benediktsson, “Remotely
processing to enhance mapping performance. It deserves to be sensed image classification using sparse representations of morphological
investigated as a possible future work. attribute profiles,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 8,
pp. 5122–5136, Aug. 2013.
[24] Y. Bengio, A. Courville, and P. Vincent, “Representation learning. A
R EFERENCES review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell.,
[1] J. A. Benediktsson and P. Ghamisi, Spectral–Spatial Classification of Hy- vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
perspectral Remote Sensing Images. Boston, MA, USA: Artech House, [25] N. Kruger et al., “Deep hierarchies in primate visual cortex what can
2015. we learn for computer vision?” IEEE Trans. Pattern Anal. Mach. Intell.,
[2] G. Hughes, “On the mean accuracy of statistical pattern recognizers,” vol. 35, no. 8, pp. 1847–1871, Aug. 2013.
IEEE Trans. Inf. Theory, vol. IT-14, no. 1, pp. 55–63, Jan. 1968. [26] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with
[3] J. B. Dias et al., “Hyperspectral remote sensing data analysis and future deep convolutional neural networks,” in Proc. Neural Inf. Process. Syst.,
challenges,” IEEE Geosci. Remote Sens. Mag., vol. 1, no. 2, pp. 6–36, Lake Tahoe, NV, USA, 2012, pp. 1106–1114.
Feb. 2013. [27] G. Hinton and R. Salakhutdinov, “Reducing the dimensionality of data
[4] X. Jia, B. Kuo, and M. M. Crawford, “Feature mining for hyperspectral with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006.
image classification,” Proc. IEEE, vol. 101, no. 3, pp. 676–679, [28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learn-
Mar. 2013. ing applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
[5] G. Licciardi, P. R. Marpu, J. Chanussot, and J. A. Benediktsson, “Linear pp. 2278–2324, Nov. 1998.
versus nonlinear PCA for the classification of hyperspectral data based on [29] Y. Chen, Z. Lin, X. Zhao, and G. Wang, “Deep learning-based classifi-
the extended morphological profiles,” IEEE Geosci. Remote Sens. Lett., cation of hyperspectral data,” IEEE J. Sel. Topics Appl. Earth Observ.
vol. 9, no. 3, pp. 447–451, May 2011. Remote Sens., vol. 7, no. 6, pp. 2094–2107, Jun. 2014.
CHEN et al.: DEEP FEATURE EXTRACTION AND CLASSIFICATION OF HYPERSPECTRAL IMAGES 6251

[30] Y. Chen, X. Zhao, and X. Jia, “Spectral–spatial classification of hyper- Hanlu Jiang received the Bachelor’s degree in re-
spectral data based on deep belief network,” IEEE J. Sel. Topics Appl. mote sensing science and technology in 2014 from
Earth Observ. Remote Sens., vol. 8, no. 6, pp. 1–12, Jun. 2015. the Harbin Institute of Technology, Harbin, China,
[31] A. Romero, C. Gatta, and G. Camps-Valls, “Unsupervised deep feature where she is currently working toward the Master’s
extraction for remote sensing image classification,” IEEE Trans. Geosci. degree in the School of Electronics and Information
Remote Sens., vol. 54, no. 3, pp. 1349–1362, Mar. 2016. Engineering.
[32] Y. LeCun, C. Cortes, and C. Burges, The MNIST Database of Handwritten Her research area is in remote sensing image
Digits. [Online]. Available: http://yann.lecun.com/exdb/mnist/ processing technologies.
[33] J. Deng and F. Li, “ImageNet: A large-scale hierarchical image database,”
in Proc. CVPR, Miami, FL, USA, 2009, pp. 248–255.
[34] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep
belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, Jul. 2006.
[35] N. LeRoux and Y. Bengio, “Deep belief networks are compact uni-
versal approximators,” Neural Comput., vol. 22, no. 8, pp. 2192–2207,
Aug. 2010.
[36] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, Chunyang Li has been working toward the Master’s
“Stacked denoising autoencoders,” J. Mach. Learn. Res., vol. 11, no. 12, degree in the Department of Information Engineer-
pp. 3371–3408, Dec. 2010. ing, School of Electronics and Information Engineer-
[37] Z. Zuo et al., “Learning contextual dependence with convolutional hier-
ing, Harbin Institute of Technology, Harbin, China,
archical recurrent neural networks,” IEEE Trans. Image Process., vol. 25,
since 2015.
no. 7, pp. 2983–2996, Jul. 2016.
Her research concerns remote sensing image
[38] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE
processing based on deep learning methods.
CVPR, Boston, MA, USA, 2015, pp. 1–9.
[39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in Proc. ICLR, 2015, pp. 1–14.
[40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” in Proc. IEEE
CVPR, Columbus, OH, USA, 2014, pp. 581–587.
[41] A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation
for non-orthogonal problems,” Technimetrics, vol. 12, no. 1, pp. 55–67,
Jan. 1970.
[42] V. Nair and G. E. Hinton, “Rectified linear units improve restricted Xiuping Jia (SM’03) received the B.Eng. degree
Boltzmann machines,” in Proc. Int. Conf. Mach. Learn., Haifa, Israel, from the Beijing University of Posts and Telecom-
2010, pp. 807–814. munications, Beijing, China, in 1982 and the Ph.D.
[43] G. E. Hinton et al., “Improving neural networks by preventing co- degree in electrical engineering from The University
adaptation of feature detectors,” Comput. Sci., vol. 3, no. 4, pp. 212–223, of New South Wales, Sydney, Australia, in 1996.
2012. Since 1988, she has been with the School of
[44] R. E. Edwards, H. Zhang, and L. E. Parker, “Approximate l-fold cross- Engineering and Information Technology, The Uni-
validation with least squares SVM and kernel ridge regression,” in Proc. versity of New South Wales, Canberra, Australia,
12th ICMLA, Miami, FL, USA, Dec. 2013, pp. 58–64. where she is currently a Senior Lecturer. She is also a
[45] W. Hofmann, “Remote sensing: The quantitative approach,” IEEE Trans. Guest Professor with Harbin Engineering University,
Pattern Anal. Mach. Intell., vol. 3, no. 6, pp. 713–714, Jun. 1981. Harbin, China, and an Adjunct Researcher with the
[46] D. J. Bartholomew, F. Steele, J. Galbraith, and I. Moustaki, “Analysis of National Engineering Research Center for Information Technology in Agricul-
multivariate social science data,” Struct. Equation Model. Multidiscipli- ture, Beijing. She is the coauthor of the remote sensing textbook titled Remote
nary J., vol. 18, no. 4, pp. 686–693, Apr. 2011. Sensing Digital Image Analysis [Springer-Verlag, 3rd ed. (1999) and 4th ed.
[47] H. Yang, F. Qin, and Y. Wang, “LLE-PLS nonlinear modeling method for (2006)]. Her research interests include remote sensing and image data analysis.
near infrared spectroscopy and its application,” Spectrosc. Spectral Anal., Dr. Jia served as the inaugural Chair of the IEEE Australia Capital Territory
vol. 27, no. 10, pp. 1955–1958, Oct. 2007. and New South Wales Section GRSS Chapter from 2010 to 2013. She is
[48] C. Chang and C. Lin, “LIBSVM: A library for support vector machines,” an Associate Editor of the IEEE T RANSACTIONS ON G EOSCIENCE AND
ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1–27, Mar. 2011. R EMOTE S ENSING.
[49] L. G. Chova, D. Tuia, G. Moser, and G. C. Valls, “Multimodal classifi-
cation of remote sensing images: A review and future directions,” Proc.
IEEE, vol. 103, no. 9, pp. 1560–1584, Nov. 2015.
[50] C. Tao, H. Pan, Y. Li, and Z. Zou, “Unsupervised spectral–spatial fea-
ture learning with stacked sparse autoencoder for hyperspectral im-
agery classification,” IEEE Geosci. Remote Sens. Lett., vol. 12, no. 12, Pedram Ghamisi (S’12–M’15) received the B.Sc.
pp. 2438–2442, Dec. 2015. degree in civil (survey) engineering from the Islamic
[51] J. A. Benediktsson, J. A. Palmason, and J. R. Sveinsson, “Classification Azad University, South Tehran Branch, Tehran, Iran;
of hyperspectral data from urban areas based on extended morphological the M.Sc. degree (with first class honors) in re-
profiles,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 480–491, mote sensing from Khajeh Nasir Toosi University of
Mar. 2005. Technology, Tehran, in 2012; and the Ph.D. degree
in electrical and computer engineering from the
University of Iceland, Reykjavik, Iceland, in 2015.
He was a Postdoctoral Research Fellow with the
Yushi Chen (M’11) received the Ph.D. degree from
University of Iceland. Since October 2015, he has
the Harbin Institute of Technology, Harbin, China, been a Postdoctoral Research Fellow with Signal
in 2008. Processing in Earth Observation, Technical University of Munich, Munich,
Currently, he is an Associate Professor with the
Germany, and a Researcher with the Remote Sensing Technology Institute
School of Electronics and Information Engineering,
(IMF), German Aerospace Center (DLR), Weßling, Germany. His research
Harbin Institute of Technology. His research interests interests are in remote sensing and image analysis with special focus on spectral
include remote sensing data processing and machine and spatial techniques for hyperspectral image classification and the integration
learning.
of LiDAR and hyperspectral data for land-cover assessment.
In 2015, Dr. Ghamisi won the prestigious Alexander von Humboldt
Fellowship.

Hyperspectral Image Fundamentals2018
100% (1)
Hyperspectral Image Fundamentals2018
24 pages
A Fast 3D CNN For Hyperspectral Image Classification: Muhammad Ahmad
No ratings yet
A Fast 3D CNN For Hyperspectral Image Classification: Muhammad Ahmad
5 pages
Hyperspectral Image Classification With Spectral-Spatial Feature Integration and Ensemble Learning
No ratings yet
Hyperspectral Image Classification With Spectral-Spatial Feature Integration and Ensemble Learning
12 pages
Electronics 12 00488 v2
No ratings yet
Electronics 12 00488 v2
34 pages
Learning high-level spectral-spatial features for hyperspectral image classification with insufficient labeled samples
No ratings yet
Learning high-level spectral-spatial features for hyperspectral image classification with insufficient labeled samples
9 pages
A Survey of Deep Learning For Hyperspectral Image Classification
No ratings yet
A Survey of Deep Learning For Hyperspectral Image Classification
26 pages
GlobalLocal Multigranularity Transformer for Hyperspectral Image Classification
No ratings yet
GlobalLocal Multigranularity Transformer for Hyperspectral Image Classification
20 pages
Survey Paper
No ratings yet
Survey Paper
35 pages
Full Document - Hyperspectral PDF
No ratings yet
Full Document - Hyperspectral PDF
96 pages
Remote Sensing: An Enhanced Spectral Fusion 3D CNN Model For Hyperspectral Image Classification
No ratings yet
Remote Sensing: An Enhanced Spectral Fusion 3D CNN Model For Hyperspectral Image Classification
24 pages
DL For HSI - Review
No ratings yet
DL For HSI - Review
39 pages
Neural Ordinary Differential Equations for Hyperspectral Image Classification-Plaza2020
No ratings yet
Neural Ordinary Differential Equations for Hyperspectral Image Classification-Plaza2020
17 pages
A Lightweight Transformer Network For Hyperspectral Image Classification
No ratings yet
A Lightweight Transformer Network For Hyperspectral Image Classification
17 pages
Radiometric Indices-Based Spectro-Spatial Approach For Hyperspectral Image Classification
100% (1)
Radiometric Indices-Based Spectro-Spatial Approach For Hyperspectral Image Classification
15 pages
Remote Sensing: Spectral-Spatial Classification of Hyperspectral Imagery With 3D Convolutional Neural Network
No ratings yet
Remote Sensing: Spectral-Spatial Classification of Hyperspectral Imagery With 3D Convolutional Neural Network
21 pages
Deep Feature Extraction and Classification of Hyperspectral Images Based On Convolutional Neural Networks
No ratings yet
Deep Feature Extraction and Classification of Hyperspectral Images Based On Convolutional Neural Networks
38 pages
2017 Multiple Kernel Learning for Hyperspectral Image Classification A Review
No ratings yet
2017 Multiple Kernel Learning for Hyperspectral Image Classification A Review
19 pages
2019 Deep Learning Ensemble for Hyperspectral Image Classification
No ratings yet
2019 Deep Learning Ensemble for Hyperspectral Image Classification
16 pages
Remote
No ratings yet
Remote
24 pages
2015 Hyperspectral Image Classification With Limited Labeled Training Samples Using Enhanced Ensemble Learning and Conditional Random Fields
No ratings yet
2015 Hyperspectral Image Classification With Limited Labeled Training Samples Using Enhanced Ensemble Learning and Conditional Random Fields
12 pages
Zhong Et Al. - 2017 - Learning To Diversify Deep Belief Networks For Hyperspectral Image Classification
No ratings yet
Zhong Et Al. - 2017 - Learning To Diversify Deep Belief Networks For Hyperspectral Image Classification
15 pages
Advances in Hyperspectral Image and Signal Processing A Comprehensive Overview of The State of The Art
No ratings yet
Advances in Hyperspectral Image and Signal Processing A Comprehensive Overview of The State of The Art
42 pages
2018 Recent Advances on Spectral–Spatial Hyperspectral Image Classification An Overview and New Guidelines
No ratings yet
2018 Recent Advances on Spectral–Spatial Hyperspectral Image Classification An Overview and New Guidelines
19 pages
Kumar 2021 J. Phys. - Conf. Ser. 1950 012087
No ratings yet
Kumar 2021 J. Phys. - Conf. Ser. 1950 012087
13 pages
Koumoutsou 2020
No ratings yet
Koumoutsou 2020
8 pages
Report of Hyperboys
No ratings yet
Report of Hyperboys
5 pages
Final Hyper Inka Anthe
No ratings yet
Final Hyper Inka Anthe
5 pages
Deep Clustering Using 3D Attention
No ratings yet
Deep Clustering Using 3D Attention
13 pages
1-s2.0-S1110982324000048-main
No ratings yet
1-s2.0-S1110982324000048-main
17 pages
HyperSpecTral Image Classification
No ratings yet
HyperSpecTral Image Classification
17 pages
SpectralSpatial Morphological Attention Transformer For Hyperspectral Image Classification
No ratings yet
SpectralSpatial Morphological Attention Transformer For Hyperspectral Image Classification
15 pages
Dual-Branch_Domain_Adaptation_Few-Shot_Learning_for_Hyperspectral_Image_Classification
No ratings yet
Dual-Branch_Domain_Adaptation_Few-Shot_Learning_for_Hyperspectral_Image_Classification
16 pages
Hyperspectral Remote Sensing Data Analysis and Future Challenges
No ratings yet
Hyperspectral Remote Sensing Data Analysis and Future Challenges
31 pages
HybridCNN Based Hyperspectral Image Classification Using Multiscalespatiospectral Features
No ratings yet
HybridCNN Based Hyperspectral Image Classification Using Multiscalespatiospectral Features
10 pages
Zhang 2018
No ratings yet
Zhang 2018
12 pages
Jstars 2014
No ratings yet
Jstars 2014
12 pages
GAO 2020 Combining t-distributed stochastic (AAM)
No ratings yet
GAO 2020 Combining t-distributed stochastic (AAM)
6 pages
Dais 7915
No ratings yet
Dais 7915
14 pages
A Multiscale Dual-Branch Feature Fusion and Attention Network For Hyperspectral Images Classification
No ratings yet
A Multiscale Dual-Branch Feature Fusion and Attention Network For Hyperspectral Images Classification
13 pages
Combining t-Distributed Stochastic Neighbor Embedding With Convolutional Neural Networks for Hyperspectral Image Classification
No ratings yet
Combining t-Distributed Stochastic Neighbor Embedding With Convolutional Neural Networks for Hyperspectral Image Classification
5 pages
Base Paper
No ratings yet
Base Paper
16 pages
10 Uses of Egg
No ratings yet
10 Uses of Egg
2 pages
Spectral-Spatial Hyperspectral Image Classification With Edge-Preserving Filtering
No ratings yet
Spectral-Spatial Hyperspectral Image Classification With Edge-Preserving Filtering
12 pages
Deep Convolutional Neural Networks For The Classification of Snapshot Mosaic Hyperspectral Imagery
No ratings yet
Deep Convolutional Neural Networks For The Classification of Snapshot Mosaic Hyperspectral Imagery
6 pages
Review Article Overview of Hyperspectral Image Classification
No ratings yet
Review Article Overview of Hyperspectral Image Classification
13 pages
Paper 82-Hyperspectral Image Classification
No ratings yet
Paper 82-Hyperspectral Image Classification
7 pages
Deep Feature Learning and Classification of Remote Sensing Images
No ratings yet
Deep Feature Learning and Classification of Remote Sensing Images
19 pages
AUTOMATIC TARGET DETECTION IN HYPERSPECTRAL IMAGES USING NEURAL NETWORK
No ratings yet
AUTOMATIC TARGET DETECTION IN HYPERSPECTRAL IMAGES USING NEURAL NETWORK
8 pages
Biologically-Inspired Data Decorrelation For Hyper-Spectral Imaging
No ratings yet
Biologically-Inspired Data Decorrelation For Hyper-Spectral Imaging
10 pages
Paper 8 PDF
No ratings yet
Paper 8 PDF
13 pages
Sample EIP-II Report
No ratings yet
Sample EIP-II Report
7 pages
R&D HiFACE
No ratings yet
R&D HiFACE
5 pages
Automatic Target Detection in
No ratings yet
Automatic Target Detection in
8 pages
Ieee MLSP 09 Luo Based
No ratings yet
Ieee MLSP 09 Luo Based
6 pages
Chapter 14 Polynomials
100% (5)
Chapter 14 Polynomials
20 pages
DWDM-Unit-5 Notes Mr. Rohit Pratap Singh
No ratings yet
DWDM-Unit-5 Notes Mr. Rohit Pratap Singh
51 pages
ONLINE GAMING
No ratings yet
ONLINE GAMING
24 pages
Middle Third Rule For Rectangular Section - Engineering Applications
No ratings yet
Middle Third Rule For Rectangular Section - Engineering Applications
4 pages
Probabilistic Classification of Hyperspectral Images by Learning Nonlinear Dimensionality Reduction Mapping
No ratings yet
Probabilistic Classification of Hyperspectral Images by Learning Nonlinear Dimensionality Reduction Mapping
8 pages
Alternating Current Class 12 Notes Chapter 7 - Learn CBSE
No ratings yet
Alternating Current Class 12 Notes Chapter 7 - Learn CBSE
8 pages
Liu 2017
No ratings yet
Liu 2017
11 pages
A Collection of Log Rules: U.S.D.A. Forest Service General Technical Report FPL
No ratings yet
A Collection of Log Rules: U.S.D.A. Forest Service General Technical Report FPL
68 pages
Mingyi He, Bo Li, Huahui Chen: Al. (11) Proposed A Modified Deep Stacking Network (DSN) For
No ratings yet
Mingyi He, Bo Li, Huahui Chen: Al. (11) Proposed A Modified Deep Stacking Network (DSN) For
5 pages
Lee 2016
No ratings yet
Lee 2016
4 pages
ADS 10T Specs
No ratings yet
ADS 10T Specs
3 pages
ADSO Request Deletion Fails With: Object &1 ADSO Is Locked by User &2
No ratings yet
ADSO Request Deletion Fails With: Object &1 ADSO Is Locked by User &2
2 pages
Radar Cross Section Measurements of Pedestrian Dummies and Humans in The 24/77 GHZ Frequency Bands
No ratings yet
Radar Cross Section Measurements of Pedestrian Dummies and Humans in The 24/77 GHZ Frequency Bands
109 pages
Stereochemistry, Conformation and Configuration
No ratings yet
Stereochemistry, Conformation and Configuration
29 pages
Common Debating Phrases
No ratings yet
Common Debating Phrases
5 pages
Adra Valid
No ratings yet
Adra Valid
3 pages
Integral calculus-II
No ratings yet
Integral calculus-II
15 pages
What Is STLC (Software Testing Lifecycle) ?: The Different Stages in Software Test Life Cycle
100% (1)
What Is STLC (Software Testing Lifecycle) ?: The Different Stages in Software Test Life Cycle
3 pages
Introduction To Numerical Methods in Chemical Engineering - P. Ahuja
100% (2)
Introduction To Numerical Methods in Chemical Engineering - P. Ahuja
99 pages
C PDF
No ratings yet
C PDF
22 pages
Jacobi EDO
No ratings yet
Jacobi EDO
10 pages
MFC Question Part
No ratings yet
MFC Question Part
28 pages
1 Deep Inelastic Scattering Kinematics: 1.1 Conventions and Basic Relations
No ratings yet
1 Deep Inelastic Scattering Kinematics: 1.1 Conventions and Basic Relations
3 pages
Subdivision and Stability
No ratings yet
Subdivision and Stability
2 pages
Csec Integrated Science - Radiation Notes
No ratings yet
Csec Integrated Science - Radiation Notes
12 pages
Answer:: Question 2. What Are The Different Functions of A Computer?
No ratings yet
Answer:: Question 2. What Are The Different Functions of A Computer?
3 pages
Theory and Applications of FEMFAT - A FE-Postprocessing Tool For Fatigue Analysis
No ratings yet
Theory and Applications of FEMFAT - A FE-Postprocessing Tool For Fatigue Analysis
7 pages
Ima5101 2018 00 wsn5
No ratings yet
Ima5101 2018 00 wsn5
4 pages
Silo Guide
100% (2)
Silo Guide
14 pages
Axpert MKS Plus Off-Grid Inverter Selection Guide
No ratings yet
Axpert MKS Plus Off-Grid Inverter Selection Guide
1 page
Bhu1101 Lecture 3 Notes
No ratings yet
Bhu1101 Lecture 3 Notes
5 pages
7.6. XSI-X/Open System Interface: IPC
No ratings yet
7.6. XSI-X/Open System Interface: IPC
5 pages
Fast Dissolving Oral Films An Innovative Drug
No ratings yet
Fast Dissolving Oral Films An Innovative Drug
8 pages
MSIS Study Plan
No ratings yet
MSIS Study Plan
2 pages
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
From Everand
An Investigation into the Use of a Neural Tree Classifier for Knowledge Discovery in OLAP Databases
David R Swinburne
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet
Support Vector Machine: Fundamentals and Applications
From Everand
Support Vector Machine: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chen 2016

Uploaded by

Chen 2016

Uploaded by

6232 IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, VOL. 54, NO.

10, OCTOBER 2016

Deep Feature Extraction and Classification of

II. O NE -D IMENSIONAL CNN-BASED

Fig. 3. Architecture of deep CNN with spectral FE of HSI.

III. T WO -D IMENSIONAL CNN-BASED

where m indexes the feature map in the (i − 1)th layer con-

B. Mixture-Based Virtual Samples

In (10), xi and xj are two training samples from the same

VI. E XPERIMENTAL R ESULTS

and Kappa coefficient K are used as performance measures.

TABLE III B. Design CNN With Spectral Features

Fig. 10. Spectral information-based classification results of different depths

Tables VI–VIII show that the CNN-based FE methods al- TABLE IX

approaches to create 2-D input such as choose one band of HSI

TABLE XI TABLE XIII

Fig. 15 shows the features after the convolutional layer,

convolutional kernels. The correlation coefficient of the two in-

1) Architecture Design and Parameter Analysis: For the

TABLE XVII TABLE XIX

On the other hand, the advantage of deep learning algorithms

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.