Icsp2016 RM
Icsp2016 RM
Icsp2016 RM
net/publication/315512448
CITATIONS READS
17 344
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jinwen Ma on 20 June 2022.
Abstract—Accurate feature extraction plays a vital role in drawbacks of PCA is that these extracted components are not
the fields of machine learning, pattern recognition and image always independent and invariant under transformation, which
processing. Feature extraction methods based on principal com- may contradict to many supervised classification assumptions
ponent analysis (PCA), independent component analysis (ICA),
and linear discriminant analysis (LDA) are capable of improving [24].
the performances of classifiers. In this paper, we propose two fea- Another linear transformation method that commonly used
tures extraction approaches, which integrate with the extracted in classification system is LDA, proposed by Fisher. It uses
features of PCA and ICA through some statistical criterion. The class label to compute the between class and within class
performances of the proposed feature extraction approaches are matrix, seeks the directions along which the classes are best
evaluated on simulated data and three public data sets by using
cross-validation accuracy of different classifiers that found in separated [8]. However, LDA is a very powerful and useful
statistics and machine learning literature. Our experiment result method for feature extraction, it requires enough training
shows that integrated with ICA and PCA feature is more effective sample in each class and the application of LDA is limited
than others in classification analysis. when classes have significant difference between means [23],
[25].
I. I NTRODUCTION
Recently, ICA is found to be very useful and effective tech-
With the rapid growth of modern technology and the wide nique helps to extract representative features in pattern classi-
application of internet, the amount of data information has fication. It was originally proposed by Jutton and Herault [9]
increased greatly and almost doubles approximately in two for solving blind source separation (BSS) problem. Although
years [1]. Big data analysis has become a popular issue nowa- ICA was initially developed to solve the BSS problem, past
days and arousing interest of the researchers in this research studies have shown that ICA can serve as an effective feature
area witnessed over the last few decades [2]-[8]. Specially in extraction method of improving the classification performance
pattern classification, the existence of too much information in both supervised classification [10]-[14] and unsupervised
may often reduce the performances of the classifier. It may also classification [15]-[17]. It has also been found that ICA may
cause a classification algorithm to overfit to training samples help to improve the performance of various classifiers, such
and generalize poorly to new samples [3]. as SVM, artificial neural networks, decisions trees, hidden
Intrinsically, good classification results may be acquired Markov models, and the naı̈ve Bayes classifier [10]-[17]. In
from a set of representative features constructed from the pattern classification, a large number of paper have been pro-
knowledge of domain experts. When such expert knowledge is posed using PCA, LDA and ICA, as direct feature extraction in
not available, general feature extraction and feature selection the field of face recognition, signal analysis and UCI machine
techniques seem to be very beneficial to remove redundant or learning database [18]-[23].
irrelevant features. Although ICA and PCA can directly be used for feature
On the feature extraction for a classification problem, there extraction, they don’t guarantee to generate useful information
may exist irrelevant or redundant features that interfere the individually [13], [18]. This paper integrated with PCA and
learning process and thus lead to a wrong result [5]. Moreover, ICA features to generate a more representative feature in
when the dimension of feature space is so large that it requires improving classification performances. Among the numerous
many instances to find out the association among features, application of ICA, it has a limitation to sort their compo-
causing slow training and testing in learning algorithm. Some nents. Principal components (PC’s) are sorted according to the
of the classifiers like support vector machine (SVM) can eigenvalues, but independent components (IC’s) has no certain
tolerant extra redundant features, but performance of other rules to order these components. Past studies have shown that
classifiers are naturally resistant to non-informative predictors. non-gaussian IC’s are sometime interesting in classification
Tree and rule based models, naı̈ve Bayes, k-nearest neighbor- problem [34], [36]. The standardized fourth central moment
hood etc. deteriorates when extra irrelevant features are added kurtosis is earlier used as a measure of non-gaussanity as
[6], [16]-[18]. well as to sort IC’s [36]. Since the conventional measures of
The most common and useful feature transformation is kurtosis are sensitive to outliers, we consider quantile measure
PCA proposed by Pearson in the early 20th century [7]. The of kurtosis instead of classical kurtosis in this paper.
II. F EATURE E XTRACTION : PCA, LDA, AND ICA B. Linear Discriminant Analysis (LDA)
An array of attributes to classify the output class are LDA is simple and powerful feature extraction method for
called features. To select subset of features referred as feature classification problem while preserving as much of the class
selection, whereby feature extraction refers to a process, to discriminatory information as possible. It requires the class
transformed data space into a feature space, in which the label of each observation and takes into consideration the
original data is represented by a reduced number of effective scatter within-classes but also the scatter between-classes.
features, retaining most of the intrinsic information as much Consider a set of observations xi within C classes, LDA can
as possible. Some common feature extraction methods are generalize C-class problem gracefully and it is capable pro-
describe below: jection is from a N-dimensional space onto (C-1) dimensions.
Let an observation may come from the i-th class of C classes
A. Principal Component Analysis (PCA) with each class containing ni observations, the generalization
The central idea of PCA is to transforms data linearly of the within-class scatter matrix is
into a low-dimensional subspace by obtaining the maximized
C X
variance of the data. The resulting vectors are an uncorrelated X t
orthogonal basis set. The principal components are orthogonal SW = (x − µi ) (x − µi )
i=1 x∈wi
because they are the eigenvectors of the covariance matrix,
which is symmetric. The generalization of between-class scatter matrix is
Mathematically, considering k observations in a data set, C
each observation is n-dimensional by ignoring the class label.
X t
SB = Ni (µi − µ)(µi − µ)
Let x1 , x2 , ..., xk ∈ <n The following steps of computing i=1
PCA.
Where µ is the mean vector of all observation and ST = SW +
• Calculate the m-dimensional mean vector µ by
SB is the total scatter matrix. For the (C-1) class problem we
1X
k will seek (C-1) projection vectors Wi which can be arranged
µ= xi by columns into a projection matrix W = [w1 |w2 |...|wC−1 ],
k i=1
so that
• Compute the estimated covariance matrix S for the ob-
served data by yi = wiT x ⇒ y = W T x
k Since the projection is not scalar (it has C-1 dimensions),
1X t
S= (xi − µ)(xi − µ) we use the determinant of the scatter matrices into the criterion
k i=1
function, which then becomes
• Compute eigenvalues and corresponding eigenvectors of
S, where λ1 ≥ λ2 ≥ ... ≥ λk ≥ 0 W T SB W
|J (W )| =
• From k original variables, calculate k principal compo- W T SW W
nents by Find W that maximize J(W ) to solving the eigenval-
y1 = a11 x1 + a12 x2 + ... + a1k xk ue/eigenvector system as given below
SB W = λSW W
y2 = a21 x1 + a22 x2 + ... + a2k xk
LDA is supervised can extract (C-1) features while PCA
...
is unsupervised can extract r (rank of data) principles
yk = ak1 x1 + ak2 x2 + ... + akk xk features. In supervised learning, LDA is more efficient
feature extraction method than PCA because its extracted
yk ’s are uncorrelated (orthogonal). y1 explains as much as features use the class information. However, it is assumed
possible of original variance in data set, y2 explains as much that the distributions of samples in each class are normal and
as possible of remaining variance etc. homoscedastic. Therefore, it may be difficult to find a good
In general, a few larger eigenvalues dominate the others in and representative feature space if this assumption is violated
the most practical data sets, that is [25]. Furthermore, LDA may fail not only in heteroscedastic
λ1 + λ2 + ... + λm cases and sometimes even in homoscedastic cases [23], [33].
γk = ≥ 80%
λ1 + λ2 + ... + λm + ... + λk
1084
C. Independent Component Analysis (ICA) In past, PCA, ICA and LDA has been studied individually
ICA is a relatively new statistical and computational tech- in various supervised and unsupervised classification problem.
nique used to discover hidden factors (sources or features) In the next section, we will discuss about our newly propose
from a set of measurements or observed data such that the feature extraction methods which derive from PCA and ICA.
sources are maximally independent.
Mathematically, given the observed variables x(t) = III. P ROPOSED APPROACH
x1 (t), x2 (t), ..., xn (t) is composed of linear combination
of original and mutually independent source s(t) = Although, the performances of PCA and ICA are powerful
s1 (t), s2 (t), ..., sn (t) at time point t such that it expressed as in the field of data visualization and blind source separation
[28], [29]. For classification problem, feature extraction meth-
x(t) = As(t) (1) ods PCA and ICA are not as good as expected [8], [18].
To overcome the problem, we propose two feature extraction
where A is a mixing matrix with full rank. In principles of methods, which integrate with ICA and PCA to represent
ICA, Eq. (1) is often written as significant feature sets for classification problem.
The idea of the proposed feature extraction is very simple.
y = Wx (2) In the first approach, PCA apply to the original data, we then
where W = A−1 is the demixing matrix and y = retain those PC’s that can explain at least 80% of the total
y1 , y2 , ..., yn denotes the independent component. The task is variation, then standard ICA algorithm applying on extracted
to estimate the demixing matrix and independent components PC’s. IC’s are ordered using quantile measure of kurtosis. This
only based on the mixed observations, which can be done by method named as ICA on PCA (IPCA). In the propose method,
various ICA algorithms like fastICA, JADE, Infomax etc. the component containing most negative kurtosis (kurtosis <
In the principles of ICA estimation, extracted components 0) considered as IC1, second most negative kurtosis treated as
are non-gaussian and independent. Kurtosis (β1 ) is one of IC2 and so on.
the ways to measure non-gaussianity. The gaussian IC’s has In classification analysis, sub-gaussian distributions are
kurtosis value equal to 0, sub-gaussian, β1 ≤ 0, and for super more interesting, can indicate a cluster structure or at least
gaussian β1 ≥ 0. Classical measure of kurtosis is defined as a uniformly distributed factor. Thus the components with
the most negative kurtosis can give us the most relevant
4
E(x − µ) µ4 information for classification [34], [36].
β1 = 2 − 3 = −3 (3) In our second approach, ICA and PCA have applied to
2 σ4
E(x − µ) the original data individually. These extracted IC’s, and
Since the conventional measures of kurtosis are essentially PC’s are ordered by using quantile measure of kurtosis
based on sample averages, they are also sensitive to outliers. and eigenvalues, respectively. Both of the ordered extracted
Moreover, the impact of outliers is greatly amplified in the features of ICA and PCA are then integrated in such a way,
conventional measures of kurtosis due to the fact that they are that they contains most sub-gaussian IC’s and those PC’s that
raised to the third and fourth powers [32]. To overcome the can explain at least 80% variability of the original data. This
illustrated problem, we take an attempt to use robust measure proposed approach is named as IC-PC feature extraction.
of kurtosis in ICA. Moors (1988) proposed a quantile kurtosis Fig.1 shows the flow chart of implementing IPCA and IC-PC
alternative to β1 . The quantity of moors kurtosis is on the four database.
(E7 − E5 ) + (E3 − E1 )
Kurtosis = (4) IV. R ESULTS FROM E XPERIMENT
(E6 − E2 )
In this work, we evaluate performance of our proposed
where Ei is i-th octile; that is Ei = F −1 ( 8i ) . For gaussian feature extraction approaches on a simulated dataset, two sets
independent components Moor’s quantile kurtosis is equal to of UCI [30] database, namely Wisconsin breast cancer, and
1.23. One advantage of the quantile measures of kurtosis is wine data, and one set of data collected from Australian crabs
that it doesn’t depend on first moment, and second moment. [35].
So it is more robust than classical measure of kurtosis. In order to test the effectiveness of the proposed feature
In pattern classification, ICA is useful as a dimension extraction methods , we select the most influential number of
preserving transformation because it produces statistically original features by using random forest algorithm (FS-RFA),
independent components, and ICA has directly been used for which is available in R cran package, FSelector [31]. In FS-
feature extraction [26]-[29]. In earlier studies, Kwak et al. [17] RFA, first employs a weight function to generate weights for
also showed that ICA outperforms as feature extraction method each feature. To select importance of weight, the algorithm use
for face recognition than PCA and LDA. More recently, Fan et mean decrease accuracy, besides it selects optimum number
al. [13], [14] presented sequential feature extraction by using of subset feature through ranking approach of chi-square and
class-conditional independent component analysis for naı̈ve information gain. Finally, the process sorts top most influential
Bayes classification of microarray data. original subset of features.
1085
Fig. 1. Flow chart for implementing IPCA and IC-PC feature extraction.
1086
of IPCA is approximately same while comparing with other TABLE II
methods for the first four PC’s, where IC-PC (2,1), that is R ESULTS FOR B REAST C ANCER DATA (PARENTHESES ARE THE NUMBER
OF PC ’ S & IC ’ S RESPECTIVELY )
fuse of first two PC’s and one most negative IC’s gives better
results for SVM and naı̈ve Bayes classification. Surprisingly, Classification Accuracy (%)
only one PC and one IC performs well in decision tree Features
SVM Naı̈ve Bayes C5.0 MLP
classification. The performances of LDA almost same as Original 96.56 95.99 94.27 96.85
IC-PC for this problem. Table II shows that IPCA and IC-PC FS-RFA 96.28 96.28 94.13 96.56
feature extraction methods can significantly improvement in PCA 96.85(3) 96.56(3) 96.28(2) 96.85(4)
the classifier performances of SVM, decision tree and MLP LDA 96.99 97.13 96.42 96.85
based on 10-fold cross validation. ICA 95.85 94.57 91.41 96.71
IPCA 96.99(4) 95.99(3) 94.99(4) 96.99(4)
IC-PC 96.85(2,1) 96.71(2,1) 97.28(1,1) 96.85(4,1)
C. Wine Data
Data from the machine learning repository [30]. A chemical
analysis of 178 Italian wines from three different cultivars TABLE III
R ESULTS FOR W INE DATA (PARENTHESES ARE THE NUMBER OF PC ’ S &
yielded 13 measurements. The dataset consists of 13 numerical IC ’ S RESPECTIVELY )
variables and three classes, where number of instances are 59
in class-1, 71 in class-2, and 48 in class-3. This dataset is Classification Accuracy (%)
Features
often used to test and compare the performances of various SVM Naı̈ve Bayes C5.0 MLP
classification algorithms. Original 98.33 97.22 90.52 97.75
For this data, first 5 PC’s and 7 PC’s can explain 80% FS-RFA 97.78 97.78 91.66 98.33
and 90.06% of the total variation, respectively, while original PCA 97.78(5) 97.78(7) 97.07(5) 97.22(5)
attributes number is 13. The classification accuracy rates for LDA 98.90 98.99 96.63 97.75
the four classifier are displayed in table III. It can be seen ICA 97.78 89.38 76.96 98.30
that both IPCA and IC-PC perform better than others, which IPCA 97.78(5) 89.38(9) 96.07(4) 97.75(5)
demonstrates the effectiveness of the proposed approach. IC-PC 98.90(5,9) 98.90(6,9) 94.97(5,9) 97.75(1,9)
1087
ACKNOWLEDGMENT [18] N. Kwak, C. Choi, Feature extraction based on ICA for binary classifi-
cation problems. IEEE Transactions on Knowledge and Data Engineering
This work was supported by the Natural Science Foundation 15, 1374- 1388, 2003.
of China for Grant 61171138. [19] X. Chen, Z. Jing, G. Xiao, Nonlinear fusion for face recognition using
fuzzy integral. Communications in Nonlinear Science and Numerical
Simulation 12, 823-831, 2007.
R EFERENCES [20] M.R. Boutell, J. Luo, Beyond pixels: Exploiting camera metadata for
photo classification. Pattern Recognition 38, 935-946, 2005.
[1] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus, Knowledge [21] V. Sanchez-Poblador, E. Monte-Moreno, J. Sol-Casals, ICA as a prepro-
discovery in databases: An overview, AI Magazine, no. 3, pp. 5770, 1992. cessing technique for classification. Lecture Notes in Computer Science
[2] U. M Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Ad- (LNCS) 3195, 1165-1172, 2004.
vances in Knowledge Discovery and Data Mining, American Association [22] J. Fortuna, D. Capson, Improved support vector classification using PCA
for Artificial Intelligence and The MIT Press, 1996. and ICA feature space modification. Pattern Recognition 37, 1117 1129,
[3] V. S. Cherkassky and I. F. Mulier, Learning from Data, chapter 5. John 2004.
Wiley & Sons, 1998. [23] J. Oh, N. kwak, M. Lee, C.H. Choi, Generalized mean for feature
[4] I. T. Joliffe, Principal Component Analysis, Springer-Verlag, 1986. extraction in one-class classification problems, Pattern Classification, 46,
[5] K.J. Cios, W. Pedrycz, and R.W. Swiniarski, Data mining methods for 3328-3340, 2013.
knowledge discovery, chapter 9, Kluwer Academic Publishers, 1998. [24] A .R. Webb, Statistical Pattern Recognition, 2nd ed, John Wiley and
[6] G. H. John, Enhancements to the data mining process, Ph.D. thesis, Sons, 2002.
Computer Science Dept., Stanford University, 1997. [25] M. Zhu, A. Martinz, Subclass discriminant analysis, IEEE Transections
[7] K. Pearson, LIII. On lines and planes of closest fit to systems of points in on pattern Analysis and machine Intellegence, 28(8), 1274-1286, 2006.
space, The London, Edinburgh, and Dublin Philosophical Magazine and [26] A. Hyvarinen, E. Oja, P. Hoyer, and J. Hurri, Image Feature Extraction
Journal of Science, vol. 2, no. 11, 559-572, 1901. by Sparse Coding and Independent Component Analysis, Proc. 14th Intl
[8] A. M. Martinez and A. C. Kak, PCA Vesus LDA, IEEE Trans. on Pattern Conf. Pattern Recognition, Aug. 1998.
Analysis and Machine Intelligence, vol. 23, no. 2, 228-233, 2001. [27] A.D. Back and T.P. Trappenberg, Input Variable Selection Using Inde-
[9] C. Jutten, J. Herault, Blind separation of sources, part 1: an adaptive pendent Component Analysis, Proc. Intl Joint Conf. Neural Networks,
algorithm based on neuromimetic architecture. Signal Processing 24, 1- July 1999.
10, 1991. [28] H.H. Yang and J. Moody, Data Visualization and Feature Selection:
[10] X. Zhang, V. Ramani, Z. Long, Y. Zeng, A. Ganapathiraju, J. Picone, New Algorithms for Nongaussian Data, Advances in Neural Information
Scenic beauty estimation using independent component analysis and Processing Systems, vol. 12, 2000.
support vector machines. In: Proceedings of IEEE Southeastcon, pp. 274- [29] T-Y. Yang and C-C. Chen, Data Visualization by PCA, LDA, and ICA,
277, 1999. ACEAT-493, 2015.
[11] N. Kwak, C.H. Choi, J.Y. Choi, Feature extraction using ICA. Lecture [30] University of California, Irvine (UCI) Machine Learning Repository.
Notes in Computer Science, 2130, 568-573, 2001. http : //www.ics.uci.edu/ mlearn/.
[12] S. N. Yu, K.T. Chou, Integration of independent component analysis [31] The Comprehensive R Archive Network. https : //cran.r −
and neural networks for ECG beat classification. Expert Systems with project.org/
Applications 34, 2841-2846, 2008. [32] T.H. Kim, H. White, On more robust estimation of skewness and
[13] L. Fan, K.L. Poh, P. Zhou, A sequential feature extraction approach kurtosis: simulation and application to the S&P500 index, Department
for naive bayes classification of microarray data, Export System with of Economics, UCSD , 2003.
Application, 36, 9919-9923, 2009. [33] D. Tao, X. Li, X. Wu, S. Maybank, General averaged divergence
[14] L. Fan, K.L. Poh, P. Zhou, Partition-conditional ICA for bayesian analysis, in: Proceedings of IEEE International Conference on Data
classification of microarray data, Export System with Application, 37, Mining, 2007.
8188-8192, 2010. [34] M.S. Reza, M. Nasser, and M. Shahjaman, An Improved Version of
[15] S.I. Lee, S. Batzoglou, Application of independent component analysis Kurtosis Measure and Their Application in ICA, International Journal of
to microarrays. Genome Biology 4, R76, 2003. Wireless Communication and Information Systems, Vol 1, No 1, 2011.
[16] A. Kapoor, T. Bowles, J. Chambers, A novel combined ICA and [35] N.A. Campbell, and R.J. Mahon, A multivariate study of variation in
clustering technique for the classification of gene expression data. In: two species of rock crab of genus Leptograpsus. Australian Journal of
Proceedings of IEEE International Conference on Acoustics, Speech, and Zoology 22, 417-425, 1974.
Signal Processing, Vol. 5, pp. 621 624, 2005. [36] M. Scholz, Y. Gibon, M. Stitt and J. Selbig, Independent component
[17] N. Kwak, Feature extraction for classification problems and its appli- analysis of starch deficient pgm mutants, Proceedings of the German
cation to face recognition. Pattern Recognition 41, 1701-1717, 2008. conference on Bioinformatics. pp.95-104, 2004.
1088