Principal Component Analysis
Principal Component Analysis
Principal Component Analysis
works illustrating the potential of PCA applications to a wide range of areas. An experimental
investigation of the ability of PCA for variance explanation and dimensionality reduction is also
developed, which confirms the efficacy of PCA and also shows that standardizing or not the original
data can have important effects on the obtained results. Overall, we believe the several covered
issues can assist researchers from the most diverse areas in using and interpreting PCA.
Galileo’s gravitation studies [1]. Since that time, sub- coordinate system, yielding the new axes identified as
stantial technological advances, in particular in electron- PCA1 and PCA2 in the figure. The maximum data dis-
ics and informatics, have implied an ever increasing ac- persion in one dimension is now found along the first PCA
cumulation of large amounts of the most varied types axis, PCA1. The second axis, PCA2, will be character-
of data, extending from eCommerce to Astronomy. Not ized by the second largest one-dimensional dispersion.
only have more types of data become available, but tradi- Interestingly, provided the original data distribution
tional measurements in areas such as particle physics are is elongated enough, it is now possible to discard the
now performed with increased resolution and in substan- second axis without great loss of overall data variation.
tially larger numbers. Such trends are now aptly known The resulting feature space now has dimension M = 1.
as the data deluge [2]. However, such vast quantities of The essence of PCA applications, therefore, consists in
data are, by themselves, of no great avail unless means simplifying the original data with minimum loss of overall
are applied in order to identify the most relevant infor- dispersion, paving the way to a reduction of dimension-
mation contained in such repositories, a process known ality in which the data is represented. Typical appli-
as data mining [3]. Indeed, provided effective means are cations of PCA are characterized by having M << N .
available for mining, truly valuable information can be Observe that PCA ensures maximum dispersion projec-
extracted. For instance, it is likely that the information tions and promotes dimensionality reduction, but does
in existing databases would already be enough to allow us not guarantee that the main axes (along to the direc-
to find the cure for several illnesses. The importance of tions of largest variation in the original data) will nec-
organizing and summarizing data can therefore be hardly essarily correspond to the directions that would be more
exaggerated. useful for each particular study. For instance, if one is
While a definitive solution to the problem of data min- aiming at separating categories of data, the direction of
ing remains elusive, there are some well-established ap- best discrimination may not necessarily correspond to
proaches which have proven to be useful for organizing that of maximum dispersion, as provided by PCA. In-
and summarizing data [4]. Perhaps the most popular deed, a more robust approach to exploring and modeling
among these, is Principal Component Analysis – PCA [5– a data set should involve, in addition to PCA, the appli-
7]. Let’s organize the several (N) measurements of each cation of several types of projections, including: Linear
object or individual i in terms of a respective feature Discriminant Analysis (LDA), Independent Component
vector X~i , existing in an N-dimensional feature space. Analysis (ICA), maximum entropy, amongst many oth-
PCA can then be understood as a statistical method in ers [8, 9]. This is illustrated in Figure 2(a). However,
which the coordinate axes of the feature space are ro- this approach can imply in substantial computation cost
tated so that the first axis results with the maximum because of the non-linear optimization required by many
possible data dispersion (as quantified by the statisti- of the aforementioned projections. In addition, the use
cal variance), the second axis with the second maximum of high-dimensional data as input to projection meth-
dispersion, and so on. This principle is illustrated with ods can imply in problems of statistical significance [4].
respect to a simple situation with N = 2 in Figure 1. Interestingly, in case of data sets characterized by the
Here, we have Q objects (real beans), each described by presence of correlations between the measurements, PCA
N = 2 respective measurements, which are themselves can be applied prior to the other computationally more
organized as a respective feature vector. More specifi- expensive projections in order to obtain data simplifica-
cally, each object i has two respective measurements X1i tion, therefore reducing the overall execution time and
and X2i , giving rise to X~i . In this particular example catering for more significant statistics. This situation is
involving real beans, the two chosen measurements cor- illustrated in Figure 2(b). Thus, one particularly impor-
respond to diameter (i.e. the maximum distance between tant issue with PCA regards its efficiency for simplifying,
any two points belonging to the border of the object) and through decorrelation, data sets typically found in the
the square root of the bean area. real-world or simulations. This issue is addressed exper-
When mapped into the respective two-dimensional fea- imentally in the present work.
ture space, these objects define a distribution of points To our knowledge, these questions have not been spe-
which, in the case of this example, assumes an elongated cific and systematically addressed in the context of PCA.
shape. Finding this type of point distribution in the fea- However, extensive evidence exists supporting hypothesis
ture space can be understood as indications of correlation (1), including several examples in which real-world data
between the measurements. In the case of beans, their can have most of its dispersion preserved while keeping
shape is not far from a disk, in which the area is given just a few of the principal component axis. However,
as pi times square radius (equal to half the diameter). despite such evidences, it would still be interesting to
So, except for shape variations, the two chosen measure- perform a more systematic investigation of the poten-
ments are directly related and would be, in principle, tial/efficiency of PCA with respect to specific types of
redundant. However, because no two beans have exactly data (e.g. biological, astronomical, simulated data, etc.).
the same shape, we have the dispersion observed in the All in all, this work has three main objectives: (a) to
feature space (Figure 1(b)). present, in intuitive and accessible manner, the concept
The application of PCA to this dataset will rotate the of PCA as well as several issues regarding its practical
3
(a)
(a) Diameter
√ (X1 i )
=
X1 i
Area (X2 i ) X2 i
N M Analysis
Data PCA
Modeling
(b) M
PCA
N
LDA
Analysis
features extraction Data
Modeling
ICA
(b)
Entropy
0.8
(c)
X2 i
0.75
LDA
0.7
-0.50
-0.75
PCA
-1.00
Standardization
TABLE I. Typical organization of the input data for PCA
application. µX̂ = 0 X̂ = X − µX
(a) (b)
X XX X X2 X2
E[ ]
−µX −µX
Corr(X, X)
Standardization
Standardization
X1 X1
X̂ = X − µX ˆX
X ˆ X̂ =X − µX
µX̂ = 0 µX̂ = 0
E[ ]
1/σX 1/σX (c) (d)
Cov(X, X)
X̂2 X̂2
X̂ ˜X
˜ X̂
X̃ = σ X X̃ = σ
X X
σX˜ = 1 σX˜ = 1
E[ ]
X̂1 X̂1
Pearson(X, X)
mean vector µ ~ X can be calculated by taking the average At this stage we have the covariance matrix of the orig-
of each variable independently. For N > 1, it is pos- inal measurements. The next step consists in obtaining
sible to calculate the correlation, covariance or Pearson the necessarily non-negative eigenvalues λi , sorted in de-
correlation coefficient for all pairs of random variables. creasing, and respective eigenvectors ~vi , 1 ≤ i ≤ N , of
It should be observed that the three statistical joint K.
measurements discussed in this section assume linear re- The eigenvectors are now stacked in order to obtain
lationship between pairs of random variables. Other mea- the transformation matrix, i.e.:
surements can be used to characterize non-linear rela-
tionships, such as the Spearman’s rank correlation and ← ~v1 →
mutual information [14]. W = ..
. (10)
.
← ~vN →
III. PRINCIPAL COMPONENT ANALYSIS
So, all we need to do now to obtain the PCA projection
of a given individual i is to use Equation 6, i.e.:
In this section, we present the mathematical formula-
tion of PCA. For simplicity’s sake, we develop this for-
mulation by integrating the conceptual framework pre- Y~i = W X~i , (11)
sented in the introduction with the basic statistical con-
cepts covered in Section II. Consider that the original where X~i and Y~i represent, respectively, the feature vec-
dataset to be analyzed is given as a matrix X, where tor of object i in the original and projected space.
each of the Q columns represents an object/individual, An important point to be kept in mind is that each
and each row i, 1 ≤ i ≤ N , expresses a respective mea- dataset will yield a respective transformation matrix W .
surement/feature X ~ i . Also, the values measured for the In other words, this matrix adapts to the data in order
jth object are represented as X~j to provide some critically important properties of PCA,
An important fact that needs to be taken into account such as the ability to completely decorrelate the original
when working with PCA is that, by being a linear trans- variables and to concentrate variation in the first PCA
formation, it can be expressed as the following simple axes.
matrix form: So far, the transformed data matrix Y still have the
same size as the original data matrix X. That is, the
transformation implied by Equation 6 only remapped the
Y = W X. (6) data into a new feature space defined by the eigenvectors
of the covariance matrix. This process can be understood
In other words, the PCA transformation corresponds to
as a rotation of the coordinate system that aligns the axes
multiplying a respective transformation matrix W by the
along the directions of largest data variation. Reducing
original measurements X. Figure 6 shows the notation
the number of variables corresponds to keeping the first
used for representing each variable involved in the trans-
M ≤ N PCA axes. The key question here is: what
formation. All we need to do in order to implement the
are the conditions allowing this data simplification? In
PCA of a given data is to obtain the respective trans-
addition, are there subsidies for choosing a reasonable
formation matrix W and then apply Equation 6. The
value for M ?
derivation of W is explained as follows.
The first important fact to consider is that each eigen-
The ith element of the empirical mean vector µ ~X =
value λi , 1 ≤ i ≤ N , of the data covariance matrix K
(µX1 , . . . , µXN )T , with dimension N × 1, is defined as:
corresponds to the variance σY2 i of the respective trans-
formed variable Yi . Let’s represent the sum of all these
Q variances as:
1 X
µXi = Xij , (7)
Q j=1 N
X
S= σY2 i . (12)
where Xij corresponds to the value of the ith measure- i=1
ment taken for the jth object. The measurements are
then brought to the coordinates origin by subtracting An important property, demonstrated in Section V B,
the respective means, i.e.: is that the total data variance S is preserved under axes
rotation, and therefore also by the PCA. In other words,
X̂i = Xi − µXi . (8) the total variance of the original data is equal to that of
the new data produced by PCA.
The matrix containing all the elements X̂i is henceforth We can define the conserved variance in a PCA with
represented as X̂. The covariance matrix of the variables M axes as:
in matrix X̂ can now be defined as:
M
1
X
K = Cov(X̂) = X̂ X̂ T . (9) Sc = σY2 i . (13)
Q−1 i=1
7
Y = W X
Y11 . . . Y1j . . . Y1Q W11 . . . W1j . . . W1N X11 . . . X1j . . . X1Q
. .. .. .. .. . .. .. .. .. . .. .. .. ..
.. . . . . ..
. . . . ..
. . . .
i Yi1
Y . . . Yij . . . YiQ = Wi1
. . . Wij . . . WiN Xi1
. . . Xij . . . XiQ X
i
. .. ..
.. .. . .. .. .. .. . .. .. .. ..
.
. . . . . .. . . . . .. . . . .
YM 1 . . . YM j . . . YM Q WM 1 . . . WM j . . . WM N XN 1 . . . XN j . . . XN Q
So, the overall conservation of variance by PCA can be eral ways so as to address specific requirements. The
expressed in terms of the ratio application of PCA often implies the question whether
to normalize or not normalize the original data. Quite
Sc
G = (100%) . (14) often, the dataset is standardized prior to PCA [5], but
S other normalizations can also be considered. In this sec-
Now, the number M of variables to preserve can be de- tion, we discuss the important issue regarding data stan-
fined with respect to G. For instance, if we desire to pre- dardization prior to PCA.
serve 70% of the overall variance after PCA, we choose A possible way to address this issue is to first consider
M so that G ≈ 70%. the respective implications. As seen in Section II, data
Many distinct methods have been defined to assist on standardization of a random variable (or vector) leads
the choice of a suitable M . For instance, Tipping and to respective dimensionless new variables that have zero
Bishop [15] defined a probabilistic version of PCA based mean and unit standard deviation (i.e. similar scales).
on a latent variable model, which allowed the defini- Therefore, all standardized, dimensionless variables will
tion of an effective dimensionality of the dataset using have similar ranges of variation. So, standardization can
a Bayesian treatment of PCA [16]. In [17], a generaliza- be particularly advisable as a way to avoid biasing the
tion error was employed to select the number of principal influence of certain variables when the original variables
components, which was evaluated analytically and em- have significantly different dispersions or scales. When
pirically. In order to compute this error analytically, the the original measurements already have similar disper-
authors modeled the data using a multivariate normal sions, standardization has little effect.
distribution. There are, however, some situations in which standard-
In principle, there is no assurance that an M < N ex- ization may not be advisable. Figure 7(a) shows such
ists ensuring that a given variance preservation can be a situation, in which one of the variables, namely X1 ,
achieved. This will critically depend on the distribution varies within the range [−100, 100], but the other vari-
of the values λi which, itself, depend on each specific able, X2 , is almost constant other than by a small varia-
dataset. More specifically, datasets with highly corre- tion. In case this small variation is intrinsic to the data
lated variables will favor variance preservation. It has (i.e. it is not an artifact) and meaningful, standardiza-
been empirically verified that substantial variance preser- tion can be used to amplify this information. However,
vation can be obtained for many types of real-world data. if this variation is a consequence of an unwanted effect
Indeed, one of the objectives of the current work is to (e.g. experimental error or noise), standardization will
investigate typical variance preservations that are com- emphasize what should have been otherwise eliminated
monly achieved for several categories of real-world data. (Figure 7(b)). In such cases, either the noise should be
reduced by some means, or standardization avoided.
In order to better understand the influence of stan-
IV. TO STANDARDIZE OR NOT TO dardization on PCA, let’s consider two properties P1 and
STANDARDIZE? P2 that can be used to characterize a set of objects. Sup-
pose that these two properties are perfectly correlated,
We have already seen that random variables can be that is, their Pearson correlation coefficient is ρP1 P2 = 1.
normalized, through statistical transformations, in sev- When property P2 is measured, an intrinsic error might
8
the first axis, but the dispersion in the second axis is only
Therefore, the Pearson correlation coefficient between
artifact/noise. If standardized, the unwanted variation in the
second axis will be substantially magnified, therefore affecting the two perfectly correlated variables P1 and P2 will be
the overall analysis. measured as ρX1 X2 , given by Equation 21. Note that
ρX1 X2 only depends on the ratio of the variances. Fig-
ure 8(a) shows a plot of Equation 21, together with simu-
lated data containing 200 objects having perfectly corre-
lated properties P1 and P2 , but with measured properties
be incorporated into the measurement. This error may X1 = P1 and X2 = P2 + ǫ. The standard deviation of the
be due to, for instance, the finite resolution of the mea- noise was set to σǫ = 0.5. The figure shows that when
surement apparatus or the influence of other variables σP2 /σǫ ' 2, the measured Pearson correlation is close
that were not accounted for in the measurement process. to the true value of 1. As σP2 /σǫ decreases, or equiva-
Therefore, the actual measured value X2 may be written lently, as the noise dominates the variation observed for
as the measurement, the Pearson correlation goes to 0.
Figure 8(b) shows the explanation of the first PCA axis
as a function of σP2 /σǫ . The variables were standard-
X2 = P2 + ǫ, (15) ized before the application of PCA. The result shows an
important aspect of variable standardization: if the typ-
where ǫ is an additive noise. Suppose that ǫ is a ran- ical variation of the measurement is moderately larger
dom variable having normal distribution with mean 0 than any variations caused by noise, the respective vari-
and variance σe2 . Also, for simplicity’s sake, consider that able can be standardized. Otherwise, standardizing the
there is no noise associated with the measurement of the variable may be detrimental to PCA. For extreme cases,
other variable P1 , that is, X1 = P1 . The Pearson corre- when noise completely dominates the measurement, the
lation coefficient between variables X1 and X2 is given obtained PCA values will indicate that this meaningless
by measurement has a great importance for the objects char-
acterization.
E [(X1 − µX1 )(X2 − µX2 )]
ρX 1 X 2 = . (16) (a) (b)
σX 1 σX 2
PCA2 (Y2)
0
0
1 Y = W X, (22)
1
2
2 where the rotation matrix W is an orthogonal matrix.
3
3
The covariance matrix of Y is given by:
3 2 1 0 1 2 3 3 2 1 0 1 2 3
PCA1 (Y1) PCA1 (Y1) 1
3 Cov(Y ) = ~ Y ~hT )(Y − µ
(Y − µ ~ Y ~hT )T
3
Q−1
2 1
2
1
= (W X − W µ ~ X ~hT )(W X − W µ ~ X ~hT )T
1
Q−1
PCA2 (Y2)
PCA2 (Y2)
0 1
0 = W (X − µ ~ X ~hT )(X − µ ~ X ~hT )T W T
1 Q−1
1
2 = W Cov(X)W T
2
3
3
= W Cov(X)W −1 , (23)
3 2 1 0 1 2 3 3 2 1 0 1 2 3
PCA1 (Y1) PCA1 (Y1)
where ~hT is a column vector filled with ones and the
identities µ
~Y = W µ~ X and W W T = I were used. The
FIG. 9. Four two-dimensional PCA projections can be ob- eigenvalues and eigenvectors of matrix cov(Y ) are given
tained for the Iris dataset or, indeed, any other data. The by:
colors identify the three categories in the Iris dataset and are
included here only for reference, being immaterial to PCA. Cov(Y )~vy = λ~vy , (24)
−1
W Cov(X)W ~vy = λ~vy , (25)
−1 −1
Cov(X)W ~vy = λi W ~vy , (26)
Cov(X)~vx = λ~vx , (27)
V. OTHER ASPECTS OF PCA where ~vy and ~vx are eigenvectors of, respectively, matri-
ces Y and X. So the eigenvalues of the covariance matrix
are conserved under rotation.
There are some important issues that need to be borne In the special case when W is the eigenvector matrix
in mind when applying PCA. These include underdeter- of Cov(X), that is, each row of W contains a respective
mination of the direction of the PCA axes, the stability eigenvector of Cov(X), Cov(Y ) is a diagonal matrix. If
of the transformation matrix, and the interpretation of the eigenvectors are sorted according to the respective
the relative importance of the original variables. These eigenvalues in decreasing order, W is the PCA transfor-
issues are discussed as follows. mation matrix W .
In order to maximize W ~ i Cov(X)W~ T we need to con- those measurements are related to the implemented pro-
i
strain W ~ i , otherwise we obtain the trivial solution where jection. For instance, in the case of the beans example in
W~ i is infinite. Here, we set the constraint that W con- Section I, we have that the first principal
√ variable is de-
sists of a rigid rotation transformation, and thus it is an fined as P CA1 = 0.82Diameter + 0.57 Area. In other
orthonormal matrix. Recall that a matrix is orthonor- words,
√ we have that the weights of the Diameter and
mal if, and only if, its lines form an orthonormal set (i.e., Area measurements are 0.82 and 0.57, respectively. As
W W T = I). So W ~ iW~ T must be unitary. a consequence, the two original measurements contribute
i
~ i Cov(X)W ~ T subject to W
~ iW
~ T we use almost equally to the first principal axis.
To maximize W i i
A possible manner to visualize the relationship be-
the Lagrange multipliers technique, maximizing the func-
tween the original and principal variables consists in pro-
tion
jecting the former into the obtained PCA space. Fig-
~ i , λi ) = W
f (W ~ i Cov(X)W
~ T − λi ( W
~ iW
~ T − 1), (31) ure 10(a) illustrates such a projection with respect to the
i i
Iris database [18]. This database involves four measure-
~ i . Differentiating and equating to zero
with respect to W ments for each individual, namely: sepal length, sepal
yields: width, petal length and petal width. The projections of
the axes defined by each of these four original variables
df (W~ i , λi )
~ iT − 2λi W
= [Cov(X) + Cov T (X)]W ~ iT = 0, are identified by the four respective vectors in the PCA
dW~i space in Figure 10(a). Each of these projected vectors
(32) are obtained by multiplying the PCA matrix (eigenvec-
~ i − 2λi W
T ~ i = 0,
T tors) by the respective versor associated with the mea-
2Cov(X)W
surement. For instance, in the case of the sepal length
(33)
variable, the projected vector is calculated as:
Cov(X)W ~ i = λi W
T ~ iT .
(34)
" # " # 1
Sepal length1 W11 W12 W13 W14 0
We see that W ~ T is an eigenvector of Cov(X), thus = (36)
i
W is formed by combining the eigenvectors of Cov(X) Sepal length2 W21 W22 W23 W24 0
row-wise. The respective variances are calculated as: 0
~ i Cov(X)W
W ~ i λi W
~ iT = W ~ iT = λi . (35) Two interesting relationships can be inferred from Fig-
So, the eigenvector corresponding to the largest eigen- ure 10(a). First, we have that the angles between the
value of Cov(X) is placed in the first row of W , the projected measurements indicate relationships between
eigenvector associated to the second largest eigenvalue is the original measurements. For instance, the fact that
place on the second row, and so on. As a result, the first the petal length and petal width axes resulted almost
M eigenvectors will lead to an M -dimensional space that parallel indicates that these two measurements are very
posses optimal preservation of the variance in the original similar one another. The second relationship involving
data. the projected variables regards their comparison with the
It is interesting to observe that the efficacy of PCA in new variables P CA1 (horizontal axis) and P CA2 (verti-
explaining variance is, to a good extend, a consequence cal axis). For instance, we have from Figure 10(a) that
of two properties of the eigenvectors associated to each the petal length is inversely aligned with P CA1, while the
principal axis. First, we have that these eigenvectors are sepal width is almost parallel to the vertical axis (P CA2).
orthogonal (as a consequence of the covariance matrix A closely related manner to study the relationship be-
being symmetric). Then, we also have that each eigen- tween original and new variables is based on the con-
vector corresponds to a ‘prototype’ of the data, in the cept of biplot [19]. Figure 10(b) illustrate the biplot
sense of having a significant similarity with the original obtained for the iris dataset. There are two main dif-
data. As a consequence, if a given data has large scalar ferences between the biplot and the projection shown in
product with one of the eigenvectors (i.e. the data aligns Figure 10(a). First, we have that the axes of the biplot
with one of the eigenvectors), it will necessarily be differ- are the PCA components divided by the respective stan-
ent from the other eigenvectors as a consequence of the dard deviation, i.e.
latter being orthogonal. This means a substantial decay
Y1
of variance along the subsequence principal axes. Y˜1 = (37)
σY1
Y2
VI. PCA LOADINGS AND BIPLOTS Y˜2 = (38)
σY2
The principal axes identified by PCA are linear combi- The other difference is that the projections of the orig-
nations of the original measurements. As such, an inter- inal variables are obtained by multiplying a normalized
esting question arises regarding the identification of how version of the PCA matrix by the respective versors, that
11
(a) (b)
FIG. 10. Visualizing the original variables of the Iris dataset on the respective PCA projection. (a) Vectors representing
the projections of the original variables onto the PCA components. (b) Biplot containing normalized PCA components and
measurements vectors.
Please refer to Appendix X for a demonstration of this where X ~ i contains the measures for object i, Cj is the
property. Therefore, the projection of each vector shown set of objects in the jth category and µ~Cj is the average
in Figure 10(b) onto a PCA axis correspond to the Pear- vector for the category. The intra-group scatter matrix,
son correlation coefficient between the respective mea- Sintra , measures the combined dispersion in each group
surement and√the PCA component. The vector L ~i =
√ and is defined as:
( λi Wi1 , . . . , λi Wi4 ) is called the loading of the ith
K
PCA component. Furthermore, the angles between the X
vectors representing the measurements approximate well Sintra = Sj , (43)
j=1
the correlations between them [19]. Thus, the biplot pro-
vides an intuitive visualization of the relationships among where K is the number of groups. The inter-group scat-
the original measurements and between those and the ter matrix, Sinter , measures the dispersion of the groups
PCA components. (based in their centroids) and is defined as:
K
X
VII. LDA – ANOTHER PROJECTION METHOD Sinter = ~ X )T ,
~ X )(µ~Cj − µ
Qj (µ~Cj − µ (44)
j=1
Linear Discriminant Analysis [6, 20] - LDA is a statis- where Qj is the number of objects belonging to the jth
tical projection closely related to PCA. It is used over group and µX the average vector of the data matrix X.
12
The matrix S can now be defined as the product of the into the significant structural changes [32]. Han et
inter-group scatter matrix by the inverse of the intra- al. [33] applied PCA, among other methods, to under-
group scatter matrix, i.e: stand how phosphorylation (the addition of a phosphate
group to an amino acid, one of the most common post-
−1
S = Sintra Sinter . (45) translational modifications in proteins) induces confor-
mational changes in protein structure.
A measurement of the separation of the groups can In quantitative genetics, correlations between pheno-
be readily obtained from the trace of matrix S (other typical features are of central importance – e.g., the corre-
approaches can be used to derive alternative separation lations between lengths and widths of certain bones [34].
distances [21]). The Breeder’s Equation tells us that a set of phenotypic
LDA consists of applying the same sequence of opera- traits respond jointly to selective pressure according to
tions as PCA, but with the matrix S being used in place the correlations between them [35–37]. Thus, PCA serves
of the covariance matrix. as a way to extract the directions along which significant
evolutionary changes are more likely to happen and vi-
sualize them directly.
VIII. REVIEW OF PCA APPLICATIONS Population genetics, on the other hand, deals with
prevalences of certain genotypes in a population, or pre-
A. Biology served sequences between groups. Here, PCA is applied
to genetic variation data in various groups and species,
Data in biology come in various forms, from measure- and used to identify population structures [38] and pu-
ments of jaw length in vertebrates [22] to gene expres- tative migratory or evolutionary events [39–42]. An in-
sion patterns in cells [23]. In many of these cases, the teresting application by Galinsky et al. [43] analyzed the
original feature space is high-dimensional, with as many allele distribution of a population and compared its prin-
as ∼ 20000 dimensions in the case of a gene expression cipal components to those of a null distribution derived
profiling [24, 25]. It is no surprise that dimensionality from a neutral model, identifying genes undergoing nat-
reduction methods, PCA among them, can be frequently ural selection in a population. In particular, they ob-
found in many areas of quantitative biology. served that ADH1B, an enzyme associated with alcohol
In bioinformatics, high-throughput measurements of consumption behaviors, seems to be undergoing simulta-
gene expression, DNA methylation, and protein profil- neous and independent evolution in both Eastern Asia
ing are on the rise as methods of exploring the complex and Europe.
mechanisms of cellular systems. Given a large number of In ecology, one might be interested in comparing data
measurements, in any such experiment, dimensionality from several different species [44, 45]. For instance, mi-
reduction algorithms rapidly became an intrinsic part of crobial ecology is often concerned with metabolic pro-
the exploratory analysis in the field. As concrete exam- files or gene expression of various microorganisms in the
ples, one may cite the arrayQualityMetrics software [26], same environment, leading to a dataset where variables
used for quality assessment of microarray gene expres- are concentrations of a specific catabolite and samples in-
sion profiling, or usage of the biplot variation for deter- dicate different species in a substrate. In another scale,
mining which variables contribute most to the samples’ PCA of transect data (i.e., counting the occurrence of
variance [27]. At other times, a researcher is interested in certain species along a predefined path) is a conventional
removing redundancy from his dataset before analyzing approach to distinguish between different animal commu-
it, and thus performs a PCA before feeding his data to nities [46]. In another example, PCA was employed to
some more ellaborated algorithm [28, 29]. reduce the amount of data in a Maximum Entropy model
One might recall that a biplot refers to the practice to infer the distribution of red spiny lobster populations
of showing the original data axes as projected onto the in the Galapagos Islands [47].
principal components [19, 30]. In this application, the
original axes correspond to expression values of particular
genes; thus, axes projected closer to the PCs indicate B. Medicine
that the corresponding gene is strongly represented in
that principal component. It serves as a clue as to which Modern medical science relies on sophisticated imag-
genes influence the divergence between samples. ing techniques such as functional Magnetic Resonance
In a considerably separate area of bioinformatics, Imaging (fMRI), Positron Emission Tomography (PET)
namely structural biology, the principal component anal- and Computed Tomography Imaging (CTI) [48]. The
ysis technique is used to identify large-scale motions obtained data need to be pre-processed: noise must be
in a biomolecule’s dynamics, such as protein and RNA extracted, redundancies must be discarded, and differ-
folding [31]. After calculating atomic displacements be- ent sources are to be gathered [49]. Thus, PCA is often
tween different conformers (i.e., locally stable structures) used in these steps as a computationally efficient and yet
of the molecule, covariances between motions are cal- reliable technique to aid in producing the image.
culated, and the principal components provide insights Apart from imaging, several clinical variables may be
13
combined to maximize the information obtained from a cation method is proposed for MRI images that employs
patient’s data like age [50], concentrations of certain sub- PCA after the discrete wavelet transform (DWT). The
stances in the blood [51], Glasgow scores [52], electrocar- PCA reduces the dimension of the feature space from
diogram (ECG) [53] signals and others. These may then 65536 to 1024 with a 95.4% of the variance. The PCA
be used to classify the patient’s possible outcome, and processed data is used as an input of a kernel support
in this process, it may be necessary to rotate or combine vector machine (KSVM) with the GRB kernel, so as to
variable axes [50, 53–55]. infer the health of the brain. The diseases considered
Another type of application that has incorporated in the method are the following: glioma, meningioma,
PCA is related to diagnostics. For instance, PCA was Alzheimer disease, Alzheimer disease plus visual agnosia,
used to reduce the dimensionality of the data in the di- pick disease, sarcoma, and Huntington disease.
agnostic prediction of cancers [56]. In this way, gene- The dendrites of a neuron can grow in a very complex
expression signatures were analyzed, and artificial neural and branched way. The respective arborizations can be
networks were employed as a classifier. Chemical-related digitalized as a set of points in a 3D space representing
tools were also employed with PCA in the diagnosis ap- its roots, nodes, tips and curvatures. PCA can be used to
proaches, such as in [57], in which the authors proposed a describe a dendritic arborization [60]. The shape of the
methodology to improve the differentiation between hep- arborization can be described by the relative values of the
atitis and hepatocirrhosis. For that, metabolites that standard deviations of the new digitalized dendritic ar-
take part of samples of urine were analyzed. borization data after PCA application. The dimensions
of the arborization are defined as the length of the inter-
val between the most extreme points projected in each
C. Neuroscience of PCA axis. Depending on the shape of the dendritic
arborization, the PCA axes are utilized to determine the
The brain is a structure with an extremely high num- orientation of the arborization.
ber of components and a very complex topology. There- The accuracy of the ability to recall past painful expe-
fore, statistical tools, including the PCA, are useful for riences is still object of controversy. A generally accepted
studying it. In neuroscience, PCA is often used in clas- way of describing pain is by a sensory-discriminative (in-
sification methods and data analysis of measurements tensity) and affective-motivational (unpleasantness) di-
of brain activity and morphology, such as in electroen- mensions. In [61], the authors conduct a psychophysical
cephalography (EEG) and Magnetic resonance imaging experiment to study the ability to recall pain intensity
(MRI). PCA is also used as a data analysis tool of psy- and unpleasantness in a very short time interval. The
chophysical experiments. subjects were thermally stimulated and evaluated in real
Epilepsy is a neurological disorder that is associated time. The intensity and unpleasantness of the pain was
with uncontrolled neuronal activity which may lead to inferred by a visual analog scale (VAS) both simulta-
seizures. The electroencephalography (EEG) is often neously and in a short time after the stimulation. The
used for epilepsy diagnosis and seizures detection through PCA is used over the VAS data and the first three princi-
identification of EEG markers or abnormalities. Epilepsy pal components are used for further analysis, explaining
diagnosis is a complicated task, so the diagnosis is nor- about 90% of the variance. The results of the study sup-
mally confirmed by the EEG interpretation by a neurol- port the loss of pain memory information and reveals a
ogist while taking into account the medical history of the significant difference in the ability to recall the stimuli
patient. Due to possible error in the diagnosis, it is in- between the subjects.
teresting to have a precise automatic system that could
assist epilepsy diagnosis and the detection of seizures.
In [58] the authors propose a supervised classification D. Psychology
method that consists of PCA applied to nine selected fea-
tures of the EEG. The transformed data serves as input In psychology-related areas, quantitative data is often
of a cosine radial basis function neural network (RBFNN) provided in the form of examination scores. Part of its in-
classificator. The method could classify the patient EEG formation can be analyzed in terms of multidimensional
in normal, interictal (period between seizures) and ictal statistics. For instance, PCA was employed in the anal-
(during a seizure), with a false alarm seizure detection of ysis regarding how people store information through a
3.2% and a missed detection rate of 5.2% for the param- memory test [62]. The researchers considered the child-
eters and data utilized in the article. hood and adolescent development and found differences
Magnetic resonance imaging (MRI) is an imaging tech- related to age and the cognitive maturation process.
nique capable of obtaining high quality pictures of the Efforts directed to better understanding of behavior
anatomy and physiological processes of the human body, can also employ PCA. For instance, studies regarding
including the brain. MRI is widely used for clinical di- the connection between memory and anxiety [63], and the
agnosis. In the case of some neurological diseases, the relationship between the organization of working memory
diagnosis is sometimes asisted by an automated classifi- and cognitive abilities [64]. Facial expressions were also
cation based on the brain MRI image. In [59], a classifi- investigated by using a PCA-based approach [65]. This
14
study considered datasets of faces and obtained features employed [78]. One of the most important tools of chemo-
from the considered images. The results indicate that metrics is the PCA technique [81], which has been in-
pictures of facial expression, when processed by the PCA- corporated into many studies, including the analysis of
based approach, provided reliable results when compared food [73, 82], drugs [83, 84], disease diagnosis [85], the
to the social psychologist’s analysis. presence of pollutants in water [75], etc. The use of PCA
in chemistry-related studies is normally related to the fol-
lowing two main aspects: (i) data visualisation and (ii)
E. Sports dimensionality reduction.
A possible application of PCA in food chemistry
The existence of multivariate measurements in sports regards the data visualisation of metabolomic analy-
provide many opportunities for PCA applications. For ses [73, 86]. In a recent study, the quality of cattle meat
instance, the precision required by elite athletes and mar- was characterised with respect to different diets [73]. The
tial artists requires coordination between several parts of animals were grouped into different classes fed with dif-
the body, and kinematics-derived measurements can be ferent amounts of mate herb extract. Levels of metabo-
submitted to a PCA in order to undercover synergies lites in the meat were measured using 1 H NMR tech-
and principles of a specific sportive practice [66, 67], or nique and PCA was employed to better understand the
pinpoint health-hazardous practices in everyday actions relationship between meat quality and the animal feed-
such as walking [68]. Furthermore, PCA can be employed ing. As data were projected onto principal components,
in tests of dopping [69, 70], such as in [70] in which the the classes emerged naturally. Furthermore, in order to
authors employed PCA as a dimensionality reduction to understand the relationship among the different metabo-
the measures of anabolic steroids. lites, the concept of loadings was used. Other works in-
PCA has also been applied to compacting three- vestigating food chemistry have been reported [74, 82],
dimensional coordinates of body points at different including the use of PCA as an auxiliary method to clas-
times [71]. In [71] PCA was applied to 26 three- sify different types of grapevines [82].
dimensional body coordinates of 6 alpine ski racers. In Other applications in Chemistry include the diagnosis
this experiment, the first four principal components were of diseases, such as identification of pancreatic cancer in
responsible for 95.5% of the PCA variance. In order to patients [85]. More specifically, the patient serum was
study the performance of vertical jumping, researchers analyzed and PCA was employed as a data reduction
considered PCA to eliminate correlation in the athlete’s method [85], providing support for multivariate analy-
data measurements. Another analyzed characteristics of sis. In other studies, the PCA technique was applied
athletes that employed PCA are related to somatic anxi- in order to identify the chemical characteristics of phy-
ety, which means the physical symptoms of anxiety [72]. tomedicines [83, 84]. Samples prepared from river water
were analyzed by EPR and the data was then projected
in PCA [75]. Different indications of the mercury cycle
F. Chemistry were found along the river.
tings and devices: (i) the variation of parameters was However, because the rank of the covariance matrix is
relatively small within each transistor type, implying in limited by the number of samples, Singular Value De-
respective clusters; (ii) moderate level of negative feed- composition (SVD) can be used directly over the fea-
back was found not to be able to completely eliminate ture space, dropping the need to compute the covariance
parameter variations; and (iii) high variance explanation matrix explicitly. Other methods similar to the Eigen-
was achieved by using only two axes. The latter result face have been developed for other tasks also involv-
was further developed giving rise to a new modeling ap- ing biometric data. This includes recognition of palm
proach [105]. print [110, 111], iris [112], gestures [113] and behav-
A subsequent work [95] focused on studying the pat- ior [114, 115]. In [116], the authors conclude that, for
terns of parameter variability among devices encapsu- the task of face recognition, PCA can sometimes outper-
lated in the same transistor array. PCA was employed form LDA when the training dataset is small, while also
in a similar fashion as in the previous work to quantify being less sensitive to changes in the training data.
parameter variability. The results confirmed the substan- Another application of PCA in computer vision is the
tially higher uniformity among parameters of transistors unsupervised classification of images and videos. Among
belonging to the same array. the popular techniques for this task is GPCA (Gener-
alized Principal Component Analysis) [117], which uses
PCA to combine subspaces defined by homogeneous poly-
I. Safety nomials for data consisting of sets of images or video clips.
Classification is attained by applying a clustering algo-
rithm in the reduced space. The technique was found
In the safety area, the study conducted in [106] devised to be appropriate for many specific tasks, including seg-
a model based on PCA to identify the correlation among mentation of videos along the time, face classification
sensors. Usually, the identification of errors in sensors with different illumination conditions and tracking of 3D
can be performed in different ways. Some measurements objects in video clips.
can reach unusual values – providing indication of a fail-
Other frequent applications of PCA in computer vi-
ure – which can be easily identified by setting lower and
sion are based on the idea of applying PCA followed by
upper limits. Nevertheless, some minor failures cannot
a clustering algorithm over the raw data or sets of fea-
be detected in such a straightforward way, since the mea-
tures extracted from images. In texture analysis [118],
sured absolute values do not take on extreme values. The
the set of features are usually obtained from data ly-
use of correlation via PCA can be used to identify such
ing on the frequency space (such as subspaces obtained
minor errors in faulty sensors since, in a typical opera-
from Fourier or Wavelet transforms). In [119], PCA is
tion, measures obtained by sensors are usually correlated.
used to combine image descriptors given by the SIFT
Given the correlations in the normal operation, the PCA
method [120]. This class of descriptors is obtained by
technique was also used to reconstruct measurements of
finding points of interest in the images, resulting in a lo-
faulty sensors via orthonormal decomposition.
cal set of descriptors for each image. The authors found
that the results obtained by applying PCA, in compar-
ison to using just the histograms of the descriptors di-
J. Computer Science rectly, gives not only a more compact representation of
the images but also significantly improves the matching
In computer science, PCA has been employed in a di- accuracy of some tasks. These tasks include tracking of
verse range of applications, varying from specific tasks, objects across different images obtained from real-world
such as biometrics and data compression to more gen- or controlled three-dimensional transformations (such as
eral problems, including unsupervised classification and rotation, shift, and scale). PCA was also employed for
visualization. the image retrieval, where, for a provided query image,
In the scope of computer vision, one of the most iconic the algorithm finds a set of similar images from a large
applications of PCA is face recognition, known as Eigen- database [121].
face method [107–109]. This technique is used to recog- An interesting result regarding the use of PCA in im-
nize faces in images by linearly projecting them onto a age analysis is that certain classes of neural networks,
lower dimensional space. For this, the sequence of pixel trained with image datasets, seem to mimic the PCA
intensities along each image is considered as the feature transformation. This approach is partially confirmed by
space, and PCA is employed to find relevant eigenvectors the fact that multilayered neural networks use uncorre-
(named eigenfaces in this context). lated linear projections of the data as internal represen-
In the Eigenface method, classification is attained by tations [122, 123]. In [124], the authors develop a math-
PCA projecting both the unknown image and reference ematical proof connecting the approach taken by some
faces onto the eigenspace and calculating the respective neural networks with PCA, which is accomplished by
Euclidean distances. Since the size of the covariance ma- creating a neural network that effectively reproduces the
trix grows quadratically with the number of pixels, its PCA transformation.
direct calculation is often unfeasible. In general data analysis, PCA can be regarded as a
17
pre-processing step to be applied to the data before us- ponents can be used as features. Some of the studies
ing more sophisticated methods for classification or learn- that applied PCA are the classification of hyper-spectral
ing [99]. In such a context, PCA can also be understood data [134], face recognition [135], and to extract features
as a feature selection or feature extraction process in the from videos [136].
sense that it does not result in a subset of the first fea- Because of the high dimensionality that takes part in
tures but a combination of them. Many benefits can be hyper-spectral measurements, there is a possibility to use
attained by using this approach; for example, the com- PCA to compress the input data. In [134], remote sens-
putational cost of a method can be reduced substantially ing data were measured by many different characteris-
with minimal loss of accuracy by also reducing the size of tics of analyzed regions, for instance, information taking
the input data before applying a more complex classifica- into account trees, water, streets, among others. One of
tion algorithm [125, 126]. However, one should be careful the employed pipelines involves PCA as the first step, in
when combining PCA and other classification techniques, which PCA compresses the multidimensional data. Next,
as this kind of benefit depends on the classification tech- such compressed data is summarized according to the
nique being used and the dataset. For instance, in [127] neighboring regions and is then flattened into a vector.
the application of PCA before Support Vector Machine Finally, this vector can be employed as input to the neu-
(SVM) considerably reduced the classification accuracy ral network. Note that this method and other variations
of the analyzed datasets. of such pipeline were used to create features to describe
Another notable application of PCA in computer sci- the system. Classification tests by using these features
ence is data compression. In particular, PCA was used were applied and, as a result, the authors found that the
to compress image data. An example of this approach is proposed features provided higher accuracy when com-
present in [128], in which the authors propose a modifica- pared to some other methods.
tion of the JPEG2000 standard by incorporating an extra In deep learning, PCA is commonly used as part of
step based on PCA to improve the rate-distortion per- other image analysis tasks, such as generating features
formance on hyperspectral image compression. Results concerning the face recognition methods, and this is also
show that PCA outperforms the traditional approach in used together with deep learning [135]. Another example
which the coder is based on spectral decorrelation using is the PCA-based deep learning that considers a cascade
wavelets. In a similar direction, the work [129] also em- of PCAs to classify images, called PCANet [137]. This
ployed PCA as a technique to compress data from stellar methodology was applied to many tasks, such as recog-
spectra. The results show that PCA attained a compres- nition of hand-written digits and objects, and the results
sion rate of 30 : 1 while keeping 95% of the variance. were compared to other deep learning based approaches.
Aside from images, PCA was also employed to com- In general, good results were obtained, and the authors
press other types of data. In [130], PCA is used to com- suggest that PCANet is a valuable baseline for tasks in-
press neural codes into short codes that maintain the volving a significant amount of images.
accuracy and retrieval rates similar or better than state In another study, different deep learning approaches
of the art techniques. Neural codes are visual descrip- were proposed to unsupervised classification of sleep
tors obtained from the top layers of a large neural net- stages [138]. This method considered PCA after the fea-
work trained with image data. These can be used to re- ture selection step, which was used in order to capture
trieve data from large datasets. PCA was also found to the most of the data variance. As a parameter, the au-
be useful in compressing data describing human motion thors used the first five principal components. In general,
sequences [131]. This kind of data incorporates three- the reached accuracy found for the PCA-based method
dimensional trajectories of sets of markers that repre- was not the best one. However, this automatic approach
sent the motion of the human skeleton captured from illustrated a way to classify sleep stages, without consid-
human actors performing specific actions. The technique ering specialist knowledge. The proposed methodology
is based on compressing the positions of the markers to can also be used in tasks of detecting anomaly and noisy
a lower dimensional space using PCA for each keyframe redundancy.
of the animation.
L. Economy
K. Deep Learning
Most of the applications of PCA in Economy are based
Many of the proposed methods regarding neural net- on evaluating the financial development of countries or
works have employed PCA as dimensionality reduction entities using a combination of features [139]. Usually,
pre-processing step [124, 132, 133]. A new area of PCA is employed to combine economic indicators in order
study, called Deep Learning, emerged and many re- to attain a smaller set of values so that these entities can
lated approaches have also incorporated PCA. Usually, be ranked or compared. An example of this approach is
as in the case of standard neural networks, PCA is em- present in [140], in which countries are ranked according
ployed in deep learning area as a pre-processing step, in to the principal components obtained from sustainability
which the data is reduced, and the first principal com- features taken over time. The authors indicate that while
18
some progress was found for economic development, the searchers concerns the introduction of measurements to
overall conditions got worse during the considered period. quantify the relevance of research, journals, and papers.
Other works also consider PCA to build an integrated While most of the metrics rely on some type of cita-
sustainable development index, such as in [141] and [142]. tion information, there is no consensus on which would
Another example of using PCA to aggregate econom- be the most important measurement. In addition, if
ical indices is explored in [143], in which the loadings a multi-view impact is desired, it would be important
resulted from PCA of several indices were employed to to understand which measurements are interrelated. In
determine the importance of macroeconomic indices for this context, a systematic comparison of 39 impact met-
countries. The article also compared the combination of rics was performed in [152]. The considered metrics in-
macroeconomic indices and shared returns of their re- cluded traditional metrics based on raw citation and us-
spective stock markets. age data, social network measures of scientific impact,
In other approaches, PCA was used to better un- and other hybrid metrics. The PCA analysis revealed
derstand the local economic characteristics of cities or that the first principal component identifies citation mea-
provinces. In [144], it was used to visualize data involv- sures, discriminating them from almost all usage met-
ing living conditions of households in rural and urban rics. The second principal component accounts for the
regions of Ghana. Results indicate very distinct charac- discrimination between citation and social network met-
teristics between the population living in rural area and rics. All in all, the clustering observed with the PCA
those in the urban region. By using a similar approach, projection revealed that the impact metrics could be in-
in [145], the evolution of economic characteristics of the terpreted according to two main dimensions: (i) the time
Liaoning province are studied throughout time. A coor- in which the evaluation is performed (i.e., usage vs. ci-
dination development index was devised by using PCA. tation), and (ii) the dimension discriminating popularity
from prestige (citation vs. social impact).
The detection of evergreens through principal com-
M. Scientometry ponent analysis was performed in [153]. Differently
from conventional scientific papers, evergreens are those
Scientometric sciences are devoted to studying the manuscripts in which the number of citations regularly
qualitative aspects of science [146]. Typical analyses in- increases, with no significant decay effect over time. Even
clude the assessment of the scientific impact of journals, though the predictability of the consistency of evergreens
institutions, and scientist via metrics such as the total is unfeasible, it is still important to understand the be-
number of articles, citations, views or patents [147]. Pop- havior of their citation trajectories. The method pro-
ular topics of interest in scientometric studies are the posed in [153] for clustering the citation behavior of ev-
evolution of science [148], the identification of interdisci- ergreens consists in decomposing the trajectory curves
plinary [149] and co-authorship dynamics [150]. Because via functional principal component analysis. Such a de-
many aspects of science are subjective (e.g., the concept composition is then used for data partitioning via K-
of quality), many measures have been proposed to cap- means [154]. The main results suggested that most of
ture different views. PCA, in this case, has been used as the data variability could be explained solely by two func-
a visualization tool and, most importantly, as a way to tional components. The main component, which explains
make sense of scientometric data. 95% of data variation, is characterized by a steadily grow-
ing citation curve. In fact, it is related to the behavior
Several indexes have been proposed to assess the qual-
of most evergreens. The main findings obtained by this
ity of universities worldwide. However, the validity of
method based on functional PCA suggest that papers
some criteria has been questioned by academic stake-
with similar citation patterns shortly after their publica-
holders. In [151], the authors studied whether the met-
tions may display distinct trajectories in the long run.
rics conceived by the annual academic rankings of world
universities (ARWU, see Liu and Cheng 2005) privilege
larger universities. The authors argue that this is an
essential debate because such an alleged privilege may N. Physics
cause institutions to pursue a growth devoid of quality
since much importance is currently being given to size. Most of the current problems in physics can be ap-
They argued, via PCA analysis of several metrics in the proached in two main manners: by developing the ba-
ARWU ranking, that two main factors would account for sic laws for the problem (Ab initio), or by constructing
the data variability. While the first principal component an empirical model regarding the relationships of some
accounts for 54% of the variance, the second component aspects of the considered system, which should be con-
was found to explain 30%. A more in-depth analysis of firmed at least for an experiment. Note that the second
the variables revealed that, in fact, the size factor ac- type of approach is inspired by a sequence of tests in
counts for a significant variance. However, the excellence which the parameters of the proposed model are system-
factor seems also to play an important role, as related atically varied. When the dimension of the associated
metrics populate the first principal component. data is considerable, PCA can be applied to determine
An important point of interest for scientometric re- which of them are potentially more relevant [155, 156].
19
The area of quantum many-body problems involves the In another study, by employing PCA to a pulsar wa-
study regarding many interacting particles in a micro- terfall diagram, a method for determining the optimal
scopic system. The solution of this problem is related to periods of pulsar was proposed [169]. Observe that wa-
the high dimensionality of the underlying Hilbert space, terfall diagram means the data matrix obtained from the
which makes PCA a useful tool for distinguishing and photon signal of the pulsar.
organizing configurations in the underlying space. PCA Conventional approaches for classification of astronom-
can be applied to recognize the phases of a given quantum ical objects and galaxy properties are based on artificial
system. For instance, a neural-network-based approach neural networks (ANNs) [170–175]. ANNs are frequently
has been used to identify the phase transition critical used because of their non-linear classification capacity.
points after a preliminary PCA-assisted identification of In this context, some studies consider PCA as a com-
phases [155]. PCA was also used to reveal that the en- plementary tool [172, 173, 176]. For instance, in the
ergies of crystal structures in binary alloys are strongly classification of astronomical objects, PCA was employed
correlated between different chemical systems [157]. This as a compression tool for the input data [176]. Such
study proposed an approach that uses this information an approach reduces the computational cost due to the
to accelerate predicting the crystal structure of others lower amount of variables. Regarding stellar classifica-
materials. tion, other works used PCA to compress the stellar spec-
In the same context, other studies considered PCA as tra, which is the data input of an ANN [171, 174, 175].
a tool to better analyze quantum mechanics. In order These studies indicate a relevant analysis of the com-
to find insights about quantum statistical properties, a pressibility of the stellar spectra. Additionally, PCA af-
quantum correlation matrix was defined, and the PCA fected the classification accuracy, replicability, network
computed [158]. In this study, the results were discussed stability, and convergence of the employed ANNs.
regarding the quantum mechanical framework. In [159], Apart from the study considering neural networks,
the authors introduced a quantum version of PCA, called many other works employed PCA as a part of spectral
qPCA. This method consists in a quantum algorithm analysis [177–182]. Taking into consideration the dimen-
that computes the eigenvectors and eigenvalues of a den- sion reduction of PCA, Dultzin-Hacyan et al. [183] stud-
sity matrix, ρ, which describes a mixed quantum system. ied the spectra provided from types 1 and 2 Seyfert galax-
In the case of ρ being a covariance matrix, the algorithm ies. Interestingly, the spectrum of a type 1 Seyfert galaxy
can perform a classical PCA. could be well described by a single component, but a type
Among other areas, in nuclear physics, PCA has been 2 Seyfert galaxy required at least three principal compo-
applied to study the effect of many nuclear param- nents. Still considering spectra information, PCA can
eters on the classification of even-even nuclear struc- also be used to improve classification schemes of galax-
tures [160]. Additionally, PCA was used to look into ies [184, 185]. Such methods analyze clusters in the PCA
the local viscoelastic properties and microstructure of a space, which are described by few principal components
fluid through a series of images that perform the mo- obtained from given galaxy spectra data.
tion in this fluid [161]. More specifically, the authors
considered a series of images of suspended particles in a
Newtonian fluid (Brownian motion). PCA has also been P. Geography
used to solve problems in network theory, such as for un-
derstanding network measurements [162], analyzing gene Many works are related to geographical or spatial infor-
networks [163] and analyzing and visualizing data ob- mation [42, 144, 186–189]. For example, an exploratory
tained from text networks [164]. study used PCA to analyse regularities in the distribu-
Similarly to the methods already discussed in sec- tion of innovative activities [186]. In other words, the
tion VIII F, in physical chemistry, PCA has been used authors investigated technological companies located in
together with molecular quantum similarity measures a same specific area. Data was obtained from patents
(MQSM) to improve the identification of groups of of European countries, including France, Germany, Italy,
molecules with similar characteristics [156, 165]. and the UK. The information of the companies’ addresses
was used to infer their location, and PCA was applied
subsequently. In general, different results were obtained
O. Astronomy for distinct technological classes. In addition, for some of
these classes, different countries presented similar trends.
The use and interest of PCA in astronomy has been Another study related to geographical information is
growing in the last decades. Due to technological ad- the investigation of the relationship between the genetic
vances, new techniques for data capture and storage have structure of human beings and their location [40, 41].
been developed. In order to deal with this new data, In [40], information about genes of several European peo-
PCA could be employed as an auxiliary tool. Appli- ple was employed, and PCA was used to summarize the
cations range from analysis of data obtained from im- data. The geographic distance was found to influence the
ages [166] to identification of stars [167], among other gene distribution.
possibilities [168]. Another work studied the determination of the proce-
20
dence of food by employing chemical-related methods duction in power stations was reported in [201]. PCA
and PCA [76, 77, 190]. For example, studies analysed was used as a preprocessing tool for a forecasting method
samples of green coffee, where PCA was used to deter- for wind and solar energy generation. As in the previ-
mine where they were produced [76, 77]. In both cases, ous study, this technique was used to reduce the dimen-
the chemical characteristics of coffee were measured by sionality of the input data, which were historical time
using multivariate data obtained from Mass Spectrome- series of power measurements of the power plants. The
try and visualized through projection onto the principal reduced data was employed as a training set for two dif-
components. ferent types of classifiers: (i) Neural Network (NN) im-
PCA can also be applied in order to study the geo- plemented by Venables and Ripley [202] and (ii) Analog
graphical origin of propolis [187]. Chemical experiments Ensemble (AnEn) algorithm [203].
were performed, and the measured multivariate data was Other studies about climate employed PCA for find-
investigated by using PCA in the same fashion as in ing Earth modifications. Examples include identify-
chemometrics (see Chemistry section). By considering ing consequences of climate change [204], finding the
the three-dimensional PCA projection, four groups of source of pollutants [205], and classifying bioclimatic
samples were identified visually. zones [206]. Another study analyzed the ionospheric
An essential type of feature that can be used to de- equatorial anomaly through a method called Total Elec-
scribe regions is temporal information. A study about tronic Content (TEC) [197]. TEC is a descriptive method
the rainfall patterns of Spain considered data from the for Earth’s ionosphere which counts the amount of elec-
years 1912 to 2000 [188]. In this work, PCA was em- trons along a circular cross section of the atmosphere and
ployed as a preprocessing step of a clustering method integrates it between two points, giving the total num-
to detect patterns of seasonal rainfalls. PCA was also ber of electrons in that region. PCA was used [197] as a
applied as part of a work monitoring the growth of the data analysis tool over TEC to aid the identification of
urban area in Pearl River Delta region [191]. temporal and spatial patterns.
Q. Weather R. Agriculture
Studies in meteorology and weather involve many dif- PCA was used to study the origin of toxic elements
ferent variables [192]. The high degree of freedom [192], in soils in [207]. While traditional techniques have been
the chaotic behavior [193] and the difficulty of know- used for this purpose (profile and spatial distribution), it
ing the initial conditions [194] of meteorological systems has been claimed that they are often unreliable to identify
make the statistical approach attractive. PCA, in partic- the sources of some elements in soils. Other traditional
ular, is used in these areas both as part of forecast meth- approaches such as parent rock decomposition and the
ods and as part of its data analysis [195, 196]. Such stud- knowledge of anthropogenic loads also yield inaccurate
ies are important for many applications, ranging from results with some frequency.
understanding the environment [197] to practical appli- The study conducted in [207] investigated the sources
cations [198, 199], such as the prediction of wind direc- of pollution by analyzing the concentration of Cu, Hg, Ni,
tion, essential for the high performance of wind energy Pb, and Zn in the Czech Republic. It focused on the first
generation [196, 198]. three principal components, which accounted for 70% of
In order to improve the turbine performance of wind the data variance. A simple visualization allowed the
power stations, the prediction of wind speed and di- identification of a cluster of elements, such as Co, Cr, Cu,
rection are essential parameters, and can vary substan- Ni, and Zn. Interestingly, the data analysis showed that
tially along time. Dealing with this problem, the authors the main component could be interpreted as representing
of [199] proposed a forecasting method based mainly on those elements of geogenic origin. Conversely, the third
PCA. The employed data are given by standardized time component was found to be able to identify pollution
series of the wind speed and/or direction. By considering from atmospheric deposition. The results obtained in
a subset of the wind measured data (the training set) and their study suggested that PCA can be employed as a
using Takens’ method of delays algorithm [200], the de- tool to assist the identification of the source of elements
lay matrix is computed. As a part of this method, PCA in soils.
is applied to reduce the matrix dimensionality. The data
of the test set (the remaining data) was converted to the
same PCA space as the training set. Finally, by employ- S. Tourism
ing a strategy based on nearest neighbors, the current
state of the wind is identified. Note that the neighbor- In Tourism, PCA has applications mainly on hotel lo-
hood of the PCA sample was used to predict its state cation or recommendation systems. In [208] a recom-
and the forecast error. mendation system based on collaborative filtering (data
Another study that aimed at better understanding coming from similar users) was proposed. Here, PCA is
weather characteristics in order to improve energy pro- used as a preprocessor to reduce the redundancy in the
21
from language applications, the PCA can alternatively Some characteristics of the considered datasets are shown
be used as a pre-processing step in text classification in Table II.
tasks. In the study carried out in [252], the authors In the astronomy area, we considered data from galax-
aimed at classifying documents in several categories ies [254]. More specifically, a table that comprises mea-
with a low computational cost. In order to perform surements of spectroscopic redshifts. The second dataset
the classification, the proposed method removed words comprised ionosphere as measured by radar [255]. In
conveying low semantic meaning (i.e., the stopwords). biology, we considered datasets regarding gene expres-
Then, the remaining words were stemmized and tf-idf sion levels [256] and measurements of leaves [257, 258].
(term frequency-inverse document frequency) weighting Two datasets of food were employed representing chem-
value [253] was assigned to each word. By using concepts istry: (i) data on characteristics of wine [258, 259] and
from information theory, the best features (i.e., the most (ii) data of milk composition [260]. In case of computer
discriminative words) were selected according to the mu- science, two different subjects were considered, which
tual information metric. After such a feature selection, are characteristics of computers [258] (machine) and fea-
PCA was applied to the remaining features as an addi- tures of image segmentation data. This image segmenta-
tional feature selection step. As a criterion to select an tion (segment-challenge) dataset is part of the datasets
adequate number of features, the method selected the provided by the software Weka [261]. In engineering,
principal components so as to keep 75% of the original we used a dataset of the electric power consumption in
data variability. The classification algorithm in the com- houses [258] (energy) and information of concrete slump
pressed data revealed that a high accuracy rate can be tests [258, 262].
obtained even if several features are disregarded. The The datasets of geography contain data of spatial co-
authors reported that when the total number of consid- ordinates and information related to weather. The first
ered features decreases from 319 to 75, the accuracy only dataset is about dengue disease [263] and the second
drops 3% in the textual classification task. is about forest fires [258, 264]. In linguistics, the first
dataset comprises the frequency of linguistic elements
(eg., punctuation and symbols) in texts of commerce re-
IX. EXPERIMENTAL STUDY OF PCA views [258, 265]; the second one contains statistics of blog
APPLIED TO DIVERSE DATABASES feedbacks [258, 266]. The datasets considered in mate-
rials area are glass identification [267], with information
One of the main reasons justifying the popularity of of refractive index and chemical elements, and measure-
PCA is its ability to reduce the dimensionality of the ments regarding plates faults [258, 268]. In medical, the
original data while preserving its variation as much as used dataset is about characteristics of people with or
possible. In addition to allowing faster and simpler com- without diabetes [269], and the other dataset considers
putational analysis, these reduction also allows the iden- biomedical voice measurements regarding patients with
tification of new important variables with distinctive ex- or without Parkinson disease [258, 270]. Finally, the
planatory capabilities. As reported in the literature, a datasets of weather are: (i) environmental measures of
small number of PCA variables often can account for El Niño [258, 271] and (ii) measures related to ozone
most of the variance explanation. In this section we de- level [258].
velop an experimental approach aimed at quantifying in
more objective terms the explanatory ability of PCA with
respect to some representative real-world databases de- B. Results and Discussion
rived from the previously surveyed material. Here, we
considered two datasets representative of each of 10 dif- In order to compare the amount of variance retained
ferent areas. The selected areas are astronomy, biology, by PCA for the different datasets, in Figure 11 we
chemistry, computer science, engineering, geography, lin- plot the number of PCA components against the re-
guistics, materials, medical, and weather. After pre- spective variance ratio, defined by Equation 14. Fig-
processing required for eliminating incomplete and cat- ure 11 (a) and (b) show the measurements for standard-
egorical data, PCA was applied to the databases. Two ized and non-standardized data, respectively. Note that
cases were considered for generality’s sake: (a) without in the majority of the cases for standardized data, the
standardization; and (b) with standardization. The pos- first three principal components can represent more than
sible effects of the database size, number of features, and 50% of the variance in the datasets. By considering the
number of categories on the variance explanation were data without standardization, 60% of the variance is con-
also investigated. tained in the first two principal components on the ma-
jority of the datasets. This is because, when the data is
not standardized, a few measurements having large val-
A. Dataset Selection ues dominate the variance in the data. One example of
such effect is shown in the linguistics (reviews) dataset,
For all considered datasets, we eliminated non- which have more than 50% of its variance explained by a
numerical data, such as categorical values and dates. single component without standardization, but negligible
24
1.0
1.0 1.0
Variance ratio (cumulative sum)
1.0 1.0
Variance ratio (cumulative sum)
(c) geography (dengue) (d) materials (plates) FIG. 14. By considering the standardized datasets, the curves
are the average PCA variance ratio against the number of
FIG. 12. Examples of the curve fits of the cumulative sum dimensions (in log scale), for all the datasets presented in
of the variance ratio against the number of dimensions, for Table II. The shaded area represents the standard deviation.
standardized data. The circles represent the measured data
ant the line is the fitted curve.
1.0
1.0
0.8
Variance ratio (cumulative sum)
0.8
Variance ratio (cumulative sum)
0.6
0.6
0.4
0.4
astronomy (galaxy) geography (dengue)
astronomy (ionosphere) geography (forest) 0.2
biology (gene) linguistics (reviews)
biology (leaf) linguistics (blog)
0.2 chemistry (milk) materials (glass)
chemistry (wine) materials (plates)
computer (machine) medical (diabetes)
computer (segment-challenge) medical (parkinsons) 0.0
engineering (energy) weather (el nino) 0.0 0.2 0.4 0.6 0.8 1.0
engineering (slump) weather (ozone) Number of dimensions (norm.)
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Number of dimensions (norm.)
FIG. 15. In the case of the standardized datasets, the curves
correspond to the average PCA variance ratio against the nor-
FIG. 13. PCA variance ratio against normalized number of malized number of dimensions, for all the datasets presented
dimensions, for all the datasets presented in Table II. Note in Table II. The shaded area represents the standard devia-
that the normalized number of principal components consists tion.
in the fraction of components computed for each dataset. In
this experiment all the data were standardized.
systematic review of PCA covering several of its theoret-
ical and applied aspects. We start by providing a simple
normalization. In both cases, the data was standardized
and yet complete application example of PCA to real-
before applying PCA.
world data (beans), and by identifying three typical ways
in which PCA can be applied. Next, we developed the
concept of PCA from more basic aspects of multivariate
X. CONCLUDING REMARKS statistics, and present the important issues of variance
preservation. The option to normalize or not the origi-
Principal component analysis – PCA – has become a nal data is addressed subsequently, and it is shown that
standard approach in data analysis as a consequence of its each of these alternatives can have major impact on the
ability to reduce dimensionality while preserving variance obtained results. Guidelines are provided that can help
of the data. In this work, we reported an integrated and the reader to decide wether to normalize or not the data.
26
Other aspects of PCA application are also addressed, in- Number of features N
cluding the direction of the principal axes, the relation- Number of projected features M
ship between PCA and rotation, and the demonstration Number of objects Q
of maximum variance of the first PCA axes. Another
Data matrix X
projection approach, namely LDA, is briefly presented
next. i-th feature Xi
i-th feature for all objects X~i
After presenting the several aspects and properties of i-th feature of object j Xij
PCA, we develop a systematic, but not exhaustive, re- Average vector of data matrix X µ
~X
view of some representative works from several distinct
Average of feature Xi µX i
areas that have used PCA for the most diverse applica-
tions. This review fully substantiates the generality and Standard deviation vector of data matrix X ~σX
efficacy of PCA for a wide range of data analysis appli- Standard deviation of feature Xi σ Xi
cations, confirming its role as a choice method for that Transformed data matrix Y
finality. i-th transformed feature Yi
i-th transformed feature for all objects ~i
Y
The last part of this work presents an experimental
investigation of the potential of PCA for variance expla- Feature vector of object i X~i
nation and dimensionality reduction. Several real-world Transformed feature vector of object i Y~i
databases are considered, founded on the main areas re- PCA transformation matrix W
views in the previous sections. The obtained results con- i-th row of W ~vi
firm the ability of PCA for explaining several types of Correlation matrix R
data while using only a few principal axes. Special at- Covariance matrix K
tention was given to the study of the effects of data stan- Pearson correlation matrix C
dardization on variance explanation, and we found that
Pearson correlation between the ith and jth variables Cij
non-standardized data tend to yield more intense vari-
Pearson correlation between variables Xi and Xj ρ Xi Xj
ance explanation. We also showed that the variance ratio
curves can be reasonably well fitted by using the expo- Percentage of preserved variance G
nential function. This result allowed us to quantify the Expectation of random variable A E[A]
effect of data size, number of classes and number of fea-
tures on the overall variance explanation. Interestingly, TABLE III. Description of the main symbols used in this
it has been found that these properties do not tend to work.
have any pronounced influence.
All in all, we hope that the reported work on PCA APPENDIX A - SYMBOLS
covering from basic principles to a systematic survey of
applications, and including experimental investigations APPENDIX B - CONSEQUENCES OF
of variance explanation, can provide resources for re- NORMALITY
searchers from the most varied areas that can help them
to better apply PCA and interpret the respective results.
As shown in the main text, PCA returns a maximal
variance projection for any dataset of finite variance. If
the underlying data follow a normal (Gaussian) distribu-
tion, more can be said: as we show here, the principal
components are independent and maximize the projec-
tion’s entropy given the original data.
ACKNOWLEDGMENTS
Firstly, we recall that an N -dimensional random vari-
able X~ follows a normal distribution with mean µ ~ and
Gustavo R. Ferreira acknowledges financial support covariance matrix Σ – denoted X ∼ N (~ µ, Σ) – if its prob-
from CNPq (grant no. 158128/2017-6). Henrique F. ability density function is expressed as
de Arruda acknowledges CAPES for sponsorship. Fil-
ipi N. Silva thanks FAPESP (grant no. 2015/08003- 1 1 T
f (~x) = e− 2 (~x−~µ) Σ−1 (~
x−~
µ)
, (47)
4 and 2017/09280-7) for sponsorship. Cesar H. Comin (2π)N/2 |Σ|1/2
thanks FAPESP (grant no. 15/18942-8) for financial
support. Diego R. Amancio acknowledges financial sup- where |Σ| denotes the determinant of Σ. We also state
port from FAPESP (16/19069-9 and 17/13464-6). Lu- some basic properties of a normal distribution:
ciano da F. Costa thanks CNPq (grant no. 307333/2013-
2) and NAP-PRP-USP for sponsorship. This work has Lemma 1. Let X ~ = (X1 , X2 , . . . , XN ) be a random vari-
been supported also by FAPESP grants 11/50761-2 and able following a Gaussian distribution with mean µ ~ and
2015/22308-2. covariance matrix Σ, then:
27
Z +∞
J[p] = − p(x) ln p(x) dx+ 1 X Yik X̃jk
P Corr(Yi , X̃j ) = (52)
−∞ Q−1 σYi
Z +∞ k
+ λ0 1 − p(x) dx + 1 X
= √ Yik X̃jk , (53)
−∞ (Q − 1) λi k
Z +∞
+ λ1 µ − xp(x) dx + where Q is the number of objects. The kth value of PCA
−∞
Z +∞ component i is a linear combination of the original mea-
+ λ2 σ 2 − x2 p(x) dx . (49) surements weighted by the respective eigenvector, that
−∞ is
Differentiating with respect to p (in the context of the
calculus of variations; see, for instance, [278]) and equat-
X
Yik = Wil X̃lk (54)
ing to zero, we obtain: l
28
p
P Corr(Yi , X̃j ) = λi Wij (61)
P Corr(X̃j , X̃1 )Wi1 + P Corr(X̃j , X̃2 )Wi2 + . . .
+ P Corr(X̃j , X˜N )WiN = λi Wij (57) Therefore, the Pearson correlation coefficient between
PCA component Yi and standardized variable X̃j is given
by the square root of the respective eigenvalue multiplied
which can be more compactly represent as by the jth element of the respective eigenvector.
[1] K. Ferguson, Tycho and Kepler: The Unlikely Part- R. L. Tatham, et al., Multivariate Data Analysis, Vol. 5
nership that forever changed our Understanding of the (Prentice hall Upper Saddle River, NJ, 1998).
Heavens (Bloomsbury Publishing USA, 2002). [15] M. E. Tipping and C. M. Bishop, Journal of the Royal
[2] G. Bell, T. Hey, and A. Szalay, Science 323, 1297 Statistical Society: Series B (Statistical Methodology)
(2009). 61, 611 (1999).
[3] D. J. Hand, Drug Safety 30, 621 (2007). [16] C. M. Bishop, in Advances in neural information pro-
[4] C. M. Bishop, Pattern Recognition and Machine Learn- cessing systems (1999) pp. 382–388.
ing (Springer, 2006). [17] L. K. Hansen, J. Larsen, F. Å. Nielsen, S. C. Strother,
[5] I. Jolliffe, Principal Component Analysis, Springer Se- E. Rostrup, R. Savoy, N. Lange, J. Sidtis, C. Svarer,
ries in Statistics (Springer, 1986). and O. B. Paulson, NeuroImage 9, 534 (1999).
[6] L. da Fontoura Costa and R. M. Cesar Jr, Shape classi- [18] R. A. Fisher, Annals of human genetics 7, 179 (1936).
fication and analysis: theory and practice (CRC Press, [19] K. R. Gabriel, Biometrika 58, 453 (1971).
Inc., 2009). [20] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Clas-
[7] H. Abdi and L. J. Williams, Wiley interdisciplinary re- sification (John Wiley & Sons, 2012).
views: computational statistics 2, 433 (2010). [21] K. Fukunaga, Introduction to Statistical Pattern Classi-
[8] I. K. Fodor, A survey of dimension reduction techniques, fication (Academic Press USA:, 1990).
Tech. Rep. (Lawrence Livermore National Lab., CA [22] J. L. Fish, B. Villmoare, K. Köbernick, C. Compag-
(US), 2002). nucci, O. Britanova, V. Tarabykin, and M. J. Depew,
[9] P. Cunningham, in Machine Learning Techniques for Evolution & development 13, 549 (2011).
Multimedia (Springer, 2008) pp. 91–112. [23] K. Birnbaum, D. E. Shasha, J. Y. Wang, J. W. Jung,
[10] D. P. Bertsekas and J. N. Tsitsiklis, Introduction to G. M. Lambert, D. W. Galbraith, and P. N. Benfey,
Probability, Vol. 1 (Athena Scientific Belmont, MA, Science 302, 1956 (2003).
2002). [24] M. Dai et al., Nucleic Acids Research 33, e175 (2005).
[11] W. Feller, An Introduction to Probability Theory and its [25] B. Alberts et al., Molecular Biology of the Cell, 6th ed.
Applications, Vol. 2 (John Wiley & Sons, 2008). (Garland Science, 2014).
[12] B. Everitt and A. Skrondal, The Cambridge Dictio- [26] A. Kauffmann, R. Gentleman, and W. Huber, Bioin-
nary of Statistics, Vol. 106 (Cambridge University Press formatics 25, 415 (2009).
Cambridge, 2002). [27] S. Chapman, S. P, K. K, and J. Manners, Bioinformat-
[13] K. Pearson, Proceedings of the Royal Society of London ics 18, 202 (2002).
58, 240 (1895). [28] Z. Lin et al., PNAS 113, 14662 (2016).
[14] J. F. Hair, W. C. Black, B. J. Babin, R. E. Anderson, [29] F. Wagner, PLoS ONE 10, e0143196 (2015).
29
[81] R. Bro and A. K. Smilde, Analytical Methods 6, 2812 [112] A. Basit, M. Y. Javed, and M. A. Anjum, in WEC (2)
(2014). (2005) pp. 24–26.
[82] L. Forveffle, J. Vercauteren, and D. N. Rutledge, Food [113] P. Gawron, P. Głomb, J. A. Miszczak, and Z. Puchała,
Chemistry 57, 441 (1996). in Man-Machine Interactions 2 (Springer, 2011) pp. 49–
[83] N. J. Bailey, J. Sampson, P. J. Hylands, J. K. Nicholson, 56.
and E. Holmes, Planta Medica 68, 734 (2002). [114] C. Fookes and S. Sridharan, in Information Sciences
[84] Y. Wang, H. Tang, J. K. Nicholson, P. J. Hylands, Signal Processing and their Applications (ISSPA), 2010
J. Sampson, I. Whitcombe, C. G. Stewart, S. Caiger, 10th International Conference on (IEEE, 2010) pp. 654–
I. Oru, and E. Holmes, Planta medica 70, 250 (2004). 657.
[85] D. OuYang, J. Xu, H. Huang, and Z. Chen, Applied [115] C. Fookes, A. Maeder, S. Sridharan, and G. Mamic, in
biochemistry and biotechnology 165, 148 (2011). Behavioral Biometrics for Human Identification: Intel-
[86] C. Ceribeli, A. de Zawadzki, A. C. R. Mondini, L. A. ligent Applications (2009) pp. 237–263.
Colnago, L. H. Skibsted, and D. R. Cardoso, Journal [116] A. M. Martínez and A. C. Kak, IEEE transactions
of the Brazilian Chemical Society , 1 (2018). on pattern analysis and machine intelligence 23, 228
[87] D. Scott, P. Coveney, J. Kilner, J. Rossiny, and N. M. N. (2001).
Alford, Journal of the European Ceramic Society 27, [117] R. Vidal, Y. Ma, and S. Sastry, IEEE transactions
4425 (2007). on pattern analysis and machine intelligence 27, 1945
[88] X. V. Eynde and P. Bertrand, Surface and interface (2005).
analysis 25, 878 (1997). [118] M. H. Bharati, J. J. Liu, and J. F. MacGregor, Chemo-
[89] P. M. Shenai, Z. Xu, and Y. Zhao, in Principal compo- metrics and intelligent laboratory systems 72, 57 (2004).
nent analysis-engineering applications (InTech, 2012). [119] Y. Ke and R. Sukthankar, in Computer Vision and
[90] M. Wagner and D. G. Castner, Langmuir 17, 4649 Pattern Recognition, 2004. CVPR 2004. Proceedings of
(2001). the 2004 IEEE Computer Society Conference on, Vol. 2
[91] K. Rajan, Materials Today 8, 38 (2005). (IEEE, 2004) pp. II–II.
[92] C. Suh, A. Rajagopalan, X. Li, and K. Rajan, Data [120] D. G. Lowe, in Computer vision, 1999. The proceedings
Science Journal 1, 19 (2002). of the seventh IEEE international conference on, Vol. 2
[93] U. Tisch and H. Haick, MRS bulletin 35, 797 (2010). (Ieee, 1999) pp. 1150–1157.
[94] L. da. F. Costa, F. N. Silva, and C. H. Comin, Electrical [121] Y. Gong, S. Lazebnik, A. Gordo, and F. Perronnin,
Engineering , 1 (2016). IEEE Transactions on Pattern Analysis and Machine
[95] L. da. F. Costa, F. N. Silva, and C. H. Comin, Physica Intelligence 35, 2916 (2013).
A 499, 176 (2018). [122] R. Brunelli and T. Poggio, Biological Cybernetics 69,
[96] X. Hua, Y. Ni, J. Ko, and K. Wong, Journal of Com- 235 (1993).
puting in Civil Engineering 21, 122 (2007). [123] H. Bourlard and Y. Kamp, Biological cybernetics 59,
[97] G. Kerschen, P. De Boe, J.-C. Golinval, and K. Worden, 291 (1988).
Smart Materials and Structures 14, 36 (2004). [124] E. Oja, Neural networks 5, 927 (1992).
[98] L. Shuang and L. Meng, in Mechatronics and Automa- [125] F. Song, Z. Guo, and D. Mei, in System science, engi-
tion, 2007. ICMA 2007. International Conference on neering design and manufacturing informatization (IC-
(IEEE, 2007) pp. 3503–3507. SEM), 2010 international conference on, Vol. 1 (IEEE,
[99] A. Malhi and R. X. Gao, IEEE Transactions on Instru- 2010) pp. 27–30.
mentation and Measurement 53, 1517 (2004). [126] H. K. Ekenel and B. Sankur, Pattern Recognition Let-
[100] C.-M. Kwan, R. Xu, and L. S. Haynes, in Thermosense ters 25, 1377 (2004).
XXIII, Vol. 4360 (International Society for Optics and [127] A. Janecek, W. Gansterer, M. Demel, and G. Ecker, in
Photonics, 2001) pp. 285–290. New Challenges for Feature Selection in Data Mining
[101] B. Liu and V. Makis, IMA Journal of Management and Knowledge Discovery (2008) pp. 90–105.
Mathematics 19, 39 (2007). [128] Q. Du and J. E. Fowler, IEEE Geoscience and Remote
[102] W. Li and Y. Xu, International Journal of Modelling, Sensing Letters 4, 201 (2007).
Identification and Control 10, 246 (2010). [129] C. A. Bailer-Jones, Publications of the Astronomical So-
[103] D. Basak, S. Pal, and D. C. Patranabis, Neural Infor- ciety of the Pacific 109, 932 (1997).
mation Processing-Letters and Reviews 11, 203 (2007). [130] A. Babenko, A. Slesarev, A. Chigorin, and V. Lem-
[104] P. Nomikos and J. F. MacGregor, AIChE Journal 40, pitsky, in European conference on computer vision
1361 (1994). (Springer, 2014) pp. 584–599.
[105] L. da. F. Costa, arXiv preprint arXiv:1801.06025 [131] G. Liu and L. McMillan, in Proceedings of the 2006
(2018). ACM SIGGRAPH/Eurographics symposium on Com-
[106] R. Dunia, S. J. Qin, T. F. Edgar, and T. J. McAvoy, puter animation (Eurographics Association, 2006) pp.
AIChE Journal 42, 2797 (1996). 127–135.
[107] M. Turk and A. Pentland, Journal of cognitive neuro- [132] J. Chen and C.-M. Liao, Journal of Process control 12,
science 3, 71 (1991). 277 (2002).
[108] M. Kirby and L. Sirovich, IEEE Transactions on Pattern [133] A. H. Sahoolizadeh, B. Z. Heidari, and C. H. Dehghani,
analysis and Machine intelligence 12, 103 (1990). International Journal of Computer Science and Engi-
[109] L. Sirovich and M. Kirby, Josa a 4, 519 (1987). neering 2, 218 (2008).
[110] G. Lu, D. Zhang, and K. Wang, Pattern Recognition [134] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, IEEE
Letters 24, 1463 (2003). Journal of Selected topics in applied earth observations
[111] T. Connie, A. T. B. Jin, M. G. K. Ong, and D. N. C. and remote sensing 7, 2094 (2014).
Ling, Image and Vision computing 23, 501 (2005). [135] Y. Sun, Y. Chen, X. Wang, and X. Tang, in Advances
31
ing 19, 1501 (1998). Music Information Research (AdMIRe 2012) (Lyon,
[192] R. Daley, Atmospheric Data Analysis, 2 (Cambridge France, 2012).
university press, 1993). [217] S. Baronti, A. Casini, F. Lotti, and S. Porcinai, Chemo-
[193] D. Rind, science 284, 105 (1999). metrics and Intelligent Laboratory Systems 39, 103
[194] T. Gneiting and A. E. Raftery, Science 310, 248 (2005). (1997).
[195] M. Jaruszewicz and J. Mandziuk, in Neural Informa- [218] M. Bacci, R. Chiari, S. Porcinai, and B. Radicati,
tion Processing, 2002. ICONIP’02. Proceedings of the Chemometrics and intelligent laboratory systems 39,
9th International Conference on, Vol. 5 (IEEE, 2002) 115 (1997).
pp. 2359–2363. [219] Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H. H. Chen, IEEE
[196] N. Sharma, P. Sharma, D. Irwin, and P. Shenoy, in Transactions on Audio, Speech, and Language Process-
Smart Grid Communications (SmartGridComm), 2011 ing 16, 448 (2008).
IEEE International Conference on (IEEE, 2011) pp. [220] Y.-A. Chen, J.-C. Wang, y.-h. Yang, and H. Chen, in
528–533. ICASSP, IEEE International Conference on Acoustics,
[197] Y. S. Maslennikova, V. Bochkarev, and D. Voloskov, in Speech and Signal Processing - Proceedings (2014) pp.
Journal of Physics: Conference Series, Vol. 574 (IOP 2149–2153.
Publishing, 2015) p. 012152. [221] R. Recours, F. Aussagel, and N. Trujillo, Culture,
[198] M. Zarzo and P. Martí, Applied energy 88, 2775 (2011). Medicine and Psychiatry 33, 473 (2009).
[199] C. Skittides and W.-G. Früh, Renewable Energy 69, 365 [222] B. Whitman, G. Flake, and S. Lawrence, in Neural
(2014). Networks for Signal Processing XI, 2001. Proceedings
[200] F. Takens et al., Lecture notes in mathematics 898, 366 of the 2001 IEEE Signal Processing Society Workshop
(1981). (IEEE, 2001) pp. 559–568.
[201] F. Davò, S. Alessandrini, S. Sperati, L. Delle Monache, [223] P. Smaragdis and J. C. Brown, in Applications of Signal
D. Airoldi, and M. T. Vespucci, Solar Energy 134, 327 Processing to Audio and Acoustics, 2003 IEEE Work-
(2016). shop on. (IEEE, 2003) pp. 177–180.
[202] W. N. Venables and B. D. Ripley, Modern Applied [224] Y. Panagakis, C. Kotropoulos, and G. R. Arce, in Sig-
Statistics with S-PLUS (Springer Science & Business nal Processing Conference, 2009 17th European (IEEE,
Media, 2013). 2009) pp. 1–5.
[203] S. Alessandrini, L. Delle Monache, S. Sperati, and [225] S. Baronti, A. Casini, F. Lotti, and S. Porcinai, Applied
J. Nissen, Renewable Energy 76, 768 (2015). optics 37, 1299 (1998).
[204] S. R. Loarie, B. E. Carter, K. Hayhoe, S. McMahon, [226] M. J. Baxter, Exploratory Multivariate Analysis in Ar-
R. Moe, C. A. Knight, and D. D. Ackerly, PloS one 3, chaeology (Edinburgh University Press, 1994).
e2502 (2008). [227] J. D. Wilcock, Archaeology in the Age of the Internet,
[205] T. Chan and M. Mozurkewich, Atmospheric Chemistry CAA 97, 35e51 (1999).
and Physics 7, 887 (2007). [228] U. Vinagre Filho, R. Latini, A. V. Bellido, A. Buar-
[206] L. Pineda-Martínez, N. Carbajal, and E. Medina- que, and A. Borges, Brazilian journal of physics 35,
Roldan, Atmósfera 20, 133 (2007). 779 (2005).
[207] L. Boruvka, O. Vacek, and J. Jehlicka, Geoderma 128, [229] R. Ravisankar, A. Naseerutheen, A. Chandrasekaran,
289 (2005). S. Bramha, K. Kanagasabapathy, M. Prasad, and
[208] M. Nilashi, O. bin Ibrahim, N. Ithnin, and N. H. K. Satpathy, Journal of Radiation Research and Ap-
Sarmin, Electronic Commerce Research and Applica- plied Sciences 7, 44 (2014).
tions 14, 542 (2015). [230] A. Moropoulou and K. Polikreti, Journal of Cultural
[209] M. Tkalcic, M. Kunaver, J. Tasic, and A. Košir, in Heritage 10, 73 (2009).
Proceedings of the 5th Workshop on Emotion in Human- [231] R. Klinger, W. Schwanghart, and B. Schütt, DIE
Computer Interaction-Real world challenges (2009) pp. ERDE–Journal of the Geographical Society of Berlin
30–37. 142, 213 (2011).
[210] G. Hankinson, Journal of Services Marketing 19, 24 [232] A. R. Templeton, Population Genetics and Microevolu-
(2005). tionary Theory (John Wiley & Sons, 2006).
[211] J. T. Coshall, Journal of travel research 39, 85 (2000). [233] X. Chunjie, L. Cavalli-Sforza, E. Minch, and D. Ruofu,
[212] V. Vieira, R. Fabbri, G. Travieso, O. N. Oliveira Jr, Science in China Ser. C 43, 472 (2000).
and L. da. F. Costa, Journal of Statistical Mechanics: [234] P. Moorjani, N. Patterson, J. N. Hirschhorn, A. Keinan,
Theory and Experiment 2012, P08010 (2012). L. Hao, G. Atzmon, E. Burns, H. Ostrer, A. L. Price,
[213] V. Vieira, R. Fabbri, D. Sbrissa, and L. da. F. Costa, and D. Reich, PLoS genetics 7, e1001373 (2011).
Physica A 417, 110 (2015). [235] K.-H. Chen and L. L. Cavalli-Sforza, Human Biology ,
[214] P.-S. Huang, S. D. Chen, P. Smaragdis, and 367 (1983).
M. Hasegawa-Johnson, in 2012 IEEE International [236] S. K. Lenka, Theoretical & Applied Economics 22
Conference on Acoustics, Speech and Signal Processing (2015).
(ICASSP) (2012) pp. 57–60. [237] S. Kolenikov and G. Angeles, Review of Income and
[215] P. Smaragdis and J. C. Brown, in 2003 IEEE Work- Wealth 55, 128 (2009).
shop on Applications of Signal Processing to Audio and [238] D. Filmer and L. H. Pritchett, Demography 38, 115
Acoustics (IEEE Cat. No.03TH8684) (2003) pp. 177– (2001).
180. [239] H. Liu, Journal of Computer-Mediated Communication
[216] Y.-H. Yang, D. Bogdanov, P. Herrera, and M. Sordo, in 13, 252 (2007).
21st International World Wide Web Conference (WWW [240] S. W. Gangestad, J. A. Simpson, A. J. Cousins, C. E.
2012): 4th International Workshop on Advances in Garver-Apgar, and P. N. Christensen, Psychological
33
Science 15, 203 (2004). [259] M. Forma, R. Leardi, C. Armanino, S. Lanteri, P. Conti,
[241] J. Binongo and M. Smith, Literary and Linguistic Com- and P. Princi, PARVUS, an Extendable Package of Pro-
puting 14, 445 (1999). grams for Data Exploration, Classification and Correla-
[242] R. A. Diego, O. N. Oliveira Jr., and L. da F. Costa, tion (Elsevier Scientific Software, Amsterdam, 1988).
Physica A: Statistical Mechanics and its Applications [260] J. Daudin, C. Duby, and P. Trecourt, Statistics: A jour-
391, 4406 (2012). nal of theoretical and applied statistics 19, 241 (1988).
[243] D. R. Amancio, O. N. Oliveira Jr., and L. da F. Costa, [261] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data
New Journal of Physics 14, 043029 (2012). Mining: Practical Machine Learning Tools and Tech-
[244] D. R. Amancio, Journal of Statistical Mechanics: The- niques (Morgan Kaufmann, 2016).
ory and Experiment 2015, P03005 (2015). [262] I.-C. Yeh, Journal of Computing in Civil Engineering
[245] D. R. Amancio, Scientometrics 105, 1763 (2015). 20, 217 (2006).
[246] H. F. de Arruda, L. da. F. Costa, and D. R. Amancio, [263] S. Hales, N. De Wet, J. Maindonald, and A. Woodward,
EPL (Europhysics Letters) 113, 28007 (2016). The Lancet 360, 830 (2002).
[247] V. Q. Marinho, H. F. de Arruda, T. S. Lima, L. da Fon- [264] P. Cortez and A. d. J. R. Morais, (2007).
toura Costa, and D. R. Amancio, in TextGraphs@ACL [265] S. Liu, Z. Liu, J. Sun, and L. Liu, International Journal
(Association for Computational Linguistics, 2017) pp. of Digital Content Technology and its Applications 5,
1–10. 126 (2011).
[248] G. Vinodhini and R. M. Chandrasekaran, in Interna- [266] K. Buza, in Data analysis, machine learning and knowl-
tional Conference on Information Communication and edge discovery (Springer, 2014) pp. 145–152.
Embedded Systems (ICICES2014) (2014) pp. 1–6. [267] I. W. Evett and J. S. Ernest, Reading, Berkshire RG7
[249] D. R. Amancio, PLOS ONE 10, e0118394 (2015). 4PN (1987).
[250] P. Marcie, M. Roudier, M.-C. Goldblum, and F. Boller, [268] “Dataset provided by Semeion, Research Center of Sci-
Journal of Communication Disorders 26, 53 (1993). ences of Communication, Via Sersale 117, 00128, Rome,
[251] M. López, J. Ramírez, J. M. Górriz, I. Álvarez, D. Salas- Italy,” (2018).
Gonzalez, F. Segovia, R. Chaves, P. Padilla, and [269] J. W. Smith, J. Everhart, W. Dickson, W. Knowler, and
M. Gómez-Río, Neurocomput. 74, 1260 (2011). R. Johannes, in Proceedings of the Annual Symposium
[252] H. Uguz, Knowledge-Based Systems 24, 1024 (2011). on Computer Application in Medical Care (American
[253] C. D. Manning and H. Schütze, Foundations of Statis- Medical Informatics Association, 1988) p. 261.
tical Natural Language Processing (MIT Press, Cam- [270] M. A. Little, P. E. McSharry, S. J. Roberts, D. A.
bridge, MA, USA, 1999). Costello, and I. M. Moroz, BioMedical Engineering On-
[254] K. W. Willett, C. J. Lintott, S. P. Bamford, K. L. Mas- Line 6, 23 (2007).
ters, B. D. Simmons, K. R. Casteels, E. M. Edmondson, [271] S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth,
L. F. Fortson, S. Kaviraj, W. C. Keel, et al., Monthly ACM SIGKDD Explorations Newsletter 2, 81 (2000).
Notices of the Royal Astronomical Society 435, 2835 [272] J. J. Moré, in Numerical analysis (Springer, 1978) pp.
(2013). 105–116.
[255] V. G. Sigillito, S. P. Wing, L. V. Hutton, and K. B. [273] T. W. Anderson, An Introduction to Multivariate Sta-
Baker, Johns Hopkins APL Technical Digest 10, 262 tistical Analysis, 3rd ed. (John Wiley & Sons, 2003).
(1989). [274] T. M. Cover and J. A. Thomas, Elements of Information
[256] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Bot- Theory, 2nd ed. (John Wiley & Sons, 2006).
stein, Proceedings of the National Academy of Sciences [275] E. T. Jaynes, Physical Review 106, 620 (1957).
95, 14863 (1998). [276] A. Caticha, AIP Conference Proceedings 1305, 20
[257] P. F. Silva, A. R. Marcal, and R. M. A. da Silva, in In- (2011).
ternational Conference Image Analysis and Recognition [277] A. Caticha and R. Preuss, Physical Review E 70, 046127
(Springer, 2013) pp. 197–204. (2004).
[258] D. Dheeru and E. Karra Taniskidou, “UCI machine [278] I. M. Gelfand and S. V. Fomin, Calculus of Variations
learning repository,” (2017). (Prentice-Hall, 1963).