Principal Component Analysis: Term Paper For Data Mining & Data Warehousing
Principal Component Analysis: Term Paper For Data Mining & Data Warehousing
Guided by:
Asst. Prof.
Dept. of CSE
Submitted by:
This transformation is defined in such a way that the first principal component has as high a
variance as possible (that is, accounts for as much of the variability in the data as possible),
and each succeeding component in turn has the highest variance possible under the
constraint that it be orthogonal to the preceding components. PCA is sensitive to the
relative scaling of the original variables.
PCA was invented in 1901 by Karl Pearson. Now it is mostly used as a tool in exploratory
data analysis and for making predictive models. PCA can be done by eigenvalue
decomposition of a data covariance matrix or singular value decomposition of a data matrix,
usually after mean centering the data for each attribute. The results of a PCA are usually
discussed in terms of component scores (the transformed variable values corresponding to a
particular case in the data) and loadings (the variance each original variable would have if
the data were projected onto a given PCA axis).
PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation
can be thought of as revealing the internal structure of the data in a way which best explains
the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a
high-dimensional data space (1 axis per variable), PCA can supply the user with a lower-
dimensional picture, a "shadow" of this object when viewed from its (in some sense) most
informative viewpoint. This is done by using only the first few principal components so that
the dimensionality of the transformed data is reduced.
PCA is closely related to factor analysis, indeed , some statistical packages deliberately
conflate the two techniques. True factor analysis makes different assumptions about the
underlying structure and solves eigenvectors of a slightly different matrix.
OBJECTIVES OF PRINCIPAL COMPONENTS ANALYSIS:
1. To discover or reduce dimensionality of the dataset.
2. To identify new meaningful underlying variables.
For a data matrix, XT, with zero empirical mean (the empirical mean of the distribution has
been subtracted from the data set), where each of the n rows represents a different repetition
of the experiment, and each of the m columns gives a particular kind of datum.
where the matrices W, Σ, and V are given by a singular value decomposition (SVD) of
X as W Σ VT. Σ is an m-by-n diagonal matrix with nonnegative real numbers on the
diagonal.
1. The first principal component (the eigenvector with the largest eigenvalue)
corresponds to a line that passes through the mean and minimizes sum squared
error with those points.
2. The second principal component corresponds to the same concept after all
correlation with the first principal component has been subtracted out from the
points.
Each eigenvalue is proportional to the portion of the "variance” that is correlated with each
eigenvector. The sum of all the eigenvalues is equal to the sum of the squared distances of
the points from their multidimensional mean.
PCA essentially rotates the set of points around their mean in order to align with the first
few principal components. This moves as much of the variance as possible (using a linear
transformation) into the first few dimensions. The values in the remaining dimensions,
therefore, tend to be highly correlated and may be dropped with minimal loss of
information. PCA is often used in this manner for dimensionality reduction.
Mean subtraction (i.e. "mean centering") is necessary for performing PCA to ensure that the
first principal component describes the direction of maximum variance. If mean subtraction is
not performed, the first principal component might instead correspond more or less to the
mean of the data. A mean of zero is needed for finding a basis that minimizes the mean
square error of the approximation of the data.
Assuming zero empirical mean (the empirical mean of the distribution has been subtracted
from the data set), the principal component w1 of a data set X can be defined as:
With the first k − 1 components, the kth component can be found by subtracting the first k − 1
principal components from X:
and by substituting this as the new data set to find a principal component in
and then obtaining the reduced-space data matrix Y by projecting X down into the reduced
space defined by only the first L singular vectors, WL:
The matrix W of singular vectors of X is equivalently the matrix W of eigenvectors of the
matrix of observed covariances C = X XT,
covariance matrix
correlation matrix
ASSUMPTIONS:
Derivation of PCA is based on the following assumptions:
Assumption on linearity-
We assumed the observed data set to be linear combinations of certain basis. Non-
linear methods such as kernel PCA have been developed without assuming linearity.
Random Sampling-
Each subject will contribute one score on each observed variable. These sets of
scores should represent a random sample drawn from the population of interest.
Normal Distributions-
Each pair of observed variables should display a bivariate normal distribution; e.g.,
they should form an elliptical scattergram when plotted.
We take any data-set. In this example we take a dataset of 2-dimension. This is done
so that plotting may be demonstrated.
For PCA to work properly, we have to subtract the mean from each of the data
dimensions. The mean subtracted is the average across each dimension. So, all the x
values have x (the mean of the x values of all the data points) subtracted, and all the
y values have y subtracted from them. This produces a data set whose mean is zero.
Data DataAdjust
x y x y
2.5 2.4 .69 .49
0.5 0.7 -1.31 -1.21
2.2 2.9 .39 .99
1.9 2.2 .09 .29
3.1 3.0 1.29 1.09
2.3 2.7 .49 .79
2 1.6 .19 -.31
1 1.1 -.81 -.81
1.5 1.6 -.31 -.31
1.1 0.9 -.71 -1.01
Fig: Plot of the data
where Cnxn is a matrix with n rows and n columns, and Dimx is the xth dimension. The
covariance matrix for 2-dimensions can be hence found out by
C=
By this process of taking the eigenvectors of the covariance matrix, we have been
able to extract lines that characterise the data. The rest of the steps involve
transforming the data so that it is expressed in terms of them lines.
Here is where the notion of data compression and reduced dimensionality comes
into it. Once eigenvectors are found from the covariance matrix, the next step is to
order them by eigenvalue, highest to lowest. This gives us the components in order
of significance. We can also decide to ignore the components of lesser significance.
The feature vector is constructed by taking the eigenvectors that we want to keep
from the list of eigenvectors, and forming a matrix with these eigenvectors in the
columns.
In our example of dataset, and the fact that we have 2 eigenvectors, we have two
choices. We can either form a feature vector with both of the eigenvectors:
or, we can choose to leave out the smaller, less significant component and only have
a single column:
Fig: A plot of the normalised data (mean
subtracted) with the eigenvectors of the
covariance matrix overlayed on top.
Once we have chosen the components (eigenvectors) that we wish to keep in our
data and formed a feature vector, we simply take the transpose of the vector and
multiply it on the left of the original data set, transposed.
Both PCA and factor analysis are often used to construct multiple-item scales from the items
that constitute questionnaires.Regardless of which method is used, once these scales have
been developed it is often desireable to access their reliability by computing coefficient
alpha: an index of internal consistency reliability.