0% found this document useful (0 votes)
112 views

Principal Component Analysis: Term Paper For Data Mining & Data Warehousing

1. Principal component analysis (PCA) is a technique used to reduce the dimensionality of large datasets by transforming correlated variables into a smaller number of uncorrelated variables called principal components. 2. PCA involves computing the eigenvalues and eigenvectors of the covariance matrix of the dataset and using them to define a new coordinate system for the data with fewer dimensions than the original dataset. 3. PCA is commonly used for dimensionality reduction in data analysis and to build predictive models by projecting high-dimensional data onto a lower-dimensional space.

Uploaded by

girish90
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views

Principal Component Analysis: Term Paper For Data Mining & Data Warehousing

1. Principal component analysis (PCA) is a technique used to reduce the dimensionality of large datasets by transforming correlated variables into a smaller number of uncorrelated variables called principal components. 2. PCA involves computing the eigenvalues and eigenvectors of the covariance matrix of the dataset and using them to define a new coordinate system for the data with fewer dimensions than the original dataset. 3. PCA is commonly used for dimensionality reduction in data analysis and to build predictive models by projecting high-dimensional data onto a lower-dimensional space.

Uploaded by

girish90
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Principal Component Analysis

Term Paper for Data Mining & Data Warehousing

Guided by:

Mrs. Rosy Das (Sarmah)

Asst. Prof.

Dept. of CSE

Submitted by:

Soma Sarkar (CSB07012)

Rohit Kashyap (CSB07014)


INTRODUCTION:
Principal component analysis (PCA) involves a mathematical procedure that transforms a
number of possibly correlated variables into a number of uncorrelated variables called
principal components, related to the original variables by an orthogonal transformation.

This transformation is defined in such a way that the first principal component has as high a
variance as possible (that is, accounts for as much of the variability in the data as possible),
and each succeeding component in turn has the highest variance possible under the
constraint that it be orthogonal to the preceding components. PCA is sensitive to the
relative scaling of the original variables.

Depending on the field of application, it is also named the discrete Karhunen–Loève


transform (KLT), the Hotelling transform or proper orthogonal decomposition (POD).

PCA was invented in 1901 by Karl Pearson. Now it is mostly used as a tool in exploratory
data analysis and for making predictive models. PCA can be done by eigenvalue
decomposition of a data covariance matrix or singular value decomposition of a data matrix,
usually after mean centering the data for each attribute. The results of a PCA are usually
discussed in terms of component scores (the transformed variable values corresponding to a
particular case in the data) and loadings (the variance each original variable would have if
the data were projected onto a given PCA axis).

PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation
can be thought of as revealing the internal structure of the data in a way which best explains
the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a
high-dimensional data space (1 axis per variable), PCA can supply the user with a lower-
dimensional picture, a "shadow" of this object when viewed from its (in some sense) most
informative viewpoint. This is done by using only the first few principal components so that
the dimensionality of the transformed data is reduced.

PCA is closely related to factor analysis, indeed , some statistical packages deliberately
conflate the two techniques. True factor analysis makes different assumptions about the
underlying structure and solves eigenvectors of a slightly different matrix.
OBJECTIVES OF PRINCIPAL COMPONENTS ANALYSIS:
1. To discover or reduce dimensionality of the dataset.
2. To identify new meaningful underlying variables.

PCA is mathematically defined as an orthogonal linear transformation that transforms the


data to a new coordinate system such that the greatest variance by any projection of the
data comes to lie on the first coordinate (called the first principal component), the second
greatest variance on the second coordinate, and so on.

For a data matrix, XT, with zero empirical mean (the empirical mean of the distribution has
been subtracted from the data set), where each of the n rows represents a different repetition
of the experiment, and each of the m columns gives a particular kind of datum.

The PCA transformation is then given by:

where the matrices W, Σ, and V are given by a singular value decomposition (SVD) of
X as W Σ VT. Σ is an m-by-n diagonal matrix with nonnegative real numbers on the
diagonal.

Given a set of points in Euclidean space,

1. The first principal component (the eigenvector with the largest eigenvalue)
corresponds to a line that passes through the mean and minimizes sum squared
error with those points.

2. The second principal component corresponds to the same concept after all
correlation with the first principal component has been subtracted out from the
points.

Each eigenvalue is proportional to the portion of the "variance” that is correlated with each
eigenvector. The sum of all the eigenvalues is equal to the sum of the squared distances of
the points from their multidimensional mean.

PCA essentially rotates the set of points around their mean in order to align with the first
few principal components. This moves as much of the variance as possible (using a linear
transformation) into the first few dimensions. The values in the remaining dimensions,
therefore, tend to be highly correlated and may be dropped with minimal loss of
information. PCA is often used in this manner for dimensionality reduction.

However , Nonlinear dimensionality reduction techniques tend to be more computationally


demanding than PCA.

PCA is sensitive to the scaling of the variables.

BASIC MATHEMATICS INVOLVED:

Mean subtraction (i.e. "mean centering") is necessary for performing PCA to ensure that the
first principal component describes the direction of maximum variance. If mean subtraction is
not performed, the first principal component might instead correspond more or less to the
mean of the data. A mean of zero is needed for finding a basis that minimizes the mean
square error of the approximation of the data.

Assuming zero empirical mean (the empirical mean of the distribution has been subtracted
from the data set), the principal component w1 of a data set X can be defined as:

With the first k − 1 components, the kth component can be found by subtracting the first k − 1
principal components from X:

and by substituting this as the new data set to find a principal component in

The Karhunen–Loève transform is therefore equivalent to finding the singular value


decomposition of the data matrix X,

and then obtaining the reduced-space data matrix Y by projecting X down into the reduced
space defined by only the first L singular vectors, WL:
The matrix W of singular vectors of X is equivalently the matrix W of eigenvectors of the
matrix of observed covariances C = X XT,

TABLES OF SYMBOLS AND ABBREVIATIONS:

Symbol Meaning Dimensions Indices


data matrix, consisting of the set of all
data vectors, one vector per column
the number of column vectors in the data
scalar
set
the number of elements in each column
scalar
vector (dimension)
the number of dimensions in the
dimensionally reduced subspace, scalar

vector of empirical means, one mean for


each row m of the data matrix
vector of empirical standard deviations,
one standard deviation for each row m of
the data matrix
vector of all 1's
deviations from the mean of each row m
of the data matrix
z-scores, computed using the mean and
standard deviation for each row m of the
data matrix

covariance matrix

correlation matrix

matrix consisting of the set of all


eigenvectors of C, one eigenvector per
column
diagonal matrix consisting of the set of
all eigenvalues of C along its principal
diagonal, and 0 for all other elements

matrix of basis vectors, one vector per


column, where each basis vector is one
of the eigenvectors of C, and where
thevectors in W are a sub-set of those in
V
matrix consisting of N column vectors,
where each vector is the projection of
the corresponding data vector from
matrix X onto the basis vectors
contained in the columns of matrix W.

ASSUMPTIONS:
Derivation of PCA is based on the following assumptions:

 Assumption on linearity-

We assumed the observed data set to be linear combinations of certain basis. Non-
linear methods such as kernel PCA have been developed without assuming linearity.

 Interval level measurement-

All analyzed variables should be assessed on an interval or ratio level of


measurement.

 Random Sampling-

Each subject will contribute one score on each observed variable. These sets of
scores should represent a random sample drawn from the population of interest.

 Normal Distributions-

Each observed variable should be normally distributed. Variables that demonstrate


marked skewness or kurtosis may be transformed to better approximate normality.

 Bivariate Normal Distribution-

Each pair of observed variables should display a bivariate normal distribution; e.g.,
they should form an elliptical scattergram when plotted.

Methods of computing PCA:


 Step I: Get the Data

We take any data-set. In this example we take a dataset of 2-dimension. This is done
so that plotting may be demonstrated.

 Step II: Subtract the mean

For PCA to work properly, we have to subtract the mean from each of the data
dimensions. The mean subtracted is the average across each dimension. So, all the x
values have x (the mean of the x values of all the data points) subtracted, and all the
y values have y subtracted from them. This produces a data set whose mean is zero.

Data DataAdjust
x y x y
2.5 2.4 .69 .49
0.5 0.7 -1.31 -1.21
2.2 2.9 .39 .99
1.9 2.2 .09 .29
3.1 3.0 1.29 1.09
2.3 2.7 .49 .79
2 1.6 .19 -.31
1 1.1 -.81 -.81
1.5 1.6 -.31 -.31
1.1 0.9 -.71 -1.01
Fig: Plot of the data

 Step III: Calculate the Covariance matrix

Covariance can be calculated by the formula

where Cnxn is a matrix with n rows and n columns, and Dimx is the xth dimension. The
covariance matrix for 2-dimensions can be hence found out by

C=

The covariance matrix for the above dataset would be


 Step IV: Calculate the eigenvectors and eigenvalues of the covariance matrix

Compute the matrix V of eigenvectors which diagonalizes the covariance matrix C

where D is the diagonal matrix of eigenvalues of C.

By this process of taking the eigenvectors of the covariance matrix, we have been
able to extract lines that characterise the data. The rest of the steps involve
transforming the data so that it is expressed in terms of them lines.

 Step V: Choosing components and forming a feature vector

Here is where the notion of data compression and reduced dimensionality comes
into it. Once eigenvectors are found from the covariance matrix, the next step is to
order them by eigenvalue, highest to lowest. This gives us the components in order
of significance. We can also decide to ignore the components of lesser significance.
The feature vector is constructed by taking the eigenvectors that we want to keep
from the list of eigenvectors, and forming a matrix with these eigenvectors in the
columns.

In our example of dataset, and the fact that we have 2 eigenvectors, we have two
choices. We can either form a feature vector with both of the eigenvectors:

or, we can choose to leave out the smaller, less significant component and only have
a single column:
Fig: A plot of the normalised data (mean
subtracted) with the eigenvectors of the
covariance matrix overlayed on top.

 Step VI: Deriving the new Dataset

Once we have chosen the components (eigenvectors) that we wish to keep in our
data and formed a feature vector, we simply take the transpose of the vector and
multiply it on the left of the original data set, transposed.

Where RowFeatureVector is the matrix with the eigenvectors in the columns


transposed so that the eigenvectors are now in the rows, with the most significant
eigenvector at the top, and RowDataAdjust is the mean-adjusted data transposed,
ie. the data items are in each column, with each row holding a separate dimension.
FinalData is the final data set, with data items in columns, and dimensions along
rows.

RELATION BETWEEN PCA and K-MEANS CLUSTERING:


It has been shown recently (2007) [12] [13] that the relaxed solution of K-means clustering,
specified by the cluster indicators, is given by the PCA principal components, and the PCA
subspace spanned by the principal directions is identical to the cluster centroid subspace
specified by the between-class scatter matrix. Thus PCA automatically projects to the
subspace where the global solution of K-means clustering lie, and thus facilitate K-means
clustering to find near-optimal solutions
CONCLUSION:
Principal component analysis is a powerful tool for reducing a number of observed variables
into a smaller number of artificial variables that account for most of the variance in the
dataset. It is particularly used when we need a data reduction procedure that makes no
assumptions concerning an underlying causal structure that is responsible for covariation in
the data. When it is possible to postulate the existence of such an underlying causal
structure, it may be more appropriate to analyze the data using exploratory factor analysis.

Both PCA and factor analysis are often used to construct multiple-item scales from the items
that constitute questionnaires.Regardless of which method is used, once these scales have
been developed it is often desireable to access their reliability by computing coefficient
alpha: an index of internal consistency reliability.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy