0% found this document useful (0 votes)

112 views

Principal Component Analysis: Term Paper For Data Mining & Data Warehousing

1. Principal component analysis (PCA) is a technique used to reduce the dimensionality of large datasets by transforming correlated variables into a smaller number of uncorrelated variables called principal components. 2. PCA involves computing the eigenvalues and eigenvectors of the covariance matrix of the dataset and using them to define a new coordinate system for the data with fewer dimensions than the original dataset. 3. PCA is commonly used for dimensionality reduction in data analysis and to build predictive models by projecting high-dimensional data onto a lower-dimensional space.

Uploaded by

girish90

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

112 views

Principal Component Analysis: Term Paper For Data Mining & Data Warehousing

Uploaded by

girish90

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Principal Component Analysis

Term Paper for Data Mining & Data Warehousing

Guided by:

Mrs. Rosy Das (Sarmah)

Asst. Prof.

Dept. of CSE

Submitted by:

Soma Sarkar (CSB07012)

Rohit Kashyap (CSB07014)

INTRODUCTION:
Principal component analysis (PCA) involves a mathematical procedure that transforms a
number of possibly correlated variables into a number of uncorrelated variables called
principal components, related to the original variables by an orthogonal transformation.

This transformation is defined in such a way that the first principal component has as high a
variance as possible (that is, accounts for as much of the variability in the data as possible),
and each succeeding component in turn has the highest variance possible under the
constraint that it be orthogonal to the preceding components. PCA is sensitive to the
relative scaling of the original variables.

Depending on the field of application, it is also named the discrete Karhunen–Loève

transform (KLT), the Hotelling transform or proper orthogonal decomposition (POD).

PCA was invented in 1901 by Karl Pearson. Now it is mostly used as a tool in exploratory
data analysis and for making predictive models. PCA can be done by eigenvalue
decomposition of a data covariance matrix or singular value decomposition of a data matrix,
usually after mean centering the data for each attribute. The results of a PCA are usually
discussed in terms of component scores (the transformed variable values corresponding to a
particular case in the data) and loadings (the variance each original variable would have if
the data were projected onto a given PCA axis).

PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation
can be thought of as revealing the internal structure of the data in a way which best explains
the variance in the data. If a multivariate dataset is visualised as a set of coordinates in a
high-dimensional data space (1 axis per variable), PCA can supply the user with a lower-
dimensional picture, a "shadow" of this object when viewed from its (in some sense) most
informative viewpoint. This is done by using only the first few principal components so that
the dimensionality of the transformed data is reduced.

PCA is closely related to factor analysis, indeed , some statistical packages deliberately
conflate the two techniques. True factor analysis makes different assumptions about the
underlying structure and solves eigenvectors of a slightly different matrix.
OBJECTIVES OF PRINCIPAL COMPONENTS ANALYSIS:
1. To discover or reduce dimensionality of the dataset.
2. To identify new meaningful underlying variables.

PCA is mathematically defined as an orthogonal linear transformation that transforms the

data to a new coordinate system such that the greatest variance by any projection of the
data comes to lie on the first coordinate (called the first principal component), the second
greatest variance on the second coordinate, and so on.

For a data matrix, XT, with zero empirical mean (the empirical mean of the distribution has
been subtracted from the data set), where each of the n rows represents a different repetition
of the experiment, and each of the m columns gives a particular kind of datum.

The PCA transformation is then given by:

where the matrices W, Σ, and V are given by a singular value decomposition (SVD) of
X as W Σ VT. Σ is an m-by-n diagonal matrix with nonnegative real numbers on the
diagonal.

Given a set of points in Euclidean space,

1. The first principal component (the eigenvector with the largest eigenvalue)
corresponds to a line that passes through the mean and minimizes sum squared
error with those points.

2. The second principal component corresponds to the same concept after all
correlation with the first principal component has been subtracted out from the
points.

Each eigenvalue is proportional to the portion of the "variance” that is correlated with each
eigenvector. The sum of all the eigenvalues is equal to the sum of the squared distances of
the points from their multidimensional mean.

PCA essentially rotates the set of points around their mean in order to align with the first
few principal components. This moves as much of the variance as possible (using a linear
transformation) into the first few dimensions. The values in the remaining dimensions,
therefore, tend to be highly correlated and may be dropped with minimal loss of
information. PCA is often used in this manner for dimensionality reduction.

However , Nonlinear dimensionality reduction techniques tend to be more computationally

demanding than PCA.

PCA is sensitive to the scaling of the variables.

BASIC MATHEMATICS INVOLVED:

Mean subtraction (i.e. "mean centering") is necessary for performing PCA to ensure that the
first principal component describes the direction of maximum variance. If mean subtraction is
not performed, the first principal component might instead correspond more or less to the
mean of the data. A mean of zero is needed for finding a basis that minimizes the mean
square error of the approximation of the data.

Assuming zero empirical mean (the empirical mean of the distribution has been subtracted
from the data set), the principal component w1 of a data set X can be defined as:

With the first k − 1 components, the kth component can be found by subtracting the first k − 1
principal components from X:

and by substituting this as the new data set to find a principal component in

The Karhunen–Loève transform is therefore equivalent to finding the singular value

decomposition of the data matrix X,

and then obtaining the reduced-space data matrix Y by projecting X down into the reduced
space defined by only the first L singular vectors, WL:
The matrix W of singular vectors of X is equivalently the matrix W of eigenvectors of the
matrix of observed covariances C = X XT,

TABLES OF SYMBOLS AND ABBREVIATIONS:

Symbol Meaning Dimensions Indices

data matrix, consisting of the set of all
data vectors, one vector per column
the number of column vectors in the data
scalar
set
the number of elements in each column
scalar
vector (dimension)
the number of dimensions in the
dimensionally reduced subspace, scalar

vector of empirical means, one mean for

each row m of the data matrix
vector of empirical standard deviations,
one standard deviation for each row m of
the data matrix
vector of all 1's
deviations from the mean of each row m
of the data matrix
z-scores, computed using the mean and
standard deviation for each row m of the
data matrix

covariance matrix

correlation matrix

matrix consisting of the set of all

eigenvectors of C, one eigenvector per
column
diagonal matrix consisting of the set of
all eigenvalues of C along its principal
diagonal, and 0 for all other elements

matrix of basis vectors, one vector per

column, where each basis vector is one
of the eigenvectors of C, and where
thevectors in W are a sub-set of those in
V
matrix consisting of N column vectors,
where each vector is the projection of
the corresponding data vector from
matrix X onto the basis vectors
contained in the columns of matrix W.

ASSUMPTIONS:
Derivation of PCA is based on the following assumptions:

 Assumption on linearity-

We assumed the observed data set to be linear combinations of certain basis. Non-
linear methods such as kernel PCA have been developed without assuming linearity.

 Interval level measurement-

All analyzed variables should be assessed on an interval or ratio level of

measurement.

 Random Sampling-

Each subject will contribute one score on each observed variable. These sets of
scores should represent a random sample drawn from the population of interest.

 Normal Distributions-

Each observed variable should be normally distributed. Variables that demonstrate

marked skewness or kurtosis may be transformed to better approximate normality.

 Bivariate Normal Distribution-

Each pair of observed variables should display a bivariate normal distribution; e.g.,
they should form an elliptical scattergram when plotted.

Methods of computing PCA:

 Step I: Get the Data

We take any data-set. In this example we take a dataset of 2-dimension. This is done
so that plotting may be demonstrated.

 Step II: Subtract the mean

For PCA to work properly, we have to subtract the mean from each of the data
dimensions. The mean subtracted is the average across each dimension. So, all the x
values have x (the mean of the x values of all the data points) subtracted, and all the
y values have y subtracted from them. This produces a data set whose mean is zero.

Data DataAdjust
x y x y
2.5 2.4 .69 .49
0.5 0.7 -1.31 -1.21
2.2 2.9 .39 .99
1.9 2.2 .09 .29
3.1 3.0 1.29 1.09
2.3 2.7 .49 .79
2 1.6 .19 -.31
1 1.1 -.81 -.81
1.5 1.6 -.31 -.31
1.1 0.9 -.71 -1.01
Fig: Plot of the data

 Step III: Calculate the Covariance matrix

Covariance can be calculated by the formula

where Cnxn is a matrix with n rows and n columns, and Dimx is the xth dimension. The
covariance matrix for 2-dimensions can be hence found out by

The covariance matrix for the above dataset would be

 Step IV: Calculate the eigenvectors and eigenvalues of the covariance matrix

Compute the matrix V of eigenvectors which diagonalizes the covariance matrix C

where D is the diagonal matrix of eigenvalues of C.

By this process of taking the eigenvectors of the covariance matrix, we have been
able to extract lines that characterise the data. The rest of the steps involve
transforming the data so that it is expressed in terms of them lines.

 Step V: Choosing components and forming a feature vector

Here is where the notion of data compression and reduced dimensionality comes
into it. Once eigenvectors are found from the covariance matrix, the next step is to
order them by eigenvalue, highest to lowest. This gives us the components in order
of significance. We can also decide to ignore the components of lesser significance.
The feature vector is constructed by taking the eigenvectors that we want to keep
from the list of eigenvectors, and forming a matrix with these eigenvectors in the
columns.

In our example of dataset, and the fact that we have 2 eigenvectors, we have two
choices. We can either form a feature vector with both of the eigenvectors:

or, we can choose to leave out the smaller, less significant component and only have
a single column:
Fig: A plot of the normalised data (mean
subtracted) with the eigenvectors of the
covariance matrix overlayed on top.

 Step VI: Deriving the new Dataset

Once we have chosen the components (eigenvectors) that we wish to keep in our
data and formed a feature vector, we simply take the transpose of the vector and
multiply it on the left of the original data set, transposed.

Where RowFeatureVector is the matrix with the eigenvectors in the columns

transposed so that the eigenvectors are now in the rows, with the most significant
eigenvector at the top, and RowDataAdjust is the mean-adjusted data transposed,
ie. the data items are in each column, with each row holding a separate dimension.
FinalData is the final data set, with data items in columns, and dimensions along
rows.

RELATION BETWEEN PCA and K-MEANS CLUSTERING:

It has been shown recently (2007) [12] [13] that the relaxed solution of K-means clustering,
specified by the cluster indicators, is given by the PCA principal components, and the PCA
subspace spanned by the principal directions is identical to the cluster centroid subspace
specified by the between-class scatter matrix. Thus PCA automatically projects to the
subspace where the global solution of K-means clustering lie, and thus facilitate K-means
clustering to find near-optimal solutions
CONCLUSION:
Principal component analysis is a powerful tool for reducing a number of observed variables
into a smaller number of artificial variables that account for most of the variance in the
dataset. It is particularly used when we need a data reduction procedure that makes no
assumptions concerning an underlying causal structure that is responsible for covariation in
the data. When it is possible to postulate the existence of such an underlying causal
structure, it may be more appropriate to analyze the data using exploratory factor analysis.

Both PCA and factor analysis are often used to construct multiple-item scales from the items
that constitute questionnaires.Regardless of which method is used, once these scales have
been developed it is often desireable to access their reliability by computing coefficient
alpha: an index of internal consistency reliability.

Zajb +: A Complex Number Is A Combination of A Real Number and Imaginary Numbers
100% (1)
Zajb +: A Complex Number Is A Combination of A Real Number and Imaginary Numbers
21 pages
38 - Matrix True False
No ratings yet
38 - Matrix True False
4 pages
Principal component analysis
No ratings yet
Principal component analysis
15 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
33 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
23 pages
Principal Component Analysis - Wikipedia
No ratings yet
Principal Component Analysis - Wikipedia
28 pages
STAT502
No ratings yet
STAT502
13 pages
7.3 Pca
No ratings yet
7.3 Pca
17 pages
Unit 3dimentionality Reduction
No ratings yet
Unit 3dimentionality Reduction
13 pages
Need of Principal Component Analysis
No ratings yet
Need of Principal Component Analysis
8 pages
IDS 4 (Week 14)
No ratings yet
IDS 4 (Week 14)
66 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
10 pages
10 ASAP Advanced Statistics Dimension Reduction
No ratings yet
10 ASAP Advanced Statistics Dimension Reduction
8 pages
Unit-3
No ratings yet
Unit-3
28 pages
Pac
No ratings yet
Pac
70 pages
3
No ratings yet
3
12 pages
Pca Tutorial
No ratings yet
Pca Tutorial
11 pages
1501589578da-mod15-Q1-e-text
No ratings yet
1501589578da-mod15-Q1-e-text
9 pages
Ferath Kherif PCA
No ratings yet
Ferath Kherif PCA
17 pages
WIREs Computational Stats - 2010 - Abdi - Principal component analysis
No ratings yet
WIREs Computational Stats - 2010 - Abdi - Principal component analysis
27 pages
Lecture 14: Principal Component Analysis: Computing The Principal Components
No ratings yet
Lecture 14: Principal Component Analysis: Computing The Principal Components
6 pages
WIREs Computational Stats - 2010 - Abdi - Principal component analysis
No ratings yet
WIREs Computational Stats - 2010 - Abdi - Principal component analysis
27 pages
(ABDI H.) Principal Component Analysis
No ratings yet
(ABDI H.) Principal Component Analysis
27 pages
Principal Component Analysis: Herv e Abdi and Lynne J. Williams
No ratings yet
Principal Component Analysis: Herv e Abdi and Lynne J. Williams
27 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
U4 - PCA - 5th Sem - DS
No ratings yet
U4 - PCA - 5th Sem - DS
14 pages
Dimensionality Reduction Using Principal Component Analysis
No ratings yet
Dimensionality Reduction Using Principal Component Analysis
32 pages
Module3 Notes
No ratings yet
Module3 Notes
13 pages
Principal Components Analysis
No ratings yet
Principal Components Analysis
16 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Varimax Rotation
No ratings yet
Varimax Rotation
47 pages
10-601 Machine Learning (Fall 2010) Principal Component Analysis
No ratings yet
10-601 Machine Learning (Fall 2010) Principal Component Analysis
8 pages
Feature Extraction: - Saheni Patra
No ratings yet
Feature Extraction: - Saheni Patra
17 pages
MDA PrincipalComponentAnalysis
No ratings yet
MDA PrincipalComponentAnalysis
20 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
Lecture 6 - PCA - Lecturefin
No ratings yet
Lecture 6 - PCA - Lecturefin
71 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
PCA Tutorial: Instructor: Forbes Burkowski
No ratings yet
PCA Tutorial: Instructor: Forbes Burkowski
12 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
Principal+Component+Analysis
No ratings yet
Principal+Component+Analysis
6 pages
PCA
100% (1)
PCA
33 pages
PCA1
No ratings yet
PCA1
45 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
9 pages
Module 3
No ratings yet
Module 3
41 pages
Unit V Foml
No ratings yet
Unit V Foml
18 pages
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
No ratings yet
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
26 pages
ML Mod32019
No ratings yet
ML Mod32019
6 pages
D3S2 _ Unsupervised - Dimensionality Reduction
No ratings yet
D3S2 _ Unsupervised - Dimensionality Reduction
81 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Pca Lda Lobo
No ratings yet
Pca Lda Lobo
20 pages
Presentation a i Std 2
No ratings yet
Presentation a i Std 2
63 pages
cheat sheet
No ratings yet
cheat sheet
2 pages
Factor analysis is a statistical method used to explore the underlying structure of relationships among observed variables in a dataset. It aims to identify latent or unobservable factors that exp (2)
No ratings yet
Factor analysis is a statistical method used to explore the underlying structure of relationships among observed variables in a dataset. It aims to identify latent or unobservable factors that exp (2)
12 pages
Face Recognition PAC
No ratings yet
Face Recognition PAC
24 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
45 pages
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Exercises of Vectors and Vectorial Spaces
From Everand
Exercises of Vectors and Vectorial Spaces
Simone Malacrida
No ratings yet
Exercises of Matrices and Linear Algebra
From Everand
Exercises of Matrices and Linear Algebra
Simone Malacrida
4/5 (1)
Introduction to Vectors, Matrices and Tensors
From Everand
Introduction to Vectors, Matrices and Tensors
Simone Malacrida
No ratings yet
MATLAB Commands
No ratings yet
MATLAB Commands
3 pages
(DIRR and WIMMER) An Enestrom-Kakeya Theorem For Hermitian Polynomial Matrices
No ratings yet
(DIRR and WIMMER) An Enestrom-Kakeya Theorem For Hermitian Polynomial Matrices
3 pages
On Inverse Sum Indeg Energy of Graphs
No ratings yet
On Inverse Sum Indeg Energy of Graphs
10 pages
Linear Algebra II Module-111
No ratings yet
Linear Algebra II Module-111
91 pages
Tutorial Set 2
No ratings yet
Tutorial Set 2
3 pages
Method of Finite Elements I
No ratings yet
Method of Finite Elements I
45 pages
Matrix and Determinants (2012)
100% (2)
Matrix and Determinants (2012)
73 pages
Unit 1 - Matrices
No ratings yet
Unit 1 - Matrices
53 pages
Chapter 4 Determinants
No ratings yet
Chapter 4 Determinants
7 pages
LinearAI DS FinalCh1-8,10!13!2021S2 DR - Omar
No ratings yet
LinearAI DS FinalCh1-8,10!13!2021S2 DR - Omar
13 pages
MTP290 Tut4
No ratings yet
MTP290 Tut4
3 pages
Chio's Method
No ratings yet
Chio's Method
5 pages
MCQ'S
No ratings yet
MCQ'S
11 pages
Determinant & Matrices
No ratings yet
Determinant & Matrices
27 pages
DIP Matlab Program
No ratings yet
DIP Matlab Program
25 pages
Lecture No. 11
No ratings yet
Lecture No. 11
15 pages
DAA ppt-555[1]
No ratings yet
DAA ppt-555[1]
15 pages
Block LU Factorization
No ratings yet
Block LU Factorization
14 pages
HW Matrix Multiplication
No ratings yet
HW Matrix Multiplication
5 pages
ALE Intl 01
No ratings yet
ALE Intl 01
9 pages
Deteminants+1+ +elite
No ratings yet
Deteminants+1+ +elite
96 pages
Instant download College Algebra 3rd Edition Ratti Solutions Manual pdf all chapter
100% (10)
Instant download College Algebra 3rd Edition Ratti Solutions Manual pdf all chapter
70 pages
AM I Module 3 Sums (Matrices)
No ratings yet
AM I Module 3 Sums (Matrices)
2 pages
Fundamentals of Matrix Algebra 3rd Edition
No ratings yet
Fundamentals of Matrix Algebra 3rd Edition
248 pages
Prasolov
No ratings yet
Prasolov
228 pages
Determinants
No ratings yet
Determinants
5 pages
1 Matrices and Determinants SJK
100% (1)
1 Matrices and Determinants SJK
40 pages
MANE 4240 & CIVL 4240 Introduction To Finite Elements: Prof. Suvranu de
No ratings yet
MANE 4240 & CIVL 4240 Introduction To Finite Elements: Prof. Suvranu de
82 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Principal Component Analysis: Term Paper For Data Mining & Data Warehousing

Uploaded by

Principal Component Analysis: Term Paper For Data Mining & Data Warehousing

Uploaded by

Principal Component Analysis

Term Paper for Data Mining & Data Warehousing

Mrs. Rosy Das (Sarmah)

Soma Sarkar (CSB07012)

Rohit Kashyap (CSB07014)

Depending on the field of application, it is also named the discrete Karhunen–Loève

PCA is mathematically defined as an orthogonal linear transformation that transforms the

The PCA transformation is then given by:

Given a set of points in Euclidean space,

However , Nonlinear dimensionality reduction techniques tend to be more computationally

PCA is sensitive to the scaling of the variables.

BASIC MATHEMATICS INVOLVED:

The Karhunen–Loève transform is therefore equivalent to finding the singular value

TABLES OF SYMBOLS AND ABBREVIATIONS:

Symbol Meaning Dimensions Indices

vector of empirical means, one mean for

matrix consisting of the set of all

matrix of basis vectors, one vector per

 Interval level measurement-

All analyzed variables should be assessed on an interval or ratio level of

Each observed variable should be normally distributed. Variables that demonstrate

 Bivariate Normal Distribution-

Methods of computing PCA:

 Step II: Subtract the mean

 Step III: Calculate the Covariance matrix

Covariance can be calculated by the formula

The covariance matrix for the above dataset would be

Compute the matrix V of eigenvectors which diagonalizes the covariance matrix C

where D is the diagonal matrix of eigenvalues of C.

 Step V: Choosing components and forming a feature vector

 Step VI: Deriving the new Dataset

Where RowFeatureVector is the matrix with the eigenvectors in the columns

RELATION BETWEEN PCA and K-MEANS CLUSTERING:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.