Principal Component Analysis1
Principal Component Analysis1
Analysis
What is PCA?
• Principal component analysis, or PCA, is a dimensionality reduction method that is
often used to reduce the dimensionality of large data sets, by transforming a large
set of variables into a smaller one that still contains most of the information in the
large set.
• PCA- reduce the number of variables of a data set, while preserving as much
information as possible.
• Goal – identifying most important feature from set of features – feature selection
Why PCA?
• PCA is extremely useful when working with data sets that have a lot of features.
• Problems- long model training time
• Difficult in visualization
• Difficult in interpretation of model
• Curse of dimensionality – reduced accuracy of machine learning models
• Over fitting
• Increased Computation Time
• Finding the time to read a 1000-pages book is a luxury that few can afford. Wouldn’t it be nice if we can summarize
the most important points in just 2 or 3 pages so that the information is easily digestible even by the busiest person?
We may lose some information in the process, but at least we get the big picture. same applies to PCA
• How PCA will come to know which are the most important features?
• Can PCA understand which part of data is most important?
• Can we mathematically quantify the amount of information embedded within the data?
• Variance can – greater the variance more the information
PCA
• Common applications
• image processing, genome research always have to deal with thousands-, if not tens of thousands
of columns.
How does PCA work?
• The greater the variance, the more the information
• Variance – measures the average degree to which each point differs
from the mean
Can you guess who’s who? It’s tough when they are
very similar in height.
Earlier, we had no trouble differentiating a 185cm
person from a 160cm and 145cm person because their
How variance is associated with
amount of information?
• In Principal Component Analysis, it is assumed that the information is carried in the variance of
the features, that is, the higher the variation in a feature, the more information that features
carries.
• Therefore, PCA chooses, features with higher variance for dimensionality reduction.
• PCA formally defined:
• Principal Component Analysis (PCA) is a technique for dimensionality reduction that identifies a set of orthogonal
axes, called principal components, that capture the maximum variance in the data. The principal components are
linear combinations of the original variables in the dataset and are ordered in decreasing order of importance. The
total variance captured by all the principal components is equal to the total variance in the original dataset.
How variance is associated with
amount of information? Continued…
• Round two: let’s guess a person based on height and weight both
Taking t as 1,t
is some real
number
PCA Solved
Given the data in Table, reduce the dimension from 2 to 1 using the Principal Component Analysis
(PCA) algorithm.
Therefore, a unit eigenvector corresponding to λ1 is