Linear Algebra
Linear Algebra
Reduce Multicollinearity
Linear Algebra
SUBMITTED BY
INDEX
1 Introduction
3 Problem Statement
4 Solution
5 Conclusion
6 References
INTRODUCTION
PCA finds extensive application in preprocessing data for machine learning algorithms. It
extracts informative features while retaining the most relevant information, thus mitigating the
"curse of dimensionality," where adding features can degrade model performance. By projecting
high-dimensional data into a reduced feature space, PCA addresses issues like multicollinearity
and overfitting. Multicollinearity, which arises from highly correlated independent variables, can
hinder causal modeling, while overfit models generalize poorly to new data.
Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are dimension
reduction techniques, with PCA being versatile for both supervised and unsupervised tasks,
while LDA is primarily used in supervised learning. In unsupervised scenarios, PCA can reduce
dimensions without considering class labels or categories, unlike LDA. Additionally, PCA is
closely related to factor analysis, as both methods aim to minimize information loss while
reducing the number of dimensions or variables in a dataset.
Comparing PCA to K-means clustering, both are unsupervised techniques serving distinct
purposes. PCA transforms a dataset by creating new variables (principal components) as linear
combinations of the original variables, effectively reducing dimensionality. In contrast, K-means
clustering identifies clusters within the data based on the similarity of data points to cluster
centers. PCA is favored for exploratory data analysis and dimensionality reduction, while K-
means clustering is useful for identifying clusters in the data. Each technique plays a vital role in
data analysis, catering to different objectives and complementing each other in various
applications.
Principal Component Analysis (PCA) condenses the information within large datasets into a
smaller set of independent variables called principal components. These components, formed as
linear combinations of original variables, capture the maximum variance in the data. PCA
leverages linear algebra and matrix operations to transform the dataset into a new coordinate
system defined by these principal components, derived from eigenvectors and eigenvalues of the
covariance matrix.
Eigenvectors signify the directions of variance in this plot, while eigenvalues quantify the
importance of these directional data. Since principal components represent maximal variance
directions, they align with the eigenvectors of the covariance matrix. The two primary principal
components, denotes as PC1 and PC2, capture the highest and subsequent variances,
respectively. PC2 will always be orthogonal to PC1 to ensure independence between the
principal components. If there are anymore principal components are included, also remain
uncorrelated and explain any remaining variation in the data.
PC1 decides the direction of highest variance, encapsulating the majority of information from the
original dataset. Meanwhile, PC2 captures the next highest variance. The relationship between
PC1 and PC2 is depicted in a scatterplot below, where the axes are perpendicular.
(Zakaria, 2024)