Advanced Data Analysis Techniques 2
Advanced Data Analysis Techniques 2
Data Analysis
Techniques
(Part 2)
2
Overview
DIMENSIONALITY
REDUCTION
CLUSTER
ANALYSIS
Cluster Analysis
What is Clustering?
Centers are
calculated
These become
the new centroids
This is repeated
for all points
Steps for K-Means Clustering
6. Repeat 4 and 5
Sensitivity to initialization
This is crucial for dealing with high-dimensional data, which can suffer
from the "curse of dimensionality."
Importance of Dimensionality Reduction
High-dimensional data can lead to overfitting,
Improved where models perform well on training data but
Model poorly on new, unseen data.
Performance Dimensionality reduction can help prevent
overfitting by reducing noise and irrelevant
information.
Reduced
Fewer features mean faster training and prediction
Computational
times for machine learning models.
Cost
It does this by identifying a new set of variables, called principal components, which are linear
combinations of the original variables.
Maths
A B C D E F
Maths 76 55 97 45 89 45
English 34 67 55 67 76 46
English
What is Principal Component Analysis?
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while preserving as much of the variation in the data as possible.
It does this by identifying a new set of variables, called principal components, which are linear
combinations of the original variables.
Maths
A B C D E F
Art
Maths 76 55 97 45 89 45
English 34 67 55 67 76 46
Art 67 45 89 44 66 44 English
What is Principal Component Analysis?
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while preserving as much of the variation in the data as possible.
It does this by identifying a new set of variables, called principal components, which are linear
combinations of the original variables.
Maths
A B C D E F
Maths 76 55 97 45 89 45
English 34 67 55 67 76 46
Art 67 45 89 44 66 44 ArtEng
Example
Choose
Select the top k eigenvectors, where k is the desired
Principal
number of dimensions.
Components
Data Compression
Data Visualization
Noise Reduction
Feature Extraction
Conclusion