0% found this document useful (0 votes)

3 views32 pages

Advanced Data Analysis Techniques 2

The document discusses advanced data analysis techniques, focusing on clustering and dimensionality reduction. Clustering is a statistical method for grouping similar data points, with various types including K-Means and DBSCAN, while dimensionality reduction techniques like PCA help reduce the number of features in a dataset to improve model performance and visualization. The document outlines key concepts, methods, evaluation metrics, and applications for both clustering and dimensionality reduction.

Uploaded by

gurjibecha88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views32 pages

Advanced Data Analysis Techniques 2

Uploaded by

gurjibecha88

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Advanced

Data Analysis
Techniques
(Part 2)

2
Overview

DIMENSIONALITY
REDUCTION

CLUSTER
ANALYSIS
Cluster Analysis
What is Clustering?

Clustering is a statistical method used to

group similar data points into clusters
based on their characteristics.

It is widely used in exploratory data

analysis, machine learning, and pattern
recognition.
What is Clustering?

Clustering is a statistical method used to

group similar data points into clusters
based on their characteristics.

It is widely used in exploratory data

analysis, machine learning, and pattern
recognition.
Key Clustering Concepts

A collection of data points that are more similar to each

Cluster
other than to points in other clusters.

The center of a cluster, often representing the average or

Centroid
mean of points in that cluster.

A mathematical formula to calculate the similarity or

Metric dissimilarity between data points (e.g., Euclidean distance,
Manhattan distance, cosine similarity).
Types of Clustering

Partition-Based Clustering Hierarchical Clustering

Divides the data into non-overlapping Creates a tree-like structure (dendrogram)

subsets. showing the nested grouping of data points.

Example: K-Means clustering. Types:

Agglomerative (bottom-up approach).
Divisive (top-down approach).
Types of Clustering

Density-Based Clustering Model-Based Clustering

Groups data points based on regions of Assumes data is generated by a mixture of

high density separated by low-density underlying probability distributions.
regions.
Example: Gaussian Mixture Models (GMM).
Example: DBSCAN (Density-Based Spatial
Clustering of Applications with Noise).
K-Means Clustering

One of the simplest and Clusters your We have a dataset

most common methods data on a given and would like to
for cluster analysis number of clusters divide it into clusters
Steps for K-Means Clustering

1. Define the number of clusters

Number of clusters is the k

in k-means
Steps for K-Means Clustering

2. Set cluster centers randomly

Initial cluster centroids are defined

Usually happens randomly

Each centroid represents a cluster

Steps for K-Means Clustering

3. Assign points to clusters

Distance from the first point to

each of the cluster centroids is
measured

The point is then assigned to the

cluster with the shortest distance

This is repeated for all points

Steps for K-Means Clustering

4. Calculate the center of each cluster

Centers are calculated

These become the new centroids

This is repeated for all points

Steps for K-Means Clustering

5. Assign points to the new clusters

Centers are
calculated

These become
the new centroids

This is repeated
for all points
Steps for K-Means Clustering

6. Repeat 4 and 5

Until the cluster distribution

no longer changes
Evaluation Metrics
Silhouette Score: Measures how similar an object is
to its own cluster versus other clusters.
Internal
Validation
Dunn Index: Evaluates compactness and separation
of clusters.

Rand Index: Compares the clustering result to a

ground truth.

External Adjusted Mutual Information (AMI): Measures

Validation similarity between clustering and ground truth.

Stability Measures: Assess robustness of clustering

by applying algorithms to different subsets of data.
Determining optimal number of clusters.

Sensitivity to initialization

Handling high-dimensional data

Challenges Dealing with noise and outliers

of Clustering

Ensuring interpretability of results

Applications of Clustering

Customer Image Anomaly Document

segmentation segmentation detection clustering
Dimensionality
Reduction
What is Dimensionality Reduction?

Dimensionality reduction is a set of techniques used to reduce the

number of input variables (features) in a dataset while preserving as
much of the important information as possible.

This is crucial for dealing with high-dimensional data, which can suffer
from the "curse of dimensionality."
Importance of Dimensionality Reduction
High-dimensional data can lead to overfitting,
Improved where models perform well on training data but
Model poorly on new, unseen data.
Performance Dimensionality reduction can help prevent
overfitting by reducing noise and irrelevant
information.
Reduced
Fewer features mean faster training and prediction
Computational
times for machine learning models.
Cost

High-dimensional data is difficult to visualize.

Data Dimensionality reduction can project data onto
Visualization lower-dimensional spaces (e.g., 2D or 3D) for easier
visualization and exploration.
Importance of Dimensionality Reduction

Reducing the number of features can significantly

Data Storage
reduce storage requirements.

With fewer features, it can be easier to understand the

Improved
relationships between variables and the underlying
Interpretability
patterns in the data.
Types of Dimensionality Reduction

Linear Methods Non-linear Methods

Principal Component Analysis (PCA): t-SNE:

• Projects data onto a new set of • A powerful non-linear technique that maps
orthogonal axes (principal high-dimensional data to a
components) that capture the low-dimensional space while preserving
maximum variance in the data. local similarities.
• Effective for reducing noise and • Effective for visualizing complex, non-linear
identifying the most important structures in data.
features.
Isomap:
Linear Discriminant Analysis (LDA): • Constructs a low-dimensional embedding
that preserves geodesic distances between
• Similar to PCA, but focuses on data points.
maximizing the separation between
classes in supervised learning Autoencoders:
problems. • Neural networks trained to reconstruct input
data, effectively learning a compressed
representation of the data.
What is Principal Component Analysis?
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while preserving as much of the variation in the data as possible.

It does this by identifying a new set of variables, called principal components, which are linear
combinations of the original variables.

Maths
A B C D E F

Maths 76 55 97 45 89 45

English 34 67 55 67 76 46

English
What is Principal Component Analysis?
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while preserving as much of the variation in the data as possible.

It does this by identifying a new set of variables, called principal components, which are linear
combinations of the original variables.

Maths
A B C D E F

Art
Maths 76 55 97 45 89 45

English 34 67 55 67 76 46

Art 67 45 89 44 66 44 English
What is Principal Component Analysis?
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while preserving as much of the variation in the data as possible.

It does this by identifying a new set of variables, called principal components, which are linear
combinations of the original variables.

Maths
A B C D E F

Maths 76 55 97 45 89 45

English 34 67 55 67 76 46

Art 67 45 89 44 66 44 ArtEng
Example

Consider a dataset of customer data with features such as age,

income, and spending habits.

PCA can be used to identify the principal components that explain

most of the variation in the data.

These principal components can then be used to group customers

with similar characteristics, which can be useful for targeted marketing
campaigns.
Steps in Principal Component Analysis

Standardize the data to have zero mean and unit

Standardize
variance. This ensures that all variables are on the
the Data
same scale.

Compute the Calculate the covariance matrix of the standardized

Covariance data. The covariance matrix measures the relationship
Matrix between pairs of variables.

Find the eigenvectors and eigenvalues of the

Compute
covariance matrix. The eigenvectors represent the
Eigenvectors
principal components, and the eigenvalues represent
and
the amount of variance explained by each principal
Eigenvalues
component.
Importance of Dimensionality Reduction

Sort Sort the eigenvectors in descending order of their

Eigenvectors corresponding eigenvalues.

Choose
Select the top k eigenvectors, where k is the desired
Principal
number of dimensions.
Components

Transform Project the original data onto the selected principal

the Data components.
Applications of Principal Component Analysis

Data Compression

Data Visualization

Noise Reduction

Feature Extraction
Conclusion

Clustering is the foundation of Classification model in

Unsupervised Learning

High Dimensionality in data is noisy and can drag models

down

Dimensionality reduction enables better visualization and

performance of data model

Principal Component Analysis is a linear method of

Dimensionality Reduction

HMM Solver!
No ratings yet
HMM Solver!
1 page
AIML Interview Questions
No ratings yet
AIML Interview Questions
17 pages
Criterion_Functions
No ratings yet
Criterion_Functions
19 pages
151 Paper Urdu Fake
No ratings yet
151 Paper Urdu Fake
7 pages
Unit2 Advanced Concepts of Modeling in AI Class X 2025-26 Part 1
No ratings yet
Unit2 Advanced Concepts of Modeling in AI Class X 2025-26 Part 1
77 pages
AI landscape
No ratings yet
AI landscape
67 pages
topic 2
No ratings yet
topic 2
10 pages
pca1
No ratings yet
pca1
3 pages
Potato Disease Classification Using Deep Learning
No ratings yet
Potato Disease Classification Using Deep Learning
6 pages
AI&Python_QuestionBank_Ch1_en_The Era of Artificial Intelligence (1)
No ratings yet
AI&Python_QuestionBank_Ch1_en_The Era of Artificial Intelligence (1)
5 pages
315 F19 27 Pca1
No ratings yet
315 F19 27 Pca1
28 pages
Lecture_08_slides
No ratings yet
Lecture_08_slides
43 pages
Braille Workshop Agenda
No ratings yet
Braille Workshop Agenda
2 pages
Module 4
No ratings yet
Module 4
63 pages
Deep Learning - IIT Ropar - - Unit 9 - Week 6
No ratings yet
Deep Learning - IIT Ropar - - Unit 9 - Week 6
5 pages
data reduction
No ratings yet
data reduction
9 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
66 pages
Ai & ML Week-9
No ratings yet
Ai & ML Week-9
30 pages
Week_13_LLM_ChatGPT_HAAI_IITKgp_v2
No ratings yet
Week_13_LLM_ChatGPT_HAAI_IITKgp_v2
119 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
CENG3300 Lecture 10
No ratings yet
CENG3300 Lecture 10
20 pages
CT2US: Cross-Modal Transfer Learning For Kidney Segmentation in Ultrasound Images With Synthesized Data
No ratings yet
CT2US: Cross-Modal Transfer Learning For Kidney Segmentation in Ultrasound Images With Synthesized Data
9 pages
Machine Learning: Unsupervised Learning Dimensionality Reduction K-Means Clustering
No ratings yet
Machine Learning: Unsupervised Learning Dimensionality Reduction K-Means Clustering
28 pages
Towards Foundation Models of Biological Image Segmentation
No ratings yet
Towards Foundation Models of Biological Image Segmentation
3 pages
Thesis Final Presentation
No ratings yet
Thesis Final Presentation
33 pages
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
No ratings yet
Clustering_and_dimensionality_reduction_techniques__PCA__t_SNE__K_means_ (1)
15 pages
1694601214-Unit 3.4 Principal Component Analysis CU 2.0
No ratings yet
1694601214-Unit 3.4 Principal Component Analysis CU 2.0
36 pages
How Would Stance Detection Techniques Evolve After The Launch
No ratings yet
How Would Stance Detection Techniques Evolve After The Launch
7 pages
A Feature-Wise Attention Module Based On The Difference With Surrounding Features For Convolutional Neural Networks
No ratings yet
A Feature-Wise Attention Module Based On The Difference With Surrounding Features For Convolutional Neural Networks
10 pages
Pca&kmean
No ratings yet
Pca&kmean
6 pages
ML(UNIT_5)
No ratings yet
ML(UNIT_5)
34 pages
PCA - Feb 8
No ratings yet
PCA - Feb 8
28 pages
Feature Selection and Dimensionality Reduction
No ratings yet
Feature Selection and Dimensionality Reduction
4 pages
Dfy Chatbot Dev Using Python
No ratings yet
Dfy Chatbot Dev Using Python
4 pages
PCA & Clustering
No ratings yet
PCA & Clustering
6 pages
Daftar Pustaka Proposal Tesis
No ratings yet
Daftar Pustaka Proposal Tesis
2 pages
PR- Unit 4
No ratings yet
PR- Unit 4
15 pages
A New Method For Dimensionality Reduction Using K-Means Clustering Algorithm For High Dimensional Data Set
No ratings yet
A New Method For Dimensionality Reduction Using K-Means Clustering Algorithm For High Dimensional Data Set
6 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
No ratings yet
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
30 pages
Key Components and Processes: Transformer Architecture:: Embeddings
No ratings yet
Key Components and Processes: Transformer Architecture:: Embeddings
10 pages
ICICCS 2022 Brochure
No ratings yet
ICICCS 2022 Brochure
1 page
Principal+Component+Analysis
No ratings yet
Principal+Component+Analysis
6 pages
nancy_chaurasia
No ratings yet
nancy_chaurasia
2 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Unit 3
No ratings yet
Unit 3
31 pages
Rangkuman Data Analitik Dan Big Data
No ratings yet
Rangkuman Data Analitik Dan Big Data
10 pages
ML RUSA Module 5 Dim Red
No ratings yet
ML RUSA Module 5 Dim Red
85 pages
Pca
No ratings yet
Pca
17 pages
W4.2 DataPreProcessing-PCA (1)
No ratings yet
W4.2 DataPreProcessing-PCA (1)
22 pages
ML Mod 4 Part 2
No ratings yet
ML Mod 4 Part 2
32 pages
CP553: Neural Networks and Fuzzy Logic: L T P C ESE CE ESE CE 3 0 2 5 70 30 30 20 150
No ratings yet
CP553: Neural Networks and Fuzzy Logic: L T P C ESE CE ESE CE 3 0 2 5 70 30 30 20 150
2 pages
Machine Learning Section3 Ebook v05
No ratings yet
Machine Learning Section3 Ebook v05
15 pages
03 Dimensionality Reduction
No ratings yet
03 Dimensionality Reduction
38 pages
Unit 3dimentionality Reduction
No ratings yet
Unit 3dimentionality Reduction
13 pages
EDAB Module 5 Singular Value Decomposition (SVD)
No ratings yet
EDAB Module 5 Singular Value Decomposition (SVD)
58 pages
Module 3
No ratings yet
Module 3
41 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
Module 3 ML
No ratings yet
Module 3 ML
19 pages
IDS 4 (Week 14)
No ratings yet
IDS 4 (Week 14)
66 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
27 pages
Data Mining Algorithms in R PDF
No ratings yet
Data Mining Algorithms in R PDF
266 pages
10 ASAP Advanced Statistics Dimension Reduction
No ratings yet
10 ASAP Advanced Statistics Dimension Reduction
8 pages
Unit No.02 - Feature Extraction and Selection
No ratings yet
Unit No.02 - Feature Extraction and Selection
17 pages
Unit V Foml
No ratings yet
Unit V Foml
18 pages
ML Unit 4 @ VS
No ratings yet
ML Unit 4 @ VS
33 pages
U4 - PCA - 5th Sem - DS
No ratings yet
U4 - PCA - 5th Sem - DS
14 pages
Dimensional Reduction in R
No ratings yet
Dimensional Reduction in R
24 pages
22AIP3101A Session 7
No ratings yet
22AIP3101A Session 7
28 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
5-Convolutional Neural Network
No ratings yet
5-Convolutional Neural Network
43 pages
Love Report 1
No ratings yet
Love Report 1
10 pages
Unit 3
No ratings yet
Unit 3
102 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
30 pages
Icics 2007 4449863
No ratings yet
Icics 2007 4449863
5 pages
Generative Ai With Python Harnessing the Power of Machine Learning and Deep Learning to Build Creative and Intelligent Systems
100% (1)
Generative Ai With Python Harnessing the Power of Machine Learning and Deep Learning to Build Creative and Intelligent Systems
239 pages
A Review of Machine Learning and Deep Learning Applications
No ratings yet
A Review of Machine Learning and Deep Learning Applications
6 pages
09 - Hidden Markov Model
No ratings yet
09 - Hidden Markov Model
78 pages
Prompt Engineering Tutorial
100% (5)
Prompt Engineering Tutorial
217 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
41 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
Deep Learning
No ratings yet
Deep Learning
189 pages
Deep Learning in Python - Master Data Science and Machine Learning With Modern Neural Networks Written in Python, Theano, and TensorFlow (PDFDrive)
100% (2)
Deep Learning in Python - Master Data Science and Machine Learning With Modern Neural Networks Written in Python, Theano, and TensorFlow (PDFDrive)
104 pages
Nimbalkar Sandesh Seminar PPT Final
No ratings yet
Nimbalkar Sandesh Seminar PPT Final
20 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
Convolutional Neural Networks in Python
100% (3)
Convolutional Neural Networks in Python
141 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Advanced Data Analysis Techniques 2

Uploaded by

Advanced Data Analysis Techniques 2

Uploaded by

Advanced

Clustering is a statistical method used to

It is widely used in exploratory data

Clustering is a statistical method used to

It is widely used in exploratory data

A collection of data points that are more similar to each

The center of a cluster, often representing the average or

A mathematical formula to calculate the similarity or

Partition-Based Clustering Hierarchical Clustering

Divides the data into non-overlapping Creates a tree-like structure (dendrogram)

Example: K-Means clustering. Types:

Density-Based Clustering Model-Based Clustering

Groups data points based on regions of Assumes data is generated by a mixture of

One of the simplest and Clusters your We have a dataset

1. Define the number of clusters

Number of clusters is the k

2. Set cluster centers randomly

Initial cluster centroids are defined

Usually happens randomly

Each centroid represents a cluster

3. Assign points to clusters

Distance from the first point to

The point is then assigned to the

This is repeated for all points

4. Calculate the center of each cluster

Centers are calculated

These become the new centroids

This is repeated for all points

5. Assign points to the new clusters

Until the cluster distribution

Rand Index: Compares the clustering result to a

External Adjusted Mutual Information (AMI): Measures

Stability Measures: Assess robustness of clustering

Handling high-dimensional data

Challenges Dealing with noise and outliers

Ensuring interpretability of results

Customer Image Anomaly Document

Dimensionality reduction is a set of techniques used to reduce the

High-dimensional data is difficult to visualize.

Reducing the number of features can significantly

With fewer features, it can be easier to understand the

Linear Methods Non-linear Methods

Principal Component Analysis (PCA): t-SNE:

Consider a dataset of customer data with features such as age,

PCA can be used to identify the principal components that explain

These principal components can then be used to group customers

Standardize the data to have zero mean and unit

Compute the Calculate the covariance matrix of the standardized

Find the eigenvectors and eigenvalues of the

Sort Sort the eigenvectors in descending order of their

Transform Project the original data onto the selected principal

Clustering is the foundation of Classification model in

High Dimensionality in data is noisy and can drag models

Dimensionality reduction enables better visualization and

Principal Component Analysis is a linear method of

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.