0% found this document useful (0 votes)
3 views32 pages

Advanced Data Analysis Techniques 2

The document discusses advanced data analysis techniques, focusing on clustering and dimensionality reduction. Clustering is a statistical method for grouping similar data points, with various types including K-Means and DBSCAN, while dimensionality reduction techniques like PCA help reduce the number of features in a dataset to improve model performance and visualization. The document outlines key concepts, methods, evaluation metrics, and applications for both clustering and dimensionality reduction.

Uploaded by

gurjibecha88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views32 pages

Advanced Data Analysis Techniques 2

The document discusses advanced data analysis techniques, focusing on clustering and dimensionality reduction. Clustering is a statistical method for grouping similar data points, with various types including K-Means and DBSCAN, while dimensionality reduction techniques like PCA help reduce the number of features in a dataset to improve model performance and visualization. The document outlines key concepts, methods, evaluation metrics, and applications for both clustering and dimensionality reduction.

Uploaded by

gurjibecha88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Advanced

Data Analysis
Techniques
(Part 2)

2
Overview

DIMENSIONALITY
REDUCTION

CLUSTER
ANALYSIS
Cluster Analysis
What is Clustering?

Clustering is a statistical method used to


group similar data points into clusters
based on their characteristics.

It is widely used in exploratory data


analysis, machine learning, and pattern
recognition.
What is Clustering?

Clustering is a statistical method used to


group similar data points into clusters
based on their characteristics.

It is widely used in exploratory data


analysis, machine learning, and pattern
recognition.
Key Clustering Concepts

A collection of data points that are more similar to each


Cluster
other than to points in other clusters.

The center of a cluster, often representing the average or


Centroid
mean of points in that cluster.

A mathematical formula to calculate the similarity or


Metric dissimilarity between data points (e.g., Euclidean distance,
Manhattan distance, cosine similarity).
Types of Clustering

Partition-Based Clustering Hierarchical Clustering

Divides the data into non-overlapping Creates a tree-like structure (dendrogram)


subsets. showing the nested grouping of data points.

Example: K-Means clustering. Types:


Agglomerative (bottom-up approach).
Divisive (top-down approach).
Types of Clustering

Density-Based Clustering Model-Based Clustering

Groups data points based on regions of Assumes data is generated by a mixture of


high density separated by low-density underlying probability distributions.
regions.
Example: Gaussian Mixture Models (GMM).
Example: DBSCAN (Density-Based Spatial
Clustering of Applications with Noise).
K-Means Clustering

One of the simplest and Clusters your We have a dataset


most common methods data on a given and would like to
for cluster analysis number of clusters divide it into clusters
Steps for K-Means Clustering

1. Define the number of clusters

Number of clusters is the k


in k-means
Steps for K-Means Clustering

2. Set cluster centers randomly

Initial cluster centroids are defined

Usually happens randomly

Each centroid represents a cluster


Steps for K-Means Clustering

3. Assign points to clusters

Distance from the first point to


each of the cluster centroids is
measured

The point is then assigned to the


cluster with the shortest distance

This is repeated for all points


Steps for K-Means Clustering

4. Calculate the center of each cluster

Centers are calculated

These become the new centroids

This is repeated for all points


Steps for K-Means Clustering

5. Assign points to the new clusters

Centers are
calculated

These become
the new centroids

This is repeated
for all points
Steps for K-Means Clustering

6. Repeat 4 and 5

Until the cluster distribution


no longer changes
Evaluation Metrics
Silhouette Score: Measures how similar an object is
to its own cluster versus other clusters.
Internal
Validation
Dunn Index: Evaluates compactness and separation
of clusters.

Rand Index: Compares the clustering result to a


ground truth.

External Adjusted Mutual Information (AMI): Measures


Validation similarity between clustering and ground truth.

Stability Measures: Assess robustness of clustering


by applying algorithms to different subsets of data.
Determining optimal number of clusters.

Sensitivity to initialization

Handling high-dimensional data

Challenges Dealing with noise and outliers


of Clustering

Ensuring interpretability of results


Applications of Clustering

Customer Image Anomaly Document


segmentation segmentation detection clustering
Dimensionality
Reduction
What is Dimensionality Reduction?

Dimensionality reduction is a set of techniques used to reduce the


number of input variables (features) in a dataset while preserving as
much of the important information as possible.

This is crucial for dealing with high-dimensional data, which can suffer
from the "curse of dimensionality."
Importance of Dimensionality Reduction
High-dimensional data can lead to overfitting,
Improved where models perform well on training data but
Model poorly on new, unseen data.
Performance Dimensionality reduction can help prevent
overfitting by reducing noise and irrelevant
information.
Reduced
Fewer features mean faster training and prediction
Computational
times for machine learning models.
Cost

High-dimensional data is difficult to visualize.


Data Dimensionality reduction can project data onto
Visualization lower-dimensional spaces (e.g., 2D or 3D) for easier
visualization and exploration.
Importance of Dimensionality Reduction

Reducing the number of features can significantly


Data Storage
reduce storage requirements.

With fewer features, it can be easier to understand the


Improved
relationships between variables and the underlying
Interpretability
patterns in the data.
Types of Dimensionality Reduction

Linear Methods Non-linear Methods

Principal Component Analysis (PCA): t-SNE:


• Projects data onto a new set of • A powerful non-linear technique that maps
orthogonal axes (principal high-dimensional data to a
components) that capture the low-dimensional space while preserving
maximum variance in the data. local similarities.
• Effective for reducing noise and • Effective for visualizing complex, non-linear
identifying the most important structures in data.
features.
Isomap:
Linear Discriminant Analysis (LDA): • Constructs a low-dimensional embedding
that preserves geodesic distances between
• Similar to PCA, but focuses on data points.
maximizing the separation between
classes in supervised learning Autoencoders:
problems. • Neural networks trained to reconstruct input
data, effectively learning a compressed
representation of the data.
What is Principal Component Analysis?
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while preserving as much of the variation in the data as possible.

It does this by identifying a new set of variables, called principal components, which are linear
combinations of the original variables.

Maths
A B C D E F

Maths 76 55 97 45 89 45

English 34 67 55 67 76 46

English
What is Principal Component Analysis?
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while preserving as much of the variation in the data as possible.

It does this by identifying a new set of variables, called principal components, which are linear
combinations of the original variables.

Maths
A B C D E F

Art
Maths 76 55 97 45 89 45

English 34 67 55 67 76 46

Art 67 45 89 44 66 44 English
What is Principal Component Analysis?
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a
dataset while preserving as much of the variation in the data as possible.

It does this by identifying a new set of variables, called principal components, which are linear
combinations of the original variables.

Maths
A B C D E F

Maths 76 55 97 45 89 45

English 34 67 55 67 76 46

Art 67 45 89 44 66 44 ArtEng
Example

Consider a dataset of customer data with features such as age,


income, and spending habits.

PCA can be used to identify the principal components that explain


most of the variation in the data.

These principal components can then be used to group customers


with similar characteristics, which can be useful for targeted marketing
campaigns.
Steps in Principal Component Analysis

Standardize the data to have zero mean and unit


Standardize
variance. This ensures that all variables are on the
the Data
same scale.

Compute the Calculate the covariance matrix of the standardized


Covariance data. The covariance matrix measures the relationship
Matrix between pairs of variables.

Find the eigenvectors and eigenvalues of the


Compute
covariance matrix. The eigenvectors represent the
Eigenvectors
principal components, and the eigenvalues represent
and
the amount of variance explained by each principal
Eigenvalues
component.
Importance of Dimensionality Reduction

Sort Sort the eigenvectors in descending order of their


Eigenvectors corresponding eigenvalues.

Choose
Select the top k eigenvectors, where k is the desired
Principal
number of dimensions.
Components

Transform Project the original data onto the selected principal


the Data components.
Applications of Principal Component Analysis

Data Compression

Data Visualization

Noise Reduction

Feature Extraction
Conclusion

Clustering is the foundation of Classification model in


Unsupervised Learning

High Dimensionality in data is noisy and can drag models


down

Dimensionality reduction enables better visualization and


performance of data model

Principal Component Analysis is a linear method of


Dimensionality Reduction

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy