0% found this document useful (0 votes)

9 views

Clustering

The document provides an overview of unsupervised learning, focusing on clustering concepts, types, and algorithms. It discusses real-life applications of clustering, such as customer segmentation and anomaly detection, and details typical clustering algorithms including K-Means, Hierarchical Agglomerative Clustering (HAC), and DBSCAN. Additionally, it addresses dimensionality reduction techniques like PCA to enhance classification accuracy.

Uploaded by

marknthbigmove1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Clustering

Uploaded by

marknthbigmove1234

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Unsupervise Learning

November 2023
Outline

◎ Part I: Clustering - General Concepts

○ Real-life Applications
○ Types of Clusterings
◎ Part II: Typical Clustering Algorithms

2
1.
Clustering
- General Concepts
Main idea, real-life applications,
types
3
Motivating Example: Customer Segmentation

Clusters

● Demographic ● Geographic
● Behavioral ● Psychographic
4
What is Cluster Analysis or Clustering?
Given a set of objects, place them in groups such that the objects in
a group are similar (or related) to one another and different from
(or unrelated to) the objects in other groups

Inter-cluster distances
are maximized

Intra-cluster distances
are minimized
5
Real-life Applications: Google News

6
Real-life Applications: Anomaly Detection

● Fake News Detection

● Fraud Detection
● Spam Email Detection

Source:
https://towardsdatascience.com/unsupervised-
anomaly-detection-on-spotify-data-k-means-vs-local-
outlier-factor-f96ae783d7a7

7
Real-life Applications: Sport Science

Find players with

similar styles

Source: https://www.americansocceranalysis.com/home/2019/3/11/using-k-means-to-learn-
what-soccer-passing-tells-us-about-playing-styles
8
Real-life Applications: Image Segmentation

Source: http://pixelsciences.blogspot.com/2017/07/image-segmentation-k-means-clustering.html

9
Real-life Applications: Recommendation

● Cluster-based ranking
● Group recommendation
● …

10
What do affect on Cluster Analysis?

Clustering

Data Algorithm

Cluster

11
Characteristics of the Input Data Are Important
● High dimensionality
○ Dimensionality reduction
● Types of attributes
○ Binary, discrete, continuous, asymmetric
○ Mixed attribute types, e.g., continuous & nominal)
● Differences in attribute scales
○ Normalization techniques
● Size of data set
● Noise and Outliers
● Properties of the data space

12
Characteristics of Cluster
● Data distribution
○ Parametric models
● Shape
○ Globular or arbitrary shape
● Differing sizes
● Differing densities
● Level of separation among clusters Distance Metrics
● Relationship among clusters
● Subspace clusters

13
How to Measure the Similarity/Distance?

Source: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa 14
Notion of a Cluster can be Ambiguous

15
Types of Clusterings

Source: https://www.datanovia.com/en/blog/types-of-clustering-methods-overview-and-quick-start-r-code/

16
Partitional Clustering

Data objects are separated into

non-overlapping subsets, i.e.,
clusters

17
Hierarchical Clustering

Data objects are separated into

nested clusters as a hierarchical
tree

Hierarchical Clustering

Clustering dendrogram

18
Fuzzy Clustering

Fuzzy clustering, i.e., soft

clustering, is a form of clustering
in which each data point can
belong to more than one cluster
with weights

19
Density-based Clustering

A cluster is a dense region of

points, which is separated by
low-density regions, from other
regions of high density.

Non-linear separation

20
Model-based Clustering

Model-based clustering assumes

that the data were generated by
a model and tries to recover the
original model from the data.

Gaussian Mixture Model

21
2.
Typical Clustering
Algorithms
Intuition, Main Idea, Limitation

22
Typical Clustering Algorithms

◎ Partitional Clustering
○ K-Means & Variants
◎ Hierarchical Clustering
○ HAC
◎ Density-based Clustering
○ DBSCAN

23
K-Means Clustering: An Example

24
K-Means Clustering

● Main idea: Each point is assigned to the cluster with the closest centroid
● Number of clusters, K, must be specified
● Sum of Squared Error (SSE)
● Complexity: O(n * K * I * d)
○ n = number of points, K = number of clusters,
○ I = number of iterations, d = number of attributes

25
Elbow Method for Optimal Value of K

WCSS is the sum of squared distance between each point and the centroid in a cluster
The graph will rapidly change at a point named Elbow Point
26
Two different K-means Clusterings

Optimal
Clustering

Original Points
Sub-optimal
Clustering

27
Importance of Choosing Initial Centroids

28
Solutions to Initial Centroids Problem
● Multiple runs
○ Helps, but probability is not on your side
● Use some strategies to select the k initial centroids and then
select among these initial centroids
○ Select most widely separated, e.g., K-means++
○ Use hierarchical clustering to determine initial centroids
● Bisecting K-Means
○ Not as susceptible to initialization issues

29
K-Means++
1. Choose one center uniformly at random among the data points.
2. For each data point x not chosen yet, compute D(x), the distance
between x and the nearest center that has already been chosen.
3. Choose one new data point at random as a new center, using a
weighted probability distribution where a point x is chosen with
probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using
standard K-Means clustering

30
Bisecting K-Means

It is a variant of K-means that can produce a

partitional or a hierarchical clustering

31
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

32
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

33
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

34
Hierarchical Agglomerative Clustering

dendrogram

● Main Idea:
○ Start with the points as individual clusters
○ At each step, merge the closest pair of clusters until only one
cluster (or K clusters) left
● Key operation is the computation of the proximity of two clusters
○ Worst-case Complexity: O(N3)
35
HAC: Algorithm

36
Closest Pair of Clusters
● Many variants to defining closest pair of clusters
● Single-link
○ Similarity of the closet elements
● Complete-link
○ Similarity of the “furthest” points
● Average-link
○ Average cosine between pairs of elements
● Ward’s Method
○ The increase in squared error when two clusters are merged

37
HAC - Single-link (MIN)
5
1
3
5
2 1
2 3 6

4
4
Nested Clusters Dendrogram

38
HAC - Complete-link (MAX)
5
4 1
22 5
5
2 1
4 3 6
3 1
4
3

Nested Clusters Dendrogram

39
HAC - Average-link

5 4 1
2
5
2
3 6
1
4
3
Nested Clusters Dendrogram

40
HAC: Limitations
● Once two clusters are combined, it cannot be undone
● No global objective function is directly minimized
● Typical Problems:
○ Sensitivity to noise
○ Difficulty handling clusters of different
sizes and non-globular shapes
○ Breaking large clusters

41
Density-based Clustering - DBSCAN

● Main Idea: Clusters are regions of high

density that are separated from one
another by regions on low density.
● Density = number of points within
a specified radius (Eps)
○ Core point
○ Border point
○ Noise point

42
DBSCAN: Algorithm

43
How to Determine Points?

MinPts = 7

● Core point: Has at least a specified number of points (MinPts) within Eps
● Border point: not a core point, but is in the neighborhood of a core point
● Noise point: any point that is not a core point or a border point

44
DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border

and noise

Eps = 10, MinPts = 4

45
DBSCAN: How to Determine Eps, MinPts?

Intuition:
● Core point: the k-th nearest
neighbors are at a close distance.
● Noise point: the k-th nearest
neighbors are at a far distance.
Plot sorted distance of every point to
its k-th nearest neighbor

46
DBSCAN: Limitations
(MinPts=4, Eps=9.92).

● Varying densities
(MinPts=4, Eps=9.75). ● High-dimensional
data
Original Points

47
Which Clustering Algorithm?
● Type of Clustering
● Type of Cluster
○ Prototype vs connected regions vs density-based
● Characteristics of Clusters
● Characteristics of Data Sets and Attributes
● Noise and Outliers
● Number of Data Objects
● Number of Attributes
● Algorithmic Considerations

48
A Comparison on Clustering Algorithms

Source: Text Clustering

Algorithms: A Review

49
Summary
◎ General Concepts of Clustering
○ Definition
○ Real-life Applications
○ Types of Clustering
◎ Typical Clustering Algorithms
○ K-Means
○ HAC
○ DBSCAN

50
Dimensionality Reduction
Curse of Dimensionality
The number of training examples required
increases exponentially with dimensionality
d (i.e., kd).

We have to choose the right set features

31 bins

32 bins

33 bins
k: number of bins per feature
Dimensionality Reduction

What is the objective?

Choose an optimum set of features of lower dimensionality to
improve classification accuracy.

53
Dimensionality Reduction
Feature extraction: Extract features from sample (increase features)
Feature selection: get relevant features from the set of features
Dimensionality Reduction: get proper mapping to
a lower dimensional space
The mapping f()
could be linear
é x1 ù é x1 ù or non-linear
êx ú
êx ú ê 2ú
ê 2ú é xi1 ù é y1 ù
ê . ú êy ú
ê . ú ê ú ê ú ê 2ú
ê ú ê xi2 ú x= ê . ú
¾¾¾
f (x)
®y = ê . ú
.
x=ê ú ® y=ê . ú ê . ú ê ú
ê . ú ê ú ê ú ê . ú
ê ú ê . ú ê ú .
êë yK úû
ê ú . ê ú ê . ú
ê . ú ë xiK û ê ú
ê ú K<<N êë xN úû
êë xN úû K<<N
54
Dimensionality Reduction

Linear combinations are faster, and easy to optimize

Given x ϵ RN, find an N x K matrix U such that:

y = UTx ϵ RK where K<N

é x1 ù
êx ú
ê 2ú é y1 ù This is a projection from
ê . ú U T êy ú
ê ú the N-dimensional
. ú ê 2ú
x= ê ¾¾¾
f (x)
®y = ê . ú space to a K-
ê . ú ê ú dimensional space.
ê ú ê . ú
ê ú .
êë yK úû
ê . ú
ê ú
êë xN úû
55
Principal Component Analysis (PCA)
• Let recall the linear algebra, assume a data point x∈RN as a linear combination of an
orthonormal set of N basis vectors <v1,v2,…,v!> in RN :
N
x = å xi vi = x1v1 + x2v2 + ... + xN vN
ì1 if i = j i =1
viT v j = í
î0 otherwise xT vi
where xi = T = xT vi
vi vi
• PCA seeks to represent x in a new space of lower dimensionality using only K basis vectors
(K < N)
K
xˆ = å yi ui = y1u1 + y2u2 + ... + yK u K
i =1

such that || x - xˆ || is minimized, for all x ∈ D

(i.e., minimize information loss)
56
Principal Component Analysis (PCA)
How should we determine the “optimal” lower dimensional space basis
vectors <u1, u2, …,uK> ?

The optimal space of lower dimensionality can be defined by:

(1) Finding the eigenvectors u! of the covariance matrix of the data Σx

Σx u!= !! u!

(2) Choosing the K “largest” eigenvectors u! (corresponding to the K “largest” eigenvalues !!)

57
PCA - Steps
Suppose we are given x1, x2, ..., xM N x 1 vectors

Step 1: compute sample mean

M
1
x=
M
åx
i =1
i

Step 2: subtract sample mean (i.e., center data at zero)

Φi = xi - x 1 n
Sˆ = å (x k - μˆ )(x k - μˆ )t
n k =1
Step 3: compute the sample covariance matrix Σx

1 M
1
Sx =
M
å Fi FTi =
i =1 M
AAT where A=[Φ1 Φ2 ... ΦΜ]

(N x M matrix)
58
PCA - Steps
Step 4: compute the eigenvalues/eigenvectors of Σx
S x ui = li ui

Assume that l1 > l2 > ... > lN and u1 , u2 ,..., u N

Question: How can we say about "N > 0 ?

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis in RN, therefore:

N
x - x = å yi ui = y1u1 + y2u2 + ... + y N u N
i =1

(x - x)T ui
yi = T
= (x - x)T ui if || ui ||= 1
ui ui (normalized)
59
PCA - Steps
Step 5: Aprpoximation with the Tranformation matrix U (using the first K vectors)

N
x - x = å yi ui = y1u1 + y2u2 + ... + y N u N
i =1

K
xˆ - x = å yi ui = y1u1 + y2u2 + ... + yK u K
i =1

é y1 ù
êy ú
ê 2ú where U = [u1 u2 ... u K ] N x K
or (xˆ - x ) = U ê . ú
ê ú
ê . ú
êë yK úû
60
Example
Compute the PCA for dataset

(1,2),(3,3),(3,5),(5,4),(5,6),(6,5),(8,7),(9,8)

Compute the sample covariance matrix is:

n
ˆS = 1 å (x - μˆ )(x - μˆ )t
k k
n k =1

The eigenvalues can be computed by finding the roots of the

characteristic polynomial:

61
Example
The eigenvectors are the solutions of the systems:
S xui = li ui

Normalize the eigenvectors vectors to unit-length.

Note: if ui is a solution, then (cui) is also a solution where c is

any constant.
62
Geometric interpretation of PCA
• PCA chooses the eigenvectors of the covariance matrix corresponding to the largest
eigenvalues.
• The eigenvalues correspond to the variance of the data along the eigenvector
directions.
• Therefore, PCA projects the data along the directions where the data varies most.

u1: direction of max variance

u2: orthogonal to u1

63
How do we choose K ?
• K is typically chosen based on how much information (variance) we want to
preserve:
K

ål i
i =1
N
>T where T is a threshold (e.g ., 0.9)
ål
i =1
i

• If T=0.9, for example, we say that we “preserve” 90% of the information

(variance) in the data.

• If K=N, then we “preserve” 100% of the information in the data (i.e., just a
change of basis)
64
Approximation Error
• The approximation error (or reconstruction error) can be
computed as:
|| x - xˆ ||
K

where xˆ = å yi ui = y1u1 + y2u2 + ... + yK u K + x (reconstruction)

i =1

• It can also be shown that the approximation error can be

computed as follows:
1 N
|| x - xˆ ||= å li
2 i = K +1
65

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
RoleTheory Biddle 1986
100% (1)
RoleTheory Biddle 1986
26 pages
2020/21 Rhode Island School Calendar
No ratings yet
2020/21 Rhode Island School Calendar
6 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Unit 5
No ratings yet
Unit 5
63 pages
Clustering
No ratings yet
Clustering
12 pages
Cluster
100% (1)
Cluster
72 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
ML - 8
No ratings yet
ML - 8
70 pages
Topic4 Clustering
No ratings yet
Topic4 Clustering
78 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
UNIT5
No ratings yet
UNIT5
60 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Ambo University: Inistitute of Technology
No ratings yet
Ambo University: Inistitute of Technology
15 pages
Unit 4
No ratings yet
Unit 4
5 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Clustering
No ratings yet
Clustering
45 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
Grouping
No ratings yet
Grouping
98 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Graph Partitioning Advance Clustering Technique
No ratings yet
Graph Partitioning Advance Clustering Technique
14 pages
DSS09 (B) - Clustering
No ratings yet
DSS09 (B) - Clustering
35 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Medical Imabmnge Analysis
No ratings yet
Medical Imabmnge Analysis
41 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Unit 2
No ratings yet
Unit 2
33 pages
DM_C6
No ratings yet
DM_C6
37 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
datamining-lect8
No ratings yet
datamining-lect8
79 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
Cluster Analysis
No ratings yet
Cluster Analysis
27 pages
Unit 7 Clustering
No ratings yet
Unit 7 Clustering
56 pages
L11 Cluster Analysis
No ratings yet
L11 Cluster Analysis
47 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
5 - Clustering
No ratings yet
5 - Clustering
13 pages
Clustering
No ratings yet
Clustering
39 pages
Fast_and_Robust_General_Purpose_Clustering_Algorit
No ratings yet
Fast_and_Robust_General_Purpose_Clustering_Algorit
29 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Introduction To Data Mining Clustering Analysis
No ratings yet
Introduction To Data Mining Clustering Analysis
84 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
BIS 541 Ch04 20-21 S
No ratings yet
BIS 541 Ch04 20-21 S
82 pages
DW&M Unit 3 Part II
No ratings yet
DW&M Unit 3 Part II
50 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Thucydides Between History and Literature 1st Edition Antonis Tsakmakis download
No ratings yet
Thucydides Between History and Literature 1st Edition Antonis Tsakmakis download
55 pages
U.K.G - Holiday Homework
No ratings yet
U.K.G - Holiday Homework
26 pages
1156629-Introduction To OSHAcademy
No ratings yet
1156629-Introduction To OSHAcademy
1 page
Oral Com. Exam Midterm
No ratings yet
Oral Com. Exam Midterm
3 pages
2017 - 2018 Odd ICT Scheme and Syllabus
No ratings yet
2017 - 2018 Odd ICT Scheme and Syllabus
49 pages
All About Me Part 2
No ratings yet
All About Me Part 2
13 pages
Robotics RRL
No ratings yet
Robotics RRL
2 pages
Project Performance Report: School Improvement Project in UC Kuchlak
No ratings yet
Project Performance Report: School Improvement Project in UC Kuchlak
11 pages
Memoirs of A Student in Manila by P. Jacinto (A Pen Name of José Rizal)
No ratings yet
Memoirs of A Student in Manila by P. Jacinto (A Pen Name of José Rizal)
28 pages
Chapter 2 - Plant Start Up & Shut Down
No ratings yet
Chapter 2 - Plant Start Up & Shut Down
23 pages
2025 SALES - AMAN SINGH
No ratings yet
2025 SALES - AMAN SINGH
2 pages
Afno Manche
No ratings yet
Afno Manche
16 pages
Module 3: Understanding Judaism: Introduction To World Religions and Belief Systems
No ratings yet
Module 3: Understanding Judaism: Introduction To World Religions and Belief Systems
7 pages
JHS Integrated Science Preamble, Jan 2012 - Final
No ratings yet
JHS Integrated Science Preamble, Jan 2012 - Final
15 pages
MKT309 (ENT309) EVENT MANAGEMENT SUMMARY 08024665051
No ratings yet
MKT309 (ENT309) EVENT MANAGEMENT SUMMARY 08024665051
46 pages
Drug Therapy For Parkinson's Disease
No ratings yet
Drug Therapy For Parkinson's Disease
9 pages
Viking Culture
No ratings yet
Viking Culture
2 pages
School Forms 5 and 6
No ratings yet
School Forms 5 and 6
12 pages
Chap1 Introduction To Information Security
No ratings yet
Chap1 Introduction To Information Security
14 pages
Research Problem
No ratings yet
Research Problem
44 pages
Papadimitriou Aggelos Dip 2017
No ratings yet
Papadimitriou Aggelos Dip 2017
70 pages
C MP 100 (Original)
No ratings yet
C MP 100 (Original)
3 pages
Chapter 7
No ratings yet
Chapter 7
6 pages
Enio Deneko
No ratings yet
Enio Deneko
10 pages
X Bio Ch1 Life Processes Chapter Notes Aug
33% (3)
X Bio Ch1 Life Processes Chapter Notes Aug
18 pages
Phrasal verbs
No ratings yet
Phrasal verbs
49 pages
Byer Square - SF Campus For Jewish Living
No ratings yet
Byer Square - SF Campus For Jewish Living
16 pages
SYBMS Syllabus
100% (1)
SYBMS Syllabus
64 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Clustering

Uploaded by

Clustering

Uploaded by

Unsupervise Learning

◎ Part I: Clustering - General Concepts

● Fake News Detection

Find players with

Data objects are separated into

Data objects are separated into

Fuzzy clustering, i.e., soft

A cluster is a dense region of

Model-based clustering assumes

Gaussian Mixture Model

It is a variant of K-means that can produce a

Original Points K-means (3 Clusters)

Original Points K-means (3 Clusters)

Original Points K-means (2 Clusters)

Nested Clusters Dendrogram

● Main Idea: Clusters are regions of high

Original Points Point types: core, border

Eps = 10, MinPts = 4

Source: Text Clustering

We have to choose the right set features

What is the objective?

Linear combinations are faster, and easy to optimize

Given x ϵ RN, find an N x K matrix U such that:

y = UTx ϵ RK where K<N

such that || x - xˆ || is minimized, for all x ∈ D

The optimal space of lower dimensionality can be defined by:

(1) Finding the eigenvectors u! of the covariance matrix of the data Σx

Step 1: compute sample mean

Step 2: subtract sample mean (i.e., center data at zero)

Assume that l1 > l2 > ... > lN and u1 , u2 ,..., u N

Question: How can we say about "N > 0 ?

Since Σx is symmetric, <u1,u2,…,uN> form an orthogonal basis in RN, therefore:

Compute the sample covariance matrix is:

The eigenvalues can be computed by finding the roots of the

Normalize the eigenvectors vectors to unit-length.

Note: if ui is a solution, then (cui) is also a solution where c is

u1: direction of max variance

• If T=0.9, for example, we say that we “preserve” 90% of the information

where xˆ = å yi ui = y1u1 + y2u2 + ... + yK u K + x (reconstruction)

• It can also be shown that the approximation error can be

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.