5-dimension reduction

FITE3010 (Fall 2024)
Big Data and Data Mining

Chapter 5 Dimension Reduction
Liu Qi
Motivation
If your data lies on or near a low d-dimensional subspace.
Instead of using the original points in the D-dimensional space,

we can map the points into a lower d-dimensional subspace.
Motivation
High-dimensional data = Lot of Features
Document classification:
Features per document =
thousands of words/unigrams, millions of bigrams,
contextual information
Netflix watching data: 480189 users x 17770 movies
Dimension reduction reduces the number of

dimensions and makes work with your data easier
Dimension Reduction
Dimension reduction, is the transformation of data from a high-
dimensional space into a low-dimensional space so that the low-
dimensional representation retains some meaningful properties
of the original data
Why Need Dimension Reduction?
Why reduce dimensions?
➢ Discover hidden correlations/topics

o In topic modeling, dimension reduction is often used to identify
and extract latent topics present in a large corpus of text data.
o LSA and LDA
➢ Remove redundant and noisy features

o Dimension reduction is used to eliminate features that contain
noise or irrelevant information that can negatively impact the
performance of your model.
➢ Interpretation and visualization
➢ Easier storage and processing of the data

Application 1: Visualization
In exploratory data analysis, data visualization plays an important role.
We often have difficulties visualizing high-dimensional data.
This is because we can only plot for 2-dimensional or 3-dimensional data.
t-SNE: visualize similar words from Google News: it is more intuitive to explore data
Application 2: Dimensionality Reduction Saves
Computational Resources When Training Models
Training models with low-dimensional data is more efficient than with

high-dimensional data
So dimensionality reduction can reduce the training time of models by

simplifying calculations.
Autoencoder
Use these low-dimensional vectors to

train your model.
Application 3: Dimensionality reduction transforms
non-linear data into a linearly-separable form
Suppose you have data shown in the image at the left.
Your task is to draw a straight line to separate the red and blue points
We can clearly see that the original data is non-linear and cannot be separated by a straight
line.
The data after applying Kernel PCA was plotted in the image at right.
The transformed data is linearly separable.

Organization of the Lecture
Recap Linear Algebra: The math tools used in dimension
reduction algorithms.
Dimension Reduction Method 1: Principal Component Analysis
Dimension Reduction Method 2: Singular Value Decomposition
Dimension Reduction Method 3: Autoencoder

Linear Algebra
Linear algebra is one of the most important math skills for data
mining and machine learning.
Most data mining models can be expressed in matrix form and

matrix calculation.
Next, we briefly recap determinant, eigenvalues and eigenvectors

etc.
Determinant of a Matrix
In mathematics, the determinant is a scalar value that is a
function of the elements of a square matrix.
Determinant: More Generally
More generally, for k by k matrix (k > 3), the determinant is
calculated in a recursive fashion.
For example, the determinant of a k by k matrix is calculated as

follows:
Eigenvalue and Eigenvectors
𝐴𝑣 = 𝜆𝑣
Matrix Eigenvector Eigenvalue: a scaler
−6 3 1
Matrix: Eigenvalue: 6 Eigenvector:
4 5 4
−6 3 1 6 1
= =6
4 5 4 24 4
Unit Vector
If 𝑣 is an eigenvector for matrix 𝐴, then for any constant 𝑐, 𝑐𝑣
is also an eigenvector. Why?
We know 𝐴𝑣 = 𝜆𝑣.
Therefore: 𝐴𝑐𝑣 = 𝑐𝐴𝑣 = 𝑐𝜆𝑣 = 𝜆𝑐𝑣
To avoid ambiguity, we require that every eigenvector be a

unit vector.
Finding Eigenvalues and Eigenvectors
For a n by n matrix, it has at most n pairs of (eigenvalue,
eigenvector)
How to find the eigenvalues and eigenvectors of a matrix?
𝐴𝑣 = 𝜆𝑣 -> 𝐴 − 𝜆𝐼 𝑣 = 0
A fact of linear algebra is that in order for 𝐴 − 𝜆𝐼 𝑣 = 0 to hold
for a vector 𝑣 ≠ 0, the determinant of 𝐴 − 𝜆𝐼 must be 0.
Therefore, 𝜆 is an eigenvalue of 𝐴 if and only if 𝐴 − 𝜆𝐼 = 0.

Example
Suppose we want to compute the eigenvalues and
eigenvectors of the following matrix A:
3 2
2 6
We know the eigenvalues satisfy that 𝐴 − 𝜆𝐼 = 0
3−𝜆 2
=0
2 6 −𝜆
The determinant is equal to 3 − 𝜆 6 − 𝜆 − 4 = 0
Solve this equation, we get the eigenvalues 𝜆 = 7 or 𝜆 = 2

Example (continued)
We know 𝜆 = 7 is an eigenvalue of the matrix 3 2
2 6
Next we calculate the eigenvector of the matrix
3 2 𝑥 𝑥
2 6 𝑦 =7 𝑦
Therefore, 3𝑥 + 2𝑦 = 7𝑥 and 2𝑥 + 6𝑦 = 7𝑦
Therefore, 𝑥 = 1 and y = 2
1 2
Making it into a unit vector ( , )
5 5
Principal-Component Analysis (PCA)
Unsupervised technique for extracting low-dimensional features
from high-dimensional datasets.
The key idea of PCA is about changing the basis.

Intuition behind PCA
The variance on this axis is
small The variance on this axis is large
Question: Can we transform the features so that we only need to

preserve the features with the largest variance?
PCA
Formally, identifying the axes with the

largest variance is known as Principal
Components Analysis, and can be obtained
by using classic matrix computation tools,
such as eigenvectors and eigenvalues.
Then we can project the points onto these

identified axes for dimension reduction.
Example: 2D Gaussian Dataset
The points are

sampled from a 2-
dimensional Gaussian
distribution.
X and Y are highly

correlated in this
example.
X and Y are negatively

correlated.
1st PCA Axis
2nd PCA Axis
These two directions

need to be orthogonal
in order to form a new
coordinate system.
How to Measure Variance?
We project the points onto the axis.
The variance is measured by how spread-out the projected points

are.
The variance of option A is larger than option B

How to Find Axes with Largest Variance?
Suppose we have m points in the n-dimensional space, we have a matrix
𝐴 ∈ 𝑅𝑚×𝑛 , where each row is a point.
The fact is that the axes with the largest variance are eigenvectors of the
matrix 𝐴𝑇 𝐴.
These eigenvectors are the ones corresponding to the largest eigenvalues.
The problem boils down to:

1. We first find the n (eigenvalue, eigenvector) pairs of the matrix 𝐴𝑇 𝐴.
2. We order the pairs by eigenvalues.
3. We only keep k (eigenvalue, eigenvector) pairs that have the largest
eigenvalues.
4. We use the k eigenvectors as the new axes and project the points onto
these k new axes.
Example
Given the following matrix, we want to reduce this 4 by 2 matrix into 4 by 1
1 2
2 1
A=
3 4
4 3
We first compute 𝐴𝑇 𝐴
1 2
1234 2 1 30 28
𝐴𝑇 A = =
2143 3 4 28 30
4 3
This is a symmetric matrix because 𝐴𝑇 𝐴 T
= 𝐴𝑇 𝐴.
Examples (Continued)
Then we find the eigenvalues and eigenvectors of 𝐴𝑇 A:
1 1
−
2 2
58, 2,
1 1
2 2
We find that 58 is the largest eigenvalue so that the corresponding

1
2
eigenvector v = 1 is the axis with the largest variance.
2
Examples (Continued)
We project all the points in matrix A onto this axis. To achieve
this, we multiply A with the eigenvector with the largest
eigenvalue (question: why the projection can be simply obtained
this way?):
3
2
1 2 1 3
2 1 2 2
The result of PCA
Av= 1 =
3 4 7
4 3 2 2
7
2
If we want to keep two dimensions, we multiply A with the two

eigenvectors:
3 1
2 2
1 2 1 1 3 1
2 1 − −
2 2 2 2
1 1 = 7 1
3 4
4 3 2 2 2 2
7 1
−
2 2
Eigenvector Matrix
The eigenvector matrix is a n by n matrix.
1 1
−
2 2
B= 1 1
2 2
The eigenvector matrix is an orthonormal matrix

• Each pair of columns are orthogonal, i.e. their inner product is 0.
• Each column is a unit vector.
1 1 1 1
−
2 2 2 2 1 0
𝐵𝐵 𝑇 = =
1 1 1 1 0 1
−
2 2 2 2
PCA Applications: Face Recognition
Given face images of resolution 256 x 256 pixels, we can use
PCA to perform dimension reduction to reduce noises and
improve efficiency.
After PCA
After PCA, the face features are more significant.
These are useful features for training machine learning

models.
PCA Application: Lossy Image Compression
Patch
Divide each original 372x492 image into patches

Each patch is an instance that contains 12x12 pixels on a grid
View each patch as a 144-D vector
Then we can perform PCA on these patches
After PCA
We obtain an image with a lower resolution that takes less space to store.
Singular Value Decomposition
Let M be a m by n matrix.
The rank of the matrix M is r.
We can use singular value decomposition to decompose this

matrix into three matrices
The Meaning of the Three Matrices
U is the matrix of Σ is a diagonal matrix.

V is the matrix of
eigenvectors of Σ2 is the diagonal matrix
eigenvectors of 𝑀𝑇 𝑀
𝑀𝑀𝑇 . Each column whose entries are the
Each column is an
is an eigenvector. corresponding
eigenvector.
eigenvalues of 𝑀𝑀𝑇 .
SVD Example: Movie Ratings
SVD
Real World Meaning of SVD
We have r columns for U and V, where r is equal to the rank of the
matrix M.
What is the real word meaning of the matrices 𝑈, Σ, 𝑉 by SVD?
M: m × n
U: m × r
Σ: r × r
V: n × r
Real World Meaning of SVD (continued)
We can think each column of r as an “abstract concept”.
M: m × n -> user-movie matrix
U: m × r -> user-concept matrix
Σ: r × r -> eigenvalues
V: n × r -> movie-concept matrix
user-concept matrix movie-concept matrix

Dimension Reduction
The matrix U can be seen as the low-dimensional matrix of the original
matrix M after dimension reduction.
So matrix U can be seen as the result after dimension reduction for M.
Using the abstract concepts:

➢ The matrix U provides us with low-dimensional features for each user.
➢ The matrix V provides us with low-dimensional features for each movie.
Each user/movie can be represented as an r-dimensional vector.
SVD
Query with Low-Dimensional Features
Given a user or movie, we can calculate the distance between its r-
dimensional features with the features of the other users or movies to
get similar users or movies.
Joe
(.13 .02 − .01)
Jane
(.07 − .029 .32)
Euclidean Distance between Joe and Jane is 0.457

Autoencoder
Neural networks trained to attempt to copy its input to its output
Autoencoder tries to approximate an identity function.
Contain two parts:

• Encoder: map the input to a hidden representation
• Decoder: map the hidden representation to the output
Autoencoder
Not really care about copying
Autoencoder forced to select which aspects to preserve and

thus hopefully can learn useful properties of the data
h is the result after dimension reduction.
The dimension of h is usually much smaller than the input x

Autoencoder
Constrain h to have smaller dimension than the input
Training: minimize a loss function

Autoencoder
Special case: 𝑓, 𝑔 linear, 𝐿 mean square error
Mean square error: L = 𝑥 − 𝑟 2
It has been proven to be equivalent to Principal Component Analysis

Implementation
Sklearn
PCA
SVD
References
Mining Massive Datasets, Chapter 11
http://infolab.stanford.edu/~ullman/mmds/ch11.pdf
Summary
We recap linear algebra.
We introduced two dimension reduction methods

including SVD and PCA.
We went through some applications of dimension

reduction.
Questions?

5-dimension reduction

Uploaded by

Copyright:

Available Formats

5-dimension reduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5-dimension reduction

Uploaded by

Copyright:

Available Formats

FITE3010 (Fall 2024)

Big Data and Data Mining

Instead of using the original points in the D-dimensional space,

Netflix watching data: 480189 users x 17770 movies

Dimension reduction reduces the number of

➢ Discover hidden correlations/topics

➢ Remove redundant and noisy features

➢ Interpretation and visualization

➢ Easier storage and processing of the data

We often have difficulties visualizing high-dimensional data.

This is because we can only plot for 2-dimensional or 3-dimensional data.

Training models with low-dimensional data is more efficient than with

So dimensionality reduction can reduce the training time of models by

Use these low-dimensional vectors to

The transformed data is linearly separable.

Dimension Reduction Method 1: Principal Component Analysis

Dimension Reduction Method 2: Singular Value Decomposition

Dimension Reduction Method 3: Autoencoder

Most data mining models can be expressed in matrix form and

Next, we briefly recap determinant, eigenvalues and eigenvectors

For example, the determinant of a k by k matrix is calculated as

Matrix Eigenvector Eigenvalue: a scaler

Therefore: 𝐴𝑐𝑣 = 𝑐𝐴𝑣 = 𝑐𝜆𝑣 = 𝜆𝑐𝑣

To avoid ambiguity, we require that every eigenvector be a

How to find the eigenvalues and eigenvectors of a matrix?

Therefore, 𝜆 is an eigenvalue of 𝐴 if and only if 𝐴 − 𝜆𝐼 = 0.

We know the eigenvalues satisfy that 𝐴 − 𝜆𝐼 = 0

The determinant is equal to 3 − 𝜆 6 − 𝜆 − 4 = 0

Solve this equation, we get the eigenvalues 𝜆 = 7 or 𝜆 = 2

The key idea of PCA is about changing the basis.

Question: Can we transform the features so that we only need to

Formally, identifying the axes with the

Then we can project the points onto these

The points are

X and Y are highly

X and Y are negatively

These two directions

The variance is measured by how spread-out the projected points

The variance of option A is larger than option B

These eigenvectors are the ones corresponding to the largest eigenvalues.

The problem boils down to:

We find that 58 is the largest eigenvalue so that the corresponding

If we want to keep two dimensions, we multiply A with the two

The eigenvector matrix is an orthonormal matrix

These are useful features for training machine learning

Divide each original 372x492 image into patches

The rank of the matrix M is r.

We can use singular value decomposition to decompose this

U is the matrix of Σ is a diagonal matrix.

user-concept matrix movie-concept matrix

So matrix U can be seen as the result after dimension reduction for M.

Using the abstract concepts:

➢ The matrix V provides us with low-dimensional features for each movie.

Each user/movie can be represented as an r-dimensional vector.

Euclidean Distance between Joe and Jane is 0.457

Neural networks trained to attempt to copy its input to its output

Autoencoder tries to approximate an identity function.

Contain two parts:

Not really care about copying

Autoencoder forced to select which aspects to preserve and