5-dimension reduction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

FITE3010 (Fall 2024)

Big Data and Data Mining


Chapter 5 Dimension Reduction

Liu Qi
Motivation
If your data lies on or near a low d-dimensional subspace.

Instead of using the original points in the D-dimensional space,


we can map the points into a lower d-dimensional subspace.
Motivation
High-dimensional data = Lot of Features

Document classification:
Features per document =
thousands of words/unigrams, millions of bigrams,
contextual information

Netflix watching data: 480189 users x 17770 movies

Dimension reduction reduces the number of


dimensions and makes work with your data easier
Dimension Reduction
Dimension reduction, is the transformation of data from a high-
dimensional space into a low-dimensional space so that the low-
dimensional representation retains some meaningful properties
of the original data
Why Need Dimension Reduction?
Why reduce dimensions?

➢ Discover hidden correlations/topics


o In topic modeling, dimension reduction is often used to identify
and extract latent topics present in a large corpus of text data.
o LSA and LDA

➢ Remove redundant and noisy features


o Dimension reduction is used to eliminate features that contain
noise or irrelevant information that can negatively impact the
performance of your model.

➢ Interpretation and visualization

➢ Easier storage and processing of the data


Application 1: Visualization
In exploratory data analysis, data visualization plays an important role.

We often have difficulties visualizing high-dimensional data.

This is because we can only plot for 2-dimensional or 3-dimensional data.

t-SNE: visualize similar words from Google News: it is more intuitive to explore data
Application 2: Dimensionality Reduction Saves
Computational Resources When Training Models

Training models with low-dimensional data is more efficient than with


high-dimensional data

So dimensionality reduction can reduce the training time of models by


simplifying calculations.

Autoencoder

Use these low-dimensional vectors to


train your model.
Application 3: Dimensionality reduction transforms
non-linear data into a linearly-separable form
Suppose you have data shown in the image at the left.

Your task is to draw a straight line to separate the red and blue points

We can clearly see that the original data is non-linear and cannot be separated by a straight
line.

The data after applying Kernel PCA was plotted in the image at right.

The transformed data is linearly separable.


Organization of the Lecture
Recap Linear Algebra: The math tools used in dimension
reduction algorithms.

Dimension Reduction Method 1: Principal Component Analysis

Dimension Reduction Method 2: Singular Value Decomposition

Dimension Reduction Method 3: Autoencoder


Linear Algebra
Linear algebra is one of the most important math skills for data
mining and machine learning.

Most data mining models can be expressed in matrix form and


matrix calculation.

Next, we briefly recap determinant, eigenvalues and eigenvectors


etc.
Determinant of a Matrix
In mathematics, the determinant is a scalar value that is a
function of the elements of a square matrix.
Determinant: More Generally
More generally, for k by k matrix (k > 3), the determinant is
calculated in a recursive fashion.

For example, the determinant of a k by k matrix is calculated as


follows:
Eigenvalue and Eigenvectors
𝐴𝑣 = 𝜆𝑣

Matrix Eigenvector Eigenvalue: a scaler

−6 3 1
Matrix: Eigenvalue: 6 Eigenvector:
4 5 4

−6 3 1 6 1
= =6
4 5 4 24 4
Unit Vector
If 𝑣 is an eigenvector for matrix 𝐴, then for any constant 𝑐, 𝑐𝑣
is also an eigenvector. Why?

We know 𝐴𝑣 = 𝜆𝑣.

Therefore: 𝐴𝑐𝑣 = 𝑐𝐴𝑣 = 𝑐𝜆𝑣 = 𝜆𝑐𝑣

To avoid ambiguity, we require that every eigenvector be a


unit vector.
Finding Eigenvalues and Eigenvectors
For a n by n matrix, it has at most n pairs of (eigenvalue,
eigenvector)

How to find the eigenvalues and eigenvectors of a matrix?

𝐴𝑣 = 𝜆𝑣 -> 𝐴 − 𝜆𝐼 𝑣 = 0
A fact of linear algebra is that in order for 𝐴 − 𝜆𝐼 𝑣 = 0 to hold
for a vector 𝑣 ≠ 0, the determinant of 𝐴 − 𝜆𝐼 must be 0.

Therefore, 𝜆 is an eigenvalue of 𝐴 if and only if 𝐴 − 𝜆𝐼 = 0.


Example
Suppose we want to compute the eigenvalues and
eigenvectors of the following matrix A:

3 2
2 6

We know the eigenvalues satisfy that 𝐴 − 𝜆𝐼 = 0

3−𝜆 2
=0
2 6 −𝜆

The determinant is equal to 3 − 𝜆 6 − 𝜆 − 4 = 0

Solve this equation, we get the eigenvalues 𝜆 = 7 or 𝜆 = 2


Example (continued)
We know 𝜆 = 7 is an eigenvalue of the matrix 3 2
2 6
Next we calculate the eigenvector of the matrix

3 2 𝑥 𝑥
2 6 𝑦 =7 𝑦

Therefore, 3𝑥 + 2𝑦 = 7𝑥 and 2𝑥 + 6𝑦 = 7𝑦

Therefore, 𝑥 = 1 and y = 2

1 2
Making it into a unit vector ( , )
5 5
Principal-Component Analysis (PCA)
Unsupervised technique for extracting low-dimensional features
from high-dimensional datasets.

The key idea of PCA is about changing the basis.


Intuition behind PCA
The variance on this axis is
small The variance on this axis is large

Question: Can we transform the features so that we only need to


preserve the features with the largest variance?
PCA

Formally, identifying the axes with the


largest variance is known as Principal
Components Analysis, and can be obtained
by using classic matrix computation tools,
such as eigenvectors and eigenvalues.

Then we can project the points onto these


identified axes for dimension reduction.
Example: 2D Gaussian Dataset

The points are


sampled from a 2-
dimensional Gaussian
distribution.

X and Y are highly


correlated in this
example.

X and Y are negatively


correlated.
1st PCA Axis
2nd PCA Axis

These two directions


need to be orthogonal
in order to form a new
coordinate system.
How to Measure Variance?
We project the points onto the axis.

The variance is measured by how spread-out the projected points


are.

The variance of option A is larger than option B


How to Find Axes with Largest Variance?
Suppose we have m points in the n-dimensional space, we have a matrix
𝐴 ∈ 𝑅𝑚×𝑛 , where each row is a point.

The fact is that the axes with the largest variance are eigenvectors of the
matrix 𝐴𝑇 𝐴.

These eigenvectors are the ones corresponding to the largest eigenvalues.

The problem boils down to:


1. We first find the n (eigenvalue, eigenvector) pairs of the matrix 𝐴𝑇 𝐴.
2. We order the pairs by eigenvalues.
3. We only keep k (eigenvalue, eigenvector) pairs that have the largest
eigenvalues.
4. We use the k eigenvectors as the new axes and project the points onto
these k new axes.
Example
Given the following matrix, we want to reduce this 4 by 2 matrix into 4 by 1
1 2
2 1
A=
3 4
4 3

We first compute 𝐴𝑇 𝐴

1 2
1234 2 1 30 28
𝐴𝑇 A = =
2143 3 4 28 30
4 3
This is a symmetric matrix because 𝐴𝑇 𝐴 T
= 𝐴𝑇 𝐴.
Examples (Continued)
Then we find the eigenvalues and eigenvectors of 𝐴𝑇 A:

1 1

2 2
58, 2,
1 1
2 2

We find that 58 is the largest eigenvalue so that the corresponding


1
2
eigenvector v = 1 is the axis with the largest variance.
2
Examples (Continued)
We project all the points in matrix A onto this axis. To achieve
this, we multiply A with the eigenvector with the largest
eigenvalue (question: why the projection can be simply obtained
this way?):
3
2
1 2 1 3
2 1 2 2
The result of PCA
Av= 1 =
3 4 7
4 3 2 2
7
2

If we want to keep two dimensions, we multiply A with the two


eigenvectors:
3 1
2 2
1 2 1 1 3 1
2 1 − −
2 2 2 2
1 1 = 7 1
3 4
4 3 2 2 2 2
7 1

2 2
Eigenvector Matrix
The eigenvector matrix is a n by n matrix.
1 1

2 2
B= 1 1
2 2

The eigenvector matrix is an orthonormal matrix


• Each pair of columns are orthogonal, i.e. their inner product is 0.
• Each column is a unit vector.

1 1 1 1

2 2 2 2 1 0
𝐵𝐵 𝑇 = =
1 1 1 1 0 1

2 2 2 2
PCA Applications: Face Recognition
Given face images of resolution 256 x 256 pixels, we can use
PCA to perform dimension reduction to reduce noises and
improve efficiency.
After PCA
After PCA, the face features are more significant.

These are useful features for training machine learning


models.
PCA Application: Lossy Image Compression

Patch

Divide each original 372x492 image into patches


Each patch is an instance that contains 12x12 pixels on a grid
View each patch as a 144-D vector
Then we can perform PCA on these patches
After PCA
We obtain an image with a lower resolution that takes less space to store.
Singular Value Decomposition
Let M be a m by n matrix.

The rank of the matrix M is r.

We can use singular value decomposition to decompose this


matrix into three matrices
The Meaning of the Three Matrices

U is the matrix of Σ is a diagonal matrix.


V is the matrix of
eigenvectors of Σ2 is the diagonal matrix
eigenvectors of 𝑀𝑇 𝑀
𝑀𝑀𝑇 . Each column whose entries are the
Each column is an
is an eigenvector. corresponding
eigenvector.
eigenvalues of 𝑀𝑀𝑇 .
SVD Example: Movie Ratings

SVD
Real World Meaning of SVD
We have r columns for U and V, where r is equal to the rank of the
matrix M.
What is the real word meaning of the matrices 𝑈, Σ, 𝑉 by SVD?
M: m × n
U: m × r
Σ: r × r
V: n × r
Real World Meaning of SVD (continued)
We can think each column of r as an “abstract concept”.
M: m × n -> user-movie matrix
U: m × r -> user-concept matrix
Σ: r × r -> eigenvalues
V: n × r -> movie-concept matrix

user-concept matrix movie-concept matrix


Dimension Reduction
The matrix U can be seen as the low-dimensional matrix of the original
matrix M after dimension reduction.

So matrix U can be seen as the result after dimension reduction for M.

Using the abstract concepts:


➢ The matrix U provides us with low-dimensional features for each user.

➢ The matrix V provides us with low-dimensional features for each movie.

Each user/movie can be represented as an r-dimensional vector.

SVD
Query with Low-Dimensional Features
Given a user or movie, we can calculate the distance between its r-
dimensional features with the features of the other users or movies to
get similar users or movies.

Joe
(.13 .02 − .01)

Jane
(.07 − .029 .32)

Euclidean Distance between Joe and Jane is 0.457


Autoencoder

Neural networks trained to attempt to copy its input to its output

Autoencoder tries to approximate an identity function.

Contain two parts:


• Encoder: map the input to a hidden representation
• Decoder: map the hidden representation to the output
Autoencoder

Not really care about copying

Autoencoder forced to select which aspects to preserve and


thus hopefully can learn useful properties of the data

h is the result after dimension reduction.

The dimension of h is usually much smaller than the input x


Autoencoder
Constrain h to have smaller dimension than the input

Training: minimize a loss function


Autoencoder
Special case: 𝑓, 𝑔 linear, 𝐿 mean square error

Mean square error: L = 𝑥 − 𝑟 2

It has been proven to be equivalent to Principal Component Analysis


Implementation
Sklearn

PCA

SVD
References

Mining Massive Datasets, Chapter 11

http://infolab.stanford.edu/~ullman/mmds/ch11.pdf
Summary

We recap linear algebra.

We introduced two dimension reduction methods


including SVD and PCA.

We went through some applications of dimension


reduction.
Questions?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy