Feature Extraction Techniques

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

FEATURE EXTRACTION

TECHNIQUES
Introduction
• So far we have discussed various approaches
for Feature Selection.
Feature Selection vs Extraction
• Feature selection keeps a subset of the original
features while feature extraction creates new ones
• Feature extraction is for creating a new, smaller set
of features that still captures most of the useful
information.
Why Feature Extraction?
• A poor model fed with meaningful features will
surely perform better than an amazing algorithm fed
with low-quality features(garbage in, garbage out).
• This becomes even more important when the
number of features are very large.
• Feature extraction fills this requirement: it builds
valuable information from raw data by reformatting,
combining, transforming primary features into new
one.
Why Feature Extraction?
• Using Regularization could certainly help reduce the
risk of overfitting, but Feature Extraction techniques
can also lead to other types of advantages such as:
– Accuracy improvements.
– Overfitting risk reduction.
– Speed up in training.
– Improved Data Visualization.
– Increase in explainability of our model.
Feature Extraction Types
• As a stand-alone task, feature extraction can
be categorized into:
– Unsupervised
• Principal Component Analysis(PCA)
– Supervised
• Linear Discriminant Analysis(LDA)
Principal Component Analysis(PCA)
• PCA is an unsupervised algorithm that creates
linear combinations of the original features.
• The new features are orthogonal, which means
that they are uncorrelated.
• Furthermore, they are ranked in order of their
"explained variance."
• The first principal component (PC1) explains the
most variance in your dataset, PC2 explains the
second-most variance, and so on.
When should I use PCA?
• Do you want to reduce the number of variables, but
aren’t able to identify variables to completely
remove from consideration?
• Do you want to ensure your variables are
independent of one another?
• Are you comfortable making your independent
variables less interpretable?
• If you answered “yes” to all three questions, then
PCA is a good method to use.
Mathematics Behind PCA
• The whole process consist of six parts :
– Take the whole dataset consisting of d+1 dimensions and
ignore the labels such that our new dataset becomes d
dimensional.
– Compute the mean for every dimension of the whole
dataset.
– Compute the covariance matrix of the whole dataset.
– Compute eigenvectors and the corresponding eigenvalues.
– Sort the eigenvectors by decreasing eigenvalues and
choose k eigenvectors with the largest eigenvalues to form
a d × k dimensional matrix W.
– Use this d × k eigenvector matrix to transform the samples
onto the new subspace.
Example
• Step-1(Ignoring the class label)
Example
• Step-2(Compute the mean of every dimension)

Mean of Matrix A
Example
• Step-3(Compute the covariance matrix)

a) The covariance between math and English is positive (360), and the
covariance between math and art is positive (180). This means the scores tend
to covary in a positive way. As scores on math go up, scores on art and English
also tend to go up; and vice versa.
b) The covariance between English and art, however, is zero. This means there
tends to be no predictable relationship between the movement of English and
art scores.
Example
• Step-4(Compute eigenvectors and the corresponding eigenvalues)
• An eigenvector is a vector whose direction remains
unchanged when a linear transformation is applied to it.

• Let A be a square matrix, ν a vector and λ a scalar that


satisfies Aν = λν, then λ is called eigenvalue associated
with eigenvector ν of A.

• The eigenvalues of A are roots of the characteristic


equation
Example
• Step-4(Compute eigenvectors and the corresponding eigenvalues)
• Calculating det(A-λI) first, I is an identity matrix :

• Simplifying the matrix first, we can calculate the determinant


later:

• Now that we have our simplified matrix, we can find the


determinant of the same :
Example
• Step-4(Compute eigenvectors and the corresponding eigenvalues)
• Now that we have our simplified matrix, we can find the
determinant of the same :

|A| = a(ei − fh) − b(di − fg) + c(dh − eg)

• Equating the above equation to zero :


Example
• Step-4(Compute eigenvectors and the corresponding eigenvalues)

• After solving this equation for the value of λ, we get the


following values:

• Now, we need to calculate the eigenvectors corresponding to


the above eigenvalues.
• Eigen vector correspond to 44.81966
• Eigen vector correspond to 629.110396
• Eigen vector correspond to 910.06995

For more help w.r.t Eigen value s and Eigen Vectors click here
Example
• Step-4(Compute eigenvectors and the corresponding eigenvalues)
• So, after solving for eigenvectors we would get the
following solution for the corresponding eigenvalues

• Step-5(Sort the eigenvectors by decreasing eigenvalues


and choose k eigenvectors with the largest eigenvalues to
form a d × k dimensional matrix W).
• So, after sorting the eigenvalues in decreasing order, we
have:
Example
• Assume that we are reducing a 3-dimensional feature space
to a 2-dimensional feature subspace, we are combining 2
eigenvectors with the highest eigenvalues to construct our
d×k dimensional eigenvector matrix W.

• Step-6(Transform the samples onto the new subspace).


• we use the 2×3 dimensional matrix W that we just computed
to transform our samples onto the new subspace via the
equation y = W′ × x where W′ is the transpose of the matrix
W.
Points to remember
• You should always normalize your dataset before performing
PCA because the transformation is dependent on scale. If you
don't, the features that are on the largest scale would
dominate your new principal components.
• Strengths: PCA is a versatile technique that works well in
practice. It's fast and simple to implement, which means you
can easily test algorithms with and without PCA to compare
performance. In addition, PCA offers several variations and
extensions (i.e. kernel PCA, sparse PCA, etc.) to tackle specific
roadblocks.
• Weaknesses: The new principal components are not
interpretable, which may be a deal-breaker in some settings.
In addition, you must still manually set or tune a threshold for
cumulative explained variance.
PCA vs. LDA
• In case of uniformly distributed data, LDA almost always
performs better than PCA.
• However if the data is highly skewed (irregularly
distributed) then it is advised to use PCA since LDA can
be biased towards the majority class.
• It is beneficial that PCA can be applied to labeled as well
as unlabeled data since it doesn't rely on the output
labels.
• On the other hand, LDA requires output classes for
finding linear discriminants and hence requires labeled
data.
Motivation for Non-linear PCA using
Kernels
• Linear projections will not detect the pattern.
Kernel Principal Component Analysis
• Kernel Principal Component Analysis (KPCA) is a non-
linear dimensionality reduction technique.

• It is an extension of Principal Component Analysis (PCA) -


which is a linear dimensionality reduction technique -
using kernel methods.

• Sometimes the structure of the data is nonlinear, and the


principle components will not give us the optimal
dimensionality reduction, so we use non-linear methods
like Kernal PCA.
Intuition behind Kernal PCA
• The idea of KPCA relies on the intuition that many datasets, which
are not linearly separable in their space, can be made linearly
separable by projecting them into a higher dimensional space.

• The added dimensions are just simple arithmetic operations


performed on the original data dimensions.

• So we project our dataset into a higher dimensional feature space,


and because they become linearly separable, then we can apply
PCA on this new dataset.

• Preforming this linear dimensionality reduction in that space will be


equivalent to a non-linear dimensionality reduction in the original
space.
Example

We define the linear mapping T(x1, x2 )=(z1, z2, z3 ) = (x1, x2, x12+x22)

This mapping function will project the data from a lower dim(2D) to a higher dim (3D)
Kernel Principal Component Analysis
• Now the considered dataset becomes linearly separable.
• The optimal surface that PCA will be in the mid of blue
and brown points.
• If we inverse the transformation then this surface will
correspond to a nonlinear curve in the original 2D space.
• By projecting on that curve, we can find the optimal new
features.
• But doing PCA in high dimensional space needs lot of
computations, so in order to solve this problem we use
kernel methods.
Kernel Principal Component Analysis
Commonly used Kernlas
• A linear kernel can be used as normal dot product any
two given observations;
K(x, xi) = sum(x * xi)
• Polynomial Kernel A polynomial kernel is a more
generalized form of the linear kernel ;
K(x, xi) = 1 + sum(x * xi)d;
d is the degree of the polynomial.
• Radial Basis Function Kernel(RBF) can map an input space
in infinite dimensional space;
K(x, xi) = exp(-ϒ * sum((x – xi2));
gamma is a parameter; which ranges from 0 to 1. (0.1 is default
value).
Independent Component Analysis
• Independent Component Analysis (ICA) is a machine
learning technique to separate independent sources
from a mixed signal.
• Unlike principal component analysis which focuses on
maximizing the variance of the data points, the
independent component analysis focuses on
independence, i.e. independent components.
Independent Component Analysis
• Problem: To extract independent sources’ signals from a
mixed signal composed of the signals from those sources.
Given: Mixed signal from five different independent
sources.
Aim: To decompose the mixed signal into independent
sources:
– Source 1, Source 2, Source 3, Source 4, Source 5
• Solution: Independent Component Analysis (ICA).
Independent Component Analysis
• Consider Cocktail Party Problem or Blind Source
Separation problem to understand the problem which is
solved by independent component analysis.
Independent Component Analysis
• Restrictions on ICA –
• The independent components generated by the ICA are
assumed to be statistically independent of each other.
• The independent components generated by the ICA must
have non-gaussian distribution.
• The number of independent components generated by
the ICA is equal to the number of observed mixtures.
PCA Vs. ICA
Principal Component Analysis Independent Component Analysis
It reduces the dimensions to avoid It decomposes the mixed signal into
the problem of overfitting. its independent sources’ signals.
It deals with the Principal It deals with the Independent
Components. Components.
It focuses on maximizing the It doesn’t focus on the issue of
variance. variance among the data points.
It focuses on the mutual
It doesn’t focus on the mutual
orthogonality property of the
orthogonality of the components.
principal components.
It doesn’t focus on the mutual It focuses on the mutual
independence of the components. independence of the components.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy