Feature Extraction Techniques
Feature Extraction Techniques
Feature Extraction Techniques
• So far we have discussed various approaches
for Feature Selection.
Feature Selection vs Extraction
• Feature selection keeps a subset of the original
features while feature extraction creates new ones
• Feature extraction is for creating a new, smaller set
of features that still captures most of the useful
Why Feature Extraction?
• A poor model fed with meaningful features will
surely perform better than an amazing algorithm fed
with low-quality features(garbage in, garbage out).
• This becomes even more important when the
number of features are very large.
• Feature extraction fills this requirement: it builds
valuable information from raw data by reformatting,
combining, transforming primary features into new
Why Feature Extraction?
• Using Regularization could certainly help reduce the
risk of overfitting, but Feature Extraction techniques
can also lead to other types of advantages such as:
– Accuracy improvements.
– Overfitting risk reduction.
– Speed up in training.
– Improved Data Visualization.
– Increase in explainability of our model.
Feature Extraction Types
• As a stand-alone task, feature extraction can
be categorized into:
– Unsupervised
• Principal Component Analysis(PCA)
– Supervised
• Linear Discriminant Analysis(LDA)
Principal Component Analysis(PCA)
• PCA is an unsupervised algorithm that creates
linear combinations of the original features.
• The new features are orthogonal, which means
that they are uncorrelated.
• Furthermore, they are ranked in order of their
"explained variance."
• The first principal component (PC1) explains the
most variance in your dataset, PC2 explains the
second-most variance, and so on.
When should I use PCA?
• Do you want to reduce the number of variables, but
aren’t able to identify variables to completely
remove from consideration?
• Do you want to ensure your variables are
independent of one another?
• Are you comfortable making your independent
variables less interpretable?
• If you answered “yes” to all three questions, then
PCA is a good method to use.
Mathematics Behind PCA
• The whole process consist of six parts :
– Take the whole dataset consisting of d+1 dimensions and
ignore the labels such that our new dataset becomes d
– Compute the mean for every dimension of the whole
– Compute the covariance matrix of the whole dataset.
– Compute eigenvectors and the corresponding eigenvalues.
– Sort the eigenvectors by decreasing eigenvalues and
choose k eigenvectors with the largest eigenvalues to form
a d × k dimensional matrix W.
– Use this d × k eigenvector matrix to transform the samples
onto the new subspace.
• Step-1(Ignoring the class label)
• Step-2(Compute the mean of every dimension)
Mean of Matrix A
• Step-3(Compute the covariance matrix)
a) The covariance between math and English is positive (360), and the
covariance between math and art is positive (180). This means the scores tend
to covary in a positive way. As scores on math go up, scores on art and English
also tend to go up; and vice versa.
b) The covariance between English and art, however, is zero. This means there
tends to be no predictable relationship between the movement of English and
art scores.
• Step-4(Compute eigenvectors and the corresponding eigenvalues)
• An eigenvector is a vector whose direction remains
unchanged when a linear transformation is applied to it.
For more help w.r.t Eigen value s and Eigen Vectors click here
• Step-4(Compute eigenvectors and the corresponding eigenvalues)
• So, after solving for eigenvectors we would get the
following solution for the corresponding eigenvalues
We define the linear mapping T(x1, x2 )=(z1, z2, z3 ) = (x1, x2, x12+x22)
This mapping function will project the data from a lower dim(2D) to a higher dim (3D)
Kernel Principal Component Analysis
• Now the considered dataset becomes linearly separable.
• The optimal surface that PCA will be in the mid of blue
and brown points.
• If we inverse the transformation then this surface will
correspond to a nonlinear curve in the original 2D space.
• By projecting on that curve, we can find the optimal new
• But doing PCA in high dimensional space needs lot of
computations, so in order to solve this problem we use
kernel methods.
Kernel Principal Component Analysis
Commonly used Kernlas
• A linear kernel can be used as normal dot product any
two given observations;
K(x, xi) = sum(x * xi)
• Polynomial Kernel A polynomial kernel is a more
generalized form of the linear kernel ;
K(x, xi) = 1 + sum(x * xi)d;
d is the degree of the polynomial.
• Radial Basis Function Kernel(RBF) can map an input space
in infinite dimensional space;
K(x, xi) = exp(-ϒ * sum((x – xi2));
gamma is a parameter; which ranges from 0 to 1. (0.1 is default
Independent Component Analysis
• Independent Component Analysis (ICA) is a machine
learning technique to separate independent sources
from a mixed signal.
• Unlike principal component analysis which focuses on
maximizing the variance of the data points, the
independent component analysis focuses on
independence, i.e. independent components.
Independent Component Analysis
• Problem: To extract independent sources’ signals from a
mixed signal composed of the signals from those sources.
Given: Mixed signal from five different independent
Aim: To decompose the mixed signal into independent
– Source 1, Source 2, Source 3, Source 4, Source 5
• Solution: Independent Component Analysis (ICA).
Independent Component Analysis
• Consider Cocktail Party Problem or Blind Source
Separation problem to understand the problem which is
solved by independent component analysis.
Independent Component Analysis
• Restrictions on ICA –
• The independent components generated by the ICA are
assumed to be statistically independent of each other.
• The independent components generated by the ICA must
have non-gaussian distribution.
• The number of independent components generated by
the ICA is equal to the number of observed mixtures.
Principal Component Analysis Independent Component Analysis
It reduces the dimensions to avoid It decomposes the mixed signal into
the problem of overfitting. its independent sources’ signals.
It deals with the Principal It deals with the Independent
Components. Components.
It focuses on maximizing the It doesn’t focus on the issue of
variance. variance among the data points.
It focuses on the mutual
It doesn’t focus on the mutual
orthogonality property of the
orthogonality of the components.
principal components.
It doesn’t focus on the mutual It focuses on the mutual
independence of the components. independence of the components.