0% found this document useful (0 votes)
3 views

Feature Engineering

Uploaded by

pankajatbioinfo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Feature Engineering

Uploaded by

pankajatbioinfo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Feature Engineering

Prof. Gajendra P.S. Raghava


Head, Department of Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so


no claim of authorship on any slide
Example
Final Grade Prediction based
Feature Engineering on marks up to mid-sem
Possible Features
Few Facts 1. Quiz1 marks
2. Quiz2 marks
 Most important part of prediction/classification 3. Assignment 1 marks
4. Mid-sem marks
 It is an art, it is human-driven design 5. Age of student
6. Height of student
 Number of techniques have been used for in past 7. Quiz1 + Quiz2 marks
8. Total marks
 Principles are not well defined or validated 9. Attendance of student
10. Weight of student
 Commonly used feature selection will be discussed 11. Marks in 12th class
12. Marks in 10th class
 Theoretical view: more feature means more discrimination power
 In practice, identification major feature is more important
 Optimization of feature is major challenge
Example
Why feature reduction Final Grade Prediction based
on marks up to mid-sem
Possible Features
1. Quiz1 marks
➢Thousands of variables/features in many domain 2. Quiz2 marks
3. Assignment 1 marks
➢Many features are irrelevant and redundant ones! 4. Mid-sem marks
5. Age of student
➢Probability distribution can be very complex and hard 6. Height of student
7. Quiz1 + Quiz2 marks
to estimate 8. Total marks
9. Attendance of student
➢Irrelevant and redundant features can “confuse” 10. Weight of student
11. Marks in 12th class
learners! 12. Marks in 10th class
➢Limited training data!
➢Limited computational resources!
➢Curse of dimensionality!
Feature Generation
Dimension Reduction Data Preprocessing

Feature ➢ Data Imputation


➢ PCA
➢ Handling outliers
➢ SVD
➢ LDA
Engineering ➢ Data encoding
➢ Log transformation
➢ tSNE
➢ Feature Scaling
➢ UMAP
Feature Selection

Filter Methods Wrapper Methods Embedded Methods

➢ Correlation Coefficient ➢ Recursive Search ➢ Decision Tree


➢ Chi-squared ➢ Genetic Algorithm ➢ Lasso
➢ ANOVA Coefficient ➢ Random Search Regularization

Univariate Analysis, Multivariate, Python Libraries


Data Imputation
(Handling Missing Data)
 Drop rows/samples if sufficient data
 Replace missing value by mean of rest of values in column (variable)
 Replace missing value by median of rest of values in column (variable)
 Missing data by models (using other features not using target)
 K-NN, Regression, Deep learning
 MaCH is a tool for genotype imputation and haplotyping using WGS sequence
data. It uses a Markov chain approach.
 AutoImpute: Autoencoder based imputation of single-cell RNA-seq data
 mice: Multivariate Imputation by Chained Equations
Outliers in Data
 Extremely high or low value in data
 Possible applications Z-score based outlier
 Credit card fraud detection detection
 Telecommunication fraud detection
 Network intrusion detection
Encoding: Hot encoding, convert string to vector
(Binary profile of a sequence “AGRTHLM”)

Posit
ion A R N D C E Q G H I L K M F P S T W Y V
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
H 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 The log function is the inverse of the exponential function.
Log transformation  It is defined such that log10(x)
 This means that the log function maps the small range of Numbers between (0, 1) to the
entire range of negative numbers (–∞, 0).
 The function log10(x) maps the range of [1, 10] to [0, 1], [10, 100] to [1, 2], and so on.
 In other words, the log function compresses the range of large numbers
 The larger x is, the slower log(x) increments.
 Log may be based on based e , 2, 10, (e.g., log , log2 , Log10
Feature Scaling
 Number of features have different scale
 Difficult to handle features at different scale in ML
 Feature scale change scale of feature (percent values)
 Scaling is done for individual features
 Commonly used techniques
 Normalization (0 to 1)
 Standardization (Variance Scaling)
 L2 normalization
Feature Generation
Dimension Reduction Data Preprocessing

Feature ➢ Data Imputation


➢ PCA
➢ Handling outliers
➢ SVD
➢ LDA
Engineering ➢ Data encoding
➢ Log transformation
➢ tSNE
➢ Feature Scaling
➢ UMAP
Feature Selection

Filter Methods Wrapper Methods Embedded Methods

➢ Correlation Coefficient ➢ Recursive Search ➢ Decision Tree


➢ Chi-squared ➢ Genetic Algorithm ➢ Lasso
➢ ANOVA Coefficient ➢ Random Search Regularization

Univariate Analysis, Multivariate, Python Libraries


Principal Component Analysis based
Feature Reduction
• PCA is an unsupervised learning algorithm that is used for the
dimensionality reduction.
• It converts correlated features into a set of linearly uncorrelated features
using orthogonal transformation.
• These new transformed features are called the Principal Components.
• It reduces the dimensions of a d-dimensional dataset by projecting it onto
a (k)-dimensional subspace (where k<d)
• Builds a recipe for converting large amounts of data into a single value,
called a Principal Component (PC)
PC = (GeneA*10)+(GeneB*3)+(GeneC*-4)+(GeneD*-20)…
Principal Component Analysis f2

based Feature Reduction


1. Calculate the covariance of the matrix of features e1
e2

2. Calculate eigenvectors & eigenvalues of covariance matrix


3. Sort the eigenvectors in descending order by the eigenvalues
4. Choose the first k eigenvectors. This k value will be the k new dimensions f1

5. Transform the original data into k dimensions


6. Assumption: data is linear, continuous, and follows a Gaussian
distribution.
7. It fails if dataset do not have linear relationship, it also fails for categorical
data.
8. PCA does not handle missing data directly.
Singular Value Decomposition (SVD)
 Normalize by moving the origin to the center of the dataset
 Find the eigenvectors of the data (or covariance) matrix
f2
 These define the new space
 Sort the eigenvalues in “goodness” order
e1
e2

f1
Principal Component Analysis
Major Steps
 Standardize the data.
 Perform Singular Vector Decomposition to get the
Eigenvectors and Eigenvalues.
 Sort eigenvalues in descending order and choose the k-
eigenvectors
 Construct the projection matrix from the selected k-
eigenvectors.
 Transform the original dataset via projection matrix to obtain
a k-dimensional feature subspace.
Linear Discriminant Analysis (LDA)
LDA is another feature reduction technique;
It is a supervised feature reduction technique

Major steps in LDA


 Inter Feature variance: Calculate distance between the mean of different
features
 Intra Feature variance: Calculate the distance between the mean and
sample of each feature
 Construct the lower-dimensional space to maximizes the between inter
feature variance and minimize the intra variance
Linear Discriminant
Analysis (LDA)
Linear Discriminant Analysis (LDA)
1. Compute the d-dimensional mean vectors.
2. Compute the scatter matrices
3. Compute the eigenvectors and corresponding eigenvalues for the
scatter matrices.
4. Sort the eigenvalues and choose those with the largest eigenvalues to
form a d×k dimensional matrix
5. Transform the samples onto the new subspace.
t-distributed Stochastic Neighbor
Embedding (tSNE)
 tSNE is an alternate to PCA for feature reduction
 It overcome number of limitations of PCA
Important terms
Stochastic → not definite but random probability
Neighbor → retaining the variance of neighbor points
Embedding → plotting data into lower dimensions
How does tSNE work?
 All-vs-all table of pairwise cell to cell distances
 Perplexity = expected number of neighbours within a cluster
 Distances scaled relative to perplexity neighbours
Perplexity Robustness
tSNE Major Steps
 Step 1: t-SNE constructs a probability distribution on pairs in higher
dimensions such that similar objects are assigned a higher probability
and dissimilar objects are assigned lower probability.
 Step 2: Then, t-SNE replicates the same probability distribution on lower
dimensions iteratively till the Kullback-Leibler divergence is minimized.
 Kullback-Leibler divergence is a measure of the difference between the
probability distributions from Step1 and Step2.
 KL divergence is mathematically given as the expected value of the
logarithm of the difference of these probability distributions.
PCA vs tSNE
 PCA  tSNE
 Requires more than 2 dimensions  Can’t cope with noisy data
 Thrown off by quantised data  Loses the ability to cluster
 Expects linear relationships

Answer: Combine the two methods, get the best of both worlds
• PCA • tSNE
– Good at extracting signal from noise – Can reduce to 2D well
– Extracts informative dimensions – Can cope with non-linear scaling

This is what CellRanger does in its default analysis


UMAP to overcome limitations of PCA
and tSNE
 UMAP is a replacement for tSNE to overcome limitations
 Conceptually similar to tSNE, but with a couple of relevant changes
 UMAP faster than tSNE, preserve more global structure than tSNE
 UMAP can run on raw data without PCA preprocessing
 Instead of the single perplexity value in tSNE, UMAP defines
 Nearest neighbours: the number of expected nearest neighbours – basically
the same concept as perplexity
 Minimum distance: how tightly UMAP packs points which are close together
 Nearest neighbours will affect the influence given to global vs local information.
Min dist will affect how compactly packed the local parts of the plot are.
Practical approach PCA + tSNE/UMAP
 Filter heavily before starting
 Nicely behaving cells
 Expressed genes
 Variable genes

 Do PCA
 Extract most interesting signal
 Take top PCs. Reduce dimensionality (but not to 2)

 Do tSNE/UMAP
 Calculate distances from PCA projections
 Scale distances and project into 2-dimensions
Feature Generation
Dimension Reduction Data Preprocessing

Feature ➢ Data Imputation


➢ PCA
➢ Handling outliers
➢ SVD
➢ LDA
Engineering ➢ Data encoding
➢ Log transformation
➢ tSNE
➢ Feature Scaling
➢ UMAP
Feature Selection

Filter Methods Wrapper Methods Embedded Methods

➢ Correlation Coefficient ➢ Recursive Search ➢ Decision Tree


➢ Chi-squared ➢ Genetic Algorithm ➢ Lasso
➢ ANOVA Coefficient ➢ Random Search Regularization

Univariate Analysis, Multivariate, Python Libraries


Definition of Feature Selection
 Classification/Regression (Supervised Learning):  x1 
 
L = {( x1 , y1 ),..., ( xi , yi ),..., ( xm , ym )}  X  Y 
x= 
  
X = f1  ...  fi  ...  f n  
 xn 
Select Feature from F={ f1 ,..., fi ,..., f n } to F' F

It should have less or equal features after reducing features


Feature Selection / - Extraction
 Feature Selection:
F

F‘

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯ →{ fi1 ,..., fi j ,..., fim } i j  1,..., n ; j = 1,..., m


f . selection ia = ib  a = b; a, b  1,..., m

 Feature Extraction/Creation

F F‘

{ f1 ,..., fi ,..., f n} ⎯⎯⎯⎯→


f . extraction
{g1 ( f1,..., f n ),..., g j ( f1,..., f n ),..., gm ( f1,..., f n )}
Filter Methods: Variable Ranking
 A simple method for feature selection using variable ranking
is to select the k highest ranked features according to S.

 This is usually not optimal

 but often preferable to other, more complicated methods

 computationally efficient(!): only calculation and sorting of n


scores
Filter Methods: Ranking Correlation
Correlation Criteria:
 Pearson correlation coefficient
cov( f i , y)
R ( fi , y ) =
var( f i ) var( y)

 Estimate for m samples:

 (f )( y )
m
k =1 k ,i − fi k −y
R( f i , y) =
 (f )  (y )
m 2 m 2

k =1 k ,i
− fi k =1 k
−y

The higher the correlation between the feature and the target, the higher the score!
Filter Methods: Classification Methods
1. Difference in mean for positive and negative samples

Ranking of feature based on Maximum difference F(Pmean – Nmean)

2. Significance difference (t-test) in mean

3. Threshold based feature selection

Based on MCC
Based on Accuracy
4. Ranking of features
Nomenclature
 Univariate method: considers one variable (feature) at a time.

 Multivariate method: considers subsets of variables (features)


together.

 Filter method: ranks features or feature subsets


independently of the predictor (classifier).

 Wrapper method: uses a classifier to assess features or


feature subsets.
Univariate feature ranking
- +

P(Xi|Y=1)
P(Xi|Y=-1)
-1
xi
- +
• Normally distributed classes, equal variance 2 unknown; estimated from data
as 2within.
• Null hypothesis H0: + = -
• T statistic: If H0 is true,
t= (+ - -)/(withinm++1/m-) Student(m++m-- d.f.)
F-score
In filter selection, a single input variable
Univariate Feature at a time with a target variable. These
selection statistical measures are termed as
univariate statistical measures
Wrapper Methods
Perspectives: Search of a Subset of Features

Search Space:
Perspectives: Search of a Subset of Features
Search Directions
 Sequential Forward Generation (SFG): It starts with an empty set of features S. As the search
starts, features are added into S according to some criterion that distinguish the best feature from
the others. S grows until it reaches a full set of original features.
 Sequential Backward Generation (SBG): It starts with a full set of features and,iteratively, they
are removed one at a time. Here, the criterion must point out the worst or least important
feature.
 Bidirectional Generation (BG): Begins the search in both directions, performing SFG and SBG
concurrently. They stop in two cases: (1) when one search finds the best subset comprised of m
features before it reaches the exact middle, or (2) both searches achieve the middle of the search
space. It takes advantage of both SFG and SBG.
 Random Generation (RG) or Genetic Algorithm: It starts the search in a random direction. The
choice of adding or removing a features is a random decision. RG tries to avoid the stagnation into
a local optima by not following a fixed way for subset generation. Unlike SFG or SBG, the size of
the subset of features cannot be stipulated.
Perspectives: Search of a Subset of Features
Search Strategies
 Exhaustive Search: It corresponds to explore all possible subsets to find the optimal ones.
As we said before, the space complexity is O(2M) . If we establish a threshold m of minimum
features to be selected and the direction of search, the search space is, independent of the
forward or backward generation. Only exhaustive search can guarantee the optimality.
Nevertheless, they are also impractical in real data sets with a high M.
 Heuristic Search: It employs heuristics to carry out the search. Thus, it prevents brute force
search, but it will surely find a non-optimal subset of features. It draws a path connecting
the beginning and the end, such in a way of a depth-first search. The maximum length of
this path is M and the number of subsets generated is O(M). The choice of the heuristic is
crucial to find a closer optimal subset of features in a faster operation.
 Nondeterministic Search: Complementary combination of the previous two. It is also
known as random search strategy and can generate best subsets constantly and keep
improving the quality of selected features as time goes by. In each step, the next subset is
obtained at random
Feature Subset Selection
Wrapper Methods
• The problem of finding the optimal subset is NP-hard!

• A wide range of heuristic search strategies can be


used.
Two different classes:
– Forward selection
(start with empty feature set and add features at each step)
– Backward elimination
(start with full feature set and discard features at each step)

• predictive power is usually measured on a validation


set or by cross-validation
• By using the learner as a black box wrappers are
universal and simple!
• Criticism: a large amount of computation is required.
43/54
Embedded Methods
• These methods encompass the benefits of both the wrapper and
filter methods
• Maintaining reasonable computational cost.
• Embedded methods are iterative, carefully extracts those features
which contribute the most to the training for a particular iteration.
• Search subset of features using many techniques including
L1-Regulariztion
Tree based methods
L

44/54
L1-Regularization method
 Regularization consists of adding a penalty to the different
parameters to reduce the freedom of the model.
 In linear model regularization, the penalty is applied over the
coefficients that multiply each of the predictors.
 Lasso or L1 is most popular for feature selection
Tree Based Feature Selection
 Random Forests aggregates a specified number of decision trees.
 Decreases in the impurity of a feature can be measured
 More a feature decreases the impurity, more important is the feature.
 In random forests, the impurity decrease from each feature can be
averaged across trees to determine the final importance of the variable.
 Random forests naturally rank by how well they improve the purity of the
node, or in other words a decrease in the impurity (Gini impurity).
 Nodes with the greatest decrease in impurity happen at the start of the
trees,
 while nodes with the least decrease in impurity occur at the end of trees.
 Thus, by pruning trees below a particular node, we can create a subset of
the most important features.
Filters,Wrappers, and
Embedded methods
Feature
All features Filter Predictor
subset

Multiple
All features Feature Predictor
subsets

Wrapper

Feature
subset
Embedded
All features
method
Predictor
Filters
Methods:
 Criterion: Measure feature/feature subset
“relevance”
 Search: Usually order features (individual feature
ranking or nested subsets of features)
 Assessment: Use statistical tests

Results:
 Are (relatively) robust against overfitting
 May fail to select the most “useful” features
Wrappers
Methods:
 Criterion: Measure feature subset “usefulness”
 Search: Search the space of all feature subsets
 Assessment: Use cross-validation

Results:
 Can in principle find the most “useful” features,
but Are prone to overfitting
Embedded Methods
Methods:
 Criterion: Measure feature subset “usefulness”
 Search: Search guided by the learning process
 Assessment: Use cross-validation

Results:
 Similar to wrappers, but Less computationally expensive
and less prone to overfitting
Important feature selection techniques
mRMR (minimum Redundancy - Maximum Relevance)
 It is a feature selection algorithm. It is a minimal-optimal feature selection
algorithm.
 This means it is designed to find the smallest relevant subset of features for
a given machine learning task.
 It tries to find a small set of features that are relevant with respect to the
target variable and are scarcely redundant with each other.
 It is valuable not only because it is effective, but also because its simplicity
makes it fast and easily implementable in any pipeline.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy