0% found this document useful (0 votes)

3 views

Feature Engineering

Uploaded by

pankajatbioinfo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Feature Engineering

Uploaded by

pankajatbioinfo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Feature Engineering

Prof. Gajendra P.S. Raghava

Head, Department of Computational Biology

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

no claim of authorship on any slide
Example
Final Grade Prediction based
Feature Engineering on marks up to mid-sem
Possible Features
Few Facts 1. Quiz1 marks
2. Quiz2 marks
 Most important part of prediction/classification 3. Assignment 1 marks
4. Mid-sem marks
 It is an art, it is human-driven design 5. Age of student
6. Height of student
 Number of techniques have been used for in past 7. Quiz1 + Quiz2 marks
8. Total marks
 Principles are not well defined or validated 9. Attendance of student
10. Weight of student
 Commonly used feature selection will be discussed 11. Marks in 12th class
12. Marks in 10th class
 Theoretical view: more feature means more discrimination power
 In practice, identification major feature is more important
 Optimization of feature is major challenge
Example
Why feature reduction Final Grade Prediction based
on marks up to mid-sem
Possible Features
1. Quiz1 marks
➢Thousands of variables/features in many domain 2. Quiz2 marks
3. Assignment 1 marks
➢Many features are irrelevant and redundant ones! 4. Mid-sem marks
5. Age of student
➢Probability distribution can be very complex and hard 6. Height of student
7. Quiz1 + Quiz2 marks
to estimate 8. Total marks
9. Attendance of student
➢Irrelevant and redundant features can “confuse” 10. Weight of student
11. Marks in 12th class
learners! 12. Marks in 10th class
➢Limited training data!
➢Limited computational resources!
➢Curse of dimensionality!
Feature Generation
Dimension Reduction Data Preprocessing

Feature ➢ Data Imputation

➢ PCA
➢ Handling outliers
➢ SVD
➢ LDA
Engineering ➢ Data encoding
➢ Log transformation
➢ tSNE
➢ Feature Scaling
➢ UMAP
Feature Selection

Filter Methods Wrapper Methods Embedded Methods

➢ Correlation Coefficient ➢ Recursive Search ➢ Decision Tree

➢ Chi-squared ➢ Genetic Algorithm ➢ Lasso
➢ ANOVA Coefficient ➢ Random Search Regularization

Univariate Analysis, Multivariate, Python Libraries

Data Imputation
(Handling Missing Data)
 Drop rows/samples if sufficient data
 Replace missing value by mean of rest of values in column (variable)
 Replace missing value by median of rest of values in column (variable)
 Missing data by models (using other features not using target)
 K-NN, Regression, Deep learning
 MaCH is a tool for genotype imputation and haplotyping using WGS sequence
data. It uses a Markov chain approach.
 AutoImpute: Autoencoder based imputation of single-cell RNA-seq data
 mice: Multivariate Imputation by Chained Equations
Outliers in Data
 Extremely high or low value in data
 Possible applications Z-score based outlier
 Credit card fraud detection detection
 Telecommunication fraud detection
 Network intrusion detection
Encoding: Hot encoding, convert string to vector
(Binary profile of a sequence “AGRTHLM”)

Posit
ion A R N D C E Q G H I L K M F P S T W Y V
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
H 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 The log function is the inverse of the exponential function.
Log transformation  It is defined such that log10(x)
 This means that the log function maps the small range of Numbers between (0, 1) to the
entire range of negative numbers (–∞, 0).
 The function log10(x) maps the range of [1, 10] to [0, 1], [10, 100] to [1, 2], and so on.
 In other words, the log function compresses the range of large numbers
 The larger x is, the slower log(x) increments.
 Log may be based on based e , 2, 10, (e.g., log , log2 , Log10
Feature Scaling
 Number of features have different scale
 Difficult to handle features at different scale in ML
 Feature scale change scale of feature (percent values)
 Scaling is done for individual features
 Commonly used techniques
 Normalization (0 to 1)
 Standardization (Variance Scaling)
 L2 normalization
Feature Generation
Dimension Reduction Data Preprocessing

Feature ➢ Data Imputation

➢ PCA
➢ Handling outliers
➢ SVD
➢ LDA
Engineering ➢ Data encoding
➢ Log transformation
➢ tSNE
➢ Feature Scaling
➢ UMAP
Feature Selection

Filter Methods Wrapper Methods Embedded Methods

➢ Correlation Coefficient ➢ Recursive Search ➢ Decision Tree

➢ Chi-squared ➢ Genetic Algorithm ➢ Lasso
➢ ANOVA Coefficient ➢ Random Search Regularization

Univariate Analysis, Multivariate, Python Libraries

Principal Component Analysis based
Feature Reduction
• PCA is an unsupervised learning algorithm that is used for the
dimensionality reduction.
• It converts correlated features into a set of linearly uncorrelated features
using orthogonal transformation.
• These new transformed features are called the Principal Components.
• It reduces the dimensions of a d-dimensional dataset by projecting it onto
a (k)-dimensional subspace (where k<d)
• Builds a recipe for converting large amounts of data into a single value,
called a Principal Component (PC)
PC = (GeneA*10)+(GeneB*3)+(GeneC*-4)+(GeneD*-20)…
Principal Component Analysis f2

based Feature Reduction

1. Calculate the covariance of the matrix of features e1
e2

2. Calculate eigenvectors & eigenvalues of covariance matrix

3. Sort the eigenvectors in descending order by the eigenvalues
4. Choose the first k eigenvectors. This k value will be the k new dimensions f1

5. Transform the original data into k dimensions

6. Assumption: data is linear, continuous, and follows a Gaussian
distribution.
7. It fails if dataset do not have linear relationship, it also fails for categorical
data.
8. PCA does not handle missing data directly.
Singular Value Decomposition (SVD)
 Normalize by moving the origin to the center of the dataset
 Find the eigenvectors of the data (or covariance) matrix
f2
 These define the new space
 Sort the eigenvalues in “goodness” order
e1
e2

f1
Principal Component Analysis
Major Steps
 Standardize the data.
 Perform Singular Vector Decomposition to get the
Eigenvectors and Eigenvalues.
 Sort eigenvalues in descending order and choose the k-
eigenvectors
 Construct the projection matrix from the selected k-
eigenvectors.
 Transform the original dataset via projection matrix to obtain
a k-dimensional feature subspace.
Linear Discriminant Analysis (LDA)
LDA is another feature reduction technique;
It is a supervised feature reduction technique

Major steps in LDA

 Inter Feature variance: Calculate distance between the mean of different
features
 Intra Feature variance: Calculate the distance between the mean and
sample of each feature
 Construct the lower-dimensional space to maximizes the between inter
feature variance and minimize the intra variance
Linear Discriminant
Analysis (LDA)
Linear Discriminant Analysis (LDA)
1. Compute the d-dimensional mean vectors.
2. Compute the scatter matrices
3. Compute the eigenvectors and corresponding eigenvalues for the
scatter matrices.
4. Sort the eigenvalues and choose those with the largest eigenvalues to
form a d×k dimensional matrix
5. Transform the samples onto the new subspace.
t-distributed Stochastic Neighbor
Embedding (tSNE)
 tSNE is an alternate to PCA for feature reduction
 It overcome number of limitations of PCA
Important terms
Stochastic → not definite but random probability
Neighbor → retaining the variance of neighbor points
Embedding → plotting data into lower dimensions
How does tSNE work?
 All-vs-all table of pairwise cell to cell distances
 Perplexity = expected number of neighbours within a cluster
 Distances scaled relative to perplexity neighbours
Perplexity Robustness
tSNE Major Steps
 Step 1: t-SNE constructs a probability distribution on pairs in higher
dimensions such that similar objects are assigned a higher probability
and dissimilar objects are assigned lower probability.
 Step 2: Then, t-SNE replicates the same probability distribution on lower
dimensions iteratively till the Kullback-Leibler divergence is minimized.
 Kullback-Leibler divergence is a measure of the difference between the
probability distributions from Step1 and Step2.
 KL divergence is mathematically given as the expected value of the
logarithm of the difference of these probability distributions.
PCA vs tSNE
 PCA  tSNE
 Requires more than 2 dimensions  Can’t cope with noisy data
 Thrown off by quantised data  Loses the ability to cluster
 Expects linear relationships

Answer: Combine the two methods, get the best of both worlds
• PCA • tSNE
– Good at extracting signal from noise – Can reduce to 2D well
– Extracts informative dimensions – Can cope with non-linear scaling

This is what CellRanger does in its default analysis

UMAP to overcome limitations of PCA
and tSNE
 UMAP is a replacement for tSNE to overcome limitations
 Conceptually similar to tSNE, but with a couple of relevant changes
 UMAP faster than tSNE, preserve more global structure than tSNE
 UMAP can run on raw data without PCA preprocessing
 Instead of the single perplexity value in tSNE, UMAP defines
 Nearest neighbours: the number of expected nearest neighbours – basically
the same concept as perplexity
 Minimum distance: how tightly UMAP packs points which are close together
 Nearest neighbours will affect the influence given to global vs local information.
Min dist will affect how compactly packed the local parts of the plot are.
Practical approach PCA + tSNE/UMAP
 Filter heavily before starting
 Nicely behaving cells
 Expressed genes
 Variable genes

 Do PCA
 Extract most interesting signal
 Take top PCs. Reduce dimensionality (but not to 2)

 Do tSNE/UMAP
 Calculate distances from PCA projections
 Scale distances and project into 2-dimensions
Feature Generation
Dimension Reduction Data Preprocessing

Feature ➢ Data Imputation

➢ PCA
➢ Handling outliers
➢ SVD
➢ LDA
Engineering ➢ Data encoding
➢ Log transformation
➢ tSNE
➢ Feature Scaling
➢ UMAP
Feature Selection

Filter Methods Wrapper Methods Embedded Methods

➢ Correlation Coefficient ➢ Recursive Search ➢ Decision Tree

➢ Chi-squared ➢ Genetic Algorithm ➢ Lasso
➢ ANOVA Coefficient ➢ Random Search Regularization

Univariate Analysis, Multivariate, Python Libraries

Definition of Feature Selection
 Classification/Regression (Supervised Learning):  x1 
 
L = {( x1 , y1 ),..., ( xi , yi ),..., ( xm , ym )}  X  Y 
x= 
  
X = f1  ...  fi  ...  f n  
 xn 
Select Feature from F={ f1 ,..., fi ,..., f n } to F' F

It should have less or equal features after reducing features

Feature Selection / - Extraction
 Feature Selection:
F

F‘

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯ →{ fi1 ,..., fi j ,..., fim } i j  1,..., n ; j = 1,..., m

f . selection ia = ib  a = b; a, b  1,..., m

 Feature Extraction/Creation

F F‘

{ f1 ,..., fi ,..., f n} ⎯⎯⎯⎯→

f . extraction
{g1 ( f1,..., f n ),..., g j ( f1,..., f n ),..., gm ( f1,..., f n )}
Filter Methods: Variable Ranking
 A simple method for feature selection using variable ranking
is to select the k highest ranked features according to S.

 This is usually not optimal

 but often preferable to other, more complicated methods

 computationally efficient(!): only calculation and sorting of n

scores
Filter Methods: Ranking Correlation
Correlation Criteria:
 Pearson correlation coefficient
cov( f i , y)
R ( fi , y ) =
var( f i ) var( y)

 Estimate for m samples:

 (f )( y )
m
k =1 k ,i − fi k −y
R( f i , y) =
 (f )  (y )
m 2 m 2

k =1 k ,i
− fi k =1 k
−y

The higher the correlation between the feature and the target, the higher the score!
Filter Methods: Classification Methods
1. Difference in mean for positive and negative samples

Ranking of feature based on Maximum difference F(Pmean – Nmean)

2. Significance difference (t-test) in mean

3. Threshold based feature selection

Based on MCC
Based on Accuracy
4. Ranking of features
Nomenclature
 Univariate method: considers one variable (feature) at a time.

 Multivariate method: considers subsets of variables (features)

together.

 Filter method: ranks features or feature subsets

independently of the predictor (classifier).

 Wrapper method: uses a classifier to assess features or

feature subsets.
Univariate feature ranking
- +

P(Xi|Y=1)
P(Xi|Y=-1)
-1
xi
- +
• Normally distributed classes, equal variance 2 unknown; estimated from data
as 2within.
• Null hypothesis H0: + = -
• T statistic: If H0 is true,
t= (+ - -)/(withinm++1/m-) Student(m++m-- d.f.)
F-score
In filter selection, a single input variable
Univariate Feature at a time with a target variable. These
selection statistical measures are termed as
univariate statistical measures
Wrapper Methods
Perspectives: Search of a Subset of Features

Search Space:
Perspectives: Search of a Subset of Features
Search Directions
 Sequential Forward Generation (SFG): It starts with an empty set of features S. As the search
starts, features are added into S according to some criterion that distinguish the best feature from
the others. S grows until it reaches a full set of original features.
 Sequential Backward Generation (SBG): It starts with a full set of features and,iteratively, they
are removed one at a time. Here, the criterion must point out the worst or least important
feature.
 Bidirectional Generation (BG): Begins the search in both directions, performing SFG and SBG
concurrently. They stop in two cases: (1) when one search finds the best subset comprised of m
features before it reaches the exact middle, or (2) both searches achieve the middle of the search
space. It takes advantage of both SFG and SBG.
 Random Generation (RG) or Genetic Algorithm: It starts the search in a random direction. The
choice of adding or removing a features is a random decision. RG tries to avoid the stagnation into
a local optima by not following a fixed way for subset generation. Unlike SFG or SBG, the size of
the subset of features cannot be stipulated.
Perspectives: Search of a Subset of Features
Search Strategies
 Exhaustive Search: It corresponds to explore all possible subsets to find the optimal ones.
As we said before, the space complexity is O(2M) . If we establish a threshold m of minimum
features to be selected and the direction of search, the search space is, independent of the
forward or backward generation. Only exhaustive search can guarantee the optimality.
Nevertheless, they are also impractical in real data sets with a high M.
 Heuristic Search: It employs heuristics to carry out the search. Thus, it prevents brute force
search, but it will surely find a non-optimal subset of features. It draws a path connecting
the beginning and the end, such in a way of a depth-first search. The maximum length of
this path is M and the number of subsets generated is O(M). The choice of the heuristic is
crucial to find a closer optimal subset of features in a faster operation.
 Nondeterministic Search: Complementary combination of the previous two. It is also
known as random search strategy and can generate best subsets constantly and keep
improving the quality of selected features as time goes by. In each step, the next subset is
obtained at random
Feature Subset Selection
Wrapper Methods
• The problem of finding the optimal subset is NP-hard!

• A wide range of heuristic search strategies can be

used.
Two different classes:
– Forward selection
(start with empty feature set and add features at each step)
– Backward elimination
(start with full feature set and discard features at each step)

• predictive power is usually measured on a validation

set or by cross-validation
• By using the learner as a black box wrappers are
universal and simple!
• Criticism: a large amount of computation is required.
43/54
Embedded Methods
• These methods encompass the benefits of both the wrapper and
filter methods
• Maintaining reasonable computational cost.
• Embedded methods are iterative, carefully extracts those features
which contribute the most to the training for a particular iteration.
• Search subset of features using many techniques including
L1-Regulariztion
Tree based methods
L

44/54
L1-Regularization method
 Regularization consists of adding a penalty to the different
parameters to reduce the freedom of the model.
 In linear model regularization, the penalty is applied over the
coefficients that multiply each of the predictors.
 Lasso or L1 is most popular for feature selection
Tree Based Feature Selection
 Random Forests aggregates a specified number of decision trees.
 Decreases in the impurity of a feature can be measured
 More a feature decreases the impurity, more important is the feature.
 In random forests, the impurity decrease from each feature can be
averaged across trees to determine the final importance of the variable.
 Random forests naturally rank by how well they improve the purity of the
node, or in other words a decrease in the impurity (Gini impurity).
 Nodes with the greatest decrease in impurity happen at the start of the
trees,
 while nodes with the least decrease in impurity occur at the end of trees.
 Thus, by pruning trees below a particular node, we can create a subset of
the most important features.
Filters,Wrappers, and
Embedded methods
Feature
All features Filter Predictor
subset

Multiple
All features Feature Predictor
subsets

Wrapper

Feature
subset
Embedded
All features
method
Predictor
Filters
Methods:
 Criterion: Measure feature/feature subset
“relevance”
 Search: Usually order features (individual feature
ranking or nested subsets of features)
 Assessment: Use statistical tests

Results:
 Are (relatively) robust against overfitting
 May fail to select the most “useful” features
Wrappers
Methods:
 Criterion: Measure feature subset “usefulness”
 Search: Search the space of all feature subsets
 Assessment: Use cross-validation

Results:
 Can in principle find the most “useful” features,
but Are prone to overfitting
Embedded Methods
Methods:
 Criterion: Measure feature subset “usefulness”
 Search: Search guided by the learning process
 Assessment: Use cross-validation

Results:
 Similar to wrappers, but Less computationally expensive
and less prone to overfitting
Important feature selection techniques
mRMR (minimum Redundancy - Maximum Relevance)
 It is a feature selection algorithm. It is a minimal-optimal feature selection
algorithm.
 This means it is designed to find the smallest relevant subset of features for
a given machine learning task.
 It tries to find a small set of features that are relevant with respect to the
target variable and are scarcely redundant with each other.
 It is valuable not only because it is effective, but also because its simplicity
makes it fast and easily implementable in any pipeline.

2nd Grade Spelling Lists
100% (6)
2nd Grade Spelling Lists
31 pages
Heem-2,3 Manual
0% (3)
Heem-2,3 Manual
20 pages
Pre-calculus Demystified, Second Edition
From Everand
Pre-calculus Demystified, Second Edition
Rhonda Huettenmueller
3/5 (5)
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
Ruiz Modified I2ml3e Chap6
No ratings yet
Ruiz Modified I2ml3e Chap6
38 pages
I2ml3e Chap6
No ratings yet
I2ml3e Chap6
37 pages
Sta 5
No ratings yet
Sta 5
16 pages
CHBE413CDS Lecture 12 Unsupervised DimRed
No ratings yet
CHBE413CDS Lecture 12 Unsupervised DimRed
30 pages
16 dm2 Dimred 2022 23
No ratings yet
16 dm2 Dimred 2022 23
49 pages
CHP 4
No ratings yet
CHP 4
72 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
Unit 3
No ratings yet
Unit 3
50 pages
Visualization 9 Dim Reduction
No ratings yet
Visualization 9 Dim Reduction
73 pages
Lecture 2. Dimension Reduction
No ratings yet
Lecture 2. Dimension Reduction
71 pages
Feature Selection and Extraction
No ratings yet
Feature Selection and Extraction
26 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
08 HighDimensional PDF
No ratings yet
08 HighDimensional PDF
88 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
5 pages
Feature Extraction Techniques
No ratings yet
Feature Extraction Techniques
32 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
41 pages
Pattern Classification 06. Feature Selection & Extraction: Abdelmoniem Bayoumi, PHD
No ratings yet
Pattern Classification 06. Feature Selection & Extraction: Abdelmoniem Bayoumi, PHD
29 pages
Feature Extraction
No ratings yet
Feature Extraction
16 pages
Unit V Foml
No ratings yet
Unit V Foml
18 pages
کتاب نهم بارگزاری شده
No ratings yet
کتاب نهم بارگزاری شده
55 pages
EDAB Module 5 Singular Value Decomposition (SVD)
No ratings yet
EDAB Module 5 Singular Value Decomposition (SVD)
58 pages
DATA REDUCTION
No ratings yet
DATA REDUCTION
23 pages
DSH - L5 - Data-Driven Approaches - Concepts
No ratings yet
DSH - L5 - Data-Driven Approaches - Concepts
38 pages
Module 3 ML
No ratings yet
Module 3 ML
19 pages
Azencott BioML
No ratings yet
Azencott BioML
87 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
315 F19 27 Pca1
No ratings yet
315 F19 27 Pca1
28 pages
Unit 3
No ratings yet
Unit 3
21 pages
Unit No: 4 Basics of Feature Engineering (31707 24)
No ratings yet
Unit No: 4 Basics of Feature Engineering (31707 24)
98 pages
Unit 3,4 and 5
No ratings yet
Unit 3,4 and 5
5 pages
UNIT-4
No ratings yet
UNIT-4
79 pages
ML 3170724 Unit-4
No ratings yet
ML 3170724 Unit-4
97 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in PDF
14 pages
Feature Extraction: - Saheni Patra
No ratings yet
Feature Extraction: - Saheni Patra
17 pages
Machine Learning (CSO851) - Lecture 03
No ratings yet
Machine Learning (CSO851) - Lecture 03
71 pages
L06 Feature Selection and Extraction
No ratings yet
L06 Feature Selection and Extraction
29 pages
Lecture 9_PCA
No ratings yet
Lecture 9_PCA
44 pages
Introduction To Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
No ratings yet
Introduction To Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
15 pages
Dim Reduction & Pattern Recognition
No ratings yet
Dim Reduction & Pattern Recognition
63 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
ML Notes.docx
No ratings yet
ML Notes.docx
15 pages
Unit-3
No ratings yet
Unit-3
28 pages
DimensionalityReduction 13022024
No ratings yet
DimensionalityReduction 13022024
32 pages
Unit 2 Feature Engineering
No ratings yet
Unit 2 Feature Engineering
64 pages
Presentation1
No ratings yet
Presentation1
15 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
82 pages
FR Pca Lda
No ratings yet
FR Pca Lda
52 pages
AI Unit-5
No ratings yet
AI Unit-5
53 pages
Kinya Sharon - Ass2 - Machine Learning
No ratings yet
Kinya Sharon - Ass2 - Machine Learning
12 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet
data science course training in india hyderabad: innomatics research labs
From Everand
data science course training in india hyderabad: innomatics research labs
innomatics research labs
No ratings yet
Automatic Plant Watering Irrigation System Circuit Code PDF
100% (1)
Automatic Plant Watering Irrigation System Circuit Code PDF
77 pages
Linear Equations Worksheet
No ratings yet
Linear Equations Worksheet
2 pages
Naproxen Drug Study
No ratings yet
Naproxen Drug Study
4 pages
Ncert Solutions For Class 9 Maths April05 Chapter 2 Polynomials
No ratings yet
Ncert Solutions For Class 9 Maths April05 Chapter 2 Polynomials
32 pages
Los Angeles Abrasion Test QC 478 - 19-11-2020
No ratings yet
Los Angeles Abrasion Test QC 478 - 19-11-2020
1 page
Biography of Yogi Ramsuratkumar The Godchild S. Parthasarathy - The full ebook version is ready for instant download
100% (3)
Biography of Yogi Ramsuratkumar The Godchild S. Parthasarathy - The full ebook version is ready for instant download
65 pages
Maxillary Arch Perimeter Prediction Using Ramanujan's Equation For The Ellipse
No ratings yet
Maxillary Arch Perimeter Prediction Using Ramanujan's Equation For The Ellipse
7 pages
Nutritional and Health Benefits of Carrots and Their Seed Extracts
No ratings yet
Nutritional and Health Benefits of Carrots and Their Seed Extracts
12 pages
Unit 1 Maths
No ratings yet
Unit 1 Maths
9 pages
Keywords: Perovskite Nanocrystals, Luminescence, Synthesis, Quantum Yield
No ratings yet
Keywords: Perovskite Nanocrystals, Luminescence, Synthesis, Quantum Yield
17 pages
Chemistry Student Worksheet With Sets (Science Technology Environment Society) Oriented To Colloid Topic For Rsmabi
No ratings yet
Chemistry Student Worksheet With Sets (Science Technology Environment Society) Oriented To Colloid Topic For Rsmabi
8 pages
ITB No. 2019-03
No ratings yet
ITB No. 2019-03
2 pages
Tolerance - E.O.T Cranes
No ratings yet
Tolerance - E.O.T Cranes
3 pages
F G F G G or G: Question 1 Explanation
0% (1)
F G F G G or G: Question 1 Explanation
8 pages
Powerbuilding For Tall People
100% (1)
Powerbuilding For Tall People
31 pages
Dts I
No ratings yet
Dts I
13 pages
Bioplasm-NLS Features & Warranty
50% (2)
Bioplasm-NLS Features & Warranty
4 pages
Coronary Angiography
100% (1)
Coronary Angiography
2 pages
Proverbs From A To Z
73% (26)
Proverbs From A To Z
11 pages
Geo Mock
No ratings yet
Geo Mock
7 pages
Percutaneous Endoscopic Gastrostomy
No ratings yet
Percutaneous Endoscopic Gastrostomy
27 pages
Song Sheet
100% (16)
Song Sheet
2 pages
VW VIN Shopping Tips
50% (2)
VW VIN Shopping Tips
1 page
L 10 - Moses and 10 Plagues - 1
No ratings yet
L 10 - Moses and 10 Plagues - 1
19 pages
9758 E-Book
No ratings yet
9758 E-Book
197 pages
Happiness - Notes PDF
No ratings yet
Happiness - Notes PDF
7 pages
Hdpe NIC
No ratings yet
Hdpe NIC
57 pages
CBSE Sample Papers For Class 3 English - Mock Paper 1
No ratings yet
CBSE Sample Papers For Class 3 English - Mock Paper 1
6 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Feature Engineering

Uploaded by

Feature Engineering

Uploaded by

Feature Engineering

Prof. Gajendra P.S. Raghava

Web Site: http://webs.iiitd.edu.in/raghava/

These slides were created with using various resources so

Feature ➢ Data Imputation

Filter Methods Wrapper Methods Embedded Methods

➢ Correlation Coefficient ➢ Recursive Search ➢ Decision Tree

Univariate Analysis, Multivariate, Python Libraries

Feature ➢ Data Imputation

Filter Methods Wrapper Methods Embedded Methods

➢ Correlation Coefficient ➢ Recursive Search ➢ Decision Tree

Univariate Analysis, Multivariate, Python Libraries

based Feature Reduction

2. Calculate eigenvectors & eigenvalues of covariance matrix

5. Transform the original data into k dimensions

Major steps in LDA

This is what CellRanger does in its default analysis

Feature ➢ Data Imputation

Filter Methods Wrapper Methods Embedded Methods

➢ Correlation Coefficient ➢ Recursive Search ➢ Decision Tree

Univariate Analysis, Multivariate, Python Libraries

It should have less or equal features after reducing features

{ f1 ,..., fi ,..., f n } ⎯⎯⎯⎯ →{ fi1 ,..., fi j ,..., fim } i j  1,..., n ; j = 1,..., m

{ f1 ,..., fi ,..., f n} ⎯⎯⎯⎯→

 This is usually not optimal

 but often preferable to other, more complicated methods

 computationally efficient(!): only calculation and sorting of n

 Estimate for m samples:

Ranking of feature based on Maximum difference F(Pmean – Nmean)

2. Significance difference (t-test) in mean

3. Threshold based feature selection

 Multivariate method: considers subsets of variables (features)

 Filter method: ranks features or feature subsets

 Wrapper method: uses a classifier to assess features or

• A wide range of heuristic search strategies can be

• predictive power is usually measured on a validation

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.