Feature Engineering
Feature Engineering
Posit
ion A R N D C E Q G H I L K M F P S T W Y V
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
R 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
H 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
The log function is the inverse of the exponential function.
Log transformation It is defined such that log10(x)
This means that the log function maps the small range of Numbers between (0, 1) to the
entire range of negative numbers (–∞, 0).
The function log10(x) maps the range of [1, 10] to [0, 1], [10, 100] to [1, 2], and so on.
In other words, the log function compresses the range of large numbers
The larger x is, the slower log(x) increments.
Log may be based on based e , 2, 10, (e.g., log , log2 , Log10
Feature Scaling
Number of features have different scale
Difficult to handle features at different scale in ML
Feature scale change scale of feature (percent values)
Scaling is done for individual features
Commonly used techniques
Normalization (0 to 1)
Standardization (Variance Scaling)
L2 normalization
Feature Generation
Dimension Reduction Data Preprocessing
f1
Principal Component Analysis
Major Steps
Standardize the data.
Perform Singular Vector Decomposition to get the
Eigenvectors and Eigenvalues.
Sort eigenvalues in descending order and choose the k-
eigenvectors
Construct the projection matrix from the selected k-
eigenvectors.
Transform the original dataset via projection matrix to obtain
a k-dimensional feature subspace.
Linear Discriminant Analysis (LDA)
LDA is another feature reduction technique;
It is a supervised feature reduction technique
Answer: Combine the two methods, get the best of both worlds
• PCA • tSNE
– Good at extracting signal from noise – Can reduce to 2D well
– Extracts informative dimensions – Can cope with non-linear scaling
Do PCA
Extract most interesting signal
Take top PCs. Reduce dimensionality (but not to 2)
Do tSNE/UMAP
Calculate distances from PCA projections
Scale distances and project into 2-dimensions
Feature Generation
Dimension Reduction Data Preprocessing
F‘
Feature Extraction/Creation
F F‘
(f )( y )
m
k =1 k ,i − fi k −y
R( f i , y) =
(f ) (y )
m 2 m 2
k =1 k ,i
− fi k =1 k
−y
The higher the correlation between the feature and the target, the higher the score!
Filter Methods: Classification Methods
1. Difference in mean for positive and negative samples
Based on MCC
Based on Accuracy
4. Ranking of features
Nomenclature
Univariate method: considers one variable (feature) at a time.
P(Xi|Y=1)
P(Xi|Y=-1)
-1
xi
- +
• Normally distributed classes, equal variance 2 unknown; estimated from data
as 2within.
• Null hypothesis H0: + = -
• T statistic: If H0 is true,
t= (+ - -)/(withinm++1/m-) Student(m++m-- d.f.)
F-score
In filter selection, a single input variable
Univariate Feature at a time with a target variable. These
selection statistical measures are termed as
univariate statistical measures
Wrapper Methods
Perspectives: Search of a Subset of Features
Search Space:
Perspectives: Search of a Subset of Features
Search Directions
Sequential Forward Generation (SFG): It starts with an empty set of features S. As the search
starts, features are added into S according to some criterion that distinguish the best feature from
the others. S grows until it reaches a full set of original features.
Sequential Backward Generation (SBG): It starts with a full set of features and,iteratively, they
are removed one at a time. Here, the criterion must point out the worst or least important
feature.
Bidirectional Generation (BG): Begins the search in both directions, performing SFG and SBG
concurrently. They stop in two cases: (1) when one search finds the best subset comprised of m
features before it reaches the exact middle, or (2) both searches achieve the middle of the search
space. It takes advantage of both SFG and SBG.
Random Generation (RG) or Genetic Algorithm: It starts the search in a random direction. The
choice of adding or removing a features is a random decision. RG tries to avoid the stagnation into
a local optima by not following a fixed way for subset generation. Unlike SFG or SBG, the size of
the subset of features cannot be stipulated.
Perspectives: Search of a Subset of Features
Search Strategies
Exhaustive Search: It corresponds to explore all possible subsets to find the optimal ones.
As we said before, the space complexity is O(2M) . If we establish a threshold m of minimum
features to be selected and the direction of search, the search space is, independent of the
forward or backward generation. Only exhaustive search can guarantee the optimality.
Nevertheless, they are also impractical in real data sets with a high M.
Heuristic Search: It employs heuristics to carry out the search. Thus, it prevents brute force
search, but it will surely find a non-optimal subset of features. It draws a path connecting
the beginning and the end, such in a way of a depth-first search. The maximum length of
this path is M and the number of subsets generated is O(M). The choice of the heuristic is
crucial to find a closer optimal subset of features in a faster operation.
Nondeterministic Search: Complementary combination of the previous two. It is also
known as random search strategy and can generate best subsets constantly and keep
improving the quality of selected features as time goes by. In each step, the next subset is
obtained at random
Feature Subset Selection
Wrapper Methods
• The problem of finding the optimal subset is NP-hard!
44/54
L1-Regularization method
Regularization consists of adding a penalty to the different
parameters to reduce the freedom of the model.
In linear model regularization, the penalty is applied over the
coefficients that multiply each of the predictors.
Lasso or L1 is most popular for feature selection
Tree Based Feature Selection
Random Forests aggregates a specified number of decision trees.
Decreases in the impurity of a feature can be measured
More a feature decreases the impurity, more important is the feature.
In random forests, the impurity decrease from each feature can be
averaged across trees to determine the final importance of the variable.
Random forests naturally rank by how well they improve the purity of the
node, or in other words a decrease in the impurity (Gini impurity).
Nodes with the greatest decrease in impurity happen at the start of the
trees,
while nodes with the least decrease in impurity occur at the end of trees.
Thus, by pruning trees below a particular node, we can create a subset of
the most important features.
Filters,Wrappers, and
Embedded methods
Feature
All features Filter Predictor
subset
Multiple
All features Feature Predictor
subsets
Wrapper
Feature
subset
Embedded
All features
method
Predictor
Filters
Methods:
Criterion: Measure feature/feature subset
“relevance”
Search: Usually order features (individual feature
ranking or nested subsets of features)
Assessment: Use statistical tests
Results:
Are (relatively) robust against overfitting
May fail to select the most “useful” features
Wrappers
Methods:
Criterion: Measure feature subset “usefulness”
Search: Search the space of all feature subsets
Assessment: Use cross-validation
Results:
Can in principle find the most “useful” features,
but Are prone to overfitting
Embedded Methods
Methods:
Criterion: Measure feature subset “usefulness”
Search: Search guided by the learning process
Assessment: Use cross-validation
Results:
Similar to wrappers, but Less computationally expensive
and less prone to overfitting
Important feature selection techniques
mRMR (minimum Redundancy - Maximum Relevance)
It is a feature selection algorithm. It is a minimal-optimal feature selection
algorithm.
This means it is designed to find the smallest relevant subset of features for
a given machine learning task.
It tries to find a small set of features that are relevant with respect to the
target variable and are scarcely redundant with each other.
It is valuable not only because it is effective, but also because its simplicity
makes it fast and easily implementable in any pipeline.