Feature Engineering and Selection: CS 294: Practical Machine Learning October 1, 2009 Alexandre Bouchard-Côté
Feature Engineering and Selection: CS 294: Practical Machine Learning October 1, 2009 Alexandre Bouchard-Côté
Feature Engineering and Selection: CS 294: Practical Machine Learning October 1, 2009 Alexandre Bouchard-Côté
and Selection
Alexandre Bouchard-Côté
Abstract supervised setup
• Training :
• : input vector
xi,1
xi,2
xi = .. , xi,j ∈ R
.
xi,n
• y : response variable
– : binary classification
– : regression
– what we want to be able to predict, having
observed some new .
Concrete setup
Input Output
“Danger”
Featurization
Input Features Output
xi,1
xi,2
..
“Danger”
.
xi,n
xi,1
xi,2
..
.
xi,n
Outline
• Program:
– Part I : Handcrafting features: examples, bag
of tricks (feature engineering)
– Part II: Automatic feature selection
Part I: Handcrafting
Features
Machines still need us
Example 1: email classification
PERSONAL
return result
} Feature template 2:
BIGRAM:Cheap Viagra
Features for multitask learning
w w
User 1
User 2
x x
...
y y
Structure on the output space
• In multiclass classification, output space
often has known structure as well
• Example: a hierarchy:
Emails
Spam Ham
Advance
Backscatter Work Personal
fee frauds
Emails
Spam Ham
x y Advance
Backscatter Work Personal
fee frauds
...
UNIGRAM:Alex AND CLASS=PERSONAL
UNIGRAM:Alex AND CLASS=HAM
...
Structure on the output space
• Not limited to hierarchies
– multiple hierarchies
– in general, arbitrary featurization of the output
• Another use:
– want to model that if no words in the email
were seen in training, it’s probably spam
– add a bias feature that is activated only in
SPAM subclass (ignores the input):
CLASS=SPAM
Dealing with continuous data
“Danger”
“r”
Dealing with continuous data
• Step 1: Find a coordinate system where
similar input have similar coordinates
– Use Fourier transforms and knowledge
about the human ear
Sound 1 Sound 2
Time domain:
Frequency domain:
Dealing with continuous data
• Step 2 (optional): Transform the
continuous data into discrete data
– Bad idea: COORDINATE=(9.54,8.34)
– Better: Vector quantization (VQ)
– Run k-mean on the training data as a
preprocessing step
– Feature is the index of the nearest
centroid
CLUSTER=1
CLUSTER=2
Dealing with continuous data
Important special case: integration of the
output of a black box
– Back to the email classifier: assume we
have an executable that returns, given a
email e, its belief B(e) that the email is
spam
– We want to model monotonicity
– Solution: thermometer feature
B(e) > 0.4 AND B(e) > 0.6 AND B(e) > 0.8 AND
... ...
CLASS=SPAM CLASS=SPAM CLASS=SPAM
Dealing with continuous data
Another way of integrating a qualibrated
black box as a feature:
!
log B(e) if y = SPAM
fi (x, y) =
0 otherwise
Recall: votes
are combined
additively
Part II: (Automatic)
Feature Selection
What is feature selection?
• Reducing the feature space by throwing
out some of the features
• Motivating idea: try to find a simple,
“parsimonious” model
– Occam’s razor: simplest explanation that
accounts for the data is best
What is feature selection?
Task: classify emails as spam, work, ... Task: predict chances of lung disease
Data: presence/absence of words Data: medical history survey
X X
UNIGRAM:Viagra 0 Vegetarian No
UNIGRAM:the 1 Plays video Yes
BIGRAM:the presence 0 Reduced X games
Reduced X
BIGRAM:hello Alex 1 Family history No
UNIGRAM:Viagra 0 Athletic No Family No
UNIGRAM:Alex 1
BIGRAM:hello Alex 1 Smoker Yes history
UNIGRAM:of 1
BIGRAM:free Viagra 0 Gender Male Smoker Yes
BIGRAM:absence of 0
Lung capacity 5.8L
BIGRAM:classify email 0
Hair color Red
BIGRAM:free Viagra 0
Car Audi
BIGRAM:predict the 1
…
…
Weight 185
BIGRAM:emails as 1 lbs
Outline
• Review/introduction
– What is feature selection? Why do it?
• Filtering
• Model selection
– Model evaluation
– Model search
• Regularization
• Summary recommendations
Why do it?
• Case 1: We’re interested in features—we want
to know which are relevant. If we fit a model, it
should be interpretable.
Input : Parameters:
Response : Prediction :
Error or “residual”
Observation
Prediction
0
0 20
Response : Prediction :
• Note
20
Degree 15 polynomial
15
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
X7
X1
X6
st
te
Learn
X2
X5
X4 X3
K-fold cross validation
• A technique for estimating test error
• Uses all of the data to validate
• Divide data into K groups .
• Use each group as a validation set, then average all validation
errors
X7
X1
X6
Learn
test X2
X5
X4 X3
K-fold cross validation
• A technique for estimating test error
• Uses all of the data to validate
• Divide data into K groups .
• Use each group as a validation set, then average all validation
errors
X7
X1
X6
Learn
…
test X2
X5
X4 X3
K-fold cross validation
• A technique for estimating test error
• Uses all of the data to validate
• Divide data into K groups .
• Use each group as a validation set, then average all validation
errors
X7
X1
X6
Learn
X2
X5
X4 X3
Model Search
• We have an objective function
– Time to search for a good model.
• This is known as a “wrapper” method
– Learning algorithm is a black box
– Just use it to compute objective function, then
do search
• Exhaustive search expensive
– for n features, 2n possible subsets s
• Greedy search is common and effective
Model search
Forward selection Backward elimination
Initialize s={} Initialize s={1,2,…,n}
Do: Do:
Add feature to s remove feature from s
which improves K(s) most which improves K(s) most
While K(s) can be improved While K(s) can be improved
• p = 1: Taxicab or Manhattan
! 1 = |w1 | + · · · + |wn |
||w||
Feature
weight
value
Univariate case: intuition
Penalty
Feature
weight
value
L1 penalizes more than L2
when the weight is small
Univariate example: L2
+ =
+ =
+ =
Weight of
feature #2
Weight of
feature #1
Multivariate case: w gets cornered
• To minimize , we can solve
by (e.g.) gradient descent.
i=1
• Two questions:
– 1. How do we perform this minimization?
• Difficulty: not differentiable everywhere
– 2. How do we choose C?
• Determines how much sparsity will be obtained
• C is called an hyperparameter
Question 1: Optimization/learning
• Set of discontinuity has Lebesgue
measure zero, but optimizer WILL hit them
• Several approaches, including:
– Projected gradient, stochastic projected
subgradient, coordinate descent, interior
point, orthan-wise L-BFGS [Friedman 07,
Andrew et. al. 07, Koh et al. 07, Kim et al. 07,
Duchi 08]
– More on that on the John’s lecture on
optimization
– Open source implementation: edu.berkeley.nlp.math.OW_LBFGSMinimizer in
http://code.google.com/p/berkeleyparser/
Question 2: Choosing C
• Up until a few years ago
this was not trivial
– Fitting model: optimization
problem, harder than
least-squares
– Cross validation to choose
C: must fit model for every
candidate C value
• Not with LARS! (Least
Angle Regression,
Hastie et al, 2004)
– Find trajectory of w for all
possible C values
simultaneously, as
efficiently as least-squares
– Can choose exactly how
many features are wanted
(Higher F1
is better)
When can feature selection
hurt?
• NLP example: back to the email
classification task
• Zipf law: frequency of a word is inversely
proportional to its frequency rank.
– Fat tail: many n-grams are seen only once in
the training
– Yet they can be very useful predictors
– E.g. 8-gram “today I give a lecture on feature
selection” occurs only once in my mailbox, but
it’s a good predictor that the email is WORK
Outline
• Review/introduction
– What is feature selection? Why do it?
• Filtering
• Model selection
– Model evaluation
– Model search
• Regularization
• Summary
Summary: feature engineering
• Feature engineering is often crucial to get
good results
• Strategy: overshoot and regularize
– Come up with lots of features: better to include
irrelevant features than to miss important
features
– Use regularization or feature selection to
prevent overfitting
– Evaluate your feature engineering on DEV set.
Then, when the feature set is frozen, evaluate
on TEST to get a final evaluation (Daniel will
say more on evaluation next week)
Summary: feature selection
When should you do it?
– If the only concern is accuracy, and the whole
dataset can be processed, feature selection not
needed (as long as there is regularization)
– If computational complexity is critical
(embedded device, web-scale data, fancy
learning algorithm), consider using feature
selection
• But there are alternatives: e.g. the Hash trick, a
fast, non-linear dimensionality reduction technique
[Weinberger et al. 2009]
– When you care about the feature themselves
• Keep in mind the correlation/causation issues
• See [Guyon et al., Causal feature selection, 07]
Summary: how to do feature selection
•Filtering
•L1 regularization
(embedded
Computational cost
methods)
•Wrappers
•Forward
selection
•Backward
selection
•Other search
•Exhaustive
Summary: how to do feature selection
•Filtering • Good preprocessing
•L1 regularization step
(embedded
Computational cost
methods)
•Wrappers
•Forward
selection
•Backward
selection
•Other search
•Exhaustive
Summary: how to do feature selection
•Filtering • Most directly optimize
•L1 regularization prediction performance
(embedded • Can be very expensive,
Computational cost
methods)
•Wrappers
•Forward
selection
•Backward
selection
•Other search
•Exhaustive
Summary: how to do feature selection
•Filtering • The “ideal”
•L1 regularization • Very seldom done in
(embedded practice
Computational cost
methods)
• With cross-validation
•Wrappers
objective, there’s a
•Forward
selection chance of over-fitting
•Backward – Some subset might
selection randomly perform quite
well in cross-validation
•Other search
•Exhaustive
Extra slides
Feature engineering case study:
Modeling language change [Bouchard et al. 07,09]
‘fish’ ‘fear’
Hawaiian iʔa makaʔu
Tongan ika
Maori ika mataku
Feature engineering case study:
Modeling language change [Bouchard et al. 07,09]
‘fish’ ‘fear’
*k > ʔ iʔa
Hawaiian makaʔu
m 8 n 6 7 ? :
pb t d % A cB kg q3 $ /
=# f v !C sz ) 2 *1 ç, x4 ' .0 +" h5
& < ; j 9
- r (
Feature selection case study:
Protein Energy Prediction [Blum et al ‘07]
• What is a protein?
– A protein is a chain of amino acids.
• Proteins fold into a 3D conformation by minimizing energy
– “Native” conformation (the one found in nature) is the lowest
energy state
– We would like to find it using only computer search.
– Very hard, need to try several initialization in parallel
• Regression problem:
– Input: many different conformation of the same sequence
– Output: energy
• Features derived from:
φ and ψ torsion angles.
• Restrict next wave of
search to agree with
features that predicted
high energy
Featurization
• Torsion angle features can be binned
φ1 ψ1 φ2 ψ2 φ3 ψ4 φ5 ψ5 φ6 ψ6
0 75.3 -61.6 -24.8 -68.6 -51.9 -63.3 -37.6 -62.8 -42.3
(180, 180)
E
B
G A A
ψ G
A
B E
(-180, -180) φ