Learning_Feature_Engineering_for_Classification
Learning_Feature_Engineering_for_Classification
Learning_Feature_Engineering_for_Classification
net/publication/318829821
CITATIONS READS
288 12,175
5 authors, including:
SEE PROFILE
All content following this page was uploaded by Udayan Khurana on 21 November 2017.
2529
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
and class label distributions, and transformation. feature selection over the augmented dataset. Cognito [Khu-
Different datasets contain different feature sizes and differ- rana et al., 2016] recommends a series of transformations
ent value ranges. One key challenge in generalizing across based on a greedy, hierarchical heuristic search. Cognito and
different datasets is to convert feature values and their class DSM focus on sequences of feature transformations, which
labels to a fixed size feature vector representation that can is outside the scope of this paper. In contrast to these ap-
be fed into LFE classifiers. To characterize datasets, hand- proaches, we do not require expensive classifier evaluations
crafted meta-features, fixed-size stratified sampling, neural to measure the impact of a transformation. ExploreKit [Katz
networks and hashing methods have been used for different et al., 2016] also generates all possible candidate features,
tasks [Michie et al., 1994; Kalousis, 2002; Feurer et al., 2015; but uses a learned ranking function to sort them. ExploreKit
Weinberger et al., 2009]. However, these representations requires generating all possible candidate features when en-
do not directly capture the correlation between feature val- gineering features for a new (test) dataset and as such, results
ues and class labels. To capture such correlations, LFE con- reported for ExploreKit are with a time limit of three days
structs a stack of fixed-size representations of feature values per dataset. In contrast, LFE can generate effective features
per target class. We use Quantile Data Sketch to represent within seconds, on average.
feature values of each class. Quantile has been used as a Several machine learning methods perform feature extrac-
fixed-size space representation and achieves reasonably ac- tion or learning indirectly. While they do not explicitly work
curate approximation to the distribution function induced by on input features and transformations, they generate new fea-
the data being sketched [Greenwald and Khanna, 2001]. tures as means to solving another problem [Storcheus et al.,
LFE presents a computationally efficient and effective al- 2015]. Methods of that type include dimensionality reduc-
ternative to other automated feature engineering approaches tion, kernel methods and deep learning. Kernel algorithms
by recommending suitable transformations for features in a such as SVM [Shawe-Taylor and Cristianini, 2004] can be
dataset. To showcase the capabilities of LFE, we trained LFE seen as embedded methods, where the learning and the (im-
on 85K features, extracted from 900 datasets, for 10 unary plicit) feature generation are performed jointly. This is in
transformations and 122K feature pairs for 4 binary trans- contrast to our setting, where feature engineering is a pre-
formations, for two models: Random Forest and Logistic processing step.
Regression. The transformations are listed in Table 1. We Deep neural networks learn useful features automati-
empirically compare LFE with a suite of feature engineering cally [Bengio et al., 2013] and have shown remarkable suc-
approaches proposed in the literature or applied in practice cesses on video, image and speech data. However, in some
(such as the Data Science Machine, evaluation-based, ran- domains feature engineering is still required. Moreover, fea-
dom selection of transformations and always applying the tures derived by neural networks are often not interpretable
most popular transformation in the training data) on a sub- which is an important factor in certain application domains
set of 50 datasets from UCI repository [Lichman, 2013], such as healthcare [Che et al., 2015].
OpenML [Vanschoren et al., 2014] and other sources. Our
experiments show that, of the datasets that demonstrated any 3 Automated Feature Engineering Problem
improvement through feature engineering, LFE was the most
effective in 89% of the cases. As shown in Figure 2, simi- Consider a dataset, D, with features, F = {f1 , . . . , fn }, and
lar results were observed for the LFE trained with Logistic a target class, , a set of transformations, T = {T1 , . . . , Tm },
Regression. Moreover, LFE runs in significantly lesser time and a classification task, L. The feature engineering problem
compared to the other approaches. This also enables interac- is to find q best paradigms for constructing new features such
tions with a practitioner since it recommends transformations that appending the new features to D maximizes the accuracy
on features in a short amount of time. of L. Each paradigm consists of a candidate transformation
Tc 2 T of arity r, an ordered list of features [fi , . . . , fi+r 1 ]
and a usefulness score.
2 Related Work For a dataset with n features and with u unary transfor-
The FICUS algorithm [Markovitch and Rosenstein, 2002] mations, O(u ⇥ n) new features can be constructed. With
takes as input a dataset and a set of transformations, and per- b binary transformations, there are O(b ⇥ P2n ) new possi-
forms a beam search over the space of possible features. FI- ble features, where P2n is the 2-permutation of n features.
CUS’s search for better features is guided by heuristic mea- Given a fixed set of transformations, the number of new fea-
sures based on information gain in a decision tree, and other tures and their combinations to explore, for an exact solution,
surrogate measures of performance. The more recent FC- grows exponentially. Hence, we make the case that a mere
Tree [Fan et al., 2010] also uses a decision tree to partition the enumeration and trial by model training and testing is not a
data using original or constructed features as splitting points. computationally practical option, and a scalable solution to
The FEADIS algorithm of [Dor and Reich, 2012] relies on the problem must avoid this computational bottleneck. LFE
a combination of random feature generation and feature se- reduces the complexity of feature space exploration by pro-
lection. FEADIS adds constructed features greedily, and as viding an efficient approach for finding a particularly “good”
such requires many expensive performance evaluations. The transformation for a given set of features. Therefore, given n
Deep Feature Synthesis component of Data Science Machine features, for unary and binary transformations, LFE performs,
(DSM) [Kanter and Veeramachaneni, 2015] relies on exhaus- respectively, O(n) and O(P2n ) transformation predictions.
tively enumerating all possible new features, then performing In order to assess the relative impact of adding new features
2530
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
across different techniques, we add as many new features as Distribution Function (PDF) in an n-dimensional space [Gar-
those originally in the data. For unary transformations, LFE rido and Juste, 1998]. However, it is not clear how existing
predicts the most suitable transformation for each feature. representation learning and PDF learning approaches may be
For binary and higher arity transformations, LFE considers applied in the context of raw numerical data. The main chal-
a random sample of all combinations of features, finds the lenges is the high variability in the size and the range of fea-
paradigm for each combination and selects top-k useful ones. ture values (e.g., from 10 to millions). In our setting, features
In the following section, we describe how LFE learns and are data points represented with various numbers of dimen-
predicts useful transformations for features. sions (number of distinct feature values). Hence, Random
Projection [Bingham and Mannila, 2001] for dimensionality
4 Transformation Recommendation reduction is not applicable. Although Recurrent Neural Net-
works can deal with varying input size, we aim at determin-
LFE models the problem of predicting a useful r-ary transfor- ing a fixed-size representation that captures the correlation
mation Tc 2 Tr , (Tr is the set of r-ary transformations in T ), between features and target classes.
for a given list of features [f1 , . . . , fr ] as a multi-class clas- Previous approaches have used hand-crafted meta-features,
sification problem, where the input is a representation of fea- including information-theoretic and statistical meta-features,
tures, R[f1 ,...,fr ] , and output classes are transformations in Tr . to represent datasets [Michie et al., 1994; Kalousis, 2002;
LFE takes a one-vs-rest approach [Rifkin and Klautau, 2004]. Feurer et al., 2015]. Such meta-features aim at modelling
Each transformation is modelled as a Multi-Layer Perceptron the distribution of values in datasets. Performing fixed-size
(MLP) binary classifier with real-valued confidence scores as sampling of feature values is another approach for represent-
output. Recommending an r-ary transformation for r fea- ing distributions. Samples extracted from features and classes
tures involves applying all |Tr | MLPs on R[f1 ,...,fr ] . If the are required to reflect the distribution of values in both fea-
highest confidence score obtained from classifiers is above a ture and target classes. While stratified sampling solves the
given threshold, LFE recommends the corresponding trans- issue for one feature, it becomes more complex for multiple
formation to be applied on feature f . Let gk (R[f1 ,...,fr ] ) be features with high correlation [Skinner et al., 1994]. Feature
the confidence score of the MLP corresponding to transfor- hashing has been used to represent features of type “string”
mation Tk , and is the threshold for confidence scores which as feature vectors [Weinberger et al., 2009]. Although fea-
we determined empirically. LFE recommends the transfor- ture hashing can be generalized for numerical values, it is not
mation Tc , for features [f1 , . . . , fr ], as follows: straightforward to choose distance-preserving hash functions
c = arg max gk (R[f1 ,...,fr ] ) that map values within a small range to the same hash value.
k In Section 5, we empirically show how LFE transformation
⇢ classifiers perform using each of representations described
Tc , if gc (R[f1 ,...,fr ] ) > above.
recommend : (1)
none, otherwise Quantile Sketch Array (QSA) uses quantile data
In the following sections, we describe how numerical features sketch [Wang et al., 2013] to represent feature values
are represented to be the input of transformation MLPs. Next, associated with a class label. QSA is a non-parametric
we explain the process of collecting past feature engineering representation that enables characterizing the approximate
knowledge as training samples to train MLPs. Probability Distribution Function of values. QSA is closely
related to the familiar concept of histogram, where data is
4.1 Feature-Class Representation summarized into a small number of buckets. Naumann et
Transformations are used to reveal and improve significant al. used quantile representation for numerical features to
correlation or discriminative information between features perform feature classification [Naumann et al., 2002]. We
and class labels. The more pronounced this correlation, the apply the exact brute-force approach to compute quantile
higher the chance that a model can achieve a better predic- sketch for numerical values [Wang et al., 2013], described as
tive performance. Each LFE classifier learns the patterns of follows.
feature-class distributions for which the corresponding trans- Let Vk be the bag of values in feature f that are for the
(i)
formation has been effective in improving feature-class cor- training data points with the label ck and Qf is the quan-
relation. LFE represents feature f in a dataset with k classes tile sketch of Vk . First, we scale these values to a predefined
as follows: h i
(i)
range [lb, ub]. Generating Qf involves bucketing all values
(1) (2) (k)
Rf = Qf ; Qf ; . . . ; Qf (2) in Vk into a set of bins. Given a fixed number of bins, r,
the range [lb, ub] is partitioned into r disjoint bins of uni-
(i)
where Qf is a fixed-sized representation of values in f that form width w = ubr lb . Assume, the range [lb, ub] is par-
are associated with class i. We call this representation Quan- titioned into bins {b0 , . . . , br 1 }, where the bin bj is a range
tile Sketch Array. Next, we describe how feature values asso- [lb + j ⇤ w, lb + (j + 1) ⇤ w). Function B(vl ) = bj asso-
(i)
ciated to class i are translated into representation Qf , which ciates the value vl in Vk to the bin bj . Function P (bj ) re-
is meant to capture the distribution of feature values. turns the number of feature values, that are bucketed in bj .
P (bj )
Neural networks have been successful in learning repre- Finally, I(bj ) = P P (bm ) is the normalized value of
sentations for image and speech data [Bengio et al., 2013]. 0m<r
P (bj ) across all bins.
Others have proposed solutions for estimating a Probability
2531
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
f1
3
class
"1 !" !# !$ !% !& !' !( !) !* !+ 5 Experimental Results
0 "1
USEFUL
We evaluate three aspects of LFE: (1) the impact of using
1 "1
Transformation. Quantile Sketch Array representation on the performance of
… "1 −- - ./ ’s.Classifier
2 +1 NOT) transformation classifiers compared to other representations,
1 +1
USEFUL* (2) the capability of LFE in recommending useful transfor-
… +1
−- - mations, (3) the benefit of using LFE to perform feature engi-
neering compared to other alternatives, in prediction accuracy
Figure 1: An example of feature representation using Quantile and time taken. We implemented LFE transformation clas-
Sketch Array. The feature f 1’s values are binned into 10 equi-width sifiers and the meta-feature learner (auto-encoder) in Tensor-
bins, separately for classes 1 and +1. Flow. Transformation computation and model training were
implemented using Scikit-learn. To showcase the capabili-
To illustrate, Figure 1 shows an example of the representa- ties of LFE, we considered the following ten unary and four
tion of a feature. Note that this representation has certain binary transformations, respectively: log, square-root (both
behaviours. For example, for feature values that are perfectly applied on the absolute of values), frequency (count of how
separated in terms of the class label the representation ends often a value occurs), square, round, tanh, sigmoid, isotonic
up being very similar regardless of the scale of feature values. regression, zscore, normalization (mapping to [-1, 1]), sum,
In this particular case there is also no need for any transfor- subtraction, multiplication and division. In order to avoid
mation since values are already separated. We decide QSA data leakage, we apply transformations on train and test folds
parameters (number of bins and scaling range) based on the separately.
performance of classifiers through empirical observations. Without the loss of generality, we focus on binary classifi-
4.2 Training cation for the purpose of these experiments. We collected 900
classification datasets from the OpenML and UCI reposito-
To generate training samples for transformation MLP clas- ries to train transformation classifiers. A subset of 50 datasets
sifiers, LFE considers numerical features in classification were reserved for testing and were not used for the purpose of
datasets across various repositories. Each classifier is trained training. In order to better utilize the datasets, we converted
with the samples for which the corresponding transformation the multi-class problems into one-vs-all binary classification
has been found useful as positive samples and all other sam- problems. Training samples were generated for Random For-
ples as negative. In order to decide whether a transformation est and Logistic Regression using 10-fold cross validation and
for a set of features leads to improvement, we evaluate a se- the performance improvement threshold, ✓, of 1%. Differ-
lected model, L, on the original features and the target class ent improvement thresholds result in collecting variable num-
as well as the constructed feature by itself and the target class. ber of training samples. Since the distribution of collected
If the constructed feature shows performance improvement samples for transformations is not uniform, we decided on
beyond a threshold, ✓, the input features together with their the threshold upon due empirical exploration. We generated
class labels are considered as a positive training sample for approximately 84.5K training samples for unary transforma-
the transformation classifier. tion and 122K for binary transformations. Table 1 reports
A sample generated from feature f , in a dataset with the statistics of positive training samples generated from all
k classes, for transformation t is translated into Rf = datasets. Frequency and multiplication are two transforma-
(1) (k)
[Qf ; . . . ; Qf ]. Assuming b bins are used for each quan- tions that incur performance improvement for the majority of
tile data sketch, Rf is a vector of size k ⇥ b. Then, Rf is fed features. All transformation classifiers are MLPs with one
into the MLP corresponding to t. Assume the corresponding hidden layer. We tuned the number of hidden units to opti-
MLP to the unary transformation t has one hidden layer with mize the F-score for each classifier, and they vary from 400
h units. The probability of t being a useful transformation or to 500. The average time to train transformation classifiers
not for feature f in a dataset is computed as: offline was 6 hours.
[pt is useful (f ), pt is not useful (f )] = 5.1 Feature Representation
(2) (1) (k)
2 (b + W(2) ( 1 (b(1) + W(1) [Qf ; . . . ; Qf ]))) We evaluate the efficacy of Quantile Sketch Array (QSA)
(3) through the performance of LFE consisting of QSA compared
where, W(1) and W(2) are weight matrices, b(1) and b(1) to other representations, as listed in Table 2. These represen-
are bias vectors, and 1 and 2 are softmaxe and rectified tations are listed in Table 2.
linear unit (ReLU) functions, respectively [Nair and Hinton, For hand-crafted meta-feature representation, we consider
2010]. We use Stochastic Gradient Descent with minibatches the following meta-features used in the literature: first 30
to train transformation MLPs. In order to prevent overfitting, moments of feature values, median, standard deviation, min,
we apply regularization and drop-out [Srivastava et al., 2014]. max, number of values and its log. For sampling represen-
The generated training samples are dependent on model type. tation, we obtain 250 stratified random samples per class.
In other words, while there may be overlaps in the suggested We also perform meta-feature learning by training an auto-
feature engineering paradigms across models, the optimal use encoder on all features in our corpus. The input and out-
of LFE for a specific model comes from using that same put of the auto-encoder are the feature hashing representa-
model while training LFE. In Section 5, we show that LFE tion [Weinberger et al., 2009] of features with 200 hash val-
is robust in terms of the choice of classification model. ues per class. The auto-encoder consists of a two-layer en-
2532
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
Transformation log sqrt square freq round tanh sigmoid isotonic-reg zscore normalize sum subt mult div
#Positive Training Samples 6710 6293 5488 73919 15855 70829 70529 624 664 31019 28492 36296 37086 19020
Classifier Performance 0.95 0.96 0.97 0.77 0.92 0.92 0.80 0.98 0.97 0.91 0.98 0.99 0.97 0.98
Table 1: Statistics of Training Samples and F1 Score of LFE Classifiers for 10-fold Cross Validation of Random Forest.
Hand-crafted Stratified Meta-feature Quantile ated in each run and the transformations in the run with the
Meta-features Sampling Learning Sketch Array highest predictive performance are recommended.
0.5558 0.5173 0.3256 0.9129
• Majority, which always recommends the single most ef-
Table 2: F1 Score of Transformation Classifiers. fective transformation in the training samples (i.e. fre-
quency for unary and multiplication for binary transforma-
coder and a two-layer decoder, which are symmetric and each tions).
has a layer of 100 units connected to another layer of 75 units. • Brute-force, inspired by Feature Synthesis component
For a hashed input feature, we consider the output of the en- of Data Science Machine [Kanter and Veeramachaneni,
coder component of a trained auto-encoder as learned meta- 2015], enumerates the whole feature space by applying all
features of the feature. Finally, for Quantile Sketch Array, we transformations to all features and performs feature selec-
consider scaling range of [-10, 10] and quantile data sketch tion on the augmented dataset.
size of 200 bins. • Model evaluation-based, which chooses the useful trans-
We use the same set of samples for training classifiers of formation for a feature by model evaluation on each con-
different representations and tuned the MLPs to have their structed feature by each transformation. Having t unary
best possible configuration. In order to overcome the im- transformations, this approach performs t model training
balance data classification problem, as shown in Table 1, we runs for each feature.
oversample the minority class samples to have balanced nega- To compare LFE with other feature engineering approaches,
tive and positive samples during the training phase. The aver- we consider 50 binary classification datasets. The detailed
age 10-fold cross validation F1-score of a subset of classifiers statistics of 23 of test datasets are shown in Table 3. Test
is reported in Table 2 as a proxy of the benefit of representa- datasets have a diverse range in number of features (2 to
tions. There is a clear distinction in terms of predictive per- 10,936) and data points (57 to 140,707). All experimental
formance and employed representation. The poorest results evaluations are based on 10-fold cross validation on a Ran-
are obtained by learned meta-features, possibly since fea- dom Forest model. Since Quantile Sketch Array considers
ture values belong to various ranges and hashing representa- both feature values and class labels, in order to avoid data
tion is not distance preserving, therefore, it is challenging for leakage problem, we do not consider the entire dataset when
the auto-encoder to learn useful meta-features. Hand-crafted LFE computes the recommendations. For experiments with
meta-features perform better than sampling and learned meta- unary transformations for each fold, we ask LFE to recom-
features, however, Quantile Sketch Array outperforms other mend at most one transformation for each feature in the train
representations by a significant margin (35.7%). data fold. Next, recommended transformations are applied
on the corresponding features of the test data fold and added
5.2 Transformation Classifier as new features to the dataset. To only analyze the capabil-
To evaluate the predictive power of each classifier, we trained ity of feature engineering approaches, we do not perform any
classifiers corresponding to unary and binary transformations, feature selection, except for brute-force approach to which
using 10-fold cross validation on all training samples. Dur- feature selection is essential.
ing the test phase, the corresponding transformation of each Table 3 compares the predictive performance and execu-
classifier is applied on one or a pair of feature(s) and the F1 tion time of LFE to other feature engineering approaches on
score of the model based on the transformed feature only is test datasets. All columns in this table, except the last, re-
evaluated. The F1-Score of classifiers using Random Forest port results when using unary transformations. Since 50%
and improvement threshold of 1%, are shown in Table 1. We of datasets time out for evaluation-based approach and 62%
considered 0/1 loss evaluation. Table 1 demonstrates the high for brute-force approach, we only report the performance of
predictive capability of LFE transformation classifiers. As LFE for binary transformations. All reported times in Table 3
baseline, a random transformation recommender converged include preparing data, recommending transformations, ap-
to F1 score of 0.50 for all transformation classifiers. plying transformations and model training. Since the cost of
training classifiers is one-time and offline, we do not consider
5.3 Transformation Recommender it while comparing the runtime performance. LFE consis-
To showcase the benefits of LFE in recommending useful fea- tently outperforms all approaches on most datasets at a small
tures, we compare the predictive performance of test datasets cost of execution time. We argue that it is effective to spend
augmented with features engineered by LFE and features en- certain effort offline in order to provide prompt answers at
gineered by the following approaches. runtime. No feature engineering approach results in improve-
• Random, which iterates for a given r runs and in each run ment of the performance of three datasets sonar and spam-
recommends a random or no transformation for feature(s). base and twitter-absolute. The evaluation-based approach on
The performance of the dataset augmented with features AP-omentum-lung, convex, gisette and higgs-boson timed out
constructed by randomly selected transformations is evalu- due to the excessive number of model training calls. Since the
2533
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
Dataset #Numerical #Data Base Majority Brute- Random Evaluation unary binary
Features Points Dataset Force (10 runs) based LFE LFE
AP-omentum-lung 10936 203 0.883 0.915 0.925 0.908 - 0.929 0.904
AP-omentum-ovary 10936 275 0.724 0.775 0.801 0.745 0.788 0.811 0.775
autos 48 4562 0.946 0.95 0.944 0.929 0.954 0.96 0.949
balance-scale 8 369 0.884 0.916 0.892 0.881 0.882 0.919 0.884
convex 784 50000 0.82 0.5 0.913 0.5 - 0.819 0.821
credit-a 6 690 0.753 0.647 0.521 0.643 0.748 0.771 0.771
dbworld-bodies 2 100 0.93 0.939 0.927 0.909 0.921 0.961 0.923
diabetes 8 768 0.745 0.694 0.737 0.719 0.731 0.762 0.749
fertility 9 100 0.854 0.872 0.861 0.832 0.833 0.873 0.861
gisette 5000 2100 0.941 0.601 0.741 0.855 - 0.942 0.933
hepatitis 6 155 0.747 0.736 0.753 0.727 0.814 0.807 0.831
higgs-boson-subset 28 50000 0.676 0.584 0.661 0.663 - 0.68 0.677
ionosphere 34 351 0.931 0.918 0.912 0.907 0.913 0.932 0.925
labor 8 57 0.856 0.827 0.855 0.806 0.862 0.896 0.896
lymph 10936 138 0.673 0.664 0.534 0.666 0.727 0.757 0.719
madelon 500 780 0.612 0.549 0.585 0.551 0.545 0.617 0.615
megawatt1 37 253 0.873 0.874 0.882 0.869 0.877 0.894 0.885
pima-indians-subset 8 768 0.74 0.687 0.751 0.726 0.735 0.745 0.76
secom 590 470 0.917 0.917 0.913 0.915 0.915 0.918 0.915
sonar 60 208 0.808 0.763 0.468 0.462 0.806 0.801 0.783
spambase 57 4601 0.948 0.737 0.39 0.413 0.948 0.947 0.947
spectf-heart 43 80 0.941 0.955 0.881 0.942 0.955 0.955 0.956
twitter-absolute 77 140707 0.964 0.866 0.946 0.958 0.963 0.964 0.964
Feature Engineering and Model geomean 2.66 11.06 48.54 69.57 403.81 18.28 44.58
Evaluation Time (seconds) average 19.23 1219.34 13723.51 2041.52 10508.75 97.90 188.17
Table 3: Statistics of Datasets and F1 Score of LFE and Other Feature Engineering Approaches with 10-fold Cross Validation of Random
Forest. The best performing approach is shown in bold for each dataset. The improving approaches are underlined.
2534
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)
2535