Learning_Feature_Engineering_for_Classification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/318829821

Learning Feature Engineering for Classification

Conference Paper · August 2017


DOI: 10.24963/ijcai.2017/352

CITATIONS READS
288 12,175

5 authors, including:

Horst Samulowitz Udayan Khurana


IBM IBM
93 PUBLICATIONS 2,618 CITATIONS 41 PUBLICATIONS 995 CITATIONS

SEE PROFILE SEE PROFILE

Surya Deepak Turaga


Sheffield Hallam University
135 PUBLICATIONS 3,156 CITATIONS

SEE PROFILE

All content following this page was uploaded by Udayan Khurana on 21 November 2017.

The user has requested enhancement of the downloaded file.


Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Learning Feature Engineering for Classification


Fatemeh Nargesian1 , Horst Samulowitz2 , Udayan Khurana2
Elias B. Khalil3 , Deepak Turaga2
1
University of Toronto, 2 IBM Research, 3 Georgia Institute of Technology
fnargesian@cs.toronto.edu, {samulowitz, ukhurana}@us.ibm.com,
lyes@gatech.edu, turaga@us.ibm.com

Abstract search in feature space using heuristic feature quality mea-


sures (such as information gain) and other surrogate mea-
Feature engineering is the task of improving pre- sures of performance [Markovitch and Rosenstein, 2002; Fan
dictive modelling performance on a dataset by et al., 2010]. Others perform greedy feature construction and
transforming its feature space. Existing ap- selection based on model evaluation [Dor and Reich, 2012;
proaches to automate this process rely on ei- Khurana et al., 2016]. Kanter et al. proposed the Data Sci-
ther transformed feature space exploration through ence Machine (DSM) which considers feature engineering
evaluation-guided search, or explicit expansion of problem as feature selection on the space of novel features.
datasets with all transformed features followed by DSM relies on exhaustively enumerating all possible features
feature selection. Such approaches incur high com- that can be constructed from a dataset, given sequences gen-
putational costs in runtime and/or memory. We erated from a set of transformations, then performing feature
present a novel technique, called Learning Feature selection on the augmented dataset [Kanter and Veeramacha-
Engineering (LFE), for automating feature engi- neni, 2015]. Evaluation-based and exhaustive feature enu-
neering in classification tasks. LFE is based on meration and selection approaches result in high time and
learning the effectiveness of applying a transfor- memory cost and may lead to overfitting due to brute-force
mation (e.g., arithmetic or aggregate operators) on generation of features. Moreover, although deep neural net-
numerical features, from past feature engineering works (DNN) allow for useful meta-features to be learned au-
experiences. Given a new dataset, LFE recom- tomatically [Bengio et al., 2013], the learned features are not
mends a set of useful transformations to be applied always interpretable and DNNs are not effective learners in
on features without relying on model evaluation or various application domains.
explicit feature expansion and selection. Using a In this paper, we propose LFE (Learning Feature Engi-
collection of datasets, we train a set of neural net- neering), a novel meta-learning approach to automatically
works, which aim at predicting the transformation perform interpretable feature engineering for classification,
that impacts classification performance positively. based on learning from past feature engineering experiences.
Our empirical results show that LFE outperforms By generalizing the impact of different transformations on the
other feature engineering approaches for an over- performance of a large number of datasets, LFE learns useful
whelming majority (89%) of the datasets from var- patterns between features, transforms and target that improve
ious sources while incurring a substantially lower learning accuracy. We show that generalizing such patterns
computational cost. across thousands of features from hundreds of datasets can
be used to successfully predict suitable transformations for
1 Introduction features in new datasets without actually applying the trans-
Feature engineering is a central task in data preparation for formations, performing model building and validation tasks,
machine learning. It is the practice of constructing suitable that are time consuming. LFE takes as input a dataset and
features from given features that lead to improved predictive recommends a set of paradigms for constructing new useful
performance. Feature engineering involves the application features. Each paradigm consists of a transformation and an
of transformation functions such as arithmetic and aggregate ordered list of features on which the transformation is suit-
operators on given features to generate new ones. Transfor- able.
mations help scale a feature or convert a non-linear relation At the core of LFE, there is a set of Multi-Layer Percep-
between a feature and a target class into a linear relation, tron (MLP) classifiers, each corresponding to a transforma-
which is easier to learn. tion. Given a set of features and class labels, the classifier
Feature engineering is usually conducted by a data scien- predicts whether the transformation can derive a more useful
tist relying on her domain expertise and iterative trial and feature than the input features. LFE considers the notion of
error and model evaluation. To perform automated fea- feature and class relevance in the context of a transformation
ture engineering, some existing approaches adopt guided- as the measure of the usefulness of a pattern of feature value

2529
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

and class label distributions, and transformation. feature selection over the augmented dataset. Cognito [Khu-
Different datasets contain different feature sizes and differ- rana et al., 2016] recommends a series of transformations
ent value ranges. One key challenge in generalizing across based on a greedy, hierarchical heuristic search. Cognito and
different datasets is to convert feature values and their class DSM focus on sequences of feature transformations, which
labels to a fixed size feature vector representation that can is outside the scope of this paper. In contrast to these ap-
be fed into LFE classifiers. To characterize datasets, hand- proaches, we do not require expensive classifier evaluations
crafted meta-features, fixed-size stratified sampling, neural to measure the impact of a transformation. ExploreKit [Katz
networks and hashing methods have been used for different et al., 2016] also generates all possible candidate features,
tasks [Michie et al., 1994; Kalousis, 2002; Feurer et al., 2015; but uses a learned ranking function to sort them. ExploreKit
Weinberger et al., 2009]. However, these representations requires generating all possible candidate features when en-
do not directly capture the correlation between feature val- gineering features for a new (test) dataset and as such, results
ues and class labels. To capture such correlations, LFE con- reported for ExploreKit are with a time limit of three days
structs a stack of fixed-size representations of feature values per dataset. In contrast, LFE can generate effective features
per target class. We use Quantile Data Sketch to represent within seconds, on average.
feature values of each class. Quantile has been used as a Several machine learning methods perform feature extrac-
fixed-size space representation and achieves reasonably ac- tion or learning indirectly. While they do not explicitly work
curate approximation to the distribution function induced by on input features and transformations, they generate new fea-
the data being sketched [Greenwald and Khanna, 2001]. tures as means to solving another problem [Storcheus et al.,
LFE presents a computationally efficient and effective al- 2015]. Methods of that type include dimensionality reduc-
ternative to other automated feature engineering approaches tion, kernel methods and deep learning. Kernel algorithms
by recommending suitable transformations for features in a such as SVM [Shawe-Taylor and Cristianini, 2004] can be
dataset. To showcase the capabilities of LFE, we trained LFE seen as embedded methods, where the learning and the (im-
on 85K features, extracted from 900 datasets, for 10 unary plicit) feature generation are performed jointly. This is in
transformations and 122K feature pairs for 4 binary trans- contrast to our setting, where feature engineering is a pre-
formations, for two models: Random Forest and Logistic processing step.
Regression. The transformations are listed in Table 1. We Deep neural networks learn useful features automati-
empirically compare LFE with a suite of feature engineering cally [Bengio et al., 2013] and have shown remarkable suc-
approaches proposed in the literature or applied in practice cesses on video, image and speech data. However, in some
(such as the Data Science Machine, evaluation-based, ran- domains feature engineering is still required. Moreover, fea-
dom selection of transformations and always applying the tures derived by neural networks are often not interpretable
most popular transformation in the training data) on a sub- which is an important factor in certain application domains
set of 50 datasets from UCI repository [Lichman, 2013], such as healthcare [Che et al., 2015].
OpenML [Vanschoren et al., 2014] and other sources. Our
experiments show that, of the datasets that demonstrated any 3 Automated Feature Engineering Problem
improvement through feature engineering, LFE was the most
effective in 89% of the cases. As shown in Figure 2, simi- Consider a dataset, D, with features, F = {f1 , . . . , fn }, and
lar results were observed for the LFE trained with Logistic a target class, , a set of transformations, T = {T1 , . . . , Tm },
Regression. Moreover, LFE runs in significantly lesser time and a classification task, L. The feature engineering problem
compared to the other approaches. This also enables interac- is to find q best paradigms for constructing new features such
tions with a practitioner since it recommends transformations that appending the new features to D maximizes the accuracy
on features in a short amount of time. of L. Each paradigm consists of a candidate transformation
Tc 2 T of arity r, an ordered list of features [fi , . . . , fi+r 1 ]
and a usefulness score.
2 Related Work For a dataset with n features and with u unary transfor-
The FICUS algorithm [Markovitch and Rosenstein, 2002] mations, O(u ⇥ n) new features can be constructed. With
takes as input a dataset and a set of transformations, and per- b binary transformations, there are O(b ⇥ P2n ) new possi-
forms a beam search over the space of possible features. FI- ble features, where P2n is the 2-permutation of n features.
CUS’s search for better features is guided by heuristic mea- Given a fixed set of transformations, the number of new fea-
sures based on information gain in a decision tree, and other tures and their combinations to explore, for an exact solution,
surrogate measures of performance. The more recent FC- grows exponentially. Hence, we make the case that a mere
Tree [Fan et al., 2010] also uses a decision tree to partition the enumeration and trial by model training and testing is not a
data using original or constructed features as splitting points. computationally practical option, and a scalable solution to
The FEADIS algorithm of [Dor and Reich, 2012] relies on the problem must avoid this computational bottleneck. LFE
a combination of random feature generation and feature se- reduces the complexity of feature space exploration by pro-
lection. FEADIS adds constructed features greedily, and as viding an efficient approach for finding a particularly “good”
such requires many expensive performance evaluations. The transformation for a given set of features. Therefore, given n
Deep Feature Synthesis component of Data Science Machine features, for unary and binary transformations, LFE performs,
(DSM) [Kanter and Veeramachaneni, 2015] relies on exhaus- respectively, O(n) and O(P2n ) transformation predictions.
tively enumerating all possible new features, then performing In order to assess the relative impact of adding new features

2530
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

across different techniques, we add as many new features as Distribution Function (PDF) in an n-dimensional space [Gar-
those originally in the data. For unary transformations, LFE rido and Juste, 1998]. However, it is not clear how existing
predicts the most suitable transformation for each feature. representation learning and PDF learning approaches may be
For binary and higher arity transformations, LFE considers applied in the context of raw numerical data. The main chal-
a random sample of all combinations of features, finds the lenges is the high variability in the size and the range of fea-
paradigm for each combination and selects top-k useful ones. ture values (e.g., from 10 to millions). In our setting, features
In the following section, we describe how LFE learns and are data points represented with various numbers of dimen-
predicts useful transformations for features. sions (number of distinct feature values). Hence, Random
Projection [Bingham and Mannila, 2001] for dimensionality
4 Transformation Recommendation reduction is not applicable. Although Recurrent Neural Net-
works can deal with varying input size, we aim at determin-
LFE models the problem of predicting a useful r-ary transfor- ing a fixed-size representation that captures the correlation
mation Tc 2 Tr , (Tr is the set of r-ary transformations in T ), between features and target classes.
for a given list of features [f1 , . . . , fr ] as a multi-class clas- Previous approaches have used hand-crafted meta-features,
sification problem, where the input is a representation of fea- including information-theoretic and statistical meta-features,
tures, R[f1 ,...,fr ] , and output classes are transformations in Tr . to represent datasets [Michie et al., 1994; Kalousis, 2002;
LFE takes a one-vs-rest approach [Rifkin and Klautau, 2004]. Feurer et al., 2015]. Such meta-features aim at modelling
Each transformation is modelled as a Multi-Layer Perceptron the distribution of values in datasets. Performing fixed-size
(MLP) binary classifier with real-valued confidence scores as sampling of feature values is another approach for represent-
output. Recommending an r-ary transformation for r fea- ing distributions. Samples extracted from features and classes
tures involves applying all |Tr | MLPs on R[f1 ,...,fr ] . If the are required to reflect the distribution of values in both fea-
highest confidence score obtained from classifiers is above a ture and target classes. While stratified sampling solves the
given threshold, LFE recommends the corresponding trans- issue for one feature, it becomes more complex for multiple
formation to be applied on feature f . Let gk (R[f1 ,...,fr ] ) be features with high correlation [Skinner et al., 1994]. Feature
the confidence score of the MLP corresponding to transfor- hashing has been used to represent features of type “string”
mation Tk , and is the threshold for confidence scores which as feature vectors [Weinberger et al., 2009]. Although fea-
we determined empirically. LFE recommends the transfor- ture hashing can be generalized for numerical values, it is not
mation Tc , for features [f1 , . . . , fr ], as follows: straightforward to choose distance-preserving hash functions
c = arg max gk (R[f1 ,...,fr ] ) that map values within a small range to the same hash value.
k In Section 5, we empirically show how LFE transformation
⇢ classifiers perform using each of representations described
Tc , if gc (R[f1 ,...,fr ] ) > above.
recommend : (1)
none, otherwise Quantile Sketch Array (QSA) uses quantile data
In the following sections, we describe how numerical features sketch [Wang et al., 2013] to represent feature values
are represented to be the input of transformation MLPs. Next, associated with a class label. QSA is a non-parametric
we explain the process of collecting past feature engineering representation that enables characterizing the approximate
knowledge as training samples to train MLPs. Probability Distribution Function of values. QSA is closely
related to the familiar concept of histogram, where data is
4.1 Feature-Class Representation summarized into a small number of buckets. Naumann et
Transformations are used to reveal and improve significant al. used quantile representation for numerical features to
correlation or discriminative information between features perform feature classification [Naumann et al., 2002]. We
and class labels. The more pronounced this correlation, the apply the exact brute-force approach to compute quantile
higher the chance that a model can achieve a better predic- sketch for numerical values [Wang et al., 2013], described as
tive performance. Each LFE classifier learns the patterns of follows.
feature-class distributions for which the corresponding trans- Let Vk be the bag of values in feature f that are for the
(i)
formation has been effective in improving feature-class cor- training data points with the label ck and Qf is the quan-
relation. LFE represents feature f in a dataset with k classes tile sketch of Vk . First, we scale these values to a predefined
as follows: h i
(i)
range [lb, ub]. Generating Qf involves bucketing all values
(1) (2) (k)
Rf = Qf ; Qf ; . . . ; Qf (2) in Vk into a set of bins. Given a fixed number of bins, r,
the range [lb, ub] is partitioned into r disjoint bins of uni-
(i)
where Qf is a fixed-sized representation of values in f that form width w = ubr lb . Assume, the range [lb, ub] is par-
are associated with class i. We call this representation Quan- titioned into bins {b0 , . . . , br 1 }, where the bin bj is a range
tile Sketch Array. Next, we describe how feature values asso- [lb + j ⇤ w, lb + (j + 1) ⇤ w). Function B(vl ) = bj asso-
(i)
ciated to class i are translated into representation Qf , which ciates the value vl in Vk to the bin bj . Function P (bj ) re-
is meant to capture the distribution of feature values. turns the number of feature values, that are bucketed in bj .
P (bj )
Neural networks have been successful in learning repre- Finally, I(bj ) = P P (bm ) is the normalized value of
sentations for image and speech data [Bengio et al., 2013]. 0m<r
P (bj ) across all bins.
Others have proposed solutions for estimating a Probability

2531
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

f1
3
class
"1 !" !# !$ !% !& !' !( !) !* !+ 5 Experimental Results
0 "1
USEFUL
We evaluate three aspects of LFE: (1) the impact of using
1 "1
Transformation. Quantile Sketch Array representation on the performance of
… "1 −- - ./ ’s.Classifier
2 +1 NOT) transformation classifiers compared to other representations,
1 +1
USEFUL* (2) the capability of LFE in recommending useful transfor-
… +1
−- - mations, (3) the benefit of using LFE to perform feature engi-
neering compared to other alternatives, in prediction accuracy
Figure 1: An example of feature representation using Quantile and time taken. We implemented LFE transformation clas-
Sketch Array. The feature f 1’s values are binned into 10 equi-width sifiers and the meta-feature learner (auto-encoder) in Tensor-
bins, separately for classes 1 and +1. Flow. Transformation computation and model training were
implemented using Scikit-learn. To showcase the capabili-
To illustrate, Figure 1 shows an example of the representa- ties of LFE, we considered the following ten unary and four
tion of a feature. Note that this representation has certain binary transformations, respectively: log, square-root (both
behaviours. For example, for feature values that are perfectly applied on the absolute of values), frequency (count of how
separated in terms of the class label the representation ends often a value occurs), square, round, tanh, sigmoid, isotonic
up being very similar regardless of the scale of feature values. regression, zscore, normalization (mapping to [-1, 1]), sum,
In this particular case there is also no need for any transfor- subtraction, multiplication and division. In order to avoid
mation since values are already separated. We decide QSA data leakage, we apply transformations on train and test folds
parameters (number of bins and scaling range) based on the separately.
performance of classifiers through empirical observations. Without the loss of generality, we focus on binary classifi-
4.2 Training cation for the purpose of these experiments. We collected 900
classification datasets from the OpenML and UCI reposito-
To generate training samples for transformation MLP clas- ries to train transformation classifiers. A subset of 50 datasets
sifiers, LFE considers numerical features in classification were reserved for testing and were not used for the purpose of
datasets across various repositories. Each classifier is trained training. In order to better utilize the datasets, we converted
with the samples for which the corresponding transformation the multi-class problems into one-vs-all binary classification
has been found useful as positive samples and all other sam- problems. Training samples were generated for Random For-
ples as negative. In order to decide whether a transformation est and Logistic Regression using 10-fold cross validation and
for a set of features leads to improvement, we evaluate a se- the performance improvement threshold, ✓, of 1%. Differ-
lected model, L, on the original features and the target class ent improvement thresholds result in collecting variable num-
as well as the constructed feature by itself and the target class. ber of training samples. Since the distribution of collected
If the constructed feature shows performance improvement samples for transformations is not uniform, we decided on
beyond a threshold, ✓, the input features together with their the threshold upon due empirical exploration. We generated
class labels are considered as a positive training sample for approximately 84.5K training samples for unary transforma-
the transformation classifier. tion and 122K for binary transformations. Table 1 reports
A sample generated from feature f , in a dataset with the statistics of positive training samples generated from all
k classes, for transformation t is translated into Rf = datasets. Frequency and multiplication are two transforma-
(1) (k)
[Qf ; . . . ; Qf ]. Assuming b bins are used for each quan- tions that incur performance improvement for the majority of
tile data sketch, Rf is a vector of size k ⇥ b. Then, Rf is fed features. All transformation classifiers are MLPs with one
into the MLP corresponding to t. Assume the corresponding hidden layer. We tuned the number of hidden units to opti-
MLP to the unary transformation t has one hidden layer with mize the F-score for each classifier, and they vary from 400
h units. The probability of t being a useful transformation or to 500. The average time to train transformation classifiers
not for feature f in a dataset is computed as: offline was 6 hours.
[pt is useful (f ), pt is not useful (f )] = 5.1 Feature Representation
(2) (1) (k)
2 (b + W(2) ( 1 (b(1) + W(1) [Qf ; . . . ; Qf ]))) We evaluate the efficacy of Quantile Sketch Array (QSA)
(3) through the performance of LFE consisting of QSA compared
where, W(1) and W(2) are weight matrices, b(1) and b(1) to other representations, as listed in Table 2. These represen-
are bias vectors, and 1 and 2 are softmaxe and rectified tations are listed in Table 2.
linear unit (ReLU) functions, respectively [Nair and Hinton, For hand-crafted meta-feature representation, we consider
2010]. We use Stochastic Gradient Descent with minibatches the following meta-features used in the literature: first 30
to train transformation MLPs. In order to prevent overfitting, moments of feature values, median, standard deviation, min,
we apply regularization and drop-out [Srivastava et al., 2014]. max, number of values and its log. For sampling represen-
The generated training samples are dependent on model type. tation, we obtain 250 stratified random samples per class.
In other words, while there may be overlaps in the suggested We also perform meta-feature learning by training an auto-
feature engineering paradigms across models, the optimal use encoder on all features in our corpus. The input and out-
of LFE for a specific model comes from using that same put of the auto-encoder are the feature hashing representa-
model while training LFE. In Section 5, we show that LFE tion [Weinberger et al., 2009] of features with 200 hash val-
is robust in terms of the choice of classification model. ues per class. The auto-encoder consists of a two-layer en-

2532
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Transformation log sqrt square freq round tanh sigmoid isotonic-reg zscore normalize sum subt mult div
#Positive Training Samples 6710 6293 5488 73919 15855 70829 70529 624 664 31019 28492 36296 37086 19020
Classifier Performance 0.95 0.96 0.97 0.77 0.92 0.92 0.80 0.98 0.97 0.91 0.98 0.99 0.97 0.98

Table 1: Statistics of Training Samples and F1 Score of LFE Classifiers for 10-fold Cross Validation of Random Forest.

Hand-crafted Stratified Meta-feature Quantile ated in each run and the transformations in the run with the
Meta-features Sampling Learning Sketch Array highest predictive performance are recommended.
0.5558 0.5173 0.3256 0.9129
• Majority, which always recommends the single most ef-
Table 2: F1 Score of Transformation Classifiers. fective transformation in the training samples (i.e. fre-
quency for unary and multiplication for binary transforma-
coder and a two-layer decoder, which are symmetric and each tions).
has a layer of 100 units connected to another layer of 75 units. • Brute-force, inspired by Feature Synthesis component
For a hashed input feature, we consider the output of the en- of Data Science Machine [Kanter and Veeramachaneni,
coder component of a trained auto-encoder as learned meta- 2015], enumerates the whole feature space by applying all
features of the feature. Finally, for Quantile Sketch Array, we transformations to all features and performs feature selec-
consider scaling range of [-10, 10] and quantile data sketch tion on the augmented dataset.
size of 200 bins. • Model evaluation-based, which chooses the useful trans-
We use the same set of samples for training classifiers of formation for a feature by model evaluation on each con-
different representations and tuned the MLPs to have their structed feature by each transformation. Having t unary
best possible configuration. In order to overcome the im- transformations, this approach performs t model training
balance data classification problem, as shown in Table 1, we runs for each feature.
oversample the minority class samples to have balanced nega- To compare LFE with other feature engineering approaches,
tive and positive samples during the training phase. The aver- we consider 50 binary classification datasets. The detailed
age 10-fold cross validation F1-score of a subset of classifiers statistics of 23 of test datasets are shown in Table 3. Test
is reported in Table 2 as a proxy of the benefit of representa- datasets have a diverse range in number of features (2 to
tions. There is a clear distinction in terms of predictive per- 10,936) and data points (57 to 140,707). All experimental
formance and employed representation. The poorest results evaluations are based on 10-fold cross validation on a Ran-
are obtained by learned meta-features, possibly since fea- dom Forest model. Since Quantile Sketch Array considers
ture values belong to various ranges and hashing representa- both feature values and class labels, in order to avoid data
tion is not distance preserving, therefore, it is challenging for leakage problem, we do not consider the entire dataset when
the auto-encoder to learn useful meta-features. Hand-crafted LFE computes the recommendations. For experiments with
meta-features perform better than sampling and learned meta- unary transformations for each fold, we ask LFE to recom-
features, however, Quantile Sketch Array outperforms other mend at most one transformation for each feature in the train
representations by a significant margin (35.7%). data fold. Next, recommended transformations are applied
on the corresponding features of the test data fold and added
5.2 Transformation Classifier as new features to the dataset. To only analyze the capabil-
To evaluate the predictive power of each classifier, we trained ity of feature engineering approaches, we do not perform any
classifiers corresponding to unary and binary transformations, feature selection, except for brute-force approach to which
using 10-fold cross validation on all training samples. Dur- feature selection is essential.
ing the test phase, the corresponding transformation of each Table 3 compares the predictive performance and execu-
classifier is applied on one or a pair of feature(s) and the F1 tion time of LFE to other feature engineering approaches on
score of the model based on the transformed feature only is test datasets. All columns in this table, except the last, re-
evaluated. The F1-Score of classifiers using Random Forest port results when using unary transformations. Since 50%
and improvement threshold of 1%, are shown in Table 1. We of datasets time out for evaluation-based approach and 62%
considered 0/1 loss evaluation. Table 1 demonstrates the high for brute-force approach, we only report the performance of
predictive capability of LFE transformation classifiers. As LFE for binary transformations. All reported times in Table 3
baseline, a random transformation recommender converged include preparing data, recommending transformations, ap-
to F1 score of 0.50 for all transformation classifiers. plying transformations and model training. Since the cost of
training classifiers is one-time and offline, we do not consider
5.3 Transformation Recommender it while comparing the runtime performance. LFE consis-
To showcase the benefits of LFE in recommending useful fea- tently outperforms all approaches on most datasets at a small
tures, we compare the predictive performance of test datasets cost of execution time. We argue that it is effective to spend
augmented with features engineered by LFE and features en- certain effort offline in order to provide prompt answers at
gineered by the following approaches. runtime. No feature engineering approach results in improve-
• Random, which iterates for a given r runs and in each run ment of the performance of three datasets sonar and spam-
recommends a random or no transformation for feature(s). base and twitter-absolute. The evaluation-based approach on
The performance of the dataset augmented with features AP-omentum-lung, convex, gisette and higgs-boson timed out
constructed by randomly selected transformations is evalu- due to the excessive number of model training calls. Since the

2533
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Dataset #Numerical #Data Base Majority Brute- Random Evaluation unary binary
Features Points Dataset Force (10 runs) based LFE LFE
AP-omentum-lung 10936 203 0.883 0.915 0.925 0.908 - 0.929 0.904
AP-omentum-ovary 10936 275 0.724 0.775 0.801 0.745 0.788 0.811 0.775
autos 48 4562 0.946 0.95 0.944 0.929 0.954 0.96 0.949
balance-scale 8 369 0.884 0.916 0.892 0.881 0.882 0.919 0.884
convex 784 50000 0.82 0.5 0.913 0.5 - 0.819 0.821
credit-a 6 690 0.753 0.647 0.521 0.643 0.748 0.771 0.771
dbworld-bodies 2 100 0.93 0.939 0.927 0.909 0.921 0.961 0.923
diabetes 8 768 0.745 0.694 0.737 0.719 0.731 0.762 0.749
fertility 9 100 0.854 0.872 0.861 0.832 0.833 0.873 0.861
gisette 5000 2100 0.941 0.601 0.741 0.855 - 0.942 0.933
hepatitis 6 155 0.747 0.736 0.753 0.727 0.814 0.807 0.831
higgs-boson-subset 28 50000 0.676 0.584 0.661 0.663 - 0.68 0.677
ionosphere 34 351 0.931 0.918 0.912 0.907 0.913 0.932 0.925
labor 8 57 0.856 0.827 0.855 0.806 0.862 0.896 0.896
lymph 10936 138 0.673 0.664 0.534 0.666 0.727 0.757 0.719
madelon 500 780 0.612 0.549 0.585 0.551 0.545 0.617 0.615
megawatt1 37 253 0.873 0.874 0.882 0.869 0.877 0.894 0.885
pima-indians-subset 8 768 0.74 0.687 0.751 0.726 0.735 0.745 0.76
secom 590 470 0.917 0.917 0.913 0.915 0.915 0.918 0.915
sonar 60 208 0.808 0.763 0.468 0.462 0.806 0.801 0.783
spambase 57 4601 0.948 0.737 0.39 0.413 0.948 0.947 0.947
spectf-heart 43 80 0.941 0.955 0.881 0.942 0.955 0.955 0.956
twitter-absolute 77 140707 0.964 0.866 0.946 0.958 0.963 0.964 0.964
Feature Engineering and Model geomean 2.66 11.06 48.54 69.57 403.81 18.28 44.58
Evaluation Time (seconds) average 19.23 1219.34 13723.51 2041.52 10508.75 97.90 188.17

Table 3: Statistics of Datasets and F1 Score of LFE and Other Feature Engineering Approaches with 10-fold Cross Validation of Random
Forest. The best performing approach is shown in bold for each dataset. The improving approaches are underlined.

evaluation-based approach requires model training for each


combination of feature and transformation, the evaluation-
based approach is only applicable to datasets with small num-
ber of features and data points.
To broaden the scope of experiments, we performed fea-
ture engineering using the same approaches and setting of
Table 3 on 50 test datasets. Figure 2 shows the percentage
of test datasets whose predictive performance improved by
each feature engineering approach. For Random Forest, we
observed that for 24% of test datasets, none of the considered
approaches, including LFE, can improve classification per- Figure 2: The percentage of datasets, from a sample of 50, for which
a feature engineering approach results in performance improvement
formance. In all remaining datasets, LFE always improves (measured by F1 score of 10 fold cross validation for Random Forest
the predictive performance and for 89% of such datasets LFE and Logistic Regression).
generates features that results in the highest classification per-
formance. For the remaining 11% of datasets, the higher per- tures to a fixed size array, preserving its essential character-
formance is achieved by other approaches at higher compu- istics. QSA enables LFE to learn and predict the ability of
tational cost than LFE. To investigate the robustness of LFE a transform to improve the accuracy of a given dataset. Our
with respect to the model, we performed the experiments for empirical evaluation demonstrates the efficacy and efficiency
Logistic Regression and similar results were observed. While of LFE in improving the predictive performance at low com-
LFE is a one shot approach it already achieves an average putational costs for a variety of classification problems. We
improvement of more than 2% in F1 score and a maximal plan to add more transformations, to design an iterative ver-
improvement of 13% across the 50 test data sets. sion of LFE, and combine it with exploration-based methods
to achieve even more pronounced improvements. In addition,
6 Conclusion and Future Work we aim to use the family of Recurrent Neural Networks to
deal with both varying data set sizes and taking into account
In this paper, we present a novel framework called LFE to
relationships across multiple features within a dataset to im-
perform automated feature engineering by learning patterns
prove transformation recommendation.
between feature characteristics, class distributions, and use-
ful transformations, from historical data. The cornerstone of
our framework is a novel feature representation, called Quan-
tile Sketch Array (QSA), that reduces any variable sized fea-

2534
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

References [Naumann et al., 2002] F. Naumann, C.-T. Ho, X. Tian,


[Bengio et al., 2013] Yoshua Bengio, Aaron Courville, and L. Haas, and N Megiddo. Attribute classification using
Pascal Vincent. Representation learning: A review and feature analysis. ICDE, page 271, 2002.
new perspectives. IEEE TPAMI, 35(8), 2013. [Rifkin and Klautau, 2004] Ryan Rifkin and Aldebaro Klau-
[Bingham and Mannila, 2001] Ella Bingham and Heikki tau. In defense of one-vs-all classification. JMLR, 5:101–
Mannila. Random projection in dimensionality reduction: 141, 2004.
Applications to image and text data. KDD, pages 245–250, [Shawe-Taylor and Cristianini, 2004] John Shawe-Taylor
2001. and Nello Cristianini. Kernel methods for pattern analysis.
[Che et al., 2015] Zhengping Che, Sanjay Purushotham, Cambridge university press, 2004.
Robinder Khemani, and Yan Liu. Distilling knowledge [Skinner et al., 1994] C.J. Skinner, D.J. Holmes, and
from deep networks with applications to healthcare do- D. Holt. Multiple frame sampling for multivariate
main. arXiv preprint arXiv:1512.03542, 2015. stratification. International Statistical Review, 62(3),
[Dor and Reich, 2012] Ofer Dor and Yoram Reich. Strength- 1994.
ening learning algorithms by feature discovery. Informa- [Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton,
tion Sciences, 2012. Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
[Fan et al., 2010] Wei Fan, Erheng Zhong, Jing Peng, Olivier dinov. Dropout: A simple way to prevent neural net-
Verscheure, Kun Zhang, Jiangtao Ren, Rong Yan, and works from overfitting. Journal of Machine Learning,
Qiang Yang. Generalized and heuristic-free feature con- 15(1):1929–1958, 2014.
struction for improved accuracy. SDM, 2010. [Storcheus et al., 2015] Dmitry Storcheus, Afshin Ros-
[Feurer et al., 2015] Matthias Feurer, Aaron Klein, Katha- tamizadeh, and Sanjiv Kumar. A survey of modern ques-
rina Eggensperger, Jost Tobias Springenberg, Manuel tions and challenges in feature extraction. Proceedings
Blum, and Frank Hutter. Efficient and robust automated of The 1st International Workshop on Feature Extraction,
machine learning. NIPS, 2015. NIPS, 2015.
[Garrido and Juste, 1998] Lluis Garrido and Aurelio Juste. [Vanschoren et al., 2014] Joaquin Vanschoren, Jan N. van
On the determination of probability density functions by Rijn, Bernd Bischl, and Luis Torgo. Openml: Networked
using neural networks. Computer Physics Communica- science in machine learning. SIGKDD Explor. Newsl.,
tions, 115(1):25 – 31, 1998. 15(2):49–60, June 2014.
[Greenwald and Khanna, 2001] Michael Greenwald and [Wang et al., 2013] Lu Wang, Ge Luo, Ke Yi, and Graham
Sanjeev Khanna. Space-efficient online computation of Cormode. Quantiles over data streams: An experimental
quantile summaries. SIGMOD, pages 58–66, 2001. study. SIGMOD, pages 737–748, 2013.
[Kalousis, 2002] Alexandros Kalousis. Algorithm selection [Weinberger et al., 2009] Kilian Weinberger, Anirban Das-
via meta-learning. PhD thesis, Universite de Geneve, gupta, John Langford, Alex Smola, and Josh Attenberg.
2002. Feature hashing for large scale multitask learning. ICML,
[Kanter and Veeramachaneni, 2015] James Max Kanter and 2009.
Kalyan Veeramachaneni. Deep feature synthesis: Towards
automating data science endeavors. DSAA, 2015.
[Katz et al., 2016] Gilad Katz, Eui Chul Richard Shin, and
Dawn Song. Explorekit: Automatic feature generation and
selection. ICDM, pages 979–984, 2016.
[Khurana et al., 2016] Udayan Khurana, Deepak Turaga,
Horst Samulowitz, and Srinivasan Parthasarathy. Cognito:
Automated Feature Engineering for Supervised Learning.
ICDM, 2016.
[Lichman, 2013] M. Lichman. UCI machine learning repos-
itory, 2013.
[Markovitch and Rosenstein, 2002] Shaul Markovitch and
Dan Rosenstein. Feature generation using general con-
structor functions. Machine Learning, 2002.
[Michie et al., 1994] Donald Michie, D. J. Spiegelhalter,
C. C. Taylor, and John Campbell, editors. Machine Learn-
ing, Neural and Statistical Classification. 1994.
[Nair and Hinton, 2010] Vinod Nair and Geoffrey E. Hinton.
Rectified linear units improve restricted boltzmann ma-
chines. ICML, pages 807–814, 2010.

2535

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy