A Review of Feature Selection Methods With Applications
A Review of Feature Selection Methods With Applications
Abstract - Feature selection (FS) methods can be used in feature set can then be easily reduced by taking into
data pre-processing to achieve efficient data reduction. This consideration characteristics such as dataset variance
is useful for finding accurate data models. Since exhaustive coverage. Feature selection, on the other hand, is a
search for optimal feature subset is infeasible in most cases, process of taking a small subset of features from the
many search strategies have been proposed in literature.
original feature set without transformation (thus
The usual applications of FS are in classification, clustering,
and regression tasks. This review considers most of the preserving the interpretation) and validating it with
commonly used FS techniques. Particular emphasis is on the respect to the analysis goal. The selection process can be
application aspects. In addition to standard filter, wrapper, achieved in a number of ways depending on the goal, the
and embedded methods, we also provide insight into FS for resources at hand, and the desired level of optimization.
recent hybrid approaches and other advanced topics.
In this paper, we focus on feature selection and
provide an overview of the existing methods that are
I. INTRODUCTION available for handling several different classes of
The abundance of data in contemporary datasets problems. Additionally, we consider the most important
demands development of clever algorithms for application domains and review comparative studies on
discovering important information. Data models are feature selection therein, in order to investigate which
constructed depending on the data mining tasks, but methods perform best for specific tasks. This research is
usually in the areas of classification, regression and motivated by the fact that there is an abundance of work
clustering. Often, pre-processing of the datasets takes in this field and insufficient systematization, particularly
place for two main reasons: 1) reduction of the size of the with respect to various application domains and novel
dataset in order to achieve more efficient analysis, and 2) research topics.
adaptation of the dataset to best suit the selected analysis Feature set reduction is based on the terms of feature
method. The former reason is more important nowadays relevance and redundancy with respect to goal. More
because of the plethora of developed analysis methods specifically, a feature is usually categorized as: 1)
that are at the researcher's disposal, while the size of an strongly relevant, 2) weakly relevant, but not redundant,
average dataset keeps growing both in respect to the 3) irrelevant, and 4) redundant [3,4]. A strongly relevant
number of features and samples. feature is always necessary for an optimal feature subset;
Dataset size reduction can be performed in one of the it cannot be removed without affecting the original
two ways: feature set reduction or sample set reduction. conditional target distribution [3]. Weakly relevant
In this paper, the focus is on feature set reduction. The feature may not always be necessary for an optimal
problem is important, because a high number of features subset, this may depend on certain conditions. Irrelevant
in a dataset, comparable to or higher than the number of features are not necessary to include at all. Redundant
samples, leads to model overfitting, which in turn leads to features are those that are weakly relevant but can be
poor results on the validation datasets. Additionally, completely replaced with a set of other features such that
constructing models from datasets with many features is the target distribution is not disturbed (the set of other
more computationally demanding [1]. All of this leads features is called Markov blanket of a feature).
researchers to propose many methods for feature set Redundancy is thus always inspected in multivariate case
reduction. The reduction is performed through the (when examining feature subset), whereas relevance is
processes of feature extraction (transformation) and established for individual features. The aim of feature
feature selection. Feature extraction methods such as selection is to maximize relevance and minimize
Principal Component Analysis (PCA), Linear redundancy. It usually includes finding a feature subset
Discriminant Analysis (LDA) and Multidimensional consisting of only relevant features.
Scaling work by transforming the original features into a In order to ensure that the optimal feature subset with
new feature set constructed from the original one based respect to goal concept has been found, feature selection
on their combinations, with the aim of discovering more method has to evaluate a total of 2m - 1 subsets, where m
meaningful information in the new set [2]. The new is the total number of features in the dataset (an empty
1200
Authorized licensed use limited to: PONTIFICIA UNIVERSIDADE CATOLICA DO RIO DE JANEIRO. Downloaded on March 28,2023 at 13:08:12 UTC from IEEE Xplore. Restrictions apply.
feature subset is excluded). This is computationally selection generates a starting subset based on a heuristic
infeasible even for a moderately large m. Therefore, (e.g. a genetic algorithm), and then explores it further.
putting completeness of the search aside, many heuristic The most common search strategies that can be used
methods have been proposed to find a sufficiently good with multivariate filters can be categorized into
(but not necessarily optimal) subset. The whole process exponential algorithms, sequential algorithms and
of finding the feature subset typically consists of four randomized algorithms. Exponential algorithms evaluate a
basic steps: 1) subset generation, 2) subset evaluation, 3) number of subsets that grows exponentially with the
a stopping criterion, and 4) validation of the results [5]. feature space size. Sequential algorithms add or remove
Feature subset generation is dependent on the state space features sequentially (one or few), which may lead to local
search strategy. After a strategy selects a candidate minima. Random algorithms incorporate randomness into
subset, it will be evaluated using an evaluation criterion their search procedure, which avoids local minima [17].
in step 2. After repeating steps 1 and 2 for a number of Common search strategies are shown in Table II.
times depending on the process stopping criterion, the
best candidate feature subset is selected. This subset is TABLE I. COMMON FILTER METHODS FOR FEATURE SELECTION
then validated on an independent dataset or using domain
Applicable to
knowledge, while considering the type of task at hand. Name Filter class
task
Study
univariate,
Information gain classification [6]
II. CLASSIFICATION OF FEATURE SELECTION information
univariate,
METHODS Gain ratio classification [7]
information
Feature selection methods can be classified in a Symmetrical univariate,
classification [8]
uncertainty information
number of ways. The most common one is the univariate,
classification into filters, wrappers, embedded, and hybrid Correlation regression [8]
statistical
methods [6]. The abovementioned classification assumes univariate,
Chi-square classification [7]
statistical
feature independency or near-independency. Additional multivariate,
methods have been devised for datasets with structured Inconsistency criterion classification [9]
consistency
features where dependencies exist and for streaming Minimum redundancy,
multivariate, classification,
features [2]. maximum relevance [2]
information regression
(mRmR)
Correlation-based multivariate, classification,
[7 ]
A. Filter methods feature selection (CFS) statistical regression
Fast correlation-based multivariate,
Filter methods select features based on a performance filter (FCBF) information
classification [8]
measure regardless of the employed data modeling univariate,
Fisher score classification [10]
algorithm. Only after the best features are found, the statistical
modeling algorithms can use them. Filter methods can univariate, classification,
Relief and ReliefF [11]
rank individual features or evaluate entire feature subsets. distance regression
Spectral feature
We can roughly classify the developed measures for selection (SPEC) univariate, classification,
feature filtering into: information, distance, consistency, [4]
and Laplacian Score similarity clustering
similarity, and statistical measures. While there are many (LS)
filter methods described in literature, a list of common Feature selection for multivariate,
clustering [12]
sparse clustering similarity
methods is given in Table I, along with the appropriate Localized Feature
references that provide details. Not all the filter features Selection Based on multivariate,
clustering [13]
can be used for all classes of data mining tasks. Therefore, Scatter Separability statistical
the filters are also classified depending on the task: (LFSBSS)
Multi-Cluster Feature multivariate,
classification, regression or clustering. Due to lack of Selection (MCFS) similarity
clustering [4]
space, we do not consider semi-supervised learning Feature weighting K- multivariate,
feature selection methods in this work. An interested clustering [14]
means statistical
reader is referred to [16] for more information. ReliefC
univariate,
clustering [15]
distance
Univariate feature filters evaluate (and usually rank) a
single feature, while multivariate filters evaluate an entire TABLE II. SEARCH STRATEGIES FOR FEATURE SELECTION
feature subset. Feature subset generation for multivariate
Algorithm
filters depends on the search strategy. While there are Algorithm name
group
many search strategies, there are four usual starting points Exhaustive search
Exponential
for feature subset generation: 1) forward selection, 2) Branch-and-bound
backward elimination, 3) bidirectional selection, and 4) Greedy forward selection or backward elimination
Best-first
heuristic feature subset selection. Forward selection Linear forward selection
typically starts with an empty feature set and then Sequential
Floating forward or backward selection
considers adding one or more features to the set. Beam search (and beam stack search)
Backward elimination typically starts with the whole Race search
Random generation
feature set and considers removing one or more features Simulated annealing
from the set. Bidirectional search starts from both sides - Randomized Evolutionary computation algorithms (e.g.
from an empty set and from the whole set, simultaneously genetic, ant colony optimization)
considering larger and smaller feature subsets. Heuristic Scatter search
1201
Authorized licensed use limited to: PONTIFICIA UNIVERSIDADE CATOLICA DO RIO DE JANEIRO. Downloaded on March 28,2023 at 13:08:12 UTC from IEEE Xplore. Restrictions apply.
B. Wrapper methods In these datasets, features are not independent. Therefore,
Wrappers consider feature subsets by the quality of the it is a good idea to employ specific algorithms to deal with
performance on a modelling algorithm, which is taken as a the dependencies in order to increase performance of the
black box evaluator. Thus, for classification tasks, a selected feature subsets. Most of the algorithms dealing
wrapper will evaluate subsets based on the classifier with feature structures are recent, and are based on some
performance (e.g. Naïve Bayes or SVM) [18,19], while adaptation of the Lasso regularization method to
for clustering, a wrapper will evaluate subsets based on accomodate different structures. Good overviews of these
the performance of a clustering algorithm (e.g. K-means) methods can be found in [2,33].
[20]. The evaluation is repeated for each subset, and the Streaming (or dynamic) features are features which
subset generation is dependent on the search strategy, in size is unknown in advance; they are rather dynamically
the same way as with filters. Wrappers are much slower generated, they arrive as streamed data into the dataset and
than filters in finding sufficiently good subsets because the modelling algorithms has to reach a decision whether
they depend on the resource demands of the modelling to keep them as useful for model construction or not.
algorithm. The feature subsets are also biased towards the Also, some features may become irrelevant over time and
modelling algorithm on which they were evaluated (even should be discarded. This scenario is common in social
when using cross-validation). Therefore, for a reliable networks such as Twitter, where new words are generated
generalization error estimate, it is necessary that both an that are not all relevant for a given subject [2]. The most
independent validation sample and another modelling important feature selection methods in this category are:
algorithm are used after the final subset is found. On the the Grafting algorithm [34], the Alpha-Investing algorithm
other hand, it has been empirically proven that wrappers [35] the OSFS algorithm [36], and dynamic feature
obtain subsets with better perfomance than filters because selection fuzzy-rough set approach [37].
the subsets are evaluated using a real modelling algorithm.
Practically any combination of search strategy and
modelling algorithm can be used as a wrapper, but III. FEATURE SELECTION APPLICATION DOMAINS
wrappers are only feasible for greedy search strategies and The choice of feature selection methods differs among
fast modelling algorithms such as Naïve Bayes [21], linear various application areas. In the following subsections,
SVM [22], and Extreme Learning Machines [23]. we review comparative studies on feature selection
pertaining to several well known application domains.
C. Embedded and hybrid methods Table III summarizes the findings from the reviewed
Embedded methods perform feature selection during studies.
the modelling algorithm's execution. These methods are A. Text mining
thus embedded in the algorithm either as its normal or
extended functionality. Common embedded methods In text mining, the standard way of representing a
include various types of decision tree algorithms: CART, document is by using the bag-of-words model. The idea
C4.5, random forest [24], but also other algorithms (e.g. is to model each document with the counts of words
multinomial logistic regression and its variants [25]). occurring in that document. Feature vectors are typically
Some embedded methods perform feature weighting formed so that each feature (i.e. each element of the
based on regularization models with objective functions feature vector) represents the count of a specific word, an
that minimize fitting errors and in the mean time force the alternative being to just indicate the presence/absence of
feature coefficients to be small or to be exact zero. These a word without specifying the count. The set of words
methods based on Lasso [26] or Elastic Net [27] usually whose occurrences are counted is called a vocabulary.
work with linear classifiers (SVM or others) and induce Given a dataset that needs to be represented, one can use
penalties to features that do not contribute to the model. all the words from all the documents in the dataset to
Hybrid methods were proposed to combine the best build the vocabulary and then prune the vocabulary using
properties of filters and wrappers. First, a filter method is feature selection.
used in order to reduce the feature space dimension space, It is common to apply a degree of preprocessing prior
possibly obtaining several candidate subsets [28]. Then, a
to feature selection, typically including the removal of
wrapper is employed to find the best candidate subset.
rare words with only a few occurrences, the removal of
Hybrid methods usually achieve high accuracy that is
characteristic to wrappers and high efficiency overly common words (e.g. "a", "the", "and" and similar)
characteristic to filters. While practically any combination and grouping the differently inflected forms of a word
of filter and wrapper can be used for constructing the together (lemmatization, stemming) [38].
hybrid methodology, several interesting methodologies Forman [38] performed a detailed experimental study
were recently proposed, such as: fuzzy random forest of filter feature selection methods for text classification.
based feature selection [29], hybrid genetic algorithms Twelve feature selection metrics were evaluated on 229
[30], hybrid ant colony optimization [31], or mixed text classification problem instances. Feature vectors
gravitational search algorithm [32].
were formed not as word counts, but as Boolean
representations of whether a certain word occurred or not.
D. Structured and streaming features
A linear SVM classifier with untuned parameters was
In some datasets, features may exhibit certain internal used to evaluate performance. The results were analyzed
structures such as spatial or temporal smoothness, with respect to precision, recall, F-measure and accuracy.
disjoint/overlapping groups, tree- or graph-like structures. Information gain was shown to perform best
1202
Authorized licensed use limited to: PONTIFICIA UNIVERSIDADE CATOLICA DO RIO DE JANEIRO. Downloaded on March 28,2023 at 13:08:12 UTC from IEEE Xplore. Restrictions apply.
TABLE III. SUMMARIZED FINDINGS OF RELEVANT FEATURE SELECTION METHODS IN VARIOUS APPLICATION AREAS
Application Evaluation
Subfield Datasets Feature selection methods Best performing Study
area metrics
Accuracy, accuracy
balanced, bi-normal
229 text classification Information gain
separation, chi-square, Accuracy, F-
problem instances (precision), bi-
Text document frequency, F1- measure,
gathered from Reuters, normal separation [38]
classification measure, information gain, precision,
TREC, OHSUMED, (accuracy, F-
odds ratio, odds ration and recall
etc. measure, recall)
Text mining numerator, power,
probability ratio, random
Information gain, chi-square,
Reuters-21578, 20 document frequency, term
Entropy, Iterative feature
Text clustering Newsgroups, strength, entropy-based [39]
precision selection
Web Directory ranking, term contribution,
iterative feature selection
Relief (R), K-means (K),
sequential floating forward R+K+B / R+K+F /
Aerial Images, The Average MSE
Image selection (F), sequential R+K, depending on
Digits Data, Cats and of 100 neural [40]
classification floating backward selection the size of feature
Image Dogs networks
(B), various combinations R subset
processing /
+ K + F/B
computer
Breast density Best-first with forward,
vision
classification backward and bi-directional
Best first forward,
from Mini-MIAS, KBD-FER search, genetic search and Accuracy [41]
best first backward
mammographic random search (k-NN and
images Naïve Bayesian classifiers)
Chi-square,
Three benchmark Chi-square, information gain,
symmetrical
Biomarker datasets deriving symmetrical uncertainty, Stability,
uncertainty, [42]
discovery from DNA microarray gain ratio, OneR, ReliefF, AUC
information gain,
experiments SVM-embedded
ReliefF
Bioinformatics
Information gain, twoing
Microarray
Two gene expression rule, sum minority, max
gene expression Consensus of all
datasets (Freije, minority, Gini index, sum of Accuracy [43]
data methods
Phillips) variances, t-statistics, one-
classification
dimensional SVM
Distance, entropy, SVM
Global geometric
Industrial Wind turbine test rig wrapper, neural network
Fault diagnosis Accuracy similarity scheme [22]
applications dataset wrapper, global geometric
with wrapper
similarity scheme
with respect to precision, while the author-introduced on the target application. Examples of features include
method bi-normal separation performed best for recall, F- histograms of oriented gradients, edge orientation
measure and accuracy. histograms, Haar wavelets, raw pixels, gradient values,
edges, color channels, etc. [44].
Liu et al. [39] investigated the use of feature selection
in the problem of text clustering, showing that feature Bins and Draper [40] studied the use of filter feature
selection can improve its performance and efficiency. selection methods in the general problem of image
Five filter feature selection methods were tested on three classification. Three different image datasets were used.
document datasets. Unsupervised feature selection They proposed a three-step method for feature selection
methods were shown to improve clustering performance, that combines Relief, K-means clustering and sequential
achieving about 2% entropy reduction and 1% precision floating forward/backward feature selection
improvement on average, while removing 90% of the (SFFS/SFBS). The idea is to: 1) use the Relief algorithm
features. The authors also proposed an iterative feature to remove irrelevant features, 2) use K-means clustering
selection method inspired by expectation maximization to cluster similar features and remove redundancy, and 3)
that combines supervised feature selection methods with run SFFS or SFBS to obtain the final set of features. The
clustering in a bootstrap setting. The proposed method authors found that using the proposed hybrid combination
reduces the entropy by 13.5% and increases precision by of algorithms yields better performance than when using
14.6%, hence coming closest to the established baseline, Relief or SFFS/SFBS alone. In cases when there are no
obtained by using a supervised approach. irrelevant or redundant features in the dataset, the
proposed algorithm does not degrade performance.
B. Image processing and computer vision When the goal is to select a specific number of features, it
Representing images is not a straightforward task, as is suggested to use the R+K+B variant of the algorithm if
the number of possible image features is practically the number of relevant and non-redundant features is less
unlimited [40]. The choice of features typically depends than 110, and otherwise R+K+F. If the number of
selected features is allowed to vary, authors suggest using
1203
Authorized licensed use limited to: PONTIFICIA UNIVERSIDADE CATOLICA DO RIO DE JANEIRO. Downloaded on March 28,2023 at 13:08:12 UTC from IEEE Xplore. Restrictions apply.
R+K. The authors also note that Relief is good at Abusamra [43] analyzed the performance of eight
removing irrelevant features, but not adequate for different filter-based feature selection methods and three
selecting the best among relevant features. classification methods on two datasets of microarray gene
expression data. The best individually performing feature
Muštra et al. [41] investigated the use of wrapper
selection methods varied depending on the dataset and
feature selection methods for breast density classification
the classifier used. Notably, using Gini index for feature
in mammographic images. Five wrapper feature selection
selection improved performance of an SVM classifier on
methods were evaluated in conjunction with three
both datasets. Some feature selection methods were
different classifiers on two datasets of mammographic
shown to degrade classification performance. However,
images. The best-performing methods were best-first
Abusamra demonstrated that classification accuracy can
search with forward selection and best-first search with
be consistently improved on both datasets using a
backward selection. Overall, the results over different
consensus of all feature selection methods to find top 20
classifiers and datasets were improved between 3% and
features, by counting the number of feature selection
12% when using feature selection.
methods that selected each feature. Seven features were
selected by all the methods, and additional 13 features
C. Industrial applications were randomly selected from a pool of features selected
Feature selection is important in fault diagnosis in by seven out of eight methods.
industrial applications, where numerous redundant
sensors monitor the performance of a machine. Liu et al. IV. CONCLUSION
[22] have shown that the accuracy of detecting a fault
(i.e. solving a binary classification problem of machine The current research advancement in this field is
state as faulty vs. normal) can be improved by using identified in the area of hybrid feature selection methods,
feature selection. They proposed to use a global particularly concerning the methodologies based on
geometric model and a similarity metric for feature evolutionary computation heuristic algorithms such as
selection in fault diagnostics. The idea is to find feature swarm intelligence based and various genetic algorithms.
subsets that are geometrically similar to the original Additionally, application areas such as bioinformatics,
feature set. The authors experimented with three different image processing, industrial applications and text mining
similarity measures: angular similarity, mutual deal with high-dimensional feature spaces where a clever
information and structure similarity index. The proposed hybrid methodology design is of utmost importance if any
approach was compared with distance-based and entropy- success is to be obtained. Therein, features may exhibit
based feature selection, and with SVM and neural complex internal structures or may even be unknown in
network wrappers. The best performance was obtained by advance.
combining the proposed geometric similarity approach While there is no silver bullet method, filters based on
with a wrapper, so that top 10% of feature subsets were information theory and wrappers based on greedy
preselected by geometric similarity, following by an stepwise approaches seem to offer best results. Future
exhaustive search-based wrapper approach to find the research should focus on optimizing the efficiency and
best subset. accuracy of feature subset search strategy by combining
earlier best filter and wrapper approaches. Most research
D. Bioinformatics tends to focus on small number of datasets on which their
An interesting application of feature selection is in methodology works. Larger comparative studies should be
biomarker discovery from genomics data. In genomics pursued in order to have more reliable results.
data, individual features correspond to genes, so by
selecting the most relevant features, one gains important ACKNOWLEDGEMENTS
knowledge about the genes that are the most
This work has been supported in part by the Croatian
discriminative for a particular problem. Dessì et al. [42]
Science Foundation, within the project “De-identification
proposed a framework for comparing different biomarker
Methods for Soft and Non-Biometric Identifiers” (DeMSI,
selection methods, taking into account predictive
UIP-11-2013-1544). This support is gratefully
performance and stability of the selected gene sets. They
acknowledged.
compared eight selection methods on three benchmark
datasets derived from DNA microarray experiments.
Additionally, they analyzed how similar the outputs of REFERENCES
different selection methods are, and found that the [1] F. Korn, B. Pagel, and C. Faloutsos, "On the „dimensionality
outputs of univariate methods seem to be more similar to curse‟ and the „self-similarity blessing‟," IEEE Trans. Knowl.
Data Eng. vol. 13, no. 1, pp. 96–111, 2001.
each other than to the multivariate methods. In particular,
[2] J. Tang, S. Alelyani, and H. Liu, "Feature Selection for
the SVM-embedded selection seems to select features Classification: A Review," in: C. Aggarwal (ed.), Data
quite distinct from the ones selected by other methods. Classification: Algorithms and Applications. CRC Press, 2014.
When jointly optimizing stability and predictive [3] L Yu and H. Liu, "Efficient Feature Selection via Analysis of
performance, best results were obtained using chi- square, Relevance and Redundancy," J. Mach. Learn. Res., vol. 5, pp.
1205–1224, 2004.
systematic uncertainty, information gain and ReliefF.
1204
Authorized licensed use limited to: PONTIFICIA UNIVERSIDADE CATOLICA DO RIO DE JANEIRO. Downloaded on March 28,2023 at 13:08:12 UTC from IEEE Xplore. Restrictions apply.
[4] S. Alelyani, J. Tang, and H. Liu, "Feature Selection for Clustering: in: B. Schölkopf, J. C. Platt, and T. Hoffmann (eds.), Advances in
A Review,'' in: C. Aggarwal and C. Reddy (eds.), Data Clustering: Neural Information Processing Systems, MIT Press, pp. 209–216,
Algorithms and Applications, CRC Press, 2013. 2007.
[5] H. Liu and L. Yu, "Toward integrating feature selection [26] S. Ma and J. Huang, "Penalized feature selection and classification
algorithms for classification and clustering," IEEE Trans. Knowl. in bioinformatics," Briefings in Bioinformatics, vol. 9, no. 5, pp.
Data Eng., vol. 17, no. 4, pp. 491–502, 2005. 392–403, 2008.
[6] N. Hoque, D. K. Bhattacharyya, and J. K. Kalita, "MIFS-ND: A [27] H. Zou and T. Hastie, "Regularization and variable selection via
mutual information-based feature selection method", Expert the elastic net," Journal of the Royal Statistical Society: Series B
Systems with Applications, vol. 41, issue 14, pp. 6371–6385, 2014. (Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.
[7] I. H. Witten and E. Frank, Data mining: Practical machine [28] S. Das, "Filters, wrappers and a boosting-based hybrid for feature
learning tools and techniques, San Francisco CA, USA: Morgan selection," in: Proc. 18th International Conference on Machine
Kaufmann, 2011. Learning (ICML-2001), San Francisco, CA, USA, Morgan
[8] L. Yu and H. Liu, "Feature Selection for High-Dimensional Data: Kaufmann, pp. 74–81, 2001.
A Fast Correlation-Based Filter Solution," in: Proc. 20th [29] J. M. Cadenas, M. C. Garrido, and R. Martínez, "Feature subset
International Conference on Machine Learning (ICML-2003), selection Filter–Wrapper based on low quality data," Expert
Washington DC, USA, AAAI Press, pp. 856–863, 2003. Systems with Applications, vol. 40, pp. 6241–6252, 2013.
[9] H. Liu and R. Setiono, "A Probabilistic Approach to Feature [30] I. S. Oh, J. S. Lee, and B. R. Moon, "Hybrid genetic algorithms for
Selection-A Filter Solution," in: Proc. 13th International feature selection," IEEE Trans. Pattern Anal. Mach. Intell., vol.
Conference on Machine Learning (ICML-1996), Bary, Italy, 26, no. 11, pp. 1424–1437, 2004.
Morgan Kaufmann, pp. 319–327, 1996. [31] S. I. Ali and W. Shahzad, "A Feature Subset Selection Method
[10] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, based on Conditional Mutual Information and Ant Colony
Wiley-interscience, 2012. Optimization," International Journal of Computer Applications,
[11] M. R. Sikonja and I. Kononenko, "Theoretical and empirical vol. 60, no. 11, pp. 5–10, 2012.
analysis of Relief and ReliefF," Mach. Learn., vol. 53, pp. 23–69, [32] S. Sarafrazi and H. Nezamabadi-pour, "Facing the classification of
2003. binary problems with a GSA-SVM hybrid system," Mathematical
[12] D. M. Witten and R. Tibshirani, "A framework for feature and Computer Modelling, vol. 57, issues 1-2, pp. 270–278, 2013.
selection in clustering," Journal of the American Statistical [33] J. Zhou, J. Liu, V. Narayan, and J. Ye, "Modeling disease
Association, vol. 105, no. 490, pp. 713–726, 2010. progression via fused sparse group lasso," in: Proc. 18th ACM
[13] Y. Li, M. Dong, and J. Hua. "Localized feature selection for SIGKDD International Conference on Knowledge Discovery and
clustering," Pattern Recognition Letters, vol. 29, no. 1, pp. 10–18, Data Mining, Beijing, China, ACM, pp. 1095–1103, 2012.
2008. [34] S. Perkins and J. Theiler, "Online feature selection using grafting,"
[14] D. S. Modha and W.S. Spangler, "Feature weighting in k-means In: Proc. 20th International Conference on Machine Learning
clustering," Mach. Learn., vol. 52, no. 3, pp. 217–237, 2003. (ICML-2003), Washington DC, USA, AAAI Press, pp. 592–599,
2003.
[15] M. Dash and Y.-S. Ong, "RELIEF-C: Efficient Feature Selection
for Clustering over Noisy Data," in: Proc. 23rd IEEE [35] D. Zhou, J. Huang, and B. Schölkopf, "Learning from labeled and
International Conference on Tools with Artificial Intelligence unlabeled data on a directed graph." In: Proc. 22nd International
(ICTAI), Roca Raton, Florida, USA, pp. 869–872, 2011. Conference on Machine Learning (ICML-2005), Bonn, Germany,
ACM, pp. 1041–1048, 2005.
[16] Z. Xu, I. King, and M. R.-T. Lyu, "Discriminative Semi-
Supervised Feature Selection Via Manifold Regularization," IEEE [36] X. Wu, K. Yu, H. Wang, and W. Ding, "Online streaming feature
Trans. Neural Networks, vol. 21, no. 7, pp. 1033–1047, 2010. selection," In: Proceedings of the 27th international conference on
machine learning (ICML-2010), Haifa, Israel, Omnipress, pp.
[17] H. Liu and H. Motoda, Feature Selection for Knowledge 1159–1166, 2010.
Discovery and Data Mining, London: Kluwer Academic
Publishers, 1998. [37] R. Diao, M. N. Parthalain, and Q. Shen, "Dynamic feature
selection with fuzzy-rough sets," in: Proc. IEEE International
[18] P. S. Bradley and O. L. Mangasarian, "Feature selection via
Conference on Fuzzy Systems (FUZZ IEEE 2013), Hyderabad,
concave minimization and support vector machines," in: Proc. India, IEEE Press, pp. 1–7, 2013.
15th International Conference on Machine Learning (ICML-
1998), Madison, Wisconsin, USA, Morgan Kaufmann, pp. 82–90, [38] G. Forman, "An extensive empirical study of feature selection
1998. metrics for text classification," J. Mach. Learn. Res., vol. 3, pp.
1289–1305, 2003.
[19] S. Maldonado, R. Weber, and F. Famili, "Feature selection for
high-dimensional class-imbalanced data sets using Support Vector [39] T. Liu, S. Liu, and Z. Chen, "An evaluation on feature selection
Machines," Information Sciences, vol. 286, pp. 228–246, 2014. for text clustering," in: Proc. 20th International Conference on
Machine Learning (ICML-2003), Washington DC, USA, AAAI
[20] Y. S. Kim, W. N. Street, and F. Menczer, "Evolutionary model Press, pp. 488–495, 2003.
selection in unsupervised learning," Intelligent Data Analysis, vol.
6, no. 6, pp. 531–556, 2002. [40] J. Bins and B. A. Draper, "Feature selection from huge feature
sets," in: Proc. 8th International Conference on Computer Vision
[21] J. C. Cortizo and I. Giraldez, "Multi Criteria Wrapper (ICCV-01), Vancouver, British Columbia, Canada, IEEE
Improvements to Naive Bayes Learning," LNCS, vol. 4224, pp. Computer Society, pp. 159–165, 2001.
419–427, 2006.
[41] M. Muštra, M. Grgić, and K. Delač, "Breast density classification
[22] C. Liu, D. Jiang, and W. Yang, "Global geometric similarity using multiple feature selection," Automatika, vol. 53, pp. 1289–
scheme for feature selection in fault diagnosis," Expert Systems 1305, 2012.
with Applications, vol. 41, issue 8, pp. 3585–3595, 2014.
[42] N. Dessì, E. Pascariello, and B. Pes, "A Comparative Analysis of
[23] F. Benoît, M. van Heeswijk, Y. Miche, M. Verleysen, and A. Biomarker Selection Techniques," BioMed Research
Lendasse, "Feature selection for nonlinear models with extreme International, vol. 2013, article ID: 387673, DOI:
learning machines," Neurocomputing, vol. 102, pp. 111–124, http://dx.doi.org/10.1155/2013/387673
2013.
[43] H. Abusamra, "A comparative study of feature selection and
[24] M. Sandri and P. Zuccolotto, "Variable Selection Using Random
classification methods for gene expression data of glioma,"
Forests," in: S. Zani, A. Cerioli, M. Riani, and M. Vichi (eds.), Procedia Computer Science, vol. 23, pp. 5–14, 2013.
Data Analysis, Classification and the Forward Search, Studies in
Classification, Data Analysis, and Knowledge Organization, [44] K. Brkić, "Structural analysis of video by histogram-based
Springer, pp. 263–270, 2006. description of local space-time appearance," Ph.D. dissertation,
University of Zagreb, Faculty of Electrical Engineering and
[25] G. C. Cawley, N. L. C. Talbot, and M. Girolami, "Sparse Computing, 2013.
Multinomial Logistic Regression via Bayesian L1 Regularisation,"
1205
Authorized licensed use limited to: PONTIFICIA UNIVERSIDADE CATOLICA DO RIO DE JANEIRO. Downloaded on March 28,2023 at 13:08:12 UTC from IEEE Xplore. Restrictions apply.