0% found this document useful (0 votes)

14 views11 pages

Jurnal Asli Diagram Sa

Uploaded by

karinapermata90

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views11 pages

Jurnal Asli Diagram Sa

Uploaded by

karinapermata90

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Applied Mathematics and Computation 219 (2013) 11018–11028

Contents lists available at SciVerse ScienceDirect

Applied Mathematics and Computation

journal homepage: www.elsevier.com/locate/amc

Subset selection in multiple linear regression models: A hybrid

of genetic and simulated annealing algorithms
H. Hasan Örkcü
Gazi University, Faculty of Sciences, Department of Statistics, Teknikokullar, 06500 Ankara, Turkey

a r t i c l e i n f o a b s t r a c t

Keywords: The question of variable selection in a multiple linear regression model is a major open
Regression analysis research topic in statistics. The subset selection problem in multiple linear regression deals
Subset selection problem with the selection of a minimal subset of input variables without loss of explanatory
Genetic algorithm power. In this paper, we adapt the genetic and simulated annealing algorithms for variable
Simulated annealing algorithm
selection in multiple linear regression. The performance of this hybrid heuristic method is
Hybrid heuristic optimization
compared to those obtained by forward selection, backward elimination and classical
genetic algorithm search. A comparative analysis on the literature data sets and simulation
data shows that our hybrid heuristic method may suggest efﬁcient alternative to tradi-
tional subset selection methods for the variable selection problem in multiple linear
regression models.
Ó 2013 Elsevier Inc. All rights reserved.

1. Introduction

When dealing with regression models, it is often difficult to decide how many variables to measure in order to build
an adequate model. Often too many variables are measured. This usually happens when the modeller is unsure as to
which are important or because the experimental procedure provides measurement of a large number of variables auto-
matically [6].
It is very tempting for the modeller to generate a model using all available variables. This, however, creates several data
analysis problems [5]: (1) some of the variables may be completely irrelevant to the objectives of the model, and cloud any
meaningful relationships that exist between other variables; (2) in order to obtain reliable parameter estimates the number
of observations made on each variable should be significantly greater than the number of variables; (3) variables may be
correlated, in which case replicated information is redundant; (4) the signal to noise ratio of certain variables may be so
low that their inclusion in the model may be questioned (and will certainly lead to a poorer model), especially if other ‘‘clea-
ner’’ correlated variables are available; (5) when the model parameters are optimised using iterative methods, a greater
number of parameters can result in a more complex error surface to be optimised. This complexity may effect the overall
convergence time of the model.
For the these reasons it is advantageous to select the ‘‘best’’ variables prior to the modelling process [30] (best is generally
used to refer to the optimum set of variables with respect to statistical significance or a statistical criterion). When a data set
including many explanatory variables and a response variable is given, the choice of best model which predicts the response
variable is known as ‘‘variable selection’’ or ‘‘the selection of the best subset model’’.

E-mail address: hhorkcu@gazi.edu.tr

0096-3003/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved.
http://dx.doi.org/10.1016/j.amc.2013.05.016
H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028 11019

There are numerous situations where researchers are faced with a large pool of candidate variables for possible inclusion
in a multivariate statistical analysis. In most cases, the inclusion of all variables in the statistical analysis is, at best, unnec-
essary and, at worst, a serious impediment to the correct interpretation of the data. Not surprisingly, the general problem of
variable selection in multiple regression analysis has been acknowledged for more than 40 years [4,14,11,30,12,13,7,15,
20,1,33,16,28,38,21,31,32].
The general approach to variable selection is to minimise a cost function, where the cost function calculates some metric
to decide which subset of the available variables produces the ‘‘best’’ model. Some of optimization algorithm is then needed
to select the subsets to be tested by the cost function.
The simplest method of selection would be to examine all possible combinations of the variables exhaustively. If there are
m initial variables then this would result in 2m possible subsets [24]. If m is large then this is computationally expensive
(and in most situations practically impossible). Disqualifying this search method means that there is no guaranteed way
of finding the optimal variable subset for a given model.
The most popular statistical strategies are forward selection procedure (FS) and the backward elimination procedure (BE)
[30].
The forward selection procedure starts with an equation containing no predictor variables, only a constant term. The first
variable included in the equation is the one which has the highest simple correlation with the response variable Y. If the
regression coefficient of this variable is significantly different from zero it is retained in the equation, and a search for a sec-
ond variable is made. The variable that enters the equation as the second variable is one which has the highest correlation
with Y, after Y has been adjusted for the effect of the first variable. The significance of the regression coefficient of the second
variable is then tested. If the regression coefficient is significant, a search for a third variable is made in the same way. The
procedure is terminated when the last variable entering the equation has an insignificant regression coefficient or all the
variables are included in the equation. The significance of the regression coefficient of the last variable introduced in the
equation is judged by the standard t-test computed from the latest equation [8].
The backward elimination procedure starts with the full equation and successively drops one variable at a time. The vari-
ables are dropped on the basis of their contribution to the reduction of error sum of squares. The first variable deleted is the
one with the smallest contribution to the reduction of error sum of squares. This is equivalent to deleting the variable which
has the smallest t-test in the equation. If all the t-tests are significant, the full set of variables is retained in the equation.
Assuming that there are one or more variables that have insignificant t-tests, the procedure operates by dropping the var-
iable with the smallest insignificant t-test.
Although FS and BE are widely used, there are several criticisms to them. FS and BE imply an order of importance to the
variables. This can be misleading since, for example, it is not uncommon to find that the first variable included in FS is quite
unnecessary in the presence of other variables. Similarly, it is easily demonstrated that the first variable deleted in BE can be
the first variable included in FS [30].
The purpose of this paper is to introduce an alternative variable selection method for use in multiple linear regression
analysis. This method bases on a hybridization of the genetic algorithm (GA) and simulated annealing (SA) methods. GA devel-
oped by Holland [19] is based on the Darwinian theory of biological evolution. It is a very important stochastic search algo-
rithm for solving optimization problems. SA developed by Metropolis [29] is another important stochastic adaptive method
based on the physical process of annealing. Hybrid procedure uses the powerful features of these methods.
The rest of this paper is organized as follows. Section 2 presents the problem of selecting regression variables and model
selection criteria. The genetic algorithm, simulated annealing algorithm and hybrid of these heuristic methods are intro-
duced in Section 3. Section 4 demonstrates the approaches with literature and simulated data sets. Conclusions are given
in Section 5.

2. The problem of selecting regression variables

In building a multiple regression model, a crucial problem is the selection of regressors to be included. If a lower amount
of regressors are selected in the model, the estimate of the parameters will not be consistent and if a higher amount is se-
lected, its variance will increase.
Let fX 1 ; . . . ; X m g the set of m independent variables (with n observations) Y dependent or response variable in a multiple
regression model. Suppose that the model

Y ¼ b0 þ b1 X 1 þ þ bp X p þ e;

explains the relationship between the dependent and independent variables, where X # X is set of the p 6 m independent
variables chosen as regressor and b1 ; . . . ; bp is the parameter set. There are 2m possible submodels. When m is high, the
computational requirements for these procedures can be prohibitive because the number of models becomes infeasible.
In order to resolve this intractable problem, several heuristic methods addressed to restrict attention to a smaller number
of potential subset of regressors are usually employed by practitioners. Such heuristic procedures, rather than search
through all possible models, seek a good path through them. Some of the most popular are the stepwise procedures, such
as forward selection or backward elimination, sequentially include or exclude variables based on t statistic considerations
[30].
11020 H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028

There are several criteria for deciding on an appropriate subset. Some of the more common ones are:

(1) The residual mean square (RMS),

RSSp
RMSp ¼ : ð2:1Þ
np
(2) The squared multiple correlation coefﬁcient (R2 ),
RSSp
R2p ¼ 1 : ð2:2Þ
TSS
(3) The adjusted R2 (R2 ),

1 R2p
R2p ¼ 1 ðn 1Þ : ð2:3Þ
np
(4) Mallow’s C p ,
RSSp
Cp ¼ þ 2p n: ð2:4Þ
r^ 2
(5) The asymptotic information criterion (AIC),

AIC ðpÞ ¼ n log S2p þ 2p; ð2:5Þ

where p is the number of input or independent variables in the model, RSSp is the residual sum of squares for the p-variable
model, TSS is the total sum of squares, S2p is the variance of residuals, and n is the number of observations.

3. A hybrid of genetic and simulated annealing algorithms

The new approach for selecting regressors proposed in this paper is based on an hybrid of genetic and simulated anneal-
ing heuristic optimization procedures called hybridGSA. Firstly, genetic and simulated annealing algorithms were summa-
rized in the subsections.

3.1. Genetic algorithms

Genetic algorithms (GAs) are adaptive methods, which may be used to solve search and optimization problems [17, 19].
GAs perform search in complex, large and multimodal landscapes, and provide near-optimal solutions for objective or fitness
function of an optimization problem.
In GAs, the parameters of the search space are generally encoded in the form of binary strings (called chromosomes),
where each binary digit within the chromosome represented a gene. A collection of such strings is called a population. Ini-
tially, a random population is created, which represents different points in the search space. An objective or fitness function
is associated with each string that represents the degree of goodness of the string. Based on the principle of survival of the
fittest, a few of the strings are selected and each is assigned a number of copies that go into the mating pool. Biologically
inspired operators like crossover and mutation are applied on these strings to yield a new generation of strings. The process
of selection, crossover and mutation continues for a fixed number of generations or till a termination condition is satisfied.
Although many variants of the original operators exist, their original purpose remains intact for most implementations,
and can be described as follows:

1. Selection. Selection is the process that mimics the ‘‘survival of the ﬁttest’’ principle in the biological theory of evolution.
Selection is implemented in a GA as a policy for determining which chromosomes in the population will survive and be
carried over into the next generation.
2. Crossover. The crossover process is implemented in a GA by exchanging chromosome segments between two or more par-
ent chromosomes to form a child chromosome. The crossover operator typically serves a dual purpose. First, to effectively
reduce the search space to regions of greater promise. Second, to provide a mechanism for allowing offspring to inherit
the properties of their parents. The crossover process is also commonly referred to as ‘‘recombination’’.
3. Mutation. Mutation is the process that mimics the unpredictable and unexpected developments that occur in biological
reproduction. In a GA, mutation is a random perturbation to one or more genes that occurs infrequently during the evo-
lutionary process. The purpose of the mutation operator is to provide a mechanism to escape local optima. Local optima
are manifested by stalled evolutionary progress when one or more dimensions in the search space have lost genetic diver-
sity. This anomaly is often referred to as premature convergence or misconvergence and is a common demise for many
GAs.
H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028 11021

3.2. Simulated annealing algorithm

As its name implies, the simulated annealing (SA) exploits an analogy between the way in which a metal cools and freezes
into a minimum energy crystalline structure (the annealing process) and the search for a minimum in a more general system.
There is a close analogy between this approach and the thermodynamic process of annealing (cooling of a solid). Thus, it is
also called the simulated annealing (SA) approach. It was Metropolis et al. and coworkers [29] who ﬁrst proposed this idea,
and 30 years later, Kirpatrick et al. [22,23] observed that this approach could be used to search for feasible solutions of quite
general optimization problems. The main idea is that the SA strategy may help prevent being trapped at poor solutions asso-
ciated with local optima of the ﬁtness function.
In essence, it is a method that uses the objective function to create a nonhomogeneous Markov chain that asymptotically
converges to the minimum (maximum) of the objective function. The concept is originally based on the manner in which
liquids freeze or metals recrystallise in the process of annealing. In an annealing process a melt, initially at high temperature
and disordered, is slowly cooled so that the system at any time is approximately in thermodynamic equilibrium. As cooling
proceeds, the system becomes more ordered and approaches a ‘‘frozen’’ ground state. The analogy to an optimisation prob-
lem is as follows: The current state of the thermodynamic system is analogous to the current solution to the optimisation
problem, the energy equation for the thermodynamic system is analogous to the objective function, and the ground state
is analogous to the global optimum.
SA consists of several decreasing temperatures. Each temperature has a few iterations. First, the beginning temperature is
selected and an initial solution is randomly chosen. The value of the cost function based on the current solution (i.e., the ini-
tial solution in this case) will then be calculated. The goal is to minimize the cost function. Afterwards, a new solution from
the neighborhood of the current solution will be generated. The new value of the cost function based on the new solution
will be calculated and compared to the current cost function value. If the new cost function value is less than the current
value, it will be accepted. Otherwise, the new value would be accepted only when the Metropolis’s criterion [29], which
is based on Boltzman’s probability, is met. According to Metropolis’s criterion, if the difference between the cost function
values of the current and the newly generated solutions (ME) is equal to or larger than zero, a random number d in ½0; 1
is generated from a uniform distribution. If
exp ððMEÞ=T Þ P d; ð3:1Þ
is met, the newly generated solution is accepted as the current solution. The given by (3.1) exponential function is occasion-
ally called a Boltzmann function, thus the operator is also called a Boltzmann-type operator.
The number of new solutions generated at each temperature is the same as the iteration number at the temperature
which is constrained by the termination condition. The termination condition could be as simple as a certain number of iter-
ations. After all the iterations at a temperature complete, the temperature would be lowered based on the temperature
updating rule. At the updated (and lowered) temperature, all required iterations will have to be completed before moving
to the next temperature. This process would repeat until the halting criterion is met. The result of simulated annealing
(SA) is related to the number of iterations at each temperature and the speed of reducing temperature. The temperature
updating rule can be chosen as

T 1 ¼ aN T 0 ; ð3:2Þ
where N is number of iterations (generations), T 0 is the initial temperature, T 1 ðT 0 > T 1 Þ is the ﬁnal temperature, ais the cool-
ing ratio. The cooling ratio controls the speed of cooling. The higher the cooling ratio, the faster the temperature cools down.
The implementation of the SA algorithm is remarkably easy. The following elements must be provided:

1. a representation of possible solutions,

2. a generator of random changes in solutions,
3. a means of evaluating the problem functions, and
4. an annealing schedule – an initial temperature and rules for lowering it as the search progresses.

3.3. Hybrid GSA

GA has been successfully used in a wide range of differentiable, nondifferentiable, and discontinuous optimization prob-
lems encountered in statistics, engineering and economics applications [3,37]. However, GA has two major limitations. First,
the performance might deteriorate as the problem size grows. In fact, with a growing size of the problem, GA requires a lar-
ger population to obtain a satisfactory solution. Second, premature convergence might occur when the GA cannot ﬁnd the
optimal solution due to loss of some important characters (genes) in candidate solutions [18,34,2,35,27,9,10].
The performance of GA can be improved by introducing more diversity among the chromosomes in the early stage of the
solution process so that premature convergence can be eliminated. A hybrid algorithm, which combines aspects of GA and SA
is proposed to overcome the limitations of GA. To implement this idea, the Metropolis acceptance test technique from SA is
adopted into GA. The new hybrid algorithm, referenced to as hybridGSA, has been shown to overcome the poor convergence
properties of GA and outperform GA or SA.
11022 H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028

Some authors [25] have suggested inserting a Boltzmann-type or SA operator after the crossover and mutation opera-
tions. This operator will induce competition between parents and children. This helps to prevent the population from becom-
ing homogeneous too soon.
By adding this Boltzmann-type operator (SA operator), we obtain a hybrid simulated annealing-genetic algorithm (hybrid-
GSA). The steps of hybridGSA are detailed as follows:

1. A population of potential solutions or chromosomes are initialized randomly.

2. The ﬁtness of each chromosome in the population are evaluated.
3. The chromosomes for reproduction with probabilities proportional to ﬁtness are selected.
4. The crossover to the selected chromosomes to produce new chromosomes (children) is applied.
5. The mutation operator to the new chromosomes is applied.
6. The new chromosomes are evaluated.
7. The SA operator to decide which two of the parents and children remain is applied.
8. The temperature as given by (3.2) is decreased.
9. If the stopping criterion is reached, return the best solution. Otherwise, go to step 3.

Fig. 1 presents ﬂow of hybridGSA.

HybridGSA method uses binary encoding to identify which independent variables should be included in the model. No
transformation is applied to the independent variables before including them. Each individual consists of a string of m binary
th
cells: if the i cell (i ¼ 1; . . . ; m) has value 1, then X i is included in the model, otherwise not. For instance, if m ¼ 5 and the
complete set of regressors (general model) are fX 1 ; . . . ; X 5 g the chromosome ð1; 0; 1; 0; 1Þ means that the subset of regressors
considered is fX 1 ; . . . ; X 3 ; X 5 g. Therefore, our algorithm begins with the random selection of an initial population of binary
chromosomes. Every candidate solution is then evaluated with respect to a fitness function. The RMS; R2 ; R 2 ; C p , and AIC
criteria (given by (2.1), (2.2), (2.3), (2.4) and (2.5), respectively) have been considered as possible fitness functions. After
selecting chromosomes with probabilities proportional to fitness, obtaining new chromosomes and performing mutation,
the population is evolved through generations using Boltzmann-type or SA operator.
The SA operator is applied as follows: First, parent 1 and child 1 are compared. If child 1 is better then parent 1 by child 1
is replaced. If parent 1 is better, parent 1 by child 1 is also replaced with a certain chosen probability. The same procedure
will then be applied to parent 2 and child 2. Suppose the difference between the fitness values corresponding to the child and
to the parent (ME) is equal to or larger than zero (parent is better) and the current temperature is T. A random number din
½0; 1 is generated from a uniform distribution and if (3.1) is holds then the child is accepted. Thus, a child may be selected
with some probability even if it is inferior compared to the parent. It is easy to see that, initially when the temperature is
high, there is a higher probability of accepting an inferior child. If a child produced is better, then it is accepted.

4. Computational results

4.1. Some literature examples

In this section, we provide numerical examples that have frequently appeared in the related literature to illustrate the
performance of proposed hybridGSA method. These data sets were provided by Longley [26], Waugh [36] and Eksioglu
et al. [15]. All the computations are performed in MATLAB 9 program. In order to compare performance of the models,
AIC criterion given by (2.5) and adjusted R2 given by ( 2.3) are used for Longley [26] – Waugh [36] data sets and for data sets
taken from Eksioglu et al. [15], respectively. In regression analysis, whereas a small AIC value is preferable, a big adjusted R2
(or R2 ) value is.
In Table 1, for Longley [26] and Waugh [36] data sets, number of candidate independent variables, sample sizes, best
model variables, best model AIC value and working time (in second) for an all possible regression search are presented
for these data sets.
For GA search, number of the individuals in the population, crossover probability, mutation probability and maximum
number of iteration are chosen as 100; 0:8; 0:1 and 1000, respectively. Similarly, for hybridGSA search, number of the indi-
viduals in the population, crossover probability, mutation probability, maximum number of iteration, the initial temperature
and the cooling rate are chosen as 100; 0:8; 0:1; 1000; 100 and 0:9, respectively. The statistical significant rate is chosen as
0:05 for FS and BE methods.
Tables 2–5, for Longley [26] and Waugh [36] data sets, present detailed results of FS, BE, GA and hybridGSA, respectively.
As can be seen in Table 5, the proposed hybridGSA method found the best subsets for both literature data sets. In contrast,
FS, BE and GA were not find the best subsets, especially for Waugh data set. For example, the sets selected by procedures
were X 3 ; X 4 ; X 6 ; X 7 ; X 8 ; X 9 is the set found by hybridGSA method (which is the optimal set), X 1 ; X 4 ; X 5 ; X 7 ; X 8 ; X 9 is the set found
by FS and BE methods, X 1 ; X 4 ; X 6 ; X 7 ; X 8 ; X 9 is the set found GA method. It is important to note that the variable X 3 , which is in
the optimal group, was never identified by any other procedures and variable X 1 , which is not in the optimal group, was in-
cluded by the other procedures.
H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028 11023

Start

Encode and produce an initial solution

pool; initialize the temperature ( ).

Select solutions (i.e. regression

models) with binary coding

Evaluate fitness function (such as AIC

and adjusted )

Perform crossover and mutation with a crossover

rate and mutation rate for a new ofspring

No
Is the produced offspring
Accept the
better than worst solution
offspring if
in the population? ( =?)

Yes

Replace the worst solution with

the produced offspring

Decrease the temperature

Is the number of No
generations greater than
the pre-set maximum?

Yes
Stop

Fig. 1. Flow of hybridGSA.

Additional to the Longley [26] and Waugh [36] data sets, a set of experiments taken from Eksioglu et al. [15] have been
conducted to compare the performance of the hybridGSA model, and the results are reported in Table 6. A total of twelve data
sets have been used for the experiments. Eksioglu et al. [15] used adjusted R2 criterion for the comparisons. Therefore, we
have used adjusted R2 criterion for hybridGSA, FS, BE, GA and also GRASP and Lagrangian relaxation proposed by Eksioglu
et al. [15]. Detailed information about data sets and GRASP and Lagrangian relaxation models can be found by Eksioglu
et al. [15].
As can be seen from the results (Table 6), hybridGSA found the best subset in eight out of twelve instances whereas GRASP
and Lagrangian relaxation found the best subset in ﬁve out of twelve instances (best values are given by bold).

4.2. Simulation study

The hybridGSA procedure was compared with FS, BE and GA by AIC criterion based on the all possible regressions proce-
dure using simulated data sets. The linear regression problem had 100 and 200 observations and k variables ranging in
11 6 k 6 25. A total of 30 (15 2 ) different data sets were generated for this study. Each problem data set was generated
independently of the others. For the entire set of problem complete enumerations were performed for all possible subsets in
order to ﬁnd the best one in a rigorous way (i.e. the k ¼ 25 problem has 225 ¼ 33554432 possible subsets).
11024 H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028

Table 1
All possible regression results.

Longley Waugh
Number of variables 6 9
Sample size 16 17
Best model variables X2 , X3 , X4, X6 X3 , X4 , X6 , X7 , X8 , X9
Best AIC value 227.66 93.528
Working time (in second) 0.4477 3.6103

Table 2
Forward selection results.

Longley Waugh
Predicted model variables X2 , X3 , X4 , X5 X1 , X4 , X5 , X7 , X8 , X9
Predicted AIC value 252.45 118.96
Working time (in second) 0.2145 1.4587

Table 3
Backward elimination results.

Longley Waugh
Predicted model variables X2 , X3 , X4 , X5 X1 , X4 , X5 , X7 , X8 , X9
Predicted AIC value 252.45 118.96
Working time (in second) 0.2346 1.5258

Table 4
Genetic algorithm results.

Longley Waugh
Predicted model variables X2 , X3 , X4 , X6 X1 , X4 , X6 , X7 , X8 , X9
Predicted AIC value 227.66 97.45
Working time (in second) 0.0941 0.5841
Number of individuals in the population 100 100
Crossover probability 0.8 0.8
Mutation probability 0.1 0.1
Maximum number of generation 1000 1000

Table 5
HybridGSA results.

Longley Waugh
Predicted model variables X2 , X3 , X4 , X6 X3 , X4 , X6 , X7 , X8 , X9
Predicted AIC value 227.66 93.528
Working time (in second) 0.0914 0.5087
Number of individuals in the population 100 100
Crossover probability 0.8 0.8
Mutation probability 0.1 0.1
Maximum number of generation 1000 100
The initial temperature 100 100
The cooling rate 0.9 0.9

Tables 7 and 8 present the results of the optimal solution for the 100 and 200 observations problem determined by all
possible regressions, respectively. Multiple linear regression problems were then analyzed using the FS, BE, GA and hybrid-
GSA procedures.
Tables 9 and 10 provide the results of the FS and BE procedures for n ¼ 100 and n ¼ 200, respectively. The statistical sig-
niﬁcant rate is chosen as 0:05 for these statistical methods.
H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028 11025

Table 6
Adjusted R2 values for the best subsets.

Data set NoOa NoPb HybridGSA GA BE FS GRASP Lagrangian

Auto 74 11 0.5256 0.5252 0.5256 0.5065 0.5252 0.5252
Bankbill 71 15 0.9947 0.9940 0.9947 0.9934 0.9945 0.9945
Baseball 530 16 0.6865 0.6865 0.6850 0.6850 0.6865 0.6865
Belle 27 7 0.6334 0.6185 0.6180 0.6180 0.6334 0.6302
Bodywomen 260 23 0.5395 0.5395 0.5401 0.5272 0.5354 0.5342
Horse 102 13 0.8701 0.8695 0.8585 0.8585 0.8710 0.8710
Mlbhof 1340 21 0.8796 0.8760 0.8790 0.8785 0.8796 0.8796
Papir 29 15 0.9681 0.9655 0.9679 0.9381 0.9647 0.9647
Physical 22 10 0.9611 0.9611 0.9579 0.9579 0.9597 0.9617
Poll Date Rate 60 14 0.5852 0.5852 0.5660 0.5287 0.5852 0.5852
Pollution 60 15 0.7198 0.7198 0.6999 0.7198 0.7087 0.7087
US Crime 47 15 0.5135 0.4760 0.5342 0.4309 0.4760 0.4760
a
Number of observations.
b
Number of predictors.

Table 7
All possible regression results (n ¼ 100)

k Variables in optimum Best AIC value

11 3 115.45
12 5 120.87
13 4 95.45
14 5 100.87
15 6 110.65
16 6 111.48
17 5 135.41
18 7 150.69
19 6 140.80
20 8 155.63
21 7 138.95
22 9 175.69
23 11 200.48
24 10 190.68
25 12 185.36

Table 8
All possible regression results (n ¼ 200).

k Variables in optimum Best AIC value

11 5 174.11
12 4 161.17
13 4 155.04
14 3 200.17
15 7 190.12
16 6 205.41
17 7 196.11
18 5 225.63
19 8 220.15
20 7 245.63
21 7 231.52
22 10 275.05
23 12 265.44
24 11 290.62
25 15 310.08

Tables 11 and 12 provide the results of the hybridGSA and GA procedures for n ¼ 100 and n ¼ 200, respectively. Control
parameters are as in the literature examples for GA and hybridGSA methods (see Table 12).
For each procedure, the number of variables in solution and predicted AIC values are provided. The number of variables in
solution represents the number of variables predicted by each procedure.
As can be seen in Tables 10 and 11, the proposed hybridGSA search procedure found the best subset for all of the size prob-
lems, except n ¼ 200; k ¼ 24 case. In fact, FS and BE methods provided quite different results even amongst themselves.
11026 H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028

Table 9
Forward selection and backward elimination results (n ¼ 100

k Forward selection Backward elimination

Variables in optimum Predicted AIC value Variables in optimum Predicted AIC value
11 3 115.45 3 115.45
12 4 125.47 4 125.47
13 4 95.45 4 95.45
14 5 100.87 6 110.70
15 5 120.15 5 120.15
16 5 121.95 6 111.48
17 5 135.41 5 135.41
18 6 175.01 6 175.01
19 6 140.80 6 140.80
20 10 205.13 10 205.13
21 5 178.41 5 178.41
22 12 245.07 12 245.07
23 12 205.58 11 203.05
24 12 210.25 12 210.25
25 14 248.16 13 235.01

Table 10
Forward selection and backward elimination results (n ¼ 200).

k Forward selection Backward elimination

Variables in optimum Predicted AIC value Variables in optimum Predicted AIC value
11 5 174.11 5 174.11
12 5 182.78 5 182.78
13 4 155.04 4 155.04
14 3 200.17 3 200.17
15 6 199.45 6 199.45
16 6 205.41 6 205.41
17 7 196.11 8 224.45
18 5 225.63 5 225.63
19 10 280.48 10 280.48
20 7 245.63 8 248.13
21 9 252.08 9 252.08
22 9 282.17 9 282.17
23 14 292.52 14 292.52
24 9 299.25 13 320.62
25 10 380.10 11 375.41

Table 11
HybridGSA and genetic algorithm results (n ¼ 100).

k HybridGSA Genetic algorithm

Variables in optimum Predicted AIC value Variables in optimum Predicted AIC value
11 3 115.45 3 115.45
12 5 120.87 5 120.87
13 4 95.45 4 95.45
14 5 100.87 5 100.87
15 6 110.65 5 115.41
16 6 111.48 6 111.48
17 5 135.41 5 135.41
18 7 150.69 8 161.12
19 6 140.80 6 140.80
20 8 155.63 8 155.63
21 7 138.95 6 139.25
22 9 175.69 9 175.69
23 11 200.48 11 200.48
24 10 190.68 12 210.25
25 12 185.36 10 205.08

Also, GA was not ﬁnd the best subsets in n ¼ 100; k ¼ 15; k ¼ 18; k ¼ 21; k ¼ 24; k ¼ 25 and n ¼ 200; k ¼ 17; k ¼ 19;
k ¼ 21; k ¼ 23; k ¼ 24; k ¼ 25 cases.
H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028 11027

Table 12
HybridGSA and genetic algorithm results (n ¼ 200).

k HybridGSA Genetic Algorithm

Variables in optimum Predicted AIC value Variables in optimum Predicted AIC value
11 5 174.11 5 174.11
12 4 161.17 4 161.17
13 4 155.04 4 155.04
14 3 200.17 3 200.17
15 7 190.12 7 190.12
16 6 205.41 6 205.41
17 7 196.11 8 215.28
18 5 225.63 5 225.63
19 8 220.15 7 240.84
20 7 245.63 7 245.63
21 7 231.52 8 240.05
22 10 275.05 10 275.05
23 12 265.44 13 285.12
24 12 293.05 9 299.25
25 15 310.08 12 340.54

The results obtained from literature and simulated data sets indicate the superiority of hybridGSA procedure over existing
subset selection procedures.

5. Conclusion

This paper introduced an alternative variable selection method for use in regression analysis that is based on the hybrid of
genetic and simulated annealing algorithms (hybridGSA). Using literature and simulated data sets, the hybridGSA procedure
was compared with forward selection, backward elimination and classical genetic algorithm based on the all possible regres-
sions procedure. The results indicate the superiority of hybridGSA procedure over existing subset selection procedures in
multiple regression analysis.

Acknowledgments

We would like to thank the anonymous referees for their helpful and constructive comments on the previous version of
the manuscript which improved the presentation of this article.

References

[1] A. Al-Ani, A. Alsukker, R.N. Khushaba, Feature subset selection using differential evolution and a wheel based search strategy, Swarm Evol. Comput. 9
(2013) 15–26.
[2] H. Aytug, G.J. Koehler, New stopping criterion for genetic algorithms, Eur. J. Oper. Res. 26 (2000) 662–674.
[3] R. Baragona, F. Battaglia, C. Calzini, Genetic algorithms for the identification of additive and innovation outliers in time series, Comput. Stat. Data Anal.
37 (2001) 1–12.
[4] E.M.L. Beale, M.G. Kendall, D.W. Mann, The discarding of variables in multivariate analysis, Biometrika 54 (1967) 357–366.
[5] D. Broadhursta, R. Goodacre, A. Jones, J.J. Rowland, D.B. Kell, Genetic algorithms as a method for variable selection in multiple linear regression and
partial least squares regression, with applications to pyrolysis mass spectrometry, Anal. Chim. Acta 348 (1997) 71–86.
[6] P.J. Brown, Measurement Regression and Calibration, Clarendon Press, Oxford, 1993.
[7] M.J. Brusco, D. Steinley, J.D. Cradit, An exact algorithm for hierarchically well-formulated subsets in second-order polynomial regression,
Technometrics 51 (3) (2009) 306–315.
[8] S. Chatterjee, A.S. Hadi, Regression Analysis by Example, Wiley, 2006.
[9] K. Deep, M. Thakur, A new mutation operator for real coded genetic algorithms, Appl. Math. Comput. 193 (2007) 229–247.
[10] K. Deep, M. Thakur, A new crossover operator for real coded genetic algorithms, Appl. Math. Comput. 188 (2007) 895–911.
[11] Z. Drezner, G.A. Marcoulides, S. Salhi, Tabu search model selection in multiple regression analysis, Commun. Stat. Simul. 28 (2) (1999) 349–367.
[12] A.P. Duarte Silva, Efficient variable screening for multivariate analysis, J. Multivariate Anal. 76 (2001) 35–62.
[13] K. Fueda, M. Iizuka, Y. Mori, Variable selection in multivariate methods using global score estimation, Comput. Stat. 24 (2009) 127–144.
[14] G.M. Furnival, R.W. Wilson, Regression by leaps and bounds, Technometrics 16 (1974) 499–512.
[15] B. Eksioglu, R. Demirer, I. Capar, Subset selection in multiple linear regression: a new mathematical programming approach, Comput. Ind. Eng. 49
(2005) 155–167.
[16] C. Gatu, P.I. Yanev, E.J. Kontoghiorghes, A graph approach to generate all possible regression submodels, Comput. Stat. Data Anal. 52 (2007) 799–815.
[17] D.E. Goldberg, Genetic algorithms in search optimization and machine learning, Addison-Wesley, 1989.
[18] J.J. Grefenstette, Optimization of control parameters for genetic algorithms, IEEE Trans. Syst. Man Cybern. 16 (1986) 122–128.
[19] J. Holland, Adaptation in Natural and Artificial Systems, Michigan Press, Michigan, 1975.
[20] C.L. Huang, ACO-based hybrid classification system with feature subset selection and model parameters optimization, Neurocomputing 73 (2009)
438–448.
[21] C. Jin, S.W. Jin, L.N. Qin, Attribute selection method based on a hybrid BPNN and PSO algorithms, Appl. Soft Comput. 12 (2012) 2147–2155.
[22] S. Kirkpatrick, C.D. Gerlatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671–680.
[23] S. Kirkpatrick, Optimization by simulated annealing-quantitative studies, J. Stat. Phys. 34 (1984) 975–986.
11028 H. Hasan Örkcü / Applied Mathematics and Computation 219 (2013) 11018–11028

[24] W.J. Krzanowski, Principles of Multivariate Analysis-A User’s Perspective, Oxford University Press, 1988.
[25] F.T. Lin, C.Y. Kao, C.C. Hsu, Applying the genetic approach to simulated annealing in solving some NP-hard problems, IEEE Trans. Syst. Man Cybern. 23
(6) (1993) 1752–1767.
[26] J. Longley, An appraisal of least squares programs for the electronic computer from the point of view of the user, J. Am. Stat. Assoc. 62 (1967) 819–841.
[27] H. Maaranen, K. Miettinen, M.M. Mäkelä, Quasi-random initial population for genetic algorithms, Comput. Math. Appl. 47 (2004) 1885–1895.
[28] R. Meiri, J. Zahavi, Using simulated annealing to optimize the feature selection problem in marketing applications, Eur. J. Oper. Res. 171 (2006) 842–
858.
[29] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculations by fast computing machines, J. Chem. Phys. 21 (6)
(1953) 1087–1091.
[30] A.J. Miller, Subset Selection in Regression, second ed., Chapman and Hall, London, 2002.
[31] M. Monirul Kabir, M. Shahjahan, K. Murase, A new local search based hybrid genetic algorithm for feature selection, Neurocomputing 74 (2011) 2914–
2928.
[32] M. Monirul Kabir, M. Shahjahan, K. Murase, A new hybrid ant colony optimization algorithm for feature selection, Expert Syst. Appl. 39 (2012) 3747–
3763.
[33] X. Peng, D. Xu, A local information-based feature-selection algorithm for data regression, Pattern Recognition, in press.
[34] M. Srinivas, L.M. Patnaik, Adaptive probabilities of crossover and mutation in genetic algorithms, IEEE Trans. Syst. Man Cybern. 24 (1994) 656–667.
[35] S. Tsutsui, D.E. Goldberg, Search space boundary extension method in real-coded genetic algorithms, Inf. Sci. 133 (2001) 229–247.
[36] F.B. Waugh, Graphic Analysis in Agricultural Economics, Agricultural Handbook No. 128, U.S. Department of Agriculture, 1957.
[37] P. Winker, M. Gilli, Applications of optimization heuristics to estimation and modelling problems, Comput. Stat. Data Anal. 47 (2) (2004) 211–223.
[38] W. Zhao, R. Zhang, Y. Lv, J. Liu, Variable selection of the quantile varying coefﬁcient regression models, J. Korean Stat. Soc., in press.

Regression
No ratings yet
Regression
21 pages
Model Selection
No ratings yet
Model Selection
49 pages
SMDS Unit 3
No ratings yet
SMDS Unit 3
45 pages
Ch5 Slide VariableSelection
No ratings yet
Ch5 Slide VariableSelection
36 pages
Lecture 4 - Multiple Linear Regression Imran 20022025 092939am
No ratings yet
Lecture 4 - Multiple Linear Regression Imran 20022025 092939am
49 pages
L2D-Multiple Regression D 2022-03-03 21 - 20 - 03
No ratings yet
L2D-Multiple Regression D 2022-03-03 21 - 20 - 03
31 pages
Stepwise Regression
0% (1)
Stepwise Regression
9 pages
Water Resources Systems Analysis
100% (1)
Water Resources Systems Analysis
608 pages
Linear Model Selection and Regularization
No ratings yet
Linear Model Selection and Regularization
23 pages
Stepwise Regression
No ratings yet
Stepwise Regression
9 pages
Practice Question 15.12.2023 2
No ratings yet
Practice Question 15.12.2023 2
13 pages
Gradient Descent
No ratings yet
Gradient Descent
11 pages
Reg 07
No ratings yet
Reg 07
22 pages
Stepwise Regression
No ratings yet
Stepwise Regression
2 pages
Regression PDF
No ratings yet
Regression PDF
16 pages
Predicting The Outcome of Soccer Matches
100% (1)
Predicting The Outcome of Soccer Matches
97 pages
Unit 4
No ratings yet
Unit 4
7 pages
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
No ratings yet
WINSEM2023-24 MAT6015 ETH VL2023240501308 2024-03-19 Reference-Material-I
39 pages
A Review On Variable Selection in Regression
No ratings yet
A Review On Variable Selection in Regression
27 pages
Multiple Linear Regression 13112023 063212pm
No ratings yet
Multiple Linear Regression 13112023 063212pm
49 pages
RM - Variable Selection Methods and Goodness of Fit
No ratings yet
RM - Variable Selection Methods and Goodness of Fit
20 pages
Model Selection R Chap 4
No ratings yet
Model Selection R Chap 4
5 pages
Yang-39 2 Proof 27
No ratings yet
Yang-39 2 Proof 27
11 pages
Lecture 4 Intro To ML 27 03 2023 27032023 041559pm
No ratings yet
Lecture 4 Intro To ML 27 03 2023 27032023 041559pm
50 pages
Multiple Regression: Curve Estimation
100% (2)
Multiple Regression: Curve Estimation
23 pages
Chapter 4
No ratings yet
Chapter 4
23 pages
STA302 Week12 Full
No ratings yet
STA302 Week12 Full
30 pages
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
No ratings yet
Lecture 5 Model Selection I: STAT 441: Statistical Methods For Learning and Data Mining
17 pages
Chapter 10 Multiple Regression
No ratings yet
Chapter 10 Multiple Regression
43 pages
A Review of Bayesian Variable Selection
No ratings yet
A Review of Bayesian Variable Selection
34 pages
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
No ratings yet
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
24 pages
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
No ratings yet
SAS Code To Select The Best Multiple Linear Regression Model For Multivariate Data Using Information Criteria
6 pages
AI - Mod 5. Part 3
No ratings yet
AI - Mod 5. Part 3
26 pages
MultiLinear VariableSelection
No ratings yet
MultiLinear VariableSelection
10 pages
3rd Module EDBA Contiuation1
No ratings yet
3rd Module EDBA Contiuation1
6 pages
VariableSelectionAndModelBuilding IIT
No ratings yet
VariableSelectionAndModelBuilding IIT
22 pages
Chapter 5
No ratings yet
Chapter 5
30 pages
Linear Regression PDF
100% (1)
Linear Regression PDF
32 pages
Multiple Regression - Selecting The Best Equation: An Example
No ratings yet
Multiple Regression - Selecting The Best Equation: An Example
29 pages
Applied Business Forecasting and Planning: Multiple Regression Analysis
No ratings yet
Applied Business Forecasting and Planning: Multiple Regression Analysis
100 pages
Regression SPSS
No ratings yet
Regression SPSS
21 pages
Trend Analysis - CompContr12
No ratings yet
Trend Analysis - CompContr12
68 pages
Bio2 Module 4 - Multiple Linear Regression
No ratings yet
Bio2 Module 4 - Multiple Linear Regression
20 pages
Test and Goodness of Fit: Investment and Financial Data Analysis 2017
No ratings yet
Test and Goodness of Fit: Investment and Financial Data Analysis 2017
21 pages
Dr. Hussin Abdullah School of Economics, Finance and Banking, Uum Cob
No ratings yet
Dr. Hussin Abdullah School of Economics, Finance and Banking, Uum Cob
12 pages
A Universal Selection Method in Linear Regression Models: Eckhard Liebscher
No ratings yet
A Universal Selection Method in Linear Regression Models: Eckhard Liebscher
10 pages
Time Series - Practical Exercises
100% (1)
Time Series - Practical Exercises
9 pages
Predictive Modelling Using Linear Regression: © Analy Datalab Inc., 2016. All Rights Reserved
No ratings yet
Predictive Modelling Using Linear Regression: © Analy Datalab Inc., 2016. All Rights Reserved
16 pages
13 Paper PDF
No ratings yet
13 Paper PDF
14 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Data Science Module 5 Q & A
No ratings yet
Data Science Module 5 Q & A
8 pages
Vijayakumar Et Al., 2018. Structural Brain Development
No ratings yet
Vijayakumar Et Al., 2018. Structural Brain Development
20 pages
Lesson 10
No ratings yet
Lesson 10
9 pages
Best SEM STATA Menu StataSEMMasterDay2and3 PDF
No ratings yet
Best SEM STATA Menu StataSEMMasterDay2and3 PDF
58 pages
Supervised Learning Algorithms
No ratings yet
Supervised Learning Algorithms
20 pages
Regn Lect 5
No ratings yet
Regn Lect 5
9 pages
jModelTest 2 Manual v0.1.11
No ratings yet
jModelTest 2 Manual v0.1.11
27 pages
Problems With Stepwise Regression
No ratings yet
Problems With Stepwise Regression
1 page
Econometric Modeling: Model Specification and Diagnostic Testing
No ratings yet
Econometric Modeling: Model Specification and Diagnostic Testing
52 pages
Machine Learning and Deep Learning Course
No ratings yet
Machine Learning and Deep Learning Course
23 pages
Development of Demand Forecasting Models For Improved Customer Service in Nigeria Soft Drink Industry - Case of Coca-Cola Company Enugu
No ratings yet
Development of Demand Forecasting Models For Improved Customer Service in Nigeria Soft Drink Industry - Case of Coca-Cola Company Enugu
8 pages
Imfit Howto
No ratings yet
Imfit Howto
29 pages
College of Natural and Computational Science Department of Statistics Linear Regression Biostatistics Master Program
No ratings yet
College of Natural and Computational Science Department of Statistics Linear Regression Biostatistics Master Program
3 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
11 pages
Forecasting With R Notes
No ratings yet
Forecasting With R Notes
66 pages
Video Content Marketing The Making of Clips
No ratings yet
Video Content Marketing The Making of Clips
55 pages
Vegetation, Slope, Aspect
No ratings yet
Vegetation, Slope, Aspect
12 pages
Chapter 6 Variable Selection and Model Building
No ratings yet
Chapter 6 Variable Selection and Model Building
32 pages
Untitled 472
No ratings yet
Untitled 472
13 pages
Multiple Regression
No ratings yet
Multiple Regression
100 pages
SSRN Id2254038
No ratings yet
SSRN Id2254038
39 pages
High Survival Promotes Persistence in A Reintroduced Population of Common Crane (Grus Grus) - Original Article (Conservation)
No ratings yet
High Survival Promotes Persistence in A Reintroduced Population of Common Crane (Grus Grus) - Original Article (Conservation)
16 pages
Prediction of Network Traffic in Wireless Mesh Networks Using Hybrid Deep Learning Model
No ratings yet
Prediction of Network Traffic in Wireless Mesh Networks Using Hybrid Deep Learning Model
13 pages
2004-Multimodel Inference Understanding AIC and BIC in Model Selection
No ratings yet
2004-Multimodel Inference Understanding AIC and BIC in Model Selection
44 pages
6-Paper Quimica y Cinetica de La Pirolisis de Polietileno
No ratings yet
6-Paper Quimica y Cinetica de La Pirolisis de Polietileno
27 pages
SBE11 CH 16
No ratings yet
SBE11 CH 16
59 pages
Depmix S4
No ratings yet
Depmix S4
17 pages
Bannerghatta Report 2019
No ratings yet
Bannerghatta Report 2019
26 pages
R - Packages With Applications From Complete and Censored Samples
No ratings yet
R - Packages With Applications From Complete and Censored Samples
43 pages
R Cheat Sheet (Updated)
No ratings yet
R Cheat Sheet (Updated)
13 pages
Importing Necessary Libraries
No ratings yet
Importing Necessary Libraries
29 pages
Public Debt and Economic Growth-Is There A Change in ThresholdEffect - Hu - Xinruo - Honor Thesis
No ratings yet
Public Debt and Economic Growth-Is There A Change in ThresholdEffect - Hu - Xinruo - Honor Thesis
35 pages
Abundance and Abundance Change in The World's Parrots
No ratings yet
Abundance and Abundance Change in The World's Parrots
36 pages
Velez Et Al. 2015. BAAE
No ratings yet
Velez Et Al. 2015. BAAE
10 pages
10-701/15-781, Machine Learning: Homework 5: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781, Machine Learning: Homework 5: Aarti Singh Carnegie Mellon University
13 pages
Qiang 2020
No ratings yet
Qiang 2020
11 pages
Ecography 2004 Snall Et Al
No ratings yet
Ecography 2004 Snall Et Al
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Jurnal Asli Diagram Sa

Uploaded by

Jurnal Asli Diagram Sa

Uploaded by

Applied Mathematics and Computation 219 (2013) 11018–11028

Contents lists available at SciVerse ScienceDirect

Applied Mathematics and Computation

Subset selection in multiple linear regression models: A hybrid

E-mail address: hhorkcu@gazi.edu.tr

2. The problem of selecting regression variables

(1) The residual mean square (RMS),

3. A hybrid of genetic and simulated annealing algorithms

3.1. Genetic algorithms

3.2. Simulated annealing algorithm

1. a representation of possible solutions,

3.3. Hybrid GSA

1. A population of potential solutions or chromosomes are initialized randomly.

Fig. 1 presents ﬂow of hybridGSA.

4.1. Some literature examples

Encode and produce an initial solution

Select solutions (i.e. regression

Evaluate fitness function (such as AIC

Perform crossover and mutation with a crossover

Replace the worst solution with

Decrease the temperature

Fig. 1. Flow of hybridGSA.

4.2. Simulation study

Data set NoOa NoPb HybridGSA GA BE FS GRASP Lagrangian

k Variables in optimum Best AIC value

k Variables in optimum Best AIC value

k Forward selection Backward elimination

k Forward selection Backward elimination

k HybridGSA Genetic algorithm

k HybridGSA Genetic Algorithm

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.