The No Free Lunch Theorem Does Not Apply To Continuous Optimization
The No Free Lunch Theorem Does Not Apply To Continuous Optimization
The No Free Lunch Theorem Does Not Apply To Continuous Optimization
(3)
( )
1
1
Cognitiveacceleration : ( ) ( )
where:
is thecognitiveaccelerationconstant,
( )is theposition vector of particle" "
at iteration" ",
( )is thepersonal best of particle" "
at iteration " "
i i
i
i
c p k x k
c
x k i
k
p k i
k
(4)
Expression (3) defines the social acceleration of Global Best (Gbest) PSO. Local Best (Lbest)
PSO limits each particles social sphere to knowledge of the best solution found by its neighbors
instead of immediately granting each particle knowledge of the best solution found so far by the entire
search team.
Substituting the sum of these two acceleration terms for ( ) a k in Equation (2), while applying the
subscript i adopted in (3) and (4), produces Equation (5).
( ) ( ) ( ) ( ) ( )
1 2
1 1
1 ( ) ( ) ( ) ( )
2 2
i i i i i i
x k x k v k c p k x k c g k x k + = + + + (5)
Having replaced physical acceleration in the position update equation of physics with social and
cognitive modeling, the next step toward producing a stochastic search algorithm is the replacement of
1/ 2 with a pseudo-random number sampled per dimension from the uniform distribution between 0
and 1, (0,1) U . Note that the expected or mean value of the distribution is still1/ 2. Designating the
first vector of pseudo-random numbers as
1i
r and the second as
2i
r produces Equation (6).
( ) ( ) ( ) ( ) ( )
1 1, 2 2,
1 ( ) ( ) ( ) ( )
where " " denotes theHadamardor"element-wise"product
i i i i i i i i
x k x k v k c r p k x k c r g k x k + = + + +
(6)
For convenience, the rather long Equation (6) is separated into a velocity update equation (7) and a
position update equation (8). This primarily helps with record keeping since each value can be stored
separately for post-simulation analysis. Substituting Equation (7) into (8) shows equivalency to (6).
( ) ( ) ( ) ( )
1 1, 2 2,
1 ( ) ( ) ( ) ( )
i i i i i i i
v k v k c r p k x k c r g k x k + = + + (7)
( ) ( ) ( ) 1 1
i i i
x k x k v k + = + + (8)
Since its conception, Equation (7) has developed two mechanisms by which to improve search
behavior. The inertia weight, e, roughly simulates friction in a computationally inexpensive manner
ICSI 2011: International conference on swarm intelligence id-3
Cergy, France, June 14-15, 2011
by carrying over to the next iteration only a user-specified percentage of the current iterations
velocity. This is done by multiplying the velocity of the current iteration by
( ) 1,1 ee
1
as shown in
the first term of Equation (9) [2]. The constriction models use a constriction coefficient instead [3],
but the popular Type 1 parameters can be converted to Clercs Equivalents for use in Equation (9) [4].
( ) ( ) ( ) ( )
1 1, 2 2,
1 ( ) ( ) ( ) ( )
i i i i i i i
v k v k c r p k x k c r g k x k e + = + + (9)
The other restriction imposed on velocity is essentially a speed limit [5]. Rather than limiting the
vector magnitude itself, the computationally simpler approach of limiting each component is
implemented as shown in Equation (10), which limits the magnitude indirectly.
( ) ( ) ( ) ( )
( )
max
, , ,
1 sign 1 max 1 ,
where {1, 2,..., 1, }, and
denotes the problem dimensionality
i j i j i j j
v k v k v k v
j n n
n
+ = + +
e (10)
This limits the maximum step size on dimension j by clamping: (i) values greater than
max
j
v to a
maximum value of
max
j
v , and (ii) values less than
max
j
v to a minimum of
max
j
v . From a physical
perspective, particles with clamped velocities are analogous to birds with limited flying speeds.
Considering the psychological aspects of the algorithm, clamped velocities could also be considered
analogous to self-limited emotive responses.
Concerning the calculation of
max
j
v , suppose the feasible candidates for an application problem are
[12, 20] on some dimension to be optimized. Clamping velocities to 50%, for example, of
max
x would
allow particles to take excessively large steps relative to the range of the search space on that
dimension; in this case, the maximum step size would be 0.5 20 10 = . But stepping 10 units in any
direction when the search space is only 8 units wide would be nonsensical. Since real-world
applications are not necessarily centered at the origin of Euclidean space, it is preferable to clamp
velocities based on the range of the search space per dimension in order to remove dependence on the
frame of reference [10]; hence, subscript j is included in Equation (10) for sake of generality; but it
can be dropped for applications with the same range of values per dimension.
1.2 Swarm Initialization
The optimization process begins by randomly initializing positions between a minimum and maximum
per dimension as per Relation (11). The most common benchmarks use the same minimum and
maximum per dimension. For application problems, however, these might differ depending on the
characteristics being optimized; hence, the general formula is provided, which uses subscript j to
indicate the dimension.
( ) ( )
min max
,
0 ,
i j j j
x k U x x = e (11)
Velocities are similarly initialized according to Relation (12). For application problems with a
different range of feasible values on one dimension than on another, different step sizes per dimension
would make sense; hence, the general form is presented, which avoids unnecessarily imposing the
same range of feasible values on all characteristics to be optimized.
( ) ( )
max max
,
0 ,
i j j j
v k U v v = e (12)
Each particles personal best is initialized to its starting position as shown in Equation (13).
( ) ( ) 0 0
i i
p k x k = = = (13)
________________________________
1 eis often selected to lie within (0, 1), but a small negative value models trust and distrust quite effectively [4].
ICSI 2011: International conference on swarm intelligence id-4
Cergy, France, June 14-15, 2011
The global best is always the best of all personal bests as shown in Equation (14).
( ) ( )
( )
( ) argmin
i
i
p k
g k f p k
= (14)
1.3 Iterative Optimization Routine
Once the swarm has been initialized, particles iteratively: (i) accelerate (i.e. adjust their velocity
vectors) toward the global best and their own personal bests, (ii) update and clamp their velocities, (iii)
update their positions, and (iv) update their personal bests and the global best. This routine is repeated
until reaching a user-specified termination criterion.
For convenience, the relevant equations are restated below as needed in order of implementation.
( ) ( ) ( ) ( )
1 1, 2 2,
1 ( ) ( ) ( ) ( )
i i i i i i i
v k v k c r p k x k c r g k x k e + = + + (9)
( ) ( ) ( ) ( )
( )
max
, , ,
1 sign 1 max 1 ,
i j i j i j j
v k v k v k v + = + + (10)
( ) ( ) ( ) 1 1
i i i
x k x k v k + = + + (8)
A particles personal best is only updated when the new position offers a better function value:
( )
( ) ( ) ( ) ( ) ( )
( )
1 if 1
1 .
otherwise
i i i
i
i
x k f x k f p k
p k
p k
+ + <
+ =
(15)
The global best is always the best of all personal bests:
( ) ( )
( )
( 1) argmin 1 .
i
i
p k
g k f p k
+ = + (14)
Rather than accelerating due to external physical forces, particles adjust toward solutions of
relative quality. Each position encountered as particles swarm is evaluated and compared to existing
bests. Though the behavior of each individual is simple, the collective result is an optimization
algorithm capable of maximizing or minimizing problems that would be difficult to tackle with
straightforward mathematical analyses, either because the problem is not well understood in advance
or simply because the problem is quite complicated.
1.4 The No Free Lunch (NFL) Theorems
The NFL Theorems [6, 7, 8] essentially state that any pair of optimizers must perform equivalently
across the set of all problems, which implies that the attempt to design a general purpose optimizer is
necessarily futile. The theorems are well established and have even become the basis for a book that
attempts to draw biological inferences from them [9]; consequently, it becomes useful for the validity
of the theorems to be closely scrutinized.
One could hardly question the productivity of designing for a particular purpose and clearly
communicating the intended problem class, yet the assertion that no algorithm can outperform any
other on the whole is disconcerting since it implies that no true general purpose optimizer can exist. If
physical tools are analogous to algorithmic tools, then for Algorithm A to be no more effective across
the set of all problems than Algorithm B would imply that no rope, wheel, or pulley could be more
effective across the set of all problems than a shoe horn; yet pulleys clearly have more applications
than shoe horns, which draws the No Free Lunch concept into question. Even comparing shoe horns
to shoe horns, to design one of such width as to be practically useless would be a trivial matter; this
too is incongruous with the No Free Lunch concept since the mean performance of a shoe horn that is
20 cm wide could hardly match the mean performance of a shoe horn that is 5 cm wide.
Disproof is often simpler than proof since a counterexample suffices. Subsection 2.2 demonstrates
via counterexample that the first NFL Theorem is too strong to be true, which also invalidates the
corollary that only via problem-specific knowledge can an algorithm outperform randomness [8].
Were algorithms only written for specific problem types, each problem would need to be
classifiable in advance unless multiple algorithms were to be randomly applied in series until
ICSI 2011: International conference on swarm intelligence id-5
Cergy, France, June 14-15, 2011
producing a quality solution, which would be quite an inefficient approach. Since modern
optimization algorithms can operate on systems capable of mapping inputs to output(s) black box
style, a quality general purpose optimizer, were such a thing possible, would offer an increased
probability of solving problems that are not classifiable in advance.
Furthermore, accidental misclassification of a problem would likely produce a relatively low-
quality solution. A quality general purpose optimizer would therefore also be useful for checking
answers produced by more problem-specific tools. So the question is, Can one optimizer produce
better mean performance than another in the long run?
2 The NFL Theorems Do not Pertain to Continuous Optimization
The first NFL Theorem is arguably the most important to address since its notion of a-independence
is the seed from which the tree of NFL Theorems has sprung. It is quoted below for convenience.
Theorem 1: For any pair of algorithms
1
a and
2
a
( ) ( )
2 1
1 2
, , , , ."
where:
: is an element of , which denotes
the space of all possible problems,
and are the algorithms being compared,
is the number of iterations allotted
to t
y y
m m
f f
P d f m P d f m a a
f X Y F
a a
m
=
( )
he comparison, assuming no revisitation
(i.e. the number of unique cost evaluations
divided by the population size)
denotes the "ordered set of cost values"
, , measures "the performance of
y
m
y
m
d
P d f m a
an algorithm, a, iterated times on a cost
function, "
m
f
[7] (16)
Theorem 1 only asserts to be true for deterministic algorithms. While PSO is stochastic in the
sense that it uses pseudo-random numbers, the counterexample of Subsection 2.2 works equally well
with the static 1/ 2 of the traditional physics formula replacing the stochastic random number vectors,
1,i
r and
2,i
r . Following this substitution, only the randomized starting locations would be stochastic;
however, initial positions can be set identically for both algorithms either via any non-stochastic
initialization scheme or via identical seeding of the randomizer. Consequently, the pseudo-stochastic
nature of particle swarm does not exempt it from the NFL Theorems. Since the pair of algorithms
presented as a counterexample to NFL Theorem 1 in Subsection 2.2 presumes the seed of the
randomizer to be set identically as a practical matter, all starting positions are identical; hence,
removal of the stochasm is unnecessary, though it would be straightforward to accomplish.
This perspective is consistent with Wolperts statement, A search algorithm is deterministic;
every sample maps to a unique new point. Of course, essentially, all algorithms implemented on
computers are deterministic, and in this our definition is not restrictive [7]. In a footnote, Wolpert
further clarifies, In particular, note that pseudorandom number generators are deterministic given a
seed [7]. For purposes of the counterexample, the pseudo randomness employed is deterministic as
evidenced by: (i) the full repeatability of any particular computer simulation, and (ii) the ability to give
multiple algorithms the same starting positions via identical seeding of the randomizer.
Theorem 1 essentially states that for any performance measure selected (e.g. the best function
value returned over m iterations), the sum of an algorithms performance over all possible functions
ICSI 2011: International conference on swarm intelligence id-6
Cergy, France, June 14-15, 2011
is identical to the sum of any other algorithms performance. This is equivalent to stating that the
mean performances of all algorithms are equal across the set of all problems. Quoting Wolpert and
Macready, This theorem explicitly demonstrates that what an algorithm gains in performance on one
class of problems is necessarily offset by its performance on the remaining problems.
Perhaps it is common sense that an algorithm specially designed for one problem type should be
expected to perform worse on another problem type than an algorithm specially designed for the latter
type; however, asserting that all algorithms should be expected to perform equivalently across the set
of all problems is tantamount to stating that a shoe horn should be expected to perform as well across
the set of all problems as a rope, wheel, pulley, or Swiss army knife. Since some tools are more
widely applicable than others, the notion that all tools must produce precisely the same mean
performance across the same large set of problems seems counterintuitive.
2.1 One Optimizer Can Outperform Another if Re-visitations Count
This example shows that one algorithm can outperform another across the set of all static, continuous
problems,
1
S , if any possible re-visitations producing repeated evaluation of the same locations are
counted for comparison purposes. While the NFL Theorems do not claim to apply to the case
considered in Subsection 2.1, the demonstration provides an intuitive basis for the counterexample of
Subsection 2.2, which counts evaluations of approximately the same locations for comparison: a case
to which the NFL Theorems are thought to apply. In practice, continuous optimizers generally do not
attempt to prevent re-visitation since modern optimization software can generate
16
10
n
distinct points to
model continuous search spaces such that the probability of any location being revisited is essentially
zero if the number of iterations is consistent with values used in practice, as discussed further in
Subsection 2.2.
Let Gbest PSO as described in Section 1 be algorithm
1
a . To fully define the algorithm, let the
velocities of
1
a be clamped to 15% of the range of the search space per dimension using Formula (10).
Let the inertia weight be set to 0.72984379, and let both acceleration constants be set to 1.49617977:
this parameter selection produces the same search behavior as the Type 1 constriction model of PSO
[3] when 1 K = , 4.1 u= , and
1
c and
2
c of Equations (2.14) and (2.16) of [4] are both 2.05. These
parameters define Gbest PSO throughout this paper and were used to generate the swarm trajectory
snapshots of Figure 1 on the following page. More important than defining a particular swarm size is
using the same population size for both algorithms to produce a straightforward comparison;
nevertheless, Figure 1 was generated using a swarm size of 10 particles.
Define
2
a as a simple algorithm that uses the same population size as
1
a but never updates
positions: its initial positions are simply preserved. Let the initial positions be set identically for both
algorithms, which can be accomplished either by seeding the randomizer identically or by utilizing
any non-stochastic initialization scheme, such as the maximum possible equidistant spacing of
particles per dimension.
Except for m and F , whose definitions would not necessarily apply to Subsection 2.1, the
symbology of [7] is utilized to avoid symbolic clutter. Let
max
k denote the number of iterations
allotted to the comparison. Let the performance measure be the lowest Y value in sample
max
y
k
d .
Then
( ) ( ) { }
max max max
min : 1...
y y
k i k
d d i i k u = = ; for Gbest PSO, this equates to the function value of the
global best when the search concludes at iteration
max
k .
Since algorithms
1
a and
2
a are initialized identically for sake of a controlled experiment, and
since the global best of Gbest PSO updates only when a better position is discovered (i.e. only when
the lowest Y value improves), Gbest PSO can never perform worse than
2
a on any static problem.
ICSI 2011: International conference on swarm intelligence id-7
Cergy, France, June 14-15, 2011
Consequently, proving that
1
a produces better mean performance than
2
a only requires that Gbest
PSO be demonstrated to improve upon its initial positions for any one problem.
Figure 1 shows the swarm migrating from a poorly initialized state to premature convergence at a
local minimizer offering the best function value of all positions occupied during the course of the
search, which is what attracts all particles to it via the social component [4]. Even in the worst-case
scenario that for all other problems in
1
S , Gbest PSO should somehow fail to improve upon the best
position of the initialization phase in all successive iterations,
1
a would still outperform
2
a on set
1
S
since
1
a can perform no worse than
2
a on any static problem for which positions are identically
initialized by definition of the global best in Equation (14). In other words, since
1
a improves
performance over
2
a on at least one problem, and since
1
a performs at least as well as
2
a on all other
problems, algorithm
1
a achieves better mean performance than algorithm
2
a across the set of all
problems; hence, what algorithm
1
a gains in performance on one problem is never offset by its
performance on the remaining problems.
Figure 1
ICSI 2011: International conference on swarm intelligence id-8
Cergy, France, June 14-15, 2011
The snapshots of Figure 1 were generated from data produced by an actual simulation on the
Rastrigin benchmark [10]. They illustrate that Gbest PSO improves upon its initial positions on at
least one problem; since algorithm
2
a cannot outperform algorithm
1
a on any static problem by
definition of the global best of
1
a , algorithm
1
a outperforms algorithm
2
a on the set of all static
problems when precise re-visitations are counted for comparison purposes. Section 2.2 extends this
concept by proving that one algorithm can outperform another across the set of all problems if
approximate re-visitations are counted for comparison purposes, which in turn disproves the
applicability of the NFL Theorems to continuous optimization.
2.2 One Optimizer Can Outperform Another if Approximate Re-visitations Count
To demonstrate that one continuous optimizer can outperform another across the set of all possible
problems, F , even if the outperformed algorithm never revisits any precise location, let Gbest PSO as
defined in Subsection 2.1 be algorithm
1
a . Let
1
P denote the percentage of trials written as a decimal
for which
1
a successfully approximates a global minimizer across set F within a user-specified
tolerance for error (e.g.
1
0.5 P = if algorithm
1
a successfully approximates the global minimizer on 50
attempts out of 100).
Let each dimension be linearly normalized to lie between 0 and 1 prior to the optimization
process, which both simplifies the ensuing calculations and ensures that each significant figure used
by the software assumes all 10 integer values from 0 to 9, thereby generating the maximum number of
discrete points within the search space. If MATLABs format is set appropriately, 16 significant
figures are reliably stored per variable; hence,
16
10 feasible values exist per dimension, constituting an
n-dimensional search space of precisely
16
10
n
locations. Though the search space is literally discrete
from a microscopic perspective, with
16
10
n
points, it is certainly continuous from the macroscopic
perspective. The following counterexample demonstrates the NFL Theorems to be irrelevant to
optimization that is continuous from a macroscopic perspective, regardless of whether it becomes
discrete at a microscopic level or not.
Let
max
k be a reasonable number of iterations over which to optimize a system. Reasonable in
this context means that the number is consistent with values used in practice. The first benefit of this
selection is that velocity vectors will have nonzero magnitudes throughout the simulation such that no
precise n-dimensional location will be revisited due to absolute immobility: in practice, continuing the
optimization process once the swarm has literally stopped moving from even the most microscopic of
perspectives would expend computational time and energy without any chance of improving solution
quality, which is why the situation is avoided. Let S denote both the swarm size of Gbest PSO and
the population size of the comparison algorithm, which are set identically for sake of a controlled
experiment: the second benefit of using a reasonable value for
max
k is that step size
1
max
P
S k
A
will
be large enough for the software to distinguish it from 0; for example, step sizes larger than
323
10
are
reliably distinguished from 0 by MATLAB.
Since NFL Theorem 1 claims to be true for any pair of algorithms, define
2
a as a population-based
optimizer that, following any given initialization, iteratively updates each particles position vector by
stepping to the right (i.e. in the positive direction) per dimension via increments of user-specified
width
1
max
P
S k
A
. If a particle ever reaches the rightmost edge of the search space on any
dimension, its subsequent position update will move it to the leftmost point on that dimension, from
which it will continue stepping to the right. Since
2
a searches uni-directionally and step
ICSI 2011: International conference on swarm intelligence id-9
Cergy, France, June 14-15, 2011
size
1
max
P
S k
A
A , the discrete
nature of the search space at a microscopic level does not become a limiting factor; rather, it becomes
irrelevant to the large-scale continuous nature of the problem as proved in the following paragraph.
Even if the first algorithm,
1
a , only successfully approximates a global minimizer once per one
million simulations with a population size of 100 (i.e.
6
1
10 P
= ,
2
10 S = ), the resulting value of
8
10
iterations for
1
max
P
k
S
=
A
would still be larger than the number of iterations used in practice by a
factor of thousands. Since the number of iterations can reasonably be selected such that
16 1
max
10
P
k
S
, it follows that the step size can reasonably be selected such that
16
10
A ; hence,
the computational requirements for this counterexample are satisfied. Since the number of iterations is
not set nearly as large as
8
10 in practice, the microscopic level at which the search space produces
discrete limitations is not reached so that the optimization process is continuous rather than discrete
from the perspective of the optimization algorithm.
For problems that are continuous from the large-scale perspective, an optimizer can be designed to
search such a small fraction of the search space that it will necessarily perform worse than a
comparison algorithm. The following counterexample considers two mutually exclusive and mutually
exhaustive cases, both of which lead to the same conclusion: one algorithm can produce better overall
performance than another across the set of all problems.
Case I: Suppose that over the set of all relevant problems, F , global minimizers are uniformly
distributed relative to the search spaces containing them. In this case, dividing each dimension of each
problems search space into 100 equidistant regions and counting the frequency with which the global
minimizers of F occur within each region would produce a histogram showing precisely a 1%
probability of occurrence per region. In other words, suppose that all relative regions of the search
space are equally likely to contain a global minimizer for any problem chosen at random.
Continuous optimization algorithms sample locations within the search space and compare their
qualities to propose the best location visited as a solution. Such an algorithm can only locate a global
minimizer within the portion of the search space from which it successfully samples locations;
consequently, if
2
a searches within a region comprising less than
1
100% P of the entire search
space, its success rate,
2
P , will necessarily be less than
1
P across set F due to the supposed uniform
distribution of global minimizers. Since step size
1
max
P
S k
A