Bayesan Theory in Perception
Bayesan Theory in Perception
Bayesan Theory in Perception
Jacob Feldman
Dept. of Psychology, Center for Cognitive Science
Rutgers University - New Brunswick
Abstract
Bayesian approaches to perception offer a principled, coherent and elegant
answer to the central problem of perception: what the brain should be-
lieve about the world based on sensory data. This chapter gives a tutorial
introduction to Bayesian inference, illustrating how it has been applied to
problems in perception.
Inference in perception
One of the central ideas in the study of perception is that the proximal stimulus—the
pattern of energy that impinges on sensory receptors, such as the visual image—is not
sufficient to specify the actual state of the world outside (the distal stimulus). That is, while
the image of your grandmother on your retina might look like your grandmother, it also
looks like an infinity of other arrangements of matter, each having a different combination
of 3D structure, surface properties, color properties, etc., so that they happen to look just
like your grandmother from a particular viewpoint. Naturally, the brain generally does not
perceive these far-fetched alternatives, but rapidly converges on a single solution which
is what we consciously perceive. A shape on the retina might be a large object that is
far away, or a smaller one more nearby, or anything in between. A mid-gray region on
the retina might be a bright white object in dim light, or a dark object in bright light, or
anything in between. An elliptical shape on the retina might be an elliptical object face-on,
or a circular object slanted back in depth, or anything in between. Every proximal stimulus
is consistent with an infinite family of possible scenes, only one of which is perceived.
The central problem for the perceptual system is to quickly and reliably decide
among all these alternatives, and the central problem for visual science is to figure out
what rules, principles, or mechanisms the brain uses to do so. This process was called
unconscious inference by Helmholtz, perhaps the first scientist to appreciate the problem,
and is sometimes called inverse optics to convey the idea that the brain must in a sense
I am grateful to Manish Singh, Vicky Froyen, and Johan Wagemans for helpful discussions. Prepa-
ration of this article was supported by NIH EY021494. Please direct correspondence to the author at ja-
cob@ruccs.rutgers.edu.
BAYESIAN PERCEPTION 2
invert the process of optical projection—to take the image and recover the world that gave
rise to it.
The modern history of visual science contains a wealth of proposals for how exactly
this process works, far too numerous to review here. Some are very broad, like the
Gestalt idea of Prägnanz (infer the simplest or most reasonable scene consistent with the
image). Many others are narrowly addressed to specific aspects of the problem like the
inference of shape or surface color. But historically, the vast majority of these proposals
suffer from one (or both) of the following two problems. First, many (like Prägnanz and
many other older suggestions) are too vague to be realized as computational mechanisms.
They rest on central ideas, like the Gestalt term “goodness of form,” that are at best can
only be subjectively defined and cannot be implemented algorithmically without a host
of additional assumptions. Second, many proposed rules are arbitrary or unmotivated,
meaning that is unclear exactly why the brain would choose them rather than an infinity of
other equally effective ones. Of course, it cannot be taken for granted that mental processes
are principled in this sense, and some have argued for a view of the brain as a “bag of
tricks” (Ramachandran, 1985). Nevertheless, to many theorists, a mental function as central
and evolutionarily ancient as perceptual inference seems to demand a more coherent and
principled explanation.
p(A and B)
p(A|B) = . (1)
p(B)
Similarly, the probability of B given A is the ratio of the probability that B and A are both
true divided by the probability that A is true, hence
p(B and A)
p(B|A) = . (2)
p(A)
It was the reverend Thomas Bayes (1763) who first noticed that these mathematically simple
observations can be combined1 to yield a formula for the conditional probability p(A|B) (A
given B) in terms of the inverse conditional probability p(B|A) (B given A),
1
More specifically, note that p(B and A) = p(A and B) (conjunction is commutative). Substitute the latter for
the former in Eq. 1 to see that p(A|B)p(B), and likewise p(B)p(A|B), are both equal to p(A and B) and thus to each
other. Divide both sides of p(A|B)p(B) = p(B|A)p(A) by p(B) to yield Bayes’ rule.
BAYESIAN PERCEPTION 3
p(B|A)p(A)
p(A|B) = , (3)
p(B)
a formula now called Bayes’ theorem or Bayes’ rule.2 Before Bayes, the mathematics of
probability had been used exclusively to calculate the chances of a particular random
outcome of a stochastic process, like the chance of getting ten consecutive heads in ten
flips of a fair coin [p(10 heads|fair coin)]. Bayes realized that his rule allowed us to invert
this inference and calculate the probability of the conditions that gave rise to the observed
outcome—here, the probability, having observed 10 consecutive heads, that the coin was
fair in the first place [p(fair coin|10 heads)]. Of course, to determine this, you need to
assume that there is some other hypothesis we might entertain about the state of the coin,
such as that it is biased towards heads. Bayes’ logic, often called inverse probability, allows
us to evaluate the plausibility of various hypotheses about the state of the world (the nature
of the coin) on the basis of what we have observed (the sequence of flips). For example, it
allows us to quantify the degree to which observing 10 heads in a row might persuade us
that the coin is biased towards heads.
Bayes and his followers, especially the visionary French mathematician Laplace, saw
how inverse probability could form the basis of a full-fledged theory of inductive inference
(see Stigler, 1986). As David Hume had pointed out only a few decades previously, much
of what we believe in real life—including all generalizations from experience—cannot be
proved with logical certainty, but instead merely seems intuitively plausible on the basis
of our knowledge and observations. To philosophers seeking a deductive basis for our
beliefs, this argument was devastating. But Laplace realized that Bayes’ rule allowed us to
quantify belief—to precisely gauge the plausibility of inductive hypotheses.
By Bayes’ rule, given any data D which has a variety of possible hypothetical causes
H1 , H2 , etc., each cause Hi is plausible in proportion to the product of two numbers: the
probability of the data if the hypothesis is true p(D|Hi ), called the likelihood; and the prior
probability of the hypothesis, p(Hi ), that is, how probable the hypothesis was in the first
place. If the various hypotheses are all mutually exclusive, then the probability of the data
D is the sum of its probability under all the various hypotheses,
X
p(D) = p(H1 )p(D|H1 ) + p(H1 )p(D|H1 ) + . . . = p(Hi )p(D|Hi ). (4)
i
Plugging this into Bayes’ rule (with Hi playing the role of A, and D playing the role of B),
this means that the probability of hypothesis Hi given data D, called the posterior probability
p(Hi |D), is
The posterior probability p(Hi |D) quantifies how much we should believe Hi after consid-
ering the data. It is simply the ratio of the probability of the evidence under Hi (the product
of its prior and likelihood) relative to the total probability of the evidence arising under
all hypotheses (the sum of the prior-likelihood products for all the hypotheses). This ratio
measures how plausible Hi is relative to all the other hypotheses under consideration.
But Laplace’s ambitious account was followed by a century of intense controversy
about the use of inverse probability (see Howie, 2004). In modern retellings, critics’ ob-
jection to Bayesian inference is often reduced to the idea that to use Bayes’ rule we need
to know the prior probability of each of the hypotheses (for example, the probability the
coin was fair in the first place), and that we often don’t have this information. But their
criticism was far more fundamental and relates to the meaning of probability itself. They
argued that many propositions—those that refer to propositions whose truth value is fixed
though unknown—can’t be assigned probabilities at all, in which case the use of inverse
probability would be nonsensical. This criticism reflects a conception of probability, often
called frequentism, in which probability refers exclusively to relative frequency in a repeatable
chance situation. Thus, in their view, you can calculate the probability of a string of heads
for a fair coin, because this is a random event that occurs on some fraction of trials; but you
can’t calculate a probability of a of a non-repeatable state of nature, like this coin is fair, or
the Higgs boson exists because such hypotheses are either definitely true or definitely false,
and are not “random.” The frequentist objection was not just that we don’t know the prior
for many hypotheses, but that most hypotheses don’t have priors—or posteriors, or any
probabilities at all.
But in contrast, Bayesians generally thought of probability as quantifying the de-
gree of belief, and were perfectly content to apply it to any proposition at all, including
non-repeatable ones. To Bayesians, the probability of any proposition is simply a charac-
terization of our state of knowledge about it, and can freely be applied to any proposition as
a way of quantifying how strongly we believe it. This conception of probability, sometimes
called subjectivist (or epistemic or sometimes just Bayesian), is thus essential to the Bayesian
program. Without it, one cannot calculate the posterior probability of a non-repeatable
proposition, because such propositions simply don’t have probabilities—and this would
rule out most uses of Bayes’ rule to perform induction.
The sometimes ferocious controversy over this issue culminated around 1920 when
the fervently frequentist statisticians Fisher, Neyman, and Pearson founded what we now
call classical statistics—sampling distributions, significance tests, confidence intervals, and
so forth—on a platform of rejecting inverse probability in the name of objectivity. But
the theory of Bayesian inference continued to develop in the shadows, and was given a
comprehensive modern formulation by Harold Jeffreys (1939/1961) and others. This history
helps explain why, despite centuries of development, Bayesian techniques are only in the
last few decades being applied without apology to inference problems in many fields,
including human cognition.
and it alone among inference systems, satisfies basic considerations of internal consistency,
such as invariance to the order in which evidence is considered. If one wishes to assign
degrees of belief to hypotheses in a rational way, one must inevitably use the conventional
rules of probability, and specifically Bayes’ rule. Later de Finetti (see de Finetti, 1970/1974)
demonstrated the uniquely rational status of Bayesian inference in an even more acute way.
He showed that if a system of inference differs from Bayesian inference in any substantive
way, it is subject to catastrophic failures of rationality. (His so-called Dutch book theorem
shows, in essence, that any non-Bayesian reasoner can be turned into a “money pump”.) In
recent decades these strong arguments for Bayesian inference as a uniquely rational system
for fixing belief were brought to wide attention by the vigorous advocacy of the physicist
E. T. Jaynes (see Jaynes, 2003). Though there are of course many subtleties surrounding the
supposedly optimal nature of Bayesian inference (see Earman, 1992), most contemporary
statisticians have rejected the dogmatic frequentism that underlies classical statistics, and
now regard Bayesian inference as an optimal method for making inferences on the basis of
data.
This characterization of Bayesian inference—as an optimal method for deciding what
to believe under conditions of uncertainty—makes it perfectly suited to the central problem
of perception, that of estimating the properties of the physical world based on sense data.
The basic idea is to think of the stimulus (e.g. the visual image) as reflecting both stable
properties of the world (which we would like to infer) plus some uncertainty introduced
in the process of image formation (which we would like to disregard). Bayesian inference
allows us estimate the stable properties of the world conditioned on the image data. The
aptness of Bayesian inference as a model of perceptual inference was first noticed in the
1980s by a number of authors, and brought to wider attention by the collection of papers
in Knill and Richards (1996). Since then the applications of Bayes to perception have
multiplied and evolved, while always retaining the core idea of associating perceptual
belief with the posterior probability as given by Bayes’ rule. Several excellent reviews of
the literature are already available (e.g. see Kersten, Mamassian, & Yuille, 2004; Knill,
Kersten, & Yuille, 1996; Yuille & Bülthoff, 1996) each with a slightly different emphasis or
slant. The current chapter is intended to be at a tutorial introduction to the main ideas of
Bayesian inference in human perception, with some emphasis on misunderstandings that
tend to arise in the minds of newcomers to the topic. Although the examples are drawn
from the perception literature, most of the main ideas apply equally to other areas of
cognition as well. The emphasis will be on central principles rather than on mathematical
details or recent technical advances.
We begin with several simple numerical examples to illustrate the basic calculations
in Bayesian inference, before moving on to perceptual examples.
The simplest type of Bayesian inference involves a finite number of distinct hypothe-
ses H1 . . . Hn , each of which has a prior probability p(Hi ) and a likelihood function p(X|Hi )
BAYESIAN PERCEPTION 6
which gives the probability of each possible dataset X conditioned on that hypothesis.3 For
example, imagine that you hear a noise X on your roof, which is either an animal A or a
burglar B. The noise sounds a bit like an animal, implying a moderate animal likelihood,
say p(X|A) = .3. (That is, if it were an animal, there is about a 30% chance of a noise of
the type that you hear.) But unfortunately it sounds a lot like a burglar, implying a high
burglar likelihood, say p(X|B) = .8. Classical statistics dictates that we select hypotheses
by maximizing likelihood, which in this situation would imply a burglar (and necessitate
an immediate call to the police). But Bayes’ rule tells us that along with the likelihood we
should incorporate the prior, which we assume strongly favors animal, say p(A) = .999
and p(B) = .001. (Burglars are, thankfully, rare.) For each hypothesis the posterior is
proportional to the product of the prior and likelihood, hence
The denominator in Bayes’ rule is the total probability of the data under all hypotheses,
here
p(X|A)p(A) .2997
p(A|X) = = = .9973, (10)
p(X) .3005
p(X|B)p(B) .0008
p(B|X) = = = .0027, (11)
p(X) .3005
strongly favoring animal. Notice that when comparing the posteriors, we really only need
to compare the numerators since the denominators are the same. Hence Bayes’ rule is
often given in its “proportional” form p(H|D) ∝ p(D|H)p(H), in which the denominator is
disregarded.
3
Students are often warned that the likelihood function is not a probability distribution, a remark that in
my experience tends to cause confusion. In traditional terminology, likelihood is an aspect of the model or
hypothesis, not the data, and one refers for example to the likelihood of H (and not the likelihood of the data
under H). This is because the term likelihood was introduced by frequentists, who insisted that hypotheses did
not have probabilities (see text), and sought a word other than “probability” to express the degree of support
given by the data to the hypothesis in question. However, to Bayesians, the distinction is unimportant,
since both data and hypotheses can have probabilities, so Bayesians tend (especially recently) to refer to the
likelihood of the data under the hypothesis, or the likelihood of the hypothesis, in both cases meaning the
probability p(D|H). In this sense, likelihoods are indeed probabilities. However note that the likelihoods of
the various hypotheses do not have to sum to one (for example, it is perfectly possible for many hypotheses to
likelihood near one given a dataset that they all fit well). In this sense, the sense, the distribution of likelihood
over hypotheses (models) is certainly not a probability distribution. But the distribution of likelihood over the
data for a single fixed model is, in fact, a probability distribution and sums to one.
BAYESIAN PERCEPTION 7
Parameter estimation
(In this example for simplicity we’ll assume that σ is known and just try to estimate
µ.) Because the n measurements are all independent, the entire dataset X = x1 . . . xn has
likelihood4
Classical statistics would say that the best estimate of the “population mean” µ is the value
with maximum likelihood, which in this case is the sample mean X̄, here 802. But Bayes’ rule
says that in addition to the likelihood, which reflects the information gained from the data,
you should incorporate whatever prior information you have about the probable value of
the parameter—in this case the assumption that Dubious Cola puts one liter (1000 ml) in
a one-liter bottle. Indeed, the optimality of Bayesian inference means that it is in effect
irrational to ignore this information. In this case it is reasonable to assume that the value of
µ is probably about 1000, with (again by assumption) a normal distribution of uncertainty
about this value. Narrower distributions would mean stronger biases towards 1000, wider
ones weaker biases. (If you really had no idea what value to expect, you could make your
prior very wide and flat, in which case it would exert very little influence on the posterior.)
Now Bayes’ rule tells us that the posterior probability of each value of µ, meaning how
believable it is in light of both the data and the prior, is proportional to the product of the
prior and likelihood,
This yields a value of p(µ|X) for every possible value of µ (the posterior distribution) which
indicates how strongly we should believe in each value of µ given both the data and our
prior beliefs.
Fig. 1 illustrates how the posterior distribution evolves as more data are acquired,
and how it relates to the prior and likelihood. The prior is a normal distribution centered at
1000, because that’s what we believed a bottle would contain before we started measuring.
(In the figure distributions are depicted via their mean plus error bars to indicate one
standard deviation; all distributions are normal.) As data is acquired (moving from left
to right in the figure), the likelihood is always centered at the sample mean (which is the
value that best fits the data so far). But the posterior, which combines the prior with the
likelihood via Bayes’ rule, is somewhere in between the prior and likelihood—gradually
approaching the likelihood, and gradually getting tighter (narrower error bars) as we collect
more data and our knowledge gets firmer. That is, the data gradually draw our beliefs
away from the prior and towards what the evidence tells us. Thus as we collect more and
more data, the posterior distribution increasingly resembles the likelihood distribution.
This is often referred to as the likelihood “overwhelming” the prior, and is one of the
reasons why in some (though not all) situations the exact choice of prior doesn’t matter
very much—because as evidence accumulates the prior tends to matter less and less.
The peak of the posterior distribution, the value of the parameter that has the highest
posterior probability, is called the maximum a posteriori or MAP value. If we need to reduce
our posterior beliefs to a single value, this is the most plausible, and casual descriptions of
Bayesian inference often imply that Bayes rule dictates that we choose the MAP hypothesis.
But remember that Bayes’ rule does not actually authorize this reduction; it simply tells how
much to believe each hypothesis—that is, the full posterior distribution. In many situations
use of the MAP be quite undesirable: for example, broadly distributed posteriors that have
many other highly probable values, or multimodal posteriors that have multiple peaks
that are almost as plausible as the MAP. Reducing the posterior distribution to a single
“winner” discards useful information, and it should be kept in mind that in principle only
the entire posterior distribution expresses the totality of our posterior beliefs.
Model selection
Many situations require both discrete hypothesis selection and parameter estimation
because the observer has to choose between several qualitatively distinct models, each of
which has some number of parameters that must be estimated; this is the problem of model
selection. Assessing the relative probability of such models can be difficult if, as is often
the case, the competing models have different numbers of parameters, because all else
being equal models with more parameters have more flexibility to fit the data, since each
parameter can act as a “fudge factor” that can improve the fit (increase the likelihood).
Classical statistics has very limited tools to deal with this very common situation unless
the models are nested (one a subset of the other). But Bayesian techniques can be applied in
a straightforward way, the simplest being to consider the ratio of the integrated likelihood
of one model relative to that of another, sometimes called the Bayes factor (see Kass &
Raftery, 1995). This is not the same as comparing the maximized likelihood of each model
(the likelihood of the model after all its parameters have been set so as to maximize fit
to the data). The maximized likelihood ratio, unlike the Bayes factor, considers only the
BAYESIAN PERCEPTION 9
1100
prior
1000
x 900 posterior
800 truth
likelihood
700
n= 2 4 8 16 32 64 128 256 512 1024
progressive accumulation of data
Figure 1. Relationship between prior, likelihood, and posterior distributions as data is accumulated
over time. Each distribution here is normal (Gaussian) and is depicted as a point representing the
mean, with error bars representing the standard deviation. The observer has a prior centered on x =
1000 with a standard deviation of 50. Data are actually generated from a normal centered at x = 800.
The posterior distribution gradually migrates from the prior, where belief was initially centered,
towards the likelihood, where the evidence points. Both likelihood and posterior gradually tighten
as more data is acquired.
best fitting parameter settings for each model, which intrinsically favors more complex
models (i.e. ones with more parameters) unless a correction is used such as AIC (Akaike,
1974) or BIC (Schwarz, 1978) (see Burnham & Anderson, 2004). But Bayesians argue that
no complexity correction is necessary with the use of Bayes factors, because Bayes’ rule
automatically trades off fit to the data (the likelihood, which tends to benefit from more
parameters) with the complexity of the model (which tends to be penalized in the prior;
see below). This tradeoff, a version of the bias-variance tradeoff that is seen everywhere
in statistical inference (see Hastie, Tibshirani, & Friedman, 2001), is quite fundamental to
Bayesian inference, because the essence of Bayes’ rule is the optimal combination of data
fit (reflected in the likelihood) and bias (reflected in the prior).
Bayesian inference can be used to estimate perceptual parameters in much the same
way it was in the Dubious Cola example. An example is the estimation of color, a classic
case of perceptual ambiguity. The reflectance properties of a surface, which determine
which wavelengths of light are reflected off the surface in what proportions, are a fixed
attribute of the material. But the light that hits our eyes reflects both this attribute, which is
what we are trying to determine, and the properties of the light source, which we usually
are not. In effect, the quantity of (say) red light that hits our eyes is a product of how much
red light is in the light source multiplied by the proportion of red light that the particular
surface reflects. Since all we can measure directly is their product, we cannot infer the
surface properties—what we care about—without some additional assumptions or tricks.
As in all problems of perception, the sensory data is insufficient by itself to disambiguate
the properties of the world. The question then is how the brain solves this problem and
thus infers the material properties of the surface—thus explaining why red things look red
approximately regardless of the color of the light source.
Brainard and Freeman (1997) and Brainard et al. (2006) have proposed a simple
Bayesian solution to this problem. First, they assume that the measurement of light am-
plitude at each frequency is, like the measurement of the volume of Coca-cola, subject to
Gaussian error. That is, when our photoreceptors measure the amount of (say) red light
reflected off a surface, the measurement is reflects both the true reflectance ρ plus some
normally-distributed error. This determines the likelihood function p(x|ρ). But (following
Bayes’ rule) in order to estimate the true ρ, we need to also consider the prior distribution
of p(ρ), that is, the prior probability of that the surface will have the given reflectance ρ
prior to considering the image (Fig. 2). Brainard et al. (2006) estimated this by first deriving
a low-parameter model of surfaces (that is, finding a small number of parameters that
together describe the variation among most surfaces). They then empirically measured
the relative frequency of different values of each of these parameters among the surfaces.
The results suggest a Gaussian (normal) prior over each of the parameters, meaning that
(just as with the volume of Coke bottles) a single mean value with bell-shaped uncertainty
BAYESIAN PERCEPTION 11
ρ×I
ρ
Likelihood Prior
Figure 2. Schematic of the Brainard et al.’s (2006) theory of color estimation. The observer’s goal
is to infer the true surface reflectance ρ, though the observed light at the given frequency is the
product of ρ and the illumination I. The Bayesian solution is to adopt a prior over ρ, and a likelihood
function p(x|ρ) that assuming normally distributed noise, which leads to a posterior over potential
surface properties.
about the mean. We can then compute the posterior probability of each parameter based
on the image data and prior knoweldge about plausible surfaces, to give an estimate of
the percevied color of each surface patch. The results show a remarkable agreement with
human judgments, suggesting that our color judgments are close to optimal given the
uncertainty inherent in the situation.
v ?
slow fast
Likelihood v
Prior
of standard motion illusions, are predicted by a simple Baysian model in which (a) the
prior favors slower speeds over faster ones, and (b) the likelihood is based on conventional
Gaussian noise assumptions (Fig. 3). That is, the posterior distribution favors motion
speeds and directions that minimize speed while simultaneously maximizing fit to the
observed data (leading to the simple slogan “slow and smooth”). The close fit between
human percepts and the predictions of the Bayesian model is particularly striking in that in
addition to accounting for normal motion percepts, it also systematically explains certain
illusions of motions as side-effects of rational inference.
The problem of perceptual organization—how to group the visual image into con-
tours, surfaces, and objects—seems at first blush quite different from color or motion
estimation, because the property we seek to estimate is not a physical parameter of the
world, but a representation of how we choose to organize it. Still, Bayesian methods can be
applied in a straightforward fashion as along as we assume that each image is potentially
subject to many grouping interpretations, but that some are more intrinsically plausible
than others (allowing us to define a prior over interpretations), and some fit the observed
image better than others (allowing us to define a likelihood function). We can then use
Bayes’ rule to infer a posterior distribution over grouping interpretations.
A simple example comes from the problem of contour integration, in the question
of whether two visual edges belong to the same contour (H1 ) or different contours (H2 ).
Because physical contours can take on a wide variety of geometric forms, practically any
observed configuration of two edges is consistent with the hypothesis of a single common
contour. But because edges drawn from the same contour tend to be relatively collinear,
the angle between two observed edges provides some evidence about how plausible this
hypothesis is, relative to the competing hypothesis that the two edge arise from distinct
contours. This decision, repeated many times for pairs of edges throughout the image,
forms the basis for the extraction of coherent object contours from the visual image.
To formalize this as a Bayesian problem, we need priors p(H1 ) and p(H2 ) for the two
hypotheses, and likelihood functions p(α|H1 ) and p(α|H2 ) that express the probability of the
angle between the two edges (called the turning angle) conditioned under each hypothesis.
BAYESIAN PERCEPTION 13
Likelihood functions
Hypothesis A:
One contour
α
? -180° 0 180°
? All directions
equally likely...
Hypothesis B:
Two contours
-180° 0 180°
Figure 4. Two edges can be interpreted as part of the same smooth contour (hypothesis A, top) or
as two distinct contours (hypothesis B, bottom). Each hypothesis has a likelihood (right) that is a
function of the turning angle α; with p(α|A) sharply peaked at 0, but p(α|B) flat.
Several authors have modeled the same-contour likelihood function p(α|H1 ) as a normal
distribution centered on collinearity (0◦ turning angle; see Feldman, 1997; Geisler, Perry,
Super, & Gallogly, 2001). Fig. 4 illustrates the decision problem in its Bayesian formulation.
In essence, each successive pair of contour elements must be classified as either part of
the same contour or as parts of distinct contours. The likelihood of each hypothesis is
determined by the geometry of the observed configuration, with the normal likelihood
function assigning higher likelihood to element pairs that are closer to collinear. The prior
(in practice fitted to subjects’ responses) tends to favor H2 , presumably because most image
edges come from disparate objects. Bayes’ rule puts these together to determine the most
plausible grouping. Applying this simple formulation more broadly to all the image edge
pairs allows the image to be divided up a set of contour elements into a discrete collection
of “smooth” contours—that is, contours made up of elements all of which Bayes’ rule says
belong to the same contour. The resulting parse of the image into contours agrees closely
with human judgments (Feldman, 2001). Related models have been applied to contour
completion and extrapolation as well (Singh & Fulvio, 2005).
a set of component parts (Singh & Hoffman, 2001)? Each of these problems can be placed
into a Bayesian framework by assigning to each distinct alternative interpretation a prior
and a method for determining likelihood.
Each of these problems requires its own unique approach, but broadly speaking a
Bayesian framework for any problem in perceptual organization flows from a generative
model for image configurations (Feldman, Singh, & Froyen, 2012). Perceptual organization
is based on the idea that the visual image is generated by regular processes that tend to
create visual structures with varying probability, which can be used to define likelihood
functions. The challenge of Bayesian perceptual grouping is to discover psychologically
reasonable generative models of visual structure.
For example, Feldman and Singh (2006) proposed a Bayesian approach to shape rep-
resentation based on the idea that shapes are generated from axial structures (skeletons)
from which the shape contour is understood to have “grown” laterally. Each skeleton con-
sists of a hierarchically organized collection of axes, and generates a shape via a probabilistic
process that defines a probability distribution over shapes (Fig. 5). This allows a prior over
skeletons to be defined, along with a likelihood function that determines the probability of
any given contour shape conditioned on the skeleton. This in turn allows the visual system
to determine the MAP skeleton (the skeleton most likely to have generated the observed
shape) or, more broadly, a posterior distribution over skeletons. The estimated skeleton in
turn determines the perceived decomposition into parts, with each section of the contour
identified with a distinct generating axis perceived as a distinct “part.” This shape model
is certainly oversimplified relative to the myriad factors that influence real shapes, but the
basic framework can be augmented with a more elaborate generative model, and tuned to
the properties of natural shapes (Wilder, Feldman, & Singh, 2011). Because the framework
is Bayesian, the resulting representation of shape is, in the sense discussed above, optimal
given the assumptions specified in the generative model.
Discussion
This section raises several issues that often arise when Bayesian models of cognitive
processes are considered.
(a) prior
Shape point x
s
xi
A
maximizes
sk
AP
posterior
M
(d) Examples
Figure 5. Generative model for shape from Feldman and Singh (2006), giving (a) prior over
skeletons (b) likelihood function (c) MAP skeleton, the maximum posterior skeleton for the given
shape, and (d) examples of the MAP skeleton.
BAYESIAN PERCEPTION 16
Leeuwenberg & Boselie, 1988). Simplicity too can be defined in a variety of ways, which
has led to an inconclusive debate in which examples purporting to illustrate the preference
for simplicity over likelihood, or vice versa, could be dissected without clear resolution
(Hatfield & Epstein, 1985; Perkins, 1976).
More recently, Chater (1996) has argued that simplicity and likelihood are two sides
of the same coin, for several reasons that stem from Bayesian arguments. First, basic consid-
erations from information theory suggest that more likely propositions are automatically
simpler in that they can be expressed in more compact codes. Specifically, Shannon (1948)
showed that an optimal code—meaning one that has minimum expected code length—
should express each proposition A in a code of length proportional to the negative log
probability of A, i.e. − log p(A). This quantity is often referred to as the surprisal, because it
quantifies how “surprising” the message is (larger values indicate less probable outcomes),
or as the Description Length (DL), because it also quantifies how many symbols it occupies
in an optimal code (longer codes for more unusual messages). Just as in Morse code (or for
that matter approximately in English) more frequently used concepts should be assigned
shorter expressions, so that the total length of expressions is minimized on average. Be-
cause the proposition with maximum posterior probability (the MAP) also has minimum
negative log posterior probability, the MAP hypothesis is also the minimum DL (MDL)
hypothesis. More specifically, while in Bayesian inference the MAP hypothesis is the one
that maximizes the product of the prior and the likelihood p(H)p(D|H), in MDL the winning
hypothesis is the one that minimizes the sum of the DL of the model plus the DL of the data
as encoded via the model (− log p(H)−log p(D|H)), a sum of logs having replaced a product.
In this sense the simplest interpretation is necessarily also the most probable—though it
must be kept in mind that this easy identification rests on the perhaps tenuous assumption
that the underlying coding language is optimal.
More broadly, Bayesian inference tends to favor simple hypotheses even when with-
out any assumptions about the optimality of the coding language.6 This tendency, some-
times called “Bayes Occam,” (after Occam’s razor, a traditional term for the preference
for simplicity), reflects fundamental considerations about the way prior probability is
distributed over hypotheses (see MacKay, 2003). Assuming thatPthe hypotheses Hi are
mutually exclusive, then their total prior necessarily equals one ( i p(Hi ) = 1), meaning
simply that the observer believes that one of them must be correct. This in turn means
that models with more parameters must distribute the same total prior over a larger set of
specific models (combinations of parameter settings) inevitably requiring each model to be
assigned a smaller prior. That is, more highly parameterized models—models that can ex-
press a wider variety of states of nature—necessarily assign lower priors to each individual
hypothesis. Hence in this sense Bayesian inference automatically assigns lower priors to
more complex models, and higher priors to simple ones, thus enforcing a simplicity metric
without any mechanisms designed especially for the purpose.
Though the close relationship between simplicity and Bayesian inference is widely
recognized, the exact nature of the relationship is more controversial (see Feldman, 2009
and van der Helm, 2000). Bayesians regard the calculation of the Bayesian posterior
as fundamental, and the simplicity principle as merely a heuristic concept whose value
6
“The simplest law is chosen because it is most likely to give correct predictions” (Jeffreys, 1939/1961), p4).
BAYESIAN PERCEPTION 17
derives from its correspondence to Bayes’ rule. The originators of MDL and information-
theoretic statistics (e.g. Akaike, 1974; Rissanen, 1989; Wallace, 2004) take the opposite
view, regarding the minimization of complexity (DL or related measures) as the more
fundamental principle, and some of the assumptions underlying Bayesian inference as
naive (see Burnham & Anderson, 2002; Grünwald, 2005).
Bayesian models more generally). The choice of loss functions in real situations probably
depend on details of the task, and remains a subject of research.
Loss functions in naturalistic behavioral situations can be arbitrarily complex, and
it is not generally understood either how they are apprehended or how human decision
making takes them into account. Trommershauser, Maloney, and Landy (2003) explored
this problem by imposing a moderately complex loss function on their subjects in a simple
motor task; they asked their subjects to touch a target on a screen that was surrounded by
several different penalty zones structured so that misses in one direction cost more than
misses in the other direction. Their subjects were surprisingly adept at modulating their
taps so that expected loss (penalty) was minimized, implying a detailed knowledge of the
noise in their own arm motions and a quick apprehension of the geometry of the imposed
utility function (see also Trommershauser, Maloney, & Landy, 2008).
tral” as possible; this allows the data (via the likelihood) to be the primary influence on
posterior belief. Exactly how to choose an uninformative prior can, however, be prob-
lematic. For example, to estimate the success probability of a binomial process, like the
probability of heads in a coin toss, it is tempting to adopt a uniform prior over success
probability (i.e. equal over the range 0 to 100%).7 But mathematical arguments suggest
that a truly uninformative prior should be relatively peaked at 0 and 100% (the beta(0,0)
distribution, sometimes called the Haldane prior; see Lee, 2004). But recall that (as il-
lustrated above), in many situations as data accumulates, the likelihood eventually tends
to dominate the posterior. Hence while the source of the prior may be philosophically
controversial, in many real situations the actual choice is moot.
More specifically, certain types of simple priors occur over and over again in Bayesian
accounts. When a particular parameter x is believed to fall around some value µ, but with
some uncertainty that is approximately symmetric about µ, Bayesians routinely assume
a Gaussian (normal) prior distribution for µ, i.e. p(x) ∝ N(µ, σ2 ). Again, this is simply a
formal way of expressing what is known about the value of x (that it falls somewhere near
µ) in as neutral a manner as possible (technically, this is the maximum entropy prior with
mean µ and variance σ2 ). Gaussian error is often a reasonable assumption because random
variations from independent sources, when summed, tend to yield a normal distribution
(the so-called central limit theorem).8 But it should be kept in mind that an assumption
of normal error along x does not entail an affirmative assertion that repeated samples of
x would be normally distributed—indeed in many situations (such as where x is a fixed
quantity of the world, like a physical constant) this interpretation does not even make
sense. Such simple assumptions work suprisingly well in practice and are often the basis
for robust inference.
Another common assumption is that priors for different parameters that have no
obvious relationship are independent (that is, knowing the value of one conveys no infor-
mation about the value of the other). Bayesian models that assume independence among
parameters whose relationship is unknown are sometimes called naive Bayesian models.
Again, an assumption of independence does not reflect an affirmative empirical assertion
about the real-world relationship between the parameters, but rather an expression of
ignorance about their relationship.
Another fundamental problem for Bayesian inference is the source of the hypotheses.
Bayesian theory provides a method for quantifying belief in each hypothesis, but it does
not provide the class of hypotheses themselves, nor any principled way to generate them.
Traditional Bayesians are generally content to assume that some member of the hypothesis
set lies sufficiently “close” to the truth, meaning that it approximates reality within some
acceptable margin of error. However such assumptions are occasionally criticized as naive
7
Bayes himself suggested this prior, which is now sometimes called Bayes’ postulate. But he was apparently
uncertain of its validity, and his hesitation may have contributed to his reluctance to publish his Essay, which
was published posthumously (see Stigler, 1983).
8
More technically, the central limit theorem says that the sum of random variables with finite variances
tends towards normality in the limit. In practice this means that if x is really the sum of a number of component
variables, each of which is random though not necessarily normal itself, then x tends to be normally distributed.
BAYESIAN PERCEPTION 20
Conclusions
References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic
Control, 19(6), 716–723.
Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Phil. Trans. of the
Royal Soc. of London, 53, 370–418.
Brainard, D. H., & Freeman, W. T. (1997). Bayesian color constancy. Journal of the Optical Society of
America A, 14, 1393–1411.
Brainard, D. H., Longere, P., Delahunt, P. B., Freeman, W. T., Kraft, J. M., & Xiao, B. (2006). Bayesian
model of human color constancy. J Vis, 6(11), 1267–1281.
Burge, J., Fowlkes, C. C., & Banks, M. S. (2010). Natural-scene statistics predict how the figure-
ground cue of convexity affects human depth perception. J. Neurosci., 30(21), 7269–7280.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multi-model inference: A practical
information-theoretic approach. New York: Springer.
Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference : Understanding AIC and BIC in
model selection. Sociological Methods & Research 2004 33: 261, 33(2), 261–304.
Chater, N. (1996). Reconciling simplicity and likelihood principles in perceptual organization.
Psychological Review, 103(3), 566–581.
Compton, B. J., & Logan, G. D. (1993). Evaluating a computational model of pereptual grouping by
proximity. Perception & Psychophysics, 53(4), 403–421.
Cox, R. T. (1961). The algebra of probable inference. London: Oxford Universitiy Press.
de Finetti, B. (1970/1974). Theory of probability. Torino: Giulio Einaudi. (Translation 1990 by A.
Machi and A. Smith, John Wiley and Sons)
Earman, J. (1992). Bayes or bust? : a critical examination of bayesian confirmation theory. MIT Press.
Feldman, J. (1997). Curvilinearity, covariance, and regularity in perceptual groups. Vision Research,
37(20), 2835–2848.
Feldman, J. (2001). Bayesian contour integration. Perception & Psychophysics, 63(7), 1171–1182.
Feldman, J. (2009). Bayes and the simplicity principle in perception. Psychological Review, 116(4),
875–887.
Feldman, J. (in press). Tuning your priors to the world. Topics in Cognitive Science.
Feldman, J., & Singh, M. (2006). Bayesian estimation of the shape skeleton. Proceedings of the National
Academy of Science, 103(47), 18014–18019.
Feldman, J., Singh, M., & Froyen, V. (2012). Perceptual grouping as Bayesian mixture estimation.
(Forthcoming.)
Geisler, W. S., & Diehl, R. L. (2002). Bayesian natural selection and the evolution of perceptual
systems. Philosophical Transactions of the Royal Society of London B, 357, 419–448.
Geisler, W. S., Perry, J. S., Super, B. J., & Gallogly, D. P. (2001). Edge co-occurrence in natural images
predicts contour grouping performance. Vision Research, 41, 711–724.
Gregory, R. (2006). Editorial essay. Perception, 35, 143–144.
Griffiths, T. L., & Yuille, A. L. (2006). A primer on probabilistic inference. Trends in Cognitive Sciences,
10(7).
Grünwald, P. D. (2005). A tutorial introduction to the minimum description length principle. In
P. D. Grünwald, I. J. Myung, & M. Pitt (Eds.), Advances in minimum description length: Theory
and applications. Cambridge, MA: MIT press.
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining,
inference, and prediction. New York: Springer.
Hatfield, G., & Epstein, W. (1985). The status of the minimum principle in the theoretical analysis
of visual perception. Psychological Bulletin, 97(2), 155–186.
Hochberg, J., & McAlister, E. (1953). A quantitative approach to figural “goodness”. Journal of
Experimental Psychology, 46, 361–364.
Hoffman, D. D. (2009). The user-interface theory of perception: Natural selection drives true
BAYESIAN PERCEPTION 23
perception to swift extinction. In S. Dickinson, M. Tarr, A. Leonardis, & B. Schiele (Eds.), Object
categorization: Computer and human vision perspectives. Cambridge: Cambridge University
Press.
Hoffman, D. D., & Singh, M. (in press). Computational evolutionary perception. Perception.
Howie, D. (2004). Interpreting probability: Controversies and developments in the early twentieth century.
Cambridge: Cambridge University Press.
Jaynes, E. T. (1982). On the rationale of maximum-entropy methods. Proceedings of the I.E.E.E., 70(9),
939–952.
Jaynes, E. T. (2003). Probability theory: the logic of science. Cambridge: Cambridge University Press.
Jeffreys, H. (1939/1961). Theory of probability (third edition). Oxford: Clarendon Press.
Jones, M., & Love, B. C. (2011). Bayesian fundamentalism or enlightenment? On the explanatory
status and theoretical contributions of Bayesian models of cognition. Behavioral and Brain
Sciences, 34, 169–188.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 9,
773–795.
Kersten, D., Mamassian, P., & Yuille, A. (2004). Object perception as Bayesian inference. Annual
Review of Psychology, 55, 271—304.
Knill, D. C., Kersten, D., & Yuille, A. (1996). Introduction: a Bayesian formulation of visual
perception. In D. C. Knill & W. Richards (Eds.), Perception as Bayesian inference (pp. 123–162).
Cambridge: Cambridge University Press.
Knill, D. C., & Richards, W. (Eds.). (1996). Perception as Bayesian inference. Cambridge: Cambridge
University Press.
Kuhn, T. S. (1962). The structure of scientific revolutions. U. Chicago Press.
Lee, P. (2004). Bayesian statistics: an introduction (3rd ed.). Wiley.
Leeuwenberg, E. L. J., & Boselie, F. (1988). Against the likelihood principle in visual form perception.
Psychological Review, 95, 485–491.
MacKay, D. J. C. (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge
University Press.
Maloney, L. T. (2002). Statistical decision theory and biological vision. In D. Heyer & R. Mausfeld
(Eds.), Perception and the physical world: Psychological and philosophical issues in perception (pp.
145–189). New York: Wiley.
McClelland, J. L., Botvinick, M. M., Noelle, D. C., Plaut, D. C., Rogers, T., Seidenberg, M. S., et
al. (2010). Letting structure emerge: Connectionist and dynamical systems approaches to
understanding cognition. Trends Cogn. Sci., 14, 348–356.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. San Mateo,
CA: Morgan Kauffman.
Perkins, D. (1976). How good a bet is good form? Perception, 5, 393–406.
Pylyshyn, Z. (1999). Is vision continuous with cognition? The case for cognitive impenetrability of
visual perception. Behav Brain Sci, 22(3), 341–365.
Ramachandran, V. S. (1985). The neurobiology of perception. Perception, 14, 97–103.
Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific.
Rumelhart, D. E., McClelland, J. L., & Hinton, G. E. (1986). Parallel distributed processing: explorations
in the microstructure of cognition. Cambridge, Massachusetts: MIT Press.
Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics 6, 2, 461–464.
Shannon, C. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27,
379–423.
Singh, M., & Fulvio, J. M. (2005). Visual extrapolation of contour geometry. PNAS, 102(3), 939–944.
Singh, M., & Hoffman, D. D. (2001). Part-based representations of visual shape and implications
for visual cognition. In T. Shipley & P. Kellman (Eds.), From fragments to objects: segmentation
and grouping in vision, advances in psychology, vol. 130 (pp. 401–459). New York: Elsevier.
Stigler, S. M. (1983). Who discovered Bayes’s theorem? The American Statistician, 37(4), 290–296.
BAYESIAN PERCEPTION 24
Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900. Harvard
University Press.
Trommershauser, J., Maloney, L. T., & Landy, M. S. (2003). Statistical decision theory and the selection
of rapid, goal-directed movements. J Opt Soc Am A Opt Image Sci Vis, 20(7), 1419–1433.
Trommershauser, J., Maloney, L. T., & Landy, M. S. (2008). Decision making, movement planning
and statistical decision theory. Trends Cogn. Sci., 12(8), 291–297.
van der Helm, P. (2000). Simplicity versus likelihood in visual perception: From surprisals to
precisals. Psychological Bulletin, 126(5), 770–800.
Wallace, C. S. (2004). Statistical and inductive inference by minimum message length. Springer.
Weiss, Y., Simoncelli, E. P., & Adelson, E. H. (2002). Motion illusions as optimal percepts. Nat.
Neurosci., 5(6), 598–604.
Wilder, J., Feldman, J., & Singh, M. (2011). Superordinate shape classification using natural shape
statistics. Cognition, 119, 325–340.
Yuille, A. L., & Bülthoff, H. H. (1996). Bayesian decision theory and psychophysics. In D. C. Knill
& W. Richards (Eds.), Perception as Bayesian inference (pp. 123–162). Cambridge: Cambridge
University Press.
Zucker, S. W., Stevens, K. A., & Sander, P. (1983). The relation between proximity and brightness
similarity in dot patterns. Perception and Psychophysics, 34(6), 513–522.