Bayesian Programming
Bayesian Programming
Bayesian Programming
Pierre Bessire
Juan-Manuel Ahuactzin
Kamel Mekhnacha
Emmanuel Mazer
Contact:
Pierre.Bessiere@imag.fr
Juan-Manuel.Ahuactzin@inrialpes.fr
Bayesian Programming
Page 2
Page 3
To the late Edward T. Jaynes
for his doubts about certitudes
and for his certitudes about probabilities
Bayesian Programming
Page 4
Page 5
Table of content
1
Introduction 11
1.1. Probability an alternative to logic 11
1.2. A need for a new computing paradigm 15
1.3. A need for a new modeling methodology 15
1.4. A need for new inference algorithms 19
1.5. A need for a new programming language and new hardware 21
1.6. A place for numerous controversies 22
2
Basic Concepts 25
2.1. Variable 26
2.2. Probability 26
2.3. The normalization postulate 26
2.4. Conditional probability 27
2.5. Variable conjunction 28
2.6. The conjunction postulate (Bayes theorem) 28
2.7. Syllogisms 29
2.8. The marginalization rule 30
2.9. Joint distribution and questions 31
2.10. Decomposition 33
2.11. Parametric forms 34
2.12. Identification 35
2.13. Specification = Variables + Decomposition + Parametric Forms 36
2.14. Description = Specification + Identification 36
2.15. Question 36
2.16. Bayesian program = Description + Question 38
2.17. Results 39
3
Incompleteness and Uncertainty 45
3.1. The "beam in the bin" investigation 45
3.2. Observing a water treatment unit 48
3.2.1. The elementary water treatment unit 49
3.2.2. Experimentation and uncertainty 53
3.3. Lessons, comments and notes 56
3.3.1. The effect of incompleteness 56
3.3.2. The effect of inaccuracy 56
3.3.3. Not taking into account the effect of ignored variables may lead to wrong decisions 57
3.3.4. From incompleteness to uncertainty 58
4
Description = Specification + Identification 61
4.1. Pushing objects and following contours 62
4.1.1. The Khepera robot 62
4.1.2. Pushing objects 64
4.1.3. Following contours 68
4.2. Description of a water treatment unit 71
4.2.1. Specification 71
4.2.2. Identification 74
4.2.3. Results 75
4.3. Lessons, comments and notes 75
4.3.1. Description = Specification + Identification 75
4.3.2. Specification = Variables + Decomposition + Forms 76
4.3.3. Learning is a means to transform incompleteness into uncertainty 77
Bayesian Programming
Page 6
5
The Importance of Conditional Independence 79
5.1. Water treatment center Bayesian model (continuation) 79
5.2. Description of the water treatment center 80
5.2.1. Specification 81
5.2.2. Identification 84
5.3. Lessons, comments and notes 85
5.3.1. Independence versus conditional independence 85
5.3.2. The importance of conditional independence 87
6
Bayesian Program = Description + Question 89
6.1. Water treatment center Bayesian model (end) 90
6.2. Forward simulation of a single unit 90
6.2.1. Question 90
6.2.2. Results 93
6.3. Forward simulation of the water treatment center 93
6.3.1. Question 93
6.3.2. Results 96
6.4. Control of the water treatment center 97
6.4.1. Question (1) 97
6.4.2. Result (1) 97
6.4.3. Question (2) 98
6.4.4. Result (2) 99
6.5. Diagnosis 101
6.5.1. Question 101
6.5.2. Results 102
6.6. Lessons, comments and notes 104
6.6.1. Bayesian Program = Description + Question 104
6.6.2. The essence of Bayesian inference 105
6.6.3. No inverse or direct problem 106
6.6.4. No ill-posed problem 107
7
Information fusion and inverse programming 109
7.1. Fusion of information in ADAS systems 110
7.1.1. Statement of the problem 110
7.1.2. Bayesian Program 110
7.1.3. Results 110
7.2. Programming and training video games avatars 110
7.2.1. Statement of the problem 110
7.2.2. Bayesian Program 111
7.2.3. Results 111
7.3. Lessons, comments and notes 111
7.3.1. Information fusion 111
7.3.2. Coherence fusion 111
7.3.3. Inverse programming 111
8
Calling Bayesian Subroutines 113
8.1. Exemple 1 114
8.1.1. Statement of the problem 114
8.1.2. Bayesian Program 114
8.1.3. Results 114
8.2. Evaluation of operational risk 114
8.2.1. Statement of the problem 114
8.2.2. Bayesian Program 114
8.2.3. Results 114
8.3. Lessons, comments and notes 114
Page 7
8.3.1. Calling subroutines 114
8.3.2. Hierarchies of description 114
9
Bayesian program mixture 117
9.1. Homing Behavior 118
9.1.1. Statement of the problem 118
9.1.2. Bayesian Program 118
9.1.3. Results 118
9.2. Heating forecast 118
9.2.1. Statement of the problem 118
9.2.2. Bayesian Program 118
9.2.3. Results 118
9.3. Lessons, comments and notes 118
9.3.1. Bayesian program combination 118
9.3.2. A probabilistic "if - then- else" 118
10
Bayesian filters 121
10.1. Markov localization 122
10.1.1. Statement of the problem 122
10.1.2. Bayesian Program 122
10.1.3. Results 122
10.2. ??? 122
10.2.1. Statement of the problem 122
10.2.2. Bayesian Program 122
10.2.3. Results 122
10.3. Lessons, comments and notes 122
10.3.1. $$$ 122
11
Using functions 123
11.1. ADD dice 124
11.1.1. Statement of the problem 124
11.1.2. Bayesian Program 124
11.1.3. Results 124
11.2. CAD system 124
11.2.1. Statement of the problem 124
11.2.2. Bayesian Program 124
11.2.3. Results 124
11.3. Lessons, comments and notes 124
12
Bayesian Programming Formalism 125
12.1. How simple! How subtle! 125
12.2. Logical propositions 126
12.3. Probability of a proposition 126
12.4. Normalization and conjunction postulates 126
12.5. Disjunction rule for propositions 127
12.6. Discrete variables 127
12.7. Variable conjunction 128
12.8. Probability on variables 128
12.9. Conjunction rule for variables 128
12.10. Normalization rule for variables 129
12.11. Marginalization rule 129
12.12. Bayesian program 130
12.13. Description 130
12.14. Specification 131
12.15. Questions 132
Bayesian Programming
Page 8
12.16. Inference 132
13
Bayesian Models Revisited 135
13.1. General purpose probabilistic models 136
13.1.1. Graphical models and Bayesian networks 136
13.1.2. Recursive Bayesian estimation: Bayesian filters, Hidden Markov Models, Kalman filters and farticle
filters 139
13.1.3. Mixture models 144
13.1.4. Maximum entropy approaches 147
13.2. Problem-oriented probabilistic models 149
13.2.1. Sensor fusion 149
13.2.2. Classification 151
13.2.3. Pattern recognition 152
13.2.4. Sequence recognition 152
13.2.5. Markov localization 153
13.2.6. Markov decision processes 154
13.2.7. Bayesian models in life science 156
13.3. Summary 157
14
Bayesian Inference Algorithms Revisited 159
14.1. Stating the problem 159
14.2. Symbolic computation 162
14.2.1. Exact symbolic computation 162
14.2.2. Approxiamate symbolic computation 176
14.3. Numerical computation 177
14.3.1. Searching the modes in high-dimensional spaces 177
14.3.2. Marginalization (integration) in high-dimensional spaces 183
15
Bayesian Learning Revisited 189
15.1. Problematic 189
15.1.1. How to identify (learn) the value of the free parameters? 190
15.1.2. How to compare different probabilistic models (specifications)? 192
15.1.3. How to find interesting decompositions and associated parametric forms? 193
15.1.4. How to find the pertinent variables to model a phenomenon? 194
15.2. Expectation - Maximization (EM) 194
15.2.1. EM and bayesian networks 195
15.2.2. EM and Mixture Models 195
15.2.3. EM and HMM: The Baum-Welch Algorithm 195
15.3. Problem Oriented Models 197
15.4. Learning Structure of Bayesian Networks 198
15.5. Bayesian Evolution? 199
16
Frequently Asked Question and
Frequently Argued Matter 201
16.1. APPLICATIONS OF BAYESIAN PROGRAMMING (WHAT ARE?) 201
16.2. BAYES, THOMAS (WHO IS?) 202
16.3. BAYESIAN DECISION THEORY (WHAT IS?) 202
16.4. BIAS VERSUS VARIANCE DILEMMA 202
16.5. Computation complexity of Bayesian Inference 202
16.6. Cox theorem (What is?) 202
16.7. DECOMPOSITION 202
16.8. DESCRIPTION 202
16.9. DISJUNCTION RULE AS AN AXIOM (WHY DON'T YOU TAKE?) 202
16.10. DRAW VERSUS BEST 202
16.11. FORMS 202
Page 9
16.12. FREQUENTIST VERSUS NON-FREQUENTIST 202
16.13. FUZZY LOGIC VERSUS BAYESIAN INFERENCE 202
16.14. HUMAN (ARE THEY BAYESIAN?) 202
16.15. IDENTIFICATION 202
16.16. Incompleteness irreducibility 202
16.17. JAYNES, ED. T. (WHO IS?) 203
16.18. KOLMOGOROV (WHO IS?) 203
16.19. KOLMOGOROV'S AXIOMATIC (WHY DON'T WE NEED?) 203
16.20. LAPLACE, MARQUIS SIMON DE (WHO IS?) 203
16.21. LAPLACE'S SUCCESSION LAW CONTROVERSY 203
16.22. Maximum entropy principle justifications 203
16.23. MIND PROJECTION FALLACY (WHAT IS?) 203
16.24. Noise or ignorance? 203
16.25. PERFECT DICE (WHAT IS?) 203
16.26. PHYSICAL CAUSALITY VERSUS LOGICAL CAUSALITY 203
16.27. PROSCRIPTIVE PROGRAMMING 203
16.28. SPECIFICATION 203
16.29. Subjectivism vs objectivism controversy 203
16.30. VARIABLE 203
17
Bibliography 205
Bayesian Programming
Page 10
Page 11
1
Introduction
The most incomprehensible thing about the world is that it is comprehen-
sible
Albert Einstein
1.1 Probability an alternative to logic
Computers have brought a new dimension to modeling. A model, once translated into a
program and run on a computer, may be used to understand, measure, simulate, mimic,
optimize, predict, and control. During the last fifty years science, industry, finance, med-
icine, entertainment, transport, and communication have been completely transformed by
this revolution.
However, models and programs suffer from a fundamental flaw: incompleteness. Any
model of a real phenomenon is incomplete. Hidden variables, not taken into account in
the model, influence the phenomenon. The effect of the hidden variables is that the model
and the phenomenon never have the exact same behaviors. Uncertainty is the direct and
unavoidable consequence of incompleteness. A model may not foresee exactly the future
Bayesian Programming
Page 12
observations of a phenomenon, as these observations are biased by the hidden variables,
and may not predict the consequences of its decisions exactly.
Computing a cost price to decide on a sell price may seem a purely arithmetic opera-
tion consisting of adding elementary costs. However, often these elementary costs may
not be known exactly. For instance, a part`s cost may be biased by exchange rates, pro-
duction cost may be biased by the number of orders and transportation costs may be
biased by the period of the year. Exchange rates, the number of orders, and the period of
the year when unknown, are hidden variables, which induce uncertainty in the computa-
tion of the cost price.
Analyzing the content of an email to filter spam is a difficult task, because no word or
combination of words can give you an absolute certitude about the nature of the email. At
most, the presence of certain words is a strong clue that an email is a spam. It may never
be a conclusive proof, because the context may completely change its meaning. For
instance, if one of your friends is forwarding you a spam for discussion about the spam
phenomenon, its whole content is suddenly not spam any longer. A linguistic model of
spam is irremediably incomplete because of this boundless contextual information. Fil-
tering spam is not hopeless and some very efficient solution exists, but the perfect result
is a chimera.
Machine control and dysfunction diagnosis is very important to industry. However,
the dream of building a complete model of a machine and all its possible failures is an
illusion. One should recall the first "bug" of the computer era: the moth located in relay
70 panel F of the Harvard Mark II computer. Once again, it does not mean that control
and diagnosis is hopeless, it only means that models of these machines should take into
account their own incompleteness and the resulting uncertainty.
In 1781, Sir William Herschell discovered Uranus, the seventh planet of the solar sys-
tem. In 1846, Johann Galle observed for the first time, Neptune, the eighth planet. In the
meantime, both Urbain Leverrier, a French astronomer, and John Adams, an English one,
became interested in the "uncertain" trajectory of Uranus. The planet was not following
exactly the trajectory that Newton`s theory of gravitation predicted. They both came to
the conclusion that these irregularities should be the result of a hidden variable not taken
into account by the model: the existence of an eighth planet. They even went much fur-
ther, finding the most probable position of this eighth planet. The Berlin observatory
received Leverrier`s prediction on September 23, 1846 and Galle observed Neptune the
Introduction
Page 13
very same day!
Logic is both the mathematical foundation of rational reasoning and the fundamental
principle of present day computing. However, logic, by essence, is restricted to problems
where information is both complete and certain. An alternative mathematical framework
and an alternative computing framework are both needed to deal with incompleteness and
uncertainty.
Probability theory is this alternative mathematical framework. It is a model of rational
reasoning in the presence of incompleteness and uncertainty. It is an extension of logic
where both certain and uncertain information have their places.
James C. Maxwell stated this point synthetically:
The actual science of logic is conversant at present only with things either
certain, impossible, or entirely doubtful, none of which (fortunately) we
have to reason on. Therefore the true logic for this world is the calculus of
Probabilities, which takes account of the magnitude of the probability
which is, or ought to be, in a reasonable man's mind.
James C. Maxwell; quote in "Probability Theory - The Logic of Science"
by Edward T. Jaynes (Jaynes, 2003)
Considering probability as a model of reasoning is called the subjectivist or Bayesian
approach. It is opposed to the objectivist approach, which considers probability as a model
of the world. This opposition is not only an epistemological controversy; it has many fun-
damental and practical consequences
1
.
To model reasoning, you must take into account the preliminary knowledge of the sub-
ject who is doing the reasoning. This preliminary knowledge plays the same role as the
axioms in logic. Starting from different preliminary knowledge may lead to different con-
clusions. Starting from wrong preliminary knowledge will lead to wrong conclusions
even with perfectly correct reasoning. Reaching wrong conclusions following correct
reasoning proves that the preliminary knowledge was wrong, offers the opportunity to
correct it and eventually leads you to learning. Incompleteness is simply the irreducible
gap between the preliminary knowledge and the phenomenon and uncertainty is a direct
1. See 'Subjectivism vs objectivism controversy, page 203
Bayesian Programming
Page 14
and measurable consequence of this imperfection.
In contrast, modeling the world by denying the existence of a "subject" and conse-
quently rejecting preliminary knowledge leads to complicated situations and apparent
paradoxes. This rejection implies that if the conclusions are wrong, either the reasoning
could be wrong or the data could be aberrant, leaving no room for improvement or learn-
ing. Incompleteness does not mean anything without preliminary knowledge, and uncer-
tainty and noise must be mysterious properties of the physical world.
The objectivist school has been dominant during the 20th century, but the subjectivist
approach has a history as long as probability itself. It can be traced back to Jakob Ber-
noulli in 1713:
Uncertainty is not in things but in our head: uncertainty is a lack of
knowledge.
Jakob Bernoulli, Ars Conjectandi (Bernouilli, 1713);
to the Marquis Simon de Laplace
1
, one century later, in 1812:
Probability theory is nothing but common sense reduced to calculation.
Simon de Laplace, Thorie Analytique des Probabilits (Laplace, 1812)
to the already quoted James C. Maxwell in 1850 and to the visionary Henri Poincar
in 1902:
Randomness is just the measure of our ignorance.
To undertake any probability calculation, and even for this calculation to
have a meaning, we have to admit, as a starting point, an hypothesis or a
convention, that always comprises a certain amount of arbitrariness. In
the choice of this convention, we can be guided only by the principle of
sufficient reason.
From this point of view, every sciences would just be unconscious appli-
cations of the calculus of probabilities. Condemning this calculus would
1. See 'LAPLACE, MARQUIS SIMON DE (WHO IS?), page 203
Introduction
Page 15
be condemning the whole science.
Henri Poincar, La science et lhypothse (Poincar, 1902)
and finally, by Edward T. Jaynes
1
in his book Probability theory: the logic of science
(Jaynes, 2003) where he brilliantly presents the subjectivist alternative and sets clearly
and simply the basis of the approach:
By inference we mean simply: deductive reasoning whenever enough
information is at hand to permit it; inductive or probabilistic reasoning
when - as is almost invariably the case in real problems - all the necessary
information is not available. Thus the topic of "Probability as Logic" is
the optimal processing of uncertain and incomplete knowledge.
Edward T. Jaynes, Probability Theory: The Logic of Science (Jaynes,
2003)
1.2 A need for a new computing paradigm
Bayesian probability theory is clearly the sought mathematical alternative to logic
2
.
However, we want working solutions to incomplete and uncertain problems. Conse-
quently, we require an alternative computing framework based on Bayesian probabilities.
To create such a complete computing Bayesian framework, we reqire a new modeling
methodology to build probabilistic models, we require new inference algorithms to auto-
mate probabilistic calculus, we require new programming languages to implement these
models on computers, and finally, we will eventually require new hardware to run these
Bayesian programs efficiently.
The ultimate goal is a Bayesian computer. The purpose of this book is to describe the
current first steps in this direction.
1.3 A need for a new modeling methodology
The existence of a systematic and generic method to build models is a sine qua non
requirement for the success of a modeling and computing paradigm. This is why algo-
1. See 'JAYNES, ED. T. (WHO IS?), page 203
2. See 'Cox theorem (What is?), page 202
Bayesian Programming
Page 16
rithms are taught in the basic course of computer science giving students the basic and
necessary methods to develop classical programs. Such a systematic and generic method
exists within the Bayesian framework. Moreover, this method is very simple even if it is
atypical and a bit worrisome at the beginning.
The purpose of Chapters 2 to 11 is to present this new modeling methodology. The
presentation is intended for the general public and does not suppose any prerequisites
other than a basic fundation in mathematics. Its purpose is to introduce the fundamental
concepts, to present the novelty and interest of the approach, and to initiate the reader to
the subtle art of Bayesian modeling. Numerous simple examples of applications are pre-
sented in different fields such as $$$ medicine, robotics, finance, and process control.
Chapter 2 gently introduces the basic concepts of Bayesian Programming. We start with a
simple example of a Bayesian spam filter that helps you dispose of junk emails. Commer-
cially available software is based on a similar approach.
The problem is very easy to formulate. We want to classify texts (email) in one of two
categories either "spam" or "to consider". The only information we have to classify the
emails is their content: a set of words.
The classifier should furthermore be able to adapt to its user and to learn from experi-
ence. Starting from an initial standard setting, the classifier should modify its internal
parameters when the choice of the user does not correspond to its own decision. It will
hence adapt to a user`s criteria to choose between "spam" and "not-spam". It will improve
its results as it analyzes increasingly classified emails.
The goal of Chapter 3 is to explore thoroughly the concept of incompleteness with a very
simple physical experiment called the "beam in the bin" experiment. We demonstrate
how incompleteness is the source of uncertainty by using a second, more elaborate exper-
iment involving a model of a water treatment center.
Two sources of uncertainty are shown: the existence of hidden variables and the inac-
curacy of measures. The effects of both are quantified. We also demonstrate that ignoring
incompleteness may lead to certain but definitely wrong decisions. Finally, we explain
how learning transforms incompleteness into uncertainty and how probabilistic inference
leads to informed decision despite this uncertainty.
In Chapter 4 we present in some detail the fundamental notion of description. A descrip-
Introduction
Page 17
tion is the probabilistic model of a given phenomenon. It is obtained after two phases of
development: first, a specification phase where the programmer expresses his own knowl-
edge about the modeled phenomenon in probabilistic terms; and second, an identification
phase where this starting probabilistic canvas is refined by learning from data. Descrip-
tions are the basic elements that are used, combined, composed, manipulated, computed,
compiled, and questioned in different ways to build Bayesian programs.
The specification itself is obtained in three phases. First, programmers must choose
the pertinent variables. Second, in a decomposition phase they must express the joint prob-
ability on the selected variables as a product of simpler distributions. Finally, they must
choose a parametric form for each of these distributions.
These three phases are depicted by: (i) a robotic example where a small mobile robot
is taught how to push, avoid, and circle small objects and (ii) a continuation of the water
treatment center instance where the corresponding description is illustrated.
Two "equations" may summarize the content of this chapter: "Description = Specifi-
cation + Identification" and "Specification = Variables + Decomposition + Forms".
In Chapter 5 the notions of independence and conditional independence are introduced.
We demonstrate the importance of conditional independence in actually solving and com-
puting complex probabilistic problems.
The water treatment center instance is further developed to exemplify these central
concepts.
The water treatment center instance is completed in Chapter 6. The probabilistic descrip-
tion of this process (built in Chapters 4 and 5) is used to solve different problems: predic-
tion of the output, choice of the best control strategy, and diagnosis of failures. This
shows that multiple questions may be asked of the same description to solve very differ-
ent problems. This clear separation between the model and its use is a very important fea-
ture of Bayesian Programming.
This chapter completes the introduction and definition of a Bayesian program, which
is made of both a description and a question: "Bayesian Program = Description + Ques-
tion".
The essence of the critical computational difficulties of Bayesian inference is pre-
sented but, per contra, we explain that in Bayesian modeling there are neither "ill posed
problems" nor opposition between "direct" and "inverse" problems. Within the Bayesian
Bayesian Programming
Page 18
framework, any inverse problem has an evident theoretical solution, but this solution may
be very costly to compute.
Chapters 2 to 6 present the concept of the Bayesian program. Chapters 8 to 13 are used to
show how to combine elementary Bayesian programs to build more complex ones. Some
analogies are stressed between this probabilistic mechanism and the corresponding algo-
rithmic ones, for instance the use of subroutines or conditional and case operators.
For instance, Chapter 8 shows how Bayesian subroutines can be called within Bayesian
programs. As for in classical programming, subroutines are a major means to build com-
plex models from simpler ones, using them as elementary bricks. We can create a hierar-
chy of probabilistic descriptions resulting from either a top-down analysis or a bottom-up
elaboration.
A financial analysis example is used to exemplify this concept. $$$more to come
when Chapter 7 will be written$$$.
Chapter 9 introduces the Bayesian program combination. Using both a simple robotic
example and a more elaborate natural risk analysis problem we show how to combine dif-
ferent Bayesian programs with the help of a choice variable. If known with certainty, this
choice variable would act as a switch between the different models. In that case this
model acts as an "if-then-else" or a case statement. If the choice variable is not known
with certainty, we then obtain the probabilistic equivalent of these conditional construc-
tors.
In Chapter 7 we investigate how to sequence Bayesian programs. Inverse programming is
proposed as a potentially efficient solution. It consists of expressing independently the
conditional probabilities of the conditions knowing the action. Even in atypical cases,
this modeling method appears to be convenient and generic. Furthermore it leads to very
easy learning schemes.
The inverse programming concept is exemplified with a video game application. We
address the problem of real-time reactive selection of elementary behaviors for an agent
playing a first person shooter game. We show how Bayesian Programming leads to a
more condensed and easier formalization than a finite state machine. We also demon-
strate that using this technique it is easy to implement learning by imitation in a fully
transparent way for the player.
Introduction
Page 19
Chapter 10 approaches the topic of Bayesian program iteration. Recursive Bayesian esti-
mation and more specifically Bayesian filtering are presented as the main tools to imple-
ment iteration.
An Advanced Driver Assistance System (ADAS) application is presented as an exam-
ple. The goal of ADAS systems is largely to automate driving by assisting and eventually
replacing the human driver in some tasks.
Chapter 11 explains how to combine probabilistic inference and algebraic operations. A
very simple example of computing a cost price for container chartering is presented. The
global cost price is the sum of sub costs, some of which can only be known approxi-
mately. We show how the probability distribution on the global cost may be derived from
the different probability distributions on the subcosts.
Bayesian Programming may be intimately combined with classical programming. It is
possible to incorporate Bayesian computation within classical programs. Moreover, it is
also possible to use classical programming within Bayesian programs. Indeed, there are
different ways to call functions inside Bayesian programs, and these are presented in
Chapter 11.
A Computer Aided Design (CAD) application is used to demostrate this functionality.
1.4 A need for new inference algorithms
A modeling methodology is not sufficient to run Bayesian programs. We also require an
efficient Bayesian inference engine to automate the probabilistic calculus. This assumes
we have a collection of inference algorithms adapted and tuned to more or less specific
models and a software architecture to combine them in a coherent and unique tool.
Numerous such Bayesian inference algorithms have been proposed in the litterature.
The purpose of this book is not to present this different computing technique and its asso-
ciated models once more. Instead, we offer a synthesis of this work and a number of bib-
liographic references for those who would like more detail on these subjects.
The purpose of Chapter 12 is to present Bayesian Programming formally. It may seem
weird to present the formalism near the end of the book and after all the examples. We
have made this choice to help comprehension and favor intuition without sacrificing
rigor. Anyone can check after reading this chapter that all the examples and programs
Bayesian Programming
Page 20
presented beforehand comply with the formalism.
The goal of Chapter 13 is to revisit the main existing Bayesian models. We use the Baye-
sian Programming formalism to present them systematically. This is a good way to be
precise and concise and it also simplifies their comparison.
We chose to divide the different probabilistic models into two categories, the general
purpose probabilistic models and the problem oriented probabilistic models.
In the general purpose category, the modeling choices are made independently of any
specific knowledge about the modeled phenomenon. Most commonly these choices are
essentially made to keep inferences tractable. However, the technical simplifications of
these models may be compatible with large classes of problems and consequently may
have numerous applications. Among others, we are restating in the Bayesian Program-
ming (BP) formalism Bayesian Networks (BN), Dynamic Bayesian Networks (DBN),
Hidden Markov Models (HMM), Kalman filters, and mixture models.
In the problem oriented category, on the contrary, the modeling choices and simplifi-
cations are decided according to some specific knowledge about the modeled phenome-
non. These choices could eventually lead to very poor models from a computational point
of view. However, most of the time problem dependent knowledge, such as conditional
independence between variables, leads to very important and effective simplifications
and computational improvements.
Chapter 16 surveys the main available general purpose algorithms for Bayesian infer-
ence.
It is well known that general Bayesian inference is a very difficult problem, which
may be practically intractable. Exact inference has been proved to be NP-hard (Cooper,
1990) as has the general problem of approximate inference (Dagum & Luby, 1993).
Numerous heuristics and restrictions to the generality of possible inferences have
been proposed to achieve admissible computation time. The purpose of this chapter is to
make a short review of these heuristics and techniques.
Before starting to crunch numbers, it is usually possible (and wise) to make some
symbolic computations to reduce the amount of numerical computation required. The
first section of this chapter presents the different possibilities. We will see that these sym-
bolic computations can be either exact or approximate.
Once simplified, the expression obtained must be numerically evaluated. In a few
Introduction
Page 21
cases exact (exhaustive) computation may be possible thanks to the previous symbolic
simplification, but most of the time, even with the simplifications, only approximate cal-
culations are possible. The second section of this chapter describes the main algorithms
to do so.
Finally, Chapter 15 surveys the different learning algorithms. The best known ones are
rapidly recalled and restated in Bayesian Programming terms. $$$More to come when
Chapter 18 has been written$$$.
1.5 A need for a new programming language and new hardware
A modeling methodology and new inference algorithms are not sufficient to make these
models operational. We also require new programming languages to implement them on
classical computers and, eventually, new specialized hardware architectures to run these
programs efficiently.
However captivating these topics may be, we chose not to deal with them in this
book.
Concerning new programming languages, this book comes with a companion website
(Bayesian-Programming.org) where such a programming language called ProBT is pro-
vided under a free license restricted to noncommercial uses.
ProBT is a C++ multi-platform professional library used to automate probabilistic
calculus. The ProBT library has two main components: (i) a friendly Application Pro-
gram Interface (API) for building Bayesian models and (ii) a high-performance Bayesian
Inference Engine (BIE) allowing the entire probability calculus to be executed exactly or
approximately.
ProBT comes with its complete documentation and numerous examples, including
those used in this book. Utilization of the library requires some computer science profi-
ciency, which is not required from the readers of this book. This companion website is a
plus but is not necessary in the comprehension of this book.
It is too early to be clear about new hardware dedicated to probabilistic inference, and the
book is already too long to make room for one more topic!
However, we would like to stress that 25 years ago, no one dared to dream about
graphical computers. Today, no one dares to sell a computer without a graphical display
Bayesian Programming
Page 22
with millions of pixels able to present realtime 3D animations or to play high quality
movies thanks to specific hardware that makes such a marvel feasible.
We are convinced that 25 years from now, the ability to treat incomplete and uncertain
data will be as inescapable for computers as graphical abilities are today. We hope that
you will also be convinced of this at the end of this book. Consequently, we will require
specific hardware to face the huge computing burden that some Bayesian inference prob-
lems may generate.
Many possible directions of research may be envisioned to develop such new hard-
ware. Some, especially promising, are inspired by biology. Indeed, some researchers are
currently exploring the hypothesis that the central nervous system (CNS) could be a
probabilistic machine either at the level of individual neurons or assemblies of neurons.
Feedback from these studies could provide inspiration for this necessary hardware.
1.6 A place for numerous controversies
We believe that Bayesian modeling is an elegant matter that can be presented simply,
intuitively, and with mathematical rigor. We hope that we succeed in doing so in this
book. However, the subjectivist approach to probability has been and still is a subject of
countless controversies.
Some questions must be asked, discussed, and answered, such as: the role of decision
theory; the dilemma of bias versus variance; the computational complexity of Bayesian
inference; the frequentist versus nonfrequentist argument; fuzzy logic versus the probabi-
listic treatment of uncertainty; physical causality versus logical causality; and last but not
least, the subjectivist versus objectivist epistemological conception of probability itself.
To make the main exposition as clear and simple as possible, none of these controver-
sies, historical notes, epistemological debates, and tricky technical questions are dis-
cussed in the body of the book. We have made the didactic choice to develop all these
questions in a special annex entitled "FAQ and FAM" (Frequently Asked Questions and
Frequently Argued Matters).
This annex is organized as a collection of "record cards", four pages long at most,
presented in alphabetical order. Cross references to these subjects are included in the
main text for readers interested in going further than a simple presentation of the princi-
ples of Bayesian modeling.
Introduction
Page 23
Bayesian Programming
Page 24
Page 25
2
Basic Concepts
Far better an approximate answer to the right question which is often
vague, than an exact answer to the wrong question which can always be
made precise.
John W. Tuckey
The purpose of this chapter is to gently introduce the basic concepts of Bayesian Pro-
gramming. These concepts will be extensively used and developed in Chapters 4 to 13
and they will be revisited, summarized and formally defined in Chapter 12.
We start with a simple example of Bayesian spam filtering, which helps to eliminate
junk emails. Commercially available software is based on a similar approach.
The problem is very easy to formulate. We want to classify texts (email) into one of
two categories either 'nonspam or 'spam. The only information we can use to classify
the emails is their content: a set of words.
The classifier should furthermore be able to adapt to its user and to learn from experi-
ence. Starting from an initial standard setting, the classifier should modify its internal
parameters when user disagrees with its own decision. It will hence adapt to users` crite-
Bayesian Programming
Page 26
ria to choose between 'nonspam and 'spam. It will improve its results as it encounters
increasingly classified emails. The classifier uses an word dictionary. Each email will
be classified according to the presence or absence of each of the words in the dictionary.
2.1 Variable
The variables necessary to write this program are the following:
1.
1
: a binary variable, false if the email is not spam and true otherwise.
2. : N binary variables. is true if the word of the dictionary
is present in the text.
These N + 1 binary variables sum up all the information we have about an email.
2.2 Probability
A variable can have one and only one value at a given time, so the value of is either
2
or , as the email may either be spam or not.
However, this value may be unknown. Unknown does not mean that you do not have
any information concerning . For instance, you may know that the average rate of
nonspam email is 25%. This information may be formalized, writing:
1. , which stands for 'the probability that an email is not
spam is 25%
2.
2.3 The normalization postulate
According to our hypothesis, an email is either interesting to read or spam. It means that
it cannot be both but it is necessarily one of them. This implies that:
1. Variables will be denoted by their name in italics with initial capital.
2. Variable values will be denoted by their name in roman, in lowercase.
N
Spam
W
0
W
1
!!! W
N 1 -
" " " W
n
n
th
Spam
true false
Spam
P Spam false = # $ % & 0,25 =
P Spam true = # $ % & 0,75 =
Basic Concepts
Page 27
(2.1)
This property is true for any discrete variable (not only for binary ones) and conse-
quently the probability distribution on a given variable should necessarily be normal-
ized:
(2.2)
For the sake of simplicity, we will use the following notation
(2.3)
2.4 Conditional probability
We may be interested in the probability that a given variable assumes a value based on
some information. This is called a conditional probability.
For instance, we may be interested in the probability that a given word appears in
spam: . The sign " " separates the variables into two sets:
on the right are the variables with values known with certainty, on the left the probed
variables.
This notation may be generalized as: , which stands for the
probability distribution on knowing that the email is spam. This distribution is
defined by two probabilities corresponding to the two possible values of . For
instance:
1.
2.
Analogously to Expression (2.3) we have that for any two variables and
(2.4)
P Spam false = # $ % & P Spam true = # $ % & + 1,0 =
X
P X x = # $ % &
x X !
!
1,0 =
P X % &
X
!
1,0 =
P W
j
true = # $| Spam true = # $ % & |
P W
j
| Spam true = # $ % &
W
j
W
j
P W
j
false = # $| Spam true = # $ % & 0,9996 =
P W
j
true = # $| Spam true = # $ % & 0,0004 =
X Y
P X Y | % &
X
!
1,0 =
Bayesian Programming
Page 28
Consequently, .
2.5 Variable conjunction
We may also be interested in the probability of the conjunction of two variables:
.
, the conjunction of the two variables and , is a new variable that
can take four different values:
(2.5)
This may be generalized as the conjunction of an arbitrary number of variables. For
instance, in the sequel, we will be very interested in the joint probability distribution of
the conjunction of N + 1 variables:
(2.6)
2.6 The conjunction postulate (Bayes theorem)
The probability of a conjunction of two variables and may be computed according to
the Conjunction Rule:
(2.7)
This rule is betterl known under the form of the so called Bayes theorem:
(2.8)
However, we prefer the first form, which clearly states that it is a means of computing
the probability of a conjunction of variables according to both the probabilities of these
variables and their relative conditional probabilities.
For instance, we have:
P W
j
| Spam true = # $ % &
W
j
!
1,0 =
P Spam W
j
" % &
Spam W
j
" Spam W
j
false false " % & false true " % & true false " % & true true " % & " " " ' (
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
X Y
P X Y " % & P X % & P Y|X % & ) =
P Y % & P X|Y % & ) =
P X|Y % &
P X % & P Y|X % & )
P Y % &
-------------------------------------- - =
Basic Concepts
Page 29
(2.9)
2.7 Syllogisms
It is very important to acquire a clear intuitive feeling of what a conditional probability
and the Conjunction Rule mean. A first step toward this understanding may be to restate
the classical logical syllogisms in their probabilistic forms.
Let us first recall the two logical syllogisms:
1. Modus Ponens:
1
if a is true and if a implies b then b is true
2. Modus Tollens:
if b is false and if a implies b then a is false
For instance, if a stands for "n may be divided by 9" and b stands for "n may be
divided by 3", we know that , and we have:
1. Modus Ponens: If "n may be divided by 9" then "n may be divided by 3".
2. Modus Tollens: if "n may be divided by 3" is false then "n may be divided by 9" is
also false.
Using probabilities, we may state:
1. Modus Ponens: , which means that knowing that a is true then we
may be sure that b is true.
2. Modus Tollens: , which means that knowing that b is false then we
may be sure that a is false.
1. Logical propositions will be denoted by name in italics, in lowercase.
P Spam true = # $ W
j
true = # $ " % & P Spam true = # $ % & P W
j
true = # $| Spam true = # $ % & ) =
0,75 0,0004 ) =
0,0003 =
P W
j
true = # $ % & P Spam true = # $| W
j
true = # $ % & ) =
a a b " # $ " b #
b * a b " # $ " a * #
a b "
P b|a % & 1 =
P a * | b * % & 1 =
may be derived from , using the normalization and
conjunction postulates:
However, using probabilities we may go further than with logic:
1. From , using normalization and conjunction postulates we may
derive
1
that , which means that if we know that b is true, the proba-
bility that a is true is higher than it would be if we knew nothing about b.
Obviously, the probability that "n may be divided by 9" is higher if you do know
that "n may be divided by 3" than if you do not. This very common reasoning
which is beyond the scope of pure logic but is very simple in the Bayesian frame-
work.
2. From , using the normalization and conjunction postulates we may
derive
2
that , which means that if we know that a is false the
probability that b is false is less than it would be if we knew nothing about a.
The probability that "n may be divided by 3" is less if you know that n may not be
divided by 9 than if you do not know anything about n.
2.8 The marginalization rule
A very useful rule, called the marginalization rule, may be derived from the normaliza-
tion and conjunction postulates. This rule states:
from (2.3)
from (2.8)
from (2.3)
because
1. This derivation is a good exercise. The solution may be found in the chapter 12 at
2. This derivation is a good exercise. The solution may be found in the chapter $$$ at $$$
P a * | b * % & 1 = P b|a % & 1 =
P a * | b * % & = 1,0 P a| b * % & -
=
1,0
P b * |a % & P a % & )
P b * % &
---------------------------------------- -
=
1,0
1 P b|a % & - % & P a % & )
P b * % &
-------------------------------------------------- -
= 1,0 P b|a % & 1,0 =
P b|a % & 1 =
P a|b % & P a % & $
P b|a % & 1 =
P b * | a * % & P b * % & %
Basic Concepts
Page 31
(2.10)
It may be derived as follows:
(2.11)
2.9 Joint distribution and questions
The joint distribution on a set of two variables and is the distribution on their con-
junction: . If you know the joint distribution, then you know everything you may
want to know about the variables. Indeed, using the conjunction and marginalization
rules you have:
1.
2.
3.
4.
This is of course also true for a joint distribution on more than two variables.
For our spam instance, if you know the joint distribution:
from (2.7)
from (2.4)
P X Y " % &
X
!
P Y % & =
P X Y " % &
X
!
=
P Y % & P X|Y % & )
X
!
=
P Y % & P X|Y % &
X
!
)
= P Y % &
X Y
P X Y " % &
P Y % & P X Y " % &
X
!
=
P X % & P X Y " % &
Y
!
=
P Y|X % &
P X Y " % &
P X Y " % &
Y
!
---------------------------- =
P X|Y % &
P X Y " % &
P X Y " % &
X
!
---------------------------- =
Bayesian Programming
Page 32
(2.12)
you can compute any of possible questions that you can imagine on
this set of N + 1 variables.
A question is defined by partitionning a set of variables in three subsets: the searched
variables (on the left of the conditioning bar), the known variables (on the right of the
conditioning bar) and the free variables. The searched variables set must not be empty.
Examples of these questions are:
1. , the joint distribution itself;
2. ,
the a priori probability to be spam;
3. ,
the a priori probability for the word of the dictionary to appear;
4.
the probability for the , word to appear, knowing that the text is a spam;
5.
the probability for the email to be spam knowing that the word appears in the
text;
6.
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
3
N 1 + % &
2
N 1 + % &
-
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
P Spam % & P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
W
0
!!! W
j
!!! W
N 1 -
" " " "
!
=
P W
j
% & P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
Spam W
0
!!! W
j 1 -
W
j 1 +
!!! W
N 1 -
" " " " " "
!
=
j
th
P W
j
Spam true = # $ | % &
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
W
1
!!! W
j 1 -
W
j 1 +
!!! W
N
" " " " "
!
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
W
1
!!! W
N
" "
!
--------------------------------------------------------------------------------------------------------------------------------------------------------------- =
j
th
P Spam W
j
true = # $ | % &
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
W
0
!!! W
j 1 -
W
j 1 +
!!! W
N 1 -
" " " " "
!
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
Spam W
0
!!! W
j 1 -
W
j 1 +
!!! W
N 1 -
" " " " " "
!
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- =
j
th
P Spam W
0
| !!! W
j
!!! W
N 1 -
" " " " % &
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
Spam
!
-------------------------------------------------------------------------------------------------------- =
Basic Concepts
Page 33
finally, the most interesting one, the probability that the email is spam knowing for
all N words in the dictionary if they are present or not in the text.
2.10 Decomposition
The key challenge for a Bayesian programmer is to specify a way to compute the joint
distribution that has the three main qualities of being a good model, easy to compute and
easy to identify.
This is done using a decomposition that restates the joint distribution as a product of
simpler distributions.
Starting from the joint distribution and applying recursively the conjunction rule we
obtain:
(2.13)
This is an exact mathematical expression.
We simplify it drastically by assuming that the probability of appearance of a word
knowing the nature of the text (spam or not) is independent of the appearance of the other
words. For instance, we assume that:
(2.14)
We finally obtain:
(2.15)
Figure 2.1 shows the graphical model of this expression.
Observe that the assumption of independence between words is clearly not com-
pletely true. For instance, it completely neglects that the appearance of pairs of words
may be more significant than isolated appearances. However, as subjectivists, we assume
this hypothesis and may develop the model and the associated inferences to test how reli-
able it is.
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % &
P Spam % & P W
0
Spam | % & ) P W
1
Spam W
0
" | % & ) =
. !!! ) P W
N 1 -
Spam W
0
!!! W
j
!!! W
N 2 -
" " " " " | % & )
P W
1
Spam W
0
" | % & P W
1
Spam | % & =
P Spam W
0
!!! W
j
!!! W
N 1 -
" " " " " % & P Spam % & P W
i
Spam | % &
i 0 =
N 1 -
&
) =
Bayesian Programming
Page 34
2.11 Parametric forms
To be able to compute the joint distribution, we must now specify the N + 1 distributions
appearing in the decomposition.
We already specified in Section 2.2:
:
Each of the N forms must in turn be specified. The first idea is to sim-
ply count the number of times the word of the dictionary appears in both spam and
nonspam. This would naively lead to histograms:
:
where, stands for the number of appearances of the word in nonspam emails
and stands for the total number nonspam emails. Similarly, stands for the number of
appearences of the word in spam emails and stands for the total number of spam
emails.
Figure 2.1: The graphical model of the Bayesian spam filter.
P Spam % &
P Spam % &
P Spam false = # $ % & 0,25 =
P Spam true = # $ % & 0,75 =
P W
i
Spam | % &
i
th
P W
i
Spam | % &
P W
i
true = # $ Spam false = # $ | % &
n
f
i
n
f
---- =
P W
i
true = # $ Spam true = # $ | % &
n
t
i
n
t
---- =
n
f
i
i
th
n
f
n
t
i
i
th
n
t
Basic Concepts
Page 35
The drawback of histograms is that when no observation has been made, the probabil-
ities are null. For instance, if the word has never been observed in spam then:
(2.16)
A very strong assumption indeed, which says that what has not yet been observed is
impossible! Consequently, we prefer to assume that the parametric forms
are Laplace succession laws rather than histograms:
: