Vstatmp E17
Vstatmp E17
Vstatmp E17
This work is licensed under the Creative Commons Attribution 4.0 Interna-
tional License. To view a copy of this license, visit http://creativecommons.org
/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Moun-
tain view, CA 94042, USA.
ISBN 978-3-945931-13-4
DOI 10.3204/PUBDB-2017-08987
July 2017,
Gerhard Bohm, Günter Zech
June 2014,
Gerhard Bohm, Günter Zech
Preface
There is a large number of excellent statistic books. Nevertheless, we think
that it is justified to complement them by another textbook with the focus
on modern applications in nuclear and particle physics. To this end we have
included a large number of related examples and figures in the text. We
emphasize less the mathematical foundations but appeal to the intuition of
the reader.
Data analysis in modern experiments is unthinkable without simulation
techniques. We discuss in some detail how to apply Monte Carlo simulation
to parameter estimation, deconvolution, goodness-of-fit tests. We sketch also
modern developments like artificial neural nets, bootstrap methods, boosted
decision trees and support vector machines.
Likelihood is a central concept of statistical analysis and its foundation is
the likelihood principle. We discuss this concept in more detail than usually
done in textbooks and base the treatment of inference problems as far as
I
February 2010,
Gerhard Bohm, Günter Zech
Contents
4 Measurement Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1 General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.1 Importance of Error Assignments . . . . . . . . . . . . . . . . . . . 91
4.1.2 The Declaration of Errors . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.3 Definition of Measurement and its Error . . . . . . . . . . . . . 92
4.2 Statistical Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.1 Errors Following a Known Statistical Distribution . . . . 94
4.2.2 Errors Determined from a Sample of Measurements . . . 95
4.2.3 Error of the Empirical Variance . . . . . . . . . . . . . . . . . . . . 98
4.3 Systematic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.1 Definition and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.2 How to Avoid, Detect and Estimate Systematic Errors 100
4.3.3 Treatment of Systematic Errors . . . . . . . . . . . . . . . . . . . . 102
4.4 Linear Propagation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.1 Error Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Contents V
6 Estimation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 Inference with Given Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.1 Discrete Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.2 Continuous Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3 Likelihood and the Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . 155
6.4 The Maximum Likelihood Method for Parameter Inference . . 160
6.4.1 The Recipe for a Single Parameter . . . . . . . . . . . . . . . . . . 161
6.4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4.3 Likelihood Inference for Several Parameters . . . . . . . . . . 168
6.4.4 Complicated Likelihood Functions . . . . . . . . . . . . . . . . . . 171
6.4.5 Combining Measurements . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.5 Likelihood and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.5.1 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.5.2 The Conditionality Principle . . . . . . . . . . . . . . . . . . . . . . . 175
6.5.3 The Likelihood Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.5.4 Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6 The Moments Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
VI Contents
7 Estimation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.1 Likelihood of Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.1.1 The χ2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.2 Extended Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.3 Comparison of Observations to a Monte Carlo Simulation . . . 200
7.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.3.2 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.3.3 The χ2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.3.4 Weighting the Monte Carlo Observations . . . . . . . . . . . . 201
7.3.5 Including the Monte Carlo Uncertainty . . . . . . . . . . . . . . 202
7.3.6 Solution for a large number of Monte Carlo events . . . . 202
7.4 Parameter Estimation of a Signal Contaminated by
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.4.2 Parametrization of the Background . . . . . . . . . . . . . . . . . 207
7.4.3 Histogram Fits with Separate Background Measurement209
7.4.4 The Binning-Free Likelihood Approach . . . . . . . . . . . . . . 209
7.5 Inclusion of Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.5.2 Eliminating Redundant Parameters . . . . . . . . . . . . . . . . . 213
7.5.3 Gaussian Approximation of Constraints . . . . . . . . . . . . . 216
7.5.4 The Method of Lagrange Multipliers . . . . . . . . . . . . . . . . 218
7.5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.6 Reduction of the Number of Variates . . . . . . . . . . . . . . . . . . . . . . 220
7.6.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.6.2 Two Variables and a Single Linear Parameter . . . . . . . . 220
7.6.3 Generalization to Several Variables and Parameters . . . 221
7.6.4 Non-linear Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.7 Approximated Likelihood Estimators . . . . . . . . . . . . . . . . . . . . . . 224
7.8 Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.8.1 Nuisance Parameters with Given Prior . . . . . . . . . . . . . . 228
7.8.2 Factorizing the Likelihood Function . . . . . . . . . . . . . . . . . 229
7.8.3 Parameter Transformation, Restructuring [19] . . . . . . . . 230
7.8.4 Conditional Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.8.5 Profile Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.8.6 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
7.8.7 Integrating out the Nuisance Parameter . . . . . . . . . . . . . 237
Contents VII
9 Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.2 Discrete Inverse Problems and the Response matrix . . . . . . . . . 262
9.2.1 Introduction and definition . . . . . . . . . . . . . . . . . . . . . . . . 262
9.2.2 The Histogram Representation . . . . . . . . . . . . . . . . . . . . . 263
9.2.3 Expansion of the True Distribution . . . . . . . . . . . . . . . . . 267
9.2.4 The Least Square Solution and the Eigenvector
Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9.2.5 The Maximum Likelihood Approach . . . . . . . . . . . . . . . . 274
9.3 Unfolding with Explicit Regularization . . . . . . . . . . . . . . . . . . . . 275
9.3.1 General considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
9.3.2 Variable Dependence and Correlations . . . . . . . . . . . . . . 276
9.3.3 Choice of the Regularization Strength . . . . . . . . . . . . . . . 277
9.3.4 Error Assignment to Unfolded Distributions . . . . . . . . . 279
9.3.5 EM Unfolding with Early Stopping . . . . . . . . . . . . . . . . . 280
9.3.6 SVD based methods [68, 78] . . . . . . . . . . . . . . . . . . . . . . . 283
9.3.7 Penalty regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
9.3.8 Comparison of the Methods . . . . . . . . . . . . . . . . . . . . . . . 287
9.3.9 Spline approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.3.10 Statistical and Systematic Uncertainties of the
Response Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
9.4 Unfolding with Implicit Regularization . . . . . . . . . . . . . . . . . . . . 293
9.5 Inclusion of Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
VIII Contents
13 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
13.1 Large Number Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
13.1.1 Chebyshev Inequality and Law of Large Numbers . . . . 415
13.1.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
13.2 Consistency, Bias and Efficiency of Estimators . . . . . . . . . . . . . 417
13.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
13.2.2 Bias of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
13.2.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
13.3 Properties of the Maximum Likelihood Estimator . . . . . . . . . . . 420
13.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
X Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
1 Introduction: Probability and Statistics
Though it is exaggerated to pretend that in our life only the taxes and the
death are certain, it is true that the majority of all predictions suffer from
uncertainties. Thus the occupation with probabilities and statistics is use-
ful for everybody, for scientists of experimental and empirical sciences it is
indispensable.
of the poll, but one wants to know in addition the accuracy of the prog-
nosis, respectively how many electors have to be asked in order to issue a
reasonably precise statement.
2. In an experiment we record the lifetimes of 100 decays of an unstable
nucleus. To determine the mean life of the nucleus, we take the average
of the observed times. Here the uncertainty has its origin in the quantum
mechanical random process. The laws of physics tell us, that the lifetimes
follow a random exponential distribution. The sample is assumed to be
representative of the total of the infinitely many decay times that could
have occurred.
3. From 10 observations the period of a pendulum is to be determined. We
will take as estimate the mean value of the replicates. Its uncertainty has
to be evaluated from the dispersion of the individual observations. The
actual observations form a sample from the infinite number of all possible
observations.
These examples are related to parameter inference. Further statistical
topics are testing, deconvolution, and classification.
4. A bump is observed in a mass distribution. Is it a resonance or just a
background fluctuation?
5. An angular distribution is predicted to be linear in the cosine of the polar
angle. Are the observed data compatible with this hypothesis?
6. It is to be tested whether two experimental setups perform identically. To
this end, measurement samples from both are compared to each other. It
is tested whether the samples belong to the same population, while the
populations themselves are not identified explicitly.
7. A frequency spectrum is distorted by the finite resolution of the detector.
We want to reconstruct the true distribution.
8. In a test beam the development of shower cascades produced by electrons
and pions is investigated. The test samples are characterized by several
variables like penetration depth and shower width. The test samples are
used to develop procedures which predict the identity of unknown parti-
cles from their shower parameters.
A further, very important part of statistics is decision theory. We shall
not cover this topic.
1
Thomas Bayes was a mathematician and theologian who lived in the 18th
century.
2
Remark, also probability assignments based on experience have a frequency
background.
1.4 Assignment of Probabilities to Events 5
3
For two reasons: The proof that the Kolmogorov’s axioms are fulfilled is rather
easy, and the calculation of the probability for complex events is possible by straight
forward combinatorics.
6 1 Introduction: Probability and Statistics
When the Z0 mass and its error were determined, a uniform prior proba-
bility in the mass was assumed. If instead a uniform probability in the mass
squared had been used, the result had changed only by about 10−3 times the
uncertainty of the mass determination. This means that applying Bayes’ as-
1.5 Outline of this Book 7
sumption to either the mass or the mass squared makes no difference within
the precision of the measurement in this specific case.
In other situations prior probabilities which we will discuss in detail in
Chap. 6 can have a considerable influence on a result.
- Recently a practical guide to data analysis in high energy physics [10] has
been published. The chapters are written by different experienced physicists
and reflect the present state of the art. As is common in statistics publi-
cations, some parts are slightly biased by the personal preferences of the
authors. More specialized is [11] which emphasizes particularly probability
density estimation and machine learning.
- Very useful especially for the solution of numerical problems is a book
by Blobel and Lohrman “Statistische und numerische Methoden der Daten-
analyse” [12] written in German.
- Other useful books written by particle physicists are found in Refs. [13,
14, 15]. The book by Roe is more conventional while Cowan and D’Agostini
favor a moderate Bayesian view.
- Modern techniques of statistical data analysis are presented in a book
written by professional statisticians for non-professional users, Hastie et al.
“The Elements of Statistical Learning”[16]
- A modern professional treatment of Bayesian statistics is the textbook
by Box and Tiao “Bayesian Inference in Statistical Analysis” [17].
The interested reader will find work on the foundations of statistics, on
basic principles and on the standard theory in the following books:
- Fisher’s book [18] “Statistical Method, Experimental Design and Scien-
tific Inference” provides an interesting overview of his complete work.
- Edward’s book “Likelihood” [19] stresses the importance of the likelihood
function, contains many useful references and the history of the Likelihood
Principle.
- Many basic considerations and a collection of personal contributions
from a moderate Bayesian view are contained in the book “Good Thinking”
by Good [20]. A collection of work by Savage [21], presents a more extreme
Baysian point of view.
- Somewhat old fashioned textbooks of Bayesian statistic which are of
historical interest are the books of Jeffreys [22] and Savage [23].
Recent statistical work by particle physicists and astrophysicists can be
found in the proceedings of the PHYSTAT Conferences [24] held during the
past few years. Many interesting and well written articles can be found also
in the internet.
This personal selection of literature is obviously in no way exhaustive.
2 Basic Probability Relations
A∪A=Ω , A∩A=∅,
and
∅⊂A⊂Ω. (2.1)
For any A, B we have
n1∩2 n1∩2 n1 n2
ε1 = , ε2 = , n= .
n2 n1 n1∩2
This scheme is used in many analog situations.
Bayes’ theorem is applied in the next two examples, where the attributes
are not independent.
P {A | B}P {B}
P {B | A} =
P {A}
0.018 · 0.5
= = 0.45 .
0.02
About 45% of the students are women.
where the bracket in the first line is equal to 1. In the second line Bayes’
theorem is applied. Applying it once more, we get
P {b | A}P {A}
P {A | b} =
P {b}
P {b | A}P {A}
=
P {b | A}P {A} + P {b | A}P {A}
0.98 · 0.0001
= = 0.0097 .
0.98 · 0.0001 + 0.01 · 0.9999
About 1% of the selected events corresponds to b quark production.
Bayes’ theorem is rather trivial, thus the results of the last two examples
could have easily been written down without referring to it.
3 Probability Distributions and their Properties
0.10
probability
0.05
0.00
2 4 6 8 10 12 14 16 18 20
sum
Fig. 3.1. Probability distribution of the sum of the points of three dice.
16 3 Probability Distributions and their Properties
It is defined by
with ǫ positive and smaller than the distance to neighboring variate values.
0.2
P(x)
0.1
0.0
1.0
F(x)
0.5
0.0
dF (x)
f (x) = . (3.1)
dx
Remark that the p.d.f. is defined in the full range −∞ < x < ∞. It may
be zero in certain regions.
It has the following properties:
a) fR (−∞) = f (+∞) = 0 ,
∞
b) −∞ f (x)dx = 1 .
The probability P {x1 ≤ x ≤ x2 } to find the random variable x in the
interval [x1 , x2 ] is given by
Z x2
P {x1 ≤ x ≤ x2 } = F (x2 ) − F (x1 ) = f (x)dx .
x1
1
We will, however, use the notations probability distribution and distribution for
discrete as well as for continuous distributions.
18 3 Probability Distributions and their Properties
x x
1 2
x
1.0
F(x)
0.5
0.0
We will discuss specific distributions in Sect. 3.6 but we introduce two com-
mon distributions already here. They will serve us as examples in the following
sections.
are shown in Fig. 3.4. The probability of observing a lifetime longer than τ
is
P {t > τ } = F (∞) − F (τ ) = e−1 .
3.1 Definition of Probability Distributions 19
0.50
f(x) - x
f(x)= e
0.25
0.00
0 2 4 6 8 10
1.0
F(x) - x
F(x)=1-e
0.5
0.0
0 2 4 6 8 10
f (x) = N (x|0, s) ,
1 2 2
N (x|x0 , s) = √ e−(x−x0 ) /(2s ) . (3.3)
2πs
The width constant s is, as will be shown later, proportional to the square
root of the number of scattering
√ processes or the square root of time. When
we descent by the factor 1/ e from the maximum, the full width is just 2s. A
statistical drift motion, or more generally a random walk, is met frequently in
20 3 Probability Distributions and their Properties
0.3
0.2
2s
f(x)
0.1
0.0
-6 -4 -2 0 2 4 6
science and also in every day life. The normal distribution also describes ap-
proximately the motion of snow flakes or the erratic movements of a drunkard
in the streets.
Many processes are too complex or not well enough understood to be de-
scribed by a distribution in form of a simple algebraic formula. In these cases
it may be useful to approximate the underlying distribution using an ex-
perimental data sample. The simplest way to do this, is to histogram the
observations and to normalize the frequency histogram. More sophisticated
methods of probability density estimation will be sketched in Chap. 12. The
quality of the approximation depends of course on the available number of
observations.
from the distribution f (x), calculating ui = u(xi ), and then averaging over
these values. Obviously, we have to assume the existence of such a limiting
value.
In quantum mechanics, expected values of physical quantities are the main
results of theoretical calculations and experimental investigations, and pro-
vide the connection to classical mechanics. Also in statistical mechanics and
thermodynamics the calculation of expected values is frequently needed. We
can, for instance, calculate from the velocity distribution of gas molecules
the expected value of their kinetic energy, that means essentially their tem-
perature. In probability theory and statistics expected values play a central
role.
Definition:
∞
X
E(u(x)) = u(xi )p(xi ) (discrete distribution) , (3.4)
i=1
Z ∞
E(u(x)) = u(x)f (x) dx (continuous distribution) . (3.5)
−∞
Here and in what follows, we assume the existence of integrals and sums.
This condition restricts the choice of the allowed functions u, p, f .
From the definition of the expected value follow the relations (c is a con-
stant, u, v are functions of x):
E(c) = c, (3.6)
E(E(u)) = E(u), (3.7)
E(u + v) = E(u) + E(v), (3.8)
E(cu) = cE(u) . (3.9)
E(u) ≡ hui .
Sometimes this simplifies the appearance of the formulas. We will use both
notations.
22 3 Probability Distributions and their Properties
The expected value of the variate x is also called the mean value. It can be
visualized as the center of gravity of the distribution. Usually it is denoted
by the Greek letter µ. Both names, mean value, and expected value3 of the
corresponding distribution are used synonymously.
Definition:
∞
X
E(x) ≡ hxi = µ = xi p(xi ) (discrete distribution) ,
i=1
Z ∞
E(x) ≡ hxi = µ = x f (x) dx (continuous distribution) .
−∞
It is called sample mean. It is a random variable and has the expected value
1 X
hxi = hxi i = hxi ,
N i
3.2.3 Variance
The square root σ of the variance is called standard deviation, and is the
standard measure of stochastic uncertainties.
A mechanical analogy to the variance is the moment of inertia for a mass
distribution along the x-axis for a total mass equal to unity.
Definition:
var(x) = σ 2 = E (x − µ)2 .
From this definition follows immediately
var(cx) = c2 var(x) ,
and σ/µ is independent of the scale of x.
Very useful is the following expression for the variance which is easily
derived from its definition and (3.8), (3.9):
σ 2 = E(x2 − 2xµ + µ2 )
= E(x2 ) − 2µ2 + µ2
= E(x2 ) − µ2 .
Sometimes this is written more conveniently as
σ 2 = hx2 i − hxi2 = hx2 i − µ2 . (3.11)
In analogy to Steiner’s theorem for moments of inertia, we have
h(x − a)2 i = h(x − µ)2 i + h(µ − a)2 i
= σ 2 + (µ − a)2 ,
implying (3.11) for a = 0.
The variance is invariant against a translation of the distribution by a:
x → x + a , µ → µ + a ⇒ σ2 → σ2 .
Let us calculate the variance σ 2 for the distribution of the sum x of two
independent random numbers x1 and x2 , which follow different distributions
with mean values µ1 , µ2 and variances σ12 , σ22 :
x = x1 + x2 ,
2
σ 2 = h(x − hxi) i
= h((x1 − µ1 ) + (x2 − µ2 ))2 i
= h(x1 − µ1 )2 + (x2 − µ2 )2 + 2(x1 − µ1 )(x2 − µ2 )i
= h(x1 − µ1 )2 i + h(x2 − µ2 )2 i + 2hx1 − µ1 ihx2 − µ2 i
= σ12 + σ22 .
24 3 Probability Distributions and their Properties
In the fourth step the independence of the variates (3.10) was used.
This result is important for all kinds of error estimation. For a sum of
two independent measurements, their standard deviationsP add quadratically.
We can generalize the last relation to a sum x = xi of N variates or
measurements:
XN
σ2 = σi2 . (3.12)
i=1
σ 2 = σg2 + σh2 .
From the last relation we obtain the variance σx2 of the sample mean x from
N independent random numbers xi , which all follow the same distribution4
f (x), with expected value µ and variance σ 2 :
N
X
x= xi /N ,
i=1
var(N x) = N var(x) = N σ 2 ,
2
σ
σx = √ . (3.13)
N
The last two relations (3.12), (3.13) have many applications, for instance
in random walk, diffusion, and error propagation. The root mean square
distance reached by a diffusing
√ √ molecule after N scatters is proportional to
N and therefore also to t, t being the diffusion time. The total length
of 100 aligned objects, all having the same standard deviation σ of their
nominal length, will have a standard deviation of only 10 σ. To a certain
degree, random fluctuations compensate each other.
4
The usual abbreviation is i.i.d. variates for independent identically distributed.
3.2 Expected Values 25
which has the correct expected value hvµ2 i = σ 2 . Usually, however, the true
mean value µ is unknown – except perhaps in calibration measurements –
and must be estimated from the same sample as is used to derive vµ2 . We
then are obliged to use the sample mean x instead of µ and calculate the
mean quadratic deviation v 2 of the sample values relative to x. In this case
the expected value of v 2 will depend not only on σ, but also on N . In a first
step we find
1 X
v2 = (xi − x)2
N i
1 X 2
= xi − 2xi x + x2
N i
1 X 2
= x − x2 . (3.14)
N i i
hx2 i = σ 2 + µ2 ,
hx2 i = var(x) + hxi2
σ2
= + µ2
N
and get with (3.14)
1
hv 2 i = hx2 i − hx2 i = σ 2 1 − ,
N
P 2
N h i (xi − x) i
2
σ = 2
hv i = . (3.15)
N −1 N −1
The expected value of the mean squared deviation hv 2 i is smaller than the
variance of the distribution by a factor of (N − 1)/N .
The relation (3.15) is widely used for the estimation of measurement er-
rors, when several independent measurements are available. The variance σx2
of the sample mean x itself is approximated, according to (3.13), by
P 2
v2 (xi − x)
= i .
N −1 N (N − 1)
26 3 Probability Distributions and their Properties
where f1 , f2 may have different mean values µ1 , µ2 and variances σ12 , σ22 :
µ = αµ1 + βµ2 ,
σ 2 = E (x − E(x))2
= E(x2 ) − µ2
= αE1 (x2 ) + βE2 (x2 ) − µ2
= α(µ21 + σ12 ) + β(µ22 + σ22 ) − µ2
= ασ12 + βσ22 + αβ(µ1 − µ2 )2 .
3.2.4 Skewness
0.8
g = 0
1
g = 17
2
0.6
f(x)
g = 2
1
0.4 g = 8
2
g = 0
1
0.2 g = 2
2
0.0
-5 0 5
0.1
f(x)
0.01
1E-3
-5 0 5
Fig. 3.6. Three distribution with equal mean and variance but different skewness
and kurtosis.
γ2 = β2 − 3 ,
is defined such that it is equal to zero for the normal distribution which is
used as a reference. (see Sect. 3.6.5).
3.2.6 Discussion
This relation is often used to estimate quickly the standard deviation σ for an
empirical distribution given in form of a histogram. As seen from the examples
in Fig.3.6, which, for the same variance, differ widely in their f.w.h.m., this
procedure may lead to wrong results for non-Gaussian distributions.
3.2.7 Examples
hti = τ ,
ht2 i = 2τ 2 ,
ht3 i = 6τ 3 ,
σ=τ,
γ1 = 2 .
Example 12. Mean value of the volume of a sphere with a normally distributed
radius
30 3 Probability Distributions and their Properties
The parameters x0 , s of the normal distribution are simply the mean value
and the standard deviation µ, σ, and the p.d.f. with these parameters is
abbreviated as N (x|µ, σ). We now assume that the radius r0 of a sphere is
smeared according to a normal distribution around the mean value r0 with
standard deviation s. This assumption is certainly only approximately valid
for r0 ≫ s, since negative radii are of course impossible. Let us calculate the
expected value of the volume V (r) = 4/3 πr3 :
Z ∞
hV i = dr V (r)f (r)
−∞
Z ∞
4 π (r−r0 )2
= √ dr r3 e− 2 s2
3 2πs −∞
Z ∞
4 π z2
= √ dz (z + r0 )3 e− 2 s2
3 2πs −∞
Z ∞
4 π z2
= √ dz (z 3 + 3z 2 r0 + 3zr02 + r03 )e− 2 s2
3 2πs −∞
4
= π(r03 + 3s2 r0 ) .
3
The mean volume is larger than the volume calculated using the mean radius.
w1 + w2 = 1 .
3.2 Expected Values 31
wall
starting point
wall
Player 1 gains the capital K2 with probability w1 and looses K1 with proba-
bility w2 . Thus his mean gain is w1 K2 − w2 K1 . The same is valid for player
two, only with reversed sign. As both players play equally well, the expected
gain should be zero for both
w1 K2 − w2 K1 = 0 .
d2 d1
w1 = , w2 = .
d1 + d2 d1 + d2
and Z 0.5
2 1
σ = t2 dt = . (3.16)
−0.5 12
The√root mean square measurement uncertainty (standard deviation) is σ =
1 s/ 12 ≈ 0.29 s. The variance of a uniform distribution, which covers a
range of a, is accordingly σ 2 = a2 /12. This result is widely used for the error
estimation of digital measurements. A typical example from particle physics
is the position measurement of ionizing particles with wire chambers.
3.3.1 Moments
and ∞
X
µn = E(xn ) = xnk p(xk )
k=1
Apart from these moments, called moments about the origin, we consider
also the moments about an arbitrary point a where xn is replaced by (x −
a)n . Of special importance are the moments about the expected value of the
distribution. They are called central moments.
Definition: The n-th central moment about µ = µ1 of f (x), p(x) is:
Z ∞
µ′n = E ((x − µ)n ) = (x − µ)n f (x) dx ,
−∞
respectively
∞
X
µ′n = E ((x − µ)n ) = (xk − µ)n p(xk ) .
k=1
Accordingly, the first central moment is zero: µ′1 = 0. Generally, the moments
are related to the expected values introduced before as follows:
First central moment: µ′1 = 0
Second central moment: µ′2 = σ 2
Third central moment: µ′3 = γ1 σ 3
Fourth central moment: µ′4 = β2 σ 4
Under conditions usually met in practise, a distribution is uniquely fixed
by its moments. This means, if two distributions have the same moments in
all orders, they are identical. We will present below plausibility arguments
for the validity of this important assertion.
With t = 0 follows
Z ∞
dn φ(0)
= (ix)n f (x) dx = in µn . (3.20)
dtn −∞
36 3 Probability Distributions and their Properties
X∞ ∞
1 n dn φ(0) X 1
φ(t) = t n
= (it)n µn , (3.21)
n=0
n! dt n=0
n!
dn φ′ (0)
= in µ′n . (3.23)
dtn
The Taylor expansion is
X∞
1
φ′ (t) = (it)n µ′n . (3.24)
n=0
n!
The results (3.20), (3.21), (3.23), (3.24) remain valid also for discrete
distributions. The Taylor expansion of the right hand side of relation (3.22)
allows us to calculate the central moments from the moments about the origin
and vice versa:
Xn Xn
n n ′
µ′n = (−1)k µn−k µk , µn = µ µk .
k k n−k
k=0 k=0
Proof:
Because of (3.10) we find for expected values
φf (t) = E(eit(x+y) )
= E(eitx eity )
= E(eitx )E(eity )
= φg (t)φh (t) .
The third step requires the independence of the two variates. Applying the
inverse Fourier transform to φf (t), we get
Z
1
f (z) = e−itz φf (t) dt .
2π
The solution of this integral is not always simple. For some functions it can
be found in tables of the Fourier transform.
In the general
P case where x is a linear combination of independent random
variables, x = cj xj , we find in an analogous way:
Y
φ(t) = φj (cj t) .
3.3.3 Cumulants
(it)2 (it)3
K(t) = ln φ(t) = ln E(eitx ) = κ1 (it) + κ2 + κ3 + ··· .
2! 3!
Since φ(0) = 1 there is no constant term. The coefficients κi , defined in this
way, are called cumulants or semiinvariants. The denotation semiinvariant
indicates that the cumulants κi , with the exception of κ1 , remain invariant
under the translations x → x + b of the variate x. Of course, the cumulant of
order i can be expressed by moments about the origin or by central moments
µk , µ′k up to the order i. We do not present the general analytic expressions
for the cumulants which can be derived from the power expansion of exp K(t)
and give only the remarkably simple relations for i ≤ 6 as a function of the
central moments:
38 3 Probability Distributions and their Properties
κ1 = µ1 ≡ µ = hxi ,
κ2 = µ′2 ≡ σ 2 = var(x) ,
κ3 = µ′3 ,
κ4 = µ′4 − 3µ′2 2 ,
κ5 = µ′5 − 10µ′2 µ′3 ,
κ6 = µ′6 − 15µ′2 µ′4 − 10µ′3 2 + 30µ′2 3 . (3.26)
Besides expected value and variance, also skewness and excess are easily
expressed by cumulants:
κ3 κ4
γ1 = , γ2 = . (3.27)
3/2
κ2 κ22
Since the product of the characteristic functions φ(t) = φ(1) (t)φ(2) (t)
turns into the sum K(t) = K (1) (t)+K (2) (t), the cumulants are additive, κi =
(1) (2)
κi +κi . In the
P general case, where x is a linear combination of independent
variates, x = cj x(j) , the cumulants of the resulting x distribution, κi , are
derived from those of the various x(j) distributions according to
X (j)
κi = cij κi . (3.28)
j
We have met examples for this useful relation already in Sect. 3.2.3 where
we have computed the variance of the distribution of a sum of variates. We
will use it again in the discussion of the Poisson distribution in the following
example and in Sect. 3.6.3.
3.3.4 Examples
∞
X 1 it k −λ
φ(t) = (e λ) e
k!
k=0
= exp(eit λ)e−λ
= exp λ(eit − 1) ,
µ = hki = λ ,
µ2 = hk 2 i = λ2 + λ
and the mean value and the standard deviation are given by
hki = λ ,
√
σ= λ.
Expanding
1 1
K(t) = ln φ(t) = λ(eit − 1) = λ[(it) + (it)2 + (it)3 + · · · ] ,
2! 3!
for the cumulants we get the simple result
κ1 = κ2 = κ3 = · · · = λ .
The calculation of the lower central moments is then trivial. For example,
skewness and excess are simply given by
3/2
√
γ1 = κ3 /κ2 = 1/ λ , γ2 = κ4 /κ22 = 1/λ .
40 3 Probability Distributions and their Properties
we observe that φ(t) is just the characteristic function of the Poisson dis-
tribution Pλ1 +λ2 (k). The sum of two Poisson distributed variates is again
Poisson distributed, the mean value being the sum of the mean values of the
two original distributions. This property is sometimes called stability.
dφ(t) iλ
= ,
dt (λ − it)2
dn φ(t) n! in λ
= ,
dtn (λ − it)n+1
dn φ(0) n! in
n
= n
dt λ
3.4 Transformation of Variables 41
µn = n! λ−n .
µ = 1/λ ,
In one of the examples of Sect. 3.2.7 we had calculated the expected value
of the energy from the distribution of velocity. For certain applications, to
know the mean value of the energy may not be sufficient and its complete
distribution may be required. To derive it, we have to perform a variable
transformation.
For discrete distributions, this is a trivial exercise: The probability that
the event “u has the value u(xk )” occurs, where u is an uniquely invertible
function of x, is of course the same as for “x has the value xk ”:
P {u = u(xk )} = P {x = xk } .
1.0
g(u)
u(x)
0.5
0.0
0 5 10 u
4
1
0.1
0.2
f(x)
0.3
0.4
Fig. 3.8. Transformation of a probability density f (x) into g(u) given u(x). The
shaded areas are equal.
Z x2
P {x1 < x < x2 } = f (x′ ) dx′
x
Z u1 2
= g(u′ ) du′ .
u1
Taking the absolute value guarantees the positivity of the probability density.
Integrating (3.29), we find numerical equality of the cumulative distribution
functions, F (x) = G(u(x)).
If u(x) is not a monotone function, then, contrary to the above assump-
tion, x(u) is not a unique function (Fig. 3.9) and we have to sum over the
contributions of the various branches of the inverse function:
dx dx
g(u) = f (x) + f (x) + ··· . (3.30)
du branch1 du branch2
3.4 Transformation of Variables 43
u
6
u(x)
g(u) 0
1.0 0.5 0 5 10 x
0.2
f(x)
0.4
Fig. 3.9. Transformation of a p.d.f. f (x) into g(u) with u = x2 . The sum of the
shaded areas below f (x) is equal to the shaded area below g(u).
Example 21. Calculation of the p.d.f. for the volume of a sphere from the
p.d.f. of the radius
Given a uniform distribution for the radius r
1/(r2 − r1 ) if r1 < r < r2
f (r) =
0 else .
dr dV
g(V ) = f (r) , = 4πr2
dV dr
we get
1 1 1 1 −2/3
g(V ) = = 1/3 V .
r2 − r1 4π r2 V2 − V1
1/3 3
-3
4.0x10
0.50
g(V)
f(r)
-3
2.0x10
0.25
0.00 0.0
4 6 8 0 200 400
r V
Fig. 3.10. Transformation of a uniform distribution of the radius into the distri-
bution of the volume of a sphere.
Its kinetic energy is E = v 2 /(2m), for which we want to know the distribution
g(E). The function v(E) has again two branches. We get, in complete analogy
to the example above,
dv 1
=√ ,
dE 2mE
1 −E/kT
g(E) = √ e + ··· .
2 πkT E branch1 branch2
The contributions of both branches are the same, hence
1
g(E) = √ e−E/kT .
πkT E
F (x) = G(u) ,
G(u) = x,
u = G−1 (x) . (3.32)
46 3 Probability Distributions and their Properties
We could have used, of course, also the relation (3.32) directly. Obviously in
the last relation we could substitute 1 − x by x, since both quantities are
uniformly distributed. When we transform the uniformly distributed random
numbers x delivered by our computer according to the last relation into the
variable u, the latter will be exponentially distributed. This is the usual way
to simulate the lifetime distribution of instable particles and other decay
processes (see Chap. 5).
F (∞, ∞) = 1,
F (−∞, y) = F (x, −∞) = 0 .
∂2F
f (x, y) = .
∂x ∂y
From these definitions follows the normalization condition
Z ∞Z ∞
f (x, y) dx dy = 1 .
−∞ −∞
The projections fx (x) respectively fy (y) of the joined probability density onto
the coordinate axes are called marginal distributions :
Z ∞
fx (x) = f (x, y) dy ,
−∞
Z ∞
fy (y) = f (x, y) dx .
−∞
f (x, y)
fx (x|y) = R ∞
−∞ f (x, y) dx
f (x, y)
= , (3.34)
fy (y)
f (x, y)
fy (y|x) = R ∞
−∞
f (x, y) dy
f (x, y)
= . (3.35)
fx (x)
fy (y|x = 1)
fy (y|1) and fR(y, 1) differ in the normalization factor, which results from the
requirement fy (y|1) dy = 1.
Graphical Presentation
Fig. 3.11 shows a similar superposition of two Gaussians together with its
marginal distributions and one conditional distribution. The chosen form of
the graphical representation as a contour plot for two-dimensional distribu-
tions is usually to be favored over three-dimensional surface plots.
3.5.2 Moments
10
1E-3
0.4
1E-3
0.040
0.13
5
f
y x 0.2
0.17
0.070
0.010
1E-3
0 0.0
0 5 10 0 5 10
x x
0.6
0.3
f(y|x=2)
0.4
0.2
f
y
0.1
f(y,x=2) 0.2
0.0 0.0
0 5 10 0 5 10
y y
Fig. 3.11. Two-dimensional probability density. The lower left-hand plot shows the
conditional p.d.f. of y for x = 2. The lower curve is the p.d.f. f (y, 2). It corresponds
to the dashed line in the upper plot. The right-hand side displays the marginal
distributions.
µx = E(x) ,
µy = E(y) ,
σx2 = E (x − µx )2 ,
σy2 = E (y − µy )2 ,
σxy = E [(x − µx )(y − µy )] ,
µlm = E(xl y m ),
µ′lm = E (x − µx )l (y − µy )m .
Explicitly,
50 3 Probability Distributions and their Properties
y y y y y
x x x x x
Z ∞ Z ∞ Z ∞
µx = xf (x, y) dx dy = xfx (x) dx ,
−∞ −∞ −∞
Z ∞ Z∞ Z ∞
µy = yf (x, y) dx dy = yfy (y) dy ,
−∞ −∞ −∞
Z ∞ Z ∞
µlm = xl y m f (x, y) dx dy ,
−∞ −∞
Z ∞ Z ∞
µ′lm = (x − µx )l (y − µy )m f (x, y) dx dy .
−∞ −∞
The mixed moment σxy is called covariance of x and y, and sometimes also
denoted as cov(x, y). If σxy is different from zero, the variables x and y are
said to be correlated. The mean value of y depends on the value chosen for x
and vice versa. Thus, for instance, the weight of a man is positively correlated
with its height.
The degree of correlation is quantified by the dimensionless quantity
σxy
ρxy = ,
σx σy
variates x, y is independent from the value of the other one, i.e. the condi-
tional distributions are equal to the marginal distributions, fx (x|y) = fx (x),
fy (y|x) = fy (y). Independence is realized only if the joined distribution
f (x, y) factorizes into its marginal distributions (see Chap. 2):
we find σxy = 0. The curves f = const. are circles, but x and y are not
independent, the conditional distribution fy (y|x) of y depends on x.
The probability densities f (x, y) and g(u, v) are transformed given the trans-
formation functions u(x, y), v(x, y), analogously to the univariate case
g(u, v) du dv = f (x, y) dx dy ,
∂(x, y)
g(u, v) = f (x, y) ,
∂(u, v)
with the Jacobian determinant replacing the differential quotient dx/du.
52 3 Probability Distributions and their Properties
x = r cos ϕ ,
y = r sin ϕ .
The Jacobian is
∂(x, y)
=r.
∂(r, ϕ)
We get
1 −r2 /2
g(r, ϕ) = re
2π
with the marginal distributions
Z 2π
2
gr = g(r, ϕ) dϕ = re−r /2 ,
Z0 ∞
1
gϕ = g(r, ϕ) dr = .
0 2π
The joined distribution factorizes into its marginal distributions (Fig. 3.13).
Not only x, y, but also r, ϕ are independent.
In most cases, the choice v = x is suitable. More formally, we might use the
equivalent reduction formula
Z ∞
g(u) = f (x, y)δ (u − u(x, y)) dx dy . (3.37)
−∞
3.5 Multivariate Probability Densities 53
a) b) f (x), f (y)
2 x y
0.4
y 0
0.2
-2
0.0
-2 0 2 -4 -2 0 2 4
x x, y
0.8 0.3
c) d)
0.6
0.2 g ( )
g (r)
r
0.4
0.1
0.2
0.0 0.0
0 1 2 3 4 5 0 2 4 6 8
t = t1 − t2 ,
t1 = t1
54 3 Probability Distributions and their Properties
2 2
g(t)
t t
2 1
1
0.50
0 0
0.25
-1 -1
-2 -2 0.00
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
t t t
1 1
Fig. 3.14. Distribution of the difference tbetween two times t1 and t2 . which have
both clock readings equal to zero.
and has the boundaries shown in Fig. 3.14. The form of the marginal distri-
bution is found by integration over t1 , or directly by reading it off from the
figure:
( t − T + ∆)/∆2 for t − T < 0
g(t) = (−t + T + ∆)/∆2 for 0 < t − T
0 else .
where T = T1 − T2 now for arbitrary values of T1 and T2 .
√
px = q cos ϕ ,
√
py = q sin ϕ
with
∂(px , py ) 1
=
∂(q, ϕ) 2
and obtain
1 −q/(2s2 )
h(q, ϕ) = e
4πs2
with the marginal distribution
Z 2π
1 −q/(2s2 )
hq (q) = e dϕ
0 4πs2
1 −q/(2s2 )
= e ,
2s2
1
g(p2 ) = 2 e−p /hp i .
2 2
hp i
1 x2 + y 2
f (x, y) = f (x)f (y) = exp(− ),
2π 2
we want to find the distribution g(u) of the quotient u = y/x. Again, we
transform first into new variates u = y/x , v = x, or, inverted, x = v , y = uv
and get
∂(x, y)
h(u, v) = f (x(u, v), y(u, v)) ,
∂(u, v)
with the Jacobian
∂(x, y)
= −v ,
∂(u, v)
hence
56 3 Probability Distributions and their Properties
Z
g(u) = h(u, v) dv
Z ∞
1 v 2 + u2 v 2
= exp(− )|v| dv
2π −∞ 2
Z
1 ∞ −(1+u2 )z
= e dz
π 0
1 1
= ,
π 1 + u2
where the substitution z = v 2 /2 has been used. The result g(u) is the Cauchy
distribution (see Sect. 3.6.9). Its long tails are caused here by the finite proba-
bility of arbitrary small values in the denominator. This effect is quite impor-
tant in experimental situations when we estimate the uncertainty of quantities
which are the quotients of normally distributed variates in cases, where the
p.d.f. in the denominator is not negligible at the value zero.
The few examples given above should not lead to the impression that
transformations of variates always yield more or less simple analytical expres-
sions for the resulting distributions. This is rather the exception. However,
as we will learn in Chap. 5, a simple, straight forward numerical solution is
provided by Monte Carlo methods.
Z ρ
′2
ρ′ e−ρ /2
dρ′ = r1 ,
0
′2 ρ
−e−ρ /2
= r1 ,
0
2
−ρ /2
1−e = r1 ,
p
ρ = −2 ln(1 − r1 ) .
ϕ = 2πr2 .
These variables are independent and distributed normally about the origin
with variance unity:
1 −(x2 +y2 )/2
f (x, y) = e .
2π
(We could replace 1 − r1 by r1 , since 1 − r1 is uniformly distributed as well.)
It is not difficult to generalize the relations just derived for two variables to
multivariate distributions, of N variables. We define the distribution function
F (x1 , . . . , xN ) as the probability to find values of the variates smaller than
x1 , . . . , xN ,
x = {x1 , x2 , . . . , xN } .
dx1 dx1
f (x1 , . . . xN )dx1 · · · dxN = dP {(x1 − ≤ x′1 ≤ x1 + ) ∩ ···
2 2
dxN dxN
· · · ∩ (xN − ≤ x′N ≤ xN + )}
2 2
Because of the additivity of expected values this relation also holds for vector
functions u(x).
The dispersion of multivariate distributions is now described by the so-
called covariance matrix C:
Transformation of Variables
∂(x1 , . . . , xN )
g(y) = f (x) .
∂(y1 , . . . , yN )
8
we omit the formulas because they are very clumsy.
3.5 Multivariate Probability Densities 59
x = r cos ϕ ,
y = r sin ϕ
For an isotropic distribution all angles are equally likely and we obtain
the uniform distribution of ϕ
1
g(ϕ) = .
2π
Since we have to deal with periodic functions, we have to be careful when
we compute moments and in general expected values. For example the mean
of the two angles ϕ1 = π/2, ϕ2 = −π is not (ϕ1 + ϕ2 )/2 = −π/4, but 3π/4.
To avoid this kind of mistake it is advisable to go back to the unit vectors
{xi , yi } = {cos ϕi , sin ϕi }, to average those and to extract the resulting angle.
Spatial directions are described by the polar angle θ and the azimuthal angle
ϕ which we define through the transformation relations from the cartesian
coordinates:
x = r sin θ cos ϕ , −π ≤ ϕ ≤ π
y = r sin θ sin ϕ , 0 ≤ θ ≤ π
z = r cos θ .
2
1 2 r + r02 − 2rr0 cos θ
f (r, θ, ϕ) = r sin θ exp − .
(2π)3/2 σ 3 2σ 2
For fixed distance r we obtain a function of θ and ϕ only which for our choice
of r0 is also independent of ϕ:
The parameter
R κ is again given by κ = rr0 /σ 2 . Applying the normalization
condition gdθdϕ = 1 we find cN (κ) = κ/(4π sinh κ) and
κ
g(θ, ϕ) = eκ cos θ sin θ (3.42)
4π sinh κ
a two-dimensional, unimodal distribution, known as Fisher’s spherical distri-
bution. As in the previous example we get in the limit κ → 0 the uniform
distribution (3.41) and for large κ the asymptotic distribution
1 2
g̃(θ, ϕ) ≈ κθ e−κθ /2 ,
4π
which is an exponential distribution of θ2 . As a function of z̃ = cos θ the
distribution (3.42) simplifies to
κ
h(z̃, ϕ) = eκz̃ .
4π sinh κ
which illustrates the spatial shape of the distribution much better than (3.42).
What is the probability to get with ten dice just two times a six ? The answer
is given by the binomial distribution:
2 8
10 10 1 1
B1/6 (2) = 1− .
2 6 6
The probability to get with 2 particular dice six, and with the remaining 8
dice not the number six, is given by the product of the two power factors.
The binomial coefficient
10 10!
=
2 2! 8!
3.6 Some Important Distributions 63
counts the number of possibilities to distribute the 2 sixes over the 10 dice.
This are just 45. With the above formula we obtain a probability of about
0.29.
Considering, more generally, n randomly chosen objects (or a sequence of
n independent trials), which have with probability p the property A, which
we will call success, the probability to find k out of these n objects with
property A is Bpn (k),
n k
Bpn (k) = p (1 − p)n−k , k = 0, . . . , n .
k
Since this is just the term of order pk in the power expansion of [p + (1 − p)]n ,
we have the normalization condition
[p + (1 − p)]n = 1 , (3.43)
Xn
Bpn (k) = 1 .
k=0
E(k) = np .
With a similar argument we can find the variance: For n = 1, we can directly
compute the expected quadratic difference, i.e. the variance σ12 . Using hki = p
and that k = 1 is found with probability9 P {1} = p and k = 0 with P {0} =
1 − p, we find:
σ 2 = nσ12 = np(1 − p) .
1 − 2p 1 − 6p(1 − p)
γ1 = p , γ2 = .
np(1 − p) np(1 − p)
The variance for each single term in the numerator is wi2 εi (1 − εi ). Then the
variance σT2 of εT becomes
P 2
wi εi (1 − εi )
σT2 = P .
( wi )2
When a single experiment or trial has not only two, but N possible outcomes
with probabilities p1 , p2 , . . . , pN , the probability to observe in n experiments
k1 , k2 , . . . , kN trials belonging to the outcomes 1, . . . , N is equal to
N
Y
n!
Mnp1 ,p2 ,...,pN (k1 , k2 , . . . , kN ) = QN pki i ,
i=1 ki ! i=1
PN PN
where i=1 pi = 1 and i=1 ki = n are satisfied. Hence we have N − 1
independent variates. The value N = 2 reproduces the binomial distribution.
In complete analogy to the binomial distribution, the multinomial distri-
bution may be generated by expanding the multinom
(p1 + p2 + . . . + pN )n = 1
N −1
!n
X
iti
φ(t1 , . . . , tN −1 ) = 1+ pi e −1
1
λk
Pλ (k) = e−λ .
k!
Occasionally we will use also the notation P(k|λ). The expected value
and variance have already been calculated above (see 38):
E(k) = λ , var(k) = λ .
The characteristic function and cumulants have also been derived in Sect.
3.3.2 :
φ(t) = exp λ(eit − 1) , (3.45)
κi = λ , i = 1, 2, . . . .
Skewness and excess,
1 1
γ1 = √ , γ2 =
λ λ
decrease with λ and indicate that the distribution approaches the normal
distribution (γ1 = 0, γ2 = 0) with increasing λ (see Fig. 3.15).
The Poisson distribution itself can be considered as the limiting case of
a binomial distribution with np = λ, where n approaches infinity (n → ∞)
and, at the same time, p approaches zero, p → 0. The corresponding limit of
the characteristic function of the binomial distribution (3.44) produces the
3.6 Some Important Distributions 67
0.4 0.20
P P
0.2 0.10
0.1 0.05
0.0 0.00
0 2 4 6 8 10 0 5 10 15 20
k k
P P
l = 20
0.08
0.04 l = 100
0.06
0.04
0.02
0.02
0.00 0.00
0 20 40 100 150
k k
Mean value and variance are hxi = ξ and σ 2 = α2 /12, respectively. The
characteristic function is
70 3 Probability Distributions and their Properties
Z ξ+α/2
1 2 αt iξt
φ(t) = eitx dx = sin e . (3.47)
α ξ−α/2 αt 2
Using the power expansion of the sinus function we find from (3.47) for ξ = 0
the even moments (the odd moments vanish):
1 α 2k
µ′2k = , µ′2k−1 = 0 .
2k + 1 2
The uniform distribution is the basis for the computer simulation of all
other distributions because random number generators for numbers uniformly
distributed between 0 and 1 are implemented on all computers used for sci-
entific purposes. We will discuss simulations in some detail in Chap. 5.
f(x)
N=25 N=25
0.4 0.4
0.2 0.2
0.0 0.0
-4 -2 0 2 4 -4 -2 0 2 4
N=5 N=5
0.4 0.4
0.2 0.2
0.0 0.0
-4 -2 0 2 4 -4 -2 0 2 4
0.5
0.2
0.0 0.0
-4 -2 0 2 4 -4 -2 0 2 4
x x
The normal distribution in two dimensions with its maximum at the origin
has the general form
2
1 1 x xy y2
N0 (x, y) = p exp − − 2ρ + 2 .
1 − ρ2 2πsx sy 2(1 − ρ2 ) s2x sx sy sy
(3.48)
The notation has been chosen such that it indicates the moments:
x2 = s2x ,
y 2 = s2y ,
hxyi = ρsx sy .
y
s
y
s'
y
s
x
x'
s'
x
y'
x
1 x2 xy y2
− 2ρ + = const
1 − ρ2 s2x sx sy s2y
describe concentric ellipses. For the special choice const = 1 we show the
ellipse in √
Fig. 3.17. At this so-called error ellipse the value of the p.d.f. √
is just
N0 (0, 0)/ e, i.e. reduced with respect to the maximum by a factor 1/ e.
By a simple rotation we achieve uncorrelated variables x′ and y ′ :
x′ = x cos φ + y sin φ ,
y ′ = −x sin φ + y cos φ ,
2ρsx sy
tan 2φ = 2 .
sx − s2y
The half-axes, i.e. the variances s′x 2 and s′y 2 of the uncorrelated variables x′
and y ′ are
V = C−1
with constant c. The property (3.50) is found only for the exponential func-
tion: f (t) = aebt . If we require that the probability density is normalized, we
get
f (t) = λe−λt .
This result could also have been derived by differentiating (3.49) and solving
the corresponding differential equation f = −c df /dt.
The characteristic function
λ
φ(t) =
λ − it
and the moments
µn = n! λ−n
have already been derived in Example 20 in Sect. 3.3.4.
Xf
x2i
χ2 = ,
σ2
i=1 i
where xi are independent, normally distributed variates with zero mean and
variance σi2 .
We have already come across the simplest case with f = 1 in Sect. 3.4.1:
The transformation of a normally distributed variate x with expected value
zero to u = x2 /s2 , where s2 is the variance, yields
1
g1 (u) = √ e−u/2 (f = 1) .
2πu
76 3 Probability Distributions and their Properties
0.6
f=1
0.4
f=2
0.2
f=4
f=8
f=16
0.0
0 5 10 15 20
c
2
If the variates xi of the sample are distributed normally with mean x0 and
variance σ 2 , then N v 2 /σ 2 follows a χ2 distribution with f = N − 1 degrees
of freedom. We omit the formal proof; the result is plausible, however, from
the expected value derived in Sect. 3.2.3:
N −1 2
v2 = σ ,
N
2
Nv
= N −1.
σ2
In Sect. 6.7 we will discuss the method of least squares for parameter esti-
mation. To adjust a curve to measured points xi with Gaussian errors σi we
minimize the quantity
N
X 2
(xi − ti (λ1 , . . . , λZ ))
χ2 = ,
i=1
σi2
where ti are the ordinates of the curve depending on the Z free parameters
λk . Large values of χ2 signal a bad agreement between measured values and
(t)
the fitted curve. If the predictions xi depend linearly on the parameters,
the sum χ obeys a χ distribution with f = N − Z degrees of freedom. The
2 2
reduction of f accounts for the fact that the expected value of χ2 is reduced
when we allow for free parameters. Indeed, for Z = N we could adjust the
parameters such that χ2 would vanish.
Generally, in statistics the term degrees of freedom11 f denotes the num-
ber of independent predictions. For N = Z we have no prediction for the
observations xi . For Z = 0 we predict all N observations, f = N . When
we fit a straight line through 3 points with given abscissa and observed or-
dinate, we have N = 3 and Z = 2 because the line contains 2 parameters.
The corresponding χ2 distribution has 1 degree of freedom. The quantity Z
is called the number of constraints, a somewhat misleading term. In the case
of the sample width discussed above, one quantity, the mean, is adjusted.
Consequently, we have Z = 1 and the sample width follows a χ2 distribution
of f = N − 1 degrees of freedom.
11
Often the notation number of degrees of freedom, abbreviated by n.d.f. or NDF
is used in the literature.
78 3 Probability Distributions and their Properties
Forming the N -fold product, and using the scaling rule for Fourier trans-
forms (3.19), φx/N (t) = φx (t/N ), we arrive at the characteristic function of
a gamma distribution with scaling parameter N λ and shape parameter N :
3.6 Some Important Distributions 79
−N
it
φx (t) = 1 − . (3.53)
Nλ
Thus the p.d.f. f (x) is equal to G(x|N, N λ). Considering the limit for large
N , we convince ourself of the validity of the law of large numbers and the
central limit theorem. From (3.53) we derive
it
ln φx (t) = −N ln 1 −
Nλ
" 2 #
it 1 it
= −N − − − + O(N −3 ) ,
Nλ 2 Nλ
1 1 1 2 −2
φx (t) = exp i t − t + O N .
λ 2 N λ2
When N is large, the term of order N −2 can be neglected and with the two
remaining terms in the exponent we get the characteristic function of a nor-
mal distribution with mean µ = 1/λ = hxi and variance σ 2 = 1/(N λ2 ) =
var(x)/N , (see 3.3.4), in agreement with the central limit theorem. If N ap-
proaches infinity, only the first term remains and we obtain the characteristic
function of a delta distribution δ(1/λ− x). This result is predicted by the law
of large numbers (see Appendix 13.1). This law states that, under certain con-
ditions, with increasing sample size, the difference between the sample mean
and the population mean approaches zero.
1.5
1.0
f(x)
0.5
0.0
0.0 0.5 1.0 1.5 2.0
Fig. 3.19. Lorentz distribution with mean equal to 1 and halfwidth Γ/2 = 0.2.
The sample mean has the same distribution as the original population. It
is therefore, as already stated above, not suited for the estimation of the
location parameter.
Note that the distribution is declared only for positive x, while u0 can also
be negative.
3.6 Some Important Distributions 81
1.5
f(x)
s=0.1
1.0
0.5
s=0.5 s=0.2
s=1
0.0
0 2 4 6 8
X N
1
s2 = (xi − x)2 .
N (N − 1) i=1
The sum on the right-hand side, after division by the variance σ 2 of the
Gaussian, follows a χ2 distribution with f = N − 1 degrees of freedom,√see
(3.51). Dividing also the numerator of (3.56) by its standard deviation σ/ N ,
it follows a normal distribution of variance unity. Thus the variable t of the
t distribution is the quotient of a normal variate and the square root of a χ2
variate.
The analytical form of the p.d.f. can be found by the standard method
used in Sect. 3.5.4. The result is
− f +1
Γ ((f + 1)/2) t2 2
h(t|f ) = √ 1+ .
Γ (f /2) πf f
They exist only for i ≤ f − 1. The variance for f ≥ 3 is σ 2 = f /(f − 2), the
excess for f ≥ 5 is γ2 = 6/(f − 4), disappearing for large f , in agreement
with the fact that the distribution approaches the normal distribution.
The typical field of application for the t distribution is the derivation
of tests or confidence intervals in cases where a sample is supposed to be
taken from a normal distribution of unknown variance but known mean µ.
Qualitatively, very large absolute values of t indicate that the sample mean
3.6 Some Important Distributions 83
0.4
normal
f=5
f=2
0.3 f=1
0.2
0.1
0.0
-8 -6 -4 -2 0 2 4 6 8
Fig. 3.21. Student’s distributions for 1, 2, 5 degrees of freedom and normal distri-
bution.
The family of extreme value distributions is relevant for the following type
of problem: Given a sample taken from a certain distribution, what can be
said about the distribution of its maximal or minimal value? It is found that
these distributions converge with increasing sample size to distributions of
the types given below.
This distribution has been studied in connection with the lifetime of complex
aggregates. It is a limiting distribution for the smallest member of a sample
taken from a distribution limited from below. The p.d.f. is
p x p−1 x p
f (x|a, p) = exp − , x>0 (3.57)
a a a
with the positive scale and shape parameters a and p. The mode is
84 3 Probability Distributions and their Properties
1/p
p−1
xm = a for p ≥ 1 ,
p
mean value and variance are
µ = aΓ (1 + 1/p) ,
σ 2 = a2 Γ (1 + 2/p) − Γ 2 (1 + 1/p) .
The moments are
µi = ai Γ (1 + i/p) .
For p = 1 we get an exponential distribution with decay constant 1/a.
0.4
0.3
f(x)
0.2
0.1
0.0
-2 0 2 4 6 8
0.1
f(t)
0.01
1E-3
0 2 4 6
Fig. 3.23. Lifetime distribution, original (solid line) and measured with Gaussian
resolution (dashed line).
The relation (3.60) has the form of a convolution and is closely related to the
marginalization of a two-dimensional distribution of x and λ. A compound
distribution may also have the form of (3.58) or (3.59) where the weights are
independently randomly distributed.
Frequently we measure a statistical quantity with a detector that has a
limited resolution. Then the probability to observe the value x′ is a random
distribution R(x′ |x) depending on the undistorted value x which itself is
distributed according to a distribution g(x). (In this context the notation is
usually different from the one used in (3.60)) We have the convolution
Z ∞
f (x′ ) = R(x′ , x)g(x)dx .
−∞
resulting in
1 ′ 1 2 2 −t′ + σ 2 γ
f (t ) = γ exp−γt − 2 σ γ erfc
′
√ .
2 2σ
The result for γ = 1, σ = 0.5 is shown in Fig. 3.23. Except for small values of
t, the observed time is shifted to larger values. In the asymptotic limit t → ∞
the integral becomes a constant and the distribution f∞ (t′ ) is exponentially
decreasing with the same slope γ as the undistorted distribution:
′
f∞ (t′ ) ∝ γe−γt .
varies from event to event. We can estimate the true number of decays by
weighting each observation with the inverse of its detection probability. Some-
times weighting is used to measure the probability that an event belongs to
a certain particle type. Weighted events play also a role in some Monte Carlo
integration methods and in parameter inference (see Chap. 6, Sect. 7.3), if
weighted observations are summed up inPhistogram bins.
N
The CPD also describes the sum x = i=1 ni wi , if there is a given discrete
weight distribution, wi , i = 1, 2, 3..N and where the numbers ni are Poisson
distributed. The equivalence of the two definitions of the CPD is shown in
Appendix 13.11.1. In Ref. [26] some properties of the compound Poisson
distribution and the treatment of samples of weighted events is described.
The CPD does not have a simple analytic expression. However, the cumulants
and thus also the moments of the distribution can be calculated exactly.
Let us consider the definite case that on average λ1 observations are ob-
tained with probability ε1 and λ2 observations with probability ε2 . We correct
the losses by weighting the observed numbers with w1 = 1/ε1 and w2 = 1/ε2 .
For the Poisson-distributed numbers k1 , k2
λk11 −λ1
Pλ1 (k1 ) = e ,
k1 !
λk2
Pλ2 (k2 ) = 2 e−λ2 ,
k2 !
k = w1 k1 + w2 k2 ,
µ = w1 λ1 + w2 λ2
= λhwi , (3.61)
σ 2 = w12 λ1 + w22 λ2
= λhw2 i. (3.62)
κ3 w13 λ1 + w23 λ2
γ1 = =
3/2
κ2 (w12 λ1 + w22 λ2 )3/2
hw3 i
= , (3.64)
λ1/2 hw2 i3/2
κ4 w14 λ1 + w24 λ2
γ2 = 2 =
κ2 (w12 λ1 + w22 λ2 )2
4
hw i
= . (3.65)
λhw2 i2
The formulas can easily be generalized to more than two Poisson distri-
butions and to a continuous weight distribution (see Appendix 13.11.1). The
relations (3.61), (3.62), (3.64), (3.65) remain valid.
In particular for a CPD with a weight distribution with variance E(w2 )
and expected number of weights λ the variance of the sum of the weights is
λE(w2 ) as indicated by (3.62).
For large values of λ the CPD can be approximated by a normal distri-
bution or by a scaled Poisson distribution (see Appendix 13.11.1).
4 Measurement Errors
The natural sciences owe their success to the possibility to compare quanti-
tative hypotheses to experimental facts. However, we are able to check the-
92 4 Measurement Errors
oretical predictions only if we have an idea about the accuracy of the mea-
surements. If this is not the case, our measurements are completely useless.
Of course, we also want to compare the results of different experiments
to each other and to combine them. Measurement errors must be defined in
such a way that this is possible without knowing details of the measurement
procedure. Only then, important parameters, like constants of nature, can be
determined more and more accurately and possible variations with time, like
it was hypothesized for the gravitational constant, can be tested.
Finally, it is indispensable for the utilization of measured data in other
scientific or engineering applications to know their accuracy and reliability.
An overestimated error can lead to a waste of resources and, even worse, an
underestimated error may lead to wrong conclusions.
2
Remark that we do not need to know the full error distribution but only its
standard deviation.
94 4 Measurement Errors
In Sect. 8.1 we will, as mentioned above, also discuss more complex cases,
including asymmetric errors due to low event rates or other sources.
Apart from the definition of a measurement and its error by the estimated
mean and standard deviation of the related distribution there exist other
conventions: Distribution median, maximal errors, width at half maximum
and confidence intervals. They are useful in specific situations but suffer from
the crucial disadvantage that they are not suited for the combination of
measurements or the determination of the errors of depending variables, i.e.
error propagation.
There are uncertainties of different nature: statistical errors and system-
atic errors. Their definitions are not unambiguous, disagree from author to
author and depend somewhat on the scientific discipline in which they are
treated.
We have to require that the fluctuations are purely statistical and that cor-
related systematic variations are absent, i.e. the data have to be independent
96 4 Measurement Errors
from
√ each other. The relative uncertainty of the error estimate follows the
1/ N law. It will be studied below. For example with 100 repetitions of the
measurement, the uncertainty of the error itself is reasonably small, i.e. about
10 % but depends on the distribution of x.
When thePtrue value is unknown, we can approximate it by the sample
N
mean x = N1 i=1 xi and use the following recipe:
N
1 X
(δx) = 2
(xi − x)2 . (4.1)
N − 1 i=1
where x̂i are the estimates of the true values corresponding to the measure-
ments xi and Z is the number of parameters that have been adjusted using
the data. When we compare the data of a sample to the sample mean we have
Z = 1 parameter, namely x̄, when we compare coordinates to the values of a
straight line fit then we have Z = 2 free parameters to be adjusted from the
data, for instance, the slope and the intercept of the line with the ordinate
axis. Again, the denominator N − Z is intuitively plausible, since for N = Z
we have 2 points lying exactly on the straight line which is determined by
them, so also the numerator is zero and the result then is indefinite.
Relation (4.2) is frequently used in particle physics to estimate momentum
or coordinate errors from empirical distributions (of course, all errors are
assumed to be the same). For example, the spatial resolution of tracking
devices is estimated from the distribution of the residuals (xi − x̂i ). The
individual measurement error δx as computed from a M tracks and N points
per track is then estimated quite reliably to
M×N
X
1
(δx)2 = (xi − x̂i )2 .
(N − Z)M i=1
4.2 Statistical Errors 97
Not only the precision of the error estimate, but also the precision of a
measurements can be increased by repetition. The error δx of a corresponding
sample mean is, following the results of the previous section, given by
(δx)2 = (δx)2 /N ,
X N
1
= (xi − x)2 . (4.3)
N (N − 1) i=1
√
Our recipe yields δx ∼ 1/ N , i.e. the error becomes arbitrarily small
if the
√ number of the measurements approaches infinity. The validity of the
1/ N behavior relies on the assumption that the fluctuations are purely
statistical and that correlated systematic variations are absent, i.e. the data
have to be independent of each other. When we measure repeatedly the period
of a pendulum, then the accuracy of the measurements can be deduced from
the variations of the results only if the clock is not stopped systematically
too early or too late and if the clock is not running too fast or too slow. Our
experience tells us that some correlation between the different measurements
usually cannot be avoided completely and thus there is a lower limit for δx.
To obtain a reliable estimate of the uncertainty, we have to take care that
the systematic uncertainties are small compared to the statistical error δx.
98 4 Measurement Errors
and it follows δs/s = 1/ 2N which also follows from the variance of the χ
distribution. This relation sometimes is applied to arbitrary distributions. It
then often underestimates the uncertainty.
Systematic errors are at least partially based on assumptions made by the ex-
perimenter, are model dependent or follow unknown distributions. This leads
to correlations between repeated measurements because the assumptions en-
tering into their evaluations are common to all measurements. Therefore,
contrary to statistical errors, the relative error of mean value√of repeated
measurements, suffering from systematic errors, violates the 1/ N law.
A systematic error arises for instance if we measure a distance with a
tape-measure which may have expanded or shrunk due to temperature ef-
fects. Corrections can be applied and the corresponding uncertainty can be
estimated roughly from the estimated range of temperature variations and
the known expansion coefficient of the tape material if it is made out of metal.
It may also be guessed from previous experience.
Systematic errors occur also when an auxiliary parameter is taken from
a technical data sheet where the given uncertainty is usually not of the type
“statistical”. It may happen that we have to derive a parameter from two or
three observations following an unknown distribution. For instance, the cur-
rent of a magnet may have been measured at the beginning and at the end of
100 4 Measurement Errors
an experiment. The variation of the current introduces an error for the mo-
mentum measurement of charged particles. The estimate of the uncertainty
from only two measurements obeying an unknown distribution of the magnet
variations will be rather vague and thus the error is classified as systematic.
A relatively large systematic error has affected the measurement of the
mass of the Z 0 particle by the four experiments at the LEP collider. It was
due to the uncertainty in the beam energy and has led to sizable correlations
between the four results.
Typical systematic uncertainties are the following:
1. Uncertainties in the experimental conditions (Calibration uncertainties
for example of a calorimeter or the magnetic field, unknown beam con-
ditions, unknown geometrical acceptance, badly known detector reso-
lutions, temperature and pressure dependence of the performance of
gaseous tracking detectors.),
2. unknown background behavior,
3. limited quality of the Monte Carlo simulation due to technical approxi-
mations,
4. uncertainties in the theoretical model used in the simulation (approxima-
tions in radiative or QCD corrections, poorly known parton densities),
5. systematic uncertainties caused by the elimination of nuisance parameters
(see Sect. 7.8),
6. uncertainties in auxiliary parameters taken from data sheets or previous
experiments.
Some systematic errors are difficult to retrieve3 . If, for instance, in the data
acquisition system the deadtime is underestimated, all results may look per-
fectly all right. In order to detect and to estimate systematic errors, expe-
rience, common sense, and intuition is needed. A general advice is to try to
suppress them as far as possible already by an appropriate design of the ex-
periment and to include the possibility of control measurements, like regular
calibration. Since correlation of repeated measurements is characteristic for
the presence of systematic errors, observed correlations of results with pa-
rameters related to the systematic effects provide the possibility to estimate
and reduce the latter. In the pendulum example, where the frequency is de-
termined from a time measurement for a given number of periods, systematic
3
For example, the LEP experiments had to discover that monitoring the beam
energy required a magnet model which takes into account leakage currents from
nearby passing trains and tidal effects.
4.3 Systematic Errors 101
contribution to the error due to a possible unknown bias in the stopping pro-
cedure can be estimated by studying the result as a function of the number
of periods and it can be reduced by increasing the measurement time. In par-
ticle physics experiments where usually only a fraction of events is accepted
by some filtering procedure, it is advisable to record also a fraction of those
events that are normally rejected (downscaling) and to try to understand
their nature. Some systematic effects are related to the beam intensity, thus
a variation of the beam intensity helps to study them.
How can we detect systematic errors caused for instance by background
subtraction or efficiency corrections at the stage of data analysis? Clearly, a
thorough comparison of the collected data with the simulation in as many
different distributions as possible is the primary method. All effects that can
be simulated are necessarily understood.
Often kinematical or geometrical constraints can be used to retrieve sys-
tematic shifts and to estimate the uncertainties. A trivial example is the
comparison of the sum of measured angles of a triangle with the value 1800
which is common in surveying. In the experiments of particle physics we can
apply among other laws the constraints provided by energy and momentum
conservation. When we adjust curves, e.g. a straight line to measured points,
the deviations of the points from the line permit us to check the goodness of
the fit, and if the fit is poor, we might reject the presumed parametrization
or revise the error assignment. (Goodness-of-fit tests will be treated in Chap.
10.) Biases in the momentum measurement can be detected by comparing the
locations and widths of mass peaks to the nominal values of known particles.
A widely used method is also the investigation of the results as a function
of the selection criteria. A correlation of the interesting parameter with the
value of a cut-off parameter in a certain variable is a clear indication for the
presence of systematic errors. It is evident though that the systematic errors
then have to be much larger than the normal statistical fluctuations in order
to be detected. Obviously, we want to discriminate also systematic errors
which are of the same order of magnitude as the statistical ones, preferably
much smaller. Therefore we have to investigate samples, where the systematic
effects are artificially enhanced. If we suspect rate dependent distortion effects
as those connected with dead times, it is recommended to analyze a control
sample with considerably enhanced rate. When we eliminate a background
reaction by a selection criterion, we should investigate its importance in the
region which has been excluded, where it is supposed to be abundant.
Frequently made mistakes are: 1. From the fact that the data are con-
sistent with the absence of systematic errors, it is supposed that they do
not exist. This leads always to underestimation of systematic errors. 2. The
changes of the results found by changing the selection criteria are directly
converted into systematic errors. This in most cases leads to overestimates
because the variations are partially due to the normal statistical fluctuations.
102 4 Measurement Errors
2 dy
2 dx
ym = hy(x)i
≈ hy(xm )i + hy ′ (xm )∆xi = y(xm ) ,
104 4 Measurement Errors
and
(δy)2 = h(y − ym )2 i
≈ h(y(xm ) + y ′ (xm )∆x − ym )2 i
= y ′2 (xm )h(∆x)2 i
= y ′2 (xm )(δx)2 ,
δy = |y ′ (xm )|δx .
This result also could have been red off directly from Fig. 4.1.
Examples of the linear propagation of errors for some simple functions
are compiled below:
Function : Relation between errors :
n δy |n|δx
y = ax ⇒ = ,
|y| |x|
|a|δx
y = a ln(bx) ⇒ δy = ,
|x|
δy
y = aebx ⇒ = |b|δx ,
|y|
δy δx
y = tan x ⇒ = .
|y| | cos x sin x|
z = xn y m ,
2 2 2
δz δx δy
= n + m .
z x y
It is not difficult to generalize our results to functions y(x1 , .., xN ) of N
measured quantities. We obtain
XN
∂y ∂y
(δy)2 = Rij δxi δxj
i,j=1
∂xi ∂xj
XN N
X
∂y 2 2 ∂y ∂y
= ) (δxi ) + Rij δxi δxj
i=1
∂xi ∂xi ∂xj
i6=j=1
XN
∂y ∂y
(δy)2 = Cij
i,j=1
∂xi ∂xj
(δy)2 = ∇y T C∇y .
Error Ellipsoids
Two-dimensional Gaussian error distributions like (4.5) (see Sect. 3.6.5) have
the property that the curves of constant probability density are ellipses. In-
stead of nσ error intervals in one dimension, we define nσ error ellipses.
The curve of constant probability density with density down by a factor of
exp(−n2 /2) relative to the maximal density is the nσ error ellipse.
For the error distribution in the form of (4.5) the error ellipse is
(∆x1 )2 (∆x2 )2
δ12
− 2ρ ∆xδ11 δ∆x 2
+ δ22
= n2 .
2
1 − ρ2
For uncorrelated errors the one standard deviation error ellipse is simply
(∆x1 )2 (∆x2 )2
+ =1.
δ12 δ22
In higher dimensions, we obtain ellipsoids which we better write in vector
notation:
∇y T C∇y = n2 .
Remember that in this chapter we assume that the errors are small enough
to neglect a dependence of the error on the value of the measured quantity
within the range of the error. This condition is violated for instance for small
Poisson numbers. The general case will be discussed in Chap. 8.
As an example let us consider two measurements with measured values
x1 , x2 and errors δ1 , δ2 . With the relations given in Sect. 3.2.3, we find for
the error squared δ 2 of a weighted sum
x = w1 x1 + w2 x2 ,
δ 2 = w12 δ12 + w22 δ22 .
Now we chose the weights in such a way that the error of the weighted
sum is minimal, i.e. we seek for the minimum of δ 2 under the condition
w1 + w2 = 1. The result is
1/δi2
wi =
1/δ12 + 1/δ22
XN N
xi X 1
x= / , (4.6)
δ 2 i=1 δi2
i=1 i
X 1N
1
= . (4.7)
δ2 δ2
i=1 i
When all measurements have the same error, all the weights are equal to
wi = 1/N , and we get the normal arithmetic
√ mean, with the corresponding
reduction of the error by the factor 1/ N .
Remark: If the original raw data of different experiments are available,
then we have the possibility to improve the averaging process compared to
the simple use of the relations 4.6 and 4.7. When, for example, in two rate
measurements of 1 and 2 hours duration, 2, respectively 12 events are ob-
served, then the combined rate is (2 + 12)/(1 h + 3 h) = 3.5 h−1 , with an
error ±0.9 h−1 . Averaging according to (4.6) would lead to too low a value
of (3.2 ± 1.2) h−1 , due to the above mentioned problem of small rates and
asymmetric errors. The optimal procedure is in any case the addition of the
log-likelihoods which will be discussed in Chap. 8. It will correspond to the
addition of the original data, as done here.
108 4 Measurement Errors
x = w1 x1 + w2 x2 , with w1 + w2 = 1 .
C22 − C12
w1 = ,
C11 + C22 − 2C12
C11 − C12
w2 = . (4.8)
C11 + C22 − 2C12
The uncorrelated weighted mean corresponds to C12 = 0. Contrary to this
case, where the expression for the minimal value of δx2 is particularly simple,
it is not as transparent in the correlated case.
The case of N correlated measurements leads to the following expression
for the weights:
PN
j=1 Vij
wi = PN ,
ij=1 Vij
where V is the inverse matrix of C which we called the weight matrix in Sect.
3.6.5.
The weighted mean and its error, derived by error propagation, are:
N PN
X ij=1 Vij xi
x= wi xi = PN , (4.9)
i=1 ij=1 Vij
PN
ijkl=1 Vij Vkl Cik 1
δ2 = P 2 = PN . (4.10)
N Vij
ij=1 Vij
ij=1
4.4 Linear Propagation of Errors 109
C is the sum of a diagonal matrix and a matrix where all elements are iden-
tical, namely equal to δ02 . In this
P special situation the variance var(E ∗ ) ≡ δ 2
of the combined result E = wi Ei is
∗ ∗
X X
δ2 = wi2 Cii + wi wj Cij
i i6=j
X X 2
= wi2 δi2 + wi δ02 .
Since the second sum is unity, the second term is unimportant when we
minimize δ 2 , with respect to the weights and we get the same result (4.6) for
the weighted mean E ∗ as in the uncorrelated case. For its error we find, as
could have been expected,
X −1
2 1
δ = + δ02 .
δi2
It is interesting that in some rare cases the weighted mean of two cor-
related measurements x1 and x2 is not located between the individual mea-
surement, the so-called “mean value” is not contained in the interval [x1 , x2 ].
Example 50. Average outside the range defined by the individual measure-
ments
The matrix
12
C=
25
with eigenvalues √
λ1,2 = 3 ± 8>0
is symmetric and positive definite and thus a possible covariance matrix. But
following (4.8) it leads to weights w1 = 23 , w2 = − 21 . Thus the weighted mean
x = 32 x1 − 12 x2 with x1 = 0, x2 = 1 will lead to x = − 21 which is less than
110 4 Measurement Errors
both input values. The reason for this sensible but at first sight unexpected
result can be understood intuitively in the following way: Due to the strong
correlation, x1 and x2 , both will usually be either too large or too low. An
indication, that x2 is too large is the fact that it is larger than x1 which is
the more precise measurement. Thus the true value x then is expected to be
located below both x1 and x2 .
with
1/δ 2
wi = PN i .
2
i=1 1/δi
The statistical and the systematic errors are
N
X
a2 = wi2 a2i ,
i=1
N
X
b2 = wi2 b2i .
i=1
XN
∂yk ∂yl
h∆yk ∆yl i = h∆xi ∆xj i , (4.11)
i,j=1
∂xi ∂xj
XN
∂yk ∂yl
Ekl = Cij .
i,j=1
∂xi ∂xj
Defining a matrix
∂yk
Dki = ,
∂xi
we can write more compactly
N
X
Ekl = Dki Dlj Cij , (4.12)
i,j=1
E = DCDT . (4.13)
4.4.7 Examples
b = (x2 y1 − x1 y2 )/(x2 − x1 )
(δy2 )2 + (δy1 )2
(δm)2 = ,
(x2 − x1 )2
x2 (δy1 )2 + x21 (δy2 )2
(δb)2 = 2 ,
(x2 − x1 )2
x2 (δy1 )2 + x1 (δy2 )2
E12 = h∆m∆bi = − .
(x2 − x1 )2
The error matrix E for m and b is therefore
1 (δy1 )2 + (δy2 )2 , −x2 (δy1 )2 − x1 (δy2 )2
E= .
(x2 − x1 )2 −x2 (δy1 )2 − x1 (δy2 )2 , x22 (δy1 )2 + x21 (δy2 )2
The correlation matrix element R12 is
114 4 Measurement Errors
E12
R12 = ,
δm δb
x2 (δy1 )2 + x1 (δy2 )2
=− 1/2
. (4.14)
{[(δy2 )2 + (δy1 )2 ] [x22 (δy1 )2 + x21 (δy2 )2 ]}
Remark: As seen from (4.14), for a suitable choice of the abscissa the correla-
tion disappears. To achieve this, we take as the origin the “center of gravity”
xs of the x-values xi , weighted with the inverse squared errors of the ordi-
nates, 1/(δyi )2 :
X xi X 1
xs = 2
/ .
(δyi ) (δyi )2
h1/xi
hx − x0 i = − x0 ,
h1/x2 i
1 1 ∆x ∆x2
= 1− + + · · ·
x x0 x0 x20
1 δ2
≈ (1 + 2 ) ,
x0 x0
1 1 ∆x ∆x2
= 1 − 2 + 3 + ..
x2 x20 x0 x20
1 δ2
≈ 2 (1 + 3 2 ) ,
x0 x0
1 + δ /x20
2
hx − x0 i ≈ x0 − x0
1 + 3δ 2 /x20
≈ x0 (1 − 2δ 2 /x20 ) − x0 ,
hx − x0 i δ2
≈ −2 2 .
x0 x0
2d 1
x
2
0.393
2d 2
0.865
0.989
x
1
Fig. 4.2. Confidence ellipses for 1, 2 and 3 standard deviations and corresponding
probabilities.
accuracy. For the normal distribution we present some limits in units of the
standard deviation in Table 4.2. The numerical values can be taken from
tables of the χ2 -distribution function.
For distributions of several variates, the probability to find all variables
inside their error limits is strongly decreasing with the number of variables.
Some probabilities for Gaussian errors are given in Table 4.1. In three di-
mensions only 20 % of the observations are found in the 1σ ellipsoid. Fig. 4.2
shows confidence ellipses and probabilities for two variables.
Example 57. Confidence level for the mean of normally distributed variates
Let us consider a sample of N measurements x1 , . . . , xN which are sup-
posed to be normally distributed with unknown mean µ but known vari-
ance σ 2 .√The sample mean x is also normally distributed with variance
δN = σ/ N . The 1σ confidence interval [x − δN , x + δN ] covers, as we have
discussed above, the true value µ in 68.3 % of the cases. We can, with the help
of Table 4.1, also find a 99 % confidence level, i.e. [x − 2.58δN , x + 2.58δN ].
Table 4.1. Confidence levels for different values of the standar deviation σ.
Deviation Dimension
1 2 3 4
1 σ 0.683 0.393 0.199 0.090
2 σ 0.954 0.865 0.739 0.594
3 σ 0.997 0.989 0.971 0.939
4 σ 1. 1. 0.999 0.997
which usually are not known with great accuracy. Then for a given confidence
level much wider intervals than in the above case are required.
Table 4.2. Error limits in units of the standard deviation σ for several confidence
levels.
Confidence Dimension
level 1 2 3 4
0.50 0.67 1.18 1.54 1.83
0.90 1.65 2.14 2.50 2.79
0.95 1.96 2.45 2.79 3.08
0.99 2.58 3.03 3.37 3.64
We come back to our previous example but now we assume that the error
has to be estimated from the sample itself, according to (4.1), (4.3):
N
X
2
δN = (xi − x)2 /[N (N − 1)] .
i=1
To compute the confidence level for a given interval in units of the standard
deviation, we now have to switch to Student’s distribution (see Sect. 3.6.11).
The variate t, given by (x − µ)/δN , can be shown to be distributed according
to hf (t) with f = N − 1 degrees of freedom. The confidence level for a given
number of standard deviations will now be lower, because of the tails of
Student’s distribution. Instead of quoting this number, we give in Table 4.3
the factor k by which we have to increase the interval length to get the same
confidence level as in the Gaussian case. To clarify its meaning, let us look at
two special cases: For 68.3% confidence and N = 3 we require a 1.32 standard
deviation interval and for 99% confidence and N = 10 a 1.26 × 2.58 = 3. 25
standard deviation interval. As expected, the discrepancies are largest for
small samples and high confidence levels. In the limit when N approaches
infinity the factor k has to become equal to one.
4.6 Confidence Intervals 119
Table 4.3. Values of the factor k for the Student’s t-distribution as a function of
the confidence levels CL and sample size N .
N 68.3% 99%
3 1.32 3.85
10 1.06 1.26
20 1.03 1.11
∞ 1.00 1.00
5 Monte Carlo Simulation
5.1 Introduction
The possibility to simulate stochastic processes and of numerical modeling
on the computer simplifies extraordinarily the solution of many problems in
science and engineering. The deeper reason for this is characterized quite
aptly by the German saying “Probieren geht über studieren” (Trying beats
studying). Monte Carlo methods replace intellectual by computational effort
which, however, is realized by the computer.
A few simple examples will demonstrate the advantages, but also the lim-
its of this method. The first two of them are purely mathematical integration
problems which could be solved also by classical numerical methods, but show
the conceptual simplicity of the statistical approach.
All examples lead finally to integration problems. In the first three ex-
amples also numerical integration, even exact analytical methods, could have
been used. For the Examples 61 and 63, however, this is hardly possible,
since the number of variables is too large. Furthermore, the mathematical
formulation of the problems becomes rather involved.
Monte Carlo simulation does not require a profound mathematical exper-
tise. Due to its simplicity and transparency mistakes can be avoided. It is
true, though, that the results are subject to statistical fluctuations which,
however, may be kept small enough in most cases thanks to the fast com-
puters available nowadays. For the simulation of chemical reactions, however,
(Example 63) we reach the limits of computing power quite soon, even with
super computers. The treatment of macroscopic quantities (one mole, say)
is impossible. Most questions can be answered, however, by simulating small
samples.
Nowadays, even statistical problems are often solved through Monte Carlo
simulations. In some big experiments the error estimation for parameters
determined in a complex analysis is so involved that it is easier to simulate
the experiment, including the analysis, several times, and to derive the errors
quasi experimentally from the distribution of the resulting parameter values.
The relative statistical fluctuations can be computed for small samples and
then scaled down with the square root of the sample size.
In the following section we will treat the simulation of the basic univariate
distributions which are needed for the generation of more complex processes.
The generalization to several dimensions is not difficult. Then we continue
with a short summary on Monte Carlo integration methods.
The computer delivers pseudo random numbers in the interval between zero
and one. Because of the finite number of digits used to represent data in a
computer, these are discrete, rational numbers which due to the usual floating
point accuracy can take only 218 ≈ 8 · 106 different values, and follow a fixed,
reproducible sequence which, however, appears as stochastic to the user. More
refined algorithms can avoid, though, the repetition of the same sequence
after 218 calls. The Mersenne twister, one of the fastest reasonable random
number generators, invented in 1997 by M. Matsomoto and T. Nishimura has
the enormous period of 219937 which never can be exhausted and is shown
to be uniformly distributed in 623 dimensions. In all generators, the user
has the possibility to set some starting value, called seed, and thus to repeat
exactly the same sequence or to interrupt a simulation and to continue with
the sequence in order to generate statistically independent samples.
In the following we will speak of random numbers also when we mean
pseudo random numbers.
There are many algorithms for the generation of random numbers. The
principle is quite simple: One performs an arithmetic operation and uses only
the insignificant digits of the resulting number. How this works is shown by
the prescription
xi+1 = n−1 mod(λxi ; n) ,
producing from the old random number xi a new one between zero and
one. The parameters λ and n fulfil the condition λ ≫ n. With the values
x1 = 0.7123, λ = 4158, n = 1 we get, for instance, the number
x2 = mod(2961.7434; 1) = 0.7434 .
1.0
random number 2
0.5
0.0
random number 1
100500
number of entries
100000
99500
random number
Fig. 5.1. Correlation plot of consequtive random numbers (top) and frequency of
random numbers (bottom).
Continuous Variables
With the restrictions discussed above, we can generate with the computer
random numbers obeying the uniform distribution
u(r) = 1 for 0 ≤ r ≤ 1.
In the following we use the notations u for the uniform distribution and
r for a uniformly distributed variate in the interval [0, 1]. Other univariate
distributions f (x) are obtained by variable transformations r(x) with r a
monotone function of x (see Chap. 3):
f (x)dx = u(r)dr,
Z x Z r(x)
f (x′ )dx′ = u(r′ )dr′ = r(x),
−∞ 0
F (x) = r,
x(r) = F −1 (r) .
f(x)
a)
random number 0 10 x 20 30
1.0
distribution function
b)
0.5
0.0
0 10 20 30
x
Fig. 5.2. The p.d.f (top) follows from the distribution function as indicated by the
arrows.
f (x) = 2x 0 ≤ x ≤ 1 ,
√
x(r) = r .
– Power-law distribution:
f (x) = γe−γx,
1
x(r) = − ln(1 − r) .
γ
1 (Γ/2)2
f (x) = ,
πΓ/2 x2 + (Γ/2)2
Γ 1
x(r) = tan π(r − ) .
2 2
ϕ = 2πr .
cos θ = (2r1 − 1) ,
θ = arccos(2r1 − 1) ,
ϕ = 2πr2 .
0.2
P(k;4.6)
0.1
0.0
0 5 10 15
k
1.0
random number
distribution function
0.5
0.0
0 5 10 15
k
Discrete Distributions
Histograms
normalize it to one, and generate at first i, and then for given i in the same
way j. That means that we need for each value of i the distribution summed
over j.
In the majority of cases it is not possible to find and invert the distribution
function analytically. As an example for a non-analytic approach, we consider
the generation of photons following the Planck black-body radiation law. The
appropriately scaled frequency x obeys the distribution
x3
f (x) = c (5.1)
ex − 1
with the normalization constant c. This function is shown in Fig. 5.4 for
c = 1, i.e. not normalized. We restrict ourselves to frequencies below a given
maximal frequency xmax .
A simple method to generate this distribution f (x) is to choose two uni-
formly distributed random numbers, where r1 is restricted to the interval
(xmin , xmax ) and r2 to (0, fmax ). This pair of numbers P (r1 , r2 ) corresponds
to a point inside the rectangle shown in the figure. We generate points and
those lying above the curve f (x) are rejected. The density of the remaining
r1 values follows the desired p.d.f. f (x).
A disadvantage of this method is that it requires several randomly dis-
tributed pairs to select one random number following the distribution. In our
example the ratio of successes to trials is about 1:10. For generating photons
up to arbitrary large frequencies the method cannot be applied at all.
0.2
f(x)
0.1
0.0
0 10 20 30
Fig. 5.4. Random selection method. The projection of the points located below
the curve follow the desired distribution.
f(x)
1.0
-0.2x 2
f=e sin (x)
0.5
0.0
0 10 20 30
0.1
f(x)
0.01
1E-3
1E-4
1E-5
0 10 20
x
– m ≥ f for all x, Rx
– x = M −1 (r), i.e. the indefinite integral M (x) = −∞ m(x′ )dx′ is invert-
ible.
If it exists (see Fig. 5.5), we generate x according to m(x) and, in a second
step, drop stochastically for given x the fraction [m(x) − f (x)]/f (x) of the
events. This means, for each event (i.e. each generated x) a second, this time
uniform random number between zero and m(x) is generated, and if it is
larger than f (x), the event is abandoned. The advantage is, that for m(x)
being not much different from f (x) in most of the cases, the generation of one
event requires only two random numbers. Moreover, in this way it is possible
to generate also distributions which extend to infinity, as for instance the
Planck distribution, and many other distributions.
We illustrate the method with a simple example (Fig. 5.5):
Z x
′
r= 0.2e−0.2x dx′
0
= 1 − e−0.2x .
Thus the variate transformation from the uniformly distributed random num-
ber r1 to x is
1
x=− ln(1 − r1 ) .
0.2
We draw a second uniform random number r2 , also between zero and one,
and test whether r2 m(x) exceeds the desired p.d.f. f (x). If this is the case,
the event is rejected:
With this method a uniform distribution of random points below the majo-
rant curve is generated, while only those points are kept which lie below the
p.d.f. to be generated. On average about 4 random numbers per event are
needed in this example, since the test has a positive result in about half of
the cases.
The factor x−0.1 does not influence the asymptotic behavior significantly but
permits the analytical integration:
Z x
M2 (x) = m2 (x′ )dx′ ,
x1
200c h −x0.9 0.9
i
= e 1 − e−x .
0.9
134 5 Monte Carlo Simulation
10000
1000
y
100
10
1
0 5 10 15 20
x
Fig. 5.7. Generated Planck spectrum.
Quite often the p.d.f. to be considered is a sum of several terms. Let us restrict
ourselves to the simplest case with two terms,
with
Z ∞
S1 = f1 (x)dx ,
−∞
Z ∞
S2 = f2 (x)dx ,
−∞
S1 + S2 = 1 .
respectively
x = F2−1 (r − S1 ) for r > S1 .
The generalization to more than two terms is trivial.
λe−λx 1
f (x) = ε + (1 − ε) für 0 < x < a ,
1 − e−λa a
we chose for r < ε
−1 1 − e−λa
x= ln 1 − r ,
λ ε
and for r > ε
r−ε
x=a .
1−ε
We need only one random number per event. The direct way to use the
inverse of the distribution function F (x) would not have been successful,
since it cannot be given in analytic form.
Introduction
W (x → x′ )
P (x → x′ ) =
W (x → x′ ) + W (x′ → x)
f (x′ )
= .
f (x) + f (x′ )
In an ideal gas and in many other systems the transition regards only
one or two molecules and we need only consider the effect of the change
of those. Then the evaluation of the transition probability is rather simple.
Now we simulate the stochastic changes of the states with the computer,
by choosing a molecule at random and change its state with the probability
P (x → x′ ) into a also randomly chosen state x′ (x → x′ ). The choice of the
initial distribution for x is relevant for the speed of convergence but not for
the asymptotic result.
This mechanism has been introduced by Metropolis et al. [33] with a dif-
ferent acceptance function in 1953. It is well suited for the calculation of mean
values and fluctuations of parameters of thermodynamical or quantum sta-
tistical distributions. The process continues after the equilibrium is reached
and the desired quantity is computed periodically. This process simulates a
periodic measurement, for instance of the energy of a gas with small number
of molecules in a heat bath. Measurements performed shortly one after the
other will be correlated. The same is true for sequentially probed quantities of
the MCMC sampling. For the calculation of statistical fluctuations the effect
of correlations has to be taken into account. It can be estimated by varying
the number if moves between subsequent measurements.
0.70
0.69
<d2> 0.68
0.67
iteration
x
Fig. 5.9. Solid sheres in a box. The plot is a projection onto the x-z plane.
in Fig. 5.8 as a function of the iteration number. Its mean value converges
to an asymptotic value after a number of moves which is large compared
to the number of atoms. Fig. 5.9 shows the position of atoms projected to
the x-z plane, for 300 out of 1000 considered atoms, after 20000 moves. Also
the statistical fluctuations can be found and, eventually,
√ re-calculated for a
modified number of atoms according to the 1/ N -factor.
140 5 Monte Carlo Simulation
Integrals with the integrand changing sign are subdivided into integrals over
intervals with only positive or only negative integrand. Hence it is sufficient
to consider only the case
Z xb
I= y(x) dx with y > 0 . (5.2)
xa
track length
photon yield
photon yield
0.05 1
0.00 0
0.0 0.2 0.4 0.6 0.8 1.0
track position
on the location where the particle intersects the fiber. The particle traverses
the fiber in y direction at a distance x from the fiber axis. To evaluate the
acceptance, we perform the following steps:
– Set the fiber radius R = 1, create a photon at x, y uniformly distributed
in the square 0 < x , y < 1,
– calculate r2 = x2 + y 2 , if r2 > 1 reject the event,
– chose azimuth angle ϕ for the photon direction, with respect to an axis
parallel to the fiber direction in the point x, y, 0 < ϕ < 2π, ϕ uniformly
distributed,
– calculate the projected angle α (sin α = r sin ϕ),
– choose a polar angle ϑ for the photon direction, 0 < cos(ϑ) < 1, cos(ϑ)
uniformly distributed,
– calculate the angle β of the photon with respect to the (inner) surface
normal of the fiber, cos β = sin ϑ cos α,
– for β < β0 reject the event,
– store x for the successful trials in a histogram and normalize to the total
number of trials.
The efficiency is normalized such that particles crossing the fiber at x = 0
produce exactly 1 photon.
Fig. 5.11 shows the result of our simulation. For large values of x the track
length is small, but the photon capture efficiency is large, therefore the yield
increases with x almost until the fiber edge.
We can gain in accuracy by reducing the area in which the points are dis-
tributed, as above by introduction of a majorant function, Fig. 5.5. As seen
from (5.3), the relative error is proportional to the square root of the ineffi-
ciency.
We come back to the first example of this chapter:
1.0
0.5
0.0
y
-0.5
-1.0
4N
b=
π ,
N
p0
δb
π 1 − π/4
= p ,
π N0 π/4
1
≈ 0.52 √ .
N0
Choosing a circumscribed octagon as the reference area, the error is reduced
by about a factor two. A further improvement is possible by inscribing another
polygon inside the circle and considering only the area between the polygons.
b) Importance Sampling
If there exists a majorant m(x) for the function y(x) to be integrated,
Z xb
I= y(x)dx , (5.4)
xa
a) Simple Weighting
Ib = (xb − xa )y .
1.0
0.5
difference function
0.0
0 2 4
Fig. 5.13. Monte Carlo integration of the difference between the function to be
integrated and an integrable function.
b) Subtraction method
We now have to evaluate by Monte Carlo only the second term with
relatively small fluctuations (Fig. 5.13).
In many cases it makes sense to factorize the integrand y(x) = f (x)y1 (x)
into a factor f (x) corresponding to a p.d.f. normalized inside the integration
interval which is easy to generate, and a second factor y1 (x). To be effective,
the method requires that f is close to y. Our integral has now the form of an
expected value:
Z xb Z xb
y(x)dx = f (x)y1 (x)dx
xa xa
= hy1 i .
146 5 Monte Carlo Simulation
The estimate is again the better, the less the y1 -values are fluctuating, i.e.
the more similar the functions y and f are. The error estimate is analogous
to (5.5).
etc.. Once a statistical ensemble has been generated, all these quantities are
easily obtained, while with the usual integration methods, one has to repeat
each time the full integration.
Even more obvious are these advantages in acceptance calculations. Big
experiments in particle physics and other areas have to be simulated as com-
pletely and realistically as allowed by the available computing power. The
acceptance of a given system of particle detectors for a certain class of events
is found in two steps: first, a sample of interesting events is generated and
the particles produced are traced through the detecting apparatus. The hits
in various detectors together with other relevant information (momenta, par-
ticle identities) are stored in data banks. In a second step the desired accep-
tance for a class of events is found by simulating the selection procedure and
counting the fraction of events which are retained. Arbitrary changes in the
selection procedure are readily implemented without the need to simulate
large event samples more than once.
Finally, we want to stress again how easy it is to estimate the errors of
Monte Carlo integration. It is almost identical1 to the error estimation for
the experimental data. We usually will generate a number of Monte Carlo
reactions which is large enough to neglect their statistical error compared to
the experimental error. In other words, the number of Monte Carlo events
should be large compared to the number of experimental events. Usually a
factor of ten is sufficient.
1
The Monte Carlo errors are usually described by the binomial distribution,
those of the experimental data by the Poisson distribution.
6 Estimation I
6.1 Introduction
We now leave the probability calculus and its simple applications and turn
to the field of statistics. More precisely, we are concerned with inferential
statistics.
While the probability calculus, starting from distributions, predicts prop-
erties of random samples, in statistics, given a data sample, we look for a
theoretical description of the population from which it has been derived by
some random process. In the simplest case, the sample consists of indepen-
dent observations, randomly drawn from a parent population. If not specified
differently, we assume that the population is a collection of elements which
all follow the same discrete or continuous distribution. Frequently, the sample
consists of data collected in a measurement sequence.
Usually we either want to check whether our sample is compatible with
a specific theory, we decide between several theories, or we infer unknown
parameters of a given theory.
To introduce the problem, we discuss three simple examples:
1. At a table we find eight playing cards: two kings, three ladies, one ten,
one eight and one seven. Do the cards belong to a set of Canasta cards or to
a set of Skat cards?
2. A college is attended by 523 boys and 490 girls. Are these numbers
compatible with the assumption that in average the tendency to attend a
college is equal for boys and girls?
3. The lifetimes of five instable particles of a certain species have been
measured. How large is the mean life of that particle and how large is the
corresponding uncertainty?
In our first example we would favor the Skat game because none of the
cards two to six is present which, however, are part of Canasta card sets.
Assuming that the cards have been taken at random from a complete card
set, we can summarize the available information in the following way: The
probability to observe no card with value below seven in eight cards of a
Canasta game is LC = (5/13)8 = 4.8 × 10−4 whereas it is LS = 1 for a
150 6 Estimation I
Skat game. We call these quantities likelihoods 1 . The likelihood indicates how
well a given hypothesis is supported by the observation, but the likelihood
alone is not sufficient for a decision in favor of one or another hypothesis.
Additional considerations may play an important role. When the cards are
located in a Swiss youth hostel we would consider the hypothesis Skat more
sceptically than when the cards are found in a pub at Hamburg. We therefore
would weight our hypotheses with prior probabilities (in short: priors) which
quantify this additional piece of information. Prior probabilities are often
hard to estimate, often they are completely unknown. As a consequence,
results depending on priors are usually model dependent.
We will avoid to introduce prior probabilities and stay with likelihoods
but sometimes this is not possible. Then the results have to be interpreted
conditional on the validity of the applied prior probabilities.
In our second example we are confronted with only one hypothesis and
no well specified alternative. The validity of the alternative, e.g. a deviation
from the equality of the distribution of the sexes is hardly measurable since an
arbitrarily small deviation from the equality is present in any case. There is no
other possibility as to quantify the deviation of the data from the prediction
in some proper way. We will treat this problem in the Section goodness-of-fit
tests.
In our third example the number of hypotheses is infinite. To each value
of the unknown parameter, i.e. to each different mean life, corresponds a dif-
ferent prediction. The difficulties are very similar to those in case one. If we
want to quote probabilities, we are forced to introduce a priori probabilities
– here for the parameter under investigation. Again, in most cases no reliable
prior information will be available. We will quote the parameter best sup-
ported by the data and define an error interval based on the likelihood of the
parameter values.
The following table summarizes the cases which we have discussed.
case 1 given: N alternative hypotheses Hi
wanted: relative probabilities for the validity of Hi
case 2 given: one hypothesis H0
wanted: a quantitative statement about the validity of H0
case 3: given: one valid hypothesis H(λ) where λ is a single parameter
or a set of unknown continuous parameters
wanted: “ best” value of λ and its uncertainty
In practice we often will compare observations with a theory which con-
tains free parameters. In this case we have to infer parameters and to test the
compatibility of the hypothesis with the data, i.e. case 2 and case 3 apply.
1
The term likelihood was first used by the British biologist and statistician Sir
Ronald Aylmer Fisher (1890-1962). We postpone the exact definition of likelihood.
6.2 Inference with Given Prior 151
H H H
3 4 5
k (0.44)
3
H
2
(0.14)
k (0.32)
H 2
1
(0.05)
k (0.24)
1
Fig. 6.1. Quantitative Venn diagramm. The areas indicate the probabilities for
certain combinations of hypotheses Hi and discrete events of type kj . The marginal
probabilities are given in brackets.
Here P {Hi } is the assumed probability for the validity of hypothesis i before
the observation happens, it is the a priori probability.
In Fig. 6.1 we illustrate relation (6.2) in form of a so called Venn diagram
where in the present example 3 out of the 5 hypotheses have the same prior.
Each hypothesis bin is divided into 3 regions with areas proportional to the
probabilities to observe k = k1 , k = k2 and k = k3 , respectively. For example
when the observation is k = k2 (shadowed in gray) then the gray areas provide
the relative probabilities of the validity of the corresponding hypotheses. In
our example hypothesis H3 is the most probable, H1 the most unlikely.
The computation of P {k} which is the marginal distribution of k, i.e. the
probability of a certain observation, summed over all hypotheses, yields:
X
P {k} = P {k|Hi }P {Hi } .
i
As required, P {Hi |k} is normalized in such a way that the probability that
any of the hypotheses is fulfilled is equal to one. We get
P {k|Hi }P {Hi }
P {Hi |k} = P . (6.3)
j P {k|Hj }P {Hj }
In words: The probability for the validity of hypothesis Hi after the mea-
surement k is equal to the prior P {Hi } of Hi multiplied with the probability
to observe k if Hi applies and divided by a normalization factor. When we
are only interested in the relative probabilities of two different hypotheses Hi
and Hj for an observation k, we have:
P {K|µ} 0.10 × 1 5
= = ,
P {π|µ} 0.02 × 3 3
P {K|µ} 0.10 × 1
= = 0.625 .
P {K|µ} + P {π|µ} 0.02 × 3 + 0.10 × 1
The kaon hypothesis is more likely than the pion hypothesis. Its probability
is 0.625.
6.2 Inference with Given Prior 153
Now we extend our considerations to the case where the hypothesis index
is replaced by a continuous parameter θ, i.e. we have an infinite number of
hypotheses. Instead of probabilities we obtain probability densities. Bayes’
theorem now reads
which is just the relation 3.36 of Sect. 3.5, where fx , fθ are conditional dis-
tribution densities and πx (x), πθ (θ) are the marginal distributions of f (x, θ).
The joined probability density f (x, θ) of the two random variables x, θ is
equal to the conditional probability density fx (x|θ) of x, where θ is fixed,
multiplied by the probability density πθ (θ), the marginal distribution of θ.
For an observation x we obtain analogously to our previous relations
fx (x|θ)πθ (θ)
fθ (θ|x) = ,
πx (x)
and
fx (x|θ)πθ (θ)
fθ (θ|x) = R ∞ . (6.5)
f (x|θ)πθ (θ)dθ
−∞ x
3
A function of the observations is called a statistic, to be distinguished from the
discipline statistics.
154 6 Estimation I
0.8
observed value
0.6
θ
f( )
0.4
0.2
0.0
0 1 2 3 4
Fig. 6.2. Fit with known prior: Probability density for the true decay time. The
maximum of the distribution is located at θ = 1, the observed time is 1.5.
2 2
e−(t−θ) /(2σ ) e−θ/τ
f (θ) = R ∞ −(t−θ)2 /(2σ2 ) −θ/τ ,
0
e e dθ
If the value of the probability density fx (x|θ) in (6.5) varies much more
rapidly with θ than the prior – this is the case when the observation restricts
the parameter drastically – then to a good approximation the prior can be
regarded as constant in the interesting region. We then have
fx (x|θ)
fθ (θ|x) ≈ R ∞ .
f (x|θ)dθ
−∞ x
In the absence of prior information the likelihood ratio is the only element
which we have, to judge the relative virtues of alternative hypotheses. Accord-
ing to a lemma of J. Neyman and E. Pearson there is no other more powerful
quantity to discriminate between competing hypotheses. (see Chap. 10).
Definition: The likelihood Li of a hypothesis Hi , to which corresponds
a probability density fi (x) ≡ f (x|Hi ) or a discrete probability distribution
Wi (k) ≡ P {k|Hi }, when the observation x, k, respectively, has been realized,
is equal to
Li ≡ L(i|x) = fi (x)
and
Li ≡ L(i|k) = Wi (k) ,
respectively. Here the index i which denotes the hypothesis is treated as an
independent random variable. When we replace it by a continuous parameter
θ and consider a parameter dependent p.d.f. f (x|θ) or a discrete probability
distribution W (k|θ) and observations x, k, the corresponding likelihoods are
L(θ) ≡ L(θ|x) = f (x|θ) ,
L(θ) ≡ L(θ|k) = W (k|θ) .
While the likelihood is related to the validity of a hypothesis given an
observation, the p.d.f. is related to the probability to observe a variate for a
given hypothesis. In our notation, the quantity which is considered as fixed
is placed behind the bar while the random variable is located left of it. When
both quantities are fixed the function values of both the likelihood and the
p.d.f. are equal. To attribute a likelihood makes sense only if alternative
hypotheses, either discrete or differing by parameters, can apply. If the like-
lihood depends on one or several continuous parameters, we have a likelihood
function.
156 6 Estimation I
For all values of θ the function f˜ evaluated for the sample x1 , . . . , xN is equal
to the likelihood L̃,
L̃(θ) ≡ L̃(θ|x1, x2 , . . . , xN )
= f˜(x1, x2 , . . . , xN |θ)
N
Y
= f (xi |θ)
i=1
N
Y
= L(θ|xi ) .
i=1
0.8
L=0.00055 L=0.016
0.6
0.4
0.2
0.0
0 1 2 3 4 5 0 1 2 3 4 5
x x
Fig. 6.3. Likelihood of three observations and two hypotheses with different p.d.f.s.
L̃(θ) ≡ L̃(θ|k1 , . . . , kN )
N
Y
= W (ki |θ)
i=1
N
Y
= L(θ|ki ) .
i=1
hence
L = LA LB ,
ln L = ln LA + ln LB .
0.2 0.2
0.0 0.0
0.2 0.2
0.0 0.0
-4 -2 0 2 4 6 8 -4 -2 0 2 4 6 8
Fig. 6.4. Likelihood ratio for two normal distributions. Top: 1 observation, bottom:
5 observations.
c) We now consider five observations which have been taken from distribution
f1 (Fig. 6.4c) and distribution f2 , respectively (Fig. 6.4d). We obtain the
likelihood ratios
L1 /L2 = 30 (Fig. 5.3c) ,
L1 /L2 = 1/430 (Fig. 5.3d) .
It turns out that narrow distributions are easier to exclude than broad ones.
On the other hand we get in case b) a preference for distribution 1 even
though the observation is located right at the center of distribution 2.
1 e−t/τ
f1 (t) = ,
τ e min − e−tmax /τ
−t /τ
1
f2 (t) = .
tmax − tmin
The likelihoods are equal to the product of the p.d.f.s at the observations:
!
h i−N XN
−tmin /τ −tmax /τ
L1 = τ e −e exp − ti /τ ,
i=1
L2 = 1/(tmax − tmin )N .
P
With t = ti /N the mean value of the times, we obtain the likelihood ratio
N
L1 tmax − tmin
= e−N t/τ .
L2 τ (e−tmin /τ − e−tmax /τ )
4
Carl Friedrich Gauß (1777-1855), German mathematician, astronomer and
physicist.
162 6 Estimation I
0.5
2.0
4.5
ln(L)
Fig. 6.5. Log-likelihood function and uncertainty limits for 1, 2, 3 standard devi-
ations.
N
Y
L(θ) = f (xi |θ) , (6.8)
i=1
N
X
ln L(θ) = ln f (xi |θ) . (6.9)
i=1
5
The advantage of using the log-likelihood compared to the normal likelihood is
that we do not need to derive a product but a sum which is much more convenient.
6.4 The Maximum Likelihood Method for Parameter Inference 163
6.4.2 Examples
Example 76. Maximum likelihood estimate (MLE) of the mean life of an un-
stable particle
Given be N decay times ti of an unstable particle with unknown mean
life τ . For an exponential decay time distribution
f (t|γ) = γe−γt
N
Y
L = γN e−γti
i=1
PN
N − γti
=γ e i=1 ,
N
X
ln L = N ln γ − γ ti .
i=1
Thus the estimate is just equal to the mean value t of the observed decay
times. In practice, the full range up to infinitely large decay times is not
always observable. If the measurement is restricted to an interval 0 < t <
tmax , the p.d.f. changes, it has to be renormalized:
γe−γt
f (t|γ) = ,
1 − e−γtmax
N
X
ln L = N ln γ − ln(1 − e−γtmax ) − γ ti .
i=1
The maximum is now located at the estimate γ̂, which fulfils the relation
X N
1 tmax e−γ̂tmax
0=N − − ti ,
γ̂ 1 − e−γ̂tmax i=1
tmax e−tmax /τ̂
τ̂ = t + ,
1 − e−tmax /τ̂
which has to be evaluated numerically. If the time interval is not too short,
tmax > τ , an iterative computation lends itself: The correction term at the
right hand side is neglected in zeroth order. At the subsequent iterations we
insert in this term the value τ of the previous iteration. We notice that the
estimate again depends solely on the mean value t of the observed decay
times. The quantity t is a sufficient statistic. We will explain this term in
more detail later. The case with also a lower bound tmin of t can be reduced
easily to the previous one by transforming the variable to t′ = t − tmin .
6.4 The Maximum Likelihood Method for Parameter Inference 165
a) b)
0.0 0.0
-0.5 -0.5
lnL
-1.0 -1.0
-1.5 -1.5
-2.0 -2.0
0 1 2 1 2 3 4
In the following examples we discuss the likelihood functions and the MLEs
of the parameters of the normal distribution with mean µ and standard de-
viation σ evaluated for 10 events drawn from (N )(x|1, 2) in four different
situations:
Example 77. MLE of the mean value of a normal distribution with known
width
Given be N observation xi drawn from a normal distribution of known
width s. The mean value µ is to be estimated:
1 (x − µ)2
f (x|µ) = √ exp − ,
2πs 2s2
166 6 Estimation I
N
Y
1 (xi − µ)2
L(µ) = √ exp − ,
i=1
2πs 2s2
N
X (xi − µ)2
ln L(µ) = − + const (6.11)
i=1
2s2
x2 − 2xµ + µ2
= −N + const .
2s2
The log-likelihood function is a parabola. It is shown in Fig. 6.6a for s = 2.
Deriving it with respect to the unknown parameter µ and setting the result
equal to zero, we get
(x − µ̂)
N =0,
s2
µ̂ = x .
The likelihood estimate µ̂ for the mean of the normal distribution is equal to
the arithmetic mean x of the sample. It is independent of s, but s determines
√
the width of the likelihood function and the standard error δµ = s/ N .
Example 78. MLE of the width of a normal distribution with given mean
Given are now N observations xi which follow a normal distribution of
unknown width σ to be estimated for known mean.
YN
1 (xi − x0 )2
L(σ) = √ exp − ,
i=1
2πσ 2σ 2
XN
1 (xi − x0 )2
ln L(σ) = −N ( ln 2π + ln σ) −
2 i=1
2σ 2
" #
(x − x0 )2
= −N ln σ + + const .
2σ 2
The log-likelihood function for our numerical values is presented in Fig. 6.6b.
Deriving it with respect to the parameter of interest and setting the result
equal to zero we find
1 (x − x0 )2
0= − ,
σ̂
q σ̂ 3
σ̂ = (x − x0 )2 .
6.4 The Maximum Likelihood Method for Parameter Inference 167
Again we obtain a well known result. The mean square deviation of the sample
values provides an estimate for the width of the normal distribution. This
relation is the usual distribution-free estimate of the standard deviation if the
expected value is known. The error bounds from the drop of the log-likelihood
function by 1/2 become asymmetric. Solving the respective transcendental
equation, neglecting higher orders in 1/N , one finds
q
1
σ̂ 2N
δσ± = q .
1
1 ∓ 2N
Example 79. MLE of the mean of a normal distribution with unknown width
The solution of this problem canP be taken from Sect. 3.6.11 where we
found that t = (x − µ)/s with s2 = (xi − x)2 /[N (N − 1)] = v 2 /(N − 1)
follows the Student’s distribution with N − 1 degrees of freedom.
− N2
Γ (N/2) t2
h(t|N − 1) = p 1+ .
Γ ((N − 1)/2) π(N − 1) N −1
Example 80. MLE of the width of a normal distribution with unknown mean
Obviously, shifting a sample changes the mean value but not the true or
the empirical variance v 2 = (x − x)2 . Thus the empirical variance v 2 can only
depend on σ and not on µ. Without going into the details of the calculation,
we state that N v 2 /σ 2 follows a χ2 distribution of N − 1 degrees of freedom,
168 6 Estimation I
(N −3)/2
N N v2 N v2
f (v 2 |σ) = exp − 2 ,
Γ [(N − 1)/2] 2σ 2 2σ 2 2σ
N v2
ln L(σ) = −(N − 1) ln σ − ,
2σ 2
corresponding to the dashed curve in Fig. 6.6b. (The numerical value of the
true value of µ was chosen such that the maxima of the two curves are located
at the same value in order to simplify the comparison.) The MLE is
N
σ̂ 2 = v2 ,
N −1
in agreement with our result (3.15). For the asymmetric error limits we find
in analogy to example 78
q
σ̂ 2(N1−1)
δσ± = q .
1 ∓ 2(N1−1)
To find the maximum of the likelihood function, we set the partial deriva-
tives equal to zero. Those values λ̂k which satisfy the system of equations
obtained this way, are the MLEs λ̂k of the parameters λk :
∂ ln L
| =0. (6.14)
∂λk λ̂1 ,...,λ̂K
The error interval is now to be replaced by an error volume with its surface
defined again by the drop of ln L by 1/2:
b − ln L(λ) = 1/2 .
ln L(λ)
6.4 The Maximum Likelihood Method for Parameter Inference 169
1 0 -0.5 -2 -4.5
-1
1 2 3 4 5
Fig. 6.7. MLE of the parameters of a normal distribution and lines of constant
log-likelihood. The numbers indicate the values of log-likelihood relative to the
maximum.
We have to assume that this defines a closed surface in the parameter space,
in two dimensions just a closed contour, as shown in the next example.
Example 81. MLEs of the mean value and the width of a normal distribution
Given are N observations xi which follow a normal distribution where
now both the width and the mean value µ are unknown. As above, the log-
likelihood is
" #
(x − µ)2
ln L(µ, σ) = −N ln σ + + const .
2σ 2
The MLE and log-likelihood contours for a sample of 10 events with empirical
mean values x = 1 and x2 = 5 are depicted in Fig. 6.7. The innermost line
170 6 Estimation I
encloses the standard error area. If one of the parameters, for instance µ = µ1
is given, the log-likelihood of the other parameter, here σ, is obtained by the
cross section of the likelihood function at µ = µ1 .
Similarly any other relation between µ and σ defines a curve in Fig. 6.7
along which a one-dimensional likelihood function is defined.
Remark: Frequently, we are interested only in one of the parameters, and
we want to eliminate the others, the nuisance parameters. How to achieve this,
will be discussed in Sect. 7.8. Generally, it is not allowed to use the MLE of
a single parameter in the multi-parameter case separately, ignoring the other
parameters. While in the previous example σ̂ is the correct estimate of σ if µ̂
applies, the solution for the estimate and its likelihood function independent
of µ has been given in example 80 and that of µ independent of σ in example
79.
f (x) = θ1 x + θ2 + θ3 N(x|µ, σ) .
Here N is the normal distribution with unknown mean µ and standard de-
viation σ. The other parameters are not independent because f has to be
normalized in the given interval xmin < x < xmax . Thus we can eliminate one
parameter. Assuming that the normal distribution is negligible outside the
interval, the norm D is:
1
D= θ1 (x2max − x2min ) + θ2 (xmax − xmin ) + θ3 .
2
The normalized p.d.f. is therefore
with the new parameters θ1′ = θ1 /θ3 and θ2′ = θ2 /θ3 . The likelihood function
is obtained
P in the usual way by inserting the observations of the sample into
ln L = ln f (xi |θ1′ , θ2′ , µ, σ). Maximizing this expression, we obtain the four
parameters and from those the fraction of signal events S = θ3 /D:
−1
1
S = 1 + θ1′ (x2max − x2min ) + θ2′ (xmax − xmin ) .
2
-4.0
-7.0
-5.0
-1.0
-0.50
y
-0.10
-2.0
-5.0 -3.0
-7.0
-10 -7.0
-9.0
In a previous example, we have seen that the likelihood function for a sam-
ple of exponentially distributed decay times is a function only of the sam-
ple mean. In fact, in many cases, the i.i.d. individual elements of a sample
{x1 , . . . , xN } can be combined to fewer quantities, ideally to a single one
without affecting the estimation of the interesting parameters. The set of
these quantities which are functions of the observations is called a sufficient
statistic. The sample itself is of course a sufficient, while uninteresting statis-
tic.
According to R. A. Fisher, a statistic is sufficient for one or several param-
eters, if by addition of arbitrary other statistics of the same data sample, the
6.5 Likelihood and Information 173
The distribution g(t|θ) then contains all the information which is relevant
for the parameter estimation. This means that for the estimation process
we can replace the sample by the sufficient statistic. In this way we may
reduce the amount of data considerably. In the standard situation where all
parameter components are constraint by the data, the dimension of t must
be larger or equal to the dimension of the parameter vector θ. Every set of
uniquely invertible functions of t is also a sufficient statistic.
The relevance of sufficiency is expressed in a different way in the so-called
sufficiency principle:
If two different sets of observations have the same values of a sufficient
statistic, then the inference about the unknown parameter should be the same.
Of special interest is a minimal sufficient statistic. It consists of a minimal
number of components, ideally only of one element per parameter.
In what follows, we consider the case of a one-dimensional sufficient statis-
tic t(x1 , . . . , xN ) and a single parameter θ. The likelihood function can ac-
cording to (6.15) be written in the following way:
6
Note that also the domain of x has to be independent of θ.
174 6 Estimation I
Example 85. Sufficient statistic for mean value and width of a normal distri-
bution
Let x1 , . . . , xN be N normally distributed observations. The mean value
µ and the width σ be the parameters of interest. From (6.17)
1 x2 − 2xµ + µ2
L(µ, σ|x1 , . . . , xN ) = √ exp[−N ]
2πσ 2σ 2
7
Note that the combination of two ancillary statistics is not necessarily ancillary.
176 6 Estimation I
The LP follows inevitably from the sufficiency principle and the condi-
tioning principle. It goes back to Fisher, has been reformulated and derived
several times [45, 46, 47, 48]. Some of the early promoters (Barnard, Birn-
baum) of the LP later came close to rejecting it or to restrict its applicability.
The reason for the refusal of the LP has probably its origin in its incompat-
ibility with some concepts of the classical statistics. A frequently expressed
argument against the LP is that the confidence intervals of the frequentist
statistics cannot be derived from the likelihood function alone and thus con-
tradict the LP. But this fact merely shows that certain statistical methods do
not use the full information content of a measurement or / and use irrelevant
information. Another reason lies in problems which sometimes occur if the
LP is applied in social sciences, in medicine or biology. There it is often not
possible to parameterize the empirical models in a stringent way. But uncer-
tainties in the model prohibit the application of the LP. The exact validity
of the model is a basic requirement for the application of the LP.
In the literature examples are presented which are pretended to contradict
the LP. These examples are not really convincing and rather strengthen the
LP. Anyway, they often contain quite exotic distributions which are irrelevant
in physics applications and which lead when treated in a frequentist way to
unacceptable results [48].
We abstain from a reproduction of the rather abstract proof of the LP
and limit us to present a simple and transparent illustration of it:
The quantity which contains all the information we have on θ after the
measurement is the p.d.f. of θ,
L(θ|x)π(θ)
g(θ) = R .
L(θ|x)π(θ)dθ
It is derived from the prior density and the likelihood function. The prior
does not depend on the data, thus the complete information that can be
extracted from the data, and which is relevant for g(θ), must be contained in
the likelihood function.
A direct consequence of the LP is that in the absence of prior informa-
tion, optimal parameter inference has to be based solely upon the likelihood
function. It is then logical to select for the estimate the value of the parame-
ter which maximizes the likelihood function and to choose the error interval
such that the likelihood is constant at the border, i.e. is smaller everywhere
outside than inside. (see Chap. 8). All approaches which are not based on the
likelihood function are inferior to the likelihood method or at best equivalent
to it.
An experiment searches for a rare reaction. Just after the first successful ob-
servation at time t the experiment is stopped. Do we have to consider the
178 6 Estimation I
stopping rule in the inference process? The answer is “no” but many scien-
tists have a different opinion. This is the reason why we find the expression
stopping rule paradox in the literature.
The possibility to stop an experiment without compromising the data
analysis, for instance because a detector failed, no money was left or because
the desired precision has been reached, means a considerable simplification
of the data analysis.
In this context we examine a simple example.
The fact that an arbitrary sequential stopping rule does not change the
expectation value is illustrated with an example given in Fig. 6.10. A rate
is determined. The measurement is stopped if a sequence of 3 decays oc-
curs within a short time interval of only one second. It is probable that the
6.6 The Moments Method 179
0 0
-2 -2
likelihood
-4 -4
-6 -6
-8 -8
-10 -10
0 1 2 3 4 5 0 1 2 3 4 5
rate rate
Fig. 6.9. Likelihood function for 20 eperiments. Left-hand: time for 4 events. Right-
hand: number of events in a fixed time interval. The dashed curve is the average of
an infinit number of experiments.
observed rate is higher than the true one, the estimate is too high in most
cases. However, if we perform many such experiments one after the other,
their combination is equivalent to a single very long experiment where the
stopping rule does not influence the result and from which we can estimate
the mean value of the rate with high precision. Since the log-likelihood of
the long experiment is equal to the sum of the log-likelihoods of the short
experiments, the log-likelihoods of the short experiments obviously represent
correctly the measurements.
Why does the fact that neglecting the stopping rule is justified, contradict
our intuition? Well, most of the sequences indeed lead to too high rates but
when we combine measurements the few long sequences get a higher weight
and they tend to produce lower rates, and the average is correct. On the other
hand, one might argue that the LP ignores the information that in most cases
the true value of the rate is lower than the MLE. This information clearly
matters if we would bet on this property, but it is irrelevant for estimating
the parameter value. A bias correction would improve somewhat the not very
precise estimate for small sequences, but be very unfavorable for the fewer but
more precise long sequences and if we have no prior information, we cannot
know whether our sequence is short or long. (see also Appendix 13.7).
Z
µn (θ) = xn f (x|θ) dx . (6.18)
e.g. the sample mean or the mean of squares, which we can extract trivially
for a sample, are estimators of the moments of the distribution. From the
inverse function µ−1 we obtain a consistent estimate of the parameter,
θ̂ = µ−1 (µ̂) ,
because according to the law of large numbers we have (see Appendix 13.1)
It is clear that any function u(x), for which expected value and variance
exist, and where hui is an invertible function of θ, can be used instead of xn .
Therefore the method is somewhat more general than suggested by its name.
If the distribution has several parameters to be estimated, we must use
several moments or expected values, approximate them by empirical averages,
and solve the resulting system of – in general non-linear – equations for the
unknown parameters.
6.6 The Moments Method 181
The estimators derived from the lower moments are usually more precise
than those computed from the higher ones. Parameter estimation from the
moments is usually inferior to that of the ML method. Only if the moments
used form a sufficient statistic, the two approaches produce the same result.
The uncertainties of the fitted parameters have to be estimated from the
covariance matrix of the corresponding moments and subsequently by error
propagation or alternatively by a Monte Carlo simulation, generating the
measurement several times. Also the bootstrap method which will be intro-
duced in Chap. 12, can be employed. Sometimes the error calculation is a
bit annoying and reproduces the ML error intervals only in the large sample
limit.
Example 90. Moments method: Mean and variance of the normal distribution
We come back to the example from Sect. 6.4.2. For a sample {x1 , . . . , xN },
following the distribution
1
f (x|µ, σ) = √ exp −(x − µ)2 /(2σ 2 ) ,
σ 2π
we determine independently the parameters µ and σ. We use again the ab-
breviations x for the sample mean and x2 for the mean of the squares and
v 2 = (x − x)2 = x2 − x2 for the empirical variance. The relation between
the moment µ1 and the parameter of the distribution µ is simply µ1 = µ,
therefore
µ̂ = µˆ1 = x .
In Chap. 3, we have derived the relation (3.15) v 2 = σ 2 (N − 1)/N between
the expectation of the empirical variance and the variance of the distribution;
inverting it, we get r
N
σ̂ = v .
N −1
The two estimates are uncorrelated. The error of µ̂ is derived from the esti-
mated variance
σ̂
δµ = √ ,
N
and the error of σ̂ is determined from the expected variance of v. We omit
the calculation, the result is:
σ̂
δσ = p .
2(N − 1)
In the special case of the normal distribution, the independent point estimates
of µ and σ of the moments method are identical to those of the maximum
likelihood method. The errors differ for small samples but coincide in the
limit N → ∞.
182 6 Estimation I
The moments method has the advantage that it is very simple, especially
in the case of distributions which depend linearly on the parameters – see
the next example below:
α̂ = 3 x .
The least square method goes back to Gauss. Historically it has success-
fully been applied to astronomical problems and is still the best method we
have to adjust parameters of a curve to measured points if only the vari-
ance of the error distribution is known. It is closely related to the likelihood
method if the errors are normally distributed. Then we can write the p.d.f.
of the measurements in the following way:
" N #
X (yi − t(xi , θ))2
f (y1 , . . . , yN |θ) ∝ exp − ,
i=1
2δi2
Example 92. Counter example to the least square method: Gauging a digital
clock
6.7 The Least Square Method 185
time
channel
Fig. 6.12. χ2 −Fit (dashed) of a straigt line to digital measurements.
A digital clock has to be gauged. Fig. 6.12 shows the time channel as a
function of the true time and a least square fit by a straight line. The error
bars in the figure are not error bars in the usual sense but indicate the channel
width. The fit fails to meet the allowed range of the fifth point and therefore
is not compatible with the data. All straight lines which meet all “error bars”
have the same likelihood. One correct solution is indicated in the figure.
where V, the weight matrix, is the inverse of the covariance matrix. The
quantity χ2 is up to a factor two equal to the negative log-likelihood of a
multivariate normal distribution,
XN
1
f (y1 , . . . , yN |θ) ∝ exp − (yi − ti )Vij (yj − tj ) ,
2 i,j=1
y(x) = ax + b (6.22)
We set the derivatives to zero and introduce the following abbreviations. (In
parentheses we put the expressions for the special case where all uncertainties
are equal, δi = δ):
X xi X 1 X
x= / ( xi /N ) ,
i
δi2 i δi2 i
X yi X 1 X
y= / ( yi /N ) ,
i
δi2 i δi2 i
X x2 X 1 X
i
x2 = 2/ ( x2i /N ) ,
i
δ i i
δi2 i
X xi yi X 1 X
xy = / ( xi yi /N ) .
i
δi2 i
δi2 i
We obtain
b̂ = y − â x ,
xy − â x2 − b̂ x = 0 ,
and
xy − x y
â = ,
x2 − x2
x2 y − x xy
b̂ = .
x2 − x2
The problem is simplified when we put the origin of the abscissa at the center
of gravity x:
6.7 The Least Square Method 187
x′ = x − x ,
x′ y
â′ = ,
x′2
′
b̂ = y .
y(θ) = Aθ + e . (6.24)
A, also called the design matrix, is a rectangular matrix of given elements with
P columns and N rows, defining the above mentioned linear mapping from
the P -dimensional parameter space into the N-dimensional sample space.
The straight line fit discussed in Example 93 is a special case of (6.24)
PP =2
with E(yi ) = j=1 Aij θj = θ1 xi + θ2 , i = 1, . . . , N , and
T
x1 · · · xN
A= .
1 ··· 1
188 6 Estimation I
χ2 = (y − Aθ)T VN (y − Aθ)
D = (AT VN A)−1 AT VN
CP = (AT VN A)−1 .
A feature of the linear model is that the result (6.27) for θ̂ turns out to be
linear in the measurements y. Using it together with (6.25) one easily finds
E(θ̂) = θ, i.e. the estimate is unbiased9 . The Gauss–Markov–theorem states
that any other estimate obeying these two assumptions will have an error
matrix with larger or equal diagonal elements than the above estimate (also
called BLUE: best linear unbiased estimate).
Linear regression provides an optimal solution only for normally dis-
tributed, known errors. Often, however, the latter depend on the parameters.
Strictly linear problems are therefore rare. When the prediction is a non-
linear function of the parameters, the problem can be linearized by a Taylor
expansion as a first rough approximation. By iteration the precision can be
improved.
The importance of non-linear parameter inference by iterative linear re-
gression has decreased considerably. The minimum searching routines which
we find in all computer libraries are more efficient and easier to apply. Some
basic minimum searching approaches are presented in Appendix 13.12.
8
We keep here the notation χ2 , which is strictly justified only in case of Gaussian
error distributions or asymptotically for large N . Only then it obeys a χ2 distribu-
tion with N − P degrees of freedom. The index of quadratic matrices indicates its
dimension.
9
This is true for any N , not only asymptotically.
6.8 Properties of estimators 189
The content of this Section is resumed in the Appendices 13.2 and 13.2.2.
6.8.1 Consistency
We require that the estimate θ̂ and the estimate of a function of θ, fd(θ), satisfy
d
the relation f (θ) = f (θ̂). For example the mean lifetime τ and the decay rate
γ of a particle are related by γ = 1/τ . Therefore their estimates from a sample
of observations have to be related by γ̂ = 1/τ̂ . If they were different, the
prediction for the number of decays in a given time interval would depend on
the choice of τ̂ or γ̂ used to evaluate the number. Similarly in the computation
of a cross section which depends on different powers of a coupling constant
g, we would get inconsistent results unless ĝ n = gc n . Estimators applied to
The bias b of an estimate θ̂ is the deviation of its expectation value from the
true value θ of the parameter:
b = E(θ̂) − θ .
190 6 Estimation I
(5γ)5 4
f (t|γ) = t exp(−5γt) ,
4!
and thus the expectation value E(γ̂) of γ̂ is
Z ∞
E(γ̂) = γ̂f (t|γ) dt
0
Z ∞ 3
(5γ)5 4!t 5
= (−5γ t̄) dt = γ .
0 exp 4
Example 95. Bias of the estimate of a Poisson rate with observation zero
We search for a rare decay but we do not observe any. The likelihood for
the mean rate λ is according to the Poisson statistic
e−λ λ0
L(λ) = = e−λ .
0!
When we normalize the likelihood function to obtain the Bayesian p.d.f. with
a uniform prior, we obtain the expectation value hλi = 1 while the value
λ̂ = 0 corresponds to the maximum of the likelihood function. (It may seem
astonishing that an expectation value one follows from a null-measurement.
This result is a consequence of the assumption of a uniform distribution
of the prior which is not unreasonable because had we not anticipated the
possibility of a decay, we would not have performed the measurement. Since
also mean rates different from zero may lead to the observation zero it is
natural that the expectation value of λ is different from zero.) Now if none
of 10 similar experiments would observe a decay, a naive averaging of the
expected values alone would again result in a mean of one, a crazy value.
Strictly speaking, the likelihoods of the individual experiments should be
multiplied, or, equivalently the null rate would have to be normalized to ten
times the original time with the Bayesian result 1/10.
Likelihood
2
0
0.0 0.5 1.0 1.5
x / q
Fig. 6.13. Likelihood funktion of the wdth of a uniform distribution for 12 obser-
vations.
log-likelihood
40
entries
20
-5
0
-1.0 -0.5 0.0 0.5 1.0 0.5 0.6 0.7 0.8 0.9
u slope
Fig. 7.1. Linear distribution with adjusted straight line (left) and likelihood func-
tion (right).
We have seen in Sect. 3.6.3 that with increasing mean value t, the Poisson
distribution asymptotically approaches a normal distribution with variance
t. Thus for high statistics histograms the number of events d in a bin with
prediction t(θ) is described by
1 (d − t)2
f (d) = √ exp − .
2πt 2t
Contrary to the case of relation (6.20) the denominator of the exponent and
the normalization now depend on the parameters.
The corresponding log-likelihood is
(d − t)2 1
ln L = − − ln(2π) − ln(t) .
2t 2
For large t, the logarithmic term is an extremely slowly varying function of
t. In situations where the Poisson distribution can be approximated by a
normal distribution, it can safely be neglected. Omitting it and the constant
term, we find for the whole histogram
B
1 X (di − ti )2
ln L = −
2 i=1 ti
1
= − χ2
2
198 7 Estimation II
with
B
X (di − ti )2
χ2 = . (7.3)
i=1
ti
If the approximation of the Poisson distribution by a normal distribution is
justified, the likelihood estimation of the parameters is equivalent to a least
square fit and the standard errors are given by an increase of χ2 by one unit.
Often histograms contain some bins with few entries. Then a binned like-
lihood fit is to be preferred to a χ2 fit, since the above condition of large ti
is violated. It is recommended to perform always a likelihood adjustment.
Again we can omit the last term in the likelihood analysis, because it does
not depend on θ.
Example 98. Fit of the particle composition of an event sample (1) [36]
We consider the distribution f (x) of a mixture of K different types of
particles. The p.d.f. of the identification variable x (This could be for example
the energy loss) for particles of type k be fk (x). The task is to determine the
numbers λk of the different particle species in the sample from the observed
values xi of N detected particles. The p.d.f. of x is
1
In the statistical literature this is called a compound distribution, see Sect. 3.60.
7.2 Extended Likelihood 199
K
X K
X
f (x) = λk fk (x)/ λk
k=1 k=1
K
!N
X
K
! λk
X k=1
exp − λk .
N!
k=1
XN
∂ ln L fm (xi )
= −1 + =0,
∂λm i=1
K
X
λk fk (xi )
k=1
N
X fm (xi )
1= K
. (7.6)
i=1
X
λk fk (xi )
k=1
XN (n−1)
λm fm (xi )
λ(n)
m = K
i=1
X (n−1)
λk fk (xi )
k=1
As a special case, let us assume that the cross section for a certain reaction
is equal to g(x|θ). Then we get the p.d.f. by normalization of g:
200 7 Estimation II
g(x|θ)
f (x|θ) = R . (7.7)
g(x|θ)dx
The theoretical models are represented by Monte Carlo samples and the
parameter inference is carried out by a comparison of experimental and sim-
ulated histograms of the observed variable x′ . For di observed and mi Monte
Carlo events in bin i and a normalization parameter cm , we get for the like-
lihood instead of (7.1):
B
X
ln L = (−cm mi + di ln(cm mi )) . (7.9)
i=1
We assume that the statistical error of the simulation can be neglected, i.e.
M ≫ N applies, with M simulated events and N observed events. In some
rare cases the normalization cm is known, if not, it is a free parameter in
the likelihood fit. The parameters of interest are hidden in the Monte Carlo
predictions mi (θ).
7.3 Comparison of Observations to a Monte Carlo Simulation 201
If the number of the entries in all bins is large enough to approximate the
Poisson distribution by the normal distribution, we can as well minimize the
corresponding χ2 expression (7.3)
B
X (di − cm mi )2
χ2 = . (7.10)
i=1
cm mi
The simulation programs usually consist of two different parts. The first
part describes the physical process which depends on the parameters of inter-
est. The second models the detection process. Both parts often require large
program packages, the so-called event generators and the detector simula-
tors. The latter usually consume considerable computing power. Limitations
in the available computing time then may result in non-negligible statistical
fluctuations of the simulation.
In rare cases it is necessary to include the statistical error of the Monte Carlo
simulation. The formulas are derived in Appendix 13.10 and the problem is
discussed in detail in Ref. [26]. We summarize here the relevant relations. The
MonteP Carlo prediction for a histogram bin is up to a normalization constant
mi = wik , where the sum runs over all Ki weights wik of the events of the
bin. We omit in the following three formulas the bin index i. The quantities
m̃, s and c̃m have to be evaluated for each bin. We define a scaled number
m̃,
m̃ = sm
with hX i
wk
s= X ,
wk2
and a normalization constant c̃ specific for the bin
c̃m = cm /s .
XB
2 1 (n − c̃m m̃)2
χ = .
i=1
c̃m (n + m̃) i
If resolution effects are absent and only acceptance losses have to be taken
care of, all weights in bin i are equal wi . The above relation simplifies with
Ki Monte Carlo entries in bin i to
XB
1 (ni − cm mi )2
χ2 = .
c w (ni + Ki )
i=1 m i
1 df
ω1 (x) = (x|θ0 ) , (7.13)
f0 dθ
1 d2 f
ω2 (x) = (x|θ0 ) . (7.14)
2f0 dθ2
The parameter inference of ∆θ is performed by comparing mi = (m0i +
∆θm1i + (∆θ)2 m2i ) with the experimental histogram di as explained in Sect.
2:
B
X (di − cmi )2
χ2 = . (7.15)
i=1
cmi
In many cases the quadratic term can be omitted. In other situations it
might be necessary to iterate the procedure.
We illustrate the method with two examples.
Example 99. Fit of the slope of a linear distribution with Monte Carlo cor-
rection
The p.d.f. be
1 + θx
f (x|θ) = , 0≤x≤1.
1 + θ/2
We generate observations x uniformly distributed in the interval 0 ≤ x ≤ 1,
simulate the experimental resolution and the acceptance, and histogram the
distorted variable x′ into bins i and obtain contents m0i . The same obser-
vations are weighted by x and summed up to the histogram m1i . These two
distributions are shown in Fig. 7.2 a, b. The dotted histograms correspond to
204 7 Estimation II
2000
a)
MC "true"
entries
1000
MC folded
-1 0 x 1 2
2000
b)
MC "true"
entries
1000
MC folded
-1 0 x 1 2
c) MC "true"
100
fitted
entries
data
50
MC folded
after fit
-1 0 x 1 2
Fig. 7.2. The superposition of two Monte Carlo distributions, a) flat and b) trian-
gular is adjusted to the experimental data.
10000
a)
2
w = t
number of events
w = t
1000
w = 1
100
0 1 2 3 4 5
t
b)
100
data
number of events
MC adjusted
10
1
0 1 2 3 4 5
t
Fig. 7.3. Lifetime fit. The dotted histogram in b) is the superposition of the three
histograms of a) with weights depending on ∆λ.
df (∆θ)2 d2 f
f (θ) = f (θ0 ) + ∆θ (θ0 ) + (θ0 ) + · · ·
dθ 2! dθ2
1 df (∆θ)2 1 d2 f
= f (θ0 ) 1 + ∆θ (θ0 ) + (θ0 ) + · · · .
f0 dθ 2! f0 dθ2
1000
800
number of events
600
400
200
0
0.0 0.2 0.4 0.6 0.8 1.0
α (x − µ)2
f (x) = √ exp − + β0 + β1 (x − 0.5) + β2 (x − 0.5)2
2πσ 2σ 2
to the observed histogram. The following table summarizes the results for
linear and quadratic background subtractions and different ranges of x.
background range α̂ µ̂ σ̂ χ2 /N DF
quadratic [0.2, 0.8] 5033(122) 0.4994(10) 0.0515(11) 0.82
quadratic [0.3, 0.7] 4975(242) 0.4996(10) 0.0512(16) 1.10
linear [0.3, 0.7] 5131(104) 0.4996(10) 0.0521(10) 1.14
linear [0.34, 0.66] 5165(133) 0.5006(11) 0.0524(12) 0.88
For the linear background subtraction the fitted amount of background and
the width of the peak are larger than for the quadratic background interpola-
tion. The quadratic background function leaves more freedom to the fit than
the linear one and consequently the errors become larger. We choose the
conservative solution with quadratic background shape and narrow range,
α̂ = 4975 ± 242, µ̂ = 0.9996 ± 0.0010, σ̂ = 0.0512 ± 0.0016. The error mar-
gins cover also the results of the linear background subtraction. Part of the
errors are of systematic type caused by the uncertainties in the background
parametrization. The purely statistical errors can be estimated by fixing the
parameters of the background function. They are δα = 73, δµ = 0.0008,
δσ = 0.0008. As expected the precision of the number of events suffers
primarily from the uncertain shape of the background. As the statistical
7.4 Parameter Estimation of a Signal Contaminated by Background 209
and the systematic errors squared add up to the total error squared, we
(sys) (sys)
can calculate the systematic contributions δα = 231, δµ = 0.0006,
(sys)
δσ = 0.0014. Except for the location µ, the systematic errors dominate. If
different parametrizations of the background produce significantly different
results, the systematic error has to be increased. The values of χ2 are accept-
able in all cases. (The χ2 goodness-of-fit test will be discussed in Chap. 10.)
In our Monte Carlo experiment we know the true parameter values µ = 5000,
µ = 0.5, σ = 0.05. The linear fits underestimate the background contribution
and therefore lead to too large values of σ.
In rare cases we have the chance to record independently from a signal sample
also a reference sample containing pure background. The measuring times or
fluxes, i.e. the relative normalization of the two experiments are either known
or to be determined from the data distributions. In this lucky situation,
we do not need to parameterize the background distribution and thus are
independent of assumptions about its shape.
We introduce B additional parameters βi for the unknown background
prediction. The relative flux normalization c can either be known, or be an
unknown parameter in the fit. Our model predicts ti (θ) + βi for bin i of the
signal histogram and βi /c for the background histogram. Our LS statistic is
B
X [ti (θ) + βi − di ]2 (βi /c − bi )2
χ2 = +
i=1
ti (θ) + βi βi /c
B
X
ln L = [di ln (ti (θ) + βi ) − (ti (θ) + βi ) + bi ln(βi /c) − βi /c] .
i=1
The idea behind the method is simple: The log-likelihood of the wanted
signal parameter as derived for the full signal sample is a superposition of
the log-likelihood of the genuine signal events and the log-likelihood of the
background events. The latter can be estimated from the reference sample
and subtracted from the full log-likelihood.
To illustrate the procedure, imagine we want to measure the signal
response of a radiation detector by recording a sample of signal heights
x1 , . . . , xN from a mono-energetic source. For a pure signal, the xi would
follow a normal distribution with resolution σ:
1 2 2
f (x|µ) = √ e−(x−µ) /(2σ ) .
2πσ
The unknown parameter µ is to be estimated. After removing the source,
we can – under identical conditions – take a reference sample x′1 , . . . , x′M of
background events. They follow a distribution which is of no interest to us.
(S)
If we knew, which observations xi in our signal sample were signal (xi ),
(B)
respectively background (xi ) events, we could take only the S signal events
and calculate the correct log-likelihood function which is up to constants
S
X S
X (S)
(S) (x − µ)2
i
ln L = ln f (xi |µ) = −
i=1 i=1
2σ 2
N
X B
X (B)
(xi − µ)2 (x i− µ)2
= − ,
i=1
2σ 2 i=1
2σ 2
30 30 reference sample
number of events
20 20
10 10
0 0
-4 -2 0 2 4 -4 -2 0 2 4
x x
The general problem where the sample and parameter spaces could be
multi-dimensional and with different fluxes of the signal and the reference
sample, is solved in complete analogy to our example: Given a contaminated
signal distribution of size N and a reference distribution of size M and flux
1/r times that of the signal sample, we put
N
X M
X
ln L̃ = ln f (xi |θ) − r ln f (x′i |θ) . (7.18)
i=1 i=1
N x − rM x′
µ̂ =
N − rM
95 · 0.61 − 0.4 · 91 · 1.17
= = −0.26 ± 0.33 .
95 − 0.4 · 91
The error is estimated by linear error propagation. The result is indicated
in Fig. 7.5. The distributions were generated with nominally 60 pure signal
plus 40 background events and 100 background reference events. The signal
corresponds to a normal distribution, N (x|0, 1), and the background to an
exponential, ∼ exp(−0.2x).
The interesting parameters are not always independent of each other but are
often constrained by physical or geometrical laws.
As an example let us look at the decay of a Λ particle into a proton and
a pion, Λ → p + π, where the direction of flight of the Λ hyperon and
the momentum vectors of the decay products are measured. The momentum
vectors of the three particles which participate in the reaction are related
through the conservation laws of energy and momentum. Taking into account
the conservation laws, we add information and can improve the precision of
the momentum determination.
In the following we assume that we have N direct observations xi which
are predicted by functions ti (θ) of a parameter vector θ with P components
as well as K constraints of the form hk (θ) = 0. Let us assume further that
7.5 Inclusion of Constraints 213
In most cases the constraints are obeyed exactly, δk = 0, and the sec-
ond term in (7.20) diverges. This difficulty is avoided in the following three
procedures:
1. The constraints are used to reduce the number of parameters.
2. The constraints are approximated by narrow Gaussians and the limit
δk → 0 is approached.
3. Lagrange multipliers are adjusted to satisfy the constraints.
We will discuss the problem in terms of a χ2 minimization. The solutions
can also be applied to likelihood fits.
(l1 − λ1 )2 (l2 − λ2 )2
χ2 = 2
+
δ δ2
including the constraint λ1 + λ2 = l = 100 cm. We simply replace λ2 by l − λ1
and adjust λ1 , minimizing
(l1 − λ1 )2 (l − l2 − λ1 )2
χ2 = + .
δ2 δ2
The minimization relative to λ1 leads to the result:
l l1 − l2
λ̂1 = + = 35.5 ± 0.2 cm
2 2
From the MLE we obtain in the usual way the first M −1 parameters and their
error matrix E. The remaining parameter λM and the related error matrix
elementsPEMj are derived from the constraint (7.21) and the corresponding
relation ∆λm = 0. The diagonal error is the expected value of (∆λM )2 :
M−1
X
∆λM = − ∆λm ,
m=1
"M−1 #2
X
2
(∆λM ) = ∆λm ,
m=1
M−1
X M−1
X M−1
X
EMM = Emm + Eml .
m=1 m l6=m
These trivial examples are not really representative for the typical prob-
lems we have to solve in particle- or astrophysics. Indeed, it is often com-
plicated or even impossible to reduce the parameter set analytically to an
unconstrained subset, but we can introduce a new unconstrained parameter
set which then predicts the measured quantities. To find such a set is straight
forward in the majority of problems: We just have to think how we would sim-
ulate the corresponding experimental process. A simulation is always based
on a minimum set of parameters. The constraints are satisfied automatically.
π(π c , πa , π b ) ≡ π c − π a − π b = 0 ,
p p q
ε(π c , πa , π b ) ≡ πc2 + m2c − πa2 + m2a − πb2 − m2b = 0 .
Often the reduced parameter set is more relevant than the set correspond-
ing to the measurement, because a simulation usually is based on parameters
which are of scientific interest. For example, the investigation of the Λ decay
might have the goal to determine the Λ decay parameter which depends on
the center of mass direction of the proton relative to the Λ polarization, i.e.
on one of the directly fitted quantities.
The exact limit will not be obtained, but it is sufficient to choose the pa-
rameters δk small compared to the experimental resolution of the constraint.
Parameter estimation is performed by numerical approximation in computer
programs following methods like Simplex. The require precision is steered by
a parameter provided by the user. The parameter δ plays a similar role.
To estimate the resolution, the constraint is evaluated from the observed
data, h̃(xi , ..., xN ) and we require δk2 << h̃2 . The precise choice of the con-
straint precision δk is not at all critical, but extremely small values of δk could
lead to numerical problems. In case the minimum search is slow, or does not
converge, one should start initially with loose constraints which subsequently
could be tightened.
The value of χ2 in the major part of the parameter space is dominated
by the contributions from the constraint terms. In the minimum searching
programs the parameter point will therefore initially move quickly from its
starting value towards the subspace defined by the constraint equations and
then proceed towards the minimum of χ2 .
Remark that the minimum of χ2 is found at parameter values that satisfy
the constraints much better than naively expected from the set constraint
tolerances. The reason is the following: Once the parameters are close to
their estimates, small changes which reduce the χ2 contribution of the penalty
terms, will not sizably affect the remaining terms. Thus the minimum will be
observed very close to hk = 0. As a consequence, the contribution of the K
constraint terms in (7.20) to the minimum value of χ2 is negligible.
produces the same result as the fit presented above. The value δk2 = 10−10 δ 2
is chosen small compared to δ.
moves along the z axis. The measured cordinates x, y are small compared to
the decay length r ≈ z which is fixed. The χ2 expression is
(x − ξ)2 (y − θ)2
χ2 = 2
+ +
δx δy2
3
X 3
X
(pai − πai )Vaij (paj − πaj ) + (pbi − πbi )Vbij (pbj − πbj ) +
i,j=1 i,j=1
The first two terms of (7.23) compare the x and y components of the Λ
path vector with the corresponding parameters ξ and θ. The next two terms
measure the difference between the observed and the fitted momentum com-
ponents of the proton and the pion. The following two terms constrain the
direction of the Λ hyperon flight path to the direction of the momentum vec-
tor π = π a + π b and the last term constrains the invariant mass mpπ (π a , πb )
of the decay products to the Λ mass. We generate 104 events, all with the
same nominal parameter values but different normally distributed measure-
ment errors. The velocity of the Λ particle is parallel to the z axis with a
Lorentz factor γ = 9. The decay length is fixed to 1 m. The direction of the
proton in the Λ center of mass is defined by the polar and azimuthal an-
gles θ = 1.5, φ = 0.1. The measurement errors of the x and y coordinates
are δx = δy = 1 cm. The momentum error is assumed to be the sum of a
term proportional to the momentum p squared, δpr = 2p2 /(GeV )2 and a
constant term δp0 = 0.02 GeV added to each momentum component. The
tolerances for the constraints are δα = 0.001 and δm = 0.1 M eV , i.e. about
10−3 times the experimental uncertainty. The minimum search is performed
with a combination of a simplex and a parabolic minimum searching routine.
The starting values for the parameters are the measured values. The fit starts
with a typical value of χ20 of 2 × 108 and converges for all events with a mean
value of χ2 of 2.986 and a mean value of √ the standard deviation of 2.446 to
be compared to the nominal values 3 and 6 = 2.450. The contribution from
each of the three constraint terms to χ2 is 10−4 . Thus the deviation from the
constraints is about 10−7 times the experimental uncertainty.
with the MLE λ̂1,2 = l1,2 − δ 2 α. Using λ̂1 + λ̂2 = l we find δ 2 α = (l1 + l2 − l)/2
and, as before, λ̂1 = (l + l1 − l2 )/2, λ̂2 = (l + l2 − l1 )/2.
7.5.5 Conclusion
By far the simplest method is the one where the constraint is directly included
and approximated by a narrow Gaussian. With conventional minimizing pro-
grams the full error matrix is produced automatically.
The approach using a reduced parameter set is especially interesting when
we are primarily interested in the parameters of the reduced set. This is the
case in most kinematical fits. Due to the reduced dimension of the parameter
space, it is faster than the other methods. The determination of the errors of
the original parameters through error propagation is sometimes tedious, but
in most applications only the reduced set is of interest.
220 7 Estimation II
A p.d.f. f (x, y|θ) of two variates with a linear parameter dependence can
always be written in the form
∂u∂v
g(u, v) = v(1 + uθ) ,
∂x∂y
we derive the log-likelihood of θ
X
ln L(θ) = ln(1 + ui θ) + const.
i
with δi the error of the bracket in the numerator. Estimate α̂, β̂ and their
errors.
4. compute
α̂θ1 + β̂θ2
θ̂ = .
α̂ + β̂
The steps 3 and 4 can be combined. The two parameters α̂+ β̂ can be elim-
inated and we obtain χ2 as a function of θ and the normalization parameter
c:
X [di − c (θ−θ2 )tθ1i −(θ−θ 1 )t2i 2
]
2 1 −θ2
χ = 2 ;.
i
δ i
Thus our procedure makes sense only if the number of parameters is smaller
than the dimension of the variate space.
222 7 Estimation II
q = -1 q=1
g(u)
Fig. 7.6. Simulated p.d.f.s of the reduced variable u for the values ±θ of the
parameter.
x + y3 √
u= , |u| ≤ 2 ,
(x2 2
+y +z ) 2 1/2
v = (x2 + y 2 + z 2 )1/2 , 0 ≤ v ≤ 1 ,
z=z
v ∂(x, y, z)
g ′ (u, v, z|θ) = [1 + u θ] ,
π ∂(u, v, z)
Z
g(u|θ) = dz dv g ′ (u, v, z|θ) .
100
true observed
observed
1.2
number of events
75
1.0
50 0.8
0.8 1.0 1.2
true
25
0
0 1 2 3 4 5
lifetime
Fig. 7.7. Measured lifetime distribution. The insert indicates the transformation
of the measured lifetime to the corrected one.
of the distorted sample t′i will usually still contain almost the full information
relative to the mean life τ . The relation τ (t′ ) between τ and its approximation
7.7 Approximated Likelihood Estimators 225
where the denominator is the global acceptance and provides the correct
normalization. We abbreviate it by A(λ). The log-likelihood of N observations
is X X
ln L(λ) = ln α(xi ) + ln f (xi |λ) − N A(λ) .
The first term can be omitted. The acceptance A(λ) can be determined by
a Monte Carlo simulation. Again a rough estimation is sufficient, at most it
reduces the precision but does not introduce a bias, since all approximations
are automatically corrected with the transformation λ(λ′ ).
Frequently, the relation (7.27) can only be solved numerically, i.e. we
find the maximum of the likelihood function in the usual manner. We are
also allowed to approximate this relation such that an analytic solution is
possible. The resulting error is compensated in the simulation.
at b0 with
b = b0 + β
and derive it with respect to β to find the value β̂ at the maximum:
X xi
=0.
0.5 + (b0 + β̂)xi
Neglecting quadratic and higher order terms in β̂ we can solve this equation
for β̂ and obtain P
xi /f0i
β̂ ≈ P 2 2 (7.28)
xi /f0i
where we have set f0i = f (xi |b0 ). If we allow also for a quadratic term
and get, after deriving ln L with respect to α and β and linearizing, two linear
equations for α̂ and β̂:
7.8 Nuisance Parameters 227
X X X
α̂ A2i + β̂
Ai Bi = Ai ,
X X X
α̂ Ai Bi + β̂ Bi2 = Bi , (7.29)
We will re-discuss this example in the next subsection and present in the
following some approaches which permit to eliminate the nuisance parame-
228 7 Estimation II
ters. First we will investigate exact methods and then we will turn to the
more problematic part where we have to apply approximations.
3.0
2.5
2.0
decay rate
1.5 lnL = 2
lnL = 0.5
1.0
0.5
0.0
0 5 10 15 20
Fig. 7.8. Log-likelihood contour as a function of the decay rate and the number of
background events. For better visualization the discrete values of the event numbers
are connected.
Very easy is the elimination of the nuisance parameter if the p.d.f. is of the
form
230 7 Estimation II
with θ the parameter which we are interested in. The normalized x distribu-
tion depends only on θ. Whatever value ν takes, the shape of this distribution
remains always the same. Therefore we can estimate θ independently of ν.
The likelihood function is proportional to a normal distribution of θ,
2
a 2
L(θ) ∼ exp − (θ − θ̂) ,
2
P
with the estimate θ̂ = x = xi /N .
0.0
-0.5
ln(L)
-1.0
-1.5
-2.0
1 2 3 4 5
!
a2 (θ − θ̂)2 − 2abρ(θ − θ̂)(ν − ν̂) + b2 (ν − ν̂)2
L(θ, ν) ∼ exp − ,
2(1 − ρ2 )
e−ρ1 ρr11
f1 (r1 |ρ1 ) = ,
r1 !
e−ρ2 ρr22
f2 (r2 |ρ2 ) = .
r2 !
The interesting parameter is the expected absorption θ = ρ2 /ρ1 . In first
approximation we can use the estimates r1 , r2 of the two independent pa-
rameters ρ1 and ρ2 and their errors to calculate in the usual way through
error propagation θ and its uncertainty:
r2
θ̂ = ,
r1
(δ θ̂)2 1 1
= + .
θ̂2 r1 r2
For large numbers r1 , r2 this method is justified but the correct way is to
transform the parameters ρ1 , ρ2 of the combined distribution
(1 + 1/θ)−r2 (1 + θ)−r1
f˜(r1 , r2 |θ, ν) = e−ν ν r1 +r2 ,
r1 !r2 !
L(θ, ν|r1 , r2 ) = Lν (ν|r1 , r2 )Lθ (θ|r1 , r2 ) .
It is presented in Fig. 7.9 for the specific values r1 = 10, r2 = 20. The
maximum is located at θ̂ = r2 /r1 , as obtained with the simple estimation
above. However the errors are asymmetric.
7.8 Nuisance Parameters 233
Example 118. Fiiting the width of a normal distribution with the mean as
nuisance parameter
The sample mean x̄ of measurements is a sufficient statistic for µ of the
normal distribution N (x|µ, σ). We can replace µ by x̄ in the Gaussian and
are left with the wanted parameter only, see also example 80.
20 0.0
-0.5
15
ν ln(L )
p
-1.0
10
-1.5
-2.0
1 2 3 4 5
Fig. 7.10. Profile likelihood (solid curve, right hand scale) and ∆(ln L) = −1/2,
θ − ν contour (left-hand scale). The dashed curve is ν̂(θ).
The function ρ̂1 , the profile likelihood and the 1 st. dev. error contour are
depicted in Fig. 7.10. The result coincides with that of the exact factorization.
(In the figure the nuisance parameter is denoted by ν.)
The point estimate is a statistic that depends on the input data. The uncer-
tainties of the data determine the error that we can associate to the estimate.
We distinguish between two input situations, a) given is a set of i.i.d. obser-
vations, b) we have measurements with associated error distributions. In the
first case we can apply the bootstrap method, in the second we resample the
input variables from the error distribution. The simulated data can be used
to generate distributions of the wanted parameter from which moments and
confidence limits can be derived.
Bootstrap Resampling
1.2
0.8
likelihood
0.4
0.0
0.2 0.4 0.6 0.8 1.0
ple x∗1 , x∗2 , ...x∗N . (The bootstrap sample may contain the same observation
several times.) We use the bootstrap sample to estimate θ, ν. The procedure
is repeated many times and produces a distribution of θ from which we can
derive arbitrary moments and confidence intervals. The bootstrap method
permits also to estimate the uncertainties of the estimates.
to the likelihood function. The bin to bin fluctuations of the histogram reflect
the discrete nature of the Poisson distribution.
If the methods fail which we have discussed so far, we are left with only two
possibilities: Either we give up the elimination of the nuisance parameter or
we integrate it out. The simple integration
Z ∞
Lθ (θ|x) = L(θ, ν|x)dν
−∞
Usually the error limits will show the same dependence as the MLE which
means that the width of the interval is independent of ν.
However, publishing a dependence of the parameter of interest on the
nuisance parameter is useful only if ν corresponds to a physical constant and
not to an internal parameter of an experiment like efficiency or background.
7.8.9 Recommendation
1
The term interval is not restricted to a single dimension. In n dimensions it
describes a n-dimensional volume.
240 8 Interval Estimation
2
The term credibility interval is used for Bayesian intervals.
8.1 Error Intervals 241
∂ 2 ln L
Vij = − θ̂
∂θi ∂θj
As above, we again use the likelihood ratio to define the error limits which
now usually are asymmetric. In the one-dimensional case the two errors δ−
and δ+ satisfy
YN
1 −ti /τ 1
Lτ = e = N e−N t/τ . (8.3)
i=1
τ τ
The values of the functions are equal at equivalent values of the two param-
eters τ and λ, i.e. for λ = 1/τ :
244 8 Interval Estimation
Lλ (λ) = Lτ (τ ) .
Fig. 8.1 shows the two log-likelihoods for a small sample of ten events with
mean value t = 0.5. The lower curves for the parameter τ are strongly asym-
metric. This is also visible in the limits for changes of the log-likelihood by
0.5 or 2 units which are indicated on the right hand cut-outs. The likelihood
with the decay rate as parameter (upper figures) is much more symmetric
than that of the mean life. This means that the decay rate is the more ap-
propriate parameter to document the shape of the likelihood function, to
average different measurement and to perform error propagation, see below.
On the other hand, we can of course transform the maximum likelihood esti-
mates and errors of the two parameters into each other without knowing the
likelihood function itself.
Generally, it does not matter whether we use one or the other parameter
to present the result but for further applications it is always simpler and
more precise to work with approximately symmetric limits. For this reason
usually 1/p (p is the absolute value of the momentum) instead of p is used as
parameter when charged particle trajectories are fitted to the measured hits
in a magnetic spectrometer.
In the general case we satisfy the conditions 4 to 7, 10, 11 of our wish list
but 2, 3, 8, 9 are only approximately valid. We neither can associate an exact
probability content to the intervals nor do the limits correspond to moments
of a p.d.f..
1a 1b
0.01
Likelihood
0.01
1E-3
1E-4
1E-5 1E-3
0 2 4 6 0 2 4
2a 2b
0.01
Likelihood
0.01
1E-3
1E-4
1E-5 1E-3
0 1 2 3 0.0 0.5 1.0 1.5
Fig. 8.1. Likelihood functions for the parameters decay rate (top) and lifetime
(below). The standard deviation limits are shown in the cut-outs on the right hand
side.
N
X
ln L(τ ) = −ni (ln τ + τ̂i /τ )
i=1
X ni τ̂i
= −n ln τ −
τ
with the maximum at P
ni τ̂i
τ̂ =
n
and its error
τ̂
δ= √ .
n
The individual measurements are weighted by their event numbers, instead
of weights proportional to 1/δi2 . As the errors are correlated with the mea-
surements, the standard weighted mean (4.6) with weights proportional to
1/δi2 would be biased. In our specific example the correlation of the errors
and the parameter values is known and we could use weights proportional to
(τi /δi )2 .
X
ln L = [ni ln(1 + 1/θ) − mi ln(1 + θ)]
= n ln(1 + 1/θ) − m ln(1 + θ)
P P
with m = mi and n = ni . The MLE is θ̂ = m/n and the error limits
have to be computed numerically in the usual way or for not too small n, m
by linear error propagation, δθ2 /θ2 = 1/n + 1/m.
In the common situation where we do not know the full likelihood func-
tion but only the MLE and the error limits, we have to be content with an
approximate procedure. If the likelihood functions which have been used to
extract the error limits are parabolic, then the standard weighted mean (4.6)
is exactly equal to the result which we obtain when we add the log-likelihood
functions and extract then the estimate and the error.
8.2 Error Propagation 247
Proof: A sum of terms of the form (8.1) can be written in the following
way:
1X 1
Vi (θ − θi )2 = Ṽ (θ − θ̃)2 + const. .
2 2
Since the right hand side is the most general form of a polynomial of second
order, a comparison of the coefficients of θ2 and θ yields
X
Ṽ = Vi ,
X
Ṽ θ̃ = Vi θi ,
that is just the weighted mean including its error. Consequently, we should
aim at approximately parabolic log-likelihood functions when we present ex-
perimental results. Sometimes this is possible by a suitable choice of the
parameter. For example, we are free to quote either the estimate of the mass
or of the mass squared.
0 0
a b
-1 -1
lnL
-2 -2
exact
2
(x) linear
-3 -3
(x) linear
2 Gaussians
-4 -4
1 2 3 4 0 1 2 3 4
1/
0 0
c d
-1 -1
lnL
-2 -2
-3 -3
-4 -4
1/2
1 2 0 1 2
1/
or
(σ(θ))2 = δ+ δ− + (δ+ − δ− )(θ − θ̂) , (8.6)
respectively. The log-likelihood function has poles at locations of θ where the
width becomes zero, σ(θ) = 0. Thus our approximations are justified only in
the range of θ which excludes the corresponding parameter values.
In Fig. 8.2 we present four typical examples of asymmetric likelihood func-
tions. The log-likelihood function of the mean life of four exponentially dis-
tributed times is shown in 8.2 a. Fig. 8.2 b is the corresponding log-likelihood
function of the decay time4 . Figs. 8.2 c, d have been derived by a parame-
ter transformation from normally distributed observations where in one case
the new parameter is one over the mean and the square root of the mean5
4
The likelihood function of a Poisson mean has the same shape.
An example of such a situation is a fit of a particle mass from normally dis-
5
in the other case. A method which is optimum for all cases does not ex-
ist. All three approximations fit very well inside the one standard deviation
limits. Outside, the two parametrizations (8.5) and (8.6) are superior to the
two-parabola approximation.
We propose to use one of the two parametrizations (8.5, 8.6) but to be
careful if σ(θ) becomes small.
Before we rely on a mean value computed from the results of different ex-
periments we should make sure that the various input data are statistically
compatible. What we mean with compatible is not obvious at this point. It
will become clearer in Chap. 10, where we discuss significance tests which
lead to the following plausible procedure that has proven to be quite useful
in particle physics [31].
We compute the weighted mean value θ̃ of the N results and form the
sum of the quadratic deviations of the individual measurements from their
average, normalized to their expected errors squared:
X
χ2 = (θi − θ̃)2 /δi2 .
where δi+ and δi− , respectively, are valid for θi < θ̃ and θi > θ̃.
θ̂ = hθi = Σhξi i,
δθ2 = σθ2 ≈ Σσi2
hln θi = Σhln ξi i ,
q
σln θ ≈ Σσln 2
ξi
1000 800
800
600
number of entries
600
400
400
200
200
0 0
0 1 2 3 4 5 -1.0 -0.5 0.0 0.5
x ln(x)
Fig. 8.3. Distribution of the product of 10 variates with mean 1 and standard
deviation 0.2.
The reason for this simple relation is founded on the fact that a sum of
weighted Poisson numbers can be approximated by the Poisson distribution
of the equivalent number of events (see Sect. 3.7.3). A condition for the
validity of this approximation is that the number of equivalent events is large
enough to use symmetric errors. If this number is low we derive the limits
from the Poisson distribution of the equivalent number of events which then
will be asymmetric.
The following example is a warning that naive linear error propagation may
lead to false results.
It can lead to the strange result that the least square estimate ξb of the two
cross sections is located outside the range defined by the individual results
[54] , e.g. ξb < ξ1 , ξ2 . This anomaly is known as Peelle’s Pertinent Puzzle [55].
254 8 Interval Estimation
Its reason is that the normalization error is proportional to the true cross
section and not to the observed one and thus has to be the same for the two
measurements, i.e. in first approximation proportional to the estimate ξb of
the true cross section. The correct covariance matrix is
!
2
δ10 + δf2 ξb2 δf2 ξb2
C= . (8.7)
δ 2 ξb2
f δ 2 + δ 2 ξb2
20 f
Since the best estimate of ξ cannot depend on the common scaling error it is
given by the weighted mean
δ −2 ξ1 + δ20
−2
ξ2
ξb = 10 −2 −2 . (8.8)
δ10 + δ20
The error δ is obtained by the usual linear error propagation,
1 2 b2
δ2 = −2 −2 + δf ξ . (8.9)
δ10 + δ20
Proof: The weighted mean for ξb is defined as the combination
ξb = w1 ξ1 + w2 ξ2
which, under the condition w1 + w2 = 1, has minimal variance (see Sect. 4.4):
b = w2 δ 2 + (1 − w1 )2 δ 2 + δ 2 ξb2 .
var(ξ) 1 10 20 f
Setting the derivative with respect to w1 to zero, we get the usual result
−2
δ10
w1 = −2 −2 , w2 = 1 − w1 ,
δ10 + δ20
−2 −2
b = δ10 δ20 2 b2
δ 2 = min[var(ξ)] −2 −2 2 + −2 −2 2 + δf ξ ,
(δ10 + δ20 ) (δ10 + δ20 )
k
X e−µ0 µi 0
C = 1−
i=0
i!
k
X
= 1− P(i|µ0 ) .
i=0
However, the sum over the Poisson probabilities cannot be solved analytically
for µ0 . It has to be solved numerically, or (8.12) is evaluated with the help
of tables of the incomplete gamma function.
A special role plays the case k = 0, e.g. when no event has been observed.
The integral simplifies to:
C = 1 − e−µ0 ,
µ0 = − ln(1 − C) .
1.0
b=0
Likelihood
0.5
b=2
0.0
0 2 4 6 8 10
Rate
Fig. 8.4. Upper limits for poisson rates. The dashed lines are likelihood ratio limits
(decrease by e2 ).
We find µ0 = 3.88. The Bayesian probability that the mean rate µ is larger
than 3.88 is 10 %. Fig. 8.4 shows the likelihood functions for the two cases
258 8 Interval Estimation
-2
log-likelihood
-4
-6
0 5 10 15
signal
Fig. 8.5. Log-likelihood function for a Poisson signal with uncertainty in back-
ground and acceptance. The arrow indicates the upper 90% limit. Also shown is
the likelihood ratio limit (decrease by e2 , dashed lines).
b = 2 and b = 0 together with the limits. For comparison are also given the
likelihood ratio limits which correspond to a decrease from the maximum
by e−2 . (For a normal distribution this would be equivalent to two standard
deviations).
We now investigate the more general case that both the acceptance ε
and the background are not perfectly known, and that the p.d.f.s of the
background and the acceptance fb , fε are given. For a mean Poisson signal
µ the probability to observe k events is
Z Z
g(k|µ) = db dε P(k|εµ + b)fb (b)fε (ε) = L(µ|k) .
renormalized
likelihood
f( )
original
unphysical likelihood
region
max
Example 128. Upper limit for a Poisson rate with uncertainty in background
and acceptance
Observed are 2 events, expected are background events following a normal
distribution N (b|2.0, 0.5) with mean value b0 = 2 and standard deviation
σb = 0.5. The acceptance is assumed to follow also a normal distribution
with mean ε0 = 0.5 and standard deviation σε = 0.1. The likelihood function
is Z Z
L(µ|2) = dε dbP(2|εµ + b)N (ε|0.5, 0.1)N (b|2.0, 0.5) .
8.4 Summary
9.1 Introduction
In many experiments the measurements are deformed by limited acceptance,
sensitivity, or resolution of the detectors. Knowing the properties of the de-
tector, we are able to simulate these effects, but is it possible to invert this
process, to reconstruct from a distorted event sample the original distribution
from which the undistorted sample has been drawn?
There is no simple answer to this question. Apart from the unavoidable
statistical uncertainties, the correction of losses is straight forward, but un-
folding the effects caused by the limited resolution is difficult and feasible
only by introducing a priory assumptions about the shape of the original
distribution or by grouping the data in histogram bins. Therefore, we should
ask ourselves, whether unfolding is really a necessary step of our analysis. If
we want to verify a theoretical prediction for a distribution f (x), it is much
easier and more accurate to fold f with the known resolution and to com-
pare then the smeared prediction and the experimental distributions with the
methods discussed in Chap. 10. If a prediction contains interesting parame-
ters, also those should be estimated by comparing the smeared distribution
with the observed data. When we study, for instance, a sharp resonance peak
above a slowly varying background, it will be very difficult, if not impossible,
to determine the relevant parameters from an unfolded spectrum, while it is
easy to fit them directly to the observed distribution, see Sect. 7.3 and Ref.
[57]. However, in situations where a reliable theoretical description is missing,
or where the measurement is to be compared with a distribution obtained
in another experiment with different experimental conditions, unfolding of
the data cannot be avoided. Examples are the determination of structure
functions in deep inelastic scattering or transverse momentum distributions
from the Large Hadron Collider at CERN where an obvious parametrization
is missing.
The choice of the unfolding procedure depends on the goal one is aiming
for. We either can try to optimize the reconstruction of the distribution, with
the typical trade-off between resolution and bias where we have a kind of
probability density estimation (PDE) problem (see Chapt. 12), or we can
treat unfolding as an inference problem where the errors should contain the
262 9 Unfolding
The function f (x) is folded with a response function h(x′ , x), resulting in
the smeared function g(x′ ). We call f (x) the true distribution and g(x′ ) the
smeared distribution or the observed distribution. The three functions g, h, f
can have discontinuities but of course the integral has to exist. The integral
equation (9.1) is called Fredholm equation of the first kind with the kernel
h(x′ , x). If the function h(x′ , x) is a function of the difference x′ −x only, (9.1)
is denoted convolution integral, but often the terms convolution and folding
are not distinguished. The relation (9.1) describes the direct process of folding.
We are interested in the inverse problem: Knowing g and h we want to infer
f (x). This inverse problem is classified by the mathematicians as ill posed
because it has no unique solution. In the direct process high frequencies are
washed out. The damping of strongly oscillating contributions in turn means
that in mapping g to f high frequencies are amplified, and the higher the
frequency, the stronger is the amplification. In fact, in practical applications
we do not really know g, the information we have consists only of a sample of
observations with the unavoidable statistical fluctuations2 . The fluctuations
of g correspond to large perturbations of f and consequently to ambiguities.
The response function often, but not always, describes a simple resolu-
tion effect and then it is called point spread function (PSF). There exists
1
We base our errors on the likeloihood function. In most cases with not too
small event numbers, the definition coincides to a good approximation with the
error definition derived from the coverage paradigm, see Appendix 13.6. In this
chapter arguing with coverage is more convenient than with the likelihood ratio..
2
In the statistical literature the fluctuations are called noise.
9.2 Discrete Inverse Problems and the Response matrix 263
Fig. 9.1. Relations between the histograms involved in the unfolding process.
E(d) = Aθ . (9.2)
d1 A11 . . A1M
d2 A21 . . A2M . θ1
. . .. .
E = · . .
.. . .
. .
. . .. . θM
dN AN 1 . . AN M
264 9 Unfolding
E(dij ) is the expected number of observed events in bin i that originate from
true bin j. In the following we will often abbreviate E(di ) by ti = Aij θj . The
value Aij represents the probability that the detector registers an event in
bin i that belongs to the true histogram bin j. This interpretation assumes
that all elements of d, A and θ are positive. The number of columns M
is the number of bins in the true histogram and the number of parameters
that have to be determined. The number of rows N is the number of bins
in the observed histogram. We do not want to have more parameters than
measurements and require N ≥ M . Normally we constrain the unknown true
histogram, requiring N > M . With N bins of the observed histogram and M
bins of the true histogram we have N − M constraints. The relation between
the various histograms is shown in Fig. 9.1. We follow the simpler right-hand
path where multinomial errors need not be handled.
We require that A is rank efficient which means that the rank is equal
to the number of columns M . Formally, this means that all columns are
linearly independent and at least M rows are linearly independent: No two
bins of the true histogram should produce observed distributions that are
proportional to each other. Unfolding would be ambiguous in this situation,
but a simple solution of the problem is to combine the bins. More complex
cases that lead to a rank deficiency never occur in practice. A more serious
requirement is the following: By definition of A, the observed histogram must
not contain events that originate from other sources than the M true bins. In
other words, the range of the true histogram has to cover all observed events.
This requirement often entails that only a small fraction of the events that
contained the border bins of the true histogram are found in the observed
histogram. The correspondingly low efficiency leads to large errors of the
reconstructed number of events in these bins. Published simulation studies
often avoid this complication by restricting the range of the true variable.
Some publications refer to a null space of the matrix A. The null space is
spanned by vectors that fulfill Aθ = 0. With our definitions and the restric-
tions that we have imposed, the null space is empty.
9.2 Discrete Inverse Problems and the Response matrix 265
500
300
400
number of entries
200
300
200
100
100
0 0
-4 -2 0 2 4 -3 -2 -1 0 1 2 3
x' x
Fig. 9.2. Folded distributions (left) for two different distributions (right).
2000
1500
number of entries
1000
500
-500
-1000
0 1
x
Fig. 9.3. Naive unfolding result obtained by matrix inversion. The curve corre-
sponds to the true distribution.
In particle physics the experimental setups are mostly quite complex, and
for this reason they are simulated with Monte Carlo programs. To construct
the matrix A, we generate events following an assumed true distribution f (x)
characterized by the true variable x and a corresponding true bin j. The
detector simulation produces the observed variable x′ and the corresponding
observed bin i. We will assume for the moment that we can generate an in-
finitely large amount of so-called Monte Carlo events such that we do not
have to care about statistical fluctuations of the elements of A. The statisti-
266 9 Unfolding
2500
3000
2000
number of events
1500
2000
1000
500
1000
-500
0
0 1 2 3 4 5 0 1 2 3 4 5
time time
The discrete model avoids the ambiguity of the continuous ill-posed problem
but especially if the observed bins are narrow compared to the resolution, the
matrix is badly conditioned which means that the inverse or pseudo-inverse of
A contains large components. This is illustrated in Fig. 9.2 which shows two
different original distributions and the corresponding distributions smeared
with a Gaussian N (x − x′ |0, 1). In spite of the extremely different original
distributions, the smeared distributions of the samples are practically indis-
tinguishable. This demonstrates the sizeable information loss that is caused
by the smearing, especially in the case of the distribution with four peaks.
Sharp structures are washed out and can hardly be reconstructed. Given the
observed histogram with some additional noise, it will be almost impossible
to exclude one of the two candidates for the true distribution even with a huge
amount of data. Naive unfolding by matrix inversion can produce oscillations
as shown in Fig. 9.3.
If the matrix A is quadratic, we can simply invert (9.2) and get an estimate
b
θ of the true histogram.
b = A−1 d .
θ (9.4)
In practice this simple solution usually does not work because, as mentioned,
our observations suffer from statistical fluctuations.
9.2 Discrete Inverse Problems and the Response matrix 267
In Fig. 9.4 the result of a simple inversion of the data vector of Fig. 9.4
is depicted. The left-hand plot is realized with 10 bins. It is clear that either
fewer bins have to be chosen, see Fig. 9.4 right-hand plot, or some smoothing
has to be applied.
M
X
f (x) ≈ βj Bj (x) (9.5)
j=1
We get
Z M
X Z ∞
′
E(di ) = dx βj h(x′ , x)Bj (x)dx
bin i j=1 −∞
M
X
= Aij βj (9.6)
j=1
with Z Z ∞
′
Aij = dx h(x′ , x)Bj (x)dx
bin i −∞
.
The response matrix element Aij now is the probability to observe an
event in bin i of the observed histogram that originates from the distribution
Bj . In other words, the observed histogram is approximated by a superposi-
tion of the histograms produced by folding the functions Bj . Unfolding means
to determine the amplitudes βj of the functions Bj .
Below we will discuss the approximation of f (x) by a superposition of ba-
sic spline functions (b-splines). For our applications the b-splines of order 2
(linear), 3 (quadratic) or 4 (cubic) are appropriate (see Appendix 13.15). Un-
folding then produces a smooth function which normally is closer to the true
distribution than a histogram. The disadvantage of spline approximations
compared to the histogram representation is that a quantitative comparison
with predictions or the combination of several results is more difficult.
Remark: In probability density estimation (PDE) a histogram is consid-
ered as a first order spline function. The spline function corresponds to the
line that limits the top of the histogram bins. The interpretation of a his-
togram in experimental sciences is different from that in PDE. Observations
are collected in bins and then the content of the bin measures the integral
of the function g over the bin and the bin content of the unfolded histogram
is an estimate of the integral of f over that bin. A function can always be
268 9 Unfolding
If the data follow a Poisson distribution where the statistics is high enough
to approximate it by a normal distribution and where the denominator of
(9.7) can be approximated by di , the least square minimum can be evaluated
by a simple linear matrix calculus. (The linear LS solution is treated in Sect.
6.7.)
XM
N N
( Aik θk − di )2
X (t i − di )2 X k=1
χ2stat = = , (9.9)
i=1
d i i=1
di
We apply the transformations
d ⇒ b = AT Vd , (9.10)
A ⇒ Q = A VA .T
(9.11)
3
In the literature the error matrix or covarince matrix is frequently denoted by
V and the weight matrix by V−1 .
9.2 Discrete Inverse Problems and the Response matrix 269
E(b) = Qθ (9.12)
Cθ = Q−1 .
4
We require that the square M × M matrix Q has M linearly independent
eigenvectors and that all eigenvalues are real and positive. These conditions are
satisfied if a LS solution exists.
270 9 Unfolding
u v= u λ λ
1.00
0.97
0.90
0.65
0.50
0.36
0.26
0.25
0.17
0.15
Fig. 9.6. Eigenvectors of the modified LS matrix ordered with decreasing eigen-
values.
shape but reduced by the factor λi as shown on the right-hand side. The
eigenvalues decrease from top to bottom. Strongly oscillating components of
the true histogram correspond to small eigenvalues. They are hardly visible
in the observed data, and in turn, small contributions vi to the observed
data caused by statistical fluctuations can lead to rather large oscillating
contributions ui = v i /λi to the unfolded histogram if the eigenvalues are
small. Eigenvector contributions with eigenvalues below a certain value can-
not be reconstructed, because they cannot be distinguished from noise in the
observed histogram.
The eigenvector decomposition is equivalent to the singular value decom-
position (SVD). In the following we will often refer to the term SVD instead
of the eigenvector decomposition, because the former is commonly used in
the unfolding literature.
x
0.00 0.08
-0.04 0.04
-0.08 0.00
-0.12 -0.04
-0.16 -0.08
0.0001 100
significance
eigenvalue
10
0.00001
1E-6 0.1
0.01
0 5 10 15 20 0 5 10 15 20
Fig. 9.7. Observed eigenvectors 1 (top left) and 20 (top right), eigenvalues (bottom
left) and significance of eigenvector amplitudes (bottom right).
100 100000
|parameter value|
10000
significance 10
1000
100
0.1
10
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35
Fig. 9.8. Left hand: Parameter significance as function of the eigenvalue index.
The effective number of parameters is 17. Right hand: Fitted parameter values as
a function of the eigenvalue index.
with even index should vanish. Statistical fluctuations in the simulation par-
tially destroy the symmetry. Eigenvector contributions where the significance
is below one, are compatible with being absent within less than one standard
deviation.
is indicated in the right-hand graph. This graph shows that some amplitudes
that correspond to small eigenvalues become rather large. This is due to the
amplification of high frequency noise in the unfolding. The number of bins in
the unfolded distribution should not be much larger than the effective number
of parameters, because then we keep too much redundant information, but on
the other hand it has to be large enough to represent the smallest significant
eigenvector. A reasonable choice for the number of bins is about twice Nef f .
The optimal number will also depend on the shape of the distribution.
Histogram Representation
e−ti tdi i
P (di ) =
di !
and the corresponding log-likelihood is up to an irrelevant constant
N
X
ln Lstat = [di ln ti − ti ]
i=1
N
X M
X M
X
= di ln Aij θj − Aij θj . (9.15)
i=1 j=1 j=1
– Unfolding step:
N
X
(k+1) (k) di
θj = Aij θj (k)
/αj . (9.17)
i=1 di
9.3 Unfolding with Explicit Regularization 275
Spline Approximation
The main field where professional unfolding methods are applied lies in im-
age reconstruction. In medicine unblurring of tomographic pictures of arterial
stenoses, of tumors or orthopedic objects are important. Other areas of inter-
est are unblurring of images of astronomical surveys, of geographical maps, of
tomographic pictures of tools or mechanical components like train wheels to
detect damages. Also pattern recognition for example of fingerprints or the
iris is an important field of interest. The goal of most applications is to dig
out hidden or fuzzy structures from blurred images, to remove noise and to
improve the contrast. Also in physics applications it is of interest to visualize
hidden structures and to reconstruct distributions which may be used for
instance to simulate experimental situations. We want to take advantage of
the fact that physics distributions are rather smooth. Often we can remove
the roughness of an unfolding result without affecting very much the real
structures of the true distribution.
To obtain a smooth distribution, several different regularization mecha-
nisms are available.
In particle- and astrophysics the following methods are applied:
276 9 Unfolding
Visual Inspection
If we resign to the idea to use the unfolded distribution for parameter fits, it
seems tolerable to apply subjective criteria for the choice of the regulariza-
tion strength. By inspection of the unfolding results obtained with increasing
regularization, we are to some extent able to distinguish fluctuations caused
by noise from structures in the true distribution and to select a reasonable
value of the regularization parameter. Probably, this method is in most cases
as good as the following approaches which are partially quite involved.
We have seen that the unfolding result can be expanded into orthogonal
components which are statistically independent.
We have studied above the eigenvector decomposition of the modified
least square matrix and realized that the small eigenvalue components λi
cause the unwanted oscillations. A smooth result is obtained by cutting all
contributions with eigenvalues λi , i = 1, ..., k) below a cut value λk . This pro-
cedure is called truncated singular value decomposition (TSVD).The value λk
is chosen such that eigenvectors are excluded with statistically insignificant
amplitudes. The truncation in the framework of the LSF has its equivalence
in the ML method. We can order the eigenvectors of the covariance matrix
of a MLF according to decreasing errors and retain only the dominant com-
ponents.
The physicist community is still attached to the - for historical reasons -
popular linear matrix calculus. Nowadays computers are fast and truncation
278 9 Unfolding
ISE is not defined for histograms in the way as physicists interpret them.
To adapt the ISE concept to our needs, we modify the definition such that
is measures the difference of the estimated content of the histogram θ̂i and
the prediction θi . In addition we normalize it to the total number of events
n and the number of true bins M .
M
X 2
ISE ′ = θ̂i − θi /n/M (9.19)
i=1
ISE ′ depends mainly on the resolution, i.e. the response matrix and less on
the shape of the true distribution. A crude guess of the latter can be used
to estimate the regularization parameter r. (Here r is a generic name for
the number of iterations in the EM method, the penalty strength or the cut
in truncation approaches.) The distribution is unfolded with a preliminarily
chosen regularization parameter. The unfolding result is then used to find the
regularization parameter that minimizes ISE ′ : The process can be iterated,
but since the shape of the true distribution is not that critical, this will not
be necessary in the majority of cases. The procedure consists of the following
steps:
1. Unfold d with varying regularization strength r and select the “best”
value r̃ and θ̃(0) by visual inspection of the unfolded histograms.
2. Use θ̃(0) as input for typically n = 100 simulations of “observed” distri-
butions d˜i , i = 1, n.
3. Unfold each “observed”distribution with varying r and select the value
r̃i that corresponds to the smallest value of ISE ′ . The value of ISE′ is
computed by comparing the unfolded histogram with θ̃(0) .
4. Take the mean value r̄ of the regularization strengths r̃i , unfold the ex-
perimental distribution and obtain θ̃(1) . If necessary, go back to 2, replace
θ̃(0) by θ̃(1) and iterate.
The procedure is independent of the regularization method.
9.3 Unfolding with Explicit Regularization 279
5
Error propagation starting from the observed data insted of the best estimate
can lead to inconsistent results. A striking example is known as Peelle’s pertinent
puzzle [55].
280 9 Unfolding
It has been proposed [67, 81, 82] to apply after the iteration sequence a final
smoothing step: After iteration i the result θ (i) is folded with a smoothing
(i)′ P (i) (i)′ (i−1)′
matrix g, yielding θ(i)′ , θk = l gkl θl . If θk agrees with θk within
given limits, the iteration sequence is terminated. In this way, convergence to
a smooth result is imposed. In [81] it is proposed to add after the convergence
one further iteration to θ(i+1)′ .
The parameters of the smoothing matrix which define the regularization
strength have to be adjusted to the specific properties of the problem that
9.3 Unfolding with Explicit Regularization 281
8000
1 5 10
6000
4000
2000
8000
15 20 25
6000
4000
2000
8000
100 100000
40
6000
4000
2000
Fig. 9.9. Unfolding with the EM method for different number of iterations. The
experimental resolution is σs = 0.08. The squares corespond to the true distribution.
0.8
0.7
0.6
ISE'
0.5
0.4
20 40 60 80
number of iterations
8000 8000
uniform data
6000 6000
4000 4000
2000 2000
0 0
Fig. 9.11. Iterative unfolding with two different starting histograms, left uniform
and right experimental.
Truncated SVD
8000
9 10 11
6000
4000
2000
Fig. 9.12. Unfolding result with SVD and different number of eigenvector contri-
butions, resolution σs = 0.08 and 50000 events.
10
ISE'
0.1
0 5 10 15
number of eigenvectors
Smooth truncation
It has been proposed [78, 62] to replace the brut force chopping off of the
noise dominated components by a smooth cut. This is accomplished by filter
factors
λ2
ϕ(λ) = 2 (9.20)
λ + λ20
where λ0 is the eigenvalue which fixes the degree of smoothing and λ is the
eigenvalue corresponding to the coefficient which is to be multiplied by ϕ(λ).
The solution is then
9.3 Unfolding with Explicit Regularization 285
1.0
0.8
filter factor
0.6
0.4
0.2
0.0
0 5 10 15 20
eigenvector index
M
X
θ reg = ϕ(λi )ai ui .
i=1
The function 9.20 is displayed in Fig. 9.14. The amplitude of the eigenvector
with eigenvalue λ = λ0 is reduced by a factor 2. For large eigenvalues λ the
filter factor is close to one and for small values it is close to zero. It is not
obvious why a reduction of the the amplitude of a component m and the
inclusion of a fraction of the amplitude of a less significant component n > m
should improve the performance.
In [78] it is shown that the filtered SVD solution is equivalent to Tikhonov’s
norm regularization under the condition that the uncertainties of the obser-
vations correspond to white noise (normally distributed fluctuations with
constant variance). We will come back to the norm regularization below.
The EM and truncated SVD methods are very intuitive and general. If we
have specific ideas about what we consider as smooth, we can penalize devi-
ations from the wanted features by introduction of a penalty term R in the
likelihood or LS expression:
ln L = ln Lstat − R , (9.21)
χ = 2
χ2stat +R. (9.22)
Here ln Lstat and χ2stat are the expressions given in (9.15) and (9.7).The sign
of R is positive such that with increasing R the unfolded histogram becomes
286 9 Unfolding
Curvature regularization
Entropy regularization
and its relation to probability which have been at the origin of its application
in unfolding problems. We penalize a low entropy and thus favor a uniform
distribution.
The entropy S of a discrete distribution with probabilities pi , i =
1, . . . , M is defined through the relation:
M
X
S=− pi ln pi .
i=1
For a random distribution the probability for one of the n = Σθi events to
fall into true bin i is given by θi /n. The maximum of the entropy corresponds
to an uniform population of the bins, i.e. θi = const. = n/M , and equals
Smax = − M 1
ln M , while its minimum Smin = 0 is found for the one-point
distribution (all events in the same bin j) θi = nδi,j . We define the entropy
regularization penalty with the regularization strength re of the distribution
by
X M
θi θi
R = re ln . (9.25)
i=1
n n
The most obvious and simplest way to regularize unfolding results is to pe-
nalize a large value of the norm squared ||θ||2 of the solution:
M
rn X 2
R= θ . (9.26)
n2 i=1 i
The norm regularization has first been proposed by Tikhonov [83]. Minimiz-
ing the norm implies a bias towards a small number of events in the unfolded
distribution. To reduce this effect, contrary to the originally proposed penalty,
we normalize the norm to the number of events squared n2 .
0.25 800
EM SVD
0.20
600
0.15
400
f(x)
0.10
200
0.05
0.00 0
-5 0 5
800
600
400
200
Fig. 9.15. Unfolding results from different methods. The top left-hand plot shows
the true distribution and its smeared version. The squares correspond to the true
distribution.
defined in the interval [−7, 7] with smearing σs = 1 which has been used
in [80]. The function and its smeared version are displayed in Fig. 9.15 top
left. The unfolded distributions obtained with the EM, the truncated SVD
and three penalty methods for the first of 10 samples with 5000 events are
depicted in the same figure. The optical inspection does not reveal large
differences between the results. The mean values of ISE ′ from the 10 samples
are presented in Fig. 9.16. They indicates that the EM method and entropy
regularization perform better than the other approaches. From a repetition
with 100 simulated experiments, we find that the mean value of ISE ′ is
smaller for the EM method compared to the entropy regularization by a
factor of 2.00 ± 0.16.
The superiority of the EM method has been confirmed for different dis-
tributions [56].
9.3 Unfolding with Explicit Regularization 289
curvature
norm
0.5
TSVD
entropy
0.4
0.3
ISE'
EM
0.2
0.1
0.0
800
600
400
200
Fig. 9.17. EM unfolding to a spline function. Three examples are presented. The
top graphs show the true distribution together with the unfolding results. The
bottom graphs contain the corresponding histograms.
Until now we have assumed that we know exactly the probability Aij for
observing elements in bin i which originally were produced in bin j. This is,
at least in principle, impossible, as we have to average the true distribution
f (x) – which we do not know – over the respective bin interval
9.3 Unfolding with Explicit Regularization 291
-1
10
-2
10
-3
10
)
max
-4
10
ln(L) - ln(L
-5
10
-6
10
-7
10
-8
10
-9
10
-10
10
0 200000 400000 600000
number of iterations
Fig. 9.18. Difference of the log-likelihood from the value at 106 iterations as a
function of the number of iterations.
R R
x′ −bin_i x−bin_j h(x, x′ )f (x)dxdx′
Aij = R . (9.27)
Bin_j f (x)dx
than the experimental sample such that the fluctuations can be neglected.
A rough estimate shows that for a factor of ten more simulated observa-
tions the contribution of the simulation to the statistical error of the result is
only about 5% and then certainly tolerable. When this condition cannot be
fulfilled, bootstrap methods (see Chap. 11) can be used to estimate the uncer-
tainties caused by the statistical fluctuations of A. Apart from the statistical
error of the response matrix, the precision of the reconstruction of f depends
on the size of the experimental sample and on the accuracy with which we
know the resolution. In nuclear and particle physics the sample size is often
the limiting factor, in other fields, like optics, the difficulties frequently are
related to a limited knowledge of the resolution of the measurement.
Fig. 9.19 shows the effect of using a wrong resolution function. The dis-
tribution in the middle is produced from that on the left hand side by convo-
lution with a Gaussian with width σf . Unfolding produces the distribution
on the right hand side, where the assumed width σf′ was taken too low by
10%. For a relative error δ,
σf − σf′
δ= ,
σf
where σart
2
has to be added to the squared width of the original line. Thus
a Dirac δ-function becomes a normal distribution of width σart . Even small
deviations in the resolution can lead to a substantial artificial broadening of
9.4 Unfolding with Implicit Regularization 293
sharp structures if the width of the smearing function is larger than that of
peaks in the true distribution.
8000
15000
1.0
correlation coefficient
0.5
number of events
number of events
10000
4000 0.0
2 4 6 8 10 12 14 16 18
5000
-0.5
-1.0
0 0
0.0 0.5 1.0 0.0 0.5 1.0
x x
Fig. 9.20. Unfolding without explicit regularization. The left-hand plot shows
the observed distribution with resolution σs = 0.04, the central plot the unfolded
histogram and the right-hand plot indicates the correlation of bin 10 with the other
bins of the histogram. The curve represents the true distribution.
400 2000
1.0
1500
correlation coefficient
0.5
number of events
number of events
-0.5
500
-1.0
0 0
0.0 0.5 1.0 0.0 0.5 1.0
x x
Fig. 9.21. Unfolding without explicit regularization. The left-hand plot shows the
observed histogram with resolution σs = 0.08, the central plot is the unfolded
histogram and the right hand plot indicates the correlation of bin 4 with the other
bins of the histogram. The curve represents the true distribution.
with 40 bins and the unfolded distribution with 18 bins are shown in Fig.
9.20. The true distribution is not much modified by the smearing. The height
of the peak is slightly reduced, the peak is a bit wider and at the borders
there are acceptance losses. The central plot shows the unfolded distribution
with the diagonal errors. Due to the strong correlation √between neighboring
bins, the errors are about a factor of five larger than θi . In the right-hand
plot the correlation coefficients of bin 10 relative to the other bins are given.
The correlation with the two adjacent bins is negative. It oscillates with the
9.6 Summary and Recommendations for the Unfolding of Histograms 295
Wide bins have the disadvantage that the response matrix depends
strongly on the distribution that is used to generate it. To cure the prob-
lem, the distribution can be approximated by the unfolding result obtained
with explicit regularization to a spline approximation. The remaining bias is
usually negligible.
The weight will remain constant once the densities t′ and d′ agree. As result
we obtain a discrete distribution of coordinates xi with appropriate weights
298 9 Unfolding
The basic idea of this method [89] is the following: We generate a Monte
Carlo sample of the same size as the experimental data sample. We let the
Monte Carlo events migrate until the distribution of their observed positions
is compatible with the observed data. With the help of a test variable φ, which
could for example be the negative log likelihood and which we will specify
later, we have the possibility to judge quantitatively the compatibility. When
the process has converged, i.e. φ has reached its minimum, the Monte Carlo
sample represents the unfolded distribution.
We proceed as follows:
We denote by {x′1 , . . . , x′N } the locations of the points of the experimental
sample and by {y 1 , . . . , y N } those of the
XMonte Carlo sample. The observed
density of the simulation is f (y ′ ) = t(y i , y ′ ), where t is the response
function. The test variable φ [x′1 , . . . , x′N ; f (y ′ )] is a function of the sample
coordinates xi and the density expected for the simulation. We execute the
following steps:
1. The points of the experimental sample {x′1 , . . . , x′N } are used as a first
approximation to the true locations y 1 = x′1 , . . . , y N = x′N .
2. We compute the test statistic φ of the system.
3. We select randomly a Monte Carlo event and let it migrate by a random
amount ∆y i into a randomly chosen direction, y i → y i + ∆y i .
4. We recompute φ. If φ has decreased, we keep the move, otherwise we
reject it. If φ has reached its minimum, we stop, if not, we return to step
3.
The resolution or smearing function t is normally not a simple analytic
function, but only numerically available through a Monte Carlo simulation.
Thus we associate to each true Monte Carlo point i a set of K generated
observations {y′i1 , . . . , y ′iK }, which we call satellites and which move together
with y i . The test quantity φ is now a function of the N experimental positions
and the N × K smeared Monte Carlo positions.
Choices of the test statistic φ are presented in Chap. 10. We recommend
to use the variable energy.
The migration distances ∆y i should be taken from a distribution with
a width somewhat larger than the measurement resolution, while the exact
shape of the distribution is not relevant. We therefore recommend to use a
9.7 Binning-free Methods 299
In the rare cases where the transfer function t(x, x′ ) is known analytically
or easily calculable otherwise, we can maximize the likelihood where the
parameters are the locations of the true points. Neglecting acceptance losses,
the p.d.f. for an observation x′ , with the true values x1 , . . . , xN as parameters
is
N
1X
fN (x′ |x1 , . . . , xN ) = t(xi , x′ )
N i=1
where t is assumed to be normalized with respect to x′ . The log likelihood
then is given, up to a constant, by
N
X N
X
ln L(x|x′ ) = ln t(xi , x′k ) .
k=1 i=1
The maximum can either be found using the well known minimum search-
ing procedures or the migration method which we have described above and
which is not restricted to low event numbers. Of course maximizing the likeli-
hood leads to the same artifacts as observed in the histogram based methods.
The true points form clusters which, eventually, degenerate into discrete dis-
tributions. A smooth result is obtained by stopping the maximizing process
before the maximum has been reached. For definiteness, similar to the case
of histogram unfolding, a fixed difference of the likelihood from its maximum
value should be chosen to stop the maximization process. Similarly p to the
histogram case, this difference should be of the order of ∆L ≈ N DF/2
9.7 Binning-free Methods 301
4 200
100
-4
0
-8 -4 0 4 8 -8 -4 0 4 8
4 200
0
100
-4
0
-8 -4 0 4 8 -8 -4 0 4 8
4 200
100
-4
0
-8 -4 0 4 8 -8 -4 0 4 8
Fig. 9.23. Deconvolution of point locations. The middle plot on the left hand side
is deconvoluted and shown in the bottom plot. The true point distribution is given
in the top plot. The right hand side shows the corresponding projections onto the
x axis in form of histograms.
302 9 Unfolding
10.1 Introduction
So far we treated problems where a data sample was used to discriminate be-
tween completely fixed competing hypotheses or to estimate free parameters
of a given distribution. Now we turn to the task to quantify the compati-
bility of observed data with a given hypothesis. We distinguish between the
following topics:
a) Classification, for example event selection in particle reactions.
b) Testing the compatibility of a distribution with a theoretical prediction,
i.e. goodness-of-fit tests.
c) Testing whether two samples originate from the same population, two-
sample tests.
d) Quantification of the significance of signals, like the Higgs signal.
The hypothesis that we intend to test which is called the null hypothesis
H0 . In most tests the alternative hypothesis H1 is simply “H0 is false”. Often,
additional characteristics of H1 are very vague and cannot be quantified, but
a test makes sense only if we have an idea about H1 . Otherwise a sensible
formulation of the test is not possible. Formally, a test is associated with a
decision: accept or reject. This is obvious for classification problems, while in
the other cases we are mostly satisfied with the quotation of the so-called
p-value, introduced by R. Fisher, which measures the compatibility of a given
data sample with the null hypothesis.
There is some confusion about the terms significance test and hypothe-
sis test which has its origin in a controversy between R. Fisher on one side
and J. Neyman1 and E. Pearson2 on the other side. Fisher had a more prag-
matic view while Neyman-Pearson emphasized a strictly formal treatment
with prefixed criteria whether to accept or reject the hypothesis. We will not
distinguish between the two terms but use the term significance mainly for
the analysis of small signals. We will talk about tests even when we do not
decide on the acceptance of H0 .
1
Jerzy Neyman (1894-1981), Polish mathematician
2
Egon Sharpe Pearson (1895-1980), English statistician
304 10 Hypothesis Tests and Significance of Signals
The test procedure has to be fixed before looking at the data3 . To base the
selection of a test and its parameters on the data which we want to analyze,
to optimize a test on the bases of the data or to terminate the running time
of an experiment as a function of the output of a test would bias the result.
Goodness-of-fit (GOF) tests and two-sample tests are closely related.
Goodness-of-fit test are often applied after a parameter of a distribution
has been adjusted to the observed data. In this case the hypothesis depends
on one or several free parameters. We have a composite hypothesis and a
composite test. Two sample test are applied if data are to be compared to
a prediction that is modeled by a Monte Carlo sample. Sometimes it is of
interest to check whether experimental conditions have changed. To test the
hypothesis that this is not the case, samples taken at different times are
compared.
At the end of this chapter we will treat another case in which we have a
partially specified alternative and which plays an important role in physics.
There the goal is to investigate whether a small signal is significant or ex-
plainable by a fluctuation of a background distribution. We call this procedure
signal test.
After we have fixed the null hypothesis and the admitted alternative H1 , we
must choose a test statistic t(x), which is a function of the sample values,
3
Scientists often call this a blind analysis.
10.2 Some Definitions 305
significance level
0.1
0.01
observed number
Fig. 10.1. Relation between the critical value n of a Poisson experiment with mean
100 and the significance level. Observation n > 124 are excluded at a significance
level of 1%.
After the test parameters are selected, we can apply the test to our data. If
the actually obtained value of t is outside the critical region, t ∈
/ K, then
we accept H0 , otherwise we reject it. This procedure implies four different
outcomes with the following a priori probabilities:
1. H0 ∩ t ∈ K, P {t ∈ K|H0 } = α: error of the first kind. (H0 is true but
rejected.),
2. H0 ∩ t ∈ / K|H0 } = 1 − α (H0 is true and accepted.),
/ K, P {t ∈
3. H1 ∩ t ∈ K, P {t ∈ K|H1 } = 1 − β (H0 is false and rejected.),
4. H1 ∩ t ∈ / K|H1 } = β: error of the second kind (H0 is false but
/ K, P {t ∈
accepted.).
When we apply the test to a large number of data sets or events, then
the rate α, the error of the first kind, is the inefficiency in the selection of
H0 events, while the rate β, the error of the second kind, represents the
background with which the selected events are contaminated with H1 events.
Of course, for α given, we would like to have β as small as possible. Given the
10.2 Some Definitions 307
A test is called consistent if its power tends to unity as the sample size tends
to infinity. In other words: If we have an infinitely large data sample, we
should always be able to decide between H0 and the alternative H1 .
We also want that independent of α the rejection probability for H1 is
higher than for H0 , i.e. α < 1 − β. Tests that violate this condition are called
biased. Consistent tests are asymptotically unbiased.
When H1 represents a family of distributions, consistency and unbiased-
ness are valid only if they apply to all members of the family. Thus in case
that the alternative H1 is not specified, a test is biased if there is an arbi-
trary hypothesis different from H0 with rejection probability less than α and
it is inconsistent if we can find a hypothesis different from H0 which is not
rejected with power unity in the large sample limit.
10.2.4 P -Values
Definition
0.3
f(t)
0.2 critical
region
0.1
0 2 4 t 6 8 10
c
1.0
p(t)
0.5
0.0
0 2 4 6 8 10
t t
c
4
This condition can always be realized for one-sided tests by a variable trans-
formation. For two-sided tests, p-values cannot be defined.
310 10 Hypothesis Tests and Significance of Signals
number of observations
A: p=0.082 250 B: p=0.073
15
200
10 150
100
50
0 0
x x
Since the distribution of p under H0 is uniform in the interval [0, 1], all values
of p in this interval are equally probable. When we reject a hypothesis under
the condition p < 0.1 we have a probability of 10% to reject H0 . The rejection
probability would be the same for a rejection region p > 0.9. The reason for
cutting at low p-values is the expectation that distributions of H1 would
produce low p-values.
The name p-value is derived from the word probability, but the p-value is
not the probability that the hypothesis under test is true. It is the probability
under H0 to obtain a value of the test statistic t that is larger than the value
that is actually observed or, equivalently, the probability to obtain a p-value
which is smaller than the observed one. A p-value between zero and p is
expected to occur in the fraction p of experiments if H0 is true.
We learn from this example that the p-value is more sensitive to deviations
from H0 in large samples than in small samples. Since in practice small un-
10.2 Some Definitions 311
1500
number of events
1000
cut
500
0
0.2 0.4 0.6 0.8 1.0
p-value
Combination of p-values
If two p-values p1 , p2 which have been derived from independent test statistics
t1 , t2 are available, we would like to combine them to a single p-value p. The
at first sight obvious idea to set p = p1 p2 suffers from the fact that the
distribution of p will not be uniform. A popular but arbitrary choice is
p = p1 p2 [1 − ln(p1 p2 )] (10.2)
which can be shown to be uniformly distributed [91]. This choice has the
unpleasant feature that the combination of the p-values is not associative,
i.e. p [(p1 , p2 ), p3 ] 6= p [p1 , (p2 , p3 )]. There is no satisfactory way to combine
p-values.
We propose, if possible, not to use (10.2) but to go back to the original
test statistics and construct from them a combined statistic t and the cor-
responding p distribution. For instance, the obvious combination of two χ2
statistics would be t = χ21 + χ22 .
100 prediction
number of events
experimental
distribution
10
1
0.0 0.2 0.4 0.6 0.8 1.0
lifetime
f
0
δk2 = N pk (1 − pk ) .
Usually the observations are distributed into typically more than 10 bins.
Thus the probabilities pk are small compared to unity and the expression in
brackets can be omitted. This is the Poisson approximation of the binomial
distribution. The mean quadratic deviation is equal to the number of expected
observations in the bin:
δk2 = N pk .
We now normalize the observed to the expected mean quadratic deviation,
316 10 Hypothesis Tests and Significance of Signals
(dk − N pk )2
χ2k = ,
N pk
and sum over all B bins:
B
X (dk − N pk )2
χ2 = . (10.3)
N pk
k=1
By construction we have:
hχ2k i ≈ 1 ,
hχ2 i ≈ B .
hχ2 i = f = B − 1 . (10.4)
2
f( )
2
P( )
2 2
c2 .
Fig. 10.7. p-value for the obseration χ
5
We have to be especially careful when the significance level α is small.
318 10 Hypothesis Tests and Significance of Signals
80
75
0.001
0.002
70 0.003
0.005
65 0.007
0.01
0.02
60
0.03
0.05
55 0.07
0.1
50
0.2
45 0.3
40
c
2 0.5
35
30
25
20
15
10
0
0 5 10 15 20
NDF
Fig. 10.8. Critical χ2 values as a fnction of the number of degrees of freedom with
the significance level as parameter.
α, and reject it for p < α. The χ2 test is also called Pearson test after the
statistician Karl Pearson6 , who has introduced it already in 1900.
Figure 10.8 gives the critical values of χ2 , as a function of the number of
degrees of freedom with the significance level as parameter. To simplify the
presentation we have replaced the discrete points by curves. The p-value as
a function of χ2 with NDF as parameter is available in the form of tables or
in graphical form in many books. The internet provides on-line calculation
programs. For large f , about f > 20 and not too small α, the χ2 distribution
can be approximated sufficiently well by a normal distribution with mean
value x0 = f and variance s2 = 2f . We are then able to compute the p-
values from integrals over the normal distribution. Tables can be found in the
literature or alternatively, the computation can be performed with computer
programs like Mathematica or Maple.
There is no general rule for the choice of the number and width of the his-
togram bins for the χ2 comparison but we note that the χ2 test looses sig-
nificance when the number of bins becomes too large.
6
Karl Pearson (1857-1980) britischer Mathematiker
10.4 Goodness-of-Fit Tests 319
Some statisticians propose to adjust the bin parameters such that the
number of events is the same in all bins. In our table this partitioning is
denoted by e.p. (equal probability). In the present example this does not
improve the significance.
The value of χ2 is independent of the signs of the deviations. However, if
several adjacent bins show an excess (or lack) of events like in the left hand
histogram of Fig. 10.9 this indicates a systematic discrepancy which one
would not expect at the same level for the central histogram which produces
the same value for χ2 . Because correlations between neighboring bins do
not enter in the test, a visual inspection is often more effective than the
mathematical test. Sometimes it is helpful to present for every bin the value
of χ2 multiplied by the sign of the deviation either graphically or in form of
a table.
320 10 Hypothesis Tests and Significance of Signals
150
100
entries
50
0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
x x x
Fig. 10.9. The left hand and the central histogram produce the same χ2 p-value,
the left hand and the right hand histograms produce the same Kolmogorov p-value.
Warning
The assumption that the distribution of the test statistic under H0 is de-
scribed by a χ2 distribution relies on the following assumptions: 1. The en-
7
Note that also the σi have to be independent of the parameters.
10.4 Goodness-of-Fit Tests 323
tries in all bins of the histogram are normality distributed. 2. The expected
number of entries depends linearly on the free parameters in the considered
parameter range. An indication for a non-linearity are asymmetric errors of
the adjusted parameters. 3. The estimated uncertainties σi in the denomina-
tors of the summands of χ2 are independent of the parameters. Deviations
from these conditions affect mostly the distribution at large values of χ2 and
thus the estimation of small p-values. Corresponding conditions have to be
satisfied when we test the GOF of a curve to measured points. Whenever we
are not convinced about their validity we have to generate the distribution
of χ2 by a Monte Carlo simulation.
Relevant numerical values of λ(k, µ0 ) = Pµ0 (k)/Pk (k) and Pµ0 (k) for µ0 = 10
are given in the following table.
324 10 Hypothesis Tests and Significance of Signals
k 8 9 10 11 12 13
λ 0.807 0.950 1.000 0.953 0.829 0.663
P 0.113 0.125 0.125 0.114 0.095 0.073
It is seen, that the sum over k runs over all k, except k = 9, 10, 11, 12:
8
p = Σk=0 ∞
P10 (k) + Σk=13 12
P10 (k) = 1 − Σk=9 P10 (k) = 0.541 which is certainly
acceptable.
The likelihood ratio test in this general form is useful to discriminate be-
tween a specific and a more general hypothesis, a problem which we will study
in Sect. 10.6.2. To apply it as a goodness-of-fit test, we have to histogram the
data.
If the bin content follows the Poisson statistics we get (see Chap. 6, Sect.
7.1)
B
X
V = [−ti + di ln ti − ln(di !) + di − di ln di + ln(di !)]
i=1
B
X
= [di − ti + di ln(ti /di )] .
i=1
The distribution of the test statistic V is not universal, i.e. not inde-
pendent of the distribution to be tested as in the case of χ2 . It has to be
10.4 Goodness-of-Fit Tests 325
1.0
F(x)
S(x)
D
-
0.5
D
+
0.0
0.0 0.5 1.0
Fig. 10.10. Comparison of the empirical distribution function S(x)with the theo-
retical distribution function F (x).
The test statistic is the maximum difference D between the two functions:
0.1
p-value
0.01
1E-3
1E-4
0.0 0.5 1.0 1.5 2.0 2.5
*
D
√ √
bers larger than 20 the approximation D∗ = D( N + 0.12 + 0.11/ N ) is still
a very good approximation8 . The function p(D∗ ) is displayed in Fig. 10.11.
The Kolmogorov–Smirnov test emphasizes more the center of the distribu-
tion than the tails because there the distribution function is tied to the values
zero and one and thus is little sensitive to deviations at the borders. Since it
is based on the distribution function, deviations are integrated over a certain
range. Therefore it is not very sensitive to deviations which are localized in a
narrow region. In Fig. 10.9 the left hand and the right hand histograms have
the same excess of entries in the region left of the center. The Kolmogorov–
Smirnov test produces in both cases approximately the same value of the
test statistic, even though we would think that the distribution of the right
hand histogram is harder to explain by a statistical fluctuation of a uniform
distribution. This shows again, that the power of a test depends strongly on
the alternatives to H0 . The deviations of the left hand histogram are well
detected by the Kolmogorov–Smirnov test, those of the right hand histogram
much better by the Anderson–Darling test which we will present below.
There exist other EDF tests [90], which in most situations are more ef-
fective than the simple Kolmogorov–Smirnov test.
8
√
D does not scale exactly with N because S increases in discrete steps.
328 10 Hypothesis Tests and Significance of Signals
Z ∞ 2
2 [S(x) − F (x)]
A =N dF ,
−∞ F (x) [1 − F (x)]
Z ∞ Z ∞ 2
U2 = N S(x) − F (x) − [S(x) − F (x)] dF dF ,
−∞ −∞
In the Appendix 13.8 we show how to compute the test statistics. There
also the asymptotic distributions are collected.
where θi are parameters and the functions πi (z) are modified orthogonal
Legendre polynomials that are normalized to the interval [0, 1] and symmetric
or antisymmetric with respect to z = 1/2:
π0 (z) ≡ 1 ,
√
π1 (z) = 3(2z − 1) ,
√
πi (z) = 2i + 1Pi (2z − 1) .
Here Pi (x) is the Legendre polynomial in the usual form. The first parameter
θ0 is fixed, θ0 = 1, and the other parameters are restricted such that gk is
positive. The user has to choose the parameter k which limits the degree of
the polynomials. If the alternative hypothesis is suspected to contain narrow
structures, we have to admit large k. The test with k = 1 rejects a linear
contribution, k = 2 in addition a quadratic component and so on. Obviously,
the null hypothesis H0 corresponds to θ1 = · · · = θk = 0, or equivalently to
P k
i=1 θi = 0. We have to look for a test statistic which increases with the
2
because Z 1
πi (z) dz = 0 .
0
330 10 Hypothesis Tests and Significance of Signals
The binning-free tests discussed so far are restricted to one dimension, i.e. to
univariate distributions. We now turn to multivariate tests.
A very obvious way to express the difference between two distributions f
and f0 is the integrated quadratic difference
9
For k = 1, for instance, the test cannot exclude distributions concentrated near
z = 1/2.
10.4 Goodness-of-Fit Tests 331
Z
L2 = [f (r) − f0 (r)]2 dr. (10.11)
So far the L2 test [97] has not found as much attention as it deserves
because the calculation of the integral is tedious. However its Monte Carlo
version is pretty simple. It offers the possibility to adjust the width of the
smearing function to the density f0 . Where we expect large distances of
observations, the Gaussian width should be large, α ∼ f02 .
A more sophisticated version of the L2 test is presented in [97]. The Monte
Carlo version is included in Sect. 10.4.10, see below.
with
N
X
Cij = (xni − xi )(xnj − xj )/N .
n=1
In the following tests the choice of the metric is up to the user. In many
situations it is reasonable to use the Mahalanobis distance, even though mod-
erate variations of the metric normally have little influence on the power of
a test.
A very general expression that measures the difference between two distribu-
tions f (r) and f0 (r) in an n dimensional space is
Z Z
1
φ= dr dr′ [f (r) − f0 (r)] [f (r′ ) − f0 (r ′ )] R(r, r′ ) . (10.12)
2
if the charge is zero everywhere, i.e. the two charge densities are equal up to
the sign. Because of this analogy we refer to φ as energy.
For our purposes the logarithmic function R(r) = − ln(r) and the bell
function R(r) ∼ exp(−cr2 ) are more suitable than 1/r.
We multiply the expressions in brackets in (10.12) and obtain
Z Z
1
φ= dr dr ′ [f (r)f (r ′ ) + f0 (r)f0 (r ′ ) − 2f (r)f0 (r ′ )] R(|r − r ′ |) .
2
(10.14)
A Monte Carlo integration of this expression is obtained when we generate M
random points {r01 . . . r 0M } of the distribution f0 (r) and N random points
{r1 , . . . , rN } of the distribution f (r) and weight each combination of points
with the corresponding distance function. The Monte Carlo approximation
is:
1 XX 1 XX
φ≈ R(|ri − r j |) − R(|r i − r 0j |)+
N (N − 1) i j>i NM i j
1 XX
+ R(|r 0i − r 0j |)
M (M − 1) i j>i
1 XX 1 XX
≈ R(|r i − r j |) − R(|r i − r 0j |)+
N 2 i j>i NM i j
1 XX
+ 2 R(|r 0i − r 0j |) . (10.15)
M i j>i
φ = φ1 − φ2 + φ3 , (10.16)
1 X
φ1 = R(|r i − rj |) , (10.17)
N (N − 1) i<j
1 X
φ2 = R(|r i − r 0j |) , (10.18)
N M i,j
1 X
φ3 = R(|r0i − r0j |) . (10.19)
M (M − 1) i<j
The term φ3 is independent of the data and can be omitted but is normally
included to reduce statistical fluctuations.
The distance function R relates sample points and simulated points of the
null hypothesis to each other. Proven useful have the functions
Rl = − ln(r + ε) , (10.20)
2 2
Rs = e−r /(2s )
. (10.21)
The small positive constant ε suppresses the pole of the logarithmic dis-
tance function. Its value should be chosen approximately equal to the ex-
perimental resolution11 but variations of ε by a large factor have no sizable
influence on the result. With the function R1 = 1/r we get the special case of
electrostatics. With the Gaussian distance function Rs the test is very similar
to the χ2 test with bin width 2s but avoids the arbitrary binning of the lat-
ter. The parameter s has to be adjusted to the application. The logarithmic
distance function is less sensitive to the scale and does not require to tune a
parameter.
The distribution of the test statistic under H0 can be obtained by gen-
erating Monte Carlo samples. If the number of events is large and if the
significance level α is small, the computating may become tedious.
Also resampling techniques can be applied to construct the distribution
of φ under H0 . A data set of 2N observations is generated, which by splitting
it, allows us to obtain two simulated values of φ. Then we shuffle the elements
and compute again two additional values of φ. The shuffling is repeated as
long as needed to get the required statistics. An efficient shuffling technique
invented by Fisher and Yates and improved by Durstenfeld is described in
the Appendix 13.9. The values of φ are correlated, but the correlation is
mostly negligible. In case of doubts, the shuffle should be repeated with sev-
eral independent 2N sets. From the fluctuation of the p-values its error can
be derived.
The energy test is consistent [98]. It is quite powerful in many situations
and has the advantage that it is not required to sort the sample elements.
11
Distances between two points that are smaller than the resolution are accidental
and thus insignificant.
336 10 Hypothesis Tests and Significance of Signals
A B C
1.0
power
0.5
0.0
30% B 20% N(0.5,1/24) 20% N(0.5,1/32)
Anderson
Kolmogorov
Kuiper
Neyman
Watson
1.0 Region
Energy(log)
Energy1(Gaussian)
Chi
power
0.5
0.0
50% linear 30% A 30% C
4
2
p(c )=0.71
p(F)=0.02
-2
-4
-2 0 2
To get an idea of the power of different tests, we consider six different ad-
mixtures to a uniform distribution and compute the fraction of cases in which
the distortion of the uniform distribution is detected at a significance level of
5%. For each distribution constructed in this way, we generate stochastically
1000 mixtures with 100 observations each. The distributions which we add
are depicted in Fig. 10.13. One of them is linear, two are normal with differ-
ent widths, and three are parabolic. The χ2 test was performed with 12 bins
following the prescription of Ref. [92], the parameter of Neyman’s smooth
test was k = 2 and the width of the Gaussian of the energy test was s = 1/8.
The sensitivity of different tests is presented in Fig. 10.14.
The histogram of Fig. 10.14 shows that none of the tests is optimum
in all cases. The χ2 test performs only mediocrely. Probably a lower bin
number would improve the result. The tests of Neyman, Anderson–Darling
and Kolmogorov–Smirnov are sensitive to a shift of the mean value while
the Anderson–Darling test reacts especially to changes at the borders of the
distribution. The tests of Watson and Kuiper detect preferentially variations
of the variance. Neyman’s test and the energy test with logarithmic distance
function are rather efficient in most cases.
10.5 Two-Sample Tests 339
Multivariate Distributions
To test whether two samples are compatible, we can apply the χ2 test or the
Kolmogorov–Smirnov test with minor modifications.
When we calculate the χ2 statistic we have to normalize the two samples
A and B of sizes N and M to each other. For ai and bi entries in bin i, ai /N −
bi /M should be compatible with zero. With the usual error propagation we
obtain an estimate ai /N 2 + bi /M 2 of the quadratic error of this quantity and
XB
(ai /N − bi /M )2
χ2 = . (10.22)
i=1
(ai /N 2 + bi /M 2 )
The likelihood ratio test is less vulnerable to low event numbers than the χ2
test.
Setting r = M/N we compute the likelihood that we observe in a single
bin a entries with expectation λ and b entries with expectation ρλ, where the
hypothesis H0 is characterized by ρ = r:
ln L = −λ(1 + ρ) + (a + b) ln λ + b ln ρ .
ln Lumax = −(a + b) + a ln a + b ln b .
10.5 Two-Sample Tests 341
Our test statistic is VAB , the logarithm of the likelihood ratio, now summed
over all bins:
2 400
number of entries
y
observed
0
energy
200
-2
0
-2 0 2 -1.8 -1.6 -1.4
x energy
Fig. 10.16. Two-sample test. Left hand: the samples which are to be compared.
Right hand: distribution of test statistic and actual value.
We compute the energy φAB in the same way as above, replacing the
Monte Carlo sample by one of the experimental samples. The expected dis-
tribution of the test statistic φAB is computed in the same way as for the like-
lihood ratio test from the combined sample using the permutation technique
by shuffling. Our experimental p-value is equal to the fraction of generated
φi from the bootstrap sample which are larger than φAB :
Number of permutations with φi > φAB
p= .
Total number of permutations
The k-nearest neighbor test is per construction a two-sample test. The dis-
tribution of the test statistic is obtained in exactly the same way as in the
two-sample energy test which we have discussed in the previous section.
The performance of the k-nearest neighbor test is similar to that of the
energy test. The energy test (and the L2 test which is automatically included
in the former) is more flexible than the k-nearest neighbor test and includes
all observation of the sample in the continuous distance function. The k-
nearest neighbor test on the other hand is less sensitive to variations of the
density which are problematic for the energy test with the Gaussian distance
function of constant width.
Tests for signals are closely related to goodness-of-fit tests but their aim is
different. We are not interested to verify that H0 is compatible with a sample
but we intend to quantify the evidence of signals which are possibly present
in a sample which consists mainly of uninteresting background. Here not
only the distribution of the background has to be known but in addition we
must be able to parameterize the alternative which we search for. The null
hypothesis H0 corresponds to the absence of deviations from the background.
The alternative Hs is not fully specified, otherwise it would be sufficient to
compute the simple likelihood ratio which we have discussed in Chap. 6.
Signal tests are applied when we search for rare decays or reactions like
neutrino oscillations. Another frequently occurring problem is that we want to
interpret a line in a spectrum as indication for a resonance or a new particle.
To establish the evidence of a signal, we usually require a very significant
deviation from the null hypothesis, i.e. the sum of background and signal
has to describe the data much better than the background alone because
particle physicists look in hundreds of histograms for more or less wide lines
and thus always find candidates12 which in most cases are just background
fluctuations. For this reason, signals are only accepted by the community if
they have a significance of at least four or five standard deviations. In cases
where the existence of new phenomena is not unlikely, a smaller significance
may be sufficient. A high significance for a signal corresponds to a low p-value
of the null hypothesis.
To quote the p-value instead of the significance as expressed by the num-
ber of standard deviations by which the signal exceeds the background ex-
pectation is to be preferred because it is a measure which is independent of
12
This is the so-called look-else-where effect.
344 10 Hypothesis Tests and Significance of Signals
the form of the distribution. However, the standard deviation scale is better
suited to indicate the significance than the p-values scale where very small
values dominate. For this reason it has become customary to transform the
p-value p into the number of Gaussian standard deviations sG which are
related through
√ Z ∞
p = 1/ 2π exp(−x2 /2)dx (10.23)
sG
h √ i
= 1 − erf(sG / 2) /2 . (10.24)
The function sG (p) is given in Fig. 10.17. Relations (10.23), (10.24) refer to
one-sided tests. For two-sided tests, p has to be multiplied by a factor two.
When we require very low p-values for H0 to establish signals, we have
to be especially careful in modeling the distribution of the test statistic.
Often the distribution corresponding to H0 is approximated for instance by a
polynomial with some uncertainties in the parameters and assumptions which
are difficult to implement in the test procedure. We then have to be especially
conservative. It is better to underestimate the significance of a signal than to
present evidence for a new phenomenon based on a doubtful number.
To illustrate this problem we return to our standard example where we
search for a line in a one-dimensional spectrum. Usually, the background
under an observed bump is estimated from the number of events outside
but near the bump in the so-called side bands. If the side bands are chosen
too close to the signal they are affected by the tails of the signal, if they are
chosen too far away, the extrapolation into the signal region is sensitive to the
assumed shape of the background distribution which often is approximated
by a linear or quadratic function. This makes it difficult to estimate the size
and the uncertainty of the expected background with sufficient accuracy to
establish the p-value for a large (>4 st. dev.) signal. As numerical example
let us consider an expectation of 1000 background events which is estimated
by the experimenter too low by 2%, i.e. equal to 980. Then a 4.3 st. dev.
excess would be claimed by him as a 5 st. dev. effect and he would find too
low a p-value by a factor of 28. We also have to be careful with numerical
approximations, for instance when we approximate a Poisson distribution by
a Gaussian. These uncertainties have to be included in the simulation of the
distribution of the test statistic.
Usually, the likelihood ratio, i.e. the ratio of the likelihood which max-
imizes Hs and the maximum likelihood for H0 is the most powerful test
statistic. In some situations a relevant parameter which characterizes the
signal strength is more informative.
10.6 Significance of Signals 345
standard deviations
3
0
1E-5 1E-4 1E-3 0.01 0.1 1
p-value
10
standard deviations
0
1E-20 1E-15 1E-10 1E-5 1
p-value
The p-value derived from the LR statistic does not take into account
that a simple hypothesis is a priori more attractive than a composite one
which contains free parameters. Another point of criticism is that the LR
is evaluated only at the parameters that maximize the likelihood while the
parameters suffer from uncertainties. Thus conclusions should not be based
on the p-value only.
A Bayesian approach applies so-called Bayes factors to correct for the
mentioned effects but is not very popular because it has other caveats. Its
essentials are presented in the Appendix 13.17
4000
0.1
number of events
3000
0.01
p-value
1E-3
2000
1E-4
1000
1E-5
0 1E-6
0 2 4 6 8 0 5 10 15
-ln(LR) -ln(LR)
Fig. 10.18. Distributions of the test statistic under H0 and p-value as a function
of the test statistic.
40
30
number of events
20
10
0
0.0 0.2 0.4 0.6 0.8 1.0
energy
Fig. 10.19. Histogram of event sample used for the likelihood ratio test. The curve
is an unbinned likelihood fit to the data.
essentially the negative logarithm of the likelihood of the MLE. Fig. 10.18
shows the results from a million simulated experiments. The distribution
of − ln λ under H0 has a mean value of −1.502 which corresponds to
h∆χ2 i = 3.004. The p-value as a function of − ln λ follows asymptotically
an exponential as is illustrated in the right hand plot of Fig. 10.18. Thus it is
possible to extrapolate the function to smaller p-values which is necessary to
claim large effects. Figure 10.19 displays the result of an experiment where
a likelihood fit finds a resonance at the energy 0.257. It contains a fraction
of 0.0653 of the events. The logarithm of the likelihood ratio is 9.277. The
corresponding p-value for H0 is pLR = 1.8 · 10−4 . Hence it is likely that the
observed bump is a resonance. In fact it had been generated as a 7 % contri-
bution of a Gaussian distribution N (x|0.25, 0.05) to a uniform distribution.
350 10 Hypothesis Tests and Significance of Signals
0.5
0.4
0.3
f(LR)
0.2
1.5 % resonance added
0.1
0.0
0 5 10 15
0
10
-1
10
-2
10
p-value
-3
10
-4
10
-5
10
-6
10
0 5 10 15
-ln(LR)
Fig. 10.20. Distributions of the test statistic for H0 and for experiments with a
1.5% resonance contribution. In the lower graph the p-value for H0 is given as a
function of the test statistic.
We have to remember though that the p-value is not the probability that
H0 is true, it is the probability that H0 simulates the resonance of the size
seen in the data or larger. In a Bayesian treatment, see Appendix 13.17, we
find betting odds in favor of H0 of about 2% which is much less impressive.
The two numbers refer to different issues but nonetheless we have to face the
fact that the two different statistical approaches lead to different conclusions
about how evident the existence of a bump really is.
In experiments with a large number of events, the computation of the p-
value distribution based on the unbinned likelihood ratio becomes excessively
slow and we have to turn to histograms and to compute the likelihood ratio
of H0 and Hs from the histogram. Figure 10.20 displays some results from
the simulation of 106 experiments of the same type as above but with 10000
events distributed over 100 bins.
10.6 Significance of Signals 351
In the figure the distributions of the LR for a signal for H0 and for exper-
iments with 1.5% resonance added is shown. The large spread of the signal
distributions reflects the fact that identical experiments by chance may ob-
serve a very significant signal or just a slight indication of a resonance.
We now extend the likelihood ratio test to the multi-channel case. We assume
that the observations xk of the channels k = 1, . . . , K are independent of each
other. The overall likelihood is the product of the individual likelihoods. For
the log-likelihood ratio we then have to replace (10.26) by
K
X
ln λ = {ln sup [L0k (θ0k |xk )] − ln sup [Lsk (θ sk |xk )]} .
k=1
Remark, that the MLEs of the parameters θ0k depend on the hypothesis.
They are different for the null and the signal hypotheses and, for this reason,
have been marked by an apostrophe in the latter.
We learn from this example that the LR statistic provides the most power-
ful test among the considered alternatives. It does not only take into account
the excess of events of a signal but also its expected shape. For this reason
pLR is smaller than pf .
Often the significance of a signal s is stated in units of standard deviations
σ:
Ns
s= p .
N0 + δ02
Here Ns is the number of events associated to the signal, N0 is the number
of events in the signal region expected from H0 and δ0 its uncertainty. In
the Gaussian approximation it can be transformed into a p-value via (10.23).
Unless N0 is very large and δ0 is very well known, this p-value has to be
considered as a lower limit or a rough guess.
11 Statistical Learning
11.1 Introduction
In the process of its mental evolution a child learns to classify objects,
persons, animals, and plants. This process partially proceeds through expla-
nations by parents and teachers (supervised learning), but partially also by
cognition of the similarities of different objects (unsupervised learning). But
the process of learning – of children and adults – is not restricted to the de-
velopment of the ability merely to classify but it includes also the realization
of relations between similar objects, which leads to ordering and quantify-
ing physical quantities, like size, time, temperature, etc.. This is relatively
easy, when the laws of nature governing a specific relation have been discov-
ered. If this is not the case, we have to rely on approximations, like inter- or
extrapolations.
Also computers, when appropriately programmed, can perform learning
processes in a similar way, though to a rather modest degree. The achieve-
ments of the so-called artificial intelligence are still rather moderate in most
areas, however a substantial progress has been achieved in the fields of super-
vised learning and classification and there computers profit from their ability
to handle a large amount of data in a short time and to provide precise
quantitative solutions to well defined specific questions. The techniques and
programs that allow computers to learn and to classify are summarized in
the literature under the term machine learning.
Let us specify the type of problems which we discuss in this chapter: For
an input vector x we want to find an output ŷ. The input is also called predic-
tor, the output response. Usually, each input consists of several components
(attributes, properties), and is written therefore in boldface letters. Normally,
it is a metric (quantifiable) quantity but it could also be a categorical quantity
like a color or a particle type. The output can also contain several components
or consists of a single real or discrete (Yes or No) variable. Like a human
being, a computer program learns from past experience. The teaching pro-
cess, called training, uses a training sample {(x1 , y1 ), (x2 , y 2 ) . . . (xN , y N )},
where for each input vector xi the response y i is known. When we ask for the
response to an arbitrary continuous input x, usually its estimate ŷ(x) will be
more accurate when the distance to the nearest input vector of the training
sample is small than when it is far away. Consequently, the training sample
354 11 Statistical Learning
general computer algorithms. This book can only introduce these methods,
without claim of completeness. A nice review of the whole field is given in
[16].
described in Chap. 6.7.1. In this section we treat the general non-linear case
with arbitrary errors.
In principle, the independent variable may also be multi-dimensional.
Since then the treatment is essentially the same as in the one-dimensional
situation, we will mainly discuss the latter.
k-Nearest Neighbors
δj2 2
(δyj )2 = + hyj (x) − ŷj (x)i . (11.1)
K
The first term is the statistical fluctuation of the mean value. The second
term is the bias which is equal to the systematic shift squared, and which
358 11 Statistical Learning
is usually difficult to evaluate. There is the usual trade-off between the two
error components: with increasing K the statistical term decreases, but the
bias increases by an amount depending on the size of the fluctuations of the
true function within the averaging region.
The simple average suffers from the drawback that at the boundary of the
variable space the measurements contributing to the average are distributed
asymmetrically with respect to the point of interest x. If, for instance, the
function falls strongly toward the left-hand boundary of a one-dimensional
space, averaging over points which are predominantly located at the right
hand side of x leads to too large a result. (See also the example at the end of
this section). This problem can be avoided by fitting a linear function through
the K neighboring points instead of using the mean value of y.
Gaussian Kernels
To take all k-nearest neighbors into account with the same weight indepen-
dent of their distance to x is certainly not optimal. Furthermore, its out-
put function is piecewise constant (or linear) and thus discontinuous. Better
should be a weighting procedure, where the weights become smaller with in-
creasing distances. An often used weighting or kernel function1 is the Gaus-
sian. The sum is now taken over all N training inputs:
PN −α|x−xi |2
i=1 y i e
ŷ(x) = PN −α|x−x |2 .
i=1 e
i
The constant
√ α determines the range of the correlation. Therefore the width
s = 1/ 2α of the Gaussian has to be adjusted to the density of the points
and to the curvature of the function. If computing time has to be economized,
the sum may of course be truncated and restricted to the neighborhood of x,
for instance to the distance 2s. According to (11.1) the mean squared error
becomes2 :
P −2α|x−xi |2
e
2 2
(δyj ) = δj P 2 2
+ hyj (x) − ŷj (x)i2 .
e−α|x−xi |
1
The denotation kernel will be justified later, when we introduce classification
methods.
2
This relation has to be modified if not all errors are equal.
11.2 Smoothing of Measurements and Approximation by Analytic Functions 359
and have
(ui , uj ) = δij ,
X
u∗i (x)ui (x′ ) = δ(x − x′ ) .
i
For instance, the functions of the well known Fourier system for the in-
terval [a, b] = [−L/2, L/2] are un (x) = √1L exp(i2πnx/L).
Every square integrable function can be represented by the series
∞
X
f (x) = ai ui (x) , with ai = (ui , f )
i=0
in the sense that the squared difference converges to zero with increasing
number of terms4 :
" K
#2
X
lim f (x) − ai ui (x) = 0 . (11.2)
K→∞
i=0
Polynomial Approximation
The simplest
X function approximation is achieved Xwith a simple polynomial
f (x) = ak xk or more generally by f (x) = ak uk where uk is a poly-
11.2 Smoothing of Measurements and Approximation by Analytic Functions 361
and now define the inner product of two functions g(x), h(x) by
X
(g, h) = wν g(xν )h(xν )
ν
N
" K
#2
X X
2
X = wν yν − ak uk (xν ) .
ν=1 k=0
(y, uj ) = aj . (11.4)
This relation produces the coefficients also in the interesting case K < N − 1.
362 11 Statistical Learning
XN
1
(δai )2 = 1/ 2
.
δ
ν=1 ν
The derivation of this formula is given in the Appendix 13.14 together with
formulas for the polynomials in the special case where all measurements have
equal errors and are uniformly distributed in x.
and their explicit form can be obtained by the simple recursion relation:
H̃i+1 = xH̃i − iH̃i−1 .
With H̃0 = 1 , H̃1 = x we get
H̃2 = x2 − 1 ,
H̃3 = x3 − 3x ,
H̃4 = x4 − 6x2 + 3 ,
and so on.
When we multiply both sides of (11.7) with H̃j (x) and integrate, we find,
according to (11.8), the coefficients ai from
Z
1
ai = f (x)H̃i (x)dx .
i!
These integrals can be expressed as combinations of moments of f (x),
which are to be approximated by the sample moments of the experimental
distribution. First, the sample mean and the sample variance are used to
shift and scale the experimental distribution such that the transformed mean
and variance equal 0 and 1, respectively. Then a1,2 = 0, and the empirical
skewness and excess of the normalized sample γ1,2 as defined in Sect. 3.2 are
proportional to the parameters a3,4 . The approximation to this order is
1 1
f (x) ≈ N (x)(1 + γ1 H̃3 (x) + γ2 H̃4 (x)) .
3! 4!
As mentioned, this approximation is well suited to describe distributions
which are close to normal distributions. This is realized, for instance, when the
variate is a sum of independent variates such that the central limit theorem
applies. It is advisable to check the convergence of the corresponding Gram–
Charlier series and not to truncate the series too early [3].
364 11 Statistical Learning
x
2
1/2
2
1/2
x
-2
x
-1
11.2.3 Wavelets
The trigonometric functions used in the Fourier series are discrete in the
frequency domain, but extend from minus infinity to plus infinity in the
spatial domain and thus are not very well suited to describe strongly localized
function variations. To handle this kind of problems, the wavelet system has
been invented. Wavelets are able to describe pulse signals and spikes like
those generated in electrocardiograms, nuclear magnetic resonance (NMR)
records or seismic records, in data transmission, and for the coding of images
and hand-written text. For data reduction and storage they have become an
indispensable tool.
The simplest orthogonal system with the desired properties are the Haar
wavelets shown in Fig. 11.1. The lowest row shows three wavelets which are
orthogonal, because they have no overlap. The next higher row contains again
three wavelets with one half the length of those below. They are orthogonal
to each other and to the wavelets in the lower row. In the same way the higher
frequency wavelets in the following row are constructed. We label them with
two indices j, k indicating length and position. We define a mother function
ψ(x), the bottom left wavelet function of Fig. 11.1.
1, if 0 ≤ x < 12
ψ(x) = −1, if 21 ≤ x < 1
0, else
11.2 Smoothing of Measurements and Approximation by Analytic Functions 365
and set W00 = ψ(x). The remaining wavelets are then obtained by transla-
tions and dilatations in discrete steps from the mother function ψ(x):
1 2 2 x2
ψ(x) = √ e−x /(2σ ) (1 − 2 ) (Mexican Hat) , (11.10)
2πσ 3 σ
2
/(2σ2 )
ψ(x) = (eix − c)e−x (Morlet-Wavelet) , (11.11)
and many others. The first function, the Mexican hat, is the second derivative
of the Gaussian function, Fig. 11.2. The second, the Morlet function, is a
complex monochromatic wave, modulated by a Gaussian. The constant c =
exp(−σ 2 /2) in the Morlet function can usually be neglected by choosing a
wide lowest order function, σ >∼ 5. In both functions σ defines the width of
the window.
The mother function ψ has to fulfil apart from the trivial normalization
property 11.9, also the relation
Z
ψ(x)dx = 0 .
R
Any square integrable function f (x) fulfilling f (x)dx = 0 can be expanded
in the discrete wavelet series,
X
f (x) = cjk Wjk (x) .
j,k
5
The Haar wavelets are real, but some types of wavelets are complex.
366 11 Statistical Learning
0.4
0.2
F
0.0
-0.2
-6 -4 -2 0 2 4 6
x
Fig. 11.3. Linear spline approximation.
1.5
linear B-splines
1.0
y
0.5
0.0
x
1.0
quadratic B-splines
y
0.5
0.0
1.0
cubic B-splines
y
0.5
0.0
A sensible choice should take into account the mean squared dispersion of
the points, i.e. the χ2 -sum should be of the order of the number of degrees of
freedom. When the response values y are exact and equidistant, the points
are simply connected by a polygon.
368 11 Statistical Learning
The amplitudes ak can be obtained from a least squares fit. For values of the
response function yi and errors δyi at the input points xi , i = 1, . . . , N , we
minimize h i2
PK
XN yi − k=0 ak Bk (xi )
χ2 = . (11.13)
i=1
(δyi )2
Of course, the number N of input values has to be at least equal to the
number K of splines. Otherwise the number of degrees of freedom would
become negative and the approximation under-determined.
11.2.6 Example
s = 0.2 k = 12
0 1 2 3 4 0 1 2 3 4
x x
nparam = 10 nparam = 5
0 1 2 3 4 0 1 2 3 4
x x
nparam = 10 nparam = 5
0 1 2 3 4 0 1 2 3 4
x x
nparam = 10 nparam = 5
0 1 2 3 4 0 1 2 3 4
x x
Fig. 11.5. Smoothing and function approximation. The measurements are con-
nected by a polygon. The curve corresponding to the original function is dashed.
11.3 Linear Factor Analysis and Principal Components 371
reasonable to look first for a linear relationship between features and pa-
rameters. Then the subspace is a linear vector space and easy to identify.
In the special situation where only one component exists, all points lie ap-
proximately on a straight line, deviations being due to measurement errors
and non-linearity. To identify the multi-dimensional plane, we have to inves-
tigate the correlation matrix. Its transformation into diagonal form delivers
the principal components – linear combinations of the feature vectors in the
direction of the principal axes. The principal components are the eigenvectors
of the correlation matrix ordered according to decreasing eigenvalues. When
we ignore the principal components with small eigenvalues, the remaining
components form the planar subspace.
Factor analysis or PCA has been developed in psychology, but it is widely
used also in other descriptive fields, and there are numerous applications in
chemistry and biology. Its moderate computing requirements which are at
the expense of the restriction to linear relations, are certainly one of the
historical reasons for its popularity. We sketch it below, because it is still in
use, and because it helps to get a quick idea of hidden structures in multi-
dimensional data. When no dominant components are found, it may help to
disprove expected relations between different observations.
A typical application is the search for factors explaining similar properties
between different objects: Different chemical compounds may act similarly,
e.g. decrease the surface tension of water. The compounds may differ in var-
ious features, as molecular size and weight, electrical dipole moment, and
others. We want to know which parameter or combination of parameters is
relevant for the interesting property. Another application is the search for
decisive factors for a similar curing effect of different drugs. The knowledge
of the principal factors helps to find new drugs with the same positive effect.
In physics factor analysis does not play a central role, mainly because its
results are often difficult to interpret and, as we will see below, not unam-
biguous. It is not easy, therefore, to find examples from our discipline. Here
we illustrate the method with an artificially constructed example taken from
astronomy.
y2
x2
y1
x1
Fig. 11.6. Scatter diagram of two attributes of 11 measured objects.
R → VT RV = diag(λ1 . . . λP ) .
(R − λp I)v p = 0 , (11.14)
Rv p = λp v p .
det(R − λI) = 0 .
In the simple case described above of only two features, this is a quadratic
equation
R11 − λ R12
=0,
R21 R22 − λ
that fixes the two eigenvalues. The eigenvectors are calculated from (11.14)
after substituting the respective eigenvalue. As they are fixed only up to
11.3 Linear Factor Analysis and Principal Components 375
By construction, these factors represent variates with zero mean and unit
variance. In most cases they are assumed to be normally distributed. Their
relation to the original data xnp is given by a linear (not orthogonal) trans-
formation with a matrix A, the elements of which are called factor loadings.
Its definition is
xn = Af n , or XT = A FT . (11.15)
Its components show, how strongly the input data are influenced by certain
factors.
In the classical factor analysis, the idea is to reduce the number of factors
such that the description of the data is still satisfactory within tolerable
deviations ε:
x1 = a11 f1 + · · · + a1Q fQ + ε1
x2 = a21 f1 + · · · + a2Q fQ + ε2
.. ..
. .
xP = aP 1 f1 + · · · + aP Q fQ + εP
X = U D VT ,
which is the same as (11.15), with factors and loadings being rotated.
There exist program packages which perform the numerical calculation of
principal components and factors.
Remarks:
1. The transformation of the correlation matrix to diagonal form makes
sense, as we obtain in this way uncorrelated inputs. The new variables
help to understand better the relations between the various measure-
ments.
2. The silent assumption that the principal components with larger eigenval-
ues are the more important ones is not always convincing, since starting
with uncorrelated measurements, due to the scaling procedure, would
result in eigenvalues which are all identical. An additional difficulty for
interpreting the data comes from the ambiguity (11.16) concerning rota-
tions of factors and loadings.
11.4 Classification
We have come across classification already when we have treated goodness-of-
fit. There the problem was either to accept or to reject a hypothesis without
a clear alternative. Now we consider a situation where we collect events based
upon their features into two or more classes. We assume that we have either
a data set where we know the classification and which is used to train the
classification algorithm or an analytic description of the distributions.
The assignment of an object according to some quality to a class or cat-
egory is described by a so-called categorical variable. For two categories we
8
Physicists may find the method familiar from the discussion of the inertial
momentum tensor and many similar problems.
11.4 Classification 377
can label the two possibilities by discrete numbers; usually the values ±1
or 1 and 0 are chosen. In most cases, we replace the strict classification by
weights which indicate the probability that the event should be assigned to
a certain class. The classification into more than two cases can be performed
sequentially by first combining classes such that we have a two class system
and then splitting them further.
Classification is indispensable in data analysis in many areas. Examples
in particle physics are the identification of particles from shower profiles or
from Cerenkov ring images, beauty, top or Higgs particles from kinematics
and secondaries and the separation of rare interactions from frequent ones. In
astronomy the classification of galaxies and other stellar objects is of interest.
But classification is also a precondition for decisions in many scientific fields
and in everyday life.
We start with an example: A patient suffers from certain symptoms:
stomach-ache, diarrhoea, temperature, head-ache. The doctor has to give
a diagnosis. He will consider further factors, as age, sex, earlier diseases, pos-
sibility of infection, duration of the illness, etc.. The diagnosis is based on the
experience and education of the doctor.
A computer program which is supposed to help the doctor in this matter
should be able to learn from past cases, and to compare new inputs in a
sensible way with the stored data. Of course, as opposed to most problems
in science, it is not possible here to provide a functional, parametric relation.
Hence there is a need for suitable methods which interpolate or extrapolate
in the space of the input variables. If these quantities cannot be ordered,
e.g. sex, color, shape, they have to be classified. In a broad sense, all this
problems may be considered as variants of function approximation.
The most important methods for this kind of problems are the discrimi-
nant analysis, artificial neural nets, kernel or weighting methods, and decision
trees. In the last years, remarkable progress in these fields could be realized
with the development of support vector machines, boosted decision trees, and
random forests classifiers.
Before discussing these methods in more detail let us consider a further
example: The interactions of electrons and hadrons in calorimeter detectors
of particle physics differ in a many parameters. Calorimeters consist of a large
number of detector elements, for which the signal heights are evaluated and
recorded. The system should learn from a training sample obtained from test
measurements with known particle beams to classify electrons and hadrons
with minimal error rates.
An optimal classification is possible if the likelihood ratio is available
which then is used as a cut variable. The goal of intelligent classification
methods is to approximate the likelihood ratio or an equivalent variable which
is a unique function of the likelihood ratio. The relation itself need not be
known.
378 11 Statistical Learning
1.0
0.8
contamination
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
efficiency
9
One minus the contamination is called purity.
11.4 Classification 379
Normally we will get a different number of wrong assignments for the two
classes: observations originating from the broader distribution will be miss-
assigned more often, see Fig. 11.8) than those of the narrower distribution.
In most cases it will matter whether an input from class 1 or from class
2 is wrongly assigned. An optimal classification is then reached using an
appropriately adjusted likelihood ratio:
If we want to have the same error rates (case 2), we must choose the
constant c such that the integrals over the densities in the selected regions
are equal: Z Z
f1 (x)dx = f2 (x)dx . (11.17)
f1/f2 >c f1/f2 <c
This assignment has again a minimal error rate, but now under the constraint
(11.17). We illustrate the two possibilities in Fig. 11.8 for univariate functions.
For normal distributions we can formulate the condition for the classifi-
cation explicitly: For case 2 we choose that class for which the observation x
has the smallest distance to the mean measured in standard deviations. This
condition can then be written as a function of the exponents. With the usual
notations we get
(x − µ1 )T V1 (x − µ1 ) − (x − µ2 )T V2 (x − µ2 ) < 0 → class 1 ,
(x − µ1 )T V1 (x − µ1 ) − (x − µ2 )T V2 (x − µ2 ) > 0 → class 2 .
This condition can easily be generalized to more than two classes; the
assignment according to the standardized distances will then, however, no
longer lead to equal error rates for all classes.
The classical discriminant analysis sets V1 = V2 . The left-hand side in
the above relations becomes a linear combination of the xp . The quadratic
terms cancel. Equating it to zero defines a hyperplane which separates the
two classes. The sign of this linear combination thus determines the class
380 11 Statistical Learning
0.3
f
1
0.2
f(x) f
2
0.1
0.0
0 10 20
Fig. 11.8. Separation of two classes. The dashed line separates the events such that
the error rate is minimal, the dotted line such that the wrongly assigned events are
the same in both classes.
membership. Note that the separating hyperplane is cutting the line connect-
ing the distribution centers under a right angle only for spherical symmetric
distributions.
If the distributions are only known empirically from representative sam-
ples, we approximate them by continuous distributions, usually by a normal
distribution, and fix their parameters to reproduce the empirical moments.
In situations where the empirical distributions strongly overlap, for instance
when a narrow distribution is located at the center of a broad one, the sim-
ple discriminant analysis does no longer work. The classification methods
introduced in the following sections have been developed for this and other
more complicated situations and where the only source of information on
the population is a training sample. The various approaches are all based on
the continuity assumption that observations with similar attributes lead to
similar outputs.
Introduction
certainly will play a role also in science in the more distant future. It could
e.g. be imagined that a self-organizing ANN would be able to classify a data
set of events produced at an accelerator without human intervention and thus
would be able to discover new reactions and particles.
The species considered here has a comparably more modest aim: The
network is trained in a first step to ascribe a certain output (response) to the
inputs. In this supervised learning scheme, the response is compared with the
target response, and then the network parameters are modified to improve
the agreement. After a training phase the network is able to classify new
data.
ANNs are used in many fields for a broad variety of problems. Examples
are pattern recognition, e.g. for hand-written letters or figures, or the forecast
of stock prices. They are successful in situations where the relations between
many parameters are too complex for an analytical treatment. In particle
physics they have been used, among other applications, to distinguish electron
from hadron cascades and to identify reactions with heavy quarks.
With ANNs, many independent computing steps have to be performed.
Therefore specialized computers have been developed which are able to eval-
uate the required functions very fast by parallel processing.
Primarily, the net approximates an algebraic function which transforms
the input vector x into the response vector y,
y = f (x|w) .
10
Here function approximation is used to perform calculations. In the previous
section its purpose was to parametrize data.
382 11 Statistical Learning
(1)
w 11 w
(2)
11
x1 (1) s s y1
(2)
w 12 w 12
x2 s s y2
x3 s s y3
(1)
w n3
xn s s ym
Fig. 11.9. Backpropagation. At each knot the sigmoid function of the sum of the
weighted inputs is computed.
Network Structure
Our network consists of two layers of knots (neurons), see Fig. 11.9. Each
component xk of the n-component input vector x is transmitted to all knots,
labeled i = 1, . . . , m, of the first layer. Each individual data line k → i is
(1) P (1)
ascribed a weight Wik . In each unit the weighted sum ui = k Wik xk
of the data components connected to it is calculated. Each knot symbolizes
a non-linear so-called activation function x′i = s(ui ), which is identical for
all units. The first layer produces a new data vector x′ . The second layer,
with m′ knots, acts analogously on the outputs of the first one. We call the
corresponding m × m′ weight matrix W(2) . It produces the output vector y.
The first layer is called hidden layer, since its output is not observed directly.
In principle, additional hidden layers could be implemented but experience
shows that this does not improve the performance of the net.
The net executes the following functions:
!
X (1)
x′j = s Wjk xk ,
k
X (2)
yi = s Wij x′j .
j
11.4 Classification 383
1.2
1.0
0.8
0.6
s
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
Sometimes it is appropriate to shift the input of each unit in the first layer
by a constant amount (bias). This is easily realized by specifying an artificial
additional input component x0 ≡ 1.
The number of weights (the parameters to be fitted) is, when we include
the component x0 , (n + 1) × m + mm′ .
Activation Function
The activation function s(x) has to be non-linear, in order to achieve that
the superposition (11.18) is able to approximate widely arbitrary functions.
It is plausible that it should be more sensitive to variations of the arguments
near zero than for very large absolute values. The input bias x0 helps to shift
input parameters which have a large mean value into the sensitive region. The
activation function is usually standardized to vary between zero and one. The
most popular activation function is the sigmoid function
1
s(u) = ,
e−u + 1
which is similar to the Fermi function. It is shown in Fig. 11.10.
E = (y − y t )2 , (11.19)
which measures for each training object the deviation of the response from
the expected one.
To reduce the error E we walk backward in the weight space. This means,
we change each weight component by ∆W , proportional to the sensitivity
∂E/∂W of E to changes of W :
1 ∂E
∆W = − α
2 ∂W
∂y
= −α(y − y t ) · .
∂W
The proportionality constant α, the learning rate, determines the step width.
We now have to find the derivatives. Let us start with s:
ds e−u
= −u = s(1 − s) . (11.20)
du (e + 1)2
From (11.18) and (11.20) we compute the derivatives with respect to the
weight components of the first and the second layer,
∂yi
(2)
= yi (1 − yi )x′j ,
∂Wij
and
∂yi (2)
(1)
= yi (1 − yi )x′j Wij (1 − x′j )xk .
∂Wjk
It is seen that the derivatives depend on the same quantities which have
already been calculated for the determination of y (the forward run through
the net). Now we run backwards, change first the matrix W(2) and then with
already computed quantities also W(1) . This is the reason why this process
is called back propagation. The weights are changed in the following way:
(1) (1)
X (2)
Wjk → Wjk − α(y − y t ) yi (1 − yi )x′j Wij (1 − x′j )xk ,
i
(2) (2)
Wij → Wij − α(y − y t )yi (1 − yi )x′j .
The gradient descending minimum search has not necessarily reached the
minimum after processing the training sample a single time, especially when
the available sample is small. Then the should be used several times (e.g.
10 or 100 times). On the other hand it may happen for too small a training
11.4 Classification 385
sample that the net performs correctly for the training data, but produces
wrong results for new data. The network has, so to say, learned the training
data by heart. Similar to other minimizing concepts, the net interpolates and
extrapolates the training data. When the number of fitted parameters (here
the weights) become of the same order as the number of constraints from the
training data, the net will occasionally, after sufficient training time, describe
the training data exactly but fail for new input data. This effect is called over-
fitting and is common to all fitting schemes when too many parameters are
adjusted.
It is therefore indispensable to validate the network function after the
optimization, with data not used in the training phase or to perform a cross
validation. If in the training phase simulated data are used, it is easy to
generate new data for testing. If only experimental data are available with
no possibility to enlarge the sample size, usually a certain fraction of the data
is reserved for testing. If the validation result is not satisfactory, one should
try to solve the problem by reducing the number of network parameters or
the number of repetitions of the training runs with the same data set.
The neural network generates from the input data the response through
the fairly complicated function (11.18). It is impossible by an internal analysis
of this function to gain some understanding of the relation between input
and resulting response. Nevertheless, it is not necessary to regard the ANN
as a “black box”. We have the possibility to display graphically correlations
between input quantities and the result, and all functional relations. In this
way we gain some insight into possible connections. If, for instance, a physicist
would have the idea to train a net with an experimental data sample to
predict for a certain gas the volume from the pressure and the temperature,
he would be able to reproduce, with a certain accuracy, the results of the
van-der-Waals equation. He could display the relations between the three
quantities graphically. Of course the analytic form of the equation and its
interpretation cannot be delivered by the network.
Often a study of the optimized weights makes it possible to simplify the
net. Very small weights can be set to zero, i.e. the corresponding connections
between knots are cut. We can check whether switching off certain neurons
has a sizable influence on the response. If this is not the case, these neurons
can be eliminated. Of course, the modified network has to be trained again.
– The number of units in each layer should more or less match the number
of input components. Some experts plead for a higher number. The user
should try to find the optimal number.
– The sigmoid function has values only between zero and unity. Therefore
the output or the target value has to be appropriately scaled by the user.
– The raw input components are usually correlated. The net is more efficient
if the user orthogonalizes them. Then often some of the new components
have negligible effect on the output and can be discarded.
– The weights have to be initialized at the beginning of the training phase.
This can be done by a random number generator or they can be set to
fixed values.
– The loss function E (11.19) has be adjusted to the problem to be solved.
– The learning rate α should be chosen relatively high at the beginning of
a training phase, e.g. α = 10. In the course of fitting it should be reduced
to avoid oscillations.
– The convergence of minimizing process is slow if the gradient is small. If
this is the case, and the fit is still bad, it is recommended to increase the
learning constant for a certain number of iterations.
– In order to check whether a minimum is only local, one should train the
net with different start values of the weights.
– Other possibilities for the improvement of the convergence and the elim-
ination of local minima can be found in the substantial literature. An
ANN program package that proceeds automatically along many of the
proposed steps is described in [104].
Charged, relativistic particles can emit photons by the Čerenkov effect. The
photons hit a detector plane at points located on a circle. Of interest are
radius and center of this circle, since they provide information on direction
and velocity of the emitting particle. The number of photons and the coor-
dinates where they hit the detector fluctuate statistically and are disturbed
by spurious noise signals. It has turned out that ANNs can reconstruct the
parameters of interest from the available coordinates with good efficiency and
accuracy.
We study this problem by a Monte Carlo simulation. In a simplified model,
we assume that exactly 5 photons are emitted by a particle and that the
coordinate pairs are located on a circle and registered. The center, the radii,
and the hit coordinates are generated stochastically. The input vector of the
net thus consists of 10 components, the 5 coordinate pairs. The output is a
single value, the radius R. The loss function is (R − Rtrue )2 , where the true
value Rtrue is known from the simulation.
The relative accuracy of the reconstruction as a function of the iteration
step is shown in Fig. 11.11. Different sequences of the learning rate have been
tried. Typically, the process is running by steps, where after a flat phase
11.4 Classification 387
0.1
Cerenkov circles
0.01
error
=5
=10
1E-3
=40
=20
1E-4
100 1000 10000 100000 1000000 1E7
number of iterations
Hardware Realization
inside this region to decide about the class membership of the input. The
region to be considered here can be chosen in different ways; it can be a fixed
volume around x, or a variable volume defined by requiring that it contains
a fixed number of observations, or an infinite volume, introducing weights for
the training objects which decrease with their distance from x.
In any case we need a metric to define the distance. The choice of a metric
in multi-dimensional applications is often a rather intricate problem, espe-
cially if some of the input components are physically of very different nature.
A way-out seems to be to normalize the different quantities to equal variance
and to eliminate global correlations by a linear variable transformation. This
corresponds to the transformation to principal components discussed above
(see Sect. 11.3) with subsequent scaling of the principal components. An al-
ternative but equivalent possibility is to use a direction dependent weighting.
The same result is achieved when we apply the Mahalanobis metric, which
we have introduced in Sect. 10.4.8.
For a large training sample the calculation of all distances is expensive
in computing time. A drastic reduction of the number of distances to be
calculated is in many cases possible by the so-called support vector machines
which we will discuss below. Those are not machines, but programs which
reduce the training sample to a few, but decisive inputs, without impairing
the results.
K-Nearest Neighbors
We choose a number K which of course will depend on the size of the training
sample and the overlap of the classes. For an input x we determine the K
nearest neighbors and the numbers k1 , k2 = K − k1 , of observations that
belong to class I and II, respectively. For a ratio k1 /k2 greater than α, we
assign the new observation to class I, in the opposite case to class II:
k1 /k2 > α =⇒ class I ,
k1 /k2 < α =⇒ class II .
The choice of α depends on the loss function. When the loss function treats
all classes alike, then α will be unity and we get a simple majority vote. To
find the optimal value of K we minimize the average of the loss function
computed for all observations of the training sample.
Instead of treating all training vector inputs x′ within a given region in the
same way, one should attribute a larger weight to those located nearer to the
input x. A sensible choice is again a Gaussian kernel,
(x − x′ )2
K(x, x′ ) ∼ exp − .
2s2
11.4 Classification 389
where xβi are the locations of the training vectors of the class β.
If there are only two classes, writing the training sample as
{x1 , y1 . . . xN , yN }
with the response vector yi = ±1, the classification of a new input x is done
according to the value ±1 of the classifier ŷ(x), given by
! !
X X X
ŷ(x) = sign K(x, xi ) − K(x, xi ) = sign yi K(x, xi ) .
yi =+1 yi =−1 i
(11.22)
For a direction dependent density of the training sample, we can use a
direction dependent kernel, eventually in the Mahalanobis form mentioned
above:
1
K(x, x′ ) ∼ exp − (x − x′ )T V(x − x′ ) .
2
with the weight matrix V. When we first normalize the sample, this compli-
cation is not necessary. The parameter s of the matrix V, which determines
the width of the kernel function, again is optimized by minimizing the loss
for the training sample.
region 2
region 1 region 1
Fig. 11.12. Separation of two classes. Top: learning sample, bottom:: wrongly
assigned events of a test sample.
In higher dimensional spaces with overlapping classes and for more than
two classes the problem to determine support vectors is of course more compli-
cated. But also in these circumstances the number of relevant training inputs
can be reduced drastically. The success of SVMs is based on the so-called
kernel trick, by which non-linear problems in the input space are treated as
linear problems in some higher-dimensional space by well known optimiza-
tion algorithms. For the corresponding algorithms and proofs we refer to the
literature, e.g. [16, 106, 107]. A short introduction is given in Appendix 13.16.
In Fig. 11.12 are shown in the top panel two overlapping training samples
of 500 inputs each. The loss function is the number of wrong assignments
independent of the respective class. Since the distributions are quite similar
11.4 Classification 391
Simple Trees
We consider the simple case, the two class classification, i.e. the assignment
of inputs to one of two classes I and II, and N observations with P features
x1 , x2 , . . . , xP , which we consider, as before, as the components of an input
vector.
In the first step we consider the first component x11 , x21 , . . . , xN 1 for all
N input vectors of the training sample. We search for a value xc1 which
optimally divides the two classes and obtain a division of the training sample
into two parts A and B. Each of these parts which belong to two different
subspaces, will now be further treated separately. Next we take the subspace
A, look at the feature x2 , and divide it, in the same way as before the full
space, again into two parts. Analogously we treat the subspace B. Now we
can switch to the next feature or return to feature 1 and perform further
splittings. The sequence of divisions leads to smaller and smaller subspaces,
each of them assigned to a certain class. This subdivision process can be
regarded as the development of a decision tree for input vectors for which the
class membership is to be determined. The growing of the tree is stopped by
a pruning rule. The final partitions are called leaves.
392 11 Statistical Learning
10
6
X
2
0 1 2 3 4 5
X
1
Fig. 11.13. Decision tree (bottom) corresponding to the classification shown below.
usually as good as those of ANNs. Their algorithm is very well suited for
parallel processing. There are first applications in particle physics [109].
Before the first run, all training inputs have the weight 1. In the following
run each input gets a weight wi , determined by a certain boosting algorithm
(see below) which depends on the particular method. The definition of the
node impurity P for calculating the loss function, see (11.23), (11.24), is
changed accordingly to
P
IwPi
P =P ,
I wi + II wi
P P
where the sums I , II run over all events in class I or II, respectively.
Again the weights will be boosted and the next run started. Typically M ≈
1000 trees are generated in this way.
If we indicate the decision of a tree m for the input xi by Tm (xi ) = 1 (for
class I) and = −1 (for class II), the final result will be given by the sign of
the weighted sum over the results from all trees
M
!
X
TM (xi ) = sign αm Tm (xi ) .
m=1
Bagging
The concept of bagging was first introduced by Breiman [110]. He has shown
that the performance of unstable classifiers can be improved considerably by
training many classifiers with bootstrap replicates and then using a majority
vote of those: From a training sample containing N input vectors, N vectors
are drawn at random with replacement. Some vectors will be contained sev-
eral times. This bootstrap12 sample is used to train a classifier. Many, 100
or 1000 classifiers are produced in this way. New inputs are run through
all trees and each tree “votes” for a certain classification. The classification
receiving the majority of votes is chosen. In a study of real data [110] a re-
duction of error rates by bagging between 20% and 47% was found. There
the bagging concept had been applied to simple decision trees, however, the
bagging concept is quite general and can be adopted also to other classifiers.
Random Forest
Another new development [111] which includes the bootstrap idea, is the
extension of the decision tree concept to the random forest classifier.
Many trees are generated from bootstrap samples of the training sam-
ple, but now part of the input vector components are suppressed. A tree is
constructed in the following way: First m out of the M components or at-
tributes of the input vectors are selected at random. The tree is grown in a
m-dimensional subspace of the full input vector space. It is not obvious how
m is to be chosen, but the author proposes m ≪ M and says that the results
show little dependence on m. With large m the individual trees are powerful
but strongly correlated. The value of m is the same for all trees.
12
We will discuss bootstrap methods in the following chapter.
396 11 Statistical Learning
We have discussed various methods for classification. Each of them has its
advantages and its drawbacks. It depends on the special problem, which one
is the most suitable.
The discriminant analysis offers itself for one- or two dimensional contin-
uous distributions (preferably normal or other unimodal distributions). It is
useful for event selection in simple situations.
Kernel methods are relatively easy to apply. They work well if the division
line between classes is sufficiently smooth and transitions between different
classes are continuous. Categorical variables cannot be treated. The vari-
ant with support vectors reduces computing time and the memory space for
the storage of the training sample. In standard cases with not too exten-
sive statistics one should avoid this additional complication. Kernel methods
can perform event selection in more complicated environments than is possi-
ble with the primitive discriminant analysis. For the better performance the
possibility of interpreting the results is diminished, however.
Artificial neural networks are, due to the enormous number of free pa-
rameters, able to solve any problem in an optimal way. They suffer from the
disadvantage that the user usually has to intervene to guide the minimizing
process to a correct minimum. The user has to check and improve the re-
sult by changing the network structure, the learning constant and the start
values of the weights. New program packages are able to partially take over
these tasks. ANN are able to separate classes in very involved situations and
extract very rare events from large samples.
Decision trees are a very attractive alternative to ANN. One should use
boosted decision trees, random forest or apply bagging though, since those
discriminate much better than simple trees. The advantage of simple trees is
11.4 Classification 397
that they are very transparent and that they can be displayed graphically.
Like ANN, decision trees can, with some modifications, also be applied to
categorical variables.
At present, there is lack of theoretical framework and experimental infor-
mation on some of the new developments. We would like to know to what
extent the different classifiers are equivalent and which classifier should be se-
lected in a given situation. There will certainly be answers to these questions
in the near future.
12 Auxiliary Methods
The simplest and most common way to measure the quality of the PDE
is to evaluate the integrated square error (ISE) L2
Z ∞h i2
L2 = fˆ(x) − f (x) dx
−∞
and its expectation value E(L2 ), the mean integrated square error (MISE )1 .
h i2
The mean quadratic difference E( fˆ(x) − f (x) ) has two components, ac-
cording to the usual decomposition:
h i2
ˆ
E f (x) − f (x) = var fˆ(x) + bias2 fˆ(x) .
The first term, the variance, caused by statistical fluctuations, decreases with
increasing smoothing and the second term, the bias squared, decreases with
decreasing smoothing. The challenge is to find the optimal balance between
these two contributions.
We will give a short introduction to PDE mainly restricted to one-
dimensional distributions. The generalization of the simpler methods to
multi-dimensional distributions is straight forward but for the more sophis-
ticated ones this is more involved. A rather complete and comprehensive
overview can be found in the books by J.S. Simonoff [112], A.W. Bowman
and A. Azzalini [100], D. W. Scott [113] and W. Härdle et al. [115]. A sum-
mary is presented in an article by D. W. Scott and S. R. Sain [84].
Histogram Approximation
1
The estimate fˆ(x) is a function of the set {x1 , . . . , xN } of random variables
and thus also a random variable.
12.1 Probability Density Estimation 401
choice for the bin width h is derived from the requirement that the mean
squared integrated error should be as small as possible. The mean integrated
square error, M ISE, for a histogram is
Z 4
1 1 2 ′ 2 h
M ISE = + h f (x) dx + O . (12.1)
N h 12 N
R
The integral f ′ (x)2 dx = R(f√ ) is
′
called roughness. For a normal density
with variance σ it is R = (4 πσ 3 )−1 . Neglecting the small terms (h → 0)
2
we can derive [84] the optimal bin width h∗ and the corresponding asymptotic
mean integrated square error AM ISE:
h i1/3
h∗ ≈ R 6
N f ′ (x)2 dx
≈ 3.5σN −1/3 , (12.2)
h R ′ 2 i1/3
9 f (x) dx
AM ISE ≈ 16N 2 ≈ 0.43N −2/3/σ.
The second part of relation (12.2) holds for a Gaussian p.d.f. with variance
σ 2 and is a reasonable approximation for a distribution with typical σ. Even
though the derivative f ′ and the bandwidth2 σ are not precisely known,
they can be estimated from the data. As expected, the optimal bin width is
proportional to the band width, whereas its N −1/3 dependence on the sample
size N is less obvious.
In d dimensions similar relations hold. Of course the N -dependence has
to be modified. For d-dimensional cubical bins the optimal bin width scales
with N −1/(d+2) and the mean square error scales with N −2/(d+2) .
2
Contrary to what is understood usually under bandwidth, in PDE this term
is used to describe the typical width of structures. For a Gaussian it equals the
standard deviation.
402 12 Auxiliary Methods
reference sample
15 15
number of events
10 10
5 5
0 0
-4 -2 0 2 4 -4 -2 0 2 4
x
x
Fig. 12.1. Experimental signal with some background (left) and background ref-
erence sample with two times longer exposure time (right). The fitted signal and
background functions are indicated.
of background events in the reference sample ρ and the fraction φ of true sig-
nal events in the signal sample. These 7 parameters are to be determined in
a likelihood fit. The log-likelihood function ln L = ln L1 + ln L2 + ln L3 + ln L4
comprises 4 terms, with: 1. L1 , the likelihood of the ns events in the signal
sample (superposition of signal and background distribution):
ns
X
ln L1 (µ, σ, φ, β1 , β2 , β3 ) = ln [φN (xi |µ, σ) + (1 − φ)h(xi |β1 , β2 , β3 )] .
i=1
In d dimensions the optimal bin width for polygon bins scales with
N −1/(d+4) and the mean square error scales with N −4/(d+4) .
404 12 Auxiliary Methods
The simple PDE methods sketched above suffer from several problems, some
of which are unavoidable:
1. The boundary bias: When the variable x is bounded, say x < a, then
fˆ(x) is biased downwards unless f (a) = 0 in case the averaging process
includes the region x > a where we have no data. When the averaging is
restricted to the region x < a, the bias is positive (negative) for a distribution
decreasing (increasing) towards the boundary. In both cases the size of the
bias can be estimated and corrected for, using so-called boundary kernels.
2. Many smoothing methods do not guarantee normalization of the es-
timated probability density. While this effect can be corrected for easily by
renormalizing fˆ, it indicates some problem of the method.
3. Fixed bandwidth methods over-smooth in regions where the density
is high and tend to produce fake bumps in regions where the density is
low. Variable bandwidth kernels are able to avoid this effect partially. Their
bandwidth is chosen inversely proportional to the square root of the density,
h(xi ) = h0 f (xi )−1/2 . Since the true density is not known, f must be replaced
by a first estimate obtained for instance with a fixed bandwidth kernel.
4. Kernel smoothing corresponds to a convolution of the discrete data
distribution with a smearing function and thus unavoidably tends to flatten
peaks and to fill-up valleys. This is especially pronounced where the distribu-
tion shows strong structure, that is where the second derivative f ′′ is large.
Convolution and thus also PDE implies a loss of some information contained
in the original data. This defect may be acceptable if we gain sufficiently
due to knowledge about f that we put into the smoothing program. In the
simplest case this is only the fact that the distribution is continuous and dif-
ferentiable but in some situations also the asymptotic behavior of f may be
given, or we may know that it is unimodal. Then we will try to implement
this information into the smoothing method.
Some of the remedies for the difficulties mentioned above use estimates
of f and its derivatives. Thus iterative procedures seem to be the solution.
However, the iteration process usually does not converge and thus has to be
supervised and stopped before artifacts appear.
In Fig. 12.2 three simple smoothing methods are compared. A sample
of 1000 events has been generated from the function shown as a dashed
406 12 Auxiliary Methods
Fig. 12.2. Estimated probability density. Left hand: Nearest neighbor, center:
Gaussian Kernel, right hand: Polygon.
with the free parameters, weights αi , mean values µi and covariance matrixes
Σi .
If information about the shape of the distribution is available, more spe-
cific parametrizations which describe the asymptotic behavior can be ap-
plied. Distributions which resemble a Gaussian should be approximated by
the Gram-Charlier series (see last paragraph of Sect. 11.2.2). If the data sam-
ple is sufficiently large and the distribution is unimodal with known asymp-
totic behavior the construction of the p.d.f. from the moments as described
in [114] is quite efficient.
Physicists use PDE mainly for the visualization of the data. Here, in
one dimension, histogramming is the standard method. When the estimated
distribution is used to simulate an experiment, frequency polygons are to be
preferred. Whenever a useful parametrization is at hand, then PDE should be
12.2 Resampling Techniques 407
1 X ∗
µ̂∗ = µb ,
B
1 X ∗
δµ∗2 = (µb − µ̂∗ )2 .
B
Fig. 12.3 shows the sample distribution corresponding to the 10 observations
and the bootstrap distribution of the mean values. The bootstrap estimates
µ̂∗ = 0.74, δµ∗ = 0.19, agree reasonably well with the directly obtained val-
ues. The larger value of δµ compared to δµ∗ is due to the bias correction in
its evaluation. The bootstrap values correspond to the maximum likelihood
estimates. The distribution of Fig. 12.3 contains further information. We re-
alize that the distribution is asymmetric, the reason being that the sample
was drawn from an exponential. We could, for example, also derive the skew-
ness or the frequency that the mean value exceeds 1.0 from the bootstrap
distribution.
While we know the exact solution for the estimation of the mean value and
mean squared error of an arbitrary function u(x), it is difficult to compute
the same quantities for more complicated functions like the median or for
correlations.
0.1
1000
probability
frequency
500
0.0 0
0 1 2 0.0 0.5 1.0 1.5
x mean value
Fig. 12.3. Sample distribution (left) and distribution of bootstrap sample mean
values (right).
1.0
750
number of events
500
0.5
y
250
0.0 0
0.0 0.5 1.0 0.2 0.4 0.6 0.8
x distance
Fig. 12.4. Distribution of points in a unit square. The right hand graph shows the
bootstrap distribution of the mean distance of the points.
demonstrates that the bootstrap method is able to solve problems which are
hardly accessible with other methods.
12.2 Resampling Techniques 411
Classifiers like decision trees and ANNs usually subdivide the learning sample
in two parts, one part is used to train the classifier and a smaller part is
reserved to test the classifier. The precision can be enhanced considerably by
using bootstrap samples for both training and testing.
Instead of a decision tree we can use any other classifier, for instance a
ANN. The corresponding tests are potentially very powerful but also quite
involved. Even with nowadays computer facilities, training of some 1000 de-
cision trees or artificial neural nets is quite an effort.
12.2 Resampling Techniques 413
Jackknife is mainly used for bias removal4 . Estimates derived from a sample
of N observations x1 , ..., xN are frequently biased. The bias of a consistent
estimator vanishes in the limit N → ∞.
Let us assume that he bias decreases proportional to 1/N . This assump-
tion holds in the majority of cases. To infer the size of the bias b = t̂N − t of
the estimate t̂N of the true parameter t, we estimate t̂N −1 for a sample of
N − 1 events and use the 1/N relation to compute the bias. For the expected
values E(t̂N ) and E(t̂N −1 ) we get
E(t̂N ) − t N −1
=
E(t̂N −1 ) − t N
and obtain
t = N E(t̂N ) − (N − 1)E(t̂N −1 ) ,
E(b) = t − E(t̂N ) = (N − 1) E(t̂N ) − E(t̂N −1 ) ,
b̂ = (N − 1)(t̂N − t̂N −1 ) .
The remaining bias after the jackknife correction is zero, of order 1/N 2 ,
or of higher. Jackknife has been invented in the 50ties of the last century by
Maurice Quenouille and John Tukey. . The name jackknife had been chosen
to indicate the simplicity of the statistical tool.
4
Remember, bias corections should be applied to MLEs only in exceptional
situations
414 12 Auxiliary Methods
then δN
2
is biased, its expected value is smaller than σ 2 . We remove one
observation at a time and compute each time the mean squared error δN2
−1,i
and average the results:
N
2 1 X 2
δN −1 = δ .
N i=1 N −1,i
2 N −1
E(δN ) = σ2 ,
N
we confirm the bias corrected result E(δc2 ) = σ 2 .
13 Appendix
For a probability density f (x) with expected value µ, finite variance σ and
arbitrary given positive δ, the following inequality, known as Chebyshev in-
equality, is valid:
σ2
P {|x − µ| ≥ δ} ≤
. (13.1)
δ2
This very general theorem says that a given, fixed deviation from the
expected value becomes less probable when the variance becomes smaller. It
is also valid for discrete distributions.
To prove the inequality, we use the definition
Z
PI ≡ P {x ∈ I} = f (x)dx ,
I
= PI ,
Z
2
var(II ) = II2 (x)f (x)dx − hII i
= PI (1 − PI ) ≤ 1/4 ,
The central limit theorem states that the distribution of the sample mean x,
N
1 X
x= xi ,
N i=1
t2 t3
φ(t) = 1 − + c 3/2 + · · ·
2N N
N
X
The characteristic function of the sum z = yi is given by the product
i=1
N
t2 t3
φz = 1 − + c 3/2 + · · ·
2N N
1
This theorem was derived by the Dutch-Swiss mathematician Jakob I. Bernoulli
(1654-1705).
13.2 Consistency, Bias and Efficiency of Estimators 417
which in the limit N → ∞, where only the first two terms survive, approaches
the characteristic function of the standard normal distribution N(0, 1):
N
t2 2
lim φz = lim 1 − = e−t /2 .
N →∞ N →∞ 2N
It can be shown that the convergence of characteristic functions implies the
convergence of the distributions. The distribution of x for large N is then
approximately √
N N (x − µ)2
f (x) ≈ √ exp − .
2πσ 2σ 2
Remark: The law of large numbers and the central limit theorem can be
generalized to sums of independent but not identically distributed variates.
The convergence is relatively fast when the variances of all variates are of
similar size.
13.2.1 Consistency
We expect from an useful estimator that it becomes more accurate with
increasing size of the sample, i.e. larger deviations from the true value should
become more and more improbable.
A sequence of estimators tN of a parameter θ is called consistent, if their
p.d.f.s for N → ∞ are shrinking towards a central value equal to the true
parameter value θ0 , or, expressing it mathematically, if
lim P {|tN − θ0 | > ε} = 0 (13.4)
N →∞
is valid for arbitrary ε. A sufficient condition for consistency which can be
easier checked than (13.4), is the combination of the two requirements
lim htN i = θ0 , lim var(tN ) = 0 ,
N →∞ N →∞
where of course the existence of mean value and variance for the estimator
tN has to be assumed.
For instance, as implied by the law of large numbers, the sample moments
N
1 X m
tm = x
N i=1 i
are consistent estimators for the respective m-th moments µm of f (x) if this
moments exist.
418 13 Appendix
The bias of an estimate has been already introduced in Sect. 6.8.3: An esti-
mate tN for θ is unbiased if already for finite N (eventually N > N0 ) and all
parameter values considered, the estimator satisfies the condition
htN i = θ .
b = htN i − θ .
lim b(N ) = 0 .
N →∞
13.2.3 Efficiency
(t − θ0 )2 = var(t) + b2 . (13.5)
[cov(ty)]2 ≤ var(t)var(y)
13.3.2 Efficiency
4
The boundaries of the domain of x must not depend on θ and the maximum
of L should not be reached at the boundary of the range of θ.
422 13 Appendix
Z
∂ ln L
hyi = L dx = 0 , (13.14)
∂θ
* 2 + 2
2 ∂ ln L ∂
σy = var(y) = =− ln L . (13.15)
∂θ ∂θ2
The last relation follows after further differentiation of (13.14) and from the
relation
Z 2 Z Z
∂ ln L ∂ ln L ∂L ∂ ln L ∂ ln L
L dx = − dx = − L dx .
∂θ2 ∂θ ∂θ ∂θ ∂θ
From the Taylor expansion of ∂ ln L/∂θ|θ=θ̂ which is zero by definition and
with (13.15) we find
∂ ln L ∂ ln L ∂ 2 ln L
0= |θ=θ̂ ≈ |θ=θ0 + (θ̂ − θ0 ) |θ=θ0
∂θ ∂θ ∂θ2
≈ y − (θ̂ − θ0 )σy2 , (13.16)
where the consistency of the MLE guaranties the validity of this approxi-
mation in the sense of stochastic convergence. Following the central limit
theorem, y/σy being the sum of i.i.d. variables, is asymptotically normally
distributed with mean zero and variance unity. The same is then true for
(θ̂ − θ0 )σy , i.e. θ̂ follows asymptotically a normal distribution with mean θ0
and asymptotically vanishing variance 1/σy2 ∼ 1/N , as seen from (13.9).
A similar result as derived in the last paragraph for the p.d.f. of the MLE θ̂
can be derived for the likelihood function itself.
If one considers the Taylor expansion of y = ∂ ln L/∂θ around the MLE
θ̂, we get with y(θ̂) = 0
y(θ) ≈ (θ − θ̂)y ′ (θ̂) . (13.17)
As discussed in the last paragraph, we have for N → ∞
The criterion of asymptotic efficiency, fulfilled by the MLE for large samples,
is usually extended to small samples, where the normal approximation of
the sampling distribution does not apply, in the following way: A bias-free
estimate t is called a minimum variance (MV) estimate if var(t) ≤ var(t′ )
for any other bias-free estimate t′ . If, moreover, the Cramer–Rao inequality
(13.6) is fulfilled as an equality, one speaks of a minimum variance bound
(MVB) estimate, often also called efficient or most efficient, estimate (not to
be confused with the asymptotic efficiency which we have considered before
in Appendix 13.2). The latter, however, exists only for a certain function
τ (θ) of the parameter θ if it has a one-dimensional sufficient statistic (see
6.5.1). It can be shown [3] that under exactly this condition the MLE for τ
will be this MVB estimate, and therefore bias-free for any N . The MLE for
any non-linear function of τ will in general be biased, but still optimal in the
following sense: if bias-corrected, it becomes an MV estimate, i.e. it will have
the smallest variance among all unbiased estimates.
according to the relation between σ and σ 2 . It is biased and thus not efficient
in the sense of the above definition. A bias-corrected estimator for σ is (see
for instance [117]) r
N Γ N2
σ̂corr = σ̂ .
2 Γ N2+1
This estimator can be shown to have the smallest variance of all unbiased
estimators, independent of the sample size N .
served for other parametrizations, since variance and bias are not invariant
properties.
XM N
Σi=1 zmi xi
ln L(µ1 , ..., µN ) = N z
− µm )
m=1
Σ i=1 mi
If this is not the case, we can estimate zmi from the observed distribution.
In the EM formalism zmi is called missing or latent variable. We can solve
our problem iteratively with two alternating steps, an expectation and a
(1)
maximization step. We start with a first guess µm of the parameters of
interest and estimate the missing data. In the expectation step k we compute
(k)
the probability gmi that xi belongs to subdistribution m. It is proportional
to the value of the distribution fm (xi |µm ) at xi :
(k)
(k) fm (xi |µ̂m )
gmi = (k)
.
M f (x |µ̂
Σj=1 j i j )
The probability gmi is the expected value of the latent variable zmi .The
expected log-likelihood is
M N
!
(k)
X X (k)
Q(µ, µb )= gmi xi − µm .
m=1 i=1
13.4 The Expectation Maximization (EM) Algorithm 425
If the values of the vector components z are discrete, the integral is re-
placed by a sum over all J possible values:
J
X
(k)
b
Q(θ, θ )= b(k) .
ln L(θ|x, z j )g z j |x, θ
j=1
The procedure is started with a first θ(1) guess of the parameters and
iterated. It converges to a minimum of the log-likelihood. To avoid that the
iteration is caught by a local minimum, different starting values can be se-
lected. It is especially useful in classification problems in connection with
p.d.f.s of the exponential family5 where the maximization step is relatively
simple.
5
To the exponential family belong among others the normal, Poisson, exponen-
tial, gamma, chi-squared distrributions.
426 13 Appendix
M X
X N
(k)
= [−Aij θj + E(dij ) ln Aij θj ] .
j=1 i=1
(k)
The expected value E(dij ) conditioned on di and θ b(k) is given by di times
the probability that an event of bin i belongs to true bin j:
(k)
(k) Aij θ̂j
E(dij ) = di M
.
X (k)
Aij θ̂j
j=1
We get
N
M X (k)
(k) X Aij θ̂j
b )=
Q(θ, θ [−Aij θj + di ln Aij θj ] .
j M
j=1 i=1
X (k)
Aim θ̂m
m=1
– Maximization step:
13.5 Consistency of the Background Contaminated Parameter Estimate and its Error 427
N
Σi=1 Aij = αj is the average acceptance of the events in the true bin j.
This formula defines the background-corrected estimate θ̂. It differs from the
“ideal” estimate θ̂(S) which would be obtained in the absence of background,
428 13 Appendix
i.e. by equating to zero the first sum on the left hand side. Writing θ̂ =
θ̂(S) + ∆θ̂ in the first sum, and Taylor expanding it up to the first order, we
get
S
" B M
#
X (S) X ∂ ln f (x(B) |θ) X
∂ 2 ln f (xi |θ) i ∂ ln f (x ′
i |θ)
2
|θ̂(S) ∆θ̂ + −r =0.
i=1
∂θ i=1
∂θ i=1
∂θ
θ̂
(13.18)
The first sum, if taken with a minus sign, is the Fisher information of the
signal sample on θ(S) , and equals −1/var( P θ̂ ), asymptotically. The approx-
(S)
imation relies on the assumption that ln f (xi |θ) is parabolic in the region
θ̂(S) ± ∆θ̂. Then we have
" B M
#
X ∂ ln f (x(B) |θ) X ∂ ln f (x′i |θ)
∆θ̂ ≈ var(θ̂ )(S) i
−r . (13.19)
i=1
∂θ i=1
∂θ
θ̂
We take the expected value with respect to the background distribution and
obtain
∂ ln f (x|θ)
h∆θ̂i = var(θ̂(S) )hB − rM ih |θ̂ i .
∂θ
Since hB − rM i = 0, the background correction is asymptotically bias-free.
Squaring (13.19), and writing the summands in short as yi , yi′ , we get
" B M
#2
X X
(∆θ̂)2 = (var(θ̂(S) ))2 yi − r yi′ ,
i=1 i=1
B X
X B M X
X M B X
X M
[· · · ]2 = yi yj + r 2 yi′ yj′ − 2r yi yj′
i j i j i j
B
X M
X B
X M
X B X
X M
= yi2 + r2 yi′2 + yi yj + r 2 yi′ yj′ − 2r yi yj′ ,
i i j6=i j6=i i j
2 (S) 2
< (∆θ̂) >= (var(θ̂ )) hB + r M ih(y i − hyi ) + h(B − rM )2 ihyi2 .
2 2 2
(13.20)
Solving the linear equation system for ∆θ̂ and constructing from its com-
ponents the error matrix E, we find in close analogy to the one-dimensional
case
E = C(S) YC(S) ,
with C(S) = V(S)−1 being the covariance matrix of the background-free esti-
mates and Y defined as
8
t1 HΘL
6
t2 HΘL
Θmax
4
Θmin
2
t=4
t
2 4 6 8 10
Fig. 13.1. Confidence belt. The shaded area is the confidence belt, consisting of
the probability intervals [t1 (θ), t2 (θ)] for the estimator t. The observation t = 4
leads to the confidence interval [θmin , θmax ].
Fig. 13.2. Confidence interval. The shaded area is the confidence region for the two-
dimensional measurement (θ̂1, θ̂2 ). The dashed curves indicate probability regions
associated to the locations denoted by capital letters.
life we calculate its expected decay length λ. The prior density for the actual
decay length θ is π(θ) = exp(−θ/λ)/λ. The experimental distance measure-
ment which follows a Gaussian with standard deviation s yields d. According
to (6.2.2), the p.d.f. for the actual distance is given by
2 2
e(−(d−θ) )/(2s ) e−θ/λ
f (θ|d) = R ∞ −(d−θ)2 /(2s2 ) −θ/λ .
0
e e dθ
This is an ideal situation. We can determine for instance the mean value and
the standard deviation or the mode of the θ distribution and an asymmetric
error interval with well defined probability content, for instance 68.3%. The
confidence level is of no interest and due to the application of the prior the
estimate of θ is biased, but this is irrelevant.
observed value
H
1
0.1
0.01
H
2
1E-3
0 10 20 30 40 50 60
time
H observed value
-2 1
H
2
likelihood limits
-3
log-likelihood
-5
-6
0 10 20 30 40 50 60
time
Fig. 13.3. Two hypotheses compared to an observation. The likelihood ratio sup-
ports hypothesis 1 while the distance in units of st.dev. supports hypothesis 2.
time. The first prediction H1 differs by more than one standard deviation from
the observation, prediction H2 by less than one standard deviation. Is then
H2 the more probable theory? Well, we cannot attribute probabilities to the
theories but the likelihood ratio R which here has the value R = 26, strongly
supports hypothesis H1 . We could, however, also consider both hypotheses
as special cases of a third general theory with the parametrization
25 625(t − θ)2
f (t) = √ exp −
2πθ2 2θ4
and now try to infer the parameter θ and its error interval. The observation
produces the likelihood function shown in the lower part of Fig. 13.3. The
usual likelihood ratio interval contains the parameter θ1 and excludes θ2
while the frequentist standard confidence interval [7.66, ∞] would lead to the
436 13 Appendix
reverse conclusion which contradicts the likelihood ratio result and also our
intuitive conclusions.
13.7.5 Conclusion
The choice of the statistical method has to be adapted to the concrete appli-
cation. The frequentist reasoning is relevant in rare situations like event selec-
tion, where coverage could be of some importance or when secondary statistics
is performed with estimated parameters. In some situations Bayesian tools
are required to proceed to sensible results. In all other cases the likelihood
function, or as a summary of it, the MLE and a likelihood ratio interval are
the best choice.
the latter rely on the likelihood principle, they base parameter and interval
inference solely on the likelihood function and these parameters cannot and
need not be considered. Nevertheless it is of some interest, to investigate how
the classical statistics reacts to the MLE and it is reassuring that asymptot-
ically for large samples the frequentist approach is in accordance with the
likelihood ratio method. This manifests itself in the consistency of the MLE.
Also for small samples, the MLE has certain optimal frequentist properties,
but there the methods provide different solutions.
Efficiency is defined through the variance of the estimator for given values
of the true parameter (independent of the measured value). In inference prob-
lems, however, the true value is unknown and of interest is the deviation of
the true parameter from a given estimate. Efficiency is not invariant against
parameter transformation. For example, the MLE of the lifetime θ̂ with an
exponential decay distribution is an efficient estimator while the MLE of the
decay rate γ̂ = 1/θ̂ is not.
Similar problems exist for the bias which also depends on the parameter
metric. Frequentists usually correct estimates for a bias. This is justified
again in commercial applications, where many replicates are considered. If in
a long-term business relation the price for a batch of some goods is agreed
to be proportional to some product quality (weight, mean lifetime...) which
is estimated for each delivery from a small sample, this estimate should be
unbiased, as otherwise gains and losses due to statistical fluctuations would
not cancel in the long run. It does not matter here that the quantity bias
is not invariant against parameter transformations. In the business example
the mentioned agreement would be on weight or on size and not on both.
In the usual physics application where experiments determine constants of
nature, the situation is different, there is no justification for bias corrections,
and invariance is an important issue.
Somewhat inconsistent in the frequentist approach is that confidence in-
tervals are invariant against parameter transformations while efficiency and
bias are not and that the aim for efficiency supports the MLE for point esti-
mation which goes along with likelihood ratio intervals and not with coverage
intervals.
X N
1 2i − 1 2
W2 = + (zi − ) , (13.23)
12N i=1
2N
N
X
(zi − zi−1 )2
i=2
U2 = PN ,
2
i=1 zi
N
X −1
2
A = −N + (zi − 1) (ln zi + ln(1 − zN +1−i )) . (13.24)
i=1
Here Nexp and NMC are the experimental respectively the simulated sample
sizes.
Calculation of p-values
After normalizing the test variables with appropriate powers of N they follow
p.d.f.s which are independent of N . The test statistics’ D∗ , W 2∗ , A2∗ modified
in this way are defined by the following empirical relations
√ 0.11
D∗ = Dmax ( N + 0.12 + √ ) , (13.25)
N
2∗ 2 0.4 0.6 1.0
W = (W − + 2 )(1.0 + ), (13.26)
N N N
A2∗ = A2 . (13.27)
The relation between these modified statistics and the p-values is given
in Fig. 13.4.
440 13 Appendix
10
2
W * x 10
5
4
test statistic
3
2
A *
V*
D*
p-value
(cn n − cm m)2
χ2 = (13.28)
δ2
where the denominator δ 2 is the expected variance of the parenthesis in the
numerator under the null hypothesis. To compute δ we have to estimate λ.
The p.d.f. of n and m is P(n|λ/cn )P(m|λ/cm ) leading to the corresponding
log-likelihood of λ
λ λ λ λ
ln L(λ) = n ln − + m ln − + const.
cn cn cm cm
with the MLE
n+m n+m
λ̂ = = cn cm . (13.29)
1/cn + 1/cm cn + cm
Assuming now that n is distributed according to a Poisson distribution with
mean n̂ = λ̂/cn and respectively, mean m̂ = λ̂/cm we find
442 13 Appendix
δ 2 = c2n n̂ + c2m m̂
= (cn + cm )λ̂
= cn cm (n + m)
1 (cn n − cm m)2
χ2 = . (13.30)
cn cm n+m
As mentioned, only the relative normalization cn /cm is relevant.
13.10.3 χ2 of Histograms
We have to evaluate the expression (13.32) for each bin and sum over all B
bins
13.10 Comparison of Histograms Containing Weighted Events 443
XB
1 (c̃n ñ − c̃m m̃)2
χ2 = (13.34)
i=1
c̃n c̃m ñ + m̃ i
where the prescription indicated by the index i means that all quantities
in the bracket have to be evaluated for bin i. In case the entries are not
weighted the tilde is obsolete. The constants cn , cm in (13.33) usually are
overall normalization constants and equal for all bins of the corresponding
histogram. If the histograms are normalized with respect to each other, we
have cn Σni = cm Σmi and we can set cn = Σmi = M and cm = Σni = N .
χ2 Goodness-of-Fit Test
This expression can be used for goodness-of-fit tests. In case the normalization
constants are given externally, for instance through the luminosity, χ2 follows
approximately a χ2 distribution of B degrees of freedom. Frequently the
histograms are normalized with respect to each other. Then we have one
degree of freedom less, i.e. B − 1. If P parameters have been adjusted in
addition, then we have B − P − 1 degrees of freedom.
In Chap. 10, Sect. 10.4.3 we have introduced the likelihood ratio test for
histograms. For a pair of Poisson numbers n, m the likelihood ratio is the
ratio of the maximal likelihood under the condition that the two numbers
are drawn from the same distribution to the unconditioned maximum of
the likelihood for the observation of n. The corresponding difference of the
logarithms is our test statistic V (see likelihood ratio test for histograms)
λ λ λ λ
V = n ln − − ln n! + m ln − − ln m! − [n ln n − n − ln n!]
cn cn cm cm
λ λ λ λ
= n ln − + m ln − − ln m! − n ln n + n .
cn cn cm cm
We now turn to weighted events and perform the same replacements as
above:
λ̃ λ̃ λ̃ λ̃
V = ñ ln − + m̃ ln − − ln m̃! − ñ ln ñ + ñ .
c̃n c̃n c̃m c̃m
The variables and parameters of this formula are given in relations (13.35),
(13.31) and (13.33). They depend on cn , cm . As stated above, only the ratio
cn , cm matters. The ratio is either given or obtained from the normalization
cn Σni = cm Σmi .
The distribution of the test statistic under H0 for large event number
follows approximately a χ2 distribution of B degrees of freedom if the nor-
malization is given or of B − 1 degrees of freedom in the usual case where
the histograms are normalized to each other. For small event numbers the
distribution of the test statistic has to be obtained by simulation.
XB
2 1 (cn n − c̃m m̃)2
χ =
i=1
cn c̃m (n + m̃) i
For each minimization step we have to recompute the weights and with
(13.31) and (13.33) the LS parameter χ2 . If the relative normalization of the
simulated and observed data is not known the ratio cn /cm is a free parameter
in the fit. As only the ratio matters, we can set for instance cm = 1.
We do not recommend to apply a likelihood fit, because the approximation
of the distribution of the sum of weights by a scaled Poisson distribution is
not valid for small event numbers where the statistical errors of the simulation
are important.
The left hand side describes N independent Poisson processes with mean
values λi and random variables ki , and the right hand side corresponds to a
single Poisson process with λ = Σλi and the random variable k = Σki where
the numbers ki follow a multinomial distribution
N
k! Y ki
Mnε1 ,...,εN (k1 , ..., kN ) = ε .
Y i=1 i
N
ki !
i=1
The scaled Poisson distribution (SPD) is fixed by the requirement that the
first two moments of the weighted sum have to be reproduced. We define an
equivalent mean value λ̃,
λE(w)2
λ̃ = , (13.38)
E(w2 )
446 13 Appendix
E(w2 )
s= , (13.39)
E(w)
such that the expected value E(sk̃) = µ and var(sk̃) = σ 2 . The cumulants of
the scaled distribution are κ̃m = sm λ̃.
We compare the cumulants of the two distributions and form the ratios
κm /κ̃m . Per definition the ratios for m = 1, 2 agree because the two lowest
moments agree.
The skewness and excess for the two distributions are in terms of the
moments E(wm ) of w:
E(w3 ) E(w3 )
γ1 = = , (13.40)
σ3 λ1/2 E(w2 )3/2
E(w4 ) E(w4 )
γ2 = = (13.41)
σ4 λE(w2 )2
1/2
E(w2 )
γ̃1 = ;, (13.42)
λE(w)2
E(w2 )
γ̃2 = , (13.43)
λE(w)2
where ai , bi are non-negative and p > 1. For p = 2 one obtains the Cauchy–
3/2 1/2
Schwartz inequality. Setting ai = wi , respectively bi = wi , we get imme-
diately the relation (13.44) for the skewness:
!2
X X X
wi2 ≤ wi3 wi .
i i i
n/(n−1) (n−2)/(n−1)
In general, with p = n−1 and ai = wi , bi = wi , the inequality
becomes
13.11 The Compound Poisson Distribution and Approximations of it 447
µ=20 µ=25
f(w)=exp(-w) truncated normal
0 10 20 30 40 0 10 20 30 40 50
x x
!n−1 !n−2
X X X
wi2 ≤ win wi
i i i
Example 170. Comparison of the CPD with the SPD approximation and the
normal distribution
In Figure 1 the results of a simulation of CPDs with two different weight
distributions is shown. The simulated events are collected into histogram
bins but the histograms are displayed as line graphs which are easier to read
than column graphs. Corresponding SPD distributions are generated with the
parameters chosen according to the relations (13.38) and (13.39). They are
indicated by dotted lines. The approximations by normal distributions are
shown as dashed lines. In the lefthand graph the weights are exponentially
distributed and the weight distribution of the righthand graph is a truncated,
renormalized normal distribution Nt (x|1, 1) = cN (x|1, 1), x > 0 with mean
and variance equal to 1 where negative values are cut. In this case the approx-
imation by the SPD is hardly distinguishable from the CPD. The exponential
weight distribution includes large weights with low frequency where the ap-
proximation by the SPD is less good. Still it models the CPD reasonably
448 13 Appendix
well. The examples show, that the approximation by the SPD is close to the
CPD and superior to the approximation by the normal distribution.
In standard bootstrap [119] samples are drawn from the observed observations
xi , i = 1, 2, ..., n, with replacement. Poisson bootstrap is a special re-sampling
technique where to all n observation xi Poisson distributed numbers ki ∼
P1 (ki ) = 1/(eki !) are associated. More precisely, for a bootstrap sample the
value xi is taken ki times where ki is randomly chosen from the Poisson
distribution with mean equal to one. Samples where the sum of outcomes
is different from the observed sample size k, i.e. Σi=1 k
ki 6= k are rejected.
Poisson bootstrap is completely equivalent to the standard bootstrap. It has
attractive theoretical properties [122].
In applications of the CPD the situation is different. One does not have
a sample of CPD outcomes but only of a single observed value of x which is
accompanied by a sample of weights. As the distribution of the number of
weights is known up to the Poisson mean, the bootstrap technique is used
to infer parameters depending on the weight distribution, To generate obser-
vations xk , we have to generate the numbers ki ∼ P1 (ki ) and form the sum
x = Σki wi . All results are kept. The resulting Poisson bootstrap distribution
(PBD) permits to estimate uncertainties of parameters and quantiles of the
CPD.
A
a) CT b)
B B
C'
A A
c)
C'
C' B'
the best point A (13.6c). In each case one of the four configurations is chosen
and the iteration continued.
There exist many variants (see refs. in [126] of the original version of
Nelder and Mead ([125]). Standard Simplex [125] has been used in most fits
of this book. If the number of parameters is large, and especially if the param-
eters are correlated, Simplex fits have the tendency to stop without having
reached the function minimum [126]. This situation occurs in fits of unfolded
histograms. Simplex may choose shrinkage while a reflection of the worst pa-
rameter point could be the optimal choice. Finally, all points have almost
equal parameter coordinates such that the convergence criterion is fulfilled.
Further improvement steps are so small that reducing the convergence pa-
rameter does not change the result. The convergence problem is studied in
great detail in [126] and a solution which introduces stochastic elements in
the stepping process is proposed.
In this book a different approach is followed. After Simplex signals con-
vergence, the fit is repeated where the best point so far obtained is kept and
the remaining points are initialized in the same way are before. Alternatively
these points are chosen randomly centered at the best value.
Dl1
Dl2
l2
l1
Fig. 13.7. Method of steepest decent.
Fig. 13.8. Stochastic annealing. A local minimum can be left with a certain prob-
ability.
A physical system which is cooled down to the absolute zero point will princi-
pally occupy an energetic minimum. When cooled down fast it may, though,
be captured in a local (relative) minimum. An example is a particle in a
potential wall. For somewhat higher temperature it may leave the local min-
imum, thanks to the statistical energy distribution (Fig. 13.8). This is used
for instance in the stimulated annealing of defects in solid matter.
This principle can be used for minimum search in general. A step in the
wrong direction, where the function increases by ∆f , can be accepted, when
using the method of steepest descent, e.g. with a probability
1
P (∆f ) = .
1 + e∆f /T
The scale factor T (“temperature”) steers the strength of the effect. It has
been shown that for successively decreasing T the absolute minimum will be
reached.
AT VN Aθ + HT α = AT VN y , (13.47)
Hθ = ρ (13.48)
to be solved for θ̂ and α̂. Note that Λ is minimized only with respect to θ, but
maximized with respect to α: The stationary point is a saddle point, which
complicates a direct extremum search. Solving (13.47) for θ and inserting it
into (13.48), we find
α̂ = C−1 T
K (HCP A VN y − ρ)
θ̂ = CP [AT VN y − HT C−1 T
K (HCP A VN y − ρ)] ,
where
D = CP (IP − HT C−1 T
K HCP )T VN
has been used. The covariance matrix is symmetric positive definite. Without
constraints, it equals CP , the negative term is absent. Of course, the intro-
duction of constraints reduces the errors and thus improves the parameter
estimation.
454 13 Appendix
XN
1
var(ak ) = 1/ 2
δ
ν=1 ν
which is valid for all k = 1, . . . , K. Thus all errors are equal to the error of
the weighted mean of the measurements yν .
Proof: from linear error propagation we have, for independent measure-
ments yν ,
!
X
var(ak ) = var wν uk (xν )yν
ν
X
= wν2 (uk (xν ))2 δν2
ν
X X 1
= wν u2k (xν )/
ν ν
δν2
X 1
= 1/ ,
ν
δν2
where in the third step we used the definition of the weights, and in the last
step the normalization of the polynomials uk .
If the errors δ1 , . . . , δN are uniform, the weights become equal to 1/N , and for
certain patterns of the locations x1 , . . . , xN , for instance for an equidistant
distribution, the orthogonalized polynomials uk (x) can be calculated. They
are given in mathematical handbooks, for instance in Ref. [127]. Although the
general expression is quite involved, we reproduce it here for the convenience
of the reader. For x defined in the domain [−1, 1] (eventually after some
linear transformation and shift), and N = 2M + 1 equidistant (with distance
∆x = 1/M ) measured points xν = ν/M, ν = 0, ±1, . . . , ±M , they are given
by
1/2 X
k
(2M + 1)(2k + 1)[(2M )!]2 (i + k)[2i] (M + t)[i]
uk (x) = (−1)i+k ,
(2M + k + 1)!(2M − k)! i=0
(i!)2 (2M )[i]
13.15 Formulas for B-Spline Functions 455
They are normalized to unit area. Since the central values are equidistant,
we fix them by the lower limit xmin of the x-interval and count them as
x0 (k) = xmin + kb, with the index k running from kmin = 0 to kmax =
(xmax − xmin )/b = K.
At the borders only half of a spline is used.
Remark: The border splines are defined in the same way as the other
splines. After the fit the part of the function outside of its original domain is
ignored. In the literature the definition of the border splines is often different.
w·x+b=0, (13.49)
where b fixes its distance from the origin. Note that w is not normalized, a
common factor in w and b does not change condition (13.49). Once we have
found the hyperplane {w, b} which separates the two classes yi = ±1 of the
training sample {(x1 , y1 ), . . . , (xN , yN )} we can use it to classify new input:
Fig. 13.9. The red hyperplane separates squares from circles. Shown are the convex
hulls and the support vectors in red.
To find the optimal hyperplane which divides ∆ into equal parts, we define
the two marginal planes which touch the hulls:
w · x + b = ±1 .
For the solution, only the constraints with equals sign are relevant. The vec-
tors corresponding to points on the marginal planes form the so-called active
set and are called support vectors (see Fig. 13.9). The optimal solution can
be written as X
w= αi yi xi
i
P
with αi > 0 for the active set, else αi = 0, and furthermore αi yi = 0. The
last condition ensures translation invariance: w(xi − a) = w(xi ). Together
with the active constraints, after substituting the above expression for w,
it provides just the required number of linear equations to fix αi and b. Of
7
A linear classification scheme was already introduced in Sect. 11.4.1.
8
The convex hull is the smallest polyhedron which contains all points and their
connecting straight lines.
458 13 Appendix
course, the main problem is to find the active set. For realistic cases this
requires the solution of a large quadratic optimization problem, subject to
linear inequalities. For this purpose an extended literature as well as program
libraries exist.
This picture can be generalized to the case of overlapping classes. Assum-
ing that the optimal separation is still given by a hyperplane, the picture
remains essentially the same, but the optimization process is substantially
more complex. The standard way is to introduce so called soft margin clas-
sifiers. There some points on the wrong side of their marginal plane are
tolerated, but with a certain penalty in the optimization process. It is chosen
proportional to the sum of their distances or their square distance from their
own territory. The proportionality constant is adjusted to the given problem.
All quantities determining the linear classifier ŷ (13.50) depend only on inner
products of vectors of the input space. This concerns not only the dividing
hyperplane, given by (13.49), but also the expressions for w, b and the factors
αi associated to the support vectors. The inner product x · x′ which is a
bilinear symmetric scalar function of two vectors, is now replaced by another
scalar function K(x, x′ ) of two vectors, the kernel, which need not be bilinear,
but should also be symmetric, and is usually required to be positive definite.
In this way a linear problem in an inner product space is mapped into a very
non-linear problem in the original input space where the kernel is defined.
We then are able to separate the classes by a hyperplane in the inner product
space that may correspond to a very complicated hypersurface in the input
space. This is the so-called kernel trick.
To illustrate how a non-linear surface can be mapped into a hyperplane,
we consider a simple example. In order to work with a linear cut, i.e. with a
dividing hyperplane, we transform our input variables x into new variables:
x → X(x). For instance, if x1 , x2 , x3 are momentum components and a cut in
energy, x21 +x22 +x23 < r2 , is to be applied, we could transform the momentum
space into a space
X = {x21 , x22 , x23 , . . .} .
where the cut corresponds to the hyperplane X1 + X2 + X3 = r2 . Such a
mapping can be realized by substituting the inner product by a kernel:
K(x, x′ ) = (x · x′ )d with d = 2 ,
(x · x′ )2 = (x1 x′1 + x2 x′2 + x3 x′3 )2 = X(x) · X(x′ ) (13.51)
with
13.17 Bayes Factor 459
√ √ √
X(x) = {x21 , x22 , x23 , 2x1 x2 , 2x1 x3 , 2x2 x3 } .
The sphere x21 + x22 + x23 = r2 in x-space is mapped into the 5-dimensional
hyperplane X1 + X2 + X3 = r2 in 6-dimensional X-space. (A kernel inducing
instead of monomials of order d (13.51), polynomials of all orders, up to order
d is K(x, x′ ) = (1 + x · x′ )d .)
The most common kernel used for classification is the Gaussian (see Sect.
11.2.1):
′ (x − x′ )2
K(x, x ) = exp − .
2s2
It can be shown that it induces a mapping into a space of infinite dimensions
[107] and that nevertheless the training vectors can in most cases be replaced
by a relatively small number of support vectors. The only free parameter
is the penalty constant which regulates the degree of overlap of the two
classes. A high value leads to a very irregular shape of the hypersurface
separating the training samples of the two classes to a high degree in the
original space whereas for a low value its shape is much smoother and more
minority observations are tolerated.
In practice, this mapping into the inner product space is not performed
explicitly, in fact it is even rarely known. All calculations are performed in
x-space, especially the determination of the support vectors and their weights
α. The kernel trick merely serves to prove that a classification with support
vectors is feasible. The classification of new input then proceeds with the
kernel K and the support vectors directly:
!
X X
ŷ = sign αi K(x, xi ) − αi K(x, xi ) .
yi =+1 yi =−1
The posterior rating is equal to the prior rating times the Bayes factor.
The Bayes factor is a very reasonable and conceptually attractive concept
which requires little computational effort. It is to be preferred to the frequen-
tist p-value approach in decision making. However, for the documentation of
a measurement it has the typical Bayesian drawback that it depends on prior
densities and unfortunately there is no objective way to fix those.
9
Postulated by William of Ockham, English logician in the 14th century.
13.18 Robust Fitting Methods 461
If one or a few observations in a sample are separated from the bulk of the
data, we speak of outliers. The reasons for their existence range from trivial
mistakes or detector failures to important physical effects. In any case, the
assumed statistical model has to be questioned if one is not willing to admit
that a large and very improbable fluctuation did occur.
Outliers are quite disturbing: They can change parameter estimates by
large amounts and increase their errors drastically.
Frequently outliers can be detected simply by inspection of appropriate
plots. It goes without saying, that simply dropping them is not a good advice.
In any case at least a complete documentation of such an event is required.
Clearly, objective methods for their detection and treatment are preferable.
In the following, we restrict our treatment to the simple one-dimensional
case of Gaussian-like distributions, where outliers are located far from the
average, and where we are interested in the mean value. If a possible outlier
is contained in the allowed variate range of the distribution – which is always
true for a Gaussian – a statistical fluctuation cannot be excluded as a logical
possibility. Since the outliers are removed on the basis of a statistical proce-
dure, the corresponding modification of results due to the possible removal
of correct observations can be evaluated.
We distinguish three cases:
1. The standard deviations of the measured points are known.
2. The standard deviations of the measured points are unknown but known
to be the same for all points.
3. The standard deviations are unknown and different.
It is obvious that case 3 of unknown and unequal standard deviations
cannot be treated.
The treatment of outliers, especially in situations like case 2, within the
LS formalism is not really satisfying. If the data are of bad quality we may
expect a sizeable fraction of outliers with large deviations. These may distort
the LS fit to such an extend that outliers become difficult to define (mask-
ing of outliers). This kind of fragility of the LS method, and the fact that
in higher dimensions the outlier detection becomes even more critical, has
lead statisticians to look for estimators which are less disturbed by data not
obeying the assumed statistical model (typical are deviations from the as-
sumed normal distribution), even when the efficiency suffers. In a second –
not robust – fit procedure with cleaned data it is always possible to optimize
the efficiency.
In particle physics, a typical problem is the reconstruction of particle
tracks from hits in wire or silicon detectors. Here outliers due to other tracks
or noise are a common difficulty, and for a first rough estimate of the track
462 13 Appendix
parameters and the associated hit selection for the pattern recognition, robust
methods are useful.
M-Estimators
with ρ(z) = z 2 for the LS method which is optimal for Gaussian errors.
For the Laplace distribution mentioned above the optimal objective function
is based on ρ(z) = |z|, derived from the likelihood analog which suggests
ρ ∝ ln f . To obtain a more robust estimation the function ρ can be modified
in various ways but we have to retain the symmetry, ρ(z) = ρ(−z) and to
require a single minimum at z = 0. This kind of estimators with objective
functions ρ different from z 2 are called M-estimators, “M” reminding maxi-
mum likelihood. The best known example is due to Huber, [128]. His proposal
is a kind of mixture of the appropriate functions of the Gauss and the Laplace
cases: 2
z /2 if |z| ≤ c
ρ(z) =
c(|z| − c/2) if |z| > c .
The constant c has to be adapted to the given problem. For a normal
population the estimate is of course not efficient. For example with c = 1.345
the inverse of the variance is reduced to 95% of the standard value. Obviously,
the fitted objective function (13.52) no longer follows a χ2 distribution with
appropriate degrees of freedom.
zero, but for M-estimators or truncated fits, changing a single point would be
not sufficient to shift the fitted parameter by a large amount. The maximal
value of ε is smaller than 50% if the outliers are the minority. It is not difficult
to construct estimators which approach this limit, see [129]. This is achieved,
for instance, by ordering the residuals according to their absolute value (or
ordering the squared residuals, resulting in the same ranking) and retaining
only a certain fraction, at least 50%, for the minimization. This so-called least
trimmed squares (LTS) fit is to be distinguished from truncated least square
fit (LST, LSTS) with a fixed cut against large residuals.
An other method relying on rank order statistics is the so-called least me-
dian of squares (LMS) method. It is defined as follows: Instead of minimizing
P
with respect to the parameters µ the sum of squared residuals, i ri2 , one
searches the minimum of the sample median of the squared residuals:
minimizeµ median(ri2 (µ)) .
This definition implies that for N data points, N/2 + 1 points enter for
N even and (N + 1)/2 for N odd. Assuming equal errors, this definition can
be illustrated geometrically in the one-dimensional case considered here: µ̂ is
the center of the smallest interval (vertical strip in Fig. 13.10) which covers
half of all x values. The width 2∆ of this strip can be used as an estimate of
the error. Many variations are of course possible: Instead of requiring 50% of
the observations to be covered, a larger fraction can be chosen. Usually, in a
second step, a LS fit is performed with the retained observations, thus using
the LMS only for outlier detection. This procedure is chosen, since it can be
shown that, at least in the case of normal distributions, ranking methods are
statistically inferior as compared to LS fits.
0 2 4 6 8 10 12
x
Fig. 13.10. Estimates of the location parameter for a sample with three outliers.
64. H. N. Mülthei and B. Schorr, On an iterative method for the unfolding of spectra,
Nucl. Instr. and Meth. A257 (1986) 371.
65. M. Schmelling, The method of reduced cross-entropy - a general approach to
unfold probability distributions, Nucl. Instr. and Meth. A340 (1994) 400.
66. L. Lindemann and G. Zech, Unfolding by weighting Monte Carlo events, Nucl.
Instr. and Meth. A354 (1994) 516.
67. G. D’Agostini, A multidimensional unfolding method based on Bayes’ theorem,
Nucl. Instr. and Meth. A 362 (1995) 487.
68. A. Hoecker and V. Kartvelishvili, SVD approach to data unfolding, Nucl. Instr.
and Meth. A 372 (1996), 469.
69. N. Milke et al. Solving inverse problems with the unfolding program TRUEE:
Examples in astroparticle physics, Nucl. Instr. and Meth. A697 (2013) 133.
70. A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum Likelihood from Incom-
plete Data via the EM Algorithm, J. R. Statist.Soc. B 39 (1977) 1.
71. W. H. Richardson, Bayesian based Iterative Method of Image restoration Jour-
nal. of the Optical Society of America 62 (1972) 55.
72. L. B. Lucy, An iterative technique for the rectification of observed distributions,
Astron. Journ. 79 (1974) 745.
73. L. A. Shepp and Y. Vardi, Maximum likelihood reconstruction for emission
tomography, IEEE trans. Med. Imaging MI-1 (1982) 113.
74. A. Kondor, Method of converging weights - an iterative procedure for solving
Fredholm’s integral equations of the first kind, Nucl. Instr. and Meth. 216 (1983)
177.
75. Y. Vardi, L. A. Shepp and L. Kaufmann, A statistical model for positron emis-
sion tomography, J. Am. Stat. Assoc.80 (1985) 8, Y. Vardi and D. Lee, From
image deblurring to optimal investments: Maximum likelihood solution for pos-
itive linear inverse problems (with discussion), J. R. Statist. Soc. B55, 569
(1993).
76. H. N. Mülthei, B. Schorr, On properties of the iterative maximum likelihood
reconstruction method, Math. Meth. Appl. Sci. 11 (2005) 331.
77. D. M. Titterington, Some aspects of statistical image modeling and restoration,
Proceedings of PHYSTAT 05, ed. L. Lyons and M. K. Ünel, Oxford (2005).
78. P. C. Hansen, Discrete Inverse Problems: Insight and Algorithms, SIAM, (2009)
79. B. Efron and R. T. Tibshirani, An Introduction to the Bootstrap, Chapman &
Hall, London (1993).
80. M. Kuusela and V. M. Panaretos, Statistical unfolding of elementary particle
spectra: Empirical Bayes estimation and bias-corrected uncertainty quantifica-
tion, Annals of Applied Statistics 9 (2015) 1671. M. Kuusela, Statistical Issues
in Unfolding Methods for High Energy Physics, Master’s thesis, Aalto Univer-
sity, Finnland (2912).
81. G. D’Agostini, Improved iterative Bayesian unfolding, arXiv:1010.632v1 (2010).
82. I. Volobouev, On the Expectation-Maximization Unfolding with smoothing,
arXiv:1408.6500v2 (2015).
83. A. N. Tichonoff, Solution of incorrectly formulated problems and the regulariza-
tion method, translation of the original article (1963) in Soviet Mathematics 4,
1035.
84. D. W. Scott, and S. R. Sain, Multi-Dimensional Density Estimation, in Hand-
book of Statistics, Vol 24: Data Mining and Computational Statistics, ed. C.R.
Rao and E.J. Wegman, Elsevier, Amsterdam (2004).
References 471
130. R. Maronna, D. Martin and V. Yohai, Robust Statistics – Theory and Methods,
John Wiley, New York (2006).
131. Some useful internet links:
- http://www.stats.gla.ac.uk/steps/glossary/basic-definitions.html, Statistics
Glossary (V. J. Easton and J. H. McColl).
- http://www.nu.to.infn.it/Statistics/, Useful Statistics Links for Particle
Physicists.
- http://www.statsoft.com/textbook/stathome.html, Electonic Textbook Stat-
soft.
- http://wiki.stat.ucla.edu/socr/index.php/EBook, Electronic Statistics Book.
- http://www.york.ac.uk/depts /maths/histstat/lifework.htm,Life and Work of
Statisticians (University of York, Dept. of Mathematics).
- http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2000/Geu00/geurts-
pkdd2000-bagging.pdf, Some enhancements of Decision Tree Bagging (P.
Geurts).
- http://www.math.ethz.ch/ blatter/Waveletsvortrag.pdf, Wavelet script (in
German).
- http://www.xmarks.com/site/www.umiacs.umd.edu/ joseph/support-vector-
machines4.pdf, A Tutorial on Support Vector Machines for Pattern Recognition
(Ch. J. C. Burges).
- http://www-stat.stanford.edu /∼jhf/ftp/machine.pdf, Recent Advances in
Predictive (Machine) Learning (J. B. Friedman).
Table of Symbols
Symbol Explanation
A,B Events
A Negation of A
Ω/∅ Certain / impossible event
A ∪ B ,A∩ B ,A ⊂ B A OR B, A AND B, A implies B etc.
P {A} Probability of A
P {A|B} Conditional probability of A
(for given B)
x,y,z ; k,l,m (Continuous; discrete) random variable (variate)
θ, µ, σ Parameter of distributions
f (x) , f (x|θ) Probability density function
F (x) , F (x|θ) Integral (probability-) distribution function
(for parameter value θ, respectively)(p. 16)
f (x) , f (x|θ) Respective multidimensional generalizations (p. 46)
A , Aji Matrix, matrix element in column i, row j
AT , ATji = Aij Transposed matrix
a , a · b ≡ aT b Column vector, inner (dot-) product
L(θ) , L(θ|x1 , . . . , xN ) Likelihood function (p. 155)
L(θ|x1 , . . . , xN ) Generalization to more dimensional parameter space
θ̂ Statistical estimate of the parameter θ(p. 161)
E(u(x)) , hu(x)i Expected value of u(x)
u(x) Arithmetic sample mean, average (p. 21)
δx Measurement error of x (p. 92)
σx Standard deviation of x
σx2 , var(x) Variance (dispersion) of x (p. 22)
cov(x, y) , σxy Covariance (p. 50)
ρxy Correlation coefficient
µi Moment of order i with respect to origin 0, initial moment
µ′i Central moment (p. 34)
µij , µ′ij etc. Two-dimensional generalizations (p. 48)
γ1 , β2 , γ2 Skewness , kurtosis , excess (p. 26)
κi Semiinvariants (cumulants) of order i, (p. 37)
Index
UMP test, see test, uniformly most wide bin regularization, 293
powerful with background, 295
unfoldin uniform distribution, 33, 69
expectation maximization method, upper limit, 255
274 Poisson statistics with background,
spline approximation, 275 257
unfolding, 261 Posson statistics, 256
binning, 290
binning-free, 296 v. Mises distribution, 60
curvature regularization, 286 variables
eigendecomposition, see deconvolu- independent, identically distributed,
tion 59
eigenvector decomposition, 268 variance, 22
EM method, 280 estimation by bootstrap, 408
entropy regularization, 286 of a sum, 23
error assignment, 279 of a sum of distributions, 26
explicit regularization, 275 of sample mean, 24
implicit regularization, 293 variate, 10
integrated square error, 278 transformation, 45
iterative, 280 Venn diagram, 10, 152
least square method, 268
migration method, 298 Watson statistic, 439
ML approach, 274 Watson test, 328
norm regularization, 287 wavelets, 364
penaty regularization, 285 Weibull distribution, 83
regularization strength, 277 weight matrix, 74
response matrix, 290 weighted events, 87
Richardson-Lucy method, 280 weighted observations, 87
spline approximation, 289 width of sample, 25
truncated SVD, 283 relation to variance, 25
List of Examples
Chapter 1
1. Uniform prior for a particle mass
Chapter 2
2. Card game, independent events
3. Random coincidences, measuring the efficiency of a counter
4. Bayes’ theorem, fraction of women among students
5. Bayes’ theorem, beauty filter
Chapter 3
6. Discrete probability distribution (dice)
7. Probability density of an exponential distribution
8. Probability density of the normal distribution
9. Variance of the convolution of two distributions
10. Expected values, dice
11. Expected values, lifetime distribution
12. Mean value of the volume of a sphere with a normally distributed radius
13. Playing poker until the bitter end
14. Diffusion
15. Mean kinetic energy of a gas molecule
16. Reading accuracy of a digital clock
17. Efficiency fluctuations of a detector
18. Characteristic function of the Poisson distribution
19. Distribution of a sum of independent, Poisson distributed variates
20. Characteristic function and moments of the exponential distribution
21. Calculation of the p.d.f. for the volume of a sphere from the p.d.f. of the
radius
22. Distribution of the quadratic deviation
23. Distribution of kinetic energy in the one-dimensional ideal gas
24. Generation of an exponential distribution starting from a uniform distri-
bution
25. Superposition of two two-dimensional normal distributions
26. Correlated variates
27. Dependent variates with correlation coefficient zero
28. Transformation of a normal distribution from cartesian into polar coor-
dinates
29. Distribution of the difference of two digitally measured times
30. Distribution of the transverse momentum squared of particle tracks
31. Quotient of two normally distributed variates
32. Generation of a two-dimensional normal distribution starting from uni-
form distributions
33. The v. Mises distribution
34. Fisher’s spherical distribution
35. Efficiency fluctuations of a Geiger counter
Index 485
77. MLE of the mean value of a normal distribution with known width
78. MLE of the width of a normal distribution with given mean
79. MLE of the mean of a normal distribution with unknown width
80. MLE of the width of a normal distribution with unknown mean
81. MLEs of the mean value and the width of a normal distribution
82. Determination of the axis of a given distribution of directions
83. Likelihood analysis for a signal with a linear background
84. Sufficient statistic and expected value of a normal distribution
85. Sufficient statistic for mean value and width of a normal distribution
86. Conditionality
87. Likelihood principle, dice
88. Likelihood principle, V − A
89. Stopping rule: Four decays in a fixed time interval
90. Moments method: Mean and variance of the normal distribution
91. Moments method: Asymmetry of an angular distribution
92. Counter example to the least square method: Gauging a digital clock
93. Least square method: Fit of a straight line
94. Bias of the MLE of the decay parameter
95. Bias of the estimate of a Poisson rate with observation zero
96. Bias of the measurement of the width of a uniform distribution
Chapter 7
97. Adjustment of a linear distribution to a histogram
98. Fit of the particle composition of an event sample (1)
99. Fit of the slope of a linear distribution with Monte Carlo correction
100. Lifetime Fit with Monte Carlo correction
101. Fit of the parameters of a peak over background
102. Fit of the parameters of a peak with a background reference sample
103. Fit with constraint: Two pieces of a rope
104. Fit with constraint: Particle composition of an event sample (2)
105. Kinematical fit with constraints: Eliminating parameters
106. Example 103 continued
107. Example 105 continued
108. Example 103 continued
109. Reduction of the variate space
110. Approximated likelihood estimator: Lifetime fit from a distorted distri-
bution
111. Approximated likelihood estimator: Linear and quadratic distributions
112. Nuisance parameter: Decay distribution with background
113. Nuisance parameter: Measurement of a Poisson rate with a digital clock
114. Nuisance parameter: Decay distribution with background sample
115. Elimination of a nuisance parameter by factorization of a two-dimensional
normal distribution
116. Elimination of a nuisance parameter by restructuring: Absorption mea-
surement
Index 487