Vstatmp E17

Download as pdf or txt
Download as pdf or txt
You are on page 1of 504

Gerhard Bohm, Günter Zech

Introduction to Statistics and


Data Analysis for Physicists

– Third Revised Edition –

Verlag Deutsches Elektronen-Synchrotron


Prof. Dr. Gerhard Bohm
Deutsches Elektronen-Synchrotron
Platanenallee 6
D-15738 Zeuthen
e-mail: gerhard.bohm@desy.de

Univ.-Prof. Dr. Günter Zech


Universität Siegen
Fachbereich Physik
Walter-Flex-Str. 3
D-57068 Siegen
e-mail: zech@physik.uni-siegen.de

This work is licensed under the Creative Commons Attribution 4.0 Interna-
tional License. To view a copy of this license, visit http://creativecommons.org
/licenses/by/4.0/ or send a letter to Creative Commons, PO Box 1866, Moun-
tain view, CA 94042, USA.

ISBN 978-3-945931-13-4
DOI 10.3204/PUBDB-2017-08987

Herausgeber Verlag Deutsches Elektronen-Synchrotron


und Vertrieb: Notkestraße 85
D-22607 Hamburg
Copyright: Gerhard Bohm, Günter Zech
Preface to the third edition
We have revised most parts of the book and added new examples. The chapter
on unfolding has been re-written, the sections on the elimination of nuisance
parameters and on background subtraction have been extended. Because of
personal reasons, G.B. was not able to check all modifications with the nec-
essary care. Therefore G.Z. takes the full responsibility for all new parts.

July 2017,
Gerhard Bohm, Günter Zech

Preface to the second edition


Since the first edition, four years ago, some new developments have been
published which have led to a few modifications in our book. The section
concerning the distribution of weighted Poisson events has been modified
and the compound Poisson distribution has been included. The sections on
parameter estimation by comparison of data with simulation and the unfold-
ing section have been revised. The denotations have been unified and minor
corrections and extensions have been applied to many parts of the book.

June 2014,
Gerhard Bohm, Günter Zech

Preface
There is a large number of excellent statistic books. Nevertheless, we think
that it is justified to complement them by another textbook with the focus
on modern applications in nuclear and particle physics. To this end we have
included a large number of related examples and figures in the text. We
emphasize less the mathematical foundations but appeal to the intuition of
the reader.
Data analysis in modern experiments is unthinkable without simulation
techniques. We discuss in some detail how to apply Monte Carlo simulation
to parameter estimation, deconvolution, goodness-of-fit tests. We sketch also
modern developments like artificial neural nets, bootstrap methods, boosted
decision trees and support vector machines.
Likelihood is a central concept of statistical analysis and its foundation is
the likelihood principle. We discuss this concept in more detail than usually
done in textbooks and base the treatment of inference problems as far as
I

possible on the likelihood function only, as is common in the majority of


the nuclear and particle physics community. In this way point and interval
estimation, error propagation, combining results, inference of discrete and
continuous parameters are consistently treated. We apply Bayesian methods
where the likelihood function is not sufficient to proceed to sensible results, for
instance in handling systematic errors, deconvolution problems and in some
cases when nuisance parameters have to be eliminated, but we avoid improper
prior densities. Goodness-of-fit and significance tests, where no likelihood
function exists, are based on standard frequentist methods.
Our textbook is based on lecture notes from a course given to master
physics students at the University of Siegen, Germany, a few years ago. The
content has been considerably extended since then. A preliminary German
version is published as an electronic book at the DESY library. The present
book is addressed mainly to master and Ph.D. students but also to physi-
cists who are interested to get an introduction into recent developments in
statistical methods of data analysis in particle physics. When reading the
book, some parts can be skipped, especially in the first five chapters. Where
necessary, back references are included.
We welcome comments, suggestions and indications of mistakes and typ-
ing errors. We are prepared to discuss or answer questions to specific statis-
tical problems.
We acknowledge the technical support provided by DESY and the Uni-
versity of Siegen.

February 2010,
Gerhard Bohm, Günter Zech
Contents

1 Introduction: Probability and Statistics . . . . . . . . . . . . . . . . . . . 1


1.1 The Purpose of Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Random Variable, Variate, Event, Observation and
Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 How to Define Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Assignment of Probabilities to Events . . . . . . . . . . . . . . . . . . . . . 5
1.5 Outline of this Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Basic Probability Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


2.1 Random Events and Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Probability Axioms and Theorems . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Conditional Probability, Independence, and Bayes’
Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Probability Distributions and their Properties . . . . . . . . . . . . 15


3.1 Definition of Probability Distributions . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 Empirical Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Definition and Properties of the Expected Value . . . . . . 21
3.2.2 Mean Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.4 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.5 Kurtosis (Excess) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Moments and Characteristic Functions . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.3 Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Transformation of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.1 Calculation of the Transformed Density . . . . . . . . . . . . . 41
IV Contents

3.4.2 Determination of the Transformation Relating two


Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5 Multivariate Probability Densities . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.1 Probability Density of two Variables . . . . . . . . . . . . . . . . 46
3.5.2 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.3 Transformation of Variables . . . . . . . . . . . . . . . . . . . . . . . 51
3.5.4 Reduction of the Number of Variables . . . . . . . . . . . . . . . 52
3.5.5 Determination of the Transformation between two
Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.6 Distributions of more than two Variables . . . . . . . . . . . . 57
3.5.7 Independent, Identically Distributed Variables . . . . . . . 59
3.5.8 Angular Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Some Important Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6.1 The Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.6.2 The Multinomial Distribution . . . . . . . . . . . . . . . . . . . . . . 65
3.6.3 The Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6.4 The Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6.5 The Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.6.6 The Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 74
3.6.7 The χ2 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6.8 The Gamma Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6.9 The Lorentz and the Cauchy Distributions . . . . . . . . . . . 79
3.6.10 The Log-normal Distribution . . . . . . . . . . . . . . . . . . . . . . 80
3.6.11 Student’s t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.6.12 The Extreme Value Distributions . . . . . . . . . . . . . . . . . . . 83
3.7 Mixed and Compound Distributions . . . . . . . . . . . . . . . . . . . . . . 85
3.7.1 Superposition of distributions . . . . . . . . . . . . . . . . . . . . . . 85
3.7.2 Compound Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.7.3 The Compound Poisson Distribution . . . . . . . . . . . . . . . . 87

4 Measurement Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1 General Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.1 Importance of Error Assignments . . . . . . . . . . . . . . . . . . . 91
4.1.2 The Declaration of Errors . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.1.3 Definition of Measurement and its Error . . . . . . . . . . . . . 92
4.2 Statistical Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.1 Errors Following a Known Statistical Distribution . . . . 94
4.2.2 Errors Determined from a Sample of Measurements . . . 95
4.2.3 Error of the Empirical Variance . . . . . . . . . . . . . . . . . . . . 98
4.3 Systematic Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.3.1 Definition and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.2 How to Avoid, Detect and Estimate Systematic Errors 100
4.3.3 Treatment of Systematic Errors . . . . . . . . . . . . . . . . . . . . 102
4.4 Linear Propagation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.4.1 Error Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Contents V

4.4.2 Error of a Function of Several Measured Quantities . . . 104


4.4.3 Averaging Uncorrelated Measurements . . . . . . . . . . . . . . 106
4.4.4 Averaging Correlated Measurements . . . . . . . . . . . . . . . . 108
4.4.5 Averaging Measurements with Systematic Errors . . . . . 110
4.4.6 Several Functions of Several Measured Quantities . . . . . 112
4.4.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.5 Biased Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.6 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

5 Monte Carlo Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.2 Generation of Statistical Distributions . . . . . . . . . . . . . . . . . . . . 123
5.2.1 Computer Generated Pseudo Random Numbers . . . . . . 124
5.2.2 Generation of Distributions by Variable Transformation126
5.2.3 Simple Rejection Sampling . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.4 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.5 Treatment of Additive Probability Densities . . . . . . . . . 134
5.2.6 Weighting Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.2.7 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . 136
5.3 Solution of Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.3.1 Simple Random Selection Method . . . . . . . . . . . . . . . . . . 140
5.3.2 Improved Selection Method . . . . . . . . . . . . . . . . . . . . . . . . 142
5.3.3 Weighting Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.3.4 Reduction to Expected Values . . . . . . . . . . . . . . . . . . . . . 145
5.3.5 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.4 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6 Estimation I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.2 Inference with Given Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.1 Discrete Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.2.2 Continuous Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3 Likelihood and the Likelihood Ratio . . . . . . . . . . . . . . . . . . . . . . 155
6.4 The Maximum Likelihood Method for Parameter Inference . . 160
6.4.1 The Recipe for a Single Parameter . . . . . . . . . . . . . . . . . . 161
6.4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.4.3 Likelihood Inference for Several Parameters . . . . . . . . . . 168
6.4.4 Complicated Likelihood Functions . . . . . . . . . . . . . . . . . . 171
6.4.5 Combining Measurements . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.5 Likelihood and Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.5.1 Sufficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.5.2 The Conditionality Principle . . . . . . . . . . . . . . . . . . . . . . . 175
6.5.3 The Likelihood Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.5.4 Stopping Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
6.6 The Moments Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
VI Contents

6.7 The Least Square Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183


6.7.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
6.8 Properties of estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.8.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.8.2 Transformation Invariance . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.8.3 Accuracy and Bias of Estimators . . . . . . . . . . . . . . . . . . . 189
6.9 Comparison of Estimation Methods . . . . . . . . . . . . . . . . . . . . . . . 193

7 Estimation II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.1 Likelihood of Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.1.1 The χ2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.2 Extended Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.3 Comparison of Observations to a Monte Carlo Simulation . . . 200
7.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.3.2 The Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.3.3 The χ2 Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
7.3.4 Weighting the Monte Carlo Observations . . . . . . . . . . . . 201
7.3.5 Including the Monte Carlo Uncertainty . . . . . . . . . . . . . . 202
7.3.6 Solution for a large number of Monte Carlo events . . . . 202
7.4 Parameter Estimation of a Signal Contaminated by
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
7.4.2 Parametrization of the Background . . . . . . . . . . . . . . . . . 207
7.4.3 Histogram Fits with Separate Background Measurement209
7.4.4 The Binning-Free Likelihood Approach . . . . . . . . . . . . . . 209
7.5 Inclusion of Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
7.5.2 Eliminating Redundant Parameters . . . . . . . . . . . . . . . . . 213
7.5.3 Gaussian Approximation of Constraints . . . . . . . . . . . . . 216
7.5.4 The Method of Lagrange Multipliers . . . . . . . . . . . . . . . . 218
7.5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
7.6 Reduction of the Number of Variates . . . . . . . . . . . . . . . . . . . . . . 220
7.6.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.6.2 Two Variables and a Single Linear Parameter . . . . . . . . 220
7.6.3 Generalization to Several Variables and Parameters . . . 221
7.6.4 Non-linear Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.7 Approximated Likelihood Estimators . . . . . . . . . . . . . . . . . . . . . . 224
7.8 Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
7.8.1 Nuisance Parameters with Given Prior . . . . . . . . . . . . . . 228
7.8.2 Factorizing the Likelihood Function . . . . . . . . . . . . . . . . . 229
7.8.3 Parameter Transformation, Restructuring [19] . . . . . . . . 230
7.8.4 Conditional Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.8.5 Profile Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
7.8.6 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
7.8.7 Integrating out the Nuisance Parameter . . . . . . . . . . . . . 237
Contents VII

7.8.8 Explicit Declaration of the Parameter Dependence . . . . 237


7.8.9 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237

8 Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239


8.1 Error Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
8.1.1 Parabolic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . 241
8.1.2 General Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
8.2 Error Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
8.2.1 Averaging Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . 244
8.2.2 Approximating the Likelihood Function . . . . . . . . . . . . . 247
8.2.3 Incompatible Measurements . . . . . . . . . . . . . . . . . . . . . . . 249
8.2.4 Error Propagation for a Scalar Function of a Single
Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8.2.5 Error Propagation for a Function of Several Parameters250
8.3 One-sided Confidence Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.3.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
8.3.2 Upper Poisson Limits, Simple Case . . . . . . . . . . . . . . . . . 256
8.3.3 Poisson Limit for Data with Background . . . . . . . . . . . . 257
8.3.4 Unphysical Parameter Values . . . . . . . . . . . . . . . . . . . . . . 259
8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260

9 Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
9.2 Discrete Inverse Problems and the Response matrix . . . . . . . . . 262
9.2.1 Introduction and definition . . . . . . . . . . . . . . . . . . . . . . . . 262
9.2.2 The Histogram Representation . . . . . . . . . . . . . . . . . . . . . 263
9.2.3 Expansion of the True Distribution . . . . . . . . . . . . . . . . . 267
9.2.4 The Least Square Solution and the Eigenvector
Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
9.2.5 The Maximum Likelihood Approach . . . . . . . . . . . . . . . . 274
9.3 Unfolding with Explicit Regularization . . . . . . . . . . . . . . . . . . . . 275
9.3.1 General considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
9.3.2 Variable Dependence and Correlations . . . . . . . . . . . . . . 276
9.3.3 Choice of the Regularization Strength . . . . . . . . . . . . . . . 277
9.3.4 Error Assignment to Unfolded Distributions . . . . . . . . . 279
9.3.5 EM Unfolding with Early Stopping . . . . . . . . . . . . . . . . . 280
9.3.6 SVD based methods [68, 78] . . . . . . . . . . . . . . . . . . . . . . . 283
9.3.7 Penalty regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
9.3.8 Comparison of the Methods . . . . . . . . . . . . . . . . . . . . . . . 287
9.3.9 Spline approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.3.10 Statistical and Systematic Uncertainties of the
Response Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
9.4 Unfolding with Implicit Regularization . . . . . . . . . . . . . . . . . . . . 293
9.5 Inclusion of Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
VIII Contents

9.6 Summary and Recommendations for the Unfolding of


Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.7 Binning-free Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
9.7.1 Iterative Unfolding Based on Probability Density
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.7.2 The Satellite Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
9.7.3 The Maximum Likelihood Method . . . . . . . . . . . . . . . . . . 300
9.7.4 Summary for Binning-free Methods . . . . . . . . . . . . . . . . . 302

10 Hypothesis Tests and Significance of Signals . . . . . . . . . . . . . . 303


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
10.2 Some Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
10.2.1 Single and Composite Hypotheses . . . . . . . . . . . . . . . . . . 304
10.2.2 Test Statistic, Critical Region and Significance Level . . 304
10.2.3 Consistency and Bias of Tests . . . . . . . . . . . . . . . . . . . . . . 307
10.2.4 P -Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
10.3 Classification problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
10.4 Goodness-of-Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
10.4.1 General Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
10.4.2 The χ2 Test in Generalized Form . . . . . . . . . . . . . . . . . . . 315
10.4.3 The Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . 323
10.4.4 The Kolmogorov–Smirnov Test . . . . . . . . . . . . . . . . . . . . . 325
10.4.5 Tests of the Kolmogorov–Smirnov – and Cramer–von
Mises Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.4.6 Neyman’s Smooth Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328
10.4.7 The L2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
10.4.8 Comparing a Data Sample to a Monte Carlo Sample
and the Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
10.4.9 The k-Nearest Neighbor Test . . . . . . . . . . . . . . . . . . . . . . 332
10.4.10 The Energy Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
10.4.11 Tests Designed for Specific Problems . . . . . . . . . . . . . . . 336
10.4.12 Comparison of Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
10.5 Two-Sample Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
10.5.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
10.5.2 The χ2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
10.5.3 The Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . 340
10.5.4 The Kolmogorov–Smirnov Test . . . . . . . . . . . . . . . . . . . . . 341
10.5.5 The Energy Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
10.5.6 The k-Nearest Neighbor Test . . . . . . . . . . . . . . . . . . . . . . 343
10.6 Significance of Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
10.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
10.6.2 The Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . 346
10.6.3 Tests Based on the Signal Strength . . . . . . . . . . . . . . . . . 351
Contents IX

11 Statistical Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
11.2 Smoothing of Measurements and Approximation by Analytic
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
11.2.1 Smoothing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
11.2.2 Approximation by Orthogonal Functions . . . . . . . . . . . . 359
11.2.3 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364
11.2.4 Spline Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
11.2.5 Approximation by a Combination of Simple Functions 369
11.2.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
11.3 Linear Factor Analysis and Principal Components . . . . . . . . . . 371
11.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
11.4.1 The Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 379
11.4.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 380
11.4.3 Weighting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
11.4.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
11.4.5 Bagging and Random Forest . . . . . . . . . . . . . . . . . . . . . . . 395
11.4.6 Comparison of the Methods . . . . . . . . . . . . . . . . . . . . . . . 396

12 Auxiliary Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399


12.1 Probability Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12.1.2 Fixed Interval Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
12.1.3 Fixed Number and Fixed Volume Methods . . . . . . . . . . 404
12.1.4 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
12.1.5 Problems and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 405
12.2 Resampling Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
12.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
12.2.2 Definition of Bootstrap and Simple Examples . . . . . . . . 408
12.2.3 Precision of the Error Estimate . . . . . . . . . . . . . . . . . . . . 411
12.2.4 Confidence Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
12.2.5 Precision of Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
12.2.6 Random Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
12.2.7 Jackknife and Bias Correction . . . . . . . . . . . . . . . . . . . . . . 413

13 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
13.1 Large Number Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
13.1.1 Chebyshev Inequality and Law of Large Numbers . . . . 415
13.1.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 416
13.2 Consistency, Bias and Efficiency of Estimators . . . . . . . . . . . . . 417
13.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
13.2.2 Bias of Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
13.2.3 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
13.3 Properties of the Maximum Likelihood Estimator . . . . . . . . . . . 420
13.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
X Contents

13.3.2 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421


13.3.3 Asymptotic Form of the Likelihood Function . . . . . . . . . 422
13.3.4 Properties of the Maximum Likelihood Estimate for
Small Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
13.4 The Expectation Maximization (EM) Algorithm . . . . . . . . . . . . 424
13.5 Consistency of the Background Contaminated Parameter
Estimate and its Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427
13.6 Frequentist Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . 430
13.7 Comparison of Different Inference Methods . . . . . . . . . . . . . . . . 433
13.7.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
13.7.2 The Frequentist Approach . . . . . . . . . . . . . . . . . . . . . . . . . 436
13.7.3 The Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
13.7.4 The Likelihood Ratio Approach . . . . . . . . . . . . . . . . . . . . 437
13.7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
13.7.6 Consistency, Efficiency, Bias . . . . . . . . . . . . . . . . . . . . . . . 437
13.8 p-values for EDF-Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
13.9 Fisher–Yates shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
13.10 Comparison of Histograms Containing Weighted Events . . . . 441
13.10.1 Comparison of two Poisson Numbers with Different
Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
13.10.2 Comparison of Weighted Sums . . . . . . . . . . . . . . . . . . . . 442
13.10.3 χ2 of Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
13.10.4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
13.11 The Compound Poisson Distribution and Approximations
of it . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
13.11.1 Equivalence of two Definitions of the CPD . . . . . . . . . . 444
13.11.2 Approximation by a Scaled Poisson Distribution . . . . . 445
13.11.3 The Poisson Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
13.12 Extremum Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
13.12.1 Monte Carlo Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
13.12.2 The Simplex Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 449
13.12.3 Parabola Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450
13.12.4 Method of Steepest Descent . . . . . . . . . . . . . . . . . . . . . . 450
13.12.5 Stochastic Elements in Minimum Search . . . . . . . . . . . 452
13.13 Linear Regression with Constraints . . . . . . . . . . . . . . . . . . . . . . 452
13.14 Formulas Related to the Polynomial Approximation . . . . . . . 454
13.15 Formulas for B-Spline Functions . . . . . . . . . . . . . . . . . . . . . . . . . 455
13.15.1 Linear B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
13.15.2 Quadratic B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455
13.15.3 Cubic B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
13.16 Support Vector Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
13.16.1 Linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
13.16.2 General Kernel Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 458
13.17 Bayes Factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459
Contents XI

13.18 Robust Fitting Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461


13.18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
13.18.2 Robust Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481
1 Introduction: Probability and Statistics

Though it is exaggerated to pretend that in our life only the taxes and the
death are certain, it is true that the majority of all predictions suffer from
uncertainties. Thus the occupation with probabilities and statistics is use-
ful for everybody, for scientists of experimental and empirical sciences it is
indispensable.

1.1 The Purpose of Statistics


Whenever we perform an experiment and want to interpret the collected
data, we need statistical tools. The accuracy of measurements is limited by
the precision of the equipment which we use, and thus the results emerge from
a random process. In many cases also the processes under investigation are
of stochastic nature, i.e. not predictable with arbitrary precision, such that
we are forced to present the results in form of estimates with error intervals.
Estimates accompanied by an uncertainty interval allow us to test scientific
hypotheses and by averaging the results of different experiments to improve
continuously the accuracy. It is by this procedure that a constant progress in
experimental sciences and their applications was made possible.
Inferential statistics provides mathematical methods to infer the prop-
erties of a population from a randomly selected sample taken from it. A
population is an arbitrary collection of elements, a sample just a subset of it.
A trivial, qualitative case of an application of statistics in every day life
is the following: To test whether a soup is too salted, we taste it with a
spoon. To obtain a reliable result, we have to stir the soup thoroughly and
the sample contained in the spoon has to be large enough: Samples have to
be representative and large enough to achieve a sufficiently precise estimate
of the properties of the population.
Scientific measurements are subject to the same scheme. Let us look at a
few statistical problems:
1. From the results of an exit poll the allocation of seats among the different
parties in the parliament is predicted. The population is the total of the
votes of all electors, the sample a representative selection from it. It is
relatively simple to compute the distribution of the seats from the results
2 1 Introduction: Probability and Statistics

of the poll, but one wants to know in addition the accuracy of the prog-
nosis, respectively how many electors have to be asked in order to issue a
reasonably precise statement.
2. In an experiment we record the lifetimes of 100 decays of an unstable
nucleus. To determine the mean life of the nucleus, we take the average
of the observed times. Here the uncertainty has its origin in the quantum
mechanical random process. The laws of physics tell us, that the lifetimes
follow a random exponential distribution. The sample is assumed to be
representative of the total of the infinitely many decay times that could
have occurred.
3. From 10 observations the period of a pendulum is to be determined. We
will take as estimate the mean value of the replicates. Its uncertainty has
to be evaluated from the dispersion of the individual observations. The
actual observations form a sample from the infinite number of all possible
observations.
These examples are related to parameter inference. Further statistical
topics are testing, deconvolution, and classification.
4. A bump is observed in a mass distribution. Is it a resonance or just a
background fluctuation?
5. An angular distribution is predicted to be linear in the cosine of the polar
angle. Are the observed data compatible with this hypothesis?
6. It is to be tested whether two experimental setups perform identically. To
this end, measurement samples from both are compared to each other. It
is tested whether the samples belong to the same population, while the
populations themselves are not identified explicitly.
7. A frequency spectrum is distorted by the finite resolution of the detector.
We want to reconstruct the true distribution.
8. In a test beam the development of shower cascades produced by electrons
and pions is investigated. The test samples are characterized by several
variables like penetration depth and shower width. The test samples are
used to develop procedures which predict the identity of unknown parti-
cles from their shower parameters.
A further, very important part of statistics is decision theory. We shall
not cover this topic.

1.2 Random Variable, Variate, Event, Observation and


Measurement
Each discipline has its own terminology. The notations used in statistics are
sometimes different from those used by physicists. To avoid confusion, we will
fix the meaning of some terms which we need in the following.
1.3 How to Define Probability? 3

A random variable can take different values according to a stochastic


process. A variate is strictly speaking the realization of a random variable.
However, in many text books variate is used as a short term for random vari-
able, for instance, when multi-variate distributions are discussed. We follow
this habit and denote the outcome of a random variable by random event,
simply event or observation. This could be the decay of a nucleus in a certain
time interval or the coincidence that two pupils of a class have their birthdays
the same day. To distinguish between random variables and their realization,
professional statistics books and publications use capital letters (X) for the
former and small letters (x) for the latter.
As indicated above, a population is the set of all possible events, i.e. all
potential observations. In the natural sciences, ideally experiments can be
repeated infinitely often, thus we usually deal with infinite populations.
When we infer properties, i.e. parameters characterizing the population,
from the sample, we talk about an estimate or a measurement. The decay
times of 10 pion decays correspond to a sample of observations from the
population of all possible events, the decay times. The estimation of the
mean life of pions from the sample is a measurement. An observation as such
– the reading of a meter, a decay time, the number of detected cosmic muons
– has no error associated with it. Its value is fixed by a random process.
On the contrary, the measurement which corresponds to parameter inference
is afflicted with an uncertainty. In many simple situations, observation and
measurement coincide numerically, in other cases the measurement is the
result of an extensive analysis based on a large amount of observations.

1.3 How to Define Probability?


Statistics is at least partially based on experience which is manifest in fields
like deconvolution and pattern recognition. It applies probability theory but
should not be confounded with it. Probability theory, contrary to statistics,
is a purely mathematical discipline and based on simple axioms. On the other
hand, all statistical methods use probability theory. Therefore, we will deal
in the first part of this book with simple concepts and computational rules
of this field.
In statistics, there exist several different notions on what probability
means. In the Dictionary of Statistical Terms of Kendall and Buckland [1]
we find the following definition:
“probability, a basic concept which may be taken as undefinable,
expressing in some way a degree of belief, or as the limiting frequency
in an infinite random series. Both approaches have their difficulties
and the most convenient axiomatization of probability theory is a
matter of personal taste. Fortunately both lead to much the same
calculus of probability.”
4 1 Introduction: Probability and Statistics

We will try to extend this short explanation:


In the frequentist statistics, sometimes also called classical statistics, the
probability of an event, the possible outcome of an experiment, is defined
as the frequency with which it occurs in the limit of an infinite number of
repetitions of the experiment. If in throwing dice the result five occurs with
frequency 1/6 in an infinite number of trials, the probability to obtain five is
defined to be 1/6.
In the more modern, so-called Bayesian statistics 1 this narrow notion of
probability is extended. Probability is also ascribed to fixed but incompletely
known facts and to processes that cannot be repeated. It may be assigned
to deterministic physical phenomena when we lack sufficient information. We
may roll a dice and before looking at the result, state that the result is “5” with
probability 1/6. Similarly, a probability can be attributed to the fact that the
electron mass is located within a certain mass interval. That in the context of
a constant like the electron mass probability statements are applied, is due to
our limited knowledge of the true facts. It would be more correct, but rather
clumsy, to formulate: “The probability that we are right with the proposition
that the electron mass is located in that error interval is such and such.” The
assignment of probabilities sometimes relies on assumptions which cannot be
proved but usually they are well founded on symmetry arguments, physical
laws or on experience2 . The results obviously depend on these assumptions
and can be interpreted only together with those.
The frequentist concept as compared to the Bayesian one has the ad-
vantage that additional not provable assumptions are obsolete but the dis-
advantages that its field of application is rather restricted. Important parts
of statistics, like deconvolution, pattern recognition and decision theory are
outside its reach. The Bayesian statistics exists in different variants. Its ex-
treme version permits very subjective assignments of probabilities and thus
its results are sometimes vulnerable and useless for scientific applications.
Anyway, these very speculative probabilities do not play a significant role in
the scientific practice.
Both schools, the classical frequentist oriented and the modern Bayesian
have developed important statistical concepts. In most applications the re-
sults are quite similar. A short comparison of the two approaches will be
presented in the appendix. A very instructive and at the same time amusing
article comparing the Bayesian and the frequentist statistical philosophies is
presented in Ref. [2].

1
Thomas Bayes was a mathematician and theologian who lived in the 18th
century.
2
Remark, also probability assignments based on experience have a frequency
background.
1.4 Assignment of Probabilities to Events 5

For completeness we mention a third classical interpretation of probability


which is appreciated by mathematicians3 : If an experiment has N equally
likely and mutually exclusive outcomes, and if the event A can occur in P of
them, then the probability of event A is equal to P/N . It has the difficulty that
it can hardly be translated into real situations and a slight logical problem in
that the term equally likely already presumes some idea of what probability
means.
Independent of the statistical approach, in order to be able to apply the
results of probability theory, it is necessary that the statistical probability
follows the axioms of the mathematical probability theory, i.e. it has to obey
Kolmogorov’s axioms. For example, probabilities have to be positive and
smaller or equal to one. We will discuss these axioms below.
In this book we will adopt a moderately Bayesian point of view. This
means that in some cases we will introduce sensible assumptions without
being able to prove their validity. However, we will establish fixed, simple
rules that have to be applied in data analysis. In this way we achieve an
objective parametrization of the data. This does not exclude that in some
occasions as in goodness-of-fit tests we favor methods of frequentist statistics.

1.4 Assignment of Probabilities to Events

The mathematician assumes that the assignment of probabilities to events


exists. To achieve practical, useful results in the natural sciences, in sociol-
ogy, economics or medicine, statistical methods are required and a sensible
assignment has to be made.
There are various possibilities to do so:
– Symmetry properties are frequently used to assign equal probabilities to
events. This is done in gambling, examples are rolling dice, roulette and
card games. The isotropy of space predicts equal probabilities for radiation
from a point source into different directions.
– Laws of nature like the Boltzmann’s law of thermodynamics, the expo-
nential decay law of quantum mechanics or Mendel’s laws allow us to
calculate the probabilities for certain events.
– From the observed frequencies of certain events in empirical studies we
can estimate their probabilities, like those of female and male births, of
muons in cosmic rays, or of measurement errors in certain repeatable
experiments. Here we derive frequencies from a large sample of observa-
tions from which we then derive with sufficient accuracy the probability
of future events.

3
For two reasons: The proof that the Kolmogorov’s axioms are fulfilled is rather
easy, and the calculation of the probability for complex events is possible by straight
forward combinatorics.
6 1 Introduction: Probability and Statistics

– In some situations we are left with educated guesses or we rely on the


opinion of experts, when for example the weather is to be predicted or
the risk of an accident of a new oil-ship has to be evaluated.
– In case of absolute ignorance often a uniform probability distribution is
assumed. This is known as Bayes’ postulate. When we watch a tennis
game and do not know the players, we might assign equal probabilities of
winning to player A and B.
To illustrate the last point in a more scientific situation, let us look at a
common example in particle physics:

Example 1. Uniform prior for a particle mass


Before a precise measurement of a particle mass is performed, we only
know that a particle mass m lies between the values m1 and m2 . We may
assume initially that all values of the mass inside the interval are equally
likely. Then the a priori probability P {m0 ≤ m < m2 } (or prior probability)
that m it is larger than m0 , with m0 located between m1 and m2 , is equal
to:
m2 − m0
P {m0 ≤ m < m2 } = .
m2 − m1
This assertion relies on the assumption of a uniform distribution of the mass
within the limits and is obviously assailable, because, had we assumed – with
equal right – a uniform distribution for the mass squared, we had obtained a
different result:
m22 − m20
P {m20 ≤ m2 < m22 } = 6= P {m0 ≤ m < m2 } .
m22 − m21

Of course, the difference is small, if the interval is small, m2 − m1 ≪ m, for


then we have:
m22 − m20 m2 − m0 m2 + m0
= ×
m22 − m21 m2 − m1 m2 + m1
 
m2 − m0 m0 − m1
= 1+
m2 − m1 m2 + m1
m2 − m0
≈ .
m2 − m1

When the Z0 mass and its error were determined, a uniform prior proba-
bility in the mass was assumed. If instead a uniform probability in the mass
squared had been used, the result had changed only by about 10−3 times the
uncertainty of the mass determination. This means that applying Bayes’ as-
1.5 Outline of this Book 7

sumption to either the mass or the mass squared makes no difference within
the precision of the measurement in this specific case.
In other situations prior probabilities which we will discuss in detail in
Chap. 6 can have a considerable influence on a result.

1.5 Outline of this Book


After a short chapter on probability axioms and theorems we present prop-
erties of probability distributions in Chapter 3, and its application to simple
error calculus and Monte Carlo simulation in Chapters 4 and 5.
The statistics part starts in Chapters 6 and 7 with point estimation fol-
lowed by interval estimation, Chapter 8.
Chapter 9 deals with deconvolution problems.
In Chapter 10 significance and goodness-of-fit tests are discussed.
Chapter 11 with the title Statistical Learning summarizes some approxi-
mation and classification techniques.
In Chapter 12 a short introduction into probability density estimation
and bootstrap techniques is given.
Finally, the Appendix contains some useful mathematical or technical
objects, introduces important frequentist concepts and theorems and presents
a short comparison of the different statistical approaches.

Recommendations for Ancillary Literature


- The standard book of Kendall and Stuart “The Advanced Theory of Statis-
tics” [3], consisting of several volumes provides a excellent and rather com-
plete presentation of classical statistics with all necessary proofs and many
references. It is a sort of Bible of conservative statistics, well suited to look
up specific topics. Modern techniques, like Monte Carlo methods are not
included.
- The books of Brandt “Data Analysis” [4] and Frodesen et. al. “ Proba-
bility and Statistics in Particle Physics” [5] give a pretty complete overview
of the standard statistical methods as used by physicists.
- For a introduction into statistics for physicists we highly recommend the
book by Barlow [6].
- Very intuitive and also well suited for beginners is the book by Lyons [7],
“Statistics for Nuclear and Particle Physicists”. It reflects the large practical
experience of the author.
- Larger, very professional and more ambitious is the book of Eadie et
al. “Statistical Methods in Experimental Physics” [8], also intended mainly
for particle and nuclear physicists and written by particle physicists and
statisticians. A new edition has appeared recently [9]. Modern techniques of
data analysis are not discussed.
8 1 Introduction: Probability and Statistics

- Recently a practical guide to data analysis in high energy physics [10] has
been published. The chapters are written by different experienced physicists
and reflect the present state of the art. As is common in statistics publi-
cations, some parts are slightly biased by the personal preferences of the
authors. More specialized is [11] which emphasizes particularly probability
density estimation and machine learning.
- Very useful especially for the solution of numerical problems is a book
by Blobel and Lohrman “Statistische und numerische Methoden der Daten-
analyse” [12] written in German.
- Other useful books written by particle physicists are found in Refs. [13,
14, 15]. The book by Roe is more conventional while Cowan and D’Agostini
favor a moderate Bayesian view.
- Modern techniques of statistical data analysis are presented in a book
written by professional statisticians for non-professional users, Hastie et al.
“The Elements of Statistical Learning”[16]
- A modern professional treatment of Bayesian statistics is the textbook
by Box and Tiao “Bayesian Inference in Statistical Analysis” [17].
The interested reader will find work on the foundations of statistics, on
basic principles and on the standard theory in the following books:
- Fisher’s book [18] “Statistical Method, Experimental Design and Scien-
tific Inference” provides an interesting overview of his complete work.
- Edward’s book “Likelihood” [19] stresses the importance of the likelihood
function, contains many useful references and the history of the Likelihood
Principle.
- Many basic considerations and a collection of personal contributions
from a moderate Bayesian view are contained in the book “Good Thinking”
by Good [20]. A collection of work by Savage [21], presents a more extreme
Baysian point of view.
- Somewhat old fashioned textbooks of Bayesian statistic which are of
historical interest are the books of Jeffreys [22] and Savage [23].
Recent statistical work by particle physicists and astrophysicists can be
found in the proceedings of the PHYSTAT Conferences [24] held during the
past few years. Many interesting and well written articles can be found also
in the internet.
This personal selection of literature is obviously in no way exhaustive.
2 Basic Probability Relations

2.1 Random Events and Variables


Events are processes or facts that are characterized by a specific property,
like obtaining a “3” with a dice. A goal in a soccer play or the existence of
fish in a lake can be random events. Events can also be complex facts, like
rolling two times a dice with the results three and five or the occurrence
of a certain mean value in a series of measurements or the estimate of the
parameter of a theory. There are elementary events, which mutually exclude
each other but also events that correspond to a class of elementary events,
like the result greater than three when throwing a dice. We are concerned with
random events which emerge from a stochastic process as already introduced
above.
When we consider several events, then there are events which exclude each
other and events which are compatible. We stick to our standard example dice.
The elementary events three and five exclude each other, the events greater
than two and five are of course compatible. An other common example: We
select an object from a bag containing blue and red cubes and spheres. Here
the events sphere and cube exclude each other, the events sphere and red may
be compatible.
The event A is called the complement of event A if either event A or event
A applies, but not both at the same time (exclusive or ). Complementary to
the event three in the dice example is the event less than three or larger than
three (inclusive or ). Complementary to the event red sphere is the event cube
or blue sphere.
The event consisting of the fact that an arbitrary event out of all possible
events applies, is called the certain event. We denote it with Ω. The com-
plementary event is the impossible event, that none of all considered events
applies: It is denoted with ∅, thus ∅ = Ω.
Some further definitions are useful:
Definition 1: A ∪ B means A or B.
The event A ∪ B has the attributes of event A or event B or those of
both A and B (inclusive or ). (The attribute cube ∪ red corresponds to the
complementary event blue sphere.)
Definition 2: A ∩ B means A and B.
10 2 Basic Probability Relations

The event A ∩ B has the attributes of event A as well as those of event


B. (The attribute cube ∩ red corresponds to red cube.) If A ∩ B = ∅, then A,
B mutually exclude each other.
Definition 3: A ⊂ B means that A implies B.
It is equivalent to both A ∪ B = B and A ∩ B = A.
From these definitions follow the trivial relations

A∪A=Ω , A∩A=∅,
and
∅⊂A⊂Ω. (2.1)
For any A, B we have

A∪B =A∩B , A∩B =A∪B .

To the random event A we associate the probability P {A} as discussed


above. In all practical cases random events can be identified with a variable,
the random variable or in short variate following [19]. Examples for variates
are the decay time in particle decay, the number of cosmic muons penetrating
a body in a fixed time interval and measurement errors. When the random
events involve values that cannot be ordered, like shapes or colors, then they
can be associated with classes or categorical variates.

2.2 Probability Axioms and Theorems


2.2.1 Axioms

The assignment of probabilities P {A} to members A, B, C, ... of a set of


events has to satisfy the following axioms1 . Only then the rules of probability
theory are applicable.
– Axiom 1 0 ≤ P {A}
The probability of an event is a positive real number.
– Axiom 2 P {Ω} = 1
The probability of the certain event is one.
– Axiom 3 P {A ∪ B} = P {A} + P {B} if A ∩ B = ∅
The probability that A or B applies is equal to the sum of the probabilities
that A or that B applies, if the events A and B are mutually exclusive.
These axioms and definitions imply the following theorems whose validity
is rather obvious. They can be illustrated with so-called Venn diagrams, Fig.
2.1. There the areas of the ellipses and their intersection are proportional to
the probabilities.
1
They are called Kolmogorov axioms, after the Russian mathematician
A. N. Kolmogorov (1903-1987).
2.2 Probability Axioms and Theorems 11

P {A} = 1 − P {A} , P {∅} = 0 ,


P {A ∪ B} = P {A} + P {B} − P {A ∩ B} ,
if A ⊂ B ⇒ P {A} ≤ P {B} . (2.2)
Relation (2.1) together with (2.2) and axioms 1, 2 imply 0 ≤ P {A} ≤ 1. For
arbitrary events we have
P {A ∪ B} ≥ P {A} , P {B} ; P {A ∩ B} ≤ P {A} , P {B} .
If all events with the attribute A possess also the attribute B, A ⊂ B, then
we have P {A ∩ B} = P {A}, and P {A ∪ B} = P {B}.

2.2.2 Conditional Probability, Independence, and Bayes’ Theorem


In the following we need two further definitions:
Definition: P {A | B} is the conditional probability of event A under the
condition that B applies. It is given, as is obvious from Fig. 2.1, by:
P {A ∩ B}
P {A | B} = , P {B} 6= 0 . (2.3)
P {B}
A conditional probability is, for example, the probability to find a sphere
among the red objects. The notation A | B expresses that B is considered as
fixed, while A is the random event to which the probability refers. Contrary
to P {A}, which refers to arbitrary events A, we require that also B is valid
and therefore P {A ∩ B} is normalized to P {B}.
Among the events A | B the event A = B is the certain event, thus
P {B | B} = 1. More generally, from definition 3 of the last section and (2.3)
follows P {A|B} = 1 if B implies A:
B ⊂ A ⇒ A ∩ B = B ⇒ P {A|B} = 1 .
Definition: If P {A ∩ B} = P {A} × P {B}, the events A and B (more
precisely: the probabilities for their occurrence) are independent.
From (2.3) then follows P {A | B} = P {A}, i.e. the conditioning on B is
irrelevant for the probability of A. Likewise P {B | A} = P {B}.
In Relation (2.3) we can exchange A and B and thus P {A | B}P {B} =
P {A ∩ B} = P {B | A}P {A} and we obtain the famous Bayes’ theorem:
P {A | B}P {B} = P {B | A}P {A} . (2.4)
Bayes’ theorem is frequently used to relate the conditional probabilities
P {A | B} and P {B | A}, and, as we will see, is of some relevance in parameter
inference.
The following simple example illustrates some of our definitions. It as-
sumes that each of the considered events is composed of a certain number of
elementary events which mutually exclude each other and which because of
symmetry arguments all have the same probability.
12 2 Basic Probability Relations

Fig. 2.1. Venn diagram.

Example 2. Card game, independent events


The following table summarizes some probabilities for randomly selected
cards from a card set consisting of 32 cards and 4 colors.

P {king}: 4/32 = 1/8 (prob. for king)


P {heart}: 1/4 (prob. for heart)
P {heart ∩ king}: 1/8 · 1/4 = 1/32 (prob. for heart king)
P {heart ∪ king}: 1/8 + 1/4 − 1/32 = 11/32 (prob. for heart or king)
P {heart | king}: 1/4 (prob. for heart if king)

The probabilities P {heart} and P {heart | king} are equal as required


from the independence of the events A and B.

The following example illustrates how we make use of independence.

Example 3. Random coincidences, measuring the efficiency of a counter


When we want to measure the efficiency of a particle counter (1), we
combine it with a second counter (2) in such a way that a particle beam
crosses both detectors. We record the number of events n1 , n2 in the two
counters and in addition the number of coincidences n1∩2 . The corresponding
efficiencies relate these numbers, ignoring the statistical fluctuations of the
observations, to the unknown number of particles n crossing the detectors.
n1 = ε1 n , n2 = ε2 n , n1∩2 = ε1∩2 n .
For independent counting efficiencies we have ε1∩2 = ε1 ε2 and we get
2.2 Probability Axioms and Theorems 13

n1∩2 n1∩2 n1 n2
ε1 = , ε2 = , n= .
n2 n1 n1∩2
This scheme is used in many analog situations.

Bayes’ theorem is applied in the next two examples, where the attributes
are not independent.

Example 4. Bayes’ theorem, fraction of women among students


From the proportion of students and women in the population and the
fraction of students among women we compute the fraction of women among
students:

P {A} = 0.02 (fraction of students in the population)


P {B} = 0.5 (fraction of women in the population)
P {A | B} = 0.018 (fraction of students among women)
P {B | A} =? (fraction of women among students)

The dependence of the events A and B manifests itself in the difference


of P {A} and P {A | B}. Applying Bayes’ theorem we obtain

P {A | B}P {B}
P {B | A} =
P {A}
0.018 · 0.5
= = 0.45 .
0.02
About 45% of the students are women.

Example 5. Bayes’ theorem, beauty filter


The probability P {A} that beauty quark production occurs in a colliding
beam reaction be 0.0001. A filter program selects beauty reactions A with
efficiency P {b | A} = 0.98 and the probability that it falsely assumes that
beauty is present if it is not, be P {b | A} = 0.01. What is the probability
P {A | b} to have genuine beauty production in a selected event? To solve the
problem, first the probability P {b} that a random event is selected has to be
evaluated,
 
P {b} = P {b} P {A | b} + P {A | b}
= P {b | A}P {A} + P {b | A}P {A}
14 2 Basic Probability Relations

where the bracket in the first line is equal to 1. In the second line Bayes’
theorem is applied. Applying it once more, we get

P {b | A}P {A}
P {A | b} =
P {b}
P {b | A}P {A}
=
P {b | A}P {A} + P {b | A}P {A}
0.98 · 0.0001
= = 0.0097 .
0.98 · 0.0001 + 0.01 · 0.9999
About 1% of the selected events corresponds to b quark production.

Bayes’ theorem is rather trivial, thus the results of the last two examples
could have easily been written down without referring to it.
3 Probability Distributions and their Properties

A probability distribution assigns probabilities to random variables. As an


example we show in Fig. 3.1 the distribution of the sum s of the points
obtained by throwing three ideal dice. Altogether there are 63 = 216 different
combinations. The random variable s takes values between 3 and 18 with
different probabilities. The sum s = 6, for instance, can be realized in 10
different ways, all of which are equally probable. Therefore the probability
for s = 6 is P {s = 6} = 10/216 ≈ 4.6 %. The distribution is symmetric with
respect to its mean value 10.5. It is restricted to discrete values of s, namely
natural numbers.
In our example the random variable is discrete. In other cases the random
variables are continuous. Then the probability for any fixed value is zero, we
have to describe the distribution by a probability density and we obtain a
finite probability when we integrate the density over a certain interval.

0.10
probability

0.05

0.00
2 4 6 8 10 12 14 16 18 20

sum

Fig. 3.1. Probability distribution of the sum of the points of three dice.
16 3 Probability Distributions and their Properties

3.1 Definition of Probability Distributions

We define a distribution function, also called cumulative or integral distri-


bution function, F (t), which specifies the probability P to find a value of x
smaller than t:

F (t) = P {x < t} , with − ∞ < t < ∞ .


The probability axioms require the following properties of the distribution
function:
a) F (t) is a non-decreasing function of t ,
b) F (−∞) = 0 ,
c) F (∞) = 1 .
We distinguish between
– Discrete distributions (Fig. 3.2)
– Continuous distributions (Fig. 3.3)

3.1.1 Discrete Distributions

If not specified differently, we assume in the following that discrete distribu-


tions assign probabilities to an enumerable set of different events, which are
characterized by an ordered, real number xi , with i = 1, . . . , N , where N may
be finite or infinite. The probabilities p(xi ) to observe the values xi satisfy
the normalization condition:
N
X
p(xi ) = 1 .
i=1

It is defined by

p(xi ) = P {x = xi } = F (xi + ǫ) − F (xi − ǫ) ,

with ǫ positive and smaller than the distance to neighboring variate values.

Example 6. Discrete probability distribution (dice)


For a fair die, the probability to throw a certain number k is just one-sixth:
p(k) = 1/6 for k = 1, 2, 3, 4, 5, 6.

It is possible to treat discrete distributions with the help of Dirac’s δ-


function like continuous ones. Therefore we will often consider only the case
of continuous variates.
3.1 Definition of Probability Distributions 17

0.2

P(x)

0.1

0.0

1.0

F(x)

0.5

0.0

Fig. 3.2. Discrete probability distribution and distribution function.

3.1.2 Continuous Distributions

We replace the discrete probability distribution by a probability density 1 f (x),


abbreviated as p.d.f. (probability density function). It is defined as follows:

dF (x)
f (x) = . (3.1)
dx
Remark that the p.d.f. is defined in the full range −∞ < x < ∞. It may
be zero in certain regions.
It has the following properties:
a) fR (−∞) = f (+∞) = 0 ,

b) −∞ f (x)dx = 1 .
The probability P {x1 ≤ x ≤ x2 } to find the random variable x in the
interval [x1 , x2 ] is given by
Z x2
P {x1 ≤ x ≤ x2 } = F (x2 ) − F (x1 ) = f (x)dx .
x1

1
We will, however, use the notations probability distribution and distribution for
discrete as well as for continuous distributions.
18 3 Probability Distributions and their Properties

P(x < x < x )


f(x) 1 2

x x
1 2
x

1.0

F(x)

0.5

0.0

Fig. 3.3. Probility density and distribution function of a continuous distribution.

We will discuss specific distributions in Sect. 3.6 but we introduce two com-
mon distributions already here. They will serve us as examples in the following
sections.

Example 7. Probability density of an exponential distribution


The decay time t of an instable particles follows an exponential distribu-
tion with the p.d.f.
f (t) ≡ f (t|λ) = λe−λt for t ≥ 0 , (3.2)
where the parameter2 λ > 0, the decay constant, is the inverse of the mean
lifetime τ = 1/λ. The probability density and the distribution function
Z t
F (t) = f (t′ )dt′ = 1 − e−λt
−∞

are shown in Fig. 3.4. The probability of observing a lifetime longer than τ
is
P {t > τ } = F (∞) − F (τ ) = e−1 .
3.1 Definition of Probability Distributions 19

0.50

f(x) - x
f(x)= e
0.25

0.00
0 2 4 6 8 10

1.0

F(x) - x
F(x)=1-e
0.5

0.0
0 2 4 6 8 10

Fig. 3.4. Probability density and distribution function of an exponential distribu-


tion.

Example 8. Probability density of the normal distribution


An oxygen atom is drifting in argon gas, driven by thermal scattering.
It starts at the origin. After a certain time its position is (x, y, z). Each
projection, for instance x, has approximately a normal distribution (see Fig.
3.5), also called Gauss distribution.

f (x) = N (x|0, s) ,
1 2 2
N (x|x0 , s) = √ e−(x−x0 ) /(2s ) . (3.3)
2πs
The width constant s is, as will be shown later, proportional to the square
root of the number of scattering
√ processes or the square root of time. When
we descent by the factor 1/ e from the maximum, the full width is just 2s. A
statistical drift motion, or more generally a random walk, is met frequently in
20 3 Probability Distributions and their Properties

0.3

0.2

2s

f(x)

0.1

0.0
-6 -4 -2 0 2 4 6

Fig. 3.5. Normal distribution.

science and also in every day life. The normal distribution also describes ap-
proximately the motion of snow flakes or the erratic movements of a drunkard
in the streets.

3.1.3 Empirical Distributions

Many processes are too complex or not well enough understood to be de-
scribed by a distribution in form of a simple algebraic formula. In these cases
it may be useful to approximate the underlying distribution using an ex-
perimental data sample. The simplest way to do this, is to histogram the
observations and to normalize the frequency histogram. More sophisticated
methods of probability density estimation will be sketched in Chap. 12. The
quality of the approximation depends of course on the available number of
observations.

3.2 Expected Values

In this section we will consider some general characteristic quantities of dis-


tributions, like mean value, width, and asymmetry or skewness. Before in-
troducing the calculation methods, we turn to the general concept of the
expected value.
The expected value E(u) of a quantity u(x), which depends on the random
variable x, can be obtained by collecting infinitely many random values xi
3.2 Expected Values 21

from the distribution f (x), calculating ui = u(xi ), and then averaging over
these values. Obviously, we have to assume the existence of such a limiting
value.
In quantum mechanics, expected values of physical quantities are the main
results of theoretical calculations and experimental investigations, and pro-
vide the connection to classical mechanics. Also in statistical mechanics and
thermodynamics the calculation of expected values is frequently needed. We
can, for instance, calculate from the velocity distribution of gas molecules
the expected value of their kinetic energy, that means essentially their tem-
perature. In probability theory and statistics expected values play a central
role.

3.2.1 Definition and Properties of the Expected Value

Definition:

X
E(u(x)) = u(xi )p(xi ) (discrete distribution) , (3.4)
i=1
Z ∞
E(u(x)) = u(x)f (x) dx (continuous distribution) . (3.5)
−∞

Here and in what follows, we assume the existence of integrals and sums.
This condition restricts the choice of the allowed functions u, p, f .
From the definition of the expected value follow the relations (c is a con-
stant, u, v are functions of x):

E(c) = c, (3.6)
E(E(u)) = E(u), (3.7)
E(u + v) = E(u) + E(v), (3.8)
E(cu) = cE(u) . (3.9)

They characterize E as a linear functional.


For independent (see also Chap. 2 and Sect. 3.5) variates x, y the following
important relation holds:

E (u(x)v(y)) = E(u)E(v) . (3.10)

Often expected values are denoted by angular brackets:

E(u) ≡ hui .

Sometimes this simplifies the appearance of the formulas. We will use both
notations.
22 3 Probability Distributions and their Properties

3.2.2 Mean Value

The expected value of the variate x is also called the mean value. It can be
visualized as the center of gravity of the distribution. Usually it is denoted
by the Greek letter µ. Both names, mean value, and expected value3 of the
corresponding distribution are used synonymously.
Definition:

X
E(x) ≡ hxi = µ = xi p(xi ) (discrete distribution) ,
i=1
Z ∞
E(x) ≡ hxi = µ = x f (x) dx (continuous distribution) .
−∞

The mean value of the exponential distribution (3.2) is


Z ∞
hti = λte−λt dt = 1/λ = τ .
0

We will distinguish hxi from the average value of a sample, consisting of


a finite number N of variate values, x1 , . . . , xN , which will be denoted by x:
1 X
x= xi .
N i

It is called sample mean. It is a random variable and has the expected value
1 X
hxi = hxi i = hxi ,
N i

as follows from (3.8), (3.9).

3.2.3 Variance

The variance σ 2 measures the spread of a distribution, defined as the mean


quadratic deviation of the variate from its mean value. Usually, we want
to know not only the mean value of a stochastic quantity, but require also
information on the dispersion of the individual random values relative to it.
When we buy a laser, we are of course interested in its mean energy per
pulse, but also in the variation of the single energies around that mean value.
The mean value alone does not provide information about the shape of a
distribution. The mean height with respect to sea level of Switzerland is about
700 m, but this alone does not say much about the beauty of that country,
which, to a large degree, depends on the spread of the height distribution.
3
The notation expected value is somewhat misleading, as the probability to
obtain it can be zero (see the example “dice” in Sect. 3.2.7).
3.2 Expected Values 23

The square root σ of the variance is called standard deviation, and is the
standard measure of stochastic uncertainties.
A mechanical analogy to the variance is the moment of inertia for a mass
distribution along the x-axis for a total mass equal to unity.
Definition:  
var(x) = σ 2 = E (x − µ)2 .
From this definition follows immediately
var(cx) = c2 var(x) ,
and σ/µ is independent of the scale of x.
Very useful is the following expression for the variance which is easily
derived from its definition and (3.8), (3.9):
σ 2 = E(x2 − 2xµ + µ2 )
= E(x2 ) − 2µ2 + µ2
= E(x2 ) − µ2 .
Sometimes this is written more conveniently as
σ 2 = hx2 i − hxi2 = hx2 i − µ2 . (3.11)
In analogy to Steiner’s theorem for moments of inertia, we have
h(x − a)2 i = h(x − µ)2 i + h(µ − a)2 i
= σ 2 + (µ − a)2 ,
implying (3.11) for a = 0.
The variance is invariant against a translation of the distribution by a:
x → x + a , µ → µ + a ⇒ σ2 → σ2 .

Variance of a Sum of Random Numbers

Let us calculate the variance σ 2 for the distribution of the sum x of two
independent random numbers x1 and x2 , which follow different distributions
with mean values µ1 , µ2 and variances σ12 , σ22 :
x = x1 + x2 ,

2
σ 2 = h(x − hxi) i
= h((x1 − µ1 ) + (x2 − µ2 ))2 i
= h(x1 − µ1 )2 + (x2 − µ2 )2 + 2(x1 − µ1 )(x2 − µ2 )i
= h(x1 − µ1 )2 i + h(x2 − µ2 )2 i + 2hx1 − µ1 ihx2 − µ2 i
= σ12 + σ22 .
24 3 Probability Distributions and their Properties

In the fourth step the independence of the variates (3.10) was used.
This result is important for all kinds of error estimation. For a sum of
two independent measurements, their standard deviationsP add quadratically.
We can generalize the last relation to a sum x = xi of N variates or
measurements:
XN
σ2 = σi2 . (3.12)
i=1

Example 9. Variance of the convolution of two distributions


We consider a quantity x with the p.d.f. g(x) with variance σg2 which is
measured with a device which produces a smearing with a p.d.f. h(y) with
variance σh2 . We want to know the variance of the “smeared” value x′ = x + y.
According to 3.12, this is the sum of the variances of the two p.d.f.s:

σ 2 = σg2 + σh2 .

Variance of the Sample Mean of Independent Identically


Distributed Variates

From the last relation we obtain the variance σx2 of the sample mean x from
N independent random numbers xi , which all follow the same distribution4
f (x), with expected value µ and variance σ 2 :
N
X
x= xi /N ,
i=1
var(N x) = N var(x) = N σ 2 ,
2

σ
σx = √ . (3.13)
N
The last two relations (3.12), (3.13) have many applications, for instance
in random walk, diffusion, and error propagation. The root mean square
distance reached by a diffusing
√ √ molecule after N scatters is proportional to
N and therefore also to t, t being the diffusion time. The total length
of 100 aligned objects, all having the same standard deviation σ of their
nominal length, will have a standard deviation of only 10 σ. To a certain
degree, random fluctuations compensate each other.

4
The usual abbreviation is i.i.d. variates for independent identically distributed.
3.2 Expected Values 25

Width v of a Sample and Variance of the Distribution

Often, as we will see in Chap. 6, a sample is used to estimate the variance


σ 2 of the underlying distribution. In case the mean value µ is known, we
calculate the quantity
1 X
vµ2 = (xi − µ)2
N i

which has the correct expected value hvµ2 i = σ 2 . Usually, however, the true
mean value µ is unknown – except perhaps in calibration measurements –
and must be estimated from the same sample as is used to derive vµ2 . We
then are obliged to use the sample mean x instead of µ and calculate the
mean quadratic deviation v 2 of the sample values relative to x. In this case
the expected value of v 2 will depend not only on σ, but also on N . In a first
step we find
1 X
v2 = (xi − x)2
N i
1 X 2 
= xi − 2xi x + x2
N i
1 X 2
= x − x2 . (3.14)
N i i

To calculate the expected value, we use (3.11) and (3.13),

hx2 i = σ 2 + µ2 ,
hx2 i = var(x) + hxi2
σ2
= + µ2
N
and get with (3.14)
 
1
hv 2 i = hx2 i − hx2 i = σ 2 1 − ,
N
P 2
N h i (xi − x) i
2
σ = 2
hv i = . (3.15)
N −1 N −1
The expected value of the mean squared deviation hv 2 i is smaller than the
variance of the distribution by a factor of (N − 1)/N .
The relation (3.15) is widely used for the estimation of measurement er-
rors, when several independent measurements are available. The variance σx2
of the sample mean x itself is approximated, according to (3.13), by
P 2
v2 (xi − x)
= i .
N −1 N (N − 1)
26 3 Probability Distributions and their Properties

Mean Value and Variance of a Superposition of two Distributions

Frequently a distribution consists of a superposition of elementary distribu-


tions. Let us compute the mean µ and variance σ 2 of a linear superposition
of two distributions

f (x) = αf1 (x) + βf2 (x) , α + β = 1 ,

where f1 , f2 may have different mean values µ1 , µ2 and variances σ12 , σ22 :

µ = αµ1 + βµ2 ,

 
σ 2 = E (x − E(x))2
= E(x2 ) − µ2
= αE1 (x2 ) + βE2 (x2 ) − µ2
= α(µ21 + σ12 ) + β(µ22 + σ22 ) − µ2
= ασ12 + βσ22 + αβ(µ1 − µ2 )2 .

Here, E, E1 , E2 denote expected values related to the p.d.f.s f , f1 , f2 . In


the last step the relation α + β = 1 has been used. Of course, the width
increases with the distance of the mean values. The result for σ 2 could have
been guessed by considering the limiting cases (µ1 = µ2 , σ1 = σ2 = 0).

3.2.4 Skewness

The skewness coefficient γ1 measures the asymmetry of a distribution with


respect to its mean. It is zero for the normal distribution, but quite sizable
for the exponential distribution. There it has the value γ1 = 2, see Sect. 3.3.4
below.
Definition: h i
3
γ1 = E (x − µ) /σ 3 .
Similarly to the variance, γ1 can be expressed by expected values of powers
of the variate x:
 
γ1 = E (x − µ)3 /σ 3
 
= E x3 − 3µx2 + 3µ2 x − µ3 /σ 3
  
= E(x3 ) − 3µ E(x2 ) − µE(x) − µ3 /σ 3
E(x3 ) − 3µσ 2 − µ3
= .
σ3
The skewness coefficient is defined in such a way that it satisfies the
requirement of invariance under translation and dilatation of the distribution.
Its square is usually denoted by β1 = γ12 .
3.2 Expected Values 27

0.8
g = 0
1

g = 17
2

0.6
f(x)
g = 2
1

0.4 g = 8
2

g = 0
1

0.2 g = 2
2

0.0
-5 0 5

0.1
f(x)

0.01

1E-3

-5 0 5

Fig. 3.6. Three distribution with equal mean and variance but different skewness
and kurtosis.

3.2.5 Kurtosis (Excess)

A fourth parameter, the kurtosis β2 , measures the tails of a distribution.


Definition:  
β2 = E (x − µ)4 /σ 4 .
A kurtosis coefficient or excess γ2 ,

γ2 = β2 − 3 ,

is defined such that it is equal to zero for the normal distribution which is
used as a reference. (see Sect. 3.6.5).

3.2.6 Discussion

The mean value of a distribution is a so-called position or location parameter,


the standard deviation is a scale parameter . A translation of the variate
x → y = x + a changes the mean value correspondingly, hyi = hxi + a.
28 3 Probability Distributions and their Properties

This parameter is therefore sensitive to the location of the distribution (like


the center of gravity for a mass distribution). The variance (corresponding
to the moment of inertia for a mass distribution), respectively the standard
deviation remain unchanged. A change of the scale (dilatation) x → y = cx
entails, besides hyi = chxi also σ(y) = cσ(x). Skewness and kurtosis remain
unchanged under both transformations. They are shape parameters.
The four parameters mean, variance, skewness, and kurtosis, or equiva-
lently the expected values of x, x2 , x3 and x4 , fix a distribution quite well if in
addition the range of the variates and the behavior of the distribution at the
limits is given. Then the distribution can be reconstructed quite accurately.
Fig. 3.6 shows three probability densities, all with the same mean µ =
0 and standard deviation σ = 1, but different skewness and kurtosis. The
apparently narrower curve has clearly longer tails, as seen in the lower graph
with logarithmic scale.
Mainly in cases, where the type of the distribution is not well known, i.e.
for empirical distributions, other location and scale parameters are common.
These are the mode xmod , the variate value, at which the distribution has
its maximum, and the median, defined as the variate value x0.5 , at which
P {x < x0.5 } = F (x0.5 ) = 0.5, i.e. the median subdivides the domain of the
variate into two regions with equal probability of 50%. More generally, we
define a quantile xa of order a by the requirement F (xa ) = a.
A well known example for a median is the half-life t0.5 which is the time
at which 50% of the nuclei of an unstable isotope have decayed. From the
exponential distribution (3.2) follows the relation between the half-life and
the mean lifetime τ
t0.5 = τ ln 2 ≈ 0.693 τ .
The median is invariant under non-linear transformations y = y(x) of the
variate, y0.5 = y(x0.5 ) while for the mean value µ and the mode xmod this
is usually not the case, µy 6= y(µx ), ymod 6= y(xmod ). The reason for these
properties is that probabilities but not probability densities are invariant
under variate transformations. Thus the mode should not be considered as
the “most probable value”. The probability to obtain exactly the mode value
is zero. To obtain finite probabilities, we have to integrate the p.d.f. over
some range of the variate as is the case for the median.
In statistical analyses of data contaminated by background the sample
median is more “robust” than the sample mean as estimator of the distribution
mean. (see Appendix, Sect. 13.18). Instead of the sample width v, often the
full width at half maximum (f.w.h.m.) is used to characterize the spread
of a distribution. It ignores the tails of the distribution. This makes sense
for empirical distributions, e.g. in the investigation of spectral lines above a
sizable background. For a normal distribution the f.w.h.m. is related to the
standard deviation by

f.w.h.m.gauss ≈ 2.36 σgauss .


3.2 Expected Values 29

This relation is often used to estimate quickly the standard deviation σ for an
empirical distribution given in form of a histogram. As seen from the examples
in Fig.3.6, which, for the same variance, differ widely in their f.w.h.m., this
procedure may lead to wrong results for non-Gaussian distributions.

3.2.7 Examples

In this section we compute expected values of some quantities for different


distributions.

Example 10. Expected values, dice


We have p(k) = 1/6, k = 1, . . . , 6.

hxi = (1 + 2 + 3 + 4 + 5 + 6) 1/6 = 7/2 ,


hx2 i = (1 + 4 + 9 + 16 + 25 + 36) 1/6 = 91/6 ,
σ 2 = 91/6 − (7/2)2 = 35/12 ,
σ ≈ 1.71 ,
γ1 = 0 .

The expected value has probability zero.

Example 11. Expected values, lifetime distribution


f (t) = τ1 e−t/τ , t ≥ 0,
Z ∞ n
t −t/τ
htn i = e dt = n! τ n ,
0 τ

hti = τ ,
ht2 i = 2τ 2 ,
ht3 i = 6τ 3 ,
σ=τ,
γ1 = 2 .

Example 12. Mean value of the volume of a sphere with a normally distributed
radius
30 3 Probability Distributions and their Properties

The normal distribution is given by


1 2 2
f (x) = √ e−(x−x0 ) /(2s ) .
2πs
It is symmetric with respect to x0 . Thus the mean value is µ = x0 , and the
skewness is zero. For the variance we obtain
Z ∞
1 2 2
σ2 = √ dx(x − x0 )2 e−(x−x0 ) /(2s )
2πs −∞
2
=s .

The parameters x0 , s of the normal distribution are simply the mean value
and the standard deviation µ, σ, and the p.d.f. with these parameters is
abbreviated as N (x|µ, σ). We now assume that the radius r0 of a sphere is
smeared according to a normal distribution around the mean value r0 with
standard deviation s. This assumption is certainly only approximately valid
for r0 ≫ s, since negative radii are of course impossible. Let us calculate the
expected value of the volume V (r) = 4/3 πr3 :
Z ∞
hV i = dr V (r)f (r)
−∞
Z ∞
4 π (r−r0 )2
= √ dr r3 e− 2 s2
3 2πs −∞
Z ∞
4 π z2
= √ dz (z + r0 )3 e− 2 s2
3 2πs −∞
Z ∞
4 π z2
= √ dz (z 3 + 3z 2 r0 + 3zr02 + r03 )e− 2 s2
3 2πs −∞
4
= π(r03 + 3s2 r0 ) .
3
The mean volume is larger than the volume calculated using the mean radius.

Example 13. Playing poker until the bitter end


Two players are equally clever, but own of different capitals K1 , respec-
tively K2 . They play, until one of the players is left without money. We
denote the probabilities for player 1, (2) to win finally with w1 (w2 ). The
probability, that one of the two players wins, is unity5 :

w1 + w2 = 1 .
3.2 Expected Values 31

wall

starting point

wall

Fig. 3.7. Brownian motion.

Player 1 gains the capital K2 with probability w1 and looses K1 with proba-
bility w2 . Thus his mean gain is w1 K2 − w2 K1 . The same is valid for player
two, only with reversed sign. As both players play equally well, the expected
gain should be zero for both

w1 K2 − w2 K1 = 0 .

From the two relation follows:


K1 K2
w1 = ; w2 = .
K1 + K2 K1 + K2
The probability to win is proportional to the capital one possesses. However,
the greater risk of the player with the smaller capital comes along with the
possibility of a higher gain.

Example 14. Diffusion


(random walk) A particle is moving stochastically according to the Brow-
nian motion, where every step is independent of the previous ones (Fig. 3.7).
The starting point has a distance d1 from the wall 1 and d2 from the oppo-
site wall 2. We want to know the probabilities w1 , w2 to hit wall 1 or 2. The
direct calculation of w1 and w2 is a quite involved problem. However, using
the properties of expected values, it can be solved quite simply, without even
knowing the probability density. The problem here is completely analogous
to the previous one:
32 3 Probability Distributions and their Properties

d2 d1
w1 = , w2 = .
d1 + d2 d1 + d2

Example 15. Mean kinetic energy of a gas molecule


The velocity of a particle in x-direction vx is given by a normal distribu-
tion
1 2 2
f (vx ) = √ e−vx /(2s ) ,
s 2π
with
kT
s2 =,
m
where k, T , m are the Boltzmann constant, the temperature, and the mass
of the molecule. The kinetic energy is
m 2
ǫkin = (v + vy2 + vz2 )
2 x
with the expected value
m  3m
E(ǫkin ) = E(vx2 ) + E(vy2 ) + E(vz2 ) = E(vx2 ),
2 2
where in the last step the velocity distribution was assumed to be isotropic.
It follows:
Z ∞
1 2 2
E(vx2 ) = √ dvx vx2 e−vx /(2s ) = s2 = kT /m,
s 2π −∞
3
E(ǫkin ) = kT.
2

Example 16. Reading accuracy of a digital clock


For an accurate digital clock which displays the time in seconds, the devi-
ation of the reading from the true time is maximally ± 0.5 seconds. After the
reading, we may associate to the true time a uniform distribution with the
actual reading as its central value. To simplify the calculation of the variance,
we set the reading equal to zero. We thus have

1 if − 0.5 < t < 0.5
f (t) =
0 else
3.3 Moments and Characteristic Functions 33

and Z 0.5
2 1
σ = t2 dt = . (3.16)
−0.5 12
The√root mean square measurement uncertainty (standard deviation) is σ =
1 s/ 12 ≈ 0.29 s. The variance of a uniform distribution, which covers a
range of a, is accordingly σ 2 = a2 /12. This result is widely used for the error
estimation of digital measurements. A typical example from particle physics
is the position measurement of ionizing particles with wire chambers.

Example 17. Efficiency fluctuations of a detector


A counter registers on average the fraction ε = 0.9 of all traversing elec-
trons. How large is the relative fluctuation σ of the the registered number N1
for N particles passing the detector? The exact solution of this problem re-
quires the knowledge of the probability distribution, in this case the binomial
distribution. But also without this knowledge we can derive the dependence
on N with the help of relation (3.13):
 
N1 1
σ ∼ √ .
N N
The whole process can be split into single processes, each being associated
with the passing of a single particle. Averaging over all processes
p leads to the
above result. (The binomial distribution gives σ(N1 /N ) = ε(1 − ε)/N , see
Sect. 3.6.1).

All stochastic processes, which can be split


√ into N identical, indepen-
dent elementary processes, show the typical 1/ N behavior of their relative
fluctuations.

3.3 Moments and Characteristic Functions


The characteristic quantities of distributions considered up to now, mean
value, variance, skewness, and kurtosis, have been calculated from expected
values of the lower four powers of the variate. Now we will investigate the
expected value of arbitrary powers of the random variable x for discrete and
continuous probability distributions p(x), f (x), respectively. They are called
moments of the distribution. Their calculation is particularly simple, if the
characteristic function of the distribution is known. The latter is just the
Fourier transform of the distribution.
34 3 Probability Distributions and their Properties

3.3.1 Moments

Definition: The n-th moments of f (x), respectively p(x) are


Z ∞
µn = E(xn ) = xn f (x) dx ,
−∞

and ∞
X
µn = E(xn ) = xnk p(xk )
k=1

where n is a natural number . 6

Apart from these moments, called moments about the origin, we consider
also the moments about an arbitrary point a where xn is replaced by (x −
a)n . Of special importance are the moments about the expected value of the
distribution. They are called central moments.
Definition: The n-th central moment about µ = µ1 of f (x), p(x) is:
Z ∞
µ′n = E ((x − µ)n ) = (x − µ)n f (x) dx ,
−∞

respectively

X
µ′n = E ((x − µ)n ) = (xk − µ)n p(xk ) .
k=1

Accordingly, the first central moment is zero: µ′1 = 0. Generally, the moments
are related to the expected values introduced before as follows:
First central moment: µ′1 = 0
Second central moment: µ′2 = σ 2
Third central moment: µ′3 = γ1 σ 3
Fourth central moment: µ′4 = β2 σ 4
Under conditions usually met in practise, a distribution is uniquely fixed
by its moments. This means, if two distributions have the same moments in
all orders, they are identical. We will present below plausibility arguments
for the validity of this important assertion.

3.3.2 Characteristic Function

We define the characteristic function φ(t) of a distribution as follows:


Definition: The characteristic function φ(t) of a probability density f (x)
is Z ∞
φ(t) = E(eitx ) = eitx f (x) dx , (3.17)
−∞

and, respectively for a discrete distribution p(xk )


6
In one dimension the zeroth moment is irrelevant. Formally, it is equal to one.
3.3 Moments and Characteristic Functions 35

X
φ(t) = E(eitx ) = eitxk p(xk ) . (3.18)
k=1

For continuous distributions, φ(t) is the Fourier transform of the p.d.f..


From the definition of the characteristic function follow several useful
properties.
φ(t) is a continuous, in general complex-valued function of t, −∞ < t < ∞
with |φ(t)| ≤ 1, φ(0) = 1 and φ(−t) = φ∗ (t). φ(t) is a real function, if and
only if the distribution is symmetric, f (x) = f (−x). Especially for continuous
distributions there is limt→∞ φ(t) = 0. A linear transformation of the variate
x → y = ax + b induces a transformation of the characteristic function of the
form
φx (t) → φy (t) = eibt φx (at) . (3.19)
Further properties are found in handbooks on the Fourier transform.
The transformation is invertible: With (3.17) it is
Z ∞ Z ∞ Z ∞

−itx −itx
φ(t)e dt = e eitx f (x′ ) dx′ dt
−∞ −∞ −∞
Z ∞ Z ∞ 
′ it(x′ −x)
= f (x ) e dt dx′
−∞ −∞
Z ∞
= 2π f (x′ ) δ(x′ − x) dx′
−∞
= 2πf (x) ,
Z ∞
1
f (x) = φ(t)e−itx dt .
2π −∞

The same is true for discrete distributions, as may be verified by substituting


(3.18):
Z T
1
p(xk ) = lim φ(t)e−itxk dt .
T →∞ 2T −T

In all cases of practical relevance, the probability distribution is uniquely


determined by its characteristic function.
Knowing the characteristic functions simplifies considerably the calcula-
tion of moments and of the distributions of sums or linear combinations of
variates. For continuous distributions moments are found by n-fold derivation
of φ(t): Z ∞
dn φ(t)
= (ix)n eitx f (x) dx .
dtn −∞

With t = 0 follows
Z ∞
dn φ(0)
= (ix)n f (x) dx = in µn . (3.20)
dtn −∞
36 3 Probability Distributions and their Properties

The Taylor expansion of φ(t),

X∞ ∞
1 n dn φ(0) X 1
φ(t) = t n
= (it)n µn , (3.21)
n=0
n! dt n=0
n!

generates the moments of the distribution.


The characteristic function φ(t) is closely related to the moment generat-
ing function which is defined through M (t) = E(etx ). In some textbooks M
is used instead of φ for the evaluation of the moments.
We realize that the moments determine φ uniquely, and, since the Fourier
transform is uniquely invertible, the moments also determine the probability
density, as stated above.
In the same way we obtain the central moments:
Z ∞

φ (t) = E(e it(x−µ)
)= eit(x−µ) f (x) dx = e−itµ φ(t) , (3.22)
−∞

dn φ′ (0)
= in µ′n . (3.23)
dtn
The Taylor expansion is
X∞
1
φ′ (t) = (it)n µ′n . (3.24)
n=0
n!

The results (3.20), (3.21), (3.23), (3.24) remain valid also for discrete
distributions. The Taylor expansion of the right hand side of relation (3.22)
allows us to calculate the central moments from the moments about the origin
and vice versa:
Xn   Xn  
n n ′
µ′n = (−1)k µn−k µk , µn = µ µk .
k k n−k
k=0 k=0

Note, that for n = 0 , µ0 = µ′0 = 1.


In some applications we have to compute the distribution f (z) where z
is the sum z = x + y of two independent random variables x and y with the
probability densities g(x) and h(y). The result is given by the convolution
integral, see Sect. 3.5.4,
Z Z
f (z) = g(x)h(z − x) dx = h(y)g(z − y) dy

which often is difficult to evaluate analytically. It is simpler in most situations


to proceed indirectly via the characteristic functions φg (t), φh (t) and φf (t)
of the three p.d.f.s which obey the simple relation

φf (t) = φg (t)φh (t) . (3.25)


3.3 Moments and Characteristic Functions 37

Proof:
Because of (3.10) we find for expected values

φf (t) = E(eit(x+y) )
= E(eitx eity )
= E(eitx )E(eity )
= φg (t)φh (t) .

The third step requires the independence of the two variates. Applying the
inverse Fourier transform to φf (t), we get
Z
1
f (z) = e−itz φf (t) dt .

The solution of this integral is not always simple. For some functions it can
be found in tables of the Fourier transform.
In the general
P case where x is a linear combination of independent random
variables, x = cj xj , we find in an analogous way:
Y
φ(t) = φj (cj t) .

3.3.3 Cumulants

As we have seen, the characteristic function simplifies in many cases the


calculation of moments and the convolution of two distributions. Interesting
relations between the moments of the three distributions g(x), h(y) and f (z)
with z = x + y are obtained from the expansion of the logarithm K(t) of the
characteristic functions into powers of it:

(it)2 (it)3
K(t) = ln φ(t) = ln E(eitx ) = κ1 (it) + κ2 + κ3 + ··· .
2! 3!
Since φ(0) = 1 there is no constant term. The coefficients κi , defined in this
way, are called cumulants or semiinvariants. The denotation semiinvariant
indicates that the cumulants κi , with the exception of κ1 , remain invariant
under the translations x → x + b of the variate x. Of course, the cumulant of
order i can be expressed by moments about the origin or by central moments
µk , µ′k up to the order i. We do not present the general analytic expressions
for the cumulants which can be derived from the power expansion of exp K(t)
and give only the remarkably simple relations for i ≤ 6 as a function of the
central moments:
38 3 Probability Distributions and their Properties

κ1 = µ1 ≡ µ = hxi ,
κ2 = µ′2 ≡ σ 2 = var(x) ,
κ3 = µ′3 ,
κ4 = µ′4 − 3µ′2 2 ,
κ5 = µ′5 − 10µ′2 µ′3 ,
κ6 = µ′6 − 15µ′2 µ′4 − 10µ′3 2 + 30µ′2 3 . (3.26)

Besides expected value and variance, also skewness and excess are easily
expressed by cumulants:
κ3 κ4
γ1 = , γ2 = . (3.27)
3/2
κ2 κ22

Since the product of the characteristic functions φ(t) = φ(1) (t)φ(2) (t)
turns into the sum K(t) = K (1) (t)+K (2) (t), the cumulants are additive, κi =
(1) (2)
κi +κi . In the
P general case, where x is a linear combination of independent
variates, x = cj x(j) , the cumulants of the resulting x distribution, κi , are
derived from those of the various x(j) distributions according to
X (j)
κi = cij κi . (3.28)
j

We have met examples for this useful relation already in Sect. 3.2.3 where
we have computed the variance of the distribution of a sum of variates. We
will use it again in the discussion of the Poisson distribution in the following
example and in Sect. 3.6.3.

3.3.4 Examples

Example 18. Characteristic function of the Poisson distribution


The Poisson distribution
λk −λ
Pλ (k) = e
k!
has the characteristic function

X λk −λ
φ(t) = eitk e
k!
k=0

which can be simplified to


3.3 Moments and Characteristic Functions 39


X 1 it k −λ
φ(t) = (e λ) e
k!
k=0
= exp(eit λ)e−λ

= exp λ(eit − 1) ,

from which we derive the moments:


dφ 
= exp λ(eit − 1) λieit ,
dt
dφ(0)
= iλ ,
dt
d2 φ  
= exp λ(eit − 1) (λieit )2 − λeit ,
dt2
2
d φ(0)
= −(λ2 + λ) .
dt2
Thus, the two lowest moments are

µ = hki = λ ,
µ2 = hk 2 i = λ2 + λ

and the mean value and the standard deviation are given by

hki = λ ,

σ= λ.

Expanding
1 1
K(t) = ln φ(t) = λ(eit − 1) = λ[(it) + (it)2 + (it)3 + · · · ] ,
2! 3!
for the cumulants we get the simple result

κ1 = κ2 = κ3 = · · · = λ .

The calculation of the lower central moments is then trivial. For example,
skewness and excess are simply given by
3/2

γ1 = κ3 /κ2 = 1/ λ , γ2 = κ4 /κ22 = 1/λ .
40 3 Probability Distributions and their Properties

Example 19. Distribution of a sum of independent, Poisson distributed vari-


ates
We start from the distributions

P1 (k1 ) = Pλ1 (k1 ),


P2 (k2 ) = Pλ2 (k2 )

and calculate the probability distribution P(k) for k = k1 + k2 . When we


write down the characteristic function for P(k),

φ(t) = φ1 (t)φ2 (t)


 
= exp λ1 (eit − 1) exp λ2 (eit − 1)

= exp (λ1 + λ2 )(eit − 1) ,

we observe that φ(t) is just the characteristic function of the Poisson dis-
tribution Pλ1 +λ2 (k). The sum of two Poisson distributed variates is again
Poisson distributed, the mean value being the sum of the mean values of the
two original distributions. This property is sometimes called stability.

Example 20. Characteristic function and moments of the exponential distri-


bution
For the p.d.f.
f (x) = λe−λx
we obtain the characteristic function
Z ∞
φ(t) = eitx λe−λx dx
0
λ ∞
= e(−λ+it)x 0
−λ + it
λ
=
λ − it
and deriving it with respect to t, we get from

dφ(t) iλ
= ,
dt (λ − it)2
dn φ(t) n! in λ
= ,
dtn (λ − it)n+1
dn φ(0) n! in
n
= n
dt λ
3.4 Transformation of Variables 41

the moments of the distribution:

µn = n! λ−n .

From these we obtain the mean value

µ = 1/λ ,

the standard deviation


p
σ= µ2 − µ2 = 1/λ ,

and the skewness


γ1 = (µ3 − 3σ 2 µ − µ3 )/σ 3 = 2 .
Contrary to the Poisson example, here we do not gain in using the charac-
teristic function, since the moments can be calculated directly:
Z ∞
xn λe−λx dx = n! λ−n .
0

3.4 Transformation of Variables

In one of the examples of Sect. 3.2.7 we had calculated the expected value
of the energy from the distribution of velocity. For certain applications, to
know the mean value of the energy may not be sufficient and its complete
distribution may be required. To derive it, we have to perform a variable
transformation.
For discrete distributions, this is a trivial exercise: The probability that
the event “u has the value u(xk )” occurs, where u is an uniquely invertible
function of x, is of course the same as for “x has the value xk ”:

P {u = u(xk )} = P {x = xk } .

For continuous distributions, the probability densities are transformed ac-


cording to the usual rules as applied for example for mass or charge densities.

3.4.1 Calculation of the Transformed Density

We consider a probability density f (x) and a monotone, i.e. uniquely invert-


ible function u(x). We we are interested in the p.d.f. of u, g(u) (Fig. 3.8).
The relation P {x1 < x < x2 } = P {u1 < u < u2 } with u1 = u(x1 ), u2 =
u(x2 ) has to hold, and therefore
42 3 Probability Distributions and their Properties

1.0

g(u)
u(x)
0.5

0.0
0 5 10 u
4

1
0.1

0.2

f(x)
0.3

0.4

Fig. 3.8. Transformation of a probability density f (x) into g(u) given u(x). The
shaded areas are equal.

Z x2
P {x1 < x < x2 } = f (x′ ) dx′
x
Z u1 2
= g(u′ ) du′ .
u1

This may be written in differential form as

|g(u)du| = |f (x)dx| , (3.29)


dx
g(u) = f (x) .
du

Taking the absolute value guarantees the positivity of the probability density.
Integrating (3.29), we find numerical equality of the cumulative distribution
functions, F (x) = G(u(x)).
If u(x) is not a monotone function, then, contrary to the above assump-
tion, x(u) is not a unique function (Fig. 3.9) and we have to sum over the
contributions of the various branches of the inverse function:
   
dx dx
g(u) = f (x) + f (x) + ··· . (3.30)
du branch1 du branch2
3.4 Transformation of Variables 43

u
6

u(x)

g(u) 0
1.0 0.5 0 5 10 x

0.2

f(x)

0.4

Fig. 3.9. Transformation of a p.d.f. f (x) into g(u) with u = x2 . The sum of the
shaded areas below f (x) is equal to the shaded area below g(u).

Example 21. Calculation of the p.d.f. for the volume of a sphere from the
p.d.f. of the radius
Given a uniform distribution for the radius r

1/(r2 − r1 ) if r1 < r < r2
f (r) =
0 else .

we are interested in the distribution g(V ) of the volume V (r). With

dr dV
g(V ) = f (r) , = 4πr2
dV dr
we get
1 1 1 1 −2/3
g(V ) = = 1/3 V .
r2 − r1 4π r2 V2 − V1
1/3 3

Example 22. Distribution of the quadratic deviation


For a normal distributed variate x with mean value x0 and variance s2
we ask for the distribution g(u), where
u = (x − x0 )2 /s2
44 3 Probability Distributions and their Properties

-3
4.0x10

0.50
g(V)
f(r)

-3
2.0x10
0.25

0.00 0.0
4 6 8 0 200 400

r V

Fig. 3.10. Transformation of a uniform distribution of the radius into the distri-
bution of the volume of a sphere.

is the normalized quadratic deviation. The expected value of u is unity, since


the expected value of (x − µ)2 per definition equals σ 2 for any distribution.
The function x(u) has two branches. With
1 2 2
f (x) = √ e−(x−x0 ) /(2s )
s 2π
and
dx s
= √
du 2 u
we find    
1
g(u) = √ e−u/2 + ··· .
2 2πu branch1 branch2
The contributions from both branches are the same, thus
1
g(u) = √ e−u/2 . (3.31)
2πu

The function g(u) is the so-called χ2 - distribution (chi-squared distribution)


for one degree of freedom.

Example 23. Distribution of kinetic energy in the one-dimensional ideal gas


Be v the velocity of a particle in x direction with probability density
r
m −mv2 /(2kT )
f (v) = e .
2πkT
3.4 Transformation of Variables 45

Its kinetic energy is E = v 2 /(2m), for which we want to know the distribution
g(E). The function v(E) has again two branches. We get, in complete analogy
to the example above,
dv 1
=√ ,
dE 2mE
   
1 −E/kT
g(E) = √ e + ··· .
2 πkT E branch1 branch2
The contributions of both branches are the same, hence
1
g(E) = √ e−E/kT .
πkT E

3.4.2 Determination of the Transformation Relating two


Distributions

In the computer simulation of stochastic processes we are frequently con-


fronted with the problem that we have to transform the uniform distribution
of a random number generator into a desired distribution, e.g. a normal or
exponential distribution. More generally, we want to obtain for two given
distributions f (x) and g(u) the transformation u(x) connecting them.
We have Z x Z u
f (x′ ) dx′ = g(u′ ) du′ .
−∞ −∞

Integrating, we get F (x) and G(u):

F (x) = G(u) ,

u(x) = G−1 (F (x)) .


G−1 is the inverse function of G. The problem can be solved analytically,
only if f and g can be integrated analytically and if the inverse function of
G can be derived.
Let us consider now the special case mentioned above, where the primary
distribution f (x) is uniform, f (x) = 1 for 0 ≤ x ≤ 1. This implies F (x) = x
and

G(u) = x,
u = G−1 (x) . (3.32)
46 3 Probability Distributions and their Properties

Example 24. Generation of an exponential distribution starting from a uni-


form distribution
Given are the p.d.f.s

1 for 0 < x < 1
f (x) =
0 else ,

λe−λu for 0 < u
g(u) =
0 else .
The desired transformation u(x), as demonstrated above in the general case,
is obtained by integration and inversion:
Z u Z x
′ ′
g(u ) du = f (x′ ) dx′ ,
0 0
Z u Z x

λe−λu du′ = f (x′ ) dx′ ,
0 0
1 − e−λu = x ,
u = − ln(1 − x)/λ .

We could have used, of course, also the relation (3.32) directly. Obviously in
the last relation we could substitute 1 − x by x, since both quantities are
uniformly distributed. When we transform the uniformly distributed random
numbers x delivered by our computer according to the last relation into the
variable u, the latter will be exponentially distributed. This is the usual way
to simulate the lifetime distribution of instable particles and other decay
processes (see Chap. 5).

3.5 Multivariate Probability Densities


The results of the last sections are easily extended to multivariate distribu-
tions. We restrict ourself here to the case of continuous distributions7 .

3.5.1 Probability Density of two Variables


Definitions
As in the one-dimensional case we define an integral distribution function
F (x, y), now taken to be the probability to find values of the variates x′ , y ′
smaller than x, respectively y
7
An example of a multivariate discrete distribution will be presented in Sect.
3.6.2.
3.5 Multivariate Probability Densities 47

F (x, y) = P {(x′ < x) ∩ (y ′ < y)} . (3.33)

The following properties of this distribution function are satisfied:

F (∞, ∞) = 1,
F (−∞, y) = F (x, −∞) = 0 .

In addition, F has to be a monotone increasing function of both variables. We


define a two-dimensional probability density, the so-called joined probability
density, as the partial derivation of f with respect to the variables x, y:

∂2F
f (x, y) = .
∂x ∂y
From these definitions follows the normalization condition
Z ∞Z ∞
f (x, y) dx dy = 1 .
−∞ −∞

The projections fx (x) respectively fy (y) of the joined probability density onto
the coordinate axes are called marginal distributions :
Z ∞
fx (x) = f (x, y) dy ,
−∞
Z ∞
fy (y) = f (x, y) dx .
−∞

The marginal distributions are one-dimensional (univariate) probability den-


sities.
The conditional probability densities for fixed values of the second variate
and normalized with respect to the first one are denoted by fx (x|y) and
fy (y|x) for given values of y or x, respectively. We have the following relations:

f (x, y)
fx (x|y) = R ∞
−∞ f (x, y) dx
f (x, y)
= , (3.34)
fy (y)
f (x, y)
fy (y|x) = R ∞
−∞
f (x, y) dy
f (x, y)
= . (3.35)
fx (x)

Together, (3.34) and (3.35) express again Bayes’ theorem:

fx (x|y)fy (y) = fy (y|x)fx (x) = f (x, y) . (3.36)


48 3 Probability Distributions and their Properties

Example 25. Superposition of two two-dimensional normal distributions


(The two-dimensional normal distribution will be discussed in Sect. 3.6.5.)
The marginal distributions fx (x), fy (y) and the conditional p.d.f.

fy (y|x = 1)

for the joined two-dimensional p.d.f.


  2   
1 x y2 0.4 (x − 2)2 (y − 2.5)2
f (x, y) = 0.6 exp − − + √ exp − −
2π 2 2 3 3 4
are
  2  
1 x 0.4 (x − 2)2
fx (x) = √ 0.6 exp − +√ exp − ,
2π 2 1.5 3
  2  
1 y 0.4 (y − 2.5)2
fy (y) = √ 0.6 exp − + √ exp − ,
2π 2 2 4
    
1 1 y2 0.4 1 (y − 2.5)2
f (y, x = 1) = 0.6 exp − − + √ exp − − ,
2π 2 2 3 3 4
    
1 y2 0.4 1 (y − 2.5)2
fy (y|x = 1) = 0.667 0.6 exp − − + √ exp − − .
2 2 3 3 4

fy (y|1) and fR(y, 1) differ in the normalization factor, which results from the
requirement fy (y|1) dy = 1.

Graphical Presentation

Fig. 3.11 shows a similar superposition of two Gaussians together with its
marginal distributions and one conditional distribution. The chosen form of
the graphical representation as a contour plot for two-dimensional distribu-
tions is usually to be favored over three-dimensional surface plots.

3.5.2 Moments

Analogously to the one-dimensional case we define moments of two-dimensional


distributions:
3.5 Multivariate Probability Densities 49

10

1E-3
0.4

1E-3
0.040

0.13

5
f
y x 0.2

0.17

0.070

0.010

1E-3

0 0.0
0 5 10 0 5 10

x x

0.6

0.3

f(y|x=2)

0.4

0.2

f
y

0.1
f(y,x=2) 0.2

0.0 0.0
0 5 10 0 5 10

y y

Fig. 3.11. Two-dimensional probability density. The lower left-hand plot shows the
conditional p.d.f. of y for x = 2. The lower curve is the p.d.f. f (y, 2). It corresponds
to the dashed line in the upper plot. The right-hand side displays the marginal
distributions.

µx = E(x) ,
µy = E(y) ,
 
σx2 = E (x − µx )2 ,
 
σy2 = E (y − µy )2 ,
σxy = E [(x − µx )(y − µy )] ,
µlm = E(xl y m ),
 
µ′lm = E (x − µx )l (y − µy )m .

Explicitly,
50 3 Probability Distributions and their Properties

y y y y y

x x x x x

Fig. 3.12. Curves f (x, y) = const. with different correlation coefficients.

Z ∞ Z ∞ Z ∞
µx = xf (x, y) dx dy = xfx (x) dx ,
−∞ −∞ −∞
Z ∞ Z∞ Z ∞
µy = yf (x, y) dx dy = yfy (y) dy ,
−∞ −∞ −∞

Z ∞ Z ∞
µlm = xl y m f (x, y) dx dy ,
−∞ −∞
Z ∞ Z ∞
µ′lm = (x − µx )l (y − µy )m f (x, y) dx dy .
−∞ −∞

Obviously, µ′x , µ′y (= µ′10 , µ′01 ) are zero.

Correlations, Covariance, Independence

The mixed moment σxy is called covariance of x and y, and sometimes also
denoted as cov(x, y). If σxy is different from zero, the variables x and y are
said to be correlated. The mean value of y depends on the value chosen for x
and vice versa. Thus, for instance, the weight of a man is positively correlated
with its height.
The degree of correlation is quantified by the dimensionless quantity
σxy
ρxy = ,
σx σy

the correlation coefficient. Schwarz’ inequality insures |ρxy | ≤ 1.


Figure 3.12 shows lines of constant probability for various kinds of cor-
related distributions. In the extreme case |ρ| = 1 the variates are linearly
related.
If the correlation coefficient is zero, this does not necessarily mean sta-
tistical independence of the variates. The dependence may be more subtle,
as we will see shortly. As defined in Chap. 2, two random variables x, y are
called independent or orthogonal, if the probability to observe one of the two
3.5 Multivariate Probability Densities 51

variates x, y is independent from the value of the other one, i.e. the condi-
tional distributions are equal to the marginal distributions, fx (x|y) = fx (x),
fy (y|x) = fy (y). Independence is realized only if the joined distribution
f (x, y) factorizes into its marginal distributions (see Chap. 2):

f (x, y) = fx (x)fy (y) .

Clearly, correlated variates cannot be independent.

Example 26. Correlated variates


A measurement uncertainty of a point in the xy-plane follows independent
normal distributions in the polar coordinates r, ϕ (the errors are assumed
small enough to neglect the regions r < 0 and |ϕ| > π ). A line of constant
probability in the xy-plane would look similar to the second graph of Fig. 3.12.
The cartesian coordinates are negatively correlated, although the original
polar coordinates have been chosen as uncorrelated, in fact they are even
independent.

Example 27. Dependent variates with correlation coefficient zero


For the probability density
1 √ 2 2
f (x, y) = p e− x +y
2π x2 + y 2

we find σxy = 0. The curves f = const. are circles, but x and y are not
independent, the conditional distribution fy (y|x) of y depends on x.

3.5.3 Transformation of Variables

The probability densities f (x, y) and g(u, v) are transformed given the trans-
formation functions u(x, y), v(x, y), analogously to the univariate case

g(u, v) du dv = f (x, y) dx dy ,

∂(x, y)
g(u, v) = f (x, y) ,
∂(u, v)
with the Jacobian determinant replacing the differential quotient dx/du.
52 3 Probability Distributions and their Properties

Example 28. Transformation of a normal distribution from cartesian into po-


lar coordinates
A two-dimensional normal distribution
1 −(x2 +y2 )/2
f (x, y) = e

is to be transformed into polar coordinates

x = r cos ϕ ,
y = r sin ϕ .

The Jacobian is
∂(x, y)
=r.
∂(r, ϕ)
We get
1 −r2 /2
g(r, ϕ) = re

with the marginal distributions
Z 2π
2
gr = g(r, ϕ) dϕ = re−r /2 ,
Z0 ∞
1
gϕ = g(r, ϕ) dr = .
0 2π

The joined distribution factorizes into its marginal distributions (Fig. 3.13).
Not only x, y, but also r, ϕ are independent.

3.5.4 Reduction of the Number of Variables


Frequently, we are faced with the problem to find from a given joined distri-
bution f (x, y) the distribution g(u) of a dependent random variable u(x, y).
We can reduce it to that of a usual transformation, by inventing a second
variable v = v(x, y), performing the transformation f (x, y) −→ h(u, v) and,
finally, by calculating the marginal distribution in u,
Z ∞
g(u) = h(u, v) dv .
−∞

In most cases, the choice v = x is suitable. More formally, we might use the
equivalent reduction formula
Z ∞
g(u) = f (x, y)δ (u − u(x, y)) dx dy . (3.37)
−∞
3.5 Multivariate Probability Densities 53

a) b) f (x), f (y)
2 x y
0.4

y 0
0.2

-2

0.0
-2 0 2 -4 -2 0 2 4

x x, y

0.8 0.3
c) d)
0.6
0.2 g ( )
g (r)
r
0.4

0.1
0.2

0.0 0.0
0 1 2 3 4 5 0 2 4 6 8

Fig. 3.13. Transformation of a two-dimensional normal distribution of cartesian


coordinates into the distribution of polar coordinates.: (a) lines of constant proba-
bility; b) cartesian marginal distributions; c), d) marginal distributions of the polar
coordinates.

For the distribution of a sum u = x + y of two independent variates x, y,


i.e. f (x, y) = fx (x)fy (y), after integration over y follows
Z Z
g(u) = f (x, u − x) dx = fx (x)fy (u − x) dx .

This is called the convolution integral or convolution product of fx and fy .

Example 29. Distribution of the difference of two digitally measured times


The true times t1 , t2 are taken to follow a uniform distribution

1/∆2 for |t1 − T1 | , |t2 − T2 | < ∆/2
f (t1 , t2 ) =
0 else

around the readings T1 , T2 . We are interested in the probability density of the


difference t = t1 −t2 . To simplify the notation, we choose the case T1 = T2 = 0
and ∆ = 2 (Fig. 3.14). First we transform the variables according to

t = t1 − t2 ,
t1 = t1
54 3 Probability Distributions and their Properties

2 2
g(t)
t t
2 1
1
0.50

0 0

0.25
-1 -1

-2 -2 0.00
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
t t t
1 1

Fig. 3.14. Distribution of the difference tbetween two times t1 and t2 . which have
both clock readings equal to zero.

with the Jacobian


∂(t1 , t2 )
=1.
∂(t1 , t)
The new distribution is also uniform:

h(t1 , t) = f (t1 , t2 ) = 1/∆2 ,

and has the boundaries shown in Fig. 3.14. The form of the marginal distri-
bution is found by integration over t1 , or directly by reading it off from the
figure: 
 ( t − T + ∆)/∆2 for t − T < 0
g(t) = (−t + T + ∆)/∆2 for 0 < t − T

0 else .
where T = T1 − T2 now for arbitrary values of T1 and T2 .

Example 30. Distribution of the transverse momentum squared of particle


tracks
The projections of the momenta are assumed to be independently nor-
mally distributed,
1 −(p2x +p2y )/(2s2 )
f (px , py ) = e ,
2πs2
with equal variances p2x = p2y = s2 . For the transverse momentum squared
we set q = p2 and calculate its distribution. We transform the distributions
into polar coordinates
3.5 Multivariate Probability Densities 55


px = q cos ϕ ,

py = q sin ϕ

with
∂(px , py ) 1
=
∂(q, ϕ) 2
and obtain
1 −q/(2s2 )
h(q, ϕ) = e
4πs2
with the marginal distribution
Z 2π
1 −q/(2s2 )
hq (q) = e dϕ
0 4πs2
1 −q/(2s2 )
= e ,
2s2
1
g(p2 ) = 2 e−p /hp i .
2 2

hp i

The result is an exponential distribution in p2 with mean p2 = p2x + p2y .

Example 31. Quotient of two normally distributed variates


For variates x, y, independently and identically normally distributed, i.e.

1 x2 + y 2
f (x, y) = f (x)f (y) = exp(− ),
2π 2
we want to find the distribution g(u) of the quotient u = y/x. Again, we
transform first into new variates u = y/x , v = x, or, inverted, x = v , y = uv
and get
∂(x, y)
h(u, v) = f (x(u, v), y(u, v)) ,
∂(u, v)
with the Jacobian
∂(x, y)
= −v ,
∂(u, v)
hence
56 3 Probability Distributions and their Properties

Z
g(u) = h(u, v) dv
Z ∞
1 v 2 + u2 v 2
= exp(− )|v| dv
2π −∞ 2
Z
1 ∞ −(1+u2 )z
= e dz
π 0
1 1
= ,
π 1 + u2
where the substitution z = v 2 /2 has been used. The result g(u) is the Cauchy
distribution (see Sect. 3.6.9). Its long tails are caused here by the finite proba-
bility of arbitrary small values in the denominator. This effect is quite impor-
tant in experimental situations when we estimate the uncertainty of quantities
which are the quotients of normally distributed variates in cases, where the
p.d.f. in the denominator is not negligible at the value zero.

The few examples given above should not lead to the impression that
transformations of variates always yield more or less simple analytical expres-
sions for the resulting distributions. This is rather the exception. However,
as we will learn in Chap. 5, a simple, straight forward numerical solution is
provided by Monte Carlo methods.

3.5.5 Determination of the Transformation between two


Distributions
As in the one-dimensional case, for the purpose of simulation, we frequently
need to generate a required distribution from the uniformly distributed ran-
dom numbers delivered by the computer. The general method of integration
and inversion of the cumulative distribution can be used directly, only if we
deal with independent variates. Often, a transformation of the variates is
helpful. We consider here a special example, which we need later in Chap. 5.

Example 32. Generation of a two-dimensional normal distribution, starting


from uniform distributions
We use the result from example 28 and start with the representation of
the two-dimensional Gaussian in polar coordinates
1 2
g(ρ, ϕ) dρ dϕ = dϕ ρ e−ρ /2 dρ ,

which factorizes in ϕ and ρ. With two in the interval [0, 1] uniformly dis-
tributed variates r1 , r2 , we obtain the function ρ(r1 ):
3.5 Multivariate Probability Densities 57

Z ρ
′2
ρ′ e−ρ /2
dρ′ = r1 ,
0
′2 ρ
−e−ρ /2
= r1 ,
0
2
−ρ /2
1−e = r1 ,
p
ρ = −2 ln(1 − r1 ) .

In the same way we get ϕ(r2 ):

ϕ = 2πr2 .

Finally we find x and y:


p
x = ρ cos ϕ = −2 ln(1 − r1 ) cos(2πr2 ) , (3.38)
p
y = ρ sin ϕ = −2 ln(1 − r1 ) sin(2πr2 ) . (3.39)

These variables are independent and distributed normally about the origin
with variance unity:
1 −(x2 +y2 )/2
f (x, y) = e .

(We could replace 1 − r1 by r1 , since 1 − r1 is uniformly distributed as well.)

3.5.6 Distributions of more than two Variables

It is not difficult to generalize the relations just derived for two variables to
multivariate distributions, of N variables. We define the distribution function
F (x1 , . . . , xN ) as the probability to find values of the variates smaller than
x1 , . . . , xN ,

F (x1 , . . . , xN ) = P {(x′1 < x) ∩ · · · ∩ (x′N < xN )} ,

and the p.d.f.


∂N F
f (x1 , . . . , xN ) = .
∂x1 , . . . , ∂xN
Often it is convenient to use the vector notation, F (x), f (x) with

x = {x1 , x2 , . . . , xN } .

These variate vectors can be represented as points in an N -dimensional space.


The p.d.f. f (x) can also be defined directly, without reference to the
distribution function F (x), as the density of points at the location x, by
setting
58 3 Probability Distributions and their Properties

dx1 dx1
f (x1 , . . . xN )dx1 · · · dxN = dP {(x1 − ≤ x′1 ≤ x1 + ) ∩ ···
2 2
dxN dxN
· · · ∩ (xN − ≤ x′N ≤ xN + )}
2 2

Expected Values and Correlation Matrix

The expected value of a function u(x) is


Z ∞ Z ∞ N
Y
E(u) = ··· u(x)f (x) dxi .
−∞ −∞ i=1

Because of the additivity of expected values this relation also holds for vector
functions u(x).
The dispersion of multivariate distributions is now described by the so-
called covariance matrix C:

Cij = h(xi − hxi i)(xj − hxj i)i = hxi xj i − hxi ihxj i .

The correlation matrix is given by


Cij
ρij = p .
Cii Cjj

Transformation of Variables

We multiply with the absolute value of the N -dimensional Jacobian

∂(x1 , . . . , xN )
g(y) = f (x) .
∂(y1 , . . . , yN )

Correlation and Independence

As in the two-dimensional case two variables xi , xj are called uncorrelated if


their correlation coefficient ρij is equal to zero. The two variates xi , xj are
independent if the conditional p.d.f. of xi conditioned on all other variates
does not depend on xj . The combined density f then has to factorize into
two factors where one of them is independent of xi and the other one is
independent of xj 8 . All variates are independent of each other, if
N
Y
f (x1 , x2 , . . . , xN ) = fxi (xi ) .
i=1

8
we omit the formulas because they are very clumsy.
3.5 Multivariate Probability Densities 59

3.5.7 Independent, Identically Distributed Variables

One of the main topics of statistics is the estimation of free parameters of a


distribution from a random sample of observations all drawn from the same
population. For example, we might want to estimate the mean lifetime τ of a
particle from N independent measurements ti where t follows an exponential
distribution depending on τ . The probability density f˜ for N independent
and identically distributed variates (abbreviated as i.i.d. variates) xi , each
distributed according to f (x), is, according to the definition of independence,
N
Y
f˜(x1 , . . . , xN ) = f (xi ) .
i=1

The covariance matrix of i.i.d. variables is diagonal, with Cii = var(xi ) =


var(x1 ).

3.5.8 Angular Distributions

In physics applications we are often interested in spatial distributions. For-


tunately our problems often exhibit certain symmetries which facilitate the
description of the phenomena. Depending on the kind of symmetry of the
physical process or the detector, we choose appropriate coordinates, spherical,
cylindrical or polar. These coordinates are especially well suited to describe
processes where radiation is emitted by a local source or where the detector
has a spherical or cylindrical symmetry. Then the distance, i.e. the radius
vector, is not the most interesting parameter and we often describe the pro-
cess solely by angular distributions. In other situations, only directions enter,
for example in particle scattering, when we investigate the polarization of
light crossing an optically active medium, or of a particle decaying in flight
into a pair of secondaries where the orientation of the normal of the decay
plane contains relevant information. Similarly, distributions of physical pa-
rameters on the surface of the earth are expressed as functions of the angular
coordinates.

Distribution of the Polar Angle

As already explained above, the expressions

x = r cos ϕ ,
y = r sin ϕ

relate the polar coordinates r, ϕ to the cartesian coordinates x , y. Since we


have periodic functions, we restrict the angle ϕ to the interval [−π, π]. This
choice is arbitrary to a certain extent.
60 3 Probability Distributions and their Properties

For an isotropic distribution all angles are equally likely and we obtain
the uniform distribution of ϕ
1
g(ϕ) = .

Since we have to deal with periodic functions, we have to be careful when
we compute moments and in general expected values. For example the mean
of the two angles ϕ1 = π/2, ϕ2 = −π is not (ϕ1 + ϕ2 )/2 = −π/4, but 3π/4.
To avoid this kind of mistake it is advisable to go back to the unit vectors
{xi , yi } = {cos ϕi , sin ϕi }, to average those and to extract the resulting angle.

Example 33. The v. Mises distribution


We consider the Brownian motion of a particle on the surface of a liquid.
Starting from a point r 0 its position r after some time will be given by the
expression  
1 |r − r0 |2
f (r) = exp − .
2πσ 2 2σ 2
Taking into account the Jacobian ∂(x, y)/∂(r, ϕ) = r, the distribution in
polar coordinates is:
 2 
r r + r02 − 2rr0 cos ϕ
g(r, ϕ) = exp − .
2πσ 2 2σ 2
For convenience we have chosen the origin of ϕ such that ϕ0 = 0. For fixed r
we obtain the conditional distribution
g̃(ϕ) = g(ϕ|r) = cN (κ) exp (κ cos ϕ)
with κ = rr0 /σ 2 and cN (κ) the normalization constant. This is the v. Mises
distribution. It is symmetric in ϕ, unimodal with its maximum at ϕ = 0. The
normalization
1
cN (κ) =
2πI0 (κ)
contains I0 , the modified Bessel function of order zero [25]. For large values of
κ the distribution approaches a Gaussian with variance 1/κ. To demonstrate
this feature, we rewrite the distribution in a slightly modified way,
g̃(ϕ) = cN (κ)eκ e[−κ(1−cos ϕ)] ,

and make use of the asymptotic form limx→∞ I0 (x) ∼ ex / 2πx (see [25]).
The exponential function is suppressed for large values of (1 − cos ϕ), and
small values can be approximated by ϕ2 /2. Thus the asymptotic form of the
distribution is r
κ −κϕ2 /2
g̃ = e . (3.40)

In the limit κ = 0, which is the case for r0 = 0 or σ → ∞, the distribution
becomes uniform, as it should.
3.5 Multivariate Probability Densities 61

Distribution of Spherical Angles

Spatial directions are described by the polar angle θ and the azimuthal angle
ϕ which we define through the transformation relations from the cartesian
coordinates:

x = r sin θ cos ϕ , −π ≤ ϕ ≤ π
y = r sin θ sin ϕ , 0 ≤ θ ≤ π
z = r cos θ .

The Jacobian is ∂(x, y, z)/∂(r, θ, ϕ) = r2 sin θ. A uniform distribution inside


a sphere of radius R in cartesian coordinates

3/(4πR3 ) if x2 + y 2 + z 2 ≤ R2 ,
fu (x, y, z) =
0 else

thus transforms into


3r2
hu (r, θ, ϕ) = sin θ if r ≤ R .
4πR3
We obtain the isotropic angular distribution by marginalizing or conditioning
on r:
1
hu (θ, ϕ) = sin θ . (3.41)

Spatial distributions are usually expressed in the coordinates z̃ = cos θ
and ϕ, because then the uniform distribution simplifies further to
1
gu (z̃, ϕ) =

with |z̃| ≤ 1.
The p.d.f. g(z̃, ϕ) of an arbitrary distribution of z̃, ϕ is defined in the
standard way through the probability d2 P = g(z̃, ϕ)dz̃dϕ. The product
dz̃dϕ = sin θdθdϕ = d2 Ω is called solid angle element and corresponds to
an infinitesimal area at the surface of the unit sphere. A solid angle Ω defines
a certain area at this surface and contains all directions pointing into this
area.

Example 34. Fisher’s spherical distribution


Instead of the uniform distribution considered in the previous example
we now investigate the angular distribution generated by a three-dimensional
rotationally symmetric Gaussian distribution with variances σ 2 = σx2 = σy2 =
σz2 . We put the center of the Gaussian at the z-axis, r0 = {0, 0, 1}. In spherical
coordinates we then obtain the p.d.f.
62 3 Probability Distributions and their Properties

 2 
1 2 r + r02 − 2rr0 cos θ
f (r, θ, ϕ) = r sin θ exp − .
(2π)3/2 σ 3 2σ 2

For fixed distance r we obtain a function of θ and ϕ only which for our choice
of r0 is also independent of ϕ:

g(θ, ϕ) = cN (κ) sin θ exp(κ cos θ) .

The parameter
R κ is again given by κ = rr0 /σ 2 . Applying the normalization
condition gdθdϕ = 1 we find cN (κ) = κ/(4π sinh κ) and
κ
g(θ, ϕ) = eκ cos θ sin θ (3.42)
4π sinh κ
a two-dimensional, unimodal distribution, known as Fisher’s spherical distri-
bution. As in the previous example we get in the limit κ → 0 the uniform
distribution (3.41) and for large κ the asymptotic distribution
1 2
g̃(θ, ϕ) ≈ κθ e−κθ /2 ,

which is an exponential distribution of θ2 . As a function of z̃ = cos θ the
distribution (3.42) simplifies to
κ
h(z̃, ϕ) = eκz̃ .
4π sinh κ
which illustrates the spatial shape of the distribution much better than (3.42).

3.6 Some Important Distributions

3.6.1 The Binomial Distribution

What is the probability to get with ten dice just two times a six ? The answer
is given by the binomial distribution:
   2  8
10 10 1 1
B1/6 (2) = 1− .
2 6 6

The probability to get with 2 particular dice six, and with the remaining 8
dice not the number six, is given by the product of the two power factors.
The binomial coefficient  
10 10!
=
2 2! 8!
3.6 Some Important Distributions 63

counts the number of possibilities to distribute the 2 sixes over the 10 dice.
This are just 45. With the above formula we obtain a probability of about
0.29.
Considering, more generally, n randomly chosen objects (or a sequence of
n independent trials), which have with probability p the property A, which
we will call success, the probability to find k out of these n objects with
property A is Bpn (k),
 
n k
Bpn (k) = p (1 − p)n−k , k = 0, . . . , n .
k

Since this is just the term of order pk in the power expansion of [p + (1 − p)]n ,
we have the normalization condition

[p + (1 − p)]n = 1 , (3.43)
Xn
Bpn (k) = 1 .
k=0

Since the mean number of successes in one trial is given by p, we obtain,


following the rules for expected values, for n independent trials

E(k) = np .

With a similar argument we can find the variance: For n = 1, we can directly
compute the expected quadratic difference, i.e. the variance σ12 . Using hki = p
and that k = 1 is found with probability9 P {1} = p and k = 0 with P {0} =
1 − p, we find:

σ12 = h(k − hki)2 i


= p(1 − p)2 + (1 − p)(0 − p)2
= p(1 − p).

According to (3.12) the variance of the sum of n i.i.d. random numbers is

σ 2 = nσ12 = np(1 − p) .

The characteristic function has the form:


 n
φ(t) = 1 + p eit − 1 . (3.44)

It is easily derived by substituting in the expansion of (3.43) in the kth term


k
pk with p eit . From (3.25) follows the property of stability, which is also
convincing intuitively:
9
For n = 1 the binomial distribution is also called two-point or Bernoulli distri-
bution.
64 3 Probability Distributions and their Properties

The distribution of a sum of numbers k = k1 + . . . + kN obeying binomial


distributions Bpni (ki ), is again a binomial distribution Bpn (k) with n = n1 +
· · · + nN .
There is no particularly simple expression for higher moments; they can of
course be calculated from the Taylor expansion of φ(t), as explained in Sect.
3.3.2. We give only the results for the coefficients of skewness and excess:

1 − 2p 1 − 6p(1 − p)
γ1 = p , γ2 = .
np(1 − p) np(1 − p)

Example 35. Efficiency fluctuations of a Geiger counter


A Geiger counter with a registration probability of 90% (p = 0.9) detects
n′ out of n = 1000 particles crossing it. On average this will be hn′ i =
np = 900. The mean
p √ fluctuation (standard deviation) of this number is σ =
np(1 − p) = p90 ≈ 9.5. The observed efficiency ε = n′ /n will fluctuate
by σp
ε = σ/n = p(1 − p)/n ≈ 0.0095 . For efficiencies close to one, we find
σ ≈ n(1 − p) which is the Poisson fluctuation of the missing particles. For

small efficiencies, we find σ ≈ np which is the Poisson fluctuation of the
detected particles.

Example 36. Accuracy of a Monte Carlo integration


We want to estimate the value of π by a Monte Carlo integration. We
distribute randomly n points in a square of area 4 cm2 , centered at the origin.
The number of points with a distance less than 1 cm from the origin is k = np
with p = π/4. To reach an accuracy of 1% requires
σ
= 0.01 ,
np
p
np(1 − p)
= 0.01 ,
np
(1 − p) (4 − π)
n= = ≈ 2732 ,
0.012 p 0.012 π
i.e. we have to generate n = 2732 pairs of random numbers.

Example 37. Acceptance fluctuations for weighted events


3.6 Some Important Distributions 65

The acceptance of a complex detector is determined by Monte Carlo sim-


ulation which depends on a probability density f0 (x) where x denotes all
relevant kinematical variables. In order to avoid the repetition of the sim-
ulation for a different physical situation (e.g. a different cross section) de-
scribed by a p.d.f. f (x), it is customary to weight the individual events with
wi = f (x)/f0 (x), i = 1, . . . , N for N generated events. The acceptance εi for
event i is either 1 or 0. Hence the overall acceptance is
P
wi εi
εT = P .
wi

The variance for each single term in the numerator is wi2 εi (1 − εi ). Then the
variance σT2 of εT becomes
P 2
wi εi (1 − εi )
σT2 = P .
( wi )2

3.6.2 The Multinomial Distribution

When a single experiment or trial has not only two, but N possible outcomes
with probabilities p1 , p2 , . . . , pN , the probability to observe in n experiments
k1 , k2 , . . . , kN trials belonging to the outcomes 1, . . . , N is equal to
N
Y
n!
Mnp1 ,p2 ,...,pN (k1 , k2 , . . . , kN ) = QN pki i ,
i=1 ki ! i=1
PN PN
where i=1 pi = 1 and i=1 ki = n are satisfied. Hence we have N − 1
independent variates. The value N = 2 reproduces the binomial distribution.
In complete analogy to the binomial distribution, the multinomial distri-
bution may be generated by expanding the multinom

(p1 + p2 + . . . + pN )n = 1

in powers of pi , see (3.43). The binomial coefficients are replaced by multino-


mial coefficients which count the number of ways in which n distinguishable
objects can be distributed into N classes which contain k1 , . . . , kN objects.
The expected values are
E(ki ) = npi
and the covariance matrix is given by

Cij = npi (δij − pj ) .

They can be derived from the characteristic function


66 3 Probability Distributions and their Properties

N −1
!n
X 
iti
φ(t1 , . . . , tN −1 ) = 1+ pi e −1
1

which is a straight forward generalization of the 1-dimensional case (3.44).


The correlations are negative: If, for instance, more events ki as expected fall
into class i, the mean number of kj for any other class will tend to be smaller
than its expected value E(kj ).
The multinomial distribution applies for the distribution of events into
histogram bins. For total a number n of events with the probability pi to
collect an event in bin i, the expected number of events in that bin will be
ni = npi and the variance Cii = npi (1 − pi ). Normally a histogram has many
bins and pi ≪ 1 for all i. Then we approximate Cij ≈ ni δij . The correlation
between the bin entries can be neglected and the fluctuation of the entries
in a bin is described by the Poisson distribution which we will discuss in the
following section.

3.6.3 The Poisson Distribution

When a certain reaction happens randomly in time with an average frequency


λ in a given time interval, then the number k of reactions in that time interval
will follow a Poisson distribution (Fig. 3.15)

λk
Pλ (k) = e−λ .
k!
Occasionally we will use also the notation P(k|λ). The expected value
and variance have already been calculated above (see 38):

E(k) = λ , var(k) = λ .

The characteristic function and cumulants have also been derived in Sect.
3.3.2 : 
φ(t) = exp λ(eit − 1) , (3.45)
κi = λ , i = 1, 2, . . . .
Skewness and excess,
1 1
γ1 = √ , γ2 =
λ λ

decrease with λ and indicate that the distribution approaches the normal
distribution (γ1 = 0, γ2 = 0) with increasing λ (see Fig. 3.15).
The Poisson distribution itself can be considered as the limiting case of
a binomial distribution with np = λ, where n approaches infinity (n → ∞)
and, at the same time, p approaches zero, p → 0. The corresponding limit of
the characteristic function of the binomial distribution (3.44) produces the
3.6 Some Important Distributions 67

0.4 0.20

P P

0.3 l=1 0.15 l=5

0.2 0.10

0.1 0.05

0.0 0.00
0 2 4 6 8 10 0 5 10 15 20

k k

P P

l = 20
0.08
0.04 l = 100

0.06

0.04
0.02

0.02

0.00 0.00
0 20 40 100 150

k k

Fig. 3.15. Poisson distriutions with different expected values.


68 3 Probability Distributions and their Properties

characteristic function of the Poisson distribution (3.45): With p = λ/n we


then obtain  n
λ it 
lim 1 + (e − 1) = exp λ(eit − 1) .
n→∞ n
For the Poisson distribution, the supply of potential events or number of
trials is supposed to be infinite while the chance of a success, p, tends to zero.
It is often used in cases where in principle the binomial distribution applies,
but where the number of trials is very large.

Example 38. Poisson limit of the binomial distribution


A volume of 1 l contains 1016 hydrogen ions. The mean number of ions in
a sub-volume of 1 µm√ is then λ = 10 and its standard deviation for a Poisson
3

distribution is σ = 10 ≈ 3. The exact calculation of the standard√ deviation


with the binomial distribution would change σ only by a factor 1 − 10−15 .

Also the number of radioactive decays in a given time interval follows a


Poisson distribution, if the number of nuclei is big and the decay probability
for a single nucleus is small.
The Poisson distribution is of exceptional importance in nuclear and par-
ticle physics, but also in the fields of microelectronics (noise), optics, and gas
discharges it describes the statistical fluctuations.

Specific Properties of the Poisson Distribution

The sum k = k1 + k2 of Poisson distributed numbers k1 , k2 with expected


values λ1 , λ2 is again a Poisson distributed number with expected value
λ = λ1 + λ2 . This property, which we called stability in connection with
the binomial distribution follows formally from the structure of the charac-
teristic function, or from the additivity of the cumulants given above. It is
also intuitively obvious.

Example 39. Fluctuation of a counting rate minus background


Expected are S signal events with a mean background B.√The mean fluc-
tuation (standard deviation) of the observed number k is S + B. This is
also the fluctuation of k − B, because B is a constant. For a mean signal
S = 100 and an expected background
√ B = 50 we will observe on average 150
events with a fluctuation of 150. After subtracting the background, this
fluctuation will remain. Hence, the background
√ corrected signal is expected
to be 100 with the standard deviation σ = 150. The uncertainty would even
be larger, if also the mean value B was not known exactly.
3.6 Some Important Distributions 69

If from a Poisson-distributed number n with expected value λ0 on average


only a fraction ε is registered, for instance when the size of a detector is
reduced by a factor of ε, then the expected rate is λ = λ0 ε and the number of
observed events k follows the Poisson distribution Pλ (k). This intuitive result
is also obtained analytically: The number k follows a binomial distribution
Bεn (k) where n is a Poisson-distributed number. The probability p(k) is:

X
p(k) = Bεn (k)Pλ0 (n)
n=k
X∞
n! λn
= εk (1 − ε)n−k e−λ0 0
k!(n − k)! n!
n=k

(ελ0 )k X 1
= e−λ0 (λ0 − λ0 ε)n−k
k! (n − k)!
n=k
(ελ0 )k
= e−ελ0
k!
= Pλ (k) .

Of interest is also the following mathematical identity


k
X Z ∞
Pλ (i) = dλ′ Pλ′ (k) ,
i=0 λ
k
X Z ∞
λi (λ′ )k −λ′ ′
e−λ = e dλ ,
i=0
i! λ k!

which allows us to calculate the probability P {i ≤ k} to find a number i less


or equal k using a well known integral (described by the incomplete gamma
function). It is applied to estimate upper and lower interval limits in Chap.
8.

3.6.4 The Uniform Distribution

The uniform distribution is the simplest continuous distribution. It describes,


for instance, digital measurements where the random variable is tied to a
given interval and where inside the interval all its values are equally probable.
Given an interval of length α centered at the mean value ξ the p.d.f. reads

1/α if |x − ξ| < α/2
f (x|ξ, α) = (3.46)
0 else .

Mean value and variance are hxi = ξ and σ 2 = α2 /12, respectively. The
characteristic function is
70 3 Probability Distributions and their Properties
Z ξ+α/2
1 2 αt iξt
φ(t) = eitx dx = sin e . (3.47)
α ξ−α/2 αt 2

Using the power expansion of the sinus function we find from (3.47) for ξ = 0
the even moments (the odd moments vanish):

1  α 2k
µ′2k = , µ′2k−1 = 0 .
2k + 1 2
The uniform distribution is the basis for the computer simulation of all
other distributions because random number generators for numbers uniformly
distributed between 0 and 1 are implemented on all computers used for sci-
entific purposes. We will discuss simulations in some detail in Chap. 5.

3.6.5 The Normal Distribution

The normal or Gauss distribution which we introduced already in Sect. 3.2.7,


1 2 2
N (x|µ, σ) = √ e−(x−µ) /(2σ ) ,
2πσ
enjoys great popularity among statisticians. This has several reasons which,
however, are not independent from each other.
1. The sum of normally P distributed Pquantities is again normally dis-
tributed (stability), with µ = µi , σ 2 = σi2 , in obvious notation.
2. The discrete binomial- and Poisson distributions and also the χ2 -
distribution, in the limit of a large number, a large mean value and many
degrees of freedom, respectively, approach the normal distribution.
3. Many distributions met in natural sciences are well approximated by
normal distributions. We have already mentioned some examples: velocity
components of gas molecules, diffusion, Brownian motion and many mea-
surement errors obey normal distributions to good accuracy.
4. Certain analytically simple statistical procedures for parameter estima-
tion and propagation of errors are valid exactly only for normally distributed
errors.
The deeper reason for point 2 and 3 is explained by the central limit theo-
rem: The mean value of a large number N of independent random variables,
obeying the same distribution with variance σ02 , approaches a normal distri-
bution with variance σ 2 = σ02 /N . The important point is that this theorem
is valid for quite arbitrary distributions, provided they have a finite variance,
a condition which practically always can be fulfilled, if necessary by cutting
off large absolute values of the variates. Instead of a formal proof10 , we show
in Fig. 3.16, how with increasing number of variates the distribution of their
mean value approaches the normal distribution better and better.
10
A simplified proof is presented in the Appendix 13.1.
3.6 Some Important Distributions 71

f(x)
N=25 N=25

0.4 0.4

0.2 0.2

0.0 0.0
-4 -2 0 2 4 -4 -2 0 2 4

N=5 N=5

0.4 0.4

0.2 0.2

0.0 0.0
-4 -2 0 2 4 -4 -2 0 2 4

1.0 N=1 N=1


0.4

0.5
0.2

0.0 0.0
-4 -2 0 2 4 -4 -2 0 2 4

x x

Fig. 3.16. Illustration of the central limit theorem.


√ The mean values of N expo-
nentially or uniformly distributed variates times N approach with increasing N
the normal distribution with variance one.
72 3 Probability Distributions and their Properties

As example we have chosen the mean values for uniformly respectively


exponentially distributed numbers. For the very asymmetrical exponential
distribution on the left hand side of the figure the convergence to a normal
distribution is not as fast as for the uniform distribution, where already the
distribution of the mean of five random numbers is in good agreement with
the normal distribution. The central limit theorem applies also when the
individual variates follow different distributions provided that the variances
are of the same order of magnitude.
The characteristic function of the normal distribution is
1
φ(t) = exp(− σ 2 t2 + iµt) .
2
It is real and also of Gaussian shape for µ = 0. The stability (see point 1
above) is easily proven, using the convolution theorem (3.25) and the expo-
nential form of φ(t).
Differentiating the characteristic function, setting µ = 0, we obtain the
central moments of the normal distribution:
(2j)! 2j
µ′2j = σ .
2j j!

Cumulants, with the exception of κ1 = µ and κ2 = σ 2 , vanish. Also the odd


central moments are zero.

The Normal Distribution in Higher Dimensions

The normal distribution in two dimensions with its maximum at the origin
has the general form
  2 
1 1 x xy y2
N0 (x, y) = p exp − − 2ρ + 2 .
1 − ρ2 2πsx sy 2(1 − ρ2 ) s2x sx sy sy
(3.48)
The notation has been chosen such that it indicates the moments:

x2 = s2x ,
y 2 = s2y ,
hxyi = ρsx sy .

We skip the explicit calculation. Integrating (3.48) over y, (x), we obtain


the marginal distributions of x, (y). They are again normal distributions with
widths sx and sy . A characteristic feature of the normal distribution is that
for a vanishing correlation ρ = 0 the two variables are independent, since in
this case the p.d.f. N0 (x, y) factorizes into normal distributions of x and y.
Curves of equal probability are obtained by equating the exponent to a
constant. The equations
3.6 Some Important Distributions 73

y
s
y

s'
y
s
x

x'
s'
x

y'
x

Fig. 3.17. Transformation of the error ellipse.

 
1 x2 xy y2
− 2ρ + = const
1 − ρ2 s2x sx sy s2y

describe concentric ellipses. For the special choice const = 1 we show the
ellipse in √
Fig. 3.17. At this so-called error ellipse the value of the p.d.f. √
is just
N0 (0, 0)/ e, i.e. reduced with respect to the maximum by a factor 1/ e.
By a simple rotation we achieve uncorrelated variables x′ and y ′ :

x′ = x cos φ + y sin φ ,
y ′ = −x sin φ + y cos φ ,
2ρsx sy
tan 2φ = 2 .
sx − s2y

The half-axes, i.e. the variances s′x 2 and s′y 2 of the uncorrelated variables x′
and y ′ are

s2x + s2y s2x − s2y


s′2
x = + ,
2 2 cos 2φ
s2x + s2y s2x − s2y
s′2
y = − .
2 2 cos 2φ
In the new variables, the normal distribution has then the simple form
  
′ ′ ′ 1 1 x′2 y ′2
N0 (x , y ) = exp − + ′2 = f (x′ )g(y ′ ) .
2πs′x s′y 2 s′x 2 sy
74 3 Probability Distributions and their Properties

The two-dimensional normal distribution with its maximum at (x0 , y0 ) is


obtained from (3.48) with the substitution x → x − x0 , y → y − y0 .
We now generalize the normal distribution to n dimensions. We skip again
the simple algebra and present directly the result. The variables are written
in vector form x and with the symmetric and positive definite covariance
matrix C, the p.d.f. is given by
 
1 1 T −1
N (x) = p exp − (x − x0 ) C (x − x0 ) .
(2π)n det(C) 2

Frequently we need the inverse of the covariance matrix

V = C−1

which is called weight matrix. Small variances Cii of components xi lead to


large weights Vii . The normal distribution in n dimensions has then the form
 
1 1 T
N (x) = p exp − (x − x0 ) V(x − x0 ) .
(2π)n det(C) 2

In the two-dimensional case the matrices C and V are


 2 
sx ρsx sy
C= ,
ρsx sy s2y
!
1
1
s2x − sxρsy
V= ρ 1
1 − ρ2 − sx sy s2 y

with the determinant det(C) = s2x s2y (1 − ρ2 ) = 1/ det(V).

3.6.6 The Exponential Distribution

Also the exponential distribution appears in many physical phenomena. Be-


sides life time distributions (decay of instable particles, nuclei or excited
states), it describes the distributions of intervals between Poisson distributed
events like time intervals between decays or gap lengths in track chambers
(bubble chambers, emulsion stacks) and of the penetration depth of particles
in absorbing materials.
The main characteristics of processes described by the exponential dis-
tribution is lack of memory, i.e. processes which are not influenced by their
history. For instance, the decay probability of an instable particle is indepen-
dent of its age, or the scattering probability for a gas molecule at the time t
is independent of t and of the time that has passed since the last scattering
event. The probability density for the decay of a particle at the time t1 + t2
must be equal to the probability density f (t2 ) multiplied with the probability
1 − F (t1 ) to survive until t1 :
3.6 Some Important Distributions 75

f (t1 + t2 ) = (1 − F (t1 )) f (t2 ) .

Since f (t1 + t2 ) must be symmetric under exchanges of t1 and t2 , the first


factor has to be proportional to f (t1 ),

1 − F (t1 ) = cf (t1 ) , (3.49)


f (t1 + t2 ) = cf (t1 )f (t2 ) (3.50)

with constant c. The property (3.50) is found only for the exponential func-
tion: f (t) = aebt . If we require that the probability density is normalized, we
get
f (t) = λe−λt .
This result could also have been derived by differentiating (3.49) and solving
the corresponding differential equation f = −c df /dt.
The characteristic function
λ
φ(t) =
λ − it
and the moments
µn = n! λ−n
have already been derived in Example 20 in Sect. 3.3.4.

3.6.7 The χ2 Distribution

The chi-square distribution (χ2 distribution) plays an important role in the


comparison of measurements with theoretical distributions (see Chap. 10).
The corresponding tests allow us to discover systematic measurement errors
and to check the validity of theoretical models. The variable χ2 which we
will define below, is certainly the quantity which is most frequently used to
quantify the quality of the agreement of experimental data with the theory.
The variate χ2 is defined as the sum

Xf
x2i
χ2 = ,
σ2
i=1 i

where xi are independent, normally distributed variates with zero mean and
variance σi2 .
We have already come across the simplest case with f = 1 in Sect. 3.4.1:
The transformation of a normally distributed variate x with expected value
zero to u = x2 /s2 , where s2 is the variance, yields
1
g1 (u) = √ e−u/2 (f = 1) .
2πu
76 3 Probability Distributions and their Properties

0.6

f=1

0.4

f=2

0.2
f=4

f=8
f=16

0.0
0 5 10 15 20
c
2

Fig. 3.18. χ2 distribution for different degrees of freedom.

(We have replaced the variable χ2 by u = χ2 to simplify the writing.) Mean


value and variance of this distribution are E(u) = 1 and var(u) = 2.
When we now add f independent summands, we obtain
1
gf (u) = uf /2−1 e−u/2 . (3.51)
Γ (f /2)2f /2

The only parameter of the χ2 distribution is the number of degrees of


freedom f , the meaning of which will become clear later. We will prove (3.51)
when we discuss the gamma distribution, which includes the χ2 distribution
as a special case. Fig. 3.18 shows the χ2 distribution for some values of f .
The value f = 2 corresponds to an exponential distribution. As follows from
the central limit theorem, for large values of f the χ2 distribution approaches
a normal distribution.
By differentiation of the p.d.f. we find for f > 2 the maximum at the
mode umod = f − 2. The expected value of the variate u is equal to f and its
variance is 2f . These relations follow immediately from the definition of u.

umod = f − 2 for f > 2 ,


E(u) = f ,
var(u) = 2f .

Distribution of the Sample Width

We define the width v of a sample of N elements xi as follows (see 3.2.3):


3.6 Some Important Distributions 77
N
1 X 2
v2 = x − x2
N i=1 i
= x2 − x2 .

If the variates xi of the sample are distributed normally with mean x0 and
variance σ 2 , then N v 2 /σ 2 follows a χ2 distribution with f = N − 1 degrees
of freedom. We omit the formal proof; the result is plausible, however, from
the expected value derived in Sect. 3.2.3:
N −1 2
v2 = σ ,
  N
2
Nv
= N −1.
σ2

Degrees of Freedom and Constraints

In Sect. 6.7 we will discuss the method of least squares for parameter esti-
mation. To adjust a curve to measured points xi with Gaussian errors σi we
minimize the quantity
N
X 2
(xi − ti (λ1 , . . . , λZ ))
χ2 = ,
i=1
σi2

where ti are the ordinates of the curve depending on the Z free parameters
λk . Large values of χ2 signal a bad agreement between measured values and
(t)
the fitted curve. If the predictions xi depend linearly on the parameters,
the sum χ obeys a χ distribution with f = N − Z degrees of freedom. The
2 2

reduction of f accounts for the fact that the expected value of χ2 is reduced
when we allow for free parameters. Indeed, for Z = N we could adjust the
parameters such that χ2 would vanish.
Generally, in statistics the term degrees of freedom11 f denotes the num-
ber of independent predictions. For N = Z we have no prediction for the
observations xi . For Z = 0 we predict all N observations, f = N . When
we fit a straight line through 3 points with given abscissa and observed or-
dinate, we have N = 3 and Z = 2 because the line contains 2 parameters.
The corresponding χ2 distribution has 1 degree of freedom. The quantity Z
is called the number of constraints, a somewhat misleading term. In the case
of the sample width discussed above, one quantity, the mean, is adjusted.
Consequently, we have Z = 1 and the sample width follows a χ2 distribution
of f = N − 1 degrees of freedom.

11
Often the notation number of degrees of freedom, abbreviated by n.d.f. or NDF
is used in the literature.
78 3 Probability Distributions and their Properties

3.6.8 The Gamma Distribution


The distributions considered in the last two sections, the exponential- and
the chi-square distribution, are special cases of the gamma distribution
λν ν−1 −λx
G(x|ν, λ) = x e ,x > 0 .
Γ (ν)
The parameter λ > 0 is a scale parameter, while the parameter ν > 0 de-
termines the shape of the distribution. With ν = 1 we obtain the exponential
distribution. The parameter ν is not restricted to natural numbers. With the
special choice ν = f /2 and λ = 1/2 we get the χ2 -distribution with f degrees
of freedom (see Sect. 3.6.7).
The gamma distribution is used typically for the description of random
variables that are restricted to positive values, as in the two cases just men-
tioned. The characteristic function is very simple:
 −ν
it
φ(t) = 1 − . (3.52)
λ
As usual, we obtain expected value, variance, moments about the origin,
skewness and excess by differentiating φ(t):
ν ν Γ (i + ν) 2 6
hxi = , var(x) = 2 , µi = i , γ1 = √ , γ2 = .
λ λ λ Γ (ν) ν ν
The maximum of the distribution is at xmod = (ν − 1)/λ, (ν > 1).
The gamma distribution has the property of stability in the following
sense: The sum of variates following gamma distributions with the same
scaling parameter λ, but different shape parameters νi is again gamma dis-
tributed, with the shape parameter ν,
X
ν= νi .

This result is obtained by multiplying the characteristic functions (3.52). It


proves also the corresponding result (3.51) for the χ2 -distribution.

Example 40. Distribution of the mean valueP of decay times


Let us consider the sample mean x = xi /N , of exponentially distributed
variates xi . The characteristic function is (see 3.3.4)
1
φx (t) = .
1 − it/λ

Forming the N -fold product, and using the scaling rule for Fourier trans-
forms (3.19), φx/N (t) = φx (t/N ), we arrive at the characteristic function of
a gamma distribution with scaling parameter N λ and shape parameter N :
3.6 Some Important Distributions 79

 −N
it
φx (t) = 1 − . (3.53)

Thus the p.d.f. f (x) is equal to G(x|N, N λ). Considering the limit for large
N , we convince ourself of the validity of the law of large numbers and the
central limit theorem. From (3.53) we derive
 
it
ln φx (t) = −N ln 1 −

"   2 #
it 1 it
= −N − − − + O(N −3 ) ,
Nλ 2 Nλ
 
1 1 1 2 −2

φx (t) = exp i t − t + O N .
λ 2 N λ2

When N is large, the term of order N −2 can be neglected and with the two
remaining terms in the exponent we get the characteristic function of a nor-
mal distribution with mean µ = 1/λ = hxi and variance σ 2 = 1/(N λ2 ) =
var(x)/N , (see 3.3.4), in agreement with the central limit theorem. If N ap-
proaches infinity, only the first term remains and we obtain the characteristic
function of a delta distribution δ(1/λ− x). This result is predicted by the law
of large numbers (see Appendix 13.1). This law states that, under certain con-
ditions, with increasing sample size, the difference between the sample mean
and the population mean approaches zero.

3.6.9 The Lorentz and the Cauchy Distributions


The Lorentz distribution (Fig. 3.19)
1 Γ/2
f (x) =
π (x − a)2 + (Γ/2)2
is symmetric with respect to x = a. Although it is bell-shaped like a Gaussian,
it has, because of its long tails, no finite variance. This means that we cannot
infer the location parameter a of the distribution12 from the sample mean,
even for arbitrary large samples. The Lorentz distribution describes resonance
effects, where Γ represents the width of the resonance. In particle or nuclear
physics, mass distributions of short-lived particles follow this p.d.f. which
then is called Breit–Wigner distribution.
The Cauchy distribution corresponds to the special choice of the scale pa-
rameter Γ = 2. 13 For the location parameter a = 0 it has the characteristic
12
The first moment exists only as a Cauchy principal value and equals a.
13
In the literature also the more general definition with two parameters is met.
80 3 Probability Distributions and their Properties
2.0

1.5

1.0
f(x)

0.5

0.0
0.0 0.5 1.0 1.5 2.0

Fig. 3.19. Lorentz distribution with mean equal to 1 and halfwidth Γ/2 = 0.2.

function φ(t) = exp(−|t|), which obviously has no derivatives at t = 0, an


other consequence of the nonexistence of moments. P
The characteristic func-
N
tion for the sample mean of N measurements, x = 1 xi /N , is found with
the help of (3.19), (3.25) as
N
φx (t) = (φ(t/N )) = φ(t) .

The sample mean has the same distribution as the original population. It
is therefore, as already stated above, not suited for the estimation of the
location parameter.

3.6.10 The Log-normal Distribution

The distribution of a variable x > 0 whose logarithm u is normally distributed


1 2 2
g(u) = √ e−(u−u0 ) /2s
2πs
with mean u0 and variance s2 follows the log-normal distribution, see Fig.
3.20:
1 2 2
f (x) = √ e−(ln x−u0 ) /2s .
xs 2π
This is, like the normal distribution, a two-parameter distribution where the
parameters u0 , s2 , however, are not identical with the mean µ and variance
σ 2 , but the latter are given by
2
µ = eu0 +s /2
,
s2 2
2
σ = (e − 1)e2u0 +s . (3.54)

Note that the distribution is declared only for positive x, while u0 can also
be negative.
3.6 Some Important Distributions 81

1.5

f(x)
s=0.1

1.0

0.5

s=0.5 s=0.2
s=1

0.0
0 2 4 6 8

Fig. 3.20. Log-normal distribution with u0 = 1 and different values of s.

The characteristic function cannot be written in closed form, but only as


a power expansion. This means, the moments of order k about the origin are
1 2 2
µk = eku0 + 2 k s
.

Other characteristic parameters are

median : x0.5 = eu0 ,


2
mode : xmod = eu0 −s ,
2
p
skewness : γ1 = (es + 2) es2 − 1 ,
2 2 2
excess : γ2 = e4s + 2e3s + 3e2s − 6 . (3.55)
Y
The distribution of a variate x = xi which is the product of many
variates xi , each of which is positive and has a small variance, σi2 compared
to its mean squared µ2 , σi2 ≪ µ2i , can be approximated by a log-normal
distribution. This is a consequence of the central limit theorem (see 3.6.5).
Writing
XN
ln x = ln xi
i=1

we realize that ln x is normally distributed in the limit N → ∞ if the sum-


mands fulfil the conditions required by the central limit theorem. Accordingly,
x will be distributed by the log-normal distribution.
82 3 Probability Distributions and their Properties

3.6.11 Student’s t Distribution

This distribution, introduced by W. S. Gosset (pseudonym “Student”) is fre-


quently used to test the compatibility of a sample with a normal distribution
with given mean but unknown variance. It describes the distribution of the
so-called “studentized” variate t, defined as
x−µ
t= . (3.56)
s
The numerator is the difference between a sample mean and the mean of the
Gaussian from which the sample of size N is drawn. It follows a normal dis-
tribution centered at zero. The denominator s is an estimate of the standard
deviation of the numerator derived from the sample. It is defined by (3.15).

X N
1
s2 = (xi − x)2 .
N (N − 1) i=1

The sum on the right-hand side, after division by the variance σ 2 of the
Gaussian, follows a χ2 distribution with f = N − 1 degrees of freedom,√see
(3.51). Dividing also the numerator of (3.56) by its standard deviation σ/ N ,
it follows a normal distribution of variance unity. Thus the variable t of the
t distribution is the quotient of a normal variate and the square root of a χ2
variate.
The analytical form of the p.d.f. can be found by the standard method
used in Sect. 3.5.4. The result is
 − f +1
Γ ((f + 1)/2) t2 2
h(t|f ) = √ 1+ .
Γ (f /2) πf f

The only parameter is f , the number of degrees of freedom. For f = 1 we


recover the Cauchy distribution. For large f it approaches the normal dis-
tribution N (0, 1) with variance equal to one. The distribution is symmetric,
centered at zero, and bell shaped, but with longer tails than N (0, 1). The
even moments are
i 1 · 3 · · · (i − 1)
µi = f 2 .
(f − 2)(f − 4) · · · (f − i)

They exist only for i ≤ f − 1. The variance for f ≥ 3 is σ 2 = f /(f − 2), the
excess for f ≥ 5 is γ2 = 6/(f − 4), disappearing for large f , in agreement
with the fact that the distribution approaches the normal distribution.
The typical field of application for the t distribution is the derivation
of tests or confidence intervals in cases where a sample is supposed to be
taken from a normal distribution of unknown variance but known mean µ.
Qualitatively, very large absolute values of t indicate that the sample mean
3.6 Some Important Distributions 83

0.4
normal

f=5

f=2

0.3 f=1

0.2

0.1

0.0
-8 -6 -4 -2 0 2 4 6 8

Fig. 3.21. Student’s distributions for 1, 2, 5 degrees of freedom and normal distri-
bution.

is incompatible with µ. Sometimes the t distribution is used to approximate


experimental distributions which differ from Gaussians because they have
longer tails. In a way, the t distribution interpolates between the Cauchy (for
f = 1) and the normal distribution (for f → ∞).

3.6.12 The Extreme Value Distributions

The family of extreme value distributions is relevant for the following type
of problem: Given a sample taken from a certain distribution, what can be
said about the distribution of its maximal or minimal value? It is found that
these distributions converge with increasing sample size to distributions of
the types given below.

The Weibull Distribution

This distribution has been studied in connection with the lifetime of complex
aggregates. It is a limiting distribution for the smallest member of a sample
taken from a distribution limited from below. The p.d.f. is
p  x p−1   x p 
f (x|a, p) = exp − , x>0 (3.57)
a a a
with the positive scale and shape parameters a and p. The mode is
84 3 Probability Distributions and their Properties
 1/p
p−1
xm = a for p ≥ 1 ,
p
mean value and variance are
µ = aΓ (1 + 1/p) ,

σ 2 = a2 Γ (1 + 2/p) − Γ 2 (1 + 1/p) .
The moments are
µi = ai Γ (1 + i/p) .
For p = 1 we get an exponential distribution with decay constant 1/a.

The Fisher–Tippett Distribution

Also this distribution with the p.d.f.


 
1 x − x0 ±(x−x0 )/s
f± (x|x0 , s) = exp ± −e
s s
belongs to the family of extreme value distributions. It is sometimes called
extreme value distribution (without further specification) or log-Weibull dis-
tribution.
If y is Weibull-distributed (3.57) with parameters a, p, the transformation
to x = − ln y leads for x to a log-Weibull distribution with parameters x0 =
− ln a and s = 1/p. The first of these, the location parameter x0 , gives the
position of the maximum, i.e. xmod = x0 , and the parameter s > 0 is a
scale parameter. Mean value µ and variance σ 2 depend on these parameters
through
µ = x0 ∓ Cs , with Euler’s constant C = 0.5772 . . . ,
π2
σ 2 = s2 .
6
Mostly, the negative sign in the exponent is realized. Its normal form

f (x|0, 1) = exp −x − e−x
is also known as Gumbel’s distribution and shown in Fig. 3.22.
Using mathematical properties of Euler’s Γ function [25] one can derive
the characteristic function in closed form:
φ(t) = Γ (1 ± ist)eix0 t ,
whose logarithmic derivatives give in turn the cumulants for this distribution:
κ1 = x0 ∓ Cs , κi≥2 = (∓1)i (i − 1)!si ζ(i) ,
with Riemann’s zeta function ζ(z) = Σn=1 ∞
1/nz . (see [25]). Skewness and
excess are given by γ1 ≈ 1.14 and γ2 = 12/5.
3.7 Mixed and Compound Distributions 85

0.4

0.3

f(x)

0.2

0.1

0.0
-2 0 2 4 6 8

Fig. 3.22. Gumbel distribution.

3.7 Mixed and Compound Distributions


In the statistical literature the terms mixed distribution and compound dis-
tribution are not clearly defined and separated. Sometimes the compound
distribution is regarded as a specific mixed distribution.

3.7.1 Superposition of distributions

The term mixed distribution is sometimes used for a superposition of distri-


butions:
N
X
f (x) = wi fi (x) , (3.58)
i=1
N
X
P (k) = wi Pi (k) . (3.59)
i=1

In physics applications superpositions occur, for instance, if a series of reso-


nances or peaks is observed over a background. We have calculated the mean
value and the variance of the superposition of two continuous distributions
in Sect. 3.2. The relations in Sect. 3.2.3) can easily be extended to more then
two components.

3.7.2 Compound Distributions

If a parameter of a distribution is itself a randomly distributed, then we have


a compound distribution. We can form different combinations of continuous
86 3 Probability Distributions and their Properties

0.1

f(t)

0.01

1E-3
0 2 4 6

Fig. 3.23. Lifetime distribution, original (solid line) and measured with Gaussian
resolution (dashed line).

and discrete distributions, but restrict ourselves to the case of a resulting


continuous distribution:
Z ∞
f (x) = h(x|λ)g(λ)dλ , (3.60)
−∞
X
f (x) = h(x|λk )Pk .
k

The relation (3.60) has the form of a convolution and is closely related to the
marginalization of a two-dimensional distribution of x and λ. A compound
distribution may also have the form of (3.58) or (3.59) where the weights are
independently randomly distributed.
Frequently we measure a statistical quantity with a detector that has a
limited resolution. Then the probability to observe the value x′ is a random
distribution R(x′ |x) depending on the undistorted value x which itself is
distributed according to a distribution g(x). (In this context the notation is
usually different from the one used in (3.60)) We have the convolution
Z ∞
f (x′ ) = R(x′ , x)g(x)dx .
−∞

Example 41. Measurement of a decay time distribution with Gaussian reso-


lution
3.7 Mixed and Compound Distributions 87

A myon stops in a scintilator and decays subsequently into an electron.


The time t between the two corresponding light pulses follows an exponential
distribution γe−γt with γ the myon decay constant. The observed value is t′
with the response function
 
1 (t′ − t)2
R(t′ , t) = √ exp − .
2πσ 2σ 2
The convolution integral
Z 
∞ 
1 (t′ − t)2
f (t′ ) = √ exp − γe−γt dt
2πσ0 2σ 2
Z ∞  
′ 1 (t′ − t)2 ′
= γe−γt √ exp − 2
eγ(t −t) dt
2πσ 0 2σ
Z ∞  2

′ 1 x
= γe−γt √ exp − 2 e−γx dx
2πσ −t′ 2σ

can be expressed with the help of the error function


Z ∞
2 2
erfc(x) = 1 − erf(x) = √ e−t dt ,
π x

resulting in
 
1 ′ 1 2 2 −t′ + σ 2 γ
f (t ) = γ exp−γt − 2 σ γ erfc

√ .
2 2σ
The result for γ = 1, σ = 0.5 is shown in Fig. 3.23. Except for small values of
t, the observed time is shifted to larger values. In the asymptotic limit t → ∞
the integral becomes a constant and the distribution f∞ (t′ ) is exponentially
decreasing with the same slope γ as the undistorted distribution:

f∞ (t′ ) ∝ γe−γt .

This property applies also to all measurements where the resolution is a


function of t′ − t. This condition is usually realized.

3.7.3 The Compound Poisson Distribution

The compound Poisson distribution (CPD) describes the sum of a Poisson


distributed number of independent and identical distributed weights. It ap-
plies if we weight events and sum up the weights. For example, when we
measure the activity of a β source with a Geiger counter the probability that
it fires, the detection probability may depend on the electron energy which
88 3 Probability Distributions and their Properties

varies from event to event. We can estimate the true number of decays by
weighting each observation with the inverse of its detection probability. Some-
times weighting is used to measure the probability that an event belongs to
a certain particle type. Weighted events play also a role in some Monte Carlo
integration methods and in parameter inference (see Chap. 6, Sect. 7.3), if
weighted observations are summed up inPhistogram bins.
N
The CPD also describes the sum x = i=1 ni wi , if there is a given discrete
weight distribution, wi , i = 1, 2, 3..N and where the numbers ni are Poisson
distributed. The equivalence of the two definitions of the CPD is shown in
Appendix 13.11.1. In Ref. [26] some properties of the compound Poisson
distribution and the treatment of samples of weighted events is described.
The CPD does not have a simple analytic expression. However, the cumulants
and thus also the moments of the distribution can be calculated exactly.
Let us consider the definite case that on average λ1 observations are ob-
tained with probability ε1 and λ2 observations with probability ε2 . We correct
the losses by weighting the observed numbers with w1 = 1/ε1 and w2 = 1/ε2 .
For the Poisson-distributed numbers k1 , k2

λk11 −λ1
Pλ1 (k1 ) = e ,
k1 !
λk2
Pλ2 (k2 ) = 2 e−λ2 ,
k2 !
k = w1 k1 + w2 k2 ,

The mean value µ of the variate k and its variance σ 2 are

µ = w1 λ1 + w2 λ2
= λhwi , (3.61)
σ 2 = w12 λ1 + w22 λ2
= λhw2 i. (3.62)

with λ = λ1 + λ2 , hwi = (w1 λ1 + w2 λ2 )/λ and hw2 i = (w12 λ1 + w22 λ2 )/λ. We


have used var(cx) = c2 var(x).
According to (3.28), the cumulant κi of order i of the distribution of k is
(1) (2)
related to the cumulants κi , κi of the corresponding distributions of k1 , k2
through
(1) (2)
κi = w1i κi + w2i κi . (3.63)
With (3.27) we get also skewness γ1 and excess γ2 :
3.7 Mixed and Compound Distributions 89

κ3 w13 λ1 + w23 λ2
γ1 = =
3/2
κ2 (w12 λ1 + w22 λ2 )3/2
hw3 i
= , (3.64)
λ1/2 hw2 i3/2
κ4 w14 λ1 + w24 λ2
γ2 = 2 =
κ2 (w12 λ1 + w22 λ2 )2
4
hw i
= . (3.65)
λhw2 i2

The formulas can easily be generalized to more than two Poisson distri-
butions and to a continuous weight distribution (see Appendix 13.11.1). The
relations (3.61), (3.62), (3.64), (3.65) remain valid.
In particular for a CPD with a weight distribution with variance E(w2 )
and expected number of weights λ the variance of the sum of the weights is
λE(w2 ) as indicated by (3.62).
For large values of λ the CPD can be approximated by a normal distri-
bution or by a scaled Poisson distribution (see Appendix 13.11.1).
4 Measurement Errors

4.1 General Considerations


When we talk about measurement errors, we do not mean mistakes caused by
the experimenter, but the unavoidable random dispersion of measurements.
Therefore, a better name would be measurement uncertainties. We will use
the terms uncertainty and error synonymously.
The correct determination and treatment of measurement errors is not
always trivial. In principle, the evaluation of parameters and their uncertain-
ties are part of the statistical problem of parameter inference, which we will
treat in Chaps. 6, 6.4.5 and 8. There we will come back to this problem and
look at is from a more general point of view. In the present chapter we will
introduce certain, in practice often well justified approximations.
Official recommendations are given in “Guide to the Expression of Uncer-
tainty of Measurement”, published in 1993 and updated in 1995 in the name of
many relevant organizations like ISO and BIMP (Guide to the Expression of
Uncertainty of Measurement, International Organization for Standardization,
Geneva, Switzerland) [27]. More recently, a task force of the European coop-
eration for Accreditation of Laboratories (EAL) with members of all western
European countries has issued a document (EAL-R2) with the aim to har-
monize the evaluation of measurement uncertainties. It follows the rules of
the document mentioned above but is more specific in some fields, especially
in calibration issues which are important when measurements are exchanged
between different laboratories. The two reports essentially recommend to es-
timate the expected value and the standard deviation of the quantity to be
measured. Our treatment of measurement uncertainty will basically be in
agreement with the recommendations of the two cited documents which deal
mainly with systematic uncertainties and follow the Bayesian philosophy, but
we will extend their concept in Sect. 8.1 where we introduce asymmetric error
limits.

4.1.1 Importance of Error Assignments

The natural sciences owe their success to the possibility to compare quanti-
tative hypotheses to experimental facts. However, we are able to check the-
92 4 Measurement Errors

oretical predictions only if we have an idea about the accuracy of the mea-
surements. If this is not the case, our measurements are completely useless.
Of course, we also want to compare the results of different experiments
to each other and to combine them. Measurement errors must be defined in
such a way that this is possible without knowing details of the measurement
procedure. Only then, important parameters, like constants of nature, can be
determined more and more accurately and possible variations with time, like
it was hypothesized for the gravitational constant, can be tested.
Finally, it is indispensable for the utilization of measured data in other
scientific or engineering applications to know their accuracy and reliability.
An overestimated error can lead to a waste of resources and, even worse, an
underestimated error may lead to wrong conclusions.

4.1.2 The Declaration of Errors

There are several ways to present measurements with their uncertainties.


Some of the more frequent ones are given in the following examples:
t = (34.5 ± 0.7) 10−3 s
t = 34.5 10−3 s ± 2 %
x = 10.3+0.7
−0.3
me = (0.510 999 06 ± 0.000 000 15) MeV/c2
me = 0.510 999 06 (15) MeV/c2
me = 9.109 389 7 10−31 kg ± 0.3 ppm
The abbreviation ppm means parts per million. The treatment of asym-
metric errors will be postponed to Chap. 8. The measurement and its
error must have the same number of significant digits. Declarations like
x = 3.2 ± 0.01 or x = 3.02 ± 0.1 are inconsistent.
A relatively crude declaration of the uncertainty is sufficient, one or two
significant digits are adequate in any case, keeping in mind that often we
do not know all sources of errors or are unable to estimate their influence
on the result with high accuracy1 . This fact also justifies in most cases the
approximations which we have to apply in the following.
We denote the error of x with δx or δx . Sometimes it is convenient, to
quote dimensionless relative errors δx/x that are useful in error propagation
– see below.

4.1.3 Definition of Measurement and its Error

Measurements are either quantities read from a measurement device or simply


an instrument – we call them input quantities – or derived quantities, like the
average of two or more input quantities, the slope of a street, or a rate which
are computed from several input quantities. Let us first restrict ourselves to
1
There are exceptions to this rule in hypothesis testing (see Chap. 10).
4.1 General Considerations 93

input quantities. An input quantity can be regarded as an observation, i.e. a


random variable x drawn from a distribution centered around the true value
xt of the quantity which we want to determine. The measurement process,
including the experimental setup, determines the type of this distribution
(Gauss, Poisson, etc.) For the experimenter the true value is an unknown
parameter of the distribution. The measurement and its error are estimates
of the true value and of the standard deviation of the distribution2 . This
definition allows us to apply relations which we have derived in the previous
chapter for the standard deviation to calculations of the uncertainty, e.g. the
error δ of a sum
P of independent measurements with individual errors δi is
given by δ 2 = δi2 .
In an ideal situation the following conditions are fulfilled:
1. The mean value of infinitely often repeated measurements coincides with
the true value, i.e. the true value is equal to the expectation value hxi
of the measurement distribution, see Sect. 3.2. The measurement is then
called unbiased.
2. The assigned measurement error is independent of the measured value.
These properties can not always be realized exactly but often they are
valid to a sufficiently good approximation. The following two examples refer
to asymmetric errors where in the first but not in the second the asymmetry
can be neglected.

Example 42. Scaling error


A tape measure is slightly elastic. The absolute measurement error in-
creases with the measured length. Assuming a scaling error of 1% also the
estimate of the error of a measured length would in average be wrong by
1% and asymmetric by the same proportion. This, however, is completely
unimportant.

Example 43. Low decay rate


We want to measure the decay rate of a radioactive compound. After one
hour we have recorded one decay. Given such small rates, it is not correct
to compute the error from a Poisson distribution (see Sect. 3.6.3) in which
we replace the mean value by the observed measurement. The declaration
R = 1 ± 1 does not reflect the result correctly because R = 0 is excluded by
the observation while R = 2.5 on the other hand is consistent with it.

2
Remark that we do not need to know the full error distribution but only its
standard deviation.
94 4 Measurement Errors

In Sect. 8.1 we will, as mentioned above, also discuss more complex cases,
including asymmetric errors due to low event rates or other sources.
Apart from the definition of a measurement and its error by the estimated
mean and standard deviation of the related distribution there exist other
conventions: Distribution median, maximal errors, width at half maximum
and confidence intervals. They are useful in specific situations but suffer from
the crucial disadvantage that they are not suited for the combination of
measurements or the determination of the errors of depending variables, i.e.
error propagation.
There are uncertainties of different nature: statistical errors and system-
atic errors. Their definitions are not unambiguous, disagree from author to
author and depend somewhat on the scientific discipline in which they are
treated.

4.2 Statistical Errors

4.2.1 Errors Following a Known Statistical Distribution

Relatively simple is the interpretation of measurements if the distributions


of the errors follow known statistical laws. The corresponding uncertainties
are called statistical errors. Examples are the measurement of counting rates
(Poisson distribution), counter efficiency (binomial distribution) or of the
lifetime of unstable particles (exponential distribution). Characteristic for
statistical errors is that sequential measurements are uncorrelated and thus
the precision of the combined results is improved by the repetition of the
measurement. In these cases the distribution is known up to a parameter –
its expected value. We then associate the actually observed value to that pa-
rameter and declare as measurement error the standard deviation belonging
to that distribution.

Example 44. Poisson distributed rate


Recorded√have been N √ = 150 decays. We set the rate and its error equal
to Z = N ± N = (150 ± 150) ≈ 150 ± 12 .

Example 45. Digital measurement (uniform distribution)


With√ a digital clock the time t = 237 s has been recorded. The error is
δt = 1/ 12 s ≈ 0.3 s, thus t = (237.0 ± 0.3) s .
4.2 Statistical Errors 95

Example 46. Efficiency of a detector (binomial distribution)


From N0 = 60 particles which traverse a detector, 45 are registered. The
efficiency is ε = N/N0 = 0.75. The error derived from the binomial distribu-
tion is
p p
δε = δN/N0 = ε(1 − ε)/N0 = 0.75 · 0.25/60 = 0.06 .

Example 47. Calorimetric energy measurement (normal distribution)


The energy of an high energy electron is measured by a scintillating fiber
calorimeter by collecting light produced by the electromagnetic cascade in
the scintillator of the device. From the calibration of the calorimeter with
electrons of known energies E we know that the calorimeter response is well
described by a Gaussian with mean proportional to E and variance propor-
tional to E.

Many experimental signals follow to a very good approximation a normal


distribution. This is due to the fact that they consist of the sum of many
contributions and a consequence of the central limit theorem.
In particle physics we derive parameters usually from a sample of events
and thus take the average of many independent measurements. We have seen
that
√ the relative error of the mean from N i.i.d. measurements decreases with
1/ N , see relation (3.13). This behavior is typical for statistical errors.

4.2.2 Errors Determined from a Sample of Measurements

An often used method for the estimation of errors is to repeat a measurement


several times and to estimate the error from the fluctuation of the results.
The results presented below will be justified in subsequent chapters but are
also intuitively plausible.
In the simplest case, for instance in calibration procedures, the true value
xt of the measured quantity x is known, and the measurement is just done
to get information about the accuracy of the measurement. An estimate of
the average error δx of x from N measurements is in this case
N
1 X
(δx)2 = (xi − xt )2 .
N i=1

We have to require that the fluctuations are purely statistical and that cor-
related systematic variations are absent, i.e. the data have to be independent
96 4 Measurement Errors

from
√ each other. The relative uncertainty of the error estimate follows the
1/ N law. It will be studied below. For example with 100 repetitions of the
measurement, the uncertainty of the error itself is reasonably small, i.e. about
10 % but depends on the distribution of x.
When thePtrue value is unknown, we can approximate it by the sample
N
mean x = N1 i=1 xi and use the following recipe:

N
1 X
(δx) = 2
(xi − x)2 . (4.1)
N − 1 i=1

In the denominator of the formula used to determine the mean quadratic


deviation (δx)2 of a single measurement figures N − 1 instead of N . This is
plausible because, when we compute the empirical mean value x, the mea-
surements xi enter and thus they are expected to be in average nearer to their
mean value than to the true value. In particular the division by N would pro-
duce the absurd value δx = 0 for N = 1, while the division by N − 1 yields
an indefinite result. The derivation of (4.1) follows from (3.15). The quantity
(δx)2 in (4.1) is sometimes called empirical variance. We have met it already
in Sect. 3.2.3 of the previous chapter.
Frequently, we want to find the error for measurements xi which are con-
strained by physical or mathematical laws and where the true values are
estimated by a parameter fit (to be explained in subsequent chapters). The
expression (4.1) then is generalized to
N
1 X
(δx)2 = (xi − x̂i )2 . (4.2)
N − Z i=1

where x̂i are the estimates of the true values corresponding to the measure-
ments xi and Z is the number of parameters that have been adjusted using
the data. When we compare the data of a sample to the sample mean we have
Z = 1 parameter, namely x̄, when we compare coordinates to the values of a
straight line fit then we have Z = 2 free parameters to be adjusted from the
data, for instance, the slope and the intercept of the line with the ordinate
axis. Again, the denominator N − Z is intuitively plausible, since for N = Z
we have 2 points lying exactly on the straight line which is determined by
them, so also the numerator is zero and the result then is indefinite.
Relation (4.2) is frequently used in particle physics to estimate momentum
or coordinate errors from empirical distributions (of course, all errors are
assumed to be the same). For example, the spatial resolution of tracking
devices is estimated from the distribution of the residuals (xi − x̂i ). The
individual measurement error δx as computed from a M tracks and N points
per track is then estimated quite reliably to
M×N
X
1
(δx)2 = (xi − x̂i )2 .
(N − Z)M i=1
4.2 Statistical Errors 97

Not only the precision of the error estimate, but also the precision of a
measurements can be increased by repetition. The error δx of a corresponding
sample mean is, following the results of the previous section, given by

(δx)2 = (δx)2 /N ,
X N
1
= (xi − x)2 . (4.3)
N (N − 1) i=1

Example 48. Average from 5 measurements


In the following table five measurements are displayed.

measurements quadratic deviations


xi (xi − x)2
2.22 0.0009
2.25 0.0000
2.30 0.0025
2.21 0.0016
2.27 0.0004
P P
The resulting mean xi = 11.25 2
Pi − x) =2 0.0054
(x
2
x = 2.25 (δx) = (xi − x) /4 = 0.0013

The resulting mean value is x = 2.25±0.02.


√ We have used that the error of
value is smaller by the factor 5 than that of a single measurement,
the mean √
δx = δx/ 5. With only 5 repetitions the precision of the error estimate is
rather poor.


Our recipe yields δx ∼ 1/ N , i.e. the error becomes arbitrarily small
if the
√ number of the measurements approaches infinity. The validity of the
1/ N behavior relies on the assumption that the fluctuations are purely
statistical and that correlated systematic variations are absent, i.e. the data
have to be independent of each other. When we measure repeatedly the period
of a pendulum, then the accuracy of the measurements can be deduced from
the variations of the results only if the clock is not stopped systematically
too early or too late and if the clock is not running too fast or too slow. Our
experience tells us that some correlation between the different measurements
usually cannot be avoided completely and thus there is a lower limit for δx.
To obtain a reliable estimate of the uncertainty, we have to take care that
the systematic uncertainties are small compared to the statistical error δx.
98 4 Measurement Errors

4.2.3 Error of the Empirical Variance


Sometimes we are interested in the variance of an empirical distribution and
in its uncertainty. In the same category falls the problem to estimate the
error of the error of a parameter which is determined from a series of mea-
surements. For example, we may need to know the resolution of a meter or
the width of a spectral line and the related accuracy. It is also of interest
to know how often a calibration measurement has to be performed to esti-
mate the corresponding error with sufficient accuracy. In these situations the
variance s2 itself is the result of the investigation to which we would like to
associate an uncertainty.
The variance of (x − µ)2 for a given distribution is easily calculated using
the formulas of Sect. 3.2.3. We omit the details of the calculation and quote
the result which is related of the second and fourth central moments.
D 2 E
var[(x − µ)2 ] = (x − µ)2 − σ 2
= µ′4 − σ 4 .
We now assume that our sample is large and replace the distribution moments
µ′n by the empirical central moments m′n ,
1 X
m′n = (xi − x)n .
N
The moment s2 = m′2 is an estimate for σ 2 . For N events in the sample, we
get for the uncertainty δs2 of s2
m′4 − m′2
2
(δs2 )2 =
N
and from error propagation (see next section 4.4) we derive the uncertainty
of s itself
δs 1 δs2
= ,
s 2 s2 p
1 m′4 − s4
= √ .
2 N s2
If the type of distribution is known, we can use relations between mo-
ments. Thus, for the normal
√ distribution we have µ4 = 3σ (see Sect. 3.6.5),2
′ 4

and it follows δs/s = 1/ 2N which also follows from the variance of the χ
distribution. This relation sometimes is applied to arbitrary distributions. It
then often underestimates the uncertainty.

4.3 Systematic Errors


The errors assigned to measurements serve primarily the purpose to verify
or reject theoretical predictions and to establish new discoveries. Sometimes
4.3 Systematic Errors 99

a significance of four or five standard deviations is required to accept a new


finding, for example a new particle that manifests itself through a bump in
a mass distribution. Obviously our reasoning here is based on the assump-
tion that the errors are approximately normally distributed which is the case
for the statistical error derived from the number of events that have been
involved. If background has to be subtracted which usually is extrapolated
from regions left and right of the bump location, then the distribution of
the background has to be estimated. In this way additional uncertainties are
introduced, which are summarized in a systematic error. Nearly every mea-
surement is subject to systematic errors, typically associated with auxiliary
parameters related to the measuring apparatus, or to model assumptions.
Their evaluation is especially important in high precision measurements like
those of the magnetic dipole moment of the muon or of the CP violation
constants in the neutral kaon system.
The result of a measurement is typically presented in the form x = 2.34 ±
0.06 = 2.34 ± 0.05(stat.) ± 0.03(syst.).
The main reason for the separate quotation of the two uncertainties is that
the systematic uncertainties are usually less well known than the purely sta-
tistical errors. Thus, for example, excluding a prediction by say a 4 standard
deviation measurement where the errors are dominantly of systematic type is
certainly less convincing than if the result is purely statistical. Furthermore
the separate quotation is informative for subsequent experiments; it helps to
design an experiment in such a way that the systematic errors are reduced
or avoided such that the precision of a measurement can be improved.

4.3.1 Definition and Examples

Systematic errors are at least partially based on assumptions made by the ex-
perimenter, are model dependent or follow unknown distributions. This leads
to correlations between repeated measurements because the assumptions en-
tering into their evaluations are common to all measurements. Therefore,
contrary to statistical errors, the relative error of mean value√of repeated
measurements, suffering from systematic errors, violates the 1/ N law.
A systematic error arises for instance if we measure a distance with a
tape-measure which may have expanded or shrunk due to temperature ef-
fects. Corrections can be applied and the corresponding uncertainty can be
estimated roughly from the estimated range of temperature variations and
the known expansion coefficient of the tape material if it is made out of metal.
It may also be guessed from previous experience.
Systematic errors occur also when an auxiliary parameter is taken from
a technical data sheet where the given uncertainty is usually not of the type
“statistical”. It may happen that we have to derive a parameter from two or
three observations following an unknown distribution. For instance, the cur-
rent of a magnet may have been measured at the beginning and at the end of
100 4 Measurement Errors

an experiment. The variation of the current introduces an error for the mo-
mentum measurement of charged particles. The estimate of the uncertainty
from only two measurements obeying an unknown distribution of the magnet
variations will be rather vague and thus the error is classified as systematic.
A relatively large systematic error has affected the measurement of the
mass of the Z 0 particle by the four experiments at the LEP collider. It was
due to the uncertainty in the beam energy and has led to sizable correlations
between the four results.
Typical systematic uncertainties are the following:
1. Uncertainties in the experimental conditions (Calibration uncertainties
for example of a calorimeter or the magnetic field, unknown beam con-
ditions, unknown geometrical acceptance, badly known detector reso-
lutions, temperature and pressure dependence of the performance of
gaseous tracking detectors.),
2. unknown background behavior,
3. limited quality of the Monte Carlo simulation due to technical approxi-
mations,
4. uncertainties in the theoretical model used in the simulation (approxima-
tions in radiative or QCD corrections, poorly known parton densities),
5. systematic uncertainties caused by the elimination of nuisance parameters
(see Sect. 7.8),
6. uncertainties in auxiliary parameters taken from data sheets or previous
experiments.

Contrary to some authors [28] we classify uncertainties from a limited


number of Monte Carlo events as statistical.

4.3.2 How to Avoid, Detect and Estimate Systematic Errors

Some systematic errors are difficult to retrieve3 . If, for instance, in the data
acquisition system the deadtime is underestimated, all results may look per-
fectly all right. In order to detect and to estimate systematic errors, expe-
rience, common sense, and intuition is needed. A general advice is to try to
suppress them as far as possible already by an appropriate design of the ex-
periment and to include the possibility of control measurements, like regular
calibration. Since correlation of repeated measurements is characteristic for
the presence of systematic errors, observed correlations of results with pa-
rameters related to the systematic effects provide the possibility to estimate
and reduce the latter. In the pendulum example, where the frequency is de-
termined from a time measurement for a given number of periods, systematic

3
For example, the LEP experiments had to discover that monitoring the beam
energy required a magnet model which takes into account leakage currents from
nearby passing trains and tidal effects.
4.3 Systematic Errors 101

contribution to the error due to a possible unknown bias in the stopping pro-
cedure can be estimated by studying the result as a function of the number
of periods and it can be reduced by increasing the measurement time. In par-
ticle physics experiments where usually only a fraction of events is accepted
by some filtering procedure, it is advisable to record also a fraction of those
events that are normally rejected (downscaling) and to try to understand
their nature. Some systematic effects are related to the beam intensity, thus
a variation of the beam intensity helps to study them.
How can we detect systematic errors caused for instance by background
subtraction or efficiency corrections at the stage of data analysis? Clearly, a
thorough comparison of the collected data with the simulation in as many
different distributions as possible is the primary method. All effects that can
be simulated are necessarily understood.
Often kinematical or geometrical constraints can be used to retrieve sys-
tematic shifts and to estimate the uncertainties. A trivial example is the
comparison of the sum of measured angles of a triangle with the value 1800
which is common in surveying. In the experiments of particle physics we can
apply among other laws the constraints provided by energy and momentum
conservation. When we adjust curves, e.g. a straight line to measured points,
the deviations of the points from the line permit us to check the goodness of
the fit, and if the fit is poor, we might reject the presumed parametrization
or revise the error assignment. (Goodness-of-fit tests will be treated in Chap.
10.) Biases in the momentum measurement can be detected by comparing the
locations and widths of mass peaks to the nominal values of known particles.
A widely used method is also the investigation of the results as a function
of the selection criteria. A correlation of the interesting parameter with the
value of a cut-off parameter in a certain variable is a clear indication for the
presence of systematic errors. It is evident though that the systematic errors
then have to be much larger than the normal statistical fluctuations in order
to be detected. Obviously, we want to discriminate also systematic errors
which are of the same order of magnitude as the statistical ones, preferably
much smaller. Therefore we have to investigate samples, where the systematic
effects are artificially enhanced. If we suspect rate dependent distortion effects
as those connected with dead times, it is recommended to analyze a control
sample with considerably enhanced rate. When we eliminate a background
reaction by a selection criterion, we should investigate its importance in the
region which has been excluded, where it is supposed to be abundant.
Frequently made mistakes are: 1. From the fact that the data are con-
sistent with the absence of systematic errors, it is supposed that they do
not exist. This leads always to underestimation of systematic errors. 2. The
changes of the results found by changing the selection criteria are directly
converted into systematic errors. This in most cases leads to overestimates
because the variations are partially due to the normal statistical fluctuations.
102 4 Measurement Errors

There is no simple recipe for the estimation of systematic uncertainties.


Let us consider again the problem of background subtraction under an in-
teresting physics signal. If we know nothing about the background, we can-
not exclude with absolute certainty that the whole signal is faked by the
background. We should exploit all possibilities to reduce the background by
looking into many different distributions to derive efficient kinematical cuts.
In the end we have to use plausible extrapolations of the background shape
based on experience and common sense.

4.3.3 Treatment of Systematic Errors

As mentioned, the systematic and the statistical contributions to the mea-


surement error should be declared separately.
In many experiments there appears a quite large number – typically a
dozen or so – of such systematic uncertainties. When we combine systematic
errors (see Sect. 8.1), we can often profit from the central limit theorem (see
Sect. 3.6.5) provided that they are all of the same order of magnitude and that
the contributions to the measurement are additive. The distribution of the
sum of variables suffering from systematic uncertainties approaches a normal
distribution, with variance equal to the sum of variances of the contributing
distributions. In this case tails in the distributions of the individual systematic
errors are less disturbing.
Sometimes a systematic effect affects several experiments which measure
the same quantity. For example, the measurements of the mass of the Z 0
particle by the four experiments at the e+ − e− - collider LEP suffered from
a common systematic uncertainty of the beam energy. When we combine the
results the correlation has to be taken into account.
Sometimes systematic errors are combined linearly. There is no justifica-
tion for such a procedure.
Interesting discussions of systematic error can be found in [28, 29]. In [30]
a very detailed and competent study of systematic errors as met in particle
physics experiments is presented.
In Ref. [28] purely statistical uncertainties related to detector effects or
secondary measurements are called class 1 systematic errors, but the author
states that a classification of these uncertainties as statistical errors would be
more informative. He subdivides further the real systematic errors following
his definition of systematic errors (which coincides with ours), into system-
atic errors related to experimental effects (class 2 ) and those depending on
theoretical models (class 3 ). This distinction makes sense, because our pos-
sibilities to reduce, detect and estimate class 2 and class 3 errors are very
different.
4.4 Linear Propagation of Errors 103

2 dy

2 dx

Fig. 4.1. Linear error propagation.

4.4 Linear Propagation of Errors


4.4.1 Error Propagation

We now want to investigate how a measurement error propagates into quan-


tities which are functions of the measurement. We consider a function y(x),
a measurement value xm ± δx, with the standard deviation δx, and are in-
terested in ym , the corresponding measurement of y and its error δy. If the
p.d.f. f (x) is known, we can determine the p.d.f. of y, its expectation value
ym and the standard deviation δy by an analytic or numerical transformation
of the variables, as introduced above in Chap. 3. We will assume, however,
that the measurement error is small enough to justify the approximation of
the function by a linear expression within the error limits. Then we need not
know the full p.d.f. f (x).
We use the Taylor expansion of y around xm :
1 ′′
y = y(xm ) + y ′ (xm )∆x + y (xm )(∆x)2 + · · · .
2!
We neglect quadratic and higher order terms, set ym equal to the expected
value of y, and (δy)2 equal to the expected value of the squared deviation.
According to the definition, the expected value of ∆x = x − xm is zero,
and that of (∆x)2 equals (δx)2 . (In our notation quantities denoted by δ
are expected values, i.e. fixed positive parameters, while ∆x is a random
variable). We get

ym = hy(x)i
≈ hy(xm )i + hy ′ (xm )∆xi = y(xm ) ,
104 4 Measurement Errors

and
(δy)2 = h(y − ym )2 i
≈ h(y(xm ) + y ′ (xm )∆x − ym )2 i
= y ′2 (xm )h(∆x)2 i
= y ′2 (xm )(δx)2 ,
δy = |y ′ (xm )|δx .
This result also could have been red off directly from Fig. 4.1.
Examples of the linear propagation of errors for some simple functions
are compiled below:
Function : Relation between errors :
n δy |n|δx
y = ax ⇒ = ,
|y| |x|
|a|δx
y = a ln(bx) ⇒ δy = ,
|x|
δy
y = aebx ⇒ = |b|δx ,
|y|
δy δx
y = tan x ⇒ = .
|y| | cos x sin x|

4.4.2 Error of a Function of Several Measured Quantities

Most physical measurements depend on several input quantities and their


uncertainties. For example, a velocity measurement v = s/t based on the
measurements of length and time has an associated error which obviously
depends on the errors of both s and t.
Let us first consider a function y(x1 , x2 ) of only two measured quantities
with values x1m , x2m and errors δx1 , δx2 . With the Taylor expansion
∂y ∂y
y = y(x1m , x2m ) + (x1m , x2m )∆x1 + (x1m , x2m )∆x2 + · · ·
∂x1 ∂x2
we get as above to lowest order:
ym = hy(x1 , x2 )i
= y(x1m , x2m )
and
(δy)2 = h(∆y)2 i
∂y 2 ∂y 2 ∂y ∂y
=( ) h(∆x1 )2 i + ( ) h(∆x2 )2 i + 2( )( )h∆x1 ∆x2 i
∂x1 ∂x2 ∂x1 ∂x2
∂y 2 ∂y 2 ∂y ∂y
=( ) (δx1 )2 + ( ) (δx2 )2 + 2( )( )R12 δx1 δx2 , (4.4)
∂x1 ∂x2 ∂x1 ∂x2
4.4 Linear Propagation of Errors 105

with the correlation coefficient


h∆x1 ∆x2 i
R12 = .
δx1 δx2
In most cases the quantities x1 and x2 are uncorrelated. Then the relation
(4.4) simplifies with R12 = 0 to
∂y 2 ∂y 2
(δy)2 = ( ) (δx1 )2 + ( ) (δx2 )2 .
∂x1 ∂x2
If the function is a product of independent quantities, it is convenient to
use relative errors as indicated in the following example:

z = xn y m ,
 2  2  2
δz δx δy
= n + m .
z x y
It is not difficult to generalize our results to functions y(x1 , .., xN ) of N
measured quantities. We obtain

XN  
∂y ∂y
(δy)2 = Rij δxi δxj
i,j=1
∂xi ∂xj

XN   N 
X 
∂y 2 2 ∂y ∂y
= ) (δxi ) + Rij δxi δxj
i=1
∂xi ∂xi ∂xj
i6=j=1

with the correlation coefficient


h∆xi ∆xj i
Rij = ,
δxi δxj
Rij = Rji ,
Rii = 1 .

The Covariance Matrix

To simplify the notation, we introduce the covariance matrix C


 
h∆x1 ∆x1 i, h∆x1 ∆x2 i, ... h∆x1 ∆xn i
 h∆x2 ∆x1 i, h∆x2 ∆x2 i, ... h∆x2 ∆xn i 
C= 
 ,

: : :
h∆xn ∆x1 i, h∆xn ∆x2 i, ... h∆xn ∆xn i

Cij = Rij δxi δxj


which, in this context, is also called error matrix. The covariance matrix by
definition is positive definite and symmetric. The error δy of the dependent
variable y is then given in linear approximation by
106 4 Measurement Errors

XN  
∂y ∂y
(δy)2 = Cij
i,j=1
∂xi ∂xj

which can also be written in matrix notation as

(δy)2 = ∇y T C∇y .

For two variables with normally distributed errors following (3.48)


 
(∆x1 )2 ∆x1 ∆x2 (∆x2 )2
1 1 δ1 2 − 2ρ δ1 δ2 + 2
δ2
N (∆x1 , ∆x2 ) = p exp − 2

2πδ1 δ2 1 − ρ 2 2 1 − ρ
(4.5)
we get  
δ12 , ρδ1 δ2
C= .
ρδ1 δ2 , δ22

Error Ellipsoids

Two-dimensional Gaussian error distributions like (4.5) (see Sect. 3.6.5) have
the property that the curves of constant probability density are ellipses. In-
stead of nσ error intervals in one dimension, we define nσ error ellipses.
The curve of constant probability density with density down by a factor of
exp(−n2 /2) relative to the maximal density is the nσ error ellipse.
For the error distribution in the form of (4.5) the error ellipse is
(∆x1 )2 (∆x2 )2
δ12
− 2ρ ∆xδ11 δ∆x 2
+ δ22
= n2 .
2

1 − ρ2
For uncorrelated errors the one standard deviation error ellipse is simply
(∆x1 )2 (∆x2 )2
+ =1.
δ12 δ22
In higher dimensions, we obtain ellipsoids which we better write in vector
notation:
∇y T C∇y = n2 .

4.4.3 Averaging Uncorrelated Measurements

Important measurements are usually performed by various experiments in


parallel, or are repeated several times. The combination of the results from
various measurements should be performed in such a way that it leads to op-
timal accuracy. Under these conditions we can calculate a so-called weighted
mean, with an error smaller than that of any of the contributing measure-
ments. We assume that the individual measurements are independent.
4.4 Linear Propagation of Errors 107

Remember that in this chapter we assume that the errors are small enough
to neglect a dependence of the error on the value of the measured quantity
within the range of the error. This condition is violated for instance for small
Poisson numbers. The general case will be discussed in Chap. 8.
As an example let us consider two measurements with measured values
x1 , x2 and errors δ1 , δ2 . With the relations given in Sect. 3.2.3, we find for
the error squared δ 2 of a weighted sum

x = w1 x1 + w2 x2 ,
δ 2 = w12 δ12 + w22 δ22 .

Now we chose the weights in such a way that the error of the weighted
sum is minimal, i.e. we seek for the minimum of δ 2 under the condition
w1 + w2 = 1. The result is

1/δi2
wi =
1/δ12 + 1/δ22

and for the combined error we get


1 1 1
= 2+ 2 .
δ2 δ1 δ2

Generally, for N measurements we find

XN N
xi X 1
x= / , (4.6)
δ 2 i=1 δi2
i=1 i

X 1N
1
= . (4.7)
δ2 δ2
i=1 i

When all measurements have the same error, all the weights are equal to
wi = 1/N , and we get the normal arithmetic
√ mean, with the corresponding
reduction of the error by the factor 1/ N .
Remark: If the original raw data of different experiments are available,
then we have the possibility to improve the averaging process compared to
the simple use of the relations 4.6 and 4.7. When, for example, in two rate
measurements of 1 and 2 hours duration, 2, respectively 12 events are ob-
served, then the combined rate is (2 + 12)/(1 h + 3 h) = 3.5 h−1 , with an
error ±0.9 h−1 . Averaging according to (4.6) would lead to too low a value
of (3.2 ± 1.2) h−1 , due to the above mentioned problem of small rates and
asymmetric errors. The optimal procedure is in any case the addition of the
log-likelihoods which will be discussed in Chap. 8. It will correspond to the
addition of the original data, as done here.
108 4 Measurement Errors

4.4.4 Averaging Correlated Measurements

In Sect.4.4.3 we derived the expression for the weighted mean of indepen-


dent measurements of one and the same quantity. This is a special case of
a more general result for a sample of N measurements of the same quantity
which differ not only in their variances, but are also correlated, and therefore
not statistically independent. Consequently, they have to be described by a
complete N × N covariance or error matrix C.
We choose the weights for a weighted mean such that the variance of
the combined value is minimal, in much the same way as in Sect.4.4.3 for
uncorrelated measurements. For simplicity, we restrict ourselves to two mea-
surements x1,2 . The weighted sum x is

x = w1 x1 + w2 x2 , with w1 + w2 = 1 .

To calculate var(x) we have to take into account the correlation terms:

δx2 ≡ var(x) = w12 C11 + w22 C22 + 2w1 w2 C12 .

The minimum of δx2 is achieved for

C22 − C12
w1 = ,
C11 + C22 − 2C12
C11 − C12
w2 = . (4.8)
C11 + C22 − 2C12
The uncorrelated weighted mean corresponds to C12 = 0. Contrary to this
case, where the expression for the minimal value of δx2 is particularly simple,
it is not as transparent in the correlated case.
The case of N correlated measurements leads to the following expression
for the weights:
PN
j=1 Vij
wi = PN ,
ij=1 Vij

where V is the inverse matrix of C which we called the weight matrix in Sect.
3.6.5.
The weighted mean and its error, derived by error propagation, are:
N PN
X ij=1 Vij xi
x= wi xi = PN , (4.9)
i=1 ij=1 Vij
PN
ijkl=1 Vij Vkl Cik 1
δ2 = P 2 = PN . (4.10)
N Vij
ij=1 Vij
ij=1
4.4 Linear Propagation of Errors 109

Example 49. Average of measurements with common off-set error


Several experiments (i) determine the energy Ei∗ of an excited nuclear
state by measuring its transition energy Ei with the uncertainty δi to the
ground state with energy E0 . They take the value of E0 from the same table
which quotes an uncertainty of δ0 of the ground state energy. Thus the results
Ei∗ = Ei + E0 are correlated. The covariance matrix is

Cij = h(∆i + ∆0 )(∆j + ∆0 )i = δi2 δij + δ02 .

C is the sum of a diagonal matrix and a matrix where all elements are iden-
tical, namely equal to δ02 . In this
P special situation the variance var(E ∗ ) ≡ δ 2
of the combined result E = wi Ei is
∗ ∗

X X
δ2 = wi2 Cii + wi wj Cij
i i6=j
X X 2
= wi2 δi2 + wi δ02 .

Since the second sum is unity, the second term is unimportant when we
minimize δ 2 , with respect to the weights and we get the same result (4.6) for
the weighted mean E ∗ as in the uncorrelated case. For its error we find, as
could have been expected,
X −1
2 1
δ = + δ02 .
δi2

It is interesting that in some rare cases the weighted mean of two cor-
related measurements x1 and x2 is not located between the individual mea-
surement, the so-called “mean value” is not contained in the interval [x1 , x2 ].

Example 50. Average outside the range defined by the individual measure-
ments
The matrix  
12
C=
25
with eigenvalues √
λ1,2 = 3 ± 8>0
is symmetric and positive definite and thus a possible covariance matrix. But
following (4.8) it leads to weights w1 = 23 , w2 = − 21 . Thus the weighted mean
x = 32 x1 − 12 x2 with x1 = 0, x2 = 1 will lead to x = − 21 which is less than
110 4 Measurement Errors

both input values. The reason for this sensible but at first sight unexpected
result can be understood intuitively in the following way: Due to the strong
correlation, x1 and x2 , both will usually be either too large or too low. An
indication, that x2 is too large is the fact that it is larger than x1 which is
the more precise measurement. Thus the true value x then is expected to be
located below both x1 and x2 .

4.4.5 Averaging Measurements with Systematic Errors

The combination of measurements with systematic errors proceeds in the


same way as for measurements with purely random errors. We form the
weighted sum where the weights are computed from the full error and we
associate to it again a statistical and a systematic error that we calculate by
simple error propagation.
To avoid complicated indices, we write the result of a measurement as
x ± δ, x ± a ± b, where a stands for the statistical and b for the systematic
error. For N measurements, xi ± δi , xi ± ai ± bi we have as before
N
X
x= wi xi
j=1

with
1/δ 2
wi = PN i .
2
i=1 1/δi
The statistical and the systematic errors are
N
X
a2 = wi2 a2i ,
i=1
N
X
b2 = wi2 b2i .
i=1

Now we consider correlated errors. As always, we assume that the statisti-


cal errors are not correlated with the systematic errors. Then we have, apart
from the combined covariance matrix C, statistical and systematic covariance
matrices A and B which add up to C, Cij = Aij + Bij . The formulas (4.9)
and (4.10) remain valid and if we split the error into its statistical and its
systematic part we get:
4.4 Linear Propagation of Errors 111
PN
ijkl=1 Vij Vkl Aik
a2 = P 2 ,
N
ij=1 V ij
PN
2 ijkl=1 Vij Vkl Bik
b = P 2 .
N
V
ij=1 ij

Example 51. Average of Z 0 mass measurements


In four experiments at the LEP storage rings the mass of the Z 0 particle
has been measured. The results in unit of GeV are summarized in the first
four lines of the following table:
experiment mass x error δ stat. error a syst. error b
OPAL 91.1852 0.0030 0.0023 0.0018
DELPHI 91.1863 0.0028 0.0023 0.0016
L3 91.1898 0.0031 0.0024 0.0018
ALEPH 91.1885 0.0031 0.0024 0.0018
mean 91.1871 0.0023 0.0016 0.0017

The estimated covariance matrices in MeV2 are:


 2 2 2 2
30 16 16 16
 162 282 162 162 
C= 
 162 162 312 162  ,
162 162 162 312
 2   2 2 
23 0 0 0 18 16 162 162
 0 232 0 0   162 162 162 162 
A=  
 0 0 242 0  , B =  162 162
 .
182 162 
0 0 0 242 162 162 162 182
The covariance matrices are estimates derived from numbers given in [31].
The systematic errors are almost completely correlated. The weight matrix
V = C−1 is:  
1.32 −0.29 −0.29 −0.29
 −0.29 1.54 −0.29 −0.29 
V= 
 −0.29 −0.29 1.22 −0.29  · 10 .
−3

−0.29 −0.29 −0.29 1.22


We insert these numbers into (4.9) and (4.10) and obtain the results dis-
played in the last line of the table. Remark, had we neglected the correla-
tion, the uncertainty would have been only 15 M eV compared to the cor-
rect number 23 M eV . The results do not exactly agree with the numbers
m(Z 0 ) = 91.1876 ± 0.0021 quoted in [31] where theoretical corrections have
been applied to the Z 0 mass.
112 4 Measurement Errors

4.4.6 Several Functions of Several Measured Quantities

When we fix a straight line by two measured points in the plane, we are


normally interested in its slope and its intercept with a given axis. The errors
of these two quantities are usually correlated. The correlations often have to
be known in subsequent calculations, e.g. of the crossing point with a second
straight line.
In the general case we are dealing with K functions yk (x1 , .., xN ) of N
variables with given measurement values xi and error matrix C. The sym-
metric error matrix E related to the values yk is

XN  
∂yk ∂yl
h∆yk ∆yl i = h∆xi ∆xj i , (4.11)
i,j=1
∂xi ∂xj
XN
∂yk ∂yl
Ekl = Cij .
i,j=1
∂xi ∂xj

Defining a matrix
∂yk
Dki = ,
∂xi
we can write more compactly
N
X
Ekl = Dki Dlj Cij , (4.12)
i,j=1

E = DCDT . (4.13)

4.4.7 Examples

The following examples represent some standard cases of error propagation.

Example 52. Error propagation: velocity of a sprinter


Given are s = (100.0 ± 0.1) m, t = (10.00 ± 0.02) s, searched for is δv:
 2  2  2
δv δt δs
= + ,
v t s
s 2  2
δv 0.02 0.1
= + = 2.2 10−3 .
v 10 100
4.4 Linear Propagation of Errors 113

Example 53. Error propagation: area of a rectangular table


Given are the sides a, b with a reading error δ1 and a relative scaling error
δ2 , caused by a possible extension or shrinkage of the measuring tape. We
want to calculate the error δF of the area F = ab. We find

(δa )2 = (δ1 )2 + (aδ2 )2 ,

(δb )2 = (δ1 )2 + (bδ2 )2 ,


Cab = ab(δ2 )2 ,
(δF )2 = b2 (δa )2 + a2 (δb )2 + 2abCab ,
 2  
δF 1 1
= (δ1 )2 + + 2(δ2 )2 .
F a2 b2
For large areas, the contribution of the reading error is negligible compared
to that of the scaling error.

Example 54. Straight line through two measured points


Given are two measured points (x1 , y1 ± δy1 ), (x2 , y2 ± δy2 ) of the straight
line y = mx + b, where only the ordinate y possesses an error. We want to
find the error matrix for the intercept

b = (x2 y1 − x1 y2 )/(x2 − x1 )

and the slope


m = (y2 − y1 )/(x2 − x1 ) .
According to (4.11) we calculate the errors

(δy2 )2 + (δy1 )2
(δm)2 = ,
(x2 − x1 )2
x2 (δy1 )2 + x21 (δy2 )2
(δb)2 = 2 ,
(x2 − x1 )2
x2 (δy1 )2 + x1 (δy2 )2
E12 = h∆m∆bi = − .
(x2 − x1 )2
The error matrix E for m and b is therefore
 
1 (δy1 )2 + (δy2 )2 , −x2 (δy1 )2 − x1 (δy2 )2
E= .
(x2 − x1 )2 −x2 (δy1 )2 − x1 (δy2 )2 , x22 (δy1 )2 + x21 (δy2 )2
The correlation matrix element R12 is
114 4 Measurement Errors

E12
R12 = ,
δm δb
x2 (δy1 )2 + x1 (δy2 )2
=− 1/2
. (4.14)
{[(δy2 )2 + (δy1 )2 ] [x22 (δy1 )2 + x21 (δy2 )2 ]}

For the special case δy1 = δy2 = δy the results simplify to


2
(δm)2 = (δy)2 ,
(x1 − x2 )2
(x21 + x22 )
(δb)2 = (δy)2 ,
(x1 − x2 )2
(x1 + x2 )
E12 = − (δy)2 ,
(x1 − x2 )2
x1 + x2
R12 = − p 2 .
2(x1 + x22 )

Remark: As seen from (4.14), for a suitable choice of the abscissa the correla-
tion disappears. To achieve this, we take as the origin the “center of gravity”
xs of the x-values xi , weighted with the inverse squared errors of the ordi-
nates, 1/(δyi )2 :
X xi X 1
xs = 2
/ .
(δyi ) (δyi )2

Example 55. Error of a sum of weighted measurements


In the evaluation of event numbers, the events are often counted with
different weights, in order to take into account, for instance, a varying accep-
tance of the detector. Weighting is also important in Monte Carlo simulations
(see 5.2.6) especially when combined with parameter estimation Sect. 7.3. For
N different weights wi , i = 1, ..., N and ni events with weight wi the weighted
number of events is
N
X
s= ni wi .
i=1

As ni is Poisson distributed, its uncertainty
P is ni . From error proagation we
obtain for the error of the sum δs2 = ni wi2 . Normally we register individual
events, ni = 1 and we get
N
X
δs2 = wi2 . (4.15)
i=1
The sum of the weights follows a compound Poisson distribution which is
described in Sect. 3.7.3. The result (4.15) corresponds to (3.62).
4.5 Biased Measurements 115

4.5 Biased Measurements

We have required that our measurement values xi are undistorted (unbiased).


We have used this property in the discussion of error propagation. Anyway, it
is rather plausible that we should always avoid biased measurements, because
averaging measurements with a common, e.g. correlated bias would produce
a result with the same bias. The average from infinitely many measurements
would thus be different from the true parameter value but the associated
error would be infinitely small. However a closer look at the problem reveals
that to require that independent measurements be unbiased, is not justified.
When we average measurements, the measurements xi are weighted with
1/δi2 , their inverse squared errors, as we have seen above. To be consistent,
it is therefore required that the quantities xi /δi2 are unbiased! Of course, we
explicitly excluded the possibility of errors which depend on the measurement
values, but since this requirement is violated so often in reality and since a
bias which is small compared to the uncertainty in an individual experiment
can become important in the average, we stress this point here and present
an example.

Example 56. Bias in averaging measurements


Let us assume that several measurements of a constant x0 produce un-
biased results xi with errors δi ∼ xi which are proportional to the measure-
ments. This could be, for instance, measurements of particle lifetimes, where
the relative error is determined by the number of recorded decays and thus
the absolute error is set proportional to the observed mean life. When we
compute the weighted mean x over many such measurements
X xi X 1
x= /
δi2 δi2
X 1 X 1
= /
xi x2i
≈ h1/xi / 1/x2

the expected value is shifted systematically to lower values. This is easily


seen from a Taylor expansion of the expected values:
116 4 Measurement Errors

h1/xi
hx − x0 i = − x0 ,
h1/x2 i
       
1 1 ∆x ∆x2
= 1− + + · · ·
x x0 x0 x20
1 δ2
≈ (1 + 2 ) ,
x0 x0
       
1 1 ∆x ∆x2
= 1 − 2 + 3 + ..
x2 x20 x0 x20
1 δ2
≈ 2 (1 + 3 2 ) ,
x0 x0
1 + δ /x20
2
hx − x0 i ≈ x0 − x0
1 + 3δ 2 /x20
≈ x0 (1 − 2δ 2 /x20 ) − x0 ,
hx − x0 i δ2
≈ −2 2 .
x0 x0

Here δ 2 is the expectation of the error squared in an individual measurement.


For a measurement error δ/x0 of 20% we obtain a sizable final bias of 8% for
the asymptotic result of infinitely many contributions.

4.6 Confidence Intervals


Under the condition that the error distribution is a one-dimensional Gaussian,
with a width independent of the expected value, the error intervals of many
repeated measurements will cover the true parameter value in 68.3 % of the
cases, because for any true value µ the probability to observe x inside one
standard deviation interval is
Z δ  
1 (x − µ)2
√ exp − dx ≈ 0.683 .
2πδ −δ 2δ 2

The region [x − δ, x + δ] is called a confidence interval4 with the confidence


level (CL) of 68.3 %, or, in physicists’ jargon, a 1σ confidence interval. Thus
in about one third of the cases our standard error intervals, under the above
assumption of normality, will not contain the true value. Often a higher safety
is desired, for instance 90 %, 95 %, or even 99 %. The respective limits can
be calculated, provided the probability distribution is known with sufficient
4
We will discuss confidence intervals in more detail in Chap. 8 and in Appendix
13.6.
4.6 Confidence Intervals 117

2d 1

x
2

0.393
2d 2

0.865

0.989

x
1

Fig. 4.2. Confidence ellipses for 1, 2 and 3 standard deviations and corresponding
probabilities.

accuracy. For the normal distribution we present some limits in units of the
standard deviation in Table 4.2. The numerical values can be taken from
tables of the χ2 -distribution function.
For distributions of several variates, the probability to find all variables
inside their error limits is strongly decreasing with the number of variables.
Some probabilities for Gaussian errors are given in Table 4.1. In three di-
mensions only 20 % of the observations are found in the 1σ ellipsoid. Fig. 4.2
shows confidence ellipses and probabilities for two variables.

Example 57. Confidence level for the mean of normally distributed variates
Let us consider a sample of N measurements x1 , . . . , xN which are sup-
posed to be normally distributed with unknown mean µ but known vari-
ance σ 2 .√The sample mean x is also normally distributed with variance
δN = σ/ N . The 1σ confidence interval [x − δN , x + δN ] covers, as we have
discussed above, the true value µ in 68.3 % of the cases. We can, with the help
of Table 4.1, also find a 99 % confidence level, i.e. [x − 2.58δN , x + 2.58δN ].

We have to keep in mind that the Gaussian confidence limits do not or


only approximately apply to other distributions. Error distributions often
have tails which are not well understood. Then it is impossible to derive
reliable confidence limits with high confidence levels. The same is true when
systematic errors play a role, for example due to background and acceptance
118 4 Measurement Errors

Table 4.1. Confidence levels for different values of the standar deviation σ.
Deviation Dimension
1 2 3 4
1 σ 0.683 0.393 0.199 0.090
2 σ 0.954 0.865 0.739 0.594
3 σ 0.997 0.989 0.971 0.939
4 σ 1. 1. 0.999 0.997

which usually are not known with great accuracy. Then for a given confidence
level much wider intervals than in the above case are required.

Table 4.2. Error limits in units of the standard deviation σ for several confidence
levels.
Confidence Dimension
level 1 2 3 4
0.50 0.67 1.18 1.54 1.83
0.90 1.65 2.14 2.50 2.79
0.95 1.96 2.45 2.79 3.08
0.99 2.58 3.03 3.37 3.64

We come back to our previous example but now we assume that the error
has to be estimated from the sample itself, according to (4.1), (4.3):
N
X
2
δN = (xi − x)2 /[N (N − 1)] .
i=1

To compute the confidence level for a given interval in units of the standard
deviation, we now have to switch to Student’s distribution (see Sect. 3.6.11).
The variate t, given by (x − µ)/δN , can be shown to be distributed according
to hf (t) with f = N − 1 degrees of freedom. The confidence level for a given
number of standard deviations will now be lower, because of the tails of
Student’s distribution. Instead of quoting this number, we give in Table 4.3
the factor k by which we have to increase the interval length to get the same
confidence level as in the Gaussian case. To clarify its meaning, let us look at
two special cases: For 68.3% confidence and N = 3 we require a 1.32 standard
deviation interval and for 99% confidence and N = 10 a 1.26 × 2.58 = 3. 25
standard deviation interval. As expected, the discrepancies are largest for
small samples and high confidence levels. In the limit when N approaches
infinity the factor k has to become equal to one.
4.6 Confidence Intervals 119

Table 4.3. Values of the factor k for the Student’s t-distribution as a function of
the confidence levels CL and sample size N .
N 68.3% 99%
3 1.32 3.85
10 1.06 1.26
20 1.03 1.11
∞ 1.00 1.00
5 Monte Carlo Simulation

5.1 Introduction
The possibility to simulate stochastic processes and of numerical modeling
on the computer simplifies extraordinarily the solution of many problems in
science and engineering. The deeper reason for this is characterized quite
aptly by the German saying “Probieren geht über studieren” (Trying beats
studying). Monte Carlo methods replace intellectual by computational effort
which, however, is realized by the computer.
A few simple examples will demonstrate the advantages, but also the lim-
its of this method. The first two of them are purely mathematical integration
problems which could be solved also by classical numerical methods, but show
the conceptual simplicity of the statistical approach.

Example 58. Area of a circle of diameter d


We should keep in mind that without the knowledge of the quantity π
the problem requires quite some mathematics but even a child can solve
this problem experimentally. It may inscribe a circle into a square with edge
length d, and sprinkles confetti with uniform density over it. The fraction of
confetti confined inside the circle provides the area of the circle in units of
the square area. Digital computers have no problem in “sprinkling confetti”
homogeneously over given regions.

Example 59. Volume of the intersection of a cone and a torus


We solve the problem simply by scattering points homogeneously inside
a cuboid containing the intersect. The fraction of points inside both bodies
is a measure for the ratio of the intersection volume to that of the cuboid.

In the following three examples we consider the influence of the measure-


ment process on the quantity to be determined.
122 5 Monte Carlo Simulation

Example 60. Correction of decay times


The decay time of instable particles is measured with a digital clock which
is stopped at a certain maximal time. How can we determine the mean lifetime
of the particles? The measured decay times are distorted by both the limited
resolution as well as by the finite measurement time, and have to be corrected.
The correction can be determined by a simulation of the whole measurement
process. (We will come back to details below.)

Example 61. Efficiency of particle detection


Charged particles passing a scintillating fiber produce photons. A fraction
of the photons is reflected at the surface of the fiber, and, after many reflec-
tions, eventually produces a signal in a photomultiplier. The photon yield
per crossing particle has to be known as a function of several parameters
like track length of the particle inside the fiber, its angle of incidence, fiber
length and curvature, surface parameters of the fiber etc.. Here a numerical
solution using classical integration methods would be extremely involved and
an experimental calibration would require a large number of measurements.
Here, and in many similar situations, a Monte Carlo simulation is the only
sensible approach.

Example 62. Measurement of a cross section in a collider experiment


Particle experiments often consist of millions of detector elements which
have to measure the trajectories of sometimes thousands of particles and
the energies deposited in an enormous number of calorimeter cells. To mea-
sure a specific cross section, the corresponding events have to be selected,
acceptance losses have to be corrected, and unavoidable background has to
be estimated. This can only be achieved by sophisticated Monte Carlo sim-
ulations which require a huge amount of computing time. These simulations
consist of two distinct parts, namely the generation of the particle reaction
(event generation) which contains the interesting physics, and the simulation
of the detector response. The computing time needed for the event genera-
tion is negligible compared to that required for the detector simulation. As
a consequence one tries to avoid the repetition of the detector simulation
and takes, if possible, modifications of the physical process into account by
re-weighting events.
5.2 Generation of Statistical Distributions 123

Example 63. Reaction rates of gas mixtures


A vessel contains different molecules with translational and rotational
movements according to the given temperature. The molecules scatter on
the walls, with each other and transform into other molecules by chemical
processes depending on their energy. To be determined is the composition
of the gas after a certain time. The process can be simulated for a limited
number of particles. The particle trajectories and the reactions have to be
computed.

All examples lead finally to integration problems. In the first three ex-
amples also numerical integration, even exact analytical methods, could have
been used. For the Examples 61 and 63, however, this is hardly possible,
since the number of variables is too large. Furthermore, the mathematical
formulation of the problems becomes rather involved.
Monte Carlo simulation does not require a profound mathematical exper-
tise. Due to its simplicity and transparency mistakes can be avoided. It is
true, though, that the results are subject to statistical fluctuations which,
however, may be kept small enough in most cases thanks to the fast com-
puters available nowadays. For the simulation of chemical reactions, however,
(Example 63) we reach the limits of computing power quite soon, even with
super computers. The treatment of macroscopic quantities (one mole, say)
is impossible. Most questions can be answered, however, by simulating small
samples.
Nowadays, even statistical problems are often solved through Monte Carlo
simulations. In some big experiments the error estimation for parameters
determined in a complex analysis is so involved that it is easier to simulate
the experiment, including the analysis, several times, and to derive the errors
quasi experimentally from the distribution of the resulting parameter values.
The relative statistical fluctuations can be computed for small samples and
then scaled down with the square root of the sample size.
In the following section we will treat the simulation of the basic univariate
distributions which are needed for the generation of more complex processes.
The generalization to several dimensions is not difficult. Then we continue
with a short summary on Monte Carlo integration methods.

5.2 Generation of Statistical Distributions


The simplest distribution is the uniform distribution which serves as the
basis for the generation of all other distributions. In the following we will
introduce some frequently used methods to generate random numbers with
desired distributions.
124 5 Monte Carlo Simulation

Some of the simpler methods have been introduced already in Chap. 3,


Sect. 3.6.4, 3.6.5: By a linear transformation we can generate uniform dis-
tributions of any location and width. The sum of two uniformly distributed
random numbers follows a triangular distribution. The addition of only five
such numbers produces a quite good approximation of a Gaussian variate.
Since our computers work deterministically, they cannot produce numbers
that are really random, but they can be programmed to deliver for practi-
cally any application sufficiently unordered numbers, pseudo random numbers
which approximate random numbers to a very good accuracy.

5.2.1 Computer Generated Pseudo Random Numbers

The computer delivers pseudo random numbers in the interval between zero
and one. Because of the finite number of digits used to represent data in a
computer, these are discrete, rational numbers which due to the usual floating
point accuracy can take only 218 ≈ 8 · 106 different values, and follow a fixed,
reproducible sequence which, however, appears as stochastic to the user. More
refined algorithms can avoid, though, the repetition of the same sequence
after 218 calls. The Mersenne twister, one of the fastest reasonable random
number generators, invented in 1997 by M. Matsomoto and T. Nishimura has
the enormous period of 219937 which never can be exhausted and is shown
to be uniformly distributed in 623 dimensions. In all generators, the user
has the possibility to set some starting value, called seed, and thus to repeat
exactly the same sequence or to interrupt a simulation and to continue with
the sequence in order to generate statistically independent samples.
In the following we will speak of random numbers also when we mean
pseudo random numbers.
There are many algorithms for the generation of random numbers. The
principle is quite simple: One performs an arithmetic operation and uses only
the insignificant digits of the resulting number. How this works is shown by
the prescription
xi+1 = n−1 mod(λxi ; n) ,
producing from the old random number xi a new one between zero and
one. The parameters λ and n fulfil the condition λ ≫ n. With the values
x1 = 0.7123, λ = 4158, n = 1 we get, for instance, the number

x2 = mod(2961.7434; 1) = 0.7434 .

The apparent “randomness” is due to the cutting off the significant digits by


the mod operation.
This random number generator is far from being perfect, as can be shown
experimentally by investigation of the correlations of consecutive random
numbers. The generators installed in the commonly used program libraries
are almost always sufficiently good. Nevertheless it is advisable to check their
5.2 Generation of Statistical Distributions 125

1.0

random number 2
0.5

0.0

0.0 0.5 1.0

random number 1

100500
number of entries

100000

99500

0.0 0.5 1.0

random number

Fig. 5.1. Correlation plot of consequtive random numbers (top) and frequency of
random numbers (bottom).

quality before starting important calculations. Possible problems with ran-


dom number generators are that they have a shorter than expected repe-
tition period, correlations of successive values and lack of uniformity. For
simulations which require a high accuracy, we should remember that with
the standard generators only a limited number of random numbers is avail-
able. Though intuitively attractive, randomly mixing the results of different
random number generators does not improve the overall quality.
126 5 Monte Carlo Simulation

In Fig. 5.1 the values of two consecutive random numbers from a PC


routine are plotted against each other. Obvious correlations and clustering
cannot be detected. The histogram of a projection is well compatible with
a uniform distribution. A quantitative judgment of the quality of random
number generators can be derived with goodness-of-fit tests (see Chap. 10).
In principle, one could of course integrate random number generators into
the computers which indeed work stochastically and replace the determinis-
tic generators. As physical processes, the photo effect or, even simpler, the
thermal noise could be used. Each bit of a computer word could be set by
a dual oscillator which is stopped by the stochastic process. Unfortunately,
such hardware random number generators are presently not used, although
they could be produced quite economically, a large number in a single chip.
They would make obsolete some discussions, which come up from time to
time, on the reliability of software generators. On the other hand, the repro-
ducibility of the pseudo random number sequence is quite useful when we
want to compare different program versions, or to debug them.

5.2.2 Generation of Distributions by Variable Transformation

Continuous Variables

With the restrictions discussed above, we can generate with the computer
random numbers obeying the uniform distribution

u(r) = 1 for 0 ≤ r ≤ 1.

In the following we use the notations u for the uniform distribution and
r for a uniformly distributed variate in the interval [0, 1]. Other univariate
distributions f (x) are obtained by variable transformations r(x) with r a
monotone function of x (see Chap. 3):

f (x)dx = u(r)dr,
Z x Z r(x)
f (x′ )dx′ = u(r′ )dr′ = r(x),
−∞ 0
F (x) = r,
x(r) = F −1 (r) .

The variable x is calculated from the inverse function F −1 where F (x)


is the distribution function which is set equal to r. For an analytic solution
the p.d.f. has to be analytically integrable and the distribution function must
have an inverse in analytic form.
The procedure is explained graphically in Fig. 5.2: A random number r
between zero and one is chosen on the ordinate. The distribution function (or
rather its inverse) then delivers the respective value of the random variable
x.
5.2 Generation of Statistical Distributions 127

f(x)
a)

random number 0 10 x 20 30

1.0

distribution function

b)
0.5

0.0
0 10 20 30
x

Fig. 5.2. The p.d.f (top) follows from the distribution function as indicated by the
arrows.

In this way it is possible to generate the following distributions by simple


variable transformation from the uniform distribution:
– Linear distribution:

f (x) = 2x 0 ≤ x ≤ 1 ,

x(r) = r .

– Power-law distribution:

f (x) = (n + 1)xn 0 ≤ x ≤ 1, n > −1 ,


1/(n+1)
x(r) = r .

– Exponential distribution (Sect. 3.6.6) :

f (x) = γe−γx,
1
x(r) = − ln(1 − r) .
γ

– Normal distribution (Sect. 3.6.5) : Two independent normally distributed


random numbers x, y are obtained from two uniformly distributed random
numbers r1 , r2 , see (3.38), (3.39).
128 5 Monte Carlo Simulation
 2 
1 x + y2
f (x, y) = exp − ,
2π 2
p
x(r1 , r2 ) = −2 ln(1 − r1 ) cos(2πr2 ) ,
p
y(r1 , r2 ) = −2 ln(1 − r1 ) sin(2πr2 ) .

– Breit-Wigner distribution (Sect3.6.9) :

1 (Γ/2)2
f (x) = ,
πΓ/2 x2 + (Γ/2)2
 
Γ 1
x(r) = tan π(r − ) .
2 2

– Log-Weibull (Fisher–Tippett) distribution (3.6.12)

f (x) = exp(−x − e−x ),


x(r) = − ln(− ln r) .

The expression 1 − r can be replaced by r in the formulas. More general


versions of these distributions are obtained by translation and/or scaling
operations. A triangular distribution can be constructed as a superposition
of two linear distributions. Correlated normal distributed random numbers
are obtained by scaling x and y differently and subsequently rotating the
coordinate frame. How to generate superpositions of distributions will be
explained in Sect. 5.2.5.

Uniform Angular, Circular and Spherical Distributions

Very often the generation of a uniform angular distribution is required. The


azimuthal angle ϕ is given by

ϕ = 2πr .

To obtain a spatially isotropic distribution, we have also to generate the polar


angle θ. As we have discussed in Sect. 3.5.8, its cosine is uniformly distributed
in the interval [−1, 1]. Therefore

cos θ = (2r1 − 1) ,
θ = arccos(2r1 − 1) ,
ϕ = 2πr2 .

A uniform distribution inside a circle of radius R0 is generated by



R = R0 r1 ,
ϕ = 2πr2 .
5.2 Generation of Statistical Distributions 129

0.2

P(k;4.6)
0.1

0.0
0 5 10 15
k

1.0
random number

distribution function

0.5

0.0
0 5 10 15
k

Fig. 5.3. Generation of a Poisson distributed random number.

A uniform distribution inside a sphere of radius R0 is obtained similarly


from
1/3
R = R0 r1 ,
θ = arccos(2r2 − 1) ,
ϕ = 2πr3 .

Discrete Distributions

The generation of random numbers drawn from discrete distributions is per-


formed in a completely analogous fashion. We demonstrate the method with
a simple example: We generate random numbers k following a Poisson dis-
tribution (see Sect. 3.6.3) P4.6 (k) with expected value 4.6 which is displayed
in Fig. 5.3. By summation of the bins starting from the left (integration),
we obtain the distribution function S(k) = Σi=0 i=k
P4.6 (i) shown in the figure.
To a uniformly distributed random number r we attach the value k which
corresponds to the minimal S(k) fulfilling S > r. The numbers k follow the
desired distribution.
130 5 Monte Carlo Simulation

Histograms

A similar method is applied when an empirical distribution given in the form


of a histogram has to be simulated. The random number r determines the
bin j. The remainder r − S(j − 1) is used for the interpolation inside the bin
interval. Often the bins are small enough to justify a uniform distribution for
this interpolation. A linear approximation does not require much additional
effort.
For two-dimensional histograms hij we first produce a projection,
X
gi = hij ,
j

normalize it to one, and generate at first i, and then for given i in the same
way j. That means that we need for each value of i the distribution summed
over j.

5.2.3 Simple Rejection Sampling

In the majority of cases it is not possible to find and invert the distribution
function analytically. As an example for a non-analytic approach, we consider
the generation of photons following the Planck black-body radiation law. The
appropriately scaled frequency x obeys the distribution

x3
f (x) = c (5.1)
ex − 1
with the normalization constant c. This function is shown in Fig. 5.4 for
c = 1, i.e. not normalized. We restrict ourselves to frequencies below a given
maximal frequency xmax .
A simple method to generate this distribution f (x) is to choose two uni-
formly distributed random numbers, where r1 is restricted to the interval
(xmin , xmax ) and r2 to (0, fmax ). This pair of numbers P (r1 , r2 ) corresponds
to a point inside the rectangle shown in the figure. We generate points and
those lying above the curve f (x) are rejected. The density of the remaining
r1 values follows the desired p.d.f. f (x).
A disadvantage of this method is that it requires several randomly dis-
tributed pairs to select one random number following the distribution. In our
example the ratio of successes to trials is about 1:10. For generating photons
up to arbitrary large frequencies the method cannot be applied at all.

5.2.4 Importance Sampling

An improved selection method, called importance sampling, is the follow-


ing: We look for an appropriate function m(x), called majorant, with the
properties
5.2 Generation of Statistical Distributions 131

0.2
f(x)

0.1

0.0
0 10 20 30

Fig. 5.4. Random selection method. The projection of the points located below
the curve follow the desired distribution.

f(x)
1.0

-0.2x 2
f=e sin (x)

0.5

0.0
0 10 20 30

Fig. 5.5. Majorant (dashed) used for importance sampling.


132 5 Monte Carlo Simulation

0.1
f(x)

0.01

1E-3

1E-4

1E-5
0 10 20
x

Fig. 5.6. Planck spectrum with majorant.

– m ≥ f for all x, Rx
– x = M −1 (r), i.e. the indefinite integral M (x) = −∞ m(x′ )dx′ is invert-
ible.
If it exists (see Fig. 5.5), we generate x according to m(x) and, in a second
step, drop stochastically for given x the fraction [m(x) − f (x)]/f (x) of the
events. This means, for each event (i.e. each generated x) a second, this time
uniform random number between zero and m(x) is generated, and if it is
larger than f (x), the event is abandoned. The advantage is, that for m(x)
being not much different from f (x) in most of the cases, the generation of one
event requires only two random numbers. Moreover, in this way it is possible
to generate also distributions which extend to infinity, as for instance the
Planck distribution, and many other distributions.
We illustrate the method with a simple example (Fig. 5.5):

Example 64. Importance sampling


To generate

f (x) = c(e−0.2x sin2 x) for 0 < x < ∞

with the majorant


m(x) = c e−0.2x ,
we normalize m(x) and calculate its distribution function
5.2 Generation of Statistical Distributions 133

Z x

r= 0.2e−0.2x dx′
0
= 1 − e−0.2x .

Thus the variate transformation from the uniformly distributed random num-
ber r1 to x is
1
x=− ln(1 − r1 ) .
0.2
We draw a second uniform random number r2 , also between zero and one,
and test whether r2 m(x) exceeds the desired p.d.f. f (x). If this is the case,
the event is rejected:

for r2 m(x) < sin2 x → keep x,


for r2 m(x) > sin2 x → drop x .

With this method a uniform distribution of random points below the majo-
rant curve is generated, while only those points are kept which lie below the
p.d.f. to be generated. On average about 4 random numbers per event are
needed in this example, since the test has a positive result in about half of
the cases.

If an appropriate continuous, analytical majorant function cannot be


found, often a piecewise constant function (step function) is chosen.

Example 65. Generation of the Planck distribution


Here a piecewise defined majorant is useful. We consider again the Planck
distribution (5.1), and define the majorant in the following way: For small
values x < x1 we chose a constant majorant m1 (x) = 6 c. For larger values x >
x1 the second majorant m2 (x) should be integrable with invertible integral
function. Due to the x3 -term, the Planck distribution decreases somewhat
more slowly than e−x . Therefore we chose for m2 an exponential factor with
x substituted by x1−ε . With the arbitrary choice ε = 0.1 we take
0.9
m2 (x) = 200 c x−0.1e−x .

The factor x−0.1 does not influence the asymptotic behavior significantly but
permits the analytical integration:
Z x
M2 (x) = m2 (x′ )dx′ ,
x1
200c h −x0.9 0.9
i
= e 1 − e−x .
0.9
134 5 Monte Carlo Simulation

10000

1000

y
100

10

1
0 5 10 15 20
x
Fig. 5.7. Generated Planck spectrum.

This function can be easily solved for x, therefore it is possible to generate


m2 via a uniformly distributed random number. Omitting further details of
the calculation, we show in Fig. 5.6 the Planck distribution with the two
majorant pieces in logarithmic scale, and in Fig. 5.7 the generated spectrum.

5.2.5 Treatment of Additive Probability Densities

Quite often the p.d.f. to be considered is a sum of several terms. Let us restrict
ourselves to the simplest case with two terms,

f (x) = f1 (x) + f2 (x) ,

with
Z ∞
S1 = f1 (x)dx ,
−∞
Z ∞
S2 = f2 (x)dx ,
−∞
S1 + S2 = 1 .

Now we chose with probability S1 (S2 ) a random number distributed


according to f1 (f2 ). If the integral functions
5.2 Generation of Statistical Distributions 135
Z x
F1 (x) = f1 (x′ )dx′ ,
−∞
Z x
F2 (x) = f2 (x′ )dx′
−∞

are invertible, we obtain with a uniformly distributed random number r the


variate x distributed according to f (x):

x = F1−1 (r) for r < S1 ,

respectively
x = F2−1 (r − S1 ) for r > S1 .
The generalization to more than two terms is trivial.

Example 66. Generation of an exponential distribution with constant back-


ground
In order to generate the p.d.f.

λe−λx 1
f (x) = ε + (1 − ε) für 0 < x < a ,
1 − e−λa a
we chose for r < ε  
−1 1 − e−λa
x= ln 1 − r ,
λ ε
and for r > ε
r−ε
x=a .
1−ε
We need only one random number per event. The direct way to use the
inverse of the distribution function F (x) would not have been successful,
since it cannot be given in analytic form.

The separation into additive terms is always recommended, even when


the individual terms cannot be handled by simple variate transformations as
in the example above.

5.2.6 Weighting Events

In Sect. 3.6.3 we have discussed some statistical properties of weighted events


and realized that the relative statistical error of
√ a sum of N weighted events
can be much larger than the Poisson value 1/ N , especially when the indi-
vidual weights are very different. Thus we will usually refrain from weighting.
However, there are situations where it is not only convenient but essential
to work with weighted events. If a large sample of events has already been
136 5 Monte Carlo Simulation

generated and stored and the p.d.f. has to be changed afterwards, it is of


course much more economical to re-weight the stored events than to generate
new ones because the simulation of high energy reactions in highly complex
detectors is quite expensive. Furthermore, for small changes the weights are
close to one and will not much increase the errors. As we will see later, param-
eter inference based on a comparison of data with a Monte Carlo simulation
usually requires re-weighting anyway.
An event with weight w stands for w identical events with weight 1. When
interpreting the results of a simulation, i.e. calculating errors, one has to take
into account the distribution of a sum of weights, see last example in Sect.
4.4.7. There we showed that
X  X
var wi = wi2 .

Relevant is the relative error of a sum of weights:


P pP
δ ( wi ) w2
P = P i .
wi wi
Strongly varying weights lead to large statistical fluctuations and should
therefore be avoided.
To simulate a distribution

f (x) : with xa < x < xb

with weighted events is especially simple: We generate events xi that are


uniformly distributed in the interval [xa , xb ] and weight each event with wi =
f (xi ).
In the Example 64 we could have generated events following the majorant
distribution, weighting them with sin2 x. The weights would then be wi =
f (xi )/m(xi ).
When we have generated events following a p.d.f. f (x|θ) depending on a
parameter θ and are interested in the distribution f ′ (x|θ′ ) we have only to
re-weight the events by f ′ /f .

5.2.7 Markov Chain Monte Carlo

Introduction

The generation of distributions of high dimensional distributions is difficult


with the methods that we have described above. Markov chain Monte Carlo
(MCMC) is able to generate samples of distributions with hundred or thou-
sand dimensions. It has become popular in thermodynamics where statistical
distributions are simulated to compute macroscopic mean values and espe-
cially to study phase transitions. It has also been applied for the approxima-
tion of functions on discrete lattices. The method is used mainly in theoretical
5.2 Generation of Statistical Distributions 137

physics to sample multi-dimensional distributions. It is also applied to opti-


mize artificial neural networks.
Characteristic of a Markov chain is that a random variable x is mod-
ified stochastically in discrete steps, its value at step i depending only on
its value at the previous step i − 1. Values of older steps are forgotten:
P (xi |xi−1 , xi−2 , ..., x1 ) = P (xi |xi−1 ). A typical example of a Markov chain
is random walk. Of interest are Markov chains that converge to an equilib-
rium distribution, like random walk in a fixed volume. MCMC generates a
Markov chain that has as its equilibrium distribution the desired distribution.
Continuing with the chain once the equilibrium has been reached produces
further variates of the distribution. To satisfy this requirement, the chain has
to satisfy certain conditions which are fulfilled for instance for the so-called
Metropolis algorithm, which we will use below. There exist also several other
sampling methods. Here we will only sketch this subject and refer the inter-
ested reader to the extensive literature which is nicely summarized in [32].

Thermodynamical Model, Metropolis Algorithm

In thermodynamics the molecules of an arbitrary initial state always approach


– if there is no external intervention – a stationary equilibrium distribution.
Transitions then obey the principle of detailed balance. In a simple model
with atoms or molecules in only two possible states in the stationary case,
the rate of transitions from state 1 to state 2 has to be equal to the reverse
rate from 2 to 1. For occupation numbers N1 , N2 of the respective states and
transition rates per molecule and time W12 , respectively W21 , we have the
equation of balance
N1 W12 = N2 W21 .
For instance, for atoms with an excited state, where the occupation numbers
are very different, the equilibrium corresponds to a Boltzmann distribution,
N1 /N2 = e−∆E/kT , with ∆E being the excitation energy, k the Boltzmann
constant and T the absolute temperature. When the stationary state is not
yet reached, e.g. the number N1 is smaller than in the equilibrium, there
will be less transitions to state 2 and more to state 1 on average than in
equilibrium. The occupation number of state 1 will therefore increase until
equilibrium is reached. Since transitions are performed stochastically, even
in equilibrium the occupation numbers will fluctuate around their nominal
values.
If now, instead of discrete states, we consider systems that are character-
ized by a continuous variable x, the occupation numbers are to be replaced
by a density distribution f (x) where x is multidimensional. It represents the
total of all energies of all molecules. As above, for a stationary system we
have
f (x)W (x → x′ ) = f (x′ )W (x′ → x) .
138 5 Monte Carlo Simulation

As probability P (x → x′ ) for a transition from state x to a state x′ we choose


the Boltzmann acceptance function

W (x → x′ )
P (x → x′ ) =
W (x → x′ ) + W (x′ → x)
f (x′ )
= .
f (x) + f (x′ )

In an ideal gas and in many other systems the transition regards only
one or two molecules and we need only consider the effect of the change
of those. Then the evaluation of the transition probability is rather simple.
Now we simulate the stochastic changes of the states with the computer,
by choosing a molecule at random and change its state with the probability
P (x → x′ ) into a also randomly chosen state x′ (x → x′ ). The choice of the
initial distribution for x is relevant for the speed of convergence but not for
the asymptotic result.
This mechanism has been introduced by Metropolis et al. [33] with a dif-
ferent acceptance function in 1953. It is well suited for the calculation of mean
values and fluctuations of parameters of thermodynamical or quantum sta-
tistical distributions. The process continues after the equilibrium is reached
and the desired quantity is computed periodically. This process simulates a
periodic measurement, for instance of the energy of a gas with small number
of molecules in a heat bath. Measurements performed shortly one after the
other will be correlated. The same is true for sequentially probed quantities of
the MCMC sampling. For the calculation of statistical fluctuations the effect
of correlations has to be taken into account. It can be estimated by varying
the number if moves between subsequent measurements.

Example 67. Mean distance of gas molecules


We consider an atomic gas enclosed in a cubic box located in the gravi-
tational field of the earth. The N atoms are treated as hard balls with given
radius R. Initially the atoms are arranged on a regular lattice. The p.d.f. is
zero for overlapping atoms, and proportional to e−αz , where z is the vertical
coordinate of a given atom. The exponential factor corresponds to the Boltz-
mann distribution for the potential energy in the gravitational field. An atom
is chosen randomly. Its position may be (x, y, z). A second position inside the
box is randomly selected by means of three uniformly distributed random
numbers. If within a distance of less than 2R an other atom is found, the
move is rejected and we repeat the selection of a possible new location. If the
position search with the coordinates (x′ , y ′ , z ′ ) is successful, we form the ratio
′ ′
w = e−az /(e−az + e−αz ). The position is changed if the condition r < w is
fulfilled, with a further random number r. Periodically, the quantity being
studied, here the mean distance between atoms, is calculated. It is displayed
5.2 Generation of Statistical Distributions 139

0.70

0.69

<d2> 0.68

0.67

0 50 100 150 200

iteration

Fig. 5.8. Mean distance of spheres as a function of the number of iterations.

x
Fig. 5.9. Solid sheres in a box. The plot is a projection onto the x-z plane.

in Fig. 5.8 as a function of the iteration number. Its mean value converges
to an asymptotic value after a number of moves which is large compared
to the number of atoms. Fig. 5.9 shows the position of atoms projected to
the x-z plane, for 300 out of 1000 considered atoms, after 20000 moves. Also
the statistical fluctuations can be found and, eventually,
√ re-calculated for a
modified number of atoms according to the 1/ N -factor.
140 5 Monte Carlo Simulation

5.3 Solution of Integrals


The generation of distributions has always the aim, finally, to evaluate inte-
grals. There the integration consists in simply counting the sample elements
(the events), for instance, when we determine the acceptance or efficiency of
a detector.
The integration methods follow very closely those treated above for the
generation of distributions. To simplify the discussion, we will consider mainly
one-dimensional integrals. The generalization to higher dimensions, where the
advantages of the Monte Carlo method become even more pronounced than
for one-dimensional integration, does not impose difficulties.
Monte Carlo integration is especially simple and has the additional ad-
vantage that the accuracy of the integrals can be determined by the usual
methods of statistical error estimation. Error estimation is often quite in-
volved with the conventional numerical integration methods.

5.3.1 Simple Random Selection Method

Integrals with the integrand changing sign are subdivided into integrals over
intervals with only positive or only negative integrand. Hence it is sufficient
to consider only the case
Z xb
I= y(x) dx with y > 0 . (5.2)
xa

As in the analogous case when we generate a distribution, we produce


points which are distributed randomly and uniformly in a rectangle covering
the integrand function. An estimate Ib for the area I is obtained from the
ratio of successes – this are the points falling below the function y(x) – to
the number of trials N0 , multiplied by the area I0 of the rectangle:
N
Ib = I0 .
N0
To evaluate the uncertainty of this estimate, we refer to the binomial
distribution in which we approximate the probability of success ε by the
experimental value ε = N/N0 :
p
δN = N0 ε(1 − ε) ,
r
δ Ib δN 1−ε
= = . (5.3)
Ib N N
As expected, the accuracy raises with the square root of the number
of successes and with ε. The smaller the deviation of the curve from the
rectangle, the less will be the uncertainty.
5.3 Solution of Integrals 141

Fig. 5.10. Geometry of photon radiation in a scintillating fiber.

0.10 track length 2

track length
photon yield

photon yield
0.05 1

0.00 0
0.0 0.2 0.4 0.6 0.8 1.0

track position

Fig. 5.11. Photon yield as a function of track position.

Example 68. Photon yield for a particle crossing a scintillating fiber


Ionizing particles are crossing a scintillating fiber with circular cross sec-
tion perpendicular to the fiber axis which is parallel to the z-axis (Fig. 5.10),
and generate photons with spatially isotropic angular distribution (see 5.2.2).
Photons hitting the fiber surface will be reflected if the angle with respect
to the surface normal is larger than β0 = 60o. For smaller angles they will
be lost. We want to know, how the number of captured photons depends
142 5 Monte Carlo Simulation

on the location where the particle intersects the fiber. The particle traverses
the fiber in y direction at a distance x from the fiber axis. To evaluate the
acceptance, we perform the following steps:
– Set the fiber radius R = 1, create a photon at x, y uniformly distributed
in the square 0 < x , y < 1,
– calculate r2 = x2 + y 2 , if r2 > 1 reject the event,
– chose azimuth angle ϕ for the photon direction, with respect to an axis
parallel to the fiber direction in the point x, y, 0 < ϕ < 2π, ϕ uniformly
distributed,
– calculate the projected angle α (sin α = r sin ϕ),
– choose a polar angle ϑ for the photon direction, 0 < cos(ϑ) < 1, cos(ϑ)
uniformly distributed,
– calculate the angle β of the photon with respect to the (inner) surface
normal of the fiber, cos β = sin ϑ cos α,
– for β < β0 reject the event,
– store x for the successful trials in a histogram and normalize to the total
number of trials.
The efficiency is normalized such that particles crossing the fiber at x = 0
produce exactly 1 photon.
Fig. 5.11 shows the result of our simulation. For large values of x the track
length is small, but the photon capture efficiency is large, therefore the yield
increases with x almost until the fiber edge.

5.3.2 Improved Selection Method

a) Reducing the Reference Area

We can gain in accuracy by reducing the area in which the points are dis-
tributed, as above by introduction of a majorant function, Fig. 5.5. As seen
from (5.3), the relative error is proportional to the square root of the ineffi-
ciency.
We come back to the first example of this chapter:

Example 69. Determination of π


The area of a circle with radius 1 is π. For N0 uniformly distributed trials
in a circumscribed square of area 4 (Fig. 5.12) the number of successes N is
on average
π
hN i = N0 .
4
An estimate πb for π is
5.3 Solution of Integrals 143

1.0

0.5

0.0
y

-0.5

-1.0

-1.0 -0.5 0.0 0.5 1.0

Fig. 5.12. Estimation of the number π.

4N
b=
π ,
N
p0
δb
π 1 − π/4
= p ,
π N0 π/4
1
≈ 0.52 √ .
N0
Choosing a circumscribed octagon as the reference area, the error is reduced
by about a factor two. A further improvement is possible by inscribing another
polygon inside the circle and considering only the area between the polygons.

b) Importance Sampling
If there exists a majorant m(x) for the function y(x) to be integrated,
Z xb
I= y(x)dx , (5.4)
xa

with the property that the indefinite integral M (x)


Z x
M (x) = m(x′ )dx′
xa
144 5 Monte Carlo Simulation

can be inverted, we generate N0 x-values according to the distribution m(x).


For each xi a further random number yi in the interval 0 < y < m(xi ) is gen-
erated. Again, as for the simulation of distributions, points lying above y(xi )
are rejected. The number N of the remaining events provides the integral
N
Ib = M (xb ) .
N0

5.3.3 Weighting Method

a) Simple Weighting

We generate N random numbers xi in the interval xa < x < xb and average


over the function values:
XN
y= y(xi )/N .
i=1

An estimate for the integral (5.4) is given by

Ib = (xb − xa )y .

This method corresponds to the usual numerical integration, with the


peculiarity that the supporting points on the abscissa are not chosen regu-
larly but are distributed at random. This alone cannot be an advantage, and
indeed the Monte Carlo integration in one and two dimensions for a given
number of supporting points is less efficient than conventional methods. It
is, however, superior to other methods for multi-dimensional integrations.
Already in three dimensions it competes favorably in many cases.
To estimate the accuracy, we apply the usual statistical error estimation.
We consider the numbers yi = y(xi ) as N stochastic measurements of y. The
expected mean squared error of y is then given by (4.3):
1 X
(δy)2 = (yi − y)2 .
N (N − 1)

The relative errors of y and Ib are the same,


!2  
2
δ Ib δy
= ,
Ib y
P 2
(yi − y)
= . (5.5)
N (N − 1)y 2

The numerator is an estimate of the variance of the y distribution. The


accuracy is the better, the smaller the fluctuations of the function around its
mean value are.
5.3 Solution of Integrals 145

1.0

f(x) analytically integrable

0.5

difference function

0.0
0 2 4

Fig. 5.13. Monte Carlo integration of the difference between the function to be
integrated and an integrable function.

b) Subtraction method

The accuracy can be improved through a reduction of the fluctuations of the


integrand.
If we find a function ye(x) which is integrable analytically and does not
differ too much from the original integrand y(x) we cast the integral into the
form Z xb Z xb Z xb
y(x)dx = ye(x)dx + (y(x) − ye(x)) dx .
xa xa xa

We now have to evaluate by Monte Carlo only the second term with
relatively small fluctuations (Fig. 5.13).

5.3.4 Reduction to Expected Values

In many cases it makes sense to factorize the integrand y(x) = f (x)y1 (x)
into a factor f (x) corresponding to a p.d.f. normalized inside the integration
interval which is easy to generate, and a second factor y1 (x). To be effective,
the method requires that f is close to y. Our integral has now the form of an
expected value:
Z xb Z xb
y(x)dx = f (x)y1 (x)dx
xa xa
= hy1 i .
146 5 Monte Carlo Simulation

We generate values xi distributed according to f (x) and obtain from these


an estimate for the integral I:
P
b y1 (xi )
I= i ,
N
!2 P 2
δ Ib [y1 (xi ) − y 1 ]
= .
Ib N (N − 1)y 21

The estimate is again the better, the less the y1 -values are fluctuating, i.e.
the more similar the functions y and f are. The error estimate is analogous
to (5.5).

5.3.5 Stratified Sampling

In stratified sampling the domain of integration is partitioned into sub-


domains. Over each of these we integrate separately. The advantage is that
the distribution in each sub-domain is more uniform and thus the fluctuations
of the random variables are smaller and the statistical error is reduced. This
method is somewhat antithetical to the basic idea of the simple Monte Carlo
method, since it produces a more uniform (equidistant) distribution of the
supporting points and requires some effort to combine the errors from the
different contributions. Thus we recommend it only if the integrand shows
very strong variations.

5.4 General Remarks


Often we need to solve integrals over different domains but always with the
same integrand. In these cases the Monte Carlo approach is particularly ad-
vantageous. We store all single simulated values (usually called “events”) and
are able to select events afterwards according to the chosen domain, and
obtain the integral with relatively small computing expense by summation.
Similarly a change of event weights is possible without repeating the gener-
ation of the events.
Let us illustrate this feature with a mechanical example: If, for instance,
we want to obtain the tensor of inertia for a complex mass distribution like
a car, we distribute points stochastically within the body and store their
coordinates together with the respective mass densities. With these data it
is easy to calculate by summations the mass, the center of mass and the
moments of inertia with respect to arbitrary axes. If desired, parts of the
body can be eliminated simply by rejecting the corresponding points in the
sums and different materials can be considered by changing the density.
In thermodynamic systems we are often interested in several mean values,
like the mean free path length, mean kinetic or potential energy, velocities
5.4 General Remarks 147

etc.. Once a statistical ensemble has been generated, all these quantities are
easily obtained, while with the usual integration methods, one has to repeat
each time the full integration.
Even more obvious are these advantages in acceptance calculations. Big
experiments in particle physics and other areas have to be simulated as com-
pletely and realistically as allowed by the available computing power. The
acceptance of a given system of particle detectors for a certain class of events
is found in two steps: first, a sample of interesting events is generated and
the particles produced are traced through the detecting apparatus. The hits
in various detectors together with other relevant information (momenta, par-
ticle identities) are stored in data banks. In a second step the desired accep-
tance for a class of events is found by simulating the selection procedure and
counting the fraction of events which are retained. Arbitrary changes in the
selection procedure are readily implemented without the need to simulate
large event samples more than once.
Finally, we want to stress again how easy it is to estimate the errors of
Monte Carlo integration. It is almost identical1 to the error estimation for
the experimental data. We usually will generate a number of Monte Carlo
reactions which is large enough to neglect their statistical error compared to
the experimental error. In other words, the number of Monte Carlo events
should be large compared to the number of experimental events. Usually a
factor of ten is sufficient.

1
The Monte Carlo errors are usually described by the binomial distribution,
those of the experimental data by the Poisson distribution.
6 Estimation I

6.1 Introduction
We now leave the probability calculus and its simple applications and turn
to the field of statistics. More precisely, we are concerned with inferential
statistics.
While the probability calculus, starting from distributions, predicts prop-
erties of random samples, in statistics, given a data sample, we look for a
theoretical description of the population from which it has been derived by
some random process. In the simplest case, the sample consists of indepen-
dent observations, randomly drawn from a parent population. If not specified
differently, we assume that the population is a collection of elements which
all follow the same discrete or continuous distribution. Frequently, the sample
consists of data collected in a measurement sequence.
Usually we either want to check whether our sample is compatible with
a specific theory, we decide between several theories, or we infer unknown
parameters of a given theory.
To introduce the problem, we discuss three simple examples:
1. At a table we find eight playing cards: two kings, three ladies, one ten,
one eight and one seven. Do the cards belong to a set of Canasta cards or to
a set of Skat cards?
2. A college is attended by 523 boys and 490 girls. Are these numbers
compatible with the assumption that in average the tendency to attend a
college is equal for boys and girls?
3. The lifetimes of five instable particles of a certain species have been
measured. How large is the mean life of that particle and how large is the
corresponding uncertainty?
In our first example we would favor the Skat game because none of the
cards two to six is present which, however, are part of Canasta card sets.
Assuming that the cards have been taken at random from a complete card
set, we can summarize the available information in the following way: The
probability to observe no card with value below seven in eight cards of a
Canasta game is LC = (5/13)8 = 4.8 × 10−4 whereas it is LS = 1 for a
150 6 Estimation I

Skat game. We call these quantities likelihoods 1 . The likelihood indicates how
well a given hypothesis is supported by the observation, but the likelihood
alone is not sufficient for a decision in favor of one or another hypothesis.
Additional considerations may play an important role. When the cards are
located in a Swiss youth hostel we would consider the hypothesis Skat more
sceptically than when the cards are found in a pub at Hamburg. We therefore
would weight our hypotheses with prior probabilities (in short: priors) which
quantify this additional piece of information. Prior probabilities are often
hard to estimate, often they are completely unknown. As a consequence,
results depending on priors are usually model dependent.
We will avoid to introduce prior probabilities and stay with likelihoods
but sometimes this is not possible. Then the results have to be interpreted
conditional on the validity of the applied prior probabilities.
In our second example we are confronted with only one hypothesis and
no well specified alternative. The validity of the alternative, e.g. a deviation
from the equality of the distribution of the sexes is hardly measurable since an
arbitrarily small deviation from the equality is present in any case. There is no
other possibility as to quantify the deviation of the data from the prediction
in some proper way. We will treat this problem in the Section goodness-of-fit
tests.
In our third example the number of hypotheses is infinite. To each value
of the unknown parameter, i.e. to each different mean life, corresponds a dif-
ferent prediction. The difficulties are very similar to those in case one. If we
want to quote probabilities, we are forced to introduce a priori probabilities
– here for the parameter under investigation. Again, in most cases no reliable
prior information will be available. We will quote the parameter best sup-
ported by the data and define an error interval based on the likelihood of the
parameter values.
The following table summarizes the cases which we have discussed.
case 1 given: N alternative hypotheses Hi
wanted: relative probabilities for the validity of Hi
case 2 given: one hypothesis H0
wanted: a quantitative statement about the validity of H0
case 3: given: one valid hypothesis H(λ) where λ is a single parameter
or a set of unknown continuous parameters
wanted: “ best” value of λ and its uncertainty
In practice we often will compare observations with a theory which con-
tains free parameters. In this case we have to infer parameters and to test the
compatibility of the hypothesis with the data, i.e. case 2 and case 3 apply.

1
The term likelihood was first used by the British biologist and statistician Sir
Ronald Aylmer Fisher (1890-1962). We postpone the exact definition of likelihood.
6.2 Inference with Given Prior 151

H H H
3 4 5

(0.27) (0.27) (0.27)

k (0.44)
3

H
2
(0.14)

k (0.32)
H 2
1
(0.05)
k (0.24)
1

Fig. 6.1. Quantitative Venn diagramm. The areas indicate the probabilities for
certain combinations of hypotheses Hi and discrete events of type kj . The marginal
probabilities are given in brackets.

6.2 Inference with Given Prior


If prior information is available, it is possible by means of Bayes’ theorem to
derive from a given sample probabilities for hypotheses or parameters.

6.2.1 Discrete Hypotheses


In Chap. 1 we had shown that conditional probabilities fulfil the following
relation (Bayes’ theorem):
P {A ∩ B} = P {A|B}P {B} = P {B|A}P {A} . (6.1)
The probability P {A∩B} that both the properties A and B apply is equal
to the probability P {B}, to find property B multiplied by the conditional
probability P {A|B} to find A, when B is realized. This is the first part of
the relation above. The second part is analogous.
We apply this relation to a discrete random variable k and hypotheses
Hi . The index denoting the hypothesis is interpreted as a random variable2 .
We assume that the probability P {k|Hi } to observe k is given for a finite
number of alternatively exclusive hypotheses Hi . Then we have
P {k|Hi }P {Hi } = P {Hi |k}P {k} ,
P {k|Hi }P {Hi }
P {Hi |k} = . (6.2)
P {k}
2
In this case this is a categorical variable which denotes a certain class.
152 6 Estimation I

Here P {Hi } is the assumed probability for the validity of hypothesis i before
the observation happens, it is the a priori probability.
In Fig. 6.1 we illustrate relation (6.2) in form of a so called Venn diagram
where in the present example 3 out of the 5 hypotheses have the same prior.
Each hypothesis bin is divided into 3 regions with areas proportional to the
probabilities to observe k = k1 , k = k2 and k = k3 , respectively. For example
when the observation is k = k2 (shadowed in gray) then the gray areas provide
the relative probabilities of the validity of the corresponding hypotheses. In
our example hypothesis H3 is the most probable, H1 the most unlikely.
The computation of P {k} which is the marginal distribution of k, i.e. the
probability of a certain observation, summed over all hypotheses, yields:
X
P {k} = P {k|Hi }P {Hi } .
i

As required, P {Hi |k} is normalized in such a way that the probability that
any of the hypotheses is fulfilled is equal to one. We get

P {k|Hi }P {Hi }
P {Hi |k} = P . (6.3)
j P {k|Hj }P {Hj }

In words: The probability for the validity of hypothesis Hi after the mea-
surement k is equal to the prior P {Hi } of Hi multiplied with the probability
to observe k if Hi applies and divided by a normalization factor. When we
are only interested in the relative probabilities of two different hypotheses Hi
and Hj for an observation k, we have:

P {Hi |k} P {k|Hi }P {Hi }


= .
P {Hj |k} P {k|Hj }P {Hj }

Example 70. Bayes’ theorem: Pion or kaon decay?


A muon has been detected. Does it originate from a pion or from a
kaon decay? The decay probabilities inside the detector are known and are
P {µ|π} = 0.02 and P {µ|K} = 0.10, respectively. The ratio of pions and
kaons in the beam is P {π} : P {K} = 3 : 1. With these numbers we obtain:

P {K|µ} 0.10 × 1 5
= = ,
P {π|µ} 0.02 × 3 3
P {K|µ} 0.10 × 1
= = 0.625 .
P {K|µ} + P {π|µ} 0.02 × 3 + 0.10 × 1

The kaon hypothesis is more likely than the pion hypothesis. Its probability
is 0.625.
6.2 Inference with Given Prior 153

6.2.2 Continuous Parameters

Now we extend our considerations to the case where the hypothesis index
is replaced by a continuous parameter θ, i.e. we have an infinite number of
hypotheses. Instead of probabilities we obtain probability densities. Bayes’
theorem now reads

f (x, θ) = fx (x|θ)πθ (θ) = fθ (θ|x)πx (x) (6.4)

which is just the relation 3.36 of Sect. 3.5, where fx , fθ are conditional dis-
tribution densities and πx (x), πθ (θ) are the marginal distributions of f (x, θ).
The joined probability density f (x, θ) of the two random variables x, θ is
equal to the conditional probability density fx (x|θ) of x, where θ is fixed,
multiplied by the probability density πθ (θ), the marginal distribution of θ.
For an observation x we obtain analogously to our previous relations

fx (x|θ)πθ (θ)
fθ (θ|x) = ,
πx (x)

and
fx (x|θ)πθ (θ)
fθ (θ|x) = R ∞ . (6.5)
f (x|θ)πθ (θ)dθ
−∞ x

In words: For a measurement with the result x, we compute the probability


density for the parameter θ from the value of the probability density fx (x|θ)
for x, multiplied by the probability density (prior) πθ (θ) of θ before the mea-
surement, divided by a normalization integral. Again, the quantity fx (x|θ)
determines how strongly various parameter values θ are supported by the
given observation x and is called – in this context – likelihood of θ.
From the probability density fθ (θ|x) of the interesting parameter we can
derive a best estimate θ̂ and an error interval. An obvious choice is the ex-
pectation value and the standard deviation. Thus the estimate is a function
of the observations3, θ̂ = θ̂(x).

Example 71. Time of a decay with exponential prior


A detector with finite resolution registers at time t the decay of a K
meson. The time resolution corresponds to a Gaussian with variance σ 2 . We
are interested in the time θ at which the decay occurred. The mean lifetime
τ of kaons is known. The probability density for the parameter θ before the
measurement, the prior, is π(θ) = e−θ/τ /τ , θ ≥ 0. The probability density
for t with θ fixed is the Gaussian. Applying (6.5) we obtain the probability
density f (θ) = f (θ|t) of the parameter θ,

3
A function of the observations is called a statistic, to be distinguished from the
discipline statistics.
154 6 Estimation I

0.8

observed value

0.6

θ
f( )

0.4

0.2

0.0
0 1 2 3 4

Fig. 6.2. Fit with known prior: Probability density for the true decay time. The
maximum of the distribution is located at θ = 1, the observed time is 1.5.

2 2
e−(t−θ) /(2σ ) e−θ/τ
f (θ) = R ∞ −(t−θ)2 /(2σ2 ) −θ/τ ,
0
e e dθ

which is displayed in Fig. 6.2. As a consequence of the exponential prior it is


visibly shifted to the left with respect to the observation.

If the value of the probability density fx (x|θ) in (6.5) varies much more
rapidly with θ than the prior – this is the case when the observation restricts
the parameter drastically – then to a good approximation the prior can be
regarded as constant in the interesting region. We then have

fx (x|θ)
fθ (θ|x) ≈ R ∞ .
f (x|θ)dθ
−∞ x

In this approximation the probability density fθ of the parameter corresponds


to the normalized likelihood function.
In practice, fθ often follows to a good approximation a normal distribu-
tion. The value θ̂ where fθ is maximal then is the estimate of θ and the values
where fθ has decreased by the factor e1/2 define a standard deviation error
interval and thus fix the uncertainty of the estimate θ̂.
6.3 Likelihood and the Likelihood Ratio 155

6.3 Likelihood and the Likelihood Ratio


Usually we do not know the prior or our ideas about it are rather vague.

Example 72. Likelihood ratio: V + A or V − A reaction?


An experiment is performed to measure the energy E of muons produced
in the decay of the tau lepton, τ − → µ− ντ ν̄µ , to determine whether the decay
corresponds to a V −A or a V +A matrix element. We know the corresponding
normalized decay distributions f− (E) and f+ (E). For a single observation
E ′ we can compute the likelihood ratio RL = L+ /L− of the likelihoods
L+ = f+ (E ′ ), L− = f− (E ′ ). But how should we choose the prior densities
for the two alternative hypotheses? In this example it would not make sense
to quantify the prejudices for the two hypothesis and to compute the resulting
probabilities. One would rather publish only the ratio RL .

In the absence of prior information the likelihood ratio is the only element
which we have, to judge the relative virtues of alternative hypotheses. Accord-
ing to a lemma of J. Neyman and E. Pearson there is no other more powerful
quantity to discriminate between competing hypotheses. (see Chap. 10).
Definition: The likelihood Li of a hypothesis Hi , to which corresponds
a probability density fi (x) ≡ f (x|Hi ) or a discrete probability distribution
Wi (k) ≡ P {k|Hi }, when the observation x, k, respectively, has been realized,
is equal to
Li ≡ L(i|x) = fi (x)
and
Li ≡ L(i|k) = Wi (k) ,
respectively. Here the index i which denotes the hypothesis is treated as an
independent random variable. When we replace it by a continuous parameter
θ and consider a parameter dependent p.d.f. f (x|θ) or a discrete probability
distribution W (k|θ) and observations x, k, the corresponding likelihoods are
L(θ) ≡ L(θ|x) = f (x|θ) ,
L(θ) ≡ L(θ|k) = W (k|θ) .
While the likelihood is related to the validity of a hypothesis given an
observation, the p.d.f. is related to the probability to observe a variate for a
given hypothesis. In our notation, the quantity which is considered as fixed
is placed behind the bar while the random variable is located left of it. When
both quantities are fixed the function values of both the likelihood and the
p.d.f. are equal. To attribute a likelihood makes sense only if alternative
hypotheses, either discrete or differing by parameters, can apply. If the like-
lihood depends on one or several continuous parameters, we have a likelihood
function.
156 6 Estimation I

Remark: The likelihood function is not a probability density of the pa-


rameter. There is no differential element like dθ involved and it does not obey
the laws of probability. To distinguish it from probability, R. A. Fisher had
invented the name likelihood. Multiplied by a prior and normalized, a prob-
ability density of the parameter is obtained. Statisticians call this inverse
probability or probability of causes to emphasize that compared to the direct
probability where the parameter is known and the chances of an event are
described, we are in the inverse position where we have observed the event
and want to associate probabilities to the various causes that could have led
to the observation.
As already stated above, the likelihood of a certain hypothesis is large if
the observation is probable for this hypothesis. It measures how strongly a
hypothesis is supported by the data. If an observation is very unlikely the
validity of the hypothesis is doubtful – however this classification applies only
when there is an alternative hypothesis with larger likelihood. Relevant are
only ratios of likelihoods.
Usually experiments provide a sample of N independent observations xi
which all follow independently the same p.d.f. f (x|θ) which depends on the
unknown parameter θ (i.i.d. variates). The combined p.d.f. f˜ then is equal to
the product of the N simple p.d.f.s,
N
Y
f˜(x1 , . . . , xN |θ) = f (xi |θ) .
i=1

For discrete variates we have the corresponding relation,


N
Y
W̃ (k1 , . . . , kN |θ) = W (ki |θ) .
i=1

For all values of θ the function f˜ evaluated for the sample x1 , . . . , xN is equal
to the likelihood L̃,

L̃(θ) ≡ L̃(θ|x1, x2 , . . . , xN )
= f˜(x1, x2 , . . . , xN |θ)
N
Y
= f (xi |θ)
i=1
N
Y
= L(θ|xi ) .
i=1

The same relation also holds for discrete variates:


6.3 Likelihood and the Likelihood Ratio 157

0.8
L=0.00055 L=0.016

0.6

0.4

0.2

0.0
0 1 2 3 4 5 0 1 2 3 4 5

x x

Fig. 6.3. Likelihood of three observations and two hypotheses with different p.d.f.s.

L̃(θ) ≡ L̃(θ|k1 , . . . , kN )
N
Y
= W (ki |θ)
i=1
N
Y
= L(θ|ki ) .
i=1

When we have a sample of independent observations, it is convenient to


consider the logarithm of the likelihood. It is called log-likelihood . It is equal
to
XN
ln L̃(θ) = ln [f (xi |θ)]
i=1

for continuous variates. A corresponding relation holds for discrete variates.


Fig. 6.3 illustrates the notion of likelihood in a concrete case of two hy-
potheses which predict different p.d.f.s of the variate x. For a sample of three
observations we present the values of the likelihood, i.e. the products of the
three corresponding p.d.f. values. The broad p.d.f. in the right hand picture
matches better. Its likelihood is about thirty times higher than that of the
left hand hypothesis.
So far we have considered the likelihood of samples of i.i.d. variates. Also
the case where two independent experiments A, B measure the same quantity
158 6 Estimation I

x is of considerable interest. The combined likelihood L is just the product


of the individual likelihoods LA (θ|x1 ) = fA (x1 |θ) and LB (θ|x2 ) = fB (x2 |θ)
as is obvious from the definition:

f (x1 , x2 |θ) = fA (x1 |θ)fB (x2 |θ) ,


L(θ) = f (x1 , x2 |θ) ,

hence

L = LA LB ,
ln L = ln LA + ln LB .

We state: The likelihood of several independent observations or experi-


ments is equal to the product of the individual likelihoods. Correspondingly,
the log-likelihoods add up.
Y
L= Li , (6.6)
X
ln L = ln Li . (6.7)

Example 73. Likelihood ratio of Poisson frequencies


We observe 5 decays and want to compute the relative probabilities for
three hypotheses. Prediction H1 assumes a Poisson distribution with expec-
tation value 2, H2 and H3 have expectation values 9 and 20, respectively.
The likelihoods following from the Poisson distribution Pλ (k) are:
L1 = P2 (5) ≈ 0.036 ,
L2 = P9 (5) ≈ 0.061 ,
L3 = P20 (5) ≈ 0.00005 .
We can form different likelihood ratios. If we are interested for example in
hypothesis 2, then the quotient L2 /(L1 + L2 + L3 ) ≈ 0.63 is relevant. If
we observe in a second measurement in the same time interval 8 decays, we
obtain:
L1 = P2 (5)P2 (8) = P4 (13) ≈ 6.4 · 10−3 ,
L2 = P9 (5)P9 (8) = P18 (13) ≈ 5.1 · 10−2 ,
L3 = P20 (5)P20 (8) = P40 (13) ≈ 6.1 · 10−7 .
The likelihood ratio L2 /(L1 + L2 + L3 ) ≈ 0.89 is now much more significant.
(For H1 and H3 the corresponding values are 0.11 and 10−5 .) The fact that
all values Li are small is unimportant because one of the three hypotheses
has to be valid.

We now turn to hypotheses with probability densities.


6.3 Likelihood and the Likelihood Ratio 159

0.4 a) L /L = 2 0.4 b) L /L = 1.2


1 2 1 2

0.2 0.2

0.0 0.0

0.4 c) L /L = 30 0.4 d) L /L = 1/430


1 2 1 2

0.2 0.2

0.0 0.0
-4 -2 0 2 4 6 8 -4 -2 0 2 4 6 8

Fig. 6.4. Likelihood ratio for two normal distributions. Top: 1 observation, bottom:
5 observations.

Example 74. Likelihood ratio of normal distributions


We compare samples drawn from one out of two alternative normal dis-
tributions with different expectation values and variances (Fig. 6.4).
1 2
f1 = √ e−(x−1) /2 ,
2π1
1 2
f2 = √ e−(x−2) /8 .
2π2
a) Initially the sample consists of a single observation at x = 0, for both
cases one standard deviation off the mean values of the two distributions
(Fig. 6.4a):
L1 e−1/2
= 2 −4/8 = 2 .
L2 e
b) Now we place the observation at x = 2, the maximum of the second
distribution (Fig. 6.4b):
L1 e−1/2
= 2 −0 = 1.2 .
L2 e
160 6 Estimation I

c) We now consider five observations which have been taken from distribution
f1 (Fig. 6.4c) and distribution f2 , respectively (Fig. 6.4d). We obtain the
likelihood ratios
L1 /L2 = 30 (Fig. 5.3c) ,
L1 /L2 = 1/430 (Fig. 5.3d) .
It turns out that narrow distributions are easier to exclude than broad ones.
On the other hand we get in case b) a preference for distribution 1 even
though the observation is located right at the center of distribution 2.

Example 75. Likelihood ratio for two decay time distributions


A sample of N decay times ti has been recorded in the time interval
tmin < t < tmax . The times are expected to follow either an exponential
distribution f1 (t) ∼ e−t/τ (hypothesis 1), or an uniform distribution f2 (t) =
const. (hypothesis 2). How likely are H1 , H2 ? First we have to normalize the
p.d.f.s:

1 e−t/τ
f1 (t) = ,
τ e min − e−tmax /τ
−t /τ

1
f2 (t) = .
tmax − tmin
The likelihoods are equal to the product of the p.d.f.s at the observations:
!
h  i−N XN
−tmin /τ −tmax /τ
L1 = τ e −e exp − ti /τ ,
i=1
L2 = 1/(tmax − tmin )N .
P
With t = ti /N the mean value of the times, we obtain the likelihood ratio
 N
L1 tmax − tmin
= e−N t/τ .
L2 τ (e−tmin /τ − e−tmax /τ )

6.4 The Maximum Likelihood Method for Parameter


Inference
In the previous examples we have compared fixed hypotheses. We now allow
for an infinite number of hypotheses by varying the value of a parameter. As in
the discrete case, in the absence of a given prior probability, the only available
6.4 The Maximum Likelihood Method for Parameter Inference 161

piece of information which allows us to judge different parameter values is


the likelihood function. A formal justification for this assertion is given by the
likelihood principle (LP) which states that the likelihood function exhausts all
the information contained in the observations related to the parameters. The
LP will be discussed in the following chapter. It is then plausible to choose
the parameter such that the likelihood is as large as possible. This is the
maximum likelihood estimate (MLE). When we are interested in a parameter
range, we will choose the interval such that the likelihood outside is always
less than inside.
Remark that the MLE, as well as likelihood intervals, are invariant against
transformations of the parameter. The likelihood is not a p.d.f. but a function
of the parameter and therefore L(θ) = L′ (θ′ ) for θ′ (θ). Thus a likelihood
analysis estimating, for example, the mass of a particle will give the same
result as that inferring the mass squared, and estimates of the decay rate γ
and mean life τ = 1/γ will be consistent.
Here and in the following sections we assume that the likelihood function
is continuous and differentiable and has exactly one maximum inside the valid
range of the parameter. This condition is fulfilled in the majority of all cases.
Besides the maximum likelihood (ML) method, invented by Fisher, there
exist a number of other methods of parameter estimation. Popular is espe-
cially the method of least squares (LS) which was first proposed by Gauß4 .
It is used to adjust parameters of curves which are fixed by some measured
points and will be discussed below. It can be traced back to the ML method
if the measurement errors are normally distributed and independent of the
parameter.
In most cases we are not able to compute analytically the location of the
maximum of the likelihood. To simplify the numerical computation, linear
approximations (e.g. linear regression) are still used quite frequently. These
methods find the solution by matrix operations and iteration. They are dis-
pensable nowadays. With common PCs and maximum searching programs
the maximum of a function of some hundred parameters can determined
without problem, given enough observations to fix it.

6.4.1 The Recipe for a Single Parameter

We proceed according to the following recipe. Given a sample of N i.i.d.


observations {x1 , . . . , xN } from a p.d.f. f (x|θ) with unknown parameter θ,
we form the likelihood or its logarithm, respectively, in the following way:

4
Carl Friedrich Gauß (1777-1855), German mathematician, astronomer and
physicist.
162 6 Estimation I

0.5

2.0

4.5

ln(L)

Fig. 6.5. Log-likelihood function and uncertainty limits for 1, 2, 3 standard devi-
ations.

N
Y
L(θ) = f (xi |θ) , (6.8)
i=1
N
X
ln L(θ) = ln f (xi |θ) . (6.9)
i=1

In most cases the likelihood function resembles a bell shaped Gaussian


and ln L(θ) approximately a downwards open parabola (see Fig. 6.5). This
approximation is especially good for large samples.
To find the maximum of L (ln L and L have their maxima at the same
location), we derive the log-likelihood5 with respect to the parameter and set
the derivative equal to zero. The value θ̂ that satisfies the equation which we

5
The advantage of using the log-likelihood compared to the normal likelihood is
that we do not need to derive a product but a sum which is much more convenient.
6.4 The Maximum Likelihood Method for Parameter Inference 163

obtain in this way is the MLE of θ:


d ln L
| =0. (6.10)
dθ θ̂
Since only the derivative of the likelihood function is of importance, fac-
tors in the likelihood or summands in the log-likelihood which are indepen-
dent of θ can be omitted.
The estimate θ̂ is a function of the sample values xi , and consequently a
statistic.
The point estimate has to be accompanied by an error interval. Point esti-
mate and error interval form an ensemble and cannot be discussed separately.
Choosing as point estimate the value that maximizes the likelihood function
it is natural to include inside the error limits parameter values with higher
likelihood than all parameters that are excluded. This prescription leads to
so-called likelihood ratio error intervals.
We will discuss the error interval estimation in a separate chapter, but fix
the error limit already now by definition:
Definition: The limits of a standard error interval are located at the pa-
rameter values where the likelihood function has decreased from its maximum
by a factor e1/2 . For two and three standard deviations the factors are e2 and
e4.5 . This choice corresponds to differences for the log-likelihood of 0.5 for
one, of 2 for two and of 4.5 for three standard error intervals as illustrated in
Fig. 6.5. We assume that these limits exist inside the parameter range.
The reason for this definition is the following: As already mentioned,
asymptotically, when the sample size N tends to infinity, under very gen-
eral conditions the likelihood function approaches a Gaussian and becomes
proportional to the probability density of the parameter (for a proof, see
Appendix 13.3). Then our error limit corresponds exactly to the standard
deviation of the p.d.f., i.e. the square root of the variance of the Gaussian.
We keep the definition also for non normally shaped likelihood functions and
small sample sizes. Then we get usually asymmetric error limits.

6.4.2 Examples

Example 76. Maximum likelihood estimate (MLE) of the mean life of an un-
stable particle
Given be N decay times ti of an unstable particle with unknown mean
life τ . For an exponential decay time distribution

f (t|γ) = γe−γt

with γ = 1/τ the likelihood is


164 6 Estimation I

N
Y
L = γN e−γti
i=1
PN
N − γti
=γ e i=1 ,
N
X
ln L = N ln γ − γ ti .
i=1

The estimate γ̂ satisfies


d ln L
|γ̂ = 0 ,

N
N X
0= − ti ,
γ̂ i=1
N
X
τ̂ = γ̂ −1 = ti /N = t .
i=1

Thus the estimate is just equal to the mean value t of the observed decay
times. In practice, the full range up to infinitely large decay times is not
always observable. If the measurement is restricted to an interval 0 < t <
tmax , the p.d.f. changes, it has to be renormalized:
γe−γt
f (t|γ) = ,
1 − e−γtmax
N
X
 
ln L = N ln γ − ln(1 − e−γtmax ) − γ ti .
i=1
The maximum is now located at the estimate γ̂, which fulfils the relation
  X N
1 tmax e−γ̂tmax
0=N − − ti ,
γ̂ 1 − e−γ̂tmax i=1
tmax e−tmax /τ̂
τ̂ = t + ,
1 − e−tmax /τ̂
which has to be evaluated numerically. If the time interval is not too short,
tmax > τ , an iterative computation lends itself: The correction term at the
right hand side is neglected in zeroth order. At the subsequent iterations we
insert in this term the value τ of the previous iteration. We notice that the
estimate again depends solely on the mean value t of the observed decay
times. The quantity t is a sufficient statistic. We will explain this term in
more detail later. The case with also a lower bound tmin of t can be reduced
easily to the previous one by transforming the variable to t′ = t − tmin .
6.4 The Maximum Likelihood Method for Parameter Inference 165

a) b)
0.0 0.0

-0.5 -0.5

lnL

-1.0 -1.0

-1.5 -1.5

-2.0 -2.0
0 1 2 1 2 3 4

Fig. 6.6. Log-likelihood functions for the parameters of a normal distribution: a)


for the mean µ with know width (solid curve) and unknown width (dashed curve),
b) for the width σ with known mean (solid curve) and unknown mean (dashed
curve). The curves represent expected likelihoods for 10 events.

In the following examples we discuss the likelihood functions and the MLEs
of the parameters of the normal distribution with mean µ and standard de-
viation σ evaluated for 10 events drawn from (N )(x|1, 2) in four different
situations:

Example 77. MLE of the mean value of a normal distribution with known
width
Given be N observation xi drawn from a normal distribution of known
width s. The mean value µ is to be estimated:
 
1 (x − µ)2
f (x|µ) = √ exp − ,
2πs 2s2
166 6 Estimation I

N
Y  
1 (xi − µ)2
L(µ) = √ exp − ,
i=1
2πs 2s2
N
X (xi − µ)2
ln L(µ) = − + const (6.11)
i=1
2s2
x2 − 2xµ + µ2
= −N + const .
2s2
The log-likelihood function is a parabola. It is shown in Fig. 6.6a for s = 2.
Deriving it with respect to the unknown parameter µ and setting the result
equal to zero, we get

(x − µ̂)
N =0,
s2
µ̂ = x .

The likelihood estimate µ̂ for the mean of the normal distribution is equal to
the arithmetic mean x of the sample. It is independent of s, but s determines

the width of the likelihood function and the standard error δµ = s/ N .

Example 78. MLE of the width of a normal distribution with given mean
Given are now N observations xi which follow a normal distribution of
unknown width σ to be estimated for known mean.
YN  
1 (xi − x0 )2
L(σ) = √ exp − ,
i=1
2πσ 2σ 2
XN
1 (xi − x0 )2
ln L(σ) = −N ( ln 2π + ln σ) −
2 i=1
2σ 2
" #
(x − x0 )2
= −N ln σ + + const .
2σ 2

The log-likelihood function for our numerical values is presented in Fig. 6.6b.
Deriving it with respect to the parameter of interest and setting the result
equal to zero we find

1 (x − x0 )2
0= − ,
σ̂
q σ̂ 3
σ̂ = (x − x0 )2 .
6.4 The Maximum Likelihood Method for Parameter Inference 167

Again we obtain a well known result. The mean square deviation of the sample
values provides an estimate for the width of the normal distribution. This
relation is the usual distribution-free estimate of the standard deviation if the
expected value is known. The error bounds from the drop of the log-likelihood
function by 1/2 become asymmetric. Solving the respective transcendental
equation, neglecting higher orders in 1/N , one finds
q
1
σ̂ 2N
δσ± = q .
1
1 ∓ 2N

Example 79. MLE of the mean of a normal distribution with unknown width
The solution of this problem canP be taken from Sect. 3.6.11 where we
found that t = (x − µ)/s with s2 = (xi − x)2 /[N (N − 1)] = v 2 /(N − 1)
follows the Student’s distribution with N − 1 degrees of freedom.
 − N2
Γ (N/2) t2
h(t|N − 1) = p 1+ .
Γ ((N − 1)/2) π(N − 1) N −1

The corresponding log-likelihood is


 
N (x − µ)2
ln L(µ) = − ln 1 +
2 v2

with the maximum µ = x. It corresponds to the dashed curve in Fig. 6.6a.


From the drop of ln L by 1/2 we get now for the standard error squared the
expression
δµ2 = (e1/N − 1)v 2 .
This becomes for large N , after expanding the exponential function, very
similar to the expression for the standard error in case with known width,
but with σ exchanged by v.

Example 80. MLE of the width of a normal distribution with unknown mean
Obviously, shifting a sample changes the mean value but not the true or
the empirical variance v 2 = (x − x)2 . Thus the empirical variance v 2 can only
depend on σ and not on µ. Without going into the details of the calculation,
we state that N v 2 /σ 2 follows a χ2 distribution of N − 1 degrees of freedom,
168 6 Estimation I

 (N −3)/2  
N N v2 N v2
f (v 2 |σ) = exp − 2 ,
Γ [(N − 1)/2] 2σ 2 2σ 2 2σ

with the log-likelihood

N v2
ln L(σ) = −(N − 1) ln σ − ,
2σ 2
corresponding to the dashed curve in Fig. 6.6b. (The numerical value of the
true value of µ was chosen such that the maxima of the two curves are located
at the same value in order to simplify the comparison.) The MLE is

N
σ̂ 2 = v2 ,
N −1
in agreement with our result (3.15). For the asymmetric error limits we find
in analogy to example 78
q
σ̂ 2(N1−1)
δσ± = q .
1 ∓ 2(N1−1)

6.4.3 Likelihood Inference for Several Parameters

We can extend our concept easily to several parameters λk which we combine


to a vector λ = {λ1 , . . . , λK }.
N
Y
L(λ) = f (xi |λ) , (6.12)
i=1
N
X
ln L(λ) = ln f (xi |λ) . (6.13)
i=1

To find the maximum of the likelihood function, we set the partial deriva-
tives equal to zero. Those values λ̂k which satisfy the system of equations
obtained this way, are the MLEs λ̂k of the parameters λk :
∂ ln L
| =0. (6.14)
∂λk λ̂1 ,...,λ̂K
The error interval is now to be replaced by an error volume with its surface
defined again by the drop of ln L by 1/2:
b − ln L(λ) = 1/2 .
ln L(λ)
6.4 The Maximum Likelihood Method for Parameter Inference 169

1 0 -0.5 -2 -4.5

-1

1 2 3 4 5

Fig. 6.7. MLE of the parameters of a normal distribution and lines of constant
log-likelihood. The numbers indicate the values of log-likelihood relative to the
maximum.

We have to assume that this defines a closed surface in the parameter space,
in two dimensions just a closed contour, as shown in the next example.

Example 81. MLEs of the mean value and the width of a normal distribution
Given are N observations xi which follow a normal distribution where
now both the width and the mean value µ are unknown. As above, the log-
likelihood is
" #
(x − µ)2
ln L(µ, σ) = −N ln σ + + const .
2σ 2

The derivation with respect to the parameters leads to the results:


N
1 X
µ̂ = xi = x ,
N i=1
N
1 X
σ̂ 2 = (xi − µ̂)2 = (x − x)2 = v 2 .
N i=1

The MLE and log-likelihood contours for a sample of 10 events with empirical
mean values x = 1 and x2 = 5 are depicted in Fig. 6.7. The innermost line
170 6 Estimation I

encloses the standard error area. If one of the parameters, for instance µ = µ1
is given, the log-likelihood of the other parameter, here σ, is obtained by the
cross section of the likelihood function at µ = µ1 .

Similarly any other relation between µ and σ defines a curve in Fig. 6.7
along which a one-dimensional likelihood function is defined.
Remark: Frequently, we are interested only in one of the parameters, and
we want to eliminate the others, the nuisance parameters. How to achieve this,
will be discussed in Sect. 7.8. Generally, it is not allowed to use the MLE of
a single parameter in the multi-parameter case separately, ignoring the other
parameters. While in the previous example σ̂ is the correct estimate of σ if µ̂
applies, the solution for the estimate and its likelihood function independent
of µ has been given in example 80 and that of µ independent of σ in example
79.

Example 82. Determination of the axis of a given distribution of directions


(This example has been borrowed from the book of L. Lyons [7].) Given
are the directions of N tracks by the unit vectors ek . The distribution of the
direction cosines cos αi with respect to an axis u corresponds to
3
f (cos α) = (1 + cos2 α) .
8
We search for the direction of the axis. The axis u(u1 , u2 , u3 ) is parameterized
by its components, the direction cosines
p uk . (There are only two independent
parameters u1 , u2 because u3 = 1 − u21 − u22 depends on u1 and u2 .) The
log-likelihood function is
N
X
ln L = ln(1 + cos2 αi ) ,
i=1

where the values cos αi = u · ei depend on the parameters of interest, the


direction cosines. Maximizing ln L, yields the parameters u1 , u2 . We omit the
calculation.

Example 83. Likelihood analysis for a signal with a linear background


We want to fit a normal distribution with a linear background to a given
sample. (The procedure for a background described by a higher order poly-
nomial is analogous.) The p.d.f. is
6.4 The Maximum Likelihood Method for Parameter Inference 171

f (x) = θ1 x + θ2 + θ3 N(x|µ, σ) .

Here N is the normal distribution with unknown mean µ and standard de-
viation σ. The other parameters are not independent because f has to be
normalized in the given interval xmin < x < xmax . Thus we can eliminate one
parameter. Assuming that the normal distribution is negligible outside the
interval, the norm D is:
1
D= θ1 (x2max − x2min ) + θ2 (xmax − xmin ) + θ3 .
2
The normalized p.d.f. is therefore

θ1′ x + θ2′ + N(x|µ, σ)


f (x) = 1 ′ ,
2
2 θ1 (xmax − x2min ) + θ2′ (xmax − xmin ) + 1

with the new parameters θ1′ = θ1 /θ3 and θ2′ = θ2 /θ3 . The likelihood function
is obtained
P in the usual way by inserting the observations of the sample into
ln L = ln f (xi |θ1′ , θ2′ , µ, σ). Maximizing this expression, we obtain the four
parameters and from those the fraction of signal events S = θ3 /D:
 −1
1
S = 1 + θ1′ (x2max − x2min ) + θ2′ (xmax − xmin ) .
2

6.4.4 Complicated Likelihood Functions


If the likelihood function deviates considerably from a normal distribution in
the vicinity of its maximum, e.g. contains several significant maxima, then
it is not appropriate to parametrize it by the maximum and error limits.
In this situation the full function or a likelihood map should be presented.
Such a map is shown in Fig. 6.8. The presentation reflects very well which
combinations of the parameters are supported by the data. Under certain
conditions, with more than two parameters, several projections have to be
considered.

6.4.5 Combining Measurements


When parameters are determined in independent experiments, we obtain ac-
cording to the definition of the likelihood the combined likelihood by multi-
plication of the likelihoods of the individual experiments.
Y
L(λ) = Li (λ) ,
X
ln L = ln Li .
172 6 Estimation I

-4.0
-7.0
-5.0

-1.0

-0.50

y
-0.10

-2.0

-5.0 -3.0

-7.0

-10 -7.0
-9.0

Fig. 6.8. Likelihood contours.

The likelihood method makes it possible to combine experimental results


in a extremely simple and at the same time optimal way. The experimental
data can originate from completely heterogeneous experiments because no
assumptions about the p.d.f.s of the individual experiments enter, except
that they are independent of each other.
For the combination of experimental results it is convenient to use the
logarithmic presentation. In case the log-likelihoods can be approximated by
quadratic parabolas, the addition again produces a parabola.

6.5 Likelihood and Information


6.5.1 Sufficiency

In a previous example, we have seen that the likelihood function for a sam-
ple of exponentially distributed decay times is a function only of the sam-
ple mean. In fact, in many cases, the i.i.d. individual elements of a sample
{x1 , . . . , xN } can be combined to fewer quantities, ideally to a single one
without affecting the estimation of the interesting parameters. The set of
these quantities which are functions of the observations is called a sufficient
statistic. The sample itself is of course a sufficient, while uninteresting statis-
tic.
According to R. A. Fisher, a statistic is sufficient for one or several param-
eters, if by addition of arbitrary other statistics of the same data sample, the
6.5 Likelihood and Information 173

parameter estimation cannot be improved. More precise is the following defi-


nition [1]: A statistic t(x1 , . . . , xN ) ≡ {t1 (x1 , . . . , xN ), . . . , tM (x1 , . . . , xN )} is
sufficient for a parameter set θ, if the distribution of a sample {x1 , . . . , xN },
given t, does not depend on θ:

f (x1 , . . . , xN |θ) = g(t1 , . . . , tM |θ)h(x1 , . . . , xN ) . (6.15)

The distribution g(t|θ) then contains all the information which is relevant
for the parameter estimation. This means that for the estimation process
we can replace the sample by the sufficient statistic. In this way we may
reduce the amount of data considerably. In the standard situation where all
parameter components are constraint by the data, the dimension of t must
be larger or equal to the dimension of the parameter vector θ. Every set of
uniquely invertible functions of t is also a sufficient statistic.
The relevance of sufficiency is expressed in a different way in the so-called
sufficiency principle:
If two different sets of observations have the same values of a sufficient
statistic, then the inference about the unknown parameter should be the same.
Of special interest is a minimal sufficient statistic. It consists of a minimal
number of components, ideally only of one element per parameter.
In what follows, we consider the case of a one-dimensional sufficient statis-
tic t(x1 , . . . , xN ) and a single parameter θ. The likelihood function can ac-
cording to (6.15) be written in the following way:

L = L1 (θ|t(x)) · L2 (x) . (6.16)

It is easy to realize that the second factor L2 which is independent of θ6 ,


has no bearing on the likelihood ratios of different values of θ. We obtain a
data reduction of N to 1. This means that all samples of size N which have
the same value of the statistic t lead to the same likelihood function and thus
to the same MLE and the same likelihood ratio interval.
If a minimal sufficient statistic of one element per parameter exists, then
the MLE itself is a minimal sufficient statistic and the MLE together with
the sample size N fix the likelihood function up to an irrelevant factor. (For
the Cauchy distribution the full sample is a minimal sufficient statistic. No
further reduction in size is possible. Thus its MLE is not sufficient.)
If in the general situation with P parameters a minimal sufficient statistic
t of P components exists, the data reduction is N to P and the MLE for the
P parameters will be a unique function of t and is therefore itself a sufficient
statistic.

Example 84. Sufficient statistic and expected value of a normal distribution

6
Note that also the domain of x has to be independent of θ.
174 6 Estimation I

Let x1 , . . . , xN be N normally distributed observations with width s. The


parameter of interest be the expected value µ of the distribution. The likeli-
hood function is
N
Y
L(µ|x1 , . . . , xN ) = c exp[−(xi − µ)2 /(2s2 ]
i=1
N
X
= c exp[− (xi − µ)2 /(2s2 )] ,
i=1

with c = ( 2πs)−N . The exponent can be expressed in the following way:
N
X
(xi − µ)2 /(2s2 ) = N (x2 − 2xµ + µ2 )/(2s2 ) .
i=1

Now the likelihood L factorizes:

L(µ|x1 , . . . , xN ) = c exp[−N (−2xµ + µ2 )/(2s2 )] · exp[−N x2 /(2s2 )] . (6.17)

Only the first factor depends on µ. Consequently the experimental quantity


x contains the full information on µ and thus is a one-dimensional sufficient
statistic. Setting the derivative of the first factor equal to zero, we obtain the
MLE µ̂ = x.

In the following example we show that a sufficient two-dimensional statis-


tic can be found when besides the expectation value also the width σ is to
be estimated.

Example 85. Sufficient statistic for mean value and width of a normal distri-
bution
Let x1 , . . . , xN be N normally distributed observations. The mean value
µ and the width σ be the parameters of interest. From (6.17)

1 x2 − 2xµ + µ2
L(µ, σ|x1 , . . . , xN ) = √ exp[−N ]
2πσ 2σ 2

we deduce that x and x2 Ptogether form a sufficient statistic {x, x2 }. Alter-


natively, also x and v = (xi − x)2 form a sufficient statistic. The MLE in
2

the two-dimensional parameter space µ, σ 2 is


1 2
µ̂ = x , σ̂ 2 = (x − x2 ) .
N
6.5 Likelihood and Information 175

There is no one-dimensional sufficient statistic for σ.


Remark: In the examples which we have discussed, the likelihood function
is fixed up to an irrelevant multiplicative factor if we consider the sample size
N as a constant. In case N is also a random variable, then N is part of the
sufficient statistic, e.g. in the last example it is {x, x2 , N }. Usually N is given
and is then an ancillary statistic.
Definition: A statistic y is called ancillary, if f (y|θ) = f (y), i.e. the p.d.f.
of y is independent of the parameter of interest7 .
The value of the ancillary statistic has no influence on the MLE but can
be relevant for the shape of the likelihood function and thus for the precision
of the estimation. The sample size is in most cases an ancillary statistic and
responsible for the accuracy of the estimation.

6.5.2 The Conditionality Principle

Imagine that a measurement is performed either with the precise device A or


with the imprecise device B. The device is selected by a stochastic process.
After the measurement has been realized, we know the device which had been
selected. Let us assume this was device B. The conditionality principle tells
us that for the parameter inference we are allowed to use this information
which means that we may act as if device A had not existed. The analysis is
not “blind”. Stochastic results influence the way we evaluate the parameters.
More generally, the conditionality principle states:
If an experiment concerning the inference about θ is chosen randomly from
a collection of possible experiments, independently of θ, then any experiment
not chosen is irrelevant to the inference.

Example 86. Conditionality


We measure the position coordinate of the trajectory of an ionizing par-
ticle passing a drift chamber. A certain wire responds. Its position provides a
rough coordinate. In 90 % of all cases a drift time is registered and we obtain
a much more precise value of the coordinate. The conditionality principle tells
us that in this case we are allowed to use the drift time information without
considering the worse resolution of a possible but not realized failure of the
time measurement.

The conditionality principle seems to be trivial. Nevertheless, the belief in


its validity is not shared by all statisticians because it leads to the likelihood
principle with its far reaching consequences which are not always intuitively
obvious.

7
Note that the combination of two ancillary statistics is not necessarily ancillary.
176 6 Estimation I

6.5.3 The Likelihood Principle

We now discuss a principle which concerns the foundations of statistical in-


ference and which plays a central role in Bayesian statistics.
The likelihood principle (LP) states the following:
Given a p.d.f. f (x|θ) containing an unknown parameter of interest θ and
an observation x, all information relevant for the estimation of the parameter
θ is contained in the likelihood function L(θ|x) = f (x|θ).
Furthermore, two likelihood functions which are proportional to each
other, contain the same information about θ. The general form of the p.d.f.
is considered as irrelevant. The p.d.f. at variate values which have not been
observed, has no bearing for the parameter inference.
Correspondingly, for discrete hypotheses Hi the full experimental informa-
tion relevant for discriminating between them is contained in the likelihoods
Li .
The following examples are intended to make plausible the LP.

Example 87. Likelihood principle, dice


We have a bag of two biased dice A and B. Die A produces the numbers
1 to 6 with probabilities 1/12, 1/6, 1/6, 1/6, 1/6, 3/12. The corresponding
probabilities for die B are 3/12, 1/6, 1/6, 1/6, 1/6, 1/12. The result of an
experiment where one of the dice is selected randomly is “3”. We are asked
to bet for A or B. We are unable to draw a conclusion from the observed
result because both dice produce this number with the same probability, the
likelihood ratio is equal to one. The LP tells us – what intuitively is clear
– that for a decision the additional information, i.e. the probabilities of the
two dice to yield values different from “3”, are irrelevant.

Example 88. Likelihood principle, V − A


We come back to an example which we had discussed already in Sect.
6.3. An experiment investigates τ − → µ− ντ ν̄µ , µ− → e− νµ ν̄e decays and
measures the slope α̂ of the cosine of the electron direction with respect to
the muon direction in the muon center-of-mass. The parameter α depends on
the τ −µ coupling. Is the τ decay proceeding through V −A or V +A coupling?
The LP implies that the probabilities f− (α), f+ (α) of the two hypotheses to
produce values α different from the observed value α̂ do not matter. When
we now allow that the decay proceeds through a mixture r = gV /gA of V
and A interaction, the inference of the ratio r is based solely on the observed
value α̂, i.e. on L(r|α̂).
6.5 Likelihood and Information 177

The LP follows inevitably from the sufficiency principle and the condi-
tioning principle. It goes back to Fisher, has been reformulated and derived
several times [45, 46, 47, 48]. Some of the early promoters (Barnard, Birn-
baum) of the LP later came close to rejecting it or to restrict its applicability.
The reason for the refusal of the LP has probably its origin in its incompat-
ibility with some concepts of the classical statistics. A frequently expressed
argument against the LP is that the confidence intervals of the frequentist
statistics cannot be derived from the likelihood function alone and thus con-
tradict the LP. But this fact merely shows that certain statistical methods do
not use the full information content of a measurement or / and use irrelevant
information. Another reason lies in problems which sometimes occur if the
LP is applied in social sciences, in medicine or biology. There it is often not
possible to parameterize the empirical models in a stringent way. But uncer-
tainties in the model prohibit the application of the LP. The exact validity
of the model is a basic requirement for the application of the LP.
In the literature examples are presented which are pretended to contradict
the LP. These examples are not really convincing and rather strengthen the
LP. Anyway, they often contain quite exotic distributions which are irrelevant
in physics applications and which lead when treated in a frequentist way to
unacceptable results [48].
We abstain from a reproduction of the rather abstract proof of the LP
and limit us to present a simple and transparent illustration of it:
The quantity which contains all the information we have on θ after the
measurement is the p.d.f. of θ,

L(θ|x)π(θ)
g(θ) = R .
L(θ|x)π(θ)dθ

It is derived from the prior density and the likelihood function. The prior
does not depend on the data, thus the complete information that can be
extracted from the data, and which is relevant for g(θ), must be contained in
the likelihood function.
A direct consequence of the LP is that in the absence of prior informa-
tion, optimal parameter inference has to be based solely upon the likelihood
function. It is then logical to select for the estimate the value of the parame-
ter which maximizes the likelihood function and to choose the error interval
such that the likelihood is constant at the border, i.e. is smaller everywhere
outside than inside. (see Chap. 8). All approaches which are not based on the
likelihood function are inferior to the likelihood method or at best equivalent
to it.

6.5.4 Stopping Rules

An experiment searches for a rare reaction. Just after the first successful ob-
servation at time t the experiment is stopped. Do we have to consider the
178 6 Estimation I

stopping rule in the inference process? The answer is “no” but many scien-
tists have a different opinion. This is the reason why we find the expression
stopping rule paradox in the literature.
The possibility to stop an experiment without compromising the data
analysis, for instance because a detector failed, no money was left or because
the desired precision has been reached, means a considerable simplification
of the data analysis.
In this context we examine a simple example.

Example 89. Stopping rule: four decays in a fixed time interval


In two similar experiments the lifetime of the same instable particle is
measured. In experiment A the time interval t is fixed and 4 decays are
observed. In experiment B the time t is measured which is required to observe
4 decays. Likelihood functions obtained for 20 experiments are displayed in
Fig. 6.9. Let us assume that in both experiments accidentally the two times
coincide. Thus in both experiments 4 decays are registered in the time interval
t but in experiment A the number n of decays is the random variable while
in experiment B it is the time t. Do both experiments find the same rate,
namely θ = 4/t and the same error interval? We could think “no”because in
the first experiment the fourth decay has happened earlier than in the second.
The likelihood functions for the two situations are deduced for experiment
A from the Poisson distribution and for experiment B from the exponential
time distribution:

LA (θ|n) = Pθt (n)


e−θt (θt)4
= ∼ θ4 e−θt ,
4!
LB (θ|t) = θ4 e−θt ∼ LA (θ|n) .

The likelihood functions are equal up to an irrelevant factor and consequently


also the results are the same. The stopping rule does not influence the anal-
ysis. The only relevant data are the number of decays and the length of the
time interval. The likelihood principle does not claim that experiments A
and B are equivalent. In fact, if we fix the length of the time interval, we
might observe no decay and rate θ = 0 would not be excluded contrary to
experiment B. The LP only states that for the estimation of parameters and
their uncertainty only the observed likelihood function is relevant.

The fact that an arbitrary sequential stopping rule does not change the
expectation value is illustrated with an example given in Fig. 6.10. A rate
is determined. The measurement is stopped if a sequence of 3 decays oc-
curs within a short time interval of only one second. It is probable that the
6.6 The Moments Method 179

0 0

fixed number fixed time

-2 -2

likelihood

-4 -4

-6 -6

-8 -8

-10 -10
0 1 2 3 4 5 0 1 2 3 4 5

rate rate

Fig. 6.9. Likelihood function for 20 eperiments. Left-hand: time for 4 events. Right-
hand: number of events in a fixed time interval. The dashed curve is the average of
an infinit number of experiments.

observed rate is higher than the true one, the estimate is too high in most
cases. However, if we perform many such experiments one after the other,
their combination is equivalent to a single very long experiment where the
stopping rule does not influence the result and from which we can estimate
the mean value of the rate with high precision. Since the log-likelihood of
the long experiment is equal to the sum of the log-likelihoods of the short
experiments, the log-likelihoods of the short experiments obviously represent
correctly the measurements.
Why does the fact that neglecting the stopping rule is justified, contradict
our intuition? Well, most of the sequences indeed lead to too high rates but
when we combine measurements the few long sequences get a higher weight
and they tend to produce lower rates, and the average is correct. On the other
hand, one might argue that the LP ignores the information that in most cases
the true value of the rate is lower than the MLE. This information clearly
matters if we would bet on this property, but it is irrelevant for estimating
the parameter value. A bias correction would improve somewhat the not very
precise estimate for small sequences, but be very unfavorable for the fewer but
more precise long sequences and if we have no prior information, we cannot
know whether our sequence is short or long. (see also Appendix 13.7).

6.6 The Moments Method


The moments of a distribution which depends on a parameter θ usually also
depend on θ:
180 6 Estimation I

Fig. 6.10. An experiment is stopped when 3 observations are registered within


a short time interval (indicated by a box) A arbitrarily long experiment can be
subdivided into many such subexperiments following the sopping rule.

Z
µn (θ) = xn f (x|θ) dx . (6.18)

The empirical moments


1 X n
µ̂n = x ,
N i i

e.g. the sample mean or the mean of squares, which we can extract trivially
for a sample, are estimators of the moments of the distribution. From the
inverse function µ−1 we obtain a consistent estimate of the parameter,

θ̂ = µ−1 (µ̂) ,

because according to the law of large numbers we have (see Appendix 13.1)

lim P {|µ̂ − µ| > ε} = 0 .


N →∞

It is clear that any function u(x), for which expected value and variance
exist, and where hui is an invertible function of θ, can be used instead of xn .
Therefore the method is somewhat more general than suggested by its name.
If the distribution has several parameters to be estimated, we must use
several moments or expected values, approximate them by empirical averages,
and solve the resulting system of – in general non-linear – equations for the
unknown parameters.
6.6 The Moments Method 181

The estimators derived from the lower moments are usually more precise
than those computed from the higher ones. Parameter estimation from the
moments is usually inferior to that of the ML method. Only if the moments
used form a sufficient statistic, the two approaches produce the same result.
The uncertainties of the fitted parameters have to be estimated from the
covariance matrix of the corresponding moments and subsequently by error
propagation or alternatively by a Monte Carlo simulation, generating the
measurement several times. Also the bootstrap method which will be intro-
duced in Chap. 12, can be employed. Sometimes the error calculation is a
bit annoying and reproduces the ML error intervals only in the large sample
limit.

Example 90. Moments method: Mean and variance of the normal distribution
We come back to the example from Sect. 6.4.2. For a sample {x1 , . . . , xN },
following the distribution
1  
f (x|µ, σ) = √ exp −(x − µ)2 /(2σ 2 ) ,
σ 2π
we determine independently the parameters µ and σ. We use again the ab-
breviations x for the sample mean and x2 for the mean of the squares and
v 2 = (x − x)2 = x2 − x2 for the empirical variance. The relation between
the moment µ1 and the parameter of the distribution µ is simply µ1 = µ,
therefore
µ̂ = µˆ1 = x .
In Chap. 3, we have derived the relation (3.15) v 2 = σ 2 (N − 1)/N between
the expectation of the empirical variance and the variance of the distribution;
inverting it, we get r
N
σ̂ = v .
N −1
The two estimates are uncorrelated. The error of µ̂ is derived from the esti-
mated variance
σ̂
δµ = √ ,
N
and the error of σ̂ is determined from the expected variance of v. We omit
the calculation, the result is:
σ̂
δσ = p .
2(N − 1)
In the special case of the normal distribution, the independent point estimates
of µ and σ of the moments method are identical to those of the maximum
likelihood method. The errors differ for small samples but coincide in the
limit N → ∞.
182 6 Estimation I

The moments method has the advantage that it is very simple, especially
in the case of distributions which depend linearly on the parameters – see
the next example below:

Example 91. Moments method: Asymmetry of an angular distribution


Suppose we have to determine the asymmetry parameter α of a distribu-
tion f (x) = (1 + αx)/2 linear in x = cos β from a sample of N measurements.
The first moment of the distribution isP µ1 = α/3. Thus we can compute the
parameter from the sample mean x = xi /N :

α̂ = 3 x .

The mean square error from an individual measurement x is proportional to


the variance of the distribution:

var(α̂) = 9 var(x) = 3 − α2 . (6.19)

Using instead of α its estimate, we get


r
3 − 9x2
δα̂ = 3 δx = .
N
A likelihood fit, according to the likelihood principle, is more accurate and
reflects much better the result of the experiment which, because of the kine-
matical constraint |α| < 1, cannot be described very well by symmetric errors;
especially when the sample size is small and the estimate happens to lie near
the boundary. In this case the maximum likelihood method should be applied.
In the asymptotic limit N → ∞ the variance of the moments estimate α̂ does
not approach the limit that is provided by the Cramer–Rao inequality, see
(13.6) in Appendix 13.2, which is achieved by the MLE:
 −1
α2 1 1+α
var(α̂ML ) ≈ ln( )−1 .
N 2α 1−α
A comparison with (6.19) shows that the asymptotic efficiency of the mo-
ments method, defined as
var(α̂ML )
ε= ,
var(α̂)
is unity only for α = 0. It is 0.92 for α = 0.5 and drops to 0.73 for α = 0.8.
(At the boundary, |α| = 1 the Cramer–Rao relation cannot be applied.) Note
that the p.d.f. of our example is a special case of the usual expansion of an
angular distribution into Legendre polynomials Pl (cos β):
L
X
f (x|θ) = (1 + θl Pl (x))/2 .
l=1
6.7 The Least Square Method 183

Fig. 6.11. Fit of a curve to measurments.

From the orthogonality of the Pl with the usual normalization


Z 1
2
Pl (x)Pm (x)dx = δl,m
−1 2l + 1

it is easy to see that θl = (2l + 1)hPl i. In the case l = 1, P1 = x, this is the


first moment of the distribution and we reproduce µ1 = α/3.

6.7 The Least Square Method


A frequently occurring problem is that a curve has to be fitted to given
measured points with error margins as shown in Fig. 6.11. The standard
solution of this regression problem is provided by the least square method
which fixes the parameters of a given function by minimizing the sum of the
normalized square deviations of the function from the measured points.
Given N measured points xi , yi ± δi , and a function t(x, θ), known up to
some free parameters θ, the latter are determined such that
N
X (yi − t(xi , θ))2
χ2 = (6.20)
i=1
δi2

takes its minimal value.


184 6 Estimation I

The least square method goes back to Gauss. Historically it has success-
fully been applied to astronomical problems and is still the best method we
have to adjust parameters of a curve to measured points if only the vari-
ance of the error distribution is known. It is closely related to the likelihood
method if the errors are normally distributed. Then we can write the p.d.f.
of the measurements in the following way:
" N #
X (yi − t(xi , θ))2
f (y1 , . . . , yN |θ) ∝ exp − ,
i=1
2δi2

and the log-likelihood is


N
1 X (yi − t(xi , θ))2
ln L(θ|y) = − ,
2 i=1 δi2
1
= − χ2 . (6.21)
2
Thus minimizing χ2 is equivalent to maximizing the likelihood if the errors
are normally distributed and independent of the free parameters, a condition
which frequently is approximately satisfied. From (6.21) we conclude that the
standard deviation errors of the parameters in a least square fit correspond to
one unit, ∆χ2 = 1, twice the value 1/2 of the maximum likelihood method.
In Sect. 3.6.7 we have seen that χ2 follows approximately a χ2 distribution of
f = N − Z (Z is the number of free parameters) degrees of freedom, provided
the normality of the errors is satisfied. Thus we expect χ2 to be of the order of
f , large values indicate possible problems with the data or their description.
We will investigate this in Chapter 10.
The √standard deviation of the χ2 distribution for f degrees of freedom
is σ = 2f which, for example, is equal to 10 for 50 degrees of freedom.
With such large fluctuations of the value of χ2 from one sample to the other,
it appears paradoxical at first sight that a parameter error of one standard
deviation corresponds to such a small change of χ2 as one unit, while a
variation of χ2 by 10 is compatible with the prediction. The obvious reason
for the good resolution is that the large fluctuations from sample to sample
are unrelated to the value of the parameter. In case we would compare the
prediction after each parameter change in a minimum searching routine to a
new measurement sample, we would not be able to obtain a precise result for
the estimate of the parameter θ.
That the least square method can lead to false results if the condition of
Gaussian uncertainties is not fulfilled, is illustrated in the following example.

Example 92. Counter example to the least square method: Gauging a digital
clock
6.7 The Least Square Method 185

time

channel
Fig. 6.12. χ2 −Fit (dashed) of a straigt line to digital measurements.

A digital clock has to be gauged. Fig. 6.12 shows the time channel as a
function of the true time and a least square fit by a straight line. The error
bars in the figure are not error bars in the usual sense but indicate the channel
width. The fit fails to meet the allowed range of the fifth point and therefore
is not compatible with the data. All straight lines which meet all “error bars”
have the same likelihood. One correct solution is indicated in the figure.

We can easily generalize the expression (6.20) to the case of correlated


errors. Then we have with ti = t(xi , θ)
N
X
χ2 = (yi − ti )Vij (yj − tj )
i,j=1

where V, the weight matrix, is the inverse of the covariance matrix. The
quantity χ2 is up to a factor two equal to the negative log-likelihood of a
multivariate normal distribution,
 
XN
1
f (y1 , . . . , yN |θ) ∝ exp − (yi − ti )Vij (yj − tj ) ,
2 i,j=1

see Sect. 7.1.1. Maximizing the likelihood is again equivalent to minimizing


χ2 if the errors are normally distributed.
The sum χ2 is not invariant against a non-linear variable transformation

y (y).
186 6 Estimation I

Example 93. Least square method: Fit of a straight line


We fit the parameters a, b of the straight line

y(x) = ax + b (6.22)

to a sample of points (xi , yi ) with uncertainties δi of the ordinates. We min-


imize χ2 :
X (yi − a xi − b)2
χ2 = ,
i
δi2
∂χ2 X (−yi + a xi + b)2xi
= ,
∂a i
δi2
∂χ2 X (−yi + a xi + b)2
= .
∂b i
δi2

We set the derivatives to zero and introduce the following abbreviations. (In
parentheses we put the expressions for the special case where all uncertainties
are equal, δi = δ):
X xi X 1 X
x= / ( xi /N ) ,
i
δi2 i δi2 i

X yi X 1 X
y= / ( yi /N ) ,
i
δi2 i δi2 i
X x2 X 1 X
i
x2 = 2/ ( x2i /N ) ,
i
δ i i
δi2 i
X xi yi X 1 X
xy = / ( xi yi /N ) .
i
δi2 i
δi2 i

We obtain
b̂ = y − â x ,
xy − â x2 − b̂ x = 0 ,
and
xy − x y
â = ,
x2 − x2
x2 y − x xy
b̂ = .
x2 − x2
The problem is simplified when we put the origin of the abscissa at the center
of gravity x:
6.7 The Least Square Method 187

x′ = x − x ,

x′ y
â′ = ,
x′2

b̂ = y .

Now the equation of the straight line reads

y = â′ (x − x) + b̂′ . (6.23)

We gain an additional advantage, the errors of the estimated parameters are


no longer correlated.
X x2
δ 2 (â′ ) = 1/ i
,
i
δi2
X 1
δ 2 (b̂′ ) = 1/ .
i
δi2

We recommend to use always the form (6.23) instead of (6.22).

6.7.1 Linear Regression

If the prediction depends only linearly on the parameters, we can compute


the parameters which minimize χ2 analytically. We put

y(θ) = Aθ + e . (6.24)

Here θ is the P -dimensional parameter vector, e is a N -dimensional error


vector with expectation zero, y = t is the N -dimensional vector of predic-
tions. For simplification, it is usual to consider y − e as a random vector and
call it again y, of course now with expectation

hyi = Aθ. (6.25)

A, also called the design matrix, is a rectangular matrix of given elements with
P columns and N rows, defining the above mentioned linear mapping from
the P -dimensional parameter space into the N-dimensional sample space.
The straight line fit discussed in Example 93 is a special case of (6.24)
PP =2
with E(yi ) = j=1 Aij θj = θ1 xi + θ2 , i = 1, . . . , N , and
 T
x1 · · · xN
A= .
1 ··· 1
188 6 Estimation I

We have to find the minimum in θ of

χ2 = (y − Aθ)T VN (y − Aθ)

where, as usual, VN is the weight matrix of y, the inverse of the covariance


matrix: VN = CN −1 . In the example above it is a diagonal N × N matrix with
elements 1/δi2 where δi is the standard deviation of the observed value yi 8 .
We derive χ2 with respect to the parameters θ and set the derivatives equal
to zero:
1 ∂χ2
= 0 = −AT VN (y − Aθ̂) . (6.26)
2 ∂θ θ̂
From these so-called normal equations we get the estimate for the P param-
eters θ̂ :
−1
θ̂ = (AT VN A) AT VN y . (6.27)
Note that AT VN A is a symmetric P × P matrix which turns out to be the
inverse of the the error- or covariance matrix Eθ ≡ CP of θ̂. This matrix is
(see Sect. 4.4 relation (4.13)) CP = DCN DT with D the derivative matrix

D = (AT VN A)−1 AT VN

derived from (6.27). After some simplifications we obtain:

CP = (AT VN A)−1 .

A feature of the linear model is that the result (6.27) for θ̂ turns out to be
linear in the measurements y. Using it together with (6.25) one easily finds
E(θ̂) = θ, i.e. the estimate is unbiased9 . The Gauss–Markov–theorem states
that any other estimate obeying these two assumptions will have an error
matrix with larger or equal diagonal elements than the above estimate (also
called BLUE: best linear unbiased estimate).
Linear regression provides an optimal solution only for normally dis-
tributed, known errors. Often, however, the latter depend on the parameters.
Strictly linear problems are therefore rare. When the prediction is a non-
linear function of the parameters, the problem can be linearized by a Taylor
expansion as a first rough approximation. By iteration the precision can be
improved.
The importance of non-linear parameter inference by iterative linear re-
gression has decreased considerably. The minimum searching routines which
we find in all computer libraries are more efficient and easier to apply. Some
basic minimum searching approaches are presented in Appendix 13.12.
8
We keep here the notation χ2 , which is strictly justified only in case of Gaussian
error distributions or asymptotically for large N . Only then it obeys a χ2 distribu-
tion with N − P degrees of freedom. The index of quadratic matrices indicates its
dimension.
9
This is true for any N , not only asymptotically.
6.8 Properties of estimators 189

6.8 Properties of estimators

The content of this Section is resumed in the Appendices 13.2 and 13.2.2.

6.8.1 Consistency

An estimator is consistent, loosely speaking, if in the large number limit the


estimator approaches the true parameter value. More precisely, consistency
requires that the probability that the absolute difference between the esti-
mated parameter value and its true value is larger than an arbitrarily small
value ǫ tends to zero if N tends to infinity:

limN →∞ P (|θ̂ − θ| > ǫ) = 0 .

Consistency is a necessary condition for a useful estimator. The MLE is con-


sistent (see Appendix 13.2.2).

6.8.2 Transformation Invariance

We require that the estimate θ̂ and the estimate of a function of θ, fd(θ), satisfy
d
the relation f (θ) = f (θ̂). For example the mean lifetime τ and the decay rate
γ of a particle are related by γ = 1/τ . Therefore their estimates from a sample
of observations have to be related by γ̂ = 1/τ̂ . If they were different, the
prediction for the number of decays in a given time interval would depend on
the choice of τ̂ or γ̂ used to evaluate the number. Similarly in the computation
of a cross section which depends on different powers of a coupling constant
g, we would get inconsistent results unless ĝ n = gc n . Estimators applied to

constants of nature have to be transformation invariant. The MLE and the


likelihood ratio error intervals satisfy this condition.
Remark that the transformation invariance is not important in most sta-
tistical applications outside the natural sciences. This is why it is not always
considered as necessary.

6.8.3 Accuracy and Bias of Estimators

The bias b of an estimate θ̂ is the deviation of its expectation value from the
true value θ of the parameter:

b = E(θ̂) − θ .
190 6 Estimation I

Example 94. Bias of the estimate of a decay parameter


We estimate the decay parameter γ from 5 observed decays of an unstable
particle. We have seen in a previous example that the MLE γ̂ is the inverse of
the average of the individual decay times, γ̂ = 1/t. The mean value t follows
a gamma distribution (see Sect. 3.6.8).

(5γ)5 4
f (t|γ) = t exp(−5γt) ,
4!
and thus the expectation value E(γ̂) of γ̂ is
Z ∞
E(γ̂) = γ̂f (t|γ) dt
0
Z ∞ 3
(5γ)5 4!t 5
= (−5γ t̄) dt = γ .
0 exp 4

When in a large number of similar experiments with 5 observed events the


MLE of the decay time is determined then the arithmetic mean differs from
the true value by 25 %, the bias of the MLE is b = E(γ̂) − γ = γ/4. For a
single decay the bias is infinite.

The MLE of the decay constant of an exponential decay distribution is


biased while the MLE of the mean lifetime is unbiased.
Similarly, we may define as a measure of accuracy a the expected value
of the mean squared deviation of the estimate from the true value.

a = E[(θ̂ − θ)2 ] . (6.28)

In both cases the estimate is considered as a random variable. An estimate


with the property that a in the large sample limit, N → ∞, is smaller than the
result of any other estimator is called efficient (see Appendix 13.2). Efficient
estimators have to be unbiased. In an exponential decay the MLE τ̂ of the
lifetime is both unbiased and efficient while the MLE of the decay rate γ =
1/τ is neither unbiased nor efficient.
Biases occur quite frequently at small samples. With increasing number
of observations the bias decreases (see Appendix 13.2.2).
The word bias somehow suggests that something is wrong and thus it
appears quite disturbing at first sight that estimates may be systematically
biased. In fact in most of the conventional statistical literature it is recom-
mended to correct for the bias. One reason given for the correction is the
expectation that averaging many biased results the error would decrease, but
the bias would remain. However, there is no obvious reason for a correction
and a closer study reveals that bias corrections lead rather to difficulties when
we combine different measurements θˆi in the usual way, weighting the results
6.8 Properties of estimators 191

Table 6.1. Expected weighted mean of 10 decay time measurements.


method τ γ
mean 1.00 1.11
weighted mean 0.80 0.91
weighted mean, bias corrected 0.80 0.82
weighted mean PDG 0.88 0.95
weighted mean extended 0.93 0.97

by the inverse covariance matrix, or in the one dimensional case according to


(4.6) simply by the inverse error squared.
Pˆ 2
θi /δi
θ= P .
1/δi2
Since usually the estimated errors depend on the value of the MLE, the
weighting introduces a bias which may partially compensate a bias of the
MLE or it may increase it.
Let us resume our last example and assume that many experiments mea-
sure the decay rate from samples of size N = 5. The estimates γ̂i will vary
from experiment to experiment. Each experiment will, apart from the es-
timate, evaluate√the error δi which will turn out to be proportional to γ̂i ,
namely δi = γ̂i / 5. Averaging without bias correction according to our pre-
scription, we will obtain E(γ) = 5/6 γ, thus the bias is reduced, while averag-
ing the bias corrected estimates would lead to the expectation E(γ) = 2/3 γ,
a result which is considerably worse. Table 6.1 summarizes the expected mean
values from 10 observed decay times from particles with true lifetime 1. (The
table includes results from averaging procedures that will be discussed in
Sect. 8)
We have to conclude that bias corrections should not be applied to MLEs.
The accuracy as defined by (6.28) and the bias are important quantities
in frequentist statistics, but are less relevant in Bayesian and likelihood based
statistics. Why is this so?
The frequentist statistics uses properties like a and b of the estimate
given the true parameter value, while we are interested in the properties of
the unknown true value given the measurement. The inversion of probabilities
can lead to contradictions as becomes obvious in our lifetime example. As the
decay rate is biased towards high values, one might conclude that the true
value is likely to be located below the estimate. As a consequence the true
value of τ = 1/γ, should be located above its estimate τ̂ , the estimate should
be negatively biased, however it is unbiased.
The requirements of unbiasedness and maximal efficiency violate trans-
formation invariance and for this reason are not relevant for the estimates of
constants of nature. If bias corrections are applied, for instance in a power
expansion of the strong coupling constant α, in each power term a different
192 6 Estimation I

value of α would have to be inserted because the bias correction depends on


the power. Biases occur if the number of events are small. In this situation the
uncertainty of measurements should be represented by asymmetric errors, or
even better, the full likelihood function should recorded.
In the following we discuss some examples where the likelihood function
is very asymmetric.

Example 95. Bias of the estimate of a Poisson rate with observation zero
We search for a rare decay but we do not observe any. The likelihood for
the mean rate λ is according to the Poisson statistic

e−λ λ0
L(λ) = = e−λ .
0!
When we normalize the likelihood function to obtain the Bayesian p.d.f. with
a uniform prior, we obtain the expectation value hλi = 1 while the value
λ̂ = 0 corresponds to the maximum of the likelihood function. (It may seem
astonishing that an expectation value one follows from a null-measurement.
This result is a consequence of the assumption of a uniform distribution
of the prior which is not unreasonable because had we not anticipated the
possibility of a decay, we would not have performed the measurement. Since
also mean rates different from zero may lead to the observation zero it is
natural that the expectation value of λ is different from zero.) Now if none
of 10 similar experiments would observe a decay, a naive averaging of the
expected values alone would again result in a mean of one, a crazy value.
Strictly speaking, the likelihoods of the individual experiments should be
multiplied, or, equivalently the null rate would have to be normalized to ten
times the original time with the Bayesian result 1/10.

We study a further example.

Example 96. Bias of the measurement of the width of a uniform distribution


Let x1 , . . . , xN be N observations of a sample following a uniform distri-
bution f (x) = 1/θ with 0 < x < θ. We estimate the parameter θ. Figure 6.13
shows the observations and the likelihood function for N = 12. The likelihood
function is
L = 0 for θ < xmax ,
1
= N for θ ≥ xmax .
θ
Obviously, the likelihood has its maximum when θ coincides with the largest
observation xmax of the sample: θ̂ = xmax . (Here xmax is a sufficient statistic.)
6.9 Comparison of Estimation Methods 193

Likelihood
2

0
0.0 0.5 1.0 1.5

x / q

Fig. 6.13. Likelihood funktion of the wdth of a uniform distribution for 12 obser-
vations.

At smaller values of x, the likelihood is zero. The estimate is biased towards


small values. Given a sample size of N , we obtain N + 1 gaps between the
observations and the borders [0, θ]. The average distance of the largest ob-
servation from θ thus is θ/(N + 1). The bias is −θ̂/N . There is no reason
to correct for the bias. We rather prefer to present the biased result with a
one-sided error
+x /N
θ = xmax −0 max
or, alternatively, the full likelihood function.

A further, more general discussion of the bias problem is given in Ap-


pendix 13.7.

6.9 Comparison of Estimation Methods

The following table contains an evaluation of the virtues and properties of


the estimation approaches which we have been discussing.
Whenever possible, the likelihood method should be applied. It requires
a sample of observations and a p.d.f. in analytic or well defined numerical
form and is very sensitive to wrongly assigned observations in the sample.
When the theoretical description of the data is given in form of a simulated
histogram, the Poisson likelihood adjustment of the simulation to the bin
194 6 Estimation I

Table 6.2. Virtues and caveats of different methods of parameter estimation.


moments χ2 max. likelihood
simplicity ++ + −
precision − + ++
individual observations + − +
measured points − + −
histograms + + +
upper and lower limits − − +
external constraints − + +
background included + + −
error assignment from error propagation χ2min + 1 ln Lmax − 0.5
requirement full p.d.f. only variance full p.d.f.

content should be chosen, see following section. When we have to fit a function


to measured data points, we use the least square method. If computing time
is a limitation like in some on-line applications, the moments method lends
itself. In many situations all three methods are equivalent.
All methods are sensitive to spurious background. Especially robust meth-
ods have been invented to solve this problem. An introduction and references
are given in Appendix 13.18. For completeness we present in Appendix 13.3.1
some frequentist criteria of point and interval estimation which are relevant
when parameters of many objects of the same type, for instance particle
tracks, are measured. In the Appendix 13.7 we discuss the virtues of different
point and interval inference approaches. Algorithms for minimum search are
sketched in Appendix 13.12.
7 Estimation II

7.1 Likelihood of Histograms


For large samples it is more efficient to analyze the data in form of histograms
than to compute the likelihood for many single observations. The individual
observations are classified and collected into bins where all events of a bin have
approximately the same likelihood. We then compare the number of entries
of a bin with the parameter dependent prediction. Often the prediction is
available only as a Monte Carlo simulation in form of a histogram. We will
discuss the comparison of data to a Monte Carlo simulation in some detail
in the following section.
We denote the total number of events by N , the number of events in bin
i by di and the number of bins by B. In the following all sums run over all
bins, i = 1, ..., B.
We have to distinguish different situations:
i) We have an absolute prediction ti (θ) for the number of events di in bin
i. The numbers di are described by Poisson distributions with mean ti .
ii) The absolute particle flux is not known. The prediction cti (θ) of the
number of events in bin i contains an unknown normalization factor c. The
numbers di are described by Poisson distributions with mean cti . The pa-
rameter c is a free parameter in the fit.
The second case is much more frequent than the first. Think for instance
of the measurement of a particle lifetime from a sample of events where the
flux is not predicted.
Remark : The case with unknown normalization can also be formulated
in the following way: The relative probabilities pi (θ) = ti (θ)/Σi ti (θ) for the
number of events of the bins are predicted. Then the observed data follow
a multinomial distribution where N events are distributed into the B bins
with probabilities pi and the normalization parameter is eliminated. As a
consequence, in case the normalization c is kept as a free parameter of the
fit, it is not correlated with θ. The multinomial treatment and the Poissonian
treatment with c as a free parameter are equivalent, see Appendix 13.11.1.
The latter is preferable because the error treatment is much simpler than in
the multinomial case where the constraint Σi pi = 1 has to be satisfied.
We start with case i):
196 7 Estimation II

The likelihood for ti expected and di observed entries according to the


Poisson distribution is given by
e−ti tdi i
Li (θ) = ,
di !
ln Li (θ) = −ti + di ln ti − ln(di !) .
Since factors not depending on θ are irrelevant for the likelihood inference
(see Sect. 6.4.1), we are allowed to omit the term with the factorial. The log-
likelihood of the complete histogram with B bins is then
B
X
ln L(θ) = (−ti + di ln ti ) . (7.1)
i=1

The parameter dependence is hidden in the quantities ti . The maximum


of this function is determined by numerical methods.
For the determination of the maximum, the sum (7.1) has to be re-
computed after each modification of the parameters. Since the sum runs
only over the bins but not over all individual observations as in the normal
likelihood method, the computation for histograms is relatively fast.
In the second case with unknown normalization we have to replace ti by
cti :
B
X
ln L(θ) = (−cti + di ln(cti )) . (7.2)
i=1
Deriving the log-likelihood with respect to c and setting the derivative equal
to zero, we obtain the proper estimate for the normalization: ĉ = Σdi /Σti .
The parameter c is not correlated with the parameters of interest θ. Therefore
the error estimates of θ are independent of c.

Example 97. Adjustment of a linear distribution to a histogram


The cosine u = cos α of an angle α be linearly distributed according to
1
f (u|λ) = (1 + λu) , −1 ≤ u ≤ 1 , |λ| < 1 .
2
We want to determine the parameter λ which best describes the observed
distribution of 500 entries di into 20 bins (Fig. 7.1). In the Poisson ap-
proximation we expect ti entries for the bin i corresponding to the value
ui = −1 + (i − 0.5)/10 of the cosine at the center of the bin:
500
ti = (1 + λui ) .
20
We obtain the likelihood function by inserting this expression into (7.1). The
likelihood function and the MLE are indicated in the Figure 7.1.
7.1 Likelihood of Histograms 197

log-likelihood
40

entries

20

-5

0
-1.0 -0.5 0.0 0.5 1.0 0.5 0.6 0.7 0.8 0.9

u slope

Fig. 7.1. Linear distribution with adjusted straight line (left) and likelihood func-
tion (right).

7.1.1 The χ2 Approximation

We have seen in Sect. 3.6.3 that with increasing mean value t, the Poisson
distribution asymptotically approaches a normal distribution with variance
t. Thus for high statistics histograms the number of events d in a bin with
prediction t(θ) is described by
 
1 (d − t)2
f (d) = √ exp − .
2πt 2t
Contrary to the case of relation (6.20) the denominator of the exponent and
the normalization now depend on the parameters.
The corresponding log-likelihood is
(d − t)2 1
ln L = − − ln(2π) − ln(t) .
2t 2
For large t, the logarithmic term is an extremely slowly varying function of
t. In situations where the Poisson distribution can be approximated by a
normal distribution, it can safely be neglected. Omitting it and the constant
term, we find for the whole histogram
B
1 X (di − ti )2
ln L = −
2 i=1 ti
1
= − χ2
2
198 7 Estimation II

with
B
X (di − ti )2
χ2 = . (7.3)
i=1
ti
If the approximation of the Poisson distribution by a normal distribution is
justified, the likelihood estimation of the parameters is equivalent to a least
square fit and the standard errors are given by an increase of χ2 by one unit.
Often histograms contain some bins with few entries. Then a binned like-
lihood fit is to be preferred to a χ2 fit, since the above condition of large ti
is violated. It is recommended to perform always a likelihood adjustment.

7.2 Extended Likelihood


When we record N independent multi-dimensional observations, {xi } , i =
1, . . . , N , of a distribution depending on a set of parameters θ, then it may
happen that these parameters also determine the rate, i.e. the expected rate
λ(θ) is a function of θ. In this situation N is no longer a fixed parameter
but a random variable like the xi 1 . This means that we have to multiply two
probabilities, the probability to find N observations which follow the Poisson
statistics Pλ (N ) and the probability to observe a certain distribution of the
variates xi . Given a p.d.f. f (x|θ) for a single observation, we obtain the
extended likelihood function [34, 35]
N
e−λ(θ) λ(θ)N Y
L(θ) = f (xi |θ)
N! i=1

and its logarithm


N
X
ln L(θ) = −λ(θ) + N ln(λ(θ)) + ln f (xi |θ) − ln N ! . (7.4)
i=1

Again we can omit the last term in the likelihood analysis, because it does
not depend on θ.

Example 98. Fit of the particle composition of an event sample (1) [36]
We consider the distribution f (x) of a mixture of K different types of
particles. The p.d.f. of the identification variable x (This could be for example
the energy loss) for particles of type k be fk (x). The task is to determine the
numbers λk of the different particle species in the sample from the observed
values xi of N detected particles. The p.d.f. of x is

1
In the statistical literature this is called a compound distribution, see Sect. 3.60.
7.2 Extended Likelihood 199

K
X K
X
f (x) = λk fk (x)/ λk
k=1 k=1

and the probability to observe N events is

K
!N
X
K
! λk
X k=1
exp − λk .
N!
k=1

The extended log-likelihood is


K
X K
X N
X K
X K
X
ln L = − λk + N ln λk + ln λk fk (xi ) − N ln λk
k=1 k=1 i=1 k=1 k=1
K
X N
X K
X
=− λk + ln λk fk (xi ) . (7.5)
k=1 i=1 k=1

To find the MLE, we derive ln L:

XN
∂ ln L fm (xi )
= −1 + =0,
∂λm i=1
K
X
λk fk (xi )
k=1
N
X fm (xi )
1= K
. (7.6)
i=1
X
λk fk (xi )
k=1

The solution of (7.6) can be obtained iteratively [36]

XN (n−1)
λm fm (xi )
λ(n)
m = K
i=1
X (n−1)
λk fk (xi )
k=1

or with a standard maximum searching program applied to (7.5). Alterna-


tively, we can base the fit on (7.5) and constrain the parameters, i.e. require
Σλk = N . This solution will be explained in Sect. 7.5.

As a special case, let us assume that the cross section for a certain reaction
is equal to g(x|θ). Then we get the p.d.f. by normalization of g:
200 7 Estimation II

g(x|θ)
f (x|θ) = R . (7.7)
g(x|θ)dx

The production rate λ is equal to the normalization factor multiplied with


the luminosity S which is a constant:
Z
λ(θ) = S g(x|θ)dx . (7.8)

The relations (7.7) and (7.8) have to be inserted into (7.4).

7.3 Comparison of Observations to a Monte Carlo


Simulation
7.3.1 Motivation

Measurements usually suffer from event losses due to a limited acceptance


and limited efficiency of the detectors and from distortions due to the lim-
ited resolution of the detectors. Modern research in natural sciences requires
more and more complex experimental setups with the consequence that these
effects cannot be corrected for analytically. Therefore we simulate the data
taking. The correction for the acceptance losses is straight forward, but the
correction of smearing effects is more complex. The general problem of un-
folding is treated in the Chapter 9. In this Section we concentrate on the
problem of parameter inference from distorted data. We follow Ref. [37].

7.3.2 The Likelihood Function

The theoretical models are represented by Monte Carlo samples and the
parameter inference is carried out by a comparison of experimental and sim-
ulated histograms of the observed variable x′ . For di observed and mi Monte
Carlo events in bin i and a normalization parameter cm , we get for the like-
lihood instead of (7.1):
B
X
ln L = (−cm mi + di ln(cm mi )) . (7.9)
i=1

We assume that the statistical error of the simulation can be neglected, i.e.
M ≫ N applies, with M simulated events and N observed events. In some
rare cases the normalization cm is known, if not, it is a free parameter in
the likelihood fit. The parameters of interest are hidden in the Monte Carlo
predictions mi (θ).
7.3 Comparison of Observations to a Monte Carlo Simulation 201

7.3.3 The χ2 Approximation

If the number of the entries in all bins is large enough to approximate the
Poisson distribution by the normal distribution, we can as well minimize the
corresponding χ2 expression (7.3)
B
X (di − cm mi )2
χ2 = . (7.10)
i=1
cm mi

The simulation programs usually consist of two different parts. The first
part describes the physical process which depends on the parameters of inter-
est. The second models the detection process. Both parts often require large
program packages, the so-called event generators and the detector simula-
tors. The latter usually consume considerable computing power. Limitations
in the available computing time then may result in non-negligible statistical
fluctuations of the simulation.

7.3.4 Weighting the Monte Carlo Observations

When we fit parameters, every parameter change obviously entails a modi-


fication of the Monte Carlo prediction. Now we do not want to repeat the
full simulation with every fitting step. Apart from the fact that we want
to avoid the computational effort there is another more important reason:
With the χ2 fit, we find the standard error interval by letting vary χ2 by
one unit. On the other hand when we compare experimental data with an
optimal simulation,
p we expect a contribution to χ2 from the simulation of
the order of 2BN/M for B histogram bins. Even with a simulation sample
which is a hundred times larger than the data sample this value is of the
order of one. This means that a repetition of the simulation causes consid-
erable fluctuations of the χ2 value which have nothing to do with parameter
changes. These fluctuations can only be reduced if the same Monte Carlo
sample is used for all parameter values. We have to adjust the simulation to
the modified parameters by weighting its observations.
Also re-weighting produces additional fluctuations. These, however, should
be tolerable if the weights do not vary too much and if the Monte Carlo sample
is much larger than the data sample. If we are not sure that this assumption is
justified, we can verify it: We reduce the number of Monte Carlo observations
and check whether the result remains stable. We know that the contribution
of the simulation to the parameter errors scales with the inverse square root
of the number of simulated events. Alternatively, we can also estimate the
Monte Carlo contribution to the error by repeating the full estimation process
with bootstrap samples, see Sect. 13.11.3.
The weights are computed in the following way: For each Monte Carlo
observation x′ we know the true values x of the variates and the corre-
sponding p.d.f. f (x|θ 0 ) for the parameter θ0 , which had been used at the
202 7 Estimation II

generation. When we modify the parameter, we weight each observation by


w(θ) = f (x|θ)/f (x|θ 0 ). The weighted distribution of x′ then describes the
modified prediction.

7.3.5 Including the Monte Carlo Uncertainty

In rare cases it is necessary to include the statistical error of the Monte Carlo
simulation. The formulas are derived in Appendix 13.10 and the problem is
discussed in detail in Ref. [26]. We summarize here the relevant relations. The
MonteP Carlo prediction for a histogram bin is up to a normalization constant
mi = wik , where the sum runs over all Ki weights wik of the events of the
bin. We omit in the following three formulas the bin index i. The quantities
m̃, s and c̃m have to be evaluated for each bin. We define a scaled number
m̃,
m̃ = sm
with hX i
wk
s= X ,
wk2
and a normalization constant c̃ specific for the bin

c̃m = cm /s .

The χ2 expression to be minimized with respect to θ and cm is then

XB  
2 1 (n − c̃m m̃)2
χ = .
i=1
c̃m (n + m̃) i

If resolution effects are absent and only acceptance losses have to be taken
care of, all weights in bin i are equal wi . The above relation simplifies with
Ki Monte Carlo entries in bin i to

XB
1 (ni − cm mi )2
χ2 = .
c w (ni + Ki )
i=1 m i

7.3.6 Solution for a large number of Monte Carlo events

Statistical problems decrease with increasing event numbers, but computa-


tional requirements may increase. The numerical minimum search that is
required to estimate the wanted parameters can become quite slow. It may
happen that we have of the order of 106 or more simulated events. This means
that, for say 103 changes of a parameter value during the extremum search
that 109 weights have to be computed. This is feasible, but we may want to
speed up the fitting procedure. This can be achieved in situations where the
7.3 Comparison of Observations to a Monte Carlo Simulation 203

Monte Carlo uncertainties can be neglected. We represent the prediction by


a superposition of Monte Carlo histograms with factors that depend on the
parameters. To this end it is useful to expand the p.d.f. f (x|θ) in a Taylor
expansion with respect to the parameter at some preliminary estimate θ0 :

df (x|θ) (∆θ)2 d2 f (x|θ)


f (x|θ) = f (x|θ0 ) + ∆θ |θ 0 + |θ 0 + · · · (7.11)
 dθ 2! dθ2 
1 df (x|θ) (∆θ)2 1 d2 f (x|θ)
= f (x|θ0 ) 1 + ∆θ |θ 0 + | θ0 + · · · .
f0 dθ 2! f0 dθ2
(7.12)

We generate events according to f0 (x) = f (x|θ0 ) and obtain simulated events


with the observed kinematic variable x′ . We histogram x′ and obtain the
histogram m0i . Weighting each event by ω1 (x), we obtain the histogram m1i
and weighting by ω2 (x) the histogram m2i with the weights

1 df
ω1 (x) = (x|θ0 ) , (7.13)
f0 dθ
1 d2 f
ω2 (x) = (x|θ0 ) . (7.14)
2f0 dθ2
The parameter inference of ∆θ is performed by comparing mi = (m0i +
∆θm1i + (∆θ)2 m2i ) with the experimental histogram di as explained in Sect.
2:
B
X (di − cmi )2
χ2 = . (7.15)
i=1
cmi
In many cases the quadratic term can be omitted. In other situations it
might be necessary to iterate the procedure.
We illustrate the method with two examples.

Example 99. Fit of the slope of a linear distribution with Monte Carlo cor-
rection
The p.d.f. be
1 + θx
f (x|θ) = , 0≤x≤1.
1 + θ/2
We generate observations x uniformly distributed in the interval 0 ≤ x ≤ 1,
simulate the experimental resolution and the acceptance, and histogram the
distorted variable x′ into bins i and obtain contents m0i . The same obser-
vations are weighted by x and summed up to the histogram m1i . These two
distributions are shown in Fig. 7.2 a, b. The dotted histograms correspond to
204 7 Estimation II

2000
a)

MC "true"

entries
1000

MC folded

-1 0 x 1 2

2000
b)
MC "true"
entries

1000

MC folded

-1 0 x 1 2

c) MC "true"
100
fitted
entries

data

50
MC folded

after fit

-1 0 x 1 2

Fig. 7.2. The superposition of two Monte Carlo distributions, a) flat and b) trian-
gular is adjusted to the experimental data.

the distributions before the distortion by the measurement process. In Fig.


7.2 c is also depicted the experimental distribution. It should be possible to
describe it by a superposition mi of the two Monte Carlo distributions:

di ∼ mi = m0i + θm1i . (7.16)

We optimize the parameter θ such that the histogram di is described up to


a normalization constant as well as possible by a superposition of the two
Monte Carlo
X histograms. We have to insert mi from (7.16) into (7.9) and set
cm = N/ mi .
i
7.3 Comparison of Observations to a Monte Carlo Simulation 205

10000
a)
2
w = t

number of events
w = t

1000

w = 1

100

0 1 2 3 4 5
t

b)
100
data
number of events

MC adjusted

10

1
0 1 2 3 4 5
t

Fig. 7.3. Lifetime fit. The dotted histogram in b) is the superposition of the three
histograms of a) with weights depending on ∆λ.

Example 100. Lifetime fit with Monte Carlo correction


We expand the p.d.f.
f (t|γ) = γe−γt
into a Taylor expansion at γ0 which is a first guess of the decay rate γ:
 
−γ0 t ∆γ ∆γ 2 γ02 t2
f (t|γ) = γ0 e 1+ (1 − γ0 t) + ( ) (−γ0 t + ) + ··· .
γ0 γ0 2

The Monte Carlo simulation follows the distribution f0 = γ0 e−γ0 t . Weighting


the events by (1/γ0 − t) and (−t/γ0 + t2 /2), we obtain the distributions
f1 = (1 − γ0 t)e−γ0 t , f2 = (−t + γ0 t2 /2)e−γ0 t and
206 7 Estimation II

f (t|γ) = f0 (t) + ∆γf1 (t) + (∆γ)2 f2 (t) + · · · .

If it is justified to neglect the higher powers of ∆γ/γ0 , we can again describe


our experimental distribution this time by a superposition of three distribu-
tions f0′ (t′ ), f1′ (t′ ), f2′ (t′ ) which are the distorted versions of f0 , f1 , f2 . The
parameter ∆γ is determined by a χ2 or likelihood fit. In our special case it
is even simpler to weight f0 by t, and t2 , respectively, and to superpose the
corresponding distributions f0 , g1 = tf0 , g2 = t2 f0 with the factors given in
the following expression:
     2
∆γ ∆γ ∆γ 2 1 2 ∆γ
f (t|γ) = f0 (t) 1 + − γ0 g1 (t) +( ) + g2 (t)γ0 .
γ0 γ0 γ0 2 γ0

The parameter ∆γ is then modified until the correspondingly weighted sum


of the distorted histograms agrees optimally with the data. Figure 7.3 shows
an example. In case the quadratic term can be neglected, two histograms are
sufficient. The general case is treated in an analogous manner. The Taylor
expansion is:

df (∆θ)2 d2 f
f (θ) = f (θ0 ) + ∆θ (θ0 ) + (θ0 ) + · · ·
 dθ 2! dθ2 
1 df (∆θ)2 1 d2 f
= f (θ0 ) 1 + ∆θ (θ0 ) + (θ0 ) + · · · .
f0 dθ 2! f0 dθ2

The observations x′ of the distribution f0 (x|θ0 ) provide the histogram m0 .


Weighting with w1 and w2 , where
1 df
w1 = (x|θ0 ) ,
f0 dθ
1 d2 f
w2 = (x|θ0 ) ,
2f0 dθ2
we obtain two further histograms m1i , m2i . The parameter inference of ∆θ is
performed by comparing mi = (m0i +∆θm1i +∆θ2 m2i ) with the experimental
histogram di . In many cases the quadratic term can be omitted. In other
situations it might be necessary to iterate the procedure.
7.4 Parameter Estimation of a Signal Contaminated by Background 207

7.4 Parameter Estimation of a Signal Contaminated by


Background
7.4.1 Introduction

Frequently, an interesting signal is located above a continuum produced by an


uninteresting or unknown physics source. If this background follows Poisson
statistics with known mean bi in bin i of a histogram with B bins, we simply
have to modify the expression (7.1) for the log-likelihood to
B
X
ln L = [−(ti (θ) + bi ) + di ln(ti (θ) + bi )] .
i=1

The corresponding formula in the LS formulation is:


B
X 2
[ti (θ) + bi − di ]
χ2 = .
i=1
ti (θ) + bi

A simple subtraction of the average background from the data di would


underestimate the uncertainties.
If we are lucky, we have independent experimental information about the
background from a separate experiment, if not, we have to parametrize the
background distribution.

7.4.2 Parametrization of the Background

We have to estimate the background from the shape of the distribution.


For a signal consisting of a narrow peak, a possibility is to interpolate the
background from the two sides of the peak and to subtract it. A common pro-
cedure is to use side bands. This method is not very professional. Instead we
should fit the peak together with a linear background distribution. The result
is more precise and the error is automatically provided by the fit. Depending
on the shape of the distribution, it may be necessary to adjust quadratic
or higher order polynomials. There is no absolute save way to estimate the
background. We have to accept that there are systematic uncertainties.

Example 101. Fit of the parameters of a peak above background


Fig.7.4 shows a normally distributed peak superposed to a smooth back-
ground. The parameters of interest are the number of events α in the peak, its
location µ and the corresponding standard deviation σ. We parametrize the
background distribution by a quadratic polynomial and fit the parameters of
the function
208 7 Estimation II

1000

800

number of events
600

400

200

0
0.0 0.2 0.4 0.6 0.8 1.0

Fig. 7.4. Normaly distributed signal contaminated by background.

 
α (x − µ)2
f (x) = √ exp − + β0 + β1 (x − 0.5) + β2 (x − 0.5)2
2πσ 2σ 2
to the observed histogram. The following table summarizes the results for
linear and quadratic background subtractions and different ranges of x.
background range α̂ µ̂ σ̂ χ2 /N DF
quadratic [0.2, 0.8] 5033(122) 0.4994(10) 0.0515(11) 0.82
quadratic [0.3, 0.7] 4975(242) 0.4996(10) 0.0512(16) 1.10
linear [0.3, 0.7] 5131(104) 0.4996(10) 0.0521(10) 1.14
linear [0.34, 0.66] 5165(133) 0.5006(11) 0.0524(12) 0.88
For the linear background subtraction the fitted amount of background and
the width of the peak are larger than for the quadratic background interpola-
tion. The quadratic background function leaves more freedom to the fit than
the linear one and consequently the errors become larger. We choose the
conservative solution with quadratic background shape and narrow range,
α̂ = 4975 ± 242, µ̂ = 0.9996 ± 0.0010, σ̂ = 0.0512 ± 0.0016. The error mar-
gins cover also the results of the linear background subtraction. Part of the
errors are of systematic type caused by the uncertainties in the background
parametrization. The purely statistical errors can be estimated by fixing the
parameters of the background function. They are δα = 73, δµ = 0.0008,
δσ = 0.0008. As expected the precision of the number of events suffers
primarily from the uncertain shape of the background. As the statistical
7.4 Parameter Estimation of a Signal Contaminated by Background 209

and the systematic errors squared add up to the total error squared, we
(sys) (sys)
can calculate the systematic contributions δα = 231, δµ = 0.0006,
(sys)
δσ = 0.0014. Except for the location µ, the systematic errors dominate. If
different parametrizations of the background produce significantly different
results, the systematic error has to be increased. The values of χ2 are accept-
able in all cases. (The χ2 goodness-of-fit test will be discussed in Chap. 10.)
In our Monte Carlo experiment we know the true parameter values µ = 5000,
µ = 0.5, σ = 0.05. The linear fits underestimate the background contribution
and therefore lead to too large values of σ.

7.4.3 Histogram Fits with Separate Background Measurement

In rare cases we have the chance to record independently from a signal sample
also a reference sample containing pure background. The measuring times or
fluxes, i.e. the relative normalization of the two experiments are either known
or to be determined from the data distributions. In this lucky situation,
we do not need to parameterize the background distribution and thus are
independent of assumptions about its shape.
We introduce B additional parameters βi for the unknown background
prediction. The relative flux normalization c can either be known, or be an
unknown parameter in the fit. Our model predicts ti (θ) + βi for bin i of the
signal histogram and βi /c for the background histogram. Our LS statistic is
B
X [ti (θ) + βi − di ]2 (βi /c − bi )2
χ2 = +
i=1
ti (θ) + βi βi /c

for Poisson distributed numbers di and bi .


Especially in low statistics experiments, it is better to avoid the normal
approximation and to switch to the Poisson likelihood formalism.
The log-likelihood is up to constants

B
X
ln L = [di ln (ti (θ) + βi ) − (ti (θ) + βi ) + bi ln(βi /c) − βi /c] .
i=1

7.4.4 The Binning-Free Likelihood Approach

If the number of events is very small, we may apply a binning-free likelihood


fit following a suggestion found in the Russian translation of the book by
Eadie et al. [8] and which has been introduced probably by the Russian
editors [38].
210 7 Estimation II

The idea behind the method is simple: The log-likelihood of the wanted
signal parameter as derived for the full signal sample is a superposition of
the log-likelihood of the genuine signal events and the log-likelihood of the
background events. The latter can be estimated from the reference sample
and subtracted from the full log-likelihood.
To illustrate the procedure, imagine we want to measure the signal
response of a radiation detector by recording a sample of signal heights
x1 , . . . , xN from a mono-energetic source. For a pure signal, the xi would
follow a normal distribution with resolution σ:
1 2 2
f (x|µ) = √ e−(x−µ) /(2σ ) .
2πσ
The unknown parameter µ is to be estimated. After removing the source,
we can – under identical conditions – take a reference sample x′1 , . . . , x′M of
background events. They follow a distribution which is of no interest to us.
(S)
If we knew, which observations xi in our signal sample were signal (xi ),
(B)
respectively background (xi ) events, we could take only the S signal events
and calculate the correct log-likelihood function which is up to constants
S
X S
X (S)
(S) (x − µ)2
i
ln L = ln f (xi |µ) = −
i=1 i=1
2σ 2
N
X B
X (B)
(xi − µ)2 (x i− µ)2
= − ,
i=1
2σ 2 i=1
2σ 2

with S + B = N . The second unknown term can be estimated from the


control sample:
B
X (B) XM
(x − µ)2
i (x′i − µ)2
≈ .
i=1
2σ 2 i=1
2σ 2
The logarithm of our corrected log-likelihood becomes:
N
X XM
(xi − µ)2 (x′i − µ)2
ln L̃ = − .
i=1
2σ 2 i=1
2σ 2

We call it pseudo log-likelihood, ln L̃, to distinguish it from a genuine log-


likelihood. To obtain the estimate µ̂ of our parameter, we look for the pa-
rameter µ̂ which maximizes ln L̃ and find the expected simple function of the
mean values x, x′ :
PN PM ′
i=1 xi − i=1 x
µ̂ =
N −M
N x − M x′
= . (7.17)
N −M
7.4 Parameter Estimation of a Signal Contaminated by Background 211

signal sample background

30 30 reference sample

number of events
20 20

10 10

0 0
-4 -2 0 2 4 -4 -2 0 2 4

x x

Fig. 7.5. Experimental distributionof a normally distributed signal over back-


ground (left) and background reference sample (right). The lower histogram is
scaled to the signal flux.

The general problem where the sample and parameter spaces could be
multi-dimensional and with different fluxes of the signal and the reference
sample, is solved in complete analogy to our example: Given a contaminated
signal distribution of size N and a reference distribution of size M and flux
1/r times that of the signal sample, we put
N
X M
X
ln L̃ = ln f (xi |θ) − r ln f (x′i |θ) . (7.18)
i=1 i=1

The range in x has to be covered by f (x|θ) and it has to be indepen-


dent of this parameter. Under these conditions the estimate of θ obtained
from (7.18) is consistent as shown in 13.5. The formula (7.18) is completely
general and avoids histogramming which is problematic for low event counts.
The LS method with subtraction of a background histogram from the signal
histogram often fails in such a situations.
The shape itself of the pseudo likelihood cannot be used directly to es-
timate the parameter errors. It has to be determined by error propagation
or alternatively with the bootstrap method, where we take a large number
of bootstrap samples from the experimental distributions of both the signal-
and the control experiment and calculate the background corrected parame-
ter estimate for each pair of samples, see Sect. 12.2.
212 7 Estimation II

Example 102. Fit of the parameters of a peak with a background reference


sample
Fig. 7.5 shows an experimental histogram of a normally distributed signal
of width σ = 1 contaminated by background, together 95 events with mean
x = 0.61 and empirical variance v 2 = 3.00. The right hand side is the dis-
tribution of a background reference sample with 1/r = 2.5 times the flux of
the signal sample, containing 91 events with mean x′ = −1.17 and variance
v ′2 = 4.79. The mean of the signal is obtained from the flux corrected version
of (7.17) which follows from (7.18:

N x − rM x′
µ̂ =
N − rM
95 · 0.61 − 0.4 · 91 · 1.17
= = −0.26 ± 0.33 .
95 − 0.4 · 91
The error is estimated by linear error propagation. The result is indicated
in Fig. 7.5. The distributions were generated with nominally 60 pure signal
plus 40 background events and 100 background reference events. The signal
corresponds to a normal distribution, N (x|0, 1), and the background to an
exponential, ∼ exp(−0.2x).

A different method, where the shape of the background distribution is


approximated by probability density estimation (PDE) will be given in Sect.
12.1.2.

7.5 Inclusion of Constraints


7.5.1 Introduction

The interesting parameters are not always independent of each other but are
often constrained by physical or geometrical laws.
As an example let us look at the decay of a Λ particle into a proton and
a pion, Λ → p + π, where the direction of flight of the Λ hyperon and
the momentum vectors of the decay products are measured. The momentum
vectors of the three particles which participate in the reaction are related
through the conservation laws of energy and momentum. Taking into account
the conservation laws, we add information and can improve the precision of
the momentum determination.
In the following we assume that we have N direct observations xi which
are predicted by functions ti (θ) of a parameter vector θ with P components
as well as K constraints of the form hk (θ) = 0. Let us assume further that
7.5 Inclusion of Constraints 213

the uncertainties ∆i of the observations are normally distributed and that


the constraints are fulfilled with the precision δk ,

(ti (θ) − xi )2 = ∆2i ,


h2k (θ = δk2 . (7.19)

Then χ2 can be written in the form:


N
X 2 K
X
[xi − ti (θ)] h2 (θ)
χ2 = + k
. (7.20)
i=1
∆2i δk2
k=1

We minimize χ2 by varying the parameters and obtain their best estimates at


the minimum of χ2 . The procedure works also when the constraints contain
more than N parameters, as long as the number of parameters P does not
exceed the the number of terms N + K. We assume that there is a single
minimum.
A corresponding likelihood fit would maximize
N
X K
1 X h2k (θ)
ln L = ln f (xi |θ) − .
i=1
2 δk2
k=1

In most cases the constraints are obeyed exactly, δk = 0, and the sec-
ond term in (7.20) diverges. This difficulty is avoided in the following three
procedures:
1. The constraints are used to reduce the number of parameters.
2. The constraints are approximated by narrow Gaussians and the limit
δk → 0 is approached.
3. Lagrange multipliers are adjusted to satisfy the constraints.
We will discuss the problem in terms of a χ2 minimization. The solutions
can also be applied to likelihood fits.

7.5.2 Eliminating Redundant Parameters

Sometimes it is possible to eliminate parameters by expressing them by an


unconstrained subset.

Example 103. Fit with constraint: Two pieces of a rope


A rope of exactly 1 m length is cut into two pieces. A measurement of both
pieces yields l1 = 35.3 cm and l2 = 64.3 cm, both with the same Gaussian
uncertainty of δ = 0.3. We have to find the estimates λ̂1 , λ̂2 of the lengths.
We minimize
214 7 Estimation II

(l1 − λ1 )2 (l2 − λ2 )2
χ2 = 2
+
δ δ2
including the constraint λ1 + λ2 = l = 100 cm. We simply replace λ2 by l − λ1
and adjust λ1 , minimizing

(l1 − λ1 )2 (l − l2 − λ1 )2
χ2 = + .
δ2 δ2
The minimization relative to λ1 leads to the result:
l l1 − l2
λ̂1 = + = 35.5 ± 0.2 cm
2 2

and the corresponding estimate of λ2 is just the complement of λ̂1 with


respect to the full length.
√ Note that due to the constraint the error of λi
is reduced by a factor 2, as can easily be seen from error propagation.
The constraint has the same effect as a double measurement, but with the
modification that now the results are (maximally) anti-correlated: one finds
cov(λ1 λ2 ) = −var(λi ).

Example 104. Fit of the particle composition of an event sample (2)


A particle identification variable x has different distributions fm (x) for
different particles. The p.d.f. given the relative particle abundance λm for
particle species m out of M different particles is
M
X
f (x|λ1 , . . . , λM ) = λm fm (x) ,
m=1
M
X
λm = 1 . (7.21)
m=1

As the constraint relation isP


linear, we can easily eliminate the parameter λM
to get rid of the constraint λm = 1:
M−1
X M−1
X
f ′ (x|λ1 , . . . , λM−1 ) = λm fm (x) + (1 − λm )fM (x) .
m=1 m=1

The log-likelihood for N particles is


N
"M−1 M−1
#
X X X
ln L = ln λm fm (xi ) + (1 − λm )fM (xi ) .
i=1 m=1 m=1
7.5 Inclusion of Constraints 215

From the MLE we obtain in the usual way the first M −1 parameters and their
error matrix E. The remaining parameter λM and the related error matrix
elementsPEMj are derived from the constraint (7.21) and the corresponding
relation ∆λm = 0. The diagonal error is the expected value of (∆λM )2 :
M−1
X
∆λM = − ∆λm ,
m=1
"M−1 #2
X
2
(∆λM ) = ∆λm ,
m=1
M−1
X M−1
X M−1
X
EMM = Emm + Eml .
m=1 m l6=m

The remaining elements are computed analogously:


M−1
X
EMj = EjM = −Ejj − Emj .
m6=j

An iterative method, called channel likelihood, to find the particle contribu-


tions, is given in [34].

These trivial examples are not really representative for the typical prob-
lems we have to solve in particle- or astrophysics. Indeed, it is often com-
plicated or even impossible to reduce the parameter set analytically to an
unconstrained subset, but we can introduce a new unconstrained parameter
set which then predicts the measured quantities. To find such a set is straight
forward in the majority of problems: We just have to think how we would sim-
ulate the corresponding experimental process. A simulation is always based
on a minimum set of parameters. The constraints are satisfied automatically.

Example 105. Kinematical fit with constraints: eliminating parameters


A neutral particle c is decaying into two charged particles a and b, for
instance Λ → p + π − . The masses mc , ma , mb are known. Measured are
the decay vertex ρ and the momenta π a , π b of the decay products. The
measurements of the components of the momentum vectors are correlated.
The inverse error matrices be Va and Vb . The origin of the decaying particle
be at the origin of the coordinate system. Thus we have 9 measurements
(ρ, pa , pb ), 10 parameters, namely the 3 momentum vectors and the distance
(π c , πa , πb , ρ), and 4 constraints from momentum and energy conservation:
216 7 Estimation II

π(π c , πa , π b ) ≡ π c − π a − π b = 0 ,
p p q
ε(π c , πa , π b ) ≡ πc2 + m2c − πa2 + m2a − πb2 − m2b = 0 .

The corresponding χ2 expression is


3 
X 2
ri − ρi P3
χ2 = + i,j=1 (pai − πai )Vaij (paj − πaj )
i=1
δri
P3
+ i,j=1 (pbi − πbi )Vbij (pbj − πbj ) .

A correlation of the cartesian components of the momenta of particles a


and b are taken into account by the weight matrices Va and Vb . The vertex
parameters ρi are fixed by the vector relation ρ = ρπc /|πc |. Now we would
like to remove 4 out of the 10 parameters using the 4 constraints. A Monte
Carlo simulation of the Λ decay would proceed as follows: First we would
select the Λ momentum vector (3 parameters). Next the decay length would
be generated (1 parameter). The decay of the Λ hyperon into proton and
pion is fully determined when we choose the proton direction in the lambda
center of mass system (2 parameters). All measured laboratory quantities
and thus also χ2 can then be expressed analytically by these 6 unconstrained
quantities (we omit here the corresponding relations) which are varied in the
fitting procedure until χ2 is minimal. Of course in the fit we would not select
random starting values for the parameters but the values which we compute
from the experimental decay length and the measured momentum vectors.
Once the reduced parameter set has been adjusted, it is straight forward
to compute also the remaining laboratory momenta and their errors which,
obviously, are strongly correlated.

Often the reduced parameter set is more relevant than the set correspond-
ing to the measurement, because a simulation usually is based on parameters
which are of scientific interest. For example, the investigation of the Λ decay
might have the goal to determine the Λ decay parameter which depends on
the center of mass direction of the proton relative to the Λ polarization, i.e.
on one of the directly fitted quantities.

7.5.3 Gaussian Approximation of Constraints


The direct inclusion of the constraint through a penalty term in the fit is
technically very simple and efficient.
We have to minimize:
N
X 2 K
X
[xi − ti (θ)] h2 (θ)
χ2 = lim + k
. (7.22)
δk→0,k=1,K
i=1
∆2i δk2
k=1
7.5 Inclusion of Constraints 217

The exact limit will not be obtained, but it is sufficient to choose the pa-
rameters δk small compared to the experimental resolution of the constraint.
Parameter estimation is performed by numerical approximation in computer
programs following methods like Simplex. The require precision is steered by
a parameter provided by the user. The parameter δ plays a similar role.
To estimate the resolution, the constraint is evaluated from the observed
data, h̃(xi , ..., xN ) and we require δk2 << h̃2 . The precise choice of the con-
straint precision δk is not at all critical, but extremely small values of δk could
lead to numerical problems. In case the minimum search is slow, or does not
converge, one should start initially with loose constraints which subsequently
could be tightened.
The value of χ2 in the major part of the parameter space is dominated
by the contributions from the constraint terms. In the minimum searching
programs the parameter point will therefore initially move quickly from its
starting value towards the subspace defined by the constraint equations and
then proceed towards the minimum of χ2 .
Remark that the minimum of χ2 is found at parameter values that satisfy
the constraints much better than naively expected from the set constraint
tolerances. The reason is the following: Once the parameters are close to
their estimates, small changes which reduce the χ2 contribution of the penalty
terms, will not sizably affect the remaining terms. Thus the minimum will be
observed very close to hk = 0. As a consequence, the contribution of the K
constraint terms in (7.20) to the minimum value of χ2 is negligible.

Example 106. Example 103 continued


Minimizing

(l1 − λ1 )2 (l2 − λ2 )2 (λ1 + λ2 − l)2


χ2 = 2
+ 2
+ .
δ δ (10−5 δ)2

produces the same result as the fit presented above. The value δk2 = 10−10 δ 2
is chosen small compared to δ.

For a numerical test we consider the decay of a Λ hyperon into a proton


and a pion and simplify example 105.

Example 107. Example 105 continued


To simplify the equations, we can fix the decay length z of the lambda
hyperon and the absolute value of its momentum because these two quantities
are not related to the constraint equations. In the simulation the Λ particle
218 7 Estimation II

moves along the z axis. The measured cordinates x, y are small compared to
the decay length r ≈ z which is fixed. The χ2 expression is

(x − ξ)2 (y − θ)2
χ2 = 2
+ +
δx δy2
3
X 3
X
(pai − πai )Vaij (paj − πaj ) + (pbi − πbi )Vbij (pbj − πbj ) +
i,j=1 i,j=1

(ξ/z − πcx /πcz )2 (θ/z − πcy /πcz )2 (mpπ − mΛ )2


+ + . (7.23)
δα2 δα2 2
δm

The first two terms of (7.23) compare the x and y components of the Λ
path vector with the corresponding parameters ξ and θ. The next two terms
measure the difference between the observed and the fitted momentum com-
ponents of the proton and the pion. The following two terms constrain the
direction of the Λ hyperon flight path to the direction of the momentum vec-
tor π = π a + π b and the last term constrains the invariant mass mpπ (π a , πb )
of the decay products to the Λ mass. We generate 104 events, all with the
same nominal parameter values but different normally distributed measure-
ment errors. The velocity of the Λ particle is parallel to the z axis with a
Lorentz factor γ = 9. The decay length is fixed to 1 m. The direction of the
proton in the Λ center of mass is defined by the polar and azimuthal an-
gles θ = 1.5, φ = 0.1. The measurement errors of the x and y coordinates
are δx = δy = 1 cm. The momentum error is assumed to be the sum of a
term proportional to the momentum p squared, δpr = 2p2 /(GeV )2 and a
constant term δp0 = 0.02 GeV added to each momentum component. The
tolerances for the constraints are δα = 0.001 and δm = 0.1 M eV , i.e. about
10−3 times the experimental uncertainty. The minimum search is performed
with a combination of a simplex and a parabolic minimum searching routine.
The starting values for the parameters are the measured values. The fit starts
with a typical value of χ20 of 2 × 108 and converges for all events with a mean
value of χ2 of 2.986 and a mean value of √ the standard deviation of 2.446 to
be compared to the nominal values 3 and 6 = 2.450. The contribution from
each of the three constraint terms to χ2 is 10−4 . Thus the deviation from the
constraints is about 10−7 times the experimental uncertainty.

7.5.4 The Method of Lagrange Multipliers


This time we choose the likelihood presentation of the problem. The likelihood
function is extended to
N
X X
ln L = ln f (xi |θ) + αk hk (θ) . (7.24)
i=1 k
7.5 Inclusion of Constraints 219

We have appended an expressions that in the end should be equal to zero,


the constraint functions multiplied by the Lagrange multipliers. The MLE
obtained by setting ∂ ln L/∂θj = 0 yields parameters that depend on the
Lagrange multipliers α. We can now use the free parameters αk to fulfil
the constraints, or in other words, we use the constraints to eliminate the
Lagrange multiplier dependence of the MLE.

Example 108. Example 103 continued


Our full likelihood function is now
(l1 − λ1 )2 (l2 − λ2 )2
ln L = − − + α(λ1 + λ2 − l)
2δ 2 2δ 2

with the MLE λ̂1,2 = l1,2 − δ 2 α. Using λ̂1 + λ̂2 = l we find δ 2 α = (l1 + l2 − l)/2
and, as before, λ̂1 = (l + l1 − l2 )/2, λ̂2 = (l + l2 − l1 )/2.

Of course, in general the situation is much more complicated than that


of our trivial example. An analytic solution will hardly be possible. Instead
we can set the derivative of the log-likelihood not only with respect to the
parameters θ but also with respect to the multipliers αk equal to zero,
∂ ln L/∂αk = 0, which automatically implies, see (7.24) that the constraints
are satisfied. Unfortunately, the zero of the derivative corresponds to a sad-
dle point and cannot be found by a maximum searching routine. More subtle
numerical methods have to be applied.
Most methods avoid this complication and limit themselves to linear re-
gression models which require a linear dependence of the observations on
the parameters and linear constraint relations. Non-linear problems are then
solved iteratively. The solution then is obtained by a simple matrix calculus.
Linear regression has be sketched in Sect. 6.7.1 and the inclusion of con-
straints in Appendix 13.13. For a detailed discussion see Ref. [39].

7.5.5 Conclusion

By far the simplest method is the one where the constraint is directly included
and approximated by a narrow Gaussian. With conventional minimizing pro-
grams the full error matrix is produced automatically.
The approach using a reduced parameter set is especially interesting when
we are primarily interested in the parameters of the reduced set. This is the
case in most kinematical fits. Due to the reduced dimension of the parameter
space, it is faster than the other methods. The determination of the errors of
the original parameters through error propagation is sometimes tedious, but
in most applications only the reduced set is of interest.
220 7 Estimation II

It is recommended to either eliminate redundant parameters or to use the


simple method where we represent constraints by narrow Gaussians. The ap-
plication of Lagrange multipliers is unnecessarily complicated and the linear
approximation requires additional assumptions and iterations.

7.6 Reduction of the Number of Variates


7.6.1 The Problem

A statistical analysis of a univariate sample is obviously much simpler than


that of a multidimensional one. This is not only true for the qualitative
comparison of a sample with a parameter dependent p.d.f. but also for the
quantitative parameter inference. Especially when the p.d.f. is distorted by
the measurement process and a Monte Carlo simulation is required, the di-
rect ML method cannot be applied as we have seen above. The parameter
inference then happens by comparing histograms with the problem that in
multidimensional spaces the number of entries can be quite small in some
bins. Therefore, we have an interest to reduce the dimensionality of the vari-
able space by appropriate transformations, of course, if possible, without loss
of information. However, it is not always easy to find out which variable or
which variable combination is especially important for the parameter estima-
tion.

7.6.2 Two Variables and a Single Linear Parameter

A p.d.f. f (x, y|θ) of two variates with a linear parameter dependence can
always be written in the form

f (x, y|θ) = v(x, y)[1 + u(x, y)θ] .

From the distribution g(u, v),

∂u∂v
g(u, v) = v(1 + uθ) ,
∂x∂y
we derive the log-likelihood of θ
X
ln L(θ) = ln(1 + ui θ) + const.
i

which depends only on u.


According to the likelihood principle, the full information relative to the
parameter of interest is contained in the distribution of u. This property
is very convenient, because we can compare the data to a prediction in a
one-dimensional histogram. A ML fit can as well be performed in the x −
7.6 Reduction of the Number of Variates 221

y space but as soon as we have to compare the data to a simulation in


form of histograms the reduction to one dimension simplifies the analysis
considerably.
The analytic variable transformation and reduction is possible only in
rare cases, but it is not necessary because it is performed implicitly by the
Monte Carlo simulation. To estimate θ the following recipe can be applied:
1. Compute for each observation xi , yi the variable ui = u(x1 , yi ) and build
the histogram d of u.
2. Select two values θ1 , θ2 of the parameter, generate events, compute u and
construct histograms t1 (u), t2 (u).
3. Perform a LS fit of the superposition of the two simulated histograms to
the observed histogram,
X [di − (αt1i + βt2i )]2
χ2 = ,
δi2

with δi the error of the bracket in the numerator. Estimate α̂, β̂ and their
errors.
4. compute
α̂θ1 + β̂θ2
θ̂ = .
α̂ + β̂
The steps 3 and 4 can be combined. The two parameters α̂+ β̂ can be elim-
inated and we obtain χ2 as a function of θ and the normalization parameter
c:
X [di − c (θ−θ2 )tθ1i −(θ−θ 1 )t2i 2
]
2 1 −θ2
χ = 2 ;.
i
δ i

Alternatively we can perform a Poisson likelihood fit, as in an example below


and apply the methods discussed in Sect. 7.1.

7.6.3 Generalization to Several Variables and Parameters

The generalization to N variates which we combine to a vector x is trivial:

f (x|θ) = v(x) [1 + u(x)θ] .


Again we can reduce the variate space to a single significant variate u
without loosing relevant information. If simultaneously P parameters have
to be determined, we usually will need also P new variates u1 , . . . , uP :
" #
X
f (x|θ) = v(x) 1 + up (x)θp .
p

Thus our procedure makes sense only if the number of parameters is smaller
than the dimension of the variate space.
222 7 Estimation II

q = -1 q=1

g(u)

-1.0 -0.5 0.0 0.5 1.0

Fig. 7.6. Simulated p.d.f.s of the reduced variable u for the values ±θ of the
parameter.

Example 109. Reduction of the variate space


We consider the p.d.f.
1h 2 i
f (x, y, z|θ) = (x + y 2 + z 2 )1/2 + (x + y 3 )θ , x2 + y 2 + z 2 ≤ 1 , (7.25)
π
which depends on three variates and one parameter. For a given sample
of observations in the three dimensional cartesian space we determine the
parameter θ. The substitutions

x + y3 √
u= , |u| ≤ 2 ,
(x2 2
+y +z ) 2 1/2

v = (x2 + y 2 + z 2 )1/2 , 0 ≤ v ≤ 1 ,
z=z

lead to the new p.d.f. g ′ (u, v, z)

v ∂(x, y, z)
g ′ (u, v, z|θ) = [1 + u θ] ,
π ∂(u, v, z)

which after integrating out v and z yields the p.d.f. g(u|θ):


7.6 Reduction of the Number of Variates 223

Z
g(u|θ) = dz dv g ′ (u, v, z|θ) .

This operation is not possible analytically but we do not need to compute g


explicitly. We are able to determine the MLE and its error from the simple
log-likelihood function of θ
X
ln L(θ) = ln(1 + ui θ) .
i

In case we have to account for acceptance effects, we have to simulate the u


distribution. For a Monte Carlo simulation of (7.25) we compute for each ob-
servation xi , yi , zi the value of ui and histogram it. The simulated histograms
g+ and g− of u for the two parameter values θ = ±1 are shown in Fig. 7.6.
(The figure does not include experimental effects. This is irrelevant for the
illustration of the method.) The superposition ti = (1 − θ)g−i + (1 + θ)g+i
has then to be inserted into the likelihood function (7.1).

7.6.4 Non-linear Parameters


The example which we just investigated is especially simple because the p.d.f.
depends linearly on a single parameter. Linear dependencies are quite fre-
quent because distributions often consist of a superposition of several pro-
cesses, and the interesting parameters are the relative weights of those.
For the general, non-linear case we restrict ourselves to a single parameter
to simplify the notation. We expand the p.d.f. into a Taylor series at a first
rough estimate θ0 :
∂f 1 ∂2f
f (x|θ) = f (x|θ0 ) + |θ0 ∆θ + | ∆θ2 + · · ·
2 θ0
 ∂θ 2 ∂θ 
= V 1 + u1 ∆θ + u2 ∆θ2 + · · · . (7.26)
As before, we choose the coefficients ui as new variates. Neglecting
quadratic and higher terms, the estimate θ̂ = θ0 + ∆θc depends only on the
new variate u1 ,
∂f (x|θ)/∂θ |θ0
u1 (x) =
f (x|θ0 )
which is a simple function of x.
If the linear approximation is insufficient, a second variate u2 should be
added. Alternatively, the solution can be iterated. The generalization to sev-
eral parameters is straight forward.
A more detailed description of the method with application to a physics
process can be found in Refs. [40, 41]. The corresponding choice of the variate
is also known under the name optimal variable method [42].
224 7 Estimation II

mean values 1.4

100

true observed

observed
1.2
number of events

75
1.0

50 0.8
0.8 1.0 1.2

true

25

0
0 1 2 3 4 5

lifetime

Fig. 7.7. Measured lifetime distribution. The insert indicates the transformation
of the measured lifetime to the corrected one.

7.7 Approximated Likelihood Estimators


As in Sect. 7.3 we investigate the situation where we have to estimate pa-
rameters in presence of acceptance and resolution effects. The idea of the
method is the following: We fit the parameters ignoring the distortion and
obtain a biased estimate θ′ . The bias is then corrected based on a Monte
Carlo simulation which provides the relation θ(θ′ ) between the parameter of
interest θ and the observed quantity θ′ . The method should become clear in
the following example.

Example 110. Approximated likelihood estimator: Lifetime fit from a dis-


torted distribution
The sample mean t of a sample of N undistorted exponentially distributed
lifetimes ti is a sufficient estimator: It contains the full information related to
the parameter τ , the mean lifetime (see Sect. 6.5.1). In case the distribution
is distorted by resolution and acceptance effects (Fig. 7.7), the mean value
X
t′ = t′i /N

of the distorted sample t′i will usually still contain almost the full information
relative to the mean life τ . The relation τ (t′ ) between τ and its approximation
7.7 Approximated Likelihood Estimators 225

t′ (see insert of Fig. 7.7) is generated by a Monte Carlo simulation. The


uncertainty δτ is obtained by error propagation from the uncertainty δt′ of
t′ ,
2
(t′2 − t′ )
(δt′ )2 = ,
N −1
1 X ′2
with t′2 = ti
N
using the Monte Carlo relation τ (t′ ).

This approach has several advantages:


– We do not need to histogram the observations.
– Problems due to small event numbers for bins in a multivariate space are
avoided.
– It is robust, simple and requires little computing time.
For these reasons the method is especially suited for online applications,
provided that we find an efficient estimator.
If the distortions are not too large, we can use the likelihood estimator
extracted from the observed sample {x′1 , . . . , x′N } and the undistorted distri-
bution f (x|λ):
Y
L(λ) = f (x′i |λ) ,
dL
| ′ =0. (7.27)
dλ λ̂
This means concretely that we perform the usual likelihood analysis where
we ignore the distortion. We obtain λ̂′ . Then we correct the bias by a Monte
Carlo simulation which provides the relation λ̂(λ̂′ ).
It may happen in rare cases where the experimental resolution is very
bad that f (x|λ) is undefined for some extremely distorted observations. This
problem can be cured by scaling λ̂′ or by eliminating particular observations.
Acceptance losses α(x) alone without resolution effects do not necessarily
entail a reduction in the precision of our approach. For example, as has been
shown in Sect. 6.4.2, cutting an exponential distribution at some maximum
value of the variate, the mean value of the observations is still a sufficient
statistic. But there are cases where sizable acceptance losses have the con-
sequence that our method deteriorates. In these cases we have to take the
losses into account. We only sketch a suitable method. The observed p.d.f.
f ′ (x|λ) for the variate x is
α(x)f (x|λ)
f ′ (x|λ) = R ,
α(x)f (x|λ)dx
226 7 Estimation II

where the denominator is the global acceptance and provides the correct
normalization. We abbreviate it by A(λ). The log-likelihood of N observations
is X X
ln L(λ) = ln α(xi ) + ln f (xi |λ) − N A(λ) .
The first term can be omitted. The acceptance A(λ) can be determined by
a Monte Carlo simulation. Again a rough estimation is sufficient, at most it
reduces the precision but does not introduce a bias, since all approximations
are automatically corrected with the transformation λ(λ′ ).
Frequently, the relation (7.27) can only be solved numerically, i.e. we
find the maximum of the likelihood function in the usual manner. We are
also allowed to approximate this relation such that an analytic solution is
possible. The resulting error is compensated in the simulation.

Example 111. Approximated likelihood estimator: Linear and quadratic dis-


tributions
A sample of events xi is distributed linearly inside the interval [−1, 1], i.e.
the p.d.f. is f (x|b) = 0.5 + bx. The slope b , |b| < 1/2, is to be fitted. It is
located in the vicinity of b0 . We expand the likelihood function
X
ln L = ln(0.5 + bxi )

at b0 with
b = b0 + β
and derive it with respect to β to find the value β̂ at the maximum:
X xi
=0.
0.5 + (b0 + β̂)xi

Neglecting quadratic and higher order terms in β̂ we can solve this equation
for β̂ and obtain P
xi /f0i
β̂ ≈ P 2 2 (7.28)
xi /f0i
where we have set f0i = f (xi |b0 ). If we allow also for a quadratic term

f (x|a, b) = a + bx + (1.5 − 3a)x2 ,

we write, in obvious notation,

f (x|a, b) = f0 + α(1 − 3x2 ) + βx

and get, after deriving ln L with respect to α and β and linearizing, two linear
equations for α̂ and β̂:
7.8 Nuisance Parameters 227

X X X
α̂ A2i + β̂
Ai Bi = Ai ,
X X X
α̂ Ai Bi + β̂ Bi2 = Bi , (7.29)

with the abbreviations Ai = (1 − 3x2i )/f0i , Bi = xi /f0i . From the observed


data using (7.29) we get β̂ ′ (x′ ), α̂′ (x′ ), and the simulation provides the pa-
rameter estimates b̂(β̂ ′ ), â(α̂′ ) and their uncertainties.

The calculation is much faster than a numerical minimum search and


almost as precise. If α̂, β̂ are large we have to iterate.

7.8 Nuisance Parameters


Frequently a p.d.f. f (x|θ, ν) contains several parameters from which only
some, namely θ, are of interest, while the other parameters ν are not, but
influence the estimate of the former. Those are called nuisance parameters.
A typical example is the following.

Example 112. Nuisance parameter: Decay distribution with background


We want to infer the decay rate γ of a certain particle from the decay times
ti of a sample of N events. Unfortunately, the sample contains an unknown
amount of background. The decay rate γb of the background particles be
known. The nuisance parameter is the number of background events η. For
a fraction of background events of η/N , the p.d.f. for a single event with
lifetime t is
 η  −γt η
f (t|γ, η) = 1 − γe + γb e−γb t , η ≤ N ,
N N
from which we derive the likelihood for the sample:
N h
Y η  −γti η i
L(γ, η) = 1− γe + γb e−γb ti .
i=1
N N

A contour plot of the log-likelihood of a specific data sample of 20 events and


γb = 0.2 is depicted in Fig. 7.8. The two parameters γ and η are correlated.
The question is then: What do we learn about γ, what is a sensible point
estimate of γ and how should we determine its uncertainty?

We will re-discuss this example in the next subsection and present in the
following some approaches which permit to eliminate the nuisance parame-
228 7 Estimation II

ters. First we will investigate exact methods and then we will turn to the
more problematic part where we have to apply approximations.

7.8.1 Nuisance Parameters with Given Prior

If we know the p.d.f. π(ν) of a nuisance parameter vector ν, the prior of ν,


then we can eliminate ν simply by integrating it out, weighting ν with its
probability π(ν) to occur.
Z
fθ (x|θ) = f (x|θ, ν)π(ν)dν .

In this way we obtain a p.d.f. depending solely on the parameters of interest


θ. The corresponding likelihood function of θ is
Z Z
Lθ (θ|x) = L(θ, ν|x)π(ν)dν = f (x|θ, ν)π(ν)dν . (7.30)

Example 113. Nuisance parameter: Measurement of a Poisson rate with a


digital clock
An automatic monitoring device measures a Poisson rate θ with a digital
clock counting in units of ∆. For n observed reactions within a time interval ν
the p.d.f. is given by the Poisson distribution Pθν (n). If we consider both, the
rate parameter θ and the length of the time interval ν as unknown parameters,
the corresponding likelihood function is
n
e−θν [θν]
L(θ, ν) = .
n!
For a clock reading t0 , the true measurement time is contained in the time
interval t0 ± ∆/2. We can assume that all times ν within that interval are
equally probable and thus the prior of ν is π(ν) = 1/∆ for ν in the interval
[t0 − ∆/2 , t0 + ∆/2] and equal to zero elsewhere. We eliminate constant
factors, and, integrating over ν,
Z t0 +∆/2
n
Lθ (θ) = e−θν [θν] dν ,
t0 −∆/2

we get rid of the nuisance parameter.


7.8 Nuisance Parameters 229

3.0

2.5

2.0
decay rate

1.5 lnL = 2

lnL = 0.5

1.0

0.5

0.0

0 5 10 15 20

number of background events

Fig. 7.8. Log-likelihood contour as a function of the decay rate and the number of
background events. For better visualization the discrete values of the event numbers
are connected.

Example 114. Nuisance parameter: Decay distribution with background sam-


ple
Let us resume the problem discussed in the introduction. We now assume
that we have prior information on the amount of background: The background
expectation had been determined in an independent experiment to be 10
with sufficient precision to neglect its uncertainty. The actual number of
background events follows a binomial distribution. The likelihood function is
20 h
η  −γti i
20
X Y
20 η
L(γ) = B0.5 (η) 1− γe + 0.2e−0.2ti .
η=0 i=1
20 20

Since our nuisance parameter η is discrete, we have replaced the integration


in (7.30) by a sum.

7.8.2 Factorizing the Likelihood Function

Very easy is the elimination of the nuisance parameter if the p.d.f. is of the
form
230 7 Estimation II

f (x|θ, ν) = fθ (x|θ)fν (x|ν) , (7.31)


i.e. only the first factor fθ depends on θ. Then we can write the likelihood as
a product
L(θ, ν) = Lθ (θ)Lν (ν)
with Y
Lθ = fθ (xi |θ) ,
independent of the nuisance parameter ν.

Example 115. Elimination of a nuisance parameter by factorization of a two-


dimensional normal distribution
A sample of space points (xi , yi ), i = 1, . . . , N follows a normal distribu-
tion
 
ab 1 2 2 2 2

f (x, y|θ, ν) = exp − a (x − θ) + b (y − ν)
2π 2
 2   2 
ab a 2 b 2
= exp − (x − θ) exp − (y − ν) .
2π 2 2

with θ the parameter which we are interested in. The normalized x distribu-
tion depends only on θ. Whatever value ν takes, the shape of this distribution
remains always the same. Therefore we can estimate θ independently of ν.
The likelihood function is proportional to a normal distribution of θ,
 2 
a 2
L(θ) ∼ exp − (θ − θ̂) ,
2
P
with the estimate θ̂ = x = xi /N .

7.8.3 Parameter Transformation, Restructuring [19]

Sometimes we manage by means of a parameter transformation ν ′ = ν ′ (θ, ν)


to bring the p.d.f. into the desired form (7.31) where the p.d.f. factorizes
into two parts which depend separately on the parameters θ and ν ′ . We
have already sketched an example in Sect. 4.4.7: When we are interested in
the slope θ and not in the intersection ν with the y-axis of a straight line
y = θx + ν which should pass through measured points, then we are able to
eliminate the correlation between the two parameters. To this end we express
the equation of the straight line by the slope and the ordinate at the center
of gravity, see Example 93 in Sect. 6.4.5.
A simple transformation ν ′ = c1 ν +c2 θ also helps to disentangle correlated
parameters of a Gaussian likelihood
7.8 Nuisance Parameters 231

0.0

-0.5

ln(L)
-1.0

-1.5

-2.0
1 2 3 4 5

Fig. 7.9. Log-likelihood function of an absoption factor.

!
a2 (θ − θ̂)2 − 2abρ(θ − θ̂)(ν − ν̂) + b2 (ν − ν̂)2
L(θ, ν) ∼ exp − ,
2(1 − ρ2 )

With suitable chosen constants c1 , c2 it produces a likelihood function that


factorizes in the new parameter pair θ, ν ′ . In the notation where the quantities
θ̂, ν̂ maximize the likelihood function, the transformation produces the result
 2 
a
Lθ (θ) ∼ exp − (θ − θ̂)2 .
2

We omit the proof of this assertion.


It turns out that this procedure yields the same result as simply integrat-
ing out the nuisance parameter and as the profile likelihood method which we
will discuss below. This is an important observation in the following respect:
In many situations the likelihood function is nearly of Gaussian shape. As is
shown in Appendix 13.3, the likelihood function approaches a Gaussian with
increasing number of observations. Therefore, integrating out the nuisance
parameter, or better to apply the profile likelihood method, is a sensible ap-
proach in many practical situations. Thus nuisance parameters are a problem
only if the sample size is small.
The following example is frequently discussed in the literature [19].

Example 116. Elimination of a nuisance parameter by restructuring: absorp-


tion measurement
232 7 Estimation II

The absorption factor θ for radioactive radiation by a plate is determined


from the numbers of events r1 and r2 , which are observed with and with-
out the absorber within the same time intervals. The numbers r1 , r2 follow
Poisson distributions with mean values ρ1 and ρ2 :

e−ρ1 ρr11
f1 (r1 |ρ1 ) = ,
r1 !
e−ρ2 ρr22
f2 (r2 |ρ2 ) = .
r2 !
The interesting parameter is the expected absorption θ = ρ2 /ρ1 . In first
approximation we can use the estimates r1 , r2 of the two independent pa-
rameters ρ1 and ρ2 and their errors to calculate in the usual way through
error propagation θ and its uncertainty:
r2
θ̂ = ,
r1
(δ θ̂)2 1 1
= + .
θ̂2 r1 r2
For large numbers r1 , r2 this method is justified but the correct way is to
transform the parameters ρ1 , ρ2 of the combined distribution

e−(ρ1 +ρ2 ) ρr11 ρr22


f (r1 , r2 |ρ1 , ρ2 ) =
r1 !r2 !
into the independent parameters θ = ρ2 /ρ1 and ν = ρ1 + ρ2 . The transfor-
mation yields:

(1 + 1/θ)−r2 (1 + θ)−r1
f˜(r1 , r2 |θ, ν) = e−ν ν r1 +r2 ,
r1 !r2 !
L(θ, ν|r1 , r2 ) = Lν (ν|r1 , r2 )Lθ (θ|r1 , r2 ) .

Thus the log-likelihood function of θ is

ln Lθ (θ|r1 , r2 ) = −r2 ln(1 + 1/θ) − r1 ln(1 + θ) .

It is presented in Fig. 7.9 for the specific values r1 = 10, r2 = 20. The
maximum is located at θ̂ = r2 /r1 , as obtained with the simple estimation
above. However the errors are asymmetric.
7.8 Nuisance Parameters 233

Example 117. Eliminating a nuisance parameter by restructuring: Slope of a


straight line with the y-axis intercept as nuisance parameter
We come back to one of our standard examples which can, as we have
indicated, be solved by a parameter transformation. Now we solve it in a
simpler way. Points (xi , yi ) are distributed along a straight line. The x co-
ordinates are exactly known, the y coordinates are the variates. The p.d.f.
f (y1 , . . . , yn |θ, ν) contains the slope parameter θ and the uninteresting inter-
cept ν of the line with the y axis. It is easy to recognize that the statistic
{ỹ1 = y1 −yn , ỹ2 = y2 −yn , . . . , ỹn−1 = yn−1 −yn } is independent of ν. In this
specific case the new statistic is also sufficient relative to the slope θ which
clearly depends only on the differences of the ordinates. We leave the details
of the solution to the reader.

Further examples for the elimination of a nuisance parameter by restruc-


turing have been given already in Sect. 6.4.2, Examples 79 and 80.

7.8.4 Conditional Likelihood

This method is closely related to the restructuring method.


In case we can find a sufficient statistic S of the nuisance parameter ν,
we may condition f (x|θ, ν) on S. If S does not depend on θ, we can fix ν to
the value required to satisfy S.

Example 118. Fiiting the width of a normal distribution with the mean as
nuisance parameter
The sample mean x̄ of measurements is a sufficient statistic for µ of the
normal distribution N (x|µ, σ). We can replace µ by x̄ in the Gaussian and
are left with the wanted parameter only, see also example 80.

If S(ν, θ) depends on theta, we can again condition on S, but in this


situation, we prefer to switch to the profile likelihood which is described in
the following subsection.

7.8.5 Profile Likelihood

We now turn to approximate solutions.


Some scientists propose to replace the nuisance parameter by its estimate.
This corresponds to a delta function for the prior of the nuisance parameter
and is for that reason quite exotic and dangerous. It leads to an illegitimate
reduction of the error limits whenever the nuisance parameter and the in-
teresting parameter are correlated. Remark that a correlation always exists
234 7 Estimation II

20 0.0

-0.5
15

ν ln(L )
p

-1.0

10

-1.5

-2.0
1 2 3 4 5

Fig. 7.10. Profile likelihood (solid curve, right hand scale) and ∆(ln L) = −1/2,
θ − ν contour (left-hand scale). The dashed curve is ν̂(θ).

unless a factorization is possible. In the extreme case of full correlation the


error would shrink to zero.
A much more sensible approach to eliminate the nuisance parameter uses
the so-called profile likelihood [43]. To explain it, we choose an example with
a single nuisance parameter.
The likelihood function is maximized with respect to the nuisance pa-
rameter ν as a function of the wanted parameter θ. The function ν̂(θ) which
maximizes L then satisfies the relation
∂L(θ, ν|x)
|ν̂ = 0 → ν̂(θ) .
∂ν
It is inserted into the likelihood function and provides the profile likelihood
Lp ,
Lp = L (θ, ν̂(θ)|x) ,
which depends solely on θ.
This method has the great advantage that only the likelihood function
enters and no assumptions about priors have to be made. It also takes cor-
relations into account. Graphically we can visualize the error interval of the
profile likelihood ∆ ln Lp (θ, ν) = 1/2 by drawing the tangents of the curve
∆ ln L = 1/2 parallel to the ν axis. These tangents include the standard error
interval.
7.8 Nuisance Parameters 235

Example 119. Profile likelihood, absorption measurement


We reformulate the absorption example 116 with the nuisance parame-
ter ρ1 and the parameter of interest θ = ρ2 /ρ1 . The log-likelihood, up to
constants, is:

ln L(ρ1 , θ) = −ρ1 (1 + θ) + (r1 + r2 ) ln ρ1 + r2 ln θ .

The maximum of ρ1 as a function of θ is ρ̂1 = (r1 + r2 )/(1 + θ) and the profile


likelihood becomes

ln Lp (θ) = −(r1 + r2 ) ln(1 + θ) + r2 ln θ .

The function ρ̂1 , the profile likelihood and the 1 st. dev. error contour are
depicted in Fig. 7.10. The result coincides with that of the exact factorization.
(In the figure the nuisance parameter is denoted by ν.)

In the literature we find methods which orthogonalize the parameters


at the maximum of the likelihood function [44] which means to diagonalize
a more dimensional Gaussian. The result is similar to that of the profile
likelihood approach. Errors derived from the profile likelihood are computed
in the program MINUIT [51].
In the limit of a large number of observations where the likelihood function
approaches the shape of a normal distribution, the profile likelihood method is
identical to restructuring and factorizing the likelihood. The profile likelihood
is not a genuine likelihood. For instance, it does not always have the property
that the product of the likelihoods of subsamples is equal to the likelihood
of the full sample.

7.8.6 Resampling Methods

The point estimate is a statistic that depends on the input data. The uncer-
tainties of the data determine the error that we can associate to the estimate.
We distinguish between two input situations, a) given is a set of i.i.d. obser-
vations, b) we have measurements with associated error distributions. In the
first case we can apply the bootstrap method, in the second we resample the
input variables from the error distribution. The simulated data can be used
to generate distributions of the wanted parameter from which moments and
confidence limits can be derived.

Bootstrap Resampling

This method will be sketched in Sect. 12.2. We draw randomly observations


of the given set x1 , x2 , ...xN with replacement and obtain a bootstrap sam-
236 7 Estimation II

1.2

0.8

likelihood

0.4

0.0
0.2 0.4 0.6 0.8 1.0

Fig. 7.11. Generated histogram by resampling compared to a likelihood solution


(curve).

ple x∗1 , x∗2 , ...x∗N . (The bootstrap sample may contain the same observation
several times.) We use the bootstrap sample to estimate θ, ν. The procedure
is repeated many times and produces a distribution of θ from which we can
derive arbitrary moments and confidence intervals. The bootstrap method
permits also to estimate the uncertainties of the estimates.

Error Propagation by Resampling

For given distributions of the measurements, we can simulate new measure-


ments and for each generated set derive the point estimate. Similar to the
bootstrap method, we obtain a distribution of the parameter of interest which
permits to derive errors and moments. If only the standard deviation of the
measurements is given, we may approximate the error distribution by a nor-
mal distribution.

Example 120. Eliminating a nuisance parameter by resampling: Absorption


measurement
We resume Example 116. The observed number of events with and without
absober be n10 = 40 and n20 = 80. We generate 106 Poisson distributed
numbers n1 and n2 with mean values n10 and n20 and form each time the
ratio θ = n1/n2. The result is displayed in Fig. 7.11 and compared to the
likelihood function derived from n01 and n02. The histogram is normalized
7.8 Nuisance Parameters 237

to the likelihood function. The bin to bin fluctuations of the histogram reflect
the discrete nature of the Poisson distribution.

7.8.7 Integrating out the Nuisance Parameter

If the methods fail which we have discussed so far, we are left with only two
possibilities: Either we give up the elimination of the nuisance parameter or
we integrate it out. The simple integration
Z ∞
Lθ (θ|x) = L(θ, ν|x)dν
−∞

implicitly contains the assumption of a uniform prior of ν and therefore de-


pends to some extend on the validity of this condition. However, in most cases
it is a reasonable approximation. The effect of varying the prior is usually
negligible, except when the likelihood function is very asymmetric. Also a lin-
ear term in the prior does usually not matter. It is interesting to notice that
in many cases integrating out the nuisance parameter assuming a uniform
prior leads to the same result as restructuring the problem.

7.8.8 Explicit Declaration of the Parameter Dependence

It is not always possible to eliminate the nuisance parameter in such a way


that the influence of the method on the result can be neglected. When the
likelihood function has a complex structure, we are obliged to document the
full likelihood function. In many cases it is possible to indicate the dependence
of the estimate θ and its error limits θ1 , θ2 on the nuisance parameter ν
explicitly by a simple linear function,

θ̂ = θ̂0 + c(ν − ν̂) ,


θ1,2 = θ1,2 + c(ν − ν̂) .

Usually the error limits will show the same dependence as the MLE which
means that the width of the interval is independent of ν.
However, publishing a dependence of the parameter of interest on the
nuisance parameter is useful only if ν corresponds to a physical constant and
not to an internal parameter of an experiment like efficiency or background.

7.8.9 Recommendation

If it is impossible to eliminate the nuisance parameter explicitly and if the


shape of the likelihood function does not differ dramatically from that of a
Gaussian, the profile likelihood approach should be used for the parameter
238 7 Estimation II

and interval estimation. In case the deviation from a Gaussian is considerable,


the dependence of the ML estimate of the parameter of interest and its error
on the nuisance parameter should be given. It is always sensible to publish
the full likelihood function of both the wanted and the unwanted parameters,
either graphically or in form of a table. If enough data are available, the
bootstrap method provides a straight forward way to estimate the standard
error of the parameter of interest and to estimate its distribution.
8 Interval Estimation

In Chap. 4 we have introduced the error calculus based on probability the-


ory. In principle, error estimation is an essential part of statistics and of
similar importance as parameter estimation. Measurements result from point
estimation of one or several parameters, measurement errors from interval1
estimation. These two parts form an ensemble and have to be defined in an
consistent way.
As we have already mentioned, the notation measurement error used by
scientists is somewhat misleading, more precise is the term measurement un-
certainty. In the field of statistics the common term is confidence intervals,
an expression which often is restricted to the specific frequentist intervals as
introduced by Neyman which is sketched in the Appendix.
It is in no way obvious how we ought define error or confidence intervals
and this is why statisticians have very different opinions on this subject. There
are various conventions in different fields of physics, and particle physicists
have not yet adopted a common solution.
Let us start with a wish list which summarizes the properties in the single
parameter case which we would like to realize. The extrapolation to several
parameters is straight forward.
1. The interval is a conneted region.
2. The error interval should represent the mean square spread of measure-
ments around the true parameter value. In allusion to the corresponding
probability term, we talk about standard deviation errors.
3. Error intervals should contain the wanted true parameter with a fixed
probability.
4. For a given probability, the interval should be as short as possible.
5. The definition has to be consistent, i.e. observations containing identical
information about the parameters should lead to identical intervals. More
precise measurements should have shorter intervals than less precise ones.
The error interval has to contain the point estimate.
6. Error intervals should be invariant under transformation of the estimated
parameter.

1
The term interval is not restricted to a single dimension. In n dimensions it
describes a n-dimensional volume.
240 8 Interval Estimation

7. The computation of the intervals should be independent of subjective


model dependent assumptions.
8. A consistent method for the combination of measurements and for error
propagation has to exist.
9. The error intervals should contain the true parameter value in a fixed
fraction of measurements.
10. The definition has to be simple and transparent.
11. The definition should be independent of the dimension of the parameter
space.
Unfortunately it is absolutely impossible to fulfil simultaneously all these
conditions which partially contradict each other. We will have to set priorities
and sometimes we will have to use ad hoc solutions which are justified only
from experience and common sense. Under all circumstances, we will sat-
isfy point 4, i.e. consistency. As far as possible, we will follow the likelihood
principle and derive the interval limits solely from the likelihood function.
We distinguish between four different interval definitions:
– Coverage intervals The true value of the parameter is contained in a fixed
fraction of a large number of identical experiments.
– Likelihood ratio intervals The interval is limited by a surface of fixed
likelihood
– Credible intervals A prior probability for the parameter is chosen and
limits are derived from the resulting p.d.f. of the parameter.
It turns out that not always the same procedure is optimum for the in-
terval estimation. For instance, if we measure the size or the weight of an
object, precision is the dominant requirement, i.e. properties denoting the
reliability or reproducibility of the data. Here, a quantity like the variance
corresponding to the mean quadratic deviation is appropriate to describe the
error or uncertainty intervals. Contrary, limits, for instance of the mass of
a hypothetical particle like the Higgs particle, will serve to verify theoretical
predictions. Here the dominant aspect is probability and we talk about confi-
dence or credibility intervals 2 . Confidence intervals are usually defined such
that they contain a parameter with high probability, e.g. 90% or 95% while
error intervals comprise one standard deviation or something equivalent. The
exact calculation of the standard deviation as well as that of the probability
that a parameter is contained inside an interval require the knowledge of its
p.d.f. which depends not only on the likelihood function but in addition on
the prior density which in most cases is unknown. To introduce a subjective
prior, however, is something which we want to avoid.
The coverage requirement 8 is sometimes relevant in classification proce-
dures or if the same parameter is measured for a number of different objects

2
The term credibility interval is used for Bayesian intervals.
8.1 Error Intervals 241

and if in addition the measurement is biased. For example, particles pro-


duced in high energy reactions have predominantly low momenta. If this this
feature is taken into account by a prior density, the estimated momenta are
biased toward low momenta High momentum particles which are especially
interesting from the physics point may be lost or even excluded. There ex-
ist different definitions of coverage intervals, no relation to point estimation
exists and consequently a consistent combination of measurements is not pos-
sible. Since coverage does not play a significant role in the large majority of
the issues of particle physics, we will not discuss it further, with the exception
of a short section in Appendix 13.7.
When we measure a constant of nature several times, coverage is not
relevant. Instead of contemplating the fact that, say, two thirds of the mea-
surements contain the true value, we would rather combine the results. It is
not possible to associate a probability to likelihood ratio intervals, except in
the limit of infinite statistics.
In the first part of this chapter we treat standard error intervals. In the
second part we will deal mainly with limits on important parameters and
hypothetical quantities, like masses of SUSY particles. There it is sometimes
sensible to include prior densities.

8.1 Error Intervals


In Sect. 6.4.1 we have defined the statistical error limits through the likelihood
ratio which decreases within one standard deviation by a factor e1/2 from
the maximum. This definition is invariant against variable transformations.
This means in the one-dimensional case that for a parameter λ(θ) which is a
monotonic function of θ that the limits λ1 , λ2 , θ1 , θ2 fulfill the relations λ1 =
λ(θ1 ) and λ2 = λ(θ2 ). It does not matter whether we write the likelihood as a
function of θ or of λ. The limits permit to parametrize the likelihood function
and thus to combine results from different experiments. It is consistent with
the point estimate (MLE).
In large experiments usually there are many different effects which influ-
ence the final result and consequently also many different independent sources
of uncertainty, most of which are of the systematic type. Systematic errors
(see Sect. 4.3) such as calibration uncertainties can only be treated in the
Bayesian formalism. We have to estimate their p.d.f. or at least a mean value
and a standard deviation.

8.1.1 Parabolic Approximation

The error assignment is problematic only for small samples. As is shown in


Appendix 13.3, the likelihood function approaches a Gaussian with increasing
size of the sample. At the same time its width decreases, and we can neglect
possible variations of the prior density in the region where the likelihood is
242 8 Interval Estimation

significant. Under this condition we obtain a normally distributed p.d.f. for


the parameters. The standard deviation error includes the parameter with
probability 68.3 (see Sect. 4.6). The log-likelihood then is parabolic and the
error interval corresponds to the region within which it decreases from its
maximum by a value of 1/2, as we had fixed it already previously. This
situation is certainly realized for the large majority of all measurements which
are published in the Particle Data Book [31].
In the parabolic approximation the MLE and the expectation value co-
incide, as well as the likelihood ratio error squared and the variance. Thus
we can also derive the standard deviation δθ from the curvature of the like-
lihood function at its maximum. For a single parameter we can approximate
the likelihood function by the expression
1
− ln Lpar = V (θ − θ̂)2 + const. . (8.1)
2
Consequently, a change of ln Lpar by 1/2 corresponds to the second derivative
of ln L at θ̂:  2 −1
d ln L
(δθ)2 = V −1 = − .
dθ2 θ̂
For several parameters the parabolic approximation can be expressed by
1X
− ln Lpar = (θi − θ̂i )Vij (θj − θ̂j ) + const. .
2 i,j

We obtain the symmetric weight matrix3 V from the derivatives

∂ 2 ln L
Vij = − θ̂
∂θi ∂θj

and the covariance or error matrix from its inverse C = V−1 .


If we are interested only in part of the parameters, we can eliminate
the remaining nuisance parameters simply forgetting about the part of the
matrix which contains the corresponding elements. This is a consequence of
the considerations from Sect. 7.8.
In most cases the likelihood function is not known analytically. Usually, we
have a computer program which provides the likelihood function for arbitrary
values of the parameters. Once we have determined the maximum, we are
able to estimate the second derivative and the weight matrix V computing
the likelihood function at parameter points close to the MLE. To ensure that
the parabolic approximation is valid, we should increase the distance of the
points and check whether the result remains consistent.
In the literature we find frequently statements like “The measurement
excludes the theoretical prediction by four standard deviations.” These kind
3
It is also called Fisher information.
8.1 Error Intervals 243

of statements have to be interpreted with caution. Their validity relies on the


assumption that the log-likelihood is parabolic over a very wide parameter
range. Neglecting tails can lead to completely wrong conclusions. We have also
to remember that for a given number of standard deviations the probability
decreases with the number of dimensions (see Tab. 4.1 in Sect. 4.6).
In the following section we address more problematic situations which
usually occur with small data samples where the asymptotic solutions are
not appropriate. Fortunately, they are rather the exception. We keep in mind
that a relatively rough estimate of the error often is sufficient such that
approximate methods in most cases are justified.

8.1.2 General Situation

As above, we again use the likelihood ratio to define the error limits which
now usually are asymmetric. In the one-dimensional case the two errors δ−
and δ+ satisfy

ln L(θ̂) − ln L(θ̂ − δ− ) = ln L(θ̂) − ln L(θ̂ + δ+ ) = 1/2 . (8.2)

If the log-likelihood function deviates considerably from a parabola it makes


sense to supplement the one standard deviation limits ∆ ln L = −1/2 with the
two standard deviation limits ∆ ln L = −2 to provide a better documentation
of the shape of the likelihood function. This complication can be avoided if we
can obtain an approximately parabolic likelihood function by an appropriate
parameter transformation. In some situations it is useful to document in
addition to the mode of the likelihood function and the asymmetric errors, if
available, also the mean and the standard deviation which are relevant, for
instance, in some cases of error propagation which we will discuss below.

Example 121. Error of a lifetime measurement


We return to one of our standard examples. The likelihood function for
the mean lifetime τ of a particle from a sample of observed decay times is

YN
1 −ti /τ 1
Lτ = e = N e−N t/τ . (8.3)
i=1
τ τ

The corresponding likelihood for the decay rate is


N
Y
Lλ = λe−λti = λN e−N tλ .
i=1

The values of the functions are equal at equivalent values of the two param-
eters τ and λ, i.e. for λ = 1/τ :
244 8 Interval Estimation

Lλ (λ) = Lτ (τ ) .

Fig. 8.1 shows the two log-likelihoods for a small sample of ten events with
mean value t = 0.5. The lower curves for the parameter τ are strongly asym-
metric. This is also visible in the limits for changes of the log-likelihood by
0.5 or 2 units which are indicated on the right hand cut-outs. The likelihood
with the decay rate as parameter (upper figures) is much more symmetric
than that of the mean life. This means that the decay rate is the more ap-
propriate parameter to document the shape of the likelihood function, to
average different measurement and to perform error propagation, see below.
On the other hand, we can of course transform the maximum likelihood esti-
mates and errors of the two parameters into each other without knowing the
likelihood function itself.

Generally, it does not matter whether we use one or the other parameter
to present the result but for further applications it is always simpler and
more precise to work with approximately symmetric limits. For this reason
usually 1/p (p is the absolute value of the momentum) instead of p is used as
parameter when charged particle trajectories are fitted to the measured hits
in a magnetic spectrometer.
In the general case we satisfy the conditions 4 to 7, 10, 11 of our wish list
but 2, 3, 8, 9 are only approximately valid. We neither can associate an exact
probability content to the intervals nor do the limits correspond to moments
of a p.d.f..

8.2 Error Propagation


In many situations we have to evaluate a quantity which depends on one or
several measurements with individual uncertainties. We thus have a problem
of point estimation and of interval estimation. We look for the parameter
which is best supported by the different measurements and determine its
uncertainty. Ideally, we are able to construct the likelihood function. In most
cases this is not necessary and approximate procedures are adequate.

8.2.1 Averaging Measurements

In Chap. 4 we have shown that the mean of measurements with Gaussian


errors δi which are independent of the measurement, is given by the weighted
sum of the individual measurements (4.6) with weights proportional to the
inverse errors squared 1/δi2 . In case the errors are correlated with the mea-
surements which occurs frequently with small event numbers, this procedure
introduces a bias (see Example 56 in Chap. 4) From (6.7) we conclude that
8.2 Error Propagation 245

1a 1b

0.01

Likelihood
0.01
1E-3

1E-4

1E-5 1E-3
0 2 4 6 0 2 4

2a 2b

0.01
Likelihood

0.01
1E-3

1E-4

1E-5 1E-3
0 1 2 3 0.0 0.5 1.0 1.5

Fig. 8.1. Likelihood functions for the parameters decay rate (top) and lifetime
(below). The standard deviation limits are shown in the cut-outs on the right hand
side.

the exact method is to add the log-likelihoods of the individual measure-


ments. Adding the log-likelihoods is equivalent to combining the raw data as
if they were obtained in a single experiment. There is no loss of information
and the method is not restricted to specific error conditions.

Example 122. Averaging lifetime measurements


N experiments quote lifetimes τ̂i ± δi of the same unstable particle. The
estimates and their errors are computed Pfrom the individual measurements tij
ni √
of the i-th experiment according to τˆi = j=1 tij /ni , respectively δi = τ̂i / ni
where ni is the number of observed decays. We can reconstructP the individual
log-likelihood functions and their sum ln L, with n, n = N i=1 i , the overall
n
event number:
246 8 Interval Estimation

N
X
ln L(τ ) = −ni (ln τ + τ̂i /τ )
i=1
X ni τ̂i
= −n ln τ −
τ
with the maximum at P
ni τ̂i
τ̂ =
n
and its error
τ̂
δ= √ .
n
The individual measurements are weighted by their event numbers, instead
of weights proportional to 1/δi2 . As the errors are correlated with the mea-
surements, the standard weighted mean (4.6) with weights proportional to
1/δi2 would be biased. In our specific example the correlation of the errors
and the parameter values is known and we could use weights proportional to
(τi /δi )2 .

Example 123. Averaging ratios of Poisson distributed numbers


In absorption measurements and many other situations we are interested
in a parameter which is the ratio of two numbers which follow the Poisson dis-
tribution. Averaging naively these ratios θ̂i = mi /ni using the weighted mean
(4.6) can lead to strongly biased results. Instead we add the log-likelihood
functions which we have derived in Sect. 7.8.3

X
ln L = [ni ln(1 + 1/θ) − mi ln(1 + θ)]
= n ln(1 + 1/θ) − m ln(1 + θ)
P P
with m = mi and n = ni . The MLE is θ̂ = m/n and the error limits
have to be computed numerically in the usual way or for not too small n, m
by linear error propagation, δθ2 /θ2 = 1/n + 1/m.

In the common situation where we do not know the full likelihood func-
tion but only the MLE and the error limits, we have to be content with an
approximate procedure. If the likelihood functions which have been used to
extract the error limits are parabolic, then the standard weighted mean (4.6)
is exactly equal to the result which we obtain when we add the log-likelihood
functions and extract then the estimate and the error.
8.2 Error Propagation 247

Proof: A sum of terms of the form (8.1) can be written in the following
way:
1X 1
Vi (θ − θi )2 = Ṽ (θ − θ̃)2 + const. .
2 2
Since the right hand side is the most general form of a polynomial of second
order, a comparison of the coefficients of θ2 and θ yields
X
Ṽ = Vi ,
X
Ṽ θ̃ = Vi θi ,

that is just the weighted mean including its error. Consequently, we should
aim at approximately parabolic log-likelihood functions when we present ex-
perimental results. Sometimes this is possible by a suitable choice of the
parameter. For example, we are free to quote either the estimate of the mass
or of the mass squared.

8.2.2 Approximating the Likelihood Function

We also need a method to average statistical data with asymmetric errors if


the exact shape of the likelihood function is not known. To this end we try to
reconstruct the log-likelihood functions approximately, add them, and extract
the parameter which maximize the sum and the likelihood ratio errors. The
approximation has to satisfy the constraints that the derivative at the MLE
is zero and the error relation (8.2).
The simplest parametrization uses two different parabola branches
1
− ln L(θ) = (θ − θ̂)2 /δ±
2
(8.4)
2
with
1 1
δ± = δ+ [1 + sgn(θ − θ̂)] + δ− [1 − sgn(θ − θ̂)] ,
2 2
i.e. the parabolas meet at the maximum and obey ( 8.2). Adding functions of
this type produces again a piecewise parabolic function which fixes the mean
value and its asymmetric errors. The solution for both the mean value and
the limits is unique.
Parametrizations [52] varying the width σ of a parabola linearly or
quadratically with the parameter are usually superior to the simple two
branch approximation. We set
1h i2
− ln L(θ) = (θ − θ̂)/σ(θ)
2
with
2δ+ δ− δ+ − δ−
σ(θ) = + (θ − θ̂) (8.5)
δ+ + δ− δ+ + δ−
248 8 Interval Estimation

0 0

a b
-1 -1

lnL

-2 -2
exact
2
(x) linear
-3 -3
(x) linear

2 Gaussians

-4 -4

1 2 3 4 0 1 2 3 4
1/

0 0

c d
-1 -1
lnL

-2 -2

-3 -3

-4 -4

1/2
1 2 0 1 2
1/

Fig. 8.2. Asymmetric likelihood functions and parametrizations.

or
(σ(θ))2 = δ+ δ− + (δ+ − δ− )(θ − θ̂) , (8.6)
respectively. The log-likelihood function has poles at locations of θ where the
width becomes zero, σ(θ) = 0. Thus our approximations are justified only in
the range of θ which excludes the corresponding parameter values.
In Fig. 8.2 we present four typical examples of asymmetric likelihood func-
tions. The log-likelihood function of the mean life of four exponentially dis-
tributed times is shown in 8.2 a. Fig. 8.2 b is the corresponding log-likelihood
function of the decay time4 . Figs. 8.2 c, d have been derived by a parame-
ter transformation from normally distributed observations where in one case
the new parameter is one over the mean and the square root of the mean5

4
The likelihood function of a Poisson mean has the same shape.
An example of such a situation is a fit of a particle mass from normally dis-
5

tributed mass squared observations.


8.2 Error Propagation 249

in the other case. A method which is optimum for all cases does not ex-
ist. All three approximations fit very well inside the one standard deviation
limits. Outside, the two parametrizations (8.5) and (8.6) are superior to the
two-parabola approximation.
We propose to use one of the two parametrizations (8.5, 8.6) but to be
careful if σ(θ) becomes small.

8.2.3 Incompatible Measurements

Before we rely on a mean value computed from the results of different ex-
periments we should make sure that the various input data are statistically
compatible. What we mean with compatible is not obvious at this point. It
will become clearer in Chap. 10, where we discuss significance tests which
lead to the following plausible procedure that has proven to be quite useful
in particle physics [31].
We compute the weighted mean value θ̃ of the N results and form the
sum of the quadratic deviations of the individual measurements from their
average, normalized to their expected errors squared:
X
χ2 = (θi − θ̃)2 /δi2 .

The expectation value of this quantity is N − 1 if the deviations are normally


distributed with variances δi2 . If χ2 is sizably (e.g. by 50%) higher than N −1,
then we can suspect that at least one of the experiments has published a
wrong value, or what is more likely, has underestimated the error, for instance
when systematic errors have not been detected. Under the premise that none
of the experiments can be discarded
p a priori, we scale-up all declared errors by
a common scaling factor S = χ2 /(N − 1) and publish this factor together
with mean value and the scaled error. Large scaling factors indicate problems
in one or several experiments. After scaling χ2 has the expected value of the
χ2 distribution with N − 1 degrees of freedom. A similar procedure is applied
if the errors are asymmetric even though the condition of normality then
obviously is violated. We form
X
χ2 = (θi − θ̃)2 /δi±
2
,

where δi+ and δi− , respectively, are valid for θi < θ̃ and θi > θ̃.

8.2.4 Error Propagation for a Scalar Function of a Single


Parameter

If we have to propagate the MLE and its error limits of a parameter θ to


another parameter θ′ = θ′ (θ), we should apply the direct functional relation
which is equivalent to a transformation of the likelihood function:
250 8 Interval Estimation

θ̂ = θ′ (θ̂) ,


θ̂ + δ+ = θ′ (θ̂ + δ+ ) ,


θ̂ − δ− = θ′ (θ̂ − δ− ) .

Here we have assumed that θ′ (θ) is monotonically increasing. If it is decreas-


ing, the arguments of θ′ have to be interchanged.
The errors of the output quantity are asymmetric either because the input
errors are asymmetric or because the functional dependence is non-linear.
For instance an angular measurement α = 870 ± 10 would transform into
sin α = 0.9986+.0008
−.0010 .

8.2.5 Error Propagation for a Function of Several Parameters

A difficult problem is the determination of the error of a scalar quantity θ(µ)


which depends on several measured input parameters µ with asymmetric
errors. We have to eliminate nuisance parameters.
If the complete likelihood function ln L(µ) of the input parameters is
available, we derive the error limits from the profile likelihood function of θ
as proposed in Sect. 6.4.1.
The MLE of θ is simply θ̂ = θ(b µ). The profile likelihood of θ has to fulfil
the relation ∆ ln L(θ) = ln L(θ̂) − ln L(θ) = ln L(bµ) − ln L(µ). To find the
two values of θ for the given ∆ ln L, we have to find the maximum and the
minimum of θ fulfilling the constraint. The one standard deviation limits are
the two extreme values of θ located on the ∆ ln L(θ) = ln L(b µ) − ln L(µ)
= 1/2 surface in the µ space. Since we assumed likelihood functions with a
single maximum, this is a closed surface, in two dimensions a closed line.
There are various numerical methods to compute these limits. Constrained
problems are usually solved with the help of Lagrange multipliers. A simpler
method is the one which has been followed when we discussed constrained
fits (see Sect. 7.5): With an extremum finding program, we minimize
2
µ) − ln L(µ) − 1/2]
θ(µ) + c [ln L(b

where c is a number which has to be large compared to the absolute change


of θ within the ∆ ln L = 1/2 region. We obtain µlow and θlow = θ(µlow ) and
maximizing
2
θ(µ) − c [ln L(b
µ) − ln L(µ) − 1/2]
we get θup .
If the likelihood functions are not known, the only practical way is to
resort to a Bayesian treatment, i.e. to make assumptions about the p.d.f.s
of the input parameters. In many cases part of the input parameters have
systematic uncertainties. Then, anyway, the p.d.f.s of those parameters have
to be constructed. Once we have established the complete p.d.f. f (µ), we
8.2 Error Propagation 251

can also determine the distribution of θ. The analytic parameter transforma-


tion and reduction described in Chap. 3 will fail in most cases and we will
adopt the simple Monte Carlo solution where we generate a sample of events
distributed according to f (µ) and where θ(µ) provides the θ distribution in
form of a histogram and the uncertainty of this parameter. To remain con-
sistent with our previously adopted definitions we would then interpret this
p.d.f. of θ as a likelihood function and derive from it the MLE θ̂ and and the
likelihood ratio error limits.
We will not discuss the general scheme in more detail but add a few
remarks related to special situations and discuss two simple examples.

Sum of Many Measurements


P
If the output parameter θ = ξi is a sum of many input quantities ξi with
variances σi2 of similar size and their mean values and variances are known,
then due to the central limit theorem we have

θ̂ = hθi = Σhξi i,
δθ2 = σθ2 ≈ Σσi2

independent of the shape of the distributions of the input parameters and


the error of θ is approximately normally distributed. This situation occurs in
experiments where many systematic uncertainties of similar magnitude enter
in a measurement.

Product of Many Measurements


Y
If the output parameter θ = ξi is a product of many positive input quanti-
ties ξi with relative uncertainties σi /ξi of similar size then due to the central
limit theorem

hln θi = Σhln ξi i ,
q
σln θ ≈ Σσln 2
ξi

independent of the shape of the distributions of the input parameters and


the error of ln θ is approximately normally distributed which means that θ
follows a log-normal distribution (see Sect. 3.6.10). Such a situation may be
realized if several multiplicative efficiencies with similar uncertainties enter
into a measurement. The distribution of θ is fully specified only once we
know the quantities hln ξi i and σln ξi . The latter condition will usually not
be fulfilled and hln ξi i , σln ξi have to be setQ
by some educated guess.
P 2 In2 most
cases, however, the approximations hθi = hξi i and δθ2 /θ2 = δi /ξi may
252 8 Interval Estimation

1000 800

800
600

number of entries
600

400

400

200
200

0 0
0 1 2 3 4 5 -1.0 -0.5 0.0 0.5

x ln(x)

Fig. 8.3. Distribution of the product of 10 variates with mean 1 and standard
deviation 0.2.

be adequate. These two quantities fix the log-normal distribution from which


we can derive the maximum and the asymmetric errors. If the relative er-
rors are sufficiently small, the log-normal distribution approaches a normal
distribution and we can simply use the standard linear error propagation
with symmetric errors. As always, it is useful to check approximations by a
simulation.

Example 124. Distribution of a product Q of measurements


We simulate the distribution of θ = ξi , of 10 measured quantities with
mean equal to 1 and standard deviation of 0.2, all normally distributed. The
result is very different from a Gaussian and is well described by a log-normal
distribution as is shown in Fig. 8.3. The mean is compatible with hθi = 1 and
the standard deviation is 0.69, slightly larger than the prediction from simple
error propagation of 0.63. These results remain the same when we replace
the Gaussian errors by uniform ones with the same standard deviation. Thus
details of the distributions of the input parameters are not important.

Sum of Weighted Poisson Numbers


P
If θ = wi ηi is a sum of Poisson numbers ηi weighted with wi then we can
apply the simple linear error propagation rule:
X
θ̂ = wi ηi ,
X
δθ2 = wi2 ηi .
8.2 Error Propagation 253

The reason for this simple relation is founded on the fact that a sum of
weighted Poisson numbers can be approximated by the Poisson distribution
of the equivalent number of events (see Sect. 3.7.3). A condition for the
validity of this approximation is that the number of equivalent events is large
enough to use symmetric errors. If this number is low we derive the limits
from the Poisson distribution of the equivalent number of events which then
will be asymmetric.

Example 125. Sum of weighted Poisson numbers


Particles are detected in three detectors with efficiencies ε1 = 0.7, ε2 =
0.5, ε3 = 0.9. The observed event counts are n1 = 10, n2 = 12, n3 = 8.
A background contribution is estimated in a separate counting experiment
as b = 9 with a reduction factor of r = 2. The estimate
P n̂ for total number
of particles which traverse the detectors is n̂ = ni /εi − b/r = 43. From
linear error propagation we obtain the uncertainty δn = 9. A more precise
calculation based on the Poisson distribution of the equivalent number of
events would yield asymmetric errors, n̂ = 43+10
−8 .

Averaging Correlated Measurements

The following example is a warning that naive linear error propagation may
lead to false results.

Example 126. Average of correlated cross section measurements, Peelle’s per-


tinent puzzle
The results of a cross section measurements is ξ1 with uncertainties due
to the event count, δ10 , and to the beam flux. The latter leads to an error δf ξ
which is proportional to the cross section ξ. The two contributions are inde-
pendent and thus the estimated error squared in the Gaussian approximation
is δ12 = δ10
2
+ δf2 ξ12 . A second measurement ξ2 with different statistics but the
same uncertainty on the flux has an uncertainty δ22 = δ20 2
+ δf2 ξ22 . Combining
the two measurements we have to take into account the correlation of the
errors. In the literature [53] the following covariance matrix is discussed:
 2 
δ10 + δf2 ξ12 δf2 ξ1 ξ2
C= .
δf2 ξ1 ξ2 δ20
2
+ δf2 ξ22

It can lead to the strange result that the least square estimate ξb of the two
cross sections is located outside the range defined by the individual results
[54] , e.g. ξb < ξ1 , ξ2 . This anomaly is known as Peelle’s Pertinent Puzzle [55].
254 8 Interval Estimation

Its reason is that the normalization error is proportional to the true cross
section and not to the observed one and thus has to be the same for the two
measurements, i.e. in first approximation proportional to the estimate ξb of
the true cross section. The correct covariance matrix is
!
2
δ10 + δf2 ξb2 δf2 ξb2
C= . (8.7)
δ 2 ξb2
f δ 2 + δ 2 ξb2
20 f

Since the best estimate of ξ cannot depend on the common scaling error it is
given by the weighted mean

δ −2 ξ1 + δ20
−2
ξ2
ξb = 10 −2 −2 . (8.8)
δ10 + δ20
The error δ is obtained by the usual linear error propagation,
1 2 b2
δ2 = −2 −2 + δf ξ . (8.9)
δ10 + δ20
Proof: The weighted mean for ξb is defined as the combination

ξb = w1 ξ1 + w2 ξ2

which, under the condition w1 + w2 = 1, has minimal variance (see Sect. 4.4):

b = w2 C11 + w2 C22 + 2w1 w2 C12 = min .


var(ξ) 1 2

Using the correct C (8.7), this can be written as

b = w2 δ 2 + (1 − w1 )2 δ 2 + δ 2 ξb2 .
var(ξ) 1 10 20 f

Setting the derivative with respect to w1 to zero, we get the usual result
−2
δ10
w1 = −2 −2 , w2 = 1 − w1 ,
δ10 + δ20
−2 −2
b = δ10 δ20 2 b2
δ 2 = min[var(ξ)] −2 −2 2 + −2 −2 2 + δf ξ ,
(δ10 + δ20 ) (δ10 + δ20 )

proving the above relations (8.8), (8.9).


8.3 One-sided Confidence Limits 255

8.3 One-sided Confidence Limits

8.3.1 General Case

Frequently, we cannot achieve the precision which is necessary to resolve a


small physical quantity. If we do not obtain a value which is significantly
different from zero, we usually present an upper limit. A typical example is
the measurement of the lifetime of a very short-lived particle which cannot
be resolved by the measurement. The result of such a measurement is then
quoted by a phrase like “The lifetime of the particle is smaller than ... with 90
% confidence.” Upper limits are often quoted for rates of rare reactions if no
reaction has been observed or the observation is compatible with background.
For masses of hypothetical particles postulated by theory but not observed
with the limited energy of present accelerators, experiments provide lower
limits.
In this situation we are interested in probabilities. Thus we have to in-
troduce prior densities or to remain with likelihood ratio limits. The latter
are not very popular. As a standard, we fix the prior to be constant in order
to achieve a uniform procedure allowing to compare and to combine mea-
surements from different experiments. This means that a priori all values of
the parameter are considered as equally likely. As a consequence, the results
of such a procedure depend on the choice of the variable. For instance lower
limits of a mass um and a mass squared um2 , respectively, would not obey the
relation um2 = (um )2 . Unfortunately we cannot avoid this property when we
want to present probabilities. Knowing that a uniform prior has been applied,
the reader of a publication can interpret the limit as a sensible parametriza-
tion of the experimental result and draw his own conclusions. Of course, it
is also useful to present the likelihood function which fully documents the
result. Also likelihood ratio limits are useful.
To obtain the p.d.f. of the parameter of interest, we just have to normalize
the likelihood function6 to the allowed range of the parameter θ. The proba-
bility P {θ < θ0 } computed from this density is the confidence level C for the
upper limit θ0 :
R θo
L(θ) dθ
C(θ0 ) = R−∞
∞ . (8.10)
−∞ L(θ) dθ

Lower limits are computed in an analogous way:


R∞
θo L(θ) dθ
Clow (θ0 ) = R ∞ . (8.11)
−∞
L(θ) dθ
Here the confidence level C is given and the relations (8.10), (8.11) have
to be solved for θ0 .
6
In case the likelihood function cannot be normalized, we have to renounce to
producing a p.d.f. and present only the likelihood function.
256 8 Interval Estimation

8.3.2 Upper Poisson Limits, Simple Case

When, in an experimental search for a certain reaction, we do not find the


corresponding events, we quote an upper limit for its existence. Similarly in
some cases where an experiment records one or two candidate events but
where strong theoretical reasons speak against accepting those as real, it is
common practice not to quote a rate but rather an upper limit. The result is
then expressed in the following way: The rate for the reaction x is less than
µ0 with 90 % confidence.
The upper limit is again obtained as above by integration of the normal-
ized likelihood function.
For k observed events, we want to determine an upper limit µ0 with
C = 90% confidence for the expectation value of the Poisson rate. The
normalization integral over the parameter µ of the Poisson distribution
P(k|µ) = e−µ µk /k! is equal to one. Thus we obtain:
Z µo
C= P(k|µ)dµ
0
R µo −µ k
e µ dµ
= 0 . (8.12)
k!
The integral is solved by partial integration,

k
X e−µ0 µi 0
C = 1−
i=0
i!
k
X
= 1− P(i|µ0 ) .
i=0

However, the sum over the Poisson probabilities cannot be solved analytically
for µ0 . It has to be solved numerically, or (8.12) is evaluated with the help
of tables of the incomplete gamma function.
A special role plays the case k = 0, e.g. when no event has been observed.
The integral simplifies to:

C = 1 − e−µ0 ,
µ0 = − ln(1 − C) .

For C = 0.9 this relation is fulfilled for µ0 ≈ 2.3.


Remark that for Poisson limits of rates without background the frequen-
tist statistics (see Appendix 13.6) and the Bayesian statistics with uniform
prior give the same results. For the following more general situations, this
does not hold anymore.
8.3 One-sided Confidence Limits 257

1.0

b=0

Likelihood
0.5
b=2

0.0
0 2 4 6 8 10

Rate

Fig. 8.4. Upper limits for poisson rates. The dashed lines are likelihood ratio limits
(decrease by e2 ).

8.3.3 Poisson Limit for Data with Background

When we find in an experiment events which can be explained by a back-


ground reaction with expected mean number b, we have to modify (8.12)
correspondingly. The expectation value of k is then µ + b and the confidence
C is R µo
P(k|µ + b)dµ
C = R0∞ .
0 P(k|µ + b)dµ
Again the integrals can be replaced by sums:
Pk
i=0 P(i|µ0 + b)
C =1− Pk .
i=0 P(i|b)

Example 127. Upper limit for a Poisson rate with background


Expected are two background events and observed are also two events.
Thus the mean signal rate µ is certainly small. We obtain an upper limit µ0
for the signal with 90% confidence by solving numerically the equation
P2
P(i|µ0 + 2)
0.9 = 1 − i=0P2 .
i=0 P(i|2)

We find µ0 = 3.88. The Bayesian probability that the mean rate µ is larger
than 3.88 is 10 %. Fig. 8.4 shows the likelihood functions for the two cases
258 8 Interval Estimation

-2

log-likelihood
-4

-6
0 5 10 15

signal

Fig. 8.5. Log-likelihood function for a Poisson signal with uncertainty in back-
ground and acceptance. The arrow indicates the upper 90% limit. Also shown is
the likelihood ratio limit (decrease by e2 , dashed lines).

b = 2 and b = 0 together with the limits. For comparison are also given the
likelihood ratio limits which correspond to a decrease from the maximum
by e−2 . (For a normal distribution this would be equivalent to two standard
deviations).

We now investigate the more general case that both the acceptance ε
and the background are not perfectly known, and that the p.d.f.s of the
background and the acceptance fb , fε are given. For a mean Poisson signal
µ the probability to observe k events is
Z Z
g(k|µ) = db dε P(k|εµ + b)fb (b)fε (ε) = L(µ|k) .

For k observations this is also the likelihood function of µ. According to


our scheme, we obtain the upper limit µ0 by normalization and integration,
R µ0
L(µ|k)dµ
C = R0∞
0
L(µ|k)dµ

which is solved numerically for µ0 .


8.3 One-sided Confidence Limits 259

renormalized

likelihood

f( )
original
unphysical likelihood
region

max

Fig. 8.6. Renormalized likelihood function and upper limit.

Example 128. Upper limit for a Poisson rate with uncertainty in background
and acceptance
Observed are 2 events, expected are background events following a normal
distribution N (b|2.0, 0.5) with mean value b0 = 2 and standard deviation
σb = 0.5. The acceptance is assumed to follow also a normal distribution
with mean ε0 = 0.5 and standard deviation σε = 0.1. The likelihood function
is Z Z
L(µ|2) = dε dbP(2|εµ + b)N (ε|0.5, 0.1)N (b|2.0, 0.5) .

We solve this integral numerically for values of µ in the range of µmin = 0 to


µmax = 20, in which the likelihood function is noticeable different from zero
(see Fig. 8.5). Subsequently we determine µ0 such that the fraction C = 0.9
of the normalized likelihood function is located left of µ0 . Since negative
values of the normal distributions are unphysical, we cut these distributions
and renormalize them. The computation in our case yields the upper limit
µ0 = 7.7. In the figure we also indicate the e−2 likelihood ratio limit.

8.3.4 Unphysical Parameter Values

Sometimes the allowed range of a parameter is restricted by physical or math-


ematical boundaries, for instance it may happen that from the experimental
260 8 Interval Estimation

data we infer a negative mass. In these circumstances the parameter range


will be cut and the likelihood function will be normalized to the allowed
region. This is illustrated in Fig. 8.6. The integral of the likelihood in the
physical region is one. The shaded area is equal to α. The parameter θ is less
than θmax with confidence C = 1 − α.
We have to treat observations which are outside the allowed physical re-
gion with caution and check whether the errors have been estimated correctly
and no systematic uncertainties have been neglected.

8.4 Summary

Measurements are described by the likelihood function.


– The standard likelihood ratio limits are used to represent the precision of
the measurement.
– If the log-likelihood function is parabolic and the prior can be approx-
imated by a constant, e.g. the likelihood function is very narrow, the
likelihood function is proportional to the p.d.f. of the parameter, error
limits represent one standard deviation and a 68.3 % probability interval.
– If the likelihood function is asymmetric, we derive asymmetric errors from
the likelihood ratio. The variance of the measurement or probabilities can
only be derived if the prior is known or if additional assumptions are
made. The likelihood function should be published.
– Nuisance parameters are eliminated by the methods described in Chap.
6.4.5, usually using the profile likelihood. If the nuisance parameters can-
not be eliminated, the dependence of the result on the values of the nui-
sance parameters should be documented.
– Error propagation is performed using the direct functional dependence of
the parameters.
– Confidence intervals, upper and lower limits are computed from the nor-
malized likelihood function, i.e. using a flat prior. These intervals usually
correspond to 90% or 95% probability.
– In many cases it is not possible to assign errors or confidence intervals
to parameters without making assumptions which are not uniquely based
on experimental data. Then the results have to be presented such that
the reader of a publication is able to insert his own assumptions and the
procedure used by the author has to be documented.
9 Unfolding

9.1 Introduction
In many experiments the measurements are deformed by limited acceptance,
sensitivity, or resolution of the detectors. Knowing the properties of the de-
tector, we are able to simulate these effects, but is it possible to invert this
process, to reconstruct from a distorted event sample the original distribution
from which the undistorted sample has been drawn?
There is no simple answer to this question. Apart from the unavoidable
statistical uncertainties, the correction of losses is straight forward, but un-
folding the effects caused by the limited resolution is difficult and feasible
only by introducing a priory assumptions about the shape of the original
distribution or by grouping the data in histogram bins. Therefore, we should
ask ourselves, whether unfolding is really a necessary step of our analysis. If
we want to verify a theoretical prediction for a distribution f (x), it is much
easier and more accurate to fold f with the known resolution and to com-
pare then the smeared prediction and the experimental distributions with the
methods discussed in Chap. 10. If a prediction contains interesting parame-
ters, also those should be estimated by comparing the smeared distribution
with the observed data. When we study, for instance, a sharp resonance peak
above a slowly varying background, it will be very difficult, if not impossible,
to determine the relevant parameters from an unfolded spectrum, while it is
easy to fit them directly to the observed distribution, see Sect. 7.3 and Ref.
[57]. However, in situations where a reliable theoretical description is missing,
or where the measurement is to be compared with a distribution obtained
in another experiment with different experimental conditions, unfolding of
the data cannot be avoided. Examples are the determination of structure
functions in deep inelastic scattering or transverse momentum distributions
from the Large Hadron Collider at CERN where an obvious parametrization
is missing.
The choice of the unfolding procedure depends on the goal one is aiming
for. We either can try to optimize the reconstruction of the distribution, with
the typical trade-off between resolution and bias where we have a kind of
probability density estimation (PDE) problem (see Chapt. 12), or we can
treat unfolding as an inference problem where the errors should contain the
262 9 Unfolding

unknown result with a reasonable coverage probability1 . The former approach


dominates in most applications outside the natural sciences, for instance in
picture unblurring, but is also adopted in particle physics and astronomy.
We will follow both lines, the first provides the most likely shape of the
distribution but is not suited as a bases for a quantitative analysis while
the second permits to combine results and to compare them quantitatively
to theoretical predictions. We will consider mainly histograms and spline
approximations but sketch also binning free methods which may become more
popular with increased computer power.
General unfolding studies are found in Refs. [58, 59, 60, 62, 63, 56]. Specific
methods are presented in Refs. [64, 65, 66, 67, 68, 61, 69].

9.2 Discrete Inverse Problems and the Response matrix


9.2.1 Introduction and definition

Folding is described by the integral


Z ∞

g(x ) = h(x′ , x)f (x)dx . (9.1)
−∞

The function f (x) is folded with a response function h(x′ , x), resulting in
the smeared function g(x′ ). We call f (x) the true distribution and g(x′ ) the
smeared distribution or the observed distribution. The three functions g, h, f
can have discontinuities but of course the integral has to exist. The integral
equation (9.1) is called Fredholm equation of the first kind with the kernel
h(x′ , x). If the function h(x′ , x) is a function of the difference x′ −x only, (9.1)
is denoted convolution integral, but often the terms convolution and folding
are not distinguished. The relation (9.1) describes the direct process of folding.
We are interested in the inverse problem: Knowing g and h we want to infer
f (x). This inverse problem is classified by the mathematicians as ill posed
because it has no unique solution. In the direct process high frequencies are
washed out. The damping of strongly oscillating contributions in turn means
that in mapping g to f high frequencies are amplified, and the higher the
frequency, the stronger is the amplification. In fact, in practical applications
we do not really know g, the information we have consists only of a sample of
observations with the unavoidable statistical fluctuations2 . The fluctuations
of g correspond to large perturbations of f and consequently to ambiguities.
The response function often, but not always, describes a simple resolu-
tion effect and then it is called point spread function (PSF). There exists
1
We base our errors on the likeloihood function. In most cases with not too
small event numbers, the definition coincides to a good approximation with the
error definition derived from the coverage paradigm, see Appendix 13.6. In this
chapter arguing with coverage is more convenient than with the likelihood ratio..
2
In the statistical literature the fluctuations are called noise.
9.2 Discrete Inverse Problems and the Response matrix 263

Fig. 9.1. Relations between the histograms involved in the unfolding process.

also more complex situations, like in positron emission tomography (PET),


where the relation between the observed distribution of two photons and the
interesting distribution of their origin is more involved. In PET and many
other applications the variables x and x′ are multi-dimensional.

9.2.2 The Histogram Representation

Discretization and the Response Matrix

The disease of the inverse problem can partially be cured by discretization,


which essentially means that we construct a parametric model. We usually
replace the continuous functions by histograms which can be written as vec-
tors θ for the true histogram and d for the observed histogram The two
histograms are connected by the response function, here by a matrix A. We
get for the direct process:

E(d) = Aθ . (9.2)
   
d1 A11 . . A1M  
 d2   A21 . . A2M .  θ1
   
 .   . .. .   

E = · .  .
  .. .   . 
 .   . 
 .   . .. .  θM
dN AN 1 . . AN M
264 9 Unfolding

Here di is the content of bin i of an observed histogram. E(d) is the


expected value. A is called response or folding matrix and θj is the content
of bin j of the undistorted true histogram that we want to determine. The
following relations define the matrix A:
Z
θj = f (x)dx ,
bin j
Z Z ∞

E(di ) = dx h(x′ , x)f (x)dx ,
bin i −∞
Aij = E(dij )/θj , (9.3)
Z Z
E(dij ) = dx′ h(x′ , x)f (x)dx .
bin i bin j

E(dij ) is the expected number of observed events in bin i that originate from
true bin j. In the following we will often abbreviate E(di ) by ti = Aij θj . The
value Aij represents the probability that the detector registers an event in
bin i that belongs to the true histogram bin j. This interpretation assumes
that all elements of d, A and θ are positive. The number of columns M
is the number of bins in the true histogram and the number of parameters
that have to be determined. The number of rows N is the number of bins
in the observed histogram. We do not want to have more parameters than
measurements and require N ≥ M . Normally we constrain the unknown true
histogram, requiring N > M . With N bins of the observed histogram and M
bins of the true histogram we have N − M constraints. The relation between
the various histograms is shown in Fig. 9.1. We follow the simpler right-hand
path where multinomial errors need not be handled.
We require that A is rank efficient which means that the rank is equal
to the number of columns M . Formally, this means that all columns are
linearly independent and at least M rows are linearly independent: No two
bins of the true histogram should produce observed distributions that are
proportional to each other. Unfolding would be ambiguous in this situation,
but a simple solution of the problem is to combine the bins. More complex
cases that lead to a rank deficiency never occur in practice. A more serious
requirement is the following: By definition of A, the observed histogram must
not contain events that originate from other sources than the M true bins. In
other words, the range of the true histogram has to cover all observed events.
This requirement often entails that only a small fraction of the events that
contained the border bins of the true histogram are found in the observed
histogram. The correspondingly low efficiency leads to large errors of the
reconstructed number of events in these bins. Published simulation studies
often avoid this complication by restricting the range of the true variable.
Some publications refer to a null space of the matrix A. The null space is
spanned by vectors that fulfill Aθ = 0. With our definitions and the restric-
tions that we have imposed, the null space is empty.
9.2 Discrete Inverse Problems and the Response matrix 265

500
300

400

number of entries

200
300

200

100

100

0 0
-4 -2 0 2 4 -3 -2 -1 0 1 2 3

x' x

Fig. 9.2. Folded distributions (left) for two different distributions (right).

2000

1500
number of entries

1000

500

-500

-1000
0 1
x

Fig. 9.3. Naive unfolding result obtained by matrix inversion. The curve corre-
sponds to the true distribution.

In particle physics the experimental setups are mostly quite complex, and
for this reason they are simulated with Monte Carlo programs. To construct
the matrix A, we generate events following an assumed true distribution f (x)
characterized by the true variable x and a corresponding true bin j. The
detector simulation produces the observed variable x′ and the corresponding
observed bin i. We will assume for the moment that we can generate an in-
finitely large amount of so-called Monte Carlo events such that we do not
have to care about statistical fluctuations of the elements of A. The statisti-
266 9 Unfolding

2500

3000

2000

number of events
1500

2000

1000

500

1000

-500

0
0 1 2 3 4 5 0 1 2 3 4 5

time time

Fig. 9.4. Unfolding by matrix inversion with different binnings.

cal fluctuations of the observed event numbers be described by the Poisson


distribution.
There is another problem that we neglect but that we have to resume
later: The matrix A depends to some extent on the true distribution which
is not known in the Monte Carlo simulation. The dependence is small if the
bins of the true distribution are narrow enough to neglect the fluctuations of
f (x) within a bin. This condition cannot always be maintained.

The Need for Regularization

The discrete model avoids the ambiguity of the continuous ill-posed problem
but especially if the observed bins are narrow compared to the resolution, the
matrix is badly conditioned which means that the inverse or pseudo-inverse of
A contains large components. This is illustrated in Fig. 9.2 which shows two
different original distributions and the corresponding distributions smeared
with a Gaussian N (x − x′ |0, 1). In spite of the extremely different original
distributions, the smeared distributions of the samples are practically indis-
tinguishable. This demonstrates the sizeable information loss that is caused
by the smearing, especially in the case of the distribution with four peaks.
Sharp structures are washed out and can hardly be reconstructed. Given the
observed histogram with some additional noise, it will be almost impossible
to exclude one of the two candidates for the true distribution even with a huge
amount of data. Naive unfolding by matrix inversion can produce oscillations
as shown in Fig. 9.3.
If the matrix A is quadratic, we can simply invert (9.2) and get an estimate
b
θ of the true histogram.
b = A−1 d .
θ (9.4)
In practice this simple solution usually does not work because, as mentioned,
our observations suffer from statistical fluctuations.
9.2 Discrete Inverse Problems and the Response matrix 267

In Fig. 9.4 the result of a simple inversion of the data vector of Fig. 9.4
is depicted. The left-hand plot is realized with 10 bins. It is clear that either
fewer bins have to be chosen, see Fig. 9.4 right-hand plot, or some smoothing
has to be applied.

9.2.3 Expansion of the True Distribution

Instead of representing the distribution f (x) by a histogram,


R∞ we can expand
it into a sum of functions Bi . The Bi be normalized, −∞ Bi (x)dx = 1.

M
X
f (x) ≈ βj Bj (x) (9.5)
j=1

We get
Z M
X Z ∞

E(di ) = dx βj h(x′ , x)Bj (x)dx
bin i j=1 −∞

M
X
= Aij βj (9.6)
j=1

with Z Z ∞

Aij = dx h(x′ , x)Bj (x)dx
bin i −∞
.
The response matrix element Aij now is the probability to observe an
event in bin i of the observed histogram that originates from the distribution
Bj . In other words, the observed histogram is approximated by a superposi-
tion of the histograms produced by folding the functions Bj . Unfolding means
to determine the amplitudes βj of the functions Bj .
Below we will discuss the approximation of f (x) by a superposition of ba-
sic spline functions (b-splines). For our applications the b-splines of order 2
(linear), 3 (quadratic) or 4 (cubic) are appropriate (see Appendix 13.15). Un-
folding then produces a smooth function which normally is closer to the true
distribution than a histogram. The disadvantage of spline approximations
compared to the histogram representation is that a quantitative comparison
with predictions or the combination of several results is more difficult.
Remark: In probability density estimation (PDE) a histogram is consid-
ered as a first order spline function. The spline function corresponds to the
line that limits the top of the histogram bins. The interpretation of a his-
togram in experimental sciences is different from that in PDE. Observations
are collected in bins and then the content of the bin measures the integral
of the function g over the bin and the bin content of the unfolded histogram
is an estimate of the integral of f over that bin. A function can always be
268 9 Unfolding

described correctly by a histogram, while the description by spline functions


is an approximation. This has to be kept in mind when we compare the
unfolding result to a prediction.

9.2.4 The Least Square Solution and the Eigenvector


Decomposition

The Least Square Solution

As mentioned, for a square matrix A, M = N , the solution of θ is simply ob-


tained by matrix inversion, θ b = A−1 d. The error matrix Cθ = A−1 Cd (A−1 )T
is derived by error propagation. We omit the calculation. In the limit where
there is no smearing, A is diagonal and describes only acceptance losses.
The choice M = N is not recommended. For M ≤ N the least square
function χ2stat is given by the following relation:
M
X
N N
( Aik θk − di )2
X (ti − di )2 X k=1
χ2stat = = . (9.7)
i=1
ti i=1
M
X
Aik θk
k=1

If the numbers di are not described by a simple Poisson distribution, we


have to insert the weight matrix3 where V = C−1 d is the inverse of its error
matrix Cd :
N
X
χ2stat = [(ti − di )Vik (tk − dk )] . (9.8)
i,k=1

If the data follow a Poisson distribution where the statistics is high enough
to approximate it by a normal distribution and where the denominator of
(9.7) can be approximated by di , the least square minimum can be evaluated
by a simple linear matrix calculus. (The linear LS solution is treated in Sect.
6.7.)
XM

N N
( Aik θk − di )2
X (t i − di )2 X k=1
χ2stat = = , (9.9)
i=1
d i i=1
di
We apply the transformations

d ⇒ b = AT Vd , (9.10)
A ⇒ Q = A VA .T
(9.11)

3
In the literature the error matrix or covarince matrix is frequently denoted by
V and the weight matrix by V−1 .
9.2 Discrete Inverse Problems and the Response matrix 269

We call Q least square matrix. We get for the expected value of b

E(b) = Qθ (9.12)

with the LS solution


b = Q−1 b
θ (9.13)
and the error matrix Cθ of the solution

Cθ = Q−1 .

We have simply replaced A by Q and d by b. Both quantities are then


known. The matrix Q is quadratic and can be inverted if the LS solution
exists.

Eigenvector Decomposition of the Least Square Matrix

To understand better the origin of the fluctuations of the LS solution (9.13),


we factorize the matrix Q in the following way: The matrix4 Q = UΛU−1 is
composed of the diagonal matrix Λ which contains the eigenvalues of Q and
the matrix U whose columns consist of the eigenvectors ui of Q:
 
λ1
 
  λ2 0  
Q = u1 u2 . . uM  .  u1 u2 . . uM T .
 
 0 . 
λM

Qui = λi ui = v i , i = 1, ..., M . (9.14)


Software to produce the eigenvector decomposition can be found in most
mathematical computer libraries.
In case of eigenvalues that appear more than once, the eigenvectors are not
uniquely defined. Linear orthogonal combinations can be created by rotations
in the corresponding subspace but they produce the same LS solution.
The solution θ can be expanded into the orthogonal unit eigenvectors ui :
M
X M
X
θ= ai ui , θk = ai uik ,
i=1 i=1
M
X
ai = θ · u i , ai = θk uik .
k=1

4
We require that the square M × M matrix Q has M linearly independent
eigenvectors and that all eigenvalues are real and positive. These conditions are
satisfied if a LS solution exists.
270 9 Unfolding

true histogram observed histogram

u v= u λ λ
1.00

0.97

0.90

0.65

0.50

0.36

0.26

0.25

0.17

0.15

Fig. 9.5. Set of eigenvectors ordered according to decreasing eigenvalues. A con-


tribution ui in the true histogram corresponds to a contribution v i to the observed
histogram.

By construction, the amplitudes ai are uncorrelated and the norm ||θ||2 =


Σθi2 of the solution is given by
M
X
||θ||2 = a2i .
i=1

The transformed observed vector b is


M
X M
X
b= ai λi ui = ai v i .
i=1 i=1

In Fig. 9.5 we present an schematic example of a set of eigenvectors. A


contribution ui to the true histogram as shown on the left-hand side will pro-
duce a contribution v i = λi ui to the observed histogram. It is of the same
9.2 Discrete Inverse Problems and the Response matrix 271

Fig. 9.6. Eigenvectors of the modified LS matrix ordered with decreasing eigen-
values.

shape but reduced by the factor λi as shown on the right-hand side. The
eigenvalues decrease from top to bottom. Strongly oscillating components of
the true histogram correspond to small eigenvalues. They are hardly visible
in the observed data, and in turn, small contributions vi to the observed
data caused by statistical fluctuations can lead to rather large oscillating
contributions ui = v i /λi to the unfolded histogram if the eigenvalues are
small. Eigenvector contributions with eigenvalues below a certain value can-
not be reconstructed, because they cannot be distinguished from noise in the
observed histogram.
The eigenvector decomposition is equivalent to the singular value decom-
position (SVD). In the following we will often refer to the term SVD instead
of the eigenvector decomposition, because the former is commonly used in
the unfolding literature.

Example 129. Eigenvector decomposition of the LS matrix


272 9 Unfolding

x
0.00 0.08

-0.04 0.04

-0.08 0.00

-0.12 -0.04

-0.16 -0.08

0.0001 100

significance
eigenvalue

10

0.00001

1E-6 0.1

0.01
0 5 10 15 20 0 5 10 15 20

eigenvector index eigenvector index

Fig. 9.7. Observed eigenvectors 1 (top left) and 20 (top right), eigenvalues (bottom
left) and significance of eigenvector amplitudes (bottom right).

In Fig. 9.6 the 20 eigenvectors of a LS matrix ordered with decreasing


eigenvalue are displayed. The response matrix has 20 true and 40 observed
bins. The graph is generated from a sample of 100 000 uniformly distributed
events in the range of the observed and the true variables 0 < x, x′ < 1.
The response function is a Gaussian with standard deviation σs = 0.04. The
eigenvectors show an oscillatory behavior where the number of clusters cor-
responds roughly to the eigenvector index. In Fig. 9.7 top the eigenvectors 1
and 20 folded with the response matrix are shown. A contribution of eigenvec-
tor 20 to the observed histogram is similar to that of noise. The eigenvalues
shown at the bottom left graph vary by about three orders in magnitude.
This means that a contribution of the eigenvector 20 to the true distribution
is suppressed in the observed distribution by a factor of 1000 with respect
to a contribution of eigenvector 1. The bottom right-hand graph shows the
significance of the amplitudes that are attributed to the eigenvectors. Signif-
icance is defined as the absolute value of the amplitude divided by its error.
As we have indicated above, the significance is expected to decreases with
decreasing eigenvalue. Due to the symmetry of the problem, the amplitudes
9.2 Discrete Inverse Problems and the Response matrix 273

100 100000

|parameter value|
10000
significance 10

1000

100

0.1

10
0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35

eigenvalue index eigenvalue index

Fig. 9.8. Left hand: Parameter significance as function of the eigenvalue index.
The effective number of parameters is 17. Right hand: Fitted parameter values as
a function of the eigenvalue index.

with even index should vanish. Statistical fluctuations in the simulation par-
tially destroy the symmetry. Eigenvector contributions where the significance
is below one, are compatible with being absent within less than one standard
deviation.

The significance of the amplitudes deteriorates strongly with increasing


smearing. It is difficult to compensate a bad resolution of an experiment
by increasing the statistics! We should always make an effort to avoid large
smearing effects not only because large event numbers are required but also
because the unfolding results then depend strongly on a precise knowledge of
the response function.

The Effective Number of Parameters

When we unfold a histogram, the number of bins of the unfolded histogram


is the number of free parameters in the fit. The previous example indicates
that the number of independent parameters that we can determine in a given
problem is rather limited. Below a certain eigenvalue λk , all parameters have
a significance close to or below one. We can define an effective number of
parameters Nef f = k as the number of parameters with eigenvalues above or
equal to this limit. For the example of Fig. 9.8 with a uniform distribution the
effective number of parameters is Nef f = 17. There are also parameters left
of index 17 that are compatible with being zero. We should not exclude the
corresponding contributions, because the reason for the small values of the
significance are not large errors, but small values of the fitted amplitudes as
274 9 Unfolding

is indicated in the right-hand graph. This graph shows that some amplitudes
that correspond to small eigenvalues become rather large. This is due to the
amplification of high frequency noise in the unfolding. The number of bins in
the unfolded distribution should not be much larger than the effective number
of parameters, because then we keep too much redundant information, but on
the other hand it has to be large enough to represent the smallest significant
eigenvector. A reasonable choice for the number of bins is about twice Nef f .
The optimal number will also depend on the shape of the distribution.

9.2.5 The Maximum Likelihood Approach

Histogram Representation

Whenever possible, we should apply a MLF instead of a LSF. WithPPoisson


distributed event numbers di with expected values E(di ) = ti = j Aij θj
the probability to obtain di is

e−ti tdi i
P (di ) =
di !
and the corresponding log-likelihood is up to an irrelevant constant
N
X
ln Lstat = [di ln ti − ti ]
i=1
 
N
X M
X M
X
= di ln Aij θj − Aij θj  . (9.15)
i=1 j=1 j=1

Maximizing ln Lstat we obtain an estimate θb of the true histogram.


Usually we have of the order of 20 bins and of course the same number
of correlated parameters which have to be adjusted. In this situation the fit
often does not converge very well. Instead of maximizing the log-likelihood
with methods like Simplex, we can compute the solution iteratively with the
expectation maximization (EM) method [70], see Appendix 13.4.
The following alternating steps are executed:
– Folding step: X
(k) (k)
di = Aij θj . (9.16)
j

– Unfolding step:
N
X
(k+1) (k) di
θj = Aij θj (k)
/αj . (9.17)
i=1 di
9.3 Unfolding with Explicit Regularization 275

Usually uniform starting values θ(0) = 1 arePselected. The efficiency pa-


M
rameter α corrects for acceptation losses, αj = i=1 Aij .
Before the EM method had been invented, the iterative procedure had
been introduced independently by Richardson and Lucy [71, 72] specifically
for the solution of unfolding problems. Later it was re-invented several times
[73, 74, 64, 67]. That the result of the iteration converges to the maximum
likelihood solution, is a general property of the EM method but was also
proven by Vardi et al. [75] and later independently by Mülthei and Schorr
[64]. For a discussion of the application to unfolding see [77].

Spline Approximation

As we have seen, the true distribution can also be represent by a spline


approximation
N
X
f (x) = θj Bj (x) .
j=1

The EM method can be used to perform a MLF of the spline coefficients


θj of the basic spline functions Bj . The relations (9.16) and (9.17) remain
valid. Instead of the basic spline functions any other set of functions can be
used to approximate f (x).

9.3 Unfolding with Explicit Regularization


9.3.1 General considerations

The main field where professional unfolding methods are applied lies in im-
age reconstruction. In medicine unblurring of tomographic pictures of arterial
stenoses, of tumors or orthopedic objects are important. Other areas of inter-
est are unblurring of images of astronomical surveys, of geographical maps, of
tomographic pictures of tools or mechanical components like train wheels to
detect damages. Also pattern recognition for example of fingerprints or the
iris is an important field of interest. The goal of most applications is to dig
out hidden or fuzzy structures from blurred images, to remove noise and to
improve the contrast. Also in physics applications it is of interest to visualize
hidden structures and to reconstruct distributions which may be used for
instance to simulate experimental situations. We want to take advantage of
the fact that physics distributions are rather smooth. Often we can remove
the roughness of an unfolding result without affecting very much the real
structures of the true distribution.
To obtain a smooth distribution, several different regularization mecha-
nisms are available.
In particle- and astrophysics the following methods are applied:
276 9 Unfolding

1. Truncation methods: In the eigenvector decomposition of the LS matrix


( equivalent to the SVD) low eigenvalue contributions to the unfolding
solution are suppressed or eliminated.
2. Penalty methods: A penalty term which is sensitive to unwanted fluctua-
tions is introduced in the LS or ML fit of the unfolding solution. Typically,
deviations from a uniform or a linear distribution are suppressed. Stan-
dard methods penalize curvature, low entropy or a large norm of the
unfolding distribution.
3. Iterative fitting with early stopping: A smooth distribution is iteratively
modified and adjusted to the observation. The iteration process based
on the EM method is stopped before “unacceptable ”oscillations emerge.
Alternatively, the iterative unfolding result is smoothed after each itera-
tion by a soft smoothing function. Then the iteration sequence converges
automatically.
A simple bin-by-bin correction method has been used in the past in some
particle physics experiments. The ratio of the numbers in the observed and
the true histogram in the simulation is used to correct the observed histogram.
This approach generates a smooth distribution, but should be discarded be-
cause it often produces strongly biased results.
In the following, we first discuss some general properties of regularization
approaches and then we describe the standard methods. We assume that the
observed data follow Poisson distributions.

9.3.2 Variable Dependence and Correlations

We have already stressed that smoothness criteria cannot be derived from


basic principles. Smoothness is not well defined and the standard methods
are not invariant against variable transformations.
Most unfolding methods have the convenient property that the unfolding
result does not depend on the ordering of the bins in the unfolded distribu-
tion. This means that multi-dimensional distributions can be represented by
one-dimensional histograms. An exception is regularization with a curvature
penalty.
The dependence of the smoothness criteria on the chosen variable can
be used to adapt the regularization approaches to specific problems. If, for
instance, penalties favor a uniform distribution, we can choose a variable in
which we expect that the true distribution is roughly uniform but in most
cases it is better to adapt the penalty function, because then usually the
resolution is approximately constant and equally sized bins are appropri-
ate. On might for instance not penalize the deviations from uniformity for a
nearly exponential distribution but the deviation from an exponential. The
corresponding procedure in the iterative EM method is to select the starting
distribution such that it corresponds to our expectation of the true distribu-
tion.
9.3 Unfolding with Explicit Regularization 277

9.3.3 Choice of the Regularization Strength

A critical parameter in all unfolding procedures is the regularization strength


which regulates the smoothness of the unfolding result and which determines
bias and precision. The optimal value of the corresponding regularization pa-
rameter depends on the specific application. To fix it, we must have an idea
about the shape of the true distribution. We might choose it differently for a
structure function, a Drell-Yan distribution with possible spikes, a transverse
momentum distribution and the distribution of the cosmic background radi-
ation. From the data we can only derive an upper limit of the regularization
parameter: The unfolded distribution has to be compatible within the sta-
tistical uncertainties with the observed histogram. Most unfolding methods
try to approach a corresponding limit and eliminate fluctuations that are
compatible with noise. There is no scientific justification for this pragmatic
choice and one has to be aware of the fact that in this way real, interesting
structures may be eliminated that could be resolved with higher statistics.
There exist many ways to fix the regularization parameter [78, 56]. We
restrict the discussion to three common methods.

Visual Inspection

If we resign to the idea to use the unfolded distribution for parameter fits, it
seems tolerable to apply subjective criteria for the choice of the regulariza-
tion strength. By inspection of the unfolding results obtained with increasing
regularization, we are to some extent able to distinguish fluctuations caused
by noise from structures in the true distribution and to select a reasonable
value of the regularization parameter. Probably, this method is in most cases
as good as the following approaches which are partially quite involved.

Truncation of the Eigenvector Solution

We have seen that the unfolding result can be expanded into orthogonal
components which are statistically independent.
We have studied above the eigenvector decomposition of the modified
least square matrix and realized that the small eigenvalue components λi
cause the unwanted oscillations. A smooth result is obtained by cutting all
contributions with eigenvalues λi , i = 1, ..., k) below a cut value λk . This pro-
cedure is called truncated singular value decomposition (TSVD).The value λk
is chosen such that eigenvectors are excluded with statistically insignificant
amplitudes. The truncation in the framework of the LSF has its equivalence
in the ML method. We can order the eigenvectors of the covariance matrix
of a MLF according to decreasing errors and retain only the dominant com-
ponents.
The physicist community is still attached to the - for historical reasons -
popular linear matrix calculus. Nowadays computers are fast and truncation
278 9 Unfolding

based on the diagonalized covariance matrix derived from a non-linear LS or


a ML fit is probably the better choice than TSVD.

Minimization of the Integrated Square Error ISE

A common measure of the agreement of a PDE with the true distribution is


the integrated square error ISE. For a functions f (x) and its PDE fˆ(x) it is
defined by Z ∞h i2
ISE = fˆ(x) − f (x) dx . (9.18)
−∞

ISE is not defined for histograms in the way as physicists interpret them.
To adapt the ISE concept to our needs, we modify the definition such that
is measures the difference of the estimated content of the histogram θ̂i and
the prediction θi . In addition we normalize it to the total number of events
n and the number of true bins M .
M 
X 2
ISE ′ = θ̂i − θi /n/M (9.19)
i=1

ISE ′ depends mainly on the resolution, i.e. the response matrix and less on
the shape of the true distribution. A crude guess of the latter can be used
to estimate the regularization parameter r. (Here r is a generic name for
the number of iterations in the EM method, the penalty strength or the cut
in truncation approaches.) The distribution is unfolded with a preliminarily
chosen regularization parameter. The unfolding result is then used to find the
regularization parameter that minimizes ISE ′ : The process can be iterated,
but since the shape of the true distribution is not that critical, this will not
be necessary in the majority of cases. The procedure consists of the following
steps:
1. Unfold d with varying regularization strength r and select the “best”
value r̃ and θ̃(0) by visual inspection of the unfolded histograms.
2. Use θ̃(0) as input for typically n = 100 simulations of “observed” distri-
butions d˜i , i = 1, n.
3. Unfold each “observed”distribution with varying r and select the value
r̃i that corresponds to the smallest value of ISE ′ . The value of ISE′ is
computed by comparing the unfolded histogram with θ̃(0) .
4. Take the mean value r̄ of the regularization strengths r̃i , unfold the ex-
perimental distribution and obtain θ̃(1) . If necessary, go back to 2, replace
θ̃(0) by θ̃(1) and iterate.
The procedure is independent of the regularization method.
9.3 Unfolding with Explicit Regularization 279

9.3.4 Error Assignment to Unfolded Distributions

The regularization introduces a bias and decreases the error δs obtained in


the fit. The height of peaks is reduced, the width is increased, valleys are
partially filled. The true uncertainties δ depend on the nominal error δs and
the bias b, δ 2 = δs2 + b2 . Increasing the regularization strength reduces δs
but increases the bias b. Due to the unavoidable bias, the nominal error does
not cover all distributions that are compatible with the observed data. The
nominal errors depend on the selected regularization parameter and do not
conform to the requirements stated in Sect. 4. The diagonal errors given
in plots of the unfolded histogram are often misleading because the errors
are correlated. Nevertheless, they may indicate qualitatively the range of
acceptable true distributions.

Calculation of the Nominal Error

There are several way to calculate the errors:


1. A common method is to apply error propagation starting from the ob-
served data d. To be consistent5 with the point estimate, the best es-
timate of the folded distribution dˆi = Σk Aik θ̂k should be used instead.
Error propagation is quite sensitive to non-linear relations which occur
with low event numbers.
2. The errors can be derived from the curvature matrix at the LS or ML
estimates. This is the standard way in which symmetric error limits are
computed in the common fitting programs. In principle also likelihood
ratio or χ2 contours can be computed. The parameters θ is varied, the
corresponding histogram d is computed and compared to d b = Aθ.
b Values
b
of θ that changes the difference ln L(θ) − ln L(θ) by 1/2 fix the standard
likelihood ratio error bounds. In the EM method with early stopping, θ
can be re-fitted starting from dˆ and the errors can be provided by the fit
program.
3. We can use bootstrap resampling techniques [79], see Sect. 12.2. In short,
the data sample is considered as representative of the true folded distribu-
tion. From the N observed events, N events are drawn with replacement.
They form a bootstrap sample d∗ which is histogrammed and unfolded.
This procedure is repeated many times and in this way a set of unfolded
distributions is generated. from which the fluctuations, confidence inter-
vals and correlations can be extracted. For example, in a selected bin the
standard 68 % confidence interval contains 68 % of the bootstrap results.

5
Error propagation starting from the observed data insted of the best estimate
can lead to inconsistent results. A striking example is known as Peelle’s pertinent
puzzle [55].
280 9 Unfolding

A more detailed discussion of the error estimation with bootstrap methods


is presented in [80].
Usually the statistical uncertainty of the response matrix can be neglected.
If this is not the case, the simplest way to include it, is to generate bootstrap
samples of it.

9.3.5 EM Unfolding with Early Stopping

In a comparison [56] of the different regularization methods, the EM method


performed better than competing approaches.
We have seen that the EM algorithm produces the MLE of the unfolded
histogram. We start with a smooth distribution and suppress the fluctuations
that emerge when we approach the MLE by stopping the iteration once the
result is compatible with the data. We have to fix the starting distribution
and the stopping condition.

Example 130. Unfolding with the EM method


As standard example we select a Gaussian peak above a uniform back-
ground. The resolutions is σs = 0.08 equal to the width σb of the bump. The
starting distribution is uniform. The results for a sample of 50000 events is
presented in Fig. 9.9.The unfolded histograms are compared to the true his-
togram indicated by squares. The number of iterations varies between 1 and
100000 and is indicated in each plot. The last plot with extreme fluctuations
corresponds to the MLE. In Fig. 9.10 the test quantity ISE ′ is plotted as a
function of the number of applied iterations.The optimal number of iterations
that minimizes ISE ′ for the given data set is 30 but the result varies slowly
with the number of iterations.

Generally, the optimal number of iterations increases with the Gaussian


smoothing parameter σs and for σs = 0 a single unfolding step would be
sufficient an iteration would not be necessary.

Introducing a Final Smoothing Step

It has been proposed [67, 81, 82] to apply after the iteration sequence a final
smoothing step: After iteration i the result θ (i) is folded with a smoothing
(i)′ P (i) (i)′ (i−1)′
matrix g, yielding θ(i)′ , θk = l gkl θl . If θk agrees with θk within
given limits, the iteration sequence is terminated. In this way, convergence to
a smooth result is imposed. In [81] it is proposed to add after the convergence
one further iteration to θ(i+1)′ .
The parameters of the smoothing matrix which define the regularization
strength have to be adjusted to the specific properties of the problem that
9.3 Unfolding with Explicit Regularization 281

8000

1 5 10
6000

4000

2000

8000

15 20 25
6000

4000

2000

8000

100 100000
40
6000

4000

2000

Fig. 9.9. Unfolding with the EM method for different number of iterations. The
experimental resolution is σs = 0.08. The squares corespond to the true distribution.

0.8

0.7

0.6
ISE'

0.5

0.4

20 40 60 80

number of iterations

Fig. 9.10. ISE ′ as a function of the number of iterations.


282 9 Unfolding

8000 8000

uniform data
6000 6000

4000 4000

2000 2000

0 0

Fig. 9.11. Iterative unfolding with two different starting histograms, left uniform
and right experimental.

has to be solved. The approach may be very successful in problems where


prior knowledge about the shape of the true distribution is available but in
the general case it is not obvious how to choose the smoothing step. The
intention of the additional iteration is to avoid a too strong influence of the
smoothing step on the final result [81].

Dependence on the starting distribution


So far we have used a uniform starting distribution for the EM iteration. If
there is prior knowledge of the approximative shape of the true distribution,
for instance from previous experiments, then the uniform histogram can be
replaced by a better estimate. Experience shows that the influence of the
starting histogram on the unfolding result is usually rather weak.

Example 131. EM unfolding with different starting distributions


We repeat the unfolding of the distribution with 50000 events and ex-
perimental resolution σs = 0.08 starting with the histogram of the observed
events. (The choice of the example with a large number of events is less sensi-
tive to statistical fluctuations than an example with low statistics and should
indicate possible systematic effects.) The two results displayed in Fig. 9.11
are qualitatively indistinguishable. Starting with the uniform histogram, the
lowest ISE ′ = 0.0964 is obtained after 40 iterations with χ2 = 35.1. With the
observed histogram the values obtained after 38 iterations are ISE ′ = 0.0940
with the same value χ2 = 35.1. In the low statistics example with 500 events
and resolution σs = 0.04 the minimum is reached already after 2 iterations
with the ISE = 0.0488 and 0.0487, respectively and values χ2 = 36.0 and
36.3.
9.3 Unfolding with Explicit Regularization 283

The influence of the starting distribution on the unfolding result should


be checked but in the majority of cases is not necessary to deviate from a
uniform distribution.

9.3.6 SVD based methods [68, 78]

Truncated SVD

The SVD decomposes the unfolded histogram into statistically independent


vectors, θ0 = Σi=1
M
ai ui , and provides an ordering of the vectors according to
their sensitivity to noise. In this way it offers the possibility to obtain a stable
solution by chopping off eigenvectors with low eigenvalues. Only contribution
with eigenvector indices less than or equal to the index m are kept:
m
X
θreg = ai ui .
i=1

The choice of of the cut-off m is based on the significance Si = ai /δi of the


eigenvector contributions ai which is provided by the LS fit. The amplitudes
of the eliminated eigenvectors should be compatible with zero within one or
two standard deviations.
The application of the method, called truncated SVD (TSVD) is simple
and computationally fast. The idea behind TSVD is attractive but it has
some limitations:
1. The SVD solution is obtained by a linear LS fit. This implies that low
event numbers in the observed histogram are not treated correctly. Com-
bining bins with low event numbers can reduce the problem.
2. The eigenvalue decomposition is essentially related to the properties of
the response matrix and does not sufficiently take into account the shape
of the true distribution. Small eigenvalues may correspond to significant
structures in the true distribution and the corresponding eigenvectors
may be eliminated by the truncation. The combination of the vectors be-
longing to several “insignificant” amplitudes may contribute significantly
to the true distribution.
Figs. 9.12 show the unfolding results from TSVD for the same data that
have been used to test the EM method. From the dependence of ISE ′ on the
number of eigenvectors, Fig. 9.13 we find that the optimal number is 10. The
agreement of the unfolded histogram with the true histogram is significantly
worse than in the EM method. For low event numbers the method has the
tendency to loose events in the unfolded histogram [56].
284 9 Unfolding

8000

9 10 11
6000

4000

2000

Fig. 9.12. Unfolding result with SVD and different number of eigenvector contri-
butions, resolution σs = 0.08 and 50000 events.

10
ISE'

0.1
0 5 10 15

number of eigenvectors

Fig. 9.13. ISE ′ (arbitrary units) as a function of the number of eigenvectors.

Smooth truncation

It has been proposed [78, 62] to replace the brut force chopping off of the
noise dominated components by a smooth cut. This is accomplished by filter
factors
λ2
ϕ(λ) = 2 (9.20)
λ + λ20
where λ0 is the eigenvalue which fixes the degree of smoothing and λ is the
eigenvalue corresponding to the coefficient which is to be multiplied by ϕ(λ).
The solution is then
9.3 Unfolding with Explicit Regularization 285

1.0

0.8

filter factor
0.6

0.4

0.2

0.0
0 5 10 15 20

eigenvector index

Fig. 9.14. Filter factor as a function of the eigenvector index.

M
X
θ reg = ϕ(λi )ai ui .
i=1

The function 9.20 is displayed in Fig. 9.14. The amplitude of the eigenvector
with eigenvalue λ = λ0 is reduced by a factor 2. For large eigenvalues λ the
filter factor is close to one and for small values it is close to zero. It is not
obvious why a reduction of the the amplitude of a component m and the
inclusion of a fraction of the amplitude of a less significant component n > m
should improve the performance.
In [78] it is shown that the filtered SVD solution is equivalent to Tikhonov’s
norm regularization under the condition that the uncertainties of the obser-
vations correspond to white noise (normally distributed fluctuations with
constant variance). We will come back to the norm regularization below.

9.3.7 Penalty regularization

The EM and truncated SVD methods are very intuitive and general. If we
have specific ideas about what we consider as smooth, we can penalize devi-
ations from the wanted features by introduction of a penalty term R in the
likelihood or LS expression:

ln L = ln Lstat − R , (9.21)
χ = 2
χ2stat +R. (9.22)

Here ln Lstat and χ2stat are the expressions given in (9.15) and (9.7).The sign
of R is positive such that with increasing R the unfolded histogram becomes
286 9 Unfolding

smoother. If we prefer a uniform


PN distribution, R could be chosen propor-
tional to the norm ||θ||2 = 1=1 θi2 . This is the simple Tikhonov regular-
ization [83]. Popular are also the entropy regularization which again favors
a uniform solution and the curvature regularization which prefers a linear
distribution. Entropy regularization is frequently applied in astronomy and
was introduced to particle physics in Ref. [65]. All three methods have the
tendency to reduce the height of peaks and to fill up valleys, a common fea-
ture of all regularization approaches. More sophisticated penalty functions
can be invented if a priori knowledge about the true distribution is avail-
able. In particle physics, distributions often have a nearly exponential shape.
Then one would select a penalty term which is sensitive to deviations from
an exponential distribution.

Curvature regularization

An often applied regularization function R is,


 2
d2 f
R(x) = rc . (9.23)
dx2
It increases with the curvature of f and favors a linear unfolded distribution.
The regularization constant rc determines the power of the regularization.
For a histogram of M bins with constant bin width we approximate (9.23)
by
M−1
X (2θi − θi−1 − θi+1 )2
R = rc . (9.24)
i=2
n2
with n the total number of events and rc the parameter that fixes the regu-
larization strength.
The curvature penalty is a function of the content of three adjacent bins.
It is not very efficient at the two border bins of the histogram. In the field of
PDE specific methods have been developed to avoid the problem [84]. More
smoothing at the edges of the histogram can be achieved by increasing the
bin size of the border bins or by increasing the penalty. The latter solution
is adopted in [80].

Entropy regularization

We borough the entropy concept from thermodynamics, where the entropy


S measures the randomness of a state and the maximum of S corresponds
to the equilibrium state which is the state with the highest probability. It
has also been introduced into information theory and into Bayesian statistics
to fix prior probabilities. However, there is no intuitive argument why the
entropy should be especially suited to cure the fake fluctuations caused by
the noise. It is probably the success of the entropy concept in other fields
9.3 Unfolding with Explicit Regularization 287

and its relation to probability which have been at the origin of its application
in unfolding problems. We penalize a low entropy and thus favor a uniform
distribution.
The entropy S of a discrete distribution with probabilities pi , i =
1, . . . , M is defined through the relation:
M
X
S=− pi ln pi .
i=1

For a random distribution the probability for one of the n = Σθi events to
fall into true bin i is given by θi /n. The maximum of the entropy corresponds
to an uniform population of the bins, i.e. θi = const. = n/M , and equals
Smax = − M 1
ln M , while its minimum Smin = 0 is found for the one-point
distribution (all events in the same bin j) θi = nδi,j . We define the entropy
regularization penalty with the regularization strength re of the distribution
by
X M
θi θi
R = re ln . (9.25)
i=1
n n

Adding R to χ2 or subtracting it from ln L can be used to smoothen a


distribution.
A draw-back of a regularization based on the entropy or the norm is that
distant bins are related, while smearing is a usually a local effect. Entropy reg-
ularization is popular in astronomy [85, 86] and has been adopted in particle
physics [65, 87].

Tikhonov or Norm Regularization

The most obvious and simplest way to regularize unfolding results is to pe-
nalize a large value of the norm squared ||θ||2 of the solution:
M
rn X 2
R= θ . (9.26)
n2 i=1 i

The norm regularization has first been proposed by Tikhonov [83]. Minimiz-
ing the norm implies a bias towards a small number of events in the unfolded
distribution. To reduce this effect, contrary to the originally proposed penalty,
we normalize the norm to the number of events squared n2 .

9.3.8 Comparison of the Methods

To compare the performance of the five methods we take an example from


[56] where a more detailed comparison is presented.
288 9 Unfolding

0.25 800

EM SVD
0.20
600

0.15

400
f(x)
0.10

200
0.05

0.00 0
-5 0 5

800

curvature entropy norm

600

400

200

Fig. 9.15. Unfolding results from different methods. The top left-hand plot shows
the true distribution and its smeared version. The squares correspond to the true
distribution.

Example 132. Comparison of unfolding methods


We simulate the distribution

f (x) = 0.2N (−2, 1) + 0.5N (2, 1) + 0.3U

defined in the interval [−7, 7] with smearing σs = 1 which has been used
in [80]. The function and its smeared version are displayed in Fig. 9.15 top
left. The unfolded distributions obtained with the EM, the truncated SVD
and three penalty methods for the first of 10 samples with 5000 events are
depicted in the same figure. The optical inspection does not reveal large
differences between the results. The mean values of ISE ′ from the 10 samples
are presented in Fig. 9.16. They indicates that the EM method and entropy
regularization perform better than the other approaches. From a repetition
with 100 simulated experiments, we find that the mean value of ISE ′ is
smaller for the EM method compared to the entropy regularization by a
factor of 2.00 ± 0.16.

The superiority of the EM method has been confirmed for different dis-
tributions [56].
9.3 Unfolding with Explicit Regularization 289

curvature

norm
0.5

TSVD

entropy
0.4

0.3

ISE'

EM
0.2

0.1

0.0

Fig. 9.16. ISE ′ averaged over 10 experiments.

9.3.9 Spline approximation

Simulations of particle experiments often are based on PDEs. For instance


unfolded proton structure functions are required to predict cross sections in
proton proton collisions at the Large Hadron Collider at CERN. For these
kind of simulations coarse binned histograms are not optimal and smooth
unfolding results are preferred which can be obtained with spline approxi-
mations [59, 80, 87]. Spline approximations are sketched in Sect. 11.2.4 and
formulas are given in Appendix 13.15. Monotone distributions like transverse
momentum distributions in particle physics can be described by linear splines.
Distributions with bumps are better approximated with quadratic or cubic
splines. The higher the order of the spline approximation is, the more difficult
it is to adjust the spline function at the borders of the distribution.
The representation of the unfolded function by a superposition of spline
functions reduces the dependence of the unfolding result on the function
used in the simulation of the response matrix. In the methods with penalty
regularization, the construction of a response matrix and the dependence of
the unfolding result on the distribution used in the Monte Carlo simulation of
the response matrix can be avoided altogether with the parameter estimation
method explained in Chap. 6
It has to be noted that independently of the regularization a systematic
error is introduced by the fact that the true distribution is approximated by
a spline curve. It can happen that this approximation is poor, but normally
it is excellent within the statistical uncertainties.
Once the elements of the response matrix have been computed, the unfold-
ing proceeds in the same way as with histograms. The unfolding procedure
290 9 Unfolding

800

600

400

200

Fig. 9.17. EM unfolding to a spline function. Three examples are presented. The
top graphs show the true distribution together with the unfolding results. The
bottom graphs contain the corresponding histograms.

is completely analogous. The coefficients βi are fitted to the observed data


vector d in the same way as in the histogram representation.

Example 133. Unfolding to a spline curve


We apply the EM method to three data samples of our standard one-peak
example with 5000 events and smearing resolution σs = 0.08. Twenty cubic
b-splines are fitted to the data. Each time the number of iterations is selected
which minimizes ISE ′ . In Fig. 9.17 the results are depicted. The convergence
of the log-likelihood is displayed in Fig. 9.18. The convergence is initially very
fast and then the residual value decays exponentially.

More detailed studies are necessary to establish the promising perfor-


mance of the EM unfolding into a superposition of b-splines.

9.3.10 Statistical and Systematic Uncertainties of the Response


Matrix

Until now we have assumed that we know exactly the probability Aij for
observing elements in bin i which originally were produced in bin j. This is,
at least in principle, impossible, as we have to average the true distribution
f (x) – which we do not know – over the respective bin interval
9.3 Unfolding with Explicit Regularization 291

-1
10

-2
10

-3
10

)
max
-4
10

ln(L) - ln(L
-5
10

-6
10

-7
10

-8
10

-9
10

-10
10
0 200000 400000 600000

number of iterations

Fig. 9.18. Difference of the log-likelihood from the value at 106 iterations as a
function of the number of iterations.

R R
x′ −bin_i x−bin_j h(x, x′ )f (x)dxdx′
Aij = R . (9.27)
Bin_j f (x)dx

Therefore A depends on f . Only if f can be approximated by constants in all


bins the dependence is negligible. This condition is satisfied if the width of
the response function, i.e. the smearing, is large compared to the bin width
in the true histogram. On the other hand, small bins mean strong oscillations
and correlations between neighboring bins. They suggest a measurement res-
olution which does not really exist. We have two contradicting conditions: To
be independent of the shape of f (x) we would like to choose small bins, to
avoid strong correlations we want wide bins. A way out in situations where
the statistics is relatively large, could be to unfold with narrow bins and to
combine bins after the unfolding. With little statistics this procedure is dif-
ficult to follow [56] because then the errors are asymmetric and the linear
error propagation used in combining bins is a bad approximation. Iteration
of the Monte Carlo input distribution can improve the precision of the re-
sponse matrix and in most cases leads to satisfactory results. To use a spline
approximation is to be preferred to the histogram representation. Eventually,
the dependence of the result on the assumed shape of the Monte Carlo input
distribution has to be investigated and documented by a systematic error.
The response matrix (9.27) is obtained by a Monte Carlo simulation and
the statistical fluctuations of the simulation have eventually to be taken into
account. This leads to multinomial errors of the transfer matrix elements.
The correct treatment of these errors is rather involved. Thus, if possible,
one should generate a number of simulated observations which is much larger
292 9 Unfolding

Fig. 9.19. Effect of deconvolution with a resolution wrong by 10%.

than the experimental sample such that the fluctuations can be neglected.
A rough estimate shows that for a factor of ten more simulated observa-
tions the contribution of the simulation to the statistical error of the result is
only about 5% and then certainly tolerable. When this condition cannot be
fulfilled, bootstrap methods (see Chap. 11) can be used to estimate the uncer-
tainties caused by the statistical fluctuations of A. Apart from the statistical
error of the response matrix, the precision of the reconstruction of f depends
on the size of the experimental sample and on the accuracy with which we
know the resolution. In nuclear and particle physics the sample size is often
the limiting factor, in other fields, like optics, the difficulties frequently are
related to a limited knowledge of the resolution of the measurement.
Fig. 9.19 shows the effect of using a wrong resolution function. The dis-
tribution in the middle is produced from that on the left hand side by convo-
lution with a Gaussian with width σf . Unfolding produces the distribution
on the right hand side, where the assumed width σf′ was taken too low by
10%. For a relative error δ,

σf − σf′
δ= ,
σf

we obtain an artificial broadening of a Gaussian line after unfolding by


2
σart = σf2 − σf′2 ,

σart = σf (2δ − δ 2 )1/2 ≈ 2δσf ,

where σart
2
has to be added to the squared width of the original line. Thus
a Dirac δ-function becomes a normal distribution of width σart . Even small
deviations in the resolution can lead to a substantial artificial broadening of
9.4 Unfolding with Implicit Regularization 293

sharp structures if the width of the smearing function is larger than that of
peaks in the true distribution.

9.4 Unfolding with Implicit Regularization


If we want to document the experimental information in such a way that it is
conserved for a comparison with a theory that might be developed in future or
if we want to compare or combine the date with those of another experiment,
we must avoid the bias introduced by the explicit regularization. This can be
achieved by retaining the distorted data together with the resolution function.
Such a procedure is optimal because no information is wasted, but it has
severe drawbacks: Two large datasets, the experimental data and the Monte
Carlo sample would have to be published and the whole analysis work would
be left to the scientist who wants to use the data. A less perfect but simple
and more practical way is to unfold the experimental effects and to present
the data in form of a histogram together with an error matrix which than
can be used in a future analysis. To avoid the unpleasant oscillation that we
have discussed in the previous section, we have to choose wide enough bins.
Additional explicit smoothing would bias the data and has to be omitted. We
have to accept that some information will be lost. In most experiments the
experimental smearing is small and the loss is minimal.
A third possibility which preserves the information that is necessary for
a future quantitative analysis is to unfold the data with a simple explicit
smoothing step and to document the smoothing function. A comparison of the
experimental result with a prediction is then possible because the smearing
can be applied to the theoretical prediction, but data of different experiments
can not be combined.
In the following we turn to the simple and efficient approach where os-
cillations are suppressed by using wide bins in the unfolded histogram. We
call this procedure implicit regularization. The bin contents are fitted either
by minimizing the sum of the least squares or by maximizing the likelihood.
With the LS method also complex situations can be handled, for instance
when background has to be taken care of while the ML method requires
Poisson distributed event numbers. In the following example we assume that
the errors follow the Poisson distribution and determine the MLE with the
EM method.

Example 134. Unfolding with implicit regularization


We simulate data according to the one-peak distribution that we have
used before. The location of the peak is µ = 0.5 and the standard deviationis
0.08. The Gaussian response function has a standard deviation σs = 0.04.
A total of 100000 events is generated. The observed smeared distribution
294 9 Unfolding

8000

15000
1.0

correlation coefficient
0.5

number of events

number of events
10000

4000 0.0
2 4 6 8 10 12 14 16 18

5000
-0.5

-1.0

0 0
0.0 0.5 1.0 0.0 0.5 1.0

x x

Fig. 9.20. Unfolding without explicit regularization. The left-hand plot shows
the observed distribution with resolution σs = 0.04, the central plot the unfolded
histogram and the right-hand plot indicates the correlation of bin 10 with the other
bins of the histogram. The curve represents the true distribution.

400 2000

1.0

1500

correlation coefficient
0.5
number of events

number of events

200 1000 0.0

-0.5

500

-1.0

0 0
0.0 0.5 1.0 0.0 0.5 1.0

x x

Fig. 9.21. Unfolding without explicit regularization. The left-hand plot shows the
observed histogram with resolution σs = 0.08, the central plot is the unfolded
histogram and the right hand plot indicates the correlation of bin 4 with the other
bins of the histogram. The curve represents the true distribution.

with 40 bins and the unfolded distribution with 18 bins are shown in Fig.
9.20. The true distribution is not much modified by the smearing. The height
of the peak is slightly reduced, the peak is a bit wider and at the borders
there are acceptance losses. The central plot shows the unfolded distribution
with the diagonal errors. Due to the strong correlation √between neighboring
bins, the errors are about a factor of five larger than θi . In the right-hand
plot the correlation coefficients of bin 10 relative to the other bins are given.
The correlation with the two adjacent bins is negative. It oscillates with the
9.6 Summary and Recommendations for the Unfolding of Histograms 295

distance to the considered bins. The correlation coefficients depend only on


the bin width and the smearing function and are independent of the shape
of the distribution. If we reduce the number of events to 5000 and increase
the smearing parameter to σs = 0.08, we have to increase the bin width to
suppress the fluctuations. The result is presented in Fig. 9.21.

Wide bins have the disadvantage that the response matrix depends
strongly on the distribution that is used to generate it. To cure the prob-
lem, the distribution can be approximated by the unfolding result obtained
with explicit regularization to a spline approximation. The remaining bias is
usually negligible.

9.5 Inclusion of Background


We distinguish two different situations.
In situation a) the background is generated by a Poisson random process.
This is by far the dominant case. We unfold the observed histogram as usual
and subtract the background from the unfolded histogram. Either its shape
and amount is known, then it can be subtracted directly, or the background
has to be evaluated from the histogram. It has to be parametrized and its
amount and the parameters have to be fitted in regions of the histogram,
where the background dominates. Background subtraction is treated in Sect.
7.4.
In situation b) the background is due to some malfunction of the detector
and not Poisson distributed. Then it has to be estimated and subtracted in
the observed histogram. Iterative unfolding is no longer possible because it
relies on the Poisson distribution of the number of events in the observed
histogram. The LS fit has to be applied with penalty regularization.

9.6 Summary and Recommendations for the Unfolding


of Histograms
Let us summarize our conclusions:
– Whenever an existing theory has to be verified, or parameters of it have
to be estimated, the prediction should be folded and compared to the
observed data. The results then are independent of the distribution used
in the Monte Carlo simulation and the unavoidable information losses of
the unfolding procedures are avoided.
– If this is not the case, the experimental results have to be published in
a way that unknown biases are excluded. This is achieved by unfolding
with bins large enough to avoid excessive oscillations.
296 9 Unfolding

– It should be attempted to generate enough Monte Carlo events such that


the statistical uncertainty introduced by the response matrix is negligi-
ble. If this is not possible, its contribution to the error of the unfolded
distribution can be estimated by bootstrap techniques or by variation of
the Monte Carlo statistics.
– Uncertainties in the shape of the distribution used to generate the re-
sponse matrix have to be estimated and taken into account by adding a
systematic error. To keep the uncertainties small, the distribution can be
approximated by a spline function that is determined by unfolding the
data with explicit regularization.
– To visualize the true distribution, it is sensible to unfold with explicit
regularization.
– The preferred choice is R-L iterative unfolding. It is technically very
simple. The standard starting distribution is uniform, but it can be
adapted to specific problems. The method is independent of the di-
mension of the histogram, the size and the ordering of the bins.
– the eigenvector decomposition provides insight in the origin of the
unfolding problematic. Regularization by truncation of low eigenvalue
components (TSVD) in many cases is not competitive.
– Regularization with roughness penalties produce similar, but often
slightly worse results as the EM method. Curvature penalties are dif-
ficult to apply in three or more dimensions. In typical examples en-
tropy regularization performs better than norm and curvature regu-
larization.
– It is recommended to adjust the regularization strength by minimizing
the parameter ISE ′ .
– The regularization biases the results such that error estimates underes-
timate the uncertainties and exclude distributions that are compatible
with the data.
Algorithms of available unfolding computer programs are often based on
the experience from only a few simple examples. The results and especially
the quoted error estimates should be used with great care.

9.7 Binning-free Methods


We now turn to binning-free methods. The goal is to reconstruct the sample
that an ideal detector would have observed. The advantage of this approach is
that arbitrary histograms under various selection criteria can be constructed
afterwards. It is especially suited for low statistics distributions in high di-
mensional spaces where histogram bins would suffer from too few events.
A draw back of these methods lies in the absence of a simple error handling
which includes correlations. So far, there is little experience with binning-free
methods.
9.7 Binning-free Methods 297

9.7.1 Iterative Unfolding Based on Probability Density


Estimation
We can realize the EM method also in a binning free way [66].
We start with a Monte Carlo sample of events, each event being defined
by the true coordinate x and the observation x′ . During the iteration process
we modify at each step a weight which we associate to the events such that
the densities in the observation space of simulated and real events approach
each other. Initially all weights are equal to one. At the end of the proce-
dure we have a sample of weighted events which corresponds to the unfolded
distribution.
To this end, we estimate a local density d′ (x′i ) in the vicinity of any
point x′i in the observation space. (For simplicity, we restrict ourselves again
to a one-dimensional space since the generalization to several dimensions
is trivial.) The following density estimation methods (see Chap. 12) lend
themselves:
1. The density is taken as the number of observations within a certain fixed
region around x′i , divided by the length of the region. The length should
correspond roughly to the resolution, if the region contains a sufficient
number of entries.
2. The density is chosen proportional to the inverse length of that interval
which contains the K nearest neighbors, where K should be not less than
about 10 and should be adjusted by the user to the available resolution
and statistics.
We denote by t(x) the simulated density in the true space at location x, by
t′ (x′ ) the folded simulated density at x′ and the corresponding data density
be d′ (x′ ). The density d′ (x′ ) is estimated from the length of the interval
containing K events, t′ (x′ ) from the number of simulated events M (x′ ) in
the same interval. The simulated densities are updated in each iteration step
k. We associate a preliminary weight

(1) d′ (x′i ) K
wi = =
t′(0) (x′i ) M (x′ )
to the Monte Carlo event i. The weighted events in the vicinity of x represent
a new density t(1) (x) in the true space. We now associate a true weight wi
to the event which is just the average over
P the preliminary weights of all K
events in the neighborhood of xi , wi = j wj′ /K. With the smoothed weight
wi a new observed simulated density t′(1) is computed. In the k-th iteration
the preliminary weight is given by
′(k+1) d′ (x′i ) (k)
wi = w .
t (x′i ) i
′(k)

The weight will remain constant once the densities t′ and d′ agree. As result
we obtain a discrete distribution of coordinates xi with appropriate weights
298 9 Unfolding

wi , which represents the unfolded distribution. The degree of regularization


depends on the parameters K used for the density estimation.
The method is obviously not restricted to one-dimensional distributions,
and is indeed useful in multi-dimensional cases, where histogram bins suffer
from small numbers of entries. We have to replace xi , x′i by xi , x′i , and the
regions for the density estimation are multi-dimensional.

9.7.2 The Satellite Method

The basic idea of this method [89] is the following: We generate a Monte
Carlo sample of the same size as the experimental data sample. We let the
Monte Carlo events migrate until the distribution of their observed positions
is compatible with the observed data. With the help of a test variable φ, which
could for example be the negative log likelihood and which we will specify
later, we have the possibility to judge quantitatively the compatibility. When
the process has converged, i.e. φ has reached its minimum, the Monte Carlo
sample represents the unfolded distribution.
We proceed as follows:
We denote by {x′1 , . . . , x′N } the locations of the points of the experimental
sample and by {y 1 , . . . , y N } those of the
XMonte Carlo sample. The observed
density of the simulation is f (y ′ ) = t(y i , y ′ ), where t is the response
function. The test variable φ [x′1 , . . . , x′N ; f (y ′ )] is a function of the sample
coordinates xi and the density expected for the simulation. We execute the
following steps:
1. The points of the experimental sample {x′1 , . . . , x′N } are used as a first
approximation to the true locations y 1 = x′1 , . . . , y N = x′N .
2. We compute the test statistic φ of the system.
3. We select randomly a Monte Carlo event and let it migrate by a random
amount ∆y i into a randomly chosen direction, y i → y i + ∆y i .
4. We recompute φ. If φ has decreased, we keep the move, otherwise we
reject it. If φ has reached its minimum, we stop, if not, we return to step
3.
The resolution or smearing function t is normally not a simple analytic
function, but only numerically available through a Monte Carlo simulation.
Thus we associate to each true Monte Carlo point i a set of K generated
observations {y′i1 , . . . , y ′iK }, which we call satellites and which move together
with y i . The test quantity φ is now a function of the N experimental positions
and the N × K smeared Monte Carlo positions.
Choices of the test statistic φ are presented in Chap. 10. We recommend
to use the variable energy.
The migration distances ∆y i should be taken from a distribution with
a width somewhat larger than the measurement resolution, while the exact
shape of the distribution is not relevant. We therefore recommend to use a
9.7 Binning-free Methods 299

uniform distribution, for which the generation of random numbers is faster


than for a normal or other distributions. The unfolding result is independent
from these choices, but the number of iteration steps can raise appreciably
for a bad choice of parameters.

Example 135. Deconvolution of a blurred picture


Figure 9.22 shows a two-dimensional application. The observed picture
consisted of lines and points which are convoluted with a two-dimensional
normal distribution. In the Monte Carlo simulation for each true point K =
25 satellites have been generated. The energy φ is minimized. The resolution
of the lines in the deconvoluted figure on the right hand side is restricted by
the low experimental statistics. For the eyes the restriction is predominantly
due to the low Monte Carlo factor K. Each eye has N = 60 points. The
maximal resolution for a point measured N times is obtained for measurement
error σf as
r
1 1
∆x = σf +
N K
r
1 1
= σf + = 0.24 σf .
60 25

Measurement resolution and acceptance should stay approximately con-


stant in the region in which the events migrate. When we start with a rea-
sonably good approximation of the true distribution, this condition is usually
satisfied. In exceptional cases it would be necessary to update the distribu-
tion of the satellites yik

after each move, i.e. to simulate or correct them once
again. It is more efficient, however, to perform the adaptation for all elements
periodically after a certain number of migration steps.
The number K determines the maximal resolution of the unfolded distri-
bution, it has therefore a regularization effect; e.g. for a measurement
√ resolu-
tion σf and K = 16 the minimal sampling interval is σT = σf / K = σf /4.
If the true p.d.f. has several maxima, we may find several relative minima
of the energy. In this case a new stochastic element has to be introduced in
the minimization (see Sect. 5.2.7). In this case a move towards a position with
smaller energy is not performed automatically, but only preferred statistically.
We have not yet explained how acceptance losses are taken into account.
The simplest possibility is the following: If there are acceptance losses, we
need Ki0 > K trials to generate the K satellites of the event yi . Consequently,
we relate a weight wi = K0i /K to the element yi . At the end of the unfolding
procedure we obtain a weighted sample.
A more detailed description of the satellite method is found in [89].
300 9 Unfolding

Fig. 9.22. Deconvolution of a blurred picture with the satellite method.

9.7.3 The Maximum Likelihood Method

In the rare cases where the transfer function t(x, x′ ) is known analytically
or easily calculable otherwise, we can maximize the likelihood where the
parameters are the locations of the true points. Neglecting acceptance losses,
the p.d.f. for an observation x′ , with the true values x1 , . . . , xN as parameters
is
N
1X
fN (x′ |x1 , . . . , xN ) = t(xi , x′ )
N i=1
where t is assumed to be normalized with respect to x′ . The log likelihood
then is given, up to a constant, by
N
X N
X
ln L(x|x′ ) = ln t(xi , x′k ) .
k=1 i=1

The maximum can either be found using the well known minimum search-
ing procedures or the migration method which we have described above and
which is not restricted to low event numbers. Of course maximizing the likeli-
hood leads to the same artifacts as observed in the histogram based methods.
The true points form clusters which, eventually, degenerate into discrete dis-
tributions. A smooth result is obtained by stopping the maximizing process
before the maximum has been reached. For definiteness, similar to the case
of histogram unfolding, a fixed difference of the likelihood from its maximum
value should be chosen to stop the maximization process. Similarly p to the
histogram case, this difference should be of the order of ∆L ≈ N DF/2
9.7 Binning-free Methods 301

4 200

100

-4

0
-8 -4 0 4 8 -8 -4 0 4 8

4 200

0
100

-4

0
-8 -4 0 4 8 -8 -4 0 4 8

4 200

100

-4

0
-8 -4 0 4 8 -8 -4 0 4 8

Fig. 9.23. Deconvolution of point locations. The middle plot on the left hand side
is deconvoluted and shown in the bottom plot. The true point distribution is given
in the top plot. The right hand side shows the corresponding projections onto the
x axis in form of histograms.
302 9 Unfolding

where the number of degrees of freedom N DF is equal to the number of


points times the dimension of the space.
There may be applications, for instance in astronomy, where we are inter-
ested to find point sources and their intensity. Then the described unfolding
procedure could be used without regularization.

Example 136. : Deconvolution by fitting the true event locations


Fig. 9.23 top shows 2000 points randomly generated according to the
superposition of two normal distributions denoted as N (x′ , y ′ |µx , µy , σx , σy ):

f (x′ , y ′ ) = 0.6 N (x′ , y ′ | − 2, 0, 1, 1) + 0.4 N (x′ , y ′ | + 2, 0, 1, 1) .

The transfer function again is a normal distribution centered at the true


points with symmetric standard deviations of one unit. It is used to convo-
lute the original distribution with the result shown in Fig. 9.23 middle. The
starting values of the parameters x̂i , ŷi are set equal to the observed locations
x′i , yi′ . Randomly selected points are then moved within squares of size 4 × 4
units and moves that improve the likelihood are kept. After 5000 successful
moves the procedure is stopped to avoid clustering of the true points. The
result is shown in the lower plot of Fig. 9.23. On the right hand side of the
same figure the projections of the distribution onto the x axis in form of
histograms are presented.

9.7.4 Summary for Binning-free Methods

The advantage of binning-free methods is that there are no approximations


related to the binning. Unfolding produces again single points in the obser-
vation space which can be subjected to selection criteria and collected into
arbitrary histograms, while methods working with histograms have to decide
on the corresponding parameters before the unfolding is performed.
The binning-free, iterative method based on PDE has the disadvantage
that the user has to choose some parameters. It requires sufficiently high
statistics in all regions of the observation space.
The satellite method is especially well suited for small samples and mul-
tidimensional distributions, where other methods have difficulties. For large
samples it is rather slow even on large computers.
The binning-free likelihood method requires an analytic response function.
It is much faster than the satellite method.
10 Hypothesis Tests and Significance of Signals

10.1 Introduction
So far we treated problems where a data sample was used to discriminate be-
tween completely fixed competing hypotheses or to estimate free parameters
of a given distribution. Now we turn to the task to quantify the compati-
bility of observed data with a given hypothesis. We distinguish between the
following topics:
a) Classification, for example event selection in particle reactions.
b) Testing the compatibility of a distribution with a theoretical prediction,
i.e. goodness-of-fit tests.
c) Testing whether two samples originate from the same population, two-
sample tests.
d) Quantification of the significance of signals, like the Higgs signal.
The hypothesis that we intend to test which is called the null hypothesis
H0 . In most tests the alternative hypothesis H1 is simply “H0 is false”. Often,
additional characteristics of H1 are very vague and cannot be quantified, but
a test makes sense only if we have an idea about H1 . Otherwise a sensible
formulation of the test is not possible. Formally, a test is associated with a
decision: accept or reject. This is obvious for classification problems, while in
the other cases we are mostly satisfied with the quotation of the so-called
p-value, introduced by R. Fisher, which measures the compatibility of a given
data sample with the null hypothesis.
There is some confusion about the terms significance test and hypothe-
sis test which has its origin in a controversy between R. Fisher on one side
and J. Neyman1 and E. Pearson2 on the other side. Fisher had a more prag-
matic view while Neyman-Pearson emphasized a strictly formal treatment
with prefixed criteria whether to accept or reject the hypothesis. We will not
distinguish between the two terms but use the term significance mainly for
the analysis of small signals. We will talk about tests even when we do not
decide on the acceptance of H0 .

1
Jerzy Neyman (1894-1981), Polish mathematician
2
Egon Sharpe Pearson (1895-1980), English statistician
304 10 Hypothesis Tests and Significance of Signals

The test procedure has to be fixed before looking at the data3 . To base the
selection of a test and its parameters on the data which we want to analyze,
to optimize a test on the bases of the data or to terminate the running time
of an experiment as a function of the output of a test would bias the result.
Goodness-of-fit (GOF) tests and two-sample tests are closely related.
Goodness-of-fit test are often applied after a parameter of a distribution
has been adjusted to the observed data. In this case the hypothesis depends
on one or several free parameters. We have a composite hypothesis and a
composite test. Two sample test are applied if data are to be compared to
a prediction that is modeled by a Monte Carlo sample. Sometimes it is of
interest to check whether experimental conditions have changed. To test the
hypothesis that this is not the case, samples taken at different times are
compared.
At the end of this chapter we will treat another case in which we have a
partially specified alternative and which plays an important role in physics.
There the goal is to investigate whether a small signal is significant or ex-
plainable by a fluctuation of a background distribution. We call this procedure
signal test.

10.2 Some Definitions


Before addressing goodness-of-fit tests, we introduce some notations used in
the statistical literature.

10.2.1 Single and Composite Hypotheses

We distinguish between simple and composite hypotheses. The former fix the


population uniquely. Thus H0 : “The sample is drawn from a normal distribu-
tion with mean zero and variance one, i.e. N (0, 1).” is a simple hypothesis.
If the alternative is also simple, e.g. H1 : “N (5, 1)”, then we have the task
to decide between two simple hypotheses which we have already treated in
Chap. 6, Sect. 6.3. In this case there exists an optimal test, the likelihood
ratio test.
Composite hypotheses are characterized by free parameters, like H0 : “The
sample is drawn from a normal distribution.”. The user will adjust mean and
variance of the normal distribution and test with a goodness-of-fit comparison
whether the adjusted Gaussian is compatible with the data.

10.2.2 Test Statistic, Critical Region and Significance Level

After we have fixed the null hypothesis and the admitted alternative H1 , we
must choose a test statistic t(x), which is a function of the sample values,
3
Scientists often call this a blind analysis.
10.2 Some Definitions 305

significance level
0.1

0.01

60 80 100 120 140

observed number

Fig. 10.1. Relation between the critical value n of a Poisson experiment with mean
100 and the significance level. Observation n > 124 are excluded at a significance
level of 1%.

x ≡ {x1 , . . . , xN }, that discriminates between f0 (t|H0 ) and the distribution


of H1 .
When we test, for instance, the hypothesis that a coordinate is distributed
according to N (0, 1), then for a sample consisting of a single measurement
x, a reasonable test statistic is the absolute value |x|. We assume that if H0
is wrong then |x| would be large. A typical test statistic is the χ2 deviation
of a histogram from a prediction. Large values of χ2 indicate that something
might be wrong with the prediction.
Before we apply the test, we have to fix a critical region K which leads
to the rejection of H0 if t is located inside of it. Under the condition that H0
is true, the probability of rejecting H0 is P {t ∈ K|H0 } = α where 0 ≤ α ≤ 1
normally is a small quantity (e.g. 5%). It is called significance level or size of
the test . For a test based on the χ2 statistic, the critical region is defined by
χ2 > χ2max (α) where the critical value χ2max is a function of the significance
level α. It fixes the range of the critical region, χ2 > χ2max .
To compute rejection probabilities we have to compute the p.d.f. g(t) of
the test statistic. In some cases it is known as we will see below, but in other
cases it has to be obtained by a Monte Carlo simulation. The distribution g
has to include all experimental conditions under which t is determined like
the acceptance and the measurement uncertainty of t.
306 10 Hypothesis Tests and Significance of Signals

Example 137. Test of a predicted counting rate


A theory H0 predicts n0 = 100 rare events in an experiment. Observed
are n̂ = 130. We test whether there is an excess of events due to a process
not considered in H0 . The significance level set is α = 0.01 which means
that with 1% probability H0 will be excluded if it is true. Fig. 10.1 shows
the significance level α = 1 − F (n),with F (n) the distribution function, as a
function of n. The critical region n ≥ nc starts at nc = 125 and extends to
infinity and H0 will be excluded at a significance level of 1%. (The p-value
that will be defined below is 0.0023)

Example 138. Particle selection based on the invariant mass


In a experiment with a magnetic spectrometer K 0 → π + π − events are
selected. An event is accepted if the mass mππ reconstructed from the decay
particles agrees with the mass mk of the kaon within 2 standard deviations.
The measurement errors are normally distributed. The null hypothesis is
that x = (mππ − mk )/δ follows a normal distribution with mean zero and
standard deviation one, x ∼ N (0, 1). The test statistic
R ∞is |x|, the critical
region is |x| ≥ 2 and the size of the test is α = √22π 2 exp(−x2 /2)dx =
0.0455.

Errors of the First and Second Kind, Power of a Test

After the test parameters are selected, we can apply the test to our data. If
the actually obtained value of t is outside the critical region, t ∈
/ K, then
we accept H0 , otherwise we reject it. This procedure implies four different
outcomes with the following a priori probabilities:
1. H0 ∩ t ∈ K, P {t ∈ K|H0 } = α: error of the first kind. (H0 is true but
rejected.),
2. H0 ∩ t ∈ / K|H0 } = 1 − α (H0 is true and accepted.),
/ K, P {t ∈
3. H1 ∩ t ∈ K, P {t ∈ K|H1 } = 1 − β (H0 is false and rejected.),
4. H1 ∩ t ∈ / K|H1 } = β: error of the second kind (H0 is false but
/ K, P {t ∈
accepted.).
When we apply the test to a large number of data sets or events, then
the rate α, the error of the first kind, is the inefficiency in the selection of
H0 events, while the rate β, the error of the second kind, represents the
background with which the selected events are contaminated with H1 events.
Of course, for α given, we would like to have β as small as possible. Given the
10.2 Some Definitions 307

rejection region K which depends on α, also β is fixed. However, we usually


have only a vague idea about the properties of H1 and cannot compute β.
For a reasonable test we expect that β is monotonically decreasing with α
increasing. The more we restrict the region where H0 is accepted, the more
the background should be reduced. With α → 0 also the critical region K is
shrinking, while the power 1 − β must decrease, and the background is less
suppressed. For fixed α, the power indicates the quality of a test, i.e. how
well alternatives to H0 can be rejected.
The power is a function, the power function, of the significance level α.
Tests which provide maximum power 1−β with respect to H1 for all values of
α are called uniformly most powerful (UMP) tests. Only in rare cases where
H1 is restricted in some way, there exists an optimum, i.e. a UMP test. If
both hypotheses are simple, then as already mentioned in Chap. 6, Sect. 6.3,
according to a lemma of Neyman and E. S. Pearson, the likelihood ratio can
be used as test statistic to discriminate between H0 and H1 and provides a
uniformly most powerful test.
The interpretation of α and β as error rates makes sense when many
experiments or data sets of the same type are investigated. In a search ex-
periment where we want to find out whether a certain physical process or
a phenomenon exists or in an isolated GOF test they refer to virtual ex-
periments and it is not obvious which conclusions we can draw from their
values.

10.2.3 Consistency and Bias of Tests

A test is called consistent if its power tends to unity as the sample size tends
to infinity. In other words: If we have an infinitely large data sample, we
should always be able to decide between H0 and the alternative H1 .
We also want that independent of α the rejection probability for H1 is
higher than for H0 , i.e. α < 1 − β. Tests that violate this condition are called
biased. Consistent tests are asymptotically unbiased.
When H1 represents a family of distributions, consistency and unbiased-
ness are valid only if they apply to all members of the family. Thus in case
that the alternative H1 is not specified, a test is biased if there is an arbi-
trary hypothesis different from H0 with rejection probability less than α and
it is inconsistent if we can find a hypothesis different from H0 which is not
rejected with power unity in the large sample limit.

Example 139. Bias and inconsistency of a test


Assume, we select in an experiment events of the type K 0 → π + π − . The
invariant mass mππ of the pion pairs has to match the K 0 mass. Due to
the finite experimental resolution the experimental masses of the pairs are
normally distributed around the kaon mass mK with variance σ 2 . With the
308 10 Hypothesis Tests and Significance of Signals

null hypothesis H0 that we observe only K 0 → π + π − decays, we may apply to


our sample a test with the test quantity t = (mππ − mK )2 /σ 2 , the normalized
mean quadratic difference between the observed masses of N pairs and the
nominal K 0 mass. Our sample is accepted if it satisfies t < t0 where t0 is
the critical quantity which determines the error of the first kind α and the
acceptance 1 − α. The distribution of N t under H0 is a χ2 distribution with
N degrees of freedom. Clearly, the test is biased, because we can imagine
mass distributions with acceptance larger than 1 − α, for instance a uniform
distribution in the range t ≤ t0 . This test is also inconsistent, because it
would favor this specific realization of H1 also for infinitely large samples.
Nevertheless it is not unreasonable for very small samples in the considered
case and for N = 1 there is no alternative. The situation is different for
large samples where more powerful tests exist which take into account the
Gaussian shape of the expected distribution under H0 .

While consistency is a necessary condition for a sensible test, bias of a test


applied to a small sample cannot always be avoided and is tolerable under
certain circumstances.
The formal definitions of this section are important in many applications,
outside physics, like in drug or fertilizer tests. Physicists talk about efficiency,
purity or contamination. The use of the terms error of the first and second
kind and size of test is rather an exception.

10.2.4 P -Values

Definition

Strictly speaking, the result of a test is that a hypothesis is “accepted” or


“rejected”. In most practical situations it is useful to replace this digital an-
swer by a continuous parameter, the so called p-value which is a monotonic
function p(t) of the test statistic t and which measures the compatibility of
the sample with the null hypothesis, a small value of p casting some doubt
on the validity of H0 . To illustrate the meaning of p-values, we return to the
example of a normally distributed measurement x. Here for a measurement
x̂ the p-value is equal to the probability to observe |x| ≥ |x̂| :
Z ∞
2 2
p= √ e−x /2 dx .
2π x̂
For a measurement x = 0 where the observation coincides exactly with the
prediction of H0 we get p = 1 and for x = ∞ we obtain p = 0. The p-value
of the counting rate example above with a predicted Poisson rate of 100 and
130 observed counts is p = 0.0023, see Fig. 10.1.
10.2 Some Definitions 309

0.3

f(t)
0.2 critical

region
0.1

0 2 4 t 6 8 10
c

1.0
p(t)

0.5

0.0
0 2 4 6 8 10
t t
c

Fig. 10.2. Distribution of a test statistic and corresponding p-value curve

To simplify the general definition of the test statistic definition, we assume


that the test statistic t is confined to values between zero and infinity with a
critical region t > tc 4 . Its distribution under H0 be f0 (t). Then we have
Z t
p(t) = 1 − f0 (t′ )dt′ = 1 − F0 (t) , (10.1)
0

with F0 the distribution function. Since p is a unique monotonic function of t,


we can consider p as a normalized test statistic which is completely equivalent
to t.
The relationship between the different quantities which we have intro-
duced is shown in Fig. 10.2. The upper graph represents the p.d.f. of the
test statistic under H0 . The critical region extends from tc to infinity. The
a priori rejection probability for a sample under H0 is α, equal to the in-
tegral of the distribution of the test statistic over the critical region. The
lower graph shows the p-value function. It starts at one and is continuously
decreasing to zero at infinity. The smaller the test statistic is – think of χ2
– the higher is the p-value. At t = tc the p-value is equal to the significance
level α. The condition p < α leads to rejection of H0 in classifications. Due
to its construction, the p.d.f. of the p-value under H0 is uniform.

4
This condition can always be realized for one-sided tests by a variable trans-
formation. For two-sided tests, p-values cannot be defined.
310 10 Hypothesis Tests and Significance of Signals

number of observations
A: p=0.082 250 B: p=0.073
15

200

10 150

100

50

0 0

x x

Fig. 10.3. Comparison of two experimental histograms to a uniform distribution

Interpretation and Use of p-values

Since the distribution of p under H0 is uniform in the interval [0, 1], all values
of p in this interval are equally probable. When we reject a hypothesis under
the condition p < 0.1 we have a probability of 10% to reject H0 . The rejection
probability would be the same for a rejection region p > 0.9. The reason for
cutting at low p-values is the expectation that distributions of H1 would
produce low p-values.
The name p-value is derived from the word probability, but the p-value is
not the probability that the hypothesis under test is true. It is the probability
under H0 to obtain a value of the test statistic t that is larger than the value
that is actually observed or, equivalently, the probability to obtain a p-value
which is smaller than the observed one. A p-value between zero and p is
expected to occur in the fraction p of experiments if H0 is true.

Example 140. The p-value and the probability of a hypothesis


In Fig. 10.3 we have histogrammed two distributions from two simulated
experiments A and B. Are these uniform distributions? For experiment B
with 10000 observations this is conceivable, while for experiment A with
only 100 observations it is difficult to guess the shape of the distribution.
Alternatives like strongly rising distributions are excluded in B but not in
A. We would therefore attribute a higher probability for the validity of the
hypothesis of a uniform distribution for B than for A, but the p-values based
on the χ2 test are very similar in both cases, namely p ≈ 0.08.

We learn from this example that the p-value is more sensitive to deviations
from H0 in large samples than in small samples. Since in practice small un-
10.2 Some Definitions 311

1500

number of events
1000
cut

500

0
0.2 0.4 0.6 0.8 1.0

p-value

Fig. 10.4. Experimental distribution of p-values

known systematic errors can rarely be excluded, we should not be astonished


that in high statistics experiments often small p-values occur. The system-
atic uncertainties which usually are not considered in the null hypothesis then
dominate the purely statistical fluctuation.
Even though we cannot transform significant deviations into probabilities
for the validity of a hypothesis, they provide useful hints for hidden measure-
ment errors or a contamination with background. In our example a linearly
rising distribution is added to uniform distributions. The fractions are 45%
in experiment A and 5% in experiment B.
In classification problems we are able to compare many replicates of mea-
surements to the same hypothesis. In particle physics experiments usually
a huge number of tracks has to be reconstructed. The track parameters are
adjusted by a χ2 fit to measured points assuming normally distributed un-
certainties. The χ2 value of each fit can be used as a test statistic and trans-
formed into a p-value, often called χ2 probability. Histograms of p-values
obtained in such a way are very instructive. They often look like the one
shown in Fig. 10.4. The plot has two interesting features: It is slightly ris-
ing with increasing p-value which indicates that the errors have been slightly
overestimated. The peak at low p-values is due to fake tracks which do not
correspond to particle trajectories and which we would eliminate almost com-
pletely by a cut at about pc = 0.05. We would have to pay for it by a loss of
good tracks of slightly less than 5%. A more precise estimate of the loss can
312 10 Hypothesis Tests and Significance of Signals

be obtained by an extrapolation of the smooth part of the p-value distribution


to p = 0.

Combination of p-values

If two p-values p1 , p2 which have been derived from independent test statistics
t1 , t2 are available, we would like to combine them to a single p-value p. The
at first sight obvious idea to set p = p1 p2 suffers from the fact that the
distribution of p will not be uniform. A popular but arbitrary choice is

p = p1 p2 [1 − ln(p1 p2 )] (10.2)

which can be shown to be uniformly distributed [91]. This choice has the
unpleasant feature that the combination of the p-values is not associative,
i.e. p [(p1 , p2 ), p3 ] 6= p [p1 , (p2 , p3 )]. There is no satisfactory way to combine
p-values.
We propose, if possible, not to use (10.2) but to go back to the original
test statistics and construct from them a combined statistic t and the cor-
responding p distribution. For instance, the obvious combination of two χ2
statistics would be t = χ21 + χ22 .

10.3 Classification problems

In classification problems we decide whether to accept or reject hypotheses


as a result of a test. Examples are event selection (e.g. B quark produc-
tion), particle track selection on the bases of the quality of reconstruction
and particle identification, (e.g. electron identification based on calorimeter
or Cerenkov information). Typical for these examples is that we examine a
number of similar objects and accept a certain error rate α. The goal is to
find the optimal test statistic. Its critical value determines the acceptance
of the selected events and its contamination by background and indirectly
the statistical and the systematic uncertainty of the result. The choice de-
pends on the physics goal and on how well we can estimate the amount of
background. For instance, when we select B particle decays, to determine
the production rate we will probably allow for a higher contamination than
if we want to determine the lifetime of the B mesons. Often it is useful to
transform the test statistic into a p-value where we know that it should be
uniformly distributed under H0 .
Sophisticated classification methods have been developed in the last few
decades along with the increased computing power. We will discuss some of
them in Chap. 11 where we introduce artificial neural networks and decision
trees. The goodness-of-fit-tests that will be treated in the following section
can also be applied to classification problems.
10.4 Goodness-of-Fit Tests 313

100 prediction

number of events
experimental

distribution

10

1
0.0 0.2 0.4 0.6 0.8 1.0

lifetime

Fig. 10.5. Comparison of an experimental distribution to a prediction.

10.4 Goodness-of-Fit Tests

10.4.1 General Remarks

Goodness-of-fit (GOF) tests check whether a sample is compatible with a


given distribution. An experienced scientist has a quite good feeling for devi-
ations between two distributions just by looking at a plot. For instance, when
we examine the statistical distribution of Fig. 10.5, we will realize that its de-
scription by an exponential distribution is rather unsatisfactory. The question
is: How can we quantify the disagreement? Without an idea about a possible
alternative description it is difficult to select an efficient test procedure.
To test whether a roulette is behaving correctly, we check whether all
numbers occur with equal probability. We could construct a roulette which
is producing all numbers sequentially. It would pass the test but not the
requirement. However this behavior is not what we imagine for a standard
roulette, we would rather expect that possibly some numbers occur more often
than others and this is what the test should exclude. When we test a random
number generator we would be interested, for example, in a periodicity of the
results or a correlation between subsequent numbers and we would choose a
different test.
GOF tests are not only used to check the validity of a hypothesis but also
serve to detect unknown systematic errors in experimental results. When we
measure the mean life of an unstable particle, we know that the lifetime
distribution is exponential but to apply a GOF test is informative, because
a low p-value may indicate a contamination of the events by background,
an unsatisfactory simulation of the detector properties or problems with the
experimental equipment.
314 10 Hypothesis Tests and Significance of Signals

f
0

Fig. 10.6. Two different samples and a hypothesis

A typical test quantity is the χ2 -variable which we have introduced to


adjust parameters of functions to experimental histograms or measured points
with known error distributions. In the least square method of parameter
inference, see Chap. 6.4.5, the parameters are fixed such that the sum χ2 of
the normalized quadratic deviations is minimum. Deviating parameter values
produce larger values of χ2 , consequently we expect the same effect when we
compare the data to a wrong hypothesis. If χ2 is abnormally large, it is likely
that the null hypothesis is not correct.
Physicists use almost exclusively the χ2 test, even though for many ap-
plications more powerful tests are available. Scientists also often overesti-
mate the significance of the χ2 test results. Other tests like the Kolmogorov–
Smirnov Test and tests of the Cramer–von Mises family avoid the always
somewhat arbitrary binning of histograms in the χ2 test. These tests are re-
stricted to univariate distributions, however. Other binning-free methods can
also be applied to multivariate distributions.
Sometimes students think that the likelihood L0 of the null hypothesis
is a powerful test statistic, e.g. for H0 with single event distribution f0 (x)
the product Πi f0 (xi ). That this is not a good idea is demonstrated in Fig.
10.6 where the null hypothesis is represented by a fully specified normal dis-
tribution. From the two samples, the narrow one clearly fits the distribution
worse but it has the higher likelihood. A sample where all observations are
located at the center would per definition maximize the likelihood but such a
sample would certainly not support the null hypothesis but rather a narrow
Gaussian.
10.4 Goodness-of-Fit Tests 315

While the indicated methods are distribution-free, i.e. applicable to arbi-


trary distributions specified by H0 , there are procedures to check the agree-
ment of data with specific distributions like normal, uniform or exponential
distributions. These methods are of inferior importance for physics applica-
tions. We will deal only with distribution-free methods.
We will not discuss tests related to order statistics. These tests are mainly
used in connection with time series and are not very powerful in most of our
applications.
At the end of this section we want to remind that parameter inference
with a valid hypothesis and GOF test which doubt the validity of a hypothesis
touch two completely different problems. Whenever possible deviations can
be parameterized it is always appropriate to determine the likelihood function
of the parameter and use the likelihood ratio to discriminate between different
parameter values.
A good review of GOF tests can be found in [90], in which, however,
recent developments are missing.

10.4.2 The χ2 Test in Generalized Form

The Idea of the χ2 Comparison

We consider a sample of N observations which are characterized by the values


xi of a variable x and a prediction f0 (x) of their distribution. We subdivide
the range of x into B intervals to which we attach sequence numbers k. The
prediction pk for the probability that an observation is contained in interval
k is: Z
pk = f0 (x) dx ,
k
with Σpk = 1. The integration extends over the interval k. The number
of sample observations dk found in this bin has to be compared with the
expectation value N pk . To interpret the deviation dk − N pk , we have to
evaluate the expected mean quadratic deviation δk2 under the condition that
the prediction is correct. Since the distribution of the observations into bins
follows a binomial distribution, we have

δk2 = N pk (1 − pk ) .

Usually the observations are distributed into typically more than 10 bins.
Thus the probabilities pk are small compared to unity and the expression in
brackets can be omitted. This is the Poisson approximation of the binomial
distribution. The mean quadratic deviation is equal to the number of expected
observations in the bin:
δk2 = N pk .
We now normalize the observed to the expected mean quadratic deviation,
316 10 Hypothesis Tests and Significance of Signals

(dk − N pk )2
χ2k = ,
N pk
and sum over all B bins:
B
X (dk − N pk )2
χ2 = . (10.3)
N pk
k=1

By construction we have:

hχ2k i ≈ 1 ,
hχ2 i ≈ B .

If the quantity χ2 is considerably larger than the number of bins, then


obviously the measurement deviates significantly from the prediction.
A significant deviation to small values χ2 ≪ B even though considered
as unlikely is tolerated, because we know that alternative hypotheses do not
produce smaller hχ2 i than H0 .

The χ2 Distribution and the χ2 Test

We now want to be more quantitative. If H0 is valid, the distribution of χ2


follows to a very good approximation the χ2 distribution which we have intro-
duced in Sect. 3.6.7 and which is displayed in Fig. 3.18. The approximation
relies on the approximation of the distribution of observations per bin by
a normal distribution, a condition which in most applications is sufficiently
good if the expected number of entries per bin is larger than about 10. The
parameter number of degrees of freedom (NDF ) f of the χ2 distribution is
equal to the expectation value and to the number of bins minus one:

hχ2 i = f = B − 1 . (10.4)

Originally we had set hχ2 i ≈ B but this relation overestimates χ2 slightly.


The smaller value B − 1 is plausible because the individual deviations are
somewhat smaller than one – remember, we had approximated the binomial
distribution by a Poisson distribution. For instance, in the limit of a single
bin, the mean deviation is not one but zero. We will come back to this point
below.
In some cases we have not only a prediction of the shape of the distribution
but also a prediction N0 of the total number of observations. Then the number
of entries in each bin should follow a Poisson distribution with mean N0 pk ,
(10.3) has to be replaced by
B
X (dk − N0 pk )2
χ2 = . (10.5)
N 0 pk
k=1
10.4 Goodness-of-Fit Tests 317

2
f( )

2
P( )

2 2

c2 .
Fig. 10.7. p-value for the obseration χ

and we have f = B = hχ2 i.


In experiments with low statistics the approximation that the distribution
of the number of entries in each bin follows a normal distribution is sometimes
not justified and the distribution of the χ2 quantity as defined by (10.3) or
(10.5) is not very well described by a χ2 distribution. Then we have the
possibility to determine the distribution of our χ2 variable under H0 by a
Monte Carlo simulation5 .
In Fig. 10.7 we illustrate how we can deduce the p-value or χ2 probability
from the distribution and the experimental value χ c2 of our test statistic
2 c
χ . The experimental value χ divides the χ distribution, which is fixed
2 2

through the number of degrees of freedom, and which is independent of the


data, into two parts. According to its definition (10.1), the p-value p(b χ2 ) is
equal to the area of the right hand part. It is the fraction of many imagined
experiments where χ2 is larger than the experimentally observed value χ c2 –
always assuming that H0 is correct. As mentioned above, high values of χ2
and correspondingly low values of p indicate that the theoretical description
is inadequate to describe the data. The reason is in most cases found in
experimental problems.
The χ2 comparison becomes a test, if we accept the theoretical description
of the data only if the p-value exceeds a critical value, the significance level

5
We have to be especially careful when the significance level α is small.
318 10 Hypothesis Tests and Significance of Signals

80

75
0.001
0.002
70 0.003
0.005
65 0.007
0.01

0.02
60
0.03

0.05
55 0.07
0.1
50
0.2
45 0.3

40
c
2 0.5

35

30

25

20

15

10

0
0 5 10 15 20

NDF

Fig. 10.8. Critical χ2 values as a fnction of the number of degrees of freedom with
the significance level as parameter.

α, and reject it for p < α. The χ2 test is also called Pearson test after the
statistician Karl Pearson6 , who has introduced it already in 1900.
Figure 10.8 gives the critical values of χ2 , as a function of the number of
degrees of freedom with the significance level as parameter. To simplify the
presentation we have replaced the discrete points by curves. The p-value as
a function of χ2 with NDF as parameter is available in the form of tables or
in graphical form in many books. The internet provides on-line calculation
programs. For large f , about f > 20 and not too small α, the χ2 distribution
can be approximated sufficiently well by a normal distribution with mean
value x0 = f and variance s2 = 2f . We are then able to compute the p-
values from integrals over the normal distribution. Tables can be found in the
literature or alternatively, the computation can be performed with computer
programs like Mathematica or Maple.

The Choice of Binning

There is no general rule for the choice of the number and width of the his-
togram bins for the χ2 comparison but we note that the χ2 test looses sig-
nificance when the number of bins becomes too large.
6
Karl Pearson (1857-1980) britischer Mathematiker
10.4 Goodness-of-Fit Tests 319

To estimate the effect of fine binning for a smooth deviation, we consider


a systematic deviation which is constant over a certain region with a total
number of entries N0 and which produces an excess of εN0 events. Partition-
ing the region into B bins would add to the statistical χ2 in each single bin
the contribution:
(εN0 /B)2 ε2 N0
χ2s = = .
N0 /B B
For B bins we increase χ2 by ε2 N0 which is to be compared to the purely
statistical contribution χ20 which is in average equal to B. The significance

S, i.e. the systematic deviation in units of the expected fluctuation 2B is
N0
S = ε2 √ .
2B
It decreases with the square root of the number of bins.
We recommend a fine binning only if deviations are considered which are
restricted to narrow regions. This could be for instance pick-up spikes. These
are pretty rare in our applications. Rather we have systematic deviations
produced by non-linearity of measurement devices or by background and
which extend over a large region. Then wide intervals are to be preferred.
In [92] it is proposed to choose the number of bins according to the formula
B = 2N 2/5 as a function of the sample size N .

Example 141. Comparison of different tests for background under an expo-


nential distribution
In Fig. 10.5 a histogrammed sample is compared to an exponential. The
sample contains, besides observations following this distribution, a small con-
tribution of uniformly distributed events. From Table 10.1 we recognize that
this defect expresses itself by small p-values and that the corresponding de-
crease becomes more pronounced with decreasing number of bins.

Some statisticians propose to adjust the bin parameters such that the
number of events is the same in all bins. In our table this partitioning is
denoted by e.p. (equal probability). In the present example this does not
improve the significance.
The value of χ2 is independent of the signs of the deviations. However, if
several adjacent bins show an excess (or lack) of events like in the left hand
histogram of Fig. 10.9 this indicates a systematic discrepancy which one
would not expect at the same level for the central histogram which produces
the same value for χ2 . Because correlations between neighboring bins do
not enter in the test, a visual inspection is often more effective than the
mathematical test. Sometimes it is helpful to present for every bin the value
of χ2 multiplied by the sign of the deviation either graphically or in form of
a table.
320 10 Hypothesis Tests and Significance of Signals

Table 10.1. p-values for χ2 and EDF statistic.


test p value
χ2 , 50 Bins 0.10
χ2 , 50 Bins, e.p. 0.05
χ2 , 20 Bins 0.08
χ2 , 20 Bins, e.p. 0.07
χ2 , 10 Bins 0.06
χ2 , 10 Bins, e.p. 0.11
χ2 , 5 Bins 0.004
χ2 , 5 Bins, e.p. 0.01
Dmax 0.005
W2 0.001
A2 0.0005

150

100
entries

50

0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

x x x

Fig. 10.9. The left hand and the central histogram produce the same χ2 p-value,
the left hand and the right hand histograms produce the same Kolmogorov p-value.

Example 142. χ2 comparison for a two-dimensional histogram


In the following table for a two-dimensional histogram the values of χ2 ac-
companied with the sign are presented. The absolute values are well confined
in the range of our expectation but near the right hand border we observe
an accumulation of positive deviations which point to a systematic effect.
i\j1 2 3 4 5 6 7 8
1 0.1 -0.5 1.3 -0.3 1.6 -1.1 2.0 1.2
2 -1.9 0.5 -0.4 0.1 -1.2 1.3 1.5
3 -1.2 -0.8 0.2 0.1 1.3 1.9
4 0.2 0.7 -0.6 1.1 2.2
10.4 Goodness-of-Fit Tests 321

Generalization to Arbitrary Measurements

The Pearson method can be generalized to arbitrary measurements yk with


mean square errors δk2 . For theoretical predictions tk we compute χ2 ,
N
X (yk − tk )2
χ2 = ,
δk2
k=1

where χ2 follows a χ2 distribution of f = N degrees of freedom. A necessary


condition for the validity of the χ2 distribution is that the uncertainties follow
a normal distribution.
A further generalization is given in the Appendix 13.10 where weighted
events and statistical errors of the theoretical predictions, resulting from the
usual Monte Carlo calculation, are considered.
Remark : The quantity δ 2 has to be calculated under the assumption that
the theoretical description which is to be tested is correct. This means, that
normally the raw measurement error cannot be inserted. For example, instead
of ascribing to a measured quantity an error δk′ which is proportional to its
value yk , a corrected error
tk
δk = δk′
yk
should be used.
Sometimes extremely small values of χ2 are presented. The reason is in
most cases an overestimation of the errors.
The variable χ2 is frequently used to separate signal events from back-
ground. To this end, the experimental distribution of χ2 is transformed into
a p-value distribution like the one presented in Fig. 10.4. In this situation it
is not required that χ2 follows the χ2 distribution. It is only necessary that
it is a discriminative test statistic.

The χ2 Test for Composite Hypotheses

In most cases measurements do not serve to verify a fixed theory but to


estimate one or more parameters. The method of least squares for parameter
estimation has been discussed in Sect. 6.7. To fit a curve y = t(x, θ) to
measured points yi with Gaussian errors σi , i = 1, . . . , N , we minimize the
quantity
XN 2
(yi − t(xi , θ1 , . . . , θZ ))
χ2 = , (10.6)
i=1
σi2
with respect to the Z free parameters θk .
322 10 Hypothesis Tests and Significance of Signals

It is plausible that with increasing number of parameters, which are ad-


justed, the description of the data improves, χ2 decreases. In the extreme
case where the number of parameters is equal to the number N of measured
points or histogram bins it becomes zero. The distribution of χ2 in the gen-
eral case where Z parameters are adjusted follows under conditions to be
discussed below a χ2 distribution of f = N − Z degrees of freedom.P
N
Setting in (10.6) zi = (yi − t(xi , θ))/σi we may interpret χ2 = 1 zi2 as
the (squared) distance of a point z with normally distributed components
from the origin in an N -dimensional space. If all parameters are fixed except
one, say θ1 , which is left free and adjusted to the data by minimizing χ2 , we
have to set the derivative with respect to θ1 equal to zero:
N
1 ∂χ2 X ∂t
− = zi /σi = 0 .
2 ∂θ1 i=1
∂θ1

If t is a linear function of the parameters, an assumption which is often


justified at least approximately7, the derivatives are constants, and we get
a linear relation (constraint) of the form c1 z1 + · · · + cN zN = 0. It defines
a (N − 1)-dimensional subspace, a hyperplane containing the origin, of the
N -dimensional z space. Consequently, the distance in z space is confined to
this subspace and derived from N − 1 components. For Z free parameters
we get Z constraints and a (N − Z)-dimensional subspace. The independent
components (dimensions) of this subspace are called degrees of freedom. The
number of degrees of freedom is f = N − Z as pretended above. Obviously,
the sum of f squared components will follow a χ2 distribution with f degrees
of freedom.
In the case of fitting a normalized distribution to a histogram with B bins
which we have considered above, we had to set (see Sect. 3.6.7) f = B − 1.
This is explained by a constraint of the form z1 + · · · + zB = 0 which is valid
due to the equality of the normalization for data and theory.

The χ2 Test for Small Samples


When the number of entries per histogram bin is small, the approximation
that the variations are normally distributed is not justified. Consequently,
the χ2 distribution should no longer be used to calculate the p-value.
Nevertheless we can use in this situation the sum of quadratic deviations
χ2 as test statistic. The distribution f0 (χ2 ) has then to be determined by a
Monte Carlo simulation. The method still works pretty well.

Warning
The assumption that the distribution of the test statistic under H0 is de-
scribed by a χ2 distribution relies on the following assumptions: 1. The en-
7
Note that also the σi have to be independent of the parameters.
10.4 Goodness-of-Fit Tests 323

tries in all bins of the histogram are normality distributed. 2. The expected
number of entries depends linearly on the free parameters in the considered
parameter range. An indication for a non-linearity are asymmetric errors of
the adjusted parameters. 3. The estimated uncertainties σi in the denomina-
tors of the summands of χ2 are independent of the parameters. Deviations
from these conditions affect mostly the distribution at large values of χ2 and
thus the estimation of small p-values. Corresponding conditions have to be
satisfied when we test the GOF of a curve to measured points. Whenever we
are not convinced about their validity we have to generate the distribution
of χ2 by a Monte Carlo simulation.

10.4.3 The Likelihood Ratio Test


General Form
The likelihood ratio test compares H0 to a parameter dependent alterna-
tive H1 which includes H0 as a special case. The two hypothesis are defined
through the p.d.f.s f (x|θ) and f (x|θ0 ) where the parameter set θ0 is a subset
of θ, often just a fixed value of θ. The test statistic is the likelihood ratio
λ, the ratio of the likelihood of H0 and the likelihood of H1 where the pa-
rameters are chosen such that they maximize the likelihoods for the given
observations x. It is given by the expression
sup L(θ0 |x)
λ= , (10.7)
sup L(θ|x)
or equivalently by
ln λ = sup ln L(θ 0 |x) − sup ln L(θ|x) .
If θ 0 is a fixed value, this expression simplifies to ln λ = ln L(θ0 |x) −
sup ln L(θ|x).
From the definition (10.7) follows that λ always obeys λ ≤ 1.

Example 143. Likelihood ratio test for a Poisson count


Let us assume that H0 predicts µ0 = 10 decays in an hour, observed are
8. The likelihood to observe 8 for the Poisson distribution is L0 = P10 (8) =
e−10 108 /8!. The likelihood is maximal for µ = 8, it is L = P8 (8) = e−8 88 /8!
Thus the likelihood ratio is λ = P10 (8)/P8 (8) = e−2 (5/4)8 = 0.807. The
probability P to observe a ratio smaller than or equal to 0.807 is
X
P = P10 (k) for k with P10 (k) ≤ 0.807 P10(10) .
k

Relevant numerical values of λ(k, µ0 ) = Pµ0 (k)/Pk (k) and Pµ0 (k) for µ0 = 10
are given in the following table.
324 10 Hypothesis Tests and Significance of Signals

k 8 9 10 11 12 13
λ 0.807 0.950 1.000 0.953 0.829 0.663
P 0.113 0.125 0.125 0.114 0.095 0.073
It is seen, that the sum over k runs over all k, except k = 9, 10, 11, 12:
8
p = Σk=0 ∞
P10 (k) + Σk=13 12
P10 (k) = 1 − Σk=9 P10 (k) = 0.541 which is certainly
acceptable.

The likelihood ratio test in this general form is useful to discriminate be-
tween a specific and a more general hypothesis, a problem which we will study
in Sect. 10.6.2. To apply it as a goodness-of-fit test, we have to histogram the
data.

The Likelihood Ratio Test for Histograms


Q
We have shown that the likelihood L0 = i f0 (xi ) of a sample cannot be used
as a test statistic, but when we combine the data into bins, a likelihood ratio
can be defined for the histogram and used as test quantity. The test variable is
the ratio of the likelihood for the hypothesis that the bin content is predicted
by H0 and the likelihood for the hypothesis that maximizes the likelihood for
the given sample. The latter is the likelihood for the hypothesis where the
prediction for the bin coincides with its content. If H0 is not simple, we take
the ratio of the maximum likelihood allowed by H0 and the unconstrained
maximum of L.
For a bin with content d, prediction t and p.d.f. f (d|t) this ratio is λ =
f (d|t)/f (d|d) since at t = d the likelihood is maximal. For the histogram we
have to multiply the ratios of the B individual bins. Instead we change to
the log-likelihoods and use as test statistic
B
X
V = ln λ = [ln f (di |ti ) − ln f (di |di )] .
i=1

If the bin content follows the Poisson statistics we get (see Chap. 6, Sect.
7.1)
B
X
V = [−ti + di ln ti − ln(di !) + di − di ln di + ln(di !)]
i=1
B
X
= [di − ti + di ln(ti /di )] .
i=1

The distribution of the test statistic V is not universal, i.e. not inde-
pendent of the distribution to be tested as in the case of χ2 . It has to be
10.4 Goodness-of-Fit Tests 325

determined by a Monte Carlo simulation. In case parameters of the predic-


tion have been adjusted to data, the parameter adjustment has to be included
in the simulation.
The method can be extended to weighted events and to the case of Monte
Carlo generated predictions with corresponding statistical errors, see Ap-
pendix 13.10.
Asymptotically, N → ∞, the test statistic V approaches −χ2 /2 as is seen
from the expansion of the logarithm, ln(1 + x) ≈ x − x2 /2. After introducing
xi = (di − ti )/ti which, according to the law of large numbers, becomes small
for large di , we find
B
X
V = [ti xi − ti (1 + xi ) ln(1 + xi )]
i=1
B
X  
1 2
≈ ti xi − (1 + xi )(xi − xi )
i=1
2
XB   B  
1 2 1 X (di − ti )2 1
≈ ti − xi = − = − χ2B ,
i=1
2 2 i=1
t i 2

and thus −2V is distributed according to a χ2 distribution with B degrees


of freedom, but then we may also use directly the χ2 test.
If the prediction is normalized to the data, we have to replace the Poisson
distribution by the multinomial distribution. We omit the calculation and
present the result:
XB
V = di ln(ti /di ) .
i=1

In this case, V approaches asymptotically the χ2 distribution with B − 1


degrees of freedom.

10.4.4 The Kolmogorov–Smirnov Test

The subdivision of a sample into intervals is arbitrary and thus subjective.


Unfortunately some experimenters use the freedom to choose histogram bins
such that the data agree as well as possible with the theoretical description
in which they believe. This problem is excluded in binning-free tests which
have the additional advantage that they are also applicable to small samples.
The Kolmogorov–Smirnov test compares the distribution function
Z x
F0 (x) = f0 (x) dx
−∞
with the corresponding experimental quantity S,
Number of observations with xi < x
S(x) = .
Total number
326 10 Hypothesis Tests and Significance of Signals

1.0

F(x)

S(x)
D
-
0.5

D
+

0.0
0.0 0.5 1.0

Fig. 10.10. Comparison of the empirical distribution function S(x)with the theo-
retical distribution function F (x).

The test statistic is the maximum difference D between the two functions:

D = sup |F (x) − S(x)|


= sup(D+ , D− ) .

The quantities D+ , D− denote the maximum positive and negative dif-


ference, respectively. S(x) is a step function, an experimental approximation
of the distribution function and is called Empirical Distribution Function
(EDF ). It is depicted in Fig. 10.10 for an example and compared to the dis-
tribution function F (x) of H0 . To calculate S(x) we sort all N elements in
ascending order of their values, xi < xi+1 and add 1/N at each location xi
to S(x). Then S(xi ) is the fraction of observations with x values less or equal
to xi ,
i
S(xi ) = ,
N
S(xN ) = 1 .

As in the χ2 test we can determine the expected distribution of D, which


will depend on N and transform the experimental value of D into a p-value.
To get√rid of the N dependence of the theoretical D distribution we use
D∗ = N D. Its distribution under H0 is for not too small N (N >≈ 100)
independent of N and available in form of tables and graphs. For event num-
10.4 Goodness-of-Fit Tests 327

0.1

p-value

0.01

1E-3

1E-4
0.0 0.5 1.0 1.5 2.0 2.5

*
D

Fig. 10.11. p-value as a function of the Kolmogorov test statistic D∗ .

√ √
bers larger than 20 the approximation D∗ = D( N + 0.12 + 0.11/ N ) is still
a very good approximation8 . The function p(D∗ ) is displayed in Fig. 10.11.
The Kolmogorov–Smirnov test emphasizes more the center of the distribu-
tion than the tails because there the distribution function is tied to the values
zero and one and thus is little sensitive to deviations at the borders. Since it
is based on the distribution function, deviations are integrated over a certain
range. Therefore it is not very sensitive to deviations which are localized in a
narrow region. In Fig. 10.9 the left hand and the right hand histograms have
the same excess of entries in the region left of the center. The Kolmogorov–
Smirnov test produces in both cases approximately the same value of the
test statistic, even though we would think that the distribution of the right
hand histogram is harder to explain by a statistical fluctuation of a uniform
distribution. This shows again, that the power of a test depends strongly on
the alternatives to H0 . The deviations of the left hand histogram are well
detected by the Kolmogorov–Smirnov test, those of the right hand histogram
much better by the Anderson–Darling test which we will present below.
There exist other EDF tests [90], which in most situations are more ef-
fective than the simple Kolmogorov–Smirnov test.

8

D does not scale exactly with N because S increases in discrete steps.
328 10 Hypothesis Tests and Significance of Signals

10.4.5 Tests of the Kolmogorov–Smirnov – and Cramer–von


Mises Families
In the Kuiper test one uses as the test statistic the sum V = D+ + D−
of the two deviations of the empirical distribution function S from F . This
quantity is designed for distributions “on the circle”. This are distributions
where the beginning and the end of the distributed quantity are arbitrary,
like the distribution of the azimuthal angle which can be presented with equal
justification in all intervals [ϕ0 , ϕ0 + 2π] with arbitrary ϕ0 .
The tests of the Cramer–von Mises family are based on the quadratic
difference between F and S. The simple Cramer–von Mises test employs the
test statistic Z ∞
2 2
W = [(F (x) − S(x)] dF .
−∞

In most situations the Anderson–Darling test with the test statistic A2


and the test of Watson with the test statistic U 2

Z ∞ 2
2 [S(x) − F (x)]
A =N dF ,
−∞ F (x) [1 − F (x)]
Z ∞ Z ∞ 2
U2 = N S(x) − F (x) − [S(x) − F (x)] dF dF ,
−∞ −∞

are superior to the Kolmogorov–Smirnow test.


The test of Anderson emphasizes especially the tails of the distribution
while Watson’s test has been developed for distributions on the circle. The
formulas above look quite complicated at first sight. They simplify consider-
ably when we perform a probability integral transformation (PIT ). This term
stands for a simple transformation of the variate x into the variate z = F0 (x),
which is uniformly distributed in the interval [0, 1] and which has the sim-
ple distribution function H0 (z) = z. With the transformed step distribution
S ∗ (z) of the sample we get
Z ∞ ∗
[S (z) − z]2
A2 = N dz ,
−∞ z(1 − z)
Z ∞ Z ∞ 2
2 ∗ ∗
U =N S (z) − z − [S (z) − z] dz dz .
−∞ −∞

In the Appendix 13.8 we show how to compute the test statistics. There
also the asymptotic distributions are collected.

10.4.6 Neyman’s Smooth Test


This test [95] is different from those discussed so far in that it parameterizes
the alternative hypothesis. Neyman introduced the smooth test in 1937 (for a
10.4 Goodness-of-Fit Tests 329

discussion by E. S. Pearson see [96]) as an alternative to the χ2 test, in that it


is insensitive to deviations from H0 which are positive (or negative) in several
consecutive bins. He insisted that in hypothesis testing the investigator has to
bear in mind which departures from H0 are possible and thus to fix partially
the p.d.f. of the alternative. The test is called “smooth” because, contrary to
the χ2 test, the alternative hypothesis approaches H0 smoothly for vanishing
parameter values. The hypothesis under test H0 is again that the sample after
the PIT, zi = F0 (xi ), follows a uniform distribution in the interval [0, 1].
The smooth test excludes alternative distributions of the form
k
X
gk (z) = θi πi (z), (10.8)
i=0

where θi are parameters and the functions πi (z) are modified orthogonal
Legendre polynomials that are normalized to the interval [0, 1] and symmetric
or antisymmetric with respect to z = 1/2:

π0 (z) ≡ 1 ,

π1 (z) = 3(2z − 1) ,

πi (z) = 2i + 1Pi (2z − 1) .

Here Pi (x) is the Legendre polynomial in the usual form. The first parameter
θ0 is fixed, θ0 = 1, and the other parameters are restricted such that gk is
positive. The user has to choose the parameter k which limits the degree of
the polynomials. If the alternative hypothesis is suspected to contain narrow
structures, we have to admit large k. The test with k = 1 rejects a linear
contribution, k = 2 in addition a quadratic component and so on. Obviously,
the null hypothesis H0 corresponds to θ1 = · · · = θk = 0, or equivalently to
P k
i=1 θi = 0. We have to look for a test statistic which increases with the
2

value of this sum.


For a sample of size N the test statistic proposed by Neyman is
 2
Xk Xk XN
1 1 
rk2 = t2 = πi (zj ) . (10.9)
N i=1 i N i=1 j=1

This choice is plausible, because a large absolute value of ti is due to a strong


contribution of the polynomial πi to the observed distribution and thus also
to a large value of θi2 , while under H0 we have for i ≥ 1

hti i = N hπi (z)i = 0 ,

because Z 1
πi (z) dz = 0 .
0
330 10 Hypothesis Tests and Significance of Signals

Asymptotically, N → ∞, under H0 the test statistic rk2 follows a χ2


distribution with k degrees of freedom (see 3.6.7). This is a consequence of
the orthonormality of the polynomials πi and the central limit theorem: We
have Z 1
var(ti ) = ht2i i = N πi2 (z) dz = N
0

and as a sum of N random variables the statistic ti / N is normally dis-
tributed for large N , with expectation value zero and variance one. Due to
the orthogonality of the πi , the ti are uncorrelated. For small N the distribu-
tion of the test statistic rk2 has to be obtained by a Monte Carlo simulation.
In any case, large values of rk2 indicate bad agreement of the data with
H0 , but for a fixed value of k the smooth test is not consistent9 . Its power
approaches unity for N → ∞ only for the class of alternatives Hk having
a PIT which is represented by an expansion in Legendre polynomials up to
order k. Hence with respect to these, while usually uninteresting, restricted
alternatives it is consistent. Thus for large samples and especially for the
exclusion of narrow structures k should not be chosen too small. The value
of k in the smooth test corresponds roughly to the number of bins in the
χ2 -test.
The smooth test is in most cases superior to the χ2 test. This can be
understood in the following way: The smooth test scrutinizes not only for
structures of a fixed frequency but for all frequencies up to k while the χ2
test with B ≫ 1 bins is rather insensitive to low frequency variations.
Remark: The alternative distribution quoted by Neyman was the expo-
nential !
Xk
gk (z) = C exp θi πi (z) (10.10)
i=0

where C(θ) is a normalization constant. Neyman probably chose the expo-


nential form, because it guaranties positivity without further restrictions of
the parameters θi . Moreover, with this class of alternatives, it has been shown
by E. S. Pearson [96] that the smooth test can be interpreted as a likelihood
ratio test. Anyway, (10.8) or (10.10) serve only as a motivation to choose the
test statistic (10.9) which is the relevant quantity.

10.4.7 The L2 Test

The binning-free tests discussed so far are restricted to one dimension, i.e. to
univariate distributions. We now turn to multivariate tests.
A very obvious way to express the difference between two distributions f
and f0 is the integrated quadratic difference

9
For k = 1, for instance, the test cannot exclude distributions concentrated near
z = 1/2.
10.4 Goodness-of-Fit Tests 331
Z
L2 = [f (r) − f0 (r)]2 dr. (10.11)

Unfortunately, we cannot use this expression for the comparison of a sample


{r1 , . . . , rN } with a continuous function f0 , but we can try to derive from
our sample an approximation of f . Such a procedure is called probability den-
sity estimation (PDE ). A common approach (see Chap. 12) is the Gaussian
smearing or smoothing. The N discrete observations at the locations ri are
transformed into the function
1 X −α(ri −r)2
fG (r) = e .
N
The smearing produces a broadening which has also to be applied to f0 :
Z
′ 2
f0G (r) = f0 (r ′ )e−α(r−r ) dr′ .

We now obtain a useful test statistic L2G ,


Z
2
L2G = [fG (r) − f0G (r)] dr .

So far the L2 test [97] has not found as much attention as it deserves
because the calculation of the integral is tedious. However its Monte Carlo
version is pretty simple. It offers the possibility to adjust the width of the
smearing function to the density f0 . Where we expect large distances of
observations, the Gaussian width should be large, α ∼ f02 .
A more sophisticated version of the L2 test is presented in [97]. The Monte
Carlo version is included in Sect. 10.4.10, see below.

10.4.8 Comparing a Data Sample to a Monte Carlo Sample and


the Metric

We know turn to tests where we compare our sample not to an analytic


distribution but to a Monte Carlo simulation of f0 . This is not a serious
restriction because anyhow acceptance and resolution effects have to be taken
into account in the majority of all experiments. Thus the null hypothesis is
usually represented by a simulation sample.
To compare two samples we have to construct a relation between obser-
vations of the samples which in the multi-dimensional case has to depend
in some way on the distance between them. We can define the distance in
the usual way using the standard Euclidian metric but since the different di-
mensions often represent completely different physical quantities, e.g. spatial
distance, time, mass etc., we have considerable freedom in the choice of the
metric and we will try to adjust the metric such that the test is powerful.
We usually want that all coordinates enter with about equal weight into
the test. If, for example, the distribution is very narrow in x but wide in y,
332 10 Hypothesis Tests and Significance of Signals

then the distance r of points is almost independent of y and it is reasonable


to stretch the distribution in the x direction before we apply a test. Therefore
we propose for the general case to scale linearly all coordinates such that the
empirical variances of the sample are the same in all dimensions. In addition
we may want to get rid of correlations when, for instance, a distribution is
concentrated in a narrow band along the x-y diagonal.
Instead of transforming the coordinates we can use the Mahalanobis dis-
tance 10 in order to normalize distances between observations (x1 , . . . , xN }
with sample mean x. (The bold-face symbols here denote P -dimensional vec-
tors describing different features measured on each of the N sampled objects.)
The Mahalanobis distance dM of two observations x and x′ is
q
dM = (x − x′ )T C−1 (x − x′ ) ,

with
N
X
Cij = (xni − xi )(xnj − xj )/N .
n=1

It is equivalent to the Euclidian distance after a linear transformation of the


vector components which produces a sample with unity covariance matrix. If
the covariance matrix is diagonal, then the resulting distance is the normal-
ized Euclidean distance in the P -dimensional space:
v
u P
uX (xp − x′p )2
dM = t .
p=1
σp2

In the following tests the choice of the metric is up to the user. In many
situations it is reasonable to use the Mahalanobis distance, even though mod-
erate variations of the metric normally have little influence on the power of
a test.

10.4.9 The k-Nearest Neighbor Test

We consider two samples, one generated by a Monte Carlo simulation of a null


distribution f0 and the experimental sample. The test statistic is the number
n(k) of observations of the mixed sample where all of its k nearest neighbors
belong to the same sample as the observation itself. This is illustrated in
Fig. 10.12 for an unrealistically simple configuration. We find n(1) = 4 and
n(2) = 4. The parameter k is a small number to be chosen by the user, in
most cases it is one, two or three.
Of course we expect n to be large if the two parent distributions are very
different. The k-nearest neighbor test is very popular and quite powerful.
It has one caveat: We would like to have the number M of Monte Carlo
10
This is a distance measure introduced by P. C. Mahalanobis in 1936.
10.4 Goodness-of-Fit Tests 333

Fig. 10.12. K nearest neigbor test

observations much larger than the number N of experimental observations.


In the situation with M ≫ N each observation tends to have as next neighbor
a Monte Carlo observation and the test becomes less significant.

10.4.10 The Energy Test

A very general expression that measures the difference between two distribu-
tions f (r) and f0 (r) in an n dimensional space is
Z Z
1
φ= dr dr′ [f (r) − f0 (r)] [f (r′ ) − f0 (r ′ )] R(r, r′ ) . (10.12)
2

Here we call R the distance function. The factor 1/2 is introduced to


simplify formulas which we derive later. The special case R = δ(r − r′ ) leads
to the simple integrated quadratic deviation (10.11) of the L2 test
Z
1 2
φ= dr [f (r) − f0 (r)] . (10.13)
2
However, we do not intend to compare two distributions but rather two
samples A {r1 , . . . , r N }, B {r01 , . . . , r 0M }, which are extracted from the
distributions f and f0 , respectively. For this purpose we start with the more
general expression (10.12) which connects points at different locations. We
restrict the function R in such a way that it is a function of the distance
|r − r ′ | only and that φ is minimum for f ≡ f0 .
The function (10.12) with R = 1/|r−r′ | describes the electrostatic energy
of the sum of two charge densities f and f0 with equal total charge but
different sign of the charge. In electrostatics the energy reaches a minimum
334 10 Hypothesis Tests and Significance of Signals

if the charge is zero everywhere, i.e. the two charge densities are equal up to
the sign. Because of this analogy we refer to φ as energy.
For our purposes the logarithmic function R(r) = − ln(r) and the bell
function R(r) ∼ exp(−cr2 ) are more suitable than 1/r.
We multiply the expressions in brackets in (10.12) and obtain
Z Z
1
φ= dr dr ′ [f (r)f (r ′ ) + f0 (r)f0 (r ′ ) − 2f (r)f0 (r ′ )] R(|r − r ′ |) .
2
(10.14)
A Monte Carlo integration of this expression is obtained when we generate M
random points {r01 . . . r 0M } of the distribution f0 (r) and N random points
{r1 , . . . , rN } of the distribution f (r) and weight each combination of points
with the corresponding distance function. The Monte Carlo approximation
is:
1 XX 1 XX
φ≈ R(|ri − r j |) − R(|r i − r 0j |)+
N (N − 1) i j>i NM i j
1 XX
+ R(|r 0i − r 0j |)
M (M − 1) i j>i
1 XX 1 XX
≈ R(|r i − r j |) − R(|r i − r 0j |)+
N 2 i j>i NM i j
1 XX
+ 2 R(|r 0i − r 0j |) . (10.15)
M i j>i

This is the energy of a configuration of discrete charges. Alternatively we


can understand this result as the sum of three expectation values which are
estimated by the two samples. The value of φ from (10.15) thus is the estimate
of the energy of two samples that are drawn from the distributions f0 and f
and that have the total charge zero.
We can use the expression (10.15) as test statistic when we compare the
experimental sample to a Monte Carlo sample, the null sample representing
the null distribution f0 . Small energies signify a good, large ones a bad agree-
ment of the experimental sample with H0 . To be independent of statistical
fluctuations of the simulated sample, we choose M large compared to N ,
typically M ≈ 10N .
The test statistic energy φ is composed of three terms φ1 , φ2 , φ3 which
correspond to the interaction of the experimental sample with itself, to its
interaction with the null sample and with the interaction of the null sample
with itself:
10.4 Goodness-of-Fit Tests 335

φ = φ1 − φ2 + φ3 , (10.16)
1 X
φ1 = R(|r i − rj |) , (10.17)
N (N − 1) i<j
1 X
φ2 = R(|r i − r 0j |) , (10.18)
N M i,j
1 X
φ3 = R(|r0i − r0j |) . (10.19)
M (M − 1) i<j

The term φ3 is independent of the data and can be omitted but is normally
included to reduce statistical fluctuations.
The distance function R relates sample points and simulated points of the
null hypothesis to each other. Proven useful have the functions

Rl = − ln(r + ε) , (10.20)
2 2
Rs = e−r /(2s )
. (10.21)

The small positive constant ε suppresses the pole of the logarithmic dis-
tance function. Its value should be chosen approximately equal to the ex-
perimental resolution11 but variations of ε by a large factor have no sizable
influence on the result. With the function R1 = 1/r we get the special case of
electrostatics. With the Gaussian distance function Rs the test is very similar
to the χ2 test with bin width 2s but avoids the arbitrary binning of the lat-
ter. The parameter s has to be adjusted to the application. The logarithmic
distance function is less sensitive to the scale and does not require to tune a
parameter.
The distribution of the test statistic under H0 can be obtained by gen-
erating Monte Carlo samples. If the number of events is large and if the
significance level α is small, the computating may become tedious.
Also resampling techniques can be applied to construct the distribution
of φ under H0 . A data set of 2N observations is generated, which by splitting
it, allows us to obtain two simulated values of φ. Then we shuffle the elements
and compute again two additional values of φ. The shuffling is repeated as
long as needed to get the required statistics. An efficient shuffling technique
invented by Fisher and Yates and improved by Durstenfeld is described in
the Appendix 13.9. The values of φ are correlated, but the correlation is
mostly negligible. In case of doubts, the shuffle should be repeated with sev-
eral independent 2N sets. From the fluctuation of the p-values its error can
be derived.
The energy test is consistent [98]. It is quite powerful in many situations
and has the advantage that it is not required to sort the sample elements.

11
Distances between two points that are smaller than the resolution are accidental
and thus insignificant.
336 10 Hypothesis Tests and Significance of Signals

A B C

Fig. 10.13. Different admixtures to a uniform distribution

The energy test with Gaussian distance function is completely equivalent


to the L2 test. It is more general than the latter in that it allows to use
various distance functions.

10.4.11 Tests Designed for Specific Problems

The power of tests depends on the alternatives. If we have an idea of it,


even if it is crude, we can design a GOF test which is especially sensitive to
the deviations from H0 which we have in mind. The distribution of the test
statistic has to be produced by a Monte Carlo simulation.

Example 144. Designed test: three region test


Experimental distributions often show a local excess of observations which
are either just a statistical fluctuation or stem from a physical process. To
check whether an experimental sample is compatible with the absence of a
bump caused by a physical process, we may use the following three region
test. We subdivide the domain of the variable in three regions with expected
numbers of observations n10 , n20 , n30 and look for differences to the corre-
sponding experimental numbers n1 , n2 , n3 . The subdivision is chosen such
that the sum of the differences is maximum. The test statistic R3 is

R3 = sup [(n1 − n10 ) + (n2 − n20 ) + (n3 − n30 )] .


n1 ,n2

Notice, that n3 = N − n1 − n2 is a function of n1 and n2 . The generalization


to more than three regions is trivial. Like in the χ2 test we could also divide
the individual squared differences by their expected value:
 
′ (n1 − n10 )2 (n2 − n20 )2 (n3 − n30 )2
R3 = sup + + .
n1 ,n2 n10 n20 n30
10.4 Goodness-of-Fit Tests 337

1.0

power

0.5

0.0
30% B 20% N(0.5,1/24) 20% N(0.5,1/32)

Anderson

Kolmogorov

Kuiper

Neyman

Watson
1.0 Region

Energy(log)

Energy1(Gaussian)

Chi
power

0.5

0.0
50% linear 30% A 30% C

Fig. 10.14. Power (fraction of identified distortions) of different tests.

10.4.12 Comparison of Tests


Univariate Distributions
Whether a test is able to detect deviations from H0 depends on the distri-
bution f0 and on the kind of distortion. Thus there is no test which is most
powerful in all situations.
338 10 Hypothesis Tests and Significance of Signals

4
2
p(c )=0.71
p(F)=0.02

-2

-4
-2 0 2

Fig. 10.15. Comparison of a normally distributed sample (circles) from H0 with


a linear admixture (triangles) with the normal distribution of H0 .

To get an idea of the power of different tests, we consider six different ad-
mixtures to a uniform distribution and compute the fraction of cases in which
the distortion of the uniform distribution is detected at a significance level of
5%. For each distribution constructed in this way, we generate stochastically
1000 mixtures with 100 observations each. The distributions which we add
are depicted in Fig. 10.13. One of them is linear, two are normal with differ-
ent widths, and three are parabolic. The χ2 test was performed with 12 bins
following the prescription of Ref. [92], the parameter of Neyman’s smooth
test was k = 2 and the width of the Gaussian of the energy test was s = 1/8.
The sensitivity of different tests is presented in Fig. 10.14.
The histogram of Fig. 10.14 shows that none of the tests is optimum
in all cases. The χ2 test performs only mediocrely. Probably a lower bin
number would improve the result. The tests of Neyman, Anderson–Darling
and Kolmogorov–Smirnov are sensitive to a shift of the mean value while
the Anderson–Darling test reacts especially to changes at the borders of the
distribution. The tests of Watson and Kuiper detect preferentially variations
of the variance. Neyman’s test and the energy test with logarithmic distance
function are rather efficient in most cases.
10.5 Two-Sample Tests 339

Multivariate Distributions

The goodness-of-fit of multivariate distributions cannot be tested very well


with simple tests. The χ2 test often suffers from the small number of entries
per bin. Here the k-nearest neighbor test and the energy test with the long
range logarithmic distance function are much more efficient.

Example 145. GOF test of a two-dimensional sample


Figure 10.15 shows a comparison of a sample H1 with a two-dimensional
normal distribution (H0 ). H1 corresponds to the distribution of H0 but con-
tains an admixture of a linear distribution. The p-value of the energy test is
2%. With a χ2 test with 9 bins we obtain a p-value of 71%. It is unable to
identify the deformation of f0 .

10.5 Two-Sample Tests


10.5.1 The Problem

A standard situation in particle physics is that H0 cannot be compared di-


rectly to the data but has first to be transformed to a Monte Carlo sample, to
take into account acceptance losses and resolution effects. We have to com-
pare two samples, a procedure which we had already applied in the energy
test. Here the distribution of the test statistic needed to compute p-values
can be generated by a simple Monte Carlo program.
In other sciences, a frequently occurring problem is that the effectiveness
of two or several procedures have to be compared. This may concern drugs,
animal feed or the quality of fabrication methods. A similar problem is to
test whether a certain process is stable or whether its results have changed
during time. Also in the natural sciences we frequently come across the prob-
lem that we observe an interesting phenomenon in one data sample which
apparently has disappeared in another sample taken at a later time. It is
important to investigate whether the two data samples are compatible with
each other. Sometimes it is also of interest to investigate whether a Monte
Carlo sample and an experimental sample are compatible. Thus we are in-
terested in a statistical procedure which tells us whether two samples A and
B are compatible, i.e. drawn from the same parent distribution. Thereby we
assume that the parent distribution itself is unknown. If it were known, we
could apply one of the GOF tests which we have discussed above. We have to
invent procedures to generate the distribution of the test statistic. In some
cases this is trivial. In the remaining cases, we have to use combinatorial
methods.
340 10 Hypothesis Tests and Significance of Signals

10.5.2 The χ2 Test

To test whether two samples are compatible, we can apply the χ2 test or the
Kolmogorov–Smirnov test with minor modifications.
When we calculate the χ2 statistic we have to normalize the two samples
A and B of sizes N and M to each other. For ai and bi entries in bin i, ai /N −
bi /M should be compatible with zero. With the usual error propagation we
obtain an estimate ai /N 2 + bi /M 2 of the quadratic error of this quantity and

XB
(ai /N − bi /M )2
χ2 = . (10.22)
i=1
(ai /N 2 + bi /M 2 )

It follows approximately a χ2 distribution of B − 1 degrees of freedom,


but not exactly, as we had to replace the expected values by the observed
numbers in the error estimation. We have to be careful if the number of
observations per bin is small.

10.5.3 The Likelihood Ratio Test

The likelihood ratio test is less vulnerable to low event numbers than the χ2
test.
Setting r = M/N we compute the likelihood that we observe in a single
bin a entries with expectation λ and b entries with expectation ρλ, where the
hypothesis H0 is characterized by ρ = r:

e−λ λa e−ρλ (ρλ)b


L(λ, ρ|a, b) = .
a! b!
Leaving out constant factors the log-likelihood is

ln L = −λ(1 + ρ) + (a + b) ln λ + b ln ρ .

We determine the conditional maximum likelihood value of λ under ρ = r


and the corresponding log-likelihood:
1
1 + r = (a + b) ,
λ̂c
a+b
λˆc = ,
1+r
 
a+b
ln Lcmax = (a + b) −1 + ln + b ln r .
1+r

The unconditional maximum of the likelihood is found for λ̂ = a and ρ̂ = b/a:

ln Lumax = −(a + b) + a ln a + b ln b .
10.5 Two-Sample Tests 341

Our test statistic is VAB , the logarithm of the likelihood ratio, now summed
over all bins:

VAB = ln Lcmax − ln Lumax


X ai + b i

= (ai + bi ) ln − ai ln ai − bi ln bi + bi ln r .
i
1+r

Note that VAB (r) = VBA (1/r), as it should.


Now we need a method to determine the expected distribution of the test
statistic VAB under the assumption that both samples originate from the
same population.
To generate the distribution of the test statistic V we combine the two
samples to a new sample with M +N elements and form new pairs of samples,
with M and N elements. We draw randomly M elements from the combined
sample and associate them to A and the remaining elements to B. Computa-
tionally this is done by suffling as described above in the section dealing with
the standard energy test. This is easier than to use systematically all indi-
vidual possibilities. For each generated pair i we determine the statistic Vi .
This procedure is repeated many times and the values Vi form the reference
distribution. Our experimental p-value is equal to the fraction of generated
Vi which are larger than VAB :
Number of permutations with Vi > VAB
p= .
Total number of permutations

10.5.4 The Kolmogorov–Smirnov Test

Also the Kolmogorov–Smirnov test can easily be adapted to a comparison of


two samples. We construct
p the test statistic in an analogous way as above. The
test statistic is D∗ = D Nef f , where D is the maximum difference between
the two empirical distribution functions SA , SB , and Nef f is the effective or
equivalent number of events, which is computed from the relation:
1 1 1
= + .
Nef f N M

In a similar way other EDF multi-dimensional tests which we have discussed


above can be adjusted.

10.5.5 The Energy Test

For a binning-free comparison of two samples A and B with M and N obser-


vations we can again use the energy test [98] which in the multi-dimensional
case has only few competitors.
342 10 Hypothesis Tests and Significance of Signals

2 400

number of entries
y
observed
0
energy
200

-2

0
-2 0 2 -1.8 -1.6 -1.4

x energy

Fig. 10.16. Two-sample test. Left hand: the samples which are to be compared.
Right hand: distribution of test statistic and actual value.

We compute the energy φAB in the same way as above, replacing the
Monte Carlo sample by one of the experimental samples. The expected dis-
tribution of the test statistic φAB is computed in the same way as for the like-
lihood ratio test from the combined sample using the permutation technique
by shuffling. Our experimental p-value is equal to the fraction of generated
φi from the bootstrap sample which are larger than φAB :
Number of permutations with φi > φAB
p= .
Total number of permutations

Example 146. Comparison of two samples


We compare two two-dimensional samples with 15 and 30 observations
with the energy test. The two samples are depicted in a scatter plot at the
left hand side of Fig. 10.16. The energy of the system is φAB = −1.480 (The
negative value arises because we have omitted the term φ3 ). From the mixed
sample 10000 sample combinations have been selected at random. Its energy
distribution is shown as a histogram in the figure. The arrow indicates the
location of φAB . It corresponds to a p-value of 0.06. We can estimate the
error of the p-value p computing it from many permutation sets each with
a smaller number of permutations. From the variation of p from 100 times
100 permutations we find δp = 0.02. The p-value is small, indicating that
the samples belong to different populations. Indeed they have been drawn
from different distributions, a uniform distribution, −1.5 < x, y < 1.5 and a
normal distribution with standard deviations σx = σy = 1.
10.6 Significance of Signals 343

10.5.6 The k-Nearest Neighbor Test

The k-nearest neighbor test is per construction a two-sample test. The dis-
tribution of the test statistic is obtained in exactly the same way as in the
two-sample energy test which we have discussed in the previous section.
The performance of the k-nearest neighbor test is similar to that of the
energy test. The energy test (and the L2 test which is automatically included
in the former) is more flexible than the k-nearest neighbor test and includes
all observation of the sample in the continuous distance function. The k-
nearest neighbor test on the other hand is less sensitive to variations of the
density which are problematic for the energy test with the Gaussian distance
function of constant width.

10.6 Significance of Signals


10.6.1 Introduction

Tests for signals are closely related to goodness-of-fit tests but their aim is
different. We are not interested to verify that H0 is compatible with a sample
but we intend to quantify the evidence of signals which are possibly present
in a sample which consists mainly of uninteresting background. Here not
only the distribution of the background has to be known but in addition we
must be able to parameterize the alternative which we search for. The null
hypothesis H0 corresponds to the absence of deviations from the background.
The alternative Hs is not fully specified, otherwise it would be sufficient to
compute the simple likelihood ratio which we have discussed in Chap. 6.
Signal tests are applied when we search for rare decays or reactions like
neutrino oscillations. Another frequently occurring problem is that we want to
interpret a line in a spectrum as indication for a resonance or a new particle.
To establish the evidence of a signal, we usually require a very significant
deviation from the null hypothesis, i.e. the sum of background and signal
has to describe the data much better than the background alone because
particle physicists look in hundreds of histograms for more or less wide lines
and thus always find candidates12 which in most cases are just background
fluctuations. For this reason, signals are only accepted by the community if
they have a significance of at least four or five standard deviations. In cases
where the existence of new phenomena is not unlikely, a smaller significance
may be sufficient. A high significance for a signal corresponds to a low p-value
of the null hypothesis.
To quote the p-value instead of the significance as expressed by the num-
ber of standard deviations by which the signal exceeds the background ex-
pectation is to be preferred because it is a measure which is independent of

12
This is the so-called look-else-where effect.
344 10 Hypothesis Tests and Significance of Signals

the form of the distribution. However, the standard deviation scale is better
suited to indicate the significance than the p-values scale where very small
values dominate. For this reason it has become customary to transform the
p-value p into the number of Gaussian standard deviations sG which are
related through
√ Z ∞
p = 1/ 2π exp(−x2 /2)dx (10.23)
sG
h √ i
= 1 − erf(sG / 2) /2 . (10.24)

The function sG (p) is given in Fig. 10.17. Relations (10.23), (10.24) refer to
one-sided tests. For two-sided tests, p has to be multiplied by a factor two.
When we require very low p-values for H0 to establish signals, we have
to be especially careful in modeling the distribution of the test statistic.
Often the distribution corresponding to H0 is approximated for instance by a
polynomial with some uncertainties in the parameters and assumptions which
are difficult to implement in the test procedure. We then have to be especially
conservative. It is better to underestimate the significance of a signal than to
present evidence for a new phenomenon based on a doubtful number.
To illustrate this problem we return to our standard example where we
search for a line in a one-dimensional spectrum. Usually, the background
under an observed bump is estimated from the number of events outside
but near the bump in the so-called side bands. If the side bands are chosen
too close to the signal they are affected by the tails of the signal, if they are
chosen too far away, the extrapolation into the signal region is sensitive to the
assumed shape of the background distribution which often is approximated
by a linear or quadratic function. This makes it difficult to estimate the size
and the uncertainty of the expected background with sufficient accuracy to
establish the p-value for a large (>4 st. dev.) signal. As numerical example
let us consider an expectation of 1000 background events which is estimated
by the experimenter too low by 2%, i.e. equal to 980. Then a 4.3 st. dev.
excess would be claimed by him as a 5 st. dev. effect and he would find too
low a p-value by a factor of 28. We also have to be careful with numerical
approximations, for instance when we approximate a Poisson distribution by
a Gaussian. These uncertainties have to be included in the simulation of the
distribution of the test statistic.
Usually, the likelihood ratio, i.e. the ratio of the likelihood which max-
imizes Hs and the maximum likelihood for H0 is the most powerful test
statistic. In some situations a relevant parameter which characterizes the
signal strength is more informative.
10.6 Significance of Signals 345

standard deviations
3

0
1E-5 1E-4 1E-3 0.01 0.1 1

p-value

10
standard deviations

0
1E-20 1E-15 1E-10 1E-5 1

p-value

Fig. 10.17. Transformation of p-values to one-sided number of standard deviations.


346 10 Hypothesis Tests and Significance of Signals

10.6.2 The Likelihood Ratio Test


Definition
An obvious candidate for the test statistic is the likelihood ratio (LR) which
we have introduced and used in Sect. 10.4 to test goodness-of-fit of his-
tograms, and in Sect. 10.5 as a two-sample test. We repeat here its general
definition:
sup [L0 (θ 0 |x)]
λ= ,
sup [Ls (θ s |x)]
ln λ = ln sup [L0 (θ 0 |x)] − ln sup [Ls (θ s |x)]
where L0 , Ls are the likelihoods under the null hypothesis and the signal
hypothesis, respectively. The supremum is to be evaluated relative to the
parameters, i.e. the likelihoods are to be taken at the MLEs of the parameters.
The vector x represents the sample of the N observations x1 , . . . , xN of a one-
dimensional geometric space. The extension to a multi-dimensional space is
trivial but complicates the writing of the formulas. The parameter space of
H0 is assumed to be a subset of that of Hs . Therefore λ will be smaller or
equal to one.
For example, we may want to find out whether a background distribution
is described significantly better by a cubic than by a linear distribution:
f0 = α0 + α1 x , (10.25)
2 3
fs = α0 + α1 x + α2 x + α3 x .
We would fit separately the parameters of the two functions to the observed
data and then take the ratio of the corresponding maximized likelihoods.
Frequently the data sample is so large that we better analyze it in form
of a histogram. Then the distribution of the number of events yi in bin i, i =
1, . . . , B can be approximated by normal distributions around the parameter
dependent predictions ti (θ). As we have seen in Chap. 6, Sect. 7.1 we then
get the log-likelihood
B
1 X [yi − ti ]
2
ln L = − + const.
2 i=1 ti

which is equivalent to the χ2 statistic, χ2 ≈ −2 ln L. In this limit the likeli-


hood ratio statistic is equivalent to the χ2 difference, ∆χ2 = min χ20 − min χ2s ,
of the χ2 deviations, min χ20 with the parameters adjusted to the null hypoth-
esis H0 , and min χ2s with its parameters adjusted to the alternative hypothesis
Hs , background plus signal:
ln λ = ln sup [L0 (θ0 |y)] − ln sup [Ls (θ s |y)] (10.26)
1
≈ − (min χ20 − min χ2s ) . (10.27)
2
10.6 Significance of Signals 347

The p-value derived from the LR statistic does not take into account
that a simple hypothesis is a priori more attractive than a composite one
which contains free parameters. Another point of criticism is that the LR
is evaluated only at the parameters that maximize the likelihood while the
parameters suffer from uncertainties. Thus conclusions should not be based
on the p-value only.
A Bayesian approach applies so-called Bayes factors to correct for the
mentioned effects but is not very popular because it has other caveats. Its
essentials are presented in the Appendix 13.17

Distribution of the Test Statistic


The distribution of λ under H0 in the general case is not known analytically;
however, if the approximation (10.27) is justified, the distribution of −2 ln λ
under certain additional regularity conditions and the conditions mentioned
at the end of Sect. 10.4.2 will be described by a χ2 distribution. In the example
corresponding to relations (10.25) this would be a χ2 distribution of 2 degrees
of freedom since fs compared to f0 has 2 additional free parameters. Knowing
the distribution of the test statistic reduces the computational effort required
for the numerical evaluation of p-values considerably.
Let us look at a specific problem: We want to check whether an observed
bump above a continuous background can be described by a fluctuation or
whether it corresponds to a resonance. The two hypotheses may be described
by the distributions
f0 = α0 + α1 x + α2 x2 , (10.28)
2
fs = α0 + α1 x + α2 x + α3 N (x|µ, σ) ,
and we can again use ln λ or ∆χ2 as test statistic. Since we have to define
the test before looking at the data, µ and σ will be free parameters in the fit
of fs to the data. Unfortunately, now ∆χ2 no longer follows a χ2 distribution
of 3 degrees of freedom and has a significantly larger expectation value than
expected from the χ2 distribution. The reason for this dilemma is that for
α3 = 0 which corresponds to H0 the other parameters µ and σ are undefined
and thus part of the χ2 fluctuation in the fit to fs is unrelated to the difference
between fs and f0 .
More generally, only if the following conditions are satisfied, ∆χ2 follows
in the large number limit a χ2 distribution with the number of degrees of
freedom given by the difference of the number of free parameters of the null
and the alternative hypotheses:
1. The distribution f0 of H0 has to be a special realization of the distribution
fs of Hs .
2. The fitted parameters have to be inside the region, i.e. off the boundary,
allowed by the hypotheses. For example, the MLE of the location of a
Gaussian should not be outside the range covered by the data.
348 10 Hypothesis Tests and Significance of Signals

4000

0.1

number of events
3000
0.01

p-value
1E-3
2000

1E-4

1000

1E-5

0 1E-6
0 2 4 6 8 0 5 10 15

-ln(LR) -ln(LR)

Fig. 10.18. Distributions of the test statistic under H0 and p-value as a function
of the test statistic.

3. All parameters of Hs have to be defined under H0 .


If one of these conditions is not satisfied, the distribution of the test
statistic has to be obtained via a Monte Carlo simulation. This means that
we generate many fictive experiments of H0 and count how many of those
have values of the test statistic that exceed the one which has actually been
observed. The corresponding fraction is the p-value for H0 . This is a fairly
involved procedure because each simulation includes fitting of the free pa-
rameters of the two hypotheses. In Ref. [91] it is shown that the asymptotic
behavior of the distribution can be described by an analytical function. In
this way the amount of simulation can be reduced.

Example 147. Distribution of the likelihood ratio statistic


We consider a uniform distribution (H0 ) of 1000 events in the interval
[0, 1] and as alternative a resonance with Gaussian width, σ = 0.05, and
arbitrary location µ in the range 0.2 ≤ µ ≤ 0.8 superposed to a uniform
distribution. The free parameters are ε, the fraction of resonance events and
µ. The logarithm of the likelihood ratio statistic is
10.6 Significance of Signals 349

40

30

number of events

20

10

0
0.0 0.2 0.4 0.6 0.8 1.0

energy

Fig. 10.19. Histogram of event sample used for the likelihood ratio test. The curve
is an unbinned likelihood fit to the data.

ln λ = ln sup [L0 (θ0 |x)] − ln sup [Ls (θ s |x)]


1000
X X 
1000
ε̂

(xi − µ̂)2

= ln(1) − ln 1 − ε̂ + √ exp −
i=1 i=1
2πσ 2σ 2
X 
1000
ε̂

(xi − µ̂)2

=− ln 1 − ε̂ + √ exp − ,
i=1
2πσ 2σ 2

essentially the negative logarithm of the likelihood of the MLE. Fig. 10.18
shows the results from a million simulated experiments. The distribution
of − ln λ under H0 has a mean value of −1.502 which corresponds to
h∆χ2 i = 3.004. The p-value as a function of − ln λ follows asymptotically
an exponential as is illustrated in the right hand plot of Fig. 10.18. Thus it is
possible to extrapolate the function to smaller p-values which is necessary to
claim large effects. Figure 10.19 displays the result of an experiment where
a likelihood fit finds a resonance at the energy 0.257. It contains a fraction
of 0.0653 of the events. The logarithm of the likelihood ratio is 9.277. The
corresponding p-value for H0 is pLR = 1.8 · 10−4 . Hence it is likely that the
observed bump is a resonance. In fact it had been generated as a 7 % contri-
bution of a Gaussian distribution N (x|0.25, 0.05) to a uniform distribution.
350 10 Hypothesis Tests and Significance of Signals

0.5

0.4

0.3
f(LR)

0.2
1.5 % resonance added

0.1

0.0
0 5 10 15

0
10

-1
10

-2
10
p-value

-3
10

-4
10

-5
10

-6
10
0 5 10 15

-ln(LR)

Fig. 10.20. Distributions of the test statistic for H0 and for experiments with a
1.5% resonance contribution. In the lower graph the p-value for H0 is given as a
function of the test statistic.

We have to remember though that the p-value is not the probability that
H0 is true, it is the probability that H0 simulates the resonance of the size
seen in the data or larger. In a Bayesian treatment, see Appendix 13.17, we
find betting odds in favor of H0 of about 2% which is much less impressive.
The two numbers refer to different issues but nonetheless we have to face the
fact that the two different statistical approaches lead to different conclusions
about how evident the existence of a bump really is.
In experiments with a large number of events, the computation of the p-
value distribution based on the unbinned likelihood ratio becomes excessively
slow and we have to turn to histograms and to compute the likelihood ratio
of H0 and Hs from the histogram. Figure 10.20 displays some results from
the simulation of 106 experiments of the same type as above but with 10000
events distributed over 100 bins.
10.6 Significance of Signals 351

In the figure the distributions of the LR for a signal for H0 and for exper-
iments with 1.5% resonance added is shown. The large spread of the signal
distributions reflects the fact that identical experiments by chance may ob-
serve a very significant signal or just a slight indication of a resonance.

General Multi-Channel Case

We now extend the likelihood ratio test to the multi-channel case. We assume
that the observations xk of the channels k = 1, . . . , K are independent of each
other. The overall likelihood is the product of the individual likelihoods. For
the log-likelihood ratio we then have to replace (10.26) by
K
X
ln λ = {ln sup [L0k (θ0k |xk )] − ln sup [Lsk (θ sk |xk )]} .
k=1

As an example, we consider an experiment where we observe bumps at


the same mass in K different decay channels, bumps which are associated to
the same phenomenon, i.e. a particle decaying into different secondaries.
When we denote the decay contribution into channel k by εk , the p.d.f.
of the decay distribution by fk (xk |θk ) and the corresponding background
distributions by f0k (xk |θ0k ), the distribution under H0 is
K
Y
f0 (x1 , . . . , xK |θ 01 , . . . , θ0K ) = f0k (xk |θ 0k )
k=1

and the alternative signal distribution is

fs (x1 , . . . , xK |θ01 , . . . , θ0K ; θ 1 , . . . , θ K ; ε1 , . . . , εK ) =


K
Y
[(1 − εk )f0k (xk |θ0k ) + εk fk (xk |θk )] .
k=1

The likelihood ratio is then


K n
X h io
ln λ = b0k ) − ln (1 − ε̂k )f0k (xk |θb′ 0k ) + ε̂k fk (xk |θ
ln f0k (xk |θ bk ) .
k=1

Remark, that the MLEs of the parameters θ0k depend on the hypothesis.
They are different for the null and the signal hypotheses and, for this reason,
have been marked by an apostrophe in the latter.

10.6.3 Tests Based on the Signal Strength

Instead of using the LR statistic it is often preferable to use a parameter of


Hs as test statistic. In the simple example of (10.25) the test statistic t = α3
352 10 Hypothesis Tests and Significance of Signals

would be a sensible choice. When we want to estimate the significance of a


line in a background distribution, instead of the likelihood ratio the number
of events which we associate to the line (or the parameter α3 in our example
(10.28)) is a reasonable test statistic. Compared to the LR statistic it has the
advantage to represent a physical parameter but usually the corresponding
test is less powerful.

Example 148. Example 147 continued


Using the fitted fraction of resonance events as test statistic, the p-value
for H0 is pf = 2.2 · 10−4 , slightly less stringent than that obtained from the
LR. Often physicists compare the number of observed events directly to the
prediction from H0 . In our example we have 243 events within two standard
deviations around the fitted energy of the resonance compared to the expecta-
tion of 200 from a uniform distribution. The probability to observe ≥ 243 for
a Poisson distribution with mean 200 is pp = 7.3 · 10−4 . This number cannot
be compared directly with pLR and pf because the latter two values include
the look-else-where effect, i.e. that the simulated resonance may be located
at an arbitrary energy. A lower number for pp is obtained if the background
is estimated from the side bands, but then the computation becomes more
involved because the error on the expectation has to be included. Primitive
methods are only useful for a first crude estimate.

We learn from this example that the LR statistic provides the most power-
ful test among the considered alternatives. It does not only take into account
the excess of events of a signal but also its expected shape. For this reason
pLR is smaller than pf .
Often the significance of a signal s is stated in units of standard deviations
σ:
Ns
s= p .
N0 + δ02
Here Ns is the number of events associated to the signal, N0 is the number
of events in the signal region expected from H0 and δ0 its uncertainty. In
the Gaussian approximation it can be transformed into a p-value via (10.23).
Unless N0 is very large and δ0 is very well known, this p-value has to be
considered as a lower limit or a rough guess.
11 Statistical Learning

11.1 Introduction
In the process of its mental evolution a child learns to classify objects,
persons, animals, and plants. This process partially proceeds through expla-
nations by parents and teachers (supervised learning), but partially also by
cognition of the similarities of different objects (unsupervised learning). But
the process of learning – of children and adults – is not restricted to the de-
velopment of the ability merely to classify but it includes also the realization
of relations between similar objects, which leads to ordering and quantify-
ing physical quantities, like size, time, temperature, etc.. This is relatively
easy, when the laws of nature governing a specific relation have been discov-
ered. If this is not the case, we have to rely on approximations, like inter- or
extrapolations.
Also computers, when appropriately programmed, can perform learning
processes in a similar way, though to a rather modest degree. The achieve-
ments of the so-called artificial intelligence are still rather moderate in most
areas, however a substantial progress has been achieved in the fields of super-
vised learning and classification and there computers profit from their ability
to handle a large amount of data in a short time and to provide precise
quantitative solutions to well defined specific questions. The techniques and
programs that allow computers to learn and to classify are summarized in
the literature under the term machine learning.
Let us specify the type of problems which we discuss in this chapter: For
an input vector x we want to find an output ŷ. The input is also called predic-
tor, the output response. Usually, each input consists of several components
(attributes, properties), and is written therefore in boldface letters. Normally,
it is a metric (quantifiable) quantity but it could also be a categorical quantity
like a color or a particle type. The output can also contain several components
or consists of a single real or discrete (Yes or No) variable. Like a human
being, a computer program learns from past experience. The teaching pro-
cess, called training, uses a training sample {(x1 , y1 ), (x2 , y 2 ) . . . (xN , y N )},
where for each input vector xi the response y i is known. When we ask for the
response to an arbitrary continuous input x, usually its estimate ŷ(x) will be
more accurate when the distance to the nearest input vector of the training
sample is small than when it is far away. Consequently, the training sample
354 11 Statistical Learning

should be as large as possible or affordable. The region of interest should be


covered with input vectors homogeneously, and we should be aware that the
accuracy of the estimate decreases at the boundary of this region.
Learning which exceeds simple memorizing relies on the existence of more
or less simple relations between input and response: Similar input corresponds
to similar response. In our approach this translates into the requirement that
the responses are similar for input vectors which are close. We can not learn
much from erratic distributions.

Example 149. Simple empirical relations


The resistance R of a wire is used for a measurement of the temperature
T . In the teaching process which here is called calibration, a sample of cor-
responding values Ri , Ti is acquired. In the application we want to find for a
given input R an estimate of T . Usually a simple interpolation will solve this
problem.

For more complicated relations, approximations with polynomials, higher


spline functions or orthogonal functions are useful.

Example 150. Search for common properties


A certain class of molecules has a positive medical effect. The structure,
physical and chemical properties x of these molecules are known. In order to
find out which combination of the properties is relevant, the distribution of
all attributes of the molecules which represent the training objects is investi-
gated. A linear method for the solution of this task is the principal component
analysis.

Example 151. Two-class classification, SPAM mails


A sizable fraction of electronic mails are of no interest to the addressee and
considered by him as a nuisance. Many mailing systems use filter programs
to eliminate these undesired so-called SPAM mails. (SPAM is an artificial
nonsense word borrowed from a sketch of a British comedy series of Monty
Python’s Flying Circus where in a cafe every meal contains SPAM.) After
evaluation of a training sample where the classification into Yes or No (ac-
cept or reject) is done by the user, the programs are able to take over the
classification job. They identify certain characteristic words, like Viagra, sex,
profit, advantage, meeting, experiment, university and other attributes like
large letters, colors to distinguish between SPAM and serious mails. This
11.1 Introduction 355

kind of problem is efficiently solved by decision trees and artificial neural


networks.

The attributes are here categorical variables. In the following we will


restrict ourselves mainly to continuous variables.

Example 152. Multi-class classification, pattern recognition


Hand-written letters or figures have to be recognized. Again a sample
for which the relation between the written pixels and the letters is known,
is used to train the program. Also this problem can be treated by decision
trees, artificial neural networks, and by kernel methods. Here the attributes
are the pixel coordinates.

As we have observed also previously, multivariate applications suffer from


the curse of dimensionality. There are two reasons: i) With increasing number
d of dimensions, the distance between the input vectors increases and ii) the
surface effects are enhanced. When a fixed number of points is uniformly
distributed over a hyper-cube
√ of dimension d, the mean distance between the
points is proportional to d. The higher the dimension, the more empty is
the space. At the same time the region where estimates become less accurate
due to surface effects increases. The fraction of the volume taken by a hyper-
sphere inscribed into a hyper-cube is only 5.2% for d = 5, and the fraction of
the volume within a distance to the surface less than 10% of the edge length
increases like 1 − 0.8d , this means from 20% for d = 1 to 67% for d = 5.

Example 153. Curse of dimensionality


A training sample of 1000 five-dimensional inputs is uniformly distributed
over a hyper-cube of edge length a. To estimate the function value at the
center of the region we take all sample elements within a distance of a/4 from
the center. These are on average one to two only ( 1000 × 0.052 × 0.55 = 1.6),
while in one dimension 500 elements would contribute.

In the following, we will first discuss the approximation of measurements


afflicted with errors by analytic functions and the interpolation by smooth-
ing techniques. Next we introduce the factor analysis, including the so-called
principal component analysis. The last section deals with classification meth-
ods, based on artificial neural networks, kernel algorithms, and decision trees.
In recent years we observed a fast progress in this field due to new develop-
ments, i.e. support vector machines, boosting, and the availability of powerful
356 11 Statistical Learning

general computer algorithms. This book can only introduce these methods,
without claim of completeness. A nice review of the whole field is given in
[16].

11.2 Smoothing of Measurements and Approximation


by Analytic Functions
We start with two simple examples, which illustrate applications:
i) In a sequence of measurements the gas amplification of an ionization
chamber as a function of the applied voltage has been determined. We would
like to describe the dependence in form of a smooth curve.
ii) With optical probes it is possible to scan a surface profile point-wise.
The objects may be workpieces, tools, or human bodies. The measurements
can be used by milling machines or cutting devices to produce replicates or
clothes. To steer these machines, a complete surface profile of the objects
is needed. The discrete points have to be approximated by a continuous
function. When the surface is sufficiently smooth, this may be achieved by
means of a spline approximation.
More generally, we are given a number N of measurements yi with uncer-
tainties δi at fixed locations xi , the independent variables, but are interested
in the values of the dependent or response variable y at different values of x,
that is, we search for a function f (x) which approximates the measurements,
improves their precision and inter- and extrapolates in x. The simplest way
to achieve this is to smooth the polygon connecting the data points.
More efficient is the approximation of the measurement by a parameter
dependent analytic function f (x, θ). We then determine the parameters by
a least square
P fit, i.e. minimize the sum over the squared and normalized
2
residuals [(yi − f (xi , θ)] /δi2 with respect to θ. The approximation should
be compatible with the measurements within their statistical errors but the
number of free parameters should be as small as possible. The accuracy of
the measurements has a decisive influence on the number of free parameters
which we permit in the fit. For large errors we allow also for large deviations
of the approximation from the measurements. As a criterion for the number
of free parameters, we use statistical tests like the χ2 test. The value of χ2
should then be compatible with the number of constraints, i.e. the number of
measured points minus the number of fitted parameters. Too low a number
of parameters leads to a bias of the predictions, while too many parameters
reduce the accuracy, since we profit less from constraints.
Both approaches rely on the presumption that the true function is simple
and smooth. Experience tells us that these conditions are justified in most
cases.
The approximation of measurements which all have the same uncertainty
by analytic functions is called regression analysis. Linear regression had been
11.2 Smoothing of Measurements and Approximation by Analytic Functions 357

described in Chap. 6.7.1. In this section we treat the general non-linear case
with arbitrary errors.
In principle, the independent variable may also be multi-dimensional.
Since then the treatment is essentially the same as in the one-dimensional
situation, we will mainly discuss the latter.

11.2.1 Smoothing Methods

We use the measured points in the neighborhood of x to get an estimate


of the value of y(x). We denote the uncertainties of the output vectors of
the training sample by δj for the component j of y. When the points of
the training sample have large errors, we average over a larger region than
in the case of small errors. The better accuracy of the average for a larger
region has to be paid for by a larger bias, due to the possibility of larger
fluctuations of the true function in this region. Weighting methods work
properly if the function is approximately linear. Difficulties arise in regions
with lot of structure and at the boundaries of the region if there the function
is not approximately constant.

k-Nearest Neighbors

The simplest method for a function approximation is similar to the density


estimation which we treat in Chap. 9 and which uses the nearest neighbors in
the training sample. We define a distance di = |x − xi | and sort the elements
of the training sample in the order of their distances di < di+1 . We choose a
number K of nearest neighbors and average over the corresponding output
vectors:
K
1 X
ŷ(x) = yi .
K i=1
This relation holds for constant errors. Otherwise for the component j of y
the corresponding weighted mean
PK 2
yij /δij
ŷj (x) = Pi=1
K 2
i=1 1/δij

has to be used. The choice of K depends on the density of points in the


training sample and the expected variation of the true function y(x).
If all individual points in the projection j have mean square errors δj2 , the
error of the prediction δyj is given by

δj2 2
(δyj )2 = + hyj (x) − ŷj (x)i . (11.1)
K
The first term is the statistical fluctuation of the mean value. The second
term is the bias which is equal to the systematic shift squared, and which
358 11 Statistical Learning

is usually difficult to evaluate. There is the usual trade-off between the two
error components: with increasing K the statistical term decreases, but the
bias increases by an amount depending on the size of the fluctuations of the
true function within the averaging region.

k-Nearest Neighbors with Linear Approximation

The simple average suffers from the drawback that at the boundary of the
variable space the measurements contributing to the average are distributed
asymmetrically with respect to the point of interest x. If, for instance, the
function falls strongly toward the left-hand boundary of a one-dimensional
space, averaging over points which are predominantly located at the right
hand side of x leads to too large a result. (See also the example at the end of
this section). This problem can be avoided by fitting a linear function through
the K neighboring points instead of using the mean value of y.

Gaussian Kernels

To take all k-nearest neighbors into account with the same weight indepen-
dent of their distance to x is certainly not optimal. Furthermore, its out-
put function is piecewise constant (or linear) and thus discontinuous. Better
should be a weighting procedure, where the weights become smaller with in-
creasing distances. An often used weighting or kernel function1 is the Gaus-
sian. The sum is now taken over all N training inputs:
PN −α|x−xi |2
i=1 y i e
ŷ(x) = PN −α|x−x |2 .
i=1 e
i

The constant
√ α determines the range of the correlation. Therefore the width
s = 1/ 2α of the Gaussian has to be adjusted to the density of the points
and to the curvature of the function. If computing time has to be economized,
the sum may of course be truncated and restricted to the neighborhood of x,
for instance to the distance 2s. According to (11.1) the mean squared error
becomes2 :
P −2α|x−xi |2
e
2 2
(δyj ) = δj P 2 2
+ hyj (x) − ŷj (x)i2 .
e−α|x−xi |

1
The denotation kernel will be justified later, when we introduce classification
methods.
2
This relation has to be modified if not all errors are equal.
11.2 Smoothing of Measurements and Approximation by Analytic Functions 359

11.2.2 Approximation by Orthogonal Functions

Complete sets of orthogonal function systems offer three attractive features:


i) The fitted function coefficients are uncorrelated, ii) The function systems
are complete and thus able to approximate any well behaved, i.e. square
integrable, function, iii) They are naturally ordered with increasing oscillation
frequency3 . The function system to be used depends on the specific problem,
i.e. on the domain of the variable and the asymptotic behavior of the function.
Since the standard orthogonal functions are well known to physicists, we will
be very brief and omit all mathematical details, they can be looked-up in
mathematical handbooks.
Complete normalized orthogonal function systems {ui (x)} defined on the
finite or infinite interval [a, b] fulfil the orthogonality and the completeness
relations. To simplify the notation, we introduce the inner product (g, h)
Z b
(g, h) ≡ g ∗ (x)h(x)dx
a

and have

(ui , uj ) = δij ,
X
u∗i (x)ui (x′ ) = δ(x − x′ ) .
i

For instance, the functions of the well known Fourier system for the in-
terval [a, b] = [−L/2, L/2] are un (x) = √1L exp(i2πnx/L).
Every square integrable function can be represented by the series

X
f (x) = ai ui (x) , with ai = (ui , f )
i=0

in the sense that the squared difference converges to zero with increasing
number of terms4 :
" K
#2
X
lim f (x) − ai ui (x) = 0 . (11.2)
K→∞
i=0

The coefficients ai become small for large i, if f (x) is smooth as compared


to the ui (x), which oscillate faster and faster for increasing i. Truncation of
the series therefore causes some smoothing of the function.
The approximation of measurements by orthogonal functions works quite
well for very smooth data. When the measurements show strong short range
variations, sharp peaks or valleys, then a large number of functions is required
3
We use the term frequency also for spatial dimensions.
4
At eventual discontinuities, f (x) should be taken as [f (x + 0) + f (x − 0)]/2.
360 11 Statistical Learning

Table 11.1. Characteristics of orthogonal polynomials.


Polynomial Domain Weight function
Legendre, Pi (x) [−1, +1] w(x) = 1
Hermite, Hi (x) (−∞, +∞) w(x) = exp(−x2 )
Laguerre, Li (x) [0, ∞) w(x) = exp(−x)

to describe the data. Neglecting individually insignificant contributions may


lead to a poor approximation. Typically, their truncation may produce spu-
rious oscillations (“ringing”) in regions near to the peaks, where the true
function is already flat.
For large data sets with equidistant points and equal errors the Fast
Fourier Transform, FFT, plays an important role, especially for data smooth-
ing and image processing. Besides the trigonometric functions, other orthogo-
nal systems are useful, some of which are displayed in Table 11.1. The orthog-
onal functions are proportional to polynomials pi (x) of degree
p i multiplied by
the square root of a weight function w(x), ui (x) = pi (x) w(x). Specifying
the domain [a, b] and w, and requiring orthogonality for ui,j ,
(ui , uj ) = ci δij ,
fixes the polynomials up to the somewhat conventional normalization factors

ci .
The most familiar orthogonal functions are the trigonometric functions
used in the Fourier series mentioned above. From electrodynamics and quan-
tum mechanics we are also familiar with Legendre polynomials and spherical
harmonics. These functions are useful for data depending on variables de-
fined on the circle or on the sphere, e.g. angular distributions. For example,
the distribution of the intensity of the microwave background radiation which
contains information about the curvature of the space, the baryon density and
the amount of dark matter in the universe, is usually described as a function of
the solid angle by a superposition of spherical harmonics. In particle physics
the angular distributions of scattered or produced particles can be described
by Legendre polynomials or spherical harmonics. Functions extending to ±∞
are often approximated by the eigenfunctions of the harmonic oscillator con-
sisting of Hermite polynomials multiplied by the exponential exp(−x2 /2) and
functions bounded to x ≥ 0 by Laguerre polynomials multiplied by e−x/2 .
In order to approximate a given measurement by one of the orthogonal
function systems, one usually has to shift and scale the independent variable
x.

Polynomial Approximation

The simplest
X function approximation is achieved Xwith a simple polynomial
f (x) = ak xk or more generally by f (x) = ak uk where uk is a poly-
11.2 Smoothing of Measurements and Approximation by Analytic Functions 361

nomial of order k. Given data yν with uncertainties δν at locations xν we


minimize " #2
XN K
X
1
2
χ = yν − ak uk (xν ) , (11.3)
δ2
ν=1 ν k=0

in order to determine the coefficients ak . To constrain the coefficients, their


number K + 1 has to be smaller than the number N of measurements. All
polynomial systems of the same order describe the data equally well but
differ in the degree to which the coefficients are correlated. The power of the
polynomial is increased until it is compatible within statistics with the data.
The decision is based on a χ2 criterion.
The purpose of this section is to show how we can select polynomials with
uncorrelated coefficients. In principle, these polynomials and their coefficients
can be computed through diagonalization of the error matrix but they can
also be obtained directly with the Gram–Schmidt method. This method has
the additional advantage that the polynomials and their coefficients are given
by simple algebraic relations.
For a given sample of measured points yν = f (xν ) with errors δν , we fix
the weights in the usual way
1 X 1
wν = w(xν ) = / ,
δν2 j δj2

and now define the inner product of two functions g(x), h(x) by
X
(g, h) = wν g(xν )h(xν )
ν

with the requirement


(ui , uj ) = δij .
Minimizing χ is equivalent to minimizing
2

N
" K
#2
X X
2
X = wν yν − ak uk (xν ) .
ν=1 k=0

For K = N − 1 the square bracket at the minimum of X 2 is zero,


N
X −1
yν − ak uk (xν ) = 0
k=0

for all ν, and forming the inner product with uj we get

(y, uj ) = aj . (11.4)

This relation produces the coefficients also in the interesting case K < N − 1.
362 11 Statistical Learning

To construct the orthogonal polynomials, we set v0 = 1,


vi
ui = p , (11.5)
(vi , vi )
i
X
vi+1 = xi+1 − (uj , xi+1 )uj . (11.6)
j=0

The first two terms in the corresponding expansion, a0 u0 and a1 u1 , are


easily calculated. From (11.5), (11.6), (11.4) and the following definition of
the moments of the weighted sample
X X X
x= wν xν , s2x = wν (xν − x)2 , sxy = wν (xν yν − xy)
ν ν ν

we find the coefficients and functions which fix the polynomial expansion of


y:
sxy
y = y + 2 (x − x) .
sx
We recover the well known result for the best fit by a straight line in the
form with independent coefficients: This is of course no surprise, as the func-
tions that are minimized are identical, namely χ2 in both cases, see Example
93 in Chap. 6.4.5. The calculation of higher order terms is straight forward
but tedious. The uncertainties δai of the coefficients are all equal independent
of i and given by the simple relation

XN
1
(δai )2 = 1/ 2
.
δ
ν=1 ν

The derivation of this formula is given in the Appendix 13.14 together with
formulas for the polynomials in the special case where all measurements have
equal errors and are uniformly distributed in x.

The Gram–Charlier Series

The following example for the application of Hermite functions, strictly


speaking, does not concern the approximation of measurements by a function
but the approximation of an empirical p.d.f. (see Sect. 12.1.1 in the following
Chapter). We discuss it here since it is mathematically closely related to the
subject of this section.
The Gram–Charlier series is used to approximate empirical distributions
which do not differ very much from the normal distribution. It expresses
the quotient of an empirical p.d.f. f (x) to the standard normal distribution
N (x|0, 1) as an expansion in the slightly modified Hermite polynomials H̃i (x)
in the form
11.2 Smoothing of Measurements and Approximation by Analytic Functions 363

X
f (x) = N (x) ai H̃i (x) . (11.7)
i=0

Here, N (x) ≡ (2π)−1/2 exp(−x2 /2), the standard normal distribution,


differs somewhat from the weight function exp(−x2 ) used in the definition of
the Hermite polynomials H(x) given above in Table 11.1. The two definitions
of the polynomials are related by
 
1 x
H̃i (x) = √ Hi √ .
2i 2
The orthogonality relation of the modified polynomials is
Z +∞
(H̃i , H̃j ) = N (x)H̃i (x)H̃j (x)dx = i! δij , (11.8)
−∞

and their explicit form can be obtained by the simple recursion relation:
H̃i+1 = xH̃i − iH̃i−1 .
With H̃0 = 1 , H̃1 = x we get
H̃2 = x2 − 1 ,
H̃3 = x3 − 3x ,
H̃4 = x4 − 6x2 + 3 ,
and so on.
When we multiply both sides of (11.7) with H̃j (x) and integrate, we find,
according to (11.8), the coefficients ai from
Z
1
ai = f (x)H̃i (x)dx .
i!
These integrals can be expressed as combinations of moments of f (x),
which are to be approximated by the sample moments of the experimental
distribution. First, the sample mean and the sample variance are used to
shift and scale the experimental distribution such that the transformed mean
and variance equal 0 and 1, respectively. Then a1,2 = 0, and the empirical
skewness and excess of the normalized sample γ1,2 as defined in Sect. 3.2 are
proportional to the parameters a3,4 . The approximation to this order is
1 1
f (x) ≈ N (x)(1 + γ1 H̃3 (x) + γ2 H̃4 (x)) .
3! 4!
As mentioned, this approximation is well suited to describe distributions
which are close to normal distributions. This is realized, for instance, when the
variate is a sum of independent variates such that the central limit theorem
applies. It is advisable to check the convergence of the corresponding Gram–
Charlier series and not to truncate the series too early [3].
364 11 Statistical Learning

x
2

1/2
2

1/2
x
-2

x
-1

-0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Fig. 11.1. Nine orthonormalized wavelets with three different frequencies..

11.2.3 Wavelets

The trigonometric functions used in the Fourier series are discrete in the
frequency domain, but extend from minus infinity to plus infinity in the
spatial domain and thus are not very well suited to describe strongly localized
function variations. To handle this kind of problems, the wavelet system has
been invented. Wavelets are able to describe pulse signals and spikes like
those generated in electrocardiograms, nuclear magnetic resonance (NMR)
records or seismic records, in data transmission, and for the coding of images
and hand-written text. For data reduction and storage they have become an
indispensable tool.
The simplest orthogonal system with the desired properties are the Haar
wavelets shown in Fig. 11.1. The lowest row shows three wavelets which are
orthogonal, because they have no overlap. The next higher row contains again
three wavelets with one half the length of those below. They are orthogonal
to each other and to the wavelets in the lower row. In the same way the higher
frequency wavelets in the following row are constructed. We label them with
two indices j, k indicating length and position. We define a mother function
ψ(x), the bottom left wavelet function of Fig. 11.1.

 1, if 0 ≤ x < 12
ψ(x) = −1, if 21 ≤ x < 1

0, else
11.2 Smoothing of Measurements and Approximation by Analytic Functions 365

and set W00 = ψ(x). The remaining wavelets are then obtained by transla-
tions and dilatations in discrete steps from the mother function ψ(x):

Wjk (x) = 2j/2 ψ(2j x − k) .

The factor 2j/2 provides the normalization in the orthonormality relation5 .


Z +∞

Wik (x)Wjl (x)dx = δij δkl . (11.9)
−∞

It is evident that wavelets are much better suited to fit local structures


than the sine and cosine functions of the Fourier expansion, since the wavelet
expansion coefficients cjk contain information on frequency and location of
a signal.
The simple Haar wavelets shown in Fig. 11.1 which we have introduced to
demonstrate the principal properties of wavelets are rarely used in applica-
tions as functions with infinitely sharp edges are usually absent in a realistic
phenomenon. More common are the smoother wavelets

1 2 2 x2
ψ(x) = √ e−x /(2σ ) (1 − 2 ) (Mexican Hat) , (11.10)
2πσ 3 σ
2
/(2σ2 )
ψ(x) = (eix − c)e−x (Morlet-Wavelet) , (11.11)

and many others. The first function, the Mexican hat, is the second derivative
of the Gaussian function, Fig. 11.2. The second, the Morlet function, is a
complex monochromatic wave, modulated by a Gaussian. The constant c =
exp(−σ 2 /2) in the Morlet function can usually be neglected by choosing a
wide lowest order function, σ >∼ 5. In both functions σ defines the width of
the window.
The mother function ψ has to fulfil apart from the trivial normalization
property 11.9, also the relation
Z
ψ(x)dx = 0 .
R
Any square integrable function f (x) fulfilling f (x)dx = 0 can be expanded
in the discrete wavelet series,
X
f (x) = cjk Wjk (x) .
j,k

As usual, in order to regularize the function f (x), the expansion is trun-


cated when the coefficients become insignificant with increasing j, corre-
sponding to small details or large frequencies.

5
The Haar wavelets are real, but some types of wavelets are complex.
366 11 Statistical Learning

0.4

0.2

F
0.0

-0.2

-6 -4 -2 0 2 4 6

Fig. 11.2. Mexican hat wavelet.

The calculation of the coefficients cjk is in principle analogous to the


calculation of Fourier coefficients by convolution of f with the wavelets6 .
For given measurements a least square fit can be applied. The success of
the wavelet applications was promoted by the appearance of fast numerical
algorithms, like the multi-scale analysis. They work on a function f which
need not integrate to zero, sampled at equidistant points, similarly to the
fast Fourier transform (FFT).
An elementary introduction to the wavelet analysis is found in [99]. Pro-
grams are available in program libraries and in the internet.

11.2.4 Spline Approximation

The mathematical and numerical treatment of polynomials is especially sim-


ple and effective. Therefore, they are often chosen for the approximation of
experimental data. A disadvantage of polynomials is however that they tend
to infinity for large absolute values of the independent variable. This diffi-
culty is resolved by using piecewise polynomial functions, the splines. The
independent variable space is divided into equal length intervals limited by
so-called knots.
According to the degree of the polynomials used, we distinguish between
linear, quadratic, cubic etc. splines.
The simplest spline approximation is the linear one, consisting of a poly-
gon. The steps in the independent variable x between the knots are constant
(Fig. 11.3). The lower the chosen number of knots and the spline order are,
the larger will be on average the deviations of the points from the fitted curve.
6
The wavelets (11.10), (11.11) are not orthogonal. Thus the coefficients are
correlated.
11.2 Smoothing of Measurements and Approximation by Analytic Functions 367

x
Fig. 11.3. Linear spline approximation.

1.5

linear B-splines
1.0
y

0.5

0.0

x
1.0

quadratic B-splines
y

0.5

0.0

1.0

cubic B-splines

y
0.5

0.0

Fig. 11.4. Linear, quadratic and cubic B-splines.

A sensible choice should take into account the mean squared dispersion of
the points, i.e. the χ2 -sum should be of the order of the number of degrees of
freedom. When the response values y are exact and equidistant, the points
are simply connected by a polygon.
368 11 Statistical Learning

A smoother approximation with no kinks is obtained with quadratic


splines. A curve with continuous derivatives up to order n is produced with
splines of degree ≥ n + 1. Since a curve with continuous second derivatives
looks smooth to the human eye, splines of degree higher than cubic are rarely
used.
Spline approximations are widely used in technical disciplines. They have
also been successfully applied to the deconvolution problem [12, 69] (Chap.
9). Instead of adapting a histogram to the true distribution, the amplitudes
of spline functions can be fitted. This has the advantage that we obtain a
continuous function which incorporates the desired degree of regularization.
For the numerical computations the so called B-splines (basis splines)
are especially useful. Linear, quadratic and cubic B-splines are shown in
Fig. 11.4. The superposition of B-splines fulfils the continuity conditions at
the knots. The superposition of the triangular linear B-splines produces a
polygon, that of quadratic and cubic B-splines a curve with continuous slope
and curvature, respectively.
A B-spline of given degree is determined by the step width b and the
position x0 of its center. Their explicit mathematical expressions are given
in Appendix 13.15.
The function is approximated by
K
X
fˆ(x) = ak Bk (x) . (11.12)
k=0

The amplitudes ak can be obtained from a least squares fit. For values of the
response function yi and errors δyi at the input points xi , i = 1, . . . , N , we
minimize h i2
PK
XN yi − k=0 ak Bk (xi )
χ2 = . (11.13)
i=1
(δyi )2
Of course, the number N of input values has to be at least equal to the
number K of splines. Otherwise the number of degrees of freedom would
become negative and the approximation under-determined.

Spline Approximation in Higher Dimensions

In principle, the spline approximation can be generalized to higher dimen-


sions. However, there the difficulty is that a grid of intervals (knots) destroys
the rotation symmetry. It is again advantageous to work with B-splines. Their
definition becomes more complicated: In two dimensions we have instead of
triangular functions pyramids and for quadratic splines also mixed terms
∝ x1 x2 have to be taken into account. In higher dimensions the number of
mixed terms explodes, another example of the curse of dimensionality.
11.2 Smoothing of Measurements and Approximation by Analytic Functions 369

11.2.5 Approximation by a Combination of Simple Functions

There is no general recipe for function approximation. An experienced scien-


tist would try, first of all, to find functions which describe the asymptotic be-
havior and the rough distribution of the data, and then add further functions
to describe the details. This approach is more tedious than using programs
from libraries but will usually produce results superior to those of the general
methods described above.
Besides polynomials, a0 + a1 x + a2 x2 + · · · , rational functions can be
used, i.e. quotients of two polynomials (Padé approximation), the exponential
2
function eαx , the logarithm b log x, the Gaussian e−ax , and combinations
like xa e−bx . In many cases a simple polynomial will do. The results usually
improve when the original data are transformed by translation and dilatation
x → a(x + b) to a normalized form.

11.2.6 Example

In order to compare different methods, we use a set of simulated measure-


ments yi according to the function xe−x with superimposed Gaussian fluctu-
ations, generated at equidistant values xi . The measurements are smoothed,
respectively fitted by different functions. The results are shown in Fig. 10.9.
All eight panels show the original function and the measured points connected
by a polygon.
In the upper two panels smoothing has been performed by weighting.
Typical for both methods are that structures are washed-out and strong
deviations at the borders. The Gaussian weighting in the left hand panel
performs better than the method of nearest neighbors on the right hand side
which also shows spurious short range fluctuations which are typical for this
method.
As expected, also the linear spline approximation is not satisfactory but
the edges are reproduced better than with the weighting methods. Both
quadratic and cubic splines with 10 free parameters describe the measure-
ment points adequately, but the cubic splines show some unwanted oscil-
lations. The structure of the spline intervals is clearly seen. Reducing the
number of free parameters to 5 suppresses the spurious fluctuations but then
the spline functions cannot follow any more the steep rise at small x. There
is only a marginal difference between the quadratic and the cubic spline ap-
proximations.
The approximation by a simple polynomial of fourth order, i.e. with 5 free
parameters, works excellently. By the way, it differs substantially from the
Taylor expansion of the true function. The polynomial can adapt itself much
better to regions of different curvature than the splines with their fixed step
width.
370 11 Statistical Learning

Gaussian weight nearest neighbor

s = 0.2 k = 12

0 1 2 3 4 0 1 2 3 4
x x

linear splines parabola

nparam = 10 nparam = 5

0 1 2 3 4 0 1 2 3 4
x x

quadratic splines quadratic splines

nparam = 10 nparam = 5

0 1 2 3 4 0 1 2 3 4
x x

cubic splines cubic splines

nparam = 10 nparam = 5

0 1 2 3 4 0 1 2 3 4
x x

Fig. 11.5. Smoothing and function approximation. The measurements are con-
nected by a polygon. The curve corresponding to the original function is dashed.
11.3 Linear Factor Analysis and Principal Components 371

To summarize: The physicist will usually prefer to construct a clever


parametrization with simple analytic functions to describe his data and avoid
the more general standard methods available in program libraries.
As we have already mentioned, the approximation of measurements by
the standard set of orthogonal functions works quite well for very smooth
functions where sharp peaks and valleys are absent. Peaks and bumps are
better described with wavelets than with the conventional orthogonal func-
tions. Smoothing results of measurements with the primitive kernel methods
which we have discussed are usually unsatisfactory. A better performance
is obtained with kernels with variable width and corrections for a possible
boundary bias. The reader is referred to the literature [100]. Spline approxi-
mations are useful when the user has no idea about the shape of the function
and when the measurements are able to constrain the function sufficiently to
suppress fake oscillations.

11.3 Linear Factor Analysis and Principal Components


Factor analysis and principal component analysis (PCA) both reduce a multi-
dimensional variate space to lower dimensions. In the literature there is no
clear distinction between the two techniques.
Often several features of an object are correlated or redundant, and we
want to express them by a few uncorrelated components with the hope to
gain deeper insight into latent relations. One would like to reduce the number
of features to as low a number of components, called factors, as possible.
Let us imagine that for 20 cuboids we have determined 6 geometrical and
physical quantities: volume, surface, basis area, sum of edge lengths, mass,
and principal moments of inertia. We submit these data which may be repre-
sented by a 6-dimensional vector to a colleague without further information.
He will look for similarities and correlations, and he might guess that these
data can be derived for each cuboid from only 4 parameters, namely length,
width, height, and density. The search for these basis parameters, the com-
ponents or factors is called factor analysis [101, 102].
A general solution for this problem cannot be given, without an ansatz
for the functional relationship between the feature matrix, in our example
build from the 20 six-dimensional data vectors and its errors. Our example
indicates though, in which direction we might search for a solution of this
problem. Each body is represented by a point in a six-dimensional feature
space. The points are however restricted to a four-dimensional subspace, the
component space. The problem is to find this subspace. This is relatively
simple if it is linear.
In general, and in our example, the subspace is not linear, but a linear
approximation might be justified if the cuboids are very similar such that
the components depend approximately linearly on the deviations of the in-
put vectors from a center of gravity. Certainly in the general situation it is
372 11 Statistical Learning

reasonable to look first for a linear relationship between features and pa-
rameters. Then the subspace is a linear vector space and easy to identify.
In the special situation where only one component exists, all points lie ap-
proximately on a straight line, deviations being due to measurement errors
and non-linearity. To identify the multi-dimensional plane, we have to inves-
tigate the correlation matrix. Its transformation into diagonal form delivers
the principal components – linear combinations of the feature vectors in the
direction of the principal axes. The principal components are the eigenvectors
of the correlation matrix ordered according to decreasing eigenvalues. When
we ignore the principal components with small eigenvalues, the remaining
components form the planar subspace.
Factor analysis or PCA has been developed in psychology, but it is widely
used also in other descriptive fields, and there are numerous applications in
chemistry and biology. Its moderate computing requirements which are at
the expense of the restriction to linear relations, are certainly one of the
historical reasons for its popularity. We sketch it below, because it is still in
use, and because it helps to get a quick idea of hidden structures in multi-
dimensional data. When no dominant components are found, it may help to
disprove expected relations between different observations.
A typical application is the search for factors explaining similar properties
between different objects: Different chemical compounds may act similarly,
e.g. decrease the surface tension of water. The compounds may differ in var-
ious features, as molecular size and weight, electrical dipole moment, and
others. We want to know which parameter or combination of parameters is
relevant for the interesting property. Another application is the search for
decisive factors for a similar curing effect of different drugs. The knowledge
of the principal factors helps to find new drugs with the same positive effect.
In physics factor analysis does not play a central role, mainly because its
results are often difficult to interpret and, as we will see below, not unam-
biguous. It is not easy, therefore, to find examples from our discipline. Here
we illustrate the method with an artificially constructed example taken from
astronomy.

Example 154. Principal component analysis


Galaxies show the well known red-shift of the spectrum which is due to
their escape velocity. Besides the measurement value or feature red-shift x1
we know the brightness x2 of the galaxies. To be independent of scales and
mean values, we transform these quantities in such a way that sample mean
and variance are zero, respectively unity. To demonstrate the concept, we
have invented some date which are shown in Fig. 11.6. The two coordinates
are strongly correlated. The correlation is eliminated in a rotated coordinate
system where the objects have coordinates y1 and y2 which are linear com-
binations of red-shift and brightness in the directions of the principal axes of
11.3 Linear Factor Analysis and Principal Components 373

y2

x2

y1

x1
Fig. 11.6. Scatter diagram of two attributes of 11 measured objects.

the correlation matrix. Now we consider as important those directions, where


the observed objects show the largest differences. In our case this is the direc-
tion of y1 , while y2 has apparently only a minor influence on both features.
We may conclude that red-shift and brightness have mainly one and the same
cause which determines the value of y1 . In our example, we know that this is
the distance, both brightness and red shift depend on it. Since, apparently,
the distance determines y1 , we can use it, after a suitable calibration, as a
measure for the distance.

We will now put these ideas into concrete terms.


The input data for the factor analysis are given in the form of a matrix
X of N rows and P columns. The element xnp is the measured value of the
feature p of the object n, thus X is a rectangular matrix. In a first step
we determine the correlations between the P input attributes. By a simple
transformation, we obtain uncorrelated linear combinations of the features.
The hope is that there are few dominant combinations and that the others
can be neglected. Then the data can be described by a small number of Q < P
linear combinations, the principal components.
We first transform the data Xnp into standardized form where the sam-
ple mean and variance are zero, respectively unity. We get the normalized
variables7
Xnp − X p
xnp =
δp
with
7
The normalization (division by δp ) is not always required.
374 11 Statistical Learning
N
1 X
Xp = Xnp ,
N n=1
N
1 X
δp2 = (Xnp − X p )2 .
N − 1 n=1

The quantity xnp is the normalized deviation of the measurement value of


type p for the object n from the average over all objects for this measurement.
In the same way as in Chap. 4 we construct the correlation matrix for our
sample by averaging the P × P products of the components xn1 . . . xnP over
all N objects:
1
R= XT X ,
N −1
N
1 X
Rpq = xnp xnq .
N − 1 n=1

It is a symmetric positive definite P × P matrix. Due to the normalization


the diagonal elements are equal to unity.
Then this matrix is brought into diagonal form by an orthogonal trans-
formation corresponding to a rotation in the P -dimensional feature space.

R → VT RV = diag(λ1 . . . λP ) .

The uncorrelated feature vectors in the rotated space y n = {yn1 , . . . , ynP }


are given by
y n = VT xn , xn = Vy n .
To obtain eigenvalues and -vectors we solve the linear equation system

(R − λp I)v p = 0 , (11.14)

where λp is the eigenvalue belonging to the eigenvector v p :

Rv p = λp v p .

The P eigenvalues are found as the solutions of the characteristic equation

det(R − λI) = 0 .

In the simple case described above of only two features, this is a quadratic
equation
R11 − λ R12
=0,
R21 R22 − λ
that fixes the two eigenvalues. The eigenvectors are calculated from (11.14)
after substituting the respective eigenvalue. As they are fixed only up to
11.3 Linear Factor Analysis and Principal Components 375

an arbitrary factor, they are usually normalized. The rotation matrix V is


constructed by taking the eigenvectors v p as its columns: vqp = (v p )q .
Since the eigenvalues are the diagonal elements in the rotated, diagonal
correlation matrix, they correspond to the variances of the data distribution
with respect to the principal axes. A small eigenvalue means that the pro-
jection of the data on this axis has a narrow distribution. The respective
component is then, presumably, only of small influence on the data, and may
perhaps be ignored in a model of the data. Large eigenvalues belong to the
important principal components.
Factors fnp are obtained by standardization of the transformed variables
ynp by division by the square root of the eigenvalues λp :
ynp
fnp = p .
λp

By construction, these factors represent variates with zero mean and unit
variance. In most cases they are assumed to be normally distributed. Their
relation to the original data xnp is given by a linear (not orthogonal) trans-
formation with a matrix A, the elements of which are called factor loadings.
Its definition is
xn = Af n , or XT = A FT . (11.15)
Its components show, how strongly the input data are influenced by certain
factors.
In the classical factor analysis, the idea is to reduce the number of factors
such that the description of the data is still satisfactory within tolerable
deviations ε:

x1 = a11 f1 + · · · + a1Q fQ + ε1
x2 = a21 f1 + · · · + a2Q fQ + ε2
.. ..
. .
xP = aP 1 f1 + · · · + aP Q fQ + εP

with Q < P , where the “factors” (latent random variables) f1 , . . . , fQ are


considered as uncorrelated and distributed according to N (0, 1), plus uncor-
related zero-mean Gaussian variables εp , with variances σp2 , representing the
residual statistical fluctuations not described by the linear combinations. As
a first guess, Q is taken as the index of the smallest eigenvalue λQ which is
considered to be still significant. In the ideal case Q = 1 only one decisive
factor would be the dominant element able to describe the data.
Generally, the aim is to estimate the loadings apq , the eigenvalues λp , and
the variances σp2 from the sampling data, in order to reduce the number of
relevant quantities responsible for their description.
376 11 Statistical Learning

The same results as we have found above by the traditional method by


solving the eigenvalue problem for the correlation matrix8 can be obtained
directly by using the singular value decomposition (SVD) of the matrix X
(remember that it has N rows and P columns):

X = U D VT ,

where U and V are orthogonal. U is not a square matrix, nevertheless UT U = I,


where the unit matrix I has dimension P . D is a diagonal matrix with elements
p
λp , ordered according to decreasing values, and called here singular values.
The decomposition (11.15) is obtained by setting F = U and A = V D.
The decomposition (11.15) is not unique: If we multiply both F and A
with a rotation matrix R from the right we get an equivalent decomposition:

X = F̃ ÃT = F R(A R)T = F R RT AT = U D VT , (11.16)

which is the same as (11.15), with factors and loadings being rotated.
There exist program packages which perform the numerical calculation of
principal components and factors.
Remarks:
1. The transformation of the correlation matrix to diagonal form makes
sense, as we obtain in this way uncorrelated inputs. The new variables
help to understand better the relations between the various measure-
ments.
2. The silent assumption that the principal components with larger eigenval-
ues are the more important ones is not always convincing, since starting
with uncorrelated measurements, due to the scaling procedure, would
result in eigenvalues which are all identical. An additional difficulty for
interpreting the data comes from the ambiguity (11.16) concerning rota-
tions of factors and loadings.

11.4 Classification
We have come across classification already when we have treated goodness-of-
fit. There the problem was either to accept or to reject a hypothesis without
a clear alternative. Now we consider a situation where we collect events based
upon their features into two or more classes. We assume that we have either
a data set where we know the classification and which is used to train the
classification algorithm or an analytic description of the distributions.
The assignment of an object according to some quality to a class or cat-
egory is described by a so-called categorical variable. For two categories we
8
Physicists may find the method familiar from the discussion of the inertial
momentum tensor and many similar problems.
11.4 Classification 377

can label the two possibilities by discrete numbers; usually the values ±1
or 1 and 0 are chosen. In most cases, we replace the strict classification by
weights which indicate the probability that the event should be assigned to
a certain class. The classification into more than two cases can be performed
sequentially by first combining classes such that we have a two class system
and then splitting them further.
Classification is indispensable in data analysis in many areas. Examples
in particle physics are the identification of particles from shower profiles or
from Cerenkov ring images, beauty, top or Higgs particles from kinematics
and secondaries and the separation of rare interactions from frequent ones. In
astronomy the classification of galaxies and other stellar objects is of interest.
But classification is also a precondition for decisions in many scientific fields
and in everyday life.
We start with an example: A patient suffers from certain symptoms:
stomach-ache, diarrhoea, temperature, head-ache. The doctor has to give
a diagnosis. He will consider further factors, as age, sex, earlier diseases, pos-
sibility of infection, duration of the illness, etc.. The diagnosis is based on the
experience and education of the doctor.
A computer program which is supposed to help the doctor in this matter
should be able to learn from past cases, and to compare new inputs in a
sensible way with the stored data. Of course, as opposed to most problems
in science, it is not possible here to provide a functional, parametric relation.
Hence there is a need for suitable methods which interpolate or extrapolate
in the space of the input variables. If these quantities cannot be ordered,
e.g. sex, color, shape, they have to be classified. In a broad sense, all this
problems may be considered as variants of function approximation.
The most important methods for this kind of problems are the discrimi-
nant analysis, artificial neural nets, kernel or weighting methods, and decision
trees. In the last years, remarkable progress in these fields could be realized
with the development of support vector machines, boosted decision trees, and
random forests classifiers.
Before discussing these methods in more detail let us consider a further
example: The interactions of electrons and hadrons in calorimeter detectors
of particle physics differ in a many parameters. Calorimeters consist of a large
number of detector elements, for which the signal heights are evaluated and
recorded. The system should learn from a training sample obtained from test
measurements with known particle beams to classify electrons and hadrons
with minimal error rates.
An optimal classification is possible if the likelihood ratio is available
which then is used as a cut variable. The goal of intelligent classification
methods is to approximate the likelihood ratio or an equivalent variable which
is a unique function of the likelihood ratio. The relation itself need not be
known.
378 11 Statistical Learning

1.0

0.8

contamination
0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

efficiency

Fig. 11.7. Fraction of wrongly assigned events as a function of the efficiency.

When we optimize a given method, it is not only the percentage of right


decisions which is of interest, but we will also consider the consequences of
the various kinds of errors. It is less serious if a SPAM has not been detected,
as compared to the loss of an important message. In statistics this is taken
into account by a loss function which has to be defined by the user. In the
standard situation where we want to select a certain class of events, we have
to consider efficiency and contamination 9 . The larger the efficiency, the larger
is also the relative contamination by wrongly assigned events. A typical curve
showing this relation if plotted in Fig. 11.7. The user will select the value of
his cut variable on the bases of this curve.
The loss has to be evaluated from the training sample. It is recommended
to use a part of the training sample to develop the method and to reserve a
certain fraction to validate it. As statistics is nearly always too small, also
more economic methods of validation, cross validation and bootstrap ( see
Sect.12.2), have been developed which permit to use the full sample to adjust
the parameters of the method. In an n-fold cross validation the whole sample
is randomly divided into n equal parts of N/n events each. In turn one of
the parts is used for the validation of the training result from the other n − 1
parts. All n validation results are then averaged. Typical choices are n equal
to 5 or 10.

9
One minus the contamination is called purity.
11.4 Classification 379

11.4.1 The Discriminant Analysis

The classical discriminant analysis as developed by Fisher is a special case of


the classification method that we introduce in the following. We follow our
discussions of Chap. 6, Sect. 6.3.
If we know the p.d.f.s f1 (x) and f2 (x) for two classes of events it is easy
to assign an observation x to one of the two classes in such a way that the
error rate is minimal (case 1):

x → class 1, if f1 (x) > f2 (x) ,


x → class 2, if f1 (x) < f2 (x) .

Normally we will get a different number of wrong assignments for the two
classes: observations originating from the broader distribution will be miss-
assigned more often, see Fig. 11.8) than those of the narrower distribution.
In most cases it will matter whether an input from class 1 or from class
2 is wrongly assigned. An optimal classification is then reached using an
appropriately adjusted likelihood ratio:

x → class 1, if f1 (x)/f2 (x) > c ,


x → class 2, if f1 (x)/f2 (x) < c .

If we want to have the same error rates (case 2), we must choose the
constant c such that the integrals over the densities in the selected regions
are equal: Z Z
f1 (x)dx = f2 (x)dx . (11.17)
f1/f2 >c f1/f2 <c

This assignment has again a minimal error rate, but now under the constraint
(11.17). We illustrate the two possibilities in Fig. 11.8 for univariate functions.
For normal distributions we can formulate the condition for the classifi-
cation explicitly: For case 2 we choose that class for which the observation x
has the smallest distance to the mean measured in standard deviations. This
condition can then be written as a function of the exponents. With the usual
notations we get

(x − µ1 )T V1 (x − µ1 ) − (x − µ2 )T V2 (x − µ2 ) < 0 → class 1 ,
(x − µ1 )T V1 (x − µ1 ) − (x − µ2 )T V2 (x − µ2 ) > 0 → class 2 .

This condition can easily be generalized to more than two classes; the
assignment according to the standardized distances will then, however, no
longer lead to equal error rates for all classes.
The classical discriminant analysis sets V1 = V2 . The left-hand side in
the above relations becomes a linear combination of the xp . The quadratic
terms cancel. Equating it to zero defines a hyperplane which separates the
two classes. The sign of this linear combination thus determines the class
380 11 Statistical Learning

0.3

f
1

0.2

f(x) f
2

0.1

0.0
0 10 20

Fig. 11.8. Separation of two classes. The dashed line separates the events such that
the error rate is minimal, the dotted line such that the wrongly assigned events are
the same in both classes.

membership. Note that the separating hyperplane is cutting the line connect-
ing the distribution centers under a right angle only for spherical symmetric
distributions.
If the distributions are only known empirically from representative sam-
ples, we approximate them by continuous distributions, usually by a normal
distribution, and fix their parameters to reproduce the empirical moments.
In situations where the empirical distributions strongly overlap, for instance
when a narrow distribution is located at the center of a broad one, the sim-
ple discriminant analysis does no longer work. The classification methods
introduced in the following sections have been developed for this and other
more complicated situations and where the only source of information on
the population is a training sample. The various approaches are all based on
the continuity assumption that observations with similar attributes lead to
similar outputs.

11.4.2 Artificial Neural Networks

Introduction

The application of artificial neural networks, ANN, has seen a remarkable


boom in the last decades, parallel to the exploding computing capacities.
From its many variants, in science the most popular are the relatively simple
forward ANNs with back-propagation, to which we will restrict our discus-
sion. The interested reader should consult the broad specialized literature on
this subject, where fascinating self organizing networks are described which
11.4 Classification 381

certainly will play a role also in science in the more distant future. It could
e.g. be imagined that a self-organizing ANN would be able to classify a data
set of events produced at an accelerator without human intervention and thus
would be able to discover new reactions and particles.
The species considered here has a comparably more modest aim: The
network is trained in a first step to ascribe a certain output (response) to the
inputs. In this supervised learning scheme, the response is compared with the
target response, and then the network parameters are modified to improve
the agreement. After a training phase the network is able to classify new
data.
ANNs are used in many fields for a broad variety of problems. Examples
are pattern recognition, e.g. for hand-written letters or figures, or the forecast
of stock prices. They are successful in situations where the relations between
many parameters are too complex for an analytical treatment. In particle
physics they have been used, among other applications, to distinguish electron
from hadron cascades and to identify reactions with heavy quarks.
With ANNs, many independent computing steps have to be performed.
Therefore specialized computers have been developed which are able to eval-
uate the required functions very fast by parallel processing.
Primarily, the net approximates an algebraic function which transforms
the input vector x into the response vector y,

y = f (x|w) .

Here w symbolizes a large set of parameters, typically, depending on the


application, 103 − 104 parameters. The training process corresponds to a
fitting problem. The parameters are adjusted such that the response agrees
within the uncertainties with a target vector y t which is known for the events
of the training sample.
There are two different applications of neural nets, simple function ap-
proximation10 and classification. The net could be trained, for instance, to
estimate the energy of a hadronic showers from the energies deposited in
different cells of the detector. The net could also be trained to separate elec-
tron from hadron showers. Then it should produce a number close to 1 for
electrons and close to 0 for hadrons.
With the large number of parameters it is evident that the solution is
not always unique. Networks with different parameters can perform the same
function within the desired accuracy.
For the fitting of the large number of parameters minimizing programs like
Simplex (see Appendix 13.12) are not suited. The gradient descent method is
much more practicable here. It is able to handle a large number of parameters
and to process the input data sequentially.

10
Here function approximation is used to perform calculations. In the previous
section its purpose was to parametrize data.
382 11 Statistical Learning
(1)
w 11 w
(2)
11
x1 (1) s s y1
(2)
w 12 w 12

x2 s s y2

x3 s s y3

(1)
w n3
xn s s ym

Fig. 11.9. Backpropagation. At each knot the sigmoid function of the sum of the
weighted inputs is computed.

A simple, more detailed introduction to the field of ANN than presented


here, is given in [103]

Network Structure

Our network consists of two layers of knots (neurons), see Fig. 11.9. Each
component xk of the n-component input vector x is transmitted to all knots,
labeled i = 1, . . . , m, of the first layer. Each individual data line k → i is
(1) P (1)
ascribed a weight Wik . In each unit the weighted sum ui = k Wik xk
of the data components connected to it is calculated. Each knot symbolizes
a non-linear so-called activation function x′i = s(ui ), which is identical for
all units. The first layer produces a new data vector x′ . The second layer,
with m′ knots, acts analogously on the outputs of the first one. We call the
corresponding m × m′ weight matrix W(2) . It produces the output vector y.
The first layer is called hidden layer, since its output is not observed directly.
In principle, additional hidden layers could be implemented but experience
shows that this does not improve the performance of the net.
The net executes the following functions:
!
X (1)
x′j = s Wjk xk ,
k
 
X (2)
yi = s  Wij x′j  .
j
11.4 Classification 383

1.2

1.0

0.8

0.6
s

0.4

0.2

0.0
-6 -4 -2 0 2 4 6

Fig. 11.10. Sigmoid funktion.

This leads to the final result:


 !
X X (1) 
(2)
yi = s Wij s Wjk xk . (11.18)
 
j k

Sometimes it is appropriate to shift the input of each unit in the first layer
by a constant amount (bias). This is easily realized by specifying an artificial
additional input component x0 ≡ 1.
The number of weights (the parameters to be fitted) is, when we include
the component x0 , (n + 1) × m + mm′ .

Activation Function
The activation function s(x) has to be non-linear, in order to achieve that
the superposition (11.18) is able to approximate widely arbitrary functions.
It is plausible that it should be more sensitive to variations of the arguments
near zero than for very large absolute values. The input bias x0 helps to shift
input parameters which have a large mean value into the sensitive region. The
activation function is usually standardized to vary between zero and one. The
most popular activation function is the sigmoid function
1
s(u) = ,
e−u + 1
which is similar to the Fermi function. It is shown in Fig. 11.10.

The Training Process


In the training phase the weights will be adapted after each new input object.
Each time the output vector of the network y is compared with the target
384 11 Statistical Learning

vector y t . We define again the loss function E:

E = (y − y t )2 , (11.19)

which measures for each training object the deviation of the response from
the expected one.
To reduce the error E we walk backward in the weight space. This means,
we change each weight component by ∆W , proportional to the sensitivity
∂E/∂W of E to changes of W :
1 ∂E
∆W = − α
2 ∂W
∂y
= −α(y − y t ) · .
∂W
The proportionality constant α, the learning rate, determines the step width.
We now have to find the derivatives. Let us start with s:
ds e−u
= −u = s(1 − s) . (11.20)
du (e + 1)2

From (11.18) and (11.20) we compute the derivatives with respect to the
weight components of the first and the second layer,
∂yi
(2)
= yi (1 − yi )x′j ,
∂Wij

and
∂yi (2)
(1)
= yi (1 − yi )x′j Wij (1 − x′j )xk .
∂Wjk
It is seen that the derivatives depend on the same quantities which have
already been calculated for the determination of y (the forward run through
the net). Now we run backwards, change first the matrix W(2) and then with
already computed quantities also W(1) . This is the reason why this process
is called back propagation. The weights are changed in the following way:
(1) (1)
X (2)
Wjk → Wjk − α(y − y t ) yi (1 − yi )x′j Wij (1 − x′j )xk ,
i
(2) (2)
Wij → Wij − α(y − y t )yi (1 − yi )x′j .

Testing and Interpreting

The gradient descending minimum search has not necessarily reached the
minimum after processing the training sample a single time, especially when
the available sample is small. Then the should be used several times (e.g.
10 or 100 times). On the other hand it may happen for too small a training
11.4 Classification 385

sample that the net performs correctly for the training data, but produces
wrong results for new data. The network has, so to say, learned the training
data by heart. Similar to other minimizing concepts, the net interpolates and
extrapolates the training data. When the number of fitted parameters (here
the weights) become of the same order as the number of constraints from the
training data, the net will occasionally, after sufficient training time, describe
the training data exactly but fail for new input data. This effect is called over-
fitting and is common to all fitting schemes when too many parameters are
adjusted.
It is therefore indispensable to validate the network function after the
optimization, with data not used in the training phase or to perform a cross
validation. If in the training phase simulated data are used, it is easy to
generate new data for testing. If only experimental data are available with
no possibility to enlarge the sample size, usually a certain fraction of the data
is reserved for testing. If the validation result is not satisfactory, one should
try to solve the problem by reducing the number of network parameters or
the number of repetitions of the training runs with the same data set.
The neural network generates from the input data the response through
the fairly complicated function (11.18). It is impossible by an internal analysis
of this function to gain some understanding of the relation between input
and resulting response. Nevertheless, it is not necessary to regard the ANN
as a “black box”. We have the possibility to display graphically correlations
between input quantities and the result, and all functional relations. In this
way we gain some insight into possible connections. If, for instance, a physicist
would have the idea to train a net with an experimental data sample to
predict for a certain gas the volume from the pressure and the temperature,
he would be able to reproduce, with a certain accuracy, the results of the
van-der-Waals equation. He could display the relations between the three
quantities graphically. Of course the analytic form of the equation and its
interpretation cannot be delivered by the network.
Often a study of the optimized weights makes it possible to simplify the
net. Very small weights can be set to zero, i.e. the corresponding connections
between knots are cut. We can check whether switching off certain neurons
has a sizable influence on the response. If this is not the case, these neurons
can be eliminated. Of course, the modified network has to be trained again.

Practical Hints for the Application

Computer Programs for ANNs with back-propagation are relatively simple


and available at many places but the effort to write an ANN program is also
not very large. The number of input vector components n and the number of
knots m and m′ are parameters to be chosen by the user, thus the program
is universal, only the loss function has to be adapted to the specific problem.
386 11 Statistical Learning

– The number of units in each layer should more or less match the number
of input components. Some experts plead for a higher number. The user
should try to find the optimal number.
– The sigmoid function has values only between zero and unity. Therefore
the output or the target value has to be appropriately scaled by the user.
– The raw input components are usually correlated. The net is more efficient
if the user orthogonalizes them. Then often some of the new components
have negligible effect on the output and can be discarded.
– The weights have to be initialized at the beginning of the training phase.
This can be done by a random number generator or they can be set to
fixed values.
– The loss function E (11.19) has be adjusted to the problem to be solved.
– The learning rate α should be chosen relatively high at the beginning of
a training phase, e.g. α = 10. In the course of fitting it should be reduced
to avoid oscillations.
– The convergence of minimizing process is slow if the gradient is small. If
this is the case, and the fit is still bad, it is recommended to increase the
learning constant for a certain number of iterations.
– In order to check whether a minimum is only local, one should train the
net with different start values of the weights.
– Other possibilities for the improvement of the convergence and the elim-
ination of local minima can be found in the substantial literature. An
ANN program package that proceeds automatically along many of the
proposed steps is described in [104].

Example: Čerenkov circles

Charged, relativistic particles can emit photons by the Čerenkov effect. The
photons hit a detector plane at points located on a circle. Of interest are
radius and center of this circle, since they provide information on direction
and velocity of the emitting particle. The number of photons and the coor-
dinates where they hit the detector fluctuate statistically and are disturbed
by spurious noise signals. It has turned out that ANNs can reconstruct the
parameters of interest from the available coordinates with good efficiency and
accuracy.
We study this problem by a Monte Carlo simulation. In a simplified model,
we assume that exactly 5 photons are emitted by a particle and that the
coordinate pairs are located on a circle and registered. The center, the radii,
and the hit coordinates are generated stochastically. The input vector of the
net thus consists of 10 components, the 5 coordinate pairs. The output is a
single value, the radius R. The loss function is (R − Rtrue )2 , where the true
value Rtrue is known from the simulation.
The relative accuracy of the reconstruction as a function of the iteration
step is shown in Fig. 11.11. Different sequences of the learning rate have been
tried. Typically, the process is running by steps, where after a flat phase
11.4 Classification 387

0.1
Cerenkov circles

=20, without momentum term

0.01
error

new learning constant =1

=5

=10
1E-3
=40

=20

1E-4
100 1000 10000 100000 1000000 1E7

number of iterations

Fig. 11.11. Reconstruction of radii of circles through 5 points by means of an ANN


with differnt sequences of learning constants a.

follows a rather abrupt improvement. The number of iterations required to


reach the minimum is quite large.

Hardware Realization

The structure of back propagation network can be implemented by a hard-


ware network. The weights are stored locally at the units which are realized by
rather simple microprocessors. Each microprocessor performs the knot func-
tion, e.g. the sigmoid function. A trained net can then calculate the fitted
function very fast, since all processors are working in parallel. Such processors
can be employed for the triggering in experiments where a quick decision is
required, whether to accept an event and to store the corresponding data.

11.4.3 Weighting Methods

For the decision whether to assign an observation at the location x to a


certain class, an obvious option is to do this according to the classification
of neighboring objects of the training sample. One possibility is to consider a
certain region around x and to take a “majority vote” of the training objects
388 11 Statistical Learning

inside this region to decide about the class membership of the input. The
region to be considered here can be chosen in different ways; it can be a fixed
volume around x, or a variable volume defined by requiring that it contains
a fixed number of observations, or an infinite volume, introducing weights for
the training objects which decrease with their distance from x.
In any case we need a metric to define the distance. The choice of a metric
in multi-dimensional applications is often a rather intricate problem, espe-
cially if some of the input components are physically of very different nature.
A way-out seems to be to normalize the different quantities to equal variance
and to eliminate global correlations by a linear variable transformation. This
corresponds to the transformation to principal components discussed above
(see Sect. 11.3) with subsequent scaling of the principal components. An al-
ternative but equivalent possibility is to use a direction dependent weighting.
The same result is achieved when we apply the Mahalanobis metric, which
we have introduced in Sect. 10.4.8.
For a large training sample the calculation of all distances is expensive
in computing time. A drastic reduction of the number of distances to be
calculated is in many cases possible by the so-called support vector machines
which we will discuss below. Those are not machines, but programs which
reduce the training sample to a few, but decisive inputs, without impairing
the results.

K-Nearest Neighbors

We choose a number K which of course will depend on the size of the training
sample and the overlap of the classes. For an input x we determine the K
nearest neighbors and the numbers k1 , k2 = K − k1 , of observations that
belong to class I and II, respectively. For a ratio k1 /k2 greater than α, we
assign the new observation to class I, in the opposite case to class II:
k1 /k2 > α =⇒ class I ,
k1 /k2 < α =⇒ class II .
The choice of α depends on the loss function. When the loss function treats
all classes alike, then α will be unity and we get a simple majority vote. To
find the optimal value of K we minimize the average of the loss function
computed for all observations of the training sample.

Distance Dependent Weighting

Instead of treating all training vector inputs x′ within a given region in the
same way, one should attribute a larger weight to those located nearer to the
input x. A sensible choice is again a Gaussian kernel,
 
(x − x′ )2
K(x, x′ ) ∼ exp − .
2s2
11.4 Classification 389

With this choice we obtain for the class β the weight wβ ,


X
wβ = K(x, xβi ) , (11.21)
i

where xβi are the locations of the training vectors of the class β.
If there are only two classes, writing the training sample as

{x1 , y1 . . . xN , yN }

with the response vector yi = ±1, the classification of a new input x is done
according to the value ±1 of the classifier ŷ(x), given by
! !
X X X
ŷ(x) = sign K(x, xi ) − K(x, xi ) = sign yi K(x, xi ) .
yi =+1 yi =−1 i
(11.22)
For a direction dependent density of the training sample, we can use a
direction dependent kernel, eventually in the Mahalanobis form mentioned
above:  
1
K(x, x′ ) ∼ exp − (x − x′ )T V(x − x′ ) .
2
with the weight matrix V. When we first normalize the sample, this compli-
cation is not necessary. The parameter s of the matrix V, which determines
the width of the kernel function, again is optimized by minimizing the loss
for the training sample.

Support Vector Machines

Support vector machines (SVMs) produce similar results as ordinary distance


depending weighting methods, but they require less memory for the storage
of learning data and the classification is extremely fast. Therefore, they are
especially useful in on-line applications.
The class assignment usually is the same for all elements in large con-
nected regions of the variable x. Very often, in a two case classification, there
are only two regions separated by a hypersurface. For short range kernels it is
obvious then that for the classification of observations, the knowledge of only
those input vectors of the training sample is essential which are located in the
vicinity of the hypersurface. These input vectors are called support vectors
[105]. SVMs are programs which try to determine them, respectively their
weights, in an optimal way, setting the weights of all other inputs vectors to
zero.
In the one-dimensional case with non-overlapping classes it is sufficient
to know those inputs of each class which are located nearest to the dividing
limit between the classes. Sums like (11.21) are then running over one element
only. This, of course, makes the calculation extremely fast.
390 11 Statistical Learning

region 2

region 1 region 1

Fig. 11.12. Separation of two classes. Top: learning sample, bottom:: wrongly
assigned events of a test sample.

In higher dimensional spaces with overlapping classes and for more than
two classes the problem to determine support vectors is of course more compli-
cated. But also in these circumstances the number of relevant training inputs
can be reduced drastically. The success of SVMs is based on the so-called
kernel trick, by which non-linear problems in the input space are treated as
linear problems in some higher-dimensional space by well known optimiza-
tion algorithms. For the corresponding algorithms and proofs we refer to the
literature, e.g. [16, 106, 107]. A short introduction is given in Appendix 13.16.

Example and Discussion

In Fig. 11.12 are shown in the top panel two overlapping training samples
of 500 inputs each. The loss function is the number of wrong assignments
independent of the respective class. Since the distributions are quite similar
11.4 Classification 391

in both coordinates we do not change the metric. We use a Gaussian kernel.


The optimization of the parameter s by means of the training sample shows
only a small change of the error rate for a change of s by a factor four. The
lower panel displays the result of the classification for a test sample of the
same size (500 inputs per class). Only the wrong assignments are shown.
We realize that wrongly assigned training observations occur in two sepa-
rate, non overlapping regions which can be separated by a curve or a polygon
chain as indicated in the figure. Obviously all new observations would be as-
signed to the class corresponding to the region in which they are located. If
we would have used instead of the distance-depending weighting the k-nearest
neighbors method, the result would have been almost identical. In spite of the
opposite expectation, this more primitive method is more expensive in both
the programming and in the calculation, when compared to the weighting
with a distance dependent kernel.
Since for the classification only the separation curve between the classes
is required, it must be sufficient to know the class assignment for those train-
ing observations which lie near this curve. They would define the support
vectors of a SVM. Thus the number of inputs needed for the assignment of
new observations would be drastically reduced. However, for a number of
assignments below about 106 the effort to determine support vectors usually
does not pay. The SVMs are useful for large event numbers in applications
where computing time is relevant.

11.4.4 Decision Trees

Simple Trees

We consider the simple case, the two class classification, i.e. the assignment
of inputs to one of two classes I and II, and N observations with P features
x1 , x2 , . . . , xP , which we consider, as before, as the components of an input
vector.
In the first step we consider the first component x11 , x21 , . . . , xN 1 for all
N input vectors of the training sample. We search for a value xc1 which
optimally divides the two classes and obtain a division of the training sample
into two parts A and B. Each of these parts which belong to two different
subspaces, will now be further treated separately. Next we take the subspace
A, look at the feature x2 , and divide it, in the same way as before the full
space, again into two parts. Analogously we treat the subspace B. Now we
can switch to the next feature or return to feature 1 and perform further
splittings. The sequence of divisions leads to smaller and smaller subspaces,
each of them assigned to a certain class. This subdivision process can be
regarded as the development of a decision tree for input vectors for which the
class membership is to be determined. The growing of the tree is stopped by
a pruning rule. The final partitions are called leaves.
392 11 Statistical Learning

10

6
X
2

0 1 2 3 4 5

X
1

Fig. 11.13. Decision tree (bottom) corresponding to the classification shown below.

In Fig. 11.13 we show schematically the subdivision into subspaces and


the corresponding decision tree for a training sample of 32 elements with
only two features. The training sample which determines the decisions is
indicated. At the end of the tree (here at the bottom) the decision about the
class membership is taken.
It is not obvious, how one should optimize the sequence of partitions and
the position of cuts, and also not, under which circumstances the procedure
should be stopped.
For the optimization of splits we must again define a loss function which
will depend on the given problem. A simple possibility in the case of two
classes is, to maximize for each splitting the difference ∆N = Nr −Nf between
right and wrong assignments. We used this in our example Fig. 11.13. For
the first division this quantity was equal to 20 − 12 = 8. To some extend
the position of the splitting hyperplane is still arbitrary, the loss function
changes its value only when it hits the nearest input. It could, for example,
be put at the center between the two nearest points. Often the importance
of efficiency and purity is different for the two classes. Then we would chose
an asymmetric loss function.
11.4 Classification 393

Very popular is the following, slightly more complicated criterion: We


define the impurity PI of class I
NI
PI = , (11.23)
NI + NII
which for optimal classification would be 1 or 0. The quantity

G = PI (1 − PI ) + PII (1 − PII ) (11.24)

the Gini-index, should be as small as possible. For each separation of a parent


node E with Gini-index GE into two children nodes A, B with GA , respec-
tively GB , we minimize the sum GA + GB .
The difference
D = GE − GA − GB
is taken as stopping or pruning parameter. The quantity D measures the
increase in purity, it is large for a parent node with large G and two children
nodes with small G. When D becomes less than a certain critical value Dc
the branch will not be split further and ends at a leave. The leave is assigned
to the class which has the majority in it.
Besides the Gini-index, also other measures for the purity or impurity are
used [16]. An interesting quantity is entropy S = −PI ln PI − PII ln PII , a
well known measure of disorder, i.e. of impurity.
The purity parameter, e.g. G, is also used to organize the splitting se-
quence. We choose always that input vector component in which the splitting
produces the most significant separation.
A further possibility would be to generalize the orthogonal splitting by
allowing also non-orthogonal planes to reach better separations. But in the
standard case all components are treated independently.
Unfortunately, the classification by decision trees is usually not perfect.
The discontinuity at the boundaries and the fixed splitting sequence impair
the accuracy. On the other hand, they are simple, transparent and the cor-
responding computer programs are extremely fast.

Boosted Decision Trees

Boosting [108] is based on a simple idea: By a weighted superposition of


many moderately effective classifiers it should be possible to reach a fairly
precise assignment. Instead of only one decision tree, many different trees are
grown. Each time, before the development of a new tree is started, wrongly
assigned training inputs are boosted to higher weights in order to lower their
probability of being wrongly classified in the following tree. The final class
assignment is then done by averaging the decisions from all trees. Obviously,
the computing effort for these boosted decision trees is increased, but the
precision is significantly enhanced. The results of boosted decision trees are
394 11 Statistical Learning

usually as good as those of ANNs. Their algorithm is very well suited for
parallel processing. There are first applications in particle physics [109].
Before the first run, all training inputs have the weight 1. In the following
run each input gets a weight wi , determined by a certain boosting algorithm
(see below) which depends on the particular method. The definition of the
node impurity P for calculating the loss function, see (11.23), (11.24), is
changed accordingly to
P
IwPi
P =P ,
I wi + II wi
P P
where the sums I , II run over all events in class I or II, respectively.
Again the weights will be boosted and the next run started. Typically M ≈
1000 trees are generated in this way.
If we indicate the decision of a tree m for the input xi by Tm (xi ) = 1 (for
class I) and = −1 (for class II), the final result will be given by the sign of
the weighted sum over the results from all trees
M
!
X
TM (xi ) = sign αm Tm (xi ) .
m=1

We proceed in the following way: To the first tree we assign a weight


α1 = 1. The weights of the wrongly assigned input vectors are increased.
The weight11 α2 of the second tree T2 (x) is chosen such that the overall loss
from all input vectors of the training sample is minimal for the combination
[α1 T1 (x) + α2 T2 (x)] / [α1 + α2 ]. We continue in the same way and add fur-
ther trees. For tree i the weight αi is optimized such that the existing trees
are complemented in an optimal way. How this is done depends of course on
the loss function.
A well tested recipe for the choice of weights is AdaBoost [108]. The
training algorithm proceeds as follows:
– The i-th input xi gets the weight wi = 1 and the value yi = 1, (= −1), if
it belongs to class I, (II).
– Tm (xi ) = 1 (= −1), if the input ends in a leave belonging to class I (II).
Sm (xi ) = (1 − yi Tm (xi ))/2 = 1 (= 0), if the assignment is wrong (right).
– The fraction of the weighted wrong assignments εm is used to change the
weights for the next iteration:
X X
εm = wi Sm (xi )/ wi ,
i i
1 − εm
αm = ln ,
εm
wi → wi eαm Sm .
11
We have two kinds of weight, weights of input vectors (wi ) and weights of trees
(αm ).
11.4 Classification 395

Weights of correctly assigned training inputs thus remain unchanged. For


example, for εm = 0.1, wrongly assigned inputs will be boosted by a
factor 0.9/0.1 = 9. Note that αm > 0 if ε < 0.5; this is required because
otherwise the replacement Tm (xi ) → −Tm (xi ) would produce a better
decision tree.
– The response for a new input which is to be classified is
M
!
X
TM (xi ) = sign αm Tm (xi ) .
m=1

For εm = 0.1 the weight of the tree is αm = ln 9 ≈ 2.20. For certain


applications it may be useful to reduce the weight factors αm somewhat,
for instance αm = 0.5 ln ((1 − εm )/εm ) [109].

11.4.5 Bagging and Random Forest

Bagging

The concept of bagging was first introduced by Breiman [110]. He has shown
that the performance of unstable classifiers can be improved considerably by
training many classifiers with bootstrap replicates and then using a majority
vote of those: From a training sample containing N input vectors, N vectors
are drawn at random with replacement. Some vectors will be contained sev-
eral times. This bootstrap12 sample is used to train a classifier. Many, 100
or 1000 classifiers are produced in this way. New inputs are run through
all trees and each tree “votes” for a certain classification. The classification
receiving the majority of votes is chosen. In a study of real data [110] a re-
duction of error rates by bagging between 20% and 47% was found. There
the bagging concept had been applied to simple decision trees, however, the
bagging concept is quite general and can be adopted also to other classifiers.

Random Forest

Another new development [111] which includes the bootstrap idea, is the
extension of the decision tree concept to the random forest classifier.
Many trees are generated from bootstrap samples of the training sam-
ple, but now part of the input vector components are suppressed. A tree is
constructed in the following way: First m out of the M components or at-
tributes of the input vectors are selected at random. The tree is grown in a
m-dimensional subspace of the full input vector space. It is not obvious how
m is to be chosen, but the author proposes m ≪ M and says that the results
show little dependence on m. With large m the individual trees are powerful
but strongly correlated. The value of m is the same for all trees.
12
We will discuss bootstrap methods in the following chapter.
396 11 Statistical Learning

From the N truncated bootstrap vectors, Nb are separated, put into a


bag and reserved for testing. A fraction f = Nb /N ≈ 1/3 is proposed. The
remaining ones are used to generate the tree. For each split that attribute
out of the m available attributes is chosen which gives the smallest number
of wrong classifications. Each leave contains only elements of a single class.
There is no pruning.
Following the bagging concept, the classification of new input vectors is
obtained by the majority vote of all trees.
The out-of-the-bag (oob) data are used to estimate the error rate. To this
end, each oob-vector of the k-th sample is run through the k-th tree and
classified. The fraction of wrong classifications from all oob vectors is the
error rate. (For T trees there are in total T × Nb oob vectors.) The oob data
can also be used to optimize the constant m.
The random forest classifier has received quite some interest. The concept
is simple and seems to be similarly powerful as that of other classifiers. It is
especially well suited for large data sets in high dimensions.

11.4.6 Comparison of the Methods

We have discussed various methods for classification. Each of them has its
advantages and its drawbacks. It depends on the special problem, which one
is the most suitable.
The discriminant analysis offers itself for one- or two dimensional contin-
uous distributions (preferably normal or other unimodal distributions). It is
useful for event selection in simple situations.
Kernel methods are relatively easy to apply. They work well if the division
line between classes is sufficiently smooth and transitions between different
classes are continuous. Categorical variables cannot be treated. The vari-
ant with support vectors reduces computing time and the memory space for
the storage of the training sample. In standard cases with not too exten-
sive statistics one should avoid this additional complication. Kernel methods
can perform event selection in more complicated environments than is possi-
ble with the primitive discriminant analysis. For the better performance the
possibility of interpreting the results is diminished, however.
Artificial neural networks are, due to the enormous number of free pa-
rameters, able to solve any problem in an optimal way. They suffer from the
disadvantage that the user usually has to intervene to guide the minimizing
process to a correct minimum. The user has to check and improve the re-
sult by changing the network structure, the learning constant and the start
values of the weights. New program packages are able to partially take over
these tasks. ANN are able to separate classes in very involved situations and
extract very rare events from large samples.
Decision trees are a very attractive alternative to ANN. One should use
boosted decision trees, random forest or apply bagging though, since those
discriminate much better than simple trees. The advantage of simple trees is
11.4 Classification 397

that they are very transparent and that they can be displayed graphically.
Like ANN, decision trees can, with some modifications, also be applied to
categorical variables.
At present, there is lack of theoretical framework and experimental infor-
mation on some of the new developments. We would like to know to what
extent the different classifiers are equivalent and which classifier should be se-
lected in a given situation. There will certainly be answers to these questions
in the near future.
12 Auxiliary Methods

12.1 Probability Density Estimation


12.1.1 Introduction

In the subsection function approximation we have considered measurements


y at fixed locations x where y due to statistical fluctuations deviates from
an unknown function. Now we start from a sample {x1 , . . . , xN } which fol-
low an unknown statistical distribution which we want to approximate. We
have to estimate the density fˆ(x) at the location x from the frequency of
observations xi in the vicinity of x. The corresponding technique, probability
density estimation (PDE), is strongly correlated with function approxima-
tion. Both problems are often treated together under the title smoothing
methods. In this section we discuss only non-parametric approaches; a para-
metric method, where parameters are adjusted to approximate Gaussian like
distributions has been described in Sect. 11.2.2. We will essentially present
results and omit the derivations. For details the reader has to consult the
specialized literature.
PDE serves mainly to visualize an empirical frequency distribution. Visu-
alization of data is an important tool of scientific research. It can lead to new
discoveries and often constitutes the basis of experimental decisions. PDE
also helps to classify data and sometimes the density which has been esti-
mated from some ancillary measurement is used in subsequent Monte Carlo
simulations of experiments. However, to solve certain problems like the es-
timation of moments and other characteristic properties of a distribution, it
is preferable to deal directly with the sample instead of performing a PDE.
This path is followed by the bootstrap method which we will discuss in a
subsequent section. When we have some knowledge about the shape of a dis-
tribution, then PDE can improve the precision of the bootstrap estimates.
For instance there may exist good reasons to assume that the distribution
has only one maximum and/or it may be known that the random variable is
restricted to a certain region with known boundary.
The PDE fˆ(x) of the true density f (x) is obtained by a smoothing pro-
cedure applied to the discrete experimental distribution of observations. This
means, that some kind of averaging is done which introduces a bias which is
especially large if the distribution f (x) varies strongly in the vicinity of x.
400 12 Auxiliary Methods

The simplest and most common way to measure the quality of the PDE
is to evaluate the integrated square error (ISE) L2
Z ∞h i2
L2 = fˆ(x) − f (x) dx
−∞

and its expectation value E(L2 ), the mean integrated square error (MISE )1 .
h i2
The mean quadratic difference E( fˆ(x) − f (x) ) has two components, ac-
cording to the usual decomposition:
h i2     
ˆ
E f (x) − f (x) = var fˆ(x) + bias2 fˆ(x) .

The first term, the variance, caused by statistical fluctuations, decreases with
increasing smoothing and the second term, the bias squared, decreases with
decreasing smoothing. The challenge is to find the optimal balance between
these two contributions.
We will give a short introduction to PDE mainly restricted to one-
dimensional distributions. The generalization of the simpler methods to
multi-dimensional distributions is straight forward but for the more sophis-
ticated ones this is more involved. A rather complete and comprehensive
overview can be found in the books by J.S. Simonoff [112], A.W. Bowman
and A. Azzalini [100], D. W. Scott [113] and W. Härdle et al. [115]. A sum-
mary is presented in an article by D. W. Scott and S. R. Sain [84].

12.1.2 Fixed Interval Methods

Histogram Approximation

The simplest and most popular method of density estimation is histogram-


ming with fixed bin width. For the number νk of N events falling into bin
Bk and bin width h the estimated density is
νk
fˆ(x) = for x ∈ Bk .
Nh
It is easy to construct, transparent, does not contain hidden parameters which
often are included in other more sophisticated methods and indicates quite
well which distributions are compatible with the data. However it has, as we
have repeatedly stated, the disadvantage of the rather arbitrary binning and
its discontinuity. Fine binning provides a good resolution of structures and
low bias but has to be paid for by large fluctuations. Histograms with wide
bins have the advantage of small statistical errors but are biased. A sensible

1
The estimate fˆ(x) is a function of the set {x1 , . . . , xN } of random variables
and thus also a random variable.
12.1 Probability Density Estimation 401

choice for the bin width h is derived from the requirement that the mean
squared integrated error should be as small as possible. The mean integrated
square error, M ISE, for a histogram is
Z  4
1 1 2 ′ 2 h
M ISE = + h f (x) dx + O . (12.1)
N h 12 N
R
The integral f ′ (x)2 dx = R(f√ ) is

called roughness. For a normal density
with variance σ it is R = (4 πσ 3 )−1 . Neglecting the small terms (h → 0)
2

we can derive [84] the optimal bin width h∗ and the corresponding asymptotic
mean integrated square error AM ISE:

h i1/3
h∗ ≈ R 6
N f ′ (x)2 dx
≈ 3.5σN −1/3 , (12.2)
h R ′ 2 i1/3
9 f (x) dx
AM ISE ≈ 16N 2 ≈ 0.43N −2/3/σ.

The second part of relation (12.2) holds for a Gaussian p.d.f. with variance
σ 2 and is a reasonable approximation for a distribution with typical σ. Even
though the derivative f ′ and the bandwidth2 σ are not precisely known,
they can be estimated from the data. As expected, the optimal bin width is
proportional to the band width, whereas its N −1/3 dependence on the sample
size N is less obvious.
In d dimensions similar relations hold. Of course the N -dependence has
to be modified. For d-dimensional cubical bins the optimal bin width scales
with N −1/(d+2) and the mean square error scales with N −2/(d+2) .

Example 155. PDE of a background distribution and signal fit


We analyze a signal sample containing a Gaussian signal N (x|µ, σ) with
unknown location and scale parameters µ, σ containing some unknown back-
ground. In addition, a reference sample containing only background is avail-
able. From the data taking times and fluxes we know that the background
in the signal sample should nominally be half (r = 0.5) of that in the refer-
ence sample. In Fig. 12.1 we show the two experimental distributions. From
the shape of the experimental background distribution we estimate the slope
y ′ = 0.05 of its p.d.f. h(x), and, using relation 12.2, find a bin width of 2
units. The heights β1 , β2 , β3 , β4 of the 4 equally wide bins of the histogram
distribution are left as free parameters in the fit. Because of the normaliza-
tion, we have β4 = 1 − β1 − β2 − β3 . Further parameters are the expected rate

2
Contrary to what is understood usually under bandwidth, in PDE this term
is used to describe the typical width of structures. For a Gaussian it equals the
standard deviation.
402 12 Auxiliary Methods

signal sample background

reference sample
15 15

number of events

10 10

5 5

0 0
-4 -2 0 2 4 -4 -2 0 2 4

x
x

Fig. 12.1. Experimental signal with some background (left) and background ref-
erence sample with two times longer exposure time (right). The fitted signal and
background functions are indicated.

of background events in the reference sample ρ and the fraction φ of true sig-
nal events in the signal sample. These 7 parameters are to be determined in
a likelihood fit. The log-likelihood function ln L = ln L1 + ln L2 + ln L3 + ln L4
comprises 4 terms, with: 1. L1 , the likelihood of the ns events in the signal
sample (superposition of signal and background distribution):
ns
X
ln L1 (µ, σ, φ, β1 , β2 , β3 ) = ln [φN (xi |µ, σ) + (1 − φ)h(xi |β1 , β2 , β3 )] .
i=1

2. L2 , the likelihood of the nr events of the reference sample (background


distribution):
nr
X
ln L2 (β1 , β2 , β3 ) = ln h(xi |β1 , β2 , β3 ) .
i=1

3. L3 , the likelihood to observe nr reference events where ρ are expected


(Poisson distribution):
ln L3 = −ρ + nr ln ρ − ln nr ! .
12.1 Probability Density Estimation 403

4. L4 , the likelihood to get ns (1 − φ) background events in the signal sample


where rρ are expected:

ln L4 = −rρ + ns (1 − φ) ln(rρ) − ln {[ns (1 − φ)]!} .

(It is recommended to replace here n! by Γ (n + 1)). The results of the fit are


indicated in the Fig. 12.1 which is a histogram of the observed events.The
MLE of the interesting parameters are µ = −0.18 ± 0.32, σ = 0.85 ± 0.22,
φ = 0.60+0.05
−0.09 , the correlation coefficients are of the order of 0.3. We abstain
from showing the full error matrix. The samples have been generated with
the nominal parameter values µ0 = 0, σ0 = 1, φ = 0.6. To check the influence
of the background parametrization, we repeat the fit with only two bins.
The results change very little to µ = −0.12 ± 0.34, σ = 0.85 ± 0.22, φ =
0.57+0.07
−0.11 . When we represent the background p.d.f. by a polygon (see next
chapter) instead of a histogram, the result again remains stable. We then get
µ = −0.20 ± 0.30, σ = 0.82 ± 0.21, φ = 0.60+0.05 −0.09 . The method which we
have applied in the present example is more precise than that of Sect. 7.4
but depends to a certain degree on the presumed shape of the background
distribution.

Linear and Higher Order Parabolic Approximation

In the previous chapter we had adjusted spline functions to measurements


with errors. Similarly, we can use them to approximate the probability den-
sity. We will consider here only the linear approximation by a polygon but it
is obvious that the method can be extended to higher order parabolic func-
tions. The discontinuity corresponding to the steps between bins is avoided
when we transform the histogram into a polygon. We just have to connect the
points corresponding to the histogram functions at the center of the bins. It
can be shown that this reduces the M ISE considerably, especially for large
samples. The optimum bin width in the one-dimensional case now depends
on the average second derivative f ′′ of the p.d.f. and is much wider than for a
histogram and the error is smaller [84] than in the corresponding histogram
case:
h i1/5
h∗ ≈ 1.6 N R f ′′1(x)2 dx ,
h R ′′ 2 i1/5
f (x) dx
M ISE ∗ ≈ 0.5 N4 .

In d dimensions the optimal bin width for polygon bins scales with
N −1/(d+4) and the mean square error scales with N −4/(d+4) .
404 12 Auxiliary Methods

12.1.3 Fixed Number and Fixed Volume Methods

To estimate the density at a point x an obvious procedure is to divide the


number k of observations in the neighborhood of x by the volume V which
they occupy, fˆ(x) = k/V . Either we can fix k and compute the corresponding
volume V (x) or we can choose V and count the number of observations
contained in that volume. The quadratic uncertainty is σ 2 = k + bias2 , hence
the former emphasizes fixed statistical uncertainty and the latter rather aims
at small variations of the bias.
The k-nearest neighbor method avoids large fluctuations in regions where
the density is low. We obtain a constant statistical error if we estimate the
density from the spherical volume V taken by the k- nearest neighbors of
point x:
k
fˆ(x) = . (12.3)
Vk (x)
As many other PDE methods, the k-nearest neighbor method is problem-
atic in regions with large curvature of f and at boundaries of x.
Instead of fixing the number of observations k in relation (12.3) we can fix
the volume V and determine k. Strong variations of the bias in the k-nearest
neighbor method are somewhat reduced but both methods suffer from the
same deficiencies, the boundary bias and a loss of precision due to the sharp
cut-off due to either fixing k or V . Furthermore it is not guaranteed that the
estimated density is normalized to one. Hence a renormalization has to be
performed.
The main advantage of fixed number and fixed volume methods is their
simplicity.

12.1.4 Kernel Methods

We now generalize the fixed volume method and replace (12.3) by


1 X
fˆ(x) = K(x − xi )
NV
where the kernel K is equal to 1 if xi is inside the sphere of volume V
centered at x and 0 otherwise. Obviously, smooth kernel functions are more
attractive than the uniform kernel of the fixed volume method. An obvious
candidate for the kernel function K(u) in the one-dimensional case is the
Gaussian ∝ exp(−u2 /2h2 ). A very popular candidate is also the parabolically
shaped Epanechnikov kernel ∝ (1 − c2 u2 ), for |cu| ≤ 1, and else zero. Here
c is a scaling constant to be adjusted to the bandwidth of f . Under very
general conditions the Epanechnikov kernel minimizes the asymptotic mean
integrated square error AM ISE obtained in the limit where the effective
binwidth tends to zero, but other kernels perform nearly as well. The AM ISE
of the Gaussian kernel is only 5% larger and that of the uniform kernel by
8% [112].
12.1 Probability Density Estimation 405

The optimal bandwidth of the kernel function obviously depends on the


true density. For example for a Gaussian true density f (x) with variance σ 2
the optimal bandwidth h of a Gaussian kernel is hG ≈ 1.06σN −1/5 [112] and
the corresponding constant c of the Epanechnikov kernel is c ≈ 2.2/(2hG). In
practice, we will have to replace the Gaussian σ in the relation for h0 by some
estimate depending on the structure of the observed data. AM ISE of the
kernel PDE is converging at the rate N −4/5 while this rate was only N −2/3
for the histogram.

12.1.5 Problems and Discussion

The simple PDE methods sketched above suffer from several problems, some
of which are unavoidable:
1. The boundary bias: When the variable x is bounded, say x < a, then
fˆ(x) is biased downwards unless f (a) = 0 in case the averaging process
includes the region x > a where we have no data. When the averaging is
restricted to the region x < a, the bias is positive (negative) for a distribution
decreasing (increasing) towards the boundary. In both cases the size of the
bias can be estimated and corrected for, using so-called boundary kernels.
2. Many smoothing methods do not guarantee normalization of the es-
timated probability density. While this effect can be corrected for easily by
renormalizing fˆ, it indicates some problem of the method.
3. Fixed bandwidth methods over-smooth in regions where the density
is high and tend to produce fake bumps in regions where the density is
low. Variable bandwidth kernels are able to avoid this effect partially. Their
bandwidth is chosen inversely proportional to the square root of the density,
h(xi ) = h0 f (xi )−1/2 . Since the true density is not known, f must be replaced
by a first estimate obtained for instance with a fixed bandwidth kernel.
4. Kernel smoothing corresponds to a convolution of the discrete data
distribution with a smearing function and thus unavoidably tends to flatten
peaks and to fill-up valleys. This is especially pronounced where the distribu-
tion shows strong structure, that is where the second derivative f ′′ is large.
Convolution and thus also PDE implies a loss of some information contained
in the original data. This defect may be acceptable if we gain sufficiently
due to knowledge about f that we put into the smoothing program. In the
simplest case this is only the fact that the distribution is continuous and dif-
ferentiable but in some situations also the asymptotic behavior of f may be
given, or we may know that it is unimodal. Then we will try to implement
this information into the smoothing method.
Some of the remedies for the difficulties mentioned above use estimates
of f and its derivatives. Thus iterative procedures seem to be the solution.
However, the iteration process usually does not converge and thus has to be
supervised and stopped before artifacts appear.
In Fig. 12.2 three simple smoothing methods are compared. A sample
of 1000 events has been generated from the function shown as a dashed
406 12 Auxiliary Methods

Fig. 12.2. Estimated probability density. Left hand: Nearest neighbor, center:
Gaussian Kernel, right hand: Polygon.

curve in the figure. A k-nearest neighbor approximation of the p.d.f. of a


sample is shown in the left hand graph of Fig. 12.2. The value of k chosen
was 100 which is too small to produce enough smoothing but too large to
follow the distribution at the left hand border. The result of the PDE with
a Gaussian kernel with fixed width is presented in the central graph and a
polygon approximation is shown in the right-hand graph. All three graphs
show the typical defects of simple smoothing methods, broadening of the peak
and fake structures in the region where the statistics is low.
Alternatively to the standard smoothing methods, complementary ap-
proaches often produce better results than the former. The typical smooth-
ing problems can partially be avoided when the p.d.f. is parametrized and
adjusted to the data sample in a likelihood fit. A simple parametrization is
the superposition of normal distributions,
X
f (x) = αi N (x; µi , Σi ) ,

with the free parameters, weights αi , mean values µi and covariance matrixes
Σi .
If information about the shape of the distribution is available, more spe-
cific parametrizations which describe the asymptotic behavior can be ap-
plied. Distributions which resemble a Gaussian should be approximated by
the Gram-Charlier series (see last paragraph of Sect. 11.2.2). If the data sam-
ple is sufficiently large and the distribution is unimodal with known asymp-
totic behavior the construction of the p.d.f. from the moments as described
in [114] is quite efficient.
Physicists use PDE mainly for the visualization of the data. Here, in
one dimension, histogramming is the standard method. When the estimated
distribution is used to simulate an experiment, frequency polygons are to be
preferred. Whenever a useful parametrization is at hand, then PDE should be
12.2 Resampling Techniques 407

replaced by an adjustment of the corresponding parameters in a likelihood fit.


Only in rare situations it pays to construct complicated kernels. For a quick
qualitative illustration of a distribution off-the-shelf programs may do. For
most quantitative evaluations of moments and parameters of the unknown
distribution we recommend to use the bootstrap method which is discussed
in the following section.

12.2 Resampling Techniques


12.2.1 Introduction

In the previous section we have discussed a method to construct a distribution


approximately starting from a sample drawn from it. Knowing the distribu-
tion allows us to calculate certain parameters like moments or quantiles. In
most cases, however, it is preferable to determine the wanted quantities di-
rectly from the sample. A trivial example for this approach is the estimation
of the mean value and the variance from a series of measurements as we have
discussed in
PChap. 4 treating error calculation wherePwe had used
 the sample
mean x = xi /N and the empirical variance s2 = (xi − x)2 /(N − 1). In
a similar way we can also determine higher moments, correlations, quantiles
and other statistical parameters. However, the analytical derivation of the
corresponding expressions is often not as simple as that of the mean value or
the variance. The errors of functions depending on several random input vari-
ables are usually computed by linear error propagation. This approximation
is often not justified. The bootstrap method avoids these problems.
The bootstrap method has been developed systematically by Efron but
was inspired by earlier developments like Jackknife. A comprehensive presen-
tation of the method is given in Ref. [79], which has served as bases for this
section.
The name bootstrap goes back to the famous book of Erich Raspe in which
he narrates the adventures of the lying Baron von Münchhausen. Münch-
hausen had pretended to have saved himself out of a swamp by pulling himself
up with his own bootstraps3 . In statistics, the expression bootstrap is used
because from a small sample the quasi complete distribution is generated.
There is not quite as much lying as in Münchhausen’s stories.
The bootstrap concept is based upon a simple idea: The sample itself
replaces the unknown distribution. The sample is the distribution from which
we draw individual observations. In fact, the bootstrap idea is also √ used
when we associate the errors to simple measurements, for example, n to a
measurement of a Poisson distributed number n which is only an estimate of
the true mean value.
As already mentioned, the bootstrap method permits us, apart from error
estimation, to compute p-values for significance tests and the error rate in
3
In the original version he is pulling himself with his hair.
408 12 Auxiliary Methods

classifications. It relies, as will be shown in subsequent examples, on the


combination of randomly selected observations.
A subvariant of the bootstrap technique is called jackknife which is mainly
used to estimate biases from subsets of the data.
In Chap. 10 where we had evaluated the distribution of the energy test
statistic in two-sample problems, we have used another resampling technique.
We had reshuffled the elements of two partitions applying random permuta-
tions. Whereas in the bootstrap method, elements are drawn with replace-
ment, permutations generate samples where every element occurs only a sin-
gle time.
The reason for not using all possible permutations is simply that their
number is in most cases excessively large and a finite random sample pro-
vides sufficiently precise results. While bootstrap techniques are used mainly
to extract parameters of an unknown distribution from a single sample, ran-
domization methods serve to compare two or more samples taken under dif-
ferent conditions.
Remark: We may ask with some right whether resampling makes sense
since randomly choosing elements from a sample does not seem to be as ef-
ficient as a systematic evaluation of the compete sample. Indeed, it should
always be optimal to evaluate the interesting parameter directly using all ele-
ments of the sample with the same weight, either analytically or numerically,
but, as in Monte Carlo simulations, the big advantage of parameter estima-
tion by randomly selecting elements relies on the simplicity of the approach.
Nowadays, lack of computing capacity is not a problem and in the limit of an
infinite number of drawn combinations of observations the complete available
information is exhausted.

12.2.2 Definition of Bootstrap and Simple Examples

We sort the N observations of a given data sample {x1 , x2 , . . . , xN } accord-


ing to their value, xi ≤ xi+1 , and associate to each of them the probability
1/N . We call this discrete distribution P0 (xi ) = 1/N the sample distribution.
We obtain a bootstrap sample {x∗1 , x∗2 , . . . , x∗M } by generating M observations
following P0 . Bootstrap observations are marked with a star "∗ ". A bootstrap
sample may contain the same observation several times. The evaluation of
interesting quantities follows closely that which is used in Monte Carlo sim-
ulations. The p.d.f. used for event generation in Monte Carlo procedures is
replaced by the sample distribution.

Example 156. Bootstrap evaluation of the accuracy of the estimated mean


value of a distribution
We have already of an efficient method to estimate the variance. Here
we present an alternative approach in order to introduce and illustrate the
12.2 Resampling Techniques 409

bootstrap method. Given be the sample of N = 10 observations {0.83, 0.79,


P 0.32, 1.11, 0.75}. The estimate of the mean value
0.31, 0.09, 0.72, 2.31, 0.11,
is obviously µb= x = xi /N = 0.74. We have
√ also derived in Sect. 3.2 a
formula to estimate the uncertainty, δµ = s/ N − 1 = 0.21. When we treat
the sample as representative of the distribution, we are able to produce an
empirical distribution of the mean value: We draw from the complete sample
sequentially N observations (drawing with replacement ) and get for instance
the bootstrap sample {0.72, 0.32, 0.79, 0.32, 0.11, 2.31, 0.83, 0.83, 0.72, 1.11}.
We compute the sample mean and repeat this procedure B times and obtain
in this way B mean values µ∗k . The number of bootstrap replicates should
be large compared to N , for example B typically equal to 100 or 1000 for
N = 10. From the distribution of the values µb , b = 1, . . . , B we can compute
again the mean value µ̂∗ and its uncertainty δµ :

1 X ∗
µ̂∗ = µb ,
B
1 X ∗
δµ∗2 = (µb − µ̂∗ )2 .
B
Fig. 12.3 shows the sample distribution corresponding to the 10 observations
and the bootstrap distribution of the mean values. The bootstrap estimates
µ̂∗ = 0.74, δµ∗ = 0.19, agree reasonably well with the directly obtained val-
ues. The larger value of δµ compared to δµ∗ is due to the bias correction in
its evaluation. The bootstrap values correspond to the maximum likelihood
estimates. The distribution of Fig. 12.3 contains further information. We re-
alize that the distribution is asymmetric, the reason being that the sample
was drawn from an exponential. We could, for example, also derive the skew-
ness or the frequency that the mean value exceeds 1.0 from the bootstrap
distribution.

While we know the exact solution for the estimation of the mean value and
mean squared error of an arbitrary function u(x), it is difficult to compute
the same quantities for more complicated functions like the median or for
correlations.

Example 157. Error of mean distance of stochastically distributed points in


a square
Fig. 12.4 shows 20 points drawn from an unknown p.d.f. distributed in
a square. The mean distance is 0.55. We determine the standard deviation
for this quantity from 104 bootstrap samples and obtain the value 0.10. This
example is rather abstract and has been chosen because it is simple and
410 12 Auxiliary Methods

0.1
1000

probability

frequency
500

0.0 0
0 1 2 0.0 0.5 1.0 1.5

x mean value

Fig. 12.3. Sample distribution (left) and distribution of bootstrap sample mean
values (right).

1.0

750
number of events

500

0.5
y

250

0.0 0
0.0 0.5 1.0 0.2 0.4 0.6 0.8

x distance

Fig. 12.4. Distribution of points in a unit square. The right hand graph shows the
bootstrap distribution of the mean distance of the points.

demonstrates that the bootstrap method is able to solve problems which are
hardly accessible with other methods.
12.2 Resampling Techniques 411

Example 158. Acceptance of weighted events


We resume an example from Sect. 4.4.7 where we had presented an an-
alytic solution. Now we propose a simpler solution: We have a sample of N
Monte Carlo generated events with weights wi , i = 1, . . . , N , where we know
for each of them whether
P it is accepted, εi = 1, or not, εi = 0. The mean
P
acceptance is ε = wi εi / wi . Now we draw from the sample randomly B
new bootstrap samples {(w1∗ , ε∗1 ), . . . , (wN

, ε∗N )} and compute in each case ε∗ .
The empirical variance σ of the distribution of ε∗ is the bootstrap estimate
2

of the error squared, δε2 = σ 2 , of the acceptance ε.

Bootstrapping is especially useful for the calculation of errors of quantities


that depend on many input variables. For instance, the uncertainty of the
number of events in an unfolded histogram depends in a complicated way on
the statistical error of both the observed and the simulated events, but can
easily be estimated with the bootstrap method.

12.2.3 Precision of the Error Estimate

Usually we are not interested in the uncertainty σδ of the error estimate


δ. This is a higher order effect, yet we want to know how many bootstrap
samples are required to avoid additional error contributions related to the
method.
The standard deviation has two components, σt , which depends on the
shape of the true distribution and the sample size N , and, σB , which depends
on the number B of bootstrap replicates. Since the two causes are independent
and of purely statistical nature, we can put
σδ2 = σt2 + σB
2
,
2
σδ c1 c2
= + .
δ2 N B
We can only influence the second term, N being given. Obviously it is
sufficient to choose the number B of bootstrap replicates large compared to
the number N of experimental observations. For a normal distribution the
two constants c1 , c2 are both equal to 1/2. (A derivation is given in [79].) For
distributions with long tails, i.e. large excess γ2 , they are larger. (Remember:
γ2 = 0 for the normal distribution.) The value of c2 is in the general case
given by [79]:
γ2 + 2
c2 = .
4
An estimate for γ2 can be derived from the empirical fourth moment of
the sample. Since error estimates are rather crude anyway, we are satisfied
with the choice B ≫ N .
412 12 Auxiliary Methods

12.2.4 Confidence Limits

To compute confidence limits or the p-value of a parameter we generate its


distribution from bootstrap samples. In a preceding example where we com-
puted the mean distance of random points, we extract from the distribution
of Fig. 12.4 that the probability to find a distance less than 0.4 is approx-
imately equal to 10%. Exact confidence intervals can only be derived from
the exact distribution.

12.2.5 Precision of Classifiers

Classifiers like decision trees and ANNs usually subdivide the learning sample
in two parts, one part is used to train the classifier and a smaller part is
reserved to test the classifier. The precision can be enhanced considerably by
using bootstrap samples for both training and testing.

12.2.6 Random Permutations

In Chap. 10 we have treated the two-sample problem: “Do two experimental


distributions belong to the same population?” In one of the tests, the energy
test, we had used permutations of the observations to determine the distri-
bution of the test quantity. The same method can be applied to an arbitrary
test quantity which is able to discriminate between samples.

Example 159. Two-sample test with a decision tree


Let us assume that we want to test whether the two samples
{x1 , . . . , xN1 }, {x1 , . . . , xN2 } of sizes N1 and N2 belong to the same pop-
ulation. This is our null hypothesis. Instead of using one of the established
two-sample methods we may train a decision tree to separate the two sam-
ples. As a test quantity serves the number of misclassifications Ñ which of
course is smaller than (N1 + N2 )/2, half the size of the total sample. Now we
combine the two samples, draw from the combined sample two new random
samples of sizes N1 and N2 , train again a decision tree to identify for each ele-
ment the sample index and count the number of misclassifications. We repeat
this procedure many, say 1000, times and obtain this way the distribution
of the test statistic under the null hypothesis. The fraction of cases where
the random selection yields a smaller number of misclassifications than the
original samples is equal to the p-value of the null hypothesis.

Instead of a decision tree we can use any other classifier, for instance a
ANN. The corresponding tests are potentially very powerful but also quite
involved. Even with nowadays computer facilities, training of some 1000 de-
cision trees or artificial neural nets is quite an effort.
12.2 Resampling Techniques 413

12.2.7 Jackknife and Bias Correction

Jackknife is mainly used for bias removal4 . Estimates derived from a sample
of N observations x1 , ..., xN are frequently biased. The bias of a consistent
estimator vanishes in the limit N → ∞.
Let us assume that he bias decreases proportional to 1/N . This assump-
tion holds in the majority of cases. To infer the size of the bias b = t̂N − t of
the estimate t̂N of the true parameter t, we estimate t̂N −1 for a sample of
N − 1 events and use the 1/N relation to compute the bias. For the expected
values E(t̂N ) and E(t̂N −1 ) we get

E(t̂N ) − t N −1
=
E(t̂N −1 ) − t N

and obtain

t = N E(t̂N ) − (N − 1)E(t̂N −1 ) ,
 
E(b) = t − E(t̂N ) = (N − 1) E(t̂N ) − E(t̂N −1 ) ,
b̂ = (N − 1)(t̂N − t̂N −1 ) .

To estimate the actual bias, we have to replace the expected values by


the observed values. To determine t̂N −1 , we exclude observation xi from the
sample and compute the estimate t̂i from the corresponding subsample that
contains N − 1 observations and average over the results for all values of i:
N
1X
t̂N −1 = t̂i .
N i=1

The remaining bias after the jackknife correction is zero, of order 1/N 2 ,
or of higher. Jackknife has been invented in the 50ties of the last century by
Maurice Quenouille and John Tukey. . The name jackknife had been chosen
to indicate the simplicity of the statistical tool.

Example 160. Jackknife bias correction


When we estimate the variance σ 2 of a distribution from sample x1 , ..., xN
of size N , using the formula
N
X
2
δN = (xi − x)2 /N
i=1

4
Remember, bias corections should be applied to MLEs only in exceptional
situations
414 12 Auxiliary Methods

then δN
2
is biased, its expected value is smaller than σ 2 . We remove one
observation at a time and compute each time the mean squared error δN2
−1,i
and average the results:
N
2 1 X 2
δN −1 = δ .
N i=1 N −1,i

Thus an improved estimate is δc2 = N δN


2 2
− (N − 1)δN −1 .
Inserting the known expectation values (see Chap. 3),

2 N −1
E(δN ) = σ2 ,
N
we confirm the bias corrected result E(δc2 ) = σ 2 .
13 Appendix

13.1 Large Number Theorems


13.1.1 Chebyshev Inequality and Law of Large Numbers

For a probability density f (x) with expected value µ, finite variance σ and
arbitrary given positive δ, the following inequality, known as Chebyshev in-
equality, is valid:

σ2
P {|x − µ| ≥ δ} ≤
. (13.1)
δ2
This very general theorem says that a given, fixed deviation from the
expected value becomes less probable when the variance becomes smaller. It
is also valid for discrete distributions.
To prove the inequality, we use the definition
Z
PI ≡ P {x ∈ I} = f (x)dx ,
I

where the domain of integration I is given by 1 ≤ |x − µ|/δ. The assertion


follows from the following inequalities for the integrals:
Z Z  2 Z ∞  2
x−µ x−µ
f (x)dx ≤ f (x)dx ≤ f (x)dx = σ 2 /δ 2 .
I I δ −∞ δ

Applying (13.1) to the arithmetic mean x from N independent identical dis-


tributed random variables x1 , . . . , xN results in one of the so-called laws of
large numbers:
P {|x − hxi| ≥ δ} ≤ var(x)/(N δ 2 ) , (13.2)
with the relations hxi = hxi , var(x) = var(x)/N obtained in Sects. 3.2.2
and 3.2.3. The right-hand side disappears for N → ∞, thus in this limit the
probability to observe an arithmetic mean value outside an arbitrary interval
centered at the expected value approaches zero. We talk about stochastic con-
vergence or convergence in probability, here of the arithmetic mean against
the expected value.
416 13 Appendix

We now apply (13.2) toPthe indicator function II (x) = 1 for x ∈ I, else


0. The sample mean II = II (xi )/N is the observed relative frequency of
events x ∈ I in the sample. The expected value and the variance are
Z
hII i = II (x)f (x)dx

= PI ,
Z
2
var(II ) = II2 (x)f (x)dx − hII i

= PI (1 − PI ) ≤ 1/4 ,

where, as above, PI is the probability P {x ∈ I} to find x in the set I. When


we insert these results into (13.2), we obtain

P {|II − PI | ≥ δ} ≤ 1/(4N δ 2) . (13.3)

The relative frequency of events of a certain type in a sample converges with


increasing N stochastically to the probability to observe an event of that
type1 .

13.1.2 Central Limit Theorem

The central limit theorem states that the distribution of the sample mean x,
N
1 X
x= xi ,
N i=1

of N i.i.d. variables xi with finite variance σ 2 in the limit N → ∞ will


approach a normal distribution with variance σ 2 /N independent of the form
of the distribution f (x). The following proof assumes that its characteristic
function exists. √
To simplify the notation, we transform the variable to y = (x−µ)/( N σ),
where µ is the mean of x. The characteristic function of a p.d.f. with mean
zero and variance 1/N and thus also the p.d.f. of y is of the form

t2 t3
φ(t) = 1 − + c 3/2 + · · ·
2N N
N
X
The characteristic function of the sum z = yi is given by the product
i=1
 N
t2 t3
φz = 1 − + c 3/2 + · · ·
2N N
1
This theorem was derived by the Dutch-Swiss mathematician Jakob I. Bernoulli
(1654-1705).
13.2 Consistency, Bias and Efficiency of Estimators 417

which in the limit N → ∞, where only the first two terms survive, approaches
the characteristic function of the standard normal distribution N(0, 1):
 N
t2 2
lim φz = lim 1 − = e−t /2 .
N →∞ N →∞ 2N
It can be shown that the convergence of characteristic functions implies the
convergence of the distributions. The distribution of x for large N is then
approximately √  
N N (x − µ)2
f (x) ≈ √ exp − .
2πσ 2σ 2
Remark: The law of large numbers and the central limit theorem can be
generalized to sums of independent but not identically distributed variates.
The convergence is relatively fast when the variances of all variates are of
similar size.

13.2 Consistency, Bias and Efficiency of Estimators


The following estimator properties are essential in frequentist statistics. We
will discuss their relevance in Appendix 13.7.
Throughout this chapter we assume that samples consist of N i.i.d. vari-
ables xi , the true parameter value is θ0 , estimates are θ̂ or t.

13.2.1 Consistency
We expect from an useful estimator that it becomes more accurate with
increasing size of the sample, i.e. larger deviations from the true value should
become more and more improbable.
A sequence of estimators tN of a parameter θ is called consistent, if their
p.d.f.s for N → ∞ are shrinking towards a central value equal to the true
parameter value θ0 , or, expressing it mathematically, if
lim P {|tN − θ0 | > ε} = 0 (13.4)
N →∞
is valid for arbitrary ε. A sufficient condition for consistency which can be
easier checked than (13.4), is the combination of the two requirements
lim htN i = θ0 , lim var(tN ) = 0 ,
N →∞ N →∞
where of course the existence of mean value and variance for the estimator
tN has to be assumed.
For instance, as implied by the law of large numbers, the sample moments
N
1 X m
tm = x
N i=1 i
are consistent estimators for the respective m-th moments µm of f (x) if this
moments exist.
418 13 Appendix

13.2.2 Bias of Estimates

The bias of an estimate has been already introduced in Sect. 6.8.3: An esti-
mate tN for θ is unbiased if already for finite N (eventually N > N0 ) and all
parameter values considered, the estimator satisfies the condition

htN i = θ .

The bias of an estimate is defined as:

b = htN i − θ .

Obviously, consistent estimators are asymptotically unbiased:

lim b(N ) = 0 .
N →∞

The bias of a consistent estimator can be removed without affecting the


consistency by multiplying the estimate with a factor like (N + a)/(N + b)
which approaches unity for N → ∞.

13.2.3 Efficiency

An important characteristics is of course the accuracy of the statistical esti-


mate. A useful measure for accuracy is the mean square deviation h(t − θ0 )2 i
of the estimate from the true value of the parameter. According to (3.11) it
is related to the variance of the estimator and the bias by

(t − θ0 )2 = var(t) + b2 . (13.5)

Definition: An estimator t is (asymptotically) efficient for the parameter


θ if for all permitted parameter values it fulfils the following conditions for
N → ∞:

1. N (t − θ) approaches a normal distribution of constant width and mean
equal to zero.
2. var(t) ≤ var(t′ ) for any other estimator t′ which satisfies condition 1.
In other words, an efficient estimate is asymptotically normally dis-
tributed and has minimal variance. According to condition 1, its variance
decreases with 1/N . An efficient estimator therefore reaches the same ac-
curacy as a competing one with a smaller sample size N , and is therefore
economically superior. Not in all situations an efficient estimator exists.
13.2 Consistency, Bias and Efficiency of Estimators 419

Example 161. Efficiency of different estimates of the location parameter of a


Gaussian [116]
Let us compare three methods to estimate the expected value
µ of a Gaussian N (x|µ, σ) with given width σ from a sam-
ple {xi } , i = 1, . . . , N . For large N we obtain for var(t):
Method 1: sample mean σ 2 /N Obviously methods 2
Method 2: sample median σ 2 /N · π/2
Method 3: (xmin + xmax )/2 σ 2 /N · N π 2 /(12 ln N )
and 3 are not efficient. Especially the third method, taking the mean of the
two extremal values found in the sample, performs badly here. For other
distributions, different results will be found. For the rather exotic two-sided
exponential distribution (an exponential distribution of the absolute value
of the variate, also called Laplace distribution) method 2 would be efficient
and equal to the MLE. For a uniform distribution the estimator of method 3
would be efficient and also equal to the MLE.

While it is of interest to find the estimator which provides the smallest


variance, it is not obvious how we could prove this property, since a compar-
ison with all thinkable methods is of course impossible. Here a useful tool
is the Cramer–Rao inequality. It provides a lower bound of the variance of
an estimator. If we reach this minimum, we can be sure that the optimal
accuracy is obtained.
The Cramer–Rao inequality states:
[1 + (db/dθ)]2
var(t) ≥ . (13.6)
N h(∂ ln f /∂θ)2 i
The denominator of the right-hand side is also called, after R. A. Fisher, the
information2 about the parameter θ from a sample of size N of i.i.d.
P variates.
To prove this inequality, we define the random variable y = yi with
∂ ln fi
yi = , fi ≡ f (xi |θ) . (13.7)
∂θ
It has the expected value
Z
1 ∂fi
hyi i = fi dxi
f ∂θ
Z i
∂fi
= dxi
∂θ
Z

= fi dxi
∂θ

= 1=0. (13.8)
∂θ
2
This is a special use of the word information as a technical term.
420 13 Appendix

Because of the independence of the yi we have hyi yj i = hyi ihyj i = 0 and


* 2 +
∂ ln f
var(y) = N hyi2 i = N . (13.9)
∂θ
Q
Using the definition L = fi , we find for cov(ty) = h(t − hti)(y − hyi)i:
Z
∂ ln L
cov(ty) = t L dx1 · · · dxN
∂θ
Z

= t L dx1 · · · dxN
∂θ

= hti
∂θ
db
=1+ . (13.10)

From the Cauchy–Schwarz inequality

[cov(ty)]2 ≤ var(t)var(y)

and (13.9), (13.10) follows (13.6).


The equality sign in (13.6) is valid if and only if the two factors t, y
in the covariance are proportional to each other. In this case t is called a
Minimum Variance Bound (MVB) estimator. It can be shown to be also
minimal sufficient.
In most of the literature efficiency is defined by the stronger condition:
An estimator is called efficient, if it is bias-free and if it satisfies the MVB.

13.3 Properties of the Maximum Likelihood Estimator


13.3.1 Consistency

The maximum likelihood estimator (MLE) is consistent under mild assump-


tions.
To prove this, we consider the expected value of
N
X
ln L(θ|x) = ln f (xi |θ) (13.11)
i=1

which is to be calculated by integration over the variables3 x using the true


p.d.f. (with the true parameter θ0 ). First we prove the inequality
3
We keep the form of the argument list of L, although now x is not considered
as fixed to the experimentally sampled values, but as a random vector with given
p.d.f..
13.3 Properties of the Maximum Likelihood Estimator 421

hln L(θ|x)i < hln L(θ0 |x)i , (13.12)

for θ 6= θ0 : Since the logarithm is a strongly convex function, there is always


hln(. . .)i < lnh(. . .)i, hence
    Z
L(θ|x) L(θ|x) L(θ|x)
ln < ln = ln L(θ0 |x)dx = ln 1 = 0 .
L(θ0 |x) L(θ0 |x) L(θ0 |x)

In the last step we used


Z Z Y
L(θ|x)dx = f (xi |θ)dx1 · · · dxN = 1 .
P
Since ln L(θ|x)/N = ln f (xi |θ)/N is an arithmetic sample mean which,
according to the law of large numbers (13.2), converges stochastically to the
expected value for N → ∞, we have also (in the sense of stochastic conver-
gence)
X
ln L(θ|x)/N → hln f (x|θ)i = hln f (xi |θ)i /N = hln L(θ|x)i /N ,

and from (13.12)

lim P {ln L(θ|x) < ln L(θ0 |x)} = 1 , θ 6= θ0 . (13.13)


N →∞

On the other hand, the MLE θ̂ is defined by the extremum condition

ln L(θ̂|x) ≥ ln L(θ0 |x) .

A contradiction to (13.13) can be avoided only, if also

lim P {|θ̂ − θ0 | < ε} = 1


N →∞

is valid. This means consistency of the MLE.

13.3.2 Efficiency

Since the MLE is consistent, it is unbiased asymptotically for N → ∞. Under


certain assumptions in addition to the usually required regularity4 the MLE
is also efficient asymptotically.
Proof : Q
With the notations of the last paragraph
P with L = fi and using (13.8),
the expected value and variance of y = yi = ∂ ln L/∂θ are given by the
following expressions:

4
The boundaries of the domain of x must not depend on θ and the maximum
of L should not be reached at the boundary of the range of θ.
422 13 Appendix
Z
∂ ln L
hyi = L dx = 0 , (13.14)
∂θ
* 2 +  2 
2 ∂ ln L ∂
σy = var(y) = =− ln L . (13.15)
∂θ ∂θ2
The last relation follows after further differentiation of (13.14) and from the
relation
Z 2 Z Z
∂ ln L ∂ ln L ∂L ∂ ln L ∂ ln L
L dx = − dx = − L dx .
∂θ2 ∂θ ∂θ ∂θ ∂θ
From the Taylor expansion of ∂ ln L/∂θ|θ=θ̂ which is zero by definition and
with (13.15) we find

∂ ln L ∂ ln L ∂ 2 ln L
0= |θ=θ̂ ≈ |θ=θ0 + (θ̂ − θ0 ) |θ=θ0
∂θ ∂θ ∂θ2
≈ y − (θ̂ − θ0 )σy2 , (13.16)

where the consistency of the MLE guaranties the validity of this approxi-
mation in the sense of stochastic convergence. Following the central limit
theorem, y/σy being the sum of i.i.d. variables, is asymptotically normally
distributed with mean zero and variance unity. The same is then true for
(θ̂ − θ0 )σy , i.e. θ̂ follows asymptotically a normal distribution with mean θ0
and asymptotically vanishing variance 1/σy2 ∼ 1/N , as seen from (13.9).

13.3.3 Asymptotic Form of the Likelihood Function

A similar result as derived in the last paragraph for the p.d.f. of the MLE θ̂
can be derived for the likelihood function itself.
If one considers the Taylor expansion of y = ∂ ln L/∂θ around the MLE
θ̂, we get with y(θ̂) = 0
y(θ) ≈ (θ − θ̂)y ′ (θ̂) . (13.17)
As discussed in the last paragraph, we have for N → ∞

y ′ (θ̂) → y ′ (θ0 ) → hy ′ i = −σy2 = const .

Thus y ′ (θ̂) is independent of θ̂ and higher derivatives disappear. After inte-


gration of (13.17) over θ we obtain a parabolic form for ln L:
1
ln L(θ) = ln L(θ̂) − σy2 (θ − θ̂)2 ,
2
where the width of the parabola decreases with σy−2 ∼ 1/N (13.9). Up to
the missing normalization, the likelihood function has the same form as the
distribution of the MLE with θ̂ − θ0 replaced by θ − θ̂.
13.3 Properties of the Maximum Likelihood Estimator 423

13.3.4 Properties of the Maximum Likelihood Estimate for Small


Samples

The criterion of asymptotic efficiency, fulfilled by the MLE for large samples,
is usually extended to small samples, where the normal approximation of
the sampling distribution does not apply, in the following way: A bias-free
estimate t is called a minimum variance (MV) estimate if var(t) ≤ var(t′ )
for any other bias-free estimate t′ . If, moreover, the Cramer–Rao inequality
(13.6) is fulfilled as an equality, one speaks of a minimum variance bound
(MVB) estimate, often also called efficient or most efficient, estimate (not to
be confused with the asymptotic efficiency which we have considered before
in Appendix 13.2). The latter, however, exists only for a certain function
τ (θ) of the parameter θ if it has a one-dimensional sufficient statistic (see
6.5.1). It can be shown [3] that under exactly this condition the MLE for τ
will be this MVB estimate, and therefore bias-free for any N . The MLE for
any non-linear function of τ will in general be biased, but still optimal in the
following sense: if bias-corrected, it becomes an MV estimate, i.e. it will have
the smallest variance among all unbiased estimates.

Example 162. : Efficiency of small sample MLEs


The MLE for the variance σ 2 of a normal distribution with known mean
µ,
X
c2 = 1
σ (xi − µ)2 ,
N
is unbiased and efficient, reaching the MVB for all N . The MLE for σ is of
course q
σ̂ = σ c2 ,

according to the relation between σ and σ 2 . It is biased and thus not efficient
in the sense of the above definition. A bias-corrected estimator for σ is (see
for instance [117]) r 
N Γ N2
σ̂corr =  σ̂ .
2 Γ N2+1
This estimator can be shown to have the smallest variance of all unbiased
estimators, independent of the sample size N .

In the above example a one-dimensional sufficient statistic exists. If this


is not the case, the question whether the MLE is optimal for small samples,
from a frequentist point of view cannot be answered.
In summary, also for finite N the MLE for a certain parameter achieves
the optimal – from the frequentist point of view – properties of an MVB
estimator, if the latter does exist. Of course these properties cannot be pre-
424 13 Appendix

served for other parametrizations, since variance and bias are not invariant
properties.

13.4 The Expectation Maximization (EM) Algorithm


The EM method finds iteratively the MLE in situations where the statistical
model depends on latent variables. The method goes back to the sixties,
has been invented several times and has been made popular by Dempster,
Laird and Rubin [70]. A very comprehensive introduction to the expectation
maximization (EM) algorithm is given in the Wikipedia article Ref. [123].
The EM algorithm exists in many different variants. We will restrict our
discussion to its application to classification problems.
To get an idea of the method, we consider a simple standard example.
Let us assume that we have a sample of observations x1 , ..., xN each drawn
from one of M overlapping normal distribution fm (x|µm ) ∼ N (µm , s) with
unknown mean values µ1 ,..., µM and given standard deviation s. The log-
likelihood of the parameters is

XM N
Σi=1 zmi xi
ln L(µ1 , ..., µN ) = N z
− µm )
m=1
Σ i=1 mi

where the classification variable zmi = 1 if xi belongs to the normal distribu-


tion m and zmi = 0 otherwise. If we know the classification variables, we get
the MLE of the parameter µm :
N
Σi=1 zmi xi
µ̂m = N z
.
Σi=1 mi

If this is not the case, we can estimate zmi from the observed distribution.
In the EM formalism zmi is called missing or latent variable. We can solve
our problem iteratively with two alternating steps, an expectation and a
(1)
maximization step. We start with a first guess µm of the parameters of
interest and estimate the missing data. In the expectation step k we compute
(k)
the probability gmi that xi belongs to subdistribution m. It is proportional
to the value of the distribution fm (xi |µm ) at xi :
(k)
(k) fm (xi |µ̂m )
gmi = (k)
.
M f (x |µ̂
Σj=1 j i j )

The probability gmi is the expected value of the latent variable zmi .The
expected log-likelihood is
M N
!
(k)
X X (k)
Q(µ, µb )= gmi xi − µm .
m=1 i=1
13.4 The Expectation Maximization (EM) Algorithm 425

In the maximization step we obtain the MLEs


(k)
µ̂(k+1)
m
N
= Σi=1 gmi xi .
which are used in the following expectation step. The iteration converges to
the overall MLE.
Let us generalize this procedure. Given be a probability distribution
p(x, z|θ) = g(z|x ,θ)p1 (x|θ) depending on a parameter vector θ and a sam-
ple of observations x1 , ..., xN . The distribution g of the latent variables z is
a function of θ and x.
– Expectation step: For the parameter vector θ(k) we compute the distri-
bution g of the hidden variables z. We form the log-likelihood function
ln L(θ|x, z) which is a random variable as it depends on the random z.
Averaging over z, we obtain the expected value Q(θ, θ b(k) ) of the log-
likelihood:
Q(θ, θb(k) ) = E z|x,θ(k) [ln L(θ|x, z)] .
The conditional expectation means that we average over z given the dis-
tribution of g(z) obtained for fixed values x and θb(k) :
Z  
(k) (k)
b
Q(θ, θ ) = b
ln L(θ|x, z)g z|x, θ dz .
Z

If the values of the vector components z are discrete, the integral is re-
placed by a sum over all J possible values:
J
X  
(k)
b
Q(θ, θ )= b(k) .
ln L(θ|x, z j )g z j |x, θ
j=1

Alternatively, somewhat less efficient, we can insert the expected values


of the latent variables:
(k)
b
Q(θ, θ ) = ln L (θ|x, E(z)) .
– Maximization step: The MLE θ(k+1) is computed:
θ(k+1) = arg max Q(θ|θ(k) ) .
θ

The procedure is started with a first θ(1) guess of the parameters and
iterated. It converges to a minimum of the log-likelihood. To avoid that the
iteration is caught by a local minimum, different starting values can be se-
lected. It is especially useful in classification problems in connection with
p.d.f.s of the exponential family5 where the maximization step is relatively
simple.
5
To the exponential family belong among others the normal, Poisson, exponen-
tial, gamma, chi-squared distrributions.
426 13 Appendix

Example 163. Unfolding a histogram


Experimental data are collected in form of a histogram with N bins.
The number of events in bin i be di . The experiment suffers from an im-
perfect resolution and from acceptance losses which we have to correct for.
The "true" histogram with M bins contains θj events in bin j. Knowing the
measurement device we can simulate the experimental effects and determine
the matrix A which relates θ with the expected values of the numbers d:
M
E(di ) = Σj=1 Aij θj . The element Aij is the probability to observe an event
in bin i which belongs to the bin j in the true histogram. The missing infor-
mation is the number of events dij in an observed bin i that belong to the
true bin j. Hence there are M missing variables per bin. The number dij is
Poisson distribution with mean Aij θj . The likelihood depends only on the
hidden variables:
M X
X N
ln L(θ|d11 , ..., dN M ) = [−Aij θj + dij ln Aij θj ] .
j=1 i=1

The following alternating steps are repeated:


– Expectation step: We have
(k)
b ) = E d ln L
Q(θ, θ j ik

M X
X N
(k)
= [−Aij θj + E(dij ) ln Aij θj ] .
j=1 i=1

(k)
The expected value E(dij ) conditioned on di and θ b(k) is given by di times
the probability that an event of bin i belongs to true bin j:
(k)
(k) Aij θ̂j
E(dij ) = di M
.
X (k)
Aij θ̂j
j=1

We get
N
M X (k)
(k) X Aij θ̂j
b )=
Q(θ, θ [−Aij θj + di ln Aij θj ] .
j M
j=1 i=1
X (k)
Aim θ̂m
m=1

– Maximization step:
13.5 Consistency of the Background Contaminated Parameter Estimate and its Error 427

The computation of the maximum of Q is easy, because the components of


the parameter vector θ appear in independent summands.
 
 
∂Q X  1
N (k)
 Aij θ̂j 
= −Aij + di M =0,
∂θj  X θ j
i=1  (k) 
A θ̂ ij j
j=1
N N (k)
(k+1)
X X Aij θ̂j
θ̂j Aij = di M
,
i=1 i=1
X (k)
Aij θ̂j
j=1
N (k)
(k+1)
X Aij θj
θj = di M
/αj .
i=1
X (k)
Aij θj
j=1

N
Σi=1 Aij = αj is the average acceptance of the events in the true bin j.

13.5 Consistency of the Background Contaminated


Parameter Estimate and its Error

In order to calculate the additional uncertainty of a parameter estimate due to


the presence of background, if the latter is taken from a reference experiment
in the way described in Sect. 7.4, we consider the general definition of the
pseudo log-likelihood
N
X M
X
ln L̃ = ln f (xi |θ) − r ln f (x′i |θ) ,
i=1 i=1

restricting ourselves at first to a single parameter θ, see (7.18). The gen-


eralization to multi-dimensional parameter spaces is straight forward. From
∂ ln L̃/∂θ|θ̂ = 0, we find
" S B M
#
X (S)
∂ ln f (x |θ) X (B)
∂ ln f (x |θ) X ∂ ln f (x′ |θ)
i i i
+ −r =0.
i=1
∂θ i=1
∂θ i=1
∂θ
θ̂

This formula defines the background-corrected estimate θ̂. It differs from the
“ideal” estimate θ̂(S) which would be obtained in the absence of background,
428 13 Appendix

i.e. by equating to zero the first sum on the left hand side. Writing θ̂ =
θ̂(S) + ∆θ̂ in the first sum, and Taylor expanding it up to the first order, we
get
S
" B M
#
X (S) X ∂ ln f (x(B) |θ) X
∂ 2 ln f (xi |θ) i ∂ ln f (x ′
i |θ)
2
|θ̂(S) ∆θ̂ + −r =0.
i=1
∂θ i=1
∂θ i=1
∂θ
θ̂
(13.18)
The first sum, if taken with a minus sign, is the Fisher information of the
signal sample on θ(S) , and equals −1/var( P θ̂ ), asymptotically. The approx-
(S)

imation relies on the assumption that ln f (xi |θ) is parabolic in the region
θ̂(S) ± ∆θ̂. Then we have
" B M
#
X ∂ ln f (x(B) |θ) X ∂ ln f (x′i |θ)
∆θ̂ ≈ var(θ̂ )(S) i
−r . (13.19)
i=1
∂θ i=1
∂θ
θ̂

We take the expected value with respect to the background distribution and
obtain
∂ ln f (x|θ)
h∆θ̂i = var(θ̂(S) )hB − rM ih |θ̂ i .
∂θ
Since hB − rM i = 0, the background correction is asymptotically bias-free.
Squaring (13.19), and writing the summands in short as yi , yi′ , we get

" B M
#2
X X
(∆θ̂)2 = (var(θ̂(S) ))2 yi − r yi′ ,
i=1 i=1
B X
X B M X
X M B X
X M
[· · · ]2 = yi yj + r 2 yi′ yj′ − 2r yi yj′
i j i j i j
B
X M
X B
X M
X B X
X M
= yi2 + r2 yi′2 + yi yj + r 2 yi′ yj′ − 2r yi yj′ ,
i i j6=i j6=i i j
2 (S) 2
 
< (∆θ̂) >= (var(θ̂ )) hB + r M ih(y i − hyi ) + h(B − rM )2 ihyi2 .
2 2 2

(13.20)

We have approximated M − 1 by M and B − 1 by B.


In physics experiments, the event numbers M , B, S are independently
fluctuating according to Poisson distributions with expected values hM i =
hBi/r, and hSi. Then hB + r2 M i = hBi(1 + r) and

h(B − rM )2 i = hB 2 i + hr2 M 2 i − 2rhBihM i = hBi + r2 hM i = hBi(1 + r) .

Adding the contribution from the uncontaminated estimate, var(θ̂(S) ), to


(13.20) leads to the final result
13.5 Consistency of the Background Contaminated Parameter Estimate and its Error 429
D E
var(θ̂) = var(θ̂(S) ) + (∆θ̂)2
= var(θ̂(S) ) + (1 + r)(var(θ̂(S) ))2 hBihy 2 i (13.21)
= var(θ̂(S) ) + r(1 + r)(var(θ̂(S) ))2 hM ihy 2 i .

The factor (var(θ̂(S) ))2 is proportional to 1/S 2 . Thus asymptotically we


get var(θ̂) = var(θ̂(S) ). The estimate obtained from the pseudo-likelihood is
consistent.
To estimate the uncertainty of the θ̂ we replace the expected values of M ,
y and y 2 by its empirical values:
M
X M
X
hM i → M , hyi → yi /M , hy 2 i → yi2 /M ,
i=1 i=1

where yi = ∂ ln f (x′i |θ)/∂θ. As usual in error calculation, the dependence of


yi on the true value of θ has to be approximated by a dependence on the
estimated value θ̂. Similarly, we approximate var(θ̂(S) ):
S
X ∂ 2 ln f (xi |θ)
−1/var(θ̂(S) ) = |θ̂(S)
i=1
∂θ2
"N M
#
X ∂ 2 ln f (xi |θ) X ∂ 2 ln f (x′ |θ) i
≈ −r .
i=1
∂θ2 i=1
∂θ2
θ̂

We realize from (13.21) that it is advantageous to take a large reference


sample, i.e. r small. The variance h(∆θ̂)2 i increases with the square of the
error of the uncontaminated sample. Via the quantity hy 2 i it depends also
on the shape of the background distribution.
For a P -dimensional parameter space θ we see from (13.18) that the first
sum is given by the weight matrix V(S) of the estimated parameters in the
absence of background
P X
X S (S) P
X
∂ 2 ln f (x |θ)
− i
|θ̂ (S) ∆θ̂l = (V(S) )kl ∆θ̂l .
∂θk ∂θl
l=1 i=1 l=1

Solving the linear equation system for ∆θ̂ and constructing from its com-
ponents the error matrix E, we find in close analogy to the one-dimensional
case
E = C(S) YC(S) ,
with C(S) = V(S)−1 being the covariance matrix of the background-free esti-
mates and Y defined as

Ykl = r(1 + r)hM ihyk yl i ,


430 13 Appendix

with yk = yk (xi ) short hand for ∂ ln f (x′i |θ)/∂θk . As in the one-dimensional


case, the total covariance matrix of the estimated parameters is the sum
(S)
cov(θ̂k , θ̂l ) = Ckl + Ekl .
The following example illustrates the error due to background contami-
nation for the above estimation method.

Example 164. Parameter uncertainty for background contaminated signals


We investigate how well our asymptotic error formula works in a specific
example. To this end, we consider a Gaussian signal distribution with width
unity and mean zero over a background modeled by an exponential distribu-
tion with decay constant γ = 0.2 of the form c exp[−γ(x + 4)] where both
distributions are restricted to the range [−4, 4]. The numbers of signal events
S, background events B and reference events M follow Poisson distributions
with mean values hSi = 60, hBi = 40 and hM i = 100. This implies a cor-
rection factor r = hBi/hM i = 0.4 for the reference experiment. From 104
MC experiments we obtain a distribution of µ̂, with mean value and width
0.019 and 0.34, √respectively. The pure signal µ̂(S) has mean and width 0.001
and 0.13 (= 1/ 60). From our asymptotic error formula (13.21) we derive
an error of 0.31, slightly smaller than the MC result. The discrepancy will be
larger for lower statistics. It is typical for Poisson fluctuations.

13.6 Frequentist Confidence Intervals


We associate error intervals to measurements to indicate that the parameter
of interest has a reasonably high probability to be located inside the interval.
However to compute the probability a prior probability has to be introduced
with the problem which we have discussed in Sect. 6.1. To circumvent this
problem, J. Neyman has proposed a method to construct intervals without
using prior probabilities. Unfortunately, as it is often the case, one problem
is traded for another one.
Neyman’s confidence intervals have the following defining property: The
true parameter lies in the interval on the average in the fraction C of intervals
of confidence level C. In other words: Given a true value θ, a measurement t
will include it in its associated confidence interval [t1 , t2 ] – “cover” it – with
probability C. (Remark that this does not necessarily imply that given a
certain confidence interval the true value is included in it with probability
C.)
Traditionally chosen values for the confidence level are 68.3%, 90%, 95%
– the former corresponds to the standard error interval of the normal distri-
bution.
13.6 Frequentist Confidence Intervals 431

8
t1 HΘL
6
t2 HΘL
Θmax
4
Θmin
2

t=4
t
2 4 6 8 10

Fig. 13.1. Confidence belt. The shaded area is the confidence belt, consisting of
the probability intervals [t1 (θ), t2 (θ)] for the estimator t. The observation t = 4
leads to the confidence interval [θmin , θmax ].

Confidence intervals are constructed in the following way:


For each parameter value θ a probability interval [t1 (θ), t2 (θ)] is defined,
such that the probability that the observed value t of θ is located in the
interval is equal to the confidence level C:
Z t2
P {t1 (θ) ≤ t ≤ t2 (θ)} = f (t|θ)dt = C . (13.22)
t1

Of course the p.d.f. f (t|θ) or error distribution of the estimator t must be


known. To fix the interval completely, an additional condition is applied. In
the univariate case, a common procedure is to choose central intervals,
1−C
P {t < t1 } = P {t > t2 } = .
2
Other conventions are minimum length and equal probability intervals de-
fined by f (t1 ) = f (t2 ). The confidence interval consists of those parameter
values which include the measurement t̂ within their probability intervals.
Somewhat simplified: Parameter values are accepted, if the observation is
compatible with them.
The one-dimensional case is illustrated in Fig. 13.1. The pair of curves
t = t1 (θ) , t = t2 (θ) in the (t, θ)-plane comprise the so-called confidence
belt . To the measurement t̂ = 4 then corresponds the confidence interval
[θmin , θmax ] obtained by inverting the relations t1,2 (θmax,min ) = t̂, i.e. the
section of the straight line t = t̂ parallel to the θ axis.
432 13 Appendix

Fig. 13.2. Confidence interval. The shaded area is the confidence region for the two-
dimensional measurement (θ̂1, θ̂2 ). The dashed curves indicate probability regions
associated to the locations denoted by capital letters.

The construction shown in Fig. 13.1 is not always feasible: It has to be


assumed that t1,2 (θ) are monotone functions. If the curve t1 (θ) has a maxi-
mum say at θ = θ0 , then the relation t1 (θ) = t̂ cannot always be inverted: For
t̂ > t1 (θ0 ) the confidence belt degenerates into a region bounded from below,
while for t̂ < t1 (θ0 ) there is no unique solution. In the first case one usually
declares a lower confidence bound as an infinite interval bounded from below.
In the second case one could construct a set of disconnected intervals, some
of which may be excluded by other arguments.
The construction of the confidence contour in the two-parameter case is
illustrated in Fig. 13.2 where for simplicity the parameter and the observa-
tion space are chosen such that they coincide. For each point θ1 , θ2 in the
parameter space we fix a probability contour which contains a measurement
of the parameters with probability C. Those parameter points with proba-
bility contours passing through the actual measurement θ̂1 , θ̂2 are located at
the confidence contour. All parameter pairs located inside the shaded area
contain the measurement in their probability region.
Frequentist statistics avoids prior probabilities. This feature, while de-
sirable in general, can have negative consequences if prior information ex-
13.7 Comparison of Different Inference Methods 433

ists. This is the case if the parameter space is constrained by mathematical


or physical conditions. In frequentist statistics it is not possible to exclude
unphysical parameter values without introducing additional complications.
Thus, for instance, a measurement could lead for a mass to a 90% confidence
interval which is situated completely in the negative region, or for an angle
to a complex angular region. The problem is mitigated somewhat by a newer
method [118], but not without introducing other complications [119], [120].

13.7 Comparison of Different Inference Methods


13.7.1 Examples

Before we compare the different statistical philosophies let us look at a few


examples.

Example 165. Performance of magnets


A company produces magnets which have to satisfy the specified field
strength within certain tolerances. The various measurement performed by
the company are fed into a fitting procedure producing a 99% confidence
intervals which are used to accept or reject the product before sending it
off. The client is able to repeat the measurement with high precision and
accepts only magnets within the agreed specification. To calculate the price
the company must rely on the condition that the confidence interval in fact
covers the nominal value with the presumed confidence level.

Example 166. Bias in the mass determination of a resonance


The mass and the width of a strongly decaying particle are determined
from the mass distribution of many events. Somewhat simplified, the mass
is computed in each event from the energy E and the momentum p, mc2 =
p
E 2 − p2 c2 . A bias in the momentum fit has to be avoided, because it would
lead to a systematic shift of the resulting mass estimate.

Example 167. Inference with known prior


We repeat an example presented in Sect. 6.2.2. In the reconstruction of
a specific, very interesting event, for instance a SUSY candidate, we have to
infer the distance θ between the production and decay vertices of an unstable
particle produced in the reaction. From its momentum and its known mean
434 13 Appendix

life we calculate its expected decay length λ. The prior density for the actual
decay length θ is π(θ) = exp(−θ/λ)/λ. The experimental distance measure-
ment which follows a Gaussian with standard deviation s yields d. According
to (6.2.2), the p.d.f. for the actual distance is given by
2 2
e(−(d−θ) )/(2s ) e−θ/λ
f (θ|d) = R ∞ −(d−θ)2 /(2s2 ) −θ/λ .
0
e e dθ

This is an ideal situation. We can determine for instance the mean value and
the standard deviation or the mode of the θ distribution and an asymmetric
error interval with well defined probability content, for instance 68.3%. The
confidence level is of no interest and due to the application of the prior the
estimate of θ is biased, but this is irrelevant.

Example 168. Bias introduced by a prior


We now modify and extend our example. Instead of the decay length we
discuss the lifetime of the particle. The reasoning is the same, we can apply
the prior and determine an estimate and an error interval. We now study
N decays, to improve our knowledge of the mean lifetime τ of the particle
species. For each individual decay we use a prior with an estimate of τ as
known from previous
P experiments, determine each time the lifetime t̂i and the
mean value t̄ = t̂i /N from all measurements. Even though the individual
time estimates are improved by applying the prior the average t̄ is a very bad
estimate of τ because the t̂i are biased towards low values and consequently
also their mean value is shifted. (Remark that in this and in the second
example we have two types of parameters which we have to distinguish. We
dicuss the effect of a bias of the primary parameter set)

Example 169. Comparing predictions with strongly differing accuracies:


Earth quake
Two theories H1 , H2 predict the time θ of an earth quake. The predictions
differ in the expected values as well as in the size of the Gaussian errors:
H1 : θ1 = (7.50 ± 2.25) h ,
H2 : θ2 = (50 ± 100) h .
To keep the discussion simple, we do not exclude negative times t. The earth-
quake then takes place at time t = 10 h. In Fig. 13.3 are shown both hypo-
thetical distributions in logarithmic form together with the actually observed
13.7 Comparison of Different Inference Methods 435

observed value
H
1

0.1

0.01

H
2

1E-3
0 10 20 30 40 50 60

time

H observed value
-2 1
H
2

likelihood limits
-3
log-likelihood

-4 frequentist cofidence interval

-5

-6
0 10 20 30 40 50 60

time

Fig. 13.3. Two hypotheses compared to an observation. The likelihood ratio sup-
ports hypothesis 1 while the distance in units of st.dev. supports hypothesis 2.

time. The first prediction H1 differs by more than one standard deviation from
the observation, prediction H2 by less than one standard deviation. Is then
H2 the more probable theory? Well, we cannot attribute probabilities to the
theories but the likelihood ratio R which here has the value R = 26, strongly
supports hypothesis H1 . We could, however, also consider both hypotheses
as special cases of a third general theory with the parametrization
 
25 625(t − θ)2
f (t) = √ exp −
2πθ2 2θ4
and now try to infer the parameter θ and its error interval. The observation
produces the likelihood function shown in the lower part of Fig. 13.3. The
usual likelihood ratio interval contains the parameter θ1 and excludes θ2
while the frequentist standard confidence interval [7.66, ∞] would lead to the
436 13 Appendix

reverse conclusion which contradicts the likelihood ratio result and also our
intuitive conclusions.

The presented examples indicate that depending on the kind of problem,


different statistical methods are to be applied.

13.7.2 The Frequentist Approach

The frequentist approach emphasizes efficiency, unbiasedness and coverage.


These quantities are defined through the expected fluctuations of the param-
eter of interest given its true value. The compute these quantities we need
to know the full p.d.f.. Efficiency and bias are related to point estimation.
A bias has to be avoided whenever we average over several estimates like in
the second and the forth example. Frequentist interval estimation guarantees
coverage6. A producer of technical items has to guarantee that a certain frac-
tion of a sample fulfils given tolerances. He will choose, for example, a 99 %
confidence interval for the lifetime of a bulb and then be sure that complaints
will occur only in 1 % of the cases. Insurance companies want to estimate
their risk and thus have to know, how frequently damage claims will occur.
Common to frequentist parameter inference (examples 1, 2 and 4) is that
we are interested in the properties of a set of parameter values. The param-
eters are associated to many different objects, events or accidents e.g. the
magnet strengths, momenta of different events or individual lifetimes. Here
coverage and unbiasedness are essential and efficiency is an important quan-
tity. As seen in the forth example the application of prior information – even
when it is known exactly – would be destructive. Physicists usually impose
transformation invariance to important parameters (The estimates of the life-
time τ̂ and the decay rate γ̂ of a particle should satisfy γ̂ = 1/τ̂ but only one
of these two parameters can be unbiased.) but in many situations the fact
that bias and efficiency are not invariant under parameter transformations do
not matter. In a business contract in the bulb example an agreement would
be on the lifetime and the decay rate would be of no interest. The combina-
tion of the results from different measurement is difficult but mostly of minor
interest.

13.7.3 The Bayesian Approach

The Bayesian statistics defines probability or credibility intervals. The inter-


est is directed towards the true value given the observed data. As the prob-
ability of data that are not observed is irrelevant, the p.d.f. is not needed,
the likelihood principle applies, only the prior and the likelihood function are
6
In the frequentist statistics point and interval estinmation are unrelated.
13.7 Comparison of Different Inference Methods 437

relevant. The Bayesian approach is justified if we are interested in a constant


of nature, a particle mass, a coupling constant or in a parameter describing
a unique event like in examples three and five. In these situations we have to
associate to the measurement an error in a consistent way. Point and interval
estimation cannot be treated independently. Coverage and bias are of no im-
portance – in fact it does not make much sense to state that a certain fraction
of physical constants are covered by their error intervals and it is of no use to
know that out of 10 measurements of a particle mass one has to expect that
about 7 contain the true value within their error intervals. Systematic errors
and nuisance parameters for which no p.d.f. is available can only be treated
in the Bayesian framework.
The drawback of the Bayesian method is the need to invent a prior prob-
ability. In example three the prior is known but this is one of the rare cases.
In the fifth example, like in many other situations, a uniform prior would be
acceptable to most scientists and then the Bayesian interval would coincide
with a likelihood ratio interval.

13.7.4 The Likelihood Ratio Approach

To avoid the introduction of prior probabilities, physicists are usually satis-


fied with the information contained in the likelihood function. In most cases
the MLE and the likelihood ratio error interval are sufficient to summarize
the result. Contrary to the frequentist confidence interval this concept is
compatible with the maximum likelihood point estimation as well as with
the likelihood ratio comparison of discrete hypotheses and allows to combine
results in a consistent way. As in the Bayesian method, parameter trans-
formation invariance holds. However, there is no coverage guarantee and an
interpretation in terms of probability is possible only for small error intervals,
where prior densities can be assumed to be constant within the accuracy of
the measurement.

13.7.5 Conclusion

The choice of the statistical method has to be adapted to the concrete appli-
cation. The frequentist reasoning is relevant in rare situations like event selec-
tion, where coverage could be of some importance or when secondary statistics
is performed with estimated parameters. In some situations Bayesian tools
are required to proceed to sensible results. In all other cases the likelihood
function, or as a summary of it, the MLE and a likelihood ratio interval are
the best choice.

13.7.6 Consistency, Efficiency, Bias

These properties are related to important issues in frequentist statistics and


of limited interest in the Bayesian and the likelihood ratio approaches. Since
438 13 Appendix

the latter rely on the likelihood principle, they base parameter and interval
inference solely on the likelihood function and these parameters cannot and
need not be considered. Nevertheless it is of some interest, to investigate how
the classical statistics reacts to the MLE and it is reassuring that asymptot-
ically for large samples the frequentist approach is in accordance with the
likelihood ratio method. This manifests itself in the consistency of the MLE.
Also for small samples, the MLE has certain optimal frequentist properties,
but there the methods provide different solutions.
Efficiency is defined through the variance of the estimator for given values
of the true parameter (independent of the measured value). In inference prob-
lems, however, the true value is unknown and of interest is the deviation of
the true parameter from a given estimate. Efficiency is not invariant against
parameter transformation. For example, the MLE of the lifetime θ̂ with an
exponential decay distribution is an efficient estimator while the MLE of the
decay rate γ̂ = 1/θ̂ is not.
Similar problems exist for the bias which also depends on the parameter
metric. Frequentists usually correct estimates for a bias. This is justified
again in commercial applications, where many replicates are considered. If in
a long-term business relation the price for a batch of some goods is agreed
to be proportional to some product quality (weight, mean lifetime...) which
is estimated for each delivery from a small sample, this estimate should be
unbiased, as otherwise gains and losses due to statistical fluctuations would
not cancel in the long run. It does not matter here that the quantity bias
is not invariant against parameter transformations. In the business example
the mentioned agreement would be on weight or on size and not on both.
In the usual physics application where experiments determine constants of
nature, the situation is different, there is no justification for bias corrections,
and invariance is an important issue.
Somewhat inconsistent in the frequentist approach is that confidence in-
tervals are invariant against parameter transformations while efficiency and
bias are not and that the aim for efficiency supports the MLE for point esti-
mation which goes along with likelihood ratio intervals and not with coverage
intervals.

13.8 p-values for EDF-Statistics


The formulas reviewed here are taken from the book of D’Agostino and
Stephens [121] and generalized to include the case of the two-sample com-
parison.

Calculation of the Test Statistics

The calculation of the supremum statistics D and of V = D+ + D− is simple


enough, so we will skip a further discussion.
13.8 p-values for EDF-Statistics 439

The quadratic statistics W 2 , U 2 and A2 are calculated after a probability


integral transformation (PIT). The PIT transforms the expected theoretical
distribution of x into a uniform distribution. The new variate z is found from
the relation z = F (x), whereby F is the integral distribution function of x.
With the transformed observations zi , ordered according to increasing
values, we get for W 2 , U 2 and A2 :

X N
1 2i − 1 2
W2 = + (zi − ) , (13.23)
12N i=1
2N

N
X
(zi − zi−1 )2
i=2
U2 = PN ,
2
i=1 zi
N
X −1
2
A = −N + (zi − 1) (ln zi + ln(1 − zN +1−i )) . (13.24)
i=1

If we know the distribution function only through a Monte Carlo simula-


tion but not analytically, the z-value for an observation x is approximately
z ≈ (number of Monte-Carlo observations with xMC < x) /(total number of
Monte Carlo observations). (Somewhat more accurate is an interpolation).
For the comparison with a simulated distribution is N to be taken as the
equivalent number of observations
1 1 1
= + .
N Nexp NMC

Here Nexp and NMC are the experimental respectively the simulated sample
sizes.

Calculation of p-values

After normalizing the test variables with appropriate powers of N they follow
p.d.f.s which are independent of N . The test statistics’ D∗ , W 2∗ , A2∗ modified
in this way are defined by the following empirical relations
√ 0.11
D∗ = Dmax ( N + 0.12 + √ ) , (13.25)
N
2∗ 2 0.4 0.6 1.0
W = (W − + 2 )(1.0 + ), (13.26)
N N N
A2∗ = A2 . (13.27)

The relation between these modified statistics and the p-values is given
in Fig. 13.4.
440 13 Appendix

10

2
W * x 10
5

4
test statistic

3
2
A *

V*

D*

1E-3 0.01 0.1

p-value

Fig. 13.4. p-values for empirical test statistics.


13.10 Comparison of Histograms Containing Weighted Events 441

13.9 Fisher–Yates shuffle


Given be a set of n numbered elements, an element j is randomly selected
from the first n − 1 elements. The elements n and j are exchanged. Then
a new element j ′ is randomly selected from the first n − 1 elements of the
modified arrangement and exchanged with the element n − 1. The procedure
is continued until the beginning of the queue is reached.
The time for the shuffle is O(n).

13.10 Comparison of Histograms Containing Weighted


Events
In the main text we have treated goodness-of-fit and parameter estimation
from a comparison of two histograms in the simple situation where the statis-
tical errors of one of the histograms (generated by Monte Carlo simulation)
was negligible compared to the uncertainties of the other histogram. Here
we take the errors of both histograms into account and also permit that the
histogram bins contain weighted entries.

13.10.1 Comparison of two Poisson Numbers with Different


Normalization

We compare cn n with cm m where the normalization constants cn , cm are


known and n, m are Poisson distributed. Only cn /cm matters and for example
cn could be set equal to one, but we prefer to keep both constants because
then the formulas are more symmetric. The null hypothesis H0 is that n is
drawn from a distribution with mean λ/cn and m from a distribution with
mean λ/cm . We form a χ2 expression

(cn n − cm m)2
χ2 = (13.28)
δ2
where the denominator δ 2 is the expected variance of the parenthesis in the
numerator under the null hypothesis. To compute δ we have to estimate λ.
The p.d.f. of n and m is P(n|λ/cn )P(m|λ/cm ) leading to the corresponding
log-likelihood of λ
λ λ λ λ
ln L(λ) = n ln − + m ln − + const.
cn cn cm cm
with the MLE
n+m n+m
λ̂ = = cn cm . (13.29)
1/cn + 1/cm cn + cm
Assuming now that n is distributed according to a Poisson distribution with
mean n̂ = λ̂/cn and respectively, mean m̂ = λ̂/cm we find
442 13 Appendix

δ 2 = c2n n̂ + c2m m̂
= (cn + cm )λ̂
= cn cm (n + m)

and inserting the result into (13.28), we obtain

1 (cn n − cm m)2
χ2 = . (13.30)
cn cm n+m
As mentioned, only the relative normalization cn /cm is relevant.

13.10.2 Comparison of Weighted Sums

When we compare experimental data to a Monte Carlo simulation, the simu-


lated events frequently are weighted. We generalize our result to the situation
wherePboth numbers
P n, m consist of a sums of weights, vk , respectively wk ,
n = vk , m = wk . In Appendix 13.11.1 it is shown that the sum of weights
for not too small event numbers can be approximated by a scaled Poisson dis-
tribution and that this approximation is superior to the approximation with
a normal distribution. Now the equivalent numbers of unweighted events ñ
and m̃,
hX i2 hX i2
vk wk
ñ = X , m̃ = X , (13.31)
vk2 wk2
are approximately Poisson distributed. We simply have to replace (13.30) by

1 (c̃n ñ − c̃m m̃)2


χ2 = (13.32)
c̃n c̃m ñ + m̃
where now c̃n , c̃m are the relative normalization constants for the equivalent
numbers of events. We summarize in short the relevant relations, assuming
that and as before cn n is supposed to agree with cm m as before. As discussed
in 3.7.3 we find with c̃n ñ = cn n, c̃m m̃ = cm m:
X X
vk2 wk2
c̃n = cn X , c̃m = cm X . (13.33)
vk wk

13.10.3 χ2 of Histograms

We have to evaluate the expression (13.32) for each bin and sum over all B
bins
13.10 Comparison of Histograms Containing Weighted Events 443

XB  
1 (c̃n ñ − c̃m m̃)2
χ2 = (13.34)
i=1
c̃n c̃m ñ + m̃ i

where the prescription indicated by the index i means that all quantities
in the bracket have to be evaluated for bin i. In case the entries are not
weighted the tilde is obsolete. The constants cn , cm in (13.33) usually are
overall normalization constants and equal for all bins of the corresponding
histogram. If the histograms are normalized with respect to each other, we
have cn Σni = cm Σmi and we can set cn = Σmi = M and cm = Σni = N .

χ2 Goodness-of-Fit Test

This expression can be used for goodness-of-fit tests. In case the normalization
constants are given externally, for instance through the luminosity, χ2 follows
approximately a χ2 distribution of B degrees of freedom. Frequently the
histograms are normalized with respect to each other. Then we have one
degree of freedom less, i.e. B − 1. If P parameters have been adjusted in
addition, then we have B − P − 1 degrees of freedom.

Likelihood Ratio Test

In Chap. 10, Sect. 10.4.3 we have introduced the likelihood ratio test for
histograms. For a pair of Poisson numbers n, m the likelihood ratio is the
ratio of the maximal likelihood under the condition that the two numbers
are drawn from the same distribution to the unconditioned maximum of
the likelihood for the observation of n. The corresponding difference of the
logarithms is our test statistic V (see likelihood ratio test for histograms)

λ λ λ λ
V = n ln − − ln n! + m ln − − ln m! − [n ln n − n − ln n!]
cn cn cm cm
λ λ λ λ
= n ln − + m ln − − ln m! − n ln n + n .
cn cn cm cm
We now turn to weighted events and perform the same replacements as
above:

λ̃ λ̃ λ̃ λ̃
V = ñ ln − + m̃ ln − − ln m̃! − ñ ln ñ + ñ .
c̃n c̃n c̃m c̃m

Here the parameter λ̃ is the MLE corresponding to (13.29) for weighted


events.
ñ + m̃
λ̃ = c̃n c̃m (13.35)
c̃n + c̃m
The test statistic of the full histogram is the sum of the contributions
from all bins.
444 13 Appendix
B
" #
X λ̃ λ̃ λ̃ λ̃
V = ñ ln − + m̃ ln − − ln m̃! − ñ ln ñ + ñ .
i=1
c̃n c̃n c̃m c̃m
i

The variables and parameters of this formula are given in relations (13.35),
(13.31) and (13.33). They depend on cn , cm . As stated above, only the ratio
cn , cm matters. The ratio is either given or obtained from the normalization
cn Σni = cm Σmi .
The distribution of the test statistic under H0 for large event number
follows approximately a χ2 distribution of B degrees of freedom if the nor-
malization is given or of B − 1 degrees of freedom in the usual case where
the histograms are normalized to each other. For small event numbers the
distribution of the test statistic has to be obtained by simulation.

13.10.4 Parameter Estimation

When we compare experimental data to a parameter dependent Monte Carlo


simulation, one of the histograms depends on the parameter, e.g. m(θ) and the
comparison is used to determine the parameter. During the fitting procedure,
the parameter is modified and this implies a change of the weights of the
Monte Carlo events. The experimentally observed events are not weighted.
Then (13.34) simplifies with ñi = ni , c̃n = cn , c̃m = cm w and m̃i is just the
number of unweighted Monte Carlo events in bin i.

XB  
2 1 (cn n − c̃m m̃)2
χ =
i=1
cn c̃m (n + m̃) i

For each minimization step we have to recompute the weights and with
(13.31) and (13.33) the LS parameter χ2 . If the relative normalization of the
simulated and observed data is not known the ratio cn /cm is a free parameter
in the fit. As only the ratio matters, we can set for instance cm = 1.
We do not recommend to apply a likelihood fit, because the approximation
of the distribution of the sum of weights by a scaled Poisson distribution is
not valid for small event numbers where the statistical errors of the simulation
are important.

13.11 The Compound Poisson Distribution and


Approximations of it
This section is based on Ref. [26].

13.11.1 Equivalence of two Definitions of the CPD

The CPD describes


13.11 The Compound Poisson Distribution and Approximations of it 445
PN
i) the sum x = i=1 ki wi , with a given discrete, positive weight distribu-
tion, wi , i = 1, 2, .., N and Poisson distributed numbers ki with mean values
λi , Pk
ii) the sum x = i=1 wi of a Poisson distributed number k of independent
and identical distributed positive weights wi .
The equivalence of the two definitions is related to the following identity:
N
Y
Pλi (ki ) = Pλ (k)Mkε1 ,...,εN (k1 , ..., kN ) . (13.36)
i=1

The left hand side describes N independent Poisson processes with mean
values λi and random variables ki , and the right hand side corresponds to a
single Poisson process with λ = Σλi and the random variable k = Σki where
the numbers ki follow a multinomial distribution
N
k! Y ki
Mnε1 ,...,εN (k1 , ..., kN ) = ε .
Y i=1 i
N
ki !
i=1

Here k is distributed to the N different classes with probabilities εi = λi /λ.


The validity of (13.36) for the binomial case

e−λ λk k! λk11 λk22 e−(λ1 +λ2 ) λk11 λk22


Pλ Mkλ1 /λ,λ2 /λ = k k
= = Pλ1 Pλ2
k! k1 !k2 ! λ 1 λ 2 k1 !k2 !
(13.37)
can easily be generalized to several Poisson processes. The multinomial dis-
tribution describes a random distribution of k events into N classes. If to
each class i is attributed a weight wi , then to the k events are randomly
associated weights wi with probabilities λi /λ.
If all probabilities are equal, εi = 1/N , the multinomial distribution de-
scribes a random selection of the weights wi out of the N weights with equal
probabilities 1/N . It does not matter whether we describe the distribution
of x = Σwi by independent Poisson distributions or by the product of a
Poisson distribution with a multinomial distribution. To describe a continu-
ous weight distribution f (w), the limit N → ∞ has to be considered. The
formulas (3.64), (3.65) remain valid with εN = 1.

13.11.2 Approximation by a Scaled Poisson Distribution

The scaled Poisson distribution (SPD) is fixed by the requirement that the
first two moments of the weighted sum have to be reproduced. We define an
equivalent mean value λ̃,
λE(w)2
λ̃ = , (13.38)
E(w2 )
446 13 Appendix

an equivalent random variable k̃ ∼ Pλ̃ and a scale factor s,

E(w2 )
s= , (13.39)
E(w)

such that the expected value E(sk̃) = µ and var(sk̃) = σ 2 . The cumulants of
the scaled distribution are κ̃m = sm λ̃.
We compare the cumulants of the two distributions and form the ratios
κm /κ̃m . Per definition the ratios for m = 1, 2 agree because the two lowest
moments agree.
The skewness and excess for the two distributions are in terms of the
moments E(wm ) of w:

E(w3 ) E(w3 )
γ1 = = , (13.40)
σ3 λ1/2 E(w2 )3/2
E(w4 ) E(w4 )
γ2 = = (13.41)
σ4 λE(w2 )2
 1/2
E(w2 )
γ̃1 = ;, (13.42)
λE(w)2
E(w2 )
γ̃2 = , (13.43)
λE(w)2

and the ratios are


γ1 E(w3 )E(w)
= ≥1, (13.44)
γ̃1 E(w2 )2
γ2 E(w4 )E(w)2
= ≥1. (13.45)
γ̃2 E(w2 )3

To proof these relations, we use Hölders inequality,


!1/p !(p−1)/p
X X X p/(p−1)
ai b i ≤ api bi ,
i i i

where ai , bi are non-negative and p > 1. For p = 2 one obtains the Cauchy–
3/2 1/2
Schwartz inequality. Setting ai = wi , respectively bi = wi , we get imme-
diately the relation (13.44) for the skewness:
!2
X X X
wi2 ≤ wi3 wi .
i i i

n/(n−1) (n−2)/(n−1)
In general, with p = n−1 and ai = wi , bi = wi , the inequality
becomes
13.11 The Compound Poisson Distribution and Approximations of it 447

µ=20 µ=25
f(w)=exp(-w) truncated normal

0 10 20 30 40 0 10 20 30 40 50

x x

Fig. 13.5. comparison of a CPD with a scaled Poisson distribution(doted) and a


normal approximation (dashed).

!n−1 !n−2
X X X
wi2 ≤ win wi
i i i

which includes (13.45).


The values γ̃1 , γ̃2 of the SPD lie between those of the CPD and the normal
distribution. Thus, the SPD is expected to be a much better approximation
of the CPD than the normal distribution [26].

Example 170. Comparison of the CPD with the SPD approximation and the
normal distribution
In Figure 1 the results of a simulation of CPDs with two different weight
distributions is shown. The simulated events are collected into histogram
bins but the histograms are displayed as line graphs which are easier to read
than column graphs. Corresponding SPD distributions are generated with the
parameters chosen according to the relations (13.38) and (13.39). They are
indicated by dotted lines. The approximations by normal distributions are
shown as dashed lines. In the lefthand graph the weights are exponentially
distributed and the weight distribution of the righthand graph is a truncated,
renormalized normal distribution Nt (x|1, 1) = cN (x|1, 1), x > 0 with mean
and variance equal to 1 where negative values are cut. In this case the approx-
imation by the SPD is hardly distinguishable from the CPD. The exponential
weight distribution includes large weights with low frequency where the ap-
proximation by the SPD is less good. Still it models the CPD reasonably
448 13 Appendix

well. The examples show, that the approximation by the SPD is close to the
CPD and superior to the approximation by the normal distribution.

13.11.3 The Poisson Bootstrap

In standard bootstrap [119] samples are drawn from the observed observations
xi , i = 1, 2, ..., n, with replacement. Poisson bootstrap is a special re-sampling
technique where to all n observation xi Poisson distributed numbers ki ∼
P1 (ki ) = 1/(eki !) are associated. More precisely, for a bootstrap sample the
value xi is taken ki times where ki is randomly chosen from the Poisson
distribution with mean equal to one. Samples where the sum of outcomes
is different from the observed sample size k, i.e. Σi=1 k
ki 6= k are rejected.
Poisson bootstrap is completely equivalent to the standard bootstrap. It has
attractive theoretical properties [122].
In applications of the CPD the situation is different. One does not have
a sample of CPD outcomes but only of a single observed value of x which is
accompanied by a sample of weights. As the distribution of the number of
weights is known up to the Poisson mean, the bootstrap technique is used
to infer parameters depending on the weight distribution, To generate obser-
vations xk , we have to generate the numbers ki ∼ P1 (ki ) and form the sum
x = Σki wi . All results are kept. The resulting Poisson bootstrap distribution
(PBD) permits to estimate uncertainties of parameters and quantiles of the
CPD.

13.12 Extremum Search


If we apply the maximum-likelihood method for parameter estimation, we
have to find the maximum of a function in the parameter space. This is, as a
rule, not possible without numerical tools. An analogous problem is posed by
the method of least squares. Minimum and maximum search are principally
not different problems, since we can invert the sign of the function. We restrict
ourselves to the minimum search.
Before we engage off-the-shelf computer programs, we should obtain some
rough idea of their function. The best way in most cases is a graphical presen-
tation. It is not important for the user to know the details of the programs,
but some knowledge of their underlying principles is helpful.

13.12.1 Monte Carlo Search

In order to obtain a rough impression of the function to be investigated, and


of the approximate location of its minimum, we may sample the parameter
13.12 Extremum Search 449
C
B

A
a) CT b)
B B
C'

A A

c)
C'
C' B'

Fig. 13.6. Siplex algorithm.

stochastically. A starting region has to be selected. Usual programs will then


further restrict the parameter space in dependence of the search results. An
advantage of this method is that the probability to end up in a relative
minimum is rather small. In the literature this rather simple and not very
effective method is sometimes sold under the somewhat pretentious name
genetic algorithm. Since it is fairly inefficient, it should be used only for the
first step of a minimization procedure.

13.12.2 The Simplex Algorithm

Simplex is a quite slow but robust algorithm, as it needs no derivatives. For


an n-dimensional parameter space n + 1 starting points are selected, and
for each point the function values calculated. The point which delivers the
largest function value is rejected and replaced by a new point. How this point
is found is demonstrated in two dimensions.
Fig. 13.6 shows in the upper picture three points. let us assume that A
has the lowest function value and point C the largest f (xC , yC ). We want
to eliminate C and to replace it by a superior point C ′ . We take its mirror
image with respect to the center of gravity of points A, B and obtain the test
point CT . If f (xCT , yCT ) < f (xC , yC ) we did find a better point, thus we
replace C by CT and continue with the new triplet. In the opposite case we
double the step width (13.6b) with respect to the center of gravity and find
C ′ . Again we accept C ′ , if it is superior to C. If not, we compare it with the
test point CT and if f (xCT , yCT ) < f (xC ′ , yC ′ ) holds, the step width is halved
and reversed in direction (13.6a). The point C ′ now moves to the inner region
of the simplex triangle. If it is superior to C it replaces C as above. In all
other cases the original simplex is shrunk by a factor two in the direction of
450 13 Appendix

the best point A (13.6c). In each case one of the four configurations is chosen
and the iteration continued.
There exist many variants (see refs. in [126] of the original version of
Nelder and Mead ([125]). Standard Simplex [125] has been used in most fits
of this book. If the number of parameters is large, and especially if the param-
eters are correlated, Simplex fits have the tendency to stop without having
reached the function minimum [126]. This situation occurs in fits of unfolded
histograms. Simplex may choose shrinkage while a reflection of the worst pa-
rameter point could be the optimal choice. Finally, all points have almost
equal parameter coordinates such that the convergence criterion is fulfilled.
Further improvement steps are so small that reducing the convergence pa-
rameter does not change the result. The convergence problem is studied in
great detail in [126] and a solution which introduces stochastic elements in
the stepping process is proposed.
In this book a different approach is followed. After Simplex signals con-
vergence, the fit is repeated where the best point so far obtained is kept and
the remaining points are initialized in the same way are before. Alternatively
these points are chosen randomly centered at the best value.

13.12.3 Parabola Method


Again we begin with starting points in parameter space. In the one-dimensional
case we choose 3 points and put a parabola through them. The point with
the largest function value is dropped and replaced by the minimum of the
parabola and a new parabola is computed. In the general situation of an n-
dimensional space, 2n + 1 points are selected which determine a paraboloid.
Again the worst point is replaced by the vertex of the paraboloid. The iter-
ation converges for functions which are convex in the search region.

13.12.4 Method of Steepest Descent


A traveler, walking in a landscape unknown to him, who wants to find a lake,
will chose a direction down-hill perpendicular to the curves of equal height
(if there are no insurmountable obstacles). The same method is applied when
searching for a minimum by the method of steepest descent. We consider this
local method in more detail, as in some cases it has to be programmed by
the user himself.
We start from a certain point λ0 in the parameter space, calculate the
gradient ∇λ f (λ) of the function f (λ) which we want to minimize and move
by ∆λ downhill.
∆λ = −α∇λ f (λ) .
The step length depends on the learning constant α which is chosen by the
user. This process is iterated until the function remains essentially constant.
The method is sketched in Fig. 13.7.
The method of steepest descent has advantages as well as drawbacks:
13.12 Extremum Search 451

Dl1

Dl2
l2

l1
Fig. 13.7. Method of steepest decent.

– The decisive advantage is its simplicity which permits to handle a large


number of parameters at the same time. If convenient, for the calculation
of the gradient rough approximations can be used. Important is only that
the function decreases with each step. As opposed to the simplex and
parabola methods its complexity increases only linear with the number of
parameters. Therefore problems with huge parameter sets can be handled.
– It is possible to evaluate a sample sequentially, element by element, which
is especially useful for the back-propagation algorithm of neural networks.
– Unsatisfactory is that the learning constant is not dimensionless. In other
words, the method is not independent of the parameter scales. For a
space-time parameter set the gradient path will depend, for instance, on
the choice whether to measure the parameters in meters or millimeters,
respectively hours or seconds.
– In regions with flat parameter space the convergency is slow. In a narrow
valley oscillations may appear. For too large values of α oscillations will
make exact minimizing difficult.
The last mentioned problems can be reduced by various measures where
the step length and direction partially depend on results of previous steps.
When the function change is small and similar in successive steps α is in-
creased. Oscillations in a valley can be avoided by adding to the gradient in
step i a fraction of the gradient of step i − 1:
∆λi = α (∇λ f (λi ) + 0.5∇λ f (λi−1 )) .
Oscillations near the minimum are easily recognized and removed by decreas-
ing α.
452 13 Appendix

Fig. 13.8. Stochastic annealing. A local minimum can be left with a certain prob-
ability.

The method of steepest descent is applied in ANN and useful in the


updating alignment of tracking detectors [124].

13.12.5 Stochastic Elements in Minimum Search

A physical system which is cooled down to the absolute zero point will princi-
pally occupy an energetic minimum. When cooled down fast it may, though,
be captured in a local (relative) minimum. An example is a particle in a
potential wall. For somewhat higher temperature it may leave the local min-
imum, thanks to the statistical energy distribution (Fig. 13.8). This is used
for instance in the stimulated annealing of defects in solid matter.
This principle can be used for minimum search in general. A step in the
wrong direction, where the function increases by ∆f , can be accepted, when
using the method of steepest descent, e.g. with a probability
1
P (∆f ) = .
1 + e∆f /T
The scale factor T (“temperature”) steers the strength of the effect. It has
been shown that for successively decreasing T the absolute minimum will be
reached.

13.13 Linear Regression with Constraints


We consider N measurements y at known locations x, with a N × N covari-
ance matrix CN and a corresponding weight matrix VN = C−1 N . (We indicate
the dimensions of quadratic matrices with an index).
In the linear model the measurements are described by P < N parameters
θ in form of linear relations
13.13 Linear Regression with Constraints 453

hyi = A(x)θ , (13.46)

with the rectangular N × P “design” matrix A.


In 6.7.1 we have found that the corresponding χ2 expression is minimized
by
θ̂ = (AT VN A)−1 AT VN y .
We now include constraints between the parameters, expressed by K < P
linear relations:
Hθ = ρ ,
with H(x) a given rectangular K × P matrix and ρ a K-dimensional vector.
This problem is solved by introducing K Lagrange multipliers α and
looking for a stationary point of the lagrangian

Λ = (y − Aθ)T VN (y − Aθ) + 2αT (Hθ − ρ) .

Differentiating with respect to θ and α gives the normal equations

AT VN Aθ + HT α = AT VN y , (13.47)
Hθ = ρ (13.48)

to be solved for θ̂ and α̂. Note that Λ is minimized only with respect to θ, but
maximized with respect to α: The stationary point is a saddle point, which
complicates a direct extremum search. Solving (13.47) for θ and inserting it
into (13.48), we find

α̂ = C−1 T
K (HCP A VN y − ρ)

and, re-inserting the estimates into (13.47), we obtain

θ̂ = CP [AT VN y − HT C−1 T
K (HCP A VN y − ρ)] ,

where the abbreviations CP = (AT VN A)−1 , CK = HCP HT have been used.


As in the case of no constraints 6.7.1 the estimate θ̂ is linear in y and
unbiased, which is easily seen by taking the expectation value in the above
equation and using (13.46) and (13.48).
The covariance matrix is found from linear error propagation, after a
somewhat lengthy calculation, as

cov(θ̂) = DCN DT = (IP − CP HT C−1


K H)CP ,

where
D = CP (IP − HT C−1 T
K HCP )T VN

has been used. The covariance matrix is symmetric positive definite. Without
constraints, it equals CP , the negative term is absent. Of course, the intro-
duction of constraints reduces the errors and thus improves the parameter
estimation.
454 13 Appendix

13.14 Formulas Related to the Polynomial


Approximation
Errors of the Expansion Coefficients

In Sect. 11.2.2 we have discussed the approximation of measurements by


orthogonal polynomials and given the following formula for the error of the
expansion coefficients ak ,

XN
1
var(ak ) = 1/ 2
δ
ν=1 ν

which is valid for all k = 1, . . . , K. Thus all errors are equal to the error of
the weighted mean of the measurements yν .
Proof: from linear error propagation we have, for independent measure-
ments yν ,
!
X
var(ak ) = var wν uk (xν )yν
ν
X
= wν2 (uk (xν ))2 δν2
ν
X X 1
= wν u2k (xν )/
ν ν
δν2
X 1
= 1/ ,
ν
δν2

where in the third step we used the definition of the weights, and in the last
step the normalization of the polynomials uk .

Polynomials for Data with Uniform Errors

If the errors δ1 , . . . , δN are uniform, the weights become equal to 1/N , and for
certain patterns of the locations x1 , . . . , xN , for instance for an equidistant
distribution, the orthogonalized polynomials uk (x) can be calculated. They
are given in mathematical handbooks, for instance in Ref. [127]. Although the
general expression is quite involved, we reproduce it here for the convenience
of the reader. For x defined in the domain [−1, 1] (eventually after some
linear transformation and shift), and N = 2M + 1 equidistant (with distance
∆x = 1/M ) measured points xν = ν/M, ν = 0, ±1, . . . , ±M , they are given
by
 1/2 X
k
(2M + 1)(2k + 1)[(2M )!]2 (i + k)[2i] (M + t)[i]
uk (x) = (−1)i+k ,
(2M + k + 1)!(2M − k)! i=0
(i!)2 (2M )[i]
13.15 Formulas for B-Spline Functions 455

for k = 0, 1, 2, . . . 2M , where we used the notation t = x/∆x = xM and the


definitions

z [i] = z(z − 1)(z − 2) · · · (z − i + 1)


z [0] = 1, z ≥ 0, 0[i] = 0, i = 1, 2, . . . .

13.15 Formulas for B-Spline Functions


13.15.1 Linear B-Splines

Linear B-splines cover an interval 2b and overlap with both neighbors:


x − x0 − b
B(x; x0 ) = 2 for x0 − b ≤ x ≤ x0 ,
b
−x − x0 + b
=2 for x0 ≤ x ≤ x0 + b ,
b
= 0 else .

They are normalized to unit area. Since the central values are equidistant,
we fix them by the lower limit xmin of the x-interval and count them as
x0 (k) = xmin + kb, with the index k running from kmin = 0 to kmax =
(xmax − xmin )/b = K.
At the borders only half of a spline is used.
Remark: The border splines are defined in the same way as the other
splines. After the fit the part of the function outside of its original domain is
ignored. In the literature the definition of the border splines is often different.

13.15.2 Quadratic B-Splines

The definition of quadratic splines is analogous:


 2
1 x − x0 + 3/2b
B(x; x0 ) = for x0 − 3b/2 ≤ x ≤ x0 − b/2 ,
2b b
"  2 #
1 3 x − x0
= −2 for x0 − b/2 ≤ x ≤ x0 + b/2 ,
2b 2 b
 2
1 x − x0 − 3/2b
= for x0 + b/2 ≤ x ≤ x0 + 3b/2 ,
2b b
= 0 else .

The supporting points x0 = xmin + (k − 1/2)b lie now partly outside of


the x-domain. The index k runs from 0 to kmax = (xmax − xmin )/b + 2. Thus,
the number K of splines is by two higher than the number of intervals. The
relations (11.13) and (11.12) are valid as before.
456 13 Appendix

13.15.3 Cubic B-Splines

Cubic B-splines are defined as follows:


 3
1 x − x0
B(x; x0 ) = 2+ for x0 − 2b ≤ x ≤ x0 − b ,
6b b
"  3  2 #
1 x − x0 x − x0
= −3 −6 + 4 for x0 − b ≤ x ≤ x0 ,
6b b b
"  3  2 #
1 x − x0 x − x0
= 3 −6 + 4 for x0 ≤ x ≤ x0 + b ,
6b b b
 3
1 x − x0
= 2− for x0 + b ≤ x ≤ x0 + 2b ,
6b b
= 0 else .

The shift of the center of the spline is performed as before: x0 = xmin +


(k − 1)b. The index k runs from 0 to kmax = (xmax − xmin )/b + 3. The number
kmax + 1 of splines is equal to the number of intervals plus 3.

13.16 Support Vector Classifiers


Support vector machines are described some detail in Refs. [16, 107, 106, 105].

13.16.1 Linear Classifiers

Linear classifiers7 separate the two training samples by a hyperplane. Let


us initially assume that in this way a complete separation is possible. Then
an optimal hyperplane is the plane which divides the two samples with the
largest margin. This is shown in Fig. 13.9. The hyperplane can be constructed
in the following way: The shortest connection ∆ between the convex hulls8
of the two non-overlapping classes determines the direction w/|w| of the
normal w of this plane which cuts the distance at its center. We represent
the hyperplane in the form

w·x+b=0, (13.49)

where b fixes its distance from the origin. Note that w is not normalized, a
common factor in w and b does not change condition (13.49). Once we have
found the hyperplane {w, b} which separates the two classes yi = ±1 of the
training sample {(x1 , y1 ), . . . , (xN , yN )} we can use it to classify new input:

ŷ = f (x) = sign(w · x + b) . (13.50)


13.16 Support Vector Classifiers 457

Fig. 13.9. The red hyperplane separates squares from circles. Shown are the convex
hulls and the support vectors in red.

To find the optimal hyperplane which divides ∆ into equal parts, we define
the two marginal planes which touch the hulls:

w · x + b = ±1 .

If x+ , x− are located at the two marginal hyperplanes, the following relations


hold which also fix the norm of w:
w 2
w · (x+ − x− ) = 2 ⇒ ∆ = · (x+ − x− ) = .
|w| |w|

The optimal hyperplane is now found by solving a constrained quadratic


optimization problem

|w|2 = minimum , subject to yi (w · xi + b) ≥ 1 , i = 1, . . . , N .

For the solution, only the constraints with equals sign are relevant. The vec-
tors corresponding to points on the marginal planes form the so-called active
set and are called support vectors (see Fig. 13.9). The optimal solution can
be written as X
w= αi yi xi
i
P
with αi > 0 for the active set, else αi = 0, and furthermore αi yi = 0. The
last condition ensures translation invariance: w(xi − a) = w(xi ). Together
with the active constraints, after substituting the above expression for w,
it provides just the required number of linear equations to fix αi and b. Of
7
A linear classification scheme was already introduced in Sect. 11.4.1.
8
The convex hull is the smallest polyhedron which contains all points and their
connecting straight lines.
458 13 Appendix

course, the main problem is to find the active set. For realistic cases this
requires the solution of a large quadratic optimization problem, subject to
linear inequalities. For this purpose an extended literature as well as program
libraries exist.
This picture can be generalized to the case of overlapping classes. Assum-
ing that the optimal separation is still given by a hyperplane, the picture
remains essentially the same, but the optimization process is substantially
more complex. The standard way is to introduce so called soft margin clas-
sifiers. There some points on the wrong side of their marginal plane are
tolerated, but with a certain penalty in the optimization process. It is chosen
proportional to the sum of their distances or their square distance from their
own territory. The proportionality constant is adjusted to the given problem.

13.16.2 General Kernel Classifiers

All quantities determining the linear classifier ŷ (13.50) depend only on inner
products of vectors of the input space. This concerns not only the dividing
hyperplane, given by (13.49), but also the expressions for w, b and the factors
αi associated to the support vectors. The inner product x · x′ which is a
bilinear symmetric scalar function of two vectors, is now replaced by another
scalar function K(x, x′ ) of two vectors, the kernel, which need not be bilinear,
but should also be symmetric, and is usually required to be positive definite.
In this way a linear problem in an inner product space is mapped into a very
non-linear problem in the original input space where the kernel is defined.
We then are able to separate the classes by a hyperplane in the inner product
space that may correspond to a very complicated hypersurface in the input
space. This is the so-called kernel trick.
To illustrate how a non-linear surface can be mapped into a hyperplane,
we consider a simple example. In order to work with a linear cut, i.e. with a
dividing hyperplane, we transform our input variables x into new variables:
x → X(x). For instance, if x1 , x2 , x3 are momentum components and a cut in
energy, x21 +x22 +x23 < r2 , is to be applied, we could transform the momentum
space into a space
X = {x21 , x22 , x23 , . . .} .
where the cut corresponds to the hyperplane X1 + X2 + X3 = r2 . Such a
mapping can be realized by substituting the inner product by a kernel:

x · x′ → K(x, x′ ) = X(x) · X(x′ ).

In our example a kernel of the so-called monomial form is appropriate:

K(x, x′ ) = (x · x′ )d with d = 2 ,
(x · x′ )2 = (x1 x′1 + x2 x′2 + x3 x′3 )2 = X(x) · X(x′ ) (13.51)

with
13.17 Bayes Factor 459
√ √ √
X(x) = {x21 , x22 , x23 , 2x1 x2 , 2x1 x3 , 2x2 x3 } .
The sphere x21 + x22 + x23 = r2 in x-space is mapped into the 5-dimensional
hyperplane X1 + X2 + X3 = r2 in 6-dimensional X-space. (A kernel inducing
instead of monomials of order d (13.51), polynomials of all orders, up to order
d is K(x, x′ ) = (1 + x · x′ )d .)
The most common kernel used for classification is the Gaussian (see Sect.
11.2.1):  
′ (x − x′ )2
K(x, x ) = exp − .
2s2
It can be shown that it induces a mapping into a space of infinite dimensions
[107] and that nevertheless the training vectors can in most cases be replaced
by a relatively small number of support vectors. The only free parameter
is the penalty constant which regulates the degree of overlap of the two
classes. A high value leads to a very irregular shape of the hypersurface
separating the training samples of the two classes to a high degree in the
original space whereas for a low value its shape is much smoother and more
minority observations are tolerated.
In practice, this mapping into the inner product space is not performed
explicitly, in fact it is even rarely known. All calculations are performed in
x-space, especially the determination of the support vectors and their weights
α. The kernel trick merely serves to prove that a classification with support
vectors is feasible. The classification of new input then proceeds with the
kernel K and the support vectors directly:
!
X X
ŷ = sign αi K(x, xi ) − αi K(x, xi ) .
yi =+1 yi =−1

The use of a relatively small number of support vectors (typically only


about 5% of all αi are different from zero) drastically reduces the storage
requirement and the computing time for the classification. Remark that the
result of the support vector classifier is not identical to that of the original
kernel classifier but very similar.

13.17 Bayes Factor


In Chap. 6 we have introduced the likelihood ratio to discriminate between
simple hypotheses. For two composite hypotheses H1 and H2 with free pa-
rameters, in the Bayesian approach the simple ratio is to be replaced by the
so-called Bayes factors.
Let us assume for a moment that H1 applies. Then the actual param-
eters will follow a p.d.f. proportional to L1 (θ 1 |x)π1 (θ1 ) where L1 (θ 1 |x) is
the likelihood function and π1 (θ 1 ) the prior density of the parameters. The
same reasoning is valid for H2 . The probability that H1 (H2 ) is true is
460 13 Appendix
R
proportional
R to the integral over the parameter space, L1 (θ1 |x)π1 (θ 1 )dθ 1
( L2 (θ 2 |x)π2 (θ2 )dθ2 ). The relative betting odds thus are given by the Bayes
factor B, R
L1 (θ1 |x)π1 (θ 1 )dθ 1
B= R .
L2 (θ2 |x)π2 (θ 2 )dθ 2
In the case with no free parameters, B reduces to the simple likelihood ratio
L1 /L2 .
The two terms forming the ratio are called marginal likelihoods. The in-
tegration automatically introduces a penalty for additional parameters and
related overfitting: The higher the dimensionality of the parameter space is,
the larger is in average the contribution of low likelihood regions to the in-
tegral. In this way the concept follows the philosophy of Ockham’s razor 9
which in short states that from different competing theories, the one with the
fewest assumptions, i.e. the simplest, should be preferred.
The Bayes factor is intended to replace the p-value of frequentist statistics.
H. Jeffreys [22] has suggested a classification of Bayes factors into different
categories ranging from < 3 (barely worth mentioning) to > 100 (decisive).
For the example of Chap. 10 Sect. 10.6, Fig.10.19 with a resonance above
a uniform background for uniform prior densities in the signal fraction t,
0 ≤ t ≤ 0.5 and the location µ, 0.2 ≤ µ ≤ 0.8 the Bayes factor is B = 54 which
is considered as very significant. This result is inversely proportional to the
range in µ as is expected because the probability to find a fake signal in a flat
background is proportional to its range. In the cited example we had found
a likelihood ratio of 1.1 · 104 taken at the MLE. The corresponding p-value
was p = 1.8 · 10−4 for the hypothesis of a flat background, much smaller than
the betting odds of 1/54 for this hypothesis. While the Bayes factor takes
into account the uncertainty of the parameter estimate, the uncertainty is
completely neglected in the p-value derived from the likelihood ratio taken
simply at the MLE. On the other hand, for the calculation of the Bayes factor
an at least partially subjective prior probability has to be included.
For the final rating the Bayes factor has to be multiplied by the prior
factors of the competing hypotheses:
R
πH1 L1 (θ1 |x)π1 (θ 1 )dθ 1 πH1
R=B = R .
πH2 L2 (θ2 |x)π2 (θ 2 )dθ 2 πH2

The posterior rating is equal to the prior rating times the Bayes factor.
The Bayes factor is a very reasonable and conceptually attractive concept
which requires little computational effort. It is to be preferred to the frequen-
tist p-value approach in decision making. However, for the documentation of
a measurement it has the typical Bayesian drawback that it depends on prior
densities and unfortunately there is no objective way to fix those.

9
Postulated by William of Ockham, English logician in the 14th century.
13.18 Robust Fitting Methods 461

13.18 Robust Fitting Methods


13.18.1 Introduction

If one or a few observations in a sample are separated from the bulk of the
data, we speak of outliers. The reasons for their existence range from trivial
mistakes or detector failures to important physical effects. In any case, the
assumed statistical model has to be questioned if one is not willing to admit
that a large and very improbable fluctuation did occur.
Outliers are quite disturbing: They can change parameter estimates by
large amounts and increase their errors drastically.
Frequently outliers can be detected simply by inspection of appropriate
plots. It goes without saying, that simply dropping them is not a good advice.
In any case at least a complete documentation of such an event is required.
Clearly, objective methods for their detection and treatment are preferable.
In the following, we restrict our treatment to the simple one-dimensional
case of Gaussian-like distributions, where outliers are located far from the
average, and where we are interested in the mean value. If a possible outlier
is contained in the allowed variate range of the distribution – which is always
true for a Gaussian – a statistical fluctuation cannot be excluded as a logical
possibility. Since the outliers are removed on the basis of a statistical proce-
dure, the corresponding modification of results due to the possible removal
of correct observations can be evaluated.
We distinguish three cases:
1. The standard deviations of the measured points are known.
2. The standard deviations of the measured points are unknown but known
to be the same for all points.
3. The standard deviations are unknown and different.
It is obvious that case 3 of unknown and unequal standard deviations
cannot be treated.
The treatment of outliers, especially in situations like case 2, within the
LS formalism is not really satisfying. If the data are of bad quality we may
expect a sizeable fraction of outliers with large deviations. These may distort
the LS fit to such an extend that outliers become difficult to define (mask-
ing of outliers). This kind of fragility of the LS method, and the fact that
in higher dimensions the outlier detection becomes even more critical, has
lead statisticians to look for estimators which are less disturbed by data not
obeying the assumed statistical model (typical are deviations from the as-
sumed normal distribution), even when the efficiency suffers. In a second –
not robust – fit procedure with cleaned data it is always possible to optimize
the efficiency.
In particle physics, a typical problem is the reconstruction of particle
tracks from hits in wire or silicon detectors. Here outliers due to other tracks
or noise are a common difficulty, and for a first rough estimate of the track
462 13 Appendix

parameters and the associated hit selection for the pattern recognition, robust
methods are useful.

13.18.2 Robust Methods

Truncated Least Square Fits

The simplest method to remove outliers is to eliminate those measurements


which contribute excessively to the χ2 of a least square (LS) fit. In this
truncated least square fit (LST) all observations that deviate by more than
a certain number of standard deviations from the mean are excluded. Rea-
sonable values lie between 1.5 and 2 standard deviations, corresponding to
a χ2 cut χ2max = 2. 25 to 4. The optimal value of this cut depends on the
expected amount of background or false measurements and the number of
observations. In case 2 the variance has to be estimated from the data and
the estimated variance δ̂ 2 is, according to Chap. 3.2.3, given by
X
δ̂ 2 = (yi − µ̂)2 /(N − 1) .

This method can be improved by removing outliers sequentially (LSTS).


In a first step we use all measurements y1 , . . . , yN , with standard devia-
tions δ1 , . . . , δN to determine the mean value µ̂ which in our case is just
the weighted mean. Then we compute the normalized residuals, also called
pulls, ri = (yi − µ̂)/δi and select the measurement with the largest value of
ri2 . The value of χ2 is computed with respect to the mean and variance of
the remaining observations and the measurement is excluded if it exceeds the
parameter χ2max 10 . The fit is repeated until all measurements are within the
margin. In case that all measurements are genuine Gaussian measurements,
this procedure only marginally reduces the precision of the fit.
In both methods LST and LSTS a minimum fraction of measurements
has to be retained. A reasonable value is 50 % but depending on the problem
other values may be appropriate.

The Sample Median

A first step (already proposed by Laplace) in the direction to estimators more


robust than the sample mean is the introduction of the sample median as
estimator for location parameters. While the former follows to an extremely
outlying observation up to ±∞, the latter stays nearly unchanged in this case.
This change can be expressed as a change of the objective
P function,
P i.e. the
function to be minimized with respect to µ, from i (yi − µ)2 to i |yi − µ|
10
If the variance has to be estimated from the data its value is biased towards
smaller values because for a genuine Gaussian distribution eliminating the mea-
surement with the largest pull reduces the expected variance.
13.18 Robust Fitting Methods 463

which is indeed minimized if µ̂ coincides with the sample median in case of


N odd. For even N , µ̂ is the mean of the two innermost points. Besides the
slightly more involved computation (sorting instead of summing), the median
is not an optimal estimator for a pure Gaussian distribution:
π
var(median) = var(mean) = 1.571 var(mean) ,
2
but it weights large residuals less and therefore performs better than the arith-
metic mean for distributions which have longer tails than the Gaussian. In-
deed for large N we find for the Cauchy distribution var(median) = π 2 /(4N ),
while var(mean) = ∞ (see 3.6.9), and for the two-sided exponential (Laplace)
distribution var(median) = var(mean)/2.

M-Estimators

The objective function of the LS approach can be generalized to


X  yi − t(xi , θ) 
ρ (13.52)
i
δi

with ρ(z) = z 2 for the LS method which is optimal for Gaussian errors.
For the Laplace distribution mentioned above the optimal objective function
is based on ρ(z) = |z|, derived from the likelihood analog which suggests
ρ ∝ ln f . To obtain a more robust estimation the function ρ can be modified
in various ways but we have to retain the symmetry, ρ(z) = ρ(−z) and to
require a single minimum at z = 0. This kind of estimators with objective
functions ρ different from z 2 are called M-estimators, “M” reminding maxi-
mum likelihood. The best known example is due to Huber, [128]. His proposal
is a kind of mixture of the appropriate functions of the Gauss and the Laplace
cases:  2
z /2 if |z| ≤ c
ρ(z) =
c(|z| − c/2) if |z| > c .
The constant c has to be adapted to the given problem. For a normal
population the estimate is of course not efficient. For example with c = 1.345
the inverse of the variance is reduced to 95% of the standard value. Obviously,
the fitted objective function (13.52) no longer follows a χ2 distribution with
appropriate degrees of freedom.

Estimators with High Breakdown Point

In order to compare different estimators with respect to their robustness,


the concept of the breakdown point has been introduced. It is the smallest
fraction ε of corrupted data points which can lead the fitted values to differ
by an arbitrary large amount from the correct ones. For LS, ε approaches
464 13 Appendix

zero, but for M-estimators or truncated fits, changing a single point would be
not sufficient to shift the fitted parameter by a large amount. The maximal
value of ε is smaller than 50% if the outliers are the minority. It is not difficult
to construct estimators which approach this limit, see [129]. This is achieved,
for instance, by ordering the residuals according to their absolute value (or
ordering the squared residuals, resulting in the same ranking) and retaining
only a certain fraction, at least 50%, for the minimization. This so-called least
trimmed squares (LTS) fit is to be distinguished from truncated least square
fit (LST, LSTS) with a fixed cut against large residuals.
An other method relying on rank order statistics is the so-called least me-
dian of squares (LMS) method. It is defined as follows: Instead of minimizing
P
with respect to the parameters µ the sum of squared residuals, i ri2 , one
searches the minimum of the sample median of the squared residuals:

minimizeµ median(ri2 (µ)) .
This definition implies that for N data points, N/2 + 1 points enter for
N even and (N + 1)/2 for N odd. Assuming equal errors, this definition can
be illustrated geometrically in the one-dimensional case considered here: µ̂ is
the center of the smallest interval (vertical strip in Fig. 13.10) which covers
half of all x values. The width 2∆ of this strip can be used as an estimate of
the error. Many variations are of course possible: Instead of requiring 50% of
the observations to be covered, a larger fraction can be chosen. Usually, in a
second step, a LS fit is performed with the retained observations, thus using
the LMS only for outlier detection. This procedure is chosen, since it can be
shown that, at least in the case of normal distributions, ranking methods are
statistically inferior as compared to LS fits.

Example 171. Fitting a mean value in the presence of outliers


In Fig.13.10 a simple example is presented. Three data points, represent-
ing the outliers, are taken from N (3, 1) and seven from N (10, 1). The LS
fit (7.7 ± 1.1) is quite disturbed by the outliers. The sample median is here
initially 9.5, and becomes 10.2 after excluding the outliers. It is less disturbed
by the outliers. The LMS fit corresponds to the thick line, and the minimal
strip of width 2∆ to the dashed lines. It prefers the region with largest point
density and is therefore a kind of mode estimator. While the median is a
location estimate which is robust against large symmetric tails, the mode
is also robust against asymmetric tails, i.e. skew distributions of outliers.A
more quantitative comparison of different fitting methods is presented in the
following table.
method background
asymm. uniform none
simple LS 2.12 0.57 0.32
median 0.72 0.49 0.37
LS trimmed 0.60 0.52 0.37
LS sequentially truncated 0.56 0.62 0.53
least median of squares 0.55 0.66 0.59
13.18 Robust Fitting Methods 465

0 2 4 6 8 10 12
x

Fig. 13.10. Estimates of the location parameter for a sample with three outliers.

We have generated 100000 samples with a 7 point signal given by N (10, 1)


and 3 points of background, a) asymmetric: N (3, 1) (the same parameters as
used in the example before), b) uniform in the interval [5, 15], and in addition
c) pure normally distributed points following N (10, 1) without background.
The table contains the root mean squared deviation of the mean values from
the nominal value of 10. To make the comparison fair, as in the LMS method
also in the trimmed LS fit 6 points have been retained and in the sequen-
tially truncated LS fit a minimum of 6 points was used. With the asymmetric
background, the first three methods lead to biased mean values (7.90 for the
simple LS, 9.44 for the median and 9.57 for the trimmed LS) and thus the
corresponding r.m.s. values are relatively large. As expected the median suf-
fers much less from the background than a standard LS fit. The results of
the other two methods, LMS and LS sequentially truncated perform reason-
able in this situation, they succeed to eliminate the background completely
without biasing the result but are rather weak when little or no background
is present. The result of LMS is not improved in our example when a least
square fit is performed with the retained data.

The methods can be generalized to the multi-parameter case. Essentially,


the r.m.s. deviation is replaced by χ2 . In the least square fits, truncated or
trimmed, the measurements with the largest χ2 values are excluded. The LMS
method searches for the parameter set where the median of the χ2 values is
minimal.
466 13 Appendix

More information than presented in this short and simplified introduction


into the field of robust methods can be found in the literature cited above
and the newer book of R. Maronna, D. Martin and V. Yohai [130].
References

1. M. G. Kendall and W. R. Buckland, A Dictionary of Statistical Terms, Long-


man, London (1982).
2. L. Lyons, Bayes and frequentism: A particle physicist’s perspective,
arXiv:1301.1273v1 (2013).
3. M. G. Kendall and A. Stuart, The Advanced Theory of Statistics, Griffin, Lon-
don (1979).
4. S. Brandt, Data Analysis, Springer, Berlin (1999).
5. A. G. Frodesen, O. Skjeggestad, M. Tofte, Probability and Statistics in Particle
Physics, Universitetsforlaget, Bergen (1979).
6. R. Barlow, Statistics, Wiley, Chichester (1989).
7. L. Lyons, Statistics for Nuclear and Particle Physicists, Cambridge University
Press (1992).
8. W. T. Eadie et al., Statistical Methods in Experimental Physics, North-Holland,
Amsterdam, (1982).
9. F. James, Statistical Methods in Experimental Physics, World Scientific Pub-
lishing, Singapore (2007).
10. Data Analysis in High Energy Physics: A Practical Guide to Statistical Methods,
ed. O. Behnke et. al, J. Wiley & Sons (2013).
11. Statistical Analysis Techniques in Particle Physics: Fits, Density Estimation
and Supervised Learning, J. Wiley & Sons (2013).
12. V. Blobel, E. Lohrmann, Statistische und numerische Methoden der Datenanal-
yse, Teubner, Stuttgart (1998).
13. B. P. Roe, Probability and Statistics in Experimental Physics, Springer, Berlin
(2001).
14. G. Cowan, Statistical Data Analysis, Clarendon Press, Oxford (1998).
15. G. D’Agostini, Bayesian Reasoning in Data Analysis: A Critical Introduction,
World Scientific Pub., Singapore (2003).
16. T. Hastie, R. Tibshirani, J. H. Friedman, The Elements of Statistical Learning,
Springer, Berlin (2001).
17. G. E. P. Box and G. C. Tiao, Bayesian Inference in Statistical Analysis,
Addison-Weseley, Reading (1973).
18. R. A. Fisher, Statistical Methods, Experimental Design and Scientific Inference,
Oxford University Press (1990). (First publication 1925).
19. A. W. F. Edwards, Likelihood, The John Hopkins University Press, Baltimore
(1992).
20. I. J. Good, Good Thinking, The Foundations of Probability and its Applications,
Univ. of Minnesota Press, Minneapolis (1983).
21. L. J. Savage, The Writings of Leonard Jimmie Savage - A Memorial Selection,
ed. American Statistical Association, Washington (1981).
468 References

22. H. Jeffreys, Theory of Probability, Clarendon Press, Oxford (1983).


23. L. J. Savage, The Foundation of Statistical Inference, Dover Publ., New York
(1972).
24. Proceedings of PHYSTAT03, Statistical Problems in Particle Physics, Astro-
physics and Cosmology ed. L. Lyons et al., SLAC, Stanford (2003) Proceedings
of PHYSTAT05, Statistical Problems in Particle Physics, Astrophysics and Cos-
mology ed. L. Lyons et al., Imperial College Press, Oxford (2005).
25. M. Abramowitz, I.A. Stegun, Handbook of Mathematical Functions, Dover Pub-
lications,inc., New York (1970).
26. G. Bohm, G. Zech, Statistics of weighted Poisson events and its applications,
Nucl. Instr. and Meth. A 748 (2014) 1.
27. Bureau International de Poids et Mesures, Rapport du Groupe de travail sur
l’expression des incertitudes, P.V. du CIPM (1981) 49, P. Giacomo, On the ex-
pression of uncertainties in quantum metrology and fundamental physical con-
stants, ed. P. H. Cutler and A. A. Lucas, Plenum Press (1983), International
Organization for Standardization (ISO), Guide to the expression of uncertainty
in measurement, Geneva (1993).
28. P. Sinervo, Definition and Treatment of Systematic Uncertainties in High En-
ergy Physics and Astrophysics, Proceedings of PHYSTAT2003, P123, ed. L.
Lyons, R. Mount, R. Reitmeyer, Stanford, Ca (2003).
29. R. J. Barlow, Systematic Errors, Fact and Fiction, hep-ex/0207026 (2002).
30. R. Wanke, How to Deal with Sytematic Uncertainties in Data Analysis in High
Energy Physics: A Practical Guide to Statistical Methods, ed. O. Behnke et.
al, J. Wiley Sons (2013).
31. A. B. Balantekin et al., Review of Particle Physics, J. of Phys. G 33 (2006),1.
32. R. M. Neal, Probabilistic Inference Using Markov Chain Monte Carlo Methods,
University of Toronto, Department of Computer Science, Tech, Rep. CRG-TR-9
3-1 (1993).
33. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller
Equation of state calculations by fast computing machines, J. Chem. Phys. 21
(1953) 1087.
34. P. E. Condon and P. L. Cowell, Channel likelihood: An extension of maximum
likelihood to multibody final states, Phys. Rev. D 9 (1974)2558.
35. R. J. Barlow, Extended maximum likelihood, Nucl. Instr. and Meth. A 297
(1990) 496.
36. M. Casarsa et al., A statistical prescription to estimate properly normalized
distributions of different particle species, Nucl. Instr. and Meth. A640 (2010)
219.
37. G. Bohm and G. Zech, Comparing statistical data to Monte Carlo simulation
with weighted events, Nucl. Instr. and Meth. A691 (2012) 171.
38. V. S. Kurbatov and A. A. Tyapkin, in Russian edition of W. T. Eadie et al.,
Statistical Methods in Experimental Physics, Atomisdat, Moscow (1976).
39. B. List, Constrained Fits in Data Analysis in High Energy Physics: A Practical
Guide to Statistical Methods, ed. O. Behnke et. al, J. Wiley & Sons (2013).
40. G. Zech, Reduction of Variables in Parameter Inference, Proceedings of PHY-
STAT2005, ed. L. Lyons, M. K. Unel, Oxford (2005).
41. G. Zech, A Monte Carlo Method for the Analysis of Low Statistic Experiments,
Nucl. Instr. and Meth. 137 (1978) 551.
42. M. Diehl and O. Nachtmann, Optimal observables for the measurement of the
three gauge boson couplings in e+ e− → W + W − , Z. f. Phys. C 62, (1994, 397.
References 469

43. O. E. Barndorff-Nielsen, On a formula for the distribution of a maximum like-


lihood estimator, Biometrika 70 (1983), 343.
44. D. R. Cox and N. Reid, Parameter Orthogonality and Approximate Conditional
Inference, J. R. Statist. Soc. B 49, No 1, (1987) 1, D. Fraser and N. Reid,
Likelihood inference with nuisance parameters, Proc. of PHYSTAT2003, ed. L.
Lyons, R. Mount, R. Reitmeyer, SLAC, Stanford (2003) 265.
45. G. A. Barnard, G. M. Jenkins and C. B. Winstein, Likelihood inference and
time series, J. Roy. Statist. Soc. A 125 (1962).
46. A. Birnbaum, More on the concepts of statistical evidence, J. Amer. Statist.
Assoc. 67 (1972), 858.
47. D. Basu, Statistical Information and Likelihood, Lecture Notes in Statistics 45,
ed. J. K. Ghosh, Springer, Berlin (1988).
48. J. O. Berger and R. L. Wolpert, The Likelihood Principle, Lecture Notes of
Inst. of Math. Stat., Hayward, Ca, ed. S. S. Gupta (1984).
49. L. G. Savage, The Foundations of Statistics Reconsidered, Proceedings of the
forth Berkeley Symposium on Mathematical Statistics and Probability, ed. J.
Neyman (1961) 575.
50. C. Stein, A remark on the likelihood principle, J. Roy. Statist. Soc. A 125 (1962)
565.
51. CERN computer library, root.cern.ch.
52. R. Barlow, Asymmtric Errors, arXiv:physics/0401042v1 (2004), Proceedings of
PHYSTAT2005, ed. L. Lyons, M. K. Unel, Oxford (2005),56.
53. S. Chiba and D. L. Smith, Impacts of data transformations on least-squares
solutions and their significance in data analysis and evaluation, J. Nucl. Sci.
Technol. 31 (1994) 770.
54. H. J. Behrendt et al., Determination of αs , and sin2 θ from measurements of
the total hadronic cross section in e+ e− annihilation at PETRA, Phys. Lett.
183B (1987) 400.
55. R. W. Peelle, Peelle’s Pertinent Puzzle, Informal memorendum dated October
13, 1987, ORNL, USA (1987).
56. G. Zech, Analysis of distorted measurements - parameter estimation and un-
folding, arXiv (2016).
57. G. Bohm, G. Zech, Comparison of experimental data to Monte Carlo simulation
- Parameter estimation and goodness-of-fit testing with weighted events, Nucl.
Instr. and Meth. A691 (2012), 171.
58. V. B. Anykeyev, A. A. Spiridonov and V. P. Zhigunov, Comparative investiga-
tion of unfolding methods, Nucl. Instr. and Meth. A303 (1991) 350.
59. V. Blobel, Unfolding methods in high-energy physics experiments, CERN Yellow
Report 85-09 (1985) 88.
60. G. Zech, Comparing statistical data to Monte Carlo simulation - parameter
fitting and unfolding, Desy 95-113 (1995).
61. Proceedings of the PHYSTAT 2011 Workshop on Statistical Issues Related
to Discovery Claims in Search Experiments and Unfolding, CERN, Geneva,
Switzerland, ed. H. B. Prosper and L. Lyons (2011).
62. V. Blobel, Unfolding methods in particle physics, Proceedings of the PHYSTAT
2011 Workshop on Statistical Issues Related to Discovery Claims in Search
Experiments and Unfolding, CERN, Geneva, Switzerland, ed. H. B. Prosper
and L. Lyons (2011).
63. V. Blobel, Unfolding in Data Analysis in High Energy Physics: A Practical
Guide to Statistical Methods, ed. O. Behnke et. al, J. Wiley and Sons (2013).
470 References

64. H. N. Mülthei and B. Schorr, On an iterative method for the unfolding of spectra,
Nucl. Instr. and Meth. A257 (1986) 371.
65. M. Schmelling, The method of reduced cross-entropy - a general approach to
unfold probability distributions, Nucl. Instr. and Meth. A340 (1994) 400.
66. L. Lindemann and G. Zech, Unfolding by weighting Monte Carlo events, Nucl.
Instr. and Meth. A354 (1994) 516.
67. G. D’Agostini, A multidimensional unfolding method based on Bayes’ theorem,
Nucl. Instr. and Meth. A 362 (1995) 487.
68. A. Hoecker and V. Kartvelishvili, SVD approach to data unfolding, Nucl. Instr.
and Meth. A 372 (1996), 469.
69. N. Milke et al. Solving inverse problems with the unfolding program TRUEE:
Examples in astroparticle physics, Nucl. Instr. and Meth. A697 (2013) 133.
70. A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum Likelihood from Incom-
plete Data via the EM Algorithm, J. R. Statist.Soc. B 39 (1977) 1.
71. W. H. Richardson, Bayesian based Iterative Method of Image restoration Jour-
nal. of the Optical Society of America 62 (1972) 55.
72. L. B. Lucy, An iterative technique for the rectification of observed distributions,
Astron. Journ. 79 (1974) 745.
73. L. A. Shepp and Y. Vardi, Maximum likelihood reconstruction for emission
tomography, IEEE trans. Med. Imaging MI-1 (1982) 113.
74. A. Kondor, Method of converging weights - an iterative procedure for solving
Fredholm’s integral equations of the first kind, Nucl. Instr. and Meth. 216 (1983)
177.
75. Y. Vardi, L. A. Shepp and L. Kaufmann, A statistical model for positron emis-
sion tomography, J. Am. Stat. Assoc.80 (1985) 8, Y. Vardi and D. Lee, From
image deblurring to optimal investments: Maximum likelihood solution for pos-
itive linear inverse problems (with discussion), J. R. Statist. Soc. B55, 569
(1993).
76. H. N. Mülthei, B. Schorr, On properties of the iterative maximum likelihood
reconstruction method, Math. Meth. Appl. Sci. 11 (2005) 331.
77. D. M. Titterington, Some aspects of statistical image modeling and restoration,
Proceedings of PHYSTAT 05, ed. L. Lyons and M. K. Ünel, Oxford (2005).
78. P. C. Hansen, Discrete Inverse Problems: Insight and Algorithms, SIAM, (2009)
79. B. Efron and R. T. Tibshirani, An Introduction to the Bootstrap, Chapman &
Hall, London (1993).
80. M. Kuusela and V. M. Panaretos, Statistical unfolding of elementary particle
spectra: Empirical Bayes estimation and bias-corrected uncertainty quantifica-
tion, Annals of Applied Statistics 9 (2015) 1671. M. Kuusela, Statistical Issues
in Unfolding Methods for High Energy Physics, Master’s thesis, Aalto Univer-
sity, Finnland (2912).
81. G. D’Agostini, Improved iterative Bayesian unfolding, arXiv:1010.632v1 (2010).
82. I. Volobouev, On the Expectation-Maximization Unfolding with smoothing,
arXiv:1408.6500v2 (2015).
83. A. N. Tichonoff, Solution of incorrectly formulated problems and the regulariza-
tion method, translation of the original article (1963) in Soviet Mathematics 4,
1035.
84. D. W. Scott, and S. R. Sain, Multi-Dimensional Density Estimation, in Hand-
book of Statistics, Vol 24: Data Mining and Computational Statistics, ed. C.R.
Rao and E.J. Wegman, Elsevier, Amsterdam (2004).
References 471

85. R. Narayan, R. Nityananda, Maximum entropy image restoration in astronomy,


Ann. Rev. Astrophys. 24 (1986) 127.
86. P. Magan, F. Courbin and S. Sohy, Deconvolution with correct sampling, As-
trophys. J. 494 (1998) 472.
87. H. P. Dembinski, M. Roth, An algorithm for automatic unfolding of one-
dimensional distributions, Nucl. Instr. and Meth. A 729 (2013) 725.
88. G. Zech, Iterative unfolding with the Richardson-Lucy algorithm, Nucl. Instr.
and Meth. A716 (2013) 1.
89. B. Aslan and G.Zech, Statistical energy as a tool for binning-free, multivariate
goodness-of-fit tests, two-sample comparison and unfolding, Nucl. Instr. and
Meth. A 537 (2005) 626.
90. R. B. D’Agostino and M. A. Stephens (editors), Goodness of Fit Techniques,
M. Dekker, New York (1986).
91. L. Demortier, P Values: What They Are and How to Use Them,
CDF/MEMO/STATISTICS/PUBLIC (2006).
92. D. S. Moore, Tests of Chi-Squared Type in Goodness of Fit Techniques, ed. R.
B. D’Agostino and M. A. Stephens, M. Dekker, New York (1986).
93. T. W. Anderson, D. A. Darling,Asyptotic Theory of Certain Goodness of Fit
Criteria Based on Stochastic Processes, Ann. of Math. Stat. 23(2) (1952) 193.
94. N. H. Kuiper, Tests concerning random points in a circle, Proceedings of the
Koninklijke Netherlandse Akademie van Wetenschappen, A63 (1960) 38.
95. J. Neyman, "Smooth test" for goodness of fit, Scandinavisk Aktuaristidskrift
11 (1037),149.
96. E. S. Pearson, The Probability Integral Transformation for Testing Goodness
of Fit and Combining Independent Tests of Significance, Biometrika 30 (1938),
134.
97. A. W. Bowman, Density based tests for goodness-of-fit, J. Statist. Comput.
Simul. 40 (1992) 1.
98. B. Aslan and G. Zech, New Test for the Multivariate Two-Sample Problem based
on the concept of Minimum Energy, J. Statist. Comput. Simul. 75, 2 (2004),
109.
99. S. G. Mallat, A Wavelet Tour of Signal Processing, Academic Press, New
York (1999), A. Graps, An introduction to wavelets, IEEE, Computa-
tional Science and Engineering, 2 (1995) 50 und www.amara.com /IEEE-
wave/IEEEwavelet.html.
100. A. W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Anal-
ysis: The Kernel Approach with S-Plus Illustration, Oxford University Press
(1997).
101. I. T. Jolliffe, Principal Component Analysis, Springer, Berlin (2002).
102. P. M. Schulze, Beschreibende Statistik, Oldenburg Verlag (2003).
103. R. Rojas, Theorie der neuronalen Netze, Springer, Berlin (1991).
104. M. Feindt and U. Kerzel, The NeuroBayes neural network package, Nucl. Instr.
and Meth A 559 (2006) 190.
105. V. Vapnik, The Nature of Statistical Learning Theory, Springer, Berlin (1995),
V. Vapnik, Statistical Lerning Theory, Wiley, New York (1998), B. Schölkopf
and A. Smola, Lerning with Kernels, MIT press, Cambridge, MA (2002).
106. B. Schölkopf, Statistical Learning and Kernel Methods,
http://research.microsoft.com.
472 References

107. J. H. Friedman, Recent Advances in Predictive (Machine) Learning Proceed-


ings of PHYSTAT03, Statistical Problems in Particle Physics, Astrophysics and
Cosmology ed. L. Lyons et al., SLAC, Stanford (2003).
108. Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm,
Proc. COLT, Academic Press, New York (1996) 209.
109. B. P. Roe et. al., Boosted decision trees as an alternative to artificial neural
networks for particle identification, Nucl. Instr. and Meth. A543 (2005) 577.
110. L. Breiman, Bagging predictors, Technical Report No. 421, Department of
Statistics, University of California, Berkeley, Ca, (1994).
111. L. Breiman, Random Forests, Technical Report, Department of Statistics, Uni-
versity of California, Berkeley, Ca (2001).
112. J. S. Simonoff, Smoothing Methods in Statistics, Springer, Berlin (1996).
113. D. W. Scott, Multivariate Density Estimation: Theory, Practice, and Visual-
ization, John Wiley, New York (1992).
114. E. L. Lehmann, Elements of Large-Sample Theory,
115. W. Härdle, M. Müller, S. Sperlich, A. Werwatz, Nonparametric and Semipara-
metric Models, Springer, Berlin (2004).
116. T. A. Bancroft and Chien-Pai Han, Statistical Theory and Inference in Re-
search, ed. D. B. Owen, Dekker, New York (1981).
117. M. Fisz, Probability Theorie and Mathematical Statistics, R.E. Krieger Pub-
lishing Company, Malabre, Florida (1980) 464.
118. G. J. Feldman, R. D. Cousins, Unified approach to the classical statistical
analysis of small signals. Phys. Rev. D 57 (1998) 3873.
119. B. Efron, Why isn’t Everyone a Bayesian? Am. Stat. 40 (1986) 1,
R. D. Cousins, Why Isn’t Every Physicist a Bayesian? Am. J. Phys. 63 (1995)
398,
D. V. Lindley, Wald Lectures: Bayesian Statistics, Statistical Science, 5 (1990)
44.
120. G. Zech, Frequentist and Bayesian confidence intervals,EPJdirect C12 (2002)
1.
121. M. A. Stephens, Tests based on EDF Statistics, in Goodness of Fit Techniques,
ed. R. B. d’Agostino and M. A. Stephens, Dekker, New York (1986).
122. G. J. Babu et al., Second-order correctness of the Poisson bootstrap,The Annals
of Statistics Vol 27, No. 5 (1999) 1666-1683).
123. https://en.wikipedia.org/wiki/Expectation-maximization-algorithm.
124. G. Zech, A Simple Iterative Alignment Method using Gradient Descending
Minimum Search, Proceedings of PHYSTAT03, Statistical Problems in Parti-
cle Physics, Astrophysics and Cosmology, ed. L. Lyons et al., SLAC, Stanford
(2003), 226.
125. J. A. Nelder and R. Mead, A simplex method for function minimization, The
Computer Journal, 7 (1965) 308.
126. J. J. Tomik, On Convergence of the Nelder-Mead Simplex algorithm for un-
constrained stochastic optimization, PhD Thesis, Pensylvania State university,
Department of Statistics (1995).
127. G. A. Korn and Th. M. Korn, Mathematical Handbook for Scientists and En-
gineers, McGraw-Hill, New York (1961).
128. P. J. Huber, Robust Statistics, John Wiley, New York (1981).
129. P. J. Rousseeuw, Robust Regression and Outlier Detection, John Wiley, New
York (1987).
References 473

130. R. Maronna, D. Martin and V. Yohai, Robust Statistics – Theory and Methods,
John Wiley, New York (2006).
131. Some useful internet links:
- http://www.stats.gla.ac.uk/steps/glossary/basic-definitions.html, Statistics
Glossary (V. J. Easton and J. H. McColl).
- http://www.nu.to.infn.it/Statistics/, Useful Statistics Links for Particle
Physicists.
- http://www.statsoft.com/textbook/stathome.html, Electonic Textbook Stat-
soft.
- http://wiki.stat.ucla.edu/socr/index.php/EBook, Electronic Statistics Book.
- http://www.york.ac.uk/depts /maths/histstat/lifework.htm,Life and Work of
Statisticians (University of York, Dept. of Mathematics).
- http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2000/Geu00/geurts-
pkdd2000-bagging.pdf, Some enhancements of Decision Tree Bagging (P.
Geurts).
- http://www.math.ethz.ch/ blatter/Waveletsvortrag.pdf, Wavelet script (in
German).
- http://www.xmarks.com/site/www.umiacs.umd.edu/ joseph/support-vector-
machines4.pdf, A Tutorial on Support Vector Machines for Pattern Recognition
(Ch. J. C. Burges).
- http://www-stat.stanford.edu /∼jhf/ftp/machine.pdf, Recent Advances in
Predictive (Machine) Learning (J. B. Friedman).
Table of Symbols

Symbol Explanation
A,B Events
A Negation of A
Ω/∅ Certain / impossible event
A ∪ B ,A∩ B ,A ⊂ B A OR B, A AND B, A implies B etc.
P {A} Probability of A
P {A|B} Conditional probability of A
(for given B)
x,y,z ; k,l,m (Continuous; discrete) random variable (variate)
θ, µ, σ Parameter of distributions
f (x) , f (x|θ) Probability density function
F (x) , F (x|θ) Integral (probability-) distribution function
(for parameter value θ, respectively)(p. 16)
f (x) , f (x|θ) Respective multidimensional generalizations (p. 46)
A , Aji Matrix, matrix element in column i, row j
AT , ATji = Aij Transposed matrix
a , a · b ≡ aT b Column vector, inner (dot-) product
L(θ) , L(θ|x1 , . . . , xN ) Likelihood function (p. 155)
L(θ|x1 , . . . , xN ) Generalization to more dimensional parameter space
θ̂ Statistical estimate of the parameter θ(p. 161)
E(u(x)) , hu(x)i Expected value of u(x)
u(x) Arithmetic sample mean, average (p. 21)
δx Measurement error of x (p. 92)
σx Standard deviation of x
σx2 , var(x) Variance (dispersion) of x (p. 22)
cov(x, y) , σxy Covariance (p. 50)
ρxy Correlation coefficient
µi Moment of order i with respect to origin 0, initial moment
µ′i Central moment (p. 34)
µij , µ′ij etc. Two-dimensional generalizations (p. 48)
γ1 , β2 , γ2 Skewness , kurtosis , excess (p. 26)
κi Semiinvariants (cumulants) of order i, (p. 37)
Index

k-nearest neighbors, 357, 388 weighted observations, 65


blind analysisis, 304
acceptance fluctuations, 65 boosting, 393
activation function, 383 bootstrap, 407
AdaBoost, 394 confidence limits, 412
ancillary statistic, 175 estimation of variance, 408
Anderson–Darling statistic, 439 jackknife, 413
Anderson–Darling test, 328 precision, 411
angular distribution, 59 two-sample test, 412
generation, 128 breakdown point, 463
ANN, see artificial neural network Breit-Wigner distribution, 79
approximation of functions, see generation, 128
function approximation brownian motion, 31
artificial neural network, see neural
network categorical variables, 376
asymptotic mean integrated square Cauchy distribution, 79
error generation of, 128
of histogram approximation, 401 central limit theorem, 70, 79, 416
attributes, 353 characteristic function, 34
averaging measurements, 244 of binomial distribution, 63
of Cauchy distribution, 80
B-splines, 368 of exponential distribution, 40
back propagation of ANN, 384 of extremal value distribution, 84
background, 207 of normal distribution, 72
bagging, 395 of Poisson distribution, 38
Bayes factor, 347, 459 of uniform distribution, 69
Bayes’ postulate, 6 Chebyshev inequality, 415
Bayes’ probability, 6 chi-square, 184
Bayes’ theorem, 11, 151–153 of histograms, 197
for probability densities, 47 of histograms of weighted events,
Bayesian statistics, 4 442, 443
Bernoulli distribution, 63 chi-square distribution, 44, 75
bias chi-square probability, 311
bias of estimate, 190 chi-square test, 315
of estimate, 418, 436 binning, 318
of measurement, 115 composite hypothesis, 321
binomial distribution, 62 generalized, 321
Poisson limit, 68 small samples, 322
478 Index

two-sample, 340 discriminant analysis, 379


classification, 354, 376 distribution
k-nearest neighbors, 388 angular, 59
decision tree, 391 continuous, 17
kernel methods, 387 discrete, 16
support vector machines, 389 multivariate, 57
weighting, 388 sample width, 76
classifiers distribution function, 16
training and testing, 412
combining measurements, 171, 244 EDF statistics, 438
compound distribution, 85 effective number of parameters, 273
compound Poisson distribution, 87, 444 efficiency
conditional probability, 11 of estimator, 436
conditionality principle, 175 of estimators, 190, 418
confidence belt, 431 efficiency fluctuations, 64
confidence interval, 116, 240 EM method, 424
classical, 430 empirical distribution function, 326
unphysical parameter values, 259 empirical moments, 98
upper limits, 255 energy test, 333
confidence level, 430 distance function, 333, 335
confidence region, 432 two-sample, 341
consistency Epanechnikov kernel, 404
of estimate, 417, 436 error, 91, 239
of test, 307 declaration of, 92
constrained fit, 212 definition, 92, 243
constraints, 452 determination of, 95
convolution integral, 53 of a product, 251
correlation, 50, 58 of a sum, 251, 252
coefficient, 50, 105 of average, 106
covariance, 50 of correlated measurements, 108
covariance matrix, 105 of empirical variance, 98
coverage probability, 430 of error, 98
Cramer–Rao inequality, 419 of ratio, 246
Cramer–von Mises test, 328 of weighted sum, 114
Cramer–von-Mises statistic, 439 one-sided, 255
credibility interval, 240 parabolic, 241
critical region, 304, 305 propagation of, 103, 103, 244, 250
cross validation, 378 relative, 92
cumulants, 37 several variables, 112
curse of dimensionality, 355 statistical, 94
unphysical parameter values, 259
decision tree, 355, 391, 396 verification of, 101
boosted, 393 error ellipsoid, 106
deconvolution, see ufolding261 error interval, 241
degree of belief, 3 error matrix, 105
degrees of freedom, 76, 77, 316 error of the first kind, 306
diffusion, 31 error of the second kind, 306
digital measurements, 32 error propagation, 103, 103, 244, 250
direct probability, 156 estimate, 3
Index 479

estimator Hermite polynomial, 360


minimum variance bound, 420 histogram, comparison of, 441
event, 3, 9 hypothesis
excess, 27 composite, 304
expectation maximization, 424 simple, 304
expected value, 20 hypothesis test, 303
definition, 21 multivariate, 330
exponential distribution, 74
generation, 127 i.i.d., 59
generation from uniform distribution, importance sampling, 130
46 incompatible measurements, 249
moments of, 29 independence, 58, 59
extended likelihood, 198 independence of variates, 50
extreme value distribution, 83 independent, identically distributed
generation, 128 variables, 59
extremum search, 448 information, 176
method of steepest descent, 450 input vector, 353
Monte Carlo methods, 448 integrated square error, 400
parabola method, 450 interval estimation, 239, 433
Simplex algorithm, 449 inverse probability, 156
stochastic, 452 inverse problem, 262
ISE, see integrated square error
factor analysis, 371
Fisher information, 419 jackknife, 413
Fisher’s spherical distribution, 61
Fisher–Tippett distribution, 84 k-nearest neighbor test, 343
folding matrix, 290 two-sample, 332
frequentist confidence intervals, 430 kernel method, 396
frequentist statistics, 4 kernel methods, 355, 404
function approximation, 356 classification, 387
k-nearest neighbors, 357 kernel trick, 458
adapted functions, 369 kinematical fit, 215
Gaussian weighting, 358 Kolmogorov–Smirnov test, 325, 341
orthogonal functions, 359 Kuiper test, 328
polynomial, 360, 454 kurtosis, 27
splines, 366 coeffizient of, 27
wavelets, 364
weighting methods, 357 L2 test, 330
Laguerre polynomial, 360
gamma distribution, 78 law of large numbers, 79, 415
Gauss distribution, 70 learning, 353
Gauss–Markov theorem, 188 least square fit, 183, 444
Gini-index, 393 truncated, 462
goodness-of-fit test, 313, 443 least square method, 183
Gram–Charlier series, 362 counter example, 184
Gram–Schmidt method, 361 Legendre polynomial, 360
Gumbel distribution, 84 likelihood, 155
definition, 155
Haar wavelet, 364 extended, 198
480 Index

histograms, 195 mean value, 22


histograms with background, 207 measurement, 3
map, 171 average, 244
likelihood function, 155 bias, 115
approximation, 247 combination of correlated results, 108
asymptotic form, 422 combining, 106, 171, 244
parametrization, 247 measurement error, see error
transformation invariance, 161 measurement uncertainty, see error
likelihood principle, 176 median, 28, 462
likelihood ratio, 155, 155 method of steepest descent, 450
examples, 158 Mexican hat wavelet, 365
likelihood ratio test, 323, 346 minimal sufficient statistic, 173
for histograms, 324, 443 minimum search, 448
two-samples, 340 minimum variance estimate, 423
linear distribution MISE, see mean integrated square error
generation, 127 mixed distribution, 85
linear regression, 187, 356 MLE, see maximum likelihood estimate
with constraints, 452 Monte Carlo adjustment, 205
literature, 7 mode, 28
loadings, 375 moments, 33
location parameter, 27 exponential distribution, 41
log-likelihood, 157
higher-dimensional distributions, 48
log-normal distribution, 80, 251
of Poisson distribution, 39
log-Weibull distribution, 84
Monte Carlo integration, 140
generation, 128
accuracy, 64
look-else-where effect, 343, 352
advantages of, 146
Lorentz distribution, 79
expected values, 145
generation, 128
importance sampling, 143
loss function
selection method, 140
decision tree, 394
stratified sampling, 146
machine learning, 353 subtraction method, 145
Mahalanobis distance, 332 weighting method, 144
marginal distribution, 47 with improved selection, 142
marginal likelihood, 460 Monte Carlo search, 448
Markov chain Monte Carlo, 136 Monte Carlo simulation, 121
maximum likelihood estimate, 161 additive distributions, 134
bias of, 190 by variate transformation, 126
consistency, 420 discrete distributions, 129
efficiency, 421 generation of distributions, 123
small sample properties, 423 histogram distributions, 130
maximum likelihood method, 160 importance sampling, 130
examples, 163 Markov chain Monte Carlo, 136
recipe, 161 Metropolis algorithm, 137
several parameters, 168 parameter inference, 200, 201
MCMC, see Markov chain Monte Carlo Planck distribution, 133
mean integrated square error, 400, 403 rejection sampling, 130
of histogram approximation, 401 with weights, 135
of linear spline approximation, 403 Morlet wavelet, 365
Index 481

multinomial distribution, 65 with constraints, 212


multivariate distributions with given prior, 151, 153
correlation, 58 PDE, see probability density estimation
correlaton matrix, 58 Pearson test, 318
covariance matrix, 58 Peelle’s pertinent puzzle, 253
expected values, 58 PIT, 328, 439
independence, 58 Planck distribution
transformation, 58 generation, 133
Poisson distribution, 66
neural network, 355, 380, 396 weighted observations, 87
activation function, 383 polynomial approximation, 360
loss function, 384 population, 3
testing, 384 power law distribution
training, 383 generation, 127
Neyman’s smooth test, 328 principal component analysis, 354, 371
normal distribution, 70 principal components, 373
generation, 127 prior probability, 152
generation from uniform p.d.f., 56 for particle mass, 6
in polar coordinates, 52 probability, 3
two-dimensional, 72 assignment of, 5
two-dimensional rotation, 73 axioms, 10
nuisance parameter, 227 conditional, 11
dependence on, 237 independent, 11
elimination, 227 probability density, 17
elimination by bootstrap, 235 conditional, 47
elimination by factorization, 229 two-dimensional, 46
elimination by integration, 237 probability density estimation, 331,
elimination by resampling, 235 399
elimination by restructuring, 230 k-nearest neighbors, 404
profile likelihood, 233 by Gram–Charlier series, 362
null hypothesis, 303 fixed volume, 404
number of degrees of freedom, 76, 77, histogram approximation, 400
316 kernel methods, 404
linear spline approximation, 403
Ockham’s razor, 460 probability integral transformation,
optimal variable method, 223 328, 328, 439
orthogonal functions, 359 probability of causes, 156
profile likelihood, 233
p-value, 308 propagation of errors, 103, 103
combination of, 312 linear, 103
parameter inference, 149 several variables, 104
approximated likelihood estimator, pseudo random number, 124
224
least square method, 183 quantile, 28
moments method, 179
Monte Carlo simulation, 200 random event, 3, 9
optimal variable method, 223 random forest, 395
reduction of number of variates, 220 random number, 124
weighted Monte Carlo , 201 random variable, 10
482 Index

random walk, 31 minimal sufficient, 173


reduction of number of variables, 52 sufficient, 172
regression, 183 statistical learning, 353
regression analysis, 356 statistics
regularization, 266 Bayesian, 4
resampling techniques, 407 frequentist, 4
response, 353 goal of, 1
response matrix, 262, 290 stimulated annealing, 452
robust fitting methods, 461 stopping rule paradox, 178
breakdown point, 463 stopping rules, 177
least median of squares, 464 straight line fit, 186, 233
least trimmed squares, 464 Student’s t distribution, 82
M-estimator, 463 sufficiency, 164, 172
sample median, 462 sufficiency principle, 173
truncated least square fit, 462 sufficient statistic, 172
support vector, 391
sample, 1 support vector machine, 355, 389, 456
sample mean, 22 SVD, see singular value decomposition
sample width, 25, 76 SVM, see support vector machine
relation to variance, 25 systematic error, 98
scale parameter, 27 definition, 99
scaled Poisson distribution, 89, 445 detection of, 100
shape parameter, 28
sigmoid function, 383 test, 303
signal test, 304 bias, 307
multi-channel, 351 comparison, 337
signal with background, 68 consistency, 307
significance level, 304 distribution-free, 315
significance level, 305 goodness-of-fit, 313, 443
significance test, 303 power, 307
small signals, 343 significance, 303
Simplex, 449 uniformly most powerful, 307
singular value decomposition, 271, 376 test statistic, 304
size of a test, 304 test!significance level
size, 305 significance level, 305
skewness, 26 test!size of a test
coefficient of, 26 size, 305
soft margin classifier, 458 training sample, 353
solid angle, 61 transformation of variables, 41
spline approximation, 366 multivariate, 51
spline functions, 455 transformation function, 56
cubic, 456 truncated least square fit, 462
linear, 455 two-point distribution, 63
normalized, 368 two-sample test, 304, 339
quadratic, 455 k-nearest neighbor test, 343
stability, 40 chi-square test, 340
standard deviation, 23 energy test, 341
statistic, 163 Kolmogorov–Smirnov test, 341
ancillary, 175 likelihood ratio, 340
Index 483

UMP test, see test, uniformly most wide bin regularization, 293
powerful with background, 295
unfoldin uniform distribution, 33, 69
expectation maximization method, upper limit, 255
274 Poisson statistics with background,
spline approximation, 275 257
unfolding, 261 Posson statistics, 256
binning, 290
binning-free, 296 v. Mises distribution, 60
curvature regularization, 286 variables
eigendecomposition, see deconvolu- independent, identically distributed,
tion 59
eigenvector decomposition, 268 variance, 22
EM method, 280 estimation by bootstrap, 408
entropy regularization, 286 of a sum, 23
error assignment, 279 of a sum of distributions, 26
explicit regularization, 275 of sample mean, 24
implicit regularization, 293 variate, 10
integrated square error, 278 transformation, 45
iterative, 280 Venn diagram, 10, 152
least square method, 268
migration method, 298 Watson statistic, 439
ML approach, 274 Watson test, 328
norm regularization, 287 wavelets, 364
penaty regularization, 285 Weibull distribution, 83
regularization strength, 277 weight matrix, 74
response matrix, 290 weighted events, 87
Richardson-Lucy method, 280 weighted observations, 87
spline approximation, 289 width of sample, 25
truncated SVD, 283 relation to variance, 25
List of Examples

Chapter 1
1. Uniform prior for a particle mass
Chapter 2
2. Card game, independent events
3. Random coincidences, measuring the efficiency of a counter
4. Bayes’ theorem, fraction of women among students
5. Bayes’ theorem, beauty filter
Chapter 3
6. Discrete probability distribution (dice)
7. Probability density of an exponential distribution
8. Probability density of the normal distribution
9. Variance of the convolution of two distributions
10. Expected values, dice
11. Expected values, lifetime distribution
12. Mean value of the volume of a sphere with a normally distributed radius
13. Playing poker until the bitter end
14. Diffusion
15. Mean kinetic energy of a gas molecule
16. Reading accuracy of a digital clock
17. Efficiency fluctuations of a detector
18. Characteristic function of the Poisson distribution
19. Distribution of a sum of independent, Poisson distributed variates
20. Characteristic function and moments of the exponential distribution
21. Calculation of the p.d.f. for the volume of a sphere from the p.d.f. of the
radius
22. Distribution of the quadratic deviation
23. Distribution of kinetic energy in the one-dimensional ideal gas
24. Generation of an exponential distribution starting from a uniform distri-
bution
25. Superposition of two two-dimensional normal distributions
26. Correlated variates
27. Dependent variates with correlation coefficient zero
28. Transformation of a normal distribution from cartesian into polar coor-
dinates
29. Distribution of the difference of two digitally measured times
30. Distribution of the transverse momentum squared of particle tracks
31. Quotient of two normally distributed variates
32. Generation of a two-dimensional normal distribution starting from uni-
form distributions
33. The v. Mises distribution
34. Fisher’s spherical distribution
35. Efficiency fluctuations of a Geiger counter
Index 485

36. Accuracy of a Monte Carlo integration


37. Acceptance fluctuations for weighted events
38. Poisson limit of the binomial distribution
39. Fluctuation of a counting rate minus background
40. Distribution of the mean value of decay times
41. Measurement of a decay time distribution with Gaussian resolution
Chapter 4
42. Scaling error
43. Low decay rate
44. Poisson distributed rate
45. Digital measurement (uniform distribution)
46. Efficiency of a detector (binomial distribution)
47. Calorimetric energy measurement (normal distribution)
48. Average from 5 measurements
49. Average of measurements with common off-set error
50. Average outside the range defined by the individual measurements
51. Average of Z 0 mass measurements
52. Error propagation: Velocity of a sprinter
53. Error propagation: Area of a rectangular table
54. Straight line through two measured points
55. Error of a sum of weighted measurements
56. Bias in averaging measurements
57. Confidence levels for the mean of normally distributed measurements
Chapter 5
58. Area of a circle of diameter d
59. Volume of the intersection of a cone and a torus
60. Correction of decay times
61. Efficiency of particle detection
62. Measurement of a cross section in a collider experiment
63. Reaction rates of gas mixtures
64. Importance sampling
65. Generation of the Planck distribution
66. Generation of an exponential distribution with constant background
67. Mean distance of gas molecules
68. Photon yield for a particle crossing a scintillating fiber
69. Determination of π
Chapter 6
70. Bayes’ theorem: Pion- or kaon decay?
71. Time of a decay with exponential prior
72. Likelihood ratio: V + A or V − A reaction?
73. Likelihood ratio of Poisson frequencies
74. Likelihood ratio of normal distributions
75. Likelihood ratio for two decay time distributions
76. MLE of the mean life of an unstable particle
486 Index

77. MLE of the mean value of a normal distribution with known width
78. MLE of the width of a normal distribution with given mean
79. MLE of the mean of a normal distribution with unknown width
80. MLE of the width of a normal distribution with unknown mean
81. MLEs of the mean value and the width of a normal distribution
82. Determination of the axis of a given distribution of directions
83. Likelihood analysis for a signal with a linear background
84. Sufficient statistic and expected value of a normal distribution
85. Sufficient statistic for mean value and width of a normal distribution
86. Conditionality
87. Likelihood principle, dice
88. Likelihood principle, V − A
89. Stopping rule: Four decays in a fixed time interval
90. Moments method: Mean and variance of the normal distribution
91. Moments method: Asymmetry of an angular distribution
92. Counter example to the least square method: Gauging a digital clock
93. Least square method: Fit of a straight line
94. Bias of the MLE of the decay parameter
95. Bias of the estimate of a Poisson rate with observation zero
96. Bias of the measurement of the width of a uniform distribution
Chapter 7
97. Adjustment of a linear distribution to a histogram
98. Fit of the particle composition of an event sample (1)
99. Fit of the slope of a linear distribution with Monte Carlo correction
100. Lifetime Fit with Monte Carlo correction
101. Fit of the parameters of a peak over background
102. Fit of the parameters of a peak with a background reference sample
103. Fit with constraint: Two pieces of a rope
104. Fit with constraint: Particle composition of an event sample (2)
105. Kinematical fit with constraints: Eliminating parameters
106. Example 103 continued
107. Example 105 continued
108. Example 103 continued
109. Reduction of the variate space
110. Approximated likelihood estimator: Lifetime fit from a distorted distri-
bution
111. Approximated likelihood estimator: Linear and quadratic distributions
112. Nuisance parameter: Decay distribution with background
113. Nuisance parameter: Measurement of a Poisson rate with a digital clock
114. Nuisance parameter: Decay distribution with background sample
115. Elimination of a nuisance parameter by factorization of a two-dimensional
normal distribution
116. Elimination of a nuisance parameter by restructuring: Absorption mea-
surement
Index 487

117. Eliminating a nuisance parameter by restructuring: Slope of a straight


line with the y-axis intercept as nuisance parameter
118. Fitting the width of a normal distribution with the mean as nuisance
parameter
119. Profile likelihood, absorption measurement
120. Eliminating a nuisance parameter by resampling, absorption measure-
ment
Chapter 8
121. Error of a lifetime measurement
122. Averaging lifetime measurements
123. Averaging ratios of Poisson distributed numbers
124. Distribution of a product of measurements
125. Sum of weighted Poisson numbers
126. Average of correlated cross section measurements, Peelle’s pertinent puz-
zle
127. Upper limit for a Poisson rate with background
128. Upper limit for a Poisson rate with uncertainty in background and ac-
ceptance
Chapter 9
129. Eigenvector decomposition of the LS matrix
130. Unfolding with the expectation maximization (EM) method
131. EM unfolding with different starting distributions
132. Comparison of unfolding methods
133. Unfolding to a spline curve
134. Unfolding with implicit regularization
135. Deconvolution of a blurred picture
136. Deconvolution by fitting the true event locations
Chapter 10
137. Test of a predicted counting rate
138. Particle selection based on the invariant mass
139. Bias and inconsistency of a test
140. The p-value and the probability of a hypothesis
141. Comparison of different tests for background under an exponential dis-
tribution
142. χ2 comparison for a two-dimensional histogram
143. Likelihood ratio test for a Poisson count
144. Designed test: Three region test
145. GOF test of a two-dimensional sample
146. Comparison of two samples
147. Significance test: Signal over background, distribution of the likelihood
ratio statistic
148. Previous example continued continued
Chapter 11
149. Simple empirical relations
488 Index

150. Search for common properties


151. Two-class classification, SPAM mails
152. Multiclass classification, pattern recognition
153. Curse of dimensionality
154. Principal component analysis
Chapter 12
155. PDE of a background distribution and signal fit
156. Bootstrap evaluation of the accuracy of the estimated mean value of a
distribution
157. Error of mean distance of stochastically distributed points in a square
158. Acceptance of weighted events
159. Two-sample test with a decision tree
160. Jackknife bias correction
Appendix
161. Efficiencies of different estimates of the location parameter of a Gaussian
[116]
162. Efficiency of small sample MLEs
163. EM algorithm: Unfolding a histogram
164. Parameter uncertainty of background contaminated signals
165. Coverage: Performance of magnets
166. Bias in the mass determination from energy and momentum
167. Inference with known prior
168. Bias introduced by a prior
169. Comparing predictions with strongly differing accuracies: Earth quake
170. Comparison of the CPD with the SPD approximation and the normal
distribution
171. Fitting a mean value in the presence of outliers

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy