Pham Gia2006
Pham Gia2006
Pham Gia2006
DOI 10.1007/s10260-006-0012-x
O R I G I NA L A RT I C L E
Abstract We study two of the classical bounds for the Bayes error Pe , Lissack
and Fu’s separability bounds and Bhattacharyya’s bounds, in the classification of
an observation into one of the two determined distributions, under the hypothesis
that the prior probability χ itself has a probability distribution. The effectiveness
of this distribution can be measured in terms of the ratio of two mean values. On
the other hand, a discriminant analysis-based optimal classification rule allows
us to derive the posterior distribution of χ , together with the related posterior
bounds of Pe .
Keywords Overlapping coefficient · Discriminant analysis · Misclassification ·
Lissack and Fu bounds · Bhattacharyya bounds · Hypergeometric functions ·
Bernoulli · Beta distribution
1 Introduction
A. Bekker
University of South Africa (UNISA),
Pretoria, South Africa
8 T. Pham-Gia et al.
It is given by Pe = R min{χ f 1 (x), (1 − χ ) f 2 (x)}dx, where f 1 and f 2 are the
densities of the two populations, usually called class-conditional probabilities in
the Pattern recognition literature, while χ , the known constant weight attached to
f 1 , is the prior probability given to it, based on our knowledge of the problem.
But χ is also the proportion of H1 in the mixture of the two above densities, i.e.
g(x) = χ f 1 (x) + (1 − χ ) f 2 (x). Because of the difficulties in getting the exact
value of Pe other than by numerical methods, bounds for Pe are important to ob-
tain. One type of such bounds, based on the mixture g(x), has been obtained by
Lissack and Fu (1976). They are called the posterior separability bounds in L p -
spaces, denoted by J p (H1 , H2 |χ ), and are obtained, using Bayes Theorem, from
the posterior distributions of H1 and H2 , given the observation
X = x. On the other
hand,
√ Bhattacharyya’s bounds are
∞ √ defined as 1 − 1 − 4χ (1 − χ )ρ 2 ≤ 2Pe ≤
2 χ (1 − χ )ρ, where ρ = −∞ f 1 (x) f 2 (x)dx, is the L -norm of the geomet-
1
√
ric mean of the two densities, with 1 − ρ being the classical Hellinger distance
between them. Naturally, other bounds encountered in the literature, could also be
used here. For example, the Nearest Neighbor error L NN can be considered, but
since only its asymptotic value can be obtained analytically, it would be difficult to
use it in a finite sampling scheme. But, since this asymptotic value lies between Pe
and the Bhattacharyya bound, we still have some information on L NN , for a large
sample.
The Bayes error can also be related to other concepts used in Information The-
ory, such as Entropy, or Divergence (Devroye et al. 1996), and a study along the
lines of this article can be carried out with these concepts. Of related interest is
the classification problem from the L1 -norm viewpoint by Pham-Gia et al. (2005).
For clarity, we will consider the univariate case only. The multivariate case, which,
however, has to include different approaches, will be treated in a subsequent paper.
In this article, the two densities f 1 and f 2 are supposed known, but, in practice,
they can be evaluated from the two available training samples, if these samples are
taken from H1 and H2 separately. A variety of methods, ranging from the moments
method to kernel density estimation, can be used for this purpose.
Very often, however, the precise value of χ is not known, making the above-
mentioned bounds of quite limited value. But, as in most phenomena, the range of
values for χ is often known, and sometimes even its distribution, based on experts’
opinion, as well as on past cases. In this article, we will give a distribution to the
unknown χ , and study its effects on the above two types of bounds. This infor-
mative prior should be quite useful in real applications, but even when it is not
available because of a lack of adequate information, non-informative priors, which
are well discussed in the Bayesian literature, can be used instead. Furthermore,
using rules provided by Discriminant analysis on a sample taken from the mixture
g(x), we will derive the posterior distribution of χ , which will then serve to update
these bounds. This process can be repeated, to refine the estimation of χ . Also,
effectiveness measures will be introduced to evaluate the performance of a distri-
bution, making comparisons between two prior distributions possible, and hence,
permitting the making of useful conclusions and decisions.
The general merits of the Bayesian approach in applied statistics, over those of
classical statistics, has been, and continue to be a subject of hot debate, and we do
not wish to address them here. The interested reader is referred to Bernardo and
Smith (1994) for a fuller discussion. Concerning the specific method presented
Bounds for the Bayes error in classification: a Bayesian approach 9
here, our numerical example in section 5 would highlight the different steps. On
the Bayesian approach to the mixture problem, Titterington et al. (1985) gives an
overview of the sequential approach available, prior to the use of computer inten-
sive methods, while Robert (1996) provides a comprehensive survey of methods
based on Monte Carlo Markov chains (MCMC). McLachlan and Basford (1988)
can be consulted for basic results in mixture distribution analysis. It should be
mentioned that in the case where χ , f 1 and f 2 are unknown, and data are obtained
from the mixture H3 , the estimation of χ is usually carried out empirically, fre-
quently using the Expectation–Maximization (EM) approach when the maximum
likelihood approach is used. A Bayesian approach for the general case has been
suggested in McLachlan and Peel (2000), and a recent survey of the MCMC meth-
ods used in this domain, and the usual problem encountered, is given by Jasra et al.
(2005).
In section 2, with a distribution assigned to the prior probability χ , bounds are
obtained for the expected Bayes error in classification, and effectiveness measures
for that prior distribution are derived. In section 3, parameters involved in the sam-
pling stage are clearly identified, and the classification of an observation taken from
the mixture is discussed. In section 4, the posterior distribution of χ is derived in
closed form, based on results from a sample taken from the mixture, and bounds for
the Bayes error are recomputed. Conversely, for χ constant but unknown, the distri-
bution of J1 (H1 , H2 |χ ), the L 1 -distance between two populations will be derived,
under some hypotheses on the two mutually complementary components of Pe .
Finally, a numerical example in section 5 provides concrete illustrations of the
results obtained in previous sections, and clearly presents the unified approach to
the study of the mean Bayes error bounds proposed by our paper. In the Appendix,
we first recall some basic Discriminant analysis procedures applied to two nor-
mal populations, where the overlapping coefficient, which represents the sum of
misclassification probabilities, plays a crucial role. Some results in hypergeometric
functions of several variables, needed to derive our results, are also presented there.
Let us now consider a mixture of the form g(t) = χ f 1 (x) + (1 − χ ) f 2 (x), where
the weight χ is constant, 0 ≤ χ ≤ 1.
For p > 0, let J p (H1 , H2 |χ ) = R |P(H1 |t) − P(H2 |t)| p .g(t)dt be the
L -posterior distance between the two distributions, using the measure dµ(t) =
p
g(t)dt. P(Hi |t) is the probability that an element belongs to population Hi , i =
1, 2, after the observation t is made. By Bayes theorem, we have R |P(H1 |t) −
p−1
P(H2 |t)| p .g(t)dt = R |χ f 1 (t) − (1 − χ ) f 2 (t)| p /[g(t)] dt, and hence, the
(posterior) L -distance between two populations can be expressed as an L p -
p
distance (with the measure dt/[g(t)] p−1 ) between the two functions χ f 1 (t) and
(1 − χ ) f 2 (t), also called the L p − (χ ) distance between the two populations.
Let us now consider χ as random, with distribution ξ(χ ), that we will suppose
of beta type, denoted beta(χ ; α, β), from now on, although any distribution on
(0,1) can be used. Since we have E χ [J1 (H1 , H2 |χ )] = 1 − 2E χ [Pe ] (see also
10 T. Pham-Gia et al.
section 4.2), the expected value of the Bayes error can also have as bounds simple
expressions of the distances between these distributions, as seen below.
Lemma Let χ have beta(χ ; α, β) as prior distribution. Then
(1) The distance W between χ and 1-χ has as prior density:
(1−w)α−1 (1+w)β−1 + (1−w)β−1 (1+w)α−1
f (w) = , 0 ≤ w ≤ 1.
[2α+β−1 B(α, β)]
(1)
(2) For 1 ≤ p ≤ ∞, we have, for posterior separability bounds for E χ J1 (H1 ,
H2 |χ ) (and the mean value of 2Pe ):
1/ p
E χ J p (H1 , H2 |χ ) ≤ E χ [J1 (H1 , H2 |χ )] ≤ E χ J p (H1 , H2 |χ ) ,
(2)
and for 0 < p ≤ 1,
1/ p
E χ J p (H1 , H2 |χ ) ≤ E χ [J1 (H1 , H2 |χ )] ≤ E χ J p (H1 , H2 |χ ) ,
(3)
where the expectations are taken w.r.t. to the prior density of χ .
(3) For Bhattacharyya’s bounds, we have:
1 − Eχ 1 − 4χ (1 − χ )ρ 2 ≤ 2E χ [Pe ] ≤ 2E χ [ χ (1 − χ )]ρ, (4)
∞ √
where ρ = f 1 (x) f 2 (x)dx,
0
Proof
and
( p)
as the corresponding effectiveness measure in L 1 , where λξ is not defined.
12 T. Pham-Gia et al.
Definition 2 For two distinct prior distributions, ξ1 (.) and ξ2 (.), and for a fixed
value of p, we define ξ1 as more separation-effective than ξ2 (or more S-effective)
( p) ( p)
in the Lissack-Fu sense, denoted by ξ1 >LF(p) ξ2 , if λξ1 > λξ2 .
The above inequality would mean that ξ1 (.) has the ratio between two respective
discriminating measures, for output and input, larger than the same ratio for ξ2 (.),
resulting in a better return. However, it should also be borne in mind that an infor-
mative prior must be chosen because it best reflects the distribution of the possible
values of χ , and not because it gives a better return, i.e. is more S-effective. For the
( p)
beta family of beta(α, β) priors, Figure 1 gives the level curves for λξ , for p = 2,
and for θξ , as functions of α and β. We can see that they have similar shapes, with
maximum values along the diagonal α = β, with these values increasing as α and
β increase.
This fact can be easily understood since for a symmetrical beta, its mean is 1/2,
which represents indifference, as explained before. In the limiting case, when α =
( p)
β → ∞, we have the degenerate distribution at 1/2, with W = 0, and λξ = θξ =
∞ (since the two related numerators remain finite). Taking a symmetrical beta as
prior usually provides a better return, but other non symmetrical beta densities can
also provide the same return, as shown by the level curves. Similarly, ξ1 is more
effective than ξ2 in the Bhattacharyya sense, denoted by ξ1 > B ξ2 , if θξ1 > θξ2 .
Remark As we can see, both bounds diverge to infinity when E ξ (χ ) → 1/2. This
situation happens, when χ ∼ beta(α, α), with α → ∞. A question raised by the
( p)
referees is that, in view of the ordering of the prior distributions of χ by λξ , for
a family of priors such as the beta, can we determine the parameters (α , β ∗ ) that
∗
( p)
would render λξ ∗ maximum?
( p)
This question does not have a positive answer, since the upper bound for λξ
given by Theorem 1, namely UB = (1− p −1 )/2 p 1/ p−1 |1−2E ξ (χ )|, for E ξ (χ ) =
( p)
1/2, is strict. The equation λξ = UB, does not have a solution in (α ∗ , β ∗ ), and
( p) ( p)
although we can find (α , β ) such that λξ < λξ < UB, the upper bound is never
reached. This can also be seen from Figure 1, and the same answer applies to θξ .
Bounds for the Bayes error in classification: a Bayesian approach 13
100 1.8
1.8
1.4
80 λξ(2 ) 1.2
1.6
1.6
1.0
0.8 1.4
1.4
0.6
60 1.2
0.4
β 0.2
1.2 1.0
0.8
40 1.0 0.6
0.4
0.8
20 0.6 0.2
0.4
0.2
20 40 60 80 100
α
100 3.5
3.5
θξ
2.5
3.0
80 2.0 3.0
1.5
2.5
1.0 2.5
60 0.5
β 2.0
2.0 1.5
40 1.0
1.5
0.5
20 1.0
0.5
20 40 60 80 100
α
Fig. 1 Level curves for λ(2)
ξ and θξ
3 Sampling stage
We suppose f 1 and f 2 are two independent normal densities and refer to the basic
notations and procedures for discriminant analysis in the univarite case, given in
the Appendix. The optimal cut-off points, that would minimize the Bayes error Pe ,
being the intersection point(s) between k1 (t) = χ f 1 (t), and k2 (t) = (1 − χ ) f 2 (t),
when the value of χ is a known constant, we have explicit expressions for the
solutions:
µ1 + µ2 σ 2 LogD
x0 = + , (8)
2 µ1 − µ2
x1 µ1 σ22 − µ2 σ12 ± σ1 σ2 ( (µ1 − µ2 )2 + 2(σ22 − σ12 )[log(σ2 /σ1 ) − log D]
or : = ,
x2 σ2 − σ2 2 1
(9)
14 T. Pham-Gia et al.
where D = (1 − χ )/χ .
When χ is random, with prior density ξ(χ ), using the same argument as pre-
viously, the optimal decision cut-off point(s) is (are) now the intersection point(s)
of E(χ ) f 1 (x) and (1 − E(χ )) f 2 (x), denoted x0∗ , or x1∗ and x2∗ , which have the
same expressions (8) and (9), with χ replaced by E(χ ). These points, in turn, will
determine the probabilities of misclassifications τ ∗ and δ∗, with ε∗ = τ ∗ +δ∗, as
in (19) and (20) ( in Appendix).
The posterior distribution of χ is based on sampling results that are considered
valid and reliable as a second source of information on χ , in order to update its
distribution. As such, they are associated with an unknown value χ0 of χ , within
the domain of its distribution ξ(χ ), which is a realisation of the random variable
χ . Hence, let H3 be a population with as density the mixture g(x) already men-
tioned, where χ = χ0 . A sample of size n is now taken from H3 , by taking either
n simulated observations from g(x), or a physical random sample from H3 in the
second case (i.e. creating a mixture of physical elements from H1 and H2 , in the
proportions χ0 and (1 − χ0 ) respectively, and taking a random sample from that
population). It will serve to update the distribution of χ , if we accept the evidence
provided by this sample originating from the value χ0 . But, whereas the proba-
bility for an element xi from a sample taken from H3 to be classified as from H1
is still χ , its probability to be classified as such, in the sampling phase, is now
θ = χ (1 − τ ∗) + (1 − χ )δ∗ = δ ∗ +Mχ , instead of χ , where τ ∗ = P(H2 |H1 )
and δ∗ = P(H1 |H2 ) are the two misclassification probabilities obtained earlier,
and M = 1 − ε∗. By using the above cut-off point(s) to classify xi , we now know
exactly to which distribution an observation will be assigned. Hence, in considering
the whole sample {x1 , . . . , xn } taken from H3 , we have j* observations belonging
to H1 and n- j * to H2 , and we can derive the exact posterior distribution of χ .
This is the Decision-directed learning approach, as outlined in Titterington et al.
(1985), but in a direct, non-sequential sampling context.
4 Posterior analysis
Theorem 2 Let χ be the mixing proportion in the mixture density g(x) = χ f 1 (x)+
(1 − χ ) f 2 (x), and let misclassification into H2 , or into H1 , have probabilities τ ∗
and δ ∗ respectively. If χ has beta(χ ; α, β) as prior, and if j* out of n observations
from the mixture are found to be from H1 , then
(1) χ has as posterior distribution
(n, j∗)
φ (n, j∗) (χ ) = beta(χ ; α, β).[1 − Aχ ] j∗ [1 − Bχ ]n− j∗ /P0 (A, B), (10)
(n, j∗)
where A = (ε ∗ −1)/(δ∗), B = (1 − ε∗)/(1 − δ∗) and P0 is a polynomial
of degree n in A and B.
Bounds for the Bayes error in classification: a Bayesian approach 15
where (a,n) is the Pochhammer notation for the product a(a + 1) . . . (a + n − 1).
But this series has only a finite number of terms here, since b1 and b2 are negative
integers. If we define the polynomial
(n, j)
j
n−
j
(α + k, m 1 + m 2 )(− j, m 1 )( j − n, m 2 ) Am 1 B m 2
Pk (A, B) = · · ,
(α + k + β, m 1 + m 2 ) m1! m2!
m 2 =0 m 1 =0
(13)
for k = 0, 1, 2, . . ., the posterior density of χ is then:
(n, j∗)
φ (n, j∗) (χ ) = beta(χ ; α, β) · [1 − Aχ ] j∗ [1 − Bχ ]n− j∗ /P0 (A, B).
When j∗ = n or j∗ = 0, we have L = B(α, β).2 F1 (α, −n, α + β, A) or
B(α, β).2 F1 (α, −n, α+β, A), where 2 F1 is Gauss hypergeometric function . Using
(n,0) (0,n)
(13) we have P0 and P0 as a polynomial in A, or B, only.
16 T. Pham-Gia et al.
The above Lissack and Fu and Bhattacharyya bounds and and the two effectiveness
( p)
measures for the distributions of χ , λξ , for 0 < p, p = 1, and θξ , can now be
computed again, as in (2), (3) and (4), and with φ (n, j∗) (χ ) replacing ξ(χ ), and the
posterior distribution of W , given by (11) replacing its prior.
Case of J1 (H1 , H2 |χ ): For p =1, and χ a constant, we have equality, i.e 2Pe =
1 − J1 (H1 , H2 |χ ), with J1 (H1 , H2 |χ ) being the L 1 -distance between χ f 1 and
(1 − χ ) f 2 (with the Lebesgue measure), also called the L 1 − (χ ) distance between
between the two populations. Similarly, when χ has distribution ξ(χ ), we have:
2E χ [Pe ] = 1 − E χ [J1 (H1 , H2 |χ )]. The first equality hence allows us to derive the
distribution of J1 (H1 , H2 |χ ), once the distribution of Pe is known, in a situation
reverse to the previous ones.
This situation occurs when we do not know the two normal densites f 1 and
f 2 , and are unsure of the value of χ itself, but by monitoring the misclassification
results, we have some information on the two misclassification probabilities, τ
and δ, which can be considered to be independent random variables, with separate
distributions. For example, they can be two beta variables, each on the interval
[0,1/4]. We can then derive the distribution of J1 (H1 , H2 |χ ). We have:
Proof First, we have the density of ε = τ + δ as the density of the sum of two
independent general beta r.v’s, each defined on [ 0, 1/4]. The general expression of
that last density, is given in Pham-Gia and Turkkan (1998), for the case c = e = 0
and d = f = 0.25. The density of Pe is hence:
Bounds for the Bayes error in classification: a Bayesian approach 17
For 0 ≤ y ≤ 1/4,
(2)
f (y) = K 1 4α1 +α2 y α1 +α2 −1 (1 − 4y)β1 −1 · FD
×(α2 , 1 − β1 , 1 − β2 ; α1 + α2 ; 4y/(4y − 1), 4y),
with K 1 = (α1 + β1 )(α2 + β2 )/(α1 + α2 )(β1 )(β2 ) and for 1/4 ≤ y ≤
1/2,
(2)
f (y) = K 2 2β1 +β2 +1 (1 − 2y)β1 +β2 −1 (4y − 1)α2 −1 · FD
×(β2 , 1 − α1 , 1 − α2 ; β1 + β2 ; 2(1 − 2y), 2(2y − 1)/(4y − 1)),
with K 2 = (α1 + β1 )(α2 + β2 )/ (β1 + β2 )(α1 )(α2 ), and the above density
of Z = J1 (H1 , H2 ) = 1 − 2Pe can then be obtained by a change of variable and
some computations.
5 A numerical example
This numerical example, taken under very general conditions, will illustrate the
approach and methods presented in the paper.
5.1 Problem
(a) We can see that the two normal densities f 1 and f 2 intersect at x1 = 11.198
and x2 = 45.602, and the two misclassification probabilities are, according to
(A2), τ = 0.2455 and δ = 0.1284. Hence, we have the overlapping coefficient
ε0 = 0.3739 and the L 1 -distance between the two distributionsis J1 (H1 , H2 ) =
1.2522.
In the absence of χ , classification is directly based on the ratio of likelihoods
of the two normal densities. For a new observation y, the rule would then be:
Classify y as belonging to H1 if 11.1982 ≤ y ≤ 45.602, and as belonging to
H2 otherwise.
18 T. Pham-Gia et al.
(b) If only the distribution of χ is known, all the above quantities become random
variables. Taking the two functions E(χ ) f 1 (x) and (1 − E(χ )) f 2 (x), the two
cut-off points are x1∗ and x2∗ , and we can proceed as presented in previous
sections. The detailed analysis will be shown in the following sub-sections.
(a) The prior distribution of χ being ξ1 (χ ) = beta(4, 16), we have E[χprior ] =0.20
and Var(χprior ) = 7.619 × 10−3 ( or σχ = 0.08728). The absolute distance W
between χ and 1-χ then has as prior density, as given by (1): For 0 ≤ w ≤ 1,
f (w) = [(1 − w)3 (1 + w)15 + (1 − w)15 (1 + w)3 ]/[219 B(4, 16)] = (1 − w)3
(1 + w)15 + (1 − w)15 (1 + w)3 ]/[33.82514].
Hence, we have E χ [Wprior ] =0.6003 and Varχ [Wprior ] = 0.0301.
(b) To obtain the Lissack–Fu and Bhattacharyya bounds for 2E χ (Pe ), we compute
E χ [J p (H1 , H2 |χ )]
1
χ α−1 (1−χ )β−1
= |χ f 1 (t)−(1−χ ) f 2 (t)| p · dtdχ ,
B(α, β)[χ f 1 (t)+(1−χ ) f 2 (t)] p−1
0 R
(16)
∞
ρ= f 1 (x) f 2 (x)dx and E χ 1 − 4χ (1 − χ )ρ 2
0
1
χ α−1 (1 − χ )β−1
= 1 − 4χ (1 − χ )ρ 2 dχ (17)
B(α, β)
0
and
1
χ α−1 (1 − χ )β−1
Eχ χ (1 − χ ) ρ = ρ χ (1 − χ ) dχ . (18)
B(α, β)
0
For example, for p = 2, and the beta(4, 16) prior, these bounds are 0.0896
and 0.1616 .On the other hand, Bhattacharyya’s bounds for Pe are 0.0705 and
0.251, as given by (4). Also, for the degenerate distribution at1/2, we have:
1 | f 1 (t) − f 2 (t)|2
J2 (H1 , H2 |χ = 1/2) = dt = 0.4670.
2 [ f 1 (t) + f 2 (t)]
R
(c) S-Effectiveness of a prior and comparison between two priors for S-effective-
ness:
We have the S-effectiveness coefficient for ξ1 , the beta(4,16) prior, given by
(5), for p > 1, and by (6), for p < 1. For p = 2, for example λ(2)
ξ
= 0.1198.
1
If another prior is considered, ξ2 = Beta(2, 18), for example, we then have λ(2)
ξ2
=
0.0560. Hence, for p = 2, ξ1 >LF(2) ξ2 . This fact means that, although Beta(2, 18)
Bounds for the Bayes error in classification: a Bayesian approach 19
is a more discriminating prior than Beta(4, 16), in the sense that it better separates
χ and 1- χ on input, its return ( when we also consider the distance between the two
Lissack–Fu bounds for p = 2), is inferior to that of Beta(4,16). This should not
be, however, the only reason why we shoud use Beta(4,16) instead of Beta(2,18),
as prior for χ .
Similarly, we can compute the Bhattacharyya’s S-effectiveness and make the
corresponding comparison. We have, θξ = 0.3006 and θξ = 0.1800, and again,
1 2
we can write ξ1 > B ξ2 . It is noted that the upper bound for λ(2)
ξ
, provided by
1
Theorem 1, is 0.2083, while it is 0.1565 for λ(2)
ξ2
. Similarly, the upper bound for
θξ is 0.3451, and for θξ , is 0.2588.
1 2
For p = 1, we have E χ (J1 (H1 , H2 )|χ ) = 0.7900 for χ ∼ beta(4,16).
5
Posterior
4
f(x) 3
1
Prior
0
0.0 0.2 0.4 0.6 0.8 1.0
x
Fig. 2 Prior and posterior densities of χ
3.0
Posterior
2.5
2.0
g(w) 1.5
1.0
Prior
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
w
Fig. 3 Prior and posterior densities of the distance W between χ and 1 − χ
to the prior L–F bounds, and are hence, represented by the same curves in figure 3.
(For illustration, we have also considered the case where j ∗ = 6, and graphed the
corresponding L–F bounds on Figure 4. We can see that, now, the two curves have
shifted upwards, and for p = 2, the posterior L–F bounds are 0.0953 and 0.1712,
which represent slight increases from the prior L–F bounds) . Bhattacharyya’s
posterior bounds have also slightly increased and are now: 0. 0710 and 0.254.
Bounds for the Bayes error in classification: a Bayesian approach 21
0.5
L-F Bounds
0.4
{1-Eχ[Jp(H1,H2)|χ]}/2
0.2
0.1712
0.1616
0.5
0.1 {1-Eχ[Jp(H1,H2)|χ] }/2
0.0953
0.0896
0.0
0 1 2 3 6 9 12 15
p
Fig. 4 Prior and posterior Lissack–Fu bounds for Pe
(20,6)
The S-efficiency measure of the distribution φξ1 , or posterior S-efficiency
(2)
measures, are now : λ (20,6) = 0.1246, a slight increase from its prior value and
φξ
1
θφ (20,6) = 0.2971, which represents a slight decrease.
ξ1
Let f 1 , f 2 and χ be unknown, and let’s suppose that the two misclassification
probabilities have been monitored, and could be given beta distributions. More
precisely, these two components τ and δ of Pe are now considered random, with
τ ∼ beta(5, 22; 0, 1/4) and δ ∼ beta(3, 18; 0, 1/4). Considering now the variable
J1 (H1 , H2 |χ ), we have, by (14) and (15), the density of Z defined on (0,1/2), and
on (1/2,1) by different expressions.
The density of Z, the L 1 (χ ) distance between the two populations, is hence
unimodal, defined on [ 0, 1] and its graph is given by Figure 5. Its mean is 0.836
and its variance is: 0.00274.
6 Conclusion
The Bayesian approach to the study of the bounds for the Bayes error Pe is an
effective approach to consider in the case the precise value of χ is not known,
and only its probability distribution can be determined, as it is frequently the case
in applications. Then, bounds for Pe , coupled with S-effectiveness measures for
the prior distribution will lead to a very pertinent choice of that prior. The con-
verse problem is also of interest, to obtain the L 1 (χ ) distance between the two
populations.
22 T. Pham-Gia et al.
10
6
f(z)
0
0.5 0.6 0.7 0.8 0.9 1.0
z
Fig. 5 Distribution of J1 (H1 , H2 |χ), with beta-distributed components, τ and δ, of Pe
Appendix
We will first look at the intersection point(s) of two densities, and then present the
normal case in detail.
(α) Two univariate densities f 1 (x) and f 2 (x), can intersect at one or several
points, which are the roots of the equation f 1 (x)− f 2 (x) = 0. Hence, their common
region, called overlapping region , can consist of a single region or of several
disjoint regions, but integration of h(x) = min{ f 1 (x), f 2 (x)} gives the value of
ε0 = area().
Let us consider the case of f 1 (x) and f 2 (x) unimodal densities . Then they can
intersect at one or two points.
(a) One intersection point I: Then is a singly connected region, which consists
of two parts, whose measures are given respectively by:
τ= h(x)dx and δ = h(x)dx, (19)
x≥x0 x≤x0
now the union of two regions. The first region is a simply connected one, limited
by the two points x1 and x2 and the lower of the two density functions between
these points. The second region consists of two disjoint areas, to the left and to the
right of x1 and x2 , respectively, and below the other density function.
Again, we have
τ= h 2 (x)dx and δ = h 2 (x)d x (20)
{x≥x2 }∪{x≤x1 } x1 ≤x≤x2
1
Logϕ(x) = −x 2 (σ1−2 − σ2−2 ) + 2x(µ1 σ1−2 − µ2 σ2−2 ) − (µ21 σ1−2 − µ22 σ2−2 )
2
+2Log(σ2 /σ1 ) , (21)
where ϕ(x)√= f 1 (x)/ f 2 (x) is the likelihood ratio and f i (x) = exp[−(x − µi )2
/2σi2 ]/ (σi 2π), i = 1, 2, −∞ ≤ x ≤ ∞.
Solving Log ϕ(x) = 0, we obtain:
(1) If σ1 = σ2 = σ , the above equation becomes linear, and the only intersec-
tion point I has x0 = (µ1 + µ2 )/2 as abscissa.
We then have, as given by (19),
τ = δ = 1 − (ξ ), (22)
where ξ = (µ2 − µ1 )/2σ , and is the cumulative distribution function of the
standard normal.
(2) If σ1 = σ2 , since we always have K = 2[σ22 − σ12 ] log(σ2 /σ1 ) ≥ 0, there
are 2 intersections points I1 and I2 , at abscissas x1 and x2 with values:
(µ1 σ22 − µ2 σ12 ) ± σ1 σ2 (µ1 − µ2 )2 + K
(23)
σ22 − σ12
Let’s suppose x1 ≤ x2 . We then have, as given by (20):
τ = 1 − [(x2 − µ1 )/σ1 ] + [(x1 − µ1 )/σ1 ], (24)
while
δ = [(x2 − µ2 )/σ2 ] − [(x1 − µ2 )/σ2 ]. (25)
(B) Equal means µ1 = µ2 = µ, with σ1 = σ2 : We have two normal distri-
√
butions with the same mean, and symmetrical intersection points µ ± σ1 σ2 E,
where E = 2log(σ2 /σ1 )/σ22 − σ12 ≥ 0. Then τ and δ can be obtained as in the
24 T. Pham-Gia et al.
√ √
two intersection
√ points case
√ above, i.e. τ = 1 − (σ 2 E) + (−σ 2 E) and
δ = (σ1 E) − (−σ1 E). If, moreover, σ1 = σ2 , the two distributions are
identical and we can interpret, for this case, ε = 1, with one of the two probabilities
τ or δ equal 1, the other zero.
(γ ) Let us consider two populations H1 and H2 , with univariate densities f 1
and f 2 , which have as common area the overlapping region . In Pattern Clas-
sification, a known constant prior probability χ is assigned to f 1 and (1 − χ ) to
f 2 , and this is equivalent to using k1 (x) = χ f 1 (x) and k2 (x) = (1 − χ ) f 2 (x), in
classification.
For any adopted decision rule, by which a new observation will be classified as to
belong either in H1 or H2 , the two misclassification probabilities, P(1|2) and P(2|1),
are present. In R, the classification regions, denoted 1 and 2 respectively, can
be determined arbitrarily by any one, or two, cut-off point (s), but it can be shown
that the regions that minimize the total misclassification probability (TMP), are
obtained by considering the common region(s) of the two curves k1 (x) and k2 (x).
Alternately, we can solve the equation : Log ϕ(x) = log D, where ϕ(x) is the ratio
of the two densities, while D = (1 − χ )/χ . Hence, let 1O = {x|k1 (x) ≥ k2 (x)}
and 2O = {x|k1 (x) < k2 (x)}. If we take 1 = 1O and 2 = 2O , we now have,
as misclassification probabilities, τ = P(2|1) = o k1 (x)dx and δ = P(1|2) =
2
1o k2 (x)dx, with their sum ε equal to the measure of their common region(s) .
is hence a deformation of the overlapping region between the two densities
f 1 (x) and f 2 (x) in section A above, with the corresponding change in its measure.
The two functions k1 (x) and k2 (x) can have one intersection point x0 , or two
intersection points x1 and x2 , which determine the optimal regions 1O and 2O
above. Any arbitrary decision cut-off point(s) xc (resp. xc1 and xc2 ), different from
x0 (resp. from x1 and x2 ), will lead to an TMP larger than the area of , which is
its minimum value, also called the Bayes error, and traditionally denoted by Pe .
We have:
Pe = min{k1 (x), k2 (x)}dx. (26)
R
(n)
FD (x1 , . . . , xn )
1
(c)
= u a−1 (1 − u)c−a−1 (1 − ux1 )−b1 . . . (1 − uxn )−bn du
(a)(c − a)
0
(28)
References