Pham Gia2006

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Statistical Methods and Applications (2007) 16: 7–26

DOI 10.1007/s10260-006-0012-x

O R I G I NA L A RT I C L E

T. Pham-Gia∗ · N. Turkkan · A. Bekker

Bounds for the Bayes error in classification:


a Bayesian approach using discriminant
analysis

Accepted: 21 February 2006 / Published online: 19 April 2006


© Springer-Verlag 2006

Abstract We study two of the classical bounds for the Bayes error Pe , Lissack
and Fu’s separability bounds and Bhattacharyya’s bounds, in the classification of
an observation into one of the two determined distributions, under the hypothesis
that the prior probability χ itself has a probability distribution. The effectiveness
of this distribution can be measured in terms of the ratio of two mean values. On
the other hand, a discriminant analysis-based optimal classification rule allows
us to derive the posterior distribution of χ , together with the related posterior
bounds of Pe .
Keywords Overlapping coefficient · Discriminant analysis · Misclassification ·
Lissack and Fu bounds · Bhattacharyya bounds · Hypergeometric functions ·
Bernoulli · Beta distribution

1 Introduction

In classical univariate Discriminant analysis, the Bayes error represents the


minimum error of misclassification, when classifying an observation in either one
of the two populations H1 and H2 , and has been the subject of numerous studies.
∗ Research partially supported by NSERC grant A 9249 (Canada). The authors wish to thank
two referees, for their very pertinent comments and suggestions, that have helped to improve
the quality and the presentation of the paper, and we have, whenever possible, addressed their
concerns.
T. Pham-Gia (B) · N. Turkkan
Université de Moncton,
Moncton, Canada
E-mail: phamgit@umoncton.ca

A. Bekker
University of South Africa (UNISA),
Pretoria, South Africa
8 T. Pham-Gia et al.


It is given by Pe = R min{χ f 1 (x), (1 − χ ) f 2 (x)}dx, where f 1 and f 2 are the
densities of the two populations, usually called class-conditional probabilities in
the Pattern recognition literature, while χ , the known constant weight attached to
f 1 , is the prior probability given to it, based on our knowledge of the problem.
But χ is also the proportion of H1 in the mixture of the two above densities, i.e.
g(x) = χ f 1 (x) + (1 − χ ) f 2 (x). Because of the difficulties in getting the exact
value of Pe other than by numerical methods, bounds for Pe are important to ob-
tain. One type of such bounds, based on the mixture g(x), has been obtained by
Lissack and Fu (1976). They are called the posterior separability bounds in L p -
spaces, denoted by J p (H1 , H2 |χ ), and are obtained, using Bayes Theorem, from
the posterior distributions of H1 and H2 , given the observation
 X = x. On the other
hand,
√ Bhattacharyya’s bounds are
∞ √ defined as 1 − 1 − 4χ (1 − χ )ρ 2 ≤ 2Pe ≤
2 χ (1 − χ )ρ, where ρ = −∞ f 1 (x) f 2 (x)dx, is the L -norm of the geomet-
1

ric mean of the two densities, with 1 − ρ being the classical Hellinger distance
between them. Naturally, other bounds encountered in the literature, could also be
used here. For example, the Nearest Neighbor error L NN can be considered, but
since only its asymptotic value can be obtained analytically, it would be difficult to
use it in a finite sampling scheme. But, since this asymptotic value lies between Pe
and the Bhattacharyya bound, we still have some information on L NN , for a large
sample.
The Bayes error can also be related to other concepts used in Information The-
ory, such as Entropy, or Divergence (Devroye et al. 1996), and a study along the
lines of this article can be carried out with these concepts. Of related interest is
the classification problem from the L1 -norm viewpoint by Pham-Gia et al. (2005).
For clarity, we will consider the univariate case only. The multivariate case, which,
however, has to include different approaches, will be treated in a subsequent paper.
In this article, the two densities f 1 and f 2 are supposed known, but, in practice,
they can be evaluated from the two available training samples, if these samples are
taken from H1 and H2 separately. A variety of methods, ranging from the moments
method to kernel density estimation, can be used for this purpose.
Very often, however, the precise value of χ is not known, making the above-
mentioned bounds of quite limited value. But, as in most phenomena, the range of
values for χ is often known, and sometimes even its distribution, based on experts’
opinion, as well as on past cases. In this article, we will give a distribution to the
unknown χ , and study its effects on the above two types of bounds. This infor-
mative prior should be quite useful in real applications, but even when it is not
available because of a lack of adequate information, non-informative priors, which
are well discussed in the Bayesian literature, can be used instead. Furthermore,
using rules provided by Discriminant analysis on a sample taken from the mixture
g(x), we will derive the posterior distribution of χ , which will then serve to update
these bounds. This process can be repeated, to refine the estimation of χ . Also,
effectiveness measures will be introduced to evaluate the performance of a distri-
bution, making comparisons between two prior distributions possible, and hence,
permitting the making of useful conclusions and decisions.
The general merits of the Bayesian approach in applied statistics, over those of
classical statistics, has been, and continue to be a subject of hot debate, and we do
not wish to address them here. The interested reader is referred to Bernardo and
Smith (1994) for a fuller discussion. Concerning the specific method presented
Bounds for the Bayes error in classification: a Bayesian approach 9

here, our numerical example in section 5 would highlight the different steps. On
the Bayesian approach to the mixture problem, Titterington et al. (1985) gives an
overview of the sequential approach available, prior to the use of computer inten-
sive methods, while Robert (1996) provides a comprehensive survey of methods
based on Monte Carlo Markov chains (MCMC). McLachlan and Basford (1988)
can be consulted for basic results in mixture distribution analysis. It should be
mentioned that in the case where χ , f 1 and f 2 are unknown, and data are obtained
from the mixture H3 , the estimation of χ is usually carried out empirically, fre-
quently using the Expectation–Maximization (EM) approach when the maximum
likelihood approach is used. A Bayesian approach for the general case has been
suggested in McLachlan and Peel (2000), and a recent survey of the MCMC meth-
ods used in this domain, and the usual problem encountered, is given by Jasra et al.
(2005).
In section 2, with a distribution assigned to the prior probability χ , bounds are
obtained for the expected Bayes error in classification, and effectiveness measures
for that prior distribution are derived. In section 3, parameters involved in the sam-
pling stage are clearly identified, and the classification of an observation taken from
the mixture is discussed. In section 4, the posterior distribution of χ is derived in
closed form, based on results from a sample taken from the mixture, and bounds for
the Bayes error are recomputed. Conversely, for χ constant but unknown, the distri-
bution of J1 (H1 , H2 |χ ), the L 1 -distance between two populations will be derived,
under some hypotheses on the two mutually complementary components of Pe .
Finally, a numerical example in section 5 provides concrete illustrations of the
results obtained in previous sections, and clearly presents the unified approach to
the study of the mean Bayes error bounds proposed by our paper. In the Appendix,
we first recall some basic Discriminant analysis procedures applied to two nor-
mal populations, where the overlapping coefficient, which represents the sum of
misclassification probabilities, plays a crucial role. Some results in hypergeometric
functions of several variables, needed to derive our results, are also presented there.

2 Bounds for mean Bayes error

2.1 Some inequalities on means

Let us now consider a mixture of the form g(t) = χ f 1 (x) + (1 − χ ) f 2 (x), where
the weight χ is constant, 0 ≤ χ ≤ 1. 
For p > 0, let J p (H1 , H2 |χ ) = R |P(H1 |t) − P(H2 |t)| p .g(t)dt be the
L -posterior distance between the two distributions, using the measure dµ(t) =
p
g(t)dt. P(Hi |t) is the probability that an element belongs to population Hi , i =
1, 2, after the observation t is made. By Bayes theorem, we have R |P(H1 |t) −
 p−1
P(H2 |t)| p .g(t)dt = R |χ f 1 (t) − (1 − χ ) f 2 (t)| p /[g(t)] dt, and hence, the
(posterior) L -distance between two populations can be expressed as an L p -
p
distance (with the measure dt/[g(t)] p−1 ) between the two functions χ f 1 (t) and
(1 − χ ) f 2 (t), also called the L p − (χ ) distance between the two populations.
Let us now consider χ as random, with distribution ξ(χ ), that we will suppose
of beta type, denoted beta(χ ; α, β), from now on, although any distribution on
(0,1) can be used. Since we have E χ [J1 (H1 , H2 |χ )] = 1 − 2E χ [Pe ] (see also
10 T. Pham-Gia et al.

section 4.2), the expected value of the Bayes error can also have as bounds simple
expressions of the distances between these distributions, as seen below.
Lemma Let χ have beta(χ ; α, β) as prior distribution. Then
(1) The distance W between χ and 1-χ has as prior density:
 
(1−w)α−1 (1+w)β−1 + (1−w)β−1 (1+w)α−1
f (w) = , 0 ≤ w ≤ 1.
[2α+β−1 B(α, β)]
(1)

(2) For 1 ≤ p ≤ ∞, we have, for posterior separability bounds for E χ J1 (H1 ,
H2 |χ ) (and the mean value of 2Pe ):
   1/ p 
E χ J p (H1 , H2 |χ ) ≤ E χ [J1 (H1 , H2 |χ )] ≤ E χ J p (H1 , H2 |χ ) ,
(2)
and for 0 < p ≤ 1,
 1/ p   
E χ J p (H1 , H2 |χ ) ≤ E χ [J1 (H1 , H2 |χ )] ≤ E χ J p (H1 , H2 |χ ) ,
(3)
where the expectations are taken w.r.t. to the prior density of χ .
(3) For Bhattacharyya’s bounds, we have:


1 − Eχ 1 − 4χ (1 − χ )ρ 2 ≤ 2E χ [Pe ] ≤ 2E χ [ χ (1 − χ )]ρ, (4)

∞ √
where ρ = f 1 (x) f 2 (x)dx,
0
Proof

(1) We have W = |1 − 2χ |. Since χ ∼ beta(χ ; α, β), by a change of variable, we


have, after some computation, the above expression (1) for the density of W.
Details are available upon request.
(2) We have (see Lissack and Fu 1976): For a value of χ , for 1 < p < ∞,
 1/ p  
1 − J p (H1 , H2 |χ ) /2 ≤ Pe ≤ 1 − J p (H1 , H2 |χ ) /2 and for
0< p≤1,
   1/ p
1 − J p (H1 , H2 |χ ) /2 ≤ Pe ≤ 1 − J p (H1 , H2 |χ ) /2.
We also know that, for p > 1, the distance between the upper and lower bounds
of Pe increases with the value of p. Let’s consider the prior distribution for χ , for
example beta(χ ; α, β). Taking the expectations w.r.t. χ of the above expressions
on both sides, we obtain bounds for E χ [Pe ], and inequalities (2) and (3) can now
be derived, since E χ [J1 (H1 , H2 |χ )] = 1 − 2E χ [Pe ] (see also section 4.2). Similar
arguments lead to the Bhattacharyya’s bounds for 2E χ [Pe ].
It should be notice that these two types of bounds are complementary, with the
first bounds used for p > 0, but p  = 1, while the second ones apply when p = 1.


Bounds for the Bayes error in classification: a Bayesian approach 11

2.2 Separation-effectiveness measures

When χ is constant, a value of χ close to 1 reflects a much larger weight given


to f 1 than to f 2 , and the reverse is true if it is close to 0. The value 1/2 reflects
indifference, and is used whenever we do have any information on χ . It is important
to notice that since both χ and its complement 1 − χ intervene in the mixture, they
should be considered together. Hence, their distance W = |1 − 2χ | is of impor-
tance, and reflects the magnitude of the prior information inputted with the prior
distribution. It varies from 0 to 1, with W = 0 when χ = 1/2, and W = 1 when
χ = 0 or 1. These extreme values of χ represent the most discriminating case, in
terms of weight information, for the mixture. When χ becomes a random variable,
the same idea can apply, roughly with E[χ ].
Furthermore, some additional information on χ is provided by its prior dis-
tribution, with the case of the beta studied by Pham-Gia and Turkkan (1994).
This information is to be considered together with W . Extreme cases include the
degenerate prior distribution  with mass unity at 1/2, which leads to W = 0 and
J p (H1 , H2 |χ = 1/2) = 21 R | f 1 (t) − f 2 (t)| p /([ f 1 (t) + f 2 (t)] p−1 )dt < 1, while
for a degenerate distribution with unit mass at either 0 or 1, we have W = 1 and
J p (H1 , H2 |χ = 0) = J p (H1 , H2 |χ = 1) =1 in both cases.
To study one type of effectiveness of a prior we can consider the distance
it generates between the two expected bounds for 2E χ [Pe ], or equivalently, for
E χ [J1 (H1 ,
H2 |χ )], and the distance W that we have, at the start. Their quotient reflects the
ratio of the discriminating output over discriminating input and gives a measure of
the “return” on the prior used. We set:

Definition 1 Let ξ(.) be the prior distribution of χ . We define the separation-effec-


( p)
tiveness (noted as S-effectiveness) measures for ξ(.): λξ , for 0 < p, p  = 1, and
θξ as follows:
(1) For 0 < p, p  = 1,

( p) E χ [J p (H1 , H2 |χ )]1/ p − E χ [J p (H1 , H2 |χ )]


λξ = , for p > 1, (5)
2E χ [W ]

and

( p) E χ [J p (H1 , H2 |χ )] − E χ [J p (H1 , H2 |χ )]1/ p


λξ = , for 0 < p < 1.
2E χ [W ]
(6)

(2) Similarly, using Bhattacharyya’s bounds, we define:


√ 
2ρ E χ χ (1 − χ ) + E χ 1 − 4χ (1 − χ )ρ 2 − 1
θξ = (7)
2E χ [W ]

( p)
as the corresponding effectiveness measure in L 1 , where λξ is not defined.
12 T. Pham-Gia et al.

Definition 2 For two distinct prior distributions, ξ1 (.) and ξ2 (.), and for a fixed
value of p, we define ξ1 as more separation-effective than ξ2 (or more S-effective)
( p) ( p)
in the Lissack-Fu sense, denoted by ξ1 >LF(p) ξ2 , if λξ1 > λξ2 .

The above inequality would mean that ξ1 (.) has the ratio between two respective
discriminating measures, for output and input, larger than the same ratio for ξ2 (.),
resulting in a better return. However, it should also be borne in mind that an infor-
mative prior must be chosen because it best reflects the distribution of the possible
values of χ , and not because it gives a better return, i.e. is more S-effective. For the
( p)
beta family of beta(α, β) priors, Figure 1 gives the level curves for λξ , for p = 2,
and for θξ , as functions of α and β. We can see that they have similar shapes, with
maximum values along the diagonal α = β, with these values increasing as α and
β increase.
This fact can be easily understood since for a symmetrical beta, its mean is 1/2,
which represents indifference, as explained before. In the limiting case, when α =
( p)
β → ∞, we have the degenerate distribution at 1/2, with W = 0, and λξ = θξ =
∞ (since the two related numerators remain finite). Taking a symmetrical beta as
prior usually provides a better return, but other non symmetrical beta densities can
also provide the same return, as shown by the level curves. Similarly, ξ1 is more
effective than ξ2 in the Bhattacharyya sense, denoted by ξ1 > B ξ2 , if θξ1 > θξ2 .

Theorem 1 For any prior distribution ξ(.) of χ , with E ξ (χ )  = 1/2 :


( p)
(1) λξ is an increasing function of p, for p > 1 . Furthermore, for p fixed, we
have:
( p)
λξ < (1 − p −1 )/2 p
1/ p−1
|1 − 2E ξ (χ )|, and

(2) θξ < 2 − 1/2|1 − 2E ξ (χ )| .

Proof Since the distance between the bounds of Pe is an increasing function of p,


we have the first property. On the other hand, it is known (Ray 1989a,b), that for
p > 1, the maximum for the difference −1/( p−1) −

√ of the bounds of Pe is ( p
p p/( p−1) )/2 for the L–F bounds and ( 2 − 1)/2 for the Bhattacharyya bound.
For the denominator, we use Jensen inequality. 


Section 5.2 gives some numerical illustration of the orderings.

Remark As we can see, both bounds diverge to infinity when E ξ (χ ) → 1/2. This
situation happens, when χ ∼ beta(α, α), with α → ∞. A question raised by the
( p)
referees is that, in view of the ordering of the prior distributions of χ by λξ , for
a family of priors such as the beta, can we determine the parameters (α , β ∗ ) that

( p)
would render λξ ∗ maximum?
( p)
This question does not have a positive answer, since the upper bound for λξ
given by Theorem 1, namely UB = (1− p −1 )/2 p 1/ p−1 |1−2E ξ (χ )|, for E ξ (χ )  =
( p)
1/2, is strict. The equation λξ = UB, does not have a solution in (α ∗ , β ∗ ), and
( p) ( p)
although we can find (α , β ) such that λξ < λξ < UB, the upper bound is never
reached. This can also be seen from Figure 1, and the same answer applies to θξ .
Bounds for the Bayes error in classification: a Bayesian approach 13

100 1.8
1.8
1.4

80 λξ(2 ) 1.2
1.6
1.6
1.0
0.8 1.4
1.4
0.6
60 1.2
0.4
β 0.2
1.2 1.0
0.8
40 1.0 0.6

0.4
0.8

20 0.6 0.2
0.4
0.2

20 40 60 80 100
α
100 3.5
3.5

θξ
2.5
3.0
80 2.0 3.0
1.5
2.5
1.0 2.5
60 0.5
β 2.0

2.0 1.5
40 1.0
1.5
0.5
20 1.0
0.5

20 40 60 80 100
α
Fig. 1 Level curves for λ(2)
ξ and θξ

3 Sampling stage

We suppose f 1 and f 2 are two independent normal densities and refer to the basic
notations and procedures for discriminant analysis in the univarite case, given in
the Appendix. The optimal cut-off points, that would minimize the Bayes error Pe ,
being the intersection point(s) between k1 (t) = χ f 1 (t), and k2 (t) = (1 − χ ) f 2 (t),
when the value of χ is a known constant, we have explicit expressions for the
solutions:
µ1 + µ2 σ 2 LogD
x0 = + , (8)
2 µ1 − µ2
  
x1 µ1 σ22 − µ2 σ12 ± σ1 σ2 ( (µ1 − µ2 )2 + 2(σ22 − σ12 )[log(σ2 /σ1 ) − log D]
or : = ,
x2 σ2 − σ2 2 1
(9)
14 T. Pham-Gia et al.

where D = (1 − χ )/χ .
When χ is random, with prior density ξ(χ ), using the same argument as pre-
viously, the optimal decision cut-off point(s) is (are) now the intersection point(s)
of E(χ ) f 1 (x) and (1 − E(χ )) f 2 (x), denoted x0∗ , or x1∗ and x2∗ , which have the
same expressions (8) and (9), with χ replaced by E(χ ). These points, in turn, will
determine the probabilities of misclassifications τ ∗ and δ∗, with ε∗ = τ ∗ +δ∗, as
in (19) and (20) ( in Appendix).
The posterior distribution of χ is based on sampling results that are considered
valid and reliable as a second source of information on χ , in order to update its
distribution. As such, they are associated with an unknown value χ0 of χ , within
the domain of its distribution ξ(χ ), which is a realisation of the random variable
χ . Hence, let H3 be a population with as density the mixture g(x) already men-
tioned, where χ = χ0 . A sample of size n is now taken from H3 , by taking either
n simulated observations from g(x), or a physical random sample from H3 in the
second case (i.e. creating a mixture of physical elements from H1 and H2 , in the
proportions χ0 and (1 − χ0 ) respectively, and taking a random sample from that
population). It will serve to update the distribution of χ , if we accept the evidence
provided by this sample originating from the value χ0 . But, whereas the proba-
bility for an element xi from a sample taken from H3 to be classified as from H1
is still χ , its probability to be classified as such, in the sampling phase, is now
θ = χ (1 − τ ∗) + (1 − χ )δ∗ = δ ∗ +Mχ , instead of χ , where τ ∗ = P(H2 |H1 )
and δ∗ = P(H1 |H2 ) are the two misclassification probabilities obtained earlier,
and M = 1 − ε∗. By using the above cut-off point(s) to classify xi , we now know
exactly to which distribution an observation will be assigned. Hence, in considering
the whole sample {x1 , . . . , xn } taken from H3 , we have j* observations belonging
to H1 and n- j * to H2 , and we can derive the exact posterior distribution of χ .
This is the Decision-directed learning approach, as outlined in Titterington et al.
(1985), but in a direct, non-sequential sampling context.

4 Posterior analysis

4.1 Posterior distributions

We have the following

Theorem 2 Let χ be the mixing proportion in the mixture density g(x) = χ f 1 (x)+
(1 − χ ) f 2 (x), and let misclassification into H2 , or into H1 , have probabilities τ ∗
and δ ∗ respectively. If χ has beta(χ ; α, β) as prior, and if j* out of n observations
from the mixture are found to be from H1 , then
(1) χ has as posterior distribution

(n, j∗)
φ (n, j∗) (χ ) = beta(χ ; α, β).[1 − Aχ ] j∗ [1 − Bχ ]n− j∗ /P0 (A, B), (10)

(n, j∗)
where A = (ε ∗ −1)/(δ∗), B = (1 − ε∗)/(1 − δ∗) and P0 is a polynomial
of degree n in A and B.
Bounds for the Bayes error in classification: a Bayesian approach 15

(2) W has as posterior density


f post (w) = {(1 − w)α−1 (1 + w)β−1 [(2 − A) + Aw] j∗ [(2 − B) + Bw]n− j∗
+(1+w)α−1 (1−w)β−1 [(2 − A) − Aw] j∗ [(2 − B) − Bw]n− j∗ }
(n, j∗)
×[2α+β+n−1 B(α, β)P0 (A, B)], (11)
(n, j∗)
and P0 (A, B) reduces to a polynomial in A alone, or in B alone, in the case
x = n or x = 0.
Proof We first establish the posterior density of χ :
We write the likelihood function as
(χ ; n, j ∗ ) = (δ + Mχ ) j∗ (1 − δ − Mχ )n− j∗
= δ j∗ (1 − δ)n− j∗ (1 − Aχ ) j∗ (1 − Bχ )n− j∗ , (12)
where A = (ε∗ − 1)/δ ∗ , B = (1 − ε∗ )/(1 − δ ∗ ), with ε∗ = τ ∗ + δ ∗ being the
overlapping coefficient.
(1) Hence, we have Prior × Likelihood proportional to δ j∗ (1 − δ)n− j∗ χ α−1
(1 − χ )β−1 (1 − Aχ ) j∗ (1 − Bχ )n− j∗ . The integral, denoted L, in the nor-
malizing constant
 1
δ j∗ (1 − δ)n− j∗ χ α−1 (1 − χ )β−1 (1 − Aχ ) j∗ (1 − Bχ )n− j∗ dχ
0

is precisely the integral representation of an Appell function FD by Picard’s The-


(2)
orem [see (28) in Appendix]. We have L = B(α, β) · FD (α, − j∗, −(n − j∗);
α + β; A, B), where B(α, β) is the beta function.
(2)
By definition FD is defined as a double series in x and y, i.e.
∞ 
 ∞
(2) (a, m 1 + m 2 )(b1 , m 1 )(b2 , m 2 ) x m 1 y m 2
FD (a, b1 , b2 ; c; x, y) = ,
(c, m 1 + m 2 ) m1! m2!
m 2 =0 m 1 =0

where (a,n) is the Pochhammer notation for the product a(a + 1) . . . (a + n − 1).
But this series has only a finite number of terms here, since b1 and b2 are negative
integers. If we define the polynomial

(n, j)
j
n− 
j
(α + k, m 1 + m 2 )(− j, m 1 )( j − n, m 2 ) Am 1 B m 2
Pk (A, B) = · · ,
(α + k + β, m 1 + m 2 ) m1! m2!
m 2 =0 m 1 =0
(13)
for k = 0, 1, 2, . . ., the posterior density of χ is then:
(n, j∗)
φ (n, j∗) (χ ) = beta(χ ; α, β) · [1 − Aχ ] j∗ [1 − Bχ ]n− j∗ /P0 (A, B).
When j∗ = n or j∗ = 0, we have L = B(α, β).2 F1 (α, −n, α + β, A) or
B(α, β).2 F1 (α, −n, α+β, A), where 2 F1 is Gauss hypergeometric function . Using
(n,0) (0,n)
(13) we have P0 and P0 as a polynomial in A, or B, only.
16 T. Pham-Gia et al.

(2) By a change of variable, we can establish the density of |1 − 2χ | from that of χ


given by (10), and then the posterior density of W follows, as given above. We
( p)
now can compute E χpost (J p (H1 , H2 |χ ), and λξpost and θξpost associated with it.



4.2 Distributions of random misclassification probabilities

The above Lissack and Fu and Bhattacharyya bounds and and the two effectiveness
( p)
measures for the distributions of χ , λξ , for 0 < p, p  = 1, and θξ , can now be
computed again, as in (2), (3) and (4), and with φ (n, j∗) (χ ) replacing ξ(χ ), and the
posterior distribution of W , given by (11) replacing its prior.
Case of J1 (H1 , H2 |χ ): For p =1, and χ a constant, we have equality, i.e 2Pe =
1 − J1 (H1 , H2 |χ ), with J1 (H1 , H2 |χ ) being the L 1 -distance between χ f 1 and
(1 − χ ) f 2 (with the Lebesgue measure), also called the L 1 − (χ ) distance between
between the two populations. Similarly, when χ has distribution ξ(χ ), we have:
2E χ [Pe ] = 1 − E χ [J1 (H1 , H2 |χ )]. The first equality hence allows us to derive the
distribution of J1 (H1 , H2 |χ ), once the distribution of Pe is known, in a situation
reverse to the previous ones.
This situation occurs when we do not know the two normal densites f 1 and
f 2 , and are unsure of the value of χ itself, but by monitoring the misclassification
results, we have some information on the two misclassification probabilities, τ
and δ, which can be considered to be independent random variables, with separate
distributions. For example, they can be two beta variables, each on the interval
[0,1/4]. We can then derive the distribution of J1 (H1 , H2 |χ ). We have:

Theorem 3 Let τ and δ be independent, with τ ∼ beta(α1 , β1 ; 0, 1/4) and δ ∼


beta(α2 , β2 ; 0, 1/4) ( with χ being a constant ). Then Z = J1 (H1 , H2 |χ ) = 1−2Pe
has as distribution, on [ 0, 1] given by:
(a) For 0 ≤ z ≤ 1/2

f (z) = K 2 2β1 +β2 z β1 +β2 −1 (1 − 2z)α2 −1


 
(2) 2z
×FD β2 , 1 − α1 , 1 − α2 ; β1 + β2 ; 2z, , (14)
2z − 1
(b) 1/2 ≤ z ≤ 1

f (z) = K 1 2α1 +α2 (1 − z)α1 +α2 −1 (2z − 1)β1 −1


 
(2) 2(1 − z)
×FD α2 , 1 − β1 , 1 − β2 ; α1 + α2 ; , 2(1 − z) (15)
(1 − 2z)
with K 1 and K 2 given below.

Proof First, we have the density of ε = τ + δ as the density of the sum of two
independent general beta r.v’s, each defined on [ 0, 1/4]. The general expression of
that last density, is given in Pham-Gia and Turkkan (1998), for the case c = e = 0
and d = f = 0.25. The density of Pe is hence:
Bounds for the Bayes error in classification: a Bayesian approach 17

For 0 ≤ y ≤ 1/4,
(2)
f (y) = K 1 4α1 +α2 y α1 +α2 −1 (1 − 4y)β1 −1 · FD
×(α2 , 1 − β1 , 1 − β2 ; α1 + α2 ; 4y/(4y − 1), 4y),
with K 1 = (α1 + β1 )(α2 + β2 )/(α1 + α2 )(β1 )(β2 ) and for 1/4 ≤ y ≤
1/2,
(2)
f (y) = K 2 2β1 +β2 +1 (1 − 2y)β1 +β2 −1 (4y − 1)α2 −1 · FD
×(β2 , 1 − α1 , 1 − α2 ; β1 + β2 ; 2(1 − 2y), 2(2y − 1)/(4y − 1)),
with K 2 = (α1 + β1 )(α2 + β2 )/ (β1 + β2 )(α1 )(α2 ), and the above density
of Z = J1 (H1 , H2 ) = 1 − 2Pe can then be obtained by a change of variable and
some computations. 


Again, if another source of information on τ and δ is available, the posterior


distributions for both of them can be derived, from which we can obtain the pos-
terior distribution of J1 (H1 , H2 |χ ), as above.

5 A numerical example

This numerical example, taken under very general conditions, will illustrate the
approach and methods presented in the paper.

5.1 Problem

Let H1 ∼ N (5, 92 ) and H2 ∼ N (18, 62 ) be combined to form a population H3 , in


which H1 forms (100χ0 )%. The exact value of the proportion χ0 is unknown, but
it can be considered as a realisation of χ , with the informative prior distribution
beta(χ ; 4, 16), for example. We wish to adopt a Bayesian approach to the study of
χ , to evaluate the bounds of the Bayes error in classification, and also the effec-
tiveness of this prior . Furthermore, using a sample of size 20 taken from H3 , and
the classification criterion set up in discriminant analysis, we wish to update these
bounds.

5.2 Discriminant analysis results

(a) We can see that the two normal densities f 1 and f 2 intersect at x1 = 11.198
and x2 = 45.602, and the two misclassification probabilities are, according to
(A2), τ = 0.2455 and δ = 0.1284. Hence, we have the overlapping coefficient
ε0 = 0.3739 and the L 1 -distance between the two distributionsis J1 (H1 , H2 ) =
1.2522.
In the absence of χ , classification is directly based on the ratio of likelihoods
of the two normal densities. For a new observation y, the rule would then be:
Classify y as belonging to H1 if 11.1982 ≤ y ≤ 45.602, and as belonging to
H2 otherwise.
18 T. Pham-Gia et al.

(b) If only the distribution of χ is known, all the above quantities become random
variables. Taking the two functions E(χ ) f 1 (x) and (1 − E(χ )) f 2 (x), the two
cut-off points are x1∗ and x2∗ , and we can proceed as presented in previous
sections. The detailed analysis will be shown in the following sub-sections.

5.2.1 Prior analysis:

(a) The prior distribution of χ being ξ1 (χ ) = beta(4, 16), we have E[χprior ] =0.20
and Var(χprior ) = 7.619 × 10−3 ( or σχ = 0.08728). The absolute distance W
between χ and 1-χ then has as prior density, as given by (1): For 0 ≤ w ≤ 1,
f (w) = [(1 − w)3 (1 + w)15 + (1 − w)15 (1 + w)3 ]/[219 B(4, 16)] = (1 − w)3
(1 + w)15 + (1 − w)15 (1 + w)3 ]/[33.82514].
Hence, we have E χ [Wprior ] =0.6003 and Varχ [Wprior ] = 0.0301.
(b) To obtain the Lissack–Fu and Bhattacharyya bounds for 2E χ (Pe ), we compute
E χ [J p (H1 , H2 |χ )]
1 
χ α−1 (1−χ )β−1
= |χ f 1 (t)−(1−χ ) f 2 (t)| p · dtdχ ,
B(α, β)[χ f 1 (t)+(1−χ ) f 2 (t)] p−1
0 R
(16)

∞  
ρ= f 1 (x) f 2 (x)dx and E χ 1 − 4χ (1 − χ )ρ 2
0
1 
χ α−1 (1 − χ )β−1
= 1 − 4χ (1 − χ )ρ 2 dχ (17)
B(α, β)
0

and
  1 
χ α−1 (1 − χ )β−1
Eχ χ (1 − χ ) ρ = ρ χ (1 − χ ) dχ . (18)
B(α, β)
0

For example, for p = 2, and the beta(4, 16) prior, these bounds are 0.0896
and 0.1616 .On the other hand, Bhattacharyya’s bounds for Pe are 0.0705 and
0.251, as given by (4). Also, for the degenerate distribution at1/2, we have:

1 | f 1 (t) − f 2 (t)|2
J2 (H1 , H2 |χ = 1/2) = dt = 0.4670.
2 [ f 1 (t) + f 2 (t)]
R
(c) S-Effectiveness of a prior and comparison between two priors for S-effective-
ness:
We have the S-effectiveness coefficient for ξ1 , the beta(4,16) prior, given by
(5), for p > 1, and by (6), for p < 1. For p = 2, for example λ(2)
ξ
= 0.1198.
1
If another prior is considered, ξ2 = Beta(2, 18), for example, we then have λ(2)
ξ2
=
0.0560. Hence, for p = 2, ξ1 >LF(2) ξ2 . This fact means that, although Beta(2, 18)
Bounds for the Bayes error in classification: a Bayesian approach 19

is a more discriminating prior than Beta(4, 16), in the sense that it better separates
χ and 1- χ on input, its return ( when we also consider the distance between the two
Lissack–Fu bounds for p = 2), is inferior to that of Beta(4,16). This should not
be, however, the only reason why we shoud use Beta(4,16) instead of Beta(2,18),
as prior for χ .
Similarly, we can compute the Bhattacharyya’s S-effectiveness and make the
corresponding comparison. We have, θξ = 0.3006 and θξ = 0.1800, and again,
1 2
we can write ξ1 > B ξ2 . It is noted that the upper bound for λ(2)
ξ
, provided by
1
Theorem 1, is 0.2083, while it is 0.1565 for λ(2)
ξ2
. Similarly, the upper bound for
θξ is 0.3451, and for θξ , is 0.2588.
1 2
For p = 1, we have E χ (J1 (H1 , H2 )|χ ) = 0.7900 for χ ∼ beta(4,16).

5.2.2 Sampling from the mixture distribution

The probability for an observation in a sample to be classified as from H1 , is, now


θ ∗ = δ ∗ +Mχ = 0.1833 + 0.6261χ , with A = −3.4157 B = 0.7662.
[Note: Although we have |A|>1, the Appell function still converges, thanks to
a change of variable (see Appendix)].
Let’s consider a sample of size 20, {Y1 , . . . , Y20 } taken from the mixture, where
χ has taken an unknown value χ0 , and the classification adopted earlier, i.e. with
cut-off points x1∗ =6.59 and x2∗ = 50.21, which are intersection points of 0.2 f 1
and 0.8 f 2 , since E(χ ) = 0.2. We then can determine j*, the number of elements
belonging to H1 . We have j* = 5 here, based on our simulation of 20 values from
the mixture 0.22 f 1 + 0.78 f 2 , that gives 5 values between 6.59 and 50.21.

5.2.3 Posterior analysis

We then have the posterior distribution of χ , defined on [0,1] as:


(20,5)
φ (20,5) (χ ) = χ 3 (1 − χ )15 · (1 − Aχ )5 (1 − Bχ )15 /P0 (A, B)B(4, 16),
as given by (10). This posterior, together with the prior beta(χ ; 4, 16), are given
by Figure 2.
The posterior density of W is:

f post (w) = (1 − w)3 (1 + w)15 [(2 − A) + Aw]5 [(2 − B) + Bw]15


+(1 + w)3 (1 − w)15 [(2 − A) − Aw]5
(20,5)
×[(2 − B) − Bw]15 /[239 B(4, 16)P0 (A, B)],
(20,5)
as given by (11), where P0 (A, B) = 2.5606.
Figure 3 gives the prior and posterior distributions of W, where the posterior
mean of W is 0.6901. Posterior bounds and S-effectiveness measures can now
(20,5)
be computed, using the posterior density φξ1 , i.e using (16), (17) and (18),
with φξ(20,5)
1
replacing beta(χ ; 4, 16) in these expressions. Prior and posterior L–F
bounds are given by Figure 3, for all values of p > 1. They are practically identical
20 T. Pham-Gia et al.

5
Posterior
4

f(x) 3

1
Prior
0
0.0 0.2 0.4 0.6 0.8 1.0
x
Fig. 2 Prior and posterior densities of χ

3.0

Posterior
2.5

2.0

g(w) 1.5

1.0
Prior
0.5

0.0
0.0 0.2 0.4 0.6 0.8 1.0
w
Fig. 3 Prior and posterior densities of the distance W between χ and 1 − χ

to the prior L–F bounds, and are hence, represented by the same curves in figure 3.
(For illustration, we have also considered the case where j ∗ = 6, and graphed the
corresponding L–F bounds on Figure 4. We can see that, now, the two curves have
shifted upwards, and for p = 2, the posterior L–F bounds are 0.0953 and 0.1712,
which represent slight increases from the prior L–F bounds) . Bhattacharyya’s
posterior bounds have also slightly increased and are now: 0. 0710 and 0.254.
Bounds for the Bayes error in classification: a Bayesian approach 21

0.5

L-F Bounds
0.4

Posterior (20,6) bound

0.3 Prior bound and


Posterior (20,5) bound

{1-Eχ[Jp(H1,H2)|χ]}/2
0.2
0.1712
0.1616
0.5
0.1 {1-Eχ[Jp(H1,H2)|χ] }/2
0.0953
0.0896
0.0
0 1 2 3 6 9 12 15
p
Fig. 4 Prior and posterior Lissack–Fu bounds for Pe

(20,6)
The S-efficiency measure of the distribution φξ1 , or posterior S-efficiency
(2)
measures, are now : λ (20,6) = 0.1246, a slight increase from its prior value and
φξ
1
θφ (20,6) = 0.2971, which represents a slight decrease.
ξ1

5.2.4 Density of J1 (H1 , H2 |χ ) under beta-distributed misclassification


probabilities τ and δ

Let f 1 , f 2 and χ be unknown, and let’s suppose that the two misclassification
probabilities have been monitored, and could be given beta distributions. More
precisely, these two components τ and δ of Pe are now considered random, with
τ ∼ beta(5, 22; 0, 1/4) and δ ∼ beta(3, 18; 0, 1/4). Considering now the variable
J1 (H1 , H2 |χ ), we have, by (14) and (15), the density of Z defined on (0,1/2), and
on (1/2,1) by different expressions.
The density of Z, the L 1 (χ ) distance between the two populations, is hence
unimodal, defined on [ 0, 1] and its graph is given by Figure 5. Its mean is 0.836
and its variance is: 0.00274.

6 Conclusion

The Bayesian approach to the study of the bounds for the Bayes error Pe is an
effective approach to consider in the case the precise value of χ is not known,
and only its probability distribution can be determined, as it is frequently the case
in applications. Then, bounds for Pe , coupled with S-effectiveness measures for
the prior distribution will lead to a very pertinent choice of that prior. The con-
verse problem is also of interest, to obtain the L 1 (χ ) distance between the two
populations.
22 T. Pham-Gia et al.

10

6
f(z)

0
0.5 0.6 0.7 0.8 0.9 1.0
z
Fig. 5 Distribution of J1 (H1 , H2 |χ), with beta-distributed components, τ and δ, of Pe

Appendix

In this appendix, we recall the following results, first on statistical discriminant


analysis related to univariate normal distributions, then on hypergeometric func-
tions in several variables.

Discriminant analysis for univeriate normal distributions

We will first look at the intersection point(s) of two densities, and then present the
normal case in detail.
(α) Two univariate densities f 1 (x) and f 2 (x), can intersect at one or several
points, which are the roots of the equation f 1 (x)− f 2 (x) = 0. Hence, their common
region, called overlapping region , can consist of a single region or of several
disjoint regions, but integration of h(x) = min{ f 1 (x), f 2 (x)} gives the value of
ε0 = area().
Let us consider the case of f 1 (x) and f 2 (x) unimodal densities . Then they can
intersect at one or two points.
(a) One intersection point I: Then  is a singly connected region, which consists
of two parts, whose measures are given respectively by:

 
τ= h(x)dx and δ = h(x)dx, (19)
x≥x0 x≤x0

where x0 is the root of the equation f 1 (x) − f 2 (x) = 0 .


(b) Two intersection points I1 and I2 :
We have, similarly, x1 and x2 , where x1 and x2 are the two roots of the same
equation. Supposing, without loss of generality, x1 ≤ x2 , the overlapping region is
Bounds for the Bayes error in classification: a Bayesian approach 23

now the union of two regions. The first region is a simply connected one, limited
by the two points x1 and x2 and the lower of the two density functions between
these points. The second region consists of two disjoint areas, to the left and to the
right of x1 and x2 , respectively, and below the other density function.
Again, we have
 
τ= h 2 (x)dx and δ = h 2 (x)d x (20)
{x≥x2 }∪{x≤x1 } x1 ≤x≤x2

In both cases, we have: ε0 = τ + δ. Inman and Bradley (1989) have studied


the overlapping coefficient and its estimation.
(β)Case of two univariate normal distributions: For this case we have several
of the preceeding results in closed form. Let’s consider the simple case of two
populations having univariate normal distributions N (µ1 , σ12 ) and N (µ2 , σ22 ) .
(A) Different means µ1 ≤ µ2 . Let


1
Logϕ(x) = −x 2 (σ1−2 − σ2−2 ) + 2x(µ1 σ1−2 − µ2 σ2−2 ) − (µ21 σ1−2 − µ22 σ2−2 )
2

+2Log(σ2 /σ1 ) , (21)

where ϕ(x)√= f 1 (x)/ f 2 (x) is the likelihood ratio and f i (x) = exp[−(x − µi )2
/2σi2 ]/ (σi 2π), i = 1, 2, −∞ ≤ x ≤ ∞.
Solving Log ϕ(x) = 0, we obtain:
(1) If σ1 = σ2 = σ , the above equation becomes linear, and the only intersec-
tion point I has x0 = (µ1 + µ2 )/2 as abscissa.
We then have, as given by (19),
τ = δ = 1 − (ξ ), (22)
where ξ = (µ2 − µ1 )/2σ , and  is the cumulative distribution function of the
standard normal.
(2) If σ1  = σ2 , since we always have K = 2[σ22 − σ12 ] log(σ2 /σ1 ) ≥ 0, there
are 2 intersections points I1 and I2 , at abscissas x1 and x2 with values:

(µ1 σ22 − µ2 σ12 ) ± σ1 σ2 (µ1 − µ2 )2 + K
(23)
σ22 − σ12
Let’s suppose x1 ≤ x2 . We then have, as given by (20):
τ = 1 − [(x2 − µ1 )/σ1 ] + [(x1 − µ1 )/σ1 ], (24)
while
δ = [(x2 − µ2 )/σ2 ] − [(x1 − µ2 )/σ2 ]. (25)
(B) Equal means µ1 = µ2 = µ, with σ1  = σ2 : We have two normal distri-

butions with the same mean, and symmetrical intersection points µ ± σ1 σ2 E,
where E = 2log(σ2 /σ1 )/σ22 − σ12 ≥ 0. Then τ and δ can be obtained as in the
24 T. Pham-Gia et al.

√ √
two intersection
√ points case
√ above, i.e. τ = 1 − (σ 2 E) + (−σ 2 E) and
δ = (σ1 E) − (−σ1 E). If, moreover, σ1 = σ2 , the two distributions are
identical and we can interpret, for this case, ε = 1, with one of the two probabilities
τ or δ equal 1, the other zero.
(γ ) Let us consider two populations H1 and H2 , with univariate densities f 1
and f 2 , which have as common area the overlapping region . In Pattern Clas-
sification, a known constant prior probability χ is assigned to f 1 and (1 − χ ) to
f 2 , and this is equivalent to using k1 (x) = χ f 1 (x) and k2 (x) = (1 − χ ) f 2 (x), in
classification.
For any adopted decision rule, by which a new observation will be classified as to
belong either in H1 or H2 , the two misclassification probabilities, P(1|2) and P(2|1),
are present. In R, the classification regions, denoted 1 and 2 respectively, can
be determined arbitrarily by any one, or two, cut-off point (s), but it can be shown
that the regions that minimize the total misclassification probability (TMP), are
obtained by considering the common region(s)  of the two curves k1 (x) and k2 (x).
Alternately, we can solve the equation : Log ϕ(x) = log D, where ϕ(x) is the ratio
of the two densities, while D = (1 − χ )/χ . Hence, let 1O = {x|k1 (x) ≥ k2 (x)}
and 2O = {x|k1 (x) < k2 (x)}. If we take 1 = 1O and 2 = 2O , we now have,
as misclassification probabilities, τ = P(2|1) =  o k1 (x)dx and δ = P(1|2) =
 2

1o k2 (x)dx, with their sum ε equal to the measure of their common region(s) .
 is hence a deformation of the overlapping region  between the two densities
f 1 (x) and f 2 (x) in section A above, with the corresponding change in its measure.
The two functions k1 (x) and k2 (x) can have one intersection point x0 , or two
intersection points x1 and x2 , which determine the optimal regions 1O and 2O
above. Any arbitrary decision cut-off point(s) xc (resp. xc1 and xc2 ), different from
x0 (resp. from x1 and x2 ), will lead to an TMP larger than the area of , which is
its minimum value, also called the Bayes error, and traditionally denoted by Pe .
We have:

Pe = min{k1 (x), k2 (x)}dx. (26)
R

In the general case there is no relation between Pe and ε0 . Hence, a strong


motivation is to find bounds for Pe , based on f 1 (x), f 2 (x) and χ ( or its distribution
ξ(χ )).

Hypergeometric functions in several variables

Definition Let a, b1 , . . . , bn and c be real or complex numbers. The Lauricella


D-function in n + 2 parameters and n variables x 1 , . . . , x0 is:
(n)
FD (a, b1 , . . . , bn ; c; x1 , . . . , xn )
∞  ∞ m
(a, m 1 + · · · + m n )(b1 , m 1 ) . . . (bn , m n ) x1 1 xnm n
= ··· ... ,
(c, m 1 + · · · + m n ) m1! mn !
m n =0 m 1 =0
(27)
Bounds for the Bayes error in classification: a Bayesian approach 25

where (a, m) is the Pochhammer coefficient (or ascending  factorial):


(a, m) = a(a + 1) . . . (a + m − 1) = (a + m) (a), m > 0, with (a, 0) = 1
and a is not a negative integer. We know that the above multiple series converges
for |x1 |, . . . , |xn | < 1 and, for any coefficient b, having a negative integral value,
(1)
i = 1, . . . , n, the corresponding summation becomes finite. For n = 1, FD is the
classical Gauss hypergeometric function 2 F1 (a, b; c; t).

Remark We have the following relations, when n = 2: FD(2) (a, b1 , b2 ; c; x, y) =


(1−x)c−(a+b1 ) (1−y)b2 .FD(2) (c − a, c − (b1 + b2 ), b2 ; c; x, (y − x)/(y − 1)) and
FD(2) (a, b1 , b2 ; c; x, y) = (1 − x)−b1 (1 − y)c−(a+b2 ) .FD(2) (c − a, b1 , c − (b1 + b2 );
c; (x − y)/(x − 1), y)( see Exton 1976).
(2)
Hence, even when |x| or |y| is larger than 1 there can be convergence for FD .
(n)
An important result obtained by E. Picard allows FD to be expressed as an integral
in one variable.

Theorem (Picard) If Re (a) and Re (c − a) are positive, then

(n)
FD (x1 , . . . , xn )
1
(c)
= u a−1 (1 − u)c−a−1 (1 − ux1 )−b1 . . . (1 − uxn )−bn du
(a)(c − a)
0
(28)

Proof See Exton (1976, p. 49). 




References

Bernardo MJ, Smith AFM (1994) Bayesian statistics. Chichester


Devroye L, Gyorfi L, Lugosi G (1996) A probabilistic a aproach to pattern recognition. Springer,
Berlin Heidelberg New York
Exton H (1976) Multiple hypergeometric functions and applications. Chichester–Ellis Howard,
London
Inman HF, Bradley EL (1989) The overlapping coefficient as a measure of agreement between
probability distributions and point estimation of the overlap of two normal densities. Commun
Statist Theory Method 18:3851–3874
Jasra A, Holmes CC, Stephens DA (2005) Markov chain Monte Carlo methods and the label
switching problem in Bayesian Mixture Model. Stat Sci, 20:50–67
Lissack TSVI, Fu KS (1976) Error estimation in pattern recognition via L p -distance between
posterior density functions. IEEE Trans Info Theory 22:34–45
McLachlan GJ, Basford K (1988) Mixture models. Marcel Dekker, New York
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, Chichester
Pham-Gia T, Turkkan N, Bekker A (2005) Bayesian analysis in the L1 -norm of the mixing
Coefficient. Metrika (in press)
Pham-Gia T, Turkkan N (1998) Distribution of the linear combination of two general beta vari-
ables, and applications. Commun Stat Theory Method 27(7):1851–1869
Pham-Gia T, Turkkan N (1994) Value of the prior information. Commun Stat Theory Method
27(7):1851–1869
Ray S (1989a) On looseness of error bounds provided by the generalized separability measures
of Lissack and Fu. Pattern Recogn Lett 9:321–325
26 T. Pham-Gia et al.

Ray S (1989b) On a theoretical property of the Bhattacharyya Coefficient as a failure evaluation


criterion. Pattern Recogn Lett 9:315–318
Robert CP (1996) Mixtures of distributions: inference and estimation. In: Gilks, Richardson,
Spiegelhalter (eds) Markov Chains in Practice. Chapman and Hall, London, pp 441–464
Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of mixture distributions.
Wiley, Chichester

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy