Divergence Measures Based On The Shannon Entropy: Member
Divergence Measures Based On The Shannon Entropy: Member
Divergence Measures Based On The Shannon Entropy: Member
Divergence Measures Based on the Shannon Entropy context. Rao’s objective was to obtain different measures of
diversity [24] and the Jensen difference can be defined in terms
Jianhua Lin, Member, IEEE of information measures other than the Shannon entropy func-
tion. No specific detailed discussion was provided for the Jensen
difference based on the Shannon entropy.
Abstract -A new class of information-theoretic divergence measures
based on the Shannon entropy is introduced. Unlike the well-known 11. THE KULLBACKI AND J DIVERGENCE
MEASURES
Kullback divergences, the new measures do not require the condition of
absolute continuity to be satisfied by the probability distributions in- Let X be a discrete random variable and let p I and p 2 be
volved. More importantly, their close relationship with the variational two probability distributions of X . The I directed divergence
distance and the probability of misclassification error are established in [17], [19] is defined as
terms of bounds. These bounds are crucial in many applications of
divergence measures. The new measures are also well characterized by
the properties of nonnegativity, finiteness, semiboundedness, and
boundedness.
Index Terms-Divergence, dissimilarity measure, discrimination in- The logarithmic base 2 is used throughout this correspondence
formation, entropy, probability of error bounds. unless otherwise stated. It is well known that Z ( p I , p 2 )is non-
negative, additive but not symmetric [12], [17]. To obtain a
symmetric measure, one can define
I. INTRODUCTION
Many information-theoretic divergence measures between two
probability distributions have been introduced and extensively
studied [2], [7], [12], [15], [17], [19], [20], [30]. The applications of
these measures can be found in the analysis of contingency
tables [lo], in approximation of probability distributions [6], [16],
which is called the J divergence [22]. Clearly, I and J diver-
[21], in signal processing [13], [14], and in pattern recognition
gences share most of their properties.
[3]-[5]. Among the proposed measures, one of the best known is It should be noted that I ( p l , p 2 )is undefined if p 2 ( x ) = 0
the I directed divergence [17], [19] or its symmetrized measure,
and p , ( x )# 0 for any x E X . This means that distribution p I
the J divergence. Although the I and J measures have many
has to be absolutely continuous [17] with respect to distribution
useful properties, they require that the probability distributions
p 2 for Z(pl,p2)to be defined. Similarly, J ( p 1 , p 2 requires
) that
involved satisfy the condition of absolute continuity [17]. Also,
p I and p r be absolutely continuous with respect to each other.
there are certain bounds that neither I nor J can provide for the
This is one of the problems with these divergence measures.
variational distance and the Bayes probability of error [28], [31].
Effort [18], [27], [28] has been devoted to finding the relation-
Such bounds are useful in many decisionmaking applications [3],
ship (in terms of bounds) between the I directed divergence and
151, [111, [141, [311. the variational distance. The variational distance between two
In this correspondence, we introduce a new directed diver- probability distributions is defined as
gence that overcomes the previous difficulties. We will show
that this new measure preserves most of the desirable properties (2.3)
of I and is in fact closely related to 1. Both the lower and upper xrx
bounds of the new divergence will also be established in terms
of the variational distance. A symmetric form of the new di- which is a distance measure satisfying the metric properties.
rected divergence can be defined in a similar way as J , defined Several lower bounds for I ( p l , p r ) in terms of V ( p , , p , ) have
in terms of I. The behavior of I , J and the new divergences will been found, among which the sharpest known is given by
be compared.
Based on Jensen’s inequality and the Shannon entropy, an
extension of the new measure, the Jensen-Shannon divergence, where
is derived. One of the salient features of the Jensen-Shannon
divergence is that we can assign a different weight to each
probability distribution. This makes it particularly suitable for
the study of decision problems where the weights could be the
prior probabilities. In fact, it provides both the lower and upper
bounds for the Bayes probability of misclassification error.
Most measures of difference are designed for two probability established by Vajda [28] and
distributions. For certain applications such as in the study of
taxonomy in biology and genetics [24], [25], one is required to
measure the overall difference of more than two distributions.
The Jensen-Shannon divergence can be generalized to provide
such a measure for any finite number of distributions. This is
also useful in multiclass decisionmaking. In fact, the bounds derived by Toussaint [27].
provided by the Jensen-Shannon divergence for the two-class However, no general upper bound exists for either l ( p l I p ? )
case can be extended to the general case. or J ( p I , p Z )in terms of the variational distance [28]. Thls IS
The generalized Jensen-Shannon divergence is related to the another difficulty in using the I directed divergence as a mea-
Jensen difference proposed by Rao [23], [24] in a different sure of difference between probability distributions [161, [311.
Manuscript received October 24, 1989; revised April 20, 1990. 111. A NEWDIRECTED
DIVERGENCE
MEASURE
The author is with the Department of Computer Science, Brandeis
University. Waltham, MA, 02254. In an attempt to overcome the problems of I and J diver-
IEEE Log Number 9038865. gences, we define a new directed divergence between two distri-
butions p I and p 2 as The L divergence is related to the J divergence in the same way
as K is related to I . From inequality (3.3), we can easily derive
the following relationship,
1
L(Pl,P,)~-J(Pl,Pd. (3.5)
2
This measure turns out to have numerous desirable proper-
ties. It is also closely related to I . From the Shannon inequality A graphical comparison of I , J, K, and L divergences is
[ l , p. 371, we know that, K ( p l , , p 2 ) 2 0 a n d K ( p , , p , ) = O i f a n d shown in Fig. 1 in which we assume p I = ( t , l - t ) and p z =
oniy i f p , = p z , which is essential for a measure of difference. It (1 - t , t ) , 0 I t I 1. I and J have a steeper slope than K and L.
is clear that K ( p , , p , ) is well defined and independent of the It is important to note that I and J approach infinity when t
values of p , ( x ) and p 2 ( x ) ,x E X . approaches 0 or 1. In contrast, K and L are well defined in the
From both the definitions of K and I, it is easy to see that entire range 0 I t I 1.
K ( p , , p , ) can be described in terms of I ( p l , p 2 ) :
Theorem 2: The following lower bound holds for the K di-
rected divergence:
Since
2
Thus, it follows
c p , ( x ) l o g &”
\EX
Pl(X> (3.6) follows immediately.
In contrast to situations for the I and J divergences, upper
0
1 1
PdX) bounds also exist for the L divergence in terms of the varia-
=- pl(x)log-=~I(pl,p2). 0 tional distance.
\EX P dx1
Theorem 3: The variational distance and the L divergence
K ( ~ , , ~ is, ) obviously not a symmetric measure, we can
measure satisfy thc inequality:
define a symmetric divergence based on K as:
L(Pl,P,) = K ( P l , P , ) + K(P,,Pl). (3.4) L(Pl,P,) I V(Pl,P,). (3.7)
~
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 37, NO, I , JANUARY 1991 147
= 2 H ( F ) - H(p,)-H(p,), (3.14)
0.5-
0.4-
0.3-
0.2-
=-
xEx
P(X) P ( c l x ) ~ o g P ( c l x ) , (4.5)
H2(CIX)I ( X € X
P W ) - j c
X € X
P(X)H2(CIX))
X t X CEC
= p(x).H'(Clx). (4.11)
which is the equivocation or the conditional entropy [91. It is X t X
also known that
For any 0 5 t 5 I, it can be shown that
H ( C I X ) = H ( C ) +H ( X I C ) - H ( X ) . (4.6) 1
Since there are only two classes involved, we have -H(t,l- t ) IJ-, (4.12)
2
H ( C ) = H ( P ( C l ) , P ( C Z ) ) = H(Tl>TTT?), (4.7) holds as depicted in Fig. 2. A rigorous proof of this inequality is
and given in the Appendix (Theorem 8).
Therefore, inequality (4.1 1) can be rewritten as
H ( XlC) = p ( c , ) H ( Xlc,) + P ( C Z ) H ( X l c z )
= 7 7 1 H ( PI + T,H( P 2 ) . (4.8)
H2(CIX) 5 4. c P(cllx)P(c,lx))
X € X
taxonomy in biology and genetics [24], [25], it might be necessary VI. CONCLUSION
to measure the overall difference of more than two distribu-
tions. The Jcnsen-Shannon divergence can be generalized to Based on the Shannon entropy, we were able to give a unified
provide such a measure for any finite number of distributions. definition and characterization to a class of information-theo-
retic divergence measures. Some of these measures have ap-
This is useful for the study of decision problems with more than
peared earlier in various applications. But their use generally
two classes involved.
Let p I , p Z ; . . , p , / be n probability distributions with weights suffered from a lack of theoretical justification. The results
presented here not only fill this gap but provide a theoretical
rlr T?,’ . .,T , ~ respectively.
, The generalized Jenscn-Shannon
foundation for future applications of these measures. Some of
divergence can be defined as
the results such as those presented in the Appendix are related
ic
to entropy and are useful in their own right.
J ~ , ( P , ~ P Z > . . . ~=PH, / )
where n- = ( T ~ , T ~. , T. ,. ~ )Consider
.
i.c, n-,P, -
where
Proof: Consider a continuous function f ( x ) in the closed
/I
interval [O, 11:
H ( n - ) = - ~ n - , l o g a ,and p , ( x ) = p ( x I c , ) , i = 1 , 2 ; . . , n .
/=I f ( x) = 24- + x log x + ( 1 - x) log (1 - x ) .
f ( x ) is twice differentiable in the open interval (0, l),
Proof The proof of this inequality is much the same as that
of (4.3). 0
Theorem 7:
Proof: From (4.11) and Theorem 9 in the Appendix, we where In is the natural logarithm. There are two different real
have solutions of the equation, f ” ( x ) = 0,
1-41-(1n2)~ 1 + 41 -(ln212
i
/I -I
H’(ClX)5 cx
IE
P(X) 2
,=I
c ~P(crlx)(1-P(C,lx)) XI=
2
and x 2 =
2
It can be easily shown that 0 < x, < 1/2 < x 2 < 1.
(5.5) From (A.31, it is clear that the function f ” ( x ) is continuous in
By the Cauchy inequality, (5.5) becomes ( 0 , l ) and the denominator of f ” ( x ) is nonnegative in [O, 11. Since
/I -I lim ( 2 4 m - 1 n 2 ) = -1n2,
.
I +o+
.Y E x f ” ( x , ) = 0, and there exists no x E (0, x , ) such that f ” ( x ) = 0, by
the continuity of f ” ( x ) , it follows, f ” ( x ) < 0 for 0 < x < x,,and
(5.6) thus the function f ( x ) is concave in ( O , x , ) .
Assume, without loss of generality, that the p ( c , l x ) have been For x = 1/2 E ( x , , x ? ) , we obtain
reordered in such a way that p ( c , , l x )is the largest. Then from
2 J m - 1 n 2 = 1 - 1 n 2 > 0,
H’(CIX) 4 4 P ( x ) ( 1 - max { P ( c,Ix)} ]( n - 1) which implies f”(1/2) > 0. Since f ” ( x , ) = f “ ( x 2 ) = 0 and there
.v t x exists no x E (x,,x ? ) such ihat f ” ( x ) = 0, we can conclude that
=4(n-I)P(e), f ” ( x ) > 0 for x , < x < x 2 . f ( x ) is therefore convex in ( x , , x z ) .
Similarly, from
we immediately obtain the desired result. O
lim ( 2 4 m - 1 1 1 2 ) = -1112,
It should be pointed out that the bounds previously presented A - 1 -
are in explicit forms and can be computed easily. Implicit lower it follows f “ ( x ) < 0 for x ? < x < 1. This means that the function
and upper bounds for the probability of error in terms of the f ( x ) is concave in (x2,1). In summary, the function f ( x ) is
f-divergence can be found in [3]. It should be useful to study the concave in both open intervals (0, x , ) and ( x ? , l), and convex in
relationship between these bounds but it will not be done in this ( x l , x 2 ) . ( x , , f ( x , ) ) and ( x 2 . f ( x 2 ) ) are the points of inflections
correspondence. for f (x 1.
~
f ( x ) 2 f(x,,,) = 0. (‘4.4)
Also, since f ( x ) is continuous in [O, x , ] and concave in (O,x,),
we have
f(x)>min(f(o),f(x,))>min
( f(0) f
(2)
- =0,
for x E [ O , X ~ ] . ( A S )
=
+ dqu
c Jm.
I1
,=I
-I
- I( 1 - q!l- I ) ’
Similarly, REFERENCES
[ l ] J. Aczel and Z . Daroczy, On Measures of Information and Their
Characterizations. New York: Academic, 1975.
[2] S. M. Ali and S. D. Silvey, “ A general class of coefficients of
divergence of one distribution from another,” J . Roy. Statist. Soc.,
for x E [ x , , ~ ] . (A.6) Ser. B, vol. 28, pp. 131-142, 1966.
[3] M. Ben Bassat, “f-entropies, probability of error, and feature
By combining (A.4), (AS), and (A.61, we finally obtain selection,” Inform. Contr., vol. 39, pp. 227-242, 1978.
[4] C. H. Chen, Statistical Pattern Recognition. Rochelle Park, NJ:
f(x) 0, for 0 Ix I 1, (‘4.7) Hayden Book Co., 1973, Ch. 4.
[SI -, “On information and distance measures, error bounds, and
from which inequality (A.1) follows immediately. feature selection,” Inform. Sci., vol. 10, pp. 159-173, 1976.
[h] C. K. Chow and C. N. Liu, “Approximating discrete probability
Theorem 9: Let q = ( q l , q 2 , . . . , q , l ) ,O I q , i l , 11iIt1, and distributions with dependence trees,” IEEE Trans. Inform. Theory,
Cy= I qJ= 1 . Then vol. IT-14, no. 3, pp. 462-467, May 1968.
1 [7] I. CsiszLr, “Information-type measures of difference of probability
-H(q) I
2 ’i’
d m
J - 1
. (A.8) distributions and indirect observations,” Studia Sci. Math. Hungar.,
vol. 2, pp. 299-318, 1967.
[8] R. G . Gallager, Information Theory and Reliable Communication.
New York: Wiley, 1968.
Proof: By the recursivity of the entropy function [1, p. 301, [9] S. Guiasu, Information Theory with Applications. New York: Mc-
we have Graw-Hill, 1977.
[lo] D. V. Gokhale and S. Kullback, Information in Contingency Tables.
H ( 4 1 = H ( 4 I 42 . ’ ’ > q,, - I + q J I )
3 ? New York: Marcel Dekker, 1978.
[ l l ] M. E. Hellman and J. Raviv, “Probability of error, equivocation,
and the Chernoff bound.” IEEE Trans. Inform. Theory, vol. IT-16,
no. 4, pp. 368-372, July 1970.
R. W. Johnson, “Axiomatic characterization of the directed diver-
gences and their linear combinations,” IEEE Trans. Inform. The-
ory. vol. IT-25, no. 6, pp. 709-716, Nov. 1979.
T. T. Kadota and L. A. Shepp, On the best finite set of linear
observables for discriminating two gaussians signals,” IEEE Truns.
Inform. Theory, vol. IT-13, no. 2, pp. 278-284, Apr. 1967.
T. Kailath, “The divergence and Bhattacharyya distance measures
in signal selection,” IEEE Transuctions Commun. Technol., vol.
COM-15, no. 1, pp. 52-60, Feb. 1967.
J. N. Kapur, “ A comparative assessment of various measures of
directed divergence,” Adimces Munag. Stud., vol. 3, no. 1, pp.
1-16, Jan. 1984.
D. Kazakos and T. Cotsidas, “A decision theory approach to the
approximation of discrete probability densities,” IEEE Trans. Pat-
tern Anal. Machine Intell., vol. PAMI-2, vol. 1, pp. 61-67, Jan. 1980.
S. Kullback, Information Theory and Stutistics. New York: Dover
Publications, 1968.
-, “ A lower bound for discrimination information in terms of
variation,” IEEE Trans. Inform. Theory. vol. IT-13, pp. 326-327,
Jan. 1967.
S. Kullback and R. A. Leibler, “On information and sufficiency,”
Ann. Math. Statist., vol. 22, pp. 79-86, 1951.
U. Kumar, V. Kumar. and J. N. Kapur, “Some normalized mea-
sures of directed divergence,” h z r . J . Gen. Syst., vol. 13, pp. 5-16,
1986.
J. Lin and S. K. M. Wong, “Approximation of discrete probability
distributions based on a new divergence measure,” Congressus
Numerantiitm, vol. 61, pp. 75-80, 1988.
H. Jeffreys, “An invariant form for the prior probability in estima-
tion problems,” Proc. Roy. Soc. Lon., Ser. A, vol. 186, 1946, pp.
453-461.
C. R. Rao and T. K. Nayak, “Cross entropy, dissimilarity measures,
and characterizations of quadratic entropy,” IEEE Trans. Inform.
Theory, vol. IT-31. no. 5. pp. 589-593. Sept. 1985.
IEEE TRANSACTIONS ON INFORMATION THEORY. VOL. 37. NO. 1, JANUARY IYYI 151
C. R . Rao, ”Diversity and dissimilarity coefficients: A unified Let P,,,, be the error probability of the maximum likelihood
approach,” Theoretical Pupirlution Bid., vol. 21, pp. 24-43, 1982. detector given that m = 0 is sent. There are several simple,
-, “Diversity: Its measurement, decomposition, apportionment generally applicable. upper bounds for this error probability.
and analysis.” Sankhya: Indian J . Statist., Ser. A, vol. 44. pt. 1. pp. Two of the most commonly used are the union hound
1-22. Feb. 1982.
G . T. Toussaint, ” O n some measures of information and their M-l
application t o pattern recognition,” in Proc. Coiif. Measirres of
information and Their Applicutiuns, Indian Inst. Technol., Bombay,
P,..II 5 c
/=I
!ad, / 2 u ) , (1.2)
Aug. 1974.
~, “Sharper lower bounds for discrimination information in where
terms of variation,” IEEE Trans. inform. Theory, vol. IT-21. no. 1,
pp. 99-100, Jan. 1975. Q( x ) = / : ( 2 2 ) I” exp ( - t 2 / 2 ) dt and d , = Ix, - xi,\;
1. Vajda, “Note on discrimination information and variation,” IEEE
Trans. Inform. Theory, vol. IT-16, pp. 771-773, Nov. 1970. and the minimum distance bound
-, “On the f-divergence and singularity o f probability
measures,” Periudicu Mathem. Hiingarrricu, vol. 2, pp. 223-234. 1972. P,.,,,<Pr{lnl2 d m i , , / 2 )= T ( N / 2 , d ; , , / 8 a 2 ) , (1.3)
__, Theory of Statistical Inference and Information. Dordrecht-
Boston: Kluwer, 1989. where d,,, = min, d , , and
J. W. Van Ness, “Dimensionality and the classification perfor-
mance with independent coordinates,” IEEE Trans. Syst. Man
Cybern., vol. SMC-7, pp. 560-564, July 1977.
A. K. C. Wong and M. You, “Entropy and distance of random
graphs with application to structural pattern recognition,” IEEE is a (normalized) incomplete gamma function. Unfortunately,
Trans. Pattern Anal. Machine Intell., vol. PAMI-7, no. 5. pp. (1.2) is somewhat loose for small values of d,,, / 2 u or large M ,
599-609. Sept. 1985. and (1.3) is loose for all but the smallest values of N .
In this note, we present a new upper bound to the probability
of error that is also straightforward to calculate and generally
applicable and improves upon the bounds in (1.2) and (1.3). In
On the Error Probability of Signals in Additive Section 11, we show
White Gaussian Noise M-l