A Two-Parameter Family of Non-Parametric, Deformed Exponential Manifolds

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Information Geometry (2023) 7 (Suppl 1):S171–S186

https://doi.org/10.1007/s41884-022-00079-5

RESEARCH PAPER

A two-parameter family of non-parametric, deformed


exponential manifolds

Nigel J. Newton1

Received: 1 September 2022 / Revised: 1 September 2022 / Accepted: 13 November 2022 /


Published online: 6 December 2022
© The Author(s) 2022

Abstract
We construct a new family of non-parametric statistical manifolds by means of a two-
parameter class of deformed exponential functions, that includes functions with power-
law, linear and sublinear rates of growth. The manifolds are modelled on weighted,
mixed-norm Sobolev spaces that are especially suited to this purpose, in the sense that
an important class of nonlinear superposition operators (those used in the construction
of divergences and tensors) act continuously on them. We analyse variants of these
operators, that map into “subordinate” Sobolev spaces, and evaluate the associated
gain in regularity. With appropriate choice of parameter values, the manifolds support
a large variety of the statistical divergences and entropies appearing in the literature,
as well as their associated tensors, eg. the Fisher-Rao metric. Manifolds of finite
measures and probability measures are constructed; the latter are shown to be smoothly
embedded submanifolds of the former.

Keywords Banach manifold · Fisher-Rao metric · Information Theory · Log-Sobolev


inequality · Non-parametric statistics · Sobolev spaces

Mathematics Subject Classification 46N30 · 60D05 · 60H15 · 62B10 · 93E11

1 Introduction

Great progress has been made during the last five decades on the theory of information
geometry, and its application in many scientific fields. The fundamental parametric
theory is well developed, and is treated pedagogically in a number of texts (See, for
example, [1, 3, 5, 9, 12]). The non-parametric theory, on the other hand, is largely

Communicated by Jürgen Jost.

B Nigel J. Newton
nigeljnewton@gmail.com

1 University of Essex, Colchester CO4 3SQ, UK

123
S172 N. J. Newton

to be found in a series of research papers. A notable exception is the text [2], which
treats parametric and non-parametric theories in a unified way. The step from the
parametric to the non-parametric setting is not an easy one, since it introduces the
infinite-dimensional spaces of Functional Analysis.
The parametric exponential model is arguably the nucleus of the subject. Its exten-
sion to the non-parametric setting was accomplished by G. Pistone and his co-workers
in the fundamental series of papers [4, 7, 18, 19]. The manifolds there constructed are
“maximally inclusive” in a precise sense, and various statistical divergences, including
the Amari α-divergences, for α in the interval [−1, 1], are smooth on them. As with
parametric exponential manifolds, the log of the density is used as a chart. This requires
a model space with a particularly strong topology: the exponential Orlicz space, which
has a number of disadvantages. Since its publication, several variations of the expo-
nential Orlicz manifold have been developed. In [11], the exponential function was
replaced by the Tsallis q-deformed exponential, which has an important interpretation
in statistical mechanics. (See [20], and Chapter 7 in [13].) The model space used is
L ∞ , which significantly restricts membership of the manifold consructed. A large
class of deformed exponential functions (the “ϕ-functions”) was used in [21, 22] to
construct inclusive manifolds of probability measures, in which the model spaces are
Musielak-Orlicz spaces.
The constructions in these references begin with the tangent space at a generic
point, P, of a set of measures. A representation of tangent vectors, derived from the
(deformed) logarithm, is then used to construct a local chart, which naturally maps
to a model space defined in terms of P. However, the model spaces required in this
approach can be difficult to use in practice. A different approach was taken in [14, 15],
where a global chart was used with the specific deformed logarithm logd = y−1+log y
to construct an inclusive manifold modelled on Lebesgue L p spaces (including the
Hilbert space L 2 ). The corresponding deformed exponential has bounded derivatives
of all orders; a property that has a number of advantages. Both the probability density
and its (non-deformed) log (objects of central importance to information geometry)
belong to the model space and, considered as superposition operators mapping into
this space, are continuous.
The sample space in all these manifolds is an abstract probability space: a set
of “outcomes”, a class of measurable subsets of these outcomes, and a probability
measure attaching a number between 0 and 1 to each subset. This has the advantage
of generality: the sample space could be Rd , or the path space of a stochastic pro-
cess,…However, topologies, metrics and linear structures on the sample space play
important roles in most applications, including the theory of partial differential equa-
tions. A natural direction for research in the non-parametric theory is to specialise
the manifolds outlined above to such problems by incorporating the topology of the
sample space in the manifolds. One way of achieving this is to use model spaces of
Sobolev type. This was carried out in the context of the exponential Orlicz manifold
in [10], where it was applied to the spatially homogeneous Boltzmann equation. It
was carried out in the context of the L p manifolds in [17]. The resulting fusion of
information and sample space topologies is mutually beneficial. For example, it was
shown in [17] that log-Sobolev embedding strengthens the topology of the raw L p
manifolds in a useful way.

123
A two-parameter family of non-parametric... S173

This paper takes the approach of [14, 15, 17] further, by constructing a new class of
non-parametric manifolds based on a two-parameter deformed exponential, dubbed
the η-exponential. The paper has two primary aims: (i) to provide manifolds on which
a wider class of divergences and entropy functions can be accommodated; (ii) to refine
the Sobolev space methods of [17] in order to increase the degree of smoothness they
confer on these quantities. Regarding the first aim, there is a vast literature on the
importance of different divergences and entropies to particular branches of science,
a full review of which is beyond the scope of this article. Let us mention, however,
the special volume (13) of Entropy on applications of the Tsallis q-divergences and
entropies. The review article by Tsallis [20], in particular, cites many applications in
which the q parameter should be strictly greater than or strictly less than the value
q = 1 of the Boltzmann-Gibbs theory. In the context of Amari’s α-divergences, this
translates to values of α both greater than and less than 1. The author was motivated,
in particular, by the study of multi-objective measures of error in nonlinear filtering
that lead naturally to divergences with α < −1, [16].
The two-parameter η-exponential we use corresponds to a deformed exponential
introduced in [8], but is reparametrised into the Amari setting. The manifolds accom-
modate α-divergences and entropies over a range of α values, but are especially suited
to those for which α ∈ [η− , η+ ], for the chosen parameters −∞ < η− < 1 ≤ η+ <
∞. The parameter values ±1 yield the linear-growth deformed exponential of [14,
15, 17]; values of η− other than −1 yield deformed exponentials with power law or
sublinear growth. For a more detailed account of deformed exponentials, and their use
in Statistical Mechanics, the reader is referred to [13].
The paper is structured as follows. Section 2 introduces the model Sobolev spaces,
expanding considerably on the material in [17]. It also introduces a new class of “subor-
dinate” Sobolev spaces, which are later used in the analysis of superposition operators
derived from Amari’s α-embedding maps. (The latter can be used in the analsis of
divergences and entropies.) Section 3 introduces the two-parameter deformed expo-
nential and uses it to construct manifolds of finite measures. Section 4 then shows that
the subsets of probability measures are smoothly embedded submanifolds of those of
Sect. 3.

2 The Sobolev spaces

The manifolds are modelled on mixed-norm, weighted Sobolev spaces generalising


those defined in [17]. The spaces are based on a reference probability measure μ on
the sample space Rd . This takes the form

μ(d x) = r (x)d x = exp(lr (x))d x, (1)

where lr : Rd → R is a continuous function such that μ(Rd ) = 1. Stronger results


can be obtained with the following additional hypothesis on lr .
(E) The log-density lr is constructed as follows. Let θ : [0, ∞) → [0, ∞) be a strictly
increasing, convex function that is twice continuously differentiable on (0, ∞), is

123
S174 N. J. Newton


such that lim z↓0 θ  (z) < ∞, − θ is convex and, for some t ∈ (1, 2],
 
0 if z = 0
θ (z) = , where z 0 ≥ 0, and c ∈ R. (2)
c + zt if z ≥ z 0

lr then takes the special form



lr (x) := i (C − θ (|xi |)), (3)

where C ∈ R is such that μ(Rd ) = 1. (Some examples are given in [17] including
the Gaussian case, in which c = z 0 = 0 and t = 2.)
The model spaces used in the construction of the manifolds comprise measurable
functions defined on Rd having weak derivatives of various orders, that belong to
λ λ λ
the Lebesgue
λ
 spacesλ L = L (μ) for various exponents λ. ( f ∈ L if and only if
Eμ | f | := | f (x)| μ(d x) < ∞.) Under (E), μ is a product measure and the model
spaces admit a log-Sobolev embedding result.
Let C ∞ (Rd ; R) be the space of continuous functions with continuous partial deriva-
tives of all orders, and let C0∞ (Rd ; R) be the subspace of those functions having
compact support. For any λ ∈ [2, ∞), and any 0 ≤ k ≤ λ, the space W k,λ is the
mixed-norm Sobolev space comprising measurable functions a ∈ L λ that have weak
partial derivatives up to order k, those of order i belonging to the Lebesgue space
L λ/i . We shall also use the “subordinate” spaces W k,λ;l , for certain integer values of
l. Let λ◦ be the following Lebesgue exponent: if (E) holds and k ≥ 1 then λ◦ = λ,
otherwise λ◦ = λ −  for some 0 <  << 1. For 1 ≤ l ≤ λ◦ , the space W k,λ;l com-

prises measurable functions a ∈ L λ /l that have weak partial derivatives up to order

kl := min{k, λ◦ − l}, those of order i belonging to the Lebesgue space L λ /(l+i) .
(For convenience W k,λ;0 := W k,λ .)
Model spaces with more general derivative structures were developed in [17],
including fixed-norm spaces; however, the Lebesgue exponents in W k,λ and its subor-
dinates are especially suited to the deformed logarithms used here. Weak derivatives
are defined in the usual way: for any ϕ ∈ C0∞ (Rd ; R),
 
∂a
(∂i a) ϕ d x = − a (∂i ϕ) d x where ∂i a is shorthand for . (4)
∂ xi

In order to express higher-order weak derivatives in an efficient way, we use the


following standard “multi-index” notation. Let S := {0, . . . , k}d 
be the set of d-tuples
of integers in the range 0 ≤ si ≤ k. For s ∈ S, we define |s| = i si , and denote by
Si := {s ∈ S : 1 ≤ |s| ≤ i} the set of d-tuples of weight at most i. For appropriate a,
we define the following

D s a = ∂1s1 · · · ∂dsd a

a λW k,λ = a λL λ + Ds a λ
L λ/(|s|)
s∈Sk

123
A two-parameter family of non-parametric... S175


λ λ λ
a W k,λ;l
= a ◦
L λ /l
+ Ds a ◦
L λ /(l+|s|)
, 1 ≤ l ≤ λ◦ . (5)
s∈Skl

Theorem 1 (i) For any 1 ≤ l ≤ λ◦ , W k,λ;l and W k,λ are Banach spaces with
respect to the norms in (5);
(ii) For any 1 ≤ l ≤ λ◦ , C0∞ (Rd ; R) is dense in W k,λ;l and W k,λ .

Proof Both parts are proved in Theorem 1 and Lemmas 1 and 2 in [17]. (The only
property required of the log density, lr , is its continuity.) Part (ii) is a consequence of
the non-increasing nature of the Lebesgue exponents in W k,λ and W k,λ;l . 


The spaces admit the following continuous embeddings:

˜
W k,λ;0 := W k,λ ≺ W k,λ;l ≺ W k,λ;l , where 1 ≤ l < l˜ ≤ λ◦ . (6)

The spaces W k,λ will be used as model spaces for manfolds of finite measures in Sect.
3, and centred versions of them, as model spaces for manifolds of probability measures
in Sect. 4. The following theorem derives some properties of particular types of map
acting on them. It will be used in the sequel.

Theorem 2 (i) For any ψ ∈ C ∞ (R; R) having bounded derivatives of all orders,
the nonlinear superposition operator : W k,λ → W k,λ , defined by (a)(x) =
ψ(a(x)), is continuous. Its spatial derivatives are given by the Faà di Bruno
formula
 
D s (a) = Fs (a) := ψ (|π |) (a) D σ a, (7)
π ∈ (s) σ ∈π


where π = {σ1 , . . . , σ|π | ∈ S|s| ; 1 ≤ |σ j | ≤ |s|, j σ j = s} is a partition of s,
|π | is the cardinal of π , and (s) is the set of all such partitions.
(ii) For appropriate Banach spaces of functions on Rd , A, B and C, let A,B :
A × B → C be defined by A,B (a, b)(x) = a(x)b(x). A,B is a well defined,
continuous, bilinear map in the following instances:

A = B = C = L ∞ ∩ W k,λ normed by · L∞ + · W k,λ ;



A= L ∩W , B =W , C =W
k,λ
; k,λ k,λ;1

A= L ∩W , B =C =W
k,λ k,λ;l
1 ≤ l ≤ λ◦ ;
˜ ˜ ˜ l + l˜ ≤ λ◦ .
A = W k,λ;l , B = W k,λ;l , C = W k,λ;l+l 1 ≤ l, l; (8)

The spatial derivatives of A,B are given by the Leibniz formula


 s!
Ds A,B (a, b) = Hs (a, b) := D σ a D s−σ b, (9)
σ ≤s
σ !(s − σ )!

where s! := s1 ! · · · sd !, and σ ≤ s if and only if σi ≤ si for 1 ≤ i ≤ d.

123
S176 N. J. Newton

(iii) For ψ as in part (i), and any 0 ≤ l ≤ λ◦ , the superposition operator l :


W k,λ → W k,λ;l , defined by l (a)(x) = ψ(a(x)), is of class C l . Its derivatives
are as follows:

(i)
l (u 1 , . . . , u i )(x) = ψ (i) (a(x))u 1 (x) · · · u i (x), for 1 ≤ i ≤ l. (10)

The proof makes use of the following Lemma.

Lemma 1 Let a ∈ W k,λ , let (an = a) and (bn ) be sequences converging to a in the
sense of W k,λ , and let B be the unit ball of W k,λ . For any continuous, bounded function
f : R → R,

an − a −1
W k,λ
( f (bn ) − f (a))(an − a) L λ◦ → 0 (11)
and supu∈B ( f (an ) − f (a))u L λ◦ → 0. (12)

Proof We use the generalised Hölder inequality,

|an − a|λ
1/λ
( f (bn ) − f (a))(an − a) Lλ
◦ ≤ f (bn ) − f (a) A E , (13)

where A and E are the following Banach spaces. If (E) does not hold or k = 0, then

A = L λλ / and E = L 1 ; (13) is then the classical Hölder inequality on dual Lebesgue
spaces. If (E) holds and k ≥ 1 then A = exp L 1/β (μ) and E = L 1 logβ L(μ), where
β = (t − 1)t; these are Orlicz spaces based on the complementary Young functions:
 z  z
G β (z) = exp(y 1/β
) − 1 dy and Fβ (z) = logβ (y + 1) dy. (14)
0 0

It follows from a log-Sobolev embedding theorem (see, for example, Theorem 7.12
in [6]), and the following representation for first-order weak derivatives

∂i |an − a|λ = λ|an − a|λ−1 sgn(an − a)∂i (an − a) ∈ L 1 , 1 ≤ i ≤ d,

that |an − a|λ E ≤ K an − a λW k,λ , for some K < ∞. (See the proof of Lemma
4 in [17] for fuller details.) Now f (bn ) − f (a) is bounded and converges to zero in
probability, and so it converges to zero in the sense of A (with either definition of A),
and (11) follows. A similar argument establishes (12). 


Proof of Theorem 2 A proof of part(i) is given in Proposition 2 of [17]. It involves a


sequence f n ∈ C ∞ (Rd ; R) converging to a in the sense of W k,λ . D s ( f n ) is defined
in the classical sense, and is equal to Fs ( f n ). It is then shown that Fs ( f n ) converges to
Fs (a) in the sense of L λ/|s| . Fs (a) is then shown to be the weak derivative of (a) by the
standard procedure of integrating D s Fs ( f n )ϕ by parts for a ϕ ∈ C0∞ (Rd ; R). Finally,
continuity is established by repeating the argument for a sequence W k,λ  an → a.
The proof of part (ii) is similar. Let f n , gn ∈ C ∞ (Rd ; R) be sequences converging
˜
to a (respectively b) in the sense of W k,λ;l (respectively W k,λ;l ). Clearly f n b → ab

123
A two-parameter family of non-parametric... S177

˜
◦ /(l+l)
in the sense of L λ . Furthermore, for any s ∈ Skl+l˜ ,

 s!
Hs ( f n , gn ) − Hs (a, gn ) = (D σ f n − D σ a)D s−σ gn ,
σ ≤s
σ !(s − σ )!

and so it follows from Hölder’s inequality that

Hs ( f n , gn ) − Hs (a, gn ) Lλ
◦ /(|s|+l+l)
˜ → 0.

A similar argument can be applied to Hs (a, gn ) − Hs (a, b), and so D s ( f n gn ) →


◦ ˜
Hs (a, b) in L λ /(|s|+l+l) . Once again, an argument involving integration by parts
establishes that Hs (a, b) = D s (a, b), and an argument involving sequences
˜
W k,λ;l  an → a, and W k,λ;l  bn → b establishes continuity. Similar arguments
can be used with the other cases.
The case l = 0 of part (iii) is established in part (i). Let (an = a) be a sequence
converging to a in the sense of W k,λ , and let

n := ψ(an ) − ψ(a) − ψ (1) (a)(an − a). (15)

Then, according to the mean-value theorem, n = δn (an − a), where

δn = ψ (1) ((1 − βn )a + βn an ) − ψ (1) (a) for some 0 ≤ βn (x) ≤ 1.



Lemma 1 shows that an − a −1 W k,λ
n L λ◦ → 0, and so 1 : W k,λ → L λ is
differentiable, with derivative as in (10) with i = 1. That this derivative is continuous
follows from (12).
For 0 ≤ i ≤ l − 1, let Fs,i (a) be as in (7), but with ψ replaced by ψ (i) ; then
 
Fs (an ) − Fs (a) − Fs,1 (a)(an − a) − ψ (|π |) (a) π, j D σ j (an − a)
π ∈ (s) j
 
= π,0 δπ (an − a) + (π, j ζπ + π, j )D σ j (an − a)) ,
π ∈ (s) j

where {σ1 , . . . , σ j , . . . σ|π | } is an enumeration of π ,


 
π,0 = D σ a, π, j = D σm a,
σ ∈π m= j
(|π |+1)
δπ = ψ (βn an + (1 − βn )a) − ψ (|π |+1) (a),
ζπ = ψ (|π |) (an ) − ψ (|π |) (a),
  
π, j = ψ (|π |) (an ) D σm an − D σm a D σm a.
m< j m< j m> j

123
S178 N. J. Newton

Hölder’s inequality now shows that

R1 := π,0 δπ (an − a) L γ0 ≤ π,0 L λ◦ /|s| δπ (an − a) L λ◦


R2 := π, j ζπ D σ j (an − a) L γ0 ≤ π, j ζπ L γ j D σ j (an − a) L
λ/|σ j |

R3 := π, j D σ j (an − a) L γ0 ≤ π, j L


γj D σ j (an − a) L
λ/|σ j |

where γ0 = λ◦ /(|s| + 1) and γ j = λ◦ /(|s| − |σ j | + 1). We now claim that

−1
an − a R
W k,λ i
→ 0 for i = 1, 2, 3.

That this is true of R1 follows from Lemma 1. Regarding R2 , ζπ is bounded and


converges to zero in probability and so, according to the dominated convergence
theorem, π, j ζπ → 0 in the sense of L γ j . Finally, the bracketed term in π, j can
be expanded as a telescopic sum of products each containing one of the differences
D σm (an − a). Hölder’s inequality then shows that π, j → 0 in the sense of L γ j .

We have thus shown that D s 1 : W k,λ → L λ /(|s|+1) is differentiable, and
 
(D s 1 )(1) u = Fs,1 (a)u + ψ (|π |) (a) π, j D σ j u.
π ∈ (s) j

We can now apply the Leibniz and Faà di Bruno formulae to D s (ψ (1) (a)u) to show
that it is equal to (D s 1 )(1) u, and is continuous in (a, u). This proves (10) for the
case l = 1.
We now proceed by induction on l. Suppose that (10) is correct for l; then, since
W k,λ;l ≺ W k,λ;l+1 , l+1 is of class C l , with derivatives as in (10). Setting l,n =
ψ (l) (an ) − ψ (l) (a) − ψ (l+1) (a)(an − a), we can apply the arguments used above on
n of (15), and the fact that

sup (l,n , u 1 · · · u l ) W k,λ;l+1 ≤ K l,n W k,λ;1 , for some K < ∞,


u i ∈B

(l)
where B is the unit ball of W k,λ , to show that l+1 is of class C 1 . 


3 The manifolds of finite measures

The charts of the statistical manifolds developed here are based on a two-parameter
family of η-deformed logarithms. These are defined in terms of Amari’s α-logarithms,
α : (0, ∞) → R:
 
2
y (1−α)/2 − 1 if α = 1
α (y) = 1−α (16)
log y if α = 1

The η-logarithm is defined for η = (η− , η+ ), (−∞ < η− < 1 ≤ η+ < ∞) as:

logη (y) = η− (y) + η+ (y) (17)

123
A two-parameter family of non-parametric... S179

The deformed logarithm log(−1,+1) is that used to construct a family of highly inclusive
statistical manifolds in [14, 15, 17]. Setting κ = (η+ −η− )/4 and r = (2−η− −η+ )/4,
logη is essentially the two-parameter (κ, r )-logarithm defined in [8]. The different
weightings for the two components of the deformed logarithm used here have no
effect on the membership or properties of the manifolds, but are more convenient in
the context of information geometry.
Now inf y logη y = −∞, sup y logη y = +∞, and logη ∈ C ∞ ((0, ∞); R) with
strictly positive first derivative y −(1+η− )/2 +y −(1+η+ )/2 and so, according to the inverse
function theorem, logη is a diffeomorphism from (0, ∞) onto R. Let expη be its inverse.
This can be thought of as a deformed exponential function. Using f (n) to denote the
n-th derivative of a function f , we have
δ/2
1 [1+η+ ]/2 expη [1+η− ]/2
exp(1)
η = δ/2
expη = δ/2
expη , (18)
1 + expη 1 + expη

where δ := η+ − η− . So expη satisfies the differential inequality


[1+η− ]/2
exp(1)
η < expη (19)

and, since expη (0) = 1, there exists a K η < ∞ such that

expη (z) ≤ K η (1 + z 2/(1−η− ) ) for all z ≥ 0. (20)

If η− = −1 (as is the case in [14, 15, 17]) then the exponent is 1, and expη has linear
growth; otherwise it has sublinear or power law growth. The exponent itself grows
without limit as η− approaches 1 from below.
For any α ∈ R, the Amari embedding map ξα : R → R is as follows:

ξα (z) = α ◦ expη (z). (21)

These maps can be used in the analysis of a large class of divergences, and their
associated tensors. The maps ξ−1 and ξ+1 are especially important since they will
represent the density of a measure and its log, respectively. The following lemma
establishes some of their properties in the context of the η-log.
Lemma 2 (i) For any α ∈ R, ξα ∈ C ∞ (R; R); its derivatives are

f α,i (expη )
ξα(i) = δ/2 (2i−1)
for 1 ≤ i < ∞, (22)
1 + expη

where f α,1 (y) = y (η+ −α)/2 and

(1)
f α,i+1 (y) = (y δ/2 + y δ )y (η− +1)/2 f α,i (y) − (i − 1/2)δ y δ y (η− −1)/2 f α,i (y)
(1)
= (1 + y δ/2 )y (η+ +1)/2 f α,i (y) − (i − 1/2)δ y (2η+ −η− −1)/2 f α,i (y).
(23)

123
S180 N. J. Newton

(ii) For any 1 ≤ i < ∞, and any α ∈ R,

1−α
lim sup z −βi |ξα(i) (z)| < ∞, where βi := − i. (24)
z→∞ 1 − η−

(iii) For any 1 ≤ i < ∞, and any α ≤ η+ ,

lim sup |ξα(i) (z)| < ∞. (25)


z→−∞

(i)
(iv) If α ∈ [η− , η+ ], then ξα is bounded for all 1 ≤ i < ∞.
Proof Part (i) is straightforward.
(1)
The power of y in f α,1 (y) is (η+ − α)/2 = δ/2 + (η− − α)/2, and so ξα (z) grows
as expη (z)(η− −α)/2 for large z. It now follows from (20) that (24) is correct for i = 1.
That it is also correct for i ≥ 2 follows from an induction argument based on the first
representation of f α,i+1 in (23).
If α ≤ η+ then the power of y in f α,1 (y) is greater than or equal to 0, and so
(25) is correct for i = 1. That it is also correct for i ≥ 2 follows from an induction
argument based on the second representation of f α,i+1 in (23). Part (iv) is an immediate
consequence of parts (ii) and (iii). 

Let θ0 := (1 − η− )λ/2. We assume that θ0 > 1; it then follows from (20) and (24)
that, for any 0 ≤ i < λ/θ0 and any a ∈ L λ ,

exp(i)
η (a) ∈ L
θ0 λ/(λ−iθ0 )
. (26)

(If i ≥ λ/θ0 then exp(i)


η is bounded.) We can now construct the manifold M (= Mη ).
k,λ

This is the set of finite measures on Rd satisfying the following:


(M1) P is mutually absolutely continuous with respect to Lebesgue measure;
(M2) logη p ∈ G (= G k,λ := W k,λ ).
Here, p denotes the density of P with respect to the reference probability measure
μ. Its density with respect to Lebesgue measure is pr , where r is as in (1). The chart
φ : M → G is defined by:

φ(P) = logη p. (27)

Proposition 1 φ is a bijection onto G. Its inverse is

φ −1 (a) = P(d x) = expη (a(x))μ(d x). (28)

Proof It follows from (M2) that, for any P ∈ M, φ(P) ∈ G. Suppose, conversely,
that a ∈ G; since expη (a) ∈ L 1 , we can define the finite measure P(d x) =
expη (a(x))μ(d x). Since expη is strictly positive, P satisfies (M1). That it also satis-
fies (M2) follows from the fact that logη expη (a) = a ∈ G. We have thus shown that
P ∈ M, and clearly φ(P) = a. 


123
A two-parameter family of non-parametric... S181

Remark 1 This proposition shows that M is, in one sense, nothing more than a (whole)
Banach space. Manifold theory enters the picture with the introduction of base-point
dependent tensors such as the Fisher-Rao metric and Amari-Chentsov tensor on the
tangent bundle.

The tangent space at basepoint P, TP M, is the linear space of signed measures, U ,


that are absolutely continuous with respect to Lebesgue measure and take the form

U (d x) = exp(1)
η (a(x))u(x)μ(d x), for some u ∈ G, (29)

where a = φ(P). U is well defined because of (26). The representation in (29) is


obtained from the tangent map of the chart; u is then the natural representation of
the tangent vector, U , in the model space. The tangent bundle is the disjoint union
T M := ∪ P∈M (P, TP M), and is globally trivialised by the chart  : T M → G × G,
where

(P, U ) = (a, u), and a and u are as in (29). (30)

We now investigate some of the smoothness properties of Amari’s embedding maps


(21) in the context of the manifold M. These, in turn, can be used to analyse the
smoothness properties of divergences and tensors.

Proposition 2 (i) For any α ∈ [η− , η+ ], any 0 ≤ l ≤ λ◦ and any a ∈ G, ξα (a) ∈


W k,λ;l . The superposition operator α,l : G → W k,λ;l , defined by α,l (a)(x) =
ξα (a(x)), is of class C l , with derivatives:

(i)
α,l (a)(u 1 , . . . , u i )(x) = ξα(i) (a(x))u 1 (x) · · · u i (x). (31)

(ii) For any 1 ≤ θ < θ0 (as defined before (26)), the superposition operator
Expη,θ : G → L θ , defined by Expη,θ (a)(x) = expη (a(x)) is of class C λ−1 ,
with derivatives:
(i)
Expη,θ (u 1 , . . . , u i )(x) = exp(i)
η (a(x))u 1 (x) · · · u i (x). (32)

Proof Part (i) is a special case of Theorem 2(iii). Part (ii) can be proved in a similar way;
the essential differences are that the derivatives of expη are not necessarily bounded,
and the range space of the superposition operator has a weaker topology. Let (an ∈
G \ {a}) be a sequence converging to a in the sense of G. For any 1 ≤ i ≤ λ − 1 let

n := expη(i−1) (an ) − expη(i−1) (a) − exp(i)


η (a)(an − a)
n := exp(i) (i)
η (an ) − expη (a). (33)

According to the mean-value theorem n = δn (an − a), where

δn = exp(i) (i)
η (βn an + (1 − βn )a) − expη (a) for some 0 ≤ βn (x) ≤ 1.

123
S182 N. J. Newton

It follows from (26) and Hölder’s inequality that, for any u 1 , . . . , u i in the unit ball of
G,

n u 1 · · · u i−1 Lθ ≤ n Lγ and n u 1 · · · u i Lθ ≤ n u i Lγ ,

where γ := λθ/(λ − iθ ). In order to prove part (ii), it thus suffices to show that

−1
an − a G n Lγ → 0 and sup n u Lγ → 0. (34)
u G =1

(i)
According to (26) and the de la Vallée-Poussin theorem, expη (an )γ is uniformly
integrable. Now δn and n both converge to zero in measure, and so (34) follows from
the Lebesgue-Vitaly theorem. 


Remark 2 (i) There is a vast choice of range spaces for superposition operators of this
type, each of which results in operators with different properties. Proposition 2 is
not intended to be exhaustive, but to cover some of the more interesting and useful
cases.
(ii) The case l = 0 is worth special mention since the domain and range spaces of
the superposition operators are then both G. If, for example, η− ≤ −1 then the
density p (= ξ−1 (a) + 1) belongs to the model space and varies continuously on
the manifold, as does the log of the density.

The superposition operators α,l can be used in the analysis of divergences and
entropies. This analysis was carried out for the α-divergences, α ∈ [−1, 1], in [15],
where it was shown that they are of class C l (M × M), for values of l dependent
on λ. Although we do not pursue these issues here, it is clear that similar methods
can be used with the manifolds of this paper for any α ∈ [η− , η+ ]. We would also
expect the (κ, r ) divergences of [8] to exhibit an equivalent degree of smoothness.
Divergences can be used to define various tensor fields on M, which depend naturally
on the superposition operators, α,l . In particular, the Fisher-Rao metric on M can be
expressed in terms of the maps (ξα , α ∈ [η− , η+ ]) in two different ways, according
to the value of η− :

(1) (1)
Eμ ξ0 (a) ξ0 (a)uv if η− ≤ 0
U , V  P = (1) (1) (35)
Eμ expη (a)η− ξη− (a) ξη− (a)uv if η− > 0,

where (a, u) = (P, U ) and (a, v) = (P, V ).

Corollary 1 The Fisher-Rao metric is if class C l on M, where



λ◦ − 2 if η− ≤ 0
l= (36)
λ◦ − 2η− /(1 − η− ) − 2 if η− > 0.

Proof The case η− ≤ 1 follows from a repeated application of Lemma 1, starting with
(1)
ψ = (ξ0 )2 . A similar technique can be applied if η− > 0; the essential difference is

123
A two-parameter family of non-parametric... S183

that, at each stage, we must use Hölder’s inequality and Proposition 2(ii) with θ = η−
to remove the term in expη . 


The Fisher-Rao metric is positive definite and dominated by the chart-induced


norm on TP M. However the norms are not equivalent, and so the metric is a weak
Riemannian metric. The Fisher-Rao metric and higher-order tensor fields, such as the
Amari-Chentsov tensor, become smoother with increasing values of λ. As Corollary 1
shows, log-Sobolev embedding plays a role in these results for certain integer values
of λ.
Of course, the use of Sobolev model spaces enables the analysis of quantities
that depend on the weak derivatives of probability densities, such as the Hyvärinen
divergence, and this is one of the motivations for extending the results of [14, 15].
The manifolds may also be useful in the theory of partial differential equations, as
discussed in the final section of [17]. These aspects will be pursued elsewhere.

4 The manifolds of probability measures

Let M0 ⊂ M be the subset of the manifold of Sect. 3, whose members are proba-
bility measures, and let L λ0 (respectively G 0 ) be the co-dimension 1 subspaces of L λ
(respectively G) whose members have zero μ-mean. Let φ0 : M0 → G 0 be defined
by

φ0 (P) = φ(P) − Eμ φ(P) = logη p − Eμ logη p. (37)

Proposition 3 (i) φ0 is a bijection onto G 0 .


(ii) (M0 , G 0 ) is a C λ−1 -embedded submanifold of (M, G). The inclusion map ρ :
G 0 → G takes the form ρ(a) = a + Z (a), where Z : G 0 → R is an (implicitly
defined), additive normalisation constant.
(iii) ρ and all its derivatives are bounded on bounded sets. The first (and if λ > 2,
second) derivatives of ρ are as follows:

ρa(1) u = u − E Pa u
(2)
Eμ expη (ρ(a))(u − E Pa u)(v − E Pa v)
ρa(2) (u, v) = − (1)
, (38)
Eμ expη (ρ(a))

where Pa (d x) := exp(1) (1)


η (ρ(a(x)))μ(d x)/Eμ expη (ρ(a)), is the escort probability
[13].

Proof Let ϒ : G 0 × R → (0, ∞) be defined by

ϒ(a, z) = Eμ expη (a + z) = Eμ Expη (a + z), (39)

123
S184 N. J. Newton

where Expη is as defined in Proposition 2 with θ = 1. It follows from Proposition 2,


that ϒ is of class C λ−1 and that, for any u ∈ G 0 ,

(1,0)
ϒa,z u = Eμ exp(1) (0,1) (1)
η (a + z)u and ϒa,z = Eμ expη (a + z) > 0. (40)

Since expη is monotone increasing,

ϒ(a, z) ≥ Eμ 1[−1,∞) (a)expη (a + z) ≥ μ(a ≥ −1)expη (z − 1),

and so lim z↑∞ ϒ(a, z) = ∞. Furthermore, the monotone convergence theorem shows
that

lim ϒ(a, z) = Eμ lim ψ(a + z) = 0.


z↓−∞ z↓−∞

So ϒ(a, · ) is a bijection with strictly positive derivative, and the inverse function
theorem shows that it is a C λ−1 -isomorphism. The implicit mapping theorem shows
that Z : G 0 → R, defined by Z (a) = ϒ(a, · )−1 (1), is of class C λ−1 . For some
a ∈ G 0 , let P be the probability measure with density p = expη (a + Z (a)); then
φ0 (P) = a and P ∈ M0 , which proves part (i).
The argument above shows that the inclusion map, ρ, is of class C λ−1 . Let c :
G → G 0 be the (linear) superposition operator defined by c(a)(x) = a(x) − Eμ a;
(1)
then c is continuous, and has derivative ca u = u − Eμ u. Now c ◦ ρ is the identity
map of G 0 , which shows that ρ is homeomorphic onto its image, ρ(G 0 ), endowed
with the relative topology. Furthermore, for any u ∈ G 0 ,

(1)
u = (c ◦ ρ)a(1) u = cρ(a) ρa(1) u,

(1) (1)
and so ρa is a toplinear isomorphism, and its image, ρa G 0 , is a closed linear
subspace of G. Let E a be the one dimensional subspace of G defined by E a =
(1) (1)
{yexpη (ρ(a)) : y ∈ R}. If u ∈ E a and v ∈ ρa G 0 then there exist y ∈ R and
w ∈ G 0 such that

Eμ uv = yEμ exp(1)
η (ρ(a))(w − E Pa w) = 0.

So E a ∩ ρa(1) G 0 = {0}, and ρa(1) splits G into the direct sum E a ⊕ ρa(1) G 0 . We have
thus shown that ρ is a C λ−1 -immersion, and this completes the proof of part (ii).
Jensen’s inequality shows that there exists a K η < ∞ such that

−logEμ exp(1) (1)


η (ρ(a)) ≤ −Eμ logexpη (ρ(a)) ≤ K η Eμ | log p| ≤ K η a .

(1)
So, for bounded B, inf P∈B Eμ expη (ρ(a)) > 0. 


123
A two-parameter family of non-parametric... S185

For any P ∈ M0 , the tangent space TP M0 is a subspace of TP M of co-dimension


1; in fact, as shown in the proof of Proposition 3(ii),

TP M = TP M0 ⊕ {y Û , y ∈ R}, where Û φ = ψ (1) (φ(P)). (41)

Let 0 : T M0 → G 0 × G 0 be defined as follows:

0 (P, U ) = (P, U ) − Eμ (P, U ). (42)

(1) (1)
Then  ◦ −1 0 (a, u) = (ρ(a), ρa u). For any (P, U ) ∈ T M0 , U φ = ρa u =
u − E Pa u, and so tangent vectors in TP M0 are distinguished from those merely in
TP M by the fact that their total mass is zero.
Any regularity possessed by divergences, entropies and tensors on M involving
fewer than λ derivatives is also enjoyed by their restrictions to M0 .

5 Concluding remarks

This paper has developed a family of non-parametric statistical manifolds that use the
two-parameter deformed logarithm of (17), and a variety of model spaces of Sobolev
type. It has shown that the mixed-norm space W k,λ is especially suited to this applica-
tion. The Amari embedding maps, ξα , which are central to the analysis of divergences,
entropies and associated tensors, “lift” to continuous nonlinear superposition oper-
ators acting on the Sobolev model spaces. (A rare property in the theory of such
operators.) Variants of the superposition operators having Sobolev range spaces with
weaker topologies enjoy greater regularity; they were shown to admit multiple deriva-
tives on the manifolds, according to the values of the parameters k, λ and η. Of course,
this paper takes only the first step in a fuller analysis of the information geometry of
the manifolds constructed. However, for reasons of space, we shall go no further here.
Data Availability Data sharing is not applicable to this article as no datasets were generated or analysed
during the current study.

Declarations

Conflict of interest The author states that there is no conflict of interest.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which
permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give
appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence,
and indicate if changes were made. The images or other third party material in this article are included
in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons licence and your intended use is not permitted
by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the
copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

123
S186 N. J. Newton

References
1. Amari, S.-I., Nagaoka, H.: Methods of Information Geometry, Translations of Mathematical Mono-
graphs, 191. American Mathematical Society, Providence (2000)
2. Ay, N., Jost, J., Van Lê, H., Schwachhöfer, L.: Information Geometry, Ergebnisse der Mathematik und
ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in Mathematics, vol 64. Springer, Cham
(2017). https://doi.org/10.1007/978-3-319-56478-4
3. Barndorff-Nielsen, O.E.: Information and Exponential Families in Statistical Theory, Wiley (1978)
4. Cena, A., Pistone, G.: Exponential statistical manifold. Ann. Inst. Stat. Math. 59, 27–56 (2007)
5. Chentsov, N.N.: Statistical Decision Rules and Optimal Inference. Translations of Mathematical Mono-
graphs, vol. 53. American Mathematical Society, Providence (1982)
6. Cianchi, A., Pick, L., Slavíková, L.: Higher-order Sobolev embeddings and isoperimetric inequalities.
Adv. Math. 273, 568–650 (2015)
7. Gibilisco, P., Pistone, G.: Connections on non-parametric statistical manifolds by Orlicz space geom-
etry. Infin.-Dimens. Anal. Quantum Probab. Relat. Top 1, 325–347 (1998)
8. Kaniadakis, G., Lissia, M., Scarfone, A.M.: Two-parameter deformations of logarithm, exponential,
and entropy: a consistent framework for generalized statistical mechanics. Phys. Rev. E 71, 046128
(2005)
9. Lauritzen, S.L.: Statistical Manifolds, IMS Lecture Notes Series, 10, Institute of Mathematical Statistics
(1987)
10. Lods, B., Pistone, G.: Information geometry formalism for the spatially homogeneous Boltzmann
equation. Entropy 17, 4323–4363 (2015)
11. Loaiza, G., Quiceno, H.R.: A q-exponential statistical Banach manifold. J. Math. Anal. Appl. 398,
466–476 (2013)
12. Murray, M.K., Rice, J.W.: Differential Geometry and Statistics, Monographs in Statistics and Applied
Probability, 48, Chapman Hall (1993)
13. Naudts, J.: Generalised Thermostatistics, Springer (2011)
14. Newton, N.J.: An infinite-dimensional statistical manifold modelled on Hilbert space. J. Funct. Anal.
263, 1661–1681 (2012)
15. Newton, N.J.: Infinite-dimensional statistical manifolds based on a balanced chart. Bernoulli 22, 711–
731 (2016). https://doi.org/10.3150/14-BEJ673
16. Newton, N.J.: Nonlinear filtering and information geometry: a Hilbert manifold approach, in: Ay, N.,
Gibilisco, P., Matús̆, F. (eds.) Information Geometry and its Applications, Proceedings in Mathematics
and Statistics, 252, Springer, Cham, 189–208 (2018). https://doi.org/10.1007/978-3-319-97798-0_7
17. Newton, N.J.: A class of non-parametric statistical manifolds modelled on Sobolev space. Inf Geometry
Springer 2, 283–312 (2019). https://doi.org/10.1007/s41884-019-00024-z
18. Pistone, G., Rogantin, M.P.: The exponential statistical manifold: mean parameters, orthogonality and
space transformations. Bernoulli 5, 721–760 (1999)
19. Pistone, G., Sempi, C.: An infinite-dimensional geometric structure on the space of all the probability
measures equivalent to a given one. Ann. Stat. 23, 1543–1561 (1995)
20. Tsallis, C.: The nonadditive entropy Sq and its applications in Physics and elsewhere, Entropy, 13,
1765–1804 (2011). https://doi.org/10.3390/e13101765
21. Vigelis, R.F., Cavalcante, C.C.: On ϕ-families of probability distributions. J. Theor. Probab. 26, 870–
884 (2013)
22. Vieira, F.L.J., de Andrade, L.H.F., Vigelis, F.R., Cavalcante, C.C.: A Deformed Exponential Statistical
Manifold, Entropy, 21 (2019). https://doi.org/10.3390/e21050496

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.

123

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy