1. Introduction
The fathers of what we now call information theory were quite clear. Information is to be defined in a manner that is objective. Information is provided by an answer to a question. The value of the answer for a particular person is not part of the theory. To represent the uncertainty before the answer is given, one considers a random variable
X. Take it that there are
n different possible answers
n ≥ 2. And that the random variable
X can assume
n different values
each one being a possible answer with a probability
pi. The uncertainty about the answer (
entropy,
H(
X), of the random variable
X) is reduced when additional information is provided. This reduction is the meaning of the term
value as used in this paper. Our quantitative examination of the value of information is most influenced by the erudite discourse of this topic by Dunn and Golan [
1]. There is, of course, the general result, introduced by Shannon [
2], of the reduction in the uncertainty about the random variable
X given another random variable Y. This concept is central to the celebrated channel capacity theorem of Shannon. In that context, the noisy channel that determines the conditional distribution of the output
Y given an input
X is fixed. It is the probability distribution of the input that one can vary. This central development of Shannon is extensively discussed in texts on information theory such as Ash [
3] or Cover and Thomas [
4] and many others. The general expression of Shannon for the information provided by
Y about
X
is not limited to information transmission by a channel.
H(
X) is the uncertainty (
entropy) about the random variable
X while
H(
X|
Y) is the remaining uncertainty about
X when
Y is given.
In this note, I assume that the additional information on
X is provided by one or more additional expectation values over the distribution of
X. Say that we are given one additional expectation value
Fl of an observable that takes the value
when the answer is
,
Before we are given this additional information, the random variable
X is characterized by
m expectation values,
, that we denote as
. We take it that the different vectors of
n components
are linearly independent. Otherwise, some of them are redundant and do not provide additional information about the distribution of
X. By a similar reasoning, the vector
, whose expectation value Equation (2) is used to specify the additional information, needs to be linearly independent of the
m vectors
to
. Of course, we need to know that the
m expectation values
to
are compatible, meaning that there is one or more distribution for which these are possible expectation values. This is also for the additional information, meaning the expectation value
Fl. Our program, as implemented in the Results section, is to determine the value of the information on
X provided by the additional information. I will discuss two views of the result, a differential and an integral value. For either, I take it that the prior information about
X, namely, the values of the
m expectations
to
, are kept constant. The differential form is a partial derivative, an infinitesimal change in the entropy of
X upon an infinitesimal change in the value of the new information, the value of
Fl, when all the other
m expectation values are unchanged. The integral form is a change in the entropy due to a finite change in the value of
Fl again, at constant value of the prior information. The distinction between a differential and an integral change is of course familiar in many other contexts of the exact sciences perhaps most notably so in quantum chemistry where the expression for differential change in the energy is known as the Hellmann–Feynman theorem [
5]. One can also apply the theorem for changes in the dynamics, e.g., [
6].
The presentation is organized as follows.
Section 2, Methods, provides essential results of the maximum entropy formalism [
7,
8,
9,
10,
11]. We use the
m expectation values
to
as constraints on the entropy of the distribution of
X. Among all distributions over the
n different values
that are consistent with the
m expectation values
to
,
m <
n, we determine the (unique) distribution whose entropy is maximal. We take it that there is a feasible solution, meaning that the values of the
m expectation values are such that there is one or more distribution with all probabilities non zero that reproduces these values. The expectation value of
Fl is not imposed as a constraint on the distribution. Technically, the constraints are imposed by the Lagrange method of undetermined multipliers. The numerical value of the
m Lagrange thus far undetermined multipliers
to
is determined at the last stage by the condition that the distribution reproduces the
m expectation values. For the observable that takes the value
on the
i’th outcome whose probability is
, its expectation value as determined by the maximal entropy procedure subject to the
m constraints is
. Our purpose, as already introduced above, and as discussed in technical detail in
Section 3, Results, is to determine the amount of information provided when the expectation value
is changed from
.
Section 4, The Value of Information as a Potential, provides motivation for the possible implications of the expression of the value of information, specifically for when the new information changes with time, a case of particular relevance for systems not in equilibrium. Also examined is a more formal issue when the additional information depends on a parameter.
2. Methods
To have a compact derivation, we take the variable
X to be such that its distribution is uniform when entropy is maximal and no additional information is available. Shannon and then many others have shown that if we take as an axiom that when the distribution is uniform the entropy is maximal, then with a few additional axioms, the entropy of
X is
. The use of a natural logarithm just determines the units of entropy, nats in this case. In the physical sciences, it is usually the case that there is always some information, e.g., the conservation of energy, and so the distribution at maximal entropy is not uniform. The needed modification is well understood, and so we proceed with the most elementary case as above. As is emphasized early on, e.g., by Tolman [
12], this is the case when the index
i enumerates individual quantum states.
The distribution when the entropy is at a constrained maximum, i.e., at a maximum where the search for a maximum is subject to given
m values
to
,
m <
n is [
7,
8,
9,
10,
13]
The numerical values of the
m Lagrange multipliers
are determined by the
m implicit equations
The distribution needs to be inherently normalized, meaning that
. So, either one of the observables is the identity, i.e.,
, or some linear combination of observables is the identity. Either way, the normalization is enforced so that
. Since the values of the Lagrange multipliers are determined by the value of the
m observables, it follows from
and Equation (3) that
The entropy of the distribution of
X is
Taking the partial derivative of
H(
X) wrt
and using the implication of the normalization, we have the basic identity for the value of information
The value
is defined as a partial derivative when all the
m-1 observables that are not
have their value kept constant. The value carried dimension, those of
or
, has the dimension of information,
. It follows from Equation (7) that the entropy of
X is a homogeneous first-order function of the
m observables that are used to constrain the distribution at its maximal possible entropy
Equation (8) generalizes a known result in thermodynamics [
10,
14,
15,
16]. The more general result, Equation (6), is that the entropy is the weighted sum of the values of the observables that define the state. It is a macroscale analog of the basic microscale definition of the entropy as the weighted sum of the surprisals,
, of the different possible outcomes. We call the value,
, a potential because as was just shown, it is the latent ability of the observable conjugate to
to change the entropy.
3. Results
So far, we have examined the value of information when the value of one of the observables that defines the state is changed. In Equation (7), it is the change in the value of observable j. Next, we consider the value of the information provided when an observable Fl is changed, and that observable was not used previously to characterize the state. As already noted, such an observable was not used because the distribution we assigned at a given value of m other observables already correctly produces its current value . This is a rather common situation. A well-known example is the Boltzmann distribution at thermal equilibrium. It is normalized and subject to the given (expectation) value of the energy of the different quantum states i. Given the mean, the distribution of quantum states correctly predicts the variance of the energy, which is the specific heat. Indeed, that was a very early success of the then new quantum theory.
Given the observable
, the value of making an infinitesimal change in its expectation is zero. This is because the distribution of
X is at maximal entropy subject to the given expectations of
m constraints. These values are to be held constant when we make an infinitesimal change in
. But linear variations about a stationary point of a function do not change it. A finite change, from
to
Fl, does lead to a new distribution of maximal entropy, and Equation (3) is replaced by
We use primes to denote a distribution of maximal entropy subject to all the previous
m constraints.
to
and to the new constraint
, where in Equation (9)
k runs from 1 to
m, excluding
l. The value of the
m expectation values remains unchanged when we go from the distribution
, Equation (3), to the distribution
, Equation (9), but the
m Lagrange multipliers can change, and that is why they have a superscript prime,
To characterize the value of the new information, we use the Shannon fundamental result, Equation (1) plus the inequality
, where equality is iff
for all
i. Using the explicit Expressions (3) and (9) and noting that by construction, the
m observables have the same expectation values for
, we have
is the entropy of the random variable X before the new information is provided, while the entropy is the entropy of X after the additional information Y, that is here the value of is provided. As is to be expected on general grounds, the value of the new information is positive unless , meaning that the new information is not really informative as the distribution of X is unchanged.
Explicitly, the finite value of the new information is, using the expressions (3) and (9),
The differential change in the Lagrange multipliers due to the change in the value is provided by Equation (10). It is a vector with m component indexed by j. It is a vector orthogonal to the constraints, as shown in Equation (5).
Any information that is provided adiabatically has no value. In general, this follows from the Shannon definition, Equation (1), of the information provided by
Y about
X.
when the distribution of
X is unchanged when
Y is given,
. On the microscale, that is in terms of the individual outcomes, this is
for all
i and
j. This shows that our use of the term adiabatic follows the conventional use in thermodynamics and mechanics: the elementary probabilities of the different outcomes do not change upon a change in the macro scale. But how can that be? We take a clue from a differential form of the first law of thermodynamics to write for an infinitesimal change in an observable
It follows that if the addition of information is such that
, then the probabilities of the elementary events are unchanged and there is no value to the information provided. In the case of the first law of thermodynamics
F is the energy, the
are the energies of the individual states. Then, 〈
〉 is the work performed on or by the system, while
is the heat transfer, which is zero in an adiabatic change. Performing pure work on or by a system does not change the value of the information of its state. The transfer of heat does, as was early and clearly noted by Clausius [
17].
4. The Value of Information as a Potential
The value of information, the Lagrange multiplier that is conjugate to the observable that is changes, is a potential. We here discuss it as a potential, see also [
13,
18], and then specifically discuss how the value changes with time in the special but important case of Hamiltonian dynamics. Given that the state of the system is of maximal entropy, the set of
m values of the constraints and the set of
m Lagrange multipliers that is conjugate are each an equally correct and useful characterization of the state. As discussed by Callen [
14] (the first edition is much better in this particular respect), there is a Legendre transform relating the use of the two sets of variables. As clearly discussed by Callen, one can also usefully introduce intermediate characterizations, using some observables and the other variables being the rest of the Lagrange multipliers. The practical advantages of using certain thermodynamic Lagrange multipliers such as temperature or pressure are well recognized. In chemical problems, there are the chemical potentials of the different species. In general, it is the
intensive character of the Lagrange multipliers that often makes them more convenient. We use intensive in its canonical thermodynamic terminology: intensive meaning independent of the actual amount as opposed to the extensive character of the mean value of the observables that will double when we double the number of systems.
The change in the value in a process is often a useful measure. Already noted is that the value is unchanged in an adiabatic process. How does the value change in time? The problem in making a definitive answer is that we only have mechanics as an agreed upon dynamical theory of change, and in mechanics, both classical and quantum mechanical, the processes are reversible. There is no dissipation. Realistically, we all recognize that there is dissipation. Still, what can one say about the rate of change in value in mechanics. I will use quantum mechanics and, as a practical point, I assume a Hilbert space of finite dimensions n. So, operators and in particular the density operator are represented by n by n matrices. It will be useful below to use the n2 operators defined as a matrix where all elements are zero except the element in position i,j that is unity. Any n by n matrix can be expressed as a linear combination of these matrices. As a side comment, these operators close a Lie algebra, , where the square bracket is the commutator and the delta symbol is the usual Kronecker delta.
The most general form of an initial state density matrix is
where the multipliers
are real, while, since
and the density needs to be Hermitian, the off diagonal coefficients must satisfy
. The observables
of Equation (3) and following can be expressed as linear combinations of the
.
In quantum dynamics, the density matrix of an isolated system evolves in time under the action of a unitary evolution operator
with the initial value
. Since
U is unitary and using a superscript dagger to denote a Hermitian conjugate, we can write
An initial density matrix of maximal entropy that is propagated in time remains a density matrix of maximal entropy with a different, time-dependent set of constraints [
19]. These constraints can be shown to be time-dependent constants of the motion (see, e.g., [
20]).
In a finite,
n, dimensional Hilbert space, the time-dependent operators are
n by
n Hermitian matrices. So, they can all be written as linear combinations of the time-independent matrices
with time-dependent coefficients
Thereby, the density at time
t, Equation (15), can be written as
where the multipliers evolve as
The multipliers evolve contra-gradient to the observables. This is the most general equation of change for the values of the information provided by the basis observables . The one cardinal assumption is that the dynamics are unitary as used in Equation (14). It is a strong assumption because unitary implies reversible, . Even if is not unitary, as long as the dynamics keep the system confined to the n dimensional Hilbert space, it is still possible to expand the surprisal matrix, in the complete basis of n2 observables, with time-dependent coefficients. Thereby, the remain their significance as the time-changing values of the elementary observables , but the system no longer evolves in a reversible manner.