Ergodic Theory
Ergodic Theory
Ergodic Theory
CONTENTS
1. Disclaimer 2
2. Introduction 3
2.1. Overview 3
2.2. Spectral invariants 4
2.3. Entropy 5
2.4. Examples 5
3. Mean Ergodic Theorems 8
3.1. Preliminaries 8
3.2. Poincaré Recurrence Theorem 8
3.3. Mean ergodic theorems 9
3.4. Some remarks on the Mean Ergodic Theorem 11
3.5. A generalization 13
4. Ergodic Transformations 14
4.1. Ergodicity 14
4.2. Ergodicity via Fourier analysis 15
4.3. Toral endormophisms 16
4.4. Bernoulli Shifts 17
5. Mixing 19
5.1. Mixing transformations 19
5.2. Weakly mixing transformations 19
5.3. Spectral perspective 20
5.4. Hyperbolic toral automorphism is mixing 21
6. Pointwise Ergodic Theorems 23
6.1. The Radon-Nikodym Theorem 23
6.2. Expectation 24
6.3. Birkhoff’s Ergodic Theorem 25
6.4. Some generalizations 28
6.5. Applications 29
7. Topological Dynamics 31
7.1. The space of T -invariant measures 31
7.2. The ergodic decomposition theorem 33
8. Unique ergodicity 34
8.1. Equidistribution 34
1
Ergodic Theory Math 248, 2014
8.2. Examples 36
8.3. Minimality 38
9. Spectral Methods 40
9.1. Spectral isomorphisms 40
9.2. Ergodic spectra 40
9.3. Fourier analysis 41
10. Entropy 42
10.1. Motivation 42
10.2. Partition information 42
10.3. Definition of entropy 45
10.4. Properties of Entropy 46
10.5. Sinai’s generator theorem 49
10.6. Examples 50
11. Measures of maximal entropy 51
11.1. Examples 51
12. Solutions to Selected Exercises 53
2
Ergodic Theory Math 248, 2014
1. DISCLAIMER
These are notes that I “live-TEXed” during a course offered by Maryam Mirzakhani
at Stanford in the fall of 2014. I have tried to edit the notes somewhat, but there are
undoubtedly still errors and typos, for which I of course take full responsibility.
Only about 80% of the lectures is contained here; some of the remaining classes I
missed, and some parts of the notes towards the end were too incoherent to include. It
is possible (but unlikely) that I will come back and patch those parts at some point in the
future.
3
Ergodic Theory Math 248, 2014
2. INTRODUCTION
2.1. Overview. The overarching goal is to understand measurable transformations of a
measure space (X , µ, B). Here µ is usually a probability measure on X and B is the σ-
algebra of measurable subsets.
Definition 2.1. We will consider a transformation T : X → X preserves µ if for all α ∈ B we
have
µ(α) = µ(T −1 (α)).
In particular, we require T −1 (B) ⊂ B for this to make sense.
Remark 2.2. It is not necessarily the case that µ(α) = µ(T α), or even that T (B) ⊂ B.
We are interested in the following kinds of questions concerning this setup.
Given (X , µ) that are sufficiently nice, can we “classify” all µ-preserving transfor-
mations T : X → X ? Can we find invariants that distinguish them?
A tricky thing about this is that since we are considering measure spaces, we can throw
out sets of measure zero. This means that topological intuition is not so useful here.
Remark 2.3. If µ is a regular measure, e.g. if X ⊂ X where X is metrizable, and B is
the Borel σ-algebra, then (X , µ, B) turns out to be isomorphic to a “standard probability
space”, which is a disjoint union of intervals with Lebesgue measure and discrete spaces.
In particular, we see that topological ideas like dimension, etc. are useless for distin-
guishing probability spaces.
So then what kinds of invariants can you use? We will discuss two flavors.
4
Ergodic Theory Math 248, 2014
2.2. Spectral invariants. A simpler class of invariants are the “spectral invariants,” which
are qualitative features reflected in the “spectral theory” of T (we will explain what we
mean by this later).
Remark 2.4. This is one of several possible definitions of ergodicity. A different one is that
if A is T -invariant and measurable, then µ(A) = 0 or 1 (here µ is a probability measure).
X0 / Y0
T S
X0 / Y0
The point is that we can throw away a set of measure 0 and get the natural notion
of isomorphism. In particular, an ergodic transformation will not be isomorphic to a
non-ergodic transformation.
2.2.2. Mixing. Another such invariant is mixing, which says that if A, B are measurable
then
lim µ(A ∩ T −n B ) = µ(A)µ(B ).
n →∞
We don’t want to dwell on the formal definitions now, but it turns out that this is stronger
than ergodicity. There are variants on this: weakly mixing, strongly mixing, exponentially
mixing, etc.
5
Ergodic Theory Math 248, 2014
2.2.3. “Spectral” explained. Why do we call these spectral invariants? Because they are
related to the action of T on L 2µ (X ). That is, we have the Hilbert space of square-integrable
functions on X equipped with inner product
Z
〈f 1 , f 2 〉 = f 1 f 2 d µ.
X
The map T : X → X induces by pullback a (unitary) operator
u T : L 2µ (X ) → L 2µ (X ).
The condition of mixing can be interpreted in terms of the spectral theory of the operator
u T . In fact, we will see that you can distinguish between rotations R α , R β based on their
spectral properties. However, many non-equivalent operators have the same action on
L 2µ (X ), so we can’t distinguish them in this way.
2.3. Entropy. To distinguish some operators we will require a different kind of invariant,
which is more refined in the sense that it does not depend only on the spectral properties.
Example 2.7. We’ll now discuss a family of examples, the hyperbolic toral automorphisms.
Let n = 2 for concreteness, although you can do this for n ≥ 2 too. Let A ∈ SL2 (Z) be a
matrix having no eigenvalue of modulus 1 (hence the “hyperbolic”). Then we have the
natural action of A on R2 , which sends integral points to integral points, and hence in-
duces an action on T 2 = R2 /Z2 . Now A preserves the Lebesgue measure on R2 (det = 1)
and hence T 2 . How can we distinguish between (T 2 , A) and (T 2 , B ) for A, B ∈ SL2 (Z)?
This is extremely difficult to do, even though they are non-isomorphic. The spectral
invariants ergodicity and mixing are not enough. The only way we know is to use a very
powerful invariant called entropy, which quantifies “how complicated” the system is.
This roughly measures the growth of the number of periodic points, although periodic
points aren’t useful here (there are only countably many, and we can throw away sets of
measure 0).
In many examples, this is the only way we know how to show that the measure spaces
are not the same. The second part of the course will deal with entropy, how to define and
calculate it. In the setting where X is a compact hyperbolic space and T is continuoous,
there are some corollaries on counting periodic points and behavior of “long” periodic
points.
leads into a big open question. The Lebesgue measure is invariant under Tm .
There are “many” measures invariant under Tk (the Lebesgue is the “nicest” one)
for any particular k .
Conjecture 2.8. If µ is a probability measure invariant under T2 and T3 then it is
either supported on a finite set or Lebesgue.
This is a huge, difficult open problem. In contrast:
Theorem 2.9 (Furstenberg). A closed subset of S 1 which is invariant under T2 or
T3 is either S 1 or a finite set.
This illustrates the contrast between topology and measure theory. Some-
times something that is hard in one world is easy in another.
It is known in some general situations that if µ has positive entropy under
certain maps, like T2 and T3 , then it is Lebesgue.
(3) The Gauss map T : [0, 1] → [0, 1] defined by
(
1
mod 1 x 6= 0,
T (x ) = x
0 x = 0.
There is a measure µ on [0, 1] invariant under T (not the Lebesgue), which has
the form Z
1 dx
µ(B ) = .
log 2 B 1 + x
It turns out that µ(T −1 B ) = µ(B ).
Let x ∈ R have the continued fraction expansion
1
x = a1 +
a 2 + a 1+...
3
that they are not equivalent, and you can prove this using entropy.
The significance of the question is that periodic points for this transformation
are related to closed geodesics on X .
8
Ergodic Theory Math 248, 2014
u T (f ) := f ◦ T : X → R (or C).
Lemma 3.1. The measure µ on X is T -invariant if and only if for all f ∈ L 1 (X , µ) we have:
Z Z
f dµ= f ◦T dµ (1)
Proof. One direction is trivial: assuming (1) for all (almost everywhere) bounded test
functions f , we can set f = χ B where B is measurable and we immediately obtain that
µ(B ) = µ(T −1 (B )).
Conversely, suppose that we know (1) for all χ B where B is measurable. A basic fact
from measure theory is that there exists a sequence f n ↑ f almost everywhere, where f n
is a simple function: a finite linear combination of indicator functions. By dominated
convergence
Z Z
lim fn → f.
n →∞
3.2. Poincaré Recurrence Theorem. We now study the Poincaré Recurrence Theorem,
which is a kind of “pigeonhole principle” for measure-preserving transformations. The
idea is that if we consider the sequence of points x , T x , T 2 x , . . . then it should “return
close to x ” infinitely many times (hence “recurrence”).
Theorem 3.3 (Poincaré). For any measurable subset E ⊂ X , for almost every x ∈ E there
exist n 1 < n 2 < . . . such that T n 1 (x ), T n 2 (x ), . . . ⊂ E .
9
Ergodic Theory Math 248, 2014
An interesting question is what can we say about the sequence n 1 (x ) < n 2 (x ) < . . ..
The theorem says that the sequence is infinite, but we might want to quantify whether
or not the recurrence happens “often.” In fact, it does: for “nice” maps T , n i (x ) ∼ αi .
Essentially, there is a finite expected time for recurrence to occur.
Proof. The idea is to try to bound the measure of the set of points that don’t come back
to E . Let
B = {x ∈ E : T n (x ) ∈
/ E ∀n ≥ 1}.
First one has to check that this is measurable:
B = E ∩ T −1 (X − E ) ∩ T −2 (X − E ) ∩ . . . ∩ T −k (X − E ) ∩ . . .
This is an (admittedly infinite) intersection of measurable sets, hence measurable.
We claim that B, T −1 (B ), . . . , T −k (B ) are disjoint. Indeed, any y ∈ T −1 (B ) satisfies
T (y ) ∈ E so y ∈
/ B . Since µ(B ) = µ(T −1 (B )) = . . ., and we are dealing with a probabil-
ity measure, we immediately see that µ(B ) = 0. If x ∈ E is not recurrent, then x ∈ T −N (B )
for some N , so we are done.
Question: What can we say about µ(E ∩ T −n E ) if µ(E ) > 0? This is a measure of how
“evenly” T propagates E around.
More generally one might ask about µ(E 1 ∩ T −n E 2 ) for distinct sets E 1 and E 2 . How-
ever, note that µ(E 1 ∩T −n E 2 ) could be zero for all n, e.g. if X is a union of two T -invariant
pieces, so this does not admit an interesting answer without further refinements.
Exercise 3.4. Show that
lim sup µ(E ∩ T −n E ) ≥ µ(E )2 .
n >0
To put this in context, one can prove that for some general classes of T (irreducible,
ergodic) one has that this is the average behavior in the sense that
1X
lim µ(E ∩ T −1 E ) = µ(E )2 .
n →∞ n
10
Ergodic Theory Math 248, 2014
Remark
R 3.6. If T is ergodic, then f ∗ is constant almost everywhere and thus equal to
f d µ. If you think of T as describing the evolution of the system in time, then this
means that the for ergodic transformations“the space average is equal to the time aver-
age.”
Proof. The proof is straightforward up to some technical machinery. The key is to ex-
plicitly describe the orthogonal complement to I , so let
B = 〈u T g − g : g ∈ L 2 (X , µ)〉.
We claim that B ⊥ = I . Indeed, if u T f = f then
〈f , u T g − g 〉 = 〈f , u T g 〉 − 〈f , g 〉 = 〈u T f , u T g 〉 − 〈f , g 〉 = 0.
This shows that I ⊂ B ⊥ .
We then have to show that B ⊥ ⊂ I . If f ∈ B ⊥ then by definition 〈u T g , f 〉 = 〈g , f 〉 for all
g . Therefore,
||u T f − f ||2 = 〈u T f − f , u T f − f 〉 = 2||u T f ||2 − 〈f , u T f 〉 − 〈u T f , f 〉 = 0.
So we have established that L 2 (X , µ) = I ⊕ B . Recall that we want to show
f + u T (f ) + . . . + u T n (f )
lim =: PT (f ) ∈ L 2 (X , µ).
n →∞ n
To do this, we proceed as follows.
(1) Check the result for f ∈ I (which is obvious).
(2) Check it for f = u T g − g (also obvious, since it telescopes to N1 ||u TN g − g ||2 ).
(3) The result follows for the whole space if we can show that the left hand side is
“continuous in f , so that it vanishes on all of B . Well, given ε > 0 and h ∈ B , we
can find h i ∈ B such that ||h − h 0 ||2 < ε. Then for all sufficiently large N we have
N
1 X n 0
u h <ε
N n=1 T
2
Therefore,
1 X n 1 X n 1 X n 0
uT h ≤ u T (h − h 0 ) + uT h
N N 2 N
< 2ε.
11
Ergodic Theory Math 248, 2014
lim A n (f ) = fe in L 1 (X , µ).
n →∞
What is this functon fe? In the L 2 case it was projection onto a certain subspace, but
since L 1 is not a Hilbert space, we can’t make sense of “projection operators” as we did
f denotes the σ-algebra of T -invariant measurable sets, then
before. It turns out that if B
fe is E (f | B).
f We will elaborate on this later.
Corollary 3.10. Assuming that µ(X ) < ∞, show that if µ(B ) > 0 then the set {n ∈ N: µ(B ∩
T −n B ) > 0} (which is infinite by Poincaré’s recurrence theorem) has the property that the
set of gaps between recurrence are bounded.
3.4. Some remarks on the Mean Ergodic Theorem. We established the Mean Ergodic
Theorem for a measure-preserving system (X , µ, T ):
N
1 X n
lim u T f = PT f
N →∞ N
n =1
where PT is the projection onto the subspace of T -invariant functions in L 2 (X , µ). This
holds in general, even if µ(X ) = ∞,R but one can encounter problems such as PT f van-
ishing almost everywhere, even if X f d µ > 0. As a simple example, suppose f is the
indicator function of [0, 1] and T is translation by 1 on R.
We would like to have Z Z
f dµ= PT f d µ.
X X
12
Ergodic Theory Math 248, 2014
When restricting to a probability space, one has || · ||1 ≤ || · ||2 by Cauchy-Schwarz. There-
fore, if f n → f in L 2 then one has
Z Z
lim fn → f.
Since Z Z
u Tn f dµ= f dµ
X X
in a probability space we are indeed guaranteed that
Z Z
PT f = f.
X X
Suppose f n → f in L 2 and g ∈ L 2 (X , µ). Then
〈f n , g 〉 → 〈f , g 〉.
For measurable sets A, B of (X , µ, T ) we apply this with f = χA and g = χ B , and f n = A n f .
By the Mean Ergodic Theorem,
N Z
1 X −n
µ(T A ∩ B ) → PT (χA ) d µ.
N n =1 B
One would like to use this to show that the orbits of A intersect B , but the right hand side
could be 0.
However, if T is ergodic then by definition, the dimension of the space of T -invariant
functions is 1 (i.e. just the constants), so the right hand side is some constant times µ(B ).
Now, in a probability space one has
Z Z
fn dµ = f d µ = µ(A).
X
We have shown:
Theorem 3.11. If T is ergodic, then
N
1 X
lim µ(T −n A ∩ B ) = µ(A)µ(B ).
N →∞ N
n =1
If T is not ergodic then one can still use the same idea to try and get something (the
result won’t be as strong, of course).
Exercise 3.12. Let (X , µ) be a probability space and E ⊂ X a subset of positive measure.
Assume T : X → X is an invertible transformation preserving µ. Show that there exists
x ∈ X such that {n ∈ Z | T n (x ) ∈ E } has positive upper density.
Exercise 3.13. Suppose (X , µ) is a probability space. For any measurable set B and ε > 0,
show that the set
{k ∈ N | µ(T −k B ∩ B ) ≥ µ(B )2 − ε}
has bounded gaps.
13
Ergodic Theory Math 248, 2014
3.5. A generalization. The key ingredient to this discussion is the mean ergodic theo-
rem, whose proof is very easy: it’s just basic functional analysis. What if we want to to
study more complicated things like
µ(A ∩ T −n A ∩ T −2n A ∩ . . . ∩ T −k n A)
if µ(A) > 0? More generally, what about µ(A ∩ T −p (n ) A) for some polynomial p (t ) ∈ Z[t ]?
More generally still, suppose you have commuting operators T1 , . . . , Tk and want to study
1 X
n n
lim k u T11 (f ) . . . u Tkk (f ).
N →∞ N
n ,...,n ≤N
1 k
In fact, Host-Kra showed that this kind of limit does converge in L 2 (X , µ). Recurrence
statements for this setting were proved by Furstenberg and Katznedson, etc. They are
significantly more challenging. We remark that these do not involve an assumption of
ergodicity.
14
Ergodic Theory Math 248, 2014
4. ERGODIC TRANSFORMATIONS
4.1. Ergodicity.
Definition 4.1. Suppose T is a measure-preserving map on (X , µ, B). Then T is ergodic
if B = T −1 B for B ∈ B implies µ(B ) = 0 or µ(X − B ) = 0.
Remark 4.2. This makes sense even when X has infinite measure.
This definition is supposed to capture the notion of irreducibility. Given any T -invariant
measure µ, it is not clear how to obtain a measure µ0 that is T -invariant and ergodic with
respect to T . However, such measures do exist.
Proposition 4.3. The following are equivalent:
(1) T is ergodic.
(2) µ(T −1 B ∆B ) = 0 =⇒ µ(B ) = 0 or µ(X − B ) = 0.
(Assuming µ(X ) = 1) For any A ∈ B, if µ(A) > 0 then µ( T −n A) = 1.
S
(3)
(4) For any A, B ∈ B such that µ(A)µ(B ) > 0 there exists n such that µ(T −n A ∩ B ) > 0.
(5) If f : X → C is measurable, then f ◦T = f almost everywhere implies that f is equal
to a constant almost everywhere.
Remark 4.4. Condition (3) generalizes the earlier remark that µ(T −N A ∩ A) > 0 for all
T -invariant measures. Recall that we said the result could fail if X were a union of two
disjoint T -invariant spaces. We will later prove that if T is ergodic then
N
1 X
lim µ(T −n A ∩ B ) = µ(A)µ(B ).
N →∞ N
n =1
Remark 4.5. The definition makes sense for any group G acting on X .
Proof. Obviously (2) =⇒ (1). For (1) =⇒ (2), start with some B such that µ(B ∆T −1 B ) =
0. We want to make B into a T -invariant set somehow, so the most naïve thing to do is to
throw in T −1 (B ). Of course, we then have to keep going, so we set
∞ [
\ ∞
C= T −n B.
N =0 n=N
Next we show that (1) ⇐⇒ (3). For (1) =⇒ (3) observe that n T −n (A) is T -invariant
S
µ(T −1 A kn ∆A kn ) = 0.
This implies that A kn has full measure or zero measure for each n, k , and it follows that f
is constant almost everywhere.
Example 4.6. Here are some examples of ergodic and non-ergodic transformations.
(1) R α : S 1 → S 1 is ergodic with respect to the Lebesgue measure if α is irrational, and
not ergodic if α is rational.
(2) T2 : S 1 → S 1 with respect to the Lebesgue measure is ergodic.
(3) The map T : S 1 × S 1 → S 1 × S 1 sending (x , y ) 7→ (x + α, y + α) is not ergodic. For
instance, the function f (x , y ) = e 2πi (x −y ) is T -invariant but not constant.
(4) The map S : T k → T k sending
(x 1 , . . . , x k ) 7→ (x 1 + α, x 2 + x 1 , x 3 + x 2 , . . . , x k + x k −1 )
is ergodic if α is irrational. That is not obvious, although it’s easy to see that this
is measure-preserving for the Lebesgue measure.
There is a nice trick due to Furstenberg to use this to show that {n 2 α}, {n 3 α}, . . . , {n k α}
are dense in S 1 if α is irrational.
4.2. Ergodicity via Fourier analysis. One approach to ergodicity on S 1 is to use Fourier
analysis on L 2 (X , µ), and study the action of T on the Fourier coefficients. This leads to
perhaps the simplest proofs, but unfortunately they do not generalize too well.
Example 4.7. Let’s try applying this idea to the rotation operator R α . For f ∈ L 2 (S 1 ) we
write X
f (t ) = c n e 2πi nt .
n ∈Z
Example 4.8. Next let’s see what happens with the doubling map. For f ∈ L 2 (S 1 ) we again
write X
f (t ) = c n e 2πi nt .
n ∈Z
Now that we are warmed up, let’s prove that (4) from Example 4.6 is ergodic. For f ∈
L 2 (T k ), we have a Fourier expansion
X
f (~
x) = c n · e 2πi n~ ·~x .
~ ∈Zk
n
Suppose f (~ x ) = f (S(~x )).
The trick is that we can write n ~ · S(~
x ) = n 1 αe~1 + S 0 (n) · x~ where S 0 (~
n ) = (n 1 + n 2 , n 2 +
n 3 , . . . , n k −1 + n k , n k ). The nice thing about S 0 is that it induces an automorphism of Zk ,
so X
c n~ · e 2πi n 1 α e 2πiS (~n )x .
0
f (S(~
x )) =
We conclude that c S 0 (~n ) = e 2πi αn 1 c n~ . In particular,
|c S 0 (~n ) | = |c n~ |.
Now we claim that the sequence of vectors n ~ ,S 0 (~
n ), (S 0 )◦k (~
n ) cannot be all distinct unless
c n~ = 0. This is for the same reason as before:
X
|| f ||2 = |c n~ |2 .
~ ∈Zk
n
We conclude that if c n~ 6= 0 then there exist p,q such that (S 0 )◦p (~
n ) = (S 0 )◦q (~
n ) = 0. An easy
analysis shows that this implies n k = . . . = n 2 = 0. Then comparing this with the earlier
equation c S 0 (~n ) = e 2πi αn 1 c n~ shows that n 1 = 0 as well.
4.3. Toral endormophisms. If A ∈ GLn (Z), then it induces a map TA : T n → T n pre-
serving the Lebesgue measure induced on T n = Rn /Zn . These are the “toral endomor-
phisms,” which we have already encountered.
Theorem 4.9. TA is ergodic if and only if no eigenvalue of A is a root of unity.
Since the eigenvalues of A are algebraic, this is the same as no eigenvalue having mag-
nitude 1. For such A, we called TA hyperbolic.
Proof. (Sketch) We use Fourier analysis again. If f ∈ L 2 (X , µ) then we write
X
f = c n e 2πi 〈~n ,x 〉
~ ∈Zk
n
and f ◦ T has expansion
X X
c n e 2πi 〈~n ,Ax 〉 = c n e 2πi 〈~n A,x 〉
n∈Zk n∈Zk
so
c n~ = c n~ A = . . . .
Applying Parseval’s formula as usual, we conclude that either c n~ = 0 or {~ ~ A, . . .} is
n,n
really only a finite set. Then n~A = n
k ~ . That implies that A has an eigenvalue which is a
k th root of unity.
~ =n
Conversely, if A k n ~ for some n
~ then
k
X −1
j x〉
f (x ) = e 2πi 〈~n ,A
j =0
17
Ergodic Theory Math 248, 2014
4.4. Bernoulli Shifts. We can give other proofs that the map Td : S 1 → S 1 is ergodic,
without referencing Fourier analysis, but putting this map into a general context called
Bernoulli shifts.
Here is a general setting that captures all of these ergodic transformations. We have a
finite alphabet S = {s 1 , . . . , s k } and real numbers {p s 1 , . . . , p s k } such that each p s ≥ 0 for all
Pk
s ∈ S and i =1 p s i = 1. We define the two-sided Bernoulli space
Σ = {(. . . x −1 , x 0 , x 1 , . . .): x i ∈ S)}
and the Bernoulli shift σ by (σ(x )i ) = (x i +1 ). Note that here σ is a bijection.
We also define the one-sided Bernoulli space
Σ+ = {(x 0 , x 1 , . . .): x i ∈ S)}
and the left shift operator σL on Σ+ by
(σL (x )i ) = x i +1 .
Notice that here σL is surjective but not injective.
We equip Σ, Σ+ with the σ-algebra generated by the fundamental “cylinders”
[(i 1 , s 0 ), . . . , (i ` , s ` )] = {x = (x i )∞
i =0 | x i 0 = s 0 , . . . , x i ` = s ` }.
These play the role of intervals (rectangles) for the construction of the Lebesgue measure
on R (Rn ). We then define measures µ on Σ by
µ([(i 0 , s 0 ), . . . , (i ` , s ` )]) = p s 0 · . . . · p s `
and similarly for µ+ on Σ+ . This measure is evidently preserved by σ and σL , respec-
tively. So (Σ, σ, µ) and (Σ+ , σL , µ+ ) are measure-preserving systems. This turns out to be
a robust framework capturing many measure-preserving systems that we have already
encountered.
Example 4.11. The doubling map can be realized as a Bernoulli shift with S = {0, 1} and
p 0 = p 1 = 1/2, we have (Σ+ , σL , µ+ ) ∼
= (S 1 , T2 , µ).
The tricky thing about Bernoulli shifts is that they are very difficult to distinguish.
Even for k = 2, we are only choosing p 0 and p 1 such that p 0 + p 1 = 1 and it is already im-
possible to distinguish the different spaces by spectral properties. To do this one needs
to introduce the notion of entropy.
Σ+ can be metrized into a compact topological space, with d (x , x 0 ) = k1 if x i = x i0 for
i = 1, . . . , k .
Proof. We begin with a key observation that leverages the specific structure of cylinders.
If E ⊂ Σ is a finite union of cylinders and F = σ−N E , then
µp (E ∩ σ−N E ) = µp (E )2 for large N .
To see this, think of E as a set where you have restricted the values in a certain (finite) set
of indices. Then σ−N is a “right shift” (technically multivalued), so σ−N E is a set where
you have restricted the values in another finite set of indices shifted to the right from the
origina. If you shift by a large enough amount then eventually the places where you have
restricted the values of E and σ−N E are disjoint.
Let B be a measurable set. We want to show that T −1 B = B =⇒ µ(B ) = 0 or 1. There
SN
exists a finite union of cylinders E = j =1 C j (where each C j is a cylinder) such that
µ(E ∆B ) < ε, so in particular |µ(B ) − µ(E )| < ε. Since µ(B ) = µ(σ−1 B ) = . . .,
µ(B ∆σ−N E ) = µ(σ−N B ∆σ−N E ) = µ(σ−N (B ∆E )) < ε.
This holds for all N . Now the point is that B is commensurate with both E and σ−N E ,
but these two sets are not commensurate with each other by the discussion of the first
paragraph unless µ(E ) = 0 or 1.
More precisely, we have µ(B ∆E ) < ε and µ(B ∆σ−N E ) < ε. Also,
B ∆(E ∩ σ−N E ) ⊂ (B ∆E ) ∪ (B ∆σ−N E )
so µ(B ∆(E ∩ σ−N E )) < 2ε. In particular, |µ(B ) − µ(E ∩ σ−N E )| < ε. Taking ε → 0, we
conclude that µ(E ) = µ(E )2 .
19
Ergodic Theory Math 248, 2014
5. MIXING
5.1. MixingP transformations. Recall that we proved that a Bernoulli shift system (Σ, σ, µp )
is ergodic if p i = 1 by using the structure of “cylinders,” specifically the fact that µ(σ−N A∩
B ) = µ(A)µ(B ) for all sufficiently large N .
By an approximation argument, this shows in fact that for any two measurable sets Ae
and B e we have
lim µ(σ−N Ae ∩ B
e ) = µ(A)µ(
e B e ).
N →∞
This is the prototype of a stronger property of transformations called mixing.
lim µ(T −n Ae ∩ B
e ) = µ(A)µ(
e B e ).
n →∞
Example 5.2. The proof of Theorem 4.12 shows that (Σ, σ, µp ) is mixing.
Mixing implies ergodic, but not conversely. Indeed, one of our equivalent charac-
terizations of ergodicity in Proposition 4.3 was that for all A,
eB e there exists n such that
µ(T Ae ∩ B
−n e ) > 0.
5.2. Weakly mixing transformations. Recall that we used the mean ergodic theorem to
show that ergodicity implies if A,
eB e are measurable then
n
1X
µ(T −n Ae ∩ B
e ) → µ(A)µ(
e B e ).
n i =1
In other words, the “average” of the quantities approaches some expected value. Mixing
says that the quantities themselves approach this value.
Example 5.5. In fact, you can easily see that R α is not even weakly mixing, since a positive
proportion of terms is positive.
5.3. Spectral perspective. Ergodicity, weakly mixing, and mixing are “spectral proper-
ties” of the operator u T on L 2 (X , µ). For instance, ergodic says that for f , g ∈ L 2 (X , µ) we
have
n
1X i
lim (u T f , g ) → (f , 1)(1, g )
n →∞ n
i =1
and mixing says that for all f , g ∈ L 2 (X , µ)
lim (u Tn f , g ) = (f , 1)(1, g ).
n →∞
A k (x − y ) = αA k v 1 + β A k v 2 = αλk1 v 1 + β λk2 v 2 .
In summary, the important ingredients were ergodicity of expanding foliations, and the
existence of expanding/contracting directions. These ideas are the basis of the notion of
entropy. It turns out that the Lebesgue measure has the maximum possible entropy for
T , and this gives information about periodic points, etc.
23
Ergodic Theory Math 248, 2014
dν
Z
ν (B ) = d µ.
B
dµ
Other basic properties are justified: if ν1 , ν2 µ then
d (ν1 + ν2 ) d ν1 d ν2
= +
dµ dµ dµ
24
Ergodic Theory Math 248, 2014
and if λ ν µ then
dλ dλ dν
= .
dµ dν dµ
Example 6.6. An example to keep in mind why the finiteness hypothesis is necessary:
compare the counting measure
(
|A| |A| < ∞
ν (A) =
∞ otherwise.
Indeed, the Lebesgue measure is absolutely continuous with respect to this counting
dµ
measure, but d ν = 0 almost everywhere.
Example 6.7. If A consists of sets of measure 0 or full measure, then any A -measurable
function is constant almost everywhere, and
Z
E (f | A ) = f d µ.
X
So far we have restricted our discussion to non-negative functions f , but we can ex-
tend the definition in the usual way: write f = f + − f − where f + and f − are the positive
and negative parts.
• E (f | B) = f ,
• E (f | A ) ◦ T = E (f ◦ T | T −1 A ).
Theorem 6.9 (Birkhoff’s Ergodic Theorem). Let (X , T, µ) be a system and let A be the σ-
algebra generated by T -invariant measurable sets, i.e. A such that µ(T −1 A∆A) = 0. (So if
T is ergodic, then A is the trivial σ-algebra.) If f ∈ L 1 (X , B) then for almost all x
f (x ) + f (T x ) + . . . + f (T n x )
lim = E (f | A )(x ) =: f ∗ (x ).
n →∞ n +1
The limit f ∗ (x )
is A -measurable (i.e. T -invariant) and for any T -invariant subset A
(i.e. µ(T −1 A∆A) = 0),
Z Z
f dµ= f ∗ d µ.
A A
Then Z
f d µ ≥ 0.
x : Fn (x )>0
Let E n = {x : Fn (x ) > 0}. The difference between the claim and the lemma is that in the
claim, we are integrating over E = n E n .
S
We claim that we can replace E n in the second interval with X , because outside E n the
function Fn is 0. Since Fn is non-negative, we can also extend the integral to X in the
third integral. So
Z Z Z
f (x ) d µ ≥ Fn (x ) d µ − Fn ◦ T (x ) d µ = 0.
En X X
Here we are using that Fn is measurable and the T -invariance of the measure.
Having established the claim, let’s turn out attention to the ergodic theorem. We want to
analyze
f (x ) + f (T x ) + . . . + f (T n x )
lim
n →∞ n +1
but we don’t know that the limit exists. So instead, we study
f (x ) + f (T x ) + . . . + f (T n x )
f ∗ (x ) = lim sup
n →∞ n +1
f (x ) + f (T x ) + . . . + f (T n x )
f ∗ (x ) = lim inf
n →∞ n +1
We want to prove that given a ,b the set
E a ,b := {x : f ∗ (x ) < a < b < f ∗ (x )}
has measure 0 in X . Since R is separable, we can let a ,b range over Q to deduce the
result.
A useful observation about this is that f ∗ , f ∗ are both T -invariant, hence E a ,b is T -
invariant. This is shown by analyzing the identity
n +1 f (x )
A n f (T x ) = A n +1 (x ) + .
n n
Remark 6.11. If T were ergodic then we would automatically know that µ(E a ,b ) = 0 or 1.
So our goal is to show that µ(E a ,b ) has measure 0. By the corollary applied to the
observation that f ∗ (x ) > b on E a ,b ,
Z
f d µ ≥ b µ(E a ,b )
E a ,b
But b > a , so this is only possible µ(E a ,b ) = 0. Thus f ∗ = f ∗ (x ) for almost all x , so the limit
exists and is T -invariant.
Remark 6.12. We see that the proof so far doesn’t use µ(X ) < ∞, but without that assump-
tion then the limit could be 0 for instance. We need it to show that the limit satisfies
Z Z
fe = f
A A
hence Z Z
|A n f (x )| d µ ≤ | f (x )| d µ
X
since µ is T -invariant. That implies that f ∗ ∈ L 1 (X , µ).
We now want to show that f ∗ = E (f | BT ), i.e. the integrals over any T -invariant B of
f and f ∗ are equal. We can reduce to showing that
Z Z
f = f ∗.
X X
Why? Let’s focus on proving the lower bound. Fix ε > 0. Then the maximal inequality
implies that
Z
k n
− ε µ(D k ) ≤ f dµ
n Dn k
In fact we can replace D kn with D kn ∩ B for any T -invariant subset B , by restricting all the
results to B . Anyway, this shows that
Z Z
1
f− f ∗ ≤ µ(D kn ).
Dn Dn
n
k k
Theorem 6.15 (Hopf). Assume that there exists g ∈ L 1 (X , µ) such that g (x ) > 0 almost
everywhere, and that for almost every x ,
g (x ) + g (T (x )) + . . . + g (T n (x )) → ∞.
Then Pn
f (T i x )
lim Pni =1 =: φ(x ) ∈ L 1 (X , µ)
i =1 g (T x )
n →∞ i
and Z Z
f dµ= g φ d µ.
X X
The proof uses “only” the maximal inequality, proceeding along the following lines.
(1) First prove that the lim sup = lim inf almost everywhere.
29
Ergodic Theory Math 248, 2014
6.5. Applications.
Example 6.18. Recall the “times b ” map Tb : S 1 → S 1 sending z 7→ z b . Write x ∈ [0, 1] in
terms of a “base b expansion” 0.x 0 x 1 x 2 x 3 . . .. Then Tb (x ) = 0.x 1 x 2 x 3 . . .. We proved that
Tb is ergodic. This corresponds to the Bernoulli shift with p 0 = b1 , p 1 = b1 , . . . , p b −1 = b1 .
Given x ∈ [0, 1], we can write x = x 0 x 1 x 2 . . . x j . Then x j = k ⇐⇒ Tb (x ) ∈ [ bk , k b+1 ). By
j
Birkhoff’s ergodic theorem for χ[k /b,k +1/b ) we have that for almost all x ,
#{x i : i ≤ n, x i = k } 1
lim = .
n →∞ n b
This can be generalized to strings of digits: a particular string (k 1 . . . k ` ) appears a pro-
portion of b1` of the time.
Definition 6.19. A point x is normal if for all b , in the base b expansion 0.x 0 x 1 . . . we have
#{i : x i = k , i ≤ n} 1
lim = .
n →∞ n b
Birkhoff’s ergodic theorem implies that almost all x are normal. However, it is an open
question to produce any provably normal point x .
30
Ergodic Theory Math 248, 2014
Example 6.20. Consider the Gauss map T (x ) = x1 mod 1. If the continued fraction ex-
pansion of x is
1
x = x0 + 1
x1 + 1
x2+ x
3 +...
then x 1 = [1/T (x )], . . . x n
= [1/T n (x )].
The ergodicity of T is then tied with the distribution of (x 0 , x 1 , x 2 , . . .).
1
For instance, consider the interval I k = ( k +1 , k1 ). Then T n (x ) ∈ I k =⇒ x n = k . An
invariant measure for T is Z
1
µ(B ) = dx.
B
x + 1
S∞ 1 1
You can check this on intervals [a ,b ], so T −1 [a ,b ] = k =1 [ b +n , a +n ].
♠♠♠ TONY: [question: how do you motivate this measure?]
One can prove that T is in fact ergodic with respect to this measure. Then Birkhoff’s
Ergodic Theorem implies that for almost every x , the frequency of k is the measure of
1
( k +1 , k1 ) under µ, and the result turns out to be
(k + 1)2
1
log .
log 2 k (k + 2)
One can also use the theorem to do “weighted averages.”
Example 6.21. Let (2n ) = [2, 4, 8, 16, 32, 64, . . .]. We ask: what is the frequency of ` as the
first digit in in x n ∈ 2n as n → ∞? We claim that the frequence of ` is log10 (1 + 1/`).
The number 2m has d as first digit if it lies in
d 10n ≤ 2m ≤ (d + 1)10n
for some n. Equivalently,
n + log10 d ≤ m log 2 ≤ n + log10 (d + 1).
Thus {m log 2} ∈ [log d , log(d + 1)] mod 1. By Birkhoff’s Ergodic Theorem,
{m α} ∈ (log d , log(d + 1))
with proportion log(1 + 1/d ) for almost every x . However, right now we are interested
in the particular value x = 0, so the result does not quite follow from Birkhoff’s Ergodic
Theorem. Therefore, we need a stronger result.
31
Ergodic Theory Math 248, 2014
7. TOPOLOGICAL DYNAMICS
7.1. The space of T -invariant measures. Suppose you have a measure-preserving sys-
tem (X , µ, T ) such that X is “compact” and metrizable (these are not essential assump-
tions, but will be very helpful). The measure is assumed to be Borel (i.e. Borel sets are
measurable). Sometimes we will want to assume that T is continuous.
Example 7.1. The rotation map R α : x 7→ x + α on S 1 and the times d map Td : z 7→ z d on
S 1 satisfy these conditions.
In general, we can consider the “space of finite T -invariant measures on X , which we
denote by M T (X ). Unfortunately, this “does not help” for understanding measurable
dynamics, in the sense that if (X 1 , T1 , µ1 ) and (X 2 , T2 , µ2 ) are two systems, then M T1 (X 1 )
and M T2 (X 2 ) are not related, since in the measurable setting you can throw away sets of
measure zero, but there could be lots of interesting T -invariant measures supported on
such a set.
There are interesting “classification” theorems that illustrate the disparity between
topological and measure-theoretic results. Any regular, ergodic, measure-preserving
system (X , T, µ) is isomorphic to a measure-preserving system (X 0 , T 0 , µ0 ) such that µ0
is the only ergodic measure invariant under T 0 . Also, it can be shown that the system is
isomorphic to a “nice” measure-preserving system on T 2 . The moral is that topological
dynamics and measure-preserving dynamics very different.
Let M1 (X ) be the space of finite measures on X . This is equipped with the weak*
topology. If C (X ) is the space of continuous maps X → R (or C), then the Riesz Represen-
tation Theorem says that C ∗ (X ) is basically the same as the space of “signed measures”
on X . Furthermore, C (X ) is separable. Then µk → µ in the weak* topology if and only if
for all f ∈ C (X ), Z Z
f d µk → f d µ.
It is a fact that M1 (X ) is compact and convex (closed) with respect to the weak* topology.
If T is continuous, then there is a map T∗ : M1 (X ) → M1 (X ) sending µ 7→ T∗ µ, i.e.
Z Z
f d T∗ µ = f ◦T dµ
and moreover this map is continuous with respect to the weak* topology.
Proposition 7.2. Let X be a compact metrizable space and T : X → X a continuous map.
Then M1T (X ) is non-empty.
The content of the proposition is that there always exist non-trivial invariant (prob-
ability) measures on a compact metrizable space. How might one construct such an
invariant measure? For any x ∈ X , we can consider the sequence x , T x , . . . T n x , . . . and
define
n
1X
µn = δT i x ∈ M1 (x ).
n i =1
Since X is compact, there is a convergent subsequence (in the weak* topology), and we
will show shortly that it is T -invariant.
32
Ergodic Theory Math 248, 2014
We claim that any weak limit is T -invariant. For any continuous function f on X , we
have
Z Z Z
1 X
f ◦ T d µn j − f d µ n j = f ◦ T i +1 − f ◦ T i d νn j
nj
Z
1
= f ◦ T n j +1 − f d νn j
nj
2
≤ || f ||∞ → 0.
nj
Proposition 7.3. Let X be compact and T measurable. The extreme ponts in M1T (X ) are
in bijection with ergodic measures for T .
An extreme point is a measure µ such that if µ = µ1 +µ2 then µ1 = t µ and µ2 = (1−t )µ.
(Recall that M1T (X ) is convex.) These intuitively correspond to extreme points in the hull
of a convex body.
Proof. If µ is not ergodic, then there exists E such that µ(E ∆T −1 E ) = 0 and 0 < µ(E ) <
1
1, then we can write µ = µ(E ) µ(E µ| + (1 − µ(E )) µ(X1\E ) µX \E , a convex combination of
) E
two probability measures which are singular with respect to each other, hence µ is not
extremal.
If µ is not extremal, then µ = t µ1 +(1−t )µ2 and it is easy to see that µ can’t be ergodic.
The reason is that µ1 (A) ≤ 1t µ(A), so µ1 is absolutely continuous with respect to µ and by
the Radon-Nikodym theorem there exists some φ such that
Z
µ1 (A) = φdµ
A
and since µ1 , µ are both T -invariant, φ must be. Since T is ergodic, φ must be constant
almost everywhere, hence µ1 = µ.
33
Ergodic Theory Math 248, 2014
7.2. The ergodic decomposition theorem. We now want to establish a kind of converse
result, asserting that every T -invariant measure is a “linear combination” of the extremal
ones. This is only true if X is compact.
Theorem 7.4. Let µ ∈ M1T (X ). Then there exists a measure λ on M1 (X ) such that λ(E T ) =
1, where E T is the set of extremal measures, such that for all f ∈ C (X ),
Z Z Z
f dµ= f dν d λ(ν ).
E T (X ) X
As a consequence, if |M1T (X )| > 1 i.e. there is more than one invariant measure, then
there exists more than one ergodic invariant measure.
Remark 7.5. This is the only thing reasonable to hope for, because M1T (X ) could be a
really large space. There are (X , T ) where M1T (X ) is “finite-dimensional” (only finitely
many ergodic measures invariant under T ), but in general things are much more com-
plicated, and the set of extremal points may not even be closed. It can even be dense.
Example 7.6. For the time 1 geodesic flow on the unit tangent bundle on a hyperbolic
surface X , one can construct ergodic measures of the following form. One ergodic mea-
sure α is supported on a closed geodesic, and another ergodic measure β is supported
on another closed geodesic. One can then take measures supported on some intertwin-
ing of these geodesics, wrapping around n times and renormalizing. In the limit this
becomes just α + β . The set of ergodic measures is dense in the space of all invariant
measures for geodesic flow on T 1 (X ).
34
Ergodic Theory Math 248, 2014
8. UNIQUE ERGODICITY
8.1. Equidistribution.
Definition 8.1. A sequence x n ∈ X becomes equidistributed with respect to µ if µ ∈ M1 (X )
(X compact) if for all f ∈ C (X ),
n Z
1X
f (x j ) → f (x ) d µ.
n j =1
Definition 8.4. If (X , T ) is a system in which the above conditions are satisfied, then we
say that it is uniquely ergodic.
Proof. (1) =⇒ (2). Assume that µ is the unique ergodic measure. For x ∈ X , we can
consider
n
1X
lim δT k x
n →∞ n
k =1
which is a T -invariant probability measure, necessarily equal to µ (in the weak* topol-
ogy). That means that for any f ∈ C (X ),
N Z
1 X n
f (T x ) → f d µ.
N n =1
Pn
(2) =⇒ (3). Letting µn = n1 k =1 δT k x denote the nth measure in the sequence, we have
Z n
1X
f d µn = f (T k x ).
n
k =1
Supposing that the convergence is not uniform, then we may choose g ∈ C (X ) such that
for all N 0 , there exists N > N 0 and x j ∈ X such that
N
1 X
g (T n x j ) − C (g ) > ε
N n =1
a contradiction.
The equivalence of (3) and (4) follows from general approximation arguments.
Let’s show (3) =⇒ (1). If A N f (x ) → C (f ) which is constant and independent of x , then
we want to show that there is only one ergodic measure. Indeed, for every T -invariant
measure µ we have
Z Z
A N f (x ) d µ → C (f ) d µ = C (f )
X
and on the other hand Z Z
A N f (x ) d µ = f dµ
36
Ergodic Theory Math 248, 2014
8.2. Examples. On S 1 , {x n }∞
n =1 become equidistributed with respect to the Lebesgue
measure m if for any f ∈ C (X ),
n Z
1X
f (x i ) → f d m .
n i =1
because the trigonometric polynomials are dense in C (X ). This isn’t necessarily easy to
check: for instance the question of whether or not (3/2)n is equidistributed is still open.
37
Ergodic Theory Math 248, 2014
hence {n 2 α} is equidistributed in S 1 .
38
Ergodic Theory Math 248, 2014
Using this technique, Furstenberg proved that if p (t ) is any polynomial with at least
one irrational coefficient, then {p (n )α} is equidistributed.
8.3. Minimality. A set equidistributed with respect to the Lebesgue measure must be
dense, but a dense set need not be equidistributed with respect to the Lebesgue measure.
We saw that unique ergodicity is equivalent to every point having equidistributed orbit,
so a natural relaxation is to study dense orbits.
If a system is uniquely ergodic for the Lebesgue measure, then it must be minimal by
the observations above. However, if the unique measure is not Borel then there is no
implication in either direction.
Example 8.11. The doubling map T2 : S 1 → S 1 is uniquely ergodic, but not minimal. In-
deed, this has a (unique) fixed point, and it turns out that the only invariant ergodic
measure is a mass supported at this point. But the orbit of the fixed point is obviously
not dense.
T S
φ
S1 / S1
40
Ergodic Theory Math 248, 2014
9. SPECTRAL METHODS
9.1. Spectral isomorphisms. Our goal is to distinguish between R α and R β where α, β
are irrational rotations, by considering the induced actions on L 2m (where m is the Lebesgue
measure on S 1 ).
We have discussed how a triple (X , T, µ) gives an operator UT on L 2 (X , µ).
Definition 9.1. We say that T1 and T2 are spectrally isomorphic, and write UT1 ∼ = UT2 , if we
can we find W : L 2 (X 1 , µ1 ) → L 2 (X 2 , µ2 ) such that 〈W f 1 , W f 2 〉 = 〈f 1 , f 2 〉 and
UT2 ◦ W = W ◦ UT1
i.e. the following diagram commutes:
UT1
L 2 (X 1 , µ1 ) / L 2 (X 1 , µ1 )
W W
L 2 (X 2 , µ2 ) / 2 (X 2 , µ2 )
UT2
10. ENTROPY
10.1. Motivation. We want to motivate the notion of entropy for measure-preserving
maps (X , T, µ). Consider (S 1 , R α ): we mentioned that the operator UT on L 2 (X , µ) has
discrete spectrum. Conversely, any transformation with discrete spectrum looks like ro-
tation on a compact abelian group.
On the other hand, the Bernoulli shifts, which encompass most of the examples we
have seen, are all spectrally isomorphic (as they have countable spectra), but they are
not measure theoretically isomorphic.
What is the difference between (S 1 , R α ) and Bernoulli shifts? The rotation is an isome-
try, and in particular
d (x , x 0 ) < ε =⇒ d (T n x , T n x 0 ) < ε.
The Bernoulli shift is much more “violent.”
Example 10.1. Baker’s transformation is defined on [0, 1] by
(
(2x , y /2) x ≤ 1/2
TB (x , y ) =
(2x − y , (y + 1)/2) x ≥ 12 .
One can check that this is the same as the bi-infinite Bernoulli shift ((x i ))∞
i =−∞ . Geo-
metrically, this splits a rectangle down the middle (vertically), and then stacks the halfs
vertically, and then crushes them down.
Now we prepare ourselves to define the entropy. The entropy of a system (X , T, µ) is a
non-negative number such that:
(1) It is invariant under measurable isomorphism. Therefore, it can distinguish be-
tween the Bernoulli shifts (1/2, 1/2) and (1/3, 1/3, 1/3).
(2) Given (X , T ), in “many nice situations” there is a unique measure of maximal en-
tropy µ for (X , T ) (even though there is no way to classify all invariant measures).
However, many interesting measures can have zero entropy.
Example 10.2. For the irrational rotation R α on S 1 , it will be the case h µ (R α ) = 0. This
reflects the fact that there are no fixed points. So sometimes we get no information from
entropy. However, we’ll see that the map z 7→ z 2 has non-zero entropy on S 1 .
It tends to be the case thatif X iscompact, in nice cases (e.g. hyperbolic toral automor-
2 1
phisms such as induced by ), the Lebesgue measure has the maximum possible
1 1
entropy.
We would also like to establish methods to compute entropy. For instance, if µ =
µ1 + µ2 then we want to describe h µ in terms of h µ1 and h µ2 . There is a relationship, but
it isn’t very simple.
10.2. Partition information. The idea of defining entropy is to ask, how much do you
“gain” from applying T ? Entropy should be a measure of “chaos.” So we partition X into
finitely many measurable sets {P1 , . . . , Pk }. That means that µ(Pi ∩ Pj ) = 0 for i 6= j and
Sk
µ(X − i =1 Pi ) = 0. The idea is that the “information” you get from P , which depends
only on the numbers µ(P1 ), . . . , µ(Pk ), i.e. is a function H (µ(P1 ), . . . , µ(Pk )).
43
Ergodic Theory Math 248, 2014
The entropy of a partition P will eventually be defined as essentially the growth rate
of H µ (P ∨ T −1 P ∨ . . . ∨ T −k P ) as k → ∞.
We say that ψ is strictly convex if the ≤ can be replaced by < unless f (x ) is (almost every-
where) constant.
Corollary 10.5. If P = {P1 , . . . , Pk } then H µ (P ) ≤ log k and equality holds when µ(P1 ) =
. . . = µ(Pi ).
Proof. Let φ(x ) = x log x . If there exists Pi such that µ(Pi ) 6= 1/k , then
k k
X 1 X 1
φ( µ(Pi )) < φ(µ(Pi )).
i =1
k i =1
k
Tracing through the equality condition gives the result of the conclusion.
Now let’s prove some of the basic properties.
(1) We have
X
H µ (α ∨ β ) = − µ(A i ∩ B j ) log µ(A i ∩ B j )
i ,j
µ(A i ∩ B j )
X X
=− µ(A i ∩ B j ) log − µ(A i ∩ B j ) log µ(A i )
i ,j
µ(A i ) i ,j
= H µ (β | α) + H µ (α).
(2) We have
k
µ(A i ∩ B j )
X̀ X
H µ (β | α) = − µ(A i ∩ B j ) log
j =1 i =1
µ(A i )
k
µ(A i ∩ B j ) µ(A i ∩ B j )
X̀ X
=− µ(A i ) log
j =1 i =1
µ(A i ) µ(A i )
!
k
X̀ X µ(A i ∩ B j )
≤− φ µ(A i )
j =1 i =1
µ(A i )
X̀
=− φ(µ(B j ))
j =1
= H µ (β ).
45
Ergodic Theory Math 248, 2014
Remark 10.9. This may seem impossible to compute because one has to check all finite
partitions, but it turns out that if P generates the σ-algebra then h µ (T, P ) = h µ (T ). Thus
in nice situations it suffices to compute the entropy of a single partition.
Example 10.10. Let T : S 1 → S 1 be the squaring map and µ the Lebesgue measure. Set
P = {[0, 1/2), [1/2, 1)}. Then
P (n ) = P ∨ T −1 P ∨ . . . ∨ T −n +1 P ,
+1
and one can check that this is {[ 2ni+1 , 2in+1 ]} for i = 0, 1, . . . , 2n +1 − 1. So
1
H µ (P (n) ) = −2n +1 × log(1/2n +1 ) = (n + 1) log 2
2n +1
so h(T, P ) = log 2.
In fact, it is true that h µ (T ) = log 2. It is not clear how to check this now, since the
definition is in terms of all partitions, but we shall later see a criterion for checking that
a given partition suffices to compute the entropy.
46
Ergodic Theory Math 248, 2014
(2) For the Td map S 1 → S 1 , any interval plus its complement generates the σ-algebra
of Lebesgue-measurable sets. Here the number of intervals grows exponentially,
and each has length about 1/d n . Then
n log d
= log d .
H µ (R α , P ) ≈ lim
n →∞ n
Let A , C be two partitions. We should have
H µ (A ∨ C ) = H µ (A) + H µ (C | A ).
If A = {A i } and C = {C j }, recall that we defined
µ(A i ∩ C j )
X
H µ (A | C ) = µ(A i ∩ C j ) log .
i ,j
µ(C j )
which is immediate from the fact that conditioning on a larger partition decreases the
entropy (Lemma 10.17).
(2) We have
−1
n_ n−1
X j
_
−i
Hµ( T A ) = H (A ) + H µ (A | T −i A ).
i =0 j =1 i =1
We will use (1) plus the observation that if limb i exists then
n
1X
lim b j = lim b j .
n →∞ n j →∞
j =1
we deduce that
n−1 n
1 _ _
h µ (T, A ) = lim H( T −i A ) = lim H µ (A | T −i A ).
n→∞ n n →∞
i =0 i =1
49
Ergodic Theory Math 248, 2014
Proof. By taking n to be sufficiently large, weWmay ensure that every part of η is ap-
n
proximated arbitrarily well by some parts of i =0 T −i ξ. Intuitively, that means that
Wn
H µ (η | i =0 T −i ξ) is very small since T −i ξ is nearly finer than η.
Exercise 10.24. Prove the result rigorously by analyzing the definition of the conditional
information.
10.6. Examples.
Example 10.25. We consider two-sided Bernoulli shifts with k symbols and parameters
(p 1 , . . . , p k ). Then we claim that
k
X
H µ (σ) = − p i log p i .
i =1
Indeed, consider the partition P obtained separating elements by the value of x 0 :
[
P = {x 0 = s i }.
Then it is easy to see that this ξ is a two-sided generator, and computation shows that
X
h µ (T, ξ) = − p i log p i .
In fact, we have the following classification theorem.
Theorem 10.26 (Ornstein). Entropy is a complete invariant for full shifts.
There are non-obvious numerical identifications, e.g. (1/2, 1/2) 6∼ (1/3, 1/3, 1/3) but
(1/4, 1/4, 1/4, 1/4) ∼ (1/2, 1/8, 1/8, 1/8, 1/8).
Remark 10.27. However, for one-sided shifts entropy is not a complete invariant. Intu-
itively, isomorphic one-sided shifts should have the same numbers of symbols, because
if there are k symbols then the map is k : 1.
Remark 10.28. What made the calculation of entropy for the full shift feasible was that
for the full shift, {T −1 ξ, . . . , T −i ξ} form an independent partition, i.e.
Y
µ(T −j 1 A i 1 ∩ . . . ∩ T −j k A i k ) = µ(A i m ).
m
In this special setting, you don’t have to calculate anything because the entropy of the
join is automatically the sum of the entropies:
_ X
H µ ( T −i ξ) = H µ (T −i ξ) = nH µ (ξ).
Therefore,
H µ (ξ ∨ . . . ∨ T −(n −1) ξ)
lim = H µ (ξ).
n →∞ n
In fact, any invertible map with an “independent generator” is isomorphic to a two-
sided Bernoulli shift.
51
Ergodic Theory Math 248, 2014
Example 11.1. The map Tp : S 1 → S 1 sending z 7→ z p . The Lebesgue measure has en-
tropy log p . We claim that any other measure has entropy strictly less than log p (so the
Lebesgue measure is the unique measure of maximal entropy).
Wn−1
Indeed, let ξ be the partition {[0, 1/p ), [1/p, 2/p ), . . . , [p −1/p, 1)}. Note that i =0 T −i ξ
j j +1
is precisely the partition consisting of (the p n ) intervals of the form [ p n , p n ), so its en-
tropy is log(p n ) = n log p . It is clear that ξ is a generator for the Lebesgue measure, so
Wn −1
H µ ( i =0 T −i ξ)
h µ (T ) = lim = log p.
n →∞ n
In fact, ξ generates any T -invariant, Radon measure. That’s because any measurable set
must be approximable by intervals. Therefore, h µ (T ) ≤ H µ (ξ) ≤ log p for any such µ, and
by definition
H µ (ξ ∨ T −1 ξ ∨ . . . ∨ T −(n −1) ξ) n log p
h µ (T, ξ) = inf ≤ .
n ≥1 n n
The quality case H µ (ξ ∨ . . . ∨ T −(n −1) ξ) = log(p n ) implies µ([j /p n , (j + 1)/p n ]) = 1/p n and
this implies that µ is the Lebesgue measure.
Example 11.2. Let’s consider the hyperbolic toral automorphisms, given by A ∈ SL2 (Z)
with eigenvalues having absolute value different from 1. One can check that T is a home-
omorphism of the torus that is expansive, so that is morally why it works in this case. (Re-
mark: if you have such a homeomorphism, then it’s easy to find a generating partition.
Indeed, take a partition with diameter less than δ, and it will be a generator).
Theorem 11.3. If m is the Lebesgue measure, then T has entropy h m (T ) = log ρ where ρ is
the eigenvalue of A greater than 1.
In fact, we will show that for any µ one has h µ (T ) ≤ log ρ, with equality holding for the
Lebesgue measure, so again the Lebesgue measure has maximal entropy.
Proof. The goal is to find a particularly nice partition ξ, from which we can calculate
the entropy. In this case we can choose a partition consisting of rectangles with edges
parallel to the eigenvectors v + and v − with eigenvalues ρ and 1/ρ.
Then T −1 ξ consists of rectangles with edges parallel to v + and v − , but contracted by
ρ along the v + direction and expanded by ρ along the v − direction.
Wn T ξ has the opposite
effect of contracting along v − and expanding along v + . Thus i =−n T −i ξ consists of a
52
Ergodic Theory Math 248, 2014
mesh of rectangles with length and width ρ −n . In particular, we see that ξ is a two-
sided generator. Therefore,
Wn
H m ( i =−n T −i ξ)
h m (T ) = h m (T, ξ) = lim .
n→∞ n
Now, the lengths of the rectangles in T −i ξ are in [c 1 ρ −n , c 2 ρ −n ] for some constants c 1 , c 2
independent of n, so
n
_
−2 log c 2 + n log ρ ≤ H m ( T −1 ξ) ≤ −2 log c 1 + n log ρ.
i =−n
Dividing by n and taking the limit, we see that necessarily h m (T ) = ρ.
Again, for equality to hold we need that all these rectangles have essentially the same
measure, which recovers the Lebesgue measure.
53
Ergodic Theory Math 248, 2014
Therefore,
N 2 µ(E )2 − N N 1
sup µ(T b −a E ∩ E ) ≥ = µ(E )2 − .
1≤a <b ≤N N (N − 1) N −1 N −1
Letting N → ∞ gives the desired result.
Solution to Exercise 5.11. Let A n = T −n (A). We regard u Tn χA = χA n ∈ L 2 (X , µ). By as-
sumption, Z
χA n d µ = µ(A) =: α
X
for each n, and
lim 〈χA n , χA m 〉 = lim µ(T −n A ∩ T −m A) = α2 .
n →∞ n→∞
Therefore, if we set f n = χA n − α then we have
lim 〈f n , f m 〉 = lim 〈χA n , χA m 〉 − α2 = 0.
n→∞ n →∞
We claim that this implies that limn →∞ 〈f n , g 〉 = 0 for all g ∈ L 2 (X , µ). Indeed, this is true
on the closure of the subspace generated by the f k , and also on its orthogonal comple-
ment by definition. Then taking g = χ B , we find that
Z
0 = lim 〈f n , g 〉 = χA n − αµ(B ) = µ(T −n A ∩ B ) − µ(A)µ(B ).
n →∞
B
54
Ergodic Theory Math 248, 2014
and also
+N −1
MX
1
A M ,N (f ) = u Ti (f ).
N n =M
We know that A N (f ) → PT (f ). Actually, observe that
||A M ,N (f ) − PT (f )|| = ||u TM (A N (f )) − PT (f )||
= ||u TM (A N (f )) − u TM PT (f )
= ||A N (f ) − PT (f )||.
It now suffices to show that Z
PT (f ) ≥ µ(B )2 .
B
Indeed, suppose this to be the case. Then for N large enough, we have
Z
A M ,N (f ) ≥ µ(B )2 − ε
B
Therefore,
!2
Z N
X
u Tn (f ) = N 2 µ(B )2 .
n =1
Expanding out the left hand side we find
N
X −1
N µ(B ) + 2 (N − k )µ(T −k (B ) ∩ B ) = N 2 µ(B )2 .
k =1
Therefore,
N
N 2 µ(B )2 + N µ(B )
Z
X
n A n (f ) = .
n =0 B
2
R
Now we know that A n (f ) → PT (f ), so B
A n (f ) converges to a limit whose value must
then be, by the above equation, µ(B )2 .
Solution to Exercise 5.7. (1) We have to show that
n
1X
lim µ(T −i A ∩ B ) → µ(A)µ(B )
n →∞ n
i =1
given that it holds for all sets in P . For any ε > 0, we can choose A 0 , B 0 such that
µ(A 0 ∆A) < ε and µ(B 0 ∆B ) < ε. Then
(T −i A ∩ B )∆(T −i A 0 ∩ B 0 ) ⊂ (T −i A∆T −i A 0 ) ∪ (B ∆B 0 )
so
µ(T −i A ∩ B ) − µ(T −i A 0 ∩ B 0 ) < 2ε.
Therefore,
n n
1X 1X
µ(T −i A ∩ B ) − µ(T −i A 0 ∩ B 0 ) < 2ε.
n i =1 n i =1
So both the left and right hand sides of the purported identity behave well under approx-
imation with elements of P .
(2) By the same argument as above, the summand µ(T −i A ∩ B ) − µ(A)µ(B ) behaves
well under approximation by elements of P , and we know that for elements of P the
limit tends to 0.
56