Ergodic Theory

INTRODUCTION TO ERGODIC THEORY
LECTURES BY MARYAM MIRZAKHANI

NOTES BY TONY FENG
CONTENTS
1. Disclaimer 2
2. Introduction 3
2.1. Overview 3
2.2. Spectral invariants 4
2.3. Entropy 5
2.4. Examples 5
3. Mean Ergodic Theorems 8
3.1. Preliminaries 8
3.2. Poincaré Recurrence Theorem 8
3.3. Mean ergodic theorems 9
3.4. Some remarks on the Mean Ergodic Theorem 11
3.5. A generalization 13
4. Ergodic Transformations 14
4.1. Ergodicity 14
4.2. Ergodicity via Fourier analysis 15
4.3. Toral endormophisms 16
4.4. Bernoulli Shifts 17
5. Mixing 19
5.1. Mixing transformations 19
5.2. Weakly mixing transformations 19
5.3. Spectral perspective 20
5.4. Hyperbolic toral automorphism is mixing 21
6. Pointwise Ergodic Theorems 23
6.1. The Radon-Nikodym Theorem 23
6.2. Expectation 24
6.3. Birkhoff’s Ergodic Theorem 25
6.4. Some generalizations 28
6.5. Applications 29
7. Topological Dynamics 31
7.1. The space of T -invariant measures 31
7.2. The ergodic decomposition theorem 33
8. Unique ergodicity 34
8.1. Equidistribution 34
1
Ergodic Theory Math 248, 2014
8.2. Examples 36
8.3. Minimality 38
9. Spectral Methods 40
9.1. Spectral isomorphisms 40
9.2. Ergodic spectra 40
9.3. Fourier analysis 41
10. Entropy 42
10.1. Motivation 42
10.2. Partition information 42
10.3. Definition of entropy 45
10.4. Properties of Entropy 46
10.5. Sinai’s generator theorem 49
10.6. Examples 50
11. Measures of maximal entropy 51
11.1. Examples 51
12. Solutions to Selected Exercises 53
2
1. DISCLAIMER
These are notes that I “live-TEXed” during a course offered by Maryam Mirzakhani
at Stanford in the fall of 2014. I have tried to edit the notes somewhat, but there are
undoubtedly still errors and typos, for which I of course take full responsibility.
Only about 80% of the lectures is contained here; some of the remaining classes I
missed, and some parts of the notes towards the end were too incoherent to include. It
is possible (but unlikely) that I will come back and patch those parts at some point in the
future.
3
2. INTRODUCTION
2.1. Overview. The overarching goal is to understand measurable transformations of a
measure space (X , µ, B). Here µ is usually a probability measure on X and B is the σ-
algebra of measurable subsets.
Definition 2.1. We will consider a transformation T : X → X preserves µ if for all α ∈ B we
have
µ(α) = µ(T −1 (α)).
In particular, we require T −1 (B) ⊂ B for this to make sense.
Remark 2.2. It is not necessarily the case that µ(α) = µ(T α), or even that T (B) ⊂ B.
We are interested in the following kinds of questions concerning this setup.
Can we understand the orbits of T on X ?
More precisely, for x ∈ X the orbit of T on x is

{x , T x , T 2 x , . . .}.
Natural questions one might ask: is it periodic? Is it dense? Is it equidistributed (whatever
that means)?
• A basic example that already leads to interesting questions is X = S 1 with µ =
the Lebesgue measure. One measure-preserving transformation is T x = 2x , the
“doubling map” (although it is not immediately obvious that this is measure-
preserving).
• Another basic but interesting example is the “rotation operator” R α (θ ) = θ + α.
Viewing S 1 = [0, 1] with endpoints identified, the orbit is the “distribution” of
{nα} = nα − [nα] for α ∈ R. For this simple problem it is easy to show that the
qualitative behavior of the orbit depends on the rationality of α.
There are already subtle extensions of this problem: what about the distribu-
tion of {n 2 α}, {n 3 α}, or more generally {p (n)α} where p (n) is some polynomial?
We will see techniques that can resolve these problems.
Given (X , µ) that are sufficiently nice, can we “classify” all µ-preserving transfor-
mations T : X → X ? Can we find invariants that distinguish them?
A tricky thing about this is that since we are considering measure spaces, we can throw
out sets of measure zero. This means that topological intuition is not so useful here.
Remark 2.3. If µ is a regular measure, e.g. if X ⊂ X where X is metrizable, and B is
the Borel σ-algebra, then (X , µ, B) turns out to be isomorphic to a “standard probability
space”, which is a disjoint union of intervals with Lebesgue measure and discrete spaces.
In particular, we see that topological ideas like dimension, etc. are useless for distin-
guishing probability spaces.
So then what kinds of invariants can you use? We will discuss two flavors.
4
2.2. Spectral invariants. A simpler class of invariants are the “spectral invariants,” which
are qualitative features reflected in the “spectral theory” of T (we will explain what we
mean by this later).
2.2.1. Ergodicity. The simplest incarnation is irreducibility. Morally, µ is reducible if it

can be decomposed as µ = µ1 + µ2 where µ1 , µ2 are T -invariant measures that are singu-
lar with respect to each other (which rules out “trivial” decompositions like µ = 12 µ+ 12 µ).
If µ is irreducible then it is called ergodic.
Remark 2.4. This is one of several possible definitions of ergodicity. A different one is that
if A is T -invariant and measurable, then µ(A) = 0 or 1 (here µ is a probability measure).
Theorem 2.5 (Ergodic Decomposition Theorem). If (X , µ, B) is a regular measure space

and µ is T -invariant, then there exists (Y, ν , C ) and a map
Y → {space of T -invariant measures on X }
denoted by y 7→ µy such that

Z
µ= µy d ν
Y
and µy is ergodic for ν -almost every y ∈ Y .
Definition 2.6. We say that (X , T ) ∼

= (Y,S) if there exists a full-measure subset X 0 ⊂ X
which is T -invariant, and a full-measure subset Y 0 ⊂ Y which is S-invariant, and a map
φ : X 0 → Y 0 such that the diagram commutes
X0 / Y0
T S

X0 / Y0
and φ has an inverse satisfying the obvious analogous properties.
The point is that we can throw away a set of measure 0 and get the natural notion
of isomorphism. In particular, an ergodic transformation will not be isomorphic to a
non-ergodic transformation.
2.2.2. Mixing. Another such invariant is mixing, which says that if A, B are measurable
then
lim µ(A ∩ T −n B ) = µ(A)µ(B ).
n →∞
We don’t want to dwell on the formal definitions now, but it turns out that this is stronger
than ergodicity. There are variants on this: weakly mixing, strongly mixing, exponentially
mixing, etc.
5
2.2.3. “Spectral” explained. Why do we call these spectral invariants? Because they are
related to the action of T on L 2µ (X ). That is, we have the Hilbert space of square-integrable
functions on X equipped with inner product
Z
〈f 1 , f 2 〉 = f 1 f 2 d µ.
X
The map T : X → X induces by pullback a (unitary) operator
u T : L 2µ (X ) → L 2µ (X ).
The condition of mixing can be interpreted in terms of the spectral theory of the operator
u T . In fact, we will see that you can distinguish between rotations R α , R β based on their
spectral properties. However, many non-equivalent operators have the same action on
L 2µ (X ), so we can’t distinguish them in this way.
2.3. Entropy. To distinguish some operators we will require a different kind of invariant,
which is more refined in the sense that it does not depend only on the spectral properties.
Example 2.7. We’ll now discuss a family of examples, the hyperbolic toral automorphisms.
Let n = 2 for concreteness, although you can do this for n ≥ 2 too. Let A ∈ SL2 (Z) be a
matrix having no eigenvalue of modulus 1 (hence the “hyperbolic”). Then we have the
natural action of A on R2 , which sends integral points to integral points, and hence in-
duces an action on T 2 = R2 /Z2 . Now A preserves the Lebesgue measure on R2 (det = 1)
and hence T 2 . How can we distinguish between (T 2 , A) and (T 2 , B ) for A, B ∈ SL2 (Z)?
This is extremely difficult to do, even though they are non-isomorphic. The spectral
invariants ergodicity and mixing are not enough. The only way we know is to use a very
powerful invariant called entropy, which quantifies “how complicated” the system is.
This roughly measures the growth of the number of periodic points, although periodic
points aren’t useful here (there are only countably many, and we can throw away sets of
measure 0).
In many examples, this is the only way we know how to show that the measure spaces
are not the same. The second part of the course will deal with entropy, how to define and
calculate it. In the setting where X is a compact hyperbolic space and T is continuoous,
there are some corollaries on counting periodic points and behavior of “long” periodic
points.
2.4. Examples. We’ll now give some examples of measure-preserving transformations

that will crop up repeatedly in the course.
(1) The rotation operator R α : S 1 → S 1 sending θ 7→ θ + α preserves the Lebesgue
measure. Its orbits are related to the distribution of {nα}, {n 2 α}, . . . in [0, 1].
(2) The doubling map T2 (or more generally T3 , Tm ) on S 1 sending z 7→ z 2 (respec-
tively z 3 , z m ). You can check that in fact µ(A) = µ(T −1 A) if µ is the Lebesgue
measure (if A is an interval, then T −1 A consists of two components each having
half the length of A).
So then one can ask when are (S 1 , T2 , µ) and (S 1 , Tm , µ) are the same? It turns
out that they are different if m 6= 2 (which we can show using entropy), but this
6
leads into a big open question. The Lebesgue measure is invariant under Tm .
There are “many” measures invariant under Tk (the Lebesgue is the “nicest” one)
for any particular k .
Conjecture 2.8. If µ is a probability measure invariant under T2 and T3 then it is
either supported on a finite set or Lebesgue.
This is a huge, difficult open problem. In contrast:
Theorem 2.9 (Furstenberg). A closed subset of S 1 which is invariant under T2 or
T3 is either S 1 or a finite set.
This illustrates the contrast between topology and measure theory. Some-
times something that is hard in one world is easy in another.
It is known in some general situations that if µ has positive entropy under
certain maps, like T2 and T3 , then it is Lebesgue.
(3) The Gauss map T : [0, 1] → [0, 1] defined by
(
1
mod 1 x 6= 0,
T (x ) = x
0 x = 0.
There is a measure µ on [0, 1] invariant under T (not the Lebesgue), which has
the form Z
1 dx
µ(B ) = .
log 2 B 1 + x
It turns out that µ(T −1 B ) = µ(B ).
Let x ∈ R have the continued fraction expansion
1
x = a1 +
a 2 + a 1+...
3
We will prove the following rather remarkable result.

Theorem 2.10. For almost every x , the frequency of k among the continued frac-
tion expansion of x is
(k + 1)2

1
log
log k k (k + 2)
and
∞ log k
1/n
Y 1 log 2
lim (a 1 . . . a n ) = 1+ 2 .
n →∞ k + 2k
k =1
The key input to prove these sorts of statements is that T is measure-preserving

and ergodic.
(4) Geodesic flow on hyperbolic surfaces. X = H2 /Γ is a hyperbolic surface, inheriting
|d z |
the hyperbolic structure from (H2 , d s = Im (z )
). The geodesics are the (semi)circles
perpendicular to boundary, including the straight lines. Let T 1 X denote the unit
tangent bundle to X . You can consider the map T ` : T 1 X → T 1 X (the map on
the unit tangent bundle) sending v 7→ g ` v (taking a tangent vector to its position
after flowing for time `).
7
So for each `, we have a triple (T 1 X , T ` , µ), where µ is the Lebesgue measure.

It is not clear if (T 1 X , T ` , µ) ∼
= (T 1 x , T ` , µ) unless ` = ±`0 . In fact the answer is
0
that they are not equivalent, and you can prove this using entropy.
The significance of the question is that periodic points for this transformation
are related to closed geodesics on X .
8
3. MEAN ERGODIC THEOREMS

3.1. Preliminaries. We are interested in understanding the geometry of (X , T, µ) where
T : X → X preserves the (probability) measure µ, i.e. for all B ∈ B (the σ-algebra of
measurable sets) we have µ(T −1 (B )) = µ(B ).
Recall how we discussed that if T is measure-preserving and f : X → R (or C) is some
measurable function, we can pull back via T to get another function
u T (f ) := f ◦ T : X → R (or C).
Defining the Banach spaces L ∞ (X , µ), L 1 (X , µ), or L p (X , µ) as usual, we see that T uT ,

an operator on the corresponding Banach space. Moreover, this is an isometry.
Lemma 3.1. The measure µ on X is T -invariant if and only if for all f ∈ L 1 (X , µ) we have:
Z Z
f dµ= f ◦T dµ (1)
Proof. One direction is trivial: assuming (1) for all (almost everywhere) bounded test
functions f , we can set f = χ B where B is measurable and we immediately obtain that
µ(B ) = µ(T −1 (B )).
Conversely, suppose that we know (1) for all χ B where B is measurable. A basic fact
from measure theory is that there exists a sequence f n ↑ f almost everywhere, where f n
is a simple function: a finite linear combination of indicator functions. By dominated
convergence
Z Z
lim fn → f.
n →∞
For each f n , we have

Z Z
fn ◦T = fn
by assumption, so we obtain the result in the limit.

3.2. Poincaré Recurrence Theorem. We now study the Poincaré Recurrence Theorem,
which is a kind of “pigeonhole principle” for measure-preserving transformations. The
idea is that if we consider the sequence of points x , T x , T 2 x , . . . then it should “return
close to x ” infinitely many times (hence “recurrence”).
Definition 3.2. We say that (X , T, µ) is a measure-preserving system if (X , µ) is a measure

space and T : X → X preserves µ.
Let (X , T, µ) be a measure-preserving system, where µ is actually a probability mea-

sure. For a point x ∈ X , we consider the orbit x , T (x ), T 2 (x ), etc. We want to show that
most points come back very close to themselves many times.
Theorem 3.3 (Poincaré). For any measurable subset E ⊂ X , for almost every x ∈ E there
exist n 1 < n 2 < . . . such that T n 1 (x ), T n 2 (x ), . . . ⊂ E .
9
An interesting question is what can we say about the sequence n 1 (x ) < n 2 (x ) < . . ..
The theorem says that the sequence is infinite, but we might want to quantify whether
or not the recurrence happens “often.” In fact, it does: for “nice” maps T , n i (x ) ∼ αi .
Essentially, there is a finite expected time for recurrence to occur.
Proof. The idea is to try to bound the measure of the set of points that don’t come back
to E . Let
B = {x ∈ E : T n (x ) ∈
/ E ∀n ≥ 1}.
First one has to check that this is measurable:
B = E ∩ T −1 (X − E ) ∩ T −2 (X − E ) ∩ . . . ∩ T −k (X − E ) ∩ . . .
This is an (admittedly infinite) intersection of measurable sets, hence measurable.
We claim that B, T −1 (B ), . . . , T −k (B ) are disjoint. Indeed, any y ∈ T −1 (B ) satisfies
T (y ) ∈ E so y ∈
/ B . Since µ(B ) = µ(T −1 (B )) = . . ., and we are dealing with a probabil-
ity measure, we immediately see that µ(B ) = 0. If x ∈ E is not recurrent, then x ∈ T −N (B )
for some N , so we are done.

Question: What can we say about µ(E ∩ T −n E ) if µ(E ) > 0? This is a measure of how
“evenly” T propagates E around.
More generally one might ask about µ(E 1 ∩ T −n E 2 ) for distinct sets E 1 and E 2 . How-
ever, note that µ(E 1 ∩T −n E 2 ) could be zero for all n, e.g. if X is a union of two T -invariant
pieces, so this does not admit an interesting answer without further refinements.
Exercise 3.4. Show that
lim sup µ(E ∩ T −n E ) ≥ µ(E )2 .
n >0
To put this in context, one can prove that for some general classes of T (irreducible,
ergodic) one has that this is the average behavior in the sense that
1X
lim µ(E ∩ T −1 E ) = µ(E )2 .
n →∞ n
3.3. Mean ergodic theorems. We know move on to the ergodic theorems. If (X , T, µ) is

a measure-preserving tuple, we can consider for any f ∈ L 1 (X , µ) the sequence of func-
tions
f (x ), f (T (x )), . . . , f (T n (x )), . . . .
In the special case where f = χE , this describes the recurrence of x with respect to E . One
might like to ask about the limit of this sequence as n → ∞, but that is too ill-behaved.
However, it is better behaved after averaging.
Theorem 3.5 (Pointwise Ergodic Theorem). With notation as above,
f (x ) + f (T (x )) + . . . + f (T n (x ))
lim =: f ∗ (x ) exists for a.e. x ∈ X .
n →∞ n
Furthermore f ∗ is measurable and T -invariant, and
Z Z
f ∗dµ= f d µ.
10
If f = χE , then this describes the asymptotics of recurrence of x with respect to E .
Remark
R 3.6. If T is ergodic, then f ∗ is constant almost everywhere and thus equal to
f d µ. If you think of T as describing the evolution of the system in time, then this
means that the for ergodic transformations“the space average is equal to the time aver-
age.”
Theorem 3.7 (Mean Ergodic Theorem, von Neumann). If (X , T, µ) is a measure-preserving

system, let u T : L 2 (X , µ) → L 2 (X , µ) denote the induced map. Then
f + u T (f ) + . . . + u T n (f )
lim =: PT (f ) ∈ L 2 (X , µ)
n→∞ n
where PT (f ) is the projection of f onto the subspace
I = {g ∈ L 2 (X , µ): u T g = g }.
Proof. The proof is straightforward up to some technical machinery. The key is to ex-
plicitly describe the orthogonal complement to I , so let
B = 〈u T g − g : g ∈ L 2 (X , µ)〉.
We claim that B ⊥ = I . Indeed, if u T f = f then
〈f , u T g − g 〉 = 〈f , u T g 〉 − 〈f , g 〉 = 〈u T f , u T g 〉 − 〈f , g 〉 = 0.
This shows that I ⊂ B ⊥ .
We then have to show that B ⊥ ⊂ I . If f ∈ B ⊥ then by definition 〈u T g , f 〉 = 〈g , f 〉 for all
g . Therefore,
||u T f − f ||2 = 〈u T f − f , u T f − f 〉 = 2||u T f ||2 − 〈f , u T f 〉 − 〈u T f , f 〉 = 0.
So we have established that L 2 (X , µ) = I ⊕ B . Recall that we want to show
f + u T (f ) + . . . + u T n (f )
lim =: PT (f ) ∈ L 2 (X , µ).
n →∞ n
To do this, we proceed as follows.
(1) Check the result for f ∈ I (which is obvious).
(2) Check it for f = u T g − g (also obvious, since it telescopes to N1 ||u TN g − g ||2 ).
(3) The result follows for the whole space if we can show that the left hand side is
“continuous in f , so that it vanishes on all of B . Well, given ε > 0 and h ∈ B , we
can find h i ∈ B such that ||h − h 0 ||2 < ε. Then for all sufficiently large N we have
N
1 X n 0
u h <ε
N n=1 T
2
Therefore,
1 X n 1 X n 1 X n 0
uT h ≤ u T (h − h 0 ) + uT h
N N 2 N
< 2ε.

11
Von Neumann’s Mean Ergodic Theorem deals with convergence of operators in L 2 . We

would actually like to have a pointwise result, which unfortunately doesn’t follow from
the L 2 convergence. However, one can obtain L 1 or pointwise convergence results:
f +u T (f )+...+u T n (f )
Denote by A n (f ) := n
the nth partial sum.
Proposition 3.8 (L 1 -convergence). If f ∈ L 1 (X , µ) then
lim A n (f ) = fe in L 1 (X , µ).
n →∞
What is this functon fe? In the L 2 case it was projection onto a certain subspace, but
since L 1 is not a Hilbert space, we can’t make sense of “projection operators” as we did
f denotes the σ-algebra of T -invariant measurable sets, then
before. It turns out that if B
fe is E (f | B).
f We will elaborate on this later.
Remark 3.9. The same argument implies that

M
1 X
lim u Tn (f ) → PT (f ) in L 2 (X , µ).
M −N →∞ M − N
n =N
From this we deduce the following corollary.
Corollary 3.10. Assuming that µ(X ) < ∞, show that if µ(B ) > 0 then the set {n ∈ N: µ(B ∩
T −n B ) > 0} (which is infinite by Poincaré’s recurrence theorem) has the property that the
set of gaps between recurrence are bounded.
Proof. See the solution to Exercise 3.13.
If T is invertible, then T −1 is measurable, and

n n
1X 1X
lim k
f (T (x )) = lim f (T −k (x )) = f ∗ .
n →∞ n n→∞ n
k =1 k =1
This is because if T is invertible and g is T -invariant, then g is T −1 -invariant, so the

projection operator is the same.
3.4. Some remarks on the Mean Ergodic Theorem. We established the Mean Ergodic
Theorem for a measure-preserving system (X , µ, T ):
N
1 X n
lim u T f = PT f
N →∞ N
n =1
where PT is the projection onto the subspace of T -invariant functions in L 2 (X , µ). This
holds in general, even if µ(X ) = ∞,R but one can encounter problems such as PT f van-
ishing almost everywhere, even if X f d µ > 0. As a simple example, suppose f is the
indicator function of [0, 1] and T is translation by 1 on R.
We would like to have Z Z
f dµ= PT f d µ.
X X
12
When restricting to a probability space, one has || · ||1 ≤ || · ||2 by Cauchy-Schwarz. There-
fore, if f n → f in L 2 then one has
Z Z
lim fn → f.
Since Z Z
u Tn f dµ= f dµ
X X
in a probability space we are indeed guaranteed that
Z Z
PT f = f.
X X
Suppose f n → f in L 2 and g ∈ L 2 (X , µ). Then
〈f n , g 〉 → 〈f , g 〉.
For measurable sets A, B of (X , µ, T ) we apply this with f = χA and g = χ B , and f n = A n f .
By the Mean Ergodic Theorem,
N Z
1 X −n
µ(T A ∩ B ) → PT (χA ) d µ.
N n =1 B
One would like to use this to show that the orbits of A intersect B , but the right hand side
could be 0.
However, if T is ergodic then by definition, the dimension of the space of T -invariant
functions is 1 (i.e. just the constants), so the right hand side is some constant times µ(B ).
Now, in a probability space one has
Z Z
fn dµ = f d µ = µ(A).
X
We have shown:
Theorem 3.11. If T is ergodic, then
N
1 X
lim µ(T −n A ∩ B ) = µ(A)µ(B ).
N →∞ N
n =1
If T is not ergodic then one can still use the same idea to try and get something (the
result won’t be as strong, of course).
Exercise 3.12. Let (X , µ) be a probability space and E ⊂ X a subset of positive measure.
Assume T : X → X is an invertible transformation preserving µ. Show that there exists
x ∈ X such that {n ∈ Z | T n (x ) ∈ E } has positive upper density.
Exercise 3.13. Suppose (X , µ) is a probability space. For any measurable set B and ε > 0,
show that the set
{k ∈ N | µ(T −k B ∩ B ) ≥ µ(B )2 − ε}
has bounded gaps.
13
3.5. A generalization. The key ingredient to this discussion is the mean ergodic theo-
rem, whose proof is very easy: it’s just basic functional analysis. What if we want to to
study more complicated things like
µ(A ∩ T −n A ∩ T −2n A ∩ . . . ∩ T −k n A)
if µ(A) > 0? More generally, what about µ(A ∩ T −p (n ) A) for some polynomial p (t ) ∈ Z[t ]?
More generally still, suppose you have commuting operators T1 , . . . , Tk and want to study
1 X
n n
lim k u T11 (f ) . . . u Tkk (f ).
N →∞ N
n ,...,n ≤N
1 k
In fact, Host-Kra showed that this kind of limit does converge in L 2 (X , µ). Recurrence
statements for this setting were proved by Furstenberg and Katznedson, etc. They are
significantly more challenging. We remark that these do not involve an assumption of
ergodicity.
14
4. ERGODIC TRANSFORMATIONS
4.1. Ergodicity.
Definition 4.1. Suppose T is a measure-preserving map on (X , µ, B). Then T is ergodic
if B = T −1 B for B ∈ B implies µ(B ) = 0 or µ(X − B ) = 0.
Remark 4.2. This makes sense even when X has infinite measure.
This definition is supposed to capture the notion of irreducibility. Given any T -invariant
measure µ, it is not clear how to obtain a measure µ0 that is T -invariant and ergodic with
respect to T . However, such measures do exist.
Proposition 4.3. The following are equivalent:
(1) T is ergodic.
(2) µ(T −1 B ∆B ) = 0 =⇒ µ(B ) = 0 or µ(X − B ) = 0.
(Assuming µ(X ) = 1) For any A ∈ B, if µ(A) > 0 then µ( T −n A) = 1.
S
(3)
(4) For any A, B ∈ B such that µ(A)µ(B ) > 0 there exists n such that µ(T −n A ∩ B ) > 0.
(5) If f : X → C is measurable, then f ◦T = f almost everywhere implies that f is equal
to a constant almost everywhere.
Remark 4.4. Condition (3) generalizes the earlier remark that µ(T −N A ∩ A) > 0 for all
T -invariant measures. Recall that we said the result could fail if X were a union of two
disjoint T -invariant spaces. We will later prove that if T is ergodic then
N
1 X
lim µ(T −n A ∩ B ) = µ(A)µ(B ).
N →∞ N
n =1
Remark 4.5. The definition makes sense for any group G acting on X .
Proof. Obviously (2) =⇒ (1). For (1) =⇒ (2), start with some B such that µ(B ∆T −1 B ) =
0. We want to make B into a T -invariant set somehow, so the most naïve thing to do is to
throw in T −1 (B ). Of course, we then have to keep going, so we set
∞ [
\ ∞
C= T −n B.
N =0 n=N
Then evidently T −1 (C ) = C , and

!
∞
[
µ(C ) = lim µ T −n B = µ(B ).
N →∞
n =N
Next we show that (1) ⇐⇒ (3). For (1) =⇒ (3) observe that n T −n (A) is T -invariant
S
S so must be full measure. Conversely, if A is T -invariant with

and has positive measure,
positive measure, then T −n (A) = A has full measure.
To see that (4) =⇒ (1), let B ⊂ X be a T -invariant set. Then taking A = X \ B , we
see that A is also T -invariant. If µ(B ) 6= 0 and µ(A) 6= 0, then there exists n such that
µ(T −n B ∩ A) = µ(B ∩ A) 6= 0, clearly a contradiction. The other direction follows from the
version of the Mean Ergodic Theorem in Theorem 3.11.
Finally, we establish that (1) ⇐⇒ (5). By taking f to be the characteristic function of
an invariant set, we see that (5) =⇒ (1). For (1) =⇒ (5), let f be a function such that
15
f ◦ T = f then set A kn = {x : f (x ) ∈ [ nk , k n+1 ]}. Then T −1 A kn = {x ∈ X : f (T x ) ∈ [ nk , k n+1 }, but

since f (T (x )) = f (x ) this is the same set as {x ∈ X : f (x ) ∈ [ nk , k n+1 ]}. Therefore,
µ(T −1 A kn ∆A kn ) = 0.
This implies that A kn has full measure or zero measure for each n, k , and it follows that f
is constant almost everywhere.

Example 4.6. Here are some examples of ergodic and non-ergodic transformations.
(1) R α : S 1 → S 1 is ergodic with respect to the Lebesgue measure if α is irrational, and
not ergodic if α is rational.
(2) T2 : S 1 → S 1 with respect to the Lebesgue measure is ergodic.
(3) The map T : S 1 × S 1 → S 1 × S 1 sending (x , y ) 7→ (x + α, y + α) is not ergodic. For
instance, the function f (x , y ) = e 2πi (x −y ) is T -invariant but not constant.
(4) The map S : T k → T k sending
(x 1 , . . . , x k ) 7→ (x 1 + α, x 2 + x 1 , x 3 + x 2 , . . . , x k + x k −1 )
is ergodic if α is irrational. That is not obvious, although it’s easy to see that this
is measure-preserving for the Lebesgue measure.
There is a nice trick due to Furstenberg to use this to show that {n 2 α}, {n 3 α}, . . . , {n k α}
are dense in S 1 if α is irrational.
4.2. Ergodicity via Fourier analysis. One approach to ergodicity on S 1 is to use Fourier
analysis on L 2 (X , µ), and study the action of T on the Fourier coefficients. This leads to
perhaps the simplest proofs, but unfortunately they do not generalize too well.
Example 4.7. Let’s try applying this idea to the rotation operator R α . For f ∈ L 2 (S 1 ) we
write X
f (t ) = c n e 2πi nt .
n ∈Z
What does it mean that f (R α (t )) = f (t )? The rotation sends t 7→ t + α, so by comparing

Fourier coefficients we see
c n = c n e 2πi nα .
If α is irrational then the factor e 2πi n α is never 1 unless n = 0, so all the c n are 0 except
the constant term, i.e. f is constant almost everywhere.
Example 4.8. Next let’s see what happens with the doubling map. For f ∈ L 2 (S 1 ) we again
write X
f (t ) = c n e 2πi nt .
n ∈Z
If f (t ) = f (2t ) then by comparison Fourier coefficients we have c 2n = c n . This forces

c k = 0 if k 6= 0, since c k = c 2k = c 4k = . . . → 0, a consequence of
X
|| f ||L 2 = |c n |2 < ∞
16
Now that we are warmed up, let’s prove that (4) from Example 4.6 is ergodic. For f ∈
L 2 (T k ), we have a Fourier expansion
X
f (~
x) = c n · e 2πi n~ ·~x .
~ ∈Zk
n
Suppose f (~ x ) = f (S(~x )).
The trick is that we can write n ~ · S(~
x ) = n 1 αe~1 + S 0 (n) · x~ where S 0 (~
n ) = (n 1 + n 2 , n 2 +
n 3 , . . . , n k −1 + n k , n k ). The nice thing about S 0 is that it induces an automorphism of Zk ,
so X
c n~ · e 2πi n 1 α e 2πiS (~n )x .
0
f (S(~
x )) =
We conclude that c S 0 (~n ) = e 2πi αn 1 c n~ . In particular,
|c S 0 (~n ) | = |c n~ |.
Now we claim that the sequence of vectors n ~ ,S 0 (~
n ), (S 0 )◦k (~
n ) cannot be all distinct unless
c n~ = 0. This is for the same reason as before:
X
|| f ||2 = |c n~ |2 .
~ ∈Zk
n
We conclude that if c n~ 6= 0 then there exist p,q such that (S 0 )◦p (~
n ) = (S 0 )◦q (~
n ) = 0. An easy
analysis shows that this implies n k = . . . = n 2 = 0. Then comparing this with the earlier
equation c S 0 (~n ) = e 2πi αn 1 c n~ shows that n 1 = 0 as well.
4.3. Toral endormophisms. If A ∈ GLn (Z), then it induces a map TA : T n → T n pre-
serving the Lebesgue measure induced on T n = Rn /Zn . These are the “toral endomor-
phisms,” which we have already encountered.
Theorem 4.9. TA is ergodic if and only if no eigenvalue of A is a root of unity.
Since the eigenvalues of A are algebraic, this is the same as no eigenvalue having mag-
nitude 1. For such A, we called TA hyperbolic.
Proof. (Sketch) We use Fourier analysis again. If f ∈ L 2 (X , µ) then we write
X
f = c n e 2πi 〈~n ,x 〉
~ ∈Zk
n
and f ◦ T has expansion
X X
c n e 2πi 〈~n ,Ax 〉 = c n e 2πi 〈~n A,x 〉
n∈Zk n∈Zk
so
c n~ = c n~ A = . . . .
Applying Parseval’s formula as usual, we conclude that either c n~ = 0 or {~ ~ A, . . .} is
n,n
really only a finite set. Then n~A = n
k ~ . That implies that A has an eigenvalue which is a
k th root of unity.
~ =n
Conversely, if A k n ~ for some n
~ then
k
X −1
j x〉
f (x ) = e 2πi 〈~n ,A
j =0
17
is invariant under T and non-constant.

Example 4.10. If T : X → X is ergodic, T ×T : X ×X → X ×X may not necessarily be ergodic.
Indeed, let X = S 1 and T be irrational rotation. Then T ×T preserves the function (x , y ) 7→
x −y.
You might wonder if T × T could ever be ergodic. If T is the doubling map on S 1 , then
T × T is indeed ergodic.
4.4. Bernoulli Shifts. We can give other proofs that the map Td : S 1 → S 1 is ergodic,
without referencing Fourier analysis, but putting this map into a general context called
Bernoulli shifts.
Here is a general setting that captures all of these ergodic transformations. We have a
finite alphabet S = {s 1 , . . . , s k } and real numbers {p s 1 , . . . , p s k } such that each p s ≥ 0 for all
Pk
s ∈ S and i =1 p s i = 1. We define the two-sided Bernoulli space
Σ = {(. . . x −1 , x 0 , x 1 , . . .): x i ∈ S)}
and the Bernoulli shift σ by (σ(x )i ) = (x i +1 ). Note that here σ is a bijection.
We also define the one-sided Bernoulli space
Σ+ = {(x 0 , x 1 , . . .): x i ∈ S)}
and the left shift operator σL on Σ+ by
(σL (x )i ) = x i +1 .
Notice that here σL is surjective but not injective.
We equip Σ, Σ+ with the σ-algebra generated by the fundamental “cylinders”
[(i 1 , s 0 ), . . . , (i ` , s ` )] = {x = (x i )∞
i =0 | x i 0 = s 0 , . . . , x i ` = s ` }.
These play the role of intervals (rectangles) for the construction of the Lebesgue measure
on R (Rn ). We then define measures µ on Σ by
µ([(i 0 , s 0 ), . . . , (i ` , s ` )]) = p s 0 · . . . · p s `
and similarly for µ+ on Σ+ . This measure is evidently preserved by σ and σL , respec-
tively. So (Σ, σ, µ) and (Σ+ , σL , µ+ ) are measure-preserving systems. This turns out to be
a robust framework capturing many measure-preserving systems that we have already
encountered.
Example 4.11. The doubling map can be realized as a Bernoulli shift with S = {0, 1} and
p 0 = p 1 = 1/2, we have (Σ+ , σL , µ+ ) ∼
= (S 1 , T2 , µ).
The tricky thing about Bernoulli shifts is that they are very difficult to distinguish.
Even for k = 2, we are only choosing p 0 and p 1 such that p 0 + p 1 = 1 and it is already im-
possible to distinguish the different spaces by spectral properties. To do this one needs
to introduce the notion of entropy.
Σ+ can be metrized into a compact topological space, with d (x , x 0 ) = k1 if x i = x i0 for
i = 1, . . . , k .
Theorem 4.12. (Σ, σ, µp ) is ergodic.

18
Proof. We begin with a key observation that leverages the specific structure of cylinders.
If E ⊂ Σ is a finite union of cylinders and F = σ−N E , then
µp (E ∩ σ−N E ) = µp (E )2 for large N .
To see this, think of E as a set where you have restricted the values in a certain (finite) set
of indices. Then σ−N is a “right shift” (technically multivalued), so σ−N E is a set where
you have restricted the values in another finite set of indices shifted to the right from the
origina. If you shift by a large enough amount then eventually the places where you have
restricted the values of E and σ−N E are disjoint.
Let B be a measurable set. We want to show that T −1 B = B =⇒ µ(B ) = 0 or 1. There
SN
exists a finite union of cylinders E = j =1 C j (where each C j is a cylinder) such that
µ(E ∆B ) < ε, so in particular |µ(B ) − µ(E )| < ε. Since µ(B ) = µ(σ−1 B ) = . . .,
µ(B ∆σ−N E ) = µ(σ−N B ∆σ−N E ) = µ(σ−N (B ∆E )) < ε.
This holds for all N . Now the point is that B is commensurate with both E and σ−N E ,
but these two sets are not commensurate with each other by the discussion of the first
paragraph unless µ(E ) = 0 or 1.
More precisely, we have µ(B ∆E ) < ε and µ(B ∆σ−N E ) < ε. Also,
B ∆(E ∩ σ−N E ) ⊂ (B ∆E ) ∪ (B ∆σ−N E )
so µ(B ∆(E ∩ σ−N E )) < 2ε. In particular, |µ(B ) − µ(E ∩ σ−N E )| < ε. Taking ε → 0, we
conclude that µ(E ) = µ(E )2 .

19
5. MIXING
5.1. MixingP transformations. Recall that we proved that a Bernoulli shift system (Σ, σ, µp )
is ergodic if p i = 1 by using the structure of “cylinders,” specifically the fact that µ(σ−N A∩
B ) = µ(A)µ(B ) for all sufficiently large N .
By an approximation argument, this shows in fact that for any two measurable sets Ae
and B e we have
lim µ(σ−N Ae ∩ B
e ) = µ(A)µ(
e B e ).
N →∞
This is the prototype of a stronger property of transformations called mixing.
Definition 5.1. Let (X , T, µ) be a measure-preserving system. We say that T on X is mixing

if for all measurable sets Ae and B e one has
lim µ(T −n Ae ∩ B
e ) = µ(A)µ(
e B e ).
n →∞
Example 5.2. The proof of Theorem 4.12 shows that (Σ, σ, µp ) is mixing.
Mixing implies ergodic, but not conversely. Indeed, one of our equivalent charac-
terizations of ergodicity in Proposition 4.3 was that for all A,
eB e there exists n such that
µ(T Ae ∩ B
−n e ) > 0.
Example 5.3. R α is ergodic but not mixing. If Ae and B

e are small intervals, then it is clear
that the limit limn→∞ µ(T A ∩ B ) will not exist (it will be zero much of the time), but
−n e e
jump up occasionally.
5.2. Weakly mixing transformations. Recall that we used the mean ergodic theorem to
show that ergodicity implies if A,
eB e are measurable then
n
1X
µ(T −n Ae ∩ B
e ) → µ(A)µ(
e B e ).
n i =1
In other words, the “average” of the quantities approaches some expected value. Mixing
says that the quantities themselves approach this value.
Definition 5.4. We say that T is weakly mixing if

n
1X
lim |µ(T −n Ae ∩ B
e ) − µ(A)µ(
e B e )| → 0.
n →∞ n
i =1
Example 5.5. In fact, you can easily see that R α is not even weakly mixing, since a positive
proportion of terms is positive.
Proposition 5.6. If P is a semi-algebra (finite unions and intersection) generating B,

then
• Ergodicity ⇐⇒ for all A, B ∈ P then
n
1X
lim µ(T −i A ∩ B ) = µ(A)µ(B ),
n→∞ n
i =1
20
• weakly mixing ⇐⇒ for all A, B ∈ P then

n
1X
lim |µ(T −i A ∩ B ) − µ(A)µ(B )| = 0.
n →∞ n
i =1
• mixing ⇐⇒ for all A, B ∈ P then
lim µ(T −n A ∩ B ) = µ(A)µ(B ).
n →∞
Exercise 5.7. Prove this. [Hint: there is basically nothing to do.]

Exercise 5.8. Show that T is weakly mixing if and only if T ×T (on X ×X with the product
measure) is ergodic.
Example 5.9. Recall that (x , y ) 7→ (x + α, y + α) is not ergodic, which reflects the fact that
R α is not weakly mixing.
Exercise 5.10. Showing weakly mixing ⇐⇒ given A, B there exists J ⊂ {1, . . . , n . . .} of zero
density (i.e. lim J ∩ {1, . . . , k }/k → 0) such that
lim µ(T −n A ∩ B ) = µ(A)µ(B ).
n →∞
/
n ∈J
5.3. Spectral perspective. Ergodicity, weakly mixing, and mixing are “spectral proper-
ties” of the operator u T on L 2 (X , µ). For instance, ergodic says that for f , g ∈ L 2 (X , µ) we
have
n
1X i
lim (u T f , g ) → (f , 1)(1, g )
n →∞ n
i =1
and mixing says that for all f , g ∈ L 2 (X , µ)
lim (u Tn f , g ) = (f , 1)(1, g ).
n →∞
Exercise 5.11. T is mixing if and only if for all measurable sets A ⊂ X ,

lim µ(T −n A ∩ A) = µ(A)2
n →∞
Recall that if µ(X ) < ∞ then lim sup µ(A ∩ T −n A) ≥ µ(A)2 .

The exercise says that it suffices to check this in the special case f = g .
Here’s another way to think about things. Recall that one formulation of ergodicity
was that u T (f ) = f =⇒ f is constant. Weakly mixing says if u T (f ) = λf (necessarily
|λ| = 1 because u T is unitary), then λ = 1, i.e. f is constant. In other words, weakly
mixing implies that there can be no other interesting eigenvalues other than 1. R
Indeed, if T is weakly mixing and µT f = λf , then we may assume that f d µ =
0 because f must be orthogonal to the space of constant functions (those being the
eigenspace with eigenvalue 1). Then
n
1X n
|(u f , f )| → 0.
n i =1 T
So
1X i
|λ |(f , f ) → 0 =⇒ (f , f ) = 0.
n
21
5.4. Hyperbolic toral automorphism

is mixing. We revisit the hyperbolic toral auto-
2 1
morphism T induced by A = on T = R2 /Z2 . The goal is to prove that the action
1 1
of T is mixing. (We already saw a Fourier-analytic proof of ergodicity, but of course this
is stronger.) p p
Consider the eigenvectors for A, spanned by v 1 = ( 1+2 5 , 1) and v 2 = ( 1−2 5 , 1). If x =
y + αv 1 + β v 2 , then
A k (x − y ) = αA k v 1 + β A k v 2 = αλk1 v 1 + β λk2 v 2 .
So if x − y is in the direction of v 2 then d (T n x , T n y ) → 0. So we have a foliation of T by

lines parallel to v 2 . If U is a little rectangle with edges parallel to v 1 and v 2 , then T −n (U )
is a rectangle stretched along the v 2 direction and squished along the v 1 direction.
These eigenvectors define foliation parallel to v 1 and v 2 . Let h si (x ) = x +s v i be the flow
along the i th foliation. This flow is ergodic in the sense that any measurable function
invariant under it must be a.e. constant. Indeed, the flow defines a “first return map”
S 1 → S 1 which is which is rotation by an irrational angle, and we know that this is an
ergodic transformation.
Let h s = h s1 be the flow along the expanding foliation and λ = λ1 . Then T n ◦ h s (x ) =
λn1 s
h ◦ T n (x ). If f , g are continuous then we want to prove that
Z Z Z
lim f (x )g (T n x ) d µ(x ) → f dµ g dµ .
n →∞
X
R
Let I n = X f (x )g (T n x ) d µ(x ). Since h s is measure-preserving, we can replace x by h s x
for small s without affecting the integral by very much:
Z Z s
1 s0 n s0 0
In = f (h x )g (T h x ) d s d µ(x )
s X 0
Z Zs
1 n s0 0
( f continuous) ≈ f (x ) g (T h x ) d s d µ(x )
X
s 0
Z Zs
1 λn s 0 n 0
= f (x ) g (h T x ) d s d µ(x )
X
s 0
Z Z λn s !
1 0
= f (T −n x ) g (h s x ) d s 0 d µ(x )
X
λ ns
0
By the ergodicity of h s (“time average is space average”), we see that

Z λn s Z
1 s0 0
g (h x ) d s → g (x ) µ(x )
λn s 0 X
for any x . Therefore,

Z Z λn s ! Z Z
−n 1 s0 0 −n
f (T x ) g (h x ) d s d µ(x ) → f (T x ) d µ(x ) g (x ) d µ(x ).
X
λn s 0 X X
22
In summary, the important ingredients were ergodicity of expanding foliations, and the
existence of expanding/contracting directions. These ideas are the basis of the notion of
entropy. It turns out that the Lebesgue measure has the maximum possible entropy for
T , and this gives information about periodic points, etc.
23
6. POINTWISE ERGODIC THEOREMS

We now work towards an L 1 -version of the mean ergodic theorem. Let (X , µ, T ) be
a measure-preserving system with µ(X ) < ∞ and T an ergodic µ-invariant measure. If
f ∈ L 1 (X , µ) then this will say that for almost every x ∈ X
f (x ) + f (T x ) + . . . + f (T n x )
Z
lim → f (x ) d µ.
n →∞ n X
One can weaken this in several ways: if T is not ergodic and µ(X ) is not necessarily fi-
nite, then the limit exists as some T -invariant f ∗ whic is E (f | B T ), B T being the σ-
algebra of T -invariant subsets, or equivalently the almost-invariant subsets B 0 satisfying
µ(T −1 B 0 ∆B 0 ) = 0. If µ(X ) < ∞ then
Z Z
f dµ= f ∗ d µ.
X X
6.1. The Radon-Nikodym Theorem.

Definition 6.1. We say that ν is absolutely continuous with respect to ν , and write ν µ,
if µ(B ) = 0 =⇒ ν (B ) = 0.
Example 6.2. Two measures on [0, 1] with disjoint support are singular with respect to
each other, hence not absolutely continuous.
Example 6.3. If µ is the Lebesgue measure on [0, 1], then one can define an absolutely
continuous ν measure with respect to µ by picking a positive function f and setting
Z
ν (B ) = f dµ
B
The Radon-Nikodym theorem is a converse to this construction.
Theorem 6.4 (Radon-Nikodym). Let (X , B, µ) be a probability space. Let ν be a measure
defined on B such that ν µ. Then there exists a non-negative measurable function f
such that Z
ν (B ) = f d µ.
B
Furthermore, if Z
ν (B ) = g dµ
B
then f = g almost everywhere.
dν
Remark 6.5. The (almost everywhere) uniqueness justifies the notation f = dµ
. So
dν
Z
ν (B ) = d µ.
B
dµ
Other basic properties are justified: if ν1 , ν2 µ then
d (ν1 + ν2 ) d ν1 d ν2
= +
dµ dµ dµ
24
and if λ ν µ then
dλ dλ dν
= .
dµ dν dµ
Example 6.6. An example to keep in mind why the finiteness hypothesis is necessary:
compare the counting measure
(
|A| |A| < ∞
ν (A) =
∞ otherwise.
Indeed, the Lebesgue measure is absolutely continuous with respect to this counting
dµ
measure, but d ν = 0 almost everywhere.
6.2. Expectation. Let A ⊂ B be a sub σ-algebra and µ a measure on B. If f ∈ L 1 (X , B, µ),

f might not be A -measurable. We want to define some function “expectation function”
E (f | A ) ∈ L 1 (X , A , µ) which captures the idea of projecting f to A .
How do we construct this operator E (· | A )? If we were working with L 2 then we could
define a projection map, but we cannot do that here. If f is non-negative, then we define
Z
ν (A) = f d µ.
A
Then ν µ|A . By the Radon-Nikodym theorem, there exists E (f | A) such that
Z
ν (A) = E (f | A ) d µ.
A
If f is actually measurable for A , then E (f | A ) = f .
By construction, for all A ∈ A we have
Z Z
f dµ= E (f | A ) d µ for all A ∈ A .
A A
Example 6.7. If A consists of sets of measure 0 or full measure, then any A -measurable
function is constant almost everywhere, and
Z
E (f | A ) = f d µ.
X
Example 6.8. If A is generated by a finite partition A 1 , . . . , A n of X , then A -measurable

functions are constant on each A i , so
Z
1
E (f | A )(x ) = f d µ if x ∈ A i .
µ(A i ) A
i
So far we have restricted our discussion to non-negative functions f , but we can ex-
tend the definition in the usual way: write f = f + − f − where f + and f − are the positive
and negative parts.
Properties. It is easy to check the following properties of the expectation.

• E (f 1 + f 2 | A ) = E (f 1 | A ) + E (f 2 | A ),
25
• E (f | B) = f ,
• E (f | A ) ◦ T = E (f ◦ T | T −1 A ).
6.3. Birkhoff’s Ergodic Theorem.
Theorem 6.9 (Birkhoff’s Ergodic Theorem). Let (X , T, µ) be a system and let A be the σ-
algebra generated by T -invariant measurable sets, i.e. A such that µ(T −1 A∆A) = 0. (So if
T is ergodic, then A is the trivial σ-algebra.) If f ∈ L 1 (X , B) then for almost all x
f (x ) + f (T x ) + . . . + f (T n x )
lim = E (f | A )(x ) =: f ∗ (x ).
n →∞ n +1
The limit f ∗ (x )
is A -measurable (i.e. T -invariant) and for any T -invariant subset A
(i.e. µ(T −1 A∆A) = 0),
Z Z
f dµ= f ∗ d µ.
A A
Proof. Let f ∈ L 1 (X , µ, B) and E be the set of x such that f (x ) + . . . + f (T n x ) ≥ 0 for at

least one n.
Claim. We claim that Z

f d µ ≥ 0.
E
This is the main part of the argument. It does not use the fact that µ(X ) < ∞.
Lemma 6.10 (Maximal inequality). Let f ∈ L 1 (X , µ, B) and define f 0 = 0,

f n (x ) = f + f ◦ T + . . . + f ◦ T n −1 ,
and
Fn (x ) = max f j (x ).
0≤j ≤n
Then Z
f d µ ≥ 0.
x : Fn (x )>0
Let E n = {x : Fn (x ) > 0}. The difference between the claim and the lemma is that in the
claim, we are integrating over E = n E n .
S
Proof. We claim that if Fn (x ) > 0 then f (x ) ≥ Fn (x ) − Fn ◦ T (x ). To see this, observe that

Fn ≥ f j for all 0 ≤ j ≤ n, hence Fn ◦ T ≥ f j ◦ T , so
Fn ◦ T (x ) + f (x ) ≥ f j ◦ T (x ) + f (x ) = f j +1 (x ).
Therefore, Fn ◦ T (x ) + f (x ) ≥ max1≤j ≤n +1 f j (x ). Since Fm (x ) ≥ 0, this the same as the
maximum including j = 0.
Now, Z Z Z
f (x ) d µ ≥ Fn (x ) d µ − Fn ◦ T (x ) d µ.
En En En
26
We claim that we can replace E n in the second interval with X , because outside E n the
function Fn is 0. Since Fn is non-negative, we can also extend the integral to X in the
third integral. So
Z Z Z
f (x ) d µ ≥ Fn (x ) d µ − Fn ◦ T (x ) d µ = 0.
En X X
Here we are using that Fn is measurable and the T -invariance of the measure.
Now for the claim, we write E = n E n . Then f χE n → f χE and f ∈ L 1 , so we can use

S
the dominated convergence theorem to conclude that

Z Z
lim f χE n d µ → f χE d µ.
n→∞
Having established the claim, let’s turn out attention to the ergodic theorem. We want to
analyze
f (x ) + f (T x ) + . . . + f (T n x )
lim
n →∞ n +1
but we don’t know that the limit exists. So instead, we study
f (x ) + f (T x ) + . . . + f (T n x )
f ∗ (x ) = lim sup
n →∞ n +1
f (x ) + f (T x ) + . . . + f (T n x )
f ∗ (x ) = lim inf
n →∞ n +1
We want to prove that given a ,b the set
E a ,b := {x : f ∗ (x ) < a < b < f ∗ (x )}
has measure 0 in X . Since R is separable, we can let a ,b range over Q to deduce the
result.
A useful observation about this is that f ∗ , f ∗ are both T -invariant, hence E a ,b is T -
invariant. This is shown by analyzing the identity
n +1 f (x )
A n f (T x ) = A n +1 (x ) + .
n n
Remark 6.11. If T were ergodic then we would automatically know that µ(E a ,b ) = 0 or 1.
A corollary of the claim is that if g ∈ L 1 (X , µ) and

n −1
1X
B α = {x : sup g (T j (x )) > α}
n ≥1 n j =0
then for any set A which is T -invariant (up to measure 0),

Z
g d µ ≥ αµ(B α ∩ A).
B α ∩A
Indeed, this follows immediately from applying the claim with f := g − α.

27
So our goal is to show that µ(E a ,b ) has measure 0. By the corollary applied to the
observation that f ∗ (x ) > b on E a ,b ,
Z
f d µ ≥ b µ(E a ,b )
E a ,b
On the other hand, since f ∗ (x ) < a on E a ,b

Z
a µ(E a ,b ) > f d µ.
E a ,b
But b > a , so this is only possible µ(E a ,b ) = 0. Thus f ∗ = f ∗ (x ) for almost all x , so the limit
exists and is T -invariant.
Remark 6.12. We see that the proof so far doesn’t use µ(X ) < ∞, but without that assump-
tion then the limit could be 0 for instance. We need it to show that the limit satisfies
Z Z
fe = f
A A
for any A satisfying µ(T −1 A∆A) = 0.

It is easy to see that the limit fe is in L 1 (X , µ). Indeed, each A n (f ) is in L 1 , so
n−1
1X
|A n f (x )| = f (T j (x ))
n j =0
hence Z Z
|A n f (x )| d µ ≤ | f (x )| d µ
X
since µ is T -invariant. That implies that f ∗ ∈ L 1 (X , µ).
We now want to show that f ∗ = E (f | BT ), i.e. the integrals over any T -invariant B of
f and f ∗ are equal. We can reduce to showing that
Z Z
f = f ∗.
X X
To do this, fix a very large n and set

k k +1
D kn = {x : ≤ f ∗ (x ) ≤ }.
n n
Then obviously
k +1
Z
k
µ(D kn ) ≤ f ∗dµ≤ µ(D kn ).
n D kn
n
We claim that in fact

k +1
Z
k
µ(D kn ) ≤ f dµ≤ µ(D kn ).
n D kn
n
28
Why? Let’s focus on proving the lower bound. Fix ε > 0. Then the maximal inequality
implies that
Z
k n
− ε µ(D k ) ≤ f dµ
n Dn k
In fact we can replace D kn with D kn ∩ B for any T -invariant subset B , by restricting all the
results to B . Anyway, this shows that
Z Z
1
f− f ∗ ≤ µ(D kn ).
Dn Dn
n
k k
Summing over k we find that

Z Z
1
f− f∗ ≤
B B
n
and then letting n → ∞ finishes off the proof.
6.4. Some generalizations.
Remark 6.13. If T is ergodic but µ(X ) = ∞ and f ∈ L 1 (X , µ), then

f (x ) + f (T (x )) + . . . + f (T n−1 (x )) a.e.
lim −→ 0.
n→∞ n
Even though Birkhoff’s Theorem applies and tells us that the limit exists, R we
R unfortu-
nately don’t (necessarily) have nice properties of the limit such as f = lim A n (f ).
There is a way of “fixing” this result, due to Hopf.
R
Theorem 6.14 (Hopf). Let T be ergodic on (X , µ). If f 1 , f 2 ∈ L 1 (X , µ) and f 2 d µ 6= 0 then
R
f 1 (x ) + f 1 (T x ) + . . . + f 1 (T n x ) f1
lim = RX .
n→∞ f 2 (x ) + f 2 (T x ) + . . . + f 2 (T x ))
n
f
X 2
Another “fix” is the following.
Theorem 6.15 (Hopf). Assume that there exists g ∈ L 1 (X , µ) such that g (x ) > 0 almost
everywhere, and that for almost every x ,
g (x ) + g (T (x )) + . . . + g (T n (x )) → ∞.
Then Pn
f (T i x )
lim Pni =1 =: φ(x ) ∈ L 1 (X , µ)
i =1 g (T x )
n →∞ i
and Z Z
f dµ= g φ d µ.
X X
The proof uses “only” the maximal inequality, proceeding along the following lines.
(1) First prove that the lim sup = lim inf almost everywhere.
29
(2) Partition the space into chunks

k k +1
≤ φ(x ) ≤
n n
Exercise 6.16. Write out a detailed proof.
Before, we considered sets of the form
X
B α := { f (T i x ) > nα}
and deduced that Z
f d µ ≥ αµ(B α ).
Bα
Here we consider a set of the form
X X
B α := { f (T i x ) > α g (T i x )}
and show that Z Z
f ≥α g.
Bα Bα
Remark 6.17. Recall that T is mixing if

µ(A)µ(B )
lim µ(T −n A ∩ B ) = .
n →∞ µ(X )
If µ(X ) = ∞, then instead the definition should be
µ(T −n A ∩ B ) µ(A)
lim → .
n →∞ µ(T −n A ∩ B)
0 µ(A 0 )
The proof gives no information about “for which points x does the limit exist.”
6.5. Applications.
Example 6.18. Recall the “times b ” map Tb : S 1 → S 1 sending z 7→ z b . Write x ∈ [0, 1] in
terms of a “base b expansion” 0.x 0 x 1 x 2 x 3 . . .. Then Tb (x ) = 0.x 1 x 2 x 3 . . .. We proved that
Tb is ergodic. This corresponds to the Bernoulli shift with p 0 = b1 , p 1 = b1 , . . . , p b −1 = b1 .
Given x ∈ [0, 1], we can write x = x 0 x 1 x 2 . . . x j . Then x j = k ⇐⇒ Tb (x ) ∈ [ bk , k b+1 ). By
j
Birkhoff’s ergodic theorem for χ[k /b,k +1/b ) we have that for almost all x ,
#{x i : i ≤ n, x i = k } 1
lim = .
n →∞ n b
This can be generalized to strings of digits: a particular string (k 1 . . . k ` ) appears a pro-
portion of b1` of the time.
Definition 6.19. A point x is normal if for all b , in the base b expansion 0.x 0 x 1 . . . we have
#{i : x i = k , i ≤ n} 1
lim = .
n →∞ n b
Birkhoff’s ergodic theorem implies that almost all x are normal. However, it is an open
question to produce any provably normal point x .
30
Example 6.20. Consider the Gauss map T (x ) = x1 mod 1. If the continued fraction ex-
pansion of x is
1
x = x0 + 1
x1 + 1
x2+ x
3 +...
then x 1 = [1/T (x )], . . . x n
= [1/T n (x )].
The ergodicity of T is then tied with the distribution of (x 0 , x 1 , x 2 , . . .).
1
For instance, consider the interval I k = ( k +1 , k1 ). Then T n (x ) ∈ I k =⇒ x n = k . An
invariant measure for T is Z
1
µ(B ) = dx.
B
x + 1
S∞ 1 1
You can check this on intervals [a ,b ], so T −1 [a ,b ] = k =1 [ b +n , a +n ].
♠♠♠ TONY: [question: how do you motivate this measure?]
One can prove that T is in fact ergodic with respect to this measure. Then Birkhoff’s
Ergodic Theorem implies that for almost every x , the frequency of k is the measure of
1
( k +1 , k1 ) under µ, and the result turns out to be
(k + 1)2

1
log .
log 2 k (k + 2)
One can also use the theorem to do “weighted averages.”
Example 6.21. Let (2n ) = [2, 4, 8, 16, 32, 64, . . .]. We ask: what is the frequency of ` as the
first digit in in x n ∈ 2n as n → ∞? We claim that the frequence of ` is log10 (1 + 1/`).
The number 2m has d as first digit if it lies in
d 10n ≤ 2m ≤ (d + 1)10n
for some n. Equivalently,
n + log10 d ≤ m log 2 ≤ n + log10 (d + 1).
Thus {m log 2} ∈ [log d , log(d + 1)] mod 1. By Birkhoff’s Ergodic Theorem,
{m α} ∈ (log d , log(d + 1))
with proportion log(1 + 1/d ) for almost every x . However, right now we are interested
in the particular value x = 0, so the result does not quite follow from Birkhoff’s Ergodic
Theorem. Therefore, we need a stronger result.
31
7. TOPOLOGICAL DYNAMICS
7.1. The space of T -invariant measures. Suppose you have a measure-preserving sys-
tem (X , µ, T ) such that X is “compact” and metrizable (these are not essential assump-
tions, but will be very helpful). The measure is assumed to be Borel (i.e. Borel sets are
measurable). Sometimes we will want to assume that T is continuous.
Example 7.1. The rotation map R α : x 7→ x + α on S 1 and the times d map Td : z 7→ z d on
S 1 satisfy these conditions.
In general, we can consider the “space of finite T -invariant measures on X , which we
denote by M T (X ). Unfortunately, this “does not help” for understanding measurable
dynamics, in the sense that if (X 1 , T1 , µ1 ) and (X 2 , T2 , µ2 ) are two systems, then M T1 (X 1 )
and M T2 (X 2 ) are not related, since in the measurable setting you can throw away sets of
measure zero, but there could be lots of interesting T -invariant measures supported on
such a set.
There are interesting “classification” theorems that illustrate the disparity between
topological and measure-theoretic results. Any regular, ergodic, measure-preserving
system (X , T, µ) is isomorphic to a measure-preserving system (X 0 , T 0 , µ0 ) such that µ0
is the only ergodic measure invariant under T 0 . Also, it can be shown that the system is
isomorphic to a “nice” measure-preserving system on T 2 . The moral is that topological
dynamics and measure-preserving dynamics very different.
Let M1 (X ) be the space of finite measures on X . This is equipped with the weak*
topology. If C (X ) is the space of continuous maps X → R (or C), then the Riesz Represen-
tation Theorem says that C ∗ (X ) is basically the same as the space of “signed measures”
on X . Furthermore, C (X ) is separable. Then µk → µ in the weak* topology if and only if
for all f ∈ C (X ), Z Z
f d µk → f d µ.
It is a fact that M1 (X ) is compact and convex (closed) with respect to the weak* topology.
If T is continuous, then there is a map T∗ : M1 (X ) → M1 (X ) sending µ 7→ T∗ µ, i.e.
Z Z
f d T∗ µ = f ◦T dµ
and moreover this map is continuous with respect to the weak* topology.
Proposition 7.2. Let X be a compact metrizable space and T : X → X a continuous map.
Then M1T (X ) is non-empty.
The content of the proposition is that there always exist non-trivial invariant (prob-
ability) measures on a compact metrizable space. How might one construct such an
invariant measure? For any x ∈ X , we can consider the sequence x , T x , . . . T n x , . . . and
define
n
1X
µn = δT i x ∈ M1 (x ).
n i =1
Since X is compact, there is a convergent subsequence (in the weak* topology), and we
will show shortly that it is T -invariant.
32
In fact, there is a more general construction. If {x n ∈ X }n ∈N , then one can consider

n
1X
µn = δT i x n ∈ M1 (x ).
n i =1
and again extract a convergent subsequence.
Proof. Let (νn ) be a sequence in M1 (X ) and let

n −1
1X j
µn = T∗ (νn ).
n i =0
We claim that any weak limit is T -invariant. For any continuous function f on X , we
have
Z Z Z
1 X
f ◦ T d µn j − f d µ n j = f ◦ T i +1 − f ◦ T i d νn j
nj
Z
1
= f ◦ T n j +1 − f d νn j
nj
2
≤ || f ||∞ → 0.
nj
Proposition 7.3. Let X be compact and T measurable. The extreme ponts in M1T (X ) are
in bijection with ergodic measures for T .
An extreme point is a measure µ such that if µ = µ1 +µ2 then µ1 = t µ and µ2 = (1−t )µ.
(Recall that M1T (X ) is convex.) These intuitively correspond to extreme points in the hull
of a convex body.
Proof. If µ is not ergodic, then there exists E such that µ(E ∆T −1 E ) = 0 and 0 < µ(E ) <
1
1, then we can write µ = µ(E ) µ(E µ| + (1 − µ(E )) µ(X1\E ) µX \E , a convex combination of
) E
two probability measures which are singular with respect to each other, hence µ is not
extremal.
If µ is not extremal, then µ = t µ1 +(1−t )µ2 and it is easy to see that µ can’t be ergodic.
The reason is that µ1 (A) ≤ 1t µ(A), so µ1 is absolutely continuous with respect to µ and by
the Radon-Nikodym theorem there exists some φ such that
Z
µ1 (A) = φdµ
A
and since µ1 , µ are both T -invariant, φ must be. Since T is ergodic, φ must be constant
almost everywhere, hence µ1 = µ.
33
7.2. The ergodic decomposition theorem. We now want to establish a kind of converse
result, asserting that every T -invariant measure is a “linear combination” of the extremal
ones. This is only true if X is compact.
Theorem 7.4. Let µ ∈ M1T (X ). Then there exists a measure λ on M1 (X ) such that λ(E T ) =
1, where E T is the set of extremal measures, such that for all f ∈ C (X ),
Z Z Z
f dµ= f dν d λ(ν ).
E T (X ) X
As a consequence, if |M1T (X )| > 1 i.e. there is more than one invariant measure, then
there exists more than one ergodic invariant measure.
Remark 7.5. This is the only thing reasonable to hope for, because M1T (X ) could be a
really large space. There are (X , T ) where M1T (X ) is “finite-dimensional” (only finitely
many ergodic measures invariant under T ), but in general things are much more com-
plicated, and the set of extremal points may not even be closed. It can even be dense.
Example 7.6. For the time 1 geodesic flow on the unit tangent bundle on a hyperbolic
surface X , one can construct ergodic measures of the following form. One ergodic mea-
sure α is supported on a closed geodesic, and another ergodic measure β is supported
on another closed geodesic. One can then take measures supported on some intertwin-
ing of these geodesics, wrapping around n times and renormalizing. In the limit this
becomes just α + β . The set of ergodic measures is dense in the space of all invariant
measures for geodesic flow on T 1 (X ).
34
8. UNIQUE ERGODICITY
8.1. Equidistribution.
Definition 8.1. A sequence x n ∈ X becomes equidistributed with respect to µ if µ ∈ M1 (X )
(X compact) if for all f ∈ C (X ),
n Z
1X
f (x j ) → f (x ) d µ.
n j =1
If (X , T, µ) is a system then we say that x ∈ X is generic if (x , T x , T 2 x , . . .) becomes equidis-

tributed on X with respect to µ.
Proposition 8.2. If T is ergodic, then almost every x ∈ X is generic.
Proof. By the Pointwise Ergodic Theorem, if f ∈ C (X ) then
n Z
1X i
f (T x ) → f d µ
n i =1
for almost every x . However, we are not quite doen yet because the “bad set” depends
on f , and there are uncountably many possibilities for f .
What saves us is that in fact C (X ) is separable, so we can restriction our attention to
the functions in a separable basis { f n } for C (X ). Then there is full measure subset X 0 ⊂ X
such that
N Z
1 X i
f n (T x ) → fn dµ
N i =1 X
for all x ∈ X 0 and n. Then any f , you can choose some f n such that || f − f n || < ε, and
Z N Z
1 X i 1 X i
f − 2ε ≤ lim inf f n (T x ) ≤ lim sup f n (T x ) ≤ f + 2ε.
N →∞ N N →∞ N
n =1

What if we really want to know that every point is generic (not just almost every point)?
This is tied in with the notion of unique ergodicity.
Theorem 8.3. Let T : X → X be continuous on X a compact metrizable space. Then the
following are equivalent.
(1) #M1T (X ) = 1.
(2) For every f ∈ C (X ) and every x ∈ X , we have
n
1X
lim f (T i x ) = C (f ).
n→∞ n
i =1
(3) For every f ∈ C (X ) and every x ∈ X , we have

n Z
1X i
lim f (T x ) = f d µ
n →∞ n
i =1
uniformly, where µ is the unique T -invariant measure.
35
(4) For f in a dense subset of C (X ),

n Z
1X
lim f (T i x ) = f dµ
n →∞ n
i =1
where µ is the unique T -invariant measure.
Definition 8.4. If (X , T ) is a system in which the above conditions are satisfied, then we
say that it is uniquely ergodic.
Proof. (1) =⇒ (2). Assume that µ is the unique ergodic measure. For x ∈ X , we can
consider
n
1X
lim δT k x
n →∞ n
k =1
which is a T -invariant probability measure, necessarily equal to µ (in the weak* topol-
ogy). That means that for any f ∈ C (X ),
N Z
1 X n
f (T x ) → f d µ.
N n =1
Pn
(2) =⇒ (3). Letting µn = n1 k =1 δT k x denote the nth measure in the sequence, we have
Z n
1X
f d µn = f (T k x ).
n
k =1
Supposing that the convergence is not uniform, then we may choose g ∈ C (X ) such that
for all N 0 , there exists N > N 0 and x j ∈ X such that
N
1 X
g (T n x j ) − C (g ) > ε
N n =1
but among such N there exists weak*-convergent subsequence µN i → ν , so

Z
g d µ − C (g ) ≥ ε
X
a contradiction.
The equivalence of (3) and (4) follows from general approximation arguments.
Let’s show (3) =⇒ (1). If A N f (x ) → C (f ) which is constant and independent of x , then
we want to show that there is only one ergodic measure. Indeed, for every T -invariant
measure µ we have
Z Z
A N f (x ) d µ → C (f ) d µ = C (f )
X
and on the other hand Z Z
A N f (x ) d µ = f dµ
36
so for any two T -invariant measures µ, ν we have

Z Z
f d µ = C (f ) = f dν
for all f ∈ C (X ), hence µ = ν .

Remark 8.5. The uniformity of convergence doesn’t follow from generalities: it is not true
that convergence to a continuous function on a compact space is automatically uniform.
For example, take a sequence of functions on [0, 1] where the nth element “spikes” on
[0, 1/n].
8.2. Examples. On S 1 , {x n }∞
n =1 become equidistributed with respect to the Lebesgue
measure m if for any f ∈ C (X ),
n Z
1X
f (x i ) → f d m .
n i =1
This is equivalent to: for any interval (a ,b ),

1
#{j ≤ n : x j ∈ (a ,b )} → |b − a |
n
and it’s in fact enough to show that if k 6= 0, then
n
1 X 2πi k x j
lim e =0
n →∞ n
j =1
because the trigonometric polynomials are dense in C (X ). This isn’t necessarily easy to
check: for instance the question of whether or not (3/2)n is equidistributed is still open.
Theorem 8.6. If R α (x ) := x +α on R/Z = S 1 , where α is irrational, then for every x ∈ S 1 the

sequence x , R α x , R α2 x , . . . becomes equidistributed with respect to the Lebesgue measure. In
particular, (S 1 , R α ) is uniquely ergodic.
Proof. We have to check that

N
1 X 2πi k (x +n α)
lim e = 0.
N →∞ N
n =1
But this can be written as

N
X
2πi k x
e e 2πi k nα .
n =1
letting z = e 2πi k α , the sum is

N
1 X n 1 1−zN
z = → 0.
N n=1 N 1−z

37
Let T : X → X be a continuous map of a compact and metrizable space. Let φ : X → S 1

be a continuous function. Then we can construct a new pair (Xb , Tb) where Xb = X ×S 1 and
Tb : (x , s ) 7→ (T (x ), s + ϕ(x )).
Example 8.7. We’ve seen a special case of this before: S 1 × S 1 → S 1 × S 1 sending (x , y ) 7→
(x + α, x + y ) is the construction for X = S 1 , T = R α , φ = Id. We proved that this is ergodic
using Fourier analysis.
Theorem 8.8 (Furstenberg). Suppose (X , T ) is uniquely ergodic with unique T -invariant
measure µ. If µb = µ × m (the Lebesgue measure) is Tb-ergodic, then (Xb , Tb) is uniquely
ergodic and µ
b is the unique Tb-invariant measure on Xb .
Proof. For t ∈ S 1 , let τt be the map defined by (x , s ) 7→ (x , s + t ), which commutes with
Tb. Then if ν0 is Tb-invariant, νt := (τt )∗ ν0 is also Tb-invariant. We define
Z
νb = νt d t .
t
We claim that νb = µ × m . If this is true, then that expresses an ergodic measure as an
integral of other invariant measures, which is impossible unless almost all of them are
the same.
Consider the projection map Xb → X . Then the pushforward of ν0 on X is a T -invariant
probability measure, hence equal to µ. Thus
Z Z Z
f d νb = f d νt d t
Xb S1 X
Z Z
= f (x , s + t ) d ν0 d t
S1 X
Z Z
= f (x , t ) d t d ν0
X S1
Z
= f d µd m
X
hence νb = µ × m . Then there exists t 0 such that νt 0 = µ × m , hence ν0 = µ × m .
Example 8.9. As mentioned above, we already proved that (x , y ) 7→ (x +α, x +y ) is ergodic,
hence uniquely ergodic by the theorem. Let’s see some interesting consequences.
For all (x , y ), the orbit becomes equidistributed in S 1 × S 1 , and
n2 − n
T n (x , y ) = (x + nα, Y + nx + α) = (x + nα, y + n(x − α) + n 2 α).
2
Applying this to (x , y ) = (α, 0) we see that T n (x , y ) = (x + nα, n 2 α). That means that for
all f ∈ C (S 1 ), applying Theorem 8.3 to F (x , y ) := f (y ) we have
N −1 Z
1 X 2
f (n α) → f (y ) d y
N n =1 S1
hence {n 2 α} is equidistributed in S 1 .
38
Using this technique, Furstenberg proved that if p (t ) is any polynomial with at least
one irrational coefficient, then {p (n )α} is equidistributed.
8.3. Minimality. A set equidistributed with respect to the Lebesgue measure must be
dense, but a dense set need not be equidistributed with respect to the Lebesgue measure.
We saw that unique ergodicity is equivalent to every point having equidistributed orbit,
so a natural relaxation is to study dense orbits.
Definition 8.10. We say that (X , T ) is minimal if every orbit is dense.
If a system is uniquely ergodic for the Lebesgue measure, then it must be minimal by
the observations above. However, if the unique measure is not Borel then there is no
implication in either direction.
Example 8.11. The doubling map T2 : S 1 → S 1 is uniquely ergodic, but not minimal. In-
deed, this has a (unique) fixed point, and it turns out that the only invariant ergodic
measure is a mass supported at this point. But the orbit of the fixed point is obviously
not dense.
In some nice situations, the two can be proved to be equivalent.

In fact, the irrational rotation is in some sense the “only” uniquely ergodic transfor-
mation, as the following theorem describes.
Theorem 8.12. Let T : S 1 → S 1 be a homeomorphism with no periodic points. Then there

exists an irrational rotation S : S 1 → S 1 and map φ : S 1 → S 1 such that φ ◦ T = S ◦ φ, i.e. the
diagram commutes:
φ
S1 / S1
T S
φ
S1 / S1
If T is minimal, then φ is a homeomorphism.
There is no analogous fact for S 1 × S 1 .

2 1
Example 8.13. Let A = and let f be the associated map R2 /Z2 ∼= S 1 × S 1 . Then
1 1
the periodic points are dense, and if Pn (f ) = #{x | f n (x ) = x } then
p n p n
3+ 5

3− 5
Pn (f ) = + − 2.
2 2
log P (f )
Note that this grows exponentially with n, and limn →∞ nn exists. There will be many
ergodic measures, supported on periodic orbits.
To see why this formula is true, note that (x , y ) is periodic if A n (x , y ) − (x , y ) ∈ Z2 .
Therefore, (A n − 1)(x , y ) ∈ Z2 . This translates the question of counting periodic points to
a question of counting lattice points: how many lattice points are there in (A n −I )([0, 1]×
[0, 1])? (That number is precisely Pn (f ).)
39
Theorem 8.14 (Pick’s Theorem). The area of a lattice triangle in R2 is

b
i+ −1
2
where i is the number of interior lattice points and b is the number of boundary lattice
points.
Remark 8.15. There is a generalization to higher dimensions.
In our case, the area of (A n − I )([0, 1]×[0, 1]) is precisely the determinant. To use Pick’s
theorem, one has to check that there are no other integral points on the boundary.
40
9. SPECTRAL METHODS
9.1. Spectral isomorphisms. Our goal is to distinguish between R α and R β where α, β
are irrational rotations, by considering the induced actions on L 2m (where m is the Lebesgue
measure on S 1 ).
We have discussed how a triple (X , T, µ) gives an operator UT on L 2 (X , µ).
Definition 9.1. We say that T1 and T2 are spectrally isomorphic, and write UT1 ∼ = UT2 , if we
can we find W : L 2 (X 1 , µ1 ) → L 2 (X 2 , µ2 ) such that 〈W f 1 , W f 2 〉 = 〈f 1 , f 2 〉 and
UT2 ◦ W = W ◦ UT1
i.e. the following diagram commutes:
UT1
L 2 (X 1 , µ1 ) / L 2 (X 1 , µ1 )
W W

L 2 (X 2 , µ2 ) / 2 (X 2 , µ2 )
UT2
9.2. Ergodic spectra.

Proposition 9.2. Let (X , T, µ) be a T -invariant probability measure, where T is ergodic.
Then
(1) UT f = λf , f ∈ L 2 (µ) =⇒ |λ| = 1 and | f | is constant,
(2) Eigenfunctions correspond to different eigenvalues are orthogonal,
(3) If f , g are both eigenfunctions for λ then f = c g for some constant c ,
(4) The eigenvalues form a subgroup of the unit circle.
Remark 9.3. It is possible that the only eigenvalue is 1 and the only eigenfunctions are
the constants.
Definition 9.4. We say that (X , T, µ) has discrete spectrum if there exists an orthonormal
basis for L 2 (X , µ) consisting of eigenfunctions.
It is a fact that if T1 , T2 have discrete spectra, then they are spectrally isomorphic if and
only if they have the same eigenvalues.
Remark 9.5. (X , T, µ) is weakly mixing if and only if 1 is the only eigenfunction for UT on
L 2 (X , T, µ).
Proof. (1) Recall that if T is measure-preserving then UT is unitary, i.e.
〈UT f ,UT f 〉 = |λ|〈f , f 〉 = 〈f , f 〉 =⇒ |λ| = 1.
Also,
|UT f | = |λ| · |f | =⇒ |UT f | = | f |
so | f | is constant almost everywhere (using ergodicity of T ).
(2) We have
〈UT f ,UT g 〉 = 〈f , g 〉 = λµ〈f , g 〉
so if λµ 6= 1 then 〈f , g 〉 = 0.
41
(3) Suppose f (T (x )) = λf (x ) and g (T (x )) = λg (x ). If |g | 6= 0 almost everywhere then

f
h = g is T -invariant, hence constant almost everywhere.
(4) If f (T (x )) = λf (x ) and g (T (x )) = µg (x ) then g ◦ T = µg , and f g ◦ T = λµ(f g ).

9.3. Fourier analysis.
Example 9.6. Consider the rotation R α . If f n (e 2πi x ) = e 2πi nx then f n (R α z ) = e 2πi nα f n (z ).
Therefore, the maps z 7→ z n are all eigenfunctions for R α .
Theorem 9.7 (Fourier analysis). The set of f n forms a basis of L 2 (S 1 , m ).
Therefore R α has discrete spectrum with eigenvalues {e 2πi nα }.
This is enough to distinguish two rotations R α and R β . If they were measure-theoretically
isomorphic, then they would be spectrally isomorphic.
Remark 9.8. We can do the same argument for any compact abelian group G . Let Gb
denote the character group. If G is compact metrizable, then Gb is countable and discrete.
For each a ∈ G , there is a map f a : G → G sending x 7→ a x .
Theorem 9.9. The characters of G give an orthonormal basis for L 2 (G , m ) where m is the
Haar measure.
The eigenvalues are {γ(a )}γ∈Gb . Then we have the following theorem, which asserts
that “every” ergodic, measure-preserving map is a rotation.
Theorem 9.10. If T is an ergodic, measure-preserving map with discrete spectrum, then
(X , T, µ) is “conjugate” to a rotation on some compact abelian group. If (X , µ) is regular,
then we can replace “conjugate” by “isomorphic.”
Exercise 9.11. Find of a proof of this.
Definition 9.12. We say that (X 1 , T1 , µ1 ) is conjugate to (X 2 , T2 , µ2 ) if there exists a map
W : L 2 (X 1 , µ1 ) → L 2 (X 2 , µ2 ) such that
(1) (W f , W g ) = (f , g ),
(2) W ◦ UT1 = UT2 ◦ W ,
(3) W sends bounded functions to bounded functions,
(4) W (f g ) = W (f )W (g ) for bounded functions f , g .
Definition 9.13. Say T : X → X is invertible. We say that (X , T, µ) has countable Lebesgue
spectrum if there are functions f 0 = 1, f 1 , f 2 , . . . , f n such that {UTi f k }i ,k forms an orthornor-
mal basis for L 2 (X , µ).
The point is as follows.
(1) Any two invertible, measure-preserving maps with countable Lebesgue spec-
trum are spectrally isomorphic. This is clear by sending the countable spectra
to each other.
One can check that having a countable Lebesgue spectrum implies mixing.
(2) Two-sided Bernoulli shifts all have countable Lebesgue spectrum. Therefore,
they cannot be distinguished by spectral methods.
42
10. ENTROPY
10.1. Motivation. We want to motivate the notion of entropy for measure-preserving
maps (X , T, µ). Consider (S 1 , R α ): we mentioned that the operator UT on L 2 (X , µ) has
discrete spectrum. Conversely, any transformation with discrete spectrum looks like ro-
tation on a compact abelian group.
On the other hand, the Bernoulli shifts, which encompass most of the examples we
have seen, are all spectrally isomorphic (as they have countable spectra), but they are
not measure theoretically isomorphic.
What is the difference between (S 1 , R α ) and Bernoulli shifts? The rotation is an isome-
try, and in particular
d (x , x 0 ) < ε =⇒ d (T n x , T n x 0 ) < ε.
The Bernoulli shift is much more “violent.”
Example 10.1. Baker’s transformation is defined on [0, 1] by
(
(2x , y /2) x ≤ 1/2
TB (x , y ) =
(2x − y , (y + 1)/2) x ≥ 12 .
One can check that this is the same as the bi-infinite Bernoulli shift ((x i ))∞
i =−∞ . Geo-
metrically, this splits a rectangle down the middle (vertically), and then stacks the halfs
vertically, and then crushes them down.
Now we prepare ourselves to define the entropy. The entropy of a system (X , T, µ) is a
non-negative number such that:
(1) It is invariant under measurable isomorphism. Therefore, it can distinguish be-
tween the Bernoulli shifts (1/2, 1/2) and (1/3, 1/3, 1/3).
(2) Given (X , T ), in “many nice situations” there is a unique measure of maximal en-
tropy µ for (X , T ) (even though there is no way to classify all invariant measures).
However, many interesting measures can have zero entropy.
Example 10.2. For the irrational rotation R α on S 1 , it will be the case h µ (R α ) = 0. This
reflects the fact that there are no fixed points. So sometimes we get no information from
entropy. However, we’ll see that the map z 7→ z 2 has non-zero entropy on S 1 .
It tends to be the case thatif X iscompact, in nice cases (e.g. hyperbolic toral automor-
2 1
phisms such as induced by ), the Lebesgue measure has the maximum possible
1 1
entropy.
We would also like to establish methods to compute entropy. For instance, if µ =
µ1 + µ2 then we want to describe h µ in terms of h µ1 and h µ2 . There is a relationship, but
it isn’t very simple.
10.2. Partition information. The idea of defining entropy is to ask, how much do you
“gain” from applying T ? Entropy should be a measure of “chaos.” So we partition X into
finitely many measurable sets {P1 , . . . , Pk }. That means that µ(Pi ∩ Pj ) = 0 for i 6= j and
Sk
µ(X − i =1 Pi ) = 0. The idea is that the “information” you get from P , which depends
only on the numbers µ(P1 ), . . . , µ(Pk ), i.e. is a function H (µ(P1 ), . . . , µ(Pk )).
43
From T and a partition P , we get more partitions T −1 (P ), T −2 (P ), . . .. Intuitively each

of these taken individually has the “same amount of information.” But in general, if P1 =
{A i }i and P2 = {B j } j are two different partitions then we can form their join P1 ∨ P2 =
{A i ∩ B j }i ,j , and this contains “more information” than either.
Now we consider the growth of the function H on the partitions P ∨ T −1 P ∨ . . . ∨
T −k (P ) as k → ∞. Intuitively this tells us about the amount of new information obtained
by T ; if T −1 (P ) = P then H will not grow at all.
Definition 10.3. For a partition P = {P1 , . . . , Pk } and a measure µ, we define
k
X
H µ (P ) = H (µ(P1 ), . . . , µ(Pk )) = µ(Pi ) log(1/µ(Pi ))
i =1
X
=− µ(Pi ) log(µ(Pi )).
The expression here is the same as that from information theory.
There is an elementary calculation due to Khinchin (50s, “Mathematical Foundations
of Information Theory”) characterizing H as the unique function satisfying the following
properties:
(1) H (p 1 , . . . , p k ) ≥ 0 and is 0 if and only if one p i = 1,
(2) H is continuous in p 1 , . . . , p k ,
(3) H (p 1 , . . . , p n , 0) = H (p 1 , . . . , p n ),
(4) H is maximized when p 1 = . . . = p k = k1 ,
(5) If A and B are two partitions of X , then H (A ∨ B) = H (A ) + H (B | A ) where
X
H (B | A ) = µ(A) · H A (B)
A∈A
and
µ(B 1 ∩ A) µ(B k ∩ A)

H A (B) = H ,..., .
µ(A) µ(A)
Definition 10.4. We define
µ(A i ∩ B j )
X X
H µ (B | A ) = − µ(A i ∩ B j ) log ≥ 0.
A i ∈A B j ∈B j
µ(A i )
The entropy of a partition P will eventually be defined as essentially the growth rate
of H µ (P ∨ T −1 P ∨ . . . ∨ T −k P ) as k → ∞.
Basic Properties. Let α, β , γ be partitions.

(1) H µ (α ∨ β ) = H µ (α) + H µ (β | α).
(2) H µ (β | α) ≤ H µ (β ).
(3) H µ (α ∨ β | γ) = H µ (B | γ) + H µ (α | β ∨ γ).
(4) H µ (α | β ∨ γ) ≤ H (α | β ).
The key technical ingredient is the convexity of x log x . Recall that if ψ is convex on
(a ,b ) and x i ∈ (a ,b ), t i ∈ (0, 1) such that t i = 1, then
P
X X
ψ( ti xi ) ≤ ti ψ(x i ).
i i
44
More generally, if µ is a probability measure, f ∈ L 1µ (X ), and ψ is convex then Jensen’s

inequality says that
Z Z
ψ f (x ) d µ(x ) ≤ ψ(x )(f (x )) d µ(x ).
We say that ψ is strictly convex if the ≤ can be replaced by < unless f (x ) is (almost every-
where) constant.
Corollary 10.5. If P = {P1 , . . . , Pk } then H µ (P ) ≤ log k and equality holds when µ(P1 ) =
. . . = µ(Pi ).
Proof. Let φ(x ) = x log x . If there exists Pi such that µ(Pi ) 6= 1/k , then
k k
X 1 X 1
φ( µ(Pi )) < φ(µ(Pi )).
i =1
k i =1
k
µ(Pi ) = 1, this says that

P
Since
k
log k 1X X
− < −µ(Pi ) log µ(Pi ) =⇒ µ(Pi ) log µ(Pi ) < log k .
k k i =1
Tracing through the equality condition gives the result of the conclusion.
Now let’s prove some of the basic properties.
(1) We have
X
H µ (α ∨ β ) = − µ(A i ∩ B j ) log µ(A i ∩ B j )
i ,j
µ(A i ∩ B j )
X X
=− µ(A i ∩ B j ) log − µ(A i ∩ B j ) log µ(A i )
i ,j
µ(A i ) i ,j
= H µ (β | α) + H µ (α).
(2) We have
k
µ(A i ∩ B j )
X̀ X
H µ (β | α) = − µ(A i ∩ B j ) log
j =1 i =1
µ(A i )
k
µ(A i ∩ B j ) µ(A i ∩ B j )
X̀ X
=− µ(A i ) log
j =1 i =1
µ(A i ) µ(A i )
!
k
X̀ X µ(A i ∩ B j )
≤− φ µ(A i )
j =1 i =1
µ(A i )
X̀
=− φ(µ(B j ))
j =1
= H µ (β ).
45
10.3. Definition of entropy. By using the basic properties, we immediately obtain:

Corollary 10.6. H µ (α ∨ β ) ≤ H µ (α) + H µ (β ).
Lemma 10.7 (Subadditive sequence lemma). If (a n )n is a positive subadditive sequence
(i.e. a m +n ≤ a m + a n ) then
an an
lim = ` = inf .
n→∞ n n ≥1 n
Proof. Left as exercise.

Wn−1
Let P be a partition of X and define P (n) = i =0 T −i P . When then claim that
{H µ (P n )}n forms a subadditive sequence. To see this, note that
H µ (P (m +n ) ) ≤ H µ (P (m ) ) + H µ (P (n ) )
because
m +n
! !
_−1 n−1 m −1
(m +n )
_ _
−k −k −n −k
P = T P = T P ∨ T ( T P)
k =0 k =0 k =0
and Corollary 10.6 implies that
H µ (P (m +n ) ) ≤ H µ (P (m ) ) + H µ (T −n P (m ) )
≤ H µ (P (m ) ) + H µ (P (n ) ).
Therefore, Lemma 10.7 implies that
H µ (P (n) )
lim exists.
n n→∞
We define the limiting value to be h µ (T, P ).

Definition 10.8. For a triple (X , µ, T ) we define the entropy to be
h µ (T ) := sup {h µ (T, P )}.
finite partitions P
Remark 10.9. This may seem impossible to compute because one has to check all finite
partitions, but it turns out that if P generates the σ-algebra then h µ (T, P ) = h µ (T ). Thus
in nice situations it suffices to compute the entropy of a single partition.
Example 10.10. Let T : S 1 → S 1 be the squaring map and µ the Lebesgue measure. Set
P = {[0, 1/2), [1/2, 1)}. Then
P (n ) = P ∨ T −1 P ∨ . . . ∨ T −n +1 P ,
+1
and one can check that this is {[ 2ni+1 , 2in+1 ]} for i = 0, 1, . . . , 2n +1 − 1. So
1
H µ (P (n) ) = −2n +1 × log(1/2n +1 ) = (n + 1) log 2
2n +1
so h(T, P ) = log 2.
In fact, it is true that h µ (T ) = log 2. It is not clear how to check this now, since the
definition is in terms of all partitions, but we shall later see a criterion for checking that
a given partition suffices to compute the entropy.
46
Theorem 10.11 (Kolmogorov-Sinai). The entropy is invariant under measurable isomor-

phisms, i.e. if π: X 1 → X 2 is a measurable isomorphism such that the diagram
X1 / X2
π
T1 T2

X1 / X2
π
commutes, then you can easily check that h µ1 (T1 ) = h µ2 (T2 ).

Proof. If {A 1 , . . . , A n } is a partition of X 1 then {π(A 1 ), . . . , π(A n )} is a partition of X 2 then
h µ1 (T1 , α1 ) = h µ2 (T2 , π(α1 )).

10.4. Properties of Entropy. Last time we defined the entropy of a finite measure space.
Today we will prove some basic properties about it.
W∞
Definition 10.12. We say that P generates the σ-algebra of measurable sets on X if i =0 T −i P
W∞ it in the usual sense, i.e. given A measurable in X , for all ε > 0 there exists
generates
B ∈ i =0 T −i P such that µ(A∆B ) < ε.
The goal is to prove the following theorem of Sinai.
Theorem 10.13. If P generates the σ-algebra of measurable sets, then h µ (T ) = h µ (T, P ).
Example 10.14. This theorem gives an effective method to compute entropy.
(1) For irrational rotation R α : S 1 → S 1 , we can check that any interval plus its com-
plement generates the σ-algebra of Lebesgue-measurable sets. Here the number
of intervals grows linearly (about 2n), of length 1/2n. Then
log n
H µ (R α , P ) ≈ lim= 0.
nn →∞
(2) For the Td map S 1 → S 1 , any interval plus its complement generates the σ-algebra
of Lebesgue-measurable sets. Here the number of intervals grows exponentially,
and each has length about 1/d n . Then
n log d
= log d .
H µ (R α , P ) ≈ lim
n →∞ n
Let A , C be two partitions. We should have
H µ (A ∨ C ) = H µ (A) + H µ (C | A ).
If A = {A i } and C = {C j }, recall that we defined
µ(A i ∩ C j )

X
H µ (A | C ) = µ(A i ∩ C j ) log .
i ,j
µ(C j )
Remark 10.15. Some basic remarks:

(1) H µ (A | C ) = 0 ⇐⇒ A ≺ C , i.e. every A i ∈ A is a union of elements of C .
(2) H µ (A | C ) = H µ (A) ⇐⇒ A and C are independent, i.e. µ(A i ∩ C j ) = µ(A i )µ(C j )
for any i , j .
47
Proposition 10.16. The identity

H µ (A | C ) + H µ (C | A ) = d µ (A , C )
defines a metric on the space of finite partitions (up to sets of measure 0).
Proof. Definiteness follows from (1) above. We have to check the triangle inequality,
which follows if we can establish:
H µ (A | D) ≤ H µ (A | C ) + H µ (C | D).
Well,
H µ (A | D) ≤ H µ (A ∨ C | D)
= H µ (C | D) + H µ (A | C ∨ D)
≤ H µ (C | D) + H µ (A | C )

Note that there is a partial order on partitions by α ≺ β if β is finer than α.
Lemma 10.17. Let α, β , γ be partitions.
(1) If α ≺ β , then H µ (α | γ) ≤ H µ (β | γ).
(2) If α ≺ β then H µ (γ | α) ≥ H µ (γ | β ).
(3) If T preserves µ, then H µ (α | β ) = H µ (T −1 α | T −1 β ).
Proof. (1) Note that a special case of (1) is H µ (α) ≤ H µ (β ) if α ≺ β , so let’s try to see this
first. Well, if α ≺ β then α ∨ β = β , so H µ (α ∨ β ) = H µ (α) + H µ (β | α) ≥ H µ (α).
The general argument just works by putting in γ everywhere.
H µ (β | γ) = H µ (α ∨ β | γ)
= H µ (α | γ) + H µ (β | γ ∨ α)
≥ H µ (α | γ).
(2) If α = {A i }, β = {B j }, and γ = {C k }, then we have
µ(C j ∩ B k )
XX
H µ (γ | β ) = − µ(C j ∩ B k ) log
j
µ(B k )
k
µ(C j ∩ B k ) µ(C j ∩ B k )
X
=− µ(B k ) log
µ(B k ) µ(B k )
j ,k
µ(C j ∩ B k ) µ(C j ∩ B k )
X
=− µ(A i ∩ B k ) log
µ(B k ) µ(B k )
i ,j ,k
Now we claim that

µ(A i ∩ C j ) µ(A i ∩ C j ) X µ(A ∩ B ) µ(C j ∩ B k ) µ(C j ∩ B k )

i k
log ≥− · log .
µ(A i ) µ(A i ) µ(A i ) µ(B k ) µ(B k )
k
Granting the claim, we find that
µ(A i ∩ C j )
X
H µ (γ | β ) ≤ − µ(A i ∩ C j ) log = H µ (γ | α).
i ,j
µ(A i )
48
Therefore we are reduced to proving the claimed identity. If α < β , then

X µ(A ∩ B ) µ(C j ∩ B k ) µ(A i ∩ C j )
i k
· = .
µ(A i ) µ(B k ) µ(A i )
k
Applying φ(x ) = −x log x and convexity completes the proof.

(3) is obvious.
Corollary 10.18. We have:

Wn −1
(1) n1 H µ ( i =0 T −i A ) is a decreasing sequence (with limit h µ (T, A )).
Wn −1
(2) h µ (T, A ) = limn →∞ H µ (A | i =1 T −i A ).
Proof. (1) We want to show that

n
_ n−1
_
nH ( T −i A ) ≤ (n + 1)H ( T −i A ).
i =0 i =0
Expanding both sides out, we find that this is equivalent to

n n
X −1 j
_ _
nH µ (A | T −i A ) ≤ H µ (A | T −i A )
i =1 j =0 i =1
which is immediate from the fact that conditioning on a larger partition decreases the
entropy (Lemma 10.17).
(2) We have
−1
n_ n−1
X j
_
−i
Hµ( T A ) = H (A ) + H µ (A | T −i A ).
i =0 j =1 i =1
We will use (1) plus the observation that if limb i exists then
n
1X
lim b j = lim b j .
n →∞ n j →∞
j =1
Applying this observation to the sequence

j
_
b j = H µ (A | T −i A )
i =1
we deduce that
n−1 n
1 _ _
h µ (T, A ) = lim H( T −i A ) = lim H µ (A | T −i A ).
n→∞ n n →∞
i =0 i =1

49
10.5. Sinai’s generator theorem. We fix a measure-preserving system (X , µ, B, T ).

W∞
Definition 10.19. We say that a finite partition ξ is a one-sided generator if i =0 T −i ξ
generates B (the σ-algebra of measurable subsets), i.e. if for all B ∈ B and δ > 0 there
Wk
exists k and A ∈ i =0 T −i ξ such that µ(A∆B ) = 0.
Definition 10.20. If T is invertible
W∞ and T −1 is also measure-preserving then we say that ξ
is a two-sided generator if i =−∞ T −i ξ generates B.
The goal of this section is to prove the following theorem giving an “effective” way to
calculate the entropy of a measure-preserving transformation.
Theorem 10.21 (Sinai). If T is invertible and ξ is a one-sided generator, then h µ (T ) =
h µ (T, ξ).
An analogous result holds if ξ is a two-sided generator.
Wn Since the proof is similar,
we just prove the result above. The idea is basically that i =0 T −i α eventually becomes
(almost) finer than any finite partition.
Note that we can replace any partition but a finite expansion of it:
n
_
h µ (T, α) = h µ (T, T −i α). (2)
i =0
Lemma 10.22. If α, β are finite partitions, then

h µ (T, β ) ≤ h µ (T, α) + H µ (β | α).
Proof. We have
n−1 n −1
1 _ 1 _
Hµ( T −i β ) ≤ H µ ( (T −i β ∨ T −i α))
n n
i =0 i =0
−1
n_ n−1 −1
n_
1 −i 1 _
−i
= Hµ( T α) + H µ ( T β| T −i α))
n n
i =0 i =0 i =0
−1
n_ n −1
1 1 X
≤ Hµ( T −i α) + H µ (T −j β | T −j α)
n n j =0
i =0
and taking the limit as n → ∞ completes the proof.

By the observation 2 that
n
_
h µ (T, α) = h µ (T, T −i α).
i =0
it suffices to establish the following lemma, which is the main technical ingredient of the
proof.
Lemma 10.23. If ξ is a one-sided generator, then
n
_
lim H µ (η | T −i ξ) = 0
n →∞
i =0
50
Proof. By taking n to be sufficiently large, weWmay ensure that every part of η is ap-
n
proximated arbitrarily well by some parts of i =0 T −i ξ. Intuitively, that means that
Wn
H µ (η | i =0 T −i ξ) is very small since T −i ξ is nearly finer than η.
Exercise 10.24. Prove the result rigorously by analyzing the definition of the conditional
information.

10.6. Examples.
Example 10.25. We consider two-sided Bernoulli shifts with k symbols and parameters
(p 1 , . . . , p k ). Then we claim that
k
X
H µ (σ) = − p i log p i .
i =1
Indeed, consider the partition P obtained separating elements by the value of x 0 :
[
P = {x 0 = s i }.
Then it is easy to see that this ξ is a two-sided generator, and computation shows that
X
h µ (T, ξ) = − p i log p i .
In fact, we have the following classification theorem.
Theorem 10.26 (Ornstein). Entropy is a complete invariant for full shifts.
There are non-obvious numerical identifications, e.g. (1/2, 1/2) 6∼ (1/3, 1/3, 1/3) but
(1/4, 1/4, 1/4, 1/4) ∼ (1/2, 1/8, 1/8, 1/8, 1/8).
Remark 10.27. However, for one-sided shifts entropy is not a complete invariant. Intu-
itively, isomorphic one-sided shifts should have the same numbers of symbols, because
if there are k symbols then the map is k : 1.
Remark 10.28. What made the calculation of entropy for the full shift feasible was that
for the full shift, {T −1 ξ, . . . , T −i ξ} form an independent partition, i.e.
Y
µ(T −j 1 A i 1 ∩ . . . ∩ T −j k A i k ) = µ(A i m ).
m
In this special setting, you don’t have to calculate anything because the entropy of the
join is automatically the sum of the entropies:
_ X
H µ ( T −i ξ) = H µ (T −i ξ) = nH µ (ξ).
Therefore,
H µ (ξ ∨ . . . ∨ T −(n −1) ξ)
lim = H µ (ξ).
n →∞ n
In fact, any invertible map with an “independent generator” is isomorphic to a two-
sided Bernoulli shift.
51
11. MEASURES OF MAXIMAL ENTROPY

11.1. Examples. Let X be a compact topological space (perhaps metrizable) and T a
continuous map X → X . Let µ be a measure on X invariant under T . What can we say
about the entropy of µ under T ?
For instance, the two-sided Bernoulli shifts with k symbols have maximum entropy
log k , when p 1 = . . . = p k = 1/k . So sometimes there is a unique measure with maximum
entropy for T on X . A key example to keep in mind is when T is a homeomorphism that
is “expanding,” i.e. there exists δ > 0 such that
d (T i x , T i y ) < δ∀i =⇒ x = y .
Example 11.1. The map Tp : S 1 → S 1 sending z 7→ z p . The Lebesgue measure has en-
tropy log p . We claim that any other measure has entropy strictly less than log p (so the
Lebesgue measure is the unique measure of maximal entropy).
Wn−1
Indeed, let ξ be the partition {[0, 1/p ), [1/p, 2/p ), . . . , [p −1/p, 1)}. Note that i =0 T −i ξ
j j +1
is precisely the partition consisting of (the p n ) intervals of the form [ p n , p n ), so its en-
tropy is log(p n ) = n log p . It is clear that ξ is a generator for the Lebesgue measure, so
Wn −1
H µ ( i =0 T −i ξ)
h µ (T ) = lim = log p.
n →∞ n
In fact, ξ generates any T -invariant, Radon measure. That’s because any measurable set
must be approximable by intervals. Therefore, h µ (T ) ≤ H µ (ξ) ≤ log p for any such µ, and
by definition
H µ (ξ ∨ T −1 ξ ∨ . . . ∨ T −(n −1) ξ) n log p
h µ (T, ξ) = inf ≤ .
n ≥1 n n
The quality case H µ (ξ ∨ . . . ∨ T −(n −1) ξ) = log(p n ) implies µ([j /p n , (j + 1)/p n ]) = 1/p n and
this implies that µ is the Lebesgue measure.
Example 11.2. Let’s consider the hyperbolic toral automorphisms, given by A ∈ SL2 (Z)
with eigenvalues having absolute value different from 1. One can check that T is a home-
omorphism of the torus that is expansive, so that is morally why it works in this case. (Re-
mark: if you have such a homeomorphism, then it’s easy to find a generating partition.
Indeed, take a partition with diameter less than δ, and it will be a generator).
Theorem 11.3. If m is the Lebesgue measure, then T has entropy h m (T ) = log ρ where ρ is
the eigenvalue of A greater than 1.
In fact, we will show that for any µ one has h µ (T ) ≤ log ρ, with equality holding for the
Lebesgue measure, so again the Lebesgue measure has maximal entropy.
Proof. The goal is to find a particularly nice partition ξ, from which we can calculate
the entropy. In this case we can choose a partition consisting of rectangles with edges
parallel to the eigenvectors v + and v − with eigenvalues ρ and 1/ρ.
Then T −1 ξ consists of rectangles with edges parallel to v + and v − , but contracted by
ρ along the v + direction and expanded by ρ along the v − direction.
Wn T ξ has the opposite
effect of contracting along v − and expanding along v + . Thus i =−n T −i ξ consists of a
52
mesh of rectangles with length and width ρ −n . In particular, we see that ξ is a two-
sided generator. Therefore,
Wn
H m ( i =−n T −i ξ)
h m (T ) = h m (T, ξ) = lim .
n→∞ n
Now, the lengths of the rectangles in T −i ξ are in [c 1 ρ −n , c 2 ρ −n ] for some constants c 1 , c 2
independent of n, so
n
_
−2 log c 2 + n log ρ ≤ H m ( T −1 ξ) ≤ −2 log c 1 + n log ρ.
i =−n
Dividing by n and taking the limit, we see that necessarily h m (T ) = ρ.
Again, for equality to hold we need that all these rectangles have essentially the same
measure, which recovers the Lebesgue measure.
53
12. SOLUTIONS TO SELECTED EXERCISES

Exercise 3.4. It suffices to show that for any ε > 0 and M > 0, we can find m > M such
that µ(T −m E ∩ E ) ≥ µ(E )2 − ε. Replacing T with T k , which still preserves µ, it suffices to
show that there exists any m > 0 such that µ(T −m E ∩ E ) > µ(E )2 − ε.
Since T is measure-preserving,
Z N
X
χT −n E = N µ(E ).
X n=1
Squaring and using Cauchy-Schwarz, we find that

Z Z 2
X 2 X
χT −n E d µ ≥ χT −n E ≥ (N µ(E ))2 .
X X
Expanding out the left hand side gives
Z Z
X 2 X
χT −n E d µ = χT −a E χT −b E d µ
X X 1≤a ≤b ≤N
X
=N + µ(T −a E ∩ T −b E )
1≤a <b ≤N
X
=N + µ(T b −a E ∩ E ).
1≤a <b ≤N
Therefore,
N 2 µ(E )2 − N N 1
sup µ(T b −a E ∩ E ) ≥ = µ(E )2 − .
1≤a <b ≤N N (N − 1) N −1 N −1
Letting N → ∞ gives the desired result.

Solution to Exercise 5.11. Let A n = T −n (A). We regard u Tn χA = χA n ∈ L 2 (X , µ). By as-
sumption, Z
χA n d µ = µ(A) =: α
X
for each n, and
lim 〈χA n , χA m 〉 = lim µ(T −n A ∩ T −m A) = α2 .
n →∞ n→∞
Therefore, if we set f n = χA n − α then we have
lim 〈f n , f m 〉 = lim 〈χA n , χA m 〉 − α2 = 0.
n→∞ n →∞
We claim that this implies that limn →∞ 〈f n , g 〉 = 0 for all g ∈ L 2 (X , µ). Indeed, this is true
on the closure of the subspace generated by the f k , and also on its orthogonal comple-
ment by definition. Then taking g = χ B , we find that
Z
0 = lim 〈f n , g 〉 = χA n − αµ(B ) = µ(T −n A ∩ B ) − µ(A)µ(B ).
n →∞
B

54
Solution to Exercise 3.13. Let f = χ B . Define

N −1
1 X i
A N (f ) = u (f )
N n =0 T
and also
+N −1
MX
1
A M ,N (f ) = u Ti (f ).
N n =M
We know that A N (f ) → PT (f ). Actually, observe that
||A M ,N (f ) − PT (f )|| = ||u TM (A N (f )) − PT (f )||
= ||u TM (A N (f )) − u TM PT (f )
= ||A N (f ) − PT (f )||.
It now suffices to show that Z
PT (f ) ≥ µ(B )2 .
B
Indeed, suppose this to be the case. Then for N large enough, we have
Z
A M ,N (f ) ≥ µ(B )2 − ε
B
since convergence in L2 implies converge in L 1 on a probability space (here is where we

use the finite measure assumption!). But
µ(B ∩ T −M (B )) + . . . + µ(B ∩ T −M −N +1 (B ))
Z
A M ,N (f ) =
B
N
is the average measure of T −k (B ) ∩ B for k ∈ [M , M + N − 1]. Therefore, µ(T −k (B ) ∩ B ) >

µ(B )2 − ε for at least one such k .
Now let’s establish the claim. As before, we use the identity
Z N
X
u Tn (f ) = N µ(B ).
n =1
Therefore,
!2
Z N
X
u Tn (f ) = N 2 µ(B )2 .
n =1
Expanding out the left hand side we find
N
X −1
N µ(B ) + 2 (N − k )µ(T −k (B ) ∩ B ) = N 2 µ(B )2 .
k =1
On the other hand,

N
X
Z N
X −1
A n (f ) = (N − k )µ(T −k (B ) ∩ B ).
n =0 B k =0
55
Therefore,
N
N 2 µ(B )2 + N µ(B )
Z
X
n A n (f ) = .
n =0 B
2
R
Now we know that A n (f ) → PT (f ), so B
A n (f ) converges to a limit whose value must
then be, by the above equation, µ(B )2 .
Solution to Exercise 5.7. (1) We have to show that
n
1X
lim µ(T −i A ∩ B ) → µ(A)µ(B )
n →∞ n
i =1
given that it holds for all sets in P . For any ε > 0, we can choose A 0 , B 0 such that
µ(A 0 ∆A) < ε and µ(B 0 ∆B ) < ε. Then
(T −i A ∩ B )∆(T −i A 0 ∩ B 0 ) ⊂ (T −i A∆T −i A 0 ) ∪ (B ∆B 0 )
so
µ(T −i A ∩ B ) − µ(T −i A 0 ∩ B 0 ) < 2ε.
Therefore,
n n
1X 1X
µ(T −i A ∩ B ) − µ(T −i A 0 ∩ B 0 ) < 2ε.
n i =1 n i =1
So both the left and right hand sides of the purported identity behave well under approx-
imation with elements of P .
(2) By the same argument as above, the summand µ(T −i A ∩ B ) − µ(A)µ(B ) behaves
well under approximation by elements of P , and we know that for elements of P the
limit tends to 0.
(3) Follows from the same argument.
56

Ergodic Theory

Uploaded by

Copyright:

Available Formats

Ergodic Theory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ergodic Theory

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO ERGODIC THEORY

LECTURES BY MARYAM MIRZAKHANI

Can we understand the orbits of T on X ?

More precisely, for x ∈ X the orbit of T on x is

2.2.1. Ergodicity. The simplest incarnation is irreducibility. Morally, µ is reducible if it

Theorem 2.5 (Ergodic Decomposition Theorem). If (X , µ, B) is a regular measure space

Y → {space of T -invariant measures on X }

denoted by y 7→ µy such that

and µy is ergodic for ν -almost every y ∈ Y .

Definition 2.6. We say that (X , T ) ∼

and φ has an inverse satisfying the obvious analogous properties.

2.4. Examples. We’ll now give some examples of measure-preserving transformations

We will prove the following rather remarkable result.

The key input to prove these sorts of statements is that T is measure-preserving

So for each `, we have a triple (T 1 X , T ` , µ), where µ is the Lebesgue measure.

3. MEAN ERGODIC THEOREMS

Defining the Banach spaces L ∞ (X , µ), L 1 (X , µ), or L p (X , µ) as usual, we see that T uT ,

For each f n , we have

by assumption, so we obtain the result in the limit.

Definition 3.2. We say that (X , T, µ) is a measure-preserving system if (X , µ) is a measure

Let (X , T, µ) be a measure-preserving system, where µ is actually a probability mea-

3.3. Mean ergodic theorems. We know move on to the ergodic theorems. If (X , T, µ) is

If f = χE , then this describes the asymptotics of recurrence of x with respect to E .

Theorem 3.7 (Mean Ergodic Theorem, von Neumann). If (X , T, µ) is a measure-preserving

Von Neumann’s Mean Ergodic Theorem deals with convergence of operators in L 2 . We

Proposition 3.8 (L 1 -convergence). If f ∈ L 1 (X , µ) then

Remark 3.9. The same argument implies that

From this we deduce the following corollary.

Proof. See the solution to Exercise 3.13.

If T is invertible, then T −1 is measurable, and

This is because if T is invertible and g is T -invariant, then g is T −1 -invariant, so the

Then evidently T −1 (C ) = C , and

S so must be full measure. Conversely, if A is T -invariant with

f ◦ T = f then set A kn = {x : f (x ) ∈ [ nk , k n+1 ]}. Then T −1 A kn = {x ∈ X : f (T x ) ∈ [ nk , k n+1 }, but

What does it mean that f (R α (t )) = f (t )? The rotation sends t 7→ t + α, so by comparing

If f (t ) = f (2t ) then by comparison Fourier coefficients we have c 2n = c n . This forces

is invariant under T and non-constant.

Theorem 4.12. (Σ, σ, µp ) is ergodic.

Definition 5.1. Let (X , T, µ) be a measure-preserving system. We say that T on X is mixing

Example 5.3. R α is ergodic but not mixing. If Ae and B

Definition 5.4. We say that T is weakly mixing if

Proposition 5.6. If P is a semi-algebra (finite unions and intersection) generating B,

• weakly mixing ⇐⇒ for all A, B ∈ P then

Exercise 5.7. Prove this. [Hint: there is basically nothing to do.]

Exercise 5.11. T is mixing if and only if for all measurable sets A ⊂ X ,

Recall that if µ(X ) < ∞ then lim sup µ(A ∩ T −n A) ≥ µ(A)2 .

5.4. Hyperbolic toral automorphism

So if x − y is in the direction of v 2 then d (T n x , T n y ) → 0. So we have a foliation of T by

By the ergodicity of h s (“time average is space average”), we see that

for any x . Therefore,

6. POINTWISE ERGODIC THEOREMS

6.1. The Radon-Nikodym Theorem.

6.2. Expectation. Let A ⊂ B be a sub σ-algebra and µ a measure on B. If f ∈ L 1 (X , B, µ),

Example 6.8. If A is generated by a finite partition A 1 , . . . , A n of X , then A -measurable

Properties. It is easy to check the following properties of the expectation.

6.3. Birkhoff’s Ergodic Theorem.

Proof. Let f ∈ L 1 (X , µ, B) and E be the set of x such that f (x ) + . . . + f (T n x ) ≥ 0 for at

Claim. We claim that Z

Lemma 6.10 (Maximal inequality). Let f ∈ L 1 (X , µ, B) and define f 0 = 0,