Book PDF

Reversible Markov Chains and Random Walks on Graphs
David Aldous and James Allen Fill
Unfinished monograph, 2002 (this is recompiled version, 2014)

2
Contents
1 Introduction (July 20, 1999) 13

1.1 Word problems . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.1 Random knight moves . . . . . . . . . . . . . . . . . . 13
1.1.2 The white screen problem . . . . . . . . . . . . . . . . 13
1.1.3 Universal traversal sequences . . . . . . . . . . . . . . 14
1.1.4 How long does it take to shuffle a deck of cards? . . . 15
1.1.5 Sampling from high-dimensional distributions: Markov
chain Monte Carlo . . . . . . . . . . . . . . . . . . . . 15
1.1.6 Approximate counting of self-avoiding walks . . . . . . 16
1.1.7 Simulating a uniform random spanning tree . . . . . . 17
1.1.8 Voter model on a finite graph . . . . . . . . . . . . . . 17
1.1.9 Are you related to your ancestors? . . . . . . . . . . . 17
1.2 So what’s in the book? . . . . . . . . . . . . . . . . . . . . . . 18
1.2.1 Conceptual themes . . . . . . . . . . . . . . . . . . . . 18
1.2.2 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . 19
1.2.3 Contents and alternate reading . . . . . . . . . . . . . 19
2 General Markov Chains (September 10, 1999) 23

2.1 Notation and reminders of fundamental results . . . . . . . . 23
2.1.1 Stationary distribution and asymptotics . . . . . . . . 24
2.1.2 Continuous-time chains . . . . . . . . . . . . . . . . . 25
2.2 Identities for mean hitting times and occupation times . . . . 27
2.2.1 Occupation measures and stopping times . . . . . . . 27
2.2.2 Mean hitting time and related formulas . . . . . . . . 29
2.2.3 Continuous-time versions . . . . . . . . . . . . . . . . 34
2.3 Variances of sums . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Two metrics on distributions . . . . . . . . . . . . . . . . . . 36
2.4.1 Variation distance . . . . . . . . . . . . . . . . . . . . 36
2.4.2 L2 distance . . . . . . . . . . . . . . . . . . . . . . . . 39
3
4 CONTENTS
2.4.3 Exponential tails of hitting times . . . . . . . . . . . . 41

2.5 Distributional identities . . . . . . . . . . . . . . . . . . . . . 42
2.5.1 Stationarity consequences . . . . . . . . . . . . . . . . 42
2.5.2 A generating function identity . . . . . . . . . . . . . 43
2.5.3 Distributions and continuization . . . . . . . . . . . . 44
2.6 Matthews’ method for cover times . . . . . . . . . . . . . . . 45
2.7 New chains from old . . . . . . . . . . . . . . . . . . . . . . . 47
2.7.1 The chain watched only on A . . . . . . . . . . . . . . 47
2.7.2 The chain restricted to A . . . . . . . . . . . . . . . . 48
2.7.3 The collapsed chain . . . . . . . . . . . . . . . . . . . 48
2.8 Miscellaneous methods . . . . . . . . . . . . . . . . . . . . . . 49
2.8.1 Martingale methods . . . . . . . . . . . . . . . . . . . 49
2.8.2 A comparison argument . . . . . . . . . . . . . . . . . 51
2.8.3 Wald equations . . . . . . . . . . . . . . . . . . . . . . 52
2.9 Notes on Chapter 2. . . . . . . . . . . . . . . . . . . . . . . . 52
2.10 Move to other chapters . . . . . . . . . . . . . . . . . . . . . . 55
2.10.1 Attaining distributions at stopping times . . . . . . . 55
2.10.2 Differentiating stationary distributions . . . . . . . . . 55
3 Reversible Markov Chains (September 10, 2002) 57

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.1.1 Time-reversals and cat-and-mouse games . . . . . . . 59
3.1.2 Entrywise ordered transition matrices . . . . . . . . . 62
3.2 Reversible chains and weighted graphs . . . . . . . . . . . . . 63
3.2.1 The fluid model . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Electrical networks . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.1 Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3.2 The analogy . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.3 Mean commute times . . . . . . . . . . . . . . . . . . 70
3.3.4 Foster’s theorem . . . . . . . . . . . . . . . . . . . . . 71
3.4 The spectral representation . . . . . . . . . . . . . . . . . . . 72
3.4.1 Mean hitting times and reversible chains . . . . . . . . 75
3.5 Complete monotonicity . . . . . . . . . . . . . . . . . . . . . 77
3.5.1 Lower bounds on mean hitting times . . . . . . . . . . 79
3.5.2 Smoothness of convergence . . . . . . . . . . . . . . . 81
3.5.3 Inequalities for hitting time distributions on subsets . 83
3.5.4 Approximate exponentiality of hitting times . . . . . . 85
3.6 Extremal characterizations of eigenvalues . . . . . . . . . . . 87
3.6.1 The Dirichlet formalism . . . . . . . . . . . . . . . . . 87
3.6.2 Summary of extremal characterizations . . . . . . . . 89
CONTENTS 5
3.6.3 The extremal characterization of relaxation time . . . 89

3.6.4 Simple applications . . . . . . . . . . . . . . . . . . . . 91
3.6.5 Quasistationarity . . . . . . . . . . . . . . . . . . . . . 95
3.7 Extremal characterizations and mean hitting times . . . . . . 98
3.7.1 Thompson’s principle and leveling networks . . . . . . 100
3.7.2 Hitting times and Thompson’s principle . . . . . . . . 102
3.8 Notes on Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . 108
4 Hitting and Convergence Time, and Flow Rate, Parameters

for Reversible Markov Chains (October 11, 1994) 113
∗
4.1 The maximal mean commute time τ . . . . . . . . . . . . . . 115
4.2 The average hitting time τ0 . . . . . . . . . . . . . . . . . . . 117
4.3 The variation threshold τ1 . . . . . . . . . . . . . . . . . . . . 119
4.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 119
4.3.2 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . 123
4.3.3 τ1 in discrete time, and algorithmic issues . . . . . . . 126
4.3.4 τ1 and mean hitting times . . . . . . . . . . . . . . . . 128
4.3.5 τ1 and flows . . . . . . . . . . . . . . . . . . . . . . . . 130
4.4 The relaxation time τ2 . . . . . . . . . . . . . . . . . . . . . . 131
4.4.1 Correlations and variances for the stationary chain . . 134
4.4.2 Algorithmic issues . . . . . . . . . . . . . . . . . . . . 137
4.4.3 τ2 and distinguished paths . . . . . . . . . . . . . . . . 139
4.5 The flow parameter τc . . . . . . . . . . . . . . . . . . . . . . 142
4.5.1 Definition and easy inequalities . . . . . . . . . . . . . 142
4.5.2 Cheeger-type inequalities . . . . . . . . . . . . . . . . 145
4.5.3 τc and hitting times . . . . . . . . . . . . . . . . . . . 146
4.6 Induced and product chains . . . . . . . . . . . . . . . . . . . 148
4.6.1 Induced chains . . . . . . . . . . . . . . . . . . . . . . 148
4.6.2 Product chains . . . . . . . . . . . . . . . . . . . . . . 149
4.6.3 Efron-Stein inequalities . . . . . . . . . . . . . . . . . 152
4.6.4 Why these parameters? . . . . . . . . . . . . . . . . . 153
4.7 Notes on Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . 154
5 Examples: Special Graphs and Trees (April 23 1996) 159

5.1 One-dimensional chains . . . . . . . . . . . . . . . . . . . . . 160
5.1.1 Simple symmetric random walk on the integers . . . . 160
5.1.2 Weighted linear graphs . . . . . . . . . . . . . . . . . . 162
5.1.3 Useful examples of one-dimensional chains . . . . . . . 165
5.2 Special graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.2.1 Biased walk on a balanced tree . . . . . . . . . . . . . 195
6 CONTENTS
5.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

5.3.1 Parameters for trees . . . . . . . . . . . . . . . . . . . 200
5.3.2 Extremal trees . . . . . . . . . . . . . . . . . . . . . . 203
5.4 Notes on Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . 205
6 Cover Times (October 31, 1994) 207

6.1 The spanning tree argument . . . . . . . . . . . . . . . . . . . 208
6.2 Simple examples of cover times . . . . . . . . . . . . . . . . . 212
6.3 More upper bounds . . . . . . . . . . . . . . . . . . . . . . . . 214
6.3.1 Simple upper bounds for mean hitting times . . . . . . 215
6.3.2 Known and conjectured upper bounds . . . . . . . . . 216
6.4 Short-time bounds . . . . . . . . . . . . . . . . . . . . . . . . 217
6.4.1 Covering by multiple walks . . . . . . . . . . . . . . . 219
6.4.2 Bounding point probabilities . . . . . . . . . . . . . . 221
6.4.3 A cat and mouse game . . . . . . . . . . . . . . . . . . 222
6.5 Hitting time bounds and connectivity . . . . . . . . . . . . . 223
6.5.1 Edge-connectivity . . . . . . . . . . . . . . . . . . . . 224
6.5.2 Equivalence of mean cover time parameters . . . . . . 226
6.6 Lower bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
6.6.1 Matthews’ method . . . . . . . . . . . . . . . . . . . . 227
6.6.2 Balanced trees . . . . . . . . . . . . . . . . . . . . . . 227
6.6.3 A resistance lower bound . . . . . . . . . . . . . . . . 228
6.6.4 General lower bounds . . . . . . . . . . . . . . . . . . 229
6.7 Distributional aspects . . . . . . . . . . . . . . . . . . . . . . 231
6.8 Algorithmic aspects . . . . . . . . . . . . . . . . . . . . . . . 232
6.8.1 Universal traversal sequences . . . . . . . . . . . . . . 232
6.8.2 Graph connectivity algorithms . . . . . . . . . . . . . 233
6.8.3 A computational question . . . . . . . . . . . . . . . . 233
6.9 Notes on Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . 233
7 Symmetric Graphs and Chains (January 31, 1994) 237

7.1 Symmetric reversible chains . . . . . . . . . . . . . . . . . . . 238
7.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 238
7.1.2 This section goes into Chapter 3 . . . . . . . . . . . . 240
7.1.3 Elementary properties . . . . . . . . . . . . . . . . . . 240
7.1.4 Hitting times . . . . . . . . . . . . . . . . . . . . . . . 241
7.1.5 Cover times . . . . . . . . . . . . . . . . . . . . . . . . 242
7.1.6 Product chains . . . . . . . . . . . . . . . . . . . . . . 246
7.1.7 The cutoff phenomenon and the upper bound lemma . 248
7.1.8 Vertex-transitive graphs and Cayley graphs . . . . . . 249
CONTENTS 7
7.1.9 Comparison arguments for eigenvalues . . . . . . . . . 252

7.2 Arc-transitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 254
7.2.1 Card-shuffling examples . . . . . . . . . . . . . . . . . 255
7.2.2 Cover times for the d-dimensional torus ZN d.. . . . . . 257
7.2.3 Bounds for the parameters . . . . . . . . . . . . . . . 259
7.2.4 Group-theory set-up . . . . . . . . . . . . . . . . . . . 259
7.3 Distance-regular graphs . . . . . . . . . . . . . . . . . . . . . 259
7.3.1 Exact formulas . . . . . . . . . . . . . . . . . . . . . . 260
7.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 262
7.3.3 Monotonicity properties . . . . . . . . . . . . . . . . . 262
7.3.4 Extremal distance-regular graphs . . . . . . . . . . . . 263
7.3.5 Gelfand pairs and isotropic flights . . . . . . . . . . . 263
7.4 Notes on Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . 263
8 Advanced L2 Techniques for Bounding Mixing Times (May

19 1999) 267
8.1 The comparison method for eigenvalues . . . . . . . . . . . . 270
8.2 Improved bounds on L2 distance . . . . . . . . . . . . . . . . 278
8.2.1 Lq norms and operator norms . . . . . . . . . . . . . . 278
8.2.2 A more general bound on L2 distance . . . . . . . . . 280
8.2.3 Exact computation of N (s) . . . . . . . . . . . . . . . 284
8.3 Nash inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 287
8.3.1 Nash inequalities and mixing times . . . . . . . . . . . 288
8.3.2 The comparison method for bounding N (·) . . . . . . 290
8.4 Logarithmic Sobolev inequalities . . . . . . . . . . . . . . . . 292
8.4.1 The log-Sobolev time τl . . . . . . . . . . . . . . . . . 292
8.4.2 τl , mixing times, and hypercontractivity . . . . . . . . 294
8.4.3 Exact computation of τl . . . . . . . . . . . . . . . . . 298
8.4.4 τl and product chains . . . . . . . . . . . . . . . . . . 302
8.4.5 The comparison method for bounding τl . . . . . . . . 304
8.5 Combining the techniques . . . . . . . . . . . . . . . . . . . . 306
8.6 Notes on Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . 307
9 A Second Look at General Markov Chains (April 21, 1995)309

9.1 Minimal constructions and mixing times . . . . . . . . . . . . 309
9.1.1 Strong stationary times . . . . . . . . . . . . . . . . . 311
9.1.2 Stopping times attaining a specified distribution . . . 312
9.1.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . 314
9.2 Markov chains and spanning trees . . . . . . . . . . . . . . . 316
9.2.1 General Chains and Directed Weighted Graphs . . . . 316
8 CONTENTS
9.2.2 Electrical network theory . . . . . . . . . . . . . . . . 319

9.3 Self-verifying algorithms for sampling from a stationary dis-
tribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
9.3.1 Exact sampling via the Markov chain tree theorem . . 322
9.3.2 Approximate sampling via coalescing paths . . . . . . 323
9.3.3 Exact sampling via backwards coupling . . . . . . . . 324
9.4 Making reversible chains from irreversible chains . . . . . . . 326
9.4.1 Mixing times . . . . . . . . . . . . . . . . . . . . . . . 326
9.4.2 Hitting times . . . . . . . . . . . . . . . . . . . . . . . 327
9.5 An example concerning eigenvalues and mixing times . . . . . 329
9.6 Miscellany . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
9.6.1 Mixing times for irreversible chains . . . . . . . . . . . 331
9.6.2 Balanced directed graphs . . . . . . . . . . . . . . . . 331
9.6.3 An absorption time problem . . . . . . . . . . . . . . . 332
9.7 Notes on Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . 332
10 Some Graph Theory and Randomized Algorithms (Septem-

ber 1 1999) 335
10.1 Expanders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
10.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . 336
10.1.2 Random walk on expanders . . . . . . . . . . . . . . . 337
10.1.3 Counter-example constructions . . . . . . . . . . . . . 338
10.2 Eigenvalues and graph theory . . . . . . . . . . . . . . . . . . 339
10.2.1 Diameter of a graph . . . . . . . . . . . . . . . . . . . 339
10.2.2 Paths avoiding congestion . . . . . . . . . . . . . . . . 340
10.3 Randomized algorithms . . . . . . . . . . . . . . . . . . . . . 342
10.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . 342
10.3.2 Overview of randomized algorithms using random walks
or Markov chains . . . . . . . . . . . . . . . . . . . . . 344
10.4 Miscellaneous graph algorithms . . . . . . . . . . . . . . . . . 344
10.4.1 Amplification of randomness . . . . . . . . . . . . . . 344
10.4.2 Using random walk to define an objective function . . 346
10.4.3 Embedding trees into the d-cube . . . . . . . . . . . . 347
10.4.4 Comparing on-line and off-line algorithms . . . . . . . 349
10.5 Approximate counting via Markov chains . . . . . . . . . . . 351
10.5.1 Volume of a convex set . . . . . . . . . . . . . . . . . . 353
10.5.2 Matchings in a graph . . . . . . . . . . . . . . . . . . 353
10.5.3 Simulating self-avoiding walks . . . . . . . . . . . . . . 354
10.6 Notes on Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . 355
10.7 Material belonging in other chapters . . . . . . . . . . . . . . 358
CONTENTS 9
10.7.1 Large deviation bounds . . . . . . . . . . . . . . . . . 358

10.7.2 The probabilistic method in combinatorics . . . . . . . 358
10.7.3 copied to Chapter 4 section 6.5 . . . . . . . . . . . . . 358
11 Markov Chain Monte Carlo (January 8 2001) 361

11.1 Overview of Applied MCMC . . . . . . . . . . . . . . . . . . 361
11.1.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 361
11.1.2 Further aspects of applied MCMC . . . . . . . . . . . 366
11.2 The two basic schemes . . . . . . . . . . . . . . . . . . . . . . 369
11.2.1 Metropolis schemes . . . . . . . . . . . . . . . . . . . . 369
11.2.2 Line-sampling schemes . . . . . . . . . . . . . . . . . . 370
11.3 Variants of basic MCMC . . . . . . . . . . . . . . . . . . . . . 371
11.3.1 Metropolized line sampling . . . . . . . . . . . . . . . 371
11.3.2 Multiple-try Metropolis . . . . . . . . . . . . . . . . . 372
11.3.3 Multilevel sampling . . . . . . . . . . . . . . . . . . . 373
11.3.4 Multiparticle MCMC . . . . . . . . . . . . . . . . . . . 375
11.4 A little theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 375
11.4.1 Comparison methods . . . . . . . . . . . . . . . . . . . 376
11.4.2 Metropolis with independent proposals . . . . . . . . . 377
11.5 The diffusion heuristic for optimal scaling of high dimensional
Metropolis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
11.5.1 Optimal scaling for high-dimensional product distri-
bution sampling . . . . . . . . . . . . . . . . . . . . . 378
11.5.2 The diffusion heuristic. . . . . . . . . . . . . . . . . . . 380
11.5.3 Sketch proof of Theorem . . . . . . . . . . . . . . . . . 381
11.6 Other theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
11.6.1 Sampling from log-concave densities . . . . . . . . . . 382
11.6.2 Combining MCMC with slow exact sampling . . . . . 383
11.7 Notes on Chapter MCMC . . . . . . . . . . . . . . . . . . . . 383
11.8 Belongs in other chapters . . . . . . . . . . . . . . . . . . . . 385
11.8.1 Pointwise ordered transition matrices . . . . . . . . . 385
12 Coupling Theory and Examples (October 11, 1999) 387

12.1 Using coupling to bound variation distance . . . . . . . . . . 387
12.1.1 The coupling inequality . . . . . . . . . . . . . . . . . 388
12.1.2 Comments on coupling methodology . . . . . . . . . . 388
12.1.3 Random walk on a dense regular graph . . . . . . . . 390
12.1.4 Continuous-time random walk on the d-cube . . . . . 391
12.1.5 The graph-coloring chain . . . . . . . . . . . . . . . . 392
12.1.6 Permutations and words . . . . . . . . . . . . . . . . . 393
10 CONTENTS
12.1.7 Card-shuffling by random transpositions . . . . . . . . 395

12.1.8 Reflection coupling on the n-cycle . . . . . . . . . . . 396
12.1.9 Card-shuffling by random adjacent transpositions . . . 397
12.1.10 Independent sets . . . . . . . . . . . . . . . . . . . . . 398
12.1.11 Two base chains for genetic algorithms . . . . . . . . . 400
12.1.12 Path coupling . . . . . . . . . . . . . . . . . . . . . . . 402
12.1.13 Extensions of a partial order . . . . . . . . . . . . . . 404
12.2 Notes on Chapter 4-3 . . . . . . . . . . . . . . . . . . . . . . 405
13 Continuous State, Infinite State and Random Environment

(June 23, 2001) 409
13.1 Continuous state space . . . . . . . . . . . . . . . . . . . . . . 409
13.1.1 One-dimensional Brownian motion and variants . . . . 409
13.1.2 d-dimensional Brownian motion . . . . . . . . . . . . . 413
13.1.3 Brownian motion in a convex set . . . . . . . . . . . . 413
13.1.4 Discrete-time chains: an example on the simplex . . . 416
13.1.5 Compact groups . . . . . . . . . . . . . . . . . . . . . 419
13.1.6 Brownian motion on a fractal set . . . . . . . . . . . . 420
13.2 Infinite graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 421
13.2.1 Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
13.2.2 Recurrence and Transience . . . . . . . . . . . . . . . 423
13.2.3 The finite analog of transience . . . . . . . . . . . . . 425
13.2.4 Random walk on Z d . . . . . . . . . . . . . . . . . . . 425
13.2.5 The torus Zm d . . . . . . . . . . . . . . . . . . . . . . . 427
13.2.6 The infinite degree-r tree . . . . . . . . . . . . . . . . 431

13.2.7 Generating function arguments . . . . . . . . . . . . . 432
13.2.8 Comparison arguments . . . . . . . . . . . . . . . . . . 433
13.2.9 The hierarchical tree . . . . . . . . . . . . . . . . . . . 435
13.2.10 Towards a classification theory for sequences of finite
chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
13.3 Random Walks in Random Environments . . . . . . . . . . . 441
13.3.1 Mixing times for some random regular graphs . . . . . 441
13.3.2 Randomizing infinite trees . . . . . . . . . . . . . . . . 444
13.3.3 Bias and speed . . . . . . . . . . . . . . . . . . . . . . 446
13.3.4 Finite random trees . . . . . . . . . . . . . . . . . . . 447
13.3.5 Randomly-weighted random graphs . . . . . . . . . . . 449
13.3.6 Random environments in d dimensions . . . . . . . . . 450
13.4 Notes on Chapter 13 . . . . . . . . . . . . . . . . . . . . . . . 451
CONTENTS 11
14 Interacting Particles on Finite Graphs (March 10, 1994) 455

14.1 Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
14.1.1 The coupling inequality . . . . . . . . . . . . . . . . . 457
14.1.2 Examples using the coupling inequality . . . . . . . . 458
14.1.3 Comparisons via couplings . . . . . . . . . . . . . . . . 460
14.2 Meeting times . . . . . . . . . . . . . . . . . . . . . . . . . . . 461
14.3 Coalescing random walks and the voter model . . . . . . . . . 464
14.3.1 A bound using the voter model . . . . . . . . . . . . . 466
14.3.2 A bound using the coalescing random walk model . . 468
14.3.3 Conjectures and examples . . . . . . . . . . . . . . . . 469
14.3.4 Voter model with new opinions . . . . . . . . . . . . . 470
14.3.5 Large component sizes in the voter model with new
opinions . . . . . . . . . . . . . . . . . . . . . . . . . . 473
14.3.6 Number of components in the voter model with new
opinions . . . . . . . . . . . . . . . . . . . . . . . . . . 474
14.4 The antivoter model . . . . . . . . . . . . . . . . . . . . . . . 474
14.4.1 Variances in the antivoter model . . . . . . . . . . . . 475
14.4.2 Examples and Open Problems . . . . . . . . . . . . . 478
14.5 The interchange process . . . . . . . . . . . . . . . . . . . . . 481
14.5.1 Card-shuffling interpretation . . . . . . . . . . . . . . 483
14.6 Other interacting particle models . . . . . . . . . . . . . . . . 483
14.6.1 Product-form stationary distributions . . . . . . . . . 483
14.6.2 Gaussian families of occupation measures . . . . . . . 484
14.7 Other coupling examples . . . . . . . . . . . . . . . . . . . . . 485
14.7.1 Markov coupling may be inadequate . . . . . . . . . . 488
14.8 Notes on Chapter 14 . . . . . . . . . . . . . . . . . . . . . . . 489
12 CONTENTS
Chapter 1
Introduction (July 20, 1999)
We start in section 1.1 with some “word problems”, intended to provide

some extrinsic motivation for the general field of this book. In section 1.2
we talk about the broad conceptual themes of the book, partly illustrated
by the word problems, and then outline the actual contents and give advice
to readers.
1.1 Word problems

1.1.1 Random knight moves
Imagine a knight on a corner square of an otherwise empty chessboard.
Move the knight by choosing at random from the legal knight-moves. What
is the mean time until the knight first returns to the starting square?
At first sight this looks like a messy problem which will require numerical
calculations. But the knight is moving as random walk on a finite graph
(rather than just some more general Markov chain), and elementary theory
reduces the problem to counting the numer of edges of the graph, giving the
answer of 168 moves. See Chapter 3 yyy.
1.1.2 The white screen problem

Around 1980 I wrote a little Basic program that would display
a random walk on the screen of my home computer. First, a
pixel in the middle of the screen was lit up. Then one of the
four directions N,E,W,S was selected uniformly at random and
the walk proceeded one step in the chosen direction. That new
pixel was lit up on the screen, and the process was repeated
13
14 CHAPTER 1. INTRODUCTION (JULY 20, 1999)
from the new point, etc. For a while, the walk is almost always
quickly visiting pixels it hasn’t visited before, so one sees an
irregular pattern that grows in the center of the screen. After
quite a long while, when the screen is perhaps 95% illuminated,
the growth process will have slowed down tremendously, and the
viewer can safely go read War and Peace without missing any
action. After a minor eternity every cell will have been visited.
Any mathematician will want to know how long, on the average,
it takes until each pixel has been visited. Edited from Wilf [336].
Taking the screen to be m × m pixels, we have a random walk on the

discrete two-dimensional torus Zd2 , and the problem asks for the mean cover
time, that is the time to visit every vertex of the graph. Such questions
have been studied for general graphs (see Chapter 6), though ironically this
particular case of the two-dimensional torus is the hardest special graph. It
is known that mean cover time is asymptotically at most 4π −1 m2 log2 m, and
conjectured this is asymptotically correct (see Chapter 7 yyy). For m = 512
this works out to be about 13 million. Of course, in accordance with Moore’s
Law what took a minor eternity in 1980 takes just a few seconds today.
1.1.3 Universal traversal sequences

Let S(n, d) be the set of all d-regular graphs G with n vertices and with
the edges at each vertex labeled (1, 2, . . . , d). A universal traversal sequence
i1 , i2 , . . . , iu ∈ {1, . . . , d} is a sequence that satisfies
for each G ∈ S(n, d) and each initial vertex of G the deterministic

walk “at step t choose edge it ” visits every vertex.
What is the shortest length u = u(n, d) of such a sequence?

To get a partial answer, instead of trying to be clever about picking the
sequence, consider what happens if we just choose i1 , i2 , . . . uniformly at
random. Then the walk on a graph G is just simple random walk on G.
Using a result that the mean cover time on a regular graph is O(n2 ) one
can show (see Chapter 6 yyy) that most sequences of length O(dn3 log n)
are universal traversal sequences.
Paradoxically, no explicit example of a universal traversal sequence this
short is known. The argument above fits a general theme that probabilistic
methods can be useful in combinatorics to establish the existence of objects
which are hard to exhibit constructively: numerous examples are in the
monograph by Alon and Spencer [29].
1.1. WORD PROBLEMS 15
1.1.4 How long does it take to shuffle a deck of cards?

Repeated random shuffles of a d-card deck may be modeled as a Markov
chain on the space of all d! possible configurations of the deck. Different
physical methods of shuffling correspond to different chains. The model for
the most common method, riffle shuffle, is described carefully in Chapter
9 (xxx section to be written). A mathematically simpler method is top-
to-random, in which the top card is reinserted at one of the d possible
positions, chosen uniformly at random (Chapter 9 section yyy). Giving a
precise mathematical interpretation to the question
how many steps of the chain (corresponding to a specified phys-
ical shuffle) are needed until the distribution of the deck is ap-
proximately uniform (over all d! configurations)?
is quite subtle; we shall formalize different interpretations as different mixing
times, and relations between mixing times are discussed in Chapter 4 for
reversible chains and in Chapter 8 (xxx section to be written) for general
chains. Our favorite formalization is via the variation threshold time τ1 , and
it turns out that
3
τ1 ∼ 2 log2 d (riffle shuffle) (1.1)
τ1 ∼ d log d (top-to-random shuffle) .
For the usual deck with d = 52 these suggest 8 and 205 shuffles respectively.
1.1.5 Sampling from high-dimensional distributions: Markov

chain Monte Carlo
Suppose you have a function f : Rd → [0, ∞) with κ := Rd f (x) dx < ∞,
R
where f is given by some explicit but maybe complicated formula. How can
you devise a scheme to sample a random point in Rd with the normalized
probability density f (x)/κ?
For d = 1 the elementary “inverse distribution function” trick is avail-
able, and for small d simple acceptance/rejection methods are often prac-
tical. For large d the most popular method is some form of Markov chain
Monte Carlo (MCMC) method, and this specific d-dimensional sampling
problem is a prototype problem for MCMC methods. The scheme is to de-
sign a chain to have stationary distribution f (x)/κ. A simple such chain is
as follows. From a point x, the next point X1 is chosen by a two-step pro-
cedure. First choose Y from some reference distribution (e.g. multivariate
Normal with specified variance, or uniform of a sphere of specified radius)
on Rd ; then set X1 = x + Y with probability min(1, f (x + Y )/f (x)) and set

X1 = x with the remaining probability.
Routine theory says that the stationary density is indeed f (x)/κ and
that as t → ∞ the distribution of the chain after t steps converges to this
stationary distribution. So a heuristic algorithm for the sampling problem
is
Choose a starting point, a reference distribution and a number
t of steps, simulate the chain for t steps, and output the state of
the chain after t steps.
To make a rigorous algorithm one needs to know how many steps are needed
to guarantee closeness to stationarity; this is a mixing time question. The
conceptual issues here are discussed in Chapter 11. Despite a huge litera-
ture on methodology and applications of MCMC in many different settings,
rigorous results are rather scarce. A notable exception is in the sampling set-
ting above where log f is a concave function, where there exist complicated
results (outlined in Chapter 11 xxx to be written) proving that a polynomial
(in d) number of steps suffice.
1.1.6 Approximate counting of self-avoiding walks

A self-avoiding walk (SAW) of length l in the lattice Z d is a walk 0 =
v0 , v1 , v2 , . . . , vl for which the (vi ) are distinct and successive pairs (vi , vi+1 )
are adjacent. Understanding the l → ∞ asymptotics of the cardinality |Sl |
of the set Sl of SAWs of length l (dimension d = 3 is the most interesting
case) is a famous open problem. A conceptual insight is that, for large l,
the problem
find an algorithm which counts |Sl | approximately
can be reduced to the problem
find an algorithm which gives an approximately uniform random
sample from Sl .
To explain, note that each walk in Sl+1 is a one-step extension of some walk
in Sl . So the ratio |Sl+1 |/|Sl | equals the mean number of extensions of a
uniform random SAW from Sl , which of course can be estimated from the
empirical average of the number of extensions of a large sample of SAWs
from Sl .
Similar schemes work for various other families (Sl ) of combinatorial sets
of increasing size, provided one has some explicit connection between Sl and
1.1. WORD PROBLEMS 17
Sl+1 . As in the previous word problem, one can get an approximately uni-
form random sample by MCMC, i.e. by designing a chain whose stationary
distribution is uniform, and simulating a sufficiently large number of steps
of the chain: in making a rigorous algorithm, the issue again reduces to
bounding the mixing time of the chain. The case of SAWs is outlined in
Chapter 11 section yyy.
1.1.7 Simulating a uniform random spanning tree

The last two word problems hinted at large classes of algorithmic problems;
here is a different, more specific problem. A finite connected graph G has a
finite number of spanning trees, and so it makes sense to consider a uniform
random spanning tree of G. How can one simulate this random tree?
It turns out there is an exact method, which involves running random
walk on G until every vertex v has been visited; then for each v (other than
the initial vertex) let the tree include the edge by which the walk first visited
v. This gives some kind of random spanning tree; it seems non-obvious that
the distribution is uniform, but that is indeed true. See Chapter 8 section
yyy.
1.1.8 Voter model on a finite graph

Consider a graph where each vertex is colored, initially with different colors.
Each vertex from time to time (precisely, at times of independent Poisson
processes of rate 1) picks an adjacent vertex at random and changes its color
to the color of the picked neighbor. Eventually, on a finite graph, all vertices
will have the same color: how long does this take?
This question turns out to be related (via a certain notion of duality) to
the following question. Imagine particles, initially one at each vertex, which
perform continuous-time random walk on the graph, but which coalesce
when they meet. Eventually they will all coalesce into one particle: how
long does this take? On the complete graph on n vertices, the mean time in
each question is ∼ n. See Chapter 10 section yyy.
1.1.9 Are you related to your ancestors?

You have two parents, four grandparents and eight great-grandparents. In
other words, for small g ≥ 1
you have exactly 2g g’th-generation ancestors, and you are re-

lated to each of them.
But what about larger g? Clearly you didn’t have 2120 ≈ 1012 distinct
120’th-generation ancestors! Even taking g = 10, one can argue it’s unlikely
you had 1, 024 different 10th-generation ancestors, though the number is
likely only a bit smaller – say 1,000, in round numbers. Whether you are ac-
tually related to these people is a subtle question. At the level of grade-school
genetics, you have 46 chromosomes, each a copy of one parental chromosome,
and hence each a copy of some 10th-generation ancestor’s chromosome. So
you’re genetically related to at most 46 of your 10th-generation ancestors.
Taking account of crossover during chromosome duplication leads to a more
interesting model, in which the issue is to estimate hitting probabilities in
a certain continuous-time reversible Markov chain. It turns out (Chapter
13 yyy) that the number of 10th-generation ancestors who are genetically
related to you is about 340. So you’re unlikely to be related to a particu-
lar 10th-generation ancestor, a fact which presents a curious sidebar to the
principle of hereditary monarchy.
1.2 So what’s in the book?
1.2.1 Conceptual themes
Classical mathematical probability focuses on time-asymptotics, describing

what happens if some random process runs for ever. In contrast, the word
problems each ask “how long until a chain does something?”, and the focus
of this book is on finite-time behavior. More precisely, the word problems
ask about hitting times, the time until a state or a set of states is first visited,
or until each state in a set is visited; or ask about mixing times, the number
of steps until the distribution is approximately the stationary distribution.
The card-shuffling problems (section 1.1.4) provide a very intuitive setting
for such questions; how many shuffles are needed, as a function of the size
of the deck, until the deck is well shuffled? Such size-asymptotic results, of
which (1.1) is perhaps the best-known, are one of the themes of this book.
Thus in one sense our work is in the spirit of the birthday and coupon-
collector’s problems in undergraduate probability; in another sense our goals
?
are reminiscent of those of computational complexity (P = NP and all that),
which seeks to relate the time required to solve an algorithmic problem to
the size of the problem.
1.2. SO WHAT’S IN THE BOOK? 19
1.2.2 Prerequisites
The reader who has taken a first-year graduate course in mathematical prob-
ability will have no difficulty with the mathematical content of this book.
Though if the phrase “randomized algorithm” means nothing to you, then
it would be helpful to look at Motwani - Raghavan [265] to get some feeling
for the algorithmic viewpoint.
We have tried to keep much of the book accessible to readers whose math-
ematical background emphasizes discrete math and algorithms rather than
analysis and probability. The minimal background required is an undergrad-
uate course in probability including classical limit theory for finite Markov
chains. Graduate-level mathematical probability is usually presented within
the framework of measure theory, which (with some justification) is often
regarded as irrelevant “general abstract nonsense” by those interested in
concrete mathematics. We will point out as we go the pieces of graduate-
level probability that we use (e.g. martingale techniques, Wald’s identity,
weak convergence). Advice: if your research involves probability then you
should at some time see what’s taught in a good first-year-graduate course,
and we strongly recommend Durrett [133] for this purpose.
1.2.3 Contents and alternate reading

Amongst the numerous introductory accounts of Markov chains, Norris [270]
is closest to our style. That book, like the more concise treatment in Dur-
rett [133] Chapter 5, emphasizes probabilistic methods designed to work in
the countable-state setting. Matrix-based methods designed for the finite-
state setting are emphasised by Kemeny - Snell [214] and by Hunter [186].
We start in Chapter 2 by briskly reviewing standard asymptotic theory of
finite-state chains, and go on to a range of small topics less often empha-
sised: obtaining general identities from the reward-renewal theorem, and
useful metrics on distributions, for instance. Chapter 3 starts our system-
atic treatment of reversible chains: their identification as random walks on
weighted graphs, the analogy with electrical networks, the spectral repre-
sentation and its consequences for the structure of hitting time distribu-
tions, the Dirichlet formalism, extremal characterization of eigenvalues and
various mean hitting times. This material has not been brought together
before. Chen [88] gives a somewhat more advanced treatment of some of the
analytic techniques and their applications to infinite particle systems (also
overlapping partly with our Chapters 10 and 11), but without our finite-
time emphasis. Kelly [213] emphasizes stationary distributions of reversible
stochastic networks, Keilson [212] emphasizes structural properties such as

complete monotonicity, and Doyle - Snell [131] give a delightful elementary
treatment of the electrical network connection. Chapter 4 is the center-
piece of our attempt to create coherent intermediate-level theory. We give a
detailed analysis of different mixing times: the relaxation time (1/spectral
gap), the variation threshold (where variation distance becomes small, uni-
formly in initial state) and the Cheeger time constant (related to weighted
connectivity). We discuss relations between these times and their surpris-
ing connection with mean hitting times; the distinguished paths method for
bounding relaxation time, Cheeger-type inequalities, and how these param-
eters behave under operations on chains (watching only on a subset, taking
product chains). Little of this exists in textbooks, though Chung [93] gives a
more graph-theoretic treatment of Cheeger inequalities and of the advanced
analytic techniques in Chapter 12.
The rather technical Chapter 4 may seem tough going, but the payoff is
that subsequent chapters tend to “branch out” without developing further
theoretical edifices. Chapter 5 gives bare-hands treatments of numerous
examples of random walks on special graphs, and of two classes of chains
with special structure: birth-and-death chains, and random walks on trees.
Chapter 6 treats cover times (times to visit every vertex), which feature in
several of our word problems, and for which a fairly complete theory exists.
Chapter 7 discusses a hierarchy of symmetry conditions for random walks on
graphs and groups, emphasising structural properties. A conspicuous gap is
that we do not discuss how analytic techniques (e.g. group representation
theory, orthogonal polynomials) can be systematically used to derive exact
formulas for t-step transition probabilities or hitting time distributions in the
presence of enough symmetry. Diaconis [112] has material on this topic, but
an updated account would be valuable. Chapter 8 returns to not-necessarily
reversible chains, treating topics such as certain optimal stopping times, the
Markov chain tree theorem, and coupling from the past. Chapter 9 xxx.
Chapter 10 describes the coupling method of bounding the variation thresh-
old mixing time, and then discusses several interacting particle systems on
finite graphs related to random walks. As background, Liggett [231] is the
standard reference for interacting particle systems on infinite lattices. Chap-
ter 11 xxx. Chapter 12 recounts work of work of Diaconis and Saloff-Coste,
who bring the techniques of Nash inequalities, log-Sobolev inequalities and
local Poincaré inequalities to bear to obtain sharper estimates for reversible
Markov chains. These techniques were originally developed by analysts in
the study of heat kernels, cf. the sophisticated treatment in Varopoulos
et al [332]. Chapter 13 xxx and mentions topics not treated in detail be-
1.2. SO WHAT’S IN THE BOOK? 21
cause of mathematical depth or requirements for extraneous mathematical

techniques or the authors’ exhaustion.
As previously mentioned, our purpose is to provide systematic intermediate-
level discussion of reversible Markov chains and random walks on graphs,
built around the central theme of mixing times and hitting times developed
in Chapter 4. Various topics could be tackled in a more bare-hands way; an
opposite approach by Lovász [237] (N.B. second edition) is to lead the reader
through half a chapter of problems concerning random walk on graphs. Our
approach is to treat random walk on an unweighted graph as a specialization
of reversible chain, which makes it clear where non-trivial graph theory is
being used (basically, not until Chapter 6).
We have not included exercises, though filling in omitted details will
provide ample exercise for a conscientious reader. Of the open problems,
some seem genuinely difficult while others have just not been thought about
before.
Chapter 2
General Markov Chains

(September 10, 1999)
The setting of this Chapter is a finite-state irreducible Markov chain (Xt ),

either in discrete time (t = 0, 1, 2, . . .) or in continuous time (0 ≤ t <
∞). Highlights of the elementary theory of general (i.e. not-necessarily-
reversible) Markov chains are readily available in several dedicated textbooks
and in chapters of numerous texts on introductory probability or stochastic
processes (see the Notes), so we just give a rapid review in sections 2.1 and
2.1.2. Subsequent sections emphasize several specific topics which are useful
for our purposes but not easy to find in any one textbook: using the funda-
mental matrix in mean hitting times and the central limit theorem, metrics
on distributions and submultiplicativity, Matthews’ method for cover times,
and martingale methods.
2.1 Notation and reminders of fundamental re-

sults
We recommend the textbook of Norris [270] for a clear treatment of the
basic theory and a wide selection of applications.
Write I = {i, j, k, . . .} for a finite state space. Write P = pi,j for the
transition matrix of a discrete-time Markov chain (Xt : t = 0, 1, 2, . . .).
To avoid trivialities let’s exclude the one-state chain (two-state chains are
useful, because surprisingly often general inequalities are sharp for two-state
(t)
chains). The t-step transition probabilities are P (Xt = j|X0 = i) = pij ,
where P(t) = PP . . . P is the t-fold matrix product. Write Pi (·) and Ei (·)
23
24CHAPTER 2. GENERAL MARKOV CHAINS (SEPTEMBER 10, 1999)
for probabilities and expectations for the chain started at state i and time
0. More generally, write Pρ (·) and Eρ (·) for probabilities and expectations
for the chain started at time 0 with distribution ρ. Write
Ti = min{t ≥ 0 : Xt = i}
for the first hitting time on state i, and write
Ti+ = min{t ≥ 1 : Xt = i}.
Of course Ti+ = Ti unless X0 = i, in which case we call Ti+ the first return
time to state i. More generally, a subset A of states has first hitting time
TA = min{t ≥ 0 : Xt ∈ A}.
We shall frequently use without comment “obvious” facts like the fol-
lowing.
Start a chain at state i, wait until it first hits j, then wait until
the time (S, say) at which it next hits k. Then Ei S = Ei Tj +
Ej Tk .
The elementary proof sums over the possible values t of Tj . The sophisti-
cated proof appeals to the strong Markov property ([270] section 1.4) of the
stopping time Tj , which implies
Ei (S|Xt , t ≤ Tj ) = Tj + Ej Tk .
Recall that the symbol | is the probabilist’s shorthand for “conditional on”.
2.1.1 Stationary distribution and asymptotics

Now assume the chain is irreducible. A fundamental result ([270] Theorems
1.7.7 and 1.5.6) is that there exists a unique stationary distribution π = (πi :
i ∈ I), i.e. a unique probability distribution satisfying the balance equations
X
πj = πi pij for all j. (2.1)
i
One way to prove this existence (liked by probabilists because it extends

easily to the countable state setting) is to turn Lemma 2.6 below into a
definition. That is, fix arbitrary i0 , define π̃(i0 ) = 1, and define
π̃(j) = Ei0 (number of visits to j before time Ti+0 ), j 6= i0 .

2.1. NOTATION AND REMINDERS OF FUNDAMENTAL RESULTS25
P
It can then be checked that πi := π̃(i)/ j π̃(j) is a stationary distribution.
The point of stationarity is that, if the initial position X0 of the chain is
random with the stationary distribution π, then the position Xt at any
subsequent non-random time t has the same distribution π, and the process
(Xt , t = 0, 1, 2, . . .) is then called the stationary chain.
A highlight of elementary theory is that the stationary distribution plays
the main role in asymptotic results, as follows.
Theorem 2.1 (The ergodic theorem: [270] Theorem 1.10.2) Let Ni (t)
be the number of visits to state i during times 0, 1, . . . , t − 1. Then for any
initial distribution,
t−1 Ni (t) → πi a.s., as t → ∞.
Theorem 2.2 (The convergence theorem: [270] Theorem 1.8.3) For

any initial distribution,
P (Xt = j) → πj as t → ∞, for all j
provided the chain is aperiodic.
Theorem 2.1 is the simplest illustration of the ergodic principle “time aver-
ages equal space averages”. Many general identities for Markov chains can
be regarded as aspects of the ergodic principle – in particular, in section
2.2.1 we use it to derive expressions for mean hitting times. Such identities
are important and useful.
The most classical topic in mathematical probability is time-asymptotics
for i.i.d. (independent, identically distributed) random sequences. A vast
number of results are known, and (broadly speaking) have simple analogs
for Markov chains. Thus the analog of the strong law of large numbers is
Theorem 2.1, and the analog of the central limit theorem is Theorem 2.17
below. As mentioned in Chapter 1 section 2.1 (yyy 7/20/99 version) this
book has a different focus, on results which say something about the behavior
of the chain over some specific finite time, rather than what happens in the
indefinite future.
2.1.2 Continuous-time chains

The theory of continuous-time Markov chains closely parallels that of the
discrete-time chains discussed above. To the reader with background in
algorithms or discrete mathematics, the introduction of continuous time
may at first seem artificial and unnecessary, but it turns out that certain
results are simpler in continuous time. See Norris [270] Chapters 2 and 3
for details on what follows.
A continuous-time chain is specified by transition rates (q(i, j) = qij , j 6=
i) which are required to be non-negative but have no constraint on the sums.
Given the transition rates, define
X
qi := qij (2.2)
j:j6=i
and extend (qij ) to a matrix Q by putting qii = −qi . The chain (Xt : 0 ≤
t < ∞) has two equivalent descriptions.
1. Infinitesimal description. Given that Xt = i, the chance that
Xt+dt = j is qij dt for each j 6= i.
2. Jump-and-hold description. Define a transition matrix J by
Jii = 0 and
Jij := qij /qi , j 6= i. (2.3)
Then the continuous-time chain may be constructed by the two-step proce-
dure
(i) Run a discrete-time chain X J with transition matrix J.
(ii) Given the sequence of states i0 , i1 , i2 , . . . visited by X J , the durations
spent in states im are independent exponential random variables with rates
qim .
The discrete-time chain X J is called the jump chain associated with Xt .
The results in the previous section go over to continuous-time chains
with the following modifications.
(t)
(a) Pi (Xt = j) = Qij , where Q(t) := exp(Qt).
(b) The definition of Ti+ becomes
Ti+ = min{t ≥ TI\i : Xt = i}.
(c) If the chain is irreducible then there exists a unique stationary dis-
tribution π characterized by
X
πi qij = 0 for all j.
i
(d) In the ergodic theorem we interpret Ni (t) as the total duration of

time spent in state i during [0, t]:
Z t
Ni (t) := 1(Xs =i) ds.
0
2.2. IDENTITIES FOR MEAN HITTING TIMES AND OCCUPATION TIMES27
(e) In the convergence theorem the assumption of aperiodicity is un-

necessary. [This fact is the one of the technical advantages of continuous
time.]
(f) The evolution of P (Xt = j) as a function of time is given by the
forwards equations
d X
P (Xt = j) = P (Xt = i)qij . (2.4)
dt i
Given a discrete-time chain X with some transition matrix P, one can

define the continuized chain X e to have transition rates qij = pij , j 6= i.
In other words, we replace the deterministic time-1 holds between jumps
by holds with exponential(1) distribution. Many quantities are unchanged
by the passage from the discrete time chain to the continuized chain. In
particular the stationary distribution π and mean hitting times Ei TA are
unchanged. Therefore results stated in continuous time can often be imme-
diately applied in discrete time, and vice versa.
In different parts of the book we shall be working with discrete or contin-
uous time as a current convention, mentioning where appropriate how results
change in the alternate setting. Chapter 4 (yyy section to be written) will
give a survey of the differences between these two settings.
2.2 Identities for mean hitting times and occupa-

tion times
2.2.1 Occupation measures and stopping times
The purpose of this section is to give a systematic “probabilistic” treat-
ment of a collection of general identities by deriving them from a single
result, Proposition 2.3. We work in discrete time, but give the correspond-
ing continuous-time results in section 2.2.3. Intuitively, a stopping time is
a random time which can be specified by some on-line algorithm, together
(perhaps) with external randomization.
Proposition 2.3 Consider the chain started at state i. Let 0 < S < ∞ be
a stopping time such that XS = i and Ei S < ∞. Let j be an arbitrary state.
Then
Ei (number of visits to j before time S) = πj Ei S.
In the phrase “number of . . . before time t”, our convention is to include

time 0 but exclude time t.
We shall give two different proofs. The first requires a widely-useful

general theorem in stochastic processes.
Proof. Consider the renewal process whose inter-renewal time is dis-
tributed as S. The reward-renewal theorem (e.g. Ross [299] Thm. 3.6.1)
says that the asymptotic proportion of time spent in state j equals
Ei (number of visits to j before time S)/Ei S.
But this asymptotic average also equals πj , by the ergodic theorem. 2

We like that proof for philosophical reasons: a good way to think about
general identities is that they show one quantity calculated in two different
ways. Here is an alternative proof of a slightly more general assertion. We
refer to Propositions 2.3 and 2.4 as occupation measure identities.
Proposition 2.4 Let θ be a probability distribution on I. Let 0 < S < ∞
be a stopping time such that Pθ (XS ∈ ·) = θ(·) and Eθ S < ∞. Let j be an
arbitrary state. Then
Eθ (number of visits to j before time S) = πj Eθ S.
Proof. Write ρj = Eθ (number of visits to j before time S). We will show

X
ρj pjk = ρk ∀k. (2.5)
j
Then by uniqueness of the stationary distribution, ρ(·) = cπ(·) for c =

P
k ρk = Eθ S.
Checking (2.5) is just a matter of careful notation.
∞
X
ρk = Pθ (Xt = k, S > t)
t=0
X∞
= Pθ (Xt+1 = k, S > t) because Pθ (XS = k) = Pθ (X0 = k)
t=0
X∞ X
= Pθ (Xt = j, S > t, Xt+1 = k)
t=0 j
X∞ X
= Pθ (Xt = j, S > t) pjk by the Markov property
t=0 j
X
= ρj pjk .
j
2
2.2.2 Mean hitting time and related formulas

The following series of formulas arise from particular choices of j and S in
Proposition 2.3. For ease of later reference, we state them all together before
starting the proofs. Some involve the quantity
∞
X (t)
Zij = pij − πj (2.6)
t=0
In the periodic case the sum may oscillate, so we use the Cesaro limit or
(equivalently, but more simply) the continuous-time limit (2.9). The matrix
Z is called the fundamental matrix (see Notes for alternate standardizations).
Note that from the definition
X
Zij = 0 for all i. (2.7)
j
Lemma 2.5 Ei Ti+ = 1/πi .
Lemma 2.6
Ei (number of visits to j before time Ti+ ) = πj /πi .
Lemma 2.7 For j 6= i,
Ej (number of visits to j before time Ti ) = πj (Ej Ti + Ei Tj ).
Corollary 2.8 For j 6= i,

1
Pi (Tj < Ti+ ) = .
πi (Ei Tj + Ej Ti )
Lemma 2.9 For i 6= l and arbitrary j,
Ei (number of visits to j before time Tl ) = πj (Ei Tl + El Tj − Ei Tj ).
Corollary 2.10 For i 6= l and j 6= l,

Ei Tl + El Tj − Ei Tj
Pi (Tj < Tl ) = .
Ej Tl + El Tj
Lemma 2.11 πi Eπ Ti = Zii .
Lemma 2.12 πj Ei Tj = Zjj − Zij .

P P
Corollary 2.13 j πj Ei Tj = j Zjj for each i.
P
Corollary 2.14 (The random target lemma) j πj Ei Tj does not de-
pend on i.
Lemma 2.15
πj
Eπ (number of visits to j before time Ti ) = Zii − Zij .
πi
Lemmas 2.11 and 2.12, which will be used frequently throughout the book,
will both be referred to as the mean hitting time formula. See the Remark
following the proofs for a two-line heuristic derivation of Lemma 2.12. A
consequence of the mean hitting time formula is that knowing the matrix
Z is equivalent to knowing the matrix (Ei Tj ), since we can recover Zij as
πj (Eπ Tj − Ei Tj ).
Proofs. The simplest choice of S in Proposition 2.3 is of course the first
return time Ti+ . With this choice, the Proposition says
Ei (number of visits to j before time Ti+ ) = πj Ei Ti+ .
Setting j = i gives 1 = πi Ei Ti+ , which is Lemma 2.5, and then the case of
general j gives Lemma 2.6.
Another choice of S is “the first return to i after the first visit to j”.
Then Ei S = Ei Tj + Ej Ti and the Proposition becomes Lemma 2.7, because
there are no visits to j before time Tj . For the chain started at i, the number
of visits to i (including time 0) before hitting j has geometric distribution,
and so
Ei (number of visits to i before time Tj ) = 1/Pi (Tj < Ti+ ).
So Corollary 2.8 follows from Lemma 2.7 (with i and j interchanged).

Another choice of S is “the first return to i after the first visit to j after
the first visit to l”, where i, j, l are distinct. The Proposition says
πj (Ei Tl + El Tj + Ej Ti ) = Ei (number of visits to j before time Tl )
+Ej (number of visits to j before time Ti ).

Lemma 2.7 gives an expression for the final expectation, and we deduce that
(for distinct i, j, l)
Ei (number of visits to j before time Tl ) = πj (Ei Tl + El Tj − Ei Tj ).

This is the assertion of Lemma 2.9, and the identity remains true if j = i
(where it becomes Lemma 2.7) or if j = l (where it reduces to 0 = 0). We
deduce Corollary 2.10 by writing
Ei (number of visits to j before time Tl ) =
Pi (Tj < Tl )Ej (number of visits to j before time Tl )
and using Lemma 2.7 to evaluate the final expectation.
We now get slightly more ingenious. Fix a time t0 ≥ 1 and define S as
the time taken by the following 2-stage procedure (for the chain started at
i).
(i) wait time t0
(ii) then wait (if necessary) until the chain next hits i.
Then the Proposition (with j = i) says
0 −1
tX
(t)
pii = πi (t0 + Eρ Ti ) (2.8)
t=0
where ρ(·) = Pi (Xt0 = ·). Rearranging,

0 −1
tX
(t)
(pii − πi ) = πi Eρ Ti .
t=0
Letting t0 → ∞ we have ρ → π by the convergence theorem (strictly,

we should give a separate argument for the periodic case, but it’s simpler
to translate the argument to continuous time where the periodicity issue
doesn’t arise) and we obtain Lemma 2.11.
For Lemma 2.12, where we may take j 6= i, we combine the previous
ideas. Again fix t0 and define S as the time taken by the following 3-stage
procedure (for the chain started at i).
(i) wait until the chain hits k.
(ii) then wait a further time t0 .
(iii) then wait (if necessary) until the chain next hits i.
Applying Proposition 2.3 with this S and with j = i gives
0 −1
tX
(t)
Ei (number of visits to i before time Tk ) + pki = πi (Ei Tk + t0 + Eρ Ti ),
t=0
where ρ(·) = Pk (Xt0 = ·). Subtracting the equality of Lemma 2.7 and
rearranging, we get
0 −1
tX
(t)
(pki − πi ) = πi (Eρ Ti − Ek Ti ).
t=0
Letting t0 → ∞, we have (as above) ρ → π, giving
Zki = πi (Eπ Ti − Ek Ti ).
Appealing to Lemma 2.11 we get Lemma 2.12. Corollary 2.13 follows from
Lemma 2.12 by using (2.7).
To prove Lemma 2.15, consider again the argument for (2.8), but now
apply the Proposition with j 6= i. This gives
0 −1
tX
(t)
pij + Eρ (number of visits to j before time Ti ) = πj (t0 + Eρ Ti )
t=0
where ρ(·) = Pi (Xt0 = ·). Rearranging,
0 −1
tX
(t)
(pij − πj ) + Eρ (number of visits to j before time Ti ) = πj Eρ Ti .
t=0
Letting t0 → ∞ gives
Zij + Eπ (number of visits to j before time Ti ) = πj Eπ Ti .
Applying Lemma 2.11 gives Lemma 2.15.

Remark. We promised a two-line heuristic derivation of the mean hitting
time formula, and here it is. Write
∞ Tj −1 ∞
X X X
1(Xt =j) − πj = 1(Xt =j) − πj + 1(Xt =j) − πj .
t=0 t=0 t=Tj
Take Ei (·) of each term to get Zij = −πj Ei Tj + Zjj . Of course this argu-
ment doesn’t make sense because the sums do not converge. Implicit in our
(honest) proof is a justification of this argument by a limiting procedure.
Example 2.16 Patterns in coin-tossing.
This is a classical example for which Z is easy to calculate. Fix n. Toss

a fair coin repeatedly, and let X0 , X1 , X2 , . . . be the successive overlapping
n-tuples. For example (with n = 4)
tosses H T H H T T
X0 = H T H H
X1 = T H H T
X2 = H H T T
So X is a Markov chain on the set I = {H, T }n of n-tuples i = (i1 , . . . , in ),

and the stationary distribution π is uniform on I. For 0 ≤ d ≤ n − 1 write
I(i, j, d) for the indicator of the set “pattern j, shifted right by d places,
agrees with pattern i where they overlap”: formally, of the set
ju = iu+d , 1 ≤ u ≤ n − d.
For example, with n = 4, i = HHT H and j = HT HH,

d 0 1 2 3
I(i, j, d) 0 1 0 1
Then write
n−1
2−d I(i, j, d).
X
c(i, j) =
d=0
From the definition of Z, and the fact that X0 and Xt are independent for
t ≥ n,
Zij = c(i, j) − n2−n .
So we can read off many facts about patterns in coin-tossing from the general
results of this section. For instance, the mean hitting time formula ( Lemma
2.11) says Eπ Ti = 2n c(i, i) − n. Note that “time 0” for the chain is the
n’th toss, at which point the chain is in its stationary distribution. So the
mean number of tosses until first seeing pattern i equals 2n c(i, i). For n = 5
and i = HHT HH, the reader may check this mean number is 38. We leave
the interested reader to explore further — in particular, find three patterns
i, j, k such that
P (pattern i occurs before pattern j) > 1/2
P (pattern j occurs before pattern k) > 1/2

P (pattern k occurs before pattern i) > 1/2.
Further results. One can of course obtain expressions in the spirit of
Lemmas 2.5–2.15 for more complicated quantities. The reader may care to
find expressions for
Ei min(Tk , Tl )
Ei (number of visits to j before time min(Tk , Tl ))
Pi (hit j before time min(Tk , Tl )).
Warning. Hitting times TA on subsets will be studied later (e.g. Chapter
3 section 5.3) (yyy 9/2/94 version) in the reversible setting. It is important
to note that results often do not extend simply from singletons to subsets.
For instance, one might guess that Lemma 2.11 could be extended to
∞
ZAA X
Eπ TA = , ZAA := (Pπ (Xt ∈ A|X0 ∈ A) − π(A)),
π(A) t=0
but it is easy to make examples where this is false.
2.2.3 Continuous-time versions

Here we record the continuous-time versions of the results of the previous
section. Write Z ∞
Zij = (Pi (Xt = j) − πj )dt (2.9)
0
This is consistent with (2.6) in that Z is the same for a discrete-time chain
and its continuized chain. Recall from section 2.1.2 the redefinition (b) of
Ti+ in continuous time. In place of “number of visits to i” we use “total
duration of time spent in i”. With this substitution, Proposition 2.3 and
the other results of the previous section extend to continuous time with only
the following changes, which occur because the mean sojourn time in a state
i is 1/qi in continuous time, rather than 1 as in discrete time.
Lemma 2.5. Ei Ti+ = qi1πi .
Lemma 2.6.
πj
Ei (duration of time spent in j before time Ti+ ) = .
qi πi
Corollary 2.8. For j 6= i,
1
Pi (Tj < Ti+ ) = .
qi πi (Ei Tj + Ej Ti )
2.3 Variances of sums

In discrete time, consider the number Ni (t) of visits to state i before time t.
(Recall our convention is to count a visit at time 0 but not at time t.) For
the stationary chain, we have (trivially)
Eπ Ni (t) = tπi .
It’s not hard to calculate the variance:

t−1 X
X t−1
var π Ni (t) = (Pπ (Xr = i, Xs = i) − πi2 )
r=0 s=0
2.3. VARIANCES OF SUMS 35
t−1
!
X (u)
= πi 2(t − u)(pii − πi ) − t(1 − πi )
u=0
setting u = |s − r|. This leads to the asymptotic result

var π Ni (t)
→ πi (2Zii − 1 + πi ). (2.10)
t
The fundamental matrix Z of (2.6) reappears in an apparently different con-
text. Here is the more general result underlying (2.10). Take arbitrary func-
tions f : I → R and g : I → R and center so that Eπ f (X0 ) := i πi f (i) = 0
P
and Eπ g(X0 ) = 0. Write

t−1
Stf
X
= f (Xs )
s=0
and similarly for Stg . Then
t−1 X
t−1
Eπ Stf Stg =
XX X
f (i)g(j) (Pπ (Xr = i, Xs = j) − πi πj ).
i j r=0 s=0
The contribution to the latter double sum from terms r ≤ s equals, putting
u = s − r,
t−1
X (u)
πi (t − u)(pij − πj ) ∼ tπi Zi,j .
u=0
Collecting the other term and subtracting the twice-counted diagonal leads
to the following result.
Eπ Stf Stg XX
→ f Γg := f (i)Γij g(j) (2.11)
t i j
where Γ is the symmetric positive-definite matrix

Γij := πi Zij + πj Zji + πi πj − πi δij . (2.12)
As often happens, the formulas simplify in continuous time. The asymp-
totic result (2.10) becomes
var π Ni (t)
→ 2πi Zii
t
and the matrix Γ occurring in (2.11) becomes
Γij := πi Zij + πj Zji .
Of course these asymptotic variances appear in the central limit theorem
for Markov chains.
Theorem 2.17 For centered f ,

d
t−1/2 Stf → Normal(0, f Γf ) as t → ∞.
The standard proofs (e.g. [133] p. 378) don’t yield any useful finite-time
results, so we won’t present a proof. We return to this subject in Chapter
4 section 4.1 (yyy 10/11/94 version) in the context of reversible chains. In
that context, getting finite-time bounds on the approximation (2.10) for
variances is not hard, but getting informative finite-time bounds on the
Normal approximation remains quite hard.
Remark. Here’s another way of seeing why asymptotic variances should
relate (via Z) to mean hitting times. Regard Ni (t) as counts in a renewal pro-
cess; in the central limit theorem for renewal counts ([133] Exercise 2.4.13)
the variance involves the variance var i (Ti+ ) of the inter-renewal time, and
by (2.22) below this in turn relates to Eπ Ti .
2.4 Two metrics on distributions

A major theme of this book is quantifying the convergence theorem (Theo-
rem 2.2) to give estimates of how close the distribution of a chain is to the
stationary distribution at finite times. Such quantifications require some
explicit choice of “distance” between distributions, and two of the simplest
choices are explained in this section. We illustrate with a trivial
Example 2.18 Rain or shine?
Suppose the true probability of rain tomorrow is 80% whereas we think
the probability is 70%. How far off are we? In other words, what is the
“distance” between π and θ, where
π(rain) = 0.8, π(shine) = 0.2
θ(rain) = 0.7, θ(shine) = 0.3.
Different notions of distance will give different numerical answers. Our first
notion abstracts the idea that the “additive error” in this example is 0.8 −
0.7 = 0.1.
2.4.1 Variation distance

Perhaps the simplest notion of distance between probability distributions is
variation distance, defined as
||θ1 − θ2 || = max |θ1 (A) − θ2 (A)|.
A⊆I
2.4. TWO METRICS ON DISTRIBUTIONS 37
So variation distance is just the maximum additive error one can make, in
using the “wrong” distribution to evaluate the probability of an event. In
example 2.18, variation distance is 0.1. Several equivalent definitions are
provided by
Lemma 2.19 For probability distributions θ1 , θ2 on a finite state space I,

1X X
|θ1 (i) − θ2 (i)| = (θ1 (i) − θ2 (i))+
2 i i
(θ1 (i) − θ2 (i))−
X
=
i
X
= 1− min(θ1 (i), θ2 (i))
i
= max |θ1 (A) − θ2 (A)|
A⊆I
= min P (V1 6= V2 )
the minimum taken over random pairs (V1 , V2 ) such that Vm has distribution
θm (m = 1, 2). So each of these quantities equals the variation distance
||θ1 − θ2 ||.
Proof. The first three equalities are clear. For the fourth, set B = {i :
θ1 (i) > θ2 (i)}. Then
X
θ1 (A) − θ2 (A) = (θ1 (i) − θ2 (i))
i∈A
X
≤ (θ1 (i) − θ2 (i))
i∈A∩B
X
≤ (θ1 (i) − θ2 (i))
i∈B
X
= (θ1 (i) − θ2 (i))+
i
with equality when A = B. This, and the symmetric form, establish the
fourth equality. In the final equality, the “≤” follows from
|θ1 (A) − θ2 (A)| = |P (V1 ∈ A) − P (V2 ∈ A)| ≤ P (V2 6= V1 ).
And equality is attained by the following joint distribution. Let θ(i) =

min(θ1 (i), θ2 (i)) and let
P (V1 = i, V2 = i) = θ(i)
(θ1 (i) − θ(i))(θ2 (j) − θ(j))

P (V1 = i, V2 = j) = , i 6= j.
1 − k θ(k)
P
(If the denominator is zero, then θ1 = θ2 and the result is trivial.) 2

In the context of Markov chains we may use
di (t) := ||Pi (Xt = ·) − π(·)|| (2.13)
as a measure of deviation from stationarity at time t, for the chain started

at state i. Also define
d(t) := max di (t) (2.14)
i
as the worst-case deviation from stationarity. Finally, it is technically con-

venient to introduce also
¯ = max ||Pi (Xt = ·) − Pj (Xt = ·)||.

d(t) (2.15)
ij
In Chapter 4 we discuss, for reversible chains, relations between these “vari-

ation distance” notions and other measures of closeness-to-stationarity, and
discuss parameters τ measuring “time until d(t) becomes small” and their
relation to other parameters of the chain. For now, let’s just introduce a
fundamental technical fact, the submultiplicativity property.
Lemma 2.20
¯ + t) ≤ d(s)
(a) d(s ¯ d(t),
¯ s, t ≥ 0 [the submultiplicativity property].
(b) d(s + t) ≤ 2d(s)d(t), s, t ≥ 0 .
¯ ≤ 2d(t), t ≥ 0.
(c) d(t) ≤ d(t)
¯ decrease as t increases.
(d) d(t) and d(t)
Proof. We use the characterization of variation distance as
||θ1 − θ2 || = min P (V1 6= V2 ), (2.16)
the minimum taken over random pairs (V1 , V2 ) such that Vm has distribution
θm (m = 1, 2).
Fix states i1 , i2 and times s, t, and let Y1 , Y2 denote the chains started
at i1 , i2 respectively. By (2.16) we can construct a joint distribution for
(Ys1 , Ys2 ) such that
P (Ys1 6= Ys2 ) = ||Pi1 (Xs = ·) − Pi2 (Xs = ·)||

¯
≤ d(s).
Now for each pair (j1 , j2 ), we can use (2.16) to construct a joint distribution
1 , Y 2 ) given (Y 1 = j , Y 2 = j ) with the property that
for (Ys+t s+t s 1 s 2
1 2
P (Ys+t 6= Ys+t |Ys1 = j1 , Ys2 = j2 ) = ||Pj1 (Xt = ·) − Pj2 (Xt = ·)||.
¯
The right side is 0 if j1 = j2 , and otherwise is at most d(t). So uncondition-
ally
1
P (Ys+t 2
6= Ys+t ¯ d(t)
) ≤ d(s) ¯
and (2.16) establishes part (a) of the lemma. For part (b), the same argu-
ment (with Y2 now being the stationary chain) shows
¯
d(s + t) ≤ d(s)d(t) (2.17)
so that (b) will follow from the upper bound d(t)¯ ≤ 2d(t) in (c). But this
upper bound is clear from the triangle inequality for variation distance.
And the lower bound in (c) follows from the fact that µ → ||θ − µ|| is a
convex function, so that averaging over j with respect to π in (2.15) can
only decrease distance. Finally, the “decreasing” property for d(t)¯ follows
from (a), and for d(t) follows from (2.17). 2
The assertions of this section hold in either discrete or continuous time.
But note that the numerical value of d(t) changes when we switch from a
discrete-time chain to the continuized chain. In particular, for a discrete-
time chain with period q we have d(t) → (q−1)/q as t → ∞ (which incidently
implies, taking q = 2, that the factor 2 in Lemma 2.20(b) cannot be reduced)
whereas for the continuized chain d(t) → 0.
One often sees slightly disguised corollaries of the submultiplicativity
property in the literature. The following is a typical one.
Corollary 2.21 Suppose there exists a probability measure µ, a real δ > 0

and a time t such that
(t)
pij ≥ δµj ∀i, j.
Then
d(s) ≤ (1 − δ)bs/tc , s ≥ 0.
¯ ≤ 1 − δ, by the third equality in Lemma
Proof. The hypothesis implies d(t)
2.19, and then the conclusion follows by submultiplicativity.
2.4.2 L2 distance
Another notion of distance, which is less intuitively natural but often more
mathematically tractable, is L2 distance. This is defined with respect to
some fixed reference probability distribution π on I, which for our purposes

will be the stationary distribution of some irreducible chain under consider-
ation (and so πi > 0 ∀i). The L2 norm of a function f : I → R is
sX
||f ||2 = πi f 2 (i). (2.18)
i
We define the L2 norm of a signed measure ν on I by

sX
||ν||2 = νi2 /πi . (2.19)
i
This may look confusing, because a signed measure ν and a function f are
in a sense “the same thing”, being determined by values (f (i); i ∈ I) or
(νi ; i ∈ I) which can be chosen arbitrarily. But the measure ν can also be
determined by its density function f (i) = νi /πi , and so (2.18) and (2.19)
say that the L2 norm of a signed measure is defined to be the L2 norm of
its density function.
So ||θ − µ||2 is the “L2 ” measure of distance between probability dis-
tributions θ, µ. In particular, the distance between θ and the reference
distribution π is
v v
uX 2
(θi − πi )2 θi
uX
u u
||θ − π||2 = t = t − 1.
i
πi i
πi
In Example 2.18 we find ||θ − π||2 = 1/4.

Writing θ(t) for the distribution at time t of a chain with stationary
distribution π, it is true (cf. Lemma 2.20(d) for variation distance) that
||θ(t) − π||2 is decreasing with t. Since there is a more instructive proof in
the reversible case (Chapter 3 Lemma 23) (yyy 9/2/94 version) we won’t
prove the general case (see Notes).
Analogous to the L2 norms are the L1 norms
X
||f ||1 = πi |f (i)|
i
X
||ν||1 = |νi |.
i
The Cauchy-Schwarz inequality gives || · ||1 ≤ || · ||2 . Note that in the

definition of ||ν||1 the reference measure π has “cancelled out”. Lemma 2.19
shows that for probability measures θ1 , θ2 the L1 distance is the same as

variation distance, up to a factor of 2:
||θ1 − θ2 || = 21 ||θ1 − θ2 ||1 .
As a trivial example in the Markov chain setting, consider
Example 2.22 Take I = {0, 1, . . . , n − 1}, fix a parameter 0 < a < 1 and
define a transition matrix
1−a
pij = a1(j=i+1 mod n) + .
n
In this example the t-step transition probabilities are
(t) 1 − at
pij = at 1(j=i+t mod n) +
n
and the stationary distribution π is uniform. We calculate (for arbitrary
j 6= i)
d(t) = ||Pi (Xt ∈ ·) − π|| = (1 − n−1 )at
¯ = ||Pi (Xt ∈ ·) − Pj (Xt ∈ ·)|| = at
d(t)
||Pi (Xt ∈ ·) − π||2 = (n − 1)1/2 at .
2.4.3 Exponential tails of hitting times

¯ is one instance of a general principle:
The submultiplicative property of d(t)
because our state space is finite, many quantities which converge

to zero as t → ∞ must converge exponentially fast, by iterating
over worst-case initial states.
Here’s another instance, tails of hitting time distributions.

Consider the first hitting time TA on a subset A. Define t∗A := maxi Ei TA .
For any initial distribution µ, any time s > 0 and any integer m ≥ 1,
Pµ (TA > ms|TA > (m − 1)s) = Pθ (TA > s) for some dist. θ
≤ max Pi (TA > s)
i
∗
≤ tA /s.
So by induction on m
Pµ (TA > js) ≤ (t∗A /s)j
implying
Pµ (TA > t) ≤ (t∗A /s)bt/sc , t > 0.
In continuous time, a good (asymptotically optimal) choice of s is s = et∗A ,
giving the exponential tail bound
$ %!
t
sup Pµ (TA > t) ≤ exp − ∗ , 0 < t < ∞. (2.20)
µ et A
A messier bound holds in discrete time, where we have to choose s to be an

integer.
2.5 Distributional identities

It is much harder to get useful information about distributions (rather than
mere expectations). Here are a few general results.
2.5.1 Stationarity consequences

A few useful facts about stationary Markov chains are, to experts, just spe-
cializations of facts about arbitrary (i.e. not-necessarily-Markov) stationary
processes. Here we give a bare-hands proof of one such fact, the relation
between the distribution of return time to a subset A and the distribution
of first hitting time to A from a stationary start. We start in discrete time.
Lemma 2.23 For t = 1, 2, . . .,
Pπ (TA = t − 1) = Pπ (TA+ = t) = π(A)PπA (TA+ ≥ t)
where πA (i) := πi /π(A), i ∈ A.

Proof. The first equality is obvious. Now let (Xt ) be the chain started with
its stationary distribution π. Then
Pπ (TA+ = t) = P (X1 6∈ A, . . . , Xt−1 6∈ A, Xt ∈ A)

= P (X1 6∈ A, . . . , Xt−1 6∈ A) − P (X1 6∈ A, . . . , Xt 6∈ A)
= P (X1 6∈ A, . . . , Xt−1 6∈ A) − P (X0 6∈ A, . . . , Xt−1 6∈ A)
= P (X0 ∈ A, X1 6∈ A, . . . , Xt−1 6∈ A)
= π(A)PπA (TA+ ≥ t),
establishing the Lemma.

We’ll give two consequences of Lemma 2.23. Summing over t gives
2.5. DISTRIBUTIONAL IDENTITIES 43
Corollary 2.24 (Kac’s formula) π(A)EπA TA+ = 1
which extends the familiar fact Ei Ti+ = 1/πi . Multiplying the identity of
Lemma 2.23 by t and summing gives
X
Eπ TA + 1 = tPπA (TA = t − 1)
t≥1
tPπA (TA+ ≥ t)
X
= π(A)
t≥1
X 1
= π(A) m(m + 1)PπA (TA+ = m)
m≥1
2
π(A)
= EπA TA+ + EπA (TA+ )2 .
2
Appealing to Kac’s formula and rearranging,
2Eπ TA + 1
EπA (TA+ )2 = , (2.21)
π(A)
2Eπ TA + 1 1
varπA (TA+ ) = − 2 . (2.22)
π(A) π (A)
More generally, there is a relation between EπA (TA+ )p and Eπ (TA+ )p−1 .
In continuous time, the analog of Lemma 2.23 is
Pπ (TA ∈ (t, t + dt)) = Q(A, Ac )PρA (TA > t)dt, t > 0 (2.23)
where
X X X
Q(A, Ac ) := qij , ρA (j) := qij /Q(A, Ac ), j ∈ Ac .
i∈A j∈Ac i∈A
Integrating over t > 0 gives the analog of Kac’s formula
Q(A, Ac )EρA TA = π(Ac ). (2.24)
2.5.2 A generating function identity

Transform methods are useful in analyzing special examples, though that
is not the main focus of this book. We record below just the simplest
“transform fact”. We work in discrete time and use generating functions
– the corresponding result in continuous time can be stated using Laplace
transforms.
Lemma 2.25 Define

X X
Gij (z) = Pi (Xt = j)z t , Fij (z) = Pi (Tj = t)z t .
t≥0 t≥0
Then Fij = Gij /Gjj .

Analysis proof. Conditioning on Tj gives
t
(t) X (t−l)
pij = Pi (Tj = l)pjj
l=0
and so X (t) X X (t−l)

pij z t = Pi (Tj = l)z l pjj z t−l
t≥0 l≥0 t−l≥0
Thus Gij (z) = Fij (z)Gjj (z), and the lemma follows. 2
Probability proof. Let ζ have geometric(z) law P (ζ > t) = z t , indepen-
dent of the chain. Then
Gij (z) = Ei (number of visits to j before time ζ)

= Pi (Tj < ζ) Ej (number of visits to j before time ζ)
= Fij (z)Gjj (z).
2
Note that, differentiating term by term,

d
Ei Tj = dz Fij (z)z=1 .

This and Lemma 2.25 can be used to give an alternative derivation of the
mean hitting time formula, Lemma 2.12.
2.5.3 Distributions and continuization

The distribution at time t of the continuization X̂ of a discrete-time chain X
is most simply viewed as a Poisson mixture of the distributions (Xs ). That
d
is, X̂t = XNt where Nt has Poisson(t) distribution independent of X. At
greater length,
∞ −t s
X e t
Pi (X̂t = j) = Pi (Xs = j).
s=0
s!
This holds because we can construct X̂ from X by replacing the determin-

istic “time 1” holds by random, exponential(1), holds (ξj ) between jumps,
2.6. MATTHEWS’ METHOD FOR COVER TIMES 45
and then the number Nt of jumps before time t has Poisson(t) distribution.
Now write Sn = nj=1 ξj for the time of the n’th jump. Then the hitting
P
time T̂A for the continuized chain is related to the hitting time TA of the
discrete-time chain by T̂A = STA . Though these two hitting time distribu-
tions are different, their expectations are the same, and their variances are
related in a simple way. To see this, the conditional distribution of T̂A given
TA is the distribution of the sum of TA independent ξ’s, so (using the notion
of conditional expectation given a random variable)
E(T̂A |TA ) = TA , var (T̂A |TA ) = TA .
Thus (for any initial distribution)
E T̂A = EE(T̂A |TA ) = ETA .
And the conditional variance formula ([133] p. 198)
var Z = E var (Z|Y ) + var E(Z|Y )
tells us that
var T̂A = Evar (T̂A |TA ) + var E(T̂A |TA )

= ETA + var TA . (2.25)
2.6 Matthews’ method for cover times

Theorem 2.26 below is the only non-classical result in this Chapter. We
make extensive use of this Matthews’ method in Chapter 6 to analyze cover
times for random walks on graphs.
Consider the cover time C := maxj Tj of the chain, i.e. the time required
to visit every state. How can we bound Ei C in terms of the mean hitting
times Ei Tj ? To appreciate the cleverness of Theorem 2.26 let us first consider
a more routine argument. Write t∗ := maxi.j Ei Tj . Since Ei C is unaffected
by continuization, we may work in continuous time. By (2.20)
Pi (Tj > ket∗ ) ≤ e−k , k = 1, 2, 3, . . . .
By Boole’s inequality, for an n-state chain
Pi (C > ket∗ ) ≤ ne−k , k = 1, 2, 3, . . . .

One can rewrite this successively as
C

Pi >x ≤ ne · e−x , 0≤x<∞
et∗
C

Pi − log(en) > x ≤ e−x , 0 ≤ x < ∞.
et∗
C
In words, this says that the distribution of et∗ − log(en) is stochastically

C
smaller that the exponential(1) distribution, implying Ei et∗ − log(en) ≤
1 and hence
max Ei C ≤ (2 + log n)et∗ .
i
This argument does lead to a bound, but one suspects the factors 2 and e
are artifacts of the proof; also, it seems hard to obtain a lower bound this
way. The following result both “cleans up” the upper bound and gives a
lower bound.
Theorem 2.26 (Matthews [256]) For any n-state Markov chain,
max Ev C ≤ hn−1 max Ei Tj

v i,j
min Ev C ≥ hn−1 min Ei Tj

v i6=j
where hn−1 := n−1 1

P
m=1 m .
Proof. We’ll prove the lower bound — the upper bound proof is identi-
cal. Let J1 , J2 , . . . , Jn be a uniform random ordering of the states, inde-
pendent of the chain. Define Cm := maxi≤m TJi to be the time until all of
{J1 , J2 , . . . , Jm } have been visited, in some order. The key identity is
E(Cm − Cm−1 |J1 , . . . , Jm ; Xt , t ≤ Cm−1 ) = t(Lm−1 , Jm )1(Lm =Jm ) (2.26)
where t(i, j) := Ei Tj and
Lm is the state amongst {J1 , J2 , . . . , Jm } hit last.
To understand what this says, suppose we are told which are the states
{J1 , J2 , . . . , Jm } and told the path of the chain up through time Cm−1 . Then
we know whether or not Lm = Jm : if not, then Cm = Cm−1 , and if so, then
the conditional distribution of Cm − Cm−1 is the distribution of the time to
hit Jm from the state at time Cm−1 , which we are told is state Lm−1 .
2.7. NEW CHAINS FROM OLD 47
Writing t∗ := mini6=j t(i, j), the right side of (2.26) is ≥ t∗ 1(Lm =Jm ) , and
so taking expectations
E(Cm − Cm−1 ) ≥ t∗ P (Lm = Jm ).
But obviously P (Lm = Jm ) = 1/m by symmetry. So

n n
X X 1
Ev C = Ev C1 + Ev (Cm − Cm−1 ) ≥ Ev C1 + t∗ .
m=2 m=2
m
Allowing for the possibility J1 = v we see Ev C1 ≥ (1 − n1 )t∗ , and the lower

bound follows.
2.7 New chains from old

Consider a chain (Xt ) on state-space I, and fix A ⊆ I. There are many dif-
ferent constructions of new chains whose state space is (exactly or roughly)
just A, and it’s important not to confuse them. Three elementary con-
structions are described here. Anticipating the definition of reversible from
Chapter 3, it is easy to check that if the original chain is reversible then
each new chain is reversible.
2.7.1 The chain watched only on A

This is the chain (Yn ) defined by
S0 = TA = min{t ≥ 0 : Xt ∈ A}
Sn = min{t > Sn−1 : Xt ∈ A}
Yn = XSn .
The chain (Yn ) has state space A and transition matrix
P̄A (i, j) = Pi (XTA = j), i, j ∈ A.
From the ergodic theorem (Theorem 2.1) it is clear that the stationary
distribution πA of (Yt ) is just π conditioned on A, that is
πA (i) = πi /π(A), i ∈ A. (2.27)

2.7.2 The chain restricted to A

This is the chain with state space A and transition matrix P̂A defined by
P̂A (i, j) = P (i, j), i, j ∈ A, i 6= j

X
P̂A (i, i) = 1 − P (i, j), i ∈ A.
j∈A,j6=i
In general there is little connection between this chain and the original chain
(Xt ), and in general it is not true that the stationary distribution is given
by (2.27). However, when the original chain is reversible, it is easy to check
that the restricted chain does have the stationary distribution (2.27).
2.7.3 The collapsed chain

This chain has state space I ∗ = A ∪ {a} where a is a new state. We interpret
the new chain as “the original chain with states Ac collapsed to a single
state a”. Warning. In later applications we switch the roles of A and Ac ,
i.e. we collapse A to a single state a and use the collapsed chain on states
I ∗ = Ac ∪ {a}. The collapsed chain has transition matrix
p∗ij = pij , i, j ∈ A
p∗ia =
X
pik , i ∈ A
k∈Ac
1
p∗ai =
X
c
πk pki , i ∈ A
π(A ) k∈Ac
1
p∗aa =
X X
c
πk pkl .
π(A ) k∈Ac l∈Ac
The collapsed chain has stationary distribution π ∗ given by
πi∗ = πi , i ∈ A; πa∗ = π(Ac ).
Obviously the P-chain started at i and run until TAc is the same as the
P∗ -chain started at i and run until Ta . This leads to the general collapsing
principle
To prove a result which involves the behavior of the chain only

up to time TAc , we may assume Ac is a singleton.
For we may apply the singleton result to the P∗ -chain run until time Ta ,
and the same result will hold for the P-chain run until time TAc .
2.8. MISCELLANEOUS METHODS 49
It is important to realize that typically (even for reversible chains) all

three constructions give different processes. Loosely, the chain restricted to
A “rebounds off the boundary of Ac where the boundary is hit”, the collapsed
chain “exits Ac at a random place independent of the hitting place”, and
the chain watched only on A “rebounds at a random place dependent on the
hitting place”.
2.8 Miscellaneous methods

2.8.1 Martingale methods
Modern probabilists regard the martingale optional stopping theorem as one
of the most important results in their subject. As propaganda for martin-
gales we give below four quick applications of that theorem, and a few more
will appear later. All of these results could be proved in alternative, ele-
mentary ways. For the reader unfamiliar with martingales, Durrett [133]
Chapter 4 contains much more than you need to know: Karlin and Taylor
[208] Chapter 6 is a gentler introduction.
Lemma 2.27 Given a non-empty subset A ⊂ I and a function f (i) defined

for i ∈ A, there exists a unique extension of f to all I satisfying
X
f (i) = pij f (j), i 6∈ A.
j
Proof. If f satisfies the equations above then for any initial distribution the
process Mt := f (Xmin(t,TA ) ) is a martingale. So by the optional stopping
theorem
f (i) = Ei f (XTA ) for all i. (2.28)
This establishes uniqueness. Conversely, if we define f by (2.28) then the
desired equations hold by conditioning on the first step.
Corollary 2.28 If h is harmonic, i.e. if

X
h(i) = pij h(j) for all i
j
then h is constant.
Proof. Clearly a constant function is harmonic. So the Corollary follows

from the uniqueness assertion of Lemma 2.27, taking A to be some singleton.
P
Lemma 2.29 (The random target lemma) The sum j Ei Tj πj does not
depend on i.
Proof. This repeats Corollary 2.14 with a different argument. The first-step
recurrence for gj (i) := Ei Tj is
X
gj (i) = 1(i6=j) + 1(i6=j) pik gj (k).
k
P
By Corollary 2.28 it is enough to show that h(i) := j πj gj (i) is a harmonic
function. We calculate
X
h(i) = 1 − πi + πj pik gj (k)1(i6=j)
j,k
X
= 1 − πi + pik (h(k) − πi gi (k)) by definition of h(k)
k
!
X X
= pik h(k) + 1 − πi 1 + pik gi (k) .
k k
But 1/πi = Ei Ti+ = 1 +

P
k pik gi (k), so h is indeed harmonic.
Lemma 2.30 For any stopping time S and any states i, j, k,
Ei (number of transitions j → k starting before time S)
= pjk Ei (number of visits to j before time S).
Proof. Recall that “before” means strictly before. The assertion of the
lemma is intuitively obvious, because each time the chain visits j it has
chance pjk to make a transition j → k, and one can formalize this as in the
proof of Proposition 2.4. A more sophisticated proof is to observe that M (t)
is a martingale, where
M (t) := Njk (t) − pjk Nj (t).
Nj (t) := number of visits to j before time t
Njk (t) := number of transitions j → k starting before time t .

And the assertion of the lemma is just the optional stopping theorem applied
to the martingale M and the stopping time S.
2.8. MISCELLANEOUS METHODS 51
Lemma 2.31 Let A be a non-empty subset of I and let h : I → R satisfy

(i) h(i) ≥ 0, i ∈ A
(ii) h(i) ≥ 1 + j pij h(j), i ∈ Ac .
P
Then Ei TA ≤ h(i), i ∈ I.
Proof. For arbitrary h, define g by

X
h(i) = 1 + pij h(j) + g(i)
j
and then define

t−1
X
Mt = t + h(Xt ) + g(Xs ).
s=0
Then Mmin(t,TA ) is a martingale, so the optional sampling theorem says
Ei MTA = Ei M0 = h(i).
But the hypotheses on h imply MTA ≥ TA .
2.8.2 A comparison argument

A theme running throughout the book is the idea of getting inequalities for
a “hard” chain by making a comparison with some “easier” chain for which
we can do exact calculations. Here is a simple example.
Lemma 2.32 Let X be a discrete-time chain on states {0, 1, 2, . . . , n} such

that pij = 0 whenever j > i. Write m(i) = i − Ei X1 , and suppose 0 <
m(1) ≤ m(2) ≤ . . . ≤ m(n). Then En T0 ≤ nj=1 m(j)
1
P
.
Proof. The proof implicitly compares the given chain to the continuous-time
chain with qi,i−1 = m(i). Write h(i) = ij=1 1/m(j), and extend h by linear
P
interpolation to real 0 ≤ x ≤ n. Then h is concave and for i ≥ 1
Ei h(X1 ) ≤ h(Ei X1 ) by concavity

= h(i − m(i))
≤ h(i) − m(i)h0 (i) by concavity
= h(i) − 1
where h0 is the left derivative. The result now follows from Lemma 2.31.
2.8.3 Wald equations

As mentioned previously, the results above don’t really require martingales.
Next we record a genuine martingale result, not directly involving Markov
chains but ultimately useful in their analysis. Part (c) is Wald’s equation
and part (b) is Wald’s equation for martingales. The result is a standard
consequence of the optional sampling theorem: see [133] (3.1.6) for (c) and
[133] Theorem 4.7.5 for (a).
Lemma 2.33 (a) Let 0 = Y0 ≤ Y1 ≤ Y2 . . . be such that
E(Yi+1 − Yi |Yj , j ≤ i) ≤ c, i ≥ 0
for a constant c. Then for any stopping time T ,
EYT ≤ cET.
(b) If in the hypothesis we replace “≤ c” by “= c”, then EYT = cET .

(c) In particular, if Yn = ni=1 ξi for i.i.d. nonnegative (ξi ) then EYT =
P
(Eξi )(ET ).
2.9 Notes on Chapter 2.

Textbooks on Markov chains.
It is easy to write books on . . . or finite Markov chains, or on any

of the other well-understood topics for which no further exposi-
tions are needed. G.-C. Rota
Your search for the Subject: MARKOV PROCESSES
retrieved 273 records. U.C. Berkeley Library book catalog,
September 1999.
Almost every introductory textbook on stochastic processes has a chapter

or two about Markov chains: among the best are Karlin-Taylor [208, 209],
Grimmett-Stirzaker [177] and, slightly more advanced, Asmussen [34]. In
addition to Norris [270] there are several other undergraduate-level text-
books entirely or mostly devoted to Markov chains: Adke-Manjanuth [1],
Hunter [186], Iosifescu [189], Isaacson-Madsen [191], Kemeny-Snell [214],
Romanovsky [297]. At the graduate level, Durrett [133] has a concise chap-
ter on the modern approach to the basic limit theory. Several more advanced
texts which overlap our material were mentioned in Chapter 1 section 2.3
(yyy 7/20/99 version); other texts are Freedman [154], Anderson [31], and
2.9. NOTES ON CHAPTER 2. 53
the treatize of Syski [318] on hitting times. Most textbooks leave an exag-
gerated impression of the difference between discrete- and continuous-time
chains.
Section 2.1.2. Continuized is an ugly neologism, but no-one has collected
my $5 prize for suggesting a better name!
Section 2.2. Elementary matrix treatments of results like those in section
2.2.2 for finite state space can be found in [186, 214]. On more general spaces,
this is part of recurrent potential theory: see [96, 215] for the countable-state
setting and Revuz [289] for continuous space. Our treatment, somewhat
novel at the textbook level, Pitman [283] studied occupation measure iden-
tities more general than thos in section 2.2.1 and their applications to hitting
time formulas, and we follow his approach in sectionMHTF. We are being
slightly dishonest in treating Lemmas 2.5 and 2.6 this way, because these
facts figure in the “right” proof of the ergodic theorems we use. We made
a special effort not to abbreviate “number of visits to j before time S” as
Nj (S), which forces the reader to decode formulas.
Kemeny and Snell [214] call Z + Π the fundamental matrix, and use
(Ei Tj+ ) rather than (Ei Tj ) as the matrix of mean hitting times. Our set-up
seems a little smoother – cf. Meyer [202] who calls Z the group inverse of
I − P.
The name “random target lemma” for Corollary 2.14 was coined by
Lovász and Winkler [241]; the result itself is classical ([214] Theorem 4.4.10).
Open Problem 2.34 Portmanteau theorem for occupation times.
Can the results of section 2.2.2 be formulated as a single theorem? To

explain the goal by analogy, consider the use [194] of Feynman diagrams to
calculate quantities such as E(A3 BC 2 ) for dependent mean-zero Gaussian
(A, B, C). One rewrites the expectation as E 6i=1 ξi for ξ1 = ξ2 = ξ3 =
Q
A; ξ4 = B, ξ5 = ξ6 = C, and then applies the formula

6
Y X
E ξi = ν(M )
i=1 M
where the sum is over matchings M = {{u1 , v1 }, {u2 , v2 }, {u3 , v3 }} of {1, 2, 3, 4, 5, 6}

and where
3
Y
ν(M ) = E(ξuj ξvj ).
j=1
By analogy, we seek a general rule which associates an expression like

Ei (number of visits to j before time min(Tk , Tl ))
with a combinatorial structure involving {i, j, k, l}; then associates with

the combinatorial structure some function of variables {pv , zvw , v, w ∈
{i, j, k, l}}; then shows that the value of the expression applied to a finite
Markov chain equals the function of {πv , Zvw , v, w ∈ {i, j, k, l}}.
Section 2.4.1. Corollary 2.21 and variants are the basis for the theory
of positive-recurrent chains on continuous spaces: see [133] section 5.6 and
Meyn and Tweedie [263].
Section 2.4.2. The fact that ||θ(t) − π||2 is decreasing is a special case
(H(u) = u2 ) of the following result (e.g. [213] Theorem 1.6).
Lemma 2.35 Let H : [0, ∞) → [0, ∞) be concave [convex]. Let θ(t) be

the distribution of an irreducible chain with stationary distribution π. Then
P
i πi H(θi (t)/πi ) is increasing [decreasing].
Section 2.6. Matthews [256, 257] introduced his method (Theorem 2.26)
to study some highly symmetric walks (cf. Chapter 7) and to study some
continuous-space Brownian motion covering problems.
Section 2.7. A more sophisticated notion is “the chain conditioned never
to hit A”, which can be formalized using Perron-Frobenius theory.
Section 2.8.1. Applying the optional stopping theorem involves checking
side conditions (involving integrability of the martingale or the stopping
time), but these are trivially satisfied in our applications.
Numerical methods. In many applications of non-reversible chains, e.g.
to queueing-type processes, one must resort to numerical computations of
the stationary distribution: see Stewart [314]. We don’t discuss such issues
because in the reversible case we have conceptually simple expressions for
the stationary distribution,
Matrix methods. There is a curious dichotomy between textbooks on
Markov chains which use matrix theory almost everywhere and textbooks
which use matrix theory almost nowhere. Our style is close to the latter;
matrix formalism obscures more than it reveals. For our purposes, the one
piece of matrix theory which is really essential is the spectral decomposition
of reversible transition matrices in Chapter 3. Secondarily useful is the the-
ory surrounding the Perron-Frobenius theorem, quoted for reversible chains
in Chapter 3 section 6.5. (yyy 9/2/94 version)
2.10. MOVE TO OTHER CHAPTERS 55
yyy move both subsections to Chapter 8 “A Second Look . . . ”.
2.10 Move to other chapters

2.10.1 Attaining distributions at stopping times
We quote a result, Theorem 2.36, which may look superficially like the
identities in section 2.2.1 but which in fact is deeper, in that it cannot be
proved by mere matrix manipulations or by Proposition 2.3. The result goes
back to Baxter and Chacon [44] (and is implicit in Rost [301]) in the more
general continuous-space setting: a proof tailored to the finite state space
case has recently been given by Lovász and Winkler [241].
Given distributions σ, µ, consider a stopping time T such that
Pσ (XT ∈ ·) = µ(·). (2.29)
Clearly, for any state j we have Eσ Tj ≤ Eσ T + Eµ Tj , which rearranges to

Eσ T ≥ Eσ Tj − Eµ Tj . So if we define
t̄(σ, µ) = inf{Eσ T : T a stopping time satisfying (2.29)}
then we have shown that t̄(σ, µ) ≥ maxj (Eσ Tj − Eµ Tj ). Surprisingly, this

inequality turns out to be an equality.
Theorem 2.36 t̄(σ, µ) = maxj (Eσ Tj − Eµ Tj ).
2.10.2 Differentiating stationary distributions

From the definition (2.6) of the fundamental matrix Z we can write, in
matrix notation,
(I − P)Z = Z(I − P) = I − Π (2.30)
where Π is the matrix with (i, j)-entry πj . The matrix I − P is not invertible
but (2.30) expresses Z as a “generalized inverse” of I − P, and one can use
matrix methods to verify general identities in the spirit of section 2.2.1. See
e.g. [186, 214]. Here is a setting where such matrix methods work well.
Lemma 2.37 Suppose P (and hence π and Z) depend on a real parameter

d
α, and suppose R = dα P exists. Then, at α such that P is irreducible,
d
π = πRZ.
dα
d
Proof. Write η = dα π. Differentiating the balance equations π = πP gives
η = ηP + πR, in other words η(I − P) = πR. Right-multiply by Z to get
πRZ = η(I − P)Z = η(I − Π) = η − ηΠ.

P d P
But ηΠ = 0 because i ηi = dα ( i πi ) = 0.
Chapter 3
Reversible Markov Chains

(September 10, 2002)
Chapter 2 reviewed some aspects of the elementary theory of general fi- 9/10/99 version
nite irreducible Markov chains. In this chapter we specialize to reversible
chains, treating the discrete-time and continuous-time cases in parallel. Af-
ter Section 3.3 we shall assume that we are dealing with reversible chains
without continually repeating this assumption, and shall instead explicitly
say “general” to mean not necessarily reversible.
3.1 Introduction
Recall P denotes the transition matrix and π the stationary distribution of
a finite irreducible discrete-time chain (Xt ). Call the chain reversible if
πi pij = πj pji for all i, j. (3.1)
Equivalently, suppose (for given irreducible P) that π is a probability distri-
bution satisfying (3.1). Then π is the unique stationary distribution and the
chain is reversible. This is true because (3.1), sometimes called the detailed
balance equations, implies
X X
πi pij = πj pji = πj for all j
i i
and therefore π satisfies the balance equations of (1) in Chapter 2. 9/10/99 version
The name reversible comes from the following fact. If (Xt ) is the sta-
tionary chain, that is, if X0 has distribution π, then
d
(X0 , X1 , . . . , Xt ) = (Xt , Xt−1 , . . . , X0 ).
57
58CHAPTER 3. REVERSIBLE MARKOV CHAINS (SEPTEMBER 10, 2002)
More vividly, given a movie of the chain run forwards and the same movie
run backwards, you cannot tell which is which.
It is elementary that the same symmetry property (3.1) holds for the
t-step transition matrix Pt :
(t) (t)
πi pi,j = πj pji
9/10/99 version and thence for the matrix Z of (6) in Chapter 2:
πi Zij = πj Zji . (3.2)
But beware that the symmetry property does not work for mean hitting
times: the assertion
πi Ei Tj = πj Ej Ti
is definitely false in general (see the Notes for one intuitive explanation).
1/31/94 version See Chapter 7 for further discussion. The following general lemma will be
The lemma has been copied useful there.
here from Section 1.2 of
Chapter 7 (1/31/94 version);
reminder: it still needs to be
Lemma 3.1 For an irreducible reversible chain, the following are equiva-
deleted there! lent.
(a) Pi (Xt = i) = Pj (Xt = j), i, j ∈ I, t ≥ 1
(b) Pi (Tj = t) = Pj (Ti = t), i, j ∈ I, t ≥ 1.
Proof. In either case the stationary distribution is uniform—under (a) by

letting t → ∞, and under (b) by taking t = 1, implying pij ≡ pji . So by
reversibility Pi (Xt = j) = Pj (Xt = i) for i 6= j and t ≥ 1. But recall from
9/10/99 version Chapter 2 Lemma 25 that the generating functions
X X
Gij (z) := Pi (Xt = j)z j , Fij (z) := Pi (Tt = j)z j
t t
satisfy
Fij = Gij /Gjj . (3.3)
For i 6= j we have seen that Gij = Gji , and hence by (3.3)
Fij = Fji iff Gjj = Gii ,
which is the assertion of Lemma 3.1.

The discussion above extends to continuous time with only notational
changes, e.g., the detailed balance equation (3.1) becomes
πi qij = πj qji for all i, j. (3.4)

3.1. INTRODUCTION 59
3.1.1 Time-reversals and cat-and-mouse games

For a general chain we can define the time-reversed chain to have transition
matrix P∗ where
πi pij = πj p∗ji
so that the chain is reversible iff P∗ = P. One can check [cf. (3.2)]
∗
πi Zij = πj Zji . (3.5)
The stationary P∗ -chain is just the stationary P-chain run backwards in

time. Consider Examples 16 and 22 from Chapter 2. In Example 16 (pat- 9/10/99 version
terns in coin tossing) the time-reversal P∗ just “shifts left” instead of shift-
ing right, i.e., from HT T T T the possible transitions are to HHT T T and
T HT T T . In Example 22 the time-reversal just reverses the direction of
motion around the n-cycle:
1−a
p∗ij = a1(j=i−1) + .
n
Warning. These examples are simple because the stationary distributions
are uniform. If the stationary distribution has no simple form then typically
P∗ will have no simple form.
A few facts about reversible chains are really specializations of facts
about general chains which involve both P and P∗ . Here is a simple instance.
Lemma 3.2 (The cyclic tour property) For states i0 , i1 , . . . , im of a re-
versible chain,
Ei0 Ti1 + Ei1 Ti2 + · · · + Eim Ti0 = Ei0 Tim + Eim Tim−1 + · · · + Ei1 Ti0 .
The explanation is that in a general chain we have
Ei0 Ti1 + Ei1 Ti2 + · · · + Eim Ti0 = Ei∗0 Tim + Ei∗m Tim−1 + · · · + Ei∗1 Ti0 (3.6)
where E ∗ refers to the time-reversed chain P∗ . Equality (3.6) is intu-

itively obvious when we visualize running a movie backwards. But a pre-
cise argument requires a little sophistication (see Notes). It is however
straightforward to verify (3.6) using (3.5) and the mean hitting time for-
mula Ei Tj = (Zjj − Zij )/πj .
We shall encounter several results which have amusing interpretations as
cat-and-mouse games. The common feature of these games is that the cat
moves according to a transition matrix P and the mouse moves according
to the time-reversed transition matrix P∗ .
Cat-and-mouse game 1. Both animals are placed at the same state,

chosen according to the stationary distribution. The mouse makes a jump
according to P∗ , and then stops. The cat starts moving according to P and
continues until it finds the mouse, after M steps.
The notable feature of this game is the simple formula for EM :
EM = n − 1, where n is the number of states. (3.7)
This is simple once you see the right picture. Consider the stationary P-
chain (X0 , X1 , X2 , . . .). We can specify the game in terms of that chain by
taking the initial state to be X1 , and the mouse’s jump to be to X0 , and
the cat’s moves to be to X2 , X3 , . . .. So M = T + − 1 with
T + := min{t ≥ 1 : Xt = X0 }.
And ET + = i πi Ei Ti+ = i πi π1i = n.

P P
Cat-and-mouse game 2. This game, and Proposition 3.3, are rephrasings

of results of Coppersmith et al [101]. Think of the cat and the mouse as
pieces in a solitaire board game. The player sees their positions and chooses
which one to move: if the cat is chosen, it makes a move according to P, and
if the mouse is chosen, it makes a move according to P∗ . Let M denote the
number of moves until the cat and mouse meet. Then one expects the mean
E(x,y) M to depend on the initial positions (x, y) of (cat, mouse) and on the
player’s strategy. But consider the example of asymmetric random walk on a
n-cycle, with (say) chance 2/3 of moving clockwise and chance 1/3 of moving
counterclockwise. A moment’s thought reveals that the distance (measured
clockwise from cat to mouse) between the animals does not depend on the
player’s strategy, and hence neither does E(x,y) M . In general EM does
depend on the strategy, but the following result implies that the size of the
effect of strategy changes can be bounded in terms of a measure of non-
symmetry of the chain.
Proposition 3.3 Regardless of strategy,
min Eπ Tz ≤ E(x,y) M − (Ex Ty − Eπ Ty ) ≤ max Eπ Tz

z z
where hitting times T refer to the P-chain.

Symbol used here is “defined Proof. Consider the functions
identically to be”.
f (x, y) :≡ Ex Ty − Eπ Ty
f ∗ (y, x) :≡ Ey∗ Tx − Eπ∗ Tx .

3.1. INTRODUCTION 61
The first-step recurrences for x 7→ Ex Ty and y 7→ Ey∗ Tx give

X
f (x, y) = 1 + pxz f (z, y), y 6= x (3.8)
z
f ∗ (y, x) = 1 + p∗yz f ∗ (z, x), y 6= x.
X
(3.9)
z
By the mean hitting time formula

∗
−Zyx
−Zxy
f (x, y) = = = f ∗ (y, x)
πy πx
so we may rewrite (3.9) as
p∗yz f (x, z), y 6= x.
X
f (x, y) = 1 + (3.10)
z
Now let (X̂t , Ŷt ) be the positions of (cat, mouse) after t moves according to
some strategy. Consider
Wt ≡ t + f (X̂t , Ŷt ).
Equalities (3.8) and (3.10) are exactly what is needed to verify
(Wt ; 0 ≤ t ≤ M ) is a martingale.
So the optional stopping theorem says E(x,y) W0 = E(x,y) WM , that is,
f (x, y) = E(x,y) M + E(x,y) f (X̂M , ŶM ). (3.11)
But X̂M = ŶM and −f (z, z) = Eπ Tz , so

min Eπ Tz ≤ −f (X̂M , ŶM ) ≤ max Eπ Tz
z z
and the result follows from (3.11).

Remarks. Symmetry conditions in the reversible setting are discussed
in Chapter 7. Vertex-transitivity forces Eπ Tz to be independent of z, and 1/31/94 version
hence in the present setting implies E(x,y) M = Ex Ty regardless of strategy.
For a reversible chain without this symmetry condition, consider (x0 , y0 )
attaining the min and max of Eπ Tz . The Proposition then implies Ey0 Tx0 ≤
E(x0 ,y0 ) M ≤ Ex0 Ty0 and the bounds are attained by keeping one animal
fixed. But for general initial states the bounds of the Proposition are not
attained. Indeed, the proof shows that to attain the bounds we need a
strategy which forces the animals to meet at states attaining the extrema
of Eπ Tz . Finally, in the setting of random walk on a n-vertex graph we can
combine Proposition 3.3 with mean hitting time bounds from Chapter 6 to 10/31/94 version
show that EM is at worst O(n3 ).
3.1.2 Entrywise ordered transition matrices

Recall from Chapter 2 Section 3 that for a function f : S → R with
P
9/10/99 version i πi fi =
This subsection adapted from 0, the asymptotic variance rate is
Section 8.1 of Chapter MCMC
(1/8/01 version); reminder: it t
σ 2 (P, f ) := lim t−1 var
X
still needs to be deleted there! f (Xs ) = f Γf (3.12)
t
s=1
where Γij = πi Zij + πj Zji + πi πj − πi δij . These individual-function variance

rates can be compared between chains with the same stationary distribu-
tion, under a very strong “(off-diagonal) entrywise ordering” of reversible
transition matrices.
Lemma 3.4 (Peskun’s Lemma [280]) Let P(1) and P(2) be reversible with
(1) (2)
the same stationary distribution π. Suppose pij ≤ pij for all j 6= i. Then
σ 2 (P(1) , f ) ≥ σ 2 (P(2) , f ) for all f with i πi fi = 0.
P
Proof. Introduce a parameter 0 ≤ α ≤ 1 and write

P = P(α) := (1 − α)P(1) + αP(2) .
Write (·)0 for d
dα (·). It is enough to show
(σ 2 (P, f ))0 ≤ 0.
By (3.12)
(σ 2 (P, f ))0 = f Γ0 f = 2 0
XX
fi πi Zij fj .
i j
Need to decide where to put By Chapter MCMC Lemma 4, Z0 = ZP0 Z. By setting

statement and proof of that
lemma gi := πi fi ; aij := Zij /πj ; wij := πi pij
1/8/01 version
we find A0 = AW0 A and can rewrite the equality above as
(σ 2 (P, f ))0 = 2 gAW0 Ag.
agreed: use negative Since A is symmetric, it is enough to show that W0 is negative semidefinite.
semidefinite rather than
nonpositive definite
By hypothesis W0 is symmetric with zero row-sums and wij 0 ≥ 0 for j 6= i.
throughout Ordering states arbitrarily, we may write

W0 = 0
X
wij Mij
i,j: i<j
where Mij is the matrix whose only nonzero entries are m(i, i) = m(j, j) =
−1 and m(i, j) = m(j, i) = 1. Plainly Mij is negative semidefinite, hence so
is W0 .
3.2. REVERSIBLE CHAINS AND WEIGHTED GRAPHS 63
3.2 Reversible chains and weighted graphs

Our convention is that a graph has finite vertex-set V = {v, x, y, . . .} and
edge-set E = {e1 , e2 , . . .}, is connected and undirected, has no multiple edges,
and has no self-loops. In a weighted graph, each edge (v, w) also has a weight
0 < wv,x = wx,v < ∞, and we allow a weighted graph to have self-loops.
Given a weighted graph, there is a natural definition of a Markov chain
on the vertices. This requires an arbitrary choice of convention: do we want
to regard an absent edge as having weight 0 or weight +∞? In terms of
electrical networks (Section 3.3) the question is whether to regard weights
as conductances or as resistances of wires. Conceptually one can make good
arguments for either choice, but formulas look simpler with the conductance
convention (absent edges have weight 0), so we’ll adopt that convention.
Define discrete-time random walk on a weighted graph to be the Markov
chain with transition matrix
pvx := wvx /wv , x 6= v (3.13)
where
X X
wv := wvx , w := wv .
x v
Note that w is the total edge-weight, when each edge is counted twice, i.e.,
once in each direction. The fundamental fact is that this chain is automat-
ically reversible with stationary distribution
πv ≡ wv /w (3.14)
because (3.1) is obviously satisfied by πv pvx = πx pxv = wvx /w. Our standing
convention that graphs be connected implies that the chain is irreducible.
Conversely, with our standing convention that chains be irreducible, any
reversible chain can be regarded as as random walk on the weighted graph
with edge-weights wvx := πv pvx . Note also that the “aperiodic” condition for
a Markov chain (occurring in the convergence theorem Chapter 2 Theorem 2) 9/10/99 version
is just the condition that the graph be not bipartite.
An unweighted graph can be fitted into this setup by simply assigning
weight 1 to each edge. Since we’ll be talking a lot about this case, let’s write
out the specialization explicitly. The transition matrix becomes
(
1/dv if (v, x) is an edge
pvx =
0 if not
where dv is the degree of vertex v. The stationary distribution becomes
dv
πv = (3.15)
2|E|
where |E| is the number of edges of the graph. In particular, on an un-

weighted regular graph the stationary distribution is uniform.
In continuous time there are two different ways to associate a walk with
a weighted or unweighted graph. One way (and we use this way unless oth-
erwise mentioned ) is just to use (3.13) as the definition of the transition
9/10/99 version rates qvx . In the language of Chapter 2 this is the continuization of the
discrete-time walk, and has the same stationary distribution and mean hit-
ting times as the discrete-time walk. The alternative definition, which we
call the fluid model , uses the weights directly as transition rates:
qvx := wvx , x 6= v. (3.16)
In this model the stationary distribution is always uniform (cf. Section 3.2.1).
In the case of an unweighted regular graph the two models are identical up to
a deterministic time rescaling, but for non-regular graphs there are typically
no exact relations between numerical quantities for the two continuous-time
models. Note that, given an arbitrary continuous-time reversible chain, we
can define edge-weights (wij ) via
πi qij = πj qji = wij , say
but the weights (wij ) do not completely determine the chain: we can specify
the πi independently and then solve for the q’s.
Though there’s no point in writing out all the specializations of the
9/10/99 version general theory of Chapter 2, let us emphasize the simple expressions for
9/10/99 version mean return times of discrete-time walk obtained from Chapter 2 Lemma 5
and the expressions (3.14)–(3.15) for the stationary distribution.
Lemma 3.5 For random walk on an n-vertex graph,

w
Ev Tv+ = (weighted )
wv
2|E|
= (unweighted )
dv
= n (unweighted regular ).
The example has been copied Example 3.6 Chess moves.

here from the start of
Example 18 of Chapter 5
(4/22/96 version); reminder:
that example needs to be
modified accordingly!
3.2. REVERSIBLE CHAINS AND WEIGHTED GRAPHS 65
Here is a classic homework problem for an undergraduate Markov chains

course.
Start a knight at a corner square of an otherwise-empty chess-

board. Move the knight at random, by choosing uniformly from
the legal knight-moves at each step. What is the mean number
of moves until the knight returns to the starting square?
It’s a good question, because if you don’t know Markov chain theory it looks
too messy to do by hand, whereas using Markov chain theory it becomes very
simple. The knight is performing random walk on a graph (the 64 squares
are the vertices, and the possible knight-moves are the edges). It is not hard
to check that the graph is connected, so by the elementary Lemma 3.5, for
a corner square v the mean return time is
1 2|E|
Ev Tv+ = = = |E|,
πv dv
and by drawing a sketch in the margin the reader can count the number of
edges |E| to be 168.
The following cute variation of Lemma 3.5 is sometimes useful. Given
the discrete-time random walk (Xt ), consider the process
Zt = (Xt−1 , Xt )
recording the present position at time t and also the previous position.
→
Clearly (Zt ) is a Markov chain whose state-space is the set E of directed
edges, and its stationary distribution (ρ, say) is
wvx
ρ(v, x) =
w
in the general weighted case, and hence
1 →
ρ(v, x) = → , (x, v) ∈ E
|E |
in the unweighted case. Now given an edge (x, v), we can apply Chapter 2 9/10/99 version
Lemma 5 to (Zt ) and the state (x, v) to deduce the following.
Lemma 3.7 Given an edge (v, x) define
U := min{t ≥ 1 : Xt = v, Xt−1 = x}.

Then
w
Ev U = (weighted )
wvx
= 2|E| (unweighted ).
Corollary 3.8 (The edge-commute inequality) For an edge (v, x),

w
wvx (weighted )
Ev Tx + Ex Tv ≤ .
2|E| (unweighted )
We shall soon see (Section 3.3.3) this inequality has a natural interpretation
in terms of electrical resistance, but it is worth remembering that the result
is more elementary than that.
Here is another variant of Lemma 3.5.
Lemma 3.9 For random walk on a weighted n-vertex graph,

X
we (Ev Tx + Ex Tv ) = w(n − 1)
e=(v,x)
where the sum is over undirected edges.

P P
Proof. Writing v x for the sum over directed edges (v, x), the left side
equals
1 XX
wvx (Ev Tx + Ex Tv )
2 v x
XX
= wvx Ex Tv by symmetry
v x
X X
= w πv pvx Ex Tv
v x
X
= w πv (Ev Tv+ − 1)
v
X
= w πv ( π1v − 1)
v
= w(n − 1).
3.2.1 The fluid model

Imagine a finite number of identical buckets which can hold unit quantity of
fluid. Some pairs of buckets are connected by tubes through their bottoms.
If a tube connects buckets i and j then, when the quantities of fluid in
3.3. ELECTRICAL NETWORKS 67
buckets i and j are pi and pj , the flow rate through the tube should be
proportional to the pressure difference and hence should be wij (pi − pj ) in
the direction i → j, where wij = wji is a parameter. Neglecting the fluid
in the tubes, the quantities of fluid (pi (t)) at time t will evolve according to
the differential equations
dpj (t) X
= wij (pi (t) − pj (t)).
dt i6=j
These of course are the same equations as the forward equations [(4) of
Chapter 2] for pi (t) (the probability of being in state i at time t) for the 9/10/99 version
continuous-time chain with transition rates qij = wij , j 6= i. Hence we call
this particular way of defining a continuous-time chain in terms of a weighted
graph the fluid model . Our main purpose in mentioning this notion is to
distinguish it from the electrical network analogy in the next section. Our
intuition about fluids says that as t → ∞ the fluid will distribute itself
uniformly amongst buckets, which corresponds to the elementary fact that
the stationary distribution of the “fluid model” chain is always uniform.
Our intuition also says that increasing a “specific flow rate” parameter wij
will make the fluid settle faster, and this corresponds to a true fact about
the “fluid model” Markov chain (in terms of the eigenvalue interpretation
of asymptotic convergence rate—see Corollary 3.28). On the other hand
the same assertion for the usual discrete-time chain or its continuization is
simply false.
3.3 Electrical networks

3.3.1 Flows
This is a convenient place to record some definitions. A flow f = (fij ) on a
graph is required only to satisfy the conditions
(
−fji if (i, j) is an edge
fij =
0 if not.
P P
So the net flow out of i is f(i) := j6=i fij , and by symmetry i f(i) = 0. We
will be concerned with flows satisfying extra conditions. Given disjoint non-
empty subsets A, B of vertices, a unit flow from B to A is a flow satisfying
X
f(i) = 1, f(j) = 0 for all j 6∈ A ∪ B (3.17)
i∈B
which implies i∈A f(i) = −1. Given a Markov chain X (in particular, given
P
a weighted graph we can use the random walk) we can define a special flow
as follows. Given v0 6∈ A, define f v0 →A by
TA
X
fij := Ev0 1(Xt−1 =i,Xt =j) − 1(Xt−1 =j,Xt =i) . (3.18)
t=1
So fij is the mean number of transitions i → j minus the mean number

of transitions j → i, for the chain started at v0 and run until hitting A.
Clearly f v0 →A is a unit flow from v0 to A. Note that the mean net transitions
definition of fij works equally well in continuous time to provide a unit flow
f v0 →A from v0 to A.
In Section 3.7.2 we will define the notion of “a unit flow from v0 to a
probability distribution ρ” and utilize a special unit flow from v0 to the
stationary distribution.
3.3.2 The analogy

Given a weighted graph, consider the graph as an electrical network, where
a wire linking v and x has conductance wvx , i.e., resistance 1/wvx . Fix a
vertex v0 and a subset A of vertices not containing v0 . Apply voltage 1 at v0
and ground (i.e., set at voltage 0) the set A of vertices. As we shall see, this
determines the voltage g(v) at each vertex v; in particular,
g(v0 ) = 1; g(·) = 0 on A. (3.19)
Physically, according to Ohm’s law ,

Potential difference
Current =
Resistance
for each wire; that is, the current Ivx along each wire (v, x) satisfies
Ivx = (g(v) − g(x))wvx . (3.20)
Clearly, I is a flow, and according to Kirchoff ’s node law
I(v) = 0, v ∈
/ {v0 } ∪ A. (3.21)
Regarding the above as intuition arising from the study of physical elec-
trical networks, we can define an electrical network mathematically as a
weighted graph together with a function g and a flow I, called voltage and
current, respectively, satisfying (3.20)–(3.21) and the normalization (3.19).
As it turns out, these three conditions specify g [and hence also I, by (3.20)]
uniquely since (3.20)–(3.21) imply
X
g(v) = pvx g(x), v 6∈ {v0 } ∪ A (3.22)
x
with pvx defined at (3.13), and Chapter 2 Lemma 27 shows that this equa- 9/10/99 version
tion, together with the boundary conditions (3.19), has a unique solution.
Conversely, if g is the unique function satisfying (3.22) and (3.19), then I de-
fined by (3.20) satisfies (3.21), as required. Thus a weighted graph uniquely
determines both a random walk and an electrical network.
The point of this subsection is that the voltage and current functions can
be identified in terms of the random walk. Recall the flow f v0 →A defined
at (3.18).
Proposition 3.10 Consider a weighted graph as an electrical network, where
a wire linking v and x has conductance wvx . Suppose that the voltage func-
tion g satisfies (3.19). Then the voltage at any vertex v is given in terms of
the associated random walk by
g(v) = Pv (Tv0 < TA ) ∈ [0, 1] (3.23)
and the current Ivx along each wire (v, x) is fvx /r, where f = f v0 →A and
1
r= ∈ (0, ∞). (3.24)
wv0 Pv0 (TA < Tv+0 )
Since f is a unit flow from v0 to A and g(v0 ) = 1, we find I(v0 ) = g(v0 )/r.
Since g = 0 on A, it is thus natural in light of Ohm’s law to regard the entire
network as effectively a single conductor from v0 to A with resistance r; for
this reason r is called the effective resistance between v0 and A. Since (3.19)
and (3.21) are clearly satisfied, to establish Proposition 3.10 it suffices by
our previous comments to prove (3.20), i.e.,
fvx
= (g(v) − g(x))wvx . (3.25)
r
Proof of (3.25). Here is a “brute force” proof by writing everything in
terms of mean hitting times. First, there is no less of generality in assuming
that A is a singleton a, by the collapsing principle (Chapter 2 Section 7.3). 9/10/99 version
Now by the Markov property
fvx = Ev0 (number of visits to v before time Ta ) pvx

−Ev0 (number of visits to x before time Ta ) pxv .
Chapter 2 Lemma 9 gives a formula for the expectations above, and using 9/10/99 version
πv pvx = πx pxv = wvx /w we get
w
fvx = Ea Tv − Ev0 Tv − Ea Tx + Ev0 Tx . (3.26)
wvx
9/10/99 version And Chapter 2 Corollaries 8 and 10 give a formula for g:
g(v) = (Ev Ta + Ea Tv0 − Ev Tv0 )πv0 Pv0 (Ta < Tv+0 )
which leads to
g(v) − g(x)
= Ev Ta − Ex Ta − Ev Tv0 + Ex Tv0 . (3.27)
πv0 Pv0 (Ta < Tv+0 )
But the right sides of (3.27) and (3.26) are equal, by the cyclic tour property
(Lemma 3.2) applied to the tour v0 , x, a, v, v0 , and the result (3.25) follows
after rearrangement, using πv0 = wv0 /w.
Remark. Note that, when identifying a reversible chain with an electrical
network, the procedure of collapsing the set A of states of the chain to a
singleton corresponds to the procedure of shorting together the vertices A
of the electrical network.
3.3.3 Mean commute times

The classical use of the electrical network analogy in the mathematical liter-
ature is in the study of the recurrence or transience of infinite-state reversible
6/23/01 version chains by comparison arguments (Chapter 13). As discussed in Doyle and
Snell [131], the comparisons involve “cutting or shorting”. Cutting an edge,
or more generally decreasing an edge’s conductance, can only increase an
effective resistance. Shorting two vertices together (i.e., linking them with
an edge of infinite conductance), or more generally increasing an edge’s con-
ductance, can only decrease an effective resistance. These ideas can be for-
malized via the extremal characterizations of Section 3.7 without explicitly
relying on the electrical analogy.
In our context of finite-state chains the key observation is the follow-
9/10/99 version ing. For not-necessarily-reversible discrete-time chains we have (Chapter 2
Corollary 8)
1
= Ev Ta + Ea Tv , v 6= a, (3.28)
πv Pv (Ta < Tv+ )
where we may call the right side the mean commute time between v and a.
[For continuous-time chains, πv is replaced by qv πv in (3.28).] Comparing
with (3.24) and using πv = wv /w gives
Corollary 3.11 (commute interpretation of resistance) Given two ver-

tices v, a in a weighted graph, the effective resistance rva between v and a is
related to the mean commute time of the associated random walk by
Ev Ta + Ea Tv = wrva .
Note that the Corollary takes a simple form in the case of unweighted graphs:
Ev Ta + Ea Tv = 2|E|rva . (3.29)
Note also that the Corollary does not hold so simply if a and v are both
replaced by subsets—see Corollary 3.37.
Corollary 3.11 apparently was not stated explicitly or exploited until
a 1989 paper of Chandra et al [85], but then rapidly became popular in
the “randomized algorithms” community. The point is that “cutting or
shorting” arguments can be used to bound mean commute times. As the
simplest example, it is obvious that the effective resistance rvx across an edge
(v, x) is at most the resistance 1/wvx of the edge itself, and so Corollary 3.11
implies the edge-commute inequality (Corollary 3.8). Finally, we can use add pointers to uses of cutting
and shorting later in book
Corollary 3.11 to get simple exact expressions for mean commute times in
some special cases, in particular for birth-and-death processes (i.e., weighted
linear graphs) discussed in Chapter 5. 4/22/96 version
As with the infinite-space results, the electrical analogy provides a vivid
language for comparison arguments, but the arguments themselves can be
justified via the extremal characterizations of Section 3.7 without explicit
use of the analogy.
3.3.4 Foster’s theorem

The commute interpretation of resistance allows us to rephrase Lemma 3.9
as the following result about electrical networks, due to Foster [153].
Corollary 3.12 (Foster’s Theorem) In a weighted n-vertex graph, let re

be the effective resistance between the ends (a, b) of an edge e. Then
X
re we = n − 1.
e
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗∗
CONVENTION.
For the rest of the chapter we make the convention that we are dealing
with a finite-state, irreducible, reversible chain, and we will not repeat the
“reversible” hypothesis in each result. Instead we will say “general chain”
to mean not-necessarily-reversible chain.
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗∗
3.4 The spectral representation

Use the transition matrix P to define
1/2 −1/2
sij = πi pij πj .
From definition (3.1), S is a symmetric matrix. So we can apply the ele-

mentary diagonalization theorem. The authors find it helpful to distinguish
between the state space I = {i, j, . . .}, of size n say, and the index set of
integers [n] = {1, 2, . . . , n}, the point being that the state space may have a
lot of extra structure, whereas the index set has no obvious structure. The
spectral theorem ([183] Theorem 4.1.5) gives a representation
S = UΛUT
where U = (uim )i∈I,m∈[n] is an orthonormal matrix, and Λ = (λm,m0 )m,m0 ∈[n]

is a diagonal real matrix. We can write the diagonal entries of Λ as (λm ),
and arrange them in decreasing order. Then
1 = λ1 > λ2 ≥ · · · ≥ λn ≥ −1. (3.30)
The classical fact that |λi | ≤ 1 follows easily from the fact that the entries
of S(t) are bounded as t → ∞ by (3.31) below. These λ’s are the eigenvalues
of P, as well as of S. That is, the solutions (λ; x) with xi 6≡ 0 of
X
xi pij = λxj for all j
i
are exactly the pairs

1/2
(λ = λm ; xi = cm πi uim , i = 1, . . . , n)
for m = 1, . . . , n, where cm 6= 0 is arbitrary. And the solutions of

X
pij yj = λyi for all i
j
3.4. THE SPECTRAL REPRESENTATION 73
are exactly the pairs
−1/2
(λ = λm ; yi = cm πi uim , i = 1, . . . , n).
Note that an eigenvector (ui1 ) of S corresponding to the eigenvalue λ1 = 1 is
1/2
ui1 = πi .
Uniqueness of the stationary distribution now implies λ2 < 1.

Now consider matrix powers. We have
S(t) = UΛ(t) UT
and
(t) −1/2 (t) 1/2
pij = πi sij πj , (3.31)
so
n
−1/2 1/2 X
Pi (Xt = j) = πi πj λtm uim ujm . (3.32)
m=1
This is the spectral representation formula. In continuous time, the analo-

gous formula is
n
−1/2 1/2 X
Pi (Xt = j) = πi πj exp(−λm t)uim ujm . (3.33)
m=1
1/2
As before, U is an orthonormal matrix and ui1 = πi , and now the λ’s
are the eigenvalues of −Q. In the continuous-time setting, the eigenvalues
satisfy
0 = λ1 < λ2 ≤ · · · ≤ λn . (3.34)
Rather than give the general proof, let us consider the effect of continuizing
the discrete-time chain (3.32). The continuized chain (Yt ) can be represented
as Yt = XN (t) where N (t) has Poisson(t) distribution, so by conditioning on
N (t) = ν,
n ∞
−1/2 1/2 X X e−t tν
Pi (Yt = j) = πi πj uim ujm λνm
m=1 ν=0
ν!
n
−1/2 1/2 X
= πi πj uim ujm exp(−(1 − λm )t).
m=1
So when we compare the spectral representations (3.32),(3.33) for a discrete-

time chain and its continuization, the orthonormal matrices are identical,
and the eigenvalues are related by
λ(c) (d)
m = 1 − λm (3.35)
superscripts (c) and (d) indicating continuous or discrete time. In particular,

this relation holds for the basic discrete and continuous time random walks
on a graph.
Let us point out some interesting simple consequences of the spectral
representation. For these purposes continuous time is simpler. First,
Pi (Xt = j) − πj = cij e−λ2 t + o(e−λ2 t ) as t → ∞ (3.36)

−1/2 1/2 P
where cij = πi πj m:λm =λ2 uim ujm and where “typically” cij 6= 0. (A
precise statement is this: there exists i such that
Pi (Xt = i) − πi ∼ cii e−λ2 t , cii > 0, (3.37)
by considering i such that ui2 6= 0.) Thus λ2 has the interpretation of

“asymptotic rate of convergence to the stationary distribution”. The au-
thors find it simpler to interpret parameters measuring “time” rather than
“1/time”, and so prefer to work with the relaxation time τ2 defined by
τ2 := 1/λ2 for a continuous-time chain (3.38)

τ2 := 1/(1 − λ2 ) for a discrete-time chain. (3.39)
Note that by (3.35) the value of τ2 is unchanged by continuizing a discrete-

time chain.
Still in continuous time, the spectral representation gives
X
Pi (Xt = i) = πi + u2im exp(−λm t) (3.40)
m≥2
so the right side is decreasing with t, and in fact is completely monotone, a

9/10/99 version subject pursued in Section 3.5. Thus Zii defined in Chapter 2 Section 2.3
satisfies
Z ∞
Zii = (Pi (Xt = i) − πi ) dt
0
u2im λ−1
X
= m by (3.40). (3.41)
m≥2
3.4. THE SPECTRAL REPRESENTATION 75
Using the orthonormal property of U,
λ−1
X X
Zii = m .
i m≥2
99 version Applying Corollary 13 of Chapter 2, we obtain a fundamental result relating

average hitting times to eigenvalues.
Proposition 3.13 (The eigentime identity) For each i,
λ−1
X X
πj Ei Tj = m (continuous time)
j m≥2
(1 − λm )−1 (discrete time).

X X
πj Ei Tj =
j m≥2
[The discrete-time version follows from (3.35).] Proposition 3.13 expands

upon the random target lemma, which said that (even for non-reversible
P
chains) j πj Ei Tj does not depend on i.
3.4.1 Mean hitting times and reversible chains

In Chapter 2 Section 2.2 we listed identities for general chains such as the 9/10/99 version
mean hitting time formulas
Ei Tj = (Zjj − Zij )/πj ; Eπ Tj = Zjj /πj .
There are a number of more complicated identities for general chains in

which one side becomes zero for any reversible chain (by the symmetry
property πi Zij = πj Zji ) and which therefore simplify to give identities for
reversible chains. We have already seen one example, the cyclic tour lemma,
and the following result may be considered an extension of that lemma.
[Indeed, sum the following equation over successive pairs (i, j) along a cycle
to recapture the cyclic tour lemma.]
Corollary 3.14 Eπ Tj − Eπ Ti = Ei Tj − Ej Ti .
This identity follows immediately from the mean hitting time formulas and
the symmetry property. Note the following interpretation of the corollary.
Define an ordering i j on the states by
i j iff Eπ Ti ≤ Eπ Tj .
Then Corollary 3.14 implies
Ei Tj ≥ Ej Ti iff i j.
Warning. Corollary 3.14 does not imply
max Ei Tj is attained by some pair (i∗ , j∗ ) such that

i,j
i∗ attains min Eπ Ti and j∗ attains max Eπ Tj .
i j
I haven’t tried to find Here is a counterexample. Choose 0 < ε < 1/2 arbitrarily and let
counterexamples with more
than three states.  
2ε 1 − 2ε 0
P :=  ε 1 − 2ε ε  .
 
0 1 − 2ε 2ε
We invite the reader to perform the computations necessary to verify that P

is reversible with π = [ε, 1 − 2ε, ε] and
 
0 ε 1
(Ei Tj ) = ε−1 (1 − 2ε)−1  1 − ε 0 1 − ε  ,
 
1 ε 0
so that (Eπ Ti ) = ε−1 (1 − 2ε)−1 [1 − 2ε + 2ε2 , 2ε2 , 1 − 2ε + 2ε2 ]. Thus Eπ Ti

is minimized uniquely by i∗ = 2, while maxi,j Ei Tj is attained only by the
pairs (1, 3) and (3, 1).
As a second instance of what reversibility implies, note, from (3.33) and
the definition of Zij , that
−1/2 1/2
λ−1
X
Zij = πi πj m uim ujm .
m≥2
This implies
1/2 −1/2
the symmetrized matrix πi Zij πj is positive semidefinite. (3.42)
Note that a symmetric positive semidefinite matrix (Mij ) has the property
Mij2 ≤ Mii Mjj . This gives
2
Zij ≤ Zii Zjj πj /πi , (3.43)
which enables us to upper-bound mean hitting times from arbitrary starts

in terms of mean hitting times from stationary starts.
Lemma 3.15 maxij Ei Tj ≤ 2 maxk Eπ Tk .

3.5. COMPLETE MONOTONICITY 77
Proof. Using (3.43),

(Zij /πj )2 ≤ (Zii /πi ) (Zjj /πj )
and so
−Zij /πj ≤ max Zkk /πk .
k
So the mean hitting time formula gives the two equalities in
Zjj Zij Zkk
Ei Tj = − ≤ 2 max = 2 max Eπ Tk .
πj πj k πk k
3.5 Complete monotonicity

One advantage of working in continuous time is to exploit complete mono-
tonicity properties. Abstractly, call f : [0, ∞) → [0, ∞) completely monotone
(CM) if there is a nonnegative measure µ on [0, ∞) such that
Z
f (t) = e−θt µ(dθ), 0 ≤ t < ∞. (3.44)
[0,∞)
Our applications will use only the special case of a finite sum
am e−θm t , for some am > 0, θm ≥ 0,
X
f (t) = (3.45)
m
but finiteness plays no essential role. If f is CM then (provided they exist)

so are
−f 0 (t),
Z ∞
F̄ (t) := f (s) ds (3.46)
t
A probability distribution ν on [0, ∞) is called CM if its tail distribution
function F̄ (t) :≡ ν(t, ∞) is CM; equivalently, if its density function f is
CM (except that here we must in the general case allow the possibility
f (0) = ∞). In more probabilistic language, ν is CM iff it can be expressed
as the distribution of ξ/Λ, where ξ and Λ are independent random variables
such that
ξ has Exponential(1) distribution; Λ > 0. (3.47)
Given a CM function or distribution, the spectral gap λ ≥ 0 can be
defined consistently by
λ := inf{t > 0 : µ[0, t] > 0} in setting (3.44)
λ := min{θm } in setting (3.45)
λ := ess inf Λ in setting (3.47).
This λ controls the behavior of f (t) as t → ∞. A key property of CM

functions is that their value at a general time t can be bounded in terms of
their behavior at 0 and at ∞, as follows.
Lemma 3.16 Let f be CM with 0 < f (0) < ∞. Then

0
f (0) t f (t) F̄ (t)

exp ≤ ≤ ≤ exp(−λt), 0 ≤ t < ∞
f (0) f (0) F̄ (0)
where λ is the spectral gap.
We might have F̄ (0) = ∞, but then F̄ (t) = ∞ and λ = 0 so the convention

∞/∞ = 1 works.
Proof. By scaling we may suppose f (0) = 1. So we can rewrite (3.44) as
f (t) = Ee−Θt (3.48)
where Θ has distribution µ. Then f 0 (t) = −E(Θe−Θt ). Because θ 7→ e−θt is

decreasing, the random variables Θ and e−Θt are negatively correlated (this
fact is sometimes called “Chebyshev’s other inequality”, and makes a nice
exercise [Hint: Symmetrize!]) and so E(Θe−Θt ) ≤ (EΘ)(Ee−Θt ). This says
−f 0 (t) ≤ −f 0 (0) f (t), or in other words dt
d
log f (t) ≥ f 0 (0). Integrating gives
0
log f (t) ≥ tf (0), which is the leftmost inequality. (Recall we scaled to make
f (0) = 1.) For the second inequality,
F̄ (t) = E(Θ−1 e−Θt ) by integrating (3.48)

≥ (EΘ−1 )(Ee−Θt ) by positive correlation
= F̄ (0) f (t).
Finally, from the definition of the spectral gap λ it is clear that f (t)/f (0) ≤
e−λt . But F̄ has the same spectral gap as f .
Returning to the study of continuous-time reversible chains, the spectral
representation (3.40) says that Pi (Xt = i) is a CM function. It is often
convenient to subtract the limit and say
Pi (Xt = i) − πi is a CM function. (3.49)
More generally, given any function g : I → R the function
ρ(t) :≡ E[g(Xt )g(X0 )] (3.50)

is CM for the stationary chain, because by (3.33)

n
! 
X X 1/2 X 1/2
ρ(t) = πi g(i)uim  πj g(j)ujm  exp(−λm t)
m=1 i j
n
!2
X X 1/2
= πi g(i)uim exp(−λm t). (3.51)
m=1 i
Specializing to the case g = 1A and conditioning,
P (Xt ∈ A | X0 ∈ A) is a CM function (3.52)
again assuming the stationary chain. When A is a singleton, this is (3.49).

Remark. To study directly discrete-time reversible chains, one would
replace CM functions by sequences (fn ) of the form
Z 1
fn = θn µ(dθ).
−1
But analogs of Lemma 3.16 and subsequent results (e.g., Proposition 3.22)
become messier—so we prefer to derive discrete-time results by continuiza-
tion.
3.5.1 Lower bounds on mean hitting times

As a quick application, we give bounds on mean hitting times to a single
P
state from a stationary start. Recall qi = j6=i qij is the exit rate from i,
and τ2 is the relaxation time of the chain.
Lemma 3.17 For any state i in a continuous-time chain,

(1 − πi )2 τ2 (1 − πi )
≤ Eπ Ti ≤ .
qi πi πi
By continuization, the Lemma holds in discrete time, replacing qi by 1 − pii .
Proof. The mean hitting time formula is
Z ∞
πi Eπ Ti = Zii = (Pi (Xt = i) − πi ) dt.
0
Write f (t) for the integrand. We know f is CM, and here λ ≥ λ2 by (3.40),
and f 0 (0) = −qi , so the extreme bounds of Lemma 3.16 become, after mul-
tiplying by f (0) = 1 − πi ,
(1 − πi ) exp(−qi t/(1 − πi )) ≤ f (t) ≤ (1 − πi )e−λ2 t .

Integrating these bounds gives the result.

We can now give general lower bounds on some basic parameters we will
study in Chapter 4.
Proposition 3.18 For a discrete-time chain on n states,

X (n − 1)2
πj Eπ Tj ≥ (3.53)
j
n
max(Ei Tj + Ej Ti ) ≥ 2(n − 1) (3.54)
i,j
max Ei Tj ≥ n−1 (3.55)
i,j
n−1
τ2 ≥ . (3.56)
n
Remark. These inequalities become equalities for random walk on the com-
4/22/96 version plete graph (Chapter 5 Example 9). By examining the proof, it can be
shown that this is the only chain where an equality holds.
Proof. We go to the continuized chain, which has qi = 1 − pii ≤ 1. Then
X X
πj Eπ Tj ≥ (1 − πj )2 by Lemma 3.17
j j
X
= n−2+ πj2
j
1
≥ n−2+ n
= (n − 1)2 /n,
giving (3.53). By the eigentime identity,
λ−1
X X
πj Eπ Tj = m ≤ (n − 1)τ2
j m≥2
and so (3.56) follows from (3.53).

P
Now fix i and write τ0 = j πj Ek Tj , which (by the random target
lemma) doesn’t depend on k. Then
X πj τ0 + Eπ Ti
(Ei Tj + Ej Ti ) = . (3.57)
j6=i
1 − πi 1 − πi
If the right side were strictly less than 2(n − 1) for all i, then
X X
πi (τ0 + Eπ Ti ) < 2(n − 1) πi (1 − πi ),
i i
which implies
!
1 2(n − 1)2
X
2τ0 < 2(n − 1) 1 − πi2 ≤ 2(n − 1) 1 − = ,
i
n n
contradicting (3.53). Therefore there exists an i such that

X πj
(Ei Tj + Ej Ti ) ≥ 2(n − 1)
j6=i
1 − πi
and so there exists j 6= i such that Ei Tj + Ej Ti ≥ 2(n − 1). This is (3.54),

and (3.55) follows immediately.
There are several other results in the spirit of Lemma 3.17 and Propo-
sition 3.18. For instance, (22) in Chapter 2 says that for a general discrete- 9/10/99 version
time chain,
2Eπ Ti + 1 1
vari Ti+ = − 2.
πi πi
Appealing to Lemma 3.17 gives, after a little algebra,
Corollary 3.19 For any state i in a discrete-time chain,
(1 − πi )(1 − 2πi )
vari Ti+ ≥ .
πi2
Again, equality holds for random walk on the complete graph.
3.5.2 Smoothness of convergence

We’re going to build some vague discussion around the following simple
result.
Lemma 3.20
X p2ij (t) pii (2t)
= (3.58)
j
πj πi
s
pik (t + s) pii (2t) pkk (2s)

− 1≤ − 1 − 1 (3.59)
πk πi πk
s
pik (t + s) pii (2t) pkk (2s) pik (2t) pii (2t)
≤ and so maxi,k πk ≤ maxi πi . (3.60)
πk πi πk
Proof.
pik (t + s) X pjk (s) X pkj (s)
= pij (t) = pij (t)
πk j
πk j
πj
by reversibility. Putting k = i, s = t gives (3.58). Rewriting the above

equality as
pik (t + s) X pij (t) − πj pkj (s) − πj

−1= πj
πk j
πj πj
p
and applying the Cauchy–Schwarz inequality, we get the bound ai (t)ak (s),
where
X (pij (t) − πj )2 X p2ij (t) pii (2t)

ai (t) = = −1= − 1.
j
πj j
πj πi
This proves (3.59). The cruder bound (3.60) is sometimes easier to use
than (3.59) and is proved similarly.
9/10/99 version Discussion. Recalling from Chapter 2 Section 4.2 the definition of L2
distance between distributions, (3.58) says
pii (2t)
kPi (Xt ∈ ·) − πk22 = − 1. (3.61)
πi
In continuous time, we may regard the assertion “kPi (Xt ∈ ·) − πk2 is de-
creasing in t” as a consequence of the equality in (3.61) and the CM property
of pii (t). This assertion in fact holds for general chains, as pointed out in
9/10/99 version Chapter 2 Lemma 35. Loosely, the general result of Chapter 2 Lemma 35
9/10/99 version says that in a general chain the ratios (Pρ (Xt = j)/πj , j ∈ I) considered
as an unordered set tend to smooth out as t increases. For a reversible
chain, much more seems to be true. There is some “intrinsic geometry” on
the state space such that, for the chain started at i, the probability dis-
tribution as time increases from 0 “spreads out smoothly” with respect to
the geometry. It’s hard to formalize that idea convincingly. On the other
hand, (3.61) does say convincingly that the rate of convergence of the single
probability pii (t) to π(i) is connected to a rate of convergence of the entire
distribution Pi (Xt ∈ ·) to π(·). This intimate connection between the local
and the global behavior of reversible chains underlies many of the technical
10/11/94 version inequalities concerning mixing times in Chapter 4 and subsequent chapters.
3.5.3 Inequalities for hitting time distributions on subsets

99 version We mentioned in Chapter 2 Section 2.2 that most of the simple identities
there for mean hitting times Ei Tj on singletons have no simple analogs for
hitting times TA on subsets. One exception is Kac’s formula (Chapter 2
Corollary 24), which says that for a general discrete-time chain 9/10/99 version
EπA TA+ = 1/π(A). (3.62)

It turns out that for reversible chains there are useful inequalities relating the
distributions of TA under different initial distributions. These are simplest in
continuous time as consequences of CM: as always, interesting consequences
may be applied to discrete-time chains via continuization.
Recall πA is the stationary distribution conditioned to A:
πA (i) ≡ π(i)/π(A), i ∈ A.
Trivially
Pπ (TA > t) = π(Ac )PπAc (TA > t) (3.63)
c
Eπ TA = π(A )EπAc TA . (3.64)
Define the ergodic exit distribution ρA from A by
P
i∈A πi qij
ρA (j) := , j ∈ Ac , (3.65)
Q(A, Ac )
where Q(A, Ac ) is the ergodic flow rate out of A:
X X
Q(A, Ac ) := πi qik . (3.66)
i∈A k∈Ac
By stationarity, Q(A, Ac ) = Q(Ac , A).
Proposition 3.21 Fix a subset A in a continuous-time chain.

(i) TA has CM distribution when the initial distribution of the chain is
any of the three distributions π or πAc or ρA .
(ii) The three hitting time distributions determine each other via (3.63)
and
Pρ (TA > t)
PπAc (TA ∈ (t, t + dt)) = A dt. (3.67)
EρA TA
(iii) Write λA for the spectral gap associated with TA (which is the same
for each of the three initial distributions). Then
Pπ (TA > t)
PρA (TA > t) ≤ PπAc (TA > t) = ≤ exp(−λA t), t > 0 (3.68)
π(Ac )
and in particular
π(Ac ) Eπ TA
= EρA TA ≤ EπAc TA = ≤ 1/λA . (3.69)
Q(A, Ac ) π(Ac )
(iv)
τ2 π(Ac )
Eπ TA ≤ . (3.70)
π(A)
Concerning (b): The results Remarks. (a) In discrete time we can define ρA and Q(A, Ac ) by replacing
cited from later require that
the chain restricted to Ac be
qij by pij in (3.65)–(3.66), and then (3.69) holds in discrete time. The left
irreducible, but I think that equality of (3.69) is then a reformulation of Kac’s formula (3.62), because
requirement can be dropped
using a limiting argument. EπA TA+ = 1 + PπA (X1 ∈ Ac )EπA (TA+ − 1|X1 ∈ Ac )
Q(A, Ac )
= 1+ EρA TA .
π(A)
(b) Equation (3.83) and Corollary 3.34 [together with remark (b) following
Theorem 3.33] later show that 1/λA ≤ τ2 /π(A). So (3.70) can be regarded
as a consequence of (3.69). Reverse inequalities will be studied in Chapter 4.
10/11/94 version
Proof of Proposition 3.21. First consider the case where A is a single-
ton {a}. Then (3.70) is an immediate consequence of Lemma 3.17. The
equalities in (3.69) and in (3.67) are general identities for stationary pro-
9/10/99 version cesses [(24) and (23) in Chapter 2]. We shall prove below that TA is CM
under PπI\{a} . Then by (3.63), (3.67), and (3.46), TA is also CM under the
other two initial distributions. Then the second inequality of (3.68) is the
upper bound in Lemma 3.16, and the first is a consequence of (3.67) and
Lemma 3.16. And (3.69) follows from (3.68) by integrating over t.
To prove that TA is CM under PπI\{a} , introduce a parameter 0 < ε < 1
and consider the modified chain (Xtε ) with transition rates
ε
qij := qij , i 6= a
ε
qaj := εqaj .
The modified chain remains reversible, and its stationary distribution is of

the form
πiε = b1 πi , i 6= a; πaε = b2
where the weights b1 , b2 depend only on ε and πa . Now as ε → 0 with t
fixed,
PπI\{a} (Xtε ∈ I \ {a}) → PπI\{a} (Ta > t) (3.71)
because the chain gets “stuck” upon hitting a. But the left side is CM
by (3.52), so the right side (which does not depend on ε) is CM, because the
class of CM distributions is closed under pointwise limits. (The last asser-
tion is in general the continuity theorem for Laplace transforms [133] p. 83,
though for our purposes we need only the simpler fact that the set of func-
tions of the form (3.45) with at most n summands is closed.)
This completes the proof when A is a singleton. We now claim that the
case of general A follows from the collapsing principle (Chapter 2 Section 7.3), 9/10/99 version
i.e., by applying the special case to the chain in which the subset A is col-
lapsed into a single state. This is clear for all the assertions of Proposi-
tion 3.21 except for (3.70), for which we need the fact that the relaxation
time τ2A of the collapsed chain is at most τ2 . This fact is proved as Corol-
lary 3.27 below.
Remark. Note that the CM property implies a super multiplicitivity
property for hitting times from stationarity in a continuous-time reversible
chain:
Pπ (TA > s + t) ≥ Pπ (TA > s)Pπ (TA > t).
Contrast with the general submultiplicitivity property (Chapter 2 Section 4.3) 9/10/99 version
which holds when Pπ is replaced by maxi Pi .
3.5.4 Approximate exponentiality of hitting times

In many circumstances, the distribution of the first hitting time TA on a
subset A of states with π(A) small (equivalently, with ETA large) can be
approximated by the exponential distribution with the same mean. As with
the issue of convergence to the stationary distribution, such approximations
can be proved for general chains (see Notes), but it is easier to get ex-
plicit bounds in the reversible setting. If T has a CM distribution, then [as
d
at (3.47), but replacing 1/Λ by Θ] we may suppose T = Θξ. We calculate
ET = (EΘ)(Eξ) = EΘ; ET 2 = (EΘ2 )(Eξ 2 ) = 2EΘ2
and so
ET 2 EΘ2
= ≥1
2(ET )2 (EΘ)2
with equality iff Θ is constant, i.e., iff T has exponential distribution. This
ET 2
suggests that the difference 2(ET )2
−1 can be used as a measure of “deviation
from exponentiality”. Let us quote a result of Mark Brown ([72] Theorem
4.1(iii)) which quantifies this idea in a very simple way.
Proposition 3.22 Let T have CM distribution. Then

ET 2
sup |P (T > t) − e−t/ET | ≤ − 1.
t 2(ET )2
So we can use this bound for hitting times TA in a stationary reversible
chain. At first sight the bound seems useful only if we can estimate Eπ TA2
and Eπ TA accurately. But the following remarkable variation shows that for
the hitting time distribution to be approximately exponential it is sufficient
that the mean hitting time be large compared to the relaxation time τ2 .
Proposition 3.23 For a subset A of a continuous-time chain,
sup |Pπ (TA > t) − exp(−t/Eπ TA )| ≤ τ2 /Eπ TA .
t
9/10/99 version Proof. By the collapsing principle (Chapter 2 Section 7.3) we may suppose A
is a singleton {j}, because (Corollary 3.27 below) collapsing cannot increase
the relaxation time. Combining the mean hitting time formula with the
expression (3.41) for Zjj in terms of the spectral representation (3.33),
Eπ Tj = πj−1 Zjj = πj−1 u2jm λ−1
X
m . (3.72)
m≥2
A similar calculation, exhibited below, shows

Eπ Tj2 − 2(Eπ Tj )2
= πj−1 u2jm λ−2
X
m . (3.73)
2 m≥2
But λ−2
m ≤ λ−1 −1 −1
2 λm = τ2 λm for m ≥ 2, so the right side of (3.73) is bounded
−1 P 2 −1
by πj τ2 m≥2 ujm λm , which by (3.72) equals τ2 Eπ Tj . Applying Proposi-
tion 3.22 gives Proposition 3.23.
We give a straightforward but tedious verification of (3.73) (see also
Notes). The identity x2 /2 = 0∞ (x − t)+ dt, x ≥ 0 starts the calculation
R
Chapter 2 reference in
following display is to 9/10/99 Z ∞
version. 1
Eπ Tj2 = Eπ (Tj − t)+ dt
2 0
Z ∞X
= Pπ (Xt = i, Tj > t)Ei Tj dt
0 i
X
= Ei Tj Eπ (time spent at i before Tj )
i
X Zjj − Zij Zjj πi − Zji πj
=
i
πj πj
by Chapter 2 Lemmas 12 and 15 (continuous-time version)
πj−2 πi (Zjj − Zij )2 .
X
=
i
3.6. EXTREMAL CHARACTERIZATIONS OF EIGENVALUES 87
Expanding the square, the cross-term vanishes and the first term becomes
(Zjj /πj )2 = (Eπ Tj )2 , so
− (Eπ Tj )2 = πj−2
X
1 2 2
2 Eπ Tj πi Zij .
i
To finish the calculation,
πj−1
X
2
πi Zij
i
Z Z
= πj−1
X
πi (pij (s) − πj ) ds (pij (t) − πj ) dt
i
X Z Z
= (pji (s) − πi ) ds (pij (t) − πj ) dt
i
Z Z
= (pjj (s + t) − πj ) ds dt
Z
= t(pjj (t) − πj ) dt
Z X
= u2jm te−λm t dt
m≥2
u2jm λ−2
X
= m .
m≥2
See the Notes for a related result, Theorem 3.43.
3.6 Extremal characterizations of eigenvalues

3.6.1 The Dirichlet formalism
A reversible chain has an associated Dirichlet form E, defined as follows.
For functions g : I → R write
XX
E(g, g) := 1
2 πi pij (g(j) − g(i))2 (3.74)
i j6=i
in discrete time, and substitute qij for pij in continuous time. One can
immediately check the following equivalent definitions. In discrete time
E(g, g) = 21 Eπ (g(X1 ) − g(X0 ))2 = Eπ [g(X0 )(g(X0 ) − g(X1 ))]. (3.75)

In continuous time
E(g, g) = 1
lim t−1 Eπ (g(Xt ) − g(X0 ))2
2 t→0
= lim t−1 Eπ [g(X0 )(g(X0 ) − g(Xt ))]

t→0
XX
= − πi g(i)qij g(j) (3.76)
i j
where the sum includes j = i. Note also that for random walk on a weighted
graph, (3.74) becomes
1 X X wij
E(g, g) := (g(j) − g(i))2 . (3.77)
2 i j6=i w
9/10/99 version Recall from Chapter 2 Section 6.2 the discussion of L2 norms for functions
and measures. In particular
X
||g||22 = πi g 2 (i) = Eπ g 2 (X0 )
i
X µ2
kµ − πk22 = i
− 1 for a probability distribution µ.
i
πi
The relevance of E can be seen in the following lemma.
Lemma 3.24 Write ρ(t) = (ρj (t)) for the distribution at time t of a continuous-
time chain, with arbitrary initial distribution. Write fj (t) = ρj (t)/πj . Then
d
kρ(t) − πk22 = −2E(f (t), f (t)).
dt
Proof. kρ(t) − πk22 = πj−1 ρ2j (t) − 1, so using the forward equations
P
j
d X
ρj (t) = ρi (t)qij
dt i
we get
d
2πj−1 ρj (t)ρi (t)qij
XX
kρ(t) − πk22 =
dt j i
XX
= 2 fj (t)fi (t)πi qij
j i
and the result follows from (3.76).

3.6.2 Summary of extremal characterizations

For ease of comparison we state below three results which will be proved
in subsequent sections. These results are commonly presented “the other
way up” using infs rather than sups, but our presentation is forced by our
convention of consistently defining parameters to have dimensions of “time”
rather than “1/time”. The sups are over functions g : I → R satisfying
specified constraints, and excluding g ≡ 0. The results below are the same
in continuous and discrete time—that is, continuization doesn’t change the
numerical values of the quantities we consider. We shall give the proofs in
discrete time.
Extremal characterization of relaxation time. The relaxation time τ2
satisfies
X
τ2 = sup{||g||22 /E(g, g) : πi g(i) = 0}.
i
Extremal characterization of quasistationary mean hitting time.

Given a subset A, let αA be the quasistationary distribution on Ac defined
at (3.82). Then the quasistationary mean exit time is
EαA TA = sup{||g||22 /E(g, g) : g ≥ 0, g = 0 on A}.
Extremal characterization of mean commute times. For distinct

states i, j the mean commute time satisfies
Ei Tj + Ej Ti = sup{1/E(g, g) : 0 ≤ g ≤ 1, g(i) = 1, g(j) = 0}.
Because the state space is finite, the sups are attained, and there are the-
oretical descriptions of the g attaining the extrema in all three cases. An
immediate practical use of these characterizations in concrete examples is
to obtain lower bounds on the parameters by inspired guesswork, that is by
choosing some simple explicit “test function” g which seems qualitatively
right and computing the right-hand quantity. See Chapter 14 Example 32 3/10/94 version
for a typical example. Of course we cannot obtain upper bounds this way,
but extremal characterizations can be used as a starting point for further
theoretical work (see in particular the bounds on τ2 in Chapter 4 Section 4). 10/11/94 version
3.6.3 The extremal characterization of relaxation time

The first two extremal characterizations are in fact just reformulations of
the classical Rayleigh–Ritz extremal characterization of eigenvalues, which
goes as follows ([183] Theorem 4.2.2 and eq. 4.2.7). Let S be a symmetric
matrix with eigenvalues µ1 ≥ µ2 ≥ · · ·. Then
P P
i j xi sij xj
µ1 = sup P 2 (3.78)
x i xi
and an x attaining the sup is an eigenvalue corresponding to µ1 (of course

sups are over x 6≡ 0). And
P P
i jyi sij yj
µ2 = Psup P 2 (3.79)
y: i
yi xi =0 i yi
and a y attaining the sup is an eigenvalue corresponding to µ2 .

As observed in Section 3.4, given a discrete-time chain with transition
1/2 −1/2
matrix P , the symmetric matrix (sij = πi pij πj ) has maximal eigen-
1/2
value 1 with corresponding eigenvector (πi ). So applying (3.79) and writ-
1/2
ing yi = πi g(i), the second-largest eigenvalue (of S and hence of P ) is
given by P P
i j πi g(i)pij g(j)
λ2 = P sup P 2
.
g: πi g(i)=0 i πi g (i)
i
In probabilistic notation the fraction is

Eπ [g(X0 )g(X1 )] Eπ [g(X0 )(g(X1 ) − g(X0 ))] E(g, g)
=1− =1− .
2
Eπ g (X0 ) 2
Eπ g (X0 ) kgk22
Since τ2 = 1/(1−λ2 ) in discrete time we have proved the first of our extremal
characterizations.
Theorem 3.25 (Extremal characterization of relaxation time) The re-
laxation time τ2 satisfies
X
τ2 = sup{||g||22 /E(g, g) : πi g(i) = 0}.
i
A function g0 , say, attaining the sup in the extremal characterization is,

by examining the argument above, a right eigenvector associated with λ2 :
X
pij g0 (j) = λ2 g0 (i).
j
(From this point on in the discussion, we assume g0 is normalized so that

kg0 k2 = 1.) The corresponding left eigenvector θ :
X
θi pij = λ2 θj for all j
i
is the signed measure θ such that θi = πi g0 (i). To continue a somewhat

informal discussion of the interpretation of g0 , it is convenient to switch to
continuous time (to avoid issues of negative eigenvalues) and to assume λ2
has multiplicity 1. The equation which relates distribution at time t to
initial distribution, X
ρj (t) = ρi (0)pij (t),
i
can also be used to define signed measures evolving from an initial signed
measure. For the initial measure θ we have
θ(t) = e−t/τ2 θ.
P
For any signed measure ν = ν(0) with i νi (0) = 0 we have
ν(t) ∼ ce−t/τ2 θ; c =
X X
νi (0)θi /πi = νi (0)g0 (i).
i i
So θ can be regarded as “the signed measure which relaxes to 0 most slowly”.

For a probability measure ρ(0), considering ρ(0) − π gives
ρ(t) − π ∼ ce−t/τ2 θ, c =
X X
(ρi (0) − πi )g0 (i) = ρi (0)g0 (i). (3.80)
i i
So θ has the interpretation of “the asymptotic normalized difference between

the true distribution at time t and the stationary distribution”. Finally,
from (3.80) with ρ(0) concentrated at i (or from the spectral representation)
Pi (Xt ∈ ·) − π ∼ g0 (i)e−t/τ2 θ.
So g0 has the interpretation of “the asymptotic normalized size of deviation

from stationarity, as a function of the starting state”. When the state space
has some geometric structure – jumps go to nearby states – one expects g0
to be a “smooth” function, exemplified by the cosine function arising in the
n-cycle (Chapter 5 Example 7). 4/22/96 version
3.6.4 Simple applications

Here is a fundamental “finite-time” result. Good name for Lemma 3.26 –
looks good to DA !
Lemma 3.26 (L2 contraction lemma) Write ρ(t) = (ρj (t)) for the dis-
tribution at time t of a continuous-time chain, with arbitrary initial distri-
bution. Then
kρ(t) − πk2 ≤ e−t/τ2 kρ(0) − πk2 .
Proof. Write fj (t) = ρj (t)/πj . Then

d
kρ(t) − πk22 = −2E(f (t), f (t)) by Lemma 3.24
dt
= −2E(f (t) − 1, f (t) − 1)
kf (t) − 1k22
≤ −2 by the extremal characterization of τ2
τ2
−2
= kρ(t) − πk22 .
τ2
Integrating, kρ(t) − πk22 ≤ e−2t/τ2 kρ(0) − πk22 , and the result follows.
Alternatively, Lemma 3.26 follows by observing that
kρ(t) − πk22 is CM with spectral gap at least 2λ2 = 2/τ2 (3.81)
and applying Lemma 3.16. The fact (3.81) can be established directly from
the spectral representation, but we will instead apply the observation at
(3.50)–(3.51). Indeed, with g(i) :≡ ρi (0)/πi , we have
X X ρj (0)
Eπ [g(X2t )g(X0 )] = ρi (0) pij (2t)
i j
πj
X XX ρj (0)
= ρi (0) pik (t)pkj (t)
i j k
πj
" # 
X 1 X X
= ρi (0)pik (t)  ρj (0)pjk (t)
k
πk i j
X 1
= ρ2k (t) = kρ(t) − πk22 + 1.
k
πk
Thus by (3.51)
n
!2
X X 1/2
kρ(t) − πk22 = πi g(i)uim exp(−λm t).
m=2 i
Our main use of the extremal characterization is to compare relaxation

times of different chains on the same (or essentially the same) state space.
Here are three instances. The first is a result we have already exploited in
Section 3.5.
Corollary 3.27 Given a chain with relaxation time τ2 , let τ2A be the relax-
ation time of the chain with subset A collapsed to a singleton {a} (Chapter 2
9/10/99 version Section 7.3). Then τ2A ≤ τ2 .
Proof. Any function g on the states of the collapsed chain can be extended to
the original state space by setting g = g(a) on A, and E(g, g) and i πi g(i)
P
and ||g||22 are unchanged. So consider a g attaining the sup in the extremal
characterization of τ2A and use this as a test function in the extremal char-
acterization of τ2 .
Remark. An extension of Corollary 3.27 will be provided by the contrac-
tion principle (Chapter 4 Proposition 44). 10/11/94 version
Corollary 3.28 Let τ2 be the relaxation time for a “fluid model” continuous-
time chain associated with a graph with weights (we ) [recall (3.16)] and let τ2∗
be the relaxation time when the weights are (we∗ ). If we∗ ≥ we for all edges e
then τ2∗ ≤ τ2 .
Proof. Each stationary distribution is uniform, so kgk22 = kgk∗2 2 while

E ∗ (g, g) ≥ E(g, g). So the result is immediate from the extremal charac-
terization.
The next result is a prototype for more complicated “indirect compari-
son” arguments later. It is convenient to state it in terms of random walk refer to not-net-written
Poincare chapter
on a weighted graph. Recall (Section 3.2) that a reversible chain specifies a
weighted graph with edge-weights wij = πi pij , vertex-weights wi = πi , and
total weight w = 1.
Lemma 3.29 (the direct comparison lemma) Let (we ) and (we∗ ) be edge-
weights on a graph, let (wi ) and (wi∗ ) be the vertex-weights, and let τ2 and τ2∗
be the relaxation times for the associated random walks. Then
mine (we /we∗ ) τ2 maxi (wi /wi∗ )
≤ ≤
maxi (wi /wi∗ ) τ2∗ mine (we /we∗ )
where in mine we don’t count loops e = (v, v).

Proof. For any g, by (3.77)
w∗ E ∗ (g, g) ≥ wE(g, g) min(we∗ /we ).

e
And since wkgk22 = 2 (i),

P
i wi g
w∗ kgk∗2 2 ∗
2 ≤ wkgk2 max(wi /wi ).
i
So if g has π ∗ -mean 0 and π-mean b then
kgk∗22 kg − bk∗2
2 kg − bk22 maxi (wi∗ /wi )
≤ ≤ .
E ∗ (g, g) E ∗ (g − b, g − b) E(g − b, g − b) mine (we∗ /we )
By considering the g attaining the extremal characterization of τ2∗ ,
maxi (wi∗ /wi )

τ2∗ ≤ τ2 .
mine (we∗ /we )
This is the lower bound in the lemma, and the upper bound follows by
reversing the roles of we and we∗ .
Remarks. Sometimes τ2 is very sensitive to apparently-small changes in
the chain. Consider random walk on an unweighted graph. If we add extra
edges, but keeping the total number of added edges small relative to the
number of original edges, then we might guess that τ2 could not increase or
decrease much. But the examples outlined below show that τ2 may in fact
change substantially in either direction.
Example 3.30 Take two complete graphs on n vertices and join with a
single edge. Then w = 2n(n − 1) + 2 and τ2 ∼ n2 /2. But if we extend the
single join-edge to an n-edge matching of the vertices in the original two
complete graphs, then w∗ = 2n(n − 1) + 2n ∼ w but τ2∗ ∼ n/2.
Example 3.31 Take a complete graph on n vertices. Take k = o(n1/2 ) new

vertices and attach each to distinct vertices of the original complete graph.
Then w = n(n − 1) + 2k and τ2 is bounded. But if we now add all edges
within the new k vertices, w∗ = n(n − 1) + 2k + k(k − 1) ∼ w but τ2∗ ∼ k
provided k → ∞.
As these examples suggest, comparison arguments are most effective

when the stationary distributions coincide. Specializing Lemma 3.29 to this
case, and rephrasing in terms of (reversible) chains, gives
Lemma 3.32 (the direct comparison lemma) For transition matrices P

and P∗ with the same stationary distribution π, if
pij ≥ δp∗ij for all j 6= i
then τ2 ≤ δ −1 τ2∗ .
Remarks. The hypothesis can be rephrased as P = δP∗ + (1 − δ)Q, where Q

is a (maybe not irreducible) reversible transition matrix with stationary
distribution π. When Q = I we have τ2 = δ −1 τ2∗ , so an interpretation of the
lemma is that “combining transitions of P∗ with noise can’t increase mixing
time any more than combining transitions with holds”.
3.6.5 Quasistationarity
Given a subset A of states in a discrete-time chain, let PA be P restricted
to Ac . Then PA will be a substochastic matrix, i.e., the row-sums are
at most 1, and some row-sum is strictly less than 1. Suppose PA is ir-
reducible. As a consequence of the Perron–Frobenius theorem (e.g., [183]
Theorem 8.4.4) for the nonnegative matrix PA , there is a unique 0 < λ < 1
(specifically, the largest eigenvalue of PA ) such that there is a probability
distribution α satisfying
X
α = 0 on A, αi pij = λαj , j ∈ Ac . (3.82)
i
Writing αA and λA to emphasize dependence on A, (3.82) implies that un-

der PαA the hitting time TA has geometric distribution
PαA (TA ≥ m) = λm
A , m ≥ 0,
whence
1
EαA TA = .
1 − λA
Call αA the quasistationary distribution and EαA TA the quasistationary
mean exit time.
Similarly, for a continuous-time chain let QA be Q restricted to Ac .
Assuming irreducibility of the substochastic chain with generator QA , there
is a unique λ ≡ λA > 0 such that there is a probability distribution α ≡ αA
(called the quasistationary distribution) satisfying
X
α = 0 on A, αi qij = −λαj , j ∈ Ac .
i
This implies that under PαA the hitting time TA has exponential distribution
PαA (TA > t) = exp(−λA t), t > 0,
whence the quasistationary mean exit time is
EαA TA = 1/λA . (3.83)
Note that both αA and EαA TA are unaffected by continuization of a discrete-
time chain.
The facts above do not depend on reversibility, but invoking now our
standing assumption that chains are reversible we will show in remark (c)
following Theorem 3.33 that, for continuous-time chains, λA here agrees
with the spectral gap λA discussed in Proposition 3.21, and we can also now
prove our second extremal characterization.
Theorem 3.33 (Extremal characterization of quasistationary mean

hitting time) The quasistationary mean exit time satisfies
EαA TA = sup{||g||22 /E(g, g) : g ≥ 0, g = 0 on A}. (3.84)
Proof. As usual, we give the proof in discrete time. The matrix (sA
ij =
1/2 −1/2 1/2
πi pAij πj ) is symmetric with largest eigenvalue λA . Putting xi = πi g(i)
in the characterization (3.78) gives
πi g(i)pA
P P
i j ij g(j)
λA = sup P 2
.
g i πi g (i)
Clearly the sup is attained by nonnegative g, and though the sums above
are technically over Ac we can sum over all I by setting g = 0 on A. So
πi g(i)pA
(P P )
i j ij g(j)
λA = sup P 2
: g ≥ 0, g = 0 on A .
i πi g (i)
As in the proof of Theorem 3.25 this rearranges to

1
= sup{||g||22 /E(g, g) : g ≥ 0, g = 0 on A},
1 − λA
establishing Theorem 3.33.

Remarks. (a) These remarks closely parallel the remarks at the end of
Section 3.6.3. The sup in Theorem 3.33 is attained by the function g0 which
is the right eigenvector associated with λA , and by reversibility this is
g0 (i) = αA (i)/πi . (3.85)
It easily follows from (3.82) that
PαA (Xt = j | TA > t) = αA (j) for all j and t,
which explains the name quasistationary distribution for αA . A related

interpretation of αA is as the distribution of the Markov chain conditioned
on having been in Ac for the infinite past. More precisely, one can use
Perron–Frobenius theory to prove that
P (Xt = j | TA > t) → αA (j) as t → ∞ (3.86)
provided P A is aperiodic as well as irreducible.

(b) Relation (3.86) holds in continuous time as well (assuming irreducibil-

ity of the chain restricted to Ac ), yielding
exp(−λA t) = PαA (T > t)
Pπ (TA > t + s)
= lim Pπ (TA > t + s | TA > s) = lim .
s→∞ s→∞ Pπ (TA > s)
Since by Proposition 3.21 the distribution of TA for the stationary chain is
CM with spectral gap (say) σA , the limit here is exp(−σA t). Thus λA = σA ,
that is, our two uses of λA refer to the same quantity.
(c) We conclude from remark (b), (3.83), and the final inequality in (3.69)
that, in either continuous or discrete time, This is needed at the bottom
of page 27 (9/22/96 version) in
Eπ TA Chapter 5.
EαA TA ≥ ≥ Eπ TA . (3.87)
π(Ac )
Our fundamental use of quasistationarity is the following.

Corollary 3.34 For any subset A, the quasistationary mean hitting time
satisfies
EαA TA ≤ τ2 /π(A).
Proof. As at (3.85) set g(i) = αA (i)/πi , so
EαA TA = ||g||22 /E(g, g). (3.88)
Now Eπ g(X0 ) = 1, so applying the extremal characterization of relaxation
time to g − 1,
kg − 1k22 kgk22 − 1 1

τ2 ≥ = = (EαA TA ) 1 − , (3.89)
E(g − 1, g − 1) E(g, g) kgk22
the last equality using (3.88). Since αA is a probability distribution on Ac
we have
1 = Eπ [1Ac (X0 )g(X0 )]
and so by Cauchy–Schwarz
12 ≤ (Eπ 1Ac (X0 )) × ||g||22 = (1 − π(A))||g||22 .
Rearranging,
1
1− ≥ π(A)
||g||22
and substituting into (3.89) gives the desired bound.
Combining Corollary 3.34 with (3.68) and (3.83) gives the result below.
DA has deleted discrete-time
claim in previous version. It
seems true but not worth
sweating over.
Lemma 3.35
(continuous time) Pπ (TA > t) ≤ exp(−tπ(A)/τ2 ), t≥0
3.7 Extremal characterizations and mean hitting

times
Theorem 3.36 (Extremal characterization of mean commute times)
For distinct states i and a, the mean commute time satisfies
Ei Ta + Ea Ti = sup{1/E(g, g) : 0 ≤ g ≤ 1, g(i) = 1, g(a) = 0} (3.90)
and the sup is attained by g(j) = Pj (Ti < Ta ). In discrete time, for a
subset A and a state i 6∈ A,
πi Pi (TA < Ti+ ) = inf{E(g, g) : 0 ≤ g ≤ 1, g(i) = 1, g(·) = 0 on A} (3.91)
and the inf is attained by g(j) = Pj (Ti < TA ). Equation (3.91) remains true
in continuous time, with πi replaced by qi πi on the left.
Proof. As noted at (3.28), form (3.90) follows (in either discrete or con-
tinuous time) from form (3.91) with A = {a}. To prove (3.91), consider g
satisfying the specified boundary conditions. Inspecting (3.74), the contri-
bution to E(g, g) involving a fixed state j is
X
πj pjk (g(k) − g(j))2 . (3.92)
k6=j
As a function of g(j) this is minimized by

X
g(j) = pjk g(k). (3.93)
k
Thus the g which minimizes E subject to the prescribed boundary conditions

9/10/99 version on A∪{i} must satisfy (3.93) for all j 6∈ A∪{i}, and by Chapter 2 Lemma 27
the unique solution of these equations is g(j) = Pj (Ti < TA ). Now apply to
this g the general expression (3.75):
!
X X
E(g, g) = πj g(j) g(j) − pjk g(k) .
j k
3.7. EXTREMAL CHARACTERIZATIONS AND MEAN HITTING TIMES99
For j 6∈ A ∪ {i} the factor (g(j) − k pjk g(k)) equals zero, and for j ∈ A we
P
have g(j) = 0, so only the j = i term contributes. Thus

!
X
E(g, g) = πi 1 − pik g(k)
k
= πi (1 − Pi (Ti+ < TA ))
= πi Pi (TA < Ti+ ), (3.94)
giving (3.91).
The analogous result for two disjoint subsets A and B is a little compli-
cated to state. The argument above shows that
inf{E(g, g) : g(·) = 0 on A, g(·) = 1 on B}
is attained by g0 (j) = Pj (TB < TA ) and that this g0 satisfies
πi Pi (TA < TB+ ).

X
E(g0 , g0 ) = (3.95)
i∈B
We want to interpret the reciprocal of this quantity as a mean time for the
chain to commute from A to B and back. Consider the stationary chain
(Xt ; −∞ < t < ∞). We can define what is technically called a “marked
point process” which records the times at which A is first visited after a
visit to B and vice versa. Precisely, define Zt taking values in {α, β, δ} by

 β
 if ∃s < t such that Xs ∈ A, Xt ∈ B, Xu 6∈ A ∪ B ∀s < u < t
Zt := α if ∃s < t such that Xs ∈ B, Xt ∈ A, Xu ∈6 B ∪ A ∀s < u < t

 δ otherwise.
So the times t when Zt = β are the times of first return to B after visiting A,
and the times t when Zt = α are the times of first return to A after visiting B.
Now (Zt ) is a stationary process. By considering the time-reversal of X, we
see that for i ∈ B
P (X0 = i, Z0 = β) = P (X0 = i, TA < TB+ ) = πi Pi (TA < TB+ ).
So (3.95) shows P (Z0 = β) = E(g0 , g0 ). If we define TBAB , “the typical time

to go from B to A and back to B”, to have the conditional distribution of
min{t ≥ 1 : Zt = β} given Z0 = β, then Kac’s formula for the (non-Markov) (discrete time chain)
stationary process Z (see e.g. [133] Theorem 6.3.3) says that ETBAB =
1/P (Z0 = β). So we have proved
Corollary 3.37
ETBAB = sup{1/E(g, g) : 0 ≤ g ≤ 1, g(·) = 0 on A, g(·) = 1 on B}
and the sup is attained by g(i) = Pi (TB < TA ).
As another interpretation of this quantity, define
ρB (·) = P (X0 ∈ · | Z0 = β), ρA (·) = P (X0 ∈ · | Z0 = α).
Interpret ρB and ρA as the distribution of hitting places on B and on A in

the commute process. It is intuitively clear, and not hard to verify, that
PρA (X(TB ) ∈ ·) = ρB (·), PρB (X(TA ) ∈ ·) = ρA (·)
ETBAB = EρB TA + EρA TB .
In particular
min Ei TA + min Ei TB ≤ ETBAB ≤ max Ei TA + max Ei TB .

i∈B i∈A i∈B i∈A
3.7.1 Thompson’s principle and leveling networks

Theorem 3.36 was stated in terms of (reversible) Markov chains. Rephrasing
in terms of discrete-time random walk on a weighted graph gives the usual
“electrical network” formulation of the Dirichlet principle stated below, us-
ing (3.77),(3.91) and (3.94). Recall from Proposition 3.10 that the effective
resistance r between v0 and A is, in terms of the random walk,
1
r= . (3.96)
wv0 Pv0 (TA < Tv+0 )
Proposition 3.38 (The Dirichlet principle) Take a weighted graph and

fix a vertex v0 and a subset A of vertices not containing v0 . Then the quantity
1P P 2
2 i j wij (g(j) − g(i)) is minimized, over all functions g : I → [0, 1] with
g(v0 ) = 1 and g(·) = 0 on A, by the function g(i) :≡ Pi (Tv0 < TA ) (where
probabilities refer to random walk on the weighted graph), and the minimum
value equals 1/r, where r is the effective resistance (3.96).
There is a dual form of the Dirichlet principle, which following Doyle and
Snell [131] we call
Proposition 3.39 (Thompson’s principle) Take a weighted graph and

fix a vertex v0 and a subset A of vertices not containing v0 . Let f = fij de-
note a unit flow from v0 to A. Then 21 i j (fij2 /wij ) is minimized, over all
P P
such flows, by the flow f v0 →A [defined at (3.18)] associated with the random
walk from v0 to A, and the minimum value equals the effective resistance r
appearing in (3.96).
Recall that a flow is required to have fij = 0 whenever wij = 0, and interpret
P P
sums i j as sums over ordered pairs (i, j) with wij > 0.
Proof. Write ψ(f ) := 12 i j (fij2 /wij ). By formula (3.25) relating the
P P
random walk notions of “flow” and “potential”, the fact that ψ(f v0 →A ) = r
is immediate from the corresponding equality in the Dirichlet principle. So
the issue is to prove that for a unit flow f ∗ , say, attaining the minimum
of ψ(f ), we have ψ(f ∗ ) = ψ(f v0 →A ). To prove this, consider two arbitrary
paths (yi ) and (zj ) from v0 to A, and let f ε denote the flow f ∗ modified by
adding flow rates +ε along the edges (yi , yi+1 ) and by adding flow rates −ε
along the edges (zi , zi+1 ). Then f ε is still a unit flow from v0 to A. So the
function ε → ψ(f ε ) must have derivative zero at ε = 0, and this becomes
the condition that
(fy∗i ,yi+1 /wyi ,yi+1 ) = (fz∗i ,zi+1 /wzi ,zi+1 ).

X X
i i
So the sum is the same for all paths from v0 to A. Fixing x, the sum must
be the same for all paths from x to A, because two paths from x to A could
be extended to paths from v0 to A by appending a common path from v0
to x. It follows that we can define g ∗ (x) as the sum i (fx∗i ,xi+1 /wxi ,xi+1 )
P
over some path (xi ) from x to A, and the sum does not depend on the path
chosen. So (This is essentially the same
argument used in
∗
fxz Section 3.3.2.)
g ∗ (x) − g ∗ (z) = for each edge (x, z) not contained within A. (3.97)
wxz
The fact that f ∗ is a flow means that, for x 6∈ A ∪ {v0 },
∗
wxz (g ∗ (x) − g ∗ (z)).
X X
0= fxz =
z:wxz >0 z
So g ∗ is a harmonic function outside A ∪ {v0 }, and g ∗ = 0 on A. So by

the uniqueness result (Chapter 2 Lemma 27) we have that g ∗ must be pro- 9/10/99 version
portional to g, the minimizing function in Proposition 3.38. So f ∗ is pro-
portional to f v0 →A , because the relationship (3.97) holds for both, and then
f ∗ = f v0 →A because both are unit flows.
A remarkable statistical interpretation was discussed in a monograph of

Borre and Meissl [57]. Imagine a finite set of locations such as hilltops. For
each pair of locations (i, j) with a clear line-of-sight, measure the elevation
difference Dij = (height of j minus height of i). Consider the associated
graph [whose edges are such pairs (i, j)], and suppose it is connected. Take
one location v0 as a benchmark “height 0”. If our measurements were exact
we could determine the height of location x by adding the D’s along a path
from v0 to x, and the sum would not depend on the path chosen. But
suppose our measurements contain random errors. Precisely, suppose Di,j
equals the true height difference h(j) − h(i) plus an error Yij which has
mean 0 and variance 1/wij and is independent for different measurements.
Then it seems natural to estimate the height of x by taking some average
ĥ(x) over paths from v0 to x, and it turns out that the “best” way to average
is to use the random walk from v0 to x and average (over realizations of the
walk) the net height climbed by the walk.
In mathematical terms, the problem is to choose weights fij , not de-
pending on the function h, such that
1 XX
ĥ(x) := fij Dij
2 i j
has E ĥ(x) = h(x) and minimal variance. It is not hard to see that the
former “unbiased” property holds iff f is a unit flow from v0 to x. Then
1 XX 2 1 X X fij2
var ĥ(x) = fij var(Dij ) =
4 i j 4 i j wij
and Proposition 3.39 says this is minimized when we use the flow from v0
to x obtained from the random walk on the weighted graph. But then
Tx
X
ĥ(x) = Ev0 DXt−1 Xt ,
t=1
the expectation referring to the random walk.
3.7.2 Hitting times and Thompson’s principle

Using the commute interpretation of resistance (Corollary 3.11) to translate
Thompson’s principle into an assertion about mean commute times gives
the following.
Corollary 3.40 For random walk on a weighted graph and distinct ver-
tices v and a,
 
1 X X 
Ev Ta + Ea Tv = w inf (fij2 /wij ) : f is a unit flow from a to v
2 
i j
and the min is attained by the flow f a→v associated with the random walk.
Comparing with Theorem 3.36 we have two different extremal characteriza-
tions of mean commute times, as a sup over potential functions and as an
inf over flows. In practice this “flow” form is less easy to use than the “po-
tential” form, because writing down a flow f is harder than writing down a
function g. But, when we can write down and calculate with some plausible
flow, it gives upper bounds on mean commute times.
One-sided mean hitting times Ei Tj don’t have simple extremal char-
acterizations of the same kind, with the exception of hitting times from
stationarity. To state the result, we need two definitions. First, given a
probability distribution ρ on vertices, a unit flow from a to ρ is a flow f
satisfying
f(i) = 1(i=a) − ρi for all i; (3.98)
more generally, a unit flow from a set A to ρ is defined to satisfy
X
f(i) = 1 − ρ(A) and f(i) = −ρi for all i ∈ Ac .
i∈A
Now fix a state a and define the special flow f a→π by

t0
X
fij := lim Ea 1(Xt−1 =i, Xt =j) − 1(Xt−1 =j, Xt =i) (3.99)
t0 →∞
t=1
with the usual convention in the periodic case. So fij is the mean excess
of transitions i → j compared to transitions j → i, for the chain started
at a and run forever. This is a unit flow from a to π, in the above sense.
Equation (6) (the definition of Z) in Chapter 2 and reversibility give the 9/10/99 version
first equality, and Chapter 2 Lemma 12 gives the last equality, in 9/10/99 version
fij = Zai pij − Zaj pji

Zia πi pij Zja πj pji
= −
πa πa
(Zia − Zja )πi pij
= (3.100)
πa
(Ej Ta − Ei Ta )wij
= , (3.101)
w
switching to “weighted graphs” notation. Note also that the first-step re-
currence for the function i 7→ Zia is
X
Zia = 1(i=a) − πa + pij Zja . (3.102)
j
Proposition 3.41 For random walk on a weighted graph and a subset A of

vertices,
 
1 X X 
Eπ TA = w inf (fij2 /wij ) : f is a unit flow from A to π
2 
i j
( )
X
= sup 1/E(g, g) : −∞ < g < ∞, g(·) = 1 on A, πi g(i) = 0 .
i
When A is a singleton {a}, the minimizing flow is the flow f a→π defined
above, and the maximizing function g is g(i) = Zia /Zaa . For general A the
maximizing function g is g(i) = 1 − EEπi TTAA .
Added the final observation
about general A Proof. Suppose first that A = {a}. We start by showing that the extremizing
flow, f ∗ say, is the asserted f a→π . By considering adding to f ∗ a flow of size ε
along a directed cycle, and copying the argument for (3.97) in the proof of
Proposition 3.39, there must exist a function g ∗ such that
∗
fxz
g ∗ (x) − g ∗ (z) = for each edge (x, z). (3.103)
wxz
The fact that f ∗ is a unit flow from a to π says that
∗
wxz (g ∗ (x) − g ∗ (z))
X X
1(x=a) − πx = fxz =
z z
which implies
1(x=a) − πx X
pxz (g ∗ (x) − g ∗ (z)) = g ∗ (x) − pxz g ∗ (z).
X
=
wx z z
Since wx = wπx and 1/πa = Ea Ta+ , this becomes

g ∗ (x) = pxz g ∗ (z) − w−1 1 − (Ea Ta+ )1(x=a) .
X
Now these equations have a unique solution g ∗ , up to an additive constant,

because the difference between two solutions is a harmonic function. (Recall
99 version Chapter 2 Corollary 28.) On the other hand, a solution is g ∗ (x) = − ExwTa ,
∗ =
by considering the first-step recurrence for Ex Ta+ . So by (3.103) fxz
∗
(Ez Ta − Ex Ta )wxz /w, and so f = f a→π by (3.101).
Now consider the function g which minimizes E(g, g) under the con-
P
straints i πi g(i) = 0 and g(a) = 1. By introducing a Lagrange multiplier γ
we may consider g as minimizing E(g, g) + γ i πi g(i) subject to g(a) = 1.
P
Repeating the argument at (3.92), the minimizing g satisfies

X
−2 πj pjk (g(k) − g(j)) + γπj = 0, j 6= a.
k6=j
Rearranging, and introducing a term β1(j=a) to cover the case j = a, we

have X
g(j) = pjk g(k) − (γ/2) + β1(j=a) for all j,
k
P
for some γ, β. Because j πj g(j) = 0 we have
0 = 0 − (γ/2) + βπa ,
allowing us to rewrite the equation as

X
g(j) = pjk g(k) + β(1(j=a) − πa ).
k
By the familiar “harmonic function” argument this has a unique solution,

and (3.102) shows the solution is g(j) = βZja . Then the constraint g(a) = 1
gives g(j) = Zja /Zaa .
Next consider the relationship between the flow f = f a→π and the func-
tion g(i) ≡ Zia /Zaa . We have Chapter 2 reference in
following display is to 9/10/99
fij2 wij version.
w = (Ej Ta − Ei Ta )2 by (3.101)
wij w
wij Ej Ta − Ei Ta 2

= (Eπ Ta )2
w Eπ Ta
2 wij Zia − Zja 2

= (Eπ Ta ) by Chapter 2 Lemmas 11, 12
w Zaa
wij
= (Eπ Ta )2 (g(i) − g(j))2 .
w
Thus it is enough to prove
1 X X fij2
w = Eπ Ta (3.104)
2 i j wij
and it will then follow that
1/E(g, g) = Eπ Ta .
To prove (3.104), introduce a parameter ε (which will later go to 0) and a

new vertex z and edge-weights wiz = εwi . Writing superscripts ε to refer
to this new graph and its random walk, the equation becomes wiz ε := εw
i
and Corollary 3.40 says
1 X X (fijε )2
Eaε Tz + Ezε Ta = wε ε (3.105)
2 i∈I ε j∈I ε wij
where f ε is the special unit flow from a to z associated with the new graph
(which has vertex set I ε := I ∪ {z}). We want to interpret the ingredients
to (3.105) in terms of the original graph. Clearly wε = w(1 + 2ε). The
new walk has chance ε/(1 + ε) to jump to z from each other vertex, so
Eaε Tz = (1 + ε)/ε. Starting from z, after one step the new walk has the
stationary distribution π on the original graph, and it follows easily that
Ezε Ta = 1 + Eπ Ta (1 + O(ε)). We can regard the new walk up to time Tz
as the old walk sent to z at a random time U ε with Geometric(ε/(1 + ε))
distribution, so for i 6= z and j 6= z the flow fijε is the expected net number
of transitions i → j by the old walk up to time U ε . From the spectral
representation it follows easily that fijε = fij + O(ε). Similarly, for i 6= z we
have −fziε = f ε = P (X(U ε − 1) = i) = π + O(ε); noting that P ε
iz a i i∈I fiz = 1,
the total contribution of such terms to the double sum in (3.105) is
X (πi + f ε − πi )2 ε − π )2
2 X (πi + fiz i
iz
2 =
i∈I
εwi wε i∈I πi
!
2 X (f ε − πi )2 2
iz
= 1+ = + O(ε).
wε i∈I
πi wε
So (3.105) becomes
 
1+ε 1 X X fij2 1 
+ 1 + Eπ Ta + O(ε) = w(1 + 2ε)  + + O(ε)
ε 2 i∈I j∈I wij wε
Subtracting (1 + 2ε)/ε from both sides and letting ε → 0 gives the de-
sired (3.104). This concludes the proof for the case A = {a}, once we use
the mean hitting time formula to verify
Zia Ei Ta
g(i) :≡ Zaa =1− Eπ Ta .
Finally, the extension to general A is an exercise in use of the chain,

say X ∗ , in which the set A is collapsed to a single state a. Recall Chapter 2
Section 7.3. In particular, 9/10/99 version
∗ ∗ ∗
w∗ = w.
X XX
wij = wij , i, j ∈ Ac ; wia = wik , i ∈ Ac ; waa = wkl ;
k∈A k∈A l∈A
We now sketch the extension. First, Eπ TA = Eπ∗∗ Ta . Then the natural

one-to-one correspondence between functions g on I with g(·) = 1 on A and
P ∗ ∗ ∗ P ∗ ∗
i∈I πi g(i) = 0 and functions g on I with g (a) = 0 and i∈I ∗ πi g (i) = 0
gives a trivial proof of
( )
X
Eπ TA = sup 1/E(g, g) : −∞ < g < ∞, g(·) = 1 on A, πi g(i) = 0 .
i
It remains to show that
inf {Ψ(f ) : f is a unit flow from A to π (for X)}

= inf {Ψ∗ (f ∗ ) : f ∗ is a unit flow from a to π ∗ (for X ∗ )} (3.106)
where
1 XX 2 1 X X
Ψ(f ) := (f /wij ), Ψ∗ (f ∗ ) := ((f ∗ )2 /wij
∗
).
2 i∈I j∈I ij 2 i∈I ∗ j∈I ∗ ij
Indeed, given a unit flow f from A to π, define fij∗ as fij if i, j ∈ Ac and as

c
k∈A fik if i ∈ A and j = a. One canP check that f ∗ is a unit flow from a
P
∗ P
to π (the key observation being that i∈A j∈A fij = 0) and, using the
Cauchy–Schwarz inequality, that Ψ∗ (f ∗ ) ≤ Ψ(f ). Conversely, given a unit
flow f ∗ from a to π ∗ , define fij as fij∗ if i, j ∈ Ac , as fia
∗ w /P
ij k∈A wik if
i ∈ Ac and j ∈ A, and as 0 if i, j ∈ A. One can check that f is a unit
flow from A to π and that Ψ(f ) = Ψ(f ∗ ). We have thus established (3.106),
completing the proof.
Corollary 3.42 For chains with transition matrices P, P̃ and the same sta-
tionary distribution π,
pij E˜π Ta pij

min ≤ ≤ max .
i6=j p̃ij Eπ Ta i6=j p̃ij
Proof. Plug the minimizing flow f a→π for the P-chain into Proposition 3.41
for the P̃-chain to get the second inequality. The first follows by reversing
the roles of P and P̃.
3.8 Notes on Chapter 3

Textbooks. Almost all the results have long been known to (different groups
of) experts, but it has not been easy to find accessible textbook treatments.
Of the three books on reversible chains at roughly the same level of sophisti-
cation as ours, Kelly [213] emphasizes stationary distributions of stochastic
networks; Keilson [212] emphasizes mathematical properties such as com-
plete monotonicity; and Chen [88] discusses those aspects useful in the study
of interacting particle systems.
Section 3.1. In abstract settings reversible chains are called symmetriz-
able, but that’s a much less evocative term. Elementary textbooks often
give Kolmogorov’s criterion ([213] Thm 1.7) for reversibility, but we have
never found it to be useful.
I put a pointer from the first The following figure may be helpful in seeing why πi Ei Tj 6= πj Ej Ti for
section to here - DA
a general reversible chain, even if π is uniform. Run such a chain (Xt ) for
−∞ < t < ∞ and record only the times when the chain is in state i or state j.
Then Ei Tj is the long-run empirical average of the passage times from i
to j, indicated by arrows −→; and Ej Ti is the long-run empirical average
of the passage times from j to i in the reverse time direction, indicated by
arrows ←−. One might think these two quantities were averaging the same
empirical intervals, but a glance at the figure shows they are not.

i i j i j j j i i j
- - -
- -
Section 3.1.1. Though probabilists would regard the “cyclic tour” Lemma 3.2
as obvious, László Lovász pointed out a complication, that with a careful
definition of starts and ends of tours these times are not invariant under time-
reversal. The sophisticated fix is to use doubly-infinite stationary chains and
observe that tours in reversed time just interleave tours in forward time, so
by ergodicity their asymptotic rates are equal. Tetali [326] shows that the
cyclic tour property implies reversibility. Tanushev and Arratia [321] show
that the distributions of forward and reverse tour times are equal.
Cat-and-mouse game 1 is treated more opaquely in Coppersmith et al
9/1/99 version [99], whose deeper results are discussed in Chapter 9 Section 4.4. Underlying
the use of the optional sampling theorem in game 2 is a general result about
optimal stopping, but it’s much easier to prove what we want here than to
3.8. NOTES ON CHAPTER 3 109
appeal to general theory. Several algorithmic variations on Proposition 3.3

are discussed in Coppersmith et al [101] and Tetali and Winkler [327].
Section 3.2. Many textbooks on Markov chains note the simple explicit
form of the stationary distribution for random walks on graphs. An histor-
ical note (taken from [107]) is that the first explicit treatment of random
walk on a general finite graph was apparently given in 1935 by Bottema [58],
who proved the convergence theorem. Amongst subsequent papers special-
izing Markov theory to random walks on graphs let us mention Gobel and
Jagers [168], which contains a variety of the more elementary facts given
in this book, for instance the unweighted version of Lemma 3.9. Another
observation from [168] is that for a reversible chain the quantity
βijl ≡ πj−1 Ei (number of visits to j before time Tl )
satisfies βijl = βjil . Indeed, by Chapter 2 Lemma 9 we have 9/10/99 version
βijl = (Ei Tl + El Tj + Ej Ti ) − (Ej Ti + Ei Tj )
and so the result follows from the cyclic tour property.

Just as random walks on undirected graphs are as general as reversible
Markov chains, so random walks on directed graphs are as general as general
Markov chains. In particular, one usually has no simple expression like (3.15)
for the stationary distribution. The one tractable case is a balanced directed
graph, where the in-degree dv of each vertex v equals its out-degree.
Section 3.2.1. Yet another way to associate a continuous-time reversible
√
chain with a weighted graph is to set qij = wij / wi wj . This construction
was used by Chung and Yau [95] as the simplest way to set up discrete
analogs of certain results from differential geometry.
Another interpretation of continuous-time random walk on a weighted
graph is to write wij = 1/Lij and interpret Lij as edge-length. Then run
Brownian motion on the edges of the graph. Starting from vertex i, the
P
chance that j is the first vertex other than i visited is wij / k wik , so the
embedded discrete chain is the usual discrete random walk. This construc-
tion could be used as an intermediate step in the context of approximating
Brownian motion on a manifold by random walk on a graph embedded in
the manifold.
Section 3.3. Doyle and Snell [131] gave a detailed elementary textbook
exposition of Proposition 3.10 and the whole random walk / electrical net-
work connection. Previous brief textbook accounts were given by Kemeny et
al [215] and Kelly [213]. Our development follows closely that of Chapter 2
in Lyons and Peres [250]. As mentioned in the text, the first explicit use
(known to us) of the mean commute interpretation was given by Chandra et

al [85]. One can combine the commute formula with the general identities
9/10/99 version of Chapter 2 to obtain numerous identities relating mean hitting times and
resistances, some of which are given (using bare-hands proofs instead) in
Tetali [324]. The connection between Foster’s theorem and Lemma 3.9 was
noted in [99].
Section 3.4. The spectral theory is of course classical. In devising a
1/2 −1/2
symmetric matrix one could use πi pij or pij πj−1 instead of πi pij πj —
there doesn’t seem any systematic advantage to a particular choice. We
learned the eigentime identity from Andrei Broder who used it in [66],
and Lemma 3.15 from David Zuckerman who used it in [341]. Apparently
no-one has studied whether Lemma 3.15 holds for general chains. Mark
Brown (personal communication) has noted several variations on the theme
of Lemma 3.15, for example that the unweighted average of (Ei Tj ; i, j ∈ A)
is bounded by the unweighted average of (Eπ Tj ; j ∈ A). The name eigen-
time identity is our own coinage: once we call 1/λ2 the relaxation time it is
natural to start thinking of the other 1/λm as “eigentimes”.
Section 3.5. We regard complete monotonicity as a name for “mixtures
of exponentials”, and have not used the analytic characterization via deriva-
tives of alternating signs. Of course the CM property is implicit in much
analysis of reversible Markov processes, but we find it helpful to exhibit
explicitly its use in obtaining inequalities. This idea in general, and in par-
ticular the “stochastic ordering of exit times” result (Proposition 3.21), were
first emphasized by Keilson [212] in the context of reliability and queueing
models. Brown [73] gives other interesting consequences of monotonicity.
Section 3.5.1. Parts of Proposition 3.18 have been given by several au-
thors, e.g., Broder and Karlin [66] Corollary 18 give (3.53). One can invent
many variations. Consider for instance mini maxj Ei Tj . On the complete
graph this equals n − 1, but this is not the minimum value, as observed
by Erik Ordentlich in a homework exercise. If we take the complete graph,
distinguish a vertex i0 , let the edges involving i0 have weight ε and the other
edges have weight 1, then as ε → 0 we have (for j 6= i0 )
1 n−2 1
Ei0 Tj → n−1 + n−1 (1 + (n − 2)) = n − 2 + n−1 .
By the random target lemma and (3.53), the quantity under consideration
is at least τ0 ≥ n − 2 + n1 , so the example is close to optimal.
Section 3.5.4. The simple result quoted as Proposition 3.22 is actually
weaker than the result proved in Brown [72]. The ideas in the proof of
Proposition 3.23 are in Aldous [12] and in Brown [73], the latter containing
a shorter Laplace transform argument for (3.73). Aldous and Brown [20]
give a more detailed account of the exponential approximation, including
the following result which is useful in precisely the situation where Proposi-
tion 3.23 is applicable, that is, when Eπ TA is large compared to τ2 .
Theorem 3.43 Let αA be the quasistationary distribution on Ac defined

at (3.82). Then
τ2 −t

Pπ (TA > t) ≥ 1− exp , t>0
EαA TA EαA TA
Eπ TA ≥ EαA TA − τ2 .
Using this requires only a lower bound on Eα TA , which can often be obtained
using the extremal characterization (3.84). Connections with “interleaving
of eigenvalues” results are discussed in Brown [74].
For general chains, explicit bounds on exponential approximation are
much messier: see Aldous [5] for a bound based upon total variation mixing
and Iscoe and McDonald [192] for a bound involving spectral gaps.
Section 3.6.1. Dirichlet forms were developed for use with continuous-
space continuous-time Markov processes, where existence and uniqueness
questions can be technically difficult—see, e.g., Fukushima [159]. Their
use subsequently trickled down to the discrete world, influenced, e.g., by
the paper of Diaconis and Stroock [124]. Chen [88] is the most accessible
introduction.
Section 3.6.2. Since mean commute times have two dual extremal char-
acterizations, as sups over potential functions and as infs over flows, it is
natural to ask
Open Problem 3.44 Does there exist a characterization of the relaxation

time as exactly an inf over flows?
We will see in Chapter 4 Theorem 32 an inequality giving an upper bound 10/11/94 version
on the relaxation time in terms of an inf over flows, but it would be more
elegant to derive such inequalities from some exact characterization.
Section 3.6.4. Lemma 3.32 is sometimes used to show, by comparison
with the i.i.d. chain,
if min (pij /πj ) = δ > 0 then τ2 ≤ δ −1 .

i,j:i6=j
But this is inefficient: direct use of submultiplicitivity of variation distance

gives a stronger conclusion.
Section 3.6.5. Quasistationary distributions for general chains have long

been studied in applied probability, but the topic lacks a good survey article.
Corollary 3.34 is a good example of a repeatedly-rediscovered simple-yet-
useful result which defies attempts at attribution.
Kahale [203] Corollary 6.1 gives a discrete time variant of Lemma 3.35,
that is to say an upper bound on Pπ (TA ≥ t), and Alon et al [27] Proposi-
tion 2.4 give a lower bound in terms of the smallest eigenvalue (both results
details in previous version now are phrased in the context of random walk on an undirected graph). In
deleted
studying bounds on TA such as Lemma 3.35 we usually have in mind that
π(A) is small. One is sometimes interested in exit times from a set A with
π(A) small, i.e., hitting times on Ac where π(Ac ) is near 1. In this setting
one can replace inequalities using τ2 or τc (parameters which involve the
whole chain) by inequalities involving analogous parameters for the chain
restricted to A and its boundary. See Babai [36] for uses of such bounds.
Section 3.7.1. Use of Thompson’s principle and the Dirichlet principle
to study transience / recurrence of countably infinite state space chains is
given an elementary treatment in Doyle and Snell [131] and more technical
treatments in papers of Nash-Williams [266], Griffeath and Liggett [173] and
Lyons [251]. Some reformulations of Thompson’s principle are discussed by
Berman and Konsowa [45].
We learned about the work of Borre and Meissl [57] on leveling networks
from Persi Diaconis. Here is another characterization of effective resistance
that is of a similar spirit. Given a weighted graph, assign independent
Normal(0, wij ) random variables Xij to the edges (i, j), with Xji = −Xij .
Then condition on the event that the sum around any cycle vanishes. The
conditional process (Yij ) is still Gaussian. Fix a reference vertex v∗ and
for each vertex v let Sv be the sum of the Y -values along a path from v∗
to v. (The choice of path doesn’t matter, because of the conditioning event.)
Then (obviously) Sv is mean-zero Normal but (not obviously) its variance
is the effective resistance between v∗ and v. This is discussed and proved in
Janson [194] Section 9.4.
Section 3.7.2. We have never seen in the literature an explicit statement
of the extremal characterizations for mean hitting times from a stationary
start (Proposition 3.41), but these are undoubtedly folklore, at least in the
“potential” form. Iscoe et al [193] implicitly contains the analogous charac-
Initial distribution is indeed π terization of Eπ exp(−θTA ). Steve Evans once showed us an argument for
– yes (DA)
Corollary 3.42 based on the usual Dirichlet principle, and that motivated us
to present the “natural explanation” given by Proposition 3.41.
Chapter 4
Hitting and Convergence

Time, and Flow Rate,
Parameters for Reversible
Markov Chains (October 11,
1994)
The elementary theory of general finite Markov chains (cf. Chapter 2) fo-
cuses on exact formulas and limit theorems. My view is that, to the extent
there is any intermediate-level mathematical theory of reversible chains, it is
a theory of inequalities. Some of these were already seen in Chapter 3. This
chapter is my attempt to impose some order on the subject of inequalities.
We will study the following five parameters of a chain. Recall our standing
assumption that chains are finite, irreducible and reversible, with stationary
distribution π.
(i) The maximal mean commute time
τ ∗ = max(Ei Tj + Ej Ti )
ij
(ii) The average hitting time

XX
τ0 = πj πi Ei Tj .
i j
(iii) The variation threshold time

¯ ≤ e−1 }
τ1 = inf{t > 0 : d(t)
113
114CHAPTER 4. HITTING AND CONVERGENCE TIME, AND FLOW RATE, PARAMETER
where as in Chapter 2 section yyy

¯ = max ||Pi (Xt ∈ ·) − Pj (Xt ∈ ·)||
d(t)
ij
(iv) The relaxation time τ2 , i.e. the time constant in the asymptotic rate
of convergence to the stationary distribution.
(v) A “flow” parameter
π(A)π(Ac ) π(Ac )
τc = sup P = sup c
A Pπ (X1 ∈ A |X0 ∈ A)
P
A i∈A j∈Ac πi pij
in discrete time, and
π(A)π(Ac ) π(Ac ) dt
τc = sup P = sup c
A Pπ (Xdt ∈ A |X0 ∈ A)
P
A i∈A j∈Ac πi qij
in continuous time.
The following table may be helpful. “Average-case” is intended to indi-
cate essential use of the stationary distribution.
worst-case average-case
hitting times τ∗ τ0
mixing times τ1 τ2
flow τc
The table suggests there should be a sixth parameter, but I don’t have a
candidate.
The ultimate point of this study, as will seen in following chapters, is
• For many questions about reversible Markov chains, the way in which
the answer depends on the chain is related to one of these parameters
• so it is useful to have methods for estimating these parameters for

particular chains.
This Chapter deals with relationships between these parameters, simple il-
lustrations of properties of chains which are closely connected to the pa-
rameters, and methods of bounding the parameters. To give a preview,
it turns out that these parameters are essentially decreasing in the order
(τ ∗ , τ0 , τ1 , τ2 , τc ): precisely,
1 ∗
τ ≥ τ0 ≥ τ2 ≥ τc
2
66τ0 ≥ τ1 ≥ τ2
4.1. THE MAXIMAL MEAN COMMUTE TIME τ ∗ 115
and perhaps the constant 66 can be reduced to 1. There are no general

reverse inequalities, but reverse bounds involving extra quantities provide a
rich and sometimes challenging source of problems.
The reader may find it helpful to read this chapter in parallel with the list
of examples of random walks on unweighted graphs in Chapter 5. As another
preview, we point out that on regular n-vertex graphs each parameter may
be as large as Θ(n2 ) but no larger; and τ ∗ , τ0 may be as small as Θ(n) and
the other parameters as small as Θ(1), but no smaller. The property (for a
sequence of chains) “τ0 = O(n)” is an analog of the property “transience”
for a single infinite-state chain, and the property “τ2 = O(poly(log n))” is
an analog of the “non-trivial boundary” property for a single infinite-state
chain. These analogies are pursued in Chapter yyy.
The next five sections discuss the parameters in turn, the relationship
between two different parameters being discussed in the latter’s section.
Except for τ1 , the numerical values of the parameters are unchanged by
continuizing a discrete-time chain. And the results of this Chapter not
involving τ1 hold for either discrete or continuous-time chains.
4.1 The maximal mean commute time τ ∗

We start by repeating the definition
τ ∗ ≡ max(Ei Tj + Ej Ti ) (4.1)
ij
and recalling what we already know. Obviously
max Ei Tj ≤ τ ∗ ≤ 2 max Ei Tj
ij ij
and by Chapter 3 Lemma yyy
max Eπ Tj ≤ τ ∗ ≤ 4 max Eπ Tj . (4.2)

j j
Arguably we could have used maxij Ei Tj as the “named” parameter, but

the virtue of τ ∗ is the resistance interpretation of Chapter 3 Corollary yyy.
Lemma 4.1 For random walk on a weighted graph,
τ ∗ = w max rij
ij
where rij is the effective resistance between i and j.

In Chapter 3 Proposition yyy we proved lower bounds for any n-state

discrete-time reversible chain:
τ ∗ ≥ 2(n − 1)
max Ei Tj ≥ n − 1
ij
which are attained by random walk on the complete graph. Upper bounds
will be discussed extensively in Chapter 6, but let’s mention two simple
ideas here. Consider a path i = i0 , i1 , . . . , im = j, and let’s call this path
γij (because we’ve run out of symbols whose names begin with “p”!) This
path, considered in isolation, has “resistance”
X
r(γij ) ≡ 1/we
e∈γij
which by the Monotonicity Law is at least the effective resistance rij . Thus
trivially
τ ∗ ≤ w max min r(γij ). (4.3)
i,j paths γij
A more interesting idea is to combine the max-flow min-cut theorem (see

e.g. [86] sec. 5.4) with Thompson’s principle (Chapter 3 Corollary yyy).
Given a weighted graph, define
X X
c ≡ min wij (4.4)
A
i∈A j∈Ac
the min over proper subsets A. The max-flow min-cut theorem implies that
for any pair a, b there exists a flow f from a to b of size c such that |fij | ≤ wij
for all edges (i, j). So there is a unit flow from a to b such that |fe | ≤ c−1 we
for all edges e. It is clear that by deleting any flows around cycles we may
assume that the flow through any vertex i is at most unity, and so
X
|fij | ≤ 2 for all i, and = 1 for i = a, b. (4.5)
j
So
X f2
e
Ea Tb + Eb Ta ≤ w by Thompson’s principle
e we
wX
≤ |fe |
c e
w
≤ (n − 1) by (4.5).
c
and we have proved
4.2. THE AVERAGE HITTING TIME τ0 117
Proposition 4.2 For random walk on an n-vertex weighted graph,
w(n − 1)
τ∗ ≤
c
for c defined at (4.4).
Lemma 4.1 and the Monotonicity Law also make clear a one-sided bound
on the effect of changing edge-weights monotonically.
Corollary 4.3 Let w̃e ≥ we be edge-weights and let τ̃ ∗ and τ ∗ be the corre-
sponding parameters for the random walks. Then
Ei T̃j + Ej T̃i w̃
≤ for all i, j
Ei Tj + Ej Ti w
and so
τ̃ ∗ /τ ∗ ≤ w̃/w.
In the case of unweighted graphs the bound in Corollary 4.3 is |Ẽ|/|E|. Ex-
ample yyy of Chapter 3 shows there can be no lower bound of this type, since
in that example w̃/w = 1 + O(1/n) but (by straightforward calculations)
τ̃ ∗ /τ ∗ = O(1/n).
4.2 The average hitting time τ0

As usual we start by repeating the definition
XX
τ0 ≡ πj πi Ei Tj (4.6)
i j
and recalling what we already know. We know (a result not using reversibil-
ity: Chapter 2 Corollary yyy) the random target lemma
X
πj Ei Tj = τ0 for all i (4.7)
j
and we know the eigentime identity (Chapter 3 yyy)
(1 − λm )−1 in discrete time

X
τ0 = (4.8)
m≥2
−1
X
τ0 = λm in continuous time (4.9)
m≥2
In Chapter 3 yyy we proved a lower bound for n-state discrete-time chains:

(n − 1)2
τ0 ≥
n
which is attained by random walk on the complete graph.
We can give a flow characterization by averaging over the characteriza-
tion in Chapter 3 yyy. For each vertex a let f a→π = (fija→π ) be a flow from
a to π of volume πa , that is a unit flow scaled by πa . Then
 
 1 X X X (f a→π )2 
ij
τ0 = w min
2 πa wij 
i j a
the min being over families of flows f a→π

described above.
By writing
1 XX 1
τ0 = πi πj (Ei Tj + Ej Ti ) ≤ max(Ei Tj + Ej Ti )
2 i j 2 ij
we see that τ0 ≤ 21 τ ∗ . It may happen that τ ∗ is substantially larger than τ0 .

A fundamental example is the M/M/1/n queue (xxx) where τ0 is linear in
n but τ ∗ grows exponentially. A simple example is the two-state chain with
p01 = ε, p10 = 1 − ε, π0 = 1 − ε, π1 = ε
for which τ0 = 1 but τ ∗ = 1ε + 1−ε
1
. This example shows that (without extra
assumptions) we can’t improve much on the bound
2τ0
τ∗ ≤ (4.10)
minj πj
which follows from the observation Ei Tj ≤ τ0 /πj .
One can invent examples of random walks on regular graphs in which
also τ ∗ is substantially larger than τ0 . Under symmetry conditions (vertex-
transitivity, Chapter 7) we know a priori that Eπ Ti is the same for all i and
hence by (4.2) τ ∗ ≤ 4τ0 . In practice we find that τ0 and τ ∗ have the same
order of magnitude in most “naturally-arising” graphs, but I don’t know
any satisfactory formalization of this idea.
The analog of Corollary 4.3 clearly holds, by averaging over i and j.
Corollary 4.4 Let w̃e ≥ we be edge-weights and let τ̃0 and τ0 be the corre-
sponding parameters for the random walks. Then
τ̃0 /τ0 ≤ w̃/w.
In one sense this is mysterious, because in the eigentime identity the largest
term in the sum is the first term, the relaxation time τ2 , and Example yyy
of Chapter 3 shows that there is no such upper bound for τ2 .
4.3. THE VARIATION THRESHOLD τ1 . 119
4.3 The variation threshold τ1 .

4.3.1 Definitions
Recall from Chapter 2 yyy that || || denotes variation distance and
d(t) ≡ max ||Pi (Xt ∈ ·) − π(·)||

i
¯ ≡ max ||Pi (Xt ∈ ·) − Pj (Xt ∈ ·)||

d(t)
ij
¯ ≤ 2d(t)
d(t) ≤ d(t)
¯ + t) ≤ d(s)
d(s ¯ d(t)
¯
We define the parameter

¯ ≤ e−1 }.
τ1 ≡ min{t : d(t) (4.11)
¯ instead of d(t), are rather
The choice of constant e−1 , and of using d(t)
arbitrary, but this choice makes the numerical constants work out nicely (in
particular, makes τ2 ≤ τ1 – see section 4.4). Submultiplicativity gives
¯ ≤ exp(−bt/τ1 c) ≤ exp(1 − t/τ1 ), t ≥ 0.
Lemma 4.5 d(t) ≤ d(t)
The point of parameter τ1 is to formalize the idea of “time to approach
stationarity, from worst starting-place”. The fact that variation distance is
just one of several distances one could use may make τ1 seem a very arbitrary
choice, but Theorem 4.6 below says that three other possible quantifications
of this idea are equivalent. Here equivalent has a technical meaning: param-
eters τa and τb are equivalent if their ratio is bounded above and below by
numerical constants not depending on the chain. (Thus (4.2) says τ ∗ and
maxj Eπ Tj are equivalent parameters). More surprisingly, τ1 is also equiva-
lent to two more parameters involving mean hitting times. We now define
all these parameters.
(4) (5) (3) (4)
xxx Warning. Parameters τ1 , τ1 in this draft were parameters τ1 , τ1
in the previous draft.
The first idea is to measure distance from stationarity by using ratios of
probabilities. Define separation from stationarity to be
s(t) ≡ min{s : pij (t) ≥ (1 − s)πj for all i, j}.
Then s(·) is submultiplicative, so we naturally define the separation thresh-

old time to be
(1)
τ1 ≡ min{t : s(t) ≤ e−1 }.
The second idea is to consider minimal random times at which the chain has
exactly the stationary distribution. Let
(2)
τ1 ≡ max min Ei Ui
i Ui
where the min is over stopping times Ui such that Pi (X(Ui ) ∈ ·) = π(·). As a
variation on this idea, let us temporarily write, for a probability distribution
µ on the state space,
τ (µ) ≡ max min Ei Ui
i Ui
where the min is over stopping times Ui such that Pi (X(Ui ) ∈ ·) = µ(·).
Then define
(3)
τ1 = min τ (µ).
µ
Turning to the parameters involving mean hitting times, we define

(4) X X
τ1 ≡ max πj |Ei Tj − Ek Tj | = max |Zij − Zkj | (4.12)
i,k i,k
j j
where the equality involves the fundamental matrix Z and holds by the mean
(4)
hitting time formula. Parameter τ1 measures variability of mean hitting
times as the starting place varies. The final parameter is
(5)
τ1 ≡ max π(A)Ei TA .
i,A
Here we can regard the right side as the ratio of Ei TA , the Markov chain
mean hitting time on A, to 1/π(A), the mean hitting time under independent
sampling from the stationary distribution.
The definitions above make sense in either discrete or continuous time,
but the following notational convention turns out to be convenient. For
a discrete-time chain we define τ1 to be the value obtained by applying
the definition (4.11) to the continuized chain, and write τ1disc for the value
obtained for the discrete-time chain itself. Define similarly τ and τ 1,disc .
(1)
1 1
(2) (5)
But the other parameters − τ1 τ1
are defined directly in terms of the
discrete-time chain. We now state the equivalence theorem, from Aldous
[6].
Theorem 4.6 (a) In either discrete or continuous time, the parameters
(1) (2) (3) (4) (5)
τ1 , τ1 , τ1 , τ1 , τ1 and τ1 are equivalent.
(b) In discrete time, τ disc and τ 1,disc are equivalent, and τ ≤ e τ 1,disc
(2)
1 1 1 e−1 1
.
This will be (partially) proved in section 4.3.2, but let us first give a few
remarks and examples. The parameter τ1 and total variation distance are
closely related to the notion of coupling of Markov chains, discussed in Chap-
ter 14. Analogously (see the Notes), the separation s(t) and the parameter
(1)
τ1 are closely related to the notion of strong stationary times Vi for which
Pi (X(Vi ) ∈ · |Vi = t) = π(·) for all t. (4.13)
Under our standing assumption of reversibility there is a close connection

between separation and variation distance, indicated by the next lemma.
Lemma 4.7 (a) d(t) ¯ ≤ s(t).
¯ 2.
(b) s(2t) ≤ 1 − (1 − d(t))
Proof. Part (a) is immediate from the definitions. For (b),
pik (2t) X pij (t)pjk (t)
=
πk j
πk
X pij (t)pkj (t)
= πj by reversibility
j
πj2
 2
1/2 1/2
X pij (t)pkj (t)
≥  πj  by EZ ≥ (EZ 1/2 )2
j
πj
 2
X
≥  min(pij (t), pkj (t))
j
= (1 − ||Pi (Xt ∈ ·) − Pk (Xt ∈ ·)||)2

¯ 2. 2
≥ (1 − d(t))
Note also that the definition of s(t) involves lower bounds in the convergence
pij (t)
πj → 1. One can make a definition involving upper bounds
ˆ ≡ max pij (t) − 1 = max pii (t) − 1 ≥ 0

d(t) (4.14)
i,j πj i πi
where the equality (Chapter 3 Lemma yyy) requires in discrete time that t
be even. This yields the following one-sided inequalities, but Example 4.9
shows there can be no such reverse inequality.
pii (2t)
Lemma 4.8 (a) 4||Pi (Xt ∈ ·) − π(·)||2 ≤ πi − 1, t ≥ 0.
q
(b) d(t) ≤ 1 ˆ
d(2t), t≥0
2
Proof. Part (b) follows from part (a) and the definitions. Part (a) is essen-
tially just the “|| ||1 ≤ || ||2 ” inequality, but let’s write it out bare-hands.
 2
X
4||Pi (Xt ∈ ·) − π(·)||2 =  |pij (t) − πj |
j
 2
X 1/2 pij (t) − πj
=  πj 1/2

πj

j
X (pij (t) − πj )2
≤ by Cauchy-Schwarz
j
πj
X p2ij (t)
= −1 +
j
πj
pii (2t)
= −1 + by Chapter 3 Lemma yyy.
πi
Example 4.9 Consider a continuous-time 3-state chain with transition rates
1 - 1 -
a b c

ε 1
ε 1
Here πa = 2+ε , πb = πc = 2+ε . It is easy to check that τ1 is bounded as
−t
ε → 0. But paa (t) → e as ε → 0, and so by considering state a we have
ˆ → ∞ as ε → 0 for any fixed t.
d(t)
Remark. In the nice examples discussed in Chapter 5 we can usually
find a pair of states (i0 , j0 ) such that
¯ = ||Pi (Xt ∈ ·) − Pj (Xt ∈ ·)|| for all t.
d(t) 0 0
The next example shows this is false in general.

Example 4.10 Consider random walk on the weighted graph
1
3 2
@
ε
@ ε
1 @
∗ 1
@
ε ε@
@
0 1 1
¯ ∼ cε t2 , the max attained by

for suitably small ε. As t → 0 we have 1 − d(t)
¯ ∼ aε exp(−t/τ2 (ε)) where
pairs (0, 2) or (1, 3). But as t → ∞ we have d(t)
τ2 (ε) = Θ(1/ε) and where the max is attained by pairs (i, ∗). 2
As a final comment, one might wonder whether the minimizing distribu-
(3) (3) (2)
tion µ in the definition of τ1 were always π, i.e. whether τ1 = τ1 always.
But a counter-example is provided by random walk on the n-star (Chapter
(3)
5 yyy) where τ1 = 1 (by taking µ to be concentrated on the center vertex)
(2)
but τ1 → 3/2.
4.3.2 Proof of Theorem 6

We will prove
(1)
Lemma 4.11 τ1 ≤ τ1 ≤ 4τ1
(3) (2) e (1)
Lemma 4.12 τ1 ≤ τ1 ≤ e−1 τ1 .
(4) (3)
Lemma 4.13 τ1 ≤ 4τ1
(5) (4)
Lemma 4.14 τ1 ≤ τ1
(1)
These lemmas hold in discrete and continuous time, interpreting tau1 , τ1
as τ disc , τ 1,disc is discrete time. Incidentally, Lemmas 4.12, 4.13 and 4.14
1 1
do not depend on reversibility. To complete the proof of Theorem 4.6 in
continuous time we would need to show
(5)
τ1 ≤ Kτ1 in continuous time (4.15)
for some absolute constant K. The proof I know is too lengthy to repeat
(2)
here – see [6]. Note that (from its definition) τ1 ≤ τ0 , so that (4.15) and
the lemmas above imply τ1 ≤ 2Kτ0 in continuous time. We shall instead
give a direct proof of a result weaker than (4.15):
Lemma 4.15 τ1 ≤ 66τ0 .
Turning to the assertions of Theorem 4.6 is discrete time, (b) is given by
the discrete-time versions of Lemmas 4.11 and 4.12. To prove (a), it is
(2) (5)
enough to show that the numerical values of the parameters τ1 − −τ1 are
(5) (4)
unchanged by continuizing the discrete-time chain. For τ1 and τ1 this is
(3)
clear, because continuization doesn’t affect mean hitting times. For τ1 and
(2)
τ1 it reduces to the following lemma.
Lemma 4.16 Let Xt be a discrete-time chain and Yt be its continuization,

both started with the same distribution. Let T be a randomized stopping
time for Y . Then there exists a randomized stopping time T̂ for X such
that P (X(T̂ ) ∈ ·) = P (Y (T ) ∈ ·) and E T̂ = ET .
Proof of Lemma 4.11. The left inequality is immediate from Lemma

4.7(a), and the right inequality holds because
¯ 1 ))2 by Lemma 4.7(b)

s(4τ1 ) ≤ 1 − (1 − d(2τ
≤ 1 − (1 − e−2 )2 by Lemma 4.5
≤ e−1 .
Proof of Lemma 4.12. The left inequality is immediate from the defini-
(1)
tions. For the right inequality, fix i. Write u = τ1 , so that
pjk (u) ≥ (1 − e−1 )πk for all j, k.
We can construct a stopping time Ui ∈ {u, 2u, 3u, . . .} such that
Pi (XUi ∈ ·, Ui = u) = (1 − e−1 )π(·)
and then by induction on m such that
Pi (XUi ∈ ·, Ui = mu) = e−(m−1) (1 − e−1 )π(·), m ≥ 1.

(2) (1)
Then Pi (XUi ∈ ·) = π(·) and Ei Ui = u(1−e−1 )−1 . So τ1 ≤ (1−e−1 )−1 τ1 .
Remark. What the argument shows is that we can construct a strong
stationary time Vi (in the sense of (4.13)) such that
(1)
Ei Vi = (1 − e−1 )−1 τ1 . (4.16)
Proof of Lemma 4.13. Consider the probability distribution µ attaining

(3)
the min in the definition of τ1 , and the associated stopping times Ui . Fix
i. Since Pi (X(Ui ) ∈ ·) = µ(·),
(3)
Ei Tj ≤ Ei Ui + Eµ Tj ≤ τ1 + Eµ Tj .
P P
The random target lemma (4.7) says j Ei Tj πj = j Eµ Tj πj and so
X X (3)
πj |Ei Tj − Eµ Tj | = 2 πj (Ei Tj − Eµ Tj )+ ≤ 2τ1 .
j j
(4)
Writing b(i) for the left sum, the definition of τ1 and the triangle inequality
(4)
give τ1 ≤ maxi,k (b(i) + b(k)), and the Lemma follows.
Proof of Lemma 4.14. Fix a subset A and a starting state i 6∈ A. Then
for any j ∈ A,
Ei Tj = Ei TA + Eρ Tj
where ρ is the hitting place distribution Pi (XTA ∈ ·). So
X X
π(A)Ei TA = πj Ei TA = πj (Ei Tj − Eρ Tj )
j∈A j∈A
X (4)
≤ max πj (Ei Tj − Ek Tj ) ≤ τ1 .
k
j∈A
Proof of Lemma 4.15. For small δ > 0 to be specified later, define
A = {j : Eπ Tj ≤ τ0 /δ}.
Note that Markov’s inequality and the definition of τ0 give

P
c j πj Eπ Tj τ0
π(A ) = π{j : Eπ Tj > τ0 /δ} ≤ = = δ. (4.17)
τ0 /δ τ0 /δ
Next, for any j

Z ∞ !
pjj (s)
Eπ Tj = − 1 ds by Chapter 2 Lemma yyy
0 πj
!
pjj (t)
≥ t −1 for any t
πj
by monotonicity of pjj (t). Thus for j ∈ A we have
pjj (t) Eπ Tj τ0
−1≤ ≤
πj t δt
and applying Chapter 3 Lemma yyy (b)
pjk (t) τ0
≥ 1 − , j, k ∈ A.
πk δt
Now let i be arbitrary and let k ∈ A. For any 0 ≤ s ≤ u,
Pi (Xu+t = k|TA = s) Pj (Xu+t−s = k) τ0 τ0

≥ min ≥1− ≥1−
πk j∈A πk δ(u + t − s) δt
and so +
pik (u + t) τ0

≥ 1− Pi (TA ≤ u). (4.18)
πk δt
Now
(5)
Ei TA τ
Pi (TA > u) ≤ ≤ 1 .
u uπ(A)
(5) (5) (4)
using Markov’s inequality and the definition of τ1 . And τ1 ≤ τ1 ≤ 2τ0 ,
the first inequality being Lemma 4.14 and the second being an easy conse-
quence of the definitions. Combining (4.18) and the subsequent inequalities
shows that, for k ∈ A and arbitrary i
+ +
pik (u + t) τ0 2τ0

≥ 1− 1− ≡ η, say.
πk δt uπ(A)
Applying this to arbitrary i and j we get
+ +
¯ + t) ≤ 1 − ηπ(A) ≤ 1 − 1 − τ0 2τ0

d(u π(A) −
δt u
+ +
τ0 2τ0

≤1− 1− 1−δ− by (4.17).
δt u
Putting t = 49τ0 , u = 17τ0 , δ = 1/7 makes the bound = 833 305
< e−1 .
Remark. The ingredients of the proof above are complete monotonicity
and conditioning on carefully chosen hitting times. The proof of (4.15) in
[6] uses these ingredients, plus the minimal hitting time construction in the
recurrent balayage theorem (Chapter 2 yyy).
Outline proof of Lemma 4.16. The observant reader will have noticed
(Chapter 2 yyy) that we avoided writing down a careful definition of stopping
times in the continuous setting. The definition involves measure-theoretic is-
sues which I don’t intend to engage, and giving a rigorous proof of the lemma
is a challenging exercise in the measure-theoretic formulation of continuous-
time chains. However, the underlying idea is very simple. Regard the
chain Yt as constructed from the chain (X0 , X1 , X2 , . . .) and exponential(1)
holds (ξi ). Define T̂ = N (T ), where N (t) is the Poisson counting process
N (t) = max{m : ξ1 +. . .+ξm ≤ t}. Then X(T̂ ) = Y (T ) by construction and
E T̂ = ET by the optional sampling theorem for the martingale N (t) − t. 2
4.3.3 τ1 in discrete time, and algorithmic issues

Of course for period-2 chains we don’t have convergence to stationarity in
discrete time, so we regard τ1disc = τ11,disc = ∞. Such chains – random
walks on bipartite weighted graphs – include several simple examples of

unweighted graphs we will discuss in Chapter 5 (e.g. the n-path and n-cycle
for even n, and the d-cube) and Chapter 7 (e.g. card-shuffling by random
transpositions, if we insist on transposing distinct cards).
As mentioned in Chapter 1 xxx, a topic of much recent interest has
been “Markov Chain Monte Carlo”, where one constructs a discrete-time
reversible chain with specified stationary distribution π and we wish to use
the chain to sample from π. We defer systematic discussion to xxx, but a few
comments are appropriate here. We have to start a simulation somewhere.
In practice one might use as initial distribution some distribution which is
feasible to simulate and which looks intuitively “close” to π, but this idea
is hard to formalize and so in theoretical analysis we seek results which
hold regardless of the initial distribution, i.e. “worst-case start” results. In
(2)
this setting τ1 is, by definition, the minimum expected time to generate a
(2)
sample with distribution π. But the definition of τ1 merely says a stopping
time exists, and doesn’t tell us how to implement it algorithmically. For
algorithmic purposes we want rules which don’t involve detailed structure
of the chain. The most natural idea – stopping at a deterministic time –
requires one to worry unnecessarily about near-periodicity. One way to avoid
this worry is to introduce holds into the discrete-time chain, i.e. simulate
(P+I)/2 instead of P. As an alternative, the distribution of the continuized
chain at time t can be obtained by simply running the discrete-time chain
for a Poisson(t) number of steps. “In practice” there is little difference
between these alternatives. But the continuization method, as well as being
mathematically less artificial, allows us to avoid the occasional messiness
of discrete-time theory (see e.g. Proposition 4.29 below). In this sense our
use of τ1 for discrete-time chains as the value for continuous-time chains is
indeed sensible: it measures the accuracy of a natural algorithmic procedure
applied to a discrete-time chain.
Returning to technical matters, the fact that a periodic (reversible, by
our standing assumption) chain can only have period 2 suggests that the
discrete-time periodicity effect could be eliminated by averaging over times
t and t + 1 only, as follows.
Open Problem 4.17 Show there exist ψ(x) ↓ 0 as x ↓ 0 and φ(t) ∼ t as
t → ∞ such that, for any discrete-time chain,
Pi (Xt ∈ ·) + Pi (Xt+1 ∈ ·)

max − π(·) ≤ ψ(d(φ(t))), t = 0, 1, 2, . . .
i 2
where d(·) refers to the continuized chain.
See the Notes for some comments on this problem.

If one does wish to study distributions of discrete-time chains at deter-
ministic times, then in place of τ2 one needs to use
β ≡ max(|λm | : 2 ≤ m ≤ n) = max(λ2 , −λn ). (4.19)
The spectral representation then implies
|Pi (Xt = i) − πi | ≤ β t , t = 0, 1, 2, . . . . (4.20)
4.3.4 τ1 and mean hitting times

In general τ1 may be much smaller than τ ∗ or τ0 . For instance, random walk
on the complete graph has τ0 ∼ n while τ1 → 1. So we cannot (without
extra assumptions) hope to improve much on the following result.
Lemma 4.18 For an n-state chain, in discrete or continuous time,

(2)
τ0 ≤ nτ1
(2)
∗ 2τ1
τ ≤ .
minj πj
Lemmas 4.24 and 4.25 later are essentially stronger, giving corresponding
upper bounds in terms of τ2 instead of τ1 . But a proof of Lemma 4.18 is
interesting for comparison with the cat-and-mouse game below.
(2)
Proof of Lemma 4.18. By definition of τ1 , for the chain started at i0
we can find stopping times U1 , U2 , . . . such that
(2)
E(Us+1 − Us |Xu , u ≤ Us ) ≤ τ1
(X(Us ); s ≥ 1) are independent with distribution π.

So Sj ≡ min{s : X(Us ) = j} has Ei0 Sj = 1/πj , and so
(2)
τ1
Ei0 Tj ≤ Ei0 USj ≤
πj
where the second inequality is justified below. The second assertion of the
lemma is now clear, and the first holds by averaging over j.
The second inequality is justified by the following martingale result,
which is a simple application of the optional sampling theorem. The “equal-
ity” assertion is sometimes called Wald’s equation for martingales.
Lemma 4.19 Let 0 = Y0 ≤ Y1 ≤ Y2 . . . be such that
E(Yi+1 − Yi |Yj , j ≤ i) ≤ c, i ≥ 0
for a constant c. Then for any stopping time T ,
EYT ≤ cET.
If in the hypothesis we replace “≤ c” by “= c”, then EYT = cET .
Cat-and-Mouse Game. Here is another variation on the type of game

described in Chapter 3 section yyy. Fix a graph. The cat starts at some
vertex vc and follows a continuous-time simple random walk. The mouse
starts at some vertex vm and is allowed an arbitrary strategy. Recall the
mouse can’t see the cat, so it must use a deterministic strategy, or a random
strategy independent of the cat’s moves. The mouse seeks to maximize
EM , the time until meeting. Write m∗ for the sup of EM over all starting
positions vc , vm and all strategies for the mouse. So m∗ just depends on the
graph. Clearly m∗ ≥ maxi,j Ei Tj , since the mouse can just stand still.
Open Problem 4.20 Does m∗ = maxi,j Ei Tj ? In other words, is it never

better to run than to hide?
Here’s a much weaker upper bound on m∗ . Consider for simplicity a regular

n-vertex graph. Then
(1)
∗ enτ1
m ≤ . (4.21)
e−1
Because as remarked at (4.16), we can construct a strong stationary time V
(1)
eτ1
such that EV = e−1 = c, say. So we can construct 0 = V0 < V1 < V2 . . .
such that
E(Vi+1 − Vi |Vj , j ≤ i) ≤ c, i ≥ 0
(X(Vi ), i ≥ 1) are independent with the uniform distribution π
(X(Vi ), i ≥ 1) are independent of (Vi , i ≥ 1).
So regardless of the mouse’s strategy, the cat has chance 1/n to meet the
mouse at time Vi , independently as i varies, so the meeting time M satisfies
M ≤ VT where T is a stopping time with mean n, and (4.21) follows from
Lemma 4.19. This topic will be pursued in Chapter 6 yyy.
4.3.5 τ1 and flows

Since discrete-time chains can be identified with random walks on weighted
graphs, relating properties of the chain to properties of “flows” on the graph
is a recurring theme. Thompson’s principle (Chapter 3 yyy) identified mean
commute times and mean hitting times from stationarity as infs over flows
of certain quantities. Sinclair [308] noticed that τ1 could be related to “mul-
ticommodity flow” issues, and we give a streamlined version of his result
(essentially Corollary 4.22) here. Recall from Chapter 3 section yyy the
general notation of a unit flow from a to π, and the special flow f a→π in-
duced by the Markov chain.
Lemma 4.21 Consider a family f = (f (a) ), where, for each state a, f (a) is
a unit flow from a to the stationary distribution π. Define
(a)
X |fij |
ψ(f ) = max πa
edges (i,j) a πi pij
in discrete time, and substitute qij for pij in continuous time. Let f a→π be
the special flow induced by the chain. Then
(4)
ψ(f a→π ) ≤ τ1 ≤ ∆ψ(f a→π )
where ∆ is the diameter of the transition graph.
Proof. We work in discrete time (the continuous case is similar). By Chapter

3 yyy
fija→π Zia − Zja
=
πi pij πa
and so
X |fija→π | X
πa = |Zia − Zja |.
a πi pij a
Thus X
ψ(f a→π ) = max |Zia − Zja |.
edge(i,j) a
The result now follows because by (4.12)

(4) X
τ1 = max |Zia − Zka |
i,k
a
where i and k are not required to be neighbors. 2

(4)
Using Lemmas 4.11 - 4.13 to relate τ1 to τ1 , we can deduce a lower
bound on τ1 in terms of flows.
4.4. THE RELAXATION TIME τ2 131
e−1
Corollary 4.22 τ1 ≥ 16e inf f ψ(f ).
Unfortunately it seems hard to get analogous upper bounds. In particular,
it is not true that
τ1 = O ∆ inf ψ(f ) .
f
To see why, consider first random walk on the n-cycle (Chapter 5 Example
yyy). Here τ1 = Θ(n2 ) and ψ(f a→π ) = Θ(n), so the upper bound in Lemma
4.21 is the right order of magnitude, since ∆ = Θ(n). Now modify the
chain by allowing transitions between arbitrary pairs (i, j) with equal chance
o(n−3 ). The new chain will still have τ1 = Θ(n2 ), and by considering the
special flow in the original chain we have inf f ψ(f ) = O(n), but now the
diameter ∆ = 1.
4.4 The relaxation time τ2

The parameter τ2 is the relaxation time, defined in terms of the eigenvalue
λ2 (Chapter 3 section yyy) as
τ2 = (1 − λ2 )−1 in discrete time

= λ−1
2 in continuous time.
In Chapter 3 yyy we proved a lower bound for an n-state discrete-time chain:

1
τ2 ≥ 1 −
n
which is attained by random walk on the complete graph. We saw in Chapter
3 Theorem yyy the extremal characterization
X
τ2 = sup{||g||22 /E(g, g) : πi g(i) = 0}. (4.22)
i
The next three lemmas give inequalities between τ2 and the parameters
studied earlier in this chapter. Write π∗ ≡ mini πi .
Lemma 4.23 In continuous time,
1 1

τ2 ≤ τ1 ≤ τ2 1 + log .
2 π∗
In discrete time,
4e 1 1

(2)
τ1 ≤ τ2 1 + log .
e−1 2 π∗
Lemma 4.24 τ2 ≤ τ0 ≤ (n − 1)τ2 .

2τ2
Lemma 4.25 τ ∗ ≤ π∗ .
Proof of Lemma 4.23. Consider first the continuous time case. By the
spectral representation, as t → ∞ we have pii (t) − πi ∼ ci exp(−t/τ2 ) with
some ci 6= 0. But by Lemma 4.5 we have |pii (t) − πi | = O(exp(−t/τ1 )). This
shows τ2 ≤ τ1 . For the right inequality, the spectral representation gives
pii (t) − πi ≤ e−t/τ2 . (4.23)

ˆ
Recalling the definition (4.14) of d,
¯
d(t) ≤ 2d(t)
q
≤ ˆ
d(2t) by Lemma 4.8(b)
s
e−2t/τ2
≤ max by (4.14) and (4.23)
i πi
−1/2 −t/τ2
= π∗ e
(2)
and the result follows. The upper bound on τ1 holds in continuous time
(2)
by Lemmas 4.11 and 4.12, and so holds in discrete time because τ1 and τ2
are unaffected by continuization.
Proof of Lemma 4.24. τ2 ≤ τ0 because τ2 is the first term in the eigen-
time identity for τ0 . For the other bound, Chapter 3 Lemma yyy gives the
inequality in
X X
τ0 = πj Eπ Tj ≤ (1 − πj )τ2 = (n − 1)τ2 .
j j
Proof of Lemma 4.25. Fix states a, b such that Ea Tb + Eb Ta = τ ∗ and

fix a function 0 ≤ g ≤ 1 attaining the sup in the extremal characterization
(Chapter 3 Theorem yyy), so that
1
τ∗ = , g(a) = 0, g(b) = 1.
E(g, g)
P
Write c = i πi g(i). Applying the extremal characterization of τ2 to the
centered function g − c,
||g − c, g − c||22 var π g(X0 )

τ2 ≥ = = τ ∗ var π g(X0 ).
E(g − c, g − c) E(g, g)
But
var π g(X0 ) ≥ πa c2 + πb (1 − c)2

≥ inf (πa y 2 + πb (1 − y)2 )
0≤y≤1
πa πb
=
πa + πb
1
≥ min(πa , πb )
2
≥ π∗ /2
establishing the lemma. 2

Simple examples show that the bounds in these Lemmas cannot be much
improved in general. Specifically
(a) on the complete graph (Chapter 5 Example yyy) τ2 = (n − 1)τ0 and
τ2∗ = 2τ
π∗ .
2
(b) On the barbell (Chapter 5 Example yyy), τ2 , τ1 and τ0 are asymptotic

to each other.
(c) In the M/M/1/n queue, τ1 /τ2 = Θ(log 1/π∗ ) as n → ∞. 2
In the context of Lemma 4.23, if we want to relate τ1disc itself to eigen-
values in discrete time we need to take almost-periodicity into account and
use β = max(λ2 , −λn ) in place of τ2 . Rephrasing the proof of Lemma 4.23
gives
Lemma 4.26 In discrete time,
1 1 + 21 log 1/π∗
d e ≤ τ1disc ≤ d e
log 1/β log 1/β
Regarding a discrete-time chain as random walk on a weighted graph, let ∆

be the diameter of the graph. By considering the definition of the variation
¯ and initial vertices i, j at distance ∆, it is obvious that d(t)
distance d(t) ¯ =1
for t < ∆/2, and hence τ1disc ≥ d∆/2e. Combining with the upper bound in
Lemma 4.26 leads to a relationship between the diameter and the eigenvalues
of a weighted graph.
Corollary 4.27
1 2 + log(1/π∗ )
log ≤ .
β ∆
This topic will be discussed further in Chapter yyy.

4.4.1 Correlations and variances for the stationary chain

Perhaps the most natural probabilistic interpretation of τ2 is as follows.
Recall that the correlation between random variables Y, Z is
E(Y Z) − (EY )(EZ)

cor(Y, Z) ≡ √ .
var Y var Z
For a stationary Markov chain define the maximal correlation function
ρ(t) ≡ max cor(h(X0 ), g(Xt ))

h,g
This makes sense for general chains (see Notes for further comments), but
under our standing assumption of reversibility we have
Lemma 4.28 In continuous time,
ρ(t) = exp(−t/τ2 ), t ≥ 0.
In discrete time,
ρ(t) = β t , t ≥ 0
where β = max(λ2 , −λn ).
This is a translation of the Rayleigh-Ritz characterization of eigenvalues

(Chapter 3 yyy) – we leave the details to the reader.
Now consider a function g with Eπ g(X0 ) = 0 and ||g||22 ≡ Eπ g 2 (X0 ) > 0.
Write
Rt
St ≡ 0 g(Xs )ds in continuous time
Pt−1
St ≡ s=0 g(Xs ) in discrete time.
Recall from Chapter 2 yyy that for general chains there is a limit variance
σ 2 = limt→∞ t−1 var St . Reversibility gives extra qualitative and quantita-
tive information. The first result refers to the stationary chain.
Proposition 4.29 In continuous time, t−1 var π St ↑ σ 2 , where
0 < σ 2 ≤ 2τ2 ||g||22 .
And A(t/τ2 )tσ 2 ≤ var π St ≤ tσ 2 , where

Z u
s −s

A(u) ≡ 1− e ds = 1 − u−1 (1 − e−u ) ↑ 1 as u ↑ ∞.
0 u
In discrete time,
t−1 var π St → σ 2 ≤ 2τ2 ||g||22
2τ2

2
σ t 1− ≤ var π St ≤ σ 2 t + ||g||22
t
and so in particular
1
var π St ≤ t||g||22 (2τ2 + ). (4.24)
t
Proof. Consider first the continuous time case. A brief calculation using the
spectral representation (Chapter 3 yyy) gives
2 −λm t
X
Eπ g(X0 )g(Xt ) = gm e (4.25)
m≥2
P 1/2
where gm = i πi uim g(i). So
Z tZ t
−1 −1
t var π St = t Eπ g(Xu )g(Xs ) dsdu
0 0
Z t
= 2t−1 (t − s)Eπ g(X0 )g(Xs )ds
0
Z t
s
X
2 −λm s
= 2 1− gm e ds (4.26)
0 t m≥2
X g2
m
= 2 A(λm t) (4.27)
m≥2
λm
by change of variables in the integral defining A(u). The right side increases
with t to X
σ2 ≡ 2 gm2
/λm , (4.28)
m≥2
and the sum here is at most

P 2 = ||g||22 τ2 . On the other hand,
m≥2 gm /λ2
A(·) is increasing, so
X g2
t−1 var π (St ) ≥ 2 m
A(λ2 t) = σ 2 A(t/τ2 ).
λ
m≥2 m
In discrete time the arguments are messier, and we will omit details of
calculations. The analog of (4.26) becomes
t−1
|i|
X
−1
X
2 s
t var π St = 1− gm λm .
s=−(t−1)
t m≥2
In place of the change of variables argument for (4.27), one needs an ele-
mentary calculation to get
2
gm
t−1 var π St = 2
X
B(λm , t) (4.29)
m≥2
1 − λm
1 + λ λ(1 − λt )
where B(λ, t) = − .
2 t(1 − λ)
This shows
1 + λm
t−1 var π St → σ 2 ≡
X
2
gm
m≥2
1 − λm
and the sum is bounded above by
1 + λ2 X 2 1 + λ2
gm = ||g||22 ≤ 2τ2 ||g||22 .
1 − λ2 m≥2 1 − λ2
Next, rewrite (4.29) as

X λm (1 − λtm )
var π St − σ 2 t = −2 2
gm .
m≥2
(1 − λm )2
Then the upper bound for var π St follows by checking
λ(1 − λt ) 1
inf ≥− .
−1≤λ<1 (1 − λ)2 2
For the lower bound, one has to verify
2λ(1 − λt )
sup is attained at λ2 (and equals C, say)
−1≤λ≤λ2 (1 − λ)(1 + λ)
where in the sequel we may assume λ2 > 0. Then

1 + λm
B(λm , t) ≥ (1 − C/t), m ≥ 2
2
and so
t−1 var π St ≥ σ 2 (1 − C/t).
But
C 2λ2 (1 − λt2 ) 2
= ≤ = 2τ2 /t
t t(1 − λ2 )(1 + λ2 ) t(1 − λ2 )
giving the lower bound. 2
Note that even in discrete time it is τ2 that matters in Proposition 4.29.

Eigenvalues near −1 are irrelevant, except that for a periodic chain we have
σ = 0 for one particular function g (which?).
Continuing the study of St ≡ 0t g(Xs )ds or its discrete analog for a sta-
R
tionary chain, standardize to the case where Eπ g(X0 ) = 0, Eπ g 2 (X0 ) = 1.

Proposition 4.29 provides finite-time bounds for the asymptotic approxima-
tion of variance. One would like a similar finite-time bound for the asymp-
totic Normal approximation of the distribution of St .
Open Problem 4.30 Is there some explicit function ψ(b, s) → 0 as s → ∞,

not depending on the chain, such that for standardized g and continuous-
time chains,

St

sup Pπ 1/2
≤ x − P (Z ≤ x) ≤ ψ(||g||∞ , t/τ2 )
x σt
where ||g||∞ ≡ maxi |g(i)| and Z has Normal(0, 1) distribution?
See the Notes for further comments. For the analogous result about large
deviations see Chapter yyy.
4.4.2 Algorithmic issues

Suppose we want to estimate the average ḡ ≡ i πi g(i) of a function g
P
defined on state space. If we could sample i.i.d. from π we would need order
√
ε−2 samples to get an estimator with error about ε var π g. Now consider
the setting where we cannot directly sample from π but instead use the
“Markov Chain Monte Carlo” method of setting up a reversible chain with
stationary distribution π. How many steps of the chain do we need to get
the same accuracy? As in section 4.3.3, because we typically can’t quantify
the closeness to π of a feasible initial distribution, we consider bounds which
hold for arbitrary initial states. In assessing the number of steps required,
there are two opposite traps to avoid. The first is to say (cf. Proposition
4.29) that ε−2 τ2 steps suffice. This is wrong because the relaxation time
bounds apply to the stationary chain and cannot be directly applied to a
non-stationary chain. The second trap is to say that because it take Θ(τ1 )
steps to obtain one sample from the stationary distribution, we therefore
need order ε−2 τ1 steps in order to get ε−2 independent samples. This is
wrong because we don’t need independent samples. The correct answer is
(2)
order (τ1 + ε−2 τ2 ) steps. The conceptual idea (cf. the definition of τ1 )
is to find a stopping time achieving distribution π and use it as an initial
state for simulating the stationary chain. More feasible to implement is the
following algorithm.
Algorithm. For a specified real number t1 > 0 and an integer m2 ≥ 1,
generate M (t1 ) with Poisson(t1 ) distribution. Simulate the chain Xt from
arbitrary initial distribution for M (t1 ) + m2 − 1 steps and calculate
M (t1 )+m2 −1
1 X
A(g, t1 , m2 ) ≡ g(Xt ).
m2 t=M (t1 )
Corollary 4.31
2τ2 + 1/m2
P (|A(g, t1 , m2 ) − ḡ| > ε||g||2 ) ≤ s(t1 ) +
ε2 m 2
where s(t) is separation (recall section 4.3.1) for the continuized chain.
To make the right side approximately δ we may take
(1) 4τ2
t1 = τ1 dlog(2/δ)e; m2 = d e.
ε2 δ
Since the mean number of steps is t1 + m2 − 1, this formalizes the idea that
we can estimate ḡ to within ε||g||2 in order (τ1 + ε−2 τ2 ) steps.
xxx if don’t know tau‘s
Proof. We may suppose ḡ = 0. Since XM (t1 ) has the distribution of the
continuized chain at time t1 , we may use the definition of s(t1 ) to write
P (XM (t1 ) ∈ ·) = (1 − s(t1 ))π + s(t1 )ρ
for some probability distribution ρ. It follows that
2 −1
1 mX
!

P (|A(g, t1 , m2 )| > ε||g||2 ) ≤ s(t1 ) + Pπ g(Xt ) > ε||g||2

m2
t=0
2 −1
mX
!
1
≤ s(t1 ) + 2 2 var π g(Xt ) .
m2 ε ||g||22 t=0
Apply (4.24).
4.4.3 τ2 and distinguished paths

The extremal characterization (4.22) can be used to get lower bounds on τ2
by considering a tractable test function g. (xxx list examples). As mentioned
in Chapter 3, it is an open problem to give an extremal characterization of τ2
as exactly an inf over flows or similar constructs. As an alternative, Theorem
4.32 gives a non-exact upper bound on τ2 involving quantities derived from
arbitrary choices of paths between states. An elegant exposition of this idea,
expressed by the first inequality in Theorem 4.32, was given by Diaconis and
Stroock [124], and Sinclair [308] noted the second inequality. We copy their
proofs.
We first state the result in the setting of random walk on a weighted
graph. As in section 4.1, consider a path x = i0 , i1 , . . . , im = y, and call this
path γxy . This path has length |γxy | = m and has “resistance”
X
r(γxy ) ≡ 1/we
e∈γxy
where here and below e denotes a directed edge.
Theorem 4.32 For each ordered pair (x, y) of vertices in a weighted graph,
let γxy be a path from x to y. Then for discrete-time random walk,
XX
τ2 ≤ w max πx πy r(γxy )1(e∈γxy )
e
x y
1 XX
τ2 ≤ w max πx πy |γxy |1(e∈γxy ) .
e we x y
Note that the two inequalities coincide on an unweighted graph.

Proof. For an edge e = (i, j) write ∆g(e) = g(j) − g(i). The first
equality below is the fact 2 var (Y1 ) = E(Y1 − Y2 )2 for i.i.d. Y ’s, and the
first inequality is Cauchy-Schwarz.
XX
2||g||22 = πx πy (g(y) − g(x))2
x y
 2
XX X
= πx πy  ∆g(e)
x y e∈γxy
 2
XX X 1 √
= πx πy r(γxy )  q we ∆g(e) (4.30)
x y e∈γxy we r(γxy )
XX X
≤ πx πy r(γxy ) we (∆g(e))2 (4.31)
x y e∈γxy
XX X
= πx πy r(γxy ) we (∆g(e))2 1(e∈γxy )
x y e
X
2
≤ κ we (∆g(e)) = κ 2wE(g, g) (4.32)
e
where κ is the max in the first inequality in the statement of the Theorem.
The first inequality now follows from the extremal characterization (4.22).
The second inequality makes a simpler use of the Cauchy-Schwarz inequality,
in which we replace (4.30,4.31,4.32) by
 2
XX X
= πx πy  1 · ∆g(e)
x y e∈γxy
(∆g(e))2
XX X
≤ πx πy |γxy | (4.33)
x y e∈γxy
≤ κ0 we (∆g(e))2 = κ0 2wE(g, g)
X
where κ0 is the max in the second inequality in the statement of the Theorem.
Remarks. (a) Theorem 4.32 applies to continuous-time (reversible) chains
by setting wij = πi qij .
(b) One can replace the deterministic choice of paths γxy by random
paths Γxy of the form x = V0 , V1 , . . . , VM = y of random length M = |Γxy |.
The second inequality extends in the natural way, by taking expectations in
(4.33) to give
!
2
XX X
≤ πx πy E |Γxy |1(e∈Γxy ) (∆g(e)) ,
x y e
and the conclusion is

Corollary 4.33
1 XX
τ2 ≤ w max πx πy E |Γxy |1(e∈Γxy ) .
e we x y
(c) Inequalities in the style of Theorem 4.32 are often called Poincaré
inequalities because, to quote [124], they are “the discrete analog of the
classical method of Poincaré for estimating the spectral gap of the Laplacian
on a domain (see e.g. Bandle [39])”. I prefer the descriptive name the
distinguished path method. This method has the same spirit as the coupling
method for bounding τ1 (see Chapter yyy), in that we get to use our skill
and judgement in making wise choices of paths in specific examples. xxx
list examples. Though its main utility is in studying hard examples, we give
some simple illustrations of its use below.
Write the conclusion of Corollary 4.33 as τ2 ≤ w maxe w1e F (e). Consider
a regular unweighted graph, and let Γx,y be chosen uniformly from the set
of minimum-length paths from x to y. Suppose that F (e) takes the same
value F for every directed edge e. A sufficient condition for this is that the
graph be arc-transitive (see Chapter 8 yyy). Then, summing over edges in
Corollary 4.33,
→ XXX XX
τ2 | E | ≤ w πx πy E|Γxy |1(e∈Γxy ) = w πx πy E|Γxy |2
e x y x y
→ →
where | E | is the number of directed edges. Now w = | E |, so we may
reinterpret this inequality as follows.
Corollary 4.34 For random walk on an arc-transitive graph, τ2 ≤ ED2 ,

where D = d(ξ1 , ξ2 ) is the distance between independent uniform random
vertices ξ1 , ξ2 .
d , the upper bound is asymptotic
In the context of the d-dimensional torus ZN
P 2
d
(as N → ∞) to N 2 E i=1 Ui where the Ui are independent uniform
[0, 1/2], This bound is asymptotic to d(d + 1/3)N 2 /16. Here (Chapter 5
Example yyy) in fact τ2 ∼ dN 2 /(2π 2 ), so for fixed d the bound is off by only
a constant. On the d-cube (Chapter 5 Example yyy), D has Binomial(d, 1/2)
distribution and so the bound is ED2 = (d2 + d)/4, while in fact τ2 = d/2.
Intuitively one feels that the bound in Corollary 4.34 should hold for
more general graphs, but the following example illustrates a difficulty.
Example 4.35 Consider the graph on n = 2m vertices obtained from two

complete graphs on m vertices by adding m edges comprising a matching of
the two vertex-sets.
Here a straightforward implementation of Theorem 4.32 gives an upper

bound of 2m, while in fact τ2 = m/2. On the other hand the conclusion
of Corollary 4.34 would give an O(1) bound. Thus even though this exam-
ple has a strong symmetry property (vertex-transitivity: Chapter 8 yyy) no
bound like Corollary 4.34 can hold.
4.5 The flow parameter τc

In this section it’s convenient to work in continuous time, but the numerical
quantities involved here are unchanged by continuization.
4.5.1 Definition and easy inequalities

Define
π(A)π(Ac )
τc = sup (4.34)
A Q(A, Ac )
where
X X
Q(A, Ac ) ≡ πi qij
i∈A j∈Ac
and where such sups are always over proper subsets A of states. This param-
eter can be calculated exactly in only very special cases, where the following
lemma is helpful.
Lemma 4.36 The sup in (4.34) is attained by some split {A, Ac } in which
both A and Ac are connected (as subsets of the graph of permissible transi-
tions).
Proof. Consider a split {A, Ac } in which A is the union of m ≥ 2 connected

Q(Bi ,Bic )
components (Bi ). Write γ = mini π(Bi )π(B c ) . Then
i
X
Q(A, Ac ) = Q(Bi , Bic )
i
X
≥ γ π(Bi )π(Bic )
i
X
= γ (π(Bi ) − π 2 (Bi ))
i
!
X
2
= γ π(A) − π (Bi )
i
and so
Q(A, Ac ) π(A) − i π 2 (Bi )
P
≥ γ .
π(A)π(Ac ) π(A) − π 2 (A)
2 (B ) 2
But for m ≥ 2 we have ≤ ( = π 2 (A), which implies
P P
iπ i i π(Bi ))
Q(A,Ac )
π(A)π(Ac ) > γ. 2
4.5. THE FLOW PARAMETER τC 143
To see how τc arises, note that the extremal characterization of τ2 (4.22)

applied to g = 1A implies
π(A)π(Ac )
≤ τ2
Q(A, Ac )
for any subset A. But much more is true: Chapter 3 yyy may be rephrased
as follows. For any subset A,
π(A)π(Ac ) π(A)Eπ TA
c
≤ ≤ π(A)EαA TA ≤ τ2
Q(A, A ) π(Ac )
where αA is the quasistationary distribution on Ac defined at Chapter 3 yyy.

So taking sups gives
Corollary 4.37
π(A)Eπ TA
τc ≤ sup ≤ sup π(A)EαA TA ≤ τ2 .
A π(Ac ) A
In a two-state chain these inequalities all become equalities. This seems a

good justification for our choice of definition of τc , instead of the alternative
π(A)
sup
A:π(A)≤1/2 Q(A, Ac )
which has been used in the literature but which would introduce a spurious
factor of 2 into the inequality τc ≤ τ2 .
Lemma 4.39 below shows that the final inequality of Corollary 4.37 can
be reversed. In contrast, on the n-cycle τc = Θ(n) whereas the other quan-
tities in Corollary 4.37 are Θ(n2 ). This shows that the “square” in Theorem
4.40 below cannot be omitted in general. It also suggests the following
(5)
question (cf. τ1 and τ1 )
Open Problem 4.38 Does there exist a constant K such that
π(A)Eπ TA
τ2 ≤ K sup
A π(Ac )
for every reversible chain?
A positive answer would provide, via Chapter 3 yyy, a correct order-of-

magnitude extremal characterization of τ2 in terms of flows.
Lemma 4.39
τ2 ≤ sup EαA TA
A:π(A)≥1/2
and so in particular
τ2 ≤ 2 sup π(A)EαA TA .
A
Proof. τ2 = ||h||22 /E(h, h) for the eigenvector h associated with λ2 . Put
A = {x : h(x) ≤ 0}
and assume π(A) ≥ 1/2, by replacing h by −h if necessary. Write h+ =

max(h, 0). We shall show
τ2 ≤ ||h+ ||22 /E(h+ , h+ ) (4.35)
and then the extremal characterization Chapter 3 yyy
EαA TA = sup{||g||22 /E(g, g) : g ≥ 0, g = 0 on A} (4.36)
implies τ2 ≤ EαA TA for this specific A.

The proof of (4.35) requires us to delve slightly further into the calculus
of Dirichlet forms. Write Pt f for the function (Pt f )(i) = Ei f (Xt ) and write
d
hf, gi for the inner product i πi f (i)g(i). Write ∂(·) for dt
P
(·)t=0 . Then
∂hf, Pt gi = −E(f, g)
where
1 XX
E(f, g) = (f (j) − f (i))(g(j) − g(i))qij .
2 i j
Now consider ∂hh+ , Pt hi. On the one hand
∂hh+ , Pt hi = −E(h+ , h) ≤ −E(h+ , h+ )
where the inequality follows from the inequality (a+ −b+ )2 ≤ (a+ −b+ )(a−b)
for real a, b. On the other hand, hh+ , hi ≤ hh+ , h+ i = ||h+ ||22 , and the
eigenvector h satisfies ∂(Pt h) = −λ2 h, so
∂hh+ , Pt hi ≥ −λ2 ||h+ ||22 .
Combining these inequalities leads to (4.35).

4.5.2 Cheeger-type inequalities

A lot of attention has been paid to reverse inequalities which upper bound
τ2 in terms of τc or related “flow rate” parameters. Motivation for such
results will be touched upon in Chapter yyy. The prototype for such results
is
Theorem 4.40 (Cheeger’s inequality) τ2 ≤ 8q ∗ τc2 , where q ∗ ≡ maxi qi .
This result follows by combining Lemma 4.39 above with Lemma 4.41 below.
In discrete time these inequalities hold with q ∗ deleted (i.e. replaced by 1),
by continuization. Our treatment of Cheeger’s inequality closely follows
Diaconis and Stroock [124] – see Notes for more history.
Lemma 4.41 For any subset A,
2q ∗ τc2
EαA TA ≤ .
π 2 (A)
Proof. Fix A and g with g ≥ 0 and g = 0 on A.

 2
XX
 |g 2 (x) − g 2 (y)|πx qxy 
x6=y
XX XX
≤ (g(x) − g(y))2 πx qxy × (g(x) + g(y))2 πx qxy
x6=y x6=y
by the Cauchy-Schwarz inequality
XX
= 2E(g, g) (g(x) + g(y))2 πx qxy
x6=y
XX
≤ 4E(g, g) (g 2 (x) + g 2 (y))πx qxy
x6=y
X
= 8E(g, g) πx qx g 2 (x)
x
∗
≤ 8q E(g, g) ||g||22 .
On the other hand

XX
|g 2 (x) − g 2 (y)|πx qxy
x6=y
X X
= 2 (g 2 (x) − g 2 (y))πx qxy
g(x)>g(y)
Z g(x) !
X X
= 4 tdt πx qxy
g(y)
g(x)>g(y)
 
Z ∞ X X
= 4 t  πx qxy  dt
0
g(y)≤t<g(x)
Z ∞
= 4 t Q(Bt , Btc ) dt where Bt ≡ {x : g(x) > t}
0
π(Bt ) π(Btc )
Z ∞
≥ 4 t dt by definition of τc
0 τc
Z ∞
π(Bt ) π(A)
≥ 4 t dt because g = 0 on A
0 τc
2π(A)||g||22
= .
τc
Rearranging,
||g||22 2q ∗ τ 2
≤ 2 c
E(g, g) π (A)
and the first assertion of the Theorem follows from the extremal character-
ization (4.36) of EαA TA .
4.5.3 τc and hitting times

Lemma 4.25 and Theorem 4.40 imply a bound on τ ∗ in terms of τc . But a
direct argument, using ideas similar to those in the proof of Lemma 4.41,
does better.
Proposition 4.42
4(1 + log n)
τ∗ ≤ τc .
minj πj
Example 4.43 below will show that the log term cannot be omitted. Compare
with graph-theoretic bounds in Chapter 6 section yyy.
Proof. Fix states a, b. We want to use the extremal characterization
(Chapter 3 yyy). So fix a function 0 ≤ g ≤ 1 with g(a) = 0, g(b) = 1. Order
the states as a = 1, 2, 3, . . . , n = b so that g(·) is increasing.
XX
E(g, g) = πi qik (g(k) − g(i))2
i<k
XX X
≥ πi qik (g(j + 1) − g(j))2
i≤j<k
X
= (g(j + 1) − g(j))2 Q(Aj , Acj ), where Aj ≡ [1, j]
j
X π(Aj )π(Acj )
≥ (g(j + 1) − g(j))2 (4.37)
j
τc
But
1/2
X X π 1/2 (Aj )π 1/2 (Acj ) τc
1= (g(j+1)−g(j)) = (g(j+1)−g(j)) .
j j τc
1/2 π 1/2 (Aj )π 1/2 (Acj )
So by Cauchy-Schwarz and (4.37)

X 1
1 ≤ τc E(g, g) . (4.38)
j
π(Aj )π(Acj )
But π(Aj ) ≥ jπ∗ , where π∗ ≡ mini πi , so

X 1 X 2 2
c ≤ ≤ (1 + log n).
j:π(Aj )≤1/2
π(Aj )π(Aj ) j
jπ∗ π∗
The same bound holds for the sum over {j : π(Aj ) ≥ 1/2}, so applying
(4.38) we get
1 4
≤ τc (1 + log n)
E(g, g) π∗
and the Proposition follows from the extremal characterization.
Example 4.43 Consider the weighted linear graph with loops on vertices
{0, 1, 2, . . . , n − 1}, with edge-weights
wi−1,i = i, 1 ≤ i ≤ n − 1; wii = 2n − i1(i6=0) − (i + 1)1(i6=n−1) .

This gives vertex-weights wi = 2n, and so the stationary distribution is
uniform. By the commute interpretation of resistance,
n−1
τ ∗ = E0 Tn−1 + En−1 T0 = wr0,n−1 = 2n2
X
1/i ∼ 2n2 log n.
i=1
Using Lemma 4.36, the value of τc is attained by a split of the form {[0, j], [j+
1, n − 1]}, and a brief calculation shows that the maximizing value is j = 0
and gives
τc = 2(n − 1).
So in this example, the bound in Proposition 4.42 is sharp up to the numer-
ical constant.
4.6 Induced and product chains

Here we record the behavior of our parameters under two natural operations
on chains.
4.6.1 Induced chains

Given a Markov chain Xt on state space I and a function f : I → I, ˆ the
process f (Xt ) is typically not a Markov chain. But we can invent a chain
which substitutes. In discrete time (the continuous case is similar) define
the induced chain X̂t to have transition matrix
PP
i,j:f (i)=î,f (j)=ĵ πi pi,j
p̂î,ĵ = Pπ (f (X1 ) = ĵ|f (X0 ) = î) = P . (4.39)
i:f (i)=î πi
More informatively, we are matching the stationary flow rates:
Pπ̂ (X̂0 = î, X̂1 = ĵ) = Pπ (f (X0 ) = î, f (X1 ) = ĵ). (4.40)
The reader may check that (4.39) and (4.40) are equivalent. Under our
standing assumption that Xt is reversible, the induced chain is also reversible
(though the construction works for general chains as well). In the electrical
network interpretation, we are shorting together vertices with the same f -
values. It seems intuitively plausible that this “shorting” can only decrease
our parameters describing convergence and mean hitting time behavior.
Proposition 4.44 (The contraction principle) The values of τ ∗ , τ0 , τ2

and τc in an induced chain are less than or equal to the corresponding values
in the original chain.
Proof. A function ĝ : Iˆ → R pulls back to a function g ≡ ĝ(f (·)) : I → R.

So the Dirichlet principle (Chapter 3 yyy) shows that mean commute times
can only decrease when passing to an induced chain:
Ef (i) T̂f (j) + Ef (j) T̂f (i) ≤ Ei Tj + Ej Ti .
This establishes the assertion for τ ∗ and τ0 , and the extremal characteri-
zation of relaxation time works similarly for τ2 . The assertion about τc is
immediate from the definition, since a partition of Iˆ pulls back to a partition
of I. 2
On the other hand, it is easy to see that shorting may increase a one-
sided mean hitting time. For example, random walk on the unweighted
4.6. INDUCED AND PRODUCT CHAINS 149
graph on the left has Ea Tb = 1, but when we short {a, d} together to form
vertex â in the graph on the right, Eâ T̂b = 2.
â
A
a b c d A
b c
Finally, the behavior of the τ1 -family under shorting is unclear.

(2)
Open Problem 4.45 Is the value of τ1 in an induced chain bounded by
(2)
K times the value of τ1 in the original chain, for some absolute constant
K? For K = 1?
4.6.2 Product chains

Given Markov chains on state spaces I (1) and I (2) , there is a natural concept
of a “product chain” on state space I (1) ×I (2) . It is worth writing this concept
out in detail for two reasons. First, to prevent confusion between several
different possible definitions in discrete time. Second, because the behavior
of relaxation times of product chains is relevant to simple examples and has
a surprising application (section 4.6.3).
As usual, things are simplest in continuous time. Define the product
chain to be
(1) (2)
Xt = (Xt , Xt )
(1) (2)
where the components Xt and Xt are independent versions of the given
chains. So
(1) (1) (2) (2)
Pi1 ,i2 (Xt = (j1 , j2 )) = Pi1 (Xt = j1 )Pi2 (Xt = j2 ). (4.41)
Using the interpretation of relaxation time as asymptotic rate of conver-

gence of transition probabilities, (Chapter 3 yyy) it is immediate that X
has relaxation time
(1) (2)
τ2 = max(τ2 , τ2 ). (4.42)
In discrete time there are two different general notions of “product
(1) (2)
chain”. One could consider the chain (Xt , Xt ) whose coordinates are
the independent chains. This is the chain with transition probabilities
(i1 , i2 ) → (j1 , j2 ) : probability P (1) (i1 , j1 )P (2) (i2 , j2 )

and has relaxation time

(1) (2)
τ2 = max(τ2 , τ2 ).
But it is more natural to define the product chain Xt to be the chain with
transition probabilities
1 (1)
(i1 , i2 ) → (j1 , i2 ) : probability P (i1 , j1 )
2
1 (2)
(i1 , i2 ) → (i1 , j2 ) : probability P (i2 , j2 ).
2
This is the jump chain derived from the product of the continuized chains,
(1) (2)
τ2 = 2 max(τ2 , τ2 ). (4.43)
Again, this can be seen without need for calculation: the continuized chain
is just the continuous-time product chain run at half speed.
This definition and (4.43) extend to d-fold products in the obvious way.
Random walk on Z d is the product of d copies of random walk on Z 1 , and
random walk on the d-cube (Chapter 5 yyy) is the product of d copies of
random walk on {0, 1}.
Just to make things more confusing, given graphs G(1) and G(2) the
Cartesian product graph is defined to have edges
(v1 , w1 ) ↔ (v2 , w1 ) for v1 ↔ v2
(v1 , w1 ) ↔ (v1 , w2 ) for w1 ↔ w2 .
If both G(1) and G(2) are r-regular then random walk on the product graph
is the product of the random walks on the individual graphs. But in general,
discrete-time random walk on the product graph is the jump chain of the
product of the fluid model (Chapter 3 yyy) continuous-time random walks.
So if the graphs are r1 - and r2 -regular then the discrete-time random walk on
the product graph has the product distribution as its stationary distribution
(1) (2)
τ2 = (r1 + r2 ) max(τ2 /r1 , τ2 /r2 ).
But for non-regular graphs, neither assertion is true.

Let us briefly discuss the behavior of some other parameters under prod-
ucts. For the continuous-time product (4.41), the total variation distance d¯
of section 4.3 satisfies
¯ = 1 − (1 − d¯(1) (t))(1 − d¯(2) (t))

d(t)
and we deduce the crude bound

(1) (2)
τ1 ≤ 2 max(τ1 , τ1 )
where superscripts refer to the graphs G(1) , G(2) and not to the parameters
in section 4.3.1. For the discrete-time chain, there is an extra factor of 2
from “slowing down” (cf. (4.42,4.43)), leading to
(1) (2)
τ1 ≤ 4 max(τ1 , τ1 ).
Here our conventions are a bit confusing: this inequality refers to the discrete-
time product chain, but as in section 4.3 we define τ1 via the continuized
chain – we leave the reader to figure out the analogous result for τ1disc
discussed in section 4.3.3.
(1) (2)
To state a result for τ0 , consider the continuous-time product (Xt , Xt )
of independent copies of the same n-state chain. If the underlying chain
has eigenvalues (λi ; 1 ≤ i ≤ n) then the product chain has eigenvalues
(λi + λj ; 1 ≤ i, j ≤ n) and so by the eigentime identity
product X 1
τ0 =
λ + λj
i,j≥1;(i,j)6=(1,1) i
X 1
= 2τ0 +
λ + λj
i,j≥2 i
n X
n
X 1
= 2τ0 + 2
i=2 j=i
λi + λj
n
X 2
≤ 2τ0 + (n − i + 1)
i=2
λi
≤ 2τ0 + (n − 1)2τ0 = 2nτ0 .
Thus in discrete time

product
τ0 ≤ 4nτ0 . (4.44)
4.6.3 Efron-Stein inequalities

The results above concerning relaxation times of product chains are essen-
tially obvious using the interpretation of relaxation time as asymptotic rate
of convergence of transition probabilities, but they are much less obvious
using the extremal interpretation. Indeed, consider the n-fold product of a
single chain X with itself. Write (X0 , X1 ) for the distribution at times 0
and 1 of X, and τ2 for the relaxation time of X. Combining (4.43) with
the extremal characterization (4.22) of the relaxation time for the product
chain, a brief calculation gives the following result.
Corollary 4.46 Let f : I n → R be arbitrary. Let (X (i) , Y (i) ), i = 1, . . . , n
be independent copies of (X0 , X1 ). Let Z = f (X (1) , . . . , X (n) ) and let Z (i) =
f (X (1) , . . . , X (i−1) , Y (i) , X (i+1) , . . . , X (n) ). Then
var (Z)
1 Pn (i) 2
≤ nτ2 .
2n i=1 E(Z − Z )
To appreciate this, consider the “trivial” case where the underlying Markov
chain is just an i.i.d. sequence with distribution π on I. Then τ2 = 1 and
the 2n random variables (X (i) , Y (i) ; 1 ≤ i ≤ n) are i.i.d. with distribution
π. And this special case of Corollary 4.46 becomes (4.45) below, because for
each i the distribution of Z − Z (i) is unchanged by substituting X0 for Y (i) .
Corollary 4.47 Let f : I n → R be arbitrary. Let (X0 , X1 , . . . , Xn ) be i.i.d.
with distribution π. Let Z (i) = f (X1 , . . . , Xi−1 , X0 , Xi+1 , . . . , Xn ) and let
Z = f (X1 , . . . , Xn ). Then
n
1X
var (Z) ≤ E(Z − Z (i) )2 (4.45)
2 i=1
If f is symmetric then
n
X
var (Z) ≤ E(Z (i) − Z̄)2 (4.46)
i=0
1 Pn
where Z (0) = Z and Z̄ = n+1 i=0 Z
(i) .
Note that in the symmetric case we may rewrite
Z (i) = f (X0 , X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ).
This reveals (4.46) to be the celebrated Efron-Stein inequality in statistics,

and in fact (4.45) is a known variant (see Notes).
Proof. As observed above, (4.45) is a special case of Corollary 4.46. So

it is enough to show that for symmetric f the right sides of (4.45) and (4.46)
are equal. Note that by symmetry a = E(Z (i) )2 does not depend on i, and
b = EZ (i) Z (j) does not depend on (i, j), for j 6= i. So the right side of (4.45)
is
1
n(a − 2b + a) = n(a − b).
2
But it is easy to calculate
a nb
EZ (i) Z̄ = E Z̄ 2 = +
n+1 n+1
and then the right side of (4.46) equals
(n + 1)(a − 2EZ (i) Z̄ + E Z̄ 2 ) = na − nb.
4.6.4 Why these parameters?

The choice of parameters studied in this chapter is partly arbitrary, but our
choice has been guided by two criteria, one philosophical and one technical.
The philosophical criterion is
when formalizing a vague idea, choose a definition which has
several equivalent formulations.
This is why we used the maximal mean hitting time parameter maxi,j (Ei Tj +
Ej Ti ) instead of maxi,j Ei Tj , because the former permits the equivalent “re-
sistance” interpretation.
Here is the technical criterion. Given a continuous-time chain Xt and a
state i, create a new chain Xt∗ by splitting i into two states i1 , i2 and setting
qi∗1 j = qi∗2 j = qij ; j 6= i
∗ ∗ 1
qji1
= qji 2
= qji j 6= i
2
qi∗1 i2 = qi∗2 i1 = ρ
with q ∗ = q elsewhere. Then π ∗ (i1 ) = π ∗ (i2 ) = 12 π(i), with π ∗ = π else-
where. As ρ → ∞, we may regard the new chain as converging to the old
chain in a certain sense. So our technical criterion for parameters τ is that
the value of τ for X ∗ should converge, as ρ → ∞, to the value for X. It is
easy to check this holds for the τ ’s we have studied, but it does not hold for,
say,
τ ≡ max πj Ei Tj
ij
which at first sight might seem a natural parameter.

(2)
Section 4.3.1. The definition of τ1 involves the idea of a stopping time
U such that XU has distribution π and is independent of the starting posi-
tion. This idea is central to the standard modern theory of Harris-recurrent
Markov chains, i.e. chains on continuous space which mimic the asymptotic
behavior of discrete recurrent chains, and does not require reversibility. See
[133] sec. 5.6 for an introduction, and [34, 263] for more comprehensive
treatments. In that field, researchers have usually been content to obtain
some finite bound on EU , and haven’t faced up to our issue of the quanti-
tative dependence of the bound on the underlying chain.
Separation and strong stationary times were introduced in Aldous and
Diaconis [22], who gave some basic theory. These constructions can be used
to bound convergence times in examples, but in practice are used in examples
with much special structure, e.g. non-necessarily-symmetric random walks
on groups. Examples can be found in [21, 22] and Matthews [258]. Develop-
ment of theory, mostly for stochastically monotone chains on 1-dimensional
state space, is in Diaconis and Fill [114, 115], Fill [150, 151] and Matthews
[260].
The recurrent balayage theorem (Chapter 1 yyy) can be combined with
the mean hitting time formula to get
(2) −Zij
τ1 = max . (4.47)
ij πj
Curiously, this elegant result doesn’t seem to help much with the inequalities
in Theorem 4.6.
What happens with the τ1 -family of parameters for general chains re-
mains rather obscure. Some counter-examples to equivalence, and weaker
inequalities containing log 1/π∗ factors, can be found in [6]. Recently, Lo-
(2)
vasz and Winkler [241] initiated a detailed study of τ1 for general chains
which promises to shed more light on this question.
(i)
Our choice of τ1 as the “representative” of the family of τ1 ’s is somewhat
arbitrary. One motivation was that it gives the constant “1” in the inequality
τ2 ≤ τ1 . It would be interesting to know whether the constants in other basic
inequalities relating the τ1 -family to other parameters could be made “1”:
Open Problem 4.48 (a) Is τ1 ≤ τ0 ?

(2)
(b) Is τ2 ≤ τ1 ?
Much of recent sophisticated theory xxx refs bounds d(t) by bounding

ˆ and appealing to Lemma 4.7(b). But it is not clear whether there is an

d(t)
ˆ
analog of Theorem 4.6 relating the d-threshold to other quantities.
(1) (3)
Section 4.3.2. The parts of Theorem 4.6 involving τ1 and τ1 are
implicit rather than explicit in [6]. That paper had an unnecessarily com-
plicated proof of Lemma 4.13. The proof of (4.15) in [6] gives a constant
K ≈ e13 . It would be interesting to obtain a smaller constant! Failing this,
(1) (3)
a small constant in the inequality τ1 ≤ Kτ1 would be desirable. As a
weaker result, it is easy to show
(1)
τ1 ≤ 10 min max Ei Tj (4.48)
j i
which has some relevance to later examples (yyy).

Section 4.3.3. The analog of Open Problem 4.17 in which we measure
distance from stationarity by dˆ instead of d(t) is straightforward, using the
“CM proxy” property of discrete time chains:
Pi (X2t = i) + Pi (X2t+1 = i) ↓ 0 as t → ∞.
Open Problem 4.17 itself seems deeper, though the weaker form in which
we require only that φ(t) = O(t) can probably be proved by translating the
proof of (4.15) into discrete time and using the CM proxy property.
Section 4.3.4. The cat-and-mouse game was treated briefly in Aleliunas
et al [25], who gave a bare-hands proof of a result like (4.21). Variations in
which the cat is also allowed an arbitrary strategy have been called “princess
and monster” games – see Isaacs [190] for results in a different setting.
Section 4.3.5. Sinclair [308] points out that “hard” results of Leighton
and Rao [223] on multicommodity flow imply
inf ψ(f ) ≤ Kτ2 log n. (4.49)
f
This follows from Corollary 4.22 and Lemma 4.23 when π is uniform, but
Sinclair posed
Open Problem 4.49 (i) Is there a simple proof of (4.49) in general?
(ii) Does (4.49) hold with the diameter ∆ in place of log n?
Section 4.4. As an example of historical interest, before this topic became
popular Fiedler [146] proved
Proposition 4.50 For random walk on a n-vertex weighted graph where
the stationary distribution is uniform,
w wn
τ2 ≤ 2 π ∼ π2c
4nc sin 2n
where c is the minimum cut defined at (4.4).

This upper bound is sharp. On the other hand, Proposition 4.2 gave the
same upper bound (up to the numerical constant) for the a priori larger
quantity τ ∗ , and so is essentially a stronger result.
Section 4.4.1. In the non-reversible case the definition of the maximal
correlation ρ(t) makes sense, and there is similar asymptotic behavior:
ρ(t) ∼ c exp(−λt) as t → ∞
where λ is the “spectral gap”. But we cannot pull back from asymptotia
to the real world so easily: it is not true that ρ(t) can be bounded by
K exp(−λt) for universal K. A dramatic example from Aldous [11] section
4 has for each n an n-state chain with spectral gap bounded away from 0 but
with ρ(n) also bounded away from 0, instead of being exponentially small.
So implicit claims in the literature that estimates of the spectral gap for
general chains have implications for finite-time behavior should be treated
with extreme skepticism.
It is not surprising that the classical Berry-Esseen Theorem for i.i.d.
sums ([133] Thm. 2.4.10) has an analog for chains. Write σ 2 for the asymp-
totic variance rate in Proposition 4.29 and write Z for a standard Normal
r.v.
Proposition 4.51 There is a constant K, depending on the chain, such
that
St
sup |Pπ ( 1/2 ≤ x) − P (Z ≤ x)| ≤ Kt−1/2
x σt
for all t ≥ 1 and all standardized g.
This result is usually stated for infinite-state chains satisfying various mix-
ing conditions, which are automatically satisfied by finite chains. See e.g.
Bolthausen [55]. At first sight the constant K depends on the function g
as well as the chain, but a finiteness argument shows that the dependence
on g can be removed. Unfortunately the usual proofs don’t give any useful
indications of how K depends on the chain, and so don’t help with Open
Problem 4.30.
The variance results in Proposition 4.29 are presumably classical, being
straightforward consequences of the spectral representation. Their use in
algorithmic settings such as Corollary 4.31 goes back at least to [10].
Section 4.4.3. Systematic study of the optimal choice of weights in the
Cauchy-Schwarz argument for Theorem 4.32 may lead to improved bounds
in examples. Alan Sokal has unpublished notes on this subject.
Section 4.5.1. The quantity 1/τc , or rather this quantity with the al-
ternate definition of τc mentioned in the text, has been called conductance.
I avoid that term, which invites unnecessary confusion with the electrical
network terminology. However, the subscript c can be regarded as standing
for “Cheeger” or “conductance”.
In connection with Open Problem 4.38 we mention the following result.
Suppose that in the definition (section 4.4.1) of the maximal correlation
function ρ(t) we considered only events, i.e. suppose we defined
ρ̃(t) ≡ sup cor(1(X0 ∈A) , 1(Xt ∈B) ).

A,B
Then ρ̃(t) ≤ ρ(t), but in fact the two definitions are equivalent in the sense
that there is a universal function ψ(x) ↓ 0 as x ↓ 0 such that ρ(t) ≤ ψ(ρ̃(t)).
This is a result about “measures of dependence” which has nothing to do
with Markovianness – see e.g. Bradley et al [59].
Section 4.5.2. The history of Cheeger-type inequalities up to 1987 is
discussed in [222] section 6. Briefly, Cheeger [87] proved a lower bound for
the eigenvalues of the Laplacian on a compact Riemannian manifold, and
this idea was subsequently adapted to different settings – in particular, by
Alon [26] to the relationship between eigenvalues and expansion properties
of graphs. Lawler and Sokal [222], and independently Jerrum and Sinclair
[307], were the first to discuss the relationship between τc and τ2 at the
level of reversible Markov chains. Their work was modified by Diaconis
and Stroock [124], whose proof we followed for Lemmas 4.39 and 4.41. The
only novelty in my presentation is talking explicitly about quasistationary
distributions, which makes the relationships easier to follow.
xxx give forward pointer to results of [238, 158].
Section 4.6.2. See Efron-Stein [140] for the origin of their inequality.
Inequality (4.45), or rather the variant mentioned above Corollary 4.47 in-
volving the 2n i.i.d. variables
Chapter 5
Examples: Special Graphs

and Trees (April 23 1996)
There are two main settings in which explicit calculations for random walks
on large graphs can be done. One is where the graph is essentially just 1-
dimensional, and the other is where the graph is highly symmetric. The main
purpose of this chapter is to record some (mostly) bare-hands calculations for
simple examples, in order to illuminate the general inequalities of Chapter 4.
Our focus is on natural examples, but there are a few artificial examples
devised to make a mathematical point. A second purpose is to set out some
theory for birth-and-death chains and for trees.
Lemma 5.1 below is useful in various simple examples, so let’s record
it here. An edge (v, x) of a graph is essential (or a bridge) if its removal
would disconnect the graph, into two components A(v, x) and A(x, v), say,
containing v and x respectively. Recall that E is the set of (undirected)
edges, and write E(v, x) for the set of edges of A(v, x).
Lemma 5.1 (essential edge lemma) For random walk on a weighted graph
with essential edge (v, x),
P
2 (i,j)∈E (v,x) wij
Ev Tx = +1 (5.1)
wvx
w XX
Ev Tx + Ex Tv = , where w = wij . (5.2)
wvx i j
Specializing to the unweighted case,

Ev Tx = 2|E(v, x)| + 1 (5.3)
Ev Tx + Ex Tv = 2|E|. (5.4)
159
160CHAPTER 5. EXAMPLES: SPECIAL GRAPHS AND TREES (APRIL 23 1996)
Proof. It is enough to prove (5.1), since (5.2) follows by adding the two
expressions of the form (5.1). Because (v, x) is essential, we may delete all
vertices of A(x, v) except x, and this does not affect the behavior of the
chain up until time Tx , because x must be the first visited vertex of A(x, v).
After this deletion, πx−1 = Ex Tx+ = 1 + Ev Tx by considering the first step
P
from x, and πx = wvx /(2wvx + 2 (i,j)∈E (v,x) wij ), giving (5.1).
Remarks. Of course Lemma 5.1 is closely related to the edge-commute in-
equality of Chapter 3 Lemma yyy. We can also regard (5.2), and hence (5.4),
as consequences of the commute interpretation of resistance (Chapter 3 yyy),
because the effective resistance across an essential edge (v, x) is obviously
1/wvx .
5.1 One-dimensional chains

5.1.1 Simple symmetric random walk on the integers
It is useful to record some elementary facts about simple symmetric random
walk (Xt ) on the (infinite) set of all integers. As we shall observe, these may
be derived in several different ways.
A fundamental formula gives exit probabilities:
b−a
Pb (Tc < Ta ) = , a < b < c. (5.5)
c−a
An elementary argument is that g(i) ≡ Pi (Tc < Ta ) satisfies the 1-step
recurrence
g(i) = 12 g(i + 1) + 21 g(i − 1), a < i < b
g(a) = 0, g(b) = 1,
whose solution is g(i) = (i − a)/(b − a). At a more sophisticated level, (5.5)
is a martingale result. The quantity p ≡ Pb (Tc < Ta ) must satisfy
b = Eb X(Ta ∧ Tc ) = pc + (1 − p)a,
where the first equality is the optional sampling theorem for the martingale
X, and solving this equation gives (5.5).
For a < c, note that Ta ∧ Tc is the “exit time” from the open interval
(a, c). We can use (5.5) to calculate the “exit before return” probability
Pb (Tb+ > Ta ∧ Tc ) = 1
2 Pb+1 (Tc< Tb ) + 12 Pb−1 (Ta < Tb )
1 1 1 1
= +
2c−b 2b−a
c−a
= . (5.6)
2(c − b)(b − a)
5.1. ONE-DIMENSIONAL CHAINS 161
For the walk started at b, let m(b, x; a, c) be the mean number of visits to
x before the exit time Ta ∧ Tc . (Recall from Chapter 2 our convention that
“before time t” includes time 0 but excludes time t). The number of returns
to b clearly has a Geometric distribution, so by (5.6)
2(c − b)(b − a)
m(b, b; a, c) = , a ≤ b ≤ c. (5.7)
c−a
To get the analog for visits to x we consider whether or not x is hit at all
before exiting; this gives
m(b, x; a, c) = Pb (Tx < Ta ∧ Tc ) m(x, x; a, c).
Appealing to (5.5) and (5.7) gives the famous mean occupation time formula
 2(x−a)(c−b)

 c−a , a≤x≤b≤c
m(b, x; a, c) = (5.8)
 2(c−x)(b−a)

c−a , a ≤ b ≤ x ≤ c.
Now the (random) time to exit must equal the sum of the (random)
times spent at each state. So, taking expectations,
c
X
Eb (Ta ∧ Tc ) = m(b, x; a, c),
x=a
and after a little algebra we obtain

Lemma 5.2 Eb (Ta ∧ Tc ) = (b − a)(c − b), a < b < c.
This derivation of Lemma 5.2 from (5.8) has the advantage of giving the
mean occupation time formula (5.8) on the way. There are two alternative
ways to prove Lemma 5.2. An elementary proof is to set up and solve the
1-step recurrence for h(i) ≡ Ei (Ta ∧ Tc ):
h(i) = 1 + 12 h(i + 1) + 12 h(i − 1), a < i < c
h(a) = h(c) = 0.
The more elegant proof uses a martingale argument. Taking b = 0 without
loss of generality, the first equality below is the optional sampling theorem
for the martingale (X 2 (t) − t):
E0 (Ta ∧ Tc ) = E0 X 2 (Ta ∧ Tc )
= a2 P0 (Ta < Tc ) + c2 P0 (Tc < Ta )
c −a
= a2 + c2 by (5.5)
c−a c−a
= −ac.
The preceding discussion works in discrete or continuous time. Exact

distributions at time t will of course differ in the two cases. In discrete time
we appeal to the Binomial distribution for the number of +1 steps, to get
(2t)!
P0 (X2t = 2j) = 2−2t , −t ≤ j ≤ t (5.9)
(t + j)!(t − j)!
and a similar expression for odd times t. In continuous time, the numbers
of +1 and of −1 steps in time t are independent Poisson(t) variables, so
∞
t2i+j
P0 (Xt = −j) = P0 (Xt = j) = e−2t
X
, j ≥ 0. (5.10)
i=0
i!(i + j)!
5.1.2 Weighted linear graphs

Consider the n-vertex linear graph 0 – 1 – 2 – · · · – (n − 1) with arbitrary
edge-weights (w1 , . . . , wn−1 ), where wi > 0 is the weight on edge (i − 1, i).
Set w0 = wn = 0 to make some later formulas cleaner. The corresponding
discrete-time random walk has transition probabilities
wi+1 wi
pi,i+1 = , pi,i−1 = , 0≤i≤n−1
wi + wi+1 wi + wi+1
and stationary distribution
wi + wi+1
πi = , 0≤i≤n−1
w
P
where w = 2 i wi . In probabilistic terminology, this is a birth-and-death
process, meaning that a transition cannot alter the state by more than 1.
It is elementary that such processes are automatically reversible (xxx spells
out the more general result for trees), so as discussed in Chapter 3 yyy
the set-up above with weighted graphs gives the general discrete-time birth-
and-death process with pii ≡ 0. But note that the continuization does
not give the general continuous-time birth-and-death process, which has
2(n − 1) parameters (qi,i−1 , qi,i+1 ) instead of just n − 1 parameters (wi ).
The formulas below could all be extended to this general case (the analog of
Proposition 5.3 can be found in undergraduate textbooks, e.g., Karlin and
Taylor [208] Chapter 4) but our focus is on the simplifications which occur
in the “weighted graphs” case.
Proposition 5.3 (a) For a < b < c,
wi−1
Pb
Pb (Tc < Ta ) = Pci=a+1 −1 .
i=a+1 wi
(b) For b < c,

c j−1
wi wj−1 .
X X
Eb Tc = c − b + 2
j=b+1 i=1
(c) For b < c,
c
wi−1 .
X
Eb Tc + Ec Tb = w
i=b+1
Note that we can obtain an expression for Ec Tb , b < c, by reflecting the

weighted graph about its center.
Proof. These are extensions of (5.5,5.1,5.2) and recycle some of the
previous arguments. Writing h(j) = ji=1 wi−1 , we have that (h(Xt )) is a
P
martingale, so
h(b) = Eb h(X(Ta ∧ Tc )) = ph(c) + (1 − p)h(a)
h(b)−h(a)
for p ≡ Pb (Tc < Ta ). Solving this equation gives p = h(c)−h(a) , which is (a).
The mean hitting time formula (b) has four different proofs! Two that
we will not give are as described below Lemma 5.2: Set up and solve a
recurrence equation, or use a well-chosen martingale. The slick argument is
to use the essential edge lemma (Lemma 5.1) to show
Pj−1
i=1 wi
Ej−1 Tj = 1 + 2 .
wj
Then c
X
Eb Tc = Ej−1 Tj ,
j=b+1
establishing (b). Let us also write out the non-slick argument, using mean
occupation times. By considering mean time spent at i,
b−1
X c−1
X
Eb Tc = Pb (Ti < Tc )m(i, i, c) + m(i, i, c), (5.11)
i=0 i=b
where m(i, i, c) is the expectation, starting at i, of the number of visits to i

before Tc . But
1
m(i, i, c) =
Pi (Tc < Ti+ )
1
=
pi.i+1 Pi+1 (Tc < Ti )
c
wj−1 using (a).
X
= (wi + wi+1 )
j=i+1
Substituting this and (a) into (5.11) leads to the formula stated in (b).
Finally, (c) can be deduced from (b), but it is more elegant to use the
essential edge lemma to get
Ei−1 Ti + Ei Ti−1 = w/wi (5.12)
and then use

c
X
Eb Tc + Ec Tb = (Ei−1 Ti + Ei Ti−1 ).
i=b+1
We now start some little calculations relating to the parameters discussed

in Chapter 4. Plainly, from Proposition 5.3
n−1
τ∗ = w wi−1 .
X
(5.13)
i=1
Next, consider calculating Eπ Tb . We could use Proposition 5.3(b), but in-

stead let us apply Theorem yyy of Chapter 3, giving Eπ Tb in terms of unit
flows from b to π. In a linear graph there is only one such flow, which
Pn−1
for i ≥ b has fi,i+1 = π[i + 1, n − 1] = j=i+1 πj , and for i ≤ b − 1 has
fi,i+1 = −π[0, i], and so the Proposition implies
n−1 b
X π 2 [i, n − 1] X π 2 [0, i − 1]
Eπ Tb = w +w . (5.14)
i=b+1
wi i=1
wi
There are several ways to use the preceding results to compute the av-
erage hitting time parameter τ0 . Perhaps the most elegant is
XX
τ0 = πi πj (Ei Tj + Ej Ti )
i j>i
n−1
X
= π[0, k − 1]π[k, n − 1](Ek−1 Tk + Ek Tk−1 )
k=1
n−1
X
= π[0, k − 1]π[k, n − 1]w/wk by (5.12)
k=1
  
n−1 k−1 n−1
= w−1 wk−1 wk + 2
X X X
wj  wk + 2 wj  . (5.15)
k=1 j=1 j=k+1
There are sophisticated methods (see Notes) of studying τ1 , but let us

just point out that Proposition 5.23 later (proved in the more general context
of trees) holds in the present setting, giving

1
min max(E0 Tx , En−1 Tx ) ≤ τ1 ≤ K2 min max(E0 Tx , En−1 Tx ). (5.16)
K1 x x
We do not know an explicit formula for τ2 , but we can get an upper bound
easily from the “distinguished paths” result Chapter 4 yyy. For x < y the
path γxy has r(γxy ) = yu=x+1 1/wu and hence the bound is
P
j−1
X n−1 y
1 X X (wx + wx+1 )(wy + wy+1 )
τ2 ≤ max . (5.17)
w j x=0 y=j u=x+1 wu
jjj This uses the Diaconis–Stroock version. The Sinclair version is

j−1 X
1 1 X n−1
τ2 ≤ max (wx + wx+1 )(wy + wy+1 )(y − x).
w j wj x=0 y=j
xxx literature on τ2 (van Doorn, etc.)

jjj Also relevant is work of N. Kahale (and others) on how optimal
choice of weights in use of Cauchy–Schwarz inequality for Diaconis–Stroock–
Sinclair leads to equality in case of birth-and-death chains.
jjj See also Diaconis and Saloff-Coste Metropolis paper, which mentions
work of Diaconis students on Metropolizing birth-and-death chains.
xxx examples of particular w. jjj might just bring up as needed?
xxx contraction principle and lower bounds on τ2 (relating to current
Section 6 of Chapter 4)
By Chapter 4 Lemma yyy,
π[0, i − 1]π[i, n − 1]
τc = max . (5.18)
1≤i≤n−1 wi
5.1.3 Useful examples of one-dimensional chains

Example 5.4 The two-state chain.
This is the birth-and-death chain on {0, 1} with p01 = 1 − p00 = p and

p10 = 1 − p11 = q, where 0 < p < 1 and 0 < q < 1 are arbitrarily specified.
Since p00 and p11 are positive, this does not quite fit into the framework of
Section 5.1.2, but everything is nonetheless easy to calculate. The stationary
distribution is given by
π0 = q/(p + q), π1 = p/(p + q).

In discrete time, the eigenvalues are λ1 = 1 and λ2 = 1 − p − q, and in

the notation of Chapter 3, Section yyy for the spectral representation, the
matrix S has s11 = 1 − p, s22 = 1 − q, and s12 = s21 = (pq)1/2 with
normalized right eigenvectors
u1 = [(q/(p+q))1/2 , (p/(p+q))1/2 ]T , u2 = [(p/(p+q))1/2 , −(q/(p+q))1/2 ]T .
The transition probabilities are given by

p
P0 (Xt = 1) = 1 − P0 (Xt = 0) = [1 − (1 − p − q)n ],
p+q
q
P1 (Xt = 0) = 1 − P1 (Xt = 1) = [1 − (1 − p − q)n ]
p+q
in discrete time and by
p
P0 (Xt = 1) = 1 − P0 (Xt = 0) = [1 − e−(p+q)t ],
p+q
q
P1 (Xt = 0) = 1 − P1 (Xt = 1) = [1 − e−(p+q)t ]
p+q
in continuous time. It is routine to calculate E0 T1 = 1/p, E1 T0 = 1/q, and
¯ = e−(p+q)t , d(t) = max(p/(p + q), q/(p + q)) e−(p+q)t ,
d(t)
and then
1 p+q
max Ei Tj = max(E0 T1 , E1 T0 ) = , τ ∗ = E0 T1 + E1 T0 = ,
ij min(p, q) pq
and
τ0 = τ1 = τ2 = τc = 1/(p + q).
Example 5.5 Biased random walk with reflecting barriers.
We consider the chain on {0, 1, . . . , n − 1} with reflecting barriers at 0 and

n − 1 that at each unit of time moves distance 1 rightward with probability
p and distance 1 leftward with probability q = 1 − p. Formally, the setting
is that of Section 5.1.2 with
2(1 − ρn−1 ) 2
wi = ρi−1 , w = → ,
1−ρ 1−ρ
where we assume ρ ≡ p/q < 1 and all asymptotics developed for this example
are for fixed ρ and large n. If ρ 6= 1, there is by symmetry no loss of generality
in assuming ρ < 1, and the case ρ = 1 will be treated later in Example 5.8.
Specializing the results of Section 5.1.2 to the present example, one can
easily derive the asymptotic results
max Ei Tj ∼ τ ∗ ∼ Eπ Tn−1 ∼ 2ρ−(n−2) /(1 − ρ)2 (5.19)

ij
and, by use of (5.15),

1+ρ
τ0 ∼ n. (5.20)
1−ρ
For τc , the maximizing i in (5.18) equals (1 + o(1))n/2, and this leads to
τc → (1 + ρ)/(1 − ρ). (5.21)
The spectral representation can be obtained using the orthogonal poly-

nomial techniques described in Karlin and Taylor [209] Chapter 10; see
especially Section 5(b) there. The reader may verify that the eigenvalues of
P in discrete time are 1, −1, and, for m = 1, . . . , n − 2,
2ρ1/2 mπ
cos θm , where θm ≡
1+ρ n−1
with (unnormalized) right eigenvector
sin((i + 1)θm )

−i/2
ρ 2 cos(iθm ) − (1 − ρ) , i = 0, . . . , n − 1.
sin(θm )
In particular,
" #−1
2ρ1/2 π 1+ρ

τ2 = 1 − cos → . (5.22)
1+ρ n−1 (1 − ρ1/2 )2
The random walk has drift p − q = −(1 − ρ)/(1 + ρ) ≡ −µ. It is not hard
to show for fixed t > 0 that the distances d¯n (tn) and dn (tn) of Chapter 4 yyy
converge to 1 if t < µ and to 0 if t > µ.
jjj include details? In fact, the cutoff occurs at µn + cρ n1/2 : cf. (e.g.)
Example 4.46 in [115]. Continue same paragraph:
In particular,
1−ρ
τ1 ∼ n (5.23)
1+ρ
Example 5.6 The M/M/1 queue.

We consider the M/M/1/(n − 1) queue. Customers queue up at a facility

to wait for a single server (hence the “1”) and are handled according to a
“first come, first served” queuing discipline. The first “M” specifies that
the arrival point process is Markovian, i.e., a Poisson process with intensity
parameter λ (say); likewise, the second “M” reflects our assumption that
the service times are exponential with parameter µ (say). The parameter
n − 1 is the queue size limit; customers arriving when the queue is full are
turned away.
We have described a continuous-time birth-and-death process with con-
stant birth and death rates λ and µ, respectively. If λ + µ = 1, this is nearly
the continuized biased random walk of Example 5.5, the only difference being
in the boundary behavior. In particular, one can check that the asymptotics
in (5.19)–(5.23) remain unchanged, where ρ ≡ λ/µ, called the traffic inten-
sity, remains fixed and n becomes large. For the M/M/1/(n − 1) queue, the
stationary distribution is the conditional distribution of G − 1 given G ≤ n,
where G has the Geometric(1 − ρ) distribution. The eigenvalues are 1 and,
for m = 1, . . . , n − 1,
2ρ1/2 mπ
cos θm , where now θm ≡
1+ρ n
with (unnormalized) right eigenvector
2ρ−i/2 sin((i + 1)θm )

cos(iθm ) + (ρ1/2 cos θm − 1) , i = 0, . . . , n − 1.
1+ρ sin(θm )
5.2 Special graphs

In this section we record results about some specific easy-to-analyze graphs.
As in Section 5.1.3, we focus on the parameters τ ∗ , τ0 , τ1 , τ2 , τc discussed
in Chapter 4; orders of magnitudes of these parameters (in the asymptotic
setting discussed with each example) are summarized in terms of n, the
number of vertices, in the following table. A minor theme is that some of
the graphs are known or conjectured to be extremal for our parameters.
In the context of extremality we ignore the parameter τ1 since its exact
definition is a little arbitrary.
jjj David: (1) Shall I add complete bipartite to table? (2) Please fill in
missing entries for torus.
Orders of magnitude of parameters [τ = Θ(entry)] for special graphs.

5.2. SPECIAL GRAPHS 169
Example τ∗ τ0 τ1 τ2 τc
5.7. cycle n2 n2 n2 n2 n
5.8. path n2 n2 n2 n2 n
5.9. complete graph n n 1 1 1
5.10. star n n 1 1 1
5.11. barbell n3 n3 n3 n3 n2
5.12. lollipop n3 n2 n2 n2 n
5.13. necklace n2 n2 n2 n2 n
5.14. balanced r-tree n log n n log n n n n
5.15. d-cube (d = log2 n) n n d log d d d
5.16. dense regular graphs n n 1 1 1
5.17. d-dimensional torus
d=2 jjj? n log n n2/d n2/d jjj?n1/d
d≥3 jjj? n n2/d n2/d jjj?n1/d
5.19. rook’s walk n n 1 1 1
In simpler cases we also record the t-step transition probabilities Pi (Xt =

j) in discrete and continuous time. In fact one could write out exact expres-
sions for Pi (Xt = j) and indeed for hitting time distributions in almost all
these examples, but complicated exact expressions are seldom very illumi-
nating.
qqq names of graphs vary—suggestions for “standard names” from read-
ers of drafts are welcome.
Example 5.7 The n-cycle.
This is just the graph 0 – 1 – 2 – · · · – (n−1) – 0 on n vertices. By rotational

symmetry, it is enough to give formulas for random walk started at 0. If
(X̂t ) is random walk on (all) the integers, then Xt = φ(X̂t ) is random walk
on the n-cycle, for
φ(i) = i mod n.
Thus results for the n-cycle can be deduced from results for the integers.
For instance,
E0 Ti = i(n − i) (5.24)
by Lemma 5.2, because this is the mean exit time from (i − n, i) for random
walk on the integers. We can now calculate
max Ei Tj = bn2 /4c

ij
τ ∗ ≡ max(Ei Tj + Ej Ti ) = 2bn2 /4c (5.25)

ij
τ0 = n−1
X
E0 Tj = (n2 − 1)/6 (5.26)
j
where for the final equality we used the formula

n
X n3 n2 n
m2 = + + .
m=1
3 2 6
As at (5.9) and (5.10) we can get an expression for the distribution

at time t from the Binomial distribution (in discrete time) or the Poisson
distribution (in continuous time). The former is
t!
2−t .
X
P0 (Xt = i) =
j:2j−t=i mod n
j!(t − j)!
A more useful expression is obtained from the spectral representation. The

n eigenvalues of the transition matrix are cos(2πm/n), 0 ≤ m ≤ n − 1. That
is, 1 and (if n is even) −1 are simple eigenvalues, with respective normalized
√ √
eigenvectors ui0 = 1/ n and ui,n/2 = (−1)i / n (0 ≤ i ≤ n − 1). The mul-
tiplicity of cos(2πm/n) p is 2 for 0 < m < n/2; the corresponding
p normalized
eigenvectors are uim = 2/n cos(2πim/n) and ui,−m = 2/n sin(2πim/n)
(0 ≤ i ≤ n − 1). Thus
1 n−1
X
P0 (Xt = i) = (cos(2πm/n))t cos(2πim/n),
n m=0
a fact most easily derived using Fourier analysis.

jjj Cite Diaconis book [112]? Continue same paragraph:
So the relaxation time is
1 n2
τ2 = ∼ 2.
1 − cos(2π/n) 2π
As an aside, note that the eigentime identity (Chapter 3 yyy) gives the
curious identity
n−1
n2 − 1 X 1
=
6 m=1
1 − cos(2πm/n)
P∞ −2
whose n → ∞ limit is the well-known formula m=1 m = π 2 /6.
If n is even, the discrete-time random walk is periodic. This parity

problem vanishes in continuous time, for which we have the formula
1 n−1
X
P0 (X(t) = i) = exp(−t(1 − cos(2πm/n))) cos(2πim/n). (5.27)
n m=0
Turning to total variation convergence, we remain in continuous time

and consider the distance functions d¯n (t) and dn (t) of Chapter 4 yyy. The
reader familiar with the notion of weak convergence of random walks to
Brownian motion (on the circle, in this setting) will see immediately that
d¯n (tn2 ) → d¯∞ (t)
where the limit is “d¯ for Brownian motion on the circle”, which can be
written as
d¯∞ (t) ≡ 1 − 2P ((t1/2 Z) mod 1 ∈ (1/4, 3/4))
where Z has the standard Normal distribution. So
τ1 ∼ cn2
.
for the constant c such that d¯∞ (c) = e−1 , whose numerical value c = 0.063
has no real significance.
jjj David: You got 0.054. Please check. Continue same paragraph:
Similarly
1 1
Z
2
dn (tn ) → d∞ (t) ≡ |ft (u) − 1| du,
2 0
where ft is the density of (t1/2 Z) mod 1.
As for τc , the sup in its definition is attained by some A of the form
[0, i − 1], so
i i
$ %
n (1 − n ) 1 n2 n
τc = max = ∼ .
i 1/n n 4 2
As remarked in Chapter 4 yyy, this provides a counter-example to reversing
inequalities in Theorem yyy. But if we consider maxA (π(A)Eπ TA ), the max
is attained with A = [ n2 − αn, n2 + αn], say, where 0 ≤ α < 1/2. By
Lemma 5.2, for x ∈ (− 12 + α, 12 − α),

Eb(x mod 1)nc TA ∼
1
2 −α−x 1
2 − α + x n2 ,
and so
1
−α 1 1 4( 12 − α)3 n2
Z
2
2
Eπ TA ∼ n −α−x −α+x dx = .
− 12 +α 2 2 3
Thus
4( 21 − α)3 2α 9n2
max(π(A)Eπ TA ) ∼ n2 sup = ,
A 0<α<1/2 3 512
consistent with Chapter 4 Open Problem yyy.
xxx level of detail for d¯ results, here and later.
Remark. Parameters τ ∗ , τ0 , τ1 , and τ2 are all Θ(n2 ) in this example, and
in Chapter 6 we’ll see that they are O(n2 ) over the class of regular graphs.
However, the exact maximum values over all n-vertex regular graphs (or the
constants c in the ∼ cn2 asymptotics) are not known. See Chapter 6 for the
natural conjectures.
Example 5.8 The n-path.
This is just the graph 0 – 1 – 2 – · · · – (n − 1) on n vertices. If (X̂t ) is
random walk on (all) the integers, then Xt = φ(X̂t ) is random walk on the
n-path, for the “concertina” map


 i if i mod 2(n − 1) ≤ n − 1
φ(i) =
 2(n − 1) − (i mod 2(n − 1)) otherwise.

Of course the stationary distribution is not quite uniform:

1 1
πi = , 1 ≤ i ≤ n − 2; π0 = πn−1 = .
n−1 2(n − 1)
Again, results for the n-path can be deduced from results for the integers.
Using Lemma 5.2,
Ei Tj = (j − i)(j + i), 0 ≤ i < j ≤ n − 1. (5.28)
¿From this, or from the more general results in Section 5.1.2, we obtain
max Ei Tj = (n − 1)2 (5.29)
ij
τ ∗ ≡ max(Ei Tj + Ej Ti ) = 2(n − 1)2 (5.30)

ij
X 1 1
τ0 = πj E0 Tj = (n − 1)2 + (5.31)
j
3 6
We can also regard Xt as being derived from random walk X̃t on the
(2n − 2)-cycle via Xt = min(X̃t , 2n − 2 − X̃t ). So we can deduce the spectral
representation from that in the previous example:
q n−1
X
Pi (Xt = j) = πj /πi λtm uim ujm
m=0
where, for 0 ≤ m ≤ n − 1,
λm = cos(πm/(n − 1))
and
√ √
u0m = πm ; un−1,m = πm (−1)m ;
√
uim = 2πm cos(πim/(n − 1)), 1 ≤ i ≤ n − 2.
In particular, the relaxation time is
1 2n2
τ2 = ∼ 2 .
1 − cos(π/(n − 1)) π
¯
Furthermore, d¯n (t) = d˜2n−2 (t) and dn (t) = d˜2n−2 (t) for all t, so
d¯n (t(2n)2 ) → d¯∞ (t)
dn (t(2n)2 ) → d∞ (t)
where the limits are those in the previous example. Thus τ1 ∼ cn2 , where
.
c = 0.25 is 4 times the corresponding constant for the n-cycle.
xxx explain: BM on [0, 1] and circle described in Chapter 16.
It is easy to see that
n−1


 2 if n is even
τc =
 n−1 − 1

if n is odd
2 2(n−1)
In Section 5.3.2 we will see that the n-path attains the exact maximum
values of our parameters over all n-vertex trees.
Example 5.9 The complete graph.
For the complete graph on n vertices, the t-step probabilities for the chain
started at i can be calculated by considering the induced 2-state chain which
indicates whether or not the walk is at i. This gives, in discrete time,
t
1 1 1

Pi (Xt = i) = + 1− −
n n n−1
t
1 1 1

Pi (Xt = j) = − − , j 6= i (5.32)
n n n−1
and, in continuous time,
1 1 nt

Pi (Xt = i) = + 1− exp −
n n n−1
1 1 nt

Pi (Xt = j) = − exp − , j 6= i (5.33)
n n n−1
It is clear that the hitting time to j 6= i has Geometric(1/(n − 1)) distri-

bution (in continuous time, Exponential(1/(n − 1)) distribution), and so in
particular
Ei Tj = n − 1, j 6= i. (5.34)
Thus we can calculate the parameters
τ ∗ ≡ max(Ej Ti + Ei Tj ) = 2(n − 1) (5.35)

ij
max Ei Tj = n−1 (5.36)
ij
τ0 ≡ n−1
X
Ei Tj = (n − 1)2 /n. (5.37)
j
From (5.32) the discrete-time eigenvalues are λ2 = λ3 = · · · = λn =

−1/(n − 1). So the relaxation time is
τ2 = (n − 1)/n. (5.38)
The total variation functions are
¯ = exp − nt n−1 nt

d(t) , d(t) = exp − ,
n−1 n n−1
so
τ1 = (n − 1)/n. (5.39)
It is easy to check
τc = (n − 1)/n.
We have already proved (Chapter 3 yyy) that the complete graph attains
the exact minimum of τ ∗ , maxij Ei Tj , τ0 , and τ2 over all n-vertex graphs.
The same holds for τc , by considering (in an arbitrary graph) a vertex of
minimum degree.
Example 5.10 The n-star.

This is the graph on n ≥ 3 vertices {0, 1, 2, . . . , n − 1} with edges 0 – 1, 0 –

2, 0 – 3, . . . , 0 – (n − 1). The stationary distribution is
π0 = 1/2, πi = 1/(2(n − 1)), i ≥ 1.
In discrete time the walk is periodic. Starting from the leaf 1, the walk at
even times is simply independent and uniform on the leaves, so
P1 (X2t = i) = 1/(n − 1), i ≥ 1
for t ≥ 1. At odd times, the walk is at 0. Writing these t-step probabilities
as
1 1
P1 (Xt = i) = (1 + (−1)t )1(i≥1) + (1 + (−1)t+1 )1(i=0) , t ≥ 1
2(n − 1) 2
we see that the discrete-time eigenvalues are λ2 = · · · = λn−1 = 0, λn = −1
and hence the relaxation time is
τ2 = 1.
The mean hitting times are
E1 T0 = 1
E1 Tj = 2(n − 1), j ≥ 2,
where the latter comes from the fact that Tj /2 has Geometric(1/(n−1)) dis-
tribution, using the “independent uniform on leaves at even times” property.
Then
E0 T1 = 2n − 3.
We can calculate the parameters
τ ∗ ≡ maxij (Ei Tj + Ej Ti ) = 4n − 4 (5.40)
maxij Ei Tj = 2n − 2 (5.41)
3
=n−
P
τ0 = j E0 Tj πj 2. (5.42)
In continuous time we find
1 n − 2 −t
P1 (Xt = 1) = (1 + e−2t ) + e
2(n − 1) n−1
1 1 −t
P1 (Xt = i) = (1 + e−2t ) − e , i>1
2(n − 1) n−1
1 −2t
P1 (Xt = 0) = 2 (1 − e )
P0 (Xt = 0) = 1
2 (1+ e−2t )
1
P0 (Xt = 1) = (1 − e−2t )
2(n − 1)
This leads to
¯ = e−t , d(t) = 1 n − 2 −t
d(t) e−2t + e ,
2(n − 1) n−1
from which
τ1 = 1.
Clearly (isolate a leaf)
1
τc = 1 − .
2(n − 1)
We shall see in Section 5.3.2 that the n-star attains the exact minimum
of our parameters over all n-vertex trees.
Example 5.11 The barbell.
Here is a graph on n = 2m1 + m2 vertices (m1 ≥ 2, m2 ≥ 0). Start with

two complete graphs on m1 vertices. Distinguish vertices vl 6= vL in one
graph (“the left bell”) and vertices vR 6= vr in the other graph (“the right
bell”). Then connect the graphs via a path vL – w1 – w2 – · · · – wm2 – vR
containing m2 new vertices.
xxx picture
A point of the construction is that the mean time to go from a typical
point vl in the left bell to a typical point vr in the right bell is roughly m21 m2 .
To argue this informally, it takes mean time about m1 to hit vL ; then there
is chance 1/m1 to hit w1 , so it takes mean time about m21 to hit w1 ; and
from w1 there is chance about 1/m2 to hit the right bell before returning
into the left bell, so it takes mean time about m21 m2 to enter the right bell.
The exact result, argued below, is
m2 + 1
max Ei Tj = Evl Tvr = m1 (m1 −1)(m2 +1)+(m2 +1)2 +4(m1 −1)+4 .
ij m1
(5.43)
It is cleaner to consider asymptotics as
n → ∞, m1 /n → α, m2 /n → 1 − 2α
with 0 < α < 1/2. Then
max Ei Tj ∼ α2 (1 − 2α)n3
ij
n3
∼ for α = 1/3
27
where α = 1/3 is the asymptotic maximizer here and for the other parame-
ters below. Similarly
τ ∗ ∼ 2α2 (1 − 2α)n3
2n3
∼ for α = 1/3.
27
The stationary distribution π puts mass → 1/2 on each bell. Also, by (5.45)–
(5.47) below, Evl Tv = o(Evl Tvr ) uniformly for vertices v in the left bell and
Evl Tv ∼ Evl Tvr ∼ α2 (1 − 2α)n3 uniformly for vertices v in the right bell.
Hence
X 1 1
τ0 ≡ πv Evl Tv ∼ Evl Tvr ∼ α2 (1 − 2α)n3
v 2 2
and so we have proved the “τ0 ” part of
1 2
each of {τ0 , τ1 , τ2 } ∼ 2 α (1 − 2α)n3 (5.44)
n3
∼ for α = 1/3.
54
Consider the relaxation time τ2 . For the function g defined to be +1 on the
left bell, −1 on the right bell and linear on the bar, the Dirichlet form gives
2 2
E(g, g) = ∼ 2 .
(m2 + 1)(m1 (m1 − 1) + m2 + 1) α (1 − 2α)n3
Since the variance of g tends to 1, the extremal characterization of τ2 shows

that 12 α2 (1 − 2α)n3 is an asymptotic lower bound for τ2 . But in general
τ2 ≤ τ0 , so having already proved (5.44) for τ0 we must have the same
asymptotics for τ2 . Finally, without going into details, it is not hard to
show that for fixed t > 0,
1 2 1 2 1

d¯n α (1 − 2α)n3 t → e−t , dn α (1 − 2α)n3 t → e−t
2 2 2
from which the “τ1 ” assertion of (5.44) follows.

jjj Proof? (It’s not so terrifically easy, either! How much do we want
to include?) I’ve (prior to writing this) carefully written out an argument
similar to the present one, also involving approximate exponentiality of a
hitting time distribution, for the balanced r-tree below. Here is a rough
sketch for the argument for d¯ here; note that it uses results about the next
example (the lollipop). (The argument for d is similar.) The pair (vl , vr )
¯ for every t (although the following can be
of initial states achieves d(t)
made to work without knowing this “obvious fact” a priori). Couple chains
starting in these states by having them move symmetrically in the obvious
fashion. Certainly these copies will couple by the time T the copy started
at vl has reached the center vertex wm2 /2 of the bar. We claim that the
distribution of T is approximately exponential, and its expected value is
∼ 21 m21 m2 ∼ 21 α2 (1 − 2α)n3 by the first displayed result for the lollipop
example, with m2 changed to m2 /2 there. (In keeping with this observation,
I’ll refer to the “half-stick” lollipop in the next paragraph.)
jjj (cont.) To get approximate exponentiality for the distribution of
T , first argue easily that it’s approximately the same as that of Twm/2 for
the half-stick lollipop started in stationarity. But that distribution is, in
turn, approximately exponential by Chapter 3 Proposition yyy, since τ2 =
Θ(n2 ) = o(n3 ) for the half-stick lollipop.
Proof of (5.43). The mean time in question is the sum of the following
mean times:
Evl TvL = m1 − 1 (5.45)

2
EvL TvR = m1 (m1 − 1)(m2 + 1) + (m2 + 1) (5.46)
m2 + 1
EvR Tvr = 3(m1 − 1) + 4 . (5.47)
m1
Here (5.45) is just the result (5.34) for the complete graph. And (5.46) is
obtained by summing over the edges of the “bar” the expression
Ewi Twi+1 = m1 (m1 − 1) + 2i + 1, i = 0, . . . , m2 (5.48)
obtained from the general formula for mean hitting time across an essen-
tial edge of a graph (Lemma 5.1), where w0 = vL and wm2 +1 = vR . To
argue (5.47), we start with the 1-step recurrence
1 m1 − 2
EvR Tvr = 1 + Ewm2 Tvr + Ex Tvr
m1 m1
where x denotes a vertex of the right bell distinct from vR and vr . Now
Ewm2 Tvr = m1 (m1 − 1) + 2m2 + 1 + EvR Tvr
using the formula (5.48) for the mean passage time from wm2 to vR . Starting
from x, the time until a hit on either vR or vr has Geometric(2/(m1 − 1))
distribution, and the two vertices are equally likely to be hit first. So
Ex Tvr = (m1 − 1)/2 + 12 EvR Tvr .

The last three expressions give an equation for EvR Tvr whose solution is (5.47).
And it is straightforward to check that Evl Tvr does achieve the maximum,
using (5.45)–(5.47) to bound the general Ei Tj .
It is straightforward to check
α 2 n2
τc ∼ .
2
Example 5.12 The lollipop.
xxx picture
This is just the barbell without the right bell. That is, we start with a
complete graph on m1 vertices and add m2 new vertices in a path. So there
are n = m1 + m2 vertices, and wm2 is now a leaf. In this example, by (5.45)
and (5.46), with m2 in place of m2 + 1, we have
max Ei Tj = Evl Twm2 = m1 (m1 − 1)m2 + (m1 − 1) + m22 .

ij
In the asymptotic setting with
n → ∞, m1 /n → α, m2 /n → 1 − α
where 0 < α < 1, we have
max Ei Tj ∼ α2 (1 − α)n3 (5.49)

ij
4n3
∼ for α = 2/3,
27
where α = 2/3 gives the asymptotic maximum.
Let us discuss the other parameters only briefly, in the asymptotic set-
ting. Clearly Ewm2 TvL = m22 ∼ (1 − α)2 n2 and it is not hard to check
Ewm2 Tvl = max Ev Tvl ∼ (1 − α)2 n2 , (5.50)

v
whence
τ ∗ = max(Ei Tj + Ej Ti ) = Evl Twm2 + Ewm2 Tvl ∼ α2 (1 − α)n3 .

ij
Because the stationary distribution puts mass Θ(1/n) on the “bar”, (5.50) is
also enough to show that τ0 = O(n2 ). So by the general inequalities between
our parameters, to show
each of {τ0 , τ1 , τ2 } = Θ(n2 ) (5.51)

it is enough to show that τ2 = Ω(n2 ). But for the function g defined to be

0 on the “bell”, 1 at the end wm2 of the “bar,” and linear along the bar, a
brief calculation gives
E(g, g) = Θ(n−3 ), var π g = Θ(n−1 )
so that τ2 ≥ (var π g)/E(g, g) = Ω(n2 ), as required.

Finally, in the asymptotic setting it is straightforward to check that τc
is achieved by A = {w1 , . . . , wm2 }, giving
τc ∼ 2(1 − α)n.
Remark. The barbell and lollipop are the natural candidates for the n-
vertex graphs which maximize our parameters. The precise conjectures and
known results will be discussed in Chapter 6.
jjj We need to put somewhere—Chapter 4 on τc ? Chapter 6 on max
parameters over n-vertex graphs? in the barbell example?—the fact that
max τc is attained, when n is even, by the barbell with m2 = 0, the max
value being (n2 −2n+2)/8 ∼ n2 /8. Similarly, when n is odd, the maximizing
graph is formed by joining complete graphs on bn/2c and dn/2e vertices
respectively by a single edge, and the max value is easy to write down (I’ve
kept a record) but not so pretty; however, this value too is ∼ n2 /8, which
is probably all we want to say anyway. Here is the first draft of a proof:
For random walk on an unweighted graph, τc is the maximum over
nonempty proper subsets A of the ratio
(deg A)(deg Ac )
, (5.52)
2|E|(A, Ac )
where deg A is defined to be the sum of the degrees of vertices in A and
(A, Ac ) is the number of directed cut edges from A to Ac .
jjj Perhaps it would be better for exposition to stick with undirected
edges and introduce factor 1/2?
Maximizing now over choice of graphs, the max in question is no larger
than the maximum M , over all choices of n1 > 0, n2 > 0, e1 , e2 , and e0
satisfying n1 + n2 = n and 0 ≤ ei ≤ n2i for i = 1, 2 and 1 ≤ e0 ≤ n1 n2 , of
the ratio
(2e1 + e0 )(2e2 + e0 )
. (5.53)
2(e1 + e2 + e0 )e0
(We don’t claim equality because we don’t check that each ni -graph is con-
nected. But we’ll show that M is in fact achieved by the connected graph
claimed above.)
Simple calculus shows that the ratio (5.53) is (as one would expect)
increasing in e1 and e2 and decreasing in e0 . Thus, for given n1 , (5.53) is
maximized by considering complete graphs of size n1 and n2 = n − n1 joined
by a single edge. Call the maximum value M (n1 ). If n is even, it is then
easy to see that Mn1 is maximized by n1 = n/2, giving M = (n2 − 2n + 2)/8,
as desired.
For the record, here are the slightly tricky details if n is odd. Write
ν = n/2 and n1 = ν − y and put x = y 2 . A short calculation gives M (n1 ) =
1 + g(x), where g(x) ≡ [(a − x)(b − x) − 1]/(2x + c) with a = ν 2 , b = (ν − 1)2 ,
and c = 2ν(ν −1)+2. Easy calculus shows that g is U -shaped over [0, ν] and
then that g(1/4) ≥ g(ν 2 ). Thus M (n1 ) is maximized when n1 = ν − 12 =
bn/2c.
Example 5.13 The necklace.
This graph, pictured below, is 3-regular with n = 4m + 2 vertices, consisting

of m subgraphs linked in a line, the two end subgraphs being different from
the intervening ones. This is an artificial graph designed to mimic the n-
path while being regular, and this construction (or some similar one) is the
natural candidate for the n-vertex graph which asymptotically maximizes
our parameters over regular graphs.
a b h
A @ e @ @ A
A @
@ f @g
@ @
@ A
A @ @ @ A
AA @
@ @
@ @@ AA
c d m − 2 repeats -
This example affords a nice illustration of use of the commute interpre-
tation of resistance. Applying voltage 1 at vertex a and voltage 0 at e, a
brief calculation gives the potentials at intervening vertices as
g(b) = 4/7, g(c) = 5/7, g(d) = 4/7
and gives the effective resistance rae = 7/8. Since the effective resistance
between f and g equals 1, we see the maximal effective resistance is
7 7
rah = 8 + (2m − 3) + 8 = 2m − 54 .
So
5 3n2

τ ∗ = Ea Th + Eh Ta = 3 × (4m + 2) × 2m − ∼ .
4 2
One could do elementary exact calculations of other parameters, but it is

simpler to get asymptotics from the Brownian motion limit, which implies
that the asymptotic ratio of each parameter (excluding τc ) in this example
and the n-path is the same for each parameter. So
n2 3n2
τ0 ∼ , τ2 ∼ 2 .
4 2π
jjj I haven’t checked this carefully, and I also have abstained from writing
anything further about τ1 .
Finally, it is clear that τc ∼ 3n/4, achieved by breaking a “link” between
“beads” in the middle of the necklace.
Example 5.14 The balanced r-tree.
Take r ≥ 2 and h ≥ 1. The balanced r-tree is the rooted tree where all leaves
are at distance h from the root, where the root has degree r, and where the
other internal vertices have degree r + 1. Call h the height of the tree. For
h = 1 we have the (r + 1)-star, and for r = 2 we have the balanced binary
tree. The number of vertices is
n = 1 + r + r2 + · · · + rh = (rh+1 − 1)/(r − 1).
The chain X̂ induced (in the sense of Chapter 4 Section yyy) by the
function
f (i) = h − (distance from i to the root)
is random walk on {0, . . . , h}, biased towards 0, with reflecting barriers, as
in Example 5.5 with
ρ = 1/r.
In fact, the transition probabilities for X can be expressed in terms of X̂
as follows. Given vertices v1 and v2 with f (v1 ) = f1 and f (v2 ) = f2 ,
the paths [root, v1 ] and [root, v2 ] intersect in the path [root, v3 ], say, with
f (v3 ) = f3 ≥ f1 ∨ f2 . Then
h
max X̂s = m, X̂t = f2 r−(m−f2 ) .
X
Pv1 (Xt = v2 ) = P f1
0≤s≤t
m=f3
As a special case, suppose that v1 is on the path from the root to v2 ; in

this case v3 = v1 . Using the essential edge lemma (or Theorem 5.20 below)
we can calculate
Ev2 Tv1 = 2(r − 1)−2 (rf1 +1 − rf2 +1 ) − 2(r − 1)−1 (f1 − f2 ) − (f1 − f2 ),
Ev1 Tv2 = 2(n − 1)(f1 − f2 ) − Ev2 Tv1 . (5.54)

Using this special case we can deduce the general formula for mean hitting
times. Indeed, Ev1 Tv2 = Ev1 Tv3 + Ev3 Tv2 , which leads to
Ev1 Tv2 = 2(n − 1)(f3 − f2 ) + 2(r − 1)−2 (rf2 +1 − rf1 +1 )

−2(r − 1)−1 (f2 − f1 ) − (f2 − f1 ). (5.55)
The maximum value 2h(n − 1) is attained when v1 and v2 are leaves and v3
is the root. So
1 ∗
2 τ = max Ev Tx = 2(n − 1)h.
v,x
(5.56)
(The τ∗ part is simpler via (5.88) below.) Another special case is that, for
a leaf v,
Ev Troot = 2(r − 1)−2 (rh+1 − r) − 2h(r − 1)−1 − h ∼ 2n/(r − 1), (5.57)
Eroot Tv = 2(n − 1)h − Ev Troot ∼ 2nh (5.58)

where asymptotics are as h → ∞ for fixed r. Since Eroot Tw is decreasing in
f (w), it follows that
X
τ0 = πw Eroot Tw ≤ (1 + o(1))2nh.
w
On the other hand, we claim τ0 ≥ (1 + o(1))2nh, so that
τ0 ∼ 2nh.
To sketch the proof, given a vertex w, let v be a leaf such that w lies on
the path from v to the root. Then
Eroot Tw = Eroot Tv − Ew Tv ,
and Ew Tv ≤ 2(n − 1)f (w) by (5.54). But the stationary distribution puts
nearly all its mass on vertices w with f (w) of constant order, and n = o(nh).
We claim next that
τ1 ∼ τ2 ∼ 2n/(r − 1).
Since τ2 ≤ τ1 , it is enough to show

2n
τ1 ≤ (1 + o(1)) (5.59)
r−1
and
2n
τ2 ≥ (1 + o(1)) . (5.60)
r−1
Proof of (5.59). Put
2n
tn ≡
r−1
for brevity. We begin the proof by recalling the results (5.22) and (5.19) for
the induced walk X̂:
(r + 1)
τ̂2 → 1/2 ,
(r − 1)2
2rh+1
Eπ̂ T̂h ∼ ∼ tn . (5.61)
(r − 1)2
By Proposition yyy of Chapter 3,
!
t τ̂2
sup Pπ̂ (T̂h > t) − exp − ≤ = Θ(n−1 ) = o(1). (5.62)

t Eπ̂ T̂h Eπ̂ T̂h
For X̂ started at 0, let Ŝ be a stopping time at which the chain has exactly
the stationary distribution. Then, for 0 ≤ s ≤ t,
P0 (T̂h > t) ≤ P0 (Ŝ > s) + Pπ̂ (T̂h > t − s).

(2)
Since τ̂1 = O(h) = O(log n) by (5.23), we can arrange to have E0 Ŝ =
O(log n). Fixing > 0 and choosing t = (1+)tn and (say) s = (log n)2 , (5.62)
and (5.61) in conjunction with Markov’s inequality yield
" #
(1 + )tn − (log n)2
P0 (T̂h > (1 + )tn ) = exp −
Eπ̂ T̂h
−1
+O((log n) ) + O(n−1 )
→ e−(1+) .
Returning to the continuous-time walk on the tree, for n sufficiently large

we have
Pv (Troot > (1 + )tn ) ≤ P0 (T̂h > (1 + )tn ) ≤ e−1
for every vertex v. Now a simple coupling argument (jjj spell out details?:
Couple the induced walks and the tree-walks will agree when the induced
walk starting farther from the origin has reached the origin) shows that
d¯n ((1 + )tn ) ≤ e−1

for all large n. Hence τ1 ≤ (1 + )tn for all large n, and (5.59) follows.
Proof of (5.60).
jjj [This requires further exposition in both Chapters 3 and 4-1. In Chap-
ter 3, it needs to be made clear that one of the inequalities having to do with
CM hitting time distributions says precisely that EαA TA ≥ Eπ TA /π(Ac ) ≥
Eπ TA . In Chapter 4-1 (2/96 version), it needs to be noted that Lemma 2(a)
(concerning τ2 for the joining of two copies of a graph) extends to the joining
of any finite number of copies.]
Let G denote a balanced r-tree of height h. Let G00 denote a balanced
r-tree of height h − 1 with root y and construct a tree G0 from G00 by adding
an edge from y to an additional vertex z. We can construct G by joining r
copies of G0 at the vertex z, which becomes the root of G. Let π 0 and π 00
denote the respective stationary distributions for the random walks on G0
and G00 , and use the notation T 0 and T 00 , respectively, for hitting times on
these graphs. By Chapter 4 jjj,
τ2 = Eα0 Tz0 (5.63)
where α0 is the quasistationary distribution on G0 associated with the hitting

time Tz0 . By Chapter 3 jjj, the expectation (5.63) is no smaller than Eπ0 Tz0 ,
which by the collapsing principle equals

π 0 (G00 ) Eπ00 Ty00 + Ey Tz0 = π 0 (G00 ) Eπ00 Ty00 + Ey Tz .
But it is easy to see that this last quantity equals (1 + o(1))Eπ Tz , which is
asymptotically equivalent to 2n/(r − 1) by (5.61).
¿From the discussion at the beginning of Section 5.3.1, it follows that τc
is achieved at any of the r subtrees of the root. This gives
(2rh − r − 1)(2rh − 1) 2n
τc = h
∼ .
2r(r − 1) r
An extension of the balanced r-tree example is treated in Section 5.2.1
below.
Example 5.15 The d-cube.
This is a graph with vertex-set I = {0, 1}d and hence with n = 2d vertices.
Write i = (i1 , . . . , id ) for a vertex, and write |i − j| = u |iu − ju |. Then (i, j)
P
is an edge if and only if |i − j| = 1, and in general |i − j| is the graph-distance

between i and j. Thus discrete-time random walk proceeds at each step by
choosing a coordinate at random and changing its parity.
It is easier to use the continuized walk X(t) = (X1 (t), . . . , Xd (t)), because
the component processes (Xu (t)) are independent as u varies, and each is
in fact just the continuous-time random walk on the 2-path with transition
rate 1/d. This follows from an elementary fact about the superposition of
(marked) Poisson processes.
Thus, in continuous time,
d
1
1 + (−1)|iu −ju | e−2t/d
Y
Pi (X(t) = j) =
u=1
2
|i−j| d−|i−j|
= 2−d 1 − e−2t/d 1 + e−2t/d . (5.64)
By expanding the right side, we see that the continuous-time eigenvalues are
!
d
λk = 2k/d with multiplicity , k = 0, 1, . . . , d. (5.65)
k
(Of course, this is just the general fact that the eigenvalues of a d-fold
product of continuous-time chains are
(λi1 + · · · + λid ; 1 ≤ i1 , . . . , id ≤ n) (5.66)
where (λi ; 1 ≤ i ≤ n) are the eigenvalues of the marginal chain.)
In particular,
τ2 = d/2. (5.67)
By the eigentime identity (Chapter 3 yyy)
d
!
X 1 dX d
τ0 = = k −1
λ
m≥2 m
2 k=1 k
= 2 (1 + d−1 + O(d−2 )),
d
(5.68)
the asymptotics being easy analysis.
From (5.64) it is also straightforward to derive the discrete-time t-step
transition probabilities:
d t X ! !
−d
X 2m r |i − j| d − |i − j|
Pi (Xt = j) = 2 1− (−1) .
m=0
d r r m−r
Starting the walk at 0, let Yt = |X(t)|. Then Y is the birth-and-death

chain on states {0, 1, . . . , d} with transition rates (transition probabilities,
in discrete time)
d−i i
qi,i+1 = , qi,i−1 = , 0 ≤ i ≤ d.
d d
xxx box picture

This is the Ehrenfest urn model mentioned in many textbooks. In our
terminology we may regard Y as random walk on the weighted linear graph
(Section 5.1.2) with weights
!
d−1
wi = , w = 2d .
i−1
In particular, writing T Y for hitting times for Y , symmetry and (5.13) give
d
1 ∗Y 1 Y Y Y d−1
X 1
τ = (E0 Td + Ed T0 ) = E0 Td = 2 d−1
.
2 2 i=1 i−1
On the d-cube, it is “obvious” that E0 Tj is maximized by j = 1, and this

can be verified by observing in (5.64) that P0 (X(t) = j) is minimized by
j = 1, and hence Z0j is minimized by j = 1, so we can apply Chapter 2 yyy.
Thus
d
1 ∗ X 1
τ = max Ei Tj = E0 T1 = 2d−1 d−1
∼ 2d (1 + 1/d + O(1/d2 )). (5.69)
2 ij
i=1 i−1
The asymptotics are the same as in (5.68). In fact it is easy to use (5.64) to
show
Zii = 2−d τ0 = 1 + d−1 + O(d−2 )
Zij = O(d−2 ) uniformly over |i − j| ≥ 2
and then by Chapter 2 yyy
Ei Tj = 2d (1 + d−1 + O(d−2 )) uniformly over |i − j| ≥ 2.
Since
1 + E1 T0Y = E0 T1Y + E1 T0Y = w/w1 = 2d ,
it follows that
Ei Tj = 2d − 1 if |i − j| = 1.
xxx refrain from write out exact Ei Tj —refs
To discuss total variation convergence, we have by symmetry (and writ-
ing d to distinguish from dimension d)
d̄(t) = kP0 (X(t) ∈ ·) − P1 (X(t) ∈ ·)k
d(t) = kP0 (X(t) ∈ ·) − π(·)k.

Following Diaconis et al [116] we shall sketch an argument leading to

d 1
4 d log d + sd → L(s) ≡ P |Z| ≤ 12 e−2s , −∞ < s < ∞ (5.70)
where Z has the standard Normal distribution. This implies
τ1 ∼ 41 d log d. (5.71)
For the discrete-time walk made aperiodic by incorporating chance 1/(d +
1) of holding, (5.70) and (5.71) remain true, though rigorous proof seems
complicated: see [116].
Fix u, and consider j = j(u) such that |j| − d/2 ∼ ud1/2 /2. Using
1 − exp(−δ) ≈ δ − 21 δ 2 as δ → 0 in (5.64), we can calculate for t = t(d) =
1
4 d log d + sd with s fixed that
!
e−4s
d
2 P0 (X(t) = j) → exp − − ue−2s .
2
Note the limit is > 1 when u < u0 (s) ≡ −e−2s /2. Now
1X
|P0 (X(t) = j) − 2−d | ∼ (P0 (X(t) = j) − 2−d )
X
d(t) =
2 j
where the second sum is over j(u) with u < u0 (s). But from (5.64) we can
write this sum as

P B 1
2 (1 − d−1/2 e−2s ) ≤ |j(u0 (s))| − P B 1
2 ≤ |j(u0 (s))|
where B(p) denotes a Binomial(d, p) random variable. By the Normal ap-

proximation to Binomial, this converges to
P (Z ≤ −u0 (s)) − P (Z ≤ u0 (s))
as stated.
As an aside, symmetry and Chapter 4 yyy give
(2)
τ0 ≤ E0 T1 ≤ τ1 + τ0
and so the difference E0 T1 − τ0 is O(d log d), which is much smaller than
what the series expansions (5.68) and (5.69) imply.
The fact that the “half-cube” A = {i ∈ I : id = 0}, yielding
τc = d/2,
achieves the sup in the definition of τc can be proved using a slightly tricky
induction argument. However, the result follows immediately from (5.67)
together with the general inequality τ2 ≥ τc .
Example 5.16 Dense regular graphs.
Consider an r-regular n-vertex graph with r > n/2. Of course here we are
considering a class of graphs rather than a specific example. The calculations
below show that these graphs necessarily mimic the complete graph (as far
as smallness of the random walk parameters is concerned) in the asymptotic
setting r/n → c > 1/2.
The basic fact is that, for any pair i, j of vertices, there must be at least
2r − n other vertices k such that i − k − j is a path. To prove this, let a1
(resp., a2 ) be the number of vertices k 6= i, j such that exactly 1 (resp., 2)
of the edges (k, i), (k, j) exist. Then a1 + a2 ≤ n − 2 by counting vertices,
and a1 + 2a2 ≥ 2(r − 1) by counting edges, and these inequalities imply
a2 ≥ 2r − n.
Thus, by Thompson’s principle (Chapter 3, yyy) the effective resistance
2
rij ≤ 2r−n and so the commute interpretation of resistance implies
2rn 2cn
τ∗ ≤ ∼ . (5.72)
2r − n 2c − 1
A simple “greedy coupling” argument (Chapter 14, Example yyy) shows
r c
τ1 ≤ ∼ . (5.73)
2r − n 2c − 1
This is also a bound on τ2 and on τc , because τc ≤ τ2 ≤ τ1 always, and special
case 2 below shows that this bound on τc cannot be improved asymptotically
(nor hence can the bound on τ1 or τ2 ). Because Eπ Tj ≤ nτ2 for regular
graphs (Chapter 3 yyy), we get
nr
Eπ Tj ≤ .
2r − n
This implies
nr cn
τ0 ≤ ∼
2r − n 2c − 1
which also follows from (5.72) and τ0 ≤ τ ∗ /2. We can also argue, in the
notation of Chapter 4 yyy, that
(2) 4e nr cn
max Ei Tj ≤ τ1 + max Eπ Tj ≤ τ1 + nτ1 ≤ (1 + o(1)) ∼ .
ij j e−1 2r − n 2c − 1
Special case 1. The orders of magnitude may change for c = 1/2. Take
two complete (n/2)-graphs, break one edge in each (say edges (v1 , v2 ) and
(w1 , w2 )) and add edges (v1 , w1 ) and (v2 , w2 ). This gives an n-vertex ((n/2)−
1)-regular graph for which all our parameters are Θ(n2 ).
jjj I haven’t checked this.
Special case 2. Can the bound τc ≤ r/(2r −n) ∼ c/(2c−1) be asymptoti-
cally improved? Eric Ordentlicht has provided the following natural counter-
example. Again start with two (n/2)-complete graphs on vertices (vi ) and
(wi ). Now add the edges (vi , wj ) for which 0 ≤ (j−i) mod (n/2) ≤ r−(n/2).
This gives an n-vertex r-regular graph. By considering the set A consisting
of the vertices vi , a brief calculation gives
r c
τc ≥ ∼ .
2r − n + 2 2c − 1
d.
Example 5.17 The d-dimensional torus Zm
The torus is the set of d-dimensional integers i = (i1 , . . . , id ) modulo m,
considered in the natural way as a 2d-regular graph on n = md vertices. It
is much simpler to work with the random walk in continuous time, X(t) =
(X1 (t), . . . , Xd (t)), because its component processes (Xu (t)) are independent
as u varies; and each is just continuous-time random walk on the m-cycle,
slowed down by a factor 1/d. Thus we can immediately write the time-t
transition probabilities for X in terms of the corresponding probabilities
p0,j (t) for continuous-time random walk on the m-cycle (see Example 5.7
above) as
d
Y
p0,j (t) = p0,ju (t/d).
u=1
Since the eigenvalues on the m-cycle are (1 − cos(2πk/m), 0 ≤ k ≤ m − 1),
by (5.66) the eigenvalues of X are
d
1X
λ(k1 ...kd ) = (1 − cos(2πku /m)), 0 ≤ ku ≤ m − 1.
d u=1
In particular, we see that the relaxation time satisfies
dm2 dn2/d
τ2 ∼ =
2π 2 2π 2
where here and below asymptotics are as m → ∞ for fixed d. This relaxation
time could more simply be derived from the N -cycle result via the general
“product chain” result of Chapter 4 yyy. But writing out all the eigenvalues
enables us to use the eigentime identity.
X X
τ0 = ··· 1/λ(k1 ,...,kd )
k1 kd
(the sum excluding (0, . . . , 0)), and hence
τ0 ∼ md Rd (5.74)
where Z 1 Z 1
1
Rd ≡ ··· 1 Pd dx1 · · · dxd (5.75)
0 0 d u=1 (1 − cos(2πxu ))
provided the integral converges. The reader who is a calculus whiz will see
that in fact Rd < ∞ for d ≥ 3 only, but this is seen more easily in the
alternative approach of Chapter 15, Section yyy.
xxx more stuff: connection to transience, recurrent potential, etc
xxx new copy from lectures
xxx τ1 , τc
jjj David: I will let you develop the rest of this example. Note that τ1
is considered very briefly in Chapter 15, eq. (17) in 3/6/96 version. Here
are a few comments for τc . First suppose that m > 2 is even and d ≥ 2.
Presumably, τc is achieved by the following half-torus:
d
A := {i = (i1 , . . . , id ) ∈ Zm : 0 ≤ id < m/2}.
In the notation of (5.52) observe
|E| = dn, deg A = dn, deg Ac = dn, (A, Ac ) = 2md−1 = 2n/m,
whence
τ (A) = d4 n1/d .
[By Example 5.15 (the d-cube) this last result is also true for m = 2, and
(for even m ≥ 2) it is by Example 5.7 (the n-cycle) also true for d = 1.] If
we have correctly conjectured the maximizing A, then
τc = d4 n1/d if m is even,
and presumably(??)
τc ∼ d4 n1/d
in any case.
Example 5.18 Chess moves.
Here is a classic homework problem for an undergraduate Markov chains

course.
Start a knight at a corner square of an otherwise-empty chess-

board. Move the knight at random, by choosing uniformly from
the legal knight-moves at each step. What is the mean number
of moves until the knight returns to the starting square?
It’s a good question, because if you don’t know Markov chain theory it looks
too messy to do by hand, whereas using Markov chain theory it becomes very
simple. The knight is performing random walk on a graph (the 64 squares
are the vertices, and the possible knight-moves are the edges). It is not hard
to check that the graph is connected, so by the elementary Chapter 3 yyy
for a corner square v the mean return time is
1 2|E|
Ev Tv+ = = = |E|,
πv dv
and by drawing a sketch in the margin the reader can count the number of
edges |E| to be 168.
Other chess pieces—queen, king, rook—define different graphs (the bish-
op’s is of course not connected, and the pawn’s not undirected). One might
expect that the conventional ordering of the “strength” of the pieces as
(queen, rook, knight, king) is reflected in parameters τ0 and τ2 (jjj how
about the other taus?) being increasing in this ordering. The reader is
invited to perform the computations. (jjj: an undergraduate project?) We
have done so only for the rook’s move, treated in the next example.
The computations for the queen, knight, and king are simplified if the
walks are made on a toroidal chessboard. (There is no difference for the
rook.)
jjj Chess on a bagel, anyone? Continue same paragraph:
2 (with
Then Fourier analysis (see Diaconis [112]) on the abelian group Zm
m = 8) can be brought to bear, and the eigenvalues are easy to compute.
We omit the details, but the results for (queen, rook, knight, king) are
asymptotically
τ0 = (m2 + 7
12 m + O(1), m2 + m + O(1),
jjj?(1 + o(1))cknight m2 log m, jjj?(1 + o(1))cking m2 log m)

τ2 ∼ 4
3, 2, 1
5π 2
m2 , 3π2 2 m2
as m → ∞, in conformance with our expectations, and numerically
τ0 = (65.04, 67.38, 69.74, 79.36)
τ2 = (1.29, 1.75, 1.55, 4.55)
for m = 8. The only surprise is the inverted τ2 ordering for (rook, knight).
Example 5.19 Rook’s random walk on an m-by-m chessboard.
jjj Do we want to do this also on a d-dimensional grid? We need to

mention how this is a serious example, used with Metropolis for sampling
from log concave distributions; reference is [32]? [33]?
Number the rows and columns of the chessboard each 0 through m − 1
in arbitrary fashion, and denote the square of the chessboard at row i1 and
column i2 by i = (i1 , i2 ). In continuous time, the rook’s random walk (X(t))
is the product of two continuous-time random walks on the complete graph
Km on m vertices, each run at rate 1/2. Thus (cf. Example 5.9)
2
1 1 mt
Y
Pi (X(t) = j) = + δiu ,ju − exp − , (5.76)
u=1
m m 2(m − 1)
which can be expanded to get the discrete-time multistep transition proba-

bilities, if desired. We recall that the eigenvalues for discrete-time random
walk on Km are 1 with multiplicity 1 and −1/(m−1) with multiplicity m−1.
It follows [recall (5.66)] that the eigenvalues for the continuous-time rook’s
walk are
m m
0, , with resp. multiplicities 1, 2(m − 1), (m − 1)2 .
2(m − 1) m − 1
In particular,
2(m − 1)
τ2 = , (5.77)
m
which equals 1.75 for m = 8 and converges to 2 as m grows. Applying the
eigentime identity, a brief calculation gives
(m − 1)2 (m + 3)
τ0 = , (5.78)
m
which equals 67.375 for m = 8 and m2 + m + O(1) for m large.
Starting the walk X at 0 = (0, 0), let Y (t) denote the Hamming distance
H(X(t), 0) of X(t) from 0, i.e., the number of coordinates (0, 1, or 2) in
which X(t) differs from 0. Then Y is a birth-and-death chain with transition
rates
1 1 1
q01 = 1, q10 = , q12 = , q21 = .
2(m − 1) 2 m−1
This is useful for computing mean hitting times. Of course
Ei Tj = 0 if H(i, j) = 0.
Since
1 + E1 T0Y = E0 T1Y + E1 T0Y = m2 ,
it follows that
Ei Tj = m2 − 1 if H(i, j) = 1.
Finally, it is clear that E2 T1Y = m − 1, so that
E2 T0Y = E2 T1Y + E1 T0Y = m2 + m − 2,
whence
Ei Tj = m2 + m − 2 if H(i, j) = 2.
These calculations show
1 ∗
2τ = max Ei Tj = m2 + m − 2,
ij
which equals 70 for m = 8, and they provide another proof of (5.78).

From (5.76) it is easy to derive
2 mt 2 mt

d¯m (t) = 2 − exp − − 1− exp −
m 2(m − 1) m m−1
and thence
" 1/2 ! #
m−1 m(m − 2) m−1

τ1 = −2 ln 1 − 1 − e−1 + ln ,
m (m − 1)2 m−2
.
which rounds to 2.54 for m = 8 and converges to −2 ln(1 − (1 − e−1 )1/2 ) =
3.17 as m becomes large.
Any set A of the form {(i1 , i2 ) : iu ∈ J} with either u = 1 or u = 2 and
J a nonempty proper subset of {0, . . . , m − 1} achieves the value
m−1
τc = 2 .
m
A direct proof is messy, but this follows immediately from the general in-
equality τc ≤ τ2 , (5.77), and a brief calculation that the indicated A indeed
gives the indicated value.
xxx other examples left to reader? complete bipartite; ladders
jjj Note: I’ve worked these out and have handwritten notes. How much
do we want to include, if at all? (I could at least put the results in the
table.)
5.2.1 Biased walk on a balanced tree

Consider again the balanced r-tree setup of Example 5.14. Fix a parameter
0 < λ < ∞. We now consider biased random walk (Xt ) on the tree, where
from each non-leaf vertex other than the root the transition goes to the
parent with probability λ/(λ+r) and to each child with probability 1/(λ+r).
As in Example 5.14 (the case λ = 1), the chain X̂ induced by the function
f (i) = h − (distance from i to the root)
is (biased) reflecting random walk on {0, . . . , h} with respective probabilities
λ/(λ + r) and r/(λ + r) of moving to the right and left from any i 6= 0, h;
the ratio of these two transition probabilities is
ρ = λ/r.
The stationary distribution π̂ for X̂ is a modified geometric:

1
 if m = 0
1 
π̂m = × (1 + ρ)ρm−1 if 1 ≤ m ≤ h − 1
ŵ 
 ρh−1 if m = h
where
h−1
(
X
m 2(1 − ρh )/(1 − ρ) if ρ 6= 1
ŵ = 2 ρ =
m=0
2h if ρ = 1.
Since the stationary distribution π for X is assigns the same probability to
each of the rh−f (v) vertices v with a given value of f (v), a brief calculation
shows that πv pvx = λf (v) /ŵrh for any edge (v = child, x = parent) in the
tree. In the same notation, it follows that X is random walk on the balanced
r-tree with edge weights wvx = λf (v) and total weight w = v,x wvx = ŵrh .
P
The distribution π̂ concentrates near the root-level if ρ < 1 and near the
leaves-level if ρ > 1; it is nearly uniform on the h levels if ρ = 1. On the
other hand, the weight assigned by the distribution π to an individual vertex
v is a decreasing function of f (v) (thus favoring vertices near the leaves) if
λ < 1 (i.e., ρ < 1/r) and is an increasing function (thus favoring vertices
near the root) if λ > 1; it is uniform on the vertices in the unbiased case
λ = 1.
The mean hitting time calculations of Example 5.14 can all be extended
to the biased case. For example, for λ 6= 1 the general formula (5.55)
becomes [using the same notation as at (5.55)]
λ−f3 − λ−f2 −1 −2

−(f2 +1) −(f1 +1)

Ev1 Tv2 = ŵrh + 2(ρ − 1) ρ − ρ
λ−1 − 1
−2(ρ−1 − 1)−1 (f2 − f1 ) − (f2 − f1 ) (5.79)
if ρ 6= 1 and
λ−f3 − λ−f2
Ev1 Tv2 = ŵrh + f22 − f12
λ−1 − 1
if ρ = 1. The maximum value is attained when v1 and v2 are leaves and v3
is the root. So if λ 6= 1,
1 ∗ λ−h − 1
2τ = max Ev Tx = ŵrh . (5.80)
v,x λ−1 − 1
The orders of magnitude for all of the τ -parameters (with r and λ, and
hence ρ, fixed as h, and hence n, becomes large) are summarized on a
case-by-case basis in the next table. Following are some of the highlights
in deriving these results; the details, and derivation of exact formulas and
more detailed asymptotic results, are left to the reader.
Orders of magnitude of parameters [τ = Θ(entry)]

for λ-biased walk on a balanced r-tree of height h (ρ = λ/r).
Value of ρ τ∗ τ0 τ1 τ2 τc
ρ < 1/r ρ−h ρ−h ρ−h ρ−h ρ−h
ρ = 1/r (≡ Example 5.14) nh nh n n n
1/r < ρ < 1 n n ρ−h ρ−h ρ−h
ρ=1 nh n h h h
ρ>1 n n h 1 1
we have τ0 ≤ E
P
For τ0 = x πx Eroot Tx root Tleaf . If ρ < 1/r, this bound
is tight:
2ρ−h
τ0 ∼ E root Tleaf ∼ (λ − ρ);
(1 − ρ)2 (1 − λ)
for ρ > 1/r a more careful calculation is required.
If ρ < 1, then the same arguments as for the unbiased case (ρ = 1/r)
show
τ1 ∼ τ2 ∼ 2ρ−(h−1) /(1 − ρ)2 .
In this case it is not hard to show that
τc = Θ(ρ−h )
as well. If ρ = 1, then it is not hard to show that
τ1 = Θ(h), τc ∼ 2(1 − 1r )h
5.3. TREES 197
with τc achieved at a branch of the root (excluding the root), and so
τ2 = Θ(h)
as well. If ρ > 1, then since X̂ has positive drift equal to (ρ − 1)/(ρ + 1), it
follows that
ρ+1
τ1 ∼ h.
ρ−1
The value τc is achieved by isolating a leaf, giving
τc → 1,
and so, by the inequalities τc ≤ τ2 ≤ 8τc2 of Chapter 4, Section yyy,
τ2 = Θ(1)
as well.
jjj Limiting value of τ2 when ρ > 1 is that of τ2 for biased infinite tree?
Namely?
5.3 Trees
For random walk on a finite tree, we can develop explicit formulas for means
and variances of first passage times, and for distributions of first hitting
places. We shall only treat the unweighted case, but the formulas can be
extended to the weighted case without difficulty.
xxx notation below —change w to x ? Used i, j, v, w, x haphazardly for
vertices.
In this section we’ll write rv for the degree of a vertex v, and d(v, x)
for the distance between v and x. On a tree we may unambiguously write
[v, x] for the path from v to x. Given vertices j, v1 , v2 , . . . in a tree, the
intersection of the paths [j, v1 ], [j, v2 ], . . . is a (maybe trivial) path; write
d(j, v1 ∧ v2 ∧ · · ·) ≥ 0 for the length of this intersection path.
On an n-vertex tree, the random walk’s stationary distribution is
rv
πv = .
2(n − 1)
Recall from the beginning of this chapter that an edge (v, x) of a graph
is essential if its removal would disconnect the graph into two components
A(v, x) and A(x, v), say, containing v and x respectively. Obviously, in a
tree every edge is essential, so we get a lot of mileage out of the essential
edge lemma (Lemma 5.1).
Theorem 5.20 Consider discrete-time random walk on an n-vertex tree.

For each edge (i, j),
Ei Tj = 2|A(i, j)| − 1 (5.81)
Ei Tj + Ej Ti = 2(n − 1). (5.82)

For arbitrary i, j,
X X
Ei Tj = −d(i, j) + 2 d(j, i ∧ v) = rv d(j, i ∧ v) (5.83)
v v
Ei Tj + Ej Ti = 2(n − 1)d(i, j). (5.84)

For each edge (i, j),
X X
vari Tj = −Ei Tj + rv rw (2d(j, v ∧ w) − 1). (5.85)
v∈A(i,j) w∈A(i,j)
For arbitrary i, j,
XX
vari Tj = −Ei Tj + rv rw d(j, i ∧ v ∧ w) [2d(j, v ∧ w) − d(j, i ∧ v ∧ w)] .
v w
(5.86)
Remarks. 1. There are several equivalent expressions for the sums above:
we chose the most symmetric-looking ones. We’ve written sums over ver-
tices, but one could rephrase in terms of sums over edges.
2. In continuous time, the terms “−Ei Tj ” disappear from the variance
formulas—see xxx.
Proof of Theorem 5.20. Equations (5.81) and (5.82) are rephrasings
of (5.3) and (5.4) from the essential edge lemma. Equation (5.84) and the
first equality in (5.83) follow from (5.82) and (5.81) by summing over the
edges in the path [i, j]. Note alternatively that (5.84) can be regarded as
a consequence of the commute interpretation of resistance, since the ef-
fective resistance between i and j is d(i, j). To get the second equality
in (5.83),consider the following deterministic identity (whose proof is obvi-
ous), relating sums over vertices to sums over edges.
Lemma 5.21 Let f be a function on the vertices of a tree, and let j be a
distinguished vertex. Then
(f (v) + f (v ∗ ))
X X
rv f (v) =
v v6=j
where v ∗ is the first vertex (other than v) in the path [v, j].
5.3. TREES 199
To apply to (5.83), note
d(j, i ∧ v ∗ ) = d(j, i ∧ v) if v 6∈ [i, j]

= d(j, i ∧ v) − 1 if v ∈ [i, j], v 6= j.
The equality in Lemma 5.21 now becomes the equality in (5.83).

We prove (5.85) below. To derive (5.86) from it, sum over the edges in
the path [i, j] = (i = i0 , i1 , . . . , im = j) to obtain
XXX
vari Tj = −Ei Tj + (2d(il+1 , v ∧ w) − 1) (5.87)
v w l
where l denotes the sum over all 0 ≤ l ≤ m − 1 for which A(il , il+1 )
P
contains both v and w. Given vertices v and w, there exist unique smallest
values of p and q so that v ∈ A(ip , ip+1 ) and w ∈ A(iq , iq+1 ). If p 6= q, then
P
the sum l in (5.87) equals
m−1
X m−1
X
(2d(il+1 , ip∨q ) − 1) = (2((l + 1) − (p ∨ q)) − 1)
l=p∨q l=p∨q
= (m − (p ∨ q))2 = d2 (j, v ∧ w)
= d(j, i ∧ v ∧ w) [2d(j, v ∧ w) − d(j, i ∧ v ∧ w)] ,
P
as required by (5.86). If p = q, then the sum l in (5.87) equals
m−1
X
(2d(il+1 , ip ) + 2d(ip , v ∧ w) − 1)
l=p
which again equals d(j, i ∧ v ∧ w) [2d(j, v ∧ w) − d(j, i ∧ v ∧ w)] by a similar

calculation.
So it remains to prove (5.85), for which we may suppose, as in the proof
of Lemma 5.1, that j is a leaf. By considering the first step from j to i we
have
varj Tj+ = vari Tj .
Now yyy of Chapter 2 gives a general expression for varj Tj+ in terms of
Eπ Tj , and in the present setting this becomes
varj Tj+ = 2(n − 1) − (2(n − 1))2 +

X
2rv Ev Tj .
v
Using the second equality in (5.83), we may rewrite the sum as

XX
rv rw 2d(j, v ∧ w).
v6=j w6=j
Also, X
rv = 2(n − 1) − 1.
v6=j
Combining these expressions gives

XX
vari Tj = −(2n − 3) + rv rw (2d(j, v ∧ w) − 1).
v6=j w6=j
But by (5.81), Ei Tj = 2n − 3.
5.3.1 Parameters for trees

Here we discuss the five parameters of Chapter 4. Obviously by (5.84)
τ ∗ = 2(n − 1)∆ (5.88)
where ∆ is the diameter of the tree. As for τc , it is clear that the sup in its
definition is attained by A(v, w) for some edge (v, w). Note that
2|A(v, w)| − 1
π(A(v, w)) = . (5.89)
2(n − 1)
This leads to
2|A(v,w)|−1 2|A(w,v)|−1
2(n−1) 2(n−1)
τc = max 1
(v,w)
2(n−1)
4|A(v, w)||A(w, v)| − 2n + 1
= max . (5.90)
(v,w) 2(n − 1)
Obviously the max is attained by an edge for which |A(v, w)| is as close as
possible to n/2. This is one of several notions of “centrality” of vertices
and edges which arise in our discussion—see Buckley and Harary [81] for a
treatment of centrality in the general graph context, and for the standard
graph-theoretic terminology.
Proposition 5.22 On an n-vertex tree,
1 2 X 1

τ0 = + |A(v, w)||A(w, v)| − |A(v, w)|2 + |A(w, v)|2
2 n (v,w) 2(n − 1)
P
where (v,w) denotes the sum over all undirected edges (v, w).
5.3. TREES 201
Proof. Using the formula for the stationary distribution, for each i
1 X
τ0 = rj Ei Tj .
2(n − 1) j
Appealing to Lemma 5.21 (with i as the distinguished vertex)
1 X
τ0 = (2Ei Tj − a(i, j))
2(n − 1) j
where a(i, i) = 0 and a(i, j) = Ex Tj , where (j, x) is the first edge of the path
[j, i]. Taking the (unweighted) average over i,
1 XX
τ0 = (2Ei Tj − a(i, j)).
2n(n − 1) i j
Each term Ei Tj is the sum of terms Ev Tw along the edges (v, w) of the path
[i, j]. Counting how many times a directed edge (v, w) appears,
1 X
τ0 = (2|A(v, w)||A(w, v)| − |A(v, w) |)Ev Tw ,
2n(n − 1)
where we sum over directed edges (v, w). Changing to a sum over undirected
edges, using Ev Tw + Ew Tv = 2(n − 1) and Ev Tw = 2|A(v, w)| − 1, gives
X
2n(n − 1)τ0 = [2|A(v, w)||A(w, v)|2(n − 1)
(v,w)
−|A(v, w)|(2|A(v, w)| − 1)
−|A(w, v)|(2|A(w, v)| − 1)] .
This simplifies to the assertion of the Proposition.

For τ1 we content ourselves with a result “up to equivalence”.
Proposition 5.23 There exist constants K1 , K2 < ∞ such that

1
min max Ej Ti ≤ τ1 ≤ K2 min max Ej Ti .
K1 i j i j
Of course the expectations can be computed by (5.83).

Proof. We work with the parameter
(3) X
τ1 ≡ max πk |Ej Tk − Ei Tk |
i,j
k
which we know is equivalent to τ1 . Write
σ = min max Ej Ti .
i j
Fix an i attaining the minimum. For arbitrary j we have (the first equality
uses the random target lemma, cf. the proof of Chapter 4 Lemma yyy)
X X
πk |Ej Tk − Ei Tk | = 2 πk (Ej Tk − Ei Tk )+
k k
X
≤ 2 πk Ej Ti because Ej Tk ≤ Ej Ti + Ei Tk
k
≤ 2σ
(3)
and so τ1 ≤ 4σ.
For the converse, it is elementary that we can find a vertex i such that
the size (n∗ , say) of the largest branch from i satisfies n∗ ≤ n/2. (This is
another notion of “centrality”. To be precise, we are excluding i itself from
the branch.) Fix this i, and consider the j which maximizes Ej Ti , so that
Ej Ti ≥ σ by definition. Let B denote the set of vertices in the branch from
i which contains j. Then
Ej Tk = Ej Ti + Ei Tk , k ∈ B c
and so
(3) X
τ1 ≥ πk |Ej Tk − Ei Tk | ≥ π(B c )Ej Ti ≥ π(B c )σ.
k
2n −1 ∗ (3)
But by (5.89) π(B) = 2(n−1) ≤ 21 , so we have shown τ1 ≥ σ/2.
We do not know whether τ2 has a simple expression “up to equivalence”
analogous to Proposition 5.23. It is natural to apply the “distinguished
paths” bound (Chapter 4 yyy). This gives the inequality
X X
τ2 ≤ 2(n − 1) max πx πy d(x, y)
(v,w)
x∈A(v,w) y∈A(w,v)
h i
= 2(n − 1) max π(A(v, w))E d(v, V )1(V ∈A(w,v))
(v,w)
h i
+π(A(w, v))E d(v, V )1(V ∈A(v,w))
where V has the stationary distribution π and where we got the equality
by writing d(x, y) = d(v, y) + d(v, x). The edge attaining the max gives yet
another notion of “centrality.”
xxx further remarks on τ2 .
5.3. TREES 203
5.3.2 Extremal trees

It is natural to think of the n-path (Example 5.8) and the n-star (Exam-
ple 5.10) as being “extremal” amongst all n-vertex trees. The proposition
below confirms that the values of τ ∗ , maxi,j Ei Tj , τ0 , τ2 , and τc in those
examples are the exact extremal values (minimal for the star, maximal for
the path).
Proposition 5.24 For any n-vertex tree with n ≥ 3,

(a) 4(n − 1) ≤ τ ∗ ≤ 2(n − 1)2
(b) 2(n − 1) ≤ maxi,j Ei Tj ≤ (n − 1)2
(c) n − 23 ≤ τ0 ≤ (2n2 − 4n + 3)/6.
(d) 1 ≤ τ2 ≤ (1 − cos(π/(n − 1)))−1 .
2 /4c−2n+1
1
(e) 1 − 2(n−1) ≤ τc ≤ 4bn 2(n−1) .
Proof. (a) is obvious from (5.88), because ∆ varies between 2 for the
n-star and (n − 1) for the n-path. The lower bound in (b) follows from the
lower bound in (a). For the upper bound in (b), consider some path i =
v0 , v1 , . . . , vd = j in the tree, where plainly d ≤ (n − 1). Now |A(vd−1 , vd )| ≤
n − 1 and so
|A(vd−i , vd−i+1 )| ≤ n − i for all i
because the left side decreases by at least 1 as i increases. So
d−1
X
Ei Tj = Evm Tvm+1
m=0
d−1
X
= (2|A(vm , vm+1 )| − 1) by (5.81)
m=0
d−1
X
≤ (2(m + n − d) − 1)
m=0
n−1
X
≤ (2l − 1)
l=1
= (n − 1)2 .
To prove (c), it is enough to show that the sum in Proposition 5.22 is min-
imized by the n-star and maximized by the n-path. For each undirected
edge (v, w), let
b(v, w) = min(|A(v, w)|, |A(w, v)|) ≤ n/2.

Let b = (b1 , b2 , . . . , bn−1 ) be the non-decreasing rearrangement of these val-

ues b(v, w). The summands in Proposition 5.22 are of the form
1
a(n − a) − (a2 + (n − a)2 )
2(n − 1)
with a ranging over the bi .
One can check that this quantity is an increasing function of a ≤ n/2.
Thus it is enough to show that the vector b on an arbitrary n-tree dominates
coordinatewise the vector b for the n-star and is dominated by the vector b
for the n-path. The former is obvious, since on the n-star b = (1, 1, . . . , 1).
The latter needs a little work. On the n-path b = (1, 1, 2, 2, 3, 3, . . .). So we
must prove that in any n-tree
i+1

bi ≤ for all i. (5.91)
2
Consider a rooted tree on m vertices. Breaking an edge e gives two
components; let a(e) be the size of the component not containing the root.
Let (a1 , a2 , . . .) be the non-decreasing rearrangement of (a(e)). For an m-
path rooted at one leaf, (a1 , a2 , . . .) = (1, 2, 3, . . .). We assert this is extremal,
in that for any rooted tree
ai ≤ i for all i. (5.92)
This fact can be proved by an obvious induction on m, growing trees by

adding leaves.
Now consider an unrooted tree, and let b be as above. There exists some
vertex v, of degree r ≥ 2, such that each of the r branches from v has size
(excluding v) at most n/2. Consider these branches as trees rooted at v,
apply (5.92), and it is easy to deduce (5.91).
For (d), the lower bound is easy. Fix a leaf v and let w be its neighbor.
We want to apply the extremal characterization (Chapter 3 yyy) of τ2 to
the function
g(v) = 1 − πv − πw , g(w) = 0, g(·) = −πv elsewhere.

P
For this function, πx g(x) = 0,
[g, g] = πv (1 − πv − πw )2 + (1 − πv − πw )πv2 ,
and by considering transitions out of w
E(g, g) = πv (1 − πv − πw )2 + (πw − πv )πv2 .

Since πw ≤ 1/2 we have [g, g] ≥ E(g, g) and hence τ2 ≥ [g, g]/E(g, g) ≥ 1.

qqq Anyone have a short proof of upper bound in (d)?
Finally, (e) is clear from (5.90).
Other extremal questions. Several other extremal questions have been
studied. Results on cover time are given in Chapter 6. Yaron [340] shows
that for leaves l the mean hitting time Eπ Tl is maximal on the n-path and
minimal on the n-star. (He actually studies the variance of return times,
but Chapter 2 yyy permits the rephrasing.) Finally, if we are interested in
the mean hitting time Ex TA or the hitting place distribution, we can reduce
to the case where A is the set L of leaves, and then set up recursively-
solvable equations for h(i) ≡ Ei TL or for f (i) = Pi (TA = Tl ) for fixed l ∈ L.
An elementary treatment of such ideas is in Pearce [277], who essentially
proved that (on n-vertex trees) maxx Ex TL is minimized by the n-star and
maximized by the n-path.

Most of the material seems pretty straightforward, so we will give references
sparingly.
Introduction. The essential edge lemma is one of those oft-rediscovered
results which defies attribution.
Section 5.1.2. One can of course use the essential edge lemma to derive
the formula for mean hitting times in the general birth-and-death process.
This approach seems more elegant than the usual textbook derivation. Al-
though we are fans of martingale methods, we didn’t use them in Proposi-
tion 5.3(b), because to define the right martingale requires one to know the
answer beforehand!
For a birth-and-death chain the spectral representation involves orthog-
onal polynomials. This theory was developed by Karlin and McGregor in
the 1950s, and is summarized in Chapter 8 of Anderson [31]. It enables
one to write down explicit formulas for Pi (Xt = j) in special cases. But
it is less clear how to gain qualitative insight, or inequalities valid over all
birth-and-death chains, from this approach.
An alternative approach which is more useful for our purposes is based
on Siegmund duality (see e.g. [31] Section 7.4). Associated with a birth-and-
death process (Xt ) is another birth-and-death process (Yt ) which is “dual”
in the sense that
Pi (Xt ≤ j) = Pj (Yt ≥ i) for all i, j, t

and whose transition rates have a simple specification in terms of those of

(Xt ). It is easy to see that τ1 for (Xt ) is equivalent to maxj Ej T0,n for (Yt ),
for which there is an explicit formula. This gives an alternative to (5.16).
Section 5.2.
That the barbell is a good candidate for an “extremal” graph with re-
spect to random walk properties was realized by Landau and Odlyzko [219],
who computed the asymptotics of τ2 , and by Mazo [261], who computed
the asymptotics of the unweighted average of (Ei Tj ; i, j ∈ I), which in this
example is asymptotically our τ0 . Note we were able to give a one-line
argument for the asymptotics of τ2 by relying on the general fact τ2 ≤ τ0 .
Formulas for quantities associated with random walk on the d-cube and
with the Ehrenfest urn model have been repeatedly rediscovered, and we
certainly haven’t given all the known results. Bingham [51] has an extensive
bibliography. Palacios [275] uses the simple “resistance” argument used in
the text, and notes that the same argument can be used on the Platonic
graphs. Different methods of computing E0 T1 lead to formulas looking
different from our (5.69), for instance
d
X
E0 T1 = d 2i−1 /i [216], eq. (4.27)
i=1
!
X
−1 d
= d j [51].
1 ≤ j ≤ d, j odd
j
Similarly, one can get different-looking expressions for τ0 . Wilf [337] lists 54
identities involving binomial coefficients—it would be amusing to see how
many could be derived by calculating a random walk on the d-cube quantity
in two different ways!
Comparing our treatment of dense regular graphs (Example 5.16) with
that in [272] should convince the reader of the value of general theory.
Section 5.3. An early reference to formulas for the mean and variance
of hitting times on a tree (Theorem 5.20) is Moon [264], who used less
intuitive generating function arguments. The formulas for the mean have
been repeatedly rediscovered.
Of course there are many other questions we can ask about random walk
on trees. Some issues treated later are
xxx list.
xxx more sophisticated ideas in Lyons [245].
Chapter 6
Cover Times (October 31,

1994)
The maximal mean hitting time maxi,j Ei Tj arises in many contexts. In

Chapter 5 we saw how to compute this in various simple examples, and the
discussion of τ ∗ in Chapter 4 indicated general methods (in particular, the
electrical resistance story) for upper bounding this quantity. But what we’ve
done so far doesn’t answer questions like “how large can τ ∗ be, for random
walk on a n-vertex graph”. Such questions are dealt with in this Chapter, in
parallel with a slightly different topic. The cover time for a n-state Markov
chain is the random time C taken for the entire state-space I to be visited.
Formally,
C ≡ max Tj .
j
It is sometimes mathematically nicer to work with the “cover-and-return”
time
C + ≡ min{t ≥ C : Xt = X0 }.
There are several reasons why cover times are interesting.
• Several applications involve cover times directly: graph connectivity
algorithms (section 6.8.2), universal traversal sequences (section 6.8.1),
the “white screen problem” (Chapter 1 yyy)
• There remains an interesting “computability” open question (section
6.8.3)
• In certain “critical” graphs, the uncovered subset at the time when
the graph is almost covered is believed to be “fractal” (see the Notes
on Chapter 7).
207
208 CHAPTER 6. COVER TIMES (OCTOBER 31, 1994)
We are ultimately interested in random walks on unweighted graphs,

but some of the arguments have as their natural setting either reversible
Markov chains or general Markov chains, so we sometimes switch to those
settings. Results are almost all stated for discrete-time walks, but we occa-
sionally work with continuized chains in the proofs, or to avoid distracting
complications in statements of results. Results often can be simplified or
sharpened under extra symmetry conditions, but such results and examples
are deferred until Chapter 7.
xxx contents of chapter
6.1 The spanning tree argument

Except for Theorem 6.1, we consider in this section random walk on an n-
vertex unweighted graph. Results can be stated in terms of the number of
edges |E| of the graph, but to aid comparison with results involving minimal
¯
or maximal degree it is helpful to state results in terms of average degree d:
2|E|
d¯ = ; ¯
|E| = nd/2.
n
The argument for Theorem 6.1 goes back to Aleliunas et al [25]. Though
elementary, it can be considered the first (both historically and logically)
result which combines Markov chain theory with graph theory in a non-
trivial way.
Consider random walk on a weighted graph. Recall from Chapter 3 yyy
the edge-commute inequality: for an edge (v, x)
Ev Tx + Ex Tv ≤ w/wvx (weighted) (6.1)

¯
≤ dn (unweighted). (6.2)
One can alternatively derive these inequalities from the commute interpre-
tation of resistance (Chapter 3 yyy), since the resistance between x and v
is at most 1/wvx .
Theorem 6.1 For random walk on a weighted graph,
X
max Ev C + ≤ w min 1/we
v T e∈T
where the min is over spanning trees T . In the unweighted case

¯
max Ev C + ≤ dn(n − 1).
v
6.1. THE SPANNING TREE ARGUMENT 209
Proof. Given a spanning tree T and a vertex v, there is a path v =

v0 , v1 , . . . , v2n−2 = v which traverses each edge of the tree once in each
direction, and in particular visits every vertex. So
2n−3
X
Ev C + ≤ Evj Tvj+1
j=0
X
= (Ev Tx + Ex Tv )
e=(v,x)∈T
X
≤ w/we by (6.1)
e∈T
This gives the weighted case, and in the unweighted case w = dn¯ and each
spanning tree has e∈T 1/we = n − 1. 2
P
Note that in the unweighted case, the bound is at most n(n−1)2 . On the
barbell (Chapter 5 Example yyy) it is easy to see that mini Ei C = Ω(n3 ),
so the maximal values of any formalization of “mean cover time”, over n-
vertex graphs, is Θ(n3 ). Results and conjectures on the optimal numerical
constants in the Θ(n3 ) upper bounds are given in section 6.3.
Corollary 6.2 On an unweighted n-vertex tree, Ev Cv+ ≤ 2(n − 1)2 , with
equality iff the tree is the n-path and v is a leaf.
Proof. The inequality follows from Theorem 6.1. On the n-path with leaves
v, z we have Ev Cv+ = Ev Tz + Ez Tv = 2(n − 1)2 . 2
It is worth dissecting the proof of Theorem 6.1. Two different inequalities
are used in the proof. Inequality (6.2) is an equality iff the edge is essential,
so the second inequality in the proof is an equality iff the graph is a tree.
But the first inequality in the proof bounds C + by the time to traverse a
spanning tree in a particular order, and is certainly not sharp on a general
tree, but only on a path. This explains Corollary 6.2. More importantly,
these remarks suggest that the bound dn(n ¯ − 1) in Theorem 6.1 will be
good iff there is some fixed “essential path” in the graph, and the dominant
contribution to C is from the time taken to traverse that path (as happens
on the barbell).
There are a number of variations on the theme of Theorem 6.1, and we
will give two. The first (due to Zuckerman [342], whose proof we follow)
provides a nice illustration of probabilistic technique.
Proposition 6.3 Write Ce for the time to cover all edges of an unweighted
graph, i.e. until each edge (v, w) has been traversed in each direction. Then
max Ev Ce ≤ 11dn ¯ 2.
v
Proof. Fix a vertex v and a time t0 . Define “excursions”, starting and ending
at v, as follows. In each excursion, wait until all vertices have been visited,
then wait t0 longer, then end the excursion at the next visit to v. Writing
Si for the time at which the i’th excursion ends, and N for the (random)
number of excursions required to cover each edge in each direction, we have
SN = min{Si : Si ≥ Ce }
and so by Wald’s identity (yyy refs)
Ev Ce ≤ Ev SN = Ev N × Ev S1 . (6.3)
Clearly
Ev S1 ≤ Ev C + t0 + max Ew Tv ≤ t0 + 2 max Ei C.
w i
To estimate the other factor, we shall first show
Pv (N > 2) ≤ m3 /t20 (6.4)
where m ≡ dn ¯ is the number of directed edges. Fix a directed edge (w, x),
say. By Chapter 3 Lemma yyy the mean time, starting at x, until (w, x) is
traversed equals m. So the chance, starting at x, that (w, x) is not traversed
before time t0 is at most m/t0 . So using the definition of excursion, the
chance that (v, w) is not traversed during the first excursion is at most
m/t0 , so the chance it is not traversed during the first two excursions is at
most (m/t0 )2 . Since there are m directed edges, (6.4) follows.
Repeating the argument for (6.4) gives
!j
m3
Pv (N > 2j) ≤ ; j≥0
t20
and hence, assuming m3 < t20 ,
2
Ev N ≤ .
1 − m3 /t20
Putting t0 = d2m3/2 e gives Ev N ≤ 8/3. Substituting into (6.3),
8
max Ev Ce ≤ (d2m3/2 e + 2 max Ev C).
v 3 v
6.1. THE SPANNING TREE ARGUMENT 211
Now Theorem 6.1 says maxv Ev C ≤ m(n − 1) ≤ mn − 1, so

8
max Ev Ce ≤ (2m3/2 + 2mn)
v 3
16
= m(m1/2 + n)
3
32
≤ mn
3
establishing the Proposition. 2
Another variant of Theorem 6.1, due to Kahn et al [205] (whose proof
we follow), uses a graph-theoretical lemma to produce a “good” spanning
tree in graphs of high degree.
Theorem 6.4 Writing d∗ = minv dv ,
¯ 2
6dn
max Ev C + ≤ (6.5)
v d∗
and so on a regular graph
max Ev C + ≤ 6n2 . (6.6)

v
To appreciate (6.6), consider

Example 6.5 Take an even number j ≥ 2 cliques of size d ≥ 3, distinguish
two vertices vi , vi0 in the i’th clique (for each 0 ≤ i < j), remove the edges
(vi , vi0 ) and add the edges (vi0 , v(i+1) mod j ). This creates a (d − 1)-regular
graph with n = jd vertices.
Arguing as in the barbell example (Chapter 5 yyy), as d → ∞ with j varying
arbitrarily,
d j2 n2
max Ev Tw ∼ × d × ∼ .
v,w 2 4 8
Thus the O(n2 ) bound in (6.6) can’t be improved, even as a bound for
the smaller quantity maxv,w Ev Tw . (Note that in the example, d/n ≤ 1/2.
From the results in Chapter 5 Example yyy and Matthews’ method one gets
EC = O(n log n) for regular graphs with d/n bounded above 1/2.)
Here is the graph-theory lemma needed for the proof of Theorem 6.4.
Lemma 6.6 Let G be an n-vertex graph with minimal degree d∗ . There
exists a family of dd∗ /2e spanning forests Fi such that
(i) Each edge of G appears in at most 2 forests
(ii) Each component of each forest has size at least dd∗ /2e.
Proof. Replace each edge (i, j) of the graph by two directed edges (i →
j), (j → i). Pick an arbitrary v1 and construct a path v1 → v2 → . . . vq on
distinct vertices, stopping when the path cannot be extended. That is the
first stage of the construction of F1 . For the second stage, pick a vertex vq+1
not used in the first stage and construct a path vq+1 → vq+2 → . . . vr in
which no second-stage vertex is revisited, stopping when a first-stage vertex
is hit or when the path cannot be extended. Continue stages until all vertices
have been touched. This creates a directed spanning forest F1 . Note that
all the neighbors of vq must be amongst {v1 , . . . , vq−1 }, and so the size of
the component of F1 containing v1 is at least d∗ + 1, and similarly for the
other components of F1 .
Now delete from the graph all the directed edges used in F1 . Inductively
construct forests F2 , F3 , . . . , Fdd∗ /2e in the same way. The same argument
shows that each component of Fi has size at least d∗ + 2 − i, because at a
“stopping” vertex v at most i − 1 of the directed edges out of v were used
in previous forests.
Proof of Theorem 6.4. Write m for the number of (undirected) edges.
For an edge e = (v, x) write be = Ev Tx + Ex Tv . Chapter 3 Lemma yyy says
e be = 2m(n − 1). Now consider the dd∗ /2e forests Fi given by Lemma 6.6.
P
Since each edge appears in at most two forests,

XX X
be ≤ 2 be ≤ 4mn,
i e∈Fi e
and so there exists a forest F with e∈F be ≤ 4mn/dd∗ /2e ≤ 8mn/d∗ .

P
But each component of F has size at least dd∗ /2e, so F has at most 2n/d∗
components. So to extend F to a tree T requires adding at most 2n/d∗ − 1
edges (ej ), and for each edge e we have be ≤ 2m by (6.2). This creates a
spanning tree T with e∈T be ≤ 12mn/d∗ . As in the proof of Theorem 6.1,
P
this is an upper bound for Ev C + .
6.2 Simple examples of cover times

There are a few (and only a few) examples where one can study EC by
bare-hands exact calculations. Write hn for the harmonic sum
n
i−1 ∼ log n.
X
hn = (6.7)
i=1
(a) The coupon collector’s problem. Many textbooks discuss this clas-
sical problem, which involves C for the chain (Xt ; t ≥ 0) whose values are
6.2. SIMPLE EXAMPLES OF COVER TIMES 213
independent and uniform on an n-element set, i.e. random walk on the

complete graph with self-loops. Write (cf. the proof of Matthews’ method,
Chapter 2 yyy) C m for the first time at which m distinct vertices have been
visited. Then each step following time C m has chance (n − m)/n to hit a
new vertex, so E(C m+1 − C m ) = n/(n − m), and so
n−1
X
EC = E(C m+1 − C m ) = nhn−1 . (6.8)
m=1
(By symmetry, Ev C is the same for each initial vertex, so we just write
EC) It is also a textbook exercise (e.g. [133] p. 124) to obtain the limit
distribution
d
n−1 (C − n log n) → ξ (6.9)
where ξ has the extreme value distribution
P (ξ ≤ x) = exp(−e−x ), −∞ < x < ∞. (6.10)
We won’t go into the elementary derivations of results like (6.9) here, because
in Chapter 7 yyy we give more general results.
(b) The complete graph. The analysis of C for random walk on the
complete graph (i.e. without self-loops) is just a trivial variation of the
analysis above. Each step following time C m has chance (n − m)/(n − 1) to
hit a new vertex, so
EC = (n − 1)hn ∼ n log n. (6.11)
And the distribution limit (6.9) still holds. Because Ev Tw = n−1 for w 6= v,
we also have
EC + = EC + (n − 1) = (n − 1)(1 + hn−1 ) ∼ n log n. (6.12)
(c) The n-star (Chapter 5 Example yyy). Here the visits to the leaves
(every second step) are exactly i.i.d., so we can directly apply the coupon
collector’s problem. For instance, writing v for the central vertex and l for
a leaf,
El C = 2(n − 1)hn−2 ∼ 2n log n

Ev C + = 1 + El C + 1 = 2(n − 1)hn−1 ∼ 2n log n
and C/2 satisfies (6.9). Though we won’t give the details, it turns out that
a clever inductive argument shows these are the minima over all trees.
Proposition 6.7 (Brightwell - Winkler [61]) On an n-vertex tree,
min Ev C ≥ 2(n − 1)hn−2

v
min Ev C + ≥ 2(n − 1)hn−1 .

v
(d) The n-cycle. Random walk on the n-cycle is also easy to study.
At time C m the walk has visited m distinct vertices, and the set of visited
vertices must form an interval [j, j + m − 1], say, where we add modulo n. At
time C m the walk is at one of the endpoints of that interval, and C m+1 −C m
is the time until the first of {j − 1, j + m} is visited, which by Chapter 5
yyy has expectation 1 × m. So
n−1 n−1
X X 1
EC = E(C m+1 − C m ) = i = n(n − 1).
m=1 i=1
2
There is also an expression for the limit distribution (see Notes).

The n-cycle also has an unexpected property. Let V denote the last
vertex to be hit. Then
P0 (V = v) = P0 (Tv−1 < Tv+1 )Pv−1 (Tv+1 < Tv )

+P0 (Tv+1 < Tv−1 )Pv+1 (Tv−1 < Tv )
n − (v + 1) 1 v−1 1
= +
n−2 n−1 n−2n−1
1
= .
n−1
In other words, the n-cycle has the property
For any initial vertex v0 , the last-visited vertex V is uniform on

the states excluding v0 .
Obviously the complete graph has the same property, by symmetry. Lovasz
and Winkler [240] gave a short but ingenious proof that these are the only
graphs with that property, a result rediscovered in [179].
6.3 More upper bounds

We remain in the setting of random walk on an unweighted graph. Theorems
6.1 and 6.4 show that the mean cover times, and hence mean hitting times,
are O(n3 ) on irregular graphs and O(n2 ) on regular graphs, and examples
6.3. MORE UPPER BOUNDS 215
such as the barbell and the n-cycle show these bounds are the right order
of magnitude. Quite a lot of attention has been paid to sharpening the
constants in such bounds. We will not go into details, but will merely
record a very simple argument in section 6.3.1 and the best known results
in section 6.3.2.
6.3.1 Simple upper bounds for mean hitting times

Obviously maxj (Ei Tj + Ej Ti ) ≤ Ei C + , so maximizing over i gives
τ ∗ ≤ max Ei C + (6.13)
i
and the results of section 6.1 imply upper bounds on τ ∗ . But implicit in
earlier results is a direct bound on τ ∗ . The edge-commute inequality implies
that, for arbitrary v, x at distance ∆(v, x),
¯
Ev Tx + Ex Tv ≤ dn∆(v, x) (6.14)
and hence
¯
Corollary 6.8 τ ∗ ≤ dn∆, where ∆ is the diameter of the graph.
It is interesting to compare the implications of Corollary 6.8 with what can

be deduced from (6.13) and the results of section 6.1. To bound ∆ in terms
of n alone, we have ∆ ≤ n − 1, and then Corollary 6.8 gives the same bound
¯
τ ∗ ≤ dn(n − 1) as follows from Theorem 6.1. On the other hand, the very
simple graph-theoretic Lemma 6.10 gives (with Corollary 6.8) the following
bound, which removes a factor of 2 from the bound implied by Theorem 6.4.
¯ 2
Corollary 6.9 τ ∗ ≤ 3dn
d∗ and so on a regular graph τ ∗ ≤ 3n2 .
Lemma 6.10 ∆ ≤ 3n/d∗ .
Proof. Consider a path v0 , v1 , . . . , v∆ , where vertices v0 and v∆ are distance

∆ apart. Write Ai for the set of neighbors of vi . Then Ai and Aj must be
disjoint when |j − i| ≥ 3. So a given vertex can be in at most 3 of the A’s,
giving the final inequality of
∆ ∆
2
X X
(∆ + 1)d∗ ≤ dvi = |Ai | ≤ 3n.
i=0 i=0
6.3.2 Known and conjectured upper bounds

Here we record results without giving proofs. Write max for the maximum
over n-vertex graphs. The next result is the only case where the exact
extremal graph is known.
Theorem 6.11 (Brightwell-Winkler [62]) max maxv,x Ev Tx is attained

by the lollipop (Chapter 5 Example yyy) with m1 = b(2n + 1)/3c, taking x
to be the leaf.
Note that the implied asymptotic behavior is

4 3
max max Ev Tw ∼ n . (6.15)
v,w 27
Further asymptotic results are given by
Theorem 6.12 (Feige [143, 144])
4 3
max max Ev C + ∼ n (6.16)
v 27
3 3
max min Ev C + ∼ n (6.17)
v 27
2
max min Ev C ∼ n3 (6.18)
v 27
The value in (6.16) is asymptotically attained on the lollipop, as in Theo-
rem 6.11. Note that (6.15) and (6.16) imply the same 4n3 /27 behavior for
intermediate quantities such as τ ∗ and maxv Ev C. The values in (6.17) and
(6.18) are asymptotically attained by the graph consisting of a n/3-path
with a 2n/3-clique attached at the middle of the path.
The corresponding results for τ0 and τ2 are not known. We have τ2 ≤
τ0 ≤ minv Ev C, the latter inequality from the random target lemma, and so
(6.18) implies
2
maxτ0 and maxτ2 ≤ ( + o(1))n3 . (6.19)
27
But a natural guess is that the asymptotic behavior is that of the barbell,
giving the values below.
Open Problem 6.13 Prove the conjectures

1 3 1 3
max τ0 ∼ n , max τ2 ∼ n .
54 54
6.4. SHORT-TIME BOUNDS 217
For regular graphs, none of the asymptotic values are known exactly.
A natural candidate for extremality is the necklace graph (Chapter 5 yyy),
where the time parameters are asymptotically 3/4 times the parameters
for the n-path. So the next conjecture uses the numerical values from the
necklace graph.
Open Problem 6.14 Prove the conjectures that, over the class of regular
n-vertex graphs
3
max max Ei Tj ∼ n2
i,j 4
3
max τ ∗ ∼ n2
2
3
max max Ev C + ∼ n2
v 2
15 2
max max Ev C ∼ n
v 16
3
max min Ev C ∼ n2
v 4
1
max τ0 ∼ n2
4
3
max τ2 ∼ 2 n2
2π
The best bounds known are those implied by the following result.
Theorem 6.15 (Feige [144]) On a d-regular graph,
max Ev C ≤ 2n2
v

max Ev Cv+ ≤ 2n2 1 + d−2
(d+1)2
≤ 13n2 /6.
v
6.4 Short-time bounds

It turns out that the bound “τ ∗ ≤ 3n2 on a regular graph” given by Corol-
lary 6.9 can be used to obtain bounds concerning the short-time behavior
of random walks. Such bounds, and their applications, are the focus of
this section. We haven’t attempted to optimize numerical constants (e.g.
Theorem 6.15 implies that τ ∗ ≤ 13n2 /6 on regular graphs). More elaborate
arguments (see Notes) can be used to improve constants and to deal with the
irregular case, but we’ll restrict attention to the regular case for simplicity.
Write Ni (t) for the number of visits to i before time t, i.e. during [0, t−1].
Proposition 6.16 Consider random walk on an n-vertex regular graph G.

Let A be a proper subset of vertices and let i ∈ A.
(i) Ei TAc ≤ 4|A|2 .
(ii) Ei Ni (TAc ) ≤ 5|A|.
(iii) Ei Ni (t) ≤ 5t1/2 , 0 ≤ t < 5n2 .
1
(iv) Pπ (Ti < t) ≥ 5n min(t1/2 , n).
Remarks. For part (i) we give a slightly fussy argument repeating ingredients
of the proof of Corollary 6.9, since these are needed for (ii). The point of
(iv) is to get a bound for t Eπ Ti . On the n-cycle, it can be shown that
the probability in question really is Θ(min(t1/2 /n, 1)), uniformly in n and t.
Proof of Proposition 6.16. Choose a vertex b ∈ Ac at minimum distance
from i, and let i = i0 , i1 , . . . , ij , ij+1 = b be a minimum-length path. Let G∗
be the subgraph on vertex-set A, and let G∗∗ be the subgraph on vertex-set
A together with all the neighbors of ij . Write superscripts ∗ and ∗∗ for the
random walks on G∗ and G∗∗ . Then
Ei TAc ≤ Ei TA∗∗c = Ei Ti∗j + Eij TA∗∗c
The inequality holds because we can specify the walk on G in terms of the
walk on G∗∗ with possibly extra chances of jumping to Ac at each step (this
is a routine stochastic comparison argument, written out as an example in
Chapter 14 yyy). The equality holds because the only routes in G∗∗ from i
to Ac are via ij , by the minimum-length assumption. Now write E, E ∗ , E ∗∗
for the edge-sets. Using the commute interpretation of resistance,
Ei Ti∗j ≤ 2|E ∗ |j. (6.20)
Writing q ≥ 1 for the number of neighbors of ij in Ac , the effective resistance

in G∗∗ between ij and Ac is 1/q, so the commute interpretation of resistance
give the first equality in
1 |E ∗ |
Eij TA∗∗c = 2|E ∗∗ | −1=2 + 1 ≤ 2|E ∗ | + 1 ≤ |A|2 .
q q
The neighbors of i0 , i1 , . . . , ij−1 are all in A, so the proof of Lemma 6.10

implies
j ≤ 3|A|/d (6.21)
where d is the degree of G. Since 2|E ∗ | ≤ d|A|, the bound in (6.20) is at

most 3|A|2 , and part (i) follows.
For part (ii), by the electrical network analogy (Chapter 3 yyy) the
quantity in question equals
1
= wi r(i, Ac ) = dr(i, Ac ) (6.22)
Pi (TAc < Ti+ )
where r(i, Ac ) is the effective resistance in G between i and Ac . Clearly this
effective resistance is at most the distance (j + 1, in the argument above)
from i to Ac , which by (6.21) is at most 3|A|/d +1. Thus the quantity (6.22)
is at most 3|A| + d, establishing the desired result in the case d ≤ 2|A|. If
1
d > 2|A| then there are at least d−|A| edges from i to Ac , so r(i, Ac ) ≤ d−|A|
d
and the quantity (6.22) is at most d−|A| ≤ 2 ≤ 5|A|.
For part (iii), fix a state i and an integer time t. Write Ni (t) for the
number of visits to i before time t, i.e. during times {0, 1, . . . , t − 1}. Then
t
= Eπ Ni (t) ≤ Pπ (Ti < t)Ei Ni (t) (6.23)
n
the inequality by conditioning on Ti . Now choose real s such that ns ≥ t.
P
Since j Ei Nj (t) = t, the set
A ≡ {j : Ei Nj (t) > s}
has |A| < t/s ≤ n, so part (ii) implies
Ei Ni (TAc ) ≤ 5t/s. (6.24)
Now by regularity we can rewrite A as {j : Ej Ni (t) > s}, and so by condi-

tioning on TAc
Ei Ni (t) ≤ Ei Ni (TAc ) + s.
√
Setting s = 5t and combining with (6.24) gives (iii). The bound in (iv)
now follows from (iii) and (6.23).
6.4.1 Covering by multiple walks

The first application is a variant of work of Broder et al [67] discussed further
in section 6.8.2.
Proposition 6.17 On a regular n-vertex graph, consider K independent
random walks, each started at a uniform random vertex. Let C [K] be the
time until every vertex has been hit by some walk. Then
(25 + o(1))n2 log2 n
EC [K] ≤ as n → ∞ with K ≥ 6 log n.
K2
Remarks. The point is the K12 dependence on K. On the n-cycle, for K ∼ εn

it can be shown that initially the largest gap between adjacent walkers is
Θ(log n) and that EC [K] = Θ(log2 n), so in this respect the bound is sharp.
Of course, for K ≤ log n the bound would be no improvement over Theorem
6.4.
Proof. As usual write Ti for the hitting time on i for a single walk, and
[K]
write Ti for the first time i is visited by some walk. Then
[K]
Pπ (Ti ≥ t) = (Pπ (Ti ≥ t))K
= (1 − Pπ (Ti < t))K
≤ exp(−KPπ (Ti < t))
!
Kt1/2
≤ exp −
5n
by Proposition 6.16 (iii), provided t ≤ n2 . So

!
[K]
X [K] Kt1/2
P (C ≥ t) ≤ P (Ti ≥ t) ≤ n exp − , t ≤ n2 . (6.25)
i
5n
25n2
The bound becomes 1 for t0 = K2
log2 n. So
∞
X
EC [K] = P (C [K] ≥ t)
t=1
2 −1
nX
! ∞
Kt1/2 X
≤ dt0 e + n exp − + P (C [K] ≥ t)
5n
t=dt0 e+1 t=n2
= dt0 e + S1 + S2 , say,
and the issue is to show that S1 and S2 are o(t0 ). To handle S2 , split the set
of K walks into subsets of sizes K − 1 and 1. By independence, for t ≥ n2
we have P (C [K] ≥ t) ≤ P (C [K−1] ≥ n2 )P (C [1] ≥ t). Then
S2 ≤ P (C [K−1] ≥ n2 )EC [1] by summing over t

≤ n exp(−(K − 1)/5) · 6n2 by (6.25) and Theorem 6.4
= o(t0 ) using the hypothesis K ≥ 6 log n.
To bound S1 we start with a calculus exercise: for u > 1

Z ∞ Z ∞
exp(−x1/2 ) dx = 2y exp(−y) dy by putting x = y 2
u2 u
Z ∞
u−1 y y

≤ 2e−1 u exp(− y) dy , using ≤ exp( − 1)
u u u u
2
2u exp(−u)
= .
u−1
The sum S1 is bounded by the corresponding integral over [t0 , ∞) and
the obvious calculation, whose details we omit, bounds this integral by
2t0 /(log n − 1).
6.4.2 Bounding point probabilities

Our second application is to universal bounds on point probabilities. A quite
different universal bound will be given in Chapter yyy.
Proposition 6.18 For continuous-time random walk on a regular n-vertex

graph,
Pi (Xt = j) ≤ 5t−1/2 , t ≤ n2
1 K1 −t

≤ + exp , t ≥ n2
n n K2 n2
where K1 and K2 are absolute constants.
In discrete time one can get essentially the same result, but with the bounds
multiplied by 2, though we shall not give details (see Notes).
Proof. Pi (Xt = i) is decreasing in t, so
Z t
Pi (Xt = i) ≤ t−1 Pi (Xs = i)ds = t−1 Ei Ni (t) ≤ 5t−1/2
0
where the last inequality is Proposition 6.16 (iii), whose proof is unchanged
in continuous time, and which holds for t ≤ n2 . This gives the first inequality
when i = j, and the general case follows from Chapter 3 yyy.
For the second inequality, recall the definition of separation s(t) from
Chapter 4 yyy. Given a vertex i and a time t, there exists a probability
distribution θ such that
Pi (Xt ∈ ·) = (1 − s(t))π + s(t)θ.
Then for u ≥ 0,
1 1

Pi (Xt+u = j) − = s(t) Pθ (Xu = j) − .
n n

1
Thus, defining q(t) = maxi,j Pi (Xt = j) − n , we have proved
q(t + u) ≤ s(t)q(u); t, u ≥ 0. (6.26)

(1)
Now q(n2 ) ≤ 4/n by the first inequality of the Proposition, and s(τ1 ) = e−1
(1)
by definition of τ1 in Chapter 4 yyy, so by iterating (6.26) we have
(1) 4 −m
q(n2 + mτ1 ) ≤ e , m ≥ 1. (6.27)
n
(1)
But by Chapter 4 yyy we have τ1 ≤ Kτ ∗ for an absolute constant K, and
(1)
then by Corollary 6.9 we have τ1 ≤ 3Kn2 . The desired inequality now
follows from (6.27).
6.4.3 A cat and mouse game

Here we reconsider the cat and mouse game discussed in Chapter 4 sec-
tion yyy. Recall that the cat performs continuous-time random walk on a
n-vertex graph, and the mouse moves according to some arbitrary determin-
istic strategy. Let M be the first meeting time, and let m∗ be the maximum
of EM over all pairs of initial vertices and all strategies for the mouse.
Proposition 6.19 On a regular graph, m∗ ≤ KN 2 for some absolute con-

stant K.
Proof. The proof relies on Proposition 6.18, whose conclusion implies there
exists a constant K such that
1
p∗ (t) ≡ max pvx (t) ≤ + Kt−1/2 ; 0 ≤ t < ∞.
x,v n
Consider running the process forever. The point is that, regardless of the
initial positions, the chance that the cat and mouse are “together” (i.e. at
the same vertex) at time u is at most p∗ (u). So in the case where the cat
starts with the (uniform) stationary distribution,
Z s
P ( together at time s ) = f (u)P ( together at time s|M = u) du
0
where f is the density function of M
Z s
≤ f (u)p∗ (s − u)du
0
Z s
1
≤ P (M ≤ s) + K f (u)(s − u)−1/2 du.
n 0
6.5. HITTING TIME BOUNDS AND CONNECTIVITY 223
So
Z t
t
= P ( together at time s ) ds by stationarity
n 0
1 t
Z Z t Z t
≤ P (M ≤ s)ds + K f (u)du (s − u)−1/2 ds
n 0 0 u
1 t
Z t
t
Z
= − P (M > s)ds + 2K f (u)(t − u)1/2 du
n n 0 0
t 1
≤ − E min(M, t) + 2Kt1/2 .
n n
Rearranging, E min(M, t) ≤ 2Knt1/2 . Writing t0 = (4Kn)2 , Markov’s in-
equality gives P (M ≤ t0 ) ≥ 1/2. This inequality assumes the cat starts
with the stationary distribution. When it starts at some arbitrary vertex,
we may use the definition of separation s(u) (recall Chapter 4 yyy) to see
P (M ≤ u + t0 ) ≥ (1 − s(u))/2. Then by iteration, EM ≤ 2(u+t 0)
1−s(u) . So
(1)
appealing to the definition of τ1 ,
2 (1)
m∗ ≤ −1
(t0 + τ1 ).
1−e
(1)
But results from Chapter 4 and this chapter show τ1 = O(τ ∗ ) = O(n2 ),
establishing the Proposition.
6.5 Hitting time bounds and connectivity

The results so far in this chapter may be misleading in that upper bounds ac-
commodating extremal graphs are rather uninformative for “typical” graphs.
For a family of n-vertex graphs with n → ∞, consider the property
τ ∗ = O(n). (6.28)
(in this order-of-magnitude discussion, τ ∗ is equivalent to maxv,x Ev Tx ).

Recalling from Chapter 3 yyy that τ ∗ ≥ 2(n − 1), we see that (6.28) is
equivalent to τ ∗ = Θ(n). By Matthews’ method (repeated as Theorem
2.6 below), (6.28) implies EC = O(n log n), and then by Theorem 6.31 we
have EC = Θ(n log n). Thus understanding when (6.28) holds is fundamen-
tal to understanding order-of-magnitude questions about cover times. But
surprisingly, this question has not been studied very carefully. An instruc-
tive example in the d-dimensional torus (Chapter 5 Example yyy), where
(6.28) holds iff d ≥ 3. This example, and other examples of vertex-transitive
graphs satisfying (6.28) discussed in Chapter 8, suggest that (6.28) is fre-

quently true. More concretely, the torus example suggests that the following
condition (“the isoperimetric property in 2 + ε dimensions”) may be suffi-
cient.
Open Problem 6.20 Show that for real 1/2 < γ < 1 and δ > 0, there
exists a constant Kγ,δ with the following property. Let G be a regular n-
vertex graph such that, for any subset A of vertices with |A| ≤ n/2, there
exist at least δ|A|γ edges between A and Ac . Then τ ∗ ≤ Kγ,δ n.
The γ = 1 case is implicit in results from previous chapters. Chapter 3 yyy
gave the bound maxi,j Ei Tj ≤ 2 maxj Eπ Tj , and Chapter 3 yyy gave the
bound Eπ Tj ≤ τ2 /πj . This gives the first assertion below, and the second
follows from Cheeger’s inequality.
Corollary 6.21 On a regular graph,
max Ev Tx ≤ 2nτ2 ≤ 16nτc2 .
v,x
Thus the “expander” property that τ2 = O(1), or equivalently that τc =

O(1), is sufficient for (6.28), and the latter is the γ = 1 case of Open Problem
6.20.
6.5.1 Edge-connectivity
At the other end of the spectrum from expanders, we can consider graphs
satisfying only a little more than connectivity.
xxx more details in proofs – see Fill’s comments.
Recall that a graph is r-edge-connected if for each proper subset A of
vertices there are at least r edges linking A with Ac . By a variant of Menger’s
theorem (e.g. [86] Theorem 5.11), for each pair (a, b) of vertices in such a
graph, there exist r paths (a = v0i , v1i , v2i , . . . , vm
i = b), i = 1, . . . , r for which
i
i i
the edges (vj , vj+1 ) are all distinct.
Proposition 6.22 For a r-edge-connected graph,
¯ 2 ψ(r)
dn
τ∗ ≤
r2
where ψ is defined by
i(i + 1)

ψ =i
2
i(i + 1) (i + 1)(i + 2)

ψ(·) is linear on , .
2 2
6.5. HITTING TIME BOUNDS AND CONNECTIVITY 225
√
Note ψ(r) ∼ 2r. So for a d-regular, d-edge-connected graph, the bound
becomes ∼ 21/2 d−1/2 n2 for large d, improving on the bound from Corollary
6.9. Also, the Proposition improves on the bound implied by Chapter 4 yyy
in this setting.
Proof. Given vertices a, b, construct a unit flow from a to b by putting
flow 1/r along each of the r paths (a = v0i , v1i , v2i , . . . , vm
i = b). By Chapter
i
3 Theorem yyy
¯
Ea Tb + Eb Ta ≤ dn(1/r) 2
M
P
where M = i mi is the total number of edges in the r paths. So the
issue is bounding M . Consider the digraph of all edges (vji , vj+1 i ). If this
digraph contained a directed cycle, we could eliminate the edges on that

cycle, and still create r paths from a to b using the remaining edges. So we
may assume the digraph is acyclic, which implies we can label the vertices
as a = 1, 2, 3, . . . , n = b in such a way that each edge (j, k) has k > j. So
the desired result follows from
Lemma 6.23 In a digraph on vertices {1, 2, . . . , n} consisting of r paths

1 = v0i < v1i < v2i < . . . vm
i = n and where all edges are distinct, the total
i
number of edges is at most nψ(r).
Proof.
xxx give proof and picture.
Example 6.24 Take vertices {0, 1, . . . , n − 1} and edges (i, i + u mod n) for
all i and all 1 ≤ u ≤ κ.
This example highlights the “slack” in Proposition 6.22. Regard κ as large

and fixed, and n → ∞. Random walk on this graph is classical random
walk (i.e. sums of independent steps) on the n-cycle, where the steps have
variance σ 2 = κ1 κu=1 u2 , and it is easy to see
P
(n/2)2
τ ∗ = 2E0 Tbn/2c ∼ = Θ(n2 /κ2 ).
σ2
This is the bound Proposition 6.22 would give if the graph were Θ(κ2 )-
edge-connected. And for a “typical” subset A such as an interval of length
greater than κ there are indeed Ω(κ2 ) edges crossing the boundary of A. But
by considering a singleton A we see that the graph is really only 2κ-edge-
connected, and Proposition 6.22 gives only the weaker O(n2 /κ1/2 ) bound.
xxx tie up with similar discussion of τ2 and connectivity being affected
by small sets; better than bound using τc only.
6.5.2 Equivalence of mean cover time parameters

Returning to the order-of-magnitude discussion at the start of section 6.5,
let us record the simple equivalence result. Recall (cf. Chapter 4 yyy) we
call parameters equivalent if their ratios are bounded by absolute constants.
Lemma 6.25 The parameters maxi Ei C + , Eπ C + , mini Ei C + , maxi Ei C
and Eπ C are equivalent for reversible chains, but mini Ei C is not equivalent
to these.
Proof. Of the five parameters asserted to be equivalent, it is clear that
maxi Ei C is the largest, and that either mini Ei C + or Eπ C is the smallest,
so it suffices to prove
max Ei C + ≤ 4Eπ C (6.29)
i
max Ej C + ≤ 3 min Ei C + . (6.30)

j i
Inequality (6.30) holds by concatenating three “cover-and-return” cycles

starting at i and considering the first hitting time on j in the first and
third cycles. In more detail, write
Γ(s) = min{u > s : (Xt : s ≤ t ≤ u) covers all states}.
For the chain started at i write C ++ = Γ(C + ) and C +++ = Γ(C ++ ). Since
Tj < C + we have Γ(Tj ) ≤ C ++ . So the chain started at time Tj has covered
all states and returned to j by time C +++ , implying Ej C + ≤ EC +++ =
3Ei C + . For inequality (6.29), recall the random target lemma: the mean
time to hit a π-random state V equals τ0 , regardless of the initial distribu-
tion. The inequality
Ei C + ≤ τ0 + Eπ C + τ0 + Eπ Ti
follows from the four-step construction:

(i) Start the chain at i and run until hitting a π-random vertex V at time
TV ;
(ii) continue until time Γ(TV );
(iii) continue until hitting an independent π-random vertex V 0 ;
(iv) continue until hitting i.
But Eπ Ti ≤ Eπ C, and then by the random target lemma τ0 ≤ Eπ C, so
(6.29) follows.
For the final assertion, on the lollipop graph (Chapter 5 Example yyy)
one has mini Ei C = Θ(n2 ) while the other quantities are Θ(n3 ). One can
also give examples on regular graphs (see Notes).
6.6. LOWER BOUNDS 227
6.6 Lower bounds

6.6.1 Matthews’ method
We restate Matthews’ method (Chapter 2 yyy) as follows. The upper bound
is widely useful: we have already used it several times in this chapter, and
will use it several more times in the sequel.
Theorem 6.26 For a general Markov chain,
max Ev C ≤ hn−1 max Ei Tj .

v i,j
And for any subset A of states,
min Ev C ≥ h|A|−1 min Ei Tj .

v i6=j: i,j∈A
In Chapter 2 we proved the lower bound in the case where A was the entire
state space, but the result for general A follows by the same proof, taking
the J’s to be a uniform random ordering of the states in A. One obvious
motivation for the more general formulation comes from the case of trees,
where for a leaf l we have minj El Tj = 1, so the lower bound with A being
the entire state space would be just hn−1 . We now illustrate use of the more
general formulation.
6.6.2 Balanced trees

We are accustomed to finding that problems on trees are simpler than prob-
lems on general graphs, so it is a little surprising to discover that one
of the graphs where studying the mean cover time is difficult is the bal-
anced r-tree of height H (Chapter 5 Example yyy). Recall this tree has
n = (rH+1 − 1)/(r − 1) vertices, and that (by the commute interpretation
of resistance)
Ei Tj = 2m(n − 1) for leaves (i, j) distance 2m apart.
Now clearly Ei Tj is maximized by some pair of leaves, so maxi,j Ei Tj =

2H(n − 1). Theorem 6.26 gives
max Ev C ≤ 2H(n − 1)hn−1 ∼ 2Hn log n.

v
To get a lower bound, consider the set Sm of rH+1−m vertices at depth

H + 1 − m, and let Am be a set of leaves consisting of one descendant of
each element of Sm . The elements of Am are at least 2m apart, so applying

the lower bound in Theorem 6.26
min Ev C ≥ max 2m(n − 1) hrH+1−m

v m
∼ 2n log r max m(H − m)
m
1 2
∼ H n log r.
2
It turns out that this lower bound is asymptotically off by a factor of 4,
while the upper bound is asymptotically correct.
Theorem 6.27 ([16]) On the balanced r-tree, as H → ∞ for arbitrary

starting vertex,
2H 2 rH+1 log r
EC ∼ 2Hn log n ∼
r−1
Improving the lower bound to obtain this result is not easy. The natural
approach (used in [16]) is to seek a recursion for the cover time distribution
C (H+1) in terms of C (H) . But the appropriate recursion is rather subtle
(we invite the reader to try to find it!) so we won’t give the statement or
analysis of the recursion here.
6.6.3 A resistance lower bound

Our use of the commute interpretation of resistance has so far been only to
obtain upper bounds on commute times. One can also use “shorting” ideas
to obtain lower bounds, and here is a very simple implementation of that
idea.
Lemma 6.28 The effective resistance between r(v, x) between vertices v and
x in a weighted graph satisfies
1 1
≤ wv,x + 1 1 .
r(v, x) wv −wv,x + wx −wv,x
In particular, on an unweighted graph
dv + dx − 2
r(v, x) ≥ if (v, x) is an edge
dv dx − 1
1 1
≥ + if not
dv dw
6.6. LOWER BOUNDS 229
and on an unweighted d-regular graph

2
r(v, x) ≥ if (v, x) is an edge
d+1
2
≥ if not .
d
So on an unweighted d-regular n-vertex graph,
2dn
Ev Tx + Ex Tv ≥ if (v, x) is an edge
d+1
≥ 2n if not .
Proof. We need only prove the first assertion, since the others follow by
specialization and by the commute interpretation of resistance. Let A be
the set of vertices which are neighbors of either v or x, but exclude v and x
themselves from A. Short the vertices of A together, to form a single vertex
a. In the shorted graph, the only way current can flow from v to x is directly
v → x or indirectly as v → a → x. So, using 0 to denote the shorted graph,
the effective resistance r0 (v, x) in the shorted graph satisfies
1 0 1
= wv,x + 1 1 .
r0 (v, x) 0
wv,a + 0
wx,a
0
Now wx,v 0
= wx,v , wv,a = wv − wv,x and wx,a 0 = wx − wv,x . Since shorting
0
decreases resistance, r (v, x) ≤ r(v, x), establishing the first inequality.
6.6.4 General lower bounds

Chapter 3 yyy shows that, over the class of random walks on n-vertex graphs
or the larger class of reversible chains on n states, various mean hitting
time parameters are minimized on the complete graph. So it is natural to
anticipate a similar result for cover time parameters. But the next example
shows that some care is required in formulating conjectures.
Example 6.29 Take the complete graph on n vertices, and add an edge
(v, l) to a new leaf l.
Since random walk on the complete graph has mean cover time (n − 1)hn−1 ,
random walk on the enlarged graph has
El C = 1 + (n − 1)hn−1 + 2µ
where µ is the mean number of returns to l before covering. Now after each
visit to v, the walk has chance 1/n to visit l on the next step, and so the
mean number of visits to l before visiting some other vertex of the complete
graph equals 1/(n − 1). We may therefore write µ in terms of expectations
for random walk on the complete graph as
1
µ = Ev ( number of visits to v before C)
n−1
1
= Ev ( number of visits to v before C + )
n−1
1 1
= Ev C + by Chapter 2 Proposition yyy
n−1 n
1 + hn−1
= by (6.12).
n
This establishes an expression for El C, which (after a brief calculation) can

be rewritten as
2 1

El C = nhn − 1 − hn − .
n n
Now random walk on the complete (n + 1)-graph has mean cover time nhn ,
so El C is smaller in our example than in the complete graph.
The example motivates the following as the natural “exact extremal
conjecture”.
Open Problem 6.30 Prove that, for any reversible chain on n states,
Eπ C ≥ (n − 1)hn−1
(the value for random walk on the complete graph).
The related asymptotic question was open for many years, and was finally
proved by Feige [142].
Theorem 6.31 For random walk on an unweighted n-vertex graph,
min Ev C ≥ cn ,
v
where cn ∼ n log n as n → ∞.
The proof is an intricate mixture of many of the techniques we have already

described.
6.7. DISTRIBUTIONAL ASPECTS 231
6.7 Distributional aspects

In many examples one can apply the following result to show that hitting
time distributions become exponential as the size of state space increases.
Corollary 6.32 Let i, j be arbitrary states in a sequence of reversible Markov

chains.
(i) If Eπ Tj /τ2 → ∞ then
!
Tj
Pπ >x → e−x , 0 < x < ∞.
Eπ Tj
(ii) If Ei Tj /τ1 → ∞ and Ei Tj ≥ (1 − o(1))Eπ Tj then Ei Tj /Eπ Tj → 1 and

!
Tj
Pi >x → e−x , 0 < x < ∞.
Ei Tj
Proof. In continuous time, assertion (i) is immediate from Chapter 3 Propo-

sition yyy. The result in discrete time now holds by continuization: if
Tj is the hitting time in discrete time and Tj0 in continuous time, then
0 0
p
Eπ Tj = Eπ Tj and Tj − Tj is order Eπ Tj . For (ii) we have (cf. Chap-
ter 4 section yyy) Tj ≤ Ui + Tj∗ where Tj is the hitting time started at
(2)
i, Tj∗ is the hitting time started from stationarity, and Ei Ui ≤ τ1 . So
ETj ≤ ET ∗ + O(τ1 ), and the hypotheses of (ii) force ETj /ETj∗ → 1 and
force the limit distribution of Tj /ETj to be the same as the limit distribu-
tion of Tj∗ /ETj∗ , which is the exponential distribution by (i) and the relation
τ2 ≤ τ1 . 2
In the complete graph example, C has mean ∼ n log n and s.d. Θ(n),
so that C/EC → 1 in distribution, although the convergence is slow. The
next result shows this “concentration” result holds whenever the mean cover
time is essentially larger than the maximal mean hitting time.
Theorem 6.33 ([17]) For states i in a sequence of (not necessarily re-

versible) Markov chains,

C

∗

if Ei C/τ → ∞ then Pi
− 1 > ε → 0, ε > 0.

EC i
The proof is too long to reproduce.

6.8 Algorithmic aspects

Many of the mathematical results in this chapter arose originally from algo-
rithmic questions, so let me briefly describe the questions and their relation
to the mathematical results.
6.8.1 Universal traversal sequences

This was one motivation for the seminal paper [25]. Consider an n-vertex
d-regular graph G, with a distinguished vertex v0 , and where for each vertex
v the edges at v are labeled as 1, 2, . . . , d in some way – it is not required that
the labels be the same at both ends of an edge. Now consider a sequence
i = (i1 , i2 , . . . , iL ) ∈ {1, . . . , d}L . The sequence defines a deterministic walk
(xi ) on the vertices of G via
x0 = v0
(xj−1 , xj ) is the edge at xj−1 labeled ij .
Say i is a traversal sequence for G if the walk (xi : 0 ≤ i ≤ L) visits every
vertex of G. Say i is a universal traversal sequence if it is a traversal sequence
for every graph G in the set G n,d of edge-labeled graphs with distinguished
vertices.
Proposition 6.34 (Aleliunas et al [25]) There exists a universal traver-
sal sequence of length (6e + o(1))dn3 log(nd) as n → ∞ with d varying arbi-
trarily.
Proof. It is enough to show that a uniform random sequence of that length
has non-zero chance to be a universal traversal sequence. But for such a
random sequence, the induced walk on a fixed G is just simple random walk
on the vertices of G. Writing t0 = d6en2 e, Theorem 6.4 implies
Ev C 6n2
Pv (C > t0 ) ≤ ≤ ≤ e−1 for all initial v
t0 t0
and so inductively (cf. Chapter 2 section yyy)
Pv0 (C > Kt0 ) ≤ e−K , K ≥ 1 integer .
Thus by taking K sufficiently large that
e−K |G n,d | < 1
there is non-zero chance that the induced walk on every G covers before
time Kt0 . The crude bound |G n,d | ≤ (nd)nd means we may take K =
dnd log(nd)e.
6.8.2 Graph connectivity algorithms

Another motivation for the seminal paper [25] was the time-space tradeoff in
algorithms for determining connectivity in graphs. Here is a highly informal
presentation, illustrated by the two Mathematician graphs. The vertices are
all mathematicians (living or dead). In the first graph, there is an edge be-
tween two mathematicians if they have written a joint paper; in the second,
there is an edge if they have written two or more joint papers. A well known
Folk Theorem asserts that the first graph has a giant component containing
most famous mathematicians; a lesser known and more cynical Folk The-
orem asserts that the second graph doesn’t. Suppose we actually want to
answer a question of that type – specifically, take two mathematicians (say,
the reader and Paul Erdos) and ask if they are in the same component of
the first graph. Suppose we have a database which, given a mathematician’s
name, will tell us information about their papers and in particular will list
all their co-authors.
xxx continue story
Broder et al [67]
6.8.3 A computational question

Consider the question of getting a numerical value for Ei C (up to error factor
1 ± ε, for fixed ε) for random walk on a n-vertex graph. Using Theorem 6.1
it’s clear we can do this by Monte Carlo simulation in O(n3 ) steps.
xxx technically, using s.d./mean bounded by submultiplicitivity.
Open Problem 6.35 Can Ei C be deterministically calculated in a poly-

nomial (in n) number of steps?
It’s clear one can compute mean hitting times on a n-step chain in polynomial
time, but to set up the computation of Ei C as a hitting-time problem one has
to incorporate the subset of already-visited states into the “current state”,
and thus work with hitting times for a n × 2n−1 -state chain.

Attributions for what I regard as the main ideas were given in the text.
The literature contains a number of corollaries or variations of these ideas,
some of which I’ve used without attribution, and many of which I haven’t
mentioned at all. A number of these ideas can be found in Zuckerman
[341, 343], Palacios [276, 273] and the Ph.D. thesis of Sbihi [306], as well as
papers cited elsewhere.
Section 6.1. The conference proceedings paper [25] proving Theorem 6.1
was not widely known, or at least its implications not realized, for some
years. Several papers subsequently appeared proving results which are con-
sequences (either obvious, or via the general relations of Chapter 4) of The-
orem 6.1. I will spare their authors embarrassment by not listing them all
here!
The spanning tree argument shows, writing be for the mean commute
time across an edge e, that
X
max Ev C + ≤ min be .
v T e∈T
Coppersmith et al [100] give a deeper study and show that the right side is
bounded between γ and 10γ/3, where
! !
X X 1
γ= dv .
v v dv + 1
The upper bound is obtained by considering a random spanning tree, cf.
Chapter yyy.
Section 6.2. The calculations in these examples, and the uniformity
property of V on the n-cycle, are essentially classical. For the cover time
d
Cn on the n-cycle there is a non-degenerate limit distribution n−2 Cn →
C. From the viewpoint of weak convergence (Chapter yyy), C is just the
cover time for Brownian motion on the circle of unit circumference, and its
distribution is known as part of a large family of known distributions for
maximal-like statistics of Brownian motion: Imhof [187] eq. (2.4) gives the
density as
∞
m2
fC (t) = 23/2 π −1/2 t−3/2
X
(−1)m−1 m2 exp(− ).
m=1
2t
Sbihi [306] gives a direct derivation of a different representation of fC .

Section 6.4. Use of Lemma 6.10 in the random walk context goes back
at least to Flatto et al [152].
Barnes and Feige [42] give a more extensive treatment of short-time
bounds in the irregular setting, and their applications to covering with mul-
tiple walks (cf. Proposition 6.17 and section 6.8.2). They also give bounds
on the mean time taken to cover µ different edges or ν different vertices –
their bound for the latter becomes O(ν 2 log ν) on regular graphs.
Proposition 6.18 implies that on an infinite regular graph Pi (Xt = j) ≤

Kt−1/2 . Carlen et al [84] Theorem 5.14 prove this as a corollary of results
using more sophisticated machinery. Our argument shows the result is fairly
elementary. In discrete time the analog of the first inequality can be proved
using the “CM proxy” property than Pi (X2t = i) + Pi (X2t+1 = i) is decreas-
ing, but the analog of the second inequality requires different arguments
(1)
because we cannot exploit the τ1 inequalities.
Section 6.5. Variations on Corollary 6.21 are given in Broder and Karlin
[66] and Chandra et al. [85].
Upper bounds on mean hitting times imply upper bounds on the relax-
ation time τ2 via the general inequalities τ2 ≤ τ0 ≤ 12 τ ∗ . In most concrete
examples these bounds are too crude to be useful, but in “extremal” settings
these bounds are essentially as good as results seeking to bound τ2 directly.
For instance, in the setting of a d-regular r-edge-connected graph, a direct
bound (Chapter 4 Proposition yyy) gives
d dn2
τ2 ≤ π ∼ .
4r sin2 2n
π2r
Up to the numerical constant, the same bound is obtained from Proposition
6.22 and the general inequality τ2 ≤ τ ∗ /2.
xxx contrast with potential and Cheeger-like arguments ?
To sketch an example of a regular graph where mini Ei C has a different
order than maxi Ei C, make a regular m1 + m2 -vertex graph from a m1 -
vertex graph with mean cover time Θ(m1 log m1 ) and a m2 -vertex graph
(such as the necklace) with mean cover time Θ(m22 ), for suitable values
of the m’s. Starting from a typical vertex of the former, the mean cover
time is Θ(m1 log m1 + m1 m2 + m22 ) whereas starting from the unattached
end of the necklace the mean cover time is Θ(m1 log m1 + m22 ). Taking
m1 log m1 + m22 ) = o(m1 m2 ) gives the desired example.
Section 6.6. The “subset” version of Matthews’ lower bound (Theorem
2.6) and its application to trees were noted by Zuckerman [343], Sbihi [306]
and others. As well as giving a lower bound for balanced trees, these authors
give several lower bounds for more general trees satisfying various constraints
(cf. the unconstrained result, Proposition 6.7). As an illustration, Devroye
- Sbihi [111] show that on a tree
(1 + o(1))n log2 n
min Ev C ≥ if d∗ ≡ max dv = no(1) .
v 2 log(d∗ − 1) v
I believe that the recursion set-up in [16] can be used to prove Open
Problem 6.35 on trees, but I haven’t thought carefully about it.
The “shorting” lower bound, Lemma 6.28, was apparently first exploited
by Coppersmith et al [100].
Section 6.7. Corollary 6.32 encompasses a number of exponential limit
results proved in the literature by ad hoc calculations in particular examples.
Section 6.8.1. Proposition 6.34 is one of the neatest instances of “Erdos’s
Probabilistic Method in Combinatorics”, though surprisingly it isn’t in the
recent book [29] on that subject. Constructing explicit universal traversal
sequences is a hard open problem: see Borodin et al [56] for a survey.
Section 6.8.2. See [67] for a more careful discussion of the issues. The
alert reader of our example will have noticed the subtle implication that the
reader has written fewer papers than Paul Erdos, otherwise (why?) it would
be preferable to do the random walk in the other direction.
Miscellaneous. Condon and Hernek [98] study cover times in the follow-
ing setting. The edges of a graph are colored, a sequence (ct ) of colors is
prespecified and the “random walk” at step t picks an edge uniformly at
random from the color-ct edges at the current vertex.
Chapter 7
Symmetric Graphs and

Chains (January 31, 1994)
In this Chapter we show how general results in Chapters 3, 4 and 6 can

sometimes be strengthened when symmetry is present. Many of the ideas
are just simple observations. Since the topic has a “discrete math” flavor
our default convention is to work in discrete time, though as always the
continuous-time case is similar. Note that we use the word “symmetry” in
the sense of spatial symmetry (which is the customary use in mathematics
as a whole) and not as a synonym for time-reversibility. Note also our use
of “random flight” for what is usually called “random walk” on a group.
Biggs [48] contains an introductory account of symmetry properties for
graphs, but we use little more than the definitions. I have deliberately not
been overly fussy about giving weakest possible hypotheses. For instance
many results for symmetric reversible chains depend only of the symmetry
of mean hitting times (7.7), but I haven’t spelt this out. Otherwise one
can end up with more definitions than serious results! Instead, we focus on
three different strengths of symmetry condition. Starting with the weakest,
section 7.1 deals with symmetric reversible chains, a minor generalization
of what is usually called “symmetric random walk on a finite group”. In
the graph setting, this specializes to random walk on a Cayley or vertex-
transitive graph. Section 7.2 deals with random walk on an arc-transitive
graph, encompassing what is usually called “random walk on a finite group
with steps uniform on a conjugacy class”. Section 5.16 deals with random
walk on a distance-regular graph, which roughly corresponds to nearest-
neighbor isotropic random walk on a discrete Gelfand pair.
This book focuses on inequalities rather than exact calculations, and the
237
238CHAPTER 7. SYMMETRIC GRAPHS AND CHAINS (JANUARY 31, 1994)
limitation of this approach is most apparent in this chapter. Group repre-

sentation theory, though of course developed for non-probabilistic reasons,
turns out to be very well adapted to the study of many questions concern-
ing random walks on groups. I lack the space (and, more importantly, the
knowledge) to give a worthwhile treatment here, and in any case an account
which is both introductory and gets to interesting results is available in Di-
aconis [112]. In many concrete examples, eigenvalues are known by group
representation theory, and so in particular our parameters τ2 and τ0 are
known. See e.g. section 7.2.1. In studying a particular example, after inves-
tigating eigenvalues one can seek to study further properties of the chain by
either
(i) continuing with calculations specific to the example; or
(ii) using general inequalities relating other aspects of the chain to τ2 and
τ0 .
The purpose of this Chapter is to develop option (ii). Of course, the more
highly-structured the example, the more likely one can get stronger explicit
results via (i). For this reason we devote more space to the weaker setting
of section 7.1 than to the stronger settings of sections 7.2 and 5.16.
xxx scattering of more sophisticated math in Chapter 10.
7.1 Symmetric reversible chains

7.1.1 Definitions
Consider an irreducible transition matrix P = (pij ) on a finite state space
I. A symmetry of P is a 1 − 1 map γ : I → I such that
pγ(i)γ(j) = pij for all i, j.
The set Γ of symmetries forms a group under convolution, and in our (non-
standard) terminology a symmetric Markov transition matrix is one for
which Γ acts transitively, i.e.
for all i, j ∈ I there exists γ ∈ Γ such that γ(i) = j.
Such a chain need not be reversible; a symmetric reversible chain is just a

chain which is both symmetric and reversible. A natural setting is where I is
itself a group under an operation (i, j) → ij which we write multiplicitively.
If µ is a probability distribution on I and (Zt ; t ≥ 1) are i.i.d. I-valued with
distribution µ then
Xt = x0 Z1 Z2 . . . Zt (7.1)
7.1. SYMMETRIC REVERSIBLE CHAINS 239
is the symmetric Markov chain with transition probabilities
pij = µ(i−1 ∗ j)
started at x0 . This chain is reversible iff
µ(i) = µ(i−1 ) for all i. (7.2)
We have rather painted ourselves into a corner over terminology. The

usual terminology for the process (7.1) is “random walk on the group I” and
if (7.2) holds then it is a
“symmetric random walk on the group I” . (7.3)
Unfortunately in this phrase, both “symmetric” and “walk” conflict with

our conventions, so we can’t use the phrase. Instead we will use “random
flight on the group I” for a process (7.1), and “reversible random flight on
the group I” when (7.2) also holds. Note that we always assume chains are
irreducible, which in the case of a random flight holds iff the support of µ
generates the whole group I. Just keep in mind that the topic of this section,
symmetric reversible chains, forms a minor generalization of the processes
usually described by (7.3).
On an graph (V, E), a graph automorphism is a 1 − 1 map γ : V → V
such that
(γ(w), γ(v)) ∈ E iff (w, v) ∈ E.
The graph is called vertex-transitive if the automorphism group acts tran-
sitively on vertices. Clearly, random walk on a (unweighted) graph is a
symmetric reversible chain iff the graph is vertex-transitive. We specialize
to this case in section 7.1.8. A further specialization is to random walk on
a Cayley graph. If G = (gi ) is a set of generators of a group I, which we
always assume to satisfy
g ∈ G implies g −1 ∈ G
then the associated Cayley graph has vertex-set I and edge-set
{(v, vg) : v ∈ I, g ∈ G}.
A Cayley graph is vertex-transitive.

Finally, recall from Chapter 3 yyy that we can identify a reversible chain
with a random walk on a weighted graph. With this identification, a sym-
metric reversible chain is one where the weighted graph is vertex-transitive,
in the natural sense.
7.1.2 This section goes into Chapter 3

Lemma 7.1 For an irreducible reversible chain, the following are equiva-
lent.
(a) Pi (Xt = i) = Pj (Xt = j), i, j ∈ I, t ≥ 1
(b) Pi (Tj = t) = Pj (Ti = t), i, j ∈ I, t ≥ 1.
Proof. In either case the stationary distribution is uniform – under (a), by

letting t → ∞, and under (b) by taking t = 1, implying pij ≡ pji . So by
reversibility Pi (Xt = j) = Pj (Xt = i) for i 6= j and t ≥ 1. But recall from
P
Chapter 2 Lemma yyy that the generating functions Gij (z) = t Pi (Xt =
j)z j and Fij (z) = t Pi (Tt = j)z j satisfy
P
Fij = Gij /Gjj . (7.4)
For i 6= j we have seen that Gij = Gji , and hence by (7.4)
Fij = Fji iff Gjj = Gii ,
which is the assertion of Lemma 7.1.
7.1.3 Elementary properties

Our standing assumption is that we have an irreducible symmetric reversible
n-state chain. The symmetry property implies that the stationary distribu-
tion π is uniform, and also implies
Pi (Xt = i) = Pj (Xt = j), i, j ∈ I, t ≥ 1. (7.5)
But by Chapter 3 Lemma yyy, under reversibility (7.5) is equivalent to
Pi (Tj = t) = Pj (Ti = t), i, j ∈ I, t ≥ 1. (7.6)
And clearly (7.6) implies
Ei Tj = Ej Ti for all i, j. (7.7)
We make frequent use of these properties. Incidently, (7.7) is in general

strictly weaker than (7.6): van Slijpe [330] p. 288 gives an example with a
3-state reversible chain.
We also have, from the definition of symmetric, that Eπ Ti is constant in
i, and hence
Eπ Ti = τ0 for all i. (7.8)
So by Chapter 4 yyy
τ ∗ ≤ 4τ0 . (7.9)
The formula for Eπ Ti in terms of the fundamental matrix (Chapter 2 yyy)
can be written as
∞
X
τ0 /n = 1 + (Pi (Xt = i) − 1/n). (7.10)
t=1
Approximating τ0 by the first few terms is what we call the local transience
heuristic. See Chapter xxx for rigorous discussion.
n
Lemma 7.2 (i) Ei Tj ≥ 1+p(i,j) , j 6= i.
(ii) maxi,j Ei Tj ≤ 2τ0
Proof. (i) This is a specialization of Chapter 6 xxx.
(ii) For any i, j, k,
Ei Tj ≤ Ei Tk + Ek Tj = Ei Tk + Ej Tk .
Averaging over k, the right side becomes 2τ0 .

Recall that a simple Cauchy-Schwartz argument (Chapter 3 yyy) shows
that, for any reversible chain whose stationary distribution is uniform,
q
Pi (X2t = j) ≤ Pi (X2t = i)Pj (X2t = j).
So by (7.5), for a symmetric reversible chain, the most likely place to be

after 2t steps is where you started:
Corollary 7.3 Pi (X2t = j) ≤ Pi (X2t = i), for all i, j, ∈ I, t ≥ 1.
This type of result is nicer in continuous time, where the inequality holds
for all times.
7.1.4 Hitting times

Here is our first non-trivial result, from Aldous [12].
Theorem 7.4 Suppose a sequence of symmetric reversible chains satisfies
τ2 /τ0 → 0. Then
d
(a) For the stationary chain, and for arbitrary j, we have Tj /τ0 → ξ
and var (Tj /τ0 ) → 1, where ξ has exponential(1) distribution.
(b) maxi,j Ei Tj /τ0 → 1.
d
(c) If (in , jn ) are such that Ein Tjn /τ0 → 1 then Pin (Tjn /τ0 ∈ ·) → ξ.
Note that, because τ2 ≤ τ1 + 1 and τ0 ≥ (n − 1)2 /n, the hypothesis “τ2 /τ0 →
0” is weaker than either “τ2 /n → 0” or “τ1 /τ0 → 0”.
Part (a) is a specialization of Chapter 3 Proposition yyy and its proof.
Parts (b) and (c) use refinements of the same technique. Part (b) implies
if τ2 /τ0 → 0 then τ ∗ ∼ 2τ0 .
Because this applies in many settings in this Chapter, we shall rarely need
to discuss τ ∗ further.
xxx give proof
In connection with (b), note that
(2)
Ev Tw ≤ τ1 + τ0 (7.11)
(2)
by definition of τ1 and vertex-transitivity. So (b) is obvious under the
slightly stronger hypothesis τ1 /τ0 → 0.
Chapter 3 Proposition yyy actually gives information on hitting times
TA to more general subsets A of vertices. Because (Chapter 3 yyy) Eπ TA ≥
(1−π(A))2
π(A) , we get (in continuous time) a quantification of the fact that TA
has approximately exponential distribution when |A| n/τ2 and when the
chain starts with the uniform distribution:
−2
τ2 n |A|

sup |Pπ (TA > t) − exp(−t/Eπ TA )| ≤ 1− .
t |A| n
7.1.5 Cover times

Recall the cover time C from Chapter 6. By symmetry, in our present setting
Ei C doesn’t depend on the starting place i, so we can write EC. In this
section we combine results on hitting times with various forms of Matthews
method to obtain asymptotics for cover times in the setting of a sequence of
symmetric reversible chains. Experience, and the informal argument above
(7.15), suggest the principle
EC ∼ τ0 log n, except for chains resembling random walk on the n-cycle .

(7.12)
The results in this chapter concerning cover times go some way towards
formalizing this principle.
Corollary 7.5 For a sequence of symmetric reversible chains
(a) 1−o(1) ∗
1+p∗ n log n ≤ EC ≤ (2 + o(1))τ0 log n, where p ≡ maxj6=i pi,j .
(b) If τ2 /τ0 → 0 then EC ≤ (1 + o(1))τ0 log n.
(c) If τ2 /τ0 = O(n−β ) for fixed 0 < β < 1 then
EC ≥ (β − o(1))τ0 log n.
Proof. Using the basic form of Matthews method (Chapter 2 yyy), (a)
follows from Lemma 7.2 and (b) from Theorem 7.4. To prove (c), fix a state
j and ε > 0. Using (7.11) and Markov’s inequality,
(2)
τ1
π{i : Ei Tj ≤ (1 − ε)τ0 } ≤ ≡ α, say.
ετ0
So we can inductively choose dα−1 e vertices ik such that
Eik Til > (1 − ε)τ0 ; 1 ≤ k < l ≤ dα−1 e.
By the extended form of Matthews method (Chapter 6 Corollary yyy)
EC ≥ (1 − ε)τ0 hdα−1 e−1 .
From Chapter 4 yyy, τ1 ≤ τ2 (1+log n) and so the hypothesis implies τ1 /τ0 =

O(nε−β ). So the asymptotic lower bound for EC becomes (1 − ε)τ0 (β −
ε) log n, and since ε is arbitrary the result follows.
Since the only natural examples with τ1 /τ0 6→ 0 are variations of random
walk on the n-cycle, for which EC = Θ(τ0 ) without the “log n” term, we
expect a positive answer to
Open Problem 7.6 In the setting of Corollary 7.5, is EC ≤ (1+o(1))τ0 log n

without further hypotheses?
Here is an artificial example to illustrate the bound in (c).
Example 7.7 Two time scales.
Take m1 = m1 (n), m2 = m2 (n) such that m1 ∼ n1−β , m1 m2 ∼ n. The

underlying idea is to take two continuous-time random walks on the complete
graphs Km1 and Km2 , but with the walks run on different time scales. To set
this up directly in discrete time, take state space {(x, y) : 1 ≤ x ≤ m1 , 1 ≤
y ≤ m2 } and transition probabilities
1

(x, y) → (x0 , y) chance (m1 − 1)−1 1 − , x0 6= x
am1 log m1
1
→ (x, y 0 ) chance (m2 − 1)−1 , y 0 6= y
am1 log m1
where a = a(n) ↑ ∞ slowly. It is not hard to formalize the following analy-

sis. Writing the chain as (Xt , Yt ), the Y -component stays constant for time
Θ(am1 log m1 ), during which time every x-value is hit, because the cover
time for Km1 is ∼ m1 log m1 . And m2 log m2 jumps of the Y -component are
required to hit every y-value, so
EC ∼ (m2 log m2 ) × (am1 log m1 ) ∼ an(log m1 )(log m2 ). (7.13)
Now τ2 ∼ am1 log m1 , and because the mean number of returns to the
starting point before the first Y -jump is ∼ a log m1 we can use the local
transience heuristic (7.10) to see τ0 ∼ (a log m1 ) × n. So τ2 /τ0 ∼ m1 /n ∼
n−β , and the lower bound from (c) is
(β − o(1))(a log m1 )n log n.
But this agrees with the exact limit (7.13), because m2 ∼ nβ .
We now turn to sharper distributional limits for C. An (easy) back-
ground fact is that, for independent random variables (Zi ) with exponential,
mean τ , distribution,
max(Z1 , . . . , Zn ) − τ log n d
→ η
τ
where η has the extreme value distribution
P (η ≤ x) = exp(−e−x ), −∞ < x < ∞. (7.14)
Now the cover time C = maxi Ti is the max of the hitting times, and with
the uniform initial distribution the Ti ’s have mean τ0 . So if the Ti ’s have
approximately exponential distribution and are roughly independent of each
other then we anticipate the limit result
C − τ0 log n d
→ η. (7.15)
τ0
Theorem 7.4 has already given us a condition for limit exponential distribu-
tions, and we shall build on this result to give (Theorem 7.9) conditions for
(7.15) to hold.
The extreme value distribution (7.14) has transform
E exp(θη) = Γ(1 − θ), −∞ < θ < 1. (7.16)
Classical probability theory (see Notes) says that to prove (7.15) it is enough
to show that transforms converge, i.e. to show
E exp(θC/τ0 ) ∼ n−θ Γ(1 − θ), −∞ < θ < 1. (7.17)
But Matthews method, which previously we have used on expectations, can

just as well be applied to transforms. By essentially the same argument as
in Chapter 2 Theorem yyy, Matthews [257] obtained
Proposition 7.8 The cover time C in a not-necessarily-reversible Markov
chain with arbitrary initial distribution satisfies
Γ(n + 1)Γ(1/f∗ (β)) Γ(n + 1)Γ(1/f ∗ (β))
≤ E exp(βC) ≤
Γ(n + 1/f∗ (β)) Γ(n + 1/f ∗ (β))
where
f ∗ (β) ≡ max Ei exp(βTj )
j6=i
f∗ (β) ≡ min Ei exp(βTj ).
j6=i
Substituting into (7.17), and using the fact

Γ(n + 1)
∼ ns as n → ∞, sn → s
Γ(n + 1 − sn )
we see that to establish (7.15) it suffices to prove that for arbitrary jn 6= in
and for each fixed −∞ < θ < 1,
1
Ein exp(θTjn /τ0 ) → . (7.18)
1−θ
Theorem 7.9 For a sequence of symmetric reversible chains, if
j = τ0 (1 − o(1))
(a) minj6=i Ei T
(b) τ2 /τ0 = o log1 n
then
C − τ0 log n d
→ η.
τ0
Proof. By hypothesis (a) and Theorem 7.4 (b,c), for arbitrary jn 6= in we
d
have Pin (Tjn /τ0 ∈ ·) → ξ. This implies (7.18) for θ ≤ 0, and also by
1
Fatou’s lemma implies lim inf n Ein exp(θTjn /τ0 ) ≥ 1−θ for 0 < θ < 1. Thus
it is sufficient to prove
1 + o(1)
max Ei exp(θTj /τ0 ) ≤ , 0 < θ < 1. (7.19)
j6=i 1−θ
The proof exploits some of our earlier general inequalities. Switch to contin-
uous time. Fix β > 0. By conditioning on the position at some fixed time
s,
Ei exp(β(Tj − s)+ ) ≤ max(nPi (Xs = x)) × Eπ exp(βTj ).
x
By Corollary 7.3 the max is attained by x = i, and so

Ei exp(βTj ) ≤ (nPi (Xs = i)eβs ) × Eπ exp(βTj ).
We now apply some general inequalities. Chapter 4 yyy says nPi (Xs = i) ≤
1 + n exp(−s/τ2 ). Writing αj for the quasistationary distribution on {j}c ,
Chapter 3 (yyy) implies Pπ (Tj > t) ≤ exp(−t/Eαj Tj ) and hence
1
Eπ exp(βTj ) ≤ .
1 − βEαj Tj
But Chapter 3 Theorem yyy implies Eαj Tj ≤ τ0 + τ2 . So setting β = θ/τ0 ,
these inequalities combine to give
1
Ei exp(θTj /τ0 ) ≤ (1 + n exp(−s/τ2 )) × exp(θs/τ0 ) × .
1 − θ(1 + τ2 /τ0 )
But by hypothesis (b) we can choose s = o(τ0 ) = Ω(τ2 log n) so that each
of the first two terms in the bound tends to 1, establishing (7.19).
√ Finally,
the effect of continuization is to change C by at most O( EC), so the
asymptotics remain true in discrete time.
Remark. Presumably (c.f. Open Problem 7.6) the Theorem remains true
without hypothesis (b).
In view of Chapter 6 yyy it is surprising that there is no obvious example
to disprove
Open Problem 7.10 Let V denote the last state to be hit. In a sequence
of vertex-transitive graphs with n → ∞, is it always true that V converges
(in variation distance, say) to the uniform distribution?
7.1.6 Product chains

In our collection of examples in Chapter 5 of random walks on graphs, the
examples with enough symmetry to fit into the present setting have in fact
extra symmetry, enough to fit into the arc-transitive setting of section 7.2.
So in a sense, working at the level of generality of symmetric reversible
chains merely serves to illustrate what properties of chains depend only on
this minimal level of symmetry. But let us point out a general construction.
Suppose we have symmetric reversible chains X (1) , . . . , X (d) on state spaces
I (1) , . . . , I (d) . Fix constants a1 , . . . , ad with each ai > 0 and with i ai = 1.
P
Then (c.f. Chapter 4 section yyy) we can define a “product chain” with
state-space I (1) × . . . × I (d) and transition probabilities
(i) (i)
(x1 , . . . , xd ) → (x1 , . . . , x0i , . . . , xd ): probability ai P (X1 = x0i |X0 = xi ).
This product chain is also symmetric reversible. But if the underlying chains
have extra symmetry properties, these extra properties are typically lost
when one passes to the product chain. Thus we have a general method of
constructing symmetric reversible chains which lack extra structure. Ex-
ample 7.14 below gives a case with distinct underlying components, and
Example 7.11 gives a case with a non-uniform product. In general, writing
(i)
(λu : 1 ≤ u ≤ |I (i) |) for the continuous-time eigenvalues of X (i) , we have
(Chapter 4 yyy) that the continuous-time eigenvalues of the product chain
are
λu = a1 λ(1) (d)
u1 + . . . + ad λud
indexed by u = (u1 , . . . , ud ) ∈ {1, . . . , |I (1) |} × . . . × {1, . . . , |I (d) |}. So in

particular
(i)
τ2
τ2 = max
i ai
X 1
τ0 = (1) (d)
u6=(1,...,1) a1 λu1 + . . . + ad λud
and of course these parameters take the same values in discrete time.
Example 7.11 Coordinate-biased random walk on the d-cube.
Take I = {0, 1}d and fix 0 < a1 ≤ a2 ≤ . . . ≤ ad with

P
i ai = 1. Then the
chain with transitions
(b1 , . . . , bd ) → (b1 , . . . , 1 − bi , . . . , bd ) : probability ai
is the weighted product of two-state chains. Most of the calculations for

simple symmetric random walk on the d-cube done in Chapter 5 Example
yyy extend to this example, with some increase of complexity. In particular,
1
τ2 =
2a1
1 X 1
τ0 = Pd .
2 u∈I,u6=0 i=1 ui ai
In continuout time we still get the product form for the distribution at time
t:
Pb (Xt = b0 ) = 2−d (1 + ηi exp(−2ai t)) ; ηi = 1 if b0i = bi , = 0 if not.

Y
i
So in a sequence of continuous time chains with d → ∞, the “separation”

(1)
parameter τ1 of Chapter 3 section yyy is asymptotic to the solution t of
exp(−2ai t) = − log(1 − e−1 ).

X
More elaborate calculations can be done to study τ1 and the discrete-time

version.
7.1.7 The cutoff phenomenon and the upper bound lemma

Chapter 2 yyy and Chapter 4 yyy discussed quantifications of notions of
“time to approach stationarity” using variation distance. The emphasis in
Chapter 4 yyy was on inequalities which hold up to universal constants.
In the present context of symmetric reversible chains, one can seek to do
sharper calculations. Thus for random walk on the d-cube (Chapter 5 Ex-
ample yyy), with chances 1/(d + 1) of making each possible step or staying
still, writing n = 2d and cn = 14 d log d, we have (as n → ∞) not only the
fact τ1 ∼ cn but also the stronger result
d((1 + ε)cn ) → 0 and d((1 − ε)cn ) → 1, for all ε > 0. (7.20)
We call this the cutoff phenomenon, and when a sequence of chains satisfies
(7.20) we say the sequence has “variation cutoff at cn ”. As mentioned at
xxx, the general theory of Chapter 4 works smoothly using d(t), ¯ but in
examples it is more natural to use d(t), which we shall do in this chapter.
Clearly, (7.20) implies the same result for d¯ and implies τ1 ∼ cn . Also, our
convention in this chapter is to work in discrete time, whereas the Chapter
4 general theory worked more smoothly in continuous time. (Clearly (7.20)
in discrete time implies the same result for the continuized chains, provided
cn → ∞). Note that, in the context of symmetric reversible chains,
d(t) = di (t) = ||Pi (Xt ∈ ·) − π(·)|| for each i.
We also can discuss separation distance (Chapter 4 yyy) which in this context
is
s(t) = 1 − n min Pi (Xt = j) for each i,
j
and introduce the analogous notion of separation threshold.

It turns out that these cut-offs automatically appear in sequences of
chains defined by repeated products. An argument similar to the analysis
of the d-cube (see [22] for a slightly different version) shows
Lemma 7.12 Fix an aperiodic symmetric reversible chain with m states

and with relaxation time τ2 = 1/(1 − λ2 ). Consider the d-fold product chain
with n = md states and transition probabilities
1
(x1 , . . . , xd ) → (x1 , . . . , yi , . . . , yd ) : probability px ,y .
d i i
As d → ∞, this sequence of chains has variation cutoff 12 τ2 d log d and sepa-
ration cut-off τ2 d log d.
xxx discuss upper bound lemma

xxx heuristics
xxx mention later examples
7.1.8 Vertex-transitive graphs and Cayley graphs

So far we have worked in the setting of symmetric reversible chains, and
haven’t used any graph theory. We now specialize to the case of random
walk on a vertex-transitive or Cayley graph (V, E). As usual, we won’t write
out all specializations of the previous results, but instead emphasize what
extra we get from graph-theoretic arguments. Let d be the degree of the
graph.
Lemma 7.13 For random walk on a vertex-transitive graph,

(i) Ev Tx ≥ n if (v, x) 6∈ E
2dn dn
(ii) d+1 − d ≥ Ev Tx ≥ d+1 if (v, x) ∈ E
Proof. The lower bounds are specializations of Lemma 7.2(i), i.e. of Chapter
6 xxx. For the upper bound in (ii),
1X
n−1 = Ey Tx (7.21)
d y∼x
1 dn

≥ Ev Tx + (d − 1) by the lower bound in (ii).
d d+1
Rearrange.
xxx mention general lower bound τ0 ≥ (1−o(1))nd/(d−2) via tree-cover.
It is known (xxx ref) that a Cayley graph of degree d is d-edge-connected,
and so Chapter 6 Proposition yyy gives
τ ∗ ≤ n2 ψ(d)/d
p
where ψ(d)/d ≈ 2/d.
Example 7.14 A Cayley graph where Ev Tw is not the same for all edges
(v, w).
Consider Zm ×Z2 with generators (1, 0), (−1, 0), (0, 1). The figure illustrates
the case m = 4.
30 20
@
@@
31 21
01 11@
@
@
00 10
Let’s calculate E00 T10 using the resistance interpretation. Put unit volt-
age at 10 and zero voltage at 00, and let ai be the voltage at i0. By symmetry
the voltage at i1 is 1 − ai , so we get the equations
1
ai = (ai−1 + ai+1 + (1 − ai )), 1 ≤ i ≤ m − 1
3
with a0 = am = 0. But this is just a linear difference equation, and a brief
calculation gives the solution
1 1 θm/2−i + θi−m/2
ai = −
2 2 θm/2 + θ−m/2
√
where θ = 2 − 3. The current flow is 1 + 2a1 , so the effective resistance is
r = (1 + 2a1 )−1 . The commute interpretation of resistance gives 2E00 T01 =
3nr, and so
3n
E00 T01 =
2(1 + 2a1 )
where n = 2m is the number of vertices. In particular,
3
n−1 E00 T01 → γ ≡ √ as n → ∞.
1+ 3
Using the averaging property (7.21)
√
−1 0 3 3
n E00 T10 →γ ≡ √ as n → ∞.
2(1 + 3)
Turning from hitting times to mixing times, recall the Cheeger constant
τc ≡ sup c(A)
A
where A is a proper subset of vertices and
π(Ac )
c(A) ≡ .
Pπ (X1 ∈ Ac |X0 ∈ A)
For random walk on a Cayley graph one can use simple “averaging” ideas
to bound c(A). This is Proposition 7.15 below. The result in fact extends
to vertex-transitive graphs by a covering graph argument - see xxx.
Consider a n-vertex Cayley graph with degree d and generators G =
{g1 , . . . , gd }, where g ∈ G implies g −1 ∈ G. Then
1 X |Ag \ A|
Pπ (X1 ∈ Ac |X0 ∈ A) =
d |A|
g∈G
where Ag = {ag : a ∈ A}. Lower bounding the sum by its maximal term,
we get
d |A| |Ac |
c(A) ≤ . (7.22)
n maxg∈G |Ag \ A|
Proposition 7.15 On a Cayley graph of degree d

(i) τc ≤ d∆, where ∆ is the diameter of the graph.
(ii) c(A) ≤ 2dρ(A) for all A with ρ(A) ≥ 1, where
ρ(A) ≡ min max d(v, w)

v∈V w∈A
is the radius of A.
Note that supA ρ(A) is bounded by ∆ but not in general by ∆/2 (consider
the cycle), so that (ii) implies (i) with an extra factor of 2. Part (i) is from
Aldous [10] and (ii) is from Babai [36].
Proof. (i) Fix A. Because
1 X
|A ∩ Av| = |A|2 /n
n
v∈V
there exists some v ∈ V such that |A ∩ Av| ≤ |A|2 /n, implying
|Av \ A| ≥ |A||Ac |/n. (7.23)

We can write v = g1 g2 . . . gδ for some sequence of generators (gi ) and some

δ ≤ ∆, and
δ
X δ
X
|Av \ A| ≤ |Ag1 . . . gi \ Ag1 . . . gi−1 | = |Agi \ A|.
i=1 i=1
1
So there exists g ∈ G with |Ag \ A| ≥ ∆ × |A||Ac |/n, and so (i) follows from
(7.22). For part (ii), fix A with |A| ≤ n/2, write ρ = ρ(A) and suppose
1
max |Ag \ A| < |A|. (7.24)
g∈G 4ρ
1
Fix v with maxw∈A d(w, v) = ρ. Since |Ag \ A| < 4ρ |A| and
A \ Axg ⊆ (A \ Ag) ∪ (A \ Ax)g
we have by induction
1
|A \ Ax| < |A|d(x, v). (7.25)
4ρ
Write B r ≡ {vg1 . . . gi ; i ≤ r, gi ∈ G} for the ball of radius r about v. Since

(2ρ + 1)/(4ρ) < 1, inequality (7.25) shows that A ∩ Ax is non-empty for each
x ∈ B 2ρ+1 , and so B 2ρ+1 ⊆ A−1 A. But by definition of ρ we have A ⊆ B ρ ,
implying B 2ρ+1 ⊆ B 2ρ , which in turn implies B 2ρ is the whole group. Now
(7.25) implies that for every x
1 |A||Ac |
|Ax \ A| < |A| ≤ .
2 n
But this contradicts (7.23). So (7.24) is false, i.e.
1 1 |A||Ac |
max |Ag \ A| ≥ |A| ≥ .
g∈G 4ρ 2ρ n
By complementation the final inequality remains true when |A| > n/2, and
the result follows from (7.22).
7.1.9 Comparison arguments for eigenvalues

The “distinguished paths” method of bounding relaxation times (Chapter
4 yyy) can also be used to compare relaxation times of two random flights
on the same group, and hence to bound one “unknown” relaxation time
in terms of a second “known” relaxation time. This approach has been

developed in great depth in
xxx ref Diaconis Saloff-Coste papers.
Here we give only the simplest of their results, from [117].
Consider generators G of a group I, and consider a reversible random
flight with step-distribution µ supported on G. Write d(x, id) for the distance
from x to the identity in the Cayley graph, i.e. the minimal length of a word
x = g1 g2 . . . gd ; gi ∈ G.
For each x choose some minimal-length word as above and define N (g, x)
to be the number of occurences of g in the word. Now consider a different
reversible random flight on I with some step-distribution µ̃, not necessarily
supported on G. If we know τ˜2 , the next result allows us to bound τ2 .
Theorem 7.16
τ2 1 X
≤ K ≡ max d(x, id)N (g, x)µ̃(x).
τ˜2 g∈G µ(g) x∈I
xxx give proof – tie up with L2 discussion

Perhaps surprisingly, Theorem 7.16 gives information even when the
comparison walk is the “trivial” walk whose step-distribution µ̃ is uniform
on the group. In this case, both d(x, id) and N (g, x) are bounded by the
diameter ∆, giving
Corollary 7.17 For reversible flight with step-distribution µ on a group I,
∆2
τ2 ≤ ,
ming∈G µ(g)
where G is the support of µ and ∆ is the diameter of the Cayley graph

associated with G.
When µ is uniform on G and |G| = d, the Corollary gives the bound d∆2 ,
which improves on the bound 8d2 ∆2 which follows from Proposition 7.15 and
Cheeger’s inequality (Chapter 4 yyy). The examples of the torus ZN d show
2
that ∆ enters naturally, but one could hope for the following variation.
Open Problem 7.18 Write τ∗ = τ∗ (I, G) for the minimum of τ2 over all
symmetric random flights on I with step-distribution supported on G. Is it
true that τ∗ = O(∆2 )?
7.2 Arc-transitivity
Example 7.14 shows that random walk on a Cayley graph does not nec-
essarily have the property that Ev Tw is the same for all edges (v, w). It
is natural to consider some stronger symmetry condition which does im-
ply this property. Call a graph arc-transitive if for each 4-tuple of vertices
(v1 , w1 , v2 , w2 ) such that (v1 , w1 ) and (v2 , w2 ) are edges, there exists an au-
tomorphism γ such that γ(v1 ) = w1 , γ(v2 ) = w2 . Arc-transitivity is stronger
than vertex-transitivity, and immediately implies that Ev Tw is constant over
edges (v, w).
Lemma 7.19 On a n-vertex arc-transitive graph,

(i) Ev Tw = n − 1 for each edge (v, w).
(ii) Ev Tw ≥ n − 2 + d(v, w) for all w 6= v.
Proof. (i) follows from Ev Tv+ = n. For (ii), write N (w) for the set of
neighbors of w. Then
Ev Tw = Ev TN (w) + (n − 1)
and TN (w) ≥ d(v, N (w)) = d(v, w) − 1.

In particular, minw6=v Ev Tw = n − 1, which gives the following bounds
on mean cover time EC. The first assertion uses Matthews method for
expectations (Chapter 2 yyy) and the second follows from Theorem 7.9.
Corollary 7.20 On a n-vertex arc-transitive graph, EC ≥ (n − 1)hn−1 .

And if τ0 /n → 1 and τ2 = o(n/ log n) then
C − τ0 log n d
→ η (7.26)
τ0
Note that the lower bound (n − 1)hn−1 is attained on the complete graph.
It is not known whether this exact lower bound remains true for vertex-
transitive graphs, but this would be a consequence of Chapter 6 Open Prob-
lem yyy. Note also that by xxx the hypothesis τ0 /n → 1 can only hold if
the degrees tend to infinity.
Corollary 7.20 provides easily-checkable conditions for the distributional
limit for cover times, in examples with ample symmetry, such as the card-
shuffling examples in the next section. Note that
b + o(1) C − n log n − bn d

(7.26) and τ0 = n 1 + imply → η.
log n n
7.2. ARC-TRANSITIVITY 255

Thus on the d-cube (Chapter 5 yyy) τ0 = n 1 + 1+o(1)
d =n 1+ log 2+o(1)
log n
and so
C − n log n − n log 2 d
→ η.
n
7.2.1 Card-shuffling examples

These examples are formally random flights on the permutation group,
though we shall describe them informally as models for random shuffles
of a m-card deck. Write Xt for the configuration of the deck after t shuffles,
and write Yt = f1 (Xt ) for the position of card 1 after t shuffles. In most ex-
amples (and all those we discuss) Yt is itself a Markov chain on {1, 2, . . . , m}.
Example 7.21, mentioned in Chapter 1 xxx, has become the prototype for
use of group representation methods.
Example 7.21 Card-shuffling via random transpositions.
The model is
Make two independent uniform choices of cards, and interchange

the positions of the two cards.
With chance 1/m the same card is chosen twice, so the “interchange” has
no effect. This model was studied by Diaconis and Shahshahani [122], and
more concisely in the book Diaconis [112] Chapter 3D. The chain Yt has
transition probabilities
i→j probability 2/m2 , j 6= i

2(m − 1)
i→i probability 1−
m2
This is essentially random walk on the complete m-graph (precisely: the
continuized chains are deterministic time-changes of each other) and it is
easy to deduce that (Yt ) has relaxation time m/2. So by the contraction
principle xxx the card-shuffling process has τ2 ≥ m/2, and group represen-
tation methods show
τ2 = m/2. (7.27)
Since the chance of being in the initial state after 1 step is 1/m and after 2
steps in O(1/m2 ), the local transience heuristic (7.10) suggests
τ0 = m!(1 + 1/m + O(1/m2 )) (7.28)

which can be verified by group representation methods (see Flatto et al

[152]). The general bound on τ1 in terms of τ2 gives only τ1 = O(τ2 log m!) =
O(m2 log m). In fact group group representation methods ([112]) show
1
there is a variation cutoff at m log m. (7.29)
2
Example 7.22 Card-shuffling via random adjacent transpositions.
The model is
With probability 1/(m + 1) do nothing. Otherwise, choose one
pair of adjacent cards (counting the top and bottom cards as ad-
jacent), with probability 1/(m+1) for each pair, and interchange
them.
The chain Yt has transition probabilities
i→ i+1 probability 1/(m + 1)
i→ i−1 probability 1/(m + 1)
i→ i probability (m − 1)/(m + 1)
with i ± 1 counted modulo m. This chain is (in continuous time) just a
time-change of random walk on the m-cycle, so has relaxation time
m+1 1 m3
a(m) ≡ ∼ 2.
2 1 − cos(2π/m) 4π
So by the contraction principle xxx the card-shuffling process has τ2 ≥ a(m),
and (xxx unpublished Diaconis work) in fact
τ2 = a(m) ∼ m3 /4π 2 .
A coupling argument which we shall present in Chapter xxx gives an upper
bound τ1 = O(m3 log m) and (xxx unpublished Diaconis work) in fact
τ1 = Θ(m3 log m).
The local transience heuristic (7.10) again suggests
τ0 = m!(1 + 1/m + O(1/m2 ))
but this has not been studied rigorously.
Many variants of these examples have been studied, and we will mention
a generalization of Examples 7.21 and 7.22 in Chapter xxx. Here is another
example, from Diaconis and Saloff-Coste [117], which illustrates the use of
comparison arguments.
7.2. ARC-TRANSITIVITY 257
Example 7.23 A slow card-shuffling scheme.

The model is: with probability 1/3 each, either
(i) interchange the top two cards
(ii) move the top card to the bottom
(iii) move the bottom card to the top.
This process is random walk on a certain Cayley graph, which (for m ≥ 3)
is not arc-transitive. Writing d for distances in the graph and writing
β = max(d(σ, id) : σ a transposition ),
it is easy to check that β ≤ 3m. Comparing the present chain with the
“random transpositions” chain (Example 7.21), denoted by ˜, Theorem 7.16
implies
τ2
≤ 3β 2 .
τ˜2
Since τ˜2 = m/2 we get
27m3
τ2 ≤ .
2
7.2.2 Cover times for the d-dimensional torus ZNd .

This is Example yyy from Chapter 5, with n = N d vertices, and is clearly
arc-transitive. Consider asymptotics as N → ∞ for d fixed. We studied
mean hitting times in this example in Chapter 5. Here τ0 /n 6→ 1, so we
cannot apply Corollary 7.20. For d = 1 the graph is just the d-cycle, treated
in Chapter 6 yyy. For d ≥ 3, Chapter 5 yyy gave
E0 Ti ∼ nRd as N → ∞, |i| → ∞
where |i| is Euclidean distance on the torus, i.e.
d
X
|(i1 , . . . , id )|2 = (min(iu , N − iu ))2 .
u=1
So EC has the asymptotic upper bound Rd n log n. Now if we apply the
subset form of Matthews method (Chapter 6 yyy) to the subset
N
A = {(j1 m, . . . , jd m) : 1 ≤ ji ≤ } (7.30)
m
then we get a lower bound for EC asymptotic to
log |A| × nRd .
By taking m = m(n) ↑ ∞ slowly, this agrees with the upper bound, so we
find
Corollary 7.24 On the d-dimensional torus with d ≥ 3,
EC ∼ Rd n log n.
Perhaps surprisingly, the case d = 2 turns out to be the hardest of all

explicit graphs for the purposes of estimating cover times. (Recall this case
is the white screen problem Chapter 1 xxx.) Loosely, the difficulty is caused
by the fact that τ2 = Θ(n log n) – recall from Chapter 6 yyy that another
example with this property, the balanced tree, is also hard. Anyway, for the
case d = 2 the calculations in Chapter 5 yyy gave
2

E0 Ti ∼ n log |i| + O(1) .
π
This leads to the upper bound in Corollary 7.25 below. For the lower bound,
we repeat the d ≥ 3 argument using a subset of the form (7.30) with m → ∞,
and obtain a lower bound asymptotic to
2
log m × log(n2 /m2 ).
π
The optimal choice is m ∼ n1/2 , leading to the lower bound below.
2,
Corollary 7.25 On the 2-dimensional torus ZN
1 1

− o(1) n log2 n ≤ EC ≤ + o(1) n log2 n.
4π π
1
Lawler [221] has improved the constant in the lower bound to 2π – see Notes.
It is widely believed that the upper bound is in fact the limit.
2,
Open Problem 7.26 Prove that, on the 2-dimensional torus ZN
1
EC ∼ n log2 n.
π
The usual distributional limit
C − τ0 log n d
→ η
τ0
certainly fails in d = 1 (see Chapter 6 yyy). It has not been studied in d ≥ 2,
but the natural conjecture it that it is true for d ≥ 3 but false in d = 2.
Note that (by Chapter 6 yyy) the weaker concentration result
d
C/EC → 1
holds for all d ≥ 2.

7.3. DISTANCE-REGULAR GRAPHS 259
7.2.3 Bounds for the parameters

In Chapter 6 we discussed upper bounds on parameters τ for regular graphs.
One can’t essentially improve these bounds by imposing symmetry condi-
tions, because the bounds are attained (up to constants) by the n-cycles.
But what if we exclude the n-cycles? Example 7.14 shows that one can
invent vertex-transitive graphs which mimic the n-cycle, but it is not clear
whether such arc-transitive graphs exist. So perhaps the next-worst arc-
2.
transitive graph is Zm
Open Problem 7.27 Is it true that, over arc-transitive graphs excluding

τ∗
the n-cycles, τ ∗ = O(n log n), τ2 = O(n) and 2τ 0
= 1 + o(1)?
7.2.4 Group-theory set-up

Recall that the Cayley graph associated with a set G of generators of a group
I has edges
{(v, vg); v ∈ I, g ∈ G}
where we assume G satisfies

(i) g ∈ G implies g −1 ∈ G.
To ensure that the graph is arc-transitive, it is sufficient to add the condition
(ii) for each pair g1 , g2 in G, there exists a group automorphism γ such
that γ(id) = id and γ(g1 ) = g2 .
In words, “the stabilizer acts transitively on G”. This is essentially the
general case: see [71] Prop. A.3.1.
As a related concept, recall that elements x, y of a group I are conjugate
if x = g −1 yg for some group element g. This is an equivalence relation which
therefore defines conjugacy classes. It is easy to check that a conjugacy class
must satisfy condition (ii). Given a conjugacy class C one can consider the
uniform distribution µC on C and then consider the random flight with step
distribution µC . Such random flights fit into the framework of section 7.2,
and Example 7.21 and the torus ZN d are of this form. On the other hand,
Example 7.22 satisfies (i) and (ii) but are not random flights with steps
uniform on a conjugacy class.
7.3 Distance-regular graphs

A graph is called distance-transitive if for each 4-tuple v1 , w1 , v2 , w2 with
d(v1 , w1 ) = d(v2 , w2 ) there exists an automorphism γ such that γ(v1 ) =
w1 , γ(v2 ) = w2 . Associated with such a graph of diameter ∆ are the inter-

section numbers (ai , bi , ci ; 0 ≤ i ≤ ∆) defined as follows. For each i choose
(v, w) with d(v, w) = i, and define
ci = number of neighbors of w at distance i−1 from v

ai = number of neighbors of w at distance i from v
bi = number of neighbors of w at distance i+1 from v.
The distance-transitive property ensures that (ai , bi , ci ) does not depend

on the choice of (v, w). A graph for which such intersection numbers ex-
ist is called distance-regular, and distance-regularity turns out to be strictly
weaker than distance-transitivity. An encyclopedic treatment of such graphs
has been given by Brouwer et al [71]. The bottom line is that there is almost
a complete characterization (i.e. list of families and sporadic examples) of
distance-regular graphs. Anticipating a future completion of the character-
ization, one could seek to prove inequalities for random walks on distance-
regular graphs by simply doing explicit calculations with all the examples,
but (to quote Biggs [49]) “this would certainly not find a place in The Erdos
Book of ideal proofs”. Instead, we shall just mention some properties of
random walk which follow easily from the definitions.
Consider random walk (Xt ) on a distance-regular graph started at v0 ,
and define Dt = d(v0 , Xt ). Then (Dt ) is itself a Markov chain on states
{0, 1, . . . , ∆}, and is in fact the birth-and-death chain with transition prob-
abilities
pi,i−1 = ci /r, pi,i = ai /r, pi,i+1 = bi /r.
xxx b-and-d with holds
Finding exact t-step transition probabilities is tantamount to finding
the orthogonal polynomials associated with the distance-regular graph –
references to the latter topic can be found in [71], but we shall not pursue
it.
7.3.1 Exact formulas

A large number of exact formulas can be derived by combining the standard
results for birth-and-death chains in Chapter 5 section yyy with the standard
renewal-theoretic identities of Chapter 2 section yyy. We present only the
basic ones.
Fix a state 0 in a distance-regular graph. Let ni be the number of states
at distance i from 0. The number of edges with one end at distance i and
7.3. DISTANCE-REGULAR GRAPHS 261
the other at distance i + 1 is ni bi = ni+1 ci+1 , leading to the formula

i
Y bj−1
ni = ; 0 ≤ i ≤ ∆.
j=1
cj
The chain Dt has stationary distribution

i
bj−1
ρi = ni /n = n−1
Y
; 0 ≤ i ≤ ∆.
j=1
cj
Switching to the notation of Chapter 5 yyy, the chain Dt is random walk on

a weighted linear graph, where the weight wi on edge (i − 1, i) is
ni−1 bi−1 ni ci
wi = = , 1≤i≤∆
2n 2n
and total weight w = 1. This graph may have self-loops, but they don’t
affect the formulas. Clearly hitting times on the graph are related to hitting
times of (Dt ) by
Ev Tx = h(d(v, x)) , where h(i) ≡ Ẽi T0 (7.31)
and where we write ˜ to refer to expectations for Dt . Clearly h(·) is strictly
increasing. Chapter 5 yyy gives the formula
i ∆
wi wj−1 .
X X
h(i) = i + 2 (7.32)
j=1 i=j+1
And Chapter 5 yyy gives the last equality in

∆
1X
wi−1 ( ρj )2 .
X
τ0 = Eπ T0 = Ẽρ T0 = (7.33)
2 i=1 j≥i
Finally, Chapter 5 yyy gives

∆
X
Ẽ0 T∆ + Ẽ∆ T0 = 1/wi . (7.34)
i=1
Thus if the graph has the property that there exists a unique vertex 0∗ at
distance ∆ from 0, then we can pull back to the graph to get
∆
τ∗ 1X
= max Ex Tv = E0 T0∗ = 1/wi . (7.35)
2 x6=v 2 i=1
If the graph lacks that property, we can use (7.31) to calculate h(∆).
The general identities of Chapter 3 yyy can now be used to give formulas
for quantities such as Px (Ty < Tz ) or Ex (number of visits to y before Tz ).
7.3.2 Examples
Many treatments of random walk on sporadic examples such as regular
polyhedra have been given, e.g. [227, 228, 275, 319, 320, 330, 331], so I shall
not repeat them here. Of infinite families, the complete graph was discussed
in Chapter 5 yyy, and the complete bipartite graph is very similar. The
d-cube also was treated in Chapter 5. Closely related to the d-cube is a
model arising in several contexts under different names,
Example 7.28 c-subsets of a d-set.
The model has parameters (c, d), where 1 ≤ c ≤ d − 1. Formally, we
have random walk on the distance-transitive graph whose vertices are the
d! 0
c!(d−c)! c-element subsets A ⊂ {1, 2, . . . , d}, and where (A, A ) is an edge iff
|A4A0 | = 2. More vividly, d balls {1, 2, . . . , d} are distributed between a left
urn and a right urn, with c balls in the left urn, and at each stage one ball is
picked at random from each urn, and the two picked balls are interchanged.
The induced birth-and-death chain is often called the Bernouilli-Laplace
diffusion model. The analysis is very similar to that of the d-cube. See
[123, 127] and [112] Chapter 3F for details on convergence to equilibrium
and [110] for hitting and cover times.
7.3.3 Monotonicity properties

The one result about random walk on distance-regular graphs we wish to
highlight is the monotonicity property given in Proposition 7.29 below. Part
(ii) can be viewed as a strengthening of the monotonicity property for mean
hitting times (by integrating over time and using the formula relating mean
hitting times to the fundamental matrix).
Proposition 7.29 For random walk (Xt ) on a distance-regular graph in
continuous time, Pv (Xt = w) = q(t, d(v, w)), where the function d → q(t, d)
satisfies
(i) d → q(t, d) in non-increasing, for fixed t.
(ii) q(t, d)/q(t, 0) in non-decreasing in t, for fixed d.
xxx proof – coupling – defer to coupling Chapter ??
Proposition 7.29 is a simple example of what I call a “geometric” result
about a random walk. Corollary 7.3 gave a much weaker result in a more
general setting. It’s natural to ask for intermediate results, e.g.
Open Problem 7.30 Does random walk on an arc-transitive graph have
some monotonicity property stronger than that of Corollary 7.3?
7.3.4 Extremal distance-regular graphs

Any brief look at examples suggests
Open Problem 7.31 Prove that, over distance-regular graphs excluding
the n-cycles, τ0 = O(n).
Of course this would imply τ ∗ = O(n) and EC = O(n log n). As mentioned
earlier, one can try to tackle problems like this by using the list of known
distance-regular graphs in [71]. Biggs [49] considered the essentially equiv-
alent problem of the maximum value of maxi,j Ei Tj /(n − 1), and found the
value 195/101 taken on the cubic graph with 102 vertices, and outlined an
argument that this may be the max over known distance-regular graphs.
xxx in same setting is τ2 = O(log n)?
7.3.5 Gelfand pairs and isotropic flights

On a distance-regular graph, a natural generalization of our nearest-neighbor
random walks is to isotropic random flight on the graph. Here one specifies a
probability distribution (s0 , s1 , . . . , s∆ ) for the step-length S, and each step
moves to a random vertex at distance S from the previous vertex. Precisely,
it is the chain with transition probabilities
sd(v,w)
p(v, w) = . (7.36)
nd(v,w)
The notion of isotropic random flight also makes sense in continuous
space. For an isotropic random flight in Rd , the steps have some arbitrary
specified random length S and a direction θ which is uniform and indepen-
dent of S. A similar definition can be made on the d-dimensional sphere.
The abstract notion which captures distance-regular graphs and their con-
tinuous analogs is a Gelfand pair. Isotropic random flights on Gelfand pairs
can be studied in great detail by analytic methods. Brief accounts can be
found in Letac [224, 225] and Diaconis [112] Chapter 3F, which contains an
extensive annotated bibliography.

Diaconis [112] Chapter 3 discusses random walks on groups, emphasizing
use of the upper bound lemma to establish bounds on τ1 and d(t), and con-
taining extensive references to previous work using group-theoretic meth-
ods. We have only mentioned reversible examples, but many natural non-
reversible examples can also be handled by group representation methods.
Also, in Example 7.21 and related examples, group representation methods

give stronger information about d(t) then we have quoted.
Elementary properties of hitting and cover times on graphs with symme-
try structure have been noted by many authors, a particularly comprehen-
sive treatment being given in the Ph.D. thesis Sbihi [306]. Less extensive
treatments and specific elementary results can be found in many of the pa-
pers cited later, plus [274, 330, 331]
Section 7.1. The phrase “random flight” is classically used for Rd . I have
used it (as did Takacs [319, 320]) in place of “random walk’ to emphasize it
is not necessarily a nearest-neighbor random walk.
Section 7.1.3. Other elementary facts about symmetric reversible chains
are
n
Eπ min(Ti , Tj ) = (Zii + Zij ).
2
Pi (X2t = i) + Pi (X2t = j) ≥ 2/n.
Chapter 6 yyy showed that on any regular graph, maxi,j Ei Tj ≤ 3n2 . On
a vertex-transitive graph the constant “3” can be improved to “2”, by an
unpublished argument of the author, but this is still far from the natural
conjecture of 1/4.
Section 7.1.4. Another curious result from [12]is that for a symmetric
reversible chain the first passage time cannot be concentrated around its
mean:
var i Tj e−2 1
2
≥ − .
(Ei Tj ) e−1 Ei Tj
Section 7.1.5. Before Matthews method was available, a result like Corol-
lary 7.5 (c) required a lot of work – see Aldous [8] for a result in the setting
of non-reversible random flight on a group. The present version of Corollary
7.5 (c) is a slight polishing of ideas in Zuckerman [343] section 6.
The fact that (7.17) implies (7.15) is a slight variation of the usual text-
book forms of the continuity theorem ([133] 2.3.4 and 2.3.11) for Fourier and
Laplace transforms. By the same argument as therein, it is enough for the
limit transform to be continuous at θ = 0, which holds in our setting.
Matthews [255, 257] introduced Proposition 7.8 and used it to obtain
the limiting cover time distribution for the d-cube and for card-shuffling
examples. Devroye and Sbihi [110] applied it to generalized hypercubes and
to Example 7.28. Our implementation in Theorem 7.9 and Corollary 7.20
reduces the need for ad hoc calculations in particular examples.
Section ?? Example 7.11 has been studied in the reliability literature
(e.g. [212]) from the viewpoint of the exponential approximation for hitting
times.
Section 7.1.7. The factor of 2 difference between the variation and sep-
aration cutoffs which appears in Lemma 7.12 is the largest possible – see
Aldous and Diaconis [22].
Section 7.1.8. xxx walk-regular example – McKay paper.
Section 7.1.9. Diaconis and Saloff-Coste [117] give many other applica-
tions of Theorem 7.16. We mention some elsewhere; others include
xxx list.
Section 7.2. The name “arc-transitive” isn’t standard: Biggs [48] writes
“symmetric” and Brouwer et al [71] write “flag-transitive”. Arc-transitivity
is not necessary for the property “Ev Tw is constant over edges”. For in-
stance, a graph which is vertex-transitive and edge-transitive (in the sense
of undirected edges) has the property, but is not necessarily arc-transitive
[182]. Gobel and Jagers [168] observed that the property
Ev Tw + Ew Tv = 2(n − 1) for all edges (v, w)
(equivalently: the effective resistance across each edge is constant) holds for
arc-transitive graphs and for trees.
Section 7.2.2. Sbihi [306] and Zuckerman [343] noted that the subset ver-
sion of Matthews method could be applied to the d-torus to give Corollaries
7.24 and 7.25.
The related topic of the time taken by random walk on the infinite lattice
Z d to cover a ball centered at the origin has been studied independently –
see Revesz [288] Chapter 22 and Lawler [221], who observed that similar
arguments could be applied to the d-torus, improving the lower bound in
Corollary 7.25. It is easy to see an informal argument suggesting that,
for random walk on the 2-torus, when nα vertices are unvisited the set of
unvisited vertices has some kind of fractal structure. No rigorous results are
known, but heuristics are given in Brummelhuis and Hilhorst [75].
Section 7.3.1. Deriving these exact formulas is scarcely more than un-
dergraduate mathematics, so I am amazed to see that research papers have
continued to be published in the 1980s and 1990s claiming various special
or general cases as new or noteworthy.
Section 7.3.5. In the setting of isotropic random flight (7.36) with step-
length distribution q, it is natural to ask what conditions on q and q 0 imply
that τ (q) ≥ τ (q 0 ) for our parameters τ . For certain distributions on the d-
cube, detailed explicit calculations by Karlin et al [207] establish an ordering
of the entire eigenvalue sequences, which in particular implies this inequality
for τ2 and τ0 . Establishing results of this type for general Gelfand pairs seems
an interesting project.
Miscellaneous. On a finite field, such as Zp for prime p, one can consider

“random walks” with steps of the form x → αx + β, with a specified joint
distribution for (α, β). Chung et al [94] treat one example in detail.
Chapter 8
Advanced L2 Techniques for

Bounding Mixing Times
(May 19 1999)
xxx In next revision, we should change

q the definition [in Chapter 4,
ˆ ˆ
yyy:(14)] of d(t) so that what is now d(2t) ˆ
becomes d(t).
This chapter concerns advanced L2 -based techniques, developed mainly
by Persi Diaconis and Laurent Saloff-Coste [117, 118, 119, 120] for bounding
mixing times for (finite, irreducible) reversible Markov chains. For conve-
nience, we will work in continuous time throughout this chapter, unless
otherwise noted. Many of the results are conveniently expressed in terms of
an “L2 threshold time” τ̂ (xxx use different notation?) defined by
τ̂ := inf{t > 0 : max kPi (Xt ∈ ·) − π(·)k2 ≤ e−1 }. (8.1)

i
xxx For NOTES: Discussion of discrete time, esp. negative eigenvalues.

Several preliminary comments are in order here. First, the definition of
the L2 distance kPi (Xt ∈ ·) − π(·)k2 may be recalled from Chapter 2 section
yyy:6.2, and Chapter 3 yyy:(55) and the spectral representation give useful
reexpressions:
!2
X pij (t)
kPi (Xt ∈ ·) − π(·)k22 = πj −1
j
πj
pii (2t)
= −1 (8.2)
πi
267
268CHAPTER 8. ADVANCED L2 TECHNIQUES FOR BOUNDING MIXING TIMES (MAY 19
n
= πi−1
X
exp(−2λm t)u2im .
m=2
Second, from (8.2) and Chapter 4 yyy:(14) we may also write the maximum
L2 distance appearing in (8.1) using
pii (2t) pij (2t) ˆ

max kPi (Xt ∈ ·) − π(·)k22 = max − 1 = max − 1 = d(2t).
i i πi i,j πj
Third, by the application of the Cauchy–Schwarz lemma in Chapter 4 Lemma

yyy:8, variation distance can be bounded by L2 distance:
4d2i (t) := 4kPi (Xt ∈ ·) − π(·)k2 ≤ kPi (Xt ∈ ·) − π(·)k22 , (8.3)

ˆ
4d2 (t) := 4 max kPi (Xt ∈ ·) − π(·)k2 ≤ d(2t); (8.4)
i
these inequalities are the primary motivation for studying L2 distance.

As argued in Chapter 4 yyy:just following (23),
ˆ
d(2t) ≤ π∗−1 e−2t/τ2 , (8.5)
where τ2 := λ−1
2 is the relaxation time and π∗ := mini πi . Thus if
1 1

t ≥ τ2 log +c ,
2 π∗
then q
d(t) ≤ 1
2
ˆ
d(2t) ≤ 12 e−c , (8.6)
which is small if c is large; in particular, (8.6) gives the upper bound in
1 1

τ2 ≤ τ̂ ≤ τ2 log +1 , (8.7)
2 π∗
and the lower bound follows easily.

For many simple chains (see Chapter 5), τ2 can be computed exactly.
Typically, however, τ2 can only be bounded. This can be done using the
“distinguished paths” method of Chapter 4 Section yyy:3. In Section 1 we
will see that that method may be regarded as a special case of a “comparison
method” whereby a chain with “unknown” relaxation time is compared to
a second chain with “known” relaxation time. The greater generality often
leads to improved bounds on τ2 . As a bonus, the comparison method also
gives bounds on the other “unknown” eigenvalues, and such bounds in turn
269
ˆ
can sometimes further decrease the time t required to guarantee that d(2t),
and hence also d(t), is small.
A second set of advanced techniques, encompassing the notions of Nash
inequalities, moderate growth, and local Poincaré inequaltities, is described
in Section 3. The development there springs from the inequality
kPi (Xt ∈ ·) − π(·)k2 ≤ N (s)e−(t−s)/τ2 , (8.8)
established for all 0 ≤ s ≤ t in Section 2, where
s
pii (2t)
N (t) = max kPi (Xt ∈ ·)k2 = max , t ≥ 0. (8.9)
i i πi
Choosing s = 0 in (8.8) gives
−1/2 −t/τ2
kPi (Xt ∈ ·) − π(·)k2 ≤ πi e ,
and maximizing over i recaptures (8.5). The point of Section 3, however,
is that one can sometimes reduce the bound by a better choice of s and
suitable estimates of the decay rate of N (·). Such estimates can be provided
by so-called Nash inequalities, which are implied by (1) moderate growth
conditions and (2) local Poincaré inequalities. Roughly speaking, for chains
satisfying these two conditions, judicious choice of s shows that variation
mixing time and τ̂ are both of order ∆2 , where ∆ is the diameter of the
graph underlying the chain.
xxx Might not do (1) or (2), so need to modify the above.
To outline a third direction of improvement, we begin by noting that nei-
ther of the bounds in (8.7) can be much improved in general. Indeed, ignor-
ing Θ(1) factors as usual, the lower bound is equality for the n-cycle (Chap-
ter 5, Example yyy:7) and the upper bound is equality for the M/M/1/n
queue (Chapter 5, Example yyy:6) with traffic intensity ρ ∈ (0, 1).
In Section 4 we introduce the log-Sobolev time τl defined by
τl := sup{L(g)/E(g, g) : g 6≡ constant} (8.10)
where L(g) is the entropy-like quantity
X
L(g) := πi g 2 (i) log(|g(i)|/kgk2 ),
i
recalling kgk22 = i πi g 2 (i). Notice the similarity between (8.10) and the
P
extremal characterization of τ2 (Chapter 3 Theorem yyy:22):

X
τ2 = sup{kgk22 /E(g, g) : πi g(i) = 0, g 6≡ 0}.
i
We will see that

1
log π∗ −1
τ2 ≤ τl ≤ τ2
2(1 − 2π∗ )
and that τ̂ is more closely related to τl than to τ2 , in the sense that
1 1

τl ≤ τ̂ ≤ τl log log +2 . (8.11)
2 π∗
To illustrate the improvement over (8.7), from the knowledge for the d-
cube (Chapter 5, Example yyy:15) that τ2 = d/2, one can deduce from (8.7)
that
1 1 2 1
2 d ≤ τ̂ ≤ 4 (log 2)d + 2 d. (8.12)
In Section 4.4 (Example 27) we will see that τl = d/2; then from (8.11) we
can deduce the substantial improvement
1 1 1 1

d ≤ τ̂ ≤ d log d + 1 − log d (8.13)
2 4 4 log 2
upon (8.12).
ZZZ!: Recall also the corrections in my notes on pages 8.2.11–12 (and
8.4.27). Continue same paragraph:
The upper bound here is remarkably tight: from Chapter 5 yyy:(65),
1 1 1

τ̂ = d log d + log d + o(d) as d → ∞.
4 4 log(1 + e−2 )
ZZZ!: In fact, the remainder term is O(1). Continue same paragraph:

Thus log-Sobolev techniques provide another means of improving mixing
time bounds, both in L2 and, because of (8.3)–(8.4), in variation. As will
be seen, these techniques can also be combined usefully with comparison
methods and Nash inequalities.
8.1 The comparison method for eigenvalues

xxx Revise Chapter 7, Sections 1.9 and 4, in light of this section?
The comparison method, introduced by Diaconis and Saloff-Coste [117,
118], generalizes the distinguished path method of Chapter 4, Section yyy:3
for bounding the relaxation time of a reversible Markov chain. As before, we
first (xxx: delete word?) work in the setting of random walks on weighted
graphs. We will proceed for given state space (vertex set) I by comparing
a collection (wij ) of weights of interest to another collection (w̃ij ); the idea
8.1. THE COMPARISON METHOD FOR EIGENVALUES 271
will be to use known results for the random walk with weights (w̃ij ) to derive
corresponding results for the walk of interest. We assume that the graph
is connected under each set of weights. As in Chapter 4, Section yyy:4.3,
we choose (“distinguish”) paths γxy from x to y. Now, however, this need
be done only for those (x, y) with x 6= y and w̃xy > 0, but we impose the
additional constraint we > 0 for each edge e in the path. (Here and below,
e denotes a directed edge in the graph of interest.) In other words, roughly
put, we need to construct a (wij )-path to effect each given (w̃xy )-edge. Recall
from Chapter 3 yyy:(71) the definition of Dirichlet form:
1 X X wij
E(g, g) = (g(j) − g(i))2 , (8.14)
2 i j6=i w
1 X X w̃ij
Ẽ(g, g) = (g(j) − g(i))2 .
2 i j6=i w̃
Theorem 8.1 (comparison of Dirichlet forms) For each ordered pair

(x, y) of distinct vertices with w̃xy > 0, let γxy be a path from x to y with
we > 0 for every e ∈ γxy . Then the Dirichlet forms (8.14) satisfy
w 1 XX
Ẽ(g, g) ≤ AE(g, g) = E(g, g) max w̃xy |γxy |1(e∈γxy )
w̃ e we x y6=x
for every g.
Proof. For an edge e = (i, j) write ∆g(e) = g(j) − g(i). Then

XX
2w̃Ẽ(g, g) = w̃xy (g(y) − g(x))2
x y6=x
 2
XX X
= w̃xy  ∆g(e)
x y6=x e∈γxy
XX X
≤ w̃xy |γxy | (∆g(e))2 by Cauchy–Schwarz
x y6=x e∈γxy
X
≤ A we (∆g(e))2 = A · 2wE(g, g).
e
Remarks. (a) Suppose the comparison weights (w̃ij ) are given by

w̃ij = wi wj /w for i, j ∈ I.
The corresponding discrete-time random walk is then the “trivial”walk with
w̃ = w and
w̃i = wi , p̃ij = πj , π̃j = πj
for all i, j, and

1 XX
Ẽ(g, g) = πi πj (g(j) − g(i))2 = varπ g
2 i j6=i
= kgk22
P
provided i πi g(i) = 0.
In this case the conclusion of Theorem 1 reduces to

XX
kgk22 ≤ E(g, g)w max w1e πx πy |γxy |1(e∈γxy ) .
e
x y6=x
This inequality was established in the proof of the distinguished path the-
orem (Chapter 4 Theorem yyy:32), and that theorem was an immediate
consequence of the inequality. Hence the comparison Theorem 1 may be
regarded as a generalization of the distinguished path theorem.
[xxx For NOTES: We’ve used simple Sinclair weighting. What about
other weighting in use of Cauchy–Schwarz? Hasn’t been considered, as far
as I know.]
(b) When specialized to the setting of reversible random flights on Cay-
ley graphs described in Chapter 7 Section yyy:1.9, Theorem 1 yields The-
orem yyy:14 of Chapter 7. To see this, adopt the setup in Chapter 7 Sec-
tion yyy:1.9, and observe that the word
x = g1 g2 · · · gd (with each gi ∈ G) (8.15)
corresponds uniquely to a path
γid,x = (id, g1 , g1 g2 , . . . , g1 g2 · · · gd = x) (8.16)
in the Cayley graph corresponding to the generating set G of interest. Having

built paths γid,x for each x ∈ I, we then can build paths γyz for y, z ∈ I by
exploiting vertex-transitivity, to wit, by setting
γyz = (y, yg1 , yg1 g2 , . . . , yg1 g2 · · · gd = z)
where y −1 z = x and the path γid,x is given by (8.16). In Theorem 1 we then

have both stationary distributions π and π̃ uniform,
w̃xy = µ̃(x−1 y)/n, w̃ = 1, |γxy | = |γid,x−1 y | = d(id, x−1 y),
and, if e = (v, vg) with v ∈ I and g ∈ G,
we = µ(g)/n, w = 1, 1(e∈γxy ) = 1((x−1 v,x−1 vg)∈γid,x−1 y ) .

Thus A of Theorem 1 equals

1 XX
max µ̃(x−1 y)d(id, x−1 y)1(x−1 v,x−1 vg∈γid,x−1 y ) ,
v∈I, g∈G µ(g) x y6=x
which reduces easily to K of Theorem yyy:14 of Chapter 7. Since π and π̃

are both uniform, the extremal characterization
X
τ2 = sup{kgk22 /E(g, g) : πi g(i) = 0} (8.17)
i
gives Theorem yyy:14 of Chapter 7.

Theorem 8.1 compares Dirichlet forms. To compare relaxation times
using the extremal characterization (8.17), we compare L2 -norms using the
same “direct” technique as for Chapter 3 Lemma yyy:26. For any g,
kgk22 ≤ kgk∼2
2 max(πi /π̃i ) (8.18)
i
where, as usual, πi := wi /w and π̃i = w̃i /w̃. So if g has π-mean 0 and

π̃-mean b, then
kgk22 kg − bk22 Akg − bk∼22

≤ ≤ max(πi /π̃i ). (8.19)
E(g, g) E(g − b, g − b) Ẽ(g − b, g − b) i
Thus
Corollary 8.2 (comparison of relaxation times) In Theorem 1,
A
τ2 ≤ τ̃2
a
where
w 1 XX
A := max w̃xy |γxy |1(e∈γxy ) ,
w̃ e we x y6=x
a := min(π̃i /πi ).
i
xxx Perhaps restate as

τ2 ≤ τ̃2
where
1 XX wi

B := max ··· max
e we i w̃i
(and similarly for Corollaries 4 and 7)?

xxx Remark about best if π = π̃?

Here is a simple example, taken from [117], showing the improvement in
Corollary 8.2 over Theorem yyy:32 of Chapter 4 provided by the freedom in
choice of benchmark chain.
xxx NOTE: After the fact, I realized that the following example was
already Example yyy:20 of Chapter 7; must reconcile.
Example 8.3 Consider a card shuffle which transposes the top two cards
in the deck, moves the top card to the bottom, or moves the bottom card
to the top, each with probability 1/3. This example fits the specialized
group framework of Chapter 7 Section yyy:1.9 (see also Remark (b) following
Theorem 8.1 above) with I taken to be the symmetric group on m letters
and
G := {(1 2), (m m − 1 m − 2 · · · 1), (1 2 · · · m)}
in cycle notation. [If the order of the deck is represented by a permutation σ
in such a way that σ(i) is the position of the card with label i, and if
permutations are composed left to right, then σ · (m m − 1 m − 2 · · · 1) is
the order resulting from σ by moving the top card to the bottom.]
We obtain a representation (8.15) for any given permutation x by writing
x = hm hm−1 · · · h2
in such a way that
(hm · · · hi )−1 (j) = x−1 (j) for i ≤ j ≤ m (8.20)
(i.e., hm · · · hi and x agree in positions i through m) and each hi is explicitly

represented as a product of generators. To accomplish this, we proceed
inductively. Suppose that (8.20) holds for given i ∈ {3, . . . , m + 1}, and that
(hm · · · hi )(x−1 (i − 1)) = li = l, with 1 ≤ l ≤ i − 1. Then let
hi−1 := (m m − 1 m − 2 · · · 1)l−1 [(1 2)(m m − 1 m − 2 · · · 1)]i−l−1

·(m m − 1 m − 2 · · · 1)m−i+2 .
In words, beginning with hm · · · hi , we repeatedly move the top card to

the bottom until card x−1 (i − 1) has risen to the top; then we repeatedly
transpose and shift until the top m − i + 2 cards, in order, are x−1 (i −
1), . . . , x−1 (m); and finally we cut these m − i + 2 cards to the bottom.
xxx Either revise Section 1.9 of Chapter 7 to delete requirement of
geodesic paths, or explain one can erase cycles.
It follows that the diameter ∆ of the Cayley graph associated with G

satisfies
m+1
!
X m
∆≤ [(li − 1) + 2(i − li − 1) + (m − i + 2)] ≤ 3
i=2
2
2
and so by Chapter 7 Corollary yyy:15 that τ2 ≤ 27 m 2 < 27 4
4 m .
To improve this bound on the relaxation time we compare the chain of
interest to the random transposition chain of Chapter 7 Example yyy:18
and employ Corollary 8.2, or rather its specialization, (yyy:Theorem 14) of
Chapter 7.
xxx Continue as in Chapter 7 Example yyy:20 to get
τ2 m 27 3
≤ 3β 2 , τ̃2 = , τ2 ≤ m .
τ̃2 2 2
xxx Test function on page 2139 of [117] shows this is right order.
Corollary 8.2 can be combined with the inequality

1 1

τ̂ ≤ τ2 log +1 (8.21)
2 π∗
from (8.7) to bound the L2 threshold parameter τ̂ for the chain of interest,
but Theorem 8.1 sometimes affords a sharper result. From the Courant–
Fischer “min–max” theorem ([183], Theorem 4.2.11) it follows along the
same lines as in Chapter 3 Section yyy:6.3 that
λ−1 = inf ρ(h1 , h2 , . . . , hm−1 ), m = 2, . . . , n, (8.22)
where h1 ≡ 1 and xxx Say the conditions better!
ρ(h1 , h2 , . . . , hm−1 )
X
:= sup{kgk22 /E(g, g) : πi hj (i)g(i) = 0 for j = 1, . . . , m − 1}
i
and the inf in (8.22) is taken over all vectors h1 , . . . , hm−1 that are orthogonal
in L2 (π) (or, equivalently, that are linearly independent). Using (8.19),
Corollary 8.2 now generalizes to
Corollary 8.4 (comparison of eigenvalues) In Theorem 8.1, the eigen-
values λm and λ̃m in the respective spectral representations satisfy
A −1
λ−1
λ̃m ≤
a m
with A and a as defined in Corollary 8.2.
Here is a simple example [118] not possessing vertex-transitivity:

xxx NOTE: This is a DIRECT comparison!: see Chapter 3 Section 6.4.
Example 8.5 Random walk on a d-dimensonal grid.
To keep the notation simple, we let d = 2 and consider the grid I :=

{0, . . . , m1 − 1} × {0, . . . , m2 − 1} as an (unweighted) subgraph of Z2 . The
eigenvalues λl are not known in closed form. However, if we add self-loops
to make a benchmark graph where I is regular with degree 4, then the
eigenvalues λ̃l for the continuous-time walk are
1 πr πs

1− cos + cos , 0 ≤ r ≤ m1 − 1, 0 ≤ s ≤ m2 − 1.
2 m1 m2
xxx Product chain. Add discussion of all eigenvalues to Section yyy:6.2
of Chapter 4?
xxx P.S. See Chapter 5, (66).
In particular, assuming m1 ≥ m2 we have
−1
π

τ̃2 = 2 1 − cos . (8.23)
m1
Now we apply Corollary 8.4 to bound the eigenvalues λl . In Theorem 8.1,
the two graphs agree except for self-loops, so
A = w/w̃;
furthermore,
π̃i w
a = min = min w̃i wi ,
i πi w̃ i
so
A wi
= max ≤ 1.
a i w̃i
Thus λ−1 −1
l ≤ λ̃l for 1 ≤ l ≤ n := m1 m2 ; in particular,
τ2 ≤ τ̃2 . (8.24)
Comparing the other way around gives

w̃i

λ̃−1
l ≤ max λ−1 −1
l = 2λl , 1≤l≤n
i wi
and in particular
τ2 ≥ 12 τ̃2 .
The result 12 λ̃−1 −1 −1

l ≤ λl ≤ λ̃l extends to general d, for which (for example)
−1
π

τ̃2 = d 1 − cos
m
where I = {0, . . . , m1 − 1} × · · · × {0, . . . , md − 1} and m := maxi mi .
Example 8.6 Random walk on a thinned grid.
As a somewhat more interesting example, suppose we modify the grid in Z2
in Example 8.5 by deleting at most one edge from each unit square.
xxx Copy picture on page 700 in [118] as example?
Again we can apply Corollary 8.4, using the same benchmark graph as
in Example 8.5. In Theorem 8.1, w̃xy > 0 for x 6= y if and only if x and y are
neighboring vertices in the (unthinned) grid {0, . . . , m1 −1}×{0, . . . , m2 −1}.
We can choose γxy to have length 1 (if the edge joining x and y has not been
deleted) or 3 (if it has). For any directed edge e in the grid, there are at most
two paths of length 3 and at most one path of length 1 passing through e.
Thus A ≤ 7w/w̃, and so A/a ≤ 7 maxi (wi /w̃i ) ≤ 7; comparing the other
way around is even easier (all paths have length 1), and we find
1 −1
4 λ̃l ≤ λ−1 −1
l ≤ 7λ̃l , 2 ≤ l ≤ n.
xxx REMINDER: NOTES OR ELSEWHERE?: Mention exclusion pro-

cess [149, 118].
Example 8.7 The n-path with end self-loops.
The comparison technique does not always provide results as sharp as those
in the preceding two examples, even when the two chains are “close.” For
example, let the chain of interest be the n-path, with self-loops added at each
end added to make the graph regular with degree 2, and let the benchmark
graph be the n-cycle (Chapter 5, Example yyy:7). Use of Corollary 8.2
−1
gives only τ2 ≤ nτ̃2 , whereas in fact τ2 = 1 − cos πn ∼ π22 n2 and τ̃2 =
−1
1 − cos 2π
n ∼ 1
2π 2
n2 .
It is difficult in general to use Corollary 8.4 to improve upon (8.21).
However, when both the chain of interest and the benchmark chain are
symmetric reversible chains (as defined in Chapter 7 Section yyy:1.1), it
follows from Chapter 4 yyy:(14) by averaging over i that
˜ˆ a

ˆ ≤d
d(t) t , t ≥ 0,
A
and hence from (8.1) we obtain
Corollary 8.8 (comparison of L2 mixing times) In Theorem 8.1, if both

the graph of interest and the benchmark graph are vertex-transitive, then the
L2 mixing time parameters τ̂ and τ̂˜ satisfy
A˜
τ̂ ≤ τ̂ .
a
Example 8.9 Returning to the slow card-shuffling scheme of Example 8.3
with random transpositions benchmark, it is known from group representa-
tion methods [112, 122] which make essential use of all the eigenvalues λ̃r ,
not just λ̃2 , that
τ̂˜ ∼ 12 m log m as m → ∞.
Since a = 1 and A (= K of Chapter 7, Theorem yyy:14) ≤ 27m2 , it follows
that
3
τ̂ ≤ (1 + o(1)) 27
2 m log m. (8.25)
This improves upon Example 8.3, which combines with (8.21) to give only
4
τ̂ ≤ (1 + o(1)) 27
4 m log m.
xxx Show truth is τ̂ = Θ(m3 log m)?
8.2 Improved bounds on L2 distance

The central theme of the remainder of this chapter is that norms other than
the L1 norm (and closely related variation distance) and L2 norm can be
used to improve substantially upon the bound
kPi (Xt ∈ ·) − π(·)k2 ≤ π −1/2 e−t/τ2 . (8.26)
8.2.1 Lq norms and operator norms

Our discussion here of Lq norms will parallel and extend the discussion in
Chapter 2 Section yyy:6.2 of L1 and L2 norms. Given 1 ≤ q ≤ ∞, both
the Lq norm of a function and the Lq norm of a signed measure are defined
with respect to some fixed reference probability distribution π on I, which
for our purposes will be the stationary distribution of some irreducible but
not necessarily reversible chain under consideration. For 1 ≤ q < ∞, the Lq
norm of a function f : I → R is
!1/q
X
q
kf kq := πi |f (i)| ,
i
8.2. IMPROVED BOUNDS ON L2 DISTANCE 279
and we define the Lq norm of a signed measure ν on I to be the Lq norm of

its density function with respect to π:
 1/q
X 1−q
kνkq :=  πj |νj |q  .
j
For q = ∞, the corresponding definitions are
kf k∞ := max |f (i)|
i
and
kνk∞ := max(|νj |/πj ).
j
Any matrix A := (aij : i, j ∈ I) operates on functions f : I → R by

left-multiplication: X
(Af )(i) = aij f (j), (8.27)
j
and on signed measures ν by right-multiplication:

X
(νA)j = νi aij . (8.28)
i
For (8.27), fix 1 ≤ q1 ≤ ∞ and 1 ≤ q2 ≤ ∞ and regard A as a linear

operator mapping Lq1 into Lq2 . The operator norm kAkq1 →q2 is defined by
kAkq1 →q2 := sup{kAf kq2 : kf kq1 = 1}. (8.29)
The sup in (8.29) is always achieved, and there are many equivalent reex-
pressions, including
kAkq1 →q2 = max{kAf kq2 : kf kq1 ≤ 1}

= max{kAf kq2 /kf kq1 : f 6= 0}.
Note also that
kBAkq1 →q3 ≤ kAkq1 →q2 kBkq2 →q3 , 1 ≤ q1 , q2 , q3 ≤ ∞. (8.30)
For (8.28), we may similarly regard A as a linear operator mapping signed

measures ν, measured by kνkq1 , to signed measures νA, measured by kνAkq2 .
The corresponding definition of operator norm, call it k|Ak|q1 →q2 , is then
k|Ak|q1 →q2 := sup{kνAkq2 : kνkq1 = 1}.

A brief calculation shows that
k|Ak|q1 →q2 = kA∗ kq1 →q2 ,
where A∗ is the matrix with (i, j) entry πj aji /πi , that is, A∗ is the adjoint
operator to A (with respect to π).
Our applications in this chapter will all have A = A∗ , so we will not need
to distinguish between the two operator norms. In fact, all our applications
will take A to be either Pt or Pt − E for some t ≥ 0, where
Pt := (pij (t) : i, j ∈ I)
xxx notation Pt found elsewhere in book?

and E = limt→∞ Pt is the transition matrix for the trivial discrete time
chain that jumps in one step to stationarity:
E = (πj : i, j ∈ I),
and where we assume that the chain for (Pt ) is reversible. Note that E
operates on functions essentially as expectation with respect to π:
X
(Ef )(i) = πj f (j), i ∈ I.
j
P
The effect of E on signed measures is to map ν to ( i νi )π, and
Pt E = E = EPt , t ≥ 0. (8.31)
8.2.2 A more general bound on L2 distance

The following preliminary result, a close relative to Chapter 3, Lemmas
yyy:21 and 23, is used frequently enough in the sequel that we isolate it
for reference. It is the simple identity in part (b) that shows why L2 -based
techniques are so useful.
Lemma 8.10 (a) For any function f ,
d 2
kPt f k22 = −2E(Pt f, Pt f ) ≤ − varπ Pt f ≤ 0.
dt τ2
(b)
kPt − Ek2→2 = e−t/τ2 , t ≥ 0.
Proof. (a) Using the backward equations
d X
pij (t) = qik pkj (t)
dt k
we find
d X
(Pt f )(i) = qik [(Pt f )(k)]
dt k
and so
d XX
kPt f k22 = 2 πi [(Pt f )(i)]qik [(Pt f )(k)]
dt i k
= −2E(Pt f, Pt f ) by Chapter 3 yyy:(70)
2
≤ − varπ Pt f by the extremal characterization of τ2 .
τ2
(b) From (a), for any f we have
d d 2
k(Pt − E)f k22 = kPt (f − Ef )k22 ≤ − k(Pt − E)f k22 ,
dt dt τ2
which yields
k(Pt − E)f k22 ≤ k(P0 − E)f k22 e−2t/τ2 = (varπ f )e−2t/τ2

≤ kf k22 e−2t/τ2 .
Thus kPt − Ek2→2 ≤ e−t/τ2 . Taking f to be the eigenvector

−1/2
fi := πi ui2 , i ∈ I,
of Pt −E corresponding to eigenvalue exp(−t/τ2 ) demonstrates equality and

completes the proof of (b).
The key to all further developments in this chapter is the following result.
Lemma 8.11 For an irreducible reversible chain with arbitrary initial dis-
tribution and any s, t ≥ 0,
kP (Xs+t ∈ ·) − π(·)k2 ≤ kP (Xs ∈ ·)k2 kPt − Ek2→2 = kP (Xs ∈ ·)k2 e−t/τ2 .
Proof. The equality is Lemma 8.10(b), and
kP (Xs+t ∈ ·) − π(·)k2 = kP (Xs ∈ ·)(Pt − E)k2 ≤ kP (Xs ∈ ·)k2 kPt − Ek2→2

proves the inequality.

We have already discussed, in Section 8.1, a technique for bounding τ2
when (as is usually the case) it cannot be computed exactly. To utilize
Lemma 8.11, we must also bound kP (Xs ∈ ·)k2 . Since
kP (Xs ∈ ·)k2 = kP (X0 ∈ ·)Ps k2 (8.32)
xxx For NOTES: By Jensen’s inequality (for 1 ≤ q < ∞), any transition
matrix contracts Lq for any 1 ≤ q ≤ ∞.
and each Pt is contractive on L2 , i.e., kPt k2→2 ≤ 1 (this follows, for
example, from Lemma 8.10(a); and note that kPt k2→2 = 1 by considering
constant functions), it follows that
kP (Xs ∈ ·)k2 decreases monotonically to 1 as s ↑ ∞, (8.33)
and the decrease is strictly monotone unless P (X0 ∈ ·) = π(·). From (8.32)
follows
kP (Xs ∈ ·)k2 ≤ kP (X0 ∈ ·)kq∗ kPs kq∗ →2 for any 1 ≤ q ∗ ≤ ∞, (8.34)
and again
kPs kq∗ →2 decreases monotonically to 1 as s ↑ ∞. (8.35)
The norm kPs kq∗ →2 decreases in q ∗ (for fixed s) and is identically 1 when
q ∗ ≥ 2, but in applications we will want to take q ∗ < 2. The following
duality lemma will then often prove useful. Recall that 1 ≤ q, q ∗ ≤ ∞ are
said to be (Hölder-)conjugate exponents if
1 1
+ ∗ = 1. (8.36)
q q
Lemma 8.12 For any operator A, let
A∗ = (πj aji /πi : i, j ∈ I)
denote its adjoint with respect to π. Then, for any 1 ≤ q1 , q2 ≤ ∞,
kAkq1 →q2 = kA∗ kq2∗ →q1∗ .
In particular, for a reversible chain and any 1 ≤ q ≤ ∞ and s ≥ 0,
kPs k2→q = kPs kq∗ →2 . (8.37)

Proof. Classical duality for Lq spaces (see, e.g., Chapter 6 in [303])

asserts that, given 1 ≤ q ≤ ∞ and g on I,
kgkq∗ = max{|hf, gi| : kf kq = 1}
where X
hf, gi := πi f (i)g(i).
i
Thus
kA∗ gkq1∗ = max{|hf, A∗ gi| : kf kq1 = 1}
= max{|hAf, gi| : kf kq = 1},
and also
|hAf, gi| ≤ kAf kq2 kgkq2∗ ≤ kAkq1 →q2 kf kq1 kgkq2∗ ,
so
kA∗ gkq1∗ ≤ kAkq1 →q2 kgkq2∗ .
Since this is true for every g, we conclude kA∗ kq2∗ →q1∗ ≤ kAkq1 →q2 . Reverse
roles to complete the proof.
As a corollary, if q ∗ = 1 then (8.34) and (8.37) combine to give
kP (Xs ∈ ·)k2 ≤ kPs k1→2 = kPs k2→∞
and then
kP (Xs+t ∈ ·) − π(·)k2 ≤ kPs k2→∞ e−t/τ2
from Lemma 8.11. Thus
q
ˆ
d(2(s + t)) ≤ kPs k2→∞ e−t/τ2 . (8.38)
Here is a somewhat different derivation of (8.38):
Lemma 8.13 For 0 ≤ s ≤ t,
q
ˆ
d(2t) = kPt − Ek2→∞ ≤ kPs k2→∞ kPt−s − Ek2→2
= kPs k2→∞ e−(t−s)/τ2 .
Proof. In light of (8.31), (8.30), and Lemma 8.10(b), we need only es-
tablish the first equality. Indeed, kPi (Xt ∈ ·) − π(·)k2 is the L2 norm of the
function (Pi (Xt ∈ ·)/π(·)) − 1 and so equals
 
X

max (pij (t) − πj )f (j) : kf k2 = 1 = max{|((Pt −E)f )(i)| : kf k2 = 1}.
 
j
Taking the maximum over i ∈ I we obtain

q
ˆ
d(2t) = max max |((Pt − E)f )(i)| : kf k2 = 1
i
= max{k(Pt − E)f k∞ : kf k2 = 1}
= kPt − Ek2→∞ .
Choosing s = 0 in Lemma 8.11 recaptures (8.26), and choosing s = 0

in Lemma 8.13 likewise recaptures the consequence (8.5) of (8.26). The
central theme for both Nash and log-Sobolev techniques is that one can
improve upon these results by more judicious choice of s.
8.2.3 Exact computation of N (s)

The proof of Lemma 8.13 can also be used to show that
s
pii (2s)
N (s) := kPs k2→∞ = max kPi (Xs ∈ ·)k2 = max , (8.39)
i i πi
as at (8.9). In those rare instances when the spectral representation is known

explicitly, this gives the formula
xxx Also useful in conjunction with comparison method—see Section 3.
ˆ
xxx If we can compute this, we can compute d(2t) = N 2 (t) − 1. But the
point is to test out Lemma 8.13.
n
N 2 (s) = 1 + max πi−1
X
u2im exp(−2λm s), (8.40)
i
m=2
and the techniques of later sections are not needed to compute N (s). In
particular, in the vertex-transitive case
n
X
N 2 (s) = 1 + exp(−2λs).
m=2
The norm N (s) clearly behaves nicely under the formation of products:
N (s) = N (1) (s)N (2) (s). (8.41)
Example 8.14 The two-state chain and the d-cube.

For the two-state chain, the results of Chapter 5 Example yyy:4 show
max(p, q) −2(p+q)s
N 2 (s) = 1 + e .
min(p, q)
In particular, for the continuized walk on the 2-path,
N 2 (s) = 1 + e−4s .
By the extension of (8.41) to higher-order products, we therefore have
N 2 (s) = (1 + e−4s/d )d
for the continuized walk on the d-cube. This result is also easily derived from
the results of Chapter 5 Example yyy:15. For d ≥ 2 and t ≥ 14 d log(d − 1),
the optimal choice of s in Lemma 8.13 is therefore
s = 14 d log(d − 1)
and this leads in straightforward fashion to the bound
τ̂ ≤ 14 d(log d + 3).
While this is a significant improvement on the bound [cf. (8.12)]
τ̂ ≤ 14 (log 2)d2 + 12 d
obtained by setting s = 0, i.e., obtained using only information about τ2 , it

is not
xxx REWRITE!, in light of corrections to my notes.
as sharp as the upper bound
τ̂ ≤ (1 + o(1)) 41 d log d
in (8.13) that will be derived using log-Sobolev techniques.
For this graph, the results of Chapter 5 Example yyy:9 show

2ns

2
N (s) = 1 + (n − 1) exp − .
n−1
It turns out for this example that s = 0 is the optimal choice in Lemma 8.13.
This is not surprising given the sharpness of the bound in (8.7) in this case.
See Example 8.32 below for further details.
Example 8.16 Product random walk on a d-dimensional grid.
Consider again the benchmark product chain (i.e., the “tilde chain”) in
Example 8.5. That chain has relaxation time
−1
π 1

τ2 = d 1 − cos ≤ dm2 ,
m 2
so choosing s = 0 in Lemma 8.13 gives
d 1

τ̂ ≤ log n + 1
1 − cos(π/m) 2
1 2
≤ 4 dm (log n + 2). (8.42)
This bound can be improved using N (·). Indeed, if we first consider

continuized random walk on the m-path with self-loops added at each end,
the stationary distribution is uniform, the eigenvalues are
λl = 1 − cos(π(l − 1)/m), 1 ≤ l ≤ m,
and the eigenvectors are given by
uil = (2/m)1/2 cos(π(l − 1)(i − 21 )/m), 0 ≤ i ≤ m − 1, 2 ≤ l ≤ m.
According to (8.40) and simple estimates, for s > 0

m
X m−1
X
N 2 (s) − 1 ≤ 2 exp[−2s(1 − cos(π(l − 1)/m)] ≤ 2 exp(−4sl2 /m2 )
l=2 l=1
and
m−1
X Z ∞
2 2
exp(−4sl /m ) ≤ exp(−4sx2 /m2 ) dx
l=2 x=1
1/2 !
π 2(2s)1/2

= m P Z≥
4s m
1/2
π/4

≤ exp(−4s/m2 )
4s/m2
≤ (4s/m2 )−1/2 exp(−4s/m2 )
when Z is standard normal; in particular, we have used the well-known (xxx:

point to Ross book exercise) bound
2 /2
P (Z ≥ z) ≤ 21 e−z , z ≥ 0.
8.3. NASH INEQUALITIES 287
Thus
N 2 (s) ≤ 1 + 2[1 + (4s/m2 )−1/2 ] exp(−4s/m2 ), s > 0.
Return now to the “tilde chain” of Example 8.5, and assume for sim-
plicity that m1 = · · · = md = m. Since this chain is a slowed-down d-fold
product of the path chain, it has
" −1/2 ! #d
4s 4s

2
N (s) ≤ 1 + 2 1 + exp − , s > 0. (8.43)
dm2 dm2
ˆ
In particular, since d(2t) = N 2 (t) − 1, it is now easy to see that
τ̂ ≤ Km2 d log d = Kn2/d d log d (8.44)
for a universal constant K.

xxx We’ve improved on (8.42), which gave order m2 d2 log m.
xxx When used to bound d(t) at (8.4), the bound (8.44) is “right”: see
Theorem 4.1, p. 481, in [120].
The optimal choice of s in Lemma 8.13 cannot be obtained explicitly
when (8.43) is used to bound N (s) = kPs k2→∞ . For this reason, and for
later purposes, it is useful to use the simpler, but more restricted, bound
N 2 (s) ≤ (4dm2 /s)d/2 for 0 ≤ s ≤ dm2 /16. (8.45)

2 2
To verify this bound, simply notice that u+2ue−u +2e−u ≤ 4 for 0 ≤ u ≤ 12 .
When s = dm2 /16, (8.45), and τ2 ≤ dm2 /2 are used in Lemma 8.13, we find
τ̂ ≤ 34 m2 d2 (log 2 + 34 d−1 ).
xxx Improvement over (8.42) by factor Θ(log m), but display follow-
ing (8.43) shows still off by factor Θ(d/ log d).
8.3 Nash inequalities

xxx For NOTES: Nash vs. Sobolev
A Nash inequality for a chain is an inequality of the form
1
2+ D 1 2 1/D
kgk2 ≤ C [E(g, g) + T kgk2 ] kgk1 (8.46)
that holds for some positive constants C, D, and T and for all functions g.
We connect Nash inequalities to mixing times in Section 8.3.1, and in Sec-
tion 8.3.2 we discuss a comparison method for establishing such inequalities.
8.3.1 Nash inequalities and mixing times

A Nash inequality implies a useful bound on the quantity
N (t) = kPt k2→∞ (8.47)
appearing in the mixing time Lemma 8.13. This norm is continuous in t and
decreases to kEk2→∞ = 1 as t ↑ ∞. Here is the main result:
Theorem 8.17 If the Nash inequality (8.46) holds for a continuous-time

reversible chain, some C, D, T > 0, and all g, then the norm N (s) at (8.47)
satisfies
N (t) ≤ e(DC/t)D for 0 < t ≤ T .
Proof. First note N (t) = kPt k1→2 by Lemma 8.12. Thus we seek a
bound on h(t) := kPt gk22 independent of g satisfying kgk1 = 1; the square
root of such a bound will also bound N (t).
Substituting Pt g for g in (8.46) and utilizing the identity in Lemma 8.10(a)
and the fact that Pt is contractive on L1 , we obtain the differential inequality
1
h(t)1+ 2D ≤ C [− 12 h0 (t) + 1
T h(t)], t ≥ 0.
Writing
h i−1/(2D)
1 −2t/T
H(t) := 2 Ch(t)e ,
the inequality can be equivalently rearranged to

h 1
i−1 t
H 0 (t) ≥ 2D(C/2)1+ 2D e DT , t ≥ 0.
Since H(0) > 0, it follows that

h 1
i−1 t

H(t) ≥ T 2(C/2)1+ 2D e DT − 1 , t ≥ 0,
or equivalently

T −2D
h(t) ≤ 1 − e−t/(DT ) , t ≥ 0.
C
But
t
e−t/(DT ) ≤ 1 − (1 − e−1/D ) for 0 < t ≤ T ,
T
so for these same values of t we have
t
−2D
−1/D
h(t) ≤ 1−e
C
−2D
2 t

1/D
= e e −1 ≤ [e(DC/t)D ]2 ,
C
as desired.
We now return to Lemma 8.13 and, for t ≥ T , set s = T . (Indeed,
using the bound on N (s) in Theorem 8.17, this is the optimal choice of s if
T < Dτ2 .) This gives
xxx NOTE: In next theorem, only need conclusion of Theorem 8.17, not
hypothesis!
Theorem 8.18 In Theorem 8.17, if c ≥ 1 and
DC

t ≥ T + τ2 D log +c ,
T
q
then ˆ
d(2t) ≤ e1−c ; in particular,
DC

τ̂ ≤ T + τ2 D log +2 .
T
The following converse of sorts to Theorem 8.17 will be very useful in

conjunction with the comparison method.
Theorem 8.19 If a continuous time reversible chain satisfies
N (t) ≤ Ct−D for 0 < t ≤ T ,
then it satisfies the Nash inequality

1
2+ D 1/D
kgk2 ≤ C 0 [E(g, g) + 1 2
2T kgk2 ] kgk1 for all g
with
1
C 0 := 2 (1 + 1
2D ) [(1 + 2D)1/2 C]1/D ≤ 22+ 2D C 1/D .
xxx Detail in proof to be filled in (I have notes): Show E(Pt g, Pt g) ↓ as

t ↑, or at least that it’s maximized at t = 0. Stronger of two statements is
equivalent to assertion that kPt f k22 is convex in t.
Proof. As in the proof of Theorem 8.17, we note N (t) = kPt k1→2 . Hence,
for any g and any 0 < t ≤ T ,
Z t
kgk22 = kPt gk22 − d 2
ds kPs gk2 ds
s=0
Z t
= kPt gk22 + 2 E(Ps g, Ps g) ds by Lemma 8.10(a)
s=0
≤ kPt gk22 + 2E(g, g)t xxx see above
2 −2D
≤ 2E(g, g)t + C t kgk21 .
This gives
kgk22 ≤ t[2E(g, g) + 1 2
T kgk2 ] + C 2 t−2D kgk21
for any t > 0. The righthand side here is convex in t and minimized (for g 6=
0) at
!1/(2D+1)
2DC 2 kgk21
t= .
2E(g, g) + T −1 kgk22
1
Plugging in this value, raising both sides to the power 1+ 2D and simplifying
0
yields the desired Nash inequality. The upper bound for C is derived with
a little bit of calculus.
8.3.2 The comparison method for bounding N (·)

In Section 8.1 we compared relaxation times for two chains by comparing
Dirichlet forms and variances. The point of this subsection is that the com-
parison can be extended to the norm function N (·) of (8.47) using Nash
inequalities. Then results on N (·) like those in Section 8.2.3 can be used to
bound mixing times for other chains on the same state space.
xxx For NOTES?: Can even use different spaces. New paragraph:
To see how this goes, suppose that a benchmark chain is known to satisfy
Ñ (t) ≤ C̃t−D̃ for 0 < t ≤ T̃ . (8.48)
By Theorem 8.19, it then satisfies a Nash inequality. The L1 - and L2 -

norms appearing in this inequality can be compared in the obvious fashion
[cf. (8.18)] and the Dirichlet forms can be compared as in Theorem 8.1.
This shows that the chain of interest also satisfies a Nash inequality. But
then Theorem 8.17 gives a bound like (8.48) for the chain of interest, and
Theorem 8.18 can then be used to bound the L2 threshold time τ̂ .
Here is the precise result; the details of the proof are left to the reader.
Theorem 8.20 (comparison of bounds on N (·)) If a reversible bench-

mark chain satisfies
Ñ (t) ≤ C̃t−D̃ for 0 < t ≤ T̃
for constants C̃, D̃, T̃ > 0, then any other reversible chain on the same state
space satisfies
N (t) ≤ e(DC/t)D for 0 < t ≤ T ,
where, with a and A as defined in Corollary 8.2, and with
a0 := max(π̃i /πi ),
i
we set
D = D̃,
1 1/D
C = a−(2+ D ) a0 A × 2(1 + 1
2D )[(1 + 2D)1/2 C̃]1/D
1 1/D 1
≤ a−(2+ D ) a0 A × 22+ 2D C̃ 1/D
,
2A
T = T̃ .
a0 2
xxx Must correct this slightly. Works for any A such that Ẽ ≤ AE, not
just minimal one. This is important since we need a lower bound on T but
generally only have an upper bound on Ẽ/E. The same goes for a0 (only
need upper bound on π̃i /πi ): we also need an upper bound on T .
Example 8.21 Random walk on a d-dimensional grid.
As in Example 8.5, consider the continuized walk on the d-dimensional grid

I = {0, . . . , m − 1}d . In Example 8.5 we compared the Dirichlet form and
variance for this walk to the d-fold product of random walk on the m-path
with end self-loops to obtain
π −1
τ2 ≤ τ̃2 = d(1 − cos m ) ≤ 12 dm2 ; (8.49)
q
−1/2

using the simple bound ˆ
d(2t) ≤ π∗ e−t/τ2 ≤ (2n)−1/2 exp − dm
2t
2 we
then get
τ̂ ≤ 41 m2 d[log(2n) + 2], (8.50)
which is of order m2 d2 log m. Here we will see how comparing N (·), too,
2 2
gives a bound of order m d log d. In Example 8.43, we will bring log-Sobolev
techniques to bear, too, to lower this bound to order m2 d log d
xxx which is correct, at least for TV. New paragraph:

Recalling (8.45), we may apply Theorem 8.20 with
D̃ = d/4, C̃ = (4dm2 )d/4 , T̃ = dm2 /16,
and, from the considerations in Example 8.5,
a ≥ 21 , A = 1, a0 = 2.
xxx See xxx following Theorem 8.20. Same paragraph:

This gives
10
D = d/4, C ≤ 26+ d dm2 ≤ 216 dm2 , T = dm2 /32.
Plugging these into Theorem 8.18 yields
τ̂ ≤ 18 m2 d2 log d + ( 19 2 2
8 log 2)m d +
33 2
32 m d, (8.51)
which is ≤ 3m2 d2 log d for d ≥ 2.

Other variants of the walk, including the thinned-grid walk of Exam-
ple 8.6, can be handled in a similar fashion.
xxx Do moderate growth and local Poincaré? Probably not, to keep
length manageable. Also, will need to rewrite intro a little, since not doing
∆2 -stuff (in any detail).
8.4 Logarithmic Sobolev inequalities

xxx For NOTES: For history and literature, see ([119], first paragraph and
end of Section 1).
xxx For NOTES: Somewhere mention relaxing to nonreversible chains.
8.4.1 The log-Sobolev time τl

Given a probability distribution π on a finite set I, define
xxx For NOTES: Persi’s L(g) is double ours.
X
L(g) := πi g 2 (i) log(|g(i)|/kgk2 ) (8.52)
i
for g 6≡ 0, recalling kgk22 = 2 (i)

P
i πi g and using the convention 0 log 0 = 0.
By Jensen’s inequality,
L(g) ≥ 0, with equality if and only if |g| is constant.

8.4. LOGARITHMIC SOBOLEV INEQUALITIES 293
Given a finite, irreducible, reversible Markov chain with stationary distribu-

tion π, define the logarithmic Sobolev (or log-Sobolev ) time by
xxx For NOTES: Persi’s α is 1/(2τl ).
xxx Note τl < ∞. (Show?) See also Corollary 8.27.
τl := sup{L(g)/E(g, g) : g 6≡ constant}. (8.53)

Notice the similarity between (8.53) and the extremal characterization of τ2
(Chapter 3, Theorem yyy:22):
X
τ2 = sup{kgk22 /E(g, g) : πi g(i) = 0, g 6≡ 0}.
i
We discuss exact computation of τl in Section 8.4.3, the behavior of τl for

product chains in Section 8.4.4, and a comparison method for bounding τl
in Section 8.4.5. In Section 8.4.2 we focus on the connection between τl and
mixing times. A first such result asserts that the relaxation time does not
exceed the log-Sobolev time:
Lemma 8.22 τ2 ≤ τl .
xxx Remarks about how “usually” equality?
xxx For NOTES: Proof from [302], via [119].
Proof. Given g 6≡ constant and , let f := 1 + g. Then, writing ḡ =
i πi g(i), and with all asymptotics as → 0,
P
log |f |2 = 2g − 2 g 2 + O(3 ),

log kf k22 = 2ḡ + 2 kgk22 − 22 ḡ 2 + O(3 ),
|f |2
log = 2(g − ḡ) + 2 (2ḡ 2 − kgk22 − g 2 ) + O(3 ).
kf k22
Also,
f 2 = 1 + 2g + 2 g 2 ;
thus
f 2 log(|f |2 /kf k22 ) = 2(g − ḡ) + 2 (3g 2 − kgk22 − 4gḡ + 2ḡ 2 ) + O(3 )
and so
L(f ) = 2 (kgk22 − ḡ 2 ) + O(3 ) = 2 varπ g + O(3 ).
Furthermore, E(f, f ) = 2 E(g, g); therefore
L(f ) varπ g
τl ≥ = + O().
E(f, f ) E(g, g)
Finish by letting → 0 and then taking the supremum over g.
8.4.2 τl , mixing times, and hypercontractivity

In this subsection we discuss the connection between the L2 threshold time
parameter
q
τ̂ = inf{t > 0 : ˆ
d(2t) = max kPi (Xt ∈ ·) − π(·)k2 ≤ e−1 } (8.54)
i
and the log-Sobolev time τl . As in Section 8.3, we again consider the fun-
damental quantity
N (s) = kPs k2→∞
q
arising in the bound on ˆ
d(2t) in Lemma 8.13, and recall from Section 8.3.1
that
−1/2
N (s) decreases strictly monotonically from π∗ at s = 0 to 1 as s ↑ ∞.
The function N is continuous. It would be nice (especially for use in con-

junction with the comparison technique) if we could characterize, in terms
of the Dirichlet form E, the value of s, call it s∗ , such that N (s) equals 2
(say), but such a characterization is not presently available.
xxx For NOTES?: A partial result is Theorem 3.9 in [119], taking q = ∞.
Open Problem 8.23 Characterize s∗ in terms of E.
To carry on along these general lines, it turns out to be somewhat more

convenient to substitute use of
kP (Xt ∈ ·) − π(·)k2 ≤ kP (X0 ∈ ·)k q kPs k2→q e−(t−s)/τ2 ,

2 ≤ q < ∞,
q−1
(8.55)
an immediate consequence of Lemmas 8.11 and 8.12 and (8.34), for use of
Lemma 8.13. The reason is that, like N (s), kPs k2→q decreases monotoni-
cally to 1 as s ↑ ∞; but, unlike N (s), it turns out that
for each q ≥ 2, kPs k2→q equals 1 for all sufficiently large s. (8.56)
The property (8.56) is called hypercontractivity, in light of the facts that,

for fixed s, Ps is a contraction on L2 and kPs k2→q is increasing in q. Let
sq := inf{s ≥ 0 : kPs k2→q ≤ 1} = inf{s : kPs k2→q = 1};
then s2 = 0 < sq , and we will see presently that sq < ∞ for q ≥ 2. The
following theorem affords a connections with the log-Sobolev time τl (and
hence with the Dirichlet form E).
Theorem 8.24 For any finite, irreducible, reversible chain,

2sq
τl = sup .
2<q<∞ log(q − 1)
Proof. The theorem is equivalently rephrased as follows:
kPt k2→q ≤ 1 for all t ≥ 0 and 2 ≤ q < ∞ satisfying e2t/u ≥ q − 1 (8.57)
if and only if u ≥ τl . The proof will make use of the generalization

X
Lq (g) := πi |g(i)|q log(|g(i)|/kgkq )
i
of (8.52). Fixing 0 6≡ g ≥ 0 and u > 0, we will also employ the notation

q(t)
q(t) := 1 + e2t/u , G(t) := kPt gkq(t) , (8.58)
1

F (t) := kPt gkq(t) = exp log G(t)
q(t)
for t ≥ 0.
As a preliminary, we compute the derivative of F . To begin, we can
proceed as at the start of the proof of Lemma 8.10(a) to derive
q 0 (t) h i
G0 (t) = −q(t) E Pt g, (Pt g)q(t)−1 + Eπ (Pt g)q(t) log (Pt g)q(t) .
q(t)
Then
G0 (t) q 0 (t) log G(t)

F 0 (t) = − F (t)
q(t)G(t) q 2 (t)
0
−(q(t)−1) q (t)

q(t)−1
= F (t) L (Pt g) − E Pt g, (Pt g) .(8.59)
q(t) q(t)
For the first half of the proof we suppose that (8.57) holds and must
prove τl ≤ u, that is, we must establish the log-Sobolev inequality
L(g) ≤ u E(g, g) for every g. (8.60)
To establish (8.60) it is enough to consider 0 6≡ g > 0,

xxx Do we actually use g ≥ 0 here?
since for arbitrary g we have
L(g) = L(|g|) and E(g, g) ≥ E(|g|, |g|). (8.61)

Plugging the specific formula (8.58) for q(t) into (8.59) and setting t = 0
gives
F 0 (0) = kgk−1 −1
2 (u L(g) − E(g, g)). (8.62)
Moreover, since
F (t) = kPt gkq(t) ≤ kPt k2→q(t) kgk2 ≤ kgk2 by (8.57)

= kP0 gk2 = F (0),
the (right-hand) derivative of F at 0 must be nonpositive. The inequal-

ity (8.60) now follows from (8.62).
For the second half of the proof, we may assume u = τl and must estab-
lish (8.57). For g ≥ 0, (8.53) and Lemma 8.25 (to follow) give
q q 2 τl
Lq (g) = L(g q/2 ) ≤ τl E(g q/2 , g q/2 ) ≤ E(g, g q−1 ) (8.63)
2 4(q − 1)
for any 1 < q < ∞. With q(t) := 1 + e2t/τl , we have q 0 (t) = 2

τl (q(t) − 1), and
replacing g by Pt g in (8.63) we obtain
q 0 (t)
Lq(t) (Pt g) − E Pt g, (Pt g)q(t)−1 ≤ 0.
q(t)
From (8.59) we then find F 0 (t) ≤ 0 for all t ≥ 0. Since F (0) = kgk2 , this
implies
kPt gkq(t) ≤ kgk2 . (8.64)
We have assumed g ≥ 0, but (8.64) now extends trivially to general g, and
therefore
kPt k2→q(t) ≤ 1.
This gives the desired hypercontractivity assertion (8.57).
Here is the technical Dirichlet form lemma that was used in the proof of
Theorem 8.24.
4(q−1)
Lemma 8.25 E(g, g q−1 ) ≥ q2
E(g q/2 , g q/2 ) for g ≥ 0 and 1 < q < ∞.
xxx Do we somewhere have the following?:

XX
1
E(f, g) = 2 πi qij (f (i) − f (j))(g(i) − g(j)). (8.65)
i j6=i
Proof. For any 0 ≤ a < b

!2 !2
bq/2 − aq/2
Z b
q q
−1
= t 2 dt
b−a 2(b − a) a
q2 q2 bq−1 − aq−1
Z b
≤ tq−2 dt = .
4(b − a) a 4(q − 1) b−a
This shows that
4(q − 1) q/2
(bq−1 − aq−1 )(b − a) ≥ (b − aq/2 )2
q2
and the lemma follows easily from this and (8.65).
Now we are prepared to bound τ̂ in terms of τl .
Theorem 8.26 (a) If c ≥ 0, then for any state i with πi ≤ e−1 ,
kPi (Xt ∈ ·) − π(·)k2 ≤ e1−c for t ≥ 21 τl log log π1i + cτ2 .
(b)
τ̂ ≤ 12 τl log log π1∗ + 2τ2 ≤ τl ( 12 log log π1∗ + 2).
Proof. Part (b) follows immediately from (8.54), part (a), and Lemma 8.22.
To prove part (a), we begin with (8.55):
−1/q
kPi (Xt ∈ ·) − π(·)k2 ≤ πi kPs k2→q e−(t−s)/τ2 .
As in the second half of the proof of Theorem 8.24, let q = q(s) := 1 + e2s/τl .
Then kPs k2→q(s) ≤ 1. Thus
−1/q(s) −(t−s)/τ2
kPi (Xt ∈ ·) − π(·)k2 ≤ πi e , 0 ≤ s ≤ t.
Choosing s = 12 τl log log( π1i ) we have q(s) = 1 + log( π1i ) and thus
t−s
kPi (Xt ∈ ·) − π(·)k2 ≤ exp(1 − τ2 ) for t ≥ s.
We have established the upper bound in the following corollary; for the
lower bound, see Corollary 3.11 in [119].
Corollary 8.27
τl ≤ τ̂ ≤ τl ( 12 log log π1∗ + 2).
Examples illustrating the improvement Corollary 8.27 affords over the

similar result (8.7) in terms of τ2 are offered in Examples 8.37 and 8.40.
8.4.3 Exact computation of τl

Exact computation of τl is exceptionally difficult—so difficult, in fact, that τl
is known only for a handful of examples. We present some of these examples
in this subsection.
Example 8.28 Trivial two-state chains.
We consider a discrete-time chain on {0, 1} that jumps in one step to sta-

tionarity (or, since the value of τl is unaffected by continuization, the cor-
responding continuized chain). Thus (p00 , p01 , p10 , p11 ) = (θ, 1 − θ, θ, 1 − θ)
with θ = π0 = 1 − π1 . We also assume 0 < θ ≤ 1/2. The claim is that
(
log[(1−θ)/θ]
2(1−2θ) if θ 6= 1/2
τl = (8.66)
1 if θ = 1/2.
Note that this is continuous and decreasing for θ ∈ (0, 1/2].

To prove (8.66), we need to show that L(g)/E(g, g) ≤ τ (θ) for every
nonconstant g on {0, 1}, where τ (θ) denotes the righthand side of (8.66),
with equality for some g0 . First suppose θ 6= 1/2. For the inequality, as at
(8.60)–(8.61) we may suppose g ≥ 0 and, by homogeneity,
Eπ g = θg(0) + (1 − θ)g(1) = 1.
We will work in terms of the single variable
x := 1/(g(0) − g(1)),
so that
1−θ θ
g(0) = 1 + , g(1) = 1 −
x x
and we must consider x ∈ (−∞, −(1 − θ)] ∪ [θ, ∞). We calculate
E(g, g) = θ(1 − θ)(g(0) − g(1))2 = θ(1 − θ)/x2 ,

1−θ 2 θ 2

kgk22 = θ 1+ + (1 − θ) 1 −
x x
θ(1 − θ)
= [θ(x + 1 − θ)2 + (1 − θ)(x − θ)2 ]/x2 = 1 + ,
x2
2
1−θ 1−θ

`(x) := L(g) = θ 1 + log 1 +
x x
θ 2 θ

+(1 − θ) 1 − log 1 −
x x
1 θ(1 − θ) θ(1 − θ)

− θ 1+ 2
log 1 +
2 x x2
h
= θ(x + 1 − θ)2 log(x + 1 − θ)2 + (1 − θ)(x − θ)2 log(x − θ)2
i
−(x2 + θ(1 − θ)) log(x2 + θ(1 − θ)) /(2x2 ),
L(g)
r(x) := 2θ(1 − θ) = 2x2 `(x)
E(g, g)
= θ(x + 1 − θ)2 log(x + 1 − θ)2 + (1 − θ)(x − θ)2 log(x − θ)2
−(x2 + θ(1 − θ)) log(x2 + θ(1 − θ)).
¿From here, a straightforward but very tedious calculus exercise shows

that r decreases over (−∞, −(1 − θ)], with r(−∞) = 2θ(1 − θ), and that r
is strictly unimodal over [θ, ∞), with r(θ) = 0 and r(∞) = 2θ(1 − θ). It
follows that r(x) is maximized over (−∞, −(1 − θ)] ∪ [θ, ∞) by taking x to
be the unique root to
0 = r0 (x) = 4θ(x + (1 − θ)) log(x + (1 − θ)) (8.67)

2
+ 4(1 − θ)(x − θ) log(x − θ) − 2x log(x + θ(1 − θ))
over (θ, ∞).

There is on hope for solving (8.67) explicitly unless
x2 + θ(1 − θ) = (x + 1 − θ)(x − θ),
i.e., x = 2θ(1 − θ)/(1 − 2θ). Fortunately, this is a solution to (8.67), and it

falls in (θ, ∞). The corresponding value of r is θ(1−θ) 1−θ
1−2θ log θ , so (8.66) fol-
lows, and we learn furthermore that the function g maximizing L(g)/E(g, g)
1 1
is g0 , with g0 (0) = 2θ and g0 (1) = 2(1−θ) .
For θ = 1/2, the major change is that now r is increasing, rather than
unimodal, over [θ, ∞). Thus rsup = 2θ(1−θ) = 1/2, and (8.66) again follows.
Example 8.29 Two-state chains.
Now consider any irreducible chain (automatically reversible) on {0, 1}, with
stationary distribution π. Without loss of generality we may suppose π0 ≤
π1 . We claim that
 π log(π /π )
1 1 0
 2p01 (1−2π0 )
 if π0 6= 1/2
τl =

1/(2p01 ) if π0 = 1/2.

The proof is easy. The functional L(g) depends only on π and so is un-
changed from Example 8.28, and the Dirichlet form changes from E(g, g) =
π0 π1 (g(0) − g(1))2 in Example 8.28 to E(g, g) = p01 (g(0) − g(1))2 here.
Remark. Recall from Chapter 5, Example yyy:4 that τ2 = 1/(p01 +p10 ) =
π1 /p01 . It follows that
 log(π /π )
1 0
 2(1−2π0 )
 if 0 < π0 < 1/2
τl
=
τ2 
1 if π0 = 1/2

is a continuous and decreasing function of π0 . In particular, we have equality

in Lemma 8.22 for a two-state chain if and only if π0 = 1/2. Moreover,
1
τl /τ2 ∼ 2 log(1/π0 ) → ∞ as π0 → 0.
Example 8.30 Trivial chains.
The proof of Lemma 8.22 and the result of Example 8.28 can be combined
to prove the following result: For the “trivial” chain with pij ≡ πj , the
log-Sobolev time τl is given (when π∗ < 1/2) by
log( π1∗ − 1)
τl = .
2(1 − 2π∗ )
We omit the details, referring the reader to Theorem 5.1 of [119].

As an immediate corollary, we get a reverse-inequality complement to
Lemma 8.22:
Corollary 8.31 For any reversible chain (with π∗ < 1/2, which is auto-
matic for n ≥ 3),
log( π1∗ − 1)
τl ≤ τ2 .
2(1 − 2π∗ )
Proof. The result of Example 8.30 can be written
log( π1∗ − 1)
L(g) ≤ (varπ g) ,
2(1 − 2π∗ )
and varπ g ≤ τ2 E(g, g) by the extremal characterization of τ2 .

It follows readily from Example 8.30 that the continuized walk of the com-
plete graph has
(n − 1) log(n − 1) 1
τl = ∼ log n.
2(n − 2) 2
Since τ2 = (n − 1)/n, equality holds in Corollary 8.31 for this example.
xxx Move the following warning to follow Corollary 8.27, perhaps?
Warning. Although the ratio of the upper bound on τ̂ to lower bound
in Corollary 8.27 is smaller than that in (8.7), the upper bound in Corol-
lary 8.27 is sometimes of larger order of magnitude than the upper bound
in (8.7). For the complete graph, (8.7) says
n−1 n−1 1
n ≤ τ̂ ≤ n (2 log n + 1)
and Corollary 8.27 yields
(1 + o(1)) 12 log n ≤ τ̂ ≤ (1 + o(1)) 14 (log n)(log log n),
while, from Chapter 5, yyy:(33) it follows that

log n

1
τ̂ = 2 log n + O .
n
As another example, the product chain development in the next subsec-

tion together with Example 8.29 will give τl exactly for the d-cube. On the
other hand, the exact value of τl is unknown even for many of the simplest
examples in Chapter 5. For instance,
Open Problem 8.33 Calculate τl for the n-cycle (Chapter 5 Example yyy:7)
when n ≥ 4.
xxx For NOTES: n = 3 is complete graph K3 , covered by Example 8.32.

(τl = log 2 for n = 3.)
Notwithstanding Open Problem 8.33, the value of τl is known up to
multiplicative constants. Indeed, it is shown in Section 4.2 in [119] that
1 2 25 2
2
n ≤ τl ≤ n .
4π 16π 2
Here is a similar result we will find useful later in dealing with our running
example of the grid.
Example 8.34 The m-path with end self-loops.
For this example, discussed above in Example 8.16, we claim
2 2
m ≤ τl ≤ m2 .
π2
The lower bound is easy, using Lemma 8.22:
2 2
τl ≥ τ2 = (1 − cos(π/m))−1 ≥ m .
π2
For the upper bound we use Corollary 8.27 and estimation of τ̂ . Indeed, in
Example 8.16 it was shown that
h i
ˆ
d(2t) = N 2 (t) − 1 ≤ 1 + (4t/m2 )−1/2 exp(−4t/m2 ), t > 0.
q
Substituting t = m2 gives d(2t)ˆ ≤ 3/2 e−2 < e−1 , so τl ≤ τ̂ ≤ m2 .
p
xxx P.S. Persi (98/07/02) points out that H. T. Yau showed τl = Θ(n log n)
for random transpositions by combining τl ≥ τ2 (Lemma 8.22) and τl ≤
L(g0 )/E(g0 , g0 ) with g0 = delta function. I have written notes generalizing
and discussing this and will incorporate them into a later version.
8.4.4 τl and product chains

xxx Remind reader of definition of product chain in continuous time given
in Chapter 4 Section yyy:6.2.
xxx Motivate study as providing benchmark chains for comparison method.
xxx Recall from Chapter 4, yyy:(42):
(1) (2)
τ2 = max(τ2 , τ2 ). (8.68)
xxx Product chain has transition rates equal (off diagonal) to


(1)


 qi1 ,j1 if i1 6= j1 and i2 = j2




q(i1 ,i2 ),(j1 ,j2 ) = (2) (8.69)
 qi2 ,j2 if i1 = j1 and i2 6= j2





0 otherwise.

xxx Dirichlet form works out very nicely for products:

Lemma 8.35
X (2) X (1)
E(g, g) = πi2 E (1) (g(·, i2 ), g(·, i2 )) + πi1 E (2) (g(i1 , ·), g(i1 , ·)).
i2 i1
Proof. This follows easily from (8.69) and the definition of E in Chapter 3
Section yyy:6.1 (cf. (68)).
The analogue of (8.68) for the log-Sobolev time is also true:
xxx For NOTES?: Can give analagous proof of (8.68): see my notes,
page 8.4.24A.
Theorem 8.36 For a continuous-time product chain,

(1) (2)
τl = max(τl , τl ).
Proof. The keys to the proof are Lemma 8.35 and the following “law
of total L-functional.” Given a function g 6≡ 0 on the product state space
I = I1 × I2 , define a function G2 6≡ 0 on I2 by
 1/2
X
G2 (i2 ) := kg(·, i2 )k2 =  πi1 g 2 (i1 , i2 ) .
i1
Then
X
L(g) = πi1 ,i2 g 2 (i1 , i2 ) [log(|g(i1 , i2 )|/G2 (i2 )) + log(G2 (i2 )/kgk2 )]
i1 ,i2
X (2)
= πi2 L(1) (g(·, i2 )) + L(2) (G2 ),
i2
where we have used

kG2 k22 = kgk22 .
(1) (2)
Thus, using the extremal characterization (definition) (8.53) of τl and τl ,
(1) X (2) (2)
L(g) ≤ τl πi2 E (1) (g(·, i2 ), g(·, i2 )) + τl E (2) (G2 , G2 ). (8.70)
i2
But from
|G2 (j2 ) − G2 (i2 )| = |kg(·, j2 )k2 − kg(·, i2 )k2 | ≤ kg(·, j2 ) − g(·, i2 )k2
follows X (1)
E (2) (G2 , G2 ) ≤ πi1 E (2) (g(i1 , ·), g(i1 , ·)). (8.71)
From (8.70), (8.71), Lemma 8.35, and the extremal characterization of τl we

(1) (2)
conclude τl ≤ max(τl , τl ). Testing on functions that depend only on one
(1) (2)
of the two variables shows that τl = max(τl , τl ).
Theorem 8.36 extends in the obvious fashion to higher-dimensional prod-
ucts.
Example 8.37 The d-cube.
The continuized walk on the d-cube (Chapter 5, Example yyy:15) is

simply the product of d copies of the continuized walk on the 2-path, each run
at rate 1/d. Therefore, since the log-Sobolev time for the 2-path equals 1/2
by Example 8.29, the corresponding time for the d-cube is
τl = d/2 = τ2 .
¿From this and the upper bound in Corollary 8.27 we can deduce
τ̂ ≤ 41 d log d + (1 − 14 log log1 2 )d.
As discussed in this chapter’s introduction, this bound is remarkably sharp

and improves significantly upon the analogous bound that uses only knowl-
edge of τ2 . xxx Recall corrections marked on pages 8.2.11–12 of my notes.
8.4.5 The comparison method for bounding τl

In Section 8.1 we compared relaxation times for two chains by using the
extremal characterization and comparing Dirichlet forms and variances. For
comparing variances, we used the characterization
varπ g = min kg − ck2 .

c∈R
To extend the comparison method to log-Sobolev times, we need the follow-

ing similar characterization of L.
xxx For NOTES: Cite [181].
Lemma 8.38 The functional L in (8.52) satisfies

X
L(g) = min πi L(g(i), c), g 6≡ 0, (8.72)
c>0
i
with
L(g(i), c) := g 2 (i) log(|g(i)|/c) − 12 (g 2 (i) − c2 ) ≥ 0. (8.73)
Proof. We compute
X
f (c) := 2 πi L(g(i), c1/2 ) = Eπ (g 2 log |g|2 ) − kgk22 log c − kgk22 + c,
i
0
f (c) = 1 − c−1 kgk22 , f 00 (c) = c−2 kgk22 > 0.
Thus f is strictly convex and minimized by the choice c = kgk22 , and so

X
min πi L(g(i), c) = 1
2 min f (c) = 12 f (kgk22 ) = L(g).
c>0 c>0
i
This proves (8.72). Finally, applying the inequality
x log(x/y) − (x − y) ≥ 0 for all x ≥ 0, y > 0
to x = g 2 (i) and y = c2 gives the inequality in (8.73).

Now it’s easy to see how to compare log-Sobolev times, since, adopting
the notation of Section 8.1, Lemma 8.38 immediately yields the analogue
L(g) ≤ L̃(g) max(πi /π̃i )

i
of (8.18). In the notation of Corollary 8.2, we therefore have
Corollary 8.39 (comparison of log-Sobolev times)

A
τl ≤ τ̃l .
a
xxx Remarked in Example 8.34 that τl ≤ m2 for m-path with end self-loops.
xxx So by Theorem 8.36, benchmark product chain has τ̃l ≤ dm2 .
Recalling A ≤ 1 and a ≥ 1/2 from Example 8.5, we therefore find
τl ≤ 2m2 d (8.74)
for random walk on the grid. Then Theorem 8.26(b) gives
τ̂ ≤ m2 d(log log(2n) + 4),
which is of order m2 d(log d + log log m). This is an improvement on the τ2 -

only bound O(m2 d2 log m) of (8.50) and may be compared with the Nash-
based bound O(m2 d2 log d) of (8.51). In Example 8.43 we will combine
Nash-inequality and log-Sobolev techniques to get a bound of order m2 d log d
xxx right for TV.
8.5 Combining the techniques

To get the maximum power out of the techniques of this chapter, it is some-
times necessary to combine the various techniques. Before proceeding to a
general result in this direction, we record a simple fact. Recall (8.36).
Lemma 8.41 If q and q ∗ are conjugate exponents with 2 ≤ q ≤ ∞, then

1− 2q 2
kf kq∗ ≤ kf k1 kf k2q for all f .
Proof. Apply Hölder’s inequality
kghk1 ≤ kgkp khkp∗
with
q−1
g = |f |(q−2)/(q−1) , h = |f |2/(q−1) , p= .
q−2
Theorem 8.42 Suppose that a continuous-time reversible chain satisfies
N (t) ≤ Ct−D for 0 < t ≤ T (8.75)
for some constants C, T, D satisfying CT −D ≥ e. If c ≥ 0, then

q
ˆ
d(2t) = max kPi (Xt ∈ ·) − π(·)k2 ≤ e2−c
i
for h i
t ≥ T + 12 τl log log(CT −D ) − 1 + cτ2 ,
where τ2 is the relaxation time and τl is the log-Sobolev time.
Proof. ¿From Lemma 8.11 and a slight extension of (8.34), for any
s, t, u ≥ 0 and any initial distribution we have
kP (Xs+t+u ∈ ·) − π(·)k2 ≤ kP (Xs ∈ ·)kq∗ kPt kq∗ →2 e−u/τ2
for any 1 ≤ q ∗ ≤ ∞. Choose q = q(t) = 1 + e2t/τl and q ∗ to be its conjugate.

Then, as in the proof of Theorem 8.26(a),
kPt kq∗ →2 = kPt k2→q ≤ 1.
According to Lemma 8.41, (8.39), and (8.75), if 0 < s ≤ T then

2/q
kP (Xs ∈ ·)kq∗ ≤ kP (Xs ∈ ·)k2 ≤ N (s)2/q ≤ (Cs−D )2/q .
Now choose s = T . Combining everything so far,

q
ˆ
d(2(T + t + u)) ≤ (CT −D )2/q(t) e−u/τ2 for t, u ≥ 0.
The final idea is to choose t so that the first factor is bounded by e2 .

¿From the formula for q(t), the smallest such t is
h i
1
2 τl log log(CT −D ) − 1 .
With this choice, the theorem follows readily.
Return one last time to the walk of interest in Example 8.5. Example 8.21
showed that (8.75) holds with
D = d/4, C = e(214 d2 m2 )d/4 = e(27 dm)d/2 , T = dm2 /32.
Also recall τ2 ≤ 12 dm2 from (8.49) and τl ≤ 2dm2 from Example 8.40.
Plugging these into Theorem 8.42 with c = 2 yields
49 2 1 19
τ̂ ≤ 32 m d log[ 4 d log d + 4 d log 2], which is ≤ 5m2 d log d for d ≥ 2.
xxx Finally of right order of magnitude.

xxx The chapter notes go here. Currently, they are interspersed throughout
the text.
xxx Also cite and plug [304].
Chapter 9
A Second Look at General

Markov Chains (April 21,
1995)
In the spirit of Chapter 2, this is an unsystematic treatment of scattered

topics which are related to topics discussed for reversible chains, but where
reversibility plays no essential role. Section 9.1 treats constructions of stop-
ping times with various optimality properties. Section 9.2 discusses random
spanning trees associated with Markov chains, the probabilistic elaboration
of “the matrix-tree theorem”. Section 9.3 discusses self-verifying algorithms
for sampling from a stationary distribution. Section 9.4 discusses “reversib-
lizations” of irreversible chains. Section 9.5 gives an example to show that
the nonasymptotic interpretation of relaxation time, so useful in the re-
versible setting, may fail completely in the general case. At first sight these
topics may seem entirely unrelated, but we shall see a few subtle connections.
Throughout the chapter, our setting is a finite irreducible discrete-time
Markov chain (Xn ) with transition matrix P = (pij ).
9.1 Minimal constructions and mixing times

Chapter 4 Theorem yyy involved three mixing time parameters; τ1 related to
(1)
variation distance to stationarity, τ1 related to “separation” from station-
(2)
arity, and τ1 related to stationary times (see below). In Chapter 4 these
parameters were defined under worst-case initial distributions, and our fo-
cus was on “equivalence” of these parameters for reversible chains. Here
309
310CHAPTER 9. A SECOND LOOK AT GENERAL MARKOV CHAINS (APRIL 21, 1995)
we discuss underlying “exact” results. Fix an initial distribution µ. Then

associated with each notion of mixing, there is a corresponding construction
of a minimal random time T , stated in Theorems 9.1 - 9.3 below.
xxx randomized stopping times
Call a stopping time T a strong stationary time if
Pµ (Xt = j, T = t) = πj Pµ (T = t) for all j, t (9.1)
i.e. if XT has distribution π and is independent of T . Call a stopping time
T a stationary time if
Pµ (XT = j) = πj for all j. (9.2)
Call a random time T a coupling time if we can construct a joint distribution
((Xt , Yt ); t ≥ 0) such that (Xt ) is the chain with initial distribution µ, (Yt )
is the stationary chain, and Xt = Yt , t ≥ T . (A coupling time need not be
a stopping time, even w.r.t. the joint process; this is the almost the only
instance of a random time which is not a stopping time that we encounter
in this book.)
Recall from yyy the notion of separation of θ from π:
sep(θ) ≡ min{u : θj ≥ (1 − u)πj ∀j}.
Write sepµ (t) for the separation at time t when the initial distribution was
µ:
sepµ (t) = min{u : Pµ (Xt = j) ≥ (1 − u)πj ∀j}.
Similarly write vdµ (t) for the variation distance from stationarity at time t:
1X
vdµ (t) = |Pµ (Xt = j) − πj |.
2 j
Theorem 9.1 Let T be any strong stationary time for the µ-chain. Then
sepµ (t) ≤ Pµ (T > t) for all t ≥ 0. (9.3)
Moreover there exists a minimal strong stationary time T for which
sepµ (t) = Pµ (T > t) for all t ≥ 0. (9.4)
Theorem 9.2 For any coupling time T ,

vdµ (t) ≤ Pµ (T > t) for all t ≥ 0.
Moreover there exists a minimal coupling time T for which
vdµ (t) = Pµ (T > t) for all t ≥ 0.
9.1. MINIMAL CONSTRUCTIONS AND MIXING TIMES 311
Theorem 9.3 For any stationary time T ,
Eµ T ≥ max(Eµ Tj − Eπ Tj ). (9.5)
j
Moreover there exist mean-minimal stationary times T for which
Eµ T = max(Eµ Tj − Eπ Tj ). (9.6)
j
In each case, the first assertion is immediate from the definitions, and
the issue is to carry out a construction of the required T . Despite the
similar appearance of the results, attempts to place them all in a common
framework have not been fruitful. We will prove Theorems 9.1 and 9.3
below, and illustrate with examples. These two proofs involve only rather
simple “greedy” constructions. We won’t give the proof of Theorem 9.2
(the construction is usually called the maximal coupling: see Lindvall [233])
because the construction is a little more elaborate and the existence of the
minimal coupling time is seldom useful, but on the other hand the coupling
inequality in Theorem 9.2 will be used extensively in Chapter 14. In the
context of Theorems 9.1 and 9.2 the minimal times T are clearly unique
in distribution, but in Theorem 9.3 there will generically be many mean-
minimal stationary times T with different distributions.
9.1.1 Strong stationary times

For any stopping time T , define
θj (t) = Pµ (Xt = j, T ≥ t), σj (t) = Pµ (Xt = j, T = t). (9.7)
Clearly these vectors satisfy
0 ≤ σ(t) ≤ θ(t), (θ(t) − σ(t))P = θ(t + 1) ∀t; θ0 = µ. (9.8)
Conversely, given (θ(t), σ(t); t ≥ 0) satisfying (9.8), we can construct a ran-

domized stopping time T satisfying (9.7) by declaring that P (T = t|Xt =
j, T ≥ t, Xs , s < t) = σj (t)/θj (t). The proofs of Theorems 9.1 and 9.3 use
different definitions of vectors satisfying (9.8).
Proof of Theorem 9.1. A particular sequence (θ(t), σ(t); t ≥ 0) can be
specified inductively by (9.8) and
σ(t) = rt π , where rt = min θj (t)/πj . (9.9)

j
The associated stopping time satisfies
Pµ (Xt = j, T = t) = σj (t) = rt πj
and so is a strong stationary time with Pµ (T = t) = rt . One can now verify

inductively that
Pµ (Xt ∈ ·) = θ(t) + Pµ (T ≤ t − 1) · pi
and so the separation is
Pµ (Xt = j)
sepµ (t) = 1 − min = Pµ (T ≥ t) − rt = Pµ (T > t).
j πj
9.1.2 Stopping times attaining a specified distribution

For comparison with the other two results, we stated Theorem 9.3 in terms
of stopping times at which the stationary distribution is attained, but the
underlying result (amplified as Theorem 9.4) holds for an arbitrary target
distribution ρ. So fix ρ as well as the initial distribution µ. Call a stopping
time T admissible if Pµ (XT ∈ ·) = ρ. Write t̄(µ, σ) for the inf of Eµ T over
all admissible stopping times T .
Theorem 9.4 (a) t̄(µ, σ) = maxj (Eµ Tj − Eρ Tj ).

(b) The “filling scheme” below defines an admissible stopping time such
that Eµ T = t̄(µ, σ).
(c) Any admissible stopping time T with the property
∃ k such that Pµ (T ≤ Tk ) = 1. (9.10)
satisfies Eµ T = t̄(µ, σ).
Part (c) is rather remarkable, and can be rephrased as follows. Call a state
k with property (9.10) a halting state for the stopping time T . In words, the
chain must stop if and when it hits a halting state. Then part (c) asserts
that, to verify that an admissible time T attains the minimum t̄(µ, ρ), it
suffices to show that there exists some halting state. In the next section we
shall see this is very useful in simple examples.
Proof. The greedy construction used here is called a filling scheme.
Recall from (9.7) the definitions
θj (t) = Pµ (Xt = j, T ≥ t), σj (t) = Pµ (Xt = j, T = t).

Write also Σj (t) = Pµ (XT = j, T ≤ t). We now define (θ(t), σ(t); t ≥ 0) and
the associated stopping time T̄ inductively via (9.8) and
σj (t) = 0 if Σj (t − 1) = ρj
= θt if Σj (t − 1) + θj (t) ≤ ρj
= ρj − Σj (t − 1) otherwise.
In words, we stop at the current state (j, say) provided our “quota” ρj for
the chance of stopping at j has not yet been filled. Clearly
Σj (t) ≤ ρj ∀j ∀t. (9.11)
We now claim that T̄ satisfies property (9.10). To see this, consider
tj ≡ min{t : Σj (t) = ρj } ≤ ∞.
Then (9.10) holds by construction for any k such that tk = maxj tj ≤

∞. In particular, T̄ ≤ Tk < ∞ a.s. and then by (9.11) Pµ (XT̄ ∈ ·) =
lim t → ∞Σ(t) = ρ. So T̄ is an admissible stopping time.
Remark. Generically we expect tj = ∞ for exactly one state j, though
other possibilities may occur, e.g. in the presence of symmetry.
Now consider an arbitrary admissible stopping time T , and consider the
associated occupation measure x = (xj ):
xj ≡ Eµ (number of visits to j during times 0, 1, . . . , T − 1).
We shall show X
xj + ρj = µj + xi pij ∀j. (9.12)
i
Indeed, by counting the number of visits during 0, 1, . . . , T − 1, T in two

ways,
xj + ρj = µj + Eµ (number of visits to j during 1, 2, . . . , T ).
Chapter 2 Lemma yyy showed the (intuitively obvious) fact
xi pij = Eµ ( number of transitions i → j starting before time T ).
So summing over i,
X
xi pij = Eµ (number of visits to j during 1, 2, . . . , T )
i
and (9.12) follows.

Write x̄ for the occupation measure associated with the stopping time
T̄ produced by the filling scheme. By (9.10), mink x̄k = 0. If x and x0
are solutions of (9.12) then the difference d = x − x0 satisfies d = dP and
so is a multiple of the stationary distribution π. In particular, if x is the
occupation measure for some arbitrary admissible time T , then
x ≥ x̄, with equality iff min xk = 0.
k
P
Since Eµ T = i xi , we have established parts (b) and (c) of the theorem,
and X
t̄(µ, σ) = x̄i .
i
To prove (a), choose a state k such that x̄k = 0, that is such that T̄ ≤ Tk .
Then Eµ Tk = Eµ T̄ + Eρ Tk and hence t̄(µ, σ) ≤ maxj (Eµ Tj − Eρ Tj ). But for
any admissible stopping time T and any state j
Eµ Tj ≤ Eµ T + Eρ Tj
giving the reverse inequality t̄(µ, σ) ≥ maxj (Eµ Tj − Eρ Tj ). 2
Corollary 9.5 The minimal strong stationary time has mean t̄(µ, π), i.e. is
mean-minimal amongst all not-necessarily-strong stationary times, iff there
exists a state k such that
Pµ (Xt = k)/πk = min Pµ (Xt = j)/πj ∀t.
j
Proof. From the construction of the minimal strong stationary time, this is
the condition for k to be a halting state.
9.1.3 Examples
Example 9.6 Patterns in coin-tossing.
Recall Chapter 2 Example yyy: (Xt ) is the chain on the set {H, T }n of n-
tuples i = (i1 , . . . , in ). Start at some arbitrary initial state j = (j1 , . . . , jn ).
Here the deterministic stopping time “T = n” is a strong stationary time.
Now a state k = (k1 , . . . , kn ) will be a halting state provided it does not
overlap j, that is provided there is no 1 ≤ u ≤ n such that (ju , . . . , jn ) =
(k1 , . . . , kn+u−1 ). But the number of overlapping states is at most 1 =
2 + 22 + . . . + 2n−1 = 2n − 1, so there exists a non-overlapping state, i.e. a
halting state. So ET attains the minimum (= n) of t̄(j, π) over all stationary
times (and not just over all strong stationary times).
Example 9.7 Top-to-random card shuffling.
Consider the following scheme for shuffling an n-card deck: the top card is
removed, and inserted in one of the n possible positions, chosen uniformly at
random. Start in some arbitrary order. Let T be the first time that the card
which was originally second-from-bottom has reached the top of the deck.
Then it is not hard to show (Diaconis [112] p. 177) that T + 1 is a strong
stationary time. Now any configuration in which the originally-bottom card
is the top card will be a halting state, and so T + 1 is mean-minimal over
Pn−1 n
all stationary times. Here E(T + 1) = 1 + m=2 m = n(hn − 1).
Example 9.8 The winning streak chain.
In a series of games which you win or lose independently with chance 0 <
c < 1, let X̂t be your current “winning streak”, i.e. the number of games won
since your last loss. For fixed n, the truncated process Xt = min(Xt , n − 1)
is the Markov chain on states {0, 1, 2, . . . , n−1} with transition probabilities
p(i, 0) = 1 − c, p(i, min(i + 1, n − 1)) = c; 0 ≤ i ≤ n − 1
and stationary distribution
πi = (1 − c)ci , 0 ≤ i ≤ n − 2; πn−1 = cn−1 .
We present this chain, started at 0, as an example where it is easy to see there

are different mean-minimal stationary times T . We’ll leave the simplest
construction until last – can you guess it now? First consider TJ , where J
has the stationary distribution. This is a stationary time, and n − 1 is a
halting state, so it is mean-minimal. Now it is easy to show
1 1
E0 Tj = − , 1 ≤ j ≤ n − 1.
(1 − c)cj 1−c
(Slick proof: in the not-truncated chain, Chapter 2 Lemma yyy says

1 = Ej ( number of visits to j before T0 ) = πj (Ej T0 + E0 Tj ) = πj (1/(1 −
c) + E0 Tj ).) So
X πn−1 1 − π0
t̄(0, π) = E0 TJ = πj E0 Tj = n − 2 + n−1
− = n − 1.
j≥1
(1 − c)c 1−c
Here is another stopping time T which is easily checked to attain the station-
ary distribution, for the chain started at 0. With chance 1 − c stop at time
0. Otherwise, run the chain until either hitting n − 1 (in which case, stop)
or returning to 0. In the latter case, the return to 0 occurs as a transition
to 0 from some state M ≥ 0. Continue until first hitting M + 1, then stop.
Again n − 1 is a halting state, so this stationary time also is mean-minimal.
Of course, the simplest construction is the deterministic time T = n − 1.
This is a strong stationary time (the winning streak chain is a function of
the patterns in coin tossing chain), and again n − 1 is clearly a halting state.
Thus t̄(0, π) = n − 1 without needing the calculation above.
Remark. One could alternatively use Corollary 9.5 to show that the
strong stationary times in Examples 9.6 and 9.7 are mean-minimal sta-
tionary times. The previous examples are atypical: here is a more typical
example in which the hypothesis of Corollary 9.5 is not satisfied and so no
mean-optimal stationary time is a strong stationary time.
Example 9.9 xxx needs a name!
Chapter 2 Example yyy can be rewritten as follows. Let (Ut ) be independent

uniform on {0, 1, . . . , n−1} and let (At ) be independent events with P (At ) =
a. Define a chain X on {0, 1, . . . , n − 1} by
Xt+1 = Ut+1 on Act

= Xt + 1 mod n on At .
The stationary distribution is the uniform distribution. Take X0 = 0.

Clearly T ≡ min{t : At occurs } is a strong stationary time, and ET =
1/(1 − a), and it is easy to see that T is the minimal strong stationary time.
But T is not a mean-minimal stationary time. The occupation measure
x associated with T is such that minj xj = xn−1 = an−1 + a2n−1 + . . . =
an−1 /(1 − an ), and so the occupation measure x̄ associated with a mean-
an−1 1 an−1
minimal stationary time is x̄ = x − 1−a n π, and so t̄(0, π) = 1−a − 1−an .
9.2 Markov chains and spanning trees

9.2.1 General Chains and Directed Weighted Graphs
Let’s jump into the details and defer the discussion until later. Consider
a finite irreducible discrete-time Markov chain (Xn ) with transition matrix
P = (pvw ), and note we are not assuming reversibility. We can identify P
with a weighted directed graph, which has (for each (v, w) with pvw > 0)
a directed edge (v, w) with weight pvw . A directed spanning tree t is a
9.2. MARKOV CHAINS AND SPANNING TREES 317
spanning tree with one vertex distinguished as the root, and with each edge
e = (v, w) of t regarded as being directed towards the root. Write T for the
set of directed spanning trees. For t ∈ T define
Y
ρ̄(t) ≡ pvw .
(v,w)∈t
Normalizing gives a probability distribution ρ on T :

ρ̄(t)
ρ(t) ≡ P 0
.
t0 ∈T ρ̄(t )
Now fix n and consider the stationary Markov chain (Xm : −∞ < m ≤ n)
run from time minus infinity to time n. We now use the chain to construct a
random directed spanning tree Tn . The root of Tn is Xn . For each v 6= Xn
there was a final time, Lv say, before n that the chain visited v:
Lv ≡ max{m ≤ n : Xm = v}.
Define Tn to consist of the directed edges
(v = XLv , XLv +1 ), v 6= Xn .
So the edges of Tn are the last-exit edges from each vertex (other than the
root Xn ). It is easy to check that Tn is a directed spanning tree.
Now consider what happens as n changes. Clearly the process (Tn :
−∞ < n < ∞) is a stationary Markov chain on T , with a certain transition
matrix Q = (q(t, t0 )), say. The figure below indicates a typical transition
t → t0 . Here t was constructed by the chain finishing at its root v, and t0 is
the new tree obtained when the chain makes a transition v → w.
x x
6@ @
@ @
R
@ R
@
◦ ◦
@ @
@ @
R
@ ? R
@ ?
◦ -w v ◦ -w v
6 6 6 6
t t0
Theorem 9.10 (The Markov chain tree theorem) The stationary dis-
tribution of (Tn ) is ρ.
Proof. Fix a directed spanning tree t0 . We have to verify
ρ̄(t)q(t, t0 ) = ρ̄(t0 ).
X
(9.13)
t
Write w for the root of t0 . For each vertex x 6= w there is a tree tx con-
structed from t0 by adding an edge (w, x) and then deleting from the result-
ing cycle the edge (v, w) (say, for some v = v(x)) leading into w. For x = w
set v(x) = x. It is easy to see that the only possible transitions into t0 are
from the trees tx , and that
ρ̄(tx ) pwx
0
= ; q(tx , t0 ) = pvw .
ρ̄(t ) pvw
Thus the left side of (9.13) becomes
ρ̄(tx )q(tx , t0 ) = ρ̄(t0 ) pwx = ρ̄(t0 ).

X X
x x
2
The underlying chain Xn can be recovered from the tree-valued chain
Tn via Xn = root(Tn ), so we can recover the stationary distribution of X
from the stationary distribution of T , as follows.
Corollary 9.11 (The Markov chain tree formula) For each vertex v
define
X
π̄(v) ≡ ρ̄(t).
t: v=root(t)
π̄(v)
π(v) ≡ P .
w π̄(w)
Then π is the stationary distribution of the original chain (Xn ).
See the Notes for comments on this classical result.

Theorem 9.10 and the definition of T0 come close to specifying an al-
gorithm for constructing a random spanning tree with distribution ρ. Of
course the notion of running the chain from time −∞ until time 0 doesn’t
sound very algorithmic, but we can rephrase this notion using time-reversal.
Regarding the stationary distribution π as known, the time-reversed chain
X ∗ has transition matrix p∗vw ≡ πw pwv /πv . Here is the restatement of The-
orem 9.10 in terms of the time-reversed chain.
9.2. MARKOV CHAINS AND SPANNING TREES 319
Corollary 9.12 Let (Xm ∗ : 0 ≤ m ≤ C) be the time-reversed chain, run
until the cover time C. Define T to be the directed spanning tree with root
X0 and with edges (v = XTv , XTv −1 ), v 6= X0 . If X0 has distribution π
then T has distribution ρ. If X0 is deterministically v0 , say, then T has
distribution ρ conditioned on being rooted at v0 .
Thus T consists of the edges by which each vertex is first visited, directed
backwards.
For a reversible chain, we can of course use the chain itself in Corollary
9.12 above, in place of the time-reversed chain. If the chain is random walk
on a unweighted graph G, then
Y 1
ρ̄(t) = d(root(t))
v d(v)
where d(v) is the degree of v in G. So ρ̄, restricted to the set of spanning

trees with specified root v0 , is uniform on that set. In this setting, Corollary
9.12 specializes as follows.
Corollary 9.13 Let (Xm : 0 ≤ m ≤ C) be random walk on an unweighted

graph G, started at v0 and run until the cover time C. Define T to be the
directed spanning tree with root v0 and with edges (v = XTv , XTv −1 ), v 6= v0 .
Then T is uniform on the set of all directed spanning trees of G rooted at
v0 .
We can rephrase this. If we just want “plain” spanning trees without a

root and directions, then the T above, regarded as a plain spanning tree,
is uniform on the set of all plain spanning trees. On the other hand, if
we want a rooted spanning tree which is uniform on all such trees without
prespecified root, the simplest procedure is to construct T as in Corollary
9.13 with deterministic start v0 , and at the end re-root T at a uniform
random vertex. (This is slightly subtle – we could alternatively start with
X0 uniform, which is typically not the stationary distribution π.)
Using the bounds on cover time developed in Chapter 6, we now have
an algorithm for generating a uniform spanning tree of a n-vertex graph in
O(n3 ) steps (and O(n2 ) steps on a regular graph). No other known algorithm
achieves these bounds.
9.2.2 Electrical network theory

The ideas in this subsection (and much more) are treated in a long but very
readable survey paper by Pemantle [279], which we encourage the interested
reader to consult. As observed above, in the reversible setting we have the

obvious simplification that we can construct uniform spanning trees using
the chain itself. Deeper results can be found using the electrical network
analogy. Consider random walk on a weighted graph G. The random span-
ning tree T constructed by Corollary 9.12, interpreted as a “plain” spanning
tree, has distribution Y
ρ(t) = c we
e∈t
where c is the normalizing constant. If an edge e is essential, it must be
in every spanning tree, so P (e ∈ T) = 1. If the edge is inessential, the
probability will be strictly between 0 and 1. Intuitively, P (e ∈ T) should
provide a measure of “how nearly essential e is”. This should remind the
reader of the inessential edge inequality (yyy). Interpreting the weighted
graph as an electrical network where an edge e = (v, x) has resistance 1/we ,
the effective resistance rvx between v and x satisfies
rvx ≤ 1/wvx with equality iff (v, x) is essential
Proposition 9.14 For each edge (v, x),
P ((v, x) ∈ T) = wvx rvx .
Note that in a n-vertex graph, T has exactly n − 1 edges, so Proposition
9.14 implies Foster’s theorem (Chapter 3 yyy)
X
wvx rvx = n − 1.
edges (v,x)
Proof. Consider the random walk started at v and run until the time U
of the first return to v after the first visit to x. Let p be the chance that
XU −1 = x, i.e. that the return to x is along the edge (x, v). We can calculate
p in two ways. In terms of random walk started at x, p is the chance that
the first visit to v is from x, and so by Corollary 9.12 (applied to the walk
started at x) p = P ((x, v) ∈ T). On the other hand, consider the walk
started at v and let S be the first time that the walk traverses (x, v) in that
direction. Then
ES = EU/p.
But by yyy and yyy
ES = w/wvx , EU = wrvx
and hence p = wvx rvx as required. 2
The next result indicates the usefulness of the electrical network analogy.
9.3. SELF-VERIFYING ALGORITHMS FOR SAMPLING FROM A STATIONARY DISTRIBUTION321
Proposition 9.15 For any two edges e1 6= e2 ,
P (e1 ∈ T, e2 ∈ T) ≤ P (e1 ∈ T)P (e2 ∈ T).
Proof. Consider the “shorted” graph Gshort in which the end-vertices (x1 , x2 )
of e1 are shorted into a single vertex x, with edge-weights wxv = wx1 v +wx2 v .
The natural 1 − 1 correspondence t ↔ t ∪ {e1 } between spanning trees of
Gshort and spanning trees of G containing e1 maps the distribution ρshort to
the conditional distribution ρ(·|e1 ∈ T). So, writing Tshort for the random
spanning tree associated with Gshort ,
P (e2 ∈ Tshort ) = P (e2 ∈ T|e1 ∈ T).
But, setting e2 = (z1 , z2 ), Proposition 9.14 shows
P (e2 ∈ Tshort ) = wz1 z2 rzshort

1 z2
, P (e2 ∈ T) = wz1 z2 rz1 z2 .
By Rayleigh’s monotonicity principle, rzshort

1 z2
≤ rz1 z−2 , and the result follows.
9.3 Self-verifying algorithms for sampling from a

stationary distribution
To start with an analogy, we can in principle compute a mean hitting time
Ei Tj from the transition matrix P, but we could alternatively estimate Ei Tj
by “pure simulation”: simulate m times the chain started at i and run until
hitting j, and then (roughly speaking) the empirical average of these m
hitting times will be (1 ± O(m−1/2 ))Ei Tj . In particular, for fixed ε we
can (roughly speaking) estimate Ei Tj to within a factor (1 ± ε) in O(Ei Tj )
steps. Analogously, consider some notion of mixing time τ (say τ1 or τ2 ,
in the reversible setting). The focus in this book has been on theoretical
methods for bounding τ in terms of P, and of theoretical consequences of
such bounds. But can we bound τ by pure simulation? More importantly,
in the practical context of Markov chain Monte Carlo, can we devise a “self-
verifying” algorithm which produces an approximately-stationary sample
from a chain in O(τ ) steps without having prior knowledge of τ ?
xxx tie up with MCMC discussion.
To say things a little more carefully, a “pure simulation” algorithm is
one in which the transition matrix P is unknown to the algorithm. Instead,
there is a list of the states, and at each step the algorithm can obtain, for any
state i, a sample from the jump distribution p(i, ·), independent of previous
samples.
In the MCMC context we typically have an exponentially large state

space and seek polynomial-time estimates. The next lemma (which we leave
to the reader to state and prove more precisely) shows that no pure simula-
tion algorithm can guarantee to do this.
Lemma 9.16 Consider a pure simulation algorithm which, given any irre-
ducible n-state chain, eventually outputs a random state whose distribution
is guaranteed to be within ε of the stationary distribution in variation dis-
tance. Then the algorithm must take Ω(n) steps for every P.
Outline of proof. If there is a state k with the property that 1 − p(k, k) is

extremely small, then the stationary distribution will be almost concentrated
on k; an algorithm which has some chance of terminating without sampling
a step from every state cannot possibly guarantee that no unvisited state k
has this property. 2
9.3.1 Exact sampling via the Markov chain tree theorem

Lovasz and Winkler [242] observed that the Markov chain tree theorem
(Theorem 9.10) could be used to give a “pure simulation” algorithm for
generating exactly from the stationary distribution of an arbitrary n-state
chain. The algorithm takes
O(τ1∗ n2 log n) (9.14)
steps, where τ1∗ is the mixing time parameter defined as the smallest t such
that
1
Pi (XUσ = j) ≥ πj for all i, j ∈ I, σ ≥ t (9.15)
2
where Uσ denotes a random time uniform on {0, 1, . . . , σ − 1}, independent
of the chain.
xxx tie up with Chapter 4 discussion and [241].
The following two facts are the mathematical ingredients of the algo-
rithm. We quote as Lemma 9.17(a) a result of Ross [300] (see also [53]
Theorem XIV.37); part (b) is an immediate consequence.
Lemma 9.17 (a) Let π be a probability distribution on I and let (Fi ; i ∈ I)

be independent with distribution π. Fix j, and consider the digraph with
edges {(i, Fi ) : i 6= j}. Then with probability (exactly) πj , the digraph is a
tree with edges directed toward the root j.
(b) So if j is first chosen uniformly at random from I, then the probability
above is exactly 1/n.
As the second ingredient, observe that the Markov chain tree formula (Corol-
lary 9.11) can be rephrased as follows.
Corollary 9.18 Let π be the stationary distribution for a transition matrix
P on I. Let J be random, uniform on I. Let (ξi ; i ∈ I) be independent, with
P (ξi = j) = pij . Consider the digraph with edges {(i, ξi ) : i 6= J}. Then,
conditional on the digraph being a tree with edges directed toward the root
J, the probability that J = j equals πj .
So consider the special case of a chain with the property
p∗ij ≥ (1/2)1/n πj ∀i, j. (9.16)
The probability of getting any particular digraph under the procedure of
Corollary 9.18 is at least 1/2 the probability of getting that digraph under
the procedure of Lemma 9.17, and so the probability of getting some tree is
at least 1/2n, by Lemma 9.17(b). So if the procedure of Corollary 9.18 is
repeated r = d2n log 4e times, the chance that some repetition produces a
tree is at least 1 − (1 − 1/2n)2n log 4 = 3/4, and then the root J of the tree
has distribution exactly π.
Now for any chain, fix σ > τ1∗ . The submultiplicativity (yyy) property of
separation, applied to the chain with transition probabilities p̃ij = Pi (XUσ =
j), shows that if V denotes the sum of m independent copies of Uσ , and ξi
is the state reached after V steps of the chain started at i, then
P (ξi = j) ≡ Pi (XV = j) ≥ (1 − 2−m )πj ∀i, j.
So putting m = − log2 (1 − (1/2)1/n ) = Θ(log n), the set of probabilities
(P (ξi = j)) satisfy (9.16).
Combining these procedures, we have (for fixed σ > τ1∗ ) an algorithm
which, in a mean number nmσr = O(σn2 log n) of steps, has chance ≥
3/4 to produce an output, and (if so) the output has distribution exactly
π. Of course we initially don’t know the right σ to use, but we simply
try n, 2n, 4n, 8n, . . . in turn until some output appears, and the mean total
number of steps will satisfy the asserted bound (9.14).
9.3.2 Approximate sampling via coalescing paths

P
A second approach involves the parameter τ0 = j πj Ei Tj arising in the
random target lemma (Chapter 2 yyy). Aldous [18] gives an algorithm
which, given P and ε > 0, outputs a random state ξ for which ||P (ξ ∈
·) − π|| ≤ ε, and such that the mean number of steps is at most
81τ0 /ε2 . (9.17)
The details are messy, so let us just outline the (simple) underlying idea.
Suppose we can define a procedure which terminates in some random number
Y of steps, where Y is an estimate of τ0 : precisely, suppose that for any P
P (Y ≤ τ0 ) ≤ ε; EY ≤ Kτ0 (9.18)
where K is an absolute constant. We can then define an algorithm as follows.
Simulate Y ; then run the chain for UY /ε steps and output the
final state ξ
where as above Uσ denotes a random time uniform on {0, 1, . . . , σ − 1},

independent of the chain. This works because, arguing as at xxx,
||P (XUσ ∈ ·) − π|| ≤ τ0 /σ
and so
τ0
||P (ξ ∈ ·) − π|| ≤ E max(1, ) ≤ 2ε.
Y /ε
1
And the mean number of steps is (1 + 2ε )EY .
So the issue is to define a procedure terminating in Y steps, where Y
satisfies (9.18). Label the states {1, 2, . . . , n} and consider the following
coalescing paths routine.
(i) Pick a uniform random state J.
(ii) Start the chain at state 1, run until hitting state J, and write A1 for
the set of states visited along the path.
(iii) Restart the chain at state min{j : j 6∈ A1 }, run until hitting some
state in A1 , and write A2 for the union of A1 and the set of states visited
by this second path.
(iiii) Restart the chain at state min{j : j 6∈ A2 }, and continue this
procedure until every state has been visited. Let Y be the total number of
steps.
The random target lemma says that the mean number of steps in (ii)
equals τ0 , making this Y a plausible candidate for a quantity satisfying
(9.18). A slightly more complicated algorithm is in fact needed – see [18].
9.3.3 Exact sampling via backwards coupling

Write U for a r.v. uniform on [0, 1], and (Ut ) for an independent sequence of
copies of U . Given a probability distribution on I, we can find a (far from
unique!) function f : [0, 1] → I such that f (U ) has the prescribed distribu-
tion. So given a transition matrix P we can find a function f : I × [0, 1] → I
such that P (f (i, U ) = j) = pij . Fix such a function. Simultaneously for

each state i, define
(i) (i) (i)

X0 = i; Xt = f (Xt−1 , Ut ), t = 1, 2, . . . .
xxx tie up with coupling treatment

Consider the (forwards) coupling time
(i) (j)
C ∗ = min{t : Xt = Xt ∀i, j} ≤ ∞.
By considering an initial state j chosen according to the stationary distri-

bution π,
max ||Pi (Xt ∈ ·) − π|| ≤ P (C > t).
i
This can be used as the basis for an approximate sampling algorithm. As

a simple implementation, repeat k times the procedure defining C ∗ , suppose
we get finite values C1∗ , . . . , Ck∗ each time, then run the chain from an arbi-
trary initial start for max1≤j≤k Cj∗ steps and output the final state ξ. Then
the error ||P (ξ ∈ ·) − π|| is bounded by a function δ(k) such that δ(k) → 0
as k → ∞.
Propp and Wilson [286] observed that by using instead a backwards cou-
pling method (which has been exploited in other contexts – see Notes) one
could make an exact sampling algorithm. Regard our i.i.d. sequence (Ut ) as
defined for −∞ < t ≤ 0. For each state i and each time s < 0 define
(i,s) (i,s)
Xs(i,s) = i; Xt = f (Xt−1 , Ut ), t = s + 1, s + 2, . . . , 0.
Consider the backwards coupling time
(i,t) (j,t)
C = max{t : X0 = X0 ∀i, j} ≥ −∞.
Lemma 9.19 (Backwards coupling lemma) If S is a random time such

that −∞ < S ≤ C a.s. then the random variable X (i,S) does not depend on
i and has the stationary distribution π.
xxx describe algorithm

xxx poset story
xxx analysis in general setting and in poset setting.
xxx compare the 3 methods
9.4 Making reversible chains from irreversible chains

Let P be an irreducible transition matrix on I with stationary distribution
π. The following straightforward lemma records several general ways in
which to construct from P a transition matrix Q for which the associated
chain still has stationary distribution π but is reversible. These methods all
involve the time-reversed matrix P∗
πi pij = πj p∗ji
and so in practice can only be used when we know π explicitly (as we have
observed several times previously, in general we cannot write down a useful
explicit expression for π in the irreversible setting).
Lemma 9.20 The following definitions each give a transition matrix Q
which is reversible with respect to π.
The additive reversiblization: Q(1) = 12 (P + P∗ )
The multiplicative reversiblization: Q(2) = PP∗
(3)
The Metropolis reversiblization; Qi,j = min(pi,j , p∗j,i ), j 6= i.
Of these three construction, only Q(1) is automatically irreducible. Consider

for instance the “patterns in coin tossing” example (Chapter 2 Example
yyy). Here are the distributions of a step of the chains from state (i1 , . . . , in ).
(Q(1) ). To (i2 , . . . , in , 0) or (i2 , . . . , in , 1) or (0, i1 , . . . , in−1 ) or (1, i1 , . . . , in−1 ),
with probability 1/4 each.
(Q(2) ). To (0, i2 , . . . , in ) or (1, i2 , . . . , in ), with probability 1/2 each. So
the state space decomposes into 2-element classes.
(Q(3) ). Here a “typical” i is isolated.
We shall discuss two aspects of the relationship between irreversible
chains and their reversibilizations.
9.4.1 Mixing times

Because the theory of L2 convergence to stationarity is nicer for reversible
chains, a natural strategy to study an irreversible chain (transition matrix
P) would be to first study a reversibilization Q and then seek some general
result relating properties of the P-chain to properties of the Q-chain. There
are (see Notes) general results relating spectra, but we don’t pursue these
because (cf. section 9.5) there seems no useful way to derive finite-time
results for irreversible chains from spectral gap estimates.
xxx Persi, Fill etc stuff
9.4. MAKING REVERSIBLE CHAINS FROM IRREVERSIBLE CHAINS327
9.4.2 Hitting times

Here are a matrix-theoretic result and conjecture, whose probabilistic sig-
nificance (loosely relating to mean hitting times and reversiblization) will
be discussed below. As usual Z is the fundamental matrix associated with
P, and P∗ is the time-reversal.
Proposition 9.21 trace Z(P∗ − P) ≥ 0.
Conjecture 9.22 trace Z2 (P∗ − P) ≥ 0.
Proposition 9.21 is essentially due to Fiedler et al [147]. In fact, what

is proved in ([147], p. 91) is that, for a positive matrix V with largest
eigenvalue < 1,
∞
X
trace ( Vm )(V − VT ) ≤ 0. (9.19)
m=1
1/2 −1/2
Applying this to vij = sπi pij πj for s < 1 gives
∞ ∞
(m) ∗
sm P(m) )(P − P∗ )
X X
m
trace ( s (pij − πj ))(P − P ) = trace (
m=0 m=0
∞
= s−1 trace (
X
Vm )(V − VT ) ≤ 0.
m=0
Letting s ↑ 1 gives the Proposition as stated. 2
The proof in [147] of (9.19) has no simple probabilistic interpretation,
and it would be interesting to find a probabilistic proof. It is not clear to
me whether Conjecture 9.22 could be proved in a similar way.
Here is the probabilistic interpretation of Proposition 9.21. Recall the
elementary result (yyy) that in a n-state chain
XX
πa pab Eb Ta = n − 1. (9.20)
a b
The next result shows that replacing Eb Ta by Ea Tb gives an inequality. This

arose as an ingredient in work of Tetali [325] discussed at xxx.
≤ n − 1.
P P
Corollary 9.23 a b πa pab Ea Tb
Proof. We argue backwards. By (9.20), the issue is to prove

XX
πa pab (Eb Ta − Ea Tb ) ≥ 0.
a b
Using Lemma yyy, the quantity in question equals
Zaa Zba Zbb Zab

XX
πa pab − − +
a b
πa πa πb πb
= trace Z − trace PZ − trace Z + trace P∗ Z = trace (P∗ − P)Z ≥ 0.

2
Here is the motivation for Conjecture 9.22. For 0 ≤ λ ≤ 1 let P(λ) =
(1 − λ)P + λP∗ , so that P(1/2) is the “additive reversiblization” in Lemma
9.20. Consider the average hitting time parameters τ0 = τ0 (λ) from Chapter
4.
Corollary 9.24 Assuming Conjecture 9.22 is true, τ0 (λ) ≤ τ0 (1/2) for all
0 ≤ λ ≤ 1.
In other words, making the chain “more reversible” tends to increase mean
hitting times.
Proof. This depends on results about differentiating with respect to
the transition matrix, which we present as slightly informal calculations.
Introduce a “perturbation” matrix Q such that
X
qij = 0 ∀i; qij = 0 whenever pij = 0. (9.21)
j
Then P + θQ is a transition matrix, for θ is some neighborhood of 0. Write

d
dθ for the derivative at θ = 0. Then, writing Ni (t) for the number of visits
to i before time t,
d X X
Ea Tb = Ea Ni (Tb ) qij Ej Tb .
dθ i j
P
This holds because the j term gives the effect on ETb of a Q-step from i.
Using general identities from Chapter 2 yyy, and (9.21), this becomes
d X πi (zab − zbb )
X
Ea Tb = + zbi − zai qij zjb /πb .
dθ i
πb j
Now specialize to the case where π is the stationary distribution for each
P + θQ, that is where X
πi qij = 0 ∀j.
i
9.5. AN EXAMPLE CONCERNING EIGENVALUES AND MIXING TIMES329
Then the expression above simplifies to

d X X
Ea Tb = (zbi − zai ) qij zjb /πb .
dθ i j
P
Averaging over a, using a πa zai = 0,
d XX
Eπ Tb = zbi qij zjb /πb
dθ i j
and then averaging over b,

d
τ0 = trace ZQZ = trace Z2 Q.
dθ
So consider λ < 1/2 in Corollary 9.24. Then
d
τ0 (λ) = trace Z2 (λ)(P∗ − P)
dλ
= (1 − 2λ)−1 trace Z2 (λ)(P∗ (λ) − P(λ))
and Conjecture 9.22 would imply this is ≥ 0, implying the conclusion of

Corollary 9.24.
9.5 An example concerning eigenvalues and mix-

ing times
Here is an example, adapted from Aldous [11]. Let (λu : 1 ≤ u ≤ n) be the
eigenvalues of P with λ1 = 1, and let
β = max{|λu | : 2 ≤ u ≤ n}.
A weak quantification of “mixing” is provided by
α(t) ≡ max |Pπ (X0 ∈ A, Xt ∈ B) − π(A)π(B)|.

A,B
By definition, α(t) is less than the maximal correlation ρ(t) discussed in

Chapter 4 yyy, and so by yyy
α(t) ≤ β t for a reversible chain. (9.22)
The convergence theorem (Chapter 2 yyy) says that α(t) → 0 as t → ∞

provided β < 1. So one might expect some analog of (9.22) to hold in
general. But this is dramatically false: Example 9.26 shows
Lemma 9.25 There exists a family of n-state chains, with uniform station-
ary distributions, such that supn βn < 1 while inf n αn (n) > 0.
Loosely, this implies there is no reasonable hypothesis on the spectrum of a

n-state chain which implies an o(n) mixing time. There is a time-asymptotic
result
α(t) ≤ ρ(t) ≤ Cβ t ∀t,
for some C depending on the chain. But implicit claims in the literature
that bounding the spectrum of a general chain has some consequence for
finite-time behavior should be treated with extreme skepticism!
Example 9.26 Let (Yt ) be independent r.v.’s taking values in {0, 1, . . . , n−

1} with distribution specified by
j+1
P (Y ≤ j) = , 0 ≤ j ≤ n − 2.
j+2
Define a Markov chain (Xt ) on {0, 1, . . . , n − 1} by
Xt = max(Xt−1 − 1, Yt ).
This chain has the property (cf. the “patterns in coin-tossing” chain) of
attaining the stationary distribution in finite time. Precisely: for any initial
distribution σ, the distribution of Xn−1 is uniform, and hence Xt is uniform
for all t ≥ n − 1. To prove this, we simply observe that for 0 ≤ j ≤ n − 1,
Pσ (Xn−1 ≤ j) = P (Yn−1 ≤ j, Yn−2 ≤ j + 1, . . . , Y0 ≤ j + n − 1)

j+1 j+2 n−1
= × × ... × × 1 × ...1
j+2 j+3 n
j+1
= .
n
If X0 is either 0 or 1 then X1 is distributed as Y1 , implying that the vector
v with vi = 1(i=0) − 1(i=1) is an eigenvector of P with eigenvalue 0. By soft
“duality” arguments it can be shown [11] that this is the largest eigenvalue,
in the sense that
R(λu ) ≤ 0 for all 2 ≤ u ≤ n. (9.23)
I believe it is true that
βn = max{|λu | : 2 ≤ u ≤ n}
9.6. MISCELLANY 331
is bounded away from 1, but we can avoid proving this by considering the
“lazy” chain X̂t with transition matrix P̂ = (I + P)/2, for which by (9.23)
q
β̂n ≤ sup{|(1 + λ)/2| : |λ| ≤ 1, R(λ) ≤ 0} = 1/2.
So the family of lazy chains has the eigenvalue property asserted in Lemma
9.25. But by construction, Xt ≥ X0 − t, and so P (X0 > 3n/4, Xn/2 <
n/4) = 0. For the lazy chains we get
Pπ (X0 > 3n/4, Xn < n/4) → 0 as n → ∞
establishing the (non)-mixing property asserted in the lemma.
9.6 Miscellany
9.6.1 Mixing times for irreversible chains
In Chapter 4 yyy we discussed equivalences between different definitions of
“mixing time” in the τ1 family. Lovasz and Winkler [241] give a detailed
treatment of analogous results in the non-reversible case.
xxx state some of this ?
9.6.2 Balanced directed graphs

Any Markov chain can be viewed as random walk on a weighted directed
graph, but even on unweighted digraphs it is hard to relate properties on
the walk to graph-theoretic properties, because (as we have often observed)
it is in general hard to get useful information about the stationary distribu-
tion. An exception is the case of a balanced digraph, i.e. when the in-degree
equals the out-degree (= rv , say) at each vertex v. Random walk on a bal-
anced digraph clearly retains the “undirected” property that the stationary
probabilities πv are proportional to rv . Now the proofs of Theorems yyy
and yyy in Chapter 6 extend unchanged to the balanced digraph setting,
showing that the cover-and-return time C + satisfies
max Ev C + ≤ n3 in general; max Ev C + ≤ 6n2 on a regular balanced digraph.

v v
(The proofs rely on the edge-commute inequality (Chapter 3 yyy), rather

than any “resistance” property).
9.6.3 An absorption time problem

Consider a Markov chain on states {1, 2, . . . , n} for which the only possible
transitions are downward, i.e. for i ≥ 2 we have
p(i, j) = 0, j ≥ i
and p(1, 1) = 1. The chain is ultimately absorbed in state 1. A question

posed by Gil Kalai is whether there is a bound on the mean absorption time
involving a parameter similar to that appearing in Cheeger’s inequality. For
each proper subset A of {1, . . . , n} with 1 6∈ A define
|A||Ac |
c(A) = P P
n i∈A j∈Ac p(i, j)
and then define

κ = max c(A).
A
Open Problem 9.27 Prove that maxi Ei T1 is bounded by a polynomial

function of κ log n.

Section 9.1. The idea of a maximal coupling goes back to Goldstein [169]:
see Lindvall [233] for further history. Strong stationary times were studied
in detail by Diaconis - Fill [115, 114] and Fill [150, 151], with particular at-
tention to the case of one-dimensional stochastically monotone chains where
there is some interesting “duality” theory. The special case of random walks
on groups had previously been studied in Aldous - Diaconis [21, 22], and the
idea is implicit in the regenerative approach to time-asymptotics for gen-
eral state space chains, discussed at xxx. The theory surrounding Theorem
9.4 goes back to Rost [301]. This is normally regarded as part of the po-
tential theory of Markov chains, which emphasizes analogous results in the
transient setting, and the recurrent case is rather a sideline in that setting.
See Revuz [289] sec. 2.5 or Dellacherie - Meyer [108] Chapter 9 sec. 3 for
textbook treatments in the general-space setting. The observation that the
theory applied in simple finite examples such as those in section 9.1.3 was
made in Lovasz - Winkler [241], from whom we borrowed the phrase halt-
ing state. Monotonicity properties like that in the statement of Corollary
9.5 were studied in detail by Brown [73] from the viewpoint of approximate
exponentiality of hitting times.
Section 9.2. A slightly more sophisticated and extensive textbook treat-

ment of these topics is in Lyons [250]. The nomenclature reflects my taste:
Theorem 9.10 is “the underlying theorem” which implies “the formula” for
the stationary distribution in terms of weighted spanning trees. Different
textbooks (e.g. [178] p. 340 xxx more refs) give rather different historical
citations for the Markov chain tree formula, and in talks I often call it “the
most often rediscovered result in probability theory”: it would be an inter-
esting project to track down the earliest explicit statement. Of course it can
be viewed as part of a circle of ideas (including the matrix-tree theorem for
the number of spanning trees in a graph) which is often traced back to Kir-
choff. The fact that Theorem 9.10 underlies the formula was undoubtably
folklore for many years (Diaconis attributes it to Peter Doyle, and indeed
it appears in an undergraduate thesis [317] of one of his students), but was
apparently not published until the paper of Anantharam and Tsoucas [30].
The fact that the Markov chain tree theorem can be interpreted as an algo-
rithm for generating uniform random spanning trees was observed by Aldous
[13] and Broder [65], both deriving from conversations with Diaconis. [13]
initiated study of theoretical properties of uniform random spanning trees,
proving e.g. the following bounds on the diameter ∆ of the random tree in
a regular n-vertex graph.
n1/2 1/2
≤ E∆ ≤ K2 τ2 n1/2 log n (9.24)
K1 τ2 log n
where K1 and K2 are absolute constants. Loosely, “in an expander, a ran-

dom spanning tree has diameter n1/2±o(1) ”. Results on asymptotic Poisson
distribution for the degrees in a random spanning tree are given in Aldous
[13], Pemantle [279] and Pemantle and Burton [83]. Pemantle [278] discusses
the analog of uniform random spanning trees on the infinite d-dimensional
lattice, and Aldous and Larget [23] give simulation results on quantitative
behavior on the d-dimensional torus.
Section 9.2.2. As described in Pemantle [279] and Burton and Pemantle
[83], the key to deeper study of random spanning trees is
Theorem 9.28 (Transfer-impedance theorem) Fix a graph G. There

is a symmetric function H(e1 , e2 ) on pairs of edges in G such that for any
edges (e1 , . . . , er )
P (ei ∈ T for all 1 ≤ i ≤ r) = det M (e1 , . . . , er )
where M (e1 , . . . , er ) is the matrix with entries H(ei , ej ), 1 ≤ i, j ≤ r.

Section 9.3. The first “pure simulation” algorithm for sampling exactly
from the stationary distribution was given by Asmussen et al [35], using a
quite different idea, and lacking explicit time bounds.
Section 9.3.1. In our discussion of these algorithms, we are assuming
that we have a list of all states. Lovasz - Winkler [242] gave the argument in
a slightly different setting, where the algorithm can only “address” a single
state, and their bound involved maxij Ei Tj in place of τ1∗ .
Section 9.3.3. Letac [226] gives a survey of the “backwards coupling”
method for establishing convergence of continuous-space chains: it suffices
(x,s)
to show there exists a r.x. X −∞ such that X0 → X −∞ a.s. as s → −∞,
for each state x. This method is especially useful in treating matrix-valued
chains of the form Xt = At Xt−1 + Bt , where (At , Bt ), t ≥ 1 are i.i.d. random
matrices. See Barnsley and Elton [43] for a popular application.
Section 9.4.1. One result on spectra and reversibilizations is the follow-
ing. For a transition matrix P write
1
τ (P) = sup{ : λ 6= 1 an eigenvalue of P}.
1 − |λ|
Then for the additive reversibilization Q(1) = 21 (P + P∗ ) we have (e.g. [316]

Proposition 1)
τ (P) ≤ 2τ (Q(1) ).
Chapter 10
Some Graph Theory and

Randomized Algorithms
(September 1 1999)
Much of the theory of algorithms deals with algorithms on graphs; con-

versely, much of the last twenty years of graph theory research pays atten-
tion to algorithmic issues. Within these large fields random walks play a
comparatively small role, but they do enter in various quite interesting and
diverse ways, some of which are described in this chapter. One theme of
this chapter is properties of random walks on expander graphs, introduced
in sections 10.1.1 and 10.1.2. Some non-probabilistic properties of graphs
can be explained naturally (to a probabilist, anyway!) in terms of random
walk: see section 10.2. Section 10.3 reviews the general idea of randomized
algorithms, and in section 10.4 we treat a diverse sample of randomized
algorithms based on random walks. Section 10.5 describes the particular
setting of approximate counting, giving details of the case of self-avoiding
walks. (xxx details not written in this version).
For simplicity let’s work in the setting of regular graphs. Except where
otherwise stated, G is an n-vertex r-regular connected graph,
pvw := r−1 1((v,w) is an edge)
is the transition matrix for discrete-time random walk on G (so P = r−1 A

for the adjacency matrix A) and 1 = λ1 > λ2 ≥ . . . ≥ λn ≥ −1 are its
eigenvalues, and τ2 = 1/(1 − λ2 ).
335
336CHAPTER 10. SOME GRAPH THEORY AND RANDOMIZED ALGORITHMS (SEPTEMB
10.1 Expanders
10.1.1 Definitions
The Cheeger time constant τc discussed in Chapter 4 section 5.1 (yyy 10/11/94
version) becomes, for a r-regular n-vertex graph,
r |A| |Ac |
τc = sup
A n |E(A, Ac )|
where E(A, Ac ) is the set of edges from a proper subset A of vertices to its
complement Ac . Our version of Cheeger’s inequality is (Chapter 4 Corollary
37 and Theorem 40) (yyy 10/11/94 version)
τc ≤ τ2 ≤ 8τc2 . (10.1)
Definition. An expander family is a sequence Gn of r-regular graphs (for
some fixed r > 2), with n → ∞ through some subsequence of integers, such
that
sup τc (Gn ) < ∞
n
or equivalently (by Cheeger’s inequality)
sup τ2 (Gn ) < ∞.
n
One informally says “expander” for a generic graph Gn in the family. The
expander property is stronger than the rapid mixing property exemplified
by the d-cube (Chapter 5 Example 15) (yyy 4/23/96 version). None of the
examples in Chapter 5 is an expander family, and indeed there are no known
elementary examples. Certain random constructions of regular graphs yield
expanders: see Chapter 30 Proposition 1 (yyy 7/9/96 version). Explicit
constructions of expander families, in particular the celebrated Ramanujan
graphs, depend on group- and number-theoretic ideas outside our scope: see
the elegant monograph of Lubotzky [243].
Graph parameters like τc are more commonly presented in inverted form
(i.e. like 1/τc ) as coefficients of expansion such as
|E(A, Ac )|
h := inf . (10.2)
A r min(|A|, |Ac |)
A more familiar version ([93] page 26) of Cheeger’s inequality in graph theory
becomes, on regular graphs,
h2 /2 ≤ 1 − λ2 ≤ 2h. (10.3)
10.1. EXPANDERS 337
Since trivially τc ≤ 1/h ≤ 2τc the two versions agree up to factors of 2.

Inequalities involving coefficients of expansion are often called isoperimetric
inequalities. Expanders and isoperimetric inequalities have been studied
extensively in graph theory and the theory of algorithms, e.g. Chung [93]
Chapters 2 and 6, the conference proceedings [156], and the introduction of
Lubotzky [243].
One algorithmic motivation for Cheeger-type inequalities concerns com-
putational complexity of calculating parameters like τc amd h. Using the
definition directly requires exponential (in n) time; but because eigenvalues
can be calculated in polynomial time, these general inequalities imply that
at least crude bounds can be computed in polynomial time.
10.1.2 Random walk on expanders

If we don’t pay attention to numerical constants, then general results about
reversible chains easily give us the orders of magnitude of other hitting and
mixing time parameters for random walks on expanders.
Theorem 10.1 For random walk on an expander family, as n → ∞
τ1 = Θ(log n) (10.4)
τ0 = Θ(n) (10.5)
∗
τ = Θ(n) (10.6)
sup Ev C = Θ(n log n) (10.7)
v
Proof. Recall the general inequality between τ1 and τ2 (Chapter 4 Lemma

23) (yyy 10/11/94 version), which on a regular graph becomes
τ1 ≤ τ2 (1 + 12 log n). (10.8)
This immediately gives the upper bound τ1 = O(log n). For the lower bound,
having bounded degree obviously implies that the diameter ∆ of the graph
satisfies ∆ = Ω(log n). And since the mean distance between an initial
vertex v and the position XT of the walk at a stopping time T is at most
(2) (2)
ET , the definition of τ1 implies d(v, w) ≤ 2τ1 for any pair of vertices,
(2)
that is τ1 ≥ ∆/2. This establishes (10.4). The general Markov chain fact
τ0 = Ω(n) is Chapter 3 Proposition 14 (yyy 1/26/93 version). Chapter 4
Lemma 25 gives τ0 ≤ 2nτ2 . Combining these and the obvious inequality τ0 ≤
τ ∗ /2 establishes (10.5,10.6). Finally, the lower bound in (10.7) follows from
the general lower bound in Chapter 6 Theorem 31 (yyy 10/31/94 version),
while the upper bound follows from the upper bound on τ ∗ combined with
Chapter 2 Theorem 36 (yyy 8/18/94 version).
In many ways the important aspect of Theorem 10.1 is that τ1 -type
mixing times are of order log n. We spell out some implications below.
These hold for arbitrary regular graphs, though the virtue of expanders is
that τ2 is bounded.
Proposition 10.2 There exists constants K1 , K2 such that the following

inequalities hold on any regular graph.
(i) For each vertex v there exists a stopping time Tv such that Pv (X(Tv ) ∈
·) is uniform and Ev Tv ≤ K1 τ2 log n.
(ii) For lazy random walk X e t (with hold-probability 1/2)
e t = w) ≥ 1 (1 −
Pv (X 1
) for all t ≥ jK2 τ2 log n and all vertices v, w.
n 2j
(2)
Proof. Part (i) is just the definition of τ1 , combined with (10.8) and the
(2)
fact τ1 = O(τ1 ).
yyy relate (ii) to Chapter 4 section 3.3
Repeated use of (i) shows that we can get independent samples from π
by sampling at random times T1 , T2 , T3 , . . . with E(Tj+1 − Tj ) ≤ K1 τ2 log n.
Alternatively, repeated use of (ii) shows that we can get almost indepen-
dent samples from π by examining the lazy chain at deterministic times, as
follows.
Corollary 10.3 Fix j and let t0 ≥ jK2 τ2 log n. Write (Y1 , . . . , YL ) =

(Xt0 , . . . , XLt0 ). Then
P (Y1 = y1 , . . . , YL = yL ) ≥ n−L (1 − L
2j
), for all L, y1 , . . . , yL
the variation distance between dist(Y1 , . . . , YL ) and π×. . .×π is at most L/2j .
Examining the lazy chain at deterministic times means sampling the original
walk at random times, but at bounded random times. Thus we can get L
precisely independent samples using (i) in mean number K1 Lτ2 log n of steps,
but without a deterministic upper bound on the number of steps. Using
Corollary 10.3 we get almost independent samples (up to variation distance
ε) in a number of steps deterministically bounded by K2 L log(L/ε)τ2 log n.
10.1.3 Counter-example constructions

Constructions with expanders are often useful in providing counter-examples
to conjectures suggested by inspecting properties of random walk on the
10.2. EIGENVALUES AND GRAPH THEORY 339
elementary examples of graphs in Chapter 5. For example, consider upper

bounds on τ0 in terms of τ2 and n, in our setting of regular graphs. From
general results for reversible chains in Chapter 4 (10/11/94 version: Lemma
24 and below (9))
2
max(τ2 , (n−1)
n ) ≤ τ0 ≤ (n − 1)τ2 .
The examples in Chapter 5 are consistent with a conjecture
τ0 =? O(max(n, τ2 ) log n) (10.9)
where the log n term is needed for the 2-dimensional torus. We now outline
a counter-example.
Take m copies on the complete graph on m vertices. Distinguish one
vertex vi from each copy i. Add edges to make the (vi ) the vertices of a
r-regular expander. For this graph Gm we have, as m → ∞ with fixed r,
n = m2 ; τ2 = Θ(m2 ); τ0 = Θ(m3 )
contradicting conjecture (10.9). We leave the details to the reader: the key
point is that random walk on Gm may be decomposed as random walk on
the expander, with successive steps in the expander separated by sojourns
of times Θ(m2 ) within a clique.
10.2 Eigenvalues and graph theory

Our treatment of the relaxation time τ2 in Chapter 4 emphasized prob-
abilistic interpretations in the broad setting of reversible Markov chains.
Specializing to random walk on unweighted graphs, there are a range of
non-probabilistic connections between eigenvalues of the adjacency matrix
and other graph-theoretic properties. Such spectral graph theory is the sub-
ject of Chung [93]: we shall just give a few results with clear probabilistic
interpretations.
10.2.1 Diameter of a graph

Implicit in the proof of Theorem 10.1 is that, on a regular graph, the diam-
eter ∆ satisfies ∆ = O(τ ) = O(τ2 log n). By being a little careful we can
produce numerical constants.
& '
1
1+ 2 log n
Proposition 10.4 ∆/2 ≤ 3−λ .
log 1+λ2
2
Proof. The discrete analog τ1disc of variation threshold satisfies
τ1disc ≥ ∆/2, (10.10)
because obviously
if d(v, w) = ∆ and t < ∆/2 then ||Pv (Xt ∈ ·) − Pw (Xt ∈ ·)|| = 1. (10.11)
Chapter 4 Lemma 26 (yyy 10/11/94 version) specializes to
1 + 21 log n
& '
τ1disc ≤ , β := max(λ2 , −λn ). (10.12)
log 1/β
We can remove dependence on λn by the trick of introducing artificial holds.

(yyy tie up with Chapter 4 section 3.3). The chain with transition matrix
P 0 := θI + (1 − θ)P has eigenvalues λ0i = θ + (1 − θ)λi . Choosing θ =
(1 − λ2 )/(3 − λ2 ), (this being the value making λ02 = −λ0n in the worst case
θn = −1), we have
1+λ2 0 0 0
3−λ2 = β = λ2 ≥ −λn .
Since (10.10) still holds for the Markov chain P 0 , combining (10.10) and
(10.12) with β 0 establishes the Proposition.
10.2.2 Paths avoiding congestion

Upper bounding τ2 via the distinguished paths technique (Chapter 4 section
4.3) (yyy 10/11/94 version) is a valuable theoreticial technique: the essence
is to choose paths which “avoid congestion”. In the opposite direction, one
can use upper bounds on mixing times to show existence of paths which
avoid congestion. Here’s the simplest result of this type. The first part
of the proof repeats an idea from Chapter 4 Lemma 21 (yyy new part of
Lemma to be added).
Proposition 10.5 Let v(1), v(2), . . . , v(n) be any ordering of the vertices
1, 2, . . . , n of a r-regular graph. Then there exists, for each 1 ≤ i ≤ n, a path
from i to v(i) such that, writing Nvw for the number of times the directed
edge (v, w) is traversed in all the paths,

(2)
max Nvw ≤ 7 max eτ1 /r, log n .
(v,w)
So on an expander the bound is O(log n), using (10.4).

(2)
Proof. By definition of τ1 , for each vertex i there is a segment of the
(i) (i) (i) (2) (i)
chain i = X0 , X1 , . . . , XUi such that EUi ≤ τ1 and XUi has uniform
10.2. EIGENVALUES AND GRAPH THEORY 341
distribution. Take these segments independent as i varies. Write Ñvw for

the (random) number of times that (v, w) is traversed by all these random
paths. By considering a uniform random start, by (yyy tie up with Chapter
2 Proposition 3) X
1 1 1
n E Ñvw = rn n EUi .
i
(2)
In particular, E Ñvw ≤ τ1 /r := κ. By erasing loops we may contract each
path to a path in which no directed edge is traversed twice. For fixed (v, w)
let pi be the chance that (v, w) is traversed by the contracted path from
vertex i and let Nvw0 be the total number of traversals. By independence of
paths as i varies,
!m
0
X
P (Nvw ≥ m) ≤ pi /m! (expand the sum)
i
≤ κ /m! ≤ (eκ/m)m .
m
Choosing m = d3 max(eκ, log n)e makes the bound less that 1/(2n2 ) and so
!
0
P max Nvw ≥m < 1/2.
(v,w)
(i)
Now repeat the entire construction to define another copy (Yt , 0 ≤ t ≤ Vi )
00 . Since X (i) and Y (v(i)) have the
of chain segments with traversal counts Nvw Ui Vi
same uniform distribution, for each i we can construct the chain segments
(i) (v(i))
jointly such that XUi = YVi . Concatenating paths gives a (non-Markov)
random path from i to v(i). Then
!
0 00
P max Nvw + Nvw ≥ 2m <1
(v,w)
and so paths with the maximum ≤ 2m − 1 must exist.

Broder et al. [69] give a more elaborate algorithm for constructing edge-
disjoint paths between specified pairs {(ai , bi ), 1 ≤ i ≤ k}) of distinct vertices
on an expander, for k = n1−o(1) . The essential idea is to first pick a set S of
4k vertices at random, then use a greedy algorithm to construct (as in the
proof above) paths from each ai and bi to some ãi and b̃i in S, then for each
i construct a bundle of random walk paths from ãi to b̃i , and finally show
that one path may be selected from each bundle so that the set of paths is
edge-disjoint.
10.3 Randomized algorithms

10.3.1 Background
Here we give some background for the mathematician with no knowledge
of the theory of algorithms. Typically there are many different possible
algorithms for a particular problem; the theory of algorithms seeks “optimal”
algorithms according to some notion of “cost”. Cost is usually “time”,
i.e. number of computational steps, but sometimes involves criteria such
as (storage) space or simplicity of coding. The phrase randomized algorithm
refers to settings where the problem itself does not involve randomness but
where randomness is introduced into the running of the algorithm. Why this
is useful is best seen by example; the textbook of Motwani and Raghavan
[265] provides a comprehensive range of examples and classifications of types
of problems where randomized algorithms have proved useful. We give three
standard examples below (not using Markov chains) and then proceed to talk
about algorithms using random walks on graphs.
Example 10.6 Statistical sampling.
Consider a population of n individuals. Suppose we wish to know the pro-

portion q of the population with some attribute, i.e. who answer “Yes” to
some Yes/No question. To calculate q exactly we need to question all n
individuals. But if we can sample uniformly at random from the popula-
tion, then we can estimate q approximately and can bound the size of error
in probability. To do this, we sample independently k random individuals,
question them, and calculate the empirical proportion q̄ of Yes answers. Use
q̄ as our estimate of q, and theory gives error probabilities
P (|q̄ − q| > 2(q̄(1 − q̄))1/2 k −1/2 ) ≈ 5%.
Such 95% confidence intervals are discussed in every freshman statistics

course. Classical statististical sampling is conceptually a bit different from
algorithms, in that the “cost” k here refers to real-world costs of interviewing
human individuals (or experimenting on individual rats or whatever) rather
than to computational cost. However, the key insight in the formula above
is that, for prescribed allowable error ε, the cost of this simple random
sampling is O(ε−2 ) and this cost does not depend on the “problem size”
(i.e. population size) n. The next example is a slightly more subtle use of
sampling in a slightly more algorithmic context.
Example 10.7 Size of a union of sets.

10.3. RANDOMIZED ALGORITHMS 343
It’s fun to say this as a word problem in the spirit of Chapter 1. Suppose
your new cyberpunk novel has been rejected by all publishers, so you have
published it privately, and seek to sell copies by mailing advertizements to
individuals. So you need to buy mailing lists (from e.g. magazines and
specialist bookshops catering to science fiction). Your problem is that such
mailing lists might have much overlap. So before buying L lists A1 , . . . , AL
(where Ai is a set of |Ai | names and addresses) you would like to know
roughly the size |∪i Ai | of their union. How can you do this without knowing
what the sets Ai are (the vendors won’t give them to you for free)? Statistical
sampling can be used here. Suppose the vendors will allow you to randomly
sample a few names (so you can check accuracy) and will allow you to
“probe” whether a few specified names are on their list. Then you can
sample k times from each list, and for each sampled name Xij probe the
other lists to count the number m(Xij ) ≥ 1 of lists containing that name.
Consider the identity
|Ai | × |Ai |−1

X X
| ∪ i Ai | = 1/m(a)
i a∈Ai
X
= |Ai | E(1/Mi )
i
where Mi is the number of lists containing a uniform random name from

Ai . You can estimate E(1/Mi ) by k −1 kj=1 1/m(Xij ), and the error has
P
standard deviation ≤ k −1/2 , and the resulting estimate of | ∪i Ai | has error
|Ai |2 /k)1/2 ) = ±O(k −1/2 L max |Ai |).

X
±O((
i
i
As in the previous example, the key point is that the cost of “approximately
counting” ∪i Ai to within a small relative error does not depend on the size
of the sets.
Example 10.8 Solovay-Strassen test of primality [312].
We can’t improve on the concise description given by Babai [37].
Let n > 1 be an odd integer. Call an integer w a Solovay-

Strassen witness (of compositeness of n) if 1 ≤ w ≤ n − 1 and
either g.c.d.(w, n) > 1 or w(n−1)/2 6≡ ( w w
n ) mod n, where ( n ) is
the Jacobi symbol (computed via quadratic reciprocity as easily
as g.c.d.’s are computed via Euclid’s algorithm). Note that no
S-S witness exists if n is prime. On the other hand (this is
the theorem) if n is composite, then at least half of the integers

1, 2, . . . , n − 1 are S-S witnesses.
Suppose now that we want to decide whether or not a given
odd 200-digit integer n is prime. Pick k integers w1 , . . . , wk inde-
pendently at random from {1, 2, . . . , n − 1}. If any one of the wi
turns out to be a witness, we know that n is composite. If none
of them are, let us conclude that n is prime. Here we may err,
but for any n, the probability that we draw the wrong conclu-
sion is at most ε = 2−k . Setting k = 500 is perfectly realistic, so
we shall have proven the mathematical statement “n is prime”
beyond the shade of doubt.
10.3.2 Overview of randomized algorithms using random walks

or Markov chains
Our focus is of course on randomized algorithms using random walks or
Markov chains. We will loosely divide these into three categories. Markov
chain Monte Carlo seeks to simulate a random sample from a (usually non-
uniform) given probability distribution on a given set. This is the central
topic of Chapter 11. In section 10.4 below we give a selection of miscella-
neous graph algorithms. Into this category also falls the idea (Chapter 6
section 8.2) (yyy 10/31/94 version; details to be written) of using random
walk as a “undirected graph connectivity” algorithm, and the idea (end of
section 10.2.2) of using random walk paths as an ingredient in constructing
edge-disjoint paths in an expander graph. A third, intermediate category is
the specific topic of approximate counting via Markov chains, to be discussed
in section 10.5.
10.4 Miscellaneous graph algorithms

10.4.1 Amplification of randomness
In practice, Monte Carlo simulations are done using deterministic pseudo-
random number generators. Ideally one would prefer some physical device
which generated “truly random” bits. Presumably any such physical random
number generator would be rather slow compared to the speed of arithmeti-
cal calculations. This thinking has led to an area of theory in which the cost
of a randomized algorithm is taken to be the number of truly random bits
used.
10.4. MISCELLANEOUS GRAPH ALGORITHMS 345
Recall the Solovay-Strassen test of primality in Example 10.8. Philo-

sophically, there is something unsettling about using a deterministic pseudo-
random number generator in this context, so we regard this as a prototype
example where one might want to use a hypothetical source of truly random
bits. To pick a uniform random integer from {1, 2, . . . , n} requires about
log2 n random bits, so the cost of the algorithm as presented above is about
k log2 n = (log2 1/ε) (log2 n) bits, where ε is the prescribed allowable error
probability. But one can use the existence of explicit expanders and results
like Lemma 10.12 to devise an algorithm which requires fewer truly random
bits. Suppose we have a n-vertex r-regular expander, and label the vertices
{1, 2, . . . , n}. To simulate a uniform random starting vertex and t steps
of the random walk requires about log2 n + t log2 r bits. The chance that
such a walk never hits the set A of witnesses is, by Lemma 10.12, at most
exp(− 2τt 2 ). To make this chance ≤ ε we take t = 2τ2 log(1/ε), and the
cost becomes log2 n + 2τ2 (log2 r) log(1/ε). Thus granted the existence of
expanders on which we can efficiently list neighbors of any specified vertex
in order to simulate the random walk, the method of simulating (depen-
dent) integers (wi ) via the random walk (instead of independently) reduces
the number of truly random bits required from O((log n) × (log 1/ε)) to
O(max(log n, log 1/ε)).
The idea of using random walks on expanders for such algorithmic pur-
poses is due to Ajtai et al [4]. Following Impagliazzo and Zuckerman [188]
one can abstract the idea to rather general randomized algorithms. Suppose
we are given a randomized algorithm, intended to show whether an object
x ∈ X has a property P by outputting “Yes” or “No”, and that for each x
the algorithm is correct with probability ≥ 2/3 and uses at most b random
bits. Formally, the algorithm is a function A : X × {0, 1}b → {Yes,No} such
that
if x ∈ P then 2−b |{i ∈ {0, 1}b : A(x, i) = YES}| ≥ 2/3
if x 6∈ P then 2−b |{i ∈ {0, 1}b : A(x, i) = YES}| ≤ 1/3

where P ⊂ X is the subset of all objects possessing the property. To make
the probability of incorrect classification be ≤ ε we may simply repeat the
algorithm m times, where m = Θ(log 1/ε) is chosen to make
P (Binomial(m, 2/3) ≤ m/2) ≤ ε,
and output Yes or No according to the majority of the m individual outputs.

This requires bm = Θ(b log 1/ε) random bits. But instead we may take
{0, 1}b as the vertices of a degree-r expander, and simulate a uniform random
starting vertex and m steps of random walk on the expander, using about
b + m log2 r random bits. For each of the m + 1 vertices of {0, 1}b visited by
the walk (Yi , 0 ≤ i ≤ m), compute A(x, Yi ), and output Yes or No according
to the majority of the m + 1 individual outputs. The error probability is at
most
max Pπ Nm+1
m+1
(B)
− π(B) ≥ 1
3
B
where Nm+1 (B) is the number of visits to B by the walk (Yi , 0 ≤ i ≤ m).
By the large deviation bound for occupation measures (Theorem 10.11, yyy
to be moved to other chapter) this error probability is at most
(1 + c1 m/τ2 ) exp(−c2 m/τ2 )
for constants c1 and c2 . To reduce this below ε requires m = 0(τ2 log(1/ε)).

Thus the existence of (bounded-degree) expanders implies that the number
of random bits required is only
b + m log2 r = O(max(b, log 1/ε))
compared to O(b log(1/ε)) using independent sampling.
10.4.2 Using random walk to define an objective function

In Chapter 6 section 8.2 (yyy currently at end of this Chapter; to be moved)
we gave a standard use of the probabilistic method Here is a less standard
use, from Aldous [7], where we use the sample path of a random walk to
make a construction.
Consider a function h defined on the vertices of a n-vertex graph G.
Constrain h to have no local minima except the global minimum (for sim-
plicity, suppose the values of h are distinct). We seek algorithms to find
the vertex v at which h(v) is minimized. Any deterministic “descent” al-
gorithm will work, but it might work slowly. Could there be some more
sophisticated algorithm which always works quickly? One idea is multi-start
descent. Pick n1/2 vertices uniformly at random; from these, choose the
vertex with minimum h-value, and follow the greedy descent algorithm. On
a degree-d graph, the mean time is O(dn1/2 ). Now specialize to the case
where G is the d-cube. One can give examples where single-start (from a
uniform random start) descent has mean time Ω(2(1−ε)d ), so from a worst-
case mean-time viewpoint, multi-start is better. The next theorem shows
that (again from a worst-case mean-time viewpoint), one cannot essentially
improve on multi-start descent. Consider random walk on the d-cube started
at a uniform random vertex U and let H(v) be the first hitting time on v.
Then H is a random function satisfying the constraint, minimized at v = U ,
but
Theorem 10.9 ([6]) Every algorithm for locating U by examining values
H(v) requires examining a mean number Ω(2d/2−ε ) of vertices.
The argument is simple in outline. As a preliminary calculation, consider
random walk on the d-cube of length t0 = O(2d/2−ε ), started at 0, and let
Lv be the time of the last visit to v, with Lv = 0 if v is not visited. Then
t0
X
ELv ≤ tP0 (X(t) = v) = O(1) (10.13)
t=1
where the O(1) bound holds because the worst-case v for the sum is v = 0
and, switching to continuous time,
Z 2d/2 Z 2d/2 !d
1 + e−2t/d
tP0 (X̃(t) = 0) dt = t dt = O(1).
0 0 2
Now consider an algorithm which has evaluated H(v1 ), . . . , H(vm ) and
write t0 = mini≤m H(vi ) = H(v ∗ ) say. It does no harm to suppose t0 =
O(2d/2−ε ). Conditional on the information revealed by H(v1 ), . . . , H(vm ),
the distribution of the walk (X(t); 0 ≤ t ≤ t0 ) is specified by
(a) take a random walk from a uniform random start U , and condition on
X(t0 ) = v ∗ ;
(b) condition further on the walk not hitting {vi } before time t0 .
The key point, which of course is technically hard to deal with, is that the
conditioning in (b) has little effect. If we ignore the conditioning in (b), then
by reversing time we see that the random variables (H(v ∗ ) − H(v))+ have
the same distribution as the random variables Lv (up to vertex relabeling).
So whatever vertex v the algorithm chooses to evaluate next, inequality
(10.13) shows that the mean improvement E(H(v ∗ ) − H(v))+ in objective
value is O(1), and so it takes Ω(2d/2−ε ) steps to reduce the objective value
from 2d/2−ε to 0.
10.4.3 Embedding trees into the d-cube

Consider again the d-cube I = {0, 1}d with Hamming distance d(i, j). Let
B be the vertices of a M -vertex binary tree. For an embedding, i.e. an
arbitrary function ρ : B → I, define
load = max |{v ∈ B : ρ(v) = i}|
i∈I
dilation = max d(ρ(v), ρ(w)).

edges (v, w) of B
How can we choose an embedding which makes both load and dilation small?
This was studied by Bhatt and Cai [47], as a toy model for parallel com-
putation. In the model I represents the set of processors, B represents the
set of tasks being done at a particular time, the tree structure indicating
tasks being split into sub-tasks. To assign tasks to processors, we desire
no one processor to have many tasks (small load) and we desire processors
working on tasks and their sub-tasks to be close (small dilation) to facil-
itate communication. As the computation proceeds the tree will undergo
local changes, as tasks are completed and new tasks started and split into
sub-tasks, and we desire to be able to update the embedding “locally” in
response to local changes in the tree. Bhatt and Cai [47] investigated the
natural random walk embedding, where the root of B is embedded at 0, and
recursively each child w of v is embedded at the vertex ρ(w) found at step L
(for even L) of a random walk started at ρ(v). So by construction, dilation
≤ L, and the mathematical issue is to estimate load. As before, the de-
tails are technically complicated, but let us outline one calculation. Clearly
load = Ω(max(1, M/2d )), so we would like the mean number of vertices of
B embedded at any particular vertex i to be O(max(1, M/2d )). In bound-
ing this mean, because p0i (t) ≤ p00 (t) for even t (Chapter 7 Corollary 3)
(yyy 1/31/94 version) we see that the worst-case i is 0, and then because
p00 (t) is decreasing in t we see that the worst-case M -vertex binary tree is
a maximally balanced tree. Thus we want
log2 M
X
2k p00 (kL) = O(max(1, M/2d )). (10.14)
k=0
From the analysis of random walk on the d-cube (Chapter 5 Example 15)
(yyy 4/23/96 version) one can show
p00 (k log d) = O(max(d−k , 2−d )), uniformly in k, d ≥ 1.
It follows that (10.14) holds if we take L = dlog de.

Of course to bound the load we need to consider the maximally-loaded
vertex, rather than a typical vertex. Considering M = 2d for definiteness, if
the M vertices were assigned independently and uniformly, the mean load
at a typical vertex would be 1 and classical arguments show the maximal
load would be Θ( logd d ). We have shown that with tree-embedding the mean
load at a typical vertex is O(1), so analogously one can show the maximal
load is O(d/ log d). However, [47] shows that by locally redistributing tasks
assigned to the same processor, one can reduce the maximal load to O(1)
while maintaining the dilation at O(log d).
10.4.4 Comparing on-line and off-line algorithms

Here we describe work of Coppersmith et al [99]. As in Chapter 3 section 2
(yyy 9/2/94 version) consider a weighted graph on n vertices, but now write
the edge-weights as cij = cji > 0 and regard them as a matrix C of costs.
Let P be the transition matrix of an irreducible Markov chain whose only
transitions are along edges of the graph. For each i and j let m(i, j) be the
mean cost of the random walk from i to j, when traversing an edge (v, w)
incurs cost cvw . Define the stretch s(P, C) to be the smallest s such that
there exists a < ∞ such that, for arbitrary v0 , v1 , . . . , vk
k−1
X k−1
X
m(vi , vi+1 ) ≤ a + s cvi vi+1 . (10.15)
i=0 i=0
Note that c(P, C) is invariant under scaling of C.

Proposition 10.10 ([99]) (a) s(P, C) ≥ n − 1.
(b) If P is reversible and C is the matrix of mean commute times Ei Tj +
Ej Ti then s(P, C) = n − 1.
(c) For any cost matrix C̃ there exists a reversible transition matrix P
with matrix C of mean commute times such that, for some constant α,
cij ≤ αc̃ij
cij = αc̃ij when pij > 0.

So from (b) and invariance under scaling, s(P, C̃) = n − 1.
We shall prove (a) and (b), which are just variations of the standard theory
of mean hitting times developed in Chapters 2 and 3. The proof of part
(c) involves “convex programming” and is rather outside our scope. The
algorithmic interpretations are also rather too lengthy to give in detail, but
are easy to say in outline. Imagine a problem where it is required to pick a
minimum-cost path, where the cost of a path consists of costs of traversing
edges, together with extra costs and constraints. There is some optimal
off-line solution, which may be hard to calculate. In such a problem, one
may be able to use Proposition 10.10(c) to show that the algorithm which
simply picks a random sample path (with transition matrix P from (c)) has
mean cost not more than n − 1 times the cost of the optimal path.
Proof. Write π for the stationary distribution of P. Write m+ (v, v) for

P P
the mean cost of an excursion from v to v, and write c̄ = v w πv pvw cvw .
Then m+ (v, v) = c̄/πv by the ergodic argument (Chapter 2 Lemma 30) (yyy
8/18/94 version). and so
X
nc̄ = πv m+ (v, v)
v
X X
= πv pvw (cv,w + m(w, v))
v w
XX
= c̄ + πv pvw m(w, v).
v w
In other words, XX
πw pwv m(v, w) = (n − 1)c̄. (10.16)
v w
Now apply the definition (10.15) of s(P, C) to the sequence of states visited
by the stationary time-reversed chain P∗ ; by considering the mean of each
step,
πv p∗vw m(v, w) ≤ s(P, C) πv p∗vw cvw .
XX XX
(10.17)
v w v w
But the left sides of (10.16) and (10.17) are equal by definition of P∗ , and
the sum in the right of (10.17) equals c̄ by symmetry of C, establishing (a).
For (b), first note that the definition (10.15) of stretch is equivalent to
P
i m(vi , vi+1 )
s(P, C) = max P (10.18)
i cvi ,vi+1
σ
where σ denotes a cycle (v1 , v2 , . . . , vm , v1 ). Write t(v, w) = Ev Tw . Fix

P
a cycle σ and write µ = i t(vi , vi+1 ) for the mean time to complete the
cyclic tour. By the ergodic argument (Chapter 2 Lemma 30) (yyy 8/18/94
version). the mean number of traversals of an edge (v, w) during the tour is
µπy pvw , and hence the ratio in (10.18) can be written as
P
i t(vi , vi+1 )
XX
P × πv pvw cvw . (10.19)
i cvi ,vi+1 v w
Now the hypothesis of (b) is that P is reversible and cvw = t(v, w) + t(w, v).
So the second term of (10.19) equals 2(n − 1) by Chapter 3 Lemma 6 (yyy
9/2/94 version) and the first term equals 1/2 by the cyclic tour property
Chapter 3 Lemma 1 (yyy 9/2/94 version). So for each cycle σ the ratio in
(10.18) equals n − 1, establishing (b).
10.5. APPROXIMATE COUNTING VIA MARKOV CHAINS 351
10.5 Approximate counting via Markov chains

For a finite set S, there is a close connection between
(a) having an explicit formula for the size |S|
(b) having a bounded-time algorithm for generating a uniform random
element of S.
As an elementary illustration, we all know that there are n! permutations of
n objects. From a proof of this fact, we could write down an explicit 1 − 1
mapping f between the set of permutations and the set A = {(a1 , a2 , . . . , an ) :
1 ≤ ai ≤ i}. Then we could simulate a uniform random permutation by
first simulating a uniform random element a of A and then computing f (a).
Conversely, given an algorithm which was guaranteed to produce a uniform
random permutation after k(n) calls to a random number generator, we
could (in principle) analyze the working of the algorithm in order to calcu-
late the chance p of getting the identity permutation. Then we can say that
number of permutations equals 1/p.
A more subtle observation is that, in certain settings, having an algo-
rithm for generating an approximately uniform random element of S can
be used to estimate approximately the size |S|. The idea is to estimate
successive ratios by sampling. Suppose we can relate S to smaller sets
S = SL ⊃ SL−1 ⊃ . . . ⊃ S2 ⊃ S1 (10.20)
where |S1 | is known, the ratios pi := |Si+1 |/|Si | are bounded away from 0,
and where we can sample uniformly from each Si . Then take k uniform
random samples from each Si and find the sample proportion Wi which fall
into Si−1 . Because |S| = |S1 | Li=2 |Si |/|Si−1 |, we use
Q
L
Wi−1
Y
N̂ := |S1 |
i=2
as an estimate of |S|. To study its accuracy, it is simpler to consider |S|/N̂ =

QL
i=2 Wi /pi . Clearly E(|S|/N̂ ) = 1, and we can calculate the variance by
L
!
|S| Wi
Y
var = var
N̂ i=2
pi
L
Y
= (1 + var (Wi /pi )) −1
i=2
L
1−pi
Y
= (1 + pi k ) − 1.
i=2
The simplest case is where we know a theoretical lower bound p∗ for the pi .
Then by taking k = O(ε−2 L/p∗ ) we get
|S|

var ≤ exp( pL∗ k ) − 1 = O(ε2 ).
N̂
In other words, with a total number
Lk = O(ε−2 L2 /p∗ ) (10.21)
of random samples, we can statistically estimate |S| to within a factor 1 ±

O(ε).
The conceptual point of invoking intermediate sets Si is that the overall
ratio |S1 |/|S| may be exponentially small in some size parameter, so that
trying to estimate this ratio directly by sampling from S would involve order
|S|/|S1 |, i.e. exponentially many, samples. If we can specify the intermediate
sets with ratios pi bounded away from 0 and 1 then L = O(log(|S|/|S1 |))
and so the number of samples required in (10.21) depends on log(|S|/|S1 |)
instead of |S|/|S1 |.
The discussion so far has not involved Markov chains. From our view-
point, the interesting setting is where we cannot directly get uniform ran-
dom samples from a typical Si , but instead need to use Markov chain Monte
Carlo. That is, on each Si we set up a reversible Markov chain with uniform
stationary distribution (i.e. a chain whose transition matrix is symmetric
in the sense pvw ≡ pwv ) Assume we have a bound τ1 on the τ1 -values of all
these chains. Then as a small modification of Corollary 10.3, one can get
m samples from the combined chains whose joint distribution is close (in
variation distance) to the the distribution of independent samples from the
uniform distributions in O(τ1 m log m) steps. As above, if we can specify the
intermediate sets with ratios pi bounded away from 0 and 1 then we need
m = O(ε−2 log2 (|S|/|S1 |)) samples, and so (ignoring dependence on ε) the
total number of steps in all the chains is O(τ1 log2+o(1) (|S|/|S1 |).
In summary, to implement this method of approximate counting via
Markov chains, one needs
• a way to specify the intermediate sets (10.20)
• a way to specify Markov chains on the Si whose mixing times can be

rigorously bounded.
Two particular examples have been studied in detail, and historically these
examples provided major impetus for the development of technical tools to
10.5. APPROXIMATE COUNTING VIA MARKOV CHAINS 353
estimate mixing times. Though the details are too technical for this book,
we outline these examples in the next two sections, and then consider in
detail the setting of self-avoiding walks.
10.5.1 Volume of a convex set

Given a closed convex set K in Rd , for large d, how can we algorithmically
calculate the volume of K? Regard K as being described by an oracle, that
is for any x ∈ Rd we can determine in one step whether or not x ∈ K.
Perhaps surprisingly, there is no known deterministic algorithm which finds
vol(K) approximately (i.e. to within a factor 1 ± ε) in a polynomial in d
number of steps. But this problem is amenable to “approximate counting
via Markov chains” technique. This line of research was initiated by Dyer
et al [135, 136] who produced an algorithm requiring O(d23+o(1) ) steps. A
long sequence of papers (see the Notes) studied variants of both the Markov
chains and the analytic techniques in order to reduce the polynomial degree.
Currently the best bound is O(d5+o(1) ), due to Kannan et al [206].
To outline the procedure in this example, suppose we know B(1) ⊂
K ⊂ B(r), where B(r) is the ball of radius r = r(d). (It turns out that
one can transform any convex set into one satisfying these constraints with
r = O(d3/2 ).) We specify an increasing sequence of convex subsets
B(1) = K0 ⊂ K1 ⊂ . . . ⊂ KL = K
by setting Ki := B(2i/d ) ∩ K. This makes the ratios of successive volumes

bounded by 2 and requires L = O(d log d) intermediate sets. So the issue
is to design and analyze a chain on a typical convex set Ki whose station-
ary distribution is uniform. Various Markov chains have been used: simple
random walk on a fine discrete lattice restricted to Ki , or spherically sym-
metric walks. The analysis of the chains has used Cheeger inequalities for
chains and the refinement of classical isoperimetric inequalities for convex
sets. Recent work of Bubley et al [80] has successfully introduced coupling
methods, and it is a challenging problem to refine these coupling methods.
There is a suggestive analogy with theoretical study of Brownian motion in
a convex set – see Chapter 13 section 1.3 (yyy 7/29/99 version).
10.5.2 Matchings in a graph

For a finite, not necessarily regular, graph G0 let M(G0 ) be the set of all
matchings in G0 , where a matching M is a subset of edges such that no vertex
is in more than one edge of M . Suppose we want to count |M(G0 )| (for the
harder setting of counting perfect matchings see the Notes). Enumerate the
edges of G0 as e1 , e2 , . . . , eL , where L is the number of edges of G0 . Write
Gi for the graph G with edges e1 , . . . , ei deleted. A matching of Gi can be
identified with a matching of Gi−1 which does not contain ei , so we can
write
M(GL−1 ) ⊂ M(GL−2 ) ⊂ . . . ⊂ M(G1 ) ⊂ M(G0 ).
Since GL−1 has one edge, we know |M(GL−1 )| = 2. The ratio |M(Gi+1 )|/|M(Gi )|
is the probability that a uniform random matching of Gi does not contain
the edge ei+1 . So the issue is to design and analyze a chain on a typical
set M(Gi ) of matchings whose stationary distribution is uniform. Here is a
natural such chain. From a matching M0 , pick a uniform random edge e of
G0 , and construct a new matching M1 from M0 and e as follows.
If e ∈ M0 then set M1 = M0 \ {e}.
If neither end-vertex of e is in an edge of M0 then set M1 = M0 ∪ {e}.
If exactly one end-vertex of e is in an edge (e0 say) of M0 then set
M1 = {e} ∪ M0 \ {e0 }.
This construction (the idea goes back to Broder [64]) yields a chain with
symmetric transition matrix, because each possible transition has chance
1/L. An elegant analysis by Jerrum and Sinclair [199], outlined in Jerrum
[197] section 5.1, used the distinguished paths technique to prove that on a
n-vertex L-edge graph
τ2 = O(Ln).
Since the number of matchings can be bounded crudely by n!,
τ1 = O(τ2 log n!) = O(Ln2 log n). (10.22)
10.5.3 Simulating self-avoiding walks

xxx to be written

Section 10.1.1. Modern interest in expanders and their algorithmic uses goes
back to the early 1980s, e.g. their use in parallel sorting networks by Ajtai
et al [3], and was increased by Alon’s [26] graph-theoretic formulation of
Cheeger’s inequality. The conference proceedings [156] provides an overview.
Edge-expansion, measured by parameters like h at (10.2), is more rele-
vant to random walk than vertex-expansion. Walters [334] compares defini-
tions. What we call “expander” is often called bounded-degree expander.
Section 10.1.2. Ajtai et al [4], studying the “amplification of random-
ness” problems in section 10.4.1, was perhaps the first explicit use of random
walk on expanders. In Theorem 10.1, the upper bounds on τ ∗ and EC go
back to Chandra et al [85].
Section 10.1.3. With the failure of conjecture (10.9), the next natural
conjecture is: on a r-regular graph
τ0 =? O(max(n, τ2 ) max(log n, r)).
It’s not clear whether such conjectures are worth pursuing.

Section 10.2. More classical accounts of spectral graph theory are in
Cvetkovic et al [107, 106].
On a not-necessarily-regular graph, Chung [93] studies the eigenvalues
of the matrix L defined by
Lvw = 1, w=v
= −(dv dw )−1/2 for an edge (v, w) (10.23)
= 0 else.
In the regular case, −L is the Q-matrix of transition rates for the continuous-
time random walk, and so Chung’s eigenvalues are identical to our continuous-
time eigenvalues. In the non-regular case there is no simple probabilistic in-
terpretation of L and hence no simple probabilistic interpretation of results
involving the relaxation time 1/λ2 associated with L.
Section 10.2.1. Chung [93] Chapter 3 gives more detailed results about
diameter and eigenvalues. One can slightly sharpen the argument for Propo-
sition 10.4 by using (10.11) and the analog of Chapter 4 Lemma 26 (yyy
10/11/94 version) in which the threshold for τ1disc is set at 1 − ε. Such argu-
ments give bounds closer to that of [93] Corollary 3.2: if G is not complete
then & '
log(n − 1)
∆≤ .
log 3−λ2
1+λ2
Section 10.2.2. Chung [93] section 4.4 analyzes a somewhat related rout-
ing problem. Broder et al [70] analyze a dynamic version of path selection
in expanders.
Section 10.3.1. Example 10.7 (union of sets) and the more general DNF
counting problem were studied systematically by Karp et al [210]; see also
[265] section 11.2.
The Solovay-Strassen test of primality depends on a certain property of
the Jacobi symbol: see [265] section 14.6 for a proof of this property.
Section 10.4.1. Several other uses of random walks on expanders can
be found in Ajtai et al [4], Cohen and Wigderson [97], Impagliazzo and
Zuckerman [188].
Section 10.4.4. Tetali [325] discussions extensions of parts (a,b) of Propo-
sition 10.10 to nonsymmetric cost matrices.
Section 10.5. More extensive treatments of approximate counting are in
Sinclair [309] and Motwani and Raghavan [265] Chapter 12.
Jerrum et al [201] formalize a notion of self-reducibility and show that,
under this condition, approximate counting can be performed in polynomial
time iff approximately uniform sampling can. See Sinclair [309] section 1.4
for a nice exposition.
Abstractly, we are studying randomized algorithms which produce a ran-
dom estimate â(d) of a numerical quantity a(d) (where d measures the “size”
of the problem) together with a rigorous bound of the form
P ((1 − ε)a(d) ≤ â(d) ≤ (1 + ε)a(d)) ≥ 1 − δ.
Such a scheme is a FPRAS (fully polynomial randomized approximation

scheme) if the cost of the algorithm is bounded by a polynomial in d, 1/ε
and log 1/δ. Here the conclusion involving log 1/δ is what emerges from
proofs using large deviation techniques.
Section 10.5.1. Other papers on the volume problem and the related
problem of sampling from a log-concave distribution are Lovász and Si-
monovits [238], Applegate and Kannan [32], Dyer and Frieze [134], Lovász
and Simonovits [239] and Frieze et al [158].
Section 10.5.2. In the background is the problem of approximating the
permanent
n
XY
perA := Aiσ(i)
σ i=1
of a n × n non-negative matrix, where the sum is over all permutations σ.

When A is the adjacency matrix of a n + n bipartite graph, per(A) is the
number of perfect matchings. Approximate counting of perfect matchings is
in principle similar to approximate counting of all matchings; one seeks to

use the chain in section 10.5.2 restricted to Mi ∪ Mi−1 , where Mi is the
set of matchings with exactly i edges. But successful analysis of this chain
requires that we have a dense graph, with minimum degree > n/2. Jerrum
and Sinclair [198] gave the first analysis, using the Cheeger inequality and
estimating expansion via distinguished paths. Sinclair [309] Chapter 3 and
Motwani and Raghavan [265] Chapter 11 give more detailed expositions.
Subsequently it was realized that using the distinguished paths technique
directly to bound τ2 was more efficient. A more general setting is to seek to
sample from the non-uniform distribution on matchings M
π(M ) ∝ λ|M |
for a parameter λ > 1. The distinguished paths technique [199, 197] giving
(10.22) works in this setting to give
τ1 = O(Ln2 λ log(nλ)).
10.7 Material belonging in other chapters

10.7.1 Large deviation bounds
yyy Somewhere in the book we need to discuss the results on explicit large
deviation bounds for occupation measure / empirical averages: [167, 125,
204, 229]. In section 10.4.1 we used the following bound from Gillman [167]
Theorem 2.1.
Theorem 10.11
!
γn −γ 2 n
sX
Pµ (Nn (B)/n − π(B) > γ) ≤ 1 + µ2i /πi exp .
10τ2 i
20τ2
10.7.2 The probabilistic method in combinatorics

yyy This is to be moved to Chapter 6, where we do the “universal traversal
sequences” example.
Suppose one wants to show the existence of a combinatorial object with
specified properties. The most natural way is to give an explicit construc-
tion of an example. There are a variety of settings where, instead of a
giving an explicit construction, it is easier to argue that a randomly-chosen
object has a non-zero chance of having the required properties. The mono-
graph by Alon and Spencer [29] is devoted to this topic, under the name the
probabilistic method. One use of this method is below. Two more example
occur later in the book: random construction of expander graphs (Chapter
30 Proposition 1) (yyy 7/9/96 version), and the random construction of an
objective function in an optimization problem (Chapter 9 section 4.2) (yyy
this version).
10.7.3 copied to Chapter 4 section 6.5

(yyy 10/11/94 version) Combining Corollary 31 with (62) gives the contin-
uous time result below. Recasting the underlying theory in discrete time
establishes the discrete-time version.
Lemma 10.12
(continuous time) Pπ (TA > t) ≤ exp(−tπ(A)/τ2 ), t ≥ 0

(discrete time) Pπ (TA ≥ t) ≤ (1 − π(A)/τ2 )t , t ≥ 0.
Notes on this section. In studying bounds on TA such as Lemma 10.12

we usually have in mind that π(A) is small. One is sometimes interested
10.7. MATERIAL BELONGING IN OTHER CHAPTERS 359
in exit times from a set A with π(A) small, i.e. hitting times on Ac where
π(Ac ) is near 1. In this setting one can replace inequalities using τ2 or τc
(which parameters involve the whole chain) by inequalities involving analo-
gous parameters for the chain restricted to A and its boundary. See Babai
[36] for uses of such bounds.
On several occasions we have remarked that for most properties of ran-
dom walk, the possibility of an eigenvalue near −1 (i.e. an almost-bipartite
graph) is irrelevant. An obvious exception arises when we consider lower
bounds for Pπ (TA > t) in terms of |A|, because in a bipartite graph with
bipartition {A, Ac } we have P (TA > 1) = 0. It turns out (Alon et al
[27] Proposition 2.4) that a corresponding lower bound holds in terms of
τn ≡ 1/(λn + 1).
t
|A|
Pπ (TA > t) ≥ max(0, 1 − nτn ) .
Chapter 11
Markov Chain Monte Carlo

(January 8 2001)
This book is intended primarily as “theoretical mathematics”, focusing on

ideas that can be encapsulated in theorems. Markov Chain Monte Carlo
(MCMC), which has grown explosively since the early 1990s, is in a sense
more of an “engineering mathematics” field – a suite of techniques which
attempt to solve applied problems, the design of the techniques being based
on intuition and physical analogies, and their analysis being based on ex-
perimental evaluation. In such a field, the key insights do not correspond
well to theorems.
In section 11.1 we give a verbal overview of the field. Section 11.2 de-
scribes the two basic schemes (Metropolis and line-sampling), and section
11.3 describes a few of the many more complex chains which have been sug-
gested. The subsequent sections are fragments of theory, indicating places
where MCMC interfaces with topics treated elsewhere in this book. Liu
[235] gives a comprehensive textbook treatment of the field.
11.1 Overview of Applied MCMC

11.1.1 Summary
We give a brisk summary here, and expand upon some main ideas (the
boldface phrases) in section 11.1.2.
Abstractly, we start with the following type of problem.
Given a probability distribution π on a space S, and a numer-

ical quantity associated with π (for instance, the mean ḡ :=
361
362CHAPTER 11. MARKOV CHAIN MONTE CARLO (JANUARY 8 2001)
R
or := g dπ, for specified g : S → R), how can one
P
x π(x)g(x)
estimate the numerical quantity using Monte Carlo (i.e. ran-
domized algorithm) methods?
Asking such a question implicitly assumes we do not have a solution using

mathematical analysis or efficient deterministic numerical methods. Exact
Monte Carlo sampling presumes the ability to sample exactly from the target
distribution π, enabling one to simulate an i.i.d. sequence (Xi ) and then use
classical statistical estimation, e.g. estimate ḡ by n−1 ni=1 g(Xi ). Where
P
implementable, such exact sampling will typically be the best randomized

algorithm. For one-dimensional distributions and a host of special dis-
tributions on higher-dimensional space or combinatorial structures, exact
sampling methods have been devised. But it is unrealistic to expect there
to be any exact sampling method which is effective in all settings. Markov
Chain Monte Carlo sampling is based on the following idea.
First devise a Markov chain on S whose stationary distribu-

tion is π. Simulate n steps X1 , . . . , Xn of the chain. Treat
Xτ ∗ , Xτ ∗ +1 , . . . , Xn as dependent samples from π (where τ ∗ is
some estimate of some mixing time) and then use these samples
in a statistical estimator of the desired numerical quantity, where
the confidence interval takes the dependence into account.
Variations of this basic idea include running multiple chains and introducing
auxiliary variables (i.e. defining a chain on some product space S × A). The
basic scheme and variations are what make up the field of MCMC. Though
there is no a priori reason why one must use reversible chains, in practice
the need to achieve a target distribution π as stationary distribution makes
general constructions using reversibility very useful.
MCMC originated in statistical physics, but mathematical analysis of
its uses there are too sophisticated for this book, so let us think instead of
Bayesian statistics with high-dimensional data as the prototype setting for
MCMC. So imagine a point x ∈ Rd as recording d numerical characteristics
of an individual. So data on n individuals is represented as a n × d matrix
x = (xij ). As a model, we first take a parametric family φ(θ, x) of probability
densities; that is, θ ∈ Rp is a p-dimensional parameter and for each θ the
function x → φ(θ, x) is a probability density on Rd . Finally, to make a
Bayes model we take θ to have some probability density h(θ) on Rp . So
the probability model for the data is: first choose θ according to h(·), then
choose (xi· ) i.i.d. with density φ(θ, x). So there is a posterior distribution
11.1. OVERVIEW OF APPLIED MCMC 363
on θ specified by Qn
h(θ) i=1 φ(θ, xi· )
fx (θ) := (11.1)
zx
where zx is the normalizing constant. Our goal is to sample from fx (·),
for purposes of e.g. estimating posterior means of real-valued parameters.
An explicit instance of (11.1) is the hierarchical Normal model, but the
general form of (11.1) exhibits features that circumscribe the type of chains
it is feasible to implement in MCMC, as follows.
(i) Though the underlying functions φ(·, ·), h(·) which define the model
may be mathematically simple, our target distribution fx (·) depends on
actual numerical data (the data matrix x), so it is hard to predict, and
dangerous to assume, global regularity properties of fx (·).
(ii) The normalizing constant zx is hard to compute, so we want to define
chains which can be implemented without calculating zx .
The wide range of issues arising in MCMC can loosely be classified as “de-
sign” or “analysis” issues. Here “design” refers to deciding which chain to
simulate, and “analysis” involves the interpretation of results. Let us start
by discussing design issues. The most famous general-purpose method is
the Metropolis scheme, of which the following is a simple implementation in
setting (11.1). Fix a length scale parameter l. Define a step θ → θ(1) of a
chain as follows.
Pick i uniformly from {1, 2, . . . , p}.

Pick U uniformly from [θi − l, θi + l].
Let θ0 be the p-vector obtained from θ by changing the i’th co-
ordinate to U .
With probability min(1, fx (θ0 )/fx (θ)) set θ(1) = θ0 ; else set
(1)
θ = θ.
The target density enters the definition only via the ratios fx (θ0 )/fx (θ),
so the value of zx is not needed. The essence of a Metropolis scheme is
that there is a proposal chain which proposes a move θ → θ0 , and then
an acceptance/rejection step which accepts or rejects the proposed move.
See section 11.2.1 for the general definition, and proof that the stationary
distribution is indeed the target distribution. There is considerable flexibility
in the choice of proposal chain. One might replace the uniform proposal
step by a Normal or symmetrized exponential or Cauchy jump; one might
instead choose a random (i.e. isotropic) direction and propose to step some
random distance in that direction (to make an isotropic Normal step, or a
step uniform within a ball, for instance). There is no convincing theory to
say which of these choices is better in general. However, in each proposal

chain there is some length scale parameter l: there is a trade-off between
making l too small (proposals mostly accepted, but small steps imply slow
mixing) and making l too large (proposals rarely accepted), and in section
11.5 we give some theory (admittedly in an artificial setting) which does
give guidance on choice of l.
The other well-known general MCMC method is exemplified by the Gibbs
sampler. In the setting of (11.1), for θ = (θ1 , . . . , θp ) and 1 ≤ j ≤ p write
fx,j,θ (v) = fx (θ1 , . . . , θj−1 , v, θj+1 , . . . , θp ).
A step θ → θ(1) of the Gibbs sampler is defined as follows.

Pick j uniformly from {1, 2, . . . , p}.
Pick V from the density on R1 proportional to fx,j,θ (v).
Let θ(1) be θ with its j’th coordinate replaced by V .
The heuristic appeal of the Gibbs sampler, compared to a Metropolis scheme,
is that in the latter one typically considers only small proposal moves (lest
proposals be almost always rejected) whereas in the Gibbs sampler one sam-
ples over an infinite line, which may permit larger moves. The disadvantage
is that sampling along the desired one-dimensional line may not be easy to
implement (see section 11.1.2). Closely related to the Gibbs sampler is the
hit-and-run sampler, where one takes a random (isotropic) direction line in-
stead of a coordinate line; section 11.2.2 abstracts the properties of such line
samplers, and section 11.3 continues this design topic to discuss more com-
plex designs of chains which attain a specified target distribution as their
stationary distribution.
We now turn to analysis issues, and focus on the simplest type of prob-
P
lem, obtaining an estimate for an expectation ḡ = g(x)π(x) using an
irreducible chain (Xt ) designed to have stationary distribution π. How do
we obtain an estimate, and how accurate is it? The most straightforward
approach is single-run estimation. The asymptotic variance rate is
t ∞
!
−1
X X
2
σ := lim t var g(Xs ) = covπ (g(X0 ), g(Xs )). (11.2)
t→∞
s=1 s=−∞
So simulate a single run of the chain, from some initial state, for some large
number t of steps. Estimate ḡ by
t
1 X
ĝ = g(Xi ) (11.3)
t − t0 i=t +1
0
and estimate the variance of ĝ by (t − t0 )−1 σ̂ 2 , and report a confidence inter-

val for ĝ by assuming ĝ has Normal distribution with mean ḡ and the esti-
mated variance. Here σ̂ 2 is an estimate of σ 2 obtained by treating the sample
covariances γ̂s (i.e. the covariance of the data-set (g(Xi ), g(Xi+s )); 0 ≤ i ≤
t − s) as estimators of γs = covπ (g(X0 ), g(Xs )). And the burn-in time t0 is
chosen as a time after which the γ̂s become small.
Though the practical relevance of theoretical mixing time pa-
rameters is debatable, one can say loosely that single-run estimates based
on t steps will work fine if t is large compared to the relaxation time τ2 .
The difficulty is that in practical MCMC problems we do not know, or have
reasonable upper bounds on, τ2 , nor can we estimate τ2 rigorously from sim-
ulations. The difficulty in diagnosing convergence from simulations is the
possibility of metastability error caused by multimodality. Using statisti-
cal physics imagery, the region around each mode is a potential well, and
the stationary distribution conditioned to a potential well is a metastable
distribution. Believing that a simulation reaches the stationary distribution
when in fact it only reaches a metastable distribution is the metastability
error.
The simplest way to try to guard against metastability error is the mul-
tiple trials diagnostic. Here we run k independent copies of the chain from
different starting states, each for t steps. One diagnostic is to calculate the
k sample averages ĝj , and check that the empirical s.d. of these k averages
is consistent with the estimated s.d. (t − t0 )−1/2 σ̂. Intuitively, one chooses
the initial states to be “overdispersed”, i.e. more spread out than we expect
the target distribution to be; passing the diagnostic test gives us some re-
assurance against metastability error (if there were different potential wells,
we hope our runs would find more than one well, and that different behavior
of g on different wells would be manifest).
Of course, if one intends to perform such diagnostics it makes sense to
start out doing the k multiple runs. A more elaborate procedure is to divide
[0, t] into L successive blocks, and seek to check whether the kL blocks “look
similar”. This can be treated as a classical topic in statistics (“analysis of
variance”). In brief, we compute the sample mean ĝi,j and sample variance
2 for the j’th block of the i’th simulation, and see if this data (perhaps
σ̂i,j
after deleting the first few blocks of each simulation) is consistent with the
blocks being i.i.d.. If so, we use the overall average as an estimator of ḡ,
and estimate the accuracy of this estimator by assuming the blocks were
independent.
If a multiple-runs diagnostic fails, or if one lacks confidence in one’s abil-
ity to choose a small number of starting points which might be attracted to
different nodes (if such existed), then one can seek schemes specially adapted
to multimodal target densities. Because it is easy to find local maxima of a
target density f , e.g. by a deterministic hill-climbing algorithm, one can find
modes by repeating such an algorithm from many initial states, to try to
find an exhaustive list of modes with relatively high f -values. This is mode-
hunting; one can then design a chain tailored to jump between the wells
with non-vanishing probabilities. Such methods are highly problem-specific;
more general methods (such as the multi-level or multi-particle schemes of
sections 11.3.3 and 11.3.4) seek to automate the search for relevant modes
within MCMC instead of having a separate mode-hunting stage.
In seeking theoretical analysis of MCMC one faces an intrinsic difficulty:
MCMC is only needed on “hard” problems, but such problems are difficult
to study. In comparing effectiveness of different variants of MCMC it is
natural to say “forget about theory – just see what works best on real
examples”. But such experimental evaluation is itself conceptually difficult:
pragmatism is easier in theory than in practice!
11.1.2 Further aspects of applied MCMC
Sampling from one-dimensional distributions. Consider a probability distri-

bution µ on R1 with density function f and and distribution function F .
In one sense, sampling from µ is easy, because of the elementary result that
F −1 (U ) has distribution µ, where U is uniform on [0, 1] and x = F −1 (u)
is the inverse function of u = F (x). In cases where we have an explicit
formula for F −1 , we are done. Many other cases can be done using rejection
sampling. Suppose there is some other density g from which we can sample
by the inverse distribution function method, and suppose we know a bound
c ≥ supx f (x)/g(x). Then the algorithm
propose a sample x from g(·);

f (x)
accept x with probability cg(x) ; else propose a new sample from
g
produces an output with density f (·) after mean c steps. By combining these
two methods, libraries of algorithms for often-encountered one-dimensional
distributions can be built, and indeed exist in statistical software packages.
But what about a general density f (x)? If we need to sample many times
from the same density, it is natural to use deterministic numerical methods.
First probe f at many values of x. Then either
(a) build up a numerical approximation to F and thence to F −1 ; or
(b) choose from a library a suitable density g and use rejection sampling.
The remaining case, which is thus the only “hard” aspect of sampling from
one-dimensional distributions, is where we only need one sample from a
general distribution. In other words, where we want many samples which
are all from different distributions. This is exactly the setting of the Gibbs
sampler where the target multidimensional density is complicated, and thus
motivates some of the variants we discuss in section 11.3.
Practical relevance of theoretical mixing time parameters. Standard theory
from Chapter 4 (yyy cross-refs) relates τ2 to the asymptotic variance rate
σ 2 (g) at (11.2) for the “worst-case” g:
1 1 + λ2 σ 2 (g)
τ2 = ≈ = sup . (11.4)
1 − λ2 1 − λ2 g var π g
Moreover Proposition 29 of Chapter 4 (yyy 10/11/94 version) shows that

σ 2 (g) also appears in an upper bound on variances of finite-time averages
from the stationary chain. So in asking how long to run MCMC simulations,
a natural principle (not practical, of course, because we typically don’t know
τ2 ) is
base estimates on t steps, where t is a reasonable large multiple
of τ2 .
But this principle can be attacked from opposite directions. It is sometimes
argued that worrying about τ2 (corresponding to the worst-case g) is overly
pessimistic in the context of studying some specific g. For instance, Sokal
[311] p. 8 remarks that in natural statistical physics models on the infinite
lattice near a phase transition in a parameter θ, as θ tends to the critical
point the growth exponent of σ 2 (g) for “interesting” g is typically different
from the growth exponent of τ2 . Madras and Slade [252] p. 326 make similar
remarks in the context of the pivot algorithm for self-avoiding walk. But we
do not know similar examples in the statistical Rd setting. In particular, in
the presence of multimodality such counterexamples would require that g
be essentially “orthogonal” to the differences between modes, which seems
implausible.
Burn-in, the time t0 excluded from the estimator (11.3) to avoid undue
influence of initial state, is conceptually more problematic. Theory says that
taking t0 as a suitable multiple of τ1 would guarantee reliable estimates. The
general fact τ1 ≥ τ2 then suggests that allowing sufficient burn-in time is a
stronger requirement than allowing enough “mixing” for the stationary chain
– so the principle above is overly optimistic. On the other hand, because it
refers to worst-case initial state, requiring a burn-in time of τ1 seems far too
conservative in practice. The bottom line is that one cannot eliminate the
possibility of metastability error; in general, all one gets from multiple-runs
and diagnostics is confidence that one is sampling from a single potential
well, in the imagery below (though section 11.6.2 indicates a special setting
where we can do better).
Statistical physics imagery. Any probability distribution π can be written

as
π(x) ∝ exp(−H(x)).
One can call H a potential function; note that a mode (local maximum) of π
is a local minimum of H. One can envisage a realization of a Markov chain
as a particle moving under the influence of both a potential function (the
particle responds to some “force” pushing it towards lower values of H) and
random noise. Associated with each local minimum y of H is a potential
well, which we envisage as the set of points which under the influence of the
potential only (without noise) the particle would move to y (in terms of π,
states from which a “steepest ascent” path leads to y).
A fundamental intuitive picture is that the main reason why a reversible
chain may relax slowly is that there is more than one potential well, and the
chain takes a long time to move from one well to another. In such a case,
π conditioned to a single potential well will be a metastable (i.e. almost-
stationary) distribution. One expects the chain’s distribution, from any
initial state, to reach fairly quickly one (or a mixture) of these metastable
distributions, and then the actual relaxation time to stationarity is domi-
nated by the times taken to move between wells. In more detail, if there are
w wells then one can consider, as a coarse-grained approximation, a w-state
continuous-time chain where the transition rates w1 → w2 are the rates of
moving from well w1 to well w2 . Then τ2 for the original chain should be
closely approximated by τ2 for the coarse-grained chain.
The hierarchical Normal model. As a very simple instance of (11.1), take

d = 1, p = 2 and x → φ(µ, σ 2 , x) the Normal(µ, σ 2 ) density. Then let (µ, σ)
be chosen independently for each individual from some joint density h(µ, σ)
on R × R+ . The data is an n-vector x = (x1 , . . . , xn ) and the full posterior
distribution is
n
fx (µ1 , . . . , µn , σ1 , . . . , σn ) = zx−1
Y
h(µi , σi )φ(µi , σi2 , xi ).
i=1
11.2. THE TWO BASIC SCHEMES 369
Typically we are interested in a posterior mean of µi for fixed i, that is ḡ for
g(µ1 , . . . , µn , σ1 , . . . , σn ) := µi .
Pragmatism is easier in theory than in practice. In comparing MCMC meth-

ods experimentally, one obvious issue is the choice of example to study. An-
other issue is that, if we measure “time” as “number of steps”, then a step of
one chain may not be comparable with a step of another chain. For instance,
a Metropolis step is typically easier to implement than a Gibbs step. More
subtlely, in combinatorial examples there may be different ways to set up
a data structure to represent the current state in a way that permits easy
computation of π-values. The alternative of measuring “time” as CPU time
introduces different problems – details of coding matter.
11.2 The two basic schemes

We will present general definitions and discussion in the context of finite-
state chains on a state space S; translating to continuous state space such
as Rd involves slightly different notation without any change of substance.
11.2.1 Metropolis schemes

Write K = (kxy ) for a proposal transition matrix on S. The simplest case is
where K is symmetric (kxy ≡ kyx ). In this case, given π on S we define a
step x → x0 of the associated Metropolis chain in words by
• pick y from k(x, ·) and propose a move to y;
• accept the move (i.e. set x0 = y) with probability min(1, πy /πx ), oth-
erwise stay (x0 = x).
This recipe defines the transition matrix P of the Metropolis chain to be
pxy = kxy min(1, πy /πx ), y 6= x.
Assuming K is irreducible and π strictly positive, then clearly P is irre-

ducible. Then since πx pxy = kxy min(πx , πy ), symmetry of K implies P
satisfies the detailed balance equations and so is reversible with stationary
distribution π.
The general case is where K is an arbitrary transition matrix, and the
acceptance rule becomes
π k
• accept a proposed move x → y with probability min(1, πxy kyx
xy
).
The transition matrix of the Metropolis chain becomes

!
πy kyx
pxy = kxy min 1, , y 6= x. (11.5)
πx kxy
To ensure irreducibility, we now need to assume connectivity of the graph

on S whose edges are the (x, y) such that min(kxy , kyx ) > 0. Again detailed
balance holds, because
πx pxy = min(πx kxy , πy kyx ), y 6= x.
The general case is often called Metropolis-Hastings – see Notes for termi-
nological comments.
11.2.2 Line-sampling schemes

The abstract setup described below comes from Diaconis [113]. Think of
each Si as a line, i.e. the set of points in a line.
Suppose we have a collection (Si ) of subsets of state space S, with ∪i Si =
S. Write I(x) := {i : x ∈ Si }. Suppose for each x ∈ S we are given a
probability distribution i → w(i, x) on I(x), and suppose
if x, y ∈ Si then w(i, x) = w(i, y). (11.6)
Write π [i] (·) = π(·|Si ). Define a step x → y of the line-sampling chain in

words by
• choose i from w(·, x);
• then choose y from π [i] .
So the chain has transition matrix

X
pxy = w(i, x)πy[i] , y 6= x.
i∈I(x)
We can rewrite this as

X
pxy = w(i, x)πy /π(Si )
i∈I(x)∩I(y)
11.3. VARIANTS OF BASIC MCMC 371
and then (11.6) makes it clear that πx pxy = πy pyx . For irreducibility, we
need the condition
the union over i of the edges in the complete graphs
on Si form a connected graph on S. (11.7)

Note in particular we want the Si to be overlapping, rather than a partition.
This setting includes many examples of random walks on combinatorial
sets. For instance, card shuffling by random transpositions (yyy cross-ref)
is essentially the case where the collection of subsets consists of all 2-card
subsets. In the Rd setting, with target density f , the Gibbs sampler is the
case where the collection consists of all lines parallel to some axis. Taking
instead all lines in all directions gives the hit-and-run sampler, for which a
step from x is defined as follows.
• Pick a direction uniformly at random, i.e. a point y on the surface on

the unit ball.
• Step from x to x + U y, where −∞ < U < ∞ is chosen with density

proportional to
ud−1 f (x + uy).
The term ud−1 here arises as a Jacobean; see Liu [235] Chapter 8 for expla-
nation and more examples in Rd .
11.3 Variants of basic MCMC

11.3.1 Metropolized line sampling
Within the Gibbs or hit-and-run scheme, at each step one needs to sample
from a one-dimensional distribution, but a different one-dimensional distri-
bution each time. As mentioned in section 11.1.2, this is in general not easy
to implement efficiently. An alternative is Metropolized line sampling, where
one instead takes a single step of a Metropolis (i.e. propose/accept) chain
with the correct stationary distribution. To say the idea abstractly, in the
general “line sampling” setting of section 11.2.2, assume also:
for each i we have an irreducible transition matrix K i on Si whose sta-
tionary distribution is π [i] .
Then define a step x → y of the Metropolized line sampler as
• choose i from w(·, x);

• then choose y from k i (x, ·).
It is easy to check that the chain has stationary distribution π, and is re-
versible if the K i are reversible, so in particular if the K i are defined by a
Metropolis-type propose-accept scheme. In the simplest setting where the
line sampler is the Gibbs sampler and we use the same one-dimensional pro-
posal step distribution each time, this scheme is Metropolis-within-Gibbs. In
that context is seems intuitively natural to use a long-tailed proposal distri-
bution such as the Cauchy distribution. Because we might encounter wildly
different one-dimensional target densities, e.g. one density with s.d. 1/10
and another with two modes separated by 10, and using a U (−L, L) step
proposal would be inefficient in the latter case if L is small, and inefficient
in the former case if L is large. Intuitively, a long-tailed distribution avoids
these worst cases, at the cost of having the acceptance rate be smaller in
good cases.
11.3.2 Multiple-try Metropolis

In the setting (section 11.2.1) of the Metropolis scheme, one might consider
making several draws from the proposal distribution and choosing one of
them to be the proposed move. Here is one way, suggested by Liu et al
[236], to implement this idea. It turns out that to ensure the stationary
distribution is the target distribution π, we need extra samples which are
used only to adjust the acceptance probability of the proposed step.
For simplicity, we take the case of a symmetric proposal matrix K. Fix
m ≥ 2. Define a step from x of the multiple-try Metropolis (MTM) chain as
follows.
• Choose y1 , . . . , ym independently from k(x, ·);
• Choose yi with probability proportional to π(yi );
• Choose x1 , . . . , xm−1 independently from k(yi , ·), and set xm = x;

P
π(yi )
• Accept the proposed move x → yi with probability min 1, Pi π(x ) .
i i
Irreducibility follows from irreducibility of K. To check detailed balance,

write the acceptance probability as min(1, q). Then
X m−1
Y m−1
Y πy
pxy = mkxy kx,yi ky,xi P min(1, q)
i=1 i=1 i πyi
11.3. VARIANTS OF BASIC MCMC 373
where the first sum is over ordered (2m−2)-tuples (y1 , . . . , ym−1 , x1 , . . . , xm−1 ).
So we can write
X m−1 m−1
!
Y Y 1 q
πx pxy = mkxy πx πy kx,yi ky,xi min P ,P .
i=1 i=1 i πyi i πyi
The choice of q makes the final term become min( P 1π , P 1π ). One can
i yi i xi
now check πx pxy = πy pyx , by switching the roles of xj and yj .
To compare MTM with single-try Metropolis, consider the m → ∞ limit,
in which the empirical distribution of y1 , . . . , ym will approach k(x, ·), and
so the distribution of the chosen yi will approach k(x, ·)π(·)/ax for ax :=
P
y kxy πy . Thus for large m the transition matrix of MTM will approximate
kxy πy
p∞
xy = min(1, ax /ay ), y 6= x.
ax
To compare with single-try Metropolis P , rewrite both as

p∞
xy = kxy πy min
1 1
ax , ay , y 6= x

1 1
pxy = kxy πy min πx , πy , y 6= x.
Thinking of a step of the proposal chain as being in a random direction

unrelated to the behavior of π, from a π-typical state x we expect a proposed
move to tend to make π decrease, so we expect ax < πx for π-typical x. In
this sense, the equations above show that MTM is an improvement. Of
course, if we judge “cost” in terms of the number of evaluations of πx , then
a step of MTM costs 2m−1 times the cost of single-step Metropolis. By this
criterion it seems implausible that MTM would be cheaper than single-step.
On the other hand one can envisage settings where there is substantial cost
in updating a data structure associated with the current state x, and in such
a setting MTM may be more appealing.
11.3.3 Multilevel sampling

Writing π(x) ∝ exp(−H(x)), as in the statistical physics imagery (section
11.1.2), suggests defining a one-parameter family of probability distributions
by
πθ (x) ∝ exp(−θH(x)).
(In the physics analogy, θ corresponds to 1/temperature). If π is multimodal
we picture πθ , as θ increases from 0 to 1, interpolating between the uniform
distribution and π by making the potential wells grow deeper. Fix a proposal
matrix K, and let Pθ be the transition matrix for the Metropolized chain
(11.5) associated with K and πθ . Now fix L and values 0 = θ1 < θ2 < . . . <
θL = 1. The idea is that for small θ the Pθ -chain should have less difficulty
moving between wells; for θ = 1 we get the correct distribution within each
well; so by varying θ we can somehow sample accurately from all wells.
There are several ways to implement this idea. Simulated tempering [254]
defines a chain on state space S × {1, . . . , L}, where state (x, i) represents
configuration x and parameter θi , and where each step is either of the form
• (x, i) → (x0 , i); x → x0 a step of Pθi
or of the form
• (x, i) → (x, i0 ); where i → i0 is a proposed step of simple random walk
on {1, 2, . . . , L}.
However, implementing this idea is slightly intricate, because normalizing
constants zθ enter into the desired acceptance probabilities. A more ele-
gant variation is the multilevel exchange chain suggested by Geyer [163] and
implemented in statistical physics by Hukushima and Nemoto [185]. First
(i)
consider L independent chains, where the i’th chain Xt has transition ma-
trix Pθi . Then introduce an interaction; propose to switch configurations
X (i) and X (i+1) , and accept with the appropriate probability. Precisely,
take state space S L with states x = (x1 , . . . , xL ). Fix a (small) number
0 < α < 1.
• With probability 1 − α pick i uniformly from {1, . . . , L}, pick x0i ac-
cording to Pθi (xi , ·) and update x by changing xi to x0i .
• With probability α, pick uniformly an adjacent pair (i, i + 1), and
propose to update x by replacing (xi , xi+1 ) by (xi+1 , xi ). Accept this
proposed move with probability
!
πθ (xi+1 )πθi+1 (xi )
min 1, i .
πθi (xi )πθi+1 (xi+1 )
To check that the product π = πθ1 × . . . × πθL is indeed a stationary distri-

bution, write the acceptance probability as min(1, q). If x and x0 differ only
by interchange of (xi , xi+1 ) then
α
π(x) p(x, x0 ) π(x) L−1 min(1, q) π(x)
= α = q
π(x0 ) p(x0 , x) π(x0 ) −1
L−1 min(1, q ) π(x0 )
and the definition of q makes the expression = 1. The case of steps where
only one component changes is easier to check.
11.4. A LITTLE THEORY 375
11.3.4 Multiparticle MCMC

Consider the setting of section 11.2.2. There is a target distribution π on S
and a collection of subsets (Si ). Write π [i] = π(·|Si ) and I(x) = {i : x ∈ Si }.
Now fix m ≥ 2. We can use the line-sampling scheme of section 11.2.2
to define (recall Chapter 4 section 6.2) (yyy 10/11/94 version) a product
chain on S m with stationary distribution π × π × . . . × π = π k . For this
product chain, picture m particles, at each step picking a random particle
and making it move as a step from the line-sampling chain. Now let us
introduce an interaction: the line along which a particle moves may depend
on the positions of the other particles.
Here is a precise construction. Suppose that for each (x, x̂) ∈ S × S m−1
we are given a probability distribution w(·, x, x̂) on I(x) satisfying the fol-
lowing analog of (11.6):
if x, y ∈ Si then w(i, x, x̂) = w(i, y, x̂). (11.8)
A step of the chain from (xi ) is defined by

• Pick k uniformly from {1, 2, . . . , m}
• Pick i from w(·, xk , (xi , i 6= k))
• Pick x0k from π [i] (·)
• Update (Xi ) by replacing xk by x0k .
It is easy to check that π m is indeed a stationary distribution; and the chain
is irreducible under condition (11.7). Of course we could, as in section 11.3.1,
use a Metropolis step instead of sampling from π [K] .
Constructions of this type in statistical applications on Rd go back to
Gilks et al [166], under the name adaptive directional sampling. In particular
they suggested picking a distinct pair (j, k) of the “particles” and taking the
straight line through xj and xk as the line to sample x0k from. Liu et al [236]
suggest combining this idea with mode-hunting. Again pick a distinct pair
(j, k) of “particles”; but now use some algorithm to find a local maximum
m(xj ) of the target density starting from xj , and sample x0k from the line
through xk and m(xj ).
11.4 A little theory

The chains designed for MCMC in previous sections are reversible, and
therefore the theory of reversible chains developed in this book is available.
Unfortunately there is very little extra to say – in that sense, there is no

“theory of MCMC”. What follows is rather fragmentary observations.
11.4.1 Comparison methods

Consider the Metropolis chain
!
πy kyx
pMetro
xy = kxy min 1, , y 6= x.
πx kxy
The requirement that a step of a chain be constructible as a proposal from K

followed by acceptance/rejection, is the requirement that pxy ≤ kxy , y 6= x.
Recall the asymptotic variance rate
t
σ 2 (P, f ) := lim t−1 var
X
f (Xs ).
t
s=1
Lemma 11.1 (Peskun’s Theorem [280]) Given K and π, let P be a re-

versible chain with pxy ≤ kxy , y 6= x and with stationary distribution π.
Then σ 2 (P, f ) ≥ σ 2 (P Metro , f ) ∀f .
Proof. Reversibility of P implies
πy pyx πy kyx πy kyx

pxy = ≤ = kxy
πx πx πx kxy
and hence
pxy = pMetro
xy βxy
where βxy = βyx ≤ 1, y 6= x. So the result follows directly from Peskun’s
lemma (yyy Lemma 11.5, to be moved elsewhere). 2
This result can be interpreted as saying that the Metropolis rates (11.5)
are the optimal way of implementing a proposal-rejection scheme. Loosely
speaking, a similar result holds in any natural Metropolis-like construction
of a reversible chain using a max(1, ·) acceptance probability.
It is important to notice that Lemma 11.1 does not answer the following
question, which (except for highly symmetric graphs) seems intractable.
Question. Given a connected graph and a probability distribution π on its
vertices, consider the class of reversible chains with stationary distribution
π and with transitions only across edges of the graph. Within that class,
which chain has smallest relaxation time?
11.4. A LITTLE THEORY 377
Unfortunately, standard comparison theorems don’t take us much fur-

ther in comparing MCMC methods. To see why, consider Metropolis on
Rd with isotropic Normal(0, σ 2 Id ) proposal steps. This has some relaxation
time τ2 (f, σ), where f is the target density. For σ1 < σ2 , the normal den-
sities gσ (x) satisfy gσ2 (x)/gσ1 (x) ≥ (σ1 /σ2 )d . So the comparison theorem
(Chapter 3 Lemma 29) (yyy 9/2/94 version) shows
τ2 (f, σ2 ) ≥ (σ1 /σ2 )d τ2 (f, σ2 ), σ1 < σ2 .
But this is no help in determining the optimal σ.
11.4.2 Metropolis with independent proposals

Though unrealistic in practical settings, the specialization of the Metropolis
chain to the case where the proposal chain is i.i.d., that is where kxy = ky ,
is mathematically a natural object of study. In this setting the transition
matrix (11.5) becomes
pxy = ky min (1, wy /wx ) , y 6= x
where wx := πw /kx . It turns out there is a simple and sharp coupling

analysis, based on the trick of labeling states as 1, 2, . . . , n so that w1 ≥ w2 ≥
. . . ≥ wn (Liu [234] used this trick to give an eigenvalue analysis, extending
part (b) below). Let ρ be the chance that a proposed step from state 1 is
rejected (count a proposed step from state 1 to 1 as always accepted). So
n
X
wi
ρ= ki (1 − w1 ) < 1.
i=1
Proposition 11.2 For the Metropolis chain over independent proposals,

with states ordered as above,
¯ ≤ ρt
(a) d(t)
(b) The relaxation time τ2 = (1 − ρ)−1 .
Proof. For the chain started at state 1, the time T of the first acceptance of
a proposed step satisfies
P (T > t) = ρt .
Recall from (yyy Chapter 4-3 section 1; 10/11/99 version) the notion of
coupling. For this chain a natural coupling is obtained by using the same
U (0, 1) random variable to implement the accept/reject step (accept if U <
P (accept)) in two versions of the chain. It is easy to check this coupling
(Xt , Xt0 ) respects the ordering: if X0 ≤ X00 then Xt ≤ Xt0 . At time T the fact
that a proposed jump from 1 is accepted implies that a jump from any other
state must be accepted. So T is a coupling time, and the coupling inequality
(yyy Chapter 4-3 section 1.1; 10/11/99 version) implies d(t) ¯ ≤ P (T > t).
¯ t
This establishes (a), and the general inequality d(t) = Ω(λ2 ) implies λ2 ≤ ρ.
On the other hand, for the chain started at state 1, on {T = 1} the time-1
distribution is π; in other words
P1 (X1 ∈ ·) = ρδ1 (·) + (1 − ρ)π(·).
But this says that ρ is an eigenvalue of P (corresponding to the eigenvector

δ1 − π), establishing (b). 2
In the continuous-space setting, with a proposal distribution uniform on
[0, 1] and target density f with f ∗ := maxx f (x), part (b) implies the relax-
ation time τ2 equals f ∗ . So (unsurprisingly) Metropolis-over-independent is
comparable to the basic rejection sampling scheme (section 11.1.2), which
gives an exact sample in mean f ∗ steps.
11.5 The diffusion heuristic for optimal scaling of

high dimensional Metropolis
In any Metropolis scheme for sampling from a target distribution on Rd ,
there arises the question of how large to take the steps of the proposal chain.
One can answer this for isotropic-proposal schemes in high dimensions, in
the setting where the target is a product distribution, and the result in
this (very artificial) setting provides a heuristic for more realistic settings
exemplified by (11.1).
11.5.1 Optimal scaling for high-dimensional product distri-

bution sampling
Fix a probability density function f on R1 . For large d consider the i.i.d.
product distribution πf (dx) = di=1 f (xi ) dxi for x = (xi ) ∈ Rd . Suppose
Q
we want to sample from πf using Metropolis or Gibbs; what is the optimal

scaling (as a function of d) for the step size of the proposal chain, and how
does the relaxation time scale?
For the Gibbs sampler this question is straightforward. Consider the one-
dimensional case, and take the proposal step increments to be Normal(0, σ 2 ).
Then (under technical conditions on f – we omit technical conditions here
and in Theorem 11.3) the Gibbs chain will have some finite relaxation time
11.5. THE DIFFUSION HEURISTIC FOR OPTIMAL SCALING OF HIGH DIMENSIONAL METROPOL
depending on f and σ, and choosing the optimal σ ∗ gives a relaxation time

τ2 (f ), say. The Gibbs sampler chain in which we choose a random coordinate
and propose changing only that coordinate (using the optimal σ ∗ above) is a
product chain in the sense of Chapter 4 section 6.2 (yyy 10/11/94 version),
and so the relaxation time of this product chain is τ2Gibbs (f ) = τ2 (f ) d.
Though the argument above is very simple, it is unsatisfactory because
there is no simple expression for relaxation time as a function of σ or for the
optimal σ ∗ . It turns out that this difficulty is eliminated in the isotropic-
proposal Metropolis chain. In the Gibbs sampler above, the variance of the
length of a proposed step is σ 2 , so we retain this property by specifying the
steps of the proposal chain to have Normal(0, σ 2 d−1 Id ) distribution. One
expects the relaxation time to grow linearly in d in this setting also. The
following result of Roberts et al [293] almost proves this, and has other useful
corollaries.
Theorem 11.3 Fix σ > 0. Let (X(t), t = 0, 1, 2, . . .) be the Metropolis

chain for sampling from product measure πf on Rd based on a proposal
random walk with step distribution Normal(0, σ 2 d−1 Id ). Write X (1) (t) for
the first coordinate of X(t), and let Yd (t) := X (1) (btdc) be this coordinate
process speeded up by a factor d, for continuous 0 ≤ t < ∞. Suppose X(0)
has the stationary distribution πf . Then
d
(Yd (t), 0 ≤ t < ∞) → (Y (t), 0 ≤ t < ∞) as d → ∞ (11.9)
where the limit process is the stationary one-dimensional diffusion
dYt = θ1/2 dWt + θµ(Yt )dt (11.10)
for standard Brownian motion Wt , where

f 0 (y)
µ(y) := 2f (y)
2
θ := 2σ Φ(−σκ/2) where Φ is the Normal distribution function
Z 1/2
(f 0 (x))2
κ := f (x) dx .
Moreover, as d → ∞ the proportion of accepted proposals in the stationary

chain tends to 2Φ(−σκ/2).
We outline the proof in section 11.5.3. The result may look complicated, so
one piece of background may be helpful. Given a probability distribution on
the integers, there is a Metropolis chain for sampling from it based on the
simple random walk proposal chain. As a continuous-space analog, given a

density f on R1 there is a “Metropolis diffusion” with stationary density f
based on θ1/2 Wt (for arbitrary constant θ) as “proposal diffusion”, and this
Metropolis diffusion is exactly the diffusion (11.10): see Notes to (yyy final
Chapter).
Thus the appearance of the limit diffusion Y is not unexpected; what is
important is the explicit formula for θ in terms of σ and f . Note that the
parameter θ affects the process (Yt ) only as a speed parameter. That is, if Yt∗
is the process (11.10) with θ = 1 then the general process can be represented
as Yt = Yθt∗ . In particular, the relaxation time scales as τ2 (Y ) = θ−1 τ2 (Y ∗ ).
Thus we seek to maximize θ as a function of the underlying step variance
σ, and a simple numerical calculation shows this is maximized by taking
σ = 2.38/κ, giving θ = 1.3/κ2 .
Thus Theorem 11.3 suggests that for the Metropolis chain X, the optimal
variance is 2.382 κ−2 d−1 Id , and suggests that the relaxation time τ2 (f, d)
scales as
dκ2
τ2 (f, d) ∼ τ2 (Y ∗ ). (11.11)
1.3
In writing (11.11) we are pretending that the Metropolis chain is a product
chain (so that its relaxation time is the relaxation time of its individual
components) and that relaxation time can be passed to the limit in (11.9).
Making a rigorous proof of (11.11) seems hard.
11.5.2 The diffusion heuristic.

Continuing the discussion above, Theorem 11.3 says that the long-run pro-
portion of proposed moves which are accepted is 2Φ(−κσ/2). At the optimal
value σ = 2.38/κ we find this proportion is a “pure number” 0.23, which
does not depend on f . To quote [293]
This result gives rise to the useful heuristic for random walk
Metropolis in practice:
Tune the proposal variance so that the average acceptance
rate is roughly 1/4.
We call this the diffusion heuristic for proposal-step scaling. Intuitively one
might hope that the heuristic would be effective for fairly general unimodel
target densities on Rd , though it clearly has nothing to say about the prob-
lem of passage between modes in a multimodal target. Note also that to
invoke the diffusion heuristic in a combinatorial setting, where the proposal
11.5. THE DIFFUSION HEURISTIC FOR OPTIMAL SCALING OF HIGH DIMENSIONAL METROPOL
chain is random walk on a graph, one needs to assume that the target dis-
tribution is “smooth” in the sense that π(v)/π(w) ≈ 1 for a typical edge
(v, w). In this case one can make a Metropolis chain in which the pro-
posal chain jumps σ edges in one step, and seek to optimize σ. See Roberts
[292] for some analysis in the context of smooth distributions on the d-cube.
However, such smoothness assumptions seem inapplicable to most practical
combinatorial MCMC problems.
11.5.3 Sketch proof of Theorem

Write a typical step of the proposal chain as
(x1 , x2 , . . . , xd ) → (x1 + ξ1 , x2 + ξ2 , . . . , xd + ξd ).
Write
d
f (xi + ξ1 ) Y f (xi + ξi )
J = log ; S = log .
f (x1 ) i=2
f (xi )
The step is accepted with probability min(1, di=1 f (x i +ξi ) J+S ).

Q
f (xi ) ) = min(1, e
So the increment of the first coordinate of the Metropolis chain has mean
and mean-square Eξ1 min(1, eJ+S ) and Eξ12 min(1, eJ+S ). The essential
issue in the proof is to show that, for “typical” values of (x2 , . . . , xn ),
Eξ1 min(1, eJ+S ) ∼ θµ(x1 )/d (11.12)

Eξ12 min(1, e J+S
) ∼ θ/d. (11.13)
This identifies the asymptotic drift and variance rates of Yd (t) with those of
Y (t).
Write h(u) := E min(1, eu+S ). Since
f 0 (x1 ) f 0 (x1 )

J ≈ log 1 + f (x1 ) ξ1 ≈ f (x1 ) ξ1 = 2µ(x1 )ξ1 ,
the desired estimates (11.12,11.13) can be rewritten as
EJh(J) ∼ 2θµ2 (x1 )/d (11.14)

2 2
EJ h(J) ∼ 4θµ (x1 )/d. (11.15)
Now if J has Normal(0, β 2 ) distribution then for sufficiently regular h(·) we

have
EJh(J) ∼ β 2 h0 (0); EJ 2 h(J) ∼ β 2 h(0) as β → 0.
Since J has approximately Normal(0, 4µ2 (x1 )var ξ1 = 4µ2 (x1 )σ 2 /d) distri-
bution, proving (11.14,11.15) reduces to proving
θ
h0 (0) → (11.16)
2σ 2
θ
h(0) → . (11.17)
σ2
We shall argue
dist(S) is approximately Normal(−κ2 σ 2 /2, κ2 σ 2 ). (11.18)
Taking the first two terms in the expansion of log(1 + u) gives
f 0 (xi ) f 0 (xi ) 2 2

log f (x i +ξi )
f (xi ) ≈ f (xi ) ξi − 1
2 f (xi ) ξi .
0
Write K(x) = d−1 di=2 ( ff (x
(xi ) 2
P
i)
) . Summing the previous approximation over
i, the first sum on the right has approximately Normal(0, σ 2 K(x)) distri-
bution, and (using the weighted law of large numbers) the second term
is approximately − 12 σ 2 K(x). So the distribution of S is approximately
Normal(−K(x)σ 2 /2, K(x)σ 2 ). But by the law of large numbers, for a typ-
ical x drawn from the product distribution πf we have K(x) ≈ κ2 , giving
(11.18).
To argue (11.17) we pretend S has exactly the Normal distribution at
(11.18). By a standard formula, if S has Normal(α, β 2 ) distribution then
2 /2
E max(1, eS ) = Φ(α/β) + eα+β Φ(−β − α/β).
This leads to
h(0) = 2Φ(−κσ/2)
which verifies (11.17). From the definition of h(u) we see
h0 (0) = EeS 1(S≤0) = h(0) − P (S ≥ 0) = h(0) − Φ(−κσ/2) = Φ(−κσ/2)
which verifies (11.16).
11.6 Other theory

11.6.1 Sampling from log-concave densities
As mentioned in Chapter 9 section 5.1 (yyy version 9/1/99) there has been
intense theoretical study of the problem of sampling uniformly from a convex
11.7. NOTES ON CHAPTER MCMC 383
set in Rd , in the d → ∞ limit. This problem turns out to be essentially

equivalent to the problem of sampling from a log-concave density f , that is
a density of the form f (x) ∝ exp(−H(x)) for convex H. The results are not
easy to state; see Bubley et al [80] for discussion.
11.6.2 Combining MCMC with slow exact sampling

Here is a special setting in which one can make rigorous inferences from
MCMC without rigorous bounds on mixing times. Suppose we have a guess
τ̂ at the relaxation time of a Markov sampler from a traget distribution π;
suppose we have some separate method of sampling exactly from π, but
where the cost of one exact sample is larger than the cost of τ̂ steps of the
Markov sampler. In this setting it is natural to take m exact samples and
use them as initial states of m multiple runs of the Markov sampler. It turns
out (see [19] for precise statement) that one can obtain confidence intervals
for a mean ḡ which are always rigorously correct (without assumptions on
τ2 ) and which, if τ̂ is indeed approximately τ2 , will have optimal length,
that is the length which would be implied by this value of τ2 .
11.7 Notes on Chapter MCMC

Liu [235] provides a nice combination of examples and carefully-described
methodology in MCMC, emphasizing statistical applications but also cov-
ering some statistical physics. Other statistically-oriented books include
[91, 165, 291]. We should reiterate that most MCMC “design” ideas origi-
nated in statistical physics; see the extensive discussion by Sokal [311]. Neal
[267] focuses on neural nets but contains useful discussion of MCMC vari-
ants.
Section 11.1.1. In the single-run setting, the variance of sample means
(11.3) could be estimated by classical methods of time series [63].
The phrase metastability error is our coinage – though the idea is stan-
dard, there seems no standard phrase.
Elaborations of the multiple-runs method are discussed by Gelman and
Rubin [161]. The applied literature has paid much attention to diagnostics:
for reviews see Cowles and Carlin [102] or Robert [290].
Section 11.1.2. Devroye [109] gives the classical theory of sampling from
one-dimensional and other specific distributions.
Section 11.2.1. The phrase “Metropolis algorithm” is useful shorthand
for “MCMC sampling, where the Markov chain is based on a proposal-
acceptance scheme like those in section 11.2.1”. The idea comes from the
1953 paper by Metropolis, Rosenbluth, Rosenbluth, Teller and Teller [262]

in the context of statistical physics, and the variant with general proposal
matrix is from the 1970 paper of Hastings [180]. Of course the word “al-
gorithm” means a definite rule for attaining some goal; the arbitraryness of
proposal matrix, and vagueness about when to stop, makes it an extreme
stretch to use the word for the Metropolis scheme.
The map K → P in the Metropolis-Hastings construction (11.5) has
an interpretation as a minimum-length projection in a certain L1 space of
matrices – see Billera and Diaconis [50].
Section 11.2.2. The Gibbs sampler was popularized in 1984 by Geman
and Geman [162] in the context of Bayesian image analysis. The idea is older
in statistical physics, under the name heat bath. Hit-and-run was introduced
in 1984 by Smith [310]. General line-sampling schemes go back to Goodman
and Sokal [170].
Section 11.3.1. Terminology for this type of construction is not standard.
What we call “Metropolized line sampling” is what Besag and Greene [46]
call an auxiliary variable construction, and this type of construction goes
back to Edwards and Sokal [139] in statistical physics.
Section 11.3.2. One can also define MTM using a general proposal ma-
trix K [236], though (in contrast to Metropolis) the specialization of the
general case to the symmetric case is different from the symmetric case de-
scribed in the text. Liu et al [236] discuss the use of MTM as an ingredient
in other variations of MCMC.
Other MCMC variations. In statistical physics, it is natural to think of
particles in Rd having position and velocity. This suggests MCMC schemes
in which velocity is introduced as an auxiliary variable. In particular one
can use deterministic equations of motion to generate proposal steps for
Metropolis, an idea called hybrid Monte Carlo – see Neal [268].
Section 11.4. The survey by Diaconis and Saloff-Coste [121] has fur-
ther pieces of theory, emphasizing the low-dimensional discrete setting. For
target densities on Rd one needs some regularity conditions to ensure τ2 is
finite; see Roberts and Tweedie [294] for results of this type.
Section 11.4.1. As background to Peskun’s theorem, one might think
(by vague physical analogy) that it would be desirable to have acceptance
probabilities behave as some “smooth” function; e.g. in the symmetric-
πy
proposal case, instead of min(1, πy /πx ) take πx +π y
. Lemma 11.1 shows this
intuition is wrong, at least using asymptotic variance rate or relaxation
time as a criterion. Liu [235] section 12.3 gives further instances where
Peskun’s Theorem can be applied. As usual, it is hard to do such comparison
arguments for τ1 .
11.8. BELONGS IN OTHER CHAPTERS 385
Section 11.4.2. The coupling here is an instance of a one-dimensional

monotone coupling, which exists for any stochastically monotone chain.
Section 11.5.2. Discussion of practical aspects of the diffusion heuristic
can be found in Roberts et al [160], and discussion in the more complicated
setting of Gibbs distributions of (Xv ; v ∈ Z d ) is in Breyer and Roberts [60].
11.8 Belongs in other chapters

yyy: add to what’s currently sec. 10.2 of Chapter 2, version 9/10/99, but
which may get moved to the new Chapter 8.
Where π does not vary with the parameter α we get a simple expression
d
for dα Z.
Lemma 11.4 In the setting of (yyy Chapter 2 Lemma 37), suppose π does
not depend on α. Then
d
dα Z = ZRZ.
xxx JF: I see this from the series expansion for Z – what to do about a
proof, I delegate to you!
11.8.1 Pointwise ordered transition matrices

yyy: belongs somewhere in Chapter 3.
Recall from Chapter 2 section 3 (yyy 9/10/99 version) that for a function
f : S → R with i πi fi = 0, the asymptotic variance rate is
P
t
σ 2 (P, f ) := lim t−1 var
X
f (Xs ) = f Γf (11.19)
t
s=1
where Γij = πi Zij + πj Zji + πi πj − πi δij . These individual-function variance

rates can be compared between chains with the same stationary distribution,
under a very strong “coordinatewise ordering” of transition matrices.
Lemma 11.5 (Peskun’s Lemma [280]) Let P and Q be reversible with

the same stationary distribution π. Suppose pij ≤ qij ∀j 6= i. Then
σ 2 (P, f ) ≥ σ 2 (Q, f ) for all f with i πi fi = 0.
P
Proof. Introduce a parameter 0 ≤ α ≤ 1 and write Pα = (1 − α)P + αQ.

Write (·)0 for dα
d
(·) at α = 0. It is enough to show
(σ 2 (P, f ))0 ≤ 0.
By (11.19)
(σ 2 (P, f ))0 = f Γ0 f = 2 0
XX
fi πi zij fj .
i j
By (yyy Lemma 11.4 above) Z0 = ZP0 Z. By setting
gi = πi fi ; aij = zij /πj ; wij = πi pij
we can rewrite the equality above as
(σ 2 (P, f ))0 = 2 gAW0 Ag.
Since A is symmetric wsith row-sums equal to zero, it is enough to show that

W0 is non-negative definite. By hypothesis W0 is symmetric and wij 0 ≥ 0
for j 6= i. These properties imply that, ordering states arbitrarily, we may

write
W0 = 0
XX
wij Mij
i<j
where Mij is the matrix whose only non-zero entries are m(i, i) = m(j, j) =
−1; m(i, j) = m(j, i) = 1. Plainly Mij is non-negative definite, hence so is
W0 .
Chapter 12
Coupling Theory and

Examples (October 11, 1999)
xxx This is intended as a section in a Chapter near Chapter 4; maybe

a new Chapter consisting of this and another section on bounding τ2 via
distinguished paths. Need some preliminary discussion, e.g. on
relations between τ1 and τ2 ;
observe that τ2 is tied to reversibility whereas coupling isn’t.
12.1 Using coupling to bound variation distance

Recall from Chapter 2 section 4.1 (yyy 9/10/99 version) several ways in
which variation distance is used to measure the deviation from stationarity
of the time-t distribution of a Markov chain:
di (t) : = ||Pi (Xt = ·) − π(·)||

d(t) : = max di (t)
i
¯ : = max ||Pi (Xt = ·) − Pj (Xt = ·)||.
d(t)
ij
Recall also from yyy the definition of variation threshold time

¯ ≤ e−1 }.
τ1 := min{t : d(t)
Since τ1 is affected by continuization, when using the above definition with

a discrete-time chain we write τ1disc for emphasis.
Coupling provides a methodology for seeking to upper bound d(t) ¯ and
hence τ1 . After giving the (very simple) theory in section 12.1.1 and some
387
388CHAPTER 12. COUPLING THEORY AND EXAMPLES (OCTOBER 11, 1999)
discussion in section 12.1.2, we proceed to give a variety of examples of

its use. We shall say more about theory and applications of coupling in
Chapter 8 (where we discuss the related idea of coupling from the past) and
in Chapter 10 (on interacting random walks).
12.1.1 The coupling inequality

Consider a finite-state chain in discrete or continuous time. Fix states i, j.
(i) (j)
Suppose we construct a coupling, that is a joint process ((Xt , Xt ), t ≥ 0)
such that
(i)
(Xt , t ≥ 0) is distributed as the chain started at i
(j)
(Xt , t ≥ 0) is distributed as the chain started at j. (12.1)
And suppose there is a random time T ij ≤ ∞ such that

(i) (j)
Xt = Xt , T ij ≤ t < ∞. (12.2)
Call such a T ij a coupling time. Then the coupling inequality is
||Pi (Xt ∈ ·) − Pj (Xt ∈ ·)|| ≤ P (T ij > t), 0 ≤ t < ∞. (12.3)
The inequality holds because

(i) (j)
||Pi (Xt ∈ ·) − Pj (Xt ∈ ·)|| = ||P (Xt ∈ ·) − P (Xt ∈ ·)||
(i) (j)
≤ P (Xt 6= Xt )
ij
≤ P (T > t).
12.1.2 Comments on coupling methodology

The coupling inequality provides a method of bounding the variation dis-
¯
tance d(t), because if we can construct a coupling for an arbitrary pair (i, j)
of initial states then
¯ ≤ max P (T ij > t).
d(t)
i,j
The reader may wish to look at a few of the examples before reading this
section in detail.
In applying coupling methodology there are two issues. First we need
to specify the coupling, then we need to analyze the coupling time. The
most common strategy for constructing couplings is via Markov couplings,
as follows. Suppose the underlying chain has state space I and (to take
12.1. USING COUPLING TO BOUND VARIATION DISTANCE 389
the continuous-time case) transition rate matrix Q = (q(i, k)). Consider a

transition rate matrix Q e on the product space I × I. Write the entries of Q e
as q̃(i, j; k, l) instead of the logical-but-fussy q̃((i, j), (k, l)). Suppose that,
for each pair (i, j) with j 6= i,
q̃(i, j; ·, ·) has marginals q(i, ·) and q(j, ·) (12.4)

P P
in other words l q̃(i, j; k, l) = q(i, k) and k q̃(i, j; k, l) = q(j, l). And
suppose that
q̃(i, i; k, k) = q(i, k) for all k

q̃(i, i; k, l) = 0 for l 6= k.
(i) (j)
Take (Xt , Xt ) to be the chain on I × I with transition rate matrix Q e and
(i) (j)
initial position (i, j), Then (12.1) must hold, and T ij := min{t : Xt = Xt }
is a coupling time. This construction gives a natural Markov coupling, and
all the examples where we use the coupling inequality will be of this form.
In practice it is much more understandable to define the joint process in
words, and we usually do so.
In constructing and analyzing couplings, we often exploit (explicitly or
implicitly) some integer-valued metric ρ(i, j) on the state space I. Then
with a Markovian coupling,
(j)
P (T ij > t) = P (ρ(Xti , Xt ) ≥ 1)
(i) (j)
and it is enough to study the integer-valued process Zt := ρ(Xt , Xt ).
Typically (Zt ) is not Markov, but one can try to compare it with some
integer-valued Markov process (Zt∗ ). Indeed, in defining the coupling one
has in mind trying to make such a comparison possible. Often one shows
that for any initial (i, j) the random time T ij is stochastically smaller than
the hitting time Ta0∗ for the comparison chain (Z ∗ ) to reach 0 starting from
t
a := maxi,j ρ(i, j). This would imply
¯ ≤ P (T ∗ > t).
d(t) a0
Finally, one does calculations with the integer-valued chain (Zt∗ ), either
∗ > t) directly or (what is often simpler)
bounding the tail probability P (Ta0
just bounding the expectation ETa0∗ , so that by Markov’s inequality and the
submultiplicativity property (Chapter 2 Lemma 20) (yyy 9/10/99 version)

we have in continuous time
∗ ¯ ≤ exp(1 − t
τ1 ≤ eETa0 ; d(t) τ1 ).
Here is perhaps the simplest comparison lemma, whose proof is left to

the reader.
Lemma 12.1 (Decreasing functional lemma) Let (Yt ) be a Markov chain

on S and f : S → {0, 1, 2, . . . , ∆} a function. Suppose that for each
1 ≤ i ≤ ∆ and each initial state y with f (y) = i,
(i) f (Y1 ) ≤ i;
(ii) P (f (Y1 ) ≤ i − 1) ≥ ai > 0.
Then
∆
X
max Ey TA ≤ 1/ai
y∈S
i=1
where A := {y : f (y) = 0}.
We now start a series of examples. Note that when presenting a coupling

proof we don’t need to explicitly check irreducibility, because the conclusion
of a bound on coupling time obviously implies irreducibility.
12.1.3 Random walk on a dense regular graph

(Chapter 5 Example 16).
Consider an r-regular n-vertex graph. Write N (v) for the set of neighbors
of v. For any pair v, w we can define a 1 − 1 map θv,w : N (v) → N (w) such
that θv,w (x) = x for x ∈ N (v) ∩ N (w). Consider discrete-time random
walk on the graph. We define a “greedy coupling” by specifying the joint
transition matrix
p̃(v, w; x, θv,w (x)) = 1/r, x ∈ N (v).
That is, from vertices v and w, if the first chain jumps to x then the sec-
ond chain jumps to θv,w (x), and we maximize the chance of the two chains
meeting after a single step. In general one cannot get useful bounds on the
coupling time. But consider the dense case, where r > n/2. As observed
in Chapter 5 Example 16, here |N (v) ∩ N (w)| ≥ 2r − n and so the coupled
process (Xt , Yt ) has the property that for w 6= v
|N (v) ∩ N (w)| 2r − n
P (Xt+1 = Yt+1 |Xt = v, Yt = w) = ≥
r r
implying that the coupling time T (for any initial pair of states) satisfies
t
n−r

P (T > t) ≤ .
r
¯ ≤ ( n−r )t . In particular the variation

So the coupling inequality implies d(t) r
threshold satisfies
τ1disc = O(1) as n → ∞, r/n → α > 1/2.
12.1.4 Continuous-time random walk on the d-cube

(Chapter 5 Example 15).
For i = (i1 , . . . , id ) and j = (j1 , . . . , jd ) in I = {0, 1}d , let D(i, j) be the
set of coordinates u where i and j differ. Write iu for the state obtained
by changing the u’th coordinate of i. Recall that in continuous time the
components move independently as 2-state chains with transition rates 1/d.
Define a coupling in words as “run unmatched coordinates independently
until they match, and then run them together”. Formally, the non-zero
transitions of the joint process are
q̃(i, j; iu , ju ) = 1/d if u 6∈ D(i, j)

q̃(i, j; iu , j) = 1/d if u ∈ D(i, j)
q̃(i, j; i, ju ) = 1/d if u ∈ D(i, j).
For each coordinate which is initially unmatched, it takes exponential (rate

2/d) time until it is matched, and so the coupling time T = T ij satisfies
d
T = max(ξ1 , . . . , ξd0 )
where the (ξu ) are independent exponential (rate 2/d) and d0 = |D(i, j)| is
the initial number of unmatched coordinates. So
P (T ≤ t) = (1 − exp(−2t/d))d0
and the coupling inequality bounds variation distance as

¯ ≤ 1 − (1 − exp(−2t/d))d .
d(t)
This leads to an upper bound on the variation threshold time
τ1 ≤ ( 21 + o(1))d log d as d → ∞.
This example is discussed in more detail in Chapter 5 Example 15 (yyy

4/23/96 version) where it is shown that
τ1 ∼ 14 d log d as d → ∞
so the coupling bound is off by a factor of 2.

12.1.5 The graph-coloring chain

Fix a n-vertex graph with maximal degree r. Fix an integer c ≥ r + 2 and
consider [c] := {1, 2, . . . , c} as a set of c colors. Let col(G, c) be the set
of c-colorings of G, where a c-coloring is an assignment of a color to each
vertex, in such a way that no two adjacent vertices have the same color.
One can put a natural “product graph” structure on col(G, c), in which two
colorings are adjacent if they differ at only one vertex. It is not hard to
check that the condition c ≥ r + 2 ensures that col(G, c) is non-empty and
the associated graph is connected. There is a natural discrete-time Markov
chain on col(G, c):
Pich a vertex v of G uniformly at random, pick a color γ uni-

formly at random, assign color γ to vertex v if feasible (i.e. if no
neighbor of v has color γ), else retain existing color of v.
Under certain conditions a simple coupling analysis succeeds in bounding

the mixing time. (The bound is far from sharp – see Notes).
¯ ≤ n exp(− (c−4r)t ) and so τ disc ≤

Proposition 12.2 If c > 4r then d(t) cn 1
cn
1 + c−4r (1 + log n).
Proof. We couple two versions of the chain by simply using the same v and
γ in both chains at each step. Write Dt for the number of vertices at which
the colors in the two chains differ. Then Dt+1 − Dt ∈ {−1, 0, 1} and the key
estimate is the following.
Lemma 12.3 Conditional on the state of the coupled process at time t,
2rDt
P (Dt+1 = Dt + 1) ≤ (12.5)
cn
(c − 2r)Dt
P (Dt+1 = Dt − 1) ≥ (12.6)
cn
Proof. In order that Dt+1 = Dt + 1 it is necessary that the chosen pair (v, γ)
is such that
(*) there exists a neighbor (w, say) of v such that w has color γ
in one chain but not in the other chain.
But the total number of pairs (v, γ) equals nc while the number of pairs
satisfying (*) is at most Dt · 2r. This establishes (12.5). Similarly, for
Dt+1 = Dt − 1 it is sufficient that v is currently unmatched and that no
neighbor of v in either chain has color γ; the number of such pairs (v, γ) is
at least Dt · (c − 2r). 2
Lemma 12.3 implies E(Dt+1 − Dt |Dt ) ≤ −(c − 4r)Dt /(cn) and so
c−4r
EDt+1 ≤ κEDt ; κ := 1 − cn .
Since D0 ≤ n we have, for any initial pair of states,
P (Dt ≥ 1) ≤ EDt ≤ κt n ≤ n exp(− (c−4r)t

cn )
and the coupling lemma establishes the Proposition.
12.1.6 Permutations and words

The examples in sections 12.1.4 and 12.1.5 were simple prototypes of in-
teracting particle systems, more examples of which appear in Chapter 10,
whose characteristic property is that a step of the chain involves only “local”
change. Chains making “global” changes are often hard to analyze, but here
is a simple example.
Fix a finite alphabet A of size |A|. Fix m, and consider the set Am of
“words” x = (x1 , . . . , xm ) with each xi ∈ A. Consider the discrete-time
Markov chain on Am in which a step x → y is specified by the following
two-stage procedure.
Stage 1. Pick a permutation σ of {1, 2, . . . , m} uniformly at random from
the set of permutations σ satisfying xσ(i) = xi ∀i.
Stage 2. Let (cj (σ); j ≥ 1) be the cycles of σ. For each j, and indepen-
dently as j varies, pick uniformly an element αj of A, and define yi = αj for
every i ∈ cj (σ).
Here is an alternative description. Write Π for the set of permutations of
{1, . . . , m}. Consider the bipartite graph on vertices Am ∪ Π with edge-set
{(x, σ) : xσ(i) = xi ∀i}. Then the chain is random walk on this bipartite
graph, watched every second step (that is, when it is in Am ).
From the second description, it is clear that the stationary probabilities
π(x) are proportional to the degree of x in the bipartite graph, giving
Y
π(x) ∝ na (x)!
a
where na (x) = |{i : xi = a}|. We shall use a coupling argument to establish

the following bound on variation distance:
t
¯ ≤m 1− 1

d(t) (12.7)
|A|
implying that the variation threshold satisfies

1 + log m
τ1disc ≤ 1 + 1 ≤ 1 + (1 + log m)|A|.
− log(1 − |A| )
The construction of the coupling depends on the following lemma.
Lemma 12.4 Given finite sets F 1 , F 2 we can construct (for u = 1, 2) a
uniform random permutation σ u of F u with cycles (Cju ; j ≥ 1), where the
cycles are labeled such that
Cj1 ∩ F 1 ∩ F 2 = Cj2 ∩ F 1 ∩ F 2 for all j.
In the equality we interpret the Cju as sets.
Proof. Given a permutation σ of a finite set G, there is an induced per-
mutation on a subset G0 obtained by deleting from the cycle representation
of σ those elements not in G0 . It is easy to check that, for a uniform random
permutation of G, the induced random permutation of G0 is also uniform. In
the setting of the lemma, take a uniform random permutation σ of F 1 ∪ F 2 ,
and let σ u be the induced random permutations of F u . Then the equality
holds because each side is representing the cycles of the induced permutation
on F 1 ∩ F 2 . 2
We construct a step (x1 , x2 ) → (Y1 , Y2 ) of the coupled processes as
follows. For each a ∈ A, set F 1,a = {i : x1i = a}, F 2,a = {i : x2i = a}. Take
random permutations σ 1,a , σ 2,a as in the lemma, with cycles Cj1,a , Cj2,a .
Then (σ 1,a , a ∈ A) define a uniform random permutation σ 1 of {1, . . . , m},
and similarly for σ 2 . This completes stage 1. For stage 2, for each pair (a, j)
pick a uniform random element αja of A and set
Yi1 = αja for every i ∈ Cj1,a
Yi2 = αja for every i ∈ Cj2,a .

This specifies a Markov coupling. By construction
if x1i = x2i then Yi1 = Yi2
if x1i 6= x2i then P (Yi1 = Yi2 ) = 1/|A|
because Yi1 and Yi2 are independent uniform choices from A. So the coupled
processes (X1 (t), X2 (t)) satisfy
t
1

P (Xi1 (t) 6= Xi2 (t)) = 1 − P (Xi1 (0) 6= Xi2 (0)).
|A|
In particular P (X1 (t) 6= X2 (t)) ≤ m(1 − 1/|A|)t and the coupling inequality
(12.3) gives (12.7).
12.1.7 Card-shuffling by random transpositions

We mentioned in Chapter 1 section 1.4 (yyy 7/20/99 version) that card-
shuffling questions provided a natural extrinsic motivation for the study of
mixing times. The example here and in section 12.1.9 give a first study
of mathematically (if not physically) simple random shuffles, and these
discrete-time chains are prototypes for more complex chains arising in other
contexts.
Consider a d-card deck. The random transpositions shuffle is:
Make two independent uniform choices of cards, and interchange

them.
With chance 1/d the two choices are the same card, so no change results.
To make a coupling analysis, we first give an equivalent reformulation.
Pick a label a and a position i uniformly at random; interchange

the label-a card with the card in position i.
This reformulation suggests the coupling in which the same choice of (a, i)
is used for each chain. In the coupled process (with two arbitrary starting
states) let Dt be the number of unmatched cards (that is, cards whose
positions in the two decks are different) after t steps. Then
(i) Dt+1 ≤ Dt .
(ii) P (Dt+1 ≤ j − 1|Dt = j) ≥ j 2 /d2 .
Here (i) is clear, and (ii) holds because whenever the card labeled a and
the card in position i are both unmatched, the step of the coupled chain
creates at least one new match (of the card labeled a).
Noting that Dt cannot take value 1, we can use the decreasing functional
lemma (Lemma 12.1) to show that the coupling time T := min{t : Dt = 0}
satisfies
d
X 2
ET ≤ d2 /j 2 ≤ d2 ( π6 − 1).
j=2
In particular, the coupling inequality implies τ1disc = O(d2 ).

We revisit this example in Chapter 7 Example 18 (yyy 1/31/94 version)
where it is observed that in fact
τ ∼ 21 d log d (12.8)
An analogous continuous-space chain on the simplex is studied in Chapter

13-4 Example 3 (yyy 7/29/99 version)
12.1.8 Reflection coupling on the n-cycle

Consider continuous-time random walk on the n-cycle I = {0, 1, 2, . . . , n−1}.
That is, the transition rates are
1/2 1/2
i −→ i + 1; i −→ i − 1
where here and below ±1 is interpreted modulo n. One can define a coupling
by specifying the following transition rates for the bivariate process.
1/2 1/2
(i, i) −→ (i + 1, i + 1); (i, i) −→ (i − 1, i − 1)
1/2 1/2
(if |j − i| > 1) (i, j) −→ (i + 1, j − 1); (i, j) −→ (i − 1, j + 1)
1/2 1/2 1/2
(i, i + 1) −→ (i, i); (i, i + 1) −→ (i + 1, i + 1); (i, i + 1) −→ (i − 1, i + 2)
(12.9)
(0) (k)
and symmetrically for (i+1, i). The joint process ((Xt , Xt ), t ≥ 0) started
at (0, k) can be visualized as follows. Let φ(i) := k − i mod n. Picture the
operation of φ as reflection in a mirror which passes through the points
{x1 , x2 } = {k/2, k/2 + n/2 mod n} each of which is either a vertex or the
middle of an edge. In the simplest case, where x1 and x2 are vertices, let
(Xt0 ) be the chain started at vertex 0, let T 0k = min{t : Xt ∈ {x1 , x2 }} and
define
(k) (0)
Xt = φ(Xt ), t ≤ T 0k
(0)
= Xt , t > T 0k .
This constructs a bivariate process with the transition rates specified above,
with coupling time T 0k , and the pre-T 0k path of X (k) is just the reflection
of the pre-T 0k path of X (0) . In the case where a mirror point is the middle
of an edge (j, j + 1) and the two moving particles are at j and j + 1, we
don’t want simultaneous jumps across that edge; instead (12.9) specifies that
attempted jumps occur at independent times, and the process is coupled at
the time of the first such jump.
It’s noteworthy that in this example the coupling inequality
(0) (k)
||P0 (Xt ∈ ·) − Pk (Xt ∈ ·)|| ≤ P (Xt 6= Xt )
is in fact an equality. Indeed this assertion, at a given time t, is equivalent

to the assertion
(0) (k)
P (Xt ∈ ·, T > t) and P (Xt ∈ ·, T > t) have disjoint support.
But the support A0 of the first measure is the set of vertices which can be
reached from 0 without meeting or crossing any mirror point (and similarly
for Ak ); and A0 and Ak are indeed disjoint.
It is intuitively clear that the minimum over k of T 0k is attained by
k = bn/2c: we leave the reader to find the simple non-computational proof.
It follows, taking e.g. the simplest case where n is multiple of 4, that we
can write
¯ = P (T{−n/4,n/4} > t)
d(t) (12.10)
where T{−n/4,n/4} is the hitting time for continuous-time random walk on
the integers.
Parallel results hold in discrete time but only when the chains are suit-
ably lazy. The point is that (12.9) isn’t allowable as transition probabilities.
However, if we fix 0 < a ≤ 1/3 then the chain with transition probabilities
a a
i −→ i + 1; i −→ i − 1
(and which holds with the remaining probability) permits a coupling of the
form (12.9) with all transition probabilities being a instead of 1/2. The
analysis goes through as above, leading to (12.10) where T refers to the
discrete-time lazy walk on the integers.
Similar results hold for random walk on the n-path (Chapter 5 Example
8) (yyy 4/23/96 version). and we call couplings of this form reflection cou-
plings. They are simpler in the context of continuous-path Brownian motion
– see Chapter 13-4 section 1 (yyy 7/29/99 version).
12.1.9 Card-shuffling by random adjacent transpositions

As in section 12.1.7 we take a d-card deck; here we define a (lazy) shuffle by
With probability 1/2 make no change; else pick a uniform ran-
dom position i ∈ {1, 2, . . . , d} and interchange the cards in posi-
tions i and i + 1 (interpret d + 1 as 1).
To study this by coupling, consider two decks. In some positions i the decks
match (the label on the card in position i is the same in both decks). Write
D for the set of i such that either position i or position i + 1 or both match.
Specify a step of the coupled chain by:
1
P (interchange i and i + 1 in each deck) = 2d , i∈D
1
P (interchange i and i + 1 in first deck, no change in second deck) = 2d , i 6∈ D
1
P (interchange i and i + 1 in second deck, no change in first deck) = 2d , i 6∈ D
|D|
P (no change in either deck) = 2d .
Consider a particular card a. From the coupling description we see

(a) if the card gets matched then it stays matched;
(b) while unmatched, at each step the card can move in at most one of the
decks.
It follows that the “clockwise” distance D(t) := Xa1 (t) − Xa2 (t) mod d be-
tween the positions of card a in the two decks behaves exactly as a lazy
random walk on the d-cycle:
pj,j+1 = pj,j−1 = 1/d, 1≤j≤d
until D(t) hits 0. By the elementary formula for mean hitting times on the
cycle (Chapter 5 eq. (24)) (yyy 4/23/96 version), the mean time T (a) until
card a becomes matched satisfies
d d2
ET (a) ≤ 2 4
uniformly over initial configurations. By submultiplicativity (Chapter 2 sec-

tion 4.3) (yyy 9/10/99 version)
P (T (a) > md3 /4) ≤ 2−m , m = 1, 2, . . . .
The chains couple at time T := maxa T (a) and so

¯
d(md3
/4) ≤ P (T > md3 /4) ≤ d2−m .
In particular
τ1disc = O(d3 log d).
In this example it turns out that coupling does give the correct order of
magnitude; the corresponding lower bound
τ1disc = Ω(d3 log d)
was proved by Wilson [338]. Different generalizations of this example appear

in section 12.1.13 and in Chapter 14 section 5 (yyy 3/10/94 version), where
we discuss relaxation times.
A generalization of this example, the interchange process, is studied in
Chapter 14 section 5 (yyy 3/10/94 version).
12.1.10 Independent sets

Fix a graph G on n vertices with maximal degree r, An independent set
is a set of vertices which does not contain any adjacent vertices. Fix m
and consider the space of all independent sets of size m in G. Picture an

independent set x as a configuration of m particles at distinct vertices, with
no two particles at adjacent vertices. A natural discrete-time chain (Xt ) on
I is
pick a uniform random particle a and a uniform random vertex

v; move particle a to vertex v if feasible, else make no move.
To study mixing times, we can define a coupling (Xt , Yt ) by simply making

the same choice of (a, v) in each of the two coupled chains, where at each time
we invoke a matching of particles in the two realizations which is arbitrary
except for matching particles at the same vertex. To analyze the coupling,
let ρ be the natural metric on I: ρ(x, y) = number of vertices occupied
by particles of x but not by particles of y. Clearly Dt := ρ(Xt , Yt ) can
change by at most 1 on each step. Let us show that, for initial states with
ρ(x, y) = d > 0,
m − d 2d(r + 1)
P(x,y) (D1 = d + 1) ≤ (12.11)
m n
d n − (m + d − 2)(r + 1)
P(x,y) (D1 = d − 1) ≥ . (12.12)
m n
For in order that D1 = d + 1 we must first choose a matched particle a
(chance (m − d)/d) and then choose a vertex v which is a neighbor of (or the
same as) some vertex v 0 which is in exactly one of {x, y}: there are 2d such
vertices v 0 and hence at most 2d(r + 1) possibilities for v. This establishes
(12.11). Similarly, in order that Dt = d − 1 it is sufficient that we pick an
unmatched particle a (chance d/m) and then choose a vertex v which is not
a neighbor of (or the same as) any vertex v 0 which is occupied in one or
both realizations by some particle other than a: there are m + d − 2 such
forbidden vertices v 0 and hence at most (m+d−2)(r +1) forbidden positions
for v. This establishes (12.12).
From (12.11,12.12) a brief calculation gives
−d
E(x,y) (D1 − d) ≤ mn (n − (3m − d − 2)(r + 1))
−d
≤ mn (n − 3(m − 1)(r + 1)).
In other words
n − 3(m − 1)(r + 1)
E(x,y) D1 ≤ κd; κ := 1 − .
mn
n
If m < 1 + 3(r+1) then κ < 1. In this case, by copying the end of the analysis
of the graph-coloring chain (section 12.1.5)

¯ ≤ mκt ;
d(t) τ1 = O log m
.
1−κ
1
To clarify the size-asymptotics, suppose m, n → ∞ with m/n → ρ < 3(r+1) .
Then for fixed ρ
τ1 = O(n log n).
12.1.11 Two base chains for genetic algorithms

One way of motivating study of Markov chains on combinatorial sets with
uniform stationary distributions is as “base chains” on which to base Markov
chain Monte Carlo, that is to create other chains designed to have some
specified distribution as their stationary distributions. Here is a typical
base chain underlying genetic algorithms.
Fix integers K, L ≥ 1 with K even. A state of the chain is a family of
words (xk , 1 ≤ k ≤ K), where each word is a binary L-tuple xk = (xkl , 1 ≤
l ≤ L). A step of the chain is defined as follows.
Use a uniform random permutation π of {1, 2, . . . , K} to parti-

tion the words into K/2 pairs {xπ(1) , xπ(2) }, {xπ(3) , xπ(4) } . . ..
Create a new pair {y1 , y2 } from {xπ(1) , xπ(2) } by setting, inde-
pendently for each 1 ≤ l ≤ L
π(1) π(2) π(2) π(1)
P ((yl1 , yl2 ) = (xl , xl )) = P ((yl1 , yl2 ) = (xl
)) = 1/2. , xl
(12.13)
Repeat independently for 1 ≤ i ≤ K/2 to create new pairs
{y2i−1 , y2i } from {xπ(2i−1) , xπ(2i) } . The new state is the family
of words yk .
Associated with an initial state (xk ) is a vector of column sums m = (ml , 1 ≤

l ≤ L) where ml = k xkl . These sums are preserved by the chain, so the
P
proper state space is the space Im of families with column-sums m. The

transition matrix is symmetric and so the chain is reversible with uniform
stationary distribution on Im .
To describe the coupling, first rephrase (12.13) in words as “(yl1 , yl2 ) is
π(1) π(2)
{xl , xl } in random order, either forwards or backwards”. Now specify
the coupling as follows.
(i) Use the same random permutation π for both chains.
(ii) For each i and each l, in creating the new words (yl2i−1 , yl2i ) from the old
π(2i−1) π(2i)
words {xl , xl )} use the same choice (forwards or backwards) in both
π(2i−1) π(2i)
chains, except when (xl , xl ) = (1, 0) for one chain and = (0, 1) for
the other chain, in which case use opposite choices of (forwards, backwards)
in the two chains.
To study the coupled processes (X(t), X̂(t)), fix l and consider the num-
ber W (t) := K k k
k=1 |Xl (t) − X̂l (t)| of words in which the l’th letter is not
P
matched in the two realizations. Suppose W (0) = w. Consider the creation

of the first two new words in each chain. The only way that the number of
matches changes is when we use opposite choices of (forwards, backwards)
in the two chains, in which case two new matches are created. The chance
that the l’th letter in the two chains is 1 and 0 in the π(1)’th word and is 0
and 1 in the π(2)’th word equals w/2 w/2
K × K−1 , and so (taking into account the
symmetric case) the mean number of new matches at l in these two words
w2
equals K(K−1) . Summing over the K/2 pairs,
w2
E(W (1)|W (0) = w) = w − 2(K−1) .
We can now apply a comparison lemma (Chapter 2 Lemma 32) (yyy 9/10/99
version) which concludes that the hitting time T l of W (t) to 0 satisfies
K
2(K−1)
X
ET l ≤ w2
≤ 2K.
w=2
Since T := maxl T l is a coupling time, a now-familiar argument shows that

for u = 1, 2, . . .
¯
d(4uK) ≤ P (T > 4uK) ≤ LP (T l > u · 4K) ≤ L 2−u
and so
τ1disc = O(K log L).
Open Problem 12.5 Show τ1disc = O(log K × log L).
We expect this bound by analogy with the “random transpositions” shuf-

fle (section 12.1.7). Loosely speaking, the action of the chain on a single
position in words is like the random transpositions chain speeded up by a
factor K/2, so from (12.8) we expect its mixing time to be Θ(log K). It
would be interesting to study this example via the group representation
or strong stationary time techniques which have proved successful for the
random transpositions chain.
To make a metaphor involving biological genetics, the letters represent

chromosomes and the words represent the chromosomal structure of a ga-
mete; the process is “sexual reproduction from the viewpoint of gametes”. If
instead we want a word to represent a particular chromosome and the letters
to represent genes within that chromosome, then instead of flipping bits in-
dependently it is more natural to model crossover. That is, consider a chain
in which the rule for creating a new pair {y2i−1 , y2i } from {xπ(2i−1) , xπ(2i) }
becomes
Take Ui uniform on {1, 2, . . . , L, L + 1}. Define

π(2i−1) π(2i)
(yl2i−1 , yl2i ) = (xl , xl ), l < U
π(2i) π(2i−1)
(yl2i−1 , yl2i ) = (xl , xl ), l ≥ U.
As an exercise (hint in Notes), find a coupling argument to show that for

this chain
τ1disc = O(KL2 ). (12.14)
12.1.12 Path coupling

In certain complicated settings it is useful to know that it is enough to couple
versions of the chain which start in “nearby” states. To say this carefully,
let I be finite and consider a {0, 1, 2, . . .}-valued function ρ(i, j) defined on
some symmetric subset E ⊂ I × I. Call ρ a pre-metric if
(i) ρ(i, j) = 0 iff i = j.
(ii) ρ(i, j) = ρ(j, i).
(iii) ρ(i0 , ik ) ≤ k−1
P
u=0 ρ(iu , iu+1 ), whenever (i0 , ik ) and each (iu , iu+1 ) are in
E.
Clearly a pre-metric extends to a metrric ρ̄ by defining
( )
X
ρ̄(i, j) := min ρ(iu , iu+1 ) (12.15)
u
the minimum over all paths i = i0 , i1 , . . . , ik = j with each (iu , iu+1 ) ∈ E.

Note ρ̄(i, j) ≤ ρ(i, j) for (i, j) ∈ E.
Lemma 12.6 Let S be a state space. Let (µi,i+1 , 0 ≤ i ≤ d − 1) be probabil-

ity distributions on S × S such that the second marginal of µi,i+1 coincides
with the first marginal of µi+1,i+2 for 0 ≤ i ≤ d − 2. Then there exists a
S-valued random sequence (Vi , 0 ≤ i ≤ d) such that µi,i+1 = dist(Vi , Vi+1 )
for 0 ≤ i ≤ d − 1.
Proof. Just take (Vi ) to be the non-homogeneous Markov chain whose tran-
sition probabilities P (Vi+1 ∈ ·|Vi = v) are the conditional probabilities de-
termined by the specified joint distribution µi,i+1 .
Lemma 12.7 (Path-coupling lemma) Take a discrete-time Markov chain
(i)
(Xt ) with finite state space I. Write X1 for the time-1 value of the chain
started at state i. Let ρ be a pre-metric defined on some subset E ⊂ I × I.
(i) (j)
Suppose that for each pair (i, j) in E we can construct a joint law (X1 , X1 )
such that
(i) (j)
E ρ̄(X1 , X1 ) ≤ κρ(i, j) (12.16)
for some constant 0 < κ < 1. Then
¯ ≤ ∆ ρ κt
d(t) (κ < 1) (12.17)
where ∆ρ := maxi,j∈I ρ̄(i, j).
See the Notes for comments on the case κ = 1.
Proof. Fix states i, j and consider a path (iu ) attaining the minimum
(i ) (i )
in (12.15). For each u let (X1 u , X1 u+1 ) have a joint distribution satis-
(i)
fying (12.16). By Lemma 12.6 there exists a random sequence (X1 =
(i ) (i ) (j)
X1 0 , X1 1 , . . . , X1 ) consistent with these bivariate distributions. In par-
(i) (j)
ticular, there is a joint distribution (X1 , X1 ) such that
(i) (j) X (i ) (i ) X
E ρ̄(X1 , X1 ) ≤ E ρ̄(X1 u , X1 u+1 ) ≤ κ ρ(iu , iu+1 ) = κρ̄(i, j).
u u
This construction gives one step of a coupling of two copies of the chain
(i) (j)
started at arbitrary states, and so extends to a coupling ((Xt , Xt ), t =
0, 1, 2, . . .) of two copies of the entire processes. The inequality above implies
(i) (j) (i) (j) (i) (j)
E(ρ̄(Xt+1 , Xt+1 )|Xt , Xt ) ≤ κρ̄(Xt , Xt )
and hence
(i) (j) (i) (j)
P (Xt 6= Xt ) ≤ E ρ̄(Xt , Xt ) ≤ κt ρ̄(i, j) ≤ κt ∆ρ
establishing (12.17). 2
Bubley and Dyer [77] introduced Lemma 12.7 and the name path-coupling.
It has proved useful in extending the range of applicability of coupling meth-
ods in settings such as graph-coloring (Bubley et al [79] Vigoda [333]) and
independent sets (Luby and Vigoda [244]). These are too intricate for pre-
sentation here, but the following example will serve to illustrate the use of
path-coupling.
12.1.13 Extensions of a partial order

Fix a partial order on an n-element set, and let Im be the set of linear
extensions of , that is to say total orders consistent with the given partial
order. We can define a discrete-time Markov chain on Im by re-using the
idea in the “random adjacent transpositions” example (section 12.1.9). Let
w(·) be a probability distribution on {1, 2, . . . , n − 1}. Define a step of the
chain as follows.
Pick position i with probability w(i), and independently pick one

of { stay, move } with probability 1/2 each. If pick “move” then
interchange the elements in positions i and i + 1 if feasible (i.e.
if consistent with the partial order); else make no change.
The transition matrix is symmetric, so the stationary distribution is uniform

on Im .
To analyze by coupling, define one step of a bivariate coupled process as
follows.
Make the same choice of i in both chains. Also make the same
choice of { move, stay }, except in the case where the elements
in positions i and i + 1 are the same elements in opposite order
in the two realizations, in which case use the opposite choices of
{ stay, move }.
The coupling is similar (but not identical) to that in section 12.1.9, where
the underlying chain is that corresponding to the “null” partial order. For
a general partial order, the coupling started from an arbitrary pair of states
seems hard to analyze directly. For instance, an element in the same position
in both realizations at time t may not remain so at time t + 1. Instead we
use path-coupling, following an argument of Bubley and Dyer [78]. Call two
states x and y adjacent if they differ by only one (not necessarily adjacent)
transposition; if the transposed cards are in positions i < j then let ρ(x, y) =
j−i. We want to study the increment Φ := ρ(X1 , Y1 )−ρ(x, y) where (X1 , Y1 )
is the coupled chain after one step from (x, y). The diagram shows a typical
pair of adjacent states.
a b c α d e f β g h
a b c β d e f α g h
position · · · i · · · j · ·
Observe first that any choice of position other than i − 1, i, j − 1, j will
have no effect on Φ. If position i and “move” are chosen, then {α, d} are
interchanged in the first chain and {β, d} in the second; both lead to feasible
12.2. NOTES ON CHAPTER 4-3 405
configurations by examining the relative orders in the other chain’s previous

configuration. This has chance w(i)/2 and leads to Φ = −1. If position i − 1
and “move” are chosen (chance w(i − 1)/2), then if either or both moves are
feasible Φ = 1, while if neither are feasible then Φ = 0. Arguing similarly
for choices j − 1, j leads to
E(Φ) ≤ 21 (w(i − 1) − w(i) − w(j − 1) + w(j)).
This estimate remains true if j = i+1 because in that case choosing position
i (chance w(i)) always creates a match. Now specify
n−1
i(n−i)
X
w(i) := wn , wn := j(n − j)
j=1
and then EΦ ≤ − j−i

wn . This leads to

1
E(x,y) ρ(X1 , Y1 ) ≤ 1 − wn ρ(x, y)
for adjacent (x, y). We are thus in the setting of Lemma 12.7, which shows
¯ ≤ ∆n exp(−t/wn ).
d(t)
Since ∆n = O(n2 ) and wn ∼ n3 /6 we obtain
τ1disc = ( 31 + o(1)) n3 log n.
12.2 Notes on Chapter 4-3

Coupling has become a standard tool in probability theory. The monograph
of Lindvall [233] contains an extensive treatment, and history. In brief, Doe-
blin [126] used the idea of running two copies of the chain independently until
they meet, in order to prove the convergence theorem (Chapter 2 Theorem
2) for finite-state chains, and this is now the textbook proof ([270] Theo-
rem 1.8.3) of the convergence theorem for countable-state chains. The first
wide-ranging applications were in the context of infinite-site interacting par-
ticles in the 1970s, where (e.g. Liggett [230]) couplings were used to study
uniqueness of invariant distributions and convergence thereto. Theory con-
necting couplings and variation distance is implicit in Griffeath [171] and
Pitman [282], though the first systematic use to bound variation distance in
finite-state chains was perhaps Aldous [9], where examples including those
in sections 12.1.4, 12.1.7 and 12.1.9 were given.
Section 12.1.2. There may exist Markov couplings which are not of the
natural form (12.4), but examples typically rely on very special symmetry
properties. For the theoretically-interesting notion of (non-Markov) maxi-
mal coupling see Chapter 9 section 1 (yyy 4/21/95 version).
The coupling inequality is often presented using a first chain started
from an arbitrary point and a second chain started with the stationary
distribution, leading to a bound on d(t) instead of d(t).¯ See Chapter 13-
4 yyy for an example where this is used in order to exploit distributional
properties of the stationary chain.
Section 12.1.5. This chain was first studied by Jerrum [200], who proved
rapid mixing under the weaker assumption c ≥ 2r. His proof involved
a somewhat more careful analysis of the coupling, exploiting the fact that
“bad” configurations for the inequalities (12.5,12.6) are different. This prob-
lem attracted interest because the same constraint c ≥ 2r appears in proofs
of the absence of phase transition in the zero-temperature anti-ferromagnetic
Potts model in statistical physics. Proving rapid mixing under weaker hy-
potheses was first done by Bubley et al [79] in special settings and using
computer assistance. Vigoda [333] then showed that rapid mixing still holds
when c > 11 6 r: the proof first studies a different chain (still reversible with
uniform stationary distribution) and then uses a comparison theorem.
Section 12.1.6. The chain here was suggested by Jerrum [196] in the
context of a general question of counting the number of orbits of a per-
mutation group acting on words. More general cases (using a subgroup of
permutations instead of the whole permutation group) remain unanalyzed.
Section 12.1.10. See Luby and Vigoda [244] for more detailed study and
references.
Section 12.1.11. Conceptually, the states in these examples are unordered
families of words. In genetic algorithms for optimization one has an objec-
tive function f : {0, 1}L → R and accepts or rejects offspring words with
probabilities depending on their f -values.
Interesting discussion of some different approaches to genetics and com-
putation is in Rabani et al [287].
Hint for (12.14). First match the L’th letters in each word, using the
occasions when Ui = L or L + 1. This takes O(LK) time.
Section 12.1.12. Another setting where path-coupling has been used is
contingency tables: Dyer and Greenhill [137].
In the case where (12.16) holds for κ = 1, one might expect a bound of
the form
¯ = O(∆2 /α)
d(t) (12.18)
ρ
12.2. NOTES ON CHAPTER 4-3 407
(i) (j)
α := min P (ρ(X1 , X1 ) ≤ ρ(i, j) − 1).
(i,j)∈E
(i) (k)
by arguing that, for arbitrary (i, k), the process ρ(Xt , Xt ) can be com-
pared to a mean-zero random walk with chance α of making a negative
step. Formalizing this idea seems subtle. Consider three states i, j, k with
(i, j) ∈ E and (j, k) ∈ E. Suppose
(i) (j) (i) (j)
P (ρ(X1 , X1 ) = ρ(i, j) + 1) = P (ρ(X1 , X1 ) = ρ(i, j) − 1) = α
and otherwise ρ(·, ·) is unchanged; similarly for (j, k). The changes for the
(i, j) process and for the (j, k) process will typically be dependent, and in
the extreme case we might have
(i) (j) (j) (k)
ρ(X1 , X1 ) = ρ(i, j) + 1 iff ρ(X1 , X1 ) = ρ(j, k) − 1
(i) (k)
and symmetrically, in which case ρ(Xt , Xt ) might not change at all. Thus
proving a result like (12.18) must require further assumptions.
Section 12.1.13. The Markov chain here (with uniform weights) was first
studied by Karzanov and Khachiyan [211].
Chapter 13
Continuous State, Infinite

State and Random
Environment (June 23, 2001)
13.1 Continuous state space
We have said several times that the theory in this book is fundamentally
a theory of inequalities. “Universal” or “a priori” inequalities for reversible
chains on finite state space, such as those in Chapter 4, should extend un-
changed to the continuous space setting. Giving proofs of this, or giving
the rigorous setup for continuous-space chains, is outside the scope of our
intermediate-level treatment. Instead we just mention a few specific pro-
cesses which parallel or give insight into topics treated earlier.
13.1.1 One-dimensional Brownian motion and variants
Let (Bt , 0 ≤ t < ∞) be one-dimensional standard Brownian motion (BM).

Mentally picture a particle moving along an erratic continuous random tra-
jectory. Briefly, for s < t the increment Bt − Bs has Normal(0, t − s) distri-
bution, and for non-overlapping intervals (si , ti ) the increments Bti −Bsi are
independent. See Norris [270] section 4.4, Karlin and Taylor [208] Chapter
7, or Durrett [133] Chapter 7 for successively more detailed introductions.
One can do explicit calculations, directly in the continuous setting, of dis-
tributions of many random quantities associated with BM. A particular
409
410CHAPTER 13. CONTINUOUS STATE, INFINITE STATE AND RANDOM ENVIRONMEN
calculation we need ([133] equation 7.8.12) is

∞
!
4 X (−1)m
G(t) := P sup |Bs | < 1 = exp(−(2m + 1)2 π 2 t/8) (13.1)
0≤s≤t π m=0 2m + 1
where
G−1 (1/e) = 1.006. (13.2)
One can also regard BM as a limit of rescaled random walk, a result which
generalizes the classical central limit theorem. If (Xm , m = 0, 1, 2, . . .) is
simple symmetric random walk on Z, then the central limit theorem implies
d
m−1/2 Xm → B1 and the generalized result is
d
(m−1/2 Xbmtc , 0 ≤ t < ∞) → (Bt , 0 ≤ t < ∞) (13.3)
where the convergence here is weak convergence of processes (see e.g. Ethier
and Kurtz [141] for detailed treatment). For more general random flights
on Z, that is Xm = m
P
j=1 ξj with ξ1 , ξ2 , . . . independent and Eξ = 0 and
2
var ξ = σ < ∞, we have Donsker’s theorem ([133] Theorem 7.6.6)
d
(m−1/2 Xbmtc , 0 ≤ t < ∞) → (σBt , 0 ≤ t < ∞). (13.4)
Many asymptotic results for random walk on the integers or on the n-cycle
or on the n-path, and their d-dimensional counterparts, can be explained
in terms of Brownian motion or its variants. The variants of interest to us
take values in compact sets and have uniform stationary distributions.
Brownian motion on the circle can be defined by
Bt◦ := Bt mod 1
(n)
and then random walk (Xm , m = 0, 1, 2, . . .) on the n-cycle {0, 1, 2, . . . , n −
1} satisfies, by (13.3),
(n) d
n−1 (Xbn2 tc , 0 ≤ t < ∞) → (Bt◦ , 0 ≤ t < ∞) as n → ∞. (13.5)
The process B ◦ has eigenvalues {2π 2 j 2 , 0 ≤ j < ∞} with eigenfunction

≡ 1 for j = 0 and two eigenfunctions cos(2πjx) and sin(2πjx) for j ≥ 1. In
particular the relaxation time is
1
τ2 = 2π 2
.
The result for random walk on the n-cycle (Chapter 5 Example 7)

n2
τ2 ∼ 2π 2
as n → ∞
13.1. CONTINUOUS STATE SPACE 411
can therefore be viewed as a consequence of the n2 time-rescaling in (13.5)

which takes random walk on the n-cycle to Brownian motion on the circle.
This argument is a prototype for the weak convergence paradigm: proving
size-asymptotic results for discrete structures in terms of some limiting con-
tinuous structure.
Variation distance can be studied via coupling. Construct two Brownian
motions on R started from 0 and x > 0 as follows. Let B (1) be standard
Brownian motion, and let
(1)
Tx/2 := inf{t : Bt = x/2}.
Then Tx/2 < ∞ a.s. and we can define B (2) by

(2) (1)
Bt = x − Bt , 0 ≤ t ≤ Tx/2
(1)
= Bt , Tx/2 ≤ t < ∞.
That is, the segment of B (2) over 0 ≤ t ≤ Tx/2 is the image of the corre-
sponding segment of B (1) under the reflection which takes 0 to x. It is easy
to see that B (2) is indeed Brownian motion started at x. This is the re-
flection coupling for Brownian motion. We shall study analogous couplings
for variant processes. Given Brownian motion on the circle B ◦1 started at
0, we can construct another Brownian motion on the circle B ◦2 started at
0 < x ≤ 1/2 via
Bt◦2 = x − Bt◦1 mod 1, 0 ≤ t ≤ T{ x , x + 1 }
2 2 2
= Bt◦1 , T{ x , x + 1 } ≤ t < ∞
2 2 2
where
T{ x , x + 1 } := inf{t : Bt◦1 = x
2 or x
2 + 12 }.
2 2 2
Again, the segment of B ◦2 over 0 ≤ t ≤ T{ x , x + 1 } is the image of the

2 2 2
corresponding segment of B ◦1 under the reflection of the circle which takes
0 to x, so we call it the reflection coupling for Brownian motion on the
circle. Because sample paths cannot cross without meeting, it is easy to see
that the general coupling inequality (Chapter 4-3 section 1.1) becomes an
equality:
||P0 (Bt◦ ∈ ·) − Px (Bt◦ ∈ ·)|| = P (T{ x , x + 1 } > t).
2 2 2
The worst starting point is x = 1/2, and the hitting time in question can be
written as the hitting time T{−1/4,1/4} for standard Brownian motion, so
¯ = P (T{−1/4,1/4} > t) = G(16t)
d(t) (13.6)
by Brownian scaling, that is the property
d
(Bc2 t , 0 ≤ t < ∞) = (cB(t), 0 ≤ t < ∞). (13.7)
See the Notes for an alternative formula. Thus for Brownian motion on the
circle
1 −1
τ1 = 16 G (1/e) = 0.063. (13.8)
If simple random walk is replaced by aperiodic random flight with step

variance σ 2 then the asymptotic values of τ2 and τ1 are replaced by τ2 /σ 2
and τ1 /σ 2 ; this may be deduced using the local central limit theorem ([133]
Theorem 2.5.2).
Reflecting Brownian motion B̄ on the interval [0, 1] is very similar. Intu-
itively, imagine that upon hitting an endpoint 0 or 1 the particle is instan-
taneously inserted an infinitesimal distance into the interval. Formally one
can construct B̄t as B̄t := φ(Bt ) for the concertina map
φ(2j +x) = x, φ(2j +1+x) = 1−x; 0 ≤ x ≤ 1, j = . . .−2, −1, 0, 1, 2, . . . .
The process B̄ has eigenvalues {π 2 j 2 /2, 0 ≤ j < ∞} with eigenfunctions

cos(πjx). In particular the relaxation time is
2
τ2 = π2
.
The result for random walk on the n-path (Chapter 5 Example 8)
2n2
τ2 ∼ π2
as n → ∞
is another instance of the weak convergence paradigm, a consequence of

the n2 time-rescaling which takes random walk on the n-path to reflecting
¯ for
Brownian motion on the interval. The variation distance function d(t)
B̄ can be expressed in terms of the corresponding quantity (write as d◦ (t))
for B ◦ . Briefly, it is easy to check
d ◦ ◦
(B̄t , 0 ≤ t < ∞) = (2 min(Bt/4 , 1 − Bt/4 ), 0 ≤ t < ∞)
¯ = d◦ (t/4). Then using (13.8)

and then to deduce d(t)
τ1 = 41 G−1 (1/e) = 0.252. (13.9)

13.1.2 d-dimensional Brownian motion

Standard d-dimensional Brownian motion can be written as
(1) (d)
Bt = (Bt , . . . , Bt )
(i)
where the component processes (Bt , i = 1, . . . , d) are independent one-
dimensional standard Brownian motions. A useful property of B is isotropy:
its distribution is invariant under rotations of Rd . In approximating simple
random walk (Xm , m = 0, 1, 2, . . .) on Z d one needs to be a little careful
with scaling constants. The analog of (13.3) is
d
(m−1/2 Xbmtc , 0 ≤ t < ∞) → (d−1/2 Bt , 0 ≤ t < ∞) (13.10)
where the factor d−1/2 arises because the components of the random walk
(n)
have variance 1/d — see (13.4). Analogous to (13.5), random walk (Xm , m =
d ◦
0, 1, 2, . . .) on the discrete torus Zn converges to Brownian motion B on the
continuous torus [0, 1)d :
(n) d
n−1 (Xbn2 tc , 0 ≤ t < ∞) → (d−1/2 B◦t , 0 ≤ t < ∞) as n → ∞. (13.11)
13.1.3 Brownian motion in a convex set

Fix a convex polyhedron K ⊂ Rd . One can define reflecting Brownian
motion in K; heuristically, when the particle hits a face it is replaced an
infinitesimal distance inside K, orthogonal to the face. As in the previous
examples, the stationary distribution is uniform on K. We will outline a
proof of
Proposition 13.1 For Brownian motion B in a convex polyhedron K which

is a subset of the ball of radius r,
(i) τ1 ≤ G−1 (1/e) r2
(ii) τ2 ≤ 8π −2 r2 .
Proof. By the d-dimensional version of Brownian scaling (13.7) we can

reduce to the case r = 1. The essential fact is
Lemma 13.2 Let B̄t be reflecting Brownian motion on [0, 1] started at 1,

and let T0 be its hitting time on 0. Versions B(1) , B(2) of Brownian motion
in K started from arbitrary points of K can be constructed jointly with B̄
such that
(1) (2)
|Bt − Bt | ≤ 2B̄min(t,T0 ) , 0 ≤ t < ∞. (13.12)
¯ for Brownian motion on K satisfies

Granted this fact, d(t)
¯ = (1) (2)
d(t) max P (Bt 6= Bt ) ≤ P (T0 > t) = G(t)
startingpoints
where the final equality holds because T0 has the same distribution as the
time for Brownian motion stared at 1 to exit the interval (0, 2). This es-
tablishes (i) for r = 1. Then from the t → ∞ asymptotics of G(t) in (13.1)
¯ = O(exp(−π 2 t/8)), implying τ2 ≤ 8/π 2 by Lemma ?? and
we have d(t)
establishing (ii).
Sketch proof of Lemma. Details require familiarity with stochastic cal-
culus, but this outline provides the idea. For two Brownian motions in Rd
started from (0, 0, . . . , 0) and from (x, 0, . . . , 0), one can define the reflection
coupling by making the first coordinates evolve as the one-dimensional re-
flection coupling, and making the other coordinate processes be identical in
the two motions. Use isotropy to entend the definition of reflection coupling
to arbitrary starting points. Note that the distance between the processes
evolves as 2 times one-dimensional Brownian motion, until they meet. The
(1) (2)
desired joint distribution of ((Bt , Bt ), 0 ≤ t < ∞) is obtained by spec-
ifying that while both processes are in the interior of K, they evolve as the
reflection coupling (and each process reflects orthogonally at faces). As the
figure illustrates, the effect of reflection can only be to decrease distance
between the two Brownian particles. (
(( (((
( ( ((((
(( (
'$
b0
c ?
a
&%
c0
For a motion hitting the boundary at a, if the unreflected process is at

b or c an infinitesimal time later then the reflected process is at b0 or c0 .
By convexity, for any x ∈ K we have |b0 − x| ≤ |b − x|; so reflection can

only decrease distance between coupled particles. To argue the inequality
carefully, let α be the vector normal to the face. The projection Pα satisfies
|Pα (b0 − x)| ≤ |Pα (b − x)|. Further, b − b0 ⊥ α, implying Pα⊥ (b0 − x) =
Pα⊥ (b − x). Therefore by Pythagoras |b0 − x| ≤ |b − x|.
We can therefore write, in stochastic calculus notation,

(1) (2)
d|Bt − Bt | = d(2Bt ) − dAt
where Bt is a one-dimensional Brownian motion and At is an increasing
process (representing the contribution from reflections off faces) which in-
creases only when one process is at a face. But we can construct reflecting
Brownian motion B̄ in terms of the same underlying Bt by
d(2B̄t ) = d(2Bt ) − dCt
where Ct (representing the contribution from reflections off the endpoint 1)
is increasing until T0 . At time 0 we have (because r = 1)
(1) (2)
|B0 − B0 | ≤ 2 = 2B̄0 .
We have shown
(1) (2)
d(|Bt − Bt | − 2B̄t ) = −dAt + dCt .
If the desired inequality (13.12 fails then it fails at some first time t, which
can only be a time when dCt is increasing, that is when B̄t = 1, at which
times the inequality holds a priori. .
Proposition 13.1 suggests an approach to the algorithmic question of
simulating a uniform random point in a convex set K ⊂ Rd where d is
large, discussed in Chapter 9 section 5.1. If we could simulate the discrete-
time chain defined as reflecting Brownian motion B on K examined at time
intervals 2
p of h /d for some small h (so that the length of a typical step is of
order (h2 /d) × d = h), then Proposition 13.1 implies that O(d/h2 ) steps
are enough to approach the stationary distribution. Since the convex set
is available only via an oracle, one can attempt to do the simulation via
acceptance/rejection. That is, from x we propose a move to x0 = x +
p
h2 /d Z where Z has standard d-variate Normal distribution, and accept
the move iff x0 ∈ K. While this leads to a plausible heuristic argument, the
rigorous difficulty is that it is not clear how close an acceptance/rejection
step is to the true step of reflecting Brownian motion. No rigorous argument
based directly on Brownian motion has yet been found, though the work of
Bubley et al [80] on coupling of random walks has elements in common with
reflection coupling.
13.1.4 Discrete-time chains: an example on the simplex

Discrete-time, continuous-space chains arise in many settings, in particular
(Chapter MCMC) in Markov Chain Monte Carlo sampling from a target
distribution on Rd . As discussed in that chapter, estimating mixing times
for such chains with general target distributions is extremely difficult. The
techniques in this book are more directly applicable to chains with (roughly)
uniform stationary distribution. The next example is intended to give the
flavor of how techniques might be adapted to the continuous setting: we will
work through the details of a coupling argument.
Example 13.3 A random walk on the simplex.
Fix d and consider the simplex ∆ = {x = (x1 , . . . , xd } : xi ≥ 0, i xi = 1}.

P
Consider the discrete-time Markov chain (X(t), t = 0, 1, 2, . . .) on ∆ with

steps:
from state x, pick 2 distinct coordinates {i, j} uniformly at random, and
replace the 2 entries {xi , xj } by {U, xi + xj − U } where U is uniform on
(0, xi + xj ).
The stationary distribution π is the uniform distribution on ∆. We will
show that the mixing time τ1 satisfies
τ1 = O(d2 log d) as d → ∞. (13.13)
The process is somewhat reminiscent of card shuffling by random transposi-

tions (Chapter 7 Example 18), so by analogy with that example we expect
that in fact τ1 = Θ(d log d). What we show here is that the coupling analysis
of that example (Chapter 4-3 section 1.7) extends fairly easily to the present
example.
As a preliminary, let us specify two distinct couplings (A, B) of the
uniform(0, a) and the uniform(0, b) distributions. In the scaling coupling we
take (A, B) = (aU, bU ) for U with uniform(0, 1) distribution. In the greedy
coupling we make P (A = B) have its maximal value, which is min(a, b)/ max(a, b),
and we say the coupling works if A = B.
Fix x(0) ∈ ∆. We now specify a coupling (X(t), Y(t)) of the chains
started with X(0) = x(0) and with Y(0) having the uniform distribution.
(This is an atypical couplig argument, in that it matters that one version is
the stationary version).
From state (x, y), choose the same random pair {i, j} for each
process, and link the new values x0i and yi0 (which are uniform
on different intervals) via the scaling coupling for the first t1 =

3d2 log d steps, then via the greedy coupling for the next t2 = Cd2
steps.
We shall show that, for any fixed constant C > 0,
P (X(t1 + t2 ) = Y(t1 + t2 )) ≥ 1 − C −1 − o(1) as d → ∞ (13.14)
establishing (13.13).
Consider the effect on l1 distance ||x − y|| := i |xi − yi | of a step of the
P
scaling coupling using coordinates {i, j}. The change is
|U (xi +xj )−U (yi +yj )|+|(1−U )(xi +xj )−(1−U )(yi +yj )|−|xi −yi |−|xj −yj |
= |(xi + xj ) − (yi + yj )| − |xi − yi | − |xj − yj |

(
0 if sgn (xi − yi ) = sgn (xj − yj )
=
−2 min(|xi − yi |, |xj − yj |) if not .
Thus
E(x,y) (||X(1) − Y(1)|| − ||x − y||)
−2 X X
= min(|xi − yi |, |xj − yj |)1(sgn (xi −yi )6=sgn (xj −yj ))
d(d − 1) i j6=i
−4 X X
= min(ci , dj )
d(d − 1) i∈A j∈B
(where ci := xi − yi on A := {i : xi > yi }; dj := yj − xj on B := {j : yj >

xj })
−4 X X ci dj
=
d(d − 1) i∈A j∈B max(ci , dj )
−4 X X ci dj
≤
d(d − 1) i∈A j∈B ||x − y||/2
−2
= ||x − y||
d(d − 1)
dj = ||x − y||/2. So
P P
because i∈A ci = j∈B
2

E(x,y) ||X(1) − Y(1)|| ≤ 1 − ||x − y||.
d(d − 1)
Because ||X(0) − Y(0)|| ≤ 2, it follows that after t steps using the scaling
coupling, t
2

E||X(t) − Y(t)|| ≤ 2 1 − .
d(d − 1)
So by taking t1 ∼ 3d2 log d, after t1 steps we have
P (||X(t1 ) − Y(t1 )|| ≤ d−5 ) = 1 − o(1). (13.15)
Now consider the greedy coupling. If a step works, the l1 distance

||X(t) − Y(t)|| cannot increase. The chance that a step from (x, y) involving
coordinates {i, j} works is
min(xi + xj , yi + yj ) yi + yj − ||x − y||

≥
max(xi + xj , yi + yj ) max(xi + xj , yi + yj )
yi + yj − ||x − y||
≥
yi + yj + ||x − y||
yi + yj − 2||x − y||
≥
yi + yj
||x − y||
≥ 1− .
min(yi , yj )
So unconditionally
||x − y||
P(x,y) (greedy coupling works on first step) ≥ 1 − . (13.16)
mink yk
(d) (d)
Now the uniform distribution (Y1 , . . . , Yd ) on the simplex has the prop-
erty (use [133] Exercise 2.6.10 and the fact that the uniform distribution on
the simplex is the joint distribution of spacings between d − 1 uniform(0, 1)
variables and the endpoint 1)
(d)
if constants ad > 0 satisfy dad → 0 then P (Y1 ≤ ad ) ∼ dad .
(d) d (d)
Since (Y(t)) is the stationary chain and Yi = Y1 ,
(d)
P ( min Yk (t) ≤ d−4.5 for some t1 < t ≤ t1 + t2 ) ≤ t2 dP (Y1 < d−4.5 )
1≤k≤d
and since t2 = O(d2 ) this bound is o(1). In other words
P ( min Yk (t) ≥ d−4.5 for all t1 < t ≤ t1 + t2 ) = 1 − o(1) as d → ∞.

1≤k≤d
Combining this with (13.15,13.16) and the non-increase of l1 distance, we

deduce
P (greedy coupling works for allt1 < t ≤ t1 + t2 ) = 1 − o(1). (13.17)
Now consider the number M (t) of unmatched coordinates i at time t ≥ t1 ,
that is, the number of i with Xi (t) 6= Yi (t). Provided the greedy coupling
works, this number M (t) cannot increase, and decreases by at least 1 each
time two unmatched coordinates are chosen. So we can compare (M (t1 +
t), t ≥ 0) with the chain (N (t), t ≥ 0) with N (0) = d and
m(m − 1)
P (N (t+1) = m−1|N (t) = m) = = 1−P (N (t+1) = m|N (t) = m).
d(d − 1)
As in the analysis of the shuffling example, the time T = min{t : N (t) = 1}
d(d−1)
has ET = dm=2 m(m−1) ≤ d2 . When the number M (t) goes strictly below
P
2 it must become 0, and so

P (greedy coupling works for all t1 < t ≤ t1 + t2 , X(t1 + t2 ) 6= Y(t1 + t2 ))
= P (greedy coupling works for all t1 < t ≤ t1 + t2 , M (t1 + t2 ) > 1)
≤ P (T > t2 ) ≤ 1/C2 .
This and (13.17) establish (13.14).
13.1.5 Compact groups

Parallel to random flights on finite groups one can discuss discrete-time ran-
dom flights on classical (continuous) compact groups such as the orthogonal
group O(d) of d × d real orthogonal matrices. For instance, specify a reflec-
tion to be an automorphism which fixes the points in some hyperplane, so
that a reflection matrix can be written as
A = I − 2xxT
where I is the d × d identity matrix and x is a unit-length vector in Rd .
Assigning to x the Haar measure on the (d − 1)-sphere creates a uniform
random reflection, and a sequence of uniform random reflections define a
random flight on O(d). Porod [285] shows that the variation threshold sat-
isfies
τ1 ∼ 21 d log d
and that the cut-off phenomenon occurs. The result, and its proof via
group representation theory, are reminiscent of card-shuffling via random
transpositions (Chapter 7 Example 18).
13.1.6 Brownian motion on a fractal set

Constructions and properties of analogs of Brownian motion taking values
in fractal subsets of Rd have been studied in great detail over the last 15
years. Since these processes are most easily viewed as limits of random
walks on graphs, we shall say a little about the simplest example. The
figure illustrates the first two stages of the construction of the well-known
Sierpinski gasket.
a1 a1
T T
T T
T T T
T T T
b1 T b3 b1 T
T b3
T T T T
T T T T
T T T T T
T
TT T a2 TT TT TT T a2
0 b2 0 b2
Graph G1 Graph G2
In the topology setting one may regard Gd as a closed subset of R2 , that is

as a set of line-segments, and then the closure of ∪∞ d=1 Gd is the Sierpinski
gasket G (this is equivalent to the usual construction by “cutting out middle
(d)
triangles”). In the graph setting, regard Gd as a graph and write (Xt , t =
0, 1, 2, . . .) for discrete-time random walk on Gd started at point 0. Let Md
be the number of steps of X (d) until first hitting point a1 or a2 . Using
symmetry properties of the graphs, there is a simple relationship between
the distributions of M1 and M2 . For the walk on G2 , the length of the time
segment until first hitting b1 or b2 is distributed as M1 ; successive segments
(periods until next hitting one of {0, a1 , a2 , b1 , b2 , b3 } other than the current
one) are like successive steps of the walk on G1 , so the number of segments
is distributed as M1 . Using the same argument for general d gives
Md is distributed as the d’th generation size in a Galton-Watson

branching process with 1 individual in generation 0 and offspring
distributed as M1 .
It is easy to calculate EM1 = 5; indeed the distribution of M1 is determined

by its generating function, which can be calculated to be Ez M1 = z 2 /(4−3z).
So EMd = 5d . This suggests the existence of a limit process on G after
13.2. INFINITE GRAPHS 421
rescaling time, that is a limit
(d) d (∞)
(Xb5−d tc , 0 ≤ t < ∞) → (Xt , 0 ≤ t < ∞).
In fact we can be more constructive. Branching process theory ([133] Ex-

d
ample 4.4.1) shows that Md /5d → W where EW = 1 and where W has
the self-consistency property
M
X d
Wi = 5W (13.18)
i=1
d d
where (M ; W1 , W2 , . . .) are independent, M = M1 and Wi = W . Now
(d)
in the topological setting, the vertices of Gd are a subset of G. Let X̃t
be the process on Gd ⊂ G whose sequence of jumps is as the jumps of the
discrete-time walk X (d) but where the times between jumps are independent
with distribution 5−d W . Using (13.18) we can construct the processes X̃ (d)
jointly for all d such that the process X̃ (d) , watched only at the times of
hitting (successively distinct) points of Gd−1 , is exactly the process X̃ (d−1) .
(∞)
These coupled processes specify a process Xt on G at a random subset of
times t. It can be shown that this random subset is dense and that sample
paths extend continuously to all t, and it is natural to call X (∞) Brownian
motion on the Sierpinski gasket.
13.2 Infinite graphs

There is a huge research literature concerning random walks on infinite dis-
crete groups, and more generally on infinite graphs, and the recent mono-
graph of Woess [339] provides an in-depth treatment. This section focuses
narrowly on two aspects of an issue not emphasized in [339]: what does
study of random walk on infinite graphs tell us about random walks on
finite graphs? One aspect of this issue is that random walks on certain spe-
cific infinite graphs may be used to get approximations or inequalities for
random walks on specific finite graphs. We treat three examples.
• The infinite lattice Z d as an approximation to the discrete torus ZN

d
for large N (section 13.2.4).
• The infinite degree-r tree Tr and bounds for r-regular expander graphs
of large size (section 13.2.6).
• The hierarchical tree Trhier as an approximation to balanced (r −1)-ary

trees (section 13.2.9).
The second aspect concerns properties such as transience, non-trivial bound-

ary, and “spectral radius < 1”, which have been well-studied as qualitative
properties which an infinite-state chain either possesses or does not possess.
What are the quantitative finite-state analogs of such properties? Here ac-
tual theorems are scarce; we present conceptual discussion in sections 13.2.3
and 13.2.10 as a spur to future research.
13.2.1 Set-up
We assume the reader has some acquaintance with classical theory (e.g., [133]
Chapter 5) for a countable-state irreducible Markov chain, which emphasizes
the trichotomy transient or null-recurrent or positive-recurrent. We use the
phrase general chain to refer to the case of an arbitrary irreducible transition
matrix P, without any reversibility assumption.
Recall from Chapter 3 section 2 the identification, in the finite-state
setting, of reversible chains and random walks on weighted graphs. Given
a reversible chain we defined edge-weights wij = πi pij = πj pji ; conversely,
given edge-weights we defined random walk as the reversible chain
X
pvx = wvx /wv ; wv = wvx . (13.19)
x
In the infinite setting it is convenient (for reasons explained below) to take

the “weighted graph” viewpoint. Thus the setting of this section is that we
are given a connected weighted graph satisfying
X X
wv ≡ wvx < ∞ ∀x, wv = ∞ (13.20)
x v
and we study the associated random walk (Xt ), i.e., the discrete-time chain
with pvx = wvx /wv . So in the unweighted setting (we ≡ 1), we have nearest-
neighbor random walk on a locally finite, infinite graph.
To explain why we adopt this set-up, say π is invariant for P if
X
πi pij = πj ∀j; πj > 0 ∀j.
i
Consider asymmetric random walk on Z, say
pi,i+1 = 2/3, pi,i−1 = 1/3; −∞ < i < ∞. (13.21)

One easily verifies that each of the two measures πi = 1 and πi = 2i is invari-
ant. Such nonuniqueness makes it awkward to seek to define reversibility of
P via the detailed balance equations
πi pij = πj pji ∀i, j (13.22)
without a prior definition of π. Stating definitions via weighted graphs

avoids this difficulty.
The second assumption in (13.20), that v wv = ∞, excludes the positive-
P
recurrent case (see Theorem 13.4 below); because in that case the questions
one asks, such as whether the relaxation time τ2 is finite, can be analyzed
by the same techniques as in the finite-state setting.
Our intuitive interpretation of “reversible” in Chapter 3 was “a movie
of the chain looks the same run forwards or run backwards”. But the chain
corresponding to the weighted graph with weights wi,i+1 = 2i , which is the
chain (13.21) with πi = 2i , has a particle moving towards +∞ and so cer-
tainly doesn’t satisfy this intuitive notion. On the other hand, a probabilistic
interpretation of an infinite invariant measure π is that if we start at time
0 with independent Poisson(πv ) numbers of particles at vertices v, and let
the particles move independently according to P, then the particle process
is stationary in time. So the detailed balance equations (13.22) correspond
to the intuitive “movie” notion of reversible for the infinite particle process,
rather than for a single chain.
13.2.2 Recurrence and Transience

The next Theorem summarizes parts of the standard theory of general chains
(e.g., [133] Chapter 5). Write ρv := Pv (Tv+ < ∞) and let Nv (∞) be the total
number of visits (including time 0) to v.
Theorem 13.4 For a general chain, one of the following alternatives holds.
Recurrent. ρv = 1 and Ev Nv (∞) = ∞ and Pv (Nw (∞) = ∞) = 1 for all
v, w.
Transient. ρv < 1 and Ev Nv (∞) < ∞ and Pv (Nw (∞) < ∞) = 1 for all
v, w.
In the recurrent case there exists an invariant measure π, unique up to
constant multiples, and the chain is either
positive-recurrent: Ev Tv+ < ∞ ∀v and v πv < ∞; or
P
null-recurrent: Ev Tv+ = ∞ ∀v and v πv = ∞.

P
In the transient and null-recurrent cases, Pv (Xt = w) → 0 as t → ∞ for all

v, w.
Specializing to random walk on a weighted graph, the measure (wv ) is in-

variant, and the second assumption in (13.20) implies that the walk cannot
be positive-recurrent. By a natural abuse of language we call the weighted
P (t)
graph recurrent or transient. Because Ev Nv (∞) = t pvv , Theorem 13.4
contains the “classical” method to establish transience or recurrence by con-
(t)
sidering the t → ∞ behavior of pvv . This method works easily for random
walk on Z d (section 13.2.4).
Some of the “electrical network” story from Chapter 3 extends immedi-
ately to the infinite setting. Recall the notion of a flow f , and the net flow
f(x) out of a vertex x. Say f is a unit flow from x to infinity if f(x) = 1 and
f(v) = 0 ∀v 6= x. Thompson’s principle (Chapter 3 Proposition 35) extends
to the infinite setting, by considering subsets An ↓ φ (the empty set) with
Acn finite.
Theorem 13.5 Consider a weighted graph satisfying (13.20). For each v,
( )
X 1
inf 1
2 fe2 /we : f a unit flow from v to infinity = .
e wv (1 − ρv )
In particular, the random walk is transient iff for some (all) v there exists
a unit flow f from v to infinity such that e fe2 /we < ∞.
P
By analogy with the finite setting, we can regard the inf as the effective
resistance between v and infinity, although (see section ??) we shall not
attempt an axiomatic treatment of infinite electrical networks.
Theorem 13.5 has the following immediate corollary: of course (a) and
(b) are logically equivalent.
Corollary 13.6 (a) If a weighted graph is recurrent, then so is any sub-
graph.
(b) To show that a weighted graph is transient, it suffices to find a transient
subgraph.
Thus the classical fact that Z 2 is recurrent implies that a subgraph of Z 2
is recurrent, a fact which is hard to prove by bounding t-step transition
probabilities. In the other direction, it is possible (but not trivial) to prove
that Z 3 is transient by exhibiting a flow: indeed Doyle and Snell [131]
construct a transient tree-like subgraph of Z 3 .
Here is a different formulation of the same idea.
Corollary 13.7 The return probability ρv = Pv (Tv+ < ∞) cannot increase
if a new edge (not incident at v) is added, or the weight of an existing edge
(not incident at v) is increased.
13.2.3 The finite analog of transience

Recall the mean hitting time parameter τ0 from Chapter 4. For a sequence
of n-state reversible chains, consider the property
n−1 τ0 (n) is bounded as n → ∞. (13.23)
We assert, as a conceptual paradigm, that property (13.23) is the analog of

the “transient” property for a single infinite-state chain. The connection is
easy to see algebraically for symmetric chains (Chapter 7), where τ0 = Eπ Tv
for each v, so that by Chapter 2 Lemma 10
∞
n−1 τ0 = zvv = (pv v (t) − n−1 ).
X
t=0
The boundedness (in n) of this sum is a natural analog of the transience

condition
∞
X
p(t)
vv < ∞
t=0
for a single infinite-state chain. So in principle the methods used to de-

termine transience or recurrence in the infinite-state case ([339] Chapter 1)
should be usable to determine whether property (13.23) holds for finite fami-
lies, and indeed Proposition 37 of Chapter 3 provides a tool for this purpose.
In practice these extremal methods haven’t yet proved very successful; early
papers [85] proved (13.23) for expanders in this way, but other methods are
easier (see our proof of Chapter 9 Theorem 1). There is well-developed the-
ory ([339] section 6) which establishes recurrence for infinite planar graphs
under mild assumptions. It is natural to conjecture that under similar as-
sumptions, a planar n-vertex graph has τ0 = Θ(n log n), as in the case of Z 2
in Proposition 13.8 below.
13.2.4 Random walk on Z d

We consider the lattice Z d as an infinite 2d-regular unweighted graph. Write
Xt for simple random walk on Z d , and write X e t for the continuized random
walk. Of course, general random flights (i.e. “random walks”, in everyone’s
terminology except ours) and their numerous variations comprise a well-
studied classical topic in probability theory. See Hughes [184] for a wide-
ranging intermediate-level treatment, emphasizing physics applications. Our
discussion here is very narrow, relating to topics treated elsewhere in this
book.
To start some calculations, for d = 1 consider

p̃(t) ≡ P0 (X
e t = 0)
= P (Jt+ = Jt− ), where Jt+ and Jt− are the
independent Poisson(t/2) numbers of +1 and −1 jumps
∞
!2
X e−t/2 (t/2)n
=
n=0
n!
−t
= e I0 (t)
P∞ (t/2)2n
where I0 (t) := n=0 (n!)2 is the modified Bessel function of the first kind
of order 0. Now var X
e t = t, and as a consequence of the local CLT (or by
quoting asymptotics of the Bessel function I0 ) we have
p̃(t) ∼ (2πt)−1/2 as t → ∞. (13.24)
As discussed in Chapter 4 section 6.2 and Chapter 5 Example 17, a great
advantage of working in continuous time in dimensions d ≥ 2 is that the
coordinate processes are independent copies of slowed-down one-dimensional
processes, so that p̃(d) (t) ≡ P0 (X
e t = 0) in dimension d satisfies
p̃(d) (t) = (p̃(t/d))d = e−t (I0 (t/d))d . (13.25)

In particular, from (13.24),
d d/2 −d/2
p̃(d) (t) ∼ ( 2π ) t as t → ∞. (13.26)
One can do a similar analysis in the discrete time case. In dimension d = 1,
p(t) ≡ P0 (Xt = 0)
!
−t t
= 2 , t even
t/2
∼ 2 (2πt)−1/2 as t → ∞, t even. (13.27)
This agrees with (13.26) but with an extra “artificial” factor of 2 arising
from periodicity. A more tedious argument gives the analog of (13.26) in
discrete time for general d:
d d/2 −d/2
p(d) (t) ∼ 2( 2π ) t as t → ∞, t even. (13.28)
From the viewpoint of classical probability, one can regard (13.26,13.28) as
the special case j = 0 of the local CLT: in continuous time in dimension d,

e t = j) − ( d )d/2 t−d/2 exp(−d|j|2 /(2t)) = o(t−d/2 ) as t → ∞
sup P0 (X

2π
j
where |j| denotes Euclidean norm.

The occupation time N0 (t) satisfies E0 N0 (t) = 0t p̃(s) ds (continuous
R
Pt−1
time) and = s=0 p(s) (discrete time). In either case, as t → ∞,
q
(d = 1) E0 N0 (t) ∼ 2
π t1/2 (13.29)
1
(d = 2) E0 N0 (t) ∼ π log t (13.30)
Z ∞
(d ≥ 3) E0 N0 (t) → Rd ≡ p̃(d) (t)dt
Z ∞0
= e−t (I0 (t/d))d dt (13.31)
0
where Rd < ∞ for d ≥ 3 by (13.26). This is the classical argument for

establishing transience in d ≥ 3 and recurrence in d ≤ 2, by applying The-
orem 13.4. Note that the return probability ρ(d) := P0 (T0+ < ∞) is related
to E0 N0 (∞) by E0 N0 (∞) = 1−ρ1 (d) ; in other words
Rd − 1
ρ(d) = , d ≥ 3.
Rd
Textbooks sometimes give the impression that calculating ρ(d) is hard, but
one can just calculate numerically the integral (13.31). Or see [174] for a
table.
The quantity ρ(d) has the following sample path interpretation. Let Vt
be the number of distinct vertices visited by the walk before time t. Then
t−1 Vt → 1 − ρ(d) a.s. , d ≥ 3. (13.32)
The proof of this result is a textbook application of the ergodic theorem for
stationary processes: see [133] Theorem 6.3.1.
d
13.2.5 The torus Zm
We now discuss how random walk on Z d relates to m → ∞ asymptotics for
random walk on the finite torus Zm d , discussed in Chapter 5. We now use
superscript ·(m) to denote the length parameter. From Chapter 5 Example

17 we have
(m) d dm2
τ2 = ∼
1 − cos(2π/m) 2π 2
(m)
τ1 = Θ(m2 ) (13.33)
where asymptotics are as m → ∞ for fixed d. One can interpret this as a
consequence of the dN 2 time rescaling in the wweak convergence of rescaled
random walk to Brownian motion of the d-dimensional torus, for which (cf.
sections 13.1.1 and 13.1.2) τ2 = 2π −2 . At (74)–(75) of Chapter 5 we saw
that the eigentime identity gave an exact formula for the mean hitting time
(m)
parameter τ0 , whose asymptotics are, for d ≥ 3,
Z 1 Z 1
(m) 1
m−d τ0 → R̂d ≡ ... 1 Pddx1 . . . dxd < ∞.
0 0 u=1 (1 − cos(2πxu ))
d
(13.34)
Here we give an independent analysis of this result, and the case d = 2.
Proposition 13.8
(n) 1 2
(d = 1) τ0 ∼ 6n (13.35)
(m) −1
(d = 2) τ0 ∼ 2π m2 log m (13.36)
(m) d
(d ≥ 3) τ0 ∼ Rd m (13.37)
for Rd defined by (13.31). In particular, the expressions for Rd and R̂d at

(13.31) and (13.34) are equal, for d ≥ 3.
The d = 1 result is from Chapter 5 (26). We now prove the other cases.
Proof. We may construct continuized random walk X e (m) on Z d from
t m
continuized random walk X d
e t on Z by
(m)
X
e
t =X
e t mod m (13.38)
(m)
and then P0 (X
e
t = 0) ≥ P0 (X
e t = 0). So
Z ∞
(m) (m)
m−d τ0 = P0 (X
e
t = 0) − m−d dt
0
(Chapter 2, Corollary 12 and (8))
Z ∞ +
(m)
= P0 (X
e
t = 0) − m−d dt by complete monotonicity
Z0∞ +
≥ e t = 0) − m−d
P0 (X dt (13.39)
0
Z ∞
→ P0 (X
e t = 0) dt = Rd .
0
Consider the case d ≥ 3. To complete the proof, we need the corresponding

upper bound, for which it is sufficient to show
Z ∞ +
(m)
P0 (X
e
t = 0) − m−d − P0 (X
e t = 0) dt → 0 as m → ∞. (13.40)
0
To verify (13.40) without detailed calculations, we first establish a 1-dimensional

bound
(d = 1) p̃(m) (t) ≤ m
1
+ p̃(t). (13.41)
To obtain (13.41) we appeal to a coupling construction (the reflection cou-
pling, described in continuous-space in section 13.1.3 – the discrete-space set-
ting here is similar) which shows that continuized random walks X e (m) , Ye (m)
on Zm with X e (m) = 0 and Ye (m) distributed uniformly can be coupled so
0 0
that
(m) e (m) = 0, T ≤ t}
Yet = 0 on the event {Xt
where T is the first time that Xe (m) goes distance bm/2c from 0. And by
considering the construction (13.38)
(m) (m)
P (X
e
t = 0) ≤ P (X
e t = 0) + P (X
e
t = 0, T ≤ t)
(m)
and (13.41) follows, since P (Yet = 0) = 1/m.
Since the d-dimensional probabilities relate to the 1-dimensional proba-
d
(m)
bilities via P0 (X
e
t = 0) = p̃(m) (t/d) and similarly on the infinite lattice,
we can use inequality (13.41) to bound the integrand in (13.40) as follows.
(m)
P0 (X
e
t = 0) − m−d − P0 (X
e t = 0)
d
≤ 1
m + p̃(t/d) − m−d − (p̃(t/d))d
d−1
!
X d
= (p̃(t/d))j ( m
1 d−j
)
j=1
j
p̃(t/d) d−1
!
X d
= (p̃(t/d))j−1 ( m
1 d−1−j
)
m j=1 j
!
p̃(t/d) X d h id−2
≤ d−1 max((p̃(t/d))d−2 , m
1
)
m j=1 j
p̃(t/d) h i
= (2d − 2) max((p̃(t/d))d−2 , ( m
1 d−2
) )
m
d p̃(t/d) h i
≤ (2 − 2) 1 d−2
(p̃(t/d))d−2 + m )
"m #
d (p̃(t/d))d−1 p̃(t/d)
= (2 − 2) + d−1 .
m m
The fact (13.24) that p̃(t) = Θ(t−1/2 ) for large t easily implies that the
integral in (13.40) over 0 ≤ t ≤ m3 tends to zero. But by (13.33) and
¯
submultiplicativity of d(t),
(m)
0 ≤ P0 (X
e
t = 0) − m−d ≤ d(t) ≤ d(t)
¯ ≤ B1 exp(− t 2 )
B2 m
(13.42)
where B1 , B2 depend only on d. This easily implies that the integral in
(13.40) over m3 ≤ t < ∞ tends to zero, completing the proof of (13.37).
In the case d = 2, we fix b > 0 and truncate the integral in (13.39) at
bm2 to get
Z bm2
(m)
m−2 τ0 ≥ −b + P0 (X
e t = 0) dt
0
= −b + (1 + o(1)) π2 log(bm2 ) by (13.30)
= (1 + o(1)) π2 log m.
Therefore
(m)
τ0 ≥ (1 + o(1)) π2 m2 log m.
R m2 2
For the corresponding upperbound, since e t = 0)dt ∼
P0 (X log m by
0 π
(m) R∞
e (m)
(13.30), and m−2 τ0 = 0 P0 (Xt = 0) − m−2 dt , it suffices to show
that Z ∞ +
(m)
P0 (X
e
t = 0) − m−2 − P0 (X
e t = 0) dt
0
Z ∞ +
(m)
+ P0 (X
e
t = 0) − m−2 dt = o(log N ). (13.43)
m2
To bound the first of these two integrals, we observe from (13.41) that
P0 (X e (m) = 0) ≤ (m−1 + p̃(t/2))2 , and so the integrand is bounded by
t
2
m p̃(t/2). Using (13.24), the first integral is O(1) = o(log m). To analyze
the second integral in (13.43) we consider separately the ranges m2 ≤ t ≤
m2 log3/2 m and m2 log3/2 m ≤ t < ∞. Over the first range, we again use
2
(13.41) to bound the integrand by m p̃(t/2) + (p̃(t/2))2 . Again using (13.24),
the integral is bounded by
Z m2 log3/2 m Z m2 log3/2 m
2 −1/2 −1
(1 + o(1)) t dt + (1 + o(1))π t−1 dt
π 1/2 m m2 m2
= Θ(log3/4 m) + Θ(log log m) = o(log m).

To bound the integral over the second range, we use (13.42) and find
Z ∞
3/2

(m)
P0 (X
e
t = 0) − m−2 dt ≤ B1 B2 m2 exp(− logB2 m
)
3/2
m2 log m
= o(1) = o(log m).
2
13.2.6 The infinite degree-r tree

Fix r ≥ 3 and write Tr for the infinite tree of degree r. We picture Tr
as a “family tree”, where the root φ has r children, and each other vertex
has one parent and r − 1 children. Being a vertex-transitive graph (recall
Chapter 7 section 1.1; for r even, Tr is the Cayley graph of the free group
on r/2 generators), one can study many more general “random flights” on
Tr (see Notes), but we shall consider only the simple random walk (Xt ).
We can get some information about the walk without resorting to calcu-
lations. The “depth” process d(Xt , φ) is clearly the “reflecting asymmetric
random walk” on Z + := {0, 1, 2, . . .} with
p0,1 = 1; pi,i−1 = 1/r; pi,i+1 = (r − 1)/r, i ≥ 1.
By comparison with asymmetric random walk on all Z, which has drift

(r − 2)/r, we see that
r−2
t−1 d(Xt , φ) → a.s. as t → ∞. (13.44)
r
In particular, the number of returns to φ is finite and so the walk is tran-
sient. Now consider the return probability ρ = Pφ (Tφ+ < ∞) and note that
(by considering the first step) ρ = Pφ (Tc < ∞) where c is a child of φ.
Considering the first two steps, we obtain the equation ρ = 1r + r−1 2
r ρ , and
since by transience ρ < 1, we see that
1
ρ := Pφ (Tφ+ < ∞) = Pφ (Tc < ∞) = . (13.45)
r−1
So
1 r−1
Eφ Nφ (∞) = = . (13.46)
1−ρ r−2
As at (13.32), ρ has a sample path interpretation: the number Vt of distinct
vertices visited by the walk before time t satisfies
t−1 Vt → 1 − ρ = r−2
r−1 a.s. as t → ∞.
By transience, amongst the children of φ there is some vertex L1 which

is visited last by the walk; then amongst the children of L1 there is some
vertex L2 which is visited last by the walk; and so on, to define a “path
to infinity” φ = L0 , L1 , L2 , . . .. By symmetry, given L1 , L2 , . . . , Li−1 the
conditional distribution of Li is uniform over the children of Li−1 , so in the
natural sense we can describe (Li ) as the uniform random path to infinity.
13.2.7 Generating function arguments

While the general qualitative behavior of random walk on Tr is clear from
the arguments above, more precise quantitative estimates are most naturally
obtained via generating function arguments. For any state i of a Markov
chain, the generating functions Gi (z) := ∞ t
P
P∞ t=0 Pi (Xt = i)z and Fi (z) :=
+ t
t=1 Pi (Ti = t)z are related by
Gi = 1 + Fi Gi (13.47)
(this is a small variation on Chapter 2 Lemma 19). Consider simple sym-

metric reflecting random walk on Z + . Clearly
∞
!
2t −2t 2t
2 z = (1 − z 2 )−1/2 ,
X
G0 (z) =
t=0
t
the latter identity being the series expansion of (1 − x)−1/2 . So by (13.47)

∞
P0 (T0+ = 2t)z 2t = 1 − (1 − z 2 )1/2 .
X
F0 (z) :=
t=0 or 1
Consider an excursion of length 2t, that is, a path (0 = i0 , i1 , . . . , i2t−1 , i2t =

0) with ij > 0, 1 ≤ j ≤ 2t − 1. This excursion has chance 21−2t for the sym-
metric walk on Z + , and has chance ((r − 1)/r)t−1 (1/r)t for the asymmetric
walk d(Xt , φ). So
Pφ (Tφ+ = 2t) r

4(r − 1)
t
=
P0 (T0+ = 2t) 2(r − 1) r2
where the numerator refers to simple RW on the tree, and the denominator
refers to simple symmetric reflecting RW on Z + . So on the tree,
 !1/2 
r r 4(r − 1)z 2
q
4(r−1) 1 − 1 −
Fφ (z) = F0 z r2
= .
2(r − 1) 2(r − 1) r2
Then (13.47) gives an expression for Gφ (z) which simplifies to
2(r − 1)
Gφ (z) = p . (13.48)
r − 2 + r2 − 4(r − 1)z 2
In particular, Gφ has radius of convergence 1/β, where
√
β = 2r−1 r − 1 < 1. (13.49)
Without going into details, one can now use standard Tauberian arguments
to show
Pφ (Xt = φ) ∼ αt−3/2 β t , t even (13.50)
for a computable constant α, and this format (for different values of α and
β) remains true for more general radially symmetric random flights on Tr
([339] Theorem 19.30). One can also in principle expand (13.48) as a power
series to obtain Pφ (Xt = φ). Again we shall not give details, but according
to Giacometti [164] one obtains
√ !t
r−1 r−1 Γ(1 + t)
Pφ (Xt = φ) =
r r Γ(2 + t/2)Γ(1 + t/2)
t 4(r−1)
× 2 F1 ( t+1
2 , 1, 2 + 2 , r2 ), t even (13.51)
where 2 F1 is the generalized hypergeometric function.

Finally, the β at (13.49) can be interpreted as an eigenvalue for the infi-
nite transition matrix (pij ), so we anticipate a corresponding eigenfunction
f2 with X
pij f2 (j) = βf2 (i) ∀i, (13.52)
j
and one can verify this holds for
f2 (i) := (1 + r−2
r i)(r − 1)−i/2 . (13.53)
13.2.8 Comparison arguments

Fix r ≥ 3 and consider a sequence (Gn ) of n-vertex r-regular graphs with
n → ∞. Write (Xtn ) for the random walk on Gn . We can compare these
random walks with the random walk (Xt∞ ) on Tr via the obvious inequality
Pv (Xtn = v) ≥ Pφ (Xt∞ = φ), t ≥ 0. (13.54)
To spell this out, there is a universal cover map γ : Tr → Gn with γ(φ) = v

and such that for each vertex w of Tr the r edges at w are mapped to the
r edges of Gn at γ(w). Given the random walk X ∞ on Tr , the definition
Xtn = γ(Xt∞ ) constructs random walk on Gn , and (13.54) holds because
{Xtn = v} ⊇ {Xt∞ = φ}.
It is easy to use (13.54) to obtain asymptotic lower bounds on the fun-
damental parameters discussed in Chapter 4. Instead of the relaxation time
τ2 , it is more natural here to deal directly with the second eigenvalue λ2 .
Lemma 13.9 For random walk on n-vertex r-regular graphs, with r ≥ 3

fixed and n → ∞
(a) lim inf n−1 τ0 (n) ≥ r−2
r−1
;
(b) lim inf τlog
1 (n) r
n ≥ (r−2) log(r−1) ;
−1
√
(c) lim inf λ2 (n) ≥ β := 2r r − 1.
Theory concerning expanders (Chapter 9 section 1) shows there exist graphs

where the limits above are finite constants (depending on r), so Lemma 13.9
gives the optimal order of magnitude bound.
Proof. For (a), switch to the continuous-time walk, consider an arbitrary
vertex v in Gn , and take t0 (n) → ∞ with t0 (n)/n → 0. Then we repeat the
argument around (13.39) in the torus setting:
Z ∞
−1
n Eπ Tv = (Pv (Xtn = v) − n1 ) dt
0
Z t0
≥ (Pv (Xtn = v) − n1 ) dt
0
Z t0
t0
≥ − + Pφ (Xt∞ = φ) dt by (13.54)
n 0
Z ∞
→ Pφ (Xt∞ = φ) dt
0
r−1
= Eφ Nφ (∞) = ,
r−2
which is somewhat stronger than assertion (a). Next, the discrete-time

spectral representation implies
Pv (Xtn = v) ≤ 1
n + nβ t (n).
Using (13.54) and (13.50), for any n → ∞, t → ∞ with t even,
t−3/2 β t (α − o(1)) ≤ 1
n + nβ t (n). (13.55)
For (b), the argument for (13.54) gives a coupling between the process X n
started at v and the process X ∞ started at φ such that
dn (Xtn , v) ≤ d∞ (Xt∞ , φ)
where dn and d∞ denote graph distance. Fix ε > 0 and write γ = r−2r + ε.
By the coupling and (13.44), P (dn (Xtn , v) ≥ γt) → 0 as n, t → ∞. This
remains true in continuous time. Clearly τ1 (n) → ∞, and so by definition

of τ1 we have
lim sup π{w : dn (w, v) ≥ γτ1 (n)} ≤ e−1 .
But by counting vertices,
1 + r + r(r − 1) + · · · + r(r − 1)d−1

π{w : dn (v, w) ≤ d} ≤
n
log n
→ 0 if d ∼ (1 − ε) .
log(r − 1)
For these two limit results to be consistent we must have γτ1 (n) ≥ (1 −
log n
ε) log(r−1) for all large n, establishing (b).
For (c), fix a vertex v0 of Gn and use the function f2 at (13.53) to
define f (v) := f2 (d(v, v0 )) for all vertices v of Gn . The equality (13.52) for
f2 on the infinite tree easily implies the inequality Pf ≥ βf on Gn . Set
f¯ := n−1 v f (v) and write 1 for the unit function. By the Rayleigh-Ritz
P
characterization (Chapter 4 eq. (73)), writing hg, hi := ij πi gi pij hj ,

P
hf − f¯1, P(f − f¯1)i

λ2 (n) ≥
||f − f¯1||22
hf, Pf i − f¯2
=
||f ||22 − f¯2
β||f ||22 − f¯2
≥ .
||f ||2 − f¯2
2
As n → ∞ we have f¯ → 0 while ||f ||2 tends to a non-zero limit, establishing

(c).
13.2.9 The hierarchical tree

Fix r ≥ 2. There is an infinite tree (illustrated for r = 2 in the figure)
specified as follows. Each vertex is at some height 0, 1, 2, . . .. A vertex
at height h has one parent vertex at height h + 1 and (if h ≥ 1) r child
vertices at height h − 1. The height-0 vertices are leaves, and the set L of
leaves has a natural labeling by finite r-ary strings. The figure illustrates
the binary (r = 2) case, where L = {0, 1, 10, 11, 100, 101, . . .}. L forms an
Abelian group under entrywise addition modulo r, e.g. for r = 2 we have
1101 + 110 = 1101 + 0110 = 1011. Adopting a name used for generalizations
of this construction in statistical physics, we call L the hierarchical lattice
and the tree Trhier the hierarchical tree.

H
HH
H
HH

HH
@ @
@ @
@ @
@ @
A A A A
A A A A
A A A A
A A A A
0 1 10 11 100 101 110 111
Fix a parameter 0 < λ < r. Consider biased random walk Xt on the tree
Trhier , where from each non-leaf vertex the transition goes to the parent with
probability λ/(λ + r) and to each child with probability 1/(λ + r). Then
consider Y = “X watched only on L”, that is the sequence of (not-necessarily
distinct) successive leaves visited by X. The group L is distance-transitive
(for Hamming distance on L) and Y is a certain isotropic random flight on
L. A nice feature of this example is that without calculation we can see that
Y is recurrent if and only if λ ≤ 1. For consider the path of ancestors of 0.
The chain X must spend an infinite time on that path (side-branches are
finite); on that path X behaves as asymmetric simple random walk on Z + ,
which is recurrent if and only if λ ≤ 1; so X and thence Y visits 0 infinitely
often if and only if λ ≤ 1.
Another nice feature is that we can give a fairly explicit expression for
the t-step transition probabilities of Y . Writing H for the maximum height
reached by X in an excursion from the leaves, then
r
λ −1
P (H ≥ h) = P1 (T̂h < T̂0 ) = r h , h≥1
(λ) − 1
where T̂ denotes hitting time for the height process. Writing Mt for the
maximum height reached in t excursions,
!t
r
t λ −1
P (Mt < h) = (P (H < h)) = 1− .
( λr )h − 1
It is clear by symmetry that the distribution of Yt is conditionally uniform

on the leaves which are descendants of the maximal-height vertex previously
visited by X. So for leaves v, x with branchpoint at height d,
r−h P (Mt = h).

X
Pv (Yt = x) =
h≥d
Since P (Mt = h) = P (Mt < h + 1) − P (Mt < h), we have found the “fairly
explicit expression” promised above. A brief calculation gives the following
time-asymptotics. Fix s > 0 and consider t ∼ s( λr )j with j → ∞; then
Pv (Yt = v) ∼ r−j f (s); where

∞
r−i exp(−s( λr − 1)( λr )−i−1 ) − exp(−s( λr − 1)( λr )−i ) .
X
f (s) =
i=−∞
In particular,
2 log r
Pv (Yt = v) = Θ(t−d/2 ) as t → ∞, d = . (13.56)
log r − log λ
Comparing with (13.26), this gives a sense in which Y mimics simple random
walk on Z d , for d defined above. Note that d increases continuously from 0
to ∞ as λ increases from 0 to r, and that Y is recurrent if and only if d ≤ 2.
Though we don’t go into details, random walk on the hierarchical lattice
is a natural infinite-state analog of biased random walk on the balanced
finite tree (Chapter 5 section 2.1). In particular, results in the latter context
showed that, writing n for number of vertices, τ0 (n) = O(n) if and only if
λ/r > 1/r, that is if and only if d > 2. This is the condition for transience
of the infinite-state walk, confirming the paradigm of section 13.2.3.
13.2.10 Towards a classification theory for sequences of finite

chains
Three chapters of Woess [339] treat in detail three properties that random
walk on an infinite graph may or may not possess:
• transience
• spectral radius < 1
• non-trivial boundary.
Can these be related to properties for sequences of finite chains? We already
mentioned (section 13.2.3) that the property τ0 (n) = O(n) seems to be the
analog of transience. In this speculative section we propose definitions of
three other properties for sequences of finite chains, which we name
• compactness
• infinite-dimensionality
• expander-like.
Future research will show whether these are useful definitions! Intuitively
we expect that every reasonably “natural” sequence should fall into one of
these three classes.
For simplicity we consider reversible random walks on Cayley graphs.
It is also convenient to continuize. The resulting chains are special cases
of (reversible) Lévy processes. We define the general Lévy process to be a
continuous-time process with stationary independent increments on a (con-
tinuous or discrete) group. Thus the setting for the rest of this section is a
(n)
sequence (Xt ) of reversible Lé—vy processes on finite groups G(n) of size
n → ∞ through some subsequence. Because we work in continuous time,
(n) (n)
the eigenvalues satisfy 0 = λ1 < λ2 ≤ · · ·.
(n)
(A): Compactness. Say the sequence (Xt ) is compact if there exists a
(discrete or continuous) compact set S and a reversible Lévy process X
e t on
S such that
˜ ≡ ||Pπ (X
(i) d(t) e t ∈ ·) → π|| → 0 as t → ∞;
λj (n)
(ii) λ2 (n) → λ̃j as n → ∞, j ≥ 2; where 1 = λ̃2 ≤ λ̃3 ≤ · · · are the
eigenvalues of (Xe t );
(n) ˜ as n → ∞; t > 0.
(iii) d (t τ2 (n)) → d(t)
These properties formalize the idea that the sequence of random walks form
discrete approximations to a limit Lévy process on a compact group, at least
as far as mixing times are concerned. Simple random walk on Zm d , and the
limit Brownian motion on Rd (section 13.1.2) form the obvious example.

Properties (i) and (iii) imply, in particular, that
τ1 (n)/τ2 (n) is bounded as n → ∞. (13.57)
One might hope that a converse is true:

Does every sequence satisfying (13.57) have a compact subse-
quence?
Unfortunately, we are convinced that the answer is “no”, for the following
reason. Take (Xtn ) which is compact, where the limit Lévy process has
˜ as at (i). Now consider a product chain (X (n) , Y (n) ), where
function d(t) t t
components run independently, and where Y (n) has the cut-off property
(Chapter 7) and τ1Y (n) ∼ τ2X (n). Note that by Chapter 7-1 Lemma 1 we
have τ2Y (n) = o(τ1Y (n)). If the product chain had a subsequential limit, then
its total variation function at (i), say d0 (t), must satisfy
d0 (t) = d(t),
˜ t>1
= 1. t < 1.
But it seems intuitively clear (though we do not know a proof) that ev-
ery Lévy process on a compact set has continuous d(·). This suggests the
following conjecture.
Conjecture 13.10 For any sequence of reversible Lévy processes satisfying
(13.57), there exists a subsequence satisfying the definition of compact except
that condition (ii) is replaced by
(iv): ∃t0 ≥ 0 :
d(n) (t τ2 (n)) → 1; t < t0
˜
→ d(t); t > t0 .
Before describing the other two classes of chains, we need a definition and
some motivating background. In the present setting, the property “trivial
boundary” is equivalent (see Notes) to the property
lim ||Pv (Xt ∈ ·) − Pw (Xt ∈ ·)|| = 0, ∀v, w. (13.58)
t→∞
This suggests that an analogous finite-state property might involve whether

the variation distance for nearby starts becomes small before time τ1 . Say
that a sequence (Ln (ε)) of subsets is an asymptotic ε-neighborhood if
||Pφ (Xετ1 ∈ ·) − Pv (Xετ1 ∈ ·)|| → 0 as n → ∞
uniformly over v ∈ Ln (ε); here φ is an arbitrary reference vertex. From
Chapter 7-1 Lemma 1(b) we can deduce that, if the cut-off property holds,
such a neighborhood must have size |Ln (ε)| = o(n).
(n)
(B): Infinite dimensional. Say the sequence (Xt ) is infinite-dimensional
if the following three properties hold.
(i) τ1 (n) = Θ(τ2 log log n)
(ii) The cut-off property holds
(iii) there exists some δ(ε), increasing from 0 to 1 as ε increases from 0
to 1, such that a maximal-size asymptotic ε-neighborhood (Ln (ε)) has
log |Ln (ε)| = (log n)δ(ε)+o(1) as n → ∞.
This definition is an attempt to abstract the essential properties of random

walk on the d-cube (Chapter 5 Example 15), where properties (i) and (ii)
were already shown. We outline below a proof of property (iii) in that exam-
ple. Another fundamental example where (i) and (ii) hold is card-shuffling
by random transpositions (Chapter 7 Example 18)), and we conjecture that
property (iii) also holds there. Conceptually, this class infinite-dimensional
of sequences is intended (cf. (13.58)) as the analog of a single random walk
with trivial boundary on an infinite-dimensional graph.
Property (iii) for the d-cube. Let (X(t)) be continuous-time random
walk on the d-cube, and (X ∗ (t)) continuous-time random walk on the b-
cube, where b ≤ d. The natural coupling shows
if d(v, w) = b then
||Pv (X(t) ∈ ·) − Pw (X(t) ∈ ·)|| = ||P0 (X ∗ (tb/d) ∈ ·) − P1 (X ∗ (tb/d) ∈ ·)||.
Take d → ∞ with
b(d) ∼ dα , t(d) ∼ 41 εd log d
for some 0 < α, ε < 1, so that
t(d)b(d)/d
lim 1 → αε .
4 b(d) log b(d)
d
Since the variation cut-off for the b-cube is at 14 b log b, we see that for vertices
v and w at distance b(d),
||Pv (X(t(d)) ∈ ·) − Pw (X(t(d)) ∈ ·)|| → 1, ε>α

→ 0, ε < α.
So a maximal-size asymptotic ε-neighborhood (Ln (ε)) of 0 must be of the

form {w : d(w, 0) ≤ dε+o(1) }. So
!
d
log |Ln (ε)| = log = dε+o(1) = (log n)ε+o(1)
dε+o(1)
as required.
Finally, we want an analog of a random walk with non-trivial boundary,

expressed using property (ii) below.
(n)
(C): Expander-like. Say the sequence (Xt ) is expander-like if
(i) τ1 = Θ(τ2 log n)
13.3. RANDOM WALKS IN RANDOM ENVIRONMENTS 441
(ii) every asymptotic ε-neighborhood (Ln (ε)) has
log |Ln (ε)| = (log n)o(1) as n → ∞
(iii) The cut-off property holds.

Recall from Chapter 9 section 1 that, for symmetric graphs which are r-
regular expanders for fixed r, we have τ2 (n) = Θ(1) and τ1 (n) = Θ(log n).
But it is not known whether properties (ii) and (iii) always hold in this
setting.
13.3 Random Walks in Random Environments

In talking about random walk on a weighted graph, we have been assuming
the graph is fixed. It is conceptually only a minor modification to consider
the case where the “environment” (the graph or the edge-weights) is itself
first given in some specified random manner. This has been studied in several
rather different contexts, and we will give a brief description of known results
without going into many details.
Quantities like our mixing time parameters τ from Chapter 4 are now
random quantities τ . In general we shall use boldface for quantities depend-
ing on the realization of the environment but not depending on a realization
of the walk.
13.3.1 Mixing times for some random regular graphs

There is a body of work on estimating mixing times for various models of
random regular graph. We shall prove two simple results which illustrate
two basic techniques, and record some of the history in the Notes.
The first result is Proposition 1.2.1 of Lubotzky [243]. This illustrates
the technique of proving expansion (i.e., upper-bounding the Cheeger time
constant τc ) by direct counting arguments in the random graph.
Proposition 13.11 Let Gk,n be the 2k-regular random graph on vertices

{1, 2, . . . , n} with edges {(i, πj (i)) : 1 ≤ i ≤ n, 1 ≤ j ≤ k}, where (πj , 1 ≤
i ≤ k) are independent uniform random permutations of {1, 2, . . . , n}. Write
τ c (k, n) for the Cheeger time constant for random walk on Gk,n . Then for
fixed k ≥ 7,
P (τ c (k, n) > 2k) → 0 as n → ∞.
Note that a realization of Gk,n may be disconnected (in which case τc = ∞)

and have self-loops and multiple edges.
Outline of proof. Suppose a realization of the graph has the property
|A| ≤ n/2 ⇒ |∂A| ≥ |A|/2 (13.59)
where ∂A := {edges(i, j) : i ∈ A, j ∈ Ac }. Then

k|A|(n − |A|) k|A|(n − |A|)
τc = sup ≤ sup ≤ 2k.
A:1≤|A|≤n/2 n|∂A| A:1≤|A|≤n/2 n|A|/2
So we want to show that (13.59) holds with probability → 1 as n → ∞. If

(13.59) fails for some A with |A| = a, then there exists B with |B| = b 32 ac = b
such that
πj (A) ⊆ B, 1 ≤ j ≤ k (13.60)
(just take B = ∪j πj (Aj ) plus, if necessary, arbitrary extra vertices). For
given A and B, the chance of (13.60) equals ((b)a /(n)a )k , where (n)a :=
Qa−1
r=0 (n − r). So the chance that (13.59) fails is at most
! !
n n
((b)a /(n)a )k .
X
q(a), where q(a) =
1≤a≤n/2
a b
So it suffices to verify 1≤a≤n/2 q(a) → 0. And this is a routine but tedious

P
verification (see Notes). 2

Of course the bound on τc gives, via Cheeger’s inequality, a bound on
τ2 , and thence a bound on τ1 via τ1 = O(τ2 log n). But Proposition 13.11
is unsatisfactory in that these bounds get worse as k increases, whereas
intuitively they should get better. For bounds on τ1 which improve with k
we turn to the second technique, which uses the “L1 ≤ L2 ” inequality to
bound the variation threshold time τ1 . Specifically, recall (Chapter 3 Lemma
8b) that for an n-state reversible chain with uniform stationary distribution,
the variation distance d(t) satisfies
d(t) ≤ 2 max(npii (2t) − 1)1/2 . (13.61)

i
This is simplest to use for random walk on a group, as illustrated by the

following result of Roichman [296].
Proposition 13.12 Fix α > 1. Given a group G, let S be a random set
of k = blogα |G|c distinct elements of G, and consider random walk on the
associated Cayley graph with edges {(g, gs) : g ∈ G, s ∈ S ∪ S −1 }. For any
sequence of groups with |G| → ∞,
α log |G|
P (τ 1 > t1 ) → 0, where t1 = d α−1 log k e.
Proof. We first give a construction of the random walk jointly with the
random set S. Write A = {a, b, . . .} for a set of k symbols, and write
Ā = {a, a−1 , b, b−1 , . . .}. Fix t ≥ 1 and let (ξs , 1 ≤ s ≤ t) be independent uni-
form on Ā. Choose (g(a), a ∈ A) by uniform sampling without replacement
from G, and set g(a−1 ) = (g(a))−1 . Then the process (Xs ; 1 ≤ s ≤ t) con-
structed via Xs = g(ξ1 )g(ξ2 ) . . . g(ξs ) is distributed as the random walk on
the random Cayley graph, started at the identity ι. So P (Xt = ι) = Epιι (t)
where pιι (t) is the t-step transition probability in the random environment,
and by (13.61) it suffices to take t = t1 (for t1 defined in the statement of
the Proposition) and show
|G|P (X2t = ι) − 1 → 0. (13.62)
To start the argument, let J(2t) be the number of distinct values taken
by (hξs i, 1 ≤ s ≤ 2t), where we define hai = ha−1 i = a. Fix j ≤ t and
1 ≤ s1 < s2 < . . . < sj ≤ 2t. Then
P (J(2t) = j|hξsi i distinct for 1 ≤ i ≤ j) = (j/k)2t−j ≤ (t/k)t .
By considering the possible choices of (si ),
!
2t
P (J(2t) = j) ≤ (t/k)t .
j
2t
= 22t we deduce
P
Since j j
P (J(2t) ≤ t) ≤ (4t/k)t . (13.63)

Now consider the construction of X2t given above. We claim
1
P (X2t = ι|ξs , 1 ≤ s ≤ 2t) ≤ on {J(2t) > t}. (13.64)
|G| − 2t
For if J(2t) > t then there exists some b ∈ A such that hξs i = b for exactly
one value of s in 1 ≤ s ≤ 2t. So if we condition also on {g(a); a ∈ A, a 6= b},
then X2t = g1 g(b)g2 or g1 g(b)−1 g2 where g1 and g2 are determined by the
conditioning, and then the conditional probability that P (X2t = ι) is the
conditional probability of g(b) taking a particular value, which is at most
1/(|G| − 2t).
Combining (13.64) and (13.63),
P (X2t = ι) ≤ (4t/k)t + 1
|G|−2t ≤ (4t/k)t + 1
|G| + O( |G|t 2 ).
So proving (13.62) reduces to proving
|G|(4t/k)t + t/|G| → 0
and the definition of t was made to ensure this.
13.3.2 Randomizing infinite trees

Simple random walk on the infinite regular tree is a fundamental process,
already discussed in section 13.2.6. There are several natural ways to ran-
domize the environment; we could take an infinite regular tree and attach
random edge-weights; or we could consider a Galton–Watson tree, in which
numbers of children are random. Let us start by considering these possibil-
ities simultaneously. Fix a distribution (ξ; W1 , W2 , . . . , Wξ ) where
ξ ≥ 1; P (ξ ≥ 2) > 0; Wi > 0, i ≤ ξ. (13.65)
Note the (Wi ) may be dependent. Construct a tree via:
the root φ has ξ children, and the edge (φ, i) to the ith child has
weight Wi ; repeat recursively for each child, taking independent
realizations of the distribution (13.65).
So the case ξi ≡ r − 1 gives the randomly-weighted r-ary tree (precisely,
the modification where the root has degree r − 1 instead of r), and the case
Wi ≡ 1 gives a Galton–Watson tree. As in Chapter 3 section 2 to each
realization of a weighted graph we associate a random walk with transition
probabilities proportional to edge-weights. Since random walk on the un-
weighted r-ary tree is transient, a natural first issue is prove transience in
this “random environment” setting. In terms of the electrical network anal-
ogy (see comment below Theorem 13.5), interpreting W as conductance, we
want to know whether the (random) resistance R between φ and ∞ is a.s.
finite. By considering the children of φ, it is clear that the distribution of
R satisfies   −1
ξ
d
(Ri + Wi−1 )−1 
X
R =  (13.66)
i=1
where the (Ri ) are independent of each other and of (ξ; W1 , W2 , . . . , Wξ ),

d
and Ri = R. But R̂ ≡ ∞ is a solution of (13.66), so we need some work
to actually prove that R < ∞.
Proposition 13.13 The resistance R between φ and ∞ satisfies R < ∞
a.s..
Proof. Write R(k) for the resistance between φ and height k (i.e. the height-k
vertices, all shorted together). Clearly R(k) ↑ R as k → ∞, and analogously
to (13.66)
 −1
ξ
d(k)
R(k+1) =  (Ri + Wi−1 )−1 
X
i=1
(k)
where the (Ri ) are independent of each other and of (ξ; W1 , W2 , . . . , Wξ ),
(k) d
and Ri = R(k) .
Consider first the special case ξ ≡ 3. Choose x such that P (Wi−1 >
x for some i) ≤ 1/16. Suppose inductively that P (R(k) > x) ≤ 1/4 (which
holds for k = 0 since R(0) = 0). Then
(k)
P (Ri + Wi−1 > 2x for at least 2 i’s ) ≤ 1
16 + 3( 14 )2 ≤ 41 .
This implies P (R(k+1) > x) ≤ 1/4, and the induction goes through. Thus
P (R > x) ≤ 1/4. By (13.66) p := P (R = ∞) satisfies p = p3 , so p = 0 or 1,
and we just eliminated the possibility p = 1. So R < ∞ a.s..
Reducing the general case to the special case involves a comparison idea,
illustrated by the figure.
c c
@ s(1) @ s(1)
r(2) @ r @
@ @
@ @
@
@ @
@
φ b s(2) ∞ φ r b s(2) ∞
r(1) @ @
@
r(3) @ r(4) @
@
@ s(3) r@ s(3)
@ @
r(5)@
@a @a
@
Here the edge-weights are resistances. In the left network, φ is linked to

{a, b, c} via an arbitrary tree, and in the right network, this tree is replaced
by three direct edges, each with resistance r = 3(r(1) + r(2) + . . . + r(5)).
We claim that this replacement can only increase the resistance between
φ and ∞. This is a nice illustration of Thompson’s principle (Chapter 3
section 7.1) which says that in a realization of either graph, writing r∗ (e)
for resistance and suming over undirected edges e,
r∗ (e)f 2 (e)
X
Rφ∞ = inf
f
e
where f = (f (e)) is a unit flow from φ to ∞. Let f be the minimizing flow in

the right network; use f to define a flow g in the left network by specifying
that the flow through a (resp. b, c) is the same in the left network and the
right network. It is easy to check
X X
(left network) r(e)g 2 (e) ≤ (right network) r(e)f 2 (e)
e e
and hence the resistance Rφ∞ can indeed only be less in the left network.
In the general case, the fact P (ξ ≥ 2) > 0 implies that the number of
individuals in generation g tends to ∞ a.s. as g → ∞. So in particular we
can find 3 distinct individuals {A, B, C} in some generation G. Retain the
edges linking φ with these 3 individuals, and cut all other edges within the
first G generations. Repeat recursively for descendants of {A, B, C}. This
procedure constructs an infinite subtree, and it suffices to show that the
resistance between φ and ∞ in the subtree is a.s. finite. By the comparison
argument above, we may replace the network linking φ to {A, B, C} by three
direct edges with the same (random) resistance, and similarly for each stage
of the construction of the subtree; this gives another tree T , and it suffices
to shows its resistance is finite a.s.. But T fits the special case ξ ≡ 3.
It is not difficult (we won’t give details) to show that the distribution
of R is the unique distribution on (0, ∞) satisfying (13.66). It does seem
difficult to say anything explicit about the distribution of R in Proposition
13.13. One can get a little from comparison arguments. On the binary
tree (ξ ≡ 2), by using the exact potential function and the exact flows
from the unweighted case as “test functions” in the Dirichlet principle and
Thompson’s principle, one obtains
ER ≤ EW −1 ; ER−1 ≤ EW.
13.3.3 Bias and speed

Lyons et al [246, 247, 248], summarized in [250] Chapter 10, have studied
in detail questions concerning a certain model of biased random walk on
deterministic and random infinite trees. Much of their focus is on topics too
sophisticated (boundary theory, dimension) to recount here, but let us give
one simple result.
Consider the unweighted Galton–Watson tree with offspring distribution
µ = dist (ξ), i.e., the case Wi ≡ 1 of (13.65). Fix a parameter 0 ≤ λ < ∞.
In the biased random walk Xt , from a vertex with r children the walker goes
to any particular child with probability 1/(λ + r), and to the parent with
probability λ/(λ + r). It turns out [248] that the biased random walk is
recurrent for λ ≥ Eξ and transient for λ < Eξ. We will just prove one half
of that result.
Proposition 13.14 The biased random walk is a.s. recurrent for λ ≥ Eξ.
Proof. We use a “method of fictitious roots”. That is, to the root φ of

the Galton-Watson tree we append an extra edge to a “fictitious” root φ∗ ,
and we consider random walk on this extended tree (rooted at φ∗ ). Write q

for the probability (conditional on the realization of the tree) that the walk
started at φ never hits φ∗ . It will suffice to prove P (q = 0) = 1. Fix a
realization of the tree, in which φ has z children. Then
z
X 1
q= (qi + (1 − qi )q)
i=1
λ+z
where qi is the probability (on this realization) that the walk started at the
P P
i’th child of φ never hits φ. Rearrange to see q = ( i qi )/(λ + i qi ). So on
the random tree we have
Pξ
d i=1 qi
q = Pξ
λ+ i=1 qi
d
where the (qi ) are independent of each other and ξ, and qi = q. Applying
x
Jensen’s inequality to the concave function x → x+λ shows
(Eξ)(Eq)
Eq ≤ .
λ + (Eξ)(Eq)
By considering the relevant quadratic equation, one sees that for λ ≥ Eξ

this inequality has no solution with Eq > 0. So Eq = 0, as required.
In the transient case, we expect there to exist a non-random speed
s(λ, µ) ≤ 1 such that
t−1 d(Xt , φ) → s(λ, µ) a.s. as t → ∞. (13.67)
Lyons et al [248] show that, when Eξ < ∞, (13.67) is indeed true and that
s(λ, µ) > 0 for all 1 ≤ λ < Eξ. Moreover in the unbiased (λ = 1) case there
is a simple formula [247]
s(1, µ) = E ξ−1
ξ+1 .
There is apparently no such simple formula for s(λ, µ) in general. See Lyons
et al [249] for several open problems in this area.
13.3.4 Finite random trees

Cayley’s formula ([313] p. 25) says there are nn−2 different trees on n ≥ 2
labeled vertices {1, 2, . . . , n}. Assuming each such tree to be equally likely
gives one tractable definition (there are others) of random n-tree Tn . One
can combine the formulas from Chapter 5 section 3 for random walks on
general trees with known distributional properties of Tn to get a variety of

formulas for random walk on Tn , an idea going back to Moon [264].
As an illustration it is known [264] that the distance d(1, 2) between
vertex 1 and vertex 2 in Tn has distribution
P (d(1, 2) = k) = (k + 1)n−k (n − 2)k−1 , 1 ≤ k ≤ n − 1
where (m)s = m(m − 1) · · · (m − s + 1). Routine calculus gives

q
Ed(1, 2) ∼ π/2 n1/2 . (13.68)
Now on any n-vertex tree, the mean hitting time t(i, j) = Ei Tj satisfies
t(i, j) + t(j, i) = 2(n − 1)d(i, j) (13.69)
(Chapter 5 (84)), and so
Et(1, 2) = (n − 1)Ed(1, 2).
Combining with (13.68),

q
Et(1, 2) ∼ π/2 n3/2 . (13.70)
Instead of deriving more formulas of this type for random walk on Tn ,

let’s jump to the bottom line. It turns out that all the mixing and hitting
(n)
time parameters τ u of Chapter 4, and the analogous “mean cover time”
parameters of Chapter 6, are of order n3/2 but are random to first order:
that is,
d
n−3/2 τ (n)
u → τ (∞)
u as n → ∞ (13.71)
(∞)
for non-deterministic limits τ u . The fact that all these parameters have
the same order is of course reminiscent of the cases of the n-cycle and n-
path (Chapter 5 Examples 7 and 8), where all the parameters are Θ(n2 ).
And the sophisticated explanation is the same: one can use the weak con-
vergence paradigm (section 13.1.1). In the present context, the random tree
Tn rescales to a limit continuum random tree T∞ , and the random walk
converges (with time rescaled by n3/2 and space rescaled by n1/2 ) to Brow-
nian motion on T∞ , and (analogously to section 13.1.1) the rescaled limits
of the parameters are just the corresponding parameters for the Brownian
motion. See the Notes for further comments.
13.3.5 Randomly-weighted random graphs

Fix a distribution W on (0, ∞) with EW < ∞. For each n consider the ran-
dom graph G(n, p(n)), that is the graph on n vertices where each possible
edge has chance p(n) to be present. Attach independent random conduc-
tances, distributed as W , to the edges. Aspects of this model were studied
by Grimmett and Kesten [176]. As they observe, much of the behavior is
intuitively rather clear, but technically difficult to prove: we shall just give
the intuition.
Case (i): p(n) = µ/n for fixed 1 < µ < ∞. Here the number of edges
at vertex 1 is asymptotically Poisson(µ), and the part of the graph within
a fixed distance d of vertex 1 is asymptotically like the first d generations
in the random family tree T ∞ of a Galton–Watson branching process with
Poisson(µ) offspring distribution, with independent edge-weights attached.
This tree essentially fits the setting of Proposition 13.13, except that the
number of offspring may be zero and so the tree may be finite, but it is not
hard to show (modifying the proof of Proposition 13.13) that the resistance
R in T ∞ between the root and ∞ satisfies {R < ∞} = {T ∞ is infinite}
and its distribution is characterized by the analog of (
refRdef). The asymptotic approximation implies that, for d(n) → ∞ slowly,
the resistance Rn,d(n) between vertex 1 and the depth-d(n) vertices of G(n, p(n))
d d (n)
satisfies Rn,d(n) → R(1) = R. We claim that the resistance R1,2 between
vertices 1 and 2 of G(n, p(n)) satisfies
(n) d
R1,2 → R(1) + R(2) ; where R1 and R2 are independent copies of R .
The lower bound is clear by shorting, but the upper bound requires a compli-
cated construction to connect the two sets of vertices at distances d(n) from
vertices 1 and 2 in such a way that the effective resistance of this connecting
network tends to zero.
The number of edges of the random graph is asymptotic to nµ/2. So the
P P
total edge weight i j Wij is asymptotic to nµEW , and by the commute
(n)
interpretation of resistance the mean commute time C1,2 for random walk
on a realization of the graph satisfies
(n) d
n−1 C1,2 → µEW (R(1) + R(2) ).
Case (ii): p(n) = o(1) = Ω(nε−1 ), some ε > 0. Here the degree
of vertex 1 tends to ∞, and it is easy to see that the (random) station-
ary probability π1 and the (random) transition probabilities and stationary
distribution the random walk satisfy

p p
max p1,j → 0, nπ 1 → 1 as n → ∞.
j
(k) p
So for fixed k ≥ 1, the k-step transition probabilities satisfy p11 → 0 as
n → ∞. This suggests, but it is technically hard to prove, that the (random)
fundamental matrix Z satisfies
p
Z11 → 1 as n → ∞. (13.72)
Granted (13.72), we can apply Lemma 11 of Chapter 2 and deduce that the
mean hitting times t(π, 1) = Eπ T1 on a realization of the random graph
satisfies
p
n−1 t(π, 1) = nZπ111 → 1, as n → ∞. (13.73)
13.3.6 Random environments in d dimensions

The phrase random walk in random environment (RWRE) is mostly used
to denote variations of the classical “random flight in d dimensions” model.
Such variations have been studied extensively in mathematical physics as
well as theoretical probability, and the monograph of Hughes [184] provides
thorough coverage. To give the flavor of the subject we quote one result,
due to Boivin [52].
Theorem 13.15 Assign random conductances (we ) to the edges of the two-
dimensional lattice Z2 , where
(i) the process (we ) is stationary ergodic.
(ii) c1 ≤ we ≤ c2 a.s., for some constants 0 < c1 < c2 < ∞.
Let (Xt ; t ≥ 0) be the associated random walk on this weighted graph, X0 = 0.
d
Then t−1/2 Xt → Z where Z is a certain two-dimensional Normal distri-
bution, and moreover this convergence holds for the conditional distribution
of Xt given the environment, for almost all environments.

Section 13.1. Rigorous setup for discrete-time continuous-space Markov
chains is given concisely in Durrett [133] section 5.6 and in detail in Meyn
and Tweedie [263]. For the more sophisticated continuous-time setting see
e.g. Rogers and Williams [295]. Aldous et al [24] prove some of the Chapter
4 mixing time inequalities in the discrete-time continuous-space setting.
The central limit theorem (for sums of functions of a Markov chain) does
not automatically extend from the finite-space setting (Chapter 2 Theorem
17) to the continuous-space setting: regularity conditions are required. See
[263] Chapter 17. But a remarkable result of Kipnis - Varadhan [217] shows
that for stationary reversible chains the central limit theorem remains true
under very weak hypotheses.
Sections 13.1.1 - 13.1.3. The eigenvalue analysis is classical. The re-
flection coupling goes back to folklore; see e.g. Lindvall [233] Chapter 6 for
applications to multidimensional diffusions and Matthews [259] for Brown-
ian motion in a polyhedron. Burdzy and Kendall [82] give a careful study
of coupling for Brownian motion in a triangle. Chen [89] surveys use of
coupling to estimate spectral gaps for diffusions on manifolds.
¯ at (13.6)
Here is a more concise though less explicit expression for d(t)
◦
(and hence for G(t)at (13.1)). Consider Brownian motions B on the circle
started at 0 and at 1/2. At any time t, the former distribution dominates
the latter on the interval (−1/4, 1/4) only, and so
¯
d(t) = P0 (Bt◦ ∈ (−1/4, 1/4)) − P1/2 (Bt◦ ∈ (−1/4, 1/4))
= P0 (Bt◦ ∈ (−1/4, 1/4)) − P0 (Bt◦ ∈ (1/4, 3/4))
= 2P0 (Bt◦ ∈ (−1/4, 1/4)) − 1
= 2P ((t1/2 Z) mod 1 ∈ (−1/4, 1/4)) − 1
where Z has Normal(0, 1) law. We quoted this expression in the analysis of

Chapter 5 Example 7
Section 13.1.5. Janvresse [195], Porod [285] and Rosenthal [298] study
mixing times for other flights on matrix groups involving rotations and re-
flections; Porod [284] also discusses more general Lie groups.
Section 13.1.6. The mathematical theory has mostly been developed
for classes of nested fractals, of which the Sierpinski gasket is the simplest.
See Barlow [40], Lindstrøm [232], Barlow [41] for successively more detailed
treatments. Closely related is Brownian motion on the continuum random
tree, mentioned in section 13.3.4.
One-dimensional diffusions. The continuous-space analog of a birth-and-

death process is a one-dimensional diffusion (Xt ), described by a stochastic
differential equation
dXt = µ(Xt )dt + σ(Xt )dBt
where Bt is standard Brownian motion and µ(·) and σ(·) are suitably regular
specified functions. See Karlin and Taylor [209] for non-technical introduc-
tion. Theoretical treatments standardize (via a one-to-one transformation
R → R) to the case µ(·) = 0, though for our purposes the standardization
to σ(·) = 1 is perhaps more natural. In this case, if the formula
Z x
f (x) ∝ exp 2µ(y)dy
can give a density function f (x) then f is the stationary density. Such
diffusions relate to two of our topics.
(i) For MCMC, to estimate a density f (x) ∝ exp(−H(x)), one can in
principle simulate the diffusion with σ(x) = 1 and µ(x) = −H 0 (x)/2. This
idea was used in Chapter MCMC section 5.
(ii) Techniques for bounding the relaxation time for one-dimensional
diffusions parallel techniques for birth-and-death chains [90].
Section 13.2. We again refer to Woess [339] for systematic treatment of
random walks on infinite graphs.
Our general theme of using the infinite case to obtain limits for finite
chains goes back at least to [8], in the case of Z d ; similar ideas occur in
the study of interacting particle systems, relating properties of finite and
infinite-site models.
Section 13.2.2. There is a remarkable connection between recurrence of
reversible chains and a topic in Bayesian statistics: see Eaton [138]. Prop-
erties of random walk on fractal-like infinite subsets of Z d are studied by
Telcs [322, 323].
Section 13.2.9. One view of (Yt ) is as one of several “toy models” for
the notion of random walk on fractional-dimensional lattice. Also, when we
seek to study complicated variations of random walk, it is often simpler to
use the hierarchical lattice than Z d itself. See for instance the sophisticated
study of self-avoiding walks by Brydges et al [76]; it would be interesting to
see whether direct combinatorial methods could reproduce their results.
Section 13.2.10. Another class of sequences could be defined as follows.
There are certain continuous-time, continuous-space reversible processes on
compact spaces which “hit points” and for which τ0 < ∞; for example
(i) Brownian motion on the circle (section 13.1.1)

(ii) Brownian motion on certain fractals (section 13.1.6)
(iii) Brownian motion on the continuum random tree (section 13.3.4).
So for a sequence of finite-state chains one can define the property
τ0 (n)/τ2 (n) is bounded
as the finite analog of “diffusions which hit points”. This property holds for
the discrete approximations to the examples above: (i) random walk on the
n-cycle
(i) random walk on graphs approximating fractals (section 13.1.6)
(iii) random walk on random n-vertex trees (section 13.3.4).
Equivalence (13.58) is hard to find in textbooks. The property “trivial
boundary” is equivalent to “no non-constant bounded harmonic functions”
([339] Corollary 24.13), which is equivalent ([328] Theorem 6.5.1) to ex-
istence of successful shift-coupling of two versions of the chain started at
arbitrary points. The property (13.58) is equivalent ([328] Theorem 4.9.4)
to existence of successful couplings. In the setting of interest to us (con-
tinuized chains on countable space), existence of a shift-coupling (a priori
weaker than existence of a coupling) for the discrete-time chain implies ex-
istence of a coupling for the continuous-time chain, by using independence
of jump chain and hold times.
Section 13.3. Grimmett [175] surveys “random graphical networks” from
a somewhat different viewpoint, emphasising connections with statistical
physics models.
Section 13.3.1. More precise variants of Proposition 3.35 were developed
in the 1970s, e.g. [281, 92]. Lubotzky [243], who attributes this method of
proof of Proposition 13.11 to Sarnak [305], asserts the result for k ≥ 5 but
our own calculations give only k ≥ 7. Note that Proposition 13.11 uses
the permutation model of a 2k-regular random graph. In the alternative
uniform model we put 2k balls labeled 1, 2k balls labeled 2, . . . . . . and 2k
balls labeled n into a box; then draw without replacement two balls at a
time, and put an edge between the two vertices. In both models the graphs
may be improper (multiple edges or self-loops) and unconnected, but are in
fact proper with probability Ω(1) and connected with probability 1 − o(1)
as n → ∞ for fixed k. Behavior of τc in the uniform model is implicitly
studied in Bollobás [54]. The L2 ideas underlying the proof of Proposition
13.12 were used by Broder and Shamir [68], Friedman [155] and Kahn and
Szemerédi [157] in the setting of the permutation model
√
of random r-regular
2 2r−1
graphs. One result is that β ≡ max(λ2 , −λn ) = O( r ) with probability
1 − o(1). Further results in the “random Cayley graph” spirit of Proposition

13.12 can be found in [28, 130, 271].
Section 13.3.2. The monograph of Lyons and Peres [250] contains many
more results concerning random walks on infinite deterministic and Galton–
Watson trees. A challenging open problem noted in [249] is to prove that R
has absolutely continuous distribution when ξ is non-constant. The method
of fictitious roots used in Proposition 13.14 is also an ingredient in the
analysis of cover times on trees [16].
Section 13.3.4. Moon [264] gives further results in the spirit of (13.70),
e.g. for variances of hitting times. The fact that random walk on Tn rescales
to Brownian motion on a “continuum random tree” T∞ was outlined in Al-
dous [14] section 5 and proved in Krebs [218]. While this makes the “order
n3/2 ” property (13.71) of the parameters essentially obvious, it is still diffi-
cult to get explicit information about the limit distributions τ (∞) . What’s
known [14] is
(∞) p
(a) Eτ 0 = √π/2, as suggested by (13.70);
(b) τ (∞)∗ = 83 2π, from (13.69) and the known asymptotics for the diame-
ter of Tn ;
(c) + −3/2 EC + →
√ The “cover and return” time Cn appearing in Chapter 6 satisfies n n
6 2π, modulo some technical issues.
Section 13.3.5. Grimmett and Kesten [176] present their results in terms
of resistances, without explicitly mentioning random walk, so that results
like (13.73) are only implicit in their work.
Chapter 14
Interacting Particles on
Finite Graphs (March 10,
1994)
There is a well-established topic “interacting particle systems”, treated in

the books by Griffeath [172], Liggett [231], and Durrett [132], which studies
different models for particles on the infinite lattice Z d . All these models
make sense, but mostly have not been systematically studied, in the con-
text of finite graphs. Some of these models – the voter model, the antivoter
model, and the exclusion process – are related (either directly or “via du-
ality”) to interacting random walks, and setting down some basic results
for these models on finite graphs (sections 14.3 - 14.5) is the main purpose
of this chapter. Our focus is on applying results developed earlier in the
book. With the important exception of duality, we do not use the deeper
theory developed in the infinite setting. As usual, whether the deeper the-
ory is applicable to the type of questions we ask in the finite setting is an
interesting open question. These models are most naturally presented in
continuous time, so our default convention is to work with continuous-time
random walk.
We have already encountered results whose natural proofs were “by cou-
pling”, and this is a convenient place to discuss couplings in general.
455
456CHAPTER 14. INTERACTING PARTICLES ON FINITE GRAPHS (MARCH 10, 1994)
14.1 Coupling
If X and Y are random variables with Binomial (n, p1 ) and (n, p2 ) distribu-
tions respectively, and if p1 ≤ p2 , then it is intuitively obvious that
P (X ≥ x) ≤ P (Y ≥ x) for all x. (14.1)
One could verify this from the exact formulas, but there is a more elegant
non-computational proof. For 1 ≤ i ≤ n define events (Ai , Bi , Ci ), indepen-
dent as i varies, with P (Ai ) = p1 , P (Bi ) = p2 − p1 , P (Ci ) = 1 − p2 . And
define
X0 =
X
1Ai = number of A’s which occur
i
Y0 =
X
1Ai ∪Bi = number of A’s and B’s which occur.
i
d
Then X 0 ≤ Y 0 , so (14.1) holds for X 0 and Y 0 , but then because X 0 = X
d
and Y 0 = Y we have proved that (14.1) holds for X and Y . This is the
prototype of a coupling argument, which (in its wide sense) means
to prove some distributional inequality relating two random pro-

cesses X, Y by constructing versions X 0 , Y 0 which satisfy some
sample path inequality.
Our first “process” example is a somewhat analogous proof of part (a)

of the following result, which abstracts slightly a result stated for random
walk on distance-regular graphs (Chapter 7 Proposition yyy).
Proposition 14.1 Let (Xt ) be an irreducible continuous-time birth-and-

death chain on states {0, 1, . . . , ∆}.
(a) P0 (Xt =i)
πi is non-increasing in i, for fixed t
P0 (Xt =i)
(b) P0 (Xt =0) is non-decreasing in t, for fixed i
Proof. Fix i1 ≤ i2 . Suppose we can construct processes Yt and Zt , dis-

tributed as the given chain started at i1 and i2 respectively, such that
Yt ≤ Zt for all t. (14.2)
Then
Pi1 (Xt = 0) = P (Yt = 0) ≥ P (Zt = 0) = Pi2 (Xt = 0).
14.1. COUPLING 457
But by reversibility
π0
Pi1 (Xt = 0) = P0 (Xt = i1 )
π i1
and similarly for i2 , establishing (a).
Existence of processes satisfying (14.2) is a consequence of the Doeblin
coupling discussed below. The proof of part (b) involves a different technique
and is deferred to section 14.1.3.
14.1.1 The coupling inequality

Consider a finite-state chain in discrete or continuous time. Fix states i, j.
(i) (j)
Suppose we construct a joint process (Xt , Xt ; t ≥ 0) such that
(i)
(Xt , t ≥ 0) is distributed as the chain started at i
(j)
(Xt , t ≥ 0) is distributed as the chain started at j. (14.3)
And suppose there is a random time T ≤ ∞ such that

(i) (j)
Xt = Xt , T ≤ t < ∞. (14.4)
Call such a T a coupling time. Then the coupling inequality is
||Pi (Xt ∈ ·) − Pj (Xt ∈ ·)|| ≤ P (T > t), 0 ≤ t < ∞. (14.5)

(i) (j)
The inequality is clear once we observe P (Xt ∈ ·, T ≤ t) = P (Xt ∈ ·, T ≤
t). The coupling inequality provides a method of bounding the variation
distance d(t)¯ of Chapter 2 section yyy.
The most common strategy for constructing a coupling satisfying (14.3)
is via Markov couplings, as follows. Suppose the underlying chain has
state space I and (to take the continuous-time case) transition rate ma-
trix Q = (q(i, j)). Consider a transition rate matrix Q̃ on the product space
I × I. Write the entries of Q̃ as q̃(i, j; k, l) instead of the logical-but-fussy
q̃((i, j), (k, l)). Suppose that, for each pair (i, j) with j 6= i,
q̃(i, j; ·, ·) has marginals q(i, ·) and q(j, ·) (14.6)

P P
in other words l q̃(i, j; k, l) = q(i, k) and k q̃(i, j; k, l) = q(j, l). And
suppose that
q̃(i, i; k, k) = q(i, k) for all k

q̃(i, i; k, l) = 0 for l 6= k.
(i) (j)
Take (Xt , Xt ) to be the chain on I × I with transition rate matrix Q̃ and
(i) (j)
initial position (i, j), Then (14.3) must hold, and T ≡ min{t : Xt = Xt }
is a coupling time. This construction gives a Markov coupling, and all the
examples where we use the coupling inequality will be of this form. In
practice it is much more understandable to define the joint process in words
xxx red and black particles.
A particular choice of Q̃ is
q̃(i, j; k, l) = q(i, k)q(j, l), j 6= i (14.7)
in which case the joint process is called to Doeblin coupling. In words, the
Doeblin coupling consists of starting one particle at i and the other particle
at j, and letting the two particles move independently until they meet, at
time Mi,j say, and thereafter letting them stick together. In the particular
case of a birth-and-death process, the particles cannot cross without meeting
(i) (j)
(in continuous time), and so if i < j then Xt ≤ Xt for all t, the property
we used at (14.2).
14.1.2 Examples using the coupling inequality

Use of the coupling inequality has nothing to do with reversibility. In fact it
finds more use in the irreversible setting, where fewer alternative methods
are available for quantifying convergence to stationarity. In the reversible
setting, coupling provides a quick way to get bounds which usually (but not
always) can be improved by other methods. Here are two examples we have
seen before.
Example 14.2 Random walk on the d-cube (Chapter 5 Example yyy).
For i = (i1 , . . . , id ) and j = (j1 , . . . , jd ) in I = {0, 1}d , let D(i, j) be the

set of coordinates u where i and j differ. Write iu for the state obtained
by changing the i’th coordinate of i. Recall that in continuous time the
components move independently as 2-state chains with transition rates 1/d.
In words, the coupling is “run unmatched coordinates independently until
they match, and then run them together”. Formally, the non-zero transitions
of the joint process are
q̃(i, j; iu , ju ) = 1/d if iu = ju
q̃(i, j; iu , j) = 1/d if iu 6= ju
q̃(i, j; i, ju ) = 1/d if iu 6= ju .
14.1. COUPLING 459
For each coordinate which is initially unmatched, it takes exponential (rate

2/d) time until it is matched, and so the coupling time T satisfies
d
T = max(ξ1 , . . . , ξd0 )
where the (ξu ) are independent exponential (rate 2/d) and d0 = d(i, j) is the
initial number of unmatched coordinates. So
P (T ≤ t) = (1 − exp(−2t/d))d0
and the coupling inequality bounds variation distance as
¯ ≤ (1 − exp(−2t/d))d .
d(t)
This leads to an upper bound on the variation threshold time
1
τ1 ≤ ( + o(1))d log d as d → ∞.
2
In this example we saw in Chapter 5 that in fact
1
τ1 ∼ d log d as d → ∞
4
so the coupling bound is off by a factor of 2.
Example 14.3 Random walk on a dense regular graph (Chapter 5 Example
yyy).
Consider a r-regular n-vertex graph. Write N (v) for the set of neighbors
of v. For any pair v, w we can define a 1 − 1 map θv,w : N (v) → N (w)
such that θv,w (x) = x for x ∈ N (v) ∩ N (w). We can now define a “greedy
coupling” by
q̃(v, w; x, θv,w (x) = 1/r, x ∈ N (v).
In general one cannot get useful bounds on the coupling time T . But consider
the dense case, where r > n/2. As observed in Chapter 5 Example yyy, here
|N (v) ∩ N (w)| ≥ 2r − n and so the coupled processes (Xt , Yt ) have the
property that for w 6= v
|N (v) ∩ N (w)| 2r − n
P (Xt+dt = Yt+dt |Xt = v, Yt = w) = dt ≥ dt
r r
implying that T satisfies
P (T > t) ≤ exp(−(2r − n)t/r).
¯ ≤ exp(−(2r−n)t/r), and in particular
So the coupling inequality implies d(t)
the variation threshold satisfies
r
τ1 ≤ .
2r − n
14.1.3 Comparisons via couplings

We now give two examples of coupling in the wide sense, to compare different
processes. The first is a technical result (inequality (14.8) below) which we
needed in Chapter 6 yyy. The second is the proof of Proposition 14.1(b).
Example 14.4 Exit times for constrained random walk.
Let (Xt ) be discrete-time random walk on a graph G, let A be a subset of
the vertices of G and let (Yt ) be random walk on the subgraph induced by
A. Given B ⊂ A, let S be the first hitting time of (Yt ) on B, and let T be
the first hitting time of (Xt ) on B ∪ Ac . Then
Ei T ≤ Ei S, i ∈ A. (14.8)
This is “obvious”, and the reason it’s obvious is by coupling. We can con-
struct coupled processes (X 0 , Y 0 ) with the property that, if both particles
are at the same position a in A, and if X jumps to another state b in A,
then Y jumps to the same state b. This property immediately implies that,
for the coupled processes started at the same state in A, we have T 0 ≤ S 0
and hence (14.8).
In words, here is the coupling (X 0 , Y 0 ). When the particles are at differ-
ent positions they jump independently. When they are at the same position,
first let X 0 jump; if X 0 jumps to a vertex in A let Y 0 jump to the same vertex,
and otherwise let Y 0 jump to a uniform random neighbor in A. Formally,
the coupled process moves according to the transition matrix P̃ on G × A
defined by
p̃(x, a; y, b) = pG (x, y) pA (a, b) if x 6∈ A or x 6= a
p̃(a, a; b, b) = pG (a, b), b ∈ A
p̃(a, a; y, b) = pG (a, y)pA (a, b), b ∈ A, y ∈ Ac
where pA and pG refer to transition probabilities for the original random
walks on A and G.
Proof of Proposition 14.1(b). Fix i ≥ 1. By reversibility it is sufficient
to prove
P0 (Xt = i)
is non-decreasing in t .
P0 (Xt = 0)
(0) (i)
Consider the Doeblin coupling (Xt , Xt ) of the processes started at 0 and
(0) (i)
at i, with coupling time T . Since Xt ≤ Xt we have
(i) (0)
P (Xt = 0) = P (Xt = 0, T ≤ t)
14.2. MEETING TIMES 461
and so we have to prove

(0)
P (T ≤ t|Xt = 0) is non-decreasing in t .
It suffices to show that, for t1 > t,

(0) (0)
P (T ≤ t|Xt1 = 0) ≥ P (T ≤ t|Xt = 0)
(0) (0)
and thus, by considering the conditional distribution of Xt given Xt1 = 0,
it suffices to show that
(0) (0)
P (T ≤ t|Xt = j) ≥ P (T ≤ t|Xt = 0) (14.9)
(0,j)
for j ≥ 0. So fix j and t. Write (Xs , 0 ≤ s ≤ t) for the process conditioned
on X0 = 0, Xt = j. By considering time running backwards from t to 0,
the processes X (0,0) and X (0,j) are the same non-homogeneous Markov chain
started at the different states 0 and j, and we can use the Doeblin coupling in
this non-homogeneous setting to construct versions of these processes with
Xs(0,0) ≤ Xs(0,j) , 0 ≤ s ≤ t.
Now introduce an independent copy of the original process, started at time

0 in state i. If this process meets X (0,0) before time t then it must also meet
X (0,j) before time t, establishing (14.9).
14.2 Meeting times

Given a Markov chain, the meeting time Mi,j is the time at which indepen-
dent copies of the chain started at i and at j first meet. Meeting times arose
in the Doeblin coupling and arise in several other contexts later, so deserve
a little study. It is natural to try to relate meeting times to properties such
as hitting times for a single copy of the chain. One case is rather simple.
Consider a distribution dist(ξ) on a group G such that
d d
ξ = ξ −1 ; gξ = ξg for all g ∈ G.
Now let Xt and Yt be independent copies of the continuization of random

flight on G with step-distribution ξ. Then if we define Zt = Xt−1 Yt , it is
easy to check that Z is itself the continuization of the random flight, but
run at twice the speed, i.e. with transition rates
qZ (g, h) = 2P (gξ = h).

It follows that EMi,j = 12 Ei Tj . The next result shows this equality holds un-
der less symmetry, and (more importantly) that an inequality holds without
any symmetry.
Proposition 14.5 For a continuous-time reversible Markov chain, let Tj be

the usual first hitting time and let Mi,j be the meeting time of independent
copies of the chain started at i and j. Then maxi,j EMi,j ≤ maxi,j Ei Tj . If
moreover the chain is symmetric (recall the definition from Chapter 7 yyy)
then EMi,j = 12 Ei Tj .
Proof. This is really just a special case of the cat and mouse game of Chapter
3 section yyy, where the player is using a random strategy to decide which
animal to move. Write Xt and Yt for the chains started at i and j. Write
f (x, y) = Ex Ty − Eπ Ty . Follow the argument in Chapter 3 yyy to verify
St ≡ (2t + f (Xt , Yt ); 0 ≤ t ≤ Mij ) is a martingale.
Then
Ei Tj − Eπ Tj = ES0
= ESMi,j by the optional sampling theorem
= 2EMij + Ef (XMij , YMij )
= 2EMi,j − E t̄(XMi,j ), where t̄(k) = Eπ Tj .
In the symmetric case we have t̄(k) = τ0 for all k, establishing the desired
equality. In general we have t̄(k) ≤ maxi,j Ei Tj and the stated inequality
follows.
Remarks. Intuitively the bound in Proposition 14.5 should be reasonable
for “not too asymmetric” graphs. But on the n-star (Chapter 5 yyy), for ex-
ample, we have maxi,j EMi,j = Θ(1) while maxi,j Ei Tj = Θ(n). The “Θ(1)”
in that example comes from concentration of the stationary distribution,
and on a regular graph we can use Chapter 3 yyy to obtain
XX (n − 1)2
πi πj EMi,j ≥ .
i j
2n
But we can construct regular graphs which mimic the n-star in the sense
that maxi,j EMi,j = o(τ0 ). A more elaborate result, which gives the correct
order of magnitude on the n-star, was given in Aldous [15].
14.2. MEETING TIMES 463
Proposition 14.6 For a continuous-time reversible chain,

!−1
X πi
max EMi,j ≤ K
i,j
i
max(Eπ Ti , τ1 )
for an absolute constant K.
The proof is too lengthy to reproduce, but let us observe as a corollary that
we can replace the maxi,j Ei Tj bound in Proposition 14.5 by the a priori
smaller quantity τ0 , at the expense of some multiplicative constant.
Corollary 14.7 For a continuous-time reversible chain,
max EMi,j ≤ Kτ0

i,j
for an absolute constant K.
Proof of Corollary 14.7. First recall from Chapter 4 yyy the inequality
τ1 ≤ 66τ0 . (14.10)
“Harmonic mean ≤ arithmetic mean” gives the first inequality in

!−1
X πi X
≤ πi max(Eπ Ti , τ1 )
i
max(Eπ Ti , τ1 ) i
X
≤ πi (Eπ Ti + τ1 )
i
≤ τ0 + τ1
≤ 67τ0 by (14.10)
and so the result is indeed a corollary of Proposition 14.6.

Two interesting open problems remain. First, does Proposition 14.6
always give the right order of magnitude, i.e.
Open Problem 14.8 In the setting of Proposition 14.6, does there exist
an absolute constant K such that
!−1
X πi
K max EMi,j ≥
i,j
i
max(Eπ Ti , τ1 )
The other open problem is whether some modification of the proof of Propo-
sition 14.5 would give a small constant K in Corollary 14.7. To motivate
this question, note that the coupling inequality applied to the Doeblin cou-
¯ ≤ maxi,j P (Mi,j > t). Then Markov’s
pling shows that for any chain d(t)
inequality shows that the variation threshold satisfies τ1 ≤ e maxi,j EMi,j .
In the reversible setting, Proposition 14.5 now implies τ1 ≤ eKτ0 where K
is the constant in Corollary 14.7. So a direct proof of Corollary 14.7 with
small K would improve the numerical constant in inequality (14.10).
14.3 Coalescing random walks and the voter model

Sections 14.3 and 14.4 treat some models whose behavior relates “by dual-
ity” to random-walk-type processes. It is possible (see Notes) to fit all our
examples into an abstract duality framework, but for the sake of concrete-
ness I haven’t done so. Note that for simplicity we work in the setting of
regular graphs, though the structural results go over to general graphs and
indeed to weighted graphs.
Fix a r-regular n-vertex graph G. In the voter model we envisage a person
at each vertex. Initially each person has a different opinion (person i has
opinion i, say). As time passes, opinions change according to the following
rule. For each person i and each time interval [t, t + dt], with chance dt
the person chooses uniformly at random a neighbor (j, say) and changes (if
necessary) their opinion to the current opinion of person j. Note that the
total number of existing opinions can only decrease with time, and at some
random time Cvm there will be only one “consensus” opinion.
In the coalescing random walk process, at time 0 there is one particle at
each vertex. These particles perform independent continuous-time random
walks on the graph, but when particles meet they coalesce into clusters and
the cluster thereafter sticks together and moves as a single random walk. So
at time t there are clusters, composed of one or more particles, at distinct
vertices, and during [t, t+dt] each cluster has chance dt to move to a random
neighbor and (if that neighbor is occupied by another cluster) to coalesce
with that other cluster. Note that the total number of clusters can only
decrease with time, and at some random time Ccrw the particles will have
all coalesced into a single cluster.
Remarkably, the two random variables Cvm and Ccrw associated with
the two models turn out to have the same distribution, depending only on
the graph G. The explanation is that the two processes can be obtained by
looking at the same picture in two different ways. Here’s the picture. For
14.3. COALESCING RANDOM WALKS AND THE VOTER MODEL465
each edge e and each direction on e, create a Poisson process of rate 1/r. In
the figure, G is the 8-cycle, “time” is horizontal and an event of the Poisson
process for edge (i, j) at time t is indicated by a vertical arrow i → j at time
t.
t
0 time for coalescing RW 0
8=0 8=0
7 ? ? 7
6
6 ? 6
6
5 ? 5
6
4 ? ? 4
vertices vertices
3 ? 3
6 6
2 2
1 ? ? 1
6
0 ? 0
-
0 time for voter model t0
In the voter model, we interpret time as increasing left-to-right from 0

to t0 , and we interpret an arrow j → i at time t as meaning that person
j adopts i’s opinion a time t. In the coalescing random walk model, we
interpret time as increasing right-to-left from 0 to t0 , and we interpret an
arrow j → i at time t as meaning that the cluster (if any) at state j at time
t jumps to state i, and coalesces with the cluster at i (if any).
So for fixed t0 , we can regard both processes as constructed from the
same Poisson process of “arrows”. For any vertices i, j, k the event (for the
voter model)
The opinions of persons i and j at time t0 are both the opinion
initially held by k
is exactly the same as the event (for the coalescing random walk process)
The particles starting at i and at j have coalesced before time

t0 and their cluster is at vertex k at time t0 .
The horizontal lines in the figure indicate part of the trajectories. In terms
of the coalescing random walks, the particles starting at 5 and 7 coalesce,
and the cluster is at 4 at time t0 . In terms of the voter model, the opinion
initially held by person 4 is held by persons 5 and 7 at time t0 . The reader
may (provided this is not a library book) draw in the remaining trajectories,
and will find that exactly 3 of the initial opinions survive, i.e. that the
random walks coalesce into 3 clusters.
In particular, the event (for the voter model)
By time t0 everyone’s opinion is the opinion initially held by

person k
is exactly the same as the event (for the coalescing random walk process)
All particles have coalesced by time t0 , and the cluster is at k at

time t0 .
So P (Cvm ≤ t0 ) = P (Ccrw ≤ t0 ), and these two times (which we shall now

call just C) do indeed have the same distribution.
We now discuss bounds on EC. It is interesting that the two models give
us quite different ways to prove bounds. Bounding EC here is somewhat
analogous to the problem of bounding mean cover time, discussed in Chapter
6.
14.3.1 A bound using the voter model

Recall from Chapter 4 yyy the definition of the Cheeger time constant τc .
In the present setting of a r-regular graph, the definition implies that for
any subset A of vertices
r|A|(n − |A|)
number of edges linking A and Ac ≥ . (14.11)
nτc
rn2
Proposition 14.9 (a) If G is s-edge-connected then EC ≤ 4s .
(b) EC ≤ 2 log 2 τc n.
Proof. The proof uses two ideas. The first is a straightforward compari-
son lemma.
Lemma 14.10 Let (Xt ) be a continuous-time chain on states I. Let f :

I → {0, 1, . . . , n} be such that f (Xt ) never jumps by more than 1, and such
that there exist strictly positive constants γ, a(1), . . . , a(n − 1) such that, for
each 1 ≤ i ≤ n − 1 and each state x with f (x) = i,
P (f (Xt+dt ) = i + 1|Xt = x) P (f (Xt+dt ) = i − 1|Xt = x)

= ≥ γa(i).
dt dt
Then
Ex T{f −1 (0),f −1 (n)} ≤ γ −1 Ef∗(x) T{0,n}
∗
where E ∗ T ∗ refers to mean hitting time for the chain X ∗ on states {0, 1, . . . , n}
with transition rates
qi,i+1 = qi,i−1 = a(i).
The second idea is that our voter model can be used to define a less-
informative “two-party” model. Fix an initial set B of vertices, and group
the opinions of the individuals in B into one political party (“Blues”) and
group the remaining opinions into a second party (“Reds”). Let NtB be
the number of Blues at time t and let C B ≤ C be the first time at which
everyone belongs to the same party. Then
B
P (Nt+dt = NtB + 1| configuration at time t)
B
= P (Nt+dt = NtB − 1| configuration at time t)
number of edges linking Blue - Red vertices at time t
= dt. (14.12)
r
Cases (a) and (b) now use Lemma 14.10 with different comparison chains.
For (a), while both parties coexist, the number of edges being counted in
(14.12) is at least s. To see this, fix two vertices v, x of different parties,
and consider (c.f. Chapter 6 yyy) a collection of s edge-disjoint paths from
v to x. Each path must contain at least one edge linking Blue to Red. Thus
the quantity (14.12) is at least rs dt. If that quantity were 12 dt then NtB
would be continuous time random walk on {0, . . . , n} and the quantity EC B
would be the mean time, starting at |B|, for simple random walk to hit 0 or
n, which by Chapter 5 yyy we know equals |B|(n − |B|). So using Lemma
14.10
r rn2
EC B ≤ |B|(n − |B|) ≤ . (14.13)
2s 8s
For (b), use (14.11) to see that the quantity (14.12) must be at least
NtB (n−NtB )
nτc dt. Consider for comparison the chain on {0, . . . , n} with transi-
tion rates qi,i+1 = qi,i−1 = i(n − i)/n. For this chain
n−1
Ei∗ T{0,n}
∗
X
= ∗
Ei ( time spent in j before T{0,n}
j=1
n−1
X 1
2
= mi (j)
j=1
j(n − j)/n
where mi (j) is the mean occupation time for simple symmetric random walk
and the second term is the speed-up factor for the comparison chain under
consideration. Using the formula for mi (j) from Chapter 5 yyy,
n−1 i−1
1 1
Ei∗ T{0,n}
∗
X X
=i + (n − i) ≤ n log 2.
j=i
j j=1
n−j
So using Lemma 14.10

EC B ≤ τc n log 2. (14.14)
Finally, imagine choosing B at random by letting each individual ini-
tially be Blue or Red with probability 1/2 each, independently for different
vertices. Then by considering some two individuals with different opinions
at time t,
1
P (C B > t) ≥ P (C > t).
2
Integrating over t gives EC ≤ 2EC B . But EC B ≤ maxB EC B , so the
Proposition follows from (14.13) and (14.14).
14.3.2 A bound using the coalescing random walk model

The following result bounds the mean coalescing time in terms of mean
hitting times of a single random walk.
Proposition 14.11 EC ≤ e(log n + 2) maxi,j Ei Tj
Proof. We can construct the coalescing random walk process in two steps.
Order the vertices arbitrarily as i1 , . . . , in . First let the n particles perform
independent random walks for ever, with the particles starting at i, j first
meeting at time Mi,j , say. Then when two particles meet, let them cluster
and follow the future path of the lower-labeled particle. Similarly, when
two clusters meet, let them cluster and follow the future path of the lowest-
labeled particle in the combined cluster. Using this construction, we see
Ccrw ≤ max Mi1 ,j . (14.15)

j
Now let m∗ ≡ maxi,j EMi,j . Using subexponentiality as in Chapter 2 section

yyy,
t
P (Mi,j > t) ≤ exp(−b ∗ c). (14.16)
em
and so
Z ∞
EC = P (C > t)dt
Z0∞ X
≤ min(1, P (Mi1 ,j > t)) dt by (14.15)
0 j
Z ∞
t
≤ min(1, ne exp(− )) dt by (14.16)
0 em∗
= em∗ (2 + log n)
where the final equality is the calculus fact

Z ∞
min(1, Ae−at ) dt = a−1 (1 + log A), A ≥ 1.
0
The result now follows from Proposition 14.5.
14.3.3 Conjectures and examples

The complete graph. On the complete graph, the number Kt of clusters at
time t in the coalescing random walk model is itself the continuous-time
chain with transition rates
qk,k−1 = k(k − 1)/(n − 1); n ≥ k ≥ 2.
Since Ccrw is the time taken for Kt to reach state 1,

m
X n−1 (n − 1)2
EC = = ∼ n.
k=2
k(k − 1) n
Recall from Chapter 7 yyy that in a vertex-transitive graph with τ2 /τ0

small, the first hitting time to a typical vertex has approximately exponential
distribution with mean τ0 . Similarly, the meeting time Mi,j for typical
i, j has approximately exponential distribution with mean τ0 /2. It seems

intuitively clear that, for fixed small k, when the number of clusters first
reaches k these clusters should be approximately uniformly distributed, so
that the mean further time until one of the k(k − 1)/2 pairs coalesce should
τ0
be about k(k−1) . Repeating the analysis of the complete graph suggests
Open Problem 14.12 Prove that for a sequence of vertex-transitive graphs

with τ2 /τ0 → 0, we have EC ∼ τ0 .
In the general setting, there is good reason to believe that the log term in
Proposition 14.11 can be removed.
Open Problem 14.13 Prove there exists an absolute constant K such that
on any graph
EC ≤ K max Ev Tw .
v,w
The assertion of Open Problem 14.12 in the case of the torus Zm d for
d ≥ 2 was proved by Cox [103]. A detailed outline is given in [132] Chapter

10b, so we will not repeat it here, but see the remark in section 14.3.5 below.
xxx discuss d = 1?
14.3.4 Voter model with new opinions

For a simple variation of the voter model, fix a parameter 0 < λ < ∞
and suppose that each individual independently decides at rate λ (i.e. with
chance λdt in each time interval [t, t + dt]) to adopt a new opinion, not
previously held by anyone. For this process we may take as state space
the set of partitions A = {A1 , A2 , . . .} of the vertex-set of the underlying
graph G, where two individuals have the same opinion iff they are in the
same component A of A. The duality relationship holds with the following
modification. In the dual process of coalescing random walks, each cluster
“dies” at rate λ. Thus in the dual process run forever, each “death” of a
cluster involves particles started at some set A of vertices, and this partition
A = {Ai } of vertices into components is (by duality) distributed as the
stationary distribution of the voter model with new opinions. This is the
unique stationary distribution, even though (e.g. on the n-cycle) the Markov
chain may not be irreducible because of the existence of transient states.
The time to approach stationarity in this model is controlled by the time
C̃ for the dual process to die out completely. Clearly E C̃ ≤ EC +1/λ, where
C is the coalescing time discussed in previous sections, and we do not have
anything new to say beyond what is implied by previous results. Instead,
we study properties of the stationary distribution A = {Ai }. A natural

parameter is the chance, γ say, that two random individuals have the same
opinion, i.e.
X |Ai |2
γ≡E . (14.17)
i
n2
Lemma 14.14
2EE 1
γ= 2
+ ,
λrn n
where E is the number of edges with endpoints in different components, under
the stationary distribution.
Proof. Run the stationary process, and let A(t) and E(t) be the partition
and the number of edges linking distinct components, at time t, and let
S(t) = i |Ai (t)|2 . Then
P
E(S 2 (t + dt) − S 2 (t)| configuration at time t)

dt
4 X
= E(t) + 2λ |Ai (t)|(1 − |Ai (t)|). (14.18)
r i
The first term arises from the “voter” dynamics. If an opinion change in-
volves an edge linking components of sizes a and b, then the change in S 2
has expectation
(a + 1)2 + (a − 1)2 + (b + 1)2 − (b − 1)2

− (a2 + b2 ) = 2
2
and for each of the E(t) edges linking distinct components, opinion changes
occur at rate 2/r. The second term arises from new opinions. A new opinion
occurs in a component of size a at rate λa, and the resulting change in S 2 is
(a − 1)2 + 12 − a2 = 2(1 − a).
Stationarity implies that the expectation of (14.18) equals zero, and so

4 X
EE = 2λ E|Ai |(|Ai | − 1) = 2λ(n2 γ − n)
r i
and the lemma follows.

1+λτc λ+1
Corollary 14.15 1+λτc n ≤γ≤ λn .
Proof. Clearly EE is at most the total number of edges, nr/2, so the upper
bound follows from the lemma. For the lower bound, (14.11) implies
i |Ai |(n − |Ai |)

P
r
ξ≥
2nτc
and hence
r
EE ≥ (n2 − n2 γ)
2nτc
and the bound follows from the lemma after brief manipulation.
We now consider bounds on γ obtainable by working with the dual pro-
cess. Consider the meeting time M of two independent random walks started
with the stationary distribution. Then by duality (xxx explain)
γ = P (M < ξ(2λ) )
where ξ(2λ) denotes a random variable with exponential (2λ) distribution in-
dependent of the random walks. Now M is the hitting time of the stationary
“product chain” (i.e. two independent continuous-time random walks) on
the diagonal A = {(v, v)}, so by Chapter 3 yyy M has completely monotone
distribution, and we shall use properties of complete monotonicity to get
Corollary 14.16
1 1 τ2
≤γ≤ + .
1 + 2λEM 1 + 2λEM EM
d
Proof. We can write M = Rξ(1) , where ξ(1) has exponential(1) distribution
and R is independent of ξ(1) . Then
γ = P (Rξ(1) < ξ(2λ)

= E P (Rξ(1) < ξ(2λ) |R)
1
= E
1 + 2λR
1
≥ by Jensen’s inequality
1 + 2λER
1
= .
1 + 2λEM
For the upper bound, apply Chapter 3 yyy to the product chain to obtain
P (M > t) ≥ exp(−t/EM ) − τ2 /EM

(recall that τ2 is the same for the product chain as for the underlying random
walk). So
1 − γ = P (M ≥ ξ(2λ) )
Z ∞
= P (M ≥ t) 2λe−2λt dt
0
2λEM τ2
≥ −
1 = 2λEM EM
and the upper bound follows after rearrangement.
Note that on a vertex-transitive graph Proposition 14.5 implies EM =
τ0 /2. So on a sequence of vertex-transitive graphs with τ2 /τ0 → 0 and with
1
λτ0 → θ, say, Corollary 14.16 implies γ → 1+θ . But in this setting we can
say much more, as the next section will show.
14.3.5 Large component sizes in the voter model with new

opinions
xxx discuss coalescent, GEM and population genetics.
xxx genetics already implicit in xxx
Fix 0 < θ < ∞. take independent random variables (ξi ) with distribution
P (ξ > x) = (1 − x)θ , 0 < x < 1
and define
(θ) (θ) (θ)
(X1 , X2 , X3 , . . .) = (ξ1 , (1 − ξ1 )ξ2 , (1 − ξ1 )(1 − ξ2 )ξ3 , . . .)
P (θ)
so that i Xi = 1.
Proposition 14.17 Consider a sequence of vertex-transitive graphs for which
τ2 /τ0 → 0. Consider the stationary distribution A of the voter model with
new opinions, presented in size-biased order. If λτ0 → θ then
|A1 | |Ak |

d (θ) (θ)
,..., → (X1 , . . . , Xk ) for all fixed k.
n n
xxx proof
Remark. The same argument goes halfway to proving Open Problem
14.12, by showing
Corollary 14.18 Consider a sequence of vertex-transitive graphs for which
τ2 /τ0 → 0. Let C (k) be the coalescing time for k walks started at independent
uniform positions. Then, for fixed k, EC (k) ∼ τ0 (1 − k −1 ).
xxx argument similar (?) to part of the proof in Cox [103] for the torus.
14.3.6 Number of components in the voter model with new

opinions
xxx τc result
14.4 The antivoter model

Recall from section 14.3 the definition of the voter model on a r-regular
n-vertex graph. We now change this in two ways. First, we suppose there
are only two different opinions, which it is convenient to call ±1. Second,
the evolution rule is
For each person i and each time interval [t, t + dt], with chance
dt the person chooses uniformly at random a neighbor (j, say)
and changes (if necessary) their opinion to the opposite of the
opinion of person j.
The essential difference from the voter model is that opinions don’t disap-
pear. Writing ηv (t) for the opinion of individual v at time t, the process
η(t) = (ηv (t), v ∈ G) is a continuous-time Markov chain on state-space
{−1, 1}G . So, provided this chain is irreducible, there is a unique stationary
distribution (ηv , v ∈ G) for the antivoter model.
This model on infinite lattices was studied in the “interacting particle
systems” literature [172, 231], and again the key idea is duality. In this
model the dual process consists of annihilating random walks. We will not
go into details about the duality relation, beyond the following definition we
need later. For vertices v, w, consider independent continuous-time random
walks started at v and at w. We have previously studied Mv,w , the time
at which the two walks first meet, but now we define Nv,w to be the total
number of jumps made by the two walks, up to and including the time Mv,w .
Set Nv,v = 0.
Donnelly and Welsh [129] considered our setting of a finite graph, and
showed that Proposition 14.19 is a simple consequence of the duality relation.
Proposition 14.19 The antivoter process has a unique stationary distri-

bution (ηv ), which satisfies
(i) Eηv = 0
(ii) c(v, w) ≡ Eηv ηw = P (Nv,w is even ) − P (Nv,w is odd ).
If G is neither bipartite nor the n-cycle, then the set of all 2n − 2 non-
unanimous configurations is irreducible, and the support of the stationary
distribution is that set.
14.4. THE ANTIVOTER MODEL 475
In particular, defining X
S≡ ηv
v
so that S or −S is the “margin of victory” in an election, we have ES = 0
and XX
var S = c(v, w). (14.19)
v w
On a bipartite graph with bipartition (A, Ac ) the stationary distribution
is
P (ηv = 1∀v ∈ A, ηv = −1∀v ∈ Ac ) = P (ηv = −1∀v ∈ A, ηv = 1∀v ∈ Ac ) = 1/2
and c(v, w) = −1 for each edge. Otherwise c(v, w) > −1 for every edge.
The antivoter process is in general a non-reversible Markov chain, be-
cause it can transition from a configuration in which v has the same opinion
as all its neighbors to the configuration where v has the opposite opinion,
but the reverse transition is impossible. Nevertheless we could use duality
to discuss convergence time. But, following [129], the spatial structure of
the stationary distribution is a more novel and hence more interesting ques-
tion. Intuitively we expect neighboring vertices to be negatively correlated
and the variance of S to be smaller than n (the variance if opinions were
independent). In the case of the complete graph on n vertices, Nv,w has (for
w 6= v) the geometric distribution
m
1

P (Nv,w > m) = 1 − ;m ≥ 0
n−1
n(n−2)
from which we calculate c(v, w) = −1/(2n − 3) and var S = 2n−3 < n/2.
We next investigate var S in general.
14.4.1 Variances in the antivoter model

Write ξ = (ξv ) for a configuration of the antivoter process and write
X
S(ξ) = ξv
v
a(ξ) = number of edges (v, w) with ξv = ξw = 1

b(ξ) = number of edges (v, w) with ξv = ξw = −1.
A simple counting argument gives
2(a(ξ) − b(ξ)) = rS(ξ). (14.20)

Lemma 14.20 var S = 2r E(a(η) + b(η)), where η is the stationary distri-

bution.
Proof. Writing (ηt ) for the stationary process and dSt = S(ηt+dt ) − S(ηt ),
we have
P (dSt = +2|ηt ) = b(ηt )dt

P (dSt = −2|ηt ) = a(ηt )dt
and so
0 = ES 2 (ηt+dt ) − ES 2 (ηt ) by stationarity

= 2ES(ηt )dSt + E(dSt )2
= 4ES(ηt )(b(ηt ) − a(ηt ))dt + 4E(a(ηt ) + b(ηt ))dt
= −2rES 2 (ηt )dt + 4E(a(ηt ) + b(ηt ))dt by (14.20)
establishing the Lemma.

Since the total number of edges is nr/2, Lemma 14.20 gives the upper
bound which follows, and the lower bound is also clear.
Corollary 14.21 Let κ = κ(G) be the largest integer such that, for any
subset A of vertices, the number of edges with both ends in A or both ends
in Ac is at least κ. Then
2κ
≤ var S ≤ n.
r
Here κ is a natural measure of “non-bipartiteness” of G. We now show how
to improve the upper bound by exploiting duality. One might expect some
better upper bound for “almost-bipartite” graphs, but Examples 14.27 and
14.28 indicate this may be difficult.
Proposition 14.22 var S < n/2.
Proof. Take two independent stationary continuous-time random walks on

(1) (2)
the underlying graph G, and let (Xt , Xt ; t = . . . , −1, 0, 1, 2, . . .) be the
jump chain, i.e. at each time we choose at random one component to make
a step of the random walk on the graph. Say an “event’ happens at t if
(1) (2)
Xt = Xt , and consider the inter-event time distribution L:
(1) (2) (1) (2)
P (L = l) = P (min{t > 0 : Xt = Xt } = l|X0 = X0 ).
In the special case where G is vertex-transitive the events form a renewal

process, but we use only stationarity properties (c.f. Chapter 2 yyy) which
hold in the general case. Write
(1) (2)
T = min{t ≥ 0 : Xt = Xt }
where the stationary chain is used. Then

(1) (2)
Pv,w (T = t) ≡ P (T = t|X0 = v, X0 = w) = P (Nv,w = t)
and so by (14.19) and Proposition 14.19(ii),

XX
var S = (Pv,w (T is even) − Pv,w (T is odd))
v w
2
= n (P (T is even) − P (T is odd)).
If successive events occur at times t0 and t1 , then
|{s : t0 < s ≤ t1 : t1 − s is even }| − {s : t0 < s ≤ t1 : t1 − s is odd}| = 0 if |t1 − t0 | is even

= 1 if |t1 − t0 | is odd
and an ergodic argument gives
P (T is even) − P (T is odd) = P (L is odd)/EL.
But EL = 1/P (event) = n, so we have established
Lemma 14.23 n−1 var S = P (L is odd).
Now consider
(1) (2)
T − = min{t ≥ 0 : X−t = X−t }.
If successive events occur at t0 and t1 , then there are t1 − t0 − 1 times s with
t0 < s < t1 , and another ergodic argument shows
(l − 1)P (L = l)
P (T + T − = l) = , l ≥ 2.
EL
So
1 X
n−1 (P (L is even) − P (L is odd)) = (−1)l P (L = l) since EL = n
EL l≥2
X (−1)l
= P (T + T − = l). (14.21)
l≥2
l−1
Now let φ(z) be the generating function of a distribution on {1, 2, 3, . . .} and

let Z, Z − be independent random variables with that distribution. Then
X (−1)l Z 0 2
φ (z)
P (Z + Z − = l) = dz > 0. (14.22)
l≥2
l−1 −1 z2
(1) (2)
Conditional on (X0 , X0 ) = (v, w) with w 6= v, we have that T and T − are
independent and identically distributed. So the sum in (14.21) is positive,
implying P (L is odd) < 1/2, so the Proposition follows from the Lemma.
Implicit in the proof are a corollary and an open problem. The open
problem is to show that var S is in fact maximized on the complete graph.
This might perhaps be provable by sharpening the inequality in (14.22).
Corollary 14.24 On an edge-transitive graph, write cedge = c(v, w) =

Eηv ηw for an arbitrary edge (v, w). Then
var S = n(1 + cedge )/2
cedge < 0.
Proof. In an edge-transitive graph, conditioning on the first jump from (v, v)

gives
P (L is odd) = P (Nv,w is even)
for an edge (v, w). But P (Nv,w is even ) = (1 + cedge )/2 by Proposition
14.19(ii), so the result follows from Lemma 14.23 and Proposition 14.22.
14.4.2 Examples and Open Problems

In the case of the complete graph, the number of +1 opinions evolves as the
birth-and-death chain on states {1, 2, . . . , n − 1} with transition rates
(n−i)(n−1−i)
i → i + 1 rate n(n−1)
i(i−1)
→i−1 rate n(n−1)
From the explicit form of the stationary distribution we can deduce that
as n → ∞ the asymptotic distribution of S is Normal. As an exercise in
technique (see Notes) we ask
Open Problem 14.25 Find sufficient conditions on a sequence of graphs

which imply S has asymptotic Normal distribution.
Example 14.26 Distance-regular graphs.
On a distance-regular graph of diameter ∆, define 1 = f (0), f (1), . . . , f (∆)

by
f (i) = c(v, w) = P (Nv,w is even ) − P (Nv,w is odd ), where d(x, y) = i.
Conditioning on the first step of the random walks,
f (i) = − (pi,i+1 f (i + 1) + pi,i f (i) + pi,i−1 f (i − 1)) , 1 ≤ i ≤ ∆ (14.23)
where (c.f. Chapter 7 yyy) the pi,j are the transition probabilities for the
birth-and-death chain associated with the discrete-time random walk. In
principle we can solve these equations to determine f (1) = cedge . Note
that the bipartite case is the case where pi,i ≡ 0, which is the case where
f (i) ≡ (−1)i and cedge = −1. A simple example of a non-bipartite distance-
regular graph is the “2-subsets of a d-set” example (Chapter 7 yyy) for d ≥ 4.
Here ∆ = 2 and
1 d−3 d−2
p1,0 = p1,1 = 2(d−2) p1,2 =
2(d − 2) 2(d − 2)
4 2d − 8
p2,1 = 2(d−2) p2,2 = .
2(d − 2)
Solving equations (14.23) gives cedge = −1/(3d − 7).
Corollary 14.24 said that in an edge-transitive graph, cedge < 0. On

a vertex-transitive graph this need not be true for every edge, as the next
example shows.
Example 14.27 An almost bipartite vertex-transitive graph.
Consider the m + 2-regular graph on 2m vertices, made by taking m-cycles

(v1 , . . . , vm ) and (w1 , . . . , wm ) and adding all edges (vi , wj ) between the two
“classes”. One might guess that, under the stationary distribution, almost
all individuals in a class would have the same opinion, different for the two
classes. But in fact the tendency for agreement between individuals in the
same class is bounded: as m → ∞
1
c(vi , wj ) → −
9
1
c(vi , vj ) → , j 6= i. (14.24)
9
To prove this, consider two independent continuous-time random walks,

started from opposite classes. Let N be the number of jumps before meeting
and let M ≥ 1 be the number of jumps before they are again in opposite
classes. Then
4 1
P (M is odd ) = + O(m−2 ); P (N < M ) = + O(m−2 ).
m m
So writing M1 = M, M2 , M3 , . . . for the cumulative numbers of jumps each
time the two walks are in opposite components, and writing
Q ≡ min{j : Mj is odd},
we have
1
P ( walks meet before Q) = + O(m−1 ).
5
Writing Q1 = Q, Q2 , Q3 , . . . for the sucessive j’s at which Mj changes parity,
and
L ≡ max{k : MQk < meeting time}
for the number of parity changes before meeting,
l
1 4
P (L = l) = + O(m−1 ), l ≥ 0
5 5
5
So P (ηvi ,wj is odd) = P (L is even) → 9 and (14.24) follows easily.
Example 14.28 Another almost-bipartite graph.
Consider the torus Zmd with d ≥ 2 and with even m ≥ 4, and make the
graph non-bipartite by moving two edges as shown.
@
@
@
Let m → ∞ and consider the covariance c(vm , wm ) across edges (vm , wm )

whose distance from the modified edges tends to infinity. One might suspect
14.5. THE INTERCHANGE PROCESS 481
that the modification had only “local” effect, in that c(vm , wm ) → −1. In
fact,
c(vm , wm ) → −1, d = 2
→ β(d) > −1, d ≥ 3.
We don’t give details, but the key observation is that in d ≥ 3 there is a
bounded-below chance that independent random walks started from vm and
wm will traverse one of the modified edges before meeting.
14.5 The interchange process

xxx notation: X̃ for process or underlying RW?
Fix a graph on n vertices. Given n distinguishable particles, there are n!
“configurations” with one particle at each vertex. The interchange process
is the following continuous-time reversible Markov chain on configurations.
On each edge there is a Poisson, rate 1, process of “switch times”,
at which times the particles at the two ends of the edge are
interchanged.
The stationary distribution is uniform on the n! configurations. We want to
study the time taken to approach the uniform distribution, as measured by
the parameters τ2 and τ1 .
As with the voter model, there is an induced process obtained by declar-
ing some subset of particles to be “visible”, regarding the visible particles
as indistinguishable, and ignoring the invisible particles. Interchanging two
visible particles has no effect, so the dynamics of the induced process are as
follows.
On each edge there is a Poisson, rate 1, process of “switch times”.
At a switch time, if one endpoint is unoccupied and the other
endpoint is occupied by a (visible) particle, then the particle
moves to the other endpoint.
This is the finite analog of the exclusion process studied in the interacting
particle systems literature. But in the finite setting, the interchange process
seems more fundamental.
If we follow an individual particle, we see a certain continuous-time
Markov chain X̃t , say, with transition rate 1 along each edge. In the termi-
nology of Chapter 3 yyy this is the fluid model random walk, rather than
the usual continuized random walk. Write τ̃2 for the relaxation time of X̃.
The contraction principle (Chapter 4 yyy) implies τ2 ≥ τ̃2 .
Open Problem 14.29 Does τ2 = τ̃2 in general?
If the answer is “yes”, then the general bound of Chapter 4 yyy will give
1
τ1 ≤ τ̃2 (1 + log n!) = O(τ̃2 n log n)
2
but the following bound is typically better.
Proposition 14.30 τ1 ≤ (2 + log n)e maxv,w Ẽv T̃w .
Proof. We use a coupling argument. Start two versions of the interchange

process in arbitrary initial configurations. Set up independent Poisson pro-
cesses N e and N ∗e for each edge e. Say edge e is special at time t if the
particles at the end-vertices in process 1 are the same two particles as in
process 2, but in transposed position. The evolution rule for the coupled
processes is
Use the same Poisson process N e to define simultaneous switch

times for both interchange processes, except for special edges
where we use N e for process 1 and N ∗e for process 2.
Clearly, once an individual particle is matched (i.e. at the same vertex in

both processes), it remains matched thereafter. And if we watch the process
(Xt , Yt ) recording the positions of particle i in each process, it is easy to
check this process is the same as watching two independent copies of the
continuous-time random walk, run until they meet, at time Ui , say. Thus
maxi Ui is a coupling time and the coupling inequality (14.5) implies
¯ ≤ P (max Ui > t).
d(t)
i
Now Ui is distributed as Mv(i),w(i) , where v(i) and w(i) are the initial po-
sitions of particle i in the two versions and where Mv,w denotes meeting
time for independent copies of the underlying random walk X̃t . Writing
m∗ = maxv,w EMv,w , we have by subexponentiality (as at (14.16))
t

P (Mv,w > t) ≤ exp 1 −
em∗
and so
t

¯ ≤ n exp 1 −
d(t) .
em∗
This leads to τ1 ≤ (2 + log n)em∗ and the result follows from Proposition
14.5.
14.6. OTHER INTERACTING PARTICLE MODELS 483
14.5.1 Card-shuffling interpretation

Taking the underlying graph to be the complete graph on n vertices, the
discrete-time jump chain of the interchange process is just the “card shuf-
fling by random transpositions” model from Chapter 7 yyy. On any graph
G, the jump chain can be viewed as a card-shuffling model, but note that
parameters τ are multiplied by |E| (the number of edges in G) when passing
from the interchange process to the card-shuffling model. On the complete
graph we have maxv,w Ẽv T̃w = Θ(1) and |E| = Θ(n2 ), and so Proposition
14.30 gives the bound τ1 = O(n2 log n) for card shuffling by random trans-
positions, which is crude in view of the exact result τ1 = Θ(n log n). In
contrast, consider the n-cycle, where maxv,w Ẽv T̃w = Θ(n2 ) and |E| = n.
Here the jump process is the “card shuffling by random adjacent transposi-
tions” model from Chapter 7 yyy. In this model, Proposition 14.30 gives the
bound τ1 = O(n3 log n) which as mentioned in Chapter 7 yyy is the correct
order of magnitude.
Diaconis and Saloff-Coste [117] studied the card-shuffling model as an ap-
plication of more sophisticated techniques of comparison of Dirichlet forms.
xxx talk about their results.
14.6 Other interacting particle models

As mentioned at the start of the chapter, the models discussed in sections
14.3 - 14.5 are special in that their behavior relates to the behavior of pro-
cesses built up from independent random walks on the underlying graph. In
other models this is not necessarily true, and the results in this book have
little application.
xxx mention Ising model and contact process.
14.6.1 Product-form stationary distributions

Consider a continuous-time particle process whose state space is the collec-
tion of subsets of vertices of a finite graph (representing the subset of vertices
occupied by particles), and where only one state can change occupancy at
a time. The simplest stationary distribution would be of the form
each vertex v is occupied independently with probability θ/(1 + θ)
(14.25)
where 0 < θ < ∞ is a parameter. By considering the detailed balance
equations (Chapter 3 yyy), such a process will be reversible with stationary
distribution (14.25) iff its transition rates satisfy
For configurations x0 , x1 which coincide except that vertex v is

q(x0 ,x1 )
unoccupied in x0 and occupied in x1 , we have q(x 1 ,x0 ) = θ.
There are many ways to set up such transition rates. Here is one way,
observed by Neuhauser and Sudbury [269]. For each edge (w, v) at time t
with w occupied,
if v is occupied at time t, then with chance dt it becomes unoccupied by
time t + dt
if v is unoccupied at time t, then with chance θdt it becomes occupied
by time t + dt.
If we exclude the empty configuration (which cannot be reached from other
configurations) the state space is irreducible and the stationary distribution
is given by (14.25) conditioned on being non-empty.
Convergence times for this model have not been studied, so we ask
Open Problem 14.31 Give bounds on the relaxation time τ2 in this model.
14.6.2 Gaussian families of occupation measures

We mentioned in Chapter 3 yyy that, in the setting of a finite irreducible
reversible chain (Xt ), the fundamental matrix Z has the property
πi Zij is symmetric and positive-definite .
So by a classical result (e.g. [145] Theorem 3.6.4) there exists a mean-zero

Gaussian family (γi ) such that
Eγi γj = πi Zij for all i, j. (14.26)
What do such Gaussian random variables represent? It turns out there is a

simple interpretation involving occupation measures of “charged particles”.
Take two independent copies (Xt+ : −∞ < t < ∞) and (Xt− : −∞ < t < ∞)
of the stationary chain, in continuous time for simplicity. For fixed u > 0
consider the random variables
Z 0
(u) 1
γi ≡ 1(X + =i) − 1(X − =i) dt.
2 −u t t
Picture one particle with charge +1/2 and the other particle with charge
(u) (u)
−1/2, and then γi has units “charge × time”. Clearly Eγi = 0 and it is
14.7. OTHER COUPLING EXAMPLES 485
easy to calculate
1
Z 0 Z 0
(u) (u)
Eγi γj = E 1(Xs =i,Xt =j) − πi πj ds dt
2 −u −u
Z 0 Z 0
1
= πi (P (Xt = j|Xs = i) − πj ) ds dt
2 −u −u
Z u
r

= πi 1− (pij (r) − πj ) dr
0 u
and hence
(u) (u)
u−1 Eγi γj → πi Zij as u → ∞. (14.27)
The central limit theorem for Markov chains (Chapter 2 yyy) implies that
(u)
the u → ∞ distributional limit of (u−1/2 γi ) is some mean-zero Gaussian
family (γi ), and so (14.27) identifies the limit as the family with covariances
(14.26).
As presented here the construction may seem an isolated curiousity, but
in fact it relates to deep ideas developed in the context of continuous-time-
and-space reversible Markov processes. In that context, the Dynkin iso-
morphism theorem relates continuity of local times to continuity of sample
paths of a certain Gaussian process. See [253] for a detailed account. And
various interesting Gaussian processes can be constructed via “charged par-
ticle” models – see [2] for a readable account of such constructions. Whether
these sophisticated ideas can be brought to bear upon the kinds of finite-
state problems in this book is a fascinating open problem.
14.7 Other coupling examples

Example 14.32 An m-particle process on the circle.
Fix m < K. Consider m indistinguishable balls distributed amongst K

boxes, at most one ball to a box, and picture the boxes arranged in a circle.
At each step, pick uniformly at random a box, say box i. If box i is occupied,
do nothing. Otherwise, pick uniformly at random a direction (clockwise or
counterclockwise) search from i in that direction until encountering a ball,!
K
and move that ball to box i. This specifies a Markov chain on the
m
possible configurations of balls. The chain is reversible and the stationary
distribution is uniform. Can we estimate the “mixing time” parameters τ1
and τ2 ? Note that as K → ∞ there is a limit process involving m particles
on the continuous circle, so we seek bounds which do not depend on K.
There is a simple-to-describe coupling, where for each of the two versions

we pick at each time the same box and the same direction. The coupling
has the usual property (c.f. the proof of Proposition 14.30) that the number
of “matched” balls (i.e. balls in the same box in both processes) can only
increase. But analyzing the coupling time seems very difficult. Cuellar-
Montoya [105] carries through a lengthy analysis to show that τ1 = O(m10 ).
In the other direction, the bound
m3
τ2 ≥
8π 2
is easily established, by applying the extremal characterization (Chapter 3

yyy) to the function
m
X
g(x) = sin(2πxi /m)
i=1
where x = (x1 , . . . , xm ) denotes the configuration with occupied boxes

{x1 , . . . , xm }. It is natural to conjecture τ2 = Θ(m3 ) and τ1 = O(m3 log m).
The next example, from Jerrum [196] (xxx cite final version), uses a
coupling whose construction is not quite obvious.
Example 14.33 Permutations and words.
Fix a finite alphabet A of size |A|. Fix m, and consider the set Am of
“words” x = (x1 , . . . , xm ) with each xi ∈ A. Consider the Markov chain on
Am in which a step x → y is specified by the following two-stage procedure.
Stage 1. Pick a permutation σ of {1, 2, . . . , m} uniformly at random from
the set of permutations σ satisfying xσ(i) = xi ∀i.
Stage 2. Let (cj (σ); j ≥ 1) be the cycles of σ. For each j, and indepen-
dently as j varies, pick uniformly an element αj of A, and define yi = αj for
every i ∈ cj (σ).
Here is an alternative description. Write Π for the set of permutations of
{1, . . . , m}. Consider the bipartite graph on vertices Am ∪ Π with edge-set
{(x, σ) : xσ(i) = xi ∀i}. Then the chain is random walk on this bipartite
graph, watched every second step when it is in Am .
From the second description, it is clear that the stationary probabilities
π(x) are proportional to the degree of x in the bipartite graph, giving
Y
π(x) ∝ na (x)!
a
14.7. OTHER COUPLING EXAMPLES 487
where na (x) = |{i : xi = a}|. We shall use a coupling argument to establish

the following bound on variation distance:
t
¯ ≤m 1− 1

d(t) (14.28)
|A|
implying that the variation threshold satisfies
1 + log m
τ1 ≤ 1 + 1 ≤ 1 + (1 + log m)|A|.
− log(1 − |A| )
The construction of the coupling depends on the following lemma, whose

proof is deferred.
Lemma 14.34 Givcen finite sets F 1 , F 2 we can construct (for u = 1, 2) a
uniform random permutation σ u of F u with cycles (Cju ; j ≥ 1), where the
cycles are labeled such that
Cj1 ∩ F 1 ∩ F 2 = Cj2 ∩ F 1 ∩ F 2 for all j.
We construct a step (x1 , x2 ) → (Y1 , Y2 ) of the coupled processes as follows.

For each a ∈ A, set F 1,a = {i : x1i = a}, F 2,a = {i : x2i = a}. Take
random permutations σ 1,a , σ 2,a as in the lemma, with cycles Cj1,a , Cj2,a .
Then (σ 1,a , a ∈ A) define a uniform random permutation σ 1 of {1, . . . , m},
and similarly for σ 2 . This completes stage 1. For stage 2, for each pair (a, j)
pick a uniform random element αja of A and set
Yi1 = αja for every i ∈ Cj1,a
Yi2 = αja for every i ∈ Cj2,a .

This specifies a Markov coupling. By construction
if x1i = x2i then Yi1 = Yi2

if x1i 6= x2i then P (Yi1 = Yi2 ) = 1/|A|.
So the coupled processes (X1 (t), X2 (t)) satisfy

t
1

P (Xi1 (t) 6= Xi2 (t)) = 1− 1(X 1 (0)6=X 2 (0)) .
|A| i i
In particular P (X1 (t) 6= X2 (t)) ≤ m(1 − 1/|A|)t and the coupling inequality
(14.5) gives (14.28).
xxx proof of Lemma – tie up with earlier discussion.
14.7.1 Markov coupling may be inadequate

Recall the discussion of the coupling inequality in section 14.1.1. Given a
Markov chain and states i, j, theory (e.g. [233] section 3.3) says there exists
(i) (j)
a maximal coupling Xt , Xt with a coupling time T for which the coupling
inequality (14.5) holds with equality. But this need not be a Markov cou-
pling, i.e. of form (14.6), as the next result implies. The point is that there
exist fixed-degree expander graphs with τ2 = O(1) and so τ1 = O(log n), but
whose girth (minimal cycle length) is Ω(log n). On such a graph, the upper
bound on τ1 obtained by a Markov coupling argument would be Θ(ET ),
which the Proposition shows is nΩ(1) .
Proposition 14.35 Fix vertices i, j in a r-regular graph (r ≥ 3) with girth

(i) (j)
g. Let (Xt , Xt ) be any Markov coupling of discrete-time random walks
started at i and j. Then the coupling time T satisfies
1 − (r − 1)−d(i,j)/2 g 1
ET ≥ (r − 1) 4 − 2 .
r−2
Proof. We quote a simple lemma, whose proof is left to the reader.
Lemma 14.36 Let ξ1 , ξ2 be (dependent) random variables with P (ξu = 1) =

r−1 1
r , P (ξu = −1) = r . Then
r − 1 2 1 −2
Eθξ1 +ξ2 ≤ θ + θ , 0 < θ < 1.
r r
In particular, setting θ = (r − 1)−1/2 , we have
Eθξ1 +ξ2 ≤ 1.
(i) (j)
Now consider the distance Dt ≡ d(Xt , Xt ) between the two particles.
The key idea is
(i) (j)
E(θDt+1 − θDt |Xt , Xt ) ≤ 0 if Dt ≤ bg/2c − 1
≤ (θ−2 − 1)θbg/2c else. (14.29)
The second inequality follows from the fact Dt+1 − Dt ≥ −2. For the
first inequality, if Dt ≤ bg/2c − 1 then the incremental distance Dt+1 − Dt
is distributed as ξ1 + ξ2 in the lemma, so the conditional expectation of
θDt+1 −Dt is ≤ 1. Now define a martingale (Mt ) via M0 = 0 and
(i) (j)
Mt+1 − Mt = θDt+1 − θDt − E(θDt+1 − θDt |Xt , Xt ).
Rearranging,
t−1
X
θDt − θD0 = Mt + E(θDs+1 − θDs |Xs(i) , Xs(j) )
s=0
−2
≤ Mt + (θ − 1)θbg/2c t by (14.29).
Apply this inequality at the coupling time T and take expectations: we have
EMT = 0 by the optional sampling theorem (Chapter 2 yyy) and DT = 0,
so
1 − θd(i,j) ≤ (θ−2 − 1)θbg/2c ET
and the Proposition follows.

Section 14.1. Coupling has become a standard tool in probability theory.
Lindvall [233] contains an extensive treatment, emphasizing its use to prove
limit theorems. Stoyan [315] emphasizes comparison results in the context
of queueing systems.
Birth-and-death chains have more monotonicity properties than stated
in Proposition 14.1 – see van Doorn [329] for an extensive treatment. The
coupling (14.2) of a birth-and-death process is better viewed as a special-
ization of couplings of stochastically monotone processes, c.f. [233] Chapter
4.3.
Section 14.1.1. Using the coupling inequality to prove convergence to
stationarity (i.e. the convergence theorem, Chapter 2 yyy) and the analogs
for continuous-space processes is called the coupling method. See [233] p.
233 for some history. Systematic use to bound variation distance in finite-
state chains goes back to Aldous [9]. repeated here. The coupling inequality
is often presented as involving the chain started from an arbitrary point and
¯
the stationary chain, leading to a bound on d(t) instead of d(t).
d
Section 14.3. The voter model on Z , and its duality with coalescing
random walk, has been extensively studied – see [132, 231] for textbook
treatments. The general notion of duality is discussed in [231] section 2.3.
The voter model on general finite graphs has apparently been studied only
once, by Donnelly and Welsh [128]. They studied the two-party model, and
obtained the analog of Proposition 14.9(a) and some variations.
In the context of Open Problem 14.13 one can seek to use the random-
ization idea in Matthews’ method, and the problem reduces to proving that,
in the coalescing of k randomly-started particles, the chance that the final

join is between a (k − 1)-cluster and a 1-cluster is small.
Section 14.3.5. On the infinite two-dimensional lattice, the meeting time
M of independent random walks is such that log M has approximately an
exponential distribution. Rather surprisingly, with a logarthmic time trans-
formation one can get a analog of Proposition 14.17 on the infinite lattice –
see Cox and Griffeath [104].
Section 14.4. Donnelly and Welsh [129] obtained Proposition 14.19 and
a few other results, e.g. that, over edge-transitive graphs, cedge is uniquely
maximized on the complete graph.
In the context of Open Problem 14.25, there are many known Normal
limits in the context of interacting particle systems on the infinite lattice,
but it is not clear how well those techniques extend to general finite graphs.
It would be interesting to know whether Stein’s method could be used here
(see Baldi and Rinott [38] for different uses of Stein’s method on graphs).
Section 14.5. The name “interchange process” is my coinage: the pro-
cess, in the card-shuffling interpretation, was introduced by Diaconis and
Saloff-Coste [117].
The interchange process can of course be constructed from a Poisson
process of directed edges, as was the voter model in section 14.3. On the
n-path, this “graphical representation” has an interpretation as a method
to create a pseudo-random permutation with paper and pencil – see Lange
and Miller [220] for an entertaining elementary exposition.
Miscellaneous. One can define a wide variety of “growth and coverage”
models on a finite graph, where there is some prescribed rule for growing a
random subset St of vertices, starting with a single vertex, and the quantity
of interest is the time T until the subset has grown to be the complete
graph. Such processes have been studied as models for rumors, broadcast
information and percolation – see e.g. Weber [335] and Fill and Pemantle
[148].
Bibliography
[1] S.R. Adke and S.M. Manjunath. An Introduction to Finite Markov

Processes. Wiley, 1984.
[2] R.J. Adler and R. Epstein. Some central limit theorems for Markov
paths and some properties of Gaussian random fields. Stochastic Pro-
cess. Appl., 24:157–202, 1987.
[3] M. Ajtai, J. Komlós, and E. Szemerédi. Sorting in c log n parallel

steps. Combinatorica, 3:1–19, 1983.
[4] M. Ajtai, J. Komlós, and E. Szemerédi. Deterministic simulation in

logspace. In Proc. 19th ACM Symp. Theory of Computing, pages 132–
140, 1987.
[5] D.J. Aldous. Markov chains with almost exponential hitting times.
Stochastic Process. Appl., 13:305–310, 1982.
[6] D.J. Aldous. Some inequalities for reversible Markov chains. J. London
Math. Soc. (2), 25:564–576, 1982.
[7] D.J. Aldous. Minimization algorithms and random walk on the d-cube.
Ann. Probab., 11:403–413, 1983.
[8] D.J. Aldous. On the time taken by random walks on finite groups to
visit every state. Z. Wahrsch. Verw. Gebiete, 62:361–374, 1983.
[9] D.J. Aldous. Random walks on finite groups and rapidly mixing
Markov chains. In Seminaire de Probabilites XVII, pages 243–297.
Springer-Verlag, 1983. Lecture Notes in Math. 986.
[10] D.J. Aldous. On the Markov chain simulation method for uniform
combinatorial distributions and simulated annealing. Probab. Engi-
neering Inform. Sci., 1:33–46, 1987.
491
492 BIBLIOGRAPHY
[11] D.J. Aldous. Finite-time implications of relaxation times for stochas-

tically monotone processes. Probab. Th. Rel. Fields, 77:137–145, 1988.
[12] D.J. Aldous. Hitting times for random walks on vertex-transitive
graphs. Math. Proc. Cambridge Philos. Soc., 106:179–191, 1989.
[13] D.J. Aldous. The random walk construction of uniform spanning trees
and uniform labelled trees. SIAM J. Discrete Math., 3:450–465, 1990.
[14] D.J. Aldous. The continuum random tree II: an overview. In M.T.
Barlow and N.H. Bingham, editors, Stochastic Analysis, pages 23–70.
Cambridge University Press, 1991.
[15] D.J. Aldous. Meeting times for independent Markov chains. Stochastic
Process. Appl., 38:185–193, 1991.
[16] D.J. Aldous. Random walk covering of some special trees. J. Math.
Analysis Appl., 157:271–283, 1991.
[17] D.J. Aldous. Threshold limits for cover times. J. Theoretical Probab.,
4:197–211, 1991.
[18] D.J. Aldous. On simulating a Markov chain stationary distribution
when transition probabilities are unknown. In D.J. Aldous, P. Dia-
conis, J. Spencer, and J. M. Steele, editors, Discrete Probability and
Algorithms, volume 72 of IMA Volumes in Mathematics and its Ap-
plications, pages 1–9. Springer-Verlag, 1995.
[19] D.J. Aldous and A. Bandyopadhyay. How to combine fast heuristic
Markov chain Monte Carlo with slow exact sampling. Unpublished,
2001.
[20] D.J. Aldous and M. Brown. Inequalities for rare events in time-
reversible Markov chains I. In M. Shaked and Y.L. Tong, editors,
Stochastic Inequalities, volume 22 of Lecture Notes, pages 1–16. Insti-
tute of Mathematical Statistics, 1992.
[21] D.J. Aldous and P. Diaconis. Shuffling cards and stopping times.
Amer. Math. Monthly, 93:333–348, 1986.
[22] D.J. Aldous and P. Diaconis. Strong uniform times and finite random
walks. Adv. in Appl. Math., 8:69–97, 1987.
[23] D.J. Aldous and B. Larget. A tree-based scaling exponent for random
cluster models. J. Phys. A: Math. Gen., 25:L1065–L1069, 1992.
BIBLIOGRAPHY 493
[24] D.J. Aldous, L. Lovász, and P. Winkler. Mixing times for uniformly
ergodic Markov chains. Stochastic Process. Appl., 71:165–185, 1997.
[25] R. Aleliunas, R.M. Karp, R.J. Lipton, L. Lovász, and C. Rackoff.

Random walks, universal traversal sequences, and the complexity of
maze traversal. In Proc. 20th IEEE Symp. Found. Comp. Sci., pages
218–233, 1979.
[26] N. Alon. Eigenvalues and expanders. Combinatorica, 6:83–96, 1986.
[27] N. Alon, U. Feige, A. Wigderson, and D. Zuckerman. Derandomized

graph products. Comput. Complexity, 5:60–75, 1995.
[28] N. Alon and Y. Roichman. Random Cayley graphs and expanders.

Random Struct. Alg., 5:271–284, 1994.
[29] N. Alon and J. H. Spencer. The Probabilistic Method. Wiley, 1992.
[30] V. Anantharam and P. Tsoucas. A proof of the Markov chain tree

theorem. Stat. Probab. Letters, 8:189–192, 1989.
[31] W.J. Anderson. Continuous-Time Markov Chains. Springer–Verlag,

1991.
[32] D. Applegate and R. Kannan. Sampling and integration of log-concave

functions. In Proc. 23rd ACM Symp. Theory of Computing, pages 156–
163, 1991.
[33] D. Applegate, R. Kannan, and N. Polson. Random polynomial time

algorithms for sampling from joint distributions. Technical report,
Carnegie-Mellon, 1990.
[34] S. Asmussen. Applied Probability and Queues. Wiley, 1987.
[35] S. Asmussen, P.W. Glynn, and H. Thorisson. Stationarity detection

in the initial transient problem. ACM Trans. Modeling and Computer
Sim., 2:130–157, 1992.
[36] L. Babai. Local expansion of vertex-transitive graphs and random

generation in finite groups. In Proc. 23rd ACM Symp. Theory of Com-
puting, pages 164–174, 1991.
[37] L. Babai. Probably true theorems, cry wolf? Notices Amer. Math.
Soc., 41:453–454, 1994.
494 BIBLIOGRAPHY
[38] P. Baldi and Y. Rinott. Asymptotic normality of some graph-related

statistics. J. Appl. Probab., 26:171–175, 1989.
[39] C. Bandle. Isoperimetric Inequalities and Applications. Pitman,

Boston MA, 1980.
[40] M.T. Barlow. Random walks and diffusions on fractals. In Proc. ICM
Kyoto 1990, pages 1025–1035. Springer–Verlag, 1991.
[41] M.T. Barlow. Diffusions on fractals. In P. Bernard, editor, Lectures

on Probability and Statistics, volume 1690 of Lecture Notes in Math.
Springer–Verlag, 1998.
[42] G. Barnes and U. Feige. Short random walks on graphs. SIAM J.

Discrete Math., 9:19–28, 1996.
[43] M.F. Barnsley and J.H. Elton. A new class of Markov processes for
image encoding. Adv. in Appl. Probab., 20:14–32, 1988.
[44] J.R. Baxter and R.V. Chacon. Stopping times for recurrent Markov
processes. Illinois J. Math., 20:467–475, 1976.
[45] K.A. Berman and M. Konsowa. Random paths, electrical networks

and reversible Markov chains. SIAM J. Discrete Math., 3:311–319,
1990.
[46] J. Besag and P.J. Greene. Spatial statistics and Bayesian computation.
J. Royal Statist. Soc. (B), 55:25–37, 1993. Followed by discussion.
[47] S. Bhatt and J-Y Cai. Taking random walks to grow trees in hyper-
cubes. J. Assoc. Comput. Mach., 40:741–764, 1993.
[48] N. L. Biggs. Algebraic Graph Theory. Cambridge University Press,

1974.
[49] N. L. Biggs. Potential theory on distance-regular graphs. Combin.

Probab. Comput., 2:243–255, 1993.
[50] L.J. Billera and P. Diaconis. A geometric interpretation of the

Metropolis algorithm. Unpublished, 2000.
[51] N. H. Bingham. Fluctuation theory for the Ehrenfest urn. Adv. in

Appl. Probab., 23:598–611, 1991.
BIBLIOGRAPHY 495
[52] D. Boivin. Weak convergence for reversible random walks in a random

environment. Ann. Probab., 21:1427–1440, 1993.
[53] B. Bollobás. Random Graphs. Academic Press, London, 1985.
[54] B. Bollobás. The isoperimetric number of random regular graphs.

European J. Combin., 9:241–244, 1988.
[55] E. Bolthausen. The Berry-Esseen theorem for functionals of discrete

Markov chains. Z. Wahrsch. Verw. Gebiete, 54:59–73, 1980.
[56] A. Borodin, W.L. Ruzzo, and M. Tompa. Lower bounds on the length
of universal traversal sequences. J. Computer Systems Sci., 45:180–
203, 1992.
[57] K. Borre and P. Meissl. Strength Analysis of Leveling-type Networks.

Geodaetisk Institut, Copenhagen, 1974. Volume 50.
[58] O. Bottema. Uber die Irrfahrt in einem Strassennetz. Math. Z.,

39:137–145, 1935.
[59] R.C. Bradley, W. Bryc, and S. Janson. On dominations between mea-

sures of dependence. J. Multivariate Anal., 23:312–329, 1987.
[60] L.A. Breyer and G.O. Roberts. From Metropolis to diffusions: Gibbs
states and optimal scaling. Technical report, Statistical Lab., Cam-
bridge U.K., 1998.
[61] G. Brightwell and P. Winkler. Extremal cover times for random walks
on trees. J. Graph Theory, 14:547–554, 1990.
[62] G. Brightwell and P. Winkler. Maximum hitting time for random

walks on graphs. Random Struct. Alg., 1:263–276, 1990.
[63] P.J. Brockwell and R.A. Davis. Time Series: Theory and Methods.
[64] A. Broder. How hard is it to marry at random? (on the approximation

of the permanent). In Proc. 18th ACM Symp. Theory of Computing,
pages 50–58, 1986.
[65] A. Broder. Generating random spanning trees. In Proc. 30’th IEEE

Symp. Found. Comp. Sci., pages 442–447, 1989.
496 BIBLIOGRAPHY
[66] A. Broder and A.R. Karlin. Bounds on the cover time. J. Theoretical
Probab., 2:101–120, 1989.
[67] A. Broder, A.R. Karlin, P. Raghavan, and E. Upfal. Trading space

for time in undirected s − t connectivity. In Proc. 21st ACM Symp.
Theory of Computing, pages 543–549, 1989.
[68] A. Broder and E. Shamir. On the second eigenvalue of random regular

graphs. In Proc. 28th Symp. Foundations of Computer Sci., pages 286–
294, 1987.
[69] A. Z. Broder, A. M. Frieze, and E. Upfal. Existence and construction

of edge disjoint paths on expander graphs. SIAM J. Comput., 23:976–
989, 1994.
[70] A. Z. Broder, A. M. Frieze, and E. Upfal. Static and dynamic path

selection on expander graphs: a random walk approach. Random
Struct. Alg., 14:87–109, 1999.
[71] A.E. Brouwer, A.M. Cohen, and A. Neumaier. Distance Regular

Graphs. Springer–Verlag, 1989.
[72] M. Brown. Approximating IMRL distributions by exponential dis-

tributions, with applications to first passage times. Ann. Probab.,
11:419–427, 1983.
[73] M. Brown. Consequences of monotonicity for Markov transition func-

tions. Technical report, City College, CUNY, 1990.
[74] M. Brown. Interlacing eigenvalues in time reversible Markov chains.

Math. Oper. Res., 24:847–864, 1999.
[75] M.J.A.M. Brummelhuis and H.J. Hilhorst. Covering of a finite lattice

by a random walk. Physica A, 176:387–408, 1991.
[76] D. Brydges, S.N. Evans, and J.Z. Imbrie. Self-avoiding walk on a

hierarchical lattice in four dimensions. Ann. Probab., 20:82–124, 1992.
[77] R. Bubley and M. Dyer. Path coupling: a technique for proving rapid
mixing in Markov chains. In Proc. 38’th IEEE Symp. Found. Comp.
Sci., pages 223–231, 1997.
[78] R. Bubley and M. Dyer. Faster random generation of linear extensions.

Discrete Math., 201:81–88, 1999.
BIBLIOGRAPHY 497
[79] R. Bubley, M. Dyer, and C. Greenhill. Beating the 2∆ bound for

approximately counting colorings: A computer-assisted proof of rapid
mixing. In Proc. 9’th ACM-SIAM Symp. Discrete Algorithms, pages
355–363, New York, 1998. ACM.
[80] R. Bubley, M. Dyer, and M. Jerrum. An elementary analysis of a

procedure for sampling points in a convex body. Random Struct. Alg.,
12:213–235, 1998.
[81] F. Buckley and F. Harary. Distance in Graphs. Addison-Wesley, 1990.
[82] K. Burdzy and W.S. Kendall. Efficient Markovian couplings: Exam-

ples and counterexamples. Ann. Appl. Probab., 10:362–409, 2000.
[83] R. Burton and R. Pemantle. Local characteristics, entropy and

limit theorems for spanning trees and domino tilings via transfer-
impedances. Ann. Probab., 21:1329–1371, 1993.
[84] E.A. Carlen, S. Kusuoka, and D.W. Stroock. Upper bounds for sym-
metric Markov transition functions. Ann. Inst. H. Poincaré Probab.
Statist., Suppl. 2:245–287, 1987.
[85] A.K. Chandra, P. Raghavan, W.L. Ruzzo, R. Smolensky, and P. Ti-

wari. The electrical resistance of a graph captures its commute and
cover times. Comput. Complexity, 6:312–340, 1996/7. Extended ab-
stract originally published in Proc. 21st ACM Symp. Theory of Com-
puting (1989) 574-586.
[86] G. Chartrand and L. Lesniak. Graphs and Digraphs. Wadsworth,

1986.
[87] J. Cheeger. A lower bound for the lowest eigenvalue for the Laplacian.
In R. C. Gunning, editor, A Symposium in Honor of S. Bochner, pages
195–199. Princeton Univ. Press, 1970.
[88] M. F. Chen. From Markov Chains to Non-Equilibrium Particle Sys-

tems. World Scientific, Singapore, 1992.
[89] M. F. Chen. Trilogy of couplings and general formulas for lower bound
of spectral gap. In L. Accardi and C. Heyde, editors, Probability To-
wards 2000, number 128 in Lecture Notes in Statistics, pages 123–136.
498 BIBLIOGRAPHY
[90] M. F. Chen. Eigenvalues, inequalities and ergodic theory II. Adv.

Math. (China), 28:481–505, 1999.
[91] M.-H. Chen, Q.-M. Shao, and J.G. Ibrahim. Monte Carlo Methods in
Bayesian Computation. Springer–Verlag, 2000.
[92] F.R.K. Chung. On concentrators, superconcentrators, and nonblock-

ing networks. Bell System Tech. J., 58:1765–1777, 1979.
[93] F.R.K. Chung. Spectral Graph Theory, volume 92 of CBMS Regional

Conference Series in Mathematics. Amer. Math. Soc., 1997.
[94] F.R.K. Chung, P. Diaconis, and R. L. Graham. A random walk prob-

lem arising in random number generation. Ann. Probab., 15:1148–
1165, 1987.
[95] F.R.K. Chung and S.-T. Yau. Eigenvalues of graphs and Sobolev
inequalities. Combin. Probab. Comput., 4:11–25, 1995.
[96] K.L. Chung. Markov Chains with Stationary Transition Probabilities.

Springer–Verlag, second edition, 1967.
[97] A. Cohen and A. Widgerson. Dispensers, deterministic amplification,

and weak random sources. In Proc. 30’th IEEE Symp. Found. Comp.
Sci., pages 14–19, 1989.
[98] A. Condon and D. Hernek. Random walks on colored graphs. Random

Struct. Alg., 5:285–303, 1994.
[99] D. Coppersmith, P. Doyle, P. Raghavan, and M. Snir. Random walks

on weighted graphs and applications to on-line algorithms. J. Assoc.
Comput. Mach., 40:421–453, 1993.
[100] D. Coppersmith, U. Feige, and J. Shearer. Random walks on regular

and irregular graphs. SIAM J. Discrete Math., 9:301–308, 1996.
[101] D. Coppersmith, P. Tetali, and P. Winkler. Collisions among random

walks on a graph. SIAM J. Discrete Math., 6:363–374, 1993.
[102] M.K. Cowles and B.P. Carlin. Markov chain Monte Carlo convergence
diagnostics: A comparative review. J. Amer. Statist. Assoc., 91:883–
904, 1996.
[103] J.T. Cox. Coalescing random walks and voter model consensus times
on the torus in Z d . Ann. Probab., 17:1333–1366, 1989.
BIBLIOGRAPHY 499
[104] J.T. Cox and D. Griffeath. Mean field asymptotics for the planar
stepping stone model. Proc. London Math. Soc., 61:189–208, 1990.
[105] S. L. Cuéllar-Montoya. A rapidly mixing stochastic system of finite

interacting particles on the circle. Stochastic Process. Appl., 67:69–99,
1997.
[106] D.M. Cvetkovic, M. Doob, I. Gutman, and A. Torgasev. Recent Re-

sults in the Theory of Graph Spectra. North-Holland, 1988. Annals of
Discrete Math. 36.
[107] D.M. Cvetkovic, M. Doob, and H. Sachs. Spectra of Graphs. Academic

Press, 1980.
[108] C. Dellacherie and P.-A. Meyer. Probabilités et Potentiel: Théorie

Discrète du Potentiel. Hermann, Paris, 1983.
[109] L. Devroye. Nonuniform Random Number Generation. Springer–

Verlag, 1986.
[110] L. Devroye and A Sbihi. Random walks on highly-symmetric graphs.

J. Theoretical Probab., 3:497–514, 1990.
[111] L. Devroye and A Sbihi. Inequalities for random walks on trees. In

A. Frieze and T. Luczak, editors, Random Graphs, volume 2, pages
35–45. Wiley, 1992.
[112] P. Diaconis. Group Representations in Probability and Statistics. In-

stitute of Mathematical Statistics, Hayward CA, 1988.
[113] P. Diaconis. Notes on the hit-and-run algorithm. Unpublished, 1996.
[114] P. Diaconis and J.A. Fill. Examples for the theory of strong stationary
duality with countable state spaces. Prob. Engineering Inform. Sci.,
4:157–180, 1990.
[115] P. Diaconis and J.A. Fill. Strong stationary times via a new form of
duality. Ann. Probab., 18:1483–1522, 1990.
[116] P. Diaconis, R.L. Graham, and J.A. Morrison. Asymptotic analysis

of a random walk on a hypercube with many dimensions. Random
Struct. Alg., 1:51–72, 1990.
[117] P. Diaconis and L. Saloff-Coste. Comparison theorems for random

walk on finite groups. Ann. Probab., 21:2131–2156, 1993.
500 BIBLIOGRAPHY
[118] P. Diaconis and L. Saloff-Coste. Comparison theorems for reversible

Markov chains. Ann. Appl. Probab., 3:696–730, 1993.
[119] P. Diaconis and L. Saloff-Coste. Logarithmic Sobolev inequalities for

finite Markov chains. Ann. Appl. Probab., 6:695–750, 1996.
[120] P. Diaconis and L. Saloff-Coste. Nash inequalities for finite Markov

chains. J. Theoretical Probab., 9:459–510, 1996.
[121] P. Diaconis and L. Saloff-Coste. What do we know about the Metropo-

lis algorithm? J. Comput. System Sci., 57:20–36, 1998.
[122] P. Diaconis and M. Shahshahani. Generating a random permutation

with random transpositions. Z. Wahrsch. Verw. Gebiete, 57:159–179,
1981.
[123] P. Diaconis and M. Shahshahani. Time to reach stationarity in the

Bernouilli-Laplace diffusion model. SIAM J. Math. Anal., 18:208–218,
1986.
[124] P. Diaconis and D. Stroock. Geometric bounds for eigenvalues of

Markov chains. Ann. Appl. Probab., 1:36–61, 1991.
[125] I.H. Dinwoodie. A probability inequality for the occupation measure

of a reversible Markov chain. Ann. Appl. Probab., 5:37–43, 1995.
[126] W. Doeblin. Exposé de la theorie des chaı̂nes simples constantes de

Markov à un nombre fini d’états. Rev. Math. Union Interbalkanique,
2:77–105, 1938.
[127] P. Donnelly, P. Lloyd, and A. Sudbury. Approach to stationarity of the

Bernouilli-Laplace diffusion model. Adv. in Appl. Probab., 26:715–727,
1994.
[128] P. Donnelly and D. Welsh. Finite particle systems and infection mod-
els. Math. Proc. Cambridge Philos. Soc., 94:167–182, 1983.
[129] P. Donnelly and D. Welsh. The antivoter problem: Random 2-

colourings of graphs. In B. Bollobás, editor, Graph Theory and Com-
binatorics, pages 133–144. Academic Press, 1984.
[130] C. Dou and M. Hildebrand. Enumeration and random random walks

on finite groups. Ann. Probab., 24:987–1000, 1996.
BIBLIOGRAPHY 501
[131] P.G. Doyle and J.L. Snell. Random Walks and Electrical Networks.
Mathematical Association of America, Washington DC, 1984.
[132] R. Durrett. Lecture Notes on Particle Systems and Percolation.
Wadsworth, Pacific Grove CA, 1988.
[133] R. Durrett. Probability: Theory and Examples. Wadsworth, Pacific
Grove CA, 1991.
[134] M. Dyer and A. Frieze. Computing the volume of convex bodies: A
case where randomness provably helps. In B. Bollobás, editor, Proba-
bilistic Combinatorics And Its Applications, volume 44 of Proc. Symp.
Applied Math., pages 123–170. American Math. Soc., 1991.
[135] M. Dyer, A. Frieze, and R. Kannan. A random polynomial time algo-
rithm for approximating the volume of convex bodies. In Proc. 21st
ACM Symp. Theory of Computing, pages 375–381, 1989.
[136] M. Dyer, A. Frieze, and R. Kannan. A random polynomial time al-
gorithm for approximating the volume of convex bodies. J. Assoc.
Comput. Mach., 38:1–17, 1991.
[137] M. Dyer and C. Greenhill. A genuinely polynomial-time algorithm for
sampling two-rowed contingency tables. Technical report, University
of Leeds, U.K., 1998. Unpublished.
[138] M.L. Eaton. Admissability in quadratically regular problems and re-
currence of symmetric Markov chains: Why the connection? J. Statist.
Plann. Inference, 64:231–247, 1997.
[139] R.G. Edwards and A.D. Sokal. Generalization of the Fortuin-
Kasteleyn-Swendsen-Wang representation and Monte Carlo algorithm.
Phys. Rev. D, 38:2009–2012, 1988.
[140] B. Efron and C. Stein. The jackknife estimate of variance. Ann.
Statist., 10:586–596, 1981.
[141] S.N. Ethier and T. G. Kurtz. Markov Processes: Characterization and
Convergence. Wiley, New York, 1986.
[142] U. Feige. A tight lower bound on the cover time for random walks on
graphs. Random Struct. Alg., 6:433–438, 1995.
[143] U. Feige. A tight upper bound on the cover time for random walks on
graphs. Random Struct. Alg., 6, 1995.
502 BIBLIOGRAPHY
[144] U. Feige. Collecting coupons on trees, and the cover time of random
walks. Comput. Complexity, 6:341–356, 1996/7.
[145] W. Feller. An Introduction to Probability Theory and its Applications,

volume II. Wiley, 2nd edition, 1971.
[146] M. Fiedler. Bounds for eigenvalues of doubly stochastic matrices. Lin-

ear Algebra Appl., 5:299–310, 1972.
[147] M. Fiedler, C.R. Johnson, T.L. Markham, and M. Neumann. A trace

inequality for M -matrices and the symmetrizability of a real matrix
by a positive diagonal matrix. Linear Alg. Appl., 71:81–94, 1985.
[148] J. A. Fill and R. Pemantle. Percolation, first-passage percolation and

covering times for Richardson’s model on the n-cube. Ann. Appl.
Probab., 3:593–629, 1993.
[149] J.A. Fill. Eigenvalue bounds on convergence to stationarity for nonre-

versible Markov chains, with an application to the exclusion process.
Ann. Appl. Probab., 1:62–87, 1991.
[150] J.A. Fill. Time to stationarity for a continuous-time Markov chain.

Prob. Engineering Inform. Sci., 5:45–70, 1991.
[151] J.A. Fill. Strong stationary duality for continuous-time Markov chains.
part I: Theory. J. Theoretical Probab., 5:45–70, 1992.
[152] L. Flatto, A.M. Odlyzko, and D.B. Wales. Random shuffles and group
representations. Ann. Probab., 13:154–178, 1985.
[153] R.M. Foster. The average impedance of an electrical network. In

Contributions to Applied Mechanics, pages 333–340, Ann Arbor, MI,
1949. Edwards Brothers, Inc.
[154] D. Freedman. Markov Chains. Springer–Verlag, 1983. Reprint of 1971

Holden-Day edition.
[155] J. Friedman. On the second eigenvalue and random walk in random

regular graphs. Combinatorica, 11:331–362, 1991.
[156] J. Friedman, editor. Expanding Graphs. Amer. Math. Soc., 1993. DI-
MACS volume 10.
BIBLIOGRAPHY 503
[157] J. Friedman, J. Kahn, and E. Szemeredi. On the second eigenvalue in

random regular graphs. In Proc. 21st ACM Symp. Theory of Comput-
ing, pages 587–598, 1989.
[158] A. Frieze, R. Kannan, and N. Polson. Sampling from log-concave

distributions. Ann. Appl. Probab., 4:812–837, 1994.
[159] M. Fukushima. Dirichlet Forms and Markov Processes. North-

Holland, 1980.
[160] A. Gelman, G.O. Roberts, and W.R. Gilks. Efficient Metropolis jump-
ing rules. In Bayesian Statistics, volume 5, pages 599–608. Oxford
University Press, 1996.
[161] A. Gelman and D.B. Rubin. Inference from iterative simulation us-
ing multiple sequences. Statistical Science, 7:457–472, 1992. With
discussion.
[162] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions

and the Bayesian restoration of images. IEEE Trans. Pattern Anal.
Mach. Intell., 6:721–741, 1984.
[163] C.J. Geyer. Markov chain Monte Carlo maximum likelihood. In

E. Keramigas, editor, Computing Science and Statistics. Proceedings of
the 23rd Symposium on the Interface, pages 156–163. Interface Foun-
dation, 1991.
[164] A. Giacometti. Exact closed form of the return probability on the

Bethe lattice. J. Phys. A: Math. Gen., 28:L13–L17, 1995.
[165] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, editors. Markov

Chain Monte Carlo in Practice, London, 1996. Chapman and Hall.
[166] W.R. Gilks, G.O. Roberts, and E.I. George. Adaptive direction sam-
pling. The Statistician, 43:179–189, 1994.
[167] D. Gillman. A Chernoff bound for random walks on expander graphs.

SIAM J. Comput., 27:1203–1220, 1998.
[168] F. Gobel and A.A. Jagers. Random walks on graphs. Stochastic Pro-
cess. Appl., 2:311–336, 1974.
[169] S. Goldstein. Maximal coupling. Z. Wahrsch. Verw. Gebiete, 46:193–

204, 1979.
504 BIBLIOGRAPHY
[170] J. Goodman and A. Sokal. Multigrid Monte Carlo method: Concep-

tual foundations. Phys. Rev. D, 40:2037–2071, 1989.
[171] D. Griffeath. A maximal coupling for Markov chains. Z. Wahrsch.

Verw. Gebiete, pages 95–106, 1974-75.
[172] D. Griffeath. Additive and Cancellative Interacting Particle Systems,

volume 724 of Lecture Notes in Math. Springer–Verlag, 1979.
[173] D. Griffeath and T. M. Liggett. Critical phenomena for Spitzer’s re-

versible nearest particle systems. Ann. Probab., 10:881–895, 1982.
[174] P. Griffin. Accelerating beyond the third dimension: Returning to

the origin in a simple random walk. Mathematical Scientist, 15:24–35,
1990.
[175] G. Grimmett. Random graphical networks. In Networks and Chaos

— Statistical and Probabilistic Aspects, pages 288–301. Chapman and
Hall, London, 1993. Monogr. Statist. Appl. Prob. 50.
[176] G. Grimmett and H. Kesten. Random electrical networks on complete

graphs. J. London Math. Soc. (2), 30:171–192, 1984.
[177] G. Grimmett and D. Stirzaker. Probability and Random Processes.

Oxford University Press, 1982.
[178] H. Haken. Synergetics. Springer-Verlag, 1978.
[179] K.J. Harrison and M.W. Short. The last vertex visited in a random
walk on a graph. Technical report, Murdoch University, Australia,
1992.
[180] W.K. Hastings. Monte Carlo sampling methods using Markov chains
and their applications. Biometrika, 57:97–109, 1970.
[181] R. Holley and D. Stroock. Logarithmic Sobolev inequalitites and

stochastic Ising models. J. Statist. Phys., 46:1159–1194, 1987.
[182] D.F. Holt. A graph which is edge-transitive but not arc-transitive. J.

Graph Theory, 5:201–204, 1981.
[183] R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University
Press, 1985.
BIBLIOGRAPHY 505
[184] B.D. Hughes. Random Walks and Random Environments. Oxford

[185] K. Hukushima and K. Nemoto. Exchange Monte Carlo method and

application to spin glass simulations. J. Physics Soc. Japan, 65:1604–
1608, 1996.
[186] J.J. Hunter. Mathematical Techniques of Applied Probability. Aca-

demic Press, 1983.
[187] J. P. Imhof. On the range of Brownian motion and its inverse process.
Ann. Probab., 13:1011–1017, 1985.
[188] R. Impagliazzo and D. Zuckerman. How to recycle random bits. In

Proc. 30’th IEEE Symp. Found. Comp. Sci., pages 248–253, 1989.
[189] M. Iosifescu. Finite Markov Processes and Their Applications. Wiley,

1980.
[190] R. Isaacs. Differential Games. Wiley, 1965.
[191] D. Isaacson and R. Madsen. Markov Chains: Theory and Applications.

Wiley, 1976.
[192] I. Iscoe and D. McDonald. Asymptotics of exit times for Markov jump
processes I. Ann. Probab., 22:372–397, 1994.
[193] I. Iscoe, D. McDonald, and K. Qian. Capacity of ATM switches. Ann.

Appl. Probab., 3:277–295, 1993.
[194] S. Janson. Gaussian Hilbert Spaces. Number 129 in Cambridge Tracts

in Mathematics. Cambridge University Press, 1997.
[195] E. Janvresse. Spectral gap for Kac’s model of the Boltzmann equation.
Ann. Probab., 29:288–304, 2001.
[196] M. Jerrum. Uniform sampling modulo a group of symmetries using

Markov chain simulation. In J. Friedman, editor, Expanding Graphs,
pages 37–48. A.M.S., 1993. DIMACS, volume 10.
[197] M. Jerrum. Mathematical foundations of the Markov chain Monte

Carlo method. In Probabilistic Methods for Algorithmic Discrete Math-
ematics, number 16 in Algorithms and Combinatorics, pages 116–165.
506 BIBLIOGRAPHY
[198] M. Jerrum and A. Sinclair. Approximating the permanent. SIAM J.

Comput., 18:1149–1178, 1989.
[199] M. Jerrum and A. Sinclair. The Markov chain Monte Carlo method: an
approach to approximate counting and integration. In D. Hochbaum,
editor, Approximation Algorithms for NP-Hard Problems, pages 482–
520, Boston MA, 1996. PWS.
[200] M.R. Jerrum. A very simple algorithm for estimating the number of
k-colorings of a low-degree graph. Random Struct. Alg., 7:157–165,
1995.
[201] M.R. Jerrum, L.G. Valiant, and V.V. Vazirani. Random generation
of combinatorial structures from a uniform distribution. Theor. Com-
puter Sci., 43:169–188, 1986.
[202] C.D. Meyer Jr. The role of the group generalized inverse in the theory
of finite Markov chains. SIAM Review, 17:443–464, 1975.
[203] N. Kahale. Eigenvalues and expansion of regular graphs. J. ACM,

42:1091–1106, 1995.
[204] N. Kahale. Large deviation bounds for Markov chains. Combin.

Probab. Comput., 6:465–474, 1997.
[205] J.D. Kahn, N. Linial, N. Nisan, and M.E. Saks. On the cover time of
random walks on graphs. J. Theoretical Probab., 2:121–128, 1989.
[206] R. Kannan, L. Lovász, and M. Simonovits. Random walks and an

O∗ (n5 ) volume algorithm for convex bodies. Random Struct. Alg.,
11:1–50, 1997.
[207] S. Karlin, B. Lindquist, and Y-C Yao. Markov chains on hypercubes:

Spectral representations and several majorization relations. Random
Struct. Alg., 4:1–36, 1993.
[208] S. Karlin and H.M. Taylor. A First Course in Stochastic Processes.

Academic Press, second edition, 1975.
[209] S. Karlin and H.M. Taylor. A Second Course in Stochastic Processes.

Academic Press, 1981.
[210] R.M. Karp, M. Luby, and N. Madras. Monte Carlo approximation al-
gorithms for enumeration problems. J. Algorithms, 10:429–448, 1989.
BIBLIOGRAPHY 507
[211] A. Karzanov and L. Khachiyan. On the conductance of order Markov

chains. Order, 8:7–15, 1991.
[212] J. Keilson. Markov Chain Models - Rarity and Exponentiality.

[213] F.P. Kelly. Reversibility and Stochastic Networks. Wiley, 1979.
[214] J.G. Kemeny and J.L. Snell. Finite Markov Chains. Van Nostrand,
1960.
[215] J.G. Kemeny, J.L Snell, and A.W. Knapp. Denumerable Markov
Chains. Springer–Verlag, 2nd edition, 1976.
[216] J.H.B. Kemperman. The Passage Problem for a Stationary Markov

Chain. University of Chicago Press, 1961.
[217] C. Kipnis and S.R.S. Varadhan. Central limit theorem for additive
functionals of reversible Markov processes and applications to simple
exclusions. Comm. Math. Phys., 104:1–19, 1986.
[218] W.B. Krebs. Brownian motion on the continuum tree. Probab. Th.
Rel. Fields, 101:421–433, 1995.
[219] H.J. Landau and A.M. Odlyzko. Bounds for eigenvalues of certain
stochastic matrices. Linear Algebra Appl., 38:5–15, 1981.
[220] L. Lange and J.W. Miller. A random ladder game: Permutations,

eigenvalues, and convergence of markov chains. College Math. Journal,
23:373–385, 1992.
[221] G. Lawler. On the covering time of a disc by simple random walk

in two dimensions. In Seminar in Stochastic Processes 1992, pages
189–208. Birkhauser, 1993.
[222] G.F. Lawler and A.D. Sokal. Bounds on the L2 spectrum for Markov
chains and Markov proceses. Trans. Amer. Math. Soc., 309:557–580,
1988.
[223] T. Leighton and S. Rao. An approximate max-flow min-cut theorem

for uniform multicommodity flow problems with applications to ap-
proximation algorithms. In Proc. 29th IEEE Symp. Found. Comp.
Sci., pages 422–431, 1988.
508 BIBLIOGRAPHY
[224] G. Letac. Problemes classiques de probabilite sur un couple de

Gelfand. In Analytic Methods in Probability Theory. Springer–Verlag,
1981. Lecture Notes in Math. 861.
[225] G. Letac. Les fonctions spheriques d’un couple de Gelfand sym-

metrique et les chaines de Markov. Adv. in Appl. Probab., 14:272–294,
1982.
[226] G. Letac. A contraction principle for certain Markov chains and its
applications. In Random Matrices and Their Applications, volume 50
of Contemp. Math., pages 263–273. American Math. Soc., 1986.
[227] G. Letac and L. Takács. Random walks on a 600-cell. SIAM J. Alg.

Discrete Math., 1:114–120, 1980.
[228] G. Letac and L. Takács. Random walks on a dodecahedron. J. Appl.

Probab., 17:373–384, 1980.
[229] P. Lezaud. Chernoff-type bound for finite Markov chains. Ann. Appl.
Probab., 8:849–867, 1998.
[230] T.M. Liggett. Coupling the simple exclusion process. Ann. Probab.,
4:339–356, 1976.
[231] T.M. Liggett. Interacting Particle Systems. Springer–Verlag, 1985.
[232] T. Lindstrøm. Brownian motion on nested fractals. Memoirs of the

A.M.S., 420, 1989.
[233] T. Lindvall. Lectures on the Coupling Method. Wiley, 1992.
[234] J.S. Liu. Metropolized independent sampling with comparisons to re-

jection sampling and importance sampling. Statistics and Computing,
6:113–119, 1996.
[235] J.S. Liu. Monte Carlo Techniques in Scientific Computing. Springer–

Verlag, 2001.
[236] J.S. Liu, F. Liang, and W.H. Wong. The use of multiple-try method
and local optimization in Metropolis sampling. JASA, xxx:xxx, xxx.
[237] L. Lovász. Combinatorial Problems and Exercises. North-Holland,

1993. Second Edition.
BIBLIOGRAPHY 509
[238] L. Lovász and M. Simonovits. The mixing rate of Markov chains,

an isoperimetric inequality, and computing the volume. In Proc. 31st
IEEE Symp. Found. Comp. Sci., pages 346–355, 1990.
[239] L. Lovász and M. Simonovits. Random walks in a convex body and an

improved volume algorithm. Random Struct. Alg., 4:359–412, 1993.
[240] L. Lovász and P. Winkler. A note on the last new vertex visited by a
random walk. J. Graph Theory, 17:593–596, 1993.
[241] L. Lovász and P. Winkler. Efficient stopping rules for Markov chains.
In Proc. 27th ACM Symp. Theory of Computing, pages 76–82, 1995.
[242] L. Lovász and P. Winkler. Exact mixing in an unknown Markov chain.

Electronic J. Combinatorics, 2:#R15, 1995.
[243] A. Lubotzky. Discrete Groups, Expanding Graphs and Invariant Mea-

sures. Birkhauser, 1994. Progress in Mathematics, vol. 125.
[244] M. Luby and E. Vigoda. Fast convergence of the Glauber dynamics

for sampling independent sets. Technical report, ICSI, Berkeley CA,
1999.
[245] R. Lyons. Random walks, capacity, and percolation on trees. Ann.

Probab., 20:2043–2088, 1992.
[246] R. Lyons and R. Pemantle. Random walk in a random environment

and first-passage percolation on trees. Ann. Probab., 20:125–136, 1992.
[247] R. Lyons, R. Pemantle, and Y. Peres. Ergodic theory on Galton–

Watson trees: Speed of random walk and dimension of harmonic mea-
sure. Ergodic Theory Dynam. Systems, 15:593–619, 1995.
[248] R. Lyons, R. Pemantle, and Y. Peres. Biased random walks on Galton–

Watson trees. Probab. Th. Rel. Fields, 106:249–264, 1996.
[249] R. Lyons, R. Pemantle, and Y. Peres. Unsolved problems concerning

random walks on trees. In K. Athreya and P. Jagers, editors, Classi-
cal and Modern Branching Processes, pages 223–238. Springer–Verlag,
1996.
[250] R. Lyons and Y. Peres. Probability on Trees and Networks. Cambridge

University Press, 2002. In preparation.
510 BIBLIOGRAPHY
[251] T.J. Lyons. A simple criterion for transience of a reversible Markov

chain. Ann. Probab., 11:393–402, 1983.
[252] N. Madras and G. Slade. The Self-Avoiding Walk. Birkhauser, 1993.
[253] M. B. Marcus and J. Rosen. Sample path properties of the local times
of strongly symmetric Markov processes via Gaussian processes. Ann.
Probab., 20:1603–1684, 1992.
[254] E. Marinari and G. Parisi. Simulated tempering: a new Monte Carlo

scheme. Europhysics Letters, 19:451–458, 1992.
[255] P.C. Matthews. Covering Problems for Random Walks on Spheres and
Finite Groups. PhD thesis, Statistics, Stanford, 1985.
[256] P.C. Matthews. Covering problems for Brownian motion on spheres.

Ann. Probab., 16:189–199, 1988.
[257] P.C. Matthews. Covering problems for Markov chains. Ann. Probab.,
16:1215–1228, 1988.
[258] P.C. Matthews. A strong uniform time for random transpositions. J.

Theoretical Probab., 1:411–423, 1988.
[259] P.C. Matthews. Mixing rates for Brownian motion in a convex poly-
hedron. J. Appl. Probab., 27:259–268, 1990.
[260] P.C. Matthews. Strong stationary times and eigenvalues. J. Appl.

Probab., 29:228–233, 1992.
[261] J.E. Mazo. Some extremal Markov chains. Bell System Tech. J.,
61:2065–2080, 1982.
[262] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and

E. Teller. Equations of state calculations by fast computing machine.
J. Chem. Phys., 21:1087–1091, 1953.
[263] S.P. Meyn and R.L. Tweedie. Markov Chains and Stochastic Stability.
[264] J.W. Moon. Random walks on random trees. J. Austral. Math. Soc.,
15:42–53, 1973.
[265] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge

BIBLIOGRAPHY 511
[266] C. St. J. A. Nash-Williams. Random walks and electric currents in

networks. Proc. Cambridge Philos. Soc., 55:181–194, 1959.
[267] R.M. Neal. Bayesian Learning for Neural Networks. Number 118 in
Lecture Notes in Statistics. Springer–Verlag, 1996.
[268] R.M. Neal. Sampling from multimodal distributions using tempered
transitions. Statistics and Computing, 6:353–366, 1996.
[269] C. Neuhauser and A. Sudbury. The biased annihilating branching
process. Adv. in Appl. Probab., 25:24–38, 1993.
[270] J.R. Norris. Markov Chains. Cambridge University Press, 1997.
[271] I. Pak. Random walk on finite groups with few random generators.
Electon. J. Probab., 4(1):1–11, 1999.
[272] J. L. Palacios. Bounds on expected hitting times for a random walk
on a connected graph. Linear Algebra Appl., 141:241–252, 1990.
[273] J. L. Palacios. A bound for the covering time of random walks on
graphs. Statistics and Probab. Lett., 14:9–11, 1992.
[274] J. L. Palacios. Expected cover times of random walks on symmetric
graphs. J. Theoretical Probab., 5:597–600, 1992.
[275] J. L. Palacios. Fluctuation theory for the Ehrenfest urn model via
electrical networks. Adv. in Appl. Probab., 25:472–476, 1993.
[276] J.L. Palacios. On a result of Aleliunas et al. concerning random walks
on graphs. Prob. Engineering Inform. Sci., 4:489–492, 1990.
[277] L. Pearce. Random walks on trees. Discrete Math., 30:269–276, 1980.
[278] R. Pemantle. Choosing a spanning tree for the integer lattice uni-
formly. Ann. Probab., 19:1559–1574, 1991.
[279] R. Pemantle. Uniform random spanning trees. In J. L. Snell, ed-
itor, Topics in Contemporary Probability, pages 1–54, Boca Raton,
FL, 1995. CRC Press.
[280] P. Peskun. Optimal Monte Carlo sampling using Markov chains.
Biometrika, 60:607–612, 1973.
[281] N. Pippenger. Superconcentrators. SIAM J. Comput., 6:298–304,
1977.
512 BIBLIOGRAPHY
[282] J.W. Pitman. On coupling of Markov chains. Z. Wahrsch. Verw.

Gebiete, 35:313–322, 1976.
[283] J.W. Pitman. Occupation measures for Markov chains. Adv. in Appl.
Probab., 9:69–86, 1977.
[284] U. Porod. L2 lower bounds for a special class of random walks. Probab.
Th. Rel. Fields, 101:277–289, 1995.
[285] U. Porod. The cut-off phenomenon for random reflections. Ann.

Probab., 24:74–96, 1996.
[286] J. Propp and D. Wilson. Exact sampling with coupled Markov chains
and applications to statistical mechanics. Random Struct. Alg., 9:223–
252, 1996.
[287] Y. Rabani, Y. Rabinovich, and A. Sinclair. A computational view of

population genetics. In Proc. 36’th IEEE Symp. Found. Comp. Sci.,
pages 83–92, 1995.
[288] P. Revesz. Random Walk in Random and Non-Random Scenery.

World Scientific, Singapore, 1990.
[289] D. Revuz. Markov Chains. North-Holland, second edition, 1984.
[290] C.P. Robert, editor. Discretization and MCMC Convergence Asses-

ment. Number 135 in Lecture Notes in Statistics. Springer–Verlag,
1998.
[291] C.P. Robert and G. Casella. Monte Carlo Statistical Methods.

[292] G.O. Roberts. Optimal Metropolis algorithms for product measures on

the vertices of a hypercube. Stochastics Stochastic Rep., 62:275–283,
1998.
[293] G.O. Roberts, A. Gelman, and W.R. Gilks. Weak convergence and
optimal scaling of random walk Metropolis algorithms. Ann. Appl.
Probab., 7:110–120, 1997.
[294] G.O. Roberts and R.L. Tweedie. Geometric convergence and central
limit theorems for multidimensional Hastings and Metropolis algo-
rithms. Biometrika, 83:95–110, 1996.
BIBLIOGRAPHY 513
[295] L.C.G. Rogers and D. Williams. Diffusions, Markov Processes and

Martingales: Foundations, volume 1. Wiley, second edition, 1994.
[296] Y. Roichman. On random random walks. Ann. Probab., 24:1001–1011,

1996.
[297] V. I. Romanovsky. Discrete Markov Chains. Wolthers-Noordhoff,

1970. English translation of Russian original.
[298] J. S. Rosenthal. Random rotations: Characters and random walks on

SO(N ). Ann. Probab., 22:398–423, 1994.
[299] S. Ross. Stochastic Processes. Wiley, 1983.
[300] S. M. Ross. A random graph. J. Appl. Probab., 18:309–315, 1981.
[301] H. Rost. The stopping distributions of a Markov process. Inventiones

Math., 14:1–16, 1971.
[302] O. S. Rothaus. Diffusion on compact Riemannian manifolds and log-

arithmic Sobolev inequalities. J. Funct. Anal., 42:102–109, 1981.
[303] W. Rudin. Real and Complex Analysis. McGraw–Hill Book Co., New
York, 3rd edition, 1987.
[304] L. Saloff-Coste. Lectures on finite Markov chains. In Lectures on

probability theory and statistics (Saint-Flour, 1996), pages 301–413.
Springer, Berlin, 1997.
[305] P. Sarnak. Some Applications of Modular Forms. Cambridge Univer-

sity Press, 1990. Cambridge Tracts in Math. 99.
[306] A.M. Sbihi. Covering Times for Random Walks on Graphs. PhD
thesis, McGill University, 1990.
[307] A. Sinclair and M. Jerrum. Approximate counting, uniform generation

and rapidly mixing Markov chains. Information and Computation,
82:93–133, 1989.
[308] A. J. Sinclair. Improved bounds for mixing rates of Markov chains and
multicommodity flow. Combin. Probab. Comput., 1:351–370, 1992.
[309] A. J. Sinclair. Algorithms for Random Generation and Counting.

Birkhauser, 1993.
514 BIBLIOGRAPHY
[310] R.L. Smith. Efficient Monte Carlo procedures for generating points
uniformly distributed over bounded regions. Operations Research,
32:1296–1308, 1984.
[311] A.D. Sokal. Monte Carlo methods in statistical mechanics: Founda-

tions and new algorithms. In Cours de Troisieme Cycle de la Physique
en Suisse Romande, Lausanne, 1989.
[312] R. Solovay and V. Strassen. A fast Monte-Carlo test for primality.

SIAM J. Comput., 6:84–85, 1977.
[313] R. Stanley. Enumerative Combinatorics, Vol. 2. Cambridge University

Press, 1999.
[314] W.J. Stewart. Introduction to the Numerical Solution of Markov

Chains. Princeton University Press, 1995.
[315] D. Stoyan. Comparison Methods for Queues and Other Stochastic

Models. Wiley, 1983.
[316] W.G. Sullivan. L2 spectral gap and jump processes. Z. Wahrsch.

Verw. Gebiete, 67:387–398, 1984.
[317] D.E. Symer. Expanded ergodic Markov chains and cycling systems.
Senior thesis, Dartmouth College, 1984.
[318] R. Syski. Passage Times for Markov Chains. IOS Press, Amsterdam,
1992.
[319] L. Takács. Random flights on regular polytopes. SIAM J. Alg. Discrete

Meth., 2:153–171, 1981.
[320] L. Takács. Random flights on regular graphs. Adv. in Appl. Probab.,

16:618–637, 1984.
[321] M. Tanushev and R. Arratia. A note on distributional equality in the

cyclic tour property for Markov chains. Combin. Probab. Comput.,
6:493–496, 1997.
[322] A. Telcs. Spectra of graphs and fractal dimensions I. Probab. Th. Rel.
Fields, 85:489–497, 1990.
[323] A. Telcs. Spectra of graphs and fractal dimensions II. J. Theoretical

Probab., 8:77–96, 1995.
BIBLIOGRAPHY 515
[324] P. Tetali. Random walks and the effective resistance of networks. J.

Theoretical Probab., 4:101–109, 1991.
[325] P. Tetali. Design of on-line algorithms using hitting times. Bell Labs,
1994.
[326] P. Tetali. An extension of Foster’s network theorem. Combin. Probab.

Comput., 3:421–427, 1994.
[327] P. Tetali and P. Winkler. Simultaneous reversible Markov chains. In

Combinatorics: Paul Erdos is Eighty, volume 1, pages 433–451. Bolyai
Society Mathematical Studies, 1993.
[328] H. Thorisson. Coupling, Stationarity and Regeneration. Springer–

Verlag, 2000.
[329] Erik van Doorn. Stochastic Monotonicity and Queueing Applications

of Birth-Death Processes, volume 4 of Lecture Notes in Stat. Springer–
Verlag, 1981.
[330] A.R.D. van Slijpe. Random walks on regular polyhedra and other
distance-regular graphs. Statist. Neerlandica, 38:273–292, 1984.
[331] A.R.D. van Slijpe. Random walks on the triangular prism and other
vertex-transitive graphs. J. Comput. Appl. Math., 15:383–394, 1986.
[332] N. Th. Varopoulos, L. Saloff-Coste, and T. Coulhon. Analysis and

Geometry on Groups. Cambridge University Press, 1992.
[333] E. Vigoda. Improved bounds for sampling colorings. C.S. Dept., U.C.
Berkeley, 1999.
[334] I.C. Walters. The ever expanding expander coefficients. Bull. Inst.
Combin. Appl., 17:79–86, 1996.
[335] K. Weber. Random spread of information, random graph processes

and random walks. In M. Karonski, J. Jaworski, and A. Rucinski,
editors, Random Graphs ’87, pages 361–366. Wiley, 1990.
[336] H.S. Wilf. The editor’s corner: The white screen problem. Amer.
Math. Monthly, 96:704–707, 1989.
[337] H.S. Wilf. Computer-generated proofs of 54 binomial coefficient iden-

itites. U. Penn., 1990.
516 BIBLIOGRAPHY
[338] D.B. Wilson. Mixing times of lozenge tiling and card shuffling Markov
chains. Ann. Appl. Probab., 14:274–325, 2004.
[339] W. Woess. Random Walks on Infinite Graphs and Groups. Number

138 in Cambridge Tracts Math. Cambridge Univ. Press, 2000.
[340] O. Yaron. Random walks on trees. Technical report, Hebrew Univer-

sity, Jerusalem, 1988.
[341] D. Zuckerman. Covering times of random walks on bounded degree

trees and other graphs. J. Theoretical Probab., 2:147–157, 1989.
[342] D. Zuckerman. On the time to traverse all edges in a graph. Informa-

tion Proc. Letters., 38:335–337, 1991.
[343] D. Zuckerman. A technique for lower bounding the cover time. SIAM
J. Discrete Math., 5:81–87, 1992.

Book PDF

Uploaded by

Copyright:

Available Formats

Book PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Book PDF

Uploaded by

Copyright:

Available Formats

What are the main topics covered in the book?

What are the main topics covered in the book?

What prerequisites are needed to understand the content?

What prerequisites are needed to understand the content?

Reversible Markov Chains and Random Walks on Graphs

David Aldous and James Allen Fill

Unfinished monograph, 2002 (this is recompiled version, 2014)

1 Introduction (July 20, 1999) 13

2 General Markov Chains (September 10, 1999) 23

2.4.3 Exponential tails of hitting times . . . . . . . . . . . . 41

3 Reversible Markov Chains (September 10, 2002) 57

3.6.3 The extremal characterization of relaxation time . . . 89

4 Hitting and Convergence Time, and Flow Rate, Parameters

5 Examples: Special Graphs and Trees (April 23 1996) 159

5.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

6 Cover Times (October 31, 1994) 207

7 Symmetric Graphs and Chains (January 31, 1994) 237

7.1.9 Comparison arguments for eigenvalues . . . . . . . . . 252

8 Advanced L2 Techniques for Bounding Mixing Times (May

9 A Second Look at General Markov Chains (April 21, 1995)309

9.2.2 Electrical network theory . . . . . . . . . . . . . . . . 319

10 Some Graph Theory and Randomized Algorithms (Septem-

10.7.1 Large deviation bounds . . . . . . . . . . . . . . . . . 358

11 Markov Chain Monte Carlo (January 8 2001) 361

12 Coupling Theory and Examples (October 11, 1999) 387

12.1.7 Card-shuffling by random transpositions . . . . . . . . 395

13 Continuous State, Infinite State and Random Environment

13.2.6 The infinite degree-r tree . . . . . . . . . . . . . . . . 431

14 Interacting Particles on Finite Graphs (March 10, 1994) 455

Introduction (July 20, 1999)

We start in section 1.1 with some “word problems”, intended to provide

1.1 Word problems

1.1.2 The white screen problem

Taking the screen to be m × m pixels, we have a random walk on the

1.1.3 Universal traversal sequences

for each G ∈ S(n, d) and each initial vertex of G the deterministic

What is the shortest length u = u(n, d) of such a sequence?

1.1.4 How long does it take to shuffle a deck of cards?

1.1.5 Sampling from high-dimensional distributions: Markov

on Rd ; then set X1 = x + Y with probability min(1, f (x + Y )/f (x)) and set

1.1.6 Approximate counting of self-avoiding walks

1.1.7 Simulating a uniform random spanning tree

1.1.8 Voter model on a finite graph

1.1.9 Are you related to your ancestors?

you have exactly 2g g’th-generation ancestors, and you are re-

1.2 So what’s in the book?

1.2.1 Conceptual themes

Classical mathematical probability focuses on time-asymptotics, describing

1.2.3 Contents and alternate reading

stochastic networks, Keilson [212] emphasizes structural properties such as

cause of mathematical depth or requirements for extraneous mathematical

General Markov Chains

The setting of this Chapter is a finite-state irreducible Markov chain (Xt ),

2.1 Notation and reminders of fundamental re-

for the first hitting time on state i, and write

Ti+ = min{t ≥ 1 : Xt = i}.

2.1.1 Stationary distribution and asymptotics

One way to prove this existence (liked by probabilists because it extends

π̃(j) = Ei0 (number of visits to j before time Ti+0 ), j 6= i0 .