Rig Notes 17

High Dimensional Statistics
Lecture Notes
(This version: November 5, 2019)
Philippe Rigollet and Jan-Christian Hütter
Spring 2017
Preface
These lecture notes were written for the course 18.657, High Dimensional
Statistics at MIT. They build on a set of notes that was prepared at Prince-
ton University in 2013-14 that was modified (and hopefully improved) over the
years.
Over the past decade, statistics have undergone drastic changes with the
development of high-dimensional statistical inference. Indeed, on each indi-
vidual, more and more features are measured to a point that their number
usually far exceeds the number of observations. This is the case in biology and
specifically genetics where millions of (combinations of) genes are measured
for a single individual. High resolution imaging, finance, online advertising,
climate studies . . . the list of intensive data producing fields is too long to be
established exhaustively. Clearly not all measured features are relevant for a
given task and most of them are simply noise. But which ones? What can be
done with so little data and so much noise? Surprisingly, the situation is not
that bad and on some simple models we can assess to which extent meaningful
statistical methods can be applied. Regression is one such simple model.
Regression analysis can be traced back to 1632 when Galileo Galilei used
a procedure to infer a linear relationship from noisy data. It was not until
the early 19th century that Gauss and Legendre developed a systematic pro-
cedure: the least-squares method. Since then, regression has been studied
in so many forms that much insight has been gained and recent advances on
high-dimensional statistics would not have been possible without standing on
the shoulders of giants. In these notes, we will explore one, obviously subjec-
tive giant on whose shoulders high-dimensional statistics stand: nonparametric
statistics.
The works of Ibragimov and Has’minskii in the seventies followed by many
researchers from the Russian school have contributed to developing a large
toolkit to understand regression with an infinite number of parameters. Much
insight from this work can be gained to understand high-dimensional or sparse
regression and it comes as no surprise that Donoho and Johnstone have made
the first contributions on this topic in the early nineties.
Therefore, while not obviously connected to high dimensional statistics, we
will talk about nonparametric estimation. I borrowed this disclaimer (and the
template) from my colleague Ramon van Handel. It does apply here.
I have no illusions about the state of these notes—they were written
i
Preface ii
rather quickly, sometimes at the rate of a chapter a week. I have

no doubt that many errors remain in the text; at the very least
many of the proofs are extremely compact, and should be made a
little clearer as is befitting of a pedagogical (?) treatment. If I have
another opportunity to teach such a course, I will go over the notes
again in detail and attempt the necessary modifications. For the
time being, however, the notes are available as-is.
As any good set of notes, they should be perpetually improved and updated
but a two or three year horizon is more realistic. Therefore, if you have any
comments, questions, suggestions, omissions, and of course mistakes, please let
me know. I can be contacted by e-mail at rigollet@math.mit.edu.
Acknowledgements. These notes were improved thanks to the careful read-
ing and comments of Mark Cerenzia, Youssef El Moujahid, Georgina Hall,
Gautam Kamath, Hengrui Luo, Kevin Lin, Ali Makhdoumi, Yaroslav Mukhin,
Mehtaab Sawhney, Ludwig Schmidt, Bastian Schroeter, Vira Semenova, Math-
ias Vetter, Yuyan Wang, Jonathan Weed, Chiyuan Zhang and Jianhao Zhang.
These notes were written under the partial support of the National Science
Foundation, CAREER award DMS-1053987.
Required background. I assume that the reader has had basic courses in
probability and mathematical statistics. Some elementary background in anal-
ysis and measure theory is helpful but not required. Some basic notions of
linear algebra, especially spectral decomposition of matrices is required for the
later chapters.
Since the first version of these notes was posted a couple of manuscripts on
high-dimensional probability by Ramon van Handel [vH17] and Roman Ver-
shynin [Ver18] were published. Both are of outstanding quality—much higher
than the present notes—and very related to this material. I strongly recom-
mend the reader to learn about this fascinating topic in parallel with high-
dimensional statistics.
Notation
Functions, sets, vectors

[n] Set of integers [n] = {1, . . . , n}
S d−1 Unit sphere in dimension d
1I( · ) Indicator function
q1
|xi |q
P
|x|q `q norm of x defined by |x|q = i for q > 0
|x|0 `0 norm of x defined to be the number of nonzero coordinates of x
(k)
f k-th derivative of f
ej j-th vector of the canonical basis
c
A complement of set A
conv(S) Convex hull of set S.
a n . bn an ≤ Cbn for a numerical constant C > 0
Sn symmetric group on n elements
Matrices
Ip Identity matrix of IRp
Tr(A) trace of a square matrix A
Sd Symmetric matrices in IRd×d
Sd+ Symmetric positive semi-definite matrices in IRd×d
Sd++ Symmetric positive definite matrices in IRd×d
AB Order relation given by B − A ∈ S +
A≺B Order relation given by B − A ∈ S ++
M† Moore-Penrose pseudoinverse of M
∇x f (x) Gradient of f at x
∇x f (x)|x=x0 Gradient of f at x0
Distributions
iii
Preface iv
N (µ, σ 2 ) Univariate Gaussian distribution with mean µ ∈ IR and variance σ 2 > 0

Nd (µ, Σ) d-variate distribution with mean µ ∈ IRd and covariance matrix Σ ∈ IRd×d
subG(σ 2 ) Univariate sub-Gaussian distributions with variance proxy σ 2 > 0
subGd (σ 2 ) d-variate sub-Gaussian distributions with variance proxy σ 2 > 0
subE(σ 2 ) sub-Exponential distributions with variance proxy σ 2 > 0
Ber(p) Bernoulli distribution with parameter p ∈ [0, 1]
Bin(n, p) Binomial distribution with parameters n ≥ 1, p ∈ [0, 1]
Lap(λ) Double exponential (or Laplace) distribution with parameter λ > 0
PX Marginal distribution of X
Function spaces
W (β, L) Sobolev class of functions
Θ(β, Q) Sobolev ellipsoid of `2 (IN)
Contents
Preface i
Notation iii
Contents v
Introduction 1
1 Sub-Gaussian Random Variables 14

1.1 Gaussian tails and MGF . . . . . . . . . . . . . . . . . . . . . . 14
1.2 Sub-Gaussian random variables and Chernoff bounds . . . . . . 16
1.3 Sub-exponential random variables . . . . . . . . . . . . . . . . . 22
1.4 Maximal inequalities . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5 Sums of independent random matrices . . . . . . . . . . . . . . 30
1.6 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 Linear Regression Model 40

2.1 Fixed design linear regression . . . . . . . . . . . . . . . . . . . 40
2.2 Least squares estimators . . . . . . . . . . . . . . . . . . . . . . 42
2.3 The Gaussian Sequence Model . . . . . . . . . . . . . . . . . . 49
2.4 High-dimensional linear regression . . . . . . . . . . . . . . . . 54
2.5 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3 Misspecified Linear Models 72

3.1 Oracle inequalities . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.2 Nonparametric regression . . . . . . . . . . . . . . . . . . . . . 81
3.3 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4 Minimax Lower Bounds 92

4.1 Optimality in a minimax sense . . . . . . . . . . . . . . . . . . 92
4.2 Reduction to finite hypothesis testing . . . . . . . . . . . . . . 94
4.3 Lower bounds based on two hypotheses . . . . . . . . . . . . . 95
4.4 Lower bounds based on many hypotheses . . . . . . . . . . . . 101
4.5 Application to the Gaussian sequence model . . . . . . . . . . . 104
4.6 Lower bounds for sparse estimation via χ2 divergence . . . . . 109
v
Contents vi
4.7 Lower bounds for estimating the `1 norm via moment matching 112
4.8 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5 Matrix estimation 121

5.1 Basic facts about matrices . . . . . . . . . . . . . . . . . . . . . 121
5.2 Multivariate regression . . . . . . . . . . . . . . . . . . . . . . . 124
5.3 Covariance matrix estimation . . . . . . . . . . . . . . . . . . . 132
5.4 Principal component analysis . . . . . . . . . . . . . . . . . . . 134
5.5 Graphical models . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.6 Problem set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
Bibliography 157
Introduction
This course is mainly about learning a regression function from a collection

of observations. In this chapter, after defining this task formally, we give
an overview of the course and the questions around regression. We adopt
the statistical learning point of view where the task of prediction prevails.
Nevertheless many interesting questions will remain unanswered when the last
page comes: testing, model selection, implementation,. . .
REGRESSION ANALYSIS AND PREDICTION RISK
Model and definitions

Let (X, Y ) ∈ X ×Y where X is called feature and lives in a topological space X
and Y ∈ Y ⊂ IR is called response or sometimes label when Y is a discrete set,
e.g., Y = {0, 1}. Often X ⊂ IRd , in which case X is called vector of covariates
or simply covariate. Our goal will be to predict Y given X and for our problem
to be meaningful, we need Y to depend non-trivially on X. Our task would be
done if we had access to the conditional distribution of Y given X. This is the
world of the probabilist. The statistician does not have access to this valuable
information but rather has to estimate it, at least partially. The regression
function gives a simple summary of this conditional distribution, namely, the
conditional expectation.
Formally, the regression function of Y onto X is defined by:
f (x) = IE[Y |X = x] , x∈X.
As we will see, it arises naturally in the context of prediction.
Best prediction and prediction risk

Suppose for a moment that you know the conditional distribution of Y given
X. Given the realization of X = x, your goal is to predict the realization of
Y . Intuitively, we would like to find a measurable1 function g : X → Y such
that g(X) is close to Y , in other words, such that |Y − g(X)| is small. But
|Y − g(X)| is a random variable, so it not clear what “small” means in this
context. A somewhat arbitrary answer can be given by declaring a random
1 all topological spaces are equipped with their Borel σ-algebra
1
Introduction 2
variable Z small if IE[Z 2 ] = [IEZ]2 + var[Z] is small. Indeed in this case, the
expectation of Z is small and the fluctuations of Z around this value are also
small. The function R(g) = IE[Y − g(X)]2 is called the L2 risk of g defined for
IEY 2 < ∞.
For any measurable function g : X → IR, the L2 risk of g can be decom-
posed as
IE[Y − g(X)]2 = IE[Y − f (X) + f (X) − g(X)]2

= IE[Y − f (X)]2 + IE[f (X) − g(X)]2 + 2IE[Y − f (X)][f (X) − g(X)]
The cross-product term satisfies

IE[Y − f (X)][f (X) − g(X)] = IE IE [Y − f (X)][f (X) − g(X)] X

= IE [IE(Y |X) − f (X)][f (X) − g(X)]

= IE [f (X) − f (X)][f (X) − g(X)] = 0 .
The above two equations yield
IE[Y − g(X)]2 = IE[Y − f (X)]2 + IE[f (X) − g(X)]2 ≥ IE[Y − f (X)]2 ,
with equality iff f (X) = g(X) almost surely.

We have proved that the regression function f (x) = IE[Y |X = x], x ∈ X ,
enjoys the best prediction property, that is
IE[Y − f (X)]2 = inf IE[Y − g(X)]2 ,

g
where the infimum is taken over all measurable functions g : X → IR.
Prediction and estimation

As we said before, in a statistical problem, we do not have access to the condi-
tional distribution of Y given X or even to the regression function f of Y onto
X. Instead, we observe a sample Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} that consists
of independent copies of (X, Y ). The goal of regression function estimation is
to use this data to construct an estimator fˆn : X → Y that has small L2 risk
R(fˆn ).
Let PX denote the marginal distribution of X and for any h : X → IR,
define Z
khk22 = h2 dPX .
X
Note that khk22 is the Hilbert norm associated to the inner product
Z
hh, h0 i2 = hh0 dPX .
X
When the reference measure is clear from the context, we will simply write
khk2 = khkL2 (PX ) and hh, h0 i2 := hh, h0 iL2 (PX ) .
Introduction 3
It follows from the proof of the best prediction property above that
R(fˆn ) = IE[Y − f (X)]2 + kfˆn − f k22

= inf IE[Y − g(X)]2 + kfˆn − f k2
2
g
In particular, the prediction risk will always be at least equal to the positive
constant IE[Y − f (X)]2 . Since we tend to prefer a measure of accuracy to be
able to go to zero (as the sample size increases), it is equivalent to study the
estimation error kfˆn − f k22 . Note that if fˆn is random, then kfˆn − f k22 and
R(fˆn ) are random quantities and we need deterministic summaries to quantify
their size. It is customary to use one of the two following options. Let {φn }n
be a sequence of positive numbers that tends to zero as n goes to infinity.
1. Bounds in expectation. They are of the form:
IEkfˆn − f k22 ≤ φn ,
where the expectation is taken with respect to the sample Dn . They

indicate the average behavior of the estimator over multiple realizations
of the sample. Such bounds have been established in nonparametric
statistics where typically φn = O(n−α ) for some α ∈ (1/2, 1) for example.
Note that such bounds do not characterize the size of the deviation of the
random variable kfˆn − f k22 around its expectation. As a result, it may be
therefore appropriate to accompany such a bound with the second option
below.
2. Bounds with high-probability. They are of the form:
IP kfˆn − f k22 > φn (δ) ≤ δ , ∀δ ∈ (0, 1/3) .

Here 1/3 is arbitrary and can be replaced by another positive constant.

Such bounds control the tail of the distribution of kfˆn − f k22 . They show
how large the quantiles of the random variable kf − fˆn k22 can be. Such
bounds are favored in learning theory, and are sometimes called PAC-
bounds (for Probably Approximately Correct).
Often, bounds with high probability follow from a bound in expectation and a
concentration inequality that bounds the following probability
IP kfˆn − f k22 − IEkfˆn − f k22 > t

by a quantity that decays to zero exponentially fast. Concentration of measure

is a fascinating but wide topic and we will only briefly touch it. We recommend
the reading of [BLM13] to the interested reader. This book presents many
aspects of concentration that are particularly well suited to the applications
covered in these notes.
Introduction 4
Other measures of error

We have chosen the L2 risk somewhat arbitrarily. Why not the Lp risk defined
by g 7→ IE|Y − g(X)|p for some p ≥ 1? The main reason for choosing the L2
risk is that it greatly simplifies the mathematics of our problem: it is a Hilbert
space! In particular, for any estimator fˆn , we have the remarkable identity:
R(fˆn ) = IE[Y − f (X)]2 + kfˆn − f k22 .
This equality allowed us to consider only the part kfˆn − f k22 as a measure of
error. While this decomposition may not hold for other risk measures, it may
be desirable to explore other distances (or pseudo-distances). This leads to two
distinct ways to measure error. Either by bounding a pseudo-distance d(fˆn , f )
(estimation error ) or by bounding the risk R(fˆn ) for choices other than the L2
risk. These two measures coincide up to the additive constant IE[Y − f (X)]2
in the case described above. However, we show below that these two quantities
may live independent lives. Bounding the estimation error is more customary
in statistics whereas risk bounds are preferred in learning theory.
Here is a list of choices for the pseudo-distance employed in the estimation
error.
• Pointwise error. Given a point x0 , the pointwise error measures only
the error at this point. It uses the pseudo-distance:
d0 (fˆn , f ) = |fˆn (x0 ) − f (x0 )| .
• Sup-norm error. Also known as the L∞ -error and defined by
d∞ (fˆn , f ) = sup |fˆn (x) − f (x)| .

x∈X
It controls the worst possible pointwise error.

• Lp -error. It generalizes both the L2 distance and the sup-norm error by
taking for any p ≥ 1, the pseudo distance
Z
dp (fˆn , f ) = |fˆn − f |p dPX .
X
The choice of p is somewhat arbitrary and mostly employed as a mathe-

matical exercise.
Note that these three examples can be split into two families: global (Sup-norm
and Lp ) and local (pointwise).
For specific problems, other considerations come into play. For example,
if Y ∈ {0, 1} is a label, one may be interested in the classification risk of a
classifier h : X → {0, 1}. It is defined by
R(h) = IP(Y 6= h(X)) .

Introduction 5
We will not cover this problem in this course.

Finally, we will devote a large part of these notes to the study of linear
models. For such models, X = IRd and f is linear (or affine), i.e., f (x) = x> θ
for some unknown θ ∈ IRd . In this case, it is traditional to measure error
directly on the coefficient θ. For example, if fˆn (x) = x> θ̂n is a candidate
linear estimator, it is customary to measure the distance of fˆn to f using a
(pseudo-)distance between θ̂n and θ as long as θ is identifiable.
MODELS AND METHODS
Empirical risk minimization

In our considerations on measuring the performance of an estimator fˆn , we
have carefully avoided the question of how to construct fˆn . This is of course
one of the most important task of statistics. As we will see, it can be carried
out in a fairly mechanical way by following one simple principle: Empirical
Risk Minimization (ERM2 ). Indeed, an overwhelming proportion of statistical
methods consist inP replacing an (unknown) expected value (IE) by a (known)
n
empirical mean ( n1 i=1 ). For example, it is well known that a good candidate
to estimate the expected value IEX of a random variable X from a sequence of
i.i.d copies X1 , . . . , Xn of X, is their empirical average
n
1X
X̄ = Xi .
n i=1
In many instances, it corresponds to the maximum likelihood estimator of IEX.

Another example is the sample variance, where IE(X − IE(X))2 is estimated by
n
1X
(Xi − X̄)2 .
n i=1
It turns out that this principle can be extended even if an optimization follows
the substitution. Recall that the L2 risk is defined by R(g) = IE[Y −g(X)]2 . See
the expectation? Well, it can be replaced by an average to form the empirical
risk of g defined by
n
1X 2
Rn (g) = Yi − g(Xi ) .
n i=1
We can now proceed to minimizing this risk. However, we have to be careful.
Indeed, Rn (g) ≥ 0 for all g. Therefore any function g such that Yi = g(Xi ) for
all i = 1, . . . , n is a minimizer of the empirical risk. Yet, it may not be the best
choice (Cf. Figure 1). To overcome this limitation, we need to leverage some
prior knowledge on f : either it may belong to a certain class G of functions (e.g.,
linear functions) or it is smooth (e.g., the L2 -norm of its second derivative is
2 ERM may also mean Empirical Risk Minimizer
Introduction 6
1.5
1.5
1.5
1.0
1.0
1.0
0.5
0.5
0.5
0.0
0.0
0.0
y
y
−0.5
−0.5
−0.5
−1.0
−1.0
−1.0
−1.5
−1.5
−1.5
−2.0
−2.0
−2.0
0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
x x x
Figure 1. It may not be the best choice idea to have fˆn (Xi ) = Yi for all i = 1, . . . , n.
small). In both cases, this extra knowledge can be incorporated to ERM using
either a constraint:
min Rn (g)
g∈G
or a penalty: n o
min Rn (g) + pen(g) ,
g
or both n o
min Rn (g) + pen(g) ,
g∈G
These schemes belong to the general idea of regularization. We will see many
variants of regularization throughout the course.
Unlike traditional (low dimensional) statistics, computation plays a key role
in high-dimensional statistics. Indeed, what is the point of describing an esti-
mator with good prediction properties if it takes years to compute it on large
datasets? As a result of this observation, much of the modern estimators, such
as the Lasso estimator for sparse linear regression can be computed efficiently
using simple tools from convex optimization. We will not describe such algo-
rithms for this problem but will comment on the computability of estimators
when relevant.
In particular computational considerations have driven the field of com-
pressed sensing that is closely connected to the problem of sparse linear regres-
sion studied in these notes. We will only briefly mention some of the results and
refer the interested reader to the book [FR13] for a comprehensive treatment.
Introduction 7
Linear models
When X = IRd , an all time favorite constraint G is the class of linear functions
that are of the form g(x) = x> θ, that is parametrized by θ ∈ IRd . Under
this constraint, the estimator obtained by ERM is usually called least squares
estimator and is defined by fˆn (x) = x> θ̂, where
n
1X
θ̂ ∈ argmin (Yi − Xi> θ)2 .
θ∈IRd n i=1
Note that θ̂ may not be unique. In the case of a linear model, where we assume
that the regression function is of the form f (x) = x> θ∗ for some unknown
θ∗ ∈ IRd , we will need assumptions to ensure identifiability if we want to prove
bounds on d(θ̂, θ∗ ) for some specific pseudo-distance d(· , ·). Nevertheless, in
other instances such as regression with fixed design, we can prove bounds on
the prediction error that are valid for any θ̂ in the argmin. In the latter case,
we will not even require that f satisfies the linear model but our bound will
be meaningful only if f can be well approximated by a linear function. In this
case, we talk about a misspecified model, i.e., we try to fit a linear model to
data that may not come from a linear model. Since linear models can have
good approximation properties especially when the dimension d is large, our
hope is that the linear model is never too far from the truth.
In the case of a misspecified model, there is no hope to drive the estimation
error d(fˆn , f ) down to zero even with a sample size that tends to infinity.
Rather, we will pay a systematic approximation error. When G is a linear
subspace as above, and the pseudo distance is given by the squared L2 norm
d(fˆn , f ) = kfˆn − f k22 , it follows from the Pythagorean theorem that
kfˆn − f k22 = kfˆn − f¯k22 + kf¯ − f k22 ,
where f¯ is the projection of f onto the linear subspace G. The systematic

approximation error is entirely contained in the deterministic term kf¯ − f k22
and one can proceed to bound kfˆn − f¯k22 by a quantity that goes to zero as n
goes to infinity. In this case, bounds (e.g., in expectation) on the estimation
error take the form
IEkfˆn − f k22 ≤ kf¯ − f k22 + φn .
The above inequality is called an oracle inequality. Indeed, it says that if φn
is small enough, then fˆn the estimator mimics the oracle f¯. It is called “oracle”
because it cannot be constructed without the knowledge of the unknown f . It
is clearly the best we can do when we restrict our attention to estimator in the
class G. Going back to the gap in knowledge between a probabilist who knows
the whole joint distribution of (X, Y ) and a statistician who only sees the data,
the oracle sits somewhere in-between: it can only see the whole distribution
through the lens provided by the statistician. In the case above, the lens is
that of linear regression functions. Different oracles are more or less powerful
and there is a tradeoff to be achieved. On the one hand, if the oracle is weak,
Introduction 8
then it’s easy for the statistician to mimic it but it may be very far from the
true regression function; on the other hand, if the oracle is strong, then it is
harder to mimic but it is much closer to the truth.
Oracle inequalities were originally developed as analytic tools to prove adap-
tation of some nonparametric estimators. With the development of aggregation
[Nem00, Tsy03, Rig06] and high dimensional statistics [CT07, BRT09, RT11],
they have become important finite sample results that characterize the inter-
play between the important parameters of the problem.
In some favorable instances, that is when the Xi s enjoy specific properties,
it is even possible to estimate the vector θ accurately, as is done in parametric
statistics. The techniques employed for this goal will essentially be the same
as the ones employed to minimize the prediction risk. The extra assumptions
on the Xi s will then translate in interesting properties on θ̂ itself, including
uniqueness on top of the prediction properties of the function fˆn (x) = x> θ̂.
High dimension and sparsity

These lecture notes are about high dimensional statistics and it is time they
enter the picture. By high dimension, we informally mean that the model has
more “parameters” than there are observations. The word “parameter” is used
here loosely and a more accurate description is perhaps degrees of freedom. For
example, the linear model f (x) = x> θ∗ has one parameter θ∗ but effectively d
degrees of freedom when θ∗ ∈ IRd . The notion of degrees of freedom is actually
well defined in the statistical literature but the formal definition does not help
our informal discussion here.
As we will see in Chapter 2, if the regression function is linear f (x) = x> θ∗ ,
∗
θ ∈ IRd , and under some assumptions on the marginal distribution of X, then
the least squares estimator fˆn (x) = x> θ̂n satisfies
d
IEkfˆn − f k22 ≤ C , (1)
n
where C > 0 is a constant and in Chapter 4, we will show that this cannot
be improved apart perhaps for a smaller multiplicative constant. Clearly such
a bound is uninformative if d n and actually, in view of its optimality,
we can even conclude that the problem is too difficult statistically. However,
the situation is not hopeless if we assume that the problem has actually less
degrees of freedom than it seems. In particular, it is now standard to resort to
the sparsity assumption to overcome this limitation.
A vector θ ∈ IRd is said to be k-sparse for some k ∈ {0, . . . , d} if it has
at most k non-zero coordinates. We denote by |θ|0 the number of nonzero
coordinates of θ, which is also known as sparsity or “`0 -norm” though it is
clearly not a norm (see footnote 3). Formally, it is defined as
d
X
|θ|0 = 1I(θj 6= 0) .
j=1
Introduction 9
Sparsity is just one of many ways to limit the size of the set of potential
θ vectors to consider. One could consider vectors θ that have the following
structure for example (see Figure 2):
• Monotonic: θ1 ≥ θ2 ≥ · · · ≥ θd
• Smooth: |θi − θj | ≤ C|i − j|α for some α > 0
Pd−1
• Piecewise constant: j=1 1I(θj+1 6= θj ) ≤ k
• Structured in another basis: θ = Ψµ, for some orthogonal matrix and µ
is in one of the structured classes described above.
Sparsity plays a significant role in statistics because, often, structure trans-
lates into sparsity in a certain basis. For example, a smooth function is sparse
in the trigonometric basis and a piecewise constant function has sparse incre-
ments. Moreover, real images are approximately sparse in certain bases such as
wavelet or Fourier bases. This is precisely the feature exploited in compression
schemes such as JPEG or JPEG-2000: only a few coefficients in these images
are necessary to retain the main features of the image.
We say that θ is approximately sparse if |θ|0 may be as large as d but many
coefficients |θj | are small rather than exactly equal to zero. There are several
mathematical ways to capture this phenomena, including `q -“balls” for q ≤ 1.
For q > 0, the unit `q -ball of IRd is defined as
d
X
Bq (R) = θ ∈ IRd : |θ|qq = |θj |q ≤ 1

j=1
where |θ|q is often called `q -norm3 . As we will see, the smaller q is, the better
vectors in the unit `q ball can be approximated by sparse vectors.
Pk
Note that the set of k-sparse vectors of IRd is a union of j=0 dj linear

subspaces with dimension at most k and that are spanned by at most k vectors
in the canonical basis of IRd . If we knew that θ∗ belongs to one of these
subspaces, we could simply drop irrelevant coordinates and obtain an oracle
inequality such as (1), with d replaced by k. Since we do not know what
subspace θ∗ lives exactly, we will have to pay an extra term to find in which
subspace θ∗ lives. It turns out that this term is exactly of the order of
P
k d
log k log ed

j=0 j k
'C
n n
Therefore, the price to pay for not knowing which subspace to look at is only
a logarithmic factor.
3 Strictly speaking, |θ|q is a norm and the `q ball is a ball only for q ≥ 1.
Introduction 10
y1 Monotone Smooth
y2
x x
Constant Basis
y3
y4
x x
Figure 2. Examples of structures vectors θ ∈ IR50
Nonparametric regression
Nonparametric does not mean that there is no parameter to estimate (the
regression function is a parameter) but rather that the parameter to estimate
is infinite dimensional (this is the case of a function). In some instances, this
parameter can be identified with an infinite sequence of real numbers, so that
we are still in the realm of countable infinity. Indeed, observe that since L2 (PX )
equipped with the inner product h· , ·i2 is a separable Hilbert space, it admits an
orthonormal basis {ϕk }k∈Z and any function f ∈ L2 (PX ) can be decomposed
as X
f= αk ϕk ,
k∈Z
where αk = hf, ϕk i2 .
Therefore estimating a regression function f amounts to estimating the
infinite sequence {αk }k∈Z ∈ `2 . You may argue (correctly) that the basis
{ϕk }k∈Z is also unknown as it depends on the unknown PX . This is absolutely
Introduction 11
correct but we will make the convenient assumption that PX is (essentially)

known whenever this is needed.
Even if infinity is countable, we still have to estimate an infinite number
of coefficients using a finite number of observations. It does not require much
statistical intuition to realize that this task is impossible in general. What if
we know something about the sequence {αk }k ? For example, if we know that
αk = 0 for |k| > k0 , then there are only 2k0 + 1 parameters to estimate (in
general, one would also have to “estimate” k0 ). In practice, we will not exactly
see αk = 0 for |k| > k0 , but rather that the sequence {αk }k decays to 0 at
a certain polynomial rate. For example |αk | ≤ C|k|−γ for some γ > 1/2 (we
need this sequence to be in `2 ). It corresponds to a smoothness assumption on
the function f . In this case, the sequence {αk }k can be well approximated by
a sequence with only a finite number of non-zero terms.
We can view this problem as a misspecified model. Indeed, for any cut-off
k0 , define the oracle X
f¯k0 = αk ϕk .
|k|≤k0
Note that it depends on the unknown αk and define the estimator

X
fˆn = α̂k ϕk ,
|k|≤k0
where α̂k are some data-driven coefficients (obtained by least-squares for ex-
ample). Then by the Pythagorean theorem and Parseval’s identity, we have
kfˆn − f k22 = kf¯ − f k22 + kfˆn − f¯k22

X X
= αk2 + (α̂k − αk )2
|k|>k0 |k|≤k0
We can even work further on this oracle inequality using the fact that |αk | ≤
C|k|−γ . Indeed, we have4
X X
αk2 ≤ C 2 k −2γ ≤ Ck01−2γ .
|k|>k0 |k|>k0
− αk )2 clearly increases with k0

P
The so called stochastic term IE |k|≤k0 (α̂k
(more parameters to estimate) whereas the approximation term Ck01−2γ de-
creases with k0 (less terms discarded). We will see that we can strike a com-
promise called bias-variance tradeoff.
The main difference between this and oracle inequalities is that we make
assumptions on the regression function (here in terms of smoothness) in order
4 Here we illustrate a convenient notational convention that we will be using through-
out these notes: a constant C may be different from line to line. This will not affect the
interpretation of our results since we are interested in the order of magnitude of the error
bounds. Nevertheless we will, as much as possible, try to make such constants explicit. As
an exercise, try to find an expression of the second C as a function of the first one and of γ.
Introduction 12
to control the approximation error. Therefore oracle inequalities are more

general but can be seen on the one hand as less quantitative. On the other
hand, if one is willing to accept the fact that approximation error is inevitable
then there is no reason to focus on it. This is not the final answer to this rather
philosophical question. Indeed, choosing the right k0 can only be done with
a control of the approximation error. Indeed, the best k0 will depend on γ.
We will see that even if the smoothness index γ is unknown, we can select k0
in a data-driven way that achieves almost the same performance as if γ were
known. This phenomenon is called adaptation (to γ).
It is important to notice the main difference between the approach taken in
nonparametric regression and the one in sparse linear regression. It is not so
much about linear vs. non-linear models as we can always first take nonlinear
transformations of the xj ’s in linear regression. Instead, sparsity or approx-
imate sparsity is a much weaker notion than the decay of coefficients {αk }k
presented above. In a way, sparsity only imposes that after ordering the coef-
ficients present a certain decay, whereas in nonparametric statistics, the order
is set ahead of time: we assume that we have found a basis that is ordered in
such a way that coefficients decay at a certain rate.
Matrix models
In the previous examples, the response variable is always assumed to be a scalar.
What if it is a higher dimensional signal? In Chapter 5, we consider various
problems of this form: matrix completion a.k.a. the Netflix problem, structured
graph estimation and covariance matrix estimation. All these problems can be
described as follows.
Let M, S and N be three matrices, respectively called observation, signal
and noise, and that satisfy
M =S+N.
Here N is a random matrix such that IE[N ] = 0, the all-zero matrix. The goal
is to estimate the signal matrix S from the observation of M .
The structure of S can also be chosen in various ways. We will consider the
case where S is sparse in the sense that it has many zero coefficients. In a way,
this assumption does not leverage much of the matrix structure and essentially
treats matrices as vectors arranged in the form of an array. This is not the case
of low rank structures where one assumes that the matrix S has either low rank
or can be well approximated by a low rank matrix. This assumption makes
sense in the case where S represents user preferences as in the Netflix example.
In this example, the (i, j)th coefficient Sij of S corresponds to the rating (on a
scale from 1 to 5) that user i gave to movie j. The low rank assumption simply
materializes the idea that there are a few canonical profiles of users and that
each user can be represented as a linear combination of these users.
At first glance, this problem seems much more difficult than sparse linear
regression. Indeed, one needs to learn not only the sparse coefficients in a given
Introduction 13
basis, but also the basis of eigenvectors. Fortunately, it turns out that the latter
task is much easier and is dominated by the former in terms of statistical price.
Another important example of matrix estimation is high-dimensional co-
variance estimation, where the goal is to estimate the covariance matrix of a
random vector X ∈ IRd , or its leading eigenvectors, based on n observations.
Such a problem has many applications including principal component analysis,
linear discriminant analysis and portfolio optimization. The main difficulty is
that n may be much smaller than the number of degrees of freedom in the
covariance matrix, which can be of order d2 . To overcome this limitation,
assumptions on the rank or the sparsity of the matrix can be leveraged.
Optimality and minimax lower bounds

So far, we have only talked about upper bounds. For a linear model, where
f (x) = x> θ∗ , we will prove in Chapter 2 the following bound for a modified
least squares estimator fˆn = x> θ̂
d
IEkfˆn − f k22 ≤ C .
n
Is this the right dependence in d and n? Would it√be possible to obtain as an
upper bound something like C(log d)/n, C/n or d/n2 , by either improving
our proof technique or using another estimator altogether? It turns out that
the answer to this question is negative. More precisely, we can prove that for
any estimator f˜n , there exists a function f of the form f (x) = x> θ∗ such that
d
IEkfˆn − f k22 > c
n
for some positive constant c. Here we used a different notation for the constant
to emphasize the fact that lower bounds guarantee optimality only up to a
constant factor. Such a lower bound on the risk is called minimax lower bound
for reasons that will become clearer in chapter 4.
How is this possible? How can we make a statement for all estimators?
We will see that these statements borrow from the theory of tests where we
know that it is impossible to drive both the type I and the type II error to
zero simultaneously (with a fixed sample size). Intuitively this phenomenon
is related to the following observation: Given n observations X1 , . . . , Xn , it is
hard to tell if they are distributed according to N (θ, 1) or to N (θ0 , 1) for a
Euclidean distance |θ − p θ0 |2 is small enough. We will see that it is the case for
0
example if |θ − θ |2 ≤ C d/n, which will yield our lower bound.
Chapter
1
Sub-Gaussian Random Variables
1.1 GAUSSIAN TAILS AND MGF
Recall that a random variable X ∈ IR has Gaussian distribution iff it has a

density p with respect to the Lebesgue measure on IR given by
1 (x − µ)2
p(x) = √ exp − , x ∈ IR ,
2πσ 2 2σ 2
where µ = IE(X) ∈ IR and σ 2 = var(X) > 0 are the mean and variance of
X. We write X ∼ N (µ, σ 2 ). Note that X = σZ + µ for Z ∼ N (0, 1) (called
standard Gaussian) and where the equality holds in distribution. Clearly, this
distribution has unbounded support but it is well known that it has almost
bounded support in the following sense: IP(|X − µ| ≤ 3σ) ' 0.997. This is due
to the fast decay of the tails of p as |x| → ∞ (see Figure 1.1). This decay can
be quantified using the following proposition (Mill’s inequality).
Proposition 1.1. Let X be a Gaussian random variable with mean µ and

variance σ 2 then for any t > 0, it holds
t2
σ e− 2σ2
IP(X − µ > t) ≤ √ .
2π t
By symmetry we also have
t2
σ e− 2σ2
IP(X − µ < −t) ≤ √
2π t
14
1.1. Gaussian tails and MGF 15
68%
95%
99.7%
µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ
Figure 1.1. Probabilities of falling within 1, 2, and 3 standard deviations close to the
mean in a Gaussian distribution. Source http://www.openintro.org/
and
t2
2 e− 2σ2
r
IP(|X − µ| > t) ≤ σ .
π t
Proof. Note that it is sufficient to prove the theorem for µ = 0 and σ 2 = 1 by
simple translation and rescaling. We get for Z ∼ N (0, 1),
Z ∞ x2
1
IP(Z > t) = √ exp − dx
2π t 2
Z ∞ x2
1 x
≤√ exp − dx
2π t t 2
Z ∞ x2
1 ∂
= √ − exp − dx
t 2π t ∂x 2
1
= √ exp(−t2 /2) .
t 2π
The second inequality follows from symmetry and the last one using the union
bound:
IP(|Z| > t) = IP({Z > t} ∪ {Z < −t}) ≤ IP(Z > t) + IP(Z < −t) = 2IP(Z > t) .
The fact that a Gaussian random variable Z has tails that decay to zero
exponentially fast can also be seen in the moment generating function (MGF)
M : s 7→ M (s) = IE[exp(sZ)] .
1.2. Sub-Gaussian random variables and Chernoff bounds 16
Indeed, in the case of a standard Gaussian random variable, we have

Z
1 z2
M (s) = IE[exp(sZ)] = √ esz e− 2 dz
2π
Z
1 (z−s)2 s2
=√ e− 2 + 2 dz
2π
s2
=e 2 .
σ 2 s2

It follows that if X ∼ N (µ, σ 2 ), then IE[exp(sX)] = exp sµ + 2 .
1.2 SUB-GAUSSIAN RANDOM VARIABLES AND CHERNOFF BOUNDS
Definition and first properties

Gaussian tails are practical when controlling the tail of an average of indepen-
dent random variables. Indeed,Pn recall that if X1 , . . . , Xn are
iid N (µ, σ 2 ), then X̄ = n1 i=1 Xi ∼ N (µ, σ 2 /n). Using Lemma 1.3 below for
example, we get
nt2
IP(|X̄ − µ| > t) ≤ 2 exp − 2 .
2σ
Equating the right-hand side with some confidence level δ > 0, we find that
with probability at least1 1 − δ,
r r
h 2 log(2/δ) 2 log(2/δ) i
µ ∈ X̄ − σ , X̄ + σ , (1.1)
n n
This is almost the confidence interval that you used in introductory statistics.
The only difference is that we used an approximation for the Gaussian tail
whereas statistical tables or software use a much more accurate computation.
Figure 1.2 shows the ratio of the width of the confidence interval to that of the
confidence interval computed by the software R. It turns out that intervals of
the same form can be also derived for non-Gaussian random variables as long
as they have sub-Gaussian tails.
Definition 1.2. A random variable X ∈ IR is said to be sub-Gaussian with
variance proxy σ 2 if IE[X] = 0 and its moment generating function satisfies
σ 2 s2
IE[exp(sX)] ≤ exp , ∀ s ∈ IR . (1.2)
2
In this case we write X ∼ subG(σ 2 ). Note that subG(σ 2 ) denotes a class of
distributions rather than a distribution. Therefore, we abuse notation when
writing X ∼ subG(σ 2 ).
More generally, we can talk about sub-Gaussian random vectors and matri-
ces. A random vector X ∈ IRd is said to be sub-Gaussian with variance proxy
1 We will often commit the statement “at least” for brevity
7
6
5
y
4
3
0 5 10 15 20
x
Figure 1.2. Width of confidence intervals from exact computation in R (red dashed)
and from the approximation (1.1) (solid black). Here x = δ and y is the width of the
confidence intervals.
σ 2 if IE[X] = 0 and u> X is sub-Gaussian with variance proxy σ 2 for any vector
u ∈ S d−1 . In this case we write X ∼ subGd (σ 2 ). Note that if X ∼ subGd (σ 2 ),
then for any v such that |v|2 ≤ 1, we have v > X ∼ subG(σ 2 ). Indeed, denoting
u = v/|v|2 ∈ S d−1 , we have
> > σ 2 s2 |v|2
2 σ 2 s2
IE[esv X
] = IE[es|v|2 u X
]≤e 2 ≤e 2 .
A random matrix X ∈ IRd×T is said to be sub-Gaussian with variance proxy
σ 2 if IE[X] = 0 and u> Xv is sub-Gaussian with variance proxy σ 2 for any unit
vectors u ∈ S d−1 , v ∈ S T −1 . In this case we write X ∼ subGd×T (σ 2 ).
This property can equivalently be expressed in terms of bounds on the tail
of the random variable X.
Lemma 1.3. Let X ∼ subG(σ 2 ). Then for any t > 0, it holds
t2 t2
IP[X > t] ≤ exp − 2 , and IP[X < −t] ≤ exp − 2 . (1.3)
2σ 2σ
Proof. Assume first that X ∼ subG(σ 2 ). We will employ a very useful tech-
nique called Chernoff bound that allows to translate a bound on the moment
generating function into a tail bound. Using Markov’s inequality, we have for
any s > 0,
sX st
IE esX
IP(X > t) ≤ IP e > e ≤ .
est
Next we use the fact that X is sub-Gaussian to get

σ 2 s2
−st
IP(X > t) ≤ e 2 .
The above inequality holds for any s > 0 so to make it the tightest possible, we
2 2
minimize with respect to s > 0. Solving φ0 (s) = 0, where φ(s) = σ 2s − st, we
t2
find that inf s>0 φ(s) = − 2σ 2 . This proves the first part of (1.3). The second
inequality in this equation follows in the same manner (recall that (1.2) holds
for any s ∈ IR).
Moments
Recall that the absolute moments of Z ∼ N (0, σ 2 ) are given by
1 k + 1
IE[|Z|k ] = √ (2σ 2 )k/2 Γ
π 2
where Γ(·) denotes the Gamma function defined by
Z ∞
Γ(t) = xt−1 e−x dx , t > 0 .
0
The next lemma shows that the tail bounds of Lemma 1.3 are sufficient to
show that the absolute moments of X ∼ subG(σ 2 ) can be bounded by those of
Z ∼ N (0, σ 2 ) up to multiplicative constants.
Lemma 1.4. Let X be a random variable such that
t2
IP[|X| > t] ≤ 2 exp − 2 ,
2σ
then for any positive integer k ≥ 1,
IE[|X|k ] ≤ (2σ 2 )k/2 kΓ(k/2) .
In particular, √
IE[|X|k ])1/k ≤ σe1/e k , k ≥ 2.
√
and IE[|X|] ≤ σ 2π .
Proof.
Z ∞
IE[|X|k ] = IP(|X|k > t)dt
0
Z ∞
= IP(|X| > t1/k )dt
0
Z ∞
t2/k
≤2 e− 2σ2 dt
0
∞
t2/k
Z
= (2σ 2 )k/2 k e−u uk/2−1 du , u=
0 2σ 2
2 k/2
= (2σ ) kΓ(k/2)
The second statement follows from Γ(k/2) ≤ (k/2)k/2 and k 1/k ≤ e1/e for any
k ≥ 2. It yields
r
2 k/2
1/k 1/k 2σ 2 k √
(2σ ) kΓ(k/2) ≤k ≤ e1/e σ k .
2
√ √
Moreover, for k = 1, we have 2Γ(1/2) = 2π.
Using moments, we can prove the following reciprocal to Lemma 1.3.
Lemma 1.5. If (1.3) holds and IE[X] = 0, then for any s > 0, it holds
2 2
IE[exp(sX)] ≤ e4σ s
.
As a result, we will sometimes write X ∼ subG(σ 2 ) when it satisfies (1.3) and

IE[X] = 0.
Proof. We use the Taylor expansion of the exponential function as follows.

Observe that by the dominated convergence theorem
∞
X sk IE[|X|k ]
IE esX ≤ 1 +

k!
k=2
∞
X (2σ 2 s2 )k/2 kΓ(k/2)
≤1+
k!
k=2
∞ ∞
X (2σ 2 s2 )k 2kΓ(k) X (2σ 2 s2 )k+1/2 (2k + 1)Γ(k + 1/2)
=1+ +
(2k)! (2k + 1)!
k=1 k=1
√ ∞
X (2σ 2 s2 )k k!
≤ 1 + 2 + 2σ 2 s2
(2k)!
k=1
∞
r
σ 2 s2 X (2σ 2 s2 )k
≤1+ 1+ 2(k!)2 ≤ (2k)!
2 k!
k=1
r
2 2 σ 2 s2 2σ2 s2
= e2σ s + (e − 1)
2
2 2
≤ e4σ s .
From the above Lemma, we see that sub-Gaussian random variables can
be equivalently defined from their tail bounds and their moment generating
functions, up to constants.
Sums of independent sub-Gaussian random variables

Recall that if X1 , . . . , Xn are
iid N (0, σ 2 ), then for any a ∈ IRn ,
n
X
ai Xi ∼ N (0, |a|22 σ 2 ).
i=1
If we only care about the tails, this property is preserved for sub-Gaussian
random variables.
Theorem 1.6. Let X = (X1 , . . . , Xn ) be a vector of independent sub-Gaussian

random variables that have variance proxy σ 2 . Then, the random vector X is
sub-Gaussian with variance proxy σ 2 .
Proof. Let u ∈ S n−1 be a unit vector, then

n n
> Y Y σ 2 s2 u 2
i σ 2 s2 |u|2
2 σ 2 s2
IE[esu X
]= IE[esui Xi ] ≤ e 2 =e 2 =e 2 .
i=1 i=1
Using a Chernoff bound, we immediately get the following corollary

Corollary 1.7. Let X1 , . . . , Xn be n independent random variables such that
Xi ∼ subG(σ 2 ). Then for any a ∈ IRn , we have
n
hX i t2
IP ai Xi > t ≤ exp − ,
i=1
2σ 2 |a|22
and
n
hX i t2
IP ai Xi < −t ≤ exp −
i=1
2σ 2 |a|22
Of special interest
Pnis the case where ai = 1/n for all i. Then, we get that
the average X̄ = n1 i=1 Xi , satisfies
nt2 nt2
IP(X̄ > t) ≤ e− 2σ2 and IP(X̄ < −t) ≤ e− 2σ2
just like for the Gaussian average.
Hoeffding’s inequality
The class of sub-Gaussian random variables is actually quite large. Indeed,
Hoeffding’s lemma below implies that all random variables that are bounded
uniformly are actually sub-Gaussian with a variance proxy that depends on the
size of their support.
Lemma 1.8 (Hoeffding’s lemma (1963)). Let X be a random variable such

that IE(X) = 0 and X ∈ [a, b] almost surely. Then, for any s ∈ IR, it holds
s2 (b−a)2
IE[esX ] ≤ e 8 .
2
In particular, X ∼ subG( (b−a)
4 ).
Proof. Define ψ(s) = log IE[esX ], and observe that and we can readily compute
2
IE[XesX ] IE[X 2 esX ] IE[XesX ]

0 00
ψ (s) = , ψ (s) = − .
IE[esX ] IE[esX ] IE[esX ]
Thus ψ 00 (s) can be interpreted as the variance of the random variable X under
esX
the probability measure dQ = IE[e sX ] dIP. But since X ∈ [a, b] almost surely,
we have, under any probability,

a + b h a + b 2 i (b − a)2
var(X) = var X − ≤ IE X − ≤ .
2 2 4
The fundamental theorem of calculus yields
Z sZ µ
s2 (b − a)2
ψ(s) = ψ 00 (ρ) dρ dµ ≤
0 0 8
using ψ(0) = log 1 = 0 and ψ 0 (0) = IEX = 0.
Using a Chernoff bound, we get the following (extremely useful) result.
Theorem 1.9 (Hoeffding’s inequality). Let X1 , . . . , Xn be n independent ran-
dom variables such that almost surely,
Xi ∈ [ai , bi ] , ∀ i.
1
Pn
Let X̄ = n i=1 Xi , then for any t > 0,
2n2 t2
IP(X̄ − IE(X̄) > t) ≤ exp − Pn 2
,
i=1 (bi − ai )
and 2n2 t2
IP(X̄ − IE(X̄) < −t) ≤ exp − Pn 2
.
i=1 (bi − ai )
Note that Hoeffding’s lemma holds for any bounded random variable. For
example, if one knows that X is a Rademacher random variable,
es + e−s s2
IE(esX ) = = cosh(s) ≤ e 2 .
2
Note that 2 is the best possible constant in the above approximation. For such
variables, a = −1, b = 1, and IE(X) = 0, so Hoeffding’s lemma yields
s2
IE(esX ) ≤ e 2 .
1.3. Sub-exponential random variables 22
Hoeffding’s inequality is very general but there is a price to pay for this gen-
erality. Indeed, if the random variables have small variance, we would like to
see it reflected in the exponential tail bound (as for the Gaussian case) but the
variance does not appear in Hoeffding’s inequality. We need a more refined
inequality.
1.3 SUB-EXPONENTIAL RANDOM VARIABLES
What can we say when a centered random variable is not sub-Gaussian?

A typical example is the double exponential (or Laplace) distribution with
parameter 1, denoted by Lap(1). Let X ∼ Lap(1) and observe that
IP(|X| > t) = e−t , t ≥ 0.
In particular, the tails of this distribution do not decay as fast as the Gaussian
2
ones (that decays as e−t /2 ). Such tails are said to be heavier than Gaussian.
This tail behavior is also captured by the moment generating function of X.
Indeed, we have
1
IE esX =

if |s| < 1 ,
1 − s2
and it is not defined for s ≥ 1. It turns out that a rather weak condition on
the moment generating function is enough to partially reproduce some of the
bounds that we have proved for sub-Gaussian random variables. Observe that
for X ∼ Lap(1)
2
IE esX ≤ e2s if |s| < 1/2 ,

In particular, the moment generating function of the Laplace distribution is

bounded by that of a Gaussian in a neighborhood of 0 but does not even exist
away from zero. It turns out that all distributions that have tails at most as
heavy as that of a Laplace distribution satisfy such a property.
Lemma 1.10. Let X be a centered random variable such that IP(|X| > t) ≤
2e−2t/λ for some λ > 0. Then, for any positive integer k ≥ 1,
IE[|X|k ] ≤ λk k! .
Moreover,
IE[|X|k ])1/k ≤ 2λk ,
and the moment generating function of X satisfies
2 2 1
IE esX ≤ e2s λ ,

∀|s| ≤ .
2λ
Proof.
Z ∞
k
IE[|X| ] = IP(|X|k > t)dt
0
Z ∞
= IP(|X| > t1/k )dt
Z0 ∞
2t1/k
≤ 2e− λ dt
0
∞
2t1/k
Z
= 2(λ/2)k k e−u uk−1 du , u=
0 λ
≤ λk kΓ(k) = λk k!
The second statement follows from Γ(k) ≤ k k and k 1/k ≤ e1/e ≤ 2 for any
k ≥ 1. It yields
1/k
λk kΓ(k) ≤ 2λk .
To control the MGF of X, we use the Taylor expansion of the exponential
function as follows. Observe that by the dominated convergence theorem, for
any s such that |s| ≤ 1/2λ
∞
X |s|k IE[|X|k ]
IE esX ≤ 1 +

k!
k=2
X∞
≤1+ (|s|λ)k
k=2
∞
X
= 1 + s2 λ2 (|s|λ)k
k=0
2 2 1
≤ 1 + 2s λ |s| ≤
2λ
2
λ2
≤ e2s
This leads to the following definition

Definition 1.11. A random variable X is said to be sub-exponential with
parameter λ (denoted X ∼ subE(λ)) if IE[X] = 0 and its moment generating
function satisfies
2 2 1
IE esX ≤ es λ /2 ,

∀|s| ≤ .
λ
A simple and useful example of of a sub-exponential random variable is
given in the next lemma.
Lemma 1.12. Let X ∼ subG(σ 2 ) then the random variable Z = X 2 − IE[X 2 ]
is sub-exponential: Z ∼ subE(16σ 2 ).
Proof. We have, by the dominated convergence theorem,

∞ k
sZ
X sk IE X 2 − IE[X 2 ]
IE[e ]=1+
k!
k=2
∞
X sk 2k−1 IE[X 2k ] + (IE[X 2 ])k
≤1+ (Jensen)
k!
k=2
∞
X sk 4k IE[X 2k ]
≤1+ (Jensen again)
2(k!)
k=2
∞
X sk 4k 2(2σ 2 )k k!
≤1+ (Lemma 1.4)
2(k!)
k=2
∞
X
= 1 + (8sσ 2 )2 (8sσ 2 )k
k=0
1
= 1 + 128s2 σ 4 for |s| ≤
16σ 2
2
σ4
≤ e128s .
Sub-exponential random variables also give rise to exponential deviation

inequalities such as Corollary 1.7 (Chernoff bound) or Theorem 1.9 (Hoeffd-
ing’s inequality) for weighted sums of independent sub-exponential random
variables. The significant difference here is that the larger deviations are con-
trolled in by a weaker bound.
Bernstein’s inequality
Theorem 1.13 (Bernstein’s inequality). Let X1 , . . . , Xn be independent ran-
dom variables such that IE(Xi ) = 0 and Xi ∼ subE(λ). Define
n
1X
X̄ = Xi ,
n i=1
Then for any t > 0 we have
n t2

t
IP(X̄ > t) ∨ IP(X̄ < −t) ≤ exp − ( 2 ∧ ) .
2 λ λ
Proof. Without loss of generality, assume that λ = 1 (we can always replace
Xi by Xi /λ and t by t/λ). Next, using a Chernoff bound, we get for any s > 0
n
Y
IE esXi e−snt .

IP(X̄ > t) ≤
i=1
1.4. Maximal inequalities 25
2
Next, if |s| ≤ 1, then IE esXi ≤ es /2 by definition of sub-exponential distri-
butions. It yields
ns2
−snt
IP(X̄ > t) ≤ e 2
Choosing s = 1 ∧ t yields
n 2
IP(X̄ > t) ≤ e− 2 (t ∧t)
We obtain the same bound for IP(X̄ < −t) which concludes the proof.
Note that usually, Bernstein’s inequality refers to a slightly more precise

result that is qualitatively the same as the one above: it exhibits a Gaussian
2 2
tail e−nt /(2λ ) and an exponential tail e−nt/(2λ) . See for example Theorem 2.10
in [BLM13].
1.4 MAXIMAL INEQUALITIES
The exponential inequalities of the previous section are valid for linear com-
binations of independent random variables, and, in particular, for the average
X̄. In many instances, we will be interested in controlling the maximum over
the parameters of such linear combinations (this is because of empirical risk
minimization). The purpose of this section is to present such results.
Maximum over a finite set

We begin by the simplest case possible: the maximum over a finite set.
Theorem 1.14. Let X1 , . . . , XN be N random variables such that Xi ∼ subG(σ 2 ).

Then
p p
IE[ max Xi ] ≤ σ 2 log(N ) , and IE[ max |Xi |] ≤ σ 2 log(2N )
1≤i≤N 1≤i≤N
Moreover, for any t > 0,

t2 t2
max Xi > t ≤ N e− 2σ2 , max |Xi | > t ≤ 2N e− 2σ2

IP and IP
1≤i≤N 1≤i≤N
Note that the random variables in this theorem need not be independent.
Proof. For any s > 0,

1
IE log es max1≤i≤N Xi

IE[ max Xi ] =
1≤i≤N s
1
≤ log IE es max1≤i≤N Xi

(by Jensen)
s
1
= log IE max esXi

s 1≤i≤N
1 X
IE esXi

≤ log
s
1≤i≤N
1 X σ 2 s2
≤ log e 2
s
1≤i≤N
log N σ2 s
= + .
s 2
p
Taking s = 2(log N )/σ 2 yields the first inequality in expectation.
The first inequality in probability is obtained by a simple union bound:
[
IP max Xi > t = IP {Xi > t}
1≤i≤N
1≤i≤N
X
≤ IP(Xi > t)
1≤i≤N
t2
≤ N e− 2σ2 ,
where we used Lemma 1.3 in the last inequality.

The remaining two inequalities follow trivially by noting that
max |Xi | = max Xi ,

1≤i≤N 1≤i≤2N
where XN +i = −Xi for i = 1, . . . , N .
Extending these results to a maximum over an infinite set may be impossi-

ble. For example, if one is given an infinite sequence of
iid N (0, σ 2 ) random variables X1 , X2 , . . ., then for any N ≥ 1, we have for any
t > 0,
IP( max Xi < t) = [IP(X1 < t)]N → 0 , N → ∞ .
1≤i≤N
On the opposite side of the picture, if all the Xi s are equal to the same random
variable X, we have for any t > 0,
IP( max Xi < t) = IP(X1 < t) > 0 ∀ N ≥ 1.

1≤i≤N
In the Gaussian case, lower bounds are also available. They illustrate the effect
of the correlation between the Xi s.
Examples from statistics have structure and we encounter many examples

where a maximum of random variables over an infinite set is in fact finite. This
is due to the fact that the random variables that we are considering are not
independent from each other. In the rest of this section, we review some of
these examples.
Maximum over a convex polytope

We use the definition of a polytope from [Gru03]: a convex polytope P is a
compact set with a finite number of vertices V(P) called extreme points. It
satisfies P = conv(V(P)), where conv(V(P)) denotes the convex hull of the
vertices of P.
Let X ∈ IRd be a random vector and consider the (infinite) family of random
variables
F = {θ> X : θ ∈ P} ,
where P ⊂ IRd is a polytope with N vertices. While the family F is infinite, the
maximum over F can be reduced to the a finite maximum using the following
useful lemma.
Lemma 1.15. Consider a linear form x 7→ c> x, x, c ∈ IRd . Then for any
convex polytope P ⊂ IRd ,
max c> x = max c> x
x∈P x∈V(P)
where V(P) denotes the set of vertices of P.

Proof. Assume that V(P) = {v1 , . . . , vN }. For any x ∈ P = conv(V(P)), there
exist nonnegative numbers λ1 , . . . λN that sum up to 1 and such that x =
λ1 v1 + · · · + λN vN . Thus
N
X XN N
X
c> x = c> λi vi = λ i c > vi ≤ λi max c> x = max c> x .
x∈V(P) x∈V(P)
i=1 i=1 i=1
It yields
max c> x ≤ max c> x ≤ max c> x
x∈P x∈V(P) x∈P
so the two quantities are equal.

It immediately yields the following theorem
Theorem 1.16. Let P be a polytope with N vertices v (1) , . . . , v (N ) ∈ IRd and let
X ∈ IRd be a random vector such that [v (i) ]> X, i = 1, . . . , N , are sub-Gaussian
random variables with variance proxy σ 2 . Then
p p
IE[max θ> X] ≤ σ 2 log(N ) , and IE[max |θ> X|] ≤ σ 2 log(2N ) .
θ∈P θ∈P
Moreover, for any t > 0,

t2 t2
IP max θ> X > t ≤ N e− 2σ2 , IP max |θ> X| > t ≤ 2N e− 2σ2

and
θ∈P θ∈P
Of particular interest are polytopes that have a small number of vertices.

A primary example is the `1 ball of IRd defined by
n d
X o
B1 = x ∈ IRd : |xi | ≤ 1 .
i=1
Indeed, it has exactly 2d vertices.
Maximum over the `2 ball

Recall that the unit `2 ball of IRd is defined by the set of vectors u that have
Euclidean norm |u|2 at most 1. Formally, it is defined by
n d
X o
B2 = x ∈ IRd : x2i ≤ 1 .
i=1
Clearly, this ball is not a polytope and yet, we can control the maximum of
random variables indexed by B2 . This is due to the fact that there exists a
finite subset of B2 such that the maximum over this finite set is of the same
order as the maximum over the entire ball.
Definition 1.17. Fix K ⊂ IRd and ε > 0. A set N is called an ε-net of K

with respect to a distance d(·, ·) on IRd , if N ⊂ K and for any z ∈ K, there
exists x ∈ N such that d(x, z) ≤ ε.
Therefore, if N is an ε-net of K with respect to a norm k · k, then every

point of K is at distance at most ε from a point in N . Clearly, every compact
set admits a finite ε-net. The following lemma gives an upper bound on the
size of the smallest ε-net of B2 .
Lemma 1.18. Fix ε ∈ (0, 1). Then the unit Euclidean ball B2 has an ε-net N
with respect to the Euclidean distance of cardinality |N | ≤ (3/ε)d
Proof. Consider the following iterative construction if the ε-net. Choose x1 =

0. For any i ≥ 2, take xi to be any x ∈ B2 such that |x − xj |2 > ε for all j < i.
If no such x exists, stop the procedure. Clearly, this will create an ε-net. We
now control its size.
Observe that since |x−y|2 > ε for all x, y ∈ N , the Euclidean balls centered
at x ∈ N and with radius ε/2 are disjoint. Moreover,
[ ε ε
{z + B2 } ⊂ (1 + )B2
2 2
z∈N
where {z + εB2 } = {z + εx , x ∈ B2 }. Thus, measuring volumes, we get

ε [ ε X ε
vol (1 + )B2 ≥ vol {z + B2 } = vol {z + B2 }
2 2 2
z∈N z∈N
This is equivalent to
ε d
ε d
≥ |N |
1+ .
2 2
Therefore, we get the following bound
2 d 3 d
|N | ≤ 1 + ≤ .
ε ε
Theorem 1.19. Let X ∈ IRd be a sub-Gaussian random vector with variance

proxy σ 2 . Then
√
IE[max θ> X] = IE[max |θ> X|] ≤ 4σ d .
θ∈B2 θ∈B2
Moreover, for any δ > 0, with probability 1 − δ, it holds

√ p
max θ> X = max |θ> X| ≤ 4σ d + 2σ 2 log(1/δ) .
θ∈B2 θ∈B2
Proof. Let N be a 1/2-net of B2 with respect to the Euclidean norm that

satisfies |N | ≤ 6d . Next, observe that for every θ ∈ B2 , there exists z ∈ N and
x such that |x|2 ≤ 1/2 and θ = z + x. Therefore,
max θ> X ≤ max z > X + max

1
x> X
θ∈B2 z∈N x∈ 2 B2
But
1
max x> X = max x> X
1
x∈ 2 B2 2 x∈B2
Therefore, using Theorem 1.14, we get
p p √
IE[max θ> X] ≤ 2IE[max z > X] ≤ 2σ 2 log(|N |) ≤ 2σ 2(log 6)d ≤ 4σ d .
θ∈B2 z∈N
The bound with high probability then follows because

t2 t2
IP max θ> X > t ≤ IP 2 max z > X > t ≤ |N |e− 8σ2 ≤ 6d e− 8σ2 .

θ∈B2 z∈N
To conclude the proof, we find t such that

t2
e− 8σ2 +d log(6) ≤ δ ⇔ t2 ≥ 8 log(6)σ 2 d + 8σ 2 log(1/δ) .
p √ p
Therefore, it is sufficient to take t = 8 log(6)σ d + 2σ 2 log(1/δ) .
1.5. Sums of independent random matrices 30
1.5 SUMS OF INDEPENDENT RANDOM MATRICES
In this section, we are going to explore how concentration statements can be

extended to sums of matrices. As an example, we are going to show a version of
Bernstein’s inequality, 1.13, for sums of independent matrices, closely following
the presentation in [Tro12], which builds on previous work in [AW02]. Results
of this type have been crucial in providing guarantees for low rank recovery,
see for example [Gro11].
In particular, we want to control the maximum P eigenvalue of a sum of
independent random symmetric matrices IP(λmax ( i Xi ) > t). The tools in-
volved will closely mimic those used for scalar random variables, but with the
caveat that care must be taken when handling exponentials of matrices because
eA+B 6= eA eB if A, B ∈ IRd and AB 6= BA.
Preliminaries
We denote the set of symmetric, positive semi-definite, and positive definite
matrices in IRd×d by Sd , Sd+ , and Sd++ , respectively, and will omit the subscript
when the dimensionality is clear. Here, positive semi-definite means that for
a matrix X ∈ S + , v > Xv ≥ 0 for all v ∈ IRd , |v|2 = 1, and positive definite
means that the equality holds strictly. This is equivalent to all eigenvalues of
X being larger or equal (strictly larger, respectively) than 0, λj (X) ≥ 0 for all
j = 1, . . . , d.
The cone of positive definite matrices induces an order on matrices by set-
ting A B if B − A ∈ S + .
Since we want to extend the notion of exponentiating random variables that
was essential to our derivation of the Chernoff bound to matrices, we will make
use of the matrix exponential and the matrix logarithm. For a symmetric matrix
X, we can define a functional calculus by applying a function f : IR → IR to
the diagonal elements of its spectral decomposition, i.e., if X = QΛQ> , then
f (X) := Qf (Λ)Q> , (1.4)
where
[f (Λ)]i,j = f (Λi,j ), i, j ∈ [d], (1.5)
is only non-zero on the diagonal and f (X) is well-defined because the spectral
decomposition is unique up to the ordering of the eigenvalues and the basis of
the eigenspaces, with respect to both of which (1.4) is invariant, as well. From
the definition, it is clear that for the spectrum of the resulting matrix,
σ(f (X)) = f (σ(X)). (1.6)
While inequalities between functions not generally carry over to matrices, we

have the following transfer rule:
f (a) ≤ g(a) for a ∈ σ(X) =⇒ f (X) g(X). (1.7)

We can now define the matrix exponential by (1.4) which is equivalent to

defining it via a power series,
∞
X 1 k
exp(X) = X .
k!
k=1
Similarly, we can define the matrix logarithm as the inverse function of exp on
S, log(eA ) = A, which defines it on S + .
Some nice properties of these functions on matrices are their monotonicity:
For A B,
Tr exp A ≤ Tr exp B, (1.8)
and for 0 ≺ A B,
log A log B. (1.9)
Note that the analogue of (1.9) is in general not true for the matrix exponential.
The Laplace transform method

For the remainder of this section, let X1 , . . . , Xn ∈ IRd×d be independent
random symmetric matrices.
As a first step that can lead to different types of matrix concentration
inequalities, we give a generalization of the Chernoff bound for the maximum
eigenvalue of a matrix.
Proposition 1.20. Let Y be a random symmetric matrix. Then, for all t ∈ IR,
IP(λmax (Y) ≥ t) ≤ inf {e−θt IE[Tr eθY ]}

θ>0
Proof. We multiply both sides of the inequality λmax (Y) ≥ t by θ, take ex-
ponentials, apply the spectral theorem (1.6) and then estimate the maximum
eigenvalue by the sum over all eigenvalues, the trace.
IP(λmax (Y) ≥ t) = IP(λmax (θY) ≥ θt)

= IP(eλmax (θY) ≥ eθt )
≤ e−θt IE[eλmax (θY) ]
= e−θt IE[λmax (eθY )]
≤ e−θt IE[Tr(eθY )].
Recall that the crucial step in proving Bernstein’s and Hoeffding’s inequal-
ity was to exploit the independence of the summands by the fact that the
exponential function turns products into sums.
P Y Y
IE[e i θXi ] = IE[ eθXi ] = IE[eθXi ].
i i
This property, eA+B = eA eB , no longer holds true for matrices, unless they
commute.
We could try to replace it with a similar property, the Golden-Thompson
inequality,
Tr[eθ(X1 +X2 ) ] ≤ Tr[eθX1 eθX2 ].
Unfortunately, this does not generalize to more than two matrices, and when
trying to peel off factors, we would have to pull a maximum eigenvalue out of
the trace,
Tr[eθX1 eθX2 ] ≤ λmax (eθX2 )Tr[eθX1 ].
This is the approach followed by Ahlswede-Winter [AW02], which leads to
worse constants for concentration than the ones we obtain below.
Instead, we are going to use the following deep theorem due to Lieb [Lie73].
A sketch of the proof can be found in the appendix of [Rus02].
Theorem 1.21 (Lieb). Let H ∈ IRd×d be symmetric. Then,
Sd+ → IR, A 7→ TreH+log A
is a concave function.
Corollary 1.22. Let X, H ∈ Sd be two symmetric matrices such that X is

random and H is fixed. Then,
X
IE[TreH+X ] ≤ TreH+log IE[e ] .
Proof. Write Y = eX and use Jensen’s inequality:
IE[TreH+X ] = IE[TreH+log Y ]
≤ TreH+log(IE[Y])
X
= TreH+log IE[e ]
With this, we can establish a better bound on the moment generating func-
tion of sums of independent matrices.
Lemma 1.23. Let X1 , . . . , Xn be n independent, random symmetric matrices.

Then, X X
IE[Tr exp( θXi )] ≤ Tr exp( log IEeθXi )
i i
Proof. Without loss of generality, we can assume θ = 1. Write IEk for the
expectation conditioned on X1 , . . . , Xk , IEk [ . ] = IE[ . |X1 , . . . , Xk ]. By the
tower property,
Xn Xn
IE[Tr exp( Xi )] = IE0 . . . IEn−1 Tr exp( Xi )
i=1 i=1
n−1
X
Xi + log IEn−1 eXn

≤ IE0 . . . IEn−2 Tr exp
| {z }
i=1
=IEeXn
..
.
Xn
≤ Tr exp( log IE[eXi ]),
i=1
where we applied Lieb’s theorem, Theorem 1.21 on the conditional expectation

and then used the independence of X1 , . . . , Xn .
Tail bounds for sums of independent matrices

Theorem 1.24 (Master tail bound). For all t ∈ IR,
X X
IP(λmax ( Xi ) ≥ t) ≤ inf {e−θt Tr exp( log IEeθXi )}
θ>0
i i
Proof. Combine Lemma 1.23 with Proposition 1.20.

Corollary 1.25. Assume that there is a function g : (0, ∞) → [0, ∞] and fixed
symmetric matrices Ai such that IE[eθXi ] eg(θ)Ai for all i. Set
X
ρ = λmax ( Ak ).
i
Then, for all t ∈ IR,

X
IP(λmax ( Xi ) ≥ t) ≤ d inf {e−θt+g(θ)ρ }.
θ>0
i
Proof. Using the operator monoticity of log, (1.9), and the monotonicity of
Tr exp, (1.8), we can plug the estimates for the matrix mgfs into the master
inequality, Theorem 1.24, to obtain
X X
IP(λmax ( Xi ) ≥ t) ≤ e−θt Tr exp(g(θ) Ai )
i i
X
−θt
≤ de λmax (exp(g(θ) Ai ))
i
X
= de−θt exp(g(θ)λmax ( Ai )),
i
Pd
where we estimated Tr(X) = j=1 λj ≤ dλmax and used the spectral theorem.
Now we are ready to prove Bernstein’s inequality We just need to come up

with a dominating function for Corollary 1.25.
Lemma 1.26. If IE[X] = 0 and λmax (X) ≤ 1, then
IE[eθX ] exp((eθ − θ − 1)IE[X2 ]).
Proof. Define  θx
e − θx − 1
, x 6= 0,


x2

f (x) =
 θ2
, x = 0,


2
and verify that this is a smooth and increasing function on IR. Hence, f (x) ≤
f (1) for x ≤ 1. By the transfer rule, (1.7), f (X) f (I) = f (1)I. Therefore,
eθX = I + θX + Xf (X)X I + θX + f (1)X2 ,
and by taking expectations,
IE[eθX ] I + f (1)IE[X2 ] exp(f (1)IE[X2 ]) = exp((eθ − θ − 1)IE[X2 ]).
Theorem 1.27 (Matrix Bernstein). Assume IEXi = 0 and λmax (Xi ) ≤ R

almost surely for all i and set
X
σ2 = k IE[X2i ]kop .
i
Then, for all t ≥ 0,

X σ2
IP(λmax ( Xi ) ≥ t) ≤ d exp(− 2 h(Rt/σ 2 ))
i
R
−t2 /2

≤ d exp
σ 2 + Rt/3
3t2 σ2

d exp − 8σ 2 , t ≤ ,


R
≤
σ2

 d exp − 3t , t ≥

 ,
8R R
where h(u) = (1 + u) log(1 + u) − u, u > 0.
Proof. Without loss of generality, take R = 1. By Lemma 1.26,
IE[eθXi ] exp(g(θ)IE[X2i ]), with g(θ) = eθ − θ − 1, θ > 0.
By Corollary 1.25,
X
IP(λmax ( Xi ) ≥ t) ≤ d exp(−θt + g(θ)σ 2 ).
i
We can verify that the minimal value is attained at θ = log(1 + t/σ 2 ).

u2 /2
The second inequality then follows from the fact that h(u) ≥ h2 (u) = 1+u/3
for u ≥ 0 which can be verified by comparing derivatives.
The third inequality follows from exploting the properties of h2 .
With similar techniques, one can also prove a version of Hoeffding’s inequal-
ity for matrices, see [Tro12, Theorem 1.3].
1.6. Problem set 36
1.6 PROBLEM SET
Problem 1.1. Let X1 , . . . , Xn be independent random variables such that

IE(Xi ) = 0 and Xi ∼ subE(λ). For any vector a = (a1 , . . . , an )> ∈ IRn , define
the weighted sum
Xn
S(a) = ai Xi ,
i=1
Show that for any t > 0 we have
t2

t
IP(|S(a)| > t) ≤ 2 exp −C ∧ .
λ2 |a|22 λ|a|∞
for some positive constant C.
Problem 1.2. A random variable X has χ2n (chi-squared with n degrees of

freedom) if it has the same distribution as Z12 + . . . + Zn2 , where Z1 , . . . , Zn are
i.i.d. N (0, 1).
(a) Let Z ∼ N (0, 1). Show that the moment generating function of Y =
Z 2 − 1 satisfies
−s

sY  √ e if s < 1/2
φ(s) := E e =
 1 − 2s
∞ otherwise
(b) Show that for all 0 < s < 1/2,

s2
φ(s) ≤ exp .
1 − 2s
(c) Conclude that √

IP(Y > 2t + 2 t) ≤ e−t
√
[Hint: you can use the convexity inequality 1 + u ≤ 1+u/2].
(d) Show that if X ∼ χ2n , then, with probability at least 1 − δ, it holds
p
X ≤ n + 2 n log(1/δ) + 2 log(1/δ) .
Problem 1.3. Let X1 , X2 . . . be an infinite sequence of sub-Gaussian random

variables with variance proxy σi2 = C(log i)−1 . Show that for C large enough,
we get
IE max Xi < ∞ .
i≥2
1.6. Problem set 37
Problem 1.4. Let A = {Ai,j } 1≤i≤n be a random matrix such that its entries
1≤j≤m
are i.i.d. sub-Gaussian random variables with variance proxy σ 2 .
(a) Show that the matrix A is sub-Gaussian. What is its variance proxy?
(b) Let kAk denote the operator norm of A defined by
|Ax|2
maxm .
x∈IR |x|2
Show that there exits a constant C > 0 such that
√ √
IEkAk ≤ C( m + n) .
Problem 1.5. Recall that for any q ≥ 1, the `q norm of a vector x ∈ IRn is
defined by
Xn q1
|x|q = |xi |q .
i=1
Let X = (X1 , . . . , Xn ) be a vector with independent entries such that Xi is

sub-Gaussian with variance proxy σ 2 and IE(Xi ) = 0.
(a) Show that for any q ≥ 2, and any x ∈ IRd ,
1 1
|x|2 ≤ |x|q n 2 − q ,
and prove that the above inequality cannot be improved

(b) Show that for for any q > 1,
1 √
IE|X|q ≤ 4σn q q
(c) Recover from this bound that

p
IE max |Xi | ≤ 4eσ log n .
1≤i≤n
Problem 1.6. Let K be a compact subset of the unit sphere of IRp that
admits an ε-net Nε with respect to the Euclidean distance of IRp that satisfies
|Nε | ≤ (C/ε)d for all ε ∈ (0, 1). Here C ≥ 1 and d ≤ p are positive constants.
Let X ∼ subGp (σ 2 ) be a centered random vector.
Show that there exists positive constants c1 and c2 to be made explicit such
that for any δ ∈ (0, 1), it holds
p p
max θ> X ≤ c1 σ d log(2p/d) + c2 σ log(1/δ)
θ∈K
with probability at least 1−δ. Comment on the result in light of Theorem 1.19 .
1.6. Problem set 38
Problem 1.7. For any K ⊂ IRd , distance d on IRd and ε > 0, the ε-covering
number C(ε) of K is the cardinality of the smallest ε-net of K. The ε-packing
number P (ε) of K is the cardinality of the largest set P ⊂ K such that
d(z, z 0 ) > ε for all z, z 0 ∈ P, z 6= z 0 . Show that
C(2ε) ≤ P (2ε) ≤ C(ε) .
Problem 1.8. Let X1 , . . . , Xn be n independent and random variables such

that IE[Xi ] = µ and var(Xi ) ≤ σ 2 . Fix δ ∈ (0, 1) and assume without loss of
generality that n can be factored into n = K · G where G = 8 log(1/δ) is a
positive integers.
For g = 1, . . . , G, let X̄g denote the average over the gth group of k variables.
Formally
gk
1 X
X̄g = Xi .
k
i=(g−1)k+1
1. Show that for any g = 1, . . . , G,

2σ 1
IP X̄g − µ > √ ≤ .
k 4
2. Let µ̂ be defined as the median of {X̄1 , . . . , X̄G }. Show that

2σ G
IP µ̂ − µ > √ ≤ IP B ≥ ,
k 2
where B ∼ Bin(G, 1/4).
3. Conclude that r
2 log(1/δ)
IP µ̂ − µ > 4σ ≤δ
n
4. Compare this result with Corollary 1.7 and Lemma 1.3. Can you conclude
that µ̂ − µ ∼ subG(σ̄ 2 /n) for some σ̄ 2 ? Conclude.
Problem 1.9. The goal of this problem is to prove the following theorem:
Theorem 1.28 (Johnson-Lindenstrauss Lemma). Given n points X = {xq 1 , . . . , xn }
d k×d d
in IR , let Q ∈ IR be a random projection operator and set P := k Q.
There is a constant C > 0 such that if
C
k≥ log n,
ε2
P is an ε-isometry for X, i.e.,
(1 − ε)kxi − xj k22 ≤ kP xi − P xj k22 ≤ (1 + ε)kxi − xj k22 , for all i, j
with propability at least 1 − 2 exp(−cε2 k).

1.6. Problem set 39
1. Convince yourself that if d > n, there is a projection P ∈ IRn×d to an n

dimensional subspace such that kP xi − P xj k2 = kxi − xj k2 , i.e., pairwise
distances are exactly preserved.
Let k ≤ d be two integers, Y = (y1 , . . . , yd ) ∼ N (0, Id×d ) independent and

identically distributed Gaussians and Q ∈ IRd×k the projection onto the first k
coordinates, i.e., Qy = (y1 , . . . , yk ). Define Z = kY1 k QY = kY1 k (y1 , . . . , yk ) and
L = kZk2 .
2. Show that IE[L] = kd .

3. Show that for all t > 0 such that 1 − 2t(kβ − d) > 0 and 1 − 2tβk > 0,
k d
!
X k X
IP yi2 ≤ β y 2 ≤ (1 − 2t(kβ − d))−k/2 (1 − 2tβk)−(d−k)/2
i=1
d i=1 i
2
(Hint: Show that IE[eλX ] = √ 1 for λ < 1
if X ∼ N (0, 1).)
1−2λ 2
4. Conclude that for β < 1,

k k
IP L ≤ β ≤ exp (1 − β + log β) .
d 2
5. Similarly, show that for β > 1,

k k
IP L ≥ β ≤ exp (1 − β + log β) .
d 2
6. Show that for a random projection operator Q ∈ IRk×d and a fixed vector
x ∈ IRd ,
a) IE[kQxk2 ] = kd kxk2 .
b) For ε ∈ (0, 1), there is a constant c > 0 such that with probability
at least 1 − 2 exp(−ckε2 ),
k k
(1 − ε) kxk2 ≤ kQxk22 ≤ (1 + ε) kxk22 .
d d
(Hint: Think about how to apply the previous results in this case
and use the inequalities log(1 − ε) ≤ −ε − ε2 /2 and log(1 + ε) ≤
ε − ε2 /2 + ε3 /3.)
7. Prove Theorem 1.28.
Chapter
2
Linear Regression Model
In this chapter, we consider the following regression model:
Yi = f (Xi ) + εi , i = 1, . . . , n , (2.1)
where ε = (ε1 , . . . , εn )> is sub-Gaussian with variance proxy σ 2 and such that
IE[ε] = 0. Our goal is to estimate the function f under a linear assumption.
Namely, we assume that x ∈ IRd and f (x) = x> θ∗ for some unknown θ∗ ∈ IRd .
2.1 FIXED DESIGN LINEAR REGRESSION
Depending on the nature of the design points X1 , . . . , Xn , we will favor a

different measure of risk. In particular, we will focus either on fixed or random
design.
Random design
The case of random design corresponds to the statistical learning setup. Let
(X1 , Y1 ), . . . , (Xn+1 , Yn+1 ) be n + 1 i.i.d. random couples. Given the pairs
(X1 , Y1 ), . . . , (Xn , Yn ), the goal is construct a function fˆn such that fˆn (Xn+1 )
is a good predictor of Yn+1 . Note that when fˆn is constructed, Xn+1 is still
unknown and we have to account for what value it is likely to take.
Consider the following example from [HTF01, Section 3.2]. The response
variable Y is the log-volume of a cancerous tumor, and the goal is to predict
it based on X ∈ IR6 , a collection of variables that are easier to measure (age
of patient, log-weight of prostate, . . . ). Here the goal is clearly to construct f
for prediction purposes. Indeed, we want to find an automatic mechanism that
40
2.1. Fixed design linear regression 41
outputs a good prediction of the log-weight of the tumor given certain inputs
for a new (unseen) patient.
A natural measure of performance here is the L2 -risk employed in the in-
troduction:
R(fˆn ) = IE[Yn+1 − fˆn (Xn+1 )]2 = IE[Yn+1 − f (Xn+1 )]2 + kfˆn − f k2L2 (PX ) ,
where PX denotes the marginal distribution of Xn+1 . It measures how good

the prediction of Yn+1 is in average over realizations of Xn+1 . In particular,
it does not put much emphasis on values of Xn+1 that are not very likely to
occur.
Note that if the εi are random variables with variance σ 2 , then one simply
has R(fˆn ) = σ 2 + kfˆn − f k2L2 (PX ) . Therefore, for random design, we will focus
on the squared L2 norm kfˆn − f k2 2 L (PX )as a measure of accuracy. It measures
how close fˆn is to the unknown f in average over realizations of Xn+1 .
Fixed design
In fixed design, the points (or vectors) X1 , . . . , Xn are deterministic. To em-
phasize this fact, we use lowercase letters x1 , . . . , xn to denote fixed design. Of
course, we can always think of them as realizations of a random variable but
the distinction between fixed and random design is deeper and significantly
affects our measure of performance. Indeed, recall that for random design, we
look at the performance in average over realizations of Xn+1 . Here, there is no
such thing as a marginal distribution of Xn+1 . Rather, since the design points
x1 , . . . , xn are considered deterministic, our goal is estimate f only at these
points. This problem is sometimes called denoising since our goal is to recover
f (x1 ), . . . , f (xn ) given noisy observations of these values.
In many instances, fixed designs exhibit particular structures. A typical
example is the regular design on [0, 1], given by xi = i/n, i = 1, . . . , n. Inter-
polation between these points is possible under smoothness assumptions.
>
Note that in fixed design, we observe µ∗ +ε, where µ∗ = f (x1 ), . . . , f (xn ) ∈
IRn and ε = (ε1 , . . . , εn )> ∈ IRn is sub-Gaussian with variance proxy σ 2 . In-
stead of a functional estimation problem, it is often simpler to view this problem
as a vector problem in IRn . This point of view will allow us to leverage the
Euclidean geometry of IRn .
In the case of fixed design, we will focus on the Mean Squared Error (MSE)
as a measure of performance. It is defined by
n
1X ˆ 2
MSE(fˆn ) = fn (xi ) − f (xi ) .
n i=1
Equivalently, if we view our problem as a vector problem, it is defined by

n
1X 2 1
MSE(µ̂) = µ̂i − µ∗i = |µ̂ − µ∗ |22 .
n i=1 n
2.2. Least squares estimators 42
Often, the design vectors x1 , . . . , xn ∈ IRd are stored in an n × d design matrix

X, whose jth row is given by x> j . With this notation, the linear regression
model can be written as
Y = Xθ∗ + ε , (2.2)
where Y = (Y1 , . . . , Yn )> and ε = (ε1 , . . . , εn )> . Moreover,
1 X> X
MSE(Xθ̂) = |X(θ̂ − θ∗ )|22 = (θ̂ − θ∗ )> (θ̂ − θ∗ ) . (2.3)
n n
A natural example of fixed design regression is image denoising. Assume
that µ∗i , i ∈ 1, . . . , n is the grayscale value of pixel i of an image. We do not
get to observe the image µ∗ but rather a noisy version of it Y = µ∗ + ε. Given
a library of d images {x1 , . . . , xd }, xj ∈ IRn , our goal is to recover the original
image µ∗ using linear combinations of the images x1 , . . . , xd . This can be done
fairly accurately (see Figure 2.1).
Figure 2.1. Reconstruction of the digit “6”: Original (left), Noisy (middle) and Recon-
struction (right). Here n = 16 × 16 = 256 pixels. Source [RT11].
As we will see in Remark 2.3, properly choosing the design also ensures that
if MSE(fˆ) is small for some linear estimator fˆ(x) = x> θ̂, then |θ̂ − θ∗ |22 is also
small.
In this chapter we only consider the fixed design case.
2.2 LEAST SQUARES ESTIMATORS
Throughout this section, we consider the regression model (2.2) with fixed
design.
Unconstrained least squares estimator

Define the (unconstrained) least squares estimator θ̂ls to be any vector such
that
θ̂ls ∈ argmin |Y − Xθ|22 .
θ∈IRd
Note that we are interested in estimating Xθ∗ and not θ∗ itself, so by exten-
sion, we also call µ̂ls = Xθ̂ls least squares estimator. Observe that µ̂ls is the
projection of Y onto the column span of X.
It is not hard to see that least squares estimators of θ∗ and µ∗ = Xθ∗ are
maximum likelihood estimators when ε ∼ N (0, σ 2 In ).
Proposition 2.1. The least squares estimator µ̂ls = Xθ̂ls ∈ IRn satisfies
X> µ̂ls = X> Y .
Moreover, θ̂ls can be chosen to be
θ̂ls = (X> X)† X> Y ,
where (X> X)† denotes the Moore-Penrose pseudoinverse of X> X.
Proof. The function θ 7→ |Y − Xθ|22 is convex so any of its minima satisfies
∇θ |Y − Xθ|22 = 0,
where ∇θ is the gradient operator. Using matrix calculus, we find
∇θ |Y − Xθ|22 = ∇θ |Y |22 − 2Y > Xθ + θ> X> Xθ = −2(Y > X − θ> X> X)> .

Therefore, solving ∇θ |Y − Xθ|22 = 0 yields
X> Xθ = X> Y .
It concludes the proof of the first statement. The second statement follows
from the definition of the Moore-Penrose pseudoinverse.
We are now going to prove our first result on the finite sample performance
of the least squares estimator for fixed design.
Theorem 2.2. Assume that the linear model (2.2) holds where ε ∼ subGn (σ 2 ).
Then the least squares estimator θ̂ls satisfies
1 r
IE MSE(Xθ̂ls ) = IE|Xθ̂ls − Xθ∗ |22 . σ 2 ,

n n
where r = rank(X> X). Moreover, for any δ > 0, with probability at least 1 − δ,
it holds
r + log(1/δ)
MSE(Xθ̂ls ) . σ 2 .
n
Proof. Note that by definition
|Y − Xθ̂ls |22 ≤ |Y − Xθ∗ |22 = |ε|22 . (2.4)
Moreover,
|Y − Xθ̂ls |22 = |Xθ∗ + ε − Xθ̂ls |22 = |Xθ̂ls − Xθ∗ |22 − 2ε> X(θ̂ls − θ∗ ) + |ε|22 .
Therefore, we get
ε> X(θ̂ls − θ∗ )
|Xθ̂ls − Xθ∗ |22 ≤ 2ε> X(θ̂ls − θ∗ ) = 2|Xθ̂ls − Xθ∗ |2 (2.5)
|X(θ̂ls − θ∗ )|2
Note that it is difficult to control
ε> X(θ̂ls − θ∗ )
|X(θ̂ls − θ∗ )|2
as θ̂ls depends on ε and the dependence structure of this term may be compli-
cated. To remove this dependency, a traditional technique is to “sup-out” θ̂ls .
This is typically where maximal inequalities are needed. Here we have to be a
bit careful.
Let Φ = [φ1 , . . . , φr ] ∈ IRn×r be an orthonormal basis of the column span
of X. In particular, there exists ν ∈ IRr such that X(θ̂ls − θ∗ ) = Φν. It yields
ε> X(θ̂ls − θ∗ ) ε> Φν ε> Φν ν
= = = ε̃> ≤ sup ε̃> u ,
|X(θ̂ls − θ∗ )|2 |Φν|2 |ν|2 |ν|2 u∈B2
where B2 is the unit ball of IRr and ε̃ = Φ> ε. Thus

|Xθ̂ls − Xθ∗ |22 ≤ 4 sup (ε̃> u)2 ,
u∈B2
Next, let u ∈ S r−1

and note that |Φu|22 = u> Φ> Φu = u> u = 1, so that for any
s ∈ IR, we have
> > s2 σ 2
IE[esε̃ u
] = IE[esε Φu
]≤e 2 .
2
Therefore, ε̃ ∼ subGr (σ ).
To conclude the bound in expectation, observe that Lemma 1.4 yields
r
X
4IE sup (ε̃> u)2 = 4 IE[ε̃2i ] ≤ 16σ 2 r .

u∈B2 i=1
Moreover, with probability 1 − δ, it follows from the last step in the proof1 of
Theorem 1.19 that
sup (ε̃> u)2 ≤ 8 log(6)σ 2 r + 8σ 2 log(1/δ) .
u∈B2
X> X
Remark 2.3. If d ≤ n and B := n has rank d, then we have
MSE(Xθ̂ls )
|θ̂ls − θ∗ |22 ≤ ,
λmin (B)
and we can use Theorem 2.2 to bound |θ̂ls − θ∗ |22 directly. By contrast, in the
high-dimensional case, we will need more structural assumptions to come to a
similar conclusion.
1 we could use Theorem 1.19 directly here but at the cost of a factor 2 in the constant.
Constrained least squares estimator

Let K ⊂ IRd be a symmetric convex set. If we know a priori that θ∗ ∈ K, we
ls
may prefer a constrained least squares estimator θ̂K defined by
ls
θ̂K ∈ argmin |Y − Xθ|22 .
θ∈K
The fundamental inequality (2.4) would still hold and the bounds on the MSE
may be smaller. Indeed, (2.5) can be replaced by
ls
|Xθ̂K − Xθ∗ |22 ≤ 2ε> X(θ̂K
ls
− θ∗ ) ≤ 2 sup (ε> Xθ) ,
θ∈K−K
where K − K = {x − y : x, y ∈ K}. It is easy to see that if K is symmetric

(x ∈ K ⇔ −x ∈ K) and convex, then K − K = 2K so that
2 sup (ε> Xθ) = 4 sup (ε> v)
θ∈K−K v∈XK
where XK = {Xθ : θ ∈ K} ⊂ IRn . This is a measure of the size (width) of

XK. If ε ∼ N (0, Id ), the expected value of the above supremum is actually
called Gaussian width of XK. While ε is not Gaussian but sub-Gaussian here,
similar properties will still hold.
`1 constrained least squares

Assume here that K = B1 is the unit `1 ball of IRd . Recall that it is defined by
n d
X
B1 = x ∈ IRd : |xi | ≤ 1 ,
i=1
and it has exactly 2d vertices V = {e1 , −e1 , . . . , ed , −ed }, where ej is the j-th
vector of the canonical basis of IRd and is defined by
ej = (0, . . . , 0, 1
|{z} , 0, . . . , 0)> .
jth position
It implies that the set XK = {Xθ, θ ∈ K} ⊂ IRn is also a polytope with at

most 2d vertices that are in the set XV = {X1 , −X1 , . . . , Xd , −Xd } where Xj is
the j-th column of X. Indeed, XK is a obtained by rescaling and embedding
(resp. projecting) the polytope K when d ≤ n (resp., d ≥ n). Note that some
columns of X might not be vertices of XK so that XV might be a strict superset
of the set of vertices of XK.
Theorem 2.4. Let K = B1 be the unit `1 ball of IRd , d ≥ 2 and assume that
θ∗ ∈ B1 . Moreover, assume the conditions of Theorem√2.2 and that the columns
of X are normalized in such a way that maxj |Xj |2 ≤ n. Then the constrained
least squares estimator θ̂Bls1 satisfies
r
ls
1 ls ∗ 2 log d
IE MSE(Xθ̂B1 ) = IE|Xθ̂B1 − Xθ |2 . σ ,
n n
Moreover, for any δ > 0, with probability 1 − δ, it holds

r
ls log(d/δ)
MSE(Xθ̂B1 ) . σ .
n
Proof. From the considerations preceding the theorem, we get that
|Xθ̂Bls1 − Xθ∗ |22 ≤ 4 sup (ε> v).

v∈XK
Observe now that since ε ∼ subGn (σ 2 ), for any column Xj such that |Xj |2 ≤
√
n, the random variable ε> Xj ∼ subG(nσ 2 ). Therefore, applying Theo-

rem 1.16, we get the bound on IE MSE(Xθ̂K ls
) and for any t ≥ 0,
nt2
) > t ≤ IP sup (ε> v) > nt/4 ≤ 2de− 32σ2
ls

IP MSE(Xθ̂K
v∈XK
To conclude the proof, we find t such that

nt2 log(2d) log(1/δ)
2de− 32σ2 ≤ δ ⇔ t2 ≥ 32σ 2 + 32σ 2 .
n n
Note that the proof of Theorem 2.2 also applies to θ̂Bls1 (exercise!) so that
Xθ̂Bls1
benefits from the best of both rates,
r r
log d
IE MSE(Xθ̂Bls1 ) . min σ 2 , σ

.
n n
√
This is called an elbow effect. The elbow takes place around r ' n (up to
logarithmic terms).
`0 constrained least squares

By an abuse of notation, we call the number of non-zero coefficients of a vector
θ ∈ IRd its `0 norm (it is not actually a norm). It is denoted by
d
X
|θ|0 = 1I(θj 6= 0) .
j=1
We call a vector θ with “small” `0 norm a sparse vector. More precisely, if

|θ|0 ≤ k, we say that θ is a k-sparse vector. We also call support of θ the set

supp(θ) = j ∈ {1, . . . , d} : θj 6= 0
so that |θ|0 = card(supp(θ)) =: | supp(θ)| .

Remark 2.5. The `0 terminology and notation comes from the fact that
d
X
lim |θj |q = |θ|0
q→0+
j=1
Therefore it is really limq→0+ |θ|qq but the notation |θ|00 suggests too much that
it is always equal to 1.
By extension, denote by B0 (k) the `0 ball of IRd , i.e., the set of k-sparse
vectors, defined by
B0 (k) = {θ ∈ IRd : |θ|0 ≤ k} .
ls
In this section, our goal is to control the MSE of θ̂K when K = B0 (k). Note that
d
ls

computing θ̂B0 (k) essentially requires computing k least squares estimators,
which is an exponential number in k. In practice this will be hard (or even
impossible) but it is interesting to understand the statistical properties of this
estimator and to use them as a benchmark.
Theorem 2.6. Fix a positive integer k ≤ d/2. Let K = B0 (k) be set of
k-sparse vectors of IRd and assume that θ∗ ∈ B0 (k). Moreover, assume the
conditions of Theorem 2.2. Then, for any δ > 0, with probability 1 − δ, it holds
σ2 σ2 k σ2

ls d
MSE(Xθ̂B0 (k) ) . log + + log(1/δ) .
n 2k n n
Proof. We begin as in the proof of Theorem 2.2 to get (2.5):
ε> X(θ̂K
ls
− θ∗ )
ls
|Xθ̂K − Xθ∗ |22 ≤ 2ε> X(θ̂K
ls
− θ∗ ) = 2|Xθ̂K
ls
− Xθ∗ |2 .
ls − θ ∗ )|
|X(θ̂K 2
We know that both θ̂K ls

and θ∗ are in B0 (k) so that θ̂K
ls
− θ∗ ∈ B0 (2k). For
any S ⊂ {1, . . . , d}, let XS denote the n × |S| submatrix of X that is obtained
from the columns of Xj , j ∈ S of X. Denote by rS ≤ |S| the rank of XS and
let ΦS = [φ1 , . . . , φrS ] ∈ IRn×rS be an orthonormal basis of the column span
of XS . Moreover, for any θ ∈ IRd , define θ(S) ∈ IR|S| to be the vector with
coordinates θj , j ∈ S. If we denote by Ŝ = supp(θ̂K ls
− θ∗ ), we have |Ŝ| ≤ 2k
rŜ
and there exists ν ∈ IR such that
ls
X(θ̂K − θ∗ ) = XŜ (θ̂K
ls
(Ŝ) − θ∗ (Ŝ)) = ΦŜ ν .
Therefore,
ε> X(θ̂K
ls
− θ∗ ) ε> ΦŜ ν
= ≤ max sup [ε> ΦS ]u
ls − θ ∗ )|
|X(θ̂K 2 |ν|2 |S|=2k u∈BrS
2
where B2rS rS
is the unit ball of IR . It yields
ls
|Xθ̂K − Xθ∗ |22 ≤ 4 max sup (ε̃> 2
S u) ,
|S|=2k u∈BrS
2
with ε̃S = Φ> 2

S ε ∼ subGrS (σ ).
Using a union bound, we get for any t > 0,
X
IP max sup (ε̃> u)2 > t ≤ sup (ε̃> u)2 > t

IP
|S|=2k r
u∈B2S
r
u∈B2S
|S|=2k
It follows from the proof of Theorem 1.19 that for any |S| ≤ 2k,
t t
sup (ε̃> u)2 > t ≤ 6|S| e− 8σ2 ≤ 62k e− 8σ2 .

IP
r
u∈B2S
Together, the above three displays yield

d t
ls
IP(|Xθ̂K − Xθ∗ |22 > 4t) ≤ 62k e− 8σ2 . (2.6)
2k
To ensure that the right-hand side of the above inequality is bounded by δ, we

need
n d o
t ≥ Cσ 2 log + k log(6) + log(1/δ) .
2k
d

How large is log 2k ? It turns out that it is not much larger than k.
Lemma 2.7. For any integers 1 ≤ k ≤ n, it holds

n en k
≤ .
k k
Proof. Observe first that if k = 1, since n ≥ 1, it holds,

en 1
n
= n ≤ en =
1 1
Next, we proceed by induction and assume that it holds for some k ≤ n − 1

that
n en k
≤ .
k k
Observe that
ek nk+1

n n n − k en k n − k 1 k
= ≤ ≤ 1 + ,
k+1 k k+1 k k+1 (k + 1)k+1 k
where we used the induction hypothesis in the first inequality. To conclude, it

suffices to observe that k
1
1+ ≤ e.
k
It immediately leads to the following corollary:
2.3. The Gaussian Sequence Model 49
Corollary 2.8. Under the assumptions of Theorem 2.6, for any δ > 0, with
probability at least 1 − δ, it holds
σ2 k ed σ 2 k σ2
MSE(Xθ̂Bls0 (k) ) . log + log(6) + log(1/δ) .
n 2k n n
Note that for any fixed δ, there exits a constant Cδ > 0 such that for any
n ≥ 2k, with high probability,
σ2 k ed
MSE(Xθ̂Bls0 (k) ) ≤ Cδ log .
n 2k
Comparing this result with Theorem 2.2 with r = k, we see that the price to
pay for not knowing the support of θ∗ but only its size, is a logarithmic factor
in the dimension d.
This result immediately leads the following bound in expectation.
Corollary 2.9. Under the assumptions of Theorem 2.6,
σ2 k ed
IE MSE(Xθ̂Bls0 (k) ) . log .
n k
Proof. It follows from (2.6) that for any H ≥ 0,
Z ∞
− Xθ∗ |22 > nu)du

IE MSE(Xθ̂Bls0 (k) ) = ls
IP(|Xθ̂K
0
Z ∞
≤H+ ls
IP(|Xθ̂K − Xθ∗ |22 > n(u + H))du
0
2k Z ∞
X d n(u+H)
≤H+ 62k
e− 32σ 2 du
j=1
j 0
2k
X d nH 32σ 2
=H+ 62k e− 32σ2 .
j=1
j n
Next, take H to be such that

2k
X d nH
62k e− 32σ2 = 1 .
j=1
j
This yields
σ2 k ed
H. log ,
n k
which completes the proof.
2.3 THE GAUSSIAN SEQUENCE MODEL
The Gaussian Sequence Model is a toy model that has received a lot of
attention, mostly in the eighties. The main reason for its popularity is that
it carries already most of the insight of nonparametric estimation. While the

model looks very simple it allows to carry deep ideas that extend beyond its
framework and in particular to the linear regression model that we are inter-
ested in. Unfortunately we will only cover a small part of these ideas and
the interested reader should definitely look at the excellent books by A. Tsy-
bakov [Tsy09, Chapter 3] and I. Johnstone [Joh11].
The model is as follows:
Yi = θi∗ + εi , i = 1, . . . , d , (2.7)
where ε1 , . . . , εd are i.i.d. N (0, σ 2 ) random variables. Note that often, d is
taken equal to ∞ in this sequence model and we will also discuss this case. Its
links to nonparametric estimation will become clearer in Chapter 3. The goal
here is to estimate the unknown vector θ∗ .
The sub-Gaussian Sequence Model

Note first that the model (2.7) is a special case of the linear model with fixed
design (2.1) with n = d and f (xi ) = x> ∗
i θ , where x1 , . . . , xn form the canonical
n
basis e1 , . . . , en of IR and ε has a Gaussian distribution. Therefore, n = d is
both the dimension of the parameter θ and the number of observations and it
looks like we have chosen to index this problem by d rather than n somewhat
arbitrarily. We can bring n back into the picture, by observing that this model
encompasses slightly more general choices for the design matrix X as long as
it satisfies the following assumption.
Assumption ORT The design matrix satisfies
X> X
= Id ,
n
where Id denotes the identity matrix of IRd .
Assumption ORT allows for cases where d ≤ n but not d > n (high dimensional
case) because of obvious rank constraints. In particular,
√ it means that the d
columns of X are orthogonal in IRn and all have norm n.
Under this assumption, it follows from the linear regression model (2.2)
that
1 > X> X ∗ 1 >
y := X Y = θ + X ε
n n n
= θ∗ + ξ ,
where ξ = (ξ1 , . . . , ξd ) ∼ subGd (σ 2 /n) (If ε is Gaussian, we even have ξ ∼
Nd (0, σ 2 /n)). As a result, under the assumption ORT, when ξ is Gaussian, the
linear regression model (2.2) is equivalent to the Gaussian Sequence Model (2.7)
up to a transformation of the data Y and a rescaling of the variance. Moreover,
for any estimator θ̂ ∈ IRd , under ORT, it follows from (2.3) that
X> X
MSE(Xθ̂) = (θ̂ − θ∗ )> (θ̂ − θ∗ ) = |θ̂ − θ∗ |22 .
n
Furthermore, for any θ ∈ IRd , the assumption ORT yields,

1
|y − θ|22 = | X> Y − θ|22
n
2 1
= |θ|22 − θ> X> Y + 2 Y > XX> Y
n n
1 2 1
= |Xθ|22 − (Xθ)> Y + |Y |22 + Q
n n n
1
= |Y − Xθ|22 + Q , (2.8)
n
where Q is a constant that does not depend on θ and is defined by
1 > > 1
Q= Y XX Y − |Y |22
n2 n
This implies in particular that the least squares estimator θ̂ls is equal to y.
We introduce a sightly more general model called sub-Gaussian sequence

model :
y = θ∗ + ξ ∈ IRd , (2.9)
2
where ξ ∼ subGd (σ /n).
In this section, we can actually completely “forget” about our original
model (2.2). In particular we can define this model independently of Assump-
tion ORT and thus for any values of n and d.
The sub-Gaussian sequence model and the Gaussian sequence model are
called direct (observation) problems as opposed to inverse problems where the
goal is to estimate the parameter θ∗ only from noisy observations of its image
through an operator. The linear regression model is one such inverse problem
where the matrix X plays the role of a linear operator. However, in these notes,
we never try to invert the operator. See [Cav11] for an interesting survey on
the statistical theory of inverse problems.
Sparsity adaptive thresholding estimators

If we knew a priori that θ was k-sparse, we could directly employ Corollary 2.8
to obtain that with probability at least 1 − δ, we have
σ2 k ed
MSE(Xθ̂Bls0 (k) ) ≤ Cδ log .
n 2k
As we will see, the assumption ORT gives us the luxury to not know k and yet
adapt to its value. Adaptation means that we can construct an estimator that
does not require the knowledge of k (the smallest such that |θ∗ |0 ≤ k) and yet
perform as well as θ̂Bls0 (k) , up to a multiplicative constant.
Let us begin with some heuristic considerations to gain some intuition.
Assume the sub-Gaussian sequence model (2.9). If nothing is known about θ∗
it is natural to estimate it using the least squares estimator θ̂ls = y. In this

case,
σ2 d
MSE(Xθ̂ls ) = |y − θ∗ |22 = |ξ|22 ≤ Cδ ,
n
where the last inequality holds with probability at least 1 − δ. This is actually
what we are looking for if k = Cd for some positive constant C ≤ 1. The
problem with this approach is that it does not use the fact that k may be much
smaller than d, which happens when θ∗ has many zero coordinates.
If θj∗ = 0, then, yj = ξj , which is a sub-Gaussian random variable with vari-
ance proxy σ 2 /n. In particular, we know from Lemma 1.3 that with probability
at least 1 − δ, r
2 log(2/δ)
|ξj | ≤ σ =τ. (2.10)
n
The consequences of this inequality are interesting. One the one hand, if we
observe |yj | τ , then it must correspond to θj∗ 6= 0. On the other hand, if
|yj | ≤ τ is smaller, then, θj∗ cannot be very large. In particular, by the triangle
inequality, |θj∗ | ≤ |yj | + |ξj | ≤ 2τ . Therefore, we loose at most 2τ by choosing
θ̂j = 0. It leads us to consider the following estimator.
Definition 2.10. The hard thresholding estimator with threshold 2τ > 0
is denoted by θ̂hrd and has coordinates

hrd yj if |yj | > 2τ ,
θ̂j =
0 if |yj | ≤ 2τ ,
for j = 1, . . . , d. In short, we can write θ̂jhrd = yj 1I(|yj | > 2τ ).

From our above consideration, we are tempted to choose τ as in (2.10).
Yet, this threshold is not large enough. Indeed, we need to choose τ such that
|ξj | ≤ τ simultaneously for all j. This can be done using a maximal inequality.
Namely, Theorem 1.14 ensures that with probability at least 1 − δ,
r
2 log(2d/δ)
max |ξj | ≤ σ .
1≤j≤d n
It yields the following theorem.
Theorem 2.11. Consider the linear regression model (2.2) under the assump-
tion ORT or, equivalenty, the sub-Gaussian sequence model (2.9). Then the
hard thresholding estimator θ̂hrd with threshold
r
2 log(2d/δ)
2τ = 2σ (2.11)
n
enjoys the following two properties on the same event A such that IP(A) ≥ 1−δ:
(i) If |θ∗ |0 = k,
k log(2d/δ)
MSE(Xθ̂hrd ) = |θ̂hrd − θ∗ |22 . σ 2 .
n
(ii) if minj∈supp(θ∗ ) |θj∗ | > 3τ , then
supp(θ̂hrd ) = supp(θ∗ ) .
Proof. Define the event n

A= max |ξj | ≤ τ ,
j
and recall that Theorem 1.14 yields IP(A) ≥ 1 − δ. On the event A, the
following holds for any j = 1, . . . , d.
First, observe that
|yj | > 2τ ⇒ |θj∗ | ≥ |yj | − |ξj | > τ (2.12)
and
|yj | ≤ 2τ ⇒ |θj∗ | ≤ |yj | + |ξj | ≤ 3τ. (2.13)
It yields
|θ̂jhrd − θj∗ | = |yj − θj∗ |1I(|yj | > 2τ ) + |θj∗ |1I(|yj | ≤ 2τ )
≤ τ 1I(|yj | > 2τ ) + |θj∗ |1I(|yj | ≤ 2τ )
≤ τ 1I(|θj∗ | > τ ) + |θj∗ |1I(|θj∗ | ≤ 3τ ) by (2.12) and (2.13)
≤ 4 min(|θj∗ |, τ )
It yields
d
X d
X
|θ̂hrd − θ∗ |22 = |θ̂jhrd − θj∗ |2 ≤ 16 min(|θj∗ |2 , τ 2 ) ≤ 16|θ∗ |0 τ 2 .
j=1 j=1
This completes the proof of (i).

To prove (ii), note that if θj∗ 6= 0, then |θj∗ | > 3τ so that
|yj | = |θj∗ + ξj | > 3τ − τ = 2τ .
Therefore, θ̂jhrd 6= 0 so that supp(θ∗ ) ⊂ supp(θ̂hrd ).

Next, if θ̂jhrd 6= 0, then |θ̂jhrd | = |yj | > 2τ . It yields
|θj∗ | ≥ |yj | − τ > τ.
Therefore, |θj∗ | =
6 0 and supp(θ̂hrd ) ⊂ supp(θ∗ ).
Similar results can be obtained for the soft thresholding estimator θ̂sft
defined by 
 yj − 2τ if yj > 2τ ,
θ̂jsft = yj + 2τ if yj < −2τ ,
0 if |yj | ≤ 2τ ,

In short, we can write

2τ
θ̂jsft = 1 − yj
|yj | +
2.4. High-dimensional linear regression 54
Hard Soft
2
2
1
1
0
0
y
y
−2 −1
−2 −1
−2 −1 0 1 2 −2 −1 0 1 2
x x
Figure 2.2. Transformation applied to yj with 2τ = 1 to obtain the hard (left) and soft
(right) thresholding estimators
2.4 HIGH-DIMENSIONAL LINEAR REGRESSION
The BIC and Lasso estimators

It can be shown (see Problem 2.6) that the hard and soft thresholding es-
timators are solutions of the following penalized empirical risk minimization
problems:
n o
θ̂hrd = argmin |y − θ|22 + 4τ 2 |θ|0
θ∈IRd
n o
θ̂sft = argmin |y − θ|22 + 4τ |θ|1
θ∈IRd
In view of (2.8), under the assumption ORT, the above variational definitions
can be written as
n1 o
θ̂hrd = argmin |Y − Xθ|22 + 4τ 2 |θ|0
θ∈IRd n
n1 o
θ̂sft = argmin |Y − Xθ|22 + 4τ |θ|1
θ∈IRd n
When the assumption ORT is not satisfied, they no longer correspond to thresh-
olding estimators but can still be defined as above. We change the constant in
the threshold parameters for future convenience.
Definition 2.12. Fix τ > 0 and assume the linear regression model (2.2). The
BIC2 estimator of θ∗ is defined by any θ̂bic such that

n1 o
θ̂bic ∈ argmin |Y − Xθ|22 + τ 2 |θ|0 .
θ∈IRd n
Moreover, the Lasso estimator of θ∗ is defined by any θ̂L such that

n1 o
θ̂L ∈ argmin |Y − Xθ|22 + 2τ |θ|1 .
θ∈IRd n
Remark 2.13 (Numerical considerations). Computing the BIC estimator can

be proved to be NP-hard in the worst case. In particular, no computational
method is known to be significantly faster than the brute force search among
all 2d sparsity patterns. Indeed, we can rewrite:
n1 o n 1 o
min |Y − Xθ|22 + τ 2 |θ|0 = min min |Y − Xθ|22 + τ 2 k
θ∈IRd n 0≤k≤d θ : |θ|0 =k n
To compute minθ : |θ|0 =k n1 |Y − Xθ|22 , we need to compute kd least squares

estimators on a space of size k. Each costs O(k 3 ) (matrix inversion). Therefore

the total cost of the brute force search is
d
X d 3
C k = Cd3 2d .
k
k=0
By contrast, computing the Lasso estimator is a convex problem and there

exist many efficient algorithms to solve it. We will not describe this optimiza-
tion problem in details but only highlight a few of the best known algorithms:
1. Probably the most popular method among statisticians relies on coor-
dinate gradient descent. It is implemented in the glmnet package in R
[FHT10],
2. An interesting method called LARS [EHJT04] computes the entire reg-
ularization path, i.e., the solution of the convex problem for all values
of τ . It relies on the fact that, as a function of τ , the solution θ̂L is a
piecewise linear function (with values in IRd ). Yet this method proved
to be too slow for very large problems and has been replaced by glmnet
which computes solutions for values of τ on a grid much faster.
3. The optimization community has made interesting contributions to this
field by using proximal methods to solve this problem. These methods
exploit the structure of the objective function which is of the form smooth
(sum of squares) + simple (`1 norm). A good entry point to this literature
is perhaps the FISTA algorithm [BT09].
2 Note that it minimizes the Bayes Information Criterion (BIC) employed in the tradi-
p
tional literature of asymptotic statistics if τ = log(d)/n. We will use the same value below,
up to multiplicative constants (it’s the price to pay to get non asymptotic results).
4. Recently there has been a lot of interest around this objective for very
large d and very large n. In this case, even computing |Y − Xθ|22 may
be computationally expensive and solutions based on stochastic gradient
descent are flourishing.
Note that by Lagrange duality, computing θ̂L is equivalent to solving an

`1 constrained least squares. Nevertheless, the radius of the `1 constraint is
unknown. In general it is hard to relate Lagrange multipliers to the size con-
straints. The name “Lasso” was originally given to the constrained version this
estimator in the original paper of Robert Tibshirani [Tib96].
Analysis of the BIC estimator

While computationally hard to implement, the BIC estimator gives us a good
benchmark for sparse estimation. Its performance is similar to that of θ̂hrd but
without requiring the assumption ORT.
Theorem 2.14. Assume that the linear model (2.2) holds, where ε ∼ subGn (σ 2 )
and that |θ∗ |0 ≥ 1. Then, the BIC estimator θ̂bic with regularization parameter
σ2 σ 2 log(ed)
τ 2 = 16 log(6) + 32 . (2.14)
n n
satisfies
1 log(ed/δ)
MSE(Xθ̂bic ) = |Xθ̂bic − Xθ∗ |22 . |θ∗ |0 σ 2 (2.15)
n n
with probability at least 1 − δ.
Proof. We begin as usual by noting that

1 1
|Y − Xθ̂bic |22 + τ 2 |θ̂bic |0 ≤ |Y − Xθ∗ |22 + τ 2 |θ∗ |0 .
n n
It implies
|Xθ̂bic − Xθ∗ |22 ≤ nτ 2 |θ∗ |0 + 2ε> X(θ̂bic − θ∗ ) − nτ 2 |θ̂bic |0 .
First, note that

Xθ̂bic − Xθ∗
2ε> X(θ̂bic − θ∗ ) = 2ε> |Xθ̂bic − Xθ∗ |2
|Xθ̂bic − Xθ∗ |2
h Xθ̂bic − Xθ∗ i2 1
≤ 2 ε> + |Xθ̂bic − Xθ∗ |22 ,
|Xθ̂bic − Xθ∗ |2 2
where we use the inequality 2ab ≤ 2a2 + 12 b2 . Together with the previous
display, it yields
2
|Xθ̂bic − Xθ∗ |22 ≤ 2nτ 2 |θ∗ |0 + 4 ε> U(θ̂bic − θ∗ ) − 2nτ 2 |θ̂bic |0

(2.16)
where
z
U(z) =
|z|2
Next, we need to “sup out” θ̂bic . To that end, we decompose the sup into a
max over cardinalities as follows:
sup = max max sup .

θ∈IRd 1≤k≤d |S|=k supp(θ)=S
Applied to the above inequality, it yields

2
4 ε> U(θ̂bic − θ∗ ) − 2nτ 2 |θ̂bic |0

2
≤ max max sup 4 ε> U(θ − θ∗ ) − 2nτ 2 k

1≤k≤d |S|=k supp(θ)=S
2
sup 4 ε> ΦS,∗ u − 2nτ 2 k ,

≤ max max
1≤k≤d |S|=k r
u∈B2S,∗
where ΦS,∗ = [φ1 , . . . , φrS,∗ ] is an orthonormal basis of the set {Xj , j ∈ S ∪

supp(θ∗ )} of columns of X and rS,∗ ≤ |S| + |θ∗ |0 is the dimension of this
column span.
Using union bounds, we get for any t > 0,
2
IP max max sup 4 ε> ΦS,∗ u − 2nτ 2 k ≥ t

1≤k≤d |S|=k r
u∈B2S,∗
d X
X > 2 t 1
≤ IP sup ε ΦS,∗ u ≥ + nτ 2 k
r
u∈B S,∗
4 2
k=1 |S|=k 2
Moreover, using the ε-net argument from Theorem 1.19, we get for |S| = k,
t
2 t 1 + 1 nτ 2 k
sup ε> ΦS,∗ u ≥ + nτ 2 k ≤ 2 · 6rS,∗ exp − 4 2 2

IP
r
u∈B2S,∗
4 2 8σ
t nτ 2 k
≤ 2 exp − 2
− + (k + |θ∗ |0 ) log(6)
32σ 16σ 2
t ∗

≤ exp − − 2k log(ed) + |θ |0 log(12)
32σ 2
where, in the last inequality, we used the definition (2.14) of τ .
Putting everything together, we get

IP |Xθ̂bic − Xθ∗ |22 ≥ 2nτ 2 |θ∗ |0 + t ≤
d X
X t ∗

exp − − 2k log(ed) + |θ | 0 log(12)
32σ 2
k=1 |S|=k
d
X d t ∗

= exp − − 2k log(ed) + |θ | 0 log(12)
k 32σ 2
k=1
d
X t
≤ exp − 2
− k log(ed) + |θ∗ |0 log(12) by Lemma 2.7
32σ
k=1
d
X t
= (ed)−k exp − + |θ ∗
| 0 log(12)
32σ 2
k=1
t
≤ exp − + |θ∗ |0 log(12) .
32σ 2
To conclude the proof, choose t = 32σ 2 |θ∗ |0 log(12)+32σ 2 log(1/δ) and observe
that combined with (2.16), it yields with probability 1 − δ,
|Xθ̂bic − Xθ∗ |22 ≤ 2nτ 2 |θ∗ |0 + t

= 64σ 2 log(ed)|θ∗ |0 + 64 log(12)σ 2 |θ∗ |0 + 32σ 2 log(1/δ)
≤ 224|θ∗ |0 σ 2 log(ed) + 32σ 2 log(1/δ) .
It follows from Theorem 2.14 that θ̂bic adapts to the unknown sparsity of
∗
θ , just like θ̂hrd . Moreover, this holds under no assumption on the design
matrix X.
Analysis of the Lasso estimator

Slow rate for the Lasso estimator
The properties of the BIC estimator are quite impressive. It shows that under
no assumption on X, one can mimic two oracles: (i) the oracle that knows the
support of θ∗ (and computes least squares on this support), up to a log(ed)
term and (ii) the oracle that knows the sparsity |θ∗ |0 of θ∗ , up to a smaller loga-
rithmic term log(ed/|θ∗ |0 ), which is subsumed by log(ed). Actually, the log(ed)
can even be improved to log(ed/|θ∗ |0 ) by using a modified BIC estimator (see
Problem 2.7).
The Lasso estimator is a bit more difficult to analyze because, by construc-
tion, it should more naturally adapt to the unknown `1 -norm of θ∗ . This can
be easily shown as in the next theorem, analogous to Theorem 2.4.
Theorem 2.15. Assume that the linear model (2.2) holds where ε ∼ subGn (σ 2 ).
Moreover, assume
√ that the columns of X are normalized in such a way that
maxj |Xj |2 ≤ n. Then, the Lasso estimator θ̂L with regularization parameter
r r
2 log(2d) 2 log(1/δ)
2τ = 2σ + 2σ (2.17)
n n
satisfies
r r
L 1 L ∗ 2 ∗ 2 log(2d) ∗ 2 log(1/δ)
MSE(Xθ̂ ) = |Xθ̂ − Xθ |2 ≤ 4|θ |1 σ + 4|θ |1 σ
n n n
Proof. From the definition of θ̂L , it holds
1 1
|Y − Xθ̂L |22 + 2τ |θ̂L |1 ≤ |Y − Xθ∗ |22 + 2τ |θ∗ |1 .
n n
Using Hölder’s inequality, it implies
|Xθ̂L − Xθ∗ |22 ≤ 2ε> X(θ̂L − θ∗ ) + 2nτ |θ∗ |1 − |θ̂L |1

≤ 2|X> ε|∞ |θ̂L |1 − 2nτ |θ̂L |1 + 2|X> ε|∞ |θ∗ |1 + 2nτ |θ∗ |1
= 2(|X> ε|∞ − nτ )|θ̂L |1 + 2(|X> ε|∞ + nτ )|θ∗ |1
Observe now that for any t > 0,
t2
− 2nσ2
IP(|X> ε|∞ ≥ t) = IP( max |X>
j ε| > t) ≤ 2de
1≤j≤d
p p
Therefore, taking t = σ 2n log(2d) + σ 2n log(1/δ) = nτ , we get that with
probability at least 1 − δ,
|Xθ̂L − Xθ∗ |22 ≤ 4nτ |θ∗ |1 .
Notice that the regularization parameter (2.17) depends on the confidence
level δ. This not the case for the BIC estimator (see (2.14)).
p
The rate in Theorem 2.15 is of order (log d)/n (slow rate), which is much
slower than the rate of order (log d)/n (fast rate) for the BIC estimator. Here-
after, we show that fast rates can also be achieved by the computationally
efficient Lasso estimator, but at the cost of a much stronger condition on the
design matrix X.
Incoherence
Assumption INC(k) We say that the design matrix X has incoherence k for
some integer k > 0 if
X> X 1
− Id ∞ ≤
n 32k
where the |A|∞ denotes the largest element of A in absolute value. Equivalently,
1. For all j = 1, . . . , d,
|Xj |22 1
−1 ≤ .
n 32k
2. For all 1 ≤ i, j ≤ d, i 6= j, we have
|X>
i Xj | 1
≤ .
n 32k
Note that Assumption ORT arises as the limiting case of INC(k) as k → ∞.

However, while Assumption ORT requires d ≤ n, here we may have d n as
illustrated in Proposition 2.16 below. To that end, we simply have to show
that there exists a matrix that satisfies INC(k) even for d > n. We resort to
the probabilistic method [AS08]. The idea of this method is that if we can find
a probability measure that assigns positive probability to objects that satisfy
a certain property, then there must exist objects that satisfy said property.
In our case, we consider the following probability distribution on random
matrices with entries in {±1}. Let the design matrix X have entries that are
i.i.d. Rademacher (±1) random variables. We are going to show that most
realizations of this random matrix satisfy Assumption INC(k) for large enough
n.
Proposition 2.16. Let X ∈ IRn×d be a random matrix with entries Xij , i =
1, . . . , n, j = 1, . . . , d, that are i.i.d. Rademacher (±1) random variables. Then,
X has incoherence k with probability 1 − δ as soon as
n ≥ 211 k 2 log(1/δ) + 213 k 2 log(d) .
It implies that there exists matrices that satisfy Assumption INC(k) for
n ≥ Ck 2 log(d) ,
for some numerical constant C.
Proof. Let εij ∈ {−1, 1} denote the Rademacher random variable that is on
the ith row and jth column of X.
Note first that the jth diagonal entries of X> X/n are given by
n
1X 2
ε = 1.
n i=1 i,j
Moreover, for j 6= k, the (j, k)th entry of the d × d matrix X> X/n is given by
n n
1X 1 X (j,k)
εi,j εi,k = ξ ,
n i=1 n i=1 i
(j,k) (j,k) (j,k)

where for each pair (j, k), ξi = εi,j εi,k , so that ξ1 , . . . , ξn are i.i.d.
Rademacher random variables.
Therefore, we get that for any t > 0,

n
X> X
1 X (j,k)
IP − Id > t = IP max ξi >t
n ∞ j6=k n
i=1
1X n
(j,k)
X
≤ IP ξi >t (Union bound)
n i=1
j6=k
nt2
X
≤ 2e− 2 (Hoeffding: Theorem 1.9)
j6=k
nt2
≤ 2d2 e− 2 .
Taking now t = 1/(32k) yields
X> X 1 n
IP − Id ∞
> ≤ 2d2 e− 211 k2 ≤ δ
n 32k
for
n ≥ 211 k 2 log(1/δ) + 213 k 2 log(d) .
For any θ ∈ IRd , S ⊂ {1, . . . , d}, define θS to be the vector with coordinates

θj if j ∈ S ,
θS,j =
0 otherwise .
In particular |θ|1 = |θS |1 + |θS c |1 .

The following lemma holds
Lemma 2.17. Fix a positive integer k ≤ d and assume that X satisfies as-
sumption INC(k). Then, for any S ∈ {1, . . . , d} such that |S| ≤ k and any
θ ∈ IRd that satisfies the cone condition
|θS c |1 ≤ 3|θS |1 , (2.18)
it holds
|Xθ|22
|θ|22 ≤ 2 .
n
Proof. We have
|Xθ|22 |XθS |22 |XθS c |22 X> X

= + + 2θS> θS c .
n n n n
We bound each of the three terms separately.
(i) First, if follows from the incoherence condition that
X> X
>
|XθS |22 |θS |21

X X
= θS> θS = |θS |22 + θS> − Id θS ≥ |θS |22 − ,
n n n 32k
(ii) Similarly,
|XθS c |22 |θS c |21 9|θS |21
≥ |θS c |22 − ≥ |θS c |22 − ,
n 32k 32k
where, in the last inequality, we used the cone condition (2.18).
(iii) Finally,
X> X 2 6
2 θS> θS c ≤ |θS |1 |θS c |1 ≤ |θS |21 .
n 32k 32k
where, in the last inequality, we used the cone condition (2.18).
Observe now that it follows from the Cauchy-Schwarz inequality that
|θS |21 ≤ |S||θS |22 .
Thus for |S| ≤ k,
|Xθ|22 16|S| 1
≥ |θS |22 + |θS c |22 − |θS |22 ≥ |θ|22 .
n 32k 2
Fast rate for the Lasso

Theorem 2.18. Fix n ≥ 2. Assume that the linear model (2.2) holds where ε ∼
subGn (σ 2 ). Moreover, assume that |θ∗ |0 ≤ k and that X satisfies assumption
INC(k). Then the Lasso estimator θ̂L with regularization parameter defined by
r r
log(2d) log(1/δ)
2τ = 8σ + 8σ
n n
satisfies
1 log(2d/δ)
MSE(Xθ̂L ) = |Xθ̂L − Xθ∗ |22 . kσ 2 (2.19)
n n
and
log(2d/δ)
|θ̂L − θ∗ |22 . kσ 2 . (2.20)
n
Proof. From the definition of θ̂L , it holds
1 1
|Y − Xθ̂L |22 ≤ |Y − Xθ∗ |22 + 2τ |θ∗ |1 − 2τ |θ̂L |1 .
n n
Adding τ |θ̂L − θ∗ |1 on each side and multiplying by n, we get
|Xθ̂L −Xθ∗ |22 +nτ |θ̂L −θ∗ |1 ≤ 2ε> X(θ̂L −θ∗ )+nτ |θ̂L −θ∗ |1 +2nτ |θ∗ |1 −2nτ |θ̂L |1 .
Applying Hölder’s inequality and using the same steps as in the proof of The-
orem 2.15, we get that with probability 1 − δ, we get
ε> X(θ̂L − θ∗ ) ≤ |ε> X|∞ |θ̂L − θ∗ |1
nτ L
≤ |θ̂ − θ∗ |1 ,
2
where we used the fact that |Xj |22 ≤ n + 1/(32k) ≤ 2n. Therefore, taking
S = supp(θ∗ ) to be the support of θ∗ , we get
|Xθ̂L − Xθ∗ |22 + nτ |θ̂L − θ∗ |1 ≤ 2nτ |θ̂L − θ∗ |1 + 2nτ |θ∗ |1 − 2nτ |θ̂L |1
= 2nτ |θ̂SL − θ∗ |1 + 2nτ |θ∗ |1 − 2nτ |θ̂SL |1
≤ 4nτ |θ̂SL − θ∗ |1 . (2.21)
In particular, it implies that
|θ̂SLc − θS∗ c |1 ≤ 3|θ̂SL − θS∗ |1 . (2.22)
so that θ = θ̂L − θ∗ satisfies the cone condition (2.18). Using now the Cauchy-
Schwarz inequality and Lemma 2.17 respectively, we get, since |S| ≤ k,
r
L ∗
p L ∗
p L ∗ 2k
|θ̂S − θ |1 ≤ |S||θ̂S − θ |2 ≤ |S||θ̂ − θ |2 ≤ |Xθ̂L − Xθ∗ |2 .
n
Combining this result with (2.21), we find
|Xθ̂L − Xθ∗ |22 ≤ 32nkτ 2 .
This concludes the proof of the bound on the MSE. To prove (2.20), we use
Lemma 2.17 once again to get
|θ̂L − θ∗ |22 ≤ 2MSE(Xθ̂L ) ≤ 64kτ 2 .
Note that all we required for the proof was not really incoherence but the
conclusion of Lemma 2.17:
|Xθ|22
inf inf ≥ κ, (2.23)
|S|≤k θ∈CS n|θ|22
where κ = 1/2 and CS is the cone defined by

CS = |θS c |1 ≤ 3|θS |1 .
Condition (2.23) is sometimes called restricted eigenvalue (RE) condition. Its

name comes from the following observation. Note that all k-sparse vectors θ
are in a cone CS with |S| ≤ k so that the RE condition implies that the smallest
eigenvalue of XS satisfies λmin (XS ) ≥ nκ for all S such that |S| ≤ k. Clearly,
the RE condition is weaker than incoherence and it can actually be shown
that a design matrix X of i.i.d. Rademacher random variables satisfies the RE
conditions as soon as n ≥ Ck log(d) with positive probability.
The Slope estimator

We noticed that the BIC estimator can actually obtain a rate of k log(ed/k),
where k = |θ∗ |0 , while the LASSO only achieves k log(ed). This begs the
question whether the same rate can be achieved by an efficiently computable

estimator. The answer is yes, and the estimator achieving this is a slight
modification of the LASSO, where the entries in the `1 -norm are weighted
differently according to their size in order to get rid of a slight inefficiency in
our bounds.
Definition 2.19 (Slope estimator). Let λ = (λ1 , . . . , λd ) be a non-increasing

sequence of positive real numbers, λ1 ≥ λ2 ≥ · · · ≥ λd > 0. For θ =
(θ1 , . . . , θd ) ∈ IRd , let (θ1∗ , . . . , θd∗ ) be a non-increasing rearrangement of the
modulus of the entries, |θ1 |, . . . , |θd |.
We define the sorted `1 norm of θ as
d
X
|θ|∗ = λj θj∗ , (2.24)
j=1
or equivalently as
d
X
|θ|∗ = max λj |θφ(j) |. (2.25)
φ∈Sd
j=1
The Slope estimator is then given by

1
θ̂S ∈ argmin{ |Y − Xθ|22 + 2τ |θ|∗ } (2.26)
θ∈IRd n
for a choice of tuning parameters λ and τ > 0.

Slope stands for Sorted L-One Penalized Estimation and was introduced
in [BvS+ 15], motivated by the quest for a penalized estimation procedure that
could offer a control of false discovery rate for the hypotheses H0,j : θj∗ = 0. We
can check that | . |∗ is indeed a norm and that θ̂S can be computed efficiently,
for example by proximal gradient algorithms, see [BvS+ 15].
In the following, we will consider
p
λj = log(2d/j), j = 1, . . . , d. (2.27)
With this choice, we will exhibit a scaling in τ that leads to the desired high
probability bounds, following the proofs in [BLT16].
We begin with refined bounds on the suprema of Gaussians.
Lemma 2.20. Let g1 , . . . , gd be zero-mean Gaussian variables with variance

at most σ 2 . Then, for any k ≤ d,
 
k 1−3t/8
1 X 2d  2d
IP  2 (gj∗ )2 > t log ≤ . (2.28)
kσ j=1 k k
Proof. We consider the moment generating function to apply a Chernoff bound.

Use Jensen’s inequality on the sum to get

  !
k k
3 X
∗ 2 1X 3(gj∗ )2
IE exp  (g ) ≤ IE exp
8kσ 2 j=1 j k j=1 8σ 2
d
!
1X 3gj2
≤ IE exp
k j=1 8σ 2
2d
≤ ,
k
where we used the bound on the moment generating function of the χ2 distri-
bution from Problem 1.2 to get that IE[exp(3g 2 /8)] = 2 for g ∼ N (0, 1).
The rest follows from a Chernoff bound.
Lemma 2.21. Define [d] := {1, . . . , d}. Under the same assumptions as in
Lemma 2.20,
(g ∗ ) p
sup k ≤ 4 log(1/δ), (2.29)
k∈[d] σλk
with probability at least 1 − δ, for δ < 21 .
Proof. We can estimate (gk∗ )2 by the average of all larger variables,

k
1X ∗ 2
(gk∗ )2 ≤ (g ) . (2.30)
k j=1 j
Applying Lemma 2.20 yields

1−3t/8
(gk∗ )2

2d
IP 2 2 >t ≤ (2.31)
σ λk k
For t > 8,
! d 1−3t/8
(gk∗ )2 X 2d
IP sup 2 2 > t ≤ (2.32)
k∈[d] σ λk j=1
j
d
X 1
≤ (2d)1−3t/8 2
(2.33)
j=1
j
≤ 4 · 2−3t/8 . (2.34)
In turn, this means that

(gk∗ ) p
sup ≤ 4 log(1/δ), (2.35)
k∈[d] σλk
for δ ≤ 1/2.
Theorem 2.22. Fix n ≥ 2. Assume that the linear model (2.2) holds where
ε ∼ Nn (0, σ 2 In ). Moreover, assume that |θ∗ |0 ≤ k and that X satisfies as-
sumption INC(k 0 ) with k 0 ≥ 4k log(2de/k). Then the Slope estimator θ̂S with
regularization parameter defined by
r
√ log(1/δ)
τ = 8 2σ (2.36)
n
satisfies
1 k log(2d/k) log(1/δ)
MSE(Xθ̂S ) = |Xθ̂S − Xθ∗ |22 . σ 2 (2.37)
n n
and
k log(2d/k) log(1/δ)
|θ̂S − θ∗ |22 . σ 2 . (2.38)
n
Remark 2.23. The multplicative depedence on log(1/δ) in (2.37) is an artifact

of the simplified proof we give here and can be improved to an additive one
similar to the lasso case, (2.19), by appealing to stronger and more general
concentration inequalities. For more details, see [BLT16].
Proof. By minimality of θ̂S in (2.26), adding nτ |θ̂S − θ∗ |∗ on both sides,
|Xθ̂S −Xθ∗ |22 +nτ |θ̂S −θ∗ |∗ ≤ 2ε> X(θ̂S −θ∗ )+nτ |θ̂S −θ∗ |∗ +2τ n|θ∗ |∗ −2τ n|θ̂S |∗ .
(2.39)
Set
u := θ̂S − θ∗ , gj = (X> ε)j , (2.40)
By Lemma 2.21, we can estimate
d
X d
X
ε> Xu = (X> ε)j uj ≤ gj∗ u∗j (2.41)
j=1 j=1
d
X gj∗
= (λj u∗j ) (2.42)
j=1
λ j
∗
gj
≤ sup |u|∗ (2.43)
j λ j
√ p nτ
≤ 4 2σ n log(1/δ)|u|∗ = |u|∗ , (2.44)
2
where we used that |Xj |22 ≤ 2n.
Pk
Pick a permutation φ such that |θ∗ |∗ = j=1 λj |θφ(j) | and |uφ(k+1) | ≥ · · · ≥
|uφ(d) |. Then, noting that λj is monotonically decreasing,
k
X d
X
|θ∗ |∗ − |θ̂S |∗ = ∗
λj |θφ(j) |− λj (θ̂S )∗j (2.45)
j=1 j=1
k
X d
X
∗ S S
≤ λj (|θφ(j) | − |θ̂φ(j) |) − λj |θ̂φ(j) | (2.46)
j=1 j=k+1
k
X d
X
≤ λj |uφ(j) | − λj u∗j (2.47)
j=1 j=k+1
k
X d
X
≤ λj u∗j − λj u∗j . (2.48)
j=1 j=k+1
√ p
Combined with τ = 8 2σ log(1/δ)/n and the basic inequality (2.39), we
have that
|Xθ̂S − Xθ∗ |22 + nτ |u|∗ ≤ 2nτ |u|∗ + 2τ n|θ∗ |∗ − 2τ n|θ̂S |∗ (2.49)

k
X
≤ 4nτ λj u∗j , (2.50)
j=1
whence
d
X k
X
λj u∗j ≤ 3 λj u∗j . (2.51)
j=k+1 j=1
We can now repeat the incoherence arguments from Lemma 2.17, with S
being the k largest entries of |u|, to get the same conclusion under the restriction
INC(k 0 ). First, by exactly the same argument as in Lemma 2.17, we have
|XuS |22 |uS |21 k

≥ |uS |22 − 0
≥ |uS |22 − |uS |22 . (2.52)
n 32k 32k 0
Next, for the cross terms, we have

k d
X> X 2 X ∗ X ∗
2 u>
S uS c ≤ ui uj (2.53)
n 32k 0 i=1
j=k+1
k d
2 X ∗ X λj ∗
= u u (2.54)
32k 0 i=1 i λj j
j=k+1
k d
2 X ∗ X
≤ u λj u∗j (2.55)
32k 0 λd i=1 i
j=k+1
k k
6 X ∗X
≤ u λj u∗j (2.56)
32k 0 λd i=1 i j=1
√ k
!1/2 k
6 k X X
≤ λ2j (u∗i )2 (2.57)
32k 0 λd i=1 i=1
6k p
≤ log(2de/k)|uS |22 . (2.58)
32k 0 λd
where we estimated the sum by
k
X 2d
log = k log(2d) − log(k!) ≤ k log(2d) − k log(k/e) (2.59)
j=1
j
= k log(2de/k), (2.60)
using Stirling’s inequality, k! ≥ (k/e)k .

Finally, from a similar calculation, using the cone condition twice,
|XuS c |22 9k
≥ |uS c |22 − log(2de/k)|uS |22 . (2.61)
n 32k 0 λ2d
Concluding, if INC(k 0 ) holds with k 0 ≥ 4k log(2de/k), taking into account

that λd ≥ 1/2, we have
|Xu|22 36 + 12 + 1 1
≥ |uS |22 + |uS c |22 − |uS |22 ≥ |u|22 . (2.62)
n 128 2
Hence, from (2.50),
 1/2  1/2
k
X Xk
|Xu|22 + nτ |u|∗ ≤ 4nτ  λ2j   (u∗j )2  (2.63)
j=1 j=1
√ p
≤ 4 2τ nk log(2de/k)|Xu|2 (2.64)
p p
= 26 σ log(1/δ) k log(2de/k)|Xu|2 , (2.65)
which combined yields

1 k
|Xu|22 ≤ 212 σ 2 log(2de/k) log(1/δ), (2.66)
n n
and
k
|u|22 ≤ 213 σ 2 log(2de/k) log(1/δ). (2.67)
n
2.5. Problem set 70
2.5 PROBLEM SET
Problem 2.1. Consider the linear regression model with fixed design with
d ≤ n. The ridge regression estimator is employed when rank(X> X) < d but
we are interested in estimating θ∗ . It is defined for a given parameter τ > 0 by
n1 o
θ̂τridge = argmin |Y − Xθ|22 + τ |θ|22 .
θ∈IRd n
(a) Show that for any τ , θ̂τridge is uniquely defined and give its closed form
expression.
(b) Compute the bias of θ̂τridge and show that it is bounded in absolute value
by |θ∗ |2 .
Problem 2.2. Let X = (1, Z, . . . , Z d−1 )> ∈ IRd be a random vector where Z
is a random variable. Show that the matrix IE(XX > ) is positive definite if Z
admits a probability density with respect to the Lebesgue measure on IR.
Problem 2.3. Let θ̂hrd be the hard thresholding estimator defined in Defini-
tion 2.10.
1. Show that
|θ̂hrd − θ∗ |22
|θ̂hrd |0 ≤ |θ∗ |0 +
4τ 2
2. Conclude that if τ is chosen as in Theorem 2.11, then
|θ̂hrd |0 ≤ C|θ∗ |0
with probability 1 − δ for some constant C to be made explicit.

Problem 2.4. In the proof of Theorem 2.11, show that 4 min(|θj∗ |, τ ) can be
replaced by 3 min(|θj∗ |, τ ), i.e., that on the event A, it holds
|θ̂jhrd − θj∗ | ≤ 3 min(|θj∗ |, τ ) .
Problem 2.5. For any q > 0, a vector θ ∈ IRd is said to be in a weak `q ball
of radius R if the decreasing rearrangement |θ[1] | ≥ |θ[2] | ≥ . . . satisfies
|θ[j] | ≤ Rj −1/q .
Moreover, we define the weak `q norm of θ by
|θ|w`q = max j 1/q |θ[j] |.

1≤j≤d
(a) Give examples of θ, θ0 ∈ IRd such that
|θ + θ0 |w`1 > |θ|w`1 + |θ0 |w`1
What do you conclude?

2.5. Problem set 71
(b) Show that |θ|w`q ≤ |θ|q .

(c) Given a sequence θ1 , θ2 , . . ., show that if limd→∞ |θ{1,...,d} |w`q < ∞, then
limd→∞ |θ{1,...,d} |q0 < ∞ for all q 0 > q.
(d) Show that, for any q ∈ (0, 2) if limd→∞ |θ{1,...,d} |w`q = C, there exists
a constant Cq > 0 that depends on q but not on d such that under the
assumptions of Theorem 2.11, it holds
σ 2 log 2d 1− q2
|θ̂hrd − θ∗ |22 ≤ Cq
n
with probability .99.
Problem 2.6. Show that
n o
θ̂hrd = argmin |y − θ|22 + 4τ 2 |θ|0
θ∈IRd
n o
θ̂sft = argmin |y − θ|22 + 4τ |θ|1
θ∈IRd
Problem 2.7. Assume the linear model (2.2) with ε ∼ subGn (σ 2 ) and θ∗ 6= 0.
Show that the modified BIC estimator θ̂ defined by
n1 ed o
θ̂ ∈ argmin |Y − Xθ|22 + λ|θ|0 log
θ∈IRd n |θ|0
satisfies
ed

∗ 2
log |θ ∗ |0
MSE(Xθ̂) . |θ |0 σ .
n
with probability .99, for appropriately chosen λ. What do you conclude?
Problem 2.8. Assume that the linear model (2.2) holds where ε ∼ subGn (σ 2 ).
Moreover, assume the conditions of Theorem 2.2 √ and that the columns of X
are normalized in such a way that maxj |Xj |2 ≤ n. Then the Lasso estimator
θ̂L with regularization parameter
r
2 log(2d)
2τ = 8σ ,
n
satisfies
|θ̂L |1 ≤ C|θ∗ |1
with probability 1 − (2d)−1 for some constant C to be specified.
Chapter
3
Misspecified Linear Models
Arguably, the strongest assumption that we made in Chapter 2 is that the

regression function f (x) is of the form f (x) = x> θ∗ . What if this assumption
is violated? In reality, we do not really believe in the linear model and we hope
that good statistical methods should be robust to deviations from this model.
This is the problem of model misspecification, and in particular misspecified
linear models.
Throughout this chapter, we assume the following model:
Yi = f (Xi ) + εi , i = 1, . . . , n , (3.1)
where ε = (ε1 , . . . , εn )> is sub-Gaussian with variance proxy σ 2 . Here Xi ∈ IRd .

When dealing with fixed design, it will be convenient to consider the vector
g ∈ IRn defined for any function g : IRd → IR by g = (g(X1 ), . . . , g(Xn ))> . In
this case, we can write for any estimator fˆ ∈ IRn of f ,
1 ˆ
MSE(fˆ) = |f − f |22 .
n
Even though the model may not be linear, we are interested in studying the
statistical properties of various linear estimators introduced in the previous
ls ls bic L
chapters: θ̂ls , θ̂K , θ̃X , θ̂ , θ̂ . Clearly, even with an infinite number of obser-
vations, we have no chance of finding a consistent estimator of f if we don’t
know the correct model. Nevertheless, as we will see in this chapter, something
can still be said about these estimators using oracle inequalities.
72
3.1. Oracle inequalities 73
3.1 ORACLE INEQUALITIES
Oracle inequalities
As mentioned in the introduction, an oracle is a quantity that cannot be con-
structed without the knowledge of the quantity of interest, here: the regression
function. Unlike the regression function itself, an oracle is constrained to take
a specific form. For all matter of purposes, an oracle can be viewed as an
estimator (in a given family) that can be constructed with an infinite amount
of data. This is exactly what we should aim for in misspecified models.
When employing the least squares estimator θ̂ls , we constrain ourselves to
estimating functions that are of the form x 7→ x> θ, even though f itself may
not be of this form. Therefore, the oracle fˆ is the linear function that is the
closest to f .
Rather than trying to approximate f by a linear function f (x) ≈ θ> x, we
make the model a bit more general and consider a dictionary H = {ϕ1 , . . . , ϕM }
of functions where ϕj : IRd → IR. In this case, we can actually remove the
assumption that X ∈ IRd . Indeed, the goal is now to estimate f using a linear
combination of the functions in the dictionary:
M
X
f ≈ ϕθ := θj ϕj .
j=1
Remark 3.1. If M = d and ϕj (X) = X (j) returns the jth coordinate of

X ∈ IRd then the goal is to approximate f (x) by θ> x. Nevertheless, the use of
a dictionary allows for a much more general framework.
Note that the use of a dictionary does not affect the methods that we have
been using so far, namely penalized/constrained least squares. We use the
same notation as before and define
1. The least squares estimator:

n
1X 2
θ̂ls ∈ argmin Yi − ϕθ (Xi ) (3.2)
θ∈IRM n i=1
2. The least squares estimator constrained to K ⊂ IRM :

n
ls 1X 2
θ̂K ∈ argmin Yi − ϕθ (Xi )
θ∈K n i=1
3. The BIC estimator:

n
n1 X o
2
θ̂bic ∈ argmin Yi − ϕθ (Xi ) + τ 2 |θ|0 (3.3)
θ∈IRM n i=1
4. The Lasso estimator:

n
n1 X o
2
θ̂L ∈ argmin Yi − ϕθ (Xi ) + 2τ |θ|1 (3.4)
θ∈IRM n i=1
Definition 3.2. Let R(·) be a risk function and let H = {ϕ1 , . . . , ϕM } be a

dictionary of functions from IRd to IR. Let K be a subset of IRM . The oracle
on K with respect to R is defined by ϕθ̄ , where θ̄ ∈ K is such that
R(ϕθ̄ ) ≤ R(ϕθ ) , ∀θ ∈ K .
Moreover, RK = R(ϕθ̄ ) is called oracle risk on K. An estimator fˆ is said

to satisfy an oracle inequality (over K) with remainder term φ in expectation
(resp. with high probability) if there exists a constant C ≥ 1 such that
IER(fˆ) ≤ C inf R(ϕθ ) + φn,M (K) ,

θ∈K
or
IP R(fˆ) ≤ C inf R(ϕθ ) + φn,M,δ (K) ≥ 1 − δ ,

∀δ>0
θ∈K
respectively. If C = 1, the oracle inequality is sometimes called exact.

Our goal will be to mimic oracles. The finite sample performance of an
estimator at this task is captured by an oracle inequality.
Oracle inequality for the least squares estimator

While our ultimate goal is to prove sparse oracle inequalities for the BIC and
Lasso estimator in the case of misspecified model, the difficulty of the extension
to this case for linear models is essentially already captured by the analysis for
the least squares estimator. In this simple case, can even obtain an exact oracle
inequality.
Theorem 3.3. Assume the general regression model (3.1) with ε ∼ subGn (σ 2 ).
Then, the least squares estimator θ̂ls satisfies for some numerical constant
C > 0,
σ2 M
MSE(ϕθ̂ls ) ≤ inf MSE(ϕθ ) + C log(1/δ)
θ∈IRM n
Proof. Note that by definition
|Y − ϕθ̂ls |22 ≤ |Y − ϕθ̄ |22
where ϕθ̄ denotes the orthogonal projection of f onto the linear span of ϕ1 , . . . , ϕM .
Since Y = f + ε, we get
|f − ϕθ̂ls |22 ≤ |f − ϕθ̄ |22 + 2ε> (ϕθ̂ls − ϕθ̄ )

Moreover, by Pythagoras’s theorem, we have
|f − ϕθ̂ls |22 − |f − ϕθ̄ |22 = |ϕθ̂ls − ϕθ̄ |22 .
It yields
|ϕθ̂ls − ϕθ̄ |22 ≤ 2ε> (ϕθ̂ls − ϕθ̄ ) .
Using the same steps as the ones following equation (2.5) for the well specified
case, we get
σ2 M
|ϕθ̂ls − ϕθ̄ |22 . log(1/δ)
n
with probability 1 − δ. The result of the lemma follows.
Sparse oracle inequality for the BIC estimator

The analysis for more complicated estimators such as the BIC in Chapter 2
allow us to derive oracle inequalities for these estimators.
Then, the BIC estimator θ̂bic with regularization parameter
σ2 σ 2 log(ed)
τ 2 = 16 log(6) + 32 (3.5)
n n
satisfies for some numerical constant C > 0,
n Cσ 2 o Cσ 2
MSE(ϕθ̂bic ) ≤ inf 3MSE(ϕθ )+ |θ|0 log(eM/δ) + log(1/δ)
θ∈IRM n n
Proof. Recall that the proof of Theorem 2.14 for the BIC estimator begins as
follows:
1 1
|Y − ϕθ̂bic |22 + τ 2 |θ̂bic |0 ≤ |Y − ϕθ |22 + τ 2 |θ|0 .
n n
This is true for any θ ∈ IRM . It implies
|f − ϕθ̂bic |22 + nτ 2 |θ̂bic |0 ≤ |f − ϕθ |22 + 2ε> (ϕθ̂bic − ϕθ ) + nτ 2 |θ|0 .
Note that if θ̂bic = θ, the result is trivial. Otherwise,

ϕ −ϕ
θ
2ε> (ϕθ̂bic − ϕθ ) = 2ε> θ̂ bic
|ϕθ̂bic − ϕθ |2
|ϕθ̂bic − ϕθ |2
2 h > ϕθ̂bic − ϕθ i2 α
≤ ε + |ϕθ̂bic − ϕθ |22 ,
α |ϕθ̂bic − ϕθ |2 2
2 2
where we used Young’s inequality 2ab ≤ αa + α2 b2 , which is valid for a, b ∈ IR
and α > 0. Next, since
α
|ϕ bic − ϕθ |22 ≤ α|ϕθ̂bic − f |22 + α|ϕθ − f |22 ,
2 θ̂
we get for α = 1/2,

1 3 2
|ϕ bic − f |22 ≤ |ϕθ − f |22 + nτ 2 |θ|0 + 4 ε> U(ϕθ̂bic − ϕθ ) − nτ 2 |θ̂bic |0

2 θ̂ 2
3 2
≤ |ϕθ − f |22 + 2nτ 2 |θ|0 + 4 ε> U(ϕθ̂bic −θ ) − nτ 2 |θ̂bic − θ|0 .

2
We conclude as in the proof of Theorem 2.14.
The interpretation of this theorem is enlightening. It implies that the

BIC estimator will mimic the best tradeoff between the approximation error
MSE(ϕθ ) and the complexity of θ as measured by its sparsity. In particular, this
result, which is sometimes called sparse oracle inequality, implies the following
oracle inequality. Define the oracle θ̄ to be such that
MSE(ϕθ̄ ) = min MSE(ϕθ ).

θ∈IRM
Then, with probability at least 1 − δ,
Cσ 2 h io
MSE(ϕθ̂bic ) ≤ 3MSE(ϕθ̄ ) + |θ̄|0 log(eM ) + log(1/δ)
n
If the linear model happens to be correct, then we simply have MSE(ϕθ̄ ) = 0.
Sparse oracle inequality for the Lasso

To prove an oracle inequality for the Lasso, we need additional assumptions
on the design matrix, such as incoherence. Here, the design matrix is given by
the n × M matrix Φ with elements Φi,j = ϕj (Xi ).
Moreover, assume that there exists an integer k such that the matrix Φ satisfies
assumption INC(k). Then, the Lasso estimator θ̂L with regularization param-
eter given by r r
2 log(2M ) 2 log(1/δ)
2τ = 8σ + 8σ (3.6)
n n
satisfies for some numerical constant C,
n Cσ 2 o
MSE(ϕθ̂L ) ≤ inf MSE(ϕθ ) + |θ|0 log(eM/δ)
θ∈IRM n
|θ|0 ≤k
Proof. From the definition of θ̂L , for any θ ∈ IRM ,

1 1
|Y − ϕθ̂L |22 ≤ |Y − ϕθ |22 + 2τ |θ|1 − 2τ |θ̂L |1 .
n n
Expanding the squares, adding τ |θ̂L − θ|1 on each side and multiplying by n,
we get
|ϕθ̂L − f |22 − |ϕθ − f |22 + nτ |θ̂L − θ|1

≤ 2ε> (ϕθ̂L − ϕθ ) + nτ |θ̂L − θ|1 + 2nτ |θ|1 − 2nτ |θ̂L |1 . (3.7)
√
Next, note that INC(k) for any k ≥ 1 implies that |ϕj |2 ≤ 2 n for all j =
1, . . . , M . Applying Hölder’s inequality, we get that with probability 1 − δ, it
holds that
nτ L
2ε> (ϕθ̂L − ϕθ ) ≤ |θ̂ − θ|1 .
2
Therefore, taking S = supp(θ) to be the support of θ, we get that the right-
hand side of (3.7) is bounded by
|ϕθ̂L − f |22 − |ϕθ − f |22 + nτ |θ̂L − θ|1 ≤ 2nτ |θ̂L − θ|1 + 2nτ |θ|1 − 2nτ |θ̂L |1
= 2nτ |θ̂SL − θ|1 + 2nτ |θ|1 − 2nτ |θ̂SL |1
≤ 4nτ |θ̂SL − θ|1 (3.8)
with probability 1 − δ.
It implies that either MSE(ϕθ̂L ) ≤ MSE(ϕθ ) or that
|θ̂SLc − θS c |1 ≤ 3|θ̂SL − θS |1 .
so that θ = θ̂L − θ satisfies the cone condition (2.18). Using now the Cauchy-
Schwarz inequality and Lemma 2.17, respectively, assuming that |θ|0 ≤ k, we
get p p
4nτ |θ̂SL − θ|1 ≤ 4nτ |S||θ̂SL − θ|2 ≤ 4τ 2n|θ|0 |ϕθ̂L − ϕθ |2 .
2 2
Using now the inequality 2ab ≤ αa + α2 b2 , we get
16τ 2 n|θ|0 α
4nτ |θ̂SL − θ|1 ≤ + |ϕθ̂L − ϕθ |22
α 2
2
16τ n|θ|0
≤ + α|ϕθ̂L − f |22 + α|ϕθ − f |22
α
Combining this result with (3.7) and (3.8), we find
16τ 2 |θ|0
(1 − α)MSE(ϕθ̂L ) ≤ (1 + α)MSE(ϕθ ) + .
α
To conclude the proof, it only remains to divide by 1 − α on both sides of the
above inequality and take α = 1/2.
Maurey’s argument
In there is no sparse θ such that MSE(ϕθ ) is small, Theorem 3.4 is useless
whereas the Lasso may still enjoy slow rates. In reality, no one really be-
lieves in the existence of sparse vectors but rather of approximately sparse
vectors. Zipf’s law would instead favor the existence of vectors θ with abso-
lute coefficients that decay polynomially when ordered from largest to smallest
in absolute value. This is the case for example if θ has a small `1 norm but
is not sparse. For such θ, the Lasso estimator still enjoys slow rates as in
Theorem 2.15, which can be easily extended to the misspecified case (see Prob-
lem 3.2). As a result, it seems that the Lasso estimator is strictly better than
the BIC estimator as long as incoherence holds since it enjoys both fast and
slow rates, whereas the BIC estimator seems to be tailored to the fast rate.
Fortunately, such vectors can be well approximated by sparse vectors in the
following sense: for any vector θ ∈ IRM such that |θ|1 ≤ 1, there exists a vector
θ0 that is sparse and for which MSE(ϕθ0 ) is not much larger than MSE(ϕθ ). The
following theorem quantifies exactly the tradeoff between sparsity and MSE. It
is often attributed to B. Maurey and was published by Pisier [Pis81]. This is
why it is referred to as Maurey’s argument.
Theorem 3.6. Let {ϕ1 , . . . , ϕM } be a dictionary normalized in such a way

that √
max |ϕj |2 ≤ D n .
1≤j≤M
Then for any integer k such that 1 ≤ k ≤ M and any positive R, we have
D 2 R2
min MSE(ϕθ ) ≤ min MSE(ϕθ ) + .
θ∈IRM θ∈IRM k
|θ|0 ≤k |θ|1 ≤R
Proof. Define
θ̄ ∈ argmin |ϕθ − f |22
θ∈IRM
|θ|1 ≤R
and assume without loss of generality that |θ̄1 | ≥ |θ̄2 | ≥ . . . ≥ |θ̄M |.

Let now U ∈ IRn be a random vector with values in {0, ±Rϕ1 , . . . , ±RϕM }
defined by
|θ̄j |
IP(U = R sign(θ̄j )ϕj ) = , j = 1, . . . , M,
R
|θ̄|1
IP(U = 0) = 1 − .
R
√
Note that IE[U ] = ϕθ̄ and |U |2 ≤ RD n. Let now U1 , . . . , Uk be k independent
copies of U and define their average
k
1X
Ū = Ui .
k i=1
Note that Ū = ϕθ̃ for some θ̃ ∈ IRM such that |θ̃|0 ≤ k. Moreover, using the
Pythagorean Theorem,
IE|f − Ū |22 = IE|f − ϕθ̄ + ϕθ̄ − Ū |22

= IE|f − ϕθ̄ |22 + |ϕθ̄ − Ū |22
IE|U − IE[U ]|22
= |f − ϕθ̄ |22 +
√k
2 (RD n)2
≤ |f − ϕθ̄ |2 +
k
To conclude the proof, note that
IE|f − Ū |22 = IE|f − ϕθ̃ |22 ≥ min |f − ϕθ |22

θ∈IRM
|θ|0 ≤k
and divide by n.
Maurey’s argument implies the following corollary.
Corollary 3.7. Assume that the assumptions of Theorem 3.4 hold and that
the dictionary {ϕ1 , . . . , ϕM } is normalized in such a way that
√
max |ϕj |2 ≤ n .
1≤j≤M
Then there exists a constant C > 0 such that the BIC estimator satisfies
r
n h σ 2 |θ| log(eM ) log(eM ) io
0
MSE(ϕθ̂bic ) ≤ inf 2MSE(ϕθ ) + C ∧ σ|θ|1
θ∈IRM n n
2
σ log(1/δ)
+C
n
Proof. Choosing α = 1/3 in Theorem 3.4 yields

n σ 2 |θ|0 log(eM ) o σ 2 log(1/δ)
MSE(ϕθ̂bic ) ≤ 2 inf MSE(ϕθ ) + C +C
θ∈IRM n n
Let θ0 ∈ IRM . It follows from Maurey’s argument that for any k ∈ [M ], there
exists θ = θ(θ0 , k) ∈ IRM such that |θ|0 = k and
2|θ0 |21
MSE(ϕθ ) ≤ MSE(ϕθ0 ) +
k
It implies that
σ 2 |θ|0 log(eM ) 2|θ0 |21 σ 2 k log(eM )

MSE(ϕθ ) + C ≤ MSE(ϕθ0 ) + +C
n k n
and furthermore that

σ 2 |θ|0 log(eM ) 2|θ0 |21 σ 2 k log(eM )
inf MSE(ϕθ ) + C ≤ MSE(ϕθ0 ) + +C
θ∈IRM n k n
Since the above bound holds for any θ0 ∈ RM and k ∈ [M ], we can take an
infimum with respect to both θ0 and k on the right-hand side to get
n σ 2 |θ|0 log(eM ) o
inf MSE(ϕθ ) + C
θ∈IRM n
n |θ0 |2 σ 2 k log(eM ) o
1
≤ inf MSE(ϕθ0 ) + C min +C .
θ 0 ∈IRM k k n
To control the minimum over k, we need to consider three cases for the quantity
|θ0 |1
r
n
k̄ = .
σ log(eM )
1. If 1 ≤ k̄ ≤ M , then we get
r
|θ0 |2 σ 2 k log(eM ) log(eM )
min 1
+C ≤ Cσ|θ0 |1
k k n n
2. If k̄ ≤ 1, then
σ 2 log(eM )
|θ0 |21 ≤ C ,
n
which yields
|θ0 |2 σ 2 k log(eM ) σ 2 log(eM )
1
min +C ≤C
k k n n
3. If k̄ ≥ M , then
σ 2 M log(eM ) |θ0 |21
≤C .
n M
|θ|1
Therefore, on the one hand, if M ≥ √ , we get
σ log(eM )/n
r
|θ0 |2 σ 2 k log(eM ) |θ0 |21 log(eM )
min 1
+C ≤C ≤ Cσ|θ0 |1 .
k k n M n
|θ|1
On the other hand, if M ≤ √ , then for any θ ∈ IRM , we have
σ log(eM )/n
r
σ 2 |θ|0 log(eM ) σ 2 M log(eM ) log(eM )
≤ ≤ Cσ|θ0 |1 .
n n n
3.2. Nonparametric regression 81
Combined,
n |θ0 |2 σ 2 k log(eM ) o
1
inf MSE(ϕθ0 ) + C min +C
θ 0 ∈IRM k k n
r
n log(eM σ 2 log(eM ) o
≤ inf MSE(ϕθ0 ) + Cσ|θ0 |1 +C ,
θ 0 ∈IRM n n
which together with Theorem 3.4 yields the claim.
Note that this last result holds for any estimator that satisfies an oracle
inequality with respect to the `0 norm as in Theorem 3.4. In particular, this
estimator need not be the BIC estimator. An example is the Exponential
Screening estimator of [RT11].
Maurey’s argument allows us to enjoy the best of both the `0 and the
`1 world. The rate adapts to the sparsity of the problem and can be even
generalized to `q -sparsity (see Problem 3.3). However, it is clear from the proof
that this argument is limited to squared `2 norms such as the one appearing
in MSE and extension to other risk measures is non trivial. Some work has
been done for non-Hilbert spaces [Pis81, DDGS97] using more sophisticated
arguments.
3.2 NONPARAMETRIC REGRESSION
So far, the oracle inequalities that we have derived do not deal with the
approximation error MSE(ϕθ ). We kept it arbitrary and simply hoped that
it was small. Note also that in the case of linear models, we simply assumed
that the approximation error was zero. As we will see in this section, this
error can be quantified under natural smoothness conditions if the dictionary
of functions H = {ϕ1 , . . . , ϕM } is chosen appropriately. In what follows, we
assume for simplicity that d = 1 so that f : IR → IR and ϕj : IR → IR.
Fourier decomposition
Historically, nonparametric estimation was developed before high-dimensional
statistics and most results hold for the case where the dictionary H = {ϕ1 , . . . , ϕM }
forms an orthonormal system of L2 ([0, 1]):
Z 1 Z 1
ϕ2j (x)dx = 1 , ϕj (x)ϕk (x)dx = 0, ∀ j 6= k .
0 0
We will also deal with the case where M = ∞.

When H is an orthonormal system, the coefficients θj∗ ∈ IR defined by
Z 1
θj∗ = f (x)ϕj (x)dx ,
0
are called Fourier coefficients of f .

Assume now that the regression function f admits the following decompo-
sition
∞
X
f= θj∗ ϕj .
j=1
There exists many choices for the orthonormal system and we give only two
as examples.
Example 3.8. Trigonometric basis. This is an orthonormal basis of L2 ([0, 1]).
It is defined by
ϕ1 ≡ 1
√
ϕ2k (x)= 2 cos(2πkx) ,
√
ϕ2k+1 (x) = 2 sin(2πkx) ,
for k = 1, 2, . . . and x ∈ [0, 1]. The fact that it is indeed an orthonormal system
can be easily check using trigonometric identities.
The next example has received a lot of attention in the signal (sound, image,
. . . ) processing community.
Example 3.9. Wavelets. Let ψ : IR → IR be a sufficiently smooth and
compactly supported function, called “mother wavelet”. Define the system of
functions
ψjk (x) = 2j/2 ψ(2j x − k) , j, k ∈ Z .
It can be shown that for a suitable ψ, the dictionary {ψj,k , j, k ∈ Z} forms an
orthonormal system of L2 ([0, 1]) and sometimes a basis. In the latter case, for
any function g ∈ L2 ([0, 1]), it holds
∞
X ∞
X Z 1
g= θjk ψjk , θjk = g(x)ψjk (x)dx .
j=−∞ k=−∞ 0
The coefficients θjk are called wavelet coefficients of g.

The simplest example is given by the Haar system obtained by taking ψ
to be the following piecewise constant function (see Figure 3.1). We will not
give more details about wavelets here but simply point the interested reader
to [Mal09]. 
 1 0 ≤ x < 1/2
ψ(x) = −1 1/2 ≤ x ≤ 1
0 otherwise

Sobolev classes and ellipsoids

We begin by describing a class of smooth functions where smoothness is under-
stood in terms of its number of derivatives. Recall that f (k) denotes the k-th
derivative of f .
1
0
y
−1
0.0 0.5 1.0

x
Figure 3.1. The Haar mother wavelet
Definition 3.10. Fix parameters β ∈ {1, 2, . . . } and L > 0. The Sobolev class
of functions W (β, L) is defined by
n
W (β, L) = f : [0, 1] → IR : f ∈ L2 ([0, 1]) , f (β−1) is absolutely continuous and
Z 1 o
[f (β) ]2 ≤ L2 , f (j) (0) = f (j) (1), j = 0, . . . , β − 1
0
Any function f ∈ W (β, L) can represented1 as its Fourier expansion along

the trigonometric basis:
∞
X
f (x) = θ1∗ ϕ1 (x) + ∗ ∗

θ2k ϕ2k (x) + θ2k+1 ϕ2k+1 (x) , ∀ x ∈ [0, 1] ,
k=1
where θ∗ = {θj∗ }j≥1 is in the space of squared summable sequences `2 (IN)

defined by
n X∞ o
`2 (IN) = θ : θj2 < ∞ .
j=1
For any β > 0, define the coefficients
jβ

for j even
aj = (3.9)
(j − 1)β for j odd
With these coefficients, we can define the Sobolev class of functions in terms
of Fourier coefficients.
1 In the sense that
Z 1 k
X
lim |f (t) − θj ϕj (t)|2 dt = 0
k→∞ 0 j=1
Theorem 3.11. Fix β ≥ 1 and L > 0 and let {ϕj }j≥1 denote the trigonometric
basis of L2 ([0, 1]). Moreover, let {aj }j≥1 be defined as in (3.9). A function
f ∈ W (β, L) can be represented as
∞
X
f= θj∗ ϕj ,
j=1
where the sequence {θj∗ }j≥1 belongs to Sobolev ellipsoid of `2 (IN) defined by
n ∞
X o
Θ(β, Q) = θ ∈ `2 (IN) : a2j θj2 ≤ Q
j=1
for Q = L2 /π 2β .
Proof. Let us first define the Fourier coefficients {sk (j)}k≥1 of the jth deriva-
tive f (j) of f for j = 1, . . . , β:
Z 1
s1 (j) = f (j) (t)dt = f (j−1) (1) − f (j−1) (0) = 0 ,
0
√ Z 1
s2k (j) = 2 f (j) (t) cos(2πkt)dt ,
0
√ Z 1
s2k+1 (j) = 2 f (j) (t) sin(2πkt)dt ,
0
The Fourier coefficients of f are given by θk = sk (0).

Using integration by parts, we find that
√ (β−1) 1 √ Z 1 (β−1)
s2k (β) = 2f (t) cos(2πkt) + (2πk) 2 f (t) sin(2πkt)dt
0 0
√ √ Z 1
= 2[f (β−1) (1) − f (β−1) (0)] + (2πk) 2 f (β−1) (t) sin(2πkt)dt
0
= (2πk)s2k+1 (β − 1) .
Moreover,
√ 1 √ Z 1
s2k+1 (β) = 2f (β−1) (t) sin(2πkt) − (2πk) 2 f (β−1) (t) cos(2πkt)dt
0 0
= −(2πk)s2k (β − 1) .
In particular, it yields
s2k (β)2 + s2k+1 (β)2 = (2πk)2 s2k (β − 1)2 + s2k+1 (β − 1)2

By induction, we find that for any k ≥ 1,
s2k (β)2 + s2k+1 (β)2 = (2πk)2β θ2k

2 2

+ θ2k+1
Next, it follows for the definition (3.9) of aj that

∞
X ∞
X ∞
X
2β 2 2 2β
a22k θ2k
2 2β
a22k+1 θ2k+1
2

(2πk) θ2k + θ2k+1 =π +π
k=1 k=1 k=1
∞
X
= π 2β a2j θj2 .
j=1
Together with the Parseval identity, it yields

Z 1 ∞ ∞
2 X X
f (β) (t) dt = s2k (β)2 + s2k+1 (β)2 = π 2β a2j θj2 .
0 k=1 j=1
To conclude, observe that since f ∈ W (β, L), we have

Z 1 2
f (β) (t) dt ≤ L2 ,
0
so that θ ∈ Θ(β, L2 /π 2β ) .
It can actually be shown that the reciprocal is true, that is, any function
with Fourier coefficients in Θ(β, Q) belongs to if W (β, L), but we will not be
needing this.
In what follows, we will define smooth functions as functions with Fourier
coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By
extension, we write f ∈ Θ(β, Q) in this case and consider any real value for β.
Proposition 3.12. The Sobolev ellipsoids enjoy the following properties
(i) For any Q > 0,
0 < β 0 < β ⇒ Θ(β, Q) ⊂ Θ(β 0 , Q).
(ii) For any Q > 0,

1
β> ⇒ f is continuous.
2
The proof is left as an exercise (Problem 3.6).
It turns out that the first n functions in the trigonometric basis are not only
orthonormal with respect to the inner product of L2 , but also Pnwith respect to
the inner predictor associated with fixed design, hf, gi := n1 i=1 f (Xi )g(Xi ),
when the design is chosen to be regular, i.e., Xi = (i − 1)/n, i = 1, . . . , n.
Lemma 3.13. Assume that {X1 , . . . , Xn } is the regular design, i.e., Xi =

(i − 1)/n. Then, for any M ≤ n − 1, the design matrix Φ = {ϕj (Xi )} 1≤i≤n
1≤j≤M
satisfies the ORT condition.
Proof. Fix j, j 0 ∈ {1, . . . , n − 1}, j 6= j 0 , and consider the inner product ϕ>
j ϕj 0 .
Write kj = bj/2c for the integer part of j/2 and define the vectors a, b, a0 , b0 ∈
i2πkj s i2πk 0 s
j
IRn with coordinates such that e n = as+1 +ibs+1 and e n = a0s+1 +ib0s+1
for s ∈ {0, . . . , n − 1}. It holds that
1 >
ϕ ϕj 0 ∈ {a> a0 , b> b0 , b> a0 , a> b0 } ,
2 j
depending on the parity of j and j 0 .
On the one hand, observe that if kj 6= kj 0 , we have for any σ ∈ {−1, +1},
n−1 i2πk 0 s
n−1 i2π(kj +σk 0 )s
X i2πkj s j X j
e n eσ n = e n = 0.
s=0 s=0
On the other hand

n−1 i2πk 0 s
X i2πkj s j
eσ = (a + ib)> (a0 + σib0 ) = a> a0 − σb> b0 + i b> a0 + σa> b0

e n n
s=0
so that a> a0 = ±b> b0 = 0 and b> a0 = ±a> b0 = 0 whenever, kj 6= kj 0 . It yields

ϕ>j ϕj 0 = 0.
Next, consider the case where kj = kj 0 so that
n−1 i2πk 0 s

X i2πkj s
σ j 0 if σ = 1
e n e n = .
n if σ = −1
s=0
On the one hand if j 6= j 0 , it can only be the case that ϕ> > 0 > 0
j ϕj 0 ∈ {b a , a b }
> 0 > 0
but the same argument as above yields b a = ±a b = 0 since the imaginary
part of the inner product is still 0. Hence, in that case, ϕ> j ϕj 0 = 0. On the
other hand, if j = j 0 , then a = a0 and b = b0 so that it yields a> a0 = |a|22 = n
and b> b0 = |b|22 = n which is equivalent to ϕ> 2
j ϕj = |ϕj |2 = n. Therefore, the
design matrix Φ is such that
Φ> Φ = nIM .
Integrated squared error

As mentioned in the introduction of this chapter, the smoothness assumption
allows us to control the approximation error. Before going into the details, let
us gain some insight. Note first that if θ ∈ Θ(β, Q), then a2j θj2 → 0 as j → ∞
so that |θj | = o(j −β ). Therefore, the θj s decay polynomially to zero and it
makes sense to approximate f by its truncated Fourier series
M
X
θj∗ ϕj =: ϕM
θ∗
j=1
for any fixed M . This truncation leads to a systematic error that vanishes as
M → ∞. We are interested in understanding the rate at which this happens.
The Sobolev assumption allows us to control precisely this error as a func-
tion of the tunable parameter M and the smoothness β.
Lemma 3.14. For any integer M ≥ 1, and f ∈ Θ(β, Q), β > 1/2, it holds
X
kϕM 2
θ ∗ − f kL2 = |θj∗ |2 ≤ QM −2β . (3.10)
j>M
and for M = n − 1, we have

X 2
|ϕn−1 2
θ ∗ − f |2 ≤ 2n |θj∗ | . Qn2−2β . (3.11)
j≥n
Proof. Note that for any θ ∈ Θ(β, Q), if β > 1/2, then
∞ ∞
X X 1
|θj | = aj |θj |
j=2 j=2
aj
v
u ∞ 2 2X ∞
uX
1
≤t aj θj 2 by Cauchy-Schwarz
j=2
a
j=2 j
v
u ∞ 1
u X
≤ tQ <∞
j=1
j 2β
Since {ϕj }j forms an orthonormal system in L2 ([0, 1]), we have

X
min kϕθ − f k2L2 = kϕM 2
θ ∗ − f kL2 = |θj∗ |2 .
θ∈IRM
j>M
When θ∗ ∈ Θ(β, Q), we have

X X 1 1 Q
|θj∗ |2 = a2j |θj∗ |2 ≤ 2 Q ≤ 2β .
a2j aM +1 M
j>M j>M
To prove the second part of the lemma, observe that

X
∗
√ X ∗
|ϕn−1
θ ∗ − f |2 = θ j ϕ j 2
≤ 2 2n |θj | ,
j≥n j≥n
where in√the last inequality, we used the fact that for the trigonometric basis
|ϕj |2 ≤ 2n, j ≥ 1 regardless of the choice of the design X1 , . . . , Xn . When
θ∗ ∈ Θ(β, Q), we have
sX s
∗ 1
X X X 1 1
∗
p
|θj | = aj |θj | ≤ 2 ∗
aj |θj |2 . Qn 2 −β .
aj a2j
j≥n j≥n j≥n j≥n
Note that the truncated Fourier series ϕM θ ∗ is an oracle: this is what we

see when we view f through the lens of functions with only low frequency
harmonics.
To estimate ϕθ∗ = ϕM
θ ∗ , consider the estimator ϕθ̂ ls where
n
ls
X 2
θ̂ ∈ argmin Yi − ϕθ (Xi ) ,
θ∈IRM i=1
which should be such that ϕθ̂ls is close to ϕθ∗ . For this estimator, we have
proved (Theorem 3.3) an oracle inequality for the MSE that is of the form
|ϕM
θ̂ ls
− f |22 ≤ inf |ϕM 2 2
θ − f |2 + Cσ M log(1/δ) , C > 0.
θ∈IRM
It yields
M >
|ϕM
θ̂ ls
− ϕM 2 M M 2
θ ∗ |2 ≤ 2(ϕθ̂ ls − ϕθ ∗ ) (f − ϕθ ∗ ) + Cσ M log(1/δ)
X
>
= 2(ϕM
θ̂ ls
− ϕM
θ∗ ) ( θj∗ ϕj ) + Cσ 2 M log(1/δ)
j>M
X
>
= 2(ϕM
θ̂ ls
− ϕM
θ∗ ) ( θj∗ ϕj ) + Cσ 2 M log(1/δ) ,
j≥n
where we used Lemma 3.13 in the last equality. Together with (3.11) and
Young’s inequality 2ab ≤ αa2 + b2 /α, a, b ≥ 0 for any α > 0, we get
>
X C
2(ϕM
θ̂ ls
− ϕM
θ∗ ) ( θj∗ ϕj ) ≤ α|ϕM
θ̂ ls
− ϕM 2
θ ∗ |2 + Qn2−2β ,
α
j≥n
for some positive constant C when θ∗ ∈ Θ(β, Q). As a result,
1 σ2 M
|ϕM
θ̂ ls
− ϕM 2
θ ∗ |2 . Qn2−2β + log(1/δ) (3.12)
α(1 − α) 1−α
for any α ∈ (0, 1). Since, Lemma 3.13 implies, |ϕM θ̂ ls

− ϕM 2 M
θ ∗ |2 = nkϕθ̂ ls −
ϕM 2
θ ∗ kL2 ([0,1]) , we have proved the following theorem.
√
Theorem 3.15. Fix β ≥ (1 + 5)/4 ' 0.81, Q > 0, δ > 0 and assume the
general regression model (3.1) with f ∈ Θ(β, Q) and ε ∼ subGn (σ 2 ), σ 2 ≤ 1.
1
Moreover, let M = dn 2β+1 e and n be large enough so that M ≤ n − 1. Then the
least squares estimator θ̂ls defined in (3.2) with {ϕj }Mj=1 being the trigonometric
basis, satisfies with probability 1 − δ, for n large enough,
2β
kϕM
θ̂ ls
− f k2L2 ([0,1]) . n− 2β+1 (1 + σ 2 log(1/δ)) .
where the constant factors may depend on β, Q and σ.

Proof. Choosing α = 1/2 for example and absorbing Q in the constants, we

get from (3.12) and Lemma 3.13 that for M ≤ n − 1,
M log(1/δ)
kϕM
θ̂ ls
− ϕM 2
θ ∗ kL2 ([0,1]) . n
1−2β
+ σ2 .
n
Using now Lemma 3.14 and σ 2 ≤ 1, we get
M log(1/δ)
kϕM
θ̂ ls
− f k2L2 ([0,1]) . M −2β + n1−2β + σ 2 .
n
1
Taking M = dn 2β+1 e ≤ n − 1 for n large enough yields
2β
kϕM
θ̂ ls
− f k2L2 ([0,1]) . n− 2β+1 + n1−2β σ 2 log(1/δ) .
To conclude the proof, simply note that for the prescribed β, we have n1−2β ≤
2β
n− 2β+1 .
Adaptive estimation
1
The rate attained by the projection estimator ϕθ̂ls with M = dn 2β+1 e is actually
optimal so, in this sense, it is a good estimator. Unfortunately, its implementa-
tion requires the knowledge of the smoothness parameter β which is typically
unknown, to determine the level M of truncation. The purpose of adaptive es-
timation is precisely to adapt to the unknown β, that is to build an estimator
2β
that does not depend on β and yet, attains a rate of the order of Cn− 2β+1 (up
to a logarithmic lowdown). To that end, we will use the oracle inequalities for
the BIC and Lasso estimator defined in (3.3) and (3.4) respectively. In view of
Lemma 3.13, the design matrix Φ actually satisfies the assumption ORT when
we work with the trigonometric basis. This has two useful implications:
1. Both estimators are actually thresholding estimators and can therefore
be implemented efficiently
2. The condition INC(k) is automatically satisfied for any k ≥ 1.
These observations lead to the following corollary.
√
Corollary 3.16. Fix β > (1 + 5)/4 ' 0.81, Q > 0, δ > 0 and n large enough
1
to ensure n − 1 ≥ dn 2β+1 e assume the general regression model (3.1) with
n−1
f ∈ Θ(β, Q) and ε ∼ subGn (σ 2 ), σ 2 ≤ 1. Let {ϕj }j=1 be the trigonometric
n−1 n−1
basis. Denote by ϕθ̂bic (resp. ϕθ̂L ) the BIC (resp. Lasso) estimator defined
in (3.3) (resp. (3.4)) over IRn−1 with regularization parameter given by (3.5)
(resp. (3.6)). Then ϕn−1
θ̂
, where θ̂ ∈ {θ̂bic , θ̂L } satisfies with probability 1 − δ,
2β
kϕn−1
θ̂
− f k2L2 ([0,1]) . (n/ log n)− 2β+1 (1 + σ 2 log(1/δ)) ,
where the constants may depend on β and Q.

Proof. For θ̂ ∈ {θ̂bic , θ̂L }, adapting the proofs of Theorem 3.4 for the BIC
estimator and Theorem 3.5 for the Lasso estimator, for any θ ∈ IRn−1 , with
probability 1 − δ
1 + α n−1
|ϕθ̂n−1 − f |22 ≤ |ϕ − f |22 + R(|θ|0 ) .
1−α θ
where
Cσ 2
R(|θ|0 ) := |θ0 | log(en/δ)
α(1 − α)
It yields
2α
|ϕn−1 − ϕn−1
θ |22 ≤ |ϕn−1 − f |22 + 2(ϕn−1 − ϕn−1 )> (f − ϕn−1 ) + R(|θ|0 )
θ̂ 1−α θ θ̂ θ θ
2α 1 n−1
≤ + |ϕθ − f |22 + α|ϕn−1 − ϕn−1
θ |22 + R(|θ|0 ) ,
1−α α θ̂
where we used Young’s inequality once again. Let M ∈ N be determined later

∗ ∗
and choose now α = 1/2 and θ = θM , where θM is equal to θ∗ on its first M
n−1 M
coordinates and 0 otherwise so that ϕθ∗ = ϕθ∗ . It yields
M
|ϕn−1
θ̂
− ϕn−1 2 n−1 2 n−1 n−1 2 n−1 2
θ ∗ |2 . |ϕθ ∗ − f |2 + R(M ) . |ϕθ ∗ − ϕθ ∗ |2 + |ϕθ ∗ − f |2 + R(M )
M M M
Next, it follows from (3.11) that |ϕn−1

θ∗ − f |22 . Qn2−2β . Together with
Lemma 3.13, it yields
R(M )
kϕn−1 − ϕn−1 2 n−1 n−1 2
θ ∗ kL2 ([0,1]) . kϕθ ∗ − ϕθ ∗ kL2 ([0,1]) + Qn
1−2β
+ .
θ̂ M M n
Moreover, using (3.10), we find that
M 2
kϕn−1 − f k2L2 ([0,1]) . M −2β + Qn1−2β + σ log(en/δ) .
θ̂ n
1
To conclude the proof, choose M = d(n/ log n) 2β+1 e and observe that the choice
of β ensures that n1−2β . M −2β .
While there is sometimes a (logarithmic) price to pay for adaptation, it

turns out that the extra logarithmic factor can be removed by a clever use of
blocks (see [Tsy09, Chapter 3]). The reason why we get this extra logarithmic
factor here is because we use a hammer that’s too big. Indeed, BIC and Lasso
allow for “holes” in the Fourier decomposition, so we only get part of their
potential benefits.
3.3. Problem Set 91
3.3 PROBLEM SET
Problem 3.1. Show that the least-squares estimator θ̂ls defined in (3.2) sat-
isfies the following exact oracle inequality:
M
IEMSE(ϕθ̂ls ) ≤ inf MSE(ϕθ ) + Cσ 2
θ∈IRM n
for some constant C to be specified.
2
Problem 3.2. Assume that ε ∼ subG √ n (σ ) and the vectors ϕj are normalized
in such a way that maxj |ϕj |2 ≤ n. Show that there exists a choice of τ
such that the Lasso estimator θ̂L with regularization parameter 2τ satisfies the
following exact oracle inequality:
r
n log M o
MSE(ϕθ̂L ) ≤ inf MSE(ϕθ ) + Cσ|θ|1
θ∈IRM n
with probability at least 1 − M −c for some positive constants C, c.
Problem 3.3. Let √ {ϕ1 , . . . , ϕM } be a dictionary normalized in such a way
that maxj |ϕj |2 ≤ D n. Show that for any integer k such that 1 ≤ k ≤ M , we
have 1 1 2
M q̄ − k q̄
min MSE(ϕθ ) ≤ min MSE(ϕθ ) + Cq D2 ,
θ∈IRM θ∈IRM k
|θ|0 ≤2k |θ|w`q ≤1
1
where |θ|w`q denotes the weak `q norm and q̄ is such that q + 1q̄ = 1, for q > 1.
Problem 3.4. Show that the trigonometric basis and the Haar system indeed
form an orthonormal system of L2 ([0, 1]).
Problem 3.5. Consider the n × d random matrix Φ = {ϕj (Xi )}1≤i≤n where
1≤j≤d
X1 , . . . , Xn are i.i.d uniform random variables on the interval [0, 1] and φj is
the trigonometric basis as defined in Example 3.8. Show that Φ satisfies INC(k)
with probability at least .9 as long as n ≥ Ck 2 log(d) for some large enough
constant C > 0.
Problem 3.6. If f ∈ Θ(β, Q) for β > 1/2 and Q > 0, then f is continuous.
Chapter
4
Minimax Lower Bounds
In the previous chapters, we have proved several upper bounds and the goal of
this chapter is to assess their optimality. Specifically, our goal is to answer the
following questions:
1. Can our analysis be improved? In other words: do the estimators that
we have studied actually satisfy better bounds?
2. Can any estimator improve upon these bounds?
Both questions ask about some form of optimality. The first one is about
optimality of an estimator, whereas the second one is about optimality of a
bound.
The difficulty of these questions varies depending on whether we are looking
for a positive or a negative answer. Indeed, a positive answer to these questions
simply consists in finding a better proof for the estimator we have studied
(question 1.) or simply finding a better estimator, together with a proof that
it performs better (question 2.). A negative answer is much more arduous.
For example, in question 2., it is a statement about all estimators. How can
this be done? The answer lies in information theory (see [CT06] for a nice
introduction).
In this chapter, we will see how to give a negative answer to question 2. It
will imply a negative answer to question 1.
4.1 OPTIMALITY IN A MINIMAX SENSE
Consider the Gaussian Sequence Model (GSM) where we observe Y =

(Y1 , . . . , Yd )> , defined by
Yi = θi∗ + εi , i = 1, . . . , d , (4.1)
92
4.1. Optimality in a minimax sense 93
Θ φ(Θ) Estimator Result
σ2 d
IRd θ̂ls Theorem 2.2
n
r
log d
B1 σ θ̂Bls1 Theorem 2.4
n
σ2 k
B0 (k) log(ed/k) θ̂Bls0 (k) Corollaries 2.8-2.9
n
Table 41. Rate φ(Θ) obtained for different choices of Θ.
2
where ε = (ε1 , . . . , εd )> ∼ Nd (0, σn Id ), θ∗ = (θ1∗ , . . . , θd∗ )> ∈ Θ is the parameter
of interest and Θ ⊂ IRd is a given set of parameters. We will need a more precise
notation for probabilities and expectations throughout this chapter. Denote by
IPθ∗ and IEθ∗ the probability measure and corresponding expectation that are
associated to the distribution of Y from the GSM (4.1).
Recall that GSM is a special case of the linear regression model when the
design matrix satisfies the ORT condition. In this case, we have proved several
performance guarantees (upper bounds) for various choices of Θ that can be
expressed either in the form
IE |θ̂n − θ∗ |22 ≤ Cφ(Θ)

(4.2)
or the form
|θ̂n − θ∗ |22 ≤ Cφ(Θ) , with prob. 1 − d−2 (4.3)
For some constant C. The rates φ(Θ) for different choices of Θ that we have
obtained are gathered in Table 41 together with the estimator (and the corre-
sponding result from Chapter 2) that was employed to obtain this rate. Can
any of these results be improved? In other words, does there exists another
estimator θ̃ such that supθ∗ ∈Θ IE|θ̃ − θ∗ |22 φ(Θ)?
A first step in this direction is the Cramér-Rao lower bound [Sha03] that
allows us to prove lower bounds in terms of the Fisher information. Neverthe-
less, this notion of optimality is too stringent and often leads to nonexistence
of optimal estimators. Rather, we prefer here the notion of minimax optimality
that characterizes how fast θ∗ can be estimated uniformly over Θ.
Definition 4.1. We say that an estimator θ̂n is minimax optimal over Θ if it
satisfies (4.2) and there exists C 0 > 0 such that
inf sup IEθ |θ̂ − θ|22 ≥ C 0 φ(Θ) (4.4)

θ̂ θ∈Θ
where the infimum is taker over all estimators (i.e., measurable functions of
Y). Moreover, φ(Θ) is called minimax rate of estimation over Θ.
4.2. Reduction to finite hypothesis testing 94
Note that minimax rates of convergence φ are defined up to multiplicative

constants. We may then choose this constant such that the minimax rate has
a simple form such as σ 2 d/n as opposed to 7σ 2 d/n for example.
This definition can be adapted to rates that hold with high probability. As
we saw in Chapter 2 (Cf. Table 41), the upper bounds in expectation and those
with high probability are of the same order of magnitude. It is also the case
for lower bounds. Indeed, observe that it follows from the Markov inequality
that for any A > 0,
IEθ φ−1 (Θ)|θ̂ − θ|22 ≥ AIPθ φ−1 (Θ)|θ̂ − θ|22 > A

(4.5)
Therefore, (4.4) follows if we prove that
inf sup IPθ |θ̂ − θ|22 > Aφ(Θ) ≥ C

θ̂ θ∈Θ
for some positive constants A and C”. The above inequality also implies a lower
bound with high probability. We can therefore employ the following alternate
definition for minimax optimality.
Definition 4.2. We say that an estimator θ̂ is minimax optimal over Θ if it
satisfies either (4.2) or (4.3) and there exists C 0 > 0 such that
inf sup IPθ |θ̂ − θ|22 > φ(Θ) ≥ C 0

(4.6)
θ̂ θ∈Θ
where the infimum is taker over all estimators (i.e., measurable functions of
Y). Moreover, φ(Θ) is called minimax rate of estimation over Θ.
4.2 REDUCTION TO FINITE HYPOTHESIS TESTING
Minimax lower bounds rely on information theory and follow from a simple
principle: if the number of observations is too small, it may be hard to distin-
guish between two probability distributions that are close to each other. For
example, given n i.i.d. observations, it is impossible to reliably decide whether
they are drawn from N (0, 1) or N ( n1 , 1). This simple argument can be made
precise using the formalism of statistical hypothesis testing. To do so, we reduce
our estimation problem to a testing problem. The reduction consists of two
steps.
1. Reduction to a finite number of parameters. In this step the goal
is to find the largest possible number of parameters θ1 , . . . , θM ∈ Θ under
the constraint that
|θj − θk |22 ≥ 4φ(Θ) . (4.7)
This problem boils down to a packing of the set Θ.
Then we can use the following trivial observations:
inf sup IPθ |θ̂ − θ|22 > φ(Θ) ≥ inf max IPθj |θ̂ − θj |22 > φ(Θ) .

θ̂ θ∈Θ θ̂ 1≤j≤M
4.3. Lower bounds based on two hypotheses 95
2. Reduction to a hypothesis testing problem. In this second step,

the necessity of the constraint (4.7) becomes apparent.
For any estimator θ̂, define the minimum distance test ψ(θ̂) that is asso-
ciated to it by
ψ(θ̂) = argmin |θ̂ − θj |2 ,
1≤j≤M
with ties broken arbitrarily.

Next observe that if, for some j = 1, . . . , M , ψ(θ̂) 6= j, then there exists
k 6= j such that |θ̂ − θk |2 ≤ |θ̂ − θj |2 . Together with the reverse triangle
inequality it yields
|θ̂ − θj |2 ≥ |θj − θk |2 − |θ̂ − θk |2 ≥ |θj − θk |2 − |θ̂ − θj |2
so that
1
|θj − θk |2
|θ̂ − θj |2 ≥
2
Together with constraint (4.7), it yields
|θ̂ − θj |22 ≥ φ(Θ)

As a result,
inf max IPθj |θ̂ − θj |22 > φ(Θ) ≥ inf max IPθj ψ(θ̂) 6= j

θ̂ 1≤j≤M θ̂ 1≤j≤M

≥ inf max IPθj ψ 6= j
ψ 1≤j≤M
where the infimum is taken over all tests ψ based on Y and that take
values in {1, . . . , M }.
Conclusion: it is sufficient for proving lower bounds to find θ1 , . . . , θM ∈ Θ
such that |θj − θk |22 ≥ 4φ(Θ) and
inf max IPθj ψ 6= j ≥ C 0 .

ψ 1≤j≤M
The above quantity is called minimax probability of error. In the next sections,
we show how it can be bounded from below using arguments from information
theory. For the purpose of illustration, we begin with the simple case where
M = 2 in the next section.
4.3 LOWER BOUNDS BASED ON TWO HYPOTHESES
The Neyman-Pearson Lemma and the total variation distance

Consider two probability measures IP0 and IP1 and observations X drawn from
either IP0 or IP1 . We want to know which distribution X comes from. It
corresponds to the following statistical hypothesis problem:
H0 : Z ∼ IP0
H1 : Z ∼ IP1
A test ψ = ψ(Z) ∈ {0, 1} indicates which hypothesis should be true. Any

test ψ can make two types of errors. It can commit either an error of type I
(ψ = 1 whereas Z ∼ IP0 ) or an error of type II (ψ = 0 whereas Z ∼ IP1 ). Of
course, the test may also be correct. The following fundamental result, called
the Neyman Pearson Lemma indicates that any test ψ is bound to commit one
of these two types of error with positive probability unless IP0 and IP1 have
essentially disjoint support.
Let ν be a sigma finite measure satisfying IP0 ν and IP1 ν. For example
we can take ν = IP0 + IP1 . It follows from the Radon-Nikodym theorem [Bil95]
that both IP0 and IP1 admit probability densities with respect to ν. We denote
them by p0 and p1 respectively. For any function f , we write for simplicity
Z Z
f = f (x)ν(dx)
Lemma 4.3 (Neyman-Pearson). Let IP0 and IP1 be two probability measures.
Then for any test ψ, it holds
Z
IP0 (ψ = 1) + IP1 (ψ = 0) ≥ min(p0 , p1 )
Moreover, equality holds for the Likelihood Ratio test ψ ? = 1I(p1 ≥ p0 ).

Proof. Observe first that
Z Z
? ?
IP0 (ψ = 1) + IP1 (ψ = 0) = p0 + p1
ψ ∗ =1 ψ ∗ =0
Z Z
= p0 + p1
p1 ≥p0 p1 <p0
Z Z
= min(p0 , p1 ) + min(p0 , p1 )
p1 ≥p0 p1 <p0
Z
= min(p0 , p1 ) .
Next for any test ψ, define its rejection region R = {ψ = 1}. Let R? = {p1 ≥
p0 } denote the rejection region of the likelihood ratio test ψ ? . It holds
IP0 (ψ = 1) + IP1 (ψ = 0) = 1 + IP0 (R) − IP1 (R)
Z
=1+ p0 − p1
ZR Z
=1+ p0 − p1 + p0 − p1
R∩R? R∩(R? )c
Z Z
=1− |p0 − p1 | + |p0 − p1 |
R∩R? R∩(R? )c
Z
= 1 + |p0 − p1 | 1I(R ∩ (R? )c ) − 1I(R ∩ R? )

The above quantity is clearly minimized for R = R? .

The lower bound in the Neyman-Pearson lemma is related to a well known

quantity: the total variation distance.
Definition-Proposition 4.4. The total variation distance between two prob-

ability measures IP0 and IP1 on a measurable space (X , A) is defined by
TV(IP0 , IP1 ) = sup |IP0 (R) − IP1 (R)| (i)

R∈A
Z
= sup p0 − p1 (ii)
R∈A R
Z
1
= |p0 − p1 | (iii)
2
Z
= 1 − min(p0 , p1 ) (iv)

= 1 − inf IP0 (ψ = 1) + IP1 (ψ = 0) (v)
ψ
where the infimum above is taken over all tests.
Proof. Clearly (i) = (ii) and the Neyman-Pearson Lemma gives (iv) = (v).
Moreover, by identifying a test ψ to its rejection region, it is not hard to see
that (i) = (v). Therefore it remains only to show that (iii) is equal to any
of the other expressions. Hereafter, we show that (iii) = (iv). To that end,
observe that
Z Z Z
|p0 − p1 | = p1 − p0 + p0 − p1
p ≥p p1 <p0
Z 1 0 Z Z
= p1 + p0 − min(p0 , p1 )
p1 ≥p0 p1 <p0
Z Z Z
=1− p1 + 1 − p0 − min(p0 , p1 )
p1 <p0 p1 ≥p0
Z
= 2 − 2 min(p0 , p1 )
In view of the Neyman-Pearson lemma, it is clear that if we want to prove

large lower bounds, we need to find probability distributions that are close in
total variation. Yet, this conflicts with constraint (4.7) and a tradeoff needs to
be achieved. To that end, in the Gaussian sequence model, we need to be able
2 2
to compute the total variation distance between N (θ0 , σn Id ) and N (θ1 , σn Id ).
None of the expression in Definition-Proposition 4.4 gives an easy way to do
so. The Kullback-Leibler divergence is much more convenient.
The Kullback-Leibler divergence

Definition 4.5. The Kullback-Leibler divergence between probability mea-
sures IP1 and IP0 is given by
 Z dIP
1
log dIP1 , if IP1 IP0

KL(IP1 , IP0 ) = dIP0
 ∞, otherwise .
It can be shown [Tsy09] that the integral is always well defined when IP1
IP0 (though it can be equal to ∞ even in this case). Unlike the total variation
distance, the Kullback-Leibler divergence is not a distance. Actually, it is not
even symmetric. Nevertheless, it enjoys properties that are very useful for our
purposes.
Proposition 4.6. Let IP and Q be two probability measures. Then
1. KL(IP, Q) ≥ 0.
2. The function (IP, Q) 7→ KL(IP, Q) is convex.
3. If IP and Q are product measures, i.e.,
n
O n
O
IP = IPi and Q = Qi
i=1 i=1
then
n
X
KL(IP, Q) = KL(IPi , Qi ) .
i=1
Proof. If IP is not absolutely continuous then the result is trivial. Next, assume
that IP Q and let X ∼ IP.
1. Observe that by Jensen’s inequality,
dQ dQ
KL(IP, Q) = −IE log (X) ≥ − log IE (X) = − log(1) = 0 .
dIP dIP
2. Consider the function f : (x, y) 7→ x log(x/y) and compute its Hessian:
∂2f 1 ∂2f x ∂2f 1

2
= , 2
= 2, =− ,
∂x x ∂y y ∂x∂y y
1
− y1

x
We check that the matrix H = is positive semidefinite for all
− y1 yx2
x, y > 0. To that end, compute its determinant and trace:
1 1 1 x
det(H) = 2
− 2 = 0 , Tr(H) = > 0.
y y x y2
Therefore, H has one null eigenvalue and one positive one and must therefore
be positive semidefinite so that the function f is convex on (0, ∞)2 . Since the
KL divergence is a sum of such functions, it is also convex on the space of

positive measures.
3. Note that if X = (X1 , . . . , Xn ),
dIP
KL(IP, Q) = IE log (X)
dQ
n Z dIP
i
X
= log (Xi ) dIP1 (X1 ) · · · dIPn (Xn )
i=1
dQi
n Z dIP
i
X
= log (Xi ) dIPi (Xi )
i=1
dQi
Xn
= KL(IPi , Qi )
i=1
Point 2. in Proposition 4.6 is particularly useful in statistics where obser-

vations typically consist of n independent random variables.
Example 4.7. For any θ ∈ IRd , let Pθ denote the distribution of Y ∼

N (θ, σ 2 Id ). Then
d
X (θi − θ0 )2 i |θ − θ0 |22
KL(Pθ , Pθ0 ) = = .
i=1
2σ 2 2σ 2
The proof is left as an exercise (see Problem 4.1).
The Kullback-Leibler divergence is easier to manipulate than the total vari-

ation distance but only the latter is related to the minimax probability of error.
Fortunately, these two quantities can be compared using Pinsker’s inequality.
We prove here a slightly weaker version of Pinsker’s inequality that will be
sufficient for our purpose. For a stronger statement, see [Tsy09], Lemma 2.5.
Lemma 4.8 (Pinsker’s inequality.). Let IP and Q be two probability measures

such that IP Q. Then
p
TV(IP, Q) ≤ KL(IP, Q) .
Proof. Note that

Z p
KL(IP, Q) = p log
pq>0 q
Z r q
= −2 p log
pq>0 p
Z hr q i
= −2 p log −1 +1
pq>0 p
Z hr q i
≥ −2 p −1 (by Jensen)
pq>0 p
√
Z
=2−2 pq
Next, note that

Z √ 2 Z p 2
pq = max(p, q) min(p, q)
Z Z
≤ max(p, q) min(p, q) (by Cauchy-Schwarz)
Z Z

= 2 − min(p, q) min(p, q)

= 1 + TV(IP, Q) 1 − TV(IP, Q)
= 1 − TV(IP, Q)2
The two displays yield

p
KL(IP, Q) ≥ 2 − 2 1 − TV(IP, Q)2 ≥ TV(IP, Q)2 ,
√
where we used the fact that 0 ≤ TV(IP, Q) ≤ 1 and 1 − x ≤ 1 − x/2 for
x ∈ [0, 1].
Pinsker’s inequality yields the following theorem for the GSM.
Theorem 4.9. Assume that Θ contains two hypotheses θ0 and θ1 such that
|θ0 − θ1 |22 = 8α2 σ 2 /n for some α ∈ (0, 1/2). Then
2ασ 2 1
inf sup IPθ (|θ̂ − θ|22 ≥ ) ≥ − α.
θ̂ θ∈Θ n 2
Proof. Write for simplicity IPj = IPθj , j = 0, 1. Recall that it follows from the
4.4. Lower bounds based on many hypotheses 101
reduction to hypothesis testing that
2ασ 2
inf sup IPθ (|θ̂ − θ|22 ≥ ) ≥ inf max IPj (ψ 6= j)
θ̂ θ∈Θ n ψ j=0,1
1
≥ inf IP0 (ψ = 1) + IP1 (ψ = 0)
2 ψ
1h i
= 1 − TV(IP0 , IP1 ) (Prop.-def. 4.4)
2
1h p i
≥ 1 − KL(IP1 , IP0 ) (Lemma 4.8)
2 r
1h n|θ1 − θ0 |22 i
= 1− (Example 4.7)
2 2σ 2
1 h i
= 1 − 2α
2
Clearly the result of Theorem 4.9 matches the upper bound for Θ = IRd
only for d = 1. How about larger d? A quick inspection of our proof shows
that our technique, in its present state, cannot yield better results. Indeed,
there are only two known candidates for the choice of θ∗ . With this knowledge,
one can obtain upper bounds that do not depend on d by simply projecting
Y onto the linear span of θ0 , θ1 and then solving the GSM in two dimensions.
To obtain larger lower bounds, we need to use more than two hypotheses. In
particular, in view of the above discussion, we need a set of hypotheses that
spans a linear space of dimension proportional to d. In principle, we should
need at least order d hypotheses but we will actually need much more.
4.4 LOWER BOUNDS BASED ON MANY HYPOTHESES
The reduction to hypothesis testing from Section 4.2 allows us to use more
than two hypotheses. Specifically, we should find θ1 , . . . , θM such that
inf max IPθj ψ 6= j ≥ C 0 ,

ψ 1≤j≤M
for some positive constant C 0 . Unfortunately, the Neyman-Pearson Lemma no

longer exists for more than two hypotheses. Nevertheless, it is possible to relate
the minimax probability of error directly to the Kullback-Leibler divergence,
without involving the total variation distance. This is possible using a well
known result from information theory called Fano’s inequality.
Theorem 4.10 (Fano’s inequality). Let P1 , . . . , PM , M ≥ 2 be probability dis-
tributions such that Pj Pk , ∀ j, k. Then
1
PM
M2 j,k=1 KL(Pj , Pk ) + log 2
inf max Pj ψ(X) 6= j ≥ 1 −
ψ 1≤j≤M log M
where the infimum is taken over all tests with values in {1, . . . , M }.
Proof. Define
M
1 X
pj = Pj (ψ = j) and qj = Pk (ψ = j)
M
k=1
so that
M M
1 X 1 X
p̄ = pj ∈ (0, 1) , q̄ = qj .
M j=1 M j=1
Moreover, define the KL divergence between two Bernoulli distributions:

p 1 − p
kl(p, q) = p log + (1 − p) log .
q 1−q
Note that p̄ log p̄ + (1 − p̄) log(1 − p̄) ≥ − log 2, which yields
kl(p̄, q̄) + log 2
p̄ ≤ .
log M
Next, observe that by convexity (Proposition 4.6, 2.), it holds
M M
1 X 1 X
kl(p̄, q̄) ≤ kl(pj , qj ) ≤ 2 kl(Pj (ψ = j), Pk (ψ = j)) .
M j=1 M
j.k=1
It remains to show that
kl(Pj (ψ = j), Pk (ψ = j)) ≤ KL(Pj , Pk ) .
This result can be seen to follow directly from a well known inequality often
referred to as data processing inequality but we are going to prove it directly.
Denote by dPk=j (resp. dPk6=j ) the conditional density of Pk given ψ(X) = j
(resp. ψ(X) 6= j) and recall that
Z dP
j
KL(Pj , Pk ) = log dPj
dPk
Z dP Z dP
j j
= log dPj + log dPj
ψ=j dP k ψ6=j dP k
Z dP =j P (ψ = j)
j j
= Pj (ψ = j) log dPj=j
dPk=j Pk (ψ = j)
Z dP 6=j P (ψ 6= j)
j
+ Pj (ψ 6= j) log dPj6=j
dPk6=j Pk (ψ 6= j)
P (ψ = j)
j
= Pj (ψ = j) log + KL(Pj=j , Pj=j )
Pk (ψ = j)
P (ψ 6= j)
j
+ Pj (ψ 6= j) log + KL(Pj6=j , Pj6=j )
Pk (ψ 6= j)
≥ kl(Pj (ψ = j), Pk (ψ = j)) .
where we used Proposition 4.6, 1.

We have proved that
M 1
PM
1 X M2 j,k=1 KL(Pj , Pk ) + log 2
Pj (ψ = j) ≤ .
M j=1 log M
Passing to the complemetary sets completes the proof of Fano’s inequality.
Fano’s inequality leads to the following useful theorem.
Theorem 4.11. Assume that Θ contains M ≥ 5 hypotheses θ1 , . . . , θM such

that for some constant 0 < α < 1/4, it holds
(i) |θj − θk |22 ≥ 4φ

2ασ 2
(ii) |θj − θk |22 ≤ log(M )
n
Then
1
inf sup IPθ |θ̂ − θ|22 ≥ φ ≥ − 2α .
θ̂ θ∈Θ 2
Proof. in view of (i ), it follows from the reduction to hypothesis testing that

it is sufficient to prove that
1
inf max IPθj ψ 6= j ≥ − 2α.
ψ 1≤j≤M 2
If follows from (ii ) and Example 4.7 that
n|θj − θk |22
KL(IPj , IPk ) = ≤ α log(M ) .
2σ 2
Moreover, since M ≥ 5,
1
PM
M2 j,k=1 KL(IPj , IPk ) + log 2 α log(M ) + log 2 1
≤ ≤ 2α + .
log(M − 1) log(M − 1) 2
The proof then follows from Fano’s inequality.

2
Theorem 4.11 indicates that we must take φ ≤ ασ 2n log(M ). Therefore, the
larger the M , the larger the lower bound can be. However, M cannot be ar-
bitrary larger because of the constraint (i). We are therefore facing a packing
problem where
p the goal is to “pack” as many Euclidean balls of radius propor-
tional to σ log(M )/n in Θ under the constraint that their centers remain close
together (constraint (ii)).
p If Θ = IRd , this the goal is to pack the Euclidean
p
ball of radius R = σ 2α log(M )/n with Euclidean balls of radius R 2α/γ.
This can be done using a volume argument (see Problem 4.3). However, we
will use the more versatile lemma below. It gives a a lower bound on the size
4.5. Application to the Gaussian sequence model 104
of a packing of the discrete hypercube {0, 1}d with respect to the Hamming
distance defined by
d
X
ρ(ω, ω 0 ) = 1I(ωi 6= ωj0 ) , ∀ ω, ω 0 ∈ {0, 1}d .
i=1
Lemma 4.12 (Varshamov-Gilbert). For any γ ∈ (0, 1/2), there exist binary
vectors ω1 , . . . ωM ∈ {0, 1}d such that
1
(i) ρ(ωj , ωk ) ≥ − γ d for all j 6= k ,
2
2 γ2 d
(ii) M = beγ d c ≥ e 2 .
Proof. Let ωj,i , 1 ≤ i ≤ d, 1 ≤ j ≤ M be i.i.d. Bernoulli random variables with

parameter 1/2 and observe that
d − ρ(ωj , ωk ) = X ∼ Bin(d, 1/2) .
Therefore it follows from a union bound that

1 M (M − 1) d
IP ∃j 6= k , ρ(ωj , ωk ) < −γ d ≤ IP X − > γd .
2 2 2
Hoeffding’s inequality then yields
M (M − 1) d M (M − 1)
IP X − > γd ≤ exp − 2γ 2 d + log

<1
2 2 2
as soon as
M (M − 1) < 2 exp 2γ 2 d .
2
A sufficient condition for the above inequality to hold is to take M = beγ d c ≥
γ2 d
e 2 . For this value of M , we have
1
IP ∀j 6= k , ρ(ωj , ωk ) ≥ −γ d >0
2
and by virtue of the probabilistic method, there exist ω1 , . . . ωM ∈ {0, 1}d that
satisfy (i) and (ii)
4.5 APPLICATION TO THE GAUSSIAN SEQUENCE MODEL
We are now in a position to apply Theorem 4.11 by choosing θ1 , . . . , θM

based on ω1 , . . . , ωM from the Varshamov-Gilbert Lemma.
Lower bounds for estimation

Take γ = 1/4 and apply the Varshamov-Gilbert Lemma to obtain ω1 , . . . , ωM
with M = bed/16 c ≥ ed/32 and such that ρ(ωj , ωk ) ≥ d/4 for all j 6= k. Let
θ1 , . . . , θM be such that
βσ
θj = ωj √ ,
n
for some β > 0 to be chosen later. We can check the conditions of Theorem 4.11:
β 2 σ2 β 2 σ2 d
(i) |θj − θk |22 = ρ(ωj , ωk ) ≥ 4 ;
n 16n
β 2 σ2 β 2 σ2 d 32β 2 σ 2 2ασ 2
(ii) |θj − θk |22 = ρ(ωj , ωk ) ≤ ≤ log(M ) = log(M ) ,
n n n n
√
α
for β = 4 . Applying now Theorem 4.11 yields
α σ2 d 1
inf sup IPθ |θ̂ − θ|22 ≥ ≥ − 2α .
θ̂ θ∈IRd 256 n 2
It implies the following corollary.

Corollary 4.13. The minimax rate of estimation of over IRd in the Gaussian
sequence model is φ(IRd ) = σ 2 d/n. Moreover, it is attained by the least squares
estimator θ̂ls = Y.
Note that this rate is minimax over sets Θ that are strictly smaller than IRd
(see Problem 4.4). Indeed, it is minimax over any subset of IRd that contains
θ1 , . . . , θ M .
Lower bounds for sparse estimation

It appears from Table 41 that when estimating sparse vectors, we have to pay
for an extra logarithmic term log(ed/k) for not knowing the sparsity pattern
of the unknown θ∗ . In this section, we show that this term is unavoidable as
it appears in the minimax optimal rate of estimation of sparse vectors.
Note that the vectors θ1 , . . . , θM employed in the previous subsection are
not guaranteed to be sparse because the vectors ω1 , . . . , ωM obtained from the
Varshamov-Gilbert Lemma may themselves not be sparse. To overcome this
limitation, we need a sparse version of the Varhsamov-Gilbert lemma.
Lemma 4.14 (Sparse Varshamov-Gilbert). There exist positive constants C1
and C2 such that the following holds for any two integers k and d such that
1 ≤ k ≤ d/8. There exist binary vectors ω1 , . . . ωM ∈ {0, 1}d such that
k
(i) ρ(ωi , ωj ) ≥ for all i 6= j ,
2
k d
(ii) log(M ) ≥ log(1 + ).
8 2k
(iii) |ωj |0 = k for all j .

Proof. Take ω1 , . . . , ωM independently and uniformly at random from the set
C0 (k) = {ω ∈ {0, 1}d : |ω|0 = k}
of k-sparse binary random vectors. Note that C0 (k) has cardinality kd . To

choose ωj uniformly from C0 (k), we proceed as follows. Let U1 , . . . , Uk ∈

{1, . . . , d} be k random variables such that U1 is drawn uniformly at random
from {1, . . . , d} and for any i = 2, . . . , k, conditionally on U1 , . . . , Ui−1 , the ran-
dom variable Ui is drawn uniformly at random from {1, . . . , d}\{U1 , . . . , Ui−1 }.
Then define

1 if i ∈ {U1 , . . . , Uk }
ω=
0 otherwise .
Clearly, all outcomes in C0 (k) are equally likely under this distribution and
therefore, ω is uniformly distributed on C0 (k). Observe that
1 X k
IP ∃ ωj 6= ωk : ρ(ωj , ωk ) < k = d IP ∃ ωj 6= x : ρ(ωj , x) <
k d
2
x∈{0,1}
|x|0 =k
M
1 X X k
≤ d
IP ωj 6= x : ρ(ωj , x) <
k
2
x∈{0,1}d j=1
|x|0 =k
k
= M IP ω 6= x0 : ρ(ω, x0 ) <
,
2
where ω has the same distribution as ω1 and x0 is any k-sparse vector in
{0, 1}d . The last equality holds by symmetry since (i) all the ωj s have the
same distribution and (ii) all the outcomes of ωj are equally likely.
Note that
k
X
ρ(ω, x0 ) ≥ k − Zi ,
i=1
where Zi = 1I(Ui ∈ supp(x0 )). Indeed the left hand side is the number of
coordinates on which the vectors ω, x0 disagree and the right hand side is
the number of coordinates in supp(x0 ) on which the two vectors disagree. In
particular, we have that Z1 ∼ Ber(k/d) and for any i = 2, . . . , d, conditionally
on Z1 , . . . , Zi−i , we have Zi ∼ Ber(Qi ), where
Pi−1
k − l=1 Zl k 2k
Qi = ≤ ≤ ,
p − (i − 1) d−k d
since k ≤ d/2.
Next we apply a Chernoff bound to get that for any s > 0,
k k
k X k h X i sk
IP ω 6= x0 : ρ(ω, x0 ) < ≤ IP Zi > = IE exp s Zi e− 2
2 i=1
2 i=1
The above MGF can be controlled by induction on k as follows:
h k k−1
X i h X i
IE exp s Zi = IE exp s Zi IE exp sZk Z1 , . . . , Zk=1
i=1 i=1
h k−1
X i
= IE exp s Zi (Qk (es − 1) + 1)
i=1
k−1
h X i 2k s
≤ IE exp s Zi (e − 1) + 1
i=1
d
..
.
2k s k
≤ (e − 1) + 1
d
= 2k
d
For s = log(1 + 2k ). Putting everything together, we get
sk
IP ∃ ωj 6= ωk : ρ(ωj , ωk ) < k ≤ exp log M + k log 2 −
2
k d
= exp log M + k log 2 − log(1 + )
2 2k
k d
≤ exp log M + k log 2 − log(1 + )
2 2k
k d
≤ exp log M − log(1 + ) (for d ≥ 8k)
4 2k
< 1,
if we take M such that

k d
log M < log(1 + ).
4 2k
Apply the sparse Varshamov-Gilbert lemma to obtain ω1 , . . . , ωM with
log(M ) ≥ k8 log(1 + 2k d
) and such that ρ(ωj , ωk ) ≥ k/2 for all j 6= k. Let
θ1 , . . . , θM be such that
r
βσ d
θj = ωj √ log(1 + ),
n 2k
for some β > 0 to be chosen later. We can check the conditions of Theorem 4.11:
β 2 σ2 d β 2 σ2 d
(i) |θj − θk |22 = ρ(ωj , ωk ) log(1 + )≥4 k log(1 + );
n 2k 8n 2k
β 2 σ2 d 2kβ 2 σ 2 d 2ασ 2
(ii) |θj −θk |22 = ρ(ωj , ωk ) log(1+ ) ≤ log(1+ ) ≤ log(M ) ,
n 2k n 2k n
pα
for β = 8. Applying now Theorem 4.11 yields
α2 σ 2 d 1
inf sup IPθ |θ̂ − θ|22 ≥ k log(1 + ) ≥ − 2α .
θ̂ θ∈IRd 64n 2k 2
|θ|0 ≤k

Corollary 4.15. Recall that B0 (k) ⊂ IRd denotes the set of all k-sparse vectors
of IRd . The minimax rate of estimation over B0 (k) in the Gaussian sequence
2
model is φ(B0 (k)) = σnk log(ed/k). Moreover, it is attained by the constrained
least squares estimator θ̂Bls0 (k) .
Note that the modified BIC estimator of Problem 2.7 and the Slope (2.26)
are also minimax optimal over B0 (k). However, unlike θ̂Bls0 (k) and the modified
BIC, the Slope is also adaptive to k. For any ε > 0, the Lasso estimator and the
BIC estimator are minimax optimal for sets of parameters such that k ≤ d1−ε .
Lower bounds for estimating vectors in `1 balls

approximated a vector θ such that |θ|1 =
Recall that in Maurey’s argument, we q
0 0
R by a vector θ such that |θ |0 = σ logn d . We can essentially do the same
R
for the lower bound. √

Assume that d ≥ n and let β ∈ (0, 1) be a parameter to be chosen later
and define k to be the smallest integer such that
r
R n
k≥ √ .
βσ log(ed/ n)
Let ω1 , . . . , ωM be obtained from the sparse Varshamov-Gilbert Lemma 4.14
with this choice of k and define
R
θj = ωj .
k
Observe that |θj |1 = R for j = 1, . . . , M . We can check the conditions of
Theorem 4.11:
√
R2 R2 R 2 log(ed/ n)
(i) |θj − θk |22 = 2 ρ(ωj , ωk ) ≥ ≥ 4R min ,β σ .
k 2k 8 8n
r √
2 2R2 log(ed/ n) 2ασ 2
(ii) |θj − θk |2 ≤ ≤ 4Rβσ ≤ log(M ) ,
k n n
for β small enough if d ≥ Ck for some constant C > 0 chosen large enough.
Applying now Theorem 4.11 yields
√
2 R 2 2 log(ed/ n) 1
inf sup IPθ |θ̂ − θ|2 ≥ R min ,β σ ≥ − 2α .
θ̂ θ∈IRd 8 8n 2
4.6. Lower bounds for sparse estimation via χ2 divergence 109
Corollary 4.16. Recall that B1 (R) ⊂ IRd denotes the set vectors θ ∈ IRd such
that |θ|1 ≤ R. Then there exist a constant C > 0 such that if d ≥ n1/2+ε ,
ε > 0, the minimax rate of estimation over B1 (R) in the Gaussian sequence
model is
log d
φ(B0 (k)) = min(R2 , Rσ ).
n
Moreover, it is attained by the constrained least squares estimator θ̂Bls1 (R) if
R ≥ σ logn d and by the trivial estimator θ̂ = 0 otherwise.
Proof. To complete the proof of the statement, we need to study risk of the
trivial estimator equal to zero for small R. Note that if |θ∗ |1 ≤ R, we have
|0 − θ∗ |22 = |θ∗ |22 ≤ |θ∗ |21 = R2 .
Remark 4.17. Note that the inequality |θ∗ |22 ≤ |θ∗ |21 appears to be quite loose.
Nevertheless, it is tight up to a multiplicative constant for the vectors of the
log d
form θj = ωj Rk that are employed in the lower bound. Indeed, if R ≤ σ n ,
we have k ≤ 2/β
R2 β
|θj |22 = ≥ |θj |21 .
k 2
4.6 LOWER BOUNDS FOR SPARSE ESTIMATION VIA χ2 DIVERGENCE
In this section, we will show how to derive lower bounds by directly con-
trolling the distance between a simple and a composite hypothesis, which is
useful when investigating lower rates for decision problems instead of estima-
tion problems.
Define the χ2 divergence between two probability distributions IP, Q as
Z 2
 dIP
 − 1 dQ if IP Q,
χ2 (IP, Q) = dQ

∞ otherwise.

When we compare the expression in the case where IP Q, then wesee that
both the KL divergence and the χ2 divergence can be written as f dIP
R
dQ dQ
with f (x) = x log x and f (x) = (x − 1)2 , respectively. Also note that both of
these functions are convex in x and fulfill f (1) = 0, which by Jensen’s inequality
shows us that they are non-negative and equal to 0 if and only if IP = Q.
Firstly, let us derive another useful expression for the χ2 divergence. By
expanding the square, we have
2
(dIP)2
Z Z Z Z
2 dIP
χ (IP, Q) = − 2 dIP + dQ = dQ − 1
dQ dQ
Secondly, we note that we can bound the KL divergence from above by the
χ2 divergence, via Jensen’s inequality, as
(dIP)2
Z Z
dIP
KL(IP, Q) = log dIP ≤ log = log(1 + χ2 (IP, Q)) ≤ χ2 (IP, Q),
dQ dQ
(4.8)
where we used the concavity inequality log(1 + x) ≤ x for x > −1.

Note that this estimate is a priori very rough and that we are transitioning
from something at log scale to something on a regular scale. However, since
we are mostly interested in showing that these distances are bounded by a
small constant after adjusting the parameters of our models appropriately, it
will suffice for our purposes. In particular, we can combine (4.8) with the
Neyman-Pearson Lemma 4.3 and Pinsker’s inequality, Lemma 4.8, to get
1 √
inf IP0 (ψ = 1) + IP1 (ψ = 0) ≥ (1 − ε)
ψ 2
for distinguishing between two hypothesis with a decision rule ψ if χ2 (IP0 , IP1 ) ≤
ε.
In the following, we want to apply this observation to derive lower rates for
distinguishing between
H0 : Y ∼ IP0 H1 : Y ∼ IPu , for some u ∈ B0 (k),
where IPu denotes the probability distribution of a Gaussian sequence model
Y = u + nσ ξ, with ξ ∼ N (0, Id ).
Theorem 4.18. Consider the detection problem
H0 : θ∗ = 0, Hv : θ∗ = µv, for some v ∈ B0 (k), kvk2 = 1,
with data given by Y = θ∗ + √σn ξ, ξ ∼ N (0, Id ).
There exist ε > 0 and cε > 0 such that if
r s
σ k dε
µ≤ log 1 + 2
2 n k
then,
1 √
inf {IP0 (ψ = 1) ∨ max IPv (ψ = 0)} ≥ (1 − ε).
ψ v∈B0 (k) 2
kvk2 =1
Proof. We introduce the mixture hypothesis

¯ = 1
X
IP d
IPS ,
k S⊆[d]
|S|=k
√
with IPS being the distribution of Y = µ1IS / k + √σn ξ and give lower bounds
on distinguishing
¯
H0 : Y ∼ IP0 , H1 : Y ∼ IP,
by computing their χ2 distance,
Z 2
2 ¯ dp̄
χ (IP, IP0 ) = dIP0 − 1
dIP0
Z
1 X dIPS dIPT
= d dIP0 − 1.
k
dIP0 dIP0
S,T
The first step is to compute dIPS /(dIP0 ). Writing out the corresponding Gaus-
sian densities yields
1 n
√
√ d exp(− 2 kX − µ1IS / kk2 2)
dIPS 2σ
(X) = 2π 1 n
dIP0 √ d exp(− 2 kX − 0k2 2)
2π 2σ

n 2µ 2
= exp √ hX, 1IS i − µ
2σ 2 k
For convenience, we introduce the notation X = √σn Z for a standard normal
√ √
Z ∼ N (0, Id ), ν = µ n/(σ k), and write ZS for the restriction to the coor-
dinates in the set S. Multiplying two of these densities and integrating with
respect to IP0 in turn then reduces to computing the mgf of a Gaussian and
gives

n σ 2
IEX∼IP0 exp 2µ √ (ZS + ZT ) − 2µ
2σ 2 kn
= IEZ∼N (0,Id ) exp ν(ZS + ZT ) − ν 2 k .

P P
Decomposing ZS + ZT = 2 i∈S∩T Zi + i∈S∆T Zi and noting that |S∆T | ≤
2k, we see that
ν2

dIPS dIPT 4 2
IEIP0 = exp ν |S ∩ T + |S∆T | − ν 2 k
dIP0 dIP0 2 2
2

≤ exp 2ν |S ∩ T | .
Now, we need to take the expectation over two uniform draws of support
sets S and T , which reduces via conditioning and exploiting the independence
of the two draws to
¯ IP0 ) ≤ IES,T [exp(2ν 2 |S ∩ T |] − 1
χ2 (IP,
= IET IES [[exp(2ν 2 |S ∩ T | | T ]] − 1
= IES [exp(2ν 2 |S ∩ [k]|)] − 1.
Similar to the proof of the sparse Varshamov-Gilbert bound, Lemma 4.14, the
distribution of |S ∩ [k]| is stochastically dominated by a binomial distribution
Bin(k, k/d), so that
¯ IP0 ) ≤ IE[exp(2ν 2 Bin(k, k )] − 1

χ2 (IP,
d
k
= IE[exp(2ν 2 Ber(k/d))] − 1
k
2ν 2 k k
= e + 1− −1
d d
k
k 2
= 1 + (e2ν − 1) − 1
d
4.7. Lower bounds for estimating the `1 norm via moment
matching 112
Since (1 + x)k − 1 ≈ kx for x = o(1/k), we can take ν 2 = 21 log(1 + kd2 ε) to get

a bound of the order ε. Plugging this back into the definition of ν yields
r s
σ k dε
µ= log 1 + 2 .
2 n k
The rate for detection for arbitrary v now follows from
¯
IP0 (ψ = 1) ∨ max IPv (ψ = 0) ≥ IP0 (ψ = 1) ∨ IP(ψ = 0)
v∈B0 (k)
kvk2 =1
1 √
≥ (1 − ε).
2
Remark 4.19. If instead of the χ2 divergence, we try to compare the KL di-
¯ we are tempted to exploit its convexity properties
vergence between IP0 and IP,
to get
¯ IP0 ) ≤ 1
X
KL(IP, d
KL(IPS , IP0 ) = KL(IP[k] , IP0 ),
k S
which is just the distance between two Gaussian distributions. In turn, we do
not see any increase in complexity brought about by increasing the number of
competing hypothesis as we did in Section 4.5. By using the convexity estimate,
we have lost the granularity of controlling how much the different probability
distributions overlap. This is where the χ2 divergence helped.
Remark 4.20. This rate for detection implies lower rates for estimation of
both θ∗ and kθ∗ k2 : If given an estimator Tb(X) for T (θ) = kθk2 that achieves
an error less than µ/2 for some X, then we would get a correct decision rule
for those X by setting
 µ
0, if Tb(X) < ,
ψ(X) = 2
1, if Tb(X) ≥ µ .
2
Conversely, the lower bound above means that such an error rate can not be
achieved with vanishing error probability for µ smaller than the critical value.
Moreover, by the triangle inequality, we can the same lower bound also for
estimaton of θ∗ itself.
We note that the rate that we obtained is slightly worse in the scaling of
the log factor (d/k 2 instead of d/k) than the one obtained in Corollary 4.15.
This is because it applies to estimating the `2 norm as well, which is a strictly
easier problem and for which estimators can be constructed that achieve the
faster rate of nk log(d/k 2 ).
4.7 LOWER BOUNDS FOR ESTIMATING THE `1 NORM VIA MOMENT

MATCHING
So far, we have seen many examples of rates tha scale with n−1 for the
squared error. In this section, we are going to see an example for estimating
matching 113
a functional that can only be estimated at a much slower rate of 1/ log n,

following [CL11].
Consider the normalized `1 norm,
n
1X
T (θ) = |θi |, (4.9)
n i=1
as the target functional to be estimated, and measurements
Y ∼ N (θ, In ). (4.10)
We are going to show lower bounds for the estimation if restricted to an `∞

ball, Θn (M ) = {θ ∈ IRn : |θi | ≤ M }, and for a general θ ∈ IRn .
A constrained risk inequality

We start with a general principle that is in the same spirit as the Neyman-
Pearson lemma, (Lemma 4.3), but deals with expectations rather than prob-
abilities and uses the χ2 distance to estimate the closeness of distributions.
Moreover, contrary to what we have seen so far, it allows us to compare two
mixture distributions over candidate hypotheses, also known as priors.
In the following, we write X for a random observation coming from a dis-
tribution indexed by θ ∈ Θ, Tb = Tb(X) an estimator for a function T (θ), and
write B(θ) = IEθ Tb − T (θ) for the bias of Tb.
Let µ0 and µ1 be two prior distributions over Θ and denote by mi , vi2 the
means and variance of T (θ) under the priors µi , i ∈ {0, 1},
Z Z Z
mi = T (θ)dµi (θ), vi2 = (T (θ) − mi )2 dµi (θ).
Moreover, write Fi for the marginal distributions over the priors and denote
their density with respect to the average of the two probability distributions
by fi . With this, we write the χ2 distance between f0 and f1 as
2 2
f1 (X) f1 (X)
I 2 = χ2 (IPf0 , IPf1 ) = IEf0 −1 = IEf0 − 1.
f0 (X) f0 (X)
Theorem 4.21. (i) If IEθ (Tb(X) − T (θ))2 dµ0 (θ) ≤ ε2 ,

R
Z Z
B(θ)dµ1 (θ) − B(θ)dµ0 (θ) ≥ |m1 − m0 | − (ε + v0 )I. (4.11)
(ii) If |m1 − m0 | > v0 I and 0 ≤ λ ≤ 1,
λ(1 − λ)(|m1 − m0 | − v0 I)2

Z
IEθ (Tb(X) − T (θ))2 d(λµ0 + (1 − λ)µ1 )(θ) ≥ ,
λ + (1 − λ)(I + 1)2
(4.12)
matching 114
in particular
(|m1 − m0 | − v0 I)2
Z
max IEθ (Tb(X) − T (θ))2 dµi (θ) ≥ , (4.13)
i∈{0,1} (I + 2)2
and
(|m1 − m0 | − v0 I)2
sup IEθ (Tb(X) − T (θ))2 ≥ . (4.14)
θ∈Θ (I + 2)2
Proof. Without loss of generality, assume m1 ≥ m0 . We start by considering
the term

f1 (X) − f0 (X)
IEf0 (T (X) − m0 )
b
f0 (X)

f1 (X) − f0 (X)
= IE T (X)
b
f0 (X)
h i h i
= IEf1 m1 + T (X) − m1 − IEf0 m0 + Tb(X) − m0
b
Z Z
= m1 + B(θ)dµ1 (θ) − m0 + B(θ)dµ0 (θ) .
Moreover,
Z
IEf0 (Tb(X) − m0 )2 = IEθ (Tb(X) − m0 )2 dµ0 (θ)
Z
= IEθ (Tb(X) − T (θ) + T (θ) − m0 )2 dµ0 (θ)
Z
= IEθ (Tb(X) − T (θ))2 dµ0 (θ)
Z
+ 2 B(θ)(T (θ) − m0 )dµ0 (θ)
Z
+ (T (θ) − m0 )2 dµ0 (θ)
≤ ε2 + 2εv0 + v02 = (ε + v0 )2 .
Therefore, by Cauchy-Schwarz,

f1 (X) − f0 (X)
IEf0 (Tb(X) − m0 ) ≤ (ε + v0 )I,
f0 (X)
and hence
Z Z
m1 + B(θ)dµ1 (θ) − m0 + B(θ)dµ0 (θ) ≤ (ε + v0 )I,
whence
Z Z
B(θ)dµ0 (θ) − B(θ)dµ1 (θ) ≥ m1 − m0 − (ε + v0 )I,
matching 115
which gives us (4.11).

To estimate the risk under a mixture of µ0 and µ1 , we note that from (4.11),
we have an estimate for the risk under µ1 given an upper bound on the risk
under µ0 , which means we can reduce this problem to estimating a quadratic
from below. Consider
J(x) = λx2 + (1 − λ)(a − bx)2 ,

ab(1−λ)
with 0 < λ < 1, a > 0 and b > 0. J is minimized by x = xmin = λ+b2 (1−λ)
2
a λ(1−λ)
with a − bxmin > 0 and J(xmin ) = λ+b 2 (1−λ) . Hence, if we add a cut-off at zero
in the second term, we do not change the minimal value and obtain that
λx2 + (1 − λ)((a − bx)+ )2
has the same minimum.

From (4.11) and Jensen’s inequality, setting ε2 = B(θ)2 dµ0 (θ), we get
R
Z Z 2
B(θ)2 dµ1 (θ) ≥ B(θ)dµ1 (θ)
≥ ((m1 − m0 − (ε + v0 )I − ε)+ )2 .
Combining this with the estimate for the quadratic above yields
Z
IEθ (Tb(X) − T (θ))2 d(λµ0 + (1 − λ)µ1 )(θ)
≥ λε2 + (1 − λ)((m1 − m0 − (ε + v0 )I − ε)+ )2

λ(1 − λ)(|m1 − m0 | − v0 I)2
≥ ,
λ + (1 − λ)(I + 1)2
which is (4.12).
Finally, since the minimax risk is bigger than any mixture risk, we can
bound it from below by the maximum value of this bound, which is obtained
at λ = (I + 1)/(I + 2) to get (4.13), and by the same argument (4.14).
Unfavorable priors via polynomial approximation

In order to construct unfavorable priors for the `1 norm (4.9), we resort to
polynomial approximation theory. For a continuous function f on [−1, 1], de-
note the maximal approximation error by polynomials of degree k, written as
Pk , by
δk (f ) = inf max |f (x) − G(x)|,
G∈Pk x∈[−1,1]
and the polynomial attaining this error by G∗k . For f (t) = |t|, it is known
[Riv90] that
β∗ = lim 2kδ2k (f ) ∈ (0, ∞),
k→∞
matching 116
which is a very slow rate of convergence of the approximation, compared to

smooth functions for which we would expect geometric convergence.
We can now construct our priors using some abstract measure theoretical
results.
Lemma 4.22. For an integer k > 0, there are two probability measures ν0 and
ν1 on [−1, 1] that fulfill the following:
1. ν0 and ν1 are symmetric about 0;

2. tl dν1 (t) = tl dν0 (t) for all l ∈ {0, 1, . . . , k};
R R
R R
3. |t|dν1 (t) − |t|dν0 (t) = 2δk .
Proof. The idea is to construct the measures via the Hahn-Banach theorem
and the Riesz representation theorem.
First, consider f (t) = |t| as an element of the space C([−1, 1]) equipped with
the supremum norm and define Pk as the space of polynomials of degree up to
k, and Fk := span(Pk ∪ {f }). On Fk , define the functional T (cf + pk ) = cδk ,
which is well-defined because f is not a polynomial.
Claim: kT k = sup{T (g) : g ∈ Fk , kgk∞ ≤ 1} = 1.
kT k ≥ 1: Let G∗k be the best-approximating polynomial to f in Pk . Then,
kf − G∗k k∞ = δk , k(f − G∗k )/δk k∞ = 1, and T ((f − G∗k )/δk ) = 1.
kT k ≤ 1: Suppose g = cf + pk with pk ∈ Pk , kgk∞ = 1 and T (g) > 1.
Then c > 1/δk and
1
kf − (−pk /c)k∞ = < δk ,
c
contradicting the definition of δk .
Now, by the Hahn-Banach theorem, there is a norm-preserving extension
of T to C([−1, 1]) which we again denote by T . By the Riesz representation
theorem, there is a Borel signed measure τ with variation equal to 1 such that
Z 1
T (g) = g(t)dτ (t), for all g ∈ C([−1, 1]).
−1
The Hahn-Jordan decomposition gives two positive measures τ+ and τ− such

that τ = τ+ − τ− and
Z 1
|t|d(τ+ − τ− )(t) = δk and
−1
Z 1 Z 1
l
t dτ+ (t) = tl dτ− (t), l ∈ {0, . . . , k}.
−1 −1
Since f is a symmetric function, we can symmetrize these measures and hence

can assume that they are symmetric about 0. Finally, to get a probability
measures with the desired properties, set ν1 = 2τ+ and ν0 = 2τ− .
matching 117
Minimax lower bounds

Theorem
Pn 4.23. We have the following bounds on the minimax rate for T (θ) =
1 n
n i=1 |θi |: For θ ∈ Θn (M ) = {θ ∈ IR : |θi | ≤ M },
2
log log n
inf sup IEθ (Tb(X) − T (θ))2 ≥ β∗2 M 2 (1 + o(1)). (4.15)
Tb θ∈Θn (M ) log n
For θ ∈ IRn ,
β∗2
inf sup IEθ (Tb(X) − T (θ))2 ≥ (1 + o(1)). (4.16)
Tb θ∈IRn 16e2 log n
The last remaining ingredient for the proof are the Hermite polynomials, a
family of orthogonal polynomials with respect to the Gaussian density
1 2
ϕ(y) = √ e−y /2 .
2π
For our purposes, it is enough to define them by the derivatives of the density,
dk
ϕ(y) = (−1)k Hk (y)ϕ(y),
dy k
and to observe that they are orthogonal,
Z Z
2
Hk (y)ϕ(y)dy = k!, Hk (y)Hj (y)ϕ(y)dy = 0, for k 6= j.
Proof. We want to use Theorem 4.21. To construct the prior measures, we scale
the measures from Lemma 4.22 appropriately: Let kn be an even integer that
is to be determined, and ν0 , ν1 the two measures given by Lemma 4.22. Define
g(x) = M x and define the measures µi by dilating νi , µi (A) = νi (g −1 (A)) for
every Borel set A ⊆ [−M, M ] and i ∈ {0, 1}. Hence,
1. µ0 and µ1 are symmetric about 0;
2. tl dµ1 (t) = tl dµ0 (t) for all l ∈ {0, 1, . . . , kn };
R R
R R
3. |t|dµ1 (t) − |t|dµ0 (t) = 2M δkn .
Qn
To get priors for n i.i.d. samples, consider the product priors µni = j=1 µi .
With this, we have
IEµn1 T (θ) − IEµn0 T (θ) = IEµ1 |θ1 | − IEµ0 |θ1 | = 2M δkn ,
and
M2
IEµn0 (T (θ) − IEµn0 T (θ))2 = IEµn0 (T (θ) − IEµn0 T (θ))2 ≤ ,
n
since each θi ∈ [−M, M ].
matching 118
It remains to control the χ2 distance between the two marginal distribu-

tions. To this end, we will expand R the Gaussian density in terms of Hermite
polynomials. First, set fi (y) = ϕ(y − t)dµi (t). Since g(x) = exp(−x) is
convex and µ0 is symmetric,
(y − t)2
Z
1
f0 (y) ≥ √ exp − dµ0 (t)
2π 2
Z
1 2 2
= ϕ(y) exp − M t dν0 (t)
2

1
≥ ϕ(y) exp − M 2 .
2
Expand ϕ as
∞
X tk
ϕ(y − t) = Hk (y)ϕ(y) .
k!
k=0
Then,
∞
(f1 (y) − f0 (y))2
Z
2 X 1 2k
dy ≤ eM /2 M .
f0 (y) k!
k=kn +1
2
The χ distance for n i.i.d. observations can now be easily bounded by
Z Qn Qn
2 ( i=1 f1 (yi ) − i=1 f0 (yi ))2
In = Q n dy1 · · · dyn
i=1 f0 (yi )
Z Qn
( i=1 f1 (yi ))2
= Qn dy1 · · · dyn − 1
i=1 f0 (yi )
n Z
!
Y (f1 (yi ))2
= dyi − 1
i=1
f0 (yi )
∞
!n
M 2 /2
X 1 2k
≤ 1+e M −1
k!
k=kn +1
n
3M 2 /2 1 2kn
≤ 1+e M −1
kn !
k n !
eM 2 n

2
≤ 1 + e3M /2 − 1, (4.17)
kn
where the last step used a Stirling type estimate, k! > (k/e)k .
Note that if x = o(1/n), then (1 + x)n − 1 = o(nx) by Taylor’s formula,
so we can choose kn ≥ log n/(log log n) to guarantee that for n large enough,
In → 0. With Theorem 4.21, we have
√
(2M δkn − (M/ n)In )2
inf sup IE(Tb − T (θ))2 ≥
Tb θ∈Θn (M ) (In + 2)2
2
log log n
= β∗2 M 2 (1 + o(1)).
log n
matching 119
√
To prove the lower bound over IRn , take M = log n and kn to be the
smallest integer such that kn ≥ 2e log n and plug this into (4.17), yielding
kn !n
2 3/2 e log n
In ≤ 1 + n − 1,
2e log n
which goes to zero. Hence, we can conclude just as before.
Note that [CL11] also complement these lower bounds with upper bounds
that are based on polynomial approximation, completing the picture of esti-
mating T (θ).
4.8. Problem Set 120
4.8 PROBLEM SET
Problem 4.1. (a) Prove the statement of Example 4.7.

(b) Let Pθ denote the distribution of X ∼ Ber(θ). Show that
KL(Pθ , Pθ0 ) ≥ C(θ − θ0 )2 .
Problem 4.2. Let IP0 and IP1 be two probability measures. Prove that for
any test ψ, it holds
1 −KL(IP0 ,IP1 )
max IPj (ψ 6= j) ≥ e .
j=0,1 4
Problem 4.3. For any R > 0, θ ∈ IRd , denote by B2 (θ, R) the (Euclidean)
ball of radius R and centered at θ. For any ε > 0 let N = N (ε) be the largest
integer such that there exist θ1 , . . . , θN ∈ B2 (0, 1) for which
|θi − θj | ≥ 2ε
for all i 6= j. We call the set {θ1 , . . . , θN } an ε-packing of B2 (0, 1) .
(a) Show that there exists a constant C > 0 such that N ≤ Cd /εd .
(b) Show that for any x ∈ B2 (0, 1), there exists i = 1, . . . , N such that
|x − θi |2 ≤ 2ε.
(c) Use (b) to conclude that there exists a constant C 0 > 0 such that N ≥
Cd0 /εd .
Problem 4.4. Show that the rate φ = σ 2 d/n is the minimax rate of estimation
over:
p
(a) The Euclidean Ball of IRd with radius σ 2 d/n.
(b) The unit `∞ ball of IRd defined by
B∞ (1) = {θ ∈ IRd : max |θj | ≤ 1}
j
as long as σ 2 ≤ n.
(c) The set of nonnegative vectors of IRd .
σ
√ {0, 1}d
(d) The discrete hypercube 16 n
.
Problem 4.5. Fix β ≥ 5/3, Q > 0 and prove that the minimax rate of esti-
2β
mation over Θ(β, Q) with the k · kL2 ([0,1]) -norm is given by n− 2β+1 .
[Hint:Consider functions of the form
N
C X
fj = √ ωji ϕi
n i=1
where C is a constant, ωj ∈ {0, 1}N for some appropriately chosen

N and {ϕj }j≥1 is the trigonometric basis.]
Chapter
5
Matrix estimation
Over the past decade or so, matrices have entered the picture of high-dimensional
statistics for several reasons. Perhaps the simplest explanation is that they are
the most natural extension of vectors. While this is true, and we will see exam-
ples where the extension from vectors to matrices is straightforward, matrices
have a much richer structure than vectors allowing “interaction” between their
rows and columns. In particular, while we have been describing simple vectors
in terms of their sparsity, here we can measure the complexity of a matrix by
its rank. This feature was successfully employed in a variety of applications
ranging from multi-task learning to collaborative filtering. This last application
was made popular by the Netflix prize in particular.
In this chapter, we study several statistical problems where the parameter of
interest θ is a matrix rather than a vector. These problems include: multivari-
ate regression, covariance matrix estimation and principal component analysis.
Before getting to these topics, we begin by a quick reminder on matrices and
linear algebra.
5.1 BASIC FACTS ABOUT MATRICES
Matrices are much more complicated objects than vectors. In particular,

while vectors can be identified with linear operators from IRd to IR, matrices
can be identified to linear operators from IRd to IRn for n ≥ 1. This seemingly
simple fact gives rise to a profusion of notions and properties as illustrated by
Bernstein’s book [Ber09] that contains facts about matrices over more than a
thousand pages. Fortunately, we will be needing only a small number of such
properties, which can be found in the excellent book [GVL96], that has become
a standard reference on matrices and numerical linear algebra.
121
5.1. Basic facts about matrices 122
Singular value decomposition

Let A = {aij , 1 ≤ i ≤ m, 1 ≤ j ≤ n} be a m × n real matrix of rank r ≤
min(m, n). The Singular Value Decomposition (SVD) of A is given by
r
X
A = U DV > = λj uj vj> ,
j=1
where D is an r × r diagonal matrix with positive diagonal entries {λ1 , . . . , λr },

U is a matrix with columns {u1 , . . . , ur } ∈ IRm that are orthonormal, and V is
a matrix with columns {v1 , . . . , vr } ∈ IRn that are also orthonormal. Moreover,
it holds that
AA> uj = λ2j uj , and A> Avj = λ2j vj
for j = 1, . . . , r. The values λj > 0 are called singular values of A and are
uniquely defined. If rank r < min(n, m) then the singular values of A are
given by λ = (λ1 , . . . , λr , 0, . . . , 0)> ∈ IRmin(n,m) where there are min(n, m) − r
zeros. This way, the vector λ of singular values of a n × m matrix is a vector
in IRmin(n,m) .
In particular, if A is an n × n symmetric positive semidefinite (PSD), i.e.
A> = A and u> Au ≥ 0 for all u ∈ IRn , then the singular values of A are equal
to its eigenvalues.
The largest singular value of A denoted by λmax (A) also satisfies the fol-
lowing variational formulation:
|Ax|2 y > Ax
λmax (A) = maxn = maxn = max y > Ax .
x∈IR |x|2 x∈IR |y|2 |x|2
m
x∈S n−1
y∈IR y∈S m−1
In the case of a n × n PSD matrix A, we have
λmax (A) = max

n−1
x> Ax .
x∈S
In these notes, we always assume that eigenvectors and singular vectors

have unit norm.
Norms and inner product

Let A = {aij } and B = {bij } be two real matrices. Their size will be implicit
in the following notation.
Vector norms
The simplest way to treat a matrix is to deal with it as if it were a vector. In
particular, we can extend `q norms to matrices:
X 1/q
|A|q = |aij |q , q > 0.
ij
5.1. Basic facts about matrices 123
The cases where q ∈ {0, ∞} can also be extended matrices:

X
|A|0 = 1I(aij 6= 0) , |A|∞ = max |aij | .
ij
ij
The case q = 2 plays a particular role for matrices and |A|2 is called the
Frobenius norm of A and is often denoted by kAkF . It is also the Hilbert-
Schmidt norm associated to the inner product:
hA, Bi = Tr(A> B) = Tr(B > A) .
Spectral norms
Let λ = (λ1 , . . . , λr , 0, . . . , 0) be the singular values of a matrix A. We can
define spectral norms on A as vector norms on the vector λ. In particular, for
any q ∈ [1, ∞],
kAkq = |λ|q ,
is called Schatten q-norm of A. Here again, special cases have special names:
• q = 2: kAk2 = kAkF is the Frobenius norm defined above.
• q = 1: kAk1 = kAk∗ is called the Nuclear norm (or trace norm) of A.
• q = ∞: kAk∞ = λmax (A) = kAkop is called the operator norm (or
spectral norm) of A.
We are going to employ these norms to assess the proximity to our matrix
of interest. While the interpretation of vector norms is clear by extension from
the vector case, the meaning of “kA−Bkop is small” is not as transparent. The
following subsection provides some inequalities (without proofs) that allow a
better reading.
Useful matrix inequalities

Let A and B be two m × n matrices with singular values λ1 (A) ≥ λ2 (A) . . . ≥
λmin(m,n) (A) and λ1 (B) ≥ . . . ≥ λmin(m,n) (B) respectively. Then the following
inequalities hold:
max λk (A) − λk (B) ≤ kA − Bkop , Weyl (1912)

k
X 2
λk (A) − λk (B) ≤ kA − Bk2F , Hoffman-Weilandt (1953)
k
1 1
hA, Bi ≤ kAkp kBkq , + = 1, p, q ∈ [1, ∞] , Hölder
p q
The singular value decomposition is associated to the following useful lemma,

known as the Eckart-Young (or Eckart-Young-Mirsky) Theorem. It states that
for any given k the closest matrix of rank k to a given matrix A in Frobenius
norm is given by its truncated SVD.
5.2. Multivariate regression 124
Lemma 5.1. Let A be a rank-r matrix with singular value decomposition

r
X
A= λi ui vi> ,
i=1
where λ1 ≥ λ2 ≥ · · · ≥ λr > 0 are the ordered singular values of A. For any

k < r, let Ak to be the truncated singular value decomposition of A given by
r
X
Ak = λi ui vi> .
i=1
Then for any matrix B such that rank(B) ≤ k, it holds

kA − Ak kF ≤ kA − BkF .
Moreover,
r
X
kA − Ak k2F = λ2j .
j=k+1
Proof. Note that the last equality of the lemma is obvious since
r
X
A − Ak = λj uj vj>
j=k+1
and the matrices in the sum above are orthogonal to each other.
Thus, it is sufficient to prove that for any matrix B such that rank(B) ≤ k,
it holds
Xr
kA − Bk2F ≥ λ2j .
j=k+1
To that end, denote by σ1 ≥ σ2 ≥ · · · ≥ σk ≥ 0 the ordered singular values

of B—some may be equal to zero if rank(B) < k—and observe that it follows
from the Hoffman-Weilandt inequality that
r k r r
X 2 X 2 X X
kA − Bk2F ≥ λj − σj = λ j − σj + λ2j ≥ λ2j .
j=1 j=1 j=k+1 j=k+1
5.2 MULTIVARIATE REGRESSION
In the traditional regression setup, the response variable Y is a scalar. In

several applications, the goal is not to predict a variable but rather a vector
Y ∈ IRT , still from a covariate X ∈ IRd . A standard example arises in genomics
data where Y contains T physical measurements of a patient and X contains
the expression levels for d genes. As a result the regression function in this case
f (x) = IE[Y |X = x] is a function from IRd to IRT . Clearly, f can be estimated
independently for each coordinate, using the tools that we have developed in
the previous chapter. However, we will see that in several interesting scenar-
ios, some structure is shared across coordinates and this information can be
leveraged to yield better prediction bounds.
The model
Throughout this section, we consider the following multivariate linear regres-
sion model:
Y = XΘ∗ + E , (5.1)
where Y ∈ IRn×T is the matrix of observed responses, X is the n × d observed
design matrix (as before), Θ ∈ IRd×T is the matrix of unknown parameters and
E ∼ subGn×T (σ 2 ) is the noise matrix. In this chapter, we will focus on the
prediction task, which consists in estimating XΘ∗ .
As mentioned in the foreword of this chapter, we can view this problem as T
(univariate) linear regression problems Y (j) = Xθ∗,(j) +ε(j) , j = 1, . . . , T , where
Y (j) , θ∗,(j) and ε(j) are the jth column of Y, Θ∗ and E respectively. In particu-
lar, an estimator for XΘ∗ can be obtained by concatenating the estimators for
each of the T problems. This approach is the subject of Problem 5.1.
The columns of Θ∗ correspond to T different regression tasks. Consider the
following example as a motivation. Assume that the Subway headquarters
want to evaluate the effect of d variables (promotions, day of the week, TV
ads,. . . ) on their sales. To that end, they ask each of their T = 40, 000
restaurants to report their sales numbers for the past n = 200 days. As a
result, franchise j returns to headquarters a vector Y(j) ∈ IRn . The d variables
for each of the n days are already known to headquarters and are stored in
a matrix X ∈ IRn×d . In this case, it may be reasonable to assume that the
same subset of variables has an impact of the sales for each of the franchise,
though the magnitude of this impact may differ from franchise to franchise. As
a result, one may assume that the matrix Θ∗ has each of its T columns that
is row sparse and that they share the same sparsity pattern, i.e., Θ∗ is of the
form:  
0 0 0 0
 • • • • 
 
 • • • • 
 
Θ∗ =  0 0 0 0  ,
 
 .. .. .. .. 
 . . . . 
 
 0 0 0 0 
• • • •
where • indicates a potentially nonzero entry.
It follows from the result of Problem 5.1 that if each task is performed
individually, one may find an estimator Θ̂ such that
1 kT log(ed)
IEkXΘ̂ − XΘ∗ k2F . σ 2 ,
n n
where k is the number of nonzero coordinates in each column of Θ∗ . We
remember that the term log(ed) corresponds to the additional price to pay
for not knowing where the nonzero components are. However, in this case,
when the number of tasks grows, this should become easier. This fact was
proved in [LPTVDG11]. We will see that we can recover a similar phenomenon
when the number of tasks becomes large, though larger than in [LPTVDG11].
Indeed, rather than exploiting sparsity, observe that such a matrix Θ∗ has rank
k. This is the kind of structure that we will be predominantly using in this
chapter.
Rather than assuming that the columns of Θ∗ share the same sparsity
pattern, it may be more appropriate to assume that the matrix Θ∗ is low rank
or approximately so. As a result, while the matrix may not be sparse at all,
the fact that it is low rank still materializes the idea that some structure is
shared across different tasks. In this more general setup, it is assumed that the
columns of Θ∗ live in a lower dimensional space. Going back to the Subway
example this amounts to assuming that while there are 40,000 franchises, there
are only a few canonical profiles for these franchises and that all franchises are
linear combinations of these profiles.
Sub-Gaussian matrix model

Recall that under the assumption ORT for the design matrix, i.e., X> X = nId ,
then the univariate regression model can be reduced to the sub-Gaussian se-
quence model. Here we investigate the effect of this assumption on the multi-
variate regression model (5.1).
Observe that under assumption ORT,
1 > 1
X Y = Θ∗ + X> E .
n n
Which can be written as an equation in IRd×T called the sub-Gaussian matrix
model (sGMM):
y = Θ∗ + F , (5.2)
where y = n1 X> Y and F = n1 X> E ∼ subGd×T (σ 2 /n).
Indeed, for any u ∈ S d−1 , v ∈ S T −1 , it holds
1 1
u> F v = (Xu)> Ev = √ w> Ev ∼ subG(σ 2 /n) ,
n n
√ >
where w = Xu/ n has unit norm: |w|22 = u> X n X u = |u|22 = 1.
Akin to the sub-Gaussian sequence model, we have a direct observation
model where we observe the parameter of interest with additive noise. This
enables us to use thresholding methods for estimating Θ∗ when |Θ∗ |0 is small.
However, this also follows from Problem 5.1. The reduction to the vector case
in the sGMM is just as straightforward. The interesting analysis begins when
Θ∗ is low-rank, which is equivalent to sparsity in its unknown eigenbasis.
Consider the SVD of Θ∗ :
X
Θ∗ = λj uj vj> .
j
and recall that kΘ∗ k0 = |λ|0 . Therefore, if we knew uj and vj , we could

simply estimate the λj s by hard thresholding. It turns out that estimating
these eigenvectors by the eigenvectors of y is sufficient.
Consider the SVD of the observed matrix y:

X
y= λ̂j ûj v̂j> .
j
Definition 5.2. The singular value thresholding estimator with threshold

2τ ≥ 0 is defined by
X
Θ̂svt = λ̂j 1I(|λ̂j | > 2τ )ûj v̂j> .
j
Recall that the threshold for the hard thresholding estimator was chosen to
be the level of the noise with high probability. The singular value thresholding
estimator obeys the same rule, except that the norm in which the magnitude of
the noise is measured is adapted to the matrix case. Specifically, the following
lemma will allow us to control the operator norm of the matrix F .
Lemma 5.3. Let A be a d × T random matrix such that A ∼ subGd×T (σ 2 ).
Then p p
kAkop ≤ 4σ log(12)(d ∨ T ) + 2σ 2 log(1/δ)
Proof. This proof follows the same steps as Problem 1.4. Let N1 be a 1/4-
net for S d−1 and N2 be a 1/4-net for S T −1 . It follows from Lemma 1.18
that we can always choose |N1 | ≤ 12d and |N2 | ≤ 12T . Moreover, for any
u ∈ S d−1 , v ∈ S T −1 , it holds
1
u> Av ≤ max x> Av + max u> Av
x∈N1 4 u∈S d−1
1 1
≤ max max x> Ay + max max x> Av + max u> Av
x∈N1 y∈N2 4 x∈N1 v∈S T −1 4 u∈S d−1
1
≤ max max x> Ay + max max u> Av
x∈N1 y∈N2 2 u∈S d−1 v∈S T −1
It yields
kAkop ≤ 2 max max x> Ay
x∈N1 y∈N2
So that for any t ≥ 0, by a union bound,

X
IP x> Ay > t/2

IP kAkop > t ≤
x∈N1
y∈N2
Next, since A ∼ subGd×T (σ 2 ), it holds that x> Ay ∼ subG(σ 2 ) for any x ∈

N1 , y ∈ N2 .Together with the above display, it yields
t2
IP kAkop > t ≤ 12d+T exp − 2 ≤ δ

8σ
for p p
t ≥ 4σ log(12)(d ∨ T ) + 2σ 2 log(1/δ) .
The following theorem holds.

Theorem 5.4. Consider the multivariate linear regression model (5.1) under
the assumption ORT or, equivalently, the sub-Gaussian matrix model (5.2).
Then, the singular value thresholding estimator Θ̂svt with threshold
r r
log(12)(d ∨ T ) 2 log(1/δ)
2τ = 8σ + 4σ , (5.3)
n n
satisfies
1
kXΘ̂svt − XΘ∗ k2F = kΘ̂svt − Θ∗ k2F ≤ 144 rank(Θ∗ )τ 2
n
σ 2 rank(Θ∗ )
. d ∨ T + log(1/δ) .
n
Proof. Assume without loss of generality that the singular values of Θ∗ and y
are arranged in a non increasing order: λ1 ≥ λ2 ≥ . . . and λ̂1 ≥ λ̂2 ≥ . . . .
Define the set S = {j : |λ̂j | > 2τ }.
Observe first that it follows from Lemma 5.3 that kF kop ≤ τ for τ chosen as
in (5.3) on an event A such that IP(A) ≥ 1 − δ. The rest of the proof assumes
that the event A occurred.
Note that it follows from Weyl’s inequality that |λ̂j − λj | ≤ kF kop ≤ τ . It
implies that S ⊂ {j : |λj | > τ }P and S c ⊂ {j : |λj | ≤ 3τ }.
Next define the oracle Θ̄ = j∈S λj uj vj> and note that
kΘ̂svt − Θ∗ k2F ≤ 2kΘ̂svt − Θ̄k2F + 2kΘ̄ − Θ∗ k2F (5.4)
Using the Hölder inequality, we control the first term as follows
kΘ̂svt − Θ̄k2F ≤ rank(Θ̂svt − Θ̄)kΘ̂svt − Θ̄k2op ≤ 2|S|kΘ̂svt − Θ̄k2op
Moreover,
kΘ̂svt − Θ̄kop ≤ kΘ̂svt − ykop + ky − Θ∗ kop + kΘ∗ − Θ̄kop

≤ maxc |λ̂j | + τ + maxc |λj | ≤ 6τ .
j∈S j∈S
Therefore, X
kΘ̂svt − Θ̄k2F ≤ 72|S|τ 2 = 72 τ2 .
j∈S
The second term in (5.4) can be written as

X
kΘ̄ − Θ∗ k2F = |λj |2 .
j∈S c
Plugging the above two displays in (5.4), we get

X X
kΘ̂svt − Θ∗ k2F ≤ 144 τ2 + |λj |2
j∈S j∈S c
Since on S, τ 2 = min(τ 2 , |λj |2 ) and on S c , |λj |2 ≤ 9 min(τ 2 , |λj |2 ), it yields,

X
kΘ̂svt − Θ∗ k2F ≤ 144 min(τ 2 , |λj |2 )
j
rank(Θ∗ )
X
≤ 144 τ2
j=1
= 144 rank(Θ∗ )τ 2 .
In the next subsection, we extend our analysis to the case where X does not
necessarily satisfy the assumption ORT.
Penalization by rank
The estimator from this section is the counterpart of the BIC estimator in the
spectral domain. However, we will see that unlike BIC, it can be computed
efficiently.
Let Θ̂rk be any solution to the following minimization problem:
n1 o
min kY − XΘk2F + 2τ 2 rank(Θ) .
Θ∈IRd×T n
This estimator is called estimator by rank penalization with regularization pa-
rameter τ 2 . It enjoys the following property.
Theorem 5.5. Consider the multivariate linear regression model (5.1). Then,
the estimator by rank penalization Θ̂rk with regularization parameter τ 2 , where
τ is defined in (5.3) satisfies
1 σ 2 rank(Θ∗ )
kXΘ̂rk − XΘ∗ k2F ≤ 8 rank(Θ∗ )τ 2 . d ∨ T + log(1/δ) .
n n
Proof. We begin as usual by noting that
kY − XΘ̂rk k2F + 2nτ 2 rank(Θ̂rk ) ≤ kY − XΘ∗ k2F + 2nτ 2 rank(Θ∗ ) ,
which is equivalent to
kXΘ̂rk − XΘ∗ k2F ≤ 2hE, XΘ̂rk − XΘ∗ i − 2nτ 2 rank(Θ̂rk ) + 2nτ 2 rank(Θ∗ ) .
Next, by Young’s inequality, we have

1
2hE, XΘ̂rk − XΘ∗ i = 2hE, U i2 + kXΘ̂rk − XΘ∗ k2F ,
2
where
XΘ̂rk − XΘ∗
U= .
kXΘ̂rk − XΘ∗ kF
Write
XΘ̂rk − XΘ∗ = ΦN ,
where Φ is a n × r, r ≤ d matrix whose columns form orthonormal basis of the
column span of X. The matrix Φ can come from the SVD of X for example:
X = ΦΛΨ> . It yields
ΦN
U=
kN kF
and
kXΘ̂rk − XΘ∗ k2F ≤ 4hΦ> E, N/kN kF i2 − 4nτ 2 rank(Θ̂rk ) + 4nτ 2 rank(Θ∗ ) .

(5.5)
Note that rank(N ) ≤ rank(Θ̂rk ) + rank(Θ∗ ). Therefore, by Hölder’s in-
equality, we get
hE, U i2 = hΦ> E, N/kN kF i2

kN k21
≤ kΦ> Ek2op
kN k2F
≤ rank(N )kΦ> Ek2op
≤ kΦ> Ek2op rank(Θ̂rk ) + rank(Θ∗ ) .

Next, note that Lemma 5.3 yields kΦ> Ek2op ≤ nτ 2 so that
hE, U i2 ≤ nτ 2 rank(Θ̂rk ) + rank(Θ∗ ) .

Together with (5.5), this completes the proof.
It follows from Theorem 5.5 that the estimator by rank penalization enjoys
the same properties as the singular value thresholding estimator even when X
does not satisfy the ORT condition. This is reminiscent of the BIC estimator
which enjoys the same properties as the hard thresholding estimator. However
this analogy does not extend to computational questions. Indeed, while the
rank penalty, just like the sparsity penalty, is not convex, it turns out that
XΘ̂rk can be computed efficiently.
Note first that
1 n1 o
min kY − XΘk2F + 2τ 2 rank(Θ) = min min kY − XΘk2F + 2τ 2 k .
Θ∈IRd×T n k n Θ∈IRd×T
rank(Θ)≤k
Therefore, it remains to show that
min kY − XΘk2F
Θ∈IRd×T
rank(Θ)≤k
can be solved efficiently. To that end, let Ȳ = X(X> X)† X> Y denote the orthog-
onal projection of Y onto the image space of X: this is a linear operator from
IRd×T into IRn×T . By the Pythagorean theorem, we get for any Θ ∈ IRd×T ,
kY − XΘk2F = kY − Ȳk2F + kȲ − XΘk2F .
Next consider the SVD of Ȳ:
X
Ȳ = λj uj vj>
j
where λ1 ≥ λ2 ≥ . . . and define Ỹ by

k
X
Ỹ = λj uj vj> .
j=1
By Lemma 5.1, it holds

kȲ − Ỹk2F = min kȲ − Zk2F .
Z:rank(Z)≤k
Therefore, any minimizer of XΘ 7→ kY − XΘk2F over matrices of rank at most

k can be obtained by truncating the SVD of Ȳ at order k.
Once XΘ̂rk has been found, one may obtain a corresponding Θ̂rk by least
squares but this is not necessary for our results.
Remark 5.6. While the rank penalized estimator can be computed efficiently,
it is worth pointing out that a convex relaxation for the rank penalty can also
be used. The estimator by nuclear norm penalization Θ̂ is defined to be any
solution to the minimization problem
n1 o
min kY − XΘk2F + τ kΘk1
Θ∈IRd×T n
Clearly this criterion is convex and it can actually be implemented efficiently
using semi-definite programming. It has been popularized by matrix comple-
tion problems. Let X have the following SVD:
r
X
X= λj uj vj> ,
j=1
with λ1 ≥ λ2 ≥ . . . ≥ λr > 0. It can be shown that for some appropriate choice

of τ , it holds
1 λ1 σ 2 rank(Θ∗ )
kXΘ̂ − XΘ∗ k2F . d∨T
n λr n
with probability .99. However, the proof of this result is far more involved
than a simple adaption of the proof for the Lasso estimator to the matrix case
(the readers are invited to see that for themselves). For one thing, there is no
assumption on the design matrix (such as INC for example). This result can
be found in [KLT11].
5.3. Covariance matrix estimation 132
5.3 COVARIANCE MATRIX ESTIMATION
Empirical covariance matrix

d
X1 , .. . , Xn be n i.i.d. copies of a random vector X ∈ IR such that
Let
IE XX > = Σ for some unknown matrix Σ 0 called covariance matrix.
This matrix contains information about the moments of order 2 of the random
vector X. A natural candidate to estimate Σ is the empirical covariance matrix
Σ̂ defined by
n
1X
Σ̂ = Xi Xi> .
n i=1
Using the tools of Chapter 1, we can prove the following result.
Theorem 5.7. Let Y ∈ IRd be a random vector such that IE[Y ] = 0, IE[Y Y > ] =
Id and Y ∼ subGd (1). Let X1 , . . . , Xn be n independent copies of sub-Gaussian
random vector X = Σ1/2 Y . Then IE[X] = 0, IE[XX > ] = Σ and X ∼ subGd (kΣkop ).
Moreover,
r
d + log(1/δ) d + log(1/δ)
kΣ̂ − Σkop . kΣkop ∨ ,
n n
Proof. It follows from elementary computations that IE[X] = 0, IE[XX > ] = Σ
and X ∼ subGd (kΣkop ). To prove the deviation inequality, observe first that
without loss of generality, we can assume that Σ = Id . Indeed,
Pn
kΣ̂ − Σkop k n1 i=1 Xi Xi> − Σkop
=
kΣkop kΣkop
Pn
kΣ kop k n1 i=1 Yi Yi> − Id kop kΣ1/2 kop
1/2
≤
kΣkop
n
1X
=k Yi Yi> − Id kop .
n i=1
Thus, in the rest of the proof, we assume that Σ = Id . Let N be a 1/4-net

for S d−1 such that |N | ≤ 12d . It follows from the proof of Lemma 5.3 that
kΣ̂ − Id kop ≤ 2 max x> (Σ̂ − Id )y.
x,y∈N
Hence, for any t ≥ 0, by a union bound,

X
IP x> (Σ̂ − Id )y > t/2 .

IP kΣ̂ − Id kop > t ≤ (5.6)
x,y∈N
It holds,
n
> 1 X >
(Xi x)(Xi> y) − IE (Xi> x)(Xi> y) .

x (Σ̂ − Id )y =
n i=1
5.3. Covariance matrix estimation 133
Using polarization, we also have

2 2
Z+ − Z−
(Xi> x)(Xi> y) = ,
4
where Z+ = Xi> (x + y) and Z− = Xi> (x − y). It yields
i
IE exp s (Xi> x)(Xi> y) − IE (Xi> x)(Xi> y)

s 2 2
s 2 2
i
= IE exp Z+ − IE[Z+ ] − Z− − IE[Z− ])
4 4
s 2 2
s 2 2
1/2
≤ IE exp Z+ − IE[Z+ ] IE exp − Z− − IE[Z− ] ,
2 2
where in the last inequality, we used Cauchy-Schwarz. Next, since X ∼
subGd (1), we have Z+ , Z− ∼ subG(2), and it follows from Lemma 1.12 that
2 2 2 2
Z+ − IE[Z+ ] ∼ subE(32) , and Z− − IE[Z− ] ∼ subE(32).
Therefore, for any s ≤ 1/16, we have for any Z ∈ {Z+ , Z− } that

s 2 2
Z − IE[Z 2 ] ≤ e128s .

IE exp
2
It yields that
(Xi> x)(Xi> y) − IE (Xi> x)(Xi> y) ∼ subE(16) .

Applying now Bernstein’s inequality (Theorem 1.13), we get

h n t 2 t i
IP x> (Σ̂ − Id )y > t/2 ≤ exp − (

∧ ) . (5.7)
2 32 32
Together with (5.6), this yields
h n t 2 t i
IP kΣ̂ − Id kop > t ≤ 144d exp − (

∧ ) . (5.8)
2 32 32
In particular, the right hand side of the above inequality is at most δ ∈ (0, 1) if
t 2d 2 2d 2 1/2
≥ log(144) + log(1/δ) ∨ log(144) + log(1/δ) .
32 n n n n
This concludes our proof.
Theorem 5.7 indicates that for fixed d, the empirical covariance matrix is a
consistent estimator of Σ (in any norm as they are all equivalent in finite dimen-
sion). However, the bound that we got is not satisfactory in high-dimensions
when d n. To overcome this limitation, we can introduce sparsity as we have
done in the case of regression. The most obvious way to do so is to assume
that few of the entries of Σ are non zero and it turns out that in this case
5.4. Principal component analysis 134
thresholding is optimal. There is a long line of work on this subject (see for
example [CZZ10] and [CZ12]).
Once we have a good estimator of Σ, what can we do with it? The key
insight is that Σ contains information about the projection of the vector X
onto any direction u ∈ S d−1 . Indeed, we have that var(X > u) = u> Σu, which
d > u) = u> Σ̂u. Observe that it follows from
can be readily estimated by Var(X
Theorem 5.7 that
d > u) − Var(X > u) = u> (Σ̂ − Σ)u
Var(X
≤ kΣ̂ − Σkop
r
d + log(1/δ) d + log(1/δ)
. kΣkop ∨
n n
The above fact is useful in the Markowitz theory of portfolio section for
example [Mar52], where a portfolio of assets is a vector u ∈ IRd such that
|u|1 = 1 and the risk of a portfolio is given by the variance Var(X > u). The
goal is then to maximize reward subject to risk constraints. In most instances,
the empirical covariance matrix is plugged into the formula in place of Σ. (See
Problem 5.4).
5.4 PRINCIPAL COMPONENT ANALYSIS
Spiked covariance model

Estimating the variance in all directions is also useful for dimension reduction.
In Principal Component Analysis (PCA), the goal is to find one (or more)
directions onto which the data X1 , . . . , Xn can be projected without loosing
much of its properties. There are several goals for doing this but perhaps
the most prominent ones are data visualization (in few dimensions, one can
plot and visualize the cloud of n points) and clustering (clustering is a hard
computational problem and it is therefore preferable to carry it out in lower
dimensions). An example of the output of a principal component analysis
is given in Figure 5.1. In this figure, the data has been projected onto two
orthogonal directions PC1 and PC2, that were estimated to have the most
variance (among all such orthogonal pairs). The idea is that when projected
onto such directions, points will remain far apart and a clustering pattern
will still emerge. This is the case in Figure 5.1 where the original data is
given by d = 500, 000 gene expression levels measured on n ' 1, 387 people.
Depicted are the projections of these 1, 387 points in two dimension. This
image has become quite popular as it shows that gene expression levels can
recover the structure induced by geographic clustering. How is it possible to
“compress” half a million dimensions into only two? The answer is that the
data is intrinsically low dimensional. In this case, a plausible assumption is
that all the 1, 387 points live close to a two-dimensional linear subspace. To see
how this assumption (in one dimension instead of two for simplicity) translates
Figure 5.1. Projection onto two dimensions of 1, 387 points from gene expression data.
Source: Gene expression blog.
into the structure of the covariance matrix Σ, assume that X1 , . . . , Xn are

Gaussian random variables generated as follows. Fix a direction v ∈ S d−1
and let Y1 , . . . , Yn ∼ Nd (0, Id ) so that v > Yi are i.i.d. N (0, 1). In particular,
the vectors (v > Y1 )v, . . . , (v > Yn )v live in the one-dimensional space spanned by
v. If one would observe such data the problem would be easy as only two
observations would suffice to recover v. Instead, we observe X1 , . . . , Xn ∈ IRd
where Xi = (v > Yi )v + Zi , and Zi ∼ Nd (0, σ 2 Id ) are i.i.d. and independent of
the Yi s, that is we add isotropic noise to every point. If the σ is small enough,
we can hope to recover the direction v (See Figure 5.2). The covariance matrix
of Xi generated as such is given by
Σ = IE XX > = IE ((v > Y )v + Z)((v > Y )v + Z)> = vv > + σ 2 Id .

This model is often called the spiked covariance model. By a simple rescaling,
it is equivalent to the following definition.
Definition 5.8. A covariance matrix Σ ∈ IRd×d is said to satisfy the spiked
covariance model if it is of the form
Σ = θvv > + Id ,
Figure 5.2. Points are close to a one dimensional space spanned by v.
where θ > 0 and v ∈ S d−1 . The vector v is called the spike.

This model can be extended to more than one spike but this extension is
beyond the scope of these notes.
Clearly, under the spiked covariance model, v is the eigenvector of the
matrix Σ that is associated to its largest eigenvalue 1 + θ. We will refer to
this vector simply as largest eigenvector. To estimate it, a natural candidate
is the largest eigenvector v̂ of Σ̃, where Σ̃ is any estimator of Σ. There is a
caveat: by symmetry, if u is an eigenvector, of a symmetric matrix, then −u is
also an eigenvector associated to the same eigenvalue. Therefore, we may only
estimate v up to a sign flip. To overcome this limitation, it is often useful to
describe proximity between two vectors u and v in terms of the principal angle
between their linear span. Let us recall that for two unit vectors the principal
angle between their linear spans is denoted by ∠(u, v) and defined as
∠(u, v) = arccos(|u> v|) .
The following result from perturbation theory is known as the Davis-Kahan

sin(θ) theorem as it bounds the sin of the principal angle between eigenspaces.
This theorem exists in much more general versions that extend beyond one-
dimensional eigenspaces.
Theorem 5.9 (Davis-Kahan sin(θ) theorem). Let A, B be two d × d PSD
matrices. Let (λ1 , u1 ), . . . , (λd , ud ) (resp. (µ1 , v1 ), . . . , (µd , vd )) denote the pairs
of eigenvalues–eigenvectors of A (resp. B) ordered such that λ1 ≥ λ2 ≥ . . .
(resp. µ1 ≥ µ2 ≥ . . . ). Then
2
sin ∠(u1 , v1 ) ≤ kA − Bkop .
max(λ1 − λ2 , µ1 − µ2 )
Moreover,
min |εu1 − v1 |22 ≤ 2 sin2 ∠(u1 , v1 ) .

ε∈{±1}
Proof. Note that u>

1 Au1 = λ1 and for any x ∈ S
d−1
, it holds
d
X
x> Ax = λj (u> 2 > 2 > 2
j x) ≤ λ1 (u1 x) + λ2 (1 − (u1 x) )
j=1
= λ1 cos2 ∠(u1 , x) + λ2 sin2 ∠(u1 , x) .

Therefore, taking x = v1 , we get that on the one hand,
u> > 2
∠(u1 , x) − λ2 sin2 ∠(u1 , x)

1 Au1 − v1 Av1 ≥ λ1 − λ1 cos
= (λ1 − λ2 ) sin2 ∠(u1 , x) .

On the other hand,
u> > > > >

1 Au1 − v1 Av1 = u1 Bu1 − v1 Av1 + u1 (A − B)u1
≤ v1> Bv1 − v1> Av1 + u>

1 (A − B)u1 (v1 is leading eigenvector of B)
= hA − B, u1 u>
1− v1 v1> i
≤ kA − Bkop ku1 u> >
1 − v1 v1 k1 (Hölder)
√
≤ kA − Bkop 2ku1 u>
1 − v1 v1> k2 (Cauchy-Schwarz)
where in the last inequality, we used the fact that rank(u1 u> >
1 − v1 v1 ) = 2.
It is straightforward to check that
ku1 u> > 2 > > 2 > 2 2

1 − v1 v1 k2 = ku1 u1 − v1 v1 kF = 2 − 2(u1 v1 ) = 2 sin ∠(u1 , v1 ) .
We have proved that
(λ1 − λ2 ) sin2 ∠(u1 , x) ≤ 2kA − Bkop sin ∠(u1 , x) ,

which concludes the first part of the lemma. Note that we can replace λ1 − λ2
with µ1 − µ2 since the result is completely symmetric in A and B.
It remains to show the second part of the lemma. To that end, observe that
min |εu1 − v1 |22 = 2 − 2|ũ> > 2 2

1 v1 | ≤ 2 − 2(u1 v1 ) = 2 sin ∠(u1 , x) .
ε∈{±1}
Combined with Theorem 5.7, we immediately get the following corollary.

Corollary 5.10. Let Y ∈ IRd be a random vector such that IE[Y ] = 0, IE[Y Y > ] =
random vector X = Σ1/2 Y so that IE[X] = 0, IE[XX > ] = Σ and X ∼ subGd (kΣkop ).
Assume further that Σ = θvv > +Id satisfies the spiked covariance model. Then,
the largest eigenvector v̂ of the empirical covariance matrix Σ̂ satisfies,
r
1 + θ d + log(1/δ) d + log(1/δ)
min |εv̂ − v|2 . ∨
ε∈{±1} θ n n
This result justifies the use of the empirical covariance matrix Σ̂ as a re-
placement for the true covariance matrix Σ when performing PCA in low di-
mensions, that is when d n. In the high-dimensional case, where d n,
the above result is uninformative. As before, we resort to sparsity to overcome
this limitation.
Sparse PCA
In the example of Figure 5.1, it may be desirable to interpret the meaning of
the two directions denoted by PC1 and PC2. We know that they are linear
combinations of the original 500,000 gene expression levels. A natural question
to ask is whether only a subset of these genes could suffice to obtain similar
results. Such a discovery could have potential interesting scientific applications
as it would point to a few genes responsible for disparities between European
populations.
In the case of the spiked covariance model this amounts to having a sparse
v. Beyond interpretability as we just discussed, sparsity should also lead to
statistical stability as in the case of sparse linear regression for example. To
enforce sparsity, we will assume that v in the spiked covariance model is k-
sparse: |v|0 = k. Therefore, a natural candidate to estimate v is given by v̂
defined by
v̂ > Σ̂v̂ = max u> Σ̂u .
u∈S d−1
|u|0 =k
It is easy to check that λkmax (Σ̂) = v̂ > Σ̂v̂ is the largest of all leading eigenvalues
among all k × k sub-matrices of Σ̂ so that the maximum is indeed attained,
though there my be several maximizers. We call λkmax (Σ̂) the k-sparse leading
eigenvalue of Σ̂ and v̂ a k-sparse leading eigenvector.
Theorem 5.11. Let Y ∈ IRd be a random vector such that IE[Y ] = 0, IE[Y Y > ] =
random vector X = Σ1/2 Y so that IE[X] = 0, IE[XX > ] = Σ and X ∼ subGd (kΣkop ).
Assume further that Σ = θvv > + Id satisfies the spiked covariance model for
v such that |v|0 = k ≤ d/2. Then, the k-sparse largest eigenvector v̂ of the
empirical covariance matrix satisfies,
r
1 + θ k log(ed/k) + log(1/δ) k log(ed/k) + log(1/δ)
min |εv̂ − v|2 . ∨ .
ε∈{±1} θ n n
Proof. We begin by obtaining an intermediate result of the Davis-Kahan sin(θ)

theorem. Using the same steps as in the proof of Theorem 5.9, we get
v > Σv − v̂ > Σv̂ ≤ hΣ̂ − Σ, v̂v̂ > − vv > i
Since both v̂ and v are k sparse, there exists a (random) set S ⊂ {1, . . . , d}
such that |S| ≤ 2k and {v̂v̂ > − vv > }ij = 0 if (i, j) ∈
/ S 2 . It yields
hΣ̂ − Σ, v̂v̂ > − vv > i = hΣ̂(S) − Σ(S), v̂(S)v̂(S)> − v(S)v(S)> i
Where for any d × d matrix M , we defined the matrix M (S) to be the |S| × |S|
sub-matrix of M with rows and columns indexed by S and for any vector
x ∈ IRd , x(S) ∈ IR|S| denotes the sub-vector of x with coordinates indexed by

S. It yields by Hölder’s inequality that
v > Σv − v̂ > Σv̂ ≤ kΣ̂(S) − Σ(S)kop kv̂(S)v̂(S)> − v(S)v(S)> k1 .
Following the same steps as in the proof of Theorem 5.9, we get now that
2
sin ∠(v̂, v) ≤ sup kΣ̂(S) − Σ(S)kop .
θ S : |S|=2k
To conclude the proof, it remains to control supS : |S|=2k kΣ̂(S) − Σ(S)kop . To

that end, observe that
h i
IP sup kΣ̂(S) − Σ(S)kop > tkΣkop
S : |S|=2k
X h i
≤ IP sup kΣ̂(S) − Σ(S)kop > tkΣ(S)kop
S : |S|=2k
S : |S|=2k
h n t 2
d t i
≤ 1442k exp − ( ∧ ) .
2k 2 32 32
where we used (5.8) in the second inequality. Using Lemma 2.7, we get that
the right-hand side above is further bounded by
h n t 2 t ed i
exp − ( ∧ ) + 2k log(144) + k log
2 32 32 2k
Choosing now t such that
r
k log(ed/k) + log(1/δ) k log(ed/k) + log(1/δ)
t≥C ∨ ,
n n
for large enough C ensures that the desired bound holds with probability at
least 1 − δ.
Minimax lower bound

The last section has established that the spike
p v in the spiked covariance model
Σ = θvv > may be estimated at the rate θ−1 k log(d/k)/n, assuming that θ ≤ 1
and k log(d/k) ≤ n, which corresponds to the interesting regime. Using the
tools from Chapter 4, we can show that this rate is in fact optimal already in
the Gaussian case.
Theorem 5.12. Let X1 , . . . , Xn be n independent copies of the Gaussian ran-
dom vector X ∼ Nd (0, Σ) where Σ = θvv > + Id satisfies the spiked covariance
model for v such that |v|0 = k and write IPv the distribution of X under this
assumption and by IEv the associated expectation. Then, there exists constants
C, c > 0 such that for any n ≥ 2, d ≥ 9, 2 ≤ k ≤ (d + 7)/8, it holds
r
n
h 1 k i
inf sup IPv min |εv̂ − v|2 ≥ C log(ed/k) ≥ c .
v̂
v∈S d−1 ε∈{±1} θ n
|v|0 =k
Proof. We use the general technique developed in Chapter 4. Using this tech-
nique, we need to provide the a set v1 , . . . , vM ∈ S d−1 of k sparse vectors such
that
(i) min |εvj − vk |2 > 2ψ for all 1 ≤ j < k ≤ M ,

ε∈{−1,1}
(ii) KL(IPvj , IPvk ) ≤ c log(M )/n ,

p
where ψ = Cθ−1 k log(d/k)/n. Akin to the Gaussian sequence model, our
goal is to employ the sparse Varshamov-Gilbert lemma to construct the vj ’s.
However, the vj ’s have the extra constraint that they need to be of unit norm
so we cannot simply rescale the sparse binary vectors output by the sparse
Varshamov-Gilbert lemma. Instead, we use the last coordinate to make sure
that they have unit norm.
Recall that it follows from the sparse Varshamov-Gilbert Lemma 4.14 that
there exits ω1 , . . . , ωM ∈ {−1, 1}d−1 such that |ωj |0 = k − 1, ρ(ωi , ωj ) ≥
(k − 1)/2 for all i 6= j and log(M ) ≥ k−1 d−1
8 log(1 + 2k−2 ). Define vj = (γωj , `)
√ p
where γ < 1/ k − 1 is to be defined later and ` = 1 − γ 2 (k − 1). It is
straightforward to check that |vj |0 = k, |vj |2 = 1 and vj has only nonnegative
coordinates. Moreover, for any i 6= j,
k−1
min |εvi − vj |22 = |vi − vj |22 = γ 2 ρ(vi , vj ) ≥ γ 2 . (5.9)
ε∈±1 2
We choose
Cγ d
γ2 = 2
log ,
θ n k
so that γ 2 (k − 1) = ψ and (i) is verified. It remains to check (ii). To that end,
note that
1
KL(IPu , IPv ) = IEu X > (Id + θvv > )−1 − (Id + θvv > )−1 X

2
1
= IEu Tr X > (Id + θvv > )−1 − (Id + θvv > )−1 X

2
1
= Tr (Id + θvv > )−1 − (Id + θuu> )−1 (Id + θuu> )

2
Next, we use the Sherman-Morrison formula1 :
θ
(Id + θuu> )−1 = Id − uu> , ∀ u ∈ S d−1 .
1+θ
1 To see this, check that (I + θuu> )(I − θ
d d 1+θ
) = Id and that the two matrices on the
left-hand side commute.
5.5. Graphical models 141
It yields
θ
KL(IPu , IPv ) = Tr (uu> − vv > )(Id + θuu> )
2(1 + θ)
θ2
= (1 − (u> v)2 )
2(1 + θ)
θ2
sin2 ∠(u, v)

=
2(1 + θ)
It yields
θ2
sin2 ∠(vi , vj ) .

KL(IPvi , IPvj ) =
2(1 + θ)
Next, note that
k d log M
sin2 ∠(vi , vj ) = 1 − γ 2 ωi> ωj − `2 ≤ γ 2 (k − 1) ≤ Cγ 2 log

≤ 2 ,
θ n k θ n
for Cγ small enough. We conclude that (ii) holds, which completes our proof.
Together with the upper bound of Theorem 5.11, we get the following corol-
lary.
p
Corollary 5.13. Assume that θ ≤ 1 and k log(d/k) ≤ n. Then θ−1 k log(d/k)/n
is the minimax rate of estimation over B0 (k) in the spiked covariance model.
5.5 GRAPHICAL MODELS
Gaussian graphical models

In Section 5.3, we noted that thresholding is an appropriate way to gain ad-
vantage from sparsity in the covariance matrix when trying to estimate it.
However, in some cases, it is more natural to assume sparsity in the inverse of
the covariance matrix, Θ = Σ−1 , which is sometimes called concentration or
precision matrix. In this chapter, we will develop an approach that allows us
to get guarantees similar to the lasso for the error between the estimate and
the ground truth matrix in Frobenius norm.
We can understand why sparsity in Θ can be appropriate in the context of
undirected graphical models [Lau96]. These are models that serve to simplify
dependence relations between high dimensional distributions according to a
graph.
Definition 5.14. Let G = (V, E) be a graph on d nodes and to each node
v, associate a random variable Xv . An undirected graphical model or Markov
random field is a collection of probability distributions over Xv that factorize
according to
1 Y
p(x1 , . . . , xd ) = ψC (xC ), (5.10)
Z
C∈C
where C is the collection of all cliques (completely connected subsets) in G,

xC is the restriction of (x1 , . . . , xd ) to the subset C, and ψC are non-negative
potential functions.
This definition implies certain conditional independence relations, such as
the following.
Proposition 5.15. A graphical model as in (5.10) fulfills the global Markov
property. That is, for any triple (A, B, S) of disjoint subsets such that S sepa-
rates A and B,
A⊥ ⊥ B | S.
Proof. Let (A, B, S) a triple as in the proposition statement and define Ã to
be the connected component in the induced subgraph G[V \ S], as well as
B̃ = G[V \ (Ã ∪ S)]. By assumption, A and B have to be in different connected
components of G[V \ S], hence any clique in G has to be a subset of Ã ∪ S or
B̃ ∪ S. Denoting the former cliques by CA , we get
Y Y Y
f (x) = ψC (x) = ψC (x) ψC (x) = h(xÃ∪S )k(xB̃∪S ),
C∈C C∈CA c∈C\CA
which implies the desired conditional independence.

A weaker form of the Markov property is the so-called pairwise Markov
property, which says that the above holds only for singleton sets of the form
A = {a}, B = {b} and S = V \ {a, b} if (a, b) ∈
/ E.
In the following, we will focus on Gaussian graphical models for n i.i.d.
samples Xi ∼ N (0, (Θ∗ )−1 ). By contrast, there is a large body of work on
discrete graphical models such as the Ising model (see for example [Bre15]),
which we are not going to discuss here, although links to the following can be
established as well [LWo12].
For a Gaussian models with mean zero, by considering the density in terms
of the concentration matrix Θ,
fΘ (x) ∝ exp(−x> Θx/2),
it is immediate that the pairs a, b for which the pairwise Markov property
holds are exactly those for which Θi,j = 0. Conversely, it can be shown that
for a family of distributions with non-negative density, this actually implies the
factorization (5.10) according to the graph given by the edges {(i, j) : Θi,j 6= 0}.
This is the content of the Hammersley-Clifford theorem.
Theorem 5.16 (Hammersley-Clifford). Let P be a probability distribution with
positive density f with respect to a product measure over V . Then P factorizes
over G as in (5.10) if and only if it fulfills the pairwise Markov property.
We show the if direction. The requirement that f be positive allows us to
take logarithms. Pick a fixed element x∗ . The idea of the proof is to rewrite the
density in a way that allows us to make use of the pairwise Markov property.
The following combinatorial lemma will be used for this purpose.
Lemma 5.17 (Moebius Inversion). Let Ψ, Φ : P(V ) → IR be functions defined

for all subsets of a set V . Then, the following are equivalent:
P
1. For all A ⊆ V : Ψ(A) = B⊆A Φ(B);
2. For all A ⊆ V : Φ(A) = B⊆A (−1)|A\B Ψ(B).

P
Proof. 1. =⇒ 2.:
X X X
Φ(B) = (−1)|B\C| Ψ(C)
B⊆A B⊆A C⊆B
 
X X
= Ψ(C)  (−1)|B\C| 
C⊆A C⊆B⊆A
 
X X
= Ψ(C)  (−1)|H|  .
C⊆A H⊆A\C
Now, H⊆A\C (−1)|H| = 0 unless A \ C = ∅ because any non-empty set has

P
the same number of subsets with odd and even cardinality, which can be seen
by induction.
Proof. Define
HA (x) = log f (xA , x∗Ac )
and X
ΦA (x) = (−1)|A\B| HB (x).
B⊆A
Note that both HA and ΦA depend on x only through xA . By Lemma 5.17,

we get X
log f (x) = HV (x) = ΦA (x).
A⊆V
It remains to show that ΦA = 0 if A is not a clique. Assume A ⊆ V with

v, w ∈ A, (v, w) ∈/ E, and set C = V \ {v, w}. By adding all possible subsets
of {v, w} to every subset of C, we get every subset of A, and hence
X
ΦA (x) = (−1)|C\B| (HB − HB∪{v} − HB∪{w} + HB∪{v,w} ).
B⊆C
Writing D = V \ {v, w}, by the pairwise Markov property, we have

f (xv , xw , xB , x∗D\B
HB∪{v,w} (x) − HB∪{v} (x) = log
f (xv , x∗w , xB , x∗D\B
f (xv |xB , x∗D\B )f (xw , xB , x∗D\B )
= log
f (xv |xB , x∗D\B )f (x∗w , xB , x∗D\B )
f (x∗v |xB , x∗D\B )f (xw , xB , x∗D\B )
= log
f (x∗v |xB , x∗D\B )f (x∗w , xB , x∗D\B )
f (x∗v , xw , xB , x∗D\B )
= log
f (x∗v , x∗w , xB , x∗D\B )
= HB∪{w} (x) − HB (x).
Thus ΨA vanishes if A is not a clique.
In the following, we will show how to exploit this kind of sparsity when
trying to estimate Θ. The key insight is that we can set up an optimization
problem that has a similar error sensitivity as the lasso, but with respect to Θ.
In fact, it is the penalized version of the maximum likelihood estimator for the
Gaussian distribution. We set
Θ̂λ = argmin Tr(Σ̂Θ) − log det(Θ) + λkΘDc k1 , (5.11)
Θ0
P
where we wrote ΘDc for the off-diagonal restriction of Θ, kAk1 = i,j |Ai,j | for
the element-wise `1 norm, and Σ̂ = n1 i Xi Xi> for the empirical covariance
P
matrix of n i.i.d. observations. We will also make use of the notation
kAk∞ = max |Ai,j |
i,j∈[d]
for the element-wise ∞-norm of a matrix A ∈ IRd×d and

X
kAk∞,∞ = max |Ai,j |
i
j
for the ∞ operator norm.

We note that the function Θ 7→ − log det(Θ) has first derivative −Θ−1 and
second derivative Θ−1 ⊗ Θ−1 , which means it is convex. Derivations for this
can be found in [BV04, Appendix A.4]. This means that in the population case
and for λ = 0, the minimizer in (5.11) would coincide with Θ∗ .
First, let us derive a bound on the error in Σ̂ that is similar to Theorem 1.14,
with the difference that we are dealing with sub-Exponential distributions.
Pn
Lemma 5.18. If Xi ∼ N (0, Σ) are i.i.d. and Σ̂ = n1 i=1 Xi Xi> denotes the
empirical covariance matrix, then
r
log(d/δ)
kΣ̂ − Σk∞ . kΣkop ,
n
with probability at least 1 − δ, if n & log(d/δ).
Proof. Start as in the proof of Theorem 5.7 by writing Y = Σ−1/2 X ∼ N (0, Id )

and notice that
n
!
> 1/2 > 1X >
|ei (Σ̂ − Σ)ej | = |(Σ ei ) Yi Yi − Id (Σ1/2 ej )|
n i=1
n
!
1 X
≤ kΣ1/2 k2op |v > Yi Yi> − Id w|
n i=1
with v, w ∈ S d−1 . Hence, without loss of generality, we can assume Σ = Id .

The remaining term is sub-Exponential and can be controlled as in (5.7)
h n t 2 t i
IP e>

i (Σ̂ − Id )ej > t ≤ exp − ( ∧ ) .
2 16 16
By a union bound, we get
h n t 2 t i
IP kΣ̂ − Id k∞ > t ≤ d2 exp − (

∧ ) .
2 16 16
Hence, if n & log(d/δ), then the right-hand side above is at most δ ∈ (0, 1) if
1/2
t 2 log(d/δ)
≥ .
16 n
Pn
Theorem 5.19. Let Xi ∼ N (0, Σ) be i.i.d. and Σ̂ = n1 i=1 Xi Xi> denote
the empirical covariance matrix. Assume that |S| = k, where S = {(i, j) : i 6=
j, Θi,j 6= 0}. There exists a constant c = c(kΣkop ) such that if
r
log(d/δ)
λ=c ,
n
and n & log(d/δ), then the minimizer in (5.11) fulfills
(p + k) log(d/δ)
kΘ̂ − Θ∗ k2F .
n
Proof. We start by considering the likelihood
l(Θ, Σ) = Tr(ΣΘ) − log det(Θ).
Taylor expanding it and the mean value theorem yield
l(Θ, Σn ) − l(Θ∗ , Σn ) = Tr(Σn (Θ − Θ∗ )) − Tr(Σ∗ (Θ − Θ∗ ))

1
+ Tr(Θ̃−1 (Θ − Θ∗ )Θ̃−1 (Θ − Θ∗ )),
2
for some Θ̃ = Θ∗ + t(Θ − Θ∗ ), t ∈ [0, 1]. Note that essentially by the convexity
of log det, we have
Tr(Θ̃−1 (Θ − Θ∗ )Θ̃−1 (Θ − Θ∗ )) = kΘ̃−1 (Θ − Θ∗ )k2F ≥ λmin (Θ̃−1 )2 kΘ − Θ∗ k2F
and
λmin (Θ̃−1 ) = (λmax (Θ̃))−1 .
If we write ∆ = Θ − Θ∗ , then for k∆kF ≤ 1,
λmax (Θ̃) = kΘ̃kop = kΘ∗ +∆kop ≤ kΘ∗ kop +k∆kop ≤ kΘ∗ kop +k∆kF ≤ kΘ∗ kop +1,
and therefore
l(Θ, Σn ) − l(Θ∗ , Σn ) ≥ Tr((Σn − Σ∗ )(Θ − Θ∗ )) + ckΘ − Θ∗ k2F ,
for c = (kΘ∗ kop + 1)−2 /2, if k∆kF ≤ 1.
This takes care of the case where ∆ is small. To handle the case where it
is large, define g(t) = l(Θ∗ + t∆, Σn ) − l(Θ∗ , Σn ). By the convexity of l in Θ,
g(1) − g(0) g(t) − g(0)
≥ ,
1 t
so that plugging in t = k∆k−1
F gives

1
l(Θ, Σn ) − l(Θ∗ , Σn ) ≥ k∆kF l(Θ∗ +∆, Σn ) − l(Θ∗ , Σn )
k∆kF

1
≥ k∆kF Tr((Σn − Σ∗ ) ∆) + c
k∆kF
= Tr((Σn − Σ∗ )∆) + ck∆kF ,
for k∆kF ≥ 1.
If we now write ∆ = Θ̂ − Θ∗ and assume k∆kF ≥ 1, then by optimality of
Θ̂,
k∆kF ≤ C [Tr((Σ∗ − Σn )∆) + λ(kΘ∗Dc k1 − kΘDc k1 )]
≤ C [Tr((Σ∗ − Σn )∆) + λ(k∆S k1 − k∆S c k1 )] ,
where S = {(i, j) ∈ Dc : Θ∗i,j 6= 0} by triangle inequality. Now, split the error
contributions for the diagonal and off-diagonal elements,
Tr((Σ∗ − Σn )∆) + λ(k∆S k1 − k∆S c k1 )
≤ k(Σ∗ − Σn )D kF k∆D kF + k(Σ∗ − Σn )Dc k∞ k∆Dc k1
+ λ(k∆S k1 − k∆S c k1 ).
√
By Hölder inequality,
p k(Σ∗ −√Σn )D kF ≤ dkΣ∗ − Σn k∞ , and by Lemma 5.18,
kΣ∗ − Σn k∞ ≤ log(d/δ)/ n for n & log(d/δ). Combining these two esti-
mates,
Tr((Σ∗ − Σn )∆) + λ(k∆S k1 − k∆S c k1 )
r r
d log(d/δ) log(d/δ)
. k∆D kF + k∆Dc k1 + λ(k∆S k1 − k∆S c k1 )
n n
p
Setting λ = C log(ep/δ)/n and splitting k∆Dc k1 = k∆S k1 + k∆S c k1 yields
r r
k∆D kF + k∆Dc k1 + λ(k∆S k1 − k∆S c k1 )
n n
r r
≤ k∆D kF + k∆S k1
n n
r r
d log(d/δ) k log(d/δ)
≤ k∆D kF + k∆S kF
n n
r
(d + k) log(d/δ)
≤ k∆kF
n
Combining this with a left-hand side of k∆kF yields k∆kF = 0 for n &
(d + k) log(d/δ), a contradiction to k∆kF ≥ 1 where this bound is effective.
Combining it with a left-hand side of k∆k2F gives us
(d + k) log(d/δ)
k∆k2F . ,
n
as desired.
Write Σ∗ = W ∗ Γ∗ W ∗ , where W 2 = Σ∗D and similarly define Ŵ by Ŵ 2 =

Σ̂D . Define a slightly modified estimator by replacing Σ̂ in (5.11) by Γ̂ to get
an estimator K̂ for K = (Γ∗ )−1 .
kΓ̂ − Γk∞ = kŴ −1 Σ̂Ŵ −1 − (W ∗ )−1 Σ∗ (W ∗ )−1 k∞ (5.12)

≤ kŴ −1 − (W ∗ )−1 k∞,∞ kΣ̂ − Σ∗ k∞ kŴ −1 − (W ∗ )−1 k∞,∞

+ kŴ −1 − (W ∗ )−1 k∞,∞ kΣ̂k∞ k(W ∗ )−1 k∞ + kΣ∗ k∞ kŴ −1 k∞,∞
+ kŴ −1 k∞,∞ kΣ̂ − Σ∗ k∞ k(W ∗ )−1 k∞,∞ (5.13)
p √
Note that kAk∞ ≤ kAkop , so that if n & log(d/δ), then kΓ̂−Γk
p ∞ . log(d/δ)/ n.
From the above arguments, conclude kK̂ − K ∗ kF . k log(d/δ)/n and
with a calculation similar to (5.13) that
r
0 ∗ k log(d/δ)
kΘ̂ − Θ kop . .
n
2 2 2
p will not work in Frobenius norm since there, we only have kŴ − W kF .
This
p log(d/δ)/n.
Lower bounds
Here, we will show that the bounds in Theorem 5.19 are optimal up to log
factors. It will again be based on applying Fano’s inequality, Theorem 4.10.
In order to calculate the KL divergence between two Gaussian distributions

with densities fΣ1 and fΣ2 , we first observe that
fΣ1 (x) 1 1 1 1
log = − x> Θ1 x + log det(Θ1 ) + x> Θ2 x − log det(Θ2 )
fΣ2 (x) 2 2 2 2
, so that

fΣ1
KL(IPΣ1 , IPΣ2 ) = IEΣ1 log (X)
fΣ2
1
log det(Θ1 ) − log det(Θ2 ) + IE[Tr((Θ2 − Θ1 )XX > )]

=
2
1
= [log det(Θ1 ) − log det(Θ2 ) + Tr((Θ2 − Θ1 )Σ1 )]
2
1
= [log det(Θ1 ) − log det(Θ2 ) + Tr(Θ2 Σ1 ) − p] .
2
writing ∆ = Θ2 − Θ1 , Θ̃ = Θ1 + t∆, t ∈ [0, 1], Taylor expansion then yields

1 1 −1 −1
KL(IPΣ1 , IPΣ2 ) = −Tr(∆Σ1 ) + Tr(Θ̃ ∆Θ̃ ∆) + Tr(∆Σ1 )
2 2
1
≤ λmax (Θ̃−1 )k∆k2F
4
1
= λ−1 (Θ̃)k∆k2F .
4 min
This means we are in good shape in applying our standard tricks if we can
make sure that λmin (Θ̃) stays large enough among our hypotheses. To this end,
assume k ≤ p2 /16 and define the collection of matrices (Bj )j∈[M1 ] with entries
in {0, 1} such that they are symmetric,
√ have zero diagonal and only have non-
zero entries in a band of size k around the diagonal, and whose entries are
given by a flattening of the vectors obtained by the sparse Varshamov-Gilbert
Lemma 4.14. Moreover, let (Cj )j∈[M2 ] be diagonal matrices with entries in
{0, 1} given by the regular Varshamov-Gilbert Lemma 4.12. Then, define
β p
Θj = Θj1 ,j2 = I + √ ( log(1 + d/(2k))Bj1 + Cj2 ), j1 ∈ [M1 ], j2 ∈ [M2 ].
n
We can ensure that λmin (Θ̃) is small by the Gershgorin circle theorem. For
each row of Θ̃ − I,
√
X 2β( k + 1) p 1
|Θ̃il | ≤ √ log(1 + d/(2k) <
n 2
l
if n & k/ log(1 + d/(2k)).

Hence,
β2
kΘj − Θl k2F = (ρ(Bj1 , Bl1 ) log(1 + d/(2k)) + ρ(Cj2 , Cl2 ))
n
β2k
& (k log(1 + d/(2k)) + d),
n
and
β2
kΘj − Θl k2F . (k log(1 + d/(2k)) + d)
n
β2
≤ log(M1 M2 ),
n
which yields the following theorem.
Theorem 5.20. Denote by Bk the set of positive definite matrices with at most
k non-zero off-diagonal entries and assume n & k log(d). Then, if IPΘ denotes
a Gaussian distribution with inverse covariance matrix Θ and mean 0, there
exists a constant c > 0 such that
α2

⊗n ∗ 2 1
inf sup IPΘ∗ kΘ̂ − Θ kF ≥ c (k log(1 + d/(2k)) + d) ≥ − 2α.
∗
Θ̂ Θ ∈Bk n 2
Ising model
Like the Gaussian graphical model, the Ising model is also a model of pairwise
interactions but for random variables that take values in {−1, 1}d .
Before developing our general approach, we describe the main ideas in the
simplest Markov random field: an Ising model without2 external field [VMLC16].
Such models specify the distribution of a random vector Z = (Z (1) , . . . , Z (d) ) ∈
{−1, 1}d as follows
IP(Z = z) = exp z > W z − Φ(W ) ,

(5.14)
where W ∈ IRpd×d is an unknown matrix of interest with null diagonal elements

and X
exp z > W z ,

Φ(W ) = log
z∈{−1,1}d
is a normalization term known as log-partition function.

Fix β, λ > 0. Our goal in this section is to estimate the parameter W subject
to the constraint that the jth row e> >
j W of W satisfies |ej W |1 ≤ λ. To that end,
we observe n independent copies Z1 , . . . Zn of Z ∼ IP. To that end, estimate
the matrix W row by row using constrained likelihood maximization. Recall
from [RWL10, Eq. (4)] that for any j ∈ [d], it holds for any z (j) ∈ {−1, 1}, that
exp(2z (j) e>

j W z)
IP(Z (j) = z (j) |Z (¬j) = z (¬j) ) = , (5.15)
1 + exp(2z (j) e>
j W z)
where we used the fact that the diagonal elements W are equal to zero.
2 The presence of an external field does not change our method. It merely introduces
an intercept in the logistic regression problem and comes at the cost of more cumbersome
notation. All arguments below follow, potentially after minor modifications and explicit
computations are left to the reader.
Therefore, the jth row e>j W of W may be estimated by performing a logistic

regression of Z (j) onto Z (¬j) subject to the constraint that |e> j W |1 ≤ λ and
>
kej W k∞ ≤ β. To simplify notation, fix j and define Y = Z (j) , X = Z (¬j)
and w∗ = (e> j W)
(¬j)
. Let (X1 , Y1 ), . . . , (Xn , Yn ) be n independent copies of
(X, Y ). Then the (rescaled) log-likelihood of a candidate parameter w ∈ IRd−1
is denoted by `¯n (w) and defined by
n exp(2Y X > w)
1X i i
`¯n (w) = log
n i=1 1 + exp(2Yi Xi> w)
n n
1X 1X
Yi Xi> w − log 1 + exp(2Yi Xi> w)

=
n i=1 n i=1
With this representation, it is easy to see that w 7→ `¯n (w) is a concave function.
The (constrained) maximum likelihood estimator (MLE) ŵ ∈ IRd−1 is defined
to be any solution of the following convex optimization problem:
ŵ ∈ argmax `¯n (w) (5.16)

w∈B1 (λ)
This problem can be solved very efficiently using a variety of methods similar
to the Lasso.
The following lemma follows from [Rig12].
Lemma 5.21. Fix δ ∈ (0, 1). Conditionally on (X1 , . . . , Xn ), the constrained
MLE ŵ defined in (5.16) satisfies with probability at least 1 − δ that
n
r
1X > 2 λ log(2(p − 1)/δ)
(X (ŵ − w)) ≤ 2λe
n i=1 i 2n
Proof. We begin with some standard manipulations of the log-likelihood. Note

that
exp(2Y X > w) exp(Yi Xi> w)
i i
log = log
1 + exp(2Yi Xi> w) exp(−Yi Xi> w) + exp(Yi Xi> w)
exp(Yi Xi> w)
= log
exp(−Xi> w) + exp(Xi> w)
= (Yi + 1)Xi> w − log 1 + exp(2Xi> w)

Therefore, writing Ỹi = (Yi + 1)/2 ∈ {0, 1}, we get that ŵ is the solution to the
following minimization problem:
ŵ = argmin κ̄n (w)

w∈B1 (λ)
where
n
1 X
− 2Ỹi Xi> w + log 1 + exp(2Xi> w)

κ̄n (w) =
n i=1
For any w ∈ IRd−1 , write κ(w) = IE[κ̄n (w)] where here and throughout this
proof, all expectations are implicitly taken conditionally on X1 , . . . , Xn . Ob-
serve that
κ(w) = −IE[Ỹ1 ]X1> w + log 1 + exp(2Xi> w) .

Next, we get from the basic inequality κ̄n (ŵ) ≤ κ̄n (w) for any w ∈ B1 (λ)
that
n
1X
κ(ŵ) − κ(w) ≤ (Ỹi − IE[Ỹi ])Xi> (ŵ − w)
n i=1
n
2λ X
≤ max (Ỹi − IE[Ỹi ])Xi> ej ,
n 1≤j≤p−1 i=1
where in the second inequality, we used Hölder’s inequality and the fact |Xi |∞ ≤
1 for all i ∈ [n]. Together with Hoeffding’s inequality and a union bound, it
yields
r
log(4(p − 1)/δ) δ
κ(ŵ) − κ(w) ≤ 2λ , with probability 1 − .
2n 2
Note that the Hessian of κ is given by
n
1 X 4 exp(2Xi> w)
∇2 κ(w) = Xi Xi> .
n i=1 (1 + exp(2Xi> w))2
Moreover, for any w ∈ B1 (λ), since kXi k∞ ≤ 1, it holds
4 exp(2Xi> w) 4e2λ
>
≥ ≥ e−2λ
(1 + exp(2Xi w)) 2 (1 + e2λ )2
Next, since w∗ is a minimizer of w 7→ κ(w), we get from a second order Taylor

expansion that
n
1X >
κ(ŵ) − κ(w∗ ) ≥ e−2λ (X (ŵ − w))2 .
n i=1 i
This completes our proof.
Next, we bound from below the quantity

n
1X >
(X (ŵ − w))2 .
n i=1 i
To that end, we must exploit the covariance structure of the Ising model. The
following lemma is similar to the combination of Lemmas 6 and 7 in [VMLC16].
Lemma 5.22. Fix R > 0, δ ∈ (0, 1) and let Z ∈ IRd be distributed according
to (5.14). The following holds with probability 1−δ/2, uniformly in u ∈ B1 (2λ):
n
r
1X > 2 1 2 log(2p(p − 1)/δ)
(Z u) ≥ |u|∞ exp(−2λ) − 4λ
n i=1 i 2 2n
Proof. Note that

n
1X > 2
(Z u) = u> Sn u ,
n i=1 i
where Sn ∈ IRd×d denotes the sample covariance matrix of Z1 , . . . , Zn , defined

by
n
1X
Sn = Xi Xi> .
n i=1
Observe that
u> Sn u ≥ u> Σu − 2λ|Sn − S|∞ , (5.17)
where S = IE[Sn ]. Using Hoeffding’s inequality together with a union bound,
we get that r
log(2p(p − 1)/δ)
|Sn − S|∞ ≤ 2
2n
with probability 1 − δ/2.
Moreover, u> Su = var(X > u). Assume without loss of generality that
(1)
|u | = |u|∞ and observe that
p p
X 2 X
var(Z > u) = IE[(Z > u)2 ] = (u(1) )2 +IE u(j) Z (j) +2IE u(1) Z (1) u(j) Z (j) .

j=2 j=2
To control the cross term, let us condition on the neighborhood Z (¬1) to get
p p
X X
2 IE u(1) Z (1) u(j) Z (j) = 2 IE IE u(1) Z (1) |Z (¬1) u(j) Z (j)

j=2 j=2
p
2 X 2
≤ IE IE u(1) Z (1) |Z (¬1) + IE u(j) Z (j) ,
j=2
where we used the fact that 2|ab| ≤ a2 + b2 for all a, b ∈ IR. The above two
displays together yield
2
var(Z > u) ≥ |u|2∞ (1 − sup IE[Z (1) |Z (¬1) = z]
z∈{−1,1}p−1
To control the above supremum, recall that from (5.15), we have
exp(2e>
1 W z) − 1 exp(2λ) − 1
sup IE[Z (1) |Z (¬1) = z] = sup = .
z∈{−1,1}p−1 z∈{−1,1}p exp(2e>
1 W z) + 1 exp(2λ) +1
Therefore,
|u|2∞ 1
var(Z > u) ≥ ≥ |u|2∞ exp(−λ) .
1 + exp(2λ) 2
Together with (5.17), it yields the desired result.
Combining the above two lemmas immediately yields the following theorem.
Theorem 5.23. Let ŵ be the constrained maximum likelihood estimator (5.16)
and let w∗ = (e>j W)
(¬j)
be the jth row of W with the jth entry removed. Then
with probability 1 − δ we have
r
∗ 2 log(2p2 /δ)
|ŵ − w |∞ ≤ 9λ exp(3λ) .
n
5.6. Problem set 154
5.6 PROBLEM SET
Problem 5.1. Using the results of Chapter 2, show that the following holds
for the multivariate regression model (5.1).
1. There exists an estimator Θ̂ ∈ IRd×T such that
1 rT
kXΘ̂ − XΘ∗ k2F . σ 2
n n
with probability .99, where r denotes the rank of X .
2. There exists an estimator Θ̂ ∈ IRd×T such that
1 |Θ∗ |0 log(ed)
kXΘ̂ − XΘ∗ k2F . σ 2 .
n n
Problem 5.2. Consider the multivariate regression model (5.1) where Y has
SVD: X
Y= λ̂j ûj v̂j> .
j
Let M be defined by
X
M̂ = λ̂j 1I(|λ̂j | > 2τ )ûj v̂j> , τ > 0 .
j
1. Show that there exists a choice of τ such that

1 σ 2 rank(Θ∗ )
kM̂ − XΘ∗ k2F . (d ∨ T )
n n
2. Show that there exists a matrix n × n matrix P such that P M̂ = XΘ̂ for
some estimator Θ̂ and
1 σ 2 rank(Θ∗ )
kXΘ̂ − XΘ∗ k2F . (d ∨ T )
n n
3. Comment on the above results in light of the results obtain in Section 5.2.
Problem 5.3. Consider the multivariate regression model (5.1) and define Θ̂
be the any solution to the minimization problem
n1 o
min kY − XΘk2F + τ kXΘk1
Θ∈IRd×T n
1. Show that there exists a choice of τ such that

1 σ 2 rank(Θ∗ )
kXΘ̂ − XΘ∗ k2F . (d ∨ T )
n n
[Hint:Consider the matrix
X λ̂j + λ∗j
ûj v̂j>
j
2
where λ∗1 ≥ λ∗2 ≥ . . . and λ̂1 ≥ λ̂2 ≥ . . . are the singular values
of XΘ∗ and Y respectively and the SVD of Y is given by
X
Y= λ̂j ûj v̂j>
j
2. Find a closed form for XΘ̂.
Problem 5.4. In the Markowitz theory of portfolio selection [Mar52], a port-

Pd
folio may be identified to a vector u ∈ IRd such that uj ≥ 0 and j=1 uj = 1.
In this case, uj represent the proportion of the portfolio invested in asset j.
The vector of (random) returns of d assets is denoted by X ∈ IRd and we
assume that X ∼ subGd (1) and IE[XX > ] = Σ unknown.
In this theory, the two key characteristics of a portfolio u are it’s reward
µ(u) = IE[u> X] and its risk R(u) = varX > u. According to this theory one
should fix a minimum reward λ > 0 and choose the optimal portfolio
u∗ = argmin R(u)
u :µ(u)≥λ
when a solution exists for a given. It is the portfolio that has minimum risk
among all portfolios with reward at least λ, provided such portfolios exist.
In practice, the distribution of X is unknown. Assume that we observe n
independent copies X1 , . . . , Xn of X and use them to compute the following
estimators of µ(u) and R(u) respectively:
n
1X >
µ̂(u) = X̄ > u = X u,
n i=1 i
n
> 1 X
R̂(u) = u Σ̂u, Σ̂ = (Xi − X̄)(Xi − X̄)> .
n − 1 i=1
We use the following estimated portfolio:

û = argmin R̂(u)
u :µ̂(u)≥λ
We assume throughout that log d n d .

1. Show that for any portfolio u, it holds

1
µ̂(u) − µ(u) . √ ,
n
and
1
R̂(u) − R(u) . √ .
n
2. Show that r
log d
R̂(û) − R(û) . ,
n
3. Define the estimator ũ by:
ũ = argmin R̂(u)
u :µ̂(u)≥λ−ε
find the smallest ε > 0 (up to multiplicative constant) such that we have
R(ũ) ≤ R(u∗ ) with probability .99.
Bibliography
[AS08] Noga Alon and Joel H. Spencer. The probabilistic method. Wiley-
Interscience Series in Discrete Mathematics and Optimization.
John Wiley & Sons, Inc., Hoboken, NJ, third edition, 2008. With
an appendix on the life and work of Paul Erdős.
[AW02] Rudolf Ahlswede and Andreas Winter. Strong converse for iden-
tification via quantum channels. IEEE Transactions on Infor-
mation Theory, 48(3):569–579, 2002.
[Ber09] Dennis S. Bernstein. Matrix mathematics. Princeton University
Press, Princeton, NJ, second edition, 2009. Theory, facts, and
formulas.
[Bil95] Patrick Billingsley. Probability and measure. Wiley Series in
Probability and Mathematical Statistics. John Wiley & Sons
Inc., New York, third edition, 1995. A Wiley-Interscience Pub-
lication.
[BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Con-
centration inequalities. Oxford University Press, Oxford, 2013.
A nonasymptotic theory of independence, With a foreword by
Michel Ledoux.
[BLT16] Pierre C. Bellec, Guillaume Lecué, and Alexandre B. Tsybakov.
Slope meets lasso: Improved oracle bounds and optimality.
arXiv preprint arXiv:1605.08651, 2016.
[Bre15] Guy Bresler. Efficiently learning Ising models on arbitrary
graphs. STOC’15—Proceedings of the 2015 ACM Symposium
on Theory of Computing, pages 771–782, 2015.
[BRT09] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Si-
multaneous analysis of Lasso and Dantzig selector. Ann. Statist.,
37(4):1705–1732, 2009.
[BT09] Amir Beck and Marc Teboulle. A fast iterative shrinkage-
thresholding algorithm for linear inverse problems. SIAM J.
Imaging Sci., 2(1):183–202, 2009.
157
Bibliography 158
[BV04] Stephen Boyd and Lieven Vandenberghe. Convex optimization.

Cambridge University Press, Cambridge, 2004.
[BvS+ 15] Malgorzata Bogdan, Ewout van den Berg, Chiara Sabatti, Wei-
jie Su, and Emmanuel J. Candès. SLOPE—adaptive variable
selection via convex optimization. The annals of applied statis-
tics, 9(3):1103, 2015.
[Cav11] Laurent Cavalier. Inverse problems in statistics. In Inverse prob-
lems and high-dimensional estimation, volume 203 of Lect. Notes
Stat. Proc., pages 3–96. Springer, Heidelberg, 2011.
[CL11] T. Tony Cai and Mark G. Low. Testing composite hypothe-
ses, hermite polynomials and optimal estimation of a nonsmooth
functional. Ann. Statist., 39(2):1012–1041, 04 2011.
[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information
theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ,
second edition, 2006.
[CT07] Emmanuel Candes and Terence Tao. The Dantzig selector: sta-
tistical estimation when p is much larger than n. Ann. Statist.,
35(6):2313–2351, 2007.
[CZ12] T. Tony Cai and Harrison H. Zhou. Minimax estimation of large
covariance matrices under `1 -norm. Statist. Sinica, 22(4):1319–
1349, 2012.
[CZZ10] T. Tony Cai, Cun-Hui Zhang, and Harrison H. Zhou. Opti-
mal rates of convergence for covariance matrix estimation. Ann.
Statist., 38(4):2118–2144, 2010.
[DDGS97] M.J. Donahue, C. Darken, L. Gurvits, and E. Sontag. Rates
of convex approximation in non-hilbert spaces. Constructive
Approximation, 13(2):187–220, 1997.
[EHJT04] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tib-
shirani. Least angle regression. Ann. Statist., 32(2):407–499,
2004. With discussion, and a rejoinder by the authors.
[FHT10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Reg-
ularization paths for generalized linear models via coordinate
descent. Journal of Statistical Software, 33(1), 2010.
[FR13] Simon Foucart and Holger Rauhut. A mathematical introduc-
tion to compressive sensing. Applied and Numerical Harmonic
Analysis. Birkhäuser/Springer, New York, 2013.
[Gro11] David Gross. Recovering low-rank matrices from few coeffi-
cients in any basis. IEEE Transactions on Information Theory,
57(3):1548–1566, 2011.
Bibliography 159
[Gru03] Branko Grunbaum. Convex polytopes, volume 221 of Graduate

Texts in Mathematics. Springer-Verlag, New York, second edi-
tion, 2003. Prepared and with a preface by Volker Kaibel, Victor
Klee and Günter M. Ziegler.
[GVL96] Gene H. Golub and Charles F. Van Loan. Matrix computa-
tions. Johns Hopkins Studies in the Mathematical Sciences.
Johns Hopkins University Press, Baltimore, MD, third edition,
1996.
[HTF01] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The
elements of statistical learning. Springer Series in Statistics.
Springer-Verlag, New York, 2001. Data mining, inference, and
prediction.
[Joh11] Iain M. Johnstone. Gaussian estimation: Sequence and wavelet
models. Unpublished Manuscript., December 2011.
[KLT11] Vladimir Koltchinskii, Karim Lounici, and Alexandre B. Tsy-
bakov. Nuclear-norm penalization and optimal rates for noisy
low-rank matrix completion. Ann. Statist., 39(5):2302–2329,
2011.
[Lau96] Steffen L. Lauritzen. Graphical models, volume 17 of Oxford Sta-
tistical Science Series. The Clarendon Press, Oxford University
Press, New York, 1996. Oxford Science Publications.
[Lie73] Elliott H. Lieb. Convex trace functions and the Wigner-Yanase-
Dyson conjecture. Advances in Mathematics, 11(3):267–288,
1973.
[LPTVDG11] K. Lounici, M. Pontil, A.B. Tsybakov, and S. Van De Geer.
Oracle inequalities and optimal inference under group sparsity.
Ann. Statist., 39(4):2164–2204, 2011.
[LWo12] Po-Ling Loh, Martin J. Wainwright, and others. Structure es-
timation for discrete graphical models: Generalized covariance
matrices and their inverses. In NIPS, pages 2096–2104, 2012.
[Mal09] Stéphane Mallat. A wavelet tour of signal processing. Else-
vier/Academic Press, Amsterdam, third edition, 2009. The
sparse way, With contributions from Gabriel Peyré.
[Mar52] Harry Markowitz. Portfolio selection. The journal of finance,
7(1):77–91, 1952.
[Nem00] Arkadi Nemirovski. Topics in non-parametric statistics. In Lec-
tures on probability theory and statistics (Saint-Flour, 1998),
volume 1738 of Lecture Notes in Math., pages 85–277. Springer,
Berlin, 2000.
Bibliography 160
[Pis81] G. Pisier. Remarques sur un résultat non publié de B. Maurey.

In Seminar on Functional Analysis, 1980–1981, pages Exp. No.
V, 13. École Polytech., Palaiseau, 1981.
[Rig06] Philippe Rigollet. Adaptive density estimation using the block-
wise Stein method. Bernoulli, 12(2):351–370, 2006.
[Rig12] Philippe Rigollet. Kullback-Leibler aggregation and misspecified
generalized linear models. Ann. Statist., 40(2):639–665, 2012.
[Riv90] Theodore J. Rivlin. Chebyshev Polynomials: From Approxima-
tion Theory to Algebra and Number Theory. Wiley, July 1990.
[RT11] Philippe Rigollet and Alexandre Tsybakov. Exponential screen-
ing and optimal rates of sparse estimation. Ann. Statist.,
39(2):731–771, 2011.
[Rus02] Mary Beth Ruskai. Inequalities for quantum entropy: A review
with conditions for equality. Journal of Mathematical Physics,
43(9):4358–4375, 2002.
[RWL10] Pradeep Ravikumar, Martin J. Wainwright, and John D.
Lafferty. High-dimensional Ising model selection using `1 -
regularized logistic regression. Ann. Statist., 38(3):1287–1319,
2010.
[Sha03] Jun Shao. Mathematical statistics. Springer Texts in Statistics.
Springer-Verlag, New York, second edition, 2003.
[Tib96] Robert Tibshirani. Regression shrinkage and selection via the
lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288, 1996.
[Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random ma-
trices. Foundations of computational mathematics, 12(4):389–
434, 2012.
[Tsy03] Alexandre B. Tsybakov. Optimal rates of aggregation. In Bern-
hard Schölkopf and Manfred K. Warmuth, editors, COLT, vol-
ume 2777 of Lecture Notes in Computer Science, pages 303–313.
Springer, 2003.
[Tsy09] Alexandre B. Tsybakov. Introduction to nonparametric estima-
tion. Springer Series in Statistics. Springer, New York, 2009.
Revised and extended from the 2004 French original, Translated
by Vladimir Zaiats.
[Ver18] Roman Vershynin. High-Dimensional Probability. Cambridge
University Press (to appear), 2018.
[vH17] Ramon van Handel. Probability in high dimension. Lecture
Notes (Princeton University), 2017.
Bibliography 161
[VMLC16] Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael

Chertkov. Interaction screening: Efficient and sample-optimal
learning of ising models. In D. D. Lee, M. Sugiyama, U. V.
Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neu-
ral Information Processing Systems 29, pages 2595–2603. Curran
Associates, Inc., 2016.

Rig Notes 17

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Rig Notes 17

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rig Notes 17

Uploaded by

Copyright:

Available Formats

High Dimensional Statistics

(This version: November 5, 2019)

Philippe Rigollet and Jan-Christian Hütter

rather quickly, sometimes at the rate of a chapter a week. I have

Functions, sets, vectors

N (µ, σ 2 ) Univariate Gaussian distribution with mean µ ∈ IR and variance σ 2 > 0

1 Sub-Gaussian Random Variables 14

2 Linear Regression Model 40

3 Misspecified Linear Models 72

4 Minimax Lower Bounds 92

5 Matrix estimation 121

This course is mainly about learning a regression function from a collection

REGRESSION ANALYSIS AND PREDICTION RISK

Model and definitions

f (x) = IE[Y |X = x] , x∈X.

As we will see, it arises naturally in the context of prediction.

Best prediction and prediction risk

IE[Y − g(X)]2 = IE[Y − f (X) + f (X) − g(X)]2

The cross-product term satisfies

The above two equations yield

IE[Y − g(X)]2 = IE[Y − f (X)]2 + IE[f (X) − g(X)]2 ≥ IE[Y − f (X)]2 ,

with equality iff f (X) = g(X) almost surely.

IE[Y − f (X)]2 = inf IE[Y − g(X)]2 ,

where the infimum is taken over all measurable functions g : X → IR.

Prediction and estimation

R(fˆn ) = IE[Y − f (X)]2 + kfˆn − f k22

where the expectation is taken with respect to the sample Dn . They

IP kfˆn − f k22 > φn (δ) ≤ δ , ∀δ ∈ (0, 1/3) .

Here 1/3 is arbitrary and can be replaced by another positive constant.

IP kfˆn − f k22 − IEkfˆn − f k22 > t

by a quantity that decays to zero exponentially fast. Concentration of measure

Other measures of error

R(fˆn ) = IE[Y − f (X)]2 + kfˆn − f k22 .

d0 (fˆn , f ) = |fˆn (x0 ) − f (x0 )| .

• Sup-norm error. Also known as the L∞ -error and defined by

d∞ (fˆn , f ) = sup |fˆn (x) − f (x)| .

It controls the worst possible pointwise error.

The choice of p is somewhat arbitrary and mostly employed as a mathe-

R(h) = IP(Y 6= h(X)) .

We will not cover this problem in this course.

MODELS AND METHODS

Empirical risk minimization

In many instances, it corresponds to the maximum likelihood estimator of IEX.

kfˆn − f k22 = kfˆn − f¯k22 + kf¯ − f k22 ,

where f¯ is the projection of f onto the linear subspace G. The systematic

High dimension and sparsity

Figure 2. Examples of structures vectors θ ∈ IR50

correct but we will make the convenient assumption that PX is (essentially)

Note that it depends on the unknown αk and define the estimator

kfˆn − f k22 = kf¯ − f k22 + kfˆn − f¯k22

− αk )2 clearly increases with k0

to control the approximation error. Therefore oracle inequalities are more

Optimality and minimax lower bounds

1.1 GAUSSIAN TAILS AND MGF

Recall that a random variable X ∈ IR has Gaussian distribution iff it has a

Proposition 1.1. Let X be a Gaussian random variable with mean µ and

Indeed, in the case of a standard Gaussian random variable, we have

1.2 SUB-GAUSSIAN RANDOM VARIABLES AND CHERNOFF BOUNDS

Definition and first properties

f (a) ≤ g(a) for a ∈ σ(X) =⇒ f (X) g(X). (1.7)

IE[eθX ] exp((eθ − θ − 1)IE[X2 ]).