O4MD 01 Introduction
O4MD 01 Introduction
Introduction
where aj is a vector (or matrix) of features and yj is an observation or label. (Each pair (aj , yj ) has
the same size and shape for all j = 1, 2, . . . , m.) The analysis task then consists of discovering a
function φ such that φ(aj ) ≈ yj for most j = 1, 2, . . . , m. The process of discovering the mapping
φ is often called “learning” or “training.”
The function φ is often defined in terms of a vector or matrix of parameters, which we denote
by x or X. (Other notation also appears below.) With these parametrizations, the problem
of identifying φ becomes a data-fitting problem: “Find the parameters x defining φ such that
φ(aj ) ≈ yj , j = 1, 2, . . . , m in some optimal sense.” Once we come up with a definition of the term
“optimal,” we have an optimization problem. Many such optimization formulations have objective
functions of the “summation” type
m
1 X
LD (x) := `(aj , yj ; x) . (1.2)
m
j=1
Here, the function ` here represents a loss paid for not properly aligning our prediction φ(a) with
y. x is the vector of parameters that determines φ. The objective LD (x) measures the average loss
accrued over the entire data set when the parameter vector is equal to x.
One use of φ is to make predictions about future data items. Given another previously unseen
item of data â of the same type as aj , j = 1, 2, . . . , m, we predict that the label ŷ associated with â
would be φ(â). The mapping may also expose other structure and properties in the data set. For
1
RECHT AND WRIGHT
example, it may reveal that only a small fraction of the features in aj are needed to reliably predict
the label yj . (This is known as feature selection.) The function φ or its parameter x may also
reveal important structure in the data. For example, X could reveal a low-dimensional subspace
that contains most of the aj , or X could reveal a matrix with particular structure (low-rank, sparse)
such that observations of X prompted by the feature vectors aj yield results close to yj .
Examples of labels yj include the following.
• Null. Some problems only have feature vectors aj and no labels. In this case, the data analysis
task may consist of grouping the aj into clusters (where the vectors within each cluster are
deemed to be functionally similar), or identifying a low-dimensional subspace (or a collection
of low-dimensional subspaces) that approximately contains the aj . Such problems require
the labels yj to be learnt, alongside the function φ. For example, in a clustering problem, yj
could represent the cluster to which aj is assigned.
Even after cleaning and preparation, the setup above may contain many complications that need
to be dealt in formulating the problem in rigorous mathematical terms. The quantities (aj , yj ) may
contain noise, or may be otherwise corrupted. We would like the mapping φ to be robust to such
errors. There may be missing data: parts of the vectors aj may be missing, or we may not know
all the labels yj . The data may be arriving in streaming fashion rather than being available all at
once. In this case, we would learn φ in an online fashion.
One particular consideration is that we wish to avoid overfitting the model to the data set
D in (1.1). The particular data set D available to us can often be thought of as a finite sample
drawn from some underlying larger (often infinite) collection of data, and we wish the function φ
to perform well on the unobserved data points as well as the observed subset D. In other words, we
want φ to be not too sensitive to the particular sample D that is used to define empirical objective
functions such as (1.2). The optimization formulation can be modified in various ways to achieve
this goal, by the inclusion of constraints or penalty terms that limit some measure of “complexity”
of the function (such techniques are typically called regularization). So, in sum, a generic “master
problem” that balances data fit, model complexity, and model structure is
1 Pm
minimize m j=1 `(aj , yj ; x) + λ pen(x) . (1.3)
subject tox ∈ Ω
The problem seeks to minimize a cost function subject to constraints. The first term in the cost
function is the average loss over the data set. The second term is a penalty term that aims to
encourage models with low complexity. The scalar λ > 0 is called a regularization parameter and
lets the practitioner tune between fitting the data and choosing a simple model. The set Ω in the
constraints is the set of parameters that we would deem as valid solutions.
We now dive into a variety of special cases of problem (1.3), showing that a wide variety of
applications can be formulated in this fashion and that there might not be a single simple algorithm
appropriate for solving all instances.
2
OPTIMIZATION FOR MODERN DATA ANALYSIS
where A the matrix whose rows are aTj , j = 1, 2, . . . , m and y = (y1 , y2 , . . . , ym )T . In the terminology
above, the function φ is defined by φ(a) := aT x. (We could also introduce a nonzero intercept by
adding an extra parameter β ∈ R and defining φ(a) := aT x+β.) This formulation can be motivated
statistically, as a maximum-likelihood estimate of x when the observations yj are exact but for i.i.d.
Gaussian noise. We can add a variety of penalty functions to this basic least squares problem to
impose desirable structure on x and hence on φ. For example, Ridge regression adds a squared
`2 -norm penalty, resulting in
1
min kAx − yk22 + λkxk22 , for some parameter λ > 0 .
x 2m
Ridge regression yields a solution x with less sensitivity to perturbations in the data (aj , yj ). The
LASSO formulation
1
min kAx − yk22 + λkxk1 (1.5)
x 2m
tends to yield solutions x that are sparse, that is, containing relatively few nonzero components
[33]. This formulation performs feature selection: The locations of the nonzero components in x
reveal those components of aj that are instrumental in determining the observation yj . Besides
its statistical appeal — predictors that depend on few features are potentially simpler and more
comprehensible than those depending on many features — feature selection has practical appeal in
making predictions about future data. Rather than gathering all components of a new data vector
â, we need to find only the “selected” features, since only these are needed to make a prediction.
The LASSO formulation (1.5) is an important prototype for many problems in data analysis,
in that it involves a regularization term λkxk1 that is nonsmooth and convex, but with relatively
simple structure that can potentially be exploited by algorithms.
where hA, Bi := trace(AT B). Here we can think of the Aj as “probing” the unknown matrix X.
Commonly considered types of observations are random linear combinations (where the elements
3
RECHT AND WRIGHT
of Aj are selected i.i.d. from some distribution) or single-element observations (in which each Aj
has 1 in a single location and zeros elsewhere). A regularized version of (1.6), leading to solutions
X that are low-rank, is
m
1 X
min (hAj , Xi − yj )2 + λkXk∗ , (1.7)
X 2m
j=1
where kXk∗ is the nuclear norm, which is the sum of singular values of X [28]. The nuclear norm
plays a role analogous to the `1 norm in (1.5): where as the `1 norm favors sparse vectors, the nuclear
norm favors low-rank matrices. Although the nuclear norm is a somewhat complex nonsmooth
function, it is at least convex, so that the formulation (1.7) is also convex. This formulation can
be shown to yield a statistically valid solution when the true X is low-rank and the observation
matrices Aj satisfy a “restricted isometry” property, commonly satisfied by random matrices, but
not by matrices with just one nonzero element. The formulation is also valid in a different context,
in which the true X is incoherent (roughly speaking, it does not have a few elements that are much
larger than the others), and the observations Aj are of single elements [10].
In another form of regularization, the matrix X is represented explicitly as a product of two
“thin” matrices L and R, where L ∈ Rn×r and R ∈ Rp×r , with r min(n, p). We set X = LRT
in (1.6) and solve
m
1 X
min (hAj , LRT i − yj )2 . (1.8)
L,R 2m
j=1
In this formulation, the rank r is “hard-wired” into the definition of X, so there is no need to
include a regularizing term. This formulation is also typically much more compact than (1.7);
the total number of elements in (L, R) is (n + p)r, which is much less than np. A disadvantage
is that it is nonconvex. An active line of current research, pioneered in [9] and also drawing on
statistical sources, shows that the nonconvexity is benign in many situations, and that under certain
assumptions on the data (Aj , yj ), j = 1, 2, . . . , m and careful choice of algorithmic strategy, good
solutions can be obtained from the formulation (1.8). A clue to this good behavior is that although
this formulation is nonconvex, it is in some sense an approximation to a tractable problem: If we
have a complete observation of X, then a rank-r approximation can be found by performing a
singular value decomposition of X, and defining L and R in terms of the r leading left and right
singular vectors.
Some applications in computer vision, chemometrics, and document clustering require us to
find factors L and R like those in (1.8) in which all elements are nonnegative. If the full matrix
Y ∈ Rn×p is observed, this problem has the form
4
OPTIMIZATION FOR MODERN DATA ANALYSIS
Any pair (x, β) that satisfies these conditions defines a separating hyperplane in Rn , that separates
the “positive” cases {aj | yj = +1} from the “negative” cases {aj | yj = −1}. Among all separating
hyperplanes, the one that minimizes kxk2 is the one that maximizes the margin between the two
classes, that is, the hyperplane whose distance to the nearest point aj of either class is greatest.
We can formulate the problem of finding a separating hyperplane as an optimization problem
by defining an objective with the summation form (1.2):
m
1 X
H(x, β) = max(1 − yj (aTj x − β), 0). (1.10)
m
j=1
Note that the jth term in this summation is zero if the conditions (1.9) are satisfied, and positive
otherwise. Even if no pair (x, β) exists for which H(x, β) = 0, a value (x, β) that minimizes (1.2)
will be the one that comes as close as possible to satisfying (1.9), in some sense. A term λkxk22 (for
some parameter λ > 0) is often added to (1.10), yielding the following regularized version:
m
1 X 1
H(x, β) = max(1 − yj (aTj x − β), 0) + λkxk22 . (1.11)
m 2
j=1
Differing from our examples thus far, the SVM problem has a non-smooth loss function and a
smooth regularizer.
If λ is sufficiently small, and if separating hyperplanes exist, the pair (x, β) that minimizes (1.11)
is the maximum-margin separating hyperplane. The maximum-margin property is consistent with
the goals of generalizability and robustness. For example, if the observed data (aj , yj ) is drawn
from an underlying “cloud” of positive and negative cases, the maximum-margin solution usually
does a reasonable job of separating other empirical data samples drawn from the same clouds,
whereas a hyperplane that passes close by several of the observed data points may not do as well
(see Figure 1.1).
Often it is not possible to find a hyperplane that separates the positive and negative cases
well enough to be useful as a classifier. One solution is to transform all of the raw data vectors
aj by a mapping ψ, then perform the support-vector-machine classification on the vectors ψ(aj ),
j = 1, 2, . . . , m. The conditions (1.9) would thus be replaced by
When transformed back to Rm , the surface {a | ψ(a)T x − β = 0} is nonlinear and possibly discon-
nected, and is often a much more powerful classifier than the hyperplanes resulting from (1.11).
5
RECHT AND WRIGHT
Figure 1.1: Linear support vector machine classification, with the one class represented by circles
and the other by squares. One possible choice of separating hyperplane is shown at left. If the
observed data is an empirical sample drawn from a cloud of underlying data points, this plane does
not do well in separating the two clouds (middle). The maximum-margin separating hyperplane
does better (right).
We note that the SVM is also naturally expressed as a minimization problem over a convex
et. Indeed, by introducing artificial variables, the problem (1.13) (and (1.11)) can be formulated
as a convex quadratic program, that is, a problem with a convex quadratic objective and linear
constraints. By taking the dual of this problem, we obtain another convex quadratic program, in
m variables:
1 1
minm αT Qα − 1T α subject to 0 ≤ α ≤ 1, y T α = 0, (1.14)
α∈R 2 λ
where
Qkl = yk yl ψ(ak )T ψ(al ), y = (y1 , y2 , . . . , ym )T , 1 = (1, 1, . . . , 1)T .
Interestingly, problem (1.14) can be formulated and solved without explicit knowledge or definition
of the mapping ψ. We need only a technique to define the elements of Q. This can be done with
the use of a kernel function K : Rn × Rn → R, where K(ak , al ) replaces ψ(ak )T ψ(al ) [5, 11]. This
is the so-called “kernel trick.” (The kernel function K can also be used to construct a classification
function φ from the solution of (1.14).) A particularly popular choice of kernel is the Gaussian
kernel:
K(ak , al ) := exp(−kak − al k2 /(2σ)),
where σ is a positive parameter.
6
OPTIMIZATION FOR MODERN DATA ANALYSIS
(Note the similarity to (1.9).) The optimal value of x can be found by maximizing a log-likelihood
function:
1 X X
L(x) := log(1 − p(aj ; x)) + log p(aj ; x) . (1.17)
m
j:yj =−1 j:yj =1
We can perform feature selection using this model by introducing a regularizer λkxk1 , as follows:
1 X X
max log(1 − p(aj ; x)) + log p(aj ; x) − λkxk1 , (1.18)
x m
j:yj =−1 j:yj =1
where λ > 0 is a regularization parameter. As we see later, this term has the effect of producing
a solution in which few components of x are nonzero, making it possible to evaluate p(a; x) by
knowing only those components of a that correspond to the nonzeros in x.
An important extension of this technique is to multiclass (or multinomial) logistic regression,
in which the data vectors aj belong to more than two classes. Such applications are common in
modern data analysis. For example, in a speech recognition system, the M classes could each
represent a phoneme of speech, one of the potentially thousands of distinct elementary sounds
that can be uttered by humans in a few tens of milliseconds. A multinomial logistic regression
problem requires a distinct odds function pk for each class k ∈ {1, 2, . . . , M }. These functions are
parametrized by vectors x[k] ∈ Rn , k = 1, 2, . . . , M , defined as follows:
exp(aT x[k] )
pk (a; X) := PM , k = 1, 2, . . . , M, (1.19)
T
l=1 exp(a x[l] )
where we define X := {x[k] | k = 1, 2, . . . , M }. Note that for all a, we have pk (a) ∈ (0, 1) for all
k = 1, 2, . . . , M and that M
P
k=1 pk (a) = 1. The functions (1.19) are (somewhat colloquially) referred
to as performing a “softmax” on the quantities {aT x[l] | l = 1, 2, . . . , M }.
In the setting of multiclass logistic regression, the labels yj are vectors in RM , whose elements
are defined as follows: (
1 when aj belongs to class k,
yjk = (1.20)
0 otherwise.
Similarly to (1.16), we seek to define the vectors x[k] so that
The problem of finding values of x[k] that satisfy these conditions can again be formulated as one
of maximizing a log-likelihood:
m
"M M
!#
1 X X T
X
T
L(X) := yj` (x[`] aj ) − log exp(x[`] aj ) . (1.22)
m
j=1 `=1 `=1
“Group-sparse” regularization terms can be included in this formulation to select a set of features
in the vectors aj , common to each class, that distinguish effectively between the classes.
7
RECHT AND WRIGHT
Figure 1.2: A deep neural network, showing connections between adjacent layers.
Each node in the top layer corresponds to a particular class, and the output of each node corresponds
to the odds of the input vector belonging to each class. As mentioned, the “softmax” operator is
typically used to convert the transformed input vector in the second-top layer (layer D) to a set of
odds at the top layer. Associated with each input vector aj are labels yjk , defined as in (1.20) to
indicate which of the M classes that aj belongs to.
The parameters in this neural network are the matrix-vector pairs (W l , g l ), l = 1, 2, . . . , D that
transform the input vector aj into its form aD
j at the second-top layer, together with the parameters
X of the softmax operation that takes place at the final (top) stage, where X is defined exactly
as in the discussion of the previous section on multiclass logistic regression. We aim to choose all
these parameters so that the network does a good job on classifying the training data correctly.
Using the notation w for the layer-to-layer transformations, that is,
w := (W 1 , g 1 , W 2 , g 2 , . . . , W D , g D ),
8
OPTIMIZATION FOR MODERN DATA ANALYSIS
Here we write aD D
j (w) to make explicit the dependence of aj on the transformations w as well as on
the input vector aj . (We can view multiclass logistic regression as a special case of deep learning
in which there are no hidden layers, so that D = 0, w is null, and aD j = aj , j = 1, 2, . . . , m.)
Neural networks in use for particular applications (in image recognition and speech recognition,
for example, where they have been very successful) include many variants on the basic design above.
These include restricted connectivity between layers (that is, enforcing structure on the matrices
W l , l = 1, 2, . . . , D), layer arrangements that are more complex than the linear layout illustrated
in Figure 1.2, with outputs coming from different levels, connections across non-adjacent layers,
different componentwise transformations σ at different layers, and so on. Deep neural networks for
practical applications are highly engineered objects.
The loss function (1.23) shares with many other applications the “summation” form (1.2), but
it has several features that set it apart from the other applications discussed above. First, and
possibly most important, it is nonconvex in the parameters w. There is reason to believe that the
“landscape” of L is complex, with the global minimizer being exceedingly difficult to find. Second,
the total number of parameters in (w, X) is usually very large. Effective training of deep learning
classifiers typically requires a great deal of data and computation power. Huge clusters of powerful
computers, often using multicore processors, GPUs, and even specially architected processing units,
are devoted to this task.
1.6 Emphasis
The problems that we can formulate as (1.3) are varied: the cost functions might be convex or
nonconvex, smooth or nonsmooth. But there are important features that they all share.
• They can be formulated as functions of real variables, which we typically arranged in a vector
Rn .
• The functions are continuous and often smooth. When nonsmoothness appears in the formu-
lation, it does so in a structured way that can be exploited by the algorithm. Smoothness
properties allow an algorithm to make good inferences about the behavior of the function on
the basis of knowledge gained at points that have been visited previously.
• The objective is often made up in part of a summation of many terms, where each term
depends on a single item of data.
• The objective is often a sum of two terms: a “loss term” (sometimes arising from a maximum
likelihood expression for some statistical model) and a “regularization term” whose purpose
is to impose structure and “generalizability” on the recovered model.
Our treatment emphasizes algorithms for solving the different kinds of problems discussed above,
and the convergence properties of these algorithms. We pay attention to complexity guarantees,
9
RECHT AND WRIGHT
which are bounds on the amount of computational effort required to obtain solutions of a given
accuracy. These bounds usually depend on fundamental properties of the objective function and
the data that defines it, including the dimensions of the data set and the number of variables in
the problem. This emphasis contrasts with much of the optimization literature, in which global
convergence results do not usually involve complexity bound. (A notable exception is the analysis
of interior-point methods; see [24, 36]).
At the same time, we try as much as possible to emphasize the practical concerns associated
with solving these problems. There are a variety of trade-offs presented by any problem, and
the optimizer has to evaluate which tools in her belt is most appropriate to use. On top of the
problem formulation, it is imperative to account for the time budget for the task at hand, the
sort of computer on where the problem will be solved, and the needed guarantees for the returned
solution. Worst-case complexity guarantees are only a piece of the story here, and understanding
the varied implementation heuristics are critical for building reliable solvers.
10