mlentary
mlentary
A. prologue
what is learning? By the end of this section, you’ll be able to
• recognize whether a learning task fits the
kinds of learning — How do we communicate patterns of desired behavior? We can teach: paradigm of learning from examples and
whether it’s supervised or unsupervised.
by instruction: “to tell whether a mushroom is poisonous, first look at its gills...”
• identify within a completed learning-
by example: “here are six poisonous fungi; here, six safe ones. see a pattern?” from-examples project: the training in-
by reinforcement: “eat foraged mushrooms for a month; learn from getting sick.” puts(outputs), testing inputs(outputs), hypoth-
esis class, learned hypothesis; and describe
Machine learning is the art of programming computers to learn from such sources. which parts depend on which.
We’ll focus on the most important case: learning from examples.◦ ← Food For Thought: What’s something you’ve
learned by instruction? By example? By rein-
from examples to predictions — For us, a pattern of desired behavior is a function that for
forcement? In unit 5 we’ll see that learning by
each given situation/prompt returns a favorable action/answer. We seek a pro- example unlocks the other modes of learning.
gram that, from a list of examples of prompts and matching answers, determines
an underlying pattern. Our program is a success if this pattern accurately pre-
dicts answers for new, unseen prompts. We often define our program as a search,
over some class H of candidate patterns (jargon: hypotheses), to maximize some
notion of “intrinsic-plausibility plus goodness-of-fit-to-the-examples”.
supervised learning — We’ll soon allow uncertainty by letting patterns map prompts to
distributions over answers. Even if there is only one prompt — say, “produce
a beautiful melody” — we may seek to learn the complicated distribution over
answers, e.g. to generate a diversity of apt answers. Such unsupervised learning
concerns output structure. By contrast, supervised learning (our main subject),
concerns the input-output relation; it’s interesting when there are many possible
prompts. Both involve learning from examples; the distinction is no more firm
than that between sandwiches and hotdogs, but the words are good to know.
learning as... — machine learning is like science. machine learning is like automatic programming.
machine learning is like curve-fitting. three classic threads of AI
4
a tiny example: classifying handwritten digits By the end of this section, you’ll be able to
• write a (simple and inefficient) image clas-
meeting the data — Say we want to classify handwritten digits. In symbols: we’ll map X to sifying ML program
• visualize data as lying in feature space; vi-
Y with X = {grayscale 28 × 28-pixel images}, Y = {1, 3}. Each datum (x, y) arises
sualize hypotheses as functions defined on
as follows: we randomly choose a digit y ∈ Y , ask a human to write that digit in feature space; and visualize the class of all
pen, and then photograph their writing to produce x ∈ X . hypotheses within weight space
When we zoom in, we can see each photo’s 28 × 28 grid of pixels. On the
computer, this data is stored as a 28 × 28 grid of numbers: 0.0 for bright through
1.0 for dark. We’ll name these 28 × 28 grid locations by their row number (count-
ing from the top) followed by their column number (counting from the left). So
location (0, 0) is the upper left corner pixel; (27, 0), the lower left corner pixel.
Food For Thought: Where is location (0, 27)? Which way is (14, 14) off-center?
To get to know the data, let’s wonder how we’d hand-code a classifier (worry
not: soon we’ll do this more automatically). We want to complete the code
def hand_coded_predict(x):
return 3 if condition(x) else 1
Well, 3s tend to have more ink than than 1s — should condition threshold by
the photo’s brightness? Or: 1s and 3s tend to have different widths — should
condition threshold by the photo’s dark part’s width?
To make this precise, let’s define a photo’s brightness as 1.0 minus its average
pixel brightness; its width as the standard deviation of the column index of its
dark pixels. Such functions from inputs in X to numbers are called features.
SIDE = 28
def brightness(x): return 1. - np.mean(x)
def width(x): return np.std([col for col in range(SIDE)
for row in range(SIDE)
if 0.5 < x[row][col] ])/(SIDE/2.0)
# (we normalized width by SIDE/2.0 so that it lies within [0., 1.])
candidate patterns — We can generalize the hand-coded hypothesis from the previous
passage to other coefficients besides −1 · brightness(x) + 4 · width(x). We let our
set H of candidate patterns contain all “linear hypotheses” fa,b defined by:
The brightness-width plane is called feature space: its points represent inputs
x in terms of chosen features (here, brightness and width). The (a, b) plane is
called weight space: its points represent linear hypotheses h in terms of the Figure 4: Hypotheses differ in training ac-
coefficients — or weights — h places on each feature (e.g. a = −1 on brightness curacy: feature space. 3 hypotheses classify
and b = +4 on width). training data in the brightness-width plane
(axes range [0, 1.0]). Glowing colors distin-
Food For Thought: Which of Fig. 4’s 3 hypotheses best predicts training data? guish a hypothesis’ 1 and 3 sides. For instance,
Food For Thought: What (a, b) pairs might have produced Fig. 4 shows 3 hypothe- the bottom-most line classifies all the training
points as 3s.
ses? Can you determine (a, b) for sure, or is there ambiguity (i.e., can multiple
(a, b) pairs make exactly the same predictions in brightness-width space)?
optimization — Let’s write a program to automatically find hypothesis h = (a, b) from the
training data. We want to predict the labels y of yet-unseen photos x (testing
examples); insofar as training data is representative of testing data, it’s sensible to
return a h ∈ H that correctly classifies maximally many training examples. To
do this, let’s just loop over a bunch (a, b)s — say, all integer pairs in [−99, +99]
— and pick one that misclassifies the least training examples:
def is_correct(x,y,a,b):
return 1.0 if predict(x,a,b)==y else 0.0
def accuracy_on(examples,a,b):
return np.mean(is_correct(x,y,a,b) for x,y in examples)
def best_hypothesis():
# returns a pair (accuracy, hypothesis)
return max((accuracy_on(training_data, a, b), (a,b))
for a in np.arange(-99,+100) Figure 5: Hypotheses differ in training accu-
for b in np.arange(-99,+100) ) racy: weight space. We visualize H as the
(a, b)-plane (axes range [−99, +99]). Each point
Fed our N = 20 training examples, the loop finds (a, b) = (−20, +83) as a min- determines a whole line in the brightness-
imizer of training error, i.e., of the fraction of training examples misclassified. It width plane. Shading shows training error:
darker points misclassify more training exam-
misclassifies only 10% of training examples. Yet the same hypothesis misclassi- ples. The least shaded, most training-accurate
fies a greater fraction — 17% — of fresh, yet-unseen testing examples. That latter hypothesis is (−20, 83): the rightmost of the 3
blue squares. The orange square is the hypoth-
number — called the testing error — represents our program’s accuracy “in the
esis that best fits our unseen testing data.
wild”; it’s the number we most care about. Food For Thought: Suppose Fig. 4’s 3 hypothe-
The difference between training and testing error is the difference between our ses arose from the 3 blue squares shown here.
Which hypothesis arose from which square?
score on our second try on a practice exam (after we’ve reviewed our mistakes) Caution: the colors in the two Figures on this
versus our score on a real exam (where we don’t know the questions beforehand page represent unrelated distinctions!
and aren’t allowed to change our answers once we get our grades back).
Food For Thought: In the (a, b) plane shaded by training error, we see two ‘cones’,
one dark and one light. They lie geometrically opposite to each other — why?
Food For Thought: Sketch fa,b ’s error on N = 1 example as a function of (a, b).
6
how well did we do? analyzing our error By the end of this section, you’ll be able to
• automatically compute training and testing
error analysis — Intuitively, our testing error of 17% comes from three sources: (a) the misclassification errors and describe their
conceptual difference.
failure of our training set to be representative of our testing set; (b) the failure of
• explain how the problem of achieving low
our program to exactly minimize training error over H; and (c) the failure of our testing error decomposes into the three
hypothesis set H to contain “the true” pattern. problems of achieving low generalization,
optimization, and approximation errors.
These are respectively errors of generalization, optimization, approximation.
We can see generalization error when we plot testing data in the brightness-
width plane. The hypotheses h = (20, 83) that we selected based on the training
in the brightness-width plane misclassifies many testing points. we see many
misclassified points. Whereas h misclassifies only 10% of the training data, it
misclassifies 17% of the testing data. This illustrates generalization error.
In our plot of the (a, b) plane, the blue square is the hypothesis h (in H) that
best fits the training data. The orange square is the hypothesis (in H) that best
fits the testing data. But even the latter seems suboptimal, since H only includes
lines through the origin while it seems we want a line — or curve — that hits
higher up on the brightness axis. This illustrates approximation error.◦ ← To define approximation error, we need to spec-
Optimization error is best seen by plotting training rather than testing data. It ify whether the ‘truth’ we want to approximate
is the training or the testing data. Either way
measures the failure of our selected hypothesis h to minimize training error — we get a useful concept. In this paragraph
i.e., the failure of the blue square to lie in a least shaded point in the (a, b) plane, we’re talking about approximating testing data;
but in our notes overall we’ll focus on the con-
when we shade according to training error. cept of error in approximating training data.
formalism — Here’s how we can describe learning and our error decomposition in symbols. ← VERY OPTIONAL PASSAGE
Draw training examples S : (X × Y )N from nature’s distribution D on X × Y . A
hypothesis f : X → Y has training error trnS (f) = P(x,y)∼S [f(x) 6= y], an average
over examples; and testing error tst(f) = P(x,y)∼D [f(x) 6= y], an average over
nature. A learning program is a function L : (X × Y )N → (X → Y ); we want to
design L so that it maps typical S s to fs with low tst(f).
So we often define L to roughly minimize trnS over a set H ⊆ (X → Y ) of
candidate patterns. Then tst decomposes into the failures of trnS to estimate
tst (generalization), of L to minimize trnS (optimization), and of H to contain
nature’s truth (approximation):
These terms are in tension. For example, as H grows, the approx. error may
decrease while the gen. error may increase — this is the “bias-variance tradeoff”.
The various hypotheses differ only in those coefficients (jargon: weights, here
−1, +4) for their linear combinations; it is these degrees of freedom that the
machine learns from data. We can diagram such hypotheses by drawing arrows:◦ ← PICTURE OF PARADIGM?
featurize linearly combine read out
X −−−−−−−→ R2 −−−−−−−−−−→ R1 −−−−−−−→ Y
not learned learned! not learned
3 if threeness[0]>0. else 1
In the next three passages we address the key question: how do we compute the
weights by which we’ll compute threeness or bovinity from our features?
how good is a hypothesis? fit — We instruct our machine to find within our menu H a
hypothesis that’s as “good” as possible. That is, the hypothesis should both fit
our training data well and seem intrinsically plausible. Now we’ll quantify these
notions of goodness-of-fit and intrinsic-plausibility. As with H, the exact way
we quantify these notions is an engineering art informed by domain knowledge.
Still, there are patterns and principles — we will study two specific quantitative
notions, the perceptron loss and SVM loss, to study these principles. Later,
once we understand these notions as quantifying uncertainty (i.e., as probabilistic
notions), we’ll appreciate their logic. But for now we will bravely adventure forth,
ad hoc!
We’ll start with goodness-of-fit. The various hypotheses correspond◦ to choices ← A very careful reader might ask: can’t mul-
of weights. For example, the weight vector (−1, +4) determines the hypothesis tiple choices of weights determine the same hy-
pothesis? E.g. (−1, +4) and classify every in-
listed above. put the same way, since they either both make
threeness positive or both make threeness
negative. This is a very good point, dear
reader, but at this stage in the course, much
too pedantic! Ask again later.
9
By historical convention we actually like◦ to minimize badness (jargon: loss) ← ML is sometimes a glass half-empty kind of
rather than maximize goodness. So we’ll rewrite the above in terms of mistakes: subject!
def leeway_before_mistake(x,y,a,b):
threeness = a*brightness(x) + b*width(x)
return +threeness if y==3 else -threeness
def is_mistake(x,y,a,b):
return 0. leeway_before_mistake(x,y,a,b)>0. else 1.
Notice that if the leeway is very positive, then the badness is very negative (i.e.,
we are in a very good situation), and vice versa.
But, continuing the theme of pessimism, we usually feel that a “very safely
classified” point (very positive leeway) shouldn’t make up for a bunch of “slightly
misclassified” points (slightly negative leeway). That is, we’d rather have leeways
+.1, +.1, +.1, +.1 than +10, −1, −1, −1 on four training examples.◦ But total linear ← Food For Thought: compute and compare the
loss incentivizes the opposite choice: it doesn’t capture our sense that a very pos- training accuracies in these two situations. As
an open-ended followup, suggest reasons why
itive leeway feels mildly pleasant while a very negative one feels urgently alarm- considering training leeways instead of just ac-
ing. To model this sense quantitatively, we can impose a floor on linear_loss so curacies might help improve testing accuracy.
that it can’t get too negative.◦ We get perceptron loss if we set a floor of 1; SVM ← To impose a floor on linear_loss is to im-
loss (also known as hinge loss) if we set a floor of 0: posing a ceiling on how much we care about
leeway: arbitrarily positive leeway doesn’t
def perceptron_loss(x,y,a,b): count arbitrarily much. It’s good to get used
return max(1, 1 - leeway_before_mistake(x,y,a,b)) to this mental flipping between maximizing
def svm_loss(x,y,a,b): goodness and minimizing loss.
return max(0, 1 - leeway_before_mistake(x,y,a,b))
Food For Thought: For incentives to point the right way, loss should decrease as
threeness increases when y==3 but should increase as threeness increases when
y==1. Verify these relations for the loss functions above.
2 PICTURES OF LOSS FUNCTIONS (PLOT vs dec; and PLOT in feature space); illus-
trate width dependence!! Weight space?
10
how good is a hypothesis? plausibility — Now to define intrinsic plausiblity, also known
as a regularizer term. There are many intuitions and aspects of domain knowl-
edge we’d in practice want to capture here — symmetry comes to mind — but
we’ll focus for now on capturing this particular intution: a hypothesis that depends
very much on very many features is less plausible.
That is, we find a hypothesis more plausible when its “total amount of depen-
dence” on the features is small. One convenient way to quantify this as propor-
tional to a sum of squared weights:◦ if weights (a, b) give rise to then hypothesis ← We could just as well use 6.86a2 + b2 instead
h, then implausibility of h = λ(a2 + b2 + · · · ). In code: of a2 + b2 . Food For Thought: When (a, b) rep-
resent weights for brightness-width digits fea-
LAMBDA = 1. tures, how how do the hypotheses with small
def implausibility(a,b): 6.86a2 + b2 visually differ from those with
return LAMBDA * np.sum(np.square([a,b])) small a2 + b2 ?
Intuitively, the constant λ=LAMBDA tells us how much we care about plausibility
relative to goodness-of-fit-to-data.
Here’s what the formula means, metaphorically. Imagine three of our friends
each have a theory about which birds sing:
AJ says it all has to do with wingspan. A bird with a wings shorter than 20cm
can’t fly far enough to obviate the need for singing: such a bird is sure to sing.
Conversely, birds with longer wings never sing.
Pat says to check whether the bird grows red feathers, eats shrimp, lives near
ice, wakes in the night, and has a bill. If, of these 5 qualities, an even number
are true, then the bird probably sings. Otherwise, it probably doesn’t.
Sandy says it has to do with wingspan and nocturnality: shorter wings and
nocturnality both make a bird moderately more likely to sing.
Which hypothesis do we prefer? Well, AJ seems way too confident. Maybe
they’re right that wingspan matters, but it seems implausible that wingspan is
so decisive. Pat, meanwhile, doesn’t make black-and-white claims, but Pat’s pre-
dictions depend substantively on many features: flipping any one quality flips
their prediction. This, too, seems implausible. By contrast, Sandy’s hypothesis
doesn’t depend too strongly on too many features. To me, a bird non-expert,
Sandy’s seems most plausible.
Now we can define the overall undesirability of a hypothesis:◦ ← Here we use SVM loss but you can plug in
whatever loss you want! Different losses will
def objective_function(examples,a,b):
give different learning behaviors with pros and
data_term = np.sum([svm_loss(x,y,a,b) for x,y in examples])
cons.
regularizer = implausibility(a, b)
return data_term + regularizer
To build intuition about which hypotheses are most desirable according to that
metric, let’s suppose λ is a tiny positive number. Then minimizing the objective
function is the same as minimizing the data term, here the total SVM loss — our
notion of implausibility only becomes important as a tiebreaker.
Now, which way does it break ties? Imagine there are multiple 100% accuracy
hypotheses, e.g. some slightly nudged versions of the black line in the Figure.
Since the black hypothesis depends only on an input’s vertical coordinate (say,
this is brightness in the brightness-width plane), it arises from weights of the
form (a, b) = (a, 0).
Some (a, 0) pairs will have higher SVM loss than others. For example, if a is
nearly 0, then each datapoint will have leeway close to 0 and thus SVM loss close
to 1; conversely, if a is huge, then each datapoint will have leeway very positive
and thus SVM loss equal to the imposed floor of 0. We see that SVM loss is 0 so
long as a is big enough for each leeway to exceed 1.
But leeway equals a times brightness — or more generally weights times fea-
tures — so for leeway to exceed 1 and a to be small, brightness must be large.
Now, here’s the subtle, crucial geometry. Our decision boundary (the black
line) shows where leeway is zero. Moreover, since for a blue point leeway=
a · brightness(x),
Interpreting leeway as a measure of confidence.
12
The various hypotheses differ only in those coefficients (here −1, +4) for their
linear combinations; it is these degrees of freedom that the machine learns from
data. We can diagram such hypotheses by drawing arrows:◦ ← PICTURE OF PARADIGM?
If before we’d have predicted “the label is 3”, we now predict “the label is 3 with
80% chance and 1 with 20% chance”. This hard-coded 80% could suffice.◦ But ← As always, it depends on what specific thing
let’s do better: intuitively, a 3 is more likely when threeness is huge than when we’re trying to do!
threeness is nearly zero. So let’s replace that 80% by some smooth function of
threeness. A popular, theoretically warranted choice is σ(z) = 1/(1 + exp(−z)):◦ ← σ, the logistic or sigmoid function, has lin-
ear log-odds: σ(z)/(1−σ(z)) = exp(z)/1. It
sigma = lambda z : 1./(1.+np.exp(-z))
tends exponentially to the step function. It’s
prediction = {3 : sigma(threeness[0]),
symmetrical: σ(−z) = 1−σ(z). Its derivative
1 : 1.-sigma(threeness[0]) }
concentrates near zero: σ0 (z) = σ(z)σ(−z).
Food For Thought: Plot σ(z) by hand.
Given training inputs xi , a hypothesis will have “hunches” about the training
outputs yi . Three hypotheses hthree! , hthree , and hone might, respectively, confi-
dently assert y42 = 3; merely lean toward y42 = 3; and think y42 = 1. If in reality
y42 = 1 then we’d say hone did a good job, hthree a bad job, and hthree! a very bad
job on the 42nd example. So the training set “surprises” different hypotheses to
different degrees. We may seek a hypothesis h? that is minimally surprised, i.e.,
usually confidently right and when wrong not confidently so. In short, by out-
putting probabilities instead of mere labels, we’ve earned this awesome upshot:
the machine can automatically calibrate its confidence levels!◦ ← It’s easy to imagine applications in language,
self-driving, etc.
14
Now, what does this all mean? What does it mean for “dogness vs cowness” to
vary “linearly”?
Confidence on mnist example! (2 pictures, left and right: hypotheses and (a,b plane)
with respect to the kinds of patterns we use it to learn. information easy for the
machine to use (e.g. through apt nonlinearities) and throw away task-irrelevant
information (e.g. by turning 784 pixel brightnesses to 2 meaningful numbers).
Here are two themes in the engineering art of featurization.◦ ← For now, we imagine hand-coding our fea-
Predicates. If domain knowledge suggests some subset S ⊆ X is salient, then tures rather than adapting them to training
data. We’ll later discuss adapted features; sim-
we can define the feature ple examples include thresholding into quan-
tiles based on sorted training data (Is x more
x 7→ 1 if x lies in S else 0 than the median training point?), and choosing
coordinate transforms that measure similarity
to landmarks (How far is x from each of these 5
The most important case helps us featurize categorical attributes (e.g. kind-of- “representative” training points?). Deep learning
chess-piece, biological sex, or letter-of-the-alphabet): if an attribute takes K pos- is a fancy example.
sible values, then each value induces a subset of X and thus a feature. These
features assemble into a map X → RK . This one-hot encoding is simple, power-
ful, and common. Likewise, if some attribute is ordered (e.g. X contains geological
strata) then interesting predicates may include thresholds.
Coordinate transforms. Applying our favorite highschool math functions
gives new features tanh(x[0]) − x[1], |x[1]x[0]| exp(−x[2]2 ), · · · from old features
x[0], x[1], · · · . We choose these functions based on domain knowledge; e.g. if
x[0], x[1] represent two spatial positions, then the distance |x[0] − x[1]| may be a
useful feature. One systematic way to include nonlinearities is to include all the
monomials (such as x[0]x[1]2 ) with not too many factors — then linear combina-
tions are polynomials The most important nonlinear coordinate transform uses
all monomial features with 0 or 1 many factors — said plainly, this maps
x 7→ (1, x)
Figure 9: The bias trick helps us model ‘off-
This is the bias trick. Intuitively, it allows the machine to learn the threshold set’ decision boundaries. Here, the origin
is the lower right corner closer to the cam-
above which three-ishness implies a three. era. Our raw inputs x = (x[0], x[1]) are 2-
humble models — Let’s modify logistic classification to allow for unknown unknowns. We’ll dimensional; we can imagine them sitting on
the bottom face of the plot (bottom ends of
do this by allowing a classifier to allot probability mass not only among labels the vertical stems). But, within that face,
in Y but also to a special class ? that means “no comment” or “alien input”. A no line through the origin separates the data
well. By contrast, when we use a featurization
logistic classifier always sets py|x [?|x] = 0, but other probability models may put
(1, x[0], x[1]), our data lies on the top face of the
nonzero mass on “no comment”. For example, consider: plot; now a plane through the origin (shown)
successfully separates the data.
logistic perceptron svm
py|x [+1|x] ⊕/( + ⊕) ⊕ · ( ∧ ⊕)/2 ⊕ · ( ∧ ⊕/e)/2 Table trick
curvy 1: Three popular models for binary
classification. Top rows: Modeled chance
py|x [−1|x] /( + ⊕) · ( ∧ ⊕)/2 · ( /e ∧ ⊕)/2 given x that y = +1, −1, ?. We use d = w ~ · ~x,
py|x [?|x] 1 − above = 0 1 − above 1 − above ⊕ = e+d/2 , = e−d/2 , a ∧ b = min(a, b)
outliers responsive robust robust to save ink. Middle rows: All models re-
inliers sensitive blind sensitive graphs
spond ofto prob
misclassifications. But are they robust
to well-classified outliers? Sensitive to well-
acc bnd good bad good
classified inliers? Bottom rows: For optimiza-
loss name softplus(·) srelu(·) hinge(·) tion, which we’ll discuss later, we list (negative
formula log2 (1 + e(·) ) max(1, ·) + 1 max(1, · + 1) log-probability) losses. An SGD step looks like
update 1/(1 + e+yd ) step(−yd) step(1 − yd)
~ t + η · update · y~x
~ t+1 = w
w
MLE with the perceptron model or svm model minimizes the same thing, but
with srelu(z) = max(0, z) + 1 or hinge(z) = max(0, z + 1) instead of softplus(z).
graphs of prob
16
Two essential properties of softplus are that: (a) it is convex◦ and (b) it upper
bounds the step function. Note that srelu and hinge also enjoy these properties.
Property (a) ensures that the optimization problem is relatively easy — under
mild conditions, gradient descent will find a global minimum. By property (b), ← A function is convex when its graph is bowl-
shaped rather than wriggly. It’s easy to min-
the total loss on a training set upper bounds the rate of erroneous classification imize convex functions by ‘rolling downhill’,
on that training set. So loss is a surrogate for (in)accuracy: if the minimized loss since we’ll never get stuck in a local wrig-
gle. Don’t worry about remembering or un-
is nearly zero, then the training accuracy is nearly 100%.◦ ← The perceptron satisfies (b) in a trivial way
derstanding this word.
So we have a family of related models: logistic, perceptron, and SVM. In that yields a vacuous bound of 100% on the
error rate.
Project 1 we’ll find hypotheses optimal with respect to the perceptron and SVM
models (the latter under a historical name of pegasos), but soon we’ll focus
mainly on logistic models, since they fit best with deep learning.
DEFINE NOTION OF LOSS!
richer outputs: multiple classes — We’ve explored hypotheses fW (x) = readout(W ·
featurize(x)) where W represents the linear-combination step we tune to data.
We began with hard binary classification, wherein we map inputs to definite
labels (say, y = cow or y = dog):
I hope some of you felt bugged by the above formulas’ asymmetry: W mea-
sures “cow-ishness minus dog-ishness” — why not the other way around? Let’s
describe the same set of hypotheses but in a more symmetrical way. A common
theme in mathematical problem solving is to trade irredundancy for symmetry
(or vice versa). So let’s posit both a Wcow and a Wdog . One measures “cow-
ishness”; the other, “dog-ishness”. They assemble to give W, which is now a
matrix of shape 2×number-of-features. So d is now a list of 2 numbers: dcow and
ddog . Now dcow − ddog plays the role that d used to play.
Then we can do hard classification by:
readout(d) = argmaxy dy
P
To make probabilities add to one, we divide by Zd = y exp(dy ).
Behold! By rewriting our soft and hard hypotheses for binary classification,
we’ve found formulas that also make sense for more than two classes! The above
readout for soft multi-class classification is called softmax.
softmax plot
17
richer outputs: beyond classification — By the way, if we’re trying to predict a real-
valued output instead of a binary label — this is called hard one-output regres-
sion — we can simply return d itself as our readout:
readout(d) = d
← VERY OPTIONAL PASSAGE
This is far from the only choice! For example, if we know that the true ys
will always be positive, then readout(d) = exp(d) may make more sense. I’ve
encountered a learning task (about alternating current in power lines) where
what domain knowledge suggested — and what ended up working best — were
trigonometric functions for featurization and readout! There are also many ways
to return a distribution instead of a number. One way to do such soft one-output
regression is to use normal distributions:◦ ← Ask on Piazza about the interesting world of
alternatives and how they influence learning!
readout(d) = “normal distribution with mean d and variance 25”
Earlier we found a nearly best candidate by brute-force search over all hy-
potheses. But this doesn’t scale to most interesting cases wherein H is intractably
large. So: what’s a faster algorithm to find a nearly best candidate?
A common idea is to start arbitrarily with some h0 ∈ H and repeatedly im-
prove to get h1 , h2 , · · · . We eventually stop, say at h10000 . The key question is:◦ ← Also important are the questions of where
how do we compute an improved hypothesis ht+1 from our current hypothesis ht ? to start and when to stop. But have patience!
We’ll discuss these later.
We could just keep randomly nudging ht until we hit on an improvement; then
we define ht+1 as that improvement. Though this sometimes works surprisingly
well,◦ we can often save time by exploiting more available information. Specifi- ← If you’re curious, search ‘metropolis hastings’
cally, we can inspect ht ’s inadequacies to inform our proposal ht+1 . Intuitively, and ‘probabilistic programming’.
we can repeatedly nudge via gradient descent (GD), the engine of ML:◦ ← Food For Thought: Can GD directly minimize
misclassification rate?
h = initialize()
for t in range(10000):
h = h - 0.01 * gradient_badness_on_dataset(h, examples)
Since the derivative of total badness depends on all the training data, looping
10000 times is expensive. So in practice we estimate the needed derivative based
on some subset (jargon: batch) of the training data — a different subset each pass cartoon of GD
through the loop — in what’s called stochastic gradient descent (SGD):
h = initialize()
for t in range(10000):
batch = select_subset_of(examples)
h = h - 0.01 * gradient_badness(h, batch)
maximum likelihood estimation — When we can compute each hypothesis h’s asserted
probability that the training ys match the training xs, it seems reasonable to seek
an h for which this probability is maximal. This method is maximum likelihood
estimation (MLE). It’s convenient for the overall goodness to be a sum (or aver-
age) over each training example. But independent chances multiply rather than
add: rolling snake-eyes has chance 1/6 · 1/6, not 1/6 + 1/6. So we prefer to think
about maximizing log-probabilities instead of maximizing probabilities — it’s the
same in the end.◦ By historical convention we like to minimize badness rather ← Throughout this course we make a crucial as-
than maximize goodness, so we’ll use SGD to minimize negative-log-probabilities. sumption that our training examples are inde-
pendent from each other.
def badness(h,y,x):
return -np.log( probability_model(y,x,h) )
Let’s see this in action for the linear logistic model we developed for soft
binary classification. A hypothesis w ~ predicts that a (featurized) input ~x has
label y = +1 or y = −1 with chance σ(+~ w · ~x) or σ(−~
w · ~x):
The key computation is the derivative of those badness terms:◦ ← Remember that σ0 (z) = σ(z)σ(−z). To reduce
w · ~x as ywx.
clutter we’ll temporarily write y~
∂(− log(σ(ywx))) −σ(ywx)σ(−ywx)yx
= = −σ(−ywx)yx
∂w σ(ywx)
Food For Thought: If you’re like me, you might’ve zoned out by now. But this stuff
is important, especially for deep learning! So please graph the above expressions
to convince yourself that our formula for derivative makes sense visually.
To summarize, we’ve found the loss gradient for the logistic model:
sigma = lambda z : 1./(1+np.exp(-z))
def badness(w,y,x): return -np.log( sigma(y*w.dot(x)) )
def gradient_badness(w,y,x): return -sigma(-y*w.dot(x)) * y*x
priors and generalization A child’s education should begin at least 100 years
before [they are] born.
on overfitting — In the Bayesian framework, we optimistically assume that our model is — oliver wendell holmes jr
“correct” and that the “true” posterior over parameters is
a normalized product of likelihood and prior. Therefore, our optimal guess for a
new prediction is:
X
p(y? ; x? , x) = p(y? |w; x? )p(y|w; x)p(w)/Zx
w
Then
p(y? ; x? , x) ≈ p(y? |w? (y, x); x? )
What do we lose in this approximation?
E and max do not commute
PICTURE: “bowtie” vs “pipe”
BIC for relevant variables
log priors and bayes — fill in computation and bases visual illustration of how choice of L2 dot
product matters `p regularization; sparsity eye regularization example! TODO: rank as a
prior (for multioutput models)
For 1 ≤ p ≤ ∞ and 1 ≤ q ≤ ∞ we can consider this prior:◦ ← To define the cases p = ∞ or q = ∞ we take
limits. To define the case p = ∞ = q we take
!q/p
X limits while maintaining p = q.
p(w) ∝ exp − |λwi |p
i
For q = p this decomposes as a sum and thus has each coordinate independent.
Whereas q limits outliers, p controls the shape of level curves. Small q makes
large vectors more probable and small p makes it more probable that the entries
within a vector will be of very different sizes.
hierarchy, mixtures, transfer — k-fold cross validation bayesian information criterion
estimating generalization — k-fold cross validation dimension-based generalization bound bayesian
information criterion
model selection
21
p=2
p=∞
C. nonlinearities
0. fixed featurizations Doing ensembles and shows is one thing, but being
able to front a feature is totally different. ... there’s
sketching — something about ... a feature that’s unique.
— michael b. jordan
—
sensitivity analysis —
double descent —
25
6. kernels enrich approximations ... animals are divided into (a) those belonging to
the emperor; (b) embalmed ones; (c) trained ones;
features as pre-processing — (d) suckling pigs; (e) mermaids; (f) fabled ones; (g)
stray dogs; (h) those included in this classification;
abstracting to dot products —
(i) those that tremble as if they were mad; (j)
kernelized perceptron and svm — innumerable ones; (k) those drawn with a very fine
camel hair brush; (l) et cetera; (m) those that have
kernelized logistic regression — just broken the vase; and (n) those that from afar
look like flies.
learned featurizations — jorge luis borges
1. learned featurizations
imagining the space of feature tuples — We’ll focus on an architecture of the form
where A, B are linear maps with the specified (input × output) dimensions,
where σ is the familiar sigmoid operation, and where f applies the leaky relu
function elementwise and concatenates a 1:
2. multiple layers
We can continue alternating learned linearities with fixed nonlinearities:
feature hierarchies —
bottlenecking —
highways —
deep q learning
learning-from-instructions ; farewell