0% found this document useful (0 votes)
16 views

mlentary

The document provides an overview of machine learning concepts, focusing on learning from examples, supervised learning, and the process of hypothesis selection and optimization. It includes a detailed example of classifying handwritten digits, illustrating the use of features and candidate patterns to improve prediction accuracy. The notes serve as supplementary material for a course, offering insights into various machine learning techniques and methodologies.

Uploaded by

moomitten.news
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

mlentary

The document provides an overview of machine learning concepts, focusing on learning from examples, supervised learning, and the process of hypothesis selection and optimization. It includes a detailed example of classifying handwritten digits, illustrating the use of features and candidate patterns to improve prediction accuracy. The notes serve as supplementary material for a course, offering insights into various machine learning techniques and methodologies.

Uploaded by

moomitten.news
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

1

mlentary: basics of machine learning


Clickable Table of Contents
These optional notes for 6.86x give a big picture view of our conceptual land- A. prologue 2
scape. They might help you study, but you can do all your assignments without 0 what is learning?
1 a tiny example: handwritten digits
these notes. These notes treat many but not all the topics of our course; they also 2 how well did we do?
treat a few topics not emphasized in our course, to fill gaps. It’s probably best to 3 how can we do better?
read these notes alongside the lectures. B. auto-predict by fitting lines to examples 8
-1 microcosm
I’m happy to answer questions about the notes on Piazza. If you want to help 0 linear approximation
improve these notes, ask me; I’ll be happy to list you as a contributor! 1 iterative optimization
2 priors and generalization, model selection
We compiled these notes using the following color palette; let us know if the C. bend those lines to capture rich patterns 23
colors are hard to distinguish! 0 featurization
1 learned featurizations
 2 locality and symmetry in architecture

 3 dependencies in architecture


D. thicken those lines to quantify uncertainty 27
0 bayesian models
1 examples of bayesian models
navigating these notes — We divide these notes into bite-size passages; each begins pas- 2 inference algorithms for bayesian models
sages begin with a gray heading (e.g. ‘navigating these notes’ to the left). All 3 combining with deep learning
E. beyond learning-from-examples 28
passages are optional in the sense that you’ll be able to complete all assign-
0 reinforcement
ments without ever studying these notes. Yet (aside from a few bonus passages 1 state
marked as VERY OPTIONAL) most passages offer complementary discussion of 2 deep q learning
3 learning from instructions
our course topics that I bet will help you understand, appreciate, and internalize
F. some background ??
the lectures. After all, only with two lines of sight can we see depth. 0 probability primer
1 linear algebra primer
2 derivatives primer
3 programming and numpy and pytorch primer
2

A. prologue
what is learning? By the end of this section, you’ll be able to
• recognize whether a learning task fits the
kinds of learning — How do we communicate patterns of desired behavior? We can teach: paradigm of learning from examples and
whether it’s supervised or unsupervised.
by instruction: “to tell whether a mushroom is poisonous, first look at its gills...”
• identify within a completed learning-
by example: “here are six poisonous fungi; here, six safe ones. see a pattern?” from-examples project: the training in-
by reinforcement: “eat foraged mushrooms for a month; learn from getting sick.” puts(outputs), testing inputs(outputs), hypoth-
esis class, learned hypothesis; and describe
Machine learning is the art of programming computers to learn from such sources. which parts depend on which.
We’ll focus on the most important case: learning from examples.◦ ← Food For Thought: What’s something you’ve
learned by instruction? By example? By rein-
from examples to predictions — For us, a pattern of desired behavior is a function that for
forcement? In unit 5 we’ll see that learning by
each given situation/prompt returns a favorable action/answer. We seek a pro- example unlocks the other modes of learning.
gram that, from a list of examples of prompts and matching answers, determines
an underlying pattern. Our program is a success if this pattern accurately pre-
dicts answers for new, unseen prompts. We often define our program as a search,
over some class H of candidate patterns (jargon: hypotheses), to maximize some
notion of “intrinsic-plausibility plus goodness-of-fit-to-the-examples”.

Figure 1: Predicting mushrooms’ poisons.


Our learning program selects from a class of
hypotheses (gray blob) a plausible hypothesis
that well fits (blue dots are close to black dots)
a given list of poison-labeled mushrooms (blue
blob). Evaluating the selected hypothesis on
new mushrooms, we predict the correspond-
ing poison levels (orange numbers).
The arrows show dataflow: how the hy-
pothesis class and the mushroom+poisonlevel
examples determine one hypothesis, which, to-
gether with new mushrooms, determines pre-
dicted poison levels. Selecting a hypothesis is
called learning; predicting unseen poison lev-
els, inference. The examples we learn from are
training data; the new mushrooms and their
true poison levels are testing data.

For example, say we want to predict poison levels (answers) of mushrooms


(prompts). Among our hypotheses,◦ the GillOdor hypothesis fits the examples ← We choose four hypotheses: respectively, that
well: it guesses poison levels close to the truth. So the program selects GillOdor. a mushroom’s poison level is close to:
— its ambient soil’s percent water by weight;
‘Wait!’, you say, ‘doesn’t Zipcode fit the example data more closely than GillOdor?’. — its gills’ odor level, in kilo-Scoville units;
Yes. But a poison-zipcode proportionality is implausible: we’d need more evi- — its zipcode (divided by 100000);
— the fraction of visible light its cap reflects.
dence before believing Zipcode. We can easily make many oddball hypotheses;
by chance some may fit our data well, but they probably won’t predict well! Thus
“intrinsic plausibility” and “goodness-of-fit-to-data” both play a role in learning.◦ ← We choose those two notions (and our H)
In practice we’ll think of each hypothesis as mapping mushrooms to distribu- based on domain knowledge. This design pro-
cess is an art; we’ll study some rules of thumb.
tions over poison levels; then its “goodness-of-fit-to-data” is simply the chance
it allots to the data.◦ We’ll also use huge Hs: we’ll combine mushroom features ← That’s why we’ll need probability.
(wetness, odor, and shine) to make more hypotheses such as (1.0 · GillOdor − 0.2 ·
CapShine).◦ Since we can’t compute “goodness-of-fit” for so many hypotheses, ← That’s why we’ll need linear algebra.
we’ll guess a hypothesis then repeatedly nudge it up the “goodness-of-fit” slope.◦ ← That’s why we’ll need derivatives.
3

supervised learning — We’ll soon allow uncertainty by letting patterns map prompts to
distributions over answers. Even if there is only one prompt — say, “produce
a beautiful melody” — we may seek to learn the complicated distribution over
answers, e.g. to generate a diversity of apt answers. Such unsupervised learning
concerns output structure. By contrast, supervised learning (our main subject),
concerns the input-output relation; it’s interesting when there are many possible
prompts. Both involve learning from examples; the distinction is no more firm
than that between sandwiches and hotdogs, but the words are good to know.
learning as... — machine learning is like science. machine learning is like automatic programming.
machine learning is like curve-fitting. three classic threads of AI
4

a tiny example: classifying handwritten digits By the end of this section, you’ll be able to
• write a (simple and inefficient) image clas-
meeting the data — Say we want to classify handwritten digits. In symbols: we’ll map X to sifying ML program
• visualize data as lying in feature space; vi-
Y with X = {grayscale 28 × 28-pixel images}, Y = {1, 3}. Each datum (x, y) arises
sualize hypotheses as functions defined on
as follows: we randomly choose a digit y ∈ Y , ask a human to write that digit in feature space; and visualize the class of all
pen, and then photograph their writing to produce x ∈ X . hypotheses within weight space

Figure 2: Twenty example pairs. Each photo x


is a 28 × 28 grid of numbers representing pixel
intensities. The light gray background has in-
3 1 3 3 3 1 1 3 3 3
tensity 0.0; the blackest pixels, intensity 1.0. Be-
low each photo x we display the corresponding
label y: either y = 1 or y = 3. We’ll adhere to
3 1 3 1 3 1 3 1 3 1 this color code throughout this tiny example.

When we zoom in, we can see each photo’s 28 × 28 grid of pixels. On the
computer, this data is stored as a 28 × 28 grid of numbers: 0.0 for bright through
1.0 for dark. We’ll name these 28 × 28 grid locations by their row number (count-
ing from the top) followed by their column number (counting from the left). So
location (0, 0) is the upper left corner pixel; (27, 0), the lower left corner pixel.
Food For Thought: Where is location (0, 27)? Which way is (14, 14) off-center?
To get to know the data, let’s wonder how we’d hand-code a classifier (worry
not: soon we’ll do this more automatically). We want to complete the code
def hand_coded_predict(x):
return 3 if condition(x) else 1

Well, 3s tend to have more ink than than 1s — should condition threshold by
the photo’s brightness? Or: 1s and 3s tend to have different widths — should
condition threshold by the photo’s dark part’s width?
To make this precise, let’s define a photo’s brightness as 1.0 minus its average
pixel brightness; its width as the standard deviation of the column index of its
dark pixels. Such functions from inputs in X to numbers are called features.
SIDE = 28
def brightness(x): return 1. - np.mean(x)
def width(x): return np.std([col for col in range(SIDE)
for row in range(SIDE)
if 0.5 < x[row][col] ])/(SIDE/2.0)
# (we normalized width by SIDE/2.0 so that it lies within [0., 1.])

So we can threshold by brightness or by width. But this isn’t very satisfying,


since sometimes there are especially dark 1s or thin 3s. Aha! Let’s use both
features: 3s are darker than 1s even relative to their width. Inspecting the training
data, we see that a line through the origin of slope 4 roughly separates the two
classes. So let’s threshold by a combination like -1*brightness(x)+4*width(x): Figure 3: Featurized training data. Our N =
20 many training examples, viewed in the
def condition(x): brightness-width plane. The vertical brightness
return -1*brightness(x)+4*width(x) > 0 axis ranges [0.0, 1.0]; the horizontal width axis
ranges [0.0, 0.5]. The origin is at the lower left.
Intuitively, the formula −1 · brightness + 4 · width we invented is a measure of Orange dots represent y = 3 examples; blue
dot, y = 1 examples. We eyeballed the line
threeness: if it’s positive, we predict y = 3. Otherwise, we predict y = 1.
−1 · brightness + 4 · width = 0 to separate the
Food For Thought: What further features might help us separate digits 1 from 3? two kinds of examples.
5

candidate patterns — We can generalize the hand-coded hypothesis from the previous
passage to other coefficients besides −1 · brightness(x) + 4 · width(x). We let our
set H of candidate patterns contain all “linear hypotheses” fa,b defined by:

fa,b (x) = 3 if a · brightness(x) + b · width(x) > 0 else 1

Each fa,b makes predictions of ys given xs. As we change a and b, we get


different predictors, some more accurate than others.
def predict(x,a,b):
return 3 if a*brightness(x) + b*width(x) > 0 else 1

The brightness-width plane is called feature space: its points represent inputs
x in terms of chosen features (here, brightness and width). The (a, b) plane is
called weight space: its points represent linear hypotheses h in terms of the Figure 4: Hypotheses differ in training ac-
coefficients — or weights — h places on each feature (e.g. a = −1 on brightness curacy: feature space. 3 hypotheses classify
and b = +4 on width). training data in the brightness-width plane
(axes range [0, 1.0]). Glowing colors distin-
Food For Thought: Which of Fig. 4’s 3 hypotheses best predicts training data? guish a hypothesis’ 1 and 3 sides. For instance,
Food For Thought: What (a, b) pairs might have produced Fig. 4 shows 3 hypothe- the bottom-most line classifies all the training
points as 3s.
ses? Can you determine (a, b) for sure, or is there ambiguity (i.e., can multiple
(a, b) pairs make exactly the same predictions in brightness-width space)?
optimization — Let’s write a program to automatically find hypothesis h = (a, b) from the
training data. We want to predict the labels y of yet-unseen photos x (testing
examples); insofar as training data is representative of testing data, it’s sensible to
return a h ∈ H that correctly classifies maximally many training examples. To
do this, let’s just loop over a bunch (a, b)s — say, all integer pairs in [−99, +99]
— and pick one that misclassifies the least training examples:
def is_correct(x,y,a,b):
return 1.0 if predict(x,a,b)==y else 0.0
def accuracy_on(examples,a,b):
return np.mean(is_correct(x,y,a,b) for x,y in examples)
def best_hypothesis():
# returns a pair (accuracy, hypothesis)
return max((accuracy_on(training_data, a, b), (a,b))
for a in np.arange(-99,+100) Figure 5: Hypotheses differ in training accu-
for b in np.arange(-99,+100) ) racy: weight space. We visualize H as the
(a, b)-plane (axes range [−99, +99]). Each point
Fed our N = 20 training examples, the loop finds (a, b) = (−20, +83) as a min- determines a whole line in the brightness-
imizer of training error, i.e., of the fraction of training examples misclassified. It width plane. Shading shows training error:
darker points misclassify more training exam-
misclassifies only 10% of training examples. Yet the same hypothesis misclassi- ples. The least shaded, most training-accurate
fies a greater fraction — 17% — of fresh, yet-unseen testing examples. That latter hypothesis is (−20, 83): the rightmost of the 3
blue squares. The orange square is the hypoth-
number — called the testing error — represents our program’s accuracy “in the
esis that best fits our unseen testing data.
wild”; it’s the number we most care about. Food For Thought: Suppose Fig. 4’s 3 hypothe-
The difference between training and testing error is the difference between our ses arose from the 3 blue squares shown here.
Which hypothesis arose from which square?
score on our second try on a practice exam (after we’ve reviewed our mistakes) Caution: the colors in the two Figures on this
versus our score on a real exam (where we don’t know the questions beforehand page represent unrelated distinctions!
and aren’t allowed to change our answers once we get our grades back).
Food For Thought: In the (a, b) plane shaded by training error, we see two ‘cones’,
one dark and one light. They lie geometrically opposite to each other — why?
Food For Thought: Sketch fa,b ’s error on N = 1 example as a function of (a, b).
6

how well did we do? analyzing our error By the end of this section, you’ll be able to
• automatically compute training and testing
error analysis — Intuitively, our testing error of 17% comes from three sources: (a) the misclassification errors and describe their
conceptual difference.
failure of our training set to be representative of our testing set; (b) the failure of
• explain how the problem of achieving low
our program to exactly minimize training error over H; and (c) the failure of our testing error decomposes into the three
hypothesis set H to contain “the true” pattern. problems of achieving low generalization,
optimization, and approximation errors.
These are respectively errors of generalization, optimization, approximation.
We can see generalization error when we plot testing data in the brightness-
width plane. The hypotheses h = (20, 83) that we selected based on the training
in the brightness-width plane misclassifies many testing points. we see many
misclassified points. Whereas h misclassifies only 10% of the training data, it
misclassifies 17% of the testing data. This illustrates generalization error.
In our plot of the (a, b) plane, the blue square is the hypothesis h (in H) that
best fits the training data. The orange square is the hypothesis (in H) that best
fits the testing data. But even the latter seems suboptimal, since H only includes
lines through the origin while it seems we want a line — or curve — that hits
higher up on the brightness axis. This illustrates approximation error.◦ ← To define approximation error, we need to spec-
Optimization error is best seen by plotting training rather than testing data. It ify whether the ‘truth’ we want to approximate
is the training or the testing data. Either way
measures the failure of our selected hypothesis h to minimize training error — we get a useful concept. In this paragraph
i.e., the failure of the blue square to lie in a least shaded point in the (a, b) plane, we’re talking about approximating testing data;
but in our notes overall we’ll focus on the con-
when we shade according to training error. cept of error in approximating training data.

Figure 6: Testing error visualized two ways..


— Left: in feature space. The hypotheses
h = (20, 83) that we selected based on the train-
ing set classifies testing data in the brightness-
width plane; glowing colors distinguish a hy-
pothesis’ 1 and 3 sides. Axes range [0, 1.0]. —
Right: in weight space. Each point in the (a, b)
plane () represents a hypothesis; darker re-
gions misclassify a greater fraction of testing
data. Axes range [−99, +99].

Here, we got optimization error ≈ 0% (albeit by unscalable brute-force). Because


optimization error is zero in our case, the approximation error and training error
are the same: ≈ 10%. The approximation error is so high because our straight
lines are too simple: brightness and width lose useful information and the “true”
boundary between digits — even training — may be curved. Finally, our testing
error ≈ 17% exceeds our training error. We thus suffer a generalization error of
≈ 7%: we didn’t perfectly extrapolate from training to testing situations. In 6.86x
we’ll address all three italicized issues.
Food For Thought: why is generalization error usually positive?
7

formalism — Here’s how we can describe learning and our error decomposition in symbols. ← VERY OPTIONAL PASSAGE
Draw training examples S : (X × Y )N from nature’s distribution D on X × Y . A
hypothesis f : X → Y has training error trnS (f) = P(x,y)∼S [f(x) 6= y], an average
over examples; and testing error tst(f) = P(x,y)∼D [f(x) 6= y], an average over
nature. A learning program is a function L : (X × Y )N → (X → Y ); we want to
design L so that it maps typical S s to fs with low tst(f).
So we often define L to roughly minimize trnS over a set H ⊆ (X → Y ) of
candidate patterns. Then tst decomposes into the failures of trnS to estimate
tst (generalization), of L to minimize trnS (optimization), and of H to contain
nature’s truth (approximation):

tst(L(S )) = tst(L(S )) − trnS (L(S )) } generalization error


+ trnS (L(S )) − infH (trnS (f)) } optimization error
+ infH (trnS (f)) } approximation error

These terms are in tension. For example, as H grows, the approx. error may
decrease while the gen. error may increase — this is the “bias-variance tradeoff”.

how can we do better? (survey of rest of notes)


8

B. auto-predict by fitting lines to examples (unit 1)


microcosm By the end of this section, you’ll be able to
• define a class of linear hypotheses for given
what hypotheses will we consider? a recipe for H — Remember: we want our machine featurized data
• compute and efficiently optimize the per-
to find an input-to-output rule. We call such rules hypotheses. As engineers, we ceptron and svm losses suffered by a hy-
carve out a menu H of rules for the machine to choose from. Let’s generalize pothesis
our previous digit classifying example to consider hypotheses of this format:
extract features of the input to make a list of numbers; then linearly combine those
numbers to make another list of numbers; finally, read out a prediction from the
latter list. Our digit classifying hypotheses, for instance, look like:◦ ← Our list threeness has length one: it’s just
a fancy way of talking about a single number.
def predict(x):
We’ll later use longer lists to model richer out-
features = [brightness(x), width(x)] # featurize
puts: to classify between > 2 labels, to generate
threeness = [ -1.*features[0] +4.*features[1] ] # linearly combine
a whole image instead of a class label, etc.
prediction = 3 if threeness[0]>0. else 1 # read out prediction
return prediction

The various hypotheses differ only in those coefficients (jargon: weights, here
−1, +4) for their linear combinations; it is these degrees of freedom that the
machine learns from data. We can diagram such hypotheses by drawing arrows:◦ ← PICTURE OF PARADIGM?
featurize linearly combine read out
X −−−−−−−→ R2 −−−−−−−−−−→ R1 −−−−−−−→ Y
not learned learned! not learned

Our Unit 1 motto is to learn linearities flanked by hand-coded nonlinearities. We


design the nonlinearities to capture domain knowledge about our data and goals.
For now, we’ll assume we’ve already decided on our featurization and we’ll
use the same readout as in the code above:

3 if threeness[0]>0. else 1

Of course, if we are classifying dogs vs cows that line would read

cow if bovinity[0]>0. else dog

In the next three passages we address the key question: how do we compute the
weights by which we’ll compute threeness or bovinity from our features?
how good is a hypothesis? fit — We instruct our machine to find within our menu H a
hypothesis that’s as “good” as possible. That is, the hypothesis should both fit
our training data well and seem intrinsically plausible. Now we’ll quantify these
notions of goodness-of-fit and intrinsic-plausibility. As with H, the exact way
we quantify these notions is an engineering art informed by domain knowledge.
Still, there are patterns and principles — we will study two specific quantitative
notions, the perceptron loss and SVM loss, to study these principles. Later,
once we understand these notions as quantifying uncertainty (i.e., as probabilistic
notions), we’ll appreciate their logic. But for now we will bravely adventure forth,
ad hoc!
We’ll start with goodness-of-fit. The various hypotheses correspond◦ to choices ← A very careful reader might ask: can’t mul-
of weights. For example, the weight vector (−1, +4) determines the hypothesis tiple choices of weights determine the same hy-
pothesis? E.g. (−1, +4) and classify every in-
listed above. put the same way, since they either both make
threeness positive or both make threeness
negative. This is a very good point, dear
reader, but at this stage in the course, much
too pedantic! Ask again later.
9

One way to quantify h’s goodness-of-fit to a training example (x, y) is to


see whether or not h correctly predicts y from x. That is, we could quantify
goodness-of-fit by training accuracy, like we did in the previous digits example:
def is_correct(x,y,a,b):
threeness = a*brightness(x) + b*width(x)
prediction = 3 if threeness>0. else 1
return 1. if prediction==y else 0.

By historical convention we actually like◦ to minimize badness (jargon: loss) ← ML is sometimes a glass half-empty kind of
rather than maximize goodness. So we’ll rewrite the above in terms of mistakes: subject!
def leeway_before_mistake(x,y,a,b):
threeness = a*brightness(x) + b*width(x)
return +threeness if y==3 else -threeness
def is_mistake(x,y,a,b):
return 0. leeway_before_mistake(x,y,a,b)>0. else 1.

We could define goodness-of-fit as training accuracy. But we’ll enjoy better


generalization and easier optimization by allowing “partial credit” for borderline
predictions. E.g. we could use leeway_before_mistake as goodness-of-fit:◦ ← to define loss, we flip signs
def linear_loss(x,y,a,b):
return 1 - leeway_before_mistake(x,y,a,b)

Notice that if the leeway is very positive, then the badness is very negative (i.e.,
we are in a very good situation), and vice versa.
But, continuing the theme of pessimism, we usually feel that a “very safely
classified” point (very positive leeway) shouldn’t make up for a bunch of “slightly
misclassified” points (slightly negative leeway). That is, we’d rather have leeways
+.1, +.1, +.1, +.1 than +10, −1, −1, −1 on four training examples.◦ But total linear ← Food For Thought: compute and compare the
loss incentivizes the opposite choice: it doesn’t capture our sense that a very pos- training accuracies in these two situations. As
an open-ended followup, suggest reasons why
itive leeway feels mildly pleasant while a very negative one feels urgently alarm- considering training leeways instead of just ac-
ing. To model this sense quantitatively, we can impose a floor on linear_loss so curacies might help improve testing accuracy.
that it can’t get too negative.◦ We get perceptron loss if we set a floor of 1; SVM ← To impose a floor on linear_loss is to im-
loss (also known as hinge loss) if we set a floor of 0: posing a ceiling on how much we care about
leeway: arbitrarily positive leeway doesn’t
def perceptron_loss(x,y,a,b): count arbitrarily much. It’s good to get used
return max(1, 1 - leeway_before_mistake(x,y,a,b)) to this mental flipping between maximizing
def svm_loss(x,y,a,b): goodness and minimizing loss.
return max(0, 1 - leeway_before_mistake(x,y,a,b))

Food For Thought: For incentives to point the right way, loss should decrease as
threeness increases when y==3 but should increase as threeness increases when
y==1. Verify these relations for the loss functions above.
2 PICTURES OF LOSS FUNCTIONS (PLOT vs dec; and PLOT in feature space); illus-
trate width dependence!! Weight space?
10

how good is a hypothesis? plausibility — Now to define intrinsic plausiblity, also known
as a regularizer term. There are many intuitions and aspects of domain knowl-
edge we’d in practice want to capture here — symmetry comes to mind — but
we’ll focus for now on capturing this particular intution: a hypothesis that depends
very much on very many features is less plausible.
That is, we find a hypothesis more plausible when its “total amount of depen-
dence” on the features is small. One convenient way to quantify this as propor-
tional to a sum of squared weights:◦ if weights (a, b) give rise to then hypothesis ← We could just as well use 6.86a2 + b2 instead
h, then implausibility of h = λ(a2 + b2 + · · · ). In code: of a2 + b2 . Food For Thought: When (a, b) rep-
resent weights for brightness-width digits fea-
LAMBDA = 1. tures, how how do the hypotheses with small
def implausibility(a,b): 6.86a2 + b2 visually differ from those with
return LAMBDA * np.sum(np.square([a,b])) small a2 + b2 ?

Intuitively, the constant λ=LAMBDA tells us how much we care about plausibility
relative to goodness-of-fit-to-data.
Here’s what the formula means, metaphorically. Imagine three of our friends
each have a theory about which birds sing:
AJ says it all has to do with wingspan. A bird with a wings shorter than 20cm
can’t fly far enough to obviate the need for singing: such a bird is sure to sing.
Conversely, birds with longer wings never sing.
Pat says to check whether the bird grows red feathers, eats shrimp, lives near
ice, wakes in the night, and has a bill. If, of these 5 qualities, an even number
are true, then the bird probably sings. Otherwise, it probably doesn’t.
Sandy says it has to do with wingspan and nocturnality: shorter wings and
nocturnality both make a bird moderately more likely to sing.
Which hypothesis do we prefer? Well, AJ seems way too confident. Maybe
they’re right that wingspan matters, but it seems implausible that wingspan is
so decisive. Pat, meanwhile, doesn’t make black-and-white claims, but Pat’s pre-
dictions depend substantively on many features: flipping any one quality flips
their prediction. This, too, seems implausible. By contrast, Sandy’s hypothesis
doesn’t depend too strongly on too many features. To me, a bird non-expert,
Sandy’s seems most plausible.
Now we can define the overall undesirability of a hypothesis:◦ ← Here we use SVM loss but you can plug in
whatever loss you want! Different losses will
def objective_function(examples,a,b):
give different learning behaviors with pros and
data_term = np.sum([svm_loss(x,y,a,b) for x,y in examples])
cons.
regularizer = implausibility(a, b)
return data_term + regularizer

To build intuition about which hypotheses are most desirable according to that
metric, let’s suppose λ is a tiny positive number. Then minimizing the objective
function is the same as minimizing the data term, here the total SVM loss — our
notion of implausibility only becomes important as a tiebreaker.
Now, which way does it break ties? Imagine there are multiple 100% accuracy
hypotheses, e.g. some slightly nudged versions of the black line in the Figure.
Since the black hypothesis depends only on an input’s vertical coordinate (say,
this is brightness in the brightness-width plane), it arises from weights of the
form (a, b) = (a, 0).

Figure 7: For convenience we set the origin to


the intersection of the two hypotheses. That
way we can still say that every hypothesis’s de-
cision boundary goes through the origin.
11

Some (a, 0) pairs will have higher SVM loss than others. For example, if a is
nearly 0, then each datapoint will have leeway close to 0 and thus SVM loss close
to 1; conversely, if a is huge, then each datapoint will have leeway very positive
and thus SVM loss equal to the imposed floor of 0. We see that SVM loss is 0 so
long as a is big enough for each leeway to exceed 1.
But leeway equals a times brightness — or more generally weights times fea-
tures — so for leeway to exceed 1 and a to be small, brightness must be large.
Now, here’s the subtle, crucial geometry. Our decision boundary (the black
line) shows where leeway is zero. Moreover, since for a blue point leeway=
a · brightness(x),
Interpreting leeway as a measure of confidence.
12

which hypothesis is best? optimization by gradient descent —


improving approximation, optimization, generalization —
13

linear approximation By the end of this section, you’ll be able to


• define a class of linear, probabilistic hy-
a recipe for our hypothesis class — Remember: we want our machine to find an input-to- potheses appropriate to a given classifica-
tion task, by: designing features; packaging
output rule. We call such rules hypotheses. As engineers, we carve out a menu
the coefficients to be learned as a matrix;
of rules for the machine to choose from. Let’s generalize that digit classifying and selecting a probability model (logistic,
example to consider hypotheses of this format: extract features from the input to perceptron, SVM, etc).
• compute the loss suffered by a probabilistic
make a list of numbers; then linearly combine those numbers to make another list hypothesis on given data
of numbers; finally, read out a prediction from that latter list. For example, each
hypothesis in §A’s example looks something like this:◦ ← Our list threeness has length one: it’s just
a fancy way of talking about a single number.
def predict(x):
We’ll later use longer lists to model richer out-
features = [brightness(x), width(x)] # featurize
puts: to classify between > 2 labels, to generate
threeness = [ -1.*features[0] +4.*features[1] ] # linearly combine
a whole image instead of a class label, etc.
prediction = 9 if threeness[0]>0. else 1 # read out prediction
return prediction

The various hypotheses differ only in those coefficients (here −1, +4) for their
linear combinations; it is these degrees of freedom that the machine learns from
data. We can diagram such hypotheses by drawing arrows:◦ ← PICTURE OF PARADIGM?

featurize linearly combine read out


X −−−−−−−→ R2 −−−−−−−−−−→ R1 −−−−−−−→ Y
not learned learned! not learned

Our Unit 1 motto is to learn linearities flanked by hand-coded nonlinearities. We


design the nonlinearities to capture domain knowledge about our data and goals.
two thirds between dog and cow — We mentioned earlier that we often prefer to model un-
certainty over Y . We can do this by choosing a different read-out function. For ex-
ample, representing distributions by objects {3:prob_of_three, 1:prob_of_one},
we could choose:
prediction = {3 : 0.8 if threeness[0]>0. else 0.2,
1 : 0.2 if threeness[0]>0. else 0.8 }

If before we’d have predicted “the label is 3”, we now predict “the label is 3 with
80% chance and 1 with 20% chance”. This hard-coded 80% could suffice.◦ But ← As always, it depends on what specific thing
let’s do better: intuitively, a 3 is more likely when threeness is huge than when we’re trying to do!
threeness is nearly zero. So let’s replace that 80% by some smooth function of
threeness. A popular, theoretically warranted choice is σ(z) = 1/(1 + exp(−z)):◦ ← σ, the logistic or sigmoid function, has lin-
ear log-odds: σ(z)/(1−σ(z)) = exp(z)/1. It
sigma = lambda z : 1./(1.+np.exp(-z))
tends exponentially to the step function. It’s
prediction = {3 : sigma(threeness[0]),
symmetrical: σ(−z) = 1−σ(z). Its derivative
1 : 1.-sigma(threeness[0]) }
concentrates near zero: σ0 (z) = σ(z)σ(−z).
Food For Thought: Plot σ(z) by hand.
Given training inputs xi , a hypothesis will have “hunches” about the training
outputs yi . Three hypotheses hthree! , hthree , and hone might, respectively, confi-
dently assert y42 = 3; merely lean toward y42 = 3; and think y42 = 1. If in reality
y42 = 1 then we’d say hone did a good job, hthree a bad job, and hthree! a very bad
job on the 42nd example. So the training set “surprises” different hypotheses to
different degrees. We may seek a hypothesis h? that is minimally surprised, i.e.,
usually confidently right and when wrong not confidently so. In short, by out-
putting probabilities instead of mere labels, we’ve earned this awesome upshot:
the machine can automatically calibrate its confidence levels!◦ ← It’s easy to imagine applications in language,
self-driving, etc.
14

Now, what does this all mean? What does it mean for “dogness vs cowness” to
vary “linearly”?
Confidence on mnist example! (2 pictures, left and right: hypotheses and (a,b plane)

interpreting weights — Note on interpreting weights


Just because two features both correlate with a positive label (y = +1) doesn’t
mean both features will have positive weights. In other words, it could be that
the blah-feature correlates with y = +1 in the training set and yet, according to
the best hypothesis for that training set, the bigger a fresh input’s blah feature is,
the less likely its label is to be +1, all else being equal. That last phrase “all else Figure 8: Relations between feature statistics
being equal” is crucial, since it refers to our choice of coordinates. and optimal weights. Each of these six fig-
ures shows a different binary classification task
Shearing two features together — e.g. measuring cooktime-plus-preptime to- along with a visually best hypothesis. — Left:
gether with cooktime rather than preptime together with cooktime — can impact positive weights don’t imply positive correlation!
— Right: presenting the same information in dif-
the decision boundary. Intuitively, the more stretched out a feature axis is, the ferent coordinates alters predictions!
more the learned hypothesis will rely on that feature.
Stretching a single feature — for instance, measuring it in centimeters instead
of meters — can impact the decision boundary as well. Intuitively, the more
stretched out a feature axis is, the more the learned hypothesis will rely on that
feature.
designing featurizations — We represent our input x as a fixed-length list of numbers so
that we can “do math” to x. For instance, we could represent a 28 × 28 photo by 2
numbers: its overall brightness and its dark part’s width. Or we could represent
it by 784 numbers, one for the brightness at each of the 28 · 28 = 784 many pixels.
Or by 10 numbers that respectively measure the overlap of x’s ink with that of
“representative” photos of the digits 0 through 9.
A way to represent x as a fixed-length list of numbers is a featurization. Each
map from raw inputs to numbers is a feature. Different featurizations make
different patterns easier to learn. We judge a featurization not in a vacuum but
15

with respect to the kinds of patterns we use it to learn. information easy for the
machine to use (e.g. through apt nonlinearities) and throw away task-irrelevant
information (e.g. by turning 784 pixel brightnesses to 2 meaningful numbers).
Here are two themes in the engineering art of featurization.◦ ← For now, we imagine hand-coding our fea-
Predicates. If domain knowledge suggests some subset S ⊆ X is salient, then tures rather than adapting them to training
data. We’ll later discuss adapted features; sim-
we can define the feature ple examples include thresholding into quan-
tiles based on sorted training data (Is x more
x 7→ 1 if x lies in S else 0 than the median training point?), and choosing
coordinate transforms that measure similarity
to landmarks (How far is x from each of these 5
The most important case helps us featurize categorical attributes (e.g. kind-of- “representative” training points?). Deep learning
chess-piece, biological sex, or letter-of-the-alphabet): if an attribute takes K pos- is a fancy example.
sible values, then each value induces a subset of X and thus a feature. These
features assemble into a map X → RK . This one-hot encoding is simple, power-
ful, and common. Likewise, if some attribute is ordered (e.g. X contains geological
strata) then interesting predicates may include thresholds.
Coordinate transforms. Applying our favorite highschool math functions
gives new features tanh(x[0]) − x[1], |x[1]x[0]| exp(−x[2]2 ), · · · from old features
x[0], x[1], · · · . We choose these functions based on domain knowledge; e.g. if
x[0], x[1] represent two spatial positions, then the distance |x[0] − x[1]| may be a
useful feature. One systematic way to include nonlinearities is to include all the
monomials (such as x[0]x[1]2 ) with not too many factors — then linear combina-
tions are polynomials The most important nonlinear coordinate transform uses
all monomial features with 0 or 1 many factors — said plainly, this maps

x 7→ (1, x)
Figure 9: The bias trick helps us model ‘off-
This is the bias trick. Intuitively, it allows the machine to learn the threshold set’ decision boundaries. Here, the origin
is the lower right corner closer to the cam-
above which three-ishness implies a three. era. Our raw inputs x = (x[0], x[1]) are 2-
humble models — Let’s modify logistic classification to allow for unknown unknowns. We’ll dimensional; we can imagine them sitting on
the bottom face of the plot (bottom ends of
do this by allowing a classifier to allot probability mass not only among labels the vertical stems). But, within that face,
in Y but also to a special class ? that means “no comment” or “alien input”. A no line through the origin separates the data
well. By contrast, when we use a featurization
logistic classifier always sets py|x [?|x] = 0, but other probability models may put
(1, x[0], x[1]), our data lies on the top face of the
nonzero mass on “no comment”. For example, consider: plot; now a plane through the origin (shown)
successfully separates the data.
logistic perceptron svm
py|x [+1|x] ⊕/( + ⊕) ⊕ · ( ∧ ⊕)/2 ⊕ · ( ∧ ⊕/e)/2 Table trick
curvy 1: Three popular models for binary
classification. Top rows: Modeled chance
py|x [−1|x] /( + ⊕) · ( ∧ ⊕)/2 · ( /e ∧ ⊕)/2 given x that y = +1, −1, ?. We use d = w ~ · ~x,
py|x [?|x] 1 − above = 0 1 − above 1 − above ⊕ = e+d/2 , = e−d/2 , a ∧ b = min(a, b)
outliers responsive robust robust to save ink. Middle rows: All models re-
inliers sensitive blind sensitive graphs
spond ofto prob
misclassifications. But are they robust
to well-classified outliers? Sensitive to well-
acc bnd good bad good
classified inliers? Bottom rows: For optimiza-
loss name softplus(·) srelu(·) hinge(·) tion, which we’ll discuss later, we list (negative
formula log2 (1 + e(·) ) max(1, ·) + 1 max(1, · + 1) log-probability) losses. An SGD step looks like
update 1/(1 + e+yd ) step(−yd) step(1 − yd)
~ t + η · update · y~x
~ t+1 = w
w

MLE with the perceptron model or svm model minimizes the same thing, but
with srelu(z) = max(0, z) + 1 or hinge(z) = max(0, z + 1) instead of softplus(z).

graphs of prob
16

Two essential properties of softplus are that: (a) it is convex◦ and (b) it upper
bounds the step function. Note that srelu and hinge also enjoy these properties.
Property (a) ensures that the optimization problem is relatively easy — under
mild conditions, gradient descent will find a global minimum. By property (b), ← A function is convex when its graph is bowl-
shaped rather than wriggly. It’s easy to min-
the total loss on a training set upper bounds the rate of erroneous classification imize convex functions by ‘rolling downhill’,
on that training set. So loss is a surrogate for (in)accuracy: if the minimized loss since we’ll never get stuck in a local wrig-
gle. Don’t worry about remembering or un-
is nearly zero, then the training accuracy is nearly 100%.◦ ← The perceptron satisfies (b) in a trivial way
derstanding this word.
So we have a family of related models: logistic, perceptron, and SVM. In that yields a vacuous bound of 100% on the
error rate.
Project 1 we’ll find hypotheses optimal with respect to the perceptron and SVM
models (the latter under a historical name of pegasos), but soon we’ll focus
mainly on logistic models, since they fit best with deep learning.
DEFINE NOTION OF LOSS!
richer outputs: multiple classes — We’ve explored hypotheses fW (x) = readout(W ·
featurize(x)) where W represents the linear-combination step we tune to data.
We began with hard binary classification, wherein we map inputs to definite
labels (say, y = cow or y = dog):

readout(d) = “cow if 0 < d else dog”

We then made this probabilistic using σ. In such soft binary classification we


return (for each given input) a distribution over labels:

readout(d) = “chance σ(d) of cow; chance 1−σ(d) of dog”

Remembering that σ(d) : (1 − σ(d)) are in the ratio exp(d) : 1, we rewrite:

readout(d) = “chance of cow is exp(d)/Zd ; of dog, exp(0)/Zd ”

I hope some of you felt bugged by the above formulas’ asymmetry: W mea-
sures “cow-ishness minus dog-ishness” — why not the other way around? Let’s
describe the same set of hypotheses but in a more symmetrical way. A common
theme in mathematical problem solving is to trade irredundancy for symmetry
(or vice versa). So let’s posit both a Wcow and a Wdog . One measures “cow-
ishness”; the other, “dog-ishness”. They assemble to give W, which is now a
matrix of shape 2×number-of-features. So d is now a list of 2 numbers: dcow and
ddog . Now dcow − ddog plays the role that d used to play.
Then we can do hard classification by:

readout(d) = argmaxy dy

and soft classification by:

readout(d) = “chance of y is exp(dy )/Zd ” softmax plot

P
To make probabilities add to one, we divide by Zd = y exp(dy ).
Behold! By rewriting our soft and hard hypotheses for binary classification,
we’ve found formulas that also make sense for more than two classes! The above
readout for soft multi-class classification is called softmax.

softmax plot
17

richer outputs: beyond classification — By the way, if we’re trying to predict a real-
valued output instead of a binary label — this is called hard one-output regres-
sion — we can simply return d itself as our readout:

readout(d) = d
← VERY OPTIONAL PASSAGE
This is far from the only choice! For example, if we know that the true ys
will always be positive, then readout(d) = exp(d) may make more sense. I’ve
encountered a learning task (about alternating current in power lines) where
what domain knowledge suggested — and what ended up working best — were
trigonometric functions for featurization and readout! There are also many ways
to return a distribution instead of a number. One way to do such soft one-output
regression is to use normal distributions:◦ ← Ask on Piazza about the interesting world of
alternatives and how they influence learning!
readout(d) = “normal distribution with mean d and variance 25”

Or we could allow for different variances by making d two-dimensional and


saying · · · mean d0 and variance exp(d1 ). By now we know how to do multi-
output regression, soft or hard: just promote W to a matrix with more output
dimensions.
TODO: show pictures of 3 classes, 4 classes (e.g. digits 0,1,8,9)
Okay, so now we know how to use our methods to predict discrete labels or
real numbers. But what if we want to output structured data like text? A useful
principle is to factor the task of generating such “variable-length” data into many
smaller, simpler predictions, each potentially depending on what’s generated so
far. For example, instead of using W to tell us how to go from (features of) an
image x to a whole string y of characters, we can use W to tell us, based on an
image x together with a partial string y0 , either what the next character is OR that
the string should end. So if there are 27 possible characters (letters and space)
then this is a 27 + 1-way classification problem:

(Images × Strings) → R··· → R28 → DistributionsOn({’a’, · · · , ’z’, ’ ’, STOP})

We could implement this function as some hand-crafted featurization function


from Images × Strings to fixed-length vectors, followed by a learned W, followed
by softmax.
Food For Thought: An “symbolic expression tree” is something that looks like
(((0.686 + x)*4.2)*x) or ((x*5.9) + (x*x + x*(6.036*x))). That is, a tree is
either (a tree plus a tree) OR (a tree times a tree) OR (the symbol x) OR (some
real number). Propose an architecture that, given a short mathematical word
problem, predicts a symbolic expression tree. Don’t worry about featurization.
Food For Thought: A “phylogenetic tree” is something that looks like (dog.5mya.(cow.2mya.raccoon))
or ((chicken.63mya.snake).64mya.(cow)).120mya.(snail.1mya.slug). That is,
a tree is either a pair of trees together with a real number OR a species name.
The numbers represent how long ago various clades diverged. Propose an archi-
tecture that, given a list of species, predicts a phylogenetic tree for that species.
Don’t worry about featurization.
18

iterative optimization By the end of this section, you’ll be able to


• implement gradient descent for any given
(stochastic) gradient descent — We seek a hypothesis that is best (among a class H) loss function and (usually) thereby au-
tomatically and efficiently find nearly-
according to some notion of how well each hypothesis models given data:
optimal linear hypotheses from data
def badness(h,y,x): • explain why the gradient-update formulas
# return e.g. whether h misclassifies y,x OR h’s surprise at seeing y,x OR etc for common linear models are sensible, not
def badness_on_dataset(h, examples): just formally but also intuitively
return np.mean([badness(h,y,x) for y,x in examples])

Earlier we found a nearly best candidate by brute-force search over all hy-
potheses. But this doesn’t scale to most interesting cases wherein H is intractably
large. So: what’s a faster algorithm to find a nearly best candidate?
A common idea is to start arbitrarily with some h0 ∈ H and repeatedly im-
prove to get h1 , h2 , · · · . We eventually stop, say at h10000 . The key question is:◦ ← Also important are the questions of where
how do we compute an improved hypothesis ht+1 from our current hypothesis ht ? to start and when to stop. But have patience!
We’ll discuss these later.
We could just keep randomly nudging ht until we hit on an improvement; then
we define ht+1 as that improvement. Though this sometimes works surprisingly
well,◦ we can often save time by exploiting more available information. Specifi- ← If you’re curious, search ‘metropolis hastings’
cally, we can inspect ht ’s inadequacies to inform our proposal ht+1 . Intuitively, and ‘probabilistic programming’.

if ht misclassifies a particular (xi , yi ) ∈ S , then we’d like ht+1 to be like ht but


nudged toward accurately classifying (xi , yi ).◦ ← In doing better on the ith datapoint, we might
How do we compute “a nudge toward accurately classifying (x, y)”? That is, mess up how we do on the other datapoints!
We’ll consider this in due time.
how do measure how slightly changing a parameter affects some result? Answer:
derivatives! To make h less bad on an example (y, x), we’ll nudge h in tiny bit
along −g = −dbadness(h, y, x)/dh. Say, h becomes h − 0.01g.◦ Once we write ← E.g. if each h is a vector and we’ve chosen
badness(h, y, x) = −yh · x as our notion of bad-
def gradient_badness(h,y,x):
ness, then −dbadness(h, y, x)/dh = +yx, so
# returns the derivative of badness(h,y,x) with respect to h
we’ll nudge h in the direction of +yx.
def gradient_badness_on_dataset(h, examples):
Food For Thought: Is this update familiar?
return np.mean([gradient_badness(h,y,x) for y,x in examples])

we can repeatedly nudge via gradient descent (GD), the engine of ML:◦ ← Food For Thought: Can GD directly minimize
misclassification rate?
h = initialize()
for t in range(10000):
h = h - 0.01 * gradient_badness_on_dataset(h, examples)

Since the derivative of total badness depends on all the training data, looping
10000 times is expensive. So in practice we estimate the needed derivative based
on some subset (jargon: batch) of the training data — a different subset each pass cartoon of GD
through the loop — in what’s called stochastic gradient descent (SGD):
h = initialize()
for t in range(10000):
batch = select_subset_of(examples)
h = h - 0.01 * gradient_badness(h, batch)

(S)GD requires informative derivatives. Misclassification rate has uninforma-


tive derivatives: any tiny change in h won’t change the predicted labels. But
when we use probabilistic models, small changes in h can lead to small changes
in the predicted distribution over labels. To speak poetically: the softness of prob-
abilistic models paves a smooth ramp over the intractably black-and-white cliffs cartoon of GD
of ‘right’ or ‘wrong’. We now apply SGD to maximizing probabilities.
19

maximum likelihood estimation — When we can compute each hypothesis h’s asserted
probability that the training ys match the training xs, it seems reasonable to seek
an h for which this probability is maximal. This method is maximum likelihood
estimation (MLE). It’s convenient for the overall goodness to be a sum (or aver-
age) over each training example. But independent chances multiply rather than
add: rolling snake-eyes has chance 1/6 · 1/6, not 1/6 + 1/6. So we prefer to think
about maximizing log-probabilities instead of maximizing probabilities — it’s the
same in the end.◦ By historical convention we like to minimize badness rather ← Throughout this course we make a crucial as-
than maximize goodness, so we’ll use SGD to minimize negative-log-probabilities. sumption that our training examples are inde-
pendent from each other.
def badness(h,y,x):
return -np.log( probability_model(y,x,h) )

Let’s see this in action for the linear logistic model we developed for soft
binary classification. A hypothesis w ~ predicts that a (featurized) input ~x has
label y = +1 or y = −1 with chance σ(+~ w · ~x) or σ(−~
w · ~x):

py|x,w (y|~x, w w · ~x)


~ ) = σ(y~ where σ(d) = 1/(1 − exp(−d))

So MLE with our logistic model means finding w ~ that minimizes


X
− log (prob of all yi s given all ~xi s and w
~)= ~ · ~xi ))
− log(σ(yi w
i

The key computation is the derivative of those badness terms:◦ ← Remember that σ0 (z) = σ(z)σ(−z). To reduce
w · ~x as ywx.
clutter we’ll temporarily write y~
∂(− log(σ(ywx))) −σ(ywx)σ(−ywx)yx
= = −σ(−ywx)yx
∂w σ(ywx)

Food For Thought: If you’re like me, you might’ve zoned out by now. But this stuff
is important, especially for deep learning! So please graph the above expressions
to convince yourself that our formula for derivative makes sense visually.

To summarize, we’ve found the loss gradient for the logistic model:
sigma = lambda z : 1./(1+np.exp(-z))
def badness(w,y,x): return -np.log( sigma(y*w.dot(x)) )
def gradient_badness(w,y,x): return -sigma(-y*w.dot(x)) * y*x

As before, we define overall badness on a dataset as an average badness over


examples; and for simplicity, let’s intialize gradient descent at h0 = 0:
def gradient_badness_on_dataset(h, examples):
return np.mean([gradient_badness(h,y,x) for y,x in examples])
def initialize(): show trajectory in weight space over time – see how
return np.zeros(NUMBER_OF_DIMENSIONS, dtype=np.float32) certainty degree of freedom is no longer redundant?
(“markov”)
Then we can finally write gradient descent:
h = initialize()
for t in range(10000):
h = h - 0.01 * gradient_badness_on_data(h, examples)

show training and testing loss and acc over time


20

initialization, learning rate, local minima — ← VERY OPTIONAL PASSAGE


pictures of training: noise and curvature — ← VERY OPTIONAL PASSAGE

test vs train curves: overfitting


random featurization: double descent

priors and generalization A child’s education should begin at least 100 years
before [they are] born.
on overfitting — In the Bayesian framework, we optimistically assume that our model is — oliver wendell holmes jr
“correct” and that the “true” posterior over parameters is

p(w|y; x) = p(y|w; x)p(w)/Zx

a normalized product of likelihood and prior. Therefore, our optimal guess for a
new prediction is:
X
p(y? ; x? , x) = p(y? |w; x? )p(y|w; x)p(w)/Zx
w

For computational tractability, we typically approximate the posterior over


hypotheses by a point mass at the posterior’s mode argmaxw0 p(w0 |y; x):

p(w|y, x) ≈ δ(w − w? (y, x)) w? (y, x) = argmaxw0 p(w0 |y; x)

Then
p(y? ; x? , x) ≈ p(y? |w? (y, x); x? )
What do we lose in this approximation?
E and max do not commute
PICTURE: “bowtie” vs “pipe”
BIC for relevant variables
log priors and bayes — fill in computation and bases visual illustration of how choice of L2 dot
product matters `p regularization; sparsity eye regularization example! TODO: rank as a
prior (for multioutput models)
For 1 ≤ p ≤ ∞ and 1 ≤ q ≤ ∞ we can consider this prior:◦ ← To define the cases p = ∞ or q = ∞ we take
 limits. To define the case p = ∞ = q we take
!q/p 
X limits while maintaining p = q.
p(w) ∝ exp − |λwi |p 
i

For q = p this decomposes as a sum and thus has each coordinate independent.
Whereas q limits outliers, p controls the shape of level curves. Small q makes
large vectors more probable and small p makes it more probable that the entries
within a vector will be of very different sizes.
hierarchy, mixtures, transfer — k-fold cross validation bayesian information criterion
estimating generalization — k-fold cross validation dimension-based generalization bound bayesian
information criterion

model selection
21

q=1 q=2 q=∞


p=1

p=2

p=∞

All human beings have three lives: public, private,


and secret.
taking stock so far — By model selection we mean the selection of all those design param- — gabriel garcìa marquez
eters — featurization and other ‘architecture’, optimization method.
The story of model selection has to do with approximation, optimization, and
generalization.
grid/random search —
selecting prior strength —
overfitting on a validation set —
22

4. generalization bounds A foreign philosopher rides a train in Scotland.


Looking out the window, they see a black sheep;
dot products and generalization — they exclaim: “wow! in Scotland at least one side
of one sheep is black!”
hypothesis-geometry bounds — Suppose we are doing binary linear classification with N
— unknown
training samples of dimension d < N. Then with probability at least 1 − η the
gen gap is at most: r
d log(6N/d) + log(4/η)
N
For example, with d = 16 features and tolerance η = 1/1000, we can achieve
a gen. gap of less than 5% once we have more than N ≈ 64000 samples. This
is pretty lousy. It’s a worst case bound in the sense that it doesn’t make any
assumptions about how orderly or gnarly the data is.
If we normalize so that kxi k ≤ R and we insist on classifiers with margin at
least 0 < m ≤ R, then we may replace d by d1 + (R/m)2 e if we wish, so long as
we count each margin-violator as a training error, even if it is correctly classified.
CHECK ABOVE!
Thus, if R = 1 and 1 ≤ Nm then with chance at least 1 − η:

(number of margin violators)


testing error ≤ +
r N
(2/m2 ) log(6Nm2 ) + log(4/η)
N
dimension/margin
optimization-based bounds — Another way to estimate testing error is through leave-one-
out cross validation (LOOCV). This requires sacrifice of a single training point
in the sense that we need N + 1 data points to do LOOCV for an algorithm
that learns from N training points. The idea is that after training on the N, the
testing-accuracy-of-hypotheses-learned-from-a-random-training-sample is unbi-
asedly estimated by the learned hypothesis’s accuracy on the remaining data
point. This is a very coarse, high-variance estimate. To address the variance, we
can average over all N + 1 choices◦ of which data point to remove from the train- ← In principle, LOOCV requires training our
ing set. When the different estimates are sufficiently uncorrelated, this drastically model N + 1 many times; we’ll soon see ways
around this for the models we’ve talked about.
reduces the variance of our estimate.
Our key to establishing sufficient un-correlation lies in algorithmic stability: the
hypothesis shouldn’t change too much as a function of small changes to the
training set; thus, most of the variance in each LOOCV estimate is due to the
testing points, which by assumption are independent.
If all xs, train or test, have length at most R, then we have that with chance at
least 1 − η:
s
1 + 6R2 /λ
ES testing error ≤ ES LOOCV error +
2Nη

bayes and testing —


23

C. bend those lines to capture rich patterns (units 2,3)


featurization
24

C. nonlinearities
0. fixed featurizations Doing ensembles and shows is one thing, but being
able to front a feature is totally different. ... there’s
sketching — something about ... a feature that’s unique.
— michael b. jordan

sensitivity analysis —
double descent —
25

6. kernels enrich approximations ... animals are divided into (a) those belonging to
the emperor; (b) embalmed ones; (c) trained ones;
features as pre-processing — (d) suckling pigs; (e) mermaids; (f) fabled ones; (g)
stray dogs; (h) those included in this classification;
abstracting to dot products —
(i) those that tremble as if they were mad; (j)
kernelized perceptron and svm — innumerable ones; (k) those drawn with a very fine
camel hair brush; (l) et cetera; (m) those that have
kernelized logistic regression — just broken the vase; and (n) those that from afar
look like flies.
learned featurizations — jorge luis borges

1. learned featurizations
imagining the space of feature tuples — We’ll focus on an architecture of the form

^ (y = +1 | x) = (σ1×1 ◦ A1×(h+1) ◦ f(h+1)×h ◦ Bh×d )(x)


p

where A, B are linear maps with the specified (input × output) dimensions,
where σ is the familiar sigmoid operation, and where f applies the leaky relu
function elementwise and concatenates a 1:

f((vi : 0 ≤ i < h)) = (1, ) ++ (lrelu(vi ) : 0 ≤ i < h) lrelu(z) = max(z/10, z)

We call h the hidden dimension of the model. Intuitively, f ◦ B re-featurizes the


input to a form more linearly separable (by weight vector A).
the featurization layer’s learning signal —
expressivity and local minima —
“representer theorem” —

locality and symmetry in architecture

2. multiple layers
We can continue alternating learned linearities with fixed nonlinearities:

^ (y = +1 | x) = (σ1×1 ◦ A1×(h00 +1) ◦ f(h00 +1)×h00 ◦ Bh00 ×(h0 +1) ◦


p
f(h0 +1)×h0 ◦ Ch0 ×(h+1) ◦
f(h+1)×h ◦ Dh×d )(x)

feature hierarchies —
bottlenecking —
highways —

3. architecture and wishful thinking


representation learning —

4. architecture and symmetry About to speak at [conference]. Spilled Coke on left


leg of jeans, so poured some water on right leg so
dependencies in architecture looks like the denim fade.
— tony hsieh
26

5. stochastic gradient descent The key to success is failure.


— michael j. jordan
6. loss landscape shape The virtue of maps, they show what can be done
with limited space, they foresee that everything can
happen therein.
— josé saramago
27

D. thicken those lines to quantify uncertainty (unit 4)


bayesian models

examples of bayesian models

inference algorithms for bayesian models

combining with deep learning


28

E. beyond learning-from-examples (unit 5)


reinforcement ; bandits

state (dependence on prev xs and on actions ) ; RL ; partial observations

deep q learning

learning-from-instructions ; farewell

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy