TheLearningTheory 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 90

The Learning Theory

Lorenzo Servadei, Sebastian Schober, Daniela Lopera, Wolfgang Ecker


Agenda: The Learning Theory

• The Learning Problem

• Is Learning Feasible

• Error and Noise

• Theory of Generalization

• The VC Dimension

• Bias-Variance Tradeoff

• Overfitting

27.04.2023 2
Predict how a viewer will rate a movie

The essence of Machine Learning:

• A pattern exists.

• We cannot pin it down mathematically.

• We have data on it.

27.04.2023 3
The Components of Learning

Metaphor: Credit approval

Applicant information:
age 23 years
gender male
annual salary $30,000
years in residence 1 year
years in job 1 year
current debt $15,000
··· ···
Approve credit?

27.04.2023 4
The Components of Learning
Formalization:

• Input: x (customer application)

• Output: y (good/bad customer? )

• Target function: f : X → Y (ideal credit approval formula)

• Data:(x1, y1), (x2, y2), ··· , (x N , y N ) (historical records)

↓ ↓ ↓

• Hypothesis: g : X → Y(formula to be used)

27.04.2023 5
The Components of Learning
UNKNOWN TARGET FUNCTION
f: X → Y

(ideal credit approval function)

TRAINING EXAMPLES
(x1 , y1 ), ... , ( xN , yN)

(historical records of credit customers)

LEARNING FINAL
HYPOTHESIS
ALGORITHM
g≈f
A
(final credit approval formula)

HYPOTHESIS SET
H

(set of candidate formulas)

27.04.2023 6
Solution Components

The 2 solution components of the learning UNKNOWN TARGET FUNCTION


f: X Y

problem: (ideal credit approval function)

TRAINING EXAMPLES

• The Hypothesis Set


(x1 , y1 ), ... , ( xN , yN)
(historical records of credit customers)

H = {h} g ∈H LEARNING FINAL


HYPOTHESIS
ALGORITHM
g~f
A

• The Learning Algorithm


(final credit approval formula)

HYPOTHESIS SET
H

Together, they are referred to as the (set of candidate formulas)

learning model

27.04.2023 7
A simple Hypotesis Set – the Perceptron

For input x = (x 1 , · · · , x d ) 'attributes of a customer'


d
Approve credit if Σ
i= 1
w i x i > threshold,
d
Deny credit if Σ
i= 1
w i x i < threshold.

This linear formula h ∈ H can be written as


d
h(x) = sign Σ
i= 1
wi x i −
threshold

27.04.2023 8
A simple Hypotesis Set – the Perceptron
d
h(x) = sign Σ wi x i
i= 1
− w0

Introduce an artificial coordinate _ + _


+
_
x 0 = 1: +
_
+
+ +
d _ + _ +
h(x) = sign Σ
i= 0
wi x i _
+
_
+

`linearly separable' data


In vector form, the perceptron
implements:
h(x) = sign (w x)T

27.04.2023 9
A simple learning Algorithm – the PLA

The perceptron implements


y= +1 w+yx
h(x) = sign(w x) T

Given the training set: x


w
(x1, y1), (x2, y2), · · · , (x N , y N )

pick a misclassified point:


sign(w xn ) ≠ yn
T
y= -1 w

and update the weight vector: x


w+yx
w ← w + ynxn

27.04.2023 10
Iterations of the PLA

• One iteration of the PLA:


w ← w + yx
_ +
where (x, y) is a misclassified training point.
_ +

• At iteration t = 1, 2, 3, · · · , pick a misclassified +


_ +
point from
+
(x1, y1), (x2, y2), · · · , (x N , y N ) _
and run a PLA iteration on it.
• That's it!

27.04.2023 11
Related Experiment

- Consider a 'bin' with red and green marbles.


top

P[ picking a red marble ] = µ BIN

P[ picking a green marble ] = 1 − µ SAM PLE

- The value of µ is unknown to = fraction


us. of red marbles

- We pick N marbles
 = probability
independently. of red marbles
bottom

- The fraction of red marbles in sample = 

27.04.2023 12
Does  say anything about µ?

top

No!
BIN
Sample can be mostly green while
bin is mostly red.
SAM PLE

Y es! = fraction
of red marbles
Sample frequency  is likely close to
bin frequency µ.
 = probability
of red marbles
bottom

P ossible Probable
versus

27.04.2023 13
What does  say anything about µ?

In a big sample (large N),  is probably close to µ Bad Event!


(within ϵ).

Formally,

This is called Hoeffding's Inequality. Tollerance

In other words,the statement µ =  is P.A.C.


Confidence

http://cs229.stanford.edu/extra-notes/hoeffding.pdf

27.04.2023 14
What does  say anything about µ?

top

BIN
• Valid for all N and 𝜖
• Bound does not depend on
S A M P LE

µ  = fraction
• Tradeoff : N, 𝜖, and the bound.
of red marbles

 = probability
• 𝜈 ≈µ ⇒ µ≈𝜈 of red marbles
bottom

27.04.2023 15
Connection to learning?

UNKNOWN TARGET FUNCTION PROBABILITY


T he bin analogy:
Hi

f: X Y DISTRIBUTION

P on X
TRAINING EXAMPLES
x1 , ... , xN
(x1 , y1 ), ... , ( xN , yN)

X HYPOTHESIS SET
LEARNING
ALGORITHM
A
FINAL
HYPOTHESIS
g~f

Hi H

27.04.2023 16
Connection to learning?

Bin: The unknown is a number µ


h(x)= f(x)
Learning: The unknown is a function f :
h(x)= f(x)
X →Y

Each marble • is a point x∈ X

•: Hypothesis got it right h(x)=f


(x)

•: Hypothesis got it wrong h(x) ≠ f Hi

(x)

27.04.2023 17
In sample – Out of sample Error
Hi

Both µ and 𝜈 depend on which hypothesis h


Eout(h)

𝜈 is ` in s a m p l e ' denoted by Ein(h)

µ is ` o u t o f s a m p l e ' denoted by E out (h) The

Hoeffding inequality becomes:

Ein(h)
Hi

27.04.2023 18
Multiple Bins

Generalizing the bin model to more than one hypothesis:


top

h1 h2 hm
  M
1 2

........

  
1 2 M bottom

27.04.2023 19
Multiple Bins

Not so fast!! Hoeffding doesn't apply to multiple bins.

W h a t ?

UNKNOWN TARGET FUNCTION PROBABILITY


top
f: X Y DISTRIBUTIO top

B IN N h1 h2 hM
TRAINING EXAMPLES P on X
(x1 , y1 ), ... , ( xN , yN)
x1 , ... , xN Eout( h1) Eout( h2) Eout(hM )
S A M P LE

→ →
. . . . . . ..
FINAL
 = fraction
LEARNING
HYPOTHESIS
ALGORITHM
g~f
of red marbles A

Ein( h1) Ein( h2) Ein(hM )


HYPOTHESIS SET
 = probability
bottom
H

of red marbles
bottom

27.04.2023 20
Agenda: The Learning Theory

• The Learning Problem

• Is Learning Feasible

• Error and Noise

• Theory of Generalization

• The VC Dimension

• Bias-Variance Tradeoff

• Overfitting

27.04.2023 21
Multiple Bins

Q u e s t i o n : If you toss a fair coin 10 times, what is the


probability that you will get 10 heads?
A n s w e r : ≈ 0.1%

Q u e s t i o n : If you toss 1000 fair coins 10 times each, what is the


probability that some coin will get 10 heads?

Answer: ≈ 63%

27.04.2023 22
From Coins to Learning

hi

BINGO ? Hi

27.04.2023 23
A simple Solution – Union Bound

Focus on this case!


Important afterwards

27.04.2023 24
Final Verdict

STOP – Important One!

27.04.2023 25
The learning Diagram – where we left it

27.04.2023 26
Data Distribution

27.04.2023 27
Agenda: The Learning Theory

• The Learning Problem

• Is Learning Feasible

• Error and Noise

• Theory of Generalization

• The VC Dimension

• Bias-Variance Tradeoff

• Overfitting

27.04.2023 28
The learning Diagram – where we left it

What does " h ≈ f " mean?

Error measure: E(h, f )

Almost always pointwise definition: e(h(x), f (x))

Examples:
Squared error: e (h(x), f (x)) = (h(x) − f (x))2

Binary error: e (h(x), f (x)) = [h(x) ≠ f (x)]

27.04.2023 29
Overall Error

Overall error E(h, f) = average of pointwise errors e(h(x), f(x))

In-sample error:

Out-of-sample error:

27.04.2023 30
Diagram with pointwise error

27.04.2023 31
How to Choose the Error Measure
Fingerprint Verification:

Two types of error:

false accept and false reject +1 y ou


f
How do we penalize each type? intruder
−1

f
+1 −1
+1 no error false accept
h
−1 false reject no error

27.04.2023 32
The Error Measure – In the Supermarket

Supermarket verifies fingerprint for discounts

False reject is mostly, customer gets annoyed!

+1 y ou
False accept is minor; gave away a discount and f
-1 intruder
intruder left their fingerprint
f
+1 −1
+1 0 1
h
−1 10 0

27.04.2023 33
The Error Measure – CIA

CIA verifies fingerprint for security

False accept is a disaster!

False reject can be tolerated. +1 y ou


f
Try again, you are an employee -1 intruder

f
+1 −1
+1 0 1000
h
−1 1 0

27.04.2023 34
The Error Measure

27.04.2023 35
The Error Measure

27.04.2023 36
Take Home Lesson

The error measure should be specified by the user.

Not always possible. Alternatives:

Some measures: squared error, cross entropy

27.04.2023 37
Error Measure

27.04.2023 38
Noisy Targets

The 'target function' is not always a function

Consider the creditcard approval:

age 23 years
annual salary $30,000
years in residence 1 year
years in job 1 year
current debt $15,000
· ·· ···

two 'identical' customers → two different behaviors

27.04.2023 39
Target Distribution

Instead of y = f(x), we use target distribution:

P (y|x)

(x, y) is now generated by the joint distribution:


P (x) P (y|x)

Noisy target = deterministic target f (x) = E (y|x) plus


noise y − f (x)

Deterministic target is a special case of noisy target:


P (y | x) is zero except for y = f (x)

27.04.2023 40
The learning Diagram + Noisy Targets

27.04.2023 41
Agenda: The Learning Theory

• The Learning Problem

• Is Learning Feasible

• Error and Noise

• Theory of Generalization

• The VC Dimension

• Bias-Variance Tradeoff

• Overfitting

27.04.2023 42
Distinction between P (y|x) and P (x)

Both convey probabilistics aspects of x and y

The target distribution P (y | x)


is what we are trying to learn

The input distribution P (x)


quantifies relative
importance of x

Merging P (x)P (y | x) as P (x , y) mixes the two concepts

27.04.2023 43
What we know so far

Learning is feasible. It is likely that

E o u t (g) ≈ E in (g)

Is this learning?

We need g ≈ f , which means

E o u t (g) ≈ 0

27.04.2023 44
2 Questions of Learning

E o u t (g) ≈ 0
is achieved through:
E o u t (g) ≈ E i n (g) and E i n (g) ≈ 0

Learning is thus split into 2 questions:

1. Can we make sure that E o u t (g)is close enough to


E in (g)?

2. Can we make E in (g) small enough?

27.04.2023 45
Training Setting

27.04.2023 46
Where did M come from?

B1 B2

B3

27.04.2023 47
Can we improve M ?
Eout

Eout

27.04.2023 48
Can we replace M with ?

Instead of the whole input


space,

we consider a finite set of


input points,

and count the number of


dichotomies

27.04.2023 49
Dichotomies: mini-hypotheses

A hypothesis h : X → {−1,+1}

A dichotomy h : {x1,x2,· · · ,xN } → {−1,+1}


Number of hypotheses |H| can be infinite

Number of dichotomies |H(x1, x2, · · · , xN )| is at most 2N

Candidate for replacing M

27.04.2023 50
The growth function

The growth function counts the most


dichotomies on any N points

The growth function satisfies:

Let's apply the definition.

27.04.2023 51
Applying mH(N ) definition - perceptrons

N= 3 N= 3 N= 4

mH(3) = 8 mH(4) = 14

27.04.2023 52
Positive Rays

h(x) = −1 h(x) = +1
a
x1 x2 x3 ... xN

H is set of h: R → {−1,+1}

h(x) = sign(x − a) mH(N ) = N + 1

27.04.2023 53
Positive Intervals

h(x) = −1 h(x) = +1 h(x) = −1


x1 x2 x3 ... xN

H is set of h: R → {−1,+1}

Place interval ends in two of N + 1 spots

27.04.2023 54
Convex sets

up +
H is set of −

h : R2 → {−1, +1} +
+

h(x) = +1 is convex −



mH(N ) = 2N
The N points are 'shattered' by convex +
− +
sets bottom

27.04.2023 55
The 3 growth functions

• H is positive rays:
mH(N ) = N + 1

• H is positive intervals:
1 1
mH(N ) = N 2 + N + 1
2 2

• H is convex sets:
mH(N ) = 2N

27.04.2023 56
Back to the big picture

Remember this inequality?

What happens if mH(N) replaces M?

mH (N) polynomial ⇒ Good!

Just prove that mH(N ) is polynomial?

27.04.2023 57
Breakpoint of H
D e fin i t i o n :
If no dataset of size k can be
shattered by H, then k is a break point
for H

mH(k)< 2k

For 2D perceptrons, k = 4

A bigger data set cannot be shattered either

27.04.2023 58
Breakpoints, 3 Examples

• H is positive rays:
mH(N ) = N + 1 break point k = 2
• •

• H is positive intervals:
break point k = 3 • • •
1 2 1
mH(N ) = N + N+ 1
2 2

• H is convex sets:
mH(N ) = 2N break point k =`∞'

27.04.2023 59
Main Result

No break point ⇒ mH(N) = 2N

Any break point ⇒ mH(N) is p o l y n o m i a l in N

27.04.2023 60
Putting it altogether

Not quite :

But more:

T h e Va p n i k - C h e r v o n e n k i s I n e q u a l i t y

https://web.eecs.umich.edu/~cscott/past_courses/eecs598w14/notes/05_vc_theory.pdf

27.04.2023 61
Agenda: The Learning Theory

• The Learning Problem

• Is Learning Feasible

• Error and Noise

• Theory of Generalization

• The VC Dimension

• Bias-Variance Tradeoff

• Overfitting

27.04.2023 62
Definition of VC dimensions

The VC dimension of a hypothesis set H, denoted by dvc (H), is


the largest value of N for which

mH(N) = 2N
t h e m o s t p o i n ts H c a n
shatter

N ≤ dvc (H) ⇒ H can shatter N points


k > dvc (H) ⇒ k is a break point for H

27.04.2023 63
Examples


• H is p ositive rays: dvc = 1


• H is 2 D p e r ceptrons: dvc = 3 • •

up

• H is convex sets: dvc = ∞

bottom

27.04.2023 64
VC Dimensions and Learning

dvc (H) is finite ⇒ g ∈H will generalize


up

UNKNOWN TARGET FUNCTION PROBABILITY


f: X Y DISTRIBUTION

•Independent of the learning TRAINING EXAMPLES


P on X

algorithm (x1 , y1 ), ... , ( xN , yN)

LEARNING FINAL

•Independent of the i n p u t ALGORITHM


A
HYPOTHESIS
g~f

distribution HYPOTHESIS SET


H

•Independent of the target


down

f u n ct i o n

27.04.2023 65
Degrees of Freedom

Parameters create degrees of freedom

# of parameters: analog degrees of


freedom

dvc : equivalent ' binary' degrees of


freedom

27.04.2023 66
Not just parameters

Parameters may not contribute degrees of freedom:

down

x y

down

dvc measures the effective number of parameters

27.04.2023 67
Number of Datapoints needed

Two small quantities in the VCinequality:


1
P [ |Ei n − Eo u t | > 𝜖] ≤4 mH (2 N) e− 8ϵ 2 N
δ

If we want a certain ϵ and δ, how does N depend on dvc ?

Let us look at Nde−N

27.04.2023 68
Visual Representation

Fix N d e − N = small value


1010

How does N change with d? 5


N30e−N
10

0
10

R u l e of t h u m b :
10−5

N ≥ 10 dvc
20 40 60 80 100 120 140 160 180 200

27.04.2023 69
Agenda: The Learning Theory

• The Learning Problem

• Is Learning Feasible

• Error and Noise

• Theory of Generalization

• The VC Dimension

• Bias-Variance Tradeoff

• Overfitting

27.04.2023 70
Rearranging Things

Start from the VC inequality:

Get ϵin terms of δ:

With probability ≥ 1 − |Eo u t − Ein | ≤ Ω(N,H,δ)


δ,

27.04.2023 71
Generalization Bound

With probability ≥ 1 − δ,

With probability ≥ 1−
δ,

27.04.2023 72
Generalization Bound

With probability ≥ 1 − δ,

With probability ≥ 1−
δ,

Subject of Regularization

27.04.2023 73
Approximation and Generalization Tradeoff

Small E o u t : good approximation of f out of sample.

More complex H ⇒ better chance of a p p r oxi mati n g f

Less complex H ⇒ better chance of generalizing out of sample

Ideal H = {f}

Chances of winning lottery ticket

27.04.2023 74
Quantifiying the Tradeoff

VC analysis was one approach: E o u t ≤ E i n + Ω

Bias-variance analysis is another: decomposing E o u t into

1. How well H can approximate f

2. How well we can zoom in on a good h ∈ H

Applies to real-valued ta rgets and uses s q u ared error

27.04.2023 75
Starting with Eout

To evaluate:

We compute the ' average' hypothesis g¯(x):

27.04.2023 76
Bias and Variance

27.04.2023 77
The Tradeoff

Take Expectation with respect to x, and obtain bias and variance

↓ H↑ ↑

27.04.2023 78
Example: Sine Target
f : [−1, 1]→ R f (x) = sin(πx)

Only two training examples! f


N = 2 2

1.5

0.5

Two models used for learning:


0

H0: h(x) = b −0.5

H1: h(x) = ax + b −1

−1.5

Which is better, H 0 or H1? −2


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

27.04.2023 79
Approximation
H0 versus H1

H0 H1
2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 E o u t = 0.50 −1.5 E o u t = 0.20


−2 −2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

27.04.2023 80
Learning
H0 versus H1

H0 H1
2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5

−2 −2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

27.04.2023 81
Bias and Variance

y
y

ḡ(x )
sin(πx)

x x

27.04.2023 82
Bias and Variance – H1

ḡ(x )

y
y

sin(πx)

x x

27.04.2023 83
And the winner is

H0 H1

ḡ(x )

y
y

ḡ(x )
sin(πx) sin(πx)

x x
bias = 0.50 var = 0.25 bias = 0.21 var = 1.69

27.04.2023 84
Lesson Learned

Match the ' m o d e l c o m p l e x i t y '

to the d a t a resources,
not to the t ar g e t complexity

27.04.2023 85
Agenda: The Learning Theory

• The Learning Problem

• Is Learning Feasible

• Error and Noise

• Theory of Generalization

• The VC Dimension

• Bias-Variance Tradeoff

• Overfitting

27.04.2023 86
Expected Eout and Ein

Data set D of size N

Expected out-of-sample error ED[E o u t (g(D) )]

Expected in-sample error ED[E in (g(D) )]

How do they vary with N ?

27.04.2023 87
The curves

Eout
Expcted Error

Expeted Error
E in
Eout

E in
Number of Data Points, N Number of Data Points, N

S i m p l e M o del C o m p l e x M o del
§

27.04.2023 88
What the Theory will achieve

Characterizing the feasibility of


learning for infinite M
out-of-sample error

model complexity

Error
Characterizing the tradeoff :

Model complexity ↑ E in ↓ in-sample error


Model complexity ↑ E o u t− E in ↑
d∗
v
VC dimension, d v

27.04.2023 89
Sources

https://work.caltech.edu/telecourse.html

http://gruber.userweb.mwn.de/17.18.statlearn/17.18.statlea
rn.html

27.04.2023 90

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy