TheLearningTheory 2

The Learning Theory
Lorenzo Servadei, Sebastian Schober, Daniela Lopera, Wolfgang Ecker

Agenda: The Learning Theory
• The Learning Problem
• Is Learning Feasible
• Error and Noise
• Theory of Generalization
• The VC Dimension
• Bias-Variance Tradeoff
• Overfitting
27.04.2023 2
Predict how a viewer will rate a movie
The essence of Machine Learning:
• A pattern exists.
• We cannot pin it down mathematically.
• We have data on it.
27.04.2023 3
The Components of Learning
Metaphor: Credit approval
Applicant information:
age 23 years
gender male
annual salary $30,000
years in residence 1 year
years in job 1 year
current debt $15,000
··· ···
Approve credit?
27.04.2023 4
Formalization:
• Input: x (customer application)
• Output: y (good/bad customer? )
• Target function: f : X → Y (ideal credit approval formula)
• Data:(x1, y1), (x2, y2), ··· , (x N , y N ) (historical records)
↓ ↓ ↓
• Hypothesis: g : X → Y(formula to be used)
27.04.2023 5
UNKNOWN TARGET FUNCTION
f: X → Y
(ideal credit approval function)
TRAINING EXAMPLES
(x1 , y1 ), ... , ( xN , yN)
(historical records of credit customers)
LEARNING FINAL
HYPOTHESIS
ALGORITHM
g≈f
A
(final credit approval formula)
HYPOTHESIS SET
H
(set of candidate formulas)
27.04.2023 6
Solution Components
The 2 solution components of the learning UNKNOWN TARGET FUNCTION

f: X Y
problem: (ideal credit approval function)
TRAINING EXAMPLES
• The Hypothesis Set

(x1 , y1 ), ... , ( xN , yN)
(historical records of credit customers)
H = {h} g ∈H LEARNING FINAL

HYPOTHESIS
ALGORITHM
g~f
A
• The Learning Algorithm

(final credit approval formula)
HYPOTHESIS SET
H
Together, they are referred to as the (set of candidate formulas)
learning model
27.04.2023 7
A simple Hypotesis Set – the Perceptron
For input x = (x 1 , · · · , x d ) 'attributes of a customer'

d
Approve credit if Σ
i= 1
w i x i > threshold,
d
Deny credit if Σ
i= 1
w i x i < threshold.
This linear formula h ∈ H can be written as

d
h(x) = sign Σ
i= 1
wi x i −
threshold
27.04.2023 8
A simple Hypotesis Set – the Perceptron
d
h(x) = sign Σ wi x i
i= 1
− w0
Introduce an artificial coordinate _ + _

+
_
x 0 = 1: +
_
+
+ +
d _ + _ +
h(x) = sign Σ
i= 0
wi x i _
+
_
+
`linearly separable' data

In vector form, the perceptron
implements:
h(x) = sign (w x)T
27.04.2023 9
A simple learning Algorithm – the PLA
The perceptron implements

y= +1 w+yx
h(x) = sign(w x) T
Given the training set: x

w
(x1, y1), (x2, y2), · · · , (x N , y N )
pick a misclassified point:

sign(w xn ) ≠ yn
T
y= -1 w
and update the weight vector: x

w+yx
w ← w + ynxn
27.04.2023 10
Iterations of the PLA
• One iteration of the PLA:

w ← w + yx
_ +
where (x, y) is a misclassified training point.
_ +
• At iteration t = 1, 2, 3, · · · , pick a misclassified +

_ +
point from
+
(x1, y1), (x2, y2), · · · , (x N , y N ) _
and run a PLA iteration on it.
• That's it!
27.04.2023 11
Related Experiment
- Consider a 'bin' with red and green marbles.

top
P[ picking a red marble ] = µ BIN
P[ picking a green marble ] = 1 − µ SAM PLE
- The value of µ is unknown to = fraction

us. of red marbles
- We pick N marbles
 = probability
independently. of red marbles
bottom
- The fraction of red marbles in sample = 
27.04.2023 12
Does  say anything about µ?
top
No!
BIN
Sample can be mostly green while
bin is mostly red.
SAM PLE
Y es! = fraction
of red marbles
Sample frequency  is likely close to
bin frequency µ.
 = probability
of red marbles
bottom
P ossible Probable
versus
27.04.2023 13
What does  say anything about µ?
In a big sample (large N),  is probably close to µ Bad Event!

(within ϵ).
Formally,
This is called Hoeffding's Inequality. Tollerance
In other words,the statement µ =  is P.A.C.

Confidence
http://cs229.stanford.edu/extra-notes/hoeffding.pdf
27.04.2023 14
What does  say anything about µ?
top
BIN
• Valid for all N and 𝜖
• Bound does not depend on
S A M P LE
µ  = fraction
• Tradeoff : N, 𝜖, and the bound.
of red marbles
 = probability
• 𝜈 ≈µ ⇒ µ≈𝜈 of red marbles
bottom
27.04.2023 15
Connection to learning?
UNKNOWN TARGET FUNCTION PROBABILITY

T he bin analogy:
Hi
f: X Y DISTRIBUTION
P on X
TRAINING EXAMPLES
x1 , ... , xN
(x1 , y1 ), ... , ( xN , yN)
X HYPOTHESIS SET
LEARNING
ALGORITHM
A
FINAL
HYPOTHESIS
g~f
Hi H
27.04.2023 16
Connection to learning?
Bin: The unknown is a number µ

h(x)= f(x)
Learning: The unknown is a function f :
h(x)= f(x)
X →Y
Each marble • is a point x∈ X
•: Hypothesis got it right h(x)=f

(x)
•: Hypothesis got it wrong h(x) ≠ f Hi
(x)
27.04.2023 17
In sample – Out of sample Error
Hi
Both µ and 𝜈 depend on which hypothesis h

Eout(h)
𝜈 is ` in s a m p l e ' denoted by Ein(h)
µ is ` o u t o f s a m p l e ' denoted by E out (h) The
Hoeffding inequality becomes:
Ein(h)
Hi
27.04.2023 18
Multiple Bins
Generalizing the bin model to more than one hypothesis:

top
h1 h2 hm
  M
1 2
........
  
1 2 M bottom
27.04.2023 19
Multiple Bins
Not so fast!! Hoeffding doesn't apply to multiple bins.
W h a t ?

top
f: X Y DISTRIBUTIO top
B IN N h1 h2 hM
TRAINING EXAMPLES P on X
(x1 , y1 ), ... , ( xN , yN)
x1 , ... , xN Eout( h1) Eout( h2) Eout(hM )
S A M P LE
→ →
. . . . . . ..
FINAL
 = fraction
LEARNING
HYPOTHESIS
ALGORITHM
g~f
of red marbles A
Ein( h1) Ein( h2) Ein(hM )

HYPOTHESIS SET
 = probability
bottom
H
of red marbles
bottom
27.04.2023 20
• Error and Noise
• Overfitting
27.04.2023 21
Multiple Bins
Q u e s t i o n : If you toss a fair coin 10 times, what is the

probability that you will get 10 heads?
A n s w e r : ≈ 0.1%
Q u e s t i o n : If you toss 1000 fair coins 10 times each, what is the

probability that some coin will get 10 heads?
Answer: ≈ 63%
27.04.2023 22
From Coins to Learning
hi
BINGO ? Hi
27.04.2023 23
A simple Solution – Union Bound
Focus on this case!

Important afterwards
27.04.2023 24
Final Verdict
STOP – Important One!
27.04.2023 25
The learning Diagram – where we left it
27.04.2023 26
Data Distribution
27.04.2023 27
• Error and Noise
• Overfitting
27.04.2023 28
The learning Diagram – where we left it
What does " h ≈ f " mean?
Error measure: E(h, f )
Almost always pointwise definition: e(h(x), f (x))
Examples:
Squared error: e (h(x), f (x)) = (h(x) − f (x))2
Binary error: e (h(x), f (x)) = [h(x) ≠ f (x)]
27.04.2023 29
Overall Error
Overall error E(h, f) = average of pointwise errors e(h(x), f(x))
In-sample error:
Out-of-sample error:
27.04.2023 30
Diagram with pointwise error
27.04.2023 31
How to Choose the Error Measure
Fingerprint Verification:
Two types of error:
false accept and false reject +1 y ou

f
How do we penalize each type? intruder
−1
f
+1 −1
+1 no error false accept
h
−1 false reject no error
27.04.2023 32
The Error Measure – In the Supermarket
Supermarket verifies fingerprint for discounts
False reject is mostly, customer gets annoyed!
+1 y ou
False accept is minor; gave away a discount and f
-1 intruder
intruder left their fingerprint
f
+1 −1
+1 0 1
h
−1 10 0
27.04.2023 33
The Error Measure – CIA
CIA verifies fingerprint for security
False accept is a disaster!
False reject can be tolerated. +1 y ou

f
Try again, you are an employee -1 intruder
f
+1 −1
+1 0 1000
h
−1 1 0
27.04.2023 34
The Error Measure
27.04.2023 35
The Error Measure
27.04.2023 36
Take Home Lesson
The error measure should be specified by the user.
Not always possible. Alternatives:
Some measures: squared error, cross entropy
27.04.2023 37
Error Measure
27.04.2023 38
Noisy Targets
The 'target function' is not always a function
Consider the creditcard approval:
age 23 years
annual salary $30,000
years in residence 1 year
years in job 1 year
current debt $15,000
· ·· ···
two 'identical' customers → two different behaviors
27.04.2023 39
Target Distribution
Instead of y = f(x), we use target distribution:
P (y|x)
(x, y) is now generated by the joint distribution:

P (x) P (y|x)
Noisy target = deterministic target f (x) = E (y|x) plus

noise y − f (x)
Deterministic target is a special case of noisy target:

P (y | x) is zero except for y = f (x)
27.04.2023 40
The learning Diagram + Noisy Targets
27.04.2023 41
• Error and Noise
• Overfitting
27.04.2023 42
Distinction between P (y|x) and P (x)
Both convey probabilistics aspects of x and y
The target distribution P (y | x)

is what we are trying to learn
The input distribution P (x)

quantifies relative
importance of x
Merging P (x)P (y | x) as P (x , y) mixes the two concepts
27.04.2023 43
What we know so far
Learning is feasible. It is likely that
E o u t (g) ≈ E in (g)
Is this learning?
We need g ≈ f , which means
E o u t (g) ≈ 0
27.04.2023 44
2 Questions of Learning
E o u t (g) ≈ 0
is achieved through:
E o u t (g) ≈ E i n (g) and E i n (g) ≈ 0
Learning is thus split into 2 questions:
1. Can we make sure that E o u t (g)is close enough to

E in (g)?
2. Can we make E in (g) small enough?
27.04.2023 45
Training Setting
27.04.2023 46
Where did M come from?
B1 B2
B3
27.04.2023 47
Can we improve M ?
Eout
Eout
27.04.2023 48
Can we replace M with ?
Instead of the whole input

space,
we consider a finite set of

input points,
and count the number of

dichotomies
27.04.2023 49
Dichotomies: mini-hypotheses
A hypothesis h : X → {−1,+1}
A dichotomy h : {x1,x2,· · · ,xN } → {−1,+1}

Number of hypotheses |H| can be infinite
Number of dichotomies |H(x1, x2, · · · , xN )| is at most 2N
Candidate for replacing M
27.04.2023 50
The growth function
The growth function counts the most

dichotomies on any N points
The growth function satisfies:
Let's apply the definition.
27.04.2023 51
Applying mH(N ) definition - perceptrons
N= 3 N= 3 N= 4
mH(3) = 8 mH(4) = 14
27.04.2023 52
Positive Rays
h(x) = −1 h(x) = +1
a
x1 x2 x3 ... xN
H is set of h: R → {−1,+1}
h(x) = sign(x − a) mH(N ) = N + 1
27.04.2023 53
Positive Intervals
h(x) = −1 h(x) = +1 h(x) = −1

x1 x2 x3 ... xN
H is set of h: R → {−1,+1}
Place interval ends in two of N + 1 spots
27.04.2023 54
Convex sets
up +
H is set of −
h : R2 → {−1, +1} +
+
h(x) = +1 is convex −
−
−
mH(N ) = 2N
The N points are 'shattered' by convex +
− +
sets bottom
27.04.2023 55
The 3 growth functions
• H is positive rays:
mH(N ) = N + 1
• H is positive intervals:
1 1
mH(N ) = N 2 + N + 1
2 2
• H is convex sets:
mH(N ) = 2N
27.04.2023 56
Back to the big picture
Remember this inequality?
What happens if mH(N) replaces M?
mH (N) polynomial ⇒ Good!
Just prove that mH(N ) is polynomial?
27.04.2023 57
Breakpoint of H
D e fin i t i o n :
If no dataset of size k can be
shattered by H, then k is a break point
for H
mH(k)< 2k
For 2D perceptrons, k = 4
A bigger data set cannot be shattered either
27.04.2023 58
Breakpoints, 3 Examples
• H is positive rays:
mH(N ) = N + 1 break point k = 2
• •
• H is positive intervals:
break point k = 3 • • •
1 2 1
mH(N ) = N + N+ 1
2 2
• H is convex sets:
mH(N ) = 2N break point k =`∞'
27.04.2023 59
Main Result
No break point ⇒ mH(N) = 2N
Any break point ⇒ mH(N) is p o l y n o m i a l in N
27.04.2023 60
Putting it altogether
Not quite :
But more:
T h e Va p n i k - C h e r v o n e n k i s I n e q u a l i t y
https://web.eecs.umich.edu/~cscott/past_courses/eecs598w14/notes/05_vc_theory.pdf
27.04.2023 61
• Error and Noise
• Overfitting
27.04.2023 62
Definition of VC dimensions
The VC dimension of a hypothesis set H, denoted by dvc (H), is

the largest value of N for which
mH(N) = 2N
t h e m o s t p o i n ts H c a n
shatter
N ≤ dvc (H) ⇒ H can shatter N points

k > dvc (H) ⇒ k is a break point for H
27.04.2023 63
Examples
•
• H is p ositive rays: dvc = 1
•
• H is 2 D p e r ceptrons: dvc = 3 • •
up
• H is convex sets: dvc = ∞
bottom
27.04.2023 64
VC Dimensions and Learning
dvc (H) is finite ⇒ g ∈H will generalize

up

f: X Y DISTRIBUTION
•Independent of the learning TRAINING EXAMPLES

P on X
algorithm (x1 , y1 ), ... , ( xN , yN)
LEARNING FINAL
•Independent of the i n p u t ALGORITHM

A
HYPOTHESIS
g~f
distribution HYPOTHESIS SET

H
•Independent of the target

down
f u n ct i o n
27.04.2023 65
Degrees of Freedom
Parameters create degrees of freedom
# of parameters: analog degrees of

freedom
dvc : equivalent ' binary' degrees of

freedom
27.04.2023 66
Not just parameters
Parameters may not contribute degrees of freedom:
down
x y
down
dvc measures the effective number of parameters
27.04.2023 67
Number of Datapoints needed
Two small quantities in the VCinequality:

1
P [ |Ei n − Eo u t | > 𝜖] ≤4 mH (2 N) e− 8ϵ 2 N
δ
If we want a certain ϵ and δ, how does N depend on dvc ?
Let us look at Nde−N
27.04.2023 68
Visual Representation
Fix N d e − N = small value

1010
How does N change with d? 5

N30e−N
10
0
10
R u l e of t h u m b :
10−5
N ≥ 10 dvc
20 40 60 80 100 120 140 160 180 200
27.04.2023 69
• Error and Noise
• Overfitting
27.04.2023 70
Rearranging Things
Start from the VC inequality:
Get ϵin terms of δ:
With probability ≥ 1 − |Eo u t − Ein | ≤ Ω(N,H,δ)

δ,
27.04.2023 71
Generalization Bound
With probability ≥ 1 − δ,
With probability ≥ 1−
δ,
27.04.2023 72
Generalization Bound
With probability ≥ 1 − δ,
With probability ≥ 1−
δ,
Subject of Regularization
27.04.2023 73
Approximation and Generalization Tradeoff
Small E o u t : good approximation of f out of sample.
More complex H ⇒ better chance of a p p r oxi mati n g f
Less complex H ⇒ better chance of generalizing out of sample
Ideal H = {f}
Chances of winning lottery ticket
27.04.2023 74
Quantifiying the Tradeoff
VC analysis was one approach: E o u t ≤ E i n + Ω
Bias-variance analysis is another: decomposing E o u t into
1. How well H can approximate f
2. How well we can zoom in on a good h ∈ H
Applies to real-valued ta rgets and uses s q u ared error
27.04.2023 75
Starting with Eout
To evaluate:
We compute the ' average' hypothesis g¯(x):
27.04.2023 76
Bias and Variance
27.04.2023 77
The Tradeoff
Take Expectation with respect to x, and obtain bias and variance
↓ H↑ ↑
27.04.2023 78
Example: Sine Target
f : [−1, 1]→ R f (x) = sin(πx)
Only two training examples! f

N = 2 2
1.5
0.5
Two models used for learning:

0
H0: h(x) = b −0.5
H1: h(x) = ax + b −1
−1.5
Which is better, H 0 or H1? −2

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
27.04.2023 79
Approximation
H0 versus H1
H0 H1
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 E o u t = 0.50 −1.5 E o u t = 0.20

−2 −2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
27.04.2023 80
Learning
H0 versus H1
H0 H1
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
−2 −2
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
27.04.2023 81
Bias and Variance
y
y
ḡ(x )
sin(πx)
x x
27.04.2023 82
Bias and Variance – H1
ḡ(x )
y
y
sin(πx)
x x
27.04.2023 83
And the winner is
H0 H1
ḡ(x )
y
y
ḡ(x )
sin(πx) sin(πx)
x x
bias = 0.50 var = 0.25 bias = 0.21 var = 1.69
27.04.2023 84
Lesson Learned
Match the ' m o d e l c o m p l e x i t y '
to the d a t a resources,
not to the t ar g e t complexity
27.04.2023 85
• Error and Noise
• Overfitting
27.04.2023 86
Expected Eout and Ein
Data set D of size N
Expected out-of-sample error ED[E o u t (g(D) )]
Expected in-sample error ED[E in (g(D) )]
How do they vary with N ?
27.04.2023 87
The curves
Eout
Expcted Error
Expeted Error
E in
Eout
E in
Number of Data Points, N Number of Data Points, N
S i m p l e M o del C o m p l e x M o del
§
27.04.2023 88
What the Theory will achieve
Characterizing the feasibility of

learning for infinite M
out-of-sample error
model complexity
Error
Characterizing the tradeoff :
Model complexity ↑ E in ↓ in-sample error

Model complexity ↑ E o u t− E in ↑
d∗
v
VC dimension, d v
27.04.2023 89
Sources
https://work.caltech.edu/telecourse.html
http://gruber.userweb.mwn.de/17.18.statlearn/17.18.statlea
rn.html
27.04.2023 90

TheLearningTheory 2

Uploaded by

Copyright:

Available Formats

TheLearningTheory 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

TheLearningTheory 2

Uploaded by

Copyright:

Available Formats

The Learning Theory

Lorenzo Servadei, Sebastian Schober, Daniela Lopera, Wolfgang Ecker

• The Learning Problem

• Error and Noise

The essence of Machine Learning:

• We cannot pin it down mathematically.

• We have data on it.

Metaphor: Credit approval

• Input: x (customer application)

• Output: y (good/bad customer? )

• Target function: f : X → Y (ideal credit approval formula)

• Data:(x1, y1), (x2, y2), ··· , (x N , y N ) (historical records)

• Hypothesis: g : X → Y(formula to be used)

(ideal credit approval function)

(historical records of credit customers)

(set of candidate formulas)

The 2 solution components of the learning UNKNOWN TARGET FUNCTION

problem: (ideal credit approval function)

• The Hypothesis Set

H = {h} g ∈H LEARNING FINAL

• The Learning Algorithm

Together, they are referred to as the (set of candidate formulas)

For input x = (x 1 , · · · , x d ) 'attributes of a customer'

This linear formula h ∈ H can be written as

Introduce an artificial coordinate _ + _

`linearly separable' data

The perceptron implements

Given the training set: x

pick a misclassified point:

and update the weight vector: x

• One iteration of the PLA:

• At iteration t = 1, 2, 3, · · · , pick a misclassified +

- Consider a 'bin' with red and green marbles.

P[ picking a red marble ] = µ BIN

P[ picking a green marble ] = 1 − µ SAM PLE

- The value of µ is unknown to = fraction

- The fraction of red marbles in sample = 

In a big sample (large N),  is probably close to µ Bad Event!

This is called Hoeffding's Inequality. Tollerance

In other words,the statement µ =  is P.A.C.

UNKNOWN TARGET FUNCTION PROBABILITY

Bin: The unknown is a number µ

Each marble • is a point x∈ X

•: Hypothesis got it right h(x)=f

•: Hypothesis got it wrong h(x) ≠ f Hi

Both µ and 𝜈 depend on which hypothesis h

𝜈 is ` in s a m p l e ' denoted by Ein(h)

µ is ` o u t o f s a m p l e ' denoted by E out (h) The

Hoeffding inequality becomes:

Generalizing the bin model to more than one hypothesis:

Not so fast!! Hoeffding doesn't apply to multiple bins.

UNKNOWN TARGET FUNCTION PROBABILITY

Ein( h1) Ein( h2) Ein(hM )

• The Learning Problem

• Error and Noise

Q u e s t i o n : If you toss a fair coin 10 times, what is the

Q u e s t i o n : If you toss 1000 fair coins 10 times each, what is the

Focus on this case!

STOP – Important One!