Pac VC PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

More details:

General: http://www.learning-with-kernels.org/
Example of more complex bounds:
http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz

PAC-learning, VC
Dimension and
Margin-based Bounds
Machine Learning – 10701/15781
Carlos Guestrin
Carnegie Mellon University

February 28th, 2005


Review: Generalization error in
finite hypothesis spaces [Haussler ’88]
„ Theorem: Hypothesis space H finite, dataset D
with m i.i.d. samples, 0 < ε < 1 : for any learned
hypothesis h that is consistent on the training data:

Even if h makes zero errors in training data, may make errors in test
Using a PAC bound
„ Typically, 2 use cases:
… 1: Pick ε and δ, give you m
… 2: Pick m and δ, give you ε
Limitations of Haussler ‘88 bound

„ Consistent classifier

„ Size of hypothesis space


What if our classifier does not have
zero error on the training data?
„ A learner with zero training errors may make
mistakes in test set
„ A learner with errorD(h) in training set, may make
even more mistakes in test set
Simpler question: What’s the
expected error of a hypothesis?
„ The error of a hypothesis is like estimating the
parameter of a coin!

„ Chernoff bound: for m i.d.d. coin flips, x1,…,xm,


where xi ∈ {0,1}. For 0<ε<1:
Using Chernoff bound to estimate
error of a single hypothesis
But we are comparing many
hypothesis: Union bound
For each hypothesis hi:

What if I am comparing two hypothesis, h1 and h2?


Generalization bound for |H|
hypothesis
„ Theorem: Hypothesis space H finite, dataset D
with m i.i.d. samples, 0 < ε < 1 : for any learned
hypothesis h:
PAC bound and Bias-Variance
tradeoff

or, after moving some terms around,


with probability at least 1-δ:

„ Important: PAC bound holds for all h,


but doesn’t guarantee that algorithm finds best h!!!
What about the size of the
hypothesis space?

„ How large is the hypothesis space?


Boolean formulas with n binary features
Number of decision trees of depth k
Recursive solution
Given n attributes
Hk = Number of decision trees of depth k
H0 =2
Hk+1 = (#choices of root attribute) *
(# possible left subtrees) *
(# possible right subtrees)
= n * Hk * Hk

Write Lk = log2 Hk
L0 = 1
Lk+1 = log2 n + 2Lk
So Lk = (2k-1)(1+log2 n) +1
PAC bound for decision trees of
depth k

„ Bad!!!
… Number of points is exponential in depth!

„ But, for m data points, decision tree can’t get too big…

Number of leaves never more than number data points


Number of decision trees with k leaves
Hk = Number of decision trees with k leaves
H0 =2

Loose bound: Reminder:


PAC bound for decision trees with k
leaves – Bias-Variance revisited
What did we learn from decision trees?

„ Bias-Variance tradeoff formalized

„ Moral of the story:


Complexity of learning not measured in terms of
size hypothesis space, but in maximum number of
points that allows consistent classification
… Complexity m – no bias, lots of variance
… Lower than m – some bias, less variance
What about continuous hypothesis
spaces?

„ Continuous hypothesis space:


… |H| =∞
… Infinite variance???

„ As with decision trees, only care about the


maximum number of points that can be
classified exactly!
How many points can a linear
boundary classify exactly? (1-D)
How many points can a linear
boundary classify exactly? (2-D)
How many points can a linear
boundary classify exactly? (d-D)
PAC bound using VC dimension
„ Number of training points that can be
classified exactly is VC dimension!!!
… Measures relevant size of hypothesis space, as
with decision trees with k leaves
Shattering a set of points
VC dimension
Examples of VC dimension
„ Linear classifiers:
… VC(H) = d+1, for d features plus constant term b

„ Neural networks
… VC(H) = #parameters
… Local minima means NNs will probably not find best
parameters

„ 1-Nearest neighbor?
PAC bound for SVMs
„ SVMs use a linear classifier
… For d features, VC(H) = d+1:
VC dimension and SVMs: Problems!!!
Doesn’t take margin into account

„ What about kernels?


… Polynomials: num. features grows really fast = Bad bound

n – input features
p – degree of polynomial

… Gaussian kernels can classify any set of points exactly


Margin-based VC dimension
„ H: Class of linear classifiers: w.Φ(x) (b=0)
… Canonical form: minj |w.Φ(xj)| = 1
„ VC(H) = R2 w.w
… Doesn’t depend on number of features!!!
… R2 = maxj Φ(xj).Φ(xj) – magnitude of data
… R2 is bounded even for Gaussian kernels → bounded VC
dimension

„ Large margin, low w.w, low VC dimension – Very cool!


Applying margin VC to SVMs?

„ VC(H) = R2 w.w
… R2 = maxj Φ(xj).Φ(xj) – magnitude of data, doesn’t depend on choice of w
„ SVMs minimize w.w

„ SVMs minimize VC dimension to get best bound?


„ Not quite right: /
… Bound assumes VC dimension chosen before looking at data
… Would require union bound over infinite number of possible VC
dimensions…
… But, it can be fixed!
Structural risk minimization theorem

„ For a family of hyperplanes with margin γ>0


… w.w ·1
„ SVMs maximize margin γ + hinge loss
… Optimize tradeoff training error (bias) versus margin γ
(variance)
Reality check – Bounds are loose

ε
d=2000
d=200
d=20
d=2

m (in 105)

„ Bound can be very loose, why should you care?


… There are tighter, albeit more complicated, bounds
… Bounds gives us formal guarantees that empirical studies can’t provide
… Bounds give us intuition about complexity of problems and
convergence rate of algorithms
What you need to know
„ Finite hypothesis space
… Derive results
… Counting number of hypothesis
… Mistakes on Training data

„ Complexity of the classifier depends on number of


points that can be classified exactly
… Finite case – decision trees
… Infinite case – VC dimension

„ Bias-Variance tradeoff in learning theory


„ Margin-based bound for SVM
„ Remember: will your algorithm find best classifier?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy