03 - Non Linear Classifiers PDF
03 - Non Linear Classifiers PDF
− +
0 1 x∈R
− +
+
-1
−
0
× +
1
θ 2
x∈R
x θ1
x → φ (x) = 2 θ =
x θ 2
x∈R φ (x) ∈ R2
+ − +
-1 0 1 x∈R
φ2
− 1 1
1 1 1
+ − +
-1 0 1 φ1
x φ1
φ (x) = 2
x φ2
Non Linear Classifiers 5
Back To the Real Line
h(x; θ, θ0) = sign (θ. φ(x) + θ0)
= sign (θ1. x + θ2. x2 + θ0)
+ − +
-1
× 0
× 1 x∈R
A linear classifier in the new feature coordinates implies a nonlinear classifier in the
original x space.
Non Linear Classifiers 6
2-dim example
φ3 x2
+
− +
+
− φ2 x1
+ −
−φ1 The (φ1, φ2) plane is actually an appropriate way
to separate these two examples.
x1 θ1 0
θ = θ 2
⌢
φ (x) = x2 θ0 θ = 0 , θ0 = 0
⌢
x1 x 2 θ 2 1
Non Linear Classifiers 7
Polynomial features
• We can add more polynomial terms
x
x2
x ∈ R, φ(x ) = x 3
4
x
...
• Means lots of features in higher dimensions
x1
x
x1 2
x = ∈ R2 , φ(x ) = x12 ∈ R5
x2 2
x2
2 x1 x2
• For instance x and x2 squared as functions are linearly independent, the original
coordinates always provide something above and beyond what were in the
previous ones.
×××××××
× ×
× ×
× ×
× × x
× ×
e.g. φ(x) = [x, x2]T
Non Linear Classifiers 10
Non linear regression
x
φ(x) = x φ( x ) = x 2
x 3
One question here is which one of these should we actually choose? Which one is
the correct one?
• At the extreme, you hold out each of the training example in turn in a procedure
called leave one out cross validation.
– You take a single training sample, you remove it from the training set
– Retrain the method
– Test how well you would predict that particular holdout example, and do that for each
training example in turn.
– Average the results.
If we now use this leave one out accuracy as a measure of selecting which one of
these is the correct explanation, we would actually select a linear one. That is the
correct answer because the data was generated from a linear model with some
noise added.
5rd Order 7rd Order
x ∈ Rd
φ(x) = [x1, … xd, {xixj}, {xixjxk}, …]T
O(d) O(d2) O(d3)
• we would want to have a more efficient way of doing that-- operating with high
dimensional feature vectors without explicitly having to construct them. And that
is what kernel methods provide us
Non Linear Classifiers 16
Non linear regression
• We want a decision boundary that is a polynomial of order p
+ + +
++ + + − + +
+ + − + +
− − −
+ + −
+ +
− − +
− + − −
− − − −
− − −
• Add new features to data vectors x
– Let φ (x) consists of all terms of order ≤ p, such as x1 x22 x3p −3
– Degree-p polynomial in x ⇔ linear in φ (x)
x = (x1, x2); p = 3
φ(x ) = ( x1 , x2 , x12 , x22 , x1 x2 , x13 , x23 , x12 x2 , x1 x22 )
[
φ( x) = x1 , x2, 2
x ,
1 2 x1 x2 , x 2 T
2 ]
φ( x' ) = [x′,
1 x2′ , x1′ , 2
2 x1′x′2 , x′ 2
2 T
]
Κ(x,x’) = φ(x). φ(x’) = (x.x’) + (x.x’)2 + (x.x’)3
The inner product between two feature vectors can be evaluated cheaply, based on
just taking inner products of the original examples and doing some nonlinear
transformations of the result.
• Why is this useful? Kernels give a way to compute dot products in some feature
space without even knowing what this space is and what is φ.
• Our task now is to turn our linear methods into methods that can operate in
terms of the kernels, rather than directly in terms of the feature coordinates.
• We will implicitly operate with very high dimensional feature vectors and do the
linear prediction there but actually computationally only deal with the kernel
function.
θ = ∑ α j y ( j )φ(x ( j ) )
n
αj #times we have updated on jth point
j =1
Instead of working with θ, we can work equivalently with the coefficients α1, …, αn
θ = ∑ α j y ( j )φ(x ( j ) )
n
j =1
• We solved the problem of θ, we will not use it at all. We will use the vector α
instead.
• The α vector is called the dual representation of θ
• We still have to compute the dot product of two very high dimensional vectors.
• Compute φ(x). φ(z) without ever writing out φ(x) or φ(z).
• What is φ(x).φ(z)?
( )(
φ( x ) . φ(z ) = 1, 2 x1 , 2 x2 , x12 , x22 , 2 x1 x2 . 1, 2 z1 , 2 z 2 , z12 , z 22 , 2 z1 z 2 )
= 1 + 2 x1 z1 + 2 x2 z 2 + x12 z12 + x22 z 22 + 2 x1 z1 x2 z 2
= (1 + x1 z1 + x2 z 2 )
2
= (1 + x . z )
2
= 1 + 2∑ xi zi + ∑ xi2 zi2 + 2∑ xi x j zi z j
i i i≠ j
= (1 + x1 z1 + ... + xd z d )
2
= (1 + x . z )
2
n
sign ∑ α j y φ x
( j)
( )
( j)
φ( x ) + θ 0 sign (θ.φ(x) + θ0)
j =1
Non Linear Classifiers 32
Feature engineering, Kernels
• There are two major techniques to construct valid kernel functions: either from
an explicit feature map or from other valid kernel functions.
• If K1(x, x’) and K2(x, x’) are kernels then K(x, x’) = K1(x, x’) + K2(x, x’)
is a kernel. φ1 (x )
φ( x ) =
φ (
2 x )
• If K1(x, x’) and K2(x, x’) are kernels then K(x, x’) = K1(x, x’) K2(x, x’) is a
kernel.
• A point in the dataset will affect the nearby points more than it affects the
faraway points.
h( x ) = ∑ α j y e ( j)
j =1
−γ x − x ( j )
2
• Let K ( x, x′ ) = e
( )
n
h ( x ) = ∑ α j y K x, x ( j) ( j)
Radial Basis Function j =1
K ( x, x′ ) = e 2
( )
n
∑ j
α
j =1
y ( j)
K x , x ( j)
K ( x, x′ ) = e 2
K ( x, x′ ) = e 2