0% found this document useful (0 votes)
81 views

03 - Non Linear Classifiers PDF

1) Machine learning algorithms can only perform linear classification on real-valued data. 2) To allow for nonlinear classification, data points can be mapped to a higher dimensional feature space using a feature transformation. 3) A linear classifier in the higher dimensional feature space corresponds to a nonlinear classifier in the original input space, allowing for more complex decision boundaries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

03 - Non Linear Classifiers PDF

1) Machine learning algorithms can only perform linear classification on real-valued data. 2) To allow for nonlinear classification, data points can be mapped to a higher dimensional feature space using a feature transformation. 3) A linear classifier in the higher dimensional feature space corresponds to a nonlinear classifier in the original input space, allowing for more complex decision boundaries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Machine Learning

Linear classifiers on the real line


h(x; θ, θ0) = sign (θ.x + θ0) θx + θ0

− +
0 1 x∈R

− +

Non Linear Classifiers 2


Linear classifiers on the real line
θx + θ0
θ1 
θ = 

+
-1

0
× +
1
θ 2 

x∈R

We can remedy the situation by introducing a feature transformation feeding a


different type of example to the linear classifier.
x
x → φ (x) =  2 
x 
x∈R φ (x) ∈ R2
For example, we map x now to a feature vector φ(x).
Remember that x here is a scalar. Feature vector now lives in a high dimensional

Non Linear Classifiers 3


Linear classifiers on the real line

x θ1 
x → φ (x) =  2  θ = 
x  θ 2 
x∈R φ (x) ∈ R2
+ − +
-1 0 1 x∈R

Quadratic function rather than a


h(x; θ, θ0) = sign (θ. φ(x) + θ0) linear one

= sign (θ1. x + θ2. x2 + θ0)

Non Linear Classifiers 4


In Feature Space

φ2

− 1 1
1 1 1
  

+ − +
-1 0 1 φ1

 x  φ1
φ (x) =  2 
 x  φ2
Non Linear Classifiers 5
Back To the Real Line
h(x; θ, θ0) = sign (θ. φ(x) + θ0)
= sign (θ1. x + θ2. x2 + θ0)

+ − +
-1
× 0
× 1 x∈R

A linear classifier in the new feature coordinates implies a nonlinear classifier in the
original x space.
Non Linear Classifiers 6
2-dim example
φ3 x2

+
− +
+
− φ2 x1

+ −
−φ1 The (φ1, φ2) plane is actually an appropriate way
to separate these two examples.
 x1   θ1  0 
θ = θ 2 
⌢  
φ (x) =  x2  θ0 θ = 0  , θ0 = 0

 
 x1 x 2  θ 2  1
Non Linear Classifiers 7
Polynomial features
• We can add more polynomial terms
x
x2 
 
x ∈ R, φ(x ) =  x 3 
 4
x 
 ... 
• Means lots of features in higher dimensions
 x1 
 x 
 x1   2 
x =   ∈ R2 , φ(x ) =  x12  ∈ R5
 x2   2 
 x2 
 2 x1 x2 

Non Linear Classifiers 8


Polynomial features
• We can get more and more powerful classifiers by adding linearly independent
features.

• For instance x and x2 squared as functions are linearly independent, the original
coordinates always provide something above and beyond what were in the
previous ones.

Non Linear Classifiers 9


Non linear classification and regression
• Non linear classification
h(x; θ, θ0) = sign (θ. φ(x) + θ0)

• Non linear regression


f (x; θ, θ0) = θ. φ(x) + θ0 = θ1x + θ2x2 + θ0
y

×××××××
× ×
× ×
× ×
× × x
× ×
e.g. φ(x) = [x, x2]T
Non Linear Classifiers 10
Non linear regression

x
φ(x) = x φ( x ) =  x 2 
 x 3 

Linear 3rd Order

One question here is which one of these should we actually choose? Which one is
the correct one?

5rd Order 7rd Order

Non Linear Classifiers 11


Non linear regression
• What we can do (as we've discussed before) is to introduce a validation set: hold
out some subset of examples, train the method based on the remaining samples,
and then try out those held out examples to see how well the method would
actually perform on those pretend test examples.

• At the extreme, you hold out each of the training example in turn in a procedure
called leave one out cross validation.
– You take a single training sample, you remove it from the training set
– Retrain the method
– Test how well you would predict that particular holdout example, and do that for each
training example in turn.
– Average the results.

Non Linear Classifiers 12


Non linear regression

Non Linear Classifiers 13


Non linear regression

Non Linear Classifiers 14


Non linear regression

Linear 3rd Order

If we now use this leave one out accuracy as a measure of selecting which one of
these is the correct explanation, we would actually select a linear one. That is the
correct answer because the data was generated from a linear model with some
noise added.
5rd Order 7rd Order

Non Linear Classifiers 15


Non linear regression
• By mapping input examples explicitly into feature vectors, and performing linear
classification or regression on top of such feature vectors, we get a lot of
expressive power .
• The downside of this procedure, which is always available to you, is that you
might end up with a very high-dimensional feature vectors that you have to
explicitly consider.

x ∈ Rd
φ(x) = [x1, … xd, {xixj}, {xixjxk}, …]T
O(d) O(d2) O(d3)

• we would want to have a more efficient way of doing that-- operating with high
dimensional feature vectors without explicitly having to construct them. And that
is what kernel methods provide us
Non Linear Classifiers 16
Non linear regression
• We want a decision boundary that is a polynomial of order p
+ + +
++ + + − + +
+ + − + +
− − −
+ + −
+ +
− − +
− + − −
− − − −
− − −
• Add new features to data vectors x
– Let φ (x) consists of all terms of order ≤ p, such as x1 x22 x3p −3
– Degree-p polynomial in x ⇔ linear in φ (x)

x = (x1, x2); p = 3
φ(x ) = ( x1 , x2 , x12 , x22 , x1 x2 , x13 , x23 , x12 x2 , x1 x22 )

Non Linear Classifiers 17


Non linear regression
• We want a decision boundary that is a polynomial of order p
+ + +
++ + + − + +
+ + − + +
− − −
+ + −
+ +
− − +
− + − −
− − − −
− − −
• Again the key idea is if we want to learn a decision boundary in the original
space which is polynomial of order p, we can equivalently just learn a linear
boundary in the higher dimensional feature space φ (x).
• In order to learn this linear classifier, we can use a standard method like
Perceptron or SVM.

Non Linear Classifiers 18


Non linear regression
• We want a decision boundary that is a polynomial of order p
+ + +
++ + + − + +
+ + − + +
− − −
+ + −
+ +
− − +
− + − −
− − − −
− − −
• The only downside to all of this is that these vectors φ (x), these extend
representations can get enormous, so large that it would be very hard to write
them down.
• Actually we don’t need to write them down at all.
• For Perceptron and for SVM, all we need to do is to compute dot products.

Non Linear Classifiers 19


Inner Products, Kernels
• Computing the inner product between two features vectors can be cheap even if
the vectors are very high dimensional.

[
φ( x) = x1 , x2, 2
x ,
1 2 x1 x2 , x 2 T
2 ]
φ( x' ) = [x′,
1 x2′ , x1′ , 2
2 x1′x′2 , x′ 2
2 T
]
Κ(x,x’) = φ(x). φ(x’) = (x.x’) + (x.x’)2 + (x.x’)3
The inner product between two feature vectors can be evaluated cheaply, based on
just taking inner products of the original examples and doing some nonlinear
transformations of the result.

The inner product of two feature vectors is known as a kernel function.

Non Linear Classifiers 20


Inner Products, Kernels
• Kernel is a way of computing the dot product of two vectors x and y in some
(possibly very high dimensional) feature space, which is why kernel functions are
sometimes called "generalized dot product”

• Suppose we have a mapping φ: Rn → Rm that brings our vectors in Rn to some


feature space Rm. Then the dot product of x and y in this space is φ(x)Tφ(y).

• A kernel is a function K that corresponds to this dot product, i.e.


K (x, y) = φ(x)Tφ(y).

• Why is this useful? Kernels give a way to compute dot products in some feature
space without even knowing what this space is and what is φ.

Non Linear Classifiers 21


Inner Products, Kernels
• For example, consider a simple polynomial kernel:
K (x, y) = (1 + xTy)2 with x, y ∈ R2.
• This doesn't seem to correspond to any mapping function φ, it's just a function
that returns a real number.
• Assuming that x = (x1, x2) and y = (y1, y2), let's expand this expression:
( T
)
K (x, y ) = 1 + x y = (1 + x1 y1 + x2 y2 )
2 2

= 1 + x12 y12 + x22 y22 + 2 x1 y1 + 2 x2 y2 + 2 x1 x2 y1 y2


• This is nothing else but a dot product between two vectors
(1, x , x ,
2
1
2
2 ) (
2 x1 , 2 x2 , 2 x1 x2 and 1, y 21, y 22 , 2 y1 , 2 y2 , 2 y1 y2 )
(
• φ(x) = φ(x1, x2) = 1, x 21 , x 22 , 2 x1 , 2 x2 , 2 x1 x2 )
Non Linear Classifiers 22
Inner Products, Kernels
(
• φ (x) = φ (x1, x2) = 1, x 21 , x 22 , 2 x1 , 2 x2 , 2 x1 x2 )
( T
)
K ( x, y ) = 1 + x y = φ ( x ) φ ( y )
2 T

computes a dot product in 6-dimensional space without explicitly visiting this


space.

• Our task now is to turn our linear methods into methods that can operate in
terms of the kernels, rather than directly in terms of the feature coordinates.

• We will implicitly operate with very high dimensional feature vectors and do the
linear prediction there but actually computationally only deal with the kernel
function.

Non Linear Classifiers 23


Kernels vs. Features
• For some feature maps, we can evaluate the inner product very efficiently, e.g.,
Κ(x, x’) = φ(x). φ(x’) = (1 + x.x’)p, p = 1, 2, …

• In those cases, it’s advantageous to express the linear classifier (regression


methods) in terms of kernels rather than explicitly constructing feature vectors.

sign (θ. φ(x) + θ0) → Κ(x,x’)

Non Linear Classifiers 24


Recall perceptron
PERCPTRON ({(x(i), y(i)), i = 1, …, n}, T)
θ = 0 (vector)
θ 0= 0
for t = 1, …, T do MNIST dataset:
for i = 1, …, n do
if y(i).(θ. φ(x(i)) + θ 0) ≤ 0 then dim (x) = 784
θ ← θ + y(i).φ(x(i)) dim (φ(x)) = 300000
θ 0= θ 0 + y(i) dim (θ) = 300000
return θ

Problem: number of features has now increased dramatically.


The Kernel trick: implement this without ever writing down a vector in.

Non Linear Classifiers 25


The kernel trick
PERCPTRON ({(x(i), y(i)), i = 1, …, n}, T)
θ = 0 (vector)
θ 0= 0
for t = 1, …, T do
for i = 1, …, n do
if y(i).(θ. φ(x(i)) + θ 0) ≤ 0 then θ Is updated only when there is a mistake
θ ← θ + y(i).φ(x(i)) θ updated
θ 0= θ 0 + y(i)
return θ

θ = ∑ α j y ( j )φ(x ( j ) )
n
αj #times we have updated on jth point
j =1

Instead of working with θ, we can work equivalently with the coefficients α1, …, αn

Non Linear Classifiers 26


The kernel trick
• There’s an αj coefficient for each data point.
• We can put them together into a vector α = (α1, …, αn)
• From α, we can if we like recover θ
• θ is always a linear combination of φ(x(i))

θ = ∑ α j y ( j )φ(x ( j ) )
n

j =1

• We solved the problem of θ, we will not use it at all. We will use the vector α
instead.
• The α vector is called the dual representation of θ

Non Linear Classifiers 27


The kernel trick
• Compute θ. φ(x(i)) using the dual representation
 n 
( ) ( )
θ .φ(x ) =  ∑ α j y ( j )φ x ( j ) .φ x (i )

 j =1 
 n 
( ) ( )
=  ∑ α j y ( j )φ x ( j ) .φ x (i ) 

 j =1 

• We still have to compute the dot product of two very high dimensional vectors.
• Compute φ(x). φ(z) without ever writing out φ(x) or φ(z).

Non Linear Classifiers 28


Computing dot products
• First in 2D
(
Suppose x = (x1, x2) and φ ( x ) = x1 , x2 , x1 , x2 , x1 x2
2 2
)
(
• Actually tweak a little: φ ( x ) = 1, 2 x1 , 2 x2 , x1 , x2 , 2 x1 x2
2 2
)
It doesn’t change anything, any function that’s linear in the original φ(x) is also
linear in the new φ(x) and vice versa.

• What is φ(x).φ(z)?

( )(
φ( x ) . φ(z ) = 1, 2 x1 , 2 x2 , x12 , x22 , 2 x1 x2 . 1, 2 z1 , 2 z 2 , z12 , z 22 , 2 z1 z 2 )
= 1 + 2 x1 z1 + 2 x2 z 2 + x12 z12 + x22 z 22 + 2 x1 z1 x2 z 2
= (1 + x1 z1 + x2 z 2 )
2

= (1 + x . z )
2

Non Linear Classifiers 29


Computing dot products
• The dot product between the higher dimensional vectors is the dot product
between the original low dimensional vectors plus 1 squared.
• Exactly the same thing holds when the original vectors are d dimensional.
• Suppose x = ( x1 , x2 ,..., xd )
(
φ(x ) = 1, 2 x1 ,..., 2 xd , x12 ,..., xd2 , 2 x1 x2 ,..., 2 xd −1 xd )
(
φ( x ) . φ(z ) = 1, 2 x1 ,..., 2 xd , x12 ,..., xd2 , 2 x1 x2 ,..., 2 xd −1 xd . )
(1, 2 z1 ,..., 2 z d , z12 ,..., z d2 , 2 z1 z 2 ,..., 2 z d −1 z )
d

= 1 + 2∑ xi zi + ∑ xi2 zi2 + 2∑ xi x j zi z j
i i i≠ j

= (1 + x1 z1 + ... + xd z d )
2

= (1 + x . z )
2

Non Linear Classifiers 30


Kernel perceptron
PERCPTRON ({(x(i), y(i)), i = 1, …, n}, T)
θ = 0 (vector) α1 = … = αn = 0
θ 0= 0
for t = 1, …, T do
for i = 1, …, n do
if y(i).(θj. φ(x(i)) + θ 0) ≤ 0 then
θ ← θ + y(i).φ(x(i))
θ 0= θ 0 + y(i)
return θ

To classify a new point, we again use the dual form of θ.

Non Linear Classifiers 31


Kernel perceptron
PERCPTRON ({(x(i), y(i)), i = 1, …, n}, T)
θ = 0 (vector) α1 = … = αn = 0
θ 0= 0
for t = 1, …, T do
for i = 1, …, n do
if y(i) Σj (αj y(j). φ(x(j)).φ(x(i)) + θ 0) ≤ 0 then
αi ← αi + 1
θ 0= θ 0 + y(i)
return θ

To classify a new point, we again use the dual form of θ.


....

 n 
sign  ∑ α j y φ x
( j)
( )
( j)
φ( x ) + θ 0  sign (θ.φ(x) + θ0)
 j =1 
Non Linear Classifiers 32
Feature engineering, Kernels
• There are two major techniques to construct valid kernel functions: either from
an explicit feature map or from other valid kernel functions.

• K (x, x’) = 1 is a kernel function. φ(x) = 1

• Let f : Rd → R and K (x, x’) is a kernel. Then so is


~ ~
K ( x, x′) = f (x )K (x, x′) f (x′) φ( x ) = f ( x )φ( x )

• If K1(x, x’) and K2(x, x’) are kernels then K(x, x’) = K1(x, x’) + K2(x, x’)
is a kernel. φ1 (x )
φ( x ) =  
φ (
 2 x )
• If K1(x, x’) and K2(x, x’) are kernels then K(x, x’) = K1(x, x’) K2(x, x’) is a
kernel.

Non Linear Classifiers 33


Radial Basis Kernel
• The idea is hat ∀ (xi, yi) ∈ Sn, it influences h(x).

• Based on || x – xi || (affected through the distance.

• A point in the dataset will affect the nearby points more than it affects the
faraway points.

Non Linear Classifiers 34


Radial Basis Kernel
• Standard form:
n
−γ x − x ( j )
2

h( x ) = ∑ α j y e ( j)
j =1

−γ x − x ( j )
2

• Let K ( x, x′ ) = e

( )
n
h ( x ) = ∑ α j y K x, x ( j) ( j)
Radial Basis Function j =1

Non Linear Classifiers 35


Radial Basis Kernel
1
− x− x( j )
2

K ( x, x′ ) = e 2

( )
n

∑ j
α
j =1
y ( j)
K x , x ( j)

Non Linear Classifiers 36


Radial Basis Kernel
1
− x− x( j )
2

K ( x, x′ ) = e 2

Non Linear Classifiers 37


Radial Basis Kernel
1
− x− x( j )
2

K ( x, x′ ) = e 2

Non Linear Classifiers 38

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy