0% found this document useful (0 votes)
20 views

Poly Kernel

poly

Uploaded by

joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Poly Kernel

poly

Uploaded by

joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Short Note About the Application of

Polynomial Kernels with Fractional Degree in


Support Vector Learning

Rolf Rossius, G~rard Zenker, Andreas Ittner, and Werner Dilger

Department of Computer Science


Artificial Intelligence Group
Chemnitz University of Technology
D-09107 Chemnitz
{ros,gze,ait ,wdi}~informatik.tu-chemnitz.de
http :/ /www.tu-chemnitz.de /informatik /HomePages / K I /

A b s t r a c t . In the mid 90's a fundamental new Machine Learning ap-


proach was developed by V. N. Vapnik: The Support Vector Machine
(SVM). This new method can be regarded as a very promising approach
and is getting more and more attention in the fields where neural net-
works and decision tree methods are applied. Whilst neural networks
may be considered (correctly or not) to be well understood and are in
wide use, Support Vector Learning has some rough edges in theoretical
details and its inherent numerical tasks prevent it from being easily ap-
plied in practice. This paper picks up a new aspect - the use of fractional
degrees on polynomial kernels in the SVM - discovered in the course of
an implementation of the algorithm. Fractional degrees on polynomial
kernels broaden the capabilities of the SVM and offer the possibility to
deal with feature spaces of infinite dimension. We introduce a method to
simplify the quadratic programming problem, as the core of the SVM.

1 Introduction

Well known representatives of classification and prediction methods in the field of


Machine Learning are neural networks and methods for generation different kinds
of decision trees. An innovative and still relatively unknown learning approach
is the Support Vector Machine (SVM) developed by V. N. Vapnik in the mid
90's. Support Vector Learning [IRZ98] is not just another approach to learning
techniques, rather it can be regarded as a fundamental new philosophy in the
area of Machine Learning.
T h e underlying principle of the SVM is the principle of the Structural Risk
Minimization (SRM) [Vap95]. In contrast to a pure minimization of the empirical
risk the SRM is based on the "idea of the simplicity" and unifies Empirical Risk
Minimization and the problem of Model Selection. T h e searched binary classifier
for the problem

(xl,yl),...,(xt,yt), x~ E R n, Yi E { q - l , - 1 } ,
144

has to be a function from the set


{ f , : a E T'}, f , : Rn "-~ { + 1 , - 1 } , z~-~y,
and should reflect the real inherent essence of the given learning problem. This
essence can be regarded as the simplest (in some sense) separation of the feature
space. Here simplicity will be formalized by means of the VC dimension, i. e. a
measure of the considered set of feasible functions, e. g. the family of separating
hyperplanes. The SRM is enforced by controlled bounding of the VC dimensions
of the set {fa} and ensures the excellent generalization ability of the SVM. The
underlying theory of the SRM will not be explained in detail in this paper. We
refer to [Vap95] which covers the SRM and the application in the SVM.
The separating hyperplane is characterized by (w, x) + b = 0. The distance
between the hyperplane and the examples should be maximized, i. e. one has to
solve a problem of mathematical programming. For the non-separable case slack
variables ~ _> 0 are introduced, which leads to:
~(w,w) + C~,i= 1 ~i -¢ min
lyi[(w, xi)+b] > 1 - ~ i Vi= 1,...,l (1)
>_ o vi= 1,...,z,
where the capacity parameter C > 0 controls the interrelationship between the
accuracy of the classifier on the learning set and its ability of generalization, i. e.
the accuracy on an unseen test set.
The vector w, as the solution of (1), determines the optimal hyperplane. It
can be expressed as a linear combination of a possibly small subset of the whole
learning data:
l

= oby, x, = ¢2)
i=1 SV

Support Vectors are such vectors xi, which satisfy yi [(w, xi) + b] = 1, i. e. which
have a nonzero ai and effectively contribute to the description of the separating
hyperplane. Hence in (2) one can reduce 0# to a linear combination of support
vectors. Less formally these support vectors can be viewed as the examples on
the frontline guarding the own class against the examples of the other one and
are essential for the concept to be learned.
Considering (2) one has to solve the following optimization problem:
AT1 --
I O < A <AcT1AA -+ I a x
(3)
'1 do
with A = ( a l , . . - , a l ) , l = ( 1 , . . . , 1), and Y = (Yl,...,Yt)- The HESSE matrix
A consists of the elements A~j = yiyj(z~, zj) for i,j = 1,... ,l [CV95].
However in the general case the linear separation in the original feature space
will not provide a sufficient classifier. Therefore the original feature space is
145

expanded to a very high dimensional image space by (e.g.):

: Itn _+ R g , n << N,~(z) "-" ( 1 , 7 1 X l , . ,"[nXn,"/n+lX2,"Yn+2XlX2,


. . . . . ,"[kXdn),

and in this space the linear separation is performed. An inverse transformation


back into I t " results in a non-linear separation in the original space of the task
supplied features:

f(z) = (w, q~(x)) -I- b .

It is not necessary to expand the feature space explicitly. One way to do the
mapping implicitly is to use kernels K(u, v) (respectively dot products). In this
context the fundamental interrelation is:

K(u, v) = (~(u), 4)(v)).


The symmetric function K(u, v) may be a dot product for the high dimensional
image space, if the eigenvalues are positive. One rather simple type of such
kernels is representable as

g(u, v) = ((u, v) + l) a , d=l,2,... (4)

with degree d as an integer. Another choice may be K(u, v) = e-I1-:~11. A gen-


eralized kind of the kernel (4) will be examined in this paper.

2 Polynomial Kernels with Fractional Degree


Interestingly a fixed chosen kernel K(u, v) induces not only exactly one trans-
formation but a manifold of such mappings 4. Even the dimensionality of the
image space R g i s not determined. From (4) for d = 2 and n -" 2 one gets:

• (u) = (1, V~Ul, v~u2, u 2, V~UlU2, u2), u = (Ul, u2)

as well as

and infinite number of others.


Therefore a question arises: Choosing a kernel K(u, v) - which is the space
of smallest dimension for an image of 4? The answer for d e N is (n+d) (or
equivalently (n+u)). While selecting an appropriate kernel g via the exponent d,
there are huge discontinuities in the dimensionalities of the corresponding image
spaces. The approximation and generalization capacity may be controlled by
bounding the norm of the separating hyperplane, but another tuning parameter
will still be there: the dimensionality (cf. Table 1).
Using a fractional exponent in the kernel (4) we encounter an interesting
property: the dot product (u, v) may be less than - 1 and we have a negative
146

I" \ all 11 21 31 al 51 6 7
2 2 6 10 15 21 28 36
16 16 1531 9 6 9 4 . 8 x l 0 s 2.0×104 7.5×104 2.5×105
256 256 3.3 × 1042.9 × 106 1.9 x 10s 9.7 × 109 4,2 x 1011 1.6 × 10 la

Table 1. Dimension of image space for polynomial kernel with exponent d and n origi-
nal features. The dimension of the image space (where the linear separation takes place)
grows quite rapidly - an explicit computation in this space would be impossible. But as
mentioned before, this is fortunately not required. Rather the value itself should guide
the user to a conjecture about the separating abilities of the associated hyperplane.

base to raise, t Hence the HEssE Matrix A will not be real valued and therefore
symmetric (A T = A) anymore, but in fact contain complex entries. Nevertheless,
A has the property of hermiticity (A T = A). This allows for a new formulation
of (3). Because

AT AA = AT AT A = AT ~1 ( A + A T ) A = A T 1 ( A + A - - ) A = A T R e ( A ) A

we equivalently solve

I AT1 -- 1ATRe (A) A -+ max


~iO < A < C1 (5)
( A T'Y ---0

instead, and get rid of the complex entries. (Re(A) denotes the real part.)
Exposing the kernel for arbitrary exponents d we get according to TAYLOR:

((u, v) + 1) d = 1 + d(u, v) + d (d2!- 1__.~<u,


) v) 2 + d (d - 1)3!(d - 2) (u, v) 3

-t d (d - 1) (d4I- 2) (d - 3) (?2, ~))4 + . . .

Non-integer exponents do not terminate the series like the integer ones, but the
influence of high-order terms decreases nevertheless. In contrast to kernels with
an integer exponent there are no mappings • corresponding to such a fractional
exponent kernel which have an image space of finite dimension.
Fractional degrees allow a more continuous range of concepts. The resulting
separating hyperplanes smoothly change the shapes with the exponent d. This
will be of importance especially for domains dealing with feature spaces which
already cover tens, hundreds or more dimensions (e.g. recognition of graphical
images), where a lower degree of a polynomial kernel is preferred. A simple
artificial problem in a two dimensional feature space is presented in Figure 1.
[Fri93]
1 One could imagine this in the original space: The representing vectors u and v of
both participating examples form a sufficient obtuse angle.
147

=.as
• ::::::::::::::::::::::::::::::::::: ,~.:: u.
eaet-ee~. • el.% ~ a" : : : : : : : : :::~:, ..~ i V ' : i ~ ' . : " . ~ .!::!,~
i.............. * ~' ~ ,.~-.., ~.. •
~al- ::::::::::::::::::::::::::::::::::::::
| ====================================
" "~.'0 *e e~. "e % I •
::::::::::::::::::::::::::::::::::
•a f • :iiiiiiiii~ii:~'iii?~
================================
i'~ i*i~i.i ~ i:
::::::::::::::::::::::::::::::::
"~ : ' / : ~ i i i i i i i i i i i i i i i i i i i i i i i i i • ,

:;:-;:.::
• x~l ~ 5 ~ a =====================
_ u • :::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::

ata

|
::::::::::::::::::::::::::
:::::::::::::::::::::::::::
============================== :::::::::::::::::::::::::::
: ================================
::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::
• ================================ ~::::iii~iiiiiiii:: ~i~ii ~ii::~::iii
• ~ -- ":iiii!!ii!i!!ii:'i!!i!!!!!!!!!!
;•~.:• ========================================= ? .~-:;:~i!iiii::iiiiiii!ii!iiiiii!i~ • ",
.. ~ ~ ~ - ======================== ~ • . • • • ===========================

~• ~ ::::::::::::::::::::::::::::
oa o, oa oa

Fig. 1. Continuous variation of exponent d. 226 examples, class distribution 93/133,


90 % used to generate the separation. Two properties of the problem are significant: low
dimensionality of the original feature space, difficultly crossed arrangement of examples
in the lower right area. As expected, a somewhat higher exponent of the polynomial
kernel is necessary for the approximation of the concept.

3 The "1/2 Trick"


Realizing the SVM as a whole, the solution of the quadratic optimization prob-
lem (quadratic programming, QP) - actually a series of such, with different
parameters - constitutes the real amount of work. Generally the QP task is for
the most part determined by the calculation of function values, gradients (or
its estimations). It makes more difficulties here because of the (potential) large
HESSE matrix and its nonsparsity.
We tackle this by choosing a kernel of the type ((u, v) + 1) 6 with d = m + 1
and m E N. The corresponding entry in the resulting HESSE matrix (Re(A) in
(5)) will vanish for negative ((u, v) + 1).
The SVM algorithm selects a separating hyperplane according to a criterion
of sufficient values on the training examples as well as the minimization of the
norm of the hyperplane. Unfortunately, • is nonlinear - the resulting shape of the
function and thus the border between the predicted areas of both classes varies
with uniform translations of the examples in the feature space. For instance
the resulting separation lines for different centered sets of the well known XOR
problem are depicted in Figure 2. A second degree kernel is used.
Despite of the non-invariance against the uniform translation of the examples
in the feature space, one could center the set into the origin of the co-ordinate
system to obtain a sufficient obtuse angle between a large number of pairs of
examples. This will result in a sparser HESSE matrix for the QP task. Up to
50 % of the entries may be zeroed by means of this smart approach.
148

i~i!i!igiiiiii!i!i!iiiig~" " ".~.~ : ::::::::::::::::::::::::::::::::

"~iii{i!ili~:
• ~iiiiiiiiiii}iii!ii~~: ' !!iiii~iiiiiili!i!ii~'"

:::::::::::::::::::::::::::

i~iiii~i~iiiiiiii~iiiiiiiii '2! .~:~ii. 05 • • ====================================

iil}iii;iiiiiiiiiiiiii;iiil
°"7Iggiiggglggi
. i!i!!i!;ii~
" . . -~,~ :

: ii?i!~:
°" !ii!gi!!i!!ig!i!!ig!iigiil
==============================

©
:::::::::::::::::::::::::::::

~,,
:::::::::::::::::::::::::::::

::::::::::::::::::::::::::::
~. .:S~!~!i~!?2f?J!?~!}i!21!?i}!ii!i~i

~
iiiiiiiliiliiiii;
!?iiii!ii!?i!iiiii!~i!?i#?ii?ii?

/x + }, y + }) are members of one class, while the two other examples (x - 6' Y +
Ix + ~, y - ~) belong to a second class. The four points are centered on (x, y).

4 Summary

The Support Vector algorithm shows some promising properties but needs s o m e
refinement especially on the level of practical realization to soften the enormous
effort to find the "simplest" explanation for a learning problem. Polynomial
kernels with fractional degrees provide a broader range of concepts as well as a
way to reduce the numerical effort to be spent in the QP.
T h e algorithm works well with a feature space of "similar" features. Is is
often preferred to do a componentwise transformation to normalize the data in
front of the number crunching task of the SVM itself. For specific domains this
could be done in the kernel function.

References

[CV95] C. Cortes and V. N. Vapnik. Support-vector networks. Machine Learning,


20:273-297, 1995.
[Fri93] B. Fritzke. Growing cell structures - a serf-organizing network for unsupervised
and supervised learning. Technical Report 93-026, International Computer
Science Institute, Berkeley, California, 1993.
[IRZ98] A. Ittner, R. Rossius, and G. Zenker. Support Vector Learning. Technical Re-
port CSR-98, Chemnitz University of Technology, Chemnitz, Germany, 1998.
[Vap95] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag,
1995.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy