0% found this document useful (0 votes)
111 views

ML Kernel Methods

The document discusses kernel methods for machine learning. It introduces kernels as a way to efficiently compute inner products in high-dimensional feature spaces. Kernels must be positive definite symmetric to guarantee the existence of a feature mapping. Common kernels like polynomial and Gaussian kernels are presented. Kernel methods allow non-linear decision boundaries by mapping inputs into feature spaces. The reproducing kernel Hilbert space is constructed from a positive definite kernel.

Uploaded by

Atharva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

ML Kernel Methods

The document discusses kernel methods for machine learning. It introduces kernels as a way to efficiently compute inner products in high-dimensional feature spaces. Kernels must be positive definite symmetric to guarantee the existence of a feature mapping. Common kernels like polynomial and Gaussian kernels are presented. Kernel methods allow non-linear decision boundaries by mapping inputs into feature spaces. The reproducing kernel Hilbert space is constructed from a positive definite kernel.

Uploaded by

Atharva
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Foundations of Machine Learning

Kernel Methods

Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu
Motivation
Efficient computation of inner products in high
dimension.
Non-linear decision boundary.
Non-vectorial inputs.
Flexible selection of more complex features.

Mehryar Mohri - Foundations of Machine Learning page 2


This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels

Mehryar Mohri - Foundations of Machine Learning page 3


Non-Linear Separation

Linear separation impossible in most problems.


Non-linear mapping from input space to high-
dimensional feature space: : X F .
Generalization ability: independent of dim(F ),
depends only on margin and sample size.
Mehryar Mohri - Foundations of Machine Learning page 4
Kernel Methods
Idea:
• Define K : X X R , called kernel, such that:
(x) · (y) = K(x, y).
• K often interpreted as a similarity measure.
Benefits:
• Efficiency: K is often more efficient to compute
than and the dot product.
• Flexibility: K can be chosen arbitrarily so long as
the existence of is guaranteed (PDS condition
or Mercer’s condition).
Mehryar Mohri - Foundations of Machine Learning page 5
PDS Condition
Definition: a kernel K: X X R is positive definite
symmetric (PDS) if for any {x1 , . . . , xm } X , the
matrix K = [K(xi , xj )]ij Rm m is symmetric
positive semi-definite (SPSD).
K SPSD if symmetric and one of the 2 equiv. cond.’s:

• its eigenvalues are non-negative. m

• for any c R , c Kc = c c K(x , x )


m 1

i,j=1
i j i j 0.

Terminology: PDS for kernels, SPSD for kernel


matrices (see (Berg et al., 1984)).
Mehryar Mohri - Foundations of Machine Learning page 6
Example - Polynomial Kernels
Definition:
x, y RN , K(x, y) = (x · y + c)d , c > 0.

Example: for N = 2 and d = 2 ,


K(x, y) = (x1 y1 + x2 y2 + c)2
x21 y12
x22 y22
2 x1 x2 2 y1 y2
= · .
2c x1 2c y1
2c x2 2c y2
c c
Mehryar Mohri - Foundations of Machine Learning page 7
XOR Problem
Use second-degree polynomial kernel with c = 1:
x2

2 x1 x2
(-1, 1)
√ √ √
(1, 1)
√ √ √
(1, 1, + 2, − 2, − 2, 1) (1, 1, + 2, + 2, + 2, 1)


x1 2 x1

(-1, -1) (1, -1) √ √ √


(1, 1, − 2, − 2, + 2, 1)
√ √ √
(1, 1, − 2, + 2, − 2, 1)

Linearly non-separable Linearly separable by


x1 x2 = 0.

Mehryar Mohri - Foundations of Machine Learning page 8


Normalized Kernels
Definition: the normalized kernel K associated to a
kernel K is defined by
0 if (K(x, x) = 0) (K(x , x ) = 0)
x, x X , K (x, x ) = K(x,x )
otherwise.
K(x,x)K(x ,x )

• If K is PDS, then K is PDS:


m m m 2
ci cj K(xi , xj ) ci cj (xi ), (xj ) ci (xi )
= = 0.
K(xi , xi )K(xj , xj ) i,j=1 (x )
i H (x )
j H (xi ) H
i,j=1 i=1 H

• By definition, for all x with K(x, x) = 0 ,


K (x, x) = 1.
Mehryar Mohri - Foundations of Machine Learning page 9
Other Standard PDS Kernels
Gaussian kernels:
||x y||2
K(x, y) = exp , = 0.
2 2

• Normalized kernel of (x, x ) exp x·x


2 .

Sigmoid Kernels:

K(x, y) = tanh(a(x · y) + b), a, b 0.

Mehryar Mohri - Foundations of Machine Learning page 10


Reproducing Kernel Hilbert Space
(Aronszajn, 1950)
Theorem: Let K: X X R be a PDS kernel. Then,
there exists a Hilbert space H and a mapping
from X to H such that
x, y X, K(x, y) = (x) · (y).

Proof: For any x X , define (x) : X RX as follows:


y X, (x)(y) = K(x, y).

• Let H = a (x ) : a R, x X, card(I) < .


0
i I
i i i i

• We are going to define an inner product ·, · on H . 0

Mehryar Mohri - Foundations of Machine Learning page 11


• Definition: for anyf = i I ai (xi ), g =
j J
bj (yj ),

f, g = ai bj K(xi , yj ) = bj f (yj ) = ai g(xi ).


i I,j J j J i I

• ·, · does not depend on representations of f and g.


• ·, · is bilinear and symmetric.

• ·, · is positive semi-definite since K is PDS: for any f,


f, f = ai aj K(xi , xj ) 0.
i,j I
• note: for any f , . . . , f m
1 m and c1 , . . . , cm ,
m m
ci cj f i , f j = ci f i , cj f j 0.
i,j=1 i=1 j=1

·, · is a PDS kernel on H0 .
Mehryar Mohri - Foundations of Machine Learning page 12
• ·, · is definite:

• first, Cauchy-Schwarz inequality for PDS kernels.


K(x,x) K(x,y)
If K is PDS,
M= is SPSD for all
x, y
K(y,x) K(y,y) X
In particular, the product of its eigenvalues, det(M)
is non-negative:
det(M) = K(x, x)K(y, y) K(x, y)2 0.

• since ·, · is a PDS kernel, for any f H0 and x X ,


f, (x) 2
f, f (x), (x) .
• observe the reproducingX
property of ·, · :
8f 2 H0 , 8x 2 X, f (x) = ai K(xi , x) = hf, (x)i.


i2I
Thus,[f (x)]2 f, f K(x, x) for all x X , which
shows the definiteness of ·, · .
Mehryar Mohri - Foundations of Machine Learning page 13
• Thus, ·, · defines an inner product on H , which0
thereby becomes a pre-Hilbert space.
• H can be completed to form a Hilbert space H in
0
which it is dense.
Notes:
• H is called the reproducing kernel Hilbert space
(RKHS) associated to K.
• A Hilbert space such that there exists : X H
with K(x, y) = (x)· (y) for all x, y X is also
called a feature space associated to K . is called
a feature mapping.
• Feature spaces associated to K are in general not
unique.
Mehryar Mohri - Foundations of Machine Learning page 14
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels

Mehryar Mohri - Foundations of Machine Learning page 15


SVMs with PDS Kernels
(Boser, Guyon, and Vapnik, 1992)
Constrained optimization:
(xi )· (xj )
m m
1
max i i j yi yj K(xi , xj )
i=1
2 i,j=1
m
subject to: 0 i C i yi = 0, i [1, m].
i=1
Solution:
m
h(x) = sgn i yi K(xi , x) +b ,
m i=1
with b = yi j yj K(xj , xi ) for any xi with
j=1 0< i < C.
Mehryar Mohri - Foundations of Machine Learning page 16
Rad. Complexity of Kernel-Based Hypotheses

Theorem: Let K: X X R be a PDS kernel and


let : X ! H be a feature mapping associated to K.
Let S {x : K(x, x) R2 } be a sample of size m , and
let H = {x 7! w· (x) : kwkH  ⇤}. Then,
Tr[K] R2 2
RS (H) .
m m
m m
1
Proof: RS (H) =
m
E sup w · i (xi )
m
E i (xi )
w i=1 i=1
m 2 1/2 m 1/2
(Jensen’s ineq.) E i (xi ) E (xi ) 2
m i=1
m i=1
m 1/2
Tr[K] R2 2
= E K(xi , xi ) = .
m i=1
m m
Mehryar Mohri - Foundations of Machine Learning page 17
Generalization: Representer Theorem
(Kimeldorf and Wahba, 1971)
Theorem: Let K: X X R be a PDS kernel with H
the corresponding RKHS. Then, for any non-
decreasing function G: R R and any L: Rm R {+ }
problem
argmin F (h) = argmin G( h H) + L h(x1 ), . . . , h(xm )
h H h H
m
admits a solution of the form h = i K(xi , ·).
i=1
If G is further assumed to be increasing, then any
solution has this form.

Mehryar Mohri - Foundations of Machine Learning page 18


• Proof: let H = span({K(x , ·):h =i h [1,+m]})
1 . Any h
i H
admits the decomposition h 1 according
to H = H1 H1 .
• Since G is non-decreasing,
H) + h
= G( h H ).
G( h1 G h1 2 2
H H

• By the reproducing property, for all i [1, m],


h(xi ) = h, K(xi , ·) = h1 , K(xi , ·) = h1 (xi ).
• Thus, L h(x ), . . . , h(x 1 m) = L h1 (x1 ), . . . , h1 (xm )
and F (h1 ) F (h).

• If G is increasing, then F (h ) < F (h) when h 1 =0


and any solution of the optimization problem
must be in H1.

Mehryar Mohri - Foundations of Machine Learning page 19


Kernel-Based Algorithms
PDS kernels used to extend a variety of algorithms
in classification and other areas:
• regression.
• ranking.
• dimensionality reduction.
• clustering.
But, how do we define PDS kernels?

Mehryar Mohri - Foundations of Machine Learning page 20


This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels

Mehryar Mohri - Foundations of Machine Learning page 21


Closure Properties of PDS Kernels
Theorem: Positive definite symmetric (PDS)
kernels are closed under:
• sum,
• product,
• tensor product,
• pointwise limit,
• composition with a power series with non-
negative coefficients.

Mehryar Mohri - Foundations of Machine Learning page 22


Closure Properties - Proof
Proof: closure under sum:
c Kc 0 c Kc 0 c (K + K )c 0.

• closure under product: K✓ = MM ,



m
X m
X hX
m i
ci cj (Kij K0ij ) = ci cj Mik Mjk K0ij
i,j=1 i,j=1 k=1
Xm  m
X
= ci cj Mik Mjk K0ij
k=1 i,j=1
2 3> 2 3
m
X c1 M1k c1 M1k
= 4 · · · 5 K0 4 · · · 5 0.
k=1 cm Mmk cm Mmk

Mehryar Mohri - Foundations of Machine Learning page 23


• Closure under tensor product:
• definition: for all x , x , y , y 1 2 1 2 X,
(K1 K2 )(x1 , y1 , x2 , y2 ) = K1 (x1 , x2 )K2 (y1 , y2 ).

• thus, PDS kernel as product of the kernels


(x1 , y1 , x2 , y2 ) K1 (x1 , x2 ) (x1 , y1 , x2 , y2 ) K2 (y1 , y2 ).

• Closure under pointwise limit: if for all x, y X,


lim Kn (x, y) = K(x, y),
n

Then, ( n, c Kn c 0) lim c Kn c = c Kc 0.
n

Mehryar Mohri - Foundations of Machine Learning page 24


• Closure under composition with power series:
• assumptions: Kf (x)
PDS kernel with|K(x, y)| < for
all x, y X and
= a x ,a
n=0 n 0
n
n power
series with radius of convergence .
• f K is a PDS kernel since K n is PDS by closure
N
under product, n=0 an K n is PDS by closure
under sum, and closure under pointwise limit.
Example: for any PDS kernel K, exp(K) is PDS.

Mehryar Mohri - Foundations of Machine Learning page 25


This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels

Mehryar Mohri - Foundations of Machine Learning page 26


Sequence Kernels
Definition: Kernels defined over pairs of strings.

• Motivation: computational biology, text and


speech classification.

• Idea: two sequences are related when they share


some common substrings or subsequences.

• Example: bigramXkernel;
K(x, y) = countx (u) ⇥ county (u).
bigram u

Mehryar Mohri - Foundations of Machine Learning page 27


Weighted Transducers
b:a/0.6
b:a/0.2
a:b/0.1 1 a:a/0.4 b:a/0.3
2 3/0.1
0 a:b/0.5

T (x, y) = Sum of the weights of all accepting


paths with input x and output y .
T (abb, baa) = .1 .2 .3 .1 + .5 .3 .6 .1

Mehryar Mohri - Foundations of Machine Learning page 28


Rational Kernels over Strings
(Cortes et al., 2004)
Definition: a kernel K : R is rational if K = T
for some weighted transducer T .

Definition: let T1 : R and T2 : R be


two weighted transducers. Then, the composition
of T1 and T2 is defined for all x ,y by
(T1 T2 )(x, y) = T1 (x, z) T2 (z, y).
z

Definition: the inverse of a transducer T : R


is the transducer T 1 : R obtained from T
by swapping input and output labels.
Mehryar Mohri - Foundations of Machine Learning page 29
PDS Rational Kernels
General Construction
Theorem: for any weighted transducer T : R,
the function K = T T 1 is a PDS rational kernel.
Proof: by definition, for all x, y ,
K(x, y) = T (x, z) T (y, z).
z

• K is pointwise limit of (K ) n n 0 defined by


x, y , Kn (x, y) = T (x, z) T (y, z).
|z| n
•K n is PDS since for any sample (x1 , . . . , xm ),
Kn = AA with A = (Kn (xi , zj )) i [1,m] .
j [1,N ]

Mehryar Mohri - Foundations of Machine Learning page 30


PDS Sequence Kernels
PDS sequences kernels in computational biology,
text classification, other applications:
• special instances of PDS rational kernels.
• PDS rational kernels easy to define and modify.
• single general algorithm for their computation:
composition + shortest-distance computation.
• no need for a specific ‘dynamic-programming’
algorithm and proof for each kernel instance.
• general sub-family: based on counting
transducers.
Mehryar Mohri - Foundations of Machine Learning page 31
Counting Transducers
b:ε/1
b:ε/1
a:ε/1
X = ab
a:ε/1 Z = bbabaabba
X:X/1
0 1/1
εεabεεεεε εεεεεabεε
TX

X may be a string or an automaton


representing a regular expression.
Counts of Z in X : sum of the weights of
accepting paths of Z TX .

Mehryar Mohri - Foundations of Machine Learning page 32


Transducer Counting Bigrams
b:ε/1 b:ε/1
a:ε/1 a:ε/1

0 a:a/1 1 a:a/1 2/1


b:b/1 b:b/1
Tbigram

Counts of Z given by Z Tbigram ab .

Mehryar Mohri - Foundations of Machine Learning page 33


Transducer Counting Gappy Bigrams

b:ε/1 b:ε/λ b:ε/1


a:ε/1 a:ε/λ a:ε/1

0 a:a/1 1 a:a/1 2/1


b:b/1 b:b/1
Tgappy bigram

Counts of Z given by Z Tgappy bigram ab ,


gap penalty (0, 1) .

Mehryar Mohri - Foundations of Machine Learning page 34


Composition
Theorem: the composition of two weighted
transducer is also a weighted transducer.
Proof: constructive proof based on composition
algorithm.
• states identified with pairs.
• -free case: transitions defined by
E= (q1 , q1 ), a, c, w1 w2 , (q2 , q2 ) .
(q1 ,a,b,w1 ,q2 ) E1
(q1 ,b,c,w2 ,q2 ) E2

• general case: use of intermediate -filter.


Mehryar Mohri - Foundations of Machine Learning page 35
Composition Algorithm
ε-Free Case
a:a/0.6
b:a/0.2
b:b/0.3 2 a:b/0.5 a:b/0.3 2 b:a/0.5
a:b/0.1 b:b/0.1
0 1 b:b/0.4 3/0.7 0 1 3/0.6
a:b/0.4
a:b/0.2

a:a/.04 (0, 1)

a:a/.02 (3, 2)
a:b/.18
a:b/.01 b:a/.06 a:a/0.1
(0, 0) (1, 1) (2, 1) (3, 1)
a:b/.24

b:a/.08 (3, 3)/.42

Complexity: O(|T1 | |T2 |) in general, linear in some cases.


Mehryar Mohri - Foundations of Machine Learning page 36
(c)
!:!1 !:!1 !:!1 !:!1 !:!1

A' 0
Redundant ε-Paths Problem
a:a
1
b:!2
2
c:!2
3
d:d
4

(d)
!2:! !2:! !2:! (MM,!2Pereira,
:! and Riley, 1996; Pereira and Riley, 1997)
T1 0 a:aa:d 1 b:ε !1:e2 c:ε d:a3 d:d 4 0 a:d 1 ε:e 2 d:a 3 T2
B' 0 1 2 3

ε:ε1 ε:ε1 ε:ε1 ε:ε1 ε:ε1 ε2:ε ε2:ε ε2:ε ε2:ε

T!1 0 a:a 1 b:ε2 2 c:ε2 3 d:d 4 0 a:d 1 ε1: e 2 d:a 3 T!2


a:d !:e
0,0 1,1 1,2 ε1:ε1
(x:x) (!1:!1)
b:! b:e b:! ε2:ε1 ε1:ε1
(!2:!2) (!2:!1) (!2:!2) x:x 1
!:e x:x
2,1 (!1:!1)
2,2
0 F
c:! c:!
ε2:ε2 ε2:ε2
(!2:!2) (!2:!2)
x:x 2
3,1 !:e 3,2
(!1:!1)
d:a
(x:x)

4,3
T = T!1 ◦ F ◦ T!2 .
Mehryar Mohri - Foundations of Machine Learning page 37
Kernels for Other Discrete Structures
Similarly, PDS kernels can be defined on other
discrete structures:

• Images,
• graphs,
• parse trees,
• automata,
• weighted automata.
Mehryar Mohri - Foundations of Machine Learning page 38
This Lecture
Kernels
Kernel-based algorithms
Closure properties
Sequence Kernels
Negative kernels

Mehryar Mohri - Foundations of Machine Learning page 39


Questions
Gaussian kernels have the form exp( d2 ) where d is
a metric.
• for what other functions d does exp( d2 ) define a
PDS kernel?
• what other PDS kernels can we construct from a
metric in a Hilbert space?

Mehryar Mohri - Foundations of Machine Learning page 40


Negative Definite Kernels
(Schoenberg, 1938)
Definition: A function K: X X R is said to be a
negative definite symmetric (NDS) kernel if it is
symmetric and if for all 1
{x , . . . , xm } X and c R m 1

with 1 c = 0 ,
c Kc 0.

Clearly, if K is PDS, then K is NDS, but the


converse does not hold in general.

Mehryar Mohri - Foundations of Machine Learning page 41


Examples
The squared distance ||x y||2 in a Hilbert space H
m
defines an NDS kernel. If i=1 ci = 0 ,
m m
ci cj ||xi xj ||2 = ci cj (xi xj ) · (xi xj )
i,j=1 i,j=1
m
= ci cj ( xi 2
+ xj 2
2xi · xj )
i,j=1
m m m
= ci cj ( xi 2
+ xj 2
) 2 ci xi · cj xj
i,j=1 i=1 j=1
m
ci cj ( xi 2
+ xj 2
)
i,j=1
m m m m
= cj ci ( xi 2
+ ci cj xj 2
= 0.
j=1 i=1 i=1 j=1

Mehryar Mohri - Foundations of Machine Learning page 42


NDS Kernels - Property
(Schoenberg, 1938)
Theorem: Let K: X X R be an NDS kernel such
that for all x, y X, K(x, y) = 0 iff x = y . Then, there
exists a Hilbert space H and a mapping : X H
such that
∀x, y ∈ X, K(x, y) = ∥Φ(x) − Φ(y)∥2 .

Thus, under the hypothesis of the theorem, K


defines a metric.

Mehryar Mohri - Foundations of Machine Learning page 43


PDS and NDS Kernels
(Schoenberg, 1938)
Theorem: let K: X X R be a symmetric kernel,
then:
• K is NDS iff exp( tK) is a PDS kernel for all t > 0 .
• Let K be defined for any x0 by
K (x, y) = K(x, x0 ) + K(y, x0 ) K(x, y) K(x0 , x0 )
for all x, y X. Then, K is NDS iff K is PDS.

Mehryar Mohri - Foundations of Machine Learning page 44


Example
The kernel defined by K(x, y) = exp( t||x y||2 )
is PDS for all t > 0 since ||x y||2 is NDS.
The kernel exp( |x y|p )is not PDS for p > 2 .
Otherwise, for any t > 0 ,{x1 , . . . , xm } X and c Rm 1

m m
t|xi xj |p |t1/p xi t1/p xj |p
ci cj e = ci cj e 0.
i,j=1 i,j=1

This would imply that |x y|p is NDS for p > 2, but


that cannot be (see past homework assignments).

Mehryar Mohri - Foundations of Machine Learning page 45


Conclusion
PDS kernels:
• rich mathematical theory and foundation.
• general idea for extending many linear
algorithms to non-linear prediction.
• flexible method: any PDS kernel can be used.
• widely used in modern algorithms and
applications.
• can we further learn a PDS kernel and a
hypothesis based on that kernel from labeled
data? (see tutorial: http://www.cs.nyu.edu/~mohri/icml2011-
tutorial/).
Mehryar Mohri - Foundations of Machine Learning page 46
References
• N. Aronszajn, Theory of Reproducing Kernels, Trans. Amer. Math. Soc., 68, 337-404, 1950.

• Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector


machines and other pattern classifiers. In Advances in kernel methods: support vector learning,
pages 43–54. MIT Press, Cambridge, MA, USA, 1999.

• Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic Analysis on
Semigroups. Springer-Verlag: Berlin-New York, 1984.

• Bernhard Boser, Isabelle M. Guyon, and Vladimir Vapnik. A training algorithm for optimal
margin classifiers. In proceedings of COLT 1992, pages 144-152, Pittsburgh, PA, 1992.

• Corinna Cortes, Patrick Haffner, and Mehryar Mohri. Rational Kernels: Theory and
Algorithms. Journal of Machine Learning Research (JMLR), 5:1035-1062, 2004.

• Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20,
1995.

• Kimeldorf, G. and Wahba, G. Some results on Tchebycheffian Spline Functions, J. Mathematical


Analysis and Applications, 33, 1 (1971) 82-95.

Mehryar Mohri - Foundations of Machine Learning page 47 Courant Institute, NYU


References
• James Mercer. Functions of Positive and Negative Type, and Their Connection with the
Theory of Integral Equations. In Proceedings of the Royal Society of London. Series A,
Containing Papers of a Mathematical and Physical Character,Vol. 83, No. 559, pp. 69-70, 1909.

• Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. Weighted Automata in Text and
Speech Processing, In Proceedings of the 12th biennial European Conference on Artificial
Intelligence (ECAI-96),Workshop on Extended finite state models of language. Budapest,
Hungary, 1996.

• Fernando C. N. Pereira and Michael D. Riley. Speech Recognition by Composition of


Weighted Finite Automata. In Finite-State Language Processing, pages 431-453. MIT Press,
1997.

• I. J. Schoenberg, Metric Spaces and Positive Definite Functions. Transactions of the American
Mathematical Society,Vol. 44, No. 3, pp. 522-536, 1938.

• Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,


1982.

• Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.


Mehryar Mohri - Foundations of Machine Learning page 48 Courant Institute, NYU
Appendix
Mercer’s Condition
(Mercer, 1909)
Theorem: Let X X be a compact subset of RN and
let K : X X R be in L (X X) and symmetric.
Then, K admits a uniformly convergent expansion

K(x, y) = an n (x) n (y), with an > 0,


n=0

iff for any function c in L2 (X),

c(x)c(y)K(x, y)dxdy 0.
X X

Mehryar Mohri - Foundations of Machine Learning page 50


SVMs with PDS Kernels
Constrained optimization: Hadamard product

max 2 1 ( y) K( y)
subject to: 0 C y = 0.

Solution:
m
h = sgn i yi K(xi , ·) +b ,
i=1
with b = yi ( y) Kei for any xi with
0 < i < C.

Mehryar Mohri - Foundations of Machine Learning page 51

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy