Weighted kernel functions for SVM learning in string domains: a distance function viewpoint

This paper extends the idea of weighted distance functions to kernels and support vector machines. Here, we focus on applications that rely on sliding a window over a sequence of string data. For this type of problems it is argued that a symbolic, context-based representation of the data should be preferred over a continuous, real format as this is a much more intuitive setting for working with (weighted) distance functions. It is shown how a weighted string distance can be decomposed and subsequently used in different kernel functions and how these kernel functions correspond to inner products between real vectors. As a case-study named entity recognition is used with information gain ratio as a weighting scheme.

[1]  T. Joachims Support Vector Machines , 2002 .

[2]  Bernard Manderick,et al.  Substitution matrix based kernel functions for protein secondary structure prediction , 2004, 2004 International Conference on Machine Learning and Applications, 2004. Proceedings..

[3]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[4]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[5]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[6]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[7]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[8]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[9]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[10]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[11]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[12]  Bram Vanschoenwinkel,et al.  A discrete Kernel Approach to Support Vector Machine Learning in Language Independent Named Entity Recognition , 2003 .

[13]  Bernard Manderick,et al.  A Weighted Polynomial Information Gain Kernel for Resolving Prepositional Phrase Attachment Ambiguities with Support Vector Machines , 2003, IJCAI.

[14]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[15]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[16]  Christina S. Leslie,et al.  Fast Kernels for Inexact String Matching , 2003, COLT.