Classification using String Kernels

We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length k. A subsequence is any ordered sequence of k characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of k, since the dimension of the feature space grows exponentially with k. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel [6] is made showing encouraging results.

[1]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[2]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[3]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[4]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[5]  W. B. Cavnar,et al.  Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[6]  Stephen Huffman Acquaintance: Language-Independent Document Categorization by N-Grams , 1995, TREC.

[7]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[8]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[9]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  Nello Cristianini,et al.  The Kernel-Adatron Algorithm: A Fast and Simple Learning Procedure for Support Vector Machines , 1998, ICML.

[13]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[14]  Nello Cristianini,et al.  The Kernel-Adatron : A fast and simple learning procedure for support vector machines , 1998, ICML 1998.

[15]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[16]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[17]  C. Watkins Dynamic Alignment Kernels , 1999 .

[18]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[19]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[20]  Nello Cristianini,et al.  Margin Distribution and Soft Margin , 2000 .

[21]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[22]  B. Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, ICML.

[23]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[24]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[25]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[26]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .