String kernels and similarity measures for information retrieval

Measuring a similarity between two strings is a fundamental step in many applications in areas such as text classification and information retrieval. Lately, kernel-based methods have been proposed for this task, both for text and biological sequences. Since kernels are inner products in a feature space, they naturally induce similarity measures. Information-theoretical approaches have also been subject of recent research. The goal is to classify finite sequences without explicit knowledge of their statistical nature: sequences are considered similar if they are likely to be generated by the same source. There is experimental evidence that relative entropy (albeit not being a true metric) yields high accuracy in several classification tasks. Compression-based techniques, such as variations of the Ziv-Lempel algorithm for text, or GenCompress for biological sequences, have been used to estimate the relative entropy. Algorithmic concepts based on the Kolmogorov complexity provide theoretic background for these approaches. This paper describes some string kernels and information theoretic methods. It evaluates the performance of both kinds of methods in text classification tasks, namely in the problems of authorship attribution, language detection, and cross-language document matching.

[1]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[2]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[3]  Marco Cuturi,et al.  A covariance kernel for proteins , 2003, q-bio/0310022.

[4]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[9]  Ran El-Yaniv,et al.  Agnostic Classification of Markovian Sequences , 1997, NIPS.

[10]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[11]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[12]  Marco Cuturi,et al.  A mutual information kernel for sequences , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[13]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[14]  V. Loreto,et al.  Data compression and learning in time sequences analysis , 2002, cond-mat/0207321.

[15]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[16]  Mário A. T. Figueiredo,et al.  Information Theoretic Text Classification Using the Ziv-Merhav Method , 2005, IbPRIA.

[17]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[18]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[19]  Sanjeev R. Kulkarni,et al.  Universal entropy estimation via block sorting , 2004, IEEE Transactions on Information Theory.

[20]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[21]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[22]  N. Merhav,et al.  A Measure of Relative Entropy between Individual Sequences with Application to Universal Classification , 1993, Proceedings. IEEE International Symposium on Information Theory.

[23]  Xin Chen,et al.  A compression algorithm for DNA sequences and its applications in genome comparison , 2000, RECOMB '00.

[24]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[25]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[26]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.