Versatile string kernels

This paper proposes a class of string kernels that can handle a variety of subsequence-based features. Slight adaptations of the basic algorithm allow for weighing subsequence lengths, restricting or soft-penalizing gap-size, character-weighing and soft-matching of characters. An easy extension of the kernels allows for comparing run-length encoded strings with a time-complexity that is independent of the length of the original strings. Such kernels have applications in image processing, computational biology, in demography and in comparing partial rankings.

[1]  Gilbert Ritschard,et al.  Analyzing and Visualizing State Sequences in R with TraMineR , 2011 .

[2]  Alessandro Moschitti,et al.  Syntactic and Semantic Kernels for Short Text Pair Categorization , 2009, EACL.

[3]  Matissa N. Hollister,et al.  Is Optimal Matching Suboptimal? , 2009 .

[4]  Zhiwei Lin,et al.  Concordance and consensus , 2011, Inf. Sci..

[5]  Alessandro Moschitti,et al.  Kernel methods, syntax and semantics for relational text categorization , 2008, CIKM '08.

[6]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[7]  Cees H. Elzinga,et al.  Sequence Similarity , 2003 .

[8]  Chedy Raïssi,et al.  On measuring similarity for sequences of itemsets , 2014, Data Mining and Knowledge Discovery.

[9]  Matthias Studer,et al.  Étude des inégalités de genre en début de carrière académique à l'aide de méthodes innovatrices d'analyse de données séquentielles , 2012 .

[10]  Cees H. Elzinga,et al.  Combinatorial Representations of Token Sequences , 2005, J. Classif..

[11]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[12]  Ahmet Palazoglu,et al.  Sequencing diurnal air flow patterns for ozone exposure assessment around Houston, Texas , 2009 .

[13]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[14]  Donald E. Knuth Two notes on notation , 1992 .

[15]  Hui Wang,et al.  Kernels for acyclic digraphs , 2012, Pattern Recognit. Lett..

[16]  Gourab Sen Gupta,et al.  Hough Transform Run Length Encoding for Real-Time Image Processing , 2007, IEEE Trans. Instrum. Meas..

[17]  Nir Ailon,et al.  Aggregation of Partial Rankings, p-Ratings and Top-m Lists , 2007, SODA '07.

[18]  Maria Hewitt,et al.  Attitudes toward Interview Mode and Comparability of Reporting Sexual Behavior by Personal Interview and Audio Computer-assisted Self-interviewing , 2002 .

[19]  Sven Rahmann,et al.  Algorithms for subsequence combinatorics , 2008, Theor. Comput. Sci..

[20]  Alessandro Bogliolo,et al.  Longest common subsequence between run-length-encoded strings: a new algorithm with improved parallelism , 2004, Inf. Process. Lett..

[21]  Hui Wang,et al.  All Common Subsequences , 2007, IJCAI.

[22]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[23]  Alessandro Moschitti,et al.  Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees , 2006, ECML.

[24]  Aart C. Liefbroer,et al.  De-standardization of Family-Life Trajectories of Young Adults: A Cross-National Comparison Using Sequence Analysis , 2007 .

[25]  Roberto Basili,et al.  Structured Lexical Similarity via Convolution Kernels on Dependency Trees , 2011, EMNLP.

[26]  Owen Kaser,et al.  Sorting improves word-aligned bitmap indexes , 2010, Data Knowl. Eng..

[27]  Christian Brzinsky-Fay,et al.  Lost in transition: labour market entry sequences of school leavers in Europe , 2007 .

[28]  Antal Iványi,et al.  On the d-complexity of words , 1987 .

[29]  Ronald Fagin,et al.  Comparing Partial Rankings , 2006, SIAM J. Discret. Math..

[30]  Aart C. Liefbroer,et al.  Standardization of pathways to adulthood? an analysis of Dutch cohorts born between 1850 and 1900 , 2010, Demography.

[31]  Aart C. Liefbroer,et al.  Intergenerational transmission of behavioural patterns: How similar are parents’ and children's demographic trajectories? , 2012 .

[32]  Eun-Soo Kim,et al.  Fast computation of hologram patterns of a 3D object using run-length encoding and novel look-up table methods. , 2009, Applied optics.

[33]  David Fernández-Baca,et al.  Computing distances between partial rankings , 2009, Inf. Process. Lett..

[34]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[35]  Zhiwei Lin,et al.  A Novel Algorithm for Counting All Common Subsequences , 2007 .

[36]  Michael F. Whiting,et al.  Phylogenetic analysis of non-stereotyped behavioural sequences with a successive event-pairing method , 2008 .

[37]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[38]  Sven Rahmann,et al.  Subsequence Combinatorics and Applications to Microarray Production, DNA Sequencing and Chaining Algorithms , 2006, CPM.

[39]  Zoltán Kása,et al.  On the d-complexity of strings , 2010, ArXiv.

[40]  Wojciech Rytter,et al.  Repetitions in strings: Algorithms and combinatorics , 2009, Theor. Comput. Sci..

[41]  Alex Thomo,et al.  Shortest Path Approaches for the Longest Common Subsequence of a Set of Strings , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[42]  Gad M. Landau,et al.  A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices , 2003, SIAM J. Comput..

[43]  Bin Ma,et al.  On the similarity metric and the distance metric , 2009, Theor. Comput. Sci..

[44]  Bernhard Schölkopf,et al.  A Kernel Approach for Learning from Almost Orthogonal Patterns , 2002, PKDD.

[45]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[46]  Jason Weston,et al.  Dealing with large diagonals in kernel matrices , 2003 .