Gradient Boosting for Sequence Alignment

Sequence alignment is a common subtask in many applications such as genetic matching and music information retrieval. Crucial to the performance of any sequence alignment algorithm is an accurate model of the reward of transforming one sequence into another. Using this model, we can find the optimal alignment of two sequences or perform query-based selection from a database of target sequences with a dynamic programming approach. In this paper, we describe a new algorithm to learn the reward models from positive and negative examples of matching sequences. We develop a gradient boosting approach that reduces sequence learning to a series of standard function approximation problems that can be solved by any function approximator. A key advantage of this approach is that it is able to induce complex features using function approximation rather than relying on the user to predefine such features. Our experiments on synthetic data and a fairly complex real-world music retrieval domain demonstrate that our approach can achieve better accuracy and faster learning compared to a state-of-the-art structured SVM approach.

[1]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[4]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[5]  R. Durbin,et al.  Biological sequence analysis: Background on probability , 1998 .

[6]  Y. Freund,et al.  Discussion of the Paper \additive Logistic Regression: a Statistical View of Boosting" By , 2000 .

[7]  Pierrick Philippe,et al.  NEW QUERY-BY-HUMMING MUSIC RETRIEVAL SYSTEM CONCEPTION AND EVALUATION BASED ON A QUERY NATURE STUDY , 2001 .

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  William P. Birmingham,et al.  Encoding Timing Information for Musical Query Matching , 2002, ISMIR.

[10]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[11]  Thomas G. Dietterich,et al.  Training conditional random fields via gradient tree boosting , 2004, ICML.

[12]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[13]  William P. Birmingham,et al.  Modelling error in query-by-humming applications , 2004 .

[14]  Shriprakash Sinha Leaf shape recognition via support vector machines with edit distance kernels , 2004 .

[15]  Ning Hu,et al.  The MUSART Testbed for Query-by-Humming Evaluation , 2004, Computer Music Journal.

[16]  Thorsten Joachims,et al.  Learning to Align Sequences: A Maximum-Margin Approach , 2006 .