Relational Sequence Alignments and Logos

The need to measure sequence similarity arises in many applicitation domains and often coincides with sequence alignment: the more similar two sequences are, the better they can be aligned. Aligning sequences not only shows how similar sequences are, it also shows where there are differences and correspondences between the sequences. Traditionally, the alignment has been considered for sequences of flat symbols only. Many real world sequences such as natural language sentences and protein secondary structures, however, exhibit rich internal structures. This is akin to the problem of dealing with structured examples studied in the field of inductive logic programming (ILP). In this paper, we introduce Real , which is a powerful, yet simple approach to align sequence of structured symbols using well-established ILP distance measures within traditional alignment methods. Although straight-forward, experiments on protein data and Medline abstracts show that this approach works well in practice, that the resulting alignments can indeed provide more information than flat ones, and that they are meaningful to experts when represented graphically.

[1]  Pier Luca Lanzi,et al.  Database Support for Data Mining Applications , 2004, Lecture Notes in Computer Science.

[2]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[3]  L. De Raedt,et al.  Logical Hidden Markov Models , 2011, J. Artif. Intell. Res..

[4]  Jan Ramon Thesis: clustering and instance based learning in first order logic , 2002 .

[5]  Gary D. Stormo,et al.  Displaying the information contents of structural RNA alignments: the structure logos , 1997, Comput. Appl. Biosci..

[6]  Stephen H Muggleton,et al.  The automatic discovery of structural principles describing protein fold space. , 2003, Journal of molecular biology.

[7]  John Wylie Lloyd,et al.  Foundations of Logic Programming , 1987, Symbolic Computation.

[8]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[9]  Kristian Kersting,et al.  TildeCRF: Conditional Random Fields for Logical Sequences , 2006, ECML.

[10]  Jan Ramon,et al.  Clustering and instance based learning in first order logic , 2002, AI Communications.

[11]  Lusheng Wang,et al.  Alignment of trees: an alternative to tree edit , 1995 .

[12]  Serafim Batzoglou,et al.  CONTRAlign: Discriminative Training for Protein Sequence Alignment , 2006, RECOMB.

[13]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[14]  N. Jacobs Relational Sequence Learning and User Modelling , 2004 .

[15]  Sean R. Eddy,et al.  Biological sequence analysis: Preface , 1998 .

[16]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[17]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[18]  Shan-Hwei Nienhuys-Cheng,et al.  Distance Between Herbrand Interpretations: A Measure for Approximations to a Target Concept , 1997, ILP.

[19]  Alan Fern,et al.  Gradient Boosting for Sequence Alignment , 2006, AAAI.

[20]  Yasubumi Sakakibara,et al.  RNA secondary structural alignment with conditional random fields , 2005, ECCB/JBI.

[21]  Neil D. Lawrence,et al.  Missing Data in Kernel PCA , 2006, ECML.

[22]  Alain Ketterlin,et al.  Clustering Sequences of Complex Objects , 1997, KDD.

[23]  Gerhard Widmer,et al.  Relational IBL in classical music , 2006, Machine Learning.

[24]  Regina Barzilay,et al.  Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment , 2003, NAACL.

[25]  Luc De Raedt,et al.  Inductive Logic Programming: Theory and Methods , 1994, J. Log. Program..

[26]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[27]  Luc De Raedt,et al.  Constraint Based Mining of First Order Sequences in SeqLog , 2004, Database Support for Data Mining Applications.

[28]  J. W. Lloyd,et al.  Foundations of logic programming; (2nd extended ed.) , 1987 .

[29]  Andrew McCallum,et al.  A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[30]  Eyke Hüllermeier,et al.  Graph Alignments: A New Concept to Detect Conserved Regions in Protein Active Sites , 2004, German Conference on Bioinformatics.

[31]  Dino Pedreschi,et al.  Machine Learning: ECML 2004 , 2004, Lecture Notes in Computer Science.

[32]  Thomas Gärtner,et al.  Fisher Kernels for Logical Sequences , 2004, ECML.