A General Edit Distance between RNA Structures

Arc-annotated sequences are useful in representing the structural information of RNA sequences. In general, RNA secondary and tertiary structures can be represented as a set of nested arcs and a set of crossing arcs, respectively. Since RNA functions are largely determined by molecular confirmation and therefore secondary and tertiary structures, the comparison between RNA secondary and tertiary structures has received much attention recently. In this paper, we propose the notion of edit distance to measure the similarity between two RNA secondary and tertiary structures, by incorporating various edit operations performed on both bases and arcs (i.e., base-pairs). Several algorithms are presented to compute the edit distance between two RNA sequences with various arc structures and under various score schemes, either exactly or approximately, with provably good performance. Preliminary experimental tests confirm that our definition of edit distance and the computation model are among the most reasonable ones ever studied in the literature.

[1]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[2]  Bin Ma,et al.  Computing similarity between RNA structures , 1999, Theor. Comput. Sci..

[3]  Kaizhong Zhang Efficient Parallel Algorithms for Tree Editing Problems , 1996, CPM.

[4]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[5]  Christos H. Papadimitriou,et al.  Algorithmic aspects of protein structure similarity , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  Juan J. Nieto,et al.  A metric space to study differences between polynucleotides , 2003, Appl. Math. Lett..

[7]  James W. Brown The ribonuclease P database , 1997, Nucleic Acids Res..

[8]  Martin Vingron,et al.  A polyhedral approach to RNA sequence structure alignment , 1998, RECOMB '98.

[9]  Kaizhong Zhang,et al.  On the Editing Distance Between Undirected Acyclic Graphs , 1996, Int. J. Found. Comput. Sci..

[10]  David Haussler,et al.  Recent Methods for RNA Modeling Using Stochastic Context-Free Grammars , 1994, CPM.

[11]  Robert Giegerich,et al.  Local similarity in RNA secondary structures , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[12]  Kaizhong Zhang,et al.  Exact and approximate algorithms for unordered tree matching , 1994, IEEE Trans. Syst. Man Cybern..

[13]  Kaizhong Zhang,et al.  Comparing multiple RNA secondary structures using tree comparisons , 1990, Comput. Appl. Biosci..

[14]  Michael P. S. Brown,et al.  Small Subunit Ribosomal RNA Modeling Using Stochastic Context-Free Grammars , 2000, ISMB.

[15]  Bin Ma,et al.  The Longest Common Subsequence Problem for Arc-Annotated Sequences , 2000, CPM.

[16]  R. Ravi,et al.  Computing Similarity between RNA Strings , 1996, CPM.

[17]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[18]  Viggo Kann,et al.  Hardness of Approximating Problems on Cubic Graphs , 1997, CIAC.

[19]  Kaizhong Zhang,et al.  On the Editing Distance between Undirected Acyclic Graphs and Related Problems , 1995, CPM.

[20]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[21]  Kaizhong Zhang Computing similarity between RNA secondary structures , 1998, Proceedings. IEEE International Joint Symposia on Intelligence and Systems (Cat. No.98EX174).

[22]  J. Nieto,et al.  An Exact Formula for the Number of Alignments Between Two DNA Sequences , 2003, DNA sequence : the journal of DNA sequencing and mapping.

[23]  Florence Corpet,et al.  RNAlign program: alignment of RNA sequences using both primary and secondary structures , 1994, Comput. Appl. Biosci..

[24]  Philip N. Klein,et al.  Computing the Edit-Distance between Unrooted Ordered Trees , 1998, ESA.

[25]  Zhi-Zhong Chen,et al.  The Longest Common Subsequence Problem for Sequences with Nested Arc Annotations , 2001, ICALP.

[26]  David P. Williamson,et al.  Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming , 1995, JACM.

[27]  Kaizhong Zhang,et al.  On the Editing Distance Between Unordered Labeled Trees , 1992, Inf. Process. Lett..

[28]  M. Zuker On finding all suboptimal foldings of an RNA molecule. , 1989, Science.

[29]  Michael R. Fellows,et al.  Algorithms and complexity for annotated sequence analysis , 1999 .

[30]  Mihalis Yannakakis,et al.  Optimization, Approximation, and Complexity Classes (Extended Abstract) , 1988, STOC 1988.

[31]  Bin Ma,et al.  Computing Similarity between RNA Structures , 1999, CPM.

[32]  Kaizhong Zhang,et al.  Fast Algorithms for the Unit Cost Editing Distance Between Trees , 1990, J. Algorithms.

[33]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .