On the Complexity of Sparse Exon Assembly

Gene structure prediction is one of the most important problems in computational molecular biology. It involves two steps: the first is finding the evidence (e.g., predicting splice sites) and the second is interpreting the evidence, that is, trying to determine the whole gene structure by assembling its pieces. In this paper, we suggest a combinatorial solution to the second step, which is also referred to as the "Exon Assembly Problem." We use a similarity-based approach that aims to produce a single gene structure based on similarities to a known homologous sequence. We target the sparse case, where filtering has been applied to the data, resulting in a set of O(n) candidate exon blocks. Our algorithm yields an O(n(2) square root of n) solution.

[1]  Gad M. Landau,et al.  Two Algorithms for LCS Consecutive Suffix Alignment , 2004, CPM.

[2]  M. Hanan,et al.  On Steiner’s Problem with Rectilinear Distance , 1966 .

[3]  Sampath Kannan,et al.  An Algorithm for Locating Non-Overlapping Regions of Maximum Alignment Score , 1993, CPM.

[4]  Sampath Kannan,et al.  An Algorithm for Locating Nonoverlapping Regions of Maximum Alignment Score , 1996, SIAM J. Comput..

[5]  P A Pevzner,et al.  Performance-guarantee gene predictions via spliced alignment. , 1998, Genomics.

[6]  Hans Jürgen Prömel,et al.  The Steiner Tree Problem , 2002 .

[7]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[8]  Pavel A. Pevzner,et al.  Las Vegas algorithms for gene recognition: suboptimal and error-tolerant spliced alignment , 1997, RECOMB '97.

[9]  See-Kiong Ng,et al.  A Faster and More Space-Efficient Algorithm for Inferring Arc-Annotations of RNA Sequences through Alignment , 2006, Algorithmica.

[10]  P. Rouzé,et al.  Current methods of gene prediction, their strengths and weaknesses. , 2002, Nucleic acids research.

[11]  Frank K. Hwang,et al.  The rectilinear steiner arborescence problem , 2005, Algorithmica.

[12]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[13]  Bing Lu,et al.  Polynomial Time Approximation Scheme for the Rectilinear Steiner Arborescence Problem , 2000, J. Comb. Optim..

[14]  Weiping Shi,et al.  The rectilinear Steiner arborescence problem is NP-complete , 2000, SODA '00.

[15]  Pavel A. Pevzner,et al.  Las Vegas Algorithms for Gene Recognition: Suboptimal and Error-Tolerant Spliced Alignment , 1997, J. Comput. Biol..

[16]  P. Pevzner,et al.  Gene recognition via spliced sequence alignment. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Sung-Ryul Kim,et al.  A dynamic edit distance table , 2004, J. Discrete Algorithms.

[18]  Bin Ma,et al.  The Longest Common Subsequence Problem for Arc-Annotated Sequences , 2000, CPM.

[19]  Mikhail S. Gelfand,et al.  Combinatorial Approaches to Gene Recognition , 1997, Comput. Chem..

[20]  Gad M. Landau,et al.  Re-Use Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit , 2005 .

[21]  Mikhail J. Atallah,et al.  Efficient Parallel Algorithms for String Editing and Related Problems , 1990, SIAM J. Comput..

[22]  Wing-Kai Hon,et al.  On All-Substrings Alignment Problems , 2003, COCOON.

[23]  Gary Benson A Space Efficient Algorithm for Finding the Best Nonoverlapping Alignment Score , 1995, Theor. Comput. Sci..

[24]  Alok Aggarwal,et al.  Notes on searching in multidimensional monotone arrays , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[25]  Jeanette P. Schmidt,et al.  All Highest Scoring Paths in Weighted Grid Graphs and Their Application to Finding All Approximate Repeats in Strings , 1998, SIAM J. Comput..

[26]  C. E. R. Alves Sequential and Parallel Algorithms for the All-Substrings Longest Common Subsequence Problem ∗ , 2022 .

[27]  Gad M. Landau,et al.  On the Common Substring Alignment Problem , 2001, J. Algorithms.

[28]  Gad M. Landau,et al.  Incremental String Comparison , 1998, SIAM J. Comput..