Partially Local Multi-way Alignments

Multiple sequence alignments are an essential tool in bioinformatics and computational biology, where they are used to represent the mutual evolutionary relationships and similarities between a set of DNA, RNA, or protein sequences. More recently they have also received considerable interest in other application domains, in particular in comparative linguistics. Multiple sequence alignments can be seen as a generalization of the string-to-string edit problem to more than two strings. With the increase in the power of computational equipment, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are considered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Here we introduce a general formal framework that gives raise to a classification of partially local alignment problems. This leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems.

[1]  Terence Hwa,et al.  Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models , 2001, J. Comput. Biol..

[2]  Robert Giegerich,et al.  Explaining and Controlling Ambiguity in Dynamic Programming , 2000, CPM.

[3]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[4]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[5]  Peter F. Stadler,et al.  Product Grammars for Alignment and Folding , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Knut Reinert,et al.  A polyhedral approach to sequence alignment problems , 2000, Discret. Appl. Math..

[7]  Mathieu Blanchette,et al.  FootPrinter: a program designed for phylogenetic footprinting , 2003, Nucleic Acids Res..

[8]  O. Gotoh Alignment of three biological sequences with an efficient traceback procedure. , 1986, Journal of theoretical biology.

[9]  T. Gregory Dewey,et al.  A Sequence Alignment Algorithm with an Arbitrary Gap Penalty Function , 2001, J. Comput. Biol..

[10]  Winfried Just,et al.  Computational Complexity of Multiple Sequence Alignment with SP-Score , 2001, J. Comput. Biol..

[11]  Peter F. Stadler,et al.  Stochastic pairwise alignments , 2002, ECCB.

[12]  Grzegorz Kondrak,et al.  Phonetic Alignment and Similarity , 2003, Comput. Humanit..

[13]  S. Miyazawa A reliable sequence alignment method based on probabilities of residue correspondences. , 1995, Protein engineering.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  Nuno A. Fonseca,et al.  Tools for mapping high-throughput sequencing data , 2012, Bioinform..

[16]  Johannes Söding,et al.  Context similarity scoring improves protein sequence alignments in the midnight zone , 2015, Bioinform..

[17]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[18]  Michael Brudno,et al.  Fast and sensitive multiple alignment of large genomic sequences , 2003, BMC Bioinformatics.

[19]  Fabian Sievers,et al.  Clustal Omega for making accurate alignments of many protein sequences , 2018, Protein science : a publication of the Protein Society.

[20]  Burkhard Morgenstern,et al.  A min-cut algorithm for the consistency problem in multiple sequence alignment , 2010, Bioinform..

[21]  Philipp Bucher,et al.  A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System , 1996, ISMB.

[22]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[23]  John D. Kececioglu,et al.  Aligning alignments exactly , 2004, RECOMB.

[24]  Ian Maddieson,et al.  Studying language evolution in the age of big data , 2018, Journal of Language Evolution.

[25]  Bodo Manthey,et al.  Non-approximability of weighted multiple sequence alignment , 2003, Theor. Comput. Sci..

[26]  Alexander V. Lukashin,et al.  Local multiple sequence alignment using dead-end elimination , 1999, Bioinform..

[27]  Isaac Elias,et al.  Settling the Intractability of Multiple Alignment , 2003, ISAAC.

[28]  Mathieu Blanchette,et al.  Computation and analysis of genomic multi-sequence alignments. , 2007, Annual review of genomics and human genetics.

[29]  Amir Abboud,et al.  Tight Hardness Results for LCS and Other Sequence Similarity Measures , 2015, 2015 IEEE 56th Annual Symposium on Foundations of Computer Science.

[30]  Peter J. Stuckey,et al.  Progressive Multiple Alignment Using Sequence Triplet Optimizations and Three-residue Exchange Costs , 2004, J. Bioinform. Comput. Biol..

[31]  Osamu Maruyama,et al.  Searching for Regulatory Elements of Alternative Splicing Events Using Phylogenetic Footprinting , 2004, WABI.

[32]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[33]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[34]  Sonja J. Prohaska,et al.  Surveying phylogenetic footprints in large gene clusters: applications to Hox cluster duplications. , 2004, Molecular phylogenetics and evolution.

[35]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[36]  Johannes Söding,et al.  Discriminative modelling of context-specific amino acid substitution probabilities , 2012, Bioinform..

[37]  Olivier Poch,et al.  BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark , 2005, Proteins.

[38]  Peter F Stadler,et al.  Progressive multiple sequence alignments from triplets , 2007, BMC Bioinformatics.

[39]  João Meidanis,et al.  Introduction to computational molecular biology , 1997 .

[40]  Benno Schwikowski,et al.  Algorithms for Phylogenetic Footprinting , 2002, J. Comput. Biol..

[41]  Paola Bonizzoni,et al.  The complexity of multiple sequence alignment with SP-score that is a metric , 2001, Theor. Comput. Sci..

[42]  Christos A. Ouzounis,et al.  Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment , 2017, Biosyst..

[43]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[44]  Knut Reinert,et al.  A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads , 2009, Bioinform..

[45]  Ryan Williams,et al.  Simulating branching programs with edit distance and friends: or: a polylog shaved is a lower bound made , 2015, STOC.

[46]  W. A. Beyer,et al.  Some Biological Sequence Metrics , 1976 .

[47]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[48]  Simon J. Greenhill,et al.  The Potential of Automatic Word Comparison for Historical Linguistics , 2017, PloS one.

[49]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[50]  Sandeep K. Gupta,et al.  Improving the Practical Space and Time Efficiency of the Shortest-Paths Approach to Sum-of-Pairs Multiple Sequence Alignment , 1995, J. Comput. Biol..

[51]  Hayato Yamana,et al.  Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost , 2006, BMC Bioinformatics.

[52]  Sonja J. Prohaska,et al.  Algebraic Dynamic Programming over general data structures , 2015, BMC Bioinformatics.

[53]  Rolf Backofen,et al.  LocARNAscan: Incorporating thermodynamic stability in sequence and structure-based RNA homology search , 2013, Algorithms for Molecular Biology.

[54]  Matthias Zytnicki,et al.  BlastR—fast and accurate database searches for non-coding RNAs , 2011, Nucleic acids research.

[55]  Andrea Tanzer,et al.  A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection , 2014, Genome Biology.

[56]  Yasuo Tabei,et al.  A local multiple alignment method for detection of non-coding RNA sequences , 2009, Bioinform..

[57]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[58]  Sonja J. Prohaska,et al.  Phylogenetic Footprinting and Consistent Sets of Local Aligments , 2011, CPM.

[59]  Biswanath Chowdhury,et al.  A review on multiple sequence alignment from the perspective of genetic algorithm. , 2017, Genomics.

[60]  Sean R. Eddy,et al.  Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction , 2004, BMC Bioinformatics.

[61]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[62]  W. Miller,et al.  Mulan: multiple-sequence local alignment and visualization for studying function and evolution. , 2005, Genome research.

[63]  S. Altschul Gap costs for multiple sequence alignment. , 1989, Journal of theoretical biology.

[64]  Janet Kelso,et al.  Computational challenges in the analysis of ancient DNA , 2010, Genome Biology.

[65]  Robert Giegerich,et al.  A discipline of dynamic programming over sequence data , 2004, Sci. Comput. Program..

[66]  John D. Kececioglu,et al.  Multiple alignment by aligning alignments , 2007, ISMB/ECCB.

[67]  Sean R. Eddy,et al.  A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation , 2008, PLoS Comput. Biol..

[68]  Olivier Poch,et al.  A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future Perspectives , 2011, PloS one.

[69]  John P. Overington,et al.  Environment‐specific amino acid substitution tables: Tertiary templates and prediction of protein folds , 1992, Protein science : a publication of the Protein Society.

[70]  Michael Cysouw,et al.  A Pipeline for Computational Historical Linguistics , 2011 .

[71]  Peter F. Stadler,et al.  Non-coding RNA annotation of the genome of Trichoplax adhaerens , 2009, Nucleic acids research.

[72]  Teresa K. Attwood,et al.  Introduction to Bioinformatics , 2001 .

[73]  J. Stoye,et al.  Consistent Equivalence Relations: A Set-Theoretical Framework for Multiple Sequence Alignment , 1999 .

[74]  Matthias Bernt,et al.  Partially local three-way alignments and the sequence signatures of mitochondrial genome rearrangements , 2017, Algorithms for Molecular Biology.

[75]  Kazutaka Katoh,et al.  MAFFT: iterative refinement and additional methods. , 2014, Methods in molecular biology.

[76]  M. Gerstein,et al.  Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements , 2003, Journal of biology.

[77]  Masato Ishikawa,et al.  Comprehensive study on iterative algorithms of multiple sequence alignment , 1995, Comput. Appl. Biosci..

[78]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[79]  Peter W. Collingridge,et al.  MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments , 2012, BMC Bioinformatics.

[80]  Burkhard Morgenstern,et al.  DIALIGN: finding local similarities by multiple sequence alignment , 1998, Bioinform..

[81]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[82]  Rolf Backofen,et al.  Inferring Noncoding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering , 2007, PLoS Comput. Biol..

[83]  Michael Cysouw,et al.  Cognate Identification and Alignment Using Practical Orthographies , 2007, SIGMORPHON.

[84]  Burkhard Morgenstern,et al.  DIALIGN at GOBICS—multiple sequence alignment using various sources of external information , 2013, Nucleic Acids Res..