Compositional Properties of Alignments

Alignments, i.e., position-wise comparisons of two or more strings or ordered lists are of utmost practical importance in computational biology and a host of other fields, including historical linguistics and emerging areas of research in the Digital Humanities. The problem is well-known to be computationally hard as soon as the number of input strings is not bounded. Due to its practical importance, a huge number of heuristics have been devised, which have proved very successful in a wide range of applications. Alignments nevertheless have received hardly any attention as formal, mathematical structures. Here, we focus on the compositional aspects of alignments, which underlie most algorithmic approaches to computing alignments. We also show that the concepts naturally generalize to finite partially ordered sets and partial maps between them that in some sense preserve the partial orders. As a consequence of this discussion we observe that alignments of even more general structure, in particular graphs, are essentially characterized by the fact that the restriction of alignments to a row must coincide with the corresponding input graphs. Pairwise alignments of graphs are therefore determined completely by common induced subgraphs. In this setting alignments of alignments are well-defined, and alignments can be decomposed recursively into subalignments. This provides a general framework within which different classes of alignment algorithms can be explored for objects very different from sequences and other totally ordered data structures.

[1]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[2]  Miriah D. Meyer,et al.  Genome-wide synteny through highly sensitive sequence alignment: Satsuma , 2010, Bioinform..

[3]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[4]  J. Gerard Wolff,et al.  Syntax, Parsing and Production of Natural Language in a Framework of Information Compression by Multiple Alignment, Unification and Search , 2003, J. Univers. Comput. Sci..

[5]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[6]  Harry G. Barrow,et al.  Subgraph Isomorphism, Matching Relational Structures and Maximal Cliques , 1976, Inf. Process. Lett..

[7]  Sean R. Eddy,et al.  Biological Sequence Analysis by Richard Durbin , 1998 .

[8]  Peter Willett,et al.  Comparison of Maximum Common Subgraph Isomorphism Algorithms for the Alignment of 2D Chemical Structures , 2018, ChemMedChem.

[9]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[10]  Bodo Manthey,et al.  Non-approximability of weighted multiple sequence alignment , 2003, Theor. Comput. Sci..

[11]  Michael Cysouw,et al.  A Pipeline for Computational Historical Linguistics , 2011 .

[12]  Tobias Marschall,et al.  Aligning sequences to general graphs in O(V + mE) time , 2017, bioRxiv.

[13]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[14]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[15]  Maurits J. J. Dijkstra,et al.  Multiple Sequence Alignment. , 2017, Methods in molecular biology.

[16]  Peter F. Stadler,et al.  Progressive multiple sequence alignments from triplets , 2007, BMC Bioinform..

[17]  Isaac Elias,et al.  Settling the Intractability of Multiple Alignment , 2003, ISAAC.

[18]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[19]  Fedor V. Fomin,et al.  Exact Algorithm for the Maximum Induced Planar Subgraph Problem , 2011, ESA.

[20]  Sonja J. Prohaska,et al.  Phylogenetic Footprinting and Consistent Sets of Local Aligments , 2011, CPM.

[21]  T. Akutsu A Polynomial Time Algorithm for Finding a Largest Common Subgraph of almost Trees of Bounded Degree , 1993 .

[22]  Bodo Manthey Non-approximability of Weighted Multiple Sequence Alignment , 2001, COCOON.

[23]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[24]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[25]  Peter F. Stadler,et al.  Partially Local Multi-way Alignments , 2018, Math. Comput. Sci..

[26]  Yongtang Shi,et al.  Fifty years of graph matching, network alignment and network comparison , 2016, Inf. Sci..

[27]  Todd Wareham,et al.  A Simplified Proof of the NP- and MAX SNP-Hardness of Multiple Sequence Tree Alignment , 1995, J. Comput. Biol..

[28]  Christopher J. Lee,et al.  Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems , 2004, Bioinform..

[29]  Christos A. Ouzounis,et al.  Computational complexity of algorithms for sequence comparison, short-read assembly and genome alignment , 2017, Biosyst..

[30]  Ian Maddieson,et al.  Studying language evolution in the age of big data , 2018, Journal of Language Evolution.

[31]  Christian Höner zu Siederdissen,et al.  Sneaking around concatMap: efficient combinators for dynamic programming , 2012, ICFP.

[32]  Christopher J. Lee Generating Consensus Sequences from Partial Order Multiple Sequence Alignment Graphs , 2003, Bioinform..

[33]  Ronald Fagin,et al.  Comparing top k lists , 2003, SODA '03.

[34]  Jens Stoye,et al.  DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment , 1997, Comput. Appl. Biosci..

[35]  Horst Bunke,et al.  On a relation between graph edit distance and maximum common subgraph , 1997, Pattern Recognit. Lett..

[36]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[37]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[38]  Gerhard Heyer,et al.  An Overview of Canonical Text Services , 2017 .

[39]  Peter Willett,et al.  Maximum common subgraph isomorphism algorithms for the matching of chemical structures , 2002, J. Comput. Aided Mol. Des..

[40]  Peter F. Stadler,et al.  Product Grammars for Alignment and Folding , 2015, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[41]  Temple F. Smith,et al.  Comparison of biosequences , 1981 .

[42]  Nancy Retzlaff,et al.  Orthologs, turn-over, and remolding of tRNAs in primates and fruit flies , 2016, BMC Genomics.

[43]  Peter J. Stuckey,et al.  Progressive Multiple Alignment Using Sequence Triplet Optimizations and Three-residue Exchange Costs , 2004, J. Bioinform. Comput. Biol..

[44]  Winfried Just,et al.  Computational Complexity of Multiple Sequence Alignment with SP-Score , 2001, J. Comput. Biol..

[45]  Sonja J. Prohaska,et al.  Algebraic Dynamic Programming over general data structures , 2015, BMC Bioinformatics.

[46]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[47]  J. Stoye,et al.  Consistent Equivalence Relations: A Set-Theoretical Framework for Multiple Sequence Alignment , 1999 .

[48]  Paola Bonizzoni,et al.  The complexity of multiple sequence alignment with SP-score that is a metric , 2001, Theor. Comput. Sci..

[49]  A. Dress,et al.  Multiple DNA and protein sequence alignment based on segment-to-segment comparison. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Michael Cysouw,et al.  Cognate Identification and Alignment Using Practical Orthographies , 2007, SIGMORPHON.

[51]  O. Gotoh Alignment of three biological sequences with an efficient traceback procedure. , 1986, Journal of theoretical biology.

[52]  Peter F. Stadler,et al.  Algebraic Dynamic Programming on Trees , 2017, Algorithms.

[53]  John D. Kececioglu,et al.  Aligning alignments exactly , 2004, RECOMB.

[54]  Matthias Rarey,et al.  Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review , 2011 .

[55]  Ketil Malde,et al.  Increasing Sequence Search Sensitivity with Transitive Alignments , 2013, PloS one.

[56]  Tao Jiang,et al.  Alignment of Trees - An Alternative to Tree Edit , 1994, Theor. Comput. Sci..

[57]  Naveen Sivadasan,et al.  Sequence Alignment on Directed Graphs , 2017, bioRxiv.

[58]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[59]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[60]  S. Altschul,et al.  A tool for multiple sequence alignment. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[61]  Rolf Backofen,et al.  Lifting Prediction to Alignment of RNA Pseudoknots , 2009, RECOMB.

[62]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[63]  Robert Giegerich,et al.  Pure multiple RNA secondary structure alignments: a progressive profile approach , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[64]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[65]  Chuong B. Do,et al.  ProbCons: Probabilistic consistency-based multiple sequence alignment. , 2005, Genome research.

[66]  Robert C. Edgar,et al.  Multiple sequence alignment. , 2006, Current opinion in structural biology.

[67]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .