A memory-efficient algorithm for multiple sequence alignment with constraints

Abstract Motivation: Recently, the concept of the constrained sequence alignment was proposed to incorporate the knowledge of biologists about structures/functionalities/consensuses of their datasets into sequence alignment such that the user-specified residues/nucleotides are aligned together in the computed alignment. The currently developed programs use the so-called progressive approach to efficiently obtain a constrained alignment of several sequences. However, the kernels of these programs, the dynamic programming algorithms for computing an optimal constrained alignment between two sequences, run in 𝒪(γn2) memory, where γ is the number of the constraints and n is the maximum of the lengths of sequences. As a result, such a high memory requirement limits the overall programs to align short sequences~only. Results: We adopt the divide-and-conquer approach to design a memory-efficient algorithm for computing an optimal constrained alignment between two sequences, which greatly reduces the memory requirement of the dynamic programming approaches at the expense of a small constant factor in CPU time. This new algorithm consumes only 𝒪(αn) space, where α is the sum of the lengths of constraints and usually α ≪ n in practical applications. Based on this algorithm, we have developed a memory-efficient tool for multiple sequence alignment with constraints. Availability: http://genome.life.nctu.edu.tw/MUSICME Contact: cllu@mail.nctu.edu.tw

[1]  Eugene L. Lawler,et al.  Approximation Algorithms for Multiple Sequence Alignment , 1994, Theor. Comput. Sci..

[2]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[3]  Burkhard Morgenstern,et al.  DIALIGN2: Improvement of the segment to segment approach to multiple sequence alignment , 1999, German Conference on Bioinformatics.

[4]  William R. Taylor,et al.  Multiple sequence alignment by a pairwise algorithm , 1987, Comput. Appl. Biosci..

[5]  Ernest Feytmans,et al.  MATCH-BOX: a fundamentally new algorithm for the simultaneous alignment of several protein sequences , 1992, Comput. Appl. Biosci..

[6]  Luc Jaeger,et al.  RNA pseudoknots , 1992, Current Biology.

[7]  Kun-Mao Chao,et al.  Recent Developments in Linear-Space Alignment Methods: A Survey , 1994, J. Comput. Biol..

[8]  D Gusfield,et al.  Efficient methods for multiple sequence alignment with guaranteed error bounds , 1993, Bulletin of mathematical biology.

[9]  John D. Kececioglu,et al.  The Maximum Weight Trace Problem in Multiple Sequence Alignment , 1993, CPM.

[10]  D. Brian,et al.  A Phylogenetically Conserved Hairpin-Type 3′ Untranslated Region Pseudoknot Functions in Coronavirus RNA Replication , 1999, Journal of Virology.

[11]  D. Lipman,et al.  The multiple sequence alignment problem in biology , 1988 .

[12]  D. Higgins,et al.  See Blockindiscussions, Blockinstats, Blockinand Blockinauthor Blockinprofiles Blockinfor Blockinthis Blockinpublication Clustal: Blockina Blockinpackage Blockinfor Blockinperforming Multiple Blockinsequence Blockinalignment Blockinon Blockina Minicomputer Article Blockin Blockinin Blockin , 2022 .

[13]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[14]  Knut Reinert,et al.  The Practical Use of the A* Algorithm for Exact Multiple Sequence Alignment , 2000, J. Comput. Biol..

[15]  A. K. Wong,et al.  A survey of multiple sequence comparison methods. , 1992, Bulletin of mathematical biology.

[16]  Kobayashi,et al.  Improvement of the A(*) Algorithm for Multiple Sequence Alignment. , 1998, Genome informatics. Workshop on Genome Informatics.

[17]  Eugene W. Myers,et al.  Progressive multiple alignment with constraints , 1997, RECOMB '97.

[18]  Yin-Te Tsai,et al.  Constrained multiple sequence alignment tool development and its application to RNase family alignment , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[19]  J Stoye,et al.  A general method for fast multiple sequence alignment. , 1996, Gene.

[20]  Andreas Premstaller,et al.  Genotyping of Snps in a polyploid genome by pyrosequencing. , 2002, BioTechniques.

[21]  Jens Stoye,et al.  DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment , 1997, Comput. Appl. Biosci..

[22]  M. A. McClure,et al.  Comparative analysis of multiple protein-sequence alignment methods. , 1994, Molecular biology and evolution.

[23]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[24]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[25]  Eugene W. Myers,et al.  Optimal alignments in linear space , 1988, Comput. Appl. Biosci..

[26]  William R. Taylor,et al.  Motif-Biased Protein Sequence Alignment , 1994, J. Comput. Biol..

[27]  Bin Ma,et al.  Near optimal multiple alignment within a band in polynomial time , 2000, STOC '00.

[28]  C. Pleij,et al.  Pseudoknots: A Vital Feature in Viral RNA , 1997 .

[29]  Hiroshi Imai,et al.  Fast A Algorithms for Multiple Sequence Alignment , 1994 .

[30]  J. Thompson,et al.  DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. , 2000, Nucleic acids research.

[31]  Yin-Te Tsai,et al.  MuSiC: a tool for multiple sequence alignment with constraints , 2004, Bioinform..

[32]  P. Pevzner Multiple alignment, communication cost, and graph matching , 1992 .

[33]  Paola Bonizzoni,et al.  The complexity of multiple sequence alignment with SP-score that is a metric , 2001, Theor. Comput. Sci..

[34]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[35]  David J. Lipman,et al.  MULTIPLE ALIGNMENT , COMMUNICATION COST , AND GRAPH MATCHING * , 1992 .

[36]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[37]  Prudence W. H. Wong,et al.  Efficient constrained multiple sequence alignment with performance guarantee , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[38]  G D Schuler,et al.  A workbench for multiple alignment construction and analysis , 1991, Proteins.

[39]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[40]  J. Stoye Multiple sequence alignment with the Divide-and-Conquer method. , 1998, Gene.

[41]  C. Notredame,et al.  Recent progress in multiple sequence alignment: a survey. , 2002, Pharmacogenomics.

[42]  Jens Stoye,et al.  Divide-and-conquer multiple alignment with segment-based constraints , 2003, ECCB.

[43]  Hiroshi Imai,et al.  Enhanced A* Algorithms for Multiple Alignments: Optimal Alignments for Several Sequences and k-Opt Approximate Alignments for Large Cases , 1999, Theoretical Computer Science.

[44]  Jens Stoye,et al.  Improving the Divide-and-Conquer Approach to Sum-of-Pairs Multiple Sequence Alignment , 1997 .

[45]  Hugh B Nicholas,et al.  Strategies for multiple sequence alignment. , 2002, BioTechniques.