Fast and accurate large multiple sequence alignments using root-to-leave regressive computation

Inferences derived from large multiple alignments of biological sequences are critical to many areas of biology, including evolution, genomics, biochemistry, and structural biology. However, the complexity of the alignment problem imposes the use of approximate solutions. The most common is the progressive algorithm, which starts by aligning the most similar sequences, incorporating the remaining ones following the order imposed by a guide-tree. We developed and validated on protein sequences a regressive algorithm that works the other way around, aligning first the most dissimilar sequences. Our algorithm produces more accurate alignments than non-regressive methods, especially on datasets larger than 10,000 sequences. By design, it can run any existing alignment method in linear time thus allowing the scale-up required for extremely large genomic analyses. One Sentence Summary Initiating alignments with the most dissimilar sequences allows slow and accurate methods to be used on large datasets

[1]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[2]  Kenji Mizuguchi,et al.  HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database , 2004, Nucleic Acids Res..

[3]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[4]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[5]  Bui Quang Minh,et al.  Phylogenetic diversity within seconds. , 2006, Systematic biology.

[6]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..

[7]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[8]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[9]  Desmond G. Higgins,et al.  Sequence embedding for fast construction of guide trees for multiple sequence alignment , 2010, Algorithms for Molecular Biology.

[10]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[11]  Maurice Herlihy,et al.  The Art of Multiprocessor Programming, Revised Reprint , 2012 .

[12]  Michael S. Breen,et al.  Epistasis as the primary factor in molecular evolution , 2012, Nature.

[13]  Md. Shamsuzzoha Bayzid,et al.  Statistical binning enables an accurate coalescent-based estimation of the avian tree , 2014, Science.

[14]  Tandy J. Warnow,et al.  Ultra-large alignments using phylogeny-aware profiles , 2015, Genome Biology.

[15]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[16]  Kazutaka Katoh,et al.  Application of the MAFFT sequence alignment program to large data—reexamination of the usefulness of chained guide trees , 2016, Bioinform..

[17]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[18]  Martin Weigt,et al.  Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis , 2017, Proceedings of the National Academy of Sciences.