Large multiple sequence alignments with a root-to-leaf regressive method

Multiple sequence alignments (MSAs) are used for structural1,2 and evolutionary predictions1,2, but the complexity of aligning large datasets requires the use of approximate solutions3, including the progressive algorithm4. Progressive MSA methods start by aligning the most similar sequences and subsequently incorporate the remaining sequences, from leaf to root, based on a guide tree. Their accuracy declines substantially as the number of sequences is scaled up5. We introduce a regressive algorithm that enables MSA of up to 1.4 million sequences on a standard workstation and substantially improves accuracy on datasets larger than 10,000 sequences. Our regressive algorithm works the other way around from the progressive algorithm and begins by aligning the most dissimilar sequences. It uses an efficient divide-and-conquer strategy to run third-party alignment methods in linear time, regardless of their original complexity. Our approach will enable analyses of extremely large genomic datasets such as the recently announced Earth BioGenome Project, which comprises 1.5 million eukaryotic genomes6.Alignment of 1.4 million sequences is made possible by upgrading T-Coffee with a regressive alignment algorithm.

[1]  Cédric Notredame,et al.  Multiple sequence alignment modeling: methods and applications , 2016, Briefings Bioinform..

[2]  Bui Quang Minh,et al.  Phylogenetic diversity within seconds. , 2006, Systematic biology.

[3]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[4]  Kazutaka Katoh,et al.  PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences , 2007, Bioinform..

[5]  Michael S. Breen,et al.  Epistasis as the primary factor in molecular evolution , 2012, Nature.

[6]  Maurice Herlihy,et al.  The Art of Multiprocessor Programming, Revised Reprint , 2012 .

[7]  Olivier Poch,et al.  BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs , 1999, Bioinform..

[8]  Scott D. Kahn On the Future of Genomic Data , 2011, Science.

[9]  Md. Shamsuzzoha Bayzid,et al.  Statistical binning enables an accurate coalescent-based estimation of the avian tree , 2014, Science.

[10]  Tandy J. Warnow,et al.  PASTA for proteins , 2018, Bioinform..

[11]  P. Hogeweg,et al.  The alignment of sets of sequences and the construction of phyletic trees: An integrated method , 2005, Journal of Molecular Evolution.

[12]  Kenji Mizuguchi,et al.  HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database , 2004, Nucleic Acids Res..

[13]  Kazutaka Katoh,et al.  Application of the MAFFT sequence alignment program to large data—reexamination of the usefulness of chained guide trees , 2016, Bioinform..

[14]  Juan Carlos Castilla-Rubio,et al.  Earth BioGenome Project: Sequencing life for the future of life , 2018, Proceedings of the National Academy of Sciences.

[15]  Tao Jiang,et al.  On the Complexity of Multiple Sequence Alignment , 1994, J. Comput. Biol..

[16]  Michael Greenacre,et al.  Biplots in Practice , 2009 .

[17]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[18]  Martin Weigt,et al.  Large-scale identification of coevolution signals across homo-oligomeric protein interfaces by direct coupling analysis , 2017, Proceedings of the National Academy of Sciences.

[19]  K. Katoh,et al.  MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. , 2002, Nucleic acids research.

[20]  Jeffrey M. Perkel,et al.  Why Jupyter is data scientists’ computational notebook of choice , 2018, Nature.

[21]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[22]  Desmond G. Higgins,et al.  Sequence embedding for fast construction of guide trees for multiple sequence alignment , 2010, Algorithms for Molecular Biology.

[23]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[24]  Tandy J. Warnow,et al.  PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences , 2015, J. Comput. Biol..

[25]  Paolo Di Tommaso,et al.  Nextflow enables reproducible computational workflows , 2017, Nature Biotechnology.

[26]  Tandy J. Warnow,et al.  Ultra-large alignments using phylogeny-aware profiles , 2015, Genome Biology.