Meta-Alignment with Crumble and Prune: Partitioning very large alignment problems for performance and parallelization

BackgroundContinuing research into the global multiple sequence alignment problem has resulted in more sophisticated and principled alignment methods. Unfortunately these new algorithms often require large amounts of time and memory to run, making it nearly impossible to run these algorithms on large datasets. As a solution, we present two general methods, Crumble and Prune, for breaking a phylogenetic alignment problem into smaller, more tractable sub-problems. We call Crumble and Prune meta-alignment methods because they use existing alignment algorithms and can be used with many current alignment programs. Crumble breaks long alignment problems into shorter sub-problems. Prune divides the phylogenetic tree into a collection of smaller trees to reduce the number of sequences in each alignment problem. These methods are orthogonal: they can be applied together to provide better scaling in terms of sequence length and in sequence depth. Both methods partition the problem such that many of the sub-problems can be solved independently. The results are then combined to form a solution to the full alignment problem.ResultsCrumble and Prune each provide a significant performance improvement with little loss of accuracy. In some cases, a gain in accuracy was observed. Crumble and Prune were tested on real and simulated data. Furthermore, we have implemented a system called Job-tree that allows hierarchical sub-problems to be solved in parallel on a compute cluster, significantly shortening the run-time.ConclusionsThese methods enabled us to solve gigabase alignment problems. These methods could enable a new generation of biologically realistic alignment algorithms to be applied to real world, large scale alignment problems.

[1]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[2]  David Haussler,et al.  Combining phylogenetic and hidden Markov models in biosequence analysis , 2003, RECOMB '03.

[3]  E. Birney,et al.  Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. , 2008, Genome research.

[4]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[5]  D. Haussler,et al.  Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Kazutaka Katoh,et al.  Parallelization of the MAFFT multiple sequence alignment program , 2010, Bioinform..

[7]  R. Doolittle,et al.  Progressive sequence alignment as a prerequisitetto correct phylogenetic trees , 2007, Journal of Molecular Evolution.

[8]  Naruya Saitou,et al.  MISHIMA - a new method for high speed multiple alignment of nucleotide sequences of bacterial genome scale data , 2010, BMC Bioinformatics.

[9]  Matthew R. Pocock,et al.  The Bioperl toolkit: Perl modules for the life sciences. , 2002, Genome research.

[10]  David Haussler,et al.  Scoring two-species local alignments to try to statistically separate neutrally evolving from selected DNA segments , 2003, RECOMB '03.

[11]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[12]  David Haussler,et al.  The infinite sites model of genome evolution , 2008, Proceedings of the National Academy of Sciences.

[13]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[14]  Kazutaka Katoh,et al.  Recent developments in the MAFFT multiple sequence alignment program , 2008, Briefings Bioinform..

[15]  F. Jongejan,et al.  Expression of genes encoding two major Theileria annulata merozoite surface antigens in Escherichia coli and a Salmonella typhimurium aroA vaccine strain. , 1996, Gene.

[16]  Jens Stoye,et al.  An iterative method for faster sum-of-pairs multiple sequence alignment , 2000, Bioinform..

[17]  Fernando Guirado,et al.  Cloud-Coffee: implementation of a parallel consistency-based multiple alignment algorithm in the T-Coffee package and its benchmarking on the Amazon Elastic-Cloud , 2010, Bioinform..

[18]  Eugene W. Myers,et al.  Progressive multiple alignment with constraints , 1997, RECOMB '97.

[19]  D. Haussler,et al.  Aligning multiple genomic sequences with the threaded blockset aligner. , 2004, Genome research.

[20]  J Stoye,et al.  A general method for fast multiple sequence alignment. , 1996, Gene.

[21]  Lior Pachter,et al.  Multiple alignment by sequence annealing , 2007, Bioinform..

[22]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[23]  Colin N. Dewey,et al.  Aligning multiple whole genomes with Mercator and MAVID. , 2007, Methods in molecular biology.

[24]  David Haussler,et al.  The UCSC Genome Browser database: update 2010 , 2009, Nucleic Acids Res..

[25]  Serita M. Nelesen,et al.  Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees , 2009, Science.

[26]  Chris Smith,et al.  Scheduling attributes and platform LSF , 2004 .

[27]  M. Waterman,et al.  Line geometries for sequence comparisons , 1984 .

[28]  David Haussler,et al.  Identification and Classification of Conserved RNA Secondary Structures in the Human Genome , 2006, PLoS Comput. Biol..

[29]  Lior Pachter,et al.  Fast Statistical Alignment , 2009, PLoS Comput. Biol..

[30]  Taeho Kim,et al.  ClustalXeed: a GUI-based grid computation version for high performance and terabyte size multiple sequence alignment , 2010, BMC Bioinformatics.

[31]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[32]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[33]  Sean R. Eddy,et al.  Rfam: annotating non-coding RNAs in complete genomes , 2004, Nucleic Acids Res..

[34]  N. Goldman,et al.  Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. , 1994, Molecular biology and evolution.

[35]  D. Haussler,et al.  Human-mouse alignments with BLASTZ. , 2003, Genome research.

[36]  E. Birney,et al.  Genome-wide nucleotide-level mammalian ancestor reconstruction. , 2008, Genome research.