Large-scale phylogenetic analysis on current HPC architectures

Phylogenetic inference is considered a grand challenge in Bioinformatics due to its immense computational requirements. The increasing popularity and availability of large multi-gene alignments as well as comprehensive datasets of single nucleotide polymorphisms (SNPs) in current biological studies, coupled with rapid accumulation of sequence data in general, pose new challenges for high performance computing. By example of RAxML, which is currently among the fastest and most accurate programs for phylogenetic inference under the Maximum Likelihood (ML) criterion, we demonstrate how the phylogenetic ML function can be efficiently scaled to current supercomputer architectures like the IBM BlueGene/L (BG/L) and SGI Altix. This is achieved by simultaneous exploitation of coarseand fine-grained parallelism which is inherent to every ML-based biological analysis. Performance is assessed using datasets consisting of 270 sequences and 566,470 base pairs (haplotype map dataset), and 2,182 sequences and 51,089 base pairs, respectively. To the best of our knowledge, these are the largest datasets analyzed under ML to date. Experimental results indicate that the fine-grained parallelization scales well up to 1,024 processors. Moreover, a larger number of processors can be efficiently exploited by a combination of coarseand fine-grained parallelism. We also demonstrate that our parallelization scales equally well on an AMD Opteron cluster with a less favorable network latency to processor speed ratio. Finally, we underline the practical relevance of our approach by including a biological discussion of the results from the haplotype map dataset analysis, which revealed novel biological insights via phylogenetic inference.

[1]  Alexandros Stamatakis,et al.  A Nuclear Ribosomal DNA Phylogeny of Acer Inferred with Maximum Likelihood, Splits Graphs, and Motif Analysis of 606 Sequences , 2006, Evolutionary bioinformatics online.

[2]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[3]  Daniel Janies,et al.  Genomic analysis and geographic visualization of the spread of avian influenza (H5N1). , 2007, Systematic biology.

[4]  Alexandros Stamatakis,et al.  Dynamic multigrain parallelization on the cell broadband engine , 2007, PPoPP.

[5]  Thomas Ludwig,et al.  RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees , 2005, Bioinform..

[6]  David A. Bader,et al.  Industrial applications of high-performance computing for phylogeny reconstruction , 2001, SPIE ITCom.

[7]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[8]  Scott R. Miller,et al.  Unexpected Diversity and Complexity of the Guerrero Negro Hypersaline Microbial Mat , 2006, Applied and Environmental Microbiology.

[9]  S Blair Hedges,et al.  Major Caribbean and Central American frog faunas originated by ancient oceanic dispersal , 2007, Proceedings of the National Academy of Sciences.

[10]  M. Chial,et al.  in simple , 2003 .

[11]  Ralf Bundschuh,et al.  Large scale genotype-phenotype correlation analysis based on phylogenetic trees , 2007, Bioinform..

[12]  Kate E. Jones,et al.  The delayed rise of present-day mammals , 1990, Nature.

[13]  Tamir Tuller,et al.  Maximum likelihood of evolutionary trees: hardness and approximation , 2005, ISMB.

[14]  David A. Bader,et al.  Computational Grand Challenges in Assembling the Tree of Life: Problems and Solutions , 2006, Adv. Comput..

[15]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[16]  Eduard Ayguadé,et al.  Is Data Distribution Necessary in OpenMP? , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[17]  Donald K. Berry,et al.  Parallel Implementation and Performance of FastDNAml - A Program for Maximum Likelihood Phylogenetic Inference , 2001, ACM/IEEE SC 2001 Conference (SC'01).

[18]  G. Peltz,et al.  In Silico Mapping of Complex Disease-Related Traits in Mice , 2001, Science.

[19]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[20]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[21]  Thomas Ludwig,et al.  RAxML-OMP: An Efficient Program for Phylogenetic Inference on SMPs , 2005, PaCT.

[22]  Alexandros Stamatakis,et al.  Distributed and parallel algorithms and systems for inference of huge phylogenetic trees based on the maximum likelihood method , 2004 .

[23]  Arndt von Haeseler,et al.  Large Maximum Likelihood Trees , 2006 .

[24]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[25]  Arndt von Haeseler,et al.  pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies , 2005, Bioinform..

[26]  Alexandros Stamatakis,et al.  RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[27]  Thomas Ludwig,et al.  Parallel Inference of a 10.000-Taxon Phylogeny with Maximum Likelihood , 2004, Euro-Par.

[28]  Derrick J. Zwickl Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion , 2006 .

[29]  T. White,et al.  Stratigraphic, chronological and behavioural contexts of Pleistocene Homo sapiens from Middle Awash, Ethiopia , 2003, Nature.

[30]  S. Hunt,et al.  Genome-Wide Associations of Gene Expression Variation in Humans , 2005, PLoS genetics.

[31]  John R Spear,et al.  Phylogenetic diversity and ecology of environmental Archaea. , 2005, Current opinion in microbiology.

[32]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[33]  Feng Lin,et al.  Reconstruction of large phylogenetic trees: A parallel approach , 2005, Comput. Biol. Chem..