Exploiting Fine-Grained Parallelism in the Phylogenetic Likelihood Function with MPI, Pthreads, and OpenMP: A Performance Study

Emerging multi- and many-core computer architectures pose new challenges with respect to efficient exploitation of parallelism. In addition, it is currently not clear which might be the most appropriate parallel programming paradigm to exploit such architectures, both from the efficiency as well as software engineering point of view. Beyond that, the application of high performance computing techniques and the use of supercomputers will be essential to deal with the explosive accumulation of sequence data. We address these issues via a thorough performance study by example of RAxML, which is a widely used Bioinformatics application for large-scale phylogenetic inference under the Maximum Likelihood criterion. We provide an overview over the respective parallelization strategies with MPI, Pthreads, and OpenMP and assess performance for these approaches on a large variety of parallel architectures. Results indicate that there is no universally best-suited paradigm with respect to efficiency and portability of the ML function. Therefore, we suggest that the ML function should be parallelized with MPI and Pthreads based on software engineering criteria as well as to enforce data locality.

[1]  Derrick J. Zwickl Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion , 2006 .

[2]  Arndt von Haeseler,et al.  Large Maximum Likelihood Trees , 2006 .

[3]  Christopher K. I. Williams,et al.  Unsupervised Learning of Multiple Aspects of Moving Objects from Video , 2005, Panhellenic Conference on Informatics.

[4]  R. Knight,et al.  Error-correcting barcoded primers for pyrosequencing hundreds of samples in multiplex , 2008, Nature Methods.

[5]  Guillaume Alléon,et al.  SPMD OpenMP versus MPI on a IBM SMP for 3 Kernels of the NAS Benchmarks , 2002, ISHPC.

[6]  William Pugh,et al.  Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures , 2003, LCPC.

[7]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[8]  Alexandros Stamatakis,et al.  The RAxML 7.0.3 Manual , 2008 .

[9]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[10]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Srinivas Aluru,et al.  Large-scale phylogenetic analysis on current HPC architectures , 2008 .

[12]  Pedro Trancoso,et al.  Initial Experiences Porting a Bioinformatics Application to a Graphics Processor , 2005, Panhellenic Conference on Informatics.

[13]  Kazuki Joe,et al.  High performance computing : 4th International Symposium, ISHPC 2002, Kansai Science City, Japan, May 15-17, 2002 : proceedings , 2002 .

[14]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[15]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[16]  Srinivas Aluru,et al.  Large-scale phylogenetic analysis on current HPC architectures , 2008, Sci. Program..

[17]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[18]  Alexandros Stamatakis,et al.  AxPcoords & parallel AxParafit: statistical co-phylogenetic analyses on thousands of taxa , 2007, BMC Bioinformatics.

[19]  M. Sanderson,et al.  Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. , 2006, Systematic biology.

[20]  Wu-chun Feng,et al.  The design, implementation, and evaluation of mpiBLAST , 2003 .

[21]  Kate E. Jones,et al.  The delayed rise of present-day mammals , 1990, Nature.

[22]  Leonid Oliker,et al.  A Comparison of Three Programming Models for Adaptive Applications on the Origin2000 , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[23]  Arndt von Haeseler,et al.  pIQPNNI: parallel reconstruction of large maximum likelihood phylogenies , 2005, Bioinform..

[24]  John R Spear,et al.  Phylogenetic diversity and ecology of environmental Archaea. , 2005, Current opinion in microbiology.

[25]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[26]  David Q. Matus,et al.  Broad phylogenomic sampling improves resolution of the animal tree of life , 2008, Nature.

[27]  David A. Bader,et al.  Computational Grand Challenges in Assembling the Tree of Life: Problems and Solutions , 2006, Adv. Comput..

[28]  Alexandros Stamatakis,et al.  Dynamic multigrain parallelization on the cell broadband engine , 2007, PPoPP.

[29]  Srinivas Aluru,et al.  Large-scale maximum likelihood-based phylogenetic analysis on the IBM BlueGene/L , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[30]  M.D. Jones,et al.  Parallel programming for OSEM reconstruction with MPI, OpenMP, and hybrid MPI-OpenMP , 2004, IEEE Symposium Conference Record Nuclear Science 2004..

[31]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[32]  M.D. Jones,et al.  Hybrid MPI-OpenMP Programming for Parallel OSEM PET Reconstruction , 2006, IEEE Transactions on Nuclear Science.

[33]  Thomas Ludwig,et al.  RAxML-OMP: An Efficient Program for Phylogenetic Inference on SMPs , 2005, PaCT.