Reconstruction of large phylogenetic trees: A parallel approach

Reconstruction of phylogenetic trees for very large datasets is a known example of a computationally hard problem. In this paper, we present a parallel computing model for the widely used Multiple Instruction Multiple Data (MIMD) architecture. Following the idea of divide-and-conquer, our model adapts the recursive-DCM3 decomposition method [Roshan, U., Moret, B.M.E., Williams, T.L., Warnow, T, 2004a. Performance of suptertree methods on various dataset decompositions. In: Binida-Emonds, O.R.P. (Eds.), Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, vol. 3 of Computational Biology, Kluwer Academics, pp. 301-328; Roshan, U., Moret, B.M.E., Williams, T.L., Warnow, T., 2004b. Rec-I-DCM3: A Fast Algorithmic Technique for reconstructing large phylogenetic trees, Proceedings of the IEEE Computational Systems Bioinformatics Conference (ICSB)] to divide datasets into smaller subproblems. It distributes computation load over multiple processors so that each processor constructs subtrees on each subproblem within a batch in parallel. It finally collects the resulting trees and merges them into a supertree. The proposed model is flexible as far as methods for dividing and merging datasets are concerned. We show that our method greatly reduces the computational time of the sequential version of the program. As a case study, our parallel approach only takes 22.1h on four processors to outperform the best score to date (Found at 123.7h by the Rec-I-DCM3 program [Roshan, U., Moret, B.M.E., Williams, T.L., Warnow, T, 2004a. Performance of suptertree methods on various dataset decompositions. In: Binida-Emonds, O.R.P. (Eds.), Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, vol. 3 of Computational Biology, Kluwer Academics, pp. 301-328; Roshan, U., Moret, B.M.E., Williams, T.L., Warnow, T., 2004b. Rec-I-DCM3: A Fast Algorithmic Technique for reconstructing large phylogenetic trees, Proceedings of the IEEE Computational Systems Bioinformatics Conference (ICSB)] on one dataset. Developed with the standard message-passing library, MPI, the program can be recompiled and run on any MIMD systems.

[1]  D. Maddison The discovery and importance of multiple islands of most , 1991 .

[2]  P. Goloboff Analyzing Large Data Sets in Reasonable Times: Solutions for Composite Optima , 1999, Cladistics : the international journal of the Willi Hennig Society.

[3]  O. Gascuel,et al.  Improvement of distance-based phylogenetic methods by a local maximum likelihood approach using triplets. , 2002, Molecular biology and evolution.

[4]  Tandy J. Warnow,et al.  Absolute convergence: true trees from short sequences , 2001, SODA '01.

[5]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[6]  Yves Van de Peer,et al.  The European database on small subunit ribosomal RNA , 2002, Nucleic Acids Res..

[7]  Bernard M. E. Moret,et al.  Performance of Supertree Methods on Various Data Set Decompositions , 2004 .

[8]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[9]  Torben Hagerup Allocating Independent Tasks to Parallel Processors: An Experimental Study , 1996, IRREGULAR.

[10]  Daniel H. Huson,et al.  Solving Large Scale Phylogenetic Problems using DCM2 , 1999, ISMB.

[11]  G. Giribet,et al.  TNT: Tree Analysis Using New Technology , 2005 .

[12]  Tandy J. Warnow,et al.  Rec-I-DCM3: A Fast Algorithmic Technique for Reconstructing Large Phylogenetic Trees , 2004, IEEE Computer Society Computational Systems Bioinformatics Conference.

[13]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[14]  R. Meier,et al.  Software Review , 2005 .

[15]  M. Nei,et al.  The neighbor-joining method , 1987 .

[16]  Tandy Warnow,et al.  Algorithmic techniques for improving the speed and accuracy of phylogenetic methods , 2004 .

[17]  M. A. STEEL,et al.  Loss of information in genetic distances , 1988, Nature.

[18]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[19]  Tandy J. Warnow,et al.  Designing fast converging phylogenetic methods , 2001, ISMB.

[20]  Bernard M. E. Moret,et al.  Rec-I-DCM3: a fast algorithmic technique for reconstructing phylogenetic trees , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[21]  Mike Steel,et al.  The Maximum Likelihood Point for a Phylogenetic Tree is Not Unique , 1994 .

[22]  James R. Cole,et al.  The RDP (Ribosomal Database Project) continues , 2000, Nucleic Acids Res..

[23]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[24]  Kimmen Sjölander,et al.  Phylogenomic inference of protein molecular function: advances and challenges , 2004, Bioinform..

[25]  O. Bininda-Emonds Phylogenetic Supertrees: Combining Information To Reveal The Tree Of Life , 2004 .

[26]  R. Sokal,et al.  A METHOD FOR DEDUCING BRANCHING SEQUENCES IN PHYLOGENY , 1965 .

[27]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[28]  Review of: T.N.T.—Tree Analysis Using New Technology. Version 1.0, by P. Goloboff, J. S. Farris and K. Nixon. Available from the authors and from http://www.zmuc.dk/public/phylogeny , 2004 .