Scaling up accurate phylogenetic reconstruction from gene-order data

MOTIVATION Phylogenetic reconstruction from gene-order data has attracted increasing attention from both biologists and computer scientists over the last few years. Methods used in reconstruction include distance-based methods (such as neighbor-joining), parsimony methods using sequence-based encodings, Bayesian approaches, and direct optimization. The latter, pioneered by Sankoff and extended by us with the software suite GRAPPA, is the most accurate approach, but cannot handle more than about 15 genomes of limited size (e.g. organelles). RESULTS We report here on our successful efforts to scale up direct optimization through a two-step approach: the first step decomposes the dataset into smaller pieces and runs the direct optimization (GRAPPA) on the smaller pieces, while the second step builds a tree from the results obtained on the smaller pieces. We used the sophisticated disk-covering method (DCM) pioneered by Warnow and her group, suitably modified to take into account the computational limitations of GRAPPA. We find that DCM-GRAPPA scales gracefully to at least 1000 genomes of a few hundred genes each and retains surprisingly high accuracy throughout the range: in our experiments, the topological error rate rarely exceeded a few percent. Thus, reconstruction based on gene-order data can now be accomplished with high accuracy on datasets of significant size.

[1]  Breakpoint Phylogenies. , 1997, Genome informatics. Workshop on Genome Informatics.

[2]  Bernard M. E. Moret,et al.  Finding an Optimal Inversion Median: Experimental Results , 2001, WABI.

[3]  Tandy J. Warnow,et al.  Performance study of phylogenetic methods: (unweighted) quartet methods and neighbor-joining , 2001, SODA '01.

[4]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[5]  Tandy J. Warnow,et al.  The Accuracy of Fast Phylogenetic Methods for Large Datasets , 2001, Pacific Symposium on Biocomputing.

[6]  Tandy J. Warnow,et al.  Steps toward accurate reconstructions of phylogenies from gene-order data , 2002, J. Comput. Syst. Sci..

[7]  Jijun Tang,et al.  Phylogenetic Reconstruction from Gene-Rearrangement Data with Unequal Gene Content , 2003, WADS.

[8]  David Sankoff,et al.  Multiple Genome Rearrangement and Breakpoint Phylogeny , 1998, J. Comput. Biol..

[9]  Daniel H. Huson,et al.  Solving Large Scale Phylogenetic Problems using DCM2 , 1999, ISMB.

[10]  Ron Shamir,et al.  The median problems for breakpoints are NP-complete , 1998, Electron. Colloquium Comput. Complex..

[11]  Xun Gu,et al.  Algorithms for Multiple Genome Rearrangement by Signed Reversals , 2001, Pacific Symposium on Biocomputing.

[12]  Daniel H. Huson,et al.  Hybrid tree reconstruction methods , 1999, JEAL.

[13]  Arne Ø. Mooers,et al.  Inferring Evolutionary Process from Phylogenetic Tree Shape , 1997, The Quarterly Review of Biology.

[14]  Alberto Caprara,et al.  Formulations and hardness of multiple sorting by reversals , 1999, RECOMB.

[15]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[16]  Tandy J. Warnow,et al.  The Performance of Phylogenetic Methods on Trees of Bounded Diameter , 2001, WABI.

[17]  Bernard M. E. Moret,et al.  An Empirical Comparison of Phylogenetic Methods on Chloroplast Gene Order Data in Campanulaceae , 2000 .

[18]  Jeffrey D. Palmer,et al.  Use of Chloroplast DNA Rearrangements in Reconstructing Plant Phylogeny , 1992 .

[19]  Tandy J. Warnow,et al.  Designing fast converging phylogenetic methods , 2001, ISMB.

[20]  David A. Bader,et al.  A detailed study of breakpoint analysis , 2001 .

[21]  Krister M. Swenson,et al.  Genomic Distances under Deletions and Insertions , 2004, Theor. Comput. Sci..

[22]  Bernard M. E. Moret,et al.  Fast Phylogenetic Methods For Genome Rearrangement Evolution: An Empirical Study , 2002 .

[23]  David A. Bader,et al.  A fast linear-time algorithm for inversion distance with an experimental comparison , 2001 .

[24]  J. Palmer,et al.  Chloroplast DNA systematics: a review of methods and data analysis , 1994 .

[25]  Daniel H. Huson,et al.  Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction , 1999, J. Comput. Biol..

[26]  K. Strimmer,et al.  Quartet Puzzling: A Quartet Maximum-Likelihood Method for Reconstructing Tree Topologies , 1996 .

[27]  Bret Larget,et al.  A Markov chain Monte Carlo approach to reconstructing ancestral genome arrangements , 2002 .

[28]  D. Aldous Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today , 2001 .

[29]  Tandy J. Warnow,et al.  Sequence-Length Requirements for Phylogenetic Methods , 2002, WABI.

[30]  Tao Liu,et al.  Inversion Medians Outperform Breakpoint Medians in Phylogeny Reconstruction from Gene-Order Data , 2002, WABI.

[31]  D. Sankoff,et al.  Comparative Genomics: "Empirical And Analytical Approaches To Gene Order Dynamics, Map Alignment And The Evolution Of Gene Families" , 2000 .

[32]  P. Pevzner,et al.  Genome-scale evolution: reconstructing gene orders in the ancestral species. , 2002, Genome research.

[33]  Alberto Caprara,et al.  On the Practical Solution of the Reversal Median Problem , 2001, WABI.

[34]  Linda A. Raubeson,et al.  Chloroplast DNA Evidence on the Ancient Evolutionary Split in Vascular Land Plants , 1992, Science.