Fast NJ-like algorithms to deal with incomplete distance matrices

BackgroundDistance-based phylogeny inference methods first estimate evolutionary distances between every pair of taxa, then build a tree from the so-obtained distance matrix. These methods are fast and fairly accurate. However, they hardly deal with incomplete distance matrices. Such matrices are frequent with recent multi-gene studies, when two species do not share any gene in analyzed data. The few existing algorithms to infer trees with satisfying accuracy from incomplete distance matrices have time complexity in O(n4) or more, where n is the number of taxa, which precludes large scale studies. Agglomerative distance algorithms (e.g. NJ [1, 2]) are much faster, with time complexity in O(n3) which allows huge datasets and heavy bootstrap analyses to be dealt with. These algorithms proceed in three steps: (a) search for the taxon pair to be agglomerated, (b) estimate the lengths of the two so-created branches, (c) reduce the distance matrix and return to (a) until the tree is fully resolved. But available agglomerative algorithms cannot deal with incomplete matrices.ResultsWe propose an adaptation to incomplete matrices of three agglomerative algorithms, namely NJ, BIONJ [3] and MVR [4]. Our adaptation generalizes to incomplete matrices the taxon pair selection criterion of NJ (also used by BIONJ and MVR), and combines this generalized criterion with that of ADDTREE [5]. Steps (b) and (c) are also modified, but O(n3) time complexity is kept. The performance of these new algorithms is studied with large scale simulations, which mimic multi-gene phylogenomic datasets. Our new algorithms – named NJ*, BIONJ* and MVR* – infer phylogenetic trees that are as least as accurate as those inferred by other available methods, but with much faster running times. MVR* presents the best overall performance. This algorithm accounts for the variance of the pairwise evolutionary distance estimates, and is well suited for multi-gene studies where some distances are accurately estimated using numerous genes, whereas others are poorly estimated (or not estimated) due to the low number (absence) of sequenced genes being shared by both species.ConclusionOur distance-based agglomerative algorithms NJ*, BIONJ* and MVR* are fast and accurate, and should be quite useful for large scale phylogenomic studies. When combined with the SDM method [6] to estimate a distance matrix from multiple genes, they offer a relevant alternative to usual supertree techniques [7]. Binaries and all simulated data are downloadable from [8].

[1]  N. Galtier A model of horizontal gene transfer and the bacterial phylogeny problem. , 2007, Systematic biology.

[2]  M. Sanderson,et al.  Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. , 2006, Systematic biology.

[3]  D. Penny,et al.  Neighbor-joining uses the optimal weight for net divergence. , 1993, Molecular phylogenetics and evolution.

[4]  A. Tversky,et al.  Additive similarity trees , 1977 .

[5]  R. Sanjuán,et al.  Weighted least-squares likelihood ratio test for branch testing in phylogenies reconstructed from distance measures. , 2005, Systematic biology.

[6]  D. Robinson,et al.  Comparison of weighted labelled trees , 1979 .

[7]  G. Soete Ultrametric tree representations of incomplete dissimilarity data , 1984 .

[8]  Sudhir Kumar,et al.  Efficiency of the Neighbor-Joining Method in Reconstructing Deep and Shallow Evolutionary Relationships in Large Phylogenies , 2000, Journal of Molecular Evolution.

[9]  M. Nei,et al.  Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. , 2000, Molecular biology and evolution.

[10]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[11]  Mike Steel,et al.  Maximum likelihood supertrees. , 2007, Systematic biology.

[12]  P. Holland,et al.  Phylogenomics of eukaryotes: impact of missing data on large alignments. , 2004, Molecular biology and evolution.

[13]  M. Ragan Phylogenetic inference based on matrix representation of trees. , 1992, Molecular phylogenetics and evolution.

[14]  O. Gascuel A note on Sattath and Tversky's, Saitou and Nei's, and Studier and Keppler's algorithms for inferring phylogenies from evolutionary distances. , 1994, Molecular biology and evolution.

[15]  Siu-Ming Yiu,et al.  Reconstructing an Ultrametric Galled Phylogenetic Network from a Distance Matrix , 2005, MFCS.

[16]  Vladimir Makarenkov,et al.  An Algorithm for the Fitting of a Tree Metric According to a Weighted Least-Squares Criterion , 1999 .

[17]  F. Lapointe,et al.  Estimating Phylogenies from Lacunose Distance Matrices: Additive is Superior to Ultrametric Estimation , 1996 .

[18]  D. Kendall,et al.  Mathematics in the Archaeological and Historical Sciences , 1971, The Mathematical Gazette.

[19]  O. Bininda-Emonds Phylogenetic Supertrees: Combining Information To Reveal The Tree Of Life , 2004 .

[20]  Jens Lagergren,et al.  Fast neighbor joining , 2005, Theor. Comput. Sci..

[21]  Vladimir Makarenkov,et al.  T-REX: reconstructing and visualizing phylogenetic trees and reticulation networks , 2001, Bioinform..

[22]  onrad,et al.  Resolution of a Supertree / Supermatrix Paradox , 2002 .

[23]  Siu-Ming Yiu,et al.  Reconstructing an Ultrametric Galled Phylogenetic Network from a Distance Matrix , 2006, J. Bioinform. Comput. Biol..

[24]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[25]  P. Buneman The Recovery of Trees from Measures of Dissimilarity , 1971 .

[26]  A. Mood,et al.  The statistical sign test. , 1946, Journal of the American Statistical Association.

[27]  F. McMorris,et al.  Mathematical Hierarchies and Biology , 1997 .

[28]  Alex Bateman,et al.  QuickTree: building huge Neighbour-Joining trees of protein sequences , 2002, Bioinform..

[29]  François-Joseph Lapointe,et al.  A weighted least-squares approach for inferring phylogenies from incomplete distance matrices , 2004, Bioinform..

[30]  J. A. Studier,et al.  A note on the neighbor-joining algorithm of Saitou and Nei. , 1988, Molecular biology and evolution.

[31]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[32]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[33]  M. Nei,et al.  Relative efficiencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. , 1988, Molecular biology and evolution.

[34]  O. Gascuel,et al.  Neighbor-joining revealed. , 2006, Molecular biology and evolution.

[35]  Arndt von Haeseler,et al.  Shortest triplet clustering: reconstructing large phylogenies using representative sets , 2005, BMC Bioinformatics.

[36]  A. Halpern,et al.  Weighted neighbor joining: a likelihood-based approach to distance-based phylogeny reconstruction. , 2000, Molecular biology and evolution.

[37]  Thomas Mailund,et al.  QuickJoin - fast neighbour-joining tree reconstruction , 2004, Bioinform..

[38]  N. Saitou,et al.  Relative Efficiencies of the Fitch-Margoliash, Maximum-Parsimony, Maximum-Likelihood, Minimum-Evolution, and Neighbor-joining Methods of Phylogenetic Tree Construction in Obtaining the Correct Tree , 1989 .

[39]  Andy Purvis,et al.  A higher-level MRP supertree of placental mammals , 2006, BMC Evolutionary Biology.

[40]  M. Rosenberg,et al.  Traditional phylogenetic reconstruction methods reconstruct shallow and deep evolutionary relationships equally well. , 2001, Molecular biology and evolution.

[41]  M. Steel,et al.  Distributions of Tree Comparison Metrics—Some New Results , 1993 .

[42]  Thylogale,et al.  THE AVERAGE CONSENSUS PROCEDURE: COMBINATION OF WEIGHTED TREES CONTAINING IDENTICAL OR OVERLAPPING SETS OF TAXA , 2009 .

[43]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[44]  M. Nei,et al.  The optimization principle in phylogenetic analysis tends to give incorrect topologies when the number of nucleotides or amino acids used is small. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[45]  K. Strimmer,et al.  Accuracy of neighbor joining for n-taxon trees , 1996 .

[46]  A. Guénoche,et al.  Approximations par arbre d'une distance partielle , 1999 .

[47]  Fred R. McMorris,et al.  COMPARISON OF UNDIRECTED PHYLOGENETIC TREES BASED ON SUBTREES OF FOUR EVOLUTIONARY UNITS , 1985 .

[48]  J. Foster,et al.  Relaxed Neighbor Joining: A Fast Distance-Based Phylogenetic Tree Construction Method , 2006, Journal of Molecular Evolution.

[49]  Olivier Gascuel,et al.  A Fast and Accurate Distance Algorithm to Reconstruct Tandem Duplication Trees , 2001 .

[50]  David Bryant,et al.  On the Uniqueness of the Selection Criterion in Neighbor-Joining , 2005, J. Classif..

[51]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[52]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[53]  David Fernández-Baca,et al.  Performance of flip supertree construction with a heuristic algorithm. , 2004, Systematic biology.

[54]  Olivier Gascuel,et al.  An efficient and accurate distance based algorithm to reconstruct tandem duplication trees , 2002, ECCB.

[55]  E. -,et al.  Properties of Matrix Representation with Parsimony Analyses , 2000 .

[56]  Olivier Gascuel,et al.  Data Model and Classification by Trees: The Minimum Variance Reduction (MVR) Method , 2000, J. Classif..

[57]  J. Felsenstein An alternating least squares approach to inferring phylogenies from pairwise distances. , 1997, Systematic biology.

[58]  J. G. Burleigh,et al.  Prospects for Building the Tree of Life from Large Sequence Databases , 2004, Science.

[59]  G. Giribet,et al.  TNT: Tree Analysis Using New Technology , 2005 .

[60]  B. Baum Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees , 1992 .

[61]  Olivier Gascuel,et al.  SDM: a fast distance-based approach for (super) tree building in phylogenomics. , 2006, Systematic biology.

[62]  Olivier Gascuel,et al.  Concerning the NJ algorithm and its unweighted version, UNJ , 1996, Mathematical Hierarchies and Biology.