Let them fall where they may: congruence analysis in massive phylogenetically messy data sets.

Interest in congruence in phylogenetic data has largely focused on issues affecting multicellular organisms, and animals in particular, in which the level of incongruence is expected to be relatively low. In addition, assessment methods developed in the past have been designed for reasonably small numbers of loci and scale poorly for larger data sets. However, there are currently over a thousand complete genome sequences available and of interest to evolutionary biologists, and these sequences are predominantly from microbial organisms, whose molecular evolution is much less frequently tree-like than that of multicellular life forms. As such, the level of incongruence in these data is expected to be high. We present a congruence method that accommodates both very large numbers of genes and high degrees of incongruence. Our method uses clustering algorithms to identify subsets of genes based on similarity of phylogenetic signal. It involves only a single phylogenetic analysis per gene, and therefore, computation time scales nearly linearly with the number of genes in the data set. We show that our method performs very well with sets of sequence alignments simulated under a wide variety of conditions. In addition, we present an analysis of core genes of prokaryotes, often assumed to have been largely vertically inherited, in which we identify two highly incongruent classes of genes. This result is consistent with the complexity hypothesis.

[1]  E. Koonin,et al.  Search for a 'Tree of Life' in the thicket of the phylogenetic forest , 2009, Journal of biology.

[2]  A. von Haeseler,et al.  IQPNNI: moving fast through tree space and stopping in time. , 2004, Molecular biology and evolution.

[3]  Thomas Mailund,et al.  Incomplete lineage sorting patterns among human, chimpanzee, and orangutan suggest recent orangutan speciation and widespread selection. , 2011, Genome research.

[4]  Daniel H. Huson,et al.  Phylogenetic Super-Networks from Partial Trees , 2004, IEEE ACM Trans. Comput. Biol. Bioinform..

[5]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[6]  Hervé Philippe,et al.  Archaeal phylogeny based on ribosomal proteins. , 2002, Molecular biology and evolution.

[7]  H. Kishino,et al.  Maximum likelihood inference of protein phylogeny and the origin of chloroplasts , 1990, Journal of Molecular Evolution.

[8]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[9]  Hirohisa Kishino,et al.  Phylogenetic methodology for detecting protein interactions. , 2006, Molecular biology and evolution.

[10]  Edward Susko,et al.  Testing congruence in phylogenomic analysis. , 2008, Systematic biology.

[11]  D. Penny,et al.  Treeness triangles: visualizing the loss of phylogenetic signal. , 2007, Molecular biology and evolution.

[12]  H. Philippe,et al.  Ancient phylogenetic relationships. , 2002, Theoretical population biology.

[13]  D. Huson,et al.  Application of phylogenetic networks in evolutionary studies. , 2006, Molecular biology and evolution.

[14]  Olivier C. Martin,et al.  A congruence index for testing topological similarity between trees , 2007, Bioinform..

[15]  W. Doolittle,et al.  Alternative methods for concatenation of core genes indicate a lack of resolution in deep nodes of the prokaryotic phylogeny. , 2007, Molecular biology and evolution.

[16]  C. Bult,et al.  TESTING SIGNIFICANCE OF INCONGRUENCE , 1994 .

[17]  W. Martin,et al.  Genes of cyanobacterial origin in plant nuclear genomes point to a heterocyst-forming plastid ancestor. , 2008, Molecular biology and evolution.

[18]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[19]  François-Joseph Lapointe,et al.  Harvesting evolutionary signals in a forest of prokaryotic gene trees. , 2011, Molecular biology and evolution.

[20]  Andrew R. Solow,et al.  Flightless birds: When did the dodo become extinct? , 2003, Nature.

[21]  J Edwards Dimensional reduction for data mapping , 2003 .

[22]  M. Steel,et al.  Distributions of Tree Comparison Metrics—Some New Results , 1993 .

[23]  Richard R. Hudson,et al.  TESTING THE CONSTANT‐RATE NEUTRAL ALLELE MODEL WITH PROTEIN SEQUENCE DATA , 1983, Evolution; international journal of organic evolution.

[24]  P. Waddell,et al.  Rapid evaluation of the phylogenetic congruence of sequence data using likelihood ratio tests. , 2000, Molecular biology and evolution.

[25]  Pietro Perona,et al.  Self-Tuning Spectral Clustering , 2004, NIPS.

[26]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[27]  J. Lake,et al.  Horizontal gene transfer among genomes: the complexity hypothesis. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[29]  Yufeng Wu,et al.  A practical method for exact computation of subtree prune and regraft distance , 2009, Bioinform..

[30]  Uri Gophna,et al.  Complexity, connectivity, and duplicability as barriers to lateral gene transfer , 2007, Genome Biology.

[31]  P. Legendre,et al.  The performance of the Congruence Among Distance Matrices (CADM) test in phylogenetic analysis , 2011, BMC Evolutionary Biology.

[32]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[33]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[34]  M. Krivánek Computing the nearest neighbor interchange metric for unlabeled binary trees is NP-complete , 1986 .

[35]  Timothy J. Harlow,et al.  Searching for convergence in phylogenetic Markov chain Monte Carlo. , 2006, Systematic biology.

[36]  Indra Neil Sarkar,et al.  mILD: a tool for constructing and analyzing matrices of pairwise phylogenetic character incongruence tests , 2005, Bioinform..

[37]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[38]  D. Penny,et al.  Reliability of evolutionary trees. , 1987, Cold Spring Harbor symposia on quantitative biology.

[39]  Hervé Philippe,et al.  Eubacterial phylogeny based on translational apparatus proteins. , 2002, Trends in genetics : TIG.

[40]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[41]  J. Rougemont,et al.  A rapid bootstrap algorithm for the RAxML Web servers. , 2008, Systematic biology.

[42]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[43]  Fred R. McMorris,et al.  COMPARISON OF UNDIRECTED PHYLOGENETIC TREES BASED ON SUBTREES OF FOUR EVOLUTIONARY UNITS , 1985 .

[44]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[45]  Leslie J. Rissler,et al.  Congruence, Consensus, and the Comparative Phylogeography of Codistributed Species in California , 2005, The American Naturalist.

[46]  K. McBreen,et al.  Reconstructing reticulate evolutionary histories of plants. , 2006, Trends in plant science.

[47]  Martin Vingron,et al.  TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing , 2002, Bioinform..

[48]  K. M. Sefc,et al.  Reticulate phylogeny of gastropod-shell-breeding cichlids from Lake Tanganyika – the result of repeated introgressive hybridization , 2007, BMC Evolutionary Biology.

[49]  François-Joseph Lapointe,et al.  Assessing Congruence Among Ultrametric Distance Matrices , 2009, J. Classif..

[50]  T. M. Nye Trees of trees: an approach to comparing multiple alternative phylogenies. , 2008, Systematic biology.

[51]  Olivier Gascuel,et al.  Empirical profile mixture models for phylogenetic reconstruction , 2008, Bioinform..

[52]  John P. Huelsenbeck,et al.  A Likelihood Ratio Test to Detect Conflicting Phylogenetic Signal , 1996 .

[53]  Charles Semple,et al.  On the Computational Complexity of the Rooted Subtree Prune and Regraft Distance , 2005 .

[54]  D. Penny,et al.  Use of spectral analysis to test hypotheses on the origin of pinnipeds. , 1995, Molecular biology and evolution.

[55]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[56]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[57]  Maureen A. O’Malley,et al.  Prokaryotic evolution and the tree of life are two different things , 2009, Biology Direct.

[58]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[59]  Mike A. Steel,et al.  The size of a maximum agreement subtree for random binary trees , 2001, Bioconsensus.

[60]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[61]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.