Detecting Phylogenetic Breakpoints and Discordance from Genome-Wide Alignments for Species Tree Reconstruction

With the easy acquisition of sequence data, it is now possible to obtain and align whole genomes across multiple related species or populations. In this work, I assess the performance of a statistical method to reconstruct the whole distribution of phylogenetic trees along the genome, estimate the proportion of the genome for which a given clade is true, and infer a concordance tree that summarizes the dominant vertical inheritance pattern. There are two main issues when dealing with whole-genome alignments, as opposed to multiple genes: the size of the data and the detection of recombination breakpoints. These breakpoints partition the genomic alignment into phylogenetically homogeneous loci, where sites within a given locus all share the same phylogenetic tree topology. To delimitate these loci, I describe here a method based on the minimum description length (MDL) principle, implemented with dynamic programming for computational efficiency. Simulations show that combining MDL partitioning with Bayesian concordance analysis provides an efficient and robust way to estimate both the vertical inheritance signal and the horizontal phylogenetic signal. The method performed well both in the presence of incomplete lineage sorting and in the presence of horizontal gene transfer. A high level of systematic bias was found here, highlighting the need for good individual tree building methods, which form the basis for more elaborate gene tree/species tree reconciliation methods.

[1]  Bin Yu,et al.  Model Selection and the Principle of Minimum Description Length , 2001 .

[2]  Cécile Ané,et al.  Missing the forest for the trees: phylogenetic compression and its implications for inferring complex evolutionary histories. , 2005, Systematic biology.

[3]  Hervé Philippe,et al.  A dirichlet process covarion mixture model and its assessments using posterior predictive discrepancy tests. , 2010, Molecular biology and evolution.

[4]  J. Huelsenbeck,et al.  A Bayesian perspective on a non-parsimonious parsimony model. , 2008, Systematic biology.

[5]  James Leebens-Mack,et al.  Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns , 2007, Proceedings of the National Academy of Sciences.

[6]  Bin Yu,et al.  Minimum Description Length Model Selection Criteria for Generalized Linear Models , 2003 .

[7]  H. Akaike A new look at the statistical model identification , 1974 .

[8]  D. Baum Concordance trees, concordance factors, and the exploration of reticulate genealogy , 2007 .

[9]  Gráinne McGuire,et al.  TOPAL 2.0: improved detection of mosaic sequences within multiple alignments , 2000, Bioinform..

[10]  Colin N. Dewey,et al.  BUCKy: Gene tree/species tree reconciliation with Bayesian concordance analysis , 2010, Bioinform..

[11]  Mike Steel,et al.  Can we avoid "SIN" in the house of "no common mechanism"? , 2009, Systematic biology.

[12]  M. Suchard,et al.  StepBrothers: inferring partially shared ancestries among recombinant viral sequences. , 2008, Biostatistics.

[13]  A. Goesmann,et al.  Whole-genome comparison of disease and carriage strains provides insights into virulence evolution in Neisseria meningitidis , 2008, Proceedings of the National Academy of Sciences.

[14]  Maryse Condé Tree of Life , 1992 .

[15]  Frédéric Delsuc,et al.  Heterotachy and long-branch attraction in phylogenetics , 2005, BMC Evolutionary Biology.

[16]  K. Crandall,et al.  Evaluation of methods for detecting recombination from DNA sequences: Computer simulations , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  E. Holmes,et al.  A likelihood method for the detection of selection and recombination using nucleotide sequences. , 1997, Molecular biology and evolution.

[18]  John M. Hancock,et al.  Phylogenetic inference under recombination using Bayesian stochastic topology selection , 2008, Bioinform..

[19]  Bin Ma,et al.  From Gene Trees to Species Trees , 2000, SIAM J. Comput..

[20]  Dirk Husmeier,et al.  Discriminating between rate heterogeneity and interspecific recombination in DNA sequence alignments with phylogenetic factorial hidden Markov models , 2005, ECCB/JBI.

[21]  Mike Steel,et al.  Phylogenetic mixtures on a single tree can mimic a tree of another topology. , 2007, Systematic biology.

[22]  M. Pagel,et al.  Modelling heterotachy in phylogenetic inference by reversible-jump Markov chain Monte Carlo , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[23]  Mark A. Ragan,et al.  Detecting recombination in evolving nucleotide sequences , 2006, BMC Bioinformatics.

[24]  Colin N. Dewey,et al.  Fine-Scale Phylogenetic Discordance across the House Mouse Genome , 2009, PLoS genetics.

[25]  Christian N. K. Anderson,et al.  Serial SimCoal: A population genetics model for data from multiple populations and points in time , 2005, Bioinform..

[26]  Oliver Eulenstein,et al.  DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony , 2008, Bioinform..

[27]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[28]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[29]  Hyuna Yang,et al.  On the subspecific origin of the laboratory mouse , 2007, Nature Genetics.

[30]  W. Maddison,et al.  Inferring phylogeny despite incomplete lineage sorting. , 2006, Systematic biology.

[31]  J. Hein A heuristic method to reconstruct the history of sequences subject to recombination , 1993, Journal of Molecular Evolution.

[32]  T. Kepler,et al.  An information-theoretic method for the treatment of plural ancestry in phylogenetics. , 2008, Molecular biology and evolution.

[33]  Kelly P Williams,et al.  Phylogeny of Gammaproteobacteria , 2010, Journal of bacteriology.

[34]  M. Suchard,et al.  Hierarchical phylogenetic models for analyzing multipartite sequence data. , 2003, Systematic biology.

[35]  N. Galtier A model of horizontal gene transfer and the bacterial phylogeny problem. , 2007, Systematic biology.

[36]  E. Koonin,et al.  Search for a 'Tree of Life' in the thicket of the phylogenetic forest , 2009, Journal of biology.

[37]  M. Slatkin,et al.  The Concordance of Gene Trees and Species Trees at Two Linked Loci , 2006, Genetics.

[38]  D. Husmeier,et al.  Detecting recombination in 4-taxa DNA sequence alignments with Bayesian hidden Markov models and Markov chain Monte Carlo. , 2003, Molecular biology and evolution.

[39]  Sergei L. Kosakovsky Pond,et al.  GARD: a genetic algorithm for recombination detection , 2006, Bioinform..

[40]  D. Goldstein Statistics and science : a Festschrift for Terry Speed , 2003 .

[41]  Matthias Platzer,et al.  Mapping human genetic ancestry. , 2007, Molecular biology and evolution.

[42]  Alexander V. Mantzaris,et al.  Statistical Applications in Genetics and Molecular Biology Addressing the Shortcomings of Three Recent Bayesian Methods for Detecting Interspecific Recombination in DNA Sequence Alignments , 2011 .

[43]  Bryan C Carstens,et al.  Estimating species phylogeny from gene-tree probabilities despite incomplete lineage sorting: an example from Melanoplus grasshoppers. , 2007, Systematic biology.

[44]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[45]  Roderic D. M. Page,et al.  GeneTree: comparing gene and species phylogenies using reconciled trees , 1998, Bioinform..

[46]  Alan M. Moses,et al.  Widespread Discordance of Gene Trees with Species Tree in Drosophila: Evidence for Incomplete Lineage Sorting , 2006, PLoS genetics.

[47]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[48]  Vladimir N. Minin,et al.  Dual multiple change-point model leads to more accurate recombination detection , 2005, Bioinform..

[49]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[50]  H. Philippe,et al.  Suppression of long-branch attraction artefacts in the animal phylogeny using a site-heterogeneous model , 2007, BMC Evolutionary Biology.

[51]  Carsten Wiuf,et al.  Gene Genealogies, Variation and Evolution - A Primer in Coalescent Theory , 2004 .

[52]  Bryan Kolaczkowski,et al.  Long-Branch Attraction Bias and Inconsistency in Bayesian Phylogenetics , 2009, PloS one.

[53]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[54]  H. Kishino,et al.  Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees , 2008, PLoS ONE.

[55]  Laura Kubatko,et al.  Estimating species trees : practical and theoretical aspects , 2010 .

[56]  Simon Whelan,et al.  Spatial and temporal heterogeneity in nucleotide sequence evolution. , 2008, Molecular biology and evolution.

[57]  B. Larget,et al.  Bayesian estimation of concordance among gene trees. , 2006, Molecular biology and evolution.

[58]  M Steel,et al.  Links between maximum likelihood and maximum parsimony under a simple model of site substitution. , 1997, Bulletin of mathematical biology.

[59]  Thomas Lengauer,et al.  Recco: recombination analysis using cost optimization , 2006, Bioinform..

[60]  L Lacey Knowles,et al.  Estimating species trees: methods of phylogenetic analysis when there is incongruence across genes. , 2009, Systematic biology.

[61]  Luay Nakhleh,et al.  RECOMP: A Parsimony-Based Method for Detecting Recombination , 2005, APBC.

[62]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[63]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[64]  Manolo Gouy,et al.  A Mixture Model and a Hidden Markov Model to Simultaneously Detect Recombination Breakpoints and Reconstruct Phylogenies , 2009, Evolutionary bioinformatics online.

[65]  David Posada,et al.  Automated phylogenetic detection of recombination using a genetic algorithm. , 2006, Molecular biology and evolution.

[66]  S. Marca,et al.  Inferring Spatial Phylogenetic Variation Along Nucleotide Sequences : A Multiple Changepoint Model , 2003 .