Phylogenetic inference under recombination using Bayesian stochastic topology selection

Motivation: Conventional phylogenetic analysis for characterizing the relatedness between taxa typically assumes that a single relationship exists between species at every site along the genome. This assumption fails to take into account recombination which is a fundamental process for generating diversity and can lead to spurious results. Recombination induces a localized phylogenetic structure which may vary along the genome. Here, we generalize a hidden Markov model (HMM) to infer changes in phylogeny along multiple sequence alignments while accounting for rate heterogeneity; the hidden states refer to the unobserved phylogenic topology underlying the relatedness at a genomic location. The dimensionality of the number of hidden states (topologies) and their structure are random (not known a priori) and are sampled using Markov chain Monte Carlo algorithms. The HMM structure allows us to analytically integrate out over all possible changepoints in topologies as well as all the unknown branch lengths. Results: We demonstrate our approach on simulated data and also to the genome of a suspected HIV recombinant strain as well as to an investigation of recombination in the sequences of 15 laboratory mouse strains sequenced by Perlegen Sciences. Our findings indicate that our method allows us to distinguish between rate heterogeneity and variation in phylogeny caused by recombination without being restricted to 4-taxa data. Availability: The method has been implemented in JAVA and is available, along with data studied here, from http://www.stats.ox.ac.uk/~webb. Contact: cholmes@stats.ox.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Hyuna Yang,et al.  On the subspecific origin of the laboratory mouse , 2007, Nature Genetics.

[2]  Sylvia Richardson,et al.  Stochastic search variable selection , 1995 .

[3]  Darren J. Wilkinson,et al.  Detecting homogeneous segments in DNA sequences by using hidden Markov models , 2000 .

[4]  Dirk Husmeier,et al.  Discriminating between rate heterogeneity and interspecific recombination in DNA sequence alignments with phylogenetic factorial hidden Markov models , 2005, ECCB/JBI.

[5]  Dirk Husmeier,et al.  Detecting recombination with MCMC , 2002, ISMB.

[6]  A. Hobolth,et al.  Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model , 2006, PLoS genetics.

[7]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[8]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[9]  Wolfgang P. Lehrach,et al.  Segmenting bacterial and viral DNA sequence alignments with a trans‐dimensional phylogenetic factorial hidden Markov model , 2009 .

[10]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[11]  Eleazar Eskin,et al.  A sequence-based variation map of 8.27 million SNPs in inbred mouse strains , 2007, Nature.

[12]  J. Hein,et al.  Consequences of recombination on traditional phylogenetic analysis. , 2000, Genetics.

[13]  G. McGuire,et al.  A graphical method for detecting recombination in phylogenetic data sets. , 1997, Molecular biology and evolution.

[14]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[15]  Alexander V. Mantzaris,et al.  Statistical Applications in Genetics and Molecular Biology Addressing the Shortcomings of Three Recent Bayesian Methods for Detecting Interspecific Recombination in DNA Sequence Alignments , 2011 .

[16]  Roderic D. M. Page,et al.  TreeView: an application to display phylogenetic trees on personal computers , 1996, Comput. Appl. Biosci..

[17]  Vladimir N. Minin,et al.  Dual multiple change-point model leads to more accurate recombination detection , 2005, Bioinform..

[18]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[19]  Bernard M. E. Moret,et al.  Efficiently Computing the Robinson-Foulds Metric , 2007, J. Comput. Biol..

[20]  S. Osmanov,et al.  An AB recombinant and its parental HIV type 1 strains in the area of the former Soviet Union: low requirements for sequence identity in recombination. UNAIDS Virus Isolation Network. , 2000, AIDS research and human retroviruses.

[21]  S. Marca,et al.  Inferring Spatial Phylogenetic Variation Along Nucleotide Sequences : A Multiple Changepoint Model , 2003 .

[22]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[23]  H. Kishino,et al.  Phylogenetic Detection of Recombination with a Bayesian Prior on the Distance between Trees , 2008, PLoS ONE.

[24]  D. Husmeier,et al.  Detecting recombination in 4-taxa DNA sequence alignments with Bayesian hidden Markov models and Markov chain Monte Carlo. , 2003, Molecular biology and evolution.

[25]  E. Holmes,et al.  A likelihood method for the detection of selection and recombination using nucleotide sequences. , 1997, Molecular biology and evolution.

[26]  Gráinne McGuire,et al.  TOPAL 2.0: improved detection of mosaic sequences within multiple alignments , 2000, Bioinform..