Scalable Statistical Introgression Mapping Using Approximate Coalescent-Based Inference

Recent advances in biomolecular sequencing have revealed the important role that interspecific gene flow has played in genome evolution throughout the Tree of Life. Current and future genomic studies will bring large amounts of genomic sequence to bear upon this topic, and scalable computational methodologies are needed to detect and analyze genomic signatures of interspecific introgression in large-scale datasets. To address the methodological gap, we introduce a new computational framework known as PHiMM (or "fast PhyloNet + Hidden Markov Model"). PHiMM combines inference and learning under a combined model of genetic drift, substitutions, recombination, and gene flow with a coalescent-based approximation technique. We compare the performance of PHiMM against the state of the art using synthetic and empirical genomic sequence data. We find that PHiMM offers better computational runtime and main memory usage by multiple orders of magnitude, while returning comparable inference accuracy. An open-source software implementation of the PHiMM framework and open data are publicly available at https://gitlab.msu.edu/liulab/phimm-dataset.

[1]  Anders E. Halager,et al.  A New Isolation with Migration Model along Complete Genomes Infers Very Different Divergence Processes among Closely Related Great Ape Species , 2012, PLoS genetics.

[2]  S. Jeffery Evolution of Protein Molecules , 1979 .

[3]  Simon H. Martin,et al.  Butterfly genome reveals promiscuous exchange of mimicry adaptations among species , 2012, Nature.

[4]  David Reich,et al.  Testing for ancient admixture between closely related populations. , 2011, Molecular biology and evolution.

[5]  Luay Nakhleh,et al.  The Probability of a Gene Tree Topology within a Phylogenetic Network with Applications to Hybridization Detection , 2012, PLoS genetics.

[6]  J. Oliver,et al.  The general stochastic model of nucleotide substitution. , 1990, Journal of theoretical biology.

[7]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[8]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[9]  A. Hobolth,et al.  Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model , 2006, PLoS genetics.

[10]  Kevin J. Liu,et al.  A scalability study of phylogenetic network inference methods using empirical datasets and simulations involving a single reticulation , 2016, BMC Bioinformatics.

[11]  Catherine E. Welsh,et al.  Subspecific origin and haplotype diversity in the laboratory mouse , 2011, Nature Genetics.

[12]  C. J-F,et al.  THE COALESCENT , 1980 .

[13]  Vincent Moulton,et al.  Reconstructing the evolutionary history of polyploids from multilabeled trees. , 2006, Molecular biology and evolution.

[14]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[15]  Philip L. F. Johnson,et al.  Genetic history of an archaic hominin group from Denisova Cave in Siberia , 2010, Nature.

[16]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[17]  Wei Wang,et al.  Non-parametric and semi-parametric support estimation using SEquential RESampling random walks on biomolecular sequences , 2018, Algorithms for Molecular Biology.

[18]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[19]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[20]  D. Tautz,et al.  Genomic resources for wild populations of the house mouse, Mus musculus and its close relative Mus spretus , 2016, Scientific Data.

[21]  Kevin J. Liu,et al.  FastNet: Fast and Accurate Statistical Inference of Phylogenetic Networks Using Large-Scale Genomic Sequence Data , 2018, RECOMB-CG.

[22]  Luay Nakhleh,et al.  PhyloNet: a software package for analyzing and reconstructing reticulate evolutionary relationships , 2008, BMC Bioinformatics.

[23]  Kevin J. Liu,et al.  Interspecific introgressive origin of genomic diversity in the house mouse , 2013, Proceedings of the National Academy of Sciences.

[24]  Philipp W. Messer,et al.  Genome Patterns of Selection and Introgression of Haplotypes in Natural Populations of the House Mouse (Mus musculus) , 2012, PLoS genetics.

[25]  S. Edwards IS A NEW AND GENERAL THEORY OF MOLECULAR SYSTEMATICS EMERGING? , 2009, Evolution; international journal of organic evolution.

[26]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[27]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[28]  Dan Graur,et al.  Heads or tails: a simple reliability check for multiple sequence alignments. , 2007, Molecular biology and evolution.

[29]  Carsten Wiuf,et al.  Gene Genealogies, Variation and Evolution - A Primer in Coalescent Theory , 2004 .

[30]  Michael H. Kohn,et al.  Adaptive Introgression of Anticoagulant Rodent Poison Resistance by Hybridization between Old World Mice , 2011, Current Biology.

[31]  D. Anderson,et al.  Algorithms for minimization without derivatives , 1974 .

[32]  Luay Nakhleh,et al.  Inferring Phylogenetic Networks Using PhyloNet , 2017, bioRxiv.

[33]  Keith S. Sheppard,et al.  Discovery of novel variants in genotyping arrays improves genotype retention and reduces ascertainment bias , 2012, BMC Genomics.

[34]  David Haussler,et al.  Comparative recombination rates in the rat, mouse, and human genomes. , 2004, Genome research.

[35]  Michael J. Sanderson,et al.  R8s: Inferring Absolute Rates of Molecular Evolution, Divergence times in the Absence of a Molecular Clock , 2003, Bioinform..

[36]  Ying Song,et al.  An HMM-Based Comparative Genomic Framework for Detecting Introgression in Eukaryotes , 2013, PLoS Comput. Biol..

[37]  Kevin J. Liu,et al.  Fast and accurate statistical inference of phylogenetic networks using large-scale genomic sequence data , 2017, bioRxiv.

[38]  M. Powell The BOBYQA algorithm for bound constrained optimization without derivatives , 2009 .

[39]  Ziheng Yang,et al.  The influence of gene flow on species tree estimation: a simulation study. , 2014, Systematic biology.

[40]  M. Kronforst,et al.  Ancient homology underlies adaptive mimetic diversity across butterflies , 2014, Nature Communications.