Fast Estimation of Recombination Rates Using Topological Data Analysis

Accurate estimation of recombination rates is critical for studying the origins and maintenance of genetic diversity. Because the inference of recombination rates under a full evolutionary model is computationally expensive, we developed an alternative approach using topological data analysis (TDA) on genome sequences. We find that this method can analyze datasets larger than what can be handled by any existing recombination inference software, and has accuracy comparable to commonly used model-based methods with significantly less processing time. Previous TDA methods used information contained solely in the first Betti number (β1) of a set of genomes, which aims to capture the number of loops that can be detected within a genealogy. These explorations have proven difficult to connect to the theory of the underlying biological process of recombination, and, consequently, have unpredictable behavior under perturbations of the data. We introduce a new topological feature, which we call ψ, with a natural connection to coalescent models, and present novel arguments relating β1 to population genetic models. Using simulations, we show that ψ and β1 are differentially affected by missing data, and package our approach as TREE (Topological Recombination Estimator). TREE’s efficiency and accuracy make it well suited as a first-pass estimator of recombination rate heterogeneity or hotspots throughout the genome. Our work empirically and theoretically justifies the use of topological statistics as summaries of genome sequences and describes a new, unintuitive relationship between topological features of the distribution of sequence data and the footprint of recombination on genomes.

[1]  Michael Lesnick,et al.  Quantifying Genetic Innovation: Mathematical Foundations for the Topological Study of Reticulate Evolution , 2018, SIAM J. Appl. Algebra Geom..

[2]  A. Blumberg,et al.  Geometry and Topology of Genomic Data , 2017 .

[3]  Kevin J. Emmett,et al.  Topological Data Analysis Generates High-Resolution, Genome-wide Maps of Human Recombination. , 2016, Cell systems.

[4]  Michael M. Desai,et al.  Sex Speeds Adaptation by Altering the Dynamics of Molecular Evolution , 2016, Nature.

[5]  E. Pastalkova,et al.  Clique topology reveals intrinsic geometric structure in neural correlations , 2015, Proceedings of the National Academy of Sciences.

[6]  Robert Ghrist,et al.  Elementary Applied Topology , 2014 .

[7]  Andrew J. Blumberg,et al.  Moduli Spaces of Phylogenetic Trees Describing Tumor Evolutionary Patterns , 2014, Brain Informatics and Health.

[8]  G. Carlsson,et al.  Topology of viral evolution , 2013, Proceedings of the National Academy of Sciences.

[9]  Andrew H. Chan,et al.  Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster , 2012, PLoS genetics.

[10]  J. M. Comeron,et al.  The Many Landscapes of Recombination in Drosophila melanogaster , 2012, PLoS genetics.

[11]  Russell B. Corbett-Detig,et al.  Population Genomics of Sub-Saharan Drosophila melanogaster: African Diversity and Non-African Admixture , 2012, PLoS genetics.

[12]  Steve Oudot,et al.  The Structure and Stability of Persistence Modules , 2012, Springer Briefs in Mathematics.

[13]  G. Carlsson,et al.  Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival , 2011, Proceedings of the National Academy of Sciences.

[14]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[15]  G. McVean,et al.  PRDM9 marks the spot , 2010, Nature Genetics.

[16]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[17]  K. Paigen,et al.  Prdm9 Controls Activation of Mammalian Recombination Hotspots , 2010, Science.

[18]  G. Coop,et al.  PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and Mice , 2010, Science.

[19]  Herbert Edelsbrunner,et al.  Computational Topology - an Introduction , 2009 .

[20]  B. Shraiman,et al.  Competition between recombination and epistasis can cause a transition from allele to genotype selection , 2009, Proceedings of the National Academy of Sciences.

[21]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[22]  J. Wakeley Coalescent Theory: An Introduction , 2008 .

[23]  G. Coop,et al.  High-Resolution Mapping of Crossovers Reveals Extensive Variation in Fine-Scale Recombination Patterns Among Humans , 2008, Science.

[24]  A. Auton,et al.  Recombination rate estimation in the presence of hotspots. , 2007, Genome research.

[25]  A. Wakolbinger,et al.  The process of most recent common ancestors in an evolving coalescent , 2005, math/0511743.

[26]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[27]  Leonidas J. Guibas,et al.  Persistence barcodes for shapes , 2004, SGP '04.

[28]  G. McVean,et al.  Estimating recombination rates from population-genetic data , 2003, Nature Reviews Genetics.

[29]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[30]  J. Wakeley Using the variance of pairwise differences to estimate the recombination rate. , 1997, Genetical research.

[31]  Norman Arnheim,et al.  High resolution localization of recombination hot spots using sperm typing , 1994, Nature Genetics.

[32]  R. Hudson,et al.  Statistical properties of the number of recombination events in the history of a sample of DNA sequences. , 1985, Genetics.

[33]  S. Tavaré,et al.  Line-of-descent and genealogical processes, and their applications in population genetics models. , 1984, Theoretical population biology.

[34]  G. A. Watterson On the number of segregating sites in genetical models without recombination. , 1975, Theoretical population biology.

[35]  J. Felsenstein The evolutionary advantage of recombination. , 1974, Genetics.

[36]  W. G. Hill,et al.  The effect of linkage on limits to artificial selection. , 1966, Genetical research.

[37]  N. Weatherill,et al.  Introduction * , 1947, Nordic Journal of Linguistics.

[38]  Afra Zomorodian,et al.  Computational topology , 2010 .

[39]  Bob Doyle,et al.  Marks the spot , 2006 .

[40]  R. Griffiths,et al.  Bounds on the minimum number of recombination events in a sample history. , 2003, Genetics.

[41]  Peter Donnelly,et al.  Particle Representations for Measure-Valued Population Models , 1999 .