Fast Estimation of Recombination Rates Using Topological Data Analysis

Accurate estimation of recombination rates is critical for studying the origins and maintenance of genetic diversity. Because the inference of recombination rates under a full evolutionary model is computationally expensive, an alternative approach using topological data analysis (TDA) has been proposed. Previous TDA methods used information contained solely in the first Betti number (β1)of the cloud of genomes, which relates to the number of loops that can be detected within a genealogy. While these methods are considerably less computationally intensive than current biological model-based methods, these explorations have proven difficult to connect to the theory of the underlying biological process of recombination, and consequently have unpredictable behavior under different perturbations of the data. We introduce a new topological feature with a natural connection to coalescent models, which we call ψ. We show that ψ and β1 are differentially affected by changes to the structure of the data and use them in conjunction to provide a robust, efficient, and accurate estimator of recombination rates, TREE. Compared to previous TDA methods, TREE more closely approximates of the results of commonly used model-based methods. These characteristics make TREE well suited as a first-pass estimator of recombination rate heterogeneity or hotspots throughout the genome. In addition, we present novel arguments relating β1 to population genetic models; our work justifies the use of topological statistics as summaries of distributions of genome sequences and describes a new, unintuitive relationship between topological summaries of distance and the footprint of recombination on genome sequences.

[1]  Leonidas J. Guibas,et al.  Persistence Barcodes for Shapes , 2005, Int. J. Shape Model..

[2]  G. Coop,et al.  High-Resolution Mapping of Crossovers Reveals Extensive Variation in Fine-Scale Recombination Patterns Among Humans , 2008, Science.

[3]  Herbert Edelsbrunner,et al.  Computational Topology - an Introduction , 2009 .

[4]  Leonidas J. Guibas,et al.  Persistence barcodes for shapes , 2004, SGP '04.

[5]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[6]  G. McVean,et al.  PRDM9 marks the spot , 2010, Nature Genetics.

[7]  J. Felsenstein The evolutionary advantage of recombination. , 1974, Genetics.

[8]  Michael Lesnick,et al.  Quantifying Genetic Innovation: Mathematical Foundations for the Topological Study of Reticulate Evolution , 2018, SIAM J. Appl. Algebra Geom..

[9]  Michael M. Desai,et al.  Sex Speeds Adaptation by Altering the Dynamics of Molecular Evolution , 2016, Nature.

[10]  Andrew J. Blumberg,et al.  Moduli Spaces of Phylogenetic Trees Describing Tumor Evolutionary Patterns , 2014, Brain Informatics and Health.

[11]  J. M. Comeron,et al.  The Many Landscapes of Recombination in Drosophila melanogaster , 2012, PLoS genetics.

[12]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[13]  G. Coop,et al.  PRDM9 Is a Major Determinant of Meiotic Recombination Hotspots in Humans and Mice , 2010, Science.

[14]  Steve Oudot,et al.  The Structure and Stability of Persistence Modules , 2012, Springer Briefs in Mathematics.

[15]  G. Carlsson,et al.  Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival , 2011, Proceedings of the National Academy of Sciences.

[16]  G. McVean,et al.  Estimating recombination rates from population-genetic data , 2003, Nature Reviews Genetics.

[17]  Norman Arnheim,et al.  High resolution localization of recombination hot spots using sperm typing , 1994, Nature Genetics.

[18]  A. Wakolbinger,et al.  The process of most recent common ancestors in an evolving coalescent , 2005, math/0511743.

[19]  A. Auton,et al.  Recombination rate estimation in the presence of hotspots. , 2007, Genome research.

[20]  A. Blumberg,et al.  Geometry and Topology of Genomic Data , 2017 .

[21]  E. Pastalkova,et al.  Clique topology reveals intrinsic geometric structure in neural correlations , 2015, Proceedings of the National Academy of Sciences.

[22]  G. A. Watterson On the number of segregating sites in genetical models without recombination. , 1975, Theoretical population biology.

[23]  K. Paigen,et al.  Prdm9 Controls Activation of Mammalian Recombination Hotspots , 2010, Science.

[24]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[25]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[26]  Andrew H. Chan,et al.  Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster , 2012, PLoS genetics.

[27]  G. Carlsson,et al.  Topology of viral evolution , 2013, Proceedings of the National Academy of Sciences.

[28]  Peter Donnelly,et al.  Particle Representations for Measure-Valued Population Models , 1999 .

[29]  B. Shraiman,et al.  Competition between recombination and epistasis can cause a transition from allele to genotype selection , 2009, Proceedings of the National Academy of Sciences.

[30]  R. Hudson,et al.  Statistical properties of the number of recombination events in the history of a sample of DNA sequences. , 1985, Genetics.

[31]  Bob Doyle,et al.  Marks the spot , 2006 .

[32]  J. Wakeley Coalescent Theory: An Introduction , 2008 .

[33]  S. Tavaré,et al.  Line-of-descent and genealogical processes, and their applications in population genetics models. , 1984, Theoretical population biology.

[34]  R. Griffiths,et al.  Bounds on the minimum number of recombination events in a sample history. , 2003, Genetics.

[35]  W. G. Hill,et al.  The effect of linkage on limits to artificial selection. , 1966, Genetical research.

[36]  Kevin J. Emmett,et al.  Topological Data Analysis Generates High-Resolution, Genome-wide Maps of Human Recombination. , 2016, Cell systems.

[37]  J. Wakeley Using the variance of pairwise differences to estimate the recombination rate. , 1997, Genetical research.