Inferring the ancestry of everyone

A central problem in evolutionary biology is to infer the full genealogical history of a set of DNA sequences. This history contains rich information about the forces that have influenced a sexually reproducing species. However, existing methods are limited: the most accurate is unable to cope with more than a few dozen samples. With modern genetic data sets rapidly approaching millions of genomes, there is an urgent need for efficient inference methods to exploit such rich resources. We introduce an algorithm to infer whole-genome history which has comparable accuracy to the state-of-the-art but can process around four orders of magnitude more sequences. Additionally, our method results in an “evolutionary encoding” of the original sequence data, enabling efficient access to genealogies and calculation of genetic statistics over the data. We apply this technique to human data from the 1000 Genomes Project, Simons Genome Diversity Project and UK Biobank, showing that the genealogies we estimate are both rich in biological signal and efficient to process.

[1]  P. Donnelly,et al.  Estimating recombination rates from population genetic data. , 2001, Genetics.

[2]  Yun S. Song,et al.  Constructing Minimal Ancestral Recombination Graphs , 2005, J. Comput. Biol..

[3]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[4]  S. Tavaré,et al.  The age of a mutation in a general coalescent tree , 1998 .

[5]  Elizabeth A. Walkup,et al.  GraphML specializations to codify ancestral recombinant graphs , 2013, Front. Genet..

[6]  Brendan D. O'Fallon,et al.  ACG: rapid inference of population history from recombining nucleotide sequences , 2013, BMC Bioinformatics.

[7]  Yun S. Song,et al.  A Decomposition Theory for Phylogenetic Networks and Incompatible Characters , 2007, J. Comput. Biol..

[8]  Yun S. Song,et al.  The Simons Genome Diversity Project: 300 genomes from 142 diverse populations , 2016, Nature.

[9]  B. Rannala,et al.  Molecular phylogenetics: principles and practice , 2012, Nature Reviews Genetics.

[10]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[11]  J. Hein Reconstructing evolution of sequences subject to recombination using parsimony. , 1990, Mathematical biosciences.

[12]  Gabriel Cardona,et al.  Extended Newick: it is time for a standard representation of phylogenetic networks , 2008, BMC Bioinformatics.

[13]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[14]  Uta Boehm,et al.  Charles Darwins Notebooks 1836 1844 Geology Transmutation Of Species Metaphysical Enquiries , 2016 .

[15]  Robert C. Griffiths,et al.  The Two-Locus Ancestral Graph , 1991 .

[16]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[17]  Francesc Calafell,et al.  Estimating the Ancestral Recombinations Graph (ARG) as Compatible Networks of SNP Patterns , 2008, J. Comput. Biol..

[18]  R. Durbin,et al.  Mapping trait loci by use of inferred ancestral recombination graphs. , 2006, American journal of human genetics.

[19]  Brent S. Pedersen,et al.  cyvcf2: fast, flexible variant analysis with Python , 2017, Bioinform..

[20]  Kevin R. Thornton,et al.  Efficient pedigree recording for fast population genetics simulation , 2018, bioRxiv.

[21]  Peter D. Keightley,et al.  Inferring the probability of the derived versus the ancestral allelic state at a polymorphic site , 2018 .

[22]  Gerton Lunter,et al.  Haplotype matching in large cohorts using the Li and Stephens model , 2018, Bioinform..

[23]  Charles Semple,et al.  On the Computational Complexity of the Rooted Subtree Prune and Regraft Distance , 2005 .

[24]  Kevin R. Thornton,et al.  Efficient pedigree recording for fast population genetics simulation , 2018, bioRxiv.

[25]  Matthieu Foll,et al.  Inferring the age of a fixed beneficial allele , 2016, Molecular ecology.

[26]  Klaus Peter Schliep,et al.  phangorn: phylogenetic analysis in R , 2010, Bioinform..

[27]  Kaizhong Zhang,et al.  Perfect Phylogenetic Networks with Recombination , 2001, J. Comput. Biol..

[28]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[29]  P. Marjoram,et al.  Ancestral Inference from Samples of DNA Sequences with Recombination , 1996, J. Comput. Biol..

[30]  Yun S. Song,et al.  Efficient computation of close lower and upper bounds on the minimum number of recombinations in biological sequence evolution , 2005, ISMB.

[31]  Peter D. Keightley,et al.  Inferring the Probability of the Derived vs. the Ancestral Allelic State at a Polymorphic Site , 2018, Genetics.

[32]  G. McVean,et al.  Estimating recombination rates from population-genetic data , 2003, Nature Reviews Genetics.

[33]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[34]  Ernst Haeckel,et al.  Generelle Morphologie der Organismen: Allgemeine Grundzüge der organischen Formen-Wissenschaft, mechanisch begründet durch die von Charles Darwin reformierte Descendenz-Theorie. Band 1: Allgemeine Anatomie. Band 2: Allgemeine Entwicklungsgeschichte , 1866 .

[35]  M. Kendall,et al.  treespace: Statistical exploration of landscapes of phylogenetic trees , 2017, Molecular ecology resources.

[36]  Yufeng Wu,et al.  RENT+: an improved method for inferring local genealogical trees from haplotypes with recombination , 2017, Bioinform..

[37]  Bryan Howie,et al.  Estimating the Ages of Selection Signals from Different Epochs in Human History. , 2016, Molecular biology and evolution.

[38]  M. Ragan Trees and networks before and after Darwin , 2009, Biology Direct.

[39]  Dan Gusfield,et al.  ReCombinatorics: The Algorithmics of Ancestral Recombination Graphs and Explicit Phylogenetic Networks , 2014 .

[40]  David A. Morrison,et al.  Genealogies: Pedigrees and Phylogenies are Reticulating Networks Not Just Divergent Trees , 2016, Evolutionary Biology.

[41]  R. Fisher,et al.  A fuller theory of “Junctions” in inbreeding , 1954, Heredity.

[42]  T. Ohta,et al.  The age of a neutral mutant persisting in a finite population. , 1973, Genetics.

[43]  Matthew Stephens,et al.  Estimating Time to the Common Ancestor for a Beneficial Allele , 2016, bioRxiv.

[44]  M. Arenas The importance and application of the ancestral recombination graph , 2013, Front. Genet..

[45]  Matthew D. Rasmussen,et al.  Genome-Wide Inference of Ancestral Recombination Graphs , 2013, PLoS genetics.

[46]  Teri A. Manolio,et al.  Bringing genome-wide association findings into clinical use , 2013, Nature Reviews Genetics.

[47]  Marek Kimmel,et al.  simuPOP: a forward-time population genetics simulation environment , 2005, Bioinform..

[48]  Dan Gusfield,et al.  Optimal, Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination , 2004, J. Bioinform. Comput. Biol..

[49]  Ryan D. Hernandez,et al.  Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data , 2009, PLoS genetics.

[50]  J. G. Burleigh,et al.  Synthesis of phylogeny and taxonomy into a comprehensive tree of life , 2014, Proceedings of the National Academy of Sciences.

[51]  Gil McVean,et al.  Dating genomic variants and shared ancestry in population-scale sequencing data , 2018, bioRxiv.

[52]  Jon A Yamato,et al.  Maximum likelihood estimation of recombination rates from population data. , 2000, Genetics.

[53]  A. Danchin,et al.  Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths , 2009, PLoS genetics.

[54]  Jerome Kelleher,et al.  Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes , 2015, bioRxiv.