Computation of the Likelihood of Joint Site Frequency Spectra Using Orthogonal Polynomials

In population genetics, information about evolutionary forces, e.g., mutation, selection and genetic drift, is often inferred from DNA sequence information. Generally, DNA consists of two long strands of nucleotides or sites that pair via the complementary bases cytosine and guanine (C and G), on the one hand, and adenine and thymine (A and T), on the other. With whole genome sequencing, most genomic information stored in the DNA has become available for multiple individuals of one or more populations, at least in humans and model species, such as fruit flies of the genus Drosophila. In a genome-wide sample of L sites for M (haploid) individuals, the state of each site may be made binary, by binning the complementary bases, e.g., C with G to C/G, and contrasting C/G to A/T, to obtain a “site frequency spectrum” (SFS). Two such samples of either a single population from different time-points or two related populations from a single time-point are called joint site frequency spectra (joint SFS). While mathematical models describing the interplay of mutation, drift and selection have been available for more than 80 years, calculation of exact likelihoods from joint SFS is difficult. Sufficient statistics for inference of, e.g., mutation or selection parameters that would make use of all the information in the genomic data are rarely available. Hence, often suites of crude summary statistics are combined in simulation-based computational approaches. In this article, we use a bi-allelic boundary-mutation and drift population genetic model to compute the transition probabilities of joint SFS using orthogonal polynomials. This allows inference of population genetic parameters, such as the mutation rate (scaled by the population size) and the time separating the two samples. We apply this inference method to a population dataset of neutrally-evolving short intronic sites from six DNA sequences of the fruit fly Drosophila melanogaster and the reference sequence of the related species Drosophila sechellia.

[1]  Steven N Evans,et al.  Non-equilibrium theory of the allele frequency spectrum. , 2006, Theoretical population biology.

[2]  R. Griffiths,et al.  Diffusion processes and coalescent trees , 2010, 1003.4650.

[3]  C. Vogl,et al.  Unconstrained evolution in short introns? – An analysis of genome‐wide polymorphism and divergence data from Drosophila , 2012, Journal of evolutionary biology.

[4]  Carsten Wiuf,et al.  Gene Genealogies, Variation and Evolution - A Primer in Coalescent Theory , 2004 .

[5]  B. Charlesworth,et al.  The Relation between Recombination Rate and Patterns of Molecular Evolution and Variation in Drosophila melanogaster , 2014, Molecular biology and evolution.

[6]  R. Punnett,et al.  The Genetical Theory of Natural Selection , 1930, Nature.

[7]  Mary K. Kuhner,et al.  LAMARC 2.0: maximum likelihood and Bayesian estimation of population parameters , 2006, Bioinform..

[8]  D. Hartl,et al.  Maximum likelihood and Bayesian methods for estimating the distribution of selective effects among classes of mutations using DNA polymorphism data. , 2003, Theoretical population biology.

[9]  G. A. Watterson On the number of segregating sites in genetical models without recombination. , 1975, Theoretical population biology.

[10]  D. Hartl,et al.  Directional selection and the site-frequency spectrum. , 2001, Genetics.

[11]  Ryan D. Hernandez,et al.  Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data , 2009, PLoS genetics.

[12]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[13]  W J Ewens,et al.  A note on the sampling theory for infinite alleles and infinite sites models. , 1974, Theoretical population biology.

[14]  C. Vogl,et al.  Evidence for complex selection on four‐fold degenerate sites in Drosophila melanogaster , 2012, Journal of evolutionary biology.

[15]  Hua Chen,et al.  Intercoalescence Time Distribution of Incomplete Gene Genealogies in Temporally Varying Populations, and Applications in Population Genetic Inference , 2013, Annals of human genetics.

[16]  C. Vogl Estimating the scaled mutation rate and mutation bias with site frequency data. , 2014, Theoretical population biology.

[17]  B. Charlesworth,et al.  Codon Usage Bias and Effective Population Sizes on the X Chromosome versus the Autosomes in Drosophila melanogaster , 2012, Molecular biology and evolution.

[18]  A. Cutter Divergence times in Caenorhabditis and Drosophila inferred from direct estimates of the neutral mutation rate. , 2008, Molecular biology and evolution.

[19]  Hua Chen The joint allele frequency spectrum of multiple populations: a coalescent theory approach. , 2012, Theoretical population biology.

[20]  P. Green,et al.  Probability and Mathematical Genetics: Papers in Honour of Sir John Kingman , 2010 .

[21]  M. Ashburner,et al.  Historical Biogeography of the Drosophila melanogaster Species Subgroup , 1988 .

[22]  J. Wakeley Coalescent Theory: An Introduction , 2008 .

[23]  J. Parsch,et al.  On the utility of short intron sequences as a reference for the detection of positive and negative selection in Drosophila. , 2010, Molecular biology and evolution.

[24]  Jonathan Terhorst,et al.  Efficient Computation of the Joint Sample Frequency Spectra for Multiple Populations , 2015, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[25]  Russell B. Corbett-Detig,et al.  The Drosophila Genome Nexus: A Population Genomic Resource of 623 Drosophila melanogaster Genomes, Including 197 from a Single Ancestral Range Population , 2015, Genetics.

[26]  Claus Vogl,et al.  The allele-frequency spectrum in a decoupled Moran model with mutation, drift, and directional selection, assuming small mutation rates , 2012, Theoretical population biology.

[27]  Yun S. Song,et al.  A Simple Method for Finding Explicit Analytic Transition Densities of Diffusion Processes with General Diploid Selection , 2012, Genetics.

[28]  Chao Qian,et al.  Population , 1940, State Rankings 2020: A Statistical View of America.

[29]  Yun S. Song,et al.  An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection. , 2012, Theoretical population biology.

[30]  M. Kimura Difiusion models in population genetics , 1964 .

[31]  D. Hartl,et al.  Population genetics of polymorphism and divergence. , 1992, Genetics.

[32]  Claus Vogl,et al.  Computation of the Likelihood in Biallelic Diffusion Models Using Orthogonal Polynomials , 2014, Comput..

[33]  M Kimura,et al.  SOLUTION OF A PROCESS OF RANDOM GENETIC DRIFT WITH A CONTINUOUS MODEL. , 1955, Proceedings of the National Academy of Sciences of the United States of America.

[34]  M. Kimura The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. , 1969, Genetics.

[35]  C. Bustamante,et al.  Population Genetics of Polymorphism and Divergence for Diploid Selection Models With Arbitrary Dominance , 2004, Genetics.

[36]  Jürgen Jost,et al.  An introduction to the mathematical structure of the Wright–Fisher model of population genetics , 2012, Theory in Biosciences.

[37]  C. Vogl,et al.  Inference of directional selection and mutation parameters assuming equilibrium. , 2015, Theoretical population biology.

[38]  Transition densities and sample frequency spectra of diffusion processes with selection and variable population size , 2015 .

[39]  W. Ewens The sampling theory of selectively neutral alleles. , 1972, Theoretical population biology.

[40]  Anand Bhaskar,et al.  A NOVEL SPECTRAL METHOD FOR INFERRING GENERAL DIPLOID SELECTION FROM TIME SERIES GENETIC DATA. , 2013, The annals of applied statistics.

[41]  Thorsten Gerber,et al.  Handbook Of Mathematical Functions , 2016 .

[42]  M. Nei,et al.  Molecular phylogeny and divergence times of drosophilid species. , 1995, Molecular biology and evolution.

[43]  Bradley P. Carlin,et al.  BAYES AND EMPIRICAL BAYES METHODS FOR DATA ANALYSIS , 1996, Stat. Comput..

[44]  Andrew H. Chan,et al.  Genome-Wide Fine-Scale Recombination Rate Variation in Drosophila melanogaster , 2012, PLoS genetics.

[45]  Franziska Wulf,et al.  Mathematical Population Genetics , 2016 .

[46]  A. Roychoudhury,et al.  Sufficiency of the number of segregating sites in the limit under finite-sites mutation. , 2010, Theoretical population biology.

[47]  S. Wright,et al.  Evolution in Mendelian Populations. , 1931, Genetics.

[48]  Xingye Yue,et al.  Complete Numerical Solution of the Diffusion Equation of Random Genetic Drift , 2013, Genetics.