FST and kinship for arbitrary population structures II: Method of moments estimators

FST and kinship are key parameters often estimated in modern population genetics studies. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently used estimators of FST and kinship are method of moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the island model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure do not likely hold in many populations, including humans. In this work, we provide new results on the behavior of these estimators in the presence of arbitrarily complex population structures. After establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of FST and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally assumed models of structure. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the island model. Using 1000 Genomes Project data, we verify that population-level pairwise FST estimates underestimate differentiation measured by an individual-level pairwise FST estimator introduced here. We show that the calculated biases are due to unknown quantities that cannot be estimated under the established frameworks, highlighting the need for innovative estimation approaches in complex populations. We provide initial results that point towards a future estimation framework for generalized FST and kinship.

[1]  Hua Tang,et al.  Estimating kinship in admixed populations. , 2012, American journal of human genetics.

[2]  R. Lewontin,et al.  Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. , 1973, Genetics.

[3]  Daniel O. Stram,et al.  A Kinship-Based Modification of the Armitage Trend Test to Address Hidden Population Structure and Small Differential Genotyping Errors , 2009, PloS one.

[4]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[5]  N. Patterson,et al.  Estimating and interpreting FST: The impact of rare variants , 2013, Genome research.

[6]  Swapan Mallick,et al.  Ancient Admixture in Human History , 2012, Genetics.

[7]  D. Balding Likelihood-based inference for genetic correlation coefficients. , 2003, Theoretical population biology.

[8]  Wei Hao,et al.  Probabilistic models of genetic variation in structured populations applied to global human studies , 2013, Bioinform..

[9]  B. Milligan,et al.  Maximum-likelihood estimation of relatedness. , 2003, Genetics.

[10]  D. Balding,et al.  Relatedness in the post-genomic era: is it still useful? , 2014, Nature Reviews Genetics.

[11]  P. Smouse,et al.  genalex 6: genetic analysis in Excel. Population genetic software for teaching and research , 2006 .

[12]  Oscar Gaggiotti,et al.  Identifying the Environmental Factors That Determine the Genetic Structure of Populations , 2006, Genetics.

[13]  M. Nei Analysis of gene diversity in subdivided populations. , 1973, Proceedings of the National Academy of Sciences of the United States of America.

[14]  D. Balding,et al.  Worldwide Estimates Relative to Five Continental-Scale Populations , 2014, Annals of human genetics.

[15]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[16]  Newton E. Morton,et al.  Structures genetiques des populations. , 1971 .

[17]  William J. Astle,et al.  Population Structure and Cryptic Relatedness in Genetic Association Studies , 2009, 1010.4681.

[18]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[19]  D. Balding,et al.  A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity , 2005, Genetica.

[20]  Sohini Ramachandran,et al.  Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[21]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[22]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[23]  O. Gaggiotti,et al.  A Genome-Scan Method to Identify Selected Loci Appropriate for Both Dominant and Codominant Markers: A Bayesian Perspective , 2008, Genetics.

[24]  Jonathan Scott Friedlaender,et al.  Ancient Genomics and the Peopling of the Southwest Pacific , 2016, Nature.

[25]  M. Stephens,et al.  fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets , 2014, Genetics.

[26]  M. Csűrös Non-identifiability of identity coefficients at biallelic loci. , 2013, Theoretical population biology.

[27]  Peter Donnelly,et al.  Assessing population differentiation and isolation from single‐nucleotide polymorphism data , 2002 .

[28]  R. Mägi,et al.  Genetic Structure of Europeans: A View from the North–East , 2009, PloS one.

[29]  A. Winsor Sampling techniques. , 2000, Nursing times.

[30]  S. Wright Systems of Mating. V. General Considerations. , 1921, Genetics.

[31]  Swapan Mallick,et al.  Massive migration from the steppe was a source for Indo-European languages in Europe , 2015, Nature.

[32]  A. Wald,et al.  On Stochastic Limit and Order Relationships , 1943 .

[33]  Michael D. Edge,et al.  Upper bounds on FST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. , 2014, Theoretical population biology.

[34]  Joseph K. Pickrell,et al.  Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data , 2012, PLoS genetics.

[35]  D. Balding,et al.  Identifying adaptive genetic divergence among populations from genome scans , 2004, Molecular ecology.

[36]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[37]  Jeffrey Ross-Ibarra,et al.  Genetic Data Analysis II. Methods for Discrete Population Genentic Data , 2002 .

[38]  M. Pirinen,et al.  The fine-scale genetic structure of the British population , 2015, Nature.

[39]  Scott M. Williams,et al.  The Great Migration and African-American Genomic Diversity , 2015, bioRxiv.

[40]  Scott M. Williams,et al.  The Genetic Structure and History of Africans and African Americans , 2009, Science.

[41]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[42]  L. Cavalli-Sforza Population structure and human evolution , 1966, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[43]  Swapan Mallick,et al.  Genomic insights into the origin of farming in the ancient Near East , 2016, Nature.

[44]  P. Boursot,et al.  Interpretation of variation across marker loci as evidence of selection. , 2001, Genetics.

[45]  A. Porter A test for deviation from island‐model population structure , 2003, Molecular ecology.

[46]  Mary Sara McPeek,et al.  ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure. , 2010, American journal of human genetics.

[47]  Ellen M Wijsman,et al.  Case‐control association testing in the presence of unknown relationships , 2009, Genetic epidemiology.

[48]  H. Hartley,et al.  Unbiased Ratio Estimators , 1954, Nature.

[49]  David B. Witonsky,et al.  Using Environmental Correlations to Identify Loci Underlying Local Adaptation , 2010, Genetics.

[50]  Christopher R. Gignoux,et al.  The genetics of Mexico recapitulates Native American substructure and affects biomedical traits , 2014, Science.

[51]  G Barbujani,et al.  An apportionment of human DNA diversity. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Noah A. Rosenberg,et al.  The Relationship Between FST and the Frequency of the Most Frequent Allele , 2013, Genetics.

[53]  E. Thompson,et al.  Efficient Estimation of Realized Kinship from Single Nucleotide Polymorphism Genotypes , 2017, Genetics.

[54]  Jake K. Byrnes,et al.  Reconstructing the Population Genetic History of the Caribbean , 2013, PLoS genetics.

[55]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[56]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[57]  B S Weir,et al.  Estimating F-statistics. , 2002, Annual review of genetics.

[58]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[59]  M. Shriver,et al.  Interrogating a high-density SNP map for signatures of natural selection. , 2002, Genome research.

[60]  Albert Jacquard,et al.  Structures génétiques des populations , 1970 .

[61]  A. Templeton Systems of Mating , 2006, Population Genetics and Microevolutionary Theory.

[62]  R. Lewontin The Apportionment of Human Diversity , 1972 .

[63]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[64]  S. Xu,et al.  Combining different line crosses for mapping quantitative trait loci using the identical by descent-based variance component method. , 1998, Genetics.

[65]  Luísa Pereira,et al.  Human Neutral Genetic Variation and Forensic STR Data , 2012, PloS one.

[66]  B. Weir,et al.  A Unified Characterization of Population Structure and Relatedness , 2016, Genetics.

[67]  B. Berger,et al.  Ancient human genomes suggest three ancestral populations for present-day Europeans , 2013, Nature.

[68]  B. Weir,et al.  ESTIMATING F‐STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE , 1984, Evolution; international journal of organic evolution.

[69]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .

[70]  Rudolf Beran,et al.  Interpolated Nonparametric Prediction Intervals and Confidence Intervals , 1993 .

[71]  Mary Sara McPeek,et al.  Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. , 2003, American journal of human genetics.

[72]  J. Ott Genetic data analysis II , 1997 .

[73]  M. Beaumont,et al.  Evaluating loci for use in the genetic analysis of population structure , 1996, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[74]  P. Visscher,et al.  GCTA: a tool for genome-wide complex trait analysis. , 2011, American journal of human genetics.

[75]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[76]  Søren Brunak,et al.  Population genomics of Bronze Age Eurasia , 2015, Nature.

[77]  Joseph K. Pickrell,et al.  The Role of Geography in Human Adaptation , 2009, PLoS genetics.

[78]  K K Kidd,et al.  Drift, admixture, and selection in human evolution: a study with DNA polymorphisms. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[79]  E A Thompson,et al.  The estimation of pairwise relationships , 1975, Annals of human genetics.

[80]  Alejandro Ochoa,et al.  FST and kinship for arbitrary population structures I: Generalized definitions , 2016, bioRxiv.

[81]  James Curran,et al.  Population-specific FST values for forensic STR markers: A worldwide survey. , 2016, Forensic science international. Genetics.

[82]  C. Haley,et al.  Genomewide Rapid Association Using Mixed Model and Regression: A Fast and Simple Method For Genomewide Pedigree-Based Quantitative Trait Loci Association Analysis , 2007, Genetics.

[83]  D. Heckerman,et al.  Efficient Control of Population Structure in Model Organism Association Mapping , 2008, Genetics.

[84]  S. Wright,et al.  Genetical Structure of Populations , 1950, Nature.

[85]  R. Bass,et al.  Review: P. Billingsley, Convergence of probability measures , 1971 .