Genotype-Frequency Estimation from High-Throughput Sequencing Data

Rapidly improving high-throughput sequencing technologies provide unprecedented opportunities for carrying out population-genomic studies with various organisms. To take full advantage of these methods, it is essential to correctly estimate allele and genotype frequencies, and here we present a maximum-likelihood method that accomplishes these tasks. The proposed method fully accounts for uncertainties resulting from sequencing errors and biparental chromosome sampling and yields essentially unbiased estimates with minimal sampling variances with moderately high depths of coverage regardless of a mating system and structure of the population. Moreover, we have developed statistical tests for examining the significance of polymorphisms and their genotypic deviations from Hardy–Weinberg equilibrium. We examine the performance of the proposed method by computer simulations and apply it to low-coverage human data generated by high-throughput sequencing. The results show that the proposed method improves our ability to carry out population-genomic analyses in important ways. The software package of the proposed method is freely available from https://github.com/Takahiro-Maruki/Package-GFE.

[1]  R. Nielsen,et al.  A Model-Based Approach for Identifying Signatures of Ancient Balancing Selection in Genetic Data , 2014, PLoS genetics.

[2]  P. Hedrick,et al.  PERSPECTIVE: DETECTING ADAPTIVE MOLECULAR POLYMORPHISM: LESSONS FROM THE MHC , 2003, Evolution; international journal of organic evolution.

[3]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[4]  M. Lynch,et al.  Population-Genetic Inference from Pooled-Sequencing Data , 2014, Genome biology and evolution.

[5]  M. Nei,et al.  Estimation of fixation indices and gene diversities , 1983, Annals of human genetics.

[6]  Francisco M. De La Vega,et al.  Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. , 2008, Genome research.

[7]  Maurice G. Kendall,et al.  The advanced theory of statistics , 1945 .

[8]  D. Halligan,et al.  Inference of Site Frequency Spectra From High-Throughput Sequence Data: Quantification of Selection on Nonsynonymous and Synonymous Sites in Humans , 2011, Genetics.

[9]  Alan Hodgkinson,et al.  The Genomic Distribution and Local Context of Coincident SNPs in Human and Chimpanzee , 2010, Genome biology and evolution.

[10]  B. Weir Genetic Data Analysis II. , 1997 .

[11]  Broome,et al.  Literature cited , 1924, A Guide to the Carnivores of Central America.

[12]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[13]  Michael Lynch,et al.  Estimation of Allele Frequencies From High-Coverage Genome-Sequencing Projects , 2009, Genetics.

[14]  Claudio J. Verzilli,et al.  An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People , 2012, Science.

[15]  Steven J Mack,et al.  Balancing selection and heterogeneity across the classical human leukocyte antigen loci: a meta-analytic review of 497 population studies. , 2008, Human immunology.

[16]  P. Taberlet,et al.  The power and promise of population genomics: from genotyping to genome typing , 2003, Nature Reviews Genetics.

[17]  R. Nielsen,et al.  Estimating inbreeding coefficients from NGS data: Impact on genotype calling and allele frequency estimation , 2013, Genome research.

[18]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[19]  J. Novembre,et al.  Characterizing bias in population genetic inferences from low-coverage sequencing data. , 2014, Molecular biology and evolution.

[20]  C. Baer,et al.  Population genomics: genome-wide sampling of insect populations. , 2001, Annual review of entomology.

[21]  L. Cardon,et al.  Allelic association patterns for a dense SNP map , 2004, Genetic epidemiology.

[22]  Alkes L. Price,et al.  Using population admixture to help complete maps of the human genome , 2013, Nature Genetics.

[23]  Nicholas Stiffler,et al.  Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags , 2010, PLoS genetics.

[24]  W. G. Hill,et al.  Genetic Data Analysis II . By Bruce S. Weir, Sunderland, Massachusetts. Sinauer Associates, Inc.445 pages. ISBN 0-87893-902-4. , 1996 .

[25]  S. Tavaré,et al.  Population Genetic Inference From Resequencing Data , 2009, Genetics.

[26]  N. Risch,et al.  Estimating genotype error rates from high-coverage next-generation sequence data , 2014, Genome research.

[27]  Si Quang Le,et al.  SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. , 2011, Genome research.

[28]  Philip L. F. Johnson,et al.  Accounting for bias from sequencing error in population genetic estimates. , 2007, Molecular biology and evolution.

[29]  Jun Wang,et al.  SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data , 2012, PloS one.

[30]  C. Schlötterer,et al.  Patterns of Linkage Disequilibrium and Long Range Hitchhiking in Evolving Experimental Drosophila melanogaster Populations , 2014, Molecular biology and evolution.

[31]  M. Lynch,et al.  Genome-Wide Estimation of Linkage Disequilibrium from Population-Level High-Throughput Sequencing Data , 2014, Genetics.

[32]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[33]  Anders Albrechtsen,et al.  ANGSD: Analysis of Next Generation Sequencing Data , 2014, BMC Bioinformatics.

[34]  R. Nielsen,et al.  Population genetic inference from genomic sequence variation. , 2010, Genome research.

[35]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[36]  Yingrui Li,et al.  Estimation of allele frequency and association mapping using next-generation sequencing data , 2011, BMC Bioinformatics.

[37]  Michael A. Schmidt,et al.  SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies , 2010, Bioinform..

[38]  W. Ewens Mathematical Population Genetics , 1980 .

[39]  A. Clark,et al.  Recent Explosive Human Population Growth Has Resulted in an Excess of Rare Genetic Variants , 2012, Science.

[40]  Taylor J. Maxwell,et al.  Deep resequencing reveals excess rare recent variants consistent with explosive population growth , 2010, Nature communications.

[41]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[42]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[43]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[44]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.