Estimation of population allele frequencies from next‐generation sequencing data: pool‐versus individual‐based genotyping

Molecular markers produced by next‐generation sequencing (NGS) technologies are revolutionizing genetic research. However, the costs of analysing large numbers of individual genomes remain prohibitive for most population genetics studies. Here, we present results based on mathematical derivations showing that, under many realistic experimental designs, NGS of DNA pools from diploid individuals allows to estimate the allele frequencies at single nucleotide polymorphisms (SNPs) with at least the same accuracy as individual‐based analyses, for considerably lower library construction and sequencing efforts. These findings remain true when taking into account the possibility of substantially unequal contributions of each individual to the final pool of sequence reads. We propose the intuitive notion of effective pool size to account for unequal pooling and derive a Bayesian hierarchical model to estimate this parameter directly from the data. We provide a user‐friendly application assessing the accuracy of allele frequency estimation from both pool‐ and individual‐based NGS population data under various sampling, sequencing depth and experimental error designs. We illustrate our findings with theoretical examples and real data sets corresponding to SNP loci obtained using restriction site–associated DNA (RAD) sequencing in pool‐ and individual‐based experiments carried out on the same population of the pine processionary moth (Thaumetopoea pityocampa). NGS of DNA pools might not be optimal for all types of studies but provides a cost‐effective approach for estimating allele frequencies for very large numbers of SNPs. It thus allows comparison of genome‐wide patterns of genetic variation for large numbers of individuals in multiple populations.

[1]  Peter Donnelly,et al.  Assessing population differentiation and isolation from single‐nucleotide polymorphism data , 2002 .

[2]  Pardis C Sabeti,et al.  Detecting recent positive selection in the human genome from haplotype structure , 2002, Nature.

[3]  M. O’Donovan,et al.  DNA Pooling: a tool for large-scale association studies , 2002, Nature Reviews Genetics.

[4]  J. Pritchard,et al.  A Map of Recent Positive Selection in the Human Genome , 2006, PLoS biology.

[5]  Kevin R. Thornton,et al.  A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome , 2007, PLoS biology.

[6]  Robert D Schnabel,et al.  SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries , 2008, Nature Methods.

[7]  P. Etter,et al.  Rapid SNP Discovery and Genetic Mapping Using Sequenced RAD Markers , 2008, PloS one.

[8]  Francesco Vallania,et al.  Quantification of rare allelic variants from pooled genomic DNA , 2009, Nature Methods.

[9]  Johan T den Dunnen,et al.  Application of massive parallel sequencing to whole genome SNP discovery in the porcine genome , 2009, BMC Genomics.

[10]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[11]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[12]  A. Futschik,et al.  The Next Generation of Molecular Markers From Massively Parallel Sequencing of Pooled DNA Samples , 2010, Genetics.

[13]  M. Pérez-Enciso,et al.  Massive parallel sequencing in animal genetics: wherefroms and wheretos. , 2010, Animal genetics.

[14]  Tina T. Hu,et al.  Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils , 2010, Nature Genetics.

[15]  Nicholas Stiffler,et al.  Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags , 2010, PLoS genetics.

[16]  T. Hocking,et al.  A Bayesian Outlier Criterion to Detect SNPs under Selection in Large Data Sets , 2010, PloS one.

[17]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[18]  M. Blaxter,et al.  RADSeq: next-generation population genetics. , 2010, Briefings in functional genomics.

[19]  M. Stoneking,et al.  Demographic History of Oceania Inferred from Genome-wide Data , 2010, Current Biology.

[20]  M. Blaxter,et al.  Genome-wide genetic marker discovery and genotyping using next-generation sequencing , 2011, Nature Reviews Genetics.

[21]  Aaron M. Tarone,et al.  Population-Based Resequencing of Experimentally Evolved Populations Reveals the Genetic Basis of Body Size Variation in Drosophila melanogaster , 2011, PLoS genetics.

[22]  C. Kerdelhué,et al.  Incipient allochronic speciation in the pine processionary moth (Thaumetopoea pityocampa, Lepidoptera, Notodontidae) , 2011, Journal of evolutionary biology.

[23]  A. Futschik,et al.  PoPoolation: A Toolbox for Population Genetic Analysis of Next Generation Sequencing Data from Pooled Individuals , 2011, PloS one.

[24]  Mathieu Gautier,et al.  Footprints of selection in the ancestral admixture of a New World Creole cattle breed , 2011, Molecular ecology.

[25]  A. Amores,et al.  Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences , 2011, G3: Genes | Genomes | Genetics.

[26]  Robert Kofler,et al.  PoPoolation2: identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq) , 2011, Bioinform..

[27]  Kai Ye,et al.  PoolHap: Inferring Haplotype Frequencies from Pooled Samples by Next Generation Sequencing , 2011, PloS one.

[28]  David B. Witonsky,et al.  A reduced representation approach to population genetic analyses and applications to human evolution. , 2011, Genome research.

[29]  Zechen Chong,et al.  Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads , 2012, Bioinform..

[30]  Dmitri A. Petrov,et al.  Empirical Validation of Pooled Whole Genome Population Re-Sequencing in Drosophila melanogaster , 2012, PloS one.

[31]  Michael Lachmann,et al.  Inferring the history of population size change from genome-wide SNP data. , 2012, Molecular biology and evolution.

[32]  Renaud Vitalis,et al.  rehh: an R package to detect footprints of selection in genome-wide SNP data from haplotype structure , 2012, Bioinform..

[33]  A. Futschik,et al.  Detecting Selective Sweeps from Pooled Next-Generation Sequencing Samples , 2012, Molecular biology and evolution.

[34]  D. Petrov,et al.  LDx: Estimation of Linkage Disequilibrium from High-Throughput Pooled Resequencing Data , 2012, PloS one.

[35]  S. Narum,et al.  Genome‐wide association reveals genetic basis for the propensity to migrate in wild populations of rainbow and steelhead trout , 2013, Molecular ecology.

[36]  J. Merilä,et al.  Molecular evolutionary and population genomic analysis of the nine‐spined stickleback using a modified restriction‐site‐associated DNA tag approach , 2013, Molecular ecology.

[37]  T. Cezard,et al.  Special features of RAD Sequencing data: implications for genotyping , 2012, Molecular ecology.

[38]  K. Gharbi,et al.  Sturgeon conservation genomics: SNP discovery and validation using RAD sequencing , 2013, Molecular ecology.