Evaluation of Allele Frequency Estimation Using Pooled Sequencing Data Simulation

Next-generation sequencing (NGS) technology has provided researchers with opportunities to study the genome in unprecedented detail. In particular, NGS is applied to disease association studies. Unlike genotyping chips, NGS is not limited to a fixed set of SNPs. Prices for NGS are now comparable to the SNP chip, although for large studies the cost can be substantial. Pooling techniques are often used to reduce the overall cost of large-scale studies. In this study, we designed a rigorous simulation model to test the practicability of estimating allele frequency from pooled sequencing data. We took crucial factors into consideration, including pool size, overall depth, average depth per sample, pooling variation, and sampling variation. We used real data to demonstrate and measure reference allele preference in DNAseq data and implemented this bias in our simulation model. We found that pooled sequencing data can introduce high levels of relative error rate (defined as error rate divided by targeted allele frequency) and that the error rate is more severe for low minor allele frequency SNPs than for high minor allele frequency SNPs. In order to overcome the error introduced by pooling, we recommend a large pool size and high average depth per sample.

[1]  Hongyu Zhao,et al.  Biases and Errors on Allele Frequency Estimation and Disease Association Tests of Next‐Generation Sequencing of Pooled Samples , 2012, Genetic epidemiology.

[2]  J. Long,et al.  Exome sequencing generates high quality data in non-target regions , 2012, BMC Genomics.

[3]  Sarah Edkins,et al.  An Evaluation of Different Target Enrichment Methods in Pooled Sequencing Designs for Complex Disease Association Studies , 2011, PloS one.

[4]  Joshua M. Korn,et al.  Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease , 2011, Nature Genetics.

[5]  Paola Benaglio,et al.  Next generation sequencing of pooled samples reveals new SNRNP200 mutations associated with retinitis pigmentosa , 2011, Human mutation.

[6]  Fred A. Wright,et al.  Genome-wide association and linkage identify modifier loci of lung disease severity in cystic fibrosis at 11p13 and 20q13.2 , 2011, Nature Genetics.

[7]  E. Cuppen,et al.  Genomic DNA Pooling Strategy for Next-Generation Sequencing-Based Rare Variant Discovery in Abdominal Aortic Aneurysm Regions of Interest—Challenges and Limitations , 2011, Journal of cardiovascular translational research.

[8]  A. Zhernakova,et al.  Multiple independent variants in 6 q 21-22 associated with susceptibility to celiac disease in the Dutch , Finnish and Hungarian populations , 2011 .

[9]  Christian Y Mardin,et al.  Genome-wide association study with DNA pooling identifies variants at CNTNAP2 associated with pseudoexfoliation syndrome , 2011, European Journal of Human Genetics.

[10]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[11]  Lihong Qi,et al.  Pooled versus individual genotyping in a breast cancer genome‐wide association study , 2010, Genetic epidemiology.

[12]  M. King,et al.  Genetic Heterogeneity in Human Disease , 2010, Cell.

[13]  Detlef Weigel,et al.  Deep sequencing to reveal new variants in pooled DNA samples , 2009, Human mutation.

[14]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[15]  John C. Marioni,et al.  Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data , 2009, Bioinform..

[16]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[17]  P. Sullivan,et al.  Genome-Wide Association Study Implicates Chromosome 9q21.31 as a Susceptibility Locus for Asthma in Mexican Children , 2009, PLoS genetics.

[18]  J. Todd,et al.  Rare Variants of IFIH1, a Gene Implicated in Antiviral Responses, Protect Against Type 1 Diabetes , 2009, Science.

[19]  Eric Boerwinkle,et al.  Rare loss-of-function mutations in ANGPTL family members contribute to plasma triglyceride levels in humans. , 2008, The Journal of clinical investigation.

[20]  M. McCarthy,et al.  Genome-wide association studies: potential next steps on a genetic journey. , 2008, Human molecular genetics.

[21]  Hongyu Zhao,et al.  Rare independent mutations in renal salt handling genes contribute to blood pressure variation , 2008, Nature Genetics.

[22]  Robert Plomin,et al.  Applicability of DNA pools on 500 K SNP microarrays for cost-effective initial screens in genomewide association studies , 2007, BMC Genomics.

[23]  C I Amos,et al.  DNA pooling in mutation detection with reference to sequence analysis. , 2000, American journal of human genetics.

[24]  R Plomin,et al.  A simple method for analyzing microsatellite allele image patterns generated from DNA pools and its application to allelic association studies. , 1998, American journal of human genetics.

[25]  A Chakravarti,et al.  Allele frequency distributions in pooled DNA samples: applications to mapping complex disease genes. , 1998, Genome research.

[26]  W. Klitz,et al.  Association mapping of disease loci, by use of a pooled DNA genomic screen. , 1997, American journal of human genetics.

[27]  V. Sheffield,et al.  An autosomal recessive nonsyndromic-hearing-loss locus identified by DNA pooling using two inbred Bedouin kindreds. , 1996, American journal of human genetics.

[28]  V. Sheffield,et al.  A cerebellar ataxia locus identified by DNA pooling to search for linkage disequilibrium in an isolated population from the Cayman Islands. , 1996, Human molecular genetics.

[29]  V. Sheffield,et al.  Identification of a Bardet-Biedl syndrome locus on chromosome 3 and evaluation of an efficient approach to homozygosity mapping. , 1994, Human molecular genetics.

[30]  A Sajantila,et al.  Determination of allele frequencies at loci with length polymorphism by quantitative analysis of DNA amplified from pooled samples. , 1993, PCR methods and applications.

[31]  R. Michelmore,et al.  Identification of markers linked to disease-resistance genes by bulked segregant analysis: a rapid method to detect markers in specific genomic regions by using segregating populations. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[32]  N. Arnheim,et al.  Use of pooled DNA samples to detect linkage disequilibrium of polymorphic restriction fragments and human disease: studies of the HLA class II loci. , 1985, Proceedings of the National Academy of Sciences of the United States of America.

[33]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[34]  Emily H Turner,et al.  Targeted Capture and Massively Parallel Sequencing of Twelve Human Exomes , 2009, Nature.

[35]  M. Kanai,et al.  Rare Variants of IFIH 1 , a Gene Implicated in Antiviral Responses , Protect Against Type 1 Diabetes , 2009 .

[36]  V. Sheffield,et al.  Use of a DNA pooling strategy to identify a human obesity syndrome locus on chromosome 15. , 1995, Human molecular genetics.