Estimation of Allele Frequencies From High-Coverage Genome-Sequencing Projects

A new generation of high-throughput sequencing strategies will soon lead to the acquisition of high-coverage genomic profiles of hundreds to thousands of individuals within species, generating unprecedented levels of information on the frequencies of nucleotides segregating at individual sites. However, because these new technologies are error prone and yield uneven coverage of alleles in diploid individuals, they also introduce the need for novel methods for analyzing the raw read data. A maximum-likelihood method for the estimation of allele frequencies is developed, eliminating both the need to arbitrarily discard individuals with low coverage and the requirement for an extrinsic measure of the sequence error rate. The resultant estimates are nearly unbiased with asymptotically minimal sampling variance, thereby defining the limits to our ability to estimate population-genetic parameters and providing a logical basis for the optimal design of population-genomic surveys.

[1]  S. Tavaré,et al.  Population Genetic Inference From Resequencing Data , 2009, Genetics.

[2]  Michael Lynch,et al.  Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects. , 2008, Molecular biology and evolution.

[3]  S. Maheshwari,et al.  Recurrent positive selection of the Drosophila hybrid incompatibility gene Hmr. , 2008, Molecular biology and evolution.

[4]  Francisco M. De La Vega,et al.  Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals. , 2008, Genome research.

[5]  M. Hofreiter,et al.  DNA from Pre-Clovis Human Coprolites in Oregon, North America , 2008, Science.

[6]  Philip L. F. Johnson,et al.  Accounting for bias from sequencing error in population genetic estimates. , 2007, Molecular biology and evolution.

[7]  P. Keightley,et al.  Joint Inference of the Distribution of Fitness Effects of Deleterious Mutations and Population Demography Based on Nucleotide Polymorphism Frequencies , 2007, Genetics.

[8]  Philip L. F. Johnson,et al.  Patterns of damage in genomic DNA sequences from a Neandertal , 2007, Proceedings of the National Academy of Sciences.

[9]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[10]  Gil McVean,et al.  The Structure of Linkage Disequilibrium Around a Selective Sweep , 2007, Genetics.

[11]  Amanda B. Hepler,et al.  Genetic relatedness analysis: modern data and new challenges , 2006, Nature Reviews Genetics.

[12]  K. Holsinger The neutral theory of molecular evolution , 2004 .

[13]  T. Jukes,et al.  The neutral theory of molecular evolution. , 2000, Genetics.

[14]  W. Ewens Genetics and analysis of quantitative traits , 1999 .

[15]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[16]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[17]  B. Weir Statistical methods employed in evaluation of single-locus probe results in criminal identity cases. , 1998, Methods in molecular biology.

[18]  J. Ott Genetic data analysis II , 1997 .

[19]  W. G. Hill,et al.  Genetic Data Analysis II . By Bruce S. Weir, Sunderland, Massachusetts. Sinauer Associates, Inc.445 pages. ISBN 0-87893-902-4. , 1996 .

[20]  M. Nei,et al.  Estimation of average heterozygosity and genetic distance from a small number of individuals. , 1978, Genetics.