Automating sequence-based detection and genotyping of SNPs from diploid samples

The detection of sequence variation, for which DNA sequencing has emerged as the most sensitive and automated approach, forms the basis of all genetic analysis. Here we describe and illustrate an algorithm that accurately detects and genotypes SNPs from fluorescence-based sequence data. Because the algorithm focuses particularly on detecting SNPs through the identification of heterozygous individuals, it is especially well suited to the detection of SNPs in diploid samples obtained after DNA amplification. It is substantially more accurate than existing approaches and, notably, provides a useful quantitative measure of its confidence in each potential SNP detected and in each genotype called. Calls assigned the highest confidence are sufficiently reliable to remove the need for manual review in several contexts. For example, for sequence data from 47–90 individuals sequenced on both the forward and reverse strands, the highest-confidence calls from our algorithm detected 93% of all SNPs and 100% of high-frequency SNPs, with no false positive SNPs identified and 99.9% genotyping accuracy. This algorithm is implemented in a software package, PolyPhred version 5.0, which is freely available for academic use.

[1]  C. van Broeckhoven,et al.  novoSNP, a novel computational tool for sequence variation discovery. , 2005, Genome research.

[2]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[3]  Deborah A. Nickerson,et al.  Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans , 2003, Nature Genetics.

[4]  Gabor T. Marth,et al.  A general approach to single-nucleotide polymorphism discovery , 1999, Nature Genetics.

[5]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[6]  Nature Genetics , 1991, Nature.

[7]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[8]  D. Nickerson,et al.  Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products. , 1994, Genomics.

[9]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[10]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[11]  C. Carlson,et al.  SNPing in the human genome. , 2001 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Geoffrey B. Nilsen,et al.  Whole-Genome Patterns of Common DNA Variation in Three Human Populations , 2005, Science.

[14]  E. Lander,et al.  Characterization of single-nucleotide polymorphisms in coding regions of human genes , 1999 .

[15]  D. Nickerson,et al.  AmpliTaq DNA polymerase, FS dye-terminator sequencing: analysis of peak height patterns. , 1996, BioTechniques.

[16]  Samuel H. Wilson,et al.  Environmental health and genomics: visions and implications , 2000, Nature Reviews Genetics.

[17]  D. Nickerson,et al.  PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. , 1997, Nucleic acids research.

[18]  Deborah A Nickerson,et al.  Comprehensive identification and characterization of diallelic insertion-deletion polymorphisms in 330 human candidate genes. , 2005, Human molecular genetics.

[19]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.