Automated identification of single nucleotide polymorphisms from sequencing data

Single nucleotide polymorphisms (SNPs) provide abundant information about genetic variation. Large scale discovery of high frequency SNPs is being undertaken using various methods. However, the publicly available SNP data are not always accurate, and therefore should be verified. If only a particular gene locus is concerned, locus-specific polymerase chain reaction amplification may be useful. Problem of this method is that the secondary peak has to be measured. We have analyzed trace data from conventional sequencing equipment and found an applicable rule to discern SNPs from noise. We have developed software that integrates this function to automatically identify SNPs. The software works accurately for high quality sequences and also can detect SNPs in low quality sequences. Further, it can determine allele frequency, display this information as a bar graph and assign corresponding nucleotide combinations. It is very useful for identifying de novo SNPs in a DNA fragment of interest.

[1]  D. Nickerson,et al.  AmpliTaq DNA polymerase, FS dye-terminator sequencing: analysis of peak height patterns. , 1996, BioTechniques.

[2]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[3]  N. Blin,et al.  A general method for isolation of high molecular weight DNA from eukaryotes. , 1976, Nucleic acids research.

[4]  E. Lander The New Genomics: Global Views of Biology , 1996, Science.

[5]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[6]  M. Kendall,et al.  Kendall's advanced theory of statistics , 1995 .

[7]  Eric S. Lander,et al.  An SNP map of the human genome generated by reduced representation shotgun sequencing , 2000, Nature.

[8]  N Risch,et al.  The Future of Genetic Studies of Complex Human Diseases , 1996, Science.

[9]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[10]  J. M. Prober,et al.  A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. , 1987, Science.

[11]  Lloyd M. Smith,et al.  Fluorescence detection in automated DNA sequence analysis , 1986, Nature.

[12]  C. R. Connell,et al.  DNA sequencing with dye-labeled terminators and T7 DNA polymerase: effect of dyes and dNTPs on incorporation of dye-terminators and probability analysis of termination fragments. , 1992, Nucleic acids research.

[13]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.