Automated SNP Detection in Expressed Sequence Tags: Statistical Considerations and Application to Maritime Pine Sequences

We developed an automated pipeline for the detection of single nucleotide polymorphisms (SNPs) in expressed sequence tag (EST) data sets, by combining three DNA sequence analysis programs:Phred, Phrap and PolyBayes. This application requires access to the individual electrophoregram traces. First, a reference set of 65 SNPs was obtained from the sequencing of 30 gametes in 13 maritime pine (Pinus pinaster Ait.) gene fragments (6671 bp), resulting in a frequency of 1 SNP every 102.6 bp. Second, parameters of the three programs were optimized in order to retrieve as many true SNPs, while keeping the rate of false positive as low as possible. Overall, the efficiency of detection of true SNPs was 83.1%. However, this rate varied largely as a function of the rare SNP allele frequency: down to 41% for rare SNP alleles (frequency ` 10%), up to 98% for allele frequencies above 10%. Third, the detection method was applied to the 18498 assembled maritime pine (Pinus pinaster Ait.) ESTs, allowing to identify a total of 1400 candidate SNPs, in contigs containing between 4 and 20 sequence reads. These genetic resources, described for the first time in a forest tree species, were made available at http://www.pierroton.inra/genetics/Pinesnps. We also derived an analytical expression for the SNP detection probability as a function of the SNP allele frequency, the number of haploid genomes used to generate the EST sequence database, and the sample size of the contigs considered for SNP detection. The frequency of the SNP allele was shown to be the main factor influencing the probability of SNP detection.

[1]  Francis S. Collins,et al.  Variations on a Theme: Cataloging Human DNA Sequence Variation , 1997, Science.

[2]  A. Roses,et al.  The use of single nucleotide polymorphisms in the isolation of common disease genes. , 2000, Pharmacogenomics.

[3]  D. Pot,et al.  Seasonal variation in transcript accumulation in wood-forming tissues of maritime pine (Pinus pinaster Ait.) with emphasis on a cell wall glycine-rich protein , 2003, Planta.

[4]  Mark Jung,et al.  SNP frequency, haplotype structure and linkage disequilibrium in elite maize inbred lines , 2002, BMC Genetics.

[5]  L. Feuk,et al.  SNP association studies in Alzheimer's disease highlight problems for complex disease analysis. , 2001, Trends in genetics : TIG.

[6]  D. Chagné,et al.  A high density genetic map of maritime pine based on AFLPs , 2002 .

[7]  D. Nickerson,et al.  The utility of single nucleotide polymorphisms in inferences of population history , 2003 .

[8]  Earl Hubbell,et al.  Genome-wide mapping with biallelic markers in Arabidopsis thaliana , 1999, Nature Genetics.

[9]  Catherine Letondal,et al.  A Web interface generator for molecular biology programs in Unix , 2001, Bioinform..

[10]  Samuel H. Wilson,et al.  Error-prone polymerization by HIV-1 reverse transcriptase. Contribution of template-primer misalignment, miscoding, and termination probability to mutational hot spots. , 1993, The Journal of biological chemistry.

[11]  A. Graner,et al.  Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.) , 2003, Molecular Genetics and Genomics.

[12]  I. Gray,et al.  Single nucleotide polymorphisms as tools in human genetics. , 2000, Human molecular genetics.

[13]  D. Nickerson,et al.  PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. , 1997, Nucleic acids research.

[14]  Edward S. Buckler,et al.  Dwarf8 polymorphisms associate with variation in flowering time , 2001, Nature Genetics.

[15]  A. Walsh,et al.  Mining single-nucleotide polymorphisms from hexaploid wheat ESTs. , 2003, Genome.

[16]  P. Oefner,et al.  The extent of linkage disequilibrium in Arabidopsis thaliana , 2002, Nature Genetics.

[17]  A. Rafalski,et al.  High-throughput identification, database storage and analysis of SNPs in EST sequences. , 2001, Genome informatics. International Conference on Genome Informatics.

[18]  J. Batley,et al.  Mining for Single Nucleotide Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag Data1 , 2003, Plant Physiology.

[19]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[20]  T. Ideker,et al.  Mining SNPs from EST databases. , 1999, Genome research.

[21]  Robert Miller,et al.  STACK: Sequence Tag Alignment and Consensus Knowledgebase , 2001, Nucleic Acids Res..

[22]  U. Gyllensten,et al.  Mitochondrial sequence analysis for forensic identification using pyrosequencing technology. , 2002, BioTechniques.

[23]  N E Morton,et al.  Genetic epidemiology of single-nucleotide polymorphisms. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Mark R. Wilson,et al.  Validation of mitochondrial DNA sequencing for forensic casework analysis , 2005, International Journal of Legal Medicine.

[25]  M. Vincentz,et al.  ESTs as a source for sequence polymorphism discovery in sugarcane: example of the Adh genes , 2002, Theoretical and Applied Genetics.

[26]  Leonid Kruglyak,et al.  The use of a genetic map of biallelic markers in linkage studies , 1997, Nature Genetics.

[27]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[28]  M. Daly,et al.  A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms , 2001, Nature.

[29]  X. Lin,et al.  Large-scale sequencing of plant genomes. , 1998, Current opinion in plant biology.

[30]  A. Brookes The essence of SNPs. , 1999, Gene.

[31]  David Edwards,et al.  Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP , 2003, Bioinform..

[32]  Gabor T. Marth,et al.  A general approach to single-nucleotide polymorphism discovery , 1999, Nature Genetics.

[33]  S. Gallagher GUS protocols: using the GUS gene as a reporter of gene expression. , 1992 .

[34]  Garth R. Brown,et al.  Comparative genome and QTL mapping between maritime and loblolly pines , 2003, Molecular Breeding.

[35]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.