SNP-PHAGE – High throughput SNP discovery pipeline

BackgroundSingle nucleotide polymorphisms (SNPs) as defined here are single base sequence changes or short insertion/deletions between or within individuals of a given species. As a result of their abundance and the availability of high throughput analysis technologies SNP markers have begun to replace other traditional markers such as restriction fragment length polymorphisms (RFLPs), amplified fragment length polymorphisms (AFLPs) and simple sequence repeats (SSRs or microsatellite) markers for fine mapping and association studies in several species. For SNP discovery from chromatogram data, several bioinformatics programs have to be combined to generate an analysis pipeline. Results have to be stored in a relational database to facilitate interrogation through queries or to generate data for further analyses such as determination of linkage disequilibrium and identification of common haplotypes. Although these tasks are routinely performed by several groups, an integrated open source SNP discovery pipeline that can be easily adapted by new groups interested in SNP marker development is currently unavailable.ResultsWe developed SNP-PHAGE (SNP discovery P ipeline with additional features for identification of common haplotypes within a sequence tagged site (H aplotype A nalysis) and Ge nBank (-dbSNP) submissions. This tool was applied for analyzing sequence traces from diverse soybean genotypes to discover over 10,000 SNPs. This package was developed on UNIX/Linux platform, written in Perl and uses a MySQL database. Scripts to generate a user-friendly web interface are also provided with common queries for preliminary data analysis. A machine learning tool developed by this group for increasing the efficiency of SNP discovery is integrated as a part of this package as an optional feature. The SNP-PHAGE package is being made available open source at http://bfgl.anri.barc.usda.gov/ML/snp-phage/.ConclusionSNP-PHAGE provides a bioinformatics solution for high throughput SNP discovery, identification of common haplotypes within an amplicon, and GenBank (dbSNP) submissions. SNP selection and visualization are aided through a user-friendly web interface. This tool is useful for analyzing sequence tagged sites (STSs) of genomic sequences, and this software can serve as a starting point for groups interested in developing SNP markers.

[1]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[2]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  X. Huang,et al.  CAP3: A DNA sequence assembly program. , 1999, Genome research.

[5]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[6]  P. Green,et al.  Consed: a graphical tool for sequence finishing. , 1998, Genome research.

[7]  John J. Grefenstette,et al.  Application of machine learning in SNP discovery , 2006, BMC Bioinformatics.

[8]  Mark D Shriver,et al.  The genomic distribution of population substructure in four populations using 8,525 autosomal SNPs , 2004, Human Genomics.

[9]  P. Cregan,et al.  Single-nucleotide polymorphisms in soybean. , 2003, Genetics.

[10]  Philippe Chaumeil,et al.  Automated SNP Detection in Expressed Sequence Tags: Statistical Considerations and Application to Maritime Pine Sequences , 2004, Plant Molecular Biology.

[11]  D. Nickerson,et al.  PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. , 1997, Nucleic acids research.

[12]  J. Aerts,et al.  POSA: Perl Objects for DNA Sequencing Data Analysis , 2004, BMC Genomics.

[13]  Eric S. Lander,et al.  An SNP map of the human genome generated by reduced representation shotgun sequencing , 2000, Nature.

[14]  Christopher J. Lee,et al.  Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences , 2000, Nature Genetics.

[15]  Paul Scheet,et al.  Automating sequence-based detection and genotyping of SNPs from diploid samples , 2006, Nature Genetics.

[16]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[17]  A. Rafalski Applications of single nucleotide polymorphisms in crop genetics. , 2002, Current opinion in plant biology.

[18]  J. Mullikin,et al.  SSAHA: a fast search method for large DNA databases. , 2001, Genome research.

[19]  David Edwards,et al.  Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP , 2003, Bioinform..

[20]  C. van Broeckhoven,et al.  novoSNP, a novel computational tool for sequence variation discovery. , 2005, Genome research.

[21]  Mark Daly,et al.  Haploview: analysis and visualization of LD and haplotype maps , 2005, Bioinform..

[22]  Gabor T. Marth,et al.  A general approach to single-nucleotide polymorphism discovery , 1999, Nature Genetics.

[23]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[24]  T. Ideker,et al.  Mining SNPs from EST databases. , 1999, Genome research.

[25]  Jochen Hampe,et al.  InSNP: A tool for automated detection and visualization of SNPs and InDels , 2005, Human mutation.