Application of machine learning in SNP discovery

BackgroundSingle nucleotide polymorphisms (SNP) constitute more than 90% of the genetic variation, and hence can account for most trait differences among individuals in a given species. Polymorphism detection software PolyBayes and PolyPhred give high false positive SNP predictions even with stringent parameter values. We developed a machine learning (ML) method to augment PolyBayes to improve its prediction accuracy. ML methods have also been successfully applied to other bioinformatics problems in predicting genes, promoters, transcription factor binding sites and protein structures.ResultsThe ML program C4.5 was applied to a set of features in order to build a SNP classifier from training data based on human expert decisions (True/False). The training data were 27,275 candidate SNP generated by sequencing 1973 STS (sequence tag sites) (12 Mb) in both directions from 6 diverse homozygous soybean cultivars and PolyBayes analysis. Test data of 18,390 candidate SNP were generated similarly from 1359 additional STS (8 Mb). SNP from both sets were classified by experts. After training the ML classifier, it agreed with the experts on 97.3% of test data compared with 7.8% agreement between PolyBayes and experts. The PolyBayes positive predictive values (PPV) (i.e., fraction of candidate SNP being real) were 7.8% for all predictions and 16.7% for those with 100% posterior probability of being real. Using ML improved the PPV to 84.8%, a 5- to 10-fold increase. While both ML and PolyBayes produced a similar number of true positives, the ML program generated only 249 false positives as compared to 16,955 for PolyBayes. The complexity of the soybean genome may have contributed to high false SNP predictions by PolyBayes and hence results may differ for other genomes.ConclusionA machine learning (ML) method was developed as a supplementary feature to the polymorphism detection software for improving prediction accuracies. The results from this study indicate that a trained ML classifier can significantly reduce human intervention and in this case achieved a 5–10 fold enhanced productivity. The optimized feature set and ML framework can also be applied to all polymorphism discovery software. ML support software is written in Perl and can be easily integrated into an existing SNP discovery pipeline.

[1]  Yu-Dong Cai,et al.  Prediction of Saccharomyces cerevisiae protein functional class from functional domain composition , 2004, Bioinform..

[2]  Gabor T. Marth,et al.  A general approach to single-nucleotide polymorphism discovery , 1999, Nature Genetics.

[3]  David Edwards,et al.  Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP , 2003, Bioinform..

[4]  Tao Li,et al.  A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression , 2004, Bioinform..

[5]  Kuo-Chen Chou,et al.  Prediction of β-turns with learning machines , 2003, Peptides.

[6]  Zhongming Zhao,et al.  Neighboring-nucleotide effects on single nucleotide polymorphisms: a study of 2.6 million polymorphisms across the human genome. , 2002, Genome research.

[7]  Anne-Lise Veuthey,et al.  A Probabilistic Information Retrieval Approach to Medical Annotation in SWISS-PROT , 2003, MIE.

[8]  Frederick P. Roth,et al.  Predicting co-complexed protein pairs using genomic and proteomic data integration , 2004, BMC Bioinformatics.

[9]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[10]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[11]  P. Cregan,et al.  Single-nucleotide polymorphisms in soybean. , 2003, Genetics.

[12]  Kuo-Chen Chou,et al.  Prediction of beta-turns with learning machines. , 2003, Peptides.

[13]  D. Nickerson,et al.  PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. , 1997, Nucleic acids research.

[14]  Yu Zong Chen,et al.  Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. , 2004, RNA.

[15]  J. Batley,et al.  Mining for Single Nucleotide Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag Data1 , 2003, Plant Physiology.

[16]  William Stafford Noble,et al.  Support vector machine classification on the web , 2004, Bioinform..

[17]  J. Lackey,et al.  CHROMOSOME NUMBERS IN THE PHASEOLEAE (FABACEAE:FABOIDEAE) AND THEIR RELATION TO TAXONOMY , 1980 .

[18]  Jessica A Schlueter,et al.  Mining EST databases to resolve evolutionary events in major crop species. , 2004, Genome.

[19]  Eric S. Lander,et al.  An SNP map of the human genome generated by reduced representation shotgun sequencing , 2000, Nature.