Prediction of Binding Sites in the Mouse Genome Using Support Vector Machines

Computational prediction of cis-regulatory binding sites is widely acknowledged as a difficult task. There are many different algorithms for searching for binding sites in current use. However, most of them produce a high rate of false positive predictions. Moreover, many algorithmic approaches are inherently constrained with respect to the range of binding sites that they can be expected to reliably predict. We propose to use SVMs to predict binding sites from multiple sources of evidence. We combine random selection under-sampling and the synthetic minority over-sampling technique to deal with the imbalanced nature of the data. In addition, we remove some of the final predicted binding sites on the basis of their biological plausibility. The results show that we can generate a new prediction that significantly improves on the performance of any one of the individual prediction algorithms.

[1]  Dongbo Liu,et al.  Clustering Categorical Data Based on Maximal Frequent Itemsets , 2007, ICMLA 2007.

[2]  Daisuke Kihara,et al.  EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences , 2006, BMC Bioinformatics.

[3]  Mathieu Blanchette,et al.  FootPrinter: a program designed for phylogenetic footprinting , 2003, Nucleic Acids Res..

[4]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[7]  Benedict Paten,et al.  The discovery, positioning and verification of a set of transcription-associated motifs in vertebrates , 2005, Genome Biology.

[8]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[9]  Yi Sun,et al.  Using real-valued meta classifiers to integrate binding site predictions , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[10]  Martha L. Bulyk,et al.  Meta-analysis discovery of tissue-specific DNA sequence motifs from mammalian gene expression data , 2006, BMC Bioinformatics.

[11]  Neil Davey,et al.  Predicting Binding Sites in the Mouse Genome , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[12]  Edward Y. Chang,et al.  Class-Boundary Alignment for Imbalanced Dataset Learning , 2003 .