Improving Transcription Factor Binding Site Predictions by Using Randomised Negative Examples

It is known that much of the genetic change underlying morphological evolution takes place in cis-regulatory regions, rather than in the coding regions of genes. Identifying these sites in a genome is a non-trivial problem. Experimental methods for finding binding sites exist with some limitations regarding their applicability, accuracy, availability or cost. On the other hand predicting algorithms perform rather poorly. The aim of this research is to develop and improve computational approaches for the prediction of transcription factor binding sites (TFBSs) by integrating the results of computational algorithms and other sources of complementary biological evidence, with particular emphasis on the use of the Support Vector Machine (SVM). Data from two organisms, yeast and mouse, were used in this study. The initial results were not particularly encouraging, as still giving predictions of low quality. However, when the vectors labelled as non-binding sites in the training set were replaced by randomised training vectors, a significant improvement in performance was observed. This gave substantial improvement over the yeast genome and even greater improvement for the mouse data. In fact the resulting classifier was finding over 80% of the binding sites in the test set and moreover 80% of the predictions were correct.

[1]  Robert B. Fisher,et al.  Incremental One-Class Learning with Bounded Computational Complexity , 2007, ICANN.

[2]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[3]  Dongbo Liu,et al.  Clustering Categorical Data Based on Maximal Frequent Itemsets , 2007, ICMLA 2007.

[4]  Neil Davey,et al.  Prediction of Binding Sites in the Mouse Genome Using Support Vector Machines , 2008, ICANN.

[5]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[6]  Neil Davey,et al.  Predicting Binding Sites in the Mouse Genome , 2007, ICMLA 2007.

[7]  Enrique Blanco,et al.  ABS: a database of Annotated regulatory Binding Sites from orthologous promoters , 2005, Nucleic Acids Res..

[8]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[9]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[10]  E. Davidson Genomic Regulatory Systems: Development and Evolution , 2005 .

[11]  Neil Davey,et al.  Integrating genomic binding site predictions using real-valued meta classifiers , 2008, Neural Computing and Applications.

[12]  Neil Davey,et al.  Identifying Binding Sites in Sequential Genomic Data , 2007, ICANN.

[13]  Nitesh V. Chawla,et al.  Classification and knowledge discovery in protein databases , 2004, J. Biomed. Informatics.

[14]  C. T. Brown,et al.  Computational approaches to finding and analyzing cis-regulatory elements. , 2008, Methods in cell biology.

[15]  Neil Davey,et al.  Effect of Using Varying Negative Examples in Transcription Factor Binding Site Predictions , 2011, EvoBio.

[16]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[17]  Véra Kůrková,et al.  Artificial Neural Networks - ICANN 2008 , 18th International Conference, Prague, Czech Republic, September 3-6, 2008, Proceedings, Part I , 2008, ICANN.

[18]  Neil Davey,et al.  Combining experts in order to identify binding sites in yeast and mouse genomic data , 2008, Neural Networks.

[19]  Neil Davey,et al.  Using pre & post-processing methods to improve binding site predictions , 2009, Pattern Recognit..

[20]  E. Davidson,et al.  The hardwiring of development: organization and function of genomic regulatory systems. , 1997, Development.

[21]  Obi L. Griffith,et al.  ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation , 2006, Bioinform..