Ensemble Machine Methods for DNA Binding

We introduce three ensemble machine learning methods for analysis of biological DNA binding by transcription factors (TFs). The goal is to identify both TF target genes and their binding motifs. Subspace-valued weak learners (formed from an ensemble of different motif finding algorithms) combine candidate motifs as probability weight matrices (PWM), which are then translated into subspaces of a DNA k-mer (string) feature space. Assessing and then integrating highly informative subspaces by machine methods gives more reliable target classification and motif prediction. We compare these target identification methods with probability weight matrix (PWM) rescanning and use of support vector machines on the full k-mer space of the yeast S. cerevisiae. This method, SVMotif-PWM, can significantly improve accuracy in computational identification of TF targets. The software is publicly available at http://cagt10.bu.edu/SVMotif .

[1]  E. Fraenkel,et al.  WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches , 2007, Environmental health perspectives.

[2]  Ernest Fraenkel,et al.  WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches , 2007, Nucleic Acids Res..

[3]  Y. Freund,et al.  Profile-based string kernels for remote homology detection and motif extraction. , 2005, Journal of bioinformatics and computational biology.

[4]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[5]  Gary D. Stormo,et al.  Identification of consensus patterns in unaligned DNA sequences known to be functionally related , 1990, Comput. Appl. Biosci..

[6]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[7]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[8]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[9]  Ting Wang,et al.  An improved map of conserved regulatory sites for Saccharomyces cerevisiae , 2006, BMC Bioinformatics.

[10]  Charles DeLisi,et al.  Machine learning methods for transcription data integration , 2006, IBM J. Res. Dev..

[11]  Shane T. Jensen,et al.  BioOptimizer: a Bayesian scoring function approach to motif discovery , 2004, Bioinform..

[12]  Liming Cai,et al.  BEST: Binding-site Estimation Suite of Tools , 2005, Bioinform..

[13]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[14]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[15]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[16]  Ernest Fraenkel,et al.  TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs , 2005, Bioinform..

[17]  Charles DeLisi,et al.  SVMotif: A Machine Learning Motif Algorithm , 2007, ICMLA 2007.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  Z. Weng,et al.  Detection of functional DNA motifs via statistical over-representation. , 2004, Nucleic acids research.

[20]  Bernd Wachmann Technologies and Solutions for Trend Detection in Public Literature for Biomarker Discovery , 2007, ICMLA 2007.

[21]  Holger Karas,et al.  TRANSFAC: a database on transcription factors and their DNA binding sites , 1996, Nucleic Acids Res..

[22]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[23]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[24]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[25]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[26]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[27]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[28]  Nak-Kyeong Kim,et al.  Adding sequence context to a Markov background model improves the identification of regulatory elements , 2006, Bioinform..

[29]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.