MotifVoter: a novel ensemble method for fine-grained integration of generic motif finders

MOTIVATION Locating transcription factor binding sites (motifs) is a key step in understanding gene regulation. Based on Tompa's benchmark study, the performance of current de novo motif finders is far from satisfactory (with sensitivity <or=0.222 and precision <or=0.307). The same study also shows that no motif finder performs consistently well over all datasets. Hence, it is not clear which finder one should use for a given dataset. To address this issue, a class of algorithms called ensemble methods have been proposed. Though the existing ensemble methods overall perform better than stand-alone motif finders, the improvement gained is not substantial. Our study reveals that these methods do not fully exploit the information obtained from the results of individual finders, resulting in minor improvement in sensitivity and poor precision. RESULTS In this article, we identify several key observations on how to utilize the results from individual finders and design a novel ensemble method, MotifVoter, to predict the motifs and binding sites. Evaluations on 186 datasets show that MotifVoter can locate more than 95% of the binding sites found by its component motif finders. In terms of sensitivity and precision, MotifVoter outperforms stand-alone motif finders and ensemble methods significantly on Tompa's benchmark, Escherichia coli, and ChIP-Chip datasets. MotifVoter is available online via a web server with several biologist-friendly features.

[1]  E. van Nimwegen Finding regulatory elements and regulatory motifs: a general probabilistic framework , 2007, BMC bioinformatics.

[2]  Ernest Fraenkel,et al.  WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches , 2007, Nucleic Acids Res..

[3]  Ernest Fraenkel,et al.  TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs , 2005 .

[4]  Daisuke Kihara,et al.  EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences , 2006, BMC Bioinformatics.

[5]  Richard G. Jenner,et al.  Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene activation in human tissues. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Panayiotis V. Benos,et al.  DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies , 2007, PLoS Comput. Biol..

[7]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[8]  E. Fraenkel,et al.  WebMOTIFS: automated discovery, filtering and scoring of DNA sequence motifs using multiple programs and Bayesian approaches , 2007, Environmental health perspectives.

[9]  G. Stormo,et al.  ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[10]  W. J. Kent,et al.  Environmentally Induced Foregut Remodeling by PHA-4/FoxA and DAF-12/NHR , 2004, Science.

[11]  Siu-Ming Yiu,et al.  Detection of generic spaced motifs using submotif pattern mining , 2007, Bioinform..

[12]  T. Volkert,et al.  E2F integrates cell cycle progression with DNA repair, replication, and G(2)/M checkpoints. , 2002, Genes & development.

[13]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[14]  Hsien-Da Huang,et al.  Identifying transcriptional regulatory sites in the human genome using an integrated system. , 2004, Nucleic acids research.

[15]  Adam A. Margolin,et al.  NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic cell growth , 2006, Proceedings of the National Academy of Sciences.

[16]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[17]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[18]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[19]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[20]  Charles Kooperberg,et al.  Global and gene‐specific analyses show distinct roles for Myod and Myog at a common set of promoters , 2006, The EMBO journal.

[21]  Ernest Fraenkel,et al.  Practical Strategies for Discovering Regulatory DNA Sequence Motifs , 2006, PLoS Comput. Biol..

[22]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[23]  Martha L. Bulyk,et al.  Meta-analysis discovery of tissue-specific DNA sequence motifs from mammalian gene expression data , 2006, BMC Bioinformatics.

[24]  Li Cai,et al.  Measuring similarities between gene expression profiles through new data transformations , 2007, BMC Bioinformatics.

[25]  Megan F. Cole,et al.  Core Transcriptional Regulatory Circuitry in Human Embryonic Stem Cells , 2005, Cell.

[26]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[27]  Nicola J. Rinaldi,et al.  Control of Pancreas and Liver Gene Expression by HNF Transcription Factors , 2004, Science.

[28]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[29]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[30]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[31]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[32]  Shane T. Jensen,et al.  BioOptimizer: a Bayesian scoring function approach to motif discovery , 2004, Bioinform..

[33]  Liming Cai,et al.  BEST: Binding-site Estimation Suite of Tools , 2005, Bioinform..

[34]  Richard G. Jenner,et al.  Coordinated binding of NF-kappaB family members in the response of human cells to lipopolysaccharide. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[35]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[36]  Ernest Fraenkel,et al.  TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs , 2005, Bioinform..