EMD: an ensemble algorithm for discovering regulatory motifs in DNA sequences

BackgroundUnderstanding gene regulatory networks has become one of the central research problems in bioinformatics. More than thirty algorithms have been proposed to identify DNA regulatory sites during the past thirty years. However, the prediction accuracy of these algorithms is still quite low. Ensemble algorithms have emerged as an effective strategy in bioinformatics for improving the prediction accuracy by exploiting the synergetic prediction capability of multiple algorithms.ResultsWe proposed a novel clustering-based ensemble algorithm named EMD for de novo motif discovery by combining multiple predictions from multiple runs of one or more base component algorithms. The ensemble approach is applied to the motif discovery problem for the first time. The algorithm is tested on a benchmark dataset generated from E. coli RegulonDB. The EMD algorithm has achieved 22.4% improvement in terms of the nucleotide level prediction accuracy over the best stand-alone component algorithm. The advantage of the EMD algorithm is more significant for shorter input sequences, but most importantly, it always outperforms or at least stays at the same performance level of the stand-alone component algorithms even for longer sequences.ConclusionWe proposed an ensemble approach for the motif discovery problem by taking advantage of the availability of a large number of motif discovery programs. We have shown that the ensemble approach is an effective strategy for improving both sensitivity and specificity, thus the accuracy of the prediction. The advantage of the EMD algorithm is its flexibility in the sense that a new powerful algorithm can be easily added to the system.

[1]  Arne Elofsson,et al.  3D-Jury: A Simple Approach to Improve Protein Structure Predictions , 2003, Bioinform..

[2]  Mathieu Blanchette,et al.  Algorithms for phylogenetic footprinting , 2001, RECOMB.

[3]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[4]  Daniel Fischer,et al.  3D‐SHOTGUN: A novel, cooperative, fold‐recognition meta‐predictor , 2003, Proteins.

[5]  Leszek Rychlewski,et al.  Detection of reliable and unexpected protein fold predictions using 3D-Jury , 2003, Nucleic Acids Res..

[6]  Bonnie Berger,et al.  Methods in Comparative Genomics: Genome Correspondence, Gene Identification and Regulatory Motif Discovery , 2004, J. Comput. Biol..

[7]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[8]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[9]  Julio Collado-Vides,et al.  RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12 , 2004, Nucleic Acids Res..

[10]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Tao Jiang,et al.  Identifying transcription factor binding sites through Markov chain optimization , 2002, ECCB.

[12]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[13]  Ting Wang,et al.  Combining phylogenetic data with co-regulated genes to identify regulatory motifs , 2003, Bioinform..

[14]  Mona Singh,et al.  Comparative analysis of methods for representing and searching for transcription factor binding sites , 2004, Bioinform..

[15]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[16]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[17]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[18]  Harpreet Kaur Saini,et al.  BIOINFORMATICS APPLICATIONS NOTE Structural bioinformatics Meta-DP: domain prediction meta-server , 2022 .

[19]  Kenta Nakai,et al.  MELINA: motif extraction from promoter regions of potentially co-regulated genes , 2003, Bioinform..

[20]  Michael R. Green,et al.  Dissecting the Regulatory Circuitry of a Eukaryotic Genome , 1998, Cell.

[21]  Lisa N Kinch,et al.  CASP5 assessment of fold recognition target predictions , 2003, Proteins.

[22]  Ceslovas Venclovas,et al.  Assessment of progress over the CASP experiments , 2003, Proteins.

[23]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[24]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[25]  Anna Tramontano,et al.  Assessment of homology‐based predictions in CASP5 , 2003, Proteins.

[26]  Vasant Honavar,et al.  Predicting binding sites of hydrolase-inhibitor complexes by combining several methods , 2004, BMC Bioinformatics.

[27]  Peter J. Myler,et al.  Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project , 2003, BMC Bioinformatics.

[28]  Kathleen Marchal,et al.  A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes , 2001, RECOMB.

[29]  G. von Heijne,et al.  Prediction of partial membrane protein topologies using a consensus approach , 2002, Protein science : a publication of the Protein Society.

[30]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[31]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[32]  K Nishikawa [Prediction of protein secondary structure by a new joint method]. , 1990, Seikagaku. The Journal of Japanese Biochemical Society.

[33]  Mathieu Blanchette,et al.  PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences , 2004, BMC Bioinformatics.

[34]  J Lundström,et al.  Pcons: A neural‐network–based consensus predictor that improves fold recognition , 2001, Protein science : a publication of the Protein Society.

[35]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[36]  M. Blanchette,et al.  Discovery of regulatory elements by a computational method for phylogenetic footprinting. , 2002, Genome research.

[37]  Giorgio Valle,et al.  Simple consensus procedures are effective and sufficient in secondary structure prediction. , 2003, Protein engineering.

[38]  Richard A Young,et al.  Deciphering gene expression regulatory networks. , 2002, Current opinion in genetics & development.

[39]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.