A novel ensemble learning method for de novo computational identification of DNA binding sites

BackgroundDespite the diversity of motif representations and search algorithms, the de novo computational identification of transcription factor binding sites remains constrained by the limited accuracy of existing algorithms and the need for user-specified input parameters that describe the motif being sought.ResultsWe present a novel ensemble learning method, SCOPE, that is based on the assumption that transcription factor binding sites belong to one of three broad classes of motifs: non-degenerate, degenerate and gapped motifs. SCOPE employs a unified scoring metric to combine the results from three motif finding algorithms each aimed at the discovery of one of these classes of motifs. We found that SCOPE's performance on 78 experimentally characterized regulons from four species was a substantial and statistically significant improvement over that of its component algorithms. SCOPE outperformed a broad range of existing motif discovery algorithms on the same dataset by a statistically significant margin.ConclusionSCOPE demonstrates that combining multiple, focused motif discovery algorithms can provide a significant gain in performance. By building on components that efficiently search for motifs without user-defined parameters, SCOPE requires as input only a set of upstream sequences and a species designation, making it a practical choice for non-expert users. A user-friendly web interface, Java source code and executables are available at http://genie.dartmouth.edu/scope.

[1]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[2]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  Saurabh Sinha,et al.  Performance comparison of algorithms for finding transcription factor binding sites , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[5]  Massimo Vergassola,et al.  Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo , 2002, BMC Bioinformatics.

[6]  D. Guhathakurta,et al.  Computational identification of transcriptional regulatory elements in DNA sequence , 2006, Nucleic acids research.

[7]  W. J. Kent,et al.  Environmentally Induced Foregut Remodeling by PHA-4/FoxA and DAF-12/NHR , 2004, Science.

[8]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[9]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[10]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[11]  G. Stormo,et al.  Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods. , 2002, Genome research.

[12]  C. Vinson,et al.  Clustering of DNA sequences in human promoters. , 2004, Genome research.

[13]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[14]  Robert H. Gross,et al.  Bounded search for de novo identification of degenerate cis-regulatory elements , 2006, BMC Bioinformatics.

[15]  J. T. Kadonaga,et al.  The RNA polymerase II core promoter. , 2003, Annual review of biochemistry.

[16]  Eleazar Eskin,et al.  Finding composite regulatory patterns in DNA sequences , 2002, ISMB.

[17]  Robert H. Gross,et al.  BEAM: A Beam Search Algorithm for the Identification of Cis-Regulatory Elements in Groups of Genes , 2006, J. Comput. Biol..

[18]  Jacques van Helden,et al.  Regulatory Sequence Analysis Tools , 2003, Nucleic Acids Res..

[19]  Tatsuya Akutsu,et al.  Finding optimal degenerate patterns in DNA sequences , 2003, ECCB.

[20]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[21]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[22]  Marc S Halfon,et al.  Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model. , 2002, Genome research.

[23]  S. Salzberg,et al.  Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura , 2004, Genome Biology.

[24]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[25]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[26]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[27]  Robert H. Gross,et al.  SPACER: identification of cis-regulatory elements with non-contiguous critical residues , 2007, Bioinform..

[28]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[29]  William H. Press,et al.  Numerical recipes in C , 2002 .

[30]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[31]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[32]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[33]  Y. Ho,et al.  Simple Explanation of the No-Free-Lunch Theorem and Its Implications , 2002 .

[34]  D. Wolpert,et al.  No Free Lunch Theorems for Search , 1995 .

[35]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[36]  Dmitri A. Papatsenko,et al.  Statistical extraction of Drosophila cis-regulatory modules using exhaustive assessment of local word frequency , 2003, BMC Bioinformatics.

[37]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[38]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[39]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[40]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[41]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[42]  A Klug,et al.  Selection of DNA binding sites for zinc fingers using rationally randomized DNA reveals coded interactions. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[43]  G. Rubin,et al.  Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Ernest Fraenkel,et al.  Practical Strategies for Discovering Regulatory DNA Sequence Motifs , 2006, PLoS Comput. Biol..