BioOptimizer: a Bayesian scoring function approach to motif discovery

MOTIVATION Transcription factors (TFs) bind directly to short segments on the genome, often within hundreds to thousands of base pairs upstream of gene transcription start sites, to regulate gene expression. The experimental determination of TFs binding sites is expensive and time-consuming. Many motif-finding programs have been developed, but no program is clearly superior in all situations. Practitioners often find it difficult to judge which of the motifs predicted by these algorithms are more likely to be biologically relevant. RESULTS We derive a comprehensive scoring function based on a full Bayesian model that can handle unknown site abundance, unknown motif width and two-block motifs with variable-length gaps. An algorithm called BioOptimizer is proposed to optimize this scoring function so as to reduce noise in the motif signal found by any motif-finding program. The accuracy of BioOptimizer, which can be used in conjunction with several existing programs, is shown to be superior to using any of these motif-finding programs alone when evaluated by both simulation studies and application to sets of co-regulated genes in bacteria. In addition, this scoring function formulation enables us to compare objectively different predicted motifs and select the optimal ones, effectively combining the strengths of existing programs. AVAILABILITY BioOptimizer is available for download at www.fas.harvard.edu/~junliu/BioOptimizer/

[1]  Patrick Eichenberger,et al.  Genome-Wide Analysis of the Stationary-Phase Sigma Factor (Sigma-H) Regulon of Bacillus subtilis , 2002, Journal of bacteriology.

[2]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[3]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[4]  Shane T. Jensen,et al.  Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective , 2004 .

[5]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[6]  Shane T. Jensen,et al.  The sigmaE regulon and the identification of additional sporulation genes in Bacillus subtilis. , 2003, Journal of molecular biology.

[7]  Michael B. Eisen,et al.  Identification of regulatory elements using a feature selection method , 2002, Bioinform..

[8]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[9]  J. Hoch,et al.  The SpoOA protein of Bacillus subtilis is a repressor of the abrB gene. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[10]  J. Helmann,et al.  RNA Polymerase and Sigma Factors , 2002 .

[11]  G. Church,et al.  Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation , 1998, Nature Biotechnology.

[12]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[13]  M. Eisen,et al.  Supervised Detection of Regulatory Motifs in DNA Sequences , 2003, Statistical applications in genetics and molecular biology.

[14]  Shane T. Jensen,et al.  The Spo0A regulon of Bacillus subtilis , 2003, Molecular microbiology.

[15]  References , 1971 .

[16]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[17]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[18]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[19]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[21]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[22]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.