On counting position weight matrix matches in a sequence, with application to discriminative motif finding

MOTIVATION AND RESULTS The position weight matrix (PWM) is a popular method to model transcription factor binding sites. A fundamental problem in cis-regulatory analysis is to "count" the occurrences of a PWM in a DNA sequence. We propose a novel probabilistic score to solve this problem of counting PWM occurrences. The proposed score has two important properties: (1) It gives appropriate weights to both strong and weak occurrences of the PWM, without using thresholds. (2) For any given PWM, this score can be computed while allowing for occurrences of other, a priori known PWMs, in a statistically sound framework. Additionally, the score is efficiently differentiable with respect to the PWM parameters, which has important consequences for designing search algorithms. The second problem we address is to find, ab initio, PWMs that have high counts in one set of sequences, and low counts in another. We develop a novel algorithm to solve this "discriminative motif-finding problem", using the proposed score for counting a PWM in the sequences. The algorithm is a local search technique that exploits derivative information on an objective function to enhance speed and performance. It is extensively tested on synthetic data, and shown to perform better than other discriminative as well as non-discriminative PWM finding algorithms. It is then applied to cis-regulatory modules involved in development of the fruitfly embryo, to elicit known and novel motifs. We finally use the algorithm on genes predictive of social behavior in the honey bee, and find interesting motifs. AVAILABILITY The program is available upon request from the author.

[1]  Qing Zhou,et al.  A boosting approach for motif modeling using ChIP-chip data , 2005, Bioinform..

[2]  Shane T. Jensen,et al.  BioOptimizer: a Bayesian scoring function approach to motif discovery , 2004, Bioinform..

[3]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[4]  Michael Q. Zhang,et al.  DWE: Discriminating Word Enumerator , 2005, Bioinform..

[5]  M. Ashburner,et al.  Systematic determination of patterns of gene expression during Drosophila embryogenesis , 2002, Genome Biology.

[6]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[7]  Wei Wu,et al.  Logos: a Modular Bayesian Model for de Novo Motif Detection , 2004, J. Bioinform. Comput. Biol..

[8]  Saurabh Sinha,et al.  A probabilistic method to detect regulatory modules , 2003, ISMB.

[9]  Mathieu Blanchette,et al.  PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences , 2004, BMC Bioinformatics.

[10]  Khalid Raza,et al.  GENE EXPRESSION PROFILES , 2007 .

[11]  Michael Q. Zhang,et al.  Mining ChIP-chip data for transcription factor and cofactor binding sites , 2005, ISMB.

[12]  J. Fak,et al.  Transcriptional Control in the Segmentation Gene Network of Drosophila , 2004, PLoS biology.

[13]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[14]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[15]  Wei Wu,et al.  LOGOS: a modular Bayesian model for de novo motif detection , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[16]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[17]  David K. Gifford,et al.  Negative Information for Motif Discovery , 2004, Pacific Symposium on Biocomputing.

[18]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[19]  G. Robinson,et al.  Gene Expression Profiles in the Brain Predict Behavior in Individual Honey Bees , 2003, Science.

[20]  Saurabh Sinha,et al.  Discriminative motifs , 2002, RECOMB '02.