Prediction of cis-Regulatory Elements: from High-Information Content Analysis to Motif Identification

One popular approach to prediction of binding motifs of transcription factors is to model the problem as to search for a group of l-mers (motifs), for some l > 0, one from each of the provided promoter regions of a group of co-expressed genes, that exhibit high information content when aligned without gaps. In our current work, we assume that these desired l-mers have evolved from a common ancestor, each of which has mutations in at most k-positions from the common ancestor, where k is substantially smaller than l. This implies that these l-mers should belong to the k-neighborhood of their common ancestor, measured in terms of Hamming distance. If the ancestor is given, then the problem for finding these l-mers becomes trivial. Unfortunately, the problem of identifying the unknown ancestor is probably as hard as the problem of predicting the motifs themselves. Our goal is to identify a set of l-mers that slightly violate the k-neighborhood of a putative ancestor, but capture all the desired motifs, which will lead to an efficient way for identification of the desired motifs. The main contributions of this paper are in four aspects: (a) we have derived nontrivial lower and upper bounds of information content for a set of l-mers that differ from an unknown ancestor in no more than k positions; (b) we have defined a new distance between two sequences and a k-pseudo-neighborhood, based on the new distance, that contains the k-neighborhood, defined by Hamming distance, of the to-be-defined ancestor; (c) we have developed an algorithm to minimize the sum of all the distances between a predicted ancestor motif and a group of l-mers from the provided promoter regions, using the new distance; and (d) we have tested PROMOCO and compared its prediction results performance with two other prediction programs. The algorithm, implemented as a computer software program PROMOCO, has been used to find all conserved motifs in a set of provided promoter sequences. Our preliminary application of PROMOCO shows that it achieves better or comparable prediction results, when compared to popular programs for identification of cis regulatory binding motifs. A limitation of the algorithm is that it does not work well when the size of the set of provided promoter sequences is too small or when desired motifs appear in only small portion of the given sequences.

[1]  T. Werner,et al.  MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. , 1995, Nucleic acids research.

[2]  Francis Y. L. Chin,et al.  Voting algorithms for discovering long motifs , 2005, APBC.

[3]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[4]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[5]  I. Jonassen,et al.  Predicting gene regulatory elements in silico on a genomic scale. , 1998, Genome research.

[6]  M. Tompa,et al.  Discovery of novel transcription factor binding sites by statistical overrepresentation. , 2002, Nucleic acids research.

[7]  Douglas L. Brutlag,et al.  BioProspector: Discovering Conserved DNA Motifs in Upstream Regulatory Regions of Co-Expressed Genes , 2000, Pacific Symposium on Biocomputing.

[8]  Bin Ma,et al.  On the closest string and substring problems , 2002, JACM.

[9]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[10]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[11]  A. Litman,et al.  On covering problems of codes , 1997, Theory of Computing Systems.

[12]  Ying Xu,et al.  Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees , 2002, Bioinform..

[13]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[14]  Aaron Golden,et al.  Improved detection of DNA motifs using a self-organized clustering of familial binding profiles , 2005, ISMB.

[15]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[16]  Uri Keich,et al.  Finding motifs in the twilight zone , 2002, Bioinform..

[17]  Ewan Birney,et al.  Discovering novel cis-regulatory motifs using functional networks. , 2003, Genome research.

[18]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[19]  Lusheng Wang,et al.  Randomized Algorithms for Motif Detection , 2004, ISAAC.