Apples to apples: improving the performance of motif finders and their significance analysis in the Twilight Zone

MOTIVATION Effective algorithms for finding relatively weak motifs are an important practical necessity while scanning long DNA sequences for regulatory elements. The success of such an algorithm hinges on the ability of its scoring function combined with a significance analysis test to discern real motifs from random noise. RESULTS In the first half of the paper we show that the paradigm of relying on entropy scores and their E-values can lead to undesirable results when searching for weak motifs and we offer alternate approaches to analyzing the significance of motifs. In the second half of the paper we reintroduce a scoring function and present a motif-finder that optimizes it that are more effective in finding relatively weak motifs than other tools. AVAILABILITY The GibbsILR motif finder is available at http://www.cs.cornell.edu/~keich.

[1]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[2]  Eleazar Eskin,et al.  From profiles to patterns and back again: a branch and bound algorithm for finding near optimal motif profiles , 2004, RECOMB.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[5]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[6]  Uri Keich,et al.  U Subtle motifs: defining the limits of motif finding algorithms , 2002, Bioinform..

[7]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[8]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[9]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[10]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[11]  Gary D. Stormo,et al.  Identifying DNA and protein patterns with statistically significant alignments of multiple sequences , 1999, Bioinform..

[12]  Sriram Ramabhadran,et al.  Finding subtle motifs by branching from sample strings , 2003, ECCB.

[13]  Uri Keich,et al.  Computing the P-value of the information content from an alignment of multiple sequences , 2005, ISMB.

[14]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[15]  Pavel A. Pevzner,et al.  Combinatorial Approaches to Finding Subtle Signals in DNA Sequences , 2000, ISMB.

[16]  J A Swets,et al.  Measuring the accuracy of diagnostic systems. , 1988, Science.

[17]  Martin Tompa,et al.  Statistics of local multiple alignments , 2005, ISMB.

[18]  Bin Li,et al.  Limitations and potentials of current motif discovery algorithms , 2005, Nucleic acids research.

[19]  Z. Weng,et al.  Finding functional sequence elements by multiple local alignment. , 2004, Nucleic acids research.