STREME: Accurate and versatile sequence motif discovery

Sequence motif discovery algorithms can identify novel sequence patterns that perform biological functions in DNA, RNA and protein sequences—for example, the binding site motifs of DNA- and RNA-binding proteins. The STREME algorithm presented here advances the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility. Using in vivo DNA (ChIP-seq) and RNA (CLIP-seq) data, and validating motifs with reference motifs derived from in vitro data, we show that STREME is more accurate, sensitive, thorough and rapid than several widely used algorithms (DREME, HOMER, MEME, Peak-motifs and Weeder). STREME’s capabilities include the ability to find motifs in datasets with hundreds of thousands of sequences, to find both short and long motifs (from 3 to 30 positions), to perform differential motif discovery in pairs of sequence datasets, and to find motifs in sequences over virtually any alphabet (DNA, RNA, protein and user-defined alphabets). Unlike most motif discovery algorithms, STREME accurately estimates and reports the statistical significance of each motif that it discovers. STREME is easy to use via its web server at http://meme-suite.org, and is fully integrated with the widely-used MEME Suite of sequence analysis tools, which can be freely downloaded at the same web site for non-commercial use.

[1]  R. Fisher On the Interpretation of χ 2 from Contingency Tables , and the Calculation of P Author , 2022 .

[2]  T. Bailey,et al.  Differential motif enrichment analysis of paired ChIP-seq experiments , 2014, BMC Genomics.

[3]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[4]  Denis Thieffry,et al.  RSAT 2011: regulatory sequence analysis tools , 2011, Nucleic Acids Res..

[5]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[6]  R. Simes,et al.  An improved Bonferroni procedure for multiple tests of significance , 1986 .

[7]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[8]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[9]  John E. Reid,et al.  STEME: efficient EM to find motifs in large data sets , 2011, Nucleic acids research.

[10]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2018, Journal of the Royal Statistical Society Series A (Statistics in Society).

[11]  Timothy L. Bailey,et al.  Gene expression Advance Access publication May 4, 2011 DREME: motif discovery in transcription factor ChIP-seq data , 2011 .

[12]  Jacques van Helden,et al.  Regulatory Sequence Analysis Tools , 2003, Nucleic Acids Res..

[13]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[14]  Uri Keich,et al.  Computing the P-value of the information content from an alignment of multiple sequences , 2005, ISMB.

[15]  Gene W. Yeo,et al.  Robust transcriptome-wide discovery of RNA binding protein binding sites with enhanced CLIP (eCLIP) , 2016, Nature Methods.

[16]  Juan M. Vaquerizas,et al.  DNA-Binding Specificities of Human Transcription Factors , 2013, Cell.

[17]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[18]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[19]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[20]  R. Gnanadesikan,et al.  Probability plotting methods for the analysis of data. , 1968, Biometrika.

[21]  R. Gnanadesikan,et al.  Probability plotting methods for the analysis for the analysis of data , 1968 .

[22]  Brendan J. Frey,et al.  A compendium of RNA-binding motifs for decoding gene regulation , 2013, Nature.