Maximal Motif Discovery in a Sliding Window

Motifs are relatively short sequences that are biologically significant, and their discovery in molecular sequences is a well-researched subject. A don’t care is a special letter that matches every letter in the alphabet. Formally, a motif is a sequence of letters of the alphabet and don’t care letters. A motif \(\tilde{m}_{d,k}\) that occurs at least k times in a sequence is maximal if it cannot be extended (to the left or right) nor can it be specialised (that is, its \(d' \le d\) don’t cares cannot be replaced with letters from the alphabet) without reducing its number of occurrences. Here we present a new dynamic data structure, and the first on-line algorithm, to discover all maximal motifs in a sliding window of length \(\ell \) on a sequence x of length n in \(\mathcal {O}(nd\ell + d\lceil \frac{\ell }{w}\rceil \cdot \sum _{i = \ell }^{n-1} |{\textsc {diff}}_{i-1}^{i}|)\) time, where w is the size of the machine word and \({\textsc {diff}}_{i-1}^{i}\) is the symmetric difference of the sets of occurrences of maximal motifs at \(x[i-\ell \mathinner {.\,.}i-1]\) and at \(x[i-\ell +1 \mathinner {.\,.}i]\).

[1]  Arthur Kornberg,et al.  The dnaA protein complex with the E. coli chromosomal replication origin (oriC) and other DNA sites , 1984, Cell.

[2]  M. Méchali,et al.  DNA replication origins. , 2013, Cold Spring Harbor perspectives in biology.

[3]  Marie-France Sagot,et al.  An efficient algorithm for the identification of structured motifs in DNA promoter sequences , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  M. Senft Suffix Tree for a Sliding Window: An Overview , 2005 .

[5]  K. von Meyenburg,et al.  Nucleotide sequence of the origin of replication of the Escherichia coli K-12 chromosome. , 1979, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[7]  Solon P. Pissis,et al.  MoTeX-II: structured MoTif eXtraction from large-scale datasets , 2014, BMC Bioinformatics.

[8]  Søren Vind,et al.  Motif trie: An efficient text index for pattern discovery with don't cares , 2018, Theor. Comput. Sci..

[9]  Solon P. Pissis,et al.  MoTeX: A word-based HPC tool for MoTif eXtraction , 2013, BCB.

[10]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[11]  Marie-France Sagot,et al.  RISOTTO: Fast Extraction of Motifs with Mismatches , 2006, LATIN.

[12]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[13]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[14]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[15]  Saurabh Sinha,et al.  YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation , 2003, Nucleic Acids Res..

[16]  Michael S. Waterman,et al.  General methods of sequence comparison , 1984 .