Irredundant tandem motifs

Eliminating the possible redundancy from a set of candidate motifs occurring in an input string is fundamental in many applications. The existing techniques proposed to extract irredundant motifs are not suitable when the motifs to search for are structured, i.e., they are made of two (or several) subwords that co-occur in a text string s of length n. The main effort of this work is studying and characterizing a compact class of tandem motifs, that is, pairs of substrings occurring in tandem within a maximum distance of d symbols in s, where d is an integer constant given in input. To this aim, we first introduce the concept of maximality, related to four specific conditions that hold only for this class of motifs. Then, we eliminate the remaining redundancy by defining the notion of irredundancy for tandem motifs. We prove that the number of non-overlapping irredundant tandem motifs is O(d^2n) which, considering d as a constant, leads to a linear number of tandems in the length of the input string. This is an order of magnitude less than previously developed compact indexes for tandem extraction. The notions and bounds provided for tandem motifs are generalized for the case r>=2, if r is the number of subwords composing the motifs. Finally, we also provide an algorithm to extract irredundant tandem motifs.

[1]  Alberto Apostolico,et al.  Motif patterns in 2D , 2008, Theor. Comput. Sci..

[2]  Giorgio Satta,et al.  Discovering subword associations in strings in time linear in the output size , 2009, J. Discrete Algorithms.

[3]  Matteo Comin,et al.  VARUN: Discovering Extensible Motifs under Saturation Constraints , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[4]  Maxime Crochemore,et al.  Bases of motifs for generating repeated patterns with wild cards , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Eli Upfal,et al.  MADMX: A Novel Strategy for Maximal Dense Motif Extraction , 2009, WABI.

[6]  Luigi Palopoli,et al.  IP6K gene identification in plant genomes by tag searching , 2011, BMC proceedings.

[7]  Laxmi Parida,et al.  An inexact-suffix-tree-based algorithm for detecting extensible patterns , 2005, Theor. Comput. Sci..

[8]  Laxmi Parida Pattern Discovery in Bioinformatics: Theory & Algorithms , 2007 .

[9]  Simona E. Rombo Optimal extraction of motif patterns in 2D , 2009, Inf. Process. Lett..

[10]  Joong Chae Na,et al.  Truncated suffix trees and their application to data compression , 2003, Theor. Comput. Sci..

[11]  Laxmi Parida,et al.  Characterization and Extraction of Irredundant Tandem Motifs , 2012, SPIRE.

[12]  Angelo Furfaro,et al.  Image Classification Based on 2D Feature Motifs , 2013, FQAS.

[13]  Graziano Pesole,et al.  Motif discovery and transcription factor binding sites before and after the next-generation sequencing era , 2012, Briefings Bioinform..

[14]  Alberto Apostolico,et al.  Incremental Paradigms of Motif Discovery , 2004, J. Comput. Biol..

[15]  Luigi Palopoli,et al.  IP6K Gene Discovery in Plant mtDNA , 2010, CIBB.

[16]  Roderic Guigó,et al.  Mutation patterns of amino acid tandem repeats in the human proteome , 2006, Genome Biology.

[17]  Nadia Pisanti,et al.  Suffix Tree Characterization of Maximal Motifs in Biological Sequences , 2008, BIRD.

[18]  Giri Narasimhan,et al.  Pattern discovery in bioinformatics , 2007 .

[19]  Laxmi Parida,et al.  Discovering Topological Motifs Using a Compact Notation , 2007, J. Comput. Biol..

[20]  Julien Allali,et al.  The at most k-deep factor tree , 2003 .

[21]  Simona E. Rombo Extracting string motif bases for quorum higher than two , 2012, Theor. Comput. Sci..

[22]  Marie-France Sagot,et al.  An efficient algorithm for the identification of structured motifs in DNA promoter sequences , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Eli Upfal,et al.  MADMX: A Strategy for Maximal Dense Motif Extraction , 2011, J. Comput. Biol..

[24]  Robert Giegerich,et al.  From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction , 1997, Algorithmica.

[25]  Alberto Apostolico,et al.  Speedup for a periodic subgraph miner , 2011, Inf. Process. Lett..

[26]  Alberto Apostolico,et al.  Efficient algorithms for the discovery of gapped factors , 2011, Algorithms for Molecular Biology.

[27]  Luigi Palopoli,et al.  Flexible Pattern Discovery with (Extended) Disjunctive Logic Programming , 2005, ISMIS.

[28]  Giorgio Terracina,et al.  Mining Loosely Structured Motifs from Biological Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[29]  Ayumi Shinohara,et al.  Finding Optimal Pairs of Cooperative and Competing Patterns with Bounded Distance , 2004, Discovery Science.

[30]  Giorgio Terracina,et al.  Discovering Representative Models in Large Time Series Databases , 2004, FQAS.

[31]  Alessia Amelio,et al.  Image Compression by 2D Motif Basis , 2011, 2011 Data Compression Conference.

[32]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[33]  Giorgio Satta,et al.  Optimal Discovery of Subword Associations in Strings , 2004, Discovery Science.

[34]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[35]  Alberto Apostolico,et al.  Efficient algorithms for the periodic subgraphs mining problem , 2012, J. Discrete Algorithms.

[36]  Marie-France Sagot,et al.  Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification , 2000, RECOMB '00.

[37]  Maxime Crochemore,et al.  A Comparative Study of Bases for Motif Inference in String Algorithmics , 2004 .

[38]  Alexandre P. Francisco,et al.  YEASTRACT-DISCOVERER: new tools to improve the analysis of transcriptional regulatory associations in Saccharomyces cerevisiae , 2007, Nucleic Acids Res..

[39]  Yuan Gao,et al.  Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm , 2000, SODA '00.

[40]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[41]  Eamonn J. Keogh,et al.  Probabilistic discovery of time series motifs , 2003, KDD '03.

[42]  Dan Gusfield,et al.  Algorithms on strings , 1997 .