Characterization and Extraction of Irredundant Tandem Motifs

We address the problem of extracting pairs of subwords (m1,m2) from a text string s of length n, such that, given also an integer constant d in input, m1 and m2 occur in tandem within a maximum distance of d symbols in s. The main effort of this work is to eliminate the possible redundancy from the candidate set of the so found tandem motifs. To this aim, we first introduce the concept of maximality, characterized by four specific conditions, that we show to be not deducible by the corresponding notion of maximality already defined for "simple" (i.e., non tandem) motifs. Then, we further eliminate the remaining redundancy by defining the concept of irredundancy for tandem motifs. We prove that the number of non-overlapping irredundant tandems is O(d2n) which, considering d as a constant, leads to a linear number of tandems in the length of the input string. This is an order of magnitude less than previously developed compact indexes for tandem extraction. As a further contribution we show an algorithm to extract this compact irredundant index.

[1]  Alberto Apostolico,et al.  Optimal Extraction of Irredundant Motif Bases , 2010, Int. J. Found. Comput. Sci..

[2]  Maxime Crochemore,et al.  A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum , 2003, MFCS.

[3]  Giorgio Terracina,et al.  Mining Loosely Structured Motifs from Biological Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[4]  Eli Upfal,et al.  MADMX: A Strategy for Maximal Dense Motif Extraction , 2011, J. Comput. Biol..

[5]  Alberto Apostolico,et al.  Incremental Paradigms of Motif Discovery , 2004, J. Comput. Biol..

[6]  Matteo Comin,et al.  VARUN: Discovering Extensible Motifs under Saturation Constraints , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Simona E. Rombo Extracting string motif bases for quorum higher than two , 2012, Theor. Comput. Sci..

[8]  Roberto Grossi,et al.  Mathematical Foundations Of Computer Science 2003 , 2003 .

[9]  Alberto Apostolico,et al.  Incremental discovery of the irredundant motif bases for all suffixes of a string in O(n2logn) time , 2008, Theor. Comput. Sci..

[10]  Marie-France Sagot,et al.  An efficient algorithm for the identification of structured motifs in DNA promoter sequences , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[11]  Esko Ukkonen,et al.  Maximal and minimal representations of gapped and non-gapped motifs of a string , 2009, Theor. Comput. Sci..

[12]  Maxime Crochemore,et al.  Bases of motifs for generating repeated patterns with wild cards , 2005, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  Alexandre P. Francisco,et al.  YEASTRACT-DISCOVERER: new tools to improve the analysis of transcriptional regulatory associations in Saccharomyces cerevisiae , 2007, Nucleic Acids Res..

[14]  Yuan Gao,et al.  Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm , 2000, SODA '00.

[15]  Roderic Guigó,et al.  Mutation patterns of amino acid tandem repeats in the human proteome , 2006, Genome Biology.

[16]  Giorgio Satta,et al.  Discovering subword associations in strings in time linear in the output size , 2009, J. Discrete Algorithms.

[17]  Johann Pelfrêne,et al.  Extracting approximate patterns , 2005, J. Discrete Algorithms.

[18]  Alberto Apostolico,et al.  Efficient algorithms for the discovery of gapped factors , 2011, Algorithms for Molecular Biology.

[19]  Marie-France Sagot,et al.  Algorithms for Extracting Structured Motifs Using a Suffix Tree with an Application to Promoter and Regulatory Site Consensus Identification , 2000, J. Comput. Biol..

[20]  Giorgio Satta,et al.  Optimal Discovery of Subword Associations in Strings , 2004, Discovery Science.

[21]  Maxime Crochemore,et al.  A Comparative Study of Bases for Motif Inference in String Algorithmics , 2004 .

[22]  Marie-France Sagot,et al.  Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification , 2000, RECOMB '00.