Property matching and weighted matching

In many pattern matching applications the text has some properties attached to its various parts. Pattern Matching with Properties (Property Matching, for short), involves a string matching between the pattern and the text, and the requirement that the text part satisfies some property. Some immediate examples come from molecular biology where it has long been a practice to consider special areas in the genome by their structures. It is straightforward to do sequential matching in a text with properties. However, indexing in a text with properties becomes difficult if we desire the time to be output dependent. We present an algorithm for indexing a text with properties in O(nlog|@S|+nloglogn) time for preprocessing and O(|P|log|@S|+tocc"@p) per query, where n is the length of the text, P is the sought pattern, @S is the alphabet, and tocc"@p is the number of occurrences of the pattern that satisfy some property @p. As a practical use of Property Matching we show how to solve Weighted Matching problems using techniques from Property Matching. Weighted sequences have recently been introduced as a tool to handle a set of sequences that are not identical but have many local similarities. The weighted sequence is a ''statistical image'' of this set, where we are given the probability of every symbol's occurrence at every text location. Weighted matching problems are pattern matching problems where the given text is weighted. We present a reduction from Weighted Matching to Property Matching that allows off-the-shelf solutions to numerous weighted matching problems including indexing, swapped matching, parameterized matching, approximate matching, and many more. Assuming that one seeks the occurrence of pattern P with probability @e in weighted text T of length n, we reduce the problem to a property matching problem of pattern P in text T^' of length O(n(1@e)^2log1@e).

[1]  Costas S. Iliopoulos,et al.  Motif Extraction from Weighted Sequences , 2004, SPIRE.

[2]  Amihood Amir,et al.  Alphabet Independent and Dictionary Scaled Matching , 1996, CPM.

[3]  Gad M. Landau,et al.  Pattern matching with swaps , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[4]  Gad M. Landau,et al.  Indexing and Dictionary Matching with One Error , 1999, WADS.

[5]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[6]  R. Maraia,et al.  The impact of short interspersed elements (SINEs) on the host genome , 1995 .

[7]  Philippe Flajolet,et al.  Motif statistics , 1999, Theor. Comput. Sci..

[8]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[9]  Mireille Régnier,et al.  A unified approach to word occurrence probabilities , 2000, Discret. Appl. Math..

[10]  Moshe Lewenstein,et al.  Real scaled matching , 2000, SODA '00.

[11]  Ming Gu,et al.  An efficient algorithm for dynamic text indexing , 1994, SODA '94.

[12]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[13]  Robert A. Wagner,et al.  An Extension of the String-to-String Correction Problem , 1975, JACM.

[14]  S. Muthukrishnan,et al.  New Results and Open Problems Related to Non-Standard Stringology , 1995, CPM.

[15]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[16]  John M. Walker,et al.  Molecular Biology and Biotechnology , 1988 .

[17]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[18]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[19]  R. Cole,et al.  Randomized Swap Matching in $O(m \log m \log , 1999 .

[20]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[21]  S. Muthukrishnan,et al.  Perfect Hashing for Strings: Formalization and Algorithms , 1996, CPM.

[22]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[23]  Richard Coley,et al.  Randomized Swap Matching in O(m Log M Log Jj) Time , 1999 .

[24]  Gesine Reinert,et al.  Probabilistic and Statistical Properties of Words: An Overview , 2000, J. Comput. Biol..

[25]  Sophie Schbath,et al.  An Overview on the Distribution of Word Counts in Markov Chains , 2000, J. Comput. Biol..

[26]  Costas S. Iliopoulos,et al.  Computing the Repetitions in a Weighted Sequence , 2003, Prague Stringology Conference.

[27]  Gad M. Landau,et al.  Efficient pattern matching with scaling , 1990, SODA '90.

[28]  Costas S. Iliopoulos,et al.  Proceedings of the Algorithms and Computational Methods for Biochemical and Evolutionary Networks 2004 (CompBioNets'04) , 2004 .

[29]  Moshe Lewenstein,et al.  Approximate Swapped Matching , 2000, FSTTCS.

[30]  Moshe Lewenstein,et al.  Efficient one-dimensional real scaled matching , 2007, J. Discrete Algorithms.

[31]  Gad M. Landau,et al.  Text Indexing and Dictionary Matching with One Error , 2000, J. Algorithms.

[32]  Robert A. Wagner,et al.  On the complexity of the Extended String-to-String Correction Problem , 1975, STOC.

[33]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[34]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[35]  Moshe Lewenstein,et al.  Function Matching: Algorithms, Applications, and a Lower Bound , 2003, ICALP.

[36]  Uzi Vishkin,et al.  Efficient approximate and dynamic matching of patterns using a labeling paradigm , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[37]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[38]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[39]  Moshe Lewenstein,et al.  Real Two Dimensional Scaled Matching , 2003, WADS.

[40]  Moshe Lewenstein,et al.  Overlap matching , 2001, SODA '01.

[41]  Ely Porat,et al.  Swap and Mismatch Edit Distance , 2004, ESA.

[42]  Costas S. Iliopoulos,et al.  Pattern Matching on Weighted Sequences , 2004 .

[43]  Gad M. Landau,et al.  Efficient Special Cases of Pattern Matching with Swaps , 1998, Inf. Process. Lett..

[44]  Roberto Grossi,et al.  Fast incremental text editing , 1995, SODA '95.

[45]  Leonidas J. Guibas,et al.  String Overlaps, Pattern Matching, and Nontransitive Games , 1981, J. Comb. Theory A.

[46]  Amihood Amir,et al.  Alphabet-Independent and Scaled Dictionary Matching , 2000, J. Algorithms.