Languages with Mismatches and an Application to Approximate Indexing

In this paper we describe a factorial language, denoted by L(S,k,r), that contains all words that occur in a string S up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition index and denoted by R(S,k,r), defined as the smallest integer h≥ 1 such that all strings of this length occur at most in a unique position of the text S up to k mismatches every r symbols. We prove that R(S,k,r) is a non-increasing function of r and a non-decreasing function of k and that the equation r=R(S,k,r) admits a unique solution. The repetition index plays an important role in the construction of an indexing data structure based on a trie that represents the set of all factors of L(S,k,r) having length equal to R(S,k,r). For each word x∈ L(S,k,r) this data structure allows us to find the list occ(x) of all occurrences of the word x in a text S up to k mismatches every r symbols in time proportional to |x|+|occ(x)|.

[1]  Johann Pelfrêne,et al.  Extracting approximate patterns , 2003, J. Discrete Algorithms.

[2]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[3]  Maxime Crochemore,et al.  A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum , 2003, MFCS.

[4]  Antonio Restivo,et al.  Indexing Structures for Approximate String Matching , 2003, CIAC.

[5]  Antonio Restivo,et al.  Approximate string matching: indexing and the k-mismatch problem , 2004 .

[6]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[7]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8]  Maxime Crochemore,et al.  Algorithmique du texte , 2001 .

[9]  J. Allouche Algebraic Combinatorics on Words , 2005 .

[10]  Wojciech Szpankowski,et al.  Average Case Analysis of Algorithms on Sequences: Szpankowski/Average , 2001 .

[11]  W. Szpankowski Average Case Analysis of Algorithms on Sequences , 2001 .

[12]  M. Lothaire,et al.  Algebraic Combinatorics on Words: Index of Notation , 2002 .

[13]  M. Waterman,et al.  THE ERDOS-RENYI STRONG LAW FOR PATTERN MATCHING WITH A GIVEN PROPORTION OF MISMATCHES , 1989 .

[14]  Roberto Grossi,et al.  Mathematical Foundations Of Computer Science 2003 , 2003 .

[15]  M. Waterman,et al.  A Phase Transition for the Score in Matching Random Sequences Allowing Deletions , 1994 .

[16]  Maxime Crochemore,et al.  A Comparative Study of Bases for Motif Inference in String Algorithmics , 2004 .

[17]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[18]  Wing-Kai Hon,et al.  Approximate String Matching Using Compressed Suffix Arrays , 2004, CPM.