论文信息 - Languages with Mismatches and an Application to Approximate Indexing

Languages with Mismatches and an Application to Approximate Indexing

In this paper we describe a factorial language, denoted by L(S,k,r), that contains all words that occur in a string S up to k mismatches every r symbols. Then we give some combinatorial properties of a parameter, called repetition index and denoted by R(S,k,r), defined as the smallest integer h≥ 1 such that all strings of this length occur at most in a unique position of the text S up to k mismatches every r symbols. We prove that R(S,k,r) is a non-increasing function of r and a non-decreasing function of k and that the equation r=R(S,k,r) admits a unique solution. The repetition index plays an important role in the construction of an indexing data structure based on a trie that represents the set of all factors of L(S,k,r) having length equal to R(S,k,r). For each word x∈ L(S,k,r) this data structure allows us to find the list occ(x) of all occurrences of the word x in a text S up to k mismatches every r symbols in time proportional to |x|+|occ(x)|.

[1] Johann Pelfrêne,et al. Extracting approximate patterns , 2003, J. Discrete Algorithms.

[2] S. Muthukrishnan,et al. Efficient algorithms for document retrieval problems , 2002, SODA '02.

[3] Maxime Crochemore,et al. A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum , 2003, MFCS.

[4] Antonio Restivo,et al. Indexing Structures for Approximate String Matching , 2003, CIAC.

[5] Antonio Restivo,et al. Approximate string matching: indexing and the k-mismatch problem , 2004 .

[6] Richard Cole,et al. Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[7] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[8] Maxime Crochemore,et al. Algorithmique du texte , 2001 .

[9] J. Allouche. Algebraic Combinatorics on Words , 2005 .

[10] Wojciech Szpankowski,et al. Average Case Analysis of Algorithms on Sequences: Szpankowski/Average , 2001 .

[11] W. Szpankowski. Average Case Analysis of Algorithms on Sequences , 2001 .

[12] M. Lothaire,et al. Algebraic Combinatorics on Words: Index of Notation , 2002 .

[13] M. Waterman,et al. THE ERDOS-RENYI STRONG LAW FOR PATTERN MATCHING WITH A GIVEN PROPORTION OF MISMATCHES , 1989 .

[14] Roberto Grossi,et al. Mathematical Foundations Of Computer Science 2003 , 2003 .

[15] M. Waterman,et al. A Phase Transition for the Score in Matching Random Sequences Allowing Deletions , 1994 .

[16] Maxime Crochemore,et al. A Comparative Study of Bases for Motif Inference in String Algorithmics , 2004 .

[17] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[18] Wing-Kai Hon,et al. Approximate String Matching Using Compressed Suffix Arrays , 2004, CPM.