Cache-oblivious index for approximate string matching

This paper revisits the problem of indexing a text for approximate string matching. Specifically, given a text T of length n and a positive integer k, we want to construct an index of T such that for any input pattern P, we can find all its k-error matches in T efficiently. This problem is well-studied in the internal-memory setting. Here, we extend some of these recent results to external-memory solutions, which are also cache-oblivious. Our first index occupies O((nlog^kn)/B) disk pages and finds all k-error matches with O((|P|+occ)/B+log^knloglog"Bn) I/Os, where B denotes the number of words in a disk page. To the best of our knowledge, this index is the first external-memory data structure that does not require @W(|P|+occ+poly(logn)) I/Os. The second index reduces the space to O((nlogn)/B) disk pages, and the I/O complexity is O((|P|+occ)/B+log^k^(^k^+^1^)nloglogn).

[1]  Tak Wah Lam,et al.  Compressed Indexes for Approximate String Matching , 2010, Algorithmica.

[2]  Tak Wah Lam,et al.  A Linear Size Index for Approximate Pattern Matching , 2006, CPM.

[3]  Gad M. Landau,et al.  Indexing and Dictionary Matching with One Error , 1999, WADS.

[4]  Dan E. Willard Log-Logarithmic Worst-Case Range Queries are Possible in Space Theta(N) , 1983, Inf. Process. Lett..

[5]  Wing-Kai Hon,et al.  Cache-Oblivious Index for Approximate String Matching , 2007, CPM.

[6]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[7]  Michael A. Bender,et al.  The LCA Problem Revisited , 2000, LATIN.

[8]  Peter van Emde Boas,et al.  Design and implementation of an efficient priority queue , 1976, Mathematical systems theory.

[9]  Gerth Stølting Brodal,et al.  Cache-oblivious string dictionaries , 2006, SODA '06.

[10]  Michael A. Bender,et al.  Cache-oblivious string B-trees , 2006, PODS '06.

[11]  Steven Skiena,et al.  Lowest common ancestors in trees and directed acyclic graphs , 2005, J. Algorithms.

[12]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[13]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[14]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[15]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[16]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[17]  Lars Arge,et al.  Cache-oblivious planar orthogonal range searching and counting , 2005, Symposium on Computational Geometry.

[18]  Gad M. Landau,et al.  Dynamic text and static pattern matching , 2007, TALG.

[19]  Gad M. Landau,et al.  Text Indexing and Dictionary Matching with One Error , 2000, J. Algorithms.

[20]  Michael T. Goodrich,et al.  Range Searching Over Tree Cross Products , 2000, ESA.

[21]  Peter van Emde Boas,et al.  Preserving Order in a Forest in Less Than Logarithmic Time and Linear Space , 1977, Inf. Process. Lett..

[22]  Gerth Stølting Brodal,et al.  Funnel Heap - A Cache Oblivious Priority Queue , 2002, ISAAC.

[23]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[24]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[25]  Tak Wah Lam,et al.  Improved Approximate String Matching Using Compressed Suffix Data Structures , 2007, Algorithmica.

[26]  Michael A. Bender,et al.  Cache-Oblivious B-Trees , 2005, SIAM J. Comput..

[27]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[28]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[29]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[30]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.