Opportunistic data structures with applications

We address the issue of compressing and indexing data. We devise a data structure whose space occupancy is a function of the entropy of the underlying data set. We call the data structure opportunistic since its space occupancy is decreased when the input is compressible and this space reduction is achieved at no significant slowdown in the query performance. More precisely, its space occupancy is optimal in an information-content sense because text T[1,u] is stored using O(H/sub k/(T))+o(1) bits per input symbol in the worst case, where H/sub k/(T) is the kth order empirical entropy of T (the bound holds for any fixed k). Given an arbitrary string P[1,p], the opportunistic data structure allows to search for the occurrences of P in T in O(p+occlog/sup /spl epsiv//u) time (for any fixed /spl epsiv/>0). If data are uncompressible we achieve the best space bound currently known (Grossi and Vitter, 2000); on compressible data our solution improves the succinct suffix array of (Grossi and Vitter, 2000) and the classical suffix tree and suffix array data structures either in space or in query time or both. We also study our opportunistic data structure in a dynamic setting and devise a variant achieving effective search and update time bounds. Finally, we show how to plug our opportunistic data structure into the Glimpse tool (Manber and Wu, 1994). The result is an indexing tool which achieves sublinear space and sublinear query time complexity.

[1]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[2]  Gonzalo Navarro,et al.  Block addressing indices for approximate text retrieval , 1997, International Conference on Information and Knowledge Management.

[3]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[4]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[5]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[6]  Anna R. Karlin,et al.  Markov paging , 1992, Proceedings., 33rd Annual Symposium on Foundations of Computer Science.

[7]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[8]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[9]  Arne Andersson Sorting and Searching Revisted , 1996, SWAT.

[10]  Erkki Sutinen,et al.  Lempel—Ziv Index for q -Grams , 1998, Algorithmica.

[11]  Anna R. Karlin,et al.  Markov Paging (Extended Abstract) , 1992, FOCS 1992.

[12]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[13]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[14]  Jan van Leeuwen,et al.  Worst-Case Optimal Insertion and Deletion Methods for Decomposable Searching Problems , 1981, Inf. Process. Lett..

[15]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[16]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[17]  P. Krishnan,et al.  Optimal prediction for prefetching in the worst case , 1994, SODA '94.

[18]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[19]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[20]  Kunihiko Sadakane A modified Burrows-Wheeler transformation for case-insensitive search with application to suffix array compression , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[21]  Kurt Mehlhorn,et al.  Optimal Dynamization of Decomposable Searching Problems , 1981, Inf. Process. Lett..

[22]  John H. Reif,et al.  Using difficulty of prediction to decrease computation: fast sort, priority queue and convex hull on entropy bounded inputs , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[23]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[24]  Ricardo Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 2000 .

[25]  Mark de Berg,et al.  Multi-method dispatching: a geometric approach with applications to string matching problems , 1999, STOC '99.

[26]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[27]  Jon Louis Bentley,et al.  Programming pearls , 1987, CACM.