Indexing Compressed Text

As a result of the rapid growth of the volume of electronic data, text compression and indexing techniques are receiving more and more attention. These two issues are usually treated as independent problems, but approaches of combining them have recently attracted the attention of researchers. In this thesis, we review and test some of the more effective and some of the more theoretically interesting techniques. Various compression and indexing techniques are presented, and we also present two compressed text indices. Based on these techniques, we implement an compressed full-text index, so that compressed texts can be indexed to support fast queries without decompressing the whole texts. The experiments show that our index is compact and supports fast search.

[1]  Sebastian Deorowicz,et al.  Second step algorithms in the Burrows–Wheeler compression algorithm , 2002, Softw. Pract. Exp..

[2]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[3]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[4]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[5]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[6]  Richard P. Brent,et al.  A Linear Algorithm for Data Compression , 1987, Aust. Comput. J..

[7]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[8]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[9]  Shmuel Tomi Klein,et al.  Storing text retrieval systems on CD-ROM: compression and encryption considerations , 1989, SIGIR '89.

[10]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[11]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[12]  Tzi-cker Chiueh,et al.  SASE: Implementation of a Compressed Text Search Engine , 1997, USENIX Symposium on Internet Technologies and Systems.

[13]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[14]  J. Ian Munro Succinct Data Structures , 2004, Electron. Notes Theor. Comput. Sci..

[15]  Edward A. Fox,et al.  Inverted Files , 1992, Information Retrieval: Data Structures & Algorithms.

[16]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[17]  P.A.J. Volf,et al.  The switching method: elaborations , 1998 .

[18]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[19]  Edward R. Fiala,et al.  Data compression with finite windows , 1989, CACM.

[20]  Christos Faloutsos,et al.  Access methods for text , 1985, CSUR.

[21]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[22]  Bernhard Balkenhol,et al.  Modifications of the Burrows and Wheeler data compression algorithm , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[23]  Arne Andersson Sorting and Searching Revisted , 1996, SWAT.

[24]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[25]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[26]  Matti Jakobsson,et al.  Compression of character strings by an adaptive dictionary , 1985, BIT.

[27]  Julian Seward On the performance of BWT sorting algorithms , 2000, Proceedings DCC 2000. Data Compression Conference.

[28]  Shmuel Tomi Klein,et al.  Storing text retrieval systems on CD-ROM: compression and encryption considerations , 1989, TOIS.

[29]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[30]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[31]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[32]  Michael Rodeh,et al.  Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[33]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[34]  Tzi-cker Chiueh,et al.  Compression-Domain Text Indexing and Retrieval , 1997 .

[35]  Ricardo A. Baeza-Yates,et al.  Fast searching on compressed text allowing errors , 1998, SIGIR '98.

[36]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[37]  Timothy Bell,et al.  A unifying theory and improvements for existing approaches to text compression , 1986 .

[38]  David R. Clark,et al.  Efficient suffix trees on secondary storage , 1996, SODA '96.

[39]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[40]  Suzanne Bunton,et al.  Semantically Motivated Improvements for PPM Variants , 1997, Comput. J..

[41]  Mark R. Nelson,et al.  LZW data compression , 1989 .

[42]  Frans M. J. Willems,et al.  Switching between two universal source coding algorithms , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[43]  Bernhard Balkenhol,et al.  One attempt of a compression algorithm using the BWT , 1999 .

[44]  P. Fenwick,et al.  Block Sorting Text Compression -- Final Report , 1996 .

[45]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[46]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[47]  Mark N. Wegman,et al.  Variations on a theme by Ziv and Lempel , 1985 .

[48]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[49]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[50]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[51]  T. Bell,et al.  Better OPM/L Text Compression , 1986, IEEE Trans. Commun..

[52]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.