Recent research has demonstrated beyond doubts the benefits of compressing natural language texts using word-based statistical semistatic compression. Not only it achieves extremely competitive compression rates, but also direct search on the compressed text can be carried out faster than on the original text; indexing based on inverted lists benefits from compression as well.
Such compression methods assign a variable-length codeword to each different text word. Some coding methods (Plain Huffman and Restricted Prefix Byte Codes) do not clearly mark codeword boundaries, and hence cannot be accessed at random positions nor searched with the fastest text search algorithms. Other coding methods (Tagged Huffman, End-Tagged Dense Code, or (s, c)-Dense Code) do mark codeword boundaries, achieving a self-synchronization property that enables fast search and random access, in exchange for some loss in compression effectiveness.
In this paper, we show that by just performing a simple reordering of the target symbols in the compressed text (more precisely, reorganizing the bytes into a wavelet-treelike shape) and using little additional space, searching capabilities are greatly improved without a drastic impact in compression and decompression times. With this approach, all the codes achieve synchronism and can be searched fast and accessed at arbitrary points. Moreover, the reordered compressed text becomes an implicitly indexed representation of the text, which can be searched for words in time independent of the text length. That is, we achieve not only fast sequential search time, but indexed search time, for almost no extra space cost.
We experiment with three well-known word-based compression techniques with different characteristics (Plain Huffman, End-Tagged Dense Code and Restricted Prefix Byte Codes), and show the searching capabilities achieved by reordering the compressed representation on several corpora. We show that the reordered versions are not only much more efficient than their classical counterparts, but also more efficient than explicit inverted indexes built on the collection, when using the same amount of space.
[1]
Gonzalo Navarro,et al.
Compressed full-text indexes
,
2007,
CSUR.
[2]
George Kingsley Zipf,et al.
Human behavior and the principle of least effort
,
1949
.
[3]
Ricardo A. Baeza-Yates,et al.
Fast searching on compressed text allowing errors
,
1998,
SIGIR '98.
[4]
Ricardo A. Baeza-Yates,et al.
Fast and flexible word searching on compressed text
,
2000,
TOIS.
[5]
David A. Huffman,et al.
A method for the construction of minimum-redundancy codes
,
1952,
Proceedings of the IRE.
[6]
Alistair Moffat,et al.
Fast file search using text compression
,
1997
.
[7]
Kotagiri Ramamohanarao,et al.
Inverted files versus signature files for text indexing
,
1998,
TODS.
[8]
Peter Sanders,et al.
Intersection in Integer Inverted Indices
,
2007,
ALENEX.
[9]
Roberto Grossi,et al.
High-order entropy-compressed text indexes
,
2003,
SODA '03.
[10]
John L. Smith.
Tables
,
1969,
Neuromuscular Disorders.
[11]
J. Shane Culpepper,et al.
Compact Set Representation for Information Retrieval
,
2007,
SPIRE.
[12]
Gonzalo Navarro,et al.
Lightweight natural language text compression
,
2006,
Information Retrieval.
[13]
Gonzalo Navarro,et al.
Compressed representations of sequences and full-text indexes
,
2007,
TALG.
[14]
J. Shane Culpepper,et al.
Enhanced Byte Codes with Restricted Prefix Properties
,
2005,
SPIRE.
[15]
Ricardo A. Baeza-Yates,et al.
Adding Compression to Block Addressing Inverted Indexes
,
2000,
Information Retrieval.
[16]
Jan Platos,et al.
Word-Based Text Compression
,
2008,
ArXiv.
[17]
Robert S. Boyer,et al.
A fast string searching algorithm
,
1977,
CACM.
[18]
W. Bruce Croft,et al.
Efficient document retrieval in main memory
,
2007,
SIGIR.