Compact in-memory models for compression of large text databases

For compression of text databases, semi-static word based models are a pragmatic choice. Previous experiments have shown that, where there is not sufficient memory to store a full word based model, encoding rare words as sequences of characters can still allow good compression, while a pure character based model is poor. We propose a further kind of model that reduces main memory costs: approximate models, in which rare words are represented by similarly spelt common words and a sequence of edits. We investigate the compression available with different models, including characters, words, word pairs, and edits, and with combinations of these approaches. We show experimentally that carefully chosen combinations of models can improve the compression available in limited memory and greatly reduce overall memory requirements.

[1]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[2]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[3]  Gwyneth Tseng,et al.  Chinese text segmentation for text retrieval: achievements and problems , 1993 .

[4]  Harold W. Thimbleby,et al.  Semantic and Generative Models for Lossy Text Compression , 1994, Comput. J..

[5]  Alistair Moffat,et al.  Text Compression for Dynamic Document Databases , 1997, IEEE Trans. Knowl. Data Eng..

[6]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[7]  Donna K. Harman,et al.  Overview of the Second Text REtrieval Conference (TREC-2) , 1994, HLT.

[8]  Ricardo A. Baeza-Yates,et al.  Fast searching on compressed text allowing errors , 1998, SIGIR '98.

[9]  Hugh E. Williams,et al.  Compression of nucleotide databases for fast searching , 1997, Comput. Appl. Biosci..

[10]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[11]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[12]  Alistair Moffat,et al.  Exploiting clustering in inverted file compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[13]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[14]  Alistair Moffat,et al.  Adding compression to a full‐text retrieval system , 1995, Softw. Pract. Exp..

[15]  S. Golomb Run-length encodings. , 1966 .

[16]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .