The authors have explored the particular needs of large information retrieval systems, in which hundreds of megabytes of data are stored, retrieval is non-sequential, and new text is continually being appended. It has been shown that the word-based model can be adapted to cope well both with dynamic environments, and with situations in which decode-time memory is limited. In the latter case as little as 100 Kb of main memory is sufficient to achieve excellent compression, provided a suitable choice of tokens is used as the compression lexicon. To solve the former problem a new paradigm of compression has been introduced, in which some components of the compression model are required to remain static to ensure that all parts of the text can be decoded, and some parts are extensible, so that new text can also influence the assignment of codewords. An additional heuristic-Swap-to-Near-the-Front-allows collections to be seeded with as little as 1/1000 of their final text with minimal loss of compression efficiency. The resulting "almost static" compression method is ideal for large dynamic collections.<<ETX>>
[1]
Ian H. Witten,et al.
The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression
,
1991,
IEEE Trans. Inf. Theory.
[2]
R. Nigel Horspool,et al.
Constructing word-based text compression algorithms
,
1992,
Data Compression Conference, 1992..
[3]
Ian H. Witten,et al.
Data compression in full-text retrieval systems
,
1993
.
[4]
Alistair Moffat,et al.
Word‐based text compression
,
1989,
Softw. Pract. Exp..
[5]
Robert E. Tarjan,et al.
A Locally Adaptive Data
,
1986
.
[6]
Donna Harman,et al.
Overview of the First Text REtrieval Conference.
,
1993,
SIGIR 1993.
[7]
Ian H. Witten,et al.
Data Compression in Full-Text Retrieval Systems
,
1993,
J. Am. Soc. Inf. Sci..