Searching of large text collections, such as repositories of Web pages, is today one of the commonest uses of computers. For a collection to be searched, it requires an index. One of the main tasks in constructing an index is identifying the set of unique words occurring in the collection, that is, extracting its vocabulary. This vocabulary is used during index construction to accumulate statistics and temporary inverted lists, and at query time both for fetching inverted lists and as a source of information about the repository. In the case of English text, where frequency of occurrence of words is skewed and follows the Zipf distribution [8], vocabulary size is typically smaller than main memory. As an example, in a medium-size collection of around 1 GB of English text derived from the TREC world-wide web data [2], there are around 170 million word occurrences, of which just under 2 million are distinct words. The single most frequent word, “the”, occurs almost 6.5 million times — almost twice as often as the second most frequent word, “of”
[1]
Michael J. Carey,et al.
A Study of Index Structures for a Main Memory Database Management System
,
1986,
HPTS.
[2]
Ian H. Witten,et al.
Source Models for Natural Language Text
,
1990,
Int. J. Man Mach. Stud..
[3]
William Pugh,et al.
Skip Lists: A Probabilistic Alternative to Balanced Trees
,
1989,
WADS.
[4]
Robert Sedgewick,et al.
Fast algorithms for sorting and searching strings
,
1997,
SODA '97.
[5]
Justin Zobel,et al.
Performance in Practice of String Hashing Functions
,
1997,
DASFAA.
[6]
Donna K. Harman,et al.
Overview of the Second Text REtrieval Conference (TREC-2)
,
1994,
HLT.
[7]
Robert Sedgewick,et al.
Algorithms in C - parts 1-4: fundamentals, data structures, sorting, searching (3. ed.)
,
1997
.