Memory Efficient Ranking

Abstract Fast and effective ranking of a collection of documents with respect to a query requires several structures, including a vocabulary, inverted file entries, arrays of term weights and document lengths, a set of partial similarity accumulators, and address tables for inverted file entries and documents. Of all of these structures, the array of document lengths and the set of accumulators are the components accessed most frequently in a ranked query, and it is crucial to acceptable performance that they be held in main memory. Here we describe an approximate ranking process that makes use of a compact array of in-memory, low-precision approximations for the lengths. Combined with another simple rule for reducing the memory required by the partial similarity accumulators, the approximation heuristic allows the ranking of large document collections using less than one byte of memory per document, an eight-fold reduction compared with conventional techniques. Moreover, in our experiments retrieval effectiveness was largely unaffected by the use of these heuristics.

[1]  Edward A. Fox,et al.  Research Contributions , 2014 .

[2]  Edward A. Fox,et al.  Practical minimal perfect hash functions for large databases , 1992, CACM.

[3]  Alistair Moffat,et al.  Parameterised compression for sparse bitmaps , 1992, SIGIR '92.

[4]  Alistair Moffat,et al.  Coding for compression in full-text retrieval systems , 1992, Data Compression Conference, 1992..

[5]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[6]  Kotagiri Ramamohanarao,et al.  Recursive linear hashing , 1984, TODS.

[7]  Edward A. Fox,et al.  Order-preserving minimal perfect hash functions and information retrieval , 1991, TOIS.

[8]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[9]  Dik Lun Lee,et al.  Implementations of Partial Document Ranking Using Inverted Files , 1993, Information Processing & Management.

[10]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[11]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[12]  Donna Harman,et al.  Overview of the First Text REtrieval Conference. , 1993, SIGIR 1993.

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Shmuel Tomi Klein,et al.  Storing text retrieval systems on CD-ROM: compression and encryption considerations , 1989, SIGIR '89.

[15]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[16]  Alistair Moffat,et al.  An Efficient Indexing Technique for Full Text Databases , 1992, Very Large Data Bases Conference.

[17]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[18]  C. J. van Rijsbergen,et al.  The nearest neighbour problem in information retrieval: an algorithm using upperbounds , 1981, SIGIR '81.

[19]  Shmuel Tomi Klein,et al.  A Systematic Approach to Compressing a Full-Text Retrieval System , 1992, Inf. Process. Manag..

[20]  Ian H. Witten,et al.  Data compression in full-text retrieval systems , 1993 .

[21]  Dario Lucarella,et al.  A document retrieval system based on nearest neighbour searching , 1988, J. Inf. Sci..

[22]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .