Access-Ordered Indexes

Search engines are an essential tool for modern life. We use them to discover new information on diverse topics and to locate a wide range of resources. The search process in all practical search engines is supported by an inverted index structure that stores all search terms and their locations within the searchable document collection. Inverted indexes are highly optimised, and significant work has been undertaken over the past fifteen years to store, retrieve, compress, and understand heuristics for these structures. In this paper, we propose a new self-organising inverted index based on past queries. We show that this access-ordered index improves query evaluation speed by 25%--40% over a conventional, optimised approach with almost indistinguishable accuracy. We conclude that access-ordered indexes are a valuable new tool to support fast and accurate web search.

[1]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[2]  Alistair Moffat,et al.  Improved Retrieval Effectiveness Through Impact Transformation , 2002, Australasian Database Conference.

[3]  Hugh E. Williams,et al.  Searchable words on the Web , 2005, International Journal on Digital Libraries.

[4]  Donna Harman The First Text REtrieval Conference (TREC-1) | NIST , 1993 .

[5]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[6]  Alistair Moffat,et al.  Impact transformation: effective and efficient web retrieval , 2002, SIGIR '02.

[7]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[8]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[9]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[10]  Hugh E. Williams,et al.  In-memory hash tables for accumulating text vocabularies , 2001, Inf. Process. Lett..

[11]  David Hawking,et al.  Overview of the TREC-9 Web Track , 2000, TREC.

[12]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[13]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[14]  Amanda Spink,et al.  Searching the Web: the public and their queries , 2001 .

[15]  Justin Zobel,et al.  Filtered Document Retrieval with Frequency-Sorted Indexes , 1996, J. Am. Soc. Inf. Sci..

[16]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[17]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[18]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[19]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[20]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[21]  Hugh E. Williams,et al.  Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.