Revisiting globally sorted indexes for efficient document retrieval

There has been a large amount of research on efficient document retrieval in both IR and web search areas. One important technique to improve retrieval efficiency is early termination, which speeds up query processing by avoiding scanning the entire inverted lists. Most early termination techniques first build new inverted indexes by sorting the inverted lists in the order of either the term-dependent information, e.g., term frequencies or term IR scores, or the term-independent information, e.g., static rank of the document; and then apply appropriate retrieval strategies on the resulting indexes. Although the methods based only on the static rank have been shown to be ineffective for the early termination, there are still many advantages of using the methods based on term-independent information. In this paper, we propose new techniques to organize inverted indexes based on the term-independent information beyond static rank and study the new retrieval strategies on the resulting indexes. We perform a detailed experimental evaluation on our new techniques and compare them with the existing approaches. Our results on the TREC GOV and GOV2 data sets show that our techniques can improve query efficiency significantly.

[1]  Seung-won Hwang,et al.  Efficient Text Proximity Search , 2007, SPIRE.

[2]  Alistair Moffat,et al.  Pruned query evaluation using pre-computed impacts , 2006, SIGIR.

[3]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[4]  Mario A. Nascimento,et al.  Improving Web search efficiency via a locality based static pruning method , 2005, WWW '05.

[5]  Sergei Vassilvitskii,et al.  Top-k aggregation using intersections of ranked inputs , 2009, WSDM '09.

[6]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[7]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[8]  Donna Harman,et al.  Retrieving Records from a Gigabyte of Text on a Minicomputer Using Statistical Ranking. , 1990 .

[9]  Ron Sacks-Davis,et al.  Efficient passage ranking for document databases , 1999, TOIS.

[10]  Stephen E. Robertson,et al.  Okapi at TREC , 1992, TREC.

[11]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[12]  Torsten Suel,et al.  Three-level caching for efficient query processing in large Web search engines , 2005, WWW.

[13]  Alistair Moffat,et al.  Compressed inverted files with reduced decoding overheads , 1998, SIGIR '98.

[14]  Dimitrios Gunopulos,et al.  Answering top-k queries using views , 2006, VLDB.

[15]  Andrei Z. Broder,et al.  Efficient query evaluation using a two-level retrieval process , 2003, CIKM '03.

[16]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[17]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[18]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[19]  Charles L. A. Clarke,et al.  A document-centric approach to static index pruning in text retrieval systems , 2006, CIKM '06.

[20]  Dik Lun Lee,et al.  Implementations of Partial Document Ranking Using Inverted Files , 1993, Information Processing & Management.

[21]  Torsten Suel,et al.  Compressing term positions in web indexes , 2009, SIGIR.

[22]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[23]  Howard R. Turtle,et al.  Query Evaluation: Strategies and Optimizations , 1995, Inf. Process. Manag..

[24]  Justin Zobel,et al.  Dynamic index pruning for effective caching , 2007, CIKM '07.

[25]  Ron Sacks-Davis,et al.  Filtered document retrieval with frequency-sorted indexes , 1996 .

[26]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[27]  Jacques Savoy,et al.  Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[28]  Alistair Moffat,et al.  Fast ranking in limited space , 1994, Proceedings of 1994 IEEE 10th International Conference on Data Engineering.

[29]  Shuming Shi,et al.  Effective top-k computation in retrieving structured documents with term-proximity support , 2007, CIKM '07.

[30]  Donna K. Harman,et al.  Retrieving Records from a Gigabyte of Text on a Mini-Computer Using Statistical Ranking , 1990, J. Am. Soc. Inf. Sci..

[31]  Ronald Fagin,et al.  Combining Fuzzy Information from Multiple Systems , 1999, J. Comput. Syst. Sci..

[32]  Nenghai Yu,et al.  Can phrase indexing help to process non-phrase queries? , 2008, CIKM '08.

[33]  Ricardo Baeza-Yates,et al.  ResIn: a combination of results caching and index pruning for high-performance web search engines , 2008, SIGIR '08.

[34]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[35]  Alistair Moffat,et al.  Vector-space ranking with effective early termination , 2001, SIGIR '01.

[36]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[37]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.