Pruning policies for two-tiered inverted index with correctness guarantee

The Web search engines maintain large-scale inverted indexes which are queried thousands of times per second by users eager for information. In order to cope with the vast amounts of query loads, search engines prune their index to keep documents that are likely to be returned as top results, and use this pruned index to compute the first batches of results. While this approach can improve performance by reducing the size of the index, if we compute the top results only from the pruned index we may notice a significant degradation in the result quality: if a document should be in the top results but was not included in the pruned index, it will be placed behind the results computed from the pruned index. Given the fierce competition in the online search market, this phenomenon is clearly undesirable. In this paper, we study how we can avoid any degradation of result quality due to the pruning-based performance optimization, while still realizing most of its benefit. Our contribution is a number of modifications in the pruning techniques for creating the pruned index and a new result computation algorithm that guarantees that the top-matching pages are always placed at the top search results, even though we are computing the first batch from the pruned index most of the time. We also show how to determine the optimal size of a pruned index and we experimentally evaluate our algorithms on a collection of 130 million Web pages.

[1]  Alistair Moffat,et al.  Vector-space ranking with effective early termination , 2001, SIGIR '01.

[2]  Ronald Fagin,et al.  Combining fuzzy information: an overview , 2002, SGMD.

[3]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[4]  Ronald Fagin,et al.  Static index pruning for information retrieval systems , 2001, SIGIR '01.

[5]  Luis Gravano,et al.  Optimizing Queries over Multimedia Repositories Bulletin of the Ieee Computer Society Technical Committee on Data Engineering , 1996 .

[6]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[7]  Kathryn S. McKinley,et al.  Evaluating the performance of distributed architectures for information retrieval using a variety of workloads , 2000, TOIS.

[8]  Alistair Moffat,et al.  Pruning strategies for mixed-mode querying , 2006, CIKM '06.

[9]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[10]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[11]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[12]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS '01.

[13]  Sriram Raghavan,et al.  Building a distributed full-text index for the Web , 2001, WWW '01.

[14]  Shlomo Moran,et al.  Optimizing Result Prefetching in Web Search Engines with Segmented Indices , 2002, VLDB.

[15]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[16]  Torsten Suel,et al.  Three-level caching for efficient query processing in large Web search engines , 2005, WWW.

[17]  Wagner Meira,et al.  Rank-preserving two-level caching for scalable search engines , 2001, SIGIR '01.

[18]  Gerhard Weikum,et al.  Top-k Query Evaluation with Probabilistic Guarantees , 2004, VLDB.

[19]  Charles L. A. Clarke,et al.  A document-centric approach to static index pruning in text retrieval systems , 2006, CIKM '06.

[20]  Hector Garcia-Molina,et al.  Performance of inverted indices in shared-nothing distributed text document information retrieval systems , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[21]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[22]  Ron Sacks-Davis,et al.  Filtered document retrieval with frequency-sorted indexes , 1996 .

[23]  Torsten Suel,et al.  Optimized Query Execution in Large Search Engines with Global Page Ordering , 2003, VLDB.

[24]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[25]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[26]  Amanda Spink,et al.  An Analysis of Web Documents Retrieved and Viewed , 2003, International Conference on Internet Computing.

[27]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[28]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[29]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[30]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[31]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[32]  Wolf-Tilo Balke,et al.  Towards efficient multi-feature queries in heterogeneous environments , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[33]  Luis Gravano,et al.  Evaluating top-k queries over Web-accessible databases , 2002, Proceedings 18th International Conference on Data Engineering.