Optimizing result prefetching in web search engines with segmented indices

We study the process in which search engines with segmented indices serve queries. In particular, we investigate the number of result pages that search engines should prepare during the query processing phase.Search engine users have been observed to browse through very few pages of results for queries that they submit. This behavior of users suggests that prefetching many results upon processing an initial query is not efficient, since most of the prefetched results will not be requested by the user who initiated the search. However, a policy that abandons result prefetching in favor of retrieving just the first page of search results might not make optimal use of system resources either.We argue that for a certain behavior of users, engines should prefetch a constant number of result pages per query. We define a concrete query processing model for search engines with segmented indices, and analyze the cost of such prefetching policies. Based on these costs, we show how to determine the constant that optimizes the prefetching policy. Our results are mostly applicable to local index partitions of the inverted files, but are also applicable to processing short queries in global index architectures.

[1]  Divesh Srivastava,et al.  Interaction of query evaluation and buffer management for information retrieval , 1998, SIGMOD '98.

[2]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1967 .

[3]  Byeong-Soo Jeong,et al.  Inverted File Partitioning Schemes in Multiple Disk Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[4]  Luiz André Barroso,et al.  Web Search for a Planet: The Google Cluster Architecture , 2003, IEEE Micro.

[5]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[6]  Hector Garcia-Molina,et al.  Performance of inverted indices in shared-nothing distributed text document information retrieval systems , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[7]  Eli Upfal,et al.  Balanced Allocations , 1999, SIAM J. Comput..

[8]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[9]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[10]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[11]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[12]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[13]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..

[14]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[15]  Berthier A. Ribeiro-Neto,et al.  Query performance for tightly coupled distributed digital libraries , 1998, DL '98.

[16]  N. L. Johnson,et al.  Some applications of two approximations to the multinomial distribution , 1960 .

[17]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[18]  Artur Czumaj,et al.  Randomized allocation processes , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[19]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[20]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[21]  David Hawking Scalable Text Retrieval for Large Digital Libraries , 1997, ECDL.

[22]  Sriram Raghavan,et al.  Building a distributed full-text index for the Web , 2001, WWW '01.

[23]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[24]  C. J. Park Random Allocations (Valentin F. Kolchin, Boris A. Sevast’yanov and Vladimir P. Chistyakov) , 1980 .

[25]  Donald Ervin Knuth,et al.  The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information , 1978 .

[26]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[27]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[28]  Kathryn S. McKinley,et al.  Evaluating the performance of distributed architectures for information retrieval using a variety of workloads , 2000, TOIS.

[29]  Evangelos P. Markatos,et al.  On caching search engine query results , 2001, Comput. Commun..

[30]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[31]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[32]  Giles,et al.  Searching the world wide Web , 1998, Science.

[33]  Berthier A. Ribeiro-Neto,et al.  Parallel generation of inverted files for distributed text collections , 1998, Proceedings SCCC'98. 18th International Conference of the Chilean Society of Computer Science (Cat. No.98EX212).

[34]  Donald E. Knuth,et al.  The Art of Computer Programming, Volumes 1-3 Boxed Set , 1998 .

[35]  Norman L. Johnson,et al.  Urn models and their application , 1977 .