Optimal crawling strategies for web search engines

Web Search Engines employ multiple so-called crawlers to maintain local copies of web pages. But these web pages are frequently updated by their owners, and therefore the crawlers must regularly revisit the web pages to maintain the freshness of their local copies. In this paper, we propose a two-part scheme to optimize this crawling process. One goal might be the minimization of the average level of staleness over all web pages, and the scheme we propose can solve this problem. Alternatively, the same basic scheme could be used to minimize a possibly more important search engine embarrassment level metric: The frequency with which a client makes a search engine query and then clicks on a returned url only to find that the result is incorrect. The first part our scheme determines the (nearly) optimal crawling frequencies, as well as the theoretically optimal times to crawl each web page. It does so within an extremely general stochastic framework, one which supports a wide range of complex update patterns found in practice. It uses techniques from probability theory and the theory of resource allocation problems which are highly computationally efficient -- crucial for practicality because the size of the problem in the web environment is immense. The second part employs these crawling frequencies and ideal crawl times as input, and creates an optimal achievable schedule for the crawlers. Our solution, based on network flow theory, is exact as well as highly efficient. An analysis of the update patterns from a highly accessed and highly dynamic web site is used to gain some insights into the properties of page updates in practice. Then, based on this analysis, we perform a set of detailed simulation experiments to demonstrate the quality and speed of our approach.

[1]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[2]  William H. Press,et al.  Numerical recipes , 1990 .

[3]  Donald B. Johnson,et al.  The Complexity of Selection and Ranking in X+Y and Matrices with Sorted Columns , 1982, J. Comput. Syst. Sci..

[4]  Toshihide Ibaraki,et al.  Resource allocation problems - algorithmic approaches , 1988, MIT Press series in the foundations of computing.

[5]  Bennett Fox,et al.  Discrete Optimization Via Marginal Analysis , 1966 .

[6]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[7]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[8]  S. Brereton Life , 1876, The Indian medical gazette.

[9]  Naoki Katoh,et al.  Resource Allocation Problems , 1998 .

[10]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[11]  Michael Pinedo,et al.  Scheduling: Theory, Algorithms, and Systems , 1994 .

[12]  Mark S. Squillante,et al.  Efficiently serving dynamic data at highly accessed web sites , 2004, IEEE/ACM Transactions on Networking.

[13]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[14]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[15]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[16]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[17]  Mark S. Squillante,et al.  Analysis and characterization of large‐scale Web server access patterns and performance , 1999, World Wide Web.

[18]  Jacek Blazewicz,et al.  Scheduling in Computer and Manufacturing Systems , 1990 .

[19]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988 .

[20]  Peter Pirolli,et al.  Life, death, and lawfulness on the electronic frontier , 1997, CHI.

[21]  Lili Qiu,et al.  The content and access dynamics of a busy Web site: findings and implications , 2000 .

[22]  K. Sigman Stationary marked point processes : an intuitive approach , 1995 .

[23]  Ronald W. Wolff,et al.  Stochastic Modeling and the Theory of Queues , 1989 .

[24]  Zvi Galil,et al.  A Fast Selection Algorithm and the Problem of Optimum Distribution of Effort , 1979, JACM.

[25]  Mark S. Squillante,et al.  Web traffic modeling and Web server performance analysis , 1999, PERV.

[26]  Jerome Talim,et al.  Optimizing the Number of Robots for Web Search Engines , 2001, Telecommun. Syst..

[27]  Craig E. Wills,et al.  Towards a Better Understanding of Web Resources and Server Responses for Improved Caching , 1999, Comput. Networks.