Tractable near-optimal policies for crawling

Significance We present a tractable algorithm that provides a near-optimal solution to the crawling problem, a fundamental challenge at the heart of web search: Given a large quantity of distributed and dynamic web content, what pages do we choose to update a local cache with the goal of serving up-to-date pages to client requests? Solving this optimization requires identifying the best set of pages to refresh given popularity rates and change rates—an intractable problem in the general case. To overcome this intractability, we show that the optimal randomized strategy can be efficiently determined (in near-linear time) and then use it to produce a deterministic policy that exhibits excellent performance in experiments. The problem of maintaining a local cache of n constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of n, but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in O(n⁡log⁡n) operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum.

[1]  Hyun-Kyu Cho,et al.  Efficient Monitoring Algorithm for Fast News Alerts , 2007, IEEE Transactions on Knowledge and Data Engineering.

[2]  Avigdor Gal,et al.  Managing periodically updated data in relational databases: a stochastic modeling approach , 2000, JACM.

[3]  James B. Martin,et al.  Discrete low-discrepancy sequences , 2009, 0910.1077.

[4]  Anthony Bonato,et al.  A course on the Web graph , 2008 .

[5]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[6]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[7]  Chung Laung Liu,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[8]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[9]  Eric Horvitz,et al.  Principles and applications of continual computation , 2001, Artif. Intell..

[10]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[11]  Celia A. Glass,et al.  The Scheduling of Maintenance Service , 1998, Discret. Appl. Math..

[12]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD 2000.

[13]  Eric Horvitz Continual computation policies for utility-directed prefetching , 1998, CIKM '98.

[14]  James W. Layland,et al.  Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment , 1989, JACM.

[15]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[16]  Rudolf Schneider,et al.  On the chairman assignment problem , 1996, Discret. Math..

[17]  Randeep Bhatia,et al.  Minimizing service and operation costs of periodic scheduling , 2002, SODA '98.

[18]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.