Optimal Freshness Crawl Under Politeness Constraints

A Web crawler is an essential part of a search engine that procures information subsequently served by the search engine to its users. As the Web is becoming increasingly more dynamic, in addition to discovering new web pages a crawler needs to keep revisiting those already in the search engine's index, in order to keep the index fresh by picking up the pages' changed content. Determining how often to recrawl pages requires making tradeoffs based on the pages' relative importance and change rates, subject to multiple resource constraints - the limited daily budget of crawl requests on the search engine's end and politeness constraints restricting the rate at which pages can be requested from a given host. In this paper, we introduce PoliteBinaryLambdaCrawl, the first optimal algorithm for freshness crawl scheduling in the presence of politeness constraints as well as non-uniform page importance scores and the crawler's own crawl request limit. We also propose an approximation for it, stating its theoretical optimality conditions and in the process discovering a connection to an approach previously thought of as a mere heuristic for freshness crawl scheduling. We explore the relative performance of PoliteBinaryLambdaCrawl and other methods for handling politeness constraints on a dataset collected by crawling over 18.5M URLs daily over 14 weeks.

[1]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[2]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[3]  Sebastiano Vigna,et al.  BUbiNG: massive crawling for the masses , 2014, WWW.

[4]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[5]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[6]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[7]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[8]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[9]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[10]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[12]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[13]  Louiqa Raschid,et al.  Adaptive pull-based policies for wide area data delivery , 2006, TODS.

[14]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[15]  Nicole Immorlica,et al.  Recharging Bandits , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[16]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[17]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[18]  Paul N. Bennett,et al.  Predicting content change on the web , 2013, WSDM.

[19]  Avigdor Gal,et al.  A Cooperative Model for Preference-Based Information Sharing in Narrow Bandwidth Networks , 2013, Int. J. Cooperative Inf. Syst..

[20]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[21]  M. Gribaudo,et al.  2002 , 2001, Cell and Tissue Research.

[22]  Sandeep Pandey,et al.  WIC: A General-Purpose Algorithm for Monitoring Web Information Sources , 2004, VLDB.

[23]  Dafna Shahaf,et al.  Tractable near-optimal policies for crawling , 2018, Proceedings of the National Academy of Sciences.

[24]  Avigdor Gal,et al.  Monitoring an Information Source Under a Politeness Constraint , 2008, INFORMS J. Comput..