Staying up to Date with Online Content Changes Using Reinforcement Learning for Scheduling

From traditional Web search engines to virtual assistants and Web accelerators, services that rely on online information need to continually keep track of remote content changes by explicitly requesting content updates from remote sources (e.g., web pages). We propose a novel optimization objective for this setting that has several practically desirable properties, and efficient algorithms for it with optimality guarantees even in the face of mixed content change observability and initially unknown change model parameters. Experiments on 18.5M URLs crawled daily for 14 weeks show significant advantages of this approach over prior art.

[1]  Kenneth Dixon,et al.  Introduction to Stochastic Modeling , 2011 .

[2]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[3]  Sandeep Pandey,et al.  Monitoring the dynamic web to respond to continuous queries , 2003, WWW '03.

[4]  GalAvigdor,et al.  Managing periodically updated data in relational databases , 2001 .

[5]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[6]  Dafna Shahaf,et al.  Tractable near-optimal policies for crawling , 2018, Proceedings of the National Academy of Sciences.

[7]  Antal van den Bosch,et al.  A Longitudinal Analysis of Search Engine Index Size , 2015, ISSI.

[8]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[9]  Nicole Immorlica,et al.  Recharging Bandits , 2018, 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS).

[10]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[11]  Sandeep Pandey,et al.  WIC: A General-Purpose Algorithm for Monitoring Web Information Sources , 2004, VLDB.

[12]  Le Song,et al.  Variational Policy for Guiding Point Processes , 2017, ICML.

[13]  Kevin D. Glazebrook,et al.  An index policy for a stochastic scheduling model with improving/deteriorating jobs , 2002 .

[14]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[15]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[16]  J. Miller Numerical Analysis , 1966, Nature.

[17]  Toshihide Ibaraki,et al.  Resource allocation problems - algorithmic approaches , 1988, MIT Press series in the foundations of computing.

[18]  Le Song,et al.  Smart Broadcasting: Do You Want to be Seen? , 2016, KDD.

[19]  Peter Clark,et al.  Learning Knowledge Graphs for Question Answering through Conversational Dialog , 2015, NAACL.

[20]  J. Kiefer,et al.  Sequential minimax search for a maximum , 1953 .

[21]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[22]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[23]  Sandeep Pandey,et al.  Recrawl scheduling based on information longevity , 2008, WWW.

[24]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[25]  Utkarsh Upadhyay,et al.  Deep Reinforcement Learning of Marked Temporal Point Processes , 2018, NeurIPS.

[26]  Celia A. Glass,et al.  The Scheduling of Maintenance Service , 1998, Discret. Appl. Math..

[27]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[28]  Hamid R. Rabiee,et al.  RedQueen: An Online Algorithm for Smart Broadcasting in Social Networks , 2016, WSDM.

[29]  Kevin D. Glazebrook,et al.  Index policies for the maintenance of a collection of machines by a set of repairmen , 2005, Eur. J. Oper. Res..

[30]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[31]  Randeep Bhatia,et al.  Minimizing service and operation costs of periodic scheduling , 2002, SODA '98.

[32]  P. Whittle Restless Bandits: Activity Allocation in a Changing World , 1988 .

[33]  Eric Horvitz,et al.  Optimal Freshness Crawl Under Politeness Constraints , 2019, SIGIR.

[34]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[35]  Paul N. Bennett,et al.  Predicting content change on the web , 2013, WSDM.

[36]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.