Effective page refresh policies for Web crawlers

In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.

[1]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[2]  Peter Pirolli,et al.  Life, death, and lawfulness on the electronic frontier , 1997, CHI.

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  Philip A. Bernstein,et al.  An algorithm for concurrency control and recovery in replicated distributed databases , 1984, TODS.

[5]  Arthur J. Bernstein,et al.  Bounded ignorance: a technique for increasing concurrency in a replicated system , 1994, TODS.

[6]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[7]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[8]  Giles,et al.  Searching the world wide Web , 1998, Science.

[9]  G. Voelker,et al.  On the scale and performance of cooperative Web proxy caching , 2000, OPSR.

[10]  C. H. Edwards,et al.  Calculus and Analytic Geometry , 1982 .

[11]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[12]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[13]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[14]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[15]  Michael R. Frey,et al.  An Introduction to Stochastic Modeling (2nd Ed.) , 1994 .

[16]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[17]  Kenneth A. Ross,et al.  Supporting multiple view maintenance policies , 1997, SIGMOD '97.

[18]  Hector Garcia-Molina,et al.  The Demarcation Protocol: A Technique for Maintaining Linear Arithmetic Constraints in Distributed Database Systems , 1992, EDBT.

[19]  Alec Wolman,et al.  On the scale and performance of cooperative Web proxy caching , 1999, SOSP.

[20]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[21]  Amin Vahdat,et al.  Efficient Numerical Error Bounding for Replicated Network Services , 2000, VLDB.

[22]  Torsten Suel,et al.  Design and implementation of a high-performance distributed Web crawler , 2002, Proceedings 18th International Conference on Data Engineering.

[23]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[24]  Hector Garcia-Molina,et al.  Crawling the web: discovery and maintenance of large-scale web data , 2001 .

[25]  Hector Garcia-Molina,et al.  Estimating frequency of change , 2003, TOIT.

[26]  Rafael Alonso,et al.  Data caching issues in an information retrieval system , 1990, TODS.

[27]  Calton Pu,et al.  Replica control in distributed systems: as asynchronous approach , 1991, SIGMOD '91.

[28]  Zhen Liu,et al.  Optimal Robot Scheduling for Web Search Engines , 1998 .

[29]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[30]  Gérard Roucairol,et al.  On the distribution of an assertion , 1982, PODC '82.

[31]  C. Lee Giles,et al.  Accessibility of information on the Web , 2000, INTL.

[32]  Craig E. Wills,et al.  Towards a Better Understanding of Web Resources and Server Responses for Improved Caching , 1999, Comput. Networks.

[33]  Jennifer Widom,et al.  Offering a Precision-Performance Tradeoff for Aggregation Queries over Replicated Data , 2000, VLDB.

[34]  Arthur J. Bernstein,et al.  Bounded ignorance in replicated systems , 1991, PODS.

[35]  Darrell D. E. Long,et al.  MODELING REPLICA DIVERGENCE IN A WEAK-CONSISTENCY PROTOCOL FOR GLOBAL-SCALE DISTRIBUTED DATA BASES , 1993 .

[36]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[37]  Garcia-MolinaHector,et al.  Effective page refresh policies for Web crawlers , 2003 .

[38]  Liuba Shrira,et al.  Providing high availability using lazy replication , 1992, TOCS.

[39]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[40]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[41]  H. M. Taylor,et al.  An introduction to stochastic modeling , 1985 .

[42]  Edmund M. Clarke,et al.  Fast Maintenance of Semantic Integrity Assertions Using Redundant Aggregate Data , 1980, VLDB.