Efficiently Detecting Webpage Updates Using Samples

Due to resource constraints, Web archiving systems and search engines usually have difficulties keeping the local repository completely synchronized with the Web. To address this problem, sampling-based techniques periodically poll a subset of webpages in the local repository to detect changes on the Web, and update the local copies accordingly. The goal of such an approach is to discover as many changed webpages as possible within the boundary of the available resources. In this paper we advance the state-of-art of the sampling-based techniques by answering a challenging question: Given a sampled webpage that has been updated, which other webpages are also likely to have changed? We propose a set of sampling policies with various downloading granularities, taking into account the link structure, the directory structure, and the content-based features. We also investigate the update history and the popularity of the webpages to adaptively model the download probability. We ran extensive experiments on a real web data set of about 300,000 distinct URLs distributed among 210 websites. The results showed that our sampling-based algorithm can detect about three times as many changed webpages as the baseline algorithm. It also showed that the changed webpages are most likely to be found in the same directory and the upper directories of the changed sample. By applying clustering algorithm on all the webpages, pages with similar change pattern are grouped together so that updated webpages can be found in the same cluster as the changed sample. Moreover, our adaptive downloading strategies significantly outperform the static ones in detecting changes for the popular webpages.

[1]  G. Grimmett,et al.  Probability and random processes , 2002 .

[2]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[3]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[4]  George Karypis,et al.  Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[5]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[6]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[7]  R. A. Doney,et al.  4. Probability and Random Processes , 1993 .

[8]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[9]  Stéphane Bressan,et al.  Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web , 2003, Lecture Notes in Computer Science.

[10]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[11]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[12]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[13]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[14]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[15]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[16]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[17]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[18]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[19]  Ana Carolina Salgado,et al.  Looking at both the present and the past to efficiently update replicas of web content , 2005, WIDM '05.

[20]  Oren Etzioni,et al.  On the Instability of Web Search Engines , 2000, RIAO.

[21]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[22]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[23]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[24]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[25]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD 2000.