A clustering-based sampling approach for refreshing search engine's database

Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. To detect as many changes as possible, the crawler used by a search engine should be able to predict the change behavior of webpages so that it can use the limited resource to download those webpages that are most likely to change. Towards this goal, we propose using sampling approach at the level of a cluster. We first group all the local webpages into different clusters such that each cluster contains webpages with similar change patterns. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster, and the cluster containing webpages with higher change frequency will be revisited more often by our crawler. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results show that by applying our clustering algorithm, pages with similar change patterns are effectively clustered together. Our proposal significantly outperforms the comparators by improving the average freshness of the local database.

[1]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[2]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[3]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[6]  Ana Carolina Salgado,et al.  Looking at both the present and the past to efficiently update replicas of web content , 2005, WIDM '05.

[7]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[8]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[9]  Hugh E. Williams,et al.  What's Changed? Measuring Document Change in Web Crawling for Search Engines , 2003, SPIRE.

[10]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[11]  George Karypis,et al.  Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[12]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[13]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[14]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[15]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[16]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[17]  C. Lee Giles,et al.  Efficiently Detecting Webpage Updates Using Samples , 2007, ICWE.

[18]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[19]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[20]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.

[21]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[22]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[23]  R. A. Doney,et al.  4. Probability and Random Processes , 1993 .