Designing clustering-based web crawling policies for search engine crawlers

The World Wide Web is growing and changing at an astonishing rate. Web information systems such as search engines have to keep up with the growth and change of the Web. Due to resource constraints, search engines usually have difficulties keeping the local database completely synchronized with the Web. In this paper, we study how tomake good use of the limited system resource and detect as many changes as possible. Towards this goal, a crawler for the Web search engine should be able to predict the change behavior of the webpages. We propose applying clustering-based sampling approach. Specifically, we first group all the local webpages into different clusters such that each cluster contains webpages with similar change pattern. We then sample webpages from each cluster to estimate the change frequency of all the webpages in that cluster. Finally, we let the crawler re-visit the cluster containing webpages with higher change frequency with a higher probability. To evaluate the performance of an incremental crawler for a Web search engine, we measure both the freshness and the quality of the query results provided by the search engine. We run extensive experiments on a real Web data set of about 300,000 distinct URLs distributed among 210 websites. The results demonstrate that our clustering algorithm effectively clusters the pages with similar change patterns, and our solution significantly outperforms the existing methods in that it can detect more changed webpages and improve the quality of the user experience for those who query the search engine.

[1]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[2]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[3]  Jenny Edwards,et al.  An adaptive model for optimizing performance of an incremental web crawler , 2001, WWW '01.

[4]  Anja Feldmann,et al.  Rate of Change and other Metrics: a Live Study of the World Wide Web , 1997, USENIX Symposium on Internet Technologies and Systems.

[5]  Ana Carolina Salgado,et al.  Looking at both the present and the past to efficiently update replicas of web content , 2005, WIDM '05.

[6]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[7]  Shlomo Moran,et al.  Predictive caching and prefetching of query results in search engines , 2003, WWW '03.

[8]  Hugh E. Williams,et al.  What's Changed? Measuring Document Change in Web Crawling for Search Engines , 2003, SPIRE.

[9]  George Karypis,et al.  Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[10]  Sandeep Pandey,et al.  User-centric Web crawling , 2005, WWW '05.

[11]  Ali Esmaili,et al.  Probability and Random Processes , 2005, Technometrics.

[12]  Alexandros Ntoulas,et al.  Effective Change Detection Using Sampling , 2002, VLDB.

[13]  R. A. Doney,et al.  4. Probability and Random Processes , 1993 .

[14]  Junghoo Cho,et al.  Impact of search engines on page popularity , 2004, WWW '04.

[15]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2003, WWW '03.

[16]  Christopher Olston,et al.  What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[17]  Charles M. Grinstead,et al.  Introduction to probability , 1999, Statistics for the Behavioural Sciences.

[18]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[19]  Hector Garcia-Molina,et al.  Crawler-Friendly Web Servers , 2000, PERV.

[20]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[21]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[22]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[23]  Amir D. Aczel Statistics:Concepts and Applications , 1995 .

[24]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[25]  George Cybenko,et al.  Keeping up with the changing Web , 2000, Computer.

[26]  William C. Schefler,et al.  Statistics: Concepts and Applications , 1988 .

[27]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[28]  Judit Bar-Ilan,et al.  Methods for comparing rankings of search engine results , 2005, Comput. Networks.

[29]  George Cybenko,et al.  How dynamic is the Web? , 2000, Comput. Networks.

[30]  Philip S. Yu,et al.  Optimal crawling strategies for web search engines , 2002, WWW '02.